ZooKeeper approved by Apache Board as TLP!
We are now officially an Apache TLP! http://bit.ly/9czN2x As part of the process for moving out from under Hadoop and into full TLP status we need to work through the following: http://incubator.apache.org/guides/graduation.html#new-project-hand-over If you are involved with the project, esp on the dev side, please review these sections. Notice that a number of things will be changing; mailinglist, source repo, wiki, etc... I'll be sending out updates as we work through these, regards and Congratulations everyone! Patrick
Re: number of clients/watchers
Camille, that's a very good question. Largest cluster I've heard about is 10k sessions. Jeremy - largest I've ever tested was a 3 server cluster with ~500 sessions. Each session created 10k znodes (100bytes each znode) and set 5 watches on each. So 5 million znodes and 25million watches. I then had the sessions delete the znodes and looked for the notifications. They were processed by the clients quite quickly (order of seconds) iirc. Note: this required some GC tuning on the servers to operate correctly (in particular cms and incremental gc was turned on and sufficient memory was allocated for the heaps). here's a similar test setup I used: http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview this is the latency tester tool https://github.com/phunt/zk-smoketest Patrick On Thu, Nov 18, 2010 at 9:44 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: Can you clarify what you mean when you say 10-100K watchers? Do you mean 10-100K clients with 1 active watch, or some lesser number of clients with more watches, or a few clients doing a lot of watches and other clients doing other things? -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Thursday, November 18, 2010 12:15 PM To: zookeeper-user@hadoop.apache.org Subject: number of clients/watchers I had a question about number of clients against a zookeeper cluster. I was looking at having between 10,000 and 100,000 (towards 100,000) watchers within a single datacenter at a given time. Assuming that some fraction of that number are active clients and the r/w ratio is well within the zookeeper norms, is that number within the realm of possibility for zookeeper? We're going to do testing and benchmarking and things, but I didn't want to go down a rabbit hole if this is simply too much for a single zookeeper cluster to handle. The numbers I've seen in blog posts vary and I saw that the observers feature may be useful in this kind of setting. Maybe I'm underestimating zookeeper or maybe I don't have enough information to tell. I'm just trying to see if zookeeper is a good fit for our use case. Thanks.
Re: number of clients/watchers
fyi: I haven't heard of anyone running over 10k sessions. I've tried 20k before and had issues, you may want to look at this sooner rather than later. * Server gc tuning will be an issue (be sure to use cms/incremental). * Be sure to disable clients accessing the leader (server configuration param). * You may need to use the Observers feature to scale out this large. Patrick On Thu, Nov 18, 2010 at 10:31 AM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: Can you clarify what you mean when you say 10-100K watchers? Do you mean 10-100K clients with 1 active watch, or some lesser number of clients with more watches, or a few clients doing a lot of watches and other clients doing other things? Probably 10-100K clients each with 1 or 2 active watches. The clients will respond to watch events and sometimes initiate actions of their own. here's a similar test setup I used: Thanks Patrick - it's really nice to have those numbers and test harness basis. We're still in architecture mode so some of the details are still in flux, but I think this gives us an idea. Thanks very much. On Nov 18, 2010, at 11:51 AM, Patrick Hunt wrote: Camille, that's a very good question. Largest cluster I've heard about is 10k sessions. Jeremy - largest I've ever tested was a 3 server cluster with ~500 sessions. Each session created 10k znodes (100bytes each znode) and set 5 watches on each. So 5 million znodes and 25million watches. I then had the sessions delete the znodes and looked for the notifications. They were processed by the clients quite quickly (order of seconds) iirc. Note: this required some GC tuning on the servers to operate correctly (in particular cms and incremental gc was turned on and sufficient memory was allocated for the heaps). here's a similar test setup I used: http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview this is the latency tester tool https://github.com/phunt/zk-smoketest Patrick On Thu, Nov 18, 2010 at 9:44 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: Can you clarify what you mean when you say 10-100K watchers? Do you mean 10-100K clients with 1 active watch, or some lesser number of clients with more watches, or a few clients doing a lot of watches and other clients doing other things? -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Thursday, November 18, 2010 12:15 PM To: zookeeper-user@hadoop.apache.org Subject: number of clients/watchers I had a question about number of clients against a zookeeper cluster. I was looking at having between 10,000 and 100,000 (towards 100,000) watchers within a single datacenter at a given time. Assuming that some fraction of that number are active clients and the r/w ratio is well within the zookeeper norms, is that number within the realm of possibility for zookeeper? We're going to do testing and benchmarking and things, but I didn't want to go down a rabbit hole if this is simply too much for a single zookeeper cluster to handle. The numbers I've seen in blog posts vary and I saw that the observers feature may be useful in this kind of setting. Maybe I'm underestimating zookeeper or maybe I don't have enough information to tell. I'm just trying to see if zookeeper is a good fit for our use case. Thanks.
Re: number of clients/watchers
On Thu, Nov 18, 2010 at 3:46 PM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: Unless I misunderstand, active watches aren't open sessions. If that's the case, I don't think we'll hit the 10K-20K number of open sessions at a given time. However, that's a good boundary to keep in mind as we put the system together. Right. A session is represented by a ZooKeeper object. One session per object. So if you have 10 client hosts each creating it's own ZooKeeper instance you'll have 10 sessions. This is regardless of the number of znodes, watches, etc... Watches were designed to be lightweight and you can maintain a large number of them. (25million spread across 500 sessions in my example) Patrick On 11/18/10 2:06 PM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: We tested up to the ulimit (~16K) of connections against a single server and performance was ok, but I would definitely try to do some serious load testing before I put a system into production that I knew was going to have that load from the get-go. The system degrades VERY ungracefully when you hit the ulimit for the process, so be sure to have enough ensemble nodes to spread those connections across that this won't happen. I think maybe there's a JIRA out to deal with this issue, not sure what the status is. C -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Thursday, November 18, 2010 2:57 PM To: zookeeper-user@hadoop.apache.org Subject: Re: number of clients/watchers fyi: I haven't heard of anyone running over 10k sessions. I've tried 20k before and had issues, you may want to look at this sooner rather than later. * Server gc tuning will be an issue (be sure to use cms/incremental). * Be sure to disable clients accessing the leader (server configuration param). * You may need to use the Observers feature to scale out this large. Patrick On Thu, Nov 18, 2010 at 10:31 AM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: Can you clarify what you mean when you say 10-100K watchers? Do you mean 10-100K clients with 1 active watch, or some lesser number of clients with more watches, or a few clients doing a lot of watches and other clients doing other things? Probably 10-100K clients each with 1 or 2 active watches. The clients will respond to watch events and sometimes initiate actions of their own. here's a similar test setup I used: Thanks Patrick - it's really nice to have those numbers and test harness basis. We're still in architecture mode so some of the details are still in flux, but I think this gives us an idea. Thanks very much. On Nov 18, 2010, at 11:51 AM, Patrick Hunt wrote: Camille, that's a very good question. Largest cluster I've heard about is 10k sessions. Jeremy - largest I've ever tested was a 3 server cluster with ~500 sessions. Each session created 10k znodes (100bytes each znode) and set 5 watches on each. So 5 million znodes and 25million watches. I then had the sessions delete the znodes and looked for the notifications. They were processed by the clients quite quickly (order of seconds) iirc. Note: this required some GC tuning on the servers to operate correctly (in particular cms and incremental gc was turned on and sufficient memory was allocated for the heaps). here's a similar test setup I used: http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview this is the latency tester tool https://github.com/phunt/zk-smoketest Patrick On Thu, Nov 18, 2010 at 9:44 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: Can you clarify what you mean when you say 10-100K watchers? Do you mean 10-100K clients with 1 active watch, or some lesser number of clients with more watches, or a few clients doing a lot of watches and other clients doing other things? -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Thursday, November 18, 2010 12:15 PM To: zookeeper-user@hadoop.apache.org Subject: number of clients/watchers I had a question about number of clients against a zookeeper cluster. I was looking at having between 10,000 and 100,000 (towards 100,000) watchers within a single datacenter at a given time. Assuming that some fraction of that number are active clients and the r/w ratio is well within the zookeeper norms, is that number within the realm of possibility for zookeeper? We're going to do testing and benchmarking and things, but I didn't want to go down a rabbit hole if this is simply too much for a single zookeeper cluster to handle. The numbers I've seen in blog posts vary and I saw that the observers feature may be useful in this kind of setting. Maybe I'm underestimating zookeeper or maybe I don't have enough information to tell. I'm just trying to see if zookeeper is a good fit for our use case. Thanks.
Re: Verifying Changes
Perhaps something similar to what Ben detailed here? (rendezvous) http://developer.yahoo.com/blogs/hadoop/posts/2009/05/using_zookeeper_to_tame_system/ Change the key, add child znode(s) that's deleted by the notified client(s) once it's read the changed value. Some details need to be worked out but seems reasonable. Patrick On Tue, Nov 9, 2010 at 6:42 PM, Ben Hall b...@zynga.com wrote: Hi All... Long time reader... First time writer... Hehe... I am curious to know what successes people have had with verifying zookeeper changes across a pool of clients. I.E. Being able to verify that your changed Key did in fact get pushed out to all of the subscribed clients. We are looking at creating a hash of the finished key value and comparing that with what is on the ZK server... But curious if anyone has any smarter ideas. Thanks Ben
Re: Key factors for production readiness of Hedwig
On Wed, Nov 10, 2010 at 10:58 AM, Erwin Tam e...@yahoo-inc.com wrote: 1. Ops tools including monitoring and administration. Command port (4 letter words) for monitoring has worked extremely well for zk. Whatever you do put the command port on a separate port, and make it a full fledged feature rather than a hack (allow clients to maintain sessions, allow more complex requests than just a 4letter word, etc...). Perhaps in today's world you should just go with a REST interface (easy using jersey) rather than try to implement a 4letterword. json/xml/text for free. easy to integrate with any monitoring app or adhoc script. Patrick
[Discussion] Some proposed logging (log4j) JIRAs
I wanted to highlight a couple recent JIRAs that may have impact on users (api consumers AND admins of the service) in the 3.4 timeframe. If you want to weigh in please comment on the respective jira: 1) proposal to move to slf4j (remove/replace log4j) https://issues.apache.org/jira/browse/ZOOKEEPER-850 from user perspective not much should change as slf4j has full support for log4j as an engine. But I'm not fully versed on every particular. Note that hbase is in the process of moving https://issues.apache.org/jira/browse/HBASE-2608 and Avro has already moved to slf4j, not sure about some of the other hadoop ex-subs. 2) on a related note. We did a bunch of work in the 3.3 timeframe to improve logging where the severity levels of log messages tended to be too verbose (many items which should have been debug/trace were info). Much of this was based on feedback we received from the hbase community. However there are still some rough edges. A recent JIRA https://issues.apache.org/jira/browse/ZOOKEEPER-912 is proposing some additional changes. It would be good for users/admins (consumers of the client api and those involved with running the service itself) to weigh in if they have any insights/preferences. My primary concern is that we are still able to help users when they run into trouble - ie sufficient logging at info level, not losing critical detail in the weeds of debug/trace level. It's unfortunate that we only have 3 levels to play with here. FF to weigh in. Regards, Patrick
Re: Running cluster behind load balancer
Hi Chang, thanks for the insights, if you have a few minutes would you mind updating the FAQ with some of this detail? http://wiki.apache.org/hadoop/ZooKeeper/FAQ Thanks! Patrick On Thu, Nov 4, 2010 at 6:27 AM, Chang Song tru64...@me.com wrote: Sorry. I made a mistake on retry timeout in load balancer section of my answer. The same timeout applies to load balancer case as well (depends on the recv timeout) Thank you Chang On Nov 4, 2010, at 10:22 PM, Chang Song wrote: I would like to add some info on this. This may not be very important, but there are subtle differences. Two cases: 1. server hardware failure or kernel panic 2. zookeeper Java daemon process down In former one, timeout will be based on the timeout argument in zookeeper_init(). Partially based on ZK heartbeat algorithm. It recognize server down in 2/3 of the timeout. then retries at every timeout. For example, if timeout is 9000 msec, it first times out in 6 second, and retries every 9 seconds. In latter case (Java process down), since socket connect immediately returns refused connection, it can retry immediately. On top of that, - Hardware load balancer: If an ensemble cluster is serviced with hardware load balancer, zookeeper client will retry every 2 second since we only have one IP to try. - DNS RR: Make sure that nscd on your linux box is off since it is most likely that DNS cache returns the same IP many times. This is actually worse than above since ZK client will retry the same dead server every 2 seconds for some time. I think it is best not to use load balancer for ZK clients since ZK clients will try next server immediately if previous one fails for some reason (based on timeout above). And this is especially true if your cluster works in pseudo realtime environment where tickTime is set to very low. Chang On Nov 4, 2010, at 9:17 AM, Ted Dunning wrote: DNS round-robin works as well. On Wed, Nov 3, 2010 at 3:45 PM, Benjamin Reed br...@yahoo-inc.com wrote: it would have to be a TCP based load balancer to work with ZooKeeper clients, but other than that it should work really well. The clients will be doing heart beats so the TCP connections will be long lived. The client library does random connection load balancing anyway. ben On 11/03/2010 12:19 PM, Luka Stojanovic wrote: What would be expected behavior if a three node cluster is put behind a load balancer? It would ease deployment because all clients would be configured to target zookeeper.example.com regardless of actual cluster configuration, but I have impression that client-server connection is stateful and that jumping randomly from server to server could bring strange behavior. Cheers, -- Luka Stojanovic lu...@vast.com Platform Engineering
Re: JUnit tests do not produce logs if the JVM crashes
In addition to what Mahadev suggested you can also change the log4j.properties to log to a file rather than the CONSOLE. Although that just redirects the logs, if there is some output to stdout/stderr then junit buffering is still in play. Patrick On Thu, Nov 4, 2010 at 8:15 AM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Andras, Junit unit will always buffer the logs unless you print it out to console. To do that, try running this ant test -Dtest.output=yes This will print out the logs to console as they are logged. Thanks mahadev On 11/4/10 3:33 AM, András Kövi allp...@gmail.com wrote: Hi all, I'm new to Zookeeper and ran into an issue while trying to run the tests with ant. It seems like the log output is buffered until the complete test suite finishes and it is flushed into its specific file only after then. I had to make some changes to the code (no JNI or similar) that resulted in JVM crashes. Since the logs are lost in this case, it is a little hard to debug the issue. Do you have any idea how I could disable the buffering? Thanks, Andras
Re: Running cluster behind load balancer
Great, thanks! On Thu, Nov 4, 2010 at 10:04 PM, Chang Song tru64...@me.com wrote: Benjamin. It looks like ZK clients can handle a list of IPs from DNS query correctly. Yes you are right. I am updating wiki per Patrick's request. Thanks a lot. Chang On Nov 5, 2010, at 1:10 AM, Benjamin Reed wrote: one thing to note: the if you are using a DNS load balancer, some load balancers will return the list of resolved addresses in different orders to do the balancing. the zookeeper client will shuffle that list before it it used, so in reality, using a single DNS hostname resolving to all the server addresses will probably work just as well as most DNS-based load balancers. ben On 11/04/2010 08:26 AM, Patrick Hunt wrote: Hi Chang, thanks for the insights, if you have a few minutes would you mind updating the FAQ with some of this detail? http://wiki.apache.org/hadoop/ZooKeeper/FAQ Thanks! Patrick On Thu, Nov 4, 2010 at 6:27 AM, Chang Songtru64...@me.com wrote: Sorry. I made a mistake on retry timeout in load balancer section of my answer. The same timeout applies to load balancer case as well (depends on the recv timeout) Thank you Chang On Nov 4, 2010, at 10:22 PM, Chang Song wrote: I would like to add some info on this. This may not be very important, but there are subtle differences. Two cases: 1. server hardware failure or kernel panic 2. zookeeper Java daemon process down In former one, timeout will be based on the timeout argument in zookeeper_init(). Partially based on ZK heartbeat algorithm. It recognize server down in 2/3 of the timeout. then retries at every timeout. For example, if timeout is 9000 msec, it first times out in 6 second, and retries every 9 seconds. In latter case (Java process down), since socket connect immediately returns refused connection, it can retry immediately. On top of that, - Hardware load balancer: If an ensemble cluster is serviced with hardware load balancer, zookeeper client will retry every 2 second since we only have one IP to try. - DNS RR: Make sure that nscd on your linux box is off since it is most likely that DNS cache returns the same IP many times. This is actually worse than above since ZK client will retry the same dead server every 2 seconds for some time. I think it is best not to use load balancer for ZK clients since ZK clients will try next server immediately if previous one fails for some reason (based on timeout above). And this is especially true if your cluster works in pseudo realtime environment where tickTime is set to very low. Chang On Nov 4, 2010, at 9:17 AM, Ted Dunning wrote: DNS round-robin works as well. On Wed, Nov 3, 2010 at 3:45 PM, Benjamin Reedbr...@yahoo-inc.com wrote: it would have to be a TCP based load balancer to work with ZooKeeper clients, but other than that it should work really well. The clients will be doing heart beats so the TCP connections will be long lived. The client library does random connection load balancing anyway. ben On 11/03/2010 12:19 PM, Luka Stojanovic wrote: What would be expected behavior if a three node cluster is put behind a load balancer? It would ease deployment because all clients would be configured to target zookeeper.example.com regardless of actual cluster configuration, but I have impression that client-server connection is stateful and that jumping randomly from server to server could bring strange behavior. Cheers, -- Luka Stojanovic lu...@vast.com Platform Engineering
Re: question about watcher
Definitely checkout the 4letter words then (wch*). Keep in mind getting this data may be expensive (if there's alot of it) and that watches are locak, so servers only know about the watches from sessions est through it (server 1 doesn't know about watches of sessions connected on server 2, 3, etc...). Patrick On Wed, Nov 3, 2010 at 1:13 AM, Qian Ye yeqian@gmail.com wrote: thanks Patrick, I want to know all watches set by all clients. I would open a jira and write some design think about it later. On Tue, Nov 2, 2010 at 11:53 PM, Patrick Hunt ph...@apache.org wrote: Hi Qian Ye, yes you should open a JIRA for this. If you want to work on a patch we could advise you. One thing not clear to me, are you interested in just the watches set by the particular client, or all watches set by all clients? The first should be relatively easy to get, the second would be more involved (the difference btw getting local watches and having to talk to the server to get all watches). Does this have to be a client api or more administrative in nature? Also see http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands specifically the wchs,wchp,wchs 4 letter words. Regards, Patrick On Tue, Nov 2, 2010 at 4:11 AM, Qian Ye yeqian@gmail.com wrote: Hi all, Is there any progress about this issue? Should we open a new JIRA for it? We really need a way to know who set watchers on a specific node. thanks~ On Thu, Aug 6, 2009 at 11:01 PM, Qian Ye yeqian@gmail.com wrote: Thanks Mahadev, I think it is a useful feature for many scenarios. On Thu, Aug 6, 2009 at 12:59 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Qian, There isnt any such api. We have been thinking abt adding an api on cancelling a cleints watches. We have been thinking about adding a proc filesystem wherein a cleintt will have a list of all the watches. This data can be used to know which clients are watching what znode, but this has always been in the future discussions for us. We DO NOT have anything planned in the near future for this. Thanks mahadev On 8/5/09 6:57 PM, Qian Ye yeqian@gmail.com wrote: Hi all: Is there a client API for querying the watchers' owner for a specific znode? In some situation, we want to find out who set watchers on the znode. thx -- With Regards! Ye, Qian Made in Zhejiang University -- With Regards! Ye, Qian -- With Regards! Ye, Qian
Re: question about watcher
Hi Qian Ye, yes you should open a JIRA for this. If you want to work on a patch we could advise you. One thing not clear to me, are you interested in just the watches set by the particular client, or all watches set by all clients? The first should be relatively easy to get, the second would be more involved (the difference btw getting local watches and having to talk to the server to get all watches). Does this have to be a client api or more administrative in nature? Also see http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands specifically the wchs,wchp,wchs 4 letter words. Regards, Patrick On Tue, Nov 2, 2010 at 4:11 AM, Qian Ye yeqian@gmail.com wrote: Hi all, Is there any progress about this issue? Should we open a new JIRA for it? We really need a way to know who set watchers on a specific node. thanks~ On Thu, Aug 6, 2009 at 11:01 PM, Qian Ye yeqian@gmail.com wrote: Thanks Mahadev, I think it is a useful feature for many scenarios. On Thu, Aug 6, 2009 at 12:59 PM, Mahadev Konar maha...@yahoo-inc.comwrote: Hi Qian, There isnt any such api. We have been thinking abt adding an api on cancelling a cleints watches. We have been thinking about adding a proc filesystem wherein a cleintt will have a list of all the watches. This data can be used to know which clients are watching what znode, but this has always been in the future discussions for us. We DO NOT have anything planned in the near future for this. Thanks mahadev On 8/5/09 6:57 PM, Qian Ye yeqian@gmail.com wrote: Hi all: Is there a client API for querying the watchers' owner for a specific znode? In some situation, we want to find out who set watchers on the znode. thx -- With Regards! Ye, Qian Made in Zhejiang University -- With Regards! Ye, Qian
Re: Getting a node exists code on a sequence create
Hi Jeremy, this sounds like a bug to me, I don't think you should be getting the nodeexists when the sequence flag is set. Looking at the code briefly we use the parent's cversion (incremented each time the child list is changed, added/removed). Did you see this error each time you called create, or just once? If you look at the cversion in the Stat of the znode /zkrsm on each of the servers what does it show? You can use the java CLI to connect to each of your servers and access this information. It would be interesting to see if the data was out of sync only for a short period of time, or forever. Is this repeatable? Ben/Flavio do you see anything here? Patrick On Thu, Oct 28, 2010 at 6:06 PM, Jeremy Stribling st...@nicira.com wrote: HI everyone, Is there any situation in which creating a new ZK node with the SEQUENCE flag should result in a node exists error? I'm seeing this happening after a failure of a ZK node that appeared to have been the master; when the new master takes over, my app is unable to create a new SEQUENCE node under an existing parent node. I'm using Zookeeper 3.2.2. Here's a representative log snippet: -- 3050756 [ProcessThread:-1] TRACE org.apache.zookeeper.server.PrepRequestProcessor - :Psessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0xfffe txntype:unknown /zkrsm/_record 3050756 [ProcessThread:-1] WARN org.apache.zookeeper.server.PrepRequestProcessor - Got exception when processing sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0xfffe txntype:unknown n/a org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:245) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) 3050756 [ProcessThread:-1] DEBUG org.apache.zookeeper.server.quorum.CommitProcessor - Processing request:: sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0x5027e txntype:-1 n/a 3050756 [ProcessThread:-1] DEBUG org.apache.zookeeper.server.quorum.Leader - Proposing:: sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0x5027e txntype:-1 n/a 3050756 [SyncThread:0] TRACE org.apache.zookeeper.server.quorum.Leader - Ack zxid: 0x5027e 3050757 [SyncThread:0] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposal: 0x5027e 3050757 [SyncThread:0] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposals all 3050757 [SyncThread:0] DEBUG org.apache.zookeeper.server.quorum.Leader - Count for zxid: 0x5027e is 1 3050757 [FollowerHandler-/172.16.0.28:48776] TRACE org.apache.zookeeper.server.quorum.Leader - Ack zxid: 0x5027e 3050757 [FollowerHandler-/172.16.0.28:48776] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposal: 0x5027e 3050757 [FollowerHandler-/172.16.0.28:48776] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposals all 3050757 [FollowerHandler-/172.16.0.28:48776] DEBUG org.apache.zookeeper.server.quorum.Leader - Count for zxid: 0x5027e is 2 3050757 [FollowerHandler-/172.16.0.28:48776] DEBUG org.apache.zookeeper.server.quorum.CommitProcessor - Committing request:: sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0x5027e txntype:-1 n/a 3050757 [CommitProcessor:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor - Processing request:: sessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0x5027e txntype:-1 n/a 3050757 [CommitProcessor:0] TRACE org.apache.zookeeper.server.FinalRequestProcessor - :Esessionid:0x12bf518350f0001 type:create cxid:0x4cca0691 zxid:0x5027e txntype:-1 n/a 3050757 [FollowerHandler-/172.16.0.28:41062] TRACE org.apache.zookeeper.server.quorum.Leader - Ack zxid: 0x5027e 3050757 [FollowerHandler-/172.16.0.28:41062] TRACE org.apache.zookeeper.server.quorum.Leader - outstanding proposals all 3050757 [FollowerHandler-/172.16.0.28:41062] DEBUG org.apache.zookeeper.server.quorum.Leader - outstanding is 0 -- I'm still a n00b at understanding ZK log messages, so maybe there's something obvious going on. I looked in the JIRA and did my best to search the mailing list archives, but couldn't find anything related to this. Any ideas? Thanks very much, Jeremy
Re: Setting the heap size
Actually if you are going to admin your own ZK it's probably a good idea to review that Admin doc fully. Some other good detail in there (backups and cleaning the datadir for example). Regards, Patrick On Fri, Oct 29, 2010 at 7:22 AM, Tim Robertson timrobertson...@gmail.com wrote: Great - thanks Patrick! On Thu, Oct 28, 2010 at 6:13 PM, Patrick Hunt ph...@apache.org wrote: Tim, one other thing you might want to be aware of: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_supervision Patrick On Thu, Oct 28, 2010 at 9:11 AM, Patrick Hunt ph...@apache.org wrote: On Thu, Oct 28, 2010 at 2:52 AM, Tim Robertson timrobertson...@gmail.com wrote: We are setting up a small Hadoop 13 node cluster running 1 HDFS master, 9 region severs for HBase and 3 map reduce nodes, and are just installing zookeeper to perform the HBase coordination and to manage a few simple process locks for other tasks we run. Could someone please advise what kind on heap we should give to our single ZK node and also (ahem) how does one actually set this? It's not immediately obvious in the docs or config. The amount of heap necessary will be dependent on the application(s) using ZK, also configuration of the heap is dependent on what packaging you are using to start ZK. Are you using zkServer.sh from our distribution? If so then you probably want to set JVMFLAGS env variable. We pass this through to the jvm, see -Xmx in the man page (http://www.manpagez.com/man/1/java/) Given this is Hbase (which I'm reasonably familiar with) the default heap should be fine. However you might want to check with the Hbase team on that. I'd also encourage you to enter a JIRA on the (lack of) doc issue you highlighted: https://issues.apache.org/jira/browse/ZOOKEEPER Regards, Patrick
Re: Setting the heap size
Actually if you are going to admin your own ZK it's probably a good idea to review that Admin doc fully. Some other good detail in there (backups and cleaning the datadir for example). Regards, Patrick On Fri, Oct 29, 2010 at 7:22 AM, Tim Robertson timrobertson...@gmail.com wrote: Great - thanks Patrick! On Thu, Oct 28, 2010 at 6:13 PM, Patrick Hunt ph...@apache.org wrote: Tim, one other thing you might want to be aware of: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_supervision Patrick On Thu, Oct 28, 2010 at 9:11 AM, Patrick Hunt ph...@apache.org wrote: On Thu, Oct 28, 2010 at 2:52 AM, Tim Robertson timrobertson...@gmail.com wrote: We are setting up a small Hadoop 13 node cluster running 1 HDFS master, 9 region severs for HBase and 3 map reduce nodes, and are just installing zookeeper to perform the HBase coordination and to manage a few simple process locks for other tasks we run. Could someone please advise what kind on heap we should give to our single ZK node and also (ahem) how does one actually set this? It's not immediately obvious in the docs or config. The amount of heap necessary will be dependent on the application(s) using ZK, also configuration of the heap is dependent on what packaging you are using to start ZK. Are you using zkServer.sh from our distribution? If so then you probably want to set JVMFLAGS env variable. We pass this through to the jvm, see -Xmx in the man page (http://www.manpagez.com/man/1/java/) Given this is Hbase (which I'm reasonably familiar with) the default heap should be fine. However you might want to check with the Hbase team on that. I'd also encourage you to enter a JIRA on the (lack of) doc issue you highlighted: https://issues.apache.org/jira/browse/ZOOKEEPER Regards, Patrick
Re: Setting the heap size
On Thu, Oct 28, 2010 at 2:52 AM, Tim Robertson timrobertson...@gmail.com wrote: We are setting up a small Hadoop 13 node cluster running 1 HDFS master, 9 region severs for HBase and 3 map reduce nodes, and are just installing zookeeper to perform the HBase coordination and to manage a few simple process locks for other tasks we run. Could someone please advise what kind on heap we should give to our single ZK node and also (ahem) how does one actually set this? It's not immediately obvious in the docs or config. The amount of heap necessary will be dependent on the application(s) using ZK, also configuration of the heap is dependent on what packaging you are using to start ZK. Are you using zkServer.sh from our distribution? If so then you probably want to set JVMFLAGS env variable. We pass this through to the jvm, see -Xmx in the man page (http://www.manpagez.com/man/1/java/) Given this is Hbase (which I'm reasonably familiar with) the default heap should be fine. However you might want to check with the Hbase team on that. I'd also encourage you to enter a JIRA on the (lack of) doc issue you highlighted: https://issues.apache.org/jira/browse/ZOOKEEPER Regards, Patrick
Re: Setting the heap size
Tim, one other thing you might want to be aware of: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_supervision Patrick On Thu, Oct 28, 2010 at 9:11 AM, Patrick Hunt ph...@apache.org wrote: On Thu, Oct 28, 2010 at 2:52 AM, Tim Robertson timrobertson...@gmail.com wrote: We are setting up a small Hadoop 13 node cluster running 1 HDFS master, 9 region severs for HBase and 3 map reduce nodes, and are just installing zookeeper to perform the HBase coordination and to manage a few simple process locks for other tasks we run. Could someone please advise what kind on heap we should give to our single ZK node and also (ahem) how does one actually set this? It's not immediately obvious in the docs or config. The amount of heap necessary will be dependent on the application(s) using ZK, also configuration of the heap is dependent on what packaging you are using to start ZK. Are you using zkServer.sh from our distribution? If so then you probably want to set JVMFLAGS env variable. We pass this through to the jvm, see -Xmx in the man page (http://www.manpagez.com/man/1/java/) Given this is Hbase (which I'm reasonably familiar with) the default heap should be fine. However you might want to check with the Hbase team on that. I'd also encourage you to enter a JIRA on the (lack of) doc issue you highlighted: https://issues.apache.org/jira/browse/ZOOKEEPER Regards, Patrick
Re: Retrying sequential znode creation
On Wed, Oct 20, 2010 at 3:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: These corner cases are relatively rare, I would think (I personally keep logs around for days or longer). A concern I would have is that it does add complexity, would be hard to debug... Would it be possible to get a partial solution in place that invokes the current behavior if logs aren't available? Seems like it's possible. The issue of finding a viable solution (one where, for example, the memory overhead is limited) is still an issue though. In the end it wouldn't really help the end user, given they would still have to code for this corner case. Patrick On Wed, Oct 20, 2010 at 10:42 AM, Patrick Hunt phu...@gmail.com wrote: Hi Ted, Mahadev is in the best position to comment (he looked at it last) but iirc when we started looking into implementing this we immediately ran into so big questions. One was what to do if the logs had been cleaned up and the individual transactions no longer available. This could be overcome by changes wrt cleanup, log rotation, etc... There was another more bulletproof option, essentially to keep all the changes in memory that might be necessary to implement 22, however this might mean a significant increase in mem requirements and general bookkeeping. It turned out (again correct me if I'm wrong) that more thought was going to be necessary, esp around ensuring correct operation in any/all special cases. Patrick On Wed, Oct 13, 2010 at 12:49 PM, Ted Dunning ted.dunn...@gmail.com wrote: Patrick, What are these hurdles? The last comment on ZK-22 was last winter. Back then, it didn't sound like it was going to be that hard. On Wed, Oct 13, 2010 at 12:08 PM, Patrick Hunt ph...@apache.org wrote: 22 would help with this issue https://issues.apache.org/jira/browse/ZOOKEEPER-22 however there are some real hurdles to implementing 22 successfully.
Re: Reading znodes directly from snapshot and log files
Sounds like a useful utility, the closest that I know of is this: http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/server/LogFormatter.html but it just dumps the txn log. Seems like it would be cool to be able to open a shell on the datadir and query it (separate from running a server). Another option is to just copy the datadir and start a standalone zk instance on it. You can then use the std zk shell to query it. Patrick ps. I had worked on something similar in python a while back: http://github.com/phunt/zk-txnlog-tools/blob/master/parse_txnlog.py On Thu, Oct 21, 2010 at 2:31 PM, Vishal K vishalm...@gmail.com wrote: Hi, Is it possible to read znodes directly from snapshot and log files instead of usign ZooKeeper API. In case a ZK ensemble is not available, can I login to all available nodes and run a utility that will dump all znodes? Thanks. -Vishal
Re: Stale value for read request
On Sat, Oct 23, 2010 at 9:03 PM, jingguo yao yaojing...@gmail.com wrote: Read requests are handled locally at each Zookeeper server. So it is possible for a read request to return a stale value even though a more recent update to the same znode has been committed. Does this statement still hold if the Zookeeper follower serving the read request is the one which has just served the recent update request? It's probably good to start with the explicit guarantees: http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkGuarantees Yes (it could still get stale data from quorum perspective). The leader may have committed a new change that has not yet been seen by the follower (ie two changes in quick succession) For example, client A connects to follower X. And client A issues a request to update znode /a from 0 to 1. After receiving this request, follower X forwards this request to the leader. Then the leader broadcasts this update proposal to all the Zookeeper servers. After a quorum of the followers commit the update request, the update succeeds. Then client A issues a read request to get the value of znode /a. And follower X receives this read request. So if follower X is not among the quorum and follower X has not committed the update to catch up with the leader, it is still possible for client A to get a stale value of znode /a. In this case, the return value is 0. Is my understanding correct? That's correct. See the the NOTE in the section (link) I provided above. Patrick
Re: Unusual exception
EOS means that the client closed the connection (from the point of view of the server). The server then tries to cleanup by closing the socket explicitly, in some cases that results in debug messages you see subsequent. EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket Notice that the session id is 0 - so either this is a zk client that failed before establishing a session, or more likely it's a monitoring/4letterword command (which never est sessions). Patrick On Wed, Oct 13, 2010 at 2:49 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: I started seeing a bunch of these exceptions. What do these mean? 2010-10-13 14:01:33,426 - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:5001:nioserverc...@606] - EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket 2010-10-13 14:01:33,426 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:5001:nioserverc...@1286] - Closed socket connection for client /10.138.34.195:55738 (no session established for client) 2010-10-13 14:01:33,426 - DEBUG [CommitProcessor:1:finalrequestproces...@78 ] - Processing request:: sessionid:0x12b9d1f8b907a44 type:closeSession cxid:0x0 zxid:0x600193996 txntype:-11 reqpath:n/a 2010-10-13 14:01:33,427 - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:5001:nioserverc...@606] - EndOfStreamException: Unable to read additional data from client sessionid 0x12b9d1f8b907a5d, likely client has closed socket 2010-10-13 14:01:33,427 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:5001:nioserverc...@1286] - Closed socket connection for client /10.138.34.195:55979 which had sessionid 0x12b9d1f8b907a5d 2010-10-13 14:01:33,427 - DEBUG [QuorumPeer:/0.0.0.0:5001 :commitproces...@159] - Committing request:: sessionid:0x52b90ab45bd51af type:createSession cxid:0x0 zxid:0x600193cf9 txntype:-10 reqpath:n/a 2010-10-13 14:01:33,427 - DEBUG [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:5001:nioserverc...@1302] - ignoring exception during output shutdown java.net.SocketException: Transport endpoint is not connected at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651) at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368) at org.apache.zookeeper.server.NIOServerCnxn.closeSock(NIOServerCnxn.java:1298) at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:1263) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:609) at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:262) 2010-10-13 14:01:33,428 - DEBUG [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:5001:nioserverc...@1310] - ignoring exception during input shutdown java.net.SocketException: Transport endpoint is not connected at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) at org.apache.zookeeper.server.NIOServerCnxn.closeSock(NIOServerCnxn.java:1306) at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:1263) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:609) at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:262) 2010-10-13 14:01:33,428 - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:5001:nioserverc...@606] - EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket 2010-10-13 14:01:33,428 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:5001:nioserverc...@1286] - Closed socket connection for client /10.138.34.195:55731 (no session established for client)
Re: zxid integer overflow
I'm not aware of sustained 1k/sec, Ben might know how long the 20k/sec test runs for (and for how long that rate is sustained). You'd definitely want to tune the GC, GC related pauses would be the biggest obstacle for this (assuming you are using a dedicated log device for the transaction logs). Patrick On Tue, Oct 19, 2010 at 3:14 PM, Sandy Pratt prat...@adobe.com wrote: Follow up question: does anyone have a production cluster that handles a similar sustained rate of changes? -Original Message- From: Benjamin Reed [mailto:br...@yahoo-inc.com] Sent: Tuesday, October 19, 2010 2:53 PM To: zookeeper-user@hadoop.apache.org Subject: Re: zxid integer overflow we should put in a test for that. it is certainly a plausible scenario. in theory it will just flow into the next epoch and everything will be fine, but we should try it and see. ben On 10/19/2010 11:33 AM, Sandy Pratt wrote: Just as a thought experiment, I was pondering the following: ZK stamps each change to its managed state with a zxid ( http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperInternals.html). That ID consists of a 64 bit number in which the upper 32 bits are the epoch, which changes when the leader does, and the bottom 32 bits are a counter, which is incremented by the leader with every change. If 1000 changes are made to ZK state each second (which is 1/20th of the peak rate advertised), then the counter portion will roll over in 2^32 / (86400 * 1000) = 49 days. Now, assuming that my math is correct, is this an actual concern? For example, if I'm using ZK to provide locking for a key value store that handles transactions at about that rate, am I setting myself up for failure? Thanks, Sandy
Re: Testing zookeeper outside the source distribution?
You might checkout a tool I built a while back to be used by operations teams deploying ZooKeeper: http://bit.ly/a6tGVJ It's really two tools actually, a smoketester and a latency tester, both of which are important to verify when deploying a new cluster. Patrick On Mon, Oct 18, 2010 at 9:50 AM, Ted Dunning ted.dunn...@gmail.com wrote: Generally, I think a better way to do this is to use a standard mock object framework. Then you don't have to fake up an interface. But the original poster probably has a need to do integration tests more than unit tests. In such tests, they need to test against a real ZK to make sure that their assumptions about the semantics of ZK are valid. On Mon, Oct 18, 2010 at 8:53 AM, David Rosenstrauch dar...@darose.net wrote: Consequently, the way I write my code for ZooKeeper is against a more generic interface that provides operations for open, close, getData, and setData. When unit testing, I substitute in a dummy implementation that just stores data in memory (i.e., a HashMap); when running live code I use an implementation that talks to ZooKeeper.
Re: Testing zookeeper outside the source distribution?
You might checkout a tool I built a while back to be used by operations teams deploying ZooKeeper: http://bit.ly/a6tGVJ It's really two tools actually, a smoketester and a latency tester, both of which are important to verify when deploying a new cluster. Patrick On Mon, Oct 18, 2010 at 9:50 AM, Ted Dunning ted.dunn...@gmail.com wrote: Generally, I think a better way to do this is to use a standard mock object framework. Then you don't have to fake up an interface. But the original poster probably has a need to do integration tests more than unit tests. In such tests, they need to test against a real ZK to make sure that their assumptions about the semantics of ZK are valid. On Mon, Oct 18, 2010 at 8:53 AM, David Rosenstrauch dar...@darose.net wrote: Consequently, the way I write my code for ZooKeeper is against a more generic interface that provides operations for open, close, getData, and setData. When unit testing, I substitute in a dummy implementation that just stores data in memory (i.e., a HashMap); when running live code I use an implementation that talks to ZooKeeper.
Re: What does this mean?
On Mon, Oct 11, 2010 at 4:16 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: tickTime = 2000, initLimit = 3000 and the data is around 11GB this is log + snapshot. So if I need to add a new observer can I transfer state from the ensemble manually before starting it? If so which files do I need to transfer? You can't really do it manually. As part of the bring up process for a server it communicates with the current leader and downloads the appropriate data (either a diff of the recent changes or a full snapshot if too far behind ). Try increasing your initLimit to 15 or so (btw, that' in ticks, not milliseconds, so if you have 3000 now that's probably not the issue ;-) ). You might also want to increase the syncLimit at the same time. Here's from the sample conf that ships with the release: # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 Patrick Thanks On Mon, Oct 11, 2010 at 10:16 AM, Benjamin Reed br...@yahoo-inc.com wrote: how big is your data? you may be running into the problem where it takes too long to do the state transfer and times out. check the initLimit and the size of your data. ben On 10/10/2010 08:57 AM, Avinash Lakshman wrote: Thanks Ben. I am not mixing processes of different clusters. I just double checked that. I have ZK deployed in a 5 node cluster and I have 20 observers. I just started the 5 node cluster w/o starting the observers. I still the same issue. Now my cluster won't start up. So what is the correct workaround to get this going? How can I find out who the leader is and who the follower to get more insight? Thanks A On Sun, Oct 10, 2010 at 8:33 AM, Benjamin Reedbr...@yahoo-inc.com wrote: this usually happens when a follower closes its connection to the leader. it is usually caused by the follower shutting down or failing. you may get further insight by looking at the follower logs. you should really run with timestamps on so that you can correlate the logs of the leader and follower. on thing that is strange is the wide divergence between zxid of follower and leader. are you mixing processes of different clusters? ben From: Avinash Lakshman [avinash.laksh...@gmail.com] Sent: Sunday, October 10, 2010 8:18 AM To: zookeeper-user Subject: What does this mean? I see this exception and the servers not doing anything. java.io.IOException: Channel eof at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630) ERROR - 124554051584(higestZxid) 21477836646(next log) for type -11 WARN - Sending snapshot last zxid of peer is 0xe zxid of leader is 0x1e WARN - Sending snapshot last zxid of peer is 0x18 zxid of leader is 0x1eg WARN - Sending snapshot last zxid of peer is 0x5002dc766 zxid of leader is 0x1e WARN - Sending snapshot last zxid of peer is 0x1c zxid of leader is 0x1e ERROR - Unexpected exception causing shutdown while sock still open java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78) at java.io.DataOutputStream.writeInt(DataOutputStream.java:180) at org.apache.jute.BinaryOutputArchive.writeInt(BinaryOutputArchive.java:55) at org.apache.zookeeper.data.StatPersisted.serialize(StatPersisted.java:116) at org.apache.zookeeper.server.DataNode.serialize(DataNode.java:167) at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:967) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:982) at org.apache.zookeeper.server.DataTree.serialize(DataTree.java:1031) at org.apache.zookeeper.server.util.SerializeUtils.serializeSnapshot(SerializeUtils.java:104) at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:426) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:331) WARN - *** GOODBYE /10.138.34.212:33272 Avinash
Re: Retrying sequential znode creation
On Wed, Oct 13, 2010 at 5:58 AM, Vishal K vishalm...@gmail.com wrote: However, gets trickier because there is no explicit way (to my knowledge) to get CreateMode for a znode. As a result, we cannot tell whether a node is sequential or not. Sequentials are really just regular znodes with fancy naming applied by the cluster at create time, subsequently it makes no distinction. Using the format of the name would be the be only/best way I know if you want to distinguish yourself. (or put some data into the znode itself) 22 would help with this issue https://issues.apache.org/jira/browse/ZOOKEEPER-22 however there are some real hurdles to implementing 22 successfully. Patrick Thanks. -Vishal On Tue, Oct 12, 2010 at 5:36 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. This is indeed a problem. I generally try to avoid sequential nodes unless they are ephemeral and if I get an error on creation, I generally have to either tear down the connection (losing all other ephemeral nodes in the process) or scan through all live nodes trying to determine if mine got created. Neither is a very acceptable answer so I try to avoid the problem. Your UUID answer is one option. At least you know what file got created (or not) and with good naming you can pretty much guarantee no collisions. You don't have to scan all children since you can simply check for the existence of the file of interest. There was a JIRA filed that was supposed to take care of this problem, but I don't know the state of play there. On Tue, Oct 12, 2010 at 12:11 PM, Vishal K vishalm...@gmail.com wrote: Hi, What is the best approach to have an idempotent create() operation for a sequential node? Suppose a client is trying to create a sequential node and it gets a ConnectionLoss KeeperException, it cannot know for sure whether the request succeeded or not. If in the meantime, the client's session is re-established, the client would like to create a sequential znode again. However, the client needs to know if its earlier request has succeeded or not. If it did, then the client does not need to retry. To my understanding ZooKeeper does not provide this feature. Can someone confirm this? External to ZooKeeper, the client can either set a unique UUID in the path to the create call or write the UUID as part of its data. Before retrying, it can read back all the children of the parent znode and go through the list to determine if its earlier request had succeeded. This doesn't sound that appealing to me. I am guessing this is a common problem that many would have faced. Can folks give a feedback on what their approach was? Thanks. -Vishal
Re: Changing configuration
You probably want to do a rolling restart, this is preferable over restart the cluster as the service will not go down. http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6 http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6Patrick On Wed, Oct 6, 2010 at 9:49 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: Suppose I have a 3 node ZK cluster composed of machines A, B and C. Now for whatever reason I lose C forever and the machine needs to be replaced. How do I handle this situation? Update the config with D in place of C and restart the cluster? Also if I am interested in read just the ZAB portions which packages should I be looking at? Cheers A
Re: snapshots
Simplified: when a server comes back up it checks it's local snaps/logs to reconstruct as much of the current state as possible. It then checks with the leader to see how far behind it is, at which point it either gets a diff or gets a full snapshot (from the leader) depending on how far behind it is. Patrick On Wed, Oct 6, 2010 at 8:11 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: Hi All Are snapshots serialized dumps of the DataTree taken whenever a log rolls over? So when a server goes down and comes back up does it construct the data tree from the snapshots? What if I am running this on a machine with SSD as extended RAM how does it affect anything? Cheers A
Re: znode inconsistencies across ZooKeeper servers
Vishal, this sounds like a bug in ZK to me. Can you create a JIRA with this description, your configuration files from all servers, and the log files from all servers during the time of the incident? If you could run the servers in DEBUG level logging during the time you reproduce the issue that would probably help: https://issues.apache.org/jira/browse/ZOOKEEPER Thanks! Patrick On Wed, Oct 6, 2010 at 2:57 PM, Vishal K vishalm...@gmail.com wrote: Hi Patrick, You are correct, the test restarts both ZooKeeper server and the client. The client opens a new connection after restarting. So we would expect that the ephmeral znode (/foo) to expire after the session timeout. However, the client with the new session creates the ephemeral znode (/foo) again after it reboots (it sets a watch for /foo and recreates /foo if it is deleted or doesn't exist). The client is not reusing the session ID. What I expect to see is that the older /foo should expire after which a new /foo should get created. Is my expectation correct? What confuses me is the following output of 3 successive getstat /foo requests on A (the zxid, time and owner fields). Notice that the older znode reappeared. At the same time when I do getstat at B and C, I see the newer /foo. log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x10607 ctime = Tue Oct 05 15:01:07 UTC 2010 mZxid = 0x10607 mtime = Tue Oct 05 15:01:07 UTC 2010 pZxid = 0x10607 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce5bda4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 Thanks for your help. -Vishal On Wed, Oct 6, 2010 at 4:45 PM, Patrick Hunt ph...@apache.org wrote: Vishal the attachment seems to be getting removed by the list daemon (I don't have it), can you create a JIRA and attach? Also this is a good question for the ppl on zookeeper-user. (ccing) You are aware that ephemeral znodes are tied to the session? And that sessions only expire after the session timeout period? At which time any znodes created during that session are then deleted. The fact that you are killing your client process leads me to believe that you are not closing the session cleanly (meaning that it will eventually expire after the session timeout period), in which case the ephemeral znodes _should_ reappear when A is restarted and successfully rejoins the cluster. (at least until the session timeout is exceeded) Patrick On Tue, Oct 5, 2010 at 11:04 AM, Vishal K vishalm...@gmail.com wrote: Hi, I have a 3 node ZK cluster (A, B, C). On one of the the nodes (node A), I have a ZK client running that connects to the local server and creates an ephemeral znode to indicate clients on other nodes that it is online. I have test script that reboots the zookeeper server as well as client on A. The test does a getstat on the ephemeral znode created by the client on A. I am seeing that the view of znodes on A is different from the other 2 nodes. I can tell this from the session ID that the client gets after reconnecting to the local ZK server. So the test is simple: - kill zookeeper server and client process - wait for a few seconds - do zkCli.sh stat ... test.out What I am seeing is that the ephemeral znode with old zxid, time, and session ID is reappearing on node A. I have attached the output of 3 consecutive getstat requests of the test (see client_getstat.out). Notice that the third output is the same as the first one. That is, the old ephemeral znode reappeared at A. However, both B and C are showing the latest znode with correct time, zxid and session ID (output not attached). After this point, all following getstat requests on A are showing the old znode. Whereas, B and C show the correct znode every time the client on A comes online. This is something very perplexing. Earlier I thought this was a bug in my client implementation. But the test shows that the ZK server on A after reboot is out of sync with rest of the servers
Re: Too many connections
On Tue, Oct 5, 2010 at 10:23 AM, Avinash Lakshman avinash.laksh...@gmail.com wrote: So shouldn't all servers in another DC just have one session? So even if I have 50 observers in another DC that should be 50 sessions established since the IP doesn't change correct? Am I missing something? In some ZK clients I see the following exception even though they are in the same DC. This really depends on how you implemented your client. Each time you create a ZooKeeper object a new session is established. If you have 50 clients each creating a ZooKeeper object then you have 50 sessions. Patrick
Re: znode inconsistencies across ZooKeeper servers
Vishal the attachment seems to be getting removed by the list daemon (I don't have it), can you create a JIRA and attach? Also this is a good question for the ppl on zookeeper-user. (ccing) You are aware that ephemeral znodes are tied to the session? And that sessions only expire after the session timeout period? At which time any znodes created during that session are then deleted. The fact that you are killing your client process leads me to believe that you are not closing the session cleanly (meaning that it will eventually expire after the session timeout period), in which case the ephemeral znodes _should_ reappear when A is restarted and successfully rejoins the cluster. (at least until the session timeout is exceeded) Patrick On Tue, Oct 5, 2010 at 11:04 AM, Vishal K vishalm...@gmail.com wrote: Hi, I have a 3 node ZK cluster (A, B, C). On one of the the nodes (node A), I have a ZK client running that connects to the local server and creates an ephemeral znode to indicate clients on other nodes that it is online. I have test script that reboots the zookeeper server as well as client on A. The test does a getstat on the ephemeral znode created by the client on A. I am seeing that the view of znodes on A is different from the other 2 nodes. I can tell this from the session ID that the client gets after reconnecting to the local ZK server. So the test is simple: - kill zookeeper server and client process - wait for a few seconds - do zkCli.sh stat ... test.out What I am seeing is that the ephemeral znode with old zxid, time, and session ID is reappearing on node A. I have attached the output of 3 consecutive getstat requests of the test (see client_getstat.out). Notice that the third output is the same as the first one. That is, the old ephemeral znode reappeared at A. However, both B and C are showing the latest znode with correct time, zxid and session ID (output not attached). After this point, all following getstat requests on A are showing the old znode. Whereas, B and C show the correct znode every time the client on A comes online. This is something very perplexing. Earlier I thought this was a bug in my client implementation. But the test shows that the ZK server on A after reboot is out of sync with rest of the servers. The stat command to each server shows that the servers are in sync as far as zxid's are concerned (see stat.out). So there is something wrong with A's local database that is causing this problem. Has anyone seen this before? I will be doing more debugging in the next few days. Comments/suggestions for further debugging are welcomed. -Vishal
Re: Zookeeper on 60+Gb mem
Tuning GC is going to be critical, otw all the sessions will timeout (and potentially expire) during GC pauses. Patrick On Tue, Oct 5, 2010 at 1:18 PM, Maarten Koopmans maar...@vrijheid.netwrote: Yes, and syncing after a crash will be interesting as well. Off note; I am running it with a 6GB heap now, but it's not filled yet. I do have smoke tests thoug, so maybe I'll give it a try. Op 5 okt. 2010 om 21:13 heeft Benjamin Reed br...@yahoo-inc.com het volgende geschreven: you will need to time how long it takes to read all that state back in and adjust the initTime accordingly. it will probably take a while to pull all that data into memory. ben On 10/05/2010 11:36 AM, Avinash Lakshman wrote: I have run it over 5 GB of heap with over 10M znodes. We will definitely run it with over 64 GB of heap. Technically I do not see any limitiation. However I will the experts chime in. Avinash On Tue, Oct 5, 2010 at 11:14 AM, Mahadev Konarmaha...@yahoo-inc.com wrote: Hi Maarteen, I definitely know of a group which uses around 3GB of memory heap for zookeeper but never heard of someone with such huge requirements. I would say it definitely would be a learning experience with such high memory which I definitely think would be very very useful for others in the community as well. Thanks mahadev On 10/5/10 11:03 AM, Maarten Koopmansmaar...@vrijheid.net wrote: Hi, I just wondered: has anybody ever ran zookeeper to the max on a 68GB quadruple extra large high memory EC2 instance? With, say, 60GB allocated or so? Because EC2 with EBS is a nice way to grow your zookeeper cluster (data on the ebs columes, upgrade as your memory utilization grows) - I just wonder what the limits are there, or if I am foing where angels fear to tread... --Maarten
Re: ZK compatability
Historically major releases can have non-bw compatible changes. However if you look back through the release history you'll see that the last time that happened was oct 2008, when we moved the project from sourceforge to apache. Patrick On Tue, Sep 28, 2010 at 11:37 AM, Jun Rao jun...@gmail.com wrote: What about major releases going forward? Thanks, Jun On Mon, Sep 27, 2010 at 10:32 PM, Patrick Hunt ph...@apache.org wrote: In general yes, minor and bug fix releases are fully backward compatible. Patrick On Sun, Sep 26, 2010 at 9:11 PM, Jun Rao jun...@gmail.com wrote: Hi, Does ZK support (and plan to support in the future) backward compatibility (so that a new client can talk to an old server and vice versa)? Thanks Jun
Re: c client 0 state?
Seems like a bug to me. Please enter a JIRA (if you haven't already). Thanks, Patrick On Fri, Sep 17, 2010 at 9:10 AM, Michael Xu mx2...@gmail.com wrote: Hi everyone in the c client api: Is it normal for zoo_state() to return zero (not one of the valid state consts) when it is handling socket errors? In the C Code, handle_error(), which handles socket errors, sets the zh-state to zero, == if (!is_unrecoverable(zh)) zh-state = 0; == If the handle is recoverable, why is the state set to zero, which is not even a valid state const? Here's a use case where the state should be connecting, but instead is zero: 1) c client connects to a zkserver 2) shutdown zkserver 3) zoo_state() returns zero on a valid zookeeper handle. We are using zoo_state() to get the state of the connection, and this is a surprising returned value from this function. Thanks, michael
Re: zkfuse
Sounds like you have an old version of autoconf, try upgrading, see similar issue here: http://www.mail-archive.com/thrift-u...@incubator.apache.org/msg00673.html http://www.mail-archive.com/thrift-u...@incubator.apache.org/msg00673.html Patrick 2010/9/24 俊贤 junx...@taobao.com Hi mahadev, My os is Linux localhost.localdomain 2.6.18-164.el5 #1 SMP Thu Sep 3 03:33:56 EDT 2009 i686 i686 i386 GNU/Linux The errror occured when I run the autoreconf -if command reminded in the README file. follow the error info: configure.ac:51: error: possibly undefined macro: AC_TYPE_INT64_T configure.ac:58: error: possibly undefined macro: AC_TYPE_UINT32_T configure.ac:59: error: possibly undefined macro: AC_TYPE_UINT64_T configure.ac:60: error: possibly undefined macro: AC_TYPE_UINT8_T Thanks you! JunX junxian From: Mahadev Konar [maha...@yahoo-inc.com] Sent: 25 September 2010 06:19 To: zookeeper-user@hadoop.apache.org Subject: Re: zkfuse Hi Jun, I havent seen people using zkfuse recently. What kind of issues are you facing? Thanks mahadev This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you. 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
Re: processResults
I believe what the author is trying to say is that if the getdata were to fail (such as the example you give) the watch set as part of the original call will fire, and this will notify the client that the node was deleted. (call to process(event)) Patrick On Mon, Sep 27, 2010 at 6:56 PM, Milind Parikh milindpar...@gmail.comwrote: In the explanation of the Java binding, it is mentioned If the file (or znode) exists, it gets the data from the znode, and then invoke the exists() callback of Executor if the state has changed. Note, it doesn't have to do any Exception processing for the getData call because it has watches pending for anything that could cause an error: if the node is deleted before it calls ZooKeeper.getData(), the watch event set by the ZooKeeper.exists()triggers a callback I read this to mean that if I insert a Thread.sleep() before the getData call removed the node from the cli, somehow (magically) there would be no error. But of course, it does not happen Sleeps for 10 seconds org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /zk_test at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950) at DataMonitor.processResult(DataMonitor.java:114) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:512) Am I doing something wrong (or reading something wrong)? -- Milind
Re: possible bug in zookeeper ?
That is unusual. I don't recall anyone reporting a similar issue, and looking at the code I don't see any issues off hand. Can you try the following? 1) on that particular zk client machine resolve the hosts zook1/zook2/zook3, what ip addresses does this resolve to? (try dig) 2) try running the client using the 3.3.1 jar file (just replace the jar on the client), it includes more log4j information, turn on DEBUG or TRACE logging Patrick On Tue, Sep 14, 2010 at 8:44 AM, Yatir Ben Shlomo yat...@outbrain.comwrote: zook1:2181,zook2:2181,zook3:2181 -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Tuesday, September 14, 2010 4:11 PM To: zookeeper-user@hadoop.apache.org Subject: Re: possible bug in zookeeper ? What was the list of servers that was given originally to open the connection to ZK? On Tue, Sep 14, 2010 at 6:15 AM, Yatir Ben Shlomo yat...@outbrain.com wrote: Hi I am using solrCloud which uses an ensemble of 3 zookeeper instances. I am performing survivability tests: Taking one of the zookeeper instances down I would expect the client to use a different zookeeper server instance. But as you can see in the below logs attached Depending on which instance I choose to take down (in my case, the last one in the list of zookeeper servers) the client is constantly insisting on the same zookeeper server (Attempting connection to server zook3/192.168.252.78:2181) and not switching to a different one the problem seems to arrive from ClientCnxn.java Any one has an idea on this ? Solr cloud currently is using zookeeper-3.2.2.jar Is this a know bug that was fixed in later versions ?( 3.3.1) Thanks in advance, Yatir Logs: Sep 14, 2010 9:02:20 AM org.apache.log4j.Category warn WARNING: Ignoring exception during shutdown input java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):999) at org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970) Sep 14, 2010 9:02:20 AM org.apache.log4j.Category warn WARNING: Ignoring exception during shutdown output java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649) at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):1004) at org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970) Sep 14, 2010 9:02:22 AM org.apache.log4j.Category info INFO: Attempting connection to server zook3/192.168.252.78:2181 Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn WARNING: Exception closing session 0x32b105244a20001 to sun.nio.ch.selectionkeyi...@3ca58cbf java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):933) Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn WARNING: Ignoring exception during shutdown input java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):999) at org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970) Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn WARNING: Ignoring exception during shutdown output java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649) at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(zookeeper:ClientCnxn.java):1004) at org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):970) Sep 14, 2010 9:02:22 AM org.apache.log4j.Category info INFO: Attempting connection to server zook3/192.168.252.78:2181 Sep 14, 2010 9:02:22 AM org.apache.log4j.Category warn WARNING: Exception closing session 0x32b105244a2 to sun.nio.ch.selectionkeyi...@3960f81b java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
Re: Spew after call to close
No worries, let us know if something else pops up. Patrick On Tue, Sep 7, 2010 at 3:10 PM, Stack st...@duboce.net wrote: Nevermind. I figured it. It was an hbase issue. We were leaking a client reference. Sorry for the noise, St.Ack On Sat, Sep 4, 2010 at 10:58 AM, Stack st...@duboce.net wrote: Thats right -- client is shutdown first, then server... How do I stop the client trying to come back from the dead? Good on you Mahadev? St.Ack On Fri, Sep 3, 2010 at 8:36 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Stack, Looks like you are shutting down the server and shutting down the client at the same time? Is that the issue? Thanks mahadev On 9/3/10 4:47 PM, Stack st...@duboce.net wrote: Have you fellas seen this before? I call close on zookeeper but it insists on doing the below exceptions. Why is it doing this 'Session 0x12ad9dccda30002 for server null, unexpected error, closing socket connection and attempting reconnect'? This would seem to come after the close has been noticed and looking in code, i'd think we'd not do this since the close flag should be set to true post call to close? Thanks lads (The below looks ugly in our logs... this is zk 3.3.1), St.Ack 2010-09-03 16:09:52,369 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:56941 which had sessionid 0x12ad9dccda30001 2010-09-03 16:09:52,369 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:56942 which had sessionid 0x12ad9dccda30002 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x12ad9dccda30001, likely server has closed socket, closing socket connection and attempting reconnect 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x12ad9dccda30002, likely server has closed socket, closing socket connection and attempting reconnect 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.server.NIOServerCnxn: NIOServerCnxn factory exited run method 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.server.PrepRequestProcessor: PrepRequestProcessor exited loop! 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.server.SyncRequestProcessor: SyncRequestProcessor exited! 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.server.FinalRequestProcessor: shutdown of request processor complete 2010-09-03 16:09:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: localhost:/hbase Received ZooKeeper Event, type=None, state=Disconnected, path=null 2010-09-03 16:09:52,470 INFO org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: localhost:/hbase Received Disconnected from ZooKeeper, ignoring 2010-09-03 16:09:52,471 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: localhost:/hbase Received ZooKeeper Event, type=None, state=Disconnected, path=null 2010-09-03 16:09:52,471 INFO org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: localhost:/hbase Received Disconnected from ZooKeeper, ignoring 2010-09-03 16:09:52,857 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181 2010-09-03 16:09:52,858 WARN org.apache.zookeeper.ClientCnxn: Session 0x12ad9dccda30001 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 2010-09-03 16:09:53,149 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 2010-09-03 16:09:53,150 WARN org.apache.zookeeper.ClientCnxn: Session 0x12ad9dccda30002 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 2010-09-03 16:09:53,576 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 2010-09-03 16:09:53,576 WARN org.apache.zookeeper.ClientCnxn: Session 0x12ad9dccda30001 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 2010-09-03 16:09:54,000 INFO
Re: Spew after call to close
No worries, let us know if something else pops up. Patrick On Tue, Sep 7, 2010 at 3:10 PM, Stack st...@duboce.net wrote: Nevermind. I figured it. It was an hbase issue. We were leaking a client reference. Sorry for the noise, St.Ack On Sat, Sep 4, 2010 at 10:58 AM, Stack st...@duboce.net wrote: Thats right -- client is shutdown first, then server... How do I stop the client trying to come back from the dead? Good on you Mahadev? St.Ack On Fri, Sep 3, 2010 at 8:36 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Stack, Looks like you are shutting down the server and shutting down the client at the same time? Is that the issue? Thanks mahadev On 9/3/10 4:47 PM, Stack st...@duboce.net wrote: Have you fellas seen this before? I call close on zookeeper but it insists on doing the below exceptions. Why is it doing this 'Session 0x12ad9dccda30002 for server null, unexpected error, closing socket connection and attempting reconnect'? This would seem to come after the close has been noticed and looking in code, i'd think we'd not do this since the close flag should be set to true post call to close? Thanks lads (The below looks ugly in our logs... this is zk 3.3.1), St.Ack 2010-09-03 16:09:52,369 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:56941 which had sessionid 0x12ad9dccda30001 2010-09-03 16:09:52,369 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:56942 which had sessionid 0x12ad9dccda30002 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x12ad9dccda30001, likely server has closed socket, closing socket connection and attempting reconnect 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x12ad9dccda30002, likely server has closed socket, closing socket connection and attempting reconnect 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.server.NIOServerCnxn: NIOServerCnxn factory exited run method 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.server.PrepRequestProcessor: PrepRequestProcessor exited loop! 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.server.SyncRequestProcessor: SyncRequestProcessor exited! 2010-09-03 16:09:52,370 INFO org.apache.zookeeper.server.FinalRequestProcessor: shutdown of request processor complete 2010-09-03 16:09:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: localhost:/hbase Received ZooKeeper Event, type=None, state=Disconnected, path=null 2010-09-03 16:09:52,470 INFO org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: localhost:/hbase Received Disconnected from ZooKeeper, ignoring 2010-09-03 16:09:52,471 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: localhost:/hbase Received ZooKeeper Event, type=None, state=Disconnected, path=null 2010-09-03 16:09:52,471 INFO org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: localhost:/hbase Received Disconnected from ZooKeeper, ignoring 2010-09-03 16:09:52,857 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181 2010-09-03 16:09:52,858 WARN org.apache.zookeeper.ClientCnxn: Session 0x12ad9dccda30001 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 2010-09-03 16:09:53,149 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 2010-09-03 16:09:53,150 WARN org.apache.zookeeper.ClientCnxn: Session 0x12ad9dccda30002 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 2010-09-03 16:09:53,576 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 2010-09-03 16:09:53,576 WARN org.apache.zookeeper.ClientCnxn: Session 0x12ad9dccda30001 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078) 2010-09-03 16:09:54,000 INFO
Re: getting created child on NodeChildrenChanged event
It is good to keep things simple, but we have seen some requests related to the client api for children use cases that seem reasonable. In particular the issue of handling large numbers of children efficiently is currently a problem (queue say). We've seen proposals on this before, just no one's followed through with them. I personally think there's room for improvement, perhaps the current client api is too simple: https://issues.apache.org/jira/browse/ZOOKEEPER-423 Patrick On Fri, Sep 3, 2010 at 11:18 PM, Mahadev Konar maha...@yahoo-inc.comwrote: Hi Todd, We have always tried to lean on the side of keeping things lightweight and the api simple. The only way you would be able to do this is with sequential creates. 1. create nodes like /queueelement-$i where i is a monotonically increasing number. You could use the sequential flag of zookeeper to do this. 2. when deleting a node, you would remove the node and create a deleted node on /deletedqueueelements/queuelement-$i 2.1 on notification you would go to /deletedqueelements/ and find out which ones were deleted. The above only works if you are ok with monotonically unique queue elements. 3. the above method allows the folks to see the deltas using deletedqueuelements, which can be garbage collected by some clean up process (you can be smarter abt this as well) Would something like this work? Thanks mahadev On 8/31/10 3:55 PM, Todd Nine t...@spidertracks.co.nz wrote: Hi Dave, Thanks for the response. I understand your point about missed events during a watch reset period. I may be off, here is the functionality I was thinking. I'm not sure if the ZK internal versioning process could possibly support something like this. 1. A watch is placed on children 2. The event is fired to the client. The client receives the Stat object as part of the event for the current state of the node when the event was created. We'll call this Stat A with version 1 3. The client performs processing. Meanwhile the node has several children changed. Versions are incremented to version 2 and version 3 4. Client resets the watch 5. A node is added 6. The event is fired to the client. Client receives Stat B with version 4 7. Client calls performs a deltaChildren(Stat A, Stat B) 8. zookeeper returns added nodes between stats, also returns deleted nodes between stats. This would handle the missed event problem since the client would have the 2 states it needs to compare. It also allows clients dealing with large data sets to only deal with the delta over time (like a git replay). Our number of queues could get quite large, and I'm concerned that keeping my previous event's children in a set to perform the delta may become quite memory and processor intensive Would a feature like this be possible without over complicating the Zookeeper core? Thanks, Todd On Tue, 2010-08-31 at 09:23 -0400, Dave Wright wrote: Hi Todd - The general explanation for why Zookeeper doesn't pass the event information w/ the event notification is that an event notification is only triggered once, and thus may indicate multiple events. For example, if you do a GetChildren and set a watch, then multiple children are added at about the same time, the first one triggers a notification, but the second (or later) ones do not. When you do another GetChildren() request to get the list and reset the watch, you'll see all the changed nodes, however if you had just been told about the first change in the notification you would have missed the others. To do what you are wanting, you would really need persistent watches that send notifications every time a change occurs and don't need to be reset so you can't miss events. That isn't the design that was chosen for Zookeeper and I don't think it's likely to be implemented. -Dave Wright On Tue, Aug 31, 2010 at 3:49 AM, Todd Nine t...@spidertracks.co.nz wrote: Hi all, I'm writing a distributed queue monitoring class for our leader node in the cluster. We're queueing messages per input hardware device, this queue is then assigned to a node with the least load in our cluster. To do this, I maintain 2 Persistent Znode with the following format. data queue /dataqueue/devices/unit id/data packet processing follower /dataqueue/nodes/node name/unit id The queue monitor watches for changes on the path of /dataqueue/devices. When the first packet from a unit is received, the queue writer will create the queue with the unit id. This triggers the watch event on the monitoring class, which in turn creates the znode for the path with the least loaded node. This path is watched for child node creation and the node creates a queue consumer to consume messages from the new queue. Our list of queues can become quite large, and I would prefer not to maintain a list
Re: election recipe
Hi Andrei, the answer may not be as simple as that. In the case of passive leader you might want to just wait till you're reconnected before taking any action. Connection loss indicates that you aren't currently connected to a server, it doesn't mean that you've lost leadership (if you get expired that would mean you lost leader). However for active leader you might want to stop acting as leader immed. upon connection loss (given you don't know if you're the leader any longer). The active vs passive leader distinction is indicating whether the leader is the one taking the action (active), or the followers are the ones taking the action (passive). For example in the active case the leader may be sending out commands to the followers, in the passive case the leader might be getting requests from the followers. In the first case you want to stop as soon as you are not sure you're the leader, in the passive case the followers will stop talking to you on their own if leadership change does take place. Patrick On Sat, Sep 4, 2010 at 11:16 AM, Andrei Savu savu.and...@gmail.com wrote: You should also be careful how you handle connection loss events. The leader should suspend itself and re-run the election process when the connection is reestablished. On Sat, Sep 4, 2010 at 8:37 AM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Eric, As Ted and you yourself mentioned its mostly to avoid herd affect. A herd affect would usually mean 1000¹s of client notified of some change and would try creating the same node on notification. With just 10¹s of clients you don¹t need to worry abt this herd effect at all. Thanks mahadev On 9/2/10 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote: You are correct that this simpler recipe will work for smaller populations and correct that the complications are to avoid the herd effect. On Thu, Sep 2, 2010 at 12:55 PM, Eric van Orsouw eric.van.ors...@gmail.comwrote: Hi there, I would like to use zookeeper to implement an election scheme. There is a recipe on the homepage, but it is relatively complex. I was wondering what was wrong with the following pseudo code; forever { zookeeper.create -e /election my_ip_address if creation succeeded then { // do the leader thing } else { // wait for change in /election using watcher mechanism } } My assumption is that the recipe is more elaborate to the eliminate the flood of requests if the leader falls away. But if there are only a handful of leader-candidates ,than that should not be a problem. Is this correct, or am I missing out on something. Thanks, Eric -- Andrei Savu -- http://www.andreisavu.ro/
Re: closing session on socket close vs waiting for timeout
That's a good point, however with suitable documentation, warnings and such it seems like a reasonable feature to provide for those users who require it. Used in moderation it seems fine to me. Perhaps we also make it configurable at the server level for those administrators/ops who don't want to deal with it (disable the feature entirely, or only enable on particular servers, etc...). Patrick On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed br...@yahoo-inc.com wrote: if this mechanism were used very often, we would get a huge number of session expirations when a server fails. you are trading fast error detection for the ability to tolerate temporary network and server outages. to be honest this seems like something that in theory sounds like it will work in practice, but once deployed we start getting session expirations for cases that we really do not want or expect. ben On 09/01/2010 12:47 PM, Patrick Hunt wrote: Ben, in this case the session would be tied directly to the connection, we'd explicitly deny session re-establishment for this session type (so 4 would fail). Would that address your concern, others? Patrick On 09/01/2010 10:03 AM, Benjamin Reed wrote: i'm a bit skeptical that this is going to work out properly. a server may receive a socket reset even though the client is still alive: 1) client sends a request to a server 2) client is partitioned from the server 3) server starts trying to send response 4) client reconnects to a different server 5) partition heals 6) server gets a reset from client at step 6 i don't think you want to delete the ephemeral nodes. ben On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote: Yes that's right. Which network issues can cause the socket to close without the initiating process closing the socket? In my limited experience in this area network issues were more prone to leave dead sockets open rather than vice versa so I don't know what to look out for. Thanks, Camille -Original Message- From: Dave Wright [mailto:wrig...@gmail.com] Sent: Tuesday, August 31, 2010 1:14 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout I think he's saying that if the socket closes because of a crash (i.e. not a normal zookeeper close request) then the session stays alive until the session timeout, which is of course true since ZK allows reconnection and resumption of the session in case of disconnect due to network issues. -Dave Wright On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com wrote: That doesn't sound right to me. Is there a Zookeeper expert in the house? On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: I foolishly did not investigate the ZK code closely enough and it seems that closing the socket still waits for the session timeout to remove the session.
Re: closing session on socket close vs waiting for timeout
Ben, in this case the session would be tied directly to the connection, we'd explicitly deny session re-establishment for this session type (so 4 would fail). Would that address your concern, others? Patrick On 09/01/2010 10:03 AM, Benjamin Reed wrote: i'm a bit skeptical that this is going to work out properly. a server may receive a socket reset even though the client is still alive: 1) client sends a request to a server 2) client is partitioned from the server 3) server starts trying to send response 4) client reconnects to a different server 5) partition heals 6) server gets a reset from client at step 6 i don't think you want to delete the ephemeral nodes. ben On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote: Yes that's right. Which network issues can cause the socket to close without the initiating process closing the socket? In my limited experience in this area network issues were more prone to leave dead sockets open rather than vice versa so I don't know what to look out for. Thanks, Camille -Original Message- From: Dave Wright [mailto:wrig...@gmail.com] Sent: Tuesday, August 31, 2010 1:14 PM To: zookeeper-user@hadoop.apache.org Subject: Re: closing session on socket close vs waiting for timeout I think he's saying that if the socket closes because of a crash (i.e. not a normal zookeeper close request) then the session stays alive until the session timeout, which is of course true since ZK allows reconnection and resumption of the session in case of disconnect due to network issues. -Dave Wright On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com wrote: That doesn't sound right to me. Is there a Zookeeper expert in the house? On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com wrote: I foolishly did not investigate the ZK code closely enough and it seems that closing the socket still waits for the session timeout to remove the session.
Re: Logs and in memory operations
On Mon, Aug 30, 2010 at 1:11 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: From my understanding when a znode is updated/created a write happens into the local transaction logs and then some in-memory data structure is updated to serve the future reads. Where in the source code can I find this? Also how can I decide when it is ok for me to delete the logs off disk? The bits where the in-memory db is updated is here: org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Request) Regarding datadir cleanup see this section of the docs. http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup Basically -- there's a tool for that but you should backup the current state of the database before doing the cleanup. Patrick
Re: Zookeeper shell
Depending on your classpath setup: java org.apache.zookeeper.ZooKeeperMain -server 127.0.0.1:2181 if jline jar is in your classpath (included in the zk release distribution) you'll get history, auto-complete and such. Patrick On 08/31/2010 03:08 PM, Michi Mutsuzaki wrote: Hello, I'm looking for a good zookeeper shell. So far I've only used cli_mt (c client), but it's not very user friendly. Are there any alternatives? In particular, I'm looking for: - command history with reverse search - auto-complete znode path Thanks! --Michi
Re: IllegalArgumentException excpetion : Path cannot be null
The client (solr in this case) is passing a null path to the ZooKeeper.getChildren(path, ... ) call. java.lang.IllegalArgumentException: Path cannot be null at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45) at org.apache.zookeeper.ZooKeeper.getChildren(zookeeper:ZooKeeper.java):1196) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:200) I'm afraid you'll have to work with the solr team to determine the cause of this. Patrick On Thu, Aug 26, 2010 at 12:15 AM, Yatir Ben Shlomo yat...@outbrain.comwrote: I am running a zookeeper ensemble of 3 zookeeper instances and established a solrCloud to work with it (2 masters , 2 slaves) on one of the masters I keep noticing ZooKeeper related exceptions which I can't understand: And the other is java.lang.IllegalArgumentException: Path cannot be null (PathUtils.java:45) Here are my logs (I set the log level to FINE on zookeeper package) Anyone can identify the issue? (I could not yet get any help from the solrCloud community) FINE: Reading reply sessionid:0x12a97312613010b, packet:: clientPath:null serverPath:null finished:false header:: -8,101 replyHeader:: -8,-1,0 request:: 30064776552,v{'/collections},v{},v{'/collections/ENPwl/shards/ENPWL1,'/collections/ENPwl/shards/ENPWL4,'/collections/ENPwl/shards/ENPWL2,'/collections,'/collections/ENPwl/shards/ENPWL3,'/collections/ENPwlMaster/shards/ENPWLMaster_3,'/collections/ENPwlMaster/shards/ENPWLMaster_4,'/live_nodes,'/collections/ENPwlMaster/shards/ENPWLMaster_1,'/collections/ENPwlMaster/shards/ENPWLMaster_2} response:: null Aug 25, 2010 5:18:19 AM org.apache.log4j.Category debug FINE: Reading reply sessionid:0x12a97312613010b, packet:: clientPath:null serverPath:null finished:false header:: 540,8 replyHeader:: 540,-1,0 request:: '/collections,F response:: v{'ENPwl,'ENPwlMaster} Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader updateCloudState INFO: Cloud state update for ZooKeeper already scheduled Aug 25, 2010 5:18:19 AM org.apache.log4j.Category error SEVERE: Error while calling watcher java.lang.IllegalArgumentException: Path cannot be null at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45) at org.apache.zookeeper.ZooKeeper.getChildren(zookeeper:ZooKeeper.java):1196) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:200) at org.apache.solr.common.cloud.ZkStateReader$5.process(ZkStateReader.java:315) at org.apache.zookeeper.ClientCnxn$EventThread.run(zookeeper:ClientCnxn.java):425) Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader$4 process INFO: Detected a shard change under ShardId:ENPWL3 in collection:ENPwl Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader updateCloudState INFO: Cloud state update for ZooKeeper already scheduled Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader$4 process INFO: Detected a shard change under ShardId:ENPWL4 in collection:ENPwl Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader updateCloudState INFO: Cloud state update for ZooKeeper already scheduled Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader$4 process INFO: Detected a shard change under ShardId:ENPWL1 in collection:ENPwl Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader updateCloudState INFO: Cloud state update for ZooKeeper already scheduled Aug 25, 2010 5:18:19 AM org.apache.solr.cloud.ZkController$2 process INFO: Updating live nodes:org.apache.solr.common.cloud.solrzkcli...@55308275 Aug 25, 2010 5:18:19 AM org.apache.solr.common.cloud.ZkStateReader updateCloudState INFO: Updating live nodes from ZooKeeper... Aug 25, 2010 5:18:19 AM org.apache.log4j.Category debug FINE: Reading reply sessionid:0x12a97312613010b, packet:: clientPath:null serverPath:null finished:false header:: 541,8 replyHeader:: 541,-1,0 request:: '/live_nodes,F response:: v{'ob1078.nydc1.outbrain.com:8983 _solr2,'ob1078.nydc1.outbrain.com:8983 _solr1,'ob1061.nydc1.outbrain.com:8983 _solr2,'ob1062.nydc1.outbrain.com:8983 _solr1,'ob1062.nydc1.outbrain.com:8983 _solr2,'ob1061.nydc1.outbrain.com:8983 _solr1,'ob1077.nydc1.outbrain.com:8983 _solr2,'ob1077.nydc1.outbrain.com:8983_solr1} Aug 25, 2010 5:18:19 AM org.apache.log4j.Category error SEVERE: Error while calling watcher java.lang.IllegalArgumentException: Path cannot be null at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45) at org.apache.zookeeper.ZooKeeper.getChildren(zookeeper:ZooKeeper.java):1196) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:200) at org.apache.solr.cloud.ZkController$2.process(ZkController.java:321) at org.apache.zookeeper.ClientCnxn$EventThread.run(zookeeper:ClientCnxn.java):425) Aug 25, 2010 5:18:19 AM
Re: Receiving create events for self with synchronous create
On line 64 are you ensuring that the ZooKeeper session is active before executing that sequence? zookeeper = new ZooKeeper(...) is async - it returns before you're actually connected to the server (you get notified of this in your watcher). If you execute this sequence quickly enough your zk.create operation is queued until the zookeeper session is actually established. Patrick On Thu, Aug 26, 2010 at 8:09 PM, Todd Nine t...@spidertracks.co.nz wrote: Sure thing. The FollowerWatcher class is instantiated by the IClusterManager implementation.It then performs the following FollowerWatcher.init() which is intended to do the following. 1. Create our follower node so that other nodes know we exist at path /com/spidertracks/aviator/cluster/follower/10.0.1.1 where the last node is an ephemeral node with the internal IP address of the node. These are lines 67 through 72. 2. Signal to the clusterManager that the cluster has changed (line 79). Ultimately the clusterManager will perform a barrier for partitioning data ( a separate watcher) 3. Register a watcher to receive all future events on the follower path /com/spidertracks/aviator/cluster/follower/ line 81. Then we have the following characteristics in the watcher 1. If a node has been added or deleted from the children of /com/spidertracks/aviator/cluster/follower then continue. Otherwise, ignore the event. Lines 33 through 44 2. If this was an event we should process our cluster has changed, signal to the CusterManager that a node has either been added or removed. line 51. I'm trying to encapsulate the detection of additions and deletions of child nodes within this Watcher. All other events that occur due to a node being added or deleted should be handled externally by the clustermanager. Thanks, Todd On Thu, 2010-08-26 at 19:26 -0700, Mahadev Konar wrote: Hi Todd, The code that you point to, I am not able to make out the sequence of steps. Can you be more clear on what you are trying to do in terms of zookeeper api? Thanks mahadev On 8/26/10 5:58 PM, Todd Nine t...@spidertracks.co.nz wrote: Hi all, I'm running into a strange issue I could use a hand with. I've implemented leader election, and this is working well. I'm now implementing a follower queue with ephemeral nodes. I have an interface IClusterManager which simply has the api clusterChanged. I don't care if nodes are added or deleted, I always want to fire this event. I have the following basic algorithm. init Create a path with /follower/+mynode name fire the clusterChangedEvent Watch set the event watcher on the path /follower. watch: reset the watch on /follower if event is not a NodeDeleted or NodeCreated, ignore fire the clustermanager event this seems pretty straightforward. Here is what I'm expecting 1. Create my node path 2. fire the clusterChanged event 3. Set watch on /follower 4. Receive watch events for changes from any other nodes. What's actually happening 1. Create my node path 2. fire the clusterChanged event 3. Set Watch on /follower 4. Receive watch event for node created in step 1 5. Receive future watch events for changes from any other nodes. Here is my code. Since I set the watch after I create the node, I'm not expecting to receive the event for it. Am I doing something incorrectly in creating my watch? Here is my code. http://pastebin.com/zDXgLagd Thanks, Todd
Re: Exception causing close of session
No, by reset I meant purging the ZK database (rm -fr /zkdatadir). I've seen a number of cases like this now, where a user plays with hbase for a while and wants to reset back to a state with no data in hbase. They shutdown some of the hbase/zk processes but not all of them (and as a result old zk sessions are hanging around). Really we should invalidate the session: https://issues.apache.org/jira/browse/ZOOKEEPER-583 Patrick On Fri, Aug 27, 2010 at 12:00 PM, Ted Dunning ted.dunn...@gmail.com wrote: Patrick, Can you clarify what reset means? It doesn't mean just restart, does it? On Thu, Aug 26, 2010 at 5:05 PM, Patrick Hunt ph...@apache.org wrote: Client has seen zxid 0xfa4 our last zxid is 0x42 Someone reset the zk server database without restarting the clients. As a result the client is forward in time relative to the cluster. Patrick On 08/26/2010 04:03 PM, Ted Yu wrote: Hi, zookeeper-3.2.2 is used out of HBase 0.20.5 Linux sjc1-.com 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux In hbase-hadoop-zookeeper-sjc1-cml-grid00.log, I see a lot of the following: 2010-08-26 22:58:01,930 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.201.9.40:2181 remote=/10.201.9.22:63316] 2010-08-26 22:58:02,097 INFO org.apache.zookeeper.server.NIOServerCnxn: Connected to /10.201.9.22:63317 lastZxid 4004 2010-08-26 22:58:02,097 WARN org.apache.zookeeper.server.NIOServerCnxn: Client has seen zxid 0xfa4 our last zxid is 0x42 2010-08-26 22:58:02,097 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: Client has seen zxid 0xfa4 our last zxid is 0x42 If you can shed some thought on root cause, that would be great.
Re: Exception causing close of session
Client has seen zxid 0xfa4 our last zxid is 0x42 Someone reset the zk server database without restarting the clients. As a result the client is forward in time relative to the cluster. Patrick On 08/26/2010 04:03 PM, Ted Yu wrote: Hi, zookeeper-3.2.2 is used out of HBase 0.20.5 Linux sjc1-.com 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux In hbase-hadoop-zookeeper-sjc1-cml-grid00.log, I see a lot of the following: 2010-08-26 22:58:01,930 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.201.9.40:2181 remote=/10.201.9.22:63316] 2010-08-26 22:58:02,097 INFO org.apache.zookeeper.server.NIOServerCnxn: Connected to /10.201.9.22:63317 lastZxid 4004 2010-08-26 22:58:02,097 WARN org.apache.zookeeper.server.NIOServerCnxn: Client has seen zxid 0xfa4 our last zxid is 0x42 2010-08-26 22:58:02,097 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: Client has seen zxid 0xfa4 our last zxid is 0x42 If you can shed some thought on root cause, that would be great.
Re: Zookeeper stops
+1 on that Ted. I frequently see this issue crop up as I just rebooted my server and lost all my data ... -- many os's will cleanup tmp on reboot. :-) Patrick On 08/19/2010 07:43 AM, Ted Dunning wrote: Also, /tmp is not a great place to keep things that are intended for persistence. On Thu, Aug 19, 2010 at 7:34 AM, Mahadev Konarmaha...@yahoo-inc.comwrote: Hi Wim, It mostly looks like that zookeeper is not able to create files on the /tmp filesystem. Is there is a space shortage or is it possible the file is being deleted as its being written to? Sometimes admins have a crontab on /tmp that cleans up the /tmp filesystem. Thanks mahadev On 8/19/10 1:15 AM, Wim Jongmanwim.jong...@gmail.com wrote: Hi, I have a zookeeper server running that can sometimes run for days and then quits: Is there somebody with a clue to the problem? I am running 64 bit Ubuntu with java version 1.6.0_18 OpenJDK Runtime Environment (IcedTea6 1.8) (6b18-1.8-0ubuntu1) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) Zookeeper 3.3.0 The log below has some context before it shows the fatal error. Our component.id=40676 indicates that it is the 40676th time that I ask ZK to publish this information. It has been seen to go up to half a million before stopping. Regards, Wim ZooDiscovery Service Unpublished: Aug 18, 2010 11:17:28 PM. ServiceInfo[uri=osgiservices:// 188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=osgiservices://188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=;full=_osgiservices._tcp.default._i...@osgiservices://188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=];priority=0;weight=0;props=ServiceProperties[{ecf.rsvc.ns=ecf.namespace.generic.remoteservice , osgi.remote.service.interfaces=org.eclipse.ecf.services.quotes.QuoteService, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, ecf.rsvc.id =org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@68a1e081, component.name=Star Wars Quotes Service, ecf.sp.ect=ecf.generic.server, component.id=40676, ecf.sp.cid=org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@5b9a6ad1 }]] ZooDiscovery Service Published: Aug 18, 2010 11:17:29 PM. ServiceInfo[uri=osgiservices:// 188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;full=_osgiservices._tcp.default._i...@osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=];priority=0;weight=0;props=ServiceProperties[{ecf.rsvc.ns=ecf.namespace.generic.remoteservice , osgi.remote.service.interfaces=org.eclipse.ecf.services.quotes.QuoteService, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, ecf.rsvc.id =org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@71bfa0a4, component.name=Eclipse Twitter, ecf.sp.ect=ecf.generic.server, component.id=40677, ecf.sp.cid=org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@5bcba953 }]] [log;+0200 2010.08.18 23:17:29:545;INFO;org.eclipse.ecf.remoteservice;org.eclipse.core.runtime.Status[plugin=org.eclipse.ecf.remoteservice;code=0;message=No async remote service interface found with name=org.eclipse.ecf.services.quotes.QuoteServiceAsync for proxy service class=org.eclipse.ecf.services.quotes.QuoteService;severity2;exception=null;children=[]]] 2010-08-18 23:17:37,057 - FATAL [Snapshot Thread:zookeeperser...@262] - Severe unrecoverable error, exiting java.io.FileNotFoundException: /tmp/zookeeperData/version-2/snapshot.13e2e (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:209) at java.io.FileOutputStream.init(FileOutputStream.java:160) at org.apache.zookeeper.server.persistence.FileSnap.serialize(FileSnap.java:224) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:211) at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:260) at org.apache.zookeeper.server.SyncRequestProcessor$1.run(SyncRequestProcessor.java:120) ZooDiscovery Service Unpublished: Aug 18, 2010 11:17:37 PM. ServiceInfo[uri=osgiservices:// 188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;full=_osgiservices._tcp.default._i...@osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=];priority=0;weight=0;props=ServiceProperties[{ecf.rsvc.ns=ecf.namespace.generic.remoteservice , osgi.remote.service.interfaces=org.eclipse.ecf.services.quotes.QuoteService, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, ecf.rsvc.id =org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@71bfa0a4, component.name=Eclipse Twitter, ecf.sp.ect=ecf.generic.server, component.id=40677,
Re: Zookeeper stops
No. You configure it in the server configuration file. Patrick On 08/19/2010 01:19 PM, Wim Jongman wrote: Hi, But zk does default to /tmp? Regards, Wim On Thursday, August 19, 2010, Patrick Huntph...@apache.org wrote: +1 on that Ted. I frequently see this issue crop up as I just rebooted my server and lost all my data ... -- many os's will cleanup tmp on reboot. :-) Patrick On 08/19/2010 07:43 AM, Ted Dunning wrote: Also, /tmp is not a great place to keep things that are intended for persistence. On Thu, Aug 19, 2010 at 7:34 AM, Mahadev Konarmaha...@yahoo-inc.comwrote: Hi Wim, It mostly looks like that zookeeper is not able to create files on the /tmp filesystem. Is there is a space shortage or is it possible the file is being deleted as its being written to? Sometimes admins have a crontab on /tmp that cleans up the /tmp filesystem. Thanks mahadev On 8/19/10 1:15 AM, Wim Jongmanwim.jong...@gmail.comwrote: Hi, I have a zookeeper server running that can sometimes run for days and then quits: Is there somebody with a clue to the problem? I am running 64 bit Ubuntu with java version 1.6.0_18 OpenJDK Runtime Environment (IcedTea6 1.8) (6b18-1.8-0ubuntu1) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) Zookeeper 3.3.0 The log below has some context before it shows the fatal error. Our component.id=40676 indicates that it is the 40676th time that I ask ZK to publish this information. It has been seen to go up to half a million before stopping. Regards, Wim ZooDiscoveryService Unpublished: Aug 18, 2010 11:17:28 PM. ServiceInfo[uri=osgiservices:// 188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=osgiservices://188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=;full=_osgiservices._tcp.default._i...@osgiservices://188.40.116.87:3282/svc_19q0FmlQF0wEwjSl6SpUTJRlV5g=];priority=0;weight=0;props=ServiceProperties[{ecf.rsvc.ns=ecf.namespace.generic.remoteservice , osgi.remote.service.interfaces=org.eclipse.ecf.services.quotes.QuoteService, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, ecf.rsvc.id =org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@68a1e081, component.name=Star Wars Quotes Service, ecf.sp.ect=ecf.generic.server, component.id=40676, ecf.sp.cid=org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@5b9a6ad1 }]] ZooDiscoveryService Published: Aug 18, 2010 11:17:29 PM. ServiceInfo[uri=osgiservices:// 188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;id=ServiceID[type=ServiceTypeID[typeName=_osgiservices._tcp.default._iana];location=osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=;full=_osgiservices._tcp.default._i...@osgiservices://188.40.116.87:3282/svc_u2GpWmF3YKSlTauWcwOMsDgiBxs=];priority=0;weight=0;props=ServiceProperties[{ecf.rsvc.ns=ecf.namespace.generic.remoteservice , osgi.remote.service.interfaces=org.eclipse.ecf.services.quotes.QuoteService, ecf.sp.cns=org.eclipse.ecf.core.identity.StringID, ecf.rsvc.id =org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@71bfa0a4, component.name=Eclipse Twitter, ecf.sp.ect=ecf.generic.server, component.id=40677, ecf.sp.cid=org.eclipse.ecf.discovery.serviceproperties$bytearraywrap...@5bcba953 }]] [log;+0200 2010.08.18 23:17:29:545;INFO;org.eclipse.ecf.remoteservice;org.eclipse.core.runtime.Status[plugin=org.eclipse.ecf.remo
Re: ZK monitoring
Maybe we should have a contrib pkg for utilities such as this? I could see a python script that, given 1 server (might require addl 4letter words but this would be useful regardless), could collect such information from the cluster. Create a JIRA? Patrick On 08/17/2010 12:14 PM, Andrei Savu wrote: It's not possible. You need to query all the servers in order to know who is the current leader. It should be pretty simple to implement this by parsing the output from the 'stat' 4-letter command. On Tue, Aug 17, 2010 at 9:50 PM, Jun Raojun...@gmail.com wrote: Hi, Is there a way to see the current leader and a list of followers from a single node in the ZK quorum? It seems that ZK monitoring (JMX, 4-letter commands) only provides info local to a node. Thanks, Jun -- Andrei Savu
Re: A question about Watcher
All servers keep a copy - so you can shutdown the zk service entirely (all servers) and restart it and the sessions are maintained. Patrick On 08/16/2010 06:34 PM, Qian Ye wrote: Thx Mahadev and Benjamin, it seems that I've got some misunderstanding about the client. I will check it out. Another relevant question. I noticed that the master zookeeper server keep a track of all the client session which connects to every zookeeper server in the same cluster. So when a slave zookeeper server failed, the clients it served, can switch to another zookeeper server and keep their old session (the new zookeeper server can get the session information from the master). My question is, if the master failed, does that means some session information will definitely be lost? thx~ On Tue, Aug 17, 2010 at 12:40 AM, Benjamin Reedbr...@yahoo-inc.com wrote: the client does keep track of the watches that it has outstanding. when it reconnects to a new server it tells the server what it is watching for and the last view of the system that it had. ben On 08/16/2010 09:28 AM, Qian Ye wrote: thx for explaination. Since the watcher can be preserved when the client switch the zookeeper server it connects to, does that means all the watchers information will be saved on all the zookeeper servers? I didn't find any source of the client can hold the watchers information. On Tue, Aug 17, 2010 at 12:21 AM, Ted Dunningted.dunn...@gmail.com wrote: I should correct this. The watchers will deliver a session expiration event, but since the connection is closed at that point no further events will be delivered and the cluster will remove them. This is as good as the watchers disappearing. On Mon, Aug 16, 2010 at 9:20 AM, Ted Dunningted.dunn...@gmail.com wrote: The other is session expiration. Watchers do not survive this. This happens when a client does not provide timely evidence that it is alive and is marked as having disappeared by the cluster.
Re: How to handle Node does not exist error?
Try using the logs, stat command or JMX to verify that each ZK server is indeed a leader/follower as expected. You should have one leader and n-1 followers. Verify that you don't have any standalone servers (this is the most frequent error I see - misconfiguration of a server such that it thinks it's a standalone server; I often see where a user has 3 standalone servers which they think is a single quorum, all of the servers will therefore be inconsistent to each other). Patrick On 08/12/2010 05:42 PM, Ted Dunning wrote: On Thu, Aug 12, 2010 at 4:57 PM, Dr Hao Heh...@softtouchit.com wrote: hi, Ted, I am a little bit confused here. So, is the node inconsistency problem that Vishal and I have seen here most likely caused by configurations or embedding? If it is the former, I'd appreciate if you can point out where those silly mistakes have been made and the correct way to embed ZK. I think it is likely due to misconfiguration, but I don't know what the issue is exactly. I think that another poster suggested that you ape the normal ZK startup process more closely. That sounds good but it may be incompatible with your goals of integrating all configuration into a single XML file and not using the normal ZK configuration process. Your thought about forking ZK is a good one since there are calls to System.exit() that could wreak havoc. Although I agree with your comments about the architectural issues that embedding may lead to and we are aware of those, I do not agree that embedding will always lead to those issues. I agree that embedding won't always lead to those issues and your application is a reasonable counter-example. As is common, I think that the exception proves the rule since your system is really just another way to launch an independent ZK cluster rather than an example of ZK being embedded into an application.
Re: client failure detectionin ZK
The session timeout is used for this: http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions Patrick On 08/16/2010 01:47 PM, Jun Rao wrote: Hi, What config parameters in ZK determine how soon a failed client is detected? Thanks, Jun
Re: Backing up zk data files
On 08/11/2010 06:49 PM, Adam Rosien wrote: http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperAdmin.html#sc_dataFileManagement says that one can copy the contents of the data directory and use it on another machine. The example states the other instance is not in the server list; what would happen if one did copy it to an offline member of the quorum that then starts up? The previously offline member will contact the quorum leader and see that it has an older version of the db, it will then synchronize with the leader as usual. (either by d/l a diff or if too bar behind getting a full snapshot). Do the docs imply that one can copy the data directory as-is as a backup method? Is it restorable to any crashed/hosed server, or only the one with the same server id? It can be copied as is. Keep in mind though this is only needed for catastrophic failures (the entire zk serving cluster is lost) - not the case where a single server loses it's HD for example, in that case you just restart the server - it will contact the leader and synchronize as I detailed above. What is a valid backup method for zk data? Copy the datadirectory (snapshots and logs) Patrick
Re: zookeeper seems to hang
Great bug report Ted, the stack trace in particular is very useful. It looks like a timing bug where the client is not shutting down cleanly on the close call. I reviewed the code in question but nothing pops out at me. Also the logs just show us shutting down, nothing else from zk in there. Create a jira and attach all the detail you have available. Patrick On 08/11/2010 03:21 PM, Ted Yu wrote: Hi, Using HBase 0.20.6 (with HBASE-2473) we encountered a situation where Regionserver process was shutting down and seemed to hang. Here is the bottom of region server log: http://pastebin.com/YYawJ4jA zookeeper-3.2.2 is used. Your comment is welcome. Here is relevant portion from jstack - I attempted to attach jstack twice in my email to d...@hbase.apache.org but failed: DestroyJavaVM prio=10 tid=0x2aabb849c800 nid=0x6c60 waiting on condition [0x] java.lang.Thread.State: RUNNABLE regionserver/10.32.42.245:60020 prio=10 tid=0x2aabb84ce000 nid=0x6c81 in Object.wait() [0x43755000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on0x2aaab76633c0 (a org.apache.zookeeper.ClientCnxn$Packet) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1099) - locked0x2aaab76633c0 (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1077) at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:505) - locked0x2aaabf5e0c30 (a org.apache.zookeeper.ZooKeeper) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.close(ZooKeeperWrapper.java:681) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:654) at java.lang.Thread.run(Thread.java:619) main-EventThread daemon prio=10 tid=0x43474000 nid=0x6c80 waiting on condition [0x413f3000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for0x2aaabf6e9150 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:414) RMI TCP Accept-0 daemon prio=10 tid=0x2aabb822c800 nid=0x6c7d runnable [0x40752000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) - locked0x2aaabf585578 (a java.net.SocksSocketImpl) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34) at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369) at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341) at java.lang.Thread.run(Thread.java:619)
Re: Clarification on async calls in a cluster
On 08/11/2010 03:25 PM, Jordan Zimmerman wrote: If I use an async version of a call in a cluster (ensemble) what happens if the server I'm connected to goes down? Does ZK transparently resubmit the call to the next server in the cluster and call my async callback or is there something I need to do? The docs aren't clear on this and searching the archive didn't give me the answer. Another source of confusion here is that the non-async versions do not resubmit the call - I need to do that manually. Thanks! Hi Jordan, the callbacks have a rc parameter that details the result of the request (result code), this will be one of KeeperException.Code, in this case CONNECTIONLOSS. You receive a connection loss result when the client has sent a request to the server but loses the connection before the server responds. You must resubmit of this request manually (usually once you reconnect to the cluster), same as for sync calls. See these sections in the faq: http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A2 also some detail in http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions I agree the docs could be improved here. The java api for callback is esp. embarassing (there is none). Please enter JIRAs for any areas you'd like to see improved, including adding javadoc to the callbacks. Regards, Patrick
Re: Sequence Number Generation With Zookeeper
Great! Basic details are here (create a jira, attach a patch, click submit and someone will review and help you get it into a state which we can commit). Probably you'd put your code into src/recipes or src/contrib (recipes sounds reasonable). http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute Patrick On 08/10/2010 09:59 AM, David Rosenstrauch wrote: Good news! I got approval to release this code! (Man, I love working for a startup!!!) :-) So anyone know: what's the next step? Do I need to obtain commit privileges? Or do I deliver the code to someone who has commit privs who shepherds this for me? Also, what (if anything) do I need to tweak in the code to make it release-ready. (e.g., Change package names? Slap an Apache license on it? etc.) Thanks, DR On 08/06/2010 10:39 PM, David Rosenstrauch wrote: I'll run it by my boss next week. DR On 08/06/2010 07:30 PM, Mahadev Konar wrote: Hi David, I think it would be really useful. It would be very helpful for someone looking for geenrating unique tokens/generations ids ( I can think of plenty of applications for this). Please do consider contributing it back to the community! Thanks mahadev On 8/6/10 7:10 AM, David Rosenstrauchdar...@darose.net wrote: Perhaps. I'd have to ask my boss for permission to release the code. Is this something that would be interesting/useful to other people? If so, I can ask about it. DR On 08/05/2010 11:02 PM, Jonathan Holloway wrote: Hi David, We did discuss potentially doing this as well. It would be nice to get some recipes for Zookeeper done for this area, if people think it's useful. Were you thinking of submitting this back as a recipe, if not then I could potentially work on such a recipe instead. Many thanks, Jon. I just ran into this exact situation, and handled it like so: I wrote a library that uses the option (b) you described above. Only instead of requesting a single sequence number, you request a block of them at a time from Zookeeper, and then locally use them up one by one from the block you retrieved. Retrieving by block (e.g., by blocks of 1 at a time) eliminates the contention issue. Then, if you're finished assigning ID's from that block, but still have a bunch of ID's left in the block, the library has another function to push back the unused ID's. They'll then get pulled again in the next block retrieval. We don't actually have this code running in production yet, so I can't vouch for how well it works. But the design was reviewed and given the thumbs up by the core developers on the team, and the implementation passes all my unit tests. HTH. Feel free to email back with specific questions if you'd like more details. DR
Re: Too many KeeperErrorCode = Session moved messages
I suspect this is a bug with the sync call and session moved (the code path for sync is a bit special). Please enter a JIRA for this. Thanks. Patrick On 08/05/2010 01:20 PM, Vishal K wrote: Hi All, I am seeing a lot of these messages in our application. I would like to know if I am doing something wrong or this is a ZK bug. Setup: - Server environment:zookeeper.version=3.3.0-925362 - 3 node cluster - Each node has few clients that connect to the local server using 127.0.0.1 as the host IP. - The application first forms a ZK cluster. Once the ZK cluster is formed, each node establish sessions with local ZK servers. The clients do not know about remote server so sessions are always with the local server. As soon as ZK clients connected to their respective follower, the ZK leader starts spitting the following messages: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,748 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x9 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,755 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0xb zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,795 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x10 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,850 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa90001 type:sync: cxid:0x1 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,910 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x1b zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,920 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x20 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:37,019 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x29 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:37,030 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x2c zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:37,035 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x2e zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:37,065 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x33 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:38,840 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa90001 type:sync: cxid:0x4 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 20 These sessions were established on the follower: 2010-07-01 08:59:09,890 - INFO [CommitProcessor:0:nioserverc...@1431] - Established session 0x298d3b1fa9 with negotiated timeout 9000 for client /127.0.0.1:50773 2010-07-01 08:59:09,890 - INFO [SvaDefaultBLC-SendThread(localhost.localdom:2181):clientcnxn$sendthr...@701] - Session establishment complete on server localhost.localdom/127.0.0.1:2181, sessionid = 0x298d3b1fa9, negotiated timeout = 9000 The server is spitting out these messages for every session that it does not own (session established by clients with followers). The messages are always seen for a sync request. No other issues are seen with the cluster. I am wondering what would be the cause of this problem? Looking at PrepRequestProcessor, it seems like this message is printed when the owner of the
Re: Using watcher for being notified of children addition/removal
You may want to consider adding a distributed queue to your use of ZK. As was mentioned previously, watches don't notify you of every change, just that a change was made. For example multiple changes may be visible when you get the notification. A distributed queue would allow you to log every change, and have your watcher process easily process the result. The only issue I could see is one of atomicity, but depending on your use case(s) that may not be an issue, or perhaps one that can be worked around. Patrick On 08/02/2010 09:18 AM, Ted Dunning wrote: Another option besides Steve's excellent one would be to keep something like 1000 nodes in your list per znode. Many update patterns will give you the same number of updates, but the ZK transactions that result (getChildren, read znode) will likely be more efficient, especially the getChildren call. Remember, it is not a requirement that you have a one-to-one mapping between your in-memory objects and in-zookeeper znodes. If that works, fine. If not, feel free to be creative. On Mon, Aug 2, 2010 at 7:45 AM, Steve Gury steve.g...@mimesis-republic.comwrote: Is there any recipe that would provide this feature (or a work around) ?
Re: JMX error while starting ZooKeeper
On 07/19/2010 05:04 PM, Rakesh Aggarwal wrote: javax.management.MBeanServer; was not found Sounds like you are missing rt.jar for some reason (contains that class). Try running java -verbose -version and see what jars are being picked up, I see a number of lines containing: ... /usr/lib/jvm/java-6-sun-1.6.0.20/jre/lib/rt.jar ... Patrick
Re: Errors with Python bindings
Hi Rich, the version string looks useful to have, thanks! Would you mind submitting this via jira? Do a svn diff (looks like you did already), create a jira and attach the diff, then click submit link on the jira. We'll review and work on getting it into a future release. http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute Thanks! Patrick On 07/15/2010 05:24 PM, Rich Schumacher wrote: Hey Henry, Good to know! I was under the impression the 3.3.0 release had the updated bindings but it seems I was mistaken. I'll get those built and then see what happens. Just curious, have you ever run into or heard of this? A quick Google search didn't return anything interesting. As for the version in the Python bindings, how about this trivial patch: Index: src/c/zookeeper.c === --- src/c/zookeeper.c (revision 964617) +++ src/c/zookeeper.c (working copy) @@ -1510,6 +1510,11 @@ PyModule_AddObject(module, ZooKeeperException, ZooKeeperException); Py_INCREF(ZooKeeperException); + char version_str[]; + sprintf(version_str, %i.%i.%i, ZOO_MAJOR_VERSION, ZOO_MINOR_VERSION, ZOO_PATCH_VERSION); + + PyModule_AddStringConstant(module, __version__, version_str); + ADD_INTCONSTANT(PERM_READ); ADD_INTCONSTANT(PERM_WRITE); ADD_INTCONSTANT(PERM_CREATE); On Jul 14, 2010, at 2:57 PM, Henry Robinson wrote: Hi Rich - No, there's not a very easy way to verify the Python bindings version afaik - would be a useful feature to have though. My first suggestion is to move to the bindings shipped with 3.3.1 - we fixed a lot of problems with the Python bindings which improved their stability a lot. Could you try that and then let us know if you continue to see problems? cheers, Henry On 14 July 2010 13:14, Rich Schumacherrich.s...@gmail.com wrote: I'm running a Tornado webserver and using ZooKeeper to store some metadata and occasionally the ZooKeeper connection will error out irrevocably. Any subsequent calls to ZooKeeper from this process will result in a SystemError. Here is the relevant portion of the Python traceback: snip... File /usr/lib/pymodules/python2.5/zuul/storage/zoo.py, line 69, in call return getattr(zookeeper, name)(self.handle, *args) SystemError: NULL result without error in PyObject_Call I found this in the ZooKeeper server logs: 2010-07-13 06:52:46,488 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@251] - Accepted socket connection from /10.2.128.233:54779 2010-07-13 06:52:46,489 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:nioserverc...@742] - Client attempting to renew session 0x429b865a6270003 at /10.2.128.233:54779 2010-07-13 06:52:46,489 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:lear...@95] - Revalidating client: 299973596915630083 2010-07-13 06:52:46,793 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1424] - Invalid session 0x429b865a6270003 for client /10.2.128.233:54779, probably expired 2010-07-13 06:52:46,794 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:nioserverc...@1286] - Closed socket connection for client /10.2.128.233:54779 which had sessionid 0x429b865a6270003 The ZooKeeper ensemble is healthy; each node responds as expected to the four letter word commands and a simple restart of the Tornado processes fixes this. My question is, if this really is due to session expiration why is a SessionExpiredException not raised? Another question, is there an easy way to determine the version of the ZooKeeper Python bindings I'm using? I built the 3.3.0 bindings but I just want to be able to verify that. Thanks for the help, Rich -- Henry Robinson Software Engineer Cloudera 415-994-6679
Re: total # of zknodes
I've done some tests with ~600 clients creating 5 million znodes (size 100bytes iirc) and 25million watches. I was using 8gb of memory for this, however --- in this scenario it's critical that you tune the GC, in particular you need to turn on CMS and incremental GC options. Otw when the GC collects it will collect for long periods of time and all of your clients will then time out. Keep an eye on the max latency of your servers, that's usually the most obvious indication of GC hits (it will spike up). You can use the latency tester from here to do the quick benchmarks Ben suggested: http://github.com/phunt/zk-smoketest also see: http://bit.ly/4ekN8G Patrick On 07/15/2010 08:57 AM, Benjamin Reed wrote: i think there is a wiki page on this, but for the short answer: the number of znodes impact two things: memory footprint and recovery time. there is a base overhead to znodes to store its path, pointers to the data, pointers to the acl, etc. i believe that is around 100 bytes. you cant just divide your memory by 100+1K (for data) though, because the GC needs to be able to run and collect things and maintain a free space. if you use 3/4 of your available memory, that would mean with 4G you can store about three million znodes. when there is a crash and you recover, servers may need to read this data back off the disk or over the network. that means it will take about a minute to read 3G from the disk and perhaps a bit more to read it over the network, so you will need to adjust your initLimit accordingly. of course this is all back-of-the-envelope. i would suggest doing some quick benchmarks to test and make sure your results are in line with expectation. ben On 07/15/2010 02:56 AM, Maarten Koopmans wrote: Hi, I am mapping a filesystem to ZooKeeper, and use it for locking and mapping a filesystem namespace to a flat data object space (like S3). So assuming proper nesting and small ZooKeeper nodes ( 1KB), how many nodes could a cluster with a few GBs of memory per instance realistically hold totally? Thanks, Maarten
Re: Suggested way to simulate client session expiration in unit tests?
If you want to simulate expiration use the example I sent. http://github.com/phunt/zkexamples Another option is to use a mock. Patrick On 07/06/2010 05:42 PM, Jeremy Davis wrote: Thanks! That seems to work, but it is approximately the same as zooKeeper.close() in that there is no SessionExpired event that comes up through the default Watcher. Maybe I'm assuming more from ZK than I should, but should a paranoid lock implementation periodically test it's session by reading or writing a value? Regards, -JD On Tue, Jul 6, 2010 at 10:32 AM, Mahadev Konarmaha...@yahoo-inc.comwrote: Hi Jeremy, zk.disconnect() is the right way to disconnect from the servers. For session expiration you just have to make sure that the client stays disconnected for more than the session expiration interval. Hope that helps. Thanks mahadev On 7/6/10 9:09 AM, Jeremy Davisjerdavis.cassan...@gmail.com wrote: Is there a recommended way of simulating a client session expiration in unit tests? I see a TestableZooKeeper.java, with a pauseCnxn() method that does cause the connection to timeout/disconnect and reconnect. Is there an easy way to push this all the way through to session expiration? Thanks, -JD
Re: Zookeeper outage recap questions
Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout on the sessions? 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the problem here we should be able to include it in any fix release we do. fixing something like 517 might help, but it's not clear how we got to this state in the first place. fixing 517 might not have any effect if the root cause is not addressed. 662 has only ever been reported once afaik, and we weren't able to identify the root cause for that one. One thing we might also consider is modifying the zk client lib to backoff connection attempts if they keep failing (timing out say). Today the clients are pretty aggressive on reconnection attempts. Having some sort of backoff (exponential?) would provide more breathing room to the server in this situation. Patrick On 06/30/2010 11:13 PM, Travis Crawford wrote: Hey zookeepers - We just experienced a total zookeeper outage, and here's a quick post-mortem of the issue, and some questions about preventing it going forward. Quick overview of the setup: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java In chronological order, the issue manifested itself as alert about RW tests failing. Logs were full of too many files errors, and the output of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was 100%. Application logs showed lots of connection timeouts. This suggests an event happened that caused applications to dogpile on Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles to run out and basically game over. I looked through lots of logs (clients+servers) and did not see a clear indication of what happened. Graphs show a sudden decrease in network traffic when the outage began, zookeeper goes cpu bound, and runs our of file descriptors. Clients are primarily a couple thousand C clients using default connection parameters, and a couple thousand python clients using default connection parameters. Digging through Jira we see two issues that probably contributed to this outage: https://issues.apache.org/jira/browse/ZOOKEEPER-662 https://issues.apache.org/jira/browse/ZOOKEEPER-517 Both are tagged for the 3.4.0 release. Anyone know if that's still the case, and when 3.4.0 is roughly scheduled to ship? Thanks! Travis
Re: Guaranteed message delivery until session timeout?
On 06/30/2010 09:37 AM, Ted Dunning wrote: Which API are you talking about? C? I think that the difference between connection loss and session expiration might mess you up slightly in your disjunction here. On Wed, Jun 30, 2010 at 7:45 AM, Bryan Thompsonbr...@systap.com wrote: I am wondering what guarantees (if any) zookeeper provides for reliable messaging for operation return codes up to a session timeout. Basically, I in particular see timeliness http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkGuarantees would like to know whether a zookeeper client can rely on observing the return code for a successful operation which creates an ephemeral (or ephemeral sequential) znode -or- have a guarantee that its session was timed out and the ephemeral znode destroyed. That is, does zookeeper provide Any ephemeral node(s) associated with a session will be deleted when the session is invalidated (session expiration or client close request). Patrick
Re: Receive timed out error while starting zookeeper server
On 06/26/2010 06:53 AM, Peeyush Kumar wrote: I have a 6 node cluster (5 slaves and 1 master). I am trying to You typically want an odd number given that zk works by majority (even is fine, but not optimal). So 5 would be great (7 is a bit of overkill). 3 is fine too, but 5 allows for you to take 1 server down for scheduled maintenance and still experience an unexpected failure w/o impact to service availability. In your exception I see DatagramSocket this is unusual. What are you running for ZK version? As Lei suggested please include your config file so that we can review that as well (if you are overriding electionAlg this might be part of the problem. Current versions of ZK servers use tcp for connections by default, that's why this is unusual.) Most likely there is either a config problem or perhaps you have a firewall that's blocking communication btw the servers? Try verifying server to server connectivity on the ports you've selected. Patrick start the zookeper server on the cluster. when I issue this command: $ java -cp zookeeper.jar:lib/log4j-1.2.15.jar:conf \ org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg I get the following error: 2010-06-26 18:09:17,468 - INFO [main:quorumpeercon...@80] - Reading configuration from: conf/zoo.cfg 2010-06-26 18:09:17,483 - INFO [main:quorumpeercon...@232] - Defaulting to majority quorums 2010-06-26 18:09:17,545 - INFO [main:quorumpeerm...@118] - Starting quorum peer 2010-06-26 18:09:17,585 - INFO [QuorumPeer:/0.0.0.0:2179:quorump...@514] - LOOKING 2010-06-26 18:09:17,589 - INFO [QuorumPeer:/0.0.0.0:2179:leaderelect...@154] - Server address: master.cf.net/192.168.1.1:2180 2010-06-26 18:09:17,589 - INFO [QuorumPeer:/0.0.0.0:2179:leaderelect...@154] - Server address: slave01.cf.net/192.168.1.2:2180 2010-06-26 18:09:17,792 - WARN [QuorumPeer:/0.0.0.0:2179:leaderelect...@194] - Ignoring exception while looking for leader java.net.SocketTimeoutException: Receive timed out at java.net.PlainDatagramSocketImpl.receive0(Native Method) at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136) at java.net.DatagramSocket.receive(DatagramSocket.java:725) at org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:170) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:515) 2010-06-26 18:09:17,794 - INFO [QuorumPeer:/0.0.0.0:2179:leaderelect...@154] - Server address: slave02.cf.net/192.168.1.3:2180 2010-06-26 18:09:17,995 - WARN [QuorumPeer:/0.0.0.0:2179:leaderelect...@194] - Ignoring exception while looking for leader java.net.SocketTimeoutException: Receive timed out at java.net.PlainDatagramSocketImpl.receive0(Native Method) at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136) at java.net.DatagramSocket.receive(DatagramSocket.java:725) at org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:170) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:515) 2010-06-26 18:09:17,996 - INFO [QuorumPeer:/0.0.0.0:2179:leaderelect...@154] - Server address: slave03.cf.net/192.168.1.4:2180 2010-06-26 18:09:18,197 - WARN [QuorumPeer:/0.0.0.0:2179:leaderelect...@194] - Ignoring exception while looking for leader java.net.SocketTimeoutException: Receive timed out at java.net.PlainDatagramSocketImpl.receive0(Native Method) at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136) at java.net.DatagramSocket.receive(DatagramSocket.java:725) at org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:170) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:515) 2010-06-26 18:09:18,200 - INFO [QuorumPeer:/0.0.0.0:2179:leaderelect...@154] - Server address: slave04.cf.net/192.168.1.5:2180 2010-06-26 18:09:18,401 - WARN [QuorumPeer:/0.0.0.0:2179:leaderelect...@194] - Ignoring exception while looking for leader java.net.SocketTimeoutException: Receive timed out at java.net.PlainDatagramSocketImpl.receive0(Native Method) at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136) at java.net.DatagramSocket.receive(DatagramSocket.java:725) at org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:170) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:515) 2010-06-26 18:09:18,402 - INFO [QuorumPeer:/0.0.0.0:2179:leaderelect...@154] - Server address: slave05.cf.net/192.168.1.6:2180 2010-06-26 18:09:18,604 - WARN [QuorumPeer:/0.0.0.0:2179:leaderelect...@194] - Ignoring exception while looking for leader java.net.SocketTimeoutException: Receive timed out at java.net.PlainDatagramSocketImpl.receive0(Native Method) at
Re: 答复: Starting zookeeper in replicat ed mode
There are 3 ports that need to be opened 1) the client port (btw client and servers) 2/3) the quorum and election ports - only btw servers You are setting these three ports in your config file (clientport defaults to 2181 iirc, unless you override) Patrick On 06/22/2010 06:17 AM, Erik Test wrote: Thanks for your help. The missing file issue is resolved. I was confused by how to start zookeeper because a firewall is blocking connections between nodes. The odd thing is hadoop can run on its own with the configured iptables but doesn't work with zookeeper for some reason. The problem here is I can't turn off the firewall and need to configure the firewall so that zookeeper can work correctly. I'm going to work on the iptables to open connections needed by zookeeper. If any one knows of a way to do this or even just a link to configuring an iptable with zookeeper in mind, I'd appreciate it. Thanks again for the help. Erik On 21 June 2010 20:56, Joe Zouj...@hz.webex.com wrote: Hi: You miss the file. the Caused by: java.lang.IllegalArgumentException: /var/zookeeper/myid file is missing at thanks Joe Zou -邮件原件- 发件人: Erik Test [mailto:erik.shi...@gmail.com] 发送时间: Tuesday, June 22, 2010 3:05 AM 收件人: zookeeper-user@hadoop.apache.org 主题: Starting zookeeper in replicated mode Hi All, I'm having a problem with installing zookeeper on a cluster with 6 nodes in replicated mode. I was able to install and run zookeeper in standalone mode but I'm unable to run zookeeper in replicated mode. I've added a list of servers in zoo.cfg as suggested by the ZooKeeper Getting Started Guide but I get these logs displayed to screen: *[r...@master1 bin]# ./zkServer.sh start JMX enabled by default Using config: /root/zookeeper-3.2.2/bin/../conf/zoo.cfg Starting zookeeper ... STARTED [r...@master1 bin]# 2010-06-21 12:25:23,738 - INFO [main:quorumpeercon...@80] - Reading configuration from: /root/zookeeper-3.2.2/bin/../conf/zoo.cfg 2010-06-21 12:25:23,743 - INFO [main:quorumpeercon...@232] - Defaulting to majority quorums 2010-06-21 12:25:23,745 - FATAL [main:quorumpeerm...@82] - Invalid config, exiting abnormally org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Error processing /root/zookeeper-3.2.2/bin/../conf/zoo.cfg at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:100) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:98) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:75) Caused by: java.lang.IllegalArgumentException: /var/zookeeper/myid file is missing at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:238) at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:96) ... 2 more Invalid config, exiting abnormally* And here is my config file: * # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=5 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=2 # the directory where the snapshot is stored. dataDir=/var/zookeeper # the port at which the clients will connect clientPort=2181 server.1=master1:2888:3888 server.2=slave2:2888:3888 server.3=slave3:2888:3888 * I'm a little confused as to why this doesn't work and I haven't had any luck finding answers to some questions I have. Am I supposed to have an instance of ZooKeeper on each node started before running in replication mode? Should I have each node that will be running ZK listed in the config file? Should I be using an IP address to point to a server instead of a hostname? Thanks for your time. Erik
Re: Free Software Solution to continuously load a large number of feeds with several servers?
I've seen a number of these built as proprietary solutions using ZooKeeper. It would be great to see something open sourced. HBase/ZK seems like a good fit. You might also consider ZooKeeper/BookKeeper. Patrick On 06/18/2010 11:01 AM, Thomas Koch wrote: http://stackoverflow.com/questions/3072042/free-software-solution-to- continuously-load-a-large-number-of-feeds-with-several I need a system that schedules and conducts the loading of a large number of Feeds. The scheduling should consider priority values for feeds provided by me and the history of past publish frequency of the feed. Later the system should make use of pubsub where available. Currently I'm planning to implement my own system based on HBase and ZooKeeper. If there isn't any free software solution by now, then I'd propose at work to develop our solution as Free Software. Thank you for any hints, Thomas Koch, http://www.koch.ro
Re: zookeeper crash
We are unable to reproduce this issue. If you can provide the server logs (all servers) and attach them to the jira it would be very helpful. Some detail on the approx time of the issue so we can correlate to the logs would help too (summary of what you did/do to cause it, etc... anything that might help us nail this one down). https://issues.apache.org/jira/browse/ZOOKEEPER-335 Some detail on ZK version, OS, Java version, HW info, etc... would also be of use to us. Patrick On 06/16/2010 02:49 PM, Vishal K wrote: Hi, We are running into this bug very often (almost 60-75% hit rate) while testing our newly developed application over ZK. This is almost a blocker for us. Will the fix be simplified if backward compatibility was not an issue? Considering that this bug is rarely reported, I am wondering why we are running into this problem so often. Also, on a side note, I am curious why the systest that comes with ZooKeeper did not detect this bug. Can anyone please give an overview of the problem? Thanks. -Vishal On Wed, Jun 2, 2010 at 8:17 PM, Charity Majorschar...@shopkick.com wrote: Sure thing. We got paged this morning because backend services were not able to write to the database. Each server discovers the DB master using zookeeper, so when zookeeper goes down, they assume they no longer know who the DB master is and stop working. When we realized there were no problems with the database, we logged in to the zookeeper nodes. We weren't able to connect to zookeeper using zkCli.sh from any of the three nodes, so we decided to restart them all, starting with node one. However, after restarting node one, the cluster started responding normally again. (The timestamps on the zookeeper processes on nodes two and three *are* dated today, but none of us restarted them. We checked shell histories and sudo logs, and they seem to back us up.) We tried getting node one to come back up and join the cluster, but that's when we realized we weren't getting any logs, because log4j.properties was in the wrong location. Sorry -- I REALLY wish I had those logs for you. We put log4j back in place, and that's when we saw the spew I pasted in my first message. I'll tack this on to ZK-335. On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote: charity, do you mind going through your scenario again to give a timeline for the failure? i'm a bit confused as to what happened. ben On 06/02/2010 01:32 PM, Charity Majors wrote: Thanks. That worked for me. I'm a little confused about why it threw the entire cluster into an unusable state, though. I said before that we restarted all three nodes, but tracing back, we actually didn't. The zookeeper cluster was refusing all connections until we restarted node one. But once node one had been dropped from the cluster, the other two nodes formed a quorum and started responding to queries on their own. Is that expected as well? I didn't see it in ZOOKEEPER-335, so thought I'd mention it. On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: Hi Charity, unfortunately this is a known issue not specific to 3.3 that we are working to address. See this thread for some background: http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html I've raised the JIRA level to blocker to ensure we address this asap. As Ted suggested you can remove the datadir -- only on the effected server -- and then restart it. That should resolve the issue (the server will d/l a snapshot of the current db from the leader). Patrick On 06/02/2010 11:11 AM, Charity Majors wrote: I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to get away from a client bug that was crashing my backend services. Unfortunately, this morning I had a server crash, and it brought down my entire cluster. I don't have the logs leading up to the crash, because -- argghffbuggle -- log4j wasn't set up correctly. But I restarted all three nodes, and odes two and three came back up and formed a quorum. Node one, meanwhile, does this: 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /services/zookeeper/data/zookeeper/version-2/snapshot.a0045 2010-06-02 17:04:56,476 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id = 1, Proposed zxid = 47244640287 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 47244640287, 4, 1, LOOKING, LOOKING, 1 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 38654707048, 3, 1, LOOKING, LEADING, 3 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0
Re: Debugging help for SessionExpiredException
I'm not very experienced personally with running zk on ec2 smalls, Ted usually has the ec2 related insight. Given these boxes are not loaded or lightly loaded, and you've ruled out gc/swap, the only thing I can think of is that something is going on under the covers at the vm level that's causing the high latency you're seeing. You're seeing 15 _minutes_ max latency. I can't think of what would cause that inside zk. Any chance that the VM is shutting down or freezing during that period? I dont' know. Are you monitoring that system from a second system? Perhaps that might shed some light (monitor the cpu/disk activity using some monitoring tool like ganglia, nagios, etc... or even more primitive, perhaps doing a ping to that system and tracking the round trip time/packet loss, dump to a file and review the next day, etc...) Patrick On 06/15/2010 03:59 PM, Jordan Zimmerman wrote: They're small instances. The thing is that these machines are doing next to no work. We're just running simple little tests. The session expiration has not happened while I've been watching. It tends to happen over night. -JZ On Jun 15, 2010, at 1:50 PM, Ted Dunning wrote: As usual, the ZK team provides the best feedback. I would be bold enough to ask what kind of ec2 instances you are running on. Small instances are small chunks of larger machines and are sometimes subject to competition for resources from the other tenants. On Tue, Jun 15, 2010 at 12:30 PM, Patrick Huntph...@apache.org wrote: 3) under-provisioned virtual machines (ie vmware) ... Given that you've ruled out the gc (most common), disk utilization would be the next thing to check.
Re: zookeeper watch triggered multiple times on same event
I don't think this should be possible (if it happens it's a bug in zk). Perhaps, for some reason, there really are 2 change actions (children created, or the same child created twice) and not just one? Re-registering the watch inside the watch is fine. The server sends watch notifications as one way messages, when it notices a znode child list has changed it fires off change messages to all the registered clients. The client then receives the notification and calls the handler. Patrick On 06/15/2010 05:47 PM, Jun Rao wrote: Hi, I get a quick question on ZK 3.2.2. Here is a sequence of events during a test: 1. client 1 creates an ephemeral node under /a 2. client 1 sets a watch using getChildren on /a 3. client 2 creates an ephemeral node under /a 4. client 1's watch gets triggered (a node change event). Inside the watch, client 1 does getChildren on /a and sets the watch. 5. client 1's watch gets triggered again (a node change event) My question is why the same node change event gets triggered twice. It seems that step 5 shouldn't have happened. Thanks, Jun
Re: Debugging help for SessionExpiredException
Session expiration is due to the server not hearing heartbeats from the client. So either the client is partitioned from the server, or the client is not sending heartbeats for some reason, typically this is due to the client JVM gc'ing or swapping. Patrick On 06/10/2010 04:14 PM, Ted Dunning wrote: Uh the options I was recommending were for your CLIENT. You should have similar settings on ZK, but it is your client that is likely to be pausing. On Thu, Jun 10, 2010 at 4:08 PM, Jordan Zimmermanjzimmer...@proofpoint.com wrote: The thing is, this is a test instance (on AWS/EC2) that isn't getting a lot of traffic. i.e. 1 zookeeper instance that we're testing with. On Jun 10, 2010, at 4:06 PM, Ted Dunning wrote: Possibly. I have seen GC times of 4 minutes on some large processes. Better to set the GC parameters so you don't get long pauses. On http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting it mentions using the -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC options. I recommend adding -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC You may want to tune the actual parameters of the GC itself. These should not be used in general, but might be helpful for certain kinds of servers: -XX:MaxTenuringThreshold=6 -XX:SurvivorRatio=6 -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly Finally, you should always add options for lots of GC diagnostics: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution On Thu, Jun 10, 2010 at 3:49 PM, Jordan Zimmerman jzimmer...@proofpoint.com wrote: If I set my session timeout very high (1 minute) this shouldn't happen, right?
Re: Debugging help for SessionExpiredException
100mb partition? sounds like virtualization. resource starvation (worse in virtualized env) is a common cause of this. Are your clients gcing/swapping at all? If a client gc's for long periods of time the heartbeat thread won't be able to run and the server will expire the session. There is a min/max cap that the server places on the client timeouts (it's negotiated), check the client log for detail on what timeout it negotiated (logged in 3.3 releases) take a look at this and see if you can make progress: http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting My guess is that your client is gcing for long periods of time - you can rule this in/out by turning on gc logging in your clients and then viewing the results after another such incident happens (try gchisto for graphical view) Patrick On 06/09/2010 11:36 AM, Jordan Zimmerman wrote: We have a test system using Zookeeper. There is a single Zookeeper server node and 4 clients. There is very little activity in this system. After a day's testing we start to see SessionExpiredException on the client. Things I've tried: * Increasing the session timeout to 1 minute * Making sure all JVMs are running in a 100MB partition Any help debugging this problem would be appreciated. What kind of diagnostics should can I add? Are there more config parameters that I should try? -JZ
Re: Debugging help for SessionExpiredException
On 06/09/2010 03:35 PM, Lei Zhang wrote: We've consistently run into issues with vmware workstation (CentOS as guest OS) on Windows host: just by leaving the cluster idle over night leads to zk session expire issue. My theory is: windows may have gone to hibernation, the zk heartbeat logic hibernates, session expire exception is thrown the moment windows is taken out of hibernation. That sounds like a possible scenario. On EC2 (still CentOS as guest OS), we consistently run into zk session expire issue when our cluster is under heavy load. I am planning to raise scheduling priority of zk server, but haven't done testing. Before you take any action you might examine a few things to identify what's biting you: this has some good general detail on issues other users have seen: http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting In particular you might look at GC/swapping on your clients, that's the most common case we see for session expiration (apart from the obvious -- network level connectivity failures). In one case I remember there was very heavy network load for a period of time once per day, this was causing some issue on the switches which would result in occassional session expiration, but only during this short window. This was pretty hard to track down. Are you monitoring network connectivity in general? Is it possible that temporary network outages are causing this? Perhaps take a look at both your server and client ZK logs, see if the client is seeing anything other than the session expiration (is the client seeing session TIMED OUT for example, this happens when the client doesn't hear back from the server, while session expiration happens because the server doesn't hear from the client). Good luck, Patrick
Re: Simulating failures?
Here's how to test session expiration (haven't tried this in a while): http://github.com/phunt/zkexamples It would be great to have some test infrastructure/examples/docs/strategies available for developers (zk client users). If someone would be interested to workon/contribute this we'd be pretty psyched to work with you on it. Patrick On 06/04/2010 11:28 AM, Stephen Green wrote: Now that I've got things working pretty smoothly with my ZooKeeper setup in normal operation, I'd like to test some of the recovery stuff that I've put into my application. I'd like to make sure that if a connection to ZK fails, then my application recovers appropriately (possibly by giving up). Obviously I could do some of this by shutting off the server and restarting it, but I'd like to be a bit more systematic, if possible. Is there any way to inject failures into the ZK client so that I can test without having to randomly kill servers/clients? Thanks, Steve
Re: Locking and Partial Failure
Hi Charles, any luck with this? Re the issues you found with the recipes please enter a JIRA, it would be good to address the problem(s) you found. https://issues.apache.org/jira/browse/ZOOKEEPER re use of session/thread id, might you use some sort of unique token that's dynamically assigned to the thread making a request on the shared session? The calling code could then be identified by that token in recovery cases. Patrick On 05/28/2010 08:28 AM, Charles Gordon wrote: Hello, I am new to using Zookeeper and I have a quick question about the locking recipe that can be found here: http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks It appears to me that there is a flaw in this algorithm related to partial failure, and I am curious to know how to fix it. The algorithm follows these steps: 1. Call create() with a pathname like /some/path/to/parent/child-lock-. 2. Call getChildren() on the lock node without the watch flag set. 3. If the path created in step (1) has the lowest sequence number, you are the master (skip the next steps). 4. Otherwise, call exists() with the watch flag set on the child with the next lowest sequence number. 5. If exists() returns false, go to step (2), otherwise wait for a notification from the path, then go to step (2). The scenario that seems to be faulty is a partial failure in step (1). Assume that my client program follows step (1) and calls create(). Assume that the call succeeds on the Zookeeper server, but there is a ConnectionLoss event right as the server sends the response (e.g., a network partition, some dropped packets, the ZK server goes down, etc). Assume further that the client immediately reconnects, so the session is not timed out. At this point there is a child node that was created by my client, but that my client does not know about (since it never received the response). Since my client doesn't know about the child, it won't know to watch the previous child to it, and it also won't know to delete it. That means all clients using that lock will fail to make progress as soon as the orphaned child is the lowest sequence number. This state will continue until my client closes it's session (which may be a while if I have a long lived session, as I would like to have). Correctness is maintained here, but live-ness is not. The only good solution I have found for this problem is to establish a new session with Zookeeper before acquiring a lock, and to close that session immediately upon any connection loss in step (1). If everything works, the session could be re-used, but you'd need to guarantee that the session was closed if there was a failure during creation of the child node. Are there other good solutions? I looked at the sample code that comes with the Zookeeper distribution (I'm using 3.2.2 right now), and it uses the current session ID as part of the child node name. Then, if there is a failure during creation, it tries to look up the child using that session ID. This isn't really helpful in the environment I'm using, where a single session could be shared by multiple threads, any of which could request a lock (so I can't uniquely identify a lock by session ID). I could use thread ID, but then I run the risk of a thread being reused and getting the wrong lock. In any case, there is also the risk that a second failure prevents me from looking up the lock after a connection loss, so I'm right back to an orphaned lock child, as above. I could, presumably, be careful enough with try/catch logic to prevent even that case, but it makes for pretty bug-prone code. Also, as a side note, that code appears to be sorting the child nodes by the session ID first, then the sequence number, which could cause locks to be ordered incorrectly. Thanks for any help you can provide! Charles Gordon
Re: Securing ZooKeeper connections
On 05/27/2010 09:47 AM, Benjamin Reed wrote: actually pat hunt took over that issue: ZOOKEEPER-733. pat has made a lot of progress and the patch looks close to being ready. This is just the server side though, still need to make similar changes on the client. That will likely be a separate jira. But yes, it's coming along. ps - actually, to be clear the patch adds netty support. the idea is that once we have netty in and netty supports SSL quite transparently, it should be easy to get SSL in. SSL/netty part seems pretty simple, however there's also the key mgmt portion which look more complicated (need to integrate not quite commons ssl or something like that, haven't gotten that far yet) On 05/26/2010 04:44 PM, Mahadev Konar wrote: Hi Vishal, Ben (Benjamin Reed) has been working on a netty based client server protocol in ZooKeeper. I think there is an open jira for it. My network connection is pretty slow so am finding it hard to search for it. We have been thinking abt enabling secure connections via this netty based connections in zookeeper. Thanks mahadev On 5/25/10 12:20 PM, Vishal Kvishalm...@gmail.com wrote: Hi All, Since ZooKeeper does not support secure network connections yet, I thought I would poll and see what people are doing to address this problem. Is anyone running ZooKeeper over secure channels (client - server and server- server authentication/encryption)? If yes, can you please elaborate how you do it? Thanks. Regards, -Vishal
Re: Securing ZooKeeper connections
Short of someone else stepping up I have it on my todo list. ;-) Still quite a bit of work to do on 733 though getting it back into shape. (not to mention layering the ssl on top). Then there's also the server-server connectivity that also needs to have netty support added (quorum/election port I mean, 733 only adds netty to the server side client port). Set a watch on 733 and subscribe to the dev list if you want to follow along. Patrick On 05/27/2010 10:46 AM, Gustavo Niemeyer wrote: actually pat hunt took over that issue: ZOOKEEPER-733. pat has made a lot of progress and the patch looks close to being ready. This is just the server side though, still need to make similar changes on the client. That will likely be a separate jira. But yes, it's coming along. Oh, that's great news Patrick. Thanks for pushing this forward! Do you think the client side might see some attention soon as well? Or, in other words, do you plan to shift over to the client side once you're done with the server?
Re: Question about concurrent primitives library
Hi, this was originally proposed as a google summer of code project, the slots for gsoc have already been given out, this was not one of the projects chosen by apache. So you could still work on this if you like, but not under the gsoc umbrella. We (zk contributor community) would be happy to work with you. See the following for the recipes we currently ship with: http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/recipes/ You might also check JIRA, such as: https://issues.apache.org/jira/browse/ZOOKEEPER-767 There's still alot of work to be done in this area. Such as: * identifying and documenting what components the library might/should contain. See this for the current list: http://hadoop.apache.org/zookeeper/docs/current/recipes.html * even the existing recipes could benefit, improved documentation for example. * queues/locks are implemented in src/recipes, however the other recipes are not * add python implementations in addition to c/java? * not all recipes are black/white, but rather there are many variations to each. We could add these to the docs/implementation there's probably alot more that could be done that I haven't identified, this is fertile ground. Would be great if you were interested and would like to contribute. Feel free to create some jiras and contribute patches! I encourage you to move further the discussion to the zookeeper-dev list, that's where we discuss futures and unreleased software. Regards, Patrick On 05/25/2010 11:34 PM, Chia-Hung Lin wrote: Hi I read the page at http://wiki.apache.org/hadoop/ZooKeeper/SoC2010Ideas saying there would have a mentor if one would work on those projects. I am interested in `Concurrent Primitives Library' and would like to work on this project. Is this project still available? Or any procedure requires to apply in order to participate in this project? Thanks, ChiaHung
Re: Ping and client session timeouts
Hi Stephen, my comments inline below: On 05/21/2010 09:31 AM, Stephen Green wrote: I feel like I'm missing something fairly fundamental here. I'm building a clustered application that uses ZooKeeper (3.3.1) to store its configuration information. There are 33 nodes in the cluster (Amazon EC2 instance, if that matters), and I'm currently using a single ZooKeeper instance. When a node starts up, it makes a connection to ZK, sets the data on a few paths and makes an ephemeral node for itself. I keep the connection open while the node is running so that I can use watches to find out if a node disappears, but after the initial setup, the node usually won't write or read anything from ZK. My understanding (having had a quick look at the code) is that the client connection will send a ping every sessionTimeout * 2/3 ms or so to keep the session alive, but I keep seeing sessions dying. On the Actually the client sends a ping every 1/3 the timeout, and then looks for a response before another 1/3 elapses. This allows time to reconnect to a different server (and still maintain the session) if the current server were to become unavailable. client side I'll see something like the following sequence of events: [05/21/10 15:59:40.753] INFO Initiating client connection, connectString=zookeeper:2200 sessionTimeout=3 watcher=com.echonest.cluster.zoocontai...@1eb3319f [05/21/10 15:59:40.767] INFO Socket connection established to zookeeper/10.255.9.187:2200, initiating session [05/21/10 15:59:40.787] INFO Session establishment complete on server zookeeper/10.255.9.187:2200, sessionid = 0x128bb7b828d004c, negotiated timeout = 3 Ok, this (^^^) says that the timeout is set to 30sec. [05/21/10 16:13:03.729] INFO Client session timed out, have not heard from server in 33766ms for sessionid 0x128bb7b828d004c, closing socket connection and attempting reconnect [05/21/10 16:13:19.268] INFO Initiating client connection, connectString=zookeeper:2200 sessionTimeout=3 watcher=com.echonest.cluster.zoocontai...@1eb3319f [05/21/10 16:14:12.326] INFO Client session timed out, have not heard from server in 53058ms for sessionid 0x128bb7b828d004c, closing socket connection and attempting reconnect This (^^^) is very suspicious, in particular have not heard from server in 53058ms. This means that the client heartbeat code didn't notice that the heartbeat was exceeded for 53 seconds! This should never happen, the client does a select with a timeout of 1/3 the session timeout (10sec here). The fact that the select is taking 43 addl seconds (53-10sec select timeout) tells me that your client jvm is not allowing the heartbeat thread to run. The most common reason for this is GC. Is your client application very memory intensive? Heavy on GC? You should turn on your GC logging and review the output after reproducing this issue (turning on CMS/incremental GC mode usually resolves this issue, but you should verify first). What we typically see here is that the client JVM is running GC for very long periods of time, this blocks all the threads, and as a result the heartbeat is not sent by the client! As you are running in a virtualized environment this could also be a factor (it's def an issue from a GC perspective). But I suspect that gc is the issue here, look at that first. See this page for some common issues users have faced in the past: http://bit.ly/5WwS44 If I'm reading this correctly, the connection gets set up and then the server experiences an error trying to read from the client, so it closes the connection. It's not clear if this causes the session timeout or vice-versa (these systems are both running ntp, but I doubt that we can count on interleaving those log times correctly.) Yes, this is due to the client not getting the heartbeat, so it will close the connection and attempt to reestablish. I started out with a session timeout of 10,000ms, but as you can see, I have the same problem at 30,000ms. You may not need to use this after you resolve the GC issue. Do I have a fundamental misunderstanding? What else should I do to figure out what's going on here? As I suggested above, give GC logging a try. I found 'gchisto' a very useful tool for reviewing the resulting log files. http://sysadminsjourney.com/content/2008/09/15/profile-your-java-gc-performance-gchisto Regards, Patrick
Re: Ping and client session timeouts
On 05/21/2010 11:32 AM, Stephen Green wrote: Right. The system can be very memory-intensive, but at the time these are occurring, it's not under a really heavy load, and there's plenty of heap available. However, while looking at a thread dump from one of the nodes, I realized that a very poor decision meant that I had more than 1200 threads running. I expect this is more of a problem than the GC at this point. I'm taking steps to correct this problem now. Lately, I've had fewer and fewer problems with GC. In a former life, I sat down the hall from the folks who wrote Hotspot's GC and they're pretty sharp folks :-) GC as a cause is very common, however had you mentioned 1200 threads I would have guessed that to be a potential issue. ;-) Right. I'd like to have as small a timeout as possible so that I notice quickly when things disappear. What's a reasonable minimum? I notice recommendations in other messages on the list that 2 is a good value. The setting you should use typically is determined by your sla requirements. How soon do you want ephemeral nodes to be cleaned up if a client fails? Say you were doing leader election, this would gate re-election in the case where the current leader failed (set it lower and you are more responsive (faster), but also more susceptible to false positives (such as temp network glitch). Set it higher and you ride over the network glitches however it takes longer to recover when a client really does go down). In some cases (hbase, solr) we've seen that the timeout had to be set artificially high due to the limitations of the current JVM GC algos. For example some hbase users were seeing GC pause times of 4 minutes. So this raises the question - do you consider this a failure or not? (I could reboot the machine faster than it takes to run that GC...) Good luck, Patrick
Re: Concurrent reads and writes on BookKeeper
On 05/20/2010 08:42 AM, Flavio Junqueira wrote: We have such a mechanism already, as Utkarsh mentions in the jira. The question is if we need more sophisticated mechanisms implemented, or if we should leave to the application to implement it. For now, we haven't felt the need for such extra mechanisms implemented along with BK, but I'd certainly be happy to hear a different perspective. Ok, was just saying that we shouldn't be too strict about it (impls available out of the box). Otw we run into situations similar to zk recipes where multiple users were re-implementing common patterns. Having said that, we have interesting projects to get folks involved with BK, but I don't have it clear that this is one of them. It would be great if you could enter JIRAs on this (projects), perhaps also a wiki 'interesting projects around bk (or hedwig, etc...)' page that catalogs those JIRAs. Thanks! Patrick -Flavio On May 20, 2010, at 1:36 AM, Patrick Hunt wrote: On 05/19/2010 01:23 PM, Flavio Junqueira wrote: Hi Andre, To guarantee that two clients that read from a ledger will read the same sequence of entries, we need to make sure that there is agreement on the end of the sequence. A client is still able to read from an open ledger, though. We have an open jira about informing clients of the progress of an open ledger (ZOOKEEPER-462), but we haven't reached agreement on it yet. Some folks think that it is best that each application use the mechanism it finds best. One option is to have the writer writing periodically to a ZooKeeper znode to inform of its progress. Hi Flavio. Seems like wrapping up a couple/few of these options in the client library (or a client library) would be useful for users -- reuse rather than everyone reinvent. Similar to how we now provide recipes in zk source base rather than everyone rewriting the basic locks/queues... Would be a great project I would think for someone interested in getting started with bk (and to some extent zk) development. Patrick I would need to know more detail of your application before recommending you to stick with BookKeeper or switch to ZooKeeper. If your workload is dominated by writes, then BookKeeper might be a better option. -Flavio On May 19, 2010, at 1:29 AM, André Oriani wrote: Sorry, I forgot the subject on my last message :| Hi all, I was considering BookKeeper to implement some server replicated application having one primary server as writer and many backup servers reading from BookKeeper concurrently. The last documentation a I had access says This writer has to execute a close ledger operation before any other client can read from it. So readers cannot ready any entry on the ledger, even the already committed ones until writer stops writing to the ledger,i.e, closes it. Is my understanding right ? Should I then use Zookeeper directly to achieve what I want ? Thanks for the attention, André Oriani
[ANNOUNCE] Apache ZooKeeper 3.3.1
The Apache ZooKeeper team is proud to announce Apache ZooKeeper version 3.3.1 ZooKeeper is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs. For ZooKeeper release details and downloads, visit: http://hadoop.apache.org/zookeeper/releases.html ZooKeeper 3.3.1 Release Notes are at: http://hadoop.apache.org/zookeeper/docs/r3.3.1/releasenotes.html Regards, The ZooKeeper Team
Re: Using ZooKeeper for managing solrCloud
Mahadev pointed out the ZK monitoring details, but on the solr side of the house I don't think we can provide much insight as solr is acting as a client of the zk service. Your best bet would be to ask on the solr user list. Regards, Patrick On 05/14/2010 04:09 AM, Rakhi Khatwani wrote: Hi, I just went through the zookeeper tutorial and successfully managed to run the zookeeper server. How do we monitor the zookeeper server?, is there a url for it? i pasted the following urls on browser, but all i get is a blank page http://localhost:2181 http://localhost:2181/zookeeper I actually needed zookeeper for managing solr cloud managed externally but now if i hv 2 solr servers running, how do i configure zookeeper to manage them. Regards, Raakhi
Re: Xid out of order. Got 8 expected 7
Hi Jordan, you've seen this once or frequently? (having the server + client logs will help alot) Patrick On 05/12/2010 11:08 AM, Jordan Zimmerman wrote: Sure - if you think it's a bug. We were using Zookeeper without issue. I then refactored a bunch of code and this new behavior started. I'm starting ZK using zkServer start and haven't made any changes to the code at all. I'll get the logs together and post a JIRA. -JZ On May 12, 2010, at 10:59 AM, Mahadev Konar wrote: Hi Jordan, Can you create a jira for this? And attach all the server logs and client logs related to this timeline? How did you start up the servers? Is there some changes you might have made accidentatlly to the servers? Thanks mahadev On 5/12/10 10:49 AM, Jordan Zimmermanjzimmer...@proofpoint.com wrote: We've just started seeing an odd error and are having trouble determining the cause. Xid out of order. Got 8 expected 7 Any hints on what can cause this? Any ideas on how to debug? We're using ZK 3.3.0. The error occurs in ClientCnxn.java line 781 -Jordan
Re: Xid out of order. Got 8 expected 7
I'm still interested though... Are you using the new getChildren api that was added to the client in 3.3.0? (it provides a Stat object on return, the old getChildren did not). While we don't officially support 3.3.0 client with 3.2.2 server (we do support the other way around), there shouldn't be they type of problem with this configuration as you describe. I'd still be interested for you to create that jira. Regards, Patrick On 05/12/2010 11:23 AM, Jordan Zimmerman wrote: Apologies... I thought I was running 3.3.0 server, but was running 3.2.2 server with 3.3.0 client. I upgraded the server and now all works again. Sorry to trouble y'all. -Jordan On May 12, 2010, at 11:11 AM, Patrick Hunt wrote: Hi Jordan, you've seen this once or frequently? (having the server + client logs will help alot) Patrick On 05/12/2010 11:08 AM, Jordan Zimmerman wrote: Sure - if you think it's a bug. We were using Zookeeper without issue. I then refactored a bunch of code and this new behavior started. I'm starting ZK using zkServer start and haven't made any changes to the code at all. I'll get the logs together and post a JIRA. -JZ On May 12, 2010, at 10:59 AM, Mahadev Konar wrote: Hi Jordan, Can you create a jira for this? And attach all the server logs and client logs related to this timeline? How did you start up the servers? Is there some changes you might have made accidentatlly to the servers? Thanks mahadev On 5/12/10 10:49 AM, Jordan Zimmermanjzimmer...@proofpoint.com wrote: We've just started seeing an odd error and are having trouble determining the cause. Xid out of order. Got 8 expected 7 Any hints on what can cause this? Any ideas on how to debug? We're using ZK 3.3.0. The error occurs in ClientCnxn.java line 781 -Jordan
Re: Xid out of order. Got 8 expected 7
I think that explains it then - the server is probably dropping the new (3.3.0) getChildren message (xid 7) as it (3.2.2 server) doesn't know about that message type. Then the server responds to the client for a subsequent operation (xid 8), and at that point the client notices that getChildren (xid 7) got lost. Patrick On 05/12/2010 11:30 AM, Jordan Zimmerman wrote: Oh, OK. When I get a moment I'll restart the 3.2.2 and post logs, etc. Yes, we're calling getChildren with the callback. -JZ On May 12, 2010, at 11:28 AM, Patrick Hunt wrote: I'm still interested though... Are you using the new getChildren api that was added to the client in 3.3.0? (it provides a Stat object on return, the old getChildren did not). While we don't officially support 3.3.0 client with 3.2.2 server (we do support the other way around), there shouldn't be they type of problem with this configuration as you describe. I'd still be interested for you to create that jira. Regards, Patrick On 05/12/2010 11:23 AM, Jordan Zimmerman wrote: Apologies... I thought I was running 3.3.0 server, but was running 3.2.2 server with 3.3.0 client. I upgraded the server and now all works again. Sorry to trouble y'all. -Jordan On May 12, 2010, at 11:11 AM, Patrick Hunt wrote: Hi Jordan, you've seen this once or frequently? (having the server + client logs will help alot) Patrick On 05/12/2010 11:08 AM, Jordan Zimmerman wrote: Sure - if you think it's a bug. We were using Zookeeper without issue. I then refactored a bunch of code and this new behavior started. I'm starting ZK using zkServer start and haven't made any changes to the code at all. I'll get the logs together and post a JIRA. -JZ On May 12, 2010, at 10:59 AM, Mahadev Konar wrote: Hi Jordan, Can you create a jira for this? And attach all the server logs and client logs related to this timeline? How did you start up the servers? Is there some changes you might have made accidentatlly to the servers? Thanks mahadev On 5/12/10 10:49 AM, Jordan Zimmermanjzimmer...@proofpoint.com wrote: We've just started seeing an odd error and are having trouble determining the cause. Xid out of order. Got 8 expected 7 Any hints on what can cause this? Any ideas on how to debug? We're using ZK 3.3.0. The error occurs in ClientCnxn.java line 781 -Jordan
Re: Xid out of order. Got 8 expected 7
I think Ben meant that the unknown operation itself (from server perspective) should result in an error directly on both client and server. Patrick On 05/12/2010 11:45 AM, Jordan Zimmerman wrote: Technically, there is an error generated. IMO - a more descriptive error would be helpful. -JZ On May 12, 2010, at 11:41 AM, Benjamin Reed wrote: is this a bug? shouldn't we be returning an error. ben On 05/12/2010 11:34 AM, Patrick Hunt wrote: I think that explains it then - the server is probably dropping the new (3.3.0) getChildren message (xid 7) as it (3.2.2 server) doesn't know about that message type. Then the server responds to the client for a subsequent operation (xid 8), and at that point the client notices that getChildren (xid 7) got lost. Patrick On 05/12/2010 11:30 AM, Jordan Zimmerman wrote: Oh, OK. When I get a moment I'll restart the 3.2.2 and post logs, etc. Yes, we're calling getChildren with the callback. -JZ On May 12, 2010, at 11:28 AM, Patrick Hunt wrote: I'm still interested though... Are you using the new getChildren api that was added to the client in 3.3.0? (it provides a Stat object on return, the old getChildren did not). While we don't officially support 3.3.0 client with 3.2.2 server (we do support the other way around), there shouldn't be they type of problem with this configuration as you describe. I'd still be interested for you to create that jira. Regards, Patrick On 05/12/2010 11:23 AM, Jordan Zimmerman wrote: Apologies... I thought I was running 3.3.0 server, but was running 3.2.2 server with 3.3.0 client. I upgraded the server and now all works again. Sorry to trouble y'all. -Jordan On May 12, 2010, at 11:11 AM, Patrick Hunt wrote: Hi Jordan, you've seen this once or frequently? (having the server + client logs will help alot) Patrick On 05/12/2010 11:08 AM, Jordan Zimmerman wrote: Sure - if you think it's a bug. We were using Zookeeper without issue. I then refactored a bunch of code and this new behavior started. I'm starting ZK using zkServer start and haven't made any changes to the code at all. I'll get the logs together and post a JIRA. -JZ On May 12, 2010, at 10:59 AM, Mahadev Konar wrote: Hi Jordan, Can you create a jira for this? And attach all the server logs and client logs related to this timeline? How did you start up the servers? Is there some changes you might have made accidentatlly to the servers? Thanks mahadev On 5/12/10 10:49 AM, Jordan Zimmermanjzimmer...@proofpoint.comwrote: We've just started seeing an odd error and are having trouble determining the cause. Xid out of order. Got 8 expected 7 Any hints on what can cause this? Any ideas on how to debug? We're using ZK 3.3.0. The error occurs in ClientCnxn.java line 781 -Jordan
Re: Xid out of order. Got 8 expected 7
Hm, if you don't mind enter that jira, would still like to verify by looking at the logs. Patrick On 05/12/2010 11:52 AM, Jordan Zimmerman wrote: So, I'm off the Jira hook then? -JZ On May 12, 2010, at 11:49 AM, Patrick Hunt wrote: You're right. Ben, would you mind entering a JIRA? Patrick On 05/12/2010 11:41 AM, Benjamin Reed wrote: is this a bug? shouldn't we be returning an error. ben On 05/12/2010 11:34 AM, Patrick Hunt wrote: I think that explains it then - the server is probably dropping the new (3.3.0) getChildren message (xid 7) as it (3.2.2 server) doesn't know about that message type. Then the server responds to the client for a subsequent operation (xid 8), and at that point the client notices that getChildren (xid 7) got lost. Patrick On 05/12/2010 11:30 AM, Jordan Zimmerman wrote: Oh, OK. When I get a moment I'll restart the 3.2.2 and post logs, etc. Yes, we're calling getChildren with the callback. -JZ On May 12, 2010, at 11:28 AM, Patrick Hunt wrote: I'm still interested though... Are you using the new getChildren api that was added to the client in 3.3.0? (it provides a Stat object on return, the old getChildren did not). While we don't officially support 3.3.0 client with 3.2.2 server (we do support the other way around), there shouldn't be they type of problem with this configuration as you describe. I'd still be interested for you to create that jira. Regards, Patrick On 05/12/2010 11:23 AM, Jordan Zimmerman wrote: Apologies... I thought I was running 3.3.0 server, but was running 3.2.2 server with 3.3.0 client. I upgraded the server and now all works again. Sorry to trouble y'all. -Jordan On May 12, 2010, at 11:11 AM, Patrick Hunt wrote: Hi Jordan, you've seen this once or frequently? (having the server + client logs will help alot) Patrick On 05/12/2010 11:08 AM, Jordan Zimmerman wrote: Sure - if you think it's a bug. We were using Zookeeper without issue. I then refactored a bunch of code and this new behavior started. I'm starting ZK using zkServer start and haven't made any changes to the code at all. I'll get the logs together and post a JIRA. -JZ On May 12, 2010, at 10:59 AM, Mahadev Konar wrote: Hi Jordan, Can you create a jira for this? And attach all the server logs and client logs related to this timeline? How did you start up the servers? Is there some changes you might have made accidentatlly to the servers? Thanks mahadev On 5/12/10 10:49 AM, Jordan Zimmermanjzimmer...@proofpoint.com wrote: We've just started seeing an odd error and are having trouble determining the cause. Xid out of order. Got 8 expected 7 Any hints on what can cause this? Any ideas on how to debug? We're using ZK 3.3.0. The error occurs in ClientCnxn.java line 781 -Jordan
Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU
On 05/12/2010 08:30 PM, Aaron Crow wrote: I may have a better idea of what caused the trouble. I way, WAY underestimated the number of nodes we collect over time. Right now we're at 1.9 million. This isn't a bug of our application; it's actually a feature (but perhaps an ill-conceived one). A most recent snapshot from a Zookeeper db is 227MB. If I scp it over to one of the other Zookeeper hosts, it takes about 4 seconds. Nice. You probably hold the record for largest (znode count) production ZK repo. Largest I've heard of at least. Now, there are some things I can do to limit the number of nodes we collect. My question is, how deadly could this node size be for us? Patrick mentioned to me that he's run Zookeeper with this many nodes, but you need to be careful about tuning. We're currently running with the recommended JVM settings (see below). We're using different drives for the 2 different kinds of data dirs that Zookeeper needs. We may also have the option of running on a 64 bit OS with added RAM, if it's worth it. What about timeout settings? I'm copying in our current settings below, are those ok? ALA you have enough memory/disk/IO you should be ok. Are you monitoring the operation latency on the servers? (via 4letter words, such as stat?) You might increase the init/sync limits a bit to ensure that the followers have enough time to d/l the snapshot, deserialize it, and get setup with the leader (if this takes too long the quorum will fail and reelect a new leader, which might happen indefinitely). Or should we just figure out how to keep our node count much lower? And how low is definitely pretty safe? There's really no max - it's just dependent on your resources. Memory in particular. You should turn on incremental GC mode though (-XX:+CMSIncrementalMode), otw large GC pauses will wreck your latencies. Checkout this link (below), verbose gc is also useful to track down issues later (if something bad happens you can use it to rule out/in GC as an issue) http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html#0.0.0.0.Incremental%20mode%7Coutline Regards, Patrick === some current settings === -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -Xms2560m -Xmx2560m tickTime=2000 initLimit=10 syncLimit=5 Many thanks in advance for any good advice. Aaron On Wed, Apr 28, 2010 at 10:47 PM, Patrick Huntph...@apache.org wrote: Hi Aaron, some questions/comments below: On 04/28/2010 06:29 PM, Aaron Crow wrote: We were running version 3.2.2 for about a month and it was working well for us. Then late this past Saturday night, our cluster went pathological. One of the 3 ZK servers spewed many WARNs (see below), and the other 2 servers were almost constantly pegging the CPU. All three servers are on separate machines. From what we could tell, the machines were fine... networking fine, disk fine, etc. The ZK clients were completely unable to complete their connections to ZK. These machines are local (not wan) connected then? What OS and java version are you using? Do you see any FATAL or ERROR level messages in the logs? It would help to look at your zk config files for these servers. Could you provide (you might want to create a JIRA first, then just attach configs and other details/collateral to that, easier than dealing with email) If you have logs for the time period and can share that would be most useful. (again, gzip and attach to the jira) We tried all sorts of restarts, running zkCleanup, etc. We even completely shut down our clients... and the pathology continued. Our workaround was to do an urgent upgrade to version 3.3.0. The new ZK cluster with 3.3.0 has been running well for us... so far... Off hand and with the data we have so far nothing sticks out that 3.3 would have resolved (JIRA is conveniently down for the last hour or so so I can't review right now). Although there were some changes to reduce memory consumption (see below). I realize that, sadly, this message doesn't contain nearly enough details to trace exactly what happened. I guess I'm wondering if anyone has seen this general scenario, and/or knows how to prevent? Is there anything we might be doing client side to trigger this? Our application level request frequency is maybe a few requests to Zookeeper per second, times 5 clients applications. If we detect a SESSION EXPIRED, we do a simple create new client and use that instead. And we were seeing this happen occasionally. What are the client doing? Do you have a large number/size of znodes? Do you see any OutOfMemoryError in the logs? Could the ZK server java process be swapping? Are you monitoring GC, perhaps large GC pauses are happening? I have a suspicion that one of a few things might be happening. I see the following in your original email: :followerhand...@302] - Sending snapshot last zxid of peer is 0xd0007d66d zxid of leader is 0xf 2010-04-24 23:06:03,254 - ERROR
Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused
The cases where we've seen this reported in the past the user tracked the issue down to a firewall problem, I'm not sure what the issue is here given you've verified that's not the problem. The log is clearly saying: Thread:quorumcnxmana...@336] - Cannot open channel to 2 at election address /192.168.1.3:3888 http://192.168.1.3:3888 java.net.ConnectException: Connection refused which means that the server is attempting to open a connection to 192.168.1.3 port 3888, but the server at that ip/port is not accepting the connection. Are you sure that both servers are up/running at the same time? The log that you included, this was for server 1 right (192.168.1.2)? You might use netstat -a to verify that each server is bound to the correct ports on each host, then take a look at the logs to see if this connection refused is still happening (it can happen in the logs if server 1 starts but server 2 is not yet started, but then should rectify once both servers are bound and accepting connections). If you still have issues create a jira, attach both configs and both log files and we'll take closer look. https://issues.apache.org/jira/browse/ZOOKEEPER Good Luck, Patrick On 05/10/2010 08:07 PM, chen peng wrote: *Thank http://www.iciba.com/thank/ you http://www.iciba.com/you/ for your http://www.iciba.com/your/ kind reply,but i think port(s) works well, it **is not http://www.iciba.com/not/ a problem http://www.iciba.com/problem/, **Any other http://www.iciba.com/other/ suggestions?* PS:*In that case,i installed zookeeper-3.2.2 but hbase.* From: chenpeng0...@hotmail.com To: phu...@gmail.com Subject: RE: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused Date: Mon, 10 May 2010 14:30:55 + *Thank http://www.iciba.com/thank/ you http://www.iciba.com/you/ for your http://www.iciba.com/your/ kind reply,but i think port(s) works well, it **is not http://www.iciba.com/not/ a problem http://www.iciba.com/problem/, **Any other http://www.iciba.com/other/ suggestions?* PS:*In that case,i installed zookeeper-3.2.2 but hbase.* Date: Sat, 8 May 2010 22:43:34 -0700 Subject: Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused From: phu...@gmail.com To: zookeeper-user@hadoop.apache.org; chenpeng0...@hotmail.com Often this is related to the port(s) being blocked by a firewall. Perhaps you could check this (2888/3888) in both directions? Telnet can help: https://help.maximumasp.com/KB/a445/connectivity-testing-with-ping-telnet-tracert-and-pathping-.aspx Patrick 2010/5/7 chen peng chenpeng0...@hotmail.com mailto:chenpeng0...@hotmail.com Hi all I have a question: after installation of the zookeeper according to the doc for zookeeper( http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_systemReq), abnormalities emerge as follows: -- JMX enabled by default Using config: /home/baeeq/hadoop-0.20.2/zookeeper-3.2.2/bin/../conf/zoo.cfg Starting zookeeper ... STARTED 2010-05-08 13:37:28,273 - INFO [main:quorumpeercon...@80] - Reading configuration from: /home/baeeq/hadoop-0.20.2/zookeeper-3.2.2/bin/../conf/zoo.cfg 2010-05-08 13:37:28,284 - INFO [main:quorumpeercon...@232] - Defaulting to majority quorums 2010-05-08 13:37:28,299 - INFO [main:quorumpeerm...@118] - Starting quorum peer 2010-05-08 13:37:28,331 - INFO [Thread-1:quorumcnxmanager$liste...@409] - My election bind port: 3888 2010-05-08 13:37:28,342 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@514] - LOOKING 2010-05-08 13:37:28,345 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@579] - New election: -1 2010-05-08 13:37:28,351 - WARN [WorkerSender Thread:quorumcnxmana...@336] - Cannot open channel to 2 at election address /192.168.1.3:3888 http://192.168.1.3:3888 java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at java.nio.channels.SocketChannel.open(SocketChannel.java:146) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302) at org.apache.zookeeper.server.quorum.FastLeaderElection $Messenger$WorkerSender.process(FastLeaderElection.java:323) at org.apache.zookeeper.server.quorum.FastLeaderElection $Messenger$WorkerSender.run(FastLeaderElection.java:296) at java.lang.Thread.run(Thread.java:619) 2010-05-08 13:37:28,352 - INFO
Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused
Ok, great, good luck! Patrick On 05/10/2010 11:20 PM, chen peng wrote: My question has been decided. *I did not http://www.iciba.com/not/ know http://www.iciba.com/know/ bin/zkServer start should be execute on each machine!* *I took him to be very close in function with **hadoop(**start-all.sh).* tks! Date: Mon, 10 May 2010 23:02:48 -0700 From: ph...@apache.org To: chenpeng0...@hotmail.com CC: zookeeper-user@hadoop.apache.org Subject: Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused The cases where we've seen this reported in the past the user tracked the issue down to a firewall problem, I'm not sure wh at the issue is here given you've verified that's not the problem. The log is clearly saying: Thread:quorumcnxmana...@336] - Cannot open channel to 2 at election address /192.168.1.3:3888 http://192.168.1.3:3888 java.net.ConnectException: Connection refused which means that the server is attempting to open a connection to 192.168.1.3 port 3888, but the server at that ip/port is not accepting the connection. Are you sure that both servers are up/running at the same time? The log that you included, this was for server 1 right (192.168.1.2)? You might use netstat -a to verify that each server is bound to the correct ports on each host, then take a look at the logs to see if this connection refused is still happening (it can happen in the logs if server 1 starts but server 2 is not yet started, but then sh ould rectify once both servers are bound and accepting connections). If you still have issues create a jira, attach both configs and both log files and we'll take closer look. https://issues.apache.org/jira/browse/ZOOKEEPER Good Luck, Patrick On 05/10/2010 08:07 PM, chen peng wrote: *Thank http://www.iciba.com/thank/ you http://www.iciba.com/you/ for your http://www.iciba.com/your/ kind reply,but i think port(s) works well, it **is not http://www.iciba.com/not/ a problem http://www.iciba.com/problem/, **Any other http://www.iciba.com/other/ suggestions?* PS:*In that case,i installed zookeeper-3.2.2 but hbase.* From: chenpeng0...@hotmail.com g t; To: phu...@gmail.com Subject: RE: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused Date: Mon, 10 May 2010 14:30:55 + *Thank http://www.iciba.com/thank/ you http://www.iciba.com/you/ for your http://www.iciba.com/your/ kind reply,but i think port(s) works well, it **is not http://www.iciba.com/not/ a problem http://www.iciba.com/problem/, **Any other http://www.iciba.com/other/ suggestions?* PS:*In that case,i installed zookeeper-3.2.2 but hbase.* Date: Sat, 8 May 2010 22:43:34 -0700 Subject: Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused From: phu...@gmail.com* To: zookeeper-user@hadoop.apache.org; chenpeng0...@hotmail.com Often this is related to the port(s) being blocked by a firewall. Perhaps you could check this (2888/3888) in both directions? Telnet can help: https://help.maximumasp.com/KB/a445/connectivity-testing-with-ping-telnet-tracert-and-pathping-.aspx Patrick 2010/5/7 chen peng chenpeng0...@hotmail.com mailto:chenpeng0...@hotmail.com Hi all I have a question: after installation of the zookeeper according to the doc for zookeeper( http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_systemReq), abnormalities emerge as follows: -- JMX enabled by de fault Using config: /home/baeeq/hadoop-0.20.2/zookeeper-3.2.2/bin/../conf/zoo.cfg Starting zookeeper ... STARTED 2010-05-08 13:37:28,273 - INFO [main:quorumpeercon...@80] - Reading configuration from: /home/baeeq/hadoop-0.20.2/zookeeper-3.2.2/bin/../conf/zoo.cfg 2010-05-08 13:37:28,284 - INFO [main:quorumpeercon...@232] - Defaulting to majority quorums 2010-05-08 13:37:28,299 - INFO [main:quorumpeerm...@118] - Starting quorum peer 2010-05-08 13:37:28,331 - INFO [Thread-1:quorumcnxmanager$liste...@409] - My election bind port: 3888 2010-05-08 13:37:28,342 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@514] - LOOKING 2010-05-08 13:37:28,345 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLea derelect...@579] - New election: -1 2010-05-08 13:37:28,351 - WARN [WorkerSender Thread:quorumcnxmana...@336] - Cannot open channel to 2 at election address /192.168.1.3:3888 http://192.168.1.3:3888 java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at
Re: New ZooKeeper client library Cages
Hi Dominic, this looks really interesting thanks for open sourcing it. I really like the idea of providing higher level concepts. I only just looked at the code, it wasn't clear on first pass what happens if you multilock on 3 paths, the first 2 are success, but the third fails. How are the locks cleared? How about the case where the client loses connectivity to the cluster, what happens in this case (both if partial locks are acquired, and the case where all the locks were acquired (for example how does the caller know if the locks are still held or released due to client partitioned from the cluster, etc...)). I'll try d/l the code and looking at it more, I see some javadoc in there as well so that's great. Regards, Patrick On 05/11/2010 04:02 PM, Dominic Williams wrote: Anyone looking for a Java client library for ZooKeeper, please checkout: Cages - http://cages.googlecode.com The library will be expanded and feedback will be helpful. Many thanks, Dominic ria101.wordpress.com
Re: zookeeper-3.2.2:Cannot open channel to X at election address / Connection refused
Often this is related to the port(s) being blocked by a firewall. Perhaps you could check this (2888/3888) in both directions? Telnet can help: https://help.maximumasp.com/KB/a445/connectivity-testing-with-ping-telnet-tracert-and-pathping-.aspx Patrick 2010/5/7 chen peng chenpeng0...@hotmail.com Hi all I have a question: after installation of the zookeeper according to the doc for zookeeper( http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_systemReq), abnormalities emerge as follows: -- JMX enabled by default Using config: /home/baeeq/hadoop-0.20.2/zookeeper-3.2.2/bin/../conf/zoo.cfg Starting zookeeper ... STARTED 2010-05-08 13:37:28,273 - INFO [main:quorumpeercon...@80] - Reading configuration from: /home/baeeq/hadoop-0.20.2/zookeeper-3.2.2/bin/../conf/zoo.cfg 2010-05-08 13:37:28,284 - INFO [main:quorumpeercon...@232] - Defaulting to majority quorums 2010-05-08 13:37:28,299 - INFO [main:quorumpeerm...@118] - Starting quorum peer 2010-05-08 13:37:28,331 - INFO [Thread-1:quorumcnxmanager$liste...@409] - My election bind port: 3888 2010-05-08 13:37:28,342 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@514] - LOOKING 2010-05-08 13:37:28,345 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@579] - New election: -1 2010-05-08 13:37:28,351 - WARN [WorkerSender Thread:quorumcnxmana...@336] - Cannot open channel to 2 at election address /192.168.1.3:3888 java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at java.nio.channels.SocketChannel.open(SocketChannel.java:146) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302) at org.apache.zookeeper.server.quorum.FastLeaderElection $Messenger$WorkerSender.process(FastLeaderElection.java:323) at org.apache.zookeeper.server.quorum.FastLeaderElection $Messenger$WorkerSender.run(FastLeaderElection.java:296) at java.lang.Thread.run(Thread.java:619) 2010-05-08 13:37:28,352 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@618] - Notification: 1, -1, 1, 1, LOOKING, LOOKING, 1 2010-05-08 13:37:28,353 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@642] - Adding vote 2010-05-08 13:37:28,557 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorumcnxmana...@336] - Cannot open channel to 2 at election address /192.168.1.3:3888 java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at java.nio.channels.SocketChannel.open(SocketChannel.java:146) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:515) 2010-05-08 13:37:28,559 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@612] - Notification time out: 400 2010-05-08 13:37:28,961 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorumcnxmana...@336] - Cannot open channel to 2 at election address /192.168.1.3:3888 java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at java.nio.channels.SocketChannel.open(SocketChannel.java:146) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323). -- fileinfo for zoo.cfg is listed below: tickTime=2000 initLimit=10 syncLimit=5 dataDir=/home/baeeq/hadoop-0.20.2/zookeeper-data clientPort=2181 server.1=192.168.1.2:2888:3888 server.2=192.168.1.3:2888:3888 PS: It works well on the single computer after deleting server.1=192.168.1.2:2888:3888 server.2=192.168.1.3:2888:3888 _ Hotmail: Trusted email with powerful SPAM protection. https://signup.live.com/signup.aspx?id=60969
Re: ZKClient
Thanks Travis, I've slated this for 3.4.0, I think it would be useful to add more examples so feel free to add more if you have any ideas for useful ones. For future reference, we ask that contributions come in the form of a patch: http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute It's fine this time around, but in future it would be helpful. (also click on submit patch link when you are ready for review - that pushes it through the process, incl automated testing/verification, that's why we ask for a patch off the root btw) Thanks! Patrick On 05/04/2010 04:00 PM, Travis Crawford wrote: On Tue, May 4, 2010 at 3:45 PM, Ted Dunningted.dunn...@gmail.com wrote: Travis, Attachments are stripped from this mailing list. Can you file a JIRA and put your attachment on that instead? Here is a link to get you started: https://issues.apache.org/jira/browse/ZOOKEEPER Whoops. Filed: https://issues.apache.org/jira/browse/ZOOKEEPER-765 --travis On Tue, May 4, 2010 at 3:43 PM, Travis Crawfordtraviscrawf...@gmail.comwrote: Attached is a skeleton application I extracted from a script I use -- perhaps we could add this as a recipe? If there are issues I'm more than happy to fix them, or add more comments, whatever. It took a while to figure this out and I'd love to save others that time in the future. --travis On Tue, May 4, 2010 at 3:16 PM, Mahadev Konarmaha...@yahoo-inc.com wrote: Hi Adam, I don't think zk is very very hard to get right. There are exmaples in src/recipes which implements locks/queues/others. There is ZOOKEEPER-22 to make it even more easier for application to use. Regarding re registration of watches, you can deifnitely write code and submit is as a part of well documented contrib module which lays out the assumptions/design of it. It could very well be useful for others. Its just that folks havent had much time to focus on these areas as yet. Thanks mahadev On 5/4/10 2:58 PM, Adam Rosiena...@rosien.net wrote: I use zkclient in my work at kaChing and I have mixed feelings about it. On one hand it makes easy things easy which is great, but on the other hand I very few ideas what assumptions it makes under the hood. I also dislike some of the design choices such as unchecked exceptions, but that's neither here nor there. It would take some extensive documentation work by the authors to really enumerate the model and assumptions, but the project doesn't seem to be active (either from it being adequate for its current users or just inactive). I'm not sure I could derive the assumptions myself. I'm a bit frustrated that zk is very, very hard to really get right. At a project level, can't we create structures to avoid most of these errors? Can there be a standard model with detailed assumptions and implementations of all the recipes? How can we start this? Is there something that makes this too hard? I feel like a recipe page is a big fail; wouldn't an example app that uses locks and barriers be that much more compelling? For the common FAQ items like you need to re-register the watch, can't we just create code that implements this pattern? My goal is to live up to the motto: a good API is impossible to use incorrectly. .. Adam On Tue, May 4, 2010 at 2:21 PM, Ted Dunningted.dunn...@gmail.com wrote: In general, writing this sort of layer on top of ZK is very, very hard to get really right for general use. In a simple use-case, you can probably nail it but distributed systems are a Zoo, to coin a phrase. The problem is that you are fundamentally changing the metaphors in use so assumptions can come unglued or be introduced pretty easily. One example of this is the fact that ZK watches *don't* fire for every change but when you write listener oriented code, you kind of expect that they will. That makes it really, really easy to introduce that assumption in the heads of the programmer using the event listener library on top of ZK. Another example is how the atomic get content/set watch call works in ZK is easy to violate in an event driven architecture because the thread that watches ZK probably resets the watch. If you assume that the listener will read the data, then you have introduced a timing mismatch between the read of the data and the resetting of the watch. That might be OK or it might not be. The point is that these changes are subtle and tricky to get exactly right. On Tue, May 4, 2010 at 1:48 PM, Jonathan Holloway jonathan.hollo...@gmail.com wrote: Is there any reason why this isn't part of the Zookeeper trunk already?
Re: ZKClient
While I agree DS is hard, I don't think we should lose the useful feedback given by Jonathan/Adam - that getting started with ZK is challenging and can be frustrating. We need to learn from this feedback and create some action items to address. One of the main things I've heard so far that we can act on today is that we should add examples/docs to round things out. I agree with this. Also the recipes page should be updated to point to the recipe implementations we recently added to the release. One suggestion, it's much easier for new contributors/users to contribute to the examples than it is to jump into ZK core development. New users feel the pain most directly (recently), I'd encourage you to contribute back by creating an example or two. I'm sure the existing contributors would be happy to work with you to get them committed and released. Regards, Patrick On 05/04/2010 03:43 PM, Ted Dunning wrote: Creating recipes is a great thing, but that doesn't change the fact that distributed systems are inherently a bit tricky, especially if you start with the assumption (as many people do) that Peter Deutsch was wrong. One of the great contributions of MapReduce style parallelism or the java concurrent package is that it provides safe trails in a pretty scary forest. Good Zookeeper recipes could provide similar guidance with similar positive effects. On Tue, May 4, 2010 at 3:24 PM, Adam Rosiena...@rosien.net wrote: I'll check it out, but it is repeated in this list and on the web site that it's not as easy as it seems. I just want to enumerate the failure points and create abstractions to avoid them. .. Adam On Tue, May 4, 2010 at 3:16 PM, Mahadev Konarmaha...@yahoo-inc.com wrote: Hi Adam, I don't think zk is very very hard to get right. There are exmaples in src/recipes which implements locks/queues/others. There is ZOOKEEPER-22 to make it even more easier for application to use. Regarding re registration of watches, you can deifnitely write code and submit is as a part of well documented contrib module which lays out the assumptions/design of it. It could very well be useful for others. Its just that folks havent had much time to focus on these areas as yet. Thanks mahadev On 5/4/10 2:58 PM, Adam Rosiena...@rosien.net wrote: I use zkclient in my work at kaChing and I have mixed feelings about it. On one hand it makes easy things easy which is great, but on the other hand I very few ideas what assumptions it makes under the hood. I also dislike some of the design choices such as unchecked exceptions, but that's neither here nor there. It would take some extensive documentation work by the authors to really enumerate the model and assumptions, but the project doesn't seem to be active (either from it being adequate for its current users or just inactive). I'm not sure I could derive the assumptions myself. I'm a bit frustrated that zk is very, very hard to really get right. At a project level, can't we create structures to avoid most of these errors? Can there be a standard model with detailed assumptions and implementations of all the recipes? How can we start this? Is there something that makes this too hard? I feel like a recipe page is a big fail; wouldn't an example app that uses locks and barriers be that much more compelling? For the common FAQ items like you need to re-register the watch, can't we just create code that implements this pattern? My goal is to live up to the motto: a good API is impossible to use incorrectly. .. Adam On Tue, May 4, 2010 at 2:21 PM, Ted Dunningted.dunn...@gmail.com wrote: In general, writing this sort of layer on top of ZK is very, very hard to get really right for general use. In a simple use-case, you can probably nail it but distributed systems are a Zoo, to coin a phrase. The problem is that you are fundamentally changing the metaphors in use so assumptions can come unglued or be introduced pretty easily. One example of this is the fact that ZK watches *don't* fire for every change but when you write listener oriented code, you kind of expect that they will. That makes it really, really easy to introduce that assumption in the heads of the programmer using the event listener library on top of ZK. Another example is how the atomic get content/set watch call works in ZK is easy to violate in an event driven architecture because the thread that watches ZK probably resets the watch. If you assume that the listener will read the data, then you have introduced a timing mismatch between the read of the data and the resetting of the watch. That might be OK or it might not be. The point is that these changes are subtle and tricky to get exactly right. On Tue, May 4, 2010 at 1:48 PM, Jonathan Holloway jonathan.hollo...@gmail.com wrote: Is there any reason why this isn't part of the Zookeeper trunk already?
Re: ZKClient
Take a look at this thread for some background. http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg00917.html There were some concerns at the time, not sure if they have been addressed since (It has been a while since that discussion). Patrick On 05/04/2010 01:48 PM, Jonathan Holloway wrote: It looks good, having written a client already myself, I'd rather use this than have to roll my own each time. Is there any reason why this isn't part of the Zookeeper trunk already? It would make working with Zookeeper a bit easier (at least from my perspective)... Jon. On 4 May 2010 12:57, Ted Dunningted.dunn...@gmail.com wrote: This is used as part of katta where it gets a fair bit of exercise at low update rates with small data. It is used for managing the state of the search cluster. I don't think it has had much external review or use for purposes apart from katta. Katta generally has pretty decent code, though. On Tue, May 4, 2010 at 12:39 PM, Jonathan Holloway jonathan.hollo...@gmail.com wrote: I came across this project on Github http://github.com/sgroschupf/zkclient for working with the Zookeeper API. Has anybody used it in the past? Is it a better way of interacting with a Zookeeper cluster? Many thanks, Jon.
Re: avoiding deadlocks on client handle close w/ python/c api
Thanks Kapil, Mahadev perhaps you could take a look at this as well? Patrick On 05/04/2010 06:36 AM, Kapil Thangavelu wrote: I've constructed a simple example just using the zkpython library with condition variables, that will deadlock. I've filed a new ticket for it, https://issues.apache.org/jira/browse/ZOOKEEPER-763 the gdb stack traces look suspiciously like the ones in 591, but sans the watchers. https://issues.apache.org/jira/browse/ZOOKEEPER-591 the attached example on the ticket will deadlock in zk 3.3.0 (which has the fix for 591) and trunk. -kapil On Mon, May 3, 2010 at 9:48 PM, Kapil Thangavelukapil.f...@gmail.comwrote: Hi Folks, I'm constructing an async api on top of the zookeeper python bindings for twisted. The intent was to make a thin wrapper that would wrap the existing async api with one that allows for integration with the twisted python event loop (http://www.twistedmatrix.com) primarily using the async apis. One issue i'm running into while developing a unit tests, deadlocks occur if we attempt to close a handle while there are any outstanding async requests (aget, acreate, etc). Normally on close both the io thread terminates and the completion thread are terminated and joined, however w\ith outstanding async requests, the completion thread won't be in a joinable state, and we effectively hang when the main thread does the join. I'm curious if this would be considered bug, afaics ideal behavior would be on close of a handle, to effectively clear out any remaining callbacks and let the completion thread terminate. i've tried adding some bookkeeping to the api to guard against closing while there is an outstanding completion request, but its an imperfect solution do to the nature of the event loop integration. The problem is that the python callback invoked by the completion thread in turn schedules a function for the main thread. In twisted the api for this is implemented by appending the function to a list attribute on the reactor and then writing a byte to a pipe to wakeup the main thread. If a thread switch to the main thread occurs before the completion thread callback returns, the scheduled function runs and the rest of the application keeps processing, of which the last step for the unit tests is to close the connection, which results in a deadlock. i've included some of the client log and gdb stack traces from a deadlock'd client process. thanks, Kapil