Re: Guaranteed message delivery until session timeout?
When a connectionloss happens all the watches are triggered saying that connectionloss occurred. But on a reconnect the watches are reset automagically on the new server and will be fired if the change has already happened or will be reset! I hope that answers your question. Thanks mahadev On 6/30/10 5:11 PM, "Ted Dunning" wrote: > I think that you are correct, but a real ZK person should answer this. > > On Wed, Jun 30, 2010 at 4:48 PM, Bryan Thompson wrote: > >> For example, if a client registers a watch, and a state change which would >> trigger that watch occurs _after_ the client has successfuly registered the >> watch with the zookeeper quorum, is it possible that the client would not >> observe the watch trigger due to communication failure, etc., even while the >> clients session remains valid? It sounds like the answer is "no" per the >> timeliness guarantee. Is that correct? >> >>
Re: Zookeeper outage recap & questions
I've moved this thread to: https://issues.apache.org/jira/browse/ZOOKEEPER-801 --travis On Thu, Jul 1, 2010 at 12:37 AM, Patrick Hunt wrote: > Hi Travis, as Flavio suggested would be great to get the logs. A few > questions: > > 1) how did you eventually recover, restart the zk servers? > > 2) was the cluster losing quorum during this time? leader re-election? > > 3) Any chance this could have been initially triggered by a long GC pause on > one of the servers? (is gc logging turned on, any sort of heap monitoring?) > Has the GC been tuned on the servers, for example CMS and incremental? > > 4) what are the clients using for timeout on the sessions? > > 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a > few weeks to fix a couple critical issues (which don't seem related to what > you saw). If we can identify the problem here we should be able to include > it in any fix release we do. > > fixing something like 517 might help, but it's not clear how we got to this > state in the first place. fixing 517 might not have any effect if the root > cause is not addressed. 662 has only ever been reported once afaik, and we > weren't able to identify the root cause for that one. > > One thing we might also consider is modifying the zk client lib to backoff > connection attempts if they keep failing (timing out say). Today the clients > are pretty aggressive on reconnection attempts. Having some sort of backoff > (exponential?) would provide more breathing room to the server in this > situation. > > Patrick > > On 06/30/2010 11:13 PM, Travis Crawford wrote: >> >> Hey zookeepers - >> >> We just experienced a total zookeeper outage, and here's a quick >> post-mortem of the issue, and some questions about preventing it going >> forward. Quick overview of the setup: >> >> - RHEL5 2.6.18 kernel >> - Zookeeper 3.3.0 >> - ulimit raised to 65k files >> - 3 cluster members >> - 4-5k connections in steady-state >> - Primarily C and python clients, plus some java >> >> In chronological order, the issue manifested itself as alert about RW >> tests failing. Logs were full of too many files errors, and the output >> of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was >> 100%. Application logs showed lots of connection timeouts. This >> suggests an event happened that caused applications to dogpile on >> Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles >> to run out and basically game over. >> >> I looked through lots of logs (clients+servers) and did not see a >> clear indication of what happened. Graphs show a sudden decrease in >> network traffic when the outage began, zookeeper goes cpu bound, and >> runs our of file descriptors. >> >> Clients are primarily a couple thousand C clients using default >> connection parameters, and a couple thousand python clients using >> default connection parameters. >> >> Digging through Jira we see two issues that probably contributed to this >> outage: >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-662 >> https://issues.apache.org/jira/browse/ZOOKEEPER-517 >> >> Both are tagged for the 3.4.0 release. Anyone know if that's still the >> case, and when 3.4.0 is roughly scheduled to ship? >> >> Thanks! >> Travis >
Solr Cloud/ Solr integration with zookeeper
Hi, I wanna use solr cloud. i downloaded the code from the trunk, and successfully executed the examples as shown in wiki. but when i try the same with multicore. i cannot access: http://localhost:8983/solr/collection1/admin/zookeeper.jsp it says page not found. Following is my configuration of solr.xml in multicore directory: i have created zoo.cfg at the multicore directory with the following configuration: # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # dataDir=/opt/zookeeper/data # NOTE: Solr defaults the dataDir to /zoo_data # the port at which the clients will connect # clientPort=2181 # NOTE: Solr sets this based on zkRun / zkHost params The command i run is: java -Dsolr.solr.home=multicore -Dcollection.configName=myconf1 -DzkRun -jar start.jar Am i going wrong anywhere? Alternatively can we run an external zookeeper server which can communicate with solr servers(as clients) and show the status of each solr server. Regards, Raakhi
Re: Zookeeper outage recap & questions
Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout on the sessions? 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the problem here we should be able to include it in any fix release we do. fixing something like 517 might help, but it's not clear how we got to this state in the first place. fixing 517 might not have any effect if the root cause is not addressed. 662 has only ever been reported once afaik, and we weren't able to identify the root cause for that one. One thing we might also consider is modifying the zk client lib to backoff connection attempts if they keep failing (timing out say). Today the clients are pretty aggressive on reconnection attempts. Having some sort of backoff (exponential?) would provide more breathing room to the server in this situation. Patrick On 06/30/2010 11:13 PM, Travis Crawford wrote: Hey zookeepers - We just experienced a total zookeeper outage, and here's a quick post-mortem of the issue, and some questions about preventing it going forward. Quick overview of the setup: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java In chronological order, the issue manifested itself as alert about RW tests failing. Logs were full of too many files errors, and the output of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was 100%. Application logs showed lots of connection timeouts. This suggests an event happened that caused applications to dogpile on Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles to run out and basically game over. I looked through lots of logs (clients+servers) and did not see a clear indication of what happened. Graphs show a sudden decrease in network traffic when the outage began, zookeeper goes cpu bound, and runs our of file descriptors. Clients are primarily a couple thousand C clients using default connection parameters, and a couple thousand python clients using default connection parameters. Digging through Jira we see two issues that probably contributed to this outage: https://issues.apache.org/jira/browse/ZOOKEEPER-662 https://issues.apache.org/jira/browse/ZOOKEEPER-517 Both are tagged for the 3.4.0 release. Anyone know if that's still the case, and when 3.4.0 is roughly scheduled to ship? Thanks! Travis