Errors while running sytest
Hi, I am trying to run systest on a 3 node cluster ( http://svn.apache.org/repos/asf/hadoop/zookeeper/trunk/src/java/systest/README.txt ). When I reach the 4th step which is to actually run the test I get exception shown below. Exception in thread main java.lang.NoClassDefFoundError: junit/framework/TestC ase at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632) at java.lang.ClassLoader.defineClass(ClassLoader.java:616) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:14 1) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:169) at org.apache.zookeeper.util.FatJarMain.main(FatJarMain.java:97) Caused by: java.lang.ClassNotFoundException: junit.framework.TestCase at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 15 more Looks like it is not able to find classes in junit. However, my classpath is set right: :/opt/zookeeper-3.3.0/zookeeper.jar:/opt/zookeeper-3.3.0/lib/junit-4.4.jar:/opt/ zookeeper-3.3.0/lib/log4j-1.2.15.jar:/opt/zookeeper-3.3.0/build/test/lib/junit-4.8.1.jar Any suggestions how I can get around this problem? Thanks.
Re: Embedding ZK in another application
Hi, Good question. We are planning to do something similar as well and it will great to know if there are any issues with embedding ZK server into an app. We simply use QourumPeerMain and QourumPeer from our app to start/stop the ZK server. Is this not a good way to do it? On Fri, Apr 23, 2010 at 1:28 PM, Asankha C. Perera asan...@apache.orgwrote: Hi All I'm very new to ZK, and am looking at embeding ZK into an app that needs cluster management - and the objective is to use ZK to notify application cluster control operations (e.g. shutdown etc) across nodes. I came across this post [1] from the user list by Ted Dunning from some months back : My experience with Katta has led me to believe that embedding a ZK in a product is almost always a bad idea. - The problems are that you can't administer the Zookeeper cluster independently and that the cluster typically goes down when the associated service goes down. However, I believe that both the above are fine to live with for the application under consideration, as ZK will be used only to coordinate the larger application. Is there anything else that needs to be considered - and can I safely shutdown the clientPort since the application is always in the same JVM - but, if I do that how would I connect to ZK thereafter ? thanks and regards asankha [1] http://markmail.org/message/tjonwec7p7dhfpms
Re: Embedding ZK in another application
Hi Mahadev, Ted, Thanks for the feedback. On Fri, Apr 23, 2010 at 3:02 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is, of course, your decision, but a key coordination function is to determine whether your application is up or not. That is very hard to do if Zookeeper is inside your application. On Fri, Apr 23, 2010 at 10:28 AM, Asankha C. Perera asan...@apache.org wrote: However, I believe that both the above are fine to live with for the application under consideration, as ZK will be used only to coordinate the larger application. Is there anything else that needs to be considered - and can I safely shutdown the clientPort since the application is always in the same JVM - but, if I do that how would I connect to ZK thereafter ?
Re: Embedding ZK in another application
Hi, Well looks like FastLeaderElection.shutdown() is not invoked. This has been in 3.3.0. Should have checked on that earlier :-) On Thu, Apr 29, 2010 at 10:13 AM, Vishal K vishalm...@gmail.com wrote: Hi Ted, We want the application that embeds the ZK server to be running even after the ZK server is shutdown. So we don't want to restart the application. Also, we prefer not to use zkServer.sh/zkServer.cmd because these are OS dependent (our application will run on Win as well as Linux). Instead, we thought that calling QuorumPeerMain.initializeAndRun() and QuorumPeerMain.shutdown() will suffice to start and shutdown a ZK server and we won't have to worry about checking the OS. Is there way to cleanly shutdown the ZK server (by invoking ZK server API) when it is embedded in the application without actually restarting the application process? Thanks. On Thu, Apr 29, 2010 at 1:54 AM, Ted Dunning ted.dunn...@gmail.comwrote: Hmmm it isn't quite clear what you mean by restart without restarting. Why is killing the server and restarting it not an option? It is common to do a rolling restart on a ZK cluster. Just restart one server at a time. This is often used during system upgrades. On Wed, Apr 28, 2010 at 8:22 PM, Vishal K vishalm...@gmail.com wrote: What is a good way to restart a ZK server (standalone and quorum) without having to restart it? Currently, I have ZK server embedded in another java application.
Securing ZooKeeper connections
Hi All, Since ZooKeeper does not support secure network connections yet, I thought I would poll and see what people are doing to address this problem. Is anyone running ZooKeeper over secure channels (client - server and server- server authentication/encryption)? If yes, can you please elaborate how you do it? Thanks. Regards, -Vishal
cleanup ZK takes 40-60 seconds
Hi, We have embedded ZK server in our application. We start a thread in our application and call QuorumPeerMain.InitializeArgs(). When cleaning-up ZK we call QuorumPeerMain.shutdown() and wait for the thread that is calling InitializeArgs() to finish. These two steps are taking around 60 seconds. I could probably not wait for InitializeArgs() to finish and that might speed up things. However, I am not sure why the cleanup should take such a long time. Can anyone comment on this? Thanks. -Vishal
Too many KeeperErrorCode = Session moved messages
Hi All, I am seeing a lot of these messages in our application. I would like to know if I am doing something wrong or this is a ZK bug. Setup: - Server environment:zookeeper.version=3.3.0-925362 - 3 node cluster - Each node has few clients that connect to the local server using 127.0.0.1 as the host IP. - The application first forms a ZK cluster. Once the ZK cluster is formed, each node establish sessions with local ZK servers. The clients do not know about remote server so sessions are always with the local server. As soon as ZK clients connected to their respective follower, the ZK leader starts spitting the following messages: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,748 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x9 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,755 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0xb zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,795 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x10 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,850 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa90001 type:sync: cxid:0x1 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,910 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x1b zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:36,920 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x20 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:37,019 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x29 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:37,030 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x2c zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:37,035 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x2e zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:37,065 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x33 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 2010-07-01 10:55:38,840 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa90001 type:sync: cxid:0x4 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved 20 These sessions were established on the follower: 2010-07-01 08:59:09,890 - INFO [CommitProcessor:0:nioserverc...@1431] - Established session 0x298d3b1fa9 with negotiated timeout 9000 for client /127.0.0.1:50773 2010-07-01 08:59:09,890 - INFO [SvaDefaultBLC-SendThread(localhost.localdom:2181):clientcnxn$sendthr...@701] - Session establishment complete on server localhost.localdom/127.0.0.1:2181, sessionid = 0x298d3b1fa9, negotiated timeout = 9000 The server is spitting out these messages for every session that it does not own (session established by clients with followers). The messages are always seen for a sync request. No other issues are seen with the cluster. I am wondering what would be the cause of this problem? Looking at PrepRequestProcessor, it seems like this message is printed when the owner of the request is not same as session owner. But in our application this should never happen since clients always connect to its local server. Any ideas? Thanks. -Vishal
Re: How to handle Node does not exist error?
Hi, I don't intend to hijack Dr. Hao's email thread here, but I would like to point out two things: 1. I use embedded server as well. But I don't use any setters. We extend QuorumPeerMain and call initializeAndRun() function. So we are doing pretty much the same thing that QuorumPeerMain is doing. However, note that I am seeing the same problem (in ZK 3.3.0) as Dr Hao is seeing. I haven't debugged the cause yet. I assumed that this was my implementation error (and it could still be). Nevertheless, this could turn out to be a bug as well. 2. With respect to Ted's point about backward compatibility, I would suggest to take an approach of having an API to support embedded ZK instead of asking users to not embed ZK. -Vishal On Thu, Aug 12, 2010 at 3:18 PM, Ted Dunning ted.dunn...@gmail.com wrote: It doesn't. But running a ZK cluster that is incorrectly configured can cause this problem and configuring ZK using setters is likely to be subject to changes in what configuration is needed. Thus, your style of code is more subject to decay over time than is nice. The rest of my comments detail *other* reasons why embedding a coordination layer in the code being coordinated is a bad idea. On Thu, Aug 12, 2010 at 6:33 AM, Vishal K vishalm...@gmail.com wrote: Hi Ted, Can you explain why running ZK in embedded mode can cause znode inconsistencies? Thanks. -Vishal On Thu, Aug 12, 2010 at 12:01 AM, Ted Dunning ted.dunn...@gmail.com wrote: Try running the server in non-embedded mode. Also, you are assuming that you know everything about how to configure the quorumPeer. That is going to change and your code will break at that time. If you use a non-embedded cluster, this won't be a problem and you will be able to upgrade ZK version without having to restart your service. My own opinion is that running an embedded ZK is a serious architectural error. Since I don't know your particular situation, it might be different, but there is an inherent contradiction involved in running a coordination layer as part of the thing being coordinated. Whatever your software does, it isn't what ZK does. As such, it is better to factor out the ZK functionality and make it completely stable. That gives you a much simpler world and will make it easier for you to trouble shoot your system. The simple fact that you can't take down your service without affecting the reliability of your ZK layer makes this a very bad idea. The problems you are having now are only a preview of what this architectural error leads to. There will be more problems and many of them are likely to be more subtle and lead to service interruptions and lots of wasted time. On Wed, Aug 11, 2010 at 8:49 PM, Dr Hao He h...@softtouchit.com wrote: hi, Ted and Mahadev, Here are some more details about my setup: I run zookeeper in the embedded mode with the following code: quorumPeer = new QuorumPeer(); quorumPeer.setClientPort(getClientPort()); quorumPeer.setTxnFactory(new FileTxnSnapLog(new File(getDataLogDir()), new File(getDataDir(; quorumPeer.setQuorumPeers(getServers()); quorumPeer.setElectionType(getElectionAlg()); quorumPeer.setMyid(getServerId()); quorumPeer.setTickTime(getTickTime()); quorumPeer.setInitLimit(getInitLimit()); quorumPeer.setSyncLimit(getSyncLimit()); quorumPeer.setQuorumVerifier(getQuorumVerifier()); quorumPeer.setCnxnFactory(cnxnFactory); quorumPeer.start(); The configuration values are read from the following XML document for server 1: cluster tickTime=1000 initLimit=10 syncLimit=5 clientPort=2181 serverId=1 member id=1 host=192.168.2.6:2888:3888/ member id=2 host=192.168.2.3:2888:3888/ member id=3 host=192.168.2.4:2888:3888/ /cluster The other servers have the same configurations except their ids being changed to 2 and 3. The error occurred on server 3 when I batch loaded some messages to server 1. However, this error does not always happen. I am not sure exactly what trigged this error yet. I also performed the stat operation on one of the No exit node and got: stat /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg001583 Exception in thread main java.lang.NullPointerException at org.apache.zookeeper.ZooKeeperMain.printStat(ZooKeeperMain.java:129) at org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:715) at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:579
Re: How to handle Node does not exist error?
In my case, I am pretty sure that the configuration was right. I will reproduce it and post more info later. Thanks. On Mon, Aug 16, 2010 at 1:08 PM, Patrick Hunt ph...@apache.org wrote: Try using the logs, stat command or JMX to verify that each ZK server is indeed a leader/follower as expected. You should have one leader and n-1 followers. Verify that you don't have any standalone servers (this is the most frequent error I see - misconfiguration of a server such that it thinks it's a standalone server; I often see where a user has 3 standalone servers which they think is a single quorum, all of the servers will therefore be inconsistent to each other). Patrick On 08/12/2010 05:42 PM, Ted Dunning wrote: On Thu, Aug 12, 2010 at 4:57 PM, Dr Hao Heh...@softtouchit.com wrote: hi, Ted, I am a little bit confused here. So, is the node inconsistency problem that Vishal and I have seen here most likely caused by configurations or embedding? If it is the former, I'd appreciate if you can point out where those silly mistakes have been made and the correct way to embed ZK. I think it is likely due to misconfiguration, but I don't know what the issue is exactly. I think that another poster suggested that you ape the normal ZK startup process more closely. That sounds good but it may be incompatible with your goals of integrating all configuration into a single XML file and not using the normal ZK configuration process. Your thought about forking ZK is a good one since there are calls to System.exit() that could wreak havoc. Although I agree with your comments about the architectural issues that embedding may lead to and we are aware of those, I do not agree that embedding will always lead to those issues. I agree that embedding won't always lead to those issues and your application is a reasonable counter-example. As is common, I think that the exception proves the rule since your system is really just another way to launch an independent ZK cluster rather than an example of ZK being embedded into an application.
Re: Weird ephemeral node issue
Hi Qing, Can you list the znodes from the monitor and from the node that the monitor is restarting (run zkCli.sh on both machines). I am curious to see if the node that did not receive the SESSION_EXPIRED event still has the znode in its database. Also can you describe your setiup? Can you send out logs and zoo.cfg file. Thanks. -Vishal On Tue, Aug 17, 2010 at 3:31 AM, Qing Yan qing...@gmail.com wrote: Forget to mention: the process looks fine, nomal memory foot print and cpu usage, generate expected results, only thing is missing the ephermenal node in ZK.
Re: Session expiration caused by time change
Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yan qing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite waite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Hunt ph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK, registers an ephemeral node and watches on it. Now change the client machine's time - session killed.. Here is the log: *2010-08-18 04:24:57,782 INFO com.taobao.timetunnel2.cluster.service.AgentService: Host name kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.6.0_13 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_13/jre 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/home/admin/TimeTunnel2/cluster/bin/../conf/agent/:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-log4j12-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-api-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/timetunnel2-cluster-0.0.1-SNAPSHOT.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zookeeper-3.2.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/log4j-1.2.14.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/gson-1.4.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zk-recipes.jar 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/jdk1.6.0_13/jre/lib/amd64/server:/usr/java/jdk1.6.0_13/jre/lib/amd64:/usr/java/jdk1.6.0_13/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=NA 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=2.6.18-164.el5 2010-08-18 04:24:57,789 INFO
Re: Session expiration caused by time change
Hi Ted, I haven't give it a serious thought yet, but I don't think it is neccessary for the cluster to keep track of time. A node can make its own decision. For the sake of argument, lets say that we have a client and a server with following policy: 1. Client is supposed to send a ping to server every 1 sec. 2. If server does not hear from client for 5 seconds, then the server declares that the client is dead. 3. Similary if the client cannot communicate with the server for 5 seconds client declares that the server is dead. If the client receives a timeout (say while doing some IO) because of a time jump, it should check the number of pings that has failed with the server. If the number is 5, then this is a true failure, If the number is less than 5, then this is because of a time drift. At the server side, the server can attempt to reconnect (or send a ping to the client) after it receives a timeout. Thus, if the timeout occured because of time drift, the server will reconnect and continue. We should ofcourse have an upper bound in number of retries, etc. For ZK, it is important to handle time jumps on ZK leader. I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. You won't see the slippage. You will mainly see a jump forward. Note with large enough number of nodes, multiple nodes could see their time jumping forward. Therefore, checking comparing time between two servers may not help. On Thu, Aug 19, 2010 at 7:51 AM, Vishal K vishalm...@gmail.com wrote: Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yan qing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite waite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Hunt ph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK, registers an ephemeral node and watches on it. Now change the client machine's time - session killed.. Here is the log: *2010-08-18 04:24:57,782 INFO com.taobao.timetunnel2.cluster.service.AgentService: Host name kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO
Re: Session expiration caused by time change
Hi Ben, Comments inline.. On Thu, Aug 19, 2010 at 5:33 PM, Benjamin Reed br...@yahoo-inc.com wrote: if we can't rely on the clock, we cannot say things like if ... for 5 seconds. if ... for 5 seconds indicates the timeout give by the socket library. After the timeout we can verify that the timeout received was not a side effect of time jump by looking at the number of ping attempts. also, clients connect to servers, not visa-versa, so we cannot say things like server can attempt to reconnect. In the scenario described below, wouldn't it be ok for the server to just send a ping request to see if the client is really dead? ben On 08/19/2010 10:17 AM, Vishal K wrote: Hi Ted, I haven't give it a serious thought yet, but I don't think it is neccessary for the cluster to keep track of time. A node can make its own decision. For the sake of argument, lets say that we have a client and a server with following policy: 1. Client is supposed to send a ping to server every 1 sec. 2. If server does not hear from client for 5 seconds, then the server declares that the client is dead. 3. Similary if the client cannot communicate with the server for 5 seconds client declares that the server is dead. If the client receives a timeout (say while doing some IO) because of a time jump, it should check the number of pings that has failed with the server. If the number is 5, then this is a true failure, If the number is less than 5, then this is because of a time drift. At the server side, the server can attempt to reconnect (or send a ping to the client) after it receives a timeout. Thus, if the timeout occured because of time drift, the server will reconnect and continue. We should ofcourse have an upper bound in number of retries, etc. For ZK, it is important to handle time jumps on ZK leader. I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. You won't see the slippage. You will mainly see a jump forward. Note with large enough number of nodes, multiple nodes could see their time jumping forward. Therefore, checking comparing time between two servers may not help. On Thu, Aug 19, 2010 at 7:51 AM, Vishal Kvishalm...@gmail.com wrote: Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yanqing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waitewaite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Huntph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK
Understanding ZooKeeper data file management and LogFormatter
Hi All, Can you please share your experience regarding ZK snapshot retention and recovery policies? We have an application where we never need to rollback (i.e., revert back to a previous state by using old snapshots). Given this, I am trying to understand under what circumstances would we ever need to use old ZK snapshots. I understand a lot of these decisions depend on the application and amount of redundancy used at every level (e.g,. RAID level where the snapshots are stored etc) in the product. To simplify the discussion, I would like to rule out any application characteristics and focus mainly on data consistency. - Assuming that we have a 3 node cluster I am trying to figure out when would I really need to use old snapshot files. With 3 nodes we already have at least 2 servers with consistent database. If I loose files on one of the servers, I can use files from the other. In fact, ZK server join will take care of this. I can remove files from a faulty node and reboot that node. The faulty node will sync with the leader. - The old files will be useful if the current snapshot and/or log files are lost or corrupted on all 3 servers. If the loss is due to a disaster (case where we loose all 3 servers), one would have to keep the snapshots on some external storage to recover. However, if the current snapshot file is corrupted on all 3 servers, then the most likely cause would be a bug in ZK. In which case, how can I trust the consistency of the old snapshots? - Given a set of snapshots and log files, how can I verify the correctness of these files? Example, if one of the intermediate snapshot file is corrupt. - The Admin's guide says Using older log and snapshot files, you can look at the previous state of ZooKeeper servers and even restore that state. The LogFormatter class allows an administrator to look at the transactions in a log. * *Is there a tool that does this for the admin? The LogFormatter only displays the transactions in the log file. - Has anyone ever had to play with the snapshot files in production? Thanks in advance. Regards, -Vishal
Re: znode inconsistencies across ZooKeeper servers
Hi Patrick, You are correct, the test restarts both ZooKeeper server and the client. The client opens a new connection after restarting. So we would expect that the ephmeral znode (/foo) to expire after the session timeout. However, the client with the new session creates the ephemeral znode (/foo) again after it reboots (it sets a watch for /foo and recreates /foo if it is deleted or doesn't exist). The client is not reusing the session ID. What I expect to see is that the older /foo should expire after which a new /foo should get created. Is my expectation correct? What confuses me is the following output of 3 successive getstat /foo requests on A (the zxid, time and owner fields). Notice that the older znode reappeared. At the same time when I do getstat at B and C, I see the newer /foo. log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x10607 ctime = Tue Oct 05 15:01:07 UTC 2010 mZxid = 0x10607 mtime = Tue Oct 05 15:01:07 UTC 2010 pZxid = 0x10607 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce5bda4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 Thanks for your help. -Vishal On Wed, Oct 6, 2010 at 4:45 PM, Patrick Hunt ph...@apache.org wrote: Vishal the attachment seems to be getting removed by the list daemon (I don't have it), can you create a JIRA and attach? Also this is a good question for the ppl on zookeeper-user. (ccing) You are aware that ephemeral znodes are tied to the session? And that sessions only expire after the session timeout period? At which time any znodes created during that session are then deleted. The fact that you are killing your client process leads me to believe that you are not closing the session cleanly (meaning that it will eventually expire after the session timeout period), in which case the ephemeral znodes _should_ reappear when A is restarted and successfully rejoins the cluster. (at least until the session timeout is exceeded) Patrick On Tue, Oct 5, 2010 at 11:04 AM, Vishal K vishalm...@gmail.com wrote: Hi, I have a 3 node ZK cluster (A, B, C). On one of the the nodes (node A), I have a ZK client running that connects to the local server and creates an ephemeral znode to indicate clients on other nodes that it is online. I have test script that reboots the zookeeper server as well as client on A. The test does a getstat on the ephemeral znode created by the client on A. I am seeing that the view of znodes on A is different from the other 2 nodes. I can tell this from the session ID that the client gets after reconnecting to the local ZK server. So the test is simple: - kill zookeeper server and client process - wait for a few seconds - do zkCli.sh stat ... test.out What I am seeing is that the ephemeral znode with old zxid, time, and session ID is reappearing on node A. I have attached the output of 3 consecutive getstat requests of the test (see client_getstat.out). Notice that the third output is the same as the first one. That is, the old ephemeral znode reappeared at A. However, both B and C are showing the latest znode with correct time, zxid and session ID (output not attached). After this point, all following getstat requests on A are showing the old znode. Whereas, B and C show the correct znode every time the client on A comes online. This is something very perplexing. Earlier I thought this was a bug in my client implementation. But the test shows that the ZK server on A after reboot is out of sync with rest of the servers. The stat command to each server shows that the servers are in sync as far as zxid's are concerned (see stat.out). So there is something wrong with A's local database that is causing this problem. Has anyone seen this before? I will be doing more debugging in the next few days. Comments/suggestions for further debugging are welcomed. -Vishal
Reading znodes directly from snapshot and log files
Hi, Is it possible to read znodes directly from snapshot and log files instead of usign ZooKeeper API. In case a ZK ensemble is not available, can I login to all available nodes and run a utility that will dump all znodes? Thanks. -Vishal