Re: Session expiration caused by time change
Hi, In our testing of Red Hat Cluster, we could reproduce the NTP impact by jumping the clock backwards and forwards, just using the date command in a tight-ish loop: use strict; my $dir = 1; while (1) { jump_time( $dir ); $dir = $dir * -1; } sub jump_time { my ($dir) = @_; my $step = 20 * $dir; my $time = scalar localtime( time() + $step ); print `/bin/date -s \$time\`, $?, \n; select(undef,undef,undef, 0.3 ); } Obviously not a realistic test, but it soon flushes out problems. regards, Martin On 19 August 2010 23:51, Benjamin Reed br...@yahoo-inc.com wrote: i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?) thanx ben On 08/19/2010 09:29 AM, Ted Dunning wrote: Nice (modulo inverting the in your text). Option 2 seems very simple. That always attracts me. On Thu, Aug 19, 2010 at 9:19 AM, Benjamin Reedbr...@yahoo-inc.com wrote: yes, you are right. we could do this. it turns out that the expiration code is very simple: while (running) { currentTime = System.currentTimeMillis(); if (nextExpirationTime currentTime) { this.wait(nextExpirationTime - currentTime); continue; } SessionSet set; set = sessionSets.remove(nextExpirationTime); if (set != null) { for (SessionImpl s : set.sessions) { sessionsById.remove(s.sessionId); expirer.expire(s); } } nextExpirationTime += expirationInterval; } so we can detect a jump very easily: if nextExpirationTime currentTime, we have jumped ahead in time. now the question is, what do we do with this information? option 1) we could figure out the jump (nextExpirationTime-currentTime is a good estimate) and move all of the sessions forward by that amount. option 2) we could converge on the time by having a policy to always wait at least a half a tick time. there probably are other options as well. i kind of like option 2. worst case is it will make the sessions expire in half the time that they should, but this shouldn't be too much of a problem since clients send a ping if they are idle for 1/3 of their session timeout. ben On 08/19/2010 08:39 AM, Ted Dunning wrote: True. But it knows that there has been a jump. Quiet time can be distinguished from clock shift by assuming that members of the cluster don't all jump at the same time. I would imagine that a recent clock jump estimate could be kept and buckets that would otherwise expire due to such a jump could be given a bit of a second lease on life, delaying all of their expiration. Since time-outs are relatively short, the server would be able to forget about the bump very shortly. On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reedbr...@yahoo-inc.com wrote: if we try to use network messages to detect and correct the situation, it seems like we would recreate the problem we are having with ntp, since that is exactly what it does.
Re: Session expiration caused by time change
i put up a patch that should address the problem. now i need to write a test case. the only way i can think of is to change the call to System.currentTimeMillis to a utility class that calls System.currentTimeMillis that i can mock for testing. any better ideas? ben On 08/19/2010 03:53 PM, Ted Dunning wrote: Put in a four letter command that will put the server to sleep for 15 seconds! :-) On Thu, Aug 19, 2010 at 3:51 PM, Benjamin Reedbr...@yahoo-inc.com wrote: i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?)
Re: Session expiration caused by time change
Mocking the time via a utility was my thought. Mocking system itself is scary. Sent from my iPhone On Aug 20, 2010, at 1:18 PM, Benjamin Reed br...@yahoo-inc.com wrote: i put up a patch that should address the problem. now i need to write a test case. the only way i can think of is to change the call to System.currentTimeMillis to a utility class that calls System.currentTimeMillis that i can mock for testing. any better ideas? ben On 08/19/2010 03:53 PM, Ted Dunning wrote: Put in a four letter command that will put the server to sleep for 15 seconds! :-) On Thu, Aug 19, 2010 at 3:51 PM, Benjamin Reedbr...@yahoo- inc.com wrote: i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?)
Re: Session expiration caused by time change
You can always increase your timeouts a bit. On Thu, Aug 19, 2010 at 12:52 AM, Qing Yan qing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite waite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Hunt ph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK, registers an ephemeral node and watches on it. Now change the client machine's time - session killed.. Here is the log: *2010-08-18 04:24:57,782 INFO com.taobao.timetunnel2.cluster.service.AgentService: Host name kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.6.0_13 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_13/jre 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/home/admin/TimeTunnel2/cluster/bin/../conf/agent/:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-log4j12-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-api-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/timetunnel2-cluster-0.0.1-SNAPSHOT.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zookeeper-3.2.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/log4j-1.2.14.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/gson-1.4.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zk-recipes.jar 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/jdk1.6.0_13/jre/lib/amd64/server:/usr/java/jdk1.6.0_13/jre/lib/amd64:/usr/java/jdk1.6.0_13/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=NA 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=2.6.18-164.el5 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=admin 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/home/admin 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/home/admin/TimeTunnel2/cluster/log 2010-08-18 04:24:57,790 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=xentest10-vm5.corp.alimama.com:2181, xentest10-vm6.corp.alimama.com:2181, xentest10-vm9.corp.alimama.com:2181 sessionTimeout=60 watcher=com.taobao.timetunnel2.cluster.service.agentserv...@48d6c16c 2010-08-18 04:24:57,791 INFO org.apache.zookeeper.ClientCnxn: zookeeper.disableAutoWatchReset is false 2010-08-18
Re: Session expiration caused by time change
Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yan qing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite waite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Hunt ph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK, registers an ephemeral node and watches on it. Now change the client machine's time - session killed.. Here is the log: *2010-08-18 04:24:57,782 INFO com.taobao.timetunnel2.cluster.service.AgentService: Host name kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.6.0_13 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_13/jre 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/home/admin/TimeTunnel2/cluster/bin/../conf/agent/:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-log4j12-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-api-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/timetunnel2-cluster-0.0.1-SNAPSHOT.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zookeeper-3.2.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/log4j-1.2.14.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/gson-1.4.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zk-recipes.jar 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/jdk1.6.0_13/jre/lib/amd64/server:/usr/java/jdk1.6.0_13/jre/lib/amd64:/usr/java/jdk1.6.0_13/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=NA 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=2.6.18-164.el5 2010-08-18 04:24:57,789 INFO
Re: Session expiration caused by time change
Another option would be for the cluster to compare times and note when one member seems to be lagging. Restoration of that lag would then be less remarkable. I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. On Thu, Aug 19, 2010 at 7:51 AM, Vishal K vishalm...@gmail.com wrote: Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yan qing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite waite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Hunt ph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK, registers an ephemeral node and watches on it. Now change the client machine's time - session killed.. Here is the log: *2010-08-18 04:24:57,782 INFO com.taobao.timetunnel2.cluster.service.AgentService: Host name kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.6.0_13 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_13/jre 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/home/admin/TimeTunnel2/cluster/bin/../conf/agent/:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-log4j12-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-api-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/timetunnel2-cluster-0.0.1-SNAPSHOT.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zookeeper-3.2.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/log4j-1.2.14.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/gson-1.4.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zk-recipes.jar 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/jdk1.6.0_13/jre/lib/amd64/server:/usr/java/jdk1.6.0_13/jre/lib/amd64:/usr/java/jdk1.6.0_13/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 2010-08-18 04:24:57,789
Re: Session expiration caused by time change
Hi, I'm not sure if you mean the timers I was on about earlier. If so, http://linux.die.net/man/3/clock_gettime Sufficiently recent versions of GNU libc and the Linux kernel support the following clocks: ... *CLOCK_MONOTONIC* Clock that cannot be set and represents monotonic time since some unspecified starting point. Although re-reading that now, I might have applied wishful thinking to my interpretation. regards, Martin On 19 August 2010 16:13, Benjamin Reed br...@yahoo-inc.com wrote: do you have a pointer to those timers? thanx ben On 08/18/2010 11:58 PM, Martin Waite wrote: On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this.
Re: Session expiration caused by time change
True. But it knows that there has been a jump. Quiet time can be distinguished from clock shift by assuming that members of the cluster don't all jump at the same time. I would imagine that a recent clock jump estimate could be kept and buckets that would otherwise expire due to such a jump could be given a bit of a second lease on life, delaying all of their expiration. Since time-outs are relatively short, the server would be able to forget about the bump very shortly. On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reed br...@yahoo-inc.com wrote: if we try to use network messages to detect and correct the situation, it seems like we would recreate the problem we are having with ntp, since that is exactly what it does.
Re: Session expiration caused by time change
yes, you are right. we could do this. it turns out that the expiration code is very simple: while (running) { currentTime = System.currentTimeMillis(); if (nextExpirationTime currentTime) { this.wait(nextExpirationTime - currentTime); continue; } SessionSet set; set = sessionSets.remove(nextExpirationTime); if (set != null) { for (SessionImpl s : set.sessions) { sessionsById.remove(s.sessionId); expirer.expire(s); } } nextExpirationTime += expirationInterval; } so we can detect a jump very easily: if nextExpirationTime currentTime, we have jumped ahead in time. now the question is, what do we do with this information? option 1) we could figure out the jump (nextExpirationTime-currentTime is a good estimate) and move all of the sessions forward by that amount. option 2) we could converge on the time by having a policy to always wait at least a half a tick time. there probably are other options as well. i kind of like option 2. worst case is it will make the sessions expire in half the time that they should, but this shouldn't be too much of a problem since clients send a ping if they are idle for 1/3 of their session timeout. ben On 08/19/2010 08:39 AM, Ted Dunning wrote: True. But it knows that there has been a jump. Quiet time can be distinguished from clock shift by assuming that members of the cluster don't all jump at the same time. I would imagine that a recent clock jump estimate could be kept and buckets that would otherwise expire due to such a jump could be given a bit of a second lease on life, delaying all of their expiration. Since time-outs are relatively short, the server would be able to forget about the bump very shortly. On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reedbr...@yahoo-inc.com wrote: if we try to use network messages to detect and correct the situation, it seems like we would recreate the problem we are having with ntp, since that is exactly what it does.
Re: Session expiration caused by time change
Nice (modulo inverting the in your text). Option 2 seems very simple. That always attracts me. On Thu, Aug 19, 2010 at 9:19 AM, Benjamin Reed br...@yahoo-inc.com wrote: yes, you are right. we could do this. it turns out that the expiration code is very simple: while (running) { currentTime = System.currentTimeMillis(); if (nextExpirationTime currentTime) { this.wait(nextExpirationTime - currentTime); continue; } SessionSet set; set = sessionSets.remove(nextExpirationTime); if (set != null) { for (SessionImpl s : set.sessions) { sessionsById.remove(s.sessionId); expirer.expire(s); } } nextExpirationTime += expirationInterval; } so we can detect a jump very easily: if nextExpirationTime currentTime, we have jumped ahead in time. now the question is, what do we do with this information? option 1) we could figure out the jump (nextExpirationTime-currentTime is a good estimate) and move all of the sessions forward by that amount. option 2) we could converge on the time by having a policy to always wait at least a half a tick time. there probably are other options as well. i kind of like option 2. worst case is it will make the sessions expire in half the time that they should, but this shouldn't be too much of a problem since clients send a ping if they are idle for 1/3 of their session timeout. ben On 08/19/2010 08:39 AM, Ted Dunning wrote: True. But it knows that there has been a jump. Quiet time can be distinguished from clock shift by assuming that members of the cluster don't all jump at the same time. I would imagine that a recent clock jump estimate could be kept and buckets that would otherwise expire due to such a jump could be given a bit of a second lease on life, delaying all of their expiration. Since time-outs are relatively short, the server would be able to forget about the bump very shortly. On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reedbr...@yahoo-inc.com wrote: if we try to use network messages to detect and correct the situation, it seems like we would recreate the problem we are having with ntp, since that is exactly what it does.
Re: Session expiration caused by time change
Hi Ted, I haven't give it a serious thought yet, but I don't think it is neccessary for the cluster to keep track of time. A node can make its own decision. For the sake of argument, lets say that we have a client and a server with following policy: 1. Client is supposed to send a ping to server every 1 sec. 2. If server does not hear from client for 5 seconds, then the server declares that the client is dead. 3. Similary if the client cannot communicate with the server for 5 seconds client declares that the server is dead. If the client receives a timeout (say while doing some IO) because of a time jump, it should check the number of pings that has failed with the server. If the number is 5, then this is a true failure, If the number is less than 5, then this is because of a time drift. At the server side, the server can attempt to reconnect (or send a ping to the client) after it receives a timeout. Thus, if the timeout occured because of time drift, the server will reconnect and continue. We should ofcourse have an upper bound in number of retries, etc. For ZK, it is important to handle time jumps on ZK leader. I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. You won't see the slippage. You will mainly see a jump forward. Note with large enough number of nodes, multiple nodes could see their time jumping forward. Therefore, checking comparing time between two servers may not help. On Thu, Aug 19, 2010 at 7:51 AM, Vishal K vishalm...@gmail.com wrote: Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yan qing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite waite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Hunt ph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK, registers an ephemeral node and watches on it. Now change the client machine's time - session killed.. Here is the log: *2010-08-18 04:24:57,782 INFO com.taobao.timetunnel2.cluster.service.AgentService: Host name kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=kgbtest1.corp.alimama.com 2010-08-18 04:24:57,789 INFO
Re: Session expiration caused by time change
if we can't rely on the clock, we cannot say things like if ... for 5 seconds. also, clients connect to servers, not visa-versa, so we cannot say things like server can attempt to reconnect. ben On 08/19/2010 10:17 AM, Vishal K wrote: Hi Ted, I haven't give it a serious thought yet, but I don't think it is neccessary for the cluster to keep track of time. A node can make its own decision. For the sake of argument, lets say that we have a client and a server with following policy: 1. Client is supposed to send a ping to server every 1 sec. 2. If server does not hear from client for 5 seconds, then the server declares that the client is dead. 3. Similary if the client cannot communicate with the server for 5 seconds client declares that the server is dead. If the client receives a timeout (say while doing some IO) because of a time jump, it should check the number of pings that has failed with the server. If the number is 5, then this is a true failure, If the number is less than 5, then this is because of a time drift. At the server side, the server can attempt to reconnect (or send a ping to the client) after it receives a timeout. Thus, if the timeout occured because of time drift, the server will reconnect and continue. We should ofcourse have an upper bound in number of retries, etc. For ZK, it is important to handle time jumps on ZK leader. I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. You won't see the slippage. You will mainly see a jump forward. Note with large enough number of nodes, multiple nodes could see their time jumping forward. Therefore, checking comparing time between two servers may not help. On Thu, Aug 19, 2010 at 7:51 AM, Vishal Kvishalm...@gmail.com wrote: Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yanqing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waitewaite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Huntph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK, registers an ephemeral node and watches on it. Now change the client machine's time - session killed.. Here is the log: *2010-08-18
Re: Session expiration caused by time change
Hi Ben, Comments inline.. On Thu, Aug 19, 2010 at 5:33 PM, Benjamin Reed br...@yahoo-inc.com wrote: if we can't rely on the clock, we cannot say things like if ... for 5 seconds. if ... for 5 seconds indicates the timeout give by the socket library. After the timeout we can verify that the timeout received was not a side effect of time jump by looking at the number of ping attempts. also, clients connect to servers, not visa-versa, so we cannot say things like server can attempt to reconnect. In the scenario described below, wouldn't it be ok for the server to just send a ping request to see if the client is really dead? ben On 08/19/2010 10:17 AM, Vishal K wrote: Hi Ted, I haven't give it a serious thought yet, but I don't think it is neccessary for the cluster to keep track of time. A node can make its own decision. For the sake of argument, lets say that we have a client and a server with following policy: 1. Client is supposed to send a ping to server every 1 sec. 2. If server does not hear from client for 5 seconds, then the server declares that the client is dead. 3. Similary if the client cannot communicate with the server for 5 seconds client declares that the server is dead. If the client receives a timeout (say while doing some IO) because of a time jump, it should check the number of pings that has failed with the server. If the number is 5, then this is a true failure, If the number is less than 5, then this is because of a time drift. At the server side, the server can attempt to reconnect (or send a ping to the client) after it receives a timeout. Thus, if the timeout occured because of time drift, the server will reconnect and continue. We should ofcourse have an upper bound in number of retries, etc. For ZK, it is important to handle time jumps on ZK leader. I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. You won't see the slippage. You will mainly see a jump forward. Note with large enough number of nodes, multiple nodes could see their time jumping forward. Therefore, checking comparing time between two servers may not help. On Thu, Aug 19, 2010 at 7:51 AM, Vishal Kvishalm...@gmail.com wrote: Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would need to be done for both client - server connections and server - server connections. One obvious solution is to retry a few times (send ping) after getting a timeout. Another way is to count the number of pings that have been sent after receiving the timeout. If number of pings do not match the expected number (say 5 ping attempt should be finished for a 5 sec timeout), then wait till all the pings are finished. In effect do not completely rely on the clock. Any comments? -Vishal On Thu, Aug 19, 2010 at 3:52 AM, Qing Yanqing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waitewaite@gmail.com wrote: Hi, I have tripped over similar problems testing Red Hat Cluster in virtualised environments. I don't know whether recent linux kernels have improved their interaction with VMWare, but in our environments clock drift caused by lost ticks can be substantial, requiring NTP to sometimes jump the clock rather than control acceleration. In one of our internal production rigs, the local NTP servers themselves were virtualised - causing absolute mayhem when heavy loads hit the other guests on the same physical hosts. The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by 10 seconds always causes a member to prematurely time-out on a network read, causing the member to drop out and trigger a cluster reconfiguration. Apparently NTP is integrated with RHCS version 3, but I don't know what is meant by that. I guess this post is not entirely relevent to ZK, but I am just making the point that virtualisation (of NTP servers and or clients) can cause repeated premature timeouts. On Linux, I believe that there is a class of timers provided that is immune to this, but I doubt that there is a platform independent way of coping with this. My 2p. regards, Martin On 18 August 2010 16:53, Patrick Huntph...@apache.org wrote: Do you expect the time to be wrong frequently? If ntp is running it should never get out of sync more than a small amount. As long as this is less than ~your timeout you should be fine. Patrick On 08/18/2010 01:04 AM, Qing Yan wrote: Hi, The testcase is fairly simple. We have a client which connects to ZK,
Re: Session expiration caused by time change
Ben's approach is really simpler. The client already sends keep-alive messages and we know that some have gone missing or a time shift has happened. Those two possibilities are cleanly distinguished by Ben's suggestion of comparing current time to the bucket expiration. If current time is significantly after the bucket expiration, we know something strange happened and can reschedule the next few buckets. As Ben mentioned, this has a cleanly bounded maximum error and is very, very simple. He didn't mention that it doesn't require any more information than is already known and doesn't require any machine interaction. On Thu, Aug 19, 2010 at 3:16 PM, Vishal K vishalm...@gmail.com wrote: On Thu, Aug 19, 2010 at 5:33 PM, Benjamin Reed br...@yahoo-inc.com wrote: if we can't rely on the clock, we cannot say things like if ... for 5 seconds. if ... for 5 seconds indicates the timeout give by the socket library. After the timeout we can verify that the timeout received was not a side effect of time jump by looking at the number of ping attempts. also, clients connect to servers, not visa-versa, so we cannot say things like server can attempt to reconnect. In the scenario described below, wouldn't it be ok for the server to just send a ping request to see if the client is really dead?
Re: Session expiration caused by time change
i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?) thanx ben On 08/19/2010 09:29 AM, Ted Dunning wrote: Nice (modulo inverting the in your text). Option 2 seems very simple. That always attracts me. On Thu, Aug 19, 2010 at 9:19 AM, Benjamin Reedbr...@yahoo-inc.com wrote: yes, you are right. we could do this. it turns out that the expiration code is very simple: while (running) { currentTime = System.currentTimeMillis(); if (nextExpirationTime currentTime) { this.wait(nextExpirationTime - currentTime); continue; } SessionSet set; set = sessionSets.remove(nextExpirationTime); if (set != null) { for (SessionImpl s : set.sessions) { sessionsById.remove(s.sessionId); expirer.expire(s); } } nextExpirationTime += expirationInterval; } so we can detect a jump very easily: if nextExpirationTime currentTime, we have jumped ahead in time. now the question is, what do we do with this information? option 1) we could figure out the jump (nextExpirationTime-currentTime is a good estimate) and move all of the sessions forward by that amount. option 2) we could converge on the time by having a policy to always wait at least a half a tick time. there probably are other options as well. i kind of like option 2. worst case is it will make the sessions expire in half the time that they should, but this shouldn't be too much of a problem since clients send a ping if they are idle for 1/3 of their session timeout. ben On 08/19/2010 08:39 AM, Ted Dunning wrote: True. But it knows that there has been a jump. Quiet time can be distinguished from clock shift by assuming that members of the cluster don't all jump at the same time. I would imagine that a recent clock jump estimate could be kept and buckets that would otherwise expire due to such a jump could be given a bit of a second lease on life, delaying all of their expiration. Since time-outs are relatively short, the server would be able to forget about the bump very shortly. On Thu, Aug 19, 2010 at 8:22 AM, Benjamin Reedbr...@yahoo-inc.com wrote: if we try to use network messages to detect and correct the situation, it seems like we would recreate the problem we are having with ntp, since that is exactly what it does.
Re: Session expiration caused by time change
Put in a four letter command that will put the server to sleep for 15 seconds! :-) On Thu, Aug 19, 2010 at 3:51 PM, Benjamin Reed br...@yahoo-inc.com wrote: i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?)
Re: Session expiration caused by time change
If NTP is changing your time by more than a few milliseconds then you have other problems (big ones). On Wed, Aug 18, 2010 at 1:04 AM, Qing Yan qing...@gmail.com wrote: I guess ZK might rely on timestamp to keep sessions alive, but we have NTP daemon running so machine time can get changed automatically, is there a conflict?