Zookeeper stops

2010-08-19 Thread Wim Jongman
Hi, I have a zookeeper server running that can sometimes run for days and then quits: Is there somebody with a clue to the problem? I am running 64 bit Ubuntu with java version 1.6.0_18 OpenJDK Runtime Environment (IcedTea6 1.8) (6b18-1.8-0ubuntu1) OpenJDK 64-Bit Server VM (build 14.0-b16,

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning
You can always increase your timeouts a bit. On Thu, Aug 19, 2010 at 12:52 AM, Qing Yan qing...@gmail.com wrote: Oh.. our servers are also running in a virtualized environment. On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite waite@gmail.com wrote: Hi, I have tripped over similar

Re: Zookeeper stops

2010-08-19 Thread Mahadev Konar
Hi Wim, It mostly looks like that zookeeper is not able to create files on the /tmp filesystem. Is there is a space shortage or is it possible the file is being deleted as its being written to? Sometimes admins have a crontab on /tmp that cleans up the /tmp filesystem. Thanks mahadev On

Re: Zookeeper stops

2010-08-19 Thread Ted Dunning
Also, /tmp is not a great place to keep things that are intended for persistence. On Thu, Aug 19, 2010 at 7:34 AM, Mahadev Konar maha...@yahoo-inc.comwrote: Hi Wim, It mostly looks like that zookeeper is not able to create files on the /tmp filesystem. Is there is a space shortage or is it

Re: Zookeeper stops

2010-08-19 Thread Wim Jongman
Ah, thanks guys! I did not realize that this was a user setting. Will try. Best regards, Wim On Thu, Aug 19, 2010 at 4:43 PM, Ted Dunning ted.dunn...@gmail.com wrote: Also, /tmp is not a great place to keep things that are intended for persistence. On Thu, Aug 19, 2010 at 7:34 AM, Mahadev

Re: Session expiration caused by time change

2010-08-19 Thread Vishal K
Hi, I remember Ben had opened a jira for clock jumps earlier: https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to have clocks jump forward in virtualized environments. It is desirable to modify ZooKeeper to handle this situation (as much as possible) internally. It would

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning
Another option would be for the cluster to compare times and note when one member seems to be lagging. Restoration of that lag would then be less remarkable. I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. On Thu, Aug 19, 2010 at 7:51 AM, Vishal

Re: Session expiration caused by time change

2010-08-19 Thread Martin Waite
Hi, I'm not sure if you mean the timers I was on about earlier. If so, http://linux.die.net/man/3/clock_gettime Sufficiently recent versions of GNU libc and the Linux kernel support the following clocks: ... *CLOCK_MONOTONIC* Clock that cannot be set and represents monotonic time since some

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning
True. But it knows that there has been a jump. Quiet time can be distinguished from clock shift by assuming that members of the cluster don't all jump at the same time. I would imagine that a recent clock jump estimate could be kept and buckets that would otherwise expire due to such a jump

Re: Session expiration caused by time change

2010-08-19 Thread Benjamin Reed
yes, you are right. we could do this. it turns out that the expiration code is very simple: while (running) { currentTime = System.currentTimeMillis(); if (nextExpirationTime currentTime) { this.wait(nextExpirationTime -

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning
Nice (modulo inverting the in your text). Option 2 seems very simple. That always attracts me. On Thu, Aug 19, 2010 at 9:19 AM, Benjamin Reed br...@yahoo-inc.com wrote: yes, you are right. we could do this. it turns out that the expiration code is very simple: while (running) {

Re: Session expiration caused by time change

2010-08-19 Thread Vishal K
Hi Ted, I haven't give it a serious thought yet, but I don't think it is neccessary for the cluster to keep track of time. A node can make its own decision. For the sake of argument, lets say that we have a client and a server with following policy: 1. Client is supposed to send a ping to server

Re: Zookeeper stops

2010-08-19 Thread Patrick Hunt
+1 on that Ted. I frequently see this issue crop up as I just rebooted my server and lost all my data ... -- many os's will cleanup tmp on reboot. :-) Patrick On 08/19/2010 07:43 AM, Ted Dunning wrote: Also, /tmp is not a great place to keep things that are intended for persistence. On Thu,

Re: Zookeeper stops

2010-08-19 Thread Wim Jongman
Hi, But zk does default to /tmp? Regards, Wim On Thursday, August 19, 2010, Patrick Hunt ph...@apache.org wrote: +1 on that Ted. I frequently see this issue crop up as I just rebooted my server and lost all my data ... -- many os's will cleanup tmp on reboot. :-) Patrick On

Re: Zookeeper stops

2010-08-19 Thread Patrick Hunt
No. You configure it in the server configuration file. Patrick On 08/19/2010 01:19 PM, Wim Jongman wrote: Hi, But zk does default to /tmp? Regards, Wim On Thursday, August 19, 2010, Patrick Huntph...@apache.org wrote: +1 on that Ted. I frequently see this issue crop up as I just

Re: ZK monitoring

2010-08-19 Thread Patrick Hunt
Maybe we should have a contrib pkg for utilities such as this? I could see a python script that, given 1 server (might require addl 4letter words but this would be useful regardless), could collect such information from the cluster. Create a JIRA? Patrick On 08/17/2010 12:14 PM, Andrei Savu

Re: Session expiration caused by time change

2010-08-19 Thread Benjamin Reed
if we can't rely on the clock, we cannot say things like if ... for 5 seconds. also, clients connect to servers, not visa-versa, so we cannot say things like server can attempt to reconnect. ben On 08/19/2010 10:17 AM, Vishal K wrote: Hi Ted, I haven't give it a serious thought yet, but I

Re: ZK monitoring

2010-08-19 Thread Ted Dunning
It would be nice if it took a list of servers and verified that they all thought that they were part of the same cluster. On Thu, Aug 19, 2010 at 1:46 PM, Patrick Hunt ph...@apache.org wrote: Maybe we should have a contrib pkg for utilities such as this? I could see a python script that, given

Re: Session expiration caused by time change

2010-08-19 Thread Vishal K
Hi Ben, Comments inline.. On Thu, Aug 19, 2010 at 5:33 PM, Benjamin Reed br...@yahoo-inc.com wrote: if we can't rely on the clock, we cannot say things like if ... for 5 seconds. if ... for 5 seconds indicates the timeout give by the socket library. After the timeout we can verify that the

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning
Ben's approach is really simpler. The client already sends keep-alive messages and we know that some have gone missing or a time shift has happened. Those two possibilities are cleanly distinguished by Ben's suggestion of comparing current time to the bucket expiration. If current time is

Re: Session expiration caused by time change

2010-08-19 Thread Benjamin Reed
i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?) thanx ben On 08/19/2010 09:29 AM, Ted Dunning wrote: Nice (modulo inverting the in your text). Option 2 seems very simple. That always attracts me. On

Re: Session expiration caused by time change

2010-08-19 Thread Ted Dunning
Put in a four letter command that will put the server to sleep for 15 seconds! :-) On Thu, Aug 19, 2010 at 3:51 PM, Benjamin Reed br...@yahoo-inc.com wrote: i'm updating ZOOKEEPER-366 with this discussion and try to get a patch out. Qing (or anyone else, can you reproduce it pretty easily?)