Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
Patrick and Ted. Unless Zookeeper clients adding this feature, it is not easy for us to implement. We only provide platform for many services within our org. Their batch servers will fire off whatever clients they want. We have no control over it. But 8 second latency during stampede is

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
2011. 4. 14., 오전 10:30, Patrick Hunt 작성: 2011/4/13 Chang Song tru64...@me.com: Patrick. Thank you for the reply. We are very aware of all the things you mentioned below. None of those. Not GC (we monitor every possible resource in JVM and system) No IO. No Swapping. No VM guest

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
2011. 4. 14., 오후 1:53, Patrick Hunt 작성: two additional thoughts come to mind: 1) try running the ensemble with a single zk server, does this help at all? (it might provide a short term workaround, it also might provide some insight into what's causing the issue) We are going to try this

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Patrick Hunt
2011/4/14 Chang Song tru64...@me.com: 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your issue is happening, what's the %util of the disk? what's the iowait look like? Again, no I/O at all. 0% This is simply not possible. Sessions are persistent. Each time a session

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Benjamin Reed
chang, if the problem is on client startup, then it isn't the heartbeat stamped, it is session establishment. the heartbeats are very light weight, so i can't imagine them causing any issues. the two key issues we need to know are: 1) the version of the server you are running, and 2) if you are

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
2011. 4. 15., 오전 1:04, Patrick Hunt 작성: 2011/4/14 Chang Song tru64...@me.com: 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your issue is happening, what's the %util of the disk? what's the iowait look like? Again, no I/O at all. 0% This is simply not

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Benjamin Reed
when you file the jira can you also note the logging level you are using? thanx ben 2011/4/14 Chang Song tru64...@me.com: Yes, Ben. If you read my emails carefully, I already said it is not heartbeat, it is session establishment / closing gets stamped. Since all the requests' response gets

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
sure I will thank you. Chang 2011. 4. 15., 오전 7:16, Benjamin Reed 작성: when you file the jira can you also note the logging level you are using? thanx ben 2011/4/14 Chang Song tru64...@me.com: Yes, Ben. If you read my emails carefully, I already said it is not heartbeat, it is

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Ted Dunning
2011/4/14 Chang Song tru64...@me.com You need to understand that most app can tolerate delay in connect/close, but we cannot tolerate ping delay since we are using ZK heartbeat TO for sole failure detection. What about using multiple ZK clusters for this, then? But it really sounds like

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Ted Dunning
You said that, but there was some skepticism from others about this. You need to try the monitoring that was suggested. 5 minute averages are not useful. What does the stat four letter command return? ( http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkCommands ) 2011/4/14 Chang