Re: Serious problem processing hearbeat on login stampede

2011-07-05 Thread Patrick Hunt
Vishal brought up an issue at the ZK post-summit meetup that might also be (partially?) resolved by this patch. Thanks again Chang Song! Patrick 2011/7/1 Chang Song tru64...@me.com: No problem. Glad to contribute. Thanks a lot. 2011. 7. 2., 오전 1:03, Ted Dunning 작성: Thanks for the

Re: Serious problem processing hearbeat on login stampede

2011-07-05 Thread Chang Song
Actually, netspider, Chisu Ryu, in my team fixed it. Thanks, Chisu. Chang 2011. 7. 6., 오전 3:04, Patrick Hunt 작성: Vishal brought up an issue at the ZK post-summit meetup that might also be (partially?) resolved by this patch. Thanks again Chang Song! Patrick 2011/7/1 Chang Song

Re: Serious problem processing hearbeat on login stampede

2011-07-01 Thread Jared Cantwell
As a note, I believe we just used this patch to solve a major issue we were seeing. We were having problems when power to a node was pulled, and thus hung tcp sessions on the servers. With many connections, each close operation was taking 2 seconds and held up the server significantly enough to

Re: Serious problem processing hearbeat on login stampede

2011-07-01 Thread Ted Dunning
Thanks for the feedback Jared! (and thanks to Chang as well!) On Fri, Jul 1, 2011 at 8:06 AM, Jared Cantwell jared.cantw...@gmail.comwrote: As a note, I believe we just used this patch to solve a major issue ... Thanks Chang! ~Jared On Tue, Apr 19, 2011 at 10:59 AM, Ted Dunning

Re: Serious problem processing hearbeat on login stampede

2011-07-01 Thread Chang Song
No problem. Glad to contribute. Thanks a lot. 2011. 7. 2., 오전 1:03, Ted Dunning 작성: Thanks for the feedback Jared! (and thanks to Chang as well!) On Fri, Jul 1, 2011 at 8:06 AM, Jared Cantwell jared.cantw...@gmail.comwrote: As a note, I believe we just used this patch to solve a

Re: Serious problem processing hearbeat on login stampede

2011-04-19 Thread Chang Song
Problem solved. it was socket linger option set to 2 sec timeout. We have verified that the original problem goes away when we turn off linger option. No longer a mystery ;) https://issues.apache.org/jira/browse/ZOOKEEPER-1049 Chang 2011. 4. 19., 오전 3:16, Mahadev Konar 작성: Camille, Ted,

Re: Serious problem processing hearbeat on login stampede

2011-04-19 Thread Ted Dunning
Where is this set? Why does this cause this problem? 2011/4/19 Chang Song tru64...@me.com Problem solved. it was socket linger option set to 2 sec timeout. We have verified that the original problem goes away when we turn off linger option. No longer a mystery ;)

Re: Serious problem processing hearbeat on login stampede

2011-04-18 Thread Ted Dunning
Interesting. It does seem to suggestion the session expiration is expensive. There is a concurrent table in guava that provides very good multi-threaded performance. I think that is achieved by using a number of locks and then distributing threads across the locks according to the hash slot

Re: Serious problem processing hearbeat on login stampede

2011-04-18 Thread Mahadev Konar
Camille, Ted, Can we continue the discussion on https://issues.apache.org/jira/browse/ZOOKEEPER-1049? We should track all the suggestions/issues on the jira. thanks mahadev On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning ted.dunn...@gmail.com wrote: Interesting.  It does seem to suggestion the

Re: Serious problem processing hearbeat on login stampede

2011-04-17 Thread Chang Song
Ted. Please be patient. I didn't say I won't post the data. I am not doing the test myself. My team does. I saw iostat result when they did the test. I cannot cut-and-paste what I don't have. I cannot force them to come in on weekends to do the testing. and let me add. There is no magic in

Re: Serious problem processing hearbeat on login stampede

2011-04-16 Thread Ted Dunning
That isn't the issue. The issue is that there is something here that is a mystery. You aren't seeing the answer. If you could, you would have seen it already and wouldn't have a question to ask. If you want somebody else to see the answer, you need to show them the raw data and not just tell

Re: Serious problem processing hearbeat on login stampede

2011-04-16 Thread Ted Dunning
How many ephemeral files have to be deleted when a session closes or expires? 2011/4/15 Chang Song tru64...@me.com It is not login, it is session expiring and closing process.

Re: Serious problem processing hearbeat on login stampede

2011-04-15 Thread Chang Song
I have file a JIRA bug https://issues.apache.org/jira/browse/ZOOKEEPER-1049 We have measured I/O wait again, but found no IO activity due to ZK. Just regular page cache sync daemon in the work: 0-3%. I will have my team to attach ZK stat result. Thanks a lot. Let's move this discussion to

Re: Serious problem processing hearbeat on login stampede

2011-04-15 Thread Ted Dunning
You know, I think it would help if you would answer some of the questions that people have posed. You say that it takes 1000 clients over 8 seconds to register. That is about 100 transactions per second. That is two orders of magnitude slower than others have observed ZK to be. This is a

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
Patrick and Ted. Unless Zookeeper clients adding this feature, it is not easy for us to implement. We only provide platform for many services within our org. Their batch servers will fire off whatever clients they want. We have no control over it. But 8 second latency during stampede is

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
2011. 4. 14., 오전 10:30, Patrick Hunt 작성: 2011/4/13 Chang Song tru64...@me.com: Patrick. Thank you for the reply. We are very aware of all the things you mentioned below. None of those. Not GC (we monitor every possible resource in JVM and system) No IO. No Swapping. No VM guest

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
2011. 4. 14., 오후 1:53, Patrick Hunt 작성: two additional thoughts come to mind: 1) try running the ensemble with a single zk server, does this help at all? (it might provide a short term workaround, it also might provide some insight into what's causing the issue) We are going to try this

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Patrick Hunt
2011/4/14 Chang Song tru64...@me.com: 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your issue is happening, what's the %util of the disk? what's the iowait look like? Again, no I/O at all. 0% This is simply not possible. Sessions are persistent. Each time a session

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Benjamin Reed
chang, if the problem is on client startup, then it isn't the heartbeat stamped, it is session establishment. the heartbeats are very light weight, so i can't imagine them causing any issues. the two key issues we need to know are: 1) the version of the server you are running, and 2) if you are

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
2011. 4. 15., 오전 1:04, Patrick Hunt 작성: 2011/4/14 Chang Song tru64...@me.com: 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your issue is happening, what's the %util of the disk? what's the iowait look like? Again, no I/O at all. 0% This is simply not

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Benjamin Reed
when you file the jira can you also note the logging level you are using? thanx ben 2011/4/14 Chang Song tru64...@me.com: Yes, Ben. If you read my emails carefully, I already said it is not heartbeat, it is session establishment / closing gets stamped. Since all the requests' response gets

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Chang Song
sure I will thank you. Chang 2011. 4. 15., 오전 7:16, Benjamin Reed 작성: when you file the jira can you also note the logging level you are using? thanx ben 2011/4/14 Chang Song tru64...@me.com: Yes, Ben. If you read my emails carefully, I already said it is not heartbeat, it is

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Ted Dunning
2011/4/14 Chang Song tru64...@me.com You need to understand that most app can tolerate delay in connect/close, but we cannot tolerate ping delay since we are using ZK heartbeat TO for sole failure detection. What about using multiple ZK clusters for this, then? But it really sounds like

Re: Serious problem processing hearbeat on login stampede

2011-04-14 Thread Ted Dunning
You said that, but there was some skepticism from others about this. You need to try the monitoring that was suggested. 5 minute averages are not useful. What does the stat four letter command return? ( http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkCommands ) 2011/4/14 Chang

Re: Serious problem processing hearbeat on login stampede

2011-04-13 Thread Chang Song
Patrick. Thank you for the reply. We are very aware of all the things you mentioned below. None of those. Not GC (we monitor every possible resource in JVM and system) No IO. No Swapping. No VM guest OS. No logging. Oh, one thing I should mention is that it is not 1000 clients, 1000

Re: Serious problem processing hearbeat on login stampede

2011-04-13 Thread Ted Dunning
This is a more powerful idea than it looks like at first glance. The reason is that there is often a highly non-linear and adverse impact to response time due to higher load. I have never been able to properly account for this using queuing models in a system that is not swapping, but it is