[ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829920#action_12829920 ]
Qian Ye commented on ZOOKEEPER-662: ----------------------------------- Hi Patrick, the c clients all run in a Linux environment, the kernels are 2.6.9. Some of the servers are 32 bit machines and some of them are 64 bits. It seems that the client on the server 10.81.14.81 has some problem, which caused the client to fail frequently. Because there is a monitor app which can restart the c client when it failed, the client on 10.81.14.81 keep restarting and connecting to the zookeeper servers frequently. You mentioned that some of the response for request "stat" didn't reach the client, it looks like the behaviors of TCP connection with SO_LINER option on. In this kind of situation, the server only put the response on the wire and close, however, the response package may be discarded, and the TCP/IP stack wouldn't re-send the response. Is it the scenario we met here? > Too many CLOSE_WAIT socket state on a server > -------------------------------------------- > > Key: ZOOKEEPER-662 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662 > Project: Zookeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.2.1 > Environment: Linux 2.6.9 > Reporter: Qian Ye > Fix For: 3.3.0 > > Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106 > > > I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is > the content in the configure file, zoo.cfg > ====== > # The number of milliseconds of each tick > tickTime=2000 > # The number of ticks that the initial > # synchronization phase can take > initLimit=5 > # The number of ticks that can pass between > # sending a request and getting an acknowledgement > syncLimit=2 > # the directory where the snapshot is stored. > dataDir=./data/ > # the port at which the clients will connect > clientPort=8181 > # zookeeper cluster list > server.100=10.23.253.43:8887:8888 > server.101=10.23.150.29:8887:8888 > server.102=10.23.247.141:8887:8888 > server.200=10.65.20.68:8887:8888 > server.201=10.65.27.21:8887:8888 > ===== > Before the problem happened, the server.200 was the leader. Yesterday > morning, I found the there were many sockets with the state of CLOSE_WAIT on > the clientPort (8181), the total was over about 120. Because of these > CLOSE_WAIT, the server.200 could not accept more connections from the > clients. The only thing I can do under this situation is restart the > server.200, at about 2010-02-01 06:06:35. The related log is attached to the > issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.