Patrick Hunt commented on ZOOKEEPER-662:

I agree, this is a potentially serious issue. Unfortunately though, based on 
the information we have I don't see how I can provide more insight. Also take 
into account that we have many users in similar situation, however this is the 
first we've heard of this type of issue, ever. (not that that diminishes your 
issue) So I just don't have that much to go on.

I would suggest that you check your monitoring script and ensure it handles all 
error cases, such as failing to connect to the server, or getting a partial 
response due to things like the linger issue.

Also ensure that you can capture the server/client logs if this does happen 
again. If it does happen capture the full/detailed netstat (netstat -a I guess) 
so that we can get detailed information.

You might also make sure to save the transactional logs if this happens again. 
Not the log4j logs, but the transaction logs that are kept in the datadir. 
Those can actually be scanned and we can see what was going on (changes to 
znodes as well as session info).

Can you think of anything else that would help here? Have you been able to 
reproduce the problem? Have you tried reproducing it and can't? That's all I 
can think of currently.

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is 
> the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=
> server.101=
> server.102=
> server.200=
> server.201=
> =====
> Before the problem happened, the server.200 was the leader. Yesterday 
> morning, I found the there were many sockets with the state of CLOSE_WAIT on 
> the clientPort (8181),  the total was over about 120. Because of these 
> CLOSE_WAIT, the server.200 could not accept more connections from the 
> clients. The only thing I can do under this situation is restart the 
> server.200, at about 2010-02-01 06:06:35. The related log is attached to the 
> issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to