Tres Seaver wrote:
Paul Williams wrote:
Ok, here is what we have. I did a netstat on both machines, client and server. The client sees and established connection and the server does not. In the server log there is a disconnect. As far as hardware between them, there is a switch (dell powerconnect 6024). Web Server Directors might get hold of it but there are no hops on traceroute. Traceroute only shows the client machine and the server machine.

So the client is just continuously polling the connection but getting nothing back.

That sounds like some weird kernel / networking problem to me:  I don't
see how Zope could be able to keep calling 'select' on a socket after
the other side has closed it.

We agree.  This is a strange situation that none of us have seen before.

However, we have until tomorrow to do something and replacing hardware is not feasable.

Is there any possibility that some kind of failover / IP takeover has
happened, such that the storage server now running is not the same host
/ instance as the one to shich the clients originally connected?  Are
you using LVS + heartbeat, or some kind of hardware load balancer to
manage such redundancy?

We do have Web Services Directors that do load balancing, but in this particular case, the storage server is not setup for load balancing, I am not aware of any features that make the zodb capable of clustering except for replication services offered through zope.

We are not sure whether the traffic is going to the Web Services Directores or not. Even if it is, there are thousands of settings and there is no-one available that knows what to change.

The storage server is a simple nas server with a static ip address.

What we are thinking about doing is changing the code in zrpc/ to close the connection in wait (line 638 zope version 2.9.5) if the wait time gets too large or the poll has happened too many times.

We are great at plone development, but have very little backend zope development. Would someone please advise me as to whether this is going to cause more problems?

According to the log message you posted earlier in the thread, your
appservers are spewing thousands of log messages from the connection's
'pending' method, although your deadlock debugger output shows the one
thread blocked on 'select' inside of the connection's 'wait' method.
There should be lots of log messages at TRACE level for the wait call,
including a doubling / backoff of the delay value from 1 mx to 1 sec.
Do you see those log messages, as well?

These messages are there. You can see the time doubling. This is where we were thinking of breaking the connection once it gets to a certain point and make zope reconnect.

This solves our hung connection problem, we think. However, I am hoping someone can let me know if I am breaking something else by doing this.

