On 2/27/07, Paul Williams <[EMAIL PROTECTED]> wrote:
Tres Seaver wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Paul Williams wrote:
>> Ok, here is what we have. I did a netstat on both machines, client and
>> server. The client sees and established connection and the server does
>> not. In the server log there is a disconnect. As far as hardware
>> between them, there is a switch (dell powerconnect 6024). Web Server
>> Directors might get hold of it but there are no hops on traceroute.
>> Traceroute only shows the client machine and the server machine.
>> So the client is just continuously polling the connection but getting
>> nothing back.
> That sounds like some weird kernel / networking problem to me: I don't
> see how Zope could be able to keep calling 'select' on a socket after
> the other side has closed it.
We agree. This is a strange situation that none of us have seen before.
However, we have until tomorrow to do something and replacing hardware
is not feasable.
> Is there any possibility that some kind of failover / IP takeover has
> happened, such that the storage server now running is not the same host
> / instance as the one to shich the clients originally connected? Are
> you using LVS + heartbeat, or some kind of hardware load balancer to
> manage such redundancy?
We do have Web Services Directors that do load balancing, but in this
particular case, the storage server is not setup for load balancing, I
am not aware of any features that make the zodb capable of clustering
except for replication services offered through zope.
We are not sure whether the traffic is going to the Web Services
Directores or not. Even if it is, there are thousands of settings and
there is no-one available that knows what to change.
The storage server is a simple nas server with a static ip address.
>> What we are thinking about doing is changing the code in
>> zrpc/connection.py to close the connection in wait (line 638 zope
>> version 2.9.5) if the wait time gets too large or the poll has happened
>> too many times.
>> We are great at plone development, but have very little backend zope
>> development. Would someone please advise me as to whether this is going
>> to cause more problems?
> According to the log message you posted earlier in the thread, your
> appservers are spewing thousands of log messages from the connection's
> 'pending' method, although your deadlock debugger output shows the one
> thread blocked on 'select' inside of the connection's 'wait' method.
> There should be lots of log messages at TRACE level for the wait call,
> including a doubling / backoff of the delay value from 1 mx to 1 sec.
> Do you see those log messages, as well?
These messages are there. You can see the time doubling. This is where
we were thinking of breaking the connection once it gets to a certain
point and make zope reconnect.
This solves our hung connection problem, we think. However, I am hoping
someone can let me know if I am breaking something else by doing this.
I don't remember if you already mentioned it. However: did you tried
to monitor the traffic outgoing and incoming? I mean, setting some
iptables rules and/or using something like tcpdump to monitor what is
going on here?
Zope maillist - Zope@zope.org
** No cross posts or HTML encoding! **
(Related lists -