[Zope] Re: zope unresponsive
No, we haven't done that yet. That is something else we may try. Marco Bizzarri wrote: On 2/27/07, Paul Williams <[EMAIL PROTECTED]> wrote: Tres Seaver wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Paul Williams wrote: >> Ok, here is what we have. I did a netstat on both machines, client and >> server. The client sees and established connection and the server does >> not. In the server log there is a disconnect. As far as hardware >> between them, there is a switch (dell powerconnect 6024). Web Server >> Directors might get hold of it but there are no hops on traceroute. >> Traceroute only shows the client machine and the server machine. >> >> So the client is just continuously polling the connection but getting >> nothing back. > > That sounds like some weird kernel / networking problem to me: I don't > see how Zope could be able to keep calling 'select' on a socket after > the other side has closed it. We agree. This is a strange situation that none of us have seen before. However, we have until tomorrow to do something and replacing hardware is not feasable. > > Is there any possibility that some kind of failover / IP takeover has > happened, such that the storage server now running is not the same host > / instance as the one to shich the clients originally connected? Are > you using LVS + heartbeat, or some kind of hardware load balancer to > manage such redundancy? We do have Web Services Directors that do load balancing, but in this particular case, the storage server is not setup for load balancing, I am not aware of any features that make the zodb capable of clustering except for replication services offered through zope. We are not sure whether the traffic is going to the Web Services Directores or not. Even if it is, there are thousands of settings and there is no-one available that knows what to change. The storage server is a simple nas server with a static ip address. > >> What we are thinking about doing is changing the code in >> zrpc/connection.py to close the connection in wait (line 638 zope >> version 2.9.5) if the wait time gets too large or the poll has happened >> too many times. >> >> We are great at plone development, but have very little backend zope >> development. Would someone please advise me as to whether this is going >> to cause more problems? > > According to the log message you posted earlier in the thread, your > appservers are spewing thousands of log messages from the connection's > 'pending' method, although your deadlock debugger output shows the one > thread blocked on 'select' inside of the connection's 'wait' method. > There should be lots of log messages at TRACE level for the wait call, > including a doubling / backoff of the delay value from 1 mx to 1 sec. > Do you see those log messages, as well? These messages are there. You can see the time doubling. This is where we were thinking of breaking the connection once it gets to a certain point and make zope reconnect. This solves our hung connection problem, we think. However, I am hoping someone can let me know if I am breaking something else by doing this. I don't remember if you already mentioned it. However: did you tried to monitor the traffic outgoing and incoming? I mean, setting some iptables rules and/or using something like tcpdump to monitor what is going on here? Regards Marco ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Re: [Zope] Re: zope unresponsive
On 2/27/07, Paul Williams <[EMAIL PROTECTED]> wrote: Tres Seaver wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Paul Williams wrote: >> Ok, here is what we have. I did a netstat on both machines, client and >> server. The client sees and established connection and the server does >> not. In the server log there is a disconnect. As far as hardware >> between them, there is a switch (dell powerconnect 6024). Web Server >> Directors might get hold of it but there are no hops on traceroute. >> Traceroute only shows the client machine and the server machine. >> >> So the client is just continuously polling the connection but getting >> nothing back. > > That sounds like some weird kernel / networking problem to me: I don't > see how Zope could be able to keep calling 'select' on a socket after > the other side has closed it. We agree. This is a strange situation that none of us have seen before. However, we have until tomorrow to do something and replacing hardware is not feasable. > > Is there any possibility that some kind of failover / IP takeover has > happened, such that the storage server now running is not the same host > / instance as the one to shich the clients originally connected? Are > you using LVS + heartbeat, or some kind of hardware load balancer to > manage such redundancy? We do have Web Services Directors that do load balancing, but in this particular case, the storage server is not setup for load balancing, I am not aware of any features that make the zodb capable of clustering except for replication services offered through zope. We are not sure whether the traffic is going to the Web Services Directores or not. Even if it is, there are thousands of settings and there is no-one available that knows what to change. The storage server is a simple nas server with a static ip address. > >> What we are thinking about doing is changing the code in >> zrpc/connection.py to close the connection in wait (line 638 zope >> version 2.9.5) if the wait time gets too large or the poll has happened >> too many times. >> >> We are great at plone development, but have very little backend zope >> development. Would someone please advise me as to whether this is going >> to cause more problems? > > According to the log message you posted earlier in the thread, your > appservers are spewing thousands of log messages from the connection's > 'pending' method, although your deadlock debugger output shows the one > thread blocked on 'select' inside of the connection's 'wait' method. > There should be lots of log messages at TRACE level for the wait call, > including a doubling / backoff of the delay value from 1 mx to 1 sec. > Do you see those log messages, as well? These messages are there. You can see the time doubling. This is where we were thinking of breaking the connection once it gets to a certain point and make zope reconnect. This solves our hung connection problem, we think. However, I am hoping someone can let me know if I am breaking something else by doing this. I don't remember if you already mentioned it. However: did you tried to monitor the traffic outgoing and incoming? I mean, setting some iptables rules and/or using something like tcpdump to monitor what is going on here? Regards Marco -- Marco Bizzarri http://iliveinpisa.blogspot.com/ ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
[Zope] Re: zope unresponsive
Tres Seaver wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul Williams wrote: Ok, here is what we have. I did a netstat on both machines, client and server. The client sees and established connection and the server does not. In the server log there is a disconnect. As far as hardware between them, there is a switch (dell powerconnect 6024). Web Server Directors might get hold of it but there are no hops on traceroute. Traceroute only shows the client machine and the server machine. So the client is just continuously polling the connection but getting nothing back. That sounds like some weird kernel / networking problem to me: I don't see how Zope could be able to keep calling 'select' on a socket after the other side has closed it. We agree. This is a strange situation that none of us have seen before. However, we have until tomorrow to do something and replacing hardware is not feasable. Is there any possibility that some kind of failover / IP takeover has happened, such that the storage server now running is not the same host / instance as the one to shich the clients originally connected? Are you using LVS + heartbeat, or some kind of hardware load balancer to manage such redundancy? We do have Web Services Directors that do load balancing, but in this particular case, the storage server is not setup for load balancing, I am not aware of any features that make the zodb capable of clustering except for replication services offered through zope. We are not sure whether the traffic is going to the Web Services Directores or not. Even if it is, there are thousands of settings and there is no-one available that knows what to change. The storage server is a simple nas server with a static ip address. What we are thinking about doing is changing the code in zrpc/connection.py to close the connection in wait (line 638 zope version 2.9.5) if the wait time gets too large or the poll has happened too many times. We are great at plone development, but have very little backend zope development. Would someone please advise me as to whether this is going to cause more problems? According to the log message you posted earlier in the thread, your appservers are spewing thousands of log messages from the connection's 'pending' method, although your deadlock debugger output shows the one thread blocked on 'select' inside of the connection's 'wait' method. There should be lots of log messages at TRACE level for the wait call, including a doubling / backoff of the delay value from 1 mx to 1 sec. Do you see those log messages, as well? These messages are there. You can see the time doubling. This is where we were thinking of breaking the connection once it gets to a certain point and make zope reconnect. This solves our hung connection problem, we think. However, I am hoping someone can let me know if I am breaking something else by doing this. Tres. - -- === Tres Seaver +1 540-429-0999 [EMAIL PROTECTED] Palladion Software "Excellence by Design"http://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF5Dvr+gerLs4ltQ4RAm/HAKCUN5WboOxVGeB11GhEfgYQ3wos3QCdH0TW DbcpXiMPlcQYyx0gewPFMLI= =9A/a -END PGP SIGNATURE- ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev ) ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
[Zope] Re: zope unresponsive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul Williams wrote: > Ok, here is what we have. I did a netstat on both machines, client and > server. The client sees and established connection and the server does > not. In the server log there is a disconnect. As far as hardware > between them, there is a switch (dell powerconnect 6024). Web Server > Directors might get hold of it but there are no hops on traceroute. > Traceroute only shows the client machine and the server machine. > > So the client is just continuously polling the connection but getting > nothing back. That sounds like some weird kernel / networking problem to me: I don't see how Zope could be able to keep calling 'select' on a socket after the other side has closed it. Is there any possibility that some kind of failover / IP takeover has happened, such that the storage server now running is not the same host / instance as the one to shich the clients originally connected? Are you using LVS + heartbeat, or some kind of hardware load balancer to manage such redundancy? > What we are thinking about doing is changing the code in > zrpc/connection.py to close the connection in wait (line 638 zope > version 2.9.5) if the wait time gets too large or the poll has happened > too many times. > > We are great at plone development, but have very little backend zope > development. Would someone please advise me as to whether this is going > to cause more problems? According to the log message you posted earlier in the thread, your appservers are spewing thousands of log messages from the connection's 'pending' method, although your deadlock debugger output shows the one thread blocked on 'select' inside of the connection's 'wait' method. There should be lots of log messages at TRACE level for the wait call, including a doubling / backoff of the delay value from 1 mx to 1 sec. Do you see those log messages, as well? Tres. - -- === Tres Seaver +1 540-429-0999 [EMAIL PROTECTED] Palladion Software "Excellence by Design"http://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF5Dvr+gerLs4ltQ4RAm/HAKCUN5WboOxVGeB11GhEfgYQ3wos3QCdH0TW DbcpXiMPlcQYyx0gewPFMLI= =9A/a -END PGP SIGNATURE- ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
[Zope] Re: zope unresponsive
Ok, here is what we have. I did a netstat on both machines, client and server. The client sees and established connection and the server does not. In the server log there is a disconnect. As far as hardware between them, there is a switch (dell powerconnect 6024). Web Server Directors might get hold of it but there are no hops on traceroute. Traceroute only shows the client machine and the server machine. So the client is just continuously polling the connection but getting nothing back. What we are thinking about doing is changing the code in zrpc/connection.py to close the connection in wait (line 638 zope version 2.9.5) if the wait time gets too large or the poll has happened too many times. We are great at plone development, but have very little backend zope development. Would someone please advise me as to whether this is going to cause more problems? Thanks, Paul Williams Paul Williams wrote: I have posted this several times, but have not until now been able to get DeadlockDebugger installed. I see several people have had this problem, but no-one has posted a solution. zope 2.9.5 + zeo pythonm2.4.3 Red Hat RHEL 4 Plone 2.5.1 Our zeo clients hang intermittently. We have no way of reproducing the problem, but it occurs daily. The client hangs and a restart seems to fix the problem. In the event log with tracing on we get Trace zeo.zrpc.Connection(C) wait(16697) {server:8100} pending, async=0 There are hundreds to thousands of these until the server is restarted. In the zeo log we get Error caught in asyncor asyncore.py error:(110,'Connection timed out') We have been trying to track this down and have had no luck. Does anyone have any suggestions? Below is our deadlock debugger output Threads traceback dump at 2007-02-23 15:26:50 Thread -1269564496 (GET /VirtualHostBase/https/soawds:443/VirtualHostRoot/Content///training): File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/ZServer/PubCore/ZServerPublisher.py", line 23, in __init__ File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/ZPublisher/Publish.py", line 395, in publish_module File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/ZPublisher/Publish.py", line 196, in publish_module_standard File "/apps1/zope2.9.5/navo_instance/Products/PlacelessTranslationService/PatchStringIO.py", line 34, in new_publish x = Publish.old_publish(request, module_name, after_list, debug) File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/ZPublisher/Publish.py", line 115, in publish File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/ZPublisher/mapply.py", line 88, in mapply File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/ZPublisher/Publish.py", line 41, in call_object File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/Shared/DC/Scripts/Bindings.py", line 311, in __call__ File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/Shared/DC/Scripts/Bindings.py", line 348, in _bindAndExec File "/apps1/zope2.9.5/navo_instance/Products/CMFCore/FSPageTemplate.py", line 195, in _exec result = self.pt_render(extra_context=bound_names) File "/apps1/zope2.9.5/navo_instance/Products/CacheSetup/patch_cmf.py", line 38, in FSPT_pt_render result = FSPageTemplate.inheritedAttribute('pt_render')( File "/apps1/zope2.9.5/navo_instance/Products/CacheSetup/patch_cmf.py", line 92, in PT_pt_render tal=not source, strictinsert=0)() File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 238, in __call__ File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 281, in interpret File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 749, in do_useMacro File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 281, in interpret File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 457, in do_optTag_tal File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 442, in do_optTag File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 437, in no_tag File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 281, in interpret File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 749, in do_useMacro File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 281, in interpret File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/TAL/TALInterpreter.py", line 507, in do_setLocal_tal File "/var/tmp/Zope-2.9.5-1-buildroot/apps1/zope2.9.5/lib/python/Products/PageTempl
[Zope] Re: zope unresponsive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul Williams wrote: > I know that there is a switch between zeo and zope and probably a > firewall too, but how do I prove this is the problem. This is on > production server in a military installation. I have major problems > getting any kind of trouble shooting support. First we don't get > access, and second no kind of debugging is allowed. You couldn't > imagine the paperwork and the three months it took for me to get > deadlockdebugger installed. Create a parallel testing environment (in fact, any place that locked down should already have one), and try tweaking the firewall settings in that environment; another possibility is to have the appserver serve only from its cache for a "long" time (e.g., while repeatedly serving the same page), and then request a page (e.g., search, or something) which needs objects not already in cache. Once you can provoke the problem at will in the test environment, *then* you can begin tweaking the firewall to keep it from closing the socket between the appserver and the storage server (if that is your problem). Once you can prevent that in test, *then* you can push to have that change rolled to production. Tres. - -- === Tres Seaver +1 540-429-0999 [EMAIL PROTECTED] Palladion Software "Excellence by Design"http://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF4vf0+gerLs4ltQ4RAtkfAKC/26t0L1Vy5QUFAIq0AnLmW20VoACfctL7 tFdYS/fLPBQMaw6/OA+j6zY= =qLaq -END PGP SIGNATURE- ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Re: [Zope] Re: zope unresponsive
I'll chime in with a "me too" ( see me thread within the last week on the same list). I haven't looked into it as deeply as you, but I have tried the DeadlockDebugger which itself was inaccessible during the time when zope was spinning. Nothing in the logs. My install is Zope 2.8.5 on RHEL 4 without Zeo. Florent Guillaume wrote: Try DeadlockDebugger. Florent Andy Altepeter wrote: Hey All, I'm experiencing hanging issues with my Zope-2.8.6+zeo setup/ RHEL 4. The hanging isn't categorized by 100% cpu usage. Actually, I had the same issues using 2.8.5, but I've upgraded since then. Here's the situation: I have one zeo client connected to a zeo server on the same box. Apache sits in front, using RewriteRules to request data from zope. After some time (could be 2 minutes or an hour), the zeo client stops responding. Apparently this is called a deadlock or a "spinning zope". I've tried using gdb to attach to the zeo client pid, and use the recipe http://zopelabs.com/cookbook/1073504990 to print a traceback, but the call always aborted with SIGABORT. I've captured all of the requests sent to zope during an uptime window (via Z2.log), and using wget to "replay" the requests. I've also pulled from apache's rewrite log all requests proxied to zope, thinking the Z2.log only writes finished requests. I setup another zeo client (on the same box, different port) and used wget to replay these captures as well. Just running these captures does not cause zope to hang. In fact, I have not been able to cause zope to hang by replaying. There doesn't seem to be any one url or sequence of urls that cause zope to hang. I've tried reinstalling the zope instance, but that didn't help. I've tried using requestprofiler.py to inspect the trace.log. This shows a high number of "hangs", but not on a url that actually triggers a spinning zope. Basically, that's where I'm stuck. Is there anything else I can try? Am I missing something? Thanks for the help, Andy ___ Zope maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev ) ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
[Zope] Re: zope unresponsive
Try DeadlockDebugger. Florent Andy Altepeter wrote: Hey All, I'm experiencing hanging issues with my Zope-2.8.6+zeo setup/ RHEL 4. The hanging isn't categorized by 100% cpu usage. Actually, I had the same issues using 2.8.5, but I've upgraded since then. Here's the situation: I have one zeo client connected to a zeo server on the same box. Apache sits in front, using RewriteRules to request data from zope. After some time (could be 2 minutes or an hour), the zeo client stops responding. Apparently this is called a deadlock or a "spinning zope". I've tried using gdb to attach to the zeo client pid, and use the recipe http://zopelabs.com/cookbook/1073504990 to print a traceback, but the call always aborted with SIGABORT. I've captured all of the requests sent to zope during an uptime window (via Z2.log), and using wget to "replay" the requests. I've also pulled from apache's rewrite log all requests proxied to zope, thinking the Z2.log only writes finished requests. I setup another zeo client (on the same box, different port) and used wget to replay these captures as well. Just running these captures does not cause zope to hang. In fact, I have not been able to cause zope to hang by replaying. There doesn't seem to be any one url or sequence of urls that cause zope to hang. I've tried reinstalling the zope instance, but that didn't help. I've tried using requestprofiler.py to inspect the trace.log. This shows a high number of "hangs", but not on a url that actually triggers a spinning zope. Basically, that's where I'm stuck. Is there anything else I can try? Am I missing something? Thanks for the help, Andy ___ Zope maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev ) -- Florent Guillaume, Nuxeo (Paris, France) Director of R&D +33 1 40 33 71 59 http://nuxeo.com [EMAIL PROTECTED] ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )