On 07/06/2010 13:57, Chris McDonough wrote:
On Mon, 2010-06-07 at 10:00 +0100, Phillip Oldham wrote:
We noticed an odd error over the weekend, and would like some advice.

One of our "services", a Python thrift[1] server, which binds to a port
had an error and stopped responding to requests. Supervisord "saw" this,
and tried to bring up another instance.
I think you might mean that superlance httpok saw this and tried to
bring up another instance?  "Raw" supervisor doesn't monitor process
behavior, only process up/down status.

We're still trying to understand the issue ourselves. We're not (yet) using superlance.

Supervisord is set to fire-up a number of Python thrift servers, some of which communicate between themselves. It seems that one of the services which is called from the other services had an issue, and stopped responding. This caused many instances of the calling service to be spawned by supervisord - as far as we can tell; it's possible that thrift was spawning the processes and ignoring the hard-limit of 10.

This is the log entry for the exiting service:
2010-06-05 12:26:57,493 CRIT uncaptured python exception, closing channel <POutputDispatcher at 121784512 for <Subprocess at 121308512 with name PDFService in state RUNNING> (stderr)> (<type 'exceptions.OSError'>:[Errno 2] No such file or directory [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/supervisord.py|runforever|241] [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|handle_read_event|242] [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|record_output|176] [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|_log|152] [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|info|281] [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|log|299] [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|emit|194] [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|doRollover|228])

The service didn't come back up, however. I believe the supervisorctl was still showing the service as RUNNING at the point we realised where the error was and restarted the service manually.

We're wondering if there's a better way to catch such problems so we can restart the service?

  However the original instance
hadn't actually exited, so was still running and was still bound to the
port. Over the weekend supervisord brought up a number of instances of
the service, so in total we found ~30 running instances none of which
were responding correctly.

We are about to script a plug-in for supervisord to "ping" the service
to monitor the connection. How would we then kill/restart the service if
it doesn't respond as expected?
I think you probably need to answer the above question and maybe provide
your current config so we can figure out what's going on before any
other advice can be given.

- C


--

*Phillip B Oldham*
ActivityHQ
[email protected] <mailto:[email protected]>

------------------------------------------------------------------------

*Policies*

This e-mail and its attachments are intended for the above named recipient(s) only and may be confidential. If they have come to you in error, please reply to this e-mail and highlight the error. No action should be taken regarding content, nor must you copy or show them to anyone.

This e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium, and we have taken steps to ensure that this e-mail and attachments are free from any virus. We must advise that in keeping with good computing practice the recipient should ensure they are completely virus free, and that you understand and observe the lack of security when e-mailing us.

------------------------------------------------------------------------
_______________________________________________
Supervisor-users mailing list
[email protected]
http://lists.supervisord.org/mailman/listinfo/supervisor-users

Reply via email to