On 07/06/2010 16:40, Chris McDonough wrote:
On Mon, 2010-06-07 at 15:55 +0100, Phillip Oldham wrote:
On 07/06/2010 13:57, Chris McDonough wrote:
On Mon, 2010-06-07 at 10:00 +0100, Phillip Oldham wrote:
We noticed an odd error over the weekend, and would like some advice.
One of our "services", a Python thrift[1] server, which binds to a port
had an error and stopped responding to requests. Supervisord "saw" this,
and tried to bring up another instance.
I think you might mean that superlance httpok saw this and tried to
bring up another instance? "Raw" supervisor doesn't monitor process
behavior, only process up/down status.
We're still trying to understand the issue ourselves. We're not (yet)
using superlance.
Supervisord is set to fire-up a number of Python thrift servers, some
of which communicate between themselves. It seems that one of the
services which is called from the other services had an issue, and
stopped responding. This caused many instances of the calling service
to be spawned by supervisord - as far as we can tell; it's possible
that thrift was spawning the processes and ignoring the hard-limit of
10.
I think your diagnosis about the cause of the spawning of more instances
isn't correct. Supervisor doesn't monitor whether a process responds or
not; it doesn't know anything about this sort of high-level detail. It
only monitors the state of the process as known by the operating system
(basically: up or down). If one of your services "stopped responding",
supervisor would neither know nor care as long as the process that
represented that service was still in a state that indicated it was
running (e.g. the process did not exit or die).
For example, if the service was an HTTP service, supervisor wouldn't try
to restart it if it stopped answering HTTP requests but still appeared
to the operating system that it was otherwise up and running. A new
instance of a process is only spawned by supervisor when it detects that
the old instance of that process has exited unexpectedly.
I see. It must have been the thrift server spawning new processes.
This is the log entry for the exiting service:
2010-06-05 12:26:57,493 CRIT uncaptured python exception, closing
channel<POutputDispatcher at 121784512 for<Subprocess at 121308512
with name PDFService in state RUNNING> (stderr)> (<type
'exceptions.OSError'>:[Errno 2] No such file or directory
[/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/supervisord.py|runforever|241]
[/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|handle_read_event|242]
[/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|record_output|176]
[/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|_log|152]
[/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|info|281]
[/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|log|299]
[/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|emit|194]
[/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|doRollover|228])
This output seems to indicate that there is a bug in supervisor itself.
It's pretty obscure, and would result from a combination of these
circumstances:
- a process log ("stdout_logfile"/"stderr_logfile") or the supervisor
log file itself ("logfile")
had grown beyond the bounds of its maxbytes size, and needed to
be "rolled over" (a foo.log.1 file needed to be created).
-AND-
- One of:
- the directory holding the log data was deleted while supervisor
was running.
-OR-
- the file holding the log data was deleted while supervisor was
running.
This may be responsible for misbehavior of your configuration. But it
also might be unrelated.
The service didn't come back up, however. I believe the supervisorctl
was still showing the service as RUNNING at the point we realised
where the error was and restarted the service manually.
We're wondering if there's a better way to catch such problems so we
can restart the service?
It's unclear what relationship the logging error has to the symptom
you're describing. Unfortunately we need to remove that possibility
from the picture before diagnosing any further.
Here's what I'd do in order to remove that possibility:
- In the [supervisord] section configure "logfile_maxbytes" to 0
- In the [supervisord] section configure "logfile_backups" to 0
- For any programs configured, configure their "stdout_logfile" and
and "stderr_logfile" parameters to 'NONE'.
This will effectively turn off logfile rotation/backups for the main
supervisor log (this log file will grow without bound). It will also
turn off individual logs for each process. If you run supervisor in
"-edebug" mode, however, the process log info will still go to the main
log (supervisord.log) itself, so you'll be able to see what the
processes are doing, if that's necessary.
Do this, then leave the system up to see if the symptom goes away. If
the symptom remains, we know the issue is unrelated to any logging bug.
Note that if you later reconfigure supervisor to do logfile backups, try
to make sure you don't delete the log files created by supervisor or the
directories in which they live while supervisor is running. Supervisor
should handle this better, but it's still a fairly bad thing to do to a
program.
Note that also you are running an older version of supervisor: the
latest is 3.0a8. There were a number of bugs fixed between these two
versions although AFACIT none related to the logging bug you reported;
nonetheless an upgrade is recommended.
In the meantime, I'll try to fix the underlying log rollover bug on the
supervisor trunk for the next release.
Thanks - I'll do as you suggest and see whether the bug happens again.
--
*Phillip B Oldham*
ActivityHQ
[email protected] <mailto:[email protected]>
------------------------------------------------------------------------
*Policies*
This e-mail and its attachments are intended for the above named
recipient(s) only and may be confidential. If they have come to you in
error, please reply to this e-mail and highlight the error. No action
should be taken regarding content, nor must you copy or show them to anyone.
This e-mail has been created in the knowledge that Internet e-mail is
not a 100% secure communications medium, and we have taken steps to
ensure that this e-mail and attachments are free from any virus. We must
advise that in keeping with good computing practice the recipient should
ensure they are completely virus free, and that you understand and
observe the lack of security when e-mailing us.
------------------------------------------------------------------------
_______________________________________________
Supervisor-users mailing list
[email protected]
http://lists.supervisord.org/mailman/listinfo/supervisor-users