On Mon, 2010-06-07 at 15:55 +0100, Phillip Oldham wrote:
> On 07/06/2010 13:57, Chris McDonough wrote:
> > On Mon, 2010-06-07 at 10:00 +0100, Phillip Oldham wrote:
> >
> > > We noticed an odd error over the weekend, and would like some advice.
> > >
> > > One of our "services", a Python thrift[1] server, which binds to a port
> > > had an error and stopped responding to requests. Supervisord "saw" this,
> > > and tried to bring up another instance.
> > >
> > I think you might mean that superlance httpok saw this and tried to
> > bring up another instance? "Raw" supervisor doesn't monitor process
> > behavior, only process up/down status.
> >
>
> We're still trying to understand the issue ourselves. We're not (yet)
> using superlance.
>
> Supervisord is set to fire-up a number of Python thrift servers, some
> of which communicate between themselves. It seems that one of the
> services which is called from the other services had an issue, and
> stopped responding. This caused many instances of the calling service
> to be spawned by supervisord - as far as we can tell; it's possible
> that thrift was spawning the processes and ignoring the hard-limit of
> 10.
I think your diagnosis about the cause of the spawning of more instances
isn't correct. Supervisor doesn't monitor whether a process responds or
not; it doesn't know anything about this sort of high-level detail. It
only monitors the state of the process as known by the operating system
(basically: up or down). If one of your services "stopped responding",
supervisor would neither know nor care as long as the process that
represented that service was still in a state that indicated it was
running (e.g. the process did not exit or die).
For example, if the service was an HTTP service, supervisor wouldn't try
to restart it if it stopped answering HTTP requests but still appeared
to the operating system that it was otherwise up and running. A new
instance of a process is only spawned by supervisor when it detects that
the old instance of that process has exited unexpectedly.
> This is the log entry for the exiting service:
> 2010-06-05 12:26:57,493 CRIT uncaptured python exception, closing
> channel <POutputDispatcher at 121784512 for <Subprocess at 121308512
> with name PDFService in state RUNNING> (stderr)> (<type
> 'exceptions.OSError'>:[Errno 2] No such file or directory
> [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/supervisord.py|runforever|241]
>
> [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|handle_read_event|242]
>
> [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|record_output|176]
>
> [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/dispatchers.py|_log|152]
>
> [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|info|281]
>
> [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|log|299]
>
> [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|emit|194]
>
> [/usr/local/lib/python2.6/site-packages/supervisor-3.0a7-py2.6.egg/supervisor/loggers.py|doRollover|228])
This output seems to indicate that there is a bug in supervisor itself.
It's pretty obscure, and would result from a combination of these
circumstances:
- a process log ("stdout_logfile"/"stderr_logfile") or the supervisor
log file itself ("logfile")
had grown beyond the bounds of its maxbytes size, and needed to
be "rolled over" (a foo.log.1 file needed to be created).
-AND-
- One of:
- the directory holding the log data was deleted while supervisor
was running.
-OR-
- the file holding the log data was deleted while supervisor was
running.
This may be responsible for misbehavior of your configuration. But it
also might be unrelated.
> The service didn't come back up, however. I believe the supervisorctl
> was still showing the service as RUNNING at the point we realised
> where the error was and restarted the service manually.
>
> We're wondering if there's a better way to catch such problems so we
> can restart the service?
It's unclear what relationship the logging error has to the symptom
you're describing. Unfortunately we need to remove that possibility
from the picture before diagnosing any further.
Here's what I'd do in order to remove that possibility:
- In the [supervisord] section configure "logfile_maxbytes" to 0
- In the [supervisord] section configure "logfile_backups" to 0
- For any programs configured, configure their "stdout_logfile" and
and "stderr_logfile" parameters to 'NONE'.
This will effectively turn off logfile rotation/backups for the main
supervisor log (this log file will grow without bound). It will also
turn off individual logs for each process. If you run supervisor in
"-edebug" mode, however, the process log info will still go to the main
log (supervisord.log) itself, so you'll be able to see what the
processes are doing, if that's necessary.
Do this, then leave the system up to see if the symptom goes away. If
the symptom remains, we know the issue is unrelated to any logging bug.
Note that if you later reconfigure supervisor to do logfile backups, try
to make sure you don't delete the log files created by supervisor or the
directories in which they live while supervisor is running. Supervisor
should handle this better, but it's still a fairly bad thing to do to a
program.
Note that also you are running an older version of supervisor: the
latest is 3.0a8. There were a number of bugs fixed between these two
versions although AFACIT none related to the logging bug you reported;
nonetheless an upgrade is recommended.
In the meantime, I'll try to fix the underlying log rollover bug on the
supervisor trunk for the next release.
- C
_______________________________________________
Supervisor-users mailing list
[email protected]
http://lists.supervisord.org/mailman/listinfo/supervisor-users