On 16/11/16 06:29, Roberto De Ioris wrote:
On 15/11/16 09:29, Roberto De Ioris wrote:
Hi there,
We've been having a problem recently where it appears that the
uWSGI
Emperor is restarting all vassals -- or, perhaps is restarting
itself.
Our setup is that on a given web server, all Python code for the
running
vassals is mounted over NFS. Sometimes there are glitches on the
NFS
server, so it stops responding for a minute or so. When it comes
back,
all vassals appear to restart; this puts so much load on the
machine
that it's essentially non-responsive for some time. So what would
be an
outage lasting a couple of minutes (bad) can take up to 20 minutes
to
clear (very, very bad).
Vassals are all configured via .ini files in a specific directory,
so
it
appears that either:
* The emperor is restarting all of its vassals, or
* The emperor is restarting itself, which means it's shutting
down
all
of the existing vassals as it exits, then reading all of the
vassal
files on startup.
Is there something in the emperor that makes it restart either all
vassals or itself when a subset of the vassals stop responding for
some
time?
All the best,
Giles
Hi Roberto,
Hi Giles, do you have emperor logs during the NFS outage ?
I do have the logs, though there's a lot of stuff in there -- is
there
anything I can search for?
Are the vassals files physically in the nfs storage ?
No, the vassal files are on a local filesystem. NFS handles the
source
code and the working directories for the vassals. I believe a number
of
them are using SQLite for storage, so that would definitely lock up
during an NFS outage.
How you mounted the nfs filesystem ? (i mean the options, like sync,
intr
and so on)
The options are nfsvers=3,soft,timeo=100,retry=5,rw,intr
It would be interesting to know if the emperor crashed (and
systemd/upstart/whatever restarted it) or the single instances crashed
and
the emperor restarted them one by one.
Hi Roberto,
Thanks for looking into this!
Now that I look at it, it looks like it must have been the single
vassal instances that were restarted. Working via the system logs:
* syslog shows the NFS outage starting at 18:48:52, ending at
18:49:33; over this period we get messages saying "nfs: server
XXX.XXX.XXX.XXX not responding, timed out"
* Sometime shortly after, we logged in and found lots of vassals
restarting (load average > 600), so in order to get the server
back online ASAP we rebooted it; syslog shows it started coming
up at 18:53.
* Looking at the upstart logs, there is just one entry in uwsgi.log
saying "[uWSGI] getting INI configuration from
/etc/uwsgi/uwsgi.ini", and the file timestamp is 18:53
From that, it looks like the only time uwsgi started via upstart that
day was as the machine came back from the reboot, so that would
suggest it didn't crash itself -- it was just the vassals. Also, see
my notes about the uwsgi.log below.
From your nfs configuration, after 100 seconds the i/o operation will
be
interrupted, so if lot of users are accessing sqlite this could lead
to a
crash of the single apps.
Maybe -- though I note that the NFS outage was only 41 seconds long,
at least assuming that the messages in syslog started appearing as
soon as the problem occurred (as opposed to, for example, after the
100 second timeout had elapsed).
Maybe you should check the logs of a single
vassals that got rebooted while the nfs outage and check for
segmentation
fault backtrace in the logs.
I don't see any segfaults. However, looking at the logs closely,
what appears to be happening is that every vassal is being shut down,
then restarted. Everything looks fine up until 18:49:33 -- in
particular, there are no messages during the period while NFS was
down (18:48:52 - 18:49:33) and then at that point I see bunch of
vassals being cursed and removed, then restarted -- perhaps 5% of the
vassals on the machine. This stops at 18:49:46. After that, at a
rate of about three a second, I see bits of log like this
(identifying information replaced with placeholders):
2016-11-12 18:50:54 +0000 EMPEROR - [emperor] removed uwsgi
instance www.example.com.ini
*** has_emperor mode detected (fd: 816) ***
[uWSGI] getting INI configuration from www.example.com.ini
[uwsgi-static] added check for /var/www/static
[uwsgi-static] added mapping for /static/ =>
/home/username/directory/examplecode/static/
[log-encoder] registered format www.example.com ${strftime:%F %T}
${msg}
Although I can't be sure, I think that there are enough of these to
say that it probably did that for every vassal for which there was a
.ini file. This triggered the fork-bomb-like effect that we saw.
It's worth noting that the static file mapping points to a directory
(/home/username/) that is hosted on NFS.
If it is something triggered by uWSGI we
could improve it to be smarter on nfs problems, otherwise i think the
best
approach is tuning the emperor throttling system to not start too much
vassals concurrently.
We could certainly look into that, but if we could stop the vassals
from being stopped and restarted it would definitely be better :-)
Is there anything we could usefully do to get more details as to why
the vassals were being removed into the log?
All the best,
Giles