On 16/11/16 13:14, Giles Thomas wrote:
On 16/11/16 13:00, Giles Thomas wrote:
On 16/11/16 06:29, Roberto De Ioris wrote:
On 15/11/16 09:29, Roberto De Ioris wrote:
Hi there,
We've been having a problem recently where it appears that the uWSGI
Emperor is restarting all vassals -- or, perhaps is restarting itself.
Our setup is that on a given web server, all Python code for the
running
vassals is mounted over NFS. Sometimes there are glitches on the NFS
server, so it stops responding for a minute or so. When it comes back,
all vassals appear to restart; this puts so much load on the machine
that it's essentially non-responsive for some time. So what would be an
outage lasting a couple of minutes (bad) can take up to 20 minutes to
clear (very, very bad).
Vassals are all configured via .ini files in a specific directory, so
it
appears that either:
* The emperor is restarting all of its vassals, or
* The emperor is restarting itself, which means it's shutting down
all
of the existing vassals as it exits, then reading all of the
vassal
files on startup.
Is there something in the emperor that makes it restart either all
vassals or itself when a subset of the vassals stop responding for some
time?
All the best,
Giles
Hi Roberto,
Hi Giles, do you have emperor logs during the NFS outage ?
I do have the logs, though there's a lot of stuff in there -- is there
anything I can search for?
Are the vassals files physically in the nfs storage ?
No, the vassal files are on a local filesystem. NFS handles the source
code and the working directories for the vassals. I believe a number of
them are using SQLite for storage, so that would definitely lock up
during an NFS outage.
How you mounted the nfs filesystem ? (i mean the options, like sync,
intr
and so on)
The options are nfsvers=3,soft,timeo=100,retry=5,rw,intr
It would be interesting to know if the emperor crashed (and
systemd/upstart/whatever restarted it) or the single instances crashed and
the emperor restarted them one by one.
Hi Roberto,
Thanks for looking into this!
Now that I look at it, it looks like it must have been the single
vassal instances that were restarted. Working via the system logs:
* syslog shows the NFS outage starting at 18:48:52, ending at
18:49:33; over this period we get messages saying "nfs: server
XXX.XXX.XXX.XXX not responding, timed out"
* Sometime shortly after, we logged in and found lots of vassals
restarting (load average > 600), so in order to get the server
back online ASAP we rebooted it; syslog shows it started coming
up at 18:53.
* Looking at the upstart logs, there is just one entry in uwsgi.log
saying "[uWSGI] getting INI configuration from
/etc/uwsgi/uwsgi.ini", and the file timestamp is 18:53
From that, it looks like the only time uwsgi started via upstart that
day was as the machine came back from the reboot, so that would
suggest it didn't crash itself -- it was just the vassals. Also, see
my notes about the uwsgi.log below.
From your nfs configuration, after 100 seconds the i/o operation will be
interrupted, so if lot of users are accessing sqlite this could lead to a
crash of the single apps.
Maybe -- though I note that the NFS outage was only 41 seconds long,
at least assuming that the messages in syslog started appearing as
soon as the problem occurred (as opposed to, for example, after the
100 second timeout had elapsed).
Maybe you should check the logs of a single
vassals that got rebooted while the nfs outage and check for segmentation
fault backtrace in the logs.
I don't see any segfaults. However, looking at the logs closely,
what appears to be happening is that every vassal is being shut down,
then restarted. Everything looks fine up until 18:49:33 -- in
particular, there are no messages during the period while NFS was
down (18:48:52 - 18:49:33) and then at that point I see bunch of
vassals being cursed and removed, then restarted -- perhaps 5% of the
vassals on the machine. This stops at 18:49:46. After that, at a
rate of about three a second, I see bits of log like this
(identifying information replaced with placeholders):
2016-11-12 18:50:54 +0000 EMPEROR - [emperor] removed uwsgi
instance www.example.com.ini
*** has_emperor mode detected (fd: 816) ***
[uWSGI] getting INI configuration from www.example.com.ini
[uwsgi-static] added check for /var/www/static
[uwsgi-static] added mapping for /static/ =>
/home/username/directory/examplecode/static/
[log-encoder] registered format www.example.com ${strftime:%F %T}
${msg}
Although I can't be sure, I think that there are enough of these to
say that it probably did that for every vassal for which there was a
.ini file. This triggered the fork-bomb-like effect that we saw.
It's worth noting that the static file mapping points to a directory
(/home/username/) that is hosted on NFS.
If it is something triggered by uWSGI we
could improve it to be smarter on nfs problems, otherwise i think the best
approach is tuning the emperor throttling system to not start too much
vassals concurrently.
We could certainly look into that, but if we could stop the vassals
from being stopped and restarted it would definitely be better :-)
Is there anything we could usefully do to get more details as to why
the vassals were being removed into the log?
All the best,
Giles
Just had a chat with the team here, and one possibility has popped
up. For each vassal, we have the setting:
touch-reload = /var/www/www_example_com_wsgi.py
These files are, ultimately, on NFS. (Strictly, they're on NFS as
something like
/mnt/fileserverXXX/username/var/www/var/www/www_example_com_wsgi.py,
and that directory is bind-mounted into
/mnt/sandbox/username/var/www/, then the vassal is chrooted to
/mnt/sandbox/username using the "chroot" option in the ini file).
How are files monitored for touch-reload? Might the NFS outage be
somehow tricking the emperor into thinking that every one of those
files has been touched?
Actually, now I think about it more -- presumably the touch-reload
stuff, being in the vassal's own ini file, is handled by the vassal. So
perhaps that's what's happening -- it's not the vassals all being
stopped and restarted, it's the vassals all suddenly thinking that they
need to reload, and independently shutting down all of their workers and
starting them up again at the same time.
That would explain why we're getting the fork-bomb-like effects; the
emperor throttling option only limits the speed at which the emperor
will spin up new vassals, just as its name suggests. There's nothing to
stop every vassal from suddenly reloading, and there's no throttling there.
So, to reword my question in the last email: how are files monitored for
touch-reload? Might the NFS outage be somehow tricking all of the
vassals into thinking that their files have been touched?
All the best,
Giles
--
Giles Thomas <[email protected]>
PythonAnywhere: Develop and host Python from your browser
<https://www.pythonanywhere.com/>
A product from PythonAnywhere LLP
17a Clerkenwell Road, London EC1M 5RD, UK
VAT No.: GB 893 5643 79
Registered in England and Wales as company number OC378414.
Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK
_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi