On 16/11/16 06:29, Roberto De Ioris wrote:
On 15/11/16 09:29, Roberto De Ioris wrote:
Hi there,

We've been having a problem recently where it appears that the uWSGI
Emperor is restarting all vassals -- or, perhaps is restarting itself.

Our setup is that on a given web server, all Python code for the
running
vassals is mounted over NFS.  Sometimes there are glitches on the NFS
server, so it stops responding for a minute or so.  When it comes back,
all vassals appear to restart; this puts so much load on the machine
that it's essentially non-responsive for some time. So what would be an
outage lasting a couple of minutes (bad) can take up to 20 minutes to
clear (very, very bad).

Vassals are all configured via .ini files in a specific directory, so
it
appears that either:

    * The emperor is restarting all of its vassals, or
    * The emperor is restarting itself, which means it's shutting down
all
      of the existing vassals as it exits, then reading all of the
vassal
      files on startup.

Is there something in the emperor that makes it restart either all
vassals or itself when a subset of the vassals stop responding for some
time?


All the best,

Giles


Hi Roberto,

Hi Giles, do you have emperor logs during the NFS outage ?
I do have the logs, though there's a lot of stuff in there -- is there
anything I can search for?

Are the vassals files physically in the nfs storage ?
No, the vassal files are on a local filesystem.  NFS handles the source
code and the working directories for the vassals.  I believe a number of
them are using SQLite for storage, so that would definitely lock up
during an NFS outage.

How you mounted the nfs filesystem ? (i mean the options, like sync,
intr
and so on)
The options are nfsvers=3,soft,timeo=100,retry=5,rw,intr


It would be interesting to know if the emperor crashed (and
systemd/upstart/whatever restarted it) or the single instances crashed and
the emperor restarted them one by one.

Hi Roberto,

Thanks for looking into this!

Now that I look at it, it looks like it must have been the single vassal instances that were restarted. Working via the system logs:

 * syslog shows the NFS outage starting at 18:48:52, ending at
   18:49:33; over this period we get messages saying "nfs: server
   XXX.XXX.XXX.XXX not responding, timed out"
 * Sometime shortly after, we logged in and found lots of vassals
   restarting (load average > 600), so in order to get the server back
   online ASAP we rebooted it; syslog shows it started coming up at 18:53.
 * Looking at the upstart logs, there is just one entry in uwsgi.log
   saying "[uWSGI] getting INI configuration from
   /etc/uwsgi/uwsgi.ini", and the file timestamp is 18:53

From that, it looks like the only time uwsgi started via upstart that day was as the machine came back from the reboot, so that would suggest it didn't crash itself -- it was just the vassals. Also, see my notes about the uwsgi.log below.


 From your nfs configuration, after 100 seconds the i/o operation will be
interrupted, so if lot of users are accessing sqlite this could lead to a
crash of the single apps.

Maybe -- though I note that the NFS outage was only 41 seconds long, at least assuming that the messages in syslog started appearing as soon as the problem occurred (as opposed to, for example, after the 100 second timeout had elapsed).

Maybe you should check the logs of a single
vassals that got rebooted while the nfs outage and check for segmentation
fault backtrace in the logs.

I don't see any segfaults. However, looking at the logs closely, what appears to be happening is that every vassal is being shut down, then restarted. Everything looks fine up until 18:49:33 -- in particular, there are no messages during the period while NFS was down (18:48:52 - 18:49:33) and then at that point I see bunch of vassals being cursed and removed, then restarted -- perhaps 5% of the vassals on the machine. This stops at 18:49:46. After that, at a rate of about three a second, I see bits of log like this (identifying information replaced with placeholders):

   2016-11-12 18:50:54 +0000 EMPEROR - [emperor] removed uwsgi instance
   www.example.com.ini
   *** has_emperor mode detected (fd: 816) ***
   [uWSGI] getting INI configuration from www.example.com.ini
   [uwsgi-static] added check for /var/www/static
   [uwsgi-static] added mapping for /static/ =>
   /home/username/directory/examplecode/static/
   [log-encoder] registered format www.example.com ${strftime:%F %T} ${msg}

Although I can't be sure, I think that there are enough of these to say that it probably did that for every vassal for which there was a .ini file. This triggered the fork-bomb-like effect that we saw.

It's worth noting that the static file mapping points to a directory (/home/username/) that is hosted on NFS.

If it is something triggered by uWSGI we
could improve it to be smarter on nfs problems, otherwise i think the best
approach is tuning the emperor throttling system to not start too much
vassals concurrently.

We could certainly look into that, but if we could stop the vassals from being stopped and restarted it would definitely be better :-)

Is there anything we could usefully do to get more details as to why the vassals were being removed into the log?


All the best,

Giles

--
Giles Thomas <[email protected]>

PythonAnywhere: Develop and host Python from your browser
<https://www.pythonanywhere.com/>

A product from PythonAnywhere LLP
17a Clerkenwell Road, London EC1M 5RD, UK
VAT No.: GB 893 5643 79
Registered in England and Wales as company number OC378414.
Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK

_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

Reply via email to