Re: [uWSGI] uWSGI emperor restarting all vassals?

Giles Thomas Wed, 16 Nov 2016 05:28:37 -0800

On 16/11/16 13:14, Giles Thomas wrote:

On 16/11/16 13:00, Giles Thomas wrote:
On 16/11/16 06:29, Roberto De Ioris wrote:
On 15/11/16 09:29, Roberto De Ioris wrote:
Hi there,
We've been having a problem recently where it appears that the uWSGI
Emperor is restarting all vassals -- or, perhaps is restarting itself.

Our setup is that on a given web server, all Python code for the
running
vassals is mounted over NFS.  Sometimes there are glitches on the NFS
server, so it stops responding for a minute or so.  When it comes back,
all vassals appear to restart; this puts so much load on the machine
that it's essentially non-responsive for some time. So what would be an
outage lasting a couple of minutes (bad) can take up to 20 minutes to
clear (very, very bad).

Vassals are all configured via .ini files in a specific directory, so
it
appears that either:

    * The emperor is restarting all of its vassals, or
    * The emperor is restarting itself, which means it's shutting down
all
      of the existing vassals as it exits, then reading all of the
vassal
      files on startup.

Is there something in the emperor that makes it restart either all
vassals or itself when a subset of the vassals stop responding for some
time?


All the best,

Giles
Hi Roberto,
Hi Giles, do you have emperor logs during the NFS outage ?
I do have the logs, though there's a lot of stuff in there -- is there
anything I can search for?
Are the vassals files physically in the nfs storage ?
No, the vassal files are on a local filesystem.  NFS handles the source
code and the working directories for the vassals.  I believe a number of
them are using SQLite for storage, so that would definitely lock up
during an NFS outage.
How you mounted the nfs filesystem ? (i mean the options, like sync,
intr
and so on)
The options are nfsvers=3,soft,timeo=100,retry=5,rw,intr
It would be interesting to know if the emperor crashed (and
systemd/upstart/whatever restarted it) or the single instances crashed and
the emperor restarted them one by one.
Hi Roberto,

Thanks for looking into this!
Now that I look at it, it looks like it must have been the singlevassal instances that were restarted. Working via the system logs:
  * syslog shows the NFS outage starting at 18:48:52, ending at
    18:49:33; over this period we get messages saying "nfs: server
    XXX.XXX.XXX.XXX not responding, timed out"
  * Sometime shortly after, we logged in and found lots of vassals
    restarting (load average > 600), so in order to get the server
    back online ASAP we rebooted it; syslog shows it started coming
    up at 18:53.
  * Looking at the upstart logs, there is just one entry in uwsgi.log
    saying "[uWSGI] getting INI configuration from
    /etc/uwsgi/uwsgi.ini", and the file timestamp is 18:53
From that, it looks like the only time uwsgi started via upstart thatday was as the machine came back from the reboot, so that wouldsuggest it didn't crash itself -- it was just the vassals. Also, seemy notes about the uwsgi.log below.
 From your nfs configuration, after 100 seconds the i/o operation will be
interrupted, so if lot of users are accessing sqlite this could lead to a
crash of the single apps.
Maybe -- though I note that the NFS outage was only 41 seconds long,at least assuming that the messages in syslog started appearing assoon as the problem occurred (as opposed to, for example, after the100 second timeout had elapsed).
Maybe you should check the logs of a single
vassals that got rebooted while the nfs outage and check for segmentation
fault backtrace in the logs.
I don't see any segfaults. However, looking at the logs closely,what appears to be happening is that every vassal is being shut down,then restarted. Everything looks fine up until 18:49:33 -- inparticular, there are no messages during the period while NFS wasdown (18:48:52 - 18:49:33) and then at that point I see bunch ofvassals being cursed and removed, then restarted -- perhaps 5% of thevassals on the machine. This stops at 18:49:46. After that, at arate of about three a second, I see bits of log like this(identifying information replaced with placeholders):
    2016-11-12 18:50:54 +0000 EMPEROR - [emperor] removed uwsgi
    instance www.example.com.ini
    *** has_emperor mode detected (fd: 816) ***
    [uWSGI] getting INI configuration from www.example.com.ini
    [uwsgi-static] added check for /var/www/static
    [uwsgi-static] added mapping for /static/ =>
    /home/username/directory/examplecode/static/
    [log-encoder] registered format www.example.com ${strftime:%F %T}
    ${msg}
Although I can't be sure, I think that there are enough of these tosay that it probably did that for every vassal for which there was a.ini file. This triggered the fork-bomb-like effect that we saw.
It's worth noting that the static file mapping points to a directory(/home/username/) that is hosted on NFS.
If it is something triggered by uWSGI we
could improve it to be smarter on nfs problems, otherwise i think the best
approach is tuning the emperor throttling system to not start too much
vassals concurrently.
We could certainly look into that, but if we could stop the vassalsfrom being stopped and restarted it would definitely be better :-)
Is there anything we could usefully do to get more details as to whythe vassals were being removed into the log?
All the best,

Giles
Just had a chat with the team here, and one possibility has poppedup. For each vassal, we have the setting:
    touch-reload = /var/www/www_example_com_wsgi.py
These files are, ultimately, on NFS. (Strictly, they're on NFS assomething like/mnt/fileserverXXX/username/var/www/var/www/www_example_com_wsgi.py,and that directory is bind-mounted into/mnt/sandbox/username/var/www/, then the vassal is chrooted to/mnt/sandbox/username using the "chroot" option in the ini file).
How are files monitored for touch-reload? Might the NFS outage besomehow tricking the emperor into thinking that every one of thosefiles has been touched?

Actually, now I think about it more -- presumably the touch-reloadstuff, being in the vassal's own ini file, is handled by the vassal. Soperhaps that's what's happening -- it's not the vassals all beingstopped and restarted, it's the vassals all suddenly thinking that theyneed to reload, and independently shutting down all of their workers andstarting them up again at the same time.

That would explain why we're getting the fork-bomb-like effects; theemperor throttling option only limits the speed at which the emperorwill spin up new vassals, just as its name suggests. There's nothing tostop every vassal from suddenly reloading, and there's no throttling there.

So, to reword my question in the last email: how are files monitored fortouch-reload? Might the NFS outage be somehow tricking all of thevassals into thinking that their files have been touched?



All the best,

Giles

--
Giles Thomas <[email protected]>

PythonAnywhere: Develop and host Python from your browser
<https://www.pythonanywhere.com/>

A product from PythonAnywhere LLP
17a Clerkenwell Road, London EC1M 5RD, UK
VAT No.: GB 893 5643 79
Registered in England and Wales as company number OC378414.
Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK

_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

Re: [uWSGI] uWSGI emperor restarting all vassals?

Reply via email to