> On 16/11/16 13:14, Giles Thomas wrote:
>> On 16/11/16 13:00, Giles Thomas wrote:
>>> On 16/11/16 06:29, Roberto De Ioris wrote:
>>>>> On 15/11/16 09:29, Roberto De Ioris wrote:
>>>>>>> Hi there,
>>>>>>>
>>>>>>> We've been having a problem recently where it appears that the
>>>>>>> uWSGI
>>>>>>> Emperor is restarting all vassals -- or, perhaps is restarting
>>>>>>> itself.
>>>>>>>
>>>>>>> Our setup is that on a given web server, all Python code for the
>>>>>>> running
>>>>>>> vassals is mounted over NFS.  Sometimes there are glitches on the
>>>>>>> NFS
>>>>>>> server, so it stops responding for a minute or so.  When it comes
>>>>>>> back,
>>>>>>> all vassals appear to restart; this puts so much load on the
>>>>>>> machine
>>>>>>> that it's essentially non-responsive for some time. So what would
>>>>>>> be an
>>>>>>> outage lasting a couple of minutes (bad) can take up to 20 minutes
>>>>>>> to
>>>>>>> clear (very, very bad).
>>>>>>>
>>>>>>> Vassals are all configured via .ini files in a specific directory,
>>>>>>> so
>>>>>>> it
>>>>>>> appears that either:
>>>>>>>
>>>>>>>     * The emperor is restarting all of its vassals, or
>>>>>>>     * The emperor is restarting itself, which means it's shutting
>>>>>>> down
>>>>>>> all
>>>>>>>       of the existing vassals as it exits, then reading all of the
>>>>>>> vassal
>>>>>>>       files on startup.
>>>>>>>
>>>>>>> Is there something in the emperor that makes it restart either all
>>>>>>> vassals or itself when a subset of the vassals stop responding for
>>>>>>> some
>>>>>>> time?
>>>>>>>
>>>>>>>
>>>>>>> All the best,
>>>>>>>
>>>>>>> Giles
>>>>>>>
>>>>>>>
>>>>> Hi Roberto,
>>>>>
>>>>>> Hi Giles, do you have emperor logs during the NFS outage ?
>>>>> I do have the logs, though there's a lot of stuff in there -- is
>>>>> there
>>>>> anything I can search for?
>>>>>
>>>>>> Are the vassals files physically in the nfs storage ?
>>>>> No, the vassal files are on a local filesystem.  NFS handles the
>>>>> source
>>>>> code and the working directories for the vassals.  I believe a number
>>>>> of
>>>>> them are using SQLite for storage, so that would definitely lock up
>>>>> during an NFS outage.
>>>>>
>>>>>> How you mounted the nfs filesystem ? (i mean the options, like sync,
>>>>>> intr
>>>>>> and so on)
>>>>> The options are nfsvers=3,soft,timeo=100,retry=5,rw,intr
>>>>>
>>>>>
>>>> It would be interesting to know if the emperor crashed (and
>>>> systemd/upstart/whatever restarted it) or the single instances crashed
>>>> and
>>>> the emperor restarted them one by one.
>>>
>>> Hi Roberto,
>>>
>>> Thanks for looking into this!
>>>
>>> Now that I look at it, it looks like it must have been the single
>>> vassal instances that were restarted.  Working via the system logs:
>>>
>>>   * syslog shows the NFS outage starting at 18:48:52, ending at
>>>     18:49:33; over this period we get messages saying "nfs: server
>>>     XXX.XXX.XXX.XXX not responding, timed out"
>>>   * Sometime shortly after, we logged in and found lots of vassals
>>>     restarting (load average > 600), so in order to get the server
>>>     back online ASAP we rebooted it; syslog shows it started coming
>>>     up at 18:53.
>>>   * Looking at the upstart logs, there is just one entry in uwsgi.log
>>>     saying "[uWSGI] getting INI configuration from
>>>     /etc/uwsgi/uwsgi.ini", and the file timestamp is 18:53
>>>
>>> From that, it looks like the only time uwsgi started via upstart that
>>> day was as the machine came back from the reboot, so that would
>>> suggest it didn't crash itself -- it was just the vassals.  Also, see
>>> my notes about the uwsgi.log below.
>>>
>>>
>>>>  From your nfs configuration, after 100 seconds the i/o operation will
>>>> be
>>>> interrupted, so if lot of users are accessing sqlite this could lead
>>>> to a
>>>> crash of the single apps.
>>>
>>> Maybe -- though I note that the NFS outage was only 41 seconds long,
>>> at least assuming that the messages in syslog started appearing as
>>> soon as the problem occurred (as opposed to, for example, after the
>>> 100 second timeout had elapsed).
>>>
>>>> Maybe you should check the logs of a single
>>>> vassals that got rebooted while the nfs outage and check for
>>>> segmentation
>>>> fault backtrace in the logs.
>>>
>>> I don't see any segfaults.  However, looking at the logs closely,
>>> what appears to be happening is that every vassal is being shut down,
>>> then restarted.  Everything looks fine up until 18:49:33 -- in
>>> particular, there are no messages during the period while NFS was
>>> down (18:48:52 - 18:49:33) and then at that point I see bunch of
>>> vassals being cursed and removed, then restarted -- perhaps 5% of the
>>> vassals on the machine.  This stops at 18:49:46.  After that, at a
>>> rate of about three a second, I see bits of log like this
>>> (identifying information replaced with placeholders):
>>>
>>>     2016-11-12 18:50:54 +0000 EMPEROR - [emperor] removed uwsgi
>>>     instance www.example.com.ini
>>>     *** has_emperor mode detected (fd: 816) ***
>>>     [uWSGI] getting INI configuration from www.example.com.ini
>>>     [uwsgi-static] added check for /var/www/static
>>>     [uwsgi-static] added mapping for /static/ =>
>>>     /home/username/directory/examplecode/static/
>>>     [log-encoder] registered format www.example.com ${strftime:%F %T}
>>>     ${msg}
>>>
>>> Although I can't be sure, I think that there are enough of these to
>>> say that it probably did that for every vassal for which there was a
>>> .ini file.   This triggered the fork-bomb-like effect that we saw.
>>>
>>> It's worth noting that the static file mapping points to a directory
>>> (/home/username/) that is hosted on NFS.
>>>
>>>> If it is something triggered by uWSGI we
>>>> could improve it to be smarter on nfs problems, otherwise i think the
>>>> best
>>>> approach is tuning the emperor throttling system to not start too much
>>>> vassals concurrently.
>>>
>>> We could certainly look into that, but if we could stop the vassals
>>> from being stopped and restarted it would definitely be better :-)
>>>
>>> Is there anything we could usefully do to get more details as to why
>>> the vassals were being removed into the log?
>>>
>>>
>>> All the best,
>>>
>>> Giles
>>
>> Just had a chat with the team here, and one possibility has popped
>> up.  For each vassal, we have the setting:
>>
>>     touch-reload = /var/www/www_example_com_wsgi.py
>>
>> These files are, ultimately, on NFS.  (Strictly, they're on NFS as
>> something like
>> /mnt/fileserverXXX/username/var/www/var/www/www_example_com_wsgi.py,
>> and that directory is bind-mounted into
>> /mnt/sandbox/username/var/www/, then the vassal is chrooted to
>> /mnt/sandbox/username using the "chroot" option in the ini file).
>>
>> How are files monitored for touch-reload?  Might the NFS outage be
>> somehow tricking the emperor into thinking that every one of those
>> files has been touched?
>
> Actually, now I think about it more -- presumably the touch-reload
> stuff, being in the vassal's own ini file, is handled by the vassal.  So
> perhaps that's what's happening -- it's not the vassals all being
> stopped and restarted, it's the vassals all suddenly thinking that they
> need to reload, and independently shutting down all of their workers and
> starting them up again at the same time.
>
> That would explain why we're getting the fork-bomb-like effects; the
> emperor throttling option only limits the speed at which the emperor
> will spin up new vassals, just as its name suggests.  There's nothing to
> stop every vassal from suddenly reloading, and there's no throttling
> there.
>
> So, to reword my question in the last email: how are files monitored for
> touch-reload?  Might the NFS outage be somehow tricking all of the
> vassals into thinking that their files have been touched?
>
>
> All the best,
>
> Giles
>
>

Hi Giles, could it be you have the heartbeat system enabled between
emperor and vassals ? If the master is stuck during the stat() for the
touch_reload, it will not be able to send the heartbeat back in time. In
such a case you should see something like this in logs:

[emperor] vassal %s sent no heartbeat in last %d seconds, brutally
respawning it.

ANother option it would be the touch_reload directly failing during
stat(), but again in the logs you should see something like this:

%s has been touched... grace them all !!!


-- 
Roberto De Ioris
http://unbit.com
_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

Reply via email to