> On 16/11/16 13:14, Giles Thomas wrote:
>> On 16/11/16 13:00, Giles Thomas wrote:
>>> On 16/11/16 06:29, Roberto De Ioris wrote:
>>>>> On 15/11/16 09:29, Roberto De Ioris wrote:
>>>>>>> Hi there,
>>>>>>>
>>>>>>> We've been having a problem recently where it appears that the
>>>>>>> uWSGI
>>>>>>> Emperor is restarting all vassals -- or, perhaps is restarting
>>>>>>> itself.
>>>>>>>
>>>>>>> Our setup is that on a given web server, all Python code for the
>>>>>>> running
>>>>>>> vassals is mounted over NFS. Sometimes there are glitches on the
>>>>>>> NFS
>>>>>>> server, so it stops responding for a minute or so. When it comes
>>>>>>> back,
>>>>>>> all vassals appear to restart; this puts so much load on the
>>>>>>> machine
>>>>>>> that it's essentially non-responsive for some time. So what would
>>>>>>> be an
>>>>>>> outage lasting a couple of minutes (bad) can take up to 20 minutes
>>>>>>> to
>>>>>>> clear (very, very bad).
>>>>>>>
>>>>>>> Vassals are all configured via .ini files in a specific directory,
>>>>>>> so
>>>>>>> it
>>>>>>> appears that either:
>>>>>>>
>>>>>>> * The emperor is restarting all of its vassals, or
>>>>>>> * The emperor is restarting itself, which means it's shutting
>>>>>>> down
>>>>>>> all
>>>>>>> of the existing vassals as it exits, then reading all of the
>>>>>>> vassal
>>>>>>> files on startup.
>>>>>>>
>>>>>>> Is there something in the emperor that makes it restart either all
>>>>>>> vassals or itself when a subset of the vassals stop responding for
>>>>>>> some
>>>>>>> time?
>>>>>>>
>>>>>>>
>>>>>>> All the best,
>>>>>>>
>>>>>>> Giles
>>>>>>>
>>>>>>>
>>>>> Hi Roberto,
>>>>>
>>>>>> Hi Giles, do you have emperor logs during the NFS outage ?
>>>>> I do have the logs, though there's a lot of stuff in there -- is
>>>>> there
>>>>> anything I can search for?
>>>>>
>>>>>> Are the vassals files physically in the nfs storage ?
>>>>> No, the vassal files are on a local filesystem. NFS handles the
>>>>> source
>>>>> code and the working directories for the vassals. I believe a number
>>>>> of
>>>>> them are using SQLite for storage, so that would definitely lock up
>>>>> during an NFS outage.
>>>>>
>>>>>> How you mounted the nfs filesystem ? (i mean the options, like sync,
>>>>>> intr
>>>>>> and so on)
>>>>> The options are nfsvers=3,soft,timeo=100,retry=5,rw,intr
>>>>>
>>>>>
>>>> It would be interesting to know if the emperor crashed (and
>>>> systemd/upstart/whatever restarted it) or the single instances crashed
>>>> and
>>>> the emperor restarted them one by one.
>>>
>>> Hi Roberto,
>>>
>>> Thanks for looking into this!
>>>
>>> Now that I look at it, it looks like it must have been the single
>>> vassal instances that were restarted. Working via the system logs:
>>>
>>> * syslog shows the NFS outage starting at 18:48:52, ending at
>>> 18:49:33; over this period we get messages saying "nfs: server
>>> XXX.XXX.XXX.XXX not responding, timed out"
>>> * Sometime shortly after, we logged in and found lots of vassals
>>> restarting (load average > 600), so in order to get the server
>>> back online ASAP we rebooted it; syslog shows it started coming
>>> up at 18:53.
>>> * Looking at the upstart logs, there is just one entry in uwsgi.log
>>> saying "[uWSGI] getting INI configuration from
>>> /etc/uwsgi/uwsgi.ini", and the file timestamp is 18:53
>>>
>>> From that, it looks like the only time uwsgi started via upstart that
>>> day was as the machine came back from the reboot, so that would
>>> suggest it didn't crash itself -- it was just the vassals. Also, see
>>> my notes about the uwsgi.log below.
>>>
>>>
>>>> From your nfs configuration, after 100 seconds the i/o operation will
>>>> be
>>>> interrupted, so if lot of users are accessing sqlite this could lead
>>>> to a
>>>> crash of the single apps.
>>>
>>> Maybe -- though I note that the NFS outage was only 41 seconds long,
>>> at least assuming that the messages in syslog started appearing as
>>> soon as the problem occurred (as opposed to, for example, after the
>>> 100 second timeout had elapsed).
>>>
>>>> Maybe you should check the logs of a single
>>>> vassals that got rebooted while the nfs outage and check for
>>>> segmentation
>>>> fault backtrace in the logs.
>>>
>>> I don't see any segfaults. However, looking at the logs closely,
>>> what appears to be happening is that every vassal is being shut down,
>>> then restarted. Everything looks fine up until 18:49:33 -- in
>>> particular, there are no messages during the period while NFS was
>>> down (18:48:52 - 18:49:33) and then at that point I see bunch of
>>> vassals being cursed and removed, then restarted -- perhaps 5% of the
>>> vassals on the machine. This stops at 18:49:46. After that, at a
>>> rate of about three a second, I see bits of log like this
>>> (identifying information replaced with placeholders):
>>>
>>> 2016-11-12 18:50:54 +0000 EMPEROR - [emperor] removed uwsgi
>>> instance www.example.com.ini
>>> *** has_emperor mode detected (fd: 816) ***
>>> [uWSGI] getting INI configuration from www.example.com.ini
>>> [uwsgi-static] added check for /var/www/static
>>> [uwsgi-static] added mapping for /static/ =>
>>> /home/username/directory/examplecode/static/
>>> [log-encoder] registered format www.example.com ${strftime:%F %T}
>>> ${msg}
>>>
>>> Although I can't be sure, I think that there are enough of these to
>>> say that it probably did that for every vassal for which there was a
>>> .ini file. This triggered the fork-bomb-like effect that we saw.
>>>
>>> It's worth noting that the static file mapping points to a directory
>>> (/home/username/) that is hosted on NFS.
>>>
>>>> If it is something triggered by uWSGI we
>>>> could improve it to be smarter on nfs problems, otherwise i think the
>>>> best
>>>> approach is tuning the emperor throttling system to not start too much
>>>> vassals concurrently.
>>>
>>> We could certainly look into that, but if we could stop the vassals
>>> from being stopped and restarted it would definitely be better :-)
>>>
>>> Is there anything we could usefully do to get more details as to why
>>> the vassals were being removed into the log?
>>>
>>>
>>> All the best,
>>>
>>> Giles
>>
>> Just had a chat with the team here, and one possibility has popped
>> up. For each vassal, we have the setting:
>>
>> touch-reload = /var/www/www_example_com_wsgi.py
>>
>> These files are, ultimately, on NFS. (Strictly, they're on NFS as
>> something like
>> /mnt/fileserverXXX/username/var/www/var/www/www_example_com_wsgi.py,
>> and that directory is bind-mounted into
>> /mnt/sandbox/username/var/www/, then the vassal is chrooted to
>> /mnt/sandbox/username using the "chroot" option in the ini file).
>>
>> How are files monitored for touch-reload? Might the NFS outage be
>> somehow tricking the emperor into thinking that every one of those
>> files has been touched?
>
> Actually, now I think about it more -- presumably the touch-reload
> stuff, being in the vassal's own ini file, is handled by the vassal. So
> perhaps that's what's happening -- it's not the vassals all being
> stopped and restarted, it's the vassals all suddenly thinking that they
> need to reload, and independently shutting down all of their workers and
> starting them up again at the same time.
>
> That would explain why we're getting the fork-bomb-like effects; the
> emperor throttling option only limits the speed at which the emperor
> will spin up new vassals, just as its name suggests. There's nothing to
> stop every vassal from suddenly reloading, and there's no throttling
> there.
>
> So, to reword my question in the last email: how are files monitored for
> touch-reload? Might the NFS outage be somehow tricking all of the
> vassals into thinking that their files have been touched?
>
>
> All the best,
>
> Giles
>
>
Hi Giles, could it be you have the heartbeat system enabled between
emperor and vassals ? If the master is stuck during the stat() for the
touch_reload, it will not be able to send the heartbeat back in time. In
such a case you should see something like this in logs:
[emperor] vassal %s sent no heartbeat in last %d seconds, brutally
respawning it.
ANother option it would be the touch_reload directly failing during
stat(), but again in the logs you should see something like this:
%s has been touched... grace them all !!!
--
Roberto De Ioris
http://unbit.com
_______________________________________________
uWSGI mailing list
[email protected]
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi