> On 15/11/16 09:29, Roberto De Ioris wrote: >>> Hi there, >>> >>> We've been having a problem recently where it appears that the uWSGI >>> Emperor is restarting all vassals -- or, perhaps is restarting itself. >>> >>> Our setup is that on a given web server, all Python code for the >>> running >>> vassals is mounted over NFS. Sometimes there are glitches on the NFS >>> server, so it stops responding for a minute or so. When it comes back, >>> all vassals appear to restart; this puts so much load on the machine >>> that it's essentially non-responsive for some time. So what would be an >>> outage lasting a couple of minutes (bad) can take up to 20 minutes to >>> clear (very, very bad). >>> >>> Vassals are all configured via .ini files in a specific directory, so >>> it >>> appears that either: >>> >>> * The emperor is restarting all of its vassals, or >>> * The emperor is restarting itself, which means it's shutting down >>> all >>> of the existing vassals as it exits, then reading all of the >>> vassal >>> files on startup. >>> >>> Is there something in the emperor that makes it restart either all >>> vassals or itself when a subset of the vassals stop responding for some >>> time? >>> >>> >>> All the best, >>> >>> Giles >>> >>> > Hi Roberto, > >> Hi Giles, do you have emperor logs during the NFS outage ? > > I do have the logs, though there's a lot of stuff in there -- is there > anything I can search for? > >> Are the vassals files physically in the nfs storage ? > > No, the vassal files are on a local filesystem. NFS handles the source > code and the working directories for the vassals. I believe a number of > them are using SQLite for storage, so that would definitely lock up > during an NFS outage. > >> How you mounted the nfs filesystem ? (i mean the options, like sync, >> intr >> and so on) > > The options are nfsvers=3,soft,timeo=100,retry=5,rw,intr > >
It would be interesting to know if the emperor crashed (and systemd/upstart/whatever restarted it) or the single instances crashed and the emperor restarted them one by one. >From your nfs configuration, after 100 seconds the i/o operation will be interrupted, so if lot of users are accessing sqlite this could lead to a crash of the single apps. Maybe you should check the logs of a single vassals that got rebooted while the nfs outage and check for segmentation fault backtrace in the logs. If it is something triggered by uWSGI we could improve it to be smarter on nfs problems, otherwise i think the best approach is tuning the emperor throttling system to not start too much vassals concurrently. -- Roberto De Ioris http://unbit.com _______________________________________________ uWSGI mailing list [email protected] http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
