On Thu, Sep 11, 2014 at 11:02 PM, Luis R. Rodriguez <mcg...@suse.com> wrote: > On Thu, Sep 11, 2014 at 1:07 PM, Tom Gundersen <t...@jklm.no> wrote: >> On Mon, Aug 11, 2014 at 7:19 PM, Luis R. Rodriguez <mcg...@suse.com> wrote: >>> On Mon, Aug 11, 2014 at 12:57 PM, Lennart Poettering >>> <lenn...@poettering.net> wrote: >>>> On Mon, 11.08.14 18:39, Luis R. Rodriguez (mcg...@suse.com) wrote: >>>> >>>>> > This looks really wrong. We shouldn't permit worker processes to be >>>>> > blocked indefinitely without any timeout applied. Designing a worker >>>>> > process system like that is simply wrong. It's one thing to allow >>>>> > changing the specific timeout applied, it's a very different thing to >>>>> > allow broken drivers to completely stall the worker process logic. >>>>> >>>>> OK what if we enable customizations then on the timeout by the built-in >>>>> cmd type and we use a high multiplier for now for kmod ? A multiplier >>>>> for kmod of 10 would set the kmod timeout to 5 minutes then, as we >>>>> sweep up and clean drivers we can reduce this over time. This would also >>>>> enable us to keep the default timeout for the other type of workers. >>>> >>>> Why this complexity? >>>> >>>> I mean, it sounds much simpler to simply increase the default timeout a >>>> bit, if it turns out to be too low for the current cases... >>> >>> True, there's two things here and one of which this v2 patch didn't address: >>> >>> 1) It'd be good for defaults on systemd to work on most systems based >>> on upstream kernels today, right now folks deploying systemd would >>> need to modify the default timeout. Are we up to bump the default up >>> considerably? If its high, would that be unfair for classes of workers >>> we know shouldn't take that long, or wouldn't that allow folks >>> developing new workers to take longer? >>> >>> 2) We want chatty logs to allow us to keep track of drivers that need >>> attention. Ideally right now we should strive for 30 seconds init and >>> work on asynching most work, we want to do this in a non fatal way. >>> Overriding the timeout won't let us to keep track of buggy drivers >>> that need love from systemd's perspective to stay within the 30 second >>> bound time. We can have chatty logs from the kernel but using a >>> timeout on the driver core seems a bit overkill specially if systemd >>> is already keeping track of driver's init time, so it'd be better if >>> we could collect offending drivers from systemd. I could have >>> implemented support for this in this v2 patch by simply adding another >>> check using the default timeout. >> >> Hi Luis, >> >> Just following up on this "old" thread: I have now bumped the default >> kill timeout (for all workers) to 180 seconds, > > Thanks Tom, this should help quite a lot while we try to get in the > kernel what systemd had assumed was there which is async probe for all > drivers. I'll poke back once we have that option available on the > kernel and that should help mitigate other issues I suppose.
Great! >> and added a warning >> which is triggered after a third of the timeout. > > This is great! What commit merged this?I just looked at the latest > commits and couldn't find one associated with this. I also checked the > mailing list for pending patches. Oops, my push failed. Just pushed now: <http://cgit.freedesktop.org/systemd/systemd/commit/?id=671174136525ddf208cdbe75d6d6bd159afa961f>. > Also what would we use to hunt for > these? Would they show up on: > > journalctl -b -0 -u systemd-udevd Precisely. Generated by making processing hang for all network devices: Sep 11 21:26:58 tomegun-x240 systemd-udevd[404]: worker [406] /devices/virtual/net/lo is taking a long time Sep 11 21:26:58 tomegun-x240 systemd-udevd[404]: worker [407] /devices/pci0000:00/0000:00:1c.1/0000:03:00.0/net/wlan0 is taking a long time Sep 11 21:26:58 tomegun-x240 systemd-udevd[404]: worker [411] /devices/pci0000:00/0000:00:19.0/net/enp0s25 is taking a long time Sep 11 21:26:58 tomegun-x240 systemd-udevd[404]: worker [412] /devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4:1.6/net/wwan0 is taking a long time Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [406] /devices/virtual/net/lo timeout; kill it Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: seq 2214 '/devices/virtual/net/lo' killed Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [407] /devices/pci0000:00/0000:00:1c.1/0000:03:00.0/net/wlan0 timeout; kill it Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: seq 2502 '/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/net/wlan0' killed Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [411] /devices/pci0000:00/0000:00:19.0/net/enp0s25 timeout; kill it Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: seq 2015 '/devices/pci0000:00/0000:00:19.0/net/enp0s25' killed Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [412] /devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4:1.6/net/wwan0 timeout; kill it Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: seq 2455 '/devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4:1.6/net/wwan0' killed Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [406] terminated by signal 9 (Killed) Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [407] terminated by signal 9 (Killed) Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [411] terminated by signal 9 (Killed) Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [412] terminated by signal 9 (Killed) > I ask as when I tried to add something similar I was unable to capture > anything with the above. We'd use this to then monitor for these prior > to assuming other issues then. > >> Let me know if this covers what you need, or if there is anything else >> we should do on the systemd side. > > Thanks! I realize that at this point perhaps removing the sigkill may > not be possible at all given that this is part of an original design > consideration but if we can at least address all other possible > concerns this would be great. I know for instance James has concerns > with any arbitrary timeout in general, but perhaps he can elaborate > more on that. Also Jiri made this one point about reboot and the > timeout: > >>>> This aproach is actually introducing a user-visible regressions. Look, for >>>> example, exec() never times out. Therefore if your system is on its knees, >>>> heavily overloaded (or completely broken), you are likely to be able to >>>> `reboot' it, because exec("/sbin/reboot") ultimately succeeds. >>>> >>>> But with all the timeouts, dbus, "Failed to issue method call: Did >>>> not receive a reply" messages, this is getting close to impossible. _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel