Bug#793116: systemd-journald exiting with SIGUSR1
I came to the same conclusion as Daniel Povey in Message #56. RestartSec in systemd-journal-flush does not help in this case. it is a one-shot service and it will not be restarted. Note that in my case systemd-journald is killed due to a watchdog timeout when the server is under heavy load. As a workaround, I set "Type=notify" (instead of the default "Type=simple") in systemd-journald's [Service] section . I assume it should delay systemd-journal-flush.service until systemd-journald is ready to handle the signal. Worst case scenario : systemd-journald does notify that it has started before installing the signal-handler. Then the workaround would only minimize the effect and I guess it would be an upstream bug. But to stay on the safe side I also added these settings to give systemd-journald a chance to restart at a latter time (still in the [Service] section): # wait 5s at restart. # fails if restarted more than 100 times within 10mn RestartSec=5 StartLimitBurst=100 StartLimitInterval=600 Finally I increased WatchdogSec in order to be more tolerant to high load but I think it is out of the scope of this bug. ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#793116: systemd-journald exiting with SIGUSR1
To follow up: this did not fix the problem. On Fri, Oct 21, 2016 at 2:45 PM, Daniel Povey wrote: > I just want to follow up on this that I believe I have found the reason > for this bug and I have a solution which I am testing out. > > In the output of `systemctl status systemd-journald` it says: >Active: failed (Result: start-limit) > and this is related to the systemd mechanism to stop cycles where services > keep restarting. It's related to the configs > StartLimitIntervalSec= [defaults to DefaultStartLimitIntervalSec=10 sec > by default] > StartLimitBurst= [defaults to DefaultStartLimitBurst=5 by default] > > so that if a service is restarted 5 times within 10 seconds, it's not > restarted any more to avoid wasting resources. > What seems to happen is that if systemd-journal-flush.service sends the > signal SIGUSR1 to systemd-journald and systemd-journald wasn't ready to > handle it for some reason (e.g. had just been started, or there was some > hangup), then systemd-journald is restarted, and > systemd-journal-flush.service immediately restarts also. Again > systemd-journald won't be ready to handle the signal as it hasn't had time > to run the signal handler, and the cycle repeats 5 times until the > systemd-journald service hits the StartLimitBurst and dies. I verified > using `lastcomm` (we have process accounting enabled) that this did in fact > happen. > > The way I am trying to fix it is to introduce a restart delay of 10 > seconds into the systemd-journal-flush service. See the config file below; > the only new line that differs from the defaults in > /lib/systemd/system/systemd-journal-flush.service is the line > RestartSec=10. > > Dan > > > > == > cat /etc/systemd/system/systemd-journal-flush.service > # This file is part of systemd. > # > # systemd is free software; you can redistribute it and/or modify it > # under the terms of the GNU Lesser General Public License as published by > # the Free Software Foundation; either version 2.1 of the License, or > # (at your option) any later version. > > [Unit] > Description=Trigger Flushing of Journal to Persistent Storage > Documentation=man:systemd-journald.service(8) man:journald.conf(5) > DefaultDependencies=no > Requires=systemd-journald.service > After=systemd-journald.service local-fs.target remote-fs.target > Before=systemd-user-sessions.service > # Don't re-start this service too often, or we can get into a cycle where > # if this service was run just after systemd-journald was started, it > # fails to handle the signal, causing that service to be restarted > # and this servie to be immediately restarted also, until we hit > # the limit of StartLimitBurst=5 restarts and systemd-journald > # dies. > RestartSec=10 > > [Service] > ExecStart=/bin/systemctl kill --kill-who=main --signal=SIGUSR1 > systemd-journald.service > Type=oneshot > > > On Sat, Oct 15, 2016 at 9:48 PM, Daniel Povey wrote: > >> Another observation about this bug, which might be helpful. >> >> If the signal is sent to systemd-journald via >> /bin/systemctl kill --kill-who=main --signal=SIGUSR1 >> systemd-journald.service >> then messages like the following show up in the kernel messages from >> `dmesg -T`, like: >> >> [Sat Oct 15 21:02:35 2016] systemd-journald[26517]: Received request to >> flush runtime journal from PID 1 >> >> but they don't show up in the output of 'journalctl -r'. >> In /etc/systemd/journald.conf, it says: >> >> #MaxLevelStore=debug >> #MaxLevelSyslog=debug >> >> so I would have thought the same messages would go to both places. >> I don't know if I'm misunderstanding something here.. >> >> Dan >> >> >> >> >> >> On Sat, Oct 15, 2016 at 9:07 PM, Daniel Povey wrote: >> >>> BTW, I attach the output from `systemd-analyze dump`, as dump.txt. >>> It would be great if the debian people could help us look into this. >>> Lennart has a policy that his team will only look into bug reports in >>> the latest two versions of systemd, and obviously we are well behind that. >>> >>> Dan >>> >>> >>> On Sat, Oct 15, 2016 at 8:53 PM, Daniel Povey wrote: >>> I just want to report that we are suffering from this bug, and it is quite frequent. This is with version 215-17+deb8u5 . root@a12:~# systemctl status systemd-journald *** systemd-journald.service - Journal Service Loaded: loaded (/lib/systemd/system/systemd-journald.service; static) Active: *failed* (Result: start-limit) since Sat 2016-10-15 12:01:57 EDT; 8h ago Docs: man:systemd-journald.service(8) man:journald.conf(5) Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed, signal=USR1)* Main PID: 51561 (code=killed, signal=USR1) >>> >> > ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo
Bug#793116: systemd-journald exiting with SIGUSR1
I just want to follow up on this that I believe I have found the reason for this bug and I have a solution which I am testing out. In the output of `systemctl status systemd-journald` it says: Active: failed (Result: start-limit) and this is related to the systemd mechanism to stop cycles where services keep restarting. It's related to the configs StartLimitIntervalSec= [defaults to DefaultStartLimitIntervalSec=10 sec by default] StartLimitBurst= [defaults to DefaultStartLimitBurst=5 by default] so that if a service is restarted 5 times within 10 seconds, it's not restarted any more to avoid wasting resources. What seems to happen is that if systemd-journal-flush.service sends the signal SIGUSR1 to systemd-journald and systemd-journald wasn't ready to handle it for some reason (e.g. had just been started, or there was some hangup), then systemd-journald is restarted, and systemd-journal-flush.service immediately restarts also. Again systemd-journald won't be ready to handle the signal as it hasn't had time to run the signal handler, and the cycle repeats 5 times until the systemd-journald service hits the StartLimitBurst and dies. I verified using `lastcomm` (we have process accounting enabled) that this did in fact happen. The way I am trying to fix it is to introduce a restart delay of 10 seconds into the systemd-journal-flush service. See the config file below; the only new line that differs from the defaults in /lib/systemd/system/systemd-journal-flush.service is the line RestartSec=10. Dan == cat /etc/systemd/system/systemd-journal-flush.service # This file is part of systemd. # # systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version. [Unit] Description=Trigger Flushing of Journal to Persistent Storage Documentation=man:systemd-journald.service(8) man:journald.conf(5) DefaultDependencies=no Requires=systemd-journald.service After=systemd-journald.service local-fs.target remote-fs.target Before=systemd-user-sessions.service # Don't re-start this service too often, or we can get into a cycle where # if this service was run just after systemd-journald was started, it # fails to handle the signal, causing that service to be restarted # and this servie to be immediately restarted also, until we hit # the limit of StartLimitBurst=5 restarts and systemd-journald # dies. RestartSec=10 [Service] ExecStart=/bin/systemctl kill --kill-who=main --signal=SIGUSR1 systemd-journald.service Type=oneshot On Sat, Oct 15, 2016 at 9:48 PM, Daniel Povey wrote: > Another observation about this bug, which might be helpful. > > If the signal is sent to systemd-journald via > /bin/systemctl kill --kill-who=main --signal=SIGUSR1 > systemd-journald.service > then messages like the following show up in the kernel messages from > `dmesg -T`, like: > > [Sat Oct 15 21:02:35 2016] systemd-journald[26517]: Received request to > flush runtime journal from PID 1 > > but they don't show up in the output of 'journalctl -r'. > In /etc/systemd/journald.conf, it says: > > #MaxLevelStore=debug > #MaxLevelSyslog=debug > > so I would have thought the same messages would go to both places. > I don't know if I'm misunderstanding something here.. > > Dan > > > > > > On Sat, Oct 15, 2016 at 9:07 PM, Daniel Povey wrote: > >> BTW, I attach the output from `systemd-analyze dump`, as dump.txt. >> It would be great if the debian people could help us look into this. >> Lennart has a policy that his team will only look into bug reports in the >> latest two versions of systemd, and obviously we are well behind that. >> >> Dan >> >> >> On Sat, Oct 15, 2016 at 8:53 PM, Daniel Povey wrote: >> >>> >>> I just want to report that we are suffering from this bug, and it is >>> quite frequent. >>> This is with version 215-17+deb8u5 . >>> >>> root@a12:~# systemctl status systemd-journald >>> >>> *** systemd-journald.service - Journal Service >>> >>>Loaded: loaded (/lib/systemd/system/systemd-journald.service; static) >>> >>>Active: *failed* (Result: start-limit) since Sat 2016-10-15 12:01:57 >>> EDT; 8h ago >>> >>> Docs: man:systemd-journald.service(8) >>> >>>man:journald.conf(5) >>> >>> Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed, >>> signal=USR1)* >>> >>> Main PID: 51561 (code=killed, signal=USR1) >>> >>> >>> >> > ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#793116: systemd-journald exiting with SIGUSR1
Another observation about this bug, which might be helpful. If the signal is sent to systemd-journald via /bin/systemctl kill --kill-who=main --signal=SIGUSR1 systemd-journald.service then messages like the following show up in the kernel messages from `dmesg -T`, like: [Sat Oct 15 21:02:35 2016] systemd-journald[26517]: Received request to flush runtime journal from PID 1 but they don't show up in the output of 'journalctl -r'. In /etc/systemd/journald.conf, it says: #MaxLevelStore=debug #MaxLevelSyslog=debug so I would have thought the same messages would go to both places. I don't know if I'm misunderstanding something here.. Dan On Sat, Oct 15, 2016 at 9:07 PM, Daniel Povey wrote: > BTW, I attach the output from `systemd-analyze dump`, as dump.txt. > It would be great if the debian people could help us look into this. > Lennart has a policy that his team will only look into bug reports in the > latest two versions of systemd, and obviously we are well behind that. > > Dan > > > On Sat, Oct 15, 2016 at 8:53 PM, Daniel Povey wrote: > >> >> I just want to report that we are suffering from this bug, and it is >> quite frequent. >> This is with version 215-17+deb8u5 . >> >> root@a12:~# systemctl status systemd-journald >> >> *** systemd-journald.service - Journal Service >> >>Loaded: loaded (/lib/systemd/system/systemd-journald.service; static) >> >>Active: *failed* (Result: start-limit) since Sat 2016-10-15 12:01:57 >> EDT; 8h ago >> >> Docs: man:systemd-journald.service(8) >> >>man:journald.conf(5) >> >> Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed, >> signal=USR1)* >> >> Main PID: 51561 (code=killed, signal=USR1) >> >> >> > ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers
Bug#793116: systemd-journald exiting with SIGUSR1
I just want to report that we are suffering from this bug, and it is quite frequent. This is with version 215-17+deb8u5 . root@a12:~# systemctl status systemd-journald *** systemd-journald.service - Journal Service Loaded: loaded (/lib/systemd/system/systemd-journald.service; static) Active: *failed* (Result: start-limit) since Sat 2016-10-15 12:01:57 EDT; 8h ago Docs: man:systemd-journald.service(8) man:journald.conf(5) Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed, signal=USR1)* Main PID: 51561 (code=killed, signal=USR1) ___ Pkg-systemd-maintainers mailing list Pkg-systemd-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-systemd-maintainers