Bug#793116: systemd-journald exiting with SIGUSR1

2017-02-09 Thread Joseph Landry
I came to the same conclusion as Daniel Povey in Message #56.

RestartSec in systemd-journal-flush does not help in this case. it is a 
one-shot service and it will not be restarted.

Note that in my case systemd-journald is killed due to a watchdog timeout when 
the server is under heavy load.

As a workaround, I set "Type=notify" (instead of the default "Type=simple") in 
systemd-journald's [Service] section . I assume it should delay 
systemd-journal-flush.service until systemd-journald is ready to handle the 
signal.
Worst case scenario : systemd-journald does notify that it has started before 
installing the signal-handler. Then the workaround would only minimize the 
effect and I guess it would be an upstream bug. 

But to stay on the safe side I also added these settings to give 
systemd-journald a chance to restart at a latter time (still in the [Service] 
section):
# wait 5s at restart.
# fails if restarted more than 100 times within 10mn
RestartSec=5
StartLimitBurst=100
StartLimitInterval=600

Finally I increased WatchdogSec in order to be more tolerant to high load but I 
think it is out of the scope of this bug.



Bug#793116: systemd-journald exiting with SIGUSR1

2016-10-28 Thread Daniel Povey
To follow up: this did not fix the problem.

On Fri, Oct 21, 2016 at 2:45 PM, Daniel Povey  wrote:

> I just want to follow up on this that I believe I have found the reason
> for this bug and I have a solution which I am testing out.
>
> In the output of `systemctl status systemd-journald` it says:
>Active: failed (Result: start-limit)
> and this is related to the systemd mechanism to stop cycles where services
> keep restarting.  It's related to the configs
> StartLimitIntervalSec=   [defaults to DefaultStartLimitIntervalSec=10 sec
> by default]
> StartLimitBurst=  [defaults to DefaultStartLimitBurst=5 by default]
>
> so that if a service is restarted 5 times within 10 seconds, it's not
> restarted any more to avoid wasting resources.
> What seems to happen is that if systemd-journal-flush.service sends the
> signal SIGUSR1 to systemd-journald and systemd-journald wasn't ready to
> handle it for some reason (e.g. had just been started, or there was some
> hangup), then systemd-journald is restarted, and
> systemd-journal-flush.service immediately restarts also.  Again
> systemd-journald won't be ready to handle the signal as it hasn't had time
> to run the signal handler, and the cycle repeats 5 times until the
> systemd-journald service hits the StartLimitBurst and dies.  I verified
> using `lastcomm` (we have process accounting enabled) that this did in fact
> happen.
>
> The way I am trying to fix it is to introduce a restart delay of 10
> seconds into the systemd-journal-flush service.  See the config file below;
> the only new line that differs from the defaults in
>  /lib/systemd/system/systemd-journal-flush.service is the line
> RestartSec=10.
>
> Dan
>
>
>
> ==
> cat /etc/systemd/system/systemd-journal-flush.service
> #  This file is part of systemd.
> #
> #  systemd is free software; you can redistribute it and/or modify it
> #  under the terms of the GNU Lesser General Public License as published by
> #  the Free Software Foundation; either version 2.1 of the License, or
> #  (at your option) any later version.
>
> [Unit]
> Description=Trigger Flushing of Journal to Persistent Storage
> Documentation=man:systemd-journald.service(8) man:journald.conf(5)
> DefaultDependencies=no
> Requires=systemd-journald.service
> After=systemd-journald.service local-fs.target remote-fs.target
> Before=systemd-user-sessions.service
> # Don't re-start this service too often, or we can get into a cycle where
> # if this service was run just after systemd-journald was started, it
> # fails to handle the signal, causing that service to be restarted
> # and this servie to be immediately restarted also, until we hit
> # the limit of StartLimitBurst=5 restarts and systemd-journald
> # dies.
> RestartSec=10
>
> [Service]
> ExecStart=/bin/systemctl kill --kill-who=main --signal=SIGUSR1
> systemd-journald.service
> Type=oneshot
>
>
> On Sat, Oct 15, 2016 at 9:48 PM, Daniel Povey  wrote:
>
>> Another observation about this bug, which might be helpful.
>>
>> If the signal is sent to systemd-journald via
>> /bin/systemctl kill --kill-who=main --signal=SIGUSR1
>> systemd-journald.service
>> then messages like the following show up in the kernel messages from
>> `dmesg -T`, like:
>>
>> [Sat Oct 15 21:02:35 2016] systemd-journald[26517]: Received request to
>> flush runtime journal from PID 1
>>
>> but they don't show up in the output of 'journalctl -r'.
>> In  /etc/systemd/journald.conf, it says:
>>
>> #MaxLevelStore=debug
>> #MaxLevelSyslog=debug
>>
>> so I would have thought the same messages would go to both places.
>> I don't know if I'm misunderstanding something here..
>>
>> Dan
>>
>>
>>
>>
>>
>> On Sat, Oct 15, 2016 at 9:07 PM, Daniel Povey  wrote:
>>
>>> BTW, I attach the output from `systemd-analyze dump`, as dump.txt.
>>> It would be great if the debian people could help us look into this.
>>> Lennart has a policy that his team will only look into bug reports in
>>> the latest two versions of systemd, and obviously we are well behind that.
>>>
>>> Dan
>>>
>>>
>>> On Sat, Oct 15, 2016 at 8:53 PM, Daniel Povey  wrote:
>>>

 I just want to report that we are suffering from this bug, and it is
 quite frequent.
 This is with version  215-17+deb8u5 .

 root@a12:~# systemctl status systemd-journald

 *** systemd-journald.service - Journal Service

Loaded: loaded (/lib/systemd/system/systemd-journald.service;
 static)

Active: *failed* (Result: start-limit) since Sat 2016-10-15
 12:01:57 EDT; 8h ago

  Docs: man:systemd-journald.service(8)

man:journald.conf(5)

   Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed,
 signal=USR1)*

  Main PID: 51561 (code=killed, signal=USR1)



>>>
>>
>


Bug#793116: systemd-journald exiting with SIGUSR1

2016-10-21 Thread Daniel Povey
I just want to follow up on this that I believe I have found the reason for
this bug and I have a solution which I am testing out.

In the output of `systemctl status systemd-journald` it says:
   Active: failed (Result: start-limit)
and this is related to the systemd mechanism to stop cycles where services
keep restarting.  It's related to the configs
StartLimitIntervalSec=   [defaults to DefaultStartLimitIntervalSec=10 sec
by default]
StartLimitBurst=  [defaults to DefaultStartLimitBurst=5 by default]

so that if a service is restarted 5 times within 10 seconds, it's not
restarted any more to avoid wasting resources.
What seems to happen is that if systemd-journal-flush.service sends the
signal SIGUSR1 to systemd-journald and systemd-journald wasn't ready to
handle it for some reason (e.g. had just been started, or there was some
hangup), then systemd-journald is restarted, and
systemd-journal-flush.service immediately restarts also.  Again
systemd-journald won't be ready to handle the signal as it hasn't had time
to run the signal handler, and the cycle repeats 5 times until the
systemd-journald service hits the StartLimitBurst and dies.  I verified
using `lastcomm` (we have process accounting enabled) that this did in fact
happen.

The way I am trying to fix it is to introduce a restart delay of 10 seconds
into the systemd-journal-flush service.  See the config file below; the
only new line that differs from the defaults in
 /lib/systemd/system/systemd-journal-flush.service is the line
RestartSec=10.

Dan



==
cat /etc/systemd/system/systemd-journal-flush.service
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Trigger Flushing of Journal to Persistent Storage
Documentation=man:systemd-journald.service(8) man:journald.conf(5)
DefaultDependencies=no
Requires=systemd-journald.service
After=systemd-journald.service local-fs.target remote-fs.target
Before=systemd-user-sessions.service
# Don't re-start this service too often, or we can get into a cycle where
# if this service was run just after systemd-journald was started, it
# fails to handle the signal, causing that service to be restarted
# and this servie to be immediately restarted also, until we hit
# the limit of StartLimitBurst=5 restarts and systemd-journald
# dies.
RestartSec=10

[Service]
ExecStart=/bin/systemctl kill --kill-who=main --signal=SIGUSR1
systemd-journald.service
Type=oneshot


On Sat, Oct 15, 2016 at 9:48 PM, Daniel Povey  wrote:

> Another observation about this bug, which might be helpful.
>
> If the signal is sent to systemd-journald via
> /bin/systemctl kill --kill-who=main --signal=SIGUSR1
> systemd-journald.service
> then messages like the following show up in the kernel messages from
> `dmesg -T`, like:
>
> [Sat Oct 15 21:02:35 2016] systemd-journald[26517]: Received request to
> flush runtime journal from PID 1
>
> but they don't show up in the output of 'journalctl -r'.
> In  /etc/systemd/journald.conf, it says:
>
> #MaxLevelStore=debug
> #MaxLevelSyslog=debug
>
> so I would have thought the same messages would go to both places.
> I don't know if I'm misunderstanding something here..
>
> Dan
>
>
>
>
>
> On Sat, Oct 15, 2016 at 9:07 PM, Daniel Povey  wrote:
>
>> BTW, I attach the output from `systemd-analyze dump`, as dump.txt.
>> It would be great if the debian people could help us look into this.
>> Lennart has a policy that his team will only look into bug reports in the
>> latest two versions of systemd, and obviously we are well behind that.
>>
>> Dan
>>
>>
>> On Sat, Oct 15, 2016 at 8:53 PM, Daniel Povey  wrote:
>>
>>>
>>> I just want to report that we are suffering from this bug, and it is
>>> quite frequent.
>>> This is with version  215-17+deb8u5 .
>>>
>>> root@a12:~# systemctl status systemd-journald
>>>
>>> *** systemd-journald.service - Journal Service
>>>
>>>Loaded: loaded (/lib/systemd/system/systemd-journald.service; static)
>>>
>>>Active: *failed* (Result: start-limit) since Sat 2016-10-15 12:01:57
>>> EDT; 8h ago
>>>
>>>  Docs: man:systemd-journald.service(8)
>>>
>>>man:journald.conf(5)
>>>
>>>   Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed,
>>> signal=USR1)*
>>>
>>>  Main PID: 51561 (code=killed, signal=USR1)
>>>
>>>
>>>
>>
>


Bug#793116: systemd-journald exiting with SIGUSR1

2016-10-15 Thread Daniel Povey
Another observation about this bug, which might be helpful.

If the signal is sent to systemd-journald via
/bin/systemctl kill --kill-who=main --signal=SIGUSR1
systemd-journald.service
then messages like the following show up in the kernel messages from `dmesg
-T`, like:

[Sat Oct 15 21:02:35 2016] systemd-journald[26517]: Received request to
flush runtime journal from PID 1

but they don't show up in the output of 'journalctl -r'.
In  /etc/systemd/journald.conf, it says:

#MaxLevelStore=debug
#MaxLevelSyslog=debug

so I would have thought the same messages would go to both places.
I don't know if I'm misunderstanding something here..

Dan





On Sat, Oct 15, 2016 at 9:07 PM, Daniel Povey  wrote:

> BTW, I attach the output from `systemd-analyze dump`, as dump.txt.
> It would be great if the debian people could help us look into this.
> Lennart has a policy that his team will only look into bug reports in the
> latest two versions of systemd, and obviously we are well behind that.
>
> Dan
>
>
> On Sat, Oct 15, 2016 at 8:53 PM, Daniel Povey  wrote:
>
>>
>> I just want to report that we are suffering from this bug, and it is
>> quite frequent.
>> This is with version  215-17+deb8u5 .
>>
>> root@a12:~# systemctl status systemd-journald
>>
>> *** systemd-journald.service - Journal Service
>>
>>Loaded: loaded (/lib/systemd/system/systemd-journald.service; static)
>>
>>Active: *failed* (Result: start-limit) since Sat 2016-10-15 12:01:57
>> EDT; 8h ago
>>
>>  Docs: man:systemd-journald.service(8)
>>
>>man:journald.conf(5)
>>
>>   Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed,
>> signal=USR1)*
>>
>>  Main PID: 51561 (code=killed, signal=USR1)
>>
>>
>>
>


Bug#793116: systemd-journald exiting with SIGUSR1

2016-10-15 Thread Daniel Povey
I just want to report that we are suffering from this bug, and it is quite
frequent.
This is with version  215-17+deb8u5 .

root@a12:~# systemctl status systemd-journald

*** systemd-journald.service - Journal Service

   Loaded: loaded (/lib/systemd/system/systemd-journald.service; static)

   Active: *failed* (Result: start-limit) since Sat 2016-10-15 12:01:57
EDT; 8h ago

 Docs: man:systemd-journald.service(8)

   man:journald.conf(5)

  Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed,
signal=USR1)*

 Main PID: 51561 (code=killed, signal=USR1)