Re: [systemd-devel] Triggering the HW Watchdog
Partially, It shows that systemd is handling the watchdog as I expect it to here, but it also means that the "dysfunctional" times where the system isn't resetting properly is _not_ due to watchdog triggering, but is a "normal system" according to systemd. Which is a worse case for me, since it's harder to debug. So, conclusion: systemd seems to handle watchdog properly systemd seems to not die properly when we expect it to, leaving us to find more debugging. I hope that makes more sense than less. On Tue, Feb 27, 2018 at 5:34 PM, Mantas Mikulėnas wrote: > On Tue, Feb 27, 2018 at 6:25 PM, D.S. Ljungmark wrote: >> >> >> >> On 27/02/18 15:21, Lennart Poettering wrote: >> > On Di, 27.02.18 15:12, D.S. Ljungmark (ljungm...@modio.se) wrote: >> > >> >>> I figure you can send SIGSTOP to PID 1, no? (there are some signals >> >>> the kernel blocks for PID 1, but I think SIGSTOP is not among them, >> >>> please try) >> >> >> >> It seems that SIGSTOP is being filtered, because nothing appears to >> >> happen, and the system certainly isn't rebooting. >> > >> > You should be able to trigger an abort in PID 1 by sending it SIGABRT >> > or SIGQUIT or so. If PID 1 aborts it will actually enter a freeze loop >> > in which it stops pinging the hw watchdog. >> > >> > Lennart >> >> >> ABRT works, or well.. >> >> systemd[1]: Caught , core dump failed (child 3844, code=killed, >> status=6/ABRT). >> >> And then a broadcast, freezing execution >> >> >> And after that, what I was afraid of: >> >> [25417.186351] watchdog: watchdog0: watchdog did not stop! >> > > Isn't that exactly the result you asked for? > > -- > Mantas Mikulėnas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Triggering the HW Watchdog
On Tue, Feb 27, 2018 at 6:25 PM, D.S. Ljungmark wrote: > > > On 27/02/18 15:21, Lennart Poettering wrote: > > On Di, 27.02.18 15:12, D.S. Ljungmark (ljungm...@modio.se) wrote: > > > >>> I figure you can send SIGSTOP to PID 1, no? (there are some signals > >>> the kernel blocks for PID 1, but I think SIGSTOP is not among them, > >>> please try) > >> > >> It seems that SIGSTOP is being filtered, because nothing appears to > >> happen, and the system certainly isn't rebooting. > > > > You should be able to trigger an abort in PID 1 by sending it SIGABRT > > or SIGQUIT or so. If PID 1 aborts it will actually enter a freeze loop > > in which it stops pinging the hw watchdog. > > > > Lennart > > > ABRT works, or well.. > > systemd[1]: Caught , core dump failed (child 3844, code=killed, > status=6/ABRT). > > And then a broadcast, freezing execution > > > And after that, what I was afraid of: > > [25417.186351] watchdog: watchdog0: watchdog did not stop! > > Isn't that exactly the result you asked for? -- Mantas Mikulėnas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Triggering the HW Watchdog
On 27/02/18 15:21, Lennart Poettering wrote: > On Di, 27.02.18 15:12, D.S. Ljungmark (ljungm...@modio.se) wrote: > >>> I figure you can send SIGSTOP to PID 1, no? (there are some signals >>> the kernel blocks for PID 1, but I think SIGSTOP is not among them, >>> please try) >> >> It seems that SIGSTOP is being filtered, because nothing appears to >> happen, and the system certainly isn't rebooting. > > You should be able to trigger an abort in PID 1 by sending it SIGABRT > or SIGQUIT or so. If PID 1 aborts it will actually enter a freeze loop > in which it stops pinging the hw watchdog. > > Lennart ABRT works, or well.. systemd[1]: Caught , core dump failed (child 3844, code=killed, status=6/ABRT). And then a broadcast, freezing execution And after that, what I was afraid of: [25417.186351] watchdog: watchdog0: watchdog did not stop! Well, that gives me a tool to debug this with, Thank you! //D.S -- 8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Triggering the HW Watchdog
( re-send as I forgot the list ) On 27/02/18 13:20, Lennart Poettering wrote:> On Di, 27.02.18 12:44, D.S. Ljungmark (ljungm...@modio.se) wrote: > >> Hi list! >> >> We're using systemd to control the hardware watchdog, and would want to >> induce fail state to _verify_ that the shutdown/reboot process works as >> expected. >> >> How do we make systemd "fail" to ping the watchdog? > > I figure you can send SIGSTOP to PID 1, no? (there are some signals > the kernel blocks for PID 1, but I think SIGSTOP is not among them, > please try) It seems that SIGSTOP is being filtered, because nothing appears to happen, and the system certainly isn't rebooting. >> How do we control which states ( root fs not available, etc) cause >> systemd to not ping the hardware watchdog? > > The watchdog is for detecting software hanging. Root fs not being > available does not really qualify as "software hanging". If you want > to reboot the machine if it fails to bring everything up, then use > JobTimeoutAction= on some suitable action, for example local-fs.target > or multi-user.target. > > Lennart Thanks, I'm trying to get to a state where the machine fails over and triggers watchdog on known things, rather than triggering the rescue shell or similar. I'll try with a jobtimeout on multi-user. //D.S. -- 8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Triggering the HW Watchdog
On Di, 27.02.18 15:12, D.S. Ljungmark (ljungm...@modio.se) wrote: > > I figure you can send SIGSTOP to PID 1, no? (there are some signals > > the kernel blocks for PID 1, but I think SIGSTOP is not among them, > > please try) > > It seems that SIGSTOP is being filtered, because nothing appears to > happen, and the system certainly isn't rebooting. You should be able to trigger an abort in PID 1 by sending it SIGABRT or SIGQUIT or so. If PID 1 aborts it will actually enter a freeze loop in which it stops pinging the hw watchdog. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Triggering the HW Watchdog
On Di, 27.02.18 12:44, D.S. Ljungmark (ljungm...@modio.se) wrote: > Hi list! > > We're using systemd to control the hardware watchdog, and would want to > induce fail state to _verify_ that the shutdown/reboot process works as > expected. > > How do we make systemd "fail" to ping the watchdog? I figure you can send SIGSTOP to PID 1, no? (there are some signals the kernel blocks for PID 1, but I think SIGSTOP is not among them, please try) > How do we control which states ( root fs not available, etc) cause > systemd to not ping the hardware watchdog? The watchdog is for detecting software hanging. Root fs not being available does not really qualify as "software hanging". If you want to reboot the machine if it fails to bring everything up, then use JobTimeoutAction= on some suitable action, for example local-fs.target or multi-user.target. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Triggering the HW Watchdog
Hi list! We're using systemd to control the hardware watchdog, and would want to induce fail state to _verify_ that the shutdown/reboot process works as expected. How do we make systemd "fail" to ping the watchdog? How do we control which states ( root fs not available, etc) cause systemd to not ping the hardware watchdog? //D.S. -- 8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel