Hello systemd experts and developers,
I recently stumbled over the bug with the watchdog mechanism that has already
been reported to free desktop bugzilla (56109).
I analyzed the bug and came to a simple solution for solving it.
First, what I think is going on:
- watchdog timeout is detected in service_handle_watchdog(),
service_enter_dead(…) is called
- service_enter_dead() sets the service state to auto_restart
- triggered by a timer, service_enter_restart is called
- service_enter_restart schedules a restart job
- systemd splits up the jobs into a stop and a start job and schedules
both
- the stop job lasts to a call of service_stop()
- here it begins to get interesting:
- based on the AUTO_RESTART state, this function decides to go directly
into dead state, nothing of the normal stopping procedure is done. This is
probably because in most cases that cause a restart to be scheduled the stop
proceeding is done automatically (for instance in case of a killed or normally
exiting service.). But this is not true for a watchdog timeout. Nothing of the
stop proceeding is executed in case of such a timeout. So the process that
missed to send the watchdog event is going on to life (in which state ever). No
one is cleaning up. A second instance of the service is started.
My suggestion to solve this:
Changes are needed in service.c in service_stop(…).
change:
/* A restart will be scheduled or is in progress. */
if (s->state == SERVICE_AUTO_RESTART) {
service_set_state(s, SERVICE_DEAD);
return 0;
}
to:
/* A restart will be scheduled or is in progress.
In all cases but the watchdog timeout, stop is already progressed by
systemd automatically*/
if (s->state == SERVICE_AUTO_RESTART && s->result !=
SERVICE_FAILURE_WATCHDOG) {
service_set_state(s, SERVICE_DEAD);
return 0;
}
and change:
assert(s->state == SERVICE_RUNNING ||
s->state == SERVICE_EXITED);
to:
assert(s->state == SERVICE_RUNNING ||
s->state == SERVICE_AUTO_RESTART ||
s->state == SERVICE_EXITED);
I tested the following:
- the watchdog mechanism is now actually stopping / killing the service
in case it is not sending the watchdog event right in time
- a restart triggered by a killed service works like before
Hopefully, I didn’t miss some side effects caused by my changes.
Any opinions on my proposed changes?
Kind regards,
Marko Hoyer
---
Alle Postfächer an einem Ort. Jetzt wechseln und E-Mail-Adresse mitnehmen!
Rundum glücklich mit freenetMail
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel