Hello systemd experts and developers,
 
I recently stumbled over the bug with the watchdog mechanism that has already 
been reported to free desktop bugzilla (56109).
 
I analyzed the bug and came to a simple solution for solving it.
 
First, what I think is going on:
-        watchdog timeout is detected in service_handle_watchdog(), 
service_enter_dead(…) is called
-        service_enter_dead() sets the service state to auto_restart
-        triggered by a timer, service_enter_restart is called
-        service_enter_restart  schedules a restart job
-        systemd splits up the jobs into a stop and a start job and schedules 
both
-        the stop job lasts to a call of service_stop()
-        here it begins to get interesting:
-        based on the AUTO_RESTART state, this function decides to go directly 
into dead state, nothing of the normal stopping procedure is done. This is 
probably because in most cases that cause a restart to be scheduled the stop 
proceeding is done automatically (for instance in case of a killed or normally 
exiting service.). But this is not true for a watchdog timeout. Nothing of the 
stop proceeding is executed in case of such a timeout. So the process that 
missed to send the watchdog event is going on to life (in which state ever). No 
one is cleaning up. A second instance of the service is started.
 
My suggestion to solve this:
 
Changes are needed in service.c in service_stop(…).
 
change:
/* A restart will be scheduled or is in progress. */
        if (s->state == SERVICE_AUTO_RESTART) {
                service_set_state(s, SERVICE_DEAD);
                return 0;
        }
 
to:
/* A restart will be scheduled or is in progress. 
           In all cases but the watchdog timeout, stop is already progressed by 
systemd automatically*/
        if (s->state == SERVICE_AUTO_RESTART && s->result != 
SERVICE_FAILURE_WATCHDOG) {
                service_set_state(s, SERVICE_DEAD);
                return 0;
        }
 
and change:
 
assert(s->state == SERVICE_RUNNING ||
             s->state == SERVICE_EXITED);
 
 
to:
assert(s->state == SERVICE_RUNNING ||
               s->state == SERVICE_AUTO_RESTART ||
               s->state == SERVICE_EXITED);
 
I tested the following:
-        the watchdog mechanism is now actually stopping / killing the service 
in case it is not sending the watchdog event right in time
-        a restart triggered by a killed service works like before
 
Hopefully, I didn’t miss some side effects caused by my changes.
 
 
Any opinions on my proposed changes?
 
Kind regards,
 
Marko Hoyer
 
 
 
 
 


---
Alle Postfächer an einem Ort. Jetzt wechseln und E-Mail-Adresse mitnehmen! 
Rundum glücklich mit freenetMail
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

Reply via email to