I'm seeing a problem with a service sometimes failing to start due to a missing 
cgroup.
After some debugging I've made the following observations:

After exec_spawn() forks, the child will set the sticky bit for the cgroup (in 
cg_set_task_access) but sometimes, the cgroup is missing (lstat returns "No 
such file or directory").

The cgroup is always created, but the main process will call cg_trim (from 
cgroup_bonding_trim <- cgroup_bonding_trim_list <- cgroup_notify_empty <- 
private_bus_message_filter ...) which will remove the cgroup if the sticky bit 
isn't set.

This seems to be a race condition.
If the child sets the sticky bit first, the parent will leave the cgroup alone. 
But if the main process gets to cg_trim first, the cgroup is removed and the 
child fails.

We're using systemd 197. I've tried using 198, but there the child dies with 
SIGSEGV so it's harder to debug what's happening.
The problem appeared when we switched from Linux 3.4 to 3.7, but as this looks 
like a race in systemd so I'm not sure if our local kernel tree is to blame or 
if the version bump just changed the timing to trigger the race in systemd.

Since I'm not familiar with the systemd internals and cgroups I would 
appreciate some help to resolve this.

I can reproduce this pretty easy, usually within 5-10 boots. It's always the 
same service that fails and the services before it never fails.

/Anders
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

Reply via email to