Hey, I've found a tricky race condition in the startup of kdbusaddons and I need some thoughts on how to deal with it....
Let's assume binary foo is a dbus unique service on org.kde.foo that uses kcrash with autorestart enabled: - foo1 gets started - registers org.kde.foo - runs normally - encounters segfault - kcrash sig handler starts up - restarts service via kdeinit or qprocess as appropriate - starts drkonqi ^ it never releases the org.kde.foo service name. Furthermore I am not sure it ever could, the only point at which it knows to unregister is in the sig handler and in there it's not save to unregister the name (surely at the very least it'd have potential to mangle the stack) - foo2 starts up - registers org.kde.foo - fails (org.kde.foo is still claimed by foo1) - exists assuming the service is already running dbus-daemon however seems to correctly detect that foo1 is defunct and after a while will forcefully take away the service names, thus making it available for registration again. As examples both plasmashell and powerdevil are unique auto-restartable services. But, plasmashell (at least on neon) indeed has just enough delay in the startup that in the autorestart scenario the well-known unique names are already made available again by dbus-daemon. Powerdevil on the other hand is getting to the registration request so quickly that the (now-crashed) previous instance still holds the name and service registration fails. We can see as much when dbus-monitoring the session [1]. This ultimately results in the powerdevil daemon failing to restart because it fails to acquire its unique service name and as a result exits with the assumption that the service is already running. A reliable way for me to reproduce it is to start powerdevil and then kill SEGV it. I do see a number of ways we can deal with this but they are all a bit meh. # Queue registration kdbusaddons attempts a queued registration with *some* timeout. If a unique service gets started and fails to register it does it's activation dance (thinking it is already running) if the activation dance also fails (because the running service is unresponsive) the registration is attempted with some timeout so as to possibly reclaim the name once dbus-daemon freed it. an example is in [2]. Notable disadvantages with this is that we need a blocking eventloop and it'll still be always racey unless we make the timeout very long. I am not sure a long timeout is necessarily a problem at this point, at the very least it may need controlling from the outside since failed registration doesn't necessarily equal exit(). Could possibly do away with the eventloop by doing a for loop with a sleep(). # Replacement registration on crash kcrash already sets an environment variable on auto-restarted instances to indicate that they were restarted. We could check the variable and if register with ReplaceExistingService. I am not sure that actually works how I think it does though, but I think this basically sets a hint on the daemon that the service name may be taken away by another process. Disadvantage is that restarts caused by something other than kcrash would not necessarily set the variable and still suffer from a race (e.g. custom sig handler instead of kcrash). # Reshuffle kcrash's restart order Currently kcrash does the autorestart as one of the first things, even before starting drkonqi. By delaying this a bit we may just be able to get enough time out everything to not race. Granted I do not know why it does the autostart so early, there may well be a reason for that. I didn't quite manage to figure out how dbus-daemon knows that the process is "lost" but I am somewhat confident that it has to do with the FD closing. So, closing earlier might help too. # Reshuffled long queue on crash Would combine elements of all previous points where Unique service registration will wait for a bit to successfully register, and wait for a much longer time if the kcrash env var for autorestart was set and by reshuffling kcrash's order the overall delay between startup and name becoming available may be super small. Only likely downside with this is the blocking qeventloop for queued registration. # Drkonqi sledgehammer When drkonqi is available, and run, the autorestart could be delayed until drkonqi quits. This can be a very long time and drkonqi may not be available. So, not exactly a reliable solution. [1] https://phabricator.kde.org/P324 [2] https://phabricator.kde.org/P325