Bug#887563: corosync prerm will stop pacemaker and not start it again
Nish Aravamudanwrites: >> Nishanth Aravamudan writes: >> >>> Now, neither of these actually fix the existing packages >>> unfortunately, which will stop pacemaker on the upgrade to a fixed >>> package and thus stop pacemaker. I have no idea if there actually is >>> any way to fix this for existing packages, since the 'old' prerm >>> will be invoked by dpkg on the upgrade path. > > in theory if a fix lands, its the last time this happens I have to make a correction here. The problem is not that Pacemaker is stopped when Corosync is upgraded, but that Pacemaker is not started after the Corosync upgrade is complete. So the old prerm stopping Corosync is not a problem: the new postinst will *restart* Corosync (even though it's stopped already), and the restart operation does start Pacemaker again. A simple start operation does not, but a restart does. I haven't decided yet whether this is a systemd bug, a quirk or a feature. -- Regards, Feri
Bug#887563: corosync prerm will stop pacemaker and not start it again
Nish Aravamudanwrites: > I spent some time reading the manpage myself and this is how I > interpret the relevant section(s): > > Requires= >Configures requirement dependencies on other units. If this unit >gets activated, the units listed here will be activated as well. > ... > > This means, since pacemaker.service Requires=corosync.service, that > when pacemaker is started, corosync is started (and, iirc, since > pacemaker.service also has an After=corosync.service, systemd will > start corosync.service first). Agreed. > This does not imply anything further Not agreed (if you mean "Requires" under "this"). Version 232-25+deb9u2 of the systemd.unit man page continues with: If one of the other units gets deactivated or its activation fails, this unit will be deactivated. > and in the default package configuration, pacemaker has a *hard* > dependency on corosync (afaict). Even: Pacemaker always has a hard dependency on Corosync in Debian. We don't compile in support for any other messaging layers. This means Pacemaker can't start without Corosync and exits immediately if Corosync is stopped under it. > Thus, the first line of the next paragraph in the manpage is relevant: > > Note that this dependency type does not imply that the other unit > always has to be in active state when this unit is running I couldn't find this text, but my understanding is that systemd won't do anything if a Required unit suddenly stops on its own (no systemd job here). This is what BindsTo adds to Requires, but in our case it does not matter anyway, as Pacemaker immediately exits if Corosync stops. This exit is not particulary pretty, though. The pacemakerd master daemon notices the loss of Corosync connection and commences a regular shutdown or the constituent daemons. However, most daemons themselves also notice the loss of Corosync connection (or the exit of their peer daemons) and exit with various failure codes. In the end, pacemakerd seems to ignore these errors and exits successfully. This successful exit is strange in my opinion, but at least it does not let the Restart=on-failure setting of Pacemaker muddy the waters even more. > PartOf= > Configures dependencies similar to Requires=, but limited to > stopping and restarting of units. When systemd stops or restarts > the units listed here, the action is propagated to this unit. > > So, actually, PartOf is *different* than Requires, and in the case of > pacemaker and corosync's dependency relationship, helps reflect the > other half of the requirements, but not the existing ones :) I can't follow you here. My reading is that PartOf is less than Requires, because starting a PartOf something does not start that something itself. However, stopping that something stops all PartsOf it, just like as if it was Required. > What we want to express (I believe) is: > > 1) Corosync can be started/stopped on its own > 2) If pacemaker is started, corosync must be started Yes. And if Corosync is stopped, Pacemaker must be stopped beforehand. > 3) If corosync is restarted, pacemaker should be restarted Rather: if Corosync is restarted, Pacemaker must be stopped beforehand and started afterwards. Otherwise you'll get fenced. > pacemaker.service Requires=corosync.service says "When pacemaker is > started, corosync should be started. When corosync is stopped, > pacemaker is stopped." Yes. Though proper ordering is necessary as well, so Pacemaker needs an After=corosync.service directive. With that in place, Pacemaker is stopped before Corosync is restarted, and started afterwards. So this achieves 1), 2) and my version of 3), which is exactly what I want. > pacemaker.service PartOf=corosync.service says "When corosync is > restarted, pacemaker is restarted. When corosync is stopped, pacemaker > is stopped." These are also true when Pacemaker Requires=corosync.service. But the start constraint is not present with PartOf. > pacemaker.service BindsTo=corosync.service says "Every state > transition corosync goes through, pacemaker will also go through." I'd rather say: BindsTo implies all constraints Requires does, and another one, which is very different in nature: "that this unit is stopped when any of the units listed suddenly disappears." Note that all other dependencies are between systemd *jobs*, while this one involves a *state*. -- Regards, Feri
Bug#887563: corosync prerm will stop pacemaker and not start it again
Hi Ferenc! On Fri, Apr 20, 2018 at 7:59 AM, Ferenc Wágnerwrote: > Control: fixed -1 2.4.4-1 > > Nishanth Aravamudan writes: > >> I believe this is because the prerm of corosync.service [...] >> unconditionally stops corosync for all Debian and Ubuntu releases >> (as the init script is installed even if unused by systemd). When >> corosync stops, pacemaker fails to connect to corosync (and the >> pacemaker systemd unit file specifies that pacemaker Requires corosync) >> and also stops. >> >> When the postinst for corosync runs [...] corosync will start, but >> there is no connection between corosync starting (systemd or SysV) and >> pacemaker. > > Right. > >> I think there are two necessary changes to the packaging/upstream to fix >> this: >> >> 1) The systemd unit file should indicate pacemaker is PartOf corosync, >> which will propogate restarts of corosync to pacemaker. This will also >> propogate stops, but as mentioned above, pacemaker already stops when >> corosync stops, so I think it's harmless. > > How would this help? Currently pacemaker.service Requires > corosync.service, which is a stronger (stricter) constraint than PartOf > would be if I read systemd.unit(5) correctly. You are right, and I'm sorry for not updating the Debian bug sooner -- we ended up moving to "BindsTo" not "PartOf" to resolve this in Ubuntu. I spent some time reading the manpage myself and this is how I interpret the relevant section(s): Requires= Configures requirement dependencies on other units. If this unit gets activated, the units listed here will be activated as well. ... This means, since pacemaker.service Requires=corosync.service, that when pacemaker is started, corosync is started (and, iirc, since pacemaker.service also has an After=corosync.service, systemd will start corosync.service first). This does not imply anything further, though, and in the default package configuration, pacemaker has a *hard* dependency on corosync (afaict). Thus, the first line of the next paragraph in the manpage is relevant: Note that this dependency type does not imply that the other unit always has to be in active state when this unit is running This section also mentions the use of BindsTo=, however that only affects the stopping of units, per the manpage. Finally, from PartOf: PartOf= Configures dependencies similar to Requires=, but limited to stopping and restarting of units. When systemd stops or restarts the units listed here, the action is propagated to this unit. So, actually, PartOf is *different* than Requires, and in the case of pacemaker and corosync's dependency relationship, helps reflect the other half of the requirements, but not the existing ones :) What we want to express (I believe) is: 1) Corosync can be started/stopped on its own 2) If pacemaker is started, corosync must be started 3) If corosync is restarted, pacemaker should be restarted pacemaker.service Requires=corosync.service says "When pacemaker is started, corosync should be started. When corosync is stopped, pacemaker is stopped." pacemaker.service PartOf=corosync.service says "When corosync is restarted, pacemaker is restarted. When corosync is stopped, pacemaker is stopped." pacemaker.service BindsTo=corosync.service says "Every state transition corosync goes through, pacemaker will also go through." >> Additionally, the SysV init file should be updated to check if the >> pacemaker SysV status was running before stopping corosync in the >> restart path and start pacemaker as well after starting corosync. > > I don't intend to go there. If you stop Corosync under Pacemaker, > Pacemaker will fail and the node will be fenced. Systemd helps with > this by cleanly stopping Pacemaker (and any other service declaring a > Requires relation to Corosync) beforehand; SysV init has no comparable > mechanisms. And you can't expect the Corosync init script take care of > all possible dependent services (Pacemaker, DLM, cLVM, corosync-notifyd, > whatever). This is part of the reason why I don't really support SysV > init in the HA stack. Yeah, I'm fine with this; I only mentioned it for completeness wrt. the ordering. >> 2) d/rules should call dh_installinit with --restart-after-upgrade. This >> is the default in compat >= 10 (2.4.2-3 is still at 9). That will change >> the prerm and postinst to not stop/start the service on upgrade, but >> simply restart it in the postinst (removals will still stop the >> service). > > Corosync 2.4.4-1 has switched to compat 11, so this is done. Great! >> Now, neither of these actually fix the existing packages unfortunately, >> which will stop pacemaker on the upgrade to a fixed package and thus >> stop pacemaker. I have no idea if there actually is any way to fix this >> for existing packages, since the 'old' prerm will be invoked by dpkg on >> the upgrade path. > > I
Bug#887563: corosync prerm will stop pacemaker and not start it again
Control: fixed -1 2.4.4-1 Nishanth Aravamudanwrites: > I believe this is because the prerm of corosync.service [...] > unconditionally stops corosync for all Debian and Ubuntu releases > (as the init script is installed even if unused by systemd). When > corosync stops, pacemaker fails to connect to corosync (and the > pacemaker systemd unit file specifies that pacemaker Requires corosync) > and also stops. > > When the postinst for corosync runs [...] corosync will start, but > there is no connection between corosync starting (systemd or SysV) and > pacemaker. Right. > I think there are two necessary changes to the packaging/upstream to fix > this: > > 1) The systemd unit file should indicate pacemaker is PartOf corosync, > which will propogate restarts of corosync to pacemaker. This will also > propogate stops, but as mentioned above, pacemaker already stops when > corosync stops, so I think it's harmless. How would this help? Currently pacemaker.service Requires corosync.service, which is a stronger (stricter) constraint than PartOf would be if I read systemd.unit(5) correctly. > Additionally, the SysV init file should be updated to check if the > pacemaker SysV status was running before stopping corosync in the > restart path and start pacemaker as well after starting corosync. I don't intend to go there. If you stop Corosync under Pacemaker, Pacemaker will fail and the node will be fenced. Systemd helps with this by cleanly stopping Pacemaker (and any other service declaring a Requires relation to Corosync) beforehand; SysV init has no comparable mechanisms. And you can't expect the Corosync init script take care of all possible dependent services (Pacemaker, DLM, cLVM, corosync-notifyd, whatever). This is part of the reason why I don't really support SysV init in the HA stack. > 2) d/rules should call dh_installinit with --restart-after-upgrade. This > is the default in compat >= 10 (2.4.2-3 is still at 9). That will change > the prerm and postinst to not stop/start the service on upgrade, but > simply restart it in the postinst (removals will still stop the > service). Corosync 2.4.4-1 has switched to compat 11, so this is done. > Now, neither of these actually fix the existing packages unfortunately, > which will stop pacemaker on the upgrade to a fixed package and thus > stop pacemaker. I have no idea if there actually is any way to fix this > for existing packages, since the 'old' prerm will be invoked by dpkg on > the upgrade path. I don't find this a too serious problem. Inconvenient, yes, but if you're running Corosync, then you probably have a highly available setup where even a prolonged node outage does not lead to service interruption. Your monitoring system delivers a warning, you start Pacemaker or reboot and everything is back to normal. Anders Kaseorg writes: > This just bit me on a Stretch cluster when upgrading corosync from 2.4.2-3 > to 2.4.2-3+deb9u1. Marking as such. I really should have put a warning about this into the DSA. > Please apply the suggested fixes as soon as possible. See above; I'm really not sure about fixing this in stable. Changing the restart behavior would be possible, but doing an update just for this would be silly, because the old prerm would stop Corosync for one last time anyway. -- Regards, Feri
Bug#887563: corosync prerm will stop pacemaker and not start it again
Control: found 887563 2.4.2-3 Control: severity 887563 important This just bit me on a Stretch cluster when upgrading corosync from 2.4.2-3 to 2.4.2-3+deb9u1. Marking as such. Please apply the suggested fixes as soon as possible. Anders
Bug#887563: corosync prerm will stop pacemaker and not start it again
Source: corosync Severity: normal Dear Maintainer, We have a report in Ubuntu, https://bugs.launchpad.net/charm-hacluster/+bug/1740892, which I believe is reproducible in Debian Sid as well. In particular, I set up a Sid LXD: # apt install corosync pacemaker ... # systemctl status corosync pacemaker ● corosync.service - Corosync Cluster Engine ... Active: active (running) since Wed 2018-01-17 23:14:56 UTC; 9s ago ... ● pacemaker.service - Pacemaker High Availability Cluster Manager ... Active: active (running) since Wed 2018-01-17 23:15:00 UTC; 5s ago # apt install --reinstall corosync ... # systemctl status corosync pacemaker ● corosync.service - Corosync Cluster Engine ... Active: active (running) since Wed 2018-01-17 23:15:23 UTC; 3s ago ● pacemaker.service - Pacemaker High Availability Cluster Manager ... Active: inactive (dead) since Wed 2018-01-17 23:15:22 UTC; 4s ago I believe this is because the prerm of corosync.service has # Automatically added by dh_installinit if [ -x "/etc/init.d/corosync" ]; then invoke-rc.d corosync stop || exit $? fi which unconditionally stops corosync for all Debian and Ubuntu releases (as the init script is installed even if unused by systemd). When corosync stops, pacemaker fails to connect to corosync (and the pacemaker systemd unit file specifies that pacemaker Requires corosync) and also stops. When the postinst for corosync runs: if [ "$1" = "configure" ] || [ "$1" = "abort-upgrade" ]; then if [ -x "/etc/init.d/corosync" ]; then update-rc.d corosync defaults >/dev/null invoke-rc.d corosync start || exit $? fi fi corosync will start, but there is no connection between corosync starting (systemd or SysV) and pacemaker. I think there are two necessary changes to the packaging/upstream to fix this: 1) The systemd unit file should indicate pacemaker is PartOf corosync, which will propogate restarts of corosync to pacemaker. This will also propogate stops, but as mentioned above, pacemaker already stops when corosync stops, so I think it's harmless. Additionally, the SysV init file should be updated to check if the pacemaker SysV status was running before stopping corosync in the restart path and start pacemaker as well after starting corosync. 2) d/rules should call dh_installinit with --restart-after-upgrade. This is the default in compat >= 10 (2.4.2-3 is still at 9). That will change the prerm and postinst to not stop/start the service on upgrade, but simply restart it in the postinst (removals will still stop the service). Now, neither of these actually fix the existing packages unfortunately, which will stop pacemaker on the upgrade to a fixed package and thus stop pacemaker. I have no idea if there actually is any way to fix this for existing packages, since the 'old' prerm will be invoked by dpkg on the upgrade path. -- System Information: Debian Release: buster/sid APT prefers bionic APT policy: (500, 'bionic') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 4.13.0-25-generic (SMP w/4 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US:en (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled -- Nishanth Aravamudan Ubuntu Server Canonical Ltd