Bug#887563: corosync prerm will stop pacemaker and not start it again

2018-04-24 Thread Ferenc Wágner
Nish Aravamudan  writes:

>> Nishanth Aravamudan  writes:
>>
>>> Now, neither of these actually fix the existing packages
>>> unfortunately, which will stop pacemaker on the upgrade to a fixed
>>> package and thus stop pacemaker. I have no idea if there actually is
>>> any way to fix this for existing packages, since the 'old' prerm
>>> will be invoked by dpkg on the upgrade path.
>
> in theory if a fix lands, its the last time this happens

I have to make a correction here.  The problem is not that Pacemaker is
stopped when Corosync is upgraded, but that Pacemaker is not started
after the Corosync upgrade is complete.  So the old prerm stopping
Corosync is not a problem: the new postinst will *restart* Corosync
(even though it's stopped already), and the restart operation does start
Pacemaker again.  A simple start operation does not, but a restart does.
I haven't decided yet whether this is a systemd bug, a quirk or a
feature.
-- 
Regards,
Feri



Bug#887563: corosync prerm will stop pacemaker and not start it again

2018-04-21 Thread Ferenc Wágner
Nish Aravamudan  writes:

> I spent some time reading the manpage myself and this is how I
> interpret the relevant section(s):
>
>  Requires=
>Configures requirement dependencies on other units. If this unit
>gets activated, the units listed here will be activated as well.
> ...
>
> This means, since pacemaker.service Requires=corosync.service, that
> when pacemaker is started, corosync is started (and, iirc, since
> pacemaker.service also has an After=corosync.service, systemd will
> start corosync.service first).

Agreed.

> This does not imply anything further

Not agreed (if you mean "Requires" under "this").  Version 232-25+deb9u2
of the systemd.unit man page continues with:

  If one of the other units gets deactivated or its activation fails,
  this unit will be deactivated.

> and in the default package configuration, pacemaker has a *hard*
> dependency on corosync (afaict).

Even: Pacemaker always has a hard dependency on Corosync in Debian.  We
don't compile in support for any other messaging layers.  This means
Pacemaker can't start without Corosync and exits immediately if Corosync
is stopped under it.

> Thus, the first line of the next paragraph in the manpage is relevant:
>
>   Note that this dependency type does not imply that the other unit
>   always has to be in active state when this unit is running

I couldn't find this text, but my understanding is that systemd won't do
anything if a Required unit suddenly stops on its own (no systemd job
here).  This is what BindsTo adds to Requires, but in our case it does
not matter anyway, as Pacemaker immediately exits if Corosync stops.

This exit is not particulary pretty, though.  The pacemakerd master
daemon notices the loss of Corosync connection and commences a regular
shutdown or the constituent daemons.  However, most daemons themselves
also notice the loss of Corosync connection (or the exit of their peer
daemons) and exit with various failure codes.  In the end, pacemakerd
seems to ignore these errors and exits successfully.

This successful exit is strange in my opinion, but at least it does not
let the Restart=on-failure setting of Pacemaker muddy the waters even
more.

>   PartOf=
> Configures dependencies similar to Requires=, but limited to
> stopping and restarting of units.  When systemd stops or restarts
> the units listed here, the action is propagated to this unit.
>
> So, actually, PartOf is *different* than Requires, and in the case of
> pacemaker and corosync's dependency relationship, helps reflect the
> other half of the requirements, but not the existing ones :)

I can't follow you here.  My reading is that PartOf is less than
Requires, because starting a PartOf something does not start that
something itself.  However, stopping that something stops all PartsOf
it, just like as if it was Required.

> What we want to express (I believe) is:
>
> 1) Corosync can be started/stopped on its own
> 2) If pacemaker is started, corosync must be started

Yes.  And if Corosync is stopped, Pacemaker must be stopped beforehand.

> 3) If corosync is restarted, pacemaker should be restarted

Rather: if Corosync is restarted, Pacemaker must be stopped beforehand
and started afterwards.  Otherwise you'll get fenced.

> pacemaker.service Requires=corosync.service says "When pacemaker is
> started, corosync should be started. When corosync is stopped,
> pacemaker is stopped."

Yes.  Though proper ordering is necessary as well, so Pacemaker needs an
After=corosync.service directive.  With that in place, Pacemaker is
stopped before Corosync is restarted, and started afterwards.  So this
achieves 1), 2) and my version of 3), which is exactly what I want.

> pacemaker.service PartOf=corosync.service says "When corosync is
> restarted, pacemaker is restarted. When corosync is stopped, pacemaker
> is stopped."

These are also true when Pacemaker Requires=corosync.service.  But the
start constraint is not present with PartOf.

> pacemaker.service BindsTo=corosync.service says "Every state
> transition corosync goes through, pacemaker will also go through."

I'd rather say: BindsTo implies all constraints Requires does, and
another one, which is very different in nature: "that this unit is
stopped when any of the units listed suddenly disappears."  Note that
all other dependencies are between systemd *jobs*, while this one
involves a *state*.
-- 
Regards,
Feri



Bug#887563: corosync prerm will stop pacemaker and not start it again

2018-04-20 Thread Nish Aravamudan
Hi Ferenc!

On Fri, Apr 20, 2018 at 7:59 AM, Ferenc Wágner  wrote:
> Control: fixed -1 2.4.4-1
>
> Nishanth Aravamudan  writes:
>
>> I believe this is because the prerm of corosync.service [...]
>> unconditionally stops corosync for all Debian and Ubuntu releases
>> (as the init script is installed even if unused by systemd). When
>> corosync stops, pacemaker fails to connect to corosync (and the
>> pacemaker systemd unit file specifies that pacemaker Requires corosync)
>> and also stops.
>>
>> When the postinst for corosync runs [...] corosync will start, but
>> there is no connection between corosync starting (systemd or SysV) and
>> pacemaker.
>
> Right.
>
>> I think there are two necessary changes to the packaging/upstream to fix
>> this:
>>
>> 1) The systemd unit file should indicate pacemaker is PartOf corosync,
>> which will propogate restarts of corosync to pacemaker. This will also
>> propogate stops, but as mentioned above, pacemaker already stops when
>> corosync stops, so I think it's harmless.
>
> How would this help?  Currently pacemaker.service Requires
> corosync.service, which is a stronger (stricter) constraint than PartOf
> would be if I read systemd.unit(5) correctly.

You are right, and I'm sorry for not updating the Debian bug sooner --
we ended up moving to "BindsTo" not "PartOf" to resolve this in
Ubuntu.

I spent some time reading the manpage myself and this is how I
interpret the relevant section(s):

 Requires=
   Configures requirement dependencies on other units. If this unit
   gets activated, the units listed here will be activated as well.
...

This means, since pacemaker.service Requires=corosync.service, that
when pacemaker is started, corosync is started (and, iirc, since
pacemaker.service also has an After=corosync.service, systemd will
start corosync.service first).

This does not imply anything further, though, and in the default
package configuration, pacemaker has a *hard* dependency on corosync
(afaict). Thus, the first line of the next paragraph in the manpage is
relevant:

  Note that this dependency type does not imply that the other unit
   always has to be in active state when this unit is running

This section also mentions the use of BindsTo=, however that only
affects the stopping of units, per the manpage.

Finally, from PartOf:

   PartOf=
   Configures dependencies similar to Requires=, but limited to
   stopping and restarting of units.  When systemd stops or restarts
   the units listed here, the action is propagated to this unit.

So, actually, PartOf is *different* than Requires, and in the case of
pacemaker and corosync's dependency relationship, helps reflect the
other half of the requirements, but not the existing ones :)

What we want to express (I believe) is:

1) Corosync can be started/stopped on its own
2) If pacemaker is started, corosync must be started
3) If corosync is restarted, pacemaker should be restarted

pacemaker.service Requires=corosync.service says "When pacemaker is
started, corosync should be started. When corosync is stopped,
pacemaker is stopped."
pacemaker.service PartOf=corosync.service says "When corosync is
restarted, pacemaker is restarted. When corosync is stopped, pacemaker
is stopped."
pacemaker.service BindsTo=corosync.service says "Every state
transition corosync goes through, pacemaker will also go through."

>> Additionally, the SysV init file should be updated to check if the
>> pacemaker SysV status was running before stopping corosync in the
>> restart path and start pacemaker as well after starting corosync.
>
> I don't intend to go there.  If you stop Corosync under Pacemaker,
> Pacemaker will fail and the node will be fenced.  Systemd helps with
> this by cleanly stopping Pacemaker (and any other service declaring a
> Requires relation to Corosync) beforehand; SysV init has no comparable
> mechanisms.  And you can't expect the Corosync init script take care of
> all possible dependent services (Pacemaker, DLM, cLVM, corosync-notifyd,
> whatever).  This is part of the reason why I don't really support SysV
> init in the HA stack.

Yeah, I'm fine with this; I only mentioned it for completeness wrt.
the ordering.

>> 2) d/rules should call dh_installinit with --restart-after-upgrade. This
>> is the default in compat >= 10 (2.4.2-3 is still at 9). That will change
>> the prerm and postinst to not stop/start the service on upgrade, but
>> simply restart it in the postinst (removals will still stop the
>> service).
>
> Corosync 2.4.4-1 has switched to compat 11, so this is done.

Great!

>> Now, neither of these actually fix the existing packages unfortunately,
>> which will stop pacemaker on the upgrade to a fixed package and thus
>> stop pacemaker. I have no idea if there actually is any way to fix this
>> for existing packages, since the 'old' prerm will be invoked by dpkg on
>> the upgrade path.
>
> I 

Bug#887563: corosync prerm will stop pacemaker and not start it again

2018-04-20 Thread Ferenc Wágner
Control: fixed -1 2.4.4-1

Nishanth Aravamudan  writes:

> I believe this is because the prerm of corosync.service [...]
> unconditionally stops corosync for all Debian and Ubuntu releases
> (as the init script is installed even if unused by systemd). When
> corosync stops, pacemaker fails to connect to corosync (and the
> pacemaker systemd unit file specifies that pacemaker Requires corosync)
> and also stops.
>
> When the postinst for corosync runs [...] corosync will start, but
> there is no connection between corosync starting (systemd or SysV) and
> pacemaker.

Right.

> I think there are two necessary changes to the packaging/upstream to fix
> this:
>
> 1) The systemd unit file should indicate pacemaker is PartOf corosync,
> which will propogate restarts of corosync to pacemaker. This will also
> propogate stops, but as mentioned above, pacemaker already stops when
> corosync stops, so I think it's harmless.

How would this help?  Currently pacemaker.service Requires
corosync.service, which is a stronger (stricter) constraint than PartOf
would be if I read systemd.unit(5) correctly.

> Additionally, the SysV init file should be updated to check if the
> pacemaker SysV status was running before stopping corosync in the
> restart path and start pacemaker as well after starting corosync.

I don't intend to go there.  If you stop Corosync under Pacemaker,
Pacemaker will fail and the node will be fenced.  Systemd helps with
this by cleanly stopping Pacemaker (and any other service declaring a
Requires relation to Corosync) beforehand; SysV init has no comparable
mechanisms.  And you can't expect the Corosync init script take care of
all possible dependent services (Pacemaker, DLM, cLVM, corosync-notifyd,
whatever).  This is part of the reason why I don't really support SysV
init in the HA stack.

> 2) d/rules should call dh_installinit with --restart-after-upgrade. This
> is the default in compat >= 10 (2.4.2-3 is still at 9). That will change
> the prerm and postinst to not stop/start the service on upgrade, but
> simply restart it in the postinst (removals will still stop the
> service).

Corosync 2.4.4-1 has switched to compat 11, so this is done.

> Now, neither of these actually fix the existing packages unfortunately,
> which will stop pacemaker on the upgrade to a fixed package and thus
> stop pacemaker. I have no idea if there actually is any way to fix this
> for existing packages, since the 'old' prerm will be invoked by dpkg on
> the upgrade path.

I don't find this a too serious problem.  Inconvenient, yes, but if
you're running Corosync, then you probably have a highly available setup
where even a prolonged node outage does not lead to service interruption.
Your monitoring system delivers a warning, you start Pacemaker or reboot
and everything is back to normal.

Anders Kaseorg  writes:

> This just bit me on a Stretch cluster when upgrading corosync from 2.4.2-3 
> to 2.4.2-3+deb9u1.  Marking as such.

I really should have put a warning about this into the DSA.

> Please apply the suggested fixes as soon as possible.

See above; I'm really not sure about fixing this in stable.  Changing
the restart behavior would be possible, but doing an update just for
this would be silly, because the old prerm would stop Corosync for one
last time anyway.
-- 
Regards,
Feri



Bug#887563: corosync prerm will stop pacemaker and not start it again

2018-04-19 Thread Anders Kaseorg
Control: found 887563 2.4.2-3
Control: severity 887563 important

This just bit me on a Stretch cluster when upgrading corosync from 2.4.2-3 
to 2.4.2-3+deb9u1.  Marking as such.  Please apply the suggested fixes as 
soon as possible.

Anders



Bug#887563: corosync prerm will stop pacemaker and not start it again

2018-01-17 Thread Nishanth Aravamudan
Source: corosync
Severity: normal

Dear Maintainer,

We have a report in Ubuntu,
https://bugs.launchpad.net/charm-hacluster/+bug/1740892, which I believe
is reproducible in Debian Sid as well. In particular, I set up a Sid
LXD:

# apt install corosync pacemaker
...
# systemctl status corosync pacemaker
● corosync.service - Corosync Cluster Engine
...
   Active: active (running) since Wed 2018-01-17 23:14:56 UTC; 9s ago
...
● pacemaker.service - Pacemaker High Availability Cluster Manager
...
   Active: active (running) since Wed 2018-01-17 23:15:00 UTC; 5s ago
# apt install --reinstall corosync
...
# systemctl status corosync pacemaker
● corosync.service - Corosync Cluster Engine
...
   Active: active (running) since Wed 2018-01-17 23:15:23 UTC; 3s ago
● pacemaker.service - Pacemaker High Availability Cluster Manager
...
   Active: inactive (dead) since Wed 2018-01-17 23:15:22 UTC; 4s ago

I believe this is because the prerm of corosync.service has

# Automatically added by dh_installinit
if [ -x "/etc/init.d/corosync" ]; then
invoke-rc.d corosync stop || exit $?
fi

which unconditionally stops corosync for all Debian and Ubuntu releases
(as the init script is installed even if unused by systemd). When
corosync stops, pacemaker fails to connect to corosync (and the
pacemaker systemd unit file specifies that pacemaker Requires corosync)
and also stops.

When the postinst for corosync runs:

if [ "$1" = "configure" ] || [ "$1" = "abort-upgrade" ]; then
if [ -x "/etc/init.d/corosync" ]; then
update-rc.d corosync defaults >/dev/null
invoke-rc.d corosync start || exit $?
fi
fi

corosync will start, but there is no connection between corosync
starting (systemd or SysV) and pacemaker.

I think there are two necessary changes to the packaging/upstream to fix
this:

1) The systemd unit file should indicate pacemaker is PartOf corosync,
which will propogate restarts of corosync to pacemaker. This will also
propogate stops, but as mentioned above, pacemaker already stops when
corosync stops, so I think it's harmless. Additionally, the SysV init
file should be updated to check if the pacemaker SysV status was running
before stopping corosync in the restart path and start pacemaker as well
after starting corosync.

2) d/rules should call dh_installinit with --restart-after-upgrade. This
is the default in compat >= 10 (2.4.2-3 is still at 9). That will change
the prerm and postinst to not stop/start the service on upgrade, but
simply restart it in the postinst (removals will still stop the
service).

Now, neither of these actually fix the existing packages unfortunately,
which will stop pacemaker on the upgrade to a fixed package and thus
stop pacemaker. I have no idea if there actually is any way to fix this
for existing packages, since the 'old' prerm will be invoked by dpkg on
the upgrade path.

-- System Information:
Debian Release: buster/sid
  APT prefers bionic
  APT policy: (500, 'bionic')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.13.0-25-generic (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US:en (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

-- 
Nishanth Aravamudan
Ubuntu Server
Canonical Ltd