Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Andrei Borzenkov
On Wed, Oct 9, 2019 at 10:59 AM Kadlecsik József
 wrote:
>
> Hello,
>
> The nodes in our cluster have got backend and frontend interfaces: the
> former ones are for the storage and cluster (corosync) traffic and the
> latter ones are for the public services of KVM guests only.
>
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7
> stuck for 23s"), which resulted that the node could process traffic on the
> backend interface but not on the fronted one. Thus the services became
> unavailable but the cluster thought the node is all right and did not
> stonith it.
>
> How could we protect the cluster against such failures?
>
> We could configure a second corosync ring, but that would be a redundancy
> ring only.
>
> We could setup a second, independent corosync configuration for a second
> pacemaker just with stonith agents. Is it enough to specify the cluster
> name in the corosync config to pair pacemaker to corosync? What about the
> pairing of pacemaker to this corosync instance, how can we tell pacemaker
> to connect to this corosync instance?
>
> Which is the best way to solve the problem?
>

That really depends on what "node could process traffic" means. If it
is just about basic IP connectivity, you can use ocf:pacemaker:ping
resource to monitor network availability and move resource if current
node is considered "unconnected". This is actually documented in
Pacemaker Explained, 8.3.2. Moving Resources Due to Connectivity
Changes.

If "process traffic" means something else, you need custom agent that
implements whatever checks are necessary to decide that node cannot
process traffic anymore.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] SBD with shared device - loss of both interconnect and shared device?

2019-10-09 Thread Andrei Borzenkov
What happens if both interconnect and shared device is lost by node? I
assume node will reboot, correct?

Now assuming (two node cluster) second node still can access shared
device it will fence (via SBD) and continue takeover, right?

If both nodes lost shared device, both nodes will reboot and if access
to shared device is not restored, then cluster services will simply
not come up on both nodes, so it means total outage. Correct?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Digimer
On 2019-10-09 3:58 a.m., Kadlecsik József wrote:
> Hello,
> 
> The nodes in our cluster have got backend and frontend interfaces: the 
> former ones are for the storage and cluster (corosync) traffic and the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> stuck for 23s"), which resulted that the node could process traffic on the 
> backend interface but not on the fronted one. Thus the services became 
> unavailable but the cluster thought the node is all right and did not 
> stonith it. 
> 
> How could we protect the cluster against such failures?
> 
> We could configure a second corosync ring, but that would be a redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a second 
> pacemaker just with stonith agents. Is it enough to specify the cluster 
> name in the corosync config to pair pacemaker to corosync? What about the 
> pairing of pacemaker to this corosync instance, how can we tell pacemaker 
> to connect to this corosync instance?
> 
> Which is the best way to solve the problem? 
> 
> Best regards,
> Jozsef

We use mode=1 (active-passive) bonded network interfaces for each
network connection (we also have a back-end, front-end and a storage
network). Each bond has a link going to one switch and the other link to
a second switch. For fence devices, we use IPMI fencing connected via
switch 1 and PDU fencing as the backup method connected on switch 2.

With this setup, no matter what might fail, one of the fence methods
will still be available. It's saved us in the field a few times now.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] change of the configuration of a resource which is part of a clone

2019-10-09 Thread Lentes, Bernd
Hi,

i finally managed to find out how i can simulate configuration changes and see 
their results before committing them.
OMG. That makes live much more relaxed. I need to change the configuration of a 
resource which is part of a group, the group is 
running as a clone on all nodes.
Unfortunately the resource is a prerequisite for several other resources. The 
other resources will restart when i commit
the changes which i definitely want to avoid.
What can i do ?
I have a two node cluster on SLES 12 SP4, with 
pacemaker-1.1.19+20181105.ccd6b5b10-3.13.1.x86_64 and 
corosync-2.3.6-9.13.1.x86_64.

Bernd


-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/idg 

Perfekt ist wer keine Fehler macht 
Also sind Tote perfekt
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
On Wed, 9 Oct 2019, Ken Gaillot wrote:

> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - 
> > CPU#7 stuck for 23s"), which resulted that the node could process 
> > traffic on the backend interface but not on the fronted one. Thus the 
> > services became unavailable but the cluster thought the node is all 
> > right and did not stonith it.
> > 
> > How could we protect the cluster against such failures?
> 
> See the ocf:heartbeat:ethmonitor agent (to monitor the interface itself) 
> and/or the ocf:pacemaker:ping agent (to monitor reachability of some IP 
> such as a gateway)

This looks really promising, thank you! Does the cluster regard it as a 
failure when a ocf:heartbeat:ethmonitor agent clone on a node does not 
run? :-)

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
Hi,

On Wed, 9 Oct 2019, Jan Pokorný wrote:

> On 09/10/19 09:58 +0200, Kadlecsik József wrote:
> > The nodes in our cluster have got backend and frontend interfaces: the 
> > former ones are for the storage and cluster (corosync) traffic and the 
> > latter ones are for the public services of KVM guests only.
> > 
> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> > stuck for 23s"), which resulted that the node could process traffic on the 
> > backend interface but not on the fronted one. Thus the services became 
> > unavailable but the cluster thought the node is all right and did not 
> > stonith it. 
> 
> > Which is the best way to solve the problem? 
> 
> Looks like heuristics of corosync-qdevice that would ping/attest your
> frontend interface could be a way to go.  You'd need an additional
> host in your setup, though.

As far as I see, corosync-qdevice can add/increase the votes for a node 
and cannot decrease it. I hope I'm wrong, I wouldn't mind adding an 
additional host :-)

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Ken Gaillot
On Wed, 2019-10-09 at 09:58 +0200, Kadlecsik József wrote:
> Hello,
> 
> The nodes in our cluster have got backend and frontend interfaces:
> the 
> former ones are for the storage and cluster (corosync) traffic and
> the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup -
> CPU#7 
> stuck for 23s"), which resulted that the node could process traffic
> on the 
> backend interface but not on the fronted one. Thus the services
> became 
> unavailable but the cluster thought the node is all right and did
> not 
> stonith it. 
> 
> How could we protect the cluster against such failures?

See the ocf:heartbeat:ethmonitor agent (to monitor the interface
itself) and/or the ocf:pacemaker:ping agent (to monitor reachability of
some IP such as a gateway)

> 
> We could configure a second corosync ring, but that would be a
> redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a
> second 
> pacemaker just with stonith agents. Is it enough to specify the
> cluster 
> name in the corosync config to pair pacemaker to corosync? What about
> the 
> pairing of pacemaker to this corosync instance, how can we tell
> pacemaker 
> to connect to this corosync instance?
> 
> Which is the best way to solve the problem? 
> 
> Best regards,
> Jozsef
> --
> E-mail : kadlecsik.joz...@wigner.mta.hu
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: Wigner Research Centre for Physics
>  H-1525 Budapest 114, POB. 49, Hungary
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [ClusterLabs Developers] FYI: looks like there are DNS glitches with clusterlabs.org subdomains

2019-10-09 Thread Ken Gaillot
Due to a mix-up, all of clusterlabs.org is currently without DNS
service. :-(

List mail may continue to work for a while as mail servers rely on DNS
caches, so hopefully this reaches most of our subscribers.

No estimate yet for when it will be recovered.

On Wed, 2019-10-09 at 11:06 +0200, Jan Pokorný wrote:
> Neither bugs.c.o nor lists.c.o work for me ATM.
> Either it resolves by itself, or Ken will intervene, I believe.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] announcement: schedule for resource-agents release 4.4.0

2019-10-09 Thread Oyvind Albrigtsen

Hi,

This is a tentative schedule for resource-agents v4.4.0:
4.4.0-rc1: October 16.
4.4.0: October 23.

I've modified the corresponding milestones at
https://github.com/ClusterLabs/resource-agents/milestones

If there's anything you think should be part of the release
please open an issue, a pull request, or a bugzilla, as you see
fit.

If there's anything that hasn't received due attention, please
let us know.

Finally, if you can help with resolving issues consider yourself
invited to do so. There are currently 105 issues and 54 pull
requests still open.


Cheers,
Oyvind Albrigtsen
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Jan Pokorný
On 09/10/19 09:58 +0200, Kadlecsik József wrote:
> The nodes in our cluster have got backend and frontend interfaces: the 
> former ones are for the storage and cluster (corosync) traffic and the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> stuck for 23s"), which resulted that the node could process traffic on the 
> backend interface but not on the fronted one. Thus the services became 
> unavailable but the cluster thought the node is all right and did not 
> stonith it. 
> 
> How could we protect the cluster against such failures?
> 
> We could configure a second corosync ring, but that would be a redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a second 
> pacemaker just with stonith agents. Is it enough to specify the cluster 
> name in the corosync config to pair pacemaker to corosync? What about the 
> pairing of pacemaker to this corosync instance, how can we tell pacemaker 
> to connect to this corosync instance?

Such pairing happens on the Unix socket system-wide singleton basis.
IOW, two instances of the corosync on the same machine would
apparently conflict -- only a single daemon can run at a time.

> Which is the best way to solve the problem? 

Looks like heuristics of corosync-qdevice that would ping/attest your
frontend interface could be a way to go.  You'd need an additional
host in your setup, though.

-- 
Poki


pgpZKhjeAe4it.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Kadlecsik József
Hello,

The nodes in our cluster have got backend and frontend interfaces: the 
former ones are for the storage and cluster (corosync) traffic and the 
latter ones are for the public services of KVM guests only.

One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
stuck for 23s"), which resulted that the node could process traffic on the 
backend interface but not on the fronted one. Thus the services became 
unavailable but the cluster thought the node is all right and did not 
stonith it. 

How could we protect the cluster against such failures?

We could configure a second corosync ring, but that would be a redundancy 
ring only.

We could setup a second, independent corosync configuration for a second 
pacemaker just with stonith agents. Is it enough to specify the cluster 
name in the corosync config to pair pacemaker to corosync? What about the 
pairing of pacemaker to this corosync instance, how can we tell pacemaker 
to connect to this corosync instance?

Which is the best way to solve the problem? 

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/