Fencing is done using fence_scsi. Config details are as follows: Resource: scsi (class=stonith type=fence_scsi) Attributes: devices=/dev/mapper/mpatha pcmk_host_list="node1 node2" pcmk_monitor_action=metadata pcmk_reboot_action=off Meta Attrs: provides=unfencing Operations: monitor interval=60s (scsi-monitor-interval-60s)
On Mon, Feb 15, 2021 at 7:17 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> shivraj dongawe <shivraj...@gmail.com> schrieb am 14.02.2021 um 12:03 > in > Nachricht > <calpaho--3erfwst70mbl-wm9g6yh3ytd-wda1r_cknbrsxu...@mail.gmail.com>: > > We are running a two node cluster on Ubuntu 20.04 LTS. Cluster related > > package version details are as > > follows: pacemaker/focal-updates,focal-security 2.0.3-3ubuntu4.1 amd64 > > pacemaker/focal 2.0.3-3ubuntu3 amd64 > > corosync/focal 3.0.3-2ubuntu2 amd64 > > pcs/focal 0.10.4-3 all > > fence-agents/focal 4.5.2-1 amd64 > > gfs2-utils/focal 3.2.0-3 amd64 > > dlm-controld/focal 4.0.9-1build1 amd64 > > lvm2-lockd/focal 2.03.07-1ubuntu1 amd64 > > > > Cluster configuration details: > > 1. Cluster is having a shared storage mounted through gfs2 filesystem > with > > the help of dlm and lvmlockd. > > 2. Corosync is configured to use knet for transport. > > 3. Fencing is configured using fence_scsi on the shared storage which is > > being used for gfs2 filesystem > > 4. Two main resources configured are cluster/virtual ip and > postgresql-12, > > postgresql-12 is configured as a systemd resource. > > We had done failover testing(rebooting/shutting down of a node, link > > failure) of the cluster and had observed that resources were getting > > migrated properly on the active node. > > > > Recently we came across an issue which has occurred repeatedly in span of > > two days. > > Details are below: > > 1. Out of memory killer is getting invoked on active node and it starts > > killing processes. > > Sample is as follows: > > postgres invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), > > order=0, oom_score_adj=0 > > 2. At one instance it started with killing of pacemaker and on another > with > > postgresql. It does not stop with the killing of a single process it goes > > on killing others(more concerning is killing of cluster related > processes) > > as well. We have observed that swap space on that node is 2 GB against > RAM > > of 96 GB and are in the process of increasing swap space to see if this > > resolves this issue. Postgres is configured with shared_buffers value of > 32 > > GB(which is way less than 96 GB). > > We are not yet sure which process is eating up that much memory suddenly. > > 3. As a result of killing processes on node1, node2 is trying to fence > > node1 and thereby initiating stopping of cluster resources on node1. > > How is fencing being done? > > > 4. At this point we go in a stage where it is assumed that node1 is down > > and application resources, cluster IP and postgresql are being started on > > node2. > > 5. Postgresql on node 2 fails to start in 60 sec(start operation timeout) > > and is declared as failed. During the start operation of postgres, we > have > > found many messages related to failure of fencing and other resources > such > > as dlm and vg waiting for fencing to complete. > > Details of syslog messages of node2 during this event are attached in > file. > > 6. After this point we are in a state where node1 and node2 both go in > > fenced state and resources are unrecoverable(all resources on both > nodes). > > > > Now my question is out of memory issue of node1 can be taken care by > > increasing swap and finding out the process responsible for such huge > > memory usage and taking necessary actions to minimize that memory usage, > > but the other issue that remains unclear is why cluster is not shifted to > > node2 cleanly and become unrecoverable. > > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/