Thank you for your valuable feedback. I will surely check that ordering part.
On Thu, Feb 25, 2021 at 5:21 PM Ken Gaillot <kgail...@redhat.com> wrote: > On Thu, 2021-02-25 at 06:34 +0000, shivraj dongawe wrote: > > > > @Ken Gaillot, Thanks for sharing your inputs on the possible behavior > > of the cluster. > > We have reconfirmed that dlm on a healthy node was waiting for > > fencing of faulty node and shared storage access on the healthy node > > was blocked during this process. > > Kindly let me know whether this is the natural behavior or is it a > > result of some misconfiguration. > > Your configuration looks perfect to me, except for one thing: I believe > lvmlockd should be *after* dlm_controld in the group. I don't know if > that's causing the problem, but it's worth trying. > > It is expected that DLM will wait for fencing, but it should be happy > after fencing completes, so something is not right. > > > As asked by I am sharing configuration information as an attachment > > to this mail. > > > > > > On Fri, Feb 19, 2021 at 11:28 PM Ken Gaillot <kgail...@redhat.com> > > wrote: > > > On Fri, 2021-02-19 at 07:48 +0530, shivraj dongawe wrote: > > > > Any update on this . > > > > Is there any issue in the configuration that we are using ? > > > > > > > > On Mon, Feb 15, 2021, 14:40 shivraj dongawe <shivraj...@gmail.com > > > > > > > > wrote: > > > > > Kindly read "fencing is done using fence_scsi" from the > > > previous > > > > > message as "fencing is configured". > > > > > > > > > > As per the error messages we have analyzed node2 initiated > > > fencing > > > > > of node1 as many processes of node1 related to cluster have > > > been > > > > > killed by oom killer and node1 marked as down. > > > > > Now many resources of node2 have waited for fencing of node1, > > > as > > > > > seen from following messages of syslog of node2: > > > > > dlm_controld[1616]: 91659 lvm_postgres_db_vg wait for fencing > > > > > dlm_controld[1616]: 91659 lvm_global wait for fencing > > > > > > > > > > These were messages when postgresql-12 service was being > > > started on > > > > > node2. > > > > > As postgresql service is dependent on these > > > services(dlm,lvmlockd > > > > > and gfs2), it has not started in time on node2. > > > > > And node2 fenced itself after declaring that services can not > > > be > > > > > started on it. > > > > > > > > > > On Mon, Feb 15, 2021 at 9:00 AM Ulrich Windl < > > > > > ulrich.wi...@rz.uni-regensburg.de> wrote: > > > > > > >>> shivraj dongawe <shivraj...@gmail.com> schrieb am > > > 15.02.2021 > > > > > > um 08:27 in > > > > > > Nachricht > > > > > > < > > > > > > > > > CALpaHO_6LsYM=t76CifsRkFeLYDKQc+hY3kz7PRKp7b4se=-a...@mail.gmail.com > > > > > > >: > > > > > > > Fencing is done using fence_scsi. > > > > > > > Config details are as follows: > > > > > > > Resource: scsi (class=stonith type=fence_scsi) > > > > > > > Attributes: devices=/dev/mapper/mpatha > > > pcmk_host_list="node1 > > > > > > node2" > > > > > > > pcmk_monitor_action=metadata pcmk_reboot_action=off > > > > > > > Meta Attrs: provides=unfencing > > > > > > > Operations: monitor interval=60s (scsi-monitor-interval- > > > 60s) > > > > > > > > > > > > > > On Mon, Feb 15, 2021 at 7:17 AM Ulrich Windl < > > > > > > > ulrich.wi...@rz.uni-regensburg.de> wrote: > > > > > > > > > > > > > >> >>> shivraj dongawe <shivraj...@gmail.com> schrieb am > > > > > > 14.02.2021 um 12:03 > > > > > > >> in > > > > > > >> Nachricht > > > > > > >> < > > > > > > > > > calpaho--3erfwst70mbl-wm9g6yh3ytd-wda1r_cknbrsxu...@mail.gmail.com > > > > > > >: > > > > > > >> > We are running a two node cluster on Ubuntu 20.04 LTS. > > > > > > Cluster related > > > > > > >> > package version details are as > > > > > > >> > follows: pacemaker/focal-updates,focal-security 2.0.3- > > > > > > 3ubuntu4.1 amd64 > > > > > > >> > pacemaker/focal 2.0.3-3ubuntu3 amd64 > > > > > > >> > corosync/focal 3.0.3-2ubuntu2 amd64 > > > > > > >> > pcs/focal 0.10.4-3 all > > > > > > >> > fence-agents/focal 4.5.2-1 amd64 > > > > > > >> > gfs2-utils/focal 3.2.0-3 amd64 > > > > > > >> > dlm-controld/focal 4.0.9-1build1 amd64 > > > > > > >> > lvm2-lockd/focal 2.03.07-1ubuntu1 amd64 > > > > > > >> > > > > > > > >> > Cluster configuration details: > > > > > > >> > 1. Cluster is having a shared storage mounted through > > > gfs2 > > > > > > filesystem > > > > > > >> with > > > > > > >> > the help of dlm and lvmlockd. > > > > > > >> > 2. Corosync is configured to use knet for transport. > > > > > > >> > 3. Fencing is configured using fence_scsi on the shared > > > > > > storage which is > > > > > > >> > being used for gfs2 filesystem > > > > > > >> > 4. Two main resources configured are cluster/virtual ip > > > and > > > > > > >> postgresql-12, > > > > > > >> > postgresql-12 is configured as a systemd resource. > > > > > > >> > We had done failover testing(rebooting/shutting down of > > > a > > > > > > node, link > > > > > > >> > failure) of the cluster and had observed that resources > > > were > > > > > > getting > > > > > > >> > migrated properly on the active node. > > > > > > >> > > > > > > > >> > Recently we came across an issue which has occurred > > > > > > repeatedly in span of > > > > > > >> > two days. > > > > > > >> > Details are below: > > > > > > >> > 1. Out of memory killer is getting invoked on active > > > node > > > > > > and it starts > > > > > > >> > killing processes. > > > > > > >> > Sample is as follows: > > > > > > >> > postgres invoked oom-killer: > > > > > > gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), > > > > > > >> > order=0, oom_score_adj=0 > > > > > > >> > 2. At one instance it started with killing of pacemaker > > > and > > > > > > on another > > > > > > >> with > > > > > > >> > postgresql. It does not stop with the killing of a > > > single > > > > > > process it goes > > > > > > >> > on killing others(more concerning is killing of cluster > > > > > > related > > > > > > >> processes) > > > > > > >> > as well. We have observed that swap space on that node > > > is 2 > > > > > > GB against > > > > > > >> RAM > > > > > > >> > of 96 GB and are in the process of increasing swap space > > > to > > > > > > see if this > > > > > > >> > resolves this issue. Postgres is configured with > > > > > > shared_buffers value of > > > > > > >> 32 > > > > > > >> > GB(which is way less than 96 GB). > > > > > > >> > We are not yet sure which process is eating up that much > > > > > > memory suddenly. > > > > > > >> > 3. As a result of killing processes on node1, node2 is > > > > > > trying to fence > > > > > > >> > node1 and thereby initiating stopping of cluster > > > resources > > > > > > on node1. > > > > > > >> > > > > > > >> How is fencing being done? > > > > > > >> > > > > > > >> > 4. At this point we go in a stage where it is assumed > > > that > > > > > > node1 is down > > > > > > >> > and application resources, cluster IP and postgresql are > > > > > > being started on > > > > > > >> > node2. > > > > > > > > > > > > This is why I was asking: Is your fencing successful > > > ("assumed > > > > > > that node1 is down > > > > > > "), or isn't it? > > > > > > > > > > > > >> > 5. Postgresql on node 2 fails to start in 60 sec(start > > > > > > operation timeout) > > > > > > >> > and is declared as failed. During the start operation of > > > > > > postgres, we > > > > > > >> have > > > > > > >> > found many messages related to failure of fencing and > > > other > > > > > > resources > > > > > > >> such > > > > > > >> > as dlm and vg waiting for fencing to complete. > > > > > > It does seem that DLM is where the problem occurs. > > > > > > Note that fencing is scheduled in two separate ways, once by DLM > > > and > > > once by the cluster itself, when node1 is lost. > > > > > > The fencing scheduled by the cluster completes successfully: > > > > > > Feb 13 11:07:56 DB-2 pacemaker-controld[2451]: notice: Peer node1 > > > was > > > terminated (reboot) by node2 on behalf of pacemaker-controld.2451: > > > OK > > > > > > but DLM just attempts fencing over and over, eventually causing > > > resource timeouts. Those timeouts cause the cluster to schedule > > > resource recovery (stop+start), but the stops timeout for the same > > > reason, and it is those stop timeouts that cause node2 to be > > > fenced. > > > > > > I'm not familiar enough with DLM to know what might keep it from > > > being > > > able to contact Pacemaker for fencing. > > > > > > Can you attach your configuration as well (with any sensitive info > > > removed)? I assume you've created an ocf:pacemaker:controld clone, > > > and > > > that the other resources are layered on top of that with colocation > > > and > > > ordering constraints. > > > > > > > > > >> > Details of syslog messages of node2 during this event > > > are > > > > > > attached in > > > > > > >> file. > > > > > > >> > 6. After this point we are in a state where node1 and > > > node2 > > > > > > both go in > > > > > > >> > fenced state and resources are unrecoverable(all > > > resources > > > > > > on both > > > > > > >> nodes). > > > > > > >> > > > > > > > >> > Now my question is out of memory issue of node1 can be > > > taken > > > > > > care by > > > > > > >> > increasing swap and finding out the process responsible > > > for > > > > > > such huge > > > > > > >> > memory usage and taking necessary actions to minimize > > > that > > > > > > memory usage, > > > > > > >> > but the other issue that remains unclear is why cluster > > > is > > > > > > not shifted to > > > > > > >> > node2 cleanly and become unrecoverable. > > > > > > >> > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/