Any update on this . Is there any issue in the configuration that we are using ?
On Mon, Feb 15, 2021, 14:40 shivraj dongawe <shivraj...@gmail.com> wrote: > Kindly read "fencing is done using fence_scsi" from the previous message > as "fencing is configured". > > As per the error messages we have analyzed node2 initiated fencing of > node1 as many processes of node1 related to cluster have been killed by oom > killer and node1 marked as down. > Now many resources of node2 have waited for fencing of node1, as seen from > following messages of syslog of node2: > > dlm_controld[1616]: 91659 lvm_postgres_db_vg wait for fencing > dlm_controld[1616]: 91659 lvm_global wait for fencing > > These were messages when postgresql-12 service was being started on node2. > > As postgresql service is dependent on these services(dlm,lvmlockd and gfs2), > it has not started in time on node2. > > And node2 fenced itself after declaring that services can not be started on > it. > > > On Mon, Feb 15, 2021 at 9:00 AM Ulrich Windl < > ulrich.wi...@rz.uni-regensburg.de> wrote: > >> >>> shivraj dongawe <shivraj...@gmail.com> schrieb am 15.02.2021 um >> 08:27 in >> Nachricht >> <CALpaHO_6LsYM=t76CifsRkFeLYDKQc+hY3kz7PRKp7b4se=-a...@mail.gmail.com>: >> > Fencing is done using fence_scsi. >> > Config details are as follows: >> > Resource: scsi (class=stonith type=fence_scsi) >> > Attributes: devices=/dev/mapper/mpatha pcmk_host_list="node1 node2" >> > pcmk_monitor_action=metadata pcmk_reboot_action=off >> > Meta Attrs: provides=unfencing >> > Operations: monitor interval=60s (scsi-monitor-interval-60s) >> > >> > On Mon, Feb 15, 2021 at 7:17 AM Ulrich Windl < >> > ulrich.wi...@rz.uni-regensburg.de> wrote: >> > >> >> >>> shivraj dongawe <shivraj...@gmail.com> schrieb am 14.02.2021 um >> 12:03 >> >> in >> >> Nachricht >> >> <calpaho--3erfwst70mbl-wm9g6yh3ytd-wda1r_cknbrsxu...@mail.gmail.com>: >> >> > We are running a two node cluster on Ubuntu 20.04 LTS. Cluster >> related >> >> > package version details are as >> >> > follows: pacemaker/focal-updates,focal-security 2.0.3-3ubuntu4.1 >> amd64 >> >> > pacemaker/focal 2.0.3-3ubuntu3 amd64 >> >> > corosync/focal 3.0.3-2ubuntu2 amd64 >> >> > pcs/focal 0.10.4-3 all >> >> > fence-agents/focal 4.5.2-1 amd64 >> >> > gfs2-utils/focal 3.2.0-3 amd64 >> >> > dlm-controld/focal 4.0.9-1build1 amd64 >> >> > lvm2-lockd/focal 2.03.07-1ubuntu1 amd64 >> >> > >> >> > Cluster configuration details: >> >> > 1. Cluster is having a shared storage mounted through gfs2 filesystem >> >> with >> >> > the help of dlm and lvmlockd. >> >> > 2. Corosync is configured to use knet for transport. >> >> > 3. Fencing is configured using fence_scsi on the shared storage >> which is >> >> > being used for gfs2 filesystem >> >> > 4. Two main resources configured are cluster/virtual ip and >> >> postgresql-12, >> >> > postgresql-12 is configured as a systemd resource. >> >> > We had done failover testing(rebooting/shutting down of a node, link >> >> > failure) of the cluster and had observed that resources were getting >> >> > migrated properly on the active node. >> >> > >> >> > Recently we came across an issue which has occurred repeatedly in >> span of >> >> > two days. >> >> > Details are below: >> >> > 1. Out of memory killer is getting invoked on active node and it >> starts >> >> > killing processes. >> >> > Sample is as follows: >> >> > postgres invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), >> >> > order=0, oom_score_adj=0 >> >> > 2. At one instance it started with killing of pacemaker and on >> another >> >> with >> >> > postgresql. It does not stop with the killing of a single process it >> goes >> >> > on killing others(more concerning is killing of cluster related >> >> processes) >> >> > as well. We have observed that swap space on that node is 2 GB >> against >> >> RAM >> >> > of 96 GB and are in the process of increasing swap space to see if >> this >> >> > resolves this issue. Postgres is configured with shared_buffers >> value of >> >> 32 >> >> > GB(which is way less than 96 GB). >> >> > We are not yet sure which process is eating up that much memory >> suddenly. >> >> > 3. As a result of killing processes on node1, node2 is trying to >> fence >> >> > node1 and thereby initiating stopping of cluster resources on node1. >> >> >> >> How is fencing being done? >> >> >> >> > 4. At this point we go in a stage where it is assumed that node1 is >> down >> >> > and application resources, cluster IP and postgresql are being >> started on >> >> > node2. >> >> This is why I was asking: Is your fencing successful ("assumed that node1 >> is down >> "), or isn't it? >> >> >> > 5. Postgresql on node 2 fails to start in 60 sec(start operation >> timeout) >> >> > and is declared as failed. During the start operation of postgres, we >> >> have >> >> > found many messages related to failure of fencing and other resources >> >> such >> >> > as dlm and vg waiting for fencing to complete. >> >> > Details of syslog messages of node2 during this event are attached in >> >> file. >> >> > 6. After this point we are in a state where node1 and node2 both go >> in >> >> > fenced state and resources are unrecoverable(all resources on both >> >> nodes). >> >> > >> >> > Now my question is out of memory issue of node1 can be taken care by >> >> > increasing swap and finding out the process responsible for such huge >> >> > memory usage and taking necessary actions to minimize that memory >> usage, >> >> > but the other issue that remains unclear is why cluster is not >> shifted to >> >> > node2 cleanly and become unrecoverable. >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> Manage your subscription: >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/