[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Non recoverable state of cluster after exit of one node due to killing of processes by oom killer

Ulrich Windl Thu, 25 Feb 2021 22:55:56 -0800

>>> Ken Gaillot <kgail...@redhat.com> schrieb am 25.02.2021 um 18:21 in
Nachricht
<f1a534dc3ba1e2e933916b270e4570847c4e7ecd.ca...@redhat.com>:
> On Thu, 2021‑02‑25 at 06:34 +0000, shivraj dongawe wrote:
>> 
>> @Ken Gaillot, Thanks for sharing your inputs on the possible behavior
>> of the cluster. 
>> We have reconfirmed that dlm on a healthy node was waiting for
>> fencing of faulty node and shared storage access on the healthy node
>> was blocked during this process. 
>> Kindly let me know whether this is the natural behavior or is it a
>> result of some misconfiguration. 
> 
> Your configuration looks perfect to me, except for one thing: I believe
> lvmlockd should be *after* dlm_controld in the group. I don't know if
> that's causing the problem, but it's worth trying.


Definitely!

> 
> It is expected that DLM will wait for fencing, but it should be happy
> after fencing completes, so something is not right.

I see (clustered MD):
h16 dlm_controld[4760]: 12943 fence wait 118 pid 41722 running
h16 dlm_controld[4760]: 12943 a9641017-8070-bba7-a956-77de0f0a wait for
fencing
...
h16 dlm_controld[4760]: 12992 fence result 118 pid 41722 result 0 exit status
h16 dlm_controld[4760]: 12992 fence status 118 receive 0 from 116 walltime
1610028681 local 12992
h16 kernel: dlm: a9641017-8070-bba7-a956-77de0f0a5c0a: dlm_recover 7
h16 kernel: dlm: a9641017-8070-bba7-a956-77de0f0a5c0a: remove member 118

Regards,
Ulrich

> 
>> As asked by I am sharing configuration information as an attachment
>> to this mail. 
>> 
>> 
>> On Fri, Feb 19, 2021 at 11:28 PM Ken Gaillot <kgail...@redhat.com>
>> wrote:
>> > On Fri, 2021‑02‑19 at 07:48 +0530, shivraj dongawe wrote:
>> > > Any update on this . 
>> > > Is there any issue in the configuration that we are using ?
>> > > 
>> > > On Mon, Feb 15, 2021, 14:40 shivraj dongawe <shivraj...@gmail.com 
>> > >
>> > > wrote:
>> > > > Kindly read "fencing is done using fence_scsi" from the
>> > previous
>> > > > message as "fencing is configured". 
>> > > > 
>> > > > As per the error messages we have analyzed node2 initiated
>> > fencing
>> > > > of node1 as many processes of node1 related to cluster have
>> > been
>> > > > killed by oom killer and node1 marked as down. 
>> > > > Now many resources of node2 have waited for fencing of node1,
>> > as
>> > > > seen from following messages of syslog of node2: 
>> > > > dlm_controld[1616]: 91659 lvm_postgres_db_vg wait for fencing
>> > > > dlm_controld[1616]: 91659 lvm_global wait for fencing
>> > > > 
>> > > > These were messages when postgresql‑12 service was being
>> > started on
>> > > > node2. 
>> > > > As postgresql service is dependent on these
>> > services(dlm,lvmlockd
>> > > > and gfs2), it has not started in time on node2. 
>> > > > And node2 fenced itself after declaring that services can not
>> > be
>> > > > started on it. 
>> > > > 
>> > > > On Mon, Feb 15, 2021 at 9:00 AM Ulrich Windl <
>> > > > ulrich.wi...@rz.uni‑regensburg.de> wrote:
>> > > > > >>> shivraj dongawe <shivraj...@gmail.com> schrieb am
>> > 15.02.2021
>> > > > > um 08:27 in
>> > > > > Nachricht
>> > > > > <
>> > > > > 
>> > CALpaHO_6LsYM=t76CifsRkFeLYDKQc+hY3kz7PRKp7b4se=‑a...@mail.gmail.com 
>> > > > > >:
>> > > > > > Fencing is done using fence_scsi.
>> > > > > > Config details are as follows:
>> > > > > >  Resource: scsi (class=stonith type=fence_scsi)
>> > > > > >   Attributes: devices=/dev/mapper/mpatha
>> > pcmk_host_list="node1
>> > > > > node2"
>> > > > > > pcmk_monitor_action=metadata pcmk_reboot_action=off
>> > > > > >   Meta Attrs: provides=unfencing
>> > > > > >   Operations: monitor interval=60s (scsi‑monitor‑interval‑
>> > 60s)
>> > > > > > 
>> > > > > > On Mon, Feb 15, 2021 at 7:17 AM Ulrich Windl <
>> > > > > > ulrich.wi...@rz.uni‑regensburg.de> wrote:
>> > > > > > 
>> > > > > >> >>> shivraj dongawe <shivraj...@gmail.com> schrieb am
>> > > > > 14.02.2021 um 12:03
>> > > > > >> in
>> > > > > >> Nachricht
>> > > > > >> <
>> > > > > 
>> > CALpaHO‑‑3ERfwST70mBL‑Wm9g6yH3YtD‑wda1r_cknbrsxu...@mail.gmail.com 
>> > > > > >:
>> > > > > >> > We are running a two node cluster on Ubuntu 20.04 LTS.
>> > > > > Cluster related
>> > > > > >> > package version details are as
>> > > > > >> > follows: pacemaker/focal‑updates,focal‑security 2.0.3‑
>> > > > > 3ubuntu4.1 amd64
>> > > > > >> > pacemaker/focal 2.0.3‑3ubuntu3 amd64
>> > > > > >> > corosync/focal 3.0.3‑2ubuntu2 amd64
>> > > > > >> > pcs/focal 0.10.4‑3 all
>> > > > > >> > fence‑agents/focal 4.5.2‑1 amd64
>> > > > > >> > gfs2‑utils/focal 3.2.0‑3 amd64
>> > > > > >> > dlm‑controld/focal 4.0.9‑1build1 amd64
>> > > > > >> > lvm2‑lockd/focal 2.03.07‑1ubuntu1 amd64
>> > > > > >> >
>> > > > > >> > Cluster configuration details:
>> > > > > >> > 1. Cluster is having a shared storage mounted through
>> > gfs2
>> > > > > filesystem
>> > > > > >> with
>> > > > > >> > the help of dlm and lvmlockd.
>> > > > > >> > 2. Corosync is configured to use knet for transport.
>> > > > > >> > 3. Fencing is configured using fence_scsi on the shared
>> > > > > storage which is
>> > > > > >> > being used for gfs2 filesystem
>> > > > > >> > 4. Two main resources configured are cluster/virtual ip
>> > and
>> > > > > >> postgresql‑12,
>> > > > > >> > postgresql‑12 is configured as a systemd resource.
>> > > > > >> > We had done failover testing(rebooting/shutting down of
>> > a
>> > > > > node, link
>> > > > > >> > failure) of the cluster and had observed that resources
>> > were
>> > > > > getting
>> > > > > >> > migrated properly on the active node.
>> > > > > >> >
>> > > > > >> > Recently we came across an issue which has occurred
>> > > > > repeatedly in span of
>> > > > > >> > two days.
>> > > > > >> > Details are below:
>> > > > > >> > 1. Out of memory killer is getting invoked on active
>> > node
>> > > > > and it starts
>> > > > > >> > killing processes.
>> > > > > >> > Sample is as follows:
>> > > > > >> > postgres invoked oom‑killer:
>> > > > > gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE),
>> > > > > >> > order=0, oom_score_adj=0
>> > > > > >> > 2. At one instance it started with killing of pacemaker
>> > and
>> > > > > on another
>> > > > > >> with
>> > > > > >> > postgresql. It does not stop with the killing of a
>> > single
>> > > > > process it goes
>> > > > > >> > on killing others(more concerning is killing of cluster
>> > > > > related
>> > > > > >> processes)
>> > > > > >> > as well. We have observed that swap space on that node
>> > is 2
>> > > > > GB against
>> > > > > >> RAM
>> > > > > >> > of 96 GB and are in the process of increasing swap space
>> > to
>> > > > > see if this
>> > > > > >> > resolves this issue. Postgres is configured with
>> > > > > shared_buffers value of
>> > > > > >> 32
>> > > > > >> > GB(which is way less than 96 GB).
>> > > > > >> > We are not yet sure which process is eating up that much
>> > > > > memory suddenly.
>> > > > > >> > 3. As a result of killing processes on node1, node2 is
>> > > > > trying to fence
>> > > > > >> > node1 and thereby initiating stopping of cluster
>> > resources
>> > > > > on node1.
>> > > > > >>
>> > > > > >> How is fencing being done?
>> > > > > >>
>> > > > > >> > 4. At this point we go in a stage where it is assumed
>> > that
>> > > > > node1 is down
>> > > > > >> > and application resources, cluster IP and postgresql are
>> > > > > being started on
>> > > > > >> > node2.
>> > > > > 
>> > > > > This is why I was asking: Is your fencing successful
>> > ("assumed
>> > > > > that node1 is down
>> > > > > "), or isn't it?
>> > > > > 
>> > > > > >> > 5. Postgresql on node 2 fails to start in 60 sec(start
>> > > > > operation timeout)
>> > > > > >> > and is declared as failed. During the start operation of
>> > > > > postgres, we
>> > > > > >> have
>> > > > > >> > found many messages related to failure of fencing and
>> > other
>> > > > > resources
>> > > > > >> such
>> > > > > >> > as dlm and vg waiting for fencing to complete.
>> > 
>> > It does seem that DLM is where the problem occurs.
>> > 
>> > Note that fencing is scheduled in two separate ways, once by DLM
>> > and
>> > once by the cluster itself, when node1 is lost.
>> > 
>> > The fencing scheduled by the cluster completes successfully:
>> > 
>> > Feb 13 11:07:56 DB‑2 pacemaker‑controld[2451]:  notice: Peer node1
>> > was
>> > terminated (reboot) by node2 on behalf of pacemaker‑controld.2451:
>> > OK
>> > 
>> > but DLM just attempts fencing over and over, eventually causing
>> > resource timeouts. Those timeouts cause the cluster to schedule
>> > resource recovery (stop+start), but the stops timeout for the same
>> > reason, and it is those stop timeouts that cause node2 to be
>> > fenced.
>> > 
>> > I'm not familiar enough with DLM to know what might keep it from
>> > being
>> > able to contact Pacemaker for fencing.
>> > 
>> > Can you attach your configuration as well (with any sensitive info
>> > removed)? I assume you've created an ocf:pacemaker:controld clone,
>> > and
>> > that the other resources are layered on top of that with colocation
>> > and
>> > ordering constraints.
>> > 
>> > > > > >> > Details of syslog messages of node2 during this event
>> > are
>> > > > > attached in
>> > > > > >> file.
>> > > > > >> > 6. After this point we are in a state where node1 and
>> > node2
>> > > > > both go in
>> > > > > >> > fenced state and resources are unrecoverable(all
>> > resources
>> > > > > on both
>> > > > > >> nodes).
>> > > > > >> >
>> > > > > >> > Now my question is out of memory issue of node1 can be
>> > taken
>> > > > > care by
>> > > > > >> > increasing swap and finding out the process responsible
>> > for
>> > > > > such huge
>> > > > > >> > memory usage and taking necessary actions to minimize
>> > that
>> > > > > memory usage,
>> > > > > >> > but the other issue that remains unclear is why cluster
>> > is
>> > > > > not shifted to
>> > > > > >> > node2 cleanly and become unrecoverable.
>> > > > > >>
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > 
>> > ClusterLabs home: https://www.clusterlabs.org/ 
> ‑‑ 
> Ken Gaillot <kgail...@redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Non recoverable state of cluster after exit of one node due to killing of processes by oom killer

Reply via email to