Re: [ClusterLabs] silence resource ? - PGSQL

2023-06-28 Thread Andrei Borzenkov

On 28.06.2023 14:11, lejeczek via Users wrote:

Hi guys.

Having 'pgsql' set up in what I'd say is a vanilla-default
confg, pacemaker's journal log is flooded with:
...
pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user
postgres(uid=26) by (uid=0)
pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user
postgres(uid=26) by (uid=0)
pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user
postgres(uid=26) by (uid=0)
pam_unix(runuser:session): session closed for user postgres
...

Would you have a working fix or even a suggestion on how to
silence those?



Did you try "man pam_unix"? May be one of options does what you need?

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pacemaker logs written on message which is not expected as per configuration

2023-06-28 Thread S Sathish S via Users
Thanks Klaus and Ken for your quick support.

Regards,
S Sathish S
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] silence resource ? - PGSQL

2023-06-28 Thread lejeczek via Users

Hi guys.

Having 'pgsql' set up in what I'd say is a vanilla-default 
confg, pacemaker's journal log is flooded with:

...
pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user 
postgres(uid=26) by (uid=0)

pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user 
postgres(uid=26) by (uid=0)

pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user 
postgres(uid=26) by (uid=0)

pam_unix(runuser:session): session closed for user postgres
...

Would you have a working fix or even a suggestion on how to 
silence those?


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-28 Thread Klaus Wenninger
On Wed, Jun 28, 2023 at 7:38 AM Klaus Wenninger  wrote:

>
>
> On Wed, Jun 28, 2023 at 3:30 AM Priyanka Balotra <
> priyanka.14balo...@gmail.com> wrote:
>
>> I am using SLES 15 SP4. Is the no-quorum-policy still supported?
>>
>>
> Thanks
>> Priyanka
>>
>> On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot  wrote:
>>
>>> On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote:
>>> > In this case stonith has been configured as a resource,
>>> > primitive stonith-sbd stonith:external/sbd
>>>
>>
> Then the error scenario you described looks like everybody lost connection
> to the shared-storage. The nodes rebooting then probably rather suicided
> instead of reading the poison-pill. And the quorate partition is staying
> alive because
> it is quorate but not seeing the shared-storage it can't verify that it
> had been
> able to write the poison-pill which makes the other nodes stay unclean.
> But again just guessing ...
>

That said and without knowing details about your scenario and the
failure-scenarios you want to cover you might consider watchdog-fencing.
afaik Suse does support that as well for a while now.
It gives you service-recovery from nodes that are cut off via network
including their physical fencing-devices. I know that poison-pill-fencing
should do that as well as long as the quorate part of the cluster is able
to access the shared-disk but in your scenario this doesn't seem to be
the case.
Just out of curiosity: Are you using poison-pill with multiple shared disks?
Asking as in that case the poison-pill may still be passed via a single disk
and the target would reboot but the other side that initiated fencing might
not recover resources as it might not have been able to write the
poison-pill
to a quorate number of disks.

Klaus

>
>
>> >
>>> > For it to be functional properly , the resource needs to be up, which
>>> > is only possible if the system is quorate.
>>>
>>> Pacemaker can use a fence device even if its resource is not active.
>>> The resource being active just allows Pacemaker to monitor the device
>>> regularly.
>>>
>>> >
>>> > Hence our requirement is to make the system quorate even if one Node
>>> > of the cluster is up.
>>> > Stonith will then take care of any split-brain scenarios.
>>>
>>> In that case it sounds like no-quorum-policy=ignore is actually what
>>> you want.
>>>
>>
> Still dangerous without something like wait-for-all - right?
> With LMS I guess you should have the same effect without having explicitly
> specified though.
>
> Klaus
>
>
>>
>>> >
>>> > Thanks
>>> > Priyanka
>>> >
>>> > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger 
>>> > wrote:
>>> > >
>>> > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <
>>> > > arvidj...@gmail.com> wrote:
>>> > > > On 27.06.2023 07:21, Priyanka Balotra wrote:
>>> > > > > Hi Andrei,
>>> > > > > After this state the system went through some more fencings and
>>> > > > we saw the
>>> > > > > following state:
>>> > > > >
>>> > > > > :~ # crm status
>>> > > > > Cluster Summary:
>>> > > > >* Stack: corosync
>>> > > > >* Current DC: FILE-2 (version
>>> > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36)
>>> > > > - partition
>>> > > > > with quorum
>>> > > >
>>> > > > It says "partition with quorum" so what exactly is the problem?
>>> > >
>>> > > I guess the problem is that resources aren't being recovered on
>>> > > the nodes in the quorate partition.
>>> > > Reason for that is probably that - as Ken was already suggesting -
>>> > > fencing isn't
>>> > > working properly or fencing-devices used are simply inappropriate
>>> > > for the
>>> > > purpose (e.g. onboard IPMI).
>>> > > The fact that a node is rebooting isn't enough. The node that
>>> > > initiated fencing
>>> > > has to know that it did actually work. But we're just guessing
>>> > > here. Logs should
>>> > > show what is actually going on.
>>> > >
>>> > > Klaus
>>> > > > >* Last updated: Mon Jun 26 12:44:15 2023
>>> > > > >* Last change:  Mon Jun 26 12:41:12 2023 by root via
>>> > > > cibadmin on FILE-2
>>> > > > >* 4 nodes configured
>>> > > > >* 11 resource instances configured
>>> > > > >
>>> > > > > Node List:
>>> > > > >* Node FILE-1: UNCLEAN (offline)
>>> > > > >* Node FILE-4: UNCLEAN (offline)
>>> > > > >* Online: [ FILE-2 ]
>>> > > > >* Online: [ FILE-3 ]
>>> > > > >
>>> > > > > At this stage FILE-1 and FILE-4 were continuously getting
>>> > > > fenced (we have
>>> > > > > device based stonith configured but the resource was not up ) .
>>> > > > > Two nodes were online and two were offline. So quorum wasn't
>>> > > > attained
>>> > > > > again.
>>> > > > > 1)  For such a scenario we need help to be able to have one
>>> > > > cluster live .
>>> > > > > 2)  And in cases where only one node of the cluster is up and
>>> > > > others are
>>> > > > > down we need the resources and cluster to be up .
>>> > > > >
>>> > > > > Thanks
>>> > > > > Priyanka
>>> > > > >
>>> > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei