Re: [ClusterLabs] [EXTERNAL] Users Digest, Vol 55, Issue 19

Andrei Borzenkov Mon, 12 Aug 2019 21:38:44 -0700


Отправлено с iPhone


> 13 авг. 2019 г., в 0:17, Michael Powell <michael.pow...@harmonicinc.com> 
> написал(а):
> 
> Yes, I have tried that.  I used crm_resource --meta -p resource-stickiness -v 
> 0 -r SS16201289RN00023 to disable resource stickiness and then kill -9 <pid> 
> to kill the application associated with the master resource.  The results are 
> the same:  the slave resource remains a slave while the failed resource is 
> restarted and becomes master again.
>  

Does slave have master score? Your logs show only one node with master. To 
select another node as new master it needs non-zero master score as well.


> One approach that seems to work is to run crm_resource -M -r 
> ms-SS16201289RN00023 -H mgraid-16201289RN00023-1 to move the resource to the 
> other node (assuming that the master is running on node 
> mgraid-16201289RN00023-0.)  My original understanding was that this would 
> “restart” the resource on the destination node, but that was apparently a 
> misunderstanding.  I can change our scripts to use this approach, but a) 
> thought that maintain the approach of demoting the master resource and 
> promoting the slave to master was more generic and b) I am unsure of any 
> potential side effects of moving the resource.  Given what I’m trying to 
> accomplish, is this in fact the preferred approach?
>  
> Regards,
>     Michael
>  
>  
> -----Original Message-----
> From: Users <users-boun...@clusterlabs.org> On Behalf Of 
> users-requ...@clusterlabs.org
> Sent: Monday, August 12, 2019 1:10 PM
> To: users@clusterlabs.org
> Subject: [EXTERNAL] Users Digest, Vol 55, Issue 19
>  
> Send Users mailing list submissions to
>                 users@clusterlabs.org
>  
> To subscribe or unsubscribe via the World Wide Web, visit
>                 https://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>                 users-requ...@clusterlabs.org
>  
> You can reach the person managing the list at
>                 users-ow...@clusterlabs.org
>  
> When replying, please edit your Subject line so it is more specific than "Re: 
> Contents of Users digest..."
>  
>  
> Today's Topics:
>  
>    1. why is node fenced ? (Lentes, Bernd)
>    2. Postgres HA - pacemaker RA do not support auto      failback (Shital A)
>    3. Re: why is node fenced ? (Chris Walker)
>    4. Re: Master/slave failover does not work as expected
>       (Andrei Borzenkov)
>  
>  
> ----------------------------------------------------------------------
>  
> Message: 1
> Date: Mon, 12 Aug 2019 18:09:24 +0200 (CEST)
> From: "Lentes, Bernd" <bernd.len...@helmholtz-muenchen.de>
> To: Pacemaker ML <users@clusterlabs.org>
> Subject: [ClusterLabs] why is node fenced ?
> Message-ID:
>                 
> <546330844.1686419.1565626164456.javamail.zim...@helmholtz-muenchen.de>
>                
> Content-Type: text/plain; charset=utf-8
>  
> Hi,
>  
> last Friday (9th of August) i had to install patches on my two-node cluster.
> I put one of the nodes (ha-idg-2) into standby (crm node standby ha-idg-2), 
> patched it, rebooted, started the cluster (systemctl start pacemaker) again, 
> put the node again online, everything fine.
>  
> Then i wanted to do the same procedure with the other node (ha-idg-1).
> I put it in standby, patched it, rebooted, started pacemaker again.
> But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
> I know that nodes which are unclean need to be shutdown, that's logical.
>  
> But i don't know from where the conclusion comes that the node is unclean 
> respectively why it is unclean, i searched in the logs and didn't find any 
> hint.
>  
> I put the syslog and the pacemaker log on a seafile share, i'd be very 
> thankful if you'll have a look.
> https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/
>  
> Here the cli history of the commands:
>  
> 17:03:04  crm node standby ha-idg-2
> 17:07:15  zypper up (install Updates on ha-idg-2)
> 17:17:30  systemctl reboot
> 17:25:21  systemctl start pacemaker.service
> 17:25:47  crm node online ha-idg-2
> 17:26:35  crm node standby ha-idg1-
> 17:30:21  zypper up (install Updates on ha-idg-1)
> 17:37:32  systemctl reboot
> 17:43:04  systemctl start pacemaker.service
> 17:44:00  ha-idg-1 is fenced
>  
> Thanks.
>  
> Bernd
>  
> OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1
>  
>  
> --
>  
> Bernd Lentes
> Systemadministration
> Institut f?r Entwicklungsgenetik
> Geb?ude 35.34 - Raum 208
> HelmholtzZentrum m?nchen
> bernd.len...@helmholtz-muenchen.de
> phone: +49 89 3187 1241
> phone: +49 89 3187 3827
> fax: +49 89 3187 2294
> http://www.helmholtz-muenchen.de/idg
>  
> Perfekt ist wer keine Fehler macht
> Also sind Tote perfekt
>  
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich 
> Bassler, Kerstin Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
>  
>  
>  
> ------------------------------
>  
> Message: 2
> Date: Mon, 12 Aug 2019 12:24:02 +0530
> From: Shital A <brightuser2...@gmail.com>
> To: pgsql-gene...@postgresql.com, Users@clusterlabs.org
> Subject: [ClusterLabs] Postgres HA - pacemaker RA do not support auto
>                 failback
> Message-ID:
>                 
> <camp7vw_kf2em_buh_fpbznc9z6pvvx+7rxjymhfmcozxuwg...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>  
> Hello,
>  
> Postgres version : 9.6
> OS:Rhel 7.6
>  
> We are working on HA setup for postgres cluster of two nodes in
> active-passive mode.
>  
> Installed:
> Pacemaker 1.1.19
> Corosync 2.4.3
>  
> The pacemaker agent with this installation doesn't support automatic
> failback. What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master.
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock file it
> creates on any one of the node and execute the cleanup command to get the
> node back as standby
>  
> Step 3 is manual so HA is not achieved in real sense.
>  
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic
> failback? To avoid generation of lock file and deleting it.
>  
> 2. If there is no such support, if we need such functionality, do we have
> to modify existing code?
>  
> How this can be achieved. Please suggest.
> Thanks.
>  
> Thanks.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/737a010e/attachment-0001.html>
>  
> ------------------------------
>  
> Message: 3
> Date: Mon, 12 Aug 2019 17:47:02 +0000
> From: Chris Walker <cwal...@cray.com>
> To: Cluster Labs - All topics related to open-source clustering
>                 welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] why is node fenced ?
> Message-ID: <eafef777-5a49-4c06-a2f6-8711f528b...@cray.com>
> Content-Type: text/plain; charset="utf-8"
>  
> When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, for 
> example,
>  
> Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd:     info: 
> pcmk_quorum_notification: Quorum retained | membership=1320 members=1
>  
> after ~20s (dc-deadtime parameter), ha-idg-2 is marked 'unclean' and 
> STONITHed as part of startup fencing.
>  
> There is nothing in ha-idg-2's HA logs around 17:43 indicating that it saw 
> ha-idg-1 either, so it appears that there was no communication at all between 
> the two nodes.
>  
> I'm not sure exactly why the nodes did not see one another, but there are 
> indications of network issues around this time
>  
> 2019-08-09T17:42:16.427947+02:00 ha-idg-2 kernel: [ 1229.245533] bond1: now 
> running without any active interface!
>  
> so perhaps that's related.
>  
> HTH,
> Chris
>  
>  
> ?On 8/12/19, 12:09 PM, "Users on behalf of Lentes, Bernd" 
> <users-boun...@clusterlabs.org on behalf of 
> bernd.len...@helmholtz-muenchen.de> wrote:
>  
>     Hi,
>    
>     last Friday (9th of August) i had to install patches on my two-node 
> cluster.
>     I put one of the nodes (ha-idg-2) into standby (crm node standby 
> ha-idg-2), patched it, rebooted,
>     started the cluster (systemctl start pacemaker) again, put the node again 
> online, everything fine.
>    
>     Then i wanted to do the same procedure with the other node (ha-idg-1).
>     I put it in standby, patched it, rebooted, started pacemaker again.
>     But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
>     I know that nodes which are unclean need to be shutdown, that's logical.
>    
>     But i don't know from where the conclusion comes that the node is unclean 
> respectively why it is unclean,
>     i searched in the logs and didn't find any hint.
>    
>     I put the syslog and the pacemaker log on a seafile share, i'd be very 
> thankful if you'll have a look.
>     https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/
>    
>     Here the cli history of the commands:
>    
>     17:03:04  crm node standby ha-idg-2
>     17:07:15  zypper up (install Updates on ha-idg-2)
>     17:17:30  systemctl reboot
>     17:25:21  systemctl start pacemaker.service
>     17:25:47  crm node online ha-idg-2
>     17:26:35  crm node standby ha-idg1-
>     17:30:21  zypper up (install Updates on ha-idg-1)
>     17:37:32  systemctl reboot
>     17:43:04  systemctl start pacemaker.service
>     17:44:00  ha-idg-1 is fenced
>    
>     Thanks.
>    
>     Bernd
>    
>     OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1
>    
>     
>     --
>     
>     Bernd Lentes
>     Systemadministration
>     Institut f?r Entwicklungsgenetik
>     Geb?ude 35.34 - Raum 208
>     HelmholtzZentrum m?nchen
>     bernd.len...@helmholtz-muenchen.de
>     phone: +49 89 3187 1241
>     phone: +49 89 3187 3827
>     fax: +49 89 3187 2294
>     http://www.helmholtz-muenchen.de/idg
>     
>     Perfekt ist wer keine Fehler macht
>     Also sind Tote perfekt
>     
>     
>     Helmholtz Zentrum Muenchen
>     Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
>     Ingolstaedter Landstr. 1
>     85764 Neuherberg
>     www.helmholtz-muenchen.de
>     Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
>     Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich 
> Bassler, Kerstin Guenther
>     Registergericht: Amtsgericht Muenchen HRB 6466
>     USt-IdNr: DE 129521671
>    
>     _______________________________________________
>     Manage your subscription:
>     https://lists.clusterlabs.org/mailman/listinfo/users
>    
>     ClusterLabs home: https://www.clusterlabs.org/
>  
>  
> ------------------------------
>  
> Message: 4
> Date: Mon, 12 Aug 2019 23:09:31 +0300
> From: Andrei Borzenkov <arvidj...@gmail.com>
> To: Cluster Labs - All topics related to open-source clustering
>                 welcomed <users@clusterlabs.org>
> Cc: Venkata Reddy Chappavarapu <venkata.chappavar...@harmonicinc.com>
> Subject: Re: [ClusterLabs] Master/slave failover does not work as
>                 expected
> Message-ID:
>                 
> <CAA91j0WxSxt_eVmUvXgJ_0goBkBw69r3o-VesRvGc6atg6o=j...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>  
> On Mon, Aug 12, 2019 at 4:12 PM Michael Powell <
> michael.pow...@harmonicinc.com> wrote:
>  
> > At 07:44:49, the ss agent discovers that the master instance has failed on
> > node *mgraid?-0* as a result of a failed *ssadm* request in response to
> > an *ss_monitor()* operation.  It issues a *crm_master -Q -D* command with
> > the intent of demoting the master and promoting the slave, on the other
> > node, to master.  The *ss_demote()* function finds that the application
> > is no longer running and returns *OCF_NOT_RUNNING* (7).  In the older
> > product, this was sufficient to promote the other instance to master, but
> > in the current product, that does not happen.  Currently, the failed
> > application is restarted, as expected, and is promoted to master, but this
> > takes 10?s of seconds.
> > 
> > 
> > 
>  
> Did you try to disable resource stickiness for this ms?
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/12978d55/attachment.html>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image001.gif
> Type: image/gif
> Size: 1854 bytes
> Desc: not available
> URL: 
> <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/12978d55/attachment.gif>
>  
> ------------------------------
>  
> Subject: Digest Footer
>  
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>  
> ClusterLabs home: https://www.clusterlabs.org/
>  
> ------------------------------
>  
> End of Users Digest, Vol 55, Issue 19
> *************************************
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [EXTERNAL] Users Digest, Vol 55, Issue 19

Reply via email to