On 09/01/2017 11:45 PM, Ken Gaillot wrote: > On Fri, 2017-09-01 at 15:06 +0530, Abhay B wrote: >> Are you sure the monitor stopped? Pacemaker only logs >> recurring monitors >> when the status changes. Any successful monitors after this >> wouldn't be >> logged. >> >> Yes. Since there were no logs which said "RecurringOp: Start >> recurring monitor" on the node after it had failed. >> Also there were no logs for any actions pertaining to >> The problem was that even though the one node was failing, the >> resources were never moved to the other node(the node on which I >> suspect monitoring had stopped). >> >> >> There are a lot of resource action failures, so I'm not sure >> where the >> issue is, but I'm guessing it has to do with >> migration-threshold=1 -- >> once a resource has failed once on a node, it won't be allowed >> back on >> that node until the failure is cleaned up. Of course you also >> have >> failure-timeout=1s, which should clean it up immediately, so >> I'm not >> sure. >> >> >> migration-threshold=1 >> failure-timeout=1s >> >> cluster-recheck-interval=2s >> >> >> first, set "two_node: >> 1" in corosync.conf and let no-quorum-policy default in >> pacemaker >> >> >> This is already configured. >> # cat /etc/corosync/corosync.conf >> totem { >> version: 2 >> secauth: off >> cluster_name: SVSDEHA >> transport: udpu >> token: 5000 >> } >> >> >> nodelist { >> node { >> ring0_addr: 2.0.0.10 >> nodeid: 1 >> } >> >> >> node { >> ring0_addr: 2.0.0.11 >> nodeid: 2 >> } >> } >> >> >> quorum { >> provider: corosync_votequorum >> two_node: 1 >> } >> >> >> logging { >> to_logfile: yes >> logfile: /var/log/cluster/corosync.log >> to_syslog: yes >> } >> >> >> let no-quorum-policy default in pacemaker; then, >> get stonith configured, tested, and enabled >> >> >> By not configuring no-quorum-policy, would it ignore quorum for a 2 >> node cluster? > With two_node, corosync always provides quorum to pacemaker, so > pacemaker doesn't see any quorum loss. The only significant difference > from ignoring quorum is that corosync won't form a cluster from a cold > start unless both nodes can reach each other (a safety feature). > >> For my use case I don't need stonith enabled. My intention is to have >> a highly available system all the time. > Stonith is the only way to recover from certain types of failure, such > as the "split brain" scenario, and a resource that fails to stop. > > If your nodes are physical machines with hardware watchdogs, you can set > up sbd for fencing without needing any extra equipment.
Small caveat here: If I get it right you have a 2-node-setup. In this case the watchdog-only sbd-setup would not be usable as it relies on 'real' quorum. In 2-node-setups sbd needs at least a single shared disk. For the sbd-single-disk-setup working with 2-node you need the patch from https://github.com/ClusterLabs/sbd/pull/23 in place. (Saw you mentioning RHEL documentation - RHEL-7.4 has it in since GA) Regards, Klaus > >> I will test my RA again as suggested with no-quorum-policy=default. >> >> >> One more doubt. >> Why do we see this is 'pcs property' ? >> last-lrm-refresh: 1504090367 >> >> >> >> Never seen this on a healthy cluster. >> From RHEL documentation: >> last-lrm-refresh >> >> Last refresh of the >> Local Resource Manager, >> given in units of >> seconds since epoca. >> Used for diagnostic >> purposes; not >> user-configurable. >> >> >> Doesn't explain much. > Whenever a cluster property changes, the cluster rechecks the current > state to see if anything needs to be done. last-lrm-refresh is just a > dummy property that the cluster uses to trigger that. It's set in > certain rare circumstances when a resource cleanup is done. You should > see a line in your logs like "Triggering a refresh after ... deleted ... > from the LRM". That might give some idea of why. > >> Also. does avg. CPU load impact resource monitoring ? >> >> >> Regards, >> Abhay > Well, it could cause the monitor to take so long that it times out. The > only direct effect of load on pacemaker is that the cluster might lower > the number of agent actions that it can execute simultaneously. > > >> On Thu, 31 Aug 2017 at 20:11 Ken Gaillot <kgail...@redhat.com> wrote: >> >> On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote: >> > Hi, >> > >> > >> > I have a 2 node HA cluster configured on CentOS 7 with pcs >> command. >> > >> > >> > Below are the properties of the cluster : >> > >> > >> > # pcs property >> > Cluster Properties: >> > cluster-infrastructure: corosync >> > cluster-name: SVSDEHA >> > cluster-recheck-interval: 2s >> > dc-deadtime: 5 >> > dc-version: 1.1.15-11.el7_3.5-e174ec8 >> > have-watchdog: false >> > last-lrm-refresh: 1504090367 >> > no-quorum-policy: ignore >> > start-failure-is-fatal: false >> > stonith-enabled: false >> > >> > >> > PFA the cib. >> > Also attached is the corosync.log around the time the below >> issue >> > happened. >> > >> > >> > After around 10 hrs and multiple failures, pacemaker stops >> monitoring >> > resource on one of the nodes in the cluster. >> > >> > >> > So even though the resource on other node fails, it is never >> migrated >> > to the node on which the resource is not monitored. >> > >> > >> > Wanted to know what could have triggered this and how to >> avoid getting >> > into such scenarios. >> > I am going through the logs and couldn't find why this >> happened. >> > >> > >> > After this log the monitoring stopped. >> > >> > Aug 29 11:01:44 [16500] TPC-D12-10-002.phaedrus.sandvine.com >> > crmd: info: process_lrm_event: Result of monitor >> operation for >> > SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com: 0 (ok) | >> call=538 >> > key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013 >> >> Are you sure the monitor stopped? Pacemaker only logs >> recurring monitors >> when the status changes. Any successful monitors after this >> wouldn't be >> logged. >> >> > Below log says the resource is leaving the cluster. >> > Aug 29 11:01:44 [16499] TPC-D12-10-002.phaedrus.sandvine.com >> > pengine: info: LogActions: Leave SVSDEHA:0 >> (Slave >> > TPC-D12-10-002.phaedrus.sandvine.com) >> >> This means that the cluster will leave the resource where it >> is (i.e. it >> doesn't need a start, stop, move, demote, promote, etc.). >> >> > Let me know if anything more is needed. >> > >> > >> > Regards, >> > Abhay >> > >> > >> > PS:'pcs resource cleanup' brought the cluster back into good >> state. >> >> There are a lot of resource action failures, so I'm not sure >> where the >> issue is, but I'm guessing it has to do with >> migration-threshold=1 -- >> once a resource has failed once on a node, it won't be allowed >> back on >> that node until the failure is cleaned up. Of course you also >> have >> failure-timeout=1s, which should clean it up immediately, so >> I'm not >> sure. >> >> My gut feeling is that you're trying to do too many things at >> once. I'd >> start over from scratch and proceed more slowly: first, set >> "two_node: >> 1" in corosync.conf and let no-quorum-policy default in >> pacemaker; then, >> get stonith configured, tested, and enabled; then, test your >> resource >> agent manually on the command line to make sure it conforms to >> the >> expected return values >> ( >> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf >> ); then add your resource to the cluster without migration-threshold or >> failure-timeout, and work out any issues with frequent failures; then >> finally set migration-threshold and failure-timeout to reflect how you want >> recovery to proceed. >> -- >> Ken Gaillot <kgail...@redhat.com> >> >> >> >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org