Hi Ken, Thanks for reply.
On Thu, Oct 8, 2015 at 8:13 PM, Ken Gaillot <[email protected]> wrote: > On 10/02/2015 01:47 PM, Pritam Kharat wrote: > > Hi, > > > > I have set up a ACTIVE/PASSIVE HA > > > > *Issue 1) * > > > > *corosync.conf* file is > > > > # Please read the openais.conf.5 manual page > > > > totem { > > > > version: 2 > > > > # How long before declaring a token lost (ms) > > token: 10000 > > > > # How many token retransmits before forming a new configuration > > token_retransmits_before_loss_const: 20 > > > > # How long to wait for join messages in the membership protocol > (ms) > > join: 10000 > > > > # How long to wait for consensus to be achieved before starting a > > new round of membership configuration (ms) > > consensus: 12000 > > > > # Turn off the virtual synchrony filter > > vsftype: none > > > > # Number of messages that may be sent by one processor on receipt > > of the token > > max_messages: 20 > > > > # Limit generated nodeids to 31-bits (positive signed integers) > > clear_node_high_bit: yes > > > > # Disable encryption > > secauth: off > > > > # How many threads to use for encryption/decryption > > threads: 0 > > > > # Optionally assign a fixed node id (integer) > > # nodeid: 1234 > > > > # This specifies the mode of redundant ring, which may be none, > > active, or passive. > > rrp_mode: none > > interface { > > # The following values need to be set based on your > > environment > > ringnumber: 0 > > bindnetaddr: 192.168.101.0 > > mcastport: 5405 > > } > > > > transport: udpu > > } > > > > amf { > > mode: disabled > > } > > > > quorum { > > # Quorum for the Pacemaker Cluster Resource Manager > > provider: corosync_votequorum > > expected_votes: 1 > > If you're using a recent version of corosync, use "two_node: 1" instead > of "expected_votes: 1", and get rid of "no-quorum-policy: ignore" in the > pacemaker cluster options. > > -> We are using corosync version 2.3.3. Do we above mentioned change for this version ? > > } > > > > > > nodelist { > > > > node { > > ring0_addr: 192.168.101.73 > > } > > > > node { > > ring0_addr: 192.168.101.74 > > } > > } > > > > aisexec { > > user: root > > group: root > > } > > > > > > logging { > > fileline: off > > to_stderr: yes > > to_logfile: yes > > to_syslog: yes > > syslog_facility: daemon > > logfile: /var/log/corosync/corosync.log > > debug: off > > timestamp: on > > logger_subsys { > > subsys: AMF > > debug: off > > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > > } > > } > > > > And I have added 5 resources - 1 is VIP and 4 are upstart jobs > > Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE) > > Resources are running on ACTIVE node > > > > Default cluster properties - > > > > <cluster_property_set id="cib-bootstrap-options"> > > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > > value="1.1.10-42f2063"/> > > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > > name="cluster-infrastructure" value="corosync"/> > > <nvpair name="no-quorum-policy" value="ignore" > > id="cib-bootstrap-options-no-quorum-policy"/> > > <nvpair name="stonith-enabled" value="false" > > id="cib-bootstrap-options-stonith-enabled"/> > > <nvpair name="cluster-recheck-interval" value="3min" > > id="cib-bootstrap-options-cluster-recheck-interval"/> > > <nvpair name="default-action-timeout" value="120s" > > id="cib-bootstrap-options-default-action-timeout"/> > > </cluster_property_set> > > > > > > But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from > > STANDBY to ACTIVE, > > both nodes become OFFLINE and Current DC becomes None, I have disabled > the > > stonith property and even quorum is ignored > > Disabling stonith isn't helping you. The cluster needs stonith to > recover from difficult situations, so it's easier to get into weird > states like this without it. > > > root@sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status > > Last updated: Sat Oct 3 00:01:40 2015 > > Last change: Fri Oct 2 23:38:28 2015 via crm_resource on sc-node-1 > > Stack: corosync > > Current DC: NONE > > 2 Nodes configured > > 5 Resources configured > > > > OFFLINE: [ sc-node-1 sc-node-2 ] > > > > What is going wrong here ? What is the reason for node Current DC > becoming > > None suddenly ? Is corosync.conf okay ? Are default cluster properties > fine > > ? Help will be appreciated. > > I'd recommend seeing how the problem behaves with stonith enabled, but > in any case you'll need to dive into the logs to figure what starts the > chain of events. > > -> We are seeing this issue when we try rebooting the vms > > > *Issue 2)* > > Command used to add upstart job is > > > > crm configure primitive service upstart:service meta allow-migrate=true > > migration-threshold=5 failure-timeout=30s op monitor interval=15s > > timeout=60s > > > > But still sometimes I see fail count going to INFINITY. Why ? How can we > > avoid it ? Resource should have migrated as soon as it reaches migration > > threshold. > > > > * Node sc-node-2: > > service: migration-threshold=5 fail-count=1000000 last-failure='Fri > Oct > > 2 23:38:53 2015' > > service1: migration-threshold=5 fail-count=1000000 last-failure='Fri > Oct > > 2 23:38:53 2015' > > > > Failed actions: > > service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, > > last-rc-change=Fri Oct 2 23:38:53 2015 > > , queued=0ms, exec=0ms > > ): unknown error > > service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, > > last-rc-change=Fri Oct 2 23:38:53 2015 > > , queued=0ms, exec=0ms > > migration-threshold is used for monitor failures, not (by default) start > or stop failures. > > This is a start failure, which (by default) makes the fail-count go to > infinity. The rationale is that a monitor failure indicates some sort of > temporary error, but failing to start could well mean that something is > wrong with the installation or configuration. > > You can tell the cluster to apply migration-threshold to start failures > too, by setting the start-failure-is-fatal=false cluster option. > > > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Thanks and Regards, Pritam Kharat.
_______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
