Could some one please reply to this query ?
On Sat, Oct 3, 2015 at 12:17 AM, Pritam Kharat < [email protected]> wrote: > > Hi, > > I have set up a ACTIVE/PASSIVE HA > > *Issue 1) * > > *corosync.conf* file is > > # Please read the openais.conf.5 manual page > > totem { > > version: 2 > > # How long before declaring a token lost (ms) > token: 10000 > > # How many token retransmits before forming a new configuration > token_retransmits_before_loss_const: 20 > > # How long to wait for join messages in the membership protocol > (ms) > join: 10000 > > # How long to wait for consensus to be achieved before starting a > new round of membership configuration (ms) > consensus: 12000 > > # Turn off the virtual synchrony filter > vsftype: none > > # Number of messages that may be sent by one processor on receipt > of the token > max_messages: 20 > > # Limit generated nodeids to 31-bits (positive signed integers) > clear_node_high_bit: yes > > # Disable encryption > secauth: off > > # How many threads to use for encryption/decryption > threads: 0 > > # Optionally assign a fixed node id (integer) > # nodeid: 1234 > > # This specifies the mode of redundant ring, which may be none, > active, or passive. > rrp_mode: none > interface { > # The following values need to be set based on your > environment > ringnumber: 0 > bindnetaddr: 192.168.101.0 > mcastport: 5405 > } > > transport: udpu > } > > amf { > mode: disabled > } > > quorum { > # Quorum for the Pacemaker Cluster Resource Manager > provider: corosync_votequorum > expected_votes: 1 > } > > > nodelist { > > node { > ring0_addr: 192.168.101.73 > } > > node { > ring0_addr: 192.168.101.74 > } > } > > aisexec { > user: root > group: root > } > > > logging { > fileline: off > to_stderr: yes > to_logfile: yes > to_syslog: yes > syslog_facility: daemon > logfile: /var/log/corosync/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > } > } > > And I have added 5 resources - 1 is VIP and 4 are upstart jobs > Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE) > Resources are running on ACTIVE node > > Default cluster properties - > > <cluster_property_set id="cib-bootstrap-options"> > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > value="1.1.10-42f2063"/> > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > name="cluster-infrastructure" value="corosync"/> > <nvpair name="no-quorum-policy" value="ignore" > id="cib-bootstrap-options-no-quorum-policy"/> > <nvpair name="stonith-enabled" value="false" > id="cib-bootstrap-options-stonith-enabled"/> > <nvpair name="cluster-recheck-interval" value="3min" > id="cib-bootstrap-options-cluster-recheck-interval"/> > <nvpair name="default-action-timeout" value="120s" > id="cib-bootstrap-options-default-action-timeout"/> > </cluster_property_set> > > > But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from > STANDBY to ACTIVE, > both nodes become OFFLINE and Current DC becomes None, I have disabled the > stonith property and even quorum is ignored > > root@sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status > Last updated: Sat Oct 3 00:01:40 2015 > Last change: Fri Oct 2 23:38:28 2015 via crm_resource on sc-node-1 > Stack: corosync > Current DC: NONE > 2 Nodes configured > 5 Resources configured > > OFFLINE: [ sc-node-1 sc-node-2 ] > > What is going wrong here ? What is the reason for node Current DC becoming > None suddenly ? Is corosync.conf okay ? Are default cluster properties fine > ? Help will be appreciated. > > > *Issue 2)* > Command used to add upstart job is > > crm configure primitive service upstart:service meta allow-migrate=true > migration-threshold=5 failure-timeout=30s op monitor interval=15s > timeout=60s > > But still sometimes I see fail count going to INFINITY. Why ? How can we > avoid it ? Resource should have migrated as soon as it reaches migration > threshold. > > * Node sc-node-2: > service: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct > 2 23:38:53 2015' > service1: migration-threshold=5 fail-count=1000000 last-failure='Fri > Oct 2 23:38:53 2015' > > Failed actions: > service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, > last-rc-change=Fri Oct 2 23:38:53 2015 > , queued=0ms, exec=0ms > ): unknown error > service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, > last-rc-change=Fri Oct 2 23:38:53 2015 > , queued=0ms, exec=0ms > > > > > -- > Thanks and Regards, > Pritam Kharat. > -- Thanks and Regards, Pritam Kharat.
_______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
