Hi Ken, Please see inline comments of last mail
On Thu, Oct 8, 2015 at 8:25 PM, Pritam Kharat < pritam.kha...@oneconvergence.com> wrote: > Hi Ken, > > Thanks for reply. > > On Thu, Oct 8, 2015 at 8:13 PM, Ken Gaillot <kgail...@redhat.com> wrote: > >> On 10/02/2015 01:47 PM, Pritam Kharat wrote: >> > Hi, >> > >> > I have set up a ACTIVE/PASSIVE HA >> > >> > *Issue 1) * >> > >> > *corosync.conf* file is >> > >> > # Please read the openais.conf.5 manual page >> > >> > totem { >> > >> > version: 2 >> > >> > # How long before declaring a token lost (ms) >> > token: 10000 >> > >> > # How many token retransmits before forming a new configuration >> > token_retransmits_before_loss_const: 20 >> > >> > # How long to wait for join messages in the membership protocol >> (ms) >> > join: 10000 >> > >> > # How long to wait for consensus to be achieved before starting >> a >> > new round of membership configuration (ms) >> > consensus: 12000 >> > >> > # Turn off the virtual synchrony filter >> > vsftype: none >> > >> > # Number of messages that may be sent by one processor on >> receipt >> > of the token >> > max_messages: 20 >> > >> > # Limit generated nodeids to 31-bits (positive signed integers) >> > clear_node_high_bit: yes >> > >> > # Disable encryption >> > secauth: off >> > >> > # How many threads to use for encryption/decryption >> > threads: 0 >> > >> > # Optionally assign a fixed node id (integer) >> > # nodeid: 1234 >> > >> > # This specifies the mode of redundant ring, which may be none, >> > active, or passive. >> > rrp_mode: none >> > interface { >> > # The following values need to be set based on your >> > environment >> > ringnumber: 0 >> > bindnetaddr: 192.168.101.0 >> > mcastport: 5405 >> > } >> > >> > transport: udpu >> > } >> > >> > amf { >> > mode: disabled >> > } >> > >> > quorum { >> > # Quorum for the Pacemaker Cluster Resource Manager >> > provider: corosync_votequorum >> > expected_votes: 1 >> >> If you're using a recent version of corosync, use "two_node: 1" instead >> of "expected_votes: 1", and get rid of "no-quorum-policy: ignore" in the >> pacemaker cluster options. >> >> -> We are using corosync version 2.3.3. Do we above mentioned change > for this version ? > > > >> > } >> > >> > >> > nodelist { >> > >> > node { >> > ring0_addr: 192.168.101.73 >> > } >> > >> > node { >> > ring0_addr: 192.168.101.74 >> > } >> > } >> > >> > aisexec { >> > user: root >> > group: root >> > } >> > >> > >> > logging { >> > fileline: off >> > to_stderr: yes >> > to_logfile: yes >> > to_syslog: yes >> > syslog_facility: daemon >> > logfile: /var/log/corosync/corosync.log >> > debug: off >> > timestamp: on >> > logger_subsys { >> > subsys: AMF >> > debug: off >> > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 >> > } >> > } >> > >> > And I have added 5 resources - 1 is VIP and 4 are upstart jobs >> > Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE) >> > Resources are running on ACTIVE node >> > >> > Default cluster properties - >> > >> > <cluster_property_set id="cib-bootstrap-options"> >> > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" >> > value="1.1.10-42f2063"/> >> > <nvpair id="cib-bootstrap-options-cluster-infrastructure" >> > name="cluster-infrastructure" value="corosync"/> >> > <nvpair name="no-quorum-policy" value="ignore" >> > id="cib-bootstrap-options-no-quorum-policy"/> >> > <nvpair name="stonith-enabled" value="false" >> > id="cib-bootstrap-options-stonith-enabled"/> >> > <nvpair name="cluster-recheck-interval" value="3min" >> > id="cib-bootstrap-options-cluster-recheck-interval"/> >> > <nvpair name="default-action-timeout" value="120s" >> > id="cib-bootstrap-options-default-action-timeout"/> >> > </cluster_property_set> >> > >> > >> > But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from >> > STANDBY to ACTIVE, >> > both nodes become OFFLINE and Current DC becomes None, I have disabled >> the >> > stonith property and even quorum is ignored >> >> Disabling stonith isn't helping you. The cluster needs stonith to >> recover from difficult situations, so it's easier to get into weird >> states like this without it. >> >> > root@sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status >> > Last updated: Sat Oct 3 00:01:40 2015 >> > Last change: Fri Oct 2 23:38:28 2015 via crm_resource on sc-node-1 >> > Stack: corosync >> > Current DC: NONE >> > 2 Nodes configured >> > 5 Resources configured >> > >> > OFFLINE: [ sc-node-1 sc-node-2 ] >> > >> > What is going wrong here ? What is the reason for node Current DC >> becoming >> > None suddenly ? Is corosync.conf okay ? Are default cluster properties >> fine >> > ? Help will be appreciated. >> >> I'd recommend seeing how the problem behaves with stonith enabled, but >> in any case you'll need to dive into the logs to figure what starts the >> chain of events. >> >> > -> We are seeing this issue when we try rebooting the vms > > > >> > *Issue 2)* >> > Command used to add upstart job is >> > >> > crm configure primitive service upstart:service meta allow-migrate=true >> > migration-threshold=5 failure-timeout=30s op monitor interval=15s >> > timeout=60s >> > >> > But still sometimes I see fail count going to INFINITY. Why ? How can we >> > avoid it ? Resource should have migrated as soon as it reaches migration >> > threshold. >> > >> > * Node sc-node-2: >> > service: migration-threshold=5 fail-count=1000000 last-failure='Fri >> Oct >> > 2 23:38:53 2015' >> > service1: migration-threshold=5 fail-count=1000000 last-failure='Fri >> Oct >> > 2 23:38:53 2015' >> > >> > Failed actions: >> > service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, >> > last-rc-change=Fri Oct 2 23:38:53 2015 >> > , queued=0ms, exec=0ms >> > ): unknown error >> > service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, >> > last-rc-change=Fri Oct 2 23:38:53 2015 >> > , queued=0ms, exec=0ms >> >> migration-threshold is used for monitor failures, not (by default) start >> or stop failures. >> >> This is a start failure, which (by default) makes the fail-count go to >> infinity. The rationale is that a monitor failure indicates some sort of >> temporary error, but failing to start could well mean that something is >> wrong with the installation or configuration. >> >> You can tell the cluster to apply migration-threshold to start failures >> too, by setting the start-failure-is-fatal=false cluster option. >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > > -- > Thanks and Regards, > Pritam Kharat. > -- Thanks and Regards, Pritam Kharat.
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org