Re: [ClusterLabs] Cluster node getting stopped from other node(resending mail)

2015-07-07 Thread Ken Gaillot
On 07/07/2015 03:58 AM, Arjun Pandey wrote:
> Hi Ken
> 
> If i look at the logs on the other node around the same time i see this. I
> can't figure out the reason based on these.Attaching the corosync.log for
> the other node as well.

I don't see anything there either. The relevant part could be earlier or
later.

Are you using the same version of pacemaker and corosync on both nodes?

> Jun 01 13:55:44 [1965] messi   crmd: info: do_dc_join_offer_one:
>  An unknown node joined - (re-)offer to any unconfirmed nodes
> Jun 01 13:55:44 [1965] messi   crmd: info: join_make_offer:
> Making join offers based on membership 224
> Jun 01 13:55:44 [1965] messi   crmd: info: join_make_offer:
> Skipping messi: already known 4
> Jun 01 13:55:44 [1965] messi   crmd: info: join_make_offer:
> join-2: Sending offer to ronaldo
> Jun 01 13:55:44 [1960] messicib: info: cib_process_request:
> Forwarding cib_modify operation for section status to master
> (origin=local/crmd/138)
> Jun 01 13:55:44 [1960] messicib: info: cib_perform_op:
>  Diff: --- 0.80.2 2
> Jun 01 13:55:44 [1960] messicib: info: cib_perform_op:
>  Diff: +++ 0.80.3 (null)
> Jun 01 13:55:44 [1965] messi   crmd: info: crm_update_peer_join:
>  join_make_offer: Node ronaldo[2] - join-2 phase 0 -> 1
> Jun 01 13:55:44 [1965] messi   crmd: info: abort_transition_graph:
>  Transition aborted: Peer Halt (source=do_te_invoke:158, 1)
> Jun 01 13:55:44 [1960] messicib: info: cib_perform_op:  +
>  /cib:  @num_updates=3
> Jun 01 13:55:44 [1960] messicib: info: cib_perform_op:  +
>  /cib/status/node_state[@id='ronaldo']:  @crmd=online,
> @crm-debug-origin=peer_update_callback
> Jun 01 13:55:44 [1960] messicib: info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0,
> origin=messi/crmd/138, version=0.80.3)
> Jun 01 13:55:44 [1960] messicib: info: cib_process_request:
> Completed cib_modify operation for section nodes: OK (rc=0,
> origin=ronaldo/crmd/3, version=0.80.3)
> Jun 01 13:55:45 [1965] messi   crmd: info: do_dc_join_offer_one:
>  join-2: Processing join_announce request from ronaldo in state
> S_INTEGRATION
> Jun 01 13:55:45 [1965] messi   crmd: info: crm_update_peer_join:
>  do_dc_join_offer_one: Node ronaldo[2] - join-2 phase 1 -> 0
> Jun 01 13:55:45 [1965] messi   crmd: info: join_make_offer:
> join-2: Sending offer to ronaldo
> Jun 01 13:55:45 [1965] messi   crmd: info: crm_update_peer_join:
>  join_make_offer: Node ronaldo[2] - join-2 phase 0 -> 1
> Jun 01 13:55:45 [1965] messi   crmd: info: crm_update_peer_join:
>  join_make_offer: Node messi[1] - join-2 phase 4 -> 0
> Jun 01 13:55:45 [1965] messi   crmd: info: join_make_offer:
> join-2: Sending offer to messi
> Jun 01 13:55:45 [1965] messi   crmd: info: crm_update_peer_join:
>  join_make_offer: Node messi[1] - join-2 phase 0 -> 1
> Jun 01 13:55:45 [1965] messi   crmd: info: abort_transition_graph:
>  Transition aborted: Node join (source=do_dc_join_offer_one:236, 1)
> Jun 01 13:55:45 [1965] messi   crmd: info: crm_update_peer_join:
>  do_dc_join_filter_offer: Node messi[1] - join-2 phase 1 -> 2
> Jun 01 13:55:46 [1965] messi   crmd: info: crm_update_peer_join:
>  do_dc_join_filter_offer: Node ronaldo[2] - join-2 phase 1 -> 2
> Jun 01 13:55:46 [1965] messi   crmd: info:
> crm_update_peer_expected:do_dc_join_filter_offer: Node ronaldo[2] -
> expected state is now member (was down)
> Jun 01 13:55:46 [1965] messi   crmd: info: do_state_transition:
> State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Jun 01 13:55:46 [1965] messi   crmd: info: crmd_join_phase_log:
> join-2: messi=integrated
> Jun 01 13:55:46 [1965] messi   crmd: info: crmd_join_phase_log:
> join-2: ronaldo=integrated
> Jun 01 13:55:46 [1965] messi   crmd: info: do_dc_join_finalize:
> join-2: Syncing our CIB to the rest of the cluster
> Jun 01 13:55:46 [1965] messi   crmd: info: crm_update_peer_join:
>  finalize_join_for: Node messi[1] - join-2 phase 2 -> 3
> Jun 01 13:55:46 [1965] messi   crmd: info: crm_update_peer_join:
>  finalize_join_for: Node ronaldo[2] - join-2 phase 2 -> 3
> Jun 01 13:55:46 [1965] messi   crmd: info: crm_update_peer_join:
>  do_dc_join_ack: Node messi[1] - join-2 phase 3 -> 4
> Jun 01 13:55:46 [1965] messi   crmd: info: do_dc_join_ack:
>  join-2: Updating node state to member for messi
> Jun 01 13:55:46 [1965] messi   crmd: info: erase_status_tag:
>  Deleting xpath: //node_state[@uname='messi']/lrm
> Jun 01 13:55:46 [1960] messicib: info: cib_process_replace:
> Digest matched on replace from messi: 5138b696984c7b834dd2b528dadabe

Re: [ClusterLabs] Cluster node getting stopped from other node(resending mail)

2015-07-01 Thread Ken Gaillot
On 06/30/2015 11:30 PM, Arjun Pandey wrote:
> Hi
> 
> I am running a 2 node cluster with this config on centos 6.5/6.6
> 
> Master/Slave Set: foo-master [foo]
> Masters: [ messi ]
> Stopped: [ronaldo ]
>  eth1-CP(ocf::pw:IPaddr):   Started messi
>  eth2-UP(ocf::pw:IPaddr):   Started messi
>  eth3-UPCP  (ocf::pw:IPaddr):   Started messi
> 
> where i have a multi-state resource foo being run in master/slave mode and
>  IPaddr RA is just modified IPAddr2 RA. Additionally i have a
> collocation constraint for the IP addr to be collocated with the master.
> 
> Sometimes when i setup the cluster , i find that one of the nodes (the
> second node that joins ) gets stopped and i find this log.
> 
> 2015-06-01T13:55:46.153941+05:30 ronaldo pacemaker: Starting Pacemaker
> Cluster Manager
> 2015-06-01T13:55:46.233639+05:30 ronaldo attrd[25988]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
> 2015-06-01T13:55:46.234162+05:30 ronaldo crmd[25990]:   notice:
> do_state_transition: State transition S_PENDING -> S_NOT_DC [
> input=I_NOT_DC cause=C_HA_MESSAG
> E origin=do_cl_join_finalize_respond ]
> 2015-06-01T13:55:46.234701+05:30 ronaldo attrd[25988]:   notice:
> attrd_local_callback: Sending full refresh (origin=crmd)
> 2015-06-01T13:55:46.234708+05:30 ronaldo attrd[25988]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
>  This looks to be the likely
> reason***
> 2015-06-01T13:55:46.254310+05:30 ronaldo crmd[25990]:error:
> handle_request: We didn't ask to be shut down, yet our DC is telling us too
> .
> *

Hi Arjun,

I'd check the other node's logs at this time, to see why it requested
the shutdown.

> 2015-06-01T13:55:46.254577+05:30 ronaldo crmd[25990]:   notice:
> do_state_transition: State transition S_NOT_DC -> S_STOPPING [ input=I_STOP
> cause=C_HA_MESSAGE
>  origin=route_message ]
> 2015-06-01T13:55:46.255134+05:30 ronaldo crmd[25990]:   notice:
> lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown...
> waiting (2 ops remaining)
> 
> Based on the logs , pacemaker on active was stopping the secondary cloud
> everytime it joins cluster. This issue seems similar to
> http://pacemaker.oss.clusterlabs.narkive.com/rVvN8May/node-sends-shutdown-request-to-other-node-error
> 
> Packages used :-
> pacemaker-1.1.12-4.el6.x86_64
> pacemaker-libs-1.1.12-4.el6.x86_64
> pacemaker-cli-1.1.12-4.el6.x86_64
> pacemaker-cluster-libs-1.1.12-4.el6.x86_64
> pacemaker-debuginfo-1.1.10-14.el6.x86_64
> pcsc-lite-libs-1.5.2-13.el6_4.x86_64
> pcs-0.9.90-2.el6.centos.2.noarch
> pcsc-lite-1.5.2-13.el6_4.x86_64
> pcsc-lite-openct-0.6.19-4.el6.x86_64
> corosync-1.4.1-17.el6.x86_64
> corosynclib-1.4.1-17.el6.x86_64


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Cluster node getting stopped from other node(resending mail)

2015-06-30 Thread Arjun Pandey
Hi

I am running a 2 node cluster with this config on centos 6.5/6.6

Master/Slave Set: foo-master [foo]
Masters: [ messi ]
Stopped: [ronaldo ]
 eth1-CP(ocf::pw:IPaddr):   Started messi
 eth2-UP(ocf::pw:IPaddr):   Started messi
 eth3-UPCP  (ocf::pw:IPaddr):   Started messi

where i have a multi-state resource foo being run in master/slave mode and
 IPaddr RA is just modified IPAddr2 RA. Additionally i have a
collocation constraint for the IP addr to be collocated with the master.

Sometimes when i setup the cluster , i find that one of the nodes (the
second node that joins ) gets stopped and i find this log.

2015-06-01T13:55:46.153941+05:30 ronaldo pacemaker: Starting Pacemaker
Cluster Manager
2015-06-01T13:55:46.233639+05:30 ronaldo attrd[25988]:   notice:
attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
2015-06-01T13:55:46.234162+05:30 ronaldo crmd[25990]:   notice:
do_state_transition: State transition S_PENDING -> S_NOT_DC [
input=I_NOT_DC cause=C_HA_MESSAG
E origin=do_cl_join_finalize_respond ]
2015-06-01T13:55:46.234701+05:30 ronaldo attrd[25988]:   notice:
attrd_local_callback: Sending full refresh (origin=crmd)
2015-06-01T13:55:46.234708+05:30 ronaldo attrd[25988]:   notice:
attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
 This looks to be the likely
reason***
2015-06-01T13:55:46.254310+05:30 ronaldo crmd[25990]:error:
handle_request: We didn't ask to be shut down, yet our DC is telling us too
.
*

2015-06-01T13:55:46.254577+05:30 ronaldo crmd[25990]:   notice:
do_state_transition: State transition S_NOT_DC -> S_STOPPING [ input=I_STOP
cause=C_HA_MESSAGE
 origin=route_message ]
2015-06-01T13:55:46.255134+05:30 ronaldo crmd[25990]:   notice:
lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown...
waiting (2 ops remaining)

Based on the logs , pacemaker on active was stopping the secondary cloud
everytime it joins cluster. This issue seems similar to
http://pacemaker.oss.clusterlabs.narkive.com/rVvN8May/node-sends-shutdown-request-to-other-node-error

Packages used :-
pacemaker-1.1.12-4.el6.x86_64
pacemaker-libs-1.1.12-4.el6.x86_64
pacemaker-cli-1.1.12-4.el6.x86_64
pacemaker-cluster-libs-1.1.12-4.el6.x86_64
pacemaker-debuginfo-1.1.10-14.el6.x86_64
pcsc-lite-libs-1.5.2-13.el6_4.x86_64
pcs-0.9.90-2.el6.centos.2.noarch
pcsc-lite-1.5.2-13.el6_4.x86_64
pcsc-lite-openct-0.6.19-4.el6.x86_64
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64



Thanks in advance for your help

Regards
Arjun
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org