Re: [ClusterLabs] Previous DC fenced prior to integration

2016-08-01 Thread Nate Clark
On Fri, Jul 29, 2016 at 4:09 PM, Ken Gaillot  wrote:
> On 07/28/2016 01:48 PM, Nate Clark wrote:
>> On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark  wrote:
>>> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot  wrote:
 On 07/23/2016 10:14 PM, Nate Clark wrote:
> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov  
> wrote:
>> 23.07.2016 01:37, Nate Clark пишет:
>>> Hello,
>>>
>>> I am running pacemaker 1.1.13 with corosync and think I may have
>>> encountered a start up timing issue on a two node cluster. I didn't
>>> notice anything in the changelog for 14 or 15 that looked similar to
>>> this or open bugs.
>>>
>>> The rough out line of what happened:
>>>
>>> Module 1 and 2 running
>>> Module 1 is DC
>>> Module 2 shuts down
>>> Module 1 updates node attributes used by resources
>>> Module 1 shuts down
>>> Module 2 starts up
>>> Module 2 votes itself as DC
>>> Module 1 starts up
>>> Module 2 sees module 1 in corosync and notices it has quorum
>>> Module 2 enters policy engine state.
>>> Module 2 policy engine decides to fence 1
>>> Module 2 then continues and starts resource on itself based upon the 
>>> old state
>>>
>>> For some reason the integration never occurred and module 2 starts to
>>> perform actions based on stale state.
>>>
>>> Here is the full logs
>>> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
>>> cluster infrastructure: corosync
>>> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
>>> obtain a node name for corosync nodeid 2
>>> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
>>> uname -n for the local corosync node name
>>> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
>>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
>>> for stonith topology changes
>>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
>>> 'watchdog' to the device list (1 active devices)
>>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
>>> on watchdog integration for fencing
>>> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
>>> uname -n for the local corosync node name
>>> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
>>> pcmk_quorum_notification: Node module-2[2] - state is now member (was
>>> (null))
>>> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
>>> uname -n for the local corosync node name
>>> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications 
>>> disabled
>>> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
>>> enabled but stonith-watchdog-timeout is disabled
>>> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
>>> is operational
>>> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
>>> transition S_STARTING -> S_PENDING [ input=I_PENDING
>>> cause=C_FSA_INTERNAL origin=do_started ]
>>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
>>> 'fence_sbd' to the device list (2 active devices)
>>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
>>> 'ipmi-1' to the device list (3 active devices)
>>> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
>>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>>> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
>>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>>> cause=C_TIMER_POPPED origin=election_timeout_popped ]
>>> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
>>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
>>> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications 
>>> disabled
>>> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
>>> enabled but stonith-watchdog-timeout is disabled
>>> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
>>> uname -n for the local corosync node name
>>> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
>>> watchdog integration for fencing
>>> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
>>> have quorum - fencing and resource management disabled
>>> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
>>> module-1 is unclean!
>>> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
>>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>>> ignore)

 The above two messages indicate that module-2 cannot see module-1 at
 startup, therefore it must assume it is potentially misbehaving and 

Re: [ClusterLabs] Previous DC fenced prior to integration

2016-07-29 Thread Ken Gaillot
On 07/28/2016 01:48 PM, Nate Clark wrote:
> On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark  wrote:
>> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot  wrote:
>>> On 07/23/2016 10:14 PM, Nate Clark wrote:
 On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov  
 wrote:
> 23.07.2016 01:37, Nate Clark пишет:
>> Hello,
>>
>> I am running pacemaker 1.1.13 with corosync and think I may have
>> encountered a start up timing issue on a two node cluster. I didn't
>> notice anything in the changelog for 14 or 15 that looked similar to
>> this or open bugs.
>>
>> The rough out line of what happened:
>>
>> Module 1 and 2 running
>> Module 1 is DC
>> Module 2 shuts down
>> Module 1 updates node attributes used by resources
>> Module 1 shuts down
>> Module 2 starts up
>> Module 2 votes itself as DC
>> Module 1 starts up
>> Module 2 sees module 1 in corosync and notices it has quorum
>> Module 2 enters policy engine state.
>> Module 2 policy engine decides to fence 1
>> Module 2 then continues and starts resource on itself based upon the old 
>> state
>>
>> For some reason the integration never occurred and module 2 starts to
>> perform actions based on stale state.
>>
>> Here is the full logs
>> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
>> cluster infrastructure: corosync
>> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
>> obtain a node name for corosync nodeid 2
>> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
>> for stonith topology changes
>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
>> 'watchdog' to the device list (1 active devices)
>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
>> on watchdog integration for fencing
>> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
>> pcmk_quorum_notification: Node module-2[2] - state is now member (was
>> (null))
>> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications 
>> disabled
>> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
>> enabled but stonith-watchdog-timeout is disabled
>> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
>> is operational
>> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
>> transition S_STARTING -> S_PENDING [ input=I_PENDING
>> cause=C_FSA_INTERNAL origin=do_started ]
>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
>> 'fence_sbd' to the device list (2 active devices)
>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
>> 'ipmi-1' to the device list (3 active devices)
>> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>> cause=C_TIMER_POPPED origin=election_timeout_popped ]
>> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
>> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications 
>> disabled
>> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
>> enabled but stonith-watchdog-timeout is disabled
>> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
>> watchdog integration for fencing
>> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
>> have quorum - fencing and resource management disabled
>> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
>> module-1 is unclean!
>> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>> ignore)
>>>
>>> The above two messages indicate that module-2 cannot see module-1 at
>>> startup, therefore it must assume it is potentially misbehaving and must
>>> be shot. However, since it does not have quorum with only one out of two
>>> nodes, it must wait until module-1 joins until it can shoot it!
>>>
>>> This is a 

Re: [ClusterLabs] Previous DC fenced prior to integration

2016-07-28 Thread Nate Clark
On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark  wrote:
> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot  wrote:
>> On 07/23/2016 10:14 PM, Nate Clark wrote:
>>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov  
>>> wrote:
 23.07.2016 01:37, Nate Clark пишет:
> Hello,
>
> I am running pacemaker 1.1.13 with corosync and think I may have
> encountered a start up timing issue on a two node cluster. I didn't
> notice anything in the changelog for 14 or 15 that looked similar to
> this or open bugs.
>
> The rough out line of what happened:
>
> Module 1 and 2 running
> Module 1 is DC
> Module 2 shuts down
> Module 1 updates node attributes used by resources
> Module 1 shuts down
> Module 2 starts up
> Module 2 votes itself as DC
> Module 1 starts up
> Module 2 sees module 1 in corosync and notices it has quorum
> Module 2 enters policy engine state.
> Module 2 policy engine decides to fence 1
> Module 2 then continues and starts resource on itself based upon the old 
> state
>
> For some reason the integration never occurred and module 2 starts to
> perform actions based on stale state.
>
> Here is the full logs
> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
> cluster infrastructure: corosync
> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
> obtain a node name for corosync nodeid 2
> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
> uname -n for the local corosync node name
> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
> for stonith topology changes
> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
> 'watchdog' to the device list (1 active devices)
> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
> on watchdog integration for fencing
> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
> uname -n for the local corosync node name
> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
> pcmk_quorum_notification: Node module-2[2] - state is now member (was
> (null))
> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
> uname -n for the local corosync node name
> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications 
> disabled
> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
> enabled but stonith-watchdog-timeout is disabled
> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
> is operational
> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
> transition S_STARTING -> S_PENDING [ input=I_PENDING
> cause=C_FSA_INTERNAL origin=do_started ]
> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
> 'fence_sbd' to the device list (2 active devices)
> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
> 'ipmi-1' to the device list (3 active devices)
> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_TIMER_POPPED origin=election_timeout_popped ]
> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications 
> disabled
> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
> enabled but stonith-watchdog-timeout is disabled
> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
> uname -n for the local corosync node name
> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
> watchdog integration for fencing
> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
> have quorum - fencing and resource management disabled
> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
> module-1 is unclean!
> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
> unclean nodes until quorum is attained (or no-quorum-policy is set to
> ignore)
>>
>> The above two messages indicate that module-2 cannot see module-1 at
>> startup, therefore it must assume it is potentially misbehaving and must
>> be shot. However, since it does not have quorum with only one out of two
>> nodes, it must wait until module-1 joins until it can shoot it!
>>
>> This is a special problem with quorum in a two-node cluster. There are a
>> variety of ways to deal with it, but the simplest is to set "two_node:

Re: [ClusterLabs] Previous DC fenced prior to integration

2016-07-25 Thread Nate Clark
On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot  wrote:
> On 07/23/2016 10:14 PM, Nate Clark wrote:
>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov  
>> wrote:
>>> 23.07.2016 01:37, Nate Clark пишет:
 Hello,

 I am running pacemaker 1.1.13 with corosync and think I may have
 encountered a start up timing issue on a two node cluster. I didn't
 notice anything in the changelog for 14 or 15 that looked similar to
 this or open bugs.

 The rough out line of what happened:

 Module 1 and 2 running
 Module 1 is DC
 Module 2 shuts down
 Module 1 updates node attributes used by resources
 Module 1 shuts down
 Module 2 starts up
 Module 2 votes itself as DC
 Module 1 starts up
 Module 2 sees module 1 in corosync and notices it has quorum
 Module 2 enters policy engine state.
 Module 2 policy engine decides to fence 1
 Module 2 then continues and starts resource on itself based upon the old 
 state

 For some reason the integration never occurred and module 2 starts to
 perform actions based on stale state.

 Here is the full logs
 Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
 cluster infrastructure: corosync
 Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
 obtain a node name for corosync nodeid 2
 Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
 uname -n for the local corosync node name
 Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
 Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
 for stonith topology changes
 Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
 'watchdog' to the device list (1 active devices)
 Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
 on watchdog integration for fencing
 Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
 uname -n for the local corosync node name
 Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
 pcmk_quorum_notification: Node module-2[2] - state is now member (was
 (null))
 Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
 uname -n for the local corosync node name
 Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications 
 disabled
 Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
 enabled but stonith-watchdog-timeout is disabled
 Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
 is operational
 Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
 transition S_STARTING -> S_PENDING [ input=I_PENDING
 cause=C_FSA_INTERNAL origin=do_started ]
 Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
 'fence_sbd' to the device list (2 active devices)
 Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
 'ipmi-1' to the device list (3 active devices)
 Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
 I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
 Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
 transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
 cause=C_TIMER_POPPED origin=election_timeout_popped ]
 Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
 I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
 Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications 
 disabled
 Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
 enabled but stonith-watchdog-timeout is disabled
 Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
 uname -n for the local corosync node name
 Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
 watchdog integration for fencing
 Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
 have quorum - fencing and resource management disabled
 Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
 module-1 is unclean!
 Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
 unclean nodes until quorum is attained (or no-quorum-policy is set to
 ignore)
>
> The above two messages indicate that module-2 cannot see module-1 at
> startup, therefore it must assume it is potentially misbehaving and must
> be shot. However, since it does not have quorum with only one out of two
> nodes, it must wait until module-1 joins until it can shoot it!
>
> This is a special problem with quorum in a two-node cluster. There are a
> variety of ways to deal with it, but the simplest is to set "two_node:
> 1" in corosync.conf (with corosync 2 or later). This will make each node
> wait for the other at startup, meaning both nodes must be started before
> the cluster 

Re: [ClusterLabs] Previous DC fenced prior to integration

2016-07-23 Thread Nate Clark
On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov  wrote:
> 23.07.2016 01:37, Nate Clark пишет:
>> Hello,
>>
>> I am running pacemaker 1.1.13 with corosync and think I may have
>> encountered a start up timing issue on a two node cluster. I didn't
>> notice anything in the changelog for 14 or 15 that looked similar to
>> this or open bugs.
>>
>> The rough out line of what happened:
>>
>> Module 1 and 2 running
>> Module 1 is DC
>> Module 2 shuts down
>> Module 1 updates node attributes used by resources
>> Module 1 shuts down
>> Module 2 starts up
>> Module 2 votes itself as DC
>> Module 1 starts up
>> Module 2 sees module 1 in corosync and notices it has quorum
>> Module 2 enters policy engine state.
>> Module 2 policy engine decides to fence 1
>> Module 2 then continues and starts resource on itself based upon the old 
>> state
>>
>> For some reason the integration never occurred and module 2 starts to
>> perform actions based on stale state.
>>
>> Here is the full logs
>> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
>> cluster infrastructure: corosync
>> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
>> obtain a node name for corosync nodeid 2
>> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
>> for stonith topology changes
>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
>> 'watchdog' to the device list (1 active devices)
>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
>> on watchdog integration for fencing
>> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
>> pcmk_quorum_notification: Node module-2[2] - state is now member (was
>> (null))
>> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications disabled
>> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
>> enabled but stonith-watchdog-timeout is disabled
>> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
>> is operational
>> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
>> transition S_STARTING -> S_PENDING [ input=I_PENDING
>> cause=C_FSA_INTERNAL origin=do_started ]
>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
>> 'fence_sbd' to the device list (2 active devices)
>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
>> 'ipmi-1' to the device list (3 active devices)
>> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>> cause=C_TIMER_POPPED origin=election_timeout_popped ]
>> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
>> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications disabled
>> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
>> enabled but stonith-watchdog-timeout is disabled
>> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
>> uname -n for the local corosync node name
>> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
>> watchdog integration for fencing
>> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
>> have quorum - fencing and resource management disabled
>> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
>> module-1 is unclean!
>> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>> ignore)
>> Jul 20 16:29:27.503521 module-2 pengine[21968]:   notice: Start
>> fence_sbd(module-2 - blocked)
>> Jul 20 16:29:27.503539 module-2 pengine[21968]:   notice: Start
>> ipmi-1(module-2 - blocked)
>> Jul 20 16:29:27.503559 module-2 pengine[21968]:   notice: Start
>> SlaveIP(module-2 - blocked)
>> Jul 20 16:29:27.503582 module-2 pengine[21968]:   notice: Start
>> postgres:0(module-2 - blocked)
>> Jul 20 16:29:27.503597 module-2 pengine[21968]:   notice: Start
>> ethmonitor:0(module-2 - blocked)
>> Jul 20 16:29:27.503618 module-2 pengine[21968]:   notice: Start
>> tomcat-instance:0(module-2 - blocked)
>> Jul 20 16:29:27.503629 module-2 pengine[21968]:   notice: Start
>> ClusterMonitor:0(module-2 - blocked)
>> Jul 20 16:29:27.506945 module-2 pengine[21968]:  warning: Calculated
>> Transition 0: /var/lib/pacemaker/pengine/pe-warn-0.bz2
>> 

Re: [ClusterLabs] Previous DC fenced prior to integration

2016-07-22 Thread Andrei Borzenkov
23.07.2016 01:37, Nate Clark пишет:
> Hello,
> 
> I am running pacemaker 1.1.13 with corosync and think I may have
> encountered a start up timing issue on a two node cluster. I didn't
> notice anything in the changelog for 14 or 15 that looked similar to
> this or open bugs.
> 
> The rough out line of what happened:
> 
> Module 1 and 2 running
> Module 1 is DC
> Module 2 shuts down
> Module 1 updates node attributes used by resources
> Module 1 shuts down
> Module 2 starts up
> Module 2 votes itself as DC
> Module 1 starts up
> Module 2 sees module 1 in corosync and notices it has quorum
> Module 2 enters policy engine state.
> Module 2 policy engine decides to fence 1
> Module 2 then continues and starts resource on itself based upon the old state
> 
> For some reason the integration never occurred and module 2 starts to
> perform actions based on stale state.
> 
> Here is the full logs
> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
> cluster infrastructure: corosync
> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
> obtain a node name for corosync nodeid 2
> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
> uname -n for the local corosync node name
> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
> for stonith topology changes
> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
> 'watchdog' to the device list (1 active devices)
> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
> on watchdog integration for fencing
> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
> uname -n for the local corosync node name
> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
> pcmk_quorum_notification: Node module-2[2] - state is now member (was
> (null))
> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
> uname -n for the local corosync node name
> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications disabled
> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
> enabled but stonith-watchdog-timeout is disabled
> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
> is operational
> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
> transition S_STARTING -> S_PENDING [ input=I_PENDING
> cause=C_FSA_INTERNAL origin=do_started ]
> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
> 'fence_sbd' to the device list (2 active devices)
> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
> 'ipmi-1' to the device list (3 active devices)
> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_TIMER_POPPED origin=election_timeout_popped ]
> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications disabled
> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
> enabled but stonith-watchdog-timeout is disabled
> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
> uname -n for the local corosync node name
> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
> watchdog integration for fencing
> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
> have quorum - fencing and resource management disabled
> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
> module-1 is unclean!
> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
> unclean nodes until quorum is attained (or no-quorum-policy is set to
> ignore)
> Jul 20 16:29:27.503521 module-2 pengine[21968]:   notice: Start
> fence_sbd(module-2 - blocked)
> Jul 20 16:29:27.503539 module-2 pengine[21968]:   notice: Start
> ipmi-1(module-2 - blocked)
> Jul 20 16:29:27.503559 module-2 pengine[21968]:   notice: Start
> SlaveIP(module-2 - blocked)
> Jul 20 16:29:27.503582 module-2 pengine[21968]:   notice: Start
> postgres:0(module-2 - blocked)
> Jul 20 16:29:27.503597 module-2 pengine[21968]:   notice: Start
> ethmonitor:0(module-2 - blocked)
> Jul 20 16:29:27.503618 module-2 pengine[21968]:   notice: Start
> tomcat-instance:0(module-2 - blocked)
> Jul 20 16:29:27.503629 module-2 pengine[21968]:   notice: Start
> ClusterMonitor:0(module-2 - blocked)
> Jul 20 16:29:27.506945 module-2 pengine[21968]:  warning: Calculated
> Transition 0: /var/lib/pacemaker/pengine/pe-warn-0.bz2
> Jul 20 16:29:27.507976 module-2 crmd[21969]:   notice: Initiating
> action 4: monitor fence_sbd_monitor_0 on module-2 (local)
> Jul 20 16:29:27.509282 module-2 crmd[21969]: