Re: [ClusterLabs] Previous DC fenced prior to integration
On Fri, Jul 29, 2016 at 4:09 PM, Ken Gaillotwrote: > On 07/28/2016 01:48 PM, Nate Clark wrote: >> On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark wrote: >>> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot wrote: On 07/23/2016 10:14 PM, Nate Clark wrote: > On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov > wrote: >> 23.07.2016 01:37, Nate Clark пишет: >>> Hello, >>> >>> I am running pacemaker 1.1.13 with corosync and think I may have >>> encountered a start up timing issue on a two node cluster. I didn't >>> notice anything in the changelog for 14 or 15 that looked similar to >>> this or open bugs. >>> >>> The rough out line of what happened: >>> >>> Module 1 and 2 running >>> Module 1 is DC >>> Module 2 shuts down >>> Module 1 updates node attributes used by resources >>> Module 1 shuts down >>> Module 2 starts up >>> Module 2 votes itself as DC >>> Module 1 starts up >>> Module 2 sees module 1 in corosync and notices it has quorum >>> Module 2 enters policy engine state. >>> Module 2 policy engine decides to fence 1 >>> Module 2 then continues and starts resource on itself based upon the >>> old state >>> >>> For some reason the integration never occurred and module 2 starts to >>> perform actions based on stale state. >>> >>> Here is the full logs >>> Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to >>> cluster infrastructure: corosync >>> Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not >>> obtain a node name for corosync nodeid 2 >>> Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to >>> uname -n for the local corosync node name >>> Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost >>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching >>> for stonith topology changes >>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added >>> 'watchdog' to the device list (1 active devices) >>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying >>> on watchdog integration for fencing >>> Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to >>> uname -n for the local corosync node name >>> Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: >>> pcmk_quorum_notification: Node module-2[2] - state is now member (was >>> (null)) >>> Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to >>> uname -n for the local corosync node name >>> Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications >>> disabled >>> Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog >>> enabled but stonith-watchdog-timeout is disabled >>> Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM >>> is operational >>> Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State >>> transition S_STARTING -> S_PENDING [ input=I_PENDING >>> cause=C_FSA_INTERNAL origin=do_started ] >>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added >>> 'fence_sbd' to the device list (2 active devices) >>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added >>> 'ipmi-1' to the device list (3 active devices) >>> Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input >>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING >>> Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State >>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC >>> cause=C_TIMER_POPPED origin=election_timeout_popped ] >>> Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input >>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION >>> Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications >>> disabled >>> Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog >>> enabled but stonith-watchdog-timeout is disabled >>> Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to >>> uname -n for the local corosync node name >>> Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on >>> watchdog integration for fencing >>> Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not >>> have quorum - fencing and resource management disabled >>> Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node >>> module-1 is unclean! >>> Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence >>> unclean nodes until quorum is attained (or no-quorum-policy is set to >>> ignore) The above two messages indicate that module-2 cannot see module-1 at startup, therefore it must assume it is potentially misbehaving and
Re: [ClusterLabs] Previous DC fenced prior to integration
On 07/28/2016 01:48 PM, Nate Clark wrote: > On Mon, Jul 25, 2016 at 2:48 PM, Nate Clarkwrote: >> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot wrote: >>> On 07/23/2016 10:14 PM, Nate Clark wrote: On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov wrote: > 23.07.2016 01:37, Nate Clark пишет: >> Hello, >> >> I am running pacemaker 1.1.13 with corosync and think I may have >> encountered a start up timing issue on a two node cluster. I didn't >> notice anything in the changelog for 14 or 15 that looked similar to >> this or open bugs. >> >> The rough out line of what happened: >> >> Module 1 and 2 running >> Module 1 is DC >> Module 2 shuts down >> Module 1 updates node attributes used by resources >> Module 1 shuts down >> Module 2 starts up >> Module 2 votes itself as DC >> Module 1 starts up >> Module 2 sees module 1 in corosync and notices it has quorum >> Module 2 enters policy engine state. >> Module 2 policy engine decides to fence 1 >> Module 2 then continues and starts resource on itself based upon the old >> state >> >> For some reason the integration never occurred and module 2 starts to >> perform actions based on stale state. >> >> Here is the full logs >> Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to >> cluster infrastructure: corosync >> Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not >> obtain a node name for corosync nodeid 2 >> Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to >> uname -n for the local corosync node name >> Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost >> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching >> for stonith topology changes >> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added >> 'watchdog' to the device list (1 active devices) >> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying >> on watchdog integration for fencing >> Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to >> uname -n for the local corosync node name >> Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: >> pcmk_quorum_notification: Node module-2[2] - state is now member (was >> (null)) >> Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to >> uname -n for the local corosync node name >> Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications >> disabled >> Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog >> enabled but stonith-watchdog-timeout is disabled >> Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM >> is operational >> Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State >> transition S_STARTING -> S_PENDING [ input=I_PENDING >> cause=C_FSA_INTERNAL origin=do_started ] >> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added >> 'fence_sbd' to the device list (2 active devices) >> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added >> 'ipmi-1' to the device list (3 active devices) >> Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input >> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING >> Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State >> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC >> cause=C_TIMER_POPPED origin=election_timeout_popped ] >> Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input >> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION >> Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications >> disabled >> Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog >> enabled but stonith-watchdog-timeout is disabled >> Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to >> uname -n for the local corosync node name >> Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on >> watchdog integration for fencing >> Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not >> have quorum - fencing and resource management disabled >> Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node >> module-1 is unclean! >> Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence >> unclean nodes until quorum is attained (or no-quorum-policy is set to >> ignore) >>> >>> The above two messages indicate that module-2 cannot see module-1 at >>> startup, therefore it must assume it is potentially misbehaving and must >>> be shot. However, since it does not have quorum with only one out of two >>> nodes, it must wait until module-1 joins until it can shoot it! >>> >>> This is a
Re: [ClusterLabs] Previous DC fenced prior to integration
On Mon, Jul 25, 2016 at 2:48 PM, Nate Clarkwrote: > On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot wrote: >> On 07/23/2016 10:14 PM, Nate Clark wrote: >>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov >>> wrote: 23.07.2016 01:37, Nate Clark пишет: > Hello, > > I am running pacemaker 1.1.13 with corosync and think I may have > encountered a start up timing issue on a two node cluster. I didn't > notice anything in the changelog for 14 or 15 that looked similar to > this or open bugs. > > The rough out line of what happened: > > Module 1 and 2 running > Module 1 is DC > Module 2 shuts down > Module 1 updates node attributes used by resources > Module 1 shuts down > Module 2 starts up > Module 2 votes itself as DC > Module 1 starts up > Module 2 sees module 1 in corosync and notices it has quorum > Module 2 enters policy engine state. > Module 2 policy engine decides to fence 1 > Module 2 then continues and starts resource on itself based upon the old > state > > For some reason the integration never occurred and module 2 starts to > perform actions based on stale state. > > Here is the full logs > Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to > cluster infrastructure: corosync > Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not > obtain a node name for corosync nodeid 2 > Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to > uname -n for the local corosync node name > Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost > Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching > for stonith topology changes > Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added > 'watchdog' to the device list (1 active devices) > Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying > on watchdog integration for fencing > Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to > uname -n for the local corosync node name > Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: > pcmk_quorum_notification: Node module-2[2] - state is now member (was > (null)) > Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to > uname -n for the local corosync node name > Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications > disabled > Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog > enabled but stonith-watchdog-timeout is disabled > Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM > is operational > Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State > transition S_STARTING -> S_PENDING [ input=I_PENDING > cause=C_FSA_INTERNAL origin=do_started ] > Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added > 'fence_sbd' to the device list (2 active devices) > Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added > 'ipmi-1' to the device list (3 active devices) > Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input > I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING > Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State > transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_TIMER_POPPED origin=election_timeout_popped ] > Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input > I_ELECTION_DC from do_election_check() received in state S_INTEGRATION > Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications > disabled > Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog > enabled but stonith-watchdog-timeout is disabled > Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to > uname -n for the local corosync node name > Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on > watchdog integration for fencing > Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not > have quorum - fencing and resource management disabled > Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node > module-1 is unclean! > Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence > unclean nodes until quorum is attained (or no-quorum-policy is set to > ignore) >> >> The above two messages indicate that module-2 cannot see module-1 at >> startup, therefore it must assume it is potentially misbehaving and must >> be shot. However, since it does not have quorum with only one out of two >> nodes, it must wait until module-1 joins until it can shoot it! >> >> This is a special problem with quorum in a two-node cluster. There are a >> variety of ways to deal with it, but the simplest is to set "two_node:
Re: [ClusterLabs] Previous DC fenced prior to integration
On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillotwrote: > On 07/23/2016 10:14 PM, Nate Clark wrote: >> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov >> wrote: >>> 23.07.2016 01:37, Nate Clark пишет: Hello, I am running pacemaker 1.1.13 with corosync and think I may have encountered a start up timing issue on a two node cluster. I didn't notice anything in the changelog for 14 or 15 that looked similar to this or open bugs. The rough out line of what happened: Module 1 and 2 running Module 1 is DC Module 2 shuts down Module 1 updates node attributes used by resources Module 1 shuts down Module 2 starts up Module 2 votes itself as DC Module 1 starts up Module 2 sees module 1 in corosync and notices it has quorum Module 2 enters policy engine state. Module 2 policy engine decides to fence 1 Module 2 then continues and starts resource on itself based upon the old state For some reason the integration never occurred and module 2 starts to perform actions based on stale state. Here is the full logs Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to cluster infrastructure: corosync Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not obtain a node name for corosync nodeid 2 Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to uname -n for the local corosync node name Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching for stonith topology changes Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added 'watchdog' to the device list (1 active devices) Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying on watchdog integration for fencing Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to uname -n for the local corosync node name Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: pcmk_quorum_notification: Node module-2[2] - state is now member (was (null)) Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to uname -n for the local corosync node name Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications disabled Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog enabled but stonith-watchdog-timeout is disabled Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM is operational Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added 'fence_sbd' to the device list (2 active devices) Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added 'ipmi-1' to the device list (3 active devices) Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ] Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications disabled Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog enabled but stonith-watchdog-timeout is disabled Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to uname -n for the local corosync node name Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on watchdog integration for fencing Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not have quorum - fencing and resource management disabled Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node module-1 is unclean! Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore) > > The above two messages indicate that module-2 cannot see module-1 at > startup, therefore it must assume it is potentially misbehaving and must > be shot. However, since it does not have quorum with only one out of two > nodes, it must wait until module-1 joins until it can shoot it! > > This is a special problem with quorum in a two-node cluster. There are a > variety of ways to deal with it, but the simplest is to set "two_node: > 1" in corosync.conf (with corosync 2 or later). This will make each node > wait for the other at startup, meaning both nodes must be started before > the cluster
Re: [ClusterLabs] Previous DC fenced prior to integration
On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkovwrote: > 23.07.2016 01:37, Nate Clark пишет: >> Hello, >> >> I am running pacemaker 1.1.13 with corosync and think I may have >> encountered a start up timing issue on a two node cluster. I didn't >> notice anything in the changelog for 14 or 15 that looked similar to >> this or open bugs. >> >> The rough out line of what happened: >> >> Module 1 and 2 running >> Module 1 is DC >> Module 2 shuts down >> Module 1 updates node attributes used by resources >> Module 1 shuts down >> Module 2 starts up >> Module 2 votes itself as DC >> Module 1 starts up >> Module 2 sees module 1 in corosync and notices it has quorum >> Module 2 enters policy engine state. >> Module 2 policy engine decides to fence 1 >> Module 2 then continues and starts resource on itself based upon the old >> state >> >> For some reason the integration never occurred and module 2 starts to >> perform actions based on stale state. >> >> Here is the full logs >> Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to >> cluster infrastructure: corosync >> Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not >> obtain a node name for corosync nodeid 2 >> Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to >> uname -n for the local corosync node name >> Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost >> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching >> for stonith topology changes >> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added >> 'watchdog' to the device list (1 active devices) >> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying >> on watchdog integration for fencing >> Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to >> uname -n for the local corosync node name >> Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: >> pcmk_quorum_notification: Node module-2[2] - state is now member (was >> (null)) >> Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to >> uname -n for the local corosync node name >> Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications disabled >> Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog >> enabled but stonith-watchdog-timeout is disabled >> Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM >> is operational >> Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State >> transition S_STARTING -> S_PENDING [ input=I_PENDING >> cause=C_FSA_INTERNAL origin=do_started ] >> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added >> 'fence_sbd' to the device list (2 active devices) >> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added >> 'ipmi-1' to the device list (3 active devices) >> Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input >> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING >> Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State >> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC >> cause=C_TIMER_POPPED origin=election_timeout_popped ] >> Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input >> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION >> Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications disabled >> Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog >> enabled but stonith-watchdog-timeout is disabled >> Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to >> uname -n for the local corosync node name >> Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on >> watchdog integration for fencing >> Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not >> have quorum - fencing and resource management disabled >> Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node >> module-1 is unclean! >> Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence >> unclean nodes until quorum is attained (or no-quorum-policy is set to >> ignore) >> Jul 20 16:29:27.503521 module-2 pengine[21968]: notice: Start >> fence_sbd(module-2 - blocked) >> Jul 20 16:29:27.503539 module-2 pengine[21968]: notice: Start >> ipmi-1(module-2 - blocked) >> Jul 20 16:29:27.503559 module-2 pengine[21968]: notice: Start >> SlaveIP(module-2 - blocked) >> Jul 20 16:29:27.503582 module-2 pengine[21968]: notice: Start >> postgres:0(module-2 - blocked) >> Jul 20 16:29:27.503597 module-2 pengine[21968]: notice: Start >> ethmonitor:0(module-2 - blocked) >> Jul 20 16:29:27.503618 module-2 pengine[21968]: notice: Start >> tomcat-instance:0(module-2 - blocked) >> Jul 20 16:29:27.503629 module-2 pengine[21968]: notice: Start >> ClusterMonitor:0(module-2 - blocked) >> Jul 20 16:29:27.506945 module-2 pengine[21968]: warning: Calculated >> Transition 0: /var/lib/pacemaker/pengine/pe-warn-0.bz2 >>
Re: [ClusterLabs] Previous DC fenced prior to integration
23.07.2016 01:37, Nate Clark пишет: > Hello, > > I am running pacemaker 1.1.13 with corosync and think I may have > encountered a start up timing issue on a two node cluster. I didn't > notice anything in the changelog for 14 or 15 that looked similar to > this or open bugs. > > The rough out line of what happened: > > Module 1 and 2 running > Module 1 is DC > Module 2 shuts down > Module 1 updates node attributes used by resources > Module 1 shuts down > Module 2 starts up > Module 2 votes itself as DC > Module 1 starts up > Module 2 sees module 1 in corosync and notices it has quorum > Module 2 enters policy engine state. > Module 2 policy engine decides to fence 1 > Module 2 then continues and starts resource on itself based upon the old state > > For some reason the integration never occurred and module 2 starts to > perform actions based on stale state. > > Here is the full logs > Jul 20 16:29:06.376805 module-2 crmd[21969]: notice: Connecting to > cluster infrastructure: corosync > Jul 20 16:29:06.386853 module-2 crmd[21969]: notice: Could not > obtain a node name for corosync nodeid 2 > Jul 20 16:29:06.392795 module-2 crmd[21969]: notice: Defaulting to > uname -n for the local corosync node name > Jul 20 16:29:06.403611 module-2 crmd[21969]: notice: Quorum lost > Jul 20 16:29:06.409237 module-2 stonith-ng[21965]: notice: Watching > for stonith topology changes > Jul 20 16:29:06.409474 module-2 stonith-ng[21965]: notice: Added > 'watchdog' to the device list (1 active devices) > Jul 20 16:29:06.413589 module-2 stonith-ng[21965]: notice: Relying > on watchdog integration for fencing > Jul 20 16:29:06.416905 module-2 cib[21964]: notice: Defaulting to > uname -n for the local corosync node name > Jul 20 16:29:06.417044 module-2 crmd[21969]: notice: > pcmk_quorum_notification: Node module-2[2] - state is now member (was > (null)) > Jul 20 16:29:06.421821 module-2 crmd[21969]: notice: Defaulting to > uname -n for the local corosync node name > Jul 20 16:29:06.422121 module-2 crmd[21969]: notice: Notifications disabled > Jul 20 16:29:06.422149 module-2 crmd[21969]: notice: Watchdog > enabled but stonith-watchdog-timeout is disabled > Jul 20 16:29:06.422286 module-2 crmd[21969]: notice: The local CRM > is operational > Jul 20 16:29:06.422312 module-2 crmd[21969]: notice: State > transition S_STARTING -> S_PENDING [ input=I_PENDING > cause=C_FSA_INTERNAL origin=do_started ] > Jul 20 16:29:07.416871 module-2 stonith-ng[21965]: notice: Added > 'fence_sbd' to the device list (2 active devices) > Jul 20 16:29:08.418567 module-2 stonith-ng[21965]: notice: Added > 'ipmi-1' to the device list (3 active devices) > Jul 20 16:29:27.423578 module-2 crmd[21969]: warning: FSA: Input > I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING > Jul 20 16:29:27.424298 module-2 crmd[21969]: notice: State > transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_TIMER_POPPED origin=election_timeout_popped ] > Jul 20 16:29:27.460834 module-2 crmd[21969]: warning: FSA: Input > I_ELECTION_DC from do_election_check() received in state S_INTEGRATION > Jul 20 16:29:27.463794 module-2 crmd[21969]: notice: Notifications disabled > Jul 20 16:29:27.463824 module-2 crmd[21969]: notice: Watchdog > enabled but stonith-watchdog-timeout is disabled > Jul 20 16:29:27.473285 module-2 attrd[21967]: notice: Defaulting to > uname -n for the local corosync node name > Jul 20 16:29:27.498464 module-2 pengine[21968]: notice: Relying on > watchdog integration for fencing > Jul 20 16:29:27.498536 module-2 pengine[21968]: notice: We do not > have quorum - fencing and resource management disabled > Jul 20 16:29:27.502272 module-2 pengine[21968]: warning: Node > module-1 is unclean! > Jul 20 16:29:27.502287 module-2 pengine[21968]: notice: Cannot fence > unclean nodes until quorum is attained (or no-quorum-policy is set to > ignore) > Jul 20 16:29:27.503521 module-2 pengine[21968]: notice: Start > fence_sbd(module-2 - blocked) > Jul 20 16:29:27.503539 module-2 pengine[21968]: notice: Start > ipmi-1(module-2 - blocked) > Jul 20 16:29:27.503559 module-2 pengine[21968]: notice: Start > SlaveIP(module-2 - blocked) > Jul 20 16:29:27.503582 module-2 pengine[21968]: notice: Start > postgres:0(module-2 - blocked) > Jul 20 16:29:27.503597 module-2 pengine[21968]: notice: Start > ethmonitor:0(module-2 - blocked) > Jul 20 16:29:27.503618 module-2 pengine[21968]: notice: Start > tomcat-instance:0(module-2 - blocked) > Jul 20 16:29:27.503629 module-2 pengine[21968]: notice: Start > ClusterMonitor:0(module-2 - blocked) > Jul 20 16:29:27.506945 module-2 pengine[21968]: warning: Calculated > Transition 0: /var/lib/pacemaker/pengine/pe-warn-0.bz2 > Jul 20 16:29:27.507976 module-2 crmd[21969]: notice: Initiating > action 4: monitor fence_sbd_monitor_0 on module-2 (local) > Jul 20 16:29:27.509282 module-2 crmd[21969]: