Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-13 Thread Ken Gaillot
On Wed, 2018-06-13 at 17:09 +0800, Albert Weng wrote:
> Hi All,
> 
> Thanks for reply.
> 
> Recently, i run the following command :
> (clustera) # crm_simulate --xml-file pe-warn.last
> 
> it returns the following results :
>    error: crm_abort:    xpath_search: Triggered assert at xpath.c:153
> : xml_top != NULL
>    error: crm_element_value:    Couldn't find validate-with in NULL

It looks like pe-warn.last somehow got corrupted. It appears to not be
a full CIB file.

If the original was compressed (.gz/.bz2 extension), and you didn't
uncompress it, re-add the extension -- that's how pacemaker knows to
uncompress it.

>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    Configuration validation is currently disabled. It is highly
> encouraged and prevents many common cluster issues.
>    error: crm_element_value:    Couldn't find validate-with in NULL
>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    error: crm_element_value:    Couldn't find ignore-dtd in NULL
>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    crm_xml_add: Triggered assert at xml.c:2494 :
> node != NULL
>    error: write_xml_stream: Cannot write NULL to
> /var/lib/pacemaker/cib/shadow.20008
>    Could not create '/var/lib/pacemaker/cib/shadow.20008': Success
> 
> Could anyone help me how to read those messages and what's going on
> my server?
> 
> Thanks a lot..
> 
> 
> On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot 
> wrote:
> > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> > > Hi Andrei,
> > > 
> > > Thanks for your quickly reply. Still need help as below :
> > > 
> > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov  > l.co
> > > m> wrote:
> > > > 06.06.2018 04:27, Albert Weng пишет:
> > > > >  Hi All,
> > > > > 
> > > > > I have created active/passive pacemaker cluster on RHEL 7.
> > > > > 
> > > > > Here are my environment:
> > > > > clustera : 192.168.11.1 (passive)
> > > > > clusterb : 192.168.11.2 (master)
> > > > > clustera-ilo4 : 192.168.11.10
> > > > > clusterb-ilo4 : 192.168.11.11
> > > > > 
> > > > > cluster resource status :
> > > > >      cluster_fs        started on clusterb
> > > > >      cluster_vip       started on clusterb
> > > > >      cluster_sid       started on clusterb
> > > > >      cluster_listnr    started on clusterb
> > > > > 
> > > > > Both cluster node are online status.
> > > > > 
> > > > > i found my corosync.log contain many records like below:
> > > > > 
> > > > > clustera        pengine:     info:
> > > > determine_online_status_fencing:
> > > > > Node clusterb is active
> > > > > clustera        pengine:     info: determine_online_status: 
> >    
> > > >   Node
> > > > > clusterb is online
> > > > > clustera        pengine:     info:
> > > > determine_online_status_fencing:
> > > > > Node clustera is active
> > > > > clustera        pengine:     info: determine_online_status: 
> >    
> > > >   Node
> > > > > clustera is online
> > > > > 
> > > > > *clustera        pengine:  warning: unpack_rsc_op_failure: 
> > > > Processing
> > > > > failed op start for cluster_sid on clustera: unknown error
> > (1)*
> > > > > *=> Question :  Why pengine always trying to start
> > cluster_sid on
> > > > the
> > > > > passive node? how to fix it? *
> > > > > 
> > > > 
> > > > pacemaker does not have concept of "passive" or 

Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-13 Thread Albert Weng
Hi All,

Thanks for reply.

Recently, i run the following command :
(clustera) # crm_simulate --xml-file pe-warn.last

it returns the following results :
   error: crm_abort:xpath_search: Triggered assert at xpath.c:153 :
xml_top != NULL
   error: crm_element_value:Couldn't find validate-with in NULL
   error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   Configuration validation is currently disabled. It is highly encouraged
and prevents many common cluster issues.
   error: crm_element_value:Couldn't find validate-with in NULL
   error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   error: crm_element_value:Couldn't find ignore-dtd in NULL
   error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:crm_xml_add: Triggered assert at xml.c:2494 : node
!= NULL
   error: write_xml_stream: Cannot write NULL to
/var/lib/pacemaker/cib/shadow.20008
   Could not create '/var/lib/pacemaker/cib/shadow.20008': Success

Could anyone help me how to read those messages and what's going on my
server?

Thanks a lot..


On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot  wrote:

> On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> > Hi Andrei,
> >
> > Thanks for your quickly reply. Still need help as below :
> >
> > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov  > m> wrote:
> > > 06.06.2018 04:27, Albert Weng пишет:
> > > >  Hi All,
> > > >
> > > > I have created active/passive pacemaker cluster on RHEL 7.
> > > >
> > > > Here are my environment:
> > > > clustera : 192.168.11.1 (passive)
> > > > clusterb : 192.168.11.2 (master)
> > > > clustera-ilo4 : 192.168.11.10
> > > > clusterb-ilo4 : 192.168.11.11
> > > >
> > > > cluster resource status :
> > > >  cluster_fsstarted on clusterb
> > > >  cluster_vip   started on clusterb
> > > >  cluster_sid   started on clusterb
> > > >  cluster_listnrstarted on clusterb
> > > >
> > > > Both cluster node are online status.
> > > >
> > > > i found my corosync.log contain many records like below:
> > > >
> > > > clusterapengine: info:
> > > determine_online_status_fencing:
> > > > Node clusterb is active
> > > > clusterapengine: info: determine_online_status:
> > >   Node
> > > > clusterb is online
> > > > clusterapengine: info:
> > > determine_online_status_fencing:
> > > > Node clustera is active
> > > > clusterapengine: info: determine_online_status:
> > >   Node
> > > > clustera is online
> > > >
> > > > *clusterapengine:  warning: unpack_rsc_op_failure:
> > > Processing
> > > > failed op start for cluster_sid on clustera: unknown error (1)*
> > > > *=> Question :  Why pengine always trying to start cluster_sid on
> > > the
> > > > passive node? how to fix it? *
> > > >
> > >
> > > pacemaker does not have concept of "passive" or "master" node - it
> > > is up
> > > to you to decide when you configure resource placement. By default
> > > pacemaker will attempt to spread resources across all eligible
> > > nodes.
> > > You can influence node selection by using constraints. See
> > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace
> > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
> > > for details.
> > >
> > > But in any case - all your resources MUST be capable of running of
> > > both
> > > nodes, otherwise cluster makes no sense. If one resource A depends
> > 

Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-07 Thread Ken Gaillot
On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> Hi Andrei,
> 
> Thanks for your quickly reply. Still need help as below :
> 
> On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov  m> wrote:
> > 06.06.2018 04:27, Albert Weng пишет:
> > >  Hi All,
> > > 
> > > I have created active/passive pacemaker cluster on RHEL 7.
> > > 
> > > Here are my environment:
> > > clustera : 192.168.11.1 (passive)
> > > clusterb : 192.168.11.2 (master)
> > > clustera-ilo4 : 192.168.11.10
> > > clusterb-ilo4 : 192.168.11.11
> > > 
> > > cluster resource status :
> > >      cluster_fs        started on clusterb
> > >      cluster_vip       started on clusterb
> > >      cluster_sid       started on clusterb
> > >      cluster_listnr    started on clusterb
> > > 
> > > Both cluster node are online status.
> > > 
> > > i found my corosync.log contain many records like below:
> > > 
> > > clustera        pengine:     info:
> > determine_online_status_fencing:
> > > Node clusterb is active
> > > clustera        pengine:     info: determine_online_status:     
> >   Node
> > > clusterb is online
> > > clustera        pengine:     info:
> > determine_online_status_fencing:
> > > Node clustera is active
> > > clustera        pengine:     info: determine_online_status:     
> >   Node
> > > clustera is online
> > > 
> > > *clustera        pengine:  warning: unpack_rsc_op_failure: 
> > Processing
> > > failed op start for cluster_sid on clustera: unknown error (1)*
> > > *=> Question :  Why pengine always trying to start cluster_sid on
> > the
> > > passive node? how to fix it? *
> > > 
> > 
> > pacemaker does not have concept of "passive" or "master" node - it
> > is up
> > to you to decide when you configure resource placement. By default
> > pacemaker will attempt to spread resources across all eligible
> > nodes.
> > You can influence node selection by using constraints. See
> > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace
> > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
> > for details.
> > 
> > But in any case - all your resources MUST be capable of running of
> > both
> > nodes, otherwise cluster makes no sense. If one resource A depends
> > on
> > something that another resource B provides and can be started only
> > together with resource B (and after it is ready) - you must tell it
> > to
> > pacemaker by using resource colocations and ordering. See same
> > document
> > for details.
> > 
> > > clustera        pengine:     info: native_print:   ipmi-fence-
> > clustera
> > > (stonith:fence_ipmilan):        Started clustera
> > > clustera        pengine:     info: native_print:   ipmi-fence-
> > clusterb
> > > (stonith:fence_ipmilan):        Started clustera
> > > clustera        pengine:     info: group_print:     Resource
> > Group: cluster
> > > clustera        pengine:     info: native_print:       
> > cluster_fs
> > > (ocf::heartbeat:Filesystem):    Started clusterb
> > > clustera        pengine:     info: native_print:       
> > cluster_vip
> > > (ocf::heartbeat:IPaddr2):       Started clusterb
> > > clustera        pengine:     info: native_print:       
> > cluster_sid
> > > (ocf::heartbeat:oracle):        Started clusterb
> > > clustera        pengine:     info: native_print:
> > > cluster_listnr       (ocf::heartbeat:oralsnr):       Started
> > clusterb
> > > clustera        pengine:     info: get_failcount_full:   
> >  cluster_sid has
> > > failed INFINITY times on clustera
> > > 
> > > 
> > > *clustera        pengine:  warning: common_apply_stickiness:     
> >   Forcing
> > > cluster_sid away from clustera after 100 failures
> > (max=100)*
> > > *=> Question: too much trying result in forbid the resource start
> > on
> > > clustera ?*
> > > 
> > 
> > Yes.
> 
> How to find out the root cause of  100 failures? which log will
> contain the error message?

As an aside, 1,000,000 is "infinity" to pacemaker. It could mean
1,000,000 actual failures, or a "fatal" failure that causes pacemaker
to set the fail count to infinity.

The most recent failure of each resource will be shown in the status
display (crm_mon, pcs status, etc.). They will have a basic exit code
(which you can use to distinguish a timeout from an error received from
the agent), and if the agent provided one, an "exit-reason". That's the
first place to look.

Failures will remain in the status display, and affect the placement of
resources, until one of two things happen: you manually clean up the
failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if
you configured a failure-timeout for the resource, that much time has
passed with no more failures.

For deeper investigation, check the system log (wherever it's kept on
your distro). You can use the timestamp from the failure in the status
to know where to look.

For even more detail, you can look at pacemaker's detail log (the one
you posted excerpts from). This will have additional messages beyond
the system log, but they are 

Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-06 Thread Albert Weng
Hi Andrei,

Thanks for your quickly reply. Still need help as below :

On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov 
wrote:

> 06.06.2018 04:27, Albert Weng пишет:
> >  Hi All,
> >
> > I have created active/passive pacemaker cluster on RHEL 7.
> >
> > Here are my environment:
> > clustera : 192.168.11.1 (passive)
> > clusterb : 192.168.11.2 (master)
> > clustera-ilo4 : 192.168.11.10
> > clusterb-ilo4 : 192.168.11.11
> >
> > cluster resource status :
> >  cluster_fsstarted on clusterb
> >  cluster_vip   started on clusterb
> >  cluster_sid   started on clusterb
> >  cluster_listnrstarted on clusterb
> >
> > Both cluster node are online status.
> >
> > i found my corosync.log contain many records like below:
> >
> > clusterapengine: info: determine_online_status_fencing:
> > Node clusterb is active
> > clusterapengine: info: determine_online_status:Node
> > clusterb is online
> > clusterapengine: info: determine_online_status_fencing:
> > Node clustera is active
> > clusterapengine: info: determine_online_status:Node
> > clustera is online
> >
> > *clusterapengine:  warning: unpack_rsc_op_failure:  Processing
> > failed op start for cluster_sid on clustera: unknown error (1)*
> > *=> Question :  Why pengine always trying to start cluster_sid on the
> > passive node? how to fix it? *
> >
>
> pacemaker does not have concept of "passive" or "master" node - it is up
> to you to decide when you configure resource placement. By default
> pacemaker will attempt to spread resources across all eligible nodes.
> You can influence node selection by using constraints. See
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/
> 1.1/html/Pacemaker_Explained/_deciding_which_nodes_a_
> resource_can_run_on.html
> for details.
>
> But in any case - all your resources MUST be capable of running of both
> nodes, otherwise cluster makes no sense. If one resource A depends on
> something that another resource B provides and can be started only
> together with resource B (and after it is ready) - you must tell it to
> pacemaker by using resource colocations and ordering. See same document
> for details.
>
> > clusterapengine: info: native_print:   ipmi-fence-clustera
> > (stonith:fence_ipmilan):Started clustera
> > clusterapengine: info: native_print:   ipmi-fence-clusterb
> > (stonith:fence_ipmilan):Started clustera
> > clusterapengine: info: group_print: Resource Group:
> cluster
> > clusterapengine: info: native_print:cluster_fs
> > (ocf::heartbeat:Filesystem):Started clusterb
> > clusterapengine: info: native_print:cluster_vip
> > (ocf::heartbeat:IPaddr2):   Started clusterb
> > clusterapengine: info: native_print:cluster_sid
> > (ocf::heartbeat:oracle):Started clusterb
> > clusterapengine: info: native_print:
> > cluster_listnr   (ocf::heartbeat:oralsnr):   Started clusterb
> > clusterapengine: info: get_failcount_full: cluster_sid
> has
> > failed INFINITY times on clustera
> >
> >
> > *clusterapengine:  warning: common_apply_stickiness:
> Forcing
> > cluster_sid away from clustera after 100 failures (max=100)*
> > *=> Question: too much trying result in forbid the resource start on
> > clustera ?*
> >
>
> Yes.
>

How to find out the root cause of  100 failures? which log will contain
the error message?


>
> > Couple days ago, the clusterb has been stonith by unknown reason, but
> only
> > "cluster_fs", "cluster_vip" moved to clustera successfully, but
> > "cluster_sid" and "cluster_listnr" go to "STOP" status.
> > like below messages, is it related with "op start for cluster_sid on
> > clustera..." ?
> >
>
> Yes. Node clustera is now marked as being incapable of running resource
> so if node cluaterb fails, resource cannot be started anywhere.
>
> How could i fix it? i need some hint for troubleshooting.


> > clusterapengine:  warning: unpack_rsc_op_failure:  Processing failed
> op
> > start for cluster_sid on clustera: unknown error (1)
> > clusterapengine: info: native_print:   ipmi-fence-clustera
> > (stonith:fence_ipmilan):Started clustera
> > clusterapengine: info: native_print:   ipmi-fence-clusterb
> > (stonith:fence_ipmilan):Started clustera
> > clusterapengine: info: group_print: Resource Group: cluster
> > clusterapengine: info: native_print:cluster_fs
> > (ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN)
> > clusterapengine: info: native_print:cluster_vip
> > (ocf::heartbeat:IPaddr2):   Started clusterb (UNCLEAN)
> > clusterapengine: info: native_print:cluster_sid
> > (ocf::heartbeat:oracle):Started clusterb (UNCLEAN)
> > clusterapengine: info: native_print:cluster_listnr
> > 

[ClusterLabs] pengine always trying to start the resource on the standby node.

2018-06-06 Thread Albert Weng
Hi All,

I have created active/passive pacemaker cluster on RHEL 7.

Here are my environment:
clustera : 192.168.11.1 (passive)
clusterb : 192.168.11.2 (master)
clustera-ilo4 : 192.168.11.10
clusterb-ilo4 : 192.168.11.11

cluster resource status :
 cluster_fsstarted on clusterb
 cluster_vip   started on clusterb
 cluster_sid   started on clusterb
 cluster_listnrstarted on clusterb

Both cluster node are online status.

i found my corosync.log contain many records like below:

clusterapengine: info: determine_online_status_fencing:
Node clusterb is active
clusterapengine: info: determine_online_status:Node
clusterb is online
clusterapengine: info: determine_online_status_fencing:
Node clustera is active
clusterapengine: info: determine_online_status:Node
clustera is online

*clusterapengine:  warning: unpack_rsc_op_failure:  Processing
failed op start for cluster_sid on clustera: unknown error (1)*
*=> Question :  Why pengine always trying to start cluster_sid on the
passive node? how to fix it? *

clusterapengine: info: native_print:   ipmi-fence-clustera
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: native_print:   ipmi-fence-clusterb
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: group_print: Resource Group: cluster
clusterapengine: info: native_print:cluster_fs
(ocf::heartbeat:Filesystem):Started clusterb
clusterapengine: info: native_print:cluster_vip
(ocf::heartbeat:IPaddr2):   Started clusterb
clusterapengine: info: native_print:cluster_sid
(ocf::heartbeat:oracle):Started clusterb
clusterapengine: info: native_print:
cluster_listnr   (ocf::heartbeat:oralsnr):   Started clusterb
clusterapengine: info: get_failcount_full: cluster_sid has
failed INFINITY times on clustera


*clusterapengine:  warning: common_apply_stickiness:Forcing
cluster_sid away from clustera after 100 failures (max=100)*
*=> Question: too much trying result in forbid the resource start on
clustera ?*

Couple days ago, the clusterb has been stonith by unknown reason, but only
"cluster_fs", "cluster_vip" moved to clustera successfully, but
"cluster_sid" and "cluster_listnr" go to "STOP" status.
like below messages, is it related with "op start for cluster_sid on
clustera..." ?

clusterapengine:  warning: unpack_rsc_op_failure:  Processing failed op
start for cluster_sid on clustera: unknown error (1)
clusterapengine: info: native_print:   ipmi-fence-clustera
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: native_print:   ipmi-fence-clusterb
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: group_print: Resource Group: cluster
clusterapengine: info: native_print:cluster_fs
(ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_vip
(ocf::heartbeat:IPaddr2):   Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_sid
(ocf::heartbeat:oracle):Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_listnr
(ocf::heartbeat:oralsnr):   Started clusterb (UNCLEAN)
clusterapengine: info: get_failcount_full: cluster_sid has
failed INFINITY times on clustera
clusterapengine:  warning: common_apply_stickiness:Forcing
cluster_sid away from clustera after 100 failures (max=100)
clusterapengine: info: rsc_merge_weights:  cluster_fs: Rolling
back scores from cluster_sid
clusterapengine: info: rsc_merge_weights:  cluster_vip: Rolling
back scores from cluster_sid
clusterapengine: info: rsc_merge_weights:  cluster_sid: Rolling
back scores from cluster_listnr
clusterapengine: info: native_color:   Resource cluster_sid cannot
run anywhere
clusterapengine: info: native_color:   Resource cluster_listnr
cannot run anywhere
clusterapengine:  warning: custom_action:  Action cluster_fs_stop_0 on
clusterb is unrunnable (offline)
clusterapengine: info: RecurringOp: Start recurring monitor
(20s) for cluster_fs on clustera
clusterapengine:  warning: custom_action:  Action cluster_vip_stop_0 on
clusterb is unrunnable (offline)
clusterapengine: info: RecurringOp: Start recurring monitor
(10s) for cluster_vip on clustera
clusterapengine:  warning: custom_action:  Action cluster_sid_stop_0 on
clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action:  Action cluster_sid_stop_0 on
clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action:  Action cluster_listnr_stop_0
on clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action:  

Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-05 Thread Andrei Borzenkov
06.06.2018 04:27, Albert Weng пишет:
>  Hi All,
> 
> I have created active/passive pacemaker cluster on RHEL 7.
> 
> Here are my environment:
> clustera : 192.168.11.1 (passive)
> clusterb : 192.168.11.2 (master)
> clustera-ilo4 : 192.168.11.10
> clusterb-ilo4 : 192.168.11.11
> 
> cluster resource status :
>  cluster_fsstarted on clusterb
>  cluster_vip   started on clusterb
>  cluster_sid   started on clusterb
>  cluster_listnrstarted on clusterb
> 
> Both cluster node are online status.
> 
> i found my corosync.log contain many records like below:
> 
> clusterapengine: info: determine_online_status_fencing:
> Node clusterb is active
> clusterapengine: info: determine_online_status:Node
> clusterb is online
> clusterapengine: info: determine_online_status_fencing:
> Node clustera is active
> clusterapengine: info: determine_online_status:Node
> clustera is online
> 
> *clusterapengine:  warning: unpack_rsc_op_failure:  Processing
> failed op start for cluster_sid on clustera: unknown error (1)*
> *=> Question :  Why pengine always trying to start cluster_sid on the
> passive node? how to fix it? *
> 

pacemaker does not have concept of "passive" or "master" node - it is up
to you to decide when you configure resource placement. By default
pacemaker will attempt to spread resources across all eligible nodes.
You can influence node selection by using constraints. See
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
for details.

But in any case - all your resources MUST be capable of running of both
nodes, otherwise cluster makes no sense. If one resource A depends on
something that another resource B provides and can be started only
together with resource B (and after it is ready) - you must tell it to
pacemaker by using resource colocations and ordering. See same document
for details.

> clusterapengine: info: native_print:   ipmi-fence-clustera
> (stonith:fence_ipmilan):Started clustera
> clusterapengine: info: native_print:   ipmi-fence-clusterb
> (stonith:fence_ipmilan):Started clustera
> clusterapengine: info: group_print: Resource Group: cluster
> clusterapengine: info: native_print:cluster_fs
> (ocf::heartbeat:Filesystem):Started clusterb
> clusterapengine: info: native_print:cluster_vip
> (ocf::heartbeat:IPaddr2):   Started clusterb
> clusterapengine: info: native_print:cluster_sid
> (ocf::heartbeat:oracle):Started clusterb
> clusterapengine: info: native_print:
> cluster_listnr   (ocf::heartbeat:oralsnr):   Started clusterb
> clusterapengine: info: get_failcount_full: cluster_sid has
> failed INFINITY times on clustera
> 
> 
> *clusterapengine:  warning: common_apply_stickiness:Forcing
> cluster_sid away from clustera after 100 failures (max=100)*
> *=> Question: too much trying result in forbid the resource start on
> clustera ?*
> 

Yes.

> Couple days ago, the clusterb has been stonith by unknown reason, but only
> "cluster_fs", "cluster_vip" moved to clustera successfully, but
> "cluster_sid" and "cluster_listnr" go to "STOP" status.
> like below messages, is it related with "op start for cluster_sid on
> clustera..." ?
> 

Yes. Node clustera is now marked as being incapable of running resource
so if node cluaterb fails, resource cannot be started anywhere.

> clusterapengine:  warning: unpack_rsc_op_failure:  Processing failed op
> start for cluster_sid on clustera: unknown error (1)
> clusterapengine: info: native_print:   ipmi-fence-clustera
> (stonith:fence_ipmilan):Started clustera
> clusterapengine: info: native_print:   ipmi-fence-clusterb
> (stonith:fence_ipmilan):Started clustera
> clusterapengine: info: group_print: Resource Group: cluster
> clusterapengine: info: native_print:cluster_fs
> (ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN)
> clusterapengine: info: native_print:cluster_vip
> (ocf::heartbeat:IPaddr2):   Started clusterb (UNCLEAN)
> clusterapengine: info: native_print:cluster_sid
> (ocf::heartbeat:oracle):Started clusterb (UNCLEAN)
> clusterapengine: info: native_print:cluster_listnr
> (ocf::heartbeat:oralsnr):   Started clusterb (UNCLEAN)
> clusterapengine: info: get_failcount_full: cluster_sid has
> failed INFINITY times on clustera
> clusterapengine:  warning: common_apply_stickiness:Forcing
> cluster_sid away from clustera after 100 failures (max=100)
> clusterapengine: info: rsc_merge_weights:  cluster_fs: Rolling
> back scores from cluster_sid
> clusterapengine: info: rsc_merge_weights:  

[ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-05 Thread Albert Weng
 Hi All,

I have created active/passive pacemaker cluster on RHEL 7.

Here are my environment:
clustera : 192.168.11.1 (passive)
clusterb : 192.168.11.2 (master)
clustera-ilo4 : 192.168.11.10
clusterb-ilo4 : 192.168.11.11

cluster resource status :
 cluster_fsstarted on clusterb
 cluster_vip   started on clusterb
 cluster_sid   started on clusterb
 cluster_listnrstarted on clusterb

Both cluster node are online status.

i found my corosync.log contain many records like below:

clusterapengine: info: determine_online_status_fencing:
Node clusterb is active
clusterapengine: info: determine_online_status:Node
clusterb is online
clusterapengine: info: determine_online_status_fencing:
Node clustera is active
clusterapengine: info: determine_online_status:Node
clustera is online

*clusterapengine:  warning: unpack_rsc_op_failure:  Processing
failed op start for cluster_sid on clustera: unknown error (1)*
*=> Question :  Why pengine always trying to start cluster_sid on the
passive node? how to fix it? *

clusterapengine: info: native_print:   ipmi-fence-clustera
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: native_print:   ipmi-fence-clusterb
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: group_print: Resource Group: cluster
clusterapengine: info: native_print:cluster_fs
(ocf::heartbeat:Filesystem):Started clusterb
clusterapengine: info: native_print:cluster_vip
(ocf::heartbeat:IPaddr2):   Started clusterb
clusterapengine: info: native_print:cluster_sid
(ocf::heartbeat:oracle):Started clusterb
clusterapengine: info: native_print:
cluster_listnr   (ocf::heartbeat:oralsnr):   Started clusterb
clusterapengine: info: get_failcount_full: cluster_sid has
failed INFINITY times on clustera


*clusterapengine:  warning: common_apply_stickiness:Forcing
cluster_sid away from clustera after 100 failures (max=100)*
*=> Question: too much trying result in forbid the resource start on
clustera ?*

Couple days ago, the clusterb has been stonith by unknown reason, but only
"cluster_fs", "cluster_vip" moved to clustera successfully, but
"cluster_sid" and "cluster_listnr" go to "STOP" status.
like below messages, is it related with "op start for cluster_sid on
clustera..." ?

clusterapengine:  warning: unpack_rsc_op_failure:  Processing failed op
start for cluster_sid on clustera: unknown error (1)
clusterapengine: info: native_print:   ipmi-fence-clustera
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: native_print:   ipmi-fence-clusterb
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: group_print: Resource Group: cluster
clusterapengine: info: native_print:cluster_fs
(ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_vip
(ocf::heartbeat:IPaddr2):   Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_sid
(ocf::heartbeat:oracle):Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_listnr
(ocf::heartbeat:oralsnr):   Started clusterb (UNCLEAN)
clusterapengine: info: get_failcount_full: cluster_sid has
failed INFINITY times on clustera
clusterapengine:  warning: common_apply_stickiness:Forcing
cluster_sid away from clustera after 100 failures (max=100)
clusterapengine: info: rsc_merge_weights:  cluster_fs: Rolling
back scores from cluster_sid
clusterapengine: info: rsc_merge_weights:  cluster_vip: Rolling
back scores from cluster_sid
clusterapengine: info: rsc_merge_weights:  cluster_sid: Rolling
back scores from cluster_listnr
clusterapengine: info: native_color:   Resource cluster_sid cannot
run anywhere
clusterapengine: info: native_color:   Resource cluster_listnr
cannot run anywhere
clusterapengine:  warning: custom_action:  Action cluster_fs_stop_0 on
clusterb is unrunnable (offline)
clusterapengine: info: RecurringOp: Start recurring monitor
(20s) for cluster_fs on clustera
clusterapengine:  warning: custom_action:  Action cluster_vip_stop_0 on
clusterb is unrunnable (offline)
clusterapengine: info: RecurringOp: Start recurring monitor
(10s) for cluster_vip on clustera
clusterapengine:  warning: custom_action:  Action cluster_sid_stop_0 on
clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action:  Action cluster_sid_stop_0 on
clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action:  Action cluster_listnr_stop_0
on clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action: