Re: [ClusterLabs] Questions about SBD behavior

2018-06-13 Thread 井上 和徳
Thanks for the response.

As of v1.3.1 and later, I recognized that real quorum is necessary.
I also read this:
https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self-fencing_with_resource_recovery

As related to this specification, in order to use pacemaker-2.0,
we are confirming the following known issue.

* When SIGSTOP is sent to the pacemaker process, no failure of the
  resource will be detected.
  https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html
  https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html

  I expected that it was being handled by SBD, but no one detected
  that the following process was frozen. Therefore, no failure of
  the resource was detected either.
  - pacemaker-based
  - pacemaker-execd
  - pacemaker-attrd
  - pacemaker-schedulerd
  - pacemaker-controld

  I confirmed this, but I couldn't read about the correspondence
  situation.
  
https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SBD_1.1.pdf

As a result of our discussion, we want SBD to detect it and reset the
machine.

Also, for users who do not have shared disk or qdevice,
we need an option to work even without real quorum.
(fence races are going to avoid with delay attribute:
 https://access.redhat.com/solutions/91653
 https://access.redhat.com/solutions/1293523)

Best Regards,
Kazunori INOUE

> -Original Message-
> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Klaus 
> Wenninger
> Sent: Friday, May 25, 2018 4:08 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Questions about SBD behavior
> 
> On 05/25/2018 07:31 AM, 井上 和徳 wrote:
> > Hi,
> >
> > I am checking the watchdog function of SBD (without shared block-device).
> > In a two-node cluster, if one cluster is stopped, watchdog is triggered on 
> > the
> remaining node.
> > Is this the designed behavior?
> 
> SBD without a shared block-device doesn't really make sense on
> a two-node cluster.
> The basic idea is - e.g. in a case of a networking problem -
> that a cluster splits up in a quorate and a non-quorate partition.
> The quorate partition stays over while SBD guarantees a
> reliable watchdog-based self-fencing of the non-quorate partition
> within a defined timeout.
> This idea of course doesn't work with just 2 nodes.
> Taking quorum info from the 2-node feature of corosync (automatically
> switching on wait-for-all) doesn't help in this case but instead
> would lead to split-brain.
> What you can do - and what e.g. pcs does automatically - is enable
> the auto-tie-breaker instead of two-node in corosync. But that
> still doesn't give you a higher availability than the one of the
> winner of auto-tie-breaker. (Maybe interesting if you are going
> for a load-balancing-scenario that doesn't affect availability or
> for a transient state while setting up a cluste node-by-node ...)
> What you can do though is using qdevice to still have 'real-quorum'
> info with just 2 full cluster-nodes.
> 
> There was quite a lot of discussion round this topic on this
> thread previously if you search the history.
> 
> Regards,
> Klaus
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-13 Thread Albert Weng
Hi All,

Thanks for reply.

Recently, i run the following command :
(clustera) # crm_simulate --xml-file pe-warn.last

it returns the following results :
   error: crm_abort:xpath_search: Triggered assert at xpath.c:153 :
xml_top != NULL
   error: crm_element_value:Couldn't find validate-with in NULL
   error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   Configuration validation is currently disabled. It is highly encouraged
and prevents many common cluster issues.
   error: crm_element_value:Couldn't find validate-with in NULL
   error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   error: crm_element_value:Couldn't find ignore-dtd in NULL
   error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 :
data != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:validate_with: Triggered assert at schemas.c:522 :
xml != NULL
   error: crm_abort:crm_xml_add: Triggered assert at xml.c:2494 : node
!= NULL
   error: write_xml_stream: Cannot write NULL to
/var/lib/pacemaker/cib/shadow.20008
   Could not create '/var/lib/pacemaker/cib/shadow.20008': Success

Could anyone help me how to read those messages and what's going on my
server?

Thanks a lot..


On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot  wrote:

> On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> > Hi Andrei,
> >
> > Thanks for your quickly reply. Still need help as below :
> >
> > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov  > m> wrote:
> > > 06.06.2018 04:27, Albert Weng пишет:
> > > >  Hi All,
> > > >
> > > > I have created active/passive pacemaker cluster on RHEL 7.
> > > >
> > > > Here are my environment:
> > > > clustera : 192.168.11.1 (passive)
> > > > clusterb : 192.168.11.2 (master)
> > > > clustera-ilo4 : 192.168.11.10
> > > > clusterb-ilo4 : 192.168.11.11
> > > >
> > > > cluster resource status :
> > > >  cluster_fsstarted on clusterb
> > > >  cluster_vip   started on clusterb
> > > >  cluster_sid   started on clusterb
> > > >  cluster_listnrstarted on clusterb
> > > >
> > > > Both cluster node are online status.
> > > >
> > > > i found my corosync.log contain many records like below:
> > > >
> > > > clusterapengine: info:
> > > determine_online_status_fencing:
> > > > Node clusterb is active
> > > > clusterapengine: info: determine_online_status:
> > >   Node
> > > > clusterb is online
> > > > clusterapengine: info:
> > > determine_online_status_fencing:
> > > > Node clustera is active
> > > > clusterapengine: info: determine_online_status:
> > >   Node
> > > > clustera is online
> > > >
> > > > *clusterapengine:  warning: unpack_rsc_op_failure:
> > > Processing
> > > > failed op start for cluster_sid on clustera: unknown error (1)*
> > > > *=> Question :  Why pengine always trying to start cluster_sid on
> > > the
> > > > passive node? how to fix it? *
> > > >
> > >
> > > pacemaker does not have concept of "passive" or "master" node - it
> > > is up
> > > to you to decide when you configure resource placement. By default
> > > pacemaker will attempt to spread resources across all eligible
> > > nodes.
> > > You can influence node selection by using constraints. See
> > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace
> > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
> > > for details.
> > >
> > > But in any case - all your resources MUST be capable of running of
> > > both
> > > nodes, otherwise cluster makes no sense. If one resource A depends
> > 

[ClusterLabs] ressource stopped unexpectly

2018-06-13 Thread Stefan Krueger
Hello,

I've a problem with my cluster. When I use 'pcs cluster standby serv3' it moves 
all ressources to serv4 that works fine, but when I restart a node the 
Ressource ha-ip become stopped and I don't know why. Can somebody give me an 
hint why this happen and how to resolve that?

btw: i use this guide: https://github.com/ewwhite/zfs-ha/wiki
the logfile is here (i guess it is too long for the mailinglist) 
https://paste.debian.net/hidden/2e001867/

thanks for help!

best regards
Stefan


pcs status
Cluster name: zfs-vmstorage
Stack: corosync
Current DC: zfs-serv3 (version 1.1.16-94ff4df) - partition with quorum
Last updated: Tue Jun 12 16:56:45 2018
Last change: Tue Jun 12 16:44:52 2018 by hacluster via crm_attribute on 
zfs-serv3

2 nodes configured
3 resources configured

Online: [ zfs-serv3 zfs-serv4 ]

Full list of resources:

 fence-vm_storage   (stonith:fence_scsi):   Started zfs-serv3
 Resource Group: zfs-storage
 vm_storage (ocf::heartbeat:ZFS):   Started zfs-serv3
 ha-ip  (ocf::heartbeat:IPaddr2):   Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled



pcs config
Cluster Name: zfs-vmstorage
Corosync Nodes:
 zfs-serv3 zfs-serv4
Pacemaker Nodes:
 zfs-serv3 zfs-serv4

Resources:
 Group: zfs-storage
  Resource: vm_storage (class=ocf provider=heartbeat type=ZFS)
   Attributes: pool=vm_storage importargs="-d /dev/disk/by-vdev/"
   Operations: monitor interval=5s timeout=30s (vm_storage-monitor-interval-5s)
   start interval=0s timeout=90 (vm_storage-start-interval-0s)
   stop interval=0s timeout=90 (vm_storage-stop-interval-0s)
  Resource: ha-ip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=172.16.101.73 cidr_netmask=16
   Operations: start interval=0s timeout=20s (ha-ip-start-interval-0s)
   stop interval=0s timeout=20s (ha-ip-stop-interval-0s)
   monitor interval=10s timeout=20s (ha-ip-monitor-interval-10s)

Stonith Devices:
 Resource: fence-vm_storage (class=stonith type=fence_scsi)
  Attributes: pcmk_monitor_action=metadata 
pcmk_host_list=172.16.101.74,172.16.101.75 devices=" 
/dev/disk/by-vdev/j3d03-hdd /dev/disk/by-vdev/j4d03-hdd 
/dev/disk/by-vdev/j3d04-hdd /dev/disk/by-vdev/j4d04-hdd 
/dev/disk/by-vdev/j3d05-hdd /dev/disk/by-vdev/j4d05-hdd 
/dev/disk/by-vdev/j3d06-hdd /dev/disk/by-vdev/j4d06-hdd 
/dev/disk/by-vdev/j3d07-hdd /dev/disk/by-vdev/j4d07-hdd 
/dev/disk/by-vdev/j3d08-hdd /dev/disk/by-vdev/j4d08-hdd 
/dev/disk/by-vdev/j3d09-hdd /dev/disk/by-vdev/j4d09-hdd 
/dev/disk/by-vdev/j3d10-hdd /dev/disk/by-vdev/j4d10-hdd 
/dev/disk/by-vdev/j3d11-hdd /dev/disk/by-vdev/j4d11-hdd 
/dev/disk/by-vdev/j3d12-hdd /dev/disk/by-vdev/j4d12-hdd 
/dev/disk/by-vdev/j3d13-hdd /dev/disk/by-vdev/j4d13-hdd 
/dev/disk/by-vdev/j3d14-hdd /dev/disk/by-vdev/j4d14-hdd 
/dev/disk/by-vdev/j3d15-hdd /dev/disk/by-vdev/j4d15-hdd 
/dev/disk/by-vdev/j3d16-hdd /dev/disk/by-vdev/j4d16-hdd 
/dev/disk/by-vdev/j3d17-hdd /dev/disk/by-vdev/j4d17-hdd 
/dev/disk/by-vdev/j3d18-hdd /dev/disk/by-vdev/j4d18-hdd /dev/d
 isk/by-vdev/j3d19-hdd /dev/disk/by-vdev/j4d19-hdd log 
/dev/disk/by-vdev/j3d00-ssd /dev/disk/by-vdev/j4d00-ssd cache 
/dev/disk/by-vdev/j3d02-ssd"
  Meta Attrs: provides=unfencing 
  Operations: monitor interval=60s (fence-vm_storage-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: 100
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: zfs-vmstorage
 dc-version: 1.1.16-94ff4df
 have-watchdog: false
 last-lrm-refresh: 1528814481
 no-quorum-policy: ignore

Quorum:
  Options:

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Questions about SBD behavior

2018-06-13 Thread Klaus Wenninger
On 06/13/2018 10:58 AM, 井上 和徳 wrote:
> Thanks for the response.
>
> As of v1.3.1 and later, I recognized that real quorum is necessary.
> I also read this:
> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self-fencing_with_resource_recovery
>
> As related to this specification, in order to use pacemaker-2.0,
> we are confirming the following known issue.
>
> * When SIGSTOP is sent to the pacemaker process, no failure of the
>   resource will be detected.
>   https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html
>   https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html
>
>   I expected that it was being handled by SBD, but no one detected
>   that the following process was frozen. Therefore, no failure of
>   the resource was detected either.
>   - pacemaker-based
>   - pacemaker-execd
>   - pacemaker-attrd
>   - pacemaker-schedulerd
>   - pacemaker-controld
>
>   I confirmed this, but I couldn't read about the correspondence
>   situation.
>   
> https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SBD_1.1.pdf
You are right. The issue was known as when I created these slides.
So a plan for improving the observation of the pacemaker-daemons
should have gone into that probably.

Thanks for bringing this to the table.
Guess the issue got a little bit neglected recently.

>
> As a result of our discussion, we want SBD to detect it and reset the
> machine.

Implementation wise I would go for some kind of a split
solution between pacemaker & SBD. Thinking of Pacemaker
observing the sub-daemons by itself while there would be
some kind of a heartbeat (implicitly via corosync or explicitly)
between pacemaker & SBD that assures this internal
observation is doing it's job properly.

>
> Also, for users who do not have shared disk or qdevice,
> we need an option to work even without real quorum.
> (fence races are going to avoid with delay attribute:
>  https://access.redhat.com/solutions/91653
>  https://access.redhat.com/solutions/1293523)
I'm not sure if I get your point here.
Watchdog-fencing on a 2-node-cluster without
additional qdevice or shared disk is like denying
the laws of physics in my mind.
At the moment I don't see why auto_tie_breaker
wouldn't work on a 4-node and up cluster here.

Regards,
Klaus
>
> Best Regards,
> Kazunori INOUE
>
>> -Original Message-
>> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Klaus 
>> Wenninger
>> Sent: Friday, May 25, 2018 4:08 PM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Questions about SBD behavior
>>
>> On 05/25/2018 07:31 AM, 井上 和徳 wrote:
>>> Hi,
>>>
>>> I am checking the watchdog function of SBD (without shared block-device).
>>> In a two-node cluster, if one cluster is stopped, watchdog is triggered on 
>>> the
>> remaining node.
>>> Is this the designed behavior?
>> SBD without a shared block-device doesn't really make sense on
>> a two-node cluster.
>> The basic idea is - e.g. in a case of a networking problem -
>> that a cluster splits up in a quorate and a non-quorate partition.
>> The quorate partition stays over while SBD guarantees a
>> reliable watchdog-based self-fencing of the non-quorate partition
>> within a defined timeout.
>> This idea of course doesn't work with just 2 nodes.
>> Taking quorum info from the 2-node feature of corosync (automatically
>> switching on wait-for-all) doesn't help in this case but instead
>> would lead to split-brain.
>> What you can do - and what e.g. pcs does automatically - is enable
>> the auto-tie-breaker instead of two-node in corosync. But that
>> still doesn't give you a higher availability than the one of the
>> winner of auto-tie-breaker. (Maybe interesting if you are going
>> for a load-balancing-scenario that doesn't affect availability or
>> for a transient state while setting up a cluste node-by-node ...)
>> What you can do though is using qdevice to still have 'real-quorum'
>> info with just 2 full cluster-nodes.
>>
>> There was quite a lot of discussion round this topic on this
>> thread previously if you search the history.
>>
>> Regards,
>> Klaus
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Limit of concurrent ressources to start?

2018-06-13 Thread Michael Schwartzkopff
Hi,

we have a cluster with several IP addresses that can start after an other 
resource. In the logs we see that only 2 IP addresses start in parallel, not 
all. Can anyone please explain, why not all IP addresses start in parallel?

Config:
primitive resProc ocf:myprovider:Proc
(ten times:) primitive resIP1 ocf:heartbeat:IPaddr2 params ip="192.168.100.1"
order ord_Proc_IP Mandatory: resProc ( resIP1 resIP2 ... )
collocation col_IP_Proc inf: (resIP1 resIP2 ...) resProc

No batch-limit in properties.
Any ideas? Thanks.

Michael

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ?==?utf-8?q? Limit of concurrent ressources to start?

2018-06-13 Thread Michael Schwartzkopff
On Wednesday, June 13, 2018 10:01 CEST, "Michael Schwartzkopff"  
wrote: 
 
> Hi,
> 
> we have a cluster with several IP addresses that can start after an other 
> resource. In the logs we see that only 2 IP addresses start in parallel, not 
> all. Can anyone please explain, why not all IP addresses start in parallel?
> 
> Config:
> primitive resProc ocf:myprovider:Proc
> (ten times:) primitive resIP1 ocf:heartbeat:IPaddr2 params ip="192.168.100.1"
> order ord_Proc_IP Mandatory: resProc ( resIP1 resIP2 ... )
> collocation col_IP_Proc inf: (resIP1 resIP2 ...) resProc
> 
> No batch-limit in properties.
> Any ideas? Thanks.
> 
> Michael

Hi,

additional remark:

With some tweaks I made my cluster start two resources (i.e. IP1 and IP2) at 
the same time. But it takes about 4 seconds to that the cluster starts the next 
resources (i.e. IP3 and IP4).

Did anybody see this behaviour before?

Why does my cluster do not start all "parallel" resources together?

Michael.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-13 Thread Ken Gaillot
On Wed, 2018-06-13 at 17:09 +0800, Albert Weng wrote:
> Hi All,
> 
> Thanks for reply.
> 
> Recently, i run the following command :
> (clustera) # crm_simulate --xml-file pe-warn.last
> 
> it returns the following results :
>    error: crm_abort:    xpath_search: Triggered assert at xpath.c:153
> : xml_top != NULL
>    error: crm_element_value:    Couldn't find validate-with in NULL

It looks like pe-warn.last somehow got corrupted. It appears to not be
a full CIB file.

If the original was compressed (.gz/.bz2 extension), and you didn't
uncompress it, re-add the extension -- that's how pacemaker knows to
uncompress it.

>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    Configuration validation is currently disabled. It is highly
> encouraged and prevents many common cluster issues.
>    error: crm_element_value:    Couldn't find validate-with in NULL
>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    error: crm_element_value:    Couldn't find ignore-dtd in NULL
>    error: crm_abort:    crm_element_value: Triggered assert at
> xml.c:5135 : data != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    validate_with: Triggered assert at
> schemas.c:522 : xml != NULL
>    error: crm_abort:    crm_xml_add: Triggered assert at xml.c:2494 :
> node != NULL
>    error: write_xml_stream: Cannot write NULL to
> /var/lib/pacemaker/cib/shadow.20008
>    Could not create '/var/lib/pacemaker/cib/shadow.20008': Success
> 
> Could anyone help me how to read those messages and what's going on
> my server?
> 
> Thanks a lot..
> 
> 
> On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot 
> wrote:
> > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote:
> > > Hi Andrei,
> > > 
> > > Thanks for your quickly reply. Still need help as below :
> > > 
> > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov  > l.co
> > > m> wrote:
> > > > 06.06.2018 04:27, Albert Weng пишет:
> > > > >  Hi All,
> > > > > 
> > > > > I have created active/passive pacemaker cluster on RHEL 7.
> > > > > 
> > > > > Here are my environment:
> > > > > clustera : 192.168.11.1 (passive)
> > > > > clusterb : 192.168.11.2 (master)
> > > > > clustera-ilo4 : 192.168.11.10
> > > > > clusterb-ilo4 : 192.168.11.11
> > > > > 
> > > > > cluster resource status :
> > > > >      cluster_fs        started on clusterb
> > > > >      cluster_vip       started on clusterb
> > > > >      cluster_sid       started on clusterb
> > > > >      cluster_listnr    started on clusterb
> > > > > 
> > > > > Both cluster node are online status.
> > > > > 
> > > > > i found my corosync.log contain many records like below:
> > > > > 
> > > > > clustera        pengine:     info:
> > > > determine_online_status_fencing:
> > > > > Node clusterb is active
> > > > > clustera        pengine:     info: determine_online_status: 
> >    
> > > >   Node
> > > > > clusterb is online
> > > > > clustera        pengine:     info:
> > > > determine_online_status_fencing:
> > > > > Node clustera is active
> > > > > clustera        pengine:     info: determine_online_status: 
> >    
> > > >   Node
> > > > > clustera is online
> > > > > 
> > > > > *clustera        pengine:  warning: unpack_rsc_op_failure: 
> > > > Processing
> > > > > failed op start for cluster_sid on clustera: unknown error
> > (1)*
> > > > > *=> Question :  Why pengine always trying to start
> > cluster_sid on
> > > > the
> > > > > passive node? how to fix it? *
> > > > > 
> > > > 
> > > > pacemaker does not have concept of "passive" or 

Re: [ClusterLabs] ?==?utf-8?q? Limit of concurrent ressources to start?

2018-06-13 Thread Ken Gaillot
On Wed, 2018-06-13 at 14:25 +0200, Michael Schwartzkopff wrote:
> On Wednesday, June 13, 2018 10:01 CEST, "Michael Schwartzkopff"  ys4.de> wrote: 
>  
> > Hi,
> > 
> > we have a cluster with several IP addresses that can start after an
> > other resource. In the logs we see that only 2 IP addresses start
> > in parallel, not all. Can anyone please explain, why not all IP
> > addresses start in parallel?
> > 
> > Config:
> > primitive resProc ocf:myprovider:Proc
> > (ten times:) primitive resIP1 ocf:heartbeat:IPaddr2 params
> > ip="192.168.100.1"
> > order ord_Proc_IP Mandatory: resProc ( resIP1 resIP2 ... )
> > collocation col_IP_Proc inf: (resIP1 resIP2 ...) resProc
> > 
> > No batch-limit in properties.
> > Any ideas? Thanks.
> > 
> > Michael

Each node has a limit of how many jobs it can execute in parallel. In
order of most preferred to least, it will be:

* The value of the (undocumented) PCMK_node_action_limit environment
variable on that node (no limit if not existing)

* The value of the (also undocumented) node-action-limit cluster
property (defaulting to 0 meaning no limit)

* Twice the node's number of CPU cores (as reported by /proc/stat)

Also, the cluster will auto-calculate a cluster-wide batch-limit if
high load is observed on any node.

So, you could mostly override throttling by setting a high node-action-
limit.

> Hi,
> 
> additional remark:
> 
> With some tweaks I made my cluster start two resources (i.e. IP1 and
> IP2) at the same time. But it takes about 4 seconds to that the
> cluster starts the next resources (i.e. IP3 and IP4).
> 
> Did anybody see this behaviour before?
> 
> Why does my cluster do not start all "parallel" resources together?
> 
> Michael.

Ken Gaillot 
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ?==?utf-8?q? Limit of concurrent ressources to start?

2018-06-13 Thread Michael Schwartzkopff
Am 13.06.2018 um 16:18 schrieb Ken Gaillot:
> On Wed, 2018-06-13 at 14:25 +0200, Michael Schwartzkopff wrote:
>> On Wednesday, June 13, 2018 10:01 CEST, "Michael Schwartzkopff" > ys4.de> wrote: 
>>  
>>> Hi,
>>>
>>> we have a cluster with several IP addresses that can start after an
>>> other resource. In the logs we see that only 2 IP addresses start
>>> in parallel, not all. Can anyone please explain, why not all IP
>>> addresses start in parallel?
>>>
>>> Config:
>>> primitive resProc ocf:myprovider:Proc
>>> (ten times:) primitive resIP1 ocf:heartbeat:IPaddr2 params
>>> ip="192.168.100.1"
>>> order ord_Proc_IP Mandatory: resProc ( resIP1 resIP2 ... )
>>> collocation col_IP_Proc inf: (resIP1 resIP2 ...) resProc
>>>
>>> No batch-limit in properties.
>>> Any ideas? Thanks.
>>>
>>> Michael
> Each node has a limit of how many jobs it can execute in parallel. In
> order of most preferred to least, it will be:
>
> * The value of the (undocumented) PCMK_node_action_limit environment
> variable on that node (no limit if not existing)
>
> * The value of the (also undocumented) node-action-limit cluster
> property (defaulting to 0 meaning no limit)
>
> * Twice the node's number of CPU cores (as reported by /proc/stat)
>
> Also, the cluster will auto-calculate a cluster-wide batch-limit if
> high load is observed on any node.
>
> So, you could mostly override throttling by setting a high node-action-
> limit.
>
>> Hi,
>>
>> additional remark:
>>
>> With some tweaks I made my cluster start two resources (i.e. IP1 and
>> IP2) at the same time. But it takes about 4 seconds to that the
>> cluster starts the next resources (i.e. IP3 and IP4).
>>
>> Did anybody see this behaviour before?
>>
>> Why does my cluster do not start all "parallel" resources together?
>>
>> Michael.
> Ken Gaillot 

Thanks for this clarification.

Mit freundlichen Grüßen,

-- 

[*] sys4 AG
 
https://sys4.de, +49 (89) 30 90 46 64
Schleißheimer Straße 26/MG,80333 München
 
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer, Wolfgang Stief
Aufsichtsratsvorsitzender: Florian Kirstein




signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org