Re: [ClusterLabs] Singleton resource not being migrated

2016-08-05 Thread Nikita Koshikov
Thanks for reply, Andreas


On Fri, Aug 5, 2016 at 1:48 AM, Andreas Kurz  wrote:

> Hi,
>
> On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov 
> wrote:
>
>> Hello list,
>>
>> Can you, please, help me in debugging 1 resource not being started after
>> node failover ?
>>
>> Here is configuration that I'm testing:
>> 3 nodes(kvm VM) cluster, that have:
>>
>> node 10: aic-controller-58055.test.domain.local
>> node 6: aic-controller-50186.test.domain.local
>> node 9: aic-controller-12993.test.domain.local
>> primitive cmha cmha \
>> params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
>> pidfile="/var/run/cmha/cmha.pid" user=cmha \
>> meta failure-timeout=30 resource-stickiness=1 target-role=Started
>> migration-threshold=3 \
>> op monitor interval=10 on-fail=restart timeout=20 \
>> op start interval=0 on-fail=restart timeout=60 \
>> op stop interval=0 on-fail=block timeout=90
>>
>
> What is the output of crm_mon -1frA once a node is down ... any failed
> actions?
>

No errors/failed actions. This is a little bit different lab(names
changes), but have the same effect:

root@aic-controller-57150:~# crm_mon -1frA
Last updated: Fri Aug  5 20:14:05 2016  Last change: Fri Aug  5
19:38:34 2016 by root via crm_attribute on
aic-controller-44151.test.domain.local
Stack: corosync
Current DC: aic-controller-57150.test.domain.local (version 1.1.14-70404b0)
- partition with quorum
3 nodes and 7 resources configured

Online: [ aic-controller-57150.test.domain.local
aic-controller-58381.test.domain.local ]
OFFLINE: [ aic-controller-44151.test.domain.local ]

Full list of resources:

 sysinfo_aic-controller-44151.test.domain.local (ocf::pacemaker:SysInfo):
Stopped
 sysinfo_aic-controller-57150.test.domain.local (ocf::pacemaker:SysInfo):
Started aic-controller-57150.test.domain.local
 sysinfo_aic-controller-58381.test.domain.local (ocf::pacemaker:SysInfo):
Started aic-controller-58381.test.domain.local
 Clone Set: clone_p_heat-engine [p_heat-engine]
 Started: [ aic-controller-57150.test.domain.local
aic-controller-58381.test.domain.local ]
 cmha   (ocf::heartbeat:cmha):  Stopped

Node Attributes:
* Node aic-controller-57150.test.domain.local:
+ arch  : x86_64
+ cpu_cores : 3
+ cpu_info  : Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz
+ cpu_load  : 1.04
+ cpu_speed : 4994.21
+ free_swap : 5150
+ os: Linux-3.13.0-85-generic
+ ram_free  : 750
+ ram_total : 5000
+ root_free : 45932
+ var_log_free  : 431543
* Node aic-controller-58381.test.domain.local:
+ arch  : x86_64
+ cpu_cores : 3
+ cpu_info  : Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz
+ cpu_load  : 1.16
+ cpu_speed : 4994.21
+ free_swap : 5150
+ os: Linux-3.13.0-85-generic
+ ram_free  : 750
+ ram_total : 5000
+ root_free : 45932
+ var_log_free  : 431542

Migration Summary:
* Node aic-controller-57150.test.domain.local:
* Node aic-controller-58381.test.domain.local:


>
>> primitive sysinfo_aic-controller-12993.test.domain.local
>> ocf:pacemaker:SysInfo \
>> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>> op monitor interval=15s
>> primitive sysinfo_aic-controller-50186.test.domain.local
>> ocf:pacemaker:SysInfo \
>> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>> op monitor interval=15s
>> primitive sysinfo_aic-controller-58055.test.domain.local
>> ocf:pacemaker:SysInfo \
>> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
>> op monitor interval=15s
>>
>
> You can use a clone for this sysinfo resource and a symmetric cluster for
> a more compact configuration  then you can skip all these location
> constraints.
>
>
>> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
>> aic-controller-12993.test.domain.local
>> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
>> aic-controller-50186.test.domain.local
>> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
>> aic-controller-58055.test.domain.local
>> location sysinfo-on-aic-controller-12993.test.domain.local
>> sysinfo_aic-controller-12993.test.domain.local inf:
>> aic-controller-12993.test.domain.local
>> location sysinfo-on-aic-controller-50186.test.domain.local
>> sysinfo_aic-controller-50186.test.domain.local inf:
>> aic-controller-50186.test.domain.local
>> 

Re: [ClusterLabs] Minimal metadata for fencing agent

2016-08-05 Thread Ken Gaillot
On 08/05/2016 06:16 AM, Maciej Kopczyński wrote:
> Thanks for your answer Thomas and sorry for messing up the layout of
> messages - I was trying to write from a mobile phone using gmail... I
> was able to put something up using what I found on the web and my own
> writing. My agent seems to do what it has to, except for sending
> proper metadata. I found the following information: "Output of
> fence_agent -o metadata should be validated by relax-ng schema
> (available at fence/agents/lib/metadata.rng)." Checked this location,
> but I am totally noob regarding to XML. There is a pretty extensive
> structure there, and what I need is to prepare a minimal agent to be
> used locally by me just to check if the whole thing makes sense at
> all.
> 
> Do you have any idea as to what is the minimal set of XML data that
> fence agent has to send to stdout? Or any way to work around this?
> Just for testing purposes.

The easiest thing to do is to look at an existing fence agent and mimic
what it does. The key parts are what parameters the agent accepts, and
what actions it supports.

> 
> Best regards,
> Maciek
> 
>> Hi,
>>
>> That is because pcs doesn't work well with external stonith agents, see
>> this github issue https://github.com/ClusterLabs/pcs/issues/81
>>
>> Regards,
>> Tomas
> 
>>> Thanks!
>>>
>>> I ran into more problems though. When configuring a stonith resource using
>>> pcs with stonith:external/libvirt I am geeting "Unable to create resource
>>> (...), it is not installed on this system." I have installed cluster_glue
>>> RPM package (I am running Cent OS), the file is present in the system,
>>> should I enable it somehow for pacemaker?
>>>
>>> Thanks,
>>> Maciek
>>>
>>>
 Hello,

 Sorry if it is a trivial question, but I am facing a wall here. I am
 trying
 to configure fencing on cluster running Hyper-V. I need to modify source
 code for external/libvirt plugin, but I have no idea which package
 provides
 it, cannot Google any files, do you have any idea?

 Thanks in advance,
 Maciek
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Singleton resource not being migrated

2016-08-05 Thread Ken Gaillot
On 08/05/2016 03:48 AM, Andreas Kurz wrote:
> Hi,
> 
> On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov  > wrote:
> 
> Hello list,
> 
> Can you, please, help me in debugging 1 resource not being started
> after node failover ?
> 
> Here is configuration that I'm testing:
> 3 nodes(kvm VM) cluster, that have:
> 
> node 10: aic-controller-58055.test.domain.local
> node 6: aic-controller-50186.test.domain.local
> node 9: aic-controller-12993.test.domain.local
> primitive cmha cmha \
> params conffile="/etc/cmha/cmha.conf"
> daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.pid" user=cmha \
> meta failure-timeout=30 resource-stickiness=1
> target-role=Started migration-threshold=3 \
> op monitor interval=10 on-fail=restart timeout=20 \
> op start interval=0 on-fail=restart timeout=60 \
> op stop interval=0 on-fail=block timeout=90
> 
> 
> What is the output of crm_mon -1frA once a node is down ... any failed
> actions?
>  
> 
> primitive sysinfo_aic-controller-12993.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-50186.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-58055.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> 
> 
> You can use a clone for this sysinfo resource and a symmetric cluster
> for a more compact configuration  then you can skip all these
> location constraints.
> 
> 
> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
> aic-controller-12993.test.domain.local
> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
> aic-controller-50186.test.domain.local
> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
> aic-controller-58055.test.domain.local
> location sysinfo-on-aic-controller-12993.test.domain.local
> sysinfo_aic-controller-12993.test.domain.local inf:
> aic-controller-12993.test.domain.local
> location sysinfo-on-aic-controller-50186.test.domain.local
> sysinfo_aic-controller-50186.test.domain.local inf:
> aic-controller-50186.test.domain.local
> location sysinfo-on-aic-controller-58055.test.domain.local
> sysinfo_aic-controller-58055.test.domain.local inf:
> aic-controller-58055.test.domain.local
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> cluster-recheck-interval=15s \
> 
> 
> Never tried such a low cluster-recheck-interval ... wouldn't do that. I
> saw setups with low intervals burning a lot of cpu cycles in bigger
> cluster setups and side-effects from aborted transitions. If you do this
> for "cleanup" the cluster state because you see resource-agent errors
> you should better fix the resource agent.

Strongly agree -- your recheck interval is lower than the various action
timeouts. The only reason recheck interval should ever be set less than
about 5 minutes is if you have time-based rules that you want to trigger
with a finer granularity.

Your issue does not appear to be coming from recheck interval, otherwise
it would go away after the recheck interval passed.

> Regards,
> Andreas
>  
> 
> no-quorum-policy=stop \
> stonith-enabled=false \
> start-failure-is-fatal=false \
> symmetric-cluster=false \
> node-health-strategy=migrate-on-red \
> last-lrm-refresh=1470334410
> 
> When 3 nodes online, everything seemed OK, this is output of
> scoreshow.sh:
> ResourceScore
> Node   Stickiness #Fail  
>  Migration-Threshold
> cmha-INFINITY
> aic-controller-12993.test.domain.local 1  0
> cmha
>  101 aic-controller-50186.test.domain.local 1  0
> cmha-INFINITY

Everything is not OK; cmha has -INFINITY scores on two nodes, meaning it
won't be allowed to run on them. This is why it won't start after the
one allowed node goes down, and why cleanup gets it working again
(cleanup removes bans caused by resource failures).

It's likely the resource previously failed the maximum allowed times
(migration-threshold=3) on those two nodes.

The next step would be to figure out why the resource is failing. The
pacemaker logs will 

Re: [ClusterLabs] Fencing with a 3-node (1 for quorum only) cluster

2016-08-05 Thread Dan Swartzendruber


A lot of good suggestions here.  Unfortunately, my budget is tapped out 
for the near future at least (this is a home lab/soho setup).  I'm 
inclined to go with Digimer's two-node approach, with IPMI fencing.  I 
understand mobos can die and such.  In such a long-shot, manual 
intervention is fine.  So, when I get a chance, I need to remove the 
quorum node from the cluster and switch it to two_node mode.  Thanks for 
the info!


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Minimal metadata for fencing agent

2016-08-05 Thread Tomas Jelinek

Dne 5.8.2016 v 13:16 Maciej Kopczyński napsal(a):

Thanks for your answer Thomas and sorry for messing up the layout of
messages - I was trying to write from a mobile phone using gmail... I
was able to put something up using what I found on the web and my own
writing. My agent seems to do what it has to, except for sending
proper metadata. I found the following information: "Output of
fence_agent -o metadata should be validated by relax-ng schema
(available at fence/agents/lib/metadata.rng)." Checked this location,
but I am totally noob regarding to XML. There is a pretty extensive
structure there, and what I need is to prepare a minimal agent to be
used locally by me just to check if the whole thing makes sense at
all.

Do you have any idea as to what is the minimal set of XML data that
fence agent has to send to stdout?


Hi,

Sorry, I don't really know that.


Or any way to work around this?
Just for testing purposes.


Did you try using --force with pcs when creating the stonith resource? 
Like this:

pcs stonith create stonith_name external/libvirt --force

Or maybe some other fence agents would work for you, like fence_virt 
which is available without cluster_glue.


Regards,
Tomas



Best regards,
Maciek


Hi,

That is because pcs doesn't work well with external stonith agents, see
this github issue https://github.com/ClusterLabs/pcs/issues/81

Regards,
Tomas



Thanks!

I ran into more problems though. When configuring a stonith resource using
pcs with stonith:external/libvirt I am geeting "Unable to create resource
(...), it is not installed on this system." I have installed cluster_glue
RPM package (I am running Cent OS), the file is present in the system,
should I enable it somehow for pacemaker?

Thanks,
Maciek



Hello,

Sorry if it is a trivial question, but I am facing a wall here. I am
trying
to configure fencing on cluster running Hyper-V. I need to modify source
code for external/libvirt plugin, but I have no idea which package
provides
it, cannot Google any files, do you have any idea?

Thanks in advance,
Maciek


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Minimal metadata for fencing agent

2016-08-05 Thread Maciej Kopczyński
Thanks for your answer Thomas and sorry for messing up the layout of
messages - I was trying to write from a mobile phone using gmail... I
was able to put something up using what I found on the web and my own
writing. My agent seems to do what it has to, except for sending
proper metadata. I found the following information: "Output of
fence_agent -o metadata should be validated by relax-ng schema
(available at fence/agents/lib/metadata.rng)." Checked this location,
but I am totally noob regarding to XML. There is a pretty extensive
structure there, and what I need is to prepare a minimal agent to be
used locally by me just to check if the whole thing makes sense at
all.

Do you have any idea as to what is the minimal set of XML data that
fence agent has to send to stdout? Or any way to work around this?
Just for testing purposes.

Best regards,
Maciek

> Hi,
>
> That is because pcs doesn't work well with external stonith agents, see
> this github issue https://github.com/ClusterLabs/pcs/issues/81
>
> Regards,
> Tomas

>> Thanks!
>>
>> I ran into more problems though. When configuring a stonith resource using
>> pcs with stonith:external/libvirt I am geeting "Unable to create resource
>> (...), it is not installed on this system." I have installed cluster_glue
>> RPM package (I am running Cent OS), the file is present in the system,
>> should I enable it somehow for pacemaker?
>>
>> Thanks,
>> Maciek
>>
>>
>>> Hello,
>>>
>>> Sorry if it is a trivial question, but I am facing a wall here. I am
>>> trying
>>> to configure fencing on cluster running Hyper-V. I need to modify source
>>> code for external/libvirt plugin, but I have no idea which package
>>> provides
>>> it, cannot Google any files, do you have any idea?
>>>
>>> Thanks in advance,
>>> Maciek

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fencing with a 3-node (1 for quorum only) cluster

2016-08-05 Thread Andrei Borzenkov
On Fri, Aug 5, 2016 at 7:08 AM, Digimer  wrote:
> On 04/08/16 11:44 PM, Andrei Borzenkov wrote:
>> 05.08.2016 02:33, Digimer пишет:
>>> On 04/08/16 07:21 PM, Dan Swartzendruber wrote:
 On 2016-08-04 19:03, Digimer wrote:
> On 04/08/16 06:56 PM, Dan Swartzendruber wrote:
>> I'm setting up an HA NFS server to serve up storage to a couple of
>> vsphere hosts.  I have a virtual IP, and it depends on a ZFS resource
>> agent which imports or exports a pool.
>>
>> ...
>>
>>>
>>> Note; If you lose power to the mainboard (which we've seen, failed
>>> mainboard voltage regulator did this once), you lose the IPMI (DRAC)
>>> BMC. This scenario will leave your cluster blocked without an external
>>> secondary fence method, like switched PDUs.
>>>
>>
>> As in this case there is shared storage (at least, so I understood),
>> using persistent SCSI reservations or SBD as secondary channel can be
>> considered.
>
> Yup. That would be fabric fencing though, or are you talking about using
> it under watchdog timers?

fabric is the third possibility :) No, I rather mean something like fence_scsi.

Although the practical problem of both fabric or scsi fencing is that
it only prevents concurrent access to shared storage; it does not
guarantee that other resources are also cleaned up, so may end up with
duplicated IP or similar.

> If fabric, then my worry is always a panic'ed
> admin clearing it without properly verifying the state of the lost node.
> With watchdog, it's fine, just slow.
>

As it is last resort better slow than never.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Singleton resource not being migrated

2016-08-05 Thread Andreas Kurz
Hi,

On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov  wrote:

> Hello list,
>
> Can you, please, help me in debugging 1 resource not being started after
> node failover ?
>
> Here is configuration that I'm testing:
> 3 nodes(kvm VM) cluster, that have:
>
> node 10: aic-controller-58055.test.domain.local
> node 6: aic-controller-50186.test.domain.local
> node 9: aic-controller-12993.test.domain.local
> primitive cmha cmha \
> params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
> pidfile="/var/run/cmha/cmha.pid" user=cmha \
> meta failure-timeout=30 resource-stickiness=1 target-role=Started
> migration-threshold=3 \
> op monitor interval=10 on-fail=restart timeout=20 \
> op start interval=0 on-fail=restart timeout=60 \
> op stop interval=0 on-fail=block timeout=90
>

What is the output of crm_mon -1frA once a node is down ... any failed
actions?


> primitive sysinfo_aic-controller-12993.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-50186.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
> primitive sysinfo_aic-controller-58055.test.domain.local
> ocf:pacemaker:SysInfo \
> params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> op monitor interval=15s
>

You can use a clone for this sysinfo resource and a symmetric cluster for a
more compact configuration  then you can skip all these location
constraints.


> location cmha-on-aic-controller-12993.test.domain.local cmha 100:
> aic-controller-12993.test.domain.local
> location cmha-on-aic-controller-50186.test.domain.local cmha 100:
> aic-controller-50186.test.domain.local
> location cmha-on-aic-controller-58055.test.domain.local cmha 100:
> aic-controller-58055.test.domain.local
> location sysinfo-on-aic-controller-12993.test.domain.local
> sysinfo_aic-controller-12993.test.domain.local inf:
> aic-controller-12993.test.domain.local
> location sysinfo-on-aic-controller-50186.test.domain.local
> sysinfo_aic-controller-50186.test.domain.local inf:
> aic-controller-50186.test.domain.local
> location sysinfo-on-aic-controller-58055.test.domain.local
> sysinfo_aic-controller-58055.test.domain.local inf:
> aic-controller-58055.test.domain.local
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> cluster-recheck-interval=15s \
>

Never tried such a low cluster-recheck-interval ... wouldn't do that. I saw
setups with low intervals burning a lot of cpu cycles in bigger cluster
setups and side-effects from aborted transitions. If you do this for
"cleanup" the cluster state because you see resource-agent errors you
should better fix the resource agent.

Regards,
Andreas


> no-quorum-policy=stop \
> stonith-enabled=false \
> start-failure-is-fatal=false \
> symmetric-cluster=false \
> node-health-strategy=migrate-on-red \
> last-lrm-refresh=1470334410
>
> When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:
> ResourceScore Node
>   Stickiness #FailMigration-Threshold
> cmha-INFINITY
> aic-controller-12993.test.domain.local 1  0
> cmha  101
> aic-controller-50186.test.domain.local 1  0
> cmha-INFINITY
> aic-controller-58055.test.domain.local 1  0
> sysinfo_aic-controller-12993.test.domain.local  INFINITY
>  aic-controller-12993.test.domain.local 0  0
> sysinfo_aic-controller-50186.test.domain.local  -INFINITY
> aic-controller-50186.test.domain.local 0  0
> sysinfo_aic-controller-58055.test.domain.local  INFINITY
>  aic-controller-58055.test.domain.local 0  0
>
> The problem starts when 1 node, goes offline (aic-controller-50186). The
> resource cmha is stocked in stopped state.
> Here is the showscores:
> ResourceScore Node
>   Stickiness #FailMigration-Threshold
> cmha-INFINITY
> aic-controller-12993.test.domain.local 1  0
> cmha-INFINITY
> aic-controller-50186.test.domain.local 1  0
> cmha-INFINITY
> aic-controller-58055.test.domain.local 1  0
>
> Even it has target-role=Started pacemaker skipping this resource. And in
> logs I see:
> pengine: info: native_print:  cmha

Re: [ClusterLabs] Antw: Fencing with a 3-node (1 for quorum only) cluster

2016-08-05 Thread Digimer
On 05/08/16 02:19 AM, Ulrich Windl wrote:
 Dan Swartzendruber  schrieb am 05.08.2016 um 00:56 in
> Nachricht <32eabeb57268bed57081646c77224...@druber.com>:
>> I'm setting up an HA NFS server to serve up storage to a couple of 
>> vsphere hosts.  I have a virtual IP, and it depends on a ZFS resource 
>> agent which imports or exports a pool.  So far, with stonith disabled, 
>> it all works perfectly.  I was dubious about a 2-node solution, so I 
>> created a 3rd node which runs as a virtual machine on one of the hosts.  
>> All it is for is quorum.  So, looking at fencing next.  The primary 
> 
> I wonder what happens if the machine where the VM runs crashes (2 of 3 nodes 
> down).

2 of 3 dead is loss of quorum. Surviving node stops offering cluster
services when it could have otherwise survived. Another (small) benefit
of a 2-node cluster.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Fencing with a 3-node (1 for quorum only) cluster

2016-08-05 Thread Ulrich Windl
>>> Dan Swartzendruber  schrieb am 05.08.2016 um 00:56 in
Nachricht <32eabeb57268bed57081646c77224...@druber.com>:
> I'm setting up an HA NFS server to serve up storage to a couple of 
> vsphere hosts.  I have a virtual IP, and it depends on a ZFS resource 
> agent which imports or exports a pool.  So far, with stonith disabled, 
> it all works perfectly.  I was dubious about a 2-node solution, so I 
> created a 3rd node which runs as a virtual machine on one of the hosts.  
> All it is for is quorum.  So, looking at fencing next.  The primary 

I wonder what happens if the machine where the VM runs crashes (2 of 3 nodes 
down).

> server is a poweredge R905, which has DRAC for fencing.  The backup 
> storage node is a Supermicro X9-SCL-F (with IPMI).  So I would be using 
> the DRAC agent for the former and the ipmilan for the latter?  I was 
> reading about location constraints, where you tell each instance of the 
> fencing agent not to run on the node that would be getting fenced.  So, 
> my first thought was to configure the drac agent and tell it not to 
> fence node 1, and configure the ipmilan agent and tell it not to fence 
> node 2.  The thing is, there is no agent available for the quorum node.  
> Would it make more sense instead to tell the drac agent to only run on 
> node 2, and the ipmilan agent to only run on node 1?  Thanks!
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Singleton resource not being migrated

2016-08-05 Thread Nikita Koshikov
Hello list,

Can you, please, help me in debugging 1 resource not being started after
node failover ?

Here is configuration that I'm testing:
3 nodes(kvm VM) cluster, that have:

node 10: aic-controller-58055.test.domain.local
node 6: aic-controller-50186.test.domain.local
node 9: aic-controller-12993.test.domain.local
primitive cmha cmha \
params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad"
pidfile="/var/run/cmha/cmha.pid" user=cmha \
meta failure-timeout=30 resource-stickiness=1 target-role=Started
migration-threshold=3 \
op monitor interval=10 on-fail=restart timeout=20 \
op start interval=0 on-fail=restart timeout=60 \
op stop interval=0 on-fail=block timeout=90
primitive sysinfo_aic-controller-12993.test.domain.local
ocf:pacemaker:SysInfo \
params disk_unit=M disks="/ /var/log" min_disk_free=512M \
op monitor interval=15s
primitive sysinfo_aic-controller-50186.test.domain.local
ocf:pacemaker:SysInfo \
params disk_unit=M disks="/ /var/log" min_disk_free=512M \
op monitor interval=15s
primitive sysinfo_aic-controller-58055.test.domain.local
ocf:pacemaker:SysInfo \
params disk_unit=M disks="/ /var/log" min_disk_free=512M \
op monitor interval=15s

location cmha-on-aic-controller-12993.test.domain.local cmha 100:
aic-controller-12993.test.domain.local
location cmha-on-aic-controller-50186.test.domain.local cmha 100:
aic-controller-50186.test.domain.local
location cmha-on-aic-controller-58055.test.domain.local cmha 100:
aic-controller-58055.test.domain.local
location sysinfo-on-aic-controller-12993.test.domain.local
sysinfo_aic-controller-12993.test.domain.local inf:
aic-controller-12993.test.domain.local
location sysinfo-on-aic-controller-50186.test.domain.local
sysinfo_aic-controller-50186.test.domain.local inf:
aic-controller-50186.test.domain.local
location sysinfo-on-aic-controller-58055.test.domain.local
sysinfo_aic-controller-58055.test.domain.local inf:
aic-controller-58055.test.domain.local
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-recheck-interval=15s \
no-quorum-policy=stop \
stonith-enabled=false \
start-failure-is-fatal=false \
symmetric-cluster=false \
node-health-strategy=migrate-on-red \
last-lrm-refresh=1470334410

When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:
ResourceScore Node
  Stickiness #FailMigration-Threshold
cmha-INFINITY
aic-controller-12993.test.domain.local 1  0
cmha  101
aic-controller-50186.test.domain.local 1  0
cmha-INFINITY
aic-controller-58055.test.domain.local 1  0
sysinfo_aic-controller-12993.test.domain.local  INFINITY
 aic-controller-12993.test.domain.local 0  0
sysinfo_aic-controller-50186.test.domain.local  -INFINITY
aic-controller-50186.test.domain.local 0  0
sysinfo_aic-controller-58055.test.domain.local  INFINITY
 aic-controller-58055.test.domain.local 0  0

The problem starts when 1 node, goes offline (aic-controller-50186). The
resource cmha is stocked in stopped state.
Here is the showscores:
ResourceScore Node
  Stickiness #FailMigration-Threshold
cmha-INFINITY
aic-controller-12993.test.domain.local 1  0
cmha-INFINITY
aic-controller-50186.test.domain.local 1  0
cmha-INFINITY
aic-controller-58055.test.domain.local 1  0

Even it has target-role=Started pacemaker skipping this resource. And in
logs I see:
pengine: info: native_print:  cmha(ocf::heartbeat:cmha):
 Stopped
pengine: info: native_color:  Resource cmha cannot run anywhere
pengine: info: LogActions:Leave   cmha(Stopped)

To recover cmha resource I need to run either:
1) crm resource cleanup cmha
2) crm resource reprobe

After any of the above commands, resource began to be picked up be
pacemaker and I see valid scores:
ResourceScore Node
  Stickiness #FailMigration-Threshold
cmha100
aic-controller-58055.test.domain.local 1  03
cmha101
aic-controller-12993.test.domain.local 1  03
cmha-INFINITY
aic-controller-50186.test.domain.local 1