Re: [ClusterLabs] A stop job is running for pacemaker high availability cluster manager

2017-02-02 Thread Ken Gaillot
On 02/02/2017 03:06 PM, Oscar Segarra wrote:
> Hi Ken, 
> 
> I have checked the /var/log/cluster/corosync.log and there no
> information about why system hangs stopping... 
> 
> ¿Can you be more specific about what logs to check?
> 
> Thanks a lot.

There, and /var/log/messages sometimes has relevant messages from
non-cluster components.

You'd want to look for messages like "Caught 'Terminated' signal" and
"Shutting down", as well as resources being stopped ("_stop_0"), then
various "Disconnect" and "Stopping" messages as individual daemons exit.

> 2017-02-02 21:10 GMT+01:00 Ken Gaillot  >:
> 
> On 02/02/2017 12:35 PM, Oscar Segarra wrote:
> > Hi,
> >
> > I have a two node cluster... when I try to shutdown the physical host I
> > get the following message in console: "a stop job is running for
> > pacemaker high availability cluster manager" and never stops...
> 
> That would be a message from systemd. You'll need to check the pacemaker
> status and/or logs to see why pacemaker can't shut down.
> 
> Without stonith enabled, pacemaker will be unable to recover if a
> resource fails to stop. That could lead to a hang.
> 
> > This is my configuration:
> >
> > [root@vdicnode01 ~]# pcs config
> > Cluster Name: vdic-cluster
> > Corosync Nodes:
> >  vdicnode01-priv vdicnode02-priv
> > Pacemaker Nodes:
> >  vdicnode01-priv vdicnode02-priv
> >
> > Resources:
> >  Resource: nfs-vdic-mgmt-vm-vip (class=ocf provider=heartbeat
> type=IPaddr)
> >   Attributes: ip=192.168.100.200 cidr_netmask=24
> >   Operations: start interval=0s timeout=20s
> > (nfs-vdic-mgmt-vm-vip-start-interval-0s)
> >   stop interval=0s timeout=20s
> > (nfs-vdic-mgmt-vm-vip-stop-interval-0s)
> >   monitor interval=10s
> > (nfs-vdic-mgmt-vm-vip-monitor-interval-10s)
> >  Clone: nfs_setup-clone
> >   Resource: nfs_setup (class=ocf provider=heartbeat type=ganesha_nfsd)
> >Attributes: ha_vol_mnt=/var/run/gluster/shared_storage
> >Operations: start interval=0s timeout=5s
> (nfs_setup-start-interval-0s)
> >stop interval=0s timeout=5s
> (nfs_setup-stop-interval-0s)
> >monitor interval=0 timeout=5s
> (nfs_setup-monitor-interval-0)
> >  Clone: nfs-mon-clone
> >   Resource: nfs-mon (class=ocf provider=heartbeat type=ganesha_mon)
> >Operations: start interval=0s timeout=40s
> (nfs-mon-start-interval-0s)
> >stop interval=0s timeout=40s (nfs-mon-stop-interval-0s)
> >monitor interval=10s timeout=10s
> > (nfs-mon-monitor-interval-10s)
> >  Clone: nfs-grace-clone
> >   Meta Attrs: notify=true
> >   Resource: nfs-grace (class=ocf provider=heartbeat
> type=ganesha_grace)
> >Meta Attrs: notify=true
> >Operations: start interval=0s timeout=40s
> (nfs-grace-start-interval-0s)
> >stop interval=0s timeout=40s
> (nfs-grace-stop-interval-0s)
> >monitor interval=5s timeout=10s
> > (nfs-grace-monitor-interval-5s)
> >  Resource: vm-vdicone01 (class=ocf provider=heartbeat
> type=VirtualDomain)
> >   Attributes: hypervisor=qemu:///system
> > config=/mnt/nfs-vdic-mgmt-vm/vdicone01.xml
> > migration_network_suffix=tcp:// migration_transport=ssh
> >   Meta Attrs: allow-migrate=true target-role=Stopped
> >   Utilization: cpu=1 hv_memory=512
> >   Operations: start interval=0s timeout=90
> (vm-vdicone01-start-interval-0s)
> >   stop interval=0s timeout=90
> (vm-vdicone01-stop-interval-0s)
> >   monitor interval=20s role=Stopped
> > (vm-vdicone01-monitor-interval-20s)
> >   monitor interval=30s (vm-vdicone01-monitor-interval-30s)
> >  Resource: vm-vdicsunstone01 (class=ocf provider=heartbeat
> > type=VirtualDomain)
> >   Attributes: hypervisor=qemu:///system
> > config=/mnt/nfs-vdic-mgmt-vm/vdicsunstone01.xml
> > migration_network_suffix=tcp:// migration_transport=ssh
> >   Meta Attrs: allow-migrate=true target-role=Stopped
> >   Utilization: cpu=1 hv_memory=1024
> >   Operations: start interval=0s timeout=90
> > (vm-vdicsunstone01-start-interval-0s)
> >   stop interval=0s timeout=90
> > (vm-vdicsunstone01-stop-interval-0s)
> >   monitor interval=20s role=Stopped
> > (vm-vdicsunstone01-monitor-interval-20s)
> >   monitor interval=30s
> (vm-vdicsunstone01-monitor-interval-30s)
> >  Resource: vm-vdicdb01 (class=ocf provider=heartbeat
> type=VirtualDomain)
> >   Attributes: hypervisor=qemu:///system
> > config=/mnt/nfs-vdic-mgmt-vm/vdicdb01.xml
> > migration_network_suffix=tcp:// migration_transport=ssh
> >   Meta Attrs: allow-migrate=true target-role=Stopped
> >   

Re: [ClusterLabs] Huge amount of files in /var/lib/pacemaker/pengine

2017-02-02 Thread Oscar Segarra
Hi Ken,

I have set the 3 values to 100.

I think may be enough to diagnose problems!

Thanks a lot!

2017-02-02 21:19 GMT+01:00 Ken Gaillot :

> On 02/02/2017 12:49 PM, Oscar Segarra wrote:
> > Hi,
> >
> > A lot of files appear in /var/lib/pacemaker/pengine and fulls my hard
> disk.
> >
> > Is there any way to avoid such amount of files in that directory?
> >
> > Thanks in advance!
>
> Pacemaker saves the cluster state at each calculated transition. This
> can come in handy when investigating after a problem occurs, to see what
> changed and how the cluster responded.
>
> The files are not necessary for cluster operation, so you can clean them
> as desired. The cluster can clean them for you based on cluster options;
> see pe-error-series-max, pe-warn-series-max, and pe-input-series-max:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/
> Pacemaker_Explained/s-cluster-options.html
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] two node cluster with clvm and virtual machines

2017-02-02 Thread Lentes, Bernd
Hi,

i'm implementing a two node cluster with SLES 11 SP4. I have a shared storage 
(FC-SAN).
I'm planning to use clvm.
What i did already:
Connecting the SAN to the hosts
Creating a volume on the SAN
Volume is visible on both nodes (through multipath and device-mapper)
My pacemaker config looks like this:

crm(live)# configure show
node ha-idg-1
node ha-idg-2
primitive prim_clvmd ocf:lvm2:clvmd \
op stop interval=0 timeout=100 \
op start interval=0 timeout=90 \
op monitor interval=20 timeout=20
primitive prim_dlm ocf:pacemaker:controld \
op start interval=0 timeout=90 \
op stop interval=0 timeout=100 \
op monitor interval=60 timeout=60
primitive prim_stonith_ilo_ha-idg-1 stonith:external/riloe \
params ilo_hostname=SUNHB65279 hostlist=ha-idg-1 ilo_user=root 
ilo_password= \
op monitor interval=60m timeout=120s \
meta target-role=Started
primitive prim_stonith_ilo_ha-idg-2 stonith:external/riloe \
params ilo_hostname=SUNHB58820-3 hostlist=ha-idg-2 ilo_user=root 
ilo_password= \
op monitor interval=60m timeout=120s \
meta target-role=Started
primitive prim_vg_cluster_01 LVM \
params volgrpname=vg_cluster_01 \
op monitor interval=60 timeout=60 \
op start interval=0 timeout=30 \
op stop interval=0 timeout=30
group group_prim_dlm_clvmd_vg_cluster_01 prim_dlm prim_clvmd prim_vg_cluster_01
clone clone_group_prim_dlm_clvmd_vg_cluster_01 
group_prim_dlm_clvmd_vg_cluster_01 \
meta target-role=Started
location loc_prim_stonith_ilo_ha-idg-1 prim_stonith_ilo_ha-idg-1 -inf: ha-idg-1
location loc_prim_stonith_ilo_ha-idg-2 prim_stonith_ilo_ha-idg-2 -inf: ha-idg-2
property cib-bootstrap-options: \
dc-version=1.1.12-f47ea56 \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes=2 \
no-quorum-policy=ignore \
last-lrm-refresh=1485872095 \
stonith-enabled=true \
default-resource-stickiness=100 \
start-failure-is-fatal=true \
is-managed-default=true \
stop-orphan-resources=true
rsc_defaults rsc-options: \
target-role=stopped \
resource-stickiness=100 \
failure-timeout=0
op_defaults op-options: \
on-fail=restart


This is the status:
crm(live)# status
Last updated: Thu Feb  2 19:14:10 2017
Last change: Thu Feb  2 19:05:26 2017 by root via cibadmin on ha-idg-2
Stack: classic openais (with plugin)
Current DC: ha-idg-2 - partition with quorum
Version: 1.1.12-f47ea56
2 Nodes configured, 2 expected votes
8 Resources configured


Online: [ ha-idg-1 ha-idg-2 ]

 Clone Set: clone_group_prim_dlm_clvmd_vg_cluster_01 
[group_prim_dlm_clvmd_vg_cluster_01]
 Started: [ ha-idg-1 ha-idg-2 ]

Failed actions:
prim_stonith_ilo_ha-idg-1_start_0 on ha-idg-2 'unknown error' (1): 
call=100, status=Timed Out, exit-reason='none', last-rc-change='Tue Jan 31 
15:14:34 2017', queued=0ms, exec=20004ms
prim_stonith_ilo_ha-idg-2_start_0 on ha-idg-1 'unknown error' (1): 
call=107, status=Error, exit-reason='none', last-rc-change='Tue Jan 31 15:14:55 
2017', queued=0ms, exec=11584ms

Until now everything is fine. The stonith resources have currently wrong 
passwords for the ILO adapters. It's difficult enough to establish a HA-cluster 
for the first time.
Until now i don't like to have my hosts booting all the time because of my 
errors in the configuration.

I created a vg and a lv, it's visible on both nodes.
My plan is to use for each vm a dedicated lv. VM's should run on both nodes, 
some on nodeA, some on nodeB.
If the cluster cares about the mounting of the fs inside the lv (i'm planning 
to use btrfs), i should not need a cluster fs ? Right ?
Because the cluster cares that the fs is always mounted only on one node. 
That's what i've been told.
I'd like to use btrfs because of its snapshot capability which is great.
Should i create now a resource group with the lv, the fs and the vm ?
I stumbled across sfex. It seems to provide an additional layer of security 
concerning access to a shared storage (my lv ?).
Is it senseful, does anyone have experience with it ?

Btw: Suse recommends 
(https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_clvm_config.html)
 to create a mirrored lv.
Is that really necessary/advisable ? My lv's reside on a SAN which is a RAID5 
configuration. I don't see the benefit and the need of a mirrored lv,
just the disadvantage of wasting disk space. Beside the RAID we have a backup, 
and before changes of the vm's i will create a btrfs snapshot.
Unfortunately i'm not able to create a snapshot inside the vm because they are 
running older versions of Suse which don't support btrfs. Of course i could
recreate the vm's with a lvm configuration inside themselves. Maybe, if i have 
time enough. Then i could create snapshots with lvm tools.

Thanks.


Bernd

-- 
Bernd Lentes 

Systemadministration 
institute of developmental genetics 

Re: [ClusterLabs] [Question] About a change of crm_failcount.

2017-02-02 Thread Ken Gaillot
On 02/02/2017 12:23 PM, renayama19661...@ybb.ne.jp wrote:
> Hi All,
> 
> By the next correction, the user was not able to set a value except zero in 
> crm_failcount.
> 
>  - [Fix: tools: implement crm_failcount command-line options correctly]
>- 
> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40a994498cafd#diff-6e58482648938fd488a920b9902daac4
> 
> However, pgsql RA sets INFINITY in a script.
> 
> ```
> (snip)
> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
> (snip)
> ocf_exit_reason "My data is newer than new master's one. New   master's 
> location : $master_baseline"
> exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U $NODENAME 
> -v INFINITY
> return $OCF_ERR_GENERIC
> (snip)
> ```
> 
> There seems to be the influence only in pgsql somehow or other.
> 
> Can you revise it to set a value except zero in crm_failcount?
> We make modifications to use crm_attribute in pgsql RA if we cannot revise it.
> 
> Best Regards,
> Hideo Yamauchi.

Hmm, I didn't realize that was used. I changed it because it's not a
good idea to set fail-count without also changing last-failure and
having a failed op in the LRM history. I'll have to think about what the
best alternative is.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] fence_vbox '--action=' not executing action

2017-02-02 Thread durwin
Kristoffer Grönlund  wrote on 02/01/2017 10:49:54 PM:

> From: Kristoffer Grönlund 
> To: dur...@mgtsciences.com, users@clusterlabs.org
> Date: 02/01/2017 11:23 PM
> Subject: Re: [ClusterLabs] fence_vbox '--action=' not executing action
> 
> dur...@mgtsciences.com writes:
> 
> > I have 2 Fedora 24 Virtualbox machines running on Windows 10 host.  On 
the 
> > host from DOS shell I can start 'node1' with,
> >
> > VBoxManage.exe startvm node1 --type headless
> >
> > I can shut it down with,
> >
> > VBoxManage.exe controlvm node1 acpipowerbutton
> >
> > But running fence_vbox from 'node2' does not work correctly.  Below 
are 
> > two commands and the output.  First action is 'status' second action 
is 
> > 'off'.  The both get list of running nodes, but 'off' does *not* 
shutdown 
> > or kill the node.
> >
> > Any ideas?
> 
> I haven't tested with Windows as the host OS for fence_vbox (I wrote the
> initial implementation of the agent). My guess from looking at your
> usage is that passing "cmd" to --ssh-options might not be
> sufficient to get it to work in that environment, but I have no idea
> what the right arguments might be.
> 
> Another possibility is that the command that fence_vbox tries to run
> doesn't work for you for some reason. It will either call
> 
> VBoxManage startvm  --type headless
> 
> or
> 
> VBoxManage controlvm  poweroff
> 
> when passed on or off as the --action parameter.

If there is no further work being done on fence_vbox, is there a 'dummy' 
fence
which I might use to make STONITH happy in my configuration?  It need only 
send
the correct signals to STONITH so that I might create an active/active 
cluster
to experiment with?  This is only an experimental configuration.

Thank you,

Durwin

> 
> Cheers,
> Kristoffer
> 
> >
> > Thank you,
> >
> > Durwin
> >
> >
> > 02:04 PM root@node2 ~
> > fc25> fence_vbox --verbose --ip=172.23.93.249 --username=durwin 
> > --identity-file=/root/.ssh/id_rsa.pub --password= --plug="node1" 
> > --ssh-options="cmd" --command-prompt='>' --login-timeout=10 
> > --shell-timeout=20 --action=status
> > Running command: /usr/bin/ssh  durwin@172.23.93.249 -i 
> > /root/.ssh/id_rsa.pub -p 22 cmd
> > Received: Enter passphrase for key '/root/.ssh/id_rsa.pub':
> > Sent:
> >
> > Received:
> > stty: 'standard input': Inappropriate ioctl for device
> > Microsoft Windows [Version 10.0.14393]
> > (c) 2016 Microsoft Corporation. All rights reserved.
> >
> > D:\home\durwin>
> > Sent: VBoxManage list runningvms
> >
> > Received: VBoxManage list runningvms
> > VBoxManage list runningvms
> >
> > D:\home\durwin>
> > Sent: VBoxManage list vms
> >
> > Received: VBoxManage list vms
> > VBoxManage list vms
> > "node2" {14bff1fe-bd26-4583-829d-bc3a393b2a01}
> > "node1" {5a029c3c-4549-48be-8e80-c7a67584cd98}
> >
> > D:\home\durwin>
> > Status: OFF
> > Sent: quit
> >
> >
> >
> > 02:05 PM root@node2 ~
> > fc25> fence_vbox --verbose --ip=172.23.93.249 --username=durwin 
> > --identity-file=/root/.ssh/id_rsa.pub --password= --plug="node1" 
> > --ssh-options="cmd" --command-prompt='>' --login-timeout=10 
> > --shell-timeout=20 --action=off
> > Delay 0 second(s) before logging in to the fence device
> > Running command: /usr/bin/ssh  durwin@172.23.93.249 -i 
> > /root/.ssh/id_rsa.pub -p 22 cmd
> > Received: Enter passphrase for key '/root/.ssh/id_rsa.pub':
> > Sent:
> >
> > Received:
> > stty: 'standard input': Inappropriate ioctl for device
> > Microsoft Windows [Version 10.0.14393]
> > (c) 2016 Microsoft Corporation. All rights reserved.
> >
> > D:\home\durwin>
> > Sent: VBoxManage list runningvms
> >
> > Received: VBoxManage list runningvms
> > VBoxManage list runningvms
> >
> > D:\home\durwin>
> > Sent: VBoxManage list vms
> >
> > Received: VBoxManage list vms
> > VBoxManage list vms
> > "node2" {14bff1fe-bd26-4583-829d-bc3a393b2a01}
> > "node1" {5a029c3c-4549-48be-8e80-c7a67584cd98}
> >
> > D:\home\durwin>
> > Success: Already OFF
> > Sent: quit
> >
> >
> > Durwin F. De La Rue
> > Management Sciences, Inc.
> > 6022 Constitution Ave. NE
> > Albuquerque, NM  87110
> > Phone (505) 255-8611
> >
> >
> > This email message and any attachments are for the sole use of the 
> > intended recipient(s) and may contain proprietary and/or confidential 
> > information which may be privileged or otherwise protected from 
> > disclosure. Any unauthorized review, use, disclosure or distribution 
is 
> > prohibited. If you are not the intended recipient(s), please contact 
the 
> > sender by reply email and destroy the original message and any copies 
of 
> > the message as well as any attachments to the original message.
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> -- 
> // 

[ClusterLabs] [Question] About a change of crm_failcount.

2017-02-02 Thread renayama19661014
Hi All,

By the next correction, the user was not able to set a value except zero in 
crm_failcount.

 - [Fix: tools: implement crm_failcount command-line options correctly]
   - 
https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40a994498cafd#diff-6e58482648938fd488a920b9902daac4

However, pgsql RA sets INFINITY in a script.

```
(snip)
    CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
(snip)
    ocf_exit_reason "My data is newer than new master's one. New   master's 
location : $master_baseline"
    exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U $NODENAME -v 
INFINITY
    return $OCF_ERR_GENERIC
(snip)
```

There seems to be the influence only in pgsql somehow or other.

Can you revise it to set a value except zero in crm_failcount?
We make modifications to use crm_attribute in pgsql RA if we cannot revise it.

Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] resource-agents v4.0.1

2017-02-02 Thread Oyvind Albrigtsen

ClusterLabs is happy to announce resource-agents v4.0.1.
Source code is available at:
https://github.com/ClusterLabs/resource-agents/releases/tag/v4.0.1

This release includes the following bugfixes:
- galera: remove "long SST monitoring" support due to corner-case issues
- exportfs: improve regexp handling of clientspec (only strip brackets from 
edges to support IPv6)

The full list of changes for resource-agents is available at:
https://github.com/ClusterLabs/resource-agents/blob/v4.0.1/ChangeLog

Everyone is encouraged to download and test the new release.
We do many regression tests and simulations, but we can't cover all
possible use cases, so your feedback is important and appreciated.

Many thanks to all the contributors to this release.


Best,
The resource-agents maintainers

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org