[ClusterLabs] Questions about SBD behavior

2018-05-24 Thread 井上 和徳
Hi,

I am checking the watchdog function of SBD (without shared block-device).
In a two-node cluster, if one cluster is stopped, watchdog is triggered on the 
remaining node.
Is this the designed behavior?


[vmrh75b]# cat /etc/corosync/corosync.conf
(snip)
quorum {
provider: corosync_votequorum
two_node: 1
}

[vmrh75b]# cat /etc/sysconfig/sbd
# This file has been generated by pcs.
SBD_DELAY_START=no
## SBD_DEVICE="/dev/vdb1"
SBD_OPTS="-vvv"
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

[vmrh75b]# crm_mon -r1
Stack: corosync
Current DC: vmrh75a (version 2.0.0-0.1.rc4.el7-2.0.0-rc4) - partition with 
quorum
Last updated: Fri May 25 13:36:07 2018
Last change: Fri May 25 13:35:22 2018 by root via cibadmin on vmrh75a

2 nodes configured
0 resources configured

Online: [ vmrh75a vmrh75b ]

No resources

[vmrh75b]# pcs property show
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: my_cluster
 dc-version: 2.0.0-0.1.rc4.el7-2.0.0-rc4
 have-watchdog: true
 stonith-enabled: false

[vmrh75b]# ps -ef | egrep "sbd|coro|pace"
root  2169 1  0 13:34 ?00:00:00 sbd: inquisitor
root  2170  2169  0 13:34 ?00:00:00 sbd: watcher: Pacemaker
root  2171  2169  0 13:34 ?00:00:00 sbd: watcher: Cluster
root  2172 1  0 13:34 ?00:00:00 corosync
root  2179 1  0 13:34 ?00:00:00 /usr/sbin/pacemakerd -f
haclust+  2180  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-based
root  2181  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-fenced
root  2182  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-execd
haclust+  2183  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-attrd
haclust+  2184  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-schedulerd
haclust+  2185  2179  0 13:34 ?00:00:00 
/usr/libexec/pacemaker/pacemaker-controld

[vmrh75b]# pcs cluster stop vmrh75a
vmrh75a: Stopping Cluster (pacemaker)...
vmrh75a: Stopping Cluster (corosync)...

[vmrh75b]# tail -F /var/log/messages
May 25 13:37:00 vmrh75b pacemaker-controld[2185]: notice: Our peer on the DC 
(vmrh75a) is dead
May 25 13:37:00 vmrh75b pacemaker-controld[2185]: notice: State transition 
S_NOT_DC -> S_ELECTION
May 25 13:37:00 vmrh75b pacemaker-controld[2185]: notice: State transition 
S_ELECTION -> S_INTEGRATION
May 25 13:37:00 vmrh75b pacemaker-attrd[2183]: notice: Node vmrh75a state is 
now lost
May 25 13:37:00 vmrh75b pacemaker-attrd[2183]: notice: Removing all vmrh75a 
attributes for peer loss
May 25 13:37:00 vmrh75b pacemaker-attrd[2183]: notice: Lost attribute writer 
vmrh75a
May 25 13:37:00 vmrh75b pacemaker-attrd[2183]: notice: Purged 1 peer with id=1 
and/or uname=vmrh75a from the membership cache
May 25 13:37:00 vmrh75b pacemaker-fenced[2181]: notice: Node vmrh75a state is 
now lost
May 25 13:37:00 vmrh75b pacemaker-fenced[2181]: notice: Purged 1 peer with id=1 
and/or uname=vmrh75a from the membership cache
May 25 13:37:00 vmrh75b pacemaker-based[2180]: notice: Node vmrh75a state is 
now lost
May 25 13:37:00 vmrh75b pacemaker-based[2180]: notice: Purged 1 peer with id=1 
and/or uname=vmrh75a from the membership cache
May 25 13:37:00 vmrh75b pacemaker-controld[2185]: warning: Input I_ELECTION_DC 
received in state S_INTEGRATION from do_election_check
May 25 13:37:01 vmrh75b sbd[2171]:   cluster:  warning: set_servant_health: 
Connected to corosync but requires both nodes present
May 25 13:37:01 vmrh75b sbd[2171]:   cluster:  warning: notify_parent: 
Notifying parent: UNHEALTHY (6)
May 25 13:37:01 vmrh75b sbd[2169]: warning: inquisitor_child: cluster health 
check: UNHEALTHY
May 25 13:37:01 vmrh75b sbd[2169]: warning: inquisitor_child: Servant cluster 
is outdated (age: 226)
May 25 13:37:01 vmrh75b sbd[2170]:  pcmk:   notice: unpack_config: Watchdog 
will be used via SBD if fencing is required
May 25 13:37:01 vmrh75b sbd[2170]:  pcmk: info: 
determine_online_status: Node vmrh75b is online
May 25 13:37:01 vmrh75b sbd[2170]:  pcmk: info: unpack_node_loop: Node 
2 is already processed
May 25 13:37:01 vmrh75b sbd[2170]:  pcmk: info: unpack_node_loop: Node 
2 is already processed
May 25 13:37:01 vmrh75b sbd[2171]:   cluster:  warning: notify_parent: 
Notifying parent: UNHEALTHY (6)
May 25 13:37:01 vmrh75b corosync[2172]: [TOTEM ] A new membership 
(192.168.28.132:5712) was formed. Members left: 1
May 25 13:37:01 vmrh75b corosync[2172]: [QUORUM] Members[1]: 2
May 25 13:37:01 vmrh75b corosync[2172]: [MAIN  ] Completed service 
synchronization, ready to provide service.
May 25 13:37:01 vmrh75b pacemakerd[2179]: notice: Node vmrh75a state is now lost
May 25 13:37:01 vmrh75b pacemaker-controld[2185]: notice: Node vmrh75a state is 
now lost
May 25 13:37:01 vmrh75b pacemaker-controld[2185]: warning: Stonith/shutdown of 
node vmrh75a was not expected
May 25 13:37:02 vmrh75b sbd[2171]:   cluster:  warning: 

Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Jason Gauthier
On Thu, May 24, 2018 at 10:40 AM, Ken Gaillot  wrote:
> On Thu, 2018-05-24 at 16:14 +0200, Klaus Wenninger wrote:
>> On 05/24/2018 04:03 PM, Ken Gaillot wrote:
>> > On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
>> > > On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov > > > il.c
>> > > om> wrote:
>> > > > 24.05.2018 02:57, Jason Gauthier пишет:
>> > > > > I'm fairly new to clustering under Linux.  I've basically
>> > > > > have
>> > > > > one shared
>> > > > > storage resource  right now, using dlm, and gfs2.
>> > > > > I'm using fibre channel and when both of my nodes are up (2
>> > > > > node
>> > > > > cluster)
>> > > > > dlm and gfs2 seem to be operating perfectly.
>> > > > > If I reboot node B, node A works fine and vice-versa.
>> > > > >
>> > > > > When node B goes offline unexpectedly, and become unclean,
>> > > > > dlm
>> > > > > seems to
>> > > > > block all IO to the shared storage.
>> > > > >
>> > > > > dlm knows node B is down:
>> > > > >
>> > > > > # dlm_tool status
>> > > > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
>> > > > > daemon now 865695 fence_pid 18186
>> > > > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail
>> > > > > 1527119246 fence
>> > > > > 0 now 1527119524
>> > > > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
>> > > > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0
>> > > > > at 0
>> > > > > 0
>> > > > >
>> > > > > on the same server, I see these messages in my daemon.log
>> > > > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick:
>> > > > > Could
>> > > > > not kick
>> > > > > (reboot) node 1084772369/(null) : No route to host (-113)
>> > > > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error
>> > > > > -113
>> > > > > nodeid
>> > > > > 1084772369
>> > > > >
>> > > > > I can recover from the situation by forcing it (or bring the
>> > > > > other node
>> > > > > back online)
>> > > > > dlm_tool fence_ack 1084772369
>> > > > >
>> > > > > cluster config is pretty straighforward.
>> > > > > node 1084772368: alpha
>> > > > > node 1084772369: beta
>> > > > > primitive p_dlm_controld ocf:pacemaker:controld \
>> > > > > op monitor interval=60 timeout=60 \
>> > > > > meta target-role=Started \
>> > > > > params args="-K -L -s 1"
>> > > > > primitive p_fs_gfs2 Filesystem \
>> > > > > params device="/dev/sdb2" directory="/vms"
>> > > > > fstype=gfs2
>> > > > > primitive stonith_sbd stonith:external/sbd \
>> > > > > params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>> > > > > meta target-role=Started
>> > > >
>> > > > What is the status of stonith resource? Did you configure SBD
>> > > > fencing
>> > > > properly?
>> > >
>> > > I believe so.  It's shown above in my cluster config.
>> > >
>> > > > Is sbd daemon up and running with proper parameters?
>> > >
>> > > Well, no, apparently sbd isn't running.With dlm, and gfs2,
>> > > the
>> > > cluster controls handling launching of the daemons.
>> > > I assumed the same here, since the resource shows that it is up.
>> >
>> > Unlike other services, sbd must be up before the cluster starts in
>> > order for the cluster to use it properly. (Notice the "have-
>> > watchdog=false" in your cib-bootstrap-options ... that means the
>> > cluster didn't find sbd running.)
>> >
>> > Also, even storage-based sbd requires a working hardware watchdog
>> > for
>> > the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd
>> > should
>> > list the watchdog device. Also sbd_device in your cluster config
>> > should
>> > match SBD_DEVICE in /etc/sysconfig/sbd.
>> >
>> > If you want the cluster to recover services elsewhere after a node
>> > self-fences (which I'm sure you do), you also need to set the
>> > stonith-
>> > watchdog-timeout cluster property to something greater than the
>> > value
>> > of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will
>> > wait
>> > that long and then assume the node fenced itself.

Thanks.  So, for whatever reason, sbd was not running. I went ahead
and got /etc/default/sbd (debian) configured.
I can't start the service manually due to dependencies, but I rebooted
node B and it came up.
Node A would not, I ended up rebooting both nodes at the same time,
and sbd was running on both.

I forced a failure of node B, and after a few seconds node A was able
to access the shared storage.
Definite improvement!



>> Actually for the case that there is a shared disk a successful
>> fencing-attempt via the sbd-fencing-resource should be enough
>> for the node to be assumed down.
>> In case of a 2-node-setup I would even discourage setting
>> stonith-watchdog-timeout as we need a real quorum-mechanism
>> for that to work.
>
> Ah, thanks -- I've updated the wiki how-to, feel free to clarify
> further:
>
> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker
>
>>
>> Regards,
>> Klaus
>>
>> >
>> > > Online: [ alpha beta ]
>> > >
>> > > Full 

Re: [ClusterLabs] pcsd processes using 100% CPU

2018-05-24 Thread Casey & Gina
> gcore is part of gdb:
> https://packages.ubuntu.com/xenial/amd64/gdb/filelist
> 
> Note that using the utility should have no observable influence
> on the running process in question.

When I ran gcore on the pid, it produced a whole bunch of memory read errors 
like this:

warning: Memory read failed for corefile section, 8192 bytes at 0x93177000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x93179000.
warning: Memory read failed for corefile section, 4096 bytes at 0x93378000.
warning: Memory read failed for corefile section, 4096 bytes at 0x93379000.
warning: Memory read failed for corefile section, 8192 bytes at 0x9337a000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x9337c000.
warning: Memory read failed for corefile section, 4096 bytes at 0x9357c000.
warning: Memory read failed for corefile section, 4096 bytes at 0x9357d000.
warning: Memory read failed for corefile section, 8192 bytes at 0x9357e000.
warning: Memory read failed for corefile section, 1048576 bytes at 0x9358.
warning: Memory read failed for corefile section, 4096 bytes at 0x9377f000.
warning: Memory read failed for corefile section, 4096 bytes at 0x9378.
warning: Memory read failed for corefile section, 4096 bytes at 0x93982000.
warning: Memory read failed for corefile section, 4096 bytes at 0x93983000.
warning: Memory read failed for corefile section, 4096 bytes at 0x93fac000.
warning: Memory read failed for corefile section, 4096 bytes at 0x93fad000.
warning: Memory read failed for corefile section, 4096 bytes at 0x941b6000.
warning: Memory read failed for corefile section, 4096 bytes at 0x941b7000.
warning: Memory read failed for corefile section, 188416 bytes at 0x941b8000.
warning: Memory read failed for corefile section, 4096 bytes at 0x943e8000.
warning: Memory read failed for corefile section, 4096 bytes at 0x943e9000.
warning: Memory read failed for corefile section, 4096 bytes at 0x94668000.
warning: Memory read failed for corefile section, 4096 bytes at 0x94669000.

Nonetheless, it did produce a core file.

> load it to gdb (together with appropriate
> debug symbols) and from here, you can investigate further.

I'm sorry, but I don't really know how to do that.  Could you provide some 
instructions of what I should do?

I tried running `gdb --core=core.24923` but that threw errors:

Failed to read a valid object file image from memory.
Core was generated by `/usr/bin/ruby'.
Program terminated with signal SIGSTOP, Stopped (signal).
#0  0x0254a790 in ?? ()
[Current thread is 1 (LWP 24923)]

When I then looked at ps again, the process was no longer running, so I guess 
one of the above commands killed it.  So, I don't have a running process to 
work with anymore.

The core file is 1GB large!  It gzipped down to 5MB.  Would it be useful for me 
to send this to you to look at directly?

Thanks,
-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Ken Gaillot
On Thu, 2018-05-24 at 16:14 +0200, Klaus Wenninger wrote:
> On 05/24/2018 04:03 PM, Ken Gaillot wrote:
> > On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
> > > On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov  > > il.c
> > > om> wrote:
> > > > 24.05.2018 02:57, Jason Gauthier пишет:
> > > > > I'm fairly new to clustering under Linux.  I've basically
> > > > > have
> > > > > one shared
> > > > > storage resource  right now, using dlm, and gfs2.
> > > > > I'm using fibre channel and when both of my nodes are up (2
> > > > > node
> > > > > cluster)
> > > > > dlm and gfs2 seem to be operating perfectly.
> > > > > If I reboot node B, node A works fine and vice-versa.
> > > > > 
> > > > > When node B goes offline unexpectedly, and become unclean,
> > > > > dlm
> > > > > seems to
> > > > > block all IO to the shared storage.
> > > > > 
> > > > > dlm knows node B is down:
> > > > > 
> > > > > # dlm_tool status
> > > > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
> > > > > daemon now 865695 fence_pid 18186
> > > > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail
> > > > > 1527119246 fence
> > > > > 0 now 1527119524
> > > > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
> > > > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0
> > > > > at 0
> > > > > 0
> > > > > 
> > > > > on the same server, I see these messages in my daemon.log
> > > > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick:
> > > > > Could
> > > > > not kick
> > > > > (reboot) node 1084772369/(null) : No route to host (-113)
> > > > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error
> > > > > -113
> > > > > nodeid
> > > > > 1084772369
> > > > > 
> > > > > I can recover from the situation by forcing it (or bring the
> > > > > other node
> > > > > back online)
> > > > > dlm_tool fence_ack 1084772369
> > > > > 
> > > > > cluster config is pretty straighforward.
> > > > > node 1084772368: alpha
> > > > > node 1084772369: beta
> > > > > primitive p_dlm_controld ocf:pacemaker:controld \
> > > > > op monitor interval=60 timeout=60 \
> > > > > meta target-role=Started \
> > > > > params args="-K -L -s 1"
> > > > > primitive p_fs_gfs2 Filesystem \
> > > > > params device="/dev/sdb2" directory="/vms"
> > > > > fstype=gfs2
> > > > > primitive stonith_sbd stonith:external/sbd \
> > > > > params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
> > > > > meta target-role=Started
> > > > 
> > > > What is the status of stonith resource? Did you configure SBD
> > > > fencing
> > > > properly?
> > > 
> > > I believe so.  It's shown above in my cluster config.
> > > 
> > > > Is sbd daemon up and running with proper parameters?
> > > 
> > > Well, no, apparently sbd isn't running.With dlm, and gfs2,
> > > the
> > > cluster controls handling launching of the daemons.
> > > I assumed the same here, since the resource shows that it is up.
> > 
> > Unlike other services, sbd must be up before the cluster starts in
> > order for the cluster to use it properly. (Notice the "have-
> > watchdog=false" in your cib-bootstrap-options ... that means the
> > cluster didn't find sbd running.)
> > 
> > Also, even storage-based sbd requires a working hardware watchdog
> > for
> > the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd
> > should
> > list the watchdog device. Also sbd_device in your cluster config
> > should
> > match SBD_DEVICE in /etc/sysconfig/sbd.
> > 
> > If you want the cluster to recover services elsewhere after a node
> > self-fences (which I'm sure you do), you also need to set the
> > stonith-
> > watchdog-timeout cluster property to something greater than the
> > value
> > of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will
> > wait
> > that long and then assume the node fenced itself.
> 
> Actually for the case that there is a shared disk a successful
> fencing-attempt via the sbd-fencing-resource should be enough
> for the node to be assumed down.
> In case of a 2-node-setup I would even discourage setting
> stonith-watchdog-timeout as we need a real quorum-mechanism
> for that to work.

Ah, thanks -- I've updated the wiki how-to, feel free to clarify
further:

https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker

> 
> Regards,
> Klaus
>  
> > 
> > > Online: [ alpha beta ]
> > > 
> > > Full list of resources:
> > > 
> > >  stonith_sbd(stonith:external/sbd): Started alpha
> > >  Clone Set: cl_gfs2 [g_gfs2]
> > >  Started: [ alpha beta ]
> > > 
> > > 
> > > > What is output of
> > > > sbd -d /dev/sdb1 dump
> > > > sbd -d /dev/sdb1 list
> > > 
> > > Both nodes seem fine.
> > > 
> > > 0   alpha   testbeta
> > > 1   betatestalpha
> > > 
> > > 
> > > > on both nodes? Does
> > > > 
> > > > sbd -d /dev/sdb1 message  test
> > > > 
> > > > work in both directions?
> > > 
> > > It doesn't return an error, yet without a daemon running, I don't
> > > think the message is 

Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Klaus Wenninger
On 05/24/2018 04:03 PM, Ken Gaillot wrote:
> On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
>> On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov > om> wrote:
>>> 24.05.2018 02:57, Jason Gauthier пишет:
 I'm fairly new to clustering under Linux.  I've basically have
 one shared
 storage resource  right now, using dlm, and gfs2.
 I'm using fibre channel and when both of my nodes are up (2 node
 cluster)
 dlm and gfs2 seem to be operating perfectly.
 If I reboot node B, node A works fine and vice-versa.

 When node B goes offline unexpectedly, and become unclean, dlm
 seems to
 block all IO to the shared storage.

 dlm knows node B is down:

 # dlm_tool status
 cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
 daemon now 865695 fence_pid 18186
 fence 1084772369 nodedown pid 18186 actor 1084772368 fail
 1527119246 fence
 0 now 1527119524
 node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
 node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0
 0

 on the same server, I see these messages in my daemon.log
 May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could
 not kick
 (reboot) node 1084772369/(null) : No route to host (-113)
 May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113
 nodeid
 1084772369

 I can recover from the situation by forcing it (or bring the
 other node
 back online)
 dlm_tool fence_ack 1084772369

 cluster config is pretty straighforward.
 node 1084772368: alpha
 node 1084772369: beta
 primitive p_dlm_controld ocf:pacemaker:controld \
 op monitor interval=60 timeout=60 \
 meta target-role=Started \
 params args="-K -L -s 1"
 primitive p_fs_gfs2 Filesystem \
 params device="/dev/sdb2" directory="/vms" fstype=gfs2
 primitive stonith_sbd stonith:external/sbd \
 params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
 meta target-role=Started
>>> What is the status of stonith resource? Did you configure SBD
>>> fencing
>>> properly?
>> I believe so.  It's shown above in my cluster config.
>>
>>> Is sbd daemon up and running with proper parameters?
>> Well, no, apparently sbd isn't running.With dlm, and gfs2, the
>> cluster controls handling launching of the daemons.
>> I assumed the same here, since the resource shows that it is up.
> Unlike other services, sbd must be up before the cluster starts in
> order for the cluster to use it properly. (Notice the "have-
> watchdog=false" in your cib-bootstrap-options ... that means the
> cluster didn't find sbd running.)
>
> Also, even storage-based sbd requires a working hardware watchdog for
> the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd should
> list the watchdog device. Also sbd_device in your cluster config should
> match SBD_DEVICE in /etc/sysconfig/sbd.
>
> If you want the cluster to recover services elsewhere after a node
> self-fences (which I'm sure you do), you also need to set the stonith-
> watchdog-timeout cluster property to something greater than the value
> of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will wait
> that long and then assume the node fenced itself.

Actually for the case that there is a shared disk a successful
fencing-attempt via the sbd-fencing-resource should be enough
for the node to be assumed down.
In case of a 2-node-setup I would even discourage setting
stonith-watchdog-timeout as we need a real quorum-mechanism
for that to work.

Regards,
Klaus
 
>
>> Online: [ alpha beta ]
>>
>> Full list of resources:
>>
>>  stonith_sbd(stonith:external/sbd): Started alpha
>>  Clone Set: cl_gfs2 [g_gfs2]
>>  Started: [ alpha beta ]
>>
>>
>>> What is output of
>>> sbd -d /dev/sdb1 dump
>>> sbd -d /dev/sdb1 list
>> Both nodes seem fine.
>>
>> 0   alpha   testbeta
>> 1   betatestalpha
>>
>>
>>> on both nodes? Does
>>>
>>> sbd -d /dev/sdb1 message  test
>>>
>>> work in both directions?
>> It doesn't return an error, yet without a daemon running, I don't
>> think the message is received either.
>>
>>
>>> Does manual fencing using stonith_admin work?
>> I'm not sure at the moment.  I think I need to look into why the
>> daemon isn't running.
>>
 group g_gfs2 p_dlm_controld p_fs_gfs2
 clone cl_gfs2 g_gfs2 \
 meta interleave=true target-role=Started
 location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
 property cib-bootstrap-options: \
 have-watchdog=false \
 dc-version=1.1.16-94ff4df \
 cluster-infrastructure=corosync \
 cluster-name=zeta \
 last-lrm-refresh=1525523370 \
 stonith-enabled=true \
 stonith-timeout=20s

 Any pointers would be appreciated. I feel like this should be
 working but
 I'm not sure if I've missed 

Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Ken Gaillot
On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
> On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov  om> wrote:
> > 24.05.2018 02:57, Jason Gauthier пишет:
> > > I'm fairly new to clustering under Linux.  I've basically have
> > > one shared
> > > storage resource  right now, using dlm, and gfs2.
> > > I'm using fibre channel and when both of my nodes are up (2 node
> > > cluster)
> > > dlm and gfs2 seem to be operating perfectly.
> > > If I reboot node B, node A works fine and vice-versa.
> > > 
> > > When node B goes offline unexpectedly, and become unclean, dlm
> > > seems to
> > > block all IO to the shared storage.
> > > 
> > > dlm knows node B is down:
> > > 
> > > # dlm_tool status
> > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
> > > daemon now 865695 fence_pid 18186
> > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail
> > > 1527119246 fence
> > > 0 now 1527119524
> > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
> > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0
> > > 0
> > > 
> > > on the same server, I see these messages in my daemon.log
> > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could
> > > not kick
> > > (reboot) node 1084772369/(null) : No route to host (-113)
> > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113
> > > nodeid
> > > 1084772369
> > > 
> > > I can recover from the situation by forcing it (or bring the
> > > other node
> > > back online)
> > > dlm_tool fence_ack 1084772369
> > > 
> > > cluster config is pretty straighforward.
> > > node 1084772368: alpha
> > > node 1084772369: beta
> > > primitive p_dlm_controld ocf:pacemaker:controld \
> > > op monitor interval=60 timeout=60 \
> > > meta target-role=Started \
> > > params args="-K -L -s 1"
> > > primitive p_fs_gfs2 Filesystem \
> > > params device="/dev/sdb2" directory="/vms" fstype=gfs2
> > > primitive stonith_sbd stonith:external/sbd \
> > > params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
> > > meta target-role=Started
> > 
> > What is the status of stonith resource? Did you configure SBD
> > fencing
> > properly?
> 
> I believe so.  It's shown above in my cluster config.
> 
> > Is sbd daemon up and running with proper parameters?
> 
> Well, no, apparently sbd isn't running.With dlm, and gfs2, the
> cluster controls handling launching of the daemons.
> I assumed the same here, since the resource shows that it is up.

Unlike other services, sbd must be up before the cluster starts in
order for the cluster to use it properly. (Notice the "have-
watchdog=false" in your cib-bootstrap-options ... that means the
cluster didn't find sbd running.)

Also, even storage-based sbd requires a working hardware watchdog for
the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd should
list the watchdog device. Also sbd_device in your cluster config should
match SBD_DEVICE in /etc/sysconfig/sbd.

If you want the cluster to recover services elsewhere after a node
self-fences (which I'm sure you do), you also need to set the stonith-
watchdog-timeout cluster property to something greater than the value
of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will wait
that long and then assume the node fenced itself.

> 
> Online: [ alpha beta ]
> 
> Full list of resources:
> 
>  stonith_sbd(stonith:external/sbd): Started alpha
>  Clone Set: cl_gfs2 [g_gfs2]
>  Started: [ alpha beta ]
> 
> 
> > What is output of
> > sbd -d /dev/sdb1 dump
> > sbd -d /dev/sdb1 list
> 
> Both nodes seem fine.
> 
> 0   alpha   testbeta
> 1   betatestalpha
> 
> 
> > on both nodes? Does
> > 
> > sbd -d /dev/sdb1 message  test
> > 
> > work in both directions?
> 
> It doesn't return an error, yet without a daemon running, I don't
> think the message is received either.
> 
> 
> > Does manual fencing using stonith_admin work?
> 
> I'm not sure at the moment.  I think I need to look into why the
> daemon isn't running.
> 
> > > group g_gfs2 p_dlm_controld p_fs_gfs2
> > > clone cl_gfs2 g_gfs2 \
> > > meta interleave=true target-role=Started
> > > location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
> > > property cib-bootstrap-options: \
> > > have-watchdog=false \
> > > dc-version=1.1.16-94ff4df \
> > > cluster-infrastructure=corosync \
> > > cluster-name=zeta \
> > > last-lrm-refresh=1525523370 \
> > > stonith-enabled=true \
> > > stonith-timeout=20s
> > > 
> > > Any pointers would be appreciated. I feel like this should be
> > > working but
> > > I'm not sure if I've missed something.
> > > 
> > > Thanks,
> > > 
> > > Jason
> > > 
> > > 
> > > 
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: 

Re: [ClusterLabs] pcsd processes using 100% CPU

2018-05-24 Thread Jan Pokorný
On 23/05/18 12:43 -0600, Casey & Gina wrote:
> I don't have gcore installed and don't know which package might
> provide it.  I also don't have experience with gdb but am happy to
> try anything suggested to help figure out what's going on.

gcore is part of gdb:
https://packages.ubuntu.com/xenial/amd64/gdb/filelist

Note that using the utility should have no observable influence
on the running process in question.

> The pcs version is 0.9.149, as packaged by Debian and inherited by
> Ubuntu.

Thanks, this will at least help to pinpoint any other similar
observations.

-- 
Jan (Poki)


pgpnyRRtQekGC.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] gfs2-utils 3.2.0 released

2018-05-24 Thread Andrew Price

Hi all,

I am happy to announce the 3.2.0 release of gfs2-utils. This is an 
important release adding support for new on-disk features introduced in 
the 4.16 kernel. In fact it is required when building against 4.16 and 
later kernel headers due to poor assumptions made by earlier gfs2-utils 
relating to structure size changes. Building earlier gfs2-utils against 
4.16 headers will result in test suite failures. (Thanks to Valentin 
Vidic for reporting these issues.)


This release adds basic support for new gfs2 on-disk features:

  * Resource group header CRCs
  * "Next resource group" pointers in resource groups
  * Journal log header block CRCs
  * Journal log header timestamp fields
  * Statfs accounting fields in journal log headers

Future releases will build on this work to take advantage of these new 
features, particularly for improving checking and performance.


Other notable changes:

  * mkfs.gfs2 now scales down the journal size to make better use of 
small devices by default

  * Better detection of bad device topology
  * fsck.gfs2 no longer accepts conflicting -p, -n and -y options
  * Saving of symlinks in gfs2_edit savemeta has been fixed
  * Fixes for issues caught by static analysis and new compiler warnings
  * New test cases in the testsuite
  * Various minor code cleanups and improvements

See below for a complete list of changes.

The source tarball is available from:

  https://releases.pagure.org/gfs2-utils/gfs2-utils-3.2.0.tar.gz

Please report bugs against the gfs2-utils component of Fedora rawhide: 
https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora=gfs2-utils=rawhide


Regards,
Andy


Changes since version 3.1.10:

Andrew Price (66):
  gfs2_grow: Disable rgrp alignment when dev topology is unsuitable
  mkfs.gfs2: Free unnecessary cached pages, disable readahead
  mkfs.gfs2: Fix resource group alignment issue
  libgfs2: Issue one write per rgrp when creating them
  libgfs2: Switch gfs2_dinode_out to use a char buffer
  libgfs2: Switch gfs2_log_header_out to use a char buffer
  gfs2-utils: Change instances of "gfs2_fsck" to "fsck.gfs2"
  gfs2_convert: Fix fgets return value warning
  fsck.gfs2: Fix snprintf truncation warning
  fsck.gfs2: Fix unchecked return value warning
  gfs2_grow: Fix unchecked ftruncate return value warning
  gfs2_grow: Remove unnecessary nesting in fix_rindex()
  gfs2_edit savemeta: Fix up saving of dinodes/symlinks
  gfs2_edit savemeta: Use size_t for saved structure lengths
  fsck.gfs2: Make -p, -n and -y conflicting options
  gfs2_edit: Print offsets of indirect pointers
  gfs2-utils configure: Check for rg_skip
  libgfs2: Add rgrp_skip support
  mkfs.gfs2: Pull place_journals() out of place_rgrps()
  mkfs.gfs2: Set the rg_skip field in new rgrps
  gfs2-utils configure: Check for rg_data0, rg_data and rg_bitbytes
  libgfs2: Add support for rg_data0, rg_data and rg_bitbytes
  mkfs.gfs2: Set the rg_data0, rg_data and rg_bitbytes fields
  libgfs2: Add support for rg_crc
  Add basic support for v2 log headers
  mkfs.gfs2: Scale down journal size for smaller devices
  gfs2-utils: Remove make-tarball.sh
  glocktop: Remove a non-existent flag from the usage string
  fsck.gfs2: Don't check lh_crc for older filesystems
  libgfs2: Remove unused lock* fields from gfs2_sbd
  libgfs2: Remove sb_addr from gfs2_sbd
  libgfs2: Plug an alignment hole in gfs2_sbd
  libgfs2: Plug an alignment hole in gfs2_buffer_head
  libgfs2: Plug an alignment hole in gfs2_inode
  libgfs2: Remove gfs2_meta_header_out_bh()
  libgfs2: Don't pass an extlen to block_map where not required
  libgfs2: Don't use a buffer_head in gfs2_meta_header_in
  libgfs2: Don't use buffer_heads in gfs2_sb_in
  libgfs2: Don't use buffer_heads in gfs2_rgrp_in
  libgfs2: Remove gfs2_rgrp_out_bh
  libgfs2: Don't use buffer_heads in gfs2_dinode_in
  libgfs2: Remove gfs2_dinode_out_bh
  libgfs2: Don't use buffer_heads in gfs2_leaf_{in,out}
  libgfs2: Don't use buffer_heads in gfs2_log_header_in
  libgfs2: Remove gfs2_log_header_out_bh
  libgfs2: Don't use buffer_heads in gfs2_log_descriptor_{in,out}
  libgfs2: Don't use buffer_heads in gfs2_quota_change_{in,out}
  libgfs2: Fix two unused variable warnings
  mkfs.gfs2: Silence an integer overflow warning
  libgfs2: Fix a thinko in write_journal()
  gfs2-utils tests: Add a fsck.gfs2 test for rebuilding journals
  gfs2_edit: Fix null pointer deref in dump_journal()
  libgfs2: Remove dead code from gfs2_rgrp_read()
  fsck.gfs2: Avoid int overflow in find_next_rgrp_dist
  gfs2_edit: Avoid potential int overflow in find_journal_block()
  gfs2_edit: Avoid a potential int overflow in dump_journal
  gfs2-utils: Avoid some more potential in overflows
  libgfs2: Fix a memory leak in lgfs2_build_jindex
  glocktop: Fix memory leak 

Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Jason Gauthier
On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov  wrote:
> 24.05.2018 02:57, Jason Gauthier пишет:
>> I'm fairly new to clustering under Linux.  I've basically have one shared
>> storage resource  right now, using dlm, and gfs2.
>> I'm using fibre channel and when both of my nodes are up (2 node cluster)
>> dlm and gfs2 seem to be operating perfectly.
>> If I reboot node B, node A works fine and vice-versa.
>>
>> When node B goes offline unexpectedly, and become unclean, dlm seems to
>> block all IO to the shared storage.
>>
>> dlm knows node B is down:
>>
>> # dlm_tool status
>> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
>> daemon now 865695 fence_pid 18186
>> fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence
>> 0 now 1527119524
>> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
>> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0
>>
>> on the same server, I see these messages in my daemon.log
>> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick
>> (reboot) node 1084772369/(null) : No route to host (-113)
>> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid
>> 1084772369
>>
>> I can recover from the situation by forcing it (or bring the other node
>> back online)
>> dlm_tool fence_ack 1084772369
>>
>> cluster config is pretty straighforward.
>> node 1084772368: alpha
>> node 1084772369: beta
>> primitive p_dlm_controld ocf:pacemaker:controld \
>> op monitor interval=60 timeout=60 \
>> meta target-role=Started \
>> params args="-K -L -s 1"
>> primitive p_fs_gfs2 Filesystem \
>> params device="/dev/sdb2" directory="/vms" fstype=gfs2
>> primitive stonith_sbd stonith:external/sbd \
>> params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>> meta target-role=Started
>
> What is the status of stonith resource? Did you configure SBD fencing
> properly?

I believe so.  It's shown above in my cluster config.

> Is sbd daemon up and running with proper parameters?

Well, no, apparently sbd isn't running.With dlm, and gfs2, the
cluster controls handling launching of the daemons.
I assumed the same here, since the resource shows that it is up.

Online: [ alpha beta ]

Full list of resources:

 stonith_sbd(stonith:external/sbd): Started alpha
 Clone Set: cl_gfs2 [g_gfs2]
 Started: [ alpha beta ]


> What is output of
> sbd -d /dev/sdb1 dump
> sbd -d /dev/sdb1 list

Both nodes seem fine.

0   alpha   testbeta
1   betatestalpha


> on both nodes? Does
>
> sbd -d /dev/sdb1 message  test
>
> work in both directions?

It doesn't return an error, yet without a daemon running, I don't
think the message is received either.


> Does manual fencing using stonith_admin work?

I'm not sure at the moment.  I think I need to look into why the
daemon isn't running.

>> group g_gfs2 p_dlm_controld p_fs_gfs2
>> clone cl_gfs2 g_gfs2 \
>> meta interleave=true target-role=Started
>> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
>> property cib-bootstrap-options: \
>> have-watchdog=false \
>> dc-version=1.1.16-94ff4df \
>> cluster-infrastructure=corosync \
>> cluster-name=zeta \
>> last-lrm-refresh=1525523370 \
>> stonith-enabled=true \
>> stonith-timeout=20s
>>
>> Any pointers would be appreciated. I feel like this should be working but
>> I'm not sure if I've missed something.
>>
>> Thanks,
>>
>> Jason
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Klaus Wenninger
On 05/24/2018 06:19 AM, Andrei Borzenkov wrote:
> 24.05.2018 02:57, Jason Gauthier пишет:
>> I'm fairly new to clustering under Linux.  I've basically have one shared
>> storage resource  right now, using dlm, and gfs2.
>> I'm using fibre channel and when both of my nodes are up (2 node cluster)
>> dlm and gfs2 seem to be operating perfectly.
>> If I reboot node B, node A works fine and vice-versa.
>>
>> When node B goes offline unexpectedly, and become unclean, dlm seems to
>> block all IO to the shared storage.
>>
>> dlm knows node B is down:
>>
>> # dlm_tool status
>> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
>> daemon now 865695 fence_pid 18186
>> fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence
>> 0 now 1527119524
>> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
>> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0
>>
>> on the same server, I see these messages in my daemon.log
>> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick
>> (reboot) node 1084772369/(null) : No route to host (-113)
>> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid
>> 1084772369
>>
>> I can recover from the situation by forcing it (or bring the other node
>> back online)
>> dlm_tool fence_ack 1084772369
>>
>> cluster config is pretty straighforward.
>> node 1084772368: alpha
>> node 1084772369: beta
>> primitive p_dlm_controld ocf:pacemaker:controld \
>> op monitor interval=60 timeout=60 \
>> meta target-role=Started \
>> params args="-K -L -s 1"
>> primitive p_fs_gfs2 Filesystem \
>> params device="/dev/sdb2" directory="/vms" fstype=gfs2
>> primitive stonith_sbd stonith:external/sbd \
>> params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>> meta target-role=Started
> What is the status of stonith resource? Did you configure SBD fencing
> properly?  Is sbd daemon up and running with proper parameters? What is
> output of
>
> sbd -d /dev/sdb1 dump
> sbd -d /dev/sdb1 list
>
> on both nodes? Does
>
> sbd -d /dev/sdb1 message  test
>
> work in both directions?
>
> Does manual fencing using stonith_admin work?

And checkout that your sbd (1.3.1 to be on the safe side)
is new enough otherwise it won't work properly with 2-node
enabled in corosync.
But this wouldn't describe your problem - would rather
be the other way round like still giving you access to
the device while it might not be assured that the sbd-fenced
node would properly watchdog-suicide in case that it looses
access to the storage.

Regards,
Klaus
 
>
>> group g_gfs2 p_dlm_controld p_fs_gfs2
>> clone cl_gfs2 g_gfs2 \
>> meta interleave=true target-role=Started
>> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
>> property cib-bootstrap-options: \
>> have-watchdog=false \
>> dc-version=1.1.16-94ff4df \
>> cluster-infrastructure=corosync \
>> cluster-name=zeta \
>> last-lrm-refresh=1525523370 \
>> stonith-enabled=true \
>> stonith-timeout=20s
>>
>> Any pointers would be appreciated. I feel like this should be working but
>> I'm not sure if I've missed something.
>>
>> Thanks,
>>
>> Jason
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: pcsd processes using 100% CPU

2018-05-24 Thread Casey & Gina
Tried that, it doesn't seem to do anything but prefix the lines with the pid:

[pid 24923] sched_yield()   = 0
[pid 24923] sched_yield()   = 0
[pid 24923] sched_yield()   = 0

Regards,
-- 
Casey

> On May 23, 2018, at 11:40 PM, Ulrich Windl 
>  wrote:
> 
 Casey & Gina  schrieb am 23.05.2018 um 20:43 in
> Nachricht <3b8567a0-ef36-44af-bbad-0d494b08f...@icloud.com>:
> [...]
>> I ran `strace ‑p `, and the screen filled with the following line 
>> repeating as fast as my terminal can render:
>> sched_yield()   = 0
>> sched_yield()   = 0
>> sched_yield()   = 0
> 
> I wonder whether such a process is multi-threaded and whether adding option
> "-f" to strace would make a difference...
> 
> [...]
> Regards,
> Ulrich
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org