Re: [ClusterLabs] Can't do anything right; how do I start over?

2016-10-15 Thread Dimitri Maziuk
On 10/15/2016 12:27 PM, Dmitri Maziuk wrote:
> On 2016-10-15 01:56, Jay Scott wrote:
> 
>> So, what's wrong?  (I'm a newbie, of course.)
> 
> Here's what worked for me on centos 7:
> http://octopus.bmrb.wisc.edu/dokuwiki/doku.php?id=sysadmin:pacemaker
> YMMV and all that.

PS. I can't in all honesty recommend this setup for running NFS clusters
at this point.

About 1 in 3 times I do 'pcs standby ' I get

> Oct 15 15:31:52 lionfish crmd[1137]:  notice: Initiating action 46: stop 
> drbd_filesystem_stop_0 on lionfish (local)
> Oct 15 15:31:52 lionfish Filesystem(drbd_filesystem)[32120]: INFO: Running 
> stop for /dev/drbd0 on /raid
> Oct 15 15:31:52 lionfish Filesystem(drbd_filesystem)[32120]: INFO: Trying to 
> unmount /raid
> Oct 15 15:31:52 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Oct 15 15:31:52 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Oct 15 15:31:53 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Oct 15 15:31:53 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Oct 15 15:31:54 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't 
> unmount /raid; trying cleanup with TERM
> Oct 15 15:31:54 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Oct 15 15:31:56 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Oct 15 15:31:56 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Oct 15 15:31:57 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Oct 15 15:31:57 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Oct 15 15:31:58 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't 
> unmount /raid; trying cleanup with KILL
> Oct 15 15:31:58 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No 
> processes on /raid were signalled. force_unmount is set to 'yes'
> Oct 15 15:31:59 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't 
> unmount /raid, giving up!
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info 
> about processes that use ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info 
> about processes that use ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info 
> about processes that use ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with TERM ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info 
> about processes that use ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [  the device is found by lsof(8) 
> or fuser(1)) ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
> trying cleanup with KILL ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ]
> Oct 15 15:32:00 lionfish lrmd[1134]:  notice: 
> drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info 
> about p

Re: [ClusterLabs] Can't do anything right; how do I start over?

2016-10-15 Thread Dmitri Maziuk

On 2016-10-15 01:56, Jay Scott wrote:


So, what's wrong?  (I'm a newbie, of course.)


Here's what worked for me on centos 7: 
http://octopus.bmrb.wisc.edu/dokuwiki/doku.php?id=sysadmin:pacemaker

YMMV and all that.

cheers,
Dima


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Can't do anything right; how do I start over?

2016-10-15 Thread Jay Scott
Greetings,

Heh.  Well, the comment in corosync.conf makes sense to me now.
Thanks, I've fixed that.


Here's my corosync.conf

totem {
version: 2

crypto_cipher: none
crypto_hash: none

interface {
ringnumber: 0
bindnetaddr: 10.1.0.0
mcastaddr: 239.255.1.1
mcastport: 5405
ttl: 1
}
cluster_name: pecan
}

logging {
fileline: off
to_stderr: no
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}

quorum {
provider: corosync_votequorum
two_node: 1
wait_for_all: 1
}
service {
name: pacemaker
ver: 1
}
nodelist {
  node {
ring0_addr: smoking
nodeid: 1
   }
  node {
ring0_addr: mars
nodeid: 2
   }
}


And a few things are behaving better than they did before.

At the moment my goal is to set up a partition as drbd.
In the interest of bandwidth I will show the commands that
I use and the result I finally get.


pcs cluster auth smoking mars
pcs property set stonith-enabled=true
stonith_admin --metadata --agent fence_pcmk
cibadmin -C -o resources --xml-file stonith.xml
pcs resource create floating_ip IPaddr2 ip=10.1.2.101 cidr_netmask=32
pcs resource  defaults resource-stickiness=100


And at this point, all appears well.  My pcs status output looks like
I think it should.

Now, of course, I admit that setting up the floating_ip is
not relevant to my goal of a drbd backed filesystem, but I've been
doing it as a sanity check.

On to drbd

modprobe drbd
systemctl start drbd.service
[root@smoking cluster]#  cat /proc/drbd
version: 8.4.8-1 (api:1/proto:86-101)
GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by mockbuild@,
2016-10-
13 19:58:26
 0: cs:Connected ro:Secondary/Secondary ds:Diskless/Diskless C r-
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-
ns:0 nr:10574 dw:10574 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f
oos:0
 2: cs:Connected ro:Secondary/Secondary ds:Diskless/Diskless C r-
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Again, this is stuff that hung around from the previous incarnation.
But it looks okay to me.  I'm planning to use the '1' device.
The above is run on the secondary machine, so Secondary/Primary is
correct.  And UpToDate/UpToDate looks right to me.

Now it goes south.  The mkfs.xfs appears to work, but that's not
relevant anyway, right?

pcs  resource create BravoSpace \
  ocf:linbit:drbd drbd_resource=bravo \
  op monitor interval=60s

[root@smoking ~]# pcs status
Cluster name: pecan
Last updated: Sat Oct 15 01:33:37 2016Last change: Sat Oct 15
01:18:56
 2016 by root via cibadmin on mars
Stack: corosync
Current DC: mars (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum
2 nodes and 3 resources configured

Node mars: UNCLEAN (online)
Node smoking: UNCLEAN (online)

Full list of resources:

 Fencing(stonith:fence_pcmk):Started mars
 floating_ip(ocf::heartbeat:IPaddr2):Started mars
 BravoSpace(ocf::linbit:drbd):FAILED[ smoking mars ]

Failed Actions:
* BravoSpace_stop_0 on smoking 'not configured' (6): call=18,
status=complete, e
xitreason='none',
last-rc-change='Sat Oct 15 01:18:56 2016', queued=0ms, exec=63ms
* BravoSpace_stop_0 on mars 'not configured' (6): call=18, status=complete,
exit
reason='none',
last-rc-change='Sat Oct 15 01:18:56 2016', queued=0ms, exec=60ms


PCSD Status:
  smoking: Online
  mars: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled

I've looked in /var/log/cluster/corosync.log and it doesn't seem
happy but I don't know what I'm looking at.  On the primary
machine it's 1800+ lines on the secondary it's 600+ lines.
There are 337 lines just with BravoSpace in them.
One of them says
drbd(BravoSpace)[3295]:2016/10/15_01:18:56 ERROR: meta parameter
misconfigured,
 expected clone-max -le 2, but found unset.
But I tried adding clone-max=2 but the command barfed-- that's not a legal
parameter.

So, what's wrong?  (I'm a newbie, of course.)

I did a pcs resource cleanup .  That shut down fencing and the IP.
I tried pcs cluster start to get them back, no help.
I did pcs cluster standby smoking, and then unstandby smoking.
The ip started, but fencing has failed on BOTH machines.
I can't see what I'm doing wrong.

Thanks.  I realize I'm consuming your time on the cheap.


On Fri, Oct 14, 2016 at 3:33 PM, Dimitri Maziuk 
wrote:

> On 10/14/2016 02:48 PM, Jay Scott 

Re: [ClusterLabs] Can't do anything right; how do I start over?

2016-10-14 Thread Dimitri Maziuk
On 10/14/2016 02:48 PM, Jay Scott wrote:

> When I "start over" I stop all the services, delete the packages,
> empty the configs and logs as best I know how.  But this doesn't
> completely clear everything:  the drbd metadata is evidently still
> on the partitions I've set aside for it.

If it's small enough, dd if=/dev/zero of=/your/partition

Get DRBD working and fully sync'ed outside of the cluster before you
start adding it.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Can't do anything right; how do I start over?

2016-10-14 Thread Ken Gaillot
On 10/14/2016 02:48 PM, Jay Scott wrote:
> I've been trying a lot of things from the introductory manual.
> I have updated the instructions (on my hardcopy) to the versions
> of corosync etc. that I'm using.  I can't get hardly anything to
> work reliably beyond the ClusterIP.
> 
> So I start over -- I had been reinstalling the machines but I've
> grown tired of that.  So, before I start in on my other tales of woe,
> I figured I should find out how to start over "according to Hoyle".
> 
> When I "start over" I stop all the services, delete the packages,
> empty the configs and logs as best I know how.  But this doesn't
> completely clear everything:  the drbd metadata is evidently still
> on the partitions I've set aside for it.
> 
> 
> 
> Oh, before I forget, in particular:
> in corosync.conf:
> totem {
> interface {
> # This is normally the *network* address of the
> # interface to bind to. This ensures that you can use
> # identical instances of this configuration file
> # across all your cluster nodes, without having to
> # modify this option.
> bindnetaddr: 10.1.1.22
> [snip]
> }
> }
> bindnetaddr:  I've tried using an address on ONE of the machines
> (everywhere),
> and I've tried using an address that's on each participating machine,
> thus a diff corosync.conf file for each machine (but otherwise identical).
> What's the right thing?  From the comment it seems that there should
> be one address used among all machines.  But I kept getting messages
> about addresses already in use, so I thought I'd try to "fix" it.

The comment may be unclear ... bindnetaddr isn't an address *on* the
network, it's the address *of* the network.

For example, if you're using a /24 subnet (255.255.255.0 netmask), the
above bindnetaddr should be 10.1.1.0, which would cover any hosts with
addresses in the range 10.1.1.1 - 10.1.1.254.

> 
> This is my burn script.
> Am I missing something?  Doing it wrong?
> 
> #!/bin/bash
> pkill -9 -f pacemaker
> systemctl stop pacemaker.service
> systemctl stop corosync.service
> systemctl stop pcsd.service
> drbdadm down alpha
> drbdadm down bravo
> drbdadm down delta
> systemctl stop drbd.service
> 
> rpm -e drbd84-utils kmod-drbd84
> rpm -e pcs
> rpm -e pacemaker
> rpm -e pacemaker-cluster-libs
> rpm -e pacemaker-cli
> rpm -e pacemaker-libs
> rpm -e pacemaker-doc
> rpm -e lvm2-cluster
> rpm -e dlm
> rpm -e corosynclib corosync
> cd /var/lib/pacemaker
> rm cib/*
> rm pengine/*
> cd
> nullfile /var/log/cluster/corosync.conf

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org