Re: [ClusterLabs] Can't do anything right; how do I start over?
On 10/15/2016 12:27 PM, Dmitri Maziuk wrote: > On 2016-10-15 01:56, Jay Scott wrote: > >> So, what's wrong? (I'm a newbie, of course.) > > Here's what worked for me on centos 7: > http://octopus.bmrb.wisc.edu/dokuwiki/doku.php?id=sysadmin:pacemaker > YMMV and all that. PS. I can't in all honesty recommend this setup for running NFS clusters at this point. About 1 in 3 times I do 'pcs standby ' I get > Oct 15 15:31:52 lionfish crmd[1137]: notice: Initiating action 46: stop > drbd_filesystem_stop_0 on lionfish (local) > Oct 15 15:31:52 lionfish Filesystem(drbd_filesystem)[32120]: INFO: Running > stop for /dev/drbd0 on /raid > Oct 15 15:31:52 lionfish Filesystem(drbd_filesystem)[32120]: INFO: Trying to > unmount /raid > Oct 15 15:31:52 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't > unmount /raid; trying cleanup with TERM > Oct 15 15:31:52 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Oct 15 15:31:53 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't > unmount /raid; trying cleanup with TERM > Oct 15 15:31:53 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Oct 15 15:31:54 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't > unmount /raid; trying cleanup with TERM > Oct 15 15:31:54 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Oct 15 15:31:56 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't > unmount /raid; trying cleanup with KILL > Oct 15 15:31:56 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Oct 15 15:31:57 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't > unmount /raid; trying cleanup with KILL > Oct 15 15:31:57 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Oct 15 15:31:58 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't > unmount /raid; trying cleanup with KILL > Oct 15 15:31:58 lionfish Filesystem(drbd_filesystem)[32120]: INFO: No > processes on /raid were signalled. force_unmount is set to 'yes' > Oct 15 15:31:59 lionfish Filesystem(drbd_filesystem)[32120]: ERROR: Couldn't > unmount /raid, giving up! > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info > about processes that use ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ the device is found by lsof(8) > or fuser(1)) ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ ocf-exit-reason:Couldn't unmount /raid; > trying cleanup with TERM ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info > about processes that use ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ the device is found by lsof(8) > or fuser(1)) ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ ocf-exit-reason:Couldn't unmount /raid; > trying cleanup with TERM ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info > about processes that use ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ the device is found by lsof(8) > or fuser(1)) ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ ocf-exit-reason:Couldn't unmount /raid; > trying cleanup with TERM ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info > about processes that use ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ the device is found by lsof(8) > or fuser(1)) ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ ocf-exit-reason:Couldn't unmount /raid; > trying cleanup with KILL ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ umount: /raid: target is busy. ] > Oct 15 15:32:00 lionfish lrmd[1134]: notice: > drbd_filesystem_stop_0:32120:stderr [ (In some cases useful info > about p
Re: [ClusterLabs] Can't do anything right; how do I start over?
On 2016-10-15 01:56, Jay Scott wrote: So, what's wrong? (I'm a newbie, of course.) Here's what worked for me on centos 7: http://octopus.bmrb.wisc.edu/dokuwiki/doku.php?id=sysadmin:pacemaker YMMV and all that. cheers, Dima ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Can't do anything right; how do I start over?
Greetings, Heh. Well, the comment in corosync.conf makes sense to me now. Thanks, I've fixed that. Here's my corosync.conf totem { version: 2 crypto_cipher: none crypto_hash: none interface { ringnumber: 0 bindnetaddr: 10.1.0.0 mcastaddr: 239.255.1.1 mcastport: 5405 ttl: 1 } cluster_name: pecan } logging { fileline: off to_stderr: no to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum two_node: 1 wait_for_all: 1 } service { name: pacemaker ver: 1 } nodelist { node { ring0_addr: smoking nodeid: 1 } node { ring0_addr: mars nodeid: 2 } } And a few things are behaving better than they did before. At the moment my goal is to set up a partition as drbd. In the interest of bandwidth I will show the commands that I use and the result I finally get. pcs cluster auth smoking mars pcs property set stonith-enabled=true stonith_admin --metadata --agent fence_pcmk cibadmin -C -o resources --xml-file stonith.xml pcs resource create floating_ip IPaddr2 ip=10.1.2.101 cidr_netmask=32 pcs resource defaults resource-stickiness=100 And at this point, all appears well. My pcs status output looks like I think it should. Now, of course, I admit that setting up the floating_ip is not relevant to my goal of a drbd backed filesystem, but I've been doing it as a sanity check. On to drbd modprobe drbd systemctl start drbd.service [root@smoking cluster]# cat /proc/drbd version: 8.4.8-1 (api:1/proto:86-101) GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by mockbuild@, 2016-10- 13 19:58:26 0: cs:Connected ro:Secondary/Secondary ds:Diskless/Diskless C r- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r- ns:0 nr:10574 dw:10574 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 2: cs:Connected ro:Secondary/Secondary ds:Diskless/Diskless C r- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Again, this is stuff that hung around from the previous incarnation. But it looks okay to me. I'm planning to use the '1' device. The above is run on the secondary machine, so Secondary/Primary is correct. And UpToDate/UpToDate looks right to me. Now it goes south. The mkfs.xfs appears to work, but that's not relevant anyway, right? pcs resource create BravoSpace \ ocf:linbit:drbd drbd_resource=bravo \ op monitor interval=60s [root@smoking ~]# pcs status Cluster name: pecan Last updated: Sat Oct 15 01:33:37 2016Last change: Sat Oct 15 01:18:56 2016 by root via cibadmin on mars Stack: corosync Current DC: mars (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum 2 nodes and 3 resources configured Node mars: UNCLEAN (online) Node smoking: UNCLEAN (online) Full list of resources: Fencing(stonith:fence_pcmk):Started mars floating_ip(ocf::heartbeat:IPaddr2):Started mars BravoSpace(ocf::linbit:drbd):FAILED[ smoking mars ] Failed Actions: * BravoSpace_stop_0 on smoking 'not configured' (6): call=18, status=complete, e xitreason='none', last-rc-change='Sat Oct 15 01:18:56 2016', queued=0ms, exec=63ms * BravoSpace_stop_0 on mars 'not configured' (6): call=18, status=complete, exit reason='none', last-rc-change='Sat Oct 15 01:18:56 2016', queued=0ms, exec=60ms PCSD Status: smoking: Online mars: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled I've looked in /var/log/cluster/corosync.log and it doesn't seem happy but I don't know what I'm looking at. On the primary machine it's 1800+ lines on the secondary it's 600+ lines. There are 337 lines just with BravoSpace in them. One of them says drbd(BravoSpace)[3295]:2016/10/15_01:18:56 ERROR: meta parameter misconfigured, expected clone-max -le 2, but found unset. But I tried adding clone-max=2 but the command barfed-- that's not a legal parameter. So, what's wrong? (I'm a newbie, of course.) I did a pcs resource cleanup . That shut down fencing and the IP. I tried pcs cluster start to get them back, no help. I did pcs cluster standby smoking, and then unstandby smoking. The ip started, but fencing has failed on BOTH machines. I can't see what I'm doing wrong. Thanks. I realize I'm consuming your time on the cheap. On Fri, Oct 14, 2016 at 3:33 PM, Dimitri Maziuk wrote: > On 10/14/2016 02:48 PM, Jay Scott
Re: [ClusterLabs] Can't do anything right; how do I start over?
On 10/14/2016 02:48 PM, Jay Scott wrote: > When I "start over" I stop all the services, delete the packages, > empty the configs and logs as best I know how. But this doesn't > completely clear everything: the drbd metadata is evidently still > on the partitions I've set aside for it. If it's small enough, dd if=/dev/zero of=/your/partition Get DRBD working and fully sync'ed outside of the cluster before you start adding it. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Can't do anything right; how do I start over?
On 10/14/2016 02:48 PM, Jay Scott wrote: > I've been trying a lot of things from the introductory manual. > I have updated the instructions (on my hardcopy) to the versions > of corosync etc. that I'm using. I can't get hardly anything to > work reliably beyond the ClusterIP. > > So I start over -- I had been reinstalling the machines but I've > grown tired of that. So, before I start in on my other tales of woe, > I figured I should find out how to start over "according to Hoyle". > > When I "start over" I stop all the services, delete the packages, > empty the configs and logs as best I know how. But this doesn't > completely clear everything: the drbd metadata is evidently still > on the partitions I've set aside for it. > > > > Oh, before I forget, in particular: > in corosync.conf: > totem { > interface { > # This is normally the *network* address of the > # interface to bind to. This ensures that you can use > # identical instances of this configuration file > # across all your cluster nodes, without having to > # modify this option. > bindnetaddr: 10.1.1.22 > [snip] > } > } > bindnetaddr: I've tried using an address on ONE of the machines > (everywhere), > and I've tried using an address that's on each participating machine, > thus a diff corosync.conf file for each machine (but otherwise identical). > What's the right thing? From the comment it seems that there should > be one address used among all machines. But I kept getting messages > about addresses already in use, so I thought I'd try to "fix" it. The comment may be unclear ... bindnetaddr isn't an address *on* the network, it's the address *of* the network. For example, if you're using a /24 subnet (255.255.255.0 netmask), the above bindnetaddr should be 10.1.1.0, which would cover any hosts with addresses in the range 10.1.1.1 - 10.1.1.254. > > This is my burn script. > Am I missing something? Doing it wrong? > > #!/bin/bash > pkill -9 -f pacemaker > systemctl stop pacemaker.service > systemctl stop corosync.service > systemctl stop pcsd.service > drbdadm down alpha > drbdadm down bravo > drbdadm down delta > systemctl stop drbd.service > > rpm -e drbd84-utils kmod-drbd84 > rpm -e pcs > rpm -e pacemaker > rpm -e pacemaker-cluster-libs > rpm -e pacemaker-cli > rpm -e pacemaker-libs > rpm -e pacemaker-doc > rpm -e lvm2-cluster > rpm -e dlm > rpm -e corosynclib corosync > cd /var/lib/pacemaker > rm cib/* > rm pengine/* > cd > nullfile /var/log/cluster/corosync.conf ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org