Re: [ClusterLabs] Trouble with drbd/pacemaker: switch to secondary/secondary
Yikes. I don't have any suggestions. This is beyond me. Sorry. J. On Sat, Oct 15, 2016 at 4:48 AM, Anne Nicolas wrote: > Anne > http://mageia.org > > Le 15 oct. 2016 9:02 AM, "Jay Scott" a écrit : > > > > > > Well, I'm a newbie myself. But this: > > drbdadm primary --force ___the name of the drbd res___ > > has worked for me. But I'm having lots of trouble myself, > > so... > > then there's this: > > drbdadm -- --overwrite-data-of-peer primary bravo > > (bravo happens to be my drbd res) and that should also > > strongarm one machine or another to be the primary. > > > > Well I used those commands it goes to primary but I czn see then pacemaker > switching it to secondary after some secondd > > > j. > > > > On Fri, Oct 14, 2016 at 3:22 PM, Anne Nicolas wrote: > >> > >> Hi! > >> > >> I'm having trouble with a 2 nodes cluster used for DRBD / Apache / Samba > >> and some other services. > >> > >> Whatever I do, it always goes to the following state: > >> > >> Last updated: Fri Oct 14 17:41:38 2016 > >> Last change: Thu Oct 13 10:42:29 2016 via cibadmin on bzvairsvr > >> Stack: corosync > >> Current DC: bzvairsvr (168430081) - partition with quorum > >> Version: 1.1.8-9.mga5-394e906 > >> 2 Nodes configured, unknown expected votes > >> 13 Resources configured. > >> > >> > >> Online: [ bzvairsvr bzvairsvr2 ] > >> > >> Master/Slave Set: drbdservClone [drbdserv] > >> Slaves: [ bzvairsvr bzvairsvr2 ] > >> Clone Set: fencing [st-ssh] > >> Started: [ bzvairsvr bzvairsvr2 ] > >> > >> When I reboot bzvairsvr2 this one goes primary again. But after a while > >> becomes secondary also. > >> I use a very basic fencing system based on ssh. It's not optimal but > >> enough for the current tests. > >> > >> Here are information about the configuration: > >> > >> node 168430081: bzvairsvr > >> node 168430082: bzvairsvr2 > >> primitive apache apache \ > >> params configfile="/etc/httpd/conf/httpd.conf" \ > >> op start interval=0 timeout=120s \ > >> op stop interval=0 timeout=120s > >> primitive clusterip IPaddr2 \ > >> params ip=192.168.100.1 cidr_netmask=24 nic=eno1 \ > >> meta target-role=Started > >> primitive clusterroute Route \ > >> params destination="0.0.0.0/0" gateway=192.168.100.254 > >> primitive drbdserv ocf:linbit:drbd \ > >> params drbd_resource=server \ > >> op monitor interval=30s role=Slave \ > >> op monitor interval=29s role=Master start-delay=30s > >> primitive fsserv Filesystem \ > >> params device="/dev/drbd/by-res/server" directory="/Server" > >> fstype=ext4 \ > >> op start interval=0 timeout=60s \ > >> op stop interval=0 timeout=60s \ > >> meta target-role=Started > >> primitive libvirt-guests systemd:libvirt-guests > >> primitive libvirtd systemd:libvirtd > >> primitive mysql systemd:mysqld > >> primitive named systemd:named > >> primitive samba systemd:smb > >> primitive st-ssh stonith:external/ssh \ > >> params hostlist="bzvairsvr bzvairsvr2" > >> group iphd clusterip clusterroute \ > >> meta target-role=Started > >> group services libvirtd libvirt-guests apache named mysql samba \ > >> meta target-role=Started > >> ms drbdservClone drbdserv \ > >> meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 > >> notify=true target-role=Started > >> clone fencing st-ssh > >> colocation fs_on_drbd inf: fsserv drbdservClone:Master > >> colocation iphd_on_services inf: iphd services > >> colocation services_on_fsserv inf: services fsserv > >> order fsserv-after-drbdserv inf: drbdservClone:promote fsserv:start > >> order services_after_fsserv inf: fsserv services > >> property cib-bootstrap-options: \ > >> dc-version=1.1.8-9.mga5-394e906 \ > >> cluster-infrastructure=corosync \ > >> no-quorum-policy=ignore \ > >> stonith-enabled=true \ > >> > >> cluster logs are flooded by : > >> Oct 14 17:42:28 [3445] bzvairsvr attrd: notice: > >> attrd_trigger_update:Sending flush op to all hosts for: > &
Re: [ClusterLabs] Trouble with drbd/pacemaker: switch to secondary/secondary
Well, I'm a newbie myself. But this: drbdadm primary --force ___the name of the drbd res___ has worked for me. But I'm having lots of trouble myself, so... then there's this: drbdadm -- --overwrite-data-of-peer primary bravo (bravo happens to be my drbd res) and that should also strongarm one machine or another to be the primary. j. On Fri, Oct 14, 2016 at 3:22 PM, Anne Nicolas wrote: > Hi! > > I'm having trouble with a 2 nodes cluster used for DRBD / Apache / Samba > and some other services. > > Whatever I do, it always goes to the following state: > > Last updated: Fri Oct 14 17:41:38 2016 > Last change: Thu Oct 13 10:42:29 2016 via cibadmin on bzvairsvr > Stack: corosync > Current DC: bzvairsvr (168430081) - partition with quorum > Version: 1.1.8-9.mga5-394e906 > 2 Nodes configured, unknown expected votes > 13 Resources configured. > > > Online: [ bzvairsvr bzvairsvr2 ] > > Master/Slave Set: drbdservClone [drbdserv] > Slaves: [ bzvairsvr bzvairsvr2 ] > Clone Set: fencing [st-ssh] > Started: [ bzvairsvr bzvairsvr2 ] > > When I reboot bzvairsvr2 this one goes primary again. But after a while > becomes secondary also. > I use a very basic fencing system based on ssh. It's not optimal but > enough for the current tests. > > Here are information about the configuration: > > node 168430081: bzvairsvr > node 168430082: bzvairsvr2 > primitive apache apache \ > params configfile="/etc/httpd/conf/httpd.conf" \ > op start interval=0 timeout=120s \ > op stop interval=0 timeout=120s > primitive clusterip IPaddr2 \ > params ip=192.168.100.1 cidr_netmask=24 nic=eno1 \ > meta target-role=Started > primitive clusterroute Route \ > params destination="0.0.0.0/0" gateway=192.168.100.254 > primitive drbdserv ocf:linbit:drbd \ > params drbd_resource=server \ > op monitor interval=30s role=Slave \ > op monitor interval=29s role=Master start-delay=30s > primitive fsserv Filesystem \ > params device="/dev/drbd/by-res/server" directory="/Server" > fstype=ext4 \ > op start interval=0 timeout=60s \ > op stop interval=0 timeout=60s \ > meta target-role=Started > primitive libvirt-guests systemd:libvirt-guests > primitive libvirtd systemd:libvirtd > primitive mysql systemd:mysqld > primitive named systemd:named > primitive samba systemd:smb > primitive st-ssh stonith:external/ssh \ > params hostlist="bzvairsvr bzvairsvr2" > group iphd clusterip clusterroute \ > meta target-role=Started > group services libvirtd libvirt-guests apache named mysql samba \ > meta target-role=Started > ms drbdservClone drbdserv \ > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 > notify=true target-role=Started > clone fencing st-ssh > colocation fs_on_drbd inf: fsserv drbdservClone:Master > colocation iphd_on_services inf: iphd services > colocation services_on_fsserv inf: services fsserv > order fsserv-after-drbdserv inf: drbdservClone:promote fsserv:start > order services_after_fsserv inf: fsserv services > property cib-bootstrap-options: \ > dc-version=1.1.8-9.mga5-394e906 \ > cluster-infrastructure=corosync \ > no-quorum-policy=ignore \ > stonith-enabled=true \ > > cluster logs are flooded by : > Oct 14 17:42:28 [3445] bzvairsvr attrd: notice: > attrd_trigger_update:Sending flush op to all hosts for: > master-drbdserv (1) > Oct 14 17:42:28 [3445] bzvairsvr attrd: notice: > attrd_perform_update:Sent update master-drbdserv=1 failed: > Transport endpoint is not connected > Oct 14 17:42:28 [3445] bzvairsvr attrd: notice: > attrd_perform_update:Sent update -107: master-drbdserv=1 > Oct 14 17:42:28 [3445] bzvairsvr attrd: warning: > attrd_cib_callback: Update master-drbdserv=1 failed: Transport > endpoint is not connected > Oct 14 17:42:59 [3445] bzvairsvr attrd: notice: > attrd_trigger_update:Sending flush op to all hosts for: > master-drbdserv (1) > Oct 14 17:42:59 [3445] bzvairsvr attrd: notice: > attrd_perform_update:Sent update master-drbdserv=1 failed: > Transport endpoint is not connected > Oct 14 17:42:59 [3445] bzvairsvr attrd: notice: > attrd_perform_update:Sent update -107: master-drbdserv=1 > Oct 14 17:42:59 [3445] bzvairsvr attrd: warning: > attrd_cib_callback: Update master-drbdserv=1 failed: Transport > endpoint is not connected > > > And here is dmesg > > [34067.547147] block drbd0: peer( Secondary -> Primary ) > [34091.023206] block drbd0: peer( Primary -> Secondary ) > [34096.616319] drbd server: peer( Secondary -> Unknown ) conn( Connected > -> TearDown ) pdsk( UpToDate -> DUnknown ) > [34096.616353] drbd server: asender terminated > [34096.616358] drbd server: Terminating drbd_a_server > [34096.682874] drbd server: Connection closed > [34096.682894] drbd server: conn( TearDown -> Unconnected ) > [34096.682897] drbd s
Re: [ClusterLabs] Can't do anything right; how do I start over?
Greetings, Heh. Well, the comment in corosync.conf makes sense to me now. Thanks, I've fixed that. Here's my corosync.conf totem { version: 2 crypto_cipher: none crypto_hash: none interface { ringnumber: 0 bindnetaddr: 10.1.0.0 mcastaddr: 239.255.1.1 mcastport: 5405 ttl: 1 } cluster_name: pecan } logging { fileline: off to_stderr: no to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum two_node: 1 wait_for_all: 1 } service { name: pacemaker ver: 1 } nodelist { node { ring0_addr: smoking nodeid: 1 } node { ring0_addr: mars nodeid: 2 } } And a few things are behaving better than they did before. At the moment my goal is to set up a partition as drbd. In the interest of bandwidth I will show the commands that I use and the result I finally get. pcs cluster auth smoking mars pcs property set stonith-enabled=true stonith_admin --metadata --agent fence_pcmk cibadmin -C -o resources --xml-file stonith.xml pcs resource create floating_ip IPaddr2 ip=10.1.2.101 cidr_netmask=32 pcs resource defaults resource-stickiness=100 And at this point, all appears well. My pcs status output looks like I think it should. Now, of course, I admit that setting up the floating_ip is not relevant to my goal of a drbd backed filesystem, but I've been doing it as a sanity check. On to drbd modprobe drbd systemctl start drbd.service [root@smoking cluster]# cat /proc/drbd version: 8.4.8-1 (api:1/proto:86-101) GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by mockbuild@, 2016-10- 13 19:58:26 0: cs:Connected ro:Secondary/Secondary ds:Diskless/Diskless C r- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r- ns:0 nr:10574 dw:10574 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 2: cs:Connected ro:Secondary/Secondary ds:Diskless/Diskless C r- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Again, this is stuff that hung around from the previous incarnation. But it looks okay to me. I'm planning to use the '1' device. The above is run on the secondary machine, so Secondary/Primary is correct. And UpToDate/UpToDate looks right to me. Now it goes south. The mkfs.xfs appears to work, but that's not relevant anyway, right? pcs resource create BravoSpace \ ocf:linbit:drbd drbd_resource=bravo \ op monitor interval=60s [root@smoking ~]# pcs status Cluster name: pecan Last updated: Sat Oct 15 01:33:37 2016Last change: Sat Oct 15 01:18:56 2016 by root via cibadmin on mars Stack: corosync Current DC: mars (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum 2 nodes and 3 resources configured Node mars: UNCLEAN (online) Node smoking: UNCLEAN (online) Full list of resources: Fencing(stonith:fence_pcmk):Started mars floating_ip(ocf::heartbeat:IPaddr2):Started mars BravoSpace(ocf::linbit:drbd):FAILED[ smoking mars ] Failed Actions: * BravoSpace_stop_0 on smoking 'not configured' (6): call=18, status=complete, e xitreason='none', last-rc-change='Sat Oct 15 01:18:56 2016', queued=0ms, exec=63ms * BravoSpace_stop_0 on mars 'not configured' (6): call=18, status=complete, exit reason='none', last-rc-change='Sat Oct 15 01:18:56 2016', queued=0ms, exec=60ms PCSD Status: smoking: Online mars: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled I've looked in /var/log/cluster/corosync.log and it doesn't seem happy but I don't know what I'm looking at. On the primary machine it's 1800+ lines on the secondary it's 600+ lines. There are 337 lines just with BravoSpace in them. One of them says drbd(BravoSpace)[3295]:2016/10/15_01:18:56 ERROR: meta parameter misconfigured, expected clone-max -le 2, but found unset. But I tried adding clone-max=2 but the command barfed-- that's not a legal parameter. So, what's wrong? (I'm a newbie, of course.) I did a pcs resource cleanup . That shut down fencing and the IP. I tried pcs cluster start to get them back, no help. I did pcs cluster standby smoking, and then unstandby smoking. The ip started, but fencing has failed on BOTH machines. I can't see what I'm doing wrong. T
[ClusterLabs] Can't do anything right; how do I start over?
I've been trying a lot of things from the introductory manual. I have updated the instructions (on my hardcopy) to the versions of corosync etc. that I'm using. I can't get hardly anything to work reliably beyond the ClusterIP. So I start over -- I had been reinstalling the machines but I've grown tired of that. So, before I start in on my other tales of woe, I figured I should find out how to start over "according to Hoyle". When I "start over" I stop all the services, delete the packages, empty the configs and logs as best I know how. But this doesn't completely clear everything: the drbd metadata is evidently still on the partitions I've set aside for it. Oh, before I forget, in particular: in corosync.conf: totem { interface { # This is normally the *network* address of the # interface to bind to. This ensures that you can use # identical instances of this configuration file # across all your cluster nodes, without having to # modify this option. bindnetaddr: 10.1.1.22 [snip] } } bindnetaddr: I've tried using an address on ONE of the machines (everywhere), and I've tried using an address that's on each participating machine, thus a diff corosync.conf file for each machine (but otherwise identical). What's the right thing? From the comment it seems that there should be one address used among all machines. But I kept getting messages about addresses already in use, so I thought I'd try to "fix" it. This is my burn script. Am I missing something? Doing it wrong? #!/bin/bash pkill -9 -f pacemaker systemctl stop pacemaker.service systemctl stop corosync.service systemctl stop pcsd.service drbdadm down alpha drbdadm down bravo drbdadm down delta systemctl stop drbd.service rpm -e drbd84-utils kmod-drbd84 rpm -e pcs rpm -e pacemaker rpm -e pacemaker-cluster-libs rpm -e pacemaker-cli rpm -e pacemaker-libs rpm -e pacemaker-doc rpm -e lvm2-cluster rpm -e dlm rpm -e corosynclib corosync cd /var/lib/pacemaker rm cib/* rm pengine/* cd nullfile /var/log/cluster/corosync.conf ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] newbie questions
hooray for me, but, how? I got about 3/4 of Digimer's list done and got stuck. I did a pcs cluster status, and, behold, the cluster was up. I pinged the ClusterIP and it answered. I didn't know what to do with the 'delay="x"' part, that's the thing I couldn't figure out. (I've been assuming the delay part is a big deal.) However, there are more things for me to read and more experiments for me to try so I'm good for now. Thanks to everyone for the prompt help. j. On Tue, May 31, 2016 at 5:22 PM, Ken Gaillot wrote: > On 05/31/2016 03:59 PM, Jay Scott wrote: > > Greetings, > > > > Cluster newbie > > Centos 7 > > trying to follow the "Clusters from Scratch" intro. > > 2 nodes (yeah, I know, but I'm just learning) > > > > [root@smoking ~]# pcs status > > Cluster name: > > Last updated: Tue May 31 15:32:18 2016Last change: Tue May 31 > > 15:02:21 > > 2016 by root via cibadmin on smoking > > Stack: unknown > > "Stack: unknown" is a big problem. The cluster isn't aware of the > corosync configuration. Did you do the "pcs cluster setup" step? > > > Current DC: NONE > > 2 nodes and 1 resource configured > > > > OFFLINE: [ mars smoking ] > > > > Full list of resources: > > > > ClusterIP(ocf::heartbeat:IPaddr2):Stopped > > > > PCSD Status: > > smoking: Online > > mars: Online > > > > Daemon Status: > > corosync: active/enabled > > pacemaker: active/enabled > > pcsd: active/enabled > > > > > > What concerns me at the moment: > > I did > > pcs resource enable ClusterIP > > while simultaneously doing > > tail -f /var/log/cluster/corosync.log > > (the only log in there) > > The system log (/var/log/messages or whatever your system has > configured) is usually the best place to start. The cluster software > sends messages of interest to end users there, and it includes messages > from all components (corosync, pacemaker, resource agents, etc.). > > /var/log/cluster/corosync.log (and in some configurations, > /var/log/pacemaker.log) have more detailed log information for debugging. > > > and nothing happens in the log, but the ClusterIP > > stays "Stopped". Should I be able to ping that addr? > > I can't. > > It also says OFFLINE: and both of my machines are offline, > > though the PCSD says they're online. Which do I trust? > > The first online/offline output is most important, and refers to the > node's status in the actual cluster; the "PSCD" online/offline output > simply tells whether the pcs daemon is running. Typically, the pcs > daemon is enabled at boot and is always running. The pcs daemon is not > part of the clustering itself; it's a front end to configuring and > administering the cluster. > > > [root@smoking ~]# pcs property show stonith-enabled > > Cluster Properties: > > stonith-enabled: false > > > > yet I see entries in the corosync.log referring to stonith. > > I'm guessing that's normal. > > Yes, you can enable stonith at any time, so the stonith daemon will > still run, to stay aware of the cluster status. > > > My corosync.conf file says the quorum is off. > > > > I also don't know what to include in this for any of you to > > help me debug. > > > > Ahh, also, is this considered "long", and if so, where would I post > > to the web? > > > > thx. > > > > j. > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] newbie questions
Greetings, Cluster newbie Centos 7 trying to follow the "Clusters from Scratch" intro. 2 nodes (yeah, I know, but I'm just learning) [root@smoking ~]# pcs status Cluster name: Last updated: Tue May 31 15:32:18 2016Last change: Tue May 31 15:02:21 2016 by root via cibadmin on smoking Stack: unknown Current DC: NONE 2 nodes and 1 resource configured OFFLINE: [ mars smoking ] Full list of resources: ClusterIP(ocf::heartbeat:IPaddr2):Stopped PCSD Status: smoking: Online mars: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled What concerns me at the moment: I did pcs resource enable ClusterIP while simultaneously doing tail -f /var/log/cluster/corosync.log (the only log in there) and nothing happens in the log, but the ClusterIP stays "Stopped". Should I be able to ping that addr? I can't. It also says OFFLINE: and both of my machines are offline, though the PCSD says they're online. Which do I trust? [root@smoking ~]# pcs property show stonith-enabled Cluster Properties: stonith-enabled: false yet I see entries in the corosync.log referring to stonith. I'm guessing that's normal. My corosync.conf file says the quorum is off. I also don't know what to include in this for any of you to help me debug. Ahh, also, is this considered "long", and if so, where would I post to the web? thx. j. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org