Re: [ClusterLabs] Trouble with drbd/pacemaker: switch to secondary/secondary

2016-10-15 Thread Jay Scott
Yikes.  I don't have any suggestions.  This is beyond me.
Sorry.

J.

On Sat, Oct 15, 2016 at 4:48 AM, Anne Nicolas  wrote:

> Anne
> http://mageia.org
>
> Le 15 oct. 2016 9:02 AM, "Jay Scott"  a écrit :
> >
> >
> > Well, I'm a newbie myself.  But this:
> > drbdadm primary --force ___the name of the drbd res___
> > has worked for me.  But I'm having lots of trouble myself,
> > so...
> > then there's this:
> > drbdadm -- --overwrite-data-of-peer primary bravo
> > (bravo happens to be my drbd res) and that should also
> > strongarm one machine or another to be the primary.
> >
>
> Well I used those commands it goes to primary but I czn see then pacemaker
> switching it to secondary after some secondd
>
> > j.
> >
> > On Fri, Oct 14, 2016 at 3:22 PM, Anne Nicolas  wrote:
> >>
> >> Hi!
> >>
> >> I'm having trouble with a 2 nodes cluster used for DRBD / Apache / Samba
> >> and some other services.
> >>
> >> Whatever I do, it always goes to the following state:
> >>
> >> Last updated: Fri Oct 14 17:41:38 2016
> >> Last change: Thu Oct 13 10:42:29 2016 via cibadmin on bzvairsvr
> >> Stack: corosync
> >> Current DC: bzvairsvr (168430081) - partition with quorum
> >> Version: 1.1.8-9.mga5-394e906
> >> 2 Nodes configured, unknown expected votes
> >> 13 Resources configured.
> >>
> >>
> >> Online: [ bzvairsvr bzvairsvr2 ]
> >>
> >>  Master/Slave Set: drbdservClone [drbdserv]
> >>  Slaves: [ bzvairsvr bzvairsvr2 ]
> >>  Clone Set: fencing [st-ssh]
> >>  Started: [ bzvairsvr bzvairsvr2 ]
> >>
> >> When I reboot bzvairsvr2 this one goes primary again. But after a while
> >> becomes secondary also.
> >> I use a very basic fencing system based on ssh. It's not optimal but
> >> enough for the current tests.
> >>
> >> Here are information about the configuration:
> >>
> >> node 168430081: bzvairsvr
> >> node 168430082: bzvairsvr2
> >> primitive apache apache \
> >> params configfile="/etc/httpd/conf/httpd.conf" \
> >> op start interval=0 timeout=120s \
> >> op stop interval=0 timeout=120s
> >> primitive clusterip IPaddr2 \
> >> params ip=192.168.100.1 cidr_netmask=24 nic=eno1 \
> >> meta target-role=Started
> >> primitive clusterroute Route \
> >> params destination="0.0.0.0/0" gateway=192.168.100.254
> >> primitive drbdserv ocf:linbit:drbd \
> >> params drbd_resource=server \
> >> op monitor interval=30s role=Slave \
> >> op monitor interval=29s role=Master start-delay=30s
> >> primitive fsserv Filesystem \
> >> params device="/dev/drbd/by-res/server" directory="/Server"
> >> fstype=ext4 \
> >> op start interval=0 timeout=60s \
> >> op stop interval=0 timeout=60s \
> >> meta target-role=Started
> >> primitive libvirt-guests systemd:libvirt-guests
> >> primitive libvirtd systemd:libvirtd
> >> primitive mysql systemd:mysqld
> >> primitive named systemd:named
> >> primitive samba systemd:smb
> >> primitive st-ssh stonith:external/ssh \
> >> params hostlist="bzvairsvr bzvairsvr2"
> >> group iphd clusterip clusterroute \
> >> meta target-role=Started
> >> group services libvirtd libvirt-guests apache named mysql samba \
> >> meta target-role=Started
> >> ms drbdservClone drbdserv \
> >> meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> >> notify=true target-role=Started
> >> clone fencing st-ssh
> >> colocation fs_on_drbd inf: fsserv drbdservClone:Master
> >> colocation iphd_on_services inf: iphd services
> >> colocation services_on_fsserv inf: services fsserv
> >> order fsserv-after-drbdserv inf: drbdservClone:promote fsserv:start
> >> order services_after_fsserv inf: fsserv services
> >> property cib-bootstrap-options: \
> >> dc-version=1.1.8-9.mga5-394e906 \
> >> cluster-infrastructure=corosync \
> >> no-quorum-policy=ignore \
> >> stonith-enabled=true \
> >>
> >> cluster logs are flooded by :
> >> Oct 14 17:42:28 [3445] bzvairsvr  attrd:   notice:
> >> attrd_trigger_update:Sending flush op to all hosts for:
> &

Re: [ClusterLabs] Trouble with drbd/pacemaker: switch to secondary/secondary

2016-10-15 Thread Jay Scott
Well, I'm a newbie myself.  But this:
drbdadm primary --force ___the name of the drbd res___
has worked for me.  But I'm having lots of trouble myself,
so...
then there's this:
drbdadm -- --overwrite-data-of-peer primary bravo
(bravo happens to be my drbd res) and that should also
strongarm one machine or another to be the primary.

j.

On Fri, Oct 14, 2016 at 3:22 PM, Anne Nicolas  wrote:

> Hi!
>
> I'm having trouble with a 2 nodes cluster used for DRBD / Apache / Samba
> and some other services.
>
> Whatever I do, it always goes to the following state:
>
> Last updated: Fri Oct 14 17:41:38 2016
> Last change: Thu Oct 13 10:42:29 2016 via cibadmin on bzvairsvr
> Stack: corosync
> Current DC: bzvairsvr (168430081) - partition with quorum
> Version: 1.1.8-9.mga5-394e906
> 2 Nodes configured, unknown expected votes
> 13 Resources configured.
>
>
> Online: [ bzvairsvr bzvairsvr2 ]
>
>  Master/Slave Set: drbdservClone [drbdserv]
>  Slaves: [ bzvairsvr bzvairsvr2 ]
>  Clone Set: fencing [st-ssh]
>  Started: [ bzvairsvr bzvairsvr2 ]
>
> When I reboot bzvairsvr2 this one goes primary again. But after a while
> becomes secondary also.
> I use a very basic fencing system based on ssh. It's not optimal but
> enough for the current tests.
>
> Here are information about the configuration:
>
> node 168430081: bzvairsvr
> node 168430082: bzvairsvr2
> primitive apache apache \
> params configfile="/etc/httpd/conf/httpd.conf" \
> op start interval=0 timeout=120s \
> op stop interval=0 timeout=120s
> primitive clusterip IPaddr2 \
> params ip=192.168.100.1 cidr_netmask=24 nic=eno1 \
> meta target-role=Started
> primitive clusterroute Route \
> params destination="0.0.0.0/0" gateway=192.168.100.254
> primitive drbdserv ocf:linbit:drbd \
> params drbd_resource=server \
> op monitor interval=30s role=Slave \
> op monitor interval=29s role=Master start-delay=30s
> primitive fsserv Filesystem \
> params device="/dev/drbd/by-res/server" directory="/Server"
> fstype=ext4 \
> op start interval=0 timeout=60s \
> op stop interval=0 timeout=60s \
> meta target-role=Started
> primitive libvirt-guests systemd:libvirt-guests
> primitive libvirtd systemd:libvirtd
> primitive mysql systemd:mysqld
> primitive named systemd:named
> primitive samba systemd:smb
> primitive st-ssh stonith:external/ssh \
> params hostlist="bzvairsvr bzvairsvr2"
> group iphd clusterip clusterroute \
> meta target-role=Started
> group services libvirtd libvirt-guests apache named mysql samba \
> meta target-role=Started
> ms drbdservClone drbdserv \
> meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true target-role=Started
> clone fencing st-ssh
> colocation fs_on_drbd inf: fsserv drbdservClone:Master
> colocation iphd_on_services inf: iphd services
> colocation services_on_fsserv inf: services fsserv
> order fsserv-after-drbdserv inf: drbdservClone:promote fsserv:start
> order services_after_fsserv inf: fsserv services
> property cib-bootstrap-options: \
> dc-version=1.1.8-9.mga5-394e906 \
> cluster-infrastructure=corosync \
> no-quorum-policy=ignore \
> stonith-enabled=true \
>
> cluster logs are flooded by :
> Oct 14 17:42:28 [3445] bzvairsvr  attrd:   notice:
> attrd_trigger_update:Sending flush op to all hosts for:
> master-drbdserv (1)
> Oct 14 17:42:28 [3445] bzvairsvr  attrd:   notice:
> attrd_perform_update:Sent update master-drbdserv=1 failed:
> Transport endpoint is not connected
> Oct 14 17:42:28 [3445] bzvairsvr  attrd:   notice:
> attrd_perform_update:Sent update -107: master-drbdserv=1
> Oct 14 17:42:28 [3445] bzvairsvr  attrd:  warning:
> attrd_cib_callback:  Update master-drbdserv=1 failed: Transport
> endpoint is not connected
> Oct 14 17:42:59 [3445] bzvairsvr  attrd:   notice:
> attrd_trigger_update:Sending flush op to all hosts for:
> master-drbdserv (1)
> Oct 14 17:42:59 [3445] bzvairsvr  attrd:   notice:
> attrd_perform_update:Sent update master-drbdserv=1 failed:
> Transport endpoint is not connected
> Oct 14 17:42:59 [3445] bzvairsvr  attrd:   notice:
> attrd_perform_update:Sent update -107: master-drbdserv=1
> Oct 14 17:42:59 [3445] bzvairsvr  attrd:  warning:
> attrd_cib_callback:  Update master-drbdserv=1 failed: Transport
> endpoint is not connected
>
>
> And here is dmesg
>
> [34067.547147] block drbd0: peer( Secondary -> Primary )
> [34091.023206] block drbd0: peer( Primary -> Secondary )
> [34096.616319] drbd server: peer( Secondary -> Unknown ) conn( Connected
> -> TearDown ) pdsk( UpToDate -> DUnknown )
> [34096.616353] drbd server: asender terminated
> [34096.616358] drbd server: Terminating drbd_a_server
> [34096.682874] drbd server: Connection closed
> [34096.682894] drbd server: conn( TearDown -> Unconnected )
> [34096.682897] drbd s

Re: [ClusterLabs] Can't do anything right; how do I start over?

2016-10-15 Thread Jay Scott
Greetings,

Heh.  Well, the comment in corosync.conf makes sense to me now.
Thanks, I've fixed that.


Here's my corosync.conf

totem {
version: 2

crypto_cipher: none
crypto_hash: none

interface {
ringnumber: 0
bindnetaddr: 10.1.0.0
mcastaddr: 239.255.1.1
mcastport: 5405
ttl: 1
}
cluster_name: pecan
}

logging {
fileline: off
to_stderr: no
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}

quorum {
provider: corosync_votequorum
two_node: 1
wait_for_all: 1
}
service {
name: pacemaker
ver: 1
}
nodelist {
  node {
ring0_addr: smoking
nodeid: 1
   }
  node {
ring0_addr: mars
nodeid: 2
   }
}


And a few things are behaving better than they did before.

At the moment my goal is to set up a partition as drbd.
In the interest of bandwidth I will show the commands that
I use and the result I finally get.


pcs cluster auth smoking mars
pcs property set stonith-enabled=true
stonith_admin --metadata --agent fence_pcmk
cibadmin -C -o resources --xml-file stonith.xml
pcs resource create floating_ip IPaddr2 ip=10.1.2.101 cidr_netmask=32
pcs resource  defaults resource-stickiness=100


And at this point, all appears well.  My pcs status output looks like
I think it should.

Now, of course, I admit that setting up the floating_ip is
not relevant to my goal of a drbd backed filesystem, but I've been
doing it as a sanity check.

On to drbd

modprobe drbd
systemctl start drbd.service
[root@smoking cluster]#  cat /proc/drbd
version: 8.4.8-1 (api:1/proto:86-101)
GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by mockbuild@,
2016-10-
13 19:58:26
 0: cs:Connected ro:Secondary/Secondary ds:Diskless/Diskless C r-
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-
ns:0 nr:10574 dw:10574 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f
oos:0
 2: cs:Connected ro:Secondary/Secondary ds:Diskless/Diskless C r-
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Again, this is stuff that hung around from the previous incarnation.
But it looks okay to me.  I'm planning to use the '1' device.
The above is run on the secondary machine, so Secondary/Primary is
correct.  And UpToDate/UpToDate looks right to me.

Now it goes south.  The mkfs.xfs appears to work, but that's not
relevant anyway, right?

pcs  resource create BravoSpace \
  ocf:linbit:drbd drbd_resource=bravo \
  op monitor interval=60s

[root@smoking ~]# pcs status
Cluster name: pecan
Last updated: Sat Oct 15 01:33:37 2016Last change: Sat Oct 15
01:18:56
 2016 by root via cibadmin on mars
Stack: corosync
Current DC: mars (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum
2 nodes and 3 resources configured

Node mars: UNCLEAN (online)
Node smoking: UNCLEAN (online)

Full list of resources:

 Fencing(stonith:fence_pcmk):Started mars
 floating_ip(ocf::heartbeat:IPaddr2):Started mars
 BravoSpace(ocf::linbit:drbd):FAILED[ smoking mars ]

Failed Actions:
* BravoSpace_stop_0 on smoking 'not configured' (6): call=18,
status=complete, e
xitreason='none',
last-rc-change='Sat Oct 15 01:18:56 2016', queued=0ms, exec=63ms
* BravoSpace_stop_0 on mars 'not configured' (6): call=18, status=complete,
exit
reason='none',
last-rc-change='Sat Oct 15 01:18:56 2016', queued=0ms, exec=60ms


PCSD Status:
  smoking: Online
  mars: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled

I've looked in /var/log/cluster/corosync.log and it doesn't seem
happy but I don't know what I'm looking at.  On the primary
machine it's 1800+ lines on the secondary it's 600+ lines.
There are 337 lines just with BravoSpace in them.
One of them says
drbd(BravoSpace)[3295]:2016/10/15_01:18:56 ERROR: meta parameter
misconfigured,
 expected clone-max -le 2, but found unset.
But I tried adding clone-max=2 but the command barfed-- that's not a legal
parameter.

So, what's wrong?  (I'm a newbie, of course.)

I did a pcs resource cleanup .  That shut down fencing and the IP.
I tried pcs cluster start to get them back, no help.
I did pcs cluster standby smoking, and then unstandby smoking.
The ip started, but fencing has failed on BOTH machines.
I can't see what I'm doing wrong.

T

[ClusterLabs] Can't do anything right; how do I start over?

2016-10-14 Thread Jay Scott
I've been trying a lot of things from the introductory manual.
I have updated the instructions (on my hardcopy) to the versions
of corosync etc. that I'm using.  I can't get hardly anything to
work reliably beyond the ClusterIP.

So I start over -- I had been reinstalling the machines but I've
grown tired of that.  So, before I start in on my other tales of woe,
I figured I should find out how to start over "according to Hoyle".

When I "start over" I stop all the services, delete the packages,
empty the configs and logs as best I know how.  But this doesn't
completely clear everything:  the drbd metadata is evidently still
on the partitions I've set aside for it.



Oh, before I forget, in particular:
in corosync.conf:
totem {
interface {
# This is normally the *network* address of the
# interface to bind to. This ensures that you can use
# identical instances of this configuration file
# across all your cluster nodes, without having to
# modify this option.
bindnetaddr: 10.1.1.22
[snip]
}
}
bindnetaddr:  I've tried using an address on ONE of the machines
(everywhere),
and I've tried using an address that's on each participating machine,
thus a diff corosync.conf file for each machine (but otherwise identical).
What's the right thing?  From the comment it seems that there should
be one address used among all machines.  But I kept getting messages
about addresses already in use, so I thought I'd try to "fix" it.

This is my burn script.
Am I missing something?  Doing it wrong?

#!/bin/bash
pkill -9 -f pacemaker
systemctl stop pacemaker.service
systemctl stop corosync.service
systemctl stop pcsd.service
drbdadm down alpha
drbdadm down bravo
drbdadm down delta
systemctl stop drbd.service

rpm -e drbd84-utils kmod-drbd84
rpm -e pcs
rpm -e pacemaker
rpm -e pacemaker-cluster-libs
rpm -e pacemaker-cli
rpm -e pacemaker-libs
rpm -e pacemaker-doc
rpm -e lvm2-cluster
rpm -e dlm
rpm -e corosynclib corosync
cd /var/lib/pacemaker
rm cib/*
rm pengine/*
cd
nullfile /var/log/cluster/corosync.conf
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] newbie questions

2016-05-31 Thread Jay Scott
hooray for me, but, how?

I got about 3/4 of Digimer's list done and got stuck.
I did a pcs cluster status, and, behold, the cluster was up.
I pinged the ClusterIP and it answered.  I didn't know what
to do with the 'delay="x"' part, that's the thing I couldn't figure
out.  (I've been assuming the delay part is a big deal.)

However, there are more things for me to read and more experiments
for me to try so I'm good for now.

Thanks to everyone for the prompt help.

j.

On Tue, May 31, 2016 at 5:22 PM, Ken Gaillot  wrote:

> On 05/31/2016 03:59 PM, Jay Scott wrote:
> > Greetings,
> >
> > Cluster newbie
> > Centos 7
> > trying to follow the "Clusters from Scratch" intro.
> > 2 nodes (yeah, I know, but I'm just learning)
> > 
> > [root@smoking ~]# pcs status
> > Cluster name:
> > Last updated: Tue May 31 15:32:18 2016Last change: Tue May 31
> > 15:02:21
> >  2016 by root via cibadmin on smoking
> > Stack: unknown
>
> "Stack: unknown" is a big problem. The cluster isn't aware of the
> corosync configuration. Did you do the "pcs cluster setup" step?
>
> > Current DC: NONE
> > 2 nodes and 1 resource configured
> >
> > OFFLINE: [ mars smoking ]
> >
> > Full list of resources:
> >
> >  ClusterIP(ocf::heartbeat:IPaddr2):Stopped
> >
> > PCSD Status:
> >   smoking: Online
> >   mars: Online
> >
> > Daemon Status:
> >   corosync: active/enabled
> >   pacemaker: active/enabled
> >   pcsd: active/enabled
> > 
> >
> > What concerns me at the moment:
> > I did
> > pcs resource enable ClusterIP
> > while simultaneously doing
> > tail -f /var/log/cluster/corosync.log
> > (the only log in there)
>
> The system log (/var/log/messages or whatever your system has
> configured) is usually the best place to start. The cluster software
> sends messages of interest to end users there, and it includes messages
> from all components (corosync, pacemaker, resource agents, etc.).
>
> /var/log/cluster/corosync.log (and in some configurations,
> /var/log/pacemaker.log) have more detailed log information for debugging.
>
> > and nothing happens in the log, but the ClusterIP
> > stays "Stopped".  Should I be able to ping that addr?
> > I can't.
> > It also says OFFLINE:  and both of my machines are offline,
> > though the PCSD says they're online.  Which do I trust?
>
> The first online/offline output is most important, and refers to the
> node's status in the actual cluster; the "PSCD" online/offline output
> simply tells whether the pcs daemon is running. Typically, the pcs
> daemon is enabled at boot and is always running. The pcs daemon is not
> part of the clustering itself; it's a front end to configuring and
> administering the cluster.
>
> > [root@smoking ~]# pcs property show stonith-enabled
> > Cluster Properties:
> >  stonith-enabled: false
> >
> > yet I see entries in the corosync.log referring to stonith.
> > I'm guessing that's normal.
>
> Yes, you can enable stonith at any time, so the stonith daemon will
> still run, to stay aware of the cluster status.
>
> > My corosync.conf file says the quorum is off.
> >
> > I also don't know what to include in this for any of you to
> > help me debug.
> >
> > Ahh, also, is this considered "long", and if so, where would I post
> > to the web?
> >
> > thx.
> >
> > j.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] newbie questions

2016-05-31 Thread Jay Scott
Greetings,

Cluster newbie
Centos 7
trying to follow the "Clusters from Scratch" intro.
2 nodes (yeah, I know, but I'm just learning)

[root@smoking ~]# pcs status
Cluster name:
Last updated: Tue May 31 15:32:18 2016Last change: Tue May 31
15:02:21
 2016 by root via cibadmin on smoking
Stack: unknown
Current DC: NONE
2 nodes and 1 resource configured

OFFLINE: [ mars smoking ]

Full list of resources:

 ClusterIP(ocf::heartbeat:IPaddr2):Stopped

PCSD Status:
  smoking: Online
  mars: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


What concerns me at the moment:
I did
pcs resource enable ClusterIP
while simultaneously doing
tail -f /var/log/cluster/corosync.log
(the only log in there)
and nothing happens in the log, but the ClusterIP
stays "Stopped".  Should I be able to ping that addr?
I can't.
It also says OFFLINE:  and both of my machines are offline,
though the PCSD says they're online.  Which do I trust?

[root@smoking ~]# pcs property show stonith-enabled
Cluster Properties:
 stonith-enabled: false

yet I see entries in the corosync.log referring to stonith.
I'm guessing that's normal.

My corosync.conf file says the quorum is off.

I also don't know what to include in this for any of you to
help me debug.

Ahh, also, is this considered "long", and if so, where would I post
to the web?

thx.

j.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org