Re: [ClusterLabs] DRBD split-brain investigations, automatic fixes and manual intervention...

2021-10-21 Thread Ian Diddams via Users
 

On Wednesday, 20 October 2021, 18:08:50 BST, Andrei Borzenkov 
 wrote: 
It depends on what hardware you have. For physical systems IPMI may be
available or managed power outlets; both allow cutting power to another
node over LAN.

For virtual machines you may use fencing agent that contacts hypervisor
or management instance (like vCenter).

There is also SBD, it requires shared storage. Third node with iSCSI
target may do it.

Qdevice with watchdog may be an option.

Personally I prefer SBD which is the most hardware-agnostic solution.


Thanks again.  As I feared these are whole new areas that Ive never even heard 
of so this is going to be a long road.

In the emantime while I try and find out what does what and how and how ti 
implement it if anybody (based on previous info preovided) has a list of "do 
tyhis

* run A
* run B
* run C"

to get this implementation covered Id be greateful
cheers
ian
  ___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] DRBD split-brain investigations, automatic fixes and manual intervention...

2021-10-20 Thread Ian Diddams via Users
 

On Wednesday, 20 October 2021, 11:15:48 BST, Andrei Borzenkov 
 wrote:  
 
 
>You cannot resolve split brain without fencing. This is as simple as
>that. Your pacemaker configuration (from another mail) shows

> pcs -f clust_cfg property set stonith-enabled=false
> pcs -f clust_cfg property set no-quorum-policy=ignore

>This is a recipe for data corruption.
Thanks Andrei - I appreciate your feedback.  So the obvious question form this 
bloke that trying to get up to speed etc etc is

...  how do i set up fencing to

* avoid data corruption
* enable automatic resolution of split-brains?

cheers
ian

  ___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] DRBD split‑brain investigations, automatic fixes and manual intervention...

2021-10-20 Thread Ian Diddams via Users
 FWIW here is the basis for my implementation being the "best" and easily 
followed drbd/clustering guide/explanantiojn I could find when I searched
Lisenet.com :: Linux | Security | Networking | Admin Blog

cheers
ian
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] DRBD split‑brain investigations, automatic fixes and manual intervention...

2021-10-20 Thread Ian Diddams via Users
 

  
>>So you drive without safety-belt and airbag (read: fencing)?

possibly?  probably?

As I said Im flying blind with this all - I was asked to try and implement it, 
Ive tried the best I can to implement it but for all I know the how-tos and 
advice I found have nmissed what may be needed.

Ive looked for online tutorials  but have failed to come up with anything much 
aside from "do these commands and there you have it".  Which may not uinclude 
more belts and braces.

>>
you're asking the worng operson.  Im given two systems, whose disks are on a 
SAN.  Minbe is not to reason why etc.
>> I wondered where the cluster is in those logs.
sorry - Ive not understood the questin here.


happy to provide extracts form logs etc.  Below Ive appended the "set up" 
commands/steps used to implement drbd_pcs+corosync on the systems if that helps 
outloine any more.

I'm just the guy that fires the bullets in effect, trying to aim as best he 
can...

ian

# prep
umount /var/lib/mysql
  - and remove /var/lib/mysql from /etc/fstab
yum remove mysql-community-server
cd /var/lib/mysql; rm -rf *
mkdir /var/lib/mysql
chown mysql:mysql /var/lib/mysql
chmod 755 /var/lib/mysql
reboot

yum makecache fast
yum -y install wget mlocate telnet lsof
updatedb


# Install Pacemaker and Corosync
yum install -y pcs
yum install -y policycoreutils-python
echo "passwd" | passwd hacluster --stdin
systemctl start pcsd.service
systemctl enable pcsd.service

#Configure Corosync
#[estrela]]
pcs cluster auth estrela rafeiro -u hacluster -p passwd
pcs cluster setup --name mysql_cluster estrela rafeiro
pcs cluster start --all

## Install DRBD
## BOTH

yum install -y kmod-drbd90
yum install -y drbd90-utils
systemctl enable corosync
systemctl enable pacemaker
reboot

# when back up
modprobe drbd
systemctl status pcsd
systemctl status corosync
systemctl status pacemaker

cat << EOL >/etc/drbd.d/mysql01.res
resource mysql01 {
 protocol C;
 meta-disk internal;
 device /dev/drbd0;
 disk   /dev/vg_mysql/lv_mysql;
 handlers {
  split-brain "/usr/lib/drbd/notify-split-brain.sh root";
 }
 net {
  allow-two-primaries no;
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
  rr-conflict disconnect;
 }
 disk {
  on-io-error detach;
 }
 syncer {
  verify-alg sha1;
 }
 on estrela {
  address  10.108.248.165:7789;
 }
 on rafeiro {
  address  10.108.248.166:7789;
 }
}
EOL

* clear previously created filesystem
dd if=/dev/zero of=/dev/vg_mysql/lv_mysql bs=1M count=128   


drbdadm create-md mysql01
systemctl start drbd
systemctl enable drbd
systemctl status drbd

#[estrela]
drbdadm primary --force mysql01
estrela: cat 
/sys/kernel/debug/drbd/resources/mysql01/connections/rafeiro/0/proc_drbd
rafeiro:  cat 
/sys/kernel/debug/drbd/resources/mysql01/connections/estrela/0/proc_drbd
drbdadm status

# WAIT UNTIL DRBD IS SYNCED

#[estrela]
mkfs.xfs -f  -L drbd /dev/drbd0
mount /dev/drbd0 /mnt


## INSTALL MYSQL on all
## BOTH
yum install mysql-server -y

# [estrela]
mysql_install_db --datadir=/mnt --user=mysql
systemctl stop mysqld
umount /mnt

#BOTH  
cp -p /etc/my.cnf /etc/my.cnf.ORIG
set up my.cnf as needed (migtated from existing mysql server)

#BOTH
mv /var/lib/mysql /var/lib/mysql.orig
mkdir /var/lib/mysql
chown mysql:mysql /var/lib/mysql
chmod 751 /var/lib/mysql
mkdir  /var/lib/mysql/innodb
chown mysql:mysql /var/lib/mysql/innodb
chmod 755 /var/lib/mysql/innodb


# estrela
mount /dev/drbd0 /var/lib/mysql
systemctl start mysqld

# set up mysql
grep 'temporary password' /var/log/mysqld.log
mysql_secure_installation
rm /root/.mysql_secret


# set up grants

flush privileges;

# test grants
[estrela]# mysql -uroot --skip-column-names -A -e"SELECT CONCAT('SHOW GRANTS 
FOR ''',user,'''@''',host,''';') FROM mysql.user WHERE user<>''" | mysql -uroot 
--skip-column-names -A | sed 's/$/;/g'
[rafeiro]# mysql -h estrela -uroot --skip-column-names -A -e"SELECT 
CONCAT('SHOW GRANTS FOR ''',user,'''@''',host,''';') FROM mysql.user WHERE 
user<>''" | mysql -hestrela -uroot --skip-column-names -A | sed 's/$/;/g'
mysql -h mysqldbdynabookHA -uroot --skip-column-names -A -e"SELECT CONCAT('SHOW 
GRANTS FOR ''',user,'''@''',host,''';') FROM mysql.user WHERE user<>''" | mysql 
-hestrela -uroot --skip-column-names -A | sed 's/$/;/g'
# stop test_200

# [estrela]
systemctl stop mysqld
umount /var/lib/mysql

# snapshot servers - pre-clustered

# Configure Pacemaker Cluster
# [estrela]
pcs cluster cib clust_cfg


pcs -f clust_cfg property set stonith-enabled=false
pcs -f clust_cfg property set no-quorum-policy=ignore
pcs -f clust_cfg resource defaults resource-stickiness=200
pcs -f clust_cfg resource create mysql_data01 ocf:linbit:drbd 
drbd_resource=mysql01 op monitor interval=30s
pcs -f clust_cfg resource master MySQLClone01 mysql_data01 master-max=1 
master-node-max=1 clone-max=2 clone-node-max=1 notify=true
#

[ClusterLabs] DRBD split-brain investigations, automatic fixes and manual intervention...

2021-10-20 Thread Ian Diddams via Users
I've been testing an implementation of a HA mysql cluster for a few months now. 
I came to this project with no preior knoweldge of what was copncerned/needed 
and have learned orgainscally via various online how-tos and web sites which 
many cases wrere slightly out-of-date to missing large chunks of perinent 
information.  Thats not a criticism at all of those still helpful aids, but 
more an indication of how there are huge holes in my knowledge..

So with that background ...

The cluster consits of 2 centos7 servers (esterla and rafeiro) running 
DRBD90
corosync 2.4.5pacemaker 0.9.169
On the whole its all running fine with some squeaks that we are hoping are down 
to underlying SAN issues.

 However...
earlier this week we had some split-brain issues - some of which seem to have 
fixed themselves, others not.  What we did notice that whilst the split-brain 
was being reported the overall cluster remained up (of course?) in that the VIP 
remained up, abnd the mysql instance remained abvailavle via the VIP on port 
3306. The underlying coincern being of course that had a "flip" occurred from 
previous master to the previous slave, the new master's drbd device (moun ted 
on /var/lib/mysql) may well be out of sync and thus contain "old" data.

So - system logs recently show this

ESTRELAOct 18th
Oct 18 04:04:28 wp-vldyn-estrela kernel: [584651.491139] drbd mysql01/0 drbd0: 
Split-Brain detected, 1 primaries, automatically solved. Sync from peer node
Oct 18 04:04:28 wp-vldyn-estrela kernel: [584651.491139] drbd mysql01/0 drbd0: 
Split-Brain detected, 1 primaries, automatically solved. Sync from peer node

Oct 19th
Oct 19 03:45:43 wp-vldyn-estrela kernel: [47892.092191] drbd mysql01/0 drbd0: 
Split-Brain detected but unresolved, dropping connection!
Oct 19 03:45:43 wp-vldyn-estrela kernel: [47892.092191] drbd mysql01/0 drbd0: 
Split-Brain detected but unresolved, dropping connection!


RAFEIRO
Oct 18
Oct 18 04:04:28 wp-vldyn-rafeiro kernel: [584652.907126] drbd mysql01/0 drbd0: 
Split-Brain detected, 1 primaries, automatically solved. Sync from this node
Oct 18 04:04:28 wp-vldyn-rafeiro kernel: [584652.907126] drbd mysql01/0 drbd0: 
Split-Brain detected, 1 primaries, automatically solved. Sync from this node

Oct 19
Oct 19 03:45:43 wp-vldyn-rafeiro kernel: [47864.401284] drbd mysql01/0 drbd0: 
Split-Brain detected but unresolved, dropping connection!
Oct 19 03:45:43 wp-vldyn-rafeiro kernel: [47864.401284] drbd mysql01/0 drbd0: 
Split-Brain detected but unresolved, dropping connection!



So on the 18th the split-brain issues was detected but (automatically?) fixed.
But on the 19th it wasnt...

Any ideas how to investigate why it worked on the 18th and not the 19th?  I am 
presuming the drbd config is set up to automatically fix stuff but maybe we 
just got lucky and it isnt?  (Ive googled automatic fixes but I am afarid I 
cant follow what Im being told/reading :-(  )

drbd config below
ta
ian

==
ESTRELAresource mysql01 {
 protocol C;
 meta-disk internal;
 device /dev/drbd0;
 disk   /dev/vg_mysql/lv_mysql;
 handlers {
  split-brain "/usr/lib/drbd/notify-split-brain.sh root";
 }
 net {
  allow-two-primaries no;
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
  rr-conflict disconnect;
 }
 disk {
  on-io-error detach;
 }
 syncer {
  verify-alg sha1;
 }
 on estrela {
  address  10.108.248.165:7789;
 }
 on rafeiro {
  address  10.108.248.166:7789;
 }
}



RAFEIRO
resource mysql01 {
 protocol C;
 meta-disk internal;
 device /dev/drbd0;
 disk   /dev/vg_mysql/lv_mysql;
 handlers {
  split-brain "/usr/lib/drbd/notify-split-brain.sh root";
 }
 net {
  allow-two-primaries no;
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
  rr-conflict disconnect;
 }
 disk {
  on-io-error detach;
 }
 syncer {
  verify-alg sha1;
 }
 on estrela {
  address  10.108.248.165:7789;
 }
 on rafeiro {
  address  10.108.248.166:7789;
 }
}






___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Is there a DRBD forum?

2021-10-19 Thread Ian Diddams via Users
Rather than clog up what i perceive as a pacemaker/corosync forum is there a 
DRBD forum I could ask a query to?

(FWIW Im trying to find a way to specifically log drbd to a separate log other 
thjan system log via its kernel logging ?)
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] what is the point of pcs status error messages while the VIP is still up and service is retained?

2021-10-07 Thread Ian Diddams
I trying to find out exactly what the impact/point of such cluster error 
messages as from running "pcs status"


mysql_VIP01_monitor_3 on wp-vldyn-rafeiro 'unknown error' (1): call=467, 
status=Timed Out, exitreason='',
last-rc-change='Mon Oct 4 17:04:07 2021', queued=0ms, exec=0ms

This error may be  reported for several hours until "pcs resouce refresh" is  
run when it just of course goes away

This sever is also reporting as the master at the same time, and other (xymon) 
monitoring checks showed

* the VIP as "up" (conn check remained green ie was pingable)
* a mysql connection via the VIP was successful all the time (test run every 
minute - from a another system "mysqlshow -h -uroot -p mysql" )

so Im not getting what the point of the error is given it doesn't seem to 
actually affect the service and is just a visual obfuscation that doesn't seem 
to help?


Disclaimer: This message is intended for the addressee only. It may contain 
information of a confidential or legally privileged nature. If you have 
received this message in error please notify the sender and destroy the message 
immediately. All attachments have been scanned for viruses. However D4t4 
Solutions Plc cannot accept liability for any loss or damage you may incur as a 
result of virus infection.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Users Digest, Vol 46, Issue 8

2018-11-09 Thread Ian Underhill
Yep all my pcs commands run on a live cluster. The design needs resources
to respond in specific ways before
moving on to other shutdown requests.

So it seems that these pcs commands that run on different nodes at the same
time, is the route cause of this issue,
anything that changes the live cib at the same time seems to cause
pacemaker to just skip\throw away actions that
that have been requested.

I have to admit this behaviour is very hard to work with. though in a
simple system using a shadow cib would avoid these issues,
that would suggest a central point of control anyway.

Luckily I have/can redesigned my approach to bring all the commands that
affect the live cib (on cluster shutdown\startup) to be run from a
single node within the cluster. (and added --waits to commands where
possible)

This approach removes all these issues, and things behave as expected.


On Fri, Nov 9, 2018 at 12:00 PM  wrote:

> Send Users mailing list submissions to
> users@clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@clusterlabs.org
>
> You can reach the person managing the list at
> users-ow...@clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
>1. Re: Pacemaker auto restarts disabled groups (Ian Underhill)
>2. Re: Pacemaker auto restarts disabled groups (Ken Gaillot)
>
>
> ------
>
> Message: 1
> Date: Thu, 8 Nov 2018 12:14:33 +
> From: Ian Underhill 
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker auto restarts disabled groups
> Message-ID:
> <
> cagu+cygddmthbv23+55ec40tjogeyzzukbl9o_ydjkqp+jo...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> seems this issue has been raised before, but has gone quite, with no
> solution
>
> https://lists.clusterlabs.org/pipermail/users/2017-October/006544.html
>
> I know my resource agents successfully return the correct status to the
> start\stop\monitor requests
>
> On Thu, Nov 8, 2018 at 11:40 AM Ian Underhill 
> wrote:
>
> > Sometimes Im seeing that a resource group that is in the process of being
> > disable is auto restarted by pacemaker.
> >
> > When issuing pcs disable command to disable different resource groups at
> > the same time (on different nodes, at the group level) the result is that
> > sometimes the resource is stopped and restarted straight away. i'm using
> a
> > balanced placement strategy.
> >
> > looking into the daemon log, pacemaker is aborting transtions due to
> > config change of the meta attributes of target-role changing?
> >
> > Transition 2838 (Complete=25, Pending=0, Fired=0, Skipped=3,
> > Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-704.bz2):
> Stopped
> >
> > could somebody explain Complete/Pending/Fired/Skipped/Incomplete and is
> > there a way of displaying Skipped actions?
> >
> > ive used crm_simulate --xml-file  -run to see the actions, and I see
> > this extra start request
> >
> > regards
> >
> > /Ian.
> >
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> https://lists.clusterlabs.org/pipermail/users/attachments/20181108/8e824615/attachment-0001.html
> >
>
> --
>
> Message: 2
> Date: Thu, 08 Nov 2018 10:58:52 -0600
> From: Ken Gaillot 
> To: Cluster Labs - All topics related to open-source clustering
> welcomed 
> Subject: Re: [ClusterLabs] Pacemaker auto restarts disabled groups
> Message-ID: <1541696332.5197.3.ca...@redhat.com>
> Content-Type: text/plain; charset="UTF-8"
>
> On Thu, 2018-11-08 at 12:14 +, Ian Underhill wrote:
> > seems this issue has been raised before, but has gone quite, with no
> > solution
> >
> > https://lists.clusterlabs.org/pipermail/users/2017-October/006544.htm
> > l
>
> In that case, something appeared to be explicitly re-enabling the
> disabled resources. You can search your logs for "target-role" to see
> whether that's happening.
>
> > I know my resource agents successfully return the correct status to
> > the start\stop\monitor requests
> >
> > On Thu, Nov 8, 2018 at 11:40 AM Ian Underhill  > m> wrote:
> > > Sometimes Im seeing that a resource group that is in the process of
> > > being disable is auto restarted

Re: [ClusterLabs] Pacemaker auto restarts disabled groups

2018-11-08 Thread Ian Underhill
seems this issue has been raised before, but has gone quite, with no
solution

https://lists.clusterlabs.org/pipermail/users/2017-October/006544.html

I know my resource agents successfully return the correct status to the
start\stop\monitor requests

On Thu, Nov 8, 2018 at 11:40 AM Ian Underhill 
wrote:

> Sometimes Im seeing that a resource group that is in the process of being
> disable is auto restarted by pacemaker.
>
> When issuing pcs disable command to disable different resource groups at
> the same time (on different nodes, at the group level) the result is that
> sometimes the resource is stopped and restarted straight away. i'm using a
> balanced placement strategy.
>
> looking into the daemon log, pacemaker is aborting transtions due to
> config change of the meta attributes of target-role changing?
>
> Transition 2838 (Complete=25, Pending=0, Fired=0, Skipped=3,
> Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-704.bz2): Stopped
>
> could somebody explain Complete/Pending/Fired/Skipped/Incomplete and is
> there a way of displaying Skipped actions?
>
> ive used crm_simulate --xml-file  -run to see the actions, and I see
> this extra start request
>
> regards
>
> /Ian.
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker auto restarts disabled groups

2018-11-08 Thread Ian Underhill
Sometimes Im seeing that a resource group that is in the process of being
disable is auto restarted by pacemaker.

When issuing pcs disable command to disable different resource groups at
the same time (on different nodes, at the group level) the result is that
sometimes the resource is stopped and restarted straight away. i'm using a
balanced placement strategy.

looking into the daemon log, pacemaker is aborting transtions due to config
change of the meta attributes of target-role changing?

Transition 2838 (Complete=25, Pending=0, Fired=0, Skipped=3, Incomplete=10,
Source=/var/lib/pacemaker/pengine/pe-input-704.bz2): Stopped

could somebody explain Complete/Pending/Fired/Skipped/Incomplete and is
there a way of displaying Skipped actions?

ive used crm_simulate --xml-file  -run to see the actions, and I see
this extra start request

regards

/Ian.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Colocation dependencies (dislikes)

2018-09-23 Thread Ian Underhill
Im trying to design a resource layout that has different "dislikes"
colocation scores between the various resources within the cluster.

1) When I start to have multiple colocation dependencies from a single
resource, strange behaviour starts to happen, in scenarios where resource
have to bunch together

consider the example (2 node system) 3 resources
C->B->A

constraints
B->A -10
C->B -INFINITY
C->A -10

So on paper I would expect A and C to run together and B to run on its own.
what you actually get is A and B running and C stopped?

crm_simulate -Ls says the score for C running on the same node as A is -10.
so why doesnt it start it?

Ideas?
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Q: Resource Groups vs Resources for stickiness and colocation?

2018-08-29 Thread Ian Underhill
im guessing this is just a "feature", but something that will probably stop
me using groups

Scenario1 (working):
1) Two nodes (1,2) within a cluster (default-stickiness = INFINITY)
2) Two resources (A,B) in a cluster running on different nodes
3) colocation constraint between resources of A->B score=-1

a) pcs standby node2, the resource B moves to node 1
b) pcs unstandby node2, the resource B stays on node 1 - this is good and
expected

Secanrio 2 (working):
1) exactly the same as above but the resource exist within their own group
(G1,G2)
2) the colocation constraint is between the groups

Secanrio 3 (not working):
1) Same as above however each group has two resources in them

 Resource Group: A_grp
 A (ocf::test:fallover): Started mac-devl03
 A_2 (ocf::test:fallover): Started mac-devl03
 Resource Group: B_grp
 B (ocf::test:fallover): Started mac-devl11
 B_2 (ocf::test:fallover): Started mac-devl11

a) pcs standby node2, the group moves to node 1
b) pcs unstandby node2, the group moves to node 2, but I have INFINITY
stickiness (maybe I need INFINITY+1 ;) )

crm_simulate -sL doesnt really explain why there is a difference.

any ideas?  (environment pacemaker-cluster-libs-1.1.16-12.el7.x86_64)

/Ian
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Understanding\Manually adjusting node-health-strategy

2018-07-26 Thread Ian Underhill
when a resource fails on a node I would like to mark the node unhealthy, so
other resources dont start up on it.

I believe I can achieve this, ignoring the concept of fencing at the moment.

I have tried to set my cluster to have a node-health-strategy as only_green.

However trying to manually adjust the nodes health, I believe I can set an
attribute #health on a node (see ref docs) but trying to set any attribute
#health fails?

# sudo crm_attribute --node myNode --name=#health --update=red
  Error setting #health=red (section=nodes, set=nodes-3): Update
does not conform to the configured schema Error performing operation:
Update does not conform to the configured schema

Im slightly surprised I dont get "something for free" regarding pacemaker
auto adjusting the health of a node when resources fail on it? am I missing
a setting or is this done by hand.

Thanks

Ian.

*ref docs*
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-health.html
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-cluster-options.html
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] OCF Return codes OCF_NOT_RUNNING

2018-07-11 Thread Ian Underhill
im trying to understand the behaviour of pacemaker when a resource monitor
returns OCF_NOT_RUNNING instead of OCF_ERR_GENERIC, and does pacemaker
really care.

The documentation states that a return code OCF_NOT_RUNNING from a monitor
will not result in a stop being called on that resource, as it believes the
node is still clean.

https://www.clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html

This makes sense, however in practice is not what happens (unless im doing
something wrong :) )

When my resource returns OCF_NOT_RUNNING for a monitor call (after a start
has been performed) a stop is called.

if I have a resource threshold set >1,  i get start->monitor->stop cycle
until the threshold is consumed

/Ian.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker alert framework

2018-07-06 Thread Ian Underhill
requirement:
when a resource fails perform an actionm, run a script on all nodes within
the cluster, before the resource is relocated. i.e. information gathering
why the resource failed.

what I have looked into:
1) Use the monitor call within the resource to SSH to all nodes, again SSH
config needed.
2) Alert framework : this only seems to be triggered for nodes involved in
the relocation of the resource. i.e. if resource moves from node1 to node 2
node 3 doesnt know. so back to the SSH solution :(
3) sending a custom alert to all nodes in the cluster? is this possible?
not found a way?

only solution I have:
1) use SSH within an alert monitor (stop) to SSH onto all nodes to perform
the action, the nodes could be configured using the alert monitors
recipients, but I would still need to config SSH users and certs etc.
 1.a) this doesnt seem to be usable if the resource is relocated back
to the same node, as the alerts start\stop are run at the "same time". i.e
I need to delay the start till the SSH has completed.

what I would like:
1) delay the start\relocation of the resource until the information from
all nodes is complete, using only pacemaker behaviour\config

any ideas?

Thanks

/Ian.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] New website design and new-new logo

2017-09-21 Thread Ian
Looks great to me!  I like Kai's suggestions, too.

On Sep 21, 2017 6:08 AM, "Kai Dupke"  wrote:

> On 09/21/2017 01:53 AM, Ken Gaillot wrote:
> > Check it out at https://clusterlabs.org/
>
> Two comments
>
> - I would like to see the logo used by as many
> people/projects/marketingers, so I propose to link the Logo to a Logo
> page with some prepared Logos - at least with one big to download and a
> license info
>
> - Should we not add a word about the license to the FAQ on top? I mean,
> I am with Open Source for quite some time but some others might not and
> we want to get fresh members to the community, right? I'm not sure all
> subproject share the same license, but if so then it should be written
> down.
>
> Best regards,
> Kai Dupke
> Senior Product Manager
> SUSE Linux Enterprise 15
> --
> Sell not virtue to purchase wealth, nor liberty to purchase power.
> Phone:  +49-(0)5102-9310828 Mail: kdu...@suse.com
> Mobile: +49-(0)173-5876766  WWW:  www.suse.com
>
> SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
> GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DRBD or SAN ?

2017-07-17 Thread Ian
DRBD is absolutely compatible with mysql/postgres.

In my experience, with a 10G pipe for block replication, there's also
basically no performance degradation compared to native disk writes, even
with an SSD array.

> I've always favored native replication over disk replication for
databases. I'm not sure that's a necessity, but I would think the
biggest advantage is that with disk replication, you couldn't run the
database server all the time, you'd have to start it (and its VM in your
case) after disk failover.

This is true, you have to start mysql/the vm during failover, but in my
experience usually that is very fast (depending on mysql/vm
configuration).  Also, depending on what replication tools you are using,
you'd still have to promote the slave to a master, which might save seconds
or less compared to starting mysql (which, I recognize, could be
important).  Note that I am unfamiliar with methods of postgres
replication.

I think a big advantage compared to native replication is that DRBD offers
synchronous replication at the block level as opposed to the transaction
level.  I have not run my own tests, but my vicarious experience with tools
such as Galera or other synchronous forms of SQL replication indicate that
there may be a significant performance hit based on workload.  Obviously,
there's no significant performance hit if you are doing mysql native
asynchronous replication (I guess as long as you aren't spending all your
i/o on the master on your binlogs), but then you are relying on
asynchronous processes to keep your data safe and available.  Perhaps that
is not a big deal, I am not well versed in that level of replication theory.

> physical proximity so that environmental factors could
take down the whole thing.

Literal server fires come to mind.

@OP, I agree with Ken that a multi-datacenter setup is ideal if your
application can deal with its various caveats and may be worth
investigating since the advantages of moving to a DRBD setup doesn't
eliminate any more problems than a multi-dc setup would as long as your SAN
is already set up on independent electrical circuits and separate
networking stacks to begin with.  E.g., both a multi-server and
multi-center setup would protect from small disasters that take out the
whole server, but a DRBD setup does not add much more than that whereas a
multi-center setup would also protect from large-scale disaster.   DRBD and
a SAN would also both suffer from a building-wide power outage.  Do you
have generators?

On Mon, Jul 17, 2017 at 1:30 PM, Digimer  wrote:

> On 2017-07-17 05:51 AM, Lentes, Bernd wrote:
> > Hi,
> >
> > i established a two node cluster with two HP servers and SLES 11 SP4.
> I'd like to start now with a test period. Resources are virtual machines.
> The vm's reside on a FC SAN. The SAN has two power supplies, two storage
> controller, two network interfaces for configuration. Each storage
> controller has two FC connectors. On each server i have one FC controller
> with two connectors in a multipath configuration. Each connector from the
> SAN controller inside the server is connected to a different storage
> controller from the SAN. But isn't a SAN, despite all that redundancy, a
> SPOF ?
> > I'm asking myself if a DRBD configuration wouldn't be more redundant and
> high available. There i have two completely independent instances of the vm.
> > We have one web application with a databse which is really crucial for
> us. Downtime should be maximum one or two hours, if longer we run in
> trouble.
> > Is DRBD in conjuction with a database (MySQL or Postgres) possible ?
> >
> >
> > Bernd
>
> DRBD any day.
>
> Yes, even with all the redundancy, it's a single electrical/mechanical
> device that can be taken offline by a bad firmware update, user error,
> etc. DRBD gives you full mechanical and electrical replication of the
> data and has survived some serious in-the-field faults in our Anvil!
> system (including a case where three drives were ejected at the same
> time from the node hosting the VMs, and the servers lived).
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2-Node Cluster Pointless?

2017-04-17 Thread Ian
>  maybe I need another coffee?

No, I don't understand how it's relevant to the specific topic of avoiding
split-brains, either.  I suppose it's possible that I also need coffee.

On Mon, Apr 17, 2017 at 1:45 PM, Dimitri Maziuk 
wrote:

> On 04/17/2017 11:58 AM, Digimer wrote:
>
> > ... Unless I am misunderstanding, your comment is related to
> > serviceability of clusters in general. I'm failing to link the contexts.
> > Similarly, I'm not sure how this relates to "new" vs. "best"...
>
> You can't know if *a* customer can access the service it provides. You
> can know if the service access point is up and connected to the server
> process.
>
> Take a simple example of shared-nothing read-only cluster: all you need
> to know is that the daemon is bound to '*' and the floating ip is bound
> to eth0.
>
> This is the "best" in that it's simple, stupid, does all you you
> need/can do and nothing that doesn't make your cluster run any "better".
>  It's also very unexciting.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] start a resource

2016-05-06 Thread Ian
Are you getting any other errors now that you've fixed the config?

What does your config file look like?
On May 6, 2016 10:33 AM, "Dmitri Maziuk"  wrote:

> On 2016-05-05 23:50, Moiz Arif wrote:
>
>> Hi Dimitri,
>>
>> Try cleanup of the fail count for the resource with the any of the below
>> commands:
>>
>> via pcs : pcs resource cleanup rsyncd
>>
>
> Tried it, didn't work. Tried pcs resource debug-start rsyncd -- got no
> errors, resource didn't start. Tried disable/enable.
>
> So far the only way I've been able to do this is pcs cluster stop ; pcs
> cluster start which is ridiculous on a production cluster with drbd and a
> database etc. (And it killed my ssh connection to the other node, again.)
>
> Ay other suggestions?
> Thanks,
> Dima
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Installed Galera, now HAProxy won't start

2016-03-19 Thread Ian
> configure MariaDB server to bind to all available ports (
http://docs.openstack.org/ha-guide/controller-ha-galera-config.html, scroll
to "Database Configuration," note that bind-address is 0.0.0.0.). If
MariaDB binds to the virtual IP address, then HAProxy can't bind to that
address and therefore won't start. Right?

That is correct as far as my understanding goes.  By binding to port 3306
on all IPs (0.0.0.0), you are effectively preventing HAProxy from being
able to use port 3306 on its own IP and vice-versa.

Try setting specific bind addresses for your Galera nodes; I would be
surprised and interested if it didn't work.


On Wed, Mar 16, 2016 at 6:10 PM, Matthew Mucker  wrote:

> Sorry, folks, for being a pest here, but I'm finding the learning curve on
> this clustering stuff to be pretty steep.
>
>
> I'm following the docs to set up a three-node Openstack Controller
> cluster. I got Pacemaker running and I had two resources, the virtual IP
> and HAProxy, up and running and I could move these resources to any of the
> three nodes. Success!
>
>
> I then moved on to installing Galera.
>
>
> The MariaDB engine started fine on 2 of the 3 nodes but refused to start
> on the third. After some digging and poking (and swearing), I found that
> HAProxy was listening on the virtual IP on the mySQL port, which prevented
> MariaDB from listening on that port. Makes sense. So I moved HAProxy to
> another node and started MariaDB on my third node and now I have a
> three-node Galera cluster.
>
>
> But.
>
>
> Now HAPRoxy won't start on any node. I imagine it's because MariaDB is
> already listening on the same IP:Port combination that Galera wants. (After
> all, HAProxy is supposed to proxy that IP:Port, right?) Unfortunately, I
> don't see anything useful in the HAProxy.log file so I don't really know
> what's wrong.
>
>
> So thinking this through logically, it seems to me that the Openstack
> docs were wrong in telling me to configure MariaDB server to bind to all
> available ports (
> http://docs.openstack.org/ha-guide/controller-ha-galera-config.html,
> scroll to "Database Configuration," note that bind-address is 0.0.0.0.). If
> MariaDB binds to the virtual IP address, then HAProxy can't bind to that
> address and therefore won't start. Right?
>
>
> Am I thinking correctly here, or is something else wrong with my setup? In
> general, I've found that the OpenStack documents tend to be right, but in
> this case my understanding of the concepts involved makes me wonder.
>
>
> In any case, I'm having difficulty getting HAProxy and Galera running on
> the same nodes. My HAProxy config file is:
>
>
> global
>   chroot  /var/lib/haproxy
>   daemon
>   group  haproxy
>   maxconn  4000
>   pidfile  /var/run/haproxy.pid
>   user  haproxy
>
> defaults
>   log  global
>   maxconn  4000
>   option  redispatch
>   retries  3
>   timeout  http-request 10s
>   timeout  queue 1m
>   timeout  connect 10s
>   timeout  client 1m
>   timeout  server 1m
>   timeout  check 10s
>
> listen galera_cluster
>   bind 10.0.0.10:3306
>   balance  source
>   option  httpchk
>   server controller1 10.0.0.11:3306 check port 9200 inter 2000 rise 2
> fall 5
>   server controller2 10.0.0.12:3306 backup check port 9200 inter 2000
> rise 2 fall 5
>   server controller3 10.0.0.13:3306 backup check port 9200 inter 2000
> rise 2 fall 5
>
>
>
> Does the server name under "listen galera_cluster" need to match the
> hostname of the node? What else could be causing these two daemons to not
> play nicely together?
>
>
> Thanks!
>
>
> -Matthew
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org