from:"Strahil Nikolov"

[ClusterLabs] RHEL 7.4 cluster cannot commit suicide (sbd)

2017-08-26 Thread strahil nikolov

Hello everyone,

as this is my first usage (writing) to mailing list , please excuse me.

Here is the reason I'm writing to you. I have 3 VM machines (kvm/qemu),
watchdog of type 'i6300esb' with RHEL 7.4 and iscsi target as a shared
storage.
I have created the 3 node cluster and poison pill (pcs stonith fence
node_name) works, but I can't make the sbd daemon to self-suicide the
node once the network is being cut-off (firewall-cmd --panic-on).

The strange thing is that sbd daemon detects that the storage is off-
line via (I've stripped out the clutter):

sbd[pid]: warning: inquisitor_child: Servant  is outdated
(age: 4)
sbd[pid]: warning: inquisitor_child: Majority of devices lost -
surviving on pacemaker
sbd[pid]: :error: header_get: Unable to read header
from device 6

The servant keeps restarted but no self-fencing. I thought that the
issue is in the watchdog , but immediately after killing the sbd main
pid - the node gets reset (as expected).

This is the configuration in "/etc/sysconfig/sbd":

SBD_DELAY_START=no
SBD_DEVICE="/full/path/to/by-id/iscsi"
SBD_OPTS="-n harhel1"
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5

I have used the following example for setting up the sbd: https://acces
s.redhat.com/articles/3099231

Thank you for reading this long e-mail. I would be grateful if someone
finds out my mistake. 


Best Regards,
Strahil Nikolov

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] fencing on iscsi device not working

2019-10-31 Thread Strahil Nikolov

 Have you checked this article: Using SCSI Persistent Reservation Fencing 
(fence_scsi) with pacemaker in a Red Hat High Availability cluster - Red Hat 
Customer Portal

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
Using SCSI Persistent Reservation Fencing (fence_scsi) with pacemaker in...

This article describes how to properly configure fence_scsi and the 
requirements for using it.
 |

 |

 |



Have you checked if your storage supports persistent reservations?

Best Regards,Strahil Nikolov

В сряда, 30 октомври 2019 г., 8:42:16 ч. Гринуич-4, RAM PRASAD TWISTED 
ILLUSIONS  написа:  
 
 Hi everyone,

I am trying to set up a storage cluster with two nodes, both running debian 
buster. The two nodes called, duke and miles, have a LUN residing on a SAN box 
as their shared storage device between them. As you can see in the output of 
pcs status, all the demons are active and I can get the nodes online without 
any issues. However, I cannot get the fencing resources to start.

These two nodes were running debian jessie before and had access to the same 
LUN in a storage cluster configuration. Now, I am trying to recreate a similar 
setup with both nodes now running the latest debian. I am not sure if this is 
relevant, but this LUN already has shared VG with data on it.  I am wondering 
if this could be the cause of the trouble? Should I be creating my stonith 
device on a different/fresh LUN?
### pcs status
Cluster name: jazz
Stack: corosync
Current DC: duke (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Wed Oct 30 11:58:19 2019
Last change: Wed Oct 30 11:28:28 2019 by root via cibadmin on duke

2 nodes configured
2 resources configured

Online: [ duke miles ]

Full list of resources:

 fence_duke    (stonith:fence_scsi):    Stopped
 fence_miles    (stonith:fence_scsi):    Stopped

Failed Fencing Actions:
* unfencing of duke failed: delegate=, client=pacemaker-controld.1703, 
origin=duke,
    last-failed='Wed Oct 30 11:43:29 2019'
* unfencing of miles failed: delegate=, client=pacemaker-controld.1703, 
origin=duke,
    last-failed='Wed Oct 30 11:43:29 2019'

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
###

I used the following commands to add the two fencing devices and set their 
location constraints .

###
sudo pcs cluster cib test_cib_cfg
pcs -f test_cib_cfg stonith create fence_duke fence_scsi pcmk_host_list=duke 
pcmk_reboot_action="off" 
devices="/dev/disk/by-id/wwn-0x600c0ff0001e8e3c89601b580100" meta 
provides="unfencing"
pcs -f test_cib_cfg stonith create fence_miles fence_scsi pcmk_host_list=miles 
pcmk_reboot_action="off" 
devices="/dev/disk/by-id/wwn-0x600c0ff0001e8e3c89601b580100" delay=15 meta 
provides="unfencing"
pcs -f test_cib_cfg constraint location fence_duke avoids duke=INFINITY
pcs -f test_cib_cfg constraint location fence_miles avoids miles=INFINITY
pcs cluster cib-push test_cib_cfg
###

Here is the output in /var/log/pacemaker/pacemaker.log after adding the fencing 
resources

Oct 30 12:06:02 duke pacemaker-schedulerd[1702] 
(determine_online_status_fencing)   info: Node miles is active
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (determine_online_status)   
info: Node miles is online
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] 
(determine_online_status_fencing)   info: Node duke is active
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (determine_online_status)   
info: Node duke is online
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (unpack_node_loop)  info: 
Node 2 is already processed
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (unpack_node_loop)  info: 
Node 1 is already processed
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (unpack_node_loop)  info: 
Node 2 is already processed
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (unpack_node_loop)  info: 
Node 1 is already processed
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (common_print) info: fence_duke 
   (stonith:fence_scsi):   Stopped
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (common_print) info: 
fence_miles   (stonith:fence_scsi):   Stopped
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (RecurringOp) info:  Start 
recurring monitor (60s) for fence_duke on miles
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (RecurringOp) info:  Start 
recurring monitor (60s) for fence_miles on duke
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (LogNodeActions)    notice: 
 * Fence (on) miles 'required by fence_duke monitor'
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (LogNodeActions)    notice: 
 * Fence (on) duke 'required by fence_duke monitor'
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (LogAction) notice:  * Start
  fence_duke ( miles )
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (LogAction) notice:  * Start
  fence_miles    (  duke )
Oct 30 12:06:02 duke pacemaker-schedulerd[1702] (proc

Re: [ClusterLabs] connection timed out fence_virsh monitor stonith

2020-02-24 Thread Strahil Nikolov

;call_id:85  exit-code:1 exec-time:5449ms queue-time:0ms
>
>which I concluded was a problem with the login timeout (which was 5
>seconds)
>
>I have therefore incresed this timeut to 20 seconds but the timeout
>persisted:
>
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:20006:stderr [ 2020-02-23
>00:00:21,102 ERROR: Connection timed out ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:20006:stderr [  ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:   notice:
>operation_finished: fence_virsh_monitor_1:20006:stderr [  ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[20006] stderr: [ 2020-02-23 00:00:21,102 ERROR:
>Connection timed out ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[20006] stderr: [  ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:  warning:
>log_action: fence_virsh[20006] stderr: [  ]
>Feb 23 00:00:21 [24633] zc-mail-2.zylacloud.com stonith-ng:   notice:
>log_operation: Operation 'monitor' [20006] for device
>'fence_zc-mail-1_virsh' returned: -62 (Timer expired)
>Feb 23 00:00:21 [24637] zc-mail-2.zylacloud.com   crmd:error:
>process_lrm_event: Result of monitor operation for
>fence_zc-mail-1_virsh on zc-mail-2-ha: Timed Out | call=30
>key=fence_zc-mail-1_virsh_monitor_6 timeout=2ms
>
>There is also a constraint as shown below so that the fencing "agent"
>runs on the opposite node to be restarted:
>
># pcs constraint show --full
>
>Location Constraints:
>
>  Resource: fence_zc-mail-1_virsh
>Enabled on: zc-mail-2-ha (score:INFINITY) (role: Started)
>(id:cli-prefer-fence_zc-mail-1_virsh)
>
>  Resource: fence_zc-mail-2_virsh
>Enabled on: zc-mail-1-ha (score:INFINITY) (role: Started)
>(id:cli-prefer-fence_zc-mail-2_virsh)
>
>Ordering Constraints:
>
>Colocation Constraints:
>
>Ticket Constraints:

I notice that the issue happens at 00:00 on both days .
Have you checked  for a backup or other cron job that is 'overloading' the 
virtualization host ?

Anything in libvirt logs or in the hosts' /var/log/messages ?

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Coming in Pacemaker 2.0.4: shutdown locks

2020-02-25 Thread Strahil Nikolov

On February 26, 2020 12:30:24 AM GMT+02:00, Ken Gaillot  
wrote:
>Hi all,
>
>We are a couple of months away from starting the release cycle for
>Pacemaker 2.0.4. I'll highlight some new features between now and then.
>
>First we have shutdown locks. This is a narrow use case that I don't
>expect a lot of interest in, but it helps give pacemaker feature parity
>with proprietary HA systems, which can help users feel more comfortable
>switching to pacemaker and open source.
>
>The use case is a large organization with few cluster experts and many
>junior system administrators who reboot hosts for OS updates during
>planned maintenance windows, without any knowledge of what the host
>does. The cluster runs services that have a preferred node and take a
>very long time to start.
>
>In this scenario, pacemaker's default behavior of moving the service to
>a failover node when the node shuts down, and moving it back when the
>node comes back up, results in needless downtime compared to just
>leaving the service down for the few minutes needed for a reboot.
>
>The goal could be accomplished with existing pacemaker features.
>Maintenance mode wouldn't work because the node is being rebooted. But
>you could figure out what resources are active on the node, and use a
>location constraint with a rule to ban them on all other nodes before
>shutting down. That's a lot of work for something the cluster can
>figure out automatically.
>
>Pacemaker 2.0.4 will offer a new cluster property, shutdown-lock,
>defaulting to false to keep the current behavior. If shutdown-lock is
>set to true, any resources active on a node when it is cleanly shut
>down will be "locked" to the node (kept down rather than recovered
>elsewhere). Once the node comes back up and rejoins the cluster, they
>will be "unlocked" (free to move again if circumstances warrant).
>
>An additional cluster property, shutdown-lock-limit, allows you to set
>a timeout for the locks so that if the node doesn't come back within
>that time, the resources are free to be recovered elsewhere. This
>defaults to no limit.
>
>If you decide while the node is down that you need the resource to be
>recovered, you can manually clear a lock with "crm_resource --refresh"
>specifying both --node and --resource.
>
>There are some limitations using shutdown locks with Pacemaker Remote
>nodes, so I'd avoid that with the upcoming release, though it is
>possible.

Hi Ken,

Can it be 'shutdown-lock-timeout' instead of 'shutdown-lock-limit' ?
Also, I think that the default value could be something more reasonable - like 
30min. Usually 30min are OK if you don't patch the firmware and 180min are the 
maximum if you do patch the firmware.

The use case is odd. I have been in the same situation, and our solution was to 
train the team (internally) instead of using such feature.
The interesting part will be the behaviour of the local cluster stack, when 
updates  happen. The risk is high for the node to be fenced due to 
unresponsiveness (during the update) or if corosync/pacemaker  use an old 
function changed in the libs.

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: rulke-based operation pause/freeze?

2020-03-05 Thread Strahil Nikolov

Hi Ulrich,

for HA NFS , you should expect no more than 90s (after the failover is 
complete) for NFSv4 clients to recover. Due to that, I think that all resources 
(in same cluster or another one) should have a longer period of monitoring. 
Maybe something like 179s .

Of course , if you NFS will be down for a longer period, you can set all HA 
resources that depend on it with a "on-fail=ignore" and once the maintenance is 
over to remove it.
After all , you seek the cluster not to react for that specific time , but you 
should keep track on such changes - as it is easy to forget such setting.

Another approach is to leave the monitoring period high enough ,so the cluster 
won't catch the downtime - but imagine that the downtime of the NFS has to be 
extended - do you believe that you will be able to change all affected 
resources on time ?


Best Regards,
Strahil Nikolov






В четвъртък, 5 март 2020 г., 14:25:36 ч. Гринуич+2, Ulrich Windl 
 написа: 





Hi!

I'm wondering whether it's possible to pause/freeze specific resource 
operations through rules.
The idea is something like this: If your monitor operation needes (e.g.) some 
external NFS server, and thst NFS server is known to be down, it seems better 
to delay the monitor operation until NFS is up again, rather than forcing a 
monitor timeout that will most likely be followed by a stop operation that will 
also time out, eventually killing the node (which has no problem itself).

As I guess it's not possible right now, what would be needed to make this work?
In case it's possible, how would an example scenario look like?

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done

2020-03-04 Thread Strahil Nikolov

On March 4, 2020 9:50:10 AM GMT+02:00, Ulrich Windl 
 wrote:
>>>> Valentin Vidic  schrieb am
>03.03.2020 um
>16:52
>in Nachricht
><20449_1583250783_5e5e7d5f_20449_2155_1_20200303155240.ga24...@valentin-vidic.fr
>
>m.hr>:
>> On Sat, Feb 29, 2020 at 03:44:50PM ‑0600, Ken Gaillot wrote:
>>> The clusterlabs.org server OS upgrade is (mostly) done.
>>> 
>>> Services are back up, with the exception of some cosmetic issues and
>>> the source code continuous integration testing for ClusterLabs
>github
>>> projects (ci.kronosnet.org). Those will be dealt with at a more
>>> reasonable time :)
>> 
>> Regarding the upgrade, perhaps the mailman config for the list should
>> be updated to work better with SPF and DKIM checks?
>
>How do you define "work better"?
>
>> 
>> ‑‑ 
>> Valentin
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

Maybe I will be unsubscribed every 10th email instead of every 5th one.

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DRBD not failing over

2020-02-26 Thread Strahil Nikolov

On February 26, 2020 2:36:46 PM GMT+02:00, "Nickle, Richard" 
 wrote:
>I spent many, many hours tackling the two-node problem and I had
>exactly
>the same symptoms (only able to get the resource to move if I moved it
>manually) until I did the following:
>
>* Switch to DRBD 9 (added LINBIT repo because DRBD 8 is the default in
>the
>Ubuntu repo)
>* Build a third diskless quorum arbitration node.
>
>My DRBD configuration now looks like this:
>
>hatst2:$ sudo drbdadm status
>
>r0 role:*Primary*
>
>  disk:*UpToDate*
>
>  hatst1 role:Secondary
>
>peer-disk:UpToDate
>
>  hatst4 role:Secondary
>
>peer-disk:Diskless
>
>On Wed, Feb 26, 2020 at 6:59 AM Jaap Winius  wrote:
>
>>
>> Hi folks,
>>
>> My 2-node test system has a DRBD resource that is configured as
>follows:
>>
>> ~# pcs resource defaults resource-stickiness=100 ; \
>> pcs resource create drbd ocf:linbit:drbd drbd_resource=r0 \
>> op monitor interval=60s ; \
>> pcs resource master drbd master-max=1 master-node-max=1 \
>> clone-max=2 clone-node-max=1 notify=true
>>
>> The resource-stickiness setting is to prevent failbacks. I've got
>that
>> to work with NFS and and VIP resources, but not with DRBD. Moreover,
>> when configured as shown above, the DRBD master does not even want to
>> fail over when the node it started up on is shut down.
>>
>> Any idea what I'm missing or doing wrong?
>>
>> Thanks,
>>
>> Jaap
>>
>> PS -- I can only get it to fail over if I first move the DRBD
>resource
>> to the other node, which creates a "cli-prefer-drbd-master" location
>> constraint for that node, but then it ignores the resource-stickiness
>> setting and always performs the failbacks.
>>
>> PPS -- I'm using CentOS 7.7.1908, DRBD 9.10.0, Corosync 2.4.3,
>> Pacemaker 1.1.20 and PCS 0.9.167.
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>

Is  your DRBD used  as  LVM PV  -> like  as a disk for iSCSI  LUN ?
If yes, ensure that you have an LVM global filter  for the /dev/drbdXYZ and the 
physical devices (like /dev/sdXYZ ) and the wwid .

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Coming in Pacemaker 2.0.4: shutdown locks

2020-02-27 Thread Strahil Nikolov

e
>recovered on another node.
>
>> excessive
>> precaution? Is it just to avoid is to move somewhere else when
>> exiting
>> maintenance-mode? If the resource has a preferred node, I suppose the
>> location
>> constraint should take care of this, isn't it?
>
>Having a preferred node doesn't prevent the resource from starting
>elsewhere if the preferred node is down (or in standby, or otherwise
>ineligible to run the resource). Even a +INFINITY constraint allows
>recovery elsewhere if the node is not available. To keep a resource
>from being recovered, you have to put a ban (-INFINITY location
>constraint) on any nodes that could otherwise run it.
>
>> > > > I wonder: Where is it different from a time-limited "ban"
>> > > > (wording
>> > > > also exists
>> > > > already)? If you ban all resources from running on a specific
>> > > > node,
>> > > > resources
>> > > > would be move away, and when booting the node, resources won't
>> > > > come
>> > > > back.  
>> > 
>> > It actually is equivalent to this process:
>> > 
>> > 1. Determine what resources are active on the node about to be shut
>> > down.
>> > 2. For each of those resources, configure a ban (location
>> > constraint
>> > with -INFINITY score) using a rule where node name is not the node
>> > being shut down.
>> > 3. Apply the updates and reboot the node. The cluster will stop the
>> > resources (due to shutdown) and not start them anywhere else (due
>> > to
>> > the bans).
>> 
>> In maintenance mode, this would not move either.
>
>The problem with maintenance mode for this scenario is that the reboot
>would uncleanly terminate any active resources.
>
>> > 4. Wait for the node to rejoin and the resources to start on it
>> > again,
>> > then remove all the bans.
>> > 
>> > The advantage is automation, and in particular the sysadmin
>> > applying
>> > the updates doesn't need to even know that the host is part of a
>> > cluster.
>> 
>> Could you elaborate? I suppose the operator still need to issue a
>> command to
>> set the shutdown‑lock before reboot, isn't it?
>
>Ah, no -- this is intended as a permanent cluster configuration
>setting, always in effect.
>
>> Moreover, if shutdown‑lock is just a matter of setting ±infinity
>> constraint on
>> nodes, maybe a higher level tool can take care of this?
>
>In this case, the operator applying the reboot may not even know what
>pacemaker is, much less what command to run. The goal is to fully
>automate the process so a cluster-aware administrator does not need to
>be present.
>
>I did consider a number of alternative approaches, but they all had
>problematic corner cases. For a higher-level tool or anything external
>to pacemaker, one such corner case is a "time-of-check/time-of-use"
>problem -- determining the list of active resources has to be done
>separately from configuring the bans, and it's possible the list could
>change in the meantime.
>
>> > > This is the standby mode.  
>> > 
>> > Standby mode will stop all resources on a node, but it doesn't
>> > prevent
>> > recovery elsewhere.
>> 
>> Yes, I was just commenting on Ulrich's description (history context
>> crop'ed
>> here).
>-- 
>Ken Gaillot 
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

Hi Ken,

Can you tell me the logic of that feature?
So far it looks like:
1. Mark resources/groups that will be affected by the feature
2. Resources/groups  are stopped  (target-role=stopped)
3. Node exits the cluster cleanly when no resources are  running any more
4. The node rejoins the cluster  after  the reboot
5. A  positive (on the rebooted node) & negative (ban on the rest of the nodes) 
constraints  are  created for the marked  in step 1 resources
6.  target-role is  set back to started and the resources are back and running
7. When each resource group (or standalone resource)  is  back online -  the 
mark in step 1  is removed  and any location constraints  (cli-ban &  
cli-prefer)  are  removed  for the resource/group.

Yet, if that feature will attract more end users (or even enterprises) - I 
think that it will be positive for the stack.

Best Regards,
Strahil Nikolov


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] SBD on shared disk

2020-02-05 Thread Strahil Nikolov

Hello Community,
I'm preparing for my EX436 and I was wondering if there are any drawbacks if a 
shared LUN is split into 2 partitions and the first partition is used for SBD , 
while the second one for Shared File System (Either XFS for active/passive, or 
GFS2 for active/active).
Do you see any drawback in such implementation ?Thanks in advance.
Best Regards,Strahil Nikolov___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Strahil Nikolov

On February 6, 2020 4:18:15 AM GMT+02:00, Eric Robinson 
 wrote:
>Hi Strahil –
>
>I think you may be right about the token timeouts being too short. I’ve
>also noticed that periods of high load can cause drbd to disconnect.
>What would you recommend for changes to the timeouts?
>
>I’m running Red Hat’s Corosync Cluster Engine, version 2.4.3. The
>config is relatively simple.
>
>Corosync config looks like this…
>
>totem {
>version: 2
>cluster_name: 001db01ab
>secauth: off
>transport: udpu
>}
>
>nodelist {
>node {
>ring0_addr: 001db01a
>nodeid: 1
>}
>
>node {
>ring0_addr: 001db01b
>nodeid: 2
>}
>}
>
>quorum {
>provider: corosync_votequorum
>two_node: 1
>}
>
>logging {
>to_logfile: yes
>logfile: /var/log/cluster/corosync.log
>to_syslog: yes
>}
>
>
>From: Users  On Behalf Of Strahil
>Nikolov
>Sent: Wednesday, February 5, 2020 6:39 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed ; Andrei Borzenkov
>
>Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
>Hi Andrei,
>
>don't trust Azure so much :D . I've seen stuff that was way more
>unbelievable.
>Can you check other systems in the same subnet reported any issues.
>Yet, pcs most probably won't report any short-term issues. I have
>noticed that RHEL7 defaults for token and consensus are quite small and
>any short-term disruption could cause an issue.
>Actually when I tested live migration on oVirt - the other hosts fenced
>the node that was migrated.
>What is your corosync config and OS version ?
>
>Best Regards,
>Strahil Nikolov
>
>В четвъртък, 6 февруари 2020 г., 01:44:55 ч. Гринуич+2, Eric Robinson
>mailto:eric.robin...@psmnv.com>> написа:
>
>
>
>Hi Strahil –
>
>
>
>I can’t prove there was no network loss, but:
>
>
>
>  1.  There were no dmesg indications of ethernet link loss.
>2.  Other than corosync, there are no other log messages about
>connectivity issues.
>  3.  Wouldn’t pcsd say something about connectivity loss?
>  4.  Both servers are in Azure.
>5.  There are many other servers in the same Azure subscription,
>including other corosync clusters, none of which had issues.
>
>
>
>So I guess it’s possible, but it seems unlikely.
>
>
>
>--Eric
>
>
>
>From: Users
>mailto:users-boun...@clusterlabs.org>>
>On Behalf Of Strahil Nikolov
>Sent: Wednesday, February 5, 2020 3:13 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed mailto:users@clusterlabs.org>>; Andrei
>Borzenkov mailto:arvidj...@gmail.com>>
>Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>
>
>
>Hi Erik,
>
>
>
>what has led you to think that there was no network loss ?
>
>
>
>Best Regards,
>
>Strahil Nikolov
>
>
>
>В сряда, 5 февруари 2020 г., 22:59:56 ч. Гринуич+2, Eric Robinson
>mailto:eric.robin...@psmnv.com>> написа:
>
>
>
>
>
>> -Original Message-
>> From: Users
>mailto:users-boun...@clusterlabs.org>>
>On Behalf Of Strahil Nikolov
>> Sent: Wednesday, February 5, 2020 1:59 PM
>> To: Andrei Borzenkov
>mailto:arvidj...@gmail.com>>;
>users@clusterlabs.org<mailto:users@clusterlabs.org>
>> Subject: Re: [ClusterLabs] Why Do Nodes Leave the Cluster?
>>
>> On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov
>> mailto:arvidj...@gmail.com>> wrote:
>> >05.02.2020 20:55, Eric Robinson пишет:
>> >> The two servers 001db01a and 001db01b were up and responsive.
>Neither
>> >had been rebooted and neither were under heavy load. There's no
>> >indication in the logs of loss of network connectivity. Any ideas on
>> >why both nodes seem to think the other one is at fault?
>> >
>> >The very fact that nodes lost connection to each other *is*
>indication
>> >of network problems. Your logs start too late, after any problem
>> >already happened.
>> >
>> >>
>> >> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is
>not
>> >an option at this time.)
>> >>
>> >> Log from 001db01a:
>> >>
>> >> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor
>failed,
>> >forming new configuration.
>> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
>> >(10.51.14.33:960) was formed. Members left: 2
>> >> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to
>receive
>> >the leave message.

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-06 Thread Strahil Nikolov

On February 6, 2020 7:35:53 PM GMT+02:00, Eric Robinson 
 wrote:
>Hi Nikolov --
>
>> Defaults are 1s  token,  1.2s  consensus which is too small.
>> In Suse, token is 10s, while consensus  is 1.2 * token -> 12s.
>> With these settings, cluster  will not react   for 22s.
>>
>> I think it's a good start for your cluster .
>> Don't forget to put  the cluster  in maintenance (pcs property set
>> maintenance-mode=true) before restarting the stack ,  or  even better
>- get
>> some downtime.
>>
>> You can use the following article to run a simulation before removing
>the
>> maintenance:
>> https://www.suse.com/support/kb/doc/?id=7022764
>>
>
>
>Thanks for the suggestions. Any thoughts on timeouts for DRBD?
>
>--Eric
>
>Disclaimer : This email and any files transmitted with it are
>confidential and intended solely for intended recipients. If you are
>not the named addressee you should not disseminate, distribute, copy or
>alter this email. Any views or opinions presented in this email are
>solely those of the author and might not represent those of Physician
>Select Management. Warning: Although Physician Select Management has
>taken reasonable precautions to ensure no viruses are present in this
>email, the company cannot accept responsibility for any loss or damage
>arising from the use of this email or attachments.

Hi Eric,

The timeouts can be treated as 'how much time to wait before  taking any 
action'. The workload is not very important (HANA  is something different).

You can try with 10s (token) , 12s (consensus) and if needed  you can adjust.

Warning: Use a 3 node cluster or at least 2 drbd nodes + qdisk. The 2 node 
cluster is vulnerable to split brain, especially when one of the nodes  is 
syncing  (for example after a patching) and the source is 
fenced/lost/disconnected. It's very hard to extract data from a semi-synced  
drbd.

Also, if you need guidance for the SELINUX, I can point you to my guide in the 
centos forum.

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fedora 31 - systemd based resources don't start

2020-02-20 Thread Strahil Nikolov

On February 20, 2020 12:49:43 PM GMT+02:00, Maverick  wrote:
>
>> You really need to debug the start & stop of  tthe resource .
>>
>> Please try the debug procedure  and provide the output:
>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>
>> Best Regards,
>> Strahil Nikolov
>
>
>Hi,
>
>Correct me if i'm wrong, but i think that procedure doesn't work for
>systemd class resources, i don't know which OCF script is responsible
>for handling systemd class resources.
>
>Also crm command doesn't exist in RHEL/Fedora, i've seen the crm
>command
>only in SUSE.
>
>
>
>On 19/02/2020 19:23, Strahil Nikolov wrote:
>> On February 19, 2020 7:21:12 PM GMT+02:00, Maverick 
>wrote:
>>> How is it possible that pacemaker is reporting that takes 4.2
>minutes
>>> (254930ms) to execute the start of httpd systemd unit?
>>>
>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514] (log_execute)    
>>> info:
>>> executing - rsc:apache action:start call_id:25
>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514] (systemd_unit_exec)
>>>    
>>> debug: Performing asynchronous start op on systemd unit httpd named
>>> 'apache'
>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514]
>>> (systemd_unit_exec_with_unit)     debug: Calling StartUnit for
>apache:
>>> /org/freedesktop/systemd1/unit/httpd_2eservice
>>> Feb 19 17:04:10 boss1 pacemaker-execd [1514] (action_complete)
>   
>>> notice: Giving up on apache start (rc=0): timeout (elapsed=254930ms,
>>> remaining=-154930ms)
>>> Feb 19 17:04:10 boss1 pacemaker-execd [1514] (log_finished)    
>>> debug: finished - rsc:apache action:monitor call_id:25 
>exit-code:198
>>> exec-time:254935ms queue-time:235ms
>>>
>>>
>>> Starting manually works fine and fast:
>>>
>>> # time systemctl start httpd
>>> real    0m0.144s
>>> user    0m0.005s
>>> sys    0m0.008s
>>>
>>>
>>> On 17/02/2020 22:47, Mvrk wrote:
>>>> In attachment the pacemaker.log. On the log i can see that the
>>> cluster
>>>> tries to start, the start fails, then tries to stop, and the stop
>>> also
>>>> fails also.
>>>>
>>>> One more thing, my cluster was working fine on Fedora 28, i started
>>>> having this problem after upgrade to Fedora 31.
>>>>
>>>> On 17/02/2020 21:30, Ricardo Esteves wrote:
>>>>> Hi,
>>>>>
>>>>> Yes, i also don't understand why is trying to stop them first.
>>>>>
>>>>> SELinux is disabled:
>>>>>
>>>>> # getenforce
>>>>> Disabled
>>>>>
>>>>> All systemd services controlled by the cluster are disabled from
>>>>> starting at boot:
>>>>>
>>>>> # systemctl is-enabled httpd
>>>>> disabled
>>>>>
>>>>> # systemctl is-enabled openvpn-server@01-server
>>>>> disabled
>>>>>
>>>>>
>>>>> On 17/02/2020 20:28, Ken Gaillot wrote:
>>>>>> On Mon, 2020-02-17 at 17:35 +, Maverick wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> When i start my cluster, most of my systemd resources won't
>start:
>>>>>>>
>>>>>>> Failed Resource Actions:
>>>>>>>   * apache_stop_0 on boss1 'OCF_TIMEOUT' (198): call=82,
>>>>>>> status='Timed Out', exitreason='', last-rc-change='1970-01-01
>>>>>>> 01:00:54 +01:00', queued=29ms, exec=197799ms
>>>>>>>   * openvpn_stop_0 on boss1 'OCF_TIMEOUT' (198): call=61,
>>>>>>> status='Timed Out', exitreason='', last-rc-change='1970-01-01
>>>>>>> 01:00:54 +01:00', queued=1805ms, exec=198841ms
>>>>>> These show that attempts to stop failed, rather than start.
>>>>>>
>>>>>>> So everytime i reboot my node, i need to start the resources
>>> manually
>>>>>>> using systemd, for example:
>>>>>>>
>>>>>>> systemd start apache
>>>>>>>
>>>>>>> and then pcs resource cleanup
>>>>>>>
>>>>>>> Resources configuration:
>>>>>>>
>>>>>>> Clone: apache-clone
>>>>>>>   Meta Attrs: maintenance=false
>>>>>>>

Re: [ClusterLabs] Fedora 31 - systemd based resources don't start

2020-02-20 Thread Strahil Nikolov

On February 20, 2020 9:35:07 PM GMT+02:00, Maverick  wrote:
>
>Manually it starts ok, no problems:
>
>pcs resource debug-start apache --full
>(unpack_config)     warning: Blind faith: not fencing unseen nodes
>Operation start for apache (systemd::httpd) returned: 'ok' (0)
>
>
>On 20/02/2020 16:46, Strahil Nikolov wrote:
>> On February 20, 2020 12:49:43 PM GMT+02:00, Maverick 
>wrote:
>>>> You really need to debug the start & stop of  tthe resource .
>>>>
>>>> Please try the debug procedure  and provide the output:
>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>
>>>> Best Regards,
>>>> Strahil Nikolov
>>>
>>> Hi,
>>>
>>> Correct me if i'm wrong, but i think that procedure doesn't work for
>>> systemd class resources, i don't know which OCF script is
>responsible
>>> for handling systemd class resources.
>>>
>>> Also crm command doesn't exist in RHEL/Fedora, i've seen the crm
>>> command
>>> only in SUSE.
>>>
>>>
>>>
>>> On 19/02/2020 19:23, Strahil Nikolov wrote:
>>>> On February 19, 2020 7:21:12 PM GMT+02:00, Maverick 
>>> wrote:
>>>>> How is it possible that pacemaker is reporting that takes 4.2
>>> minutes
>>>>> (254930ms) to execute the start of httpd systemd unit?
>>>>>
>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514] (log_execute)    
>>>>> info:
>>>>> executing - rsc:apache action:start call_id:25
>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514]
>(systemd_unit_exec)
>>>>>    
>>>>> debug: Performing asynchronous start op on systemd unit httpd
>named
>>>>> 'apache'
>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514]
>>>>> (systemd_unit_exec_with_unit)     debug: Calling StartUnit for
>>> apache:
>>>>> /org/freedesktop/systemd1/unit/httpd_2eservice
>>>>> Feb 19 17:04:10 boss1 pacemaker-execd [1514] (action_complete)
>>>    
>>>>> notice: Giving up on apache start (rc=0): timeout
>(elapsed=254930ms,
>>>>> remaining=-154930ms)
>>>>> Feb 19 17:04:10 boss1 pacemaker-execd [1514] (log_finished)
>   
>>>>> debug: finished - rsc:apache action:monitor call_id:25 
>>> exit-code:198
>>>>> exec-time:254935ms queue-time:235ms
>>>>>
>>>>>
>>>>> Starting manually works fine and fast:
>>>>>
>>>>> # time systemctl start httpd
>>>>> real    0m0.144s
>>>>> user    0m0.005s
>>>>> sys    0m0.008s
>>>>>
>>>>>
>>>>> On 17/02/2020 22:47, Mvrk wrote:
>>>>>> In attachment the pacemaker.log. On the log i can see that the
>>>>> cluster
>>>>>> tries to start, the start fails, then tries to stop, and the stop
>>>>> also
>>>>>> fails also.
>>>>>>
>>>>>> One more thing, my cluster was working fine on Fedora 28, i
>started
>>>>>> having this problem after upgrade to Fedora 31.
>>>>>>
>>>>>> On 17/02/2020 21:30, Ricardo Esteves wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Yes, i also don't understand why is trying to stop them first.
>>>>>>>
>>>>>>> SELinux is disabled:
>>>>>>>
>>>>>>> # getenforce
>>>>>>> Disabled
>>>>>>>
>>>>>>> All systemd services controlled by the cluster are disabled from
>>>>>>> starting at boot:
>>>>>>>
>>>>>>> # systemctl is-enabled httpd
>>>>>>> disabled
>>>>>>>
>>>>>>> # systemctl is-enabled openvpn-server@01-server
>>>>>>> disabled
>>>>>>>
>>>>>>>
>>>>>>> On 17/02/2020 20:28, Ken Gaillot wrote:
>>>>>>>> On Mon, 2020-02-17 at 17:35 +, Maverick wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> When i start my cluster, most of my systemd resources won't
>>> start:
>>>>>>>>> Failed Resource Actions:
>>>>>>>>>   * apache_stop_0 on boss1 'OCF_TIMEOUT' (198): call=82

Re: [ClusterLabs] "apache httpd program not found" "environment is invalid, resource considered stopped"

2020-02-19 Thread Strahil Nikolov

On February 19, 2020 6:27:54 PM GMT+02:00, Paul Alberts  
wrote:
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

I hope  that  the url  is wrong due to copy/paste:
httpd://local/host:1090/server-status

Otherwise  -  check the protocol.As status URL should be available  only from 
127.0.0.1, you can use  'http' instead.

Best Regards,
Strahil Nikolov

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Ugrading Ubuntu 14.04 to 16.04 with corosync/pacemaker failed

2020-02-19 Thread Strahil Nikolov

On February 19, 2020 6:31:19 PM GMT+02:00, Rasca  wrote:
>Hi,
>
>we run a 2-system cluster for Samba with Ubuntu 14.04 and Samba,
>Corosync and Pacemaker from the Ubuntu repos. We wanted to update
>to Ubuntu 16.04 but it failed:
>
>I checked the versions before and because of just minor updates
>of corosync and pacemaker I thought it should be possible to
>update node by node.
>
>* Put srv2 into standby
>* Upgraded srv2 to Ubuntu 16.04 with reboot and so on
>* Added a nodelist to corosync.conf because it looked
>  like corosync on srv2 didn't know the names of the
>  node ids anymore
>
>But still it does not work on srv2. srv1 (the active
>server with ubuntu 14.04) ist fine. It looks like
>it's an upstart/systemd issue, but may be even more.
>Why does srv1 says UNCLEAN about srv2? On srv2 I see
>corosync sees both systems. But srv2 says srv1 is
>OFFLINE!?
>
>crm status
>
>
>srv1
>Last updated: Wed Feb 19 17:22:03 2020
>Last change: Tue Feb 18 11:05:47 2020 via crm_attribute on srv2
>Stack: corosync
>Current DC: srv1 (1084766053) - partition with quorum
>Version: 1.1.10-42f2063
>2 Nodes configured
>9 Resources configured
>
>
>Node srv2 (1084766054): UNCLEAN (offline)
>Online: [ srv1 ]
>
> Resource Group: samba_daemons
> samba-nmbd(upstart:nmbd): Started srv1
>[..]
>
>
>srv2
>Last updated: Wed Feb 19 17:25:14 2020 Last change: Tue Feb 18
>18:29:29
>2020 by hacluster via crmd on srv2
>Stack: corosync
>Current DC: srv2 (version 1.1.14-70404b0) - partition with quorum
>2 nodes and 9 resources configured
>
>Node srv2: standby
>OFFLINE: [ srv1 ]
>
>Full list of resources:
>
> Resource Group: samba_daemons
> samba-nmbd(upstart:nmbd): Stopped
>[..]
>
>Failed Actions:
>* samba-nmbd_monitor_0 on srv2 'not installed' (5): call=5, status=Not
>installed, exitreason='none',
>last-rc-change='Wed Feb 19 14:13:20 2020', queued=0ms, exec=1ms
>[..]
>
>
>Any suggestions, ideas? Is the a nice HowTo for this upgrade situation?
>
>Regards,
> Rasca
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

Are  you  sure  that there  is no cluster  peotocol mismatch ?

Major number OS Upgrade  (even if supported by vendor)  must be done offline  
(with proper  testing in advance).

What happens  when you upgraded  the other  node ,  or when you rollback the 
upgrade ?

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fedora 31 - systemd based resources don't start

2020-02-20 Thread Strahil Nikolov

On February 20, 2020 10:29:54 PM GMT+02:00, Maverick  wrote:
>
>> Hi Maverick,
>>
>>
>> According this thread:
>>
>https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html
>>
>> You have 'startup-fencing' is set  to false.
>>
>> Check it out - maybe this is your reason.
>>
>> Best Regards,
>> Strahil Nikolov
>
>Yes, i have stonith disabled, because as soon as the resources startup
>fail on boot, node was rebooted.
>
>
>Anyway, i was checking the pacemaker logs and the journal log, and i
>see
>that the service actually starts ok but for some reason pacemaker
>thinks
>it has timeout and then because of that tries to stop and also thinks
>it
>has timeout but actually stops it:
>
>pacemaker.log:
>
>Feb 20 19:39:52 boss1 pacemaker-execd [1499] (log_execute)  info:
>executing - rsc:apache action:start call_id:25
>Feb 20 19:39:52 boss1 pacemaker-execd [1499] (systemd_unit_exec)   
>debug: Performing asynchronous start op on systemd unit httpd named
>'apache'
>Feb 20 19:39:52 boss1 pacemaker-execd [1499]
>(systemd_unit_exec_with_unit)  debug: Calling StartUnit for apache:
>/org/freedesktop/systemd1/unit/httpd_2eservice
>Feb 20 19:39:52 boss1 pacemaker-execd [1499] (action_complete) 
>notice: Giving up on apache start (rc=0): timeout (elapsed=248199ms,
>remaining=-148199ms)
>Feb 20 19:39:52 boss1 pacemaker-execd [1499] (log_finished)
>debug: finished - rsc:apache action:monitor call_id:25  exit-code:198
>exec-time:248205ms queue-time:216ms
>
>Feb 20 19:40:00 boss1 pacemaker-execd [1499] (log_execute)  info:
>executing - rsc:apache action:stop call_id:81
>Feb 20 19:40:00 boss1 pacemaker-execd [1499] (systemd_unit_exec)   
>debug: Performing asynchronous stop op on systemd unit httpd named
>'apache'
>Feb 20 19:40:00 boss1 pacemaker-execd [1499]
>(systemd_unit_exec_with_unit)  debug: Calling StopUnit for apache:
>/org/freedesktop/systemd1/unit/httpd_2eservice
>Feb 20 19:40:01 boss1 pacemaker-execd [1499] (action_complete) 
>notice: Giving up on apache stop (rc=0): timeout (elapsed=304539ms,
>remaining=-204539ms)
>Feb 20 19:40:01 boss1 pacemaker-execd [1499] (log_finished)
>debug: finished - rsc:apache action:monitor call_id:81  exit-code:198
>exec-time:304545ms queue-time:240ms
>
>
>system journal:
>
>Feb 20 19:39:52 boss1 systemd[1]: Starting Cluster Controlled httpd...
>Feb 20 19:39:53 boss1 systemd[1]: Started Cluster Controlled httpd.
>Feb 20 19:39:53 boss1 httpd[2145]: Server configured, listening on:
>port
>443, port 80
>
>Feb 20 19:40:01 boss1 systemd[1]: Stopping The Apache HTTP Server...
>Feb 20 19:40:02 boss1 systemd[1]: httpd.service: Succeeded.
>Feb 20 19:40:02 boss1 systemd[1]: Stopped The Apache HTTP Server.
>
>
>
>
>On 20/02/2020 21:02, Strahil Nikolov wrote:
>> On February 20, 2020 9:35:07 PM GMT+02:00, Maverick 
>wrote:
>>> Manually it starts ok, no problems:
>>>
>>> pcs resource debug-start apache --full
>>> (unpack_config)     warning: Blind faith: not fencing unseen nodes
>>> Operation start for apache (systemd::httpd) returned: 'ok' (0)
>>>
>>>
>>> On 20/02/2020 16:46, Strahil Nikolov wrote:
>>>> On February 20, 2020 12:49:43 PM GMT+02:00, Maverick 
>>> wrote:
>>>>>> You really need to debug the start & stop of  tthe resource .
>>>>>>
>>>>>> Please try the debug procedure  and provide the output:
>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures
>>>>>>
>>>>>> Best Regards,
>>>>>> Strahil Nikolov
>>>>> Hi,
>>>>>
>>>>> Correct me if i'm wrong, but i think that procedure doesn't work
>for
>>>>> systemd class resources, i don't know which OCF script is
>>> responsible
>>>>> for handling systemd class resources.
>>>>>
>>>>> Also crm command doesn't exist in RHEL/Fedora, i've seen the crm
>>>>> command
>>>>> only in SUSE.
>>>>>
>>>>>
>>>>>
>>>>> On 19/02/2020 19:23, Strahil Nikolov wrote:
>>>>>> On February 19, 2020 7:21:12 PM GMT+02:00, Maverick
>
>>>>> wrote:
>>>>>>> How is it possible that pacemaker is reporting that takes 4.2
>>>>> minutes
>>>>>>> (254930ms) to execute the start of httpd systemd unit?
>>>>>>>
>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514] (log_execute)

Re: [ClusterLabs] How to unfence without reboot (fence_mpath)

2020-02-16 Thread Strahil Nikolov

 wondering if the setup is correctly done.
Storage in this test setup is a Highly Available iSCSI Cluster ontop of DRBD 
/RHEL 7 again/, and it seems that SCSI Reservations Support is OK.

Best Regards,
Strahil Nikolov

В неделя, 16 февруари 2020 г., 23:11:40 Гринуич-5, Ondrej 
 написа: 

Hello Strahil,

On 2/17/20 11:54 AM, Strahil Nikolov wrote:
> Hello Community,
> 
> This is my first interaction with pacemaker and SCSI reservations and I was 
> wondering how to unfence a node without rebooting it ?

For first encounter with SCSI reservation I would recommend 'fence_scsi' 
over 'fence_mpath' for the reason that it is easier to configure :)
If everything works correctly then simple restart of cluster on fenced 
node should be enough.

Side NOTE: There was discussion previous year about change that 
introduced ability to choose what happens when node is fenced by 
storage-based fence agent (like fence_mpath/fence_scsi) that defaults as 
of now to 'shutdown the cluster'. In newer pacemaker versions is option 
that can change this to 'shutdown the cluster and panic the node making 
it to reboot'.

> I tried to stop & start the cluster stack - it just powers off itself.
> Adding the reservation before starting the cluster stack - same.

It sounds like maybe after start the node was fenced again or at least 
the fencing was attempted. Are there any errors 
(/var/log/cluster/corosync.log or similar) in logs about fencing/stonith 
from around the time when the cluster is started again on node?

> Only a reboot works.

What does the state of cluster looks like on living node when other 
nodes is fenced? I wonder if the fenced node is reported as Offline or 
UNCLEAN - you can use the 'crm_mon -1f' to get current cluster state on 
living node for this including the failures.

> 
> Thank for answering my question.
> 
> 
> Best Regards,
> Strahil Nikolov

--
Ondrej Famera
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] How to unfence without reboot (fence_mpath)

2020-02-17 Thread Strahil Nikolov

On February 17, 2020 3:36:27 PM GMT+02:00, Ondrej 
 wrote:
>Hello Strahil,
>
>On 2/17/20 3:39 PM, Strahil Nikolov wrote:
>> Hello Ondrej,
>> 
>> thanks for your reply. I really appreciate that.
>> 
>> I have picked fence_multipath as I'm preparing for my EX436 and I
>can't know what agent will be useful on the exam.
>> Also ,according to https://access.redhat.com/solutions/3201072 ,
>there could be a race condition with fence_scsi.
>
>I believe that exam is about testing knowledge in configuration and not
>
>testing knowledge in knowing which race condition bugs are present and 
>how to handle them :)
>If you have access to learning materials for EX436 exam I would 
>recommend trying those ones out - they have labs and comprehensive 
>review exercises that are useful in preparation for exam.
>
>> So, I've checked the cluster when fencing and the node immediately
>goes offline.
>> Last messages from pacemaker are:
>> 
>> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]:   notice: Client
>stonith_admin.controld.23888.b57ceee7 wants to fence (reboot)
>'node1.localdomain' with device '(any)'
>> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]:   notice:
>Requesting peer fencing (reboot) of node1.localdomain
>> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]:   notice:
>FENCING can fence (reboot) node1.localdomain (aka. '1'): static-list
>> Feb 17 08:21:58 node1.localdomain stonith-ng[23808]:   notice:
>Operation reboot of node1.localdomain by node2.localdomain for 
>stonith_admin.controld.23888@node1.localdomain.ede38ffb: OK
>- This part looks OK - meaning the fencing looks like a success.
>> Feb 17 08:21:58 node1.localdomain crmd[23812]: crit: We were
>allegedly just fenced by node2.localdomain for node1.localdomai
>- this is also normal as node just announces that it was fenced by
>other 
>node
>
>> 
>> 
>> Which for me means - node1 just got fenced again. Actually fencing
>works ,as I/O is immediately blocked and the reservation is removed.
>> 
>> I've used https://access.redhat.com/solutions/2766611 to setup the
>fence_mpath , but I could have messed up something.
>-  note related to exam: you will not have Internet on exam, so I would
>
>expect that you would have to configure something that would not
>require 
>access to this (and as Dan Swartzendruber pointed out in other email - 
>we cannot* even see RH links without account)
>
>* you can get free developers account to read them, but ideally that 
>should be not needed and is certainly inconvenient for wide public
>audience
>
>> 
>> Cluster config is:
>> [root@node3 ~]# pcs config show
>> Cluster Name: HACLUSTER2
>> Corosync Nodes:
>>   node1.localdomain node2.localdomain node3.localdomain
>> Pacemaker Nodes:
>>   node1.localdomain node2.localdomain node3.localdomain
>> 
>> Resources:
>>   Clone: dlm-clone
>>    Meta Attrs: interleave=true ordered=true
>>    Resource: dlm (class=ocf provider=pacemaker type=controld)
>>     Operations: monitor interval=30s on-fail=fence
>(dlm-monitor-interval-30s)
>>     start interval=0s timeout=90 (dlm-start-interval-0s)
>>     stop interval=0s timeout=100 (dlm-stop-interval-0s)
>>   Clone: clvmd-clone
>>    Meta Attrs: interleave=true ordered=true
>>    Resource: clvmd (class=ocf provider=heartbeat type=clvm)
>>     Operations: monitor interval=30s on-fail=fence
>(clvmd-monitor-interval-30s)
>>     start interval=0s timeout=90s
>(clvmd-start-interval-0s)
>>     stop interval=0s timeout=90s (clvmd-stop-interval-0s)
>>   Clone: TESTGFS2-clone
>>    Meta Attrs: interleave=true
>>    Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem)
>>     Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2
>options=noatime run_fsck=no
>>     Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20
>(TESTGFS2-monitor-interval-15s)
>>     notify interval=0s timeout=60s
>(TESTGFS2-notify-interval-0s)
>>     start interval=0s timeout=60s
>(TESTGFS2-start-interval-0s)
>>     stop interval=0s timeout=60s
>(TESTGFS2-stop-interval-0s)
>> 
>> Stonith Devices:
>>   Resource: FENCING (class=stonith type=fence_mpath)
>>    Attributes: devices=/dev/mapper/36001405cb123d000
>pcmk_host_argument=key
>pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3
>pcmk_monitor_action=metadata pcmk_reboot_action=off
>>    Meta Attrs: provides=unfencing
>>    Operations: monitor interval=60s (FENCING-monitor-inte

[ClusterLabs] Saving secret locally

2020-01-18 Thread Strahil Nikolov

Hello Community,


I have been using pacemaker in the last 2 years on SUSE who use crmsh and now I 
struggle to recall some of the knowledge I had. Cluster is RHEL 7.7 on 
oVirt/RHV .


Can someone tell me the pcs command that matches to this one, as I don't want 
the password for the fencing user in the CIB :


crm resource secret  set  


I've been searching in the pcs --help and on 
https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md 
, but it seems it's not there or I can't find it.

Thanks in advance.


Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pcs stonith fence - Error: unable to fence

2020-01-18 Thread Strahil Nikolov

Sorry for the spam.
I figured out that I forgot to specify the domain for the 'drbd1' and thus it 
has reacted like that.
The strange thing is that pcs allows me to fence a node , that is not in the 
cluster :)

Do you think that this behaviour is a bug?
If yes, I can open an issue to the upstream


Best Regards,
Strahil Nikolov



В неделя, 19 януари 2020 г., 00:01:11 ч. Гринуич+2, Strahil Nikolov 
 написа: 





Hi All,


I am building a test cluster with fence_rhevm stonith agent on RHEL 7.7 and 
oVirt 4.3.
When I fenced drbd3 from drbd1 using 'pcs stonith fence drbd3' - the fence 
action was successfull.

So then I decided to test the fencing the opposite way and it partially failed.


1. in oVirt the machine was powered off and then powered on properly - so the 
communication with the engine is OK
2. the command on drbd3 to fence drbd1 did stuck and then reported as failiure 
despite the VM was reset.



Now 'pcs status' is reporting the following:
Failed Fencing Actions:
* reboot of drbd1 failed: delegate=drbd3.localdomain, 
client=stonith_admin.1706, origin=drbd3.localdomain,
   last-failed='Sat Jan 18 23:18:24 2020'




My stonith is configured as follows:
Stonith Devices: 
Resource: ovirt_FENCE (class=stonith type=fence_rhevm) 
 Attributes: ipaddr=engine.localdomain login=fencerdrbd@internal 
passwd=I_have_replaced_that 
pcmk_host_map=drbd1.localdomain:drbd1;drbd2.localdomain:drbd2;drbd3.localdomain:drbd
3 power_wait=3 ssl=1 ssl_secure=1 
 Operations: monitor interval=60s (ovirt_FENCE-monitor-interval-60s) 
Fencing Levels:



Do I need to add some other settings to the fence_rhevm stonith agent ?


Manually running the status command from drbd2/drbd3 is OK:


[root@drbd3 ~]# fence_rhevm -o status --ssl --ssl-secure -a engine.localdomain 
--username='fencerdrbd@internal'  --password=I_have_replaced_that -n drbd1 
Status: ON

I'm attaching the logs from the drbd2 (DC) and drbd3.


Thanks in advance for your suggestions.


Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Multiple nfsserver resource groups

2020-03-09 Thread Strahil Nikolov

gt;>> 
>>> Resource group 2:
>>> 
>>> *   /dev/mapper/mpathd ‑ shared volume 3
>>> *   /dev/mapper/mpathe ‑ shared volume 4
>>> *   /dev/mapper/mpathf ‑ nfs_shared_infodir for resource group 2
>>> 
>>> Resource group 3:
>>> 
>>> *   /dev/mapper/mpathg ‑ shared volume 5
>>> *   /dev/mapper/mpathh ‑ shared volume 6
>>> *   /dev/mapper/mpathi ‑ nfs_shared_infodir for resource group 3
>>> 
>>>  
>>> 
>>> My concern is that when I run a df command on the active node, the 
>>> last ocf_heartbeat_nfsserver volume (/dev/mapper/mpathi) mounted to
>> /var/lib/nfs.
>>> I understand that I cannot change this, but I can change the
>location 
>>> of
>> the
>>> rpc_pipefs folder.
>>> 
>>>  
>>> 
>>> I have had this setup running with 2 resource groups in our 
>>> development environment, and have not noticed any issues, but since 
>>> we're planning to move to production and add a 3rd resource group, I
>
>>> want to make sure that this setup will not cause any issues. I am by
>
>>> no means an expert on NFS, so some insight is appreciated.
>>> 
>>>  
>>> 
>>> If this kind of setup is not supported or recommended, I have 2 
>>> alternate plans in mind:
>>> 
>>> 1.  Have all resources in the same resource group, in a setup that
>will
>>> look like this:
>>> 
>>> a.  1x ocf_heartbeat_IPaddr2 resource for the Virtual IP that exposes
>>> the NFS share.
>>> b.  7x ocf_heartbeat_Filesystem resources (1 is for the
>>> nfs_shared_infodir and 6 exposed via the NFS server)
>>> c.  1x ocf_heartbeat_nfsserver resource that uses the aforementioned
>>> nfs_shared_infodir.
>>> d.  6x ocf_heartbeat_exportfs resources that expose the other 6
>>> filesystems as NFS shares. Use the clientspec option to restrict to 
>>> IPs and prevent unwanted mounts.
>>> e.  1x ocf_heartbeat_nfsnotify resource that has the Virtual IP set
>as
>>> its own source_host.
>>> 
>>> 2.  Setup 2 more clusters to accommodate our needs
>>> 
>>>  
>>> 
>>> I really want to avoid #2, due to the fact that it will be overkill 
>>> for our case.
>> 
>> Things you might consider is to get reid of the groups and use 
>> explicit colocation and orderings. The advantages will be that you
>can 
>> execute
>several 
>> agents in parallel (e.g. prepare all fileysstems in parallel). In the
>
>> past
>we 
>> had made the experience that exportfs resources can take quite some 
>> time and
>
>> if you have like 20 or more of them, it delays the shutdown/startup 
>> significatly.
>> So we moved to using netgroups provided by LDAP instead, and we could
>
>> reduce
>
>> the number of exportfs statements drastically.
>> However we have one odd problem (SLES12 SP4): The NFS resource using 
>> systemd
>
>> does not shut down clearly due some unmount issue related the shared 
>> info dir.
>> 
>>> 
>>> Thanks
>>> 
>>>  
>>> 
>>> Christoforos Christoforou
>>> 
>>> Senior Systems Administrator
>>> 
>>> Global Reach Internet Productions
>>> 
>>>  <http://www.twitter.com/globalreach> Twitter | 
>>> <http://www.facebook.com/globalreach> Facebook | 
>>> <https://www.linkedin.com/company/global‑reach‑internet‑productions>
>>> LinkedIn
>>> 
>>> p (515) 996‑0996 |  <http://www.globalreach.com/> globalreach.com
>>> 
>>>  
>
>
>
>
>
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

Actually Red Hat's documentation uses the nfsnotify  variant .

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-08 Thread Strahil Nikolov

On April 8, 2020 8:32:59 PM GMT+03:00, Sherrard Burton 
 wrote:
>
>
>On 4/8/20 1:09 PM, Andrei Borzenkov wrote:
>> 08.04.2020 10:12, Jan Friesse пишет:
>>> Sherrard,
>>>
>>>> i could not determine which of these sub-threads to include this
>in,
>>>> so i am going to (reluctantly) top-post it.
>>>>
>>>> i switched the transport to udp, and in limited testing i seem to
>not
>>>> be hitting the race condition. of course i have no idea whether
>this
>>>> will behave consistently, or which part of the knet vs udp setup
>makes
>>>> the most difference.
>>>>
>>>> ie, is it the overhead of the crypto handshakes/setup? is there
>some
>>>> other knet layer that imparts additional delay in establishing
>>>> connection to other nodes? is the delay on the rebooted node, the
>>>> standing node, or both?
>>>>
>>>
>>> Very high level, what is happening in corosync when using udpu:
>>> - Corosync started and begins in gather state -> sends "multicast"
>>> (emulated by unicast to all expected members) message telling "I'm
>here
>>> and this is my view of live nodes").
>>> - In this state, corosync waits for answers
>>> - When node receives this message it "multicast" same message with
>>> updated view of live nodes
>>> - After all nodes agrees, they move to next state (commit/recovery
>and
>>> finally operational)
>>>
>>> With udp, this happens instantly so most of the time corosync
>doesn't
>>> even create single node membership, which would be created if no
>other
>>> nodes exists and/or replies wouldn't be delivered on time.
>>>
>> 
>> Is it possible to delay "creating single node membership" until some
>> reasonable initial timeout after corosync starts to ensure node view
>of
>> cluster is up to date? It is clear that there will always be some
>corner
>> cases, but at least this would make "obviously correct" configuration
>to
>> behave as expected.
>> 
>> Corosync already must have timeout to declare peers unreachable - it
>> sounds like most logical to use in this case.
>> 
>
>i tossed that idea around in my head as well. basically if there was an
>
>analogue client_leaving called client_joining that could be used to 
>allowed the qdevice to return 'ask later'.
>
>i think the trade-off here is that you sacrifice some responsiveness in
>
>your failover times, since (i'm guessing) the timeout for declaring 
>peers unreachable errors on the side of caution.
>
>the other hairy bit is determining the difference between a new 
>(illegitimate) single-node membership, and the existing (legitimate) 
>single-node membership. both are equally legitimate from the standpoint
>
>of each client, which can see the qdevice, but not the peer, and from 
>the standpoint of the qdevice, which can see both clients.
>
>as such, i suspect that this all comes right back to figuring out how
>to 
>implement issue #7.
>
>
>>>
>>> Knet adds a layer which monitors links between each of the node and
>it
>>> will make line active after it received configured number of "pong"
>>> packets. Idea behind is to have evidence of reasonable stable line.
>As
>>> long as line is not active no data packet goes thru (corosync
>traffic is
>>> just "data"). This basically means, that initial corosync multicast
>is
>>> not delivered to other nodes so corosync creates single node
>membership.
>>> After line becomes active "multicast" is delivered to other nodes
>and
>>> they move to gather state.
>>>
>> 
>> I would expect "reasonable timeout" to also take in account knet
>delay.
>> 
>>> So to answer you question. "Delay" is on both nodes side because
>link is
>>> not established between the nodes.
>>>
>> 
>> knet was expected to improve things, was not it? :)
>> 
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

I would have increased the consensus with several seconds.

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] When the active node enters the standby state, what should be done to make the VIP not automatically jump

2020-04-15 Thread Strahil Nikolov

On April 15, 2020 2:28:33 PM GMT+03:00, "邴洪涛" <695097494p...@gmail.com> wrote:
>hi:
>  We now get a strange requirement. When the active node enters standby
>mode, virtual_ip will not automatically jump to the normal node, but
>requires manual operation to achieve the jump of virtual_ip
> The mode we use is Active / Passive mode
> The Resource Agent we use is ocf: heartbeat: IPaddr2
>  Hope you can solve my confusion

Hello,

Can you provide the version of the stack, your config and the command you run 
to put the node in sandby ?

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] NFS in different subnets

2020-04-18 Thread Strahil Nikolov

rl?u=https-3A__github.com_ClusterLabs_OCF-2Dspec_blob_master_ra_1.0_resource-2Dagent-2Dapi.md=DwIF-g=Wu-45Ur27gGDpxsjlug4Cg=q-8ZpsKaCLRx2o0mEXxy3Wwikv-bFgowvCmzCB6rH1g=yidl5gIjUPBx1vNdS5f1AytCDGDNSc-d6sEoth4tJzM=LvXxFjX_zy6Agy-HIyocIlL4G8OJnm9QYktJtWiUL6M=
>> 
>> digimer
>> 
>> 
>> --
>> Digimer
>> Papers and Projects:
>>
>https://urldefense.proofpoint.com/v2/url?u=https-3A__alteeve.com_w_=DwIF-g=Wu-45Ur27gGDpxsjlug4Cg=q-8ZpsKaCLRx2o0mEXxy3Wwikv-bFgowvCmzCB6rH1g=yidl5gIjUPBx1vNdS5f1AytCDGDNSc-d6sEoth4tJzM=ij-evc6FbvnQZ-18Aah_BLc8RfpAeEKfs4VR_l1LRbw=
>> "I am, somehow, less interested in the weight and convolutions of
>> Einstein's brain than in the near certainty that people of equal
>talent
>> have lived and died in cotton fields and sweatshops." - Stephen Jay
>Gould
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.com/w/
>"I am, somehow, less interested in the weight and convolutions of
>Einstein’s brain than in the near certainty that people of equal talent
>have lived and died in cotton fields and sweatshops." - Stephen Jay
>Gould
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

I don't get something.

Why this cannot be done?

One  node is in siteA, one in siteB , qnet on third location.Routing between 
the 2 subnets is established and symmetrical.
Fencing via IPMI or  SBD (for  example from a HA iSCSI cluster) is  configured

The NFS resource is started on 1  node and a special RA is  used for the DNS 
records. If node1 dies, the cluster  will fence  it and node2  will  power up 
the NFS and update the records.

Of course, updating DNS only from 1  side must work for both sites.

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Verifying DRBD Run-Time Configuration

2020-04-12 Thread Strahil Nikolov

On April 12, 2020 10:58:39 AM GMT+03:00, Eric Robinson 
 wrote:
>> -Original Message-
>> From: Strahil Nikolov 
>> Sent: Sunday, April 12, 2020 2:54 AM
>> To: Cluster Labs - All topics related to open-source clustering
>welcomed
>> ; Eric Robinson 
>> Subject: Re: [ClusterLabs] Verifying DRBD Run-Time Configuration
>>
>> On April 11, 2020 6:17:14 PM GMT+03:00, Eric Robinson
>>  wrote:
>> >If I want to know the current DRBD runtime settings such as timeout,
>> >ping-int, or connect-int, how do I check that? I'm assuming they may
>> >not be the same as what shows in the config file.
>> >
>> >--Eric
>> >
>> >
>> >
>> >
>> >Disclaimer : This email and any files transmitted with it are
>> >confidential and intended solely for intended recipients. If you are
>> >not the named addressee you should not disseminate, distribute, copy
>or
>> >alter this email. Any views or opinions presented in this email are
>> >solely those of the author and might not represent those of
>Physician
>> >Select Management. Warning: Although Physician Select Management has
>> >taken reasonable precautions to ensure no viruses are present in
>this
>> >email, the company cannot accept responsibility for any loss or
>damage
>> >arising from the use of this email or attachments.
>>
>> You can get everything the cluster know via 'cibadmin -Q
>> /tmp/cluster_conf.xml'
>>
>> Then you can examine it.
>>
>
>As usual, I guess there's more than one way to get things. Someone
>suggested 'drbdsetup show  --show-defaults' and that works great.
>
>
>Disclaimer : This email and any files transmitted with it are
>confidential and intended solely for intended recipients. If you are
>not the named addressee you should not disseminate, distribute, copy or
>alter this email. Any views or opinions presented in this email are
>solely those of the author and might not represent those of Physician
>Select Management. Warning: Although Physician Select Management has
>taken reasonable precautions to ensure no viruses are present in this
>email, the company cannot accept responsibility for any loss or damage
>arising from the use of this email or attachments.

You left the inpression that  you need the cluster data (this is clusterlabs 
maining list after all) :)

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] When the active node enters the standby state, what should be done to make the VIP not automatically jump

2020-04-16 Thread Strahil Nikolov

On April 16, 2020 12:57:05 PM GMT+03:00, "邴洪涛" <695097494p...@gmail.com> wrote:
>>*hi:
>*>*  We now get a strange requirement. When the active node enters
>standby
>*>*mode, virtual_ip will not automatically jump to the normal node, but
>*>*requires manual operation to achieve the jump of virtual_ip
>*>* The mode we use is Active / Passive mode
>*>* The Resource Agent we use is ocf: heartbeat: IPaddr2
>*>*  Hope you can solve my confusion
>*
>Hello,
>
>Can you provide the version of the stack, your config and the command
>you run to put the node in sandby ?
>
>Best Regards,
>Strahil Nikolov
>
>-
>
>Sorry, I don't know how to reply correctly, so I pasted the previous
>chat content on it
>
>The following are the commands we use
>
>pcs property set stonith-enabled=false
>
>pcs property set no-quorum-policy=ignore
>pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=${VIP}
>cidr_netmask=32  op monitor interval="10s"
>
>pcs resource create docker systemd:docker op monitor interval="10s"
>timeout="15s" op start interval="0" timeout="1200s" op stop
>interval="0"
>timeout="1200s"
>pcs constraint colocation add docker virtual_ip INFINITY
>pcs constraint order virtual_ip then docker
>pcs constraint location docker prefers ${MASTER_NAME}=50
>
>pcs resource create lsyncd systemd:lsyncd op monitor interval="10s"
>timeout="15s" op start interval="0" timeout="120s" op stop interval="0"
>timeout="60s"
>    pcs constraint colocation add lsyncd virtual_ip INFINITY
>
>The version we use is
> Pacemaker 1.1.20-5.el7_7.2
> Written by Andrew Beekhof

If you need  to enter a node in standby mode and still keep the IP on that node 
- I don't think that you can do it at all.

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-06 Thread Strahil Nikolov

On April 7, 2020 12:21:50 AM GMT+03:00, Sherrard Burton 
 wrote:
>
>
>On 4/6/20 4:10 PM, Andrei Borzenkov wrote:
>> 06.04.2020 20:57, Sherrard Burton пишет:
>>>
>>>
>>> On 4/6/20 1:20 PM, Sherrard Burton wrote:
>>>>
>>>>
>>>> On 4/6/20 12:35 PM, Andrei Borzenkov wrote:
>>>>> 06.04.2020 17:05, Sherrard Burton пишет:
>>>>>>
>>>>>> from the quorum node:
>>>> ...
>>>>>> Apr 05 23:10:17 debug   Client :::192.168.250.50:54462
>(cluster
>>>>>> xen-nfs01_xen-nfs02, node_id 1) sent quorum node list.
>>>>>> Apr 05 23:10:17 debug msg seq num = 6
>>>>>> Apr 05 23:10:17 debug quorate = 0
>>>>>> Apr 05 23:10:17 debug node list:
>>>>>> Apr 05 23:10:17 debug   node_id = 1, data_center_id = 0,
>node_state
>>>>>> = member
>>>>>
>>>>> Oops. How comes that node that was rebooted formed cluster all by
>>>>> itself, without seeing the second node? Do you have two_nodes
>and/or
>>>>> wait_for_all configured?
>>>>>
>>>
>>> i never thought to check the logs on the rebooted server. hopefully
>>> someone can extract some further useful information here:
>>>
>>>
>>> https://pastebin.com/imnYKBMN
>>>
>> 
>> It looks like some timing issue or race condition. After reboot node
>> manages to contact qnetd first, before connection to other node is
>> established. Qnetd behaves as documented - it sees two equal size
>> partitions and favors the partition that includes tie breaker (lowest
>> node id). So existing node goes out of quorum. Second later both
>nodes
>> see each other and so quorum is regained.
>
>
>thank you for taking the time to troll through my debugging output.
>your 
>explanation seems to accurately describe what i am experiencing. of 
>course i have no idea how to remedy it. :-)
>
>> 
>> I cannot reproduce it, but I also do not use knet. From documentation
>I
>> have impression that knet has artificial delay before it considers
>links
>> operational, so may be that is the reason.
>
>i will do some reading on how knet factors into all of this and respond
>
>with any questions or discoveries.
>
>> 
>>>>
>>>> BTW, great eyes. i had not picked up on that little nuance. i had
>>>> poured through this particular log a number of times, but it was
>very
>>>> hard for me to discern the starting and stopping points for each
>>>> logical group of messages. the indentation made some of it clear.
>but
>>>> when you have a series of lines beginning in the left-most column,
>it
>>>> is not clear whether they belong to the previous group, the next
>>>> group, or they are their own group.
>>>>
>>>> just wanted to note my confusion in case the relevant maintainer
>>>> happens across this thread.
>>>>
>>>> thanks again
>>>> ___
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> ClusterLabs home: https://www.clusterlabs.org/
>> 
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

Hi Sherrard,

Have you tried to increase the qnet timers in the corosync.conf ?

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Retrofit MySQL with pacemaker?

2020-05-03 Thread Strahil Nikolov

On May 3, 2020 11:47:35 AM GMT+03:00, "Niels Kobschätzki" 
 wrote:
>Hi,
>
>I have here several master/slave-MySQL-instanced which I want to
>retrofit with corosync/Pacemaker while they are in production and I
>want to do it with minimal downtime. 
>Is that even possible?
>
>I would set up a 2-node corosync-cluster on both. Then I have the
>problem that there is already a virtual IP and I want to continue using
>it. Do I deco figure the VIP on the current active master, then
>configure the resource? Or is there a better way. How would I go in
>configuring the MySQL-resource?
>
>Did anyone here did something like that before or do you always start
>with a new setup (which begs the question how do you get the data from
>the old to the new - dump it out or can you attach the new one to the
>old one somehow and then switch at a certain point?)?
>
>Best,
>
>Niels
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

Hi Niels,

I believe that you can do it even without downtime but it will be hard to do it.

Most probably something like this can work, but I have never done it:
1. Setup and install the cluster without the resources
2. Set the cluster in maintenance
3. Either use the shadow cib or pcs counterpart command to define the whole 
cluster
4. Load the shadow cib from the file
5. Run a crm_simulate to verify the cluster's actions
6. Remove the maintenance

A more safer approach is to create the cluster with the replication in advance 
and then either use replication to transfer the data or use an export/import 
for that purpose.

If you have a spare pair of systems , you can try the first approach (building 
a fresh cluster and configure it while the DB is running).

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] heartbeat IP chenged to 127.0.0.1

2020-05-13 Thread Strahil Nikolov

On May 12, 2020 3:10:28 PM GMT+03:00, kexue  wrote:
>Hi,
>
>I have a two-nodes cluster on Centos7,and heartbeat interfaces connects
>
>directly.
>
>Execute  "corosync-cfgtool -s" command :
>
># corosync-cfgtool -s
>Printing ring status.
>Local node ID 2
>RING ID 0
>     id    = 192.168.44.35
>     status    = ring 0 active with no faults
>RING ID 1
>     id    = 192.168.55.35
>     status    = ring 1 active with no faults
>
>Unpluged the heartbeat network cable,Execute  "corosync-cfgtool -s" 
>command :
>
># corosync-cfgtool -s
>Printing ring status.
>Local node ID 2
>RING ID 0
>     id    = 127.0.0.1
>     status    = ring 0 active with no faults
>RING ID 1
>     id    = 127.0.0.1
>     status    = ring 1 active with no faults
>
>What is wrong with this? Could you give me some advice.
>
>Thanks.

How do you 'unplug' the cable ?

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: heartbeat IP chenged to 127.0.0.1

2020-05-13 Thread Strahil Nikolov

On May 13, 2020 9:57:46 AM GMT+03:00, Ulrich Windl 
 wrote:
>>>> kexue  schrieb am 13.05.2020 um 08:46 in
>Nachricht
><940_1589352410_5ebb97da_940_560_1_826b6602-7157-4f1a-7c64-3f6583ff6...@163.com>
>
>> more configure details:
>> 
>> /etc/hosts:
>> 
>> 172.60.60.34 centosA
>> 192.168.44.34 centosA-0
>> 192.168.55.34 centosA-1
>> 
>> 172.60.60.35 centosB
>> 192.168.44.35 centosB-0
>> 192.168.55.35 centosB-1
>
>Hi!
>
>I don't know whether it matters, but the canonical form of host file
>entries
>is
>  []...
>
>If you are also using a nameserver, make sure the FQHN in the host file
>matches the name in DNS.
>
>> 
>> corosync.conf:
>> 
>> totem {
>>  version: 2
>>  secauth: on
>>  cluster_name: mycluster
>>  transport: udpu
>>  rrp_mode: passive
>> }
>> 
>> nodelist {
>>  node {
>>  ring0_addr: centosA-0
>>  ring1_addr: centosA-1
>>  nodeid: 1
>>  }
>> 
>>  node {
>>  ring0_addr: centosB-0
>>  ring1_addr: centosB-1
>>  nodeid: 2
>>  }
>> }
>
>Also we run a different version, but we also set "name" of the node.
>ANd we
>have the address explicitly, not the network (host) name...
>
>Regards,
>Ulrich
>
>> 
>> Disconnect the heartbeat network cable ,and corosync-cfgtool -s:
>> 
>> RING ID 0
>>  id= 127.0.0.1
>>  status= ring 0 active with no faults
>> RING ID 1
>>  id= 127.0.0.1
>>      status= ring 1 active with no faults
>> 
>> heartbeat ip binding to 127.0.0.1
>> 
>> 
>> 在 2020/5/13 下午2:32, kexue 写道:
>>>
>>> Thanks.
>>>
>>> Each node has 2 similarly connected/configured NIC's. Both nodes are
>>> connected each other by two network cables.
>>>
>>> 'unplug' means Disconnect the network cable
>>>
>>>
>>> 在 2020/5/13 下午2:12, Strahil Nikolov 写道:
>>>> On May 12, 2020 3:10:28 PM GMT+03:00, kexue 
>wrote:
>>>>> Hi,
>>>>>
>>>>> I have a two-nodes cluster on Centos7,and heartbeat interfaces
>connects
>>>>>
>>>>> directly.
>>>>>
>>>>> Execute  "corosync-cfgtool -s" command :
>>>>>
>>>>> # corosync-cfgtool -s
>>>>> Printing ring status.
>>>>> Local node ID 2
>>>>> RING ID 0
>>>>>  id    = 192.168.44.35
>>>>>  status= ring 0 active with no faults
>>>>> RING ID 1
>>>>>  id= 192.168.55.35
>>>>>  status= ring 1 active with no faults
>>>>>
>>>>> Unpluged the heartbeat network cable,Execute  "corosync-cfgtool
>-s"
>>>>> command :
>>>>>
>>>>> # corosync-cfgtool -s
>>>>> Printing ring status.
>>>>> Local node ID 2
>>>>> RING ID 0
>>>>>  id= 127.0.0.1
>>>>>  status= ring 0 active with no faults
>>>>> RING ID 1
>>>>>  id= 127.0.0.1
>>>>>  status= ring 1 active with no faults
>>>>>
>>>>> What is wrong with this? Could you give me some advice.
>>>>>
>>>>> Thanks.
>>>> How do you 'unplug' the cable ?
>>>>
>>>> Best Regards,
>>>> Strahil Nikolov
>>> -- 
>>> kexue
>>> =
>>> -岂曰无衣-
>>> E-mail: kexue...@163.com 
>> -- 
>> kexue
>> =
>> -岂曰无衣-
>> E-mail: kexue...@163.com 
>
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

I have seen such behaviour with 'ifdown' of the main interface.

Have you setup some NIC monitoring resource in the cluster ?

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Removing DRBD w/out Data Loss?

2020-09-09 Thread Strahil Nikolov

I would play safe and leave drbd run , but in single node mode (no peers).
As it won't replicate - it should be as close to bare metal as possible.

Best Regards,
Strahil Nikolov






В сряда, 9 септември 2020 г., 15:11:06 Гринуич+3, Eric Robinson 
 написа: 





Valentin --

With DRBD stopped, wipefs only showed one signature...

[root@001db01 ~]# wipefs /dev/vg0/lv0
offset              type

0x438                ext4  [filesystem]
                    UUID:  2035aabe-b6e1-4313-a022-9518eb7489e6

So I just mounted the filesystem...

[root@001db01 ~]# mount /dev/vg0/lv0 /mnt
[ 7712.478905] EXT4-fs (dm-2): 35 orphan inodes deleted
[ 7712.484594] EXT4-fs (dm-2): recovery complete
[ 7712.506344] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: 
(null)

...and all my data seems to be there? Did I miss something? I don't understand. 
Where did the DRBD partition go?



> > After stopping DRBD, wipefs /dev/vg1/lv1 should list the signatures on
> > the device. Removing *only* the DRBD signature will give a filesystem
> > accessible directly on /dev/vg1/lv1. Make sure to use the --backup option.
> >
> > --
> > Valentin

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Maintenance mode status in CIB

2020-10-13 Thread Strahil Nikolov

Yep , both work without affecting the resources :
crm cluster stop
pcs cluster stop 

Once your maintenance is over , you can start the cluster and everything will 
be back in maintenance.


Best Regards,
Strahil Nikolov





В вторник, 13 октомври 2020 г., 19:15:27 Гринуич+3, Digimer  
написа: 





On 2020-10-13 11:59 a.m., Strahil Nikolov wrote:
> Also, it's worth mentioning that you can set the whole cluster in global 
> maintenance and power off the stack on all nodes without affecting your 
> resources.
> I'm not sure if that is ever possible in node maintenance.
> 
> Best Regards,
> Strahil Nikolov

Can you clarify what you mean by "power off the stack on all nodes"? Do
you mean stop pacemaker/corosync/knet daemon themselves without issue?


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Maintenance mode status in CIB

2020-10-13 Thread Strahil Nikolov

Also, it's worth mentioning that you can set the whole cluster in global 
maintenance and power off the stack on all nodes without affecting your 
resources.
I'm not sure if that is ever possible in node maintenance.

Best Regards,
Strahil Nikolov






В вторник, 13 октомври 2020 г., 12:49:38 Гринуич+3, Digimer  
написа: 





On 2020-10-13 5:41 a.m., Jehan-Guillaume de Rorthais wrote:
> On Tue, 13 Oct 2020 04:48:04 -0400
> Digimer  wrote:
> 
>> On 2020-10-13 4:32 a.m., Jehan-Guillaume de Rorthais wrote:
>>> On Mon, 12 Oct 2020 19:08:39 -0400
>>> Digimer  wrote:
>>>  
>>>> Hi all,  
>>>
>>> Hi you,
>>>  
>>>>
>>>>  I noticed that there appear to be a global "maintenance mode"
>>>> attribute under cluster_property_set. This seems to be independent of
>>>> node maintenance mode. It seemed to not change even when using
>>>> 'pcs node maintenance --all'  
>>>
>>> You can set maintenance-mode using:
>>>
>>>  pcs property set maintenance-mode=true
>>>
>>> You can read about "maintenance-mode" cluster attribute and "maintenance"
>>> node attribute in chapters:
>>>
>>>  
>>>https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
>>> 
>>>  
>>>https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/_special_node_attributes.html
>>>
>>> I would bet the difference is that "maintenance-mode" applies to all nodes
>>> in one single action. Using 'pcs node maintenance --all', each pcsd daemon
>>> apply the local node maintenance independently. 
>>>
>>> With the later, I suppose you might have some lag between nodes to actually
>>> start the maintenance, depending on external factors. Moreover, you can
>>> start/exit the maintenance mode independently on each nodes.  
>>
>> Thanks for this.
>>
>> A question remains; Is it possible that:
>>
>> > name="maintenance-mode" value="false"/>
>>
>> Could be set, and a given node could be:
>>
>> 
>>  
>>    
>>  
>> 
>>
>> That is to say; If the cluster is set to maintenance mode, does that
>> mean I should consider all nodes to also be in maintenance mode,
>> regardless of what their individual maintenance mode might be set to?
> 
> I remember a similar discussion happening some months ago. I believe Ken
> answered your question there:
> 
>  https://lists.clusterlabs.org/pipermail/developers/2019-November/002242.html
> 
> The whole answer is informative, but the conclusion might answer your
> question:
> 
>  >> There is some room for coming up with better option naming and meaning. 
>For
>  >> example maybe the cluster-wide "maintenance-mode" should be something
>  >> like "force-maintenance" to make clear it takes precedence over node and
>  >> resource maintenance. 
> 
> I understand here that "maintenance-mode" takes precedence over individual 
> node
> maintenance mode.
> 
> Regards,

Very helpful, thank you kindly!

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Open Source Linux Load Balancer with HA and Split Brain Prevention?

2020-10-04 Thread Strahil Nikolov

Actually, everything that works on the OS is OK for pacemaker.
You got 2 options :
- use systemd.service for your loadbalancer (for example HAProxy)
- create your own script which just requires 'start', 'stop' and 'monitoring' 
methods so pacemaker can control it

Based on my very fast search in the web , haproxy has a ready-to-go resource 
agent 'ocf:heartbeat:haproxy' , so you can give it a try.

Best Regards,
Strahil Nikolov






В неделя, 4 октомври 2020 г., 22:41:59 Гринуич+3, Eric Robinson 
 написа: 





  


Greetings!

 

We are looking for an open-source Linux load-balancing solution that supports 
high availability on 2 nodes with split-brain prevention. We’ve been using 
corosync+pacemaker+ldirectord+quorumdevice, and that works fine, but ldirectord 
isn’t available for CentOS 7 or 8, and we need to move along to something 
that’s still in active development. Any suggestions? 

 

-Eric

 

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments. 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Two ethernet adapter within same subnet causing issue on Qdevice

2020-10-06 Thread Strahil Nikolov

I agree,

it's more of a routing problem.
Actually a static route should fix the issue.

Best Regards,
Strahil Nikolov






В вторник, 6 октомври 2020 г., 10:50:24 Гринуич+3, Jan Friesse 
 написа: 





Richard ,

> To clarify my problem, this is more on Qdevice issue I want to fix.

The question is, how much it is really qdevice problem and if so, if 
there is really something we can do about the problem.

Qdevice itself is just using standard connect(2) call and standard TCP 
socket. So from qdevice point of view it is really kernel problem where 
to route packet to reach qnetd.

It is clear that ifdown made qdevice to lost connection with qnetd 
(that's why ip changed from ens192 to ens256) and standard qdevice 
behavior is to try reconnect. Qdevice itself is not binding to any 
specific address (it is really just a client) so after calling 
connect(2) qdevice reached qnetd via other (working) interface.

So I would suggest to try method recommended by Andrei (add host route).

Regards,
  Honza

> See below for more detail.
> Thank you,
> Richard
> 
>      - Original message -
>      From: Andrei Borzenkov 
>      Sent by: "Users" 
>      To: users@clusterlabs.org
>      Cc:
>      Subject: [EXTERNAL] Re: [ClusterLabs] Two ethernet adapter within same
>      subnet causing issue on Qdevice
>      Date: Thu, Oct 1, 2020 2:45 PM
>      01.10.2020 20:09, Richard Seo пишет:
>      > Hello everyone,
>      > I'm trying to setup a cluster with two hosts:
>      > both have two ethernet adapters all within the same subnet.
>      > I've created resources for an adapter for each hosts.
>      > Here is the example:
>      > Stack: corosync
>      > Current DC:  (version 2.0.2-1.el8-744a30d655) - partition with 
>quorum
>      > Last updated: Thu Oct  1 12:50:48 2020
>      > Last change: Thu Oct  1 12:32:53 2020 by root via cibadmin on 
>      > 2 nodes configured
>      > 2 resources configured
>      > Online: [   ]
>      > Active resources:
>      > db2__ens192    (ocf::heartbeat:db2ethmon):    Started 
>      > db2__ens192    (ocf::heartbeat:db2ethmon):    Started 
>      > I also have a qdevice setup:
>      > # corosync-qnetd-tool -l
>      > Cluster "hadom":
>      >      Algorithm:        LMS
>      >      Tie-breaker:    Node with lowest node ID
>      >      Node ID 2:
>      >          Client address:        ::::40044
>      >          Configured node list:    1, 2
>      >          Membership node list:    1, 2
>      >          Vote:            ACK (ACK)
>      >      Node ID 1:
>      >          Client address:        :::<*ip for ens192 for host 
>1*>:37906
>      >          Configured node list:    1, 2
>      >          Membership node list:    1, 2
>      >          Vote:            ACK (ACK)
>      > When I ifconfig down ens192 for host 1, looks like qdevice changes the 
>Client
>      > address to the other adapter and still give quorum to the lowest node 
>ID
>      (which
>      > is host 1 in this case) even when the network is down for host 1.
> 
>      Network on host 1 is obviously not down as this host continues to
>      communicate with the outside world. Network may be down for your
>      specific application but then it is up to resource agent for this
>      application to detect it and initiate failover.
>      The Network (ens192) on host 1 is down. host1 can still communicate with 
>the
>      world, because host1 has another network adapter (ens256). However, only
>      ens192 was configured as a resource. I've also configured specifically
>      ens192 ip address in the corsync.conf.
>      I want the network on host 1 down. that way, I can reproduce the problem
>      where quorum is given to a wrong node.
> 
>      > Cluster "hadom":
>      >      Algorithm:        LMS
>      >      Tie-breaker:    Node with lowest node ID
>      >      Node ID 2:
>      >          Client address:        ::::40044
>      >          Configured node list:    1, 2
>      >          Membership node list:    1, 2
>      >          Vote:            ACK (ACK)
>      >      Node ID 1:
>      >          Client address:        :::<*ip for ens256 for host 
>1*>:37906
>      >          Configured node list:    1, 2
>      >          Membership node list:    1, 2
>      >          Vote:            ACK (ACK)
>      > Is there a way we can force qdevice to only route through a specified 
>adapter
>      > (ens192 in this case)?
> 
>      Create host route via specific device.
>      I've looked over the docs, haven't found a way to do this. I'v

Re: [ClusterLabs] Active-Active cluster CentOS 8

2020-08-23 Thread Strahil Nikolov

There is a topic about that at https://bugs.centos.org/view.php?id=16939

Based on the comments you can obtain it from 
https://koji.mbox.centos.org/koji/buildinfo?buildID=4801 , but I haven' tested 
it.

Best Regards,
Strahil Nikolov






В петък, 21 август 2020 г., 18:30:31 Гринуич+3, Mark Battistella 
 написа: 





  

Hi,




I was wondering if I could get some help when following along with the Clusters 
from Scratch part 9: 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Clusters_from_Scratch/index.html#_install_cluster_filesystem_software




The first step is to install dlm which cant be found - but dlm-lib can.




Is there any update to the installation or alternative? I'd love to be able to 
have an active-active filesystem.




I've looked through the repo to see if there were any answers but I cant find 
anything




Thanks,

Mark








___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] node utilization attributes are lost during upgrade

2020-08-18 Thread Strahil Nikolov

Won't it be easier if:
- set a node in standby
- stop a node
- remove the node
- add again with the new hostname

Best Regards,
Strahil Nikolov

На 18 август 2020 г. 17:15:49 GMT+03:00, Ken Gaillot  
написа:
>On Tue, 2020-08-18 at 14:35 +0200, Kadlecsik József wrote:
>> Hi,
>> 
>> On Mon, 17 Aug 2020, Ken Gaillot wrote:
>> 
>> > On Mon, 2020-08-17 at 12:12 +0200, Kadlecsik József wrote:
>> > > 
>> > > At upgrading a corosync/pacemaker/libvirt/KVM cluster from
>> > > Debian 
>> > > stretch to buster, all the node utilization attributes were
>> > > erased 
>> > > from the configuration. However, the same attributes were kept at
>> > > the 
>> > > VirtualDomain resources. This resulted that all resources with 
>> > > utilization attributes were stopped.
>> > 
>> > Ouch :(
>> > 
>> > There are two types of node attributes, transient and permanent. 
>> > Transient attributes last only until pacemaker is next stopped on
>> > the 
>> > node, while permanent attributes persist between reboots/restarts.
>> > 
>> > If you configured the utilization attributes with crm_attribute
>> > -z/ 
>> > --utilization, it will default to permanent, but it's possible to 
>> > override that with -l/--lifetime reboot (or equivalently, -t/
>> > --type 
>> > status).
>> 
>> The attributes were defined by "crm configure edit", simply stating:
>> 
>> node 1084762113: atlas0 \
>> utilization hv_memory=192 cpu=32 \
>> attributes standby=off
>> ...
>> node 1084762119: atlas6 \
>> utilization hv_memory=192 cpu=32 \
>> 
>> But I believe now that corosync caused the problem, because the nodes
>> had 
>> been renumbered:
>
>Ah yes, that would do it. Pacemaker would consider them different nodes
>with the same names. The "other" node's attributes would not apply to
>the "new" node.
>
>The upgrade procedure would be similar except that you would start
>corosync by itself after each upgrade. After all nodes were upgraded,
>you would modify the CIB on one node (while pacemaker is not running)
>with:
>
>CIB_file=/var/lib/pacemaker/cib/cib.xml cibadmin --modify --scope=nodes
>-X '...'
>
>where '...' is a  XML entry from the CIB with the "id" value
>changed to the new ID, and repeat that for each node. Then, start
>pacemaker on that node and wait for it to come up, then start pacemaker
>on the other nodes.
>
>> 
>> node 3232245761: atlas0
>> ...
>> node 3232245767: atlas6
>> 
>> The upgrade process was:
>> 
>> for each node do
>> set the "hold" mark on the corosync package
>> put the node standby
>> wait for the resources to be migrated off
>> upgrade from stretch to buster
>> reboot
>> put the node online
>> wait for the resources to be migrated (back)
>> done
>> 
>> Up to this point all resources were running fine.
>> 
>> In order to upgrade corosync, we followed the next steps:
>> 
>> enable maintenance mode
>> stop pacemaker and corosync on all nodes
>> for each node do
>> delete the hold mark and upgrade corosync
>> install new config file (nodeid not specified)
>> restart corosync, start pacemaker
>> done
>> 
>> We could see that all resources were running unmanaged. When
>> disabling the 
>> maintenance mode, then those were stopped.
>> 
>> So I think corosync renumbered the nodes and I suspect the reason for
>> that 
>> was that "clear_node_high_bit: yes" was not specified in the new
>> config 
>> file. It means it was an admin error then.
>> 
>> Best regards,
>> Jozsef
>> --
>> E-mail : kadlecsik.joz...@wigner.hu
>> PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
>> Address: Wigner Research Centre for Physics
>>  H-1525 Budapest 114, POB. 49, Hungary
>-- 
>Ken Gaillot 
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] why is node fenced ?

2020-08-19 Thread Strahil Nikolov

Hi Bernd,

As SLES 12 is in a such a support phase, I guess SUSE will provide fixes only 
for SLES 15.
It will be best if you open them a case and ask them about that.

Best Regards,
Strahil Nikolov

На 19 август 2020 г. 17:29:32 GMT+03:00, "Lentes, Bernd" 
 написа:
>
>- On Aug 19, 2020, at 4:04 PM, kgaillot kgail...@redhat.com wrote:
>>> This appears to be a scheduler bug.
>> 
>> Fix is in master branch and will land in 2.0.5 expected at end of the
>> year
>> 
>> https://github.com/ClusterLabs/pacemaker/pull/2146
>
>A principal question:
>I have SLES 12 and i'm using the pacemaker version provided with the
>distribution.
>If this fix is backported depends on Suse.
>
>If i install und update pacemaker manually (not the version provided by
>Suse),
>i loose my support from them, but have always the most recent code and
>fixes.
>
>If i stay with the version from Suse i have support from them, but
>maybe not all fixes and not the most recent code.
>
>What is your approach ?
>Recommendations ?
>
>Thanks.
>
>Bernd
>Helmholtz Zentrum München
>
>Helmholtz Zentrum Muenchen
>Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
>Ingolstaedter Landstr. 1
>85764 Neuherberg
>www.helmholtz-muenchen.de
>Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
>Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin
>Guenther
>Registergericht: Amtsgericht Muenchen HRB 6466
>USt-IdNr: DE 129521671
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Format of '--lifetime' in 'pcs resource move'

2020-08-20 Thread Strahil Nikolov

Have you tried ISO 8601 format.
For example: 'PT20M' 

The  ISo format  is described at:  
https://manpages.debian.org/testing/crmsh/crm.8.en.html

Best Regards,
Strahil Nikolov

На 20 август 2020 г. 13:40:16 GMT+03:00, Digimer  написа:
>Hi all,
>
>  Reading the pcs man page for the 'move' action, it talks about
>'--lifetime' switch that appears to control when the location
>constraint
>is removed;
>
>
>   move [destination  node]  [--master]  [life‐
>   time=] [--wait[=n]]
>  Move the resource off the node it is  currently  running
>  on  by  creating  a -INFINITY location constraint to ban
>  the node. If destination node is specified the  resource
>  will be moved to that node by creating an INFINITY loca‐
>  tion constraint  to  prefer  the  destination  node.  If
>  --master  is used the scope of the command is limited to
>  the master role and you must use the promotable clone id
>  (instead  of  the resource id). If lifetime is specified
>  then the constraint will expire after that time,  other‐
>  wise  it  defaults to infinity and the constraint can be
>  cleared manually with 'pcs resource clear' or 'pcs  con‐
>  straint  delete'.  If --wait is specified, pcs will wait
>  up to 'n' seconds for the  resource  to  move  and  then
>  return  0 on success or 1 on error. If 'n' is not speci‐
>  fied it defaults to 60 minutes. If you want the resource
>  to preferably avoid running on some nodes but be able to
>  failover to them use 'pcs constraint location avoids'.
>
>
>I think I want to use this, as we move resources manually for various
>reasons where the old host is still able to host the resource should a
>node failure occur. So we'd love to immediately remove the location
>constraint as soon as the move completes.
>
>I tries using '--lifetime=60' as a test, assuming the format was
>'seconds', but that was invalid. How is this switch meant to be used?
>
>Cheers
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.com/w/
>"I am, somehow, less interested in the weight and convolutions of
>Einstein’s brain than in the near certainty that people of equal talent
>have lived and died in cotton fields and sweatshops." - Stephen Jay
>Gould
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Resources restart when a node joins in

2020-08-27 Thread Strahil Nikolov

Hi Quentin,

in order to get help it will be easier if you provide both corosync and 
pacemaker configuration.


Best Regards,
Strahil Nikolov






В четвъртък, 27 август 2020 г., 17:10:01 Гринуич+3, Citron Vert 
 написа: 






Hi,

Sorry for using this email adress, my name is Quentin. Thank you for your reply.

I have already tried the stickiness solution (with the deprecated  value). I 
tried the one you gave me, and it does not change anything. 


Resources don't seem to move from node to node (i don't see the changes with 
crm_mon command).




In the logs i found this line "error: native_create_actions: Resource 
SERVICE1 is active on 2 nodes"

Which led me to contact you to understand and learn a little more about this 
cluster. And why there are running resources on the passive node.





You will find attached the logs during the reboot of the passive node and my 
cluster configuration.


I think I'm missing out on something in the configuration / logs that I don't 
understand..




Thank you in advance for your help,

Quentin





Le 26/08/2020 à 20:16, Reid Wahl a écrit :


>  
>  
> Hi, Citron.
> 
> 
> 
> 
> Based on your description, it sounds like some resources **might** be moving 
> from node 1 to node 2, failing on node 2, and then moving back to node 1. If 
> that's what's happening (and even if it's not), then it's probably smart to 
> set some resource stickiness as a resource default. The below command sets a 
> resource stickiness score of 1.
> 
> 
> 
> 
> 
>     # pcs resource defaults resource-stickiness=1
> 
> 
> 
> 
> 
> Also note that the "default-resource-stickiness" cluster property is 
> deprecated and should not be used.
> 
> 
> 
> 
> Finally, an explicit default resource stickiness score of 0 can interfere 
> with the placement of cloned resource instances. If you don't want any 
> stickiness, then it's better to leave stickiness unset. That way, primitives 
> will have a stickiness of 0, but clone instances will have a stickiness of 1.
> 
> 
> 
> 
> 
> If adding stickiness does not resolve the issue, can you share your cluster 
> configuration and some logs that show the issue happening? Off the top of my 
> head I'm not sure why resources would start and stop on node 2 without moving 
> away from node1, unless they're clone instances that are starting and then 
> failing a monitor operation on node 2.
> 
> 
> 
>  
> On Wed, Aug 26, 2020 at 8:42 AM Citron Vert  wrote:
> 
> 
>>  
>>  
>> Hello,
>> I am contacting you because I have a problem with my cluster and I cannot 
>> find (nor understand) any information that can help me.
>> 
>> I have a 2 nodes cluster (pacemaker, corosync, pcs) installed on CentOS 7 
>> with a set of configuration.
>> Everything seems to works fine, but here is what happens:
>> 
>> * Node1 and Node2 are running well with Node1 as primary
>> * I reboot Node2 wich is passive (no changes on Node1)
>> * Node2 comes back in the cluster as passive
>> * corosync logs shows resources getting started then stopped on Node2
>> * "crm_mon" command shows some ressources on Node1 getting restarted 
>> 
>> 
>> I don't understand how it should work.
>> If a node comes back, and becomes passive (since Node1 is running primary), 
>> there is no reason for the resources to be started then stopped on the new 
>> passive node ?
>> 
>> 
>> One of my resources becomes unstable because it gets started and then stoped 
>> too quickly on Node2, wich seems to make it restart on Node1 without a 
>> failover.
>> 
>> I tried several things and solution proposed by different sites and forums 
>> but without success.
>> 
>> 
>> 
>> 
>> Is there a way so that the node, which joins the cluster as passive, does 
>> not start its own resources ?
>> 
>> 
>> 
>> 
>> thanks in advance
>> 
>> 
>> 
>> 
>> Here are some information just in case :
>> 
>> $ rpm -qa | grep -E "corosync|pacemaker|pcs"
>>  corosync-2.4.5-4.el7.x86_64
>>  pacemaker-cli-1.1.21-4.el7.x86_64
>>  pacemaker-1.1.21-4.el7.x86_64
>>  pcs-0.9.168-4.el7.centos.x86_64
>>  corosynclib-2.4.5-4.el7.x86_64
>>  pacemaker-libs-1.1.21-4.el7.x86_64
>>  pacemaker-cluster-libs-1.1.21-4.el7.x86_64
>> 
>> 
>> 
>> 
>> > name="stonith-enabled" value="false"/>
>> > name="no-quorum-policy" value="ignore"/>
>> > value="120s"/>
>> > name="have-watchdog&

Re: [ClusterLabs] DRBD resource not starting

2020-08-15 Thread Strahil Nikolov

And how did you define the drbd resource?

Best Regards,
Strahil Nikolov

На 14 август 2020 г. 18:48:32 GMT+03:00, Gerry Kernan  
написа:
>Hi
>Im trying to add a drbd resource to pacemaker cluster on centos 7
>
>
>But getting this error on pcs status
>drbd_r0(ocf::linbit:drbd):  ORPHANED FAILED (blocked)[ vps1
>vps2 ]
>
>
>if I try and the resource from cli I get out out below
>
>[root@VPS1 drbd-utils]# pcs resource debug-start rep-r0
>Operation start for rep-r0:0 (ocf:linbit:drbd) returned: 'unknown
>error' (1)
>>  stdout: drbdsetup - Configure the DRBD kernel module.
>>  stdout:
>>  stdout: USAGE: drbdsetup command {arguments} [options]
>>  stdout:
>>  stdout: Commands:
>>  stdout: primary - Change the role of a node in a resource to
>primary.
>>  stdout: secondary - Change the role of a node in a resource to
>secondary.
>>  stdout: attach - Attach a lower-level device to an existing
>replicated device.
>>  stdout: disk-options - Change the disk options of an attached
>lower-level device.
>>  stdout: detach - Detach the lower-level device of a replicated
>device.
>>  stdout: connect - Attempt to (re)establish a replication link to
>a peer host.
>>  stdout: new-peer - Make a peer host known to a resource.
>>  stdout: del-peer - Remove a connection to a peer host.
>>  stdout: new-path - Add a path (endpoint address pair) where a
>peer host should be
>>  stdout: reachable.
>>  stdout: del-path - Remove a path (endpoint address pair) from a
>connection to a peer
>>  stdout: host.
>>  stdout: net-options - Change the network options of a
>connection.
>>  stdout: disconnect - Unconnect from a peer host.
>>  stdout: resize - Reexamine the lower-level device sizes to
>resize a replicated
>>  stdout: device.
>>  stdout: resource-options - Change the resource options of an
>existing resource.
>>  stdout: peer-device-options - Change peer-device options.
>>  stdout: new-current-uuid - Generate a new current UUID.
>>  stdout: invalidate - Replace the local data of a volume with
>that of a peer.
>>  stdout: invalidate-remote - Replace a peer's data of a volume
>with the local data.
>>  stdout: pause-sync - Stop resynchronizing between a local and a
>peer device.
>>  stdout: resume-sync - Allow resynchronization to resume on a
>replicated device.
>>  stdout: suspend-io - Suspend I/O on a replicated device.
>>  stdout: resume-io - Resume I/O on a replicated device.
>>  stdout: outdate - Mark the data on a lower-level device as
>outdated.
>>  stdout: verify - Verify the data on a lower-level device against
>a peer device.
>>  stdout: down - Take a resource down.
>>  stdout: role - Show the current role of a resource.
>>  stdout: cstate - Show the current state of a connection.
>>  stdout: dstate - Show the current disk state of a lower-level
>device.
>>  stdout: show-gi - Show the data generation identifiers for a
>device on a particular
>>  stdout: connection, with explanations.
>>  stdout: get-gi - Show the data generation identifiers for a
>device on a particular
>>  stdout: connection.
>>  stdout: show - Show the current configuration of a resource, or
>of all resources.
>>  stdout: status - Show the state of a resource, or of all
>resources.
>>  stdout: check-resize - Remember the current size of a
>lower-level device.
>>  stdout: events2 - Show the current state and all state changes
>of a resource, or of
>>  stdout: all resources.
>>  stdout: wait-sync-volume - Wait until resync finished on a
>volume.
>>  stdout: wait-sync-connection - Wait until resync finished on all
>volumes of a
>>  stdout: connection.
>>  stdout: wait-sync-resource - Wait until resync finished on all
>volumes.
>>  stdout: wait-connect-volume - Wait until a device on a peer is
>visible.
>>  stdout: wait-connect-connection - Wait until all peer volumes of
>connection are
>>  stdout: visible.
>>  stdout: wait-connect-resource - Wait until all connections are
>establised.
>>  stdout: new-resource - Create a new resource.
>>  stdout: new-minor - Create a new replicated device within a
>resource.
>>  stdout: del-minor - Remove a replicated device.
>>  stdout: del-resource - Remove a resource.
>>  stdout: forget-peer - Completely remove any reference to a
>unconnected peer from
>>  stdout: meta-d

Re: [ClusterLabs] SBD fencing not working on my two-node cluster

2020-09-22 Thread Strahil Nikolov

Replace /dev/sde1 with 
/dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 :
- in /etc/sysconfig/sbd
- in the cib (via crom configure edit)

Also, I don't see 'stonith-enabled=true' which could be your actual problem.

I think you can set it via :
crm configure property stonith-enabled=true

P.S.: Consider setting the 'resource-stickiness' to '1'.Using partitions is not 
the best option but is better than nothing.

Best Regards,
Strahil Nikolov






В вторник, 22 септември 2020 г., 02:06:10 Гринуич+3, Philippe M Stedman 
 написа: 





Hi Strahil,

Here is the output of those commands I appreciate the help!

# crm config show
node 1: ceha03 \
attributes ethmonitor-ens192=1
node 2: ceha04 \
attributes ethmonitor-ens192=1
(...)
primitive stonith_sbd stonith:fence_sbd \
params devices="/dev/sde1" \
meta is-managed=true
(...)
property cib-bootstrap-options: \
have-watchdog=true \
dc-version=2.0.2-1.el8-744a30d655 \
cluster-infrastructure=corosync \
cluster-name=ps_dom \
stonith-enabled=true \
no-quorum-policy=ignore \
stop-all-resources=false \
cluster-recheck-interval=60 \
symmetric-cluster=true \
stonith-watchdog-timeout=0
rsc_defaults rsc-options: \
is-managed=false \
resource-stickiness=0 \
failure-timeout=1min

# cat /etc/sysconfig/sbd
SBD_DEVICE="/dev/sde1"
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=no
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_OPTS=

# systemctl status sbd
sbd.service - Shared-storage based fencing daemon
Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: 
disabled)
Active: active (running) since Mon 2020-09-21 18:36:28 EDT; 15min ago
Docs: man:sbd(8)
Process: 12810 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch 
(code=exited, status=0/SUCCESS)
Main PID: 12812 (sbd)
Tasks: 4 (limit: 26213)
Memory: 14.5M
CGroup: /system.slice/sbd.service
\u251c\u250012812 sbd: inquisitor
\u251c\u250012814 sbd: watcher: /dev/sde1 - slot: 0 - uuid: 
94d67f15-e301-4fa9-89ae-e3ce2e82c9e7
\u251c\u250012815 sbd: watcher: Pacemaker
\u2514\u250012816 sbd: watcher: Cluster

Sep 21 18:36:27 ceha03.canlab.ibm.com systemd[1]: Starting Shared-storage based 
fencing daemon...
Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12810]: notice: main: Doing flush + 
writing 'b' to sysrq on timeout
Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12815]: pcmk: notice: servant_pcmk: 
Monitoring Pacemaker health
Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12816]: cluster: notice: 
servant_cluster: Monitoring unknown cluster health
Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12814]: /dev/sde1: notice: 
servant_md: Monitoring slot 0 on disk /dev/sde1
Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12812]: notice: watchdog_init: Using 
watchdog device '/dev/watchdog'
Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12816]: cluster: notice: 
sbd_get_two_node: Corosync is in 2Node-mode
Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12812]: notice: inquisitor_child: 
Servant cluster is healthy (age: 0)
Sep 21 18:36:28 ceha03.canlab.ibm.com systemd[1]: Started Shared-storage based 
fencing daemon.

# sbd -d /dev/disk/by-id/scsi- dump
[root@ceha03 by-id]# sbd -d 
/dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 dump
==Dumping header on disk 
/dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1
Header version : 2.1
UUID : 94d67f15-e301-4fa9-89ae-e3ce2e82c9e7
Number of slots : 255
Sector size : 512
Timeout (watchdog) : 5
Timeout (allocate) : 2
Timeout (loop) : 1
Timeout (msgwait) : 10
==Header on disk /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 
is dumped


Thanks,

Phil Stedman
Db2 High Availability Development and Support
Email: pmste...@us.ibm.com

Strahil Nikolov ---09/21/2020 01:41:10 PM---Can you provide (replace sensitive 
data) : crm configure show

From: Strahil Nikolov 
To: "users@clusterlabs.org" 
Date: 09/21/2020 01:41 PM
Subject: [EXTERNAL] Re: [ClusterLabs] SBD fencing not working on my two-node 
cluster
Sent by: "Users" 




Can you provide (replace sensitive data) :

crm configure show
cat /etc/sysconfig/sbd
systemctl status sbd
sbd -d /dev/disk/by-id/scsi- dump

P.S.: It is very bad practice to use "/dev/sdXYZ" as these are not 
permanent.Always use persistent names like those inside 
"/dev/disk/by-XYZ/". Also , SBD needs max 10MB block device and yours seems 
unnecessarily big.


Most probably /dev/sde1 is your problem. 

Best Regards,
Strahil Nikolov




В понеделник, 21 септември 2020 г., 23:19:47 Гринуич+3, Philippe M Stedman 
 написа: 





Hi,

I have been following the instructions on the following page to try and 
configure SBD fencing on my two-node cluster:
https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html
 

I am able to get through all the steps successfully, I am using the following 
device (/dev/sde1) as my shared disk:

Disk

Re: [ClusterLabs] SBD fencing not working on my two-node cluster

2020-09-21 Thread Strahil Nikolov

Can you provide (replace sensitive data) :

crm configure show
cat /etc/sysconfig/sbd
systemctl status sbd
sbd -d /dev/disk/by-id/scsi- dump

P.S.: It is very bad practice to use "/dev/sdXYZ" as these are not 
permanent.Always use persistent names like those inside 
"/dev/disk/by-XYZ/". Also , SBD needs max 10MB block device and yours seems 
unnecessarily big.


Most probably /dev/sde1 is your problem. 

Best Regards,
Strahil Nikolov




В понеделник, 21 септември 2020 г., 23:19:47 Гринуич+3, Philippe M Stedman 
 написа: 





Hi,

I have been following the instructions on the following page to try and 
configure SBD fencing on my two-node cluster:
https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html

I am able to get through all the steps successfully, I am using the following 
device (/dev/sde1) as my shared disk:

Disk /dev/sde: 20 GiB, 21474836480 bytes, 41943040 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 43987868-1C0B-41CE-8AF8-C522AB259655

Device Start End Sectors Size Type
/dev/sde1 48 41942991 41942944 20G IBM General Parallel Fs

Since, I don't have a hardware watchdog at my disposal, I am using the software 
watchdog (softdog) instead. Having said this, I am able to get through all the 
steps successfully... I create the fence agent resource successfully, it shows 
as Started in crm status output:

stonith_sbd (stonith:fence_sbd): Started ceha04

The problem is when I run crm node fence ceha04 to test out fencing a host in 
my cluster. From the crm status output, I see that the reboot action has failed 
and furthermore, in the system logs, I see the following messages:

Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Requesting fencing 
(reboot) of node ceha04
Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Client 
pacemaker-controld.24146.5ff1ac0c wants to fence (reboot) 'ceha04' with device 
'(any)'
Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Requesting peer fencing 
(reboot) of ceha04
Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Couldn't find anyone to 
fence (reboot) ceha04 with any device
Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: error: Operation reboot of 
ceha04 by  for pacemaker-controld.24146@ceha04.1bad3987: No such device
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Stonith operation 
3/1:4317:0:ec560474-96ea-4984-b801-400d11b5b3ae: No such device (-19)
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Stonith operation 3 
for ceha04 failed (No such device): aborting transition.
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: warning: No devices found in 
cluster to fence ceha04, giving up
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Transition 4317 
aborted: Stonith failed
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Peer ceha04 was not 
terminated (reboot) by  on behalf of pacemaker-controld.24146: No such 
device

I don't know why Pacemaker isn't able to discover my fencing resource, why 
isn't it able to find anyone to fence the host from the cluster?

Any help is greatly appreciated. I can provide more details as required.

Thanks,

Phil Stedman
Db2 High Availability Development and Support
Email: pmste...@us.ibm.com

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker not starting

2020-09-23 Thread Strahil Nikolov

What is the output of 'corosync-quorumtool -s' on both nodes ?
What is your cluster's configuration :

'crm configure show' or 'pcs config'


Best Regards,
Strahil Nikolov






В сряда, 23 септември 2020 г., 16:07:16 Гринуич+3, Ambadas Kawle 
 написа: 





Hello All

We have 2 node with Mysql cluster and we are not able to start pacemaker on one 
of the node (slave node)
We are getting error "waiting for quorum... timed-out waiting for cluster"

Following are package detail
pacemaker pacemaker-1.1.15-5.el6.x86_64
pacemaker-libs-1.1.15-5.el6.x86_64
pacemaker-cluster-libs-1.1.15-5.el6.x86_64
pacemaker-cli-1.1.15-5.el6.x86_64

Corosync corosync-1.4.7-6.el6.x86_64
corosynclib-1.4.7-6.el6.x86_64

Mysql mysql-5.1.73-7.el6.x86_64
"mysql-connector-python-2.0.4-1.el6.noarch

Your help is appreciatedThanks Ambadas kawle
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Is the "allow_downscale" option supported by Corosync/Pacemaker?

2020-09-25 Thread Strahil Nikolov

I would use the 'last_man_standing: 1' + 'wait_for_all: 1'.

When you shoutdown a node gracefully , the quorum is recalculated.
You can check the manpage for explanation.

Best Regards,
Strahil Nikolov

В петък, 25 септември 2020 г., 01:19:09 Гринуич+3, Philippe M Stedman
написа:

Hi,

I was reading through the following votequorum manpage:
https://www.systutorials.com/docs/linux/man/5-votequorum/

and read the following about the allow_downscale feature:
---
allow_downscale: 1
Enables allow downscale (AD) feature (default: 0).
THIS FEATURE IS INCOMPLETE AND CURRENTLY UNSUPPORTED.
The general behaviour of votequorum is to never decrease expected votes or
quorum.
When AD is enabled, both expected votes and quorum are recalculated when a node
leaves the cluster in a clean state (normal corosync shutdown process) down to
configured expected_votes.
---
I am interested in using this option for my cluster to allow hosts to be
gracefully shutdown for maintenance operations for prolonged amounts of time
without adversely affecting quorum for the remaining active hosts in the
cluster. Does anybody know when this feature will be supported?
Thanks,

Phil Stedman
Db2 High Availability Development and Support
Email: pmste...@us.ibm.com

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Resources always return to original node

2020-09-26 Thread Strahil Nikolov

Resource Stickiness for a group is the sum of all resources' resource stikiness 
-> 5 resources x 100 score (default stickiness) = 500 score.
If your location constraint has a bigger number -> it wins :)


Best Regards,
Strahil Nikolov






В събота, 26 септември 2020 г., 12:22:32 Гринуич+3, Michael Ivanov 
 написа: 







Hallo,

I have strange problem: when I reset the node on which my resources are 
running, they are correctly migrated to the other node. But when I turn the 
failed node back, then as soon as it is up all resources are returned back to 
it. I have set resource-stickiness default value to 100. When this did not help 
I have set up resource-stickiness meta attr also to 100 for all my resources. 
Still when the failed node recovers the resources are migrated back to it! 
Where should I look to try to understand this situation?

Here's the configuration of my cluster:

root@node1# pcs status
Cluster name: gcluster
Cluster Summary:
  * Stack: corosync
  * Current DC: node1 (version 2.0.4-2deceaa3ae) - partition with quorum
  * Last updated: Sat Sep 26 11:12:34 2020
  * Last change:  Sat Sep 26 10:39:16 2020 by root via cibadmin on node1
  * 2 nodes configured
  * 14 resource instances configured (1 DISABLED)

Node List:
  * Online: [ node1 node2 ]

Full List of Resources:
  * ilo5_node1    (stonith:fence_ilo5_ssh): Started node2
  * ilo5_node2    (stonith:fence_ilo5_ssh): Started node1
  * Resource Group: VirtIP:
    * PrimaryIP    (ocf::heartbeat:IPaddr2): Started node2
    * PrimaryIP6    (ocf::heartbeat:IPv6addr): Started node2
    * AliasIP    (ocf::heartbeat:IPaddr2): Started node2
  * BackupFS    (ocf::redhat:netfs.sh): Started node2
  * Clone Set: MailVolume-clone [MailVolume] (promotable):
    * Masters: [ node2 ]
    * Slaves: [ node1 ]
  * MailFS    (ocf::heartbeat:Filesystem): Started node2
  * apache    (ocf::heartbeat:apache): Started node2
  * postfix    (ocf::heartbeat:postfix): Started node2
  * amavis    (service:amavis): Started node2
  * dovecot    (service:dovecot): Started node2
  * openvpn    (service:openvpn): Stopped (disabled)

And resources:

root@node1# pcs resource config
 Group: VirtIP
  Meta Attrs: resource-stickiness=100
  Resource: PrimaryIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=16 ip=xx.xx.xx.20 nic=br0
   Meta Attrs: resource-stickiness=100
   Operations: monitor interval=30s (PrimaryIP-monitor-interval-30s)
   start interval=0s timeout=20s (PrimaryIP-start-interval-0s)
   stop interval=0s timeout=20s (PrimaryIP-stop-interval-0s)
  Resource: PrimaryIP6 (class=ocf provider=heartbeat type=IPv6addr)
   Attributes: cidr_netmask=64 ipv6addr=::::0:0:0:20 nic=br0
   Meta Attrs: resource-stickiness=100
   Operations: monitor interval=30s (PrimaryIP6-monitor-interval-30s)
   start interval=0s timeout=15s (PrimaryIP6-start-interval-0s)
   stop interval=0s timeout=15s (PrimaryIP6-stop-interval-0s)
  Resource: AliasIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=16 ip=xx.xx.yy.20 nic=br0
   Meta Attrs: resource-stickiness=100
   Operations: monitor interval=30s (AliasIP-monitor-interval-30s)
   start interval=0s timeout=20s (AliasIP-start-interval-0s)
   stop interval=0s timeout=20s (AliasIP-stop-interval-0s)
 Resource: BackupFS (class=ocf provider=redhat type=netfs.sh)
  Attributes: export=/Backup/Gateway fstype=nfs host=atlas mountpoint=/Backup 
options=noatime,async
  Meta Attrs: resource-stickiness=100
  Operations: monitor interval=1m timeout=10 (BackupFS-monitor-interval-1m)
  monitor interval=5m timeout=30 OCF_CHECK_LEVEL=10 
(BackupFS-monitor-interval-5m)
  monitor interval=10m timeout=30 OCF_CHECK_LEVEL=20 
(BackupFS-monitor-interval-10m)
  start interval=0s timeout=900 (BackupFS-start-interval-0s)
  stop interval=0s timeout=30 (BackupFS-stop-interval-0s)
 Clone: MailVolume-clone
  Meta Attrs: clone-max=2 clone-node-max=1 notify=true promotable=true 
promoted-max=1 promoted-node-max=1 resource-stickiness=100
  Resource: MailVolume (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=mail
   Meta Attrs: resource-stickiness=100
   Operations: demote interval=0s timeout=90 (MailVolume-demote-interval-0s)
   monitor interval=60s (MailVolume-monitor-interval-60s)
   notify interval=0s timeout=90 (MailVolume-notify-interval-0s)
   promote interval=0s timeout=90 (MailVolume-promote-interval-0s)
   reload interval=0s timeout=30 (MailVolume-reload-interval-0s)
   start interval=0s timeout=240 (MailVolume-start-interval-0s)
   stop interval=0s timeout=100 (MailVolume-stop-interval-0s)
 Resource: MailFS (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/drbd0 directory=/var/mail fstype=btrfs
  Meta Attrs: re

Re: [ClusterLabs] How to stop removed resources when replacing cib.xml via cibadmin or crm_shadow

2020-10-01 Thread Strahil Nikolov

That's the strangest request I have heard so far ...
What is the reason not to use crmsh or pcs to manage the cluster ?

About your question , have you tried to load a cib with the old resources
stopped and then another one with the stopped resources removed ?

Best Regards,
Strahil Nikolov

В четвъртък, 1 октомври 2020 г., 00:49:37 Гринуич+3, Igor Tverdovskiy
написа:

Hi All,

I have a necessity to apply the whole cib.xml instead of using command line
tools like
> crm configure ...

I have generated proper XML according to predefined templates and apply it via
> cibadmin --replace --xml-file cib.xml

However I have encountered an issue when resources (in particular VIP
addresses) which were completely removed/replaced by others continue to run on
the interface after XML replacing.

Orphaned resources appear and VIP actually is kept started.
```
haproxy-10.0.0.111 (ocf::haproxy): ORPHANED FAILED 738741-ip2
(blocked)vip-10.0.0.111 (ocf::IPaddr2): ORPHANED FAILED 738741-ip2
(blocked)
```

What I have tried:
1. Replace pure cib.xml without element (only )
2. Take active CIB from "cibadmin -Q" and replace only element
while keeping as is.
3. Create shadow replace cib.xml with/without status and commit.
Indeed crm_simulate -LS shows intention to stop vip-1.1.1.1, but in fact it
will not after shadow commit.

Sometimes I can manage to automatically clear removed/replaced VIP addresses
from the interface after replacing XML, so it is definitely possible, but I can
not understand how to achieve this.

So I wonder is there a way to replace XML but at the same time stop resources
which are removed in a new XML?

Thanks!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] fence_mpath in latest fence-agents: single reservation after fence

2020-06-01 Thread Strahil Nikolov

I  don't see the reservation key in multipath.conf .
Have you set it up in unique way (each host has it's own key)?

Best Regards,
Strahil Nikolov

На 1 юни 2020 г. 16:04:32 GMT+03:00, Rafael David Tinoco 
 написа:
>Hello again,
>
>Long time I don't show up... I was finishing up details of Ubuntu 20.04
>HA packages (with lots of other stuff), so sorry for not being active
>until now (about to change). During my regression lab preparation, as I
>spoke in latest HA conf, I'm facing a situation I'd like to have some
>inputs on if anyone has...
>
>I'm clearing up needed fence_mpath/fence_iscsi setup for all Ubuntu
>versions:
>
>https://bugs.launchpad.net/ubuntu/+source/fence-agents/+bug/1864404
>
>and I just faced this:
>
>- 3 x node cluster setup
>- 3 x nodes share 4 paths to /dev/mapper/volume{00..10}
>- Using /dev/mapper/volume01 for fencing tests
>- softdog configured for /dev/watchdog
>- fence_mpath_check installed in /etc/watchdog.d/
>
>
>
>(k)rafaeldtinoco@clusterg01:~$ crm configure show
>node 1: clusterg01
>node 2: clusterg02
>node 3: clusterg03
>primitive fence-mpath-clusterg01 stonith:fence_mpath \
>    params pcmk_on_timeout=70 pcmk_off_timeout=70
>pcmk_host_list=clusterg01 pcmk_monitor_action=metadata
>pcmk_reboot_action=off key=5945 devices="/dev/mapper/volume01"
>power_wait=65 \
>    meta provides=unfencing target-role=Started
>primitive fence-mpath-clusterg02 stonith:fence_mpath \
>    params pcmk_on_timeout=70 pcmk_off_timeout=70
>pcmk_host_list=clusterg02 pcmk_monitor_action=metadata
>pcmk_reboot_action=off key=59450001 devices="/dev/mapper/volume01"
>power_wait=65 \
>    meta provides=unfencing target-role=Started
>primitive fence-mpath-clusterg03 stonith:fence_mpath \
>    params pcmk_on_timeout=70 pcmk_off_timeout=70
>pcmk_host_list=clusterg03 pcmk_monitor_action=metadata
>pcmk_reboot_action=off key=59450002 devices="/dev/mapper/volume01"
>power_wait=65 \
>    meta provides=unfencing target-role=Started
>property cib-bootstrap-options: \
>    have-watchdog=false \
>    dc-version=2.0.3-4b1f869f0f \
>    cluster-infrastructure=corosync \
>    cluster-name=clusterg \
>    stonith-enabled=true \
>    no-quorum-policy=stop \
>    last-lrm-refresh=1590773755
>
>
>
>(k)rafaeldtinoco@clusterg03:~$ crm status
>Cluster Summary:
>  * Stack: corosync
>  * Current DC: clusterg02 (version 2.0.3-4b1f869f0f) - partition with
>quorum
>  * Last updated: Mon Jun  1 12:55:13 2020
>  * Last change:  Mon Jun  1 04:35:07 2020 by root via cibadmin on
>clusterg03
>  * 3 nodes configured
>  * 3 resource instances configured
>
>Node List:
>  * Online: [ clusterg01 clusterg02 clusterg03 ]
>
>Full List of Resources:
>  * fence-mpath-clusterg01    (stonith:fence_mpath):     Started
>clusterg02
>  * fence-mpath-clusterg02    (stonith:fence_mpath):     Started
>clusterg03
>  * fence-mpath-clusterg03    (stonith:fence_mpath):     Started
>clusterg01
>
>
>
>(k)rafaeldtinoco@clusterg03:~$ sudo mpathpersist --in -r
>/dev/mapper/volume01
>  PR generation=0x2d, Reservation follows:
>   Key = 0x59450001
>  scope = LU_SCOPE, type = Write Exclusive, registrants only
>
>(k)rafaeldtinoco@clusterg03:~$ sudo mpathpersist --in -k
>/dev/mapper/volume01
>  PR generation=0x2d,     12 registered reservation keys follow:
>    0x59450001
>    0x59450001
>    0x59450001
>    0x59450001
>    0x59450002
>    0x59450002
>    0x59450002
>    0x59450002
>    0x5945
>    0x5945
>    0x5945
>    0x5945
>
>
>
>You can see that everything looks fine. If I disable the 2
>interconnects
>I have for corosync:
>
>(k)rafaeldtinoco@clusterg01:~$ sudo corosync-quorumtool -a
>Quorum information
>--
>Date: Mon Jun  1 12:56:00 2020
>Quorum provider:  corosync_votequorum
>Nodes:    3
>Node ID:  1
>Ring ID:  1.120
>Quorate:  Yes
>
>Votequorum information
>--
>Expected votes:   3
>Highest expected: 3
>Total votes:  3
>Quorum:   2 
>Flags:    Quorate
>
>Membership information
>--
>    Nodeid  Votes Name
> 1  1 clusterg01, clusterg01bkp (local)
> 2  1 clusterg02, clusterg02bkp
> 3  1 clusterg03, clusterg03bkp
>
>for node clusterg01 I have it fenced correctly:
>
>Pending Fencing Actions:
>  * reboot of clusterg01 pending: client=pacemaker-controld.906,
>origin=clusterg02
>
>(k)rafaeldtinoco@clusterg03:~$ sudo mpathpersist --in -r
>/dev/mapper/volume01
>  PR generation=0x2e, Reservation follows:
>   Key = 0x59

Re: [ClusterLabs] Triggering script on cib change

2020-09-16 Thread Strahil Nikolov

Theoretically the CIB is a file on each node,so a script that is looking for 
that file's timestamps or in the cluster's logs should work.
Yet, I think that 'ocf:pacemaker:ClusterMon' can be used to notify and/or 
execute an external programm ( I guess a script) that will do your logic.

A simple example is available at : 
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-eventnotification-haar


Best Regards,
Strahil Nikolov






В сряда, 16 септември 2020 г., 09:20:44 Гринуич+3, Digimer  
написа: 





Is there a way to invoke a script when something happens with the
cluster? Be it a simple transition, stonith action, resource dis/enable
or resovery, etc?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Adding a node to an active cluster

2020-10-21 Thread Strahil Nikolov

Both SUSE and RedHat provide utilities to add the node without messing with the 
configs manually.

What is your distro ?


Best Regards,
Strahil Nikolov






В сряда, 21 октомври 2020 г., 17:03:19 Гринуич+3, Jiaqi Tian1 
 написа: 





Hi,

I'm trying to add a new node into an active pacemaker cluster with resources up 
and running.

After steps:

1. update corosync.conf files among all hosts in cluster including the new node

2. copy corosync auth file to the new node

3. enable corosync and pacemaker on the new node 

4. adding the new node to the list of node in /var/lib/pacemaker/cib/cib.xml

 

Then I run crm status, the new node is displayed as offline. It will not become 
online, unless we run restart corosync and pacemaker on all nodes in cluster. 
But this is not what we want, since we want to keep existing nodes and 
resources up and running. Also in this case crm_node -l doesn't list the new 
node.

 

So my question is:

 

1. Is there another approach to have the existing nodes aware of the new node 
and have crm status indicates the node is online while keeping other nodes and 
resources up and running?

2. which config file crm_node command reads?

 

Thanks,

 

Jiaqi Tian


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Upgrading/downgrading cluster configuration

2020-10-23 Thread Strahil Nikolov

Usually I prefer to use "crm configure show" and later "crm configure edit" and 
replace the config.

I am not sure if this will work with such downgrade scenario, but it shouldn't 
be a problem.

Best Regards,
Strahil Nikolov






В четвъртък, 22 октомври 2020 г., 21:30:55 Гринуич+3, Vitaly Zolotusky 
 написа: 





Thanks for the reply.
I do see backup config command in pcs, but not in crmsh. 
What would that be in crmsh? Would something like this work after Corosync 
started in the state with all resources inactive? I'll try this: 

crm configure save 
crm configure load 

Thank you!
_Vitaly Zolotusky

> On October 22, 2020 1:54 PM Strahil Nikolov  wrote:
> 
>  
> Have you tried to backup the config via crmsh/pcs and when you downgrade to 
> restore from it ?
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> 
> В четвъртък, 22 октомври 2020 г., 15:40:43 Гринуич+3, Vitaly Zolotusky 
>  написа: 
> 
> 
> 
> 
> 
> Hello,
> We are trying to upgrade our product from Corosync 2.X to Corosync 3.X. Our 
> procedure includes upgrade where we stopthe cluster, replace rpms and restart 
> the cluster. Upgrade works fine, but we also need to implement rollback in 
> case something goes wrong. 
> When we rollback and reload old RPMs cluster says that there are no active 
> resources. It looks like there is a problem with cluster configuration 
> version.
> Here is output of the crm_mon:
> 
> d21-22-left.lab.archivas.com /opt/rhino/sil/bin # crm_mon -A1
> Stack: corosync
> Current DC: NONE
> Last updated: Thu Oct 22 12:39:37 2020
> Last change: Thu Oct 22 12:04:49 2020 by root via crm_attribute on 
> d21-22-left.lab.archivas.com
> 
> 2 nodes configured
> 15 resources configured
> 
> Node d21-22-left.lab.archivas.com: UNCLEAN (offline)
> Node d21-22-right.lab.archivas.com: UNCLEAN (offline)
> 
> No active resources
> 
> 
> Node Attributes:
> 
> ***
> What would be the best way to implement downgrade of the configuration? 
> Should we just change crm feature set, or we need to rebuild the whole config?
> Thanks!
> _Vitaly Zolotusky
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Strahil Nikolov

why don't you work with something like this: 'op stop interval =300 
timeout=600'.
The stop operation will timeout at your requirements without modifying the 
script.

Best Regards,
Strahil Nikolov






В четвъртък, 22 октомври 2020 г., 23:30:08 Гринуич+3, Lentes, Bernd 
 написа: 





Hi guys,

ocassionally stopping a VirtualDomain resource via "crm resource stop" does not 
work, and in the end the node is fenced, which is ugly.
I had a look at the RA to see what it does. After trying to stop the domain via 
"virsh shutdown ..." in a configurable time it switches to "virsh destroy".
i assume "virsh destroy" send a sigkill to the respective process. But when the 
host is doing heavily IO it's possible that the process is in "D" state 
(uninterruptible sleep) 
in which it can't be finished with a SIGKILL. The the node the domain is 
running on is fenced due to that.
I digged deeper and found out that the signal is often delivered a bit later 
(just some seconds) and the process is killed, but pacemaker already decided to 
fence the node.
It's all about this excerp in the RA:

force_stop()
{
        local out ex translate
        local status=0

        ocf_log info "Issuing forced shutdown (destroy) request for domain 
${DOMAIN_NAME}."
        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
        ex=$?
        translate=$(echo $out|tr 'A-Z' 'a-z')
        echo >&2 "$translate"
        case $ex$translate in
                *"error:"*"domain is not running"*|*"error:"*"domain not 
found"*|\
                *"error:"*"failed to get domain"*)
                        : ;; # unexpected path to the intended outcome, all is 
well
                [!0]*)
                        ocf_exit_reason "forced stop failed"
                        return $OCF_ERR_GENERIC ;;
                0*)
                        while [ $status != $OCF_NOT_RUNNING ]; do
                                VirtualDomain_status
                                status=$?
                        done ;;
        esac
        return $OCF_SUCCESS
}

I'm thinking about the following:
How about to let the script wait a bit after "virsh destroy". I saw that 
usually it just takes some seconds that "virsh destroy" is successfull.
I'm thinking about this change:

ocf_log info "Issuing forced shutdown (destroy) request for domain 
${DOMAIN_NAME}."
        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
        ex=$?
        sleep (10)    < (or maybe configurable)
        translate=$(echo $out|tr 'A-Z' 'a-z')


What do you think ?

Bernd


-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Adding a node to an active cluster

2020-10-27 Thread Strahil Nikolov

On RHEL, I would use "pcs cluster auth"/"pcs host auth" && "pcs cluster node 
add".

For cluster nodes auth , you can check : 
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/considerations_in_adopting_rhel_8/high-availability-and-clusters_considerations-in-adopting-rhel-8#new_commands_for_authenticating_nodes_in_a_cluster

Best Regards,
Strahil Nikolov







В вторник, 27 октомври 2020 г., 18:06:06 Гринуич+2, Jiaqi Tian1 
 написа: 





Hi Xin,

Thank you. The crmsh version is 4.1.0.0, OS is RHEL 8.0.

 

I have tried crm cluster init -y, but it seems it cannot be run when the 
cluster is already up and running with resources running on it. is crm cluster 
join command used for this situation? 

 

Thanks,

Jiaqi

 

> - Original message -
> From: Xin Liang 
> Sent by: "Users" 
> To: "users@clusterlabs.org" 
> Cc:
> 
> Subject: [EXTERNAL] Re: [ClusterLabs] Adding a node to an active cluster
> Date: Tue, Oct 27, 2020 3:29 AM
>  
> 
> Hi Jiaqi,
> 
>  
> 
> Which OS version do you use and which crmsh version do you use?
> I highly recommend you update your crmsh to latest version.
> 
> Besides that, did you mean you already have ceha03 and ceha04 nodes running 
> the cluster service?
> 
> From ha-cluster-bootstrapceha03.log, I can't see you have record of init 
> cluster successfully.
> 
>  
> 
> Ideally, you should run:
> 
> 1. on ceha03:    crm cluster init -y
> 2. on ceha01:   crm cluster join -c ceha03 -y
> 
>  
> 
> From: Users  on behalf of Jiaqi Tian1 
> 
> Sent: Tuesday, October 27, 2020 12:15 AM
> To: users@clusterlabs.org 
> Subject: Re: [ClusterLabs] Adding a node to an active cluster 
>  
> 
> Hi Xin,
> 
> I have ceha03 and ceha04 in cluster and trying to add ceha01 to the cluster. 
> Running crm cluster join -c ceha03 -y on ceha01. Here are logs in ceha03 and 
> ceha01. the log file in ceha04 is empty.
> 
>  
> 
> Thanks,
> 
> Jiaqi
> 
>  
> 
>> - Original message -
>> From: Xin Liang 
>> Sent by: "Users" 
>> To: "users@clusterlabs.org" 
>> Cc:
>> Subject: [EXTERNAL] Re: [ClusterLabs] Adding a node to an active cluster
>> Date: Mon, Oct 26, 2020 8:43 AM
>>   
>> Hi Jiaqi
>> 
>>  
>> 
>> Could you give me your "/var/log/crmsh/ha-cluster-bootstrap.log" or 
>> "/var/log/ha-cluster-bootstrap.log" on these 3 nodes?
>> 
>>  
>> 
>> Thanks
>> 
>>  
>> 
>> From: Users  on behalf of Jiaqi Tian1 
>> 
>> Sent: Saturday, October 24, 2020 5:48 AM
>> To: users@clusterlabs.org 
>> Subject: Re: [ClusterLabs] Adding a node to an active cluster 
>>  
>> 
>> Hi,
>> 
>> Thank you for your suggestion. The case I have is, I have host1 and host2 in 
>> cluster that has resources running, then I try to join host3 to the cluster 
>> by running "crm cluster join -c host1 -y". But I get this "Configuring 
>> csync2...ERROR: cluster.join: Can't invoke crm cluster init init 
>> csync2_remote on host3" issue. Are there any other requirements for running 
>> this command?
>> 
>>  
>> 
>> Thanks
>> 
>>  
>> 
>> Jiaqi Tian
>> 
>>  
>> 
>>> - Original message -
>>> From: Xin Liang 
>>> Sent by: "Users" 
>>> To: Cluster Labs - All topics related to open-source clustering welcomed 
>>> 
>>> Cc:
>>> Subject: [EXTERNAL] Re: [ClusterLabs] Adding a node to an active cluster
>>> Date: Wed, Oct 21, 2020 9:44 PM
>>>   
>>> Hi Jiaqi,
>>> 
>>>  
>>> 
>>> Assuming you already have node1 running resources, you can try to run this 
>>> command on node2:
>>> 
>>>  
>>> 
>>> "crm cluster join -c node1 -y"
>>> 
>>>  
>>> 
>>> From: Users  on behalf of Jiaqi Tian1 
>>> 
>>> Sent: Wednesday, October 21, 2020 10:03 PM
>>> To: users@clusterlabs.org 
>>> Subject: [ClusterLabs] Adding a node to an active cluster 
>>>  
>>> 
>>> Hi,
>>> 
>>> I'm trying to add a new node into an active pacemaker cluster with 
>>> resources up and running.
>>> 
>>> After steps:
>>> 
>>> 1. update corosync.conf files among all hosts in cluster including the new 
>>&g

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-27 Thread Strahil Nikolov

Ulrich,

do you mean '--queue' ?


Best Regards,
Strahil Nikolov






В вторник, 27 октомври 2020 г., 12:15:16 Гринуич+2, Ulrich Windl 
 написа: 





>>> "Lentes, Bernd"  schrieb am 26.10.2020
um
21:44 in Nachricht
<1480408662.7194527.1603745092927.javamail.zim...@helmholtz-muenchen.de>:

> 
> - On Oct 26, 2020, at 4:09 PM, Ulrich Windl 
> ulrich.wi...@rz.uni-regensburg.de wrote:
> 
> 
>> 
>> AFAIK you can even kill processes in Linux that are in "D" state (contrary

> to
>> other operating systems).
> 
> How ?

man 1 kill

> 
> 
> Bernd
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de 
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin 
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Upgrading/downgrading cluster configuration

2020-10-26 Thread Strahil Nikolov






>>> Strahil Nikolov  schrieb am 23.10.2020 um 17:04 in
Nachricht <362944335.2019534.1603465466...@mail.yahoo.com>:
> Usually I prefer to use "crm configure show" and later "crm configure edit"

> and replace the config.

>I guess you use "edit" because of the lack of support for massive
configuration changes (like "replace all timeouts for... with ..."), right?

Yep, it's quite simple and it never complains about anything (as long as you 
copy & paste without missing a letter ;) )

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-26 Thread Strahil Nikolov

I think it's useful - for example a HANA powers up for 10-15min (even more , 
depends on storage tier) - so the default will time out and the fun starts 
there.

Maybe the cluster is just showing them without using them , but it looked quite 
the opposite.

Best Regards,
Strahil Nikolov






В понеделник, 26 октомври 2020 г., 09:34:31 Гринуич+2, Ulrich Windl 
 написа: 





>>> Strahil Nikolov  schrieb am 23.10.2020 um 17:06 in
Nachricht <428616368.2019191.1603465603...@mail.yahoo.com>:
> why don't you work with something like this: 'op stop interval =300 
> timeout=600'.

I always thought "interval=" does not make any sense for "start" and "stop",
but only for "monitor"...

> The stop operation will timeout at your requirements without modifying the 
> script.
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> 
> В четвъртък, 22 октомври 2020 г., 23:30:08 Гринуич+3, Lentes, Bernd 
>  написа: 
> 
> 
> 
> 
> 
> Hi guys,
> 
> ocassionally stopping a VirtualDomain resource via "crm resource stop" does

> not work, and in the end the node is fenced, which is ugly.
> I had a look at the RA to see what it does. After trying to stop the domain

> via "virsh shutdown ..." in a configurable time it switches to "virsh 
> destroy".
> i assume "virsh destroy" send a sigkill to the respective process. But when

> the host is doing heavily IO it's possible that the process is in "D" state

> (uninterruptible sleep) 
> in which it can't be finished with a SIGKILL. The the node the domain is 
> running on is fenced due to that.
> I digged deeper and found out that the signal is often delivered a bit later

> (just some seconds) and the process is killed, but pacemaker already decided

> to fence the node.
> It's all about this excerp in the RA:
> 
> force_stop()
> {
>        local out ex translate
>        local status=0
> 
>        ocf_log info "Issuing forced shutdown (destroy) request for domain 
> ${DOMAIN_NAME}."
>        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
>        ex=$?
>        translate=$(echo $out|tr 'A-Z' 'a-z')
>        echo >&2 "$translate"
>        case $ex$translate in
>                *"error:"*"domain is not running"*|*"error:"*"domain not 
> found"*|\
>                *"error:"*"failed to get domain"*)
>                        : ;; # unexpected path to the intended outcome, all

> is well
>                [!0]*)
>                        ocf_exit_reason "forced stop failed"
>                        return $OCF_ERR_GENERIC ;;
>                0*)
>                        while [ $status != $OCF_NOT_RUNNING ]; do
>                                VirtualDomain_status
>                                status=$?
>                        done ;;
>        esac
>        return $OCF_SUCCESS
> }
> 
> I'm thinking about the following:
> How about to let the script wait a bit after "virsh destroy". I saw that 
> usually it just takes some seconds that "virsh destroy" is successfull.
> I'm thinking about this change:
> 
> ocf_log info "Issuing forced shutdown (destroy) request for domain 
> ${DOMAIN_NAME}."
>        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
>        ex=$?
>        sleep (10)    < (or maybe configurable)
>        translate=$(echo $out|tr 'A-Z' 'a-z')
> 
> 
> What do you think ?
> 
> Bernd
> 
> 
> -- 
> 
> Bernd Lentes 
> Systemadministration 
> Institute for Metabolism and Cell Death (MCD) 
> Building 25 - office 122 
> HelmholtzZentrum München 
> bernd.len...@helmholtz-muenchen.de 
> phone: +49 89 3187 1241 
> phone: +49 89 3187 3827 
> fax: +49 89 3187 2294 
> http://www.helmholtz-muenchen.de/mcd 
> 
> stay healthy
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de 
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin 
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] fence_virt architecture? (was: Re: Still Beginner STONITH Problem)

2020-07-19 Thread Strahil Nikolov

My understanding is that fence_xvm  is reaching each Hypervisour  via multicast 
(otherwise why multicast ?)... yet I could be simply fooling myself.

If the  VMs are behind NAT, I think that the simplest way to STONITH  is to use 
SBD over iSCSI. 
Yet,  my KVM knowledge is  limited and I didn't see any proof that I'm right 
(libvirt network was in NAT mode) or  wrong (VMs using Host's  bond in a 
bridged network).

Best Regards,
Strahil Nikolov

На 19 юли 2020 г. 9:45:29 GMT+03:00, Andrei Borzenkov  
написа:
>18.07.2020 03:36, Reid Wahl пишет:
>> I'm not sure that the libvirt backend is intended to be used in this
>way,
>> with multiple hosts using the same multicast address. From the
>> fence_virt.conf man page:
>> 
>> ~~~
>> BACKENDS
>>libvirt
>>The  libvirt  plugin  is  the  simplest  plugin.  It is used
>in
>> environments where routing fencing requests between multiple hosts is
>not
>> required, for example by a user running a cluster of virtual
>>machines on a single desktop computer.
>>libvirt-qmf
>>The libvirt-qmf plugin acts as a QMFv2 Console to the
>libvirt-qmf
>> daemon in order to route fencing requests over AMQP to the
>appropriate
>> computer.
>>cpg
>>The cpg plugin uses corosync CPG and libvirt to track virtual
>> machines and route fencing requests to the appropriate computer.
>> ~~~
>> 
>> I'm not an expert on fence_xvm or libvirt. It's possible that this is
>a
>> viable configuration with the libvirt backend.
>> 
>> However, when users want to configure fence_xvm for multiple hosts
>with the
>> libvirt backend, I have typically seen them configure multiple
>fence_xvm
>> devices (one per host) and configure a different multicast address on
>each
>> host.
>> 
>> If you have a Red Hat account, see also:
>>   - https://access.redhat.com/solutions/2386421
>
>What's the point in using multicast listener if every host will have
>unique multicast address and there will be separate stonith agent for
>each host using this unique address? That's not what everyone expects
>seeing "multicast" as communication protocol.
>
>This is serious question. If intention is to avoid TCP overhead, why
>not
>simply use UDP with unique address? Or is single multicast address
>still
>possible and this article describes "what I once set up and it worked
>for me" and not "how it is designed to work"?
>
>Also what is not clear - which fence_virtd instance on host will be
>contacted by stonith agent on cluster node? I.e. consider
>
>three hosts host1, host2, host3
>three VM vm1, vm2, vm3 each active on corresponding host
>
>vm1 on host1 want to fence vm3 on host3. Will it
>a) contact fence_virtd on host1 and fence_virtd on host1 will forward
>request to host3? Or
>b) is it mandatory for vm1 to have connectivity to fence_virtd on
>host3?
>
>If we combine existence of local-only listeners (like serial or vsock)
>and distributed backend (like cpg) it strongly suggests that vm1
>-(listener)-> host1 -(backend)-> host3 -> -(fence)->vm3 is possible.
>
>If each cluster node always directly contacts fence_virtd on *target*
>host then libvirt backend is still perfectly usable for multi-host
>configuration as every fence_virtd will only ever fence local VM.
>
>Is there any high level architecture overview (may be presentation from
>some conference)?
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-30 Thread Strahil Nikolov

You got plenty of options:
-  IPMI based  fencing like  HP iLO,  DELL iDRAC
-  SCSI-3  persistent reservations (which can be extended to fence  the node 
when the reservation(s)  were  removed)

- Shared  disk (even iSCSI)  and using SBD (a.k.a. Poison pill) -> in case your 
hardware has  no watchdog,  you can use  softdog  kernel module  for linux.

Best  Regards,
Strahil Nikolov

На 29 юли 2020 г. 9:01:22 GMT+03:00, Gabriele Bulfon  
написа:
>That one was taken from a specific implementation on Solaris 11.
>The situation is a dual node server with shared storage controller:
>both nodes see the same disks concurrently.
>Here we must be sure that the two nodes are not going to import/mount
>the same zpool at the same time, or we will encounter data corruption:
>node 1 will be perferred for pool 1, node 2 for pool 2, only in case
>one of the node goes down or is taken offline the resources should be
>first free by the leaving node and taken by the other node.
> 
>Would you suggest one of the available stonith in this case?
> 
>Thanks!
>Gabriele
> 
> 
> 
>Sonicle S.r.l. 
>: 
>http://www.sonicle.com
>Music: 
>http://www.gabrielebulfon.com
>Quantum Mechanics : 
>http://www.cdbaby.com/cd/gabrielebulfon
>----------
>Da: Strahil Nikolov
>A: Cluster Labs - All topics related to open-source clustering welcomed
>Gabriele Bulfon
>Data: 29 luglio 2020 6.39.08 CEST
>Oggetto: Re: [ClusterLabs] Antw: [EXT] Stonith failing
>Do you have a reason not to use any stonith already available ?
>Best Regards,
>Strahil Nikolov
>На 28 юли 2020 г. 13:26:52 GMT+03:00, Gabriele Bulfon
>написа:
>Thanks, I attach here the script.
>It basically runs ssh on the other node with no password (must be
>preconfigured via authorization keys) with commands.
>This was taken from a script by OpenIndiana (I think).
>As it stated in the comments, we don't want to halt or boot via ssh,
>only reboot.
>Maybe this is the problem, we should at least have it shutdown when
>asked for.
> 
>Actually if I stop corosync in node 2, I don't want it to shutdown the
>system but just let node 1 keep control of all resources.
>Same if I just shutdown manually node 2, 
>node 1 should keep control of all resources and release them back on
>reboot.
>Instead, when I stopped corosync on node 2, log was showing the
>temptative to stonith node 2: why?
> 
>Thanks!
>Gabriele
> 
> 
> 
>Sonicle S.r.l. 
>: 
>http://www.sonicle.com
>Music: 
>http://www.gabrielebulfon.com
>Quantum Mechanics : 
>http://www.cdbaby.com/cd/gabrielebulfon
>Da:
>Reid Wahl
>A:
>Cluster Labs - All topics related to open-source clustering welcomed
>Data:
>28 luglio 2020 12.03.46 CEST
>Oggetto:
>Re: [ClusterLabs] Antw: [EXT] Stonith failing
>Gabriele,
> 
>"No route to host" is a somewhat generic error message when we can't
>find anyone to fence the node. It doesn't mean there's necessarily a
>network routing issue at fault; no need to focus on that error message.
> 
>I agree with Ulrich about needing to know what the script does. But
>based on your initial message, it sounds like your custom fence agent
>returns 1 in response to "on" and "off" actions. Am I understanding
>correctly? If so, why does it behave that way? Pacemaker is trying to
>run a poweroff action based on the logs, so it needs your script to
>support an off action.
>On Tue, Jul 28, 2020 at 2:47 AM Ulrich Windl
>ulrich.wi...@rz.uni-regensburg.de
>wrote:
>Gabriele Bulfon
>gbul...@sonicle.com
>schrieb am 28.07.2020 um 10:56 in
>Nachricht
>:
>Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by
>Corosync.
>To check how stonith would work, I turned off Corosync service on
>second
>node.
>First node try to attempt to stonith 2nd node and take care of its
>resources, but this fails.
>Stonith action is configured to run a custom script to run ssh
>commands,
>I think you should explain what that script does exactly.
>[...]
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>ClusterLabs home:
>https://www.clusterlabs.org/
>--
>Regards,
>Reid Wahl, RHCA
>Software Maintenance Engineer, Red Hat
>CEE - Platform Support Delivery - ClusterHA
>___Manage your
>subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs
>home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pacemaker crashed and produce a coredump file

2020-07-30 Thread Strahil Nikolov

Early  systemd bugs caused dbus  issues  and session files  not being cleaned  
up properly.  At least EL 7.4  or  older  were  affected.

What is your OS and version?

P.S.: I know your pain. I am still fighting to explain that without planned 
downtime, the end users will definitely get unplanned downtime.

Best Regards,
Strahil Nikolov

На 29 юли 2020 г. 12:46:16 GMT+03:00, lkxjtu  написа:
>Hi Reid Wahl,
>
>
>There are more log informations below. The reason seems to be that
>communication with DBUS timed out. Any suggestions?
>
>
>1672712 Jul 24 21:20:17 [3945305] B0610011   lrmd: info:
>pcmk_dbus_timeout_dispatch:Timeout 0x147bbd0 expired
>1672713 Jul 24 21:20:17 [3945305] B0610011   lrmd: info:
>pcmk_dbus_find_error:  LoadUnit error
>'org.freedesktop.DBus.Error.NoReply': Did not receive a reply.
>Possible causes include: the remote application did not send a reply,
>the message bus security policy blocked the reply, the reply timeout
>expired, or the network connection was broken.
>1672714 Jul 24 21:20:17 [3945305] B0610011   lrmd:error:
>systemd_loadunit_result:   Unexepcted DBus type, expected o in 's'
>instead of s
>1672715 Jul 24 21:20:17 [3945305] B0610011   lrmd:error:
>crm_abort: systemd_unit_exec_with_unit: Triggered fatal assert at
>systemd.c:514 : unit
>1672716 2020-07-24T21:20:17.701484+08:00 B0610011 lrmd[3945305]:   
>error: systemd_loadunit_result: Unexepcted DBus type, expected o in 's'
>instead of s
>1672717 2020-07-24T21:20:17.701517+08:00 B0610011 lrmd[3945305]:   
>error: crm_abort: systemd_unit_exec_with_unit: Triggered fatal assert
>at systemd.c:514 : unit
>1672718 Jul 24 21:20:17 [3945306] B0610011   crmd:error:
>crm_ipc_read:  Connection to lrmd failed
>
>
>
>> Hi,
>>
>> It looks like this is a bug that was fixed in later releases. The
>`path`
>> variable was a null pointer when it was passed to
>> `systemd_unit_exec_with_unit` as the `unit` argument. Commit 62a0d26a
>>
><https://github.com/ClusterLabs/pacemaker/commit/62a0d26a8f85fbcee9b56524ea3f1ae0171cbe52#diff-00b989f66499e2081134c17c06d2b359R201>
>> adds a null check to the `path` variable before using it to call
>`systemd_unit_exec_with_unit`.
>>
>> I believe pacemaker-1.1.15-11.el7 is the first RHEL pacemaker release
>that
>> contains the fix. Can you upgrade and see if the issue is resolved?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-30 Thread Strahil Nikolov

This  one links to how  to power fence when reservations are removed:
https://access.redhat.com/solutions/4526731

Best Regards,
Strahil Nikolov


На 30 юли 2020 г. 9:28:51 GMT+03:00, Andrei Borzenkov  
написа:
>30.07.2020 08:42, Strahil Nikolov пишет:
>> You got plenty of options:
>> -  IPMI based  fencing like  HP iLO,  DELL iDRAC
>> -  SCSI-3  persistent reservations (which can be extended to fence 
>the node when the reservation(s)  were  removed)
>> 
>
>SCSI reservation prevents data corruption due to concurrent access; it
>cannot be used as replacement for proper STONITH as it does not affect
>all other non-disk resources.
>
>You need to combine SCSI reservation which something that actually
>eliminates the node at which point it becomes much more simple to use
>SBD.
>
>
>
>> - Shared  disk (even iSCSI)  and using SBD (a.k.a. Poison pill) -> in
>case your hardware has  no watchdog,  you can use  softdog  kernel
>module  for linux.
>> 
>> Best  Regards,
>> Strahil Nikolov
>> 
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Maximum cluster size with Pacemaker 2.x and Corosync 3.x, and scaling to hundreds of nodes

2020-07-31 Thread Strahil Nikolov

When I joined the previous company, we  were  just decommissioning  a 41-node  
Scale-out HANA  /SLES 11/  with a 21-node /SLES12/ cluster.
The most popular was a 2-node  cluster,  but we had  a  lot  of issues .For  me 
 a  2-node clusters with qnet will be the most popular.

Best Regards,
Strahil Nikolov

На 31 юли 2020 г. 8:57:29 GMT+03:00, Ulrich Windl 
 написа:
>>>> Ken Gaillot  schrieb am 30.07.2020 um 16:43 in
>Nachricht
><93b973947008b62c4848f8a799ddc3f0949451e8.ca...@redhat.com>:
>> On Wed, 2020‑07‑29 at 23:12 +, Toby Haynes wrote:
>>> In Corosync 1.x there was a limit on the maximum number of active
>>> nodes in a corosync cluster ‑ broswing the mailing list says 64
>>> hosts. The Pacemaker 1.1 documentation says scalability goes up to
>16
>>> nodes. The Pacemaker 2.0 documentation says the same, although I
>>> can't find a maximum number of nodes in Corosync 3.
>> 
>> My understanding is that there is no theoretical limit, only
>practical
>> limits, so giving a single number is somewhat arbitrary.
>> 
>> There is a huge difference between full cluster nodes (running
>corosync
>> and all pacemaker daemons) and Pacemaker Remote nodes (running only
>> pacemaker‑remoted).
>> 
>> Corosync uses a ring model where a token has to be passed in a very
>> short amount of time, and also has message guarantees (i.e. every
>node
>> has to confirm receiving a message before it is made available), so
>> there is a low practical limit to full cluster nodes. The 16 or 32
>> number comes from what enterprise providers are willing to support,
>and
>> is a good ballpark for a real‑world comfort zone. Even at 32 you need
>a
>
>What I'd like to see is some table with recommended parameters,
>depending on
>the number of nodes and the maximum acceptable network delay.
>
>The other thing I'd like to see is a worl-wide histogram (x-axis:
>number of
>nodes, y-axis: number of installations) of pacemaker clusters.
>Here we have a configuration ot two 2-node clusters and one 3-node
>cluster.
>Initially we had planned to make one 7-node cluster, but basically
>stability
>(common fencing) and configuration issues (becoming complex) prevented
>that.
>
>> dedicated fast network and likely some tuning tweaks. Going beyond
>that
>> is possible but depends on hardware and tuning, and becomes sensitive
>> to slight disturbances.
>> 
>> Pacemaker Remote nodes on the other hand are lightweight. They
>> communicate with only a single cluster node, with relatively low
>> traffic. The upper bound is unknown; some people report getting
>strange
>> errors with as few as 40 remote nodes, while others run over 100 with
>> no problems. So it may well depend on network and hardware
>capabilities
>
>See the parameter table requested above.
>
>> at high numbers, and you can run far more in VMs or containers than
>on
>> bare metal, since traffic will (usually) be internal rather than over
>> the network.
>> 
>> I would expect a cluster with 16‑32 full nodes and several hundred
>> remotes (maybe even thousands in VMs or containers) to be feasible
>with
>> the right hardware and tuning.
>
>I wonder: Do such configurations have a lot of identical or similar
>resources,
>or do they do massive load balancing, or do they run many different
>resources?
>
>> 
>> Since remotes don't run all the daemons, they can't do things like
>> directly execute fence devices or contribute to cluster quorum, but
>> remotes on bare metal or VMs are not really in a hierarchy as far as
>> the services being clustered go. A resource can move between cluster
>> and remote nodes, and a remote's connection can move from one cluster
>> node to another without interrupting the services on the remote.
>
>Regards,
>Ulrich
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] About the log indicating RA execution

2020-07-02 Thread Strahil Nikolov

Usually I check the logs on the Designated  Coordinator ,  especially when it 
was not fenced.

Best Regards,
Strahil Nikolov

На 2 юли 2020 г. 12:12:04 GMT+03:00, "井上和徳"  
написа:
>Hi all,
>
>We think it is desirable to output the log indicating the start and
>finish of RA execution to syslog on the same node. (End users
>monitoring the syslog are requesting that the output be on the same
>node.)
>
>Currently, the start[1] and finish[2] logs may be output by different
>nodes.
>
>Cluster Summary:
>  * Stack: corosync
>  * Current DC: r81-2 (version 2.0.4-556cef416) - partition with quorum
>  * Last updated: Thu Jul  2 12:42:17 2020
>* Last change:  Thu Jul  2 12:42:13 2020 by hacluster via crmd on r81-2
>  * 2 nodes configured
>  * 3 resource instances configured
>
>Node List:
>  * Online: [ r81-1 r81-2 ]
>
>Full List of Resources:
>  * dummy1  (ocf::pacemaker:Dummy):  Started r81-1
>  * fence1-ipmilan  (stonith:fence_ipmilan): Started r81-2
>  * fence2-ipmilan  (stonith:fence_ipmilan): Started r81-1
>
>*1
>Jul  2 12:42:15 r81-2 pacemaker-controld[18009]:
> notice: Initiating start operation dummy1_start_0 on r81-1
>*2
>Jul  2 12:42:15 r81-1 pacemaker-controld[10109]:
> notice: Result of start operation for dummy1 on r81-1: ok
>
>As a suggestion,
>
>1) change the following log levels to NOTICE and output the start and
>   finish logs to syslog on the node where RA was executed.
>
>Jul 02 12:42:15 r81-1 pacemaker-execd [10106] (log_execute)
>  info: executing - rsc:dummy1 action:start call_id:10
>Jul 02 12:42:15 r81-1 pacemaker-execd [10106] (log_finished)
>  info: dummy1 start (call 10, PID 10164) exited with status 0
>(execution time 91ms, queue time 0ms)
>
>2) alternatively, change the following log levels to NOTICE and
>   output a log indicating the finish at the DC node.
>
>Jul 02 12:42:15 r81-2 pacemaker-controld  [18009] (process_graph_event)
>  info: Transition 2 action 7 (dummy1_start_0 on r81-1) confirmed: ok
>| rc=0 call-id=10
>
>What do you think about this? (Do you have a better idea?)
>
>Best Regards,
>Kazunori INOUE
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-07 Thread Strahil Nikolov

I can't find fence_virtd for Ubuntu18, but it is available for Ubuntu20.

Your other option is to get an iSCSI from your quorum system and use that for 
SBD.
For watchdog, you can use 'softdog' kernel module or you can use KVM to present 
one to the VMs.
You can also check the '-P' flag for SBD.

Best Regards,
Strahil Nikolov

На 7 юли 2020 г. 10:11:38 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
> >What does 'virsh list'
> >give you onthe 2 hosts? Hopefully different names for
> >the VMs ...
>
>Yes, each host shows its own
>
># virsh list
>  IdName   Status
>
>  2 kvm101 running
>
># virsh list
>  IdName   State
>
>  1 kvm102 running
>
>
>
> >Did you try 'fence_xvm -a {mcast-ip} -o list' on the
> >guests as well?
>
>fence_xvm sadly does not work on the Ubuntu guests. The howto said to 
>install  "yum install fence-virt fence-virtd" which do not exist as
>such 
>in Ubuntu 18.04. After we tried to find the appropiate packages we 
>installed "libvirt-clients" and "multipath-tools". Is there maybe 
>something misisng or completely wrong?
>Though we can  connect to both hosts using "nc -z -v -u 192.168.1.21 
>1229", that just works fine.
>
>
> >Usually,  the biggest problem is the multicast traffic - as in many 
> >environments it can be dropped  by firewalls.
>
>To make sure I have requested our Datacenter techs to verify that 
>multicast Traffic can move unhindered in our local Network. But in the 
>past on multiple occasions they have confirmed, that local traffic is 
>not filtered in any way. But Since now I have never specifically asked 
>for multicast traffic, which I now did. I am waiting for an answer to 
>that question.
>
>
>kind regards
>Stefan Schmitz
>
>Am 06.07.2020 um 11:24 schrieb Klaus Wenninger:
>> On 7/6/20 10:10 AM, stefan.schm...@farmpartner-tec.com wrote:
>>> Hello,
>>>
>>>>> # fence_xvm -o list
>>>>> kvm102  
>bab3749c-15fc-40b7-8b6c-d4267b9f0eb9
>>>>> on
>>>
>>>> This should show both VMs, so getting to that point will likely
>solve
>>>> your problem. fence_xvm relies on multicast, there could be some
>>>> obscure network configuration to get that working on the VMs.
>> You said you tried on both hosts. What does 'virsh list'
>> give you onthe 2 hosts? Hopefully different names for
>> the VMs ...
>> Did you try 'fence_xvm -a {mcast-ip} -o list' on the
>> guests as well?
>> Did you try pinging via the physical network that is
>> connected tothe bridge configured to be used for
>> fencing?
>> If I got it right fence_xvm should supportcollecting
>> answersfrom multiple hosts but I found a suggestion
>> to do a setup with 2 multicast-addresses & keys for
>> each host.
>> Which route did you go?
>> 
>> Klaus
>>>
>>> Thank you for pointing me in that direction. We have tried to solve
>>> that but with no success. We were using an howto provided here
>>> https://wiki.clusterlabs.org/wiki/Guest_Fencing
>>>
>>> Problem is, it specifically states that the tutorial does not yet
>>> support the case where guests are running on multiple hosts. There
>are
>>> some short hints what might be necessary to do, but working through
>>> those sadly just did not work nor where there any clues which would
>>> help us finding a solution ourselves. So now we are completely stuck
>>> here.
>>>
>>> Has someone the same configuration with Guest VMs on multiple hosts?
>>> And how did you manage to get that to work? What do we need to do to
>>> resolve this? Is there maybe even someone who would be willing to
>take
>>> a closer look at our server? Any help would be greatly appreciated!
>>>
>>> Kind regards
>>> Stefan Schmitz
>>>
>>>
>>>
>>> Am 03.07.2020 um 02:39 schrieb Ken Gaillot:
>>>> On Thu, 2020-07-02 at 17:18 +0200,
>stefan.schm...@farmpartner-tec.com
>>>> wrote:
>>>>> Hello,
>>>>>
>>>>> I hope someone can help with this problem. We are (still) trying
>to
>>>>> get
>>>>> Stonith to achieve a running active/active HA Cluster, but sadly
>to
>>>>> no
>>>>> avail.
>>>>>
>>>&

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-07 Thread Strahil Nikolov



>With kvm please use the qemu-watchdog and try to
>prevent using softdogwith SBD.
>Especially if you are aiming for a production-cluster ...

You can tell it to the previous company  I worked  for :D .
All clusters were  using softdog on SLES 11/12 despite  the hardware had it's 
own.

We  had  no issues with fencing,  but we got plenty of san issues to test the 
fencing :)

Best Regards,
Strahil Nikolov
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-06 Thread Strahil Nikolov

As far as  I know  fence_xvm supports multiple  hosts, but you need  to open 
the port on both Hypervisour  (udp)  and Guest (tcp). 'fence_xvm -o list' 
should provide  a list of VMs from all hosts that responded (and have the key).
Usually,  the biggest problem is the multicast traffic - as in many 
environments it can be dropped  by firewalls.

Best  Regards,
Strahil Nikolov

На 6 юли 2020 г. 12:24:08 GMT+03:00, Klaus Wenninger  
написа:
>On 7/6/20 10:10 AM, stefan.schm...@farmpartner-tec.com wrote:
>> Hello,
>>
>> >> # fence_xvm -o list
>> >> kvm102  
>bab3749c-15fc-40b7-8b6c-d4267b9f0eb9
>> >> on
>>
>> >This should show both VMs, so getting to that point will likely
>solve
>> >your problem. fence_xvm relies on multicast, there could be some
>> >obscure network configuration to get that working on the VMs.
>You said you tried on both hosts. What does 'virsh list'
>give you onthe 2 hosts? Hopefully different names for
>the VMs ...
>Did you try 'fence_xvm -a {mcast-ip} -o list' on the
>guests as well?
>Did you try pinging via the physical network that is
>connected tothe bridge configured to be used for
>fencing?
>If I got it right fence_xvm should supportcollecting
>answersfrom multiple hosts but I found a suggestion
>to do a setup with 2 multicast-addresses & keys for
>each host.
>Which route did you go?
>
>Klaus
>>
>> Thank you for pointing me in that direction. We have tried to solve
>> that but with no success. We were using an howto provided here
>> https://wiki.clusterlabs.org/wiki/Guest_Fencing
>>
>> Problem is, it specifically states that the tutorial does not yet
>> support the case where guests are running on multiple hosts. There
>are
>> some short hints what might be necessary to do, but working through
>> those sadly just did not work nor where there any clues which would
>> help us finding a solution ourselves. So now we are completely stuck
>> here.
>>
>> Has someone the same configuration with Guest VMs on multiple hosts?
>> And how did you manage to get that to work? What do we need to do to
>> resolve this? Is there maybe even someone who would be willing to
>take
>> a closer look at our server? Any help would be greatly appreciated!
>>
>> Kind regards
>> Stefan Schmitz
>>
>>
>>
>> Am 03.07.2020 um 02:39 schrieb Ken Gaillot:
>>> On Thu, 2020-07-02 at 17:18 +0200,
>stefan.schm...@farmpartner-tec.com
>>> wrote:
>>>> Hello,
>>>>
>>>> I hope someone can help with this problem. We are (still) trying to
>>>> get
>>>> Stonith to achieve a running active/active HA Cluster, but sadly to
>>>> no
>>>> avail.
>>>>
>>>> There are 2 Centos Hosts. On each one there is a virtual Ubuntu VM.
>>>> The
>>>> Ubuntu VMs are the ones which should form the HA Cluster.
>>>>
>>>> The current status is this:
>>>>
>>>> # pcs status
>>>> Cluster name: pacemaker_cluster
>>>> WARNING: corosync and pacemaker node names do not match (IPs used
>in
>>>> setup?)
>>>> Stack: corosync
>>>> Current DC: server2ubuntu1 (version 1.1.18-2b07d5c5a9) - partition
>>>> with
>>>> quorum
>>>> Last updated: Thu Jul  2 17:03:53 2020
>>>> Last change: Thu Jul  2 14:33:14 2020 by root via cibadmin on
>>>> server4ubuntu1
>>>>
>>>> 2 nodes configured
>>>> 13 resources configured
>>>>
>>>> Online: [ server2ubuntu1 server4ubuntu1 ]
>>>>
>>>> Full list of resources:
>>>>
>>>>    stonith_id_1   (stonith:external/libvirt): Stopped
>>>>    Master/Slave Set: r0_pacemaker_Clone [r0_pacemaker]
>>>>    Masters: [ server4ubuntu1 ]
>>>>    Slaves: [ server2ubuntu1 ]
>>>>    Master/Slave Set: WebDataClone [WebData]
>>>>    Masters: [ server2ubuntu1 server4ubuntu1 ]
>>>>    Clone Set: dlm-clone [dlm]
>>>>    Started: [ server2ubuntu1 server4ubuntu1 ]
>>>>    Clone Set: ClusterIP-clone [ClusterIP] (unique)
>>>>    ClusterIP:0    (ocf::heartbeat:IPaddr2):   Started
>>>> server2ubuntu1
>>>>    ClusterIP:1    (ocf::heartbeat:IPaddr2):   Started
>>>> server4ubuntu1
>>>>    Clone Set: WebFS-clone [WebFS]
>>>>    Started: [ server4ubuntu1 ]
>>>>    Stopped: [ server2ubuntu1 ]
>>>>    Clone Set

Re: [ClusterLabs] qnetd and booth arbitrator running together in a 3rd geo site

2020-07-14 Thread Strahil Nikolov

And  whatabout SBD (a.k.a. poison pill).  I've used it reliably with 3  SBDs  
on a streched  cluster. Neverr failed  to kill  the node.

Best Regards,
Strahil Nikolov

На 14 юли 2020 г. 14:18:56 GMT+03:00, Rohit Saini 
 написа:
>I dont think my question was very clear. I am strictly NO for STONITH.
>STONITH is limited only for kvm or HP machines. That's the reason I
>don't
>want to use STONITH.
>What my question is can I use booth with nodes of a single cluster also
>(similar to qdevice)? So idea is to use booth arbitrator for cluster of
>clusters AS WELL AS for a single cluster.
>
>
>On Tue, Jul 14, 2020 at 4:42 PM Jan Friesse 
>wrote:
>
>> Rohit,
>>
>> > Thanks Honja. That's helpful.
>> > Let's say I don't use qnetd, can I achieve same with booth
>arbitrator?
>>
>> That means to have two two-node clusters. Two-node cluster without
>> fencing is strictly no.
>>
>> > Booth arbitrator works for geo-clusters, can the same arbitrator be
>> reused
>> > for local clusters as well?
>>
>> I'm not sure that I understand question. Booth just gives ticket to
>> (maximally) one of booth-sites.
>>
>>
>> > Is it even possible technically?
>>
>> The question is, what you are trying to achieve. If geo-cluster then
>> stonith for sites + booth is probably best solution. If the cluster
>is
>> more like a stretch cluster, then qnetd + stonith is enough.
>>
>> And of course your idea (original one) should work too.
>>
>> Honza
>>
>>
>> >
>> > Regards,
>> > Rohit
>> >
>> > On Tue, Jul 14, 2020 at 3:32 PM Jan Friesse 
>wrote:
>> >
>> >> Rohit,
>> >>
>> >>> Hi Team,
>> >>> Can I execute corosync-qnetd and booth-arbitrator on the same VM
>in a
>> >>> different geo site? What's the recommendation? Will it have any
>> >> limitations
>> >>> in a production deployment?
>> >>
>> >> There is no technical limitation. Both qnetd and booth are very
>> >> lightweight and work just fine with high latency links.
>> >>
>> >> But I don't really have any real-life experiences with deployment
>where
>> >> both booth and qnetd are used. It should work, but I would
>recommend
>> >> proper testing - especially what happens when arbitrator node
>> disappears.
>> >>
>> >>> Due to my architecture limitation, I have only one arbitrator
>available
>> >>> which is on a 3rd site. To handle cluster split-brain errors, I
>am
>> >> thinking
>> >>> to use same arbitrator for local cluster as well.
>> >>> STONITH is not useful in my case as it is limited only to ILO and
>VIRT.
>> >>
>> >> Keep in mind that neither qdevice nor booth is "replacement" for
>> stonith.
>> >>
>> >> Regards,
>> >> Honza
>> >>
>> >>>
>> >>> [image: image.png]
>> >>>
>> >>> Thanks,
>> >>> Rohit
>> >>>
>> >>>
>> >>>
>> >>> ___
>> >>> Manage your subscription:
>> >>> https://lists.clusterlabs.org/mailman/listinfo/users
>> >>>
>> >>> ClusterLabs home: https://www.clusterlabs.org/
>> >>>
>> >>
>> >>
>> >
>>
>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-15 Thread Strahil Nikolov

How  did you configure the network on your ubuntu 20.04 Hosts ? I tried  to 
setup bridged connection for the test setup , but obviously I'm missing 
something.

Best Regards,
Strahil Nikolov

На 14 юли 2020 г. 11:06:42 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>
>Am 09.07.2020 um 19:10 Strahil Nikolov wrote:
> >Have  you  run 'fence_virtd  -c' ?
>Yes I had run that on both Hosts. The current config looks like that
>and 
>is identical on both.
>
>cat fence_virt.conf
>fence_virtd {
> listener = "multicast";
> backend = "libvirt";
> module_path = "/usr/lib64/fence-virt";
>}
>
>listeners {
> multicast {
> key_file = "/etc/cluster/fence_xvm.key";
> address = "225.0.0.12";
> interface = "bond0";
> family = "ipv4";
> port = "1229";
> }
>
>}
>
>backends {
> libvirt {
> uri = "qemu:///system";
> }
>
>}
>
>
>The situation is still that no matter on what host I issue the 
>"fence_xvm -a 225.0.0.12 -o list" command, both guest systems receive 
>the traffic. The local guest, but also the guest on the other host. I 
>reckon that means the traffic is not filtered by any network device, 
>like switches or firewalls. Since the guest on the other host receives 
>the packages, the traffic must reach te physical server and 
>networkdevice and is then routed to the VM on that host.
>But still, the traffic is not shown on the host itself.
>
>Further the local firewalls on both hosts are set to let each and every
>
>traffic pass. Accept to any and everything. Well at least as far as I 
>can see.
>
>
>Am 09.07.2020 um 22:34 Klaus Wenninger wrote:
> > makes me believe that
> > the whole setup doesn't lookas I would have
> > expected (bridges on each host where theguest
> > has a connection to and where ethernet interfaces
> > that connect the 2 hosts are part of as well
>
>On each physical server the networkcards are bonded to achieve failure 
>safety (bond0). The guest are connected over a bridge(br0) but 
>apparently our virtualization softrware creates an own device named 
>after the guest (kvm101.0).
>There is no direct connection between the servers, but as I said 
>earlier, the multicast traffic does reach the VMs so I assume there is 
>no problem with that.
>
>
>Am 09.07.2020 um 20:18 Vladislav Bogdanov wrote:
> > First, you need to ensure that your switch (or all switches in the
> > path) have igmp snooping enabled on host ports (and probably
> > interconnects along the path between your hosts).
> >
>> Second, you need an igmp querier to be enabled somewhere near (better
>> to have it enabled on a switch itself). Please verify that you see
>its
> > queries on hosts.
> >
> > Next, you probably need to make your hosts to use IGMPv2 (not 3) as
> > many switches still can not understand v3. This is doable by sysctl,
> > find on internet, there are many articles.
>
>
>I have send an query to our Data center Techs who are analyzing this
>and 
>were already on it analyzing if multicast Traffic is somewhere blocked 
>or hindered. So far the answer is, "multicast ist explictly allowed in 
>the local network and no packets are filtered or dropped". I am still 
>waiting for a final report though.
>
>In the meantime I have switched IGMPv3 to IGMPv2 on every involved 
>server, hosts and guests via the mentioned sysctl. The switching itself
>
>was successful, according to "cat /proc/net/igmp" but sadly did not 
>better the behavior. It actually led to that no VM received the 
>multicast traffic anymore too.
>
>kind regards
>Stefan Schmitz
>
>
>Am 09.07.2020 um 22:34 schrieb Klaus Wenninger:
>> On 7/9/20 5:17 PM, stefan.schm...@farmpartner-tec.com wrote:
>>> Hello,
>>>
>>>> Well, theory still holds I would say.
>>>>
>>>> I guess that the multicast-traffic from the other host
>>>> or the guestsdoesn't get to the daemon on the host.
>>>> Can't you just simply check if there are any firewall
>>>> rules configuredon the host kernel?
>>>
>>> I hope I did understand you corretcly and you are referring to
>iptables?
>> I didn't say iptables because it might have been
>> nftables - but yesthat is what I was referring to.
>> Guess to understand the config the output is
>> lacking verbositybut it makes me believe that
>> the whole setup doesn

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-15 Thread Strahil Nikolov

By default libvirt is using NAT and not routed network - in such case, vm1 
won't receive data from host2.

Can you provide the Networks' xml ?

Best Regards,
Strahil Nikolov

На 15 юли 2020 г. 13:19:59 GMT+03:00, Klaus Wenninger  
написа:
>On 7/15/20 11:42 AM, stefan.schm...@farmpartner-tec.com wrote:
>> Hello,
>>
>>
>> Am 15.07.2020 um 06:32 Strahil Nikolov wrote:
>>> How  did you configure the network on your ubuntu 20.04 Hosts ? I
>>> tried  to setup bridged connection for the test setup , but
>obviously
>>> I'm missing something.
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>
>> on the hosts (CentOS) the bridge config looks like that.The bridging
>> and configuration is handled by the virtualization software:
>>
>> # cat ifcfg-br0
>> DEVICE=br0
>> TYPE=Bridge
>> BOOTPROTO=static
>> ONBOOT=yes
>> IPADDR=192.168.1.21
>> NETMASK=255.255.0.0
>> GATEWAY=192.168.1.1
>> NM_CONTROLLED=no
>> IPV6_AUTOCONF=yes
>> IPV6_DEFROUTE=yes
>> IPV6_PEERDNS=yes
>> IPV6_PEERROUTES=yes
>> IPV6_FAILURE_FATAL=no
>>
>>
>>
>> Am 15.07.2020 um 09:50 Klaus Wenninger wrote:
>> > Guess it is not easy to have your servers connected physically for
>a
>> try.
>> > But maybe you can at least try on one host to have virt_fenced & VM
>> > on the same bridge - just to see if that basic pattern is working.
>>
>> I am not sure if I understand you correctly. What do you by having
>> them on the same bridge? The bridge device is configured on the host
>> by the virtualization software.
>I meant to check out which bridge the interface of the VM is enslaved
>to and to use that bridge as interface in /etc/fence_virt.conf.
>Get me right - just for now - just to see if it is working for this one
>host and the corresponding guest.
>>
>>
>> >Well maybe still sbdy in the middle playing IGMPv3 or the request
>for
>> >a certain source is needed to shoot open some firewall or
>switch-tables.
>>
>> I am still waiting for the final report from our Data Center techs. I
>> hope that will clear up somethings.
>>
>>
>> Additionally  I have just noticed that apparently since switching
>from
>> IGMPv3 to IGMPv2 and back the command "fence_xvm -a 225.0.0.12 -o
>> list" is no completely broken.
>> Before that switch this command at least returned the local VM. Now
>it
>> returns:
>> Timed out waiting for response
>> Operation failed
>>
>> I am a bit confused by that, because all we did was running commands
>> like "sysctl -w net.ipv4.conf.all.force_igmp_version =" with the
>> different Version umbers and #cat /proc/net/igmp shows that V3 is
>used
>> again on every device just like before...?!
>>
>> kind regards
>> Stefan Schmitz
>>
>>
>>> На 14 юли 2020 г. 11:06:42 GMT+03:00,
>>> "stefan.schm...@farmpartner-tec.com"
>>>  написа:
>>>> Hello,
>>>>
>>>>
>>>> Am 09.07.2020 um 19:10 Strahil Nikolov wrote:
>>>>> Have  you  run 'fence_virtd  -c' ?
>>>> Yes I had run that on both Hosts. The current config looks like
>that
>>>> and
>>>> is identical on both.
>>>>
>>>> cat fence_virt.conf
>>>> fence_virtd {
>>>>  listener = "multicast";
>>>>  backend = "libvirt";
>>>>  module_path = "/usr/lib64/fence-virt";
>>>> }
>>>>
>>>> listeners {
>>>>  multicast {
>>>>  key_file = "/etc/cluster/fence_xvm.key";
>>>>  address = "225.0.0.12";
>>>>  interface = "bond0";
>>>>  family = "ipv4";
>>>>  port = "1229";
>>>>  }
>>>>
>>>> }
>>>>
>>>> backends {
>>>>  libvirt {
>>>>  uri = "qemu:///system";
>>>>  }
>>>>
>>>> }
>>>>
>>>>
>>>> The situation is still that no matter on what host I issue the
>>>> "fence_xvm -a 225.0.0.12 -o list" command, both guest systems
>receive
>>>> the traffic. The local guest, but also the guest on the other host.
>I
>>>> reckon that means the traffic is not filtered by any network
>device,
>>>>

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-15 Thread Strahil Nikolov

If it is created by libvirt - this is NAT and you will never receive  output  
from the other  host.

Best Regards,
Strahil Nikolov

На 15 юли 2020 г. 15:05:48 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>Am 15.07.2020 um 13:42 Strahil Nikolov wrote:
>> By default libvirt is using NAT and not routed network - in such
>case, vm1 won't receive data from host2.
>> 
>> Can you provide the Networks' xml ?
>> 
>> Best Regards,
>> Strahil Nikolov
>> 
>
># cat default.xml
>
>   default
>   
>   
>   
> 
>   
> 
>   
>
>
>I just checked this and the file is identical on both hosts.
>
>kind regards
>Stefan Schmitz
>
>
>> На 15 юли 2020 г. 13:19:59 GMT+03:00, Klaus Wenninger
> написа:
>>> On 7/15/20 11:42 AM, stefan.schm...@farmpartner-tec.com wrote:
>>>> Hello,
>>>>
>>>>
>>>> Am 15.07.2020 um 06:32 Strahil Nikolov wrote:
>>>>> How  did you configure the network on your ubuntu 20.04 Hosts ? I
>>>>> tried  to setup bridged connection for the test setup , but
>>> obviously
>>>>> I'm missing something.
>>>>>
>>>>> Best Regards,
>>>>> Strahil Nikolov
>>>>>
>>>>
>>>> on the hosts (CentOS) the bridge config looks like that.The
>bridging
>>>> and configuration is handled by the virtualization software:
>>>>
>>>> # cat ifcfg-br0
>>>> DEVICE=br0
>>>> TYPE=Bridge
>>>> BOOTPROTO=static
>>>> ONBOOT=yes
>>>> IPADDR=192.168.1.21
>>>> NETMASK=255.255.0.0
>>>> GATEWAY=192.168.1.1
>>>> NM_CONTROLLED=no
>>>> IPV6_AUTOCONF=yes
>>>> IPV6_DEFROUTE=yes
>>>> IPV6_PEERDNS=yes
>>>> IPV6_PEERROUTES=yes
>>>> IPV6_FAILURE_FATAL=no
>>>>
>>>>
>>>>
>>>> Am 15.07.2020 um 09:50 Klaus Wenninger wrote:
>>>>> Guess it is not easy to have your servers connected physically for
>>> a
>>>> try.
>>>>> But maybe you can at least try on one host to have virt_fenced &
>VM
>>>>> on the same bridge - just to see if that basic pattern is working.
>>>>
>>>> I am not sure if I understand you correctly. What do you by having
>>>> them on the same bridge? The bridge device is configured on the
>host
>>>> by the virtualization software.
>>> I meant to check out which bridge the interface of the VM is
>enslaved
>>> to and to use that bridge as interface in /etc/fence_virt.conf.
>>> Get me right - just for now - just to see if it is working for this
>one
>>> host and the corresponding guest.
>>>>
>>>>
>>>>> Well maybe still sbdy in the middle playing IGMPv3 or the request
>>> for
>>>>> a certain source is needed to shoot open some firewall or
>>> switch-tables.
>>>>
>>>> I am still waiting for the final report from our Data Center techs.
>I
>>>> hope that will clear up somethings.
>>>>
>>>>
>>>> Additionally  I have just noticed that apparently since switching
>>> from
>>>> IGMPv3 to IGMPv2 and back the command "fence_xvm -a 225.0.0.12 -o
>>>> list" is no completely broken.
>>>> Before that switch this command at least returned the local VM. Now
>>> it
>>>> returns:
>>>> Timed out waiting for response
>>>> Operation failed
>>>>
>>>> I am a bit confused by that, because all we did was running
>commands
>>>> like "sysctl -w net.ipv4.conf.all.force_igmp_version =" with the
>>>> different Version umbers and #cat /proc/net/igmp shows that V3 is
>>> used
>>>> again on every device just like before...?!
>>>>
>>>> kind regards
>>>> Stefan Schmitz
>>>>
>>>>
>>>>> На 14 юли 2020 г. 11:06:42 GMT+03:00,
>>>>> "stefan.schm...@farmpartner-tec.com"
>>>>>  написа:
>>>>>> Hello,
>>>>>>
>>>>>>
>>>>>> Am 09.07.2020 um 19:10 Strahil Nikolov wrote:
>>>>>>> Have  you  run 'fence_virtd  -c' ?
>>>>>> Yes I had run that on both Hosts. The current config looks like
>>> that
>>>>>> and
>>>>>> is identical on both.
>>>>>>
&

Re: [ClusterLabs] jquery in pcs package

2020-07-02 Thread Strahil Nikolov

Firewalld's add-service (without zone definition)  will  add it on the default 
zone which by default is public.
If you have public  and private zones ,  and the cluster is supposed  to 
communicate over the private VLAN,  you can open the port only there.

Best Regards,
Strahil  Nikolov

На 2 юли 2020 г. 13:40:02 GMT+03:00, Tony Stocker  написа:
>On Wed, Jul 1, 2020 at 1:44 PM Tony Stocker 
>wrote:
>>
>> So, first question: is this jquery something that is maintained,
>> promulgated by/with the Pacemaker installation? Or is this something
>> special that Red Hat is doing when they package it?
>
>So, investigating the source code in GitHub, the inclusion of this
>jquery is part of Pacemaker/pcs and related to the Web UI. So this
>should be the proper forum to address it.
>
>> Second, if this is Pacemaker-maintained (not Red Hat) part of code,
>is
>> there a reason that it's such an old version, given that the current
>> version is 3.5.0, is used?
>
>Based on the GitHub check-in date, it appears that this section of
>code hasn't been updated in 7 years.
>
>> Finally, if this is Pacemaker-maintained (not Red Hat) part of code,
>> where can I find the documentation regarding the patching that's been
>> done to address the various cross-site scripting vulnerabilities? I'm
>> working under the assumption that the binary has been patched and the
>> vulnerabilities are no longer present, in which case I have to
>> document it with security. Obviously if the code has not been patched
>> and it's still vulnerable, that's a whole different issue.
>
>So, one would assume since there haven't been any updates to the code
>that this code is indeed vulnerable to all the XSS vulnerabilities,
>which is not good. Regardless of anything else below, does anyone know
>if there are any plans to update this part of the code to deal with
>these security issues?
>
>What appears to be worse is that this Web UI interface is not
>optional, and runs on the communication port (default=2224) across all
>interfaces on a system. So, even though I set up a cluster using host
>names/addresses which are on a private lan, the security scanner tool
>is still finding the Web UI running on port 2224 on the public IP
>interface of the system. This can't be the correct/intended behavior,
>can it? I'm thinking that this has to do with the setup step that I
>see in pretty much all how-to documents that looks like this one from
>the Red Hat 8 "Configuring and Maintaining High Availability Clusters"
>document, section 4.7:
>
>"If you are running the firewalld daemon, execute the following
>commands to enable the ports that are required by the Red Hat High
>Availability Add-On.
># firewall-cmd --permanent --add-service=high-availability
># firewall-cmd --add-service=high-availability"
>
>Here is the description in the same document for Port 2224/tcp:
>"Default pcsd port required on all nodes (needed by the pcsd Web UI
>and required for node-to-node communication). You can configure the
>pcsd port by means of the PCSD_PORT parameter in the
>/etc/sysconfig/pcsd file.
>
>It is crucial to open port 2224 in such a way that pcs from any node
>can talk to all nodes in the cluster, including itself. When using the
>Booth cluster ticket manager or a quorum device you must open port
>2224 on all related hosts, such as Booth arbiters or the quorum device
>host. "
>
>Executing this command appears to add the 'high-availability'
>"service" to all zones in firewalld, which I don't believe is needed,
>or am I wrong? If you have nodes with multiple network interfaces (in
>my test case each node is attached to 3 networks,) do the nodes have
>to have pcsd access across all the networks?
>
>Even if I can mitigate things by only allowing 'high-availability'
>service ports on a single, private LAN, is there any way to DISABLE
>the Web UI so that it doesn't run at all? I don't use it, nor have any
>intention of doing so, and having a separate, unmaintained (as in
>patched for vulnerabilities) http service running on a 'random' port
>is not something our project management, and certainly the security
>division approves of doing.
>
>Thanks.
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-09 Thread Strahil Nikolov

Have  you  run 'fence_virtd  -c' ?
I made a  silly mistake last time when  I deployed it and the daemon was not 
listening on the right interface.
Netstat can check this out.
Also, As far as I know  hosts use  unicast to reply to the VMs (thus tcp/1229 
and not udp/1229).

If you have a  developer account for Red Hat,  you can check 
https://access.redhat.com/solutions/917833

Best Regards,
Strahil Nikolov

На 9 юли 2020 г. 17:01:13 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>thanks for the advise. I have worked through that list as follows:
>
> > -  key deployed on the Hypervisours
> > -  key deployed on the VMs
>I created the key file a while ago once on one host and distributed it 
>to every other host and guest. Right now it resides on all 4 machines
>in 
>the same path: /etc/cluster/fence_xvm.key
>Is there maybe a a corosync/Stonith or other function which checks the 
>keyfiles for any corruption or errors?
>
>
> > -  fence_virtd  running on both Hypervisours
>It is running on each host:
>#  ps aux |grep fence_virtd
>root  62032  0.0  0.0 251568  4496 ?Ss   Jun29   0:00 
>fence_virtd
>
>
>> -  Firewall opened (1229/udp  for the hosts,  1229/tcp  for the
>guests)
>
>Command on one host:
>fence_xvm -a 225.0.0.12 -o list
>
>tcpdump on the guest residing on the other host:
>host2.55179 > 225.0.0.12.1229: [udp sum ok] UDP, length 176
>host2 > igmp.mcast.net: igmp v3 report, 1 group record(s) [gaddr 
>225.0.0.12 to_in { }]
>host2 > igmp.mcast.net: igmp v3 report, 1 group record(s) [gaddr 
>225.0.0.12 to_in { }]
>
>At least to me it looks like the VMs are reachable by the multicast
>traffic.
>Additionally, no matter on which host I execute the fence_xvm command, 
>tcpdum shows the same traffic on both guests.
>But on the other hand, at the same time, tcpdump shows nothing on the 
>other host. Just to be sure I have flushed iptables beforehand on each 
>host. Is there maybe a problem?
>
>
> > -  fence_xvm on both VMs
>fence_xvm is installed on both VMs
># which fence_xvm
>/usr/sbin/fence_xvm
>
>Could you please advise on how to proceed? Thank you in advance.
>Kind regards
>Stefan Schmitz
>
>Am 08.07.2020 um 20:24 schrieb Strahil Nikolov:
>> Erm...network/firewall  is  always "green". Run tcpdump on Host1  and
>VM2  (not  on the same host).
>> Then run again 'fence_xvm  -o  list' and check what is captured.
>> 
>> In summary,  you need:
>> -  key deployed on the Hypervisours
>> -  key deployed on the VMs
>> -  fence_virtd  running on both Hypervisours
>> -  Firewall opened (1229/udp  for the hosts,  1229/tcp  for the
>guests)
>> -  fence_xvm on both VMs
>> 
>> In your case  ,  the primary suspect  is multicast  traffic.
>> 
>> Best  Regards,
>> Strahil Nikolov
>> 
>> На 8 юли 2020 г. 16:33:45 GMT+03:00,
>"stefan.schm...@farmpartner-tec.com"
> написа:
>>> Hello,
>>>
>>>> I can't find fence_virtd for Ubuntu18, but it is available for
>>>> Ubuntu20.
>>>
>>> We have now upgraded our Server to Ubuntu 20.04 LTS and installed
>the
>>> packages fence-virt and fence-virtd.
>>>
>>> The command "fence_xvm -a 225.0.0.12 -o list" on the Hosts still
>just
>>> returns the single local VM.
>>>
>>> The same command on both VMs results in:
>>> # fence_xvm -a 225.0.0.12 -o list
>>> Timed out waiting for response
>>> Operation failed
>>>
>>> But just as before, trying to connect from the guest to the host via
>nc
>>>
>>> just works fine.
>>> #nc -z -v -u 192.168.1.21 1229
>>> Connection to 192.168.1.21 1229 port [udp/*] succeeded!
>>>
>>> So the hosts and service basically is reachable.
>>>
>>> I have spoken to our Firewall tech, he has assured me, that no local
>>> traffic is hindered by anything. Be it multicast or not.
>>> Software Firewalls are not present/active on any of our servers.
>>>
>>> Ubuntu guests:
>>> # ufw status
>>> Status: inactive
>>>
>>> CentOS hosts:
>>> systemctl status firewalld
>>> ● firewalld.service - firewalld - dynamic firewall daemon
>>>   Loaded: loaded (/usr/lib/systemd/system/firewalld.service;
>disabled;
>>> vendor preset: enabled)
>>> Active: inactive (dead)
>>>   Docs: man:firewalld(1)
>>>
>>>
>>> Any hints or help on how to remedy this problem would be greatly
>>> appreciated!
>>>
>>> Kind regards
>>>

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-08 Thread Strahil Nikolov

Erm...network/firewall  is  always "green". Run tcpdump on Host1  and VM2  (not 
 on the same host).
Then run again 'fence_xvm  -o  list' and check what is captured.

In summary,  you need:
-  key deployed on the Hypervisours
-  key deployed on the VMs
-  fence_virtd  running on both Hypervisours
-  Firewall opened (1229/udp  for the hosts,  1229/tcp  for the guests)
-  fence_xvm on both VMs

In your case  ,  the primary suspect  is multicast  traffic.

Best  Regards,
Strahil Nikolov

На 8 юли 2020 г. 16:33:45 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>>I can't find fence_virtd for Ubuntu18, but it is available for
>>Ubuntu20.
>
>We have now upgraded our Server to Ubuntu 20.04 LTS and installed the 
>packages fence-virt and fence-virtd.
>
>The command "fence_xvm -a 225.0.0.12 -o list" on the Hosts still just 
>returns the single local VM.
>
>The same command on both VMs results in:
># fence_xvm -a 225.0.0.12 -o list
>Timed out waiting for response
>Operation failed
>
>But just as before, trying to connect from the guest to the host via nc
>
>just works fine.
>#nc -z -v -u 192.168.1.21 1229
>Connection to 192.168.1.21 1229 port [udp/*] succeeded!
>
>So the hosts and service basically is reachable.
>
>I have spoken to our Firewall tech, he has assured me, that no local 
>traffic is hindered by anything. Be it multicast or not.
>Software Firewalls are not present/active on any of our servers.
>
>Ubuntu guests:
># ufw status
>Status: inactive
>
>CentOS hosts:
>systemctl status firewalld
>● firewalld.service - firewalld - dynamic firewall daemon
>  Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; 
>vendor preset: enabled)
>Active: inactive (dead)
>  Docs: man:firewalld(1)
>
>
>Any hints or help on how to remedy this problem would be greatly 
>appreciated!
>
>Kind regards
>Stefan Schmitz
>
>
>Am 07.07.2020 um 10:54 schrieb Klaus Wenninger:
>> On 7/7/20 10:33 AM, Strahil Nikolov wrote:
>>> I can't find fence_virtd for Ubuntu18, but it is available for
>Ubuntu20.
>>>
>>> Your other option is to get an iSCSI from your quorum system and use
>that for SBD.
>>> For watchdog, you can use 'softdog' kernel module or you can use KVM
>to present one to the VMs.
>>> You can also check the '-P' flag for SBD.
>> With kvm please use the qemu-watchdog and try to
>> prevent using softdogwith SBD.
>> Especially if you are aiming for a production-cluster ...
>> 
>> Adding something like that to libvirt-xml should do the trick:
>> 
>>    > function='0x0'/>
>> 
>> 
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>> На 7 юли 2020 г. 10:11:38 GMT+03:00,
>"stefan.schm...@farmpartner-tec.com"
> написа:
>>>>> What does 'virsh list'
>>>>> give you onthe 2 hosts? Hopefully different names for
>>>>> the VMs ...
>>>> Yes, each host shows its own
>>>>
>>>> # virsh list
>>>>   IdName   Status
>>>> 
>>>>   2 kvm101 running
>>>>
>>>> # virsh list
>>>>   IdName   State
>>>> 
>>>>   1 kvm102 running
>>>>
>>>>
>>>>
>>>>> Did you try 'fence_xvm -a {mcast-ip} -o list' on the
>>>>> guests as well?
>>>> fence_xvm sadly does not work on the Ubuntu guests. The howto said
>to
>>>> install  "yum install fence-virt fence-virtd" which do not exist as
>>>> such
>>>> in Ubuntu 18.04. After we tried to find the appropiate packages we
>>>> installed "libvirt-clients" and "multipath-tools". Is there maybe
>>>> something misisng or completely wrong?
>>>> Though we can  connect to both hosts using "nc -z -v -u
>192.168.1.21
>>>> 1229", that just works fine.
>>>>
>> without fence-virt you can't expect the whole thing to work.
>> maybe you can build it for your ubuntu-version from sources of
>> a package for another ubuntu-version if it doesn't exist yet.
>> btw. which pacemaker-version are you using?
>> There was a convenience-fix on the master-branch for at least
>> a couple of days (sometimes during 2.0.4 release-cycle) that
>> wasn't compatible with fence_xvm.
>>>>> Usually,  the biggest p

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-08 Thread Strahil Nikolov

Erm...network/firewall  is  always "green". Run tcpdump on Host1  and VM2  (not 
 on the same host).
Then run again 'fence_xvm  -o  list' and check what is captured.

In summary,  you need:
-  key deployed on the Hypervisours
-  key deployed on the VMs
-  fence_virtd  running on both Hypervisours
-  Firewall opened (1229/udp  for the hosts,  1229/tcp  for the guests)
-  fence_xvm on both VMs

In your case  ,  the primary suspect  is multicast  traffic.

Best  Regards,
Strahil Nikolov

На 8 юли 2020 г. 16:33:45 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>>I can't find fence_virtd for Ubuntu18, but it is available for
>>Ubuntu20.
>
>We have now upgraded our Server to Ubuntu 20.04 LTS and installed the 
>packages fence-virt and fence-virtd.
>
>The command "fence_xvm -a 225.0.0.12 -o list" on the Hosts still just 
>returns the single local VM.
>
>The same command on both VMs results in:
># fence_xvm -a 225.0.0.12 -o list
>Timed out waiting for response
>Operation failed
>
>But just as before, trying to connect from the guest to the host via nc
>
>just works fine.
>#nc -z -v -u 192.168.1.21 1229
>Connection to 192.168.1.21 1229 port [udp/*] succeeded!
>
>So the hosts and service basically is reachable.
>
>I have spoken to our Firewall tech, he has assured me, that no local 
>traffic is hindered by anything. Be it multicast or not.
>Software Firewalls are not present/active on any of our servers.
>
>Ubuntu guests:
># ufw status
>Status: inactive
>
>CentOS hosts:
>systemctl status firewalld
>● firewalld.service - firewalld - dynamic firewall daemon
>  Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; 
>vendor preset: enabled)
>Active: inactive (dead)
>  Docs: man:firewalld(1)
>
>
>Any hints or help on how to remedy this problem would be greatly 
>appreciated!
>
>Kind regards
>Stefan Schmitz
>
>
>Am 07.07.2020 um 10:54 schrieb Klaus Wenninger:
>> On 7/7/20 10:33 AM, Strahil Nikolov wrote:
>>> I can't find fence_virtd for Ubuntu18, but it is available for
>Ubuntu20.
>>>
>>> Your other option is to get an iSCSI from your quorum system and use
>that for SBD.
>>> For watchdog, you can use 'softdog' kernel module or you can use KVM
>to present one to the VMs.
>>> You can also check the '-P' flag for SBD.
>> With kvm please use the qemu-watchdog and try to
>> prevent using softdogwith SBD.
>> Especially if you are aiming for a production-cluster ...
>> 
>> Adding something like that to libvirt-xml should do the trick:
>> 
>>    > function='0x0'/>
>> 
>> 
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>> На 7 юли 2020 г. 10:11:38 GMT+03:00,
>"stefan.schm...@farmpartner-tec.com"
> написа:
>>>>> What does 'virsh list'
>>>>> give you onthe 2 hosts? Hopefully different names for
>>>>> the VMs ...
>>>> Yes, each host shows its own
>>>>
>>>> # virsh list
>>>>   IdName   Status
>>>> 
>>>>   2 kvm101 running
>>>>
>>>> # virsh list
>>>>   IdName   State
>>>> 
>>>>   1 kvm102 running
>>>>
>>>>
>>>>
>>>>> Did you try 'fence_xvm -a {mcast-ip} -o list' on the
>>>>> guests as well?
>>>> fence_xvm sadly does not work on the Ubuntu guests. The howto said
>to
>>>> install  "yum install fence-virt fence-virtd" which do not exist as
>>>> such
>>>> in Ubuntu 18.04. After we tried to find the appropiate packages we
>>>> installed "libvirt-clients" and "multipath-tools". Is there maybe
>>>> something misisng or completely wrong?
>>>> Though we can  connect to both hosts using "nc -z -v -u
>192.168.1.21
>>>> 1229", that just works fine.
>>>>
>> without fence-virt you can't expect the whole thing to work.
>> maybe you can build it for your ubuntu-version from sources of
>> a package for another ubuntu-version if it doesn't exist yet.
>> btw. which pacemaker-version are you using?
>> There was a convenience-fix on the master-branch for at least
>> a couple of days (sometimes during 2.0.4 release-cycle) that
>> wasn't compatible with fence_xvm.
>>>>> Usually,  the biggest p

Re: [ClusterLabs] pacemaker together with ovirt or Kimchi ?

2020-07-11 Thread Strahil Nikolov

It won't make sense.
oVirt has a built-in HA for Virtual Machines.

Best Regards,
Strahil Nikolov






В събота, 11 юли 2020 г., 17:50:18 ч. Гринуич+3, Lentes, Bernd 
 написа: 





Hi,

i'm having a two node cluster with pacemaker and about 10 virtual domains as 
resources.
It's running fine.
I configure/administrate everything with the crm shell.
But i'm also looking for a web interface.
I'm not much impressed by HAWK.
Is it possible to use Kimchi or ovirt together with a pacemaker HA cluster ?

Bernd


-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum münchen 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy 
stay at home
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Strahil Nikolov

What about second fencing mechanism ?
You can add a shared (independent) vmdk as an sbd device. The reconfiguration 
will require cluster downtime, but this is only necessary once.
Once 2 fencing mechanisms are available - you can configure the order easily.
Best Regards,
Strahil Nikolov






В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl 
 написа: 





Hi!

I can't give much detailed advice, but I think any network service should have 
a timeout of at least 30 Sekonds (you have timeout=2ms).

And "after 100 failures" is symbolic, not literal: It means it failed too 
often, so I won't retry.

Regards,
Ulrich

>>> Howard  schrieb am 17.06.2020 um 21:05 in Nachricht
<2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1gfDL_2tAbKmw
mq...@mail.gmail.com>:
> Hello, recently I received some really great advice from this community
> regarding changing the token timeout value in corosync. Thank you! Since
> then the cluster has been working perfectly with no errors in the log for
> more than a week.
> 
> This morning I logged in to find a stopped stonith device.  If I'm reading
> the log right, it looks like it failed 1 million times in ~20 seconds then
> gave up. If you wouldn't mind looking at the logs below, is there some way
> that I can make this more robust so that it can recover?  I'll be
> investigating the reason for the timeout but would like to help the system
> recover on its own.
> 
> Servers: RHEL 8.2
> 
> Cluster name: cluster_pgperf2
> Stack: corosync
> Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition with
> quorum
> Last updated: Wed Jun 17 11:47:42 2020
> Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on srv1
> 
> 2 nodes configured
> 4 resources configured
> 
> Online: [ srv1 srv2 ]
> 
> Full list of resources:
> 
>  Clone Set: pgsqld-clone [pgsqld] (promotable)
>      Masters: [ srv1 ]
>      Slaves: [ srv2 ]
>  pgsql-master-ip        (ocf::heartbeat:IPaddr2):      Started srv1
>  vmfence        (stonith:fence_vmware_soap):    Stopped
> 
> Failed Resource Actions:
> * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19, status=Timed Out,
> exitreason='',
>    last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms, exec=20184ms
> * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44, status=Timed Out,
> exitreason='',
>    last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms, exec=20008ms
> 
> Daemon Status:
>  corosync: active/disabled
>  pacemaker: active/disabled
>  pcsd: active/enabled
> 
>  pcs resource config
>  Clone: pgsqld-clone
>  Meta Attrs: notify=true promotable=true
>  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>    Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
>    Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
>                methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
>                monitor interval=15s role=Master timeout=60s
> (pgsqld-monitor-interval-15s)
>                monitor interval=16s role=Slave timeout=60s
> (pgsqld-monitor-interval-16s)
>                notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
>                promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
>                reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
>                start interval=0s timeout=60s (pgsqld-start-interval-0s)
>                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>                monitor interval=60s timeout=60s
> (pgsqld-monitor-interval-60s)
>  Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
>  Attributes: cidr_netmask=24 ip=xxx.xxx.xxx.xxx
>  Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s)
>              start interval=0s timeout=20s
> (pgsql-master-ip-start-interval-0s)
>              stop interval=0s timeout=20s
> (pgsql-master-ip-stop-interval-0s)
> 
> pcs stonith config
>  Resource: vmfence (class=stonith type=fence_vmware_soap)
>  Attributes: ipaddr=xxx.xxx.xxx.xxx login=\
> passwd_script= pcmk_host_map=srv1:x;srv2:y ssl=1
> ssl_insecure=1
>  Operations: monitor interval=60s (vmfence-monitor-interval-60s)
> 
> pcs resource failcount show
> Failcounts for resource 'vmfence'
>  srv1: INFINITY
>  srv2: INFINITY
> 
> Here are the versions installed:
> [postgres@srv1 cluster]$ rpm -qa|grep
> "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
> corosync-3.0.2-3.el8_1.1.x86_64
> corosync-qdevice-3.0.0-2.el8.x86_64
> corosync-qnetd-3.0.0-2.el8.x86_64
> corosynclib-3.0.2-3.el8_1.1.x86_64
> fence-agents-vmware-soap-4.2.1-41.el8.noarch
> pacemaker-2.0.2-3.el8_1.2.x86_64
> pacemaker-

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Two node cluster and extended distance/site failure

2020-06-24 Thread Strahil Nikolov

Instead of NFS,  iSCSI is  also an option.

Best Regards,
Strahil  Nikolov

На 24 юни 2020 г. 13:42:26 GMT+03:00, Andrei Borzenkov  
написа:
>24.06.2020 12:20, Ulrich Windl пишет:
>>>
>>> How Service Guard handles loss of shared storage?
>> 
>> When a node is up it would log the event; if a node is down it
>wouldn't care;
>> if a node detects a communication problem with the other node, it
>would fence
>> itself.
>> 
>
>So in case of split brain without witness both nodes fence itself and
>become unavailable. Which is exactly what I'd like to avoid if
>possible.
>
>> But hoestly: What sense does it make to run a node if the shared
>storage is
>> unavailable?
>> 
>
>Cluster nodes may use NFS which is not suitable for SBD (although I
>wonder if shared file on NFS may work) and shared SAN storage used for
>witness only. In this case it makes all sort of sense to continue when
>witness is not available.
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Beginner with STONITH Problem

2020-06-24 Thread Strahil Nikolov

Hello Stefan,

There are multiple options for stonith, but it depends on the environment.
Are the VMs in the same VLAN like the hosts? I am asking this , as the most 
popular candidate is 'fence_xvm' but it requires the VM to send fencing request 
to the KVM host (multicast) where the partner VM is hosted .
Another approach is to use a shared disk (either over iSCSI or SAN)  and use 
sbd for power-based fencing,  or  use SCSI3 Persistent Reservations (which can 
also be converted into a power-based fencing).


Best Regards,
Strahil Nikolov


На 24 юни 2020 г. 13:44:27 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>I am an absolute beginner trying to setup our first HA Cluster.
>So far I have been working with the "Pacemaker 1.1 Clusters from 
>Scratch" Guide wich worked for me perfectly up to the Point where I
>need 
>to install and configure STONITH.
>
>Curerent Situation is:2 Ubuntu Server as the cluster. Both of those 
>Servers are virtual machines running on 2 Centos KVM Hosts.
>Those are the devices or ressources we can use for a STONITH 
>implementation. In this and other guides I do read a lot about external
>
>devices and in the "pcs stonith list" there are some XEN but sadly I 
>cannot find anything about KVM. At this point I am stumped and have no 
>clue in how to proceed, I am not even sure what further inforamtion I 
>shopuld provide that would be useful for giving advise?
>
>The current pcs status is:
>
># pcs status
>Cluster name: pacemaker_cluster
>WARNING: corosync and pacemaker node names do not match (IPs used in
>setup?)
>Stack: corosync
>Current DC: server2ubuntu1 (version 1.1.18-2b07d5c5a9) - partition with
>
>quorum
>Last updated: Wed Jun 24 12:43:24 2020
>Last change: Wed Jun 24 12:35:17 2020 by root via cibadmin on
>server4ubuntu1
>
>2 nodes configured
>12 resources configured
>
>Online: [ server2ubuntu1 server4ubuntu1 ]
>
>Full list of resources:
>
>  Master/Slave Set: r0_pacemaker_Clone [r0_pacemaker]
>  Masters: [ server4ubuntu1 ]
>  Slaves: [ server2ubuntu1 ]
>  Clone Set: dlm-clone [dlm]
>  Stopped: [ server2ubuntu1 server4ubuntu1 ]
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>  ClusterIP:0(ocf::heartbeat:IPaddr2):   Started 
>server4ubuntu1
>  ClusterIP:1(ocf::heartbeat:IPaddr2):   Started 
>server4ubuntu1
>  Master/Slave Set: WebDataClone [WebData]
>  Masters: [ server2ubuntu1 server4ubuntu1 ]
>  Clone Set: WebFS-clone [WebFS]
>  Stopped: [ server2ubuntu1 server4ubuntu1 ]
>  Clone Set: WebSite-clone [WebSite]
>  Stopped: [ server2ubuntu1 server4ubuntu1 ]
>
>Failed Actions:
>* dlm_start_0 on server2ubuntu1 'not configured' (6): call=437, 
>status=complete, exitreason='',
> last-rc-change='Wed Jun 24 12:35:30 2020', queued=0ms, exec=86ms
>* r0_pacemaker_monitor_6 on server2ubuntu1 'master' (8): call=438, 
>status=complete, exitreason='',
> last-rc-change='Wed Jun 24 12:36:30 2020', queued=0ms, exec=0ms
>* dlm_start_0 on server4ubuntu1 'not configured' (6): call=441, 
>status=complete, exitreason='',
> last-rc-change='Wed Jun 24 12:35:30 2020', queued=0ms, exec=74ms
>
>
>Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
>
>
>
>I have researched the shown dlm Problem but everything I have found
>says 
>that configuring STONITH would solve that issue.
>Could please someone advise on how to proceed?
>
>Thank you in advance!
>
>Kind regards
>Stefan Schmitz
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Beginner with STONITH Problem

2020-06-25 Thread Strahil Nikolov

Hi Stefan,

this  sounds  like  firewall issue.

Check that the port udp/1229 is opened  for  the Hypervisours and tcp/1229 for 
the VMs.

P.S.: The  protocols are based on my fading memory, so double check the .

Best Regards,
Strahil Nikolov

На 25 юни 2020 г. 18:18:46 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>I have now tried to use that "how to" to make things work. Sadly I have
>
>run into a couple of Problems.
>
>I have installed and configured fence_xvm like it was told in the 
>walk-through but as expected the fence_virtd does not find all VMs,
>only 
>the one installed on itself.
>In the configuration I have chosen "bond0" as the listeners interface 
>since the hosts have bonding configured. I have appenmded the complete 
>fence_virt.conf at the end of the mail.
>All 4 servers, CentOS-Hosts and Ubuntu-VMs are in the same Network.
>Also 
>the generated key is present on all 4 Servers.
>
>Still the "fence_xvm -o list" command olny results in showing the local
>VM
>
># fence_xvm -o list
>kvm101   beee402d-c6ac-4df4-9b97-bd84e637f2e7
>on
>
>I hav tried the "Alternative configuration for guests running on 
>multiple hosts" but this fails right from the start, because the 
>packages libvirt-qpid are not available
>
># yum install -y libvirt-qpid qpidd
>[...]
>No package libvirt-qpid available.
>No package qpidd available.
>
>Could anyone please advise on how to proceed to get both nodes 
>recognized by the CentOS-Hosts? As a side note, all 4 Servers can ping 
>each other, so they are present and available in the same network.
>
>In addition, I cant seem to find the correct packages for Ubuntu 18.04 
>to install on the VMs. Trying to install fence_virt and/or fence_xvm 
>just results in "E: Unable to locate package fence_xvm/fence_virt".
>Are those packages available at all forUbubtu 18.04? I could only find 
>them for 20.04 or are they just called completely different so that I
>am 
>not able to find them?
>
>Thank you in advance for your help!
>
>Kind regards
>Stefan Schmitz
>
>
>The current /etc/fence_virt.conf:
>
>fence_virtd {
> listener = "multicast";
> backend = "libvirt";
> module_path = "/usr/lib64/fence-virt";
>}
>
>listeners {
> multicast {
> key_file = "/etc/cluster/fence_xvm.key";
> address = "225.0.0.12";
> interface = "bond0";
> family = "ipv4";
> port = "1229";
> }
>
>}
>
>backends {
> libvirt {
> uri = "qemu:///system";
> }
>
>}
>
>
>
>
>
>
>
>Am 25.06.2020 um 10:28 schrieb stefan.schm...@farmpartner-tec.com:
>> Hello and thank you both for the help,
>> 
>>  >> Are the VMs in the same VLAN like the hosts?
>> Yes the VMs and Hosts are all in the same VLan. So I will try the 
>> fence_xvm solution.
>> 
>>  > https://wiki.clusterlabs.org/wiki/Guest_Fencing
>> Thank you for the pointer to that walk-through. Sadly every VM is on
>its 
>> own host which is marked as "Not yet supported" but still this how to
>is 
>> a good starting point and I will try to work and tweak my way through
>it 
>> for out setup.
>> 
>> Thanks again!
>> 
>> Kind regards
>> Stefan Schmitz
>> 
>> Am 24.06.2020 um 15:51 schrieb Ken Gaillot:
>>> On Wed, 2020-06-24 at 15:47 +0300, Strahil Nikolov wrote:
>>>> Hello Stefan,
>>>>
>>>> There are multiple options for stonith, but it depends on the
>>>> environment.
>>>> Are the VMs in the same VLAN like the hosts? I am asking this , as
>>>> the most popular candidate is 'fence_xvm' but it requires the VM to
>>>> send fencing request to the KVM host (multicast) where the partner
>VM
>>>> is hosted .
>>>
>>> FYI a fence_xvm walk-through for the simple case is available on the
>>> ClusterLabs wiki:
>>>
>>> https://wiki.clusterlabs.org/wiki/Guest_Fencing
>>>
>>>> Another approach is to use a shared disk (either over iSCSI or
>>>> SAN)  and use sbd for power-based fencing,  or  use SCSI3
>Persistent
>>>> Reservations (which can also be converted into a power-based
>>>> fencing).
>>>>
>>>>
>>>> Best Regards,
>>>> Strahil Nikolov
>>>>
>>>>
>>>> На 24 юни

Re: [ClusterLabs] DRBD sync stalled at 100% ?

2020-06-28 Thread Strahil Nikolov

i was thinking about  a github issue, but it seems  that only 'linstor-server' 
has an issue section.

Best Regards,
Strahil Nikolov

На 28 юни 2020 г. 20:13:21 GMT+03:00, Eric Robinson  
написа:
>I could if linbit had per-incident pricing. Unfortunately, they only
>offer yearly contracts, which is way more than I need.
>
>Get Outlook for Android<https://aka.ms/ghei36>
>
>________
>From: Strahil Nikolov 
>Sent: Sunday, June 28, 2020 4:11:47 AM
>To: Eric Robinson ; Cluster Labs - All topics
>related to open-source clustering welcomed 
>Subject: RE: [ClusterLabs] DRBD sync stalled at 100% ?
>
>I guess you  can open an issue to linbit, as you still have the logs.
>
>Best Regards,
>Strahil Nikolov
>
>На 28 юни 2020 г. 8:19:59 GMT+03:00, Eric Robinson
> написа:
>>I fixed it with a drbd down/up.
>>
>>From: Users  On Behalf Of Eric Robinson
>>Sent: Saturday, June 27, 2020 4:32 PM
>>To: Cluster Labs - All topics related to open-source clustering
>>welcomed ; Strahil Nikolov
>>
>>Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?
>>
>>Thanks for the feedback. I was hoping for a non-downtime solution. No
>>way to do that?
>>Get Outlook for Android<https://aka.ms/ghei36>
>>
>>
>>From: Strahil Nikolov
>>mailto:hunter86...@yahoo.com>>
>>Sent: Saturday, June 27, 2020 2:40:38 PM
>>To: Cluster Labs - All topics related to open-source clustering
>>welcomed mailto:users@clusterlabs.org>>; Eric
>>Robinson mailto:eric.robin...@psmnv.com>>
>>Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?
>>
>>I've  seen this  on a  test setup  after multiple  network
>disruptions.
>>I managed  to fix it by stopping drbd on all  nodes  and starting it
>>back.
>>
>>I guess  you can get downtime  and try that approach.
>>
>>
>>Best Regards,
>>Strahil Nikolov
>>
>>
>>
>>На 27 юни 2020 г. 16:36:10 GMT+03:00, Eric Robinson
>>mailto:eric.robin...@psmnv.com>> написа:
>>>I'm not seeing anything on Google about this. Two DRBD nodes lost
>>>communication with each other, and then reconnected and started sync.
>>>But then it got to 100% and is just stalled there.
>>>
>>>The nodes are 001db03a, 001db03b.
>>>
>>>On 001db03a:
>>>
>>>[root@001db03a ~]# drbdadm status
>>>ha01_mysql role:Primary
>>>  disk:UpToDate
>>>  001db03b role:Secondary
>>>replication:SyncSource peer-disk:Inconsistent done:100.00
>>>
>>>ha02_mysql role:Secondary
>>>  disk:UpToDate
>>>  001db03b role:Primary
>>>peer-disk:UpToDate
>>>
>>>On 001drbd03b:
>>>
>>>[root@001db03b ~]# drbdadm status
>>>ha01_mysql role:Secondary
>>>  disk:Inconsistent
>>>  001db03a role:Primary
>>>replication:SyncTarget peer-disk:UpToDate done:100.00
>>>
>>>ha02_mysql role:Primary
>>>  disk:UpToDate
>>>  001db03a role:Secondary
>>>peer-disk:UpToDate
>>>
>>>
>>>On 001db03a, here are the DRBD messages from the onset of the problem
>>>until now.
>>>
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: PingAck
>did
>>>not arrive in time.
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>>>Connected -> NetworkFailure ) peer( Primary -> Unknown )
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>>>UpToDate -> Consistent )
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b:
>>>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b:
>>ack_receiver
>>>terminated
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b:
>Terminating
>>>ack_recv thread
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Preparing
>>>cluster-wide state change 2946943372 (1->-1 0/0)
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Committing
>>>cluster-wide state change 2946943372 (6ms)
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>>>Consistent -> UpToDate )
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Connection
>>>closed
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>>>NetworkFailure -> Unconnected )
>>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Restarting
>>>receiver threa

Re: [ClusterLabs] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Strahil Nikolov

NFS mounting ... This sounds the perrfect candidate for autofs or systemd's 
'.automount'. Have you thought about systemd automounting your NFS ? 
It will allow  you to automatically mount on demand and umount based  on 
inactivity to prevent stale NFS mounts  on network issues.

If you still wish to use  your script,  you can create a systemd service  to 
call it and ensure (via  pacemaker) that service  will be  always running.


Best Regards,
Strahil Nikolov

На 29 юни 2020 г. 16:15:42 GMT+03:00, Tony Stocker  
написа:
>Hello
>
>We have a system which has become critical in nature and that
>management wants to be made into a high-available pair of servers. We
>are building on CentOS-8 and using Pacemaker to accomplish this.
>
>Without going into too much detail as to why it's being done, and to
>avoid any comments/suggestions about changing it which I cannot do,
>the system currently uses a script (which is not LSB compliant) to
>mount 133 NFS mounts. Yes, it's a crap ton of NFS mounts. No, I cannot
>do anything to alter, change, or reduce it. I must implement a
>Pacemaker 2-node high-availability pair which mounts those 133 NFS
>mounts. This list of mounts also changes over time as some are removed
>(rarely) and others added (much too frequently) and occasionally
>changed.
>
>It seems to me that manually putting each individual NFS mount in
>using the 'pcs' command as an individual ocf:heartbeat:FileSystem
>resource would be time-consuming and ultimately futile given the
>frequency of changes.
>
>Also, the reason that we don't put all of these mounts in the
>/etc/fstab file is to speed up boot times and ensure that the systems
>can actually come up into a useable state (and not hang forever)
>during a period when the NFS mounts might not be available for
>whatever reason (e.g. archive maintenance periods.)
>
>So, I'm left with trying to turn my coworker's bare minimum bash
>script that mounts these volumes into a functional LSB script. I've
>read:
>https://www.clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/_linux_standard_base.html
>and
>http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
>
>My first question is: is there any kind of script within the Pacemaker
>world that one can use to verify that one's script passes muster and
>is compliant without actually trying to run it as a resource? ~8 years
>ago there used to be a script called ocf-tester that one used to check
>OCF scripts, but I notice that that doesn't seem to be available any
>more - and really I need one for Pacemaker-compatible LSB script
>testing.
>
>Second, just what is Pacemaker expecting from the script? Does it
>'exercise' it looking for all available options? Or is it simply
>relying on it to provide the correct responses when it calls 'start',
>'stop', and 'status'?
>
>Thanks in advance for help.
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Strahil Nikolov

Nice to know.
Yet, if the monitoring of that fencing device failed - most probably the 
Vcenter was not responding/unreachable - that's why  I offered sbd .

Best Regards,
Strahil  Nikolov

На 18 юни 2020 г. 18:24:48 GMT+03:00, Ken Gaillot  написа:
>Note that a failed start of a stonith device will not prevent the
>cluster from using that device for fencing. It just prevents the
>cluster from monitoring the device.
>
>On Thu, 2020-06-18 at 08:20 +, Strahil Nikolov wrote:
>> What about second fencing mechanism ?
>> You can add a shared (independent) vmdk as an sbd device. The
>> reconfiguration will require cluster downtime, but this is only
>> necessary once.
>> Once 2 fencing mechanisms are available - you can configure the order
>> easily.
>> Best Regards,
>> Strahil Nikolov
>> 
>> 
>> 
>> 
>> 
>> 
>> В четвъртък, 18 юни 2020 г., 10:29:22 Гринуич+3, Ulrich Windl <
>> ulrich.wi...@rz.uni-regensburg.de> написа: 
>> 
>> 
>> 
>> 
>> 
>> Hi!
>> 
>> I can't give much detailed advice, but I think any network service
>> should have a timeout of at least 30 Sekonds (you have
>> timeout=2ms).
>> 
>> And "after 100 failures" is symbolic, not literal: It means it
>> failed too often, so I won't retry.
>> 
>> Regards,
>> Ulrich
>> 
>> > > > Howard  schrieb am 17.06.2020 um 21:05 in
>> > > > Nachricht
>> 
>> <2817_1592420740_5EEA6983_2817_3_1_CAO51vj6oXjfvhGQz7oOu=Pi+D_cKh5M1g
>> fDL_2tAbKmw
>> mq...@mail.gmail.com>:
>> > Hello, recently I received some really great advice from this
>> > community
>> > regarding changing the token timeout value in corosync. Thank you!
>> > Since
>> > then the cluster has been working perfectly with no errors in the
>> > log for
>> > more than a week.
>> > 
>> > This morning I logged in to find a stopped stonith device.  If I'm
>> > reading
>> > the log right, it looks like it failed 1 million times in ~20
>> > seconds then
>> > gave up. If you wouldn't mind looking at the logs below, is there
>> > some way
>> > that I can make this more robust so that it can recover?  I'll be
>> > investigating the reason for the timeout but would like to help the
>> > system
>> > recover on its own.
>> > 
>> > Servers: RHEL 8.2
>> > 
>> > Cluster name: cluster_pgperf2
>> > Stack: corosync
>> > Current DC: srv1 (version 2.0.2-3.el8_1.2-744a30d655) - partition
>> > with
>> > quorum
>> > Last updated: Wed Jun 17 11:47:42 2020
>> > Last change: Tue Jun 16 22:00:29 2020 by root via crm_attribute on
>> > srv1
>> > 
>> > 2 nodes configured
>> > 4 resources configured
>> > 
>> > Online: [ srv1 srv2 ]
>> > 
>> > Full list of resources:
>> > 
>> >   Clone Set: pgsqld-clone [pgsqld] (promotable)
>> >   Masters: [ srv1 ]
>> >   Slaves: [ srv2 ]
>> >   pgsql-master-ip(ocf::heartbeat:IPaddr2):  Started
>> > srv1
>> >   vmfence(stonith:fence_vmware_soap):Stopped
>> > 
>> > Failed Resource Actions:
>> > * vmfence_start_0 on srv2 'OCF_TIMEOUT' (198): call=19,
>> > status=Timed Out,
>> > exitreason='',
>> > last-rc-change='Wed Jun 17 08:34:16 2020', queued=7ms,
>> > exec=20184ms
>> > * vmfence_start_0 on srv1 'OCF_TIMEOUT' (198): call=44,
>> > status=Timed Out,
>> > exitreason='',
>> > last-rc-change='Wed Jun 17 08:33:55 2020', queued=0ms,
>> > exec=20008ms
>> > 
>> > Daemon Status:
>> >   corosync: active/disabled
>> >   pacemaker: active/disabled
>> >   pcsd: active/enabled
>> > 
>> >   pcs resource config
>> >   Clone: pgsqld-clone
>> >   Meta Attrs: notify=true promotable=true
>> >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>> > Attributes: bindir=/usr/bin pgdata=/var/lib/pgsql/data
>> > Operations: demote interval=0s timeout=120s (pgsqld-demote-
>> > interval-0s)
>> > methods interval=0s timeout=5 (pgsqld-methods-
>> > interval-0s)
>> > monitor interval=15s role=Master timeout=60s
>> > (pgsqld-monitor-interval-15s)
>> > monitor interval=16s role=Slave timeout=60s
>> > (pgsqld-monitor-interval-16s)
>> > notify interval=0s t

Re: [ClusterLabs] DRBD sync stalled at 100% ?

2020-06-27 Thread Strahil Nikolov

I've  seen this  on a  test setup  after multiple  network disruptions.
I managed  to fix it by stopping drbd on all  nodes  and starting it back.

I guess  you can get downtime  and try that approach.


Best Regards,
Strahil Nikolov



На 27 юни 2020 г. 16:36:10 GMT+03:00, Eric Robinson  
написа:
>I'm not seeing anything on Google about this. Two DRBD nodes lost
>communication with each other, and then reconnected and started sync.
>But then it got to 100% and is just stalled there.
>
>The nodes are 001db03a, 001db03b.
>
>On 001db03a:
>
>[root@001db03a ~]# drbdadm status
>ha01_mysql role:Primary
>  disk:UpToDate
>  001db03b role:Secondary
>replication:SyncSource peer-disk:Inconsistent done:100.00
>
>ha02_mysql role:Secondary
>  disk:UpToDate
>  001db03b role:Primary
>peer-disk:UpToDate
>
>On 001drbd03b:
>
>[root@001db03b ~]# drbdadm status
>ha01_mysql role:Secondary
>  disk:Inconsistent
>  001db03a role:Primary
>replication:SyncTarget peer-disk:UpToDate done:100.00
>
>ha02_mysql role:Primary
>  disk:UpToDate
>  001db03a role:Secondary
>peer-disk:UpToDate
>
>
>On 001db03a, here are the DRBD messages from the onset of the problem
>until now.
>
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: PingAck did
>not arrive in time.
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>Connected -> NetworkFailure ) peer( Primary -> Unknown )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>UpToDate -> Consistent )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b:
>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: ack_receiver
>terminated
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Terminating
>ack_recv thread
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Preparing
>cluster-wide state change 2946943372 (1->-1 0/0)
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Committing
>cluster-wide state change 2946943372 (6ms)
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>Consistent -> UpToDate )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Connection
>closed
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>NetworkFailure -> Unconnected )
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Restarting
>receiver thread
>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>Unconnected -> Connecting )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: PingAck did
>not arrive in time.
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>Connected -> NetworkFailure ) peer( Secondary -> Unknown )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0 001db03b:
>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: ack_receiver
>terminated
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Terminating
>ack_recv thread
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0: new current
>UUID: D07A3D4B2F99832D weak: FFFD
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Connection
>closed
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>NetworkFailure -> Unconnected )
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Restarting
>receiver thread
>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>Unconnected -> Connecting )
>Jun 26 22:34:33 001db03a pengine[1474]:  notice:  * Start 
>p_drbd0:1( 001db03b )
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Initiating notify
>operation p_drbd0_pre_notify_start_0 locally on 001db03a
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Result of notify
>operation for p_drbd0 on 001db03a: 0 (ok)
>Jun 26 22:34:33 001db03a crmd[1475]:  notice: Initiating start
>operation p_drbd0_start_0 on 001db03b
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Handshake to
>peer 0 successful: Agreed network protocol version 113
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Feature
>flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
>WRITE_ZEROES.
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Starting
>ack_recv thread (from drbd_r_ha02_mys [2116])
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Preparing
>remote state change 3920461435
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: Committing
>remote state change 3920461435 (primary_nodes=1)
>Jun 26 22:34:34 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>Connecting -> Connected ) peer( Unknown -> Primary )
>

Re: [ClusterLabs] DRBD sync stalled at 100% ?

2020-06-28 Thread Strahil Nikolov

I guess you  can open an issue to linbit, as you still have the logs.

Best Regards,
Strahil Nikolov

На 28 юни 2020 г. 8:19:59 GMT+03:00, Eric Robinson  
написа:
>I fixed it with a drbd down/up.
>
>From: Users  On Behalf Of Eric Robinson
>Sent: Saturday, June 27, 2020 4:32 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed ; Strahil Nikolov
>
>Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?
>
>Thanks for the feedback. I was hoping for a non-downtime solution. No
>way to do that?
>Get Outlook for Android<https://aka.ms/ghei36>
>
>____
>From: Strahil Nikolov
>mailto:hunter86...@yahoo.com>>
>Sent: Saturday, June 27, 2020 2:40:38 PM
>To: Cluster Labs - All topics related to open-source clustering
>welcomed mailto:users@clusterlabs.org>>; Eric
>Robinson mailto:eric.robin...@psmnv.com>>
>Subject: Re: [ClusterLabs] DRBD sync stalled at 100% ?
>
>I've  seen this  on a  test setup  after multiple  network disruptions.
>I managed  to fix it by stopping drbd on all  nodes  and starting it
>back.
>
>I guess  you can get downtime  and try that approach.
>
>
>Best Regards,
>Strahil Nikolov
>
>
>
>На 27 юни 2020 г. 16:36:10 GMT+03:00, Eric Robinson
>mailto:eric.robin...@psmnv.com>> написа:
>>I'm not seeing anything on Google about this. Two DRBD nodes lost
>>communication with each other, and then reconnected and started sync.
>>But then it got to 100% and is just stalled there.
>>
>>The nodes are 001db03a, 001db03b.
>>
>>On 001db03a:
>>
>>[root@001db03a ~]# drbdadm status
>>ha01_mysql role:Primary
>>  disk:UpToDate
>>  001db03b role:Secondary
>>replication:SyncSource peer-disk:Inconsistent done:100.00
>>
>>ha02_mysql role:Secondary
>>  disk:UpToDate
>>  001db03b role:Primary
>>peer-disk:UpToDate
>>
>>On 001drbd03b:
>>
>>[root@001db03b ~]# drbdadm status
>>ha01_mysql role:Secondary
>>  disk:Inconsistent
>>  001db03a role:Primary
>>replication:SyncTarget peer-disk:UpToDate done:100.00
>>
>>ha02_mysql role:Primary
>>  disk:UpToDate
>>  001db03a role:Secondary
>>peer-disk:UpToDate
>>
>>
>>On 001db03a, here are the DRBD messages from the onset of the problem
>>until now.
>>
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: PingAck did
>>not arrive in time.
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>>Connected -> NetworkFailure ) peer( Primary -> Unknown )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>>UpToDate -> Consistent )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1 001db03b:
>>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b:
>ack_receiver
>>terminated
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Terminating
>>ack_recv thread
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Preparing
>>cluster-wide state change 2946943372 (1->-1 0/0)
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql: Committing
>>cluster-wide state change 2946943372 (6ms)
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql/0 drbd1: disk(
>>Consistent -> UpToDate )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Connection
>>closed
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>>NetworkFailure -> Unconnected )
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: Restarting
>>receiver thread
>>Jun 26 22:34:27 001db03a kernel: drbd ha02_mysql 001db03b: conn(
>>Unconnected -> Connecting )
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: PingAck did
>>not arrive in time.
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>>Connected -> NetworkFailure ) peer( Secondary -> Unknown )
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0 001db03b:
>>pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b:
>ack_receiver
>>terminated
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Terminating
>>ack_recv thread
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql/0 drbd0: new current
>>UUID: D07A3D4B2F99832D weak: FFFD
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: Connection
>>closed
>>Jun 26 22:34:30 001db03a kernel: drbd ha01_mysql 001db03b: conn(
>>NetworkFailure -> Unconnected )
>>Jun 26 22:34:30 001db03a kernel

Re: [ClusterLabs] Redudant Ring Network failure

2020-06-09 Thread Strahil Nikolov

Are you using multicast ?

Best Regards,
Strahil Nikolov

На 9 юни 2020 г. 10:28:25 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL (EXTERN)" 
 написа:
>Hello,
>We have massive problems with the redundant ring operation of our
>Corosync / pacemaker 3 Node NFS clusters.
>
>Most of the nodes either have an entire ring offline or only 1 node in
>a ring.
>Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 | Node3
>Ring0 333 Ring 1 33n)
>
>corosync-cfgtool -R don't help
>All nodes are VMs that build the ring together using 2 VLANs.
>Which logs do you need to hopefully help me?
>
>Corosync Cluster Engine, version '3.0.1'
>Copyright (c) 2006-2018 Red Hat, Inc.
>Debian Buster
>
>
>--
>Mit freundlichen Grüßen
>  Michael Rohweder-Neubeck
>
>NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
>D-64291 Darmstadt
>E-Mail:
>m...@nsb-software.de<mailto:m...@nsb-software.de<mailto:m...@nsb-software.de%3cmailto:m...@nsb-software.de>>
>Manager: Van-Hien Nguyen, Jörg Jaspert
>USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt
>
>
>
>
>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa
>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>Amtsgericht Koeln HR B 2168
>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr.
>Karl-Ludwig Kley
>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman),
>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef
>Kayser, Dr. Michael Niggemann
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Redudant Ring Network failure

2020-06-09 Thread Strahil Nikolov

It  will  be hard to guess if you are  using sctp or udp/udpu.
If possible  share  the corosync.conf  (you can remove sensitive data,  but  
make it meaningful).

Are you using a firewall ? If yes  check :
1. Node firewall is not blocking  the communication on the specific  interfaces
2. Verify with tcpdump that the heartbeats are received from the remote side.
3. Check for retransmissions or packet loss.

Usually you can find more details in the log specified in corosync.conf or in 
/var/log/messages (and also the journal).

Best Regards,
Strahil Nikolov

На 9 юни 2020 г. 21:11:02 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL (EXTERN)" 
 написа:
>Hi,
>
>we are using unicast ("knet")
>
>Greetings
>
>Michael
>
>
>
>
>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa
>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>Amtsgericht Koeln HR B 2168
>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board: Dr.
>Karl-Ludwig Kley
>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman),
>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef
>Kayser, Dr. Michael Niggemann
>
>
>-Ursprüngliche Nachricht-
>Von: Strahil Nikolov  
>Gesendet: Dienstag, 9. Juni 2020 19:30
>An: Cluster Labs - All topics related to open-source clustering
>welcomed ; ROHWEDER-NEUBECK, MICHAEL (EXTERN)
>
>Betreff: Re: [ClusterLabs] Redudant Ring Network failure
>
>Are you using multicast ?
>
>Best Regards,
>Strahil Nikolov
>
>На 9 юни 2020 г. 10:28:25 GMT+03:00, "ROHWEDER-NEUBECK, MICHAEL
>(EXTERN)"  написа:
>>Hello,
>>We have massive problems with the redundant ring operation of our 
>>Corosync / pacemaker 3 Node NFS clusters.
>>
>>Most of the nodes either have an entire ring offline or only 1 node in
>
>>a ring.
>>Example: (Node1 Ring0 333 Ring1 n33 | Node2 Ring0 033 Ring1 3n3 |
>Node3
>>Ring0 333 Ring 1 33n)
>>
>>corosync-cfgtool -R don't help
>>All nodes are VMs that build the ring together using 2 VLANs.
>>Which logs do you need to hopefully help me?
>>
>>Corosync Cluster Engine, version '3.0.1'
>>Copyright (c) 2006-2018 Red Hat, Inc.
>>Debian Buster
>>
>>
>>--
>>Mit freundlichen Grüßen
>>  Michael Rohweder-Neubeck
>>
>>NSB GmbH – Nguyen Softwareentwicklung & Beratung GmbH Röntgenstraße 27
>>D-64291 Darmstadt
>>E-Mail:
>>m...@nsb-software.de<mailto:m...@nsb-software.de<mailto:mrn@nsb-software.
>>de%3cmailto:m...@nsb-software.de>>
>>Manager: Van-Hien Nguyen, Jörg Jaspert
>>USt-ID: DE 195 703 354; HRB 7131 Amtsgericht Darmstadt
>>
>>
>>
>>
>>Sitz der Gesellschaft / Corporate Headquarters: Deutsche Lufthansa 
>>Aktiengesellschaft, Koeln, Registereintragung / Registration:
>>Amtsgericht Koeln HR B 2168
>>Vorsitzender des Aufsichtsrats / Chairman of the Supervisory Board:
>Dr.
>>Karl-Ludwig Kley
>>Vorstand / Executive Board: Carsten Spohr (Vorsitzender / Chairman), 
>>Thorsten Dirks, Christina Foerster, Harry Hohmeister, Dr. Detlef 
>>Kayser, Dr. Michael Niggemann
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-10 Thread Strahil Nikolov

What  is your corosync.conf timeouts (especially token & consensus)?
Last time I did live migration of RHEL 7 node with the default values, the 
cluster fenced it - thus I set it to 10s  for token and I also raised the 
consensus (check 'man corosync.conf') above the default.

Also,  start your investigation from the  virtualization layer, as during the 
nights a lot  of backups  are going on. Last week I got a cluster node fenced 
cause it failed to respond for 40s  . Thankfully that was just a QA cluster,  
so it wasn't a big deal.

The most common reasons for a VM to fail to respond are:
- CPU starvation due to high CPU utilisation on the host
- I/O issues  causing the VM to pause
- Lots  of backups eating the bandwidth on any of the  Hypervisours or on a 
switch between them  (if you have a single heartbeat  network)

With RHEL8 corosync allows using more than 2 heartbeat rings and way new stuff 
like sctp.

P.S.: You can use a second fencing mechanism like 'sbd' a.k.a. "poison pill" ,  
just make the vmdk shared &  independent .  This way your cluster can operate 
even when the vCenter is unreachable for any reason.

Best Regards,
Strahil Nikolov


На 10 юни 2020 г. 20:06:28 GMT+03:00, Howard  написа:
>Good morning.  Thanks for reading.  We have a requirement to provide
>high
>availability for PostgreSQL 10.  I have built a two node cluster with a
>quorum device as the third vote, all running on RHEL 8.
>
>Here are the versions installed:
>[postgres@srv2 cluster]$ rpm -qa|grep
>"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
>corosync-3.0.2-3.el8_1.1.x86_64
>corosync-qdevice-3.0.0-2.el8.x86_64
>corosync-qnetd-3.0.0-2.el8.x86_64
>corosynclib-3.0.2-3.el8_1.1.x86_64
>fence-agents-vmware-soap-4.2.1-41.el8.noarch
>pacemaker-2.0.2-3.el8_1.2.x86_64
>pacemaker-cli-2.0.2-3.el8_1.2.x86_64
>pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
>pacemaker-libs-2.0.2-3.el8_1.2.x86_64
>pacemaker-schemas-2.0.2-3.el8_1.2.noarch
>pcs-0.10.2-4.el8.x86_64
>resource-agents-paf-2.3.0-1.noarch
>
>These are vmare VMs so I configured the cluster to use the ESX host as
>the
>fencing device using fence_vmware_soap.
>
>Throughout each day things generally work very well.  The cluster
>remains
>online and healthy. Unfortunately, when I check pcs status in the
>mornings,
>I see that all kinds of things went wrong overnight.  It is hard to
>pinpoint what the issue is as there is so much information being
>written to
>the pacemaker.log. Scrolling through pages and pages of informational
>log
>entries trying to find the lines that pertain to the issue.  Is there a
>way
>to separate the logs out to make it easier to scroll through? Or maybe
>a
>list of keywords to GREP for?
>
>It is clearly indicating that the server lost contact with the other
>node
>and also the quorum device. Is there a way to make this configuration
>more
>robust or able to recover from a connectivity blip?
>
>Here are the pacemaker and corosync logs for this morning's failures:
>pacemaker.log
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
> [10573] (pcmk_quorum_notification)   warning: Quorum lost |
>membership=952 members=1
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2
>pacemaker-controld
> [10579] (pcmk_quorum_notification)   warning: Quorum lost |
>membership=952 members=1
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
>will be fenced: peer is no longer part of the cluster
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (determine_online_status)warning:
>Node
>srv1 is unclean
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-sch

Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-11 Thread Strahil Nikolov

Hi Vitaly,

have you considered  something like  this:
1.  Setup a  new cluster
2.  Present the same  shared storage on the new  cluster
3. Prepare the resource configuration but do not apply yet.
3. Power down all  resources on old cluster
4. Deploy the resources on the new cluster and immediately bring the  resources 
up
5. Remove access  to the shared storage for the  old cluster
6. Wipe the  old  cluster.

Downtime  will be way  shorter.

Best Regards,
Strahil  Nikolov

На 11 юни 2020 г. 17:48:47 GMT+03:00, Vitaly Zolotusky  
написа:
>Thank you very much for quick reply!
>I will try to either build new version on Fedora 22, or build the old
>version on CentOs 8 and do a HA stack upgrade separately from my full
>product/OS upgrade. A lot of my customers would be extremely unhappy
>with even short downtime, so I can't really do the full upgrade
>offline.
>Thanks again!
>_Vitaly
>
>> On June 11, 2020 10:14 AM Jan Friesse  wrote:
>> 
>>  
>> > Thank you very much for your help!
>> > We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that
>it may work with rolling upgrade (we were fooled by the same major
>version (2)). Our fresh install works fine on V3.0.3-5.
>> > Do you know if it is possible to build Pacemaker 3.0.3-5 and
>Corosync 2.0.3 on Fedora 22 so that I 
>> 
>> Good question. Fedora 22 is quite old but close to RHEL 7 for which
>we 
>> build packages automatically (https://kronosnet.org/builds/) so it 
>> should be possible. But you are really on your own, because I don't 
>> think anybody ever tried it.
>> 
>> Regards,
>>Honza
>> 
>> 
>> 
>> upgrade the stack before starting "real" upgrade of the product?
>> > Then I can do the following sequence:
>> > 1. "quick" full shutdown for HA stack upgrade to 3.0 version
>> > 2. start HA stack on the old OS and product version with Pacemaker
>3.0.3 and bring the product online
>> > 3. start rolling upgrade for product upgrade to the new OS and
>product version
>> > Thanks again for your help!
>> > _Vitaly
>> > 
>> >> On June 11, 2020 3:30 AM Jan Friesse  wrote:
>> >>
>> >>   
>> >> Vitaly,
>> >>
>> >>> Hello everybody.
>> >>> We are trying to do a rolling upgrade from Corosync 2.3.5-1 to
>Corosync 2.99+. It looks like they are not compatible and we are
>getting messages like:
>> >>
>> >> Yes, they are not wire compatible. Also please do not use 2.99
>versions,
>> >> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a
>long time
>> >> released (3.0.4 is latest and I would recommend using it - there
>were
>> >> quite a few important bugfixes between 3.0.0 and 3.0.4)
>> >>
>> >>
>> >>> Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message
>received from 172.18.52.44 has bad magic number (probably sent by
>Corosync 2.3+).. Ignoring
>> >>> on the upgraded node and
>> >>> Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid
>packet data
>> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming
>packet has different crypto type. Rejecting
>> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received
>message has invalid digest... ignoring.
>> >>> on the pre-upgrade node.
>> >>>
>> >>> Is there a good way to do this upgrade?
>> >>
>> >> Usually best way is to start from scratch in testing environment
>to make
>> >> sure everything works as expected. Then you can shutdown current
>> >> cluster, upgrade and start it again - config file is mostly
>compatible,
>> >> you may just consider changing transport to knet. I don't think
>there is
>> >> any definitive guide to do upgrade without shutting down whole
>cluster,
>> >> but somebody else may have idea.
>> >>
>> >> Regards,
>> >> Honza
>> >>
>> >>> I would appreciate it very much if you could point me to any
>documentation or articles on this issue.
>> >>> Thank you very much!
>> >>> _Vitaly
>> >>> ___
>> >>> Manage your subscription:
>> >>> https://lists.clusterlabs.org/mailman/listinfo/users
>> >>>
>> >>> ClusterLabs home: https://www.clusterlabs.org/
>> >>>
>> >
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-12 Thread Strahil Nikolov

Don't forget to increase the consensus!

Best Regards,
Strahil Nikolov

На 11 юни 2020 г. 22:11:09 GMT+03:00, Howard  написа:
>This is interesting. So it seems that 13,000 ms or 13 seconds is how
>long
>the VM was frozen during the snapshot backup and 0.8 seconds is the
>threshold. We will be disabling the snapshot backups and may increase
>the
>token timeout a bit since these systems are not so critical.
>
>Thanks Honza for helping me understand.
>
>Howard
>
>On Thu, Jun 11, 2020 at 12:36 AM Jan Friesse 
>wrote:
>
>> Howard,
>>
>>
>> ...
>>
>> The most important info is following line:
>>
>>  > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync
>main
>>  > process was not scheduled for 13006.0615 ms (threshold is 800.
>ms).
>>  > Consider token timeout increase.
>>
>> There are more of these, so you can either make sure VM is not paused
>> for such a long time or increase token timeout so corosync is able to
>> handle such pause.
>>
>> Regards,
>>Honza
>>
>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] New user needs some help stabilizing the cluster

2020-06-12 Thread Strahil Nikolov

And I forgot to ask ... Are you using memory-based snapshot ?
It shouldn't take so long.

Best Regards,
Strahil Nikolov

На 12 юни 2020 г. 7:10:38 GMT+03:00, Strahil Nikolov  
написа:
>Don't forget to increase the consensus!
>
>Best Regards,
>Strahil Nikolov
>
>На 11 юни 2020 г. 22:11:09 GMT+03:00, Howard 
>написа:
>>This is interesting. So it seems that 13,000 ms or 13 seconds is how
>>long
>>the VM was frozen during the snapshot backup and 0.8 seconds is the
>>threshold. We will be disabling the snapshot backups and may increase
>>the
>>token timeout a bit since these systems are not so critical.
>>
>>Thanks Honza for helping me understand.
>>
>>Howard
>>
>>On Thu, Jun 11, 2020 at 12:36 AM Jan Friesse 
>>wrote:
>>
>>> Howard,
>>>
>>>
>>> ...
>>>
>>> The most important info is following line:
>>>
>>>  > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync
>>main
>>>  > process was not scheduled for 13006.0615 ms (threshold is
>800.
>>ms).
>>>  > Consider token timeout increase.
>>>
>>> There are more of these, so you can either make sure VM is not
>paused
>>> for such a long time or increase token timeout so corosync is able
>to
>>> handle such pause.
>>>
>>> Regards,
>>>Honza
>>>
>>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Setting up HA cluster on Raspberry pi4 with ubuntu 20.04 aarch64 architecture

2020-06-12 Thread Strahil Nikolov

Out  of  curiosity , are you running it on sles/opensuse?

I think it is easier with  'crm cluster start'.

Otherwise you can run 'journalctl -u pacemaker.service  -e'  to find what 
dependency has failed.

Another  one is:

'systemctl list-dependencies pacemaker.service'

Best Regards,
Strahil  Nikolov




На 11 юни 2020 г. 9:10:22 GMT+03:00, Jayadeva DB 
 написа:
>Hi ,
>I have installed ubuntu 20.04 aarch64 OS on raspberry pi4.
>I want to set up HA cluster using pacemaker corosync and crm .
>I have been following this link
>https://clusterlabs.org/quickstart-ubuntu.html .
>I am able to install everything . But when I give "service pacemaker
>start"
>,the error pops up
>
>"A dependency job for pacemaker.service failed. See 'journalctl -xe'
>for
>details".
>Can I get complete documentation as to how to make it live and
>happening .
>
>Regards,
>Jayadev
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-17 Thread Strahil Nikolov

The simplest way to check if the libvirt's network is NAT (or not)  is to try 
to ssh from the first VM to the second one.

I should admit that I was  lost when I tried  to create a routed network in 
KVM, so I can't help with that.

Best Regards,
Strahil Nikolov

На 17 юли 2020 г. 16:56:44 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>I have now managed to get # fence_xvm -a 225.0.0.12 -o list to list at 
>least its local Guest again. It seems the fence_virtd was not working 
>properly anymore.
>
>Regarding the Network XML config
>
># cat default.xml
>  
>  default
>  
>  
>  
>
>  
>
>  
>  
>
>I have used "virsh net-edit default" to test other network Devices on 
>the hosts but this did not change anything.
>
>Regarding the statement
>
> > If it is created by libvirt - this is NAT and you will never
> > receive  output  from the other  host.
>
>I am at a loss an do not know why this is NAT. I am aware what NAT 
>means, but what am I supposed to reconfigure here to dolve the problem?
>Any help would be greatly appreciated.
>Thank you in advance.
>
>Kind regards
>Stefan Schmitz
>
>
>Am 15.07.2020 um 16:48 schrieb stefan.schm...@farmpartner-tec.com:
>> 
>> Am 15.07.2020 um 16:29 schrieb Klaus Wenninger:
>>> On 7/15/20 4:21 PM, stefan.schm...@farmpartner-tec.com wrote:
>>>> Hello,
>>>>
>>>>
>>>> Am 15.07.2020 um 15:30 schrieb Klaus Wenninger:
>>>>> On 7/15/20 3:15 PM, Strahil Nikolov wrote:
>>>>>> If it is created by libvirt - this is NAT and you will never
>>>>>> receive  output  from the other  host.
>>>>> And twice the same subnet behind NAT is probably giving
>>>>> issues at other places as well.
>>>>> And if using DHCP you have to at least enforce that both sides
>>>>> don't go for the same IP at least.
>>>>> But all no explanation why it doesn't work on the same host.
>>>>> Which is why I was asking for running the service on the
>>>>> bridge to check if that would work at least. So that we
>>>>> can go forward step by step.
>>>>
>>>> I just now finished trying and testing it on both hosts.
>>>> I ran # fence_virtd -c on both hosts and entered different network
>>>> devices. On both I tried br0 and the kvm10x.0.
>>> According to your libvirt-config I would have expected
>>> the bridge to be virbr0.
>> 
>> I understand that, but an "virbr0" Device does not seem to exist on
>any 
>> of the two hosts.
>> 
>> # ip link show
>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
>mode 
>> DEFAULT group default qlen 1000
>>      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 2: eno1:  mtu 1500 qdisc mq 
>> master bond0 state UP mode DEFAULT group default qlen 1000
>>      link/ether 0c:c4:7a:fb:30:1a brd ff:ff:ff:ff:ff:ff
>> 3: enp216s0f0:  mtu 1500 qdisc noop state DOWN
>mode 
>> DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:26:69:dc brd ff:ff:ff:ff:ff:ff
>> 4: eno2:  mtu 1500 qdisc mq 
>> master bond0 state UP mode DEFAULT group default qlen 1000
>>      link/ether 0c:c4:7a:fb:30:1a brd ff:ff:ff:ff:ff:ff
>> 5: enp216s0f1:  mtu 1500 qdisc noop state DOWN
>mode 
>> DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:26:69:dd brd ff:ff:ff:ff:ff:ff
>> 6: bond0:  mtu 1500 qdisc 
>> noqueue master br0 state UP mode DEFAULT group default qlen 1000
>>      link/ether 0c:c4:7a:fb:30:1a brd ff:ff:ff:ff:ff:ff
>> 7: br0:  mtu 1500 qdisc noqueue
>state 
>> UP mode DEFAULT group default qlen 1000
>>      link/ether 0c:c4:7a:fb:30:1a brd ff:ff:ff:ff:ff:ff
>> 8: kvm101.0:  mtu 1500 qdisc
>pfifo_fast 
>> master br0 state UNKNOWN mode DEFAULT group default qlen 1000
>>      link/ether fe:16:3c:ba:10:6c brd ff:ff:ff:ff:ff:ff
>> 
>> 
>> 
>>>>
>>>> After each reconfiguration I ran #fence_xvm -a 225.0.0.12 -o list
>>>> On the second server it worked with each device. After that I
>>>> reconfigured back to the normal device, bond0, on which it did not
>>>> work anymore, it worked now again!
>>>> #  fence_xvm -a 225.0.0.12 -o list
>>>> kvm102  
>bab3749c-15fc-40b7-8b6c-d4267b9f0eb9 on
>>>>
>>>> But anyhow not on the first server, it did not work with any
>device.
>>>> #  fence_xvm -a 225.0.0.12 -o list always resulted in
>>>&g

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Strahil Nikolov

Do you have a reason not to use any stonith already available ?


Best Regards,
Strahil Nikolov

На 28 юли 2020 г. 13:26:52 GMT+03:00, Gabriele Bulfon  
написа:
>Thanks, I attach here the script.
>It basically runs ssh on the other node with no password (must be
>preconfigured via authorization keys) with commands.
>This was taken from a script by OpenIndiana (I think).
>As it stated in the comments, we don't want to halt or boot via ssh,
>only reboot.
>Maybe this is the problem, we should at least have it shutdown when
>asked for.
> 
>Actually if I stop corosync in node 2, I don't want it to shutdown the
>system but just let node 1 keep control of all resources.
>Same if I just shutdown manually node 2, 
>node 1 should keep control of all resources and release them back on
>reboot.
>Instead, when I stopped corosync on node 2, log was showing the
>temptative to stonith node 2: why?
> 
>Thanks!
>Gabriele
> 
> 
> 
>Sonicle S.r.l. 
>: 
>http://www.sonicle.com
>Music: 
>http://www.gabrielebulfon.com
>Quantum Mechanics : 
>http://www.cdbaby.com/cd/gabrielebulfon
>Da:
>Reid Wahl
>A:
>Cluster Labs - All topics related to open-source clustering welcomed
>Data:
>28 luglio 2020 12.03.46 CEST
>Oggetto:
>Re: [ClusterLabs] Antw: [EXT] Stonith failing
>Gabriele,
> 
>"No route to host" is a somewhat generic error message when we can't
>find anyone to fence the node. It doesn't mean there's necessarily a
>network routing issue at fault; no need to focus on that error message.
> 
>I agree with Ulrich about needing to know what the script does. But
>based on your initial message, it sounds like your custom fence agent
>returns 1 in response to "on" and "off" actions. Am I understanding
>correctly? If so, why does it behave that way? Pacemaker is trying to
>run a poweroff action based on the logs, so it needs your script to
>support an off action.
>On Tue, Jul 28, 2020 at 2:47 AM Ulrich Windl
>ulrich.wi...@rz.uni-regensburg.de
>wrote:
>Gabriele Bulfon
>gbul...@sonicle.com
>schrieb am 28.07.2020 um 10:56 in
>Nachricht
>:
>Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by
>Corosync.
>To check how stonith would work, I turned off Corosync service on
>second
>node.
>First node try to attempt to stonith 2nd node and take care of its
>resources, but this fails.
>Stonith action is configured to run a custom script to run ssh
>commands,
>I think you should explain what that script does exactly.
>[...]
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>ClusterLabs home:
>https://www.clusterlabs.org/
>--
>Regards,
>Reid Wahl, RHCA
>Software Maintenance Engineer, Red Hat
>CEE - Platform Support Delivery - ClusterHA
>___Manage your
>subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs
>home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-30 Thread Strahil Nikolov

SBD can use iSCSI  (for example target is also  the quorum node), disk 
partition  or  lvm LV,  so  I guess  it can also  use  a ZFS volume  dedicated  
for  the  SBD (10MB  is enough).
In your  case IPMI  is quite  suitable.

About the power  fencing  when persistent  reservations  are  removed  ->  it's 
 just a  script started by the watchdog.service  on the node itself.It should 
be usable on all Linuxes  and many UNIX-like OSes.

Best  Regards,
Strahil Nikolov

На 30 юли 2020 г. 12:05:39 GMT+03:00, Gabriele Bulfon  
написа:
>Reading sbd from SuSE I saw that it requires a special block to write
>informations, I don't think this is possibile here.
> 
>It's a dual node ZFS storage running our own XStreamOS/illumos
>distribution, and here we're trying to add HA capabilities.
>We can move IPs, ZFS Pools and COMSTAR/iSCSI/FC, and now looking for a
>stable way to manage stonith.
> 
>The hardware system is this:
> 
>https://www.supermicro.com/products/system/1u/1029/SYS-1029TP-DC0R.cfm
> 
>and it features a shared SAS3 backplane, so both nodes can see all the
>discs concurrently.
> 
>Gabriele
> 
> 
>Sonicle S.r.l. 
>: 
>http://www.sonicle.com
>Music: 
>http://www.gabrielebulfon.com
>Quantum Mechanics : 
>http://www.cdbaby.com/cd/gabrielebulfon
>Da:
>Reid Wahl
>A:
>Cluster Labs - All topics related to open-source clustering welcomed
>Data:
>30 luglio 2020 6.38.58 CEST
>Oggetto:
>Re: [ClusterLabs] Antw: [EXT] Stonith failing
>I don't know of a stonith method that acts upon a filesystem directly.
>You'd generally want to act upon the power state of the node or upon
>the underlying shared storage.
> 
>What kind of hardware or virtualization platform are these systems
>running on? If there is a hardware watchdog timer, then sbd is
>possible. The fence_sbd agent (poison-pill fencing via block device)
>requires shared block storage, but sbd itself only requires a hardware
>watchdog timer.
> 
>Additionally, there may be an existing fence agent that can connect to
>the controller you mentioned. What kind of controller is it?
>On Wed, Jul 29, 2020 at 5:24 AM Gabriele Bulfon
>gbul...@sonicle.com
>wrote:
>Thanks a lot for the extensive explanation!
>Any idea about a ZFS stonith?
> 
>Gabriele
> 
> 
>Sonicle S.r.l. 
>: 
>http://www.sonicle.com
>Music: 
>http://www.gabrielebulfon.com
>Quantum Mechanics : 
>http://www.cdbaby.com/cd/gabrielebulfon
>Da:
>Reid Wahl
>nw...@redhat.com
>A:
>Cluster Labs - All topics related to open-source clustering welcomed
>users@clusterlabs.org
>Data:
>29 luglio 2020 11.39.35 CEST
>Oggetto:
>Re: [ClusterLabs] Antw: [EXT] Stonith failing
>"As it stated in the comments, we don't want to halt or boot via ssh,
>only reboot."
> 
>Generally speaking, a stonith reboot action consists of the following
>basic sequence of events:
>Execute the fence agent with the "off" action.
>Poll the power status of the fenced node until it is powered off.
>Execute the fence agent with the "on" action.
>Poll the power status of the fenced node until it is powered on.
>So a custom fence agent that supports reboots, actually needs to
>support off and on actions.
> 
> 
>As Andrei noted, ssh is **not** a reliable method by which to ensure a
>node gets rebooted or stops using cluster-managed resources. You can't
>depend on the ability to SSH to an unhealthy node that needs to be
>fenced.
> 
>The only way to guarantee that an unhealthy or unresponsive node stops
>all access to shared resources is to power off or reboot the node. (In
>the case of resources that rely on shared storage, I/O fencing instead
>of power fencing can also work, but that's not ideal.)
> 
>As others have said, SBD is a great option. Use it if you can. There
>are also power fencing methods (one example is fence_ipmilan, but the
>options available depend on your hardware or virt platform) that are
>reliable under most circumstances.
> 
>You said that when you stop corosync on node 2, Pacemaker tries to
>fence node 2. There are a couple of possible reasons for that. One
>possibility is that you stopped or killed corosync without stopping
>Pacemaker first. (If you use pcs, then try `pcs cluster stop`.) Another
>possibility is that resources failed to stop during cluster shutdown on
>node 2, causing node 2 to be fenced.
>On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov
>arvidj...@gmail.com
>wrote:
> 
>On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon
>gbul...@sonicle.com
>wrote:
>That one was taken from a specific implementation on Solaris 11.
>The situation is a dual node server with shared storage controller:
>both nodes see the same disks concurrently.
>Here we must be sure tha

Re: [ClusterLabs] Antw: [EXT] Re: Preferred node for a service (not constrained)

2020-12-03 Thread Strahil Nikolov

The problem with infinity is that the moment when the node is back - there will 
be a second failover. This is bad for bulky DBs that power down/up more than 30 
min (15 min down, 15 min up).

Best Regards,
Strahil Nikolov






В четвъртък, 3 декември 2020 г., 10:32:18 Гринуич+2, Andrei Borzenkov 
 написа: 





On Thu, Dec 3, 2020 at 11:11 AM Ulrich Windl
 wrote:
>
> >>> Strahil Nikolov  schrieb am 02.12.2020 um 22:42 in
> Nachricht <311137659.2419591.1606945369...@mail.yahoo.com>:
> > Constraints' values are varying from:
> > infinity which equals to score of 100
> > to:
> > - infinity which equals to score of -100
> >
> > You can usually set a positive score on the prefered node which is bigger
> > than on the other node.
> >
> > For example setting a location constraint like this will prefer node1:
> > node1 - score 1
> > node2 - score 5000
> >
>
> The bad thing with those numbers is that you are never sure which number to
> use:
> Is 50 enough? 100 Maybe? 1000? 1? 10?
>

+INFINITY score guarantees that resource will always be active on
preferred node as long as this node is available but allow resource to
be started on another node if preferred node is down.
...
> > I believe I used the value infinity, so it will prefer the 2nd host over
> > the 1st if at all possible.  My 'pcs constraint':
> >

And this was the very first answer to this question.


> > [root@centos-vsa2 ~]# pcs constraint
> > Location Constraints:
> >  Resource: group-zfs
> >    Enabled on: centos-vsa2 (score:INFINITY)
> > Ordering Constraints:
> > Colocation Constraints:
> > Ticket Constraints:
> >
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: high-priority messages from DLM?

2020-12-05 Thread Strahil Nikolov

It's more interesting why you got connection close...
Are you sure you didn't got network issues ? What is corosync saying in
the lgos ?

Offtopic: Are you using DLM with OCFS2 ?

Best Regards,
Strahil Nikolov

В 10:33 -0800 на 04.12.2020 (пт), Reid Wahl написа:
> On Fri, Dec 4, 2020 at 10:32 AM Reid Wahl  wrote:
> > I'm inclined to agree, although maybe there's a good reason. These
> > get
> > logged with KERN_ERR priority.
> 
> I hit Enter and that email sent instead of line-breaking... anyway.
> 
> https://github.com/torvalds/linux/blob/master/fs/dlm/dlm_internal.h#L61-L62
> https://github.com/torvalds/linux/blob/master/fs/dlm/lowcomms.c#L1250
> 
> > On Fri, Dec 4, 2020 at 5:32 AM Ulrich Windl
> >  wrote:
> > > Hi!
> > > 
> > > Logging into a server via iDRAC, I see several messages drom
> > > "dlm:" at the console screen. My obvious explanation is that they
> > > are on the screen, because journald (SLES15 SP2) treats them is
> > > high priority messages that should go to the screen. However IMHO
> > > they are not:
> > > 
> > > [83035.82] dlm: closing connection to node 118
> > > [84756.045008] dlm: closing connection to node 118
> > > [160906.211673] dlm: Using SCTP for communications
> > > [160906.239357] dlm: connecting to 118
> > > [160906.239807] dlm: connecting to 116
> > > [160906.241432] dlm: connected to 116
> > > [160906.241448] dlm: connected to 118
> > > [174464.522831] dlm: closing connection to node 116
> > > [174670.058912] dlm: connecting to 116
> > > [174670.061373] dlm: connected to 116
> > > [175561.816821] dlm: closing connection to node 118
> > > [175617.654995] dlm: connecting to 118
> > > [175617.665153] dlm: connected to 118
> > > [175695.310971] dlm: closing connection to node 118
> > > [175695.311039] dlm: closing connection to node 116
> > > [175695.311084] dlm: closing connection to node 119
> > > [175759.045564] dlm: Using SCTP for communications
> > > [175759.052075] dlm: connecting to 118
> > > [175759.052623] dlm: connecting to 116
> > > [175759.052917] dlm: connected to 116
> > > [175759.053847] dlm: connected to 118
> > > [432217.637844] dlm: closing connection to node 119
> > > [432217.637912] dlm: closing connection to node 118
> > > [432217.637953] dlm: closing connection to node 116
> > > [438872.495086] dlm: Using SCTP for communications
> > > [438872.499832] dlm: connecting to 118
> > > [438872.500340] dlm: connecting to 116
> > > [438872.500600] dlm: connected to 116
> > > [438872.500642] dlm: connected to 118
> > > [779424.346316] dlm: closing connection to node 116
> > > [780017.597844] dlm: connecting to 116
> > > [780017.616321] dlm: connected to 116
> > > [783118.476060] dlm: closing connection to node 116
> > > [783318.744036] dlm: connecting to 116
> > > [783318.756923] dlm: connected to 116
> > > [784893.793366] dlm: closing connection to node 118
> > > [785082.619709] dlm: connecting to 118
> > > [785082.633263] dlm: connected to 118
> > > 
> > > Regards,
> > > Ulrich
> > > 
> > > 
> > > 
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/
> > > 
> > 
> > --
> > Regards,
> > 
> > Reid Wahl, RHCA
> > Senior Software Maintenance Engineer, Red Hat
> > CEE - Platform Support Delivery - ClusterHA
> 
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Q: high-priority messages from DLM?

2020-12-08 Thread Strahil Nikolov

Nope,

but if you don't use clustered FS, you could also use plain LVM + tags.
As far as I know you need dlm and clvmd for clustered FS.

Best Regards,
Strahil Nikolov






В вторник, 8 декември 2020 г., 10:15:39 Гринуич+2, Ulrich Windl 
 написа: 





>>> Strahil Nikolov  schrieb am 05.12.2020 um 18:51 in
Nachricht :
> It's more interesting why you got connection close...
> Are you sure you didn't got network issues ? What is corosync saying in
> the lgos ?
> 
> Offtopic: Are you using DLM with OCFS2 ?

Hi!

I'm using OCFS2, but I tend to ask "Can I use OCFS2 _without_ DLM?". ;-)

Regards,
Ulrich

> 
> Best Regards,
> Strahil Nikolov
> 
> В 10:33 -0800 на 04.12.2020 (пт), Reid Wahl написа:
>> On Fri, Dec 4, 2020 at 10:32 AM Reid Wahl  wrote:
>> > I'm inclined to agree, although maybe there's a good reason. These
>> > get
>> > logged with KERN_ERR priority.
>> 
>> I hit Enter and that email sent instead of line-breaking... anyway.
>> 
>> https://github.com/torvalds/linux/blob/master/fs/dlm/dlm_internal.h#L61-L62

>> https://github.com/torvalds/linux/blob/master/fs/dlm/lowcomms.c#L1250 
>> 
>> > On Fri, Dec 4, 2020 at 5:32 AM Ulrich Windl
>> >  wrote:
>> > > Hi!
>> > > 
>> > > Logging into a server via iDRAC, I see several messages drom
>> > > "dlm:" at the console screen. My obvious explanation is that they
>> > > are on the screen, because journald (SLES15 SP2) treats them is
>> > > high priority messages that should go to the screen. However IMHO
>> > > they are not:
>> > > 
>> > > [83035.82] dlm: closing connection to node 118
>> > > [84756.045008] dlm: closing connection to node 118
>> > > [160906.211673] dlm: Using SCTP for communications
>> > > [160906.239357] dlm: connecting to 118
>> > > [160906.239807] dlm: connecting to 116
>> > > [160906.241432] dlm: connected to 116
>> > > [160906.241448] dlm: connected to 118
>> > > [174464.522831] dlm: closing connection to node 116
>> > > [174670.058912] dlm: connecting to 116
>> > > [174670.061373] dlm: connected to 116
>> > > [175561.816821] dlm: closing connection to node 118
>> > > [175617.654995] dlm: connecting to 118
>> > > [175617.665153] dlm: connected to 118
>> > > [175695.310971] dlm: closing connection to node 118
>> > > [175695.311039] dlm: closing connection to node 116
>> > > [175695.311084] dlm: closing connection to node 119
>> > > [175759.045564] dlm: Using SCTP for communications
>> > > [175759.052075] dlm: connecting to 118
>> > > [175759.052623] dlm: connecting to 116
>> > > [175759.052917] dlm: connected to 116
>> > > [175759.053847] dlm: connected to 118
>> > > [432217.637844] dlm: closing connection to node 119
>> > > [432217.637912] dlm: closing connection to node 118
>> > > [432217.637953] dlm: closing connection to node 116
>> > > [438872.495086] dlm: Using SCTP for communications
>> > > [438872.499832] dlm: connecting to 118
>> > > [438872.500340] dlm: connecting to 116
>> > > [438872.500600] dlm: connected to 116
>> > > [438872.500642] dlm: connected to 118
>> > > [779424.346316] dlm: closing connection to node 116
>> > > [780017.597844] dlm: connecting to 116
>> > > [780017.616321] dlm: connected to 116
>> > > [783118.476060] dlm: closing connection to node 116
>> > > [783318.744036] dlm: connecting to 116
>> > > [783318.756923] dlm: connected to 116
>> > > [784893.793366] dlm: closing connection to node 118
>> > > [785082.619709] dlm: connecting to 118
>> > > [785082.633263] dlm: connected to 118
>> > > 
>> > > Regards,
>> > > Ulrich
>> > > 
>> > > 
>> > > 
>> > > ___
>> > > Manage your subscription:
>> > > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > > 
>> > > ClusterLabs home: https://www.clusterlabs.org/ 
>> > > 
>> > 
>> > --
>> > Regards,
>> > 
>> > Reid Wahl, RHCA
>> > Senior Software Maintenance Engineer, Red Hat
>> > CEE - Platform Support Delivery - ClusterHA
>> 
>> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Preferred node for a service (not constrained)

2020-12-02 Thread Strahil Nikolov

Constraints' values are varying from:
infinity which equals to score of 100
to:
- infinity which equals to score of -100

You can usually set a positive score on the prefered node which is bigger than 
on the other node.

For example setting a location constraint like this will prefer node1:
node1 - score 1
node2 - score 5000

In order to prevent unnecessary downtime , you should also consider setting 
stickiness.

For example a stickiness of 2 will overwhelm the score of 1 on the 
recently recovered node1 and will prevent the resource of being stopped and 
relocated from node2 to node1 .

Note: default stickiness is per resource , while the total stickiness score of 
a group is calculated based on the scores of all resources in it.

Best Regards,
Strahil Nikolov

В сряда, 2 декември 2020 г., 16:54:43 Гринуич+2, Dan Swartzendruber 
 написа: 

On 2020-11-30 23:21, Petr Bena wrote:
> Hello,
> 
> Is there a way to setup a preferred node for a service? I know how to
> create constrain that will make it possible to run a service ONLY on
> certain node, or constrain that will make it impossible to run 2
> services on same node, but I don't want any of that, as in
> catastrophical scenarios when services would have to be located 
> together
> on same node, this would instead disable it.
> 
> Essentially what I want is for service to be always started on 
> preferred
> node when it is possible, but if it's not possible (eg. node is down) 
> it
> would freely run on any other node, with no restrictions and when node
> is back up, it would migrate back.
> 
> How can I do that?

I do precisely this for an active/passive NFS/ZFS storage appliance 
pair.
One of the VSA has more memory and is less used, so I have it set to 
prefer
that host.

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_prefer_one_node_over_another.html

I believe I used the value infinity, so it will prefer the 2nd host over 
the 1st if at all possible.  My 'pcs constraint':

[root@centos-vsa2 ~]# pcs constraint
Location Constraints:
  Resource: group-zfs
    Enabled on: centos-vsa2 (score:INFINITY)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Changing order in resource group after it's created

2020-12-17 Thread Strahil Nikolov

Use the syntax as if your resource was never in a group and use 
'--before/--after' to specify the new location.

Best Regards,
Strahil Nikolov






В четвъртък, 17 декември 2020 г., 13:21:55 Гринуич+2, Tony Stocker 
 написа: 





I have a resource group that has a number of entries. If I want to
reorder them, how do I do that?

I tried doing this:

pcs resource update FileMount --after InternalIP

but got this error:

Error: Specified option '--after' is not supported in this command


Is there a way to change this? I don't want to have to completely
destroy and reenter everything, but at the moment that appears to be
the only option.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Cannot allocate memory in pgsql_monitor

2020-12-10 Thread Strahil Nikolov

systemd services do not use ulimit, so you need to check "systemctl show 
pacemaker.service" for any clues.
I have seen similar error in SLES 12 SP2 when the maximum tasks was reduced and 
we were hitting the limit.

Best Regards,
Strahil Nikolov






В четвъртък, 10 декември 2020 г., 13:14:14 Гринуич+2, Ларионов Андрей 
Валентинович  написа: 





  


Hello,

 

We have PostgreSQL under pacemaker/corosync management 

and sometimes we catch problem like this:

Dec  9 18:51:15 prod-inside-tvh-pgsql1 lrmd[13405]:  notice: 
pgsql_monitor_8000:34884:stderr [ /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs: 
fork: Cannot allocate memory ]

Dec  9 18:51:15 prod-inside-tvh-pgsql1 crmd[13409]:  notice: 
prod-inside-tvh-pgsql1-pgsql_monitor_8000:13 [ 
/usr/lib/ocf/lib/heartbeat/ocf-shellfuncs: fork: Cannot allocate memory\n ]

Dec  9 18:51:15 prod-inside-tvh-pgsql1 crmd[13409]:  notice: Result of notify 
operation for pgsql on prod-inside-tvh-pgsql1: 0 (ok)

Dec  9 18:51:15 prod-inside-tvh-pgsql1 crmd[13409]:  notice: 
prod-inside-tvh-pgsql1-pgsql_monitor_8000:13 [ 
/usr/lib/ocf/lib/heartbeat/ocf-shellfuncs: fork: Cannot allocate memory\n ]

Dec  9 18:51:15 prod-inside-tvh-pgsql1 pgsql(pgsql)[35130]: INFO: 
stop_escalate(or stop_escalate_in_slave) time is adjusted to 50 based on the 
configured timeout.

Dec  9 18:51:15 prod-inside-tvh-pgsql1 pgsql(pgsql)[35130]: INFO: server 
shutting down

Dec  9 18:51:17 prod-inside-tvh-pgsql1 pgsql(pgsql)[35130]: INFO: PostgreSQL is 
down

Dec  9 18:51:17 prod-inside-tvh-pgsql1 pgsql(pgsql)[35130]: INFO: Changing 
pgsql-status on prod-inside-tvh-pgsql1 : HS:async->STOP.

Dec  9 18:51:17 prod-inside-tvh-pgsql1 crmd[13409]:  notice: Result of stop 
operation for pgsql on prod-inside-tvh-pgsql1: 0 (ok)

 

Main - “/usr/lib/ocf/lib/heartbeat/ocf-shellfuncs: fork: Cannot allocate 
memory”.

But memory is always enough.

This is virtual machine with Linux 4.14.35-1818.4.7.el7uek.x86_64

and

pacemaker 1.1.19-8.el7_6.4

corosync 2.4.3-4.el7

(all from linux repository).

 

Please, help us to resolve this problem with “Cannot allocate memory”.

May be we need some additional settings on OS or upgrade to fresh 
pacemaker/corosync version where this issue resolved or some other?

 

 

--

WBR, 

Andrey Larionov

 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: LVM-activate a shared LV

2020-12-10 Thread Strahil Nikolov

I think that dlm + clvmd was enough to take care of OCFS2 .
Have you tried that ?

Best Regards,
Strahil Nikolov






В четвъртък, 10 декември 2020 г., 16:55:52 Гринуич+2, Ulrich Windl 
 написа: 





Hi!

I configured a clustered LV (I think) for activation on three nodes, but it 
won't work. Error is:
LVM-activate(prm_testVG0_test-jeos_activate)[48844]: ERROR:  LV locked by other 
host: testVG0/test-jeos Failed to lock logical volume testVG0/test-jeos.

primitive prm_testVG0_test-jeos_activate LVM-activate \
        params vgname=testVG0 lvname=test-jeos activation_mode=shared 
vg_access_mode=lvmlockd \
        op start timeout=90s interval=0 \
        op stop timeout=90s interval=0 \
        op monitor interval=60s timeout=90s
clone cln_testVG0_test-jeos_activate prm_testVG0_test-jeos_activate \
        meta interleave=true

Is this a software bug, or am I using the wrong RA or configuration?

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

1 2 3 >

1 - 100 of 278 matches

Mail list logo