from:"lejeczek via Users"

[ClusterLabs] thanks! to mailman owners/admins...

2021-08-31 Thread lejeczek via Users


...for fixing DMARCs(my Yahoo).
many thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Qemu VM resources - cannot acquire state change lock

2021-08-26 Thread lejeczek via Users


Hi guys.

I sometimes - I think I know when in terms of any pattern - 
get resources stuck on one node (two-node cluster) with 
these in libvirtd's logs:

...
Cannot start job (query, none, none) for domain 
c8kubermaster1; current job is (modify, none, none) owned by 
(192261 qemuProcessReconnect, 0 , 0  
(flags=0x0)) for (1093s, 0s, 0s)
Cannot start job (query, none, none) for domain ubuntu-tor; 
current job is (modify, none, none) owned by (192263 
qemuProcessReconnect, 0 , 0  (flags=0x0)) for 
(1093s, 0s, 0s)
Timed out during operation: cannot acquire state change lock 
(held by monitor=qemuProcessReconnect)
Timed out during operation: cannot acquire state change lock 
(held by monitor=qemuProcessReconnect)

...

when this happens, and if the resourec is meant to be the 
other node, I have to to disable the resource first, then 
the node on which resources are stuck will shutdown the VM 
and then I have to re-enable that resource so it would, only 
then, start on that other, the second node.


I think this problem occurs if I restart 'libvirtd' via systemd.

Any thoughts on this guys?
many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Qemu VM resources - cannot acquire state change lock

2021-08-28 Thread lejeczek via Users




On 26/08/2021 10:35, Klaus Wenninger wrote:



On Thu, Aug 26, 2021 at 11:13 AM lejeczek via Users 
mailto:users@clusterlabs.org>> wrote:


Hi guys.

I sometimes - I think I know when in terms of any
pattern -
get resources stuck on one node (two-node cluster) with
these in libvirtd's logs:
...
Cannot start job (query, none, none) for domain
c8kubermaster1; current job is (modify, none, none)
owned by
(192261 qemuProcessReconnect, 0 , 0 
(flags=0x0)) for (1093s, 0s, 0s)
Cannot start job (query, none, none) for domain
ubuntu-tor;
current job is (modify, none, none) owned by (192263
qemuProcessReconnect, 0 , 0  (flags=0x0)) for
(1093s, 0s, 0s)
Timed out during operation: cannot acquire state
change lock
(held by monitor=qemuProcessReconnect)
Timed out during operation: cannot acquire state
change lock
(held by monitor=qemuProcessReconnect)
...

when this happens, and if the resourec is meant to be the
other node, I have to to disable the resource first, then
the node on which resources are stuck will shutdown
the VM
and then I have to re-enable that resource so it
would, only
then, start on that other, the second node.

I think this problem occurs if I restart 'libvirtd'
via systemd.

Any thoughts on this guys?


What are the logs on the pacemaker-side saying?
An issue with migration?

Klaus


I'll have to try to tidy up the "protocol" with my stuff so 
I could call it all reproducible, at the moment if only 
feels that way, as reproducible.


I'm on CentOS Stream and have 2-node cluster, with KVM 
resources, with same glusterfs cluster 2-node. (all 
psychically is two machines)


1) I power down one node in orderly manner and the other 
node is last-man-standing.
2) after a while (not sure if time period is also a key 
here) I brought up that first node.
3) the last man-standing-node libvirtd becomes irresponsive 
(don't know yet, if that is only after the first node came 
back up) to virt cmd and to probably everything else, 
pacameker log says:

...
pacemaker-controld[2730]:  error: Result of probe operation 
for c8kubernode2 on dzien: Timed Out

...
and libvirtd log does not say anything really (with default 
debug levels)


4) if glusterfs might play any role? Healing of the 
volume(s) is finished at this time, completed successfully.


This the moment where I would manually 'systemd restart 
libvirtd' that irresponsive node(was last-man-standing) and 
got original error messages.


There is plenty of room for anybody to make guesses, obvious.
Is it 'libvirtd' going haywire because glusterfs volume is 
in an unhealthy state and needs healing?
Is it pacemaker last-man-standing which makes 'libvirtd' go 
haywire?

etc...

I can add much concrete stuff at this moment but will 
appreciate any thoughts you want to share.

thanks, L


many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
<https://lists.clusterlabs.org/mailman/listinfo/users>

ClusterLabs home: https://www.clusterlabs.org/
<https://www.clusterlabs.org/>



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] no systemd - ?

2021-12-18 Thread lejeczek via Users


hi guys

I've always been RHE/Fedora user and memories of times 
before 'systemd' almost completely vacated my brain - 
nowadays, is it possible to have HA without systemd would 
you know and if so, then how would that work?


many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] VirtualDomain - unable to migrate

2021-12-29 Thread lejeczek via Users


Hi guys

I'm having problems with cluster on new CentOS Stream 9 and 
I'd be glad if you can share your thoughts.


-> $ pcs resource move c8kubermaster2 swir
Location constraint to move resource 'c8kubermaster2' has 
been created

Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 
'c8kubermaster2' has been removed

Waiting for the cluster to apply configuration changes...
Error: resource 'c8kubermaster2' is running on node 'whale'
Error: Errors have occurred, therefore pcs is unable to continue

Not much in pacemaker logs, perhaps nothing at all.
VM does migrate with 'virsh' just fine.

-> $ pcs resource config c8kubermaster1
 Resource: c8kubermaster1 (class=ocf provider=heartbeat 
type=VirtualDomain)
  Attributes: 
config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml 
hypervisor=qemu:///system migration_transport=ssh

  Meta Attrs: allow-migrate=true failure-timeout=30s
  Operations: migrate_from interval=0s timeout=90s 
(c8kubermaster1-migrate_from-interval-0s)
  migrate_to interval=0s timeout=90s 
(c8kubermaster1-migrate_to-interval-0s)
  monitor interval=30s 
(c8kubermaster1-monitor-interval-30s)
  start interval=0s timeout=60s 
(c8kubermaster1-start-interval-0s)
  stop interval=0s timeout=60s 
(c8kubermaster1-stop-interval-0s)


Any and all suggestions & thoughts are much appreciated.
many thanks, L
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] VirtualDomain - started but... not really

2021-12-10 Thread lejeczek via Users


Hi guys.

I quite often.. well, to frequently in my mind, see a VM 
which cluster says:

-> $ pcs resource status | grep -v disabled
...
  * c8kubermaster2    (ocf::heartbeat:VirtualDomain): 
 Started dzien

..

but that is false, also cluster itself confirms it:
-> $ pcs resource debug-monitor c8kubermaster2
crm_resource: Error performing operation: Not running
Operation force-check for c8kubermaster2 
(ocf:heartbeat:VirtualDomain) returned: 'not running' (7)


What is the issue here, might be & how best to troubleshoot it?

-> $ pcs resource config c8kubermaster2
 Resource: c8kubermaster2 (class=ocf provider=heartbeat 
type=VirtualDomain)
  Attributes: 
config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml 
hypervisor=qemu:///system migration_transport=ssh

  Meta Attrs: allow-migrate=true failure-timeout=30s
  Operations: migrate_from interval=0s timeout=180s 
(c8kubermaster2-migrate_from-interval-0s)
  migrate_to interval=0s timeout=180s 
(c8kubermaster2-migrate_to-interval-0s)
  monitor interval=30s 
(c8kubermaster2-monitor-interval-30s)
  start interval=0s timeout=90s 
(c8kubermaster2-start-interval-0s)
  stop interval=0s timeout=90s 
(c8kubermaster2-stop-interval-0s)


many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] VirtualDomain - started but... not really

2021-12-11 Thread lejeczek via Users





On 10/12/2021 21:17, Ken Gaillot wrote:

On Fri, 2021-12-10 at 16:33 +, lejeczek via Users wrote:

Hi guys.

I quite often.. well, to frequently in my mind, see a VM
which cluster says:
-> $ pcs resource status | grep -v disabled
...
* c8kubermaster2(ocf::heartbeat:VirtualDomain):
   Started dzien
..

but that is false, also cluster itself confirms it:
-> $ pcs resource debug-monitor c8kubermaster2
crm_resource: Error performing operation: Not running
Operation force-check for c8kubermaster2
(ocf:heartbeat:VirtualDomain) returned: 'not running' (7)

What is the issue here, might be & how best to troubleshoot it?

Is there a recurring monitor configured for the resource?
I have already shown the config for there resource. Is there 
not?

-> $ pcs resource config c8kubermaster2
   Resource: c8kubermaster2 (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes:
config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true failure-timeout=30s
Operations: migrate_from interval=0s timeout=180s
(c8kubermaster2-migrate_from-interval-0s)
migrate_to interval=0s timeout=180s
(c8kubermaster2-migrate_to-interval-0s)
monitor interval=30s
(c8kubermaster2-monitor-interval-30s)
start interval=0s timeout=90s
(c8kubermaster2-start-interval-0s)
stop interval=0s timeout=90s
(c8kubermaster2-stop-interval-0s)

many thanks, L.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] VirtualDomain - started but... not really

2021-12-14 Thread lejeczek via Users





Hi!

My guess is that you checked the corresponding logs already; why not show them 
here?
I can imagine that the VMs die rather early after start.

Regards,
Ulrich


lejeczek via Users  schrieb am 10.12.2021 um 17:33 in

Nachricht :

Hi guys.

I quite often.. well, to frequently in my mind, see a VM
which cluster says:
-> $ pcs resource status | grep -v disabled
...
* c8kubermaster2(ocf::heartbeat:VirtualDomain):
   Started dzien
..

but that is false, also cluster itself confirms it:
-> $ pcs resource debug-monitor c8kubermaster2
crm_resource: Error performing operation: Not running
Operation force-check for c8kubermaster2
(ocf:heartbeat:VirtualDomain) returned: 'not running' (7)

What is the issue here, might be & how best to troubleshoot it?

-> $ pcs resource config c8kubermaster2
   Resource: c8kubermaster2 (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes:
config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true failure-timeout=30s
Operations: migrate_from interval=0s timeout=180s
(c8kubermaster2-migrate_from-interval-0s)
migrate_to interval=0s timeout=180s
(c8kubermaster2-migrate_to-interval-0s)
monitor interval=30s
(c8kubermaster2-monitor-interval-30s)
start interval=0s timeout=90s
(c8kubermaster2-start-interval-0s)
stop interval=0s timeout=90s
(c8kubermaster2-stop-interval-0s)

many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
Not much there in the logs I could see (which is probably 
why cluster decides the resource is okey)
What is the resource's monitor for it not for that exactly - 
to check the state of resource - whether it dies early or 
late should not matter.
What suffices in order to "fix" such resource 
false-positive, I do quick dis/enable the resource or as in 
this very instance rpm updates which restarted node.
Again, how cluster might think resource is okey while 
debug-monitor shows it's not.
I only do not know how to reproduce this in a controlled, 
orderly manner.


thanks, L
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Resource(s) for mysql+galera cluster

2021-12-14 Thread lejeczek via Users


Hi guys

I failed to find any good or any for that matter info, on 
present-day mariadb/mysql galera cluster setups - would you 
know of any docs which discuss such scenario comprehensively?
I see there among resources are 'mariadb' and 'galera' but 
which one to use I'm still confused.


many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Resource(s) for mysql+galera cluster

2021-12-15 Thread lejeczek via Users





On 15/12/2021 08:16, Michele Baldessari wrote:

Hi,

On Tue, Dec 14, 2021 at 03:09:56PM +, lejeczek via Users wrote:

I failed to find any good or any for that matter info, on present-day
mariadb/mysql galera cluster setups - would you know of any docs which
discuss such scenario comprehensively?
I see there among resources are 'mariadb' and 'galera' but which one to use
I'm still confused.

in the deployment of OpenStack via TripleO[1] we set up galera in
multi-master mode using the galera resource agent[2] (it runs in bundles
aka containers in our specific case, but that is tangential).

Here is an example of the pcs commands we use to start it (these are
a bit old, things might have changed since):

pcs property set --node controller-1 galera-role=true
pcs resource bundle create galera-bundle container docker 
image=192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest replicas=3 masters=3 
options="--user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" 
run-command="/bin/bash /usr/local/bin/kolla_start" network=host storage-map 
id=mysql-cfg-files source-dir=/var/lib/kolla/config_files/mysql.json 
target-dir=/var/lib/kolla/config_files/config.json options=ro storage-map id=mysql-cfg-data 
source-dir=/var/lib/config-data/puppet-generated/mysql/ target-dir=/var/lib/kolla/config_files/src 
options=ro storage-map id=mysql-hosts source-dir=/etc/hosts target-dir=/etc/hosts options=ro 
storage-map id=mysql-localtime source-dir=/etc/localtime target-dir=/etc/localtime options=ro 
storage-map id=mysql-lib source-dir=/var/lib/mysql target-dir=/var/lib/mysql options=rw storage-map 
id=mysql-log-mariadb source-dir=/var/log/mariadb target-dir=/var/log/mariadb options=rw storage-map 
id=mysql-log source-dir=/var/log/containers/mysql ta

rget-dir=/var/log/mysql options=rw storage-map id=mysql-dev-log 
source-dir=/dev/log target-dir=/dev/log options=rw network control-port=3123 
--disabled


pcs constraint location galera-bundle rule resource-discovery=exclusive score=0 
galera-role eq true

pcs resource create galera ocf:heartbeat:galera log='/var/log/mysql/mysqld.log' 
additional_parameters='--open-files-limit=16384' enable_creation=true 
wsrep_cluster_address='gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain'
 
cluster_host_map='controller-0:controller-0.internalapi.localdomain;controller-1:controller-1.internalapi.localdomain;controller-2:controller-2.internalapi.localdomain'
 meta master-max=3 ordered=true container-attribute-target=host op promote 
timeout=300s on-fail=block bundle galera-bundle

I don't think we have any docs discussing tradeoffs around this setup,
but maybe this is a start. If you want multi-master galera then you can
do it with the galera RA. If you need to look at more recent configs
check out [3] (Click on a job and the logs and you can check something
like [4] for the full pcs config)

hth,
Michele

[1] https://docs.openstack.org/tripleo-docs/latest/
[2] 
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/galera.in
[3] 
https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master
[4] 
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/31594c2/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz
I had used 'galera' OCF resource which worked, following man 
page for it but if I can share an opinion - it's seems that 
most(if not all developers) "suffer" from the same problem - 
namely, inability to write a solid, very good man/docs 
pages, which is the case with HA/Pacemaker. (there are 
exceptions such as 'lvm2', 'systemd', 'firewalld' & more 
(I'm a CentOS/Fedora user))


Especially man pages suffer - while developers know that 
it's first place user/admin goes to.


So I struggle with multi-master setup - many thanks for the 
examples - "regular" master/slave works, but that gives out 
only one node having 'mariadb' up, whereas remaining 
nodes keep resource in 'standby' mode - as I understand it.
When I try to fiddle with similar to you configs then 
'galera' tells me:

..
Galera must be configured as a multistate Master/Slave resource.
..
and my setup should be simpler - no containers, no bundles - 
only 3 nodes.


many thanks, L
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] ethernet link up/down - ?

2022-02-15 Thread lejeczek via Users





On 07/02/2022 19:21, Antony Stone wrote:

On Monday 07 February 2022 at 20:09:02, lejeczek via Users wrote:


Hi guys

How do you guys go about doing link up/down as a resource?

I apply or remove addresses on the interface, using "IPaddr2" and "IPv6addr",
which I know is not the same thing.

Why do you separately want to control link up/down?  I can't think what I
would use this for.


Antony.

Kind of similar - tcp/ip and those layers configs are 
delivered by DHCP.
I'd think it would have to be a clone resource with one 
master without any constraints where cluster freely decides 
where to put master(link up) on - which is when link gets 
dhcp-served.
But I wonder if that would mean writing up a new resource - 
I don't think there is anything like that included in 
ready-made pcs/ocf packages.


many thanks, L
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] VirtualDomain + GlusterFS - troubles coming with CentOS 9

2022-02-15 Thread lejeczek via Users


Hi guys.

With CentOS 9's packages & binaries libgfapi is removed from 
libvirt/qemu which means that if you want to use GlusterFS 
for VMs image storage you have to expose its volumes via FS 
mount point - that is how I understand these changes - which 
seems to cause quite a problem to the HA.
Maybe it's only my setup, yet I'd think it'll be the only 
way forward for everybody here - if on Gluster - having a 
simple mount point and VM resources running off it, causes 
HA cluster to fail to live-migrate but only @ system reboot 
and makes such a reboot/shutdown take "ages".


Anybody sees the same or similar problem?
many thanks, L.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF with postgresql 13?

2022-03-08 Thread lejeczek via Users





On 08/03/2022 10:21, Jehan-Guillaume de Rorthais wrote:
op start timeout=60s \ op stop timeout=60s \ op promote timeout=30s  >> \ op demote timeout=120s \ op monitor interval=15s 
timeout=10s >> role="Master" meta master-max=1 \ op monitor 
interval=16s >> timeout=10s role="Slave" \ op notify 
timeout=60s meta notify=true > Because "op" appears, we are 
back in resource ("pgsqld") context, > anything after is 
interpreted as ressource and operation attributes, > even 
the "meta notify=true". That's why your pgsqld-clone doesn't 
> have the meta attribute "notify=true" set.
Here is one-liner that should do - add, as per 'debug-' 
suggestion, 'master-max=1'


-> $ pcs resource create pgsqld ocf:heartbeat:pgsqlms 
bindir=/usr/bin pgdata=/var/lib/pgsql/data op start 
timeout=60s op stop timeout=60s op promote timeout=30s op 
demote timeout=120s op monitor interval=15s timeout=10s 
role="Master" op monitor interval=16s timeout=10s 
role="Slave" op notify timeout=60s promotable notify=true 
master-max=1 && pcs constraint colocation add HA-10-1-1-226 
with master pgsqld-clone INFINITY && pcs constraint order 
promote pgsqld-clone then start HA-10-1-1-226 
symmetrical=false kind=Mandatory && pcs constraint order 
demote pgsqld-clone then stop HA-10-1-1-226 
symmetrical=false kind=Mandatory


but ... ! this "issue" is reproducible! So now you have 
working 'pgsqlms', then do:


-> $ pcs resource delete pgsqld

'-clone' should get removed too, so now no 'pgsqld' 
resource(s) but cluster - weirdly in my mind - leaves node 
attributes on.
I see 'master-pgsqld' with each node and do not see why 
'node attributes' should be kept(certainly shown) for 
non-existent resources(to which only resources those attrs 
are instinct)
So, you want to "clean" that for, perhaps for now you are 
not going to have/use 'pgsqlms', you can do that with:


-> $ pcs node attribute node1 master-pgsqld="" # same for 
remaining nodes


now .. ! repeat your one-liner which worked just a moment 
ago and you should get exact same or similar errors(while 
all nodes are stuck on 'slave'


-> $ pcs resource debug-promote pgsqld
crm_resource: Error performing operation: Error occurred
Operation force-promote for pgsqld (ocf:heartbeat:pgsqlms) 
returned 1 (error: Can not get current node LSN location)

/tmp:5432 - accepting connections

ocf-exit-reason:Can not get current node LSN location

You have to 'cib-push' to "fix" this very problem.
In my(admin's) opinion this is a 100% candidate for a bug - 
whether in PCS or PAF - perhaps authors may wish to comment?


many thanks, L.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF with postgresql 13?

2022-03-08 Thread lejeczek via Users





On 08/03/2022 16:20, Jehan-Guillaume de Rorthais wrote:

Removing the node attributes with the resource might be legit from the
Pacemaker point of view, but I'm not sure how they can track the dependency
(ping Ken?).

PAF has no way to know the ressource is being deleted and can not remove its
node attribute before hand.

Maybe PCS can look for promotable score and remove them during the "resource
delete" command (ping Tomas)?

bit of catch-22, no?
To those of us to whom it's first foray into all "this" it 
might be - it was to me - I see "attributes" hanging on for 
no reason, resource does not exist, then first thing I want 
to do is a "cleanup" - natural - but this to result in 
inability to re-create the resource(all remain slaves) at 
later time with 'pcs' one-liner (no cib-push) is a: no no 
should be not...
... because, at later time, why should not resource 
re/creation with 'pcs resource create' bring back those 
'node attrs' ?


Unless... 'pgsqlms' can only be done by 'cib-push' under 
every circumstance - so, a) after manual node attr removal 
b) clean cluster, very first no prior 'pgsqlms' resource 
deployment.


yes, it would be great to have this 
'improvement/enhancement' in future releases incorporated - 
those node attr removed @resource removal time.


many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF with postgresql 13?

2022-02-22 Thread lejeczek via Users


On 21/02/2022 16:01, Jehan-Guillaume de Rorthais wrote:

On Mon, 21 Feb 2022 09:04:27 +
CHAMPAGNE Julie  wrote:
...

The last release is 2 years old, is it still in development?

There's no activity because there's not much to do on it. PAF is mainly in
maintenance (bug fix) mode.

I have few ideas here and there. It might land soon or later, but nothing
really fancy. It just works.

The current effort is to reborn the old workshop that was written few years ago
to translate it to english.
Perhaps as the author(s) you can chip in and/or help via comments to 
rectify this:


...

 Problem: package resource-agents-paf-4.9.0-7.el8.x86_64 requires 
resource-agents = 4.9.0-7.el8, but none of the providers can be installed
  - cannot install both resource-agents-4.9.0-7.el8.x86_64 and 
resource-agents-4.9.0-13.el8.x86_64
  - cannot install the best update candidate for package 
resource-agents-paf-2.3.0-1.noarch
  - cannot install the best update candidate for package 
resource-agents-4.9.0-13.el8.x86_64
(try to add '--allowerasing' to command line to replace conflicting 
packages or '--skip-broken' to skip uninstallable packages or '--nobest' 
to use not only best candidate packages)

...

How I understand that is that CentOS guys/community have PAF in 
'resilientstorage' repo.


many thanks, L

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] ethernet link up/down - ?

2022-02-07 Thread lejeczek via Users


Hi guys

How do you guys go about doing link up/down as a resource?

many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF / PGSQLMS and wal_level ?

2023-09-13 Thread lejeczek via Users





On 08/09/2023 17:29, Jehan-Guillaume de Rorthais wrote:

On Fri, 8 Sep 2023 16:52:53 +0200
lejeczek via Users  wrote:


Hi guys.

Before I start fiddling and brake things I wonder if
somebody knows if:
pgSQL can work with: |wal_level = archive for PAF ?
Or more general question with pertains to ||wal_level - can
_barman_ be used with pgSQL "under" PAF?

PAF needs "wal_level = replica" (or "hot_standby" on very old versions) so it
can have hot standbys where it can connects and query there status.

Wal level "replica" includes the archive level, so you can set up archiving.

Of course you can use barman or any other tools to manage your PITR Backups,
even when Pacemaker/PAF is looking at your instances. This is even the very
first step you should focus on during your journey to HA.

Regards,
and with _barman_ specifically - is one method preferred, 
recommended over another: streaming VS rsync - for/with PAF?

many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] PAF / PGSQLMS on Ubuntu

2023-09-07 Thread lejeczek via Users


Hi guys.

I'm trying to set ocf_heartbeat_pgsqlms agent but I get:
...
Failed Resource Actions:
  * PGSQL-PAF-5433 stop on ubusrv3 returned 'invalid 
parameter' because 'Parameter "recovery_target_timeline" 
MUST be set to 'latest'. It is currently set to ''' at Thu 
Sep  7 13:58:06 2023 after 54ms


I'm new to Ubuntu and I see that Ubuntu has a bit different 
approach to paths (in comparison to how Centos do it).

I see separation between config & data, eg.

14  paf 5433 down   postgres /var/lib/postgresql/14/paf 
/var/log/postgresql/postgresql-14-paf.log


I create the resource like here:

-> $ pcs resource create PGSQL-PAF-5433 
ocf:heartbeat:pgsqlms pgport=5433 bindir=/usr/bin 
pgdata=/etc/postgresql/14/paf 
datadir=/var/lib/postgresql/14/paf meta failure-timeout=30s 
master-max=1 op start timeout=60s op stop timeout=60s op 
promote timeout=30s op demote timeout=120s op monitor 
interval=15s timeout=10s role="Promoted" op monitor 
interval=16s timeout=10s role="Unpromoted" op notify 
timeout=60s promotable notify=true failure-timeout=30s 
master-max=1 --disable


Ubuntu 22.04.3 LTS
What am I missing can you tell?
many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF / PGSQLMS on Ubuntu

2023-09-07 Thread lejeczek via Users




On 07/09/2023 16:09, Andrei Borzenkov wrote:

On Thu, Sep 7, 2023 at 5:01 PM lejeczek via Users  wrote:

Hi guys.

I'm trying to set ocf_heartbeat_pgsqlms agent but I get:
...
Failed Resource Actions:
   * PGSQL-PAF-5433 stop on ubusrv3 returned 'invalid parameter' because 'Parameter 
"recovery_target_timeline" MUST be set to 'latest'. It is currently set to ''' 
at Thu Sep  7 13:58:06 2023 after 54ms

I'm new to Ubuntu and I see that Ubuntu has a bit different approach to paths 
(in comparison to how Centos do it).
I see separation between config & data, eg.

14  paf 5433 down   postgres /var/lib/postgresql/14/paf 
/var/log/postgresql/postgresql-14-paf.log

I create the resource like here:

-> $ pcs resource create PGSQL-PAF-5433 ocf:heartbeat:pgsqlms pgport=5433 bindir=/usr/bin 
pgdata=/etc/postgresql/14/paf datadir=/var/lib/postgresql/14/paf meta failure-timeout=30s master-max=1 
op start timeout=60s op stop timeout=60s op promote timeout=30s op demote timeout=120s op monitor 
interval=15s timeout=10s role="Promoted" op monitor interval=16s timeout=10s 
role="Unpromoted" op notify timeout=60s promotable notify=true failure-timeout=30s 
master-max=1 --disable

Ubuntu 22.04.3 LTS
What am I missing can you tell?

Exactly what the message tells you. You need to set recovery_target=latest.

and having it in 'postgresql.conf' make it all work for you?
I've had it and got those errors - perhaps that has to be 
set some place else.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] PAF / PGSQLMS and wal_level ?

2023-09-08 Thread lejeczek via Users


Hi guys.

Before I start fiddling and brake things I wonder if 
somebody knows if:

pgSQL can work with: |wal_level = archive for PAF ?
Or more general question with pertains to ||wal_level - can 
_barman_ be used with pgSQL "under" PAF?


many thanks, L.
|___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF / PGSQLMS on Ubuntu

2023-09-08 Thread lejeczek via Users




On 07/09/2023 16:20, lejeczek via Users wrote:



On 07/09/2023 16:09, Andrei Borzenkov wrote:
On Thu, Sep 7, 2023 at 5:01 PM lejeczek via Users 
 wrote:

Hi guys.

I'm trying to set ocf_heartbeat_pgsqlms agent but I get:
...
Failed Resource Actions:
   * PGSQL-PAF-5433 stop on ubusrv3 returned 'invalid 
parameter' because 'Parameter "recovery_target_timeline" 
MUST be set to 'latest'. It is currently set to ''' at 
Thu Sep  7 13:58:06 2023 after 54ms


I'm new to Ubuntu and I see that Ubuntu has a bit 
different approach to paths (in comparison to how Centos 
do it).

I see separation between config & data, eg.

14  paf 5433 down   postgres 
/var/lib/postgresql/14/paf 
/var/log/postgresql/postgresql-14-paf.log


I create the resource like here:

-> $ pcs resource create PGSQL-PAF-5433 
ocf:heartbeat:pgsqlms pgport=5433 bindir=/usr/bin 
pgdata=/etc/postgresql/14/paf 
datadir=/var/lib/postgresql/14/paf meta 
failure-timeout=30s master-max=1 op start timeout=60s op 
stop timeout=60s op promote timeout=30s op demote 
timeout=120s op monitor interval=15s timeout=10s 
role="Promoted" op monitor interval=16s timeout=10s 
role="Unpromoted" op notify timeout=60s promotable 
notify=true failure-timeout=30s master-max=1 --disable


Ubuntu 22.04.3 LTS
What am I missing can you tell?
Exactly what the message tells you. You need to set 
recovery_target=latest.

and having it in 'postgresql.conf' make it all work for you?
I've had it and got those errors - perhaps that has to be 
set some place else.


In case anybody was in this situation - I was missing one 
important bit: _bindir_
Ubuntu's pgSQL binaries have different path - what 
resource/agent returns as errors is utterly confusing.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] PAF / pgSQL fails after OS/system shutdown

2023-11-07 Thread lejeczek via Users


hi guys

Having 3-node pgSQL cluster with PAF - when all three 
systems are shutdown at virtually the same time then PAF 
fails to start when HA cluster is operational again.


from status:
...
Migration Summary:
  * Node: ubusrv2 (2):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'

  * Node: ubusrv3 (3):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'

  * Node: ubusrv1 (1):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'


Failed Resource Actions:
  * PGSQL-PAF-5433_stop_0 on ubusrv2 'error' (1): call=90, 
status='complete', exitreason='Unexpected state for instance 
"PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov  7 
17:52:38 2023', queued=0ms, exec=84ms
  * PGSQL-PAF-5433_stop_0 on ubusrv3 'error' (1): call=82, 
status='complete', exitreason='Unexpected state for instance 
"PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov  7 
17:52:38 2023', queued=0ms, exec=82ms
  * PGSQL-PAF-5433_stop_0 on ubusrv1 'error' (1): call=86, 
status='complete', exitreason='Unexpected state for instance 
"PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov  7 
17:52:38 2023', queued=0ms, exec=108ms


and all three pgSQLs show virtually identical logs:
...
2023-11-07 16:54:45.532 UTC [24936] LOG:  starting 
PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on 
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-11-07 16:54:45.532 UTC [24936] LOG:  listening on IPv4 
address "0.0.0.0", port 5433
2023-11-07 16:54:45.532 UTC [24936] LOG:  listening on IPv6 
address "::", port 5433
2023-11-07 16:54:45.535 UTC [24936] LOG:  listening on Unix 
socket "/var/run/postgresql/.s.PGSQL.5433"
2023-11-07 16:54:45.547 UTC [24938] LOG:  database system 
was interrupted while in recovery at log time 2023-11-07 
15:30:56 UTC
2023-11-07 16:54:45.547 UTC [24938] HINT:  If this has 
occurred more than once some data might be corrupted and you 
might need to choose an earlier recovery target.

2023-11-07 16:54:45.819 UTC [24938] LOG:  entering standby mode
2023-11-07 16:54:45.824 UTC [24938] FATAL:  could not open 
directory "/var/run/postgresql/14-paf.pg_stat_tmp": No such 
file or directory
2023-11-07 16:54:45.825 UTC [24936] LOG:  startup process 
(PID 24938) exited with exit code 1
2023-11-07 16:54:45.825 UTC [24936] LOG:  aborting startup 
due to startup process failure
2023-11-07 16:54:45.826 UTC [24936] LOG:  database system is 
shut down


Is this "test" case's result, as I showed above, expected? 
It reproduces every time.

If not - what might it be I'm missing?

many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX

2023-11-10 Thread lejeczek via Users




On 07/11/2023 17:57, lejeczek via Users wrote:

hi guys

Having 3-node pgSQL cluster with PAF - when all three 
systems are shutdown at virtually the same time then PAF 
fails to start when HA cluster is operational again.


from status:
...
Migration Summary:
  * Node: ubusrv2 (2):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'

  * Node: ubusrv3 (3):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'

  * Node: ubusrv1 (1):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'


Failed Resource Actions:
  * PGSQL-PAF-5433_stop_0 on ubusrv2 'error' (1): call=90, 
status='complete', exitreason='Unexpected state for 
instance "PGSQL-PAF-5433" (returned 1)', 
last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
exec=84ms
  * PGSQL-PAF-5433_stop_0 on ubusrv3 'error' (1): call=82, 
status='complete', exitreason='Unexpected state for 
instance "PGSQL-PAF-5433" (returned 1)', 
last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
exec=82ms
  * PGSQL-PAF-5433_stop_0 on ubusrv1 'error' (1): call=86, 
status='complete', exitreason='Unexpected state for 
instance "PGSQL-PAF-5433" (returned 1)', 
last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
exec=108ms


and all three pgSQLs show virtually identical logs:
...
2023-11-07 16:54:45.532 UTC [24936] LOG:  starting 
PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on 
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-11-07 16:54:45.532 UTC [24936] LOG:  listening on 
IPv4 address "0.0.0.0", port 5433
2023-11-07 16:54:45.532 UTC [24936] LOG:  listening on 
IPv6 address "::", port 5433
2023-11-07 16:54:45.535 UTC [24936] LOG:  listening on 
Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2023-11-07 16:54:45.547 UTC [24938] LOG:  database system 
was interrupted while in recovery at log time 2023-11-07 
15:30:56 UTC
2023-11-07 16:54:45.547 UTC [24938] HINT:  If this has 
occurred more than once some data might be corrupted and 
you might need to choose an earlier recovery target.
2023-11-07 16:54:45.819 UTC [24938] LOG:  entering standby 
mode
2023-11-07 16:54:45.824 UTC [24938] FATAL:  could not open 
directory "/var/run/postgresql/14-paf.pg_stat_tmp": No 
such file or directory
2023-11-07 16:54:45.825 UTC [24936] LOG:  startup process 
(PID 24938) exited with exit code 1
2023-11-07 16:54:45.825 UTC [24936] LOG:  aborting startup 
due to startup process failure
2023-11-07 16:54:45.826 UTC [24936] LOG:  database system 
is shut down


Is this "test" case's result, as I showed above, expected? 
It reproduces every time.

If not - what might it be I'm missing?

many thanks, L.

___

to share my "fix" for it - perhaps it was introduced by 
OS/packages (Ubuntu 22) updates - ? - as oppose to resource 
agent itself.


As the logs point out - pg_stat_tmp - is missing and from 
what I see it's only the master, within a cluster, doing 
those stats.
That appeared, I use the word for I did not put it into 
configs, on all nodes.

fix = to not use _pg_stat_tmp_ directive/option at all.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX

2023-11-10 Thread lejeczek via Users




On 10/11/2023 13:13, Jehan-Guillaume de Rorthais wrote:

On Fri, 10 Nov 2023 12:27:24 +0100
lejeczek via Users  wrote:
...
  

to share my "fix" for it - perhaps it was introduced by
OS/packages (Ubuntu 22) updates - ? - as oppose to resource
agent itself.

As the logs point out - pg_stat_tmp - is missing and from
what I see it's only the master, within a cluster, doing
those stats.
That appeared, I use the word for I did not put it into
configs, on all nodes.
fix = to not use _pg_stat_tmp_ directive/option at all.

Of course you can use "pg_stat_tmp", just make sure the temp folder exists:

   cat < /etc/tmpfiles.d/postgresql-part.conf
   # Directory for PostgreSQL temp stat files
   d /var/run/postgresql/14-paf.pg_stat_tmp 0700 postgres postgres - -
   EOF

To take this file in consideration immediately without rebooting the server,
run the following command:

   systemd-tmpfiles --create /etc/tmpfiles.d/postgresql-part.conf
Then there must be something else at play here with Ubuntus, 
for none of the nodes has any extra/additional configs for 
those paths & I'm sure that those were not created manually.
Perhpaphs pgSQL created these on it's own outside of 
HA-cluster. Also, one node has the path missing but the 
other two  have & between those two only - how it seems to 
me -  the master PG actually put any data there and if that 
is the case - does the existence of the path alone guarantee 
anything, is the question.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX

2023-11-10 Thread lejeczek via Users

On 10/11/2023 18:16, Jehan-Guillaume de Rorthais wrote:

On Fri, 10 Nov 2023 17:17:41 +0100
lejeczek via Users wrote:

...

Of course you can use "pg_stat_tmp", just make sure the temp folder exists:

cat < /etc/tmpfiles.d/postgresql-part.conf
# Directory for PostgreSQL temp stat files
d /var/run/postgresql/14-paf.pg_stat_tmp 0700 postgres postgres - -
EOF

To take this file in consideration immediately without rebooting the server,
run the following command:

systemd-tmpfiles --create /etc/tmpfiles.d/postgresql-part.conf

Then there must be something else at play here with Ubuntus,
for none of the nodes has any extra/additional configs for
those paths & I'm sure that those were not created manually.

Indeed.

This parameter is usually set by pg_createcluster command and the folder
created by both pg_createcluster and pg_ctlcluster commands when needed.

This is explained in PAF tutorial there:

https://clusterlabs.github.io/PAF/Quick_Start-Debian-10-pcs.html#postgresql-and-cluster-stack-installation

These commands comes from the postgresql-common wrapper, used in all Debian
related distros, allowing to install, create and use multiple PostgreSQL
versions on the same server.

Perhpaphs pgSQL created these on it's own outside of HA-cluster.

No, the Debian packaging did.

Just create the config file I pointed you in my previous answer,
systemd-tmpfiles will take care of it and you'll be fine.

yes, that was a weird trip.
I could not take the whole cluster down & as I kept fiddling
with it trying to "fix" it - within & out of _pcs_ those
paths were created by wrappers from disto packages - I know
see that - but at first it did not look that way.
Still that directive for the stats - I wonder if that got
introduced, injected somewhere in between because the PG
cluster - which I had for a while - did not experience this
"issue" from the start and not until "recently"

thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown

2023-11-09 Thread lejeczek via Users




On 07/11/2023 17:57, lejeczek via Users wrote:

hi guys

Having 3-node pgSQL cluster with PAF - when all three 
systems are shutdown at virtually the same time then PAF 
fails to start when HA cluster is operational again.


from status:
...
Migration Summary:
  * Node: ubusrv2 (2):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'

  * Node: ubusrv3 (3):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'

  * Node: ubusrv1 (1):
    * PGSQL-PAF-5433: migration-threshold=100 
fail-count=100 last-failure='Tue Nov  7 17:52:38 2023'


Failed Resource Actions:
  * PGSQL-PAF-5433_stop_0 on ubusrv2 'error' (1): call=90, 
status='complete', exitreason='Unexpected state for 
instance "PGSQL-PAF-5433" (returned 1)', 
last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
exec=84ms
  * PGSQL-PAF-5433_stop_0 on ubusrv3 'error' (1): call=82, 
status='complete', exitreason='Unexpected state for 
instance "PGSQL-PAF-5433" (returned 1)', 
last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
exec=82ms
  * PGSQL-PAF-5433_stop_0 on ubusrv1 'error' (1): call=86, 
status='complete', exitreason='Unexpected state for 
instance "PGSQL-PAF-5433" (returned 1)', 
last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
exec=108ms


and all three pgSQLs show virtually identical logs:
...
2023-11-07 16:54:45.532 UTC [24936] LOG:  starting 
PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on 
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-11-07 16:54:45.532 UTC [24936] LOG:  listening on 
IPv4 address "0.0.0.0", port 5433
2023-11-07 16:54:45.532 UTC [24936] LOG:  listening on 
IPv6 address "::", port 5433
2023-11-07 16:54:45.535 UTC [24936] LOG:  listening on 
Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2023-11-07 16:54:45.547 UTC [24938] LOG:  database system 
was interrupted while in recovery at log time 2023-11-07 
15:30:56 UTC
2023-11-07 16:54:45.547 UTC [24938] HINT:  If this has 
occurred more than once some data might be corrupted and 
you might need to choose an earlier recovery target.
2023-11-07 16:54:45.819 UTC [24938] LOG:  entering standby 
mode
2023-11-07 16:54:45.824 UTC [24938] FATAL:  could not open 
directory "/var/run/postgresql/14-paf.pg_stat_tmp": No 
such file or directory
2023-11-07 16:54:45.825 UTC [24936] LOG:  startup process 
(PID 24938) exited with exit code 1
2023-11-07 16:54:45.825 UTC [24936] LOG:  aborting startup 
due to startup process failure
2023-11-07 16:54:45.826 UTC [24936] LOG:  database system 
is shut down


Is this "test" case's result, as I showed above, expected? 
It reproduces every time.

If not - what might it be I'm missing?

many thanks, L.

Actually, the  resource fails to start on a node a single 
node - as opposed to entire cluster shutdown as I noted 
originally - which was powered down in an orderly fashion 
and powered back on.
That the the time of power-cycle the node was PAF resource 
master, it fails:

...
2023-11-09 20:35:04.439 UTC [17727] LOG:  starting 
PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on 
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-11-09 20:35:04.439 UTC [17727] LOG:  listening on IPv4 
address "0.0.0.0", port 5433
2023-11-09 20:35:04.439 UTC [17727] LOG:  listening on IPv6 
address "::", port 5433
2023-11-09 20:35:04.442 UTC [17727] LOG:  listening on Unix 
socket "/var/run/postgresql/.s.PGSQL.5433"
2023-11-09 20:35:04.452 UTC [17731] LOG:  database system 
was interrupted while in recovery at log time 2023-11-09 
20:25:21 UTC
2023-11-09 20:35:04.452 UTC [17731] HINT:  If this has 
occurred more than once some data might be corrupted and you 
might need to choose an earlier recovery target.

2023-11-09 20:35:04.809 UTC [17731] LOG:  entering standby mode
2023-11-09 20:35:04.813 UTC [17731] FATAL:  could not open 
directory "/var/run/postgresql/14-paf.pg_stat_tmp": No such 
file or directory
2023-11-09 20:35:04.814 UTC [17727] LOG:  startup process 
(PID 17731) exited with exit code 1
2023-11-09 20:35:04.814 UTC [17727] LOG:  aborting startup 
due to startup process failure
2023-11-09 20:35:04.815 UTC [17727] LOG:  database system is 
shut down



The master at the time node was shut down did get moved over 
to standby/slave node, properly,


I'm on Ubuntu with:

ii  corosync  3.1.6-1ubuntu1   amd64    
cluster engine daemon and utilities
ii  pacemaker 2.1.2-1ubuntu3.1 amd64    
cluster resource manager
ii  pacemaker-cli-utils   2.1.2-1ubuntu3.1 amd64    
cluster resource manager command line utilities
ii  pacemaker-common  2.1.2-1ubuntu3.1 all  
cluster resource manager common files
ii  pacemaker-resource-agents 2.1.2-1ubuntu3.1 all  
cluster res

[ClusterLabs] Redis OCF agent - cannot decide which is master - ?

2022-07-07 Thread lejeczek via Users


Hi guys.

I have a peculiar case - to me at least - here.
Unless I tell cluster to move resource to a node (two-node 
cluster) with '--master' cluster keeps both nodes as "slaves"

As soon as I remove such a constraint both nodes start to log:
...
3442363:M 07 Jul 2022 20:05:31.561 # Setting secondary 
replication ID to b14b339d4b0162c3bcd73c9591b60fa491f475dd, 
valid up to offset: 435. New replication ID is 
d704c3a0815fdca98c22392341c7a05e2bb89429
3442363:M 07 Jul 2022 20:05:31.561 * MASTER MODE enabled 
(user request from 'id=66 addr=/run/redis/redis.sock:0 
laddr=/run/redis/redis.sock:0 fd=8 name= age=0 idle=0 
flags=U db=0 sub=0 psub=0 multi=-1 qbuf=34 qbuf-free=40920 
argv-mem=12 obl=0 oll=0 omem=0 tot-mem=61476 events=r 
cmd=slaveof user=default redir=-1')
3442363:M 07 Jul 2022 20:05:32.269 * Replica 10.0.1.7:6379 
asks for synchronization
3442363:M 07 Jul 2022 20:05:32.269 * Partial 
resynchronization request from 10.0.1.7:6379 accepted. 
Sending 0 bytes of backlog starting from offset 435.
3442363:S 07 Jul 2022 20:11:19.175 # Connection with replica 
10.0.1.7:6379 lost.
3442363:S 07 Jul 2022 20:11:19.175 * Before turning into a 
replica, using my own master parameters to synthesize a 
cached master: I may be able to synchronize with the new 
master with just a partial transfer.
3442363:S 07 Jul 2022 20:11:19.175 * Connecting to MASTER 
no-such-master:6379
3442363:S 07 Jul 2022 20:11:19.177 # Unable to connect to 
MASTER: (null)
3442363:S 07 Jul 2022 20:11:19.177 * REPLICAOF 
no-such-master:6379 enabled (user request from 'id=90 
addr=/run/redis/redis.sock:0 laddr=/run/redis/redis.sock:0 
fd=10 name= age=0 idle=0 flags=U db=0 sub=0 psub=0 multi=-1 
qbuf=48 qbuf-free=40906 argv-mem=25 obl=0 oll=0 omem=0 
tot-mem=61489 events=r cmd=slaveof user=default redir=-1')


and then infamous:
...
3442363:S 07 Jul 2022 20:11:24.184 # Unable to connect to 
MASTER: (null)
3442363:S 07 Jul 2022 20:11:25.187 * Connecting to MASTER 
no-such-master:6379

..

What is PCS/agent getting wrong there?
btw. OCF Redis does not need 'redis-sentinel.conf' , right?

many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] resource agent OCF_HEARTBEAT_GALERA issue/broken - ?

2022-07-26 Thread lejeczek via Users


Hi guys

I set up a clone of a new instance of mariadb galera - which otherwise, 
outside of pcs works - but I see something weird.


Firstly cluster claims it's all good:

-> $ pcs status --full
...

  * Clone Set: mariadb-apps-clone [mariadb-apps] (promotable):
    * mariadb-apps    (ocf::heartbeat:galera):     Master 
sucker.internal.ccn
    * mariadb-apps    (ocf::heartbeat:galera):     Master 
drunk.internal.ccn


but that mariadb is _not_ started actually.

In clone's attr I set:

config=/apps/etc/mariadb-server.cnf

I also for peace of mind set:

datadir=/apps/mysql/data

even tough '/apps/etc/mariadb-server.cnf' declares that & other bits - 
again, works outside of pcs.


Then I see in pacemaker logs:

notice: mariadb-apps_star...@drunk.internal.ccn output [ 220726 11:56:13 
mysqld_safe Logging to '/tmp/tmp.On5VnzOyaF'.\n220726 11:56:13 
mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql\n ]


.. and I think what the F?

resource-agents-4.9.0-22.el8.x86_64

All thoughts share are much appreciated.

many thanks, L.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] resource agent OCF_HEARTBEAT_GALERA issue/broken - ?

2022-07-29 Thread lejeczek via Users


On 28/07/2022 00:33, Reid Wahl wrote:

On Wed, Jul 27, 2022 at 2:08 AM lejeczek via Users
 wrote:



On 26/07/2022 20:56, Reid Wahl wrote:

On Tue, Jul 26, 2022 at 4:21 AM lejeczek via Users
 wrote:

Hi guys

I set up a clone of a new instance of mariadb galera - which otherwise,
outside of pcs works - but I see something weird.

Firstly cluster claims it's all good:

-> $ pcs status --full
...

 * Clone Set: mariadb-apps-clone [mariadb-apps] (promotable):
   * mariadb-apps(ocf::heartbeat:galera): Master
sucker.internal.ccn
   * mariadb-apps(ocf::heartbeat:galera): Master
drunk.internal.ccn

Clearly the problem is that your server is drunk.


but that mariadb is _not_ started actually.

In clone's attr I set:

config=/apps/etc/mariadb-server.cnf

I also for peace of mind set:

datadir=/apps/mysql/data

even tough '/apps/etc/mariadb-server.cnf' declares that & other bits -
again, works outside of pcs.

Then I see in pacemaker logs:

notice: mariadb-apps_star...@drunk.internal.ccn output [ 220726 11:56:13
mysqld_safe Logging to '/tmp/tmp.On5VnzOyaF'.\n220726 11:56:13
mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql\n ]

.. and I think what the F?

resource-agents-4.9.0-22.el8.x86_64

All thoughts share are much appreciated.

many thanks, L.


How do you start mariadb outside of pacemaker's control?

simply by:
-> $ sudo -u mysql /usr/libexec/mysqld
--defaults-file=/apps/etc/mariadb-server.cnf
--wsrep-new-clusterrep

Got it. FWIW, `--wsrep-new-clusterrep` doesn't appear to be a valid
option (no search results and nothing in the help output). Maybe this
is a typo in the email; `--wsrep-new-cluster` is valid.


It seems that *something* is running, based on the "Starting" message
and the fact that the resources are still in Started state...

Yes, a "regular" instance of galera (which outside of
pacemaker works with "regular" systemd) is running as ofc
one resource already.


The logic to start and promote the galera resource is contained within
/usr/lib/ocf/resource.d/heartbeat/galera and
/usr/lib/ocf/lib/heartbeat/mysql-common.sh. I encourage you to inspect
those for any relevant differences between your own startup method and
that of the resource agent.

As one example, note that the resource agent uses mysqld_safe by
default. This is configurable via the `binary` option for the
resource. Be sure that you've looked at all the available options
(`pcs resource describe ocf:heartbeat:galera`) and configured any of
them that you need. You've definitely at least started that process
with config and datadir.

log snippet I pasted was not to emphasize 'mysqld_safe' but
the fact that resource does:
...
Starting mysqld daemon with databases from /var/lib/mysql
...

Yes, I did check man pages for options and if if you suggest
checking source code then I say that's  case for a bug report.

For something like pacemaker, I wouldn't point a user to the code.
galera and mysql-common.sh are shell scripts, so they're easier to
interpret. The idea was "maybe you can find a discrepancy between how
the scripts are starting mysql, and how you're manually starting
mysql."

Filing a bug might be reasonable. I don't have much hands-on
experience with mysql or with this resource agent.


I tried my best to make it clear, doubt can do that better
but I'll re-try:
a) bits needed to run a "new/second" instance are all in
'mariadb-server.cnf' and galera works outside of pacemaker
(cmd above)
b) if I use attrs: 'datadir' & 'config' yet resource tells
me "..daemon with databases from /var/lib/mysql..." then..
people looking into the code should perhaps be you (as
Redhat employee) and/or other authors/developers - if I
filed it as BZ - no?

It should be easily reproducible, have:
1) a galera (as a 2nd/3rd/etc. instance, outside & with
different settings from "standard" OS's rpm installation)
work & tested
2) set up a resource:
-> $ pcs resource create mariadb-apps ocf:heartbeat:galera
cluster_host_map="drunk.internal.ccn:10.0.1.6;sucker.internal.ccn:10.0.1.7"
config="/apps/etc/mariadb-server.cnf"
wsrep_cluster_address="gcomm://10.0.1.6:5567,10.0.1.7:5567"
user=mysql group=mysql check_user="pacemaker"
check_passwd="pacemaker#9897" op monitor OCF_CHECK_LEVEL="0"
timeout="30s" interval="20s" op monitor role="Master"
OCF_CHECK_LEVEL="0" timeout="30s" interval="10s" op monitor
role="Slave" OCF_CHECK_LEVEL="0" timeout="30s"
interval="30s" promotable promoted-max=2 meta
failure-timeout=30s

The datadir option isn't specified in the resource create command
above. If datadir not explicitly set, then the resource agent uses
/var/lib/mysql by default:
```
mysql_common_start()
{
 local mysql_extra_params="$1&quo

Re: [ClusterLabs] resource agent OCF_HEARTBEAT_GALERA issue/broken - ?

2022-07-27 Thread lejeczek via Users




On 26/07/2022 20:56, Reid Wahl wrote:

On Tue, Jul 26, 2022 at 4:21 AM lejeczek via Users
 wrote:

Hi guys

I set up a clone of a new instance of mariadb galera - which otherwise,
outside of pcs works - but I see something weird.

Firstly cluster claims it's all good:

-> $ pcs status --full
...

* Clone Set: mariadb-apps-clone [mariadb-apps] (promotable):
  * mariadb-apps(ocf::heartbeat:galera): Master
sucker.internal.ccn
  * mariadb-apps(ocf::heartbeat:galera): Master
drunk.internal.ccn

Clearly the problem is that your server is drunk.


but that mariadb is _not_ started actually.

In clone's attr I set:

config=/apps/etc/mariadb-server.cnf

I also for peace of mind set:

datadir=/apps/mysql/data

even tough '/apps/etc/mariadb-server.cnf' declares that & other bits -
again, works outside of pcs.

Then I see in pacemaker logs:

notice: mariadb-apps_star...@drunk.internal.ccn output [ 220726 11:56:13
mysqld_safe Logging to '/tmp/tmp.On5VnzOyaF'.\n220726 11:56:13
mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql\n ]

.. and I think what the F?

resource-agents-4.9.0-22.el8.x86_64

All thoughts share are much appreciated.

many thanks, L.


How do you start mariadb outside of pacemaker's control?

simply by:
-> $ sudo -u mysql /usr/libexec/mysqld 
--defaults-file=/apps/etc/mariadb-server.cnf 
--wsrep-new-clusterrep


It seems that *something* is running, based on the "Starting" message
and the fact that the resources are still in Started state...
Yes, a "regular" instance of galera (which outside of 
pacemaker works with "regular" systemd) is running as ofc 
one resource already.




The logic to start and promote the galera resource is contained within
/usr/lib/ocf/resource.d/heartbeat/galera and
/usr/lib/ocf/lib/heartbeat/mysql-common.sh. I encourage you to inspect
those for any relevant differences between your own startup method and
that of the resource agent.

As one example, note that the resource agent uses mysqld_safe by
default. This is configurable via the `binary` option for the
resource. Be sure that you've looked at all the available options
(`pcs resource describe ocf:heartbeat:galera`) and configured any of
them that you need. You've definitely at least started that process
with config and datadir.
log snippet I pasted was not to emphasize 'mysqld_safe' but 
the fact that resource does:

...
Starting mysqld daemon with databases from /var/lib/mysql
...

Yes, I did check man pages for options and if if you suggest 
checking source code then I say that's  case for a bug report.
I tried my best to make it clear, doubt can do that better 
but I'll re-try:
a) bits needed to run a "new/second" instance are all in 
'mariadb-server.cnf' and galera works outside of pacemaker 
(cmd above)
b) if I use attrs: 'datadir' & 'config' yet resource tells 
me "..daemon with databases from /var/lib/mysql..." then..
people looking into the code should perhaps be you (as 
Redhat employee) and/or other authors/developers - if I 
filed it as BZ - no?


It should be easily reproducible, have:
1) a galera (as a 2nd/3rd/etc. instance, outside & with 
different settings from "standard" OS's rpm installation) 
work & tested

2) set up a resource:
-> $ pcs resource create mariadb-apps ocf:heartbeat:galera 
cluster_host_map="drunk.internal.ccn:10.0.1.6;sucker.internal.ccn:10.0.1.7" 
config="/apps/etc/mariadb-server.cnf" 
wsrep_cluster_address="gcomm://10.0.1.6:5567,10.0.1.7:5567" 
user=mysql group=mysql check_user="pacemaker" 
check_passwd="pacemaker#9897" op monitor OCF_CHECK_LEVEL="0" 
timeout="30s" interval="20s" op monitor role="Master" 
OCF_CHECK_LEVEL="0" timeout="30s" interval="10s" op monitor 
role="Slave" OCF_CHECK_LEVEL="0" timeout="30s" 
interval="30s" promotable promoted-max=2 meta 
failure-timeout=30s


and you should get (give version & evn is the same) see what 
I see - weird "misbehavior"


many thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Ordering - clones & relocation

2022-09-01 Thread lejeczek via Users


Hi guys.

This might be tricky but also can be trivial, I'm hoping for 
the latter of course.


Ordering Constraints:
...
start non-clone then start a-clone-resource (kind:Mandatory)
...

Now, when 'non-clone' gets relocated what then happens to 
'a-clone-resource' ?
or.. for that matter, to any resource when 'ordering' is in 
play?
Do such "dependent" resources get restarted, when source 
resource 'moves' and if yes, is it possible to have PCS do 
not that?


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] podman containers as resources - ? - with a twist

2022-12-28 Thread lejeczek via Users


Hi guys.

I have a situation which begins to look like quite the 
pickle and I'm in it, with no possible or no elegant at 
least, way out.

I'm hoping you guys can share your thoughts.
My cluster mounts a path, in two steps
1) runs systemd luks service
2) mount that unlocked luks device under a certain path
now...
that certain path is where user(s) home dir resides and... 
as the result of all that 'systemd' does not pick up user's 
systemd units. (must be way too late for 'systemd')


How would you fix that?
Right now I manually poke systemd with, as that given user:
-> $ systemctl --user daemon-reload
only then 'systemd' picks up user units - until then 
'systemd' says "Unit ...  could not be found"

Naturally, I do not want 'manually'.

I'm thinking...
somehow have cluster make OS's 'systemd' redo that user 
systemd bits, after the resource successful start, or...
have cluster somehow manage that user's systemd units 
directly, on its own.


In case it might make it bit more clear - those units are 
'podman' containers, non-root containers.


All thoughts shared are much appreciated.
many thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] podman containers as resources - ? - with a twist

2022-12-28 Thread lejeczek via Users





On 28/12/2022 21:53, Reid Wahl wrote:

On Wed, Dec 28, 2022 at 6:08 AM lejeczek via Users
 wrote:

Hi guys.

I have a situation which begins to look like quite the pickle and I'm in it, 
with no possible or no elegant at least, way out.
I'm hoping you guys can share your thoughts.
My cluster mounts a path, in two steps
1) runs systemd luks service
2) mount that unlocked luks device under a certain path
now...
that certain path is where user(s) home dir resides and... as the result of all 
that 'systemd' does not pick up user's systemd units. (must be way too late for 
'systemd')

How would you fix that?
Right now I manually poke systemd with, as that given user:
-> $ systemctl --user daemon-reload
only then 'systemd' picks up user units - until then 'systemd' says "Unit ...  could 
not be found"
Naturally, I do not want 'manually'.

I'm thinking...
somehow have cluster make OS's 'systemd' redo that user systemd bits, after the 
resource successful start, or...
have cluster somehow manage that user's systemd units directly, on its own.

In case it might make it bit more clear - those units are 'podman' containers, 
non-root containers.

You might be able to manage these containers via the
ocf:heartbeat:podman resource agent, with `--user=` in the
`run_opts` resource option along with any other relevant options.

If something like that doesn't work, then you could write a simple
lsb-class or systemd-class cluster resource to do what you're
currently doing manually.

There may be other options; those are the two that come to mind.

From man pages for that resource I assumed that agent 
creates a new container (from "scratch") each time and - if 
that is the case indeed - I need to control/manage an 
existing ones.


I was thinking about putting these bits I do manually into a 
'systemd' service but I cannot see how to go about it - I 
understand there is a clear separation of OS/root systemd 
and users' systemd and I don't think - I can be wrong I do 
hope - it is possible to have OS/root (from that namespace) 
sytemd do users systemd/namespaces.


many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] multiple resources - pgsqlms - and IP(s)

2023-01-03 Thread lejeczek via Users


Hi guys.

To get/have Postgresql cluster with 'pgsqlms' resource, such 
cluster needs a 'master' IP - what do you guys do when/if 
you have multiple resources off this agent?
I wonder if it is possible to keep just one IP and have all 
those resources go to it - probably 'scoring' would be very 
tricky then, or perhaps not?
Or you do separate IP for each 'pgsqlms' resource - the 
easiest way out?


all thoughts share are much appreciated.
many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] multiple resources - pgsqlms - and IP(s)

2023-01-03 Thread lejeczek via Users




On 03/01/2023 17:03, Jehan-Guillaume de Rorthais wrote:

Hi,

On Tue, 3 Jan 2023 16:44:01 +0100
lejeczek via Users  wrote:


To get/have Postgresql cluster with 'pgsqlms' resource, such
cluster needs a 'master' IP - what do you guys do when/if
you have multiple resources off this agent?
I wonder if it is possible to keep just one IP and have all
those resources go to it - probably 'scoring' would be very
tricky then, or perhaps not?

That would mean all promoted pgsql MUST be on the same node at any time.
If one of your instance got some troubles and need to failover, *ALL* of them
would failover.

This imply not just a small failure time window for one instance, but for all
of them, all the users.


Or you do separate IP for each 'pgsqlms' resource - the
easiest way out?

That looks like a better option to me, yes.

Regards,

Not related - Is this an old bug?:

-> $ pcs resource create pgsqld-apps ocf:heartbeat:pgsqlms 
bindir=/usr/bin pgdata=/apps/pgsql/data op start timeout=60s 
op stop timeout=60s op promote timeout=30s op demote 
timeout=120s op monitor interval=15s timeout=10s 
role="Master" op monitor interval=16s timeout=10s 
role="Slave" op notify timeout=60s meta promotable=true 
notify=true master-max=1 --disable

Error: Validation result from agent (use --force to override):
  ocf-exit-reason:You must set meta parameter notify=true 
for your master resource

Error: Errors have occurred, therefore pcs is unable to continue
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] multiple resources - pgsqlms - and IP(s)

2023-01-03 Thread lejeczek via Users





On 03/01/2023 21:44, Ken Gaillot wrote:

On Tue, 2023-01-03 at 18:18 +0100, lejeczek via Users wrote:

On 03/01/2023 17:03, Jehan-Guillaume de Rorthais wrote:

Hi,

On Tue, 3 Jan 2023 16:44:01 +0100
lejeczek via Users  wrote:


To get/have Postgresql cluster with 'pgsqlms' resource, such
cluster needs a 'master' IP - what do you guys do when/if
you have multiple resources off this agent?
I wonder if it is possible to keep just one IP and have all
those resources go to it - probably 'scoring' would be very
tricky then, or perhaps not?

That would mean all promoted pgsql MUST be on the same node at any
time.
If one of your instance got some troubles and need to failover,
*ALL* of them
would failover.

This imply not just a small failure time window for one instance,
but for all
of them, all the users.


Or you do separate IP for each 'pgsqlms' resource - the
easiest way out?

That looks like a better option to me, yes.

Regards,

Not related - Is this an old bug?:

-> $ pcs resource create pgsqld-apps ocf:heartbeat:pgsqlms
bindir=/usr/bin pgdata=/apps/pgsql/data op start timeout=60s
op stop timeout=60s op promote timeout=30s op demote
timeout=120s op monitor interval=15s timeout=10s
role="Master" op monitor interval=16s timeout=10s
role="Slave" op notify timeout=60s meta promotable=true
notify=true master-max=1 --disable
Error: Validation result from agent (use --force to override):
ocf-exit-reason:You must set meta parameter notify=true
for your master resource
Error: Errors have occurred, therefore pcs is unable to continue

pcs now runs an agent's validate-all action before creating a resource.
In this case it's detecting a real issue in your command. The options
you have after "meta" are clone options, not meta options of the
resource being cloned. If you just change "meta" to "clone" it should
work.

Nope. Exact same error message.
If I remember correctly there was a bug specifically 
pertained to 'notify=true'

I'm on C8S with resource-agents-paf-4.9.0-35.el8.x86_64.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] cluster with redundant links - PCSD offline

2023-02-26 Thread lejeczek via Users


Hi guys.

I have a simple 2-node cluster with redundant links and I 
wonder why status reports like this:

...
Node List:
  * Node swir (1): online, feature set 3.16.2
  * Node whale (2): online, feature set 3.16.2
...
PCSD Status:
  swir: Online
  whale: Offline
...

Cluster's config:
...
Nodes:
  swir:
    Link 0 address: 10.1.1.100
    Link 1 address: 10.3.1.100
    nodeid: 1
  whale:
    Link 0 address: 10.1.1.101
    Link 1 address: 10.3.1.101
    nodeid: 2
...

Is that all normal for a cluster when 'Link 0' is down?
I thought redundant link(s) would be for that exact reason - 
so cluster would remain fully operational.


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] ocf:heartbeat:galera - SELinux

2023-02-28 Thread lejeczek via Users


Hi guys.

I'd prefer to avoid putting 'mysqld_t' into permissive side 
of SELinux - which apparently would be the easiest way out - 
so I wonder: does anybody here have a SELinux module which 
would make Galera agent run successfully?

Devel/authors, I think ignored that part - that critical part.

many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] VirtualDomain - node map - ?

2023-04-17 Thread lejeczek via Users





On 17/04/2023 06:17, Andrei Borzenkov wrote:

On 16.04.2023 16:29, lejeczek via Users wrote:



On 16/04/2023 12:54, Andrei Borzenkov wrote:

On 16.04.2023 13:40, lejeczek via Users wrote:

Hi guys

Some agents do employ that concept of node/host map 
which I

do not see in any manual/docs that this agent does - would
you suggest some technique or tips on how to achieve
similar?
I'm thinking specifically of 'migrate' here, as I 
understand

'migration' just uses OS' own resolver to call migrate_to
node.



No, pacemaker decides where to migrate the resource and
calls agents on the current source and then on the
intended target passing this information.



Yes pacemaker does that but - as I mentioned - some agents
do employ that "internal" nodes map "technique".
I see no mention of that/similar in the manual for
VirtualDomain so I asked if perhaps somebody had an idea of
how to archive such result by some other means.



What about showing example of these "some agents" or 
better describing what you want to achieve?



'ocf_heartbeat_galera' is one such agent.
But after more fiddling with 'VirtualDomain' I understand 
this agent does offer this functionality tough not in that 
clear way nor to that full extent, but it's there.

So, what I needed was possible to achieve.
thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] manner in which cluster migrates VirtualDomain - ?

2023-04-18 Thread lejeczek via Users





On 18/04/2023 18:22, Ken Gaillot wrote:

On Tue, 2023-04-18 at 14:58 +0200, lejeczek via Users wrote:

Hi guys.

When it's done by the cluster itself, eg. a node goes 'standby' - how
do clusters migrate VirtualDomain resources?

1. Call resource agent migrate_to action on original node
2. Call resource agent migrate_from action on new node
3. Call resource agent stop action on original node


Do users have any control over it and if so then how?

The allow-migrate resource meta-attribute (true/false)


I'd imagine there must be some docs - I failed to find

It's sort of scattered throughout Pacemaker Explained -- the main one
is:

https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/advanced-options.html#migrating-resources


Especially in large deployments one obvious question would be - I'm
guessing as my setup is rather SOHO - can VMs migrate in sequence or
it is(always?) a kind of 'swarm' migration?

The migration-limit cluster property specifies how many live migrations
may be initiated at once (the default of -1 means unlimited).
But if this is cluster property - unless I got it wrong, 
hopefully - then this govern any/all resources.
If so, can such a limit be rounded down to RA type or 
perhaps group of resources?


many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] manner in which cluster migrates VirtualDomain - ?

2023-04-19 Thread lejeczek via Users





On 18/04/2023 21:02, Ken Gaillot wrote:

On Tue, 2023-04-18 at 19:36 +0200, lejeczek via Users wrote:

On 18/04/2023 18:22, Ken Gaillot wrote:

On Tue, 2023-04-18 at 14:58 +0200, lejeczek via Users wrote:

Hi guys.

When it's done by the cluster itself, eg. a node goes 'standby' -
how
do clusters migrate VirtualDomain resources?

1. Call resource agent migrate_to action on original node
2. Call resource agent migrate_from action on new node
3. Call resource agent stop action on original node


Do users have any control over it and if so then how?

The allow-migrate resource meta-attribute (true/false)


I'd imagine there must be some docs - I failed to find

It's sort of scattered throughout Pacemaker Explained -- the main
one
is:

https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/advanced-options.html#migrating-resources


Especially in large deployments one obvious question would be -
I'm
guessing as my setup is rather SOHO - can VMs migrate in sequence
or
it is(always?) a kind of 'swarm' migration?

The migration-limit cluster property specifies how many live
migrations
may be initiated at once (the default of -1 means unlimited).

But if this is cluster property - unless I got it wrong,
hopefully - then this govern any/all resources.
If so, can such a limit be rounded down to RA type or
perhaps group of resources?

many thanks, L.

No, it's global
To me it feels so intuitive, so natural & obvious that I 
will ask - nobody yet suggested that such feature be 
available to smaller divisions of cluster independently of 
global rule?
In the vastness of resource types many are polar opposites 
and to treat them all the same?
Would be great to have some way to tell cluster to run 
different migration/relocation limits on for eg. 
compute-heavy resources VS light-weight ones - where to 
"file" such a enhancement suggestion, Bugzilla?


many thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] FEEDBACK WANTED: possible deprecation of nagios-class resources

2023-04-20 Thread lejeczek via Users





On 19/04/2023 21:08, Ken Gaillot wrote:

Hi all,

I am considering deprecating Pacemaker's support for nagios-class
resources.

This has nothing to do with nagios monitoring of a Pacemaker cluster,
which would be unaffected. This is about Pacemaker's ability to use
nagios plugin scripts as a type of resource.

Nagios-class resources act as a sort of proxy monitor for another
resource. You configure the main resource (typically a VM or container)
as normal, then configure a nagios: resource to monitor it. If
the nagios monitor fails, the main resource is restarted.

Most of that use case is now better covered by Pacemaker Remote and
bundles. The only advantage of nagios-class resources these days is if
you have a VM or container image that you can't modify (to add
Pacemaker Remote). If we deprecate nagios-class resources, the
preferred alternative would be to write a custom OCF agent with the
monitoring you want (which can even call a nagios plugin).

Does anyone here use nagios-class resources? If it's actively being
used, I'm willing to keep it around. But if there's no demand, we'd
rather not have to maintain that (poorly tested) code forever.
Myself, regarding ha/pacemekr/vm, I've not dealt with nagios 
ever.

thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] OCF_HEARTBEAT_PGSQL - any good with current Postgres- ?

2023-04-24 Thread lejeczek via Users


Hi guys.

I've been looking up and fiddling with this RA but 
unsuccessfully so far, that I wonder - is it good for 
current versions of pgSQLs?


many thanks, L.g___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] PAF / pgsqlms - lags behind in terms of RA specs?

2023-04-26 Thread lejeczek via Users


Hi guys.

anybody here use PAF with up-to-date Centos?
I see this RA fails from cluster perspective, I've failed a 
report over at github no comment there so thought I'd ask 
around.


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] manner in which cluster migrates VirtualDomain - ?

2023-04-19 Thread lejeczek via Users

On 19/04/2023 16:16, Ken Gaillot wrote:

On Wed, 2023-04-19 at 08:00 +0200, lejeczek via Users wrote:

On 18/04/2023 21:02, Ken Gaillot wrote:

On Tue, 2023-04-18 at 19:36 +0200, lejeczek via Users wrote:

On 18/04/2023 18:22, Ken Gaillot wrote:

On Tue, 2023-04-18 at 14:58 +0200, lejeczek via Users wrote:

Hi guys.

When it's done by the cluster itself, eg. a node goes
'standby' -
how
do clusters migrate VirtualDomain resources?

1. Call resource agent migrate_to action on original node
2. Call resource agent migrate_from action on new node
3. Call resource agent stop action on original node

Do users have any control over it and if so then how?

The allow-migrate resource meta-attribute (true/false)

I'd imagine there must be some docs - I failed to find

It's sort of scattered throughout Pacemaker Explained -- the
main
one
is:

https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/advanced-options.html#migrating-resources

Especially in large deployments one obvious question would be
-
I'm
guessing as my setup is rather SOHO - can VMs migrate in
sequence
or
it is(always?) a kind of 'swarm' migration?

The migration-limit cluster property specifies how many live
migrations
may be initiated at once (the default of -1 means unlimited).

But if this is cluster property - unless I got it wrong,
hopefully - then this govern any/all resources.
If so, can such a limit be rounded down to RA type or
perhaps group of resources?

many thanks, L.

No, it's global

To me it feels so intuitive, so natural & obvious that I
will ask - nobody yet suggested that such feature be
available to smaller divisions of cluster independently of
global rule?
In the vastness of resource types many are polar opposites
and to treat them all the same?
Would be great to have some way to tell cluster to run
different migration/relocation limits on for eg.
compute-heavy resources VS light-weight ones - where to
"file" such a enhancement suggestion, Bugzilla?

many thanks, L.

Looking at the code, I see it's a little different than I originally
thought.

First, I overlooked that it's correctly documented as a per-node limit
rather than a cluster-wide limit.

That highlights the complexity of allowing different values for
different resources; if rscA has a migration limit of 2, and rscB has a
migration limit of 5, do we allow up to 2 rscA migrations and 5 rscB
migrations simultaneously, or do we weight them relative to each other
so the total capacity is still constrained (for example limiting it to
1 rscA migration and 2 rscB migrations together)?
My first thoughts were - I cannot comment on the code, only
inasmuch as an admin would care - perhaps to introduce, if
would not require business logic total overhaul, "migration
groups"(while not being another resource type) whose such
groups then a resource could be member.
Or perhaps marry 'migration-limit' to 'resource group' which
would take priority over global/node-wide rule.
One way or another, simple to end-users - then user/admin
sets N-limit of resources which in such group can be
live-migrated at one time, say...

in this-given-group only 2 resources can cluster attempt to
live-migrate simultaneously, then wait for success or
failure but wait for result and only then proceed to next & ...

We would almost need something like the node utilization feature, being
able to define a node's total migration capacity and then how much of
that capacity is taken up by the migration of a specific resource. That
seems overcomplicated to me, especially since there aren't that many
resource types that support live migration.
Those types which do support live migration and are
compute-heavy, then I really wonder how large consumers do
VirtualDomain migration, as one good example.
Say a Virtual/Cloud provides - there a chunky host node
might host hundreds VMs - there, but anywhere else timeouts,
all/any, must be some real, fixed number.
As of right now, how intuitive is what cluster does when it
swarms - say equally - those hundreds of VMs to
remaining-available nodes...
... even with fast inner-node connectivity many - without
migration-limit - live-migrations will timeout.
Is cluster capable of some very clever heuristics so humans
could leave it to the machine to ensure that such
mass-migration will not fail simply due to overall
bottleneck of the underlying infrastructure?
... and could the cluster alone do that? Would not
VirtualDomain agent have to gather comprehensive metric data
on each VM in the first place, to feed it to the cluster
internal logic..?
I would see some way similar to these which I mentioned
above, as relatively effective and surely down-to-earth,
practical aid to alleviate cases such as VMs "mass-migration".

Second, any actions on a Pacemaker Remote node count toward the
throttling limit of its connection host, and aren't checked for
migration-limit at all. That's an interesting

[ClusterLabs] manner in which cluster migrates VirtualDomain - ?

2023-04-18 Thread lejeczek via Users


Hi guys.

When it's done by the cluster itself, eg. a node goes 
'standby' - how do clusters migrate VirtualDomain resources?

Do users have any control over it and if so then how?
I'd imagine there must be some docs - I failed to find
Especially in large deployments one obvious question would 
be - I'm guessing as my setup is rather SOHO - can VMs 
migrate in sequence or it is(always?) a kind of 'swarm' 
migration?


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] 99-VirtualDomain-libvirt.conf under control - ?

2023-04-29 Thread lejeczek via Users


Hi guys.

I presume these are a consequence of having resource of 
VirtuaDomain type set up(& enabled) - but where, how cab 
users control presence & content of those?


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] VirtualDomain - node map - ?

2023-04-16 Thread lejeczek via Users





On 16/04/2023 12:54, Andrei Borzenkov wrote:

On 16.04.2023 13:40, lejeczek via Users wrote:

Hi guys

Some agents do employ that concept of node/host map which I
do not see in any manual/docs that this agent does - would
you suggest some technique or tips on how to achieve 
similar?

I'm thinking specifically of 'migrate' here, as I understand
'migration' just uses OS' own resolver to call migrate_to 
node.




No, pacemaker decides where to migrate the resource and 
calls agents on the current source and then on the 
intended target passing this information.




Yes pacemaker does that but - as I mentioned - some agents 
do employ that "internal" nodes map "technique".
I see no mention of that/similar in the manual for 
VirtualDomain so I asked if perhaps somebody had an idea of 
how to archive such result by some other means.


thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] VirtualDomain - node map - ?

2023-04-16 Thread lejeczek via Users


Hi guys

Some agents do employ that concept of node/host map which I 
do not see in any manual/docs that this agent does - would 
you suggest some technique or tips on how to achieve similar?
I'm thinking specifically of 'migrate' here, as I understand 
'migration' just uses OS' own resolver to call migrate_to node.


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] location constraint does not move promoted resource ?

2023-07-03 Thread lejeczek via Users


Hi guys.

I have pgsql with I constrain like so:

-> $ pcs constraint location PGSQL-clone rule role=Promoted 
score=-1000 gateway-link ne 1


and I have a few more location constraints with that 
ethmonitor & those work, but this one does not seem to.
When contraint is created cluster is silent, no errors nor 
warning, but relocation does not take place.
I can move promoted resource manually just fine, to that 
node where 'location' should move it.


All thoughts share are much appreciated.
many thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] location constraint does not move promoted resource ?

2023-07-03 Thread lejeczek via Users





On 03/07/2023 11:16, Andrei Borzenkov wrote:

On 03.07.2023 12:05, lejeczek via Users wrote:

Hi guys.

I have pgsql with I constrain like so:

-> $ pcs constraint location PGSQL-clone rule role=Promoted
score=-1000 gateway-link ne 1

and I have a few more location constraints with that
ethmonitor & those work, but this one does not seem to.
When contraint is created cluster is silent, no errors nor
warning, but relocation does not take place.
I can move promoted resource manually just fine, to that
node where 'location' should move it.



Instance to promote is selected according to promotion 
scores which are normally set by resource agent. 
Documentation implies that standard location constraints 
are also taken in account, but there is no explanation how 
promotion scores interoperate with location scores. It is 
possible that promotion score in this case takes precedence.

It seems to have kicked in with score=-1 but..
that was me just guessing.
Indeed it would be great to know how those are calculated, 
in a way which would' be admin friendly or just obvious.


thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] location constraint does not move promoted resource ?

2023-07-03 Thread lejeczek via Users




On 03/07/2023 18:55, Andrei Borzenkov wrote:

On 03.07.2023 19:39, Ken Gaillot wrote:

On Mon, 2023-07-03 at 19:22 +0300, Andrei Borzenkov wrote:

On 03.07.2023 18:07, Ken Gaillot wrote:
On Mon, 2023-07-03 at 12:20 +0200, lejeczek via Users 
wrote:

On 03/07/2023 11:16, Andrei Borzenkov wrote:

On 03.07.2023 12:05, lejeczek via Users wrote:

Hi guys.

I have pgsql with I constrain like so:

-> $ pcs constraint location PGSQL-clone rule 
role=Promoted

score=-1000 gateway-link ne 1

and I have a few more location constraints with that
ethmonitor & those work, but this one does not seem to.
When contraint is created cluster is silent, no 
errors nor

warning, but relocation does not take place.
I can move promoted resource manually just fine, to 
that

node where 'location' should move it.



Instance to promote is selected according to promotion
scores which are normally set by resource agent.
Documentation implies that standard location constraints
are also taken in account, but there is no 
explanation how
promotion scores interoperate with location scores. 
It is
possible that promotion score in this case takes 
precedence.

It seems to have kicked in with score=-1 but..
that was me just guessing.
Indeed it would be great to know how those are 
calculated,

in a way which would' be admin friendly or just obvious.

thanks, L.


It's a longstanding goal to have some sort of tool for 
explaining

how
scores interact in a given situation. However it's a 
challenging

problem and there's never enough time ...

Basically, all scores are added together for each node, 
and the

node
with the highest score runs the resource, subject to 
any placement
strategy configured. These mainly include stickiness, 
location
constraints, colocation constraints, and node health. 
Nodes may be


And you omitted the promotion scores which was the main 
question.


Oh right -- first, the above is used to determine the 
nodes on which
clone instances will be placed. After that, an 
appropriate number of
nodes are selected for the promoted role, based on 
promotion scores and

location and colocation constraints for the promoted role.



I am sorry but it does not really explain anything. Let's 
try concrete examples


a) master clone instance has location score -1000 for a 
node and promotion score 1000. Is this node eligible for 
promoting clone instance (assuming no other scores are 
present)?


b) promotion score is equal on two nodes A and B, but node 
A has better location score than node B. Is it guaranteed 
that clone will be promoted on A?




a real-life example:
...Colocation Constraints:
  Started resource 'HA-10-1-1-253' with Promoted resource 
'PGSQL-clone' (id:

  colocation-HA-10-1-1-253-PGSQL-clone-INFINITY)
    score=INFINITY
..
Order Constraints:
  promote resource 'PGSQL-clone' then start resource 
'HA-10-1-1-253' (id: order-

  PGSQL-clone-HA-10-1-1-253-Mandatory)
    symmetrical=0 kind=Mandatory
  demote resource 'PGSQL-clone' then stop resource 
'HA-10-1-1-253' (id: order-

  PGSQL-clone-HA-10-1-1-253-Mandatory-1)
    symmetrical=0 kind=Mandatory

I had to bump this one up to:
...
  resource 'PGSQL-clone' (id: location-PGSQL-clone)
    Rules:
  Rule: role=Promoted score=-1 (id: 
location-PGSQL-clone-rule)
    Expression: gateway-link ne 1 (id: 
location-PGSQL-clone-rule-expr)



'-1000' did not seem to be good enough, '-1' was just a 
"lucky" guess.


as earlier: I was able to 'move' the promoted and I think 
'prefers' also worked.
I don't know it 'pgsql' would work with any other 
constraints, if it was safe to try so.


many thanks, L.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] newly created clone waits for off-lined node

2023-07-12 Thread lejeczek via Users


Hi guys.

I have a fresh new 'galera' clone and that one would not 
start & cluster says:

...
INFO: Waiting on node  to report database status 
before Master instances can start.

...

Is that only for newly created resources - which I guess it 
must be - and if so then why?
Naturally, next question would be - how to make such 
resource start in that very circumstance?


many thank, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] newly created clone waits for off-lined node

2023-07-17 Thread lejeczek via Users





On 13/07/2023 17:33, Ken Gaillot wrote:

On Wed, 2023-07-12 at 21:08 +0200, lejeczek via Users wrote:

Hi guys.

I have a fresh new 'galera' clone and that one would not start &
cluster says:
...
INFO: Waiting on node  to report database status before Master
instances can start.
...

Is that only for newly created resources - which I guess it must be -
and if so then why?
Naturally, next question would be - how to make such resource start
in that very circumstance?

many thank, L.

That is part of the agent rather than Pacemaker. Looking at the agent
code, it's based on a node attribute the agent sets, so it is only
empty for newly created resources that haven't yet run on a node. I'm
not sure if there's a way around it. (Anyone else have experience with
that?)

any expert/devel would agree - this qualifies as a bug?
To add to my last message - mariadb starts a okey outside of 
the cluster but also if I manually change 
'safe_to_bootstrap' to '1' on any of the - in my case - two 
nodes, then cluster will also start the clone ok.
But then... disable(seems that Galera cluster gets shut down 
ok) & enable the clone and clone fails to start with those 
errors as earlier - waiting for off-lined node - and leaving 
the clone as 'Unpromoted'

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] newly created clone waits for off-lined node

2023-07-13 Thread lejeczek via Users





On 13/07/2023 17:33, Ken Gaillot wrote:

On Wed, 2023-07-12 at 21:08 +0200, lejeczek via Users wrote:

Hi guys.

I have a fresh new 'galera' clone and that one would not start &
cluster says:
...
INFO: Waiting on node  to report database status before Master
instances can start.
...

Is that only for newly created resources - which I guess it must be -
and if so then why?
Naturally, next question would be - how to make such resource start
in that very circumstance?

many thank, L.

That is part of the agent rather than Pacemaker. Looking at the agent
code, it's based on a node attribute the agent sets, so it is only
empty for newly created resources that haven't yet run on a node. I'm
not sure if there's a way around it. (Anyone else have experience with
that?)
But this is bit weird -- even with 'promoted-max' & 
'clone-max' to exclude (& on top of it ban the resource on 
the) off-lined node, clone would not start -- no?
Perhaps if not a an obvious bug, I'd think this could/should 
be improved, the code.
If anybody wanted to reproduce - have a "regular" galera 
clone okey, off-line a node, remove re/create the clone.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?

2023-06-09 Thread lejeczek via Users




On 09/06/2023 09:04, Reid Wahl wrote:

On Thu, Jun 8, 2023 at 10:55 PM lejeczek via Users
 wrote:



On 09/06/2023 01:38, Reid Wahl wrote:

On Thu, Jun 8, 2023 at 2:24 PM lejeczek via Users  wrote:



Ouch.

Let's see the full output of the move command, with the whole CIB that
failed to validate.


For a while there I thought perhaps it was just that one
pglsq resource, but it seems that any - though only a few
are set up - (only clone promoted?)resource fails to move.
Perhaps primarily to do with 'pcs'

-> $ pcs resource move REDIS-clone --promoted podnode3
Error: cannot move resource 'REDIS-clone'
  1 

This is the problem: `validate-with="pacemaker-3.6"`. That old schema
doesn't support role="Promoted" in a location constraint. Support
begins with version 3.7 of the schema:
https://github.com/ClusterLabs/pacemaker/commit/e7f1424df49ac41b2d38b72af5ff9ad5121432d2.

You'll need at least Pacemaker 2.1.0.

I have:
corosynclib-3.1.7-1.el9.x86_64
corosync-3.1.7-1.el9.x86_64
pacemaker-schemas-2.1.6-2.el9.noarch
pacemaker-libs-2.1.6-2.el9.x86_64
pacemaker-cluster-libs-2.1.6-2.el9.x86_64
pacemaker-cli-2.1.6-2.el9.x86_64
pacemaker-2.1.6-2.el9.x86_64
pcs-0.11.5-2.el9.x86_64
pacemaker-remote-2.1.6-2.el9.x86_64
and the reset is Centos 9 up-to-what-is-in-repos

If all your cluster nodes are at those versions, try `pcs cluster cib-upgrade`


'cib-upgrade' might be a fix for this "issue" - but should 
it not happen at rpm update time?

Differently but still 'move' errored out:
-> $ pcs resource move REDIS-clone --promoted podnode2
Location constraint to move resource 'REDIS-clone' has been 
created

Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 'REDIS-clone' 
has been removed

Waiting for the cluster to apply configuration changes...
Error: resource 'REDIS-clone' is promoted on node 
'podnode3'; unpromoted on nodes 'podnode1', 'podnode2'

Error: Errors have occurred, therefore pcs is unable to continue

Then a node-promted-resource going into 'standby' or being 
rebooted - for first/one time - seemed to "fix" 'move'.

-> $ pcs resource move REDIS-clone --promoted podnode2
Location constraint to move resource 'REDIS-clone' has been 
created

Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 'REDIS-clone' 
has been removed

Waiting for the cluster to apply configuration changes...
resource 'REDIS-clone' is promoted on node 'podnode2'; 
unpromoted on nodes 'podnode1', 'podnode3'


quite puzzling, sharing in case others experience this/similar.
thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?

2023-06-08 Thread lejeczek via Users




On 09/06/2023 01:38, Reid Wahl wrote:

On Thu, Jun 8, 2023 at 2:24 PM lejeczek via Users  wrote:




Ouch.

Let's see the full output of the move command, with the whole CIB that
failed to validate.


For a while there I thought perhaps it was just that one
pglsq resource, but it seems that any - though only a few
are set up - (only clone promoted?)resource fails to move.
Perhaps primarily to do with 'pcs'

-> $ pcs resource move REDIS-clone --promoted podnode3
Error: cannot move resource 'REDIS-clone'
 1 

This is the problem: `validate-with="pacemaker-3.6"`. That old schema
doesn't support role="Promoted" in a location constraint. Support
begins with version 3.7 of the schema:
https://github.com/ClusterLabs/pacemaker/commit/e7f1424df49ac41b2d38b72af5ff9ad5121432d2.

You'll need at least Pacemaker 2.1.0.

I have:
corosynclib-3.1.7-1.el9.x86_64
corosync-3.1.7-1.el9.x86_64
pacemaker-schemas-2.1.6-2.el9.noarch
pacemaker-libs-2.1.6-2.el9.x86_64
pacemaker-cluster-libs-2.1.6-2.el9.x86_64
pacemaker-cli-2.1.6-2.el9.x86_64
pacemaker-2.1.6-2.el9.x86_64
pcs-0.11.5-2.el9.x86_64
pacemaker-remote-2.1.6-2.el9.x86_64
and the reset is Centos 9 up-to-what-is-in-repos



 2   
 3 
 4   
 5 
 6 
 7 
 8 
 9 
10 
11 
12   
13   
14 
15   
16 
17 
18   
19 
20   
21 
22   
23   
24 
25   
26 
27   
28   
29 
30   
31 
32   
33 
34 
35   
36 
37   
38 
39 
40   
41   
42 
43 
44   
45   
46   
47 
48   
49   
50 
51   
52 
53 
54   
55   
56 
57 
58   
59   
60   
61 
62   
63   
64 
65   
66 
67 
68 
69 
70 
71 
72   
73   
74 
75 
76   
77 
78   
79 
80 
81   
82 
83   
84 
85 
86   
87 
88   
89 
90 
91 
92 
93 
94   
95 
96 
97   
98   
99   
   100   
   101 
   102   
   103   
   104 
   105   
   106 
   107 
   108 
   109 
   110 
   111 
   112 
   113 
   114   
   115   
   116 
   117   
   118   
   119 
   120 
   121   
   122 
   123   
   124 
   125 
   126   
   127 
   128   
   129 
   130 
   131   
   132 
   133   
   134 
   135 
   136 
   137 
   138   
   139 
   140 
   141   
   142   
   143 
   144   
   145   
   146 
   147   
   148 
   149 
   150 
   151 
   152 
   153 
   154 
   155 
   156 
   157 
   158 
   159   
   160   
   161 
   162 
   163 
   164 
   165 
   166 
   167 
   168 
   169   
   170 
   171 
   172   
   173   
   174   
   175   
   176 
   177   
   178 
   179 
   180   
   181   
   182   
   183   
   184   
   185 
   186 
   187   
   188 
   189   
   190 
   191   
   192   
   193 
   194   
   195 
   196   
   197   
   198   
   199 
   200   
   201   
   202 
   203   
   204 
   205   
   206   
   207 
   208   
   209   
   210 
   211 
   212 
   213   
   214   
   215 
   216

Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?

2023-06-08 Thread lejeczek via Users





Ouch.

Let's see the full output of the move command, with the whole CIB that
failed to validate.

For a while there I thought perhaps it was just that one 
pglsq resource, but it seems that any - though only a few 
are set up - (only clone promoted?)resource fails to move.

Perhaps primarily to do with 'pcs'

-> $ pcs resource move REDIS-clone --promoted podnode3
Error: cannot move resource 'REDIS-clone'
   1 validate-with="pacemaker-3.6" epoch="8212" num_updates="0" 
admin_epoch="0" cib-last-written="Thu Jun  8 21:59:53 2023" 
update-origin="podnode1" update-client="crm_attribute" 
have-quorum="1" update-user="root" dc-uuid="1">

   2   
   3 
   4   
   5 id="cib-bootstrap-options-have-watchdog" 
name="have-watchdog" value="false"/>
   6 name="dc-version" value="2.1.6-2.el9-6fdc9deea29"/>
   7 id="cib-bootstrap-options-cluster-infrastructure" 
name="cluster-infrastructure" value="corosync"/>
   8 id="cib-bootstrap-options-cluster-name" name="cluster-name" 
value="podnodes"/>
   9 id="cib-bootstrap-options-stonith-enabled" 
name="stonith-enabled" value="false"/>
  10 id="cib-bootstrap-options-last-lrm-refresh" 
name="last-lrm-refresh" value="1686047745"/>
  11 id="cib-bootstrap-options-maintenance-mode" 
name="maintenance-mode" value="false"/>

  12   
  13   
  14 name="REDIS_REPL_INFO" value="podnode1"/>

  15   
  16 
  17 
  18   
  19 
  20   name="PGSQL-data-status" value="DISCONNECT"/>

  21 
  22   
  23   
  24 
  25   name="PGSQL-data-status" value="DISCONNECT"/>

  26 
  27   
  28   
  29 
  30   name="PGSQL-data-status" value="LATEST"/>

  31 
  32   
  33 
  34 
  35   provider="heartbeat" type="IPaddr2">
  36 id="HA-10-1-1-226-meta_attributes">
  37   id="HA-10-1-1-226-meta_attributes-failure-timeout" 
name="failure-timeout" value="30s"/>

  38 
  39 id="HA-10-1-1-226-instance_attributes">
  40   id="HA-10-1-1-226-instance_attributes-cidr_netmask" 
name="cidr_netmask" value="24"/>
  41   id="HA-10-1-1-226-instance_attributes-ip" name="ip" 
value="10.1.1.226"/>

  42 
  43 
  44   interval="10s" name="monitor" timeout="20s"/>
  45   interval="0s" name="start" timeout="20s"/>
  46   interval="0s" name="stop" timeout="20s"/>

  47 
  48   
  49   provider="heartbeat" type="IPaddr2">
  50 id="HA-10-3-1-226-meta_attributes">
  51   id="HA-10-3-1-226-meta_attributes-failure-timeout" 
name="failure-timeout" value="30s"/>

  52 
  53 id="HA-10-3-1-226-instance_attributes">
  54   id="HA-10-3-1-226-instance_attributes-cidr_netmask" 
name="cidr_netmask" value="24"/>
  55   id="HA-10-3-1-226-instance_attributes-ip" name="ip" 
value="10.3.1.226"/>

  56 
  57 
  58   interval="10s" name="monitor" timeout="20s"/>
  59   interval="0s" name="start" timeout="20s"/>
  60   interval="0s" name="stop" timeout="20s"/>

  61 
  62   
  63   
  64 provider="heartbeat">
  65   id="REDIS-instance_attributes">
  66 name="bin" value="/usr/bin/redis-server"/>
  67 id="REDIS-instance_attributes-client_bin" name="client_bin" 
value="/usr/bin/redis-cli"/>
  68 id="REDIS-instance_attributes-config" name="config" 
value="/etc/redis/redis.conf"/>
  69 id="REDIS-instance_attributes-rundir" name="rundir" 
value="/run/redis"/>
  70 id="REDIS-instance_attributes-user" name="user" value="redis"/>
  71 id="REDIS-instance_attributes-wait_last_known_master" 
name="wait_last_known_master" value="true"/>

  72   
  73   
  74 timeout="120s" id="REDIS-demote-interval-0s"/>
  75 interval="45s" id="REDIS-monitor-interval-45s">
  76   id="REDIS-monitor-interval-45s-instance_attributes">
  77 id="REDIS-monitor-interval-45s-instance_attributes-OCF_CHECK_LEVEL" 
name="OCF_CHECK_LEVEL" value="0"/>

  78   
  79 
  80 timeout="60s" interval="20s" id="REDIS-monitor-interval-20s">
  81   id="REDIS-monitor-interval-20s-instance_attributes">
  82 id="REDIS-monitor-interval-20s-instance_attributes-OCF_CHECK_LEVEL" 
name="OCF_CHECK_LEVEL" value="0"/>

  83   
  84 
  85 timeout="60s" interval="60s" id="REDIS-monitor-interval-60s">
  86   id="REDIS-monitor-interval-60s-instance_attributes">
  87 id="REDIS-monitor-interval-60s-instance_attributes-OCF_CHECK_LEVEL" 
name="OCF_CHECK_LEVEL" value="0"/>

  88   
  89 
  90 timeout="90s" id="REDIS-notify-interval-0s"/>
  91 timeout="120s" id="REDIS-promote-interval-0s"/>

Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?

2023-06-05 Thread lejeczek via Users





On 05/06/2023 16:23, Ken Gaillot wrote:

On Sat, 2023-06-03 at 15:09 +0200, lejeczek via Users wrote:

Hi guys.

I've something which I'm new to entirely - cluster which is seemingly
okey errors, fails to move a resource.

What pcs version are you using? I believe there was a move regression
in a recent push.


I'd won't contaminate here just yet with long json cluster spits when
fails but a snippet:

-> $ pcs resource move PGSQL-clone --promoted podnode1
Error: cannot move resource 'PGSQL-clone'
1 
2   
3 
4   
5 
6 
7 
8 
9 
   10 
   11 
   12   
   13   
   14 
   15 
   16   
   17 
...
crm_resource: Error performing operation: Invalid configuration

This one line: (might be more)

puzzles me, as there is no such node/member in the cluster and so I
try:

That's not a problem. Pacemaker allows "custom" values in both cluster
options and resource/action meta-attributes. I don't know whether redis
is actually using that or not.


-> $ pcs property unset redis_REPL_INFO --force
Warning: Cannot remove property 'redis_REPL_INFO', it is not present
in property set 'cib-bootstrap-options'

That's because the custom options are in their own
cluster_property_set. I believe pcs can only manage the options in the
cluster_property_set with id="cib-bootstrap-options", so you'd have to
use "pcs cluster edit" or crm_attribute to remove the custom ones.


Any & all suggestions on how to fix this are much appreciated.
many thanks, L.
Well. that would make sense - those must have gone in & 
through to the repos' rpms - as I suspected thos broke after 
recent 'dnf' updates.
I'll fiddle with 'downgrades' later on - but if true, then 
rather critical, no?

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?

2023-06-06 Thread lejeczek via Users





On 05/06/2023 16:23, Ken Gaillot wrote:

On Sat, 2023-06-03 at 15:09 +0200, lejeczek via Users wrote:

Hi guys.

I've something which I'm new to entirely - cluster which is seemingly
okey errors, fails to move a resource.

What pcs version are you using? I believe there was a move regression
in a recent push.


I'd won't contaminate here just yet with long json cluster spits when
fails but a snippet:

-> $ pcs resource move PGSQL-clone --promoted podnode1
Error: cannot move resource 'PGSQL-clone'
1 
2   
3 
4   
5 
6 
7 
8 
9 
   10 
   11 
   12   
   13   
   14 
   15 
   16   
   17 
...
crm_resource: Error performing operation: Invalid configuration

This one line: (might be more)

puzzles me, as there is no such node/member in the cluster and so I
try:

That's not a problem. Pacemaker allows "custom" values in both cluster
options and resource/action meta-attributes. I don't know whether redis
is actually using that or not.


-> $ pcs property unset redis_REPL_INFO --force
Warning: Cannot remove property 'redis_REPL_INFO', it is not present
in property set 'cib-bootstrap-options'

That's because the custom options are in their own
cluster_property_set. I believe pcs can only manage the options in the
cluster_property_set with id="cib-bootstrap-options", so you'd have to
use "pcs cluster edit" or crm_attribute to remove the custom ones.


Any & all suggestions on how to fix this are much appreciated.
many thanks, L.

I've downgraded back to:
pacemaker-2.1.6-1.el9.x86_64
pcs-0.11.4-7.el9.x86_64
but, it's either not enough - if bugs are in those that is - 
or issues are somewhere else. for 'move' still fails the same:

...
crm_resource: Error performing operation: Invalid configuration


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] cluster okey but errors when tried to move resource - ?

2023-06-03 Thread lejeczek via Users


Hi guys.

I've something which I'm new to entirely - cluster which is 
seemingly okey errors, fails to move a resource.
I'd won't contaminate here just yet with long json cluster 
spits when fails but a snippet:


-> $ pcs resource move PGSQL-clone --promoted podnode1
Error: cannot move resource 'PGSQL-clone'
   1 validate-with="pacemaker-3.6" epoch="8109" num_updates="0" 
admin_epoch="0" cib-last-written="Sat Jun  3 13:49:34 2023" 
update-origin="podnode2" update-client="cibadmin" 
have-quorum="1" update-user="root" dc-uuid="2">

   2   
   3 
   4   
   5 id="cib-bootstrap-options-have-watchdog" 
name="have-watchdog" value="false"/>
   6 name="dc-version" value="2.1.6-1.el9-802a72226be"/>
   7 id="cib-bootstrap-options-cluster-infrastructure" 
name="cluster-infrastructure" value="corosync"/>
   8 id="cib-bootstrap-options-cluster-name" name="cluster-name" 
value="podnodes"/>
   9 id="cib-bootstrap-options-stonith-enabled" 
name="stonith-enabled" value="false"/>
  10 id="cib-bootstrap-options-last-lrm-refresh" 
name="last-lrm-refresh" value="1683293193"/>
  11 id="cib-bootstrap-options-maintenance-mode" 
name="maintenance-mode" value="false"/>

  12   
  13   
  14 name="redis_REPL_INFO" value="c8kubernode1"/>
  15 name="REDIS_REPL_INFO" value="podnode3"/>

  16   
  17 
...
crm_resource: Error performing operation: Invalid configuration

This one line: (might be more)
name="redis_REPL_INFO" value="c8kubernode1"/>
puzzles me, as there is no such node/member in the cluster 
and so I try:


-> $ pcs property unset redis_REPL_INFO --force
Warning: Cannot remove property 'redis_REPL_INFO', it is not 
present in property set 'cib-bootstrap-options'


Any & all suggestions on how to fix this are much appreciated.
many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] last man - issue

2023-07-22 Thread lejeczek via Users


Hi guys.

That below should work, right?

-> $ pcs quorum update last_man_standing=1 --skip-offline
Checking corosync is not running on nodes...
Warning: Unable to connect to dzien (Failed to connect to 
dzien port 2224: No route to host)

Warning: dzien: Unable to check if corosync is not running
Error: swir: corosync is running
Error: whale: corosync is running
Error: Errors have occurred, therefore pcs is unable to continue

if it should, why does it not or what else am I missing?

many thanks, L
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] silence resource ? - PGSQL

2023-06-28 Thread lejeczek via Users


Hi guys.

Having 'pgsql' set up in what I'd say is a vanilla-default 
confg, pacemaker's journal log is flooded with:

...
pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user 
postgres(uid=26) by (uid=0)

pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user 
postgres(uid=26) by (uid=0)

pam_unix(runuser:session): session closed for user postgres
pam_unix(runuser:session): session opened for user 
postgres(uid=26) by (uid=0)

pam_unix(runuser:session): session closed for user postgres
...

Would you have a working fix or even a suggestion on how to 
silence those?


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] OCF_HEARTBEAT_PGSQL - any good with current Postgres- ?

2023-05-05 Thread lejeczek via Users

On 05/05/2023 10:41, Jehan-Guillaume de Rorthais wrote:

On Fri, 5 May 2023 10:08:17 +0200
lejeczek via Users wrote:

On 25/04/2023 14:16, Jehan-Guillaume de Rorthais wrote:

Hi,

On Mon, 24 Apr 2023 12:32:45 +0200
lejeczek via Users wrote:

I've been looking up and fiddling with this RA but
unsuccessfully so far, that I wonder - is it good for
current versions of pgSQLs?

As far as I know, the pgsql agent is still supported, last commit on it
happen in Jan 11th 2023. I don't know about its compatibility with latest
PostgreSQL versions.

I've been testing it many years ago, I just remember it was quite hard to
setup, understand and manage from the maintenance point of view.

Also, this agent is fine in a shared storage setup where it only
start/stop/monitor the instance, without paying attention to its role
(promoted or not).

It's not only that it's hard - which is purely due to
piss-poor man page in my opinion - but it really sounds
"expired".

I really don't know. My feeling is that the manpage might be expired, which
really doesn't help with this agent, but not the RA itself.

Eg. man page speaks of 'recovery.conf' which - as I
understand it - newer/current versions of pgSQL do not! even
use... which makes one wonder.

This has been fixed in late 2019, but with no documentation associated :/
See:
https://github.com/ClusterLabs/resource-agents/commit/a43075be72683e1d4ddab700ec16d667164d359c

Regards,

Right.. like the rest of us admins/users going to the code
is the very first thing we do :)
RA/resource seems to work but I sincerely urge the
programmer(s)/author(s) responsible, to update man pages so
they reflect state of affairs as of today - there is nothing
more discouraging - to the rest of us sysadmins/endusers -
than man pages which look like stingy conservatives
programmer's note.
What setup of this RA does, is misleading in parts - still
creates config part in a file called 'recovery.conf'

on that note - does this RA not require way to much open
pgSQL setup, meaning not very secure? Would anybody know?
I cannot see - again ! regular man pages and not the code -
how replication could be secured down by not using user
'postgres' itself and then perhaps adding authentication
with passwords, at least.

many thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] ocf_heartbeat_pgsql - lock file

2023-05-07 Thread lejeczek via Users


Hi guys.

I have a resource seemingly running a ok but when a node 
gets rebooted then cluster finds it not able, not good to 
start the resource.


Failed Resource Actions:
  * PGSQL start on podnode3 returned 'error' (My data may 
be inconsistent. You have to remove 
/var/lib/pgsql/tmp/PGSQL.lock file to force start.) at Sun 
May  7 11:48:43 2023 after 121ms


and indeed, with manual intervention, after removal of that 
file, cluster seems to be happy to rejoin the node into 
pgsql cluster.
Is that intentional, by design and if yes/no then why it 
happens?


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] 99-VirtualDomain-libvirt.conf under control - ?

2023-05-05 Thread lejeczek via Users

On 05/05/2023 10:08, Andrei Borzenkov wrote:

On Fri, May 5, 2023 at 11:03 AM lejeczek via Users
wrote:

On 29/04/2023 21:02, Reid Wahl wrote:

On Sat, Apr 29, 2023 at 3:34 AM lejeczek via Users
wrote:

Hi guys.

I presume these are a consequence of having resource of VirtuaDomain type set up(&
enabled) - but where, how cab users control presence & content of those?

Yep:
https://github.com/ClusterLabs/resource-agents/blob/v4.12.0/heartbeat/VirtualDomain#L674-L680

You can't really control the content, since it's set by the resource
agent. (You could change it after creation but that defeats the
purpose.) However, you can view it at
/run/systemd/system/resource-agents-deps.target.d/libvirt.conf.

You can see the systemd_drop_in definition here:
https://github.com/ClusterLabs/resource-agents/blob/v4.12.0/heartbeat/ocf-shellfuncs.in#L654-L673

I wonder how much of an impact those bits have on the
cluster(?)
Take '99-VirtualDomain-libvirt.conf' - that one poses
questions, with c9s 'libvirtd.service' is not really used or
should not be, new modular approach is devised there.
So, with 'resources-agents' having:
After=libvirtd.service
and users not being able to manage those bit - is that not
asking for trouble?

it does no harm (missing units are simply ignored) but it certainly
does not do anything useful either. OTOH modular approach is also
optional, so you could still use monolithic libvirtd on cluster nodes.
So it is more a documentation issue.
Not sure what you mean by 'missing unit' - unit is there
only is not used, is disabled. What does
'resource-agents-deps' do with that?
I don't suppose upstream, redhat & others made that effort,
those changes with the suggestions to us consumers - do go
back to "old" stuff.
I'd suggest, if devel/contributors read here - and I'd
imagine other users would reckon as well - to enhance RAs,
certainly VirtualDomain, with a parameter/attribute with
which users could, at least to certain extent, control those
"outside" of cluster, dependencies.

thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] 99-VirtualDomain-libvirt.conf under control - ?

2023-05-05 Thread lejeczek via Users




On 29/04/2023 21:02, Reid Wahl wrote:

On Sat, Apr 29, 2023 at 3:34 AM lejeczek via Users
 wrote:

Hi guys.

I presume these are a consequence of having resource of VirtuaDomain type set up(& 
enabled) - but where, how cab users control presence & content of those?

Yep: 
https://github.com/ClusterLabs/resource-agents/blob/v4.12.0/heartbeat/VirtualDomain#L674-L680

You can't really control the content, since it's set by the resource
agent. (You could change it after creation but that defeats the
purpose.) However, you can view it at
/run/systemd/system/resource-agents-deps.target.d/libvirt.conf.

You can see the systemd_drop_in definition here:
https://github.com/ClusterLabs/resource-agents/blob/v4.12.0/heartbeat/ocf-shellfuncs.in#L654-L673

I wonder how much of an impact those bits have on the 
cluster(?)
Take '99-VirtualDomain-libvirt.conf' - that one poses 
questions, with c9s 'libvirtd.service' is not really used or 
should not be, new modular approach is devised there.

So, with 'resources-agents' having:
After=libvirtd.service
and users not being able to manage those bit - is that not 
asking for trouble?


thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] OCF_HEARTBEAT_PGSQL - any good with current Postgres- ?

2023-05-05 Thread lejeczek via Users





On 25/04/2023 14:16, Jehan-Guillaume de Rorthais wrote:

Hi,

On Mon, 24 Apr 2023 12:32:45 +0200
lejeczek via Users  wrote:


I've been looking up and fiddling with this RA but
unsuccessfully so far, that I wonder - is it good for
current versions of pgSQLs?

As far as I know, the pgsql agent is still supported, last commit on it happen
in Jan 11th 2023. I don't know about its compatibility with latest PostgreSQL
versions.

I've been testing it many years ago, I just remember it was quite hard to
setup, understand and manage from the maintenance point of view.

Also, this agent is fine in a shared storage setup where it only
start/stop/monitor the instance, without paying attention to its role (promoted
or not).

It's not only that it's hard - which is purely due to 
piss-poor man page in my opinion - but it really sounds 
"expired".
Eg. man page speaks of 'recovery.conf' which - as I 
understand it - newer/current versions of pgSQL do not! even 
use... which makes one wonder.


thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] colocation constraint - do I get it all wrong?

2024-02-05 Thread lejeczek via Users





On 01/01/2024 18:28, Ken Gaillot wrote:

On Fri, 2023-12-22 at 17:02 +0100, lejeczek via Users wrote:

hi guys.

I have a colocation constraint:

-> $ pcs constraint ref DHCPD
Resource: DHCPD
   colocation-DHCPD-GATEWAY-NM-link-INFINITY

and the trouble is... I thought DHCPD is to follow GATEWAY-NM-link,
always!
If that is true that I see very strange behavior, namely.
When there is an issue with DHCPD resource, cannot be started, then
GATEWAY-NM-link gets tossed around by the cluster.

Is that normal & expected - is my understanding of _colocation_
completely wrong - or my cluster is indeed "broken"?
many thanks, L.


Pacemaker considers the preferences of colocated resources when
assigning a resource to a node, to ensure that as many resources as
possible can run. So if a colocated resource becomes unable to run on a
node, the primary resource might move to allow the colocated resource
to run.
So what is the way to "fix" this - is it simply low/er score 
for such constraint?
In my case _dhcpd_ is important but if fails sometimes as 
it's often tampered with, so... make _dhcpd_ flow 
gateway_link but just fail _dhcp_ (it it keeps failing) and 
leave _gateway_link_ alone if/where it's good.
Or perhaps there a global config/param for whole cluster 
behaviour?


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] trigger something at ?

2024-02-08 Thread lejeczek via Users




On 31/01/2024 16:37, lejeczek via Users wrote:



On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:02:12 +0100
lejeczek via Users  wrote:



On 29/01/2024 17:22, Ken Gaillot wrote:
On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users 
wrote:

Hi guys.

Is it possible to trigger some... action - I'm 
thinking specifically

at shutdown/start.
If not within the cluster then - if you do that - 
perhaps outside.
I would like to create/remove constraints, when 
cluster starts &

stops, respectively.

many thanks, L.

You could use node status alerts for that, but it's 
risky for alert
agents to change the configuration (since that may 
result in more

alerts and potentially some sort of infinite loop).

Pacemaker has no concept of a full cluster start/stop, 
only node
start/stop. You could approximate that by checking 
whether the node

receiving the alert is the only active node.

Another possibility would be to write a resource agent 
that does what
you want and order everything else after it. However 
it's even more

risky for a resource agent to modify the configuration.

Finally you could write a systemd unit to do what you 
want and order it

after pacemaker.

What's wrong with leaving the constraints permanently 
configured?

yes, that would be for a node start/stop
I struggle with using constraints to move pgsql (PAF) 
master

onto a given node - seems that co/locating paf's master
results in troubles (replication brakes) at/after node
shutdown/reboot (not always, but way too often)
What? What's wrong with colocating PAF's masters exactly? 
How does it brake any

replication? What's these constraints you are dealing with?

Could you share your configuration?
Constraints beyond/above of what is required by PAF agent 
itself, say...
you have multiple pgSQL cluster with PAF - thus multiple 
(separate, for each pgSQL cluster) masters and you want to 
spread/balance those across HA cluster
(or in other words - avoid having more that 1 pgsql master 
per HA node)
These below, I've tried, those move the master onto chosen 
node but.. then the issues I mentioned.


-> $ pcs constraint location PGSQL-PAF-5438-clone prefers 
ubusrv1=1002

or
-> $ pcs constraint colocation set PGSQL-PAF-5435-clone 
PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master 
require-all=false setoptions score=-1000


Wanted to share an observation - not a measurement of 
anything, I did not take those - of different, latest pgSQL 
version which I put in place of version 14 which I've been 
using all this time.
(also with that upgrade -  from Postgres own repos - came 
update of PAF)
So, with pgSQL ver. 16  and the same of everything else - 
now paf/pgSQL resources behave a lot lot better, survives 
just fine all those cases - with ! extra constraints of 
course - where previously it had replication failures.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] clone_op_key pcmk__notify_key - Triggered fatal assertion

2024-02-17 Thread lejeczek via Users


Hi guys.

Everything seems to be working a ok yet pacemakers logs
...
 error: clone_op_key: Triggered fatal assertion at 
pcmk_graph_producer.c:207 : (n_type != NULL) && (n_task != NULL)
 error: pcmk__notify_key: Triggered fatal assertion at 
actions.c:187 : op_type != NULL
 error: clone_op_key: Triggered fatal assertion at 
pcmk_graph_producer.c:207 : (n_type != NULL) && (n_task != NULL)
 error: pcmk__notify_key: Triggered fatal assertion at 
actions.c:187 : op_type != NULL

...
 error: pcmk__create_history_xml: Triggered fatal assertion 
at pcmk_sched_actions.c:1163 : n_type != NULL
 error: pcmk__create_history_xml: Triggered fatal assertion 
at pcmk_sched_actions.c:1164 : n_task != NULL
 error: pcmk__notify_key: Triggered fatal assertion at 
actions.c:187 : op_type != NULL

...

Looks critical, is it it - would you know?
many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pcsd web interface not working on EL 9.3

2024-02-19 Thread lejeczek via Users





On 19/02/2024 09:06, Strahil Nikolov via Users wrote:

Hi All,

Is there a specific setup I missed in order to setup the 
web interface ?


Usually, you just login with the hacluster user on 
https://fqdn:2224 but when I do a curl, I get an empty 
response.


Best Regards,
Strahil Nikolov

___
Was it giving out some stuff before? I've never did curl on 
it. Won't guess why do that.
Yes it is _empty_for me too - though I use it with reverse 
proxy.

While _main_ URL curl returns no content this does get:
-> $ curl https://pcs-ha.mine.priv/ui/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] colocation constraint - do I get it all wrong?

2023-12-22 Thread lejeczek via Users


hi guys.

I have a colocation constraint:

-> $ pcs constraint ref DHCPD
Resource: DHCPD
  colocation-DHCPD-GATEWAY-NM-link-INFINITY

and the trouble is... I thought DHCPD is to follow 
GATEWAY-NM-link, always!

If that is true that I see very strange behavior, namely.
When there is an issue with DHCPD resource, cannot be 
started, then GATEWAY-NM-link gets tossed around by the cluster.


Is that normal & expected - is my understanding of 
_colocation_ completely wrong - or my cluster is indeed 
"broken"?

many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] colocate Redis - weird

2023-12-19 Thread lejeczek via Users


hi guys,

Is this below not the weirdest thing?

-> $ pcs constraint ref PGSQL-PAF-5435
Resource: PGSQL-PAF-5435
  colocation-HA-10-1-1-84-PGSQL-PAF-5435-clone-INFINITY
  colocation-REDIS-6385-clone-PGSQL-PAF-5435-clone-INFINITY
  order-PGSQL-PAF-5435-clone-HA-10-1-1-84-Mandatory
  order-PGSQL-PAF-5435-clone-HA-10-1-1-84-Mandatory-1
  colocation_set_PePePe

Here Redis master should folow pgSQL master.
Which such constraint:

-> $ pcs resource status PGSQL-PAF-5435
  * Clone Set: PGSQL-PAF-5435-clone [PGSQL-PAF-5435] 
(promotable):

    * Promoted: [ ubusrv1 ]
    * Unpromoted: [ ubusrv2 ubusrv3 ]
-> $ pcs resource status REDIS-6385-clone
  * Clone Set: REDIS-6385-clone [REDIS-6385] (promotable):
    * Unpromoted: [ ubusrv1 ubusrv2 ubusrv3 ]

If I remove that constrain:
-> $ pcs constraint delete 
colocation-REDIS-6385-clone-PGSQL-PAF-5435-clone-INFINITY

-> $ pcs resource status REDIS-6385-clone
  * Clone Set: REDIS-6385-clone [REDIS-6385] (promotable):
    * Promoted: [ ubusrv1 ]
    * Unpromoted: [ ubusrv2 ubusrv3 ]

and ! I can manually move Redis master around, master moves 
to each server just fine.

I again, add that constraint:

-> $ pcs constraint colocation add master REDIS-6385-clone 
with master PGSQL-PAF-5435-clone


and the same...

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] colocate Redis - weird

2023-12-20 Thread lejeczek via Users




On 19/12/2023 19:13, lejeczek via Users wrote:

hi guys,

Is this below not the weirdest thing?

-> $ pcs constraint ref PGSQL-PAF-5435
Resource: PGSQL-PAF-5435
  colocation-HA-10-1-1-84-PGSQL-PAF-5435-clone-INFINITY
  colocation-REDIS-6385-clone-PGSQL-PAF-5435-clone-INFINITY
  order-PGSQL-PAF-5435-clone-HA-10-1-1-84-Mandatory
  order-PGSQL-PAF-5435-clone-HA-10-1-1-84-Mandatory-1
  colocation_set_PePePe

Here Redis master should folow pgSQL master.
Which such constraint:

-> $ pcs resource status PGSQL-PAF-5435
  * Clone Set: PGSQL-PAF-5435-clone [PGSQL-PAF-5435] 
(promotable):

    * Promoted: [ ubusrv1 ]
    * Unpromoted: [ ubusrv2 ubusrv3 ]
-> $ pcs resource status REDIS-6385-clone
  * Clone Set: REDIS-6385-clone [REDIS-6385] (promotable):
    * Unpromoted: [ ubusrv1 ubusrv2 ubusrv3 ]

If I remove that constrain:
-> $ pcs constraint delete 
colocation-REDIS-6385-clone-PGSQL-PAF-5435-clone-INFINITY

-> $ pcs resource status REDIS-6385-clone
  * Clone Set: REDIS-6385-clone [REDIS-6385] (promotable):
    * Promoted: [ ubusrv1 ]
    * Unpromoted: [ ubusrv2 ubusrv3 ]

and ! I can manually move Redis master around, master 
moves to each server just fine.

I again, add that constraint:

-> $ pcs constraint colocation add master REDIS-6385-clone 
with master PGSQL-PAF-5435-clone


and the same...


What there might be about that one node - resource removed, 
created anew and cluster insists on keeping master there.
I can manually move the master anywhere but if I _clear_ the 
resource, no constraints then cluster move it back to the 
same node.


I wonder about:  a) "transient" node attrs & b) if this 
cluster is somewhat broken.
On a) - can we read more about those somewhere?(not the 
code/internals)

thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] resource-agents and VMs

2023-12-15 Thread lejeczek via Users


Hi guys.

my resources-agents depend like so:

resource-agents-deps.target
○ ├─00\\x2dVMsy.mount
● └─virt-guest-shutdown.target

when I reboot a node VMs seems to migrated off it live a ok, 
but..
when node comes back on after a reboot, VMs fail to migrate 
back to it, live.

I see on such node

-> $ journalctl -lf -o cat -u virtqemud.service
Starting Virtualization qemu daemon...
Started Virtualization qemu daemon.
libvirt version: 9.5.0, package: 6.el9 (buil...@centos.org, 
2023-08-25-08:53:56, )

hostname: dzien.mine.priv
Path '/00-VMsy/enc.podnode3.qcow2' is not accessible: No 
such file or directory


and I wonder if it's indeed the fact the the _path_ is 
absent at the moment cluster just after node start, tries to 
migrate VM resource...
Is it possible to somehow - seemingly my 
_resource-agents-deps.target_ does not do - assure that 
cluster, perhaps on per-resource basis, will wait/check a 
path first?
BTW, that paths is available and is made sure is available 
to the system, it's a glusterfs mount.


many thanks, L

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] replica vol when a peer is lost/offed & qcow2 ?

2024-01-17 Thread lejeczek via Users


Hi guys.

I wonder if you might have any tips/tweaks for 
volume/cluster to make it more resilient? accommodating? to 
qcow2 files but! when a peer is lots or missing?
I have 3-peer cluster/volume: 2 + 1 arbiter & my experience 
is such, that when all is good then.. well, all is good, but...
when one peers was lots - even if it's the arbiter - then 
VMs begin to suffer from & to report, their filesystems, errors.
Perhaps it's not a volume's properties/config but whole 
cluster? Or perhaps 3-peer vol is not good for things such a 
qcow2s?


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] make promoted follow promoted resource ?

2023-11-26 Thread lejeczek via Users




On 26/11/2023 10:32, lejeczek via Users wrote:

Hi guys.

With these:

-> $ pcs resource status REDIS-6381-clone
  * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable):
    * Promoted: [ ubusrv2 ]
    * Unpromoted: [ ubusrv1 ubusrv3 ]

-> $ pcs resource status PGSQL-PAF-5433-clone
  * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] 
(promotable):

    * Promoted: [ ubusrv1 ]
    * Unpromoted: [ ubusrv2 ubusrv3 ]

-> $ pcs constraint ref REDIS-6381-clone
Resource: REDIS-6381-clone
  colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY

basically promoted Redis should follow promoted pgSQL but 
it's not happening, usually it does.
I presume pcs/cluster does something internally which 
results in disobeying/ignoring that _colocation_ 
constraint for these resources.

I presume scoring might play a role:
  REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) 
(rsc-role:Master) (with-rsc-role:Master)

but usually, that scoring works, only "now" it does not.
Any comments I appreciate much.
thanks, L.

I looked at pamaker log - snippet below after 
REDIS-6381-clone re-enabled - but cannot see explanation 
for this.

...
 notice: Calculated transition 110, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3729.bz2
 notice: Transition 110 (Complete=0, Pending=0, Fired=0, 
Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete

 notice: State transition S_TRANSITION_ENGINE -> S_IDLE
 notice: State transition S_IDLE -> S_POLICY_ENGINE
 notice: Actions: Start  REDIS-6381:0 
(    ubusrv2 )
 notice: Actions: Start  REDIS-6381:1 
(    ubusrv3 )
 notice: Actions: Start  REDIS-6381:2 
(    ubusrv1 )
 notice: Calculated transition 111, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3730.bz2
 notice: Initiating start operation REDIS-6381_start_0 
locally on ubusrv2
 notice: Requesting local execution of start operation for 
REDIS-6381 on ubusrv2

(to redis) root on none
pam_unix(su:session): session opened for user 
redis(uid=127) by (uid=0)
pam_sss(su:session): Request to sssd failed. Connection 
refused

pam_unix(su:session): session closed for user redis
pam_sss(su:session): Request to sssd failed. Connection 
refused

 notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000
 notice: Transition 111 aborted by 
status-2-master-REDIS-6381 doing create 
master-REDIS-6381=1000: Transient attribute change

INFO: demote: Setting master to 'no-such-master'
 notice: Result of start operation for REDIS-6381 on 
ubusrv2: ok
 notice: Transition 111 (Complete=4, Pending=0, Fired=0, 
Skipped=1, Incomplete=14, 
Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped
 notice: Actions: Promote    REDIS-6381:0 ( 
Unpromoted -> Promoted ubusrv2 )
 notice: Actions: Start  REDIS-6381:1 
(    ubusrv1 )
 notice: Actions: Start  REDIS-6381:2 
(    ubusrv3 )
 notice: Calculated transition 112, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3731.bz2
 notice: Initiating notify operation 
REDIS-6381_pre_notify_start_0 locally on ubusrv2
 notice: Requesting local execution of notify operation 
for REDIS-6381 on ubusrv2
 notice: Result of notify operation for REDIS-6381 on 
ubusrv2: ok
 notice: Initiating start operation REDIS-6381_start_0 on 
ubusrv1
 notice: Initiating start operation REDIS-6381:2_start_0 
on ubusrv3
 notice: Initiating notify operation 
REDIS-6381_post_notify_start_0 locally on ubusrv2
 notice: Requesting local execution of notify operation 
for REDIS-6381 on ubusrv2
 notice: Initiating notify operation 
REDIS-6381_post_notify_start_0 on ubusrv1
 notice: Initiating notify operation 
REDIS-6381:2_post_notify_start_0 on ubusrv3
 notice: Result of notify operation for REDIS-6381 on 
ubusrv2: ok
 notice: Initiating notify operation 
REDIS-6381_pre_notify_promote_0 locally on ubusrv2
 notice: Requesting local execution of notify operation 
for REDIS-6381 on ubusrv2
 notice: Initiating notify operation 
REDIS-6381_pre_notify_promote_0 on ubusrv1
 notice: Initiating notify operation 
REDIS-6381:2_pre_notify_promote_0 on ubusrv3
 notice: Result of notify operation for REDIS-6381 on 
ubusrv2: ok
 notice: Initiating promote operation REDIS-6381_promote_0 
locally on ubusrv2
 notice: Requesting local execution of promote operation 
for REDIS-6381 on ubusrv2
 notice: Result of promote operation for REDIS-6381 on 
ubusrv2: ok
 notice: Initiating notify operation 
REDIS-6381_post_notify_promote_0 locally on ubusrv2
 notice: Requesting local execution of notify operation 
for REDIS-6381 on ubusrv2
 notice: Initiating notify operation 
REDIS-6381_post_notify_promote_0 on ubusrv1
 notice: Initiating notify operation 
REDIS-6381:2_post_notify_promote_0 on ubusrv3
 notice: Result of notify operation for REDIS-6381 on 
ubusrv2: ok

 notice: Setting master-REDIS-6381[ubusrv3]: (unset) -> 1
 notice: Transition 112

[ClusterLabs] make promoted follow promoted resource ?

2023-11-26 Thread lejeczek via Users


Hi guys.

With these:

-> $ pcs resource status REDIS-6381-clone
  * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable):
    * Promoted: [ ubusrv2 ]
    * Unpromoted: [ ubusrv1 ubusrv3 ]

-> $ pcs resource status PGSQL-PAF-5433-clone
  * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] 
(promotable):

    * Promoted: [ ubusrv1 ]
    * Unpromoted: [ ubusrv2 ubusrv3 ]

-> $ pcs constraint ref REDIS-6381-clone
Resource: REDIS-6381-clone
  colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY

basically promoted Redis should follow promoted pgSQL but 
it's not happening, usually it does.
I presume pcs/cluster does something internally which 
results in disobeying/ignoring that _colocation_ constraint 
for these resources.

I presume scoring might play a role:
  REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) 
(rsc-role:Master) (with-rsc-role:Master)

but usually, that scoring works, only "now" it does not.
Any comments I appreciate much.
thanks, L.

I looked at pamaker log - snippet below after 
REDIS-6381-clone re-enabled - but cannot see explanation for 
this.

...
 notice: Calculated transition 110, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3729.bz2
 notice: Transition 110 (Complete=0, Pending=0, Fired=0, 
Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete

 notice: State transition S_TRANSITION_ENGINE -> S_IDLE
 notice: State transition S_IDLE -> S_POLICY_ENGINE
 notice: Actions: Start  REDIS-6381:0 
(    ubusrv2 )
 notice: Actions: Start  REDIS-6381:1 
(    ubusrv3 )
 notice: Actions: Start  REDIS-6381:2 
(    ubusrv1 )
 notice: Calculated transition 111, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3730.bz2
 notice: Initiating start operation REDIS-6381_start_0 
locally on ubusrv2
 notice: Requesting local execution of start operation for 
REDIS-6381 on ubusrv2

(to redis) root on none
pam_unix(su:session): session opened for user redis(uid=127) 
by (uid=0)

pam_sss(su:session): Request to sssd failed. Connection refused
pam_unix(su:session): session closed for user redis
pam_sss(su:session): Request to sssd failed. Connection refused
 notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000
 notice: Transition 111 aborted by 
status-2-master-REDIS-6381 doing create 
master-REDIS-6381=1000: Transient attribute change

INFO: demote: Setting master to 'no-such-master'
 notice: Result of start operation for REDIS-6381 on 
ubusrv2: ok
 notice: Transition 111 (Complete=4, Pending=0, Fired=0, 
Skipped=1, Incomplete=14, 
Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped
 notice: Actions: Promote    REDIS-6381:0 ( 
Unpromoted -> Promoted ubusrv2 )
 notice: Actions: Start  REDIS-6381:1 
(    ubusrv1 )
 notice: Actions: Start  REDIS-6381:2 
(    ubusrv3 )
 notice: Calculated transition 112, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3731.bz2
 notice: Initiating notify operation 
REDIS-6381_pre_notify_start_0 locally on ubusrv2
 notice: Requesting local execution of notify operation for 
REDIS-6381 on ubusrv2
 notice: Result of notify operation for REDIS-6381 on 
ubusrv2: ok
 notice: Initiating start operation REDIS-6381_start_0 on 
ubusrv1
 notice: Initiating start operation REDIS-6381:2_start_0 on 
ubusrv3
 notice: Initiating notify operation 
REDIS-6381_post_notify_start_0 locally on ubusrv2
 notice: Requesting local execution of notify operation for 
REDIS-6381 on ubusrv2
 notice: Initiating notify operation 
REDIS-6381_post_notify_start_0 on ubusrv1
 notice: Initiating notify operation 
REDIS-6381:2_post_notify_start_0 on ubusrv3
 notice: Result of notify operation for REDIS-6381 on 
ubusrv2: ok
 notice: Initiating notify operation 
REDIS-6381_pre_notify_promote_0 locally on ubusrv2
 notice: Requesting local execution of notify operation for 
REDIS-6381 on ubusrv2
 notice: Initiating notify operation 
REDIS-6381_pre_notify_promote_0 on ubusrv1
 notice: Initiating notify operation 
REDIS-6381:2_pre_notify_promote_0 on ubusrv3
 notice: Result of notify operation for REDIS-6381 on 
ubusrv2: ok
 notice: Initiating promote operation REDIS-6381_promote_0 
locally on ubusrv2
 notice: Requesting local execution of promote operation 
for REDIS-6381 on ubusrv2
 notice: Result of promote operation for REDIS-6381 on 
ubusrv2: ok
 notice: Initiating notify operation 
REDIS-6381_post_notify_promote_0 locally on ubusrv2
 notice: Requesting local execution of notify operation for 
REDIS-6381 on ubusrv2
 notice: Initiating notify operation 
REDIS-6381_post_notify_promote_0 on ubusrv1
 notice: Initiating notify operation 
REDIS-6381:2_post_notify_promote_0 on ubusrv3
 notice: Result of notify operation for REDIS-6381 on 
ubusrv2: ok

 notice: Setting master-REDIS-6381[ubusrv3]: (unset) -> 1
 notice: Transition 112 aborted by 
status-3-master-REDIS-6381 doing create master-REDIS-6381=1: 
Transient

Re: [ClusterLabs] make promoted follow promoted resource ?

2023-11-26 Thread lejeczek via Users




On 26/11/2023 12:20, Reid Wahl wrote:

On Sun, Nov 26, 2023 at 1:32 AM lejeczek via Users
 wrote:

Hi guys.

With these:

-> $ pcs resource status REDIS-6381-clone
   * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable):
 * Promoted: [ ubusrv2 ]
 * Unpromoted: [ ubusrv1 ubusrv3 ]

-> $ pcs resource status PGSQL-PAF-5433-clone
   * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] (promotable):
 * Promoted: [ ubusrv1 ]
 * Unpromoted: [ ubusrv2 ubusrv3 ]

-> $ pcs constraint ref REDIS-6381-clone
Resource: REDIS-6381-clone
   colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY

basically promoted Redis should follow promoted pgSQL but it's not happening, 
usually it does.
I presume pcs/cluster does something internally which results in 
disobeying/ignoring that _colocation_ constraint for these resources.
I presume scoring might play a role:
   REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) (rsc-role:Master) 
(with-rsc-role:Master)
but usually, that scoring works, only "now" it does not.
Any comments I appreciate much.
thanks, L.

I looked at pamaker log - snippet below after REDIS-6381-clone re-enabled - but 
cannot see explanation for this.
...
  notice: Calculated transition 110, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3729.bz2
  notice: Transition 110 (Complete=0, Pending=0, Fired=0, Skipped=0, 
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete
  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
  notice: State transition S_IDLE -> S_POLICY_ENGINE
  notice: Actions: Start  REDIS-6381:0 (
ubusrv2 )
  notice: Actions: Start  REDIS-6381:1 (
ubusrv3 )
  notice: Actions: Start  REDIS-6381:2 (
ubusrv1 )
  notice: Calculated transition 111, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3730.bz2
  notice: Initiating start operation REDIS-6381_start_0 locally on ubusrv2
  notice: Requesting local execution of start operation for REDIS-6381 on 
ubusrv2
(to redis) root on none
pam_unix(su:session): session opened for user redis(uid=127) by (uid=0)
pam_sss(su:session): Request to sssd failed. Connection refused
pam_unix(su:session): session closed for user redis
pam_sss(su:session): Request to sssd failed. Connection refused
  notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000
  notice: Transition 111 aborted by status-2-master-REDIS-6381 doing create 
master-REDIS-6381=1000: Transient attribute change
INFO: demote: Setting master to 'no-such-master'
  notice: Result of start operation for REDIS-6381 on ubusrv2: ok
  notice: Transition 111 (Complete=4, Pending=0, Fired=0, Skipped=1, 
Incomplete=14, Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped
  notice: Actions: PromoteREDIS-6381:0 ( Unpromoted -> Promoted 
ubusrv2 )
  notice: Actions: Start  REDIS-6381:1 (
ubusrv1 )
  notice: Actions: Start  REDIS-6381:2 (
ubusrv3 )
  notice: Calculated transition 112, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3731.bz2
  notice: Initiating notify operation REDIS-6381_pre_notify_start_0 locally on 
ubusrv2
  notice: Requesting local execution of notify operation for REDIS-6381 on 
ubusrv2
  notice: Result of notify operation for REDIS-6381 on ubusrv2: ok
  notice: Initiating start operation REDIS-6381_start_0 on ubusrv1
  notice: Initiating start operation REDIS-6381:2_start_0 on ubusrv3
  notice: Initiating notify operation REDIS-6381_post_notify_start_0 locally on 
ubusrv2
  notice: Requesting local execution of notify operation for REDIS-6381 on 
ubusrv2
  notice: Initiating notify operation REDIS-6381_post_notify_start_0 on ubusrv1
  notice: Initiating notify operation REDIS-6381:2_post_notify_start_0 on 
ubusrv3
  notice: Result of notify operation for REDIS-6381 on ubusrv2: ok
  notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 locally 
on ubusrv2
  notice: Requesting local execution of notify operation for REDIS-6381 on 
ubusrv2
  notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 on ubusrv1
  notice: Initiating notify operation REDIS-6381:2_pre_notify_promote_0 on 
ubusrv3
  notice: Result of notify operation for REDIS-6381 on ubusrv2: ok
  notice: Initiating promote operation REDIS-6381_promote_0 locally on ubusrv2
  notice: Requesting local execution of promote operation for REDIS-6381 on 
ubusrv2
  notice: Result of promote operation for REDIS-6381 on ubusrv2: ok
  notice: Initiating notify operation REDIS-6381_post_notify_promote_0 locally 
on ubusrv2
  notice: Requesting local execution of notify operation for REDIS-6381 on 
ubusrv2
  notice: Initiating notify operation REDIS-6381_post_notify_promote_0 on 
ubusrv1
  notice: Initiating notify operation REDIS-6381:2_post_notify_promote_0 on 
ubusrv3
  notice: Result of notify operation for RED

Re: [ClusterLabs] ethernet link up/down - ?

2023-11-28 Thread lejeczek via Users

On 16/02/2022 10:37, Klaus Wenninger wrote:

On Tue, Feb 15, 2022 at 5:25 PM lejeczek via Users 
 wrote:

On 07/02/2022 19:21, Antony Stone wrote:
> On Monday 07 February 2022 at 20:09:02, lejeczek via
    Users wrote:
>
>> Hi guys
>>
>> How do you guys go about doing link up/down as a
resource?
> I apply or remove addresses on the interface, using
"IPaddr2" and "IPv6addr",
> which I know is not the same thing.
>
> Why do you separately want to control link up/down? 
I can't think what I
> would use this for.

Just out of curiosity and as I haven't seen an answer in 
the thread yet - maybe

I overlooked something ...
Is this to control some link-triggered redundancy setup 
with switches?

Revisiting my own question/thread.
Yes. Very close to what Klaus wondered - it's a device over 
which I have no control and from that device perspective 
it's simply - link is up then I'll "serve" it.
I've been thinking lowest possible layer shall be the safest 
way - thus asked about controlling eth link that way: 
down/up by means of electric power, ideally.
As opposed to ha-cluster calling some middle men such as 
network managers.

I read some eth nics/drivers can power down a port.
Is there an agent & a way to do that?

>
>
> Antony.
>
Kind of similar - tcp/ip and those layers configs are
delivered by DHCP.
I'd think it would have to be a clone resource with one
master without any constraints where cluster freely
decides
where to put master(link up) on - which is when link gets
dhcp-served.
But I wonder if that would mean writing up a new
resource -
I don't think there is anything like that included in
ready-made pcs/ocf packages.

many thanks, L
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] make promoted follow promoted resource ?

2023-11-26 Thread lejeczek via Users




On 26/11/2023 17:44, Andrei Borzenkov wrote:

On 26.11.2023 12:32, lejeczek via Users wrote:

Hi guys.

With these:

-> $ pcs resource status REDIS-6381-clone
    * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable):
      * Promoted: [ ubusrv2 ]
      * Unpromoted: [ ubusrv1 ubusrv3 ]

-> $ pcs resource status PGSQL-PAF-5433-clone
    * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433]
(promotable):
      * Promoted: [ ubusrv1 ]
      * Unpromoted: [ ubusrv2 ubusrv3 ]

-> $ pcs constraint ref REDIS-6381-clone
Resource: REDIS-6381-clone
    
colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY


basically promoted Redis should follow promoted pgSQL but
it's not happening, usually it does.
I presume pcs/cluster does something internally which
results in disobeying/ignoring that _colocation_ constraint
for these resources.
I presume scoring might play a role:
    REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001)
(rsc-role:Master) (with-rsc-role:Master)
but usually, that scoring works, only "now" it does not.
Any comments I appreciate much.
thanks, L.

I looked at pamaker log - snippet below after
REDIS-6381-clone re-enabled - but cannot see explanation for
this.
...
   notice: Calculated transition 110, saving inputs in
/var/lib/pacemaker/pengine/pe-input-3729.bz2
   notice: Transition 110 (Complete=0, Pending=0, Fired=0,
Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): 
Complete

   notice: State transition S_TRANSITION_ENGINE -> S_IDLE
   notice: State transition S_IDLE -> S_POLICY_ENGINE
   notice: Actions: Start  REDIS-6381:0
(    ubusrv2 )
   notice: Actions: Start  REDIS-6381:1
(    ubusrv3 )
   notice: Actions: Start  REDIS-6381:2
(    ubusrv1 )
   notice: Calculated transition 111, saving inputs in
/var/lib/pacemaker/pengine/pe-input-3730.bz2
   notice: Initiating start operation REDIS-6381_start_0
locally on ubusrv2
   notice: Requesting local execution of start operation for
REDIS-6381 on ubusrv2
(to redis) root on none
pam_unix(su:session): session opened for user redis(uid=127)
by (uid=0)
pam_sss(su:session): Request to sssd failed. Connection 
refused

pam_unix(su:session): session closed for user redis
pam_sss(su:session): Request to sssd failed. Connection 
refused
   notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 
1000


This is the only line that sets master score, so 
apparently ubusrv2 is the only node where your clone *can* 
be promoted. Whether pacemaker is expected to fail this 
operation because it violates constraint I do not know.



   notice: Transition 111 aborted by
status-2-master-REDIS-6381 doing create
master-REDIS-6381=1000: Transient attribute change
INFO: demote: Setting master to 'no-such-master'
   notice: Result of start operation for REDIS-6381 on
ubusrv2: ok
   notice: Transition 111 (Complete=4, Pending=0, Fired=0,
Skipped=1, Incomplete=14,
Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): 
Stopped

   notice: Actions: Promote    REDIS-6381:0 (
Unpromoted -> Promoted ubusrv2 )
   notice: Actions: Start  REDIS-6381:1
(    ubusrv1 )
   notice: Actions: Start  REDIS-6381:2
(    ubusrv3 )
   notice: Calculated transition 112, saving inputs in
/var/lib/pacemaker/pengine/pe-input-3731.bz2
   notice: Initiating notify operation
REDIS-6381_pre_notify_start_0 locally on ubusrv2
   notice: Requesting local execution of notify operation 
for

REDIS-6381 on ubusrv2
   notice: Result of notify operation for REDIS-6381 on
ubusrv2: ok
   notice: Initiating start operation REDIS-6381_start_0 on
ubusrv1
   notice: Initiating start operation 
REDIS-6381:2_start_0 on

ubusrv3
   notice: Initiating notify operation
REDIS-6381_post_notify_start_0 locally on ubusrv2
   notice: Requesting local execution of notify operation 
for

REDIS-6381 on ubusrv2
   notice: Initiating notify operation
REDIS-6381_post_notify_start_0 on ubusrv1
   notice: Initiating notify operation
REDIS-6381:2_post_notify_start_0 on ubusrv3
   notice: Result of notify operation for REDIS-6381 on
ubusrv2: ok
   notice: Initiating notify operation
REDIS-6381_pre_notify_promote_0 locally on ubusrv2
   notice: Requesting local execution of notify operation 
for

REDIS-6381 on ubusrv2
   notice: Initiating notify operation
REDIS-6381_pre_notify_promote_0 on ubusrv1
   notice: Initiating notify operation
REDIS-6381:2_pre_notify_promote_0 on ubusrv3
   notice: Result of notify operation for REDIS-6381 on
ubusrv2: ok
   notice: Initiating promote operation REDIS-6381_promote_0
locally on ubusrv2
   notice: Requesting local execution of promote operation
for REDIS-6381 on ubusrv2
   notice: Result of promote operation for REDIS-6381 on
ubusrv2: ok
   notice: Initiating notify operation
REDIS-6381_post_notify_promote_0 locally on ubusrv2
   notice: Requesting local execution of notify operation 
for

REDIS-6381 o

Re: [ClusterLabs] [EXT] moving VM live fails?

2023-11-25 Thread lejeczek via Users





On 24/11/2023 08:33, Windl, Ulrich wrote:

Hi!

So you have different CPUs in the cluster? We once had a similar situation with Xen using 
live migration: Migration failed, and the cluster "wasn't that smart" handling 
the situation. The solution was (with the help of support) to add some CPU flags masking 
in the VM configuration so that the newer CPU features were not used.
Before migration worked from the older to the newer CPU and back if the VM had 
been started on the older CPU initially, but when it had been started on the 
newer CPU it could not migrate to the older one. I vaguely remember the details 
but the hardware was a ProLiant 380 G7 vs. 380 G8 or so.
My recommendation is: If you want to size a cluster, but all the nodes you 
need, because if you want to add another node maybe in two years or so, the 
older model may not be available any more. IT lifetime is short.

Regards,
Ulrich

-Original Message-
From: Users  On Behalf Of lejeczek via Users
Sent: Friday, November 17, 2023 12:55 PM
To: users@clusterlabs.org
Cc: lejeczek 
Subject: [EXT] [ClusterLabs] moving VM live fails?

Hi guys.

I have a resource which when asked to 'move' then it fails with:

  virtqemud[3405456]: operation failed: guest CPU doesn't match specification: 
missing features: xsave

but VM domain does not require (nor disable) the feature:

   
 https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
This must go down - same environment - all the way, to the 
kernel. I knew that but forgotten - one box did change 
default kernel & after reboot booted with Centos's default 
as opposed to other boxes running with ver. 6.x.

So, my bad, not really an issue.
Thanks,. L
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] make promoted follow promoted resource ?

2023-12-06 Thread lejeczek via Users




On 26/11/2023 12:20, Reid Wahl wrote:

On Sun, Nov 26, 2023 at 1:32 AM lejeczek via Users
 wrote:

Hi guys.

With these:

-> $ pcs resource status REDIS-6381-clone
   * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable):
 * Promoted: [ ubusrv2 ]
 * Unpromoted: [ ubusrv1 ubusrv3 ]

-> $ pcs resource status PGSQL-PAF-5433-clone
   * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] (promotable):
 * Promoted: [ ubusrv1 ]
 * Unpromoted: [ ubusrv2 ubusrv3 ]

-> $ pcs constraint ref REDIS-6381-clone
Resource: REDIS-6381-clone
   colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY

basically promoted Redis should follow promoted pgSQL but it's not happening, 
usually it does.
I presume pcs/cluster does something internally which results in 
disobeying/ignoring that _colocation_ constraint for these resources.
I presume scoring might play a role:
   REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) (rsc-role:Master) 
(with-rsc-role:Master)
but usually, that scoring works, only "now" it does not.
Any comments I appreciate much.
thanks, L.

I looked at pamaker log - snippet below after REDIS-6381-clone re-enabled - but 
cannot see explanation for this.
...
  notice: Calculated transition 110, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3729.bz2
  notice: Transition 110 (Complete=0, Pending=0, Fired=0, Skipped=0, 
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete
  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
  notice: State transition S_IDLE -> S_POLICY_ENGINE
  notice: Actions: Start  REDIS-6381:0 (
ubusrv2 )
  notice: Actions: Start  REDIS-6381:1 (
ubusrv3 )
  notice: Actions: Start  REDIS-6381:2 (
ubusrv1 )
  notice: Calculated transition 111, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3730.bz2
  notice: Initiating start operation REDIS-6381_start_0 locally on ubusrv2
  notice: Requesting local execution of start operation for REDIS-6381 on 
ubusrv2
(to redis) root on none
pam_unix(su:session): session opened for user redis(uid=127) by (uid=0)
pam_sss(su:session): Request to sssd failed. Connection refused
pam_unix(su:session): session closed for user redis
pam_sss(su:session): Request to sssd failed. Connection refused
  notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000
  notice: Transition 111 aborted by status-2-master-REDIS-6381 doing create 
master-REDIS-6381=1000: Transient attribute change
INFO: demote: Setting master to 'no-such-master'
  notice: Result of start operation for REDIS-6381 on ubusrv2: ok
  notice: Transition 111 (Complete=4, Pending=0, Fired=0, Skipped=1, 
Incomplete=14, Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped
  notice: Actions: PromoteREDIS-6381:0 ( Unpromoted -> Promoted 
ubusrv2 )
  notice: Actions: Start  REDIS-6381:1 (
ubusrv1 )
  notice: Actions: Start  REDIS-6381:2 (
ubusrv3 )
  notice: Calculated transition 112, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-3731.bz2
  notice: Initiating notify operation REDIS-6381_pre_notify_start_0 locally on 
ubusrv2
  notice: Requesting local execution of notify operation for REDIS-6381 on 
ubusrv2
  notice: Result of notify operation for REDIS-6381 on ubusrv2: ok
  notice: Initiating start operation REDIS-6381_start_0 on ubusrv1
  notice: Initiating start operation REDIS-6381:2_start_0 on ubusrv3
  notice: Initiating notify operation REDIS-6381_post_notify_start_0 locally on 
ubusrv2
  notice: Requesting local execution of notify operation for REDIS-6381 on 
ubusrv2
  notice: Initiating notify operation REDIS-6381_post_notify_start_0 on ubusrv1
  notice: Initiating notify operation REDIS-6381:2_post_notify_start_0 on 
ubusrv3
  notice: Result of notify operation for REDIS-6381 on ubusrv2: ok
  notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 locally 
on ubusrv2
  notice: Requesting local execution of notify operation for REDIS-6381 on 
ubusrv2
  notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 on ubusrv1
  notice: Initiating notify operation REDIS-6381:2_pre_notify_promote_0 on 
ubusrv3
  notice: Result of notify operation for REDIS-6381 on ubusrv2: ok
  notice: Initiating promote operation REDIS-6381_promote_0 locally on ubusrv2
  notice: Requesting local execution of promote operation for REDIS-6381 on 
ubusrv2
  notice: Result of promote operation for REDIS-6381 on ubusrv2: ok
  notice: Initiating notify operation REDIS-6381_post_notify_promote_0 locally 
on ubusrv2
  notice: Requesting local execution of notify operation for REDIS-6381 on 
ubusrv2
  notice: Initiating notify operation REDIS-6381_post_notify_promote_0 on 
ubusrv1
  notice: Initiating notify operation REDIS-6381:2_post_notify_promote_0 on 
ubusrv3
  notice: Result of notify operation for RED

[ClusterLabs] how to colocate promoted resources ?

2023-12-06 Thread lejeczek via Users


Hi guys.

How do your colocate your promoted resources with balancing 
underlying resources as priority?

With a simple scenario, say
3 nodes and 3 pgSQL clusters
what would be best possible way - I'm thinking most gentle 
at the same time, if that makes sense.


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] IPaddr2 Started (disabled) ?

2023-12-04 Thread lejeczek via Users


hi guys.

A cluster thinks the resource is up:
...
  * HA-10-1-1-80    (ocf:heartbeat:IPaddr2):     Started 
ubusrv3 (disabled)

..
while it is not the case. What might it mean?
Config is simple:
-> $ pcs resource config HA-10-1-1-80
 Resource: HA-10-1-1-80 (class=ocf provider=heartbeat 
type=IPaddr2)

  Attributes: cidr_netmask=24 ip=10.1.1.80
  Meta Attrs: failure-timeout=20s target-role=Stopped
  Operations: monitor interval=10s timeout=20s 
(HA-10-1-1-80-monitor-interval-10s)
  start interval=0s timeout=20s 
(HA-10-1-1-80-start-interval-0s)
  stop interval=0s timeout=20s 
(HA-10-1-1-80-stop-interval-0s)


I expected, well.. cluster to behave.
This results in a trouble - because of that exactly I think 
- for when resource is re-enabled then that IP/resource is 
not instantiated.

Is its monitor not working?
Is there a way to "harden" cluster perhaps via resource config?
cluster is Ubuntu VM boxes.

many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] how to colocate promoted resources ?

2023-12-08 Thread lejeczek via Users





On 08/12/2023 13:25, Jehan-Guillaume de Rorthais wrote:

Hi,

On Wed, 6 Dec 2023 10:36:39 +0100
lejeczek via Users  wrote:


How do your colocate your promoted resources with balancing
underlying resources as priority?

What do you mean?


With a simple scenario, say
3 nodes and 3 pgSQL clusters
what would be best possible way - I'm thinking most gentle
at the same time, if that makes sense.

I'm not sure it answers your question (as I don't understand it), but here is a
doc explaining how to create and move two IP supposed to each start on
secondaries, avoiding the primary node if possible, as long as secondaries nodes
exists

https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#adding-ips-on-standbys-nodes

++

Apologies, perhaps I was quite vague.
I was thinking - having a 3-node HA cluster and 3-node 
single-master->slaves pgSQL, now..
say, I want pgSQL masters to spread across HA cluster so I 
theory - having each HA node identical hardware-wise - 
masters' resources would nicely balance out across HA cluster.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] ethernet link up/down - ?

2023-12-07 Thread lejeczek via Users




On 04/12/2023 20:58, Reid Wahl wrote:

On Thu, Nov 30, 2023 at 10:30 AM lejeczek via Users
 wrote:



On 07/02/2022 20:09, lejeczek via Users wrote:

Hi guys

How do you guys go about doing link up/down as a resource?

many thanks, L.


With simple tests I confirmed that indeed Linux - on my
hardware at leat - can easily power down an eth link - if
a @devel reads this:
Is there an agent in the suite which a non-programmer could
easily (for most safely) adopt for such purpose?
I understand such agent has to be cloneable & promotable.

The iface-bridge resource appears to do something similar for bridges.
I don't see anything currently for links in general.



Where can I find that agent?

Any comment on that idea about adding, introducing such 
"link" agent into agents in the future?

Should I go _github_ and suggest it there perhaps?
Naturally done by "devel" it would be ideal, as opposed to, 
by us user/admins.

thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] ethernet link up/down - ?

2023-11-30 Thread lejeczek via Users





On 07/02/2022 20:09, lejeczek via Users wrote:

Hi guys

How do you guys go about doing link up/down as a resource?

many thanks, L.



With simple tests I confirmed that indeed Linux - on my 
hardware at leat - can easily power down an eth link - if 
a @devel reads this:
Is there an agent in the suite which a non-programmer could 
easily (for most safely) adopt for such purpose?

I understand such agent has to be cloneable & promotable.

btw. I think many would consider that a really neat & useful 
addition to resource-agents - if authors/devel made it into 
the core package.


many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] trigger something at ?

2024-01-31 Thread lejeczek via Users




On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:37:21 +0100
lejeczek via Users  wrote:



On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:02:12 +0100
lejeczek via Users  wrote:


On 29/01/2024 17:22, Ken Gaillot wrote:

On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:

Hi guys.

Is it possible to trigger some... action - I'm thinking specifically
at shutdown/start.
If not within the cluster then - if you do that - perhaps outside.
I would like to create/remove constraints, when cluster starts &
stops, respectively.

many thanks, L.


You could use node status alerts for that, but it's risky for alert
agents to change the configuration (since that may result in more
alerts and potentially some sort of infinite loop).

Pacemaker has no concept of a full cluster start/stop, only node
start/stop. You could approximate that by checking whether the node
receiving the alert is the only active node.

Another possibility would be to write a resource agent that does what
you want and order everything else after it. However it's even more
risky for a resource agent to modify the configuration.

Finally you could write a systemd unit to do what you want and order it
after pacemaker.

What's wrong with leaving the constraints permanently configured?

yes, that would be for a node start/stop
I struggle with using constraints to move pgsql (PAF) master
onto a given node - seems that co/locating paf's master
results in troubles (replication brakes) at/after node
shutdown/reboot (not always, but way too often)

What? What's wrong with colocating PAF's masters exactly? How does it brake
any replication? What's these constraints you are dealing with?

Could you share your configuration?

Constraints beyond/above of what is required by PAF agent
itself, say...
you have multiple pgSQL cluster with PAF - thus multiple
(separate, for each pgSQL cluster) masters and you want to
spread/balance those across HA cluster
(or in other words - avoid having more that 1 pgsql master
per HA node)

ok


These below, I've tried, those move the master onto chosen
node but.. then the issues I mentioned.

You just mentioned it breaks the replication, but there so little information
about your architecture and configuration, it's impossible to imagine how this
could break the replication.

Could you add details about the issues ?


-> $ pcs constraint location PGSQL-PAF-5438-clone prefers
ubusrv1=1002
or
-> $ pcs constraint colocation set PGSQL-PAF-5435-clone
PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master
require-all=false setoptions score=-1000

I suppose "collocation" constraint is the way to go, not the "location" one.
This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in 
my case


-> $ pcs resource config PGSQL-PAF-5438-clone
 Clone: PGSQL-PAF-5438-clone
  Meta Attrs: failure-timeout=60s master-max=1 notify=true 
promotable=true
  Resource: PGSQL-PAF-5438 (class=ocf provider=heartbeat 
type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/16/bin 
datadir=/var/lib/postgresql/16/paf-5438 maxlag=1000 
pgdata=/etc/postgresql/16/paf-5438 pgport=5438
   Operations: demote interval=0s timeout=120s 
(PGSQL-PAF-5438-demote-interval-0s)
   methods interval=0s timeout=5 
(PGSQL-PAF-5438-methods-interval-0s)
   monitor interval=15s role=Master timeout=10s 
(PGSQL-PAF-5438-monitor-interval-15s)
   monitor interval=16s role=Slave timeout=10s 
(PGSQL-PAF-5438-monitor-interval-16s)
   notify interval=0s timeout=60s 
(PGSQL-PAF-5438-notify-interval-0s)
   promote interval=0s timeout=30s 
(PGSQL-PAF-5438-promote-interval-0s)
   reload interval=0s timeout=20 
(PGSQL-PAF-5438-reload-interval-0s)
   start interval=0s timeout=60s 
(PGSQL-PAF-5438-start-interval-0s)
   stop interval=0s timeout=60s 
(PGSQL-PAF-5438-stop-interval-0s)


so, regarding PAF - 1 master + 2 slaves, have a healthy 
pqSQL/PAF cluster to begin with, then
make resource prefer a specific node (with simplest variant 
of constraints I tried):
-> $ pcs constraint location PGSQL-PAF-5438-clone prefers 
ubusrv1=1002


and play with it, rebooting node(s) with OS' _reboot_
I at some point, get HA/resource unable to start pgSQL, 
unable to elect a master (logs saying with replication 
broken) and I have to "fix" pgSQL cluster outside of PAF, 
using _pg_basebackup_



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] trigger something at ?

2024-02-01 Thread lejeczek via Users




On 31/01/2024 18:11, Ken Gaillot wrote:

On Wed, 2024-01-31 at 16:37 +0100, lejeczek via Users wrote:

On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:02:12 +0100
lejeczek via Users  wrote:


On 29/01/2024 17:22, Ken Gaillot wrote:

On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:

Hi guys.

Is it possible to trigger some... action - I'm thinking
specifically
at shutdown/start.
If not within the cluster then - if you do that - perhaps
outside.
I would like to create/remove constraints, when cluster
starts &
stops, respectively.

many thanks, L.


You could use node status alerts for that, but it's risky for
alert
agents to change the configuration (since that may result in
more
alerts and potentially some sort of infinite loop).

Pacemaker has no concept of a full cluster start/stop, only
node
start/stop. You could approximate that by checking whether the
node
receiving the alert is the only active node.

Another possibility would be to write a resource agent that
does what
you want and order everything else after it. However it's even
more
risky for a resource agent to modify the configuration.

Finally you could write a systemd unit to do what you want and
order it
after pacemaker.

What's wrong with leaving the constraints permanently
configured?

yes, that would be for a node start/stop
I struggle with using constraints to move pgsql (PAF) master
onto a given node - seems that co/locating paf's master
results in troubles (replication brakes) at/after node
shutdown/reboot (not always, but way too often)

What? What's wrong with colocating PAF's masters exactly? How does
it brake any
replication? What's these constraints you are dealing with?

Could you share your configuration?

Constraints beyond/above of what is required by PAF agent
itself, say...
you have multiple pgSQL cluster with PAF - thus multiple
(separate, for each pgSQL cluster) masters and you want to
spread/balance those across HA cluster
(or in other words - avoid having more that 1 pgsql master
per HA node)
These below, I've tried, those move the master onto chosen
node but.. then the issues I mentioned.

-> $ pcs constraint location PGSQL-PAF-5438-clone prefers
ubusrv1=1002
or
-> $ pcs constraint colocation set PGSQL-PAF-5435-clone
PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master
require-all=false setoptions score=-1000


Anti-colocation sets tend to be tricky currently -- if the first
resource can't be assigned to a node, none of them can. We have an idea
for a better implementation:

  https://projects.clusterlabs.org/T383

In the meantime, a possible workaround is to use placement-
strategy=balanced and define utilization for the clones only. The
promoted roles will each get a slight additional utilization, and the
cluster should spread them out across nodes whenever possible. I don't
know if that will avoid the replication issues but it may be worth a
try.

using _balanced_ causes a small mayhem to PAF/pgsql:

-> $ pcs property
Cluster Properties:
 REDIS-6380_REPL_INFO: ubusrv3
 REDIS-6381_REPL_INFO: ubusrv2
 REDIS-6382_REPL_INFO: ubusrv2
 REDIS-6385_REPL_INFO: ubusrv1
 REDIS_REPL_INFO: ubusrv1
 cluster-infrastructure: corosync
 cluster-name: ubusrv
 dc-version: 2.1.2-ada5c3b36e2
 have-watchdog: false
 last-lrm-refresh: 1706711588
 placement-strategy: default
 stonith-enabled: false

-> $ pcs resource utilization PGSQL-PAF-5438 cpu="20"

-> $ pcs property set placement-strategy=balanced # when 
resource stops:

I change it back:
-> $ pcs property set placement-strategy=default
and pgSQL/paf works again

I've not used _utilization_ nor _placement-strategy_ before, 
thus chance that I'm missing something is solid.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] trigger something at ?

2024-02-01 Thread lejeczek via Users





On 01/02/2024 15:02, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 18:23:40 +0100
lejeczek via Users  wrote:


On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:37:21 +0100
lejeczek via Users  wrote:
  

On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:02:12 +0100
lejeczek via Users  wrote:
  

On 29/01/2024 17:22, Ken Gaillot wrote:

On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:

Hi guys.

Is it possible to trigger some... action - I'm thinking specifically
at shutdown/start.
If not within the cluster then - if you do that - perhaps outside.
I would like to create/remove constraints, when cluster starts &
stops, respectively.

many thanks, L.
  

You could use node status alerts for that, but it's risky for alert
agents to change the configuration (since that may result in more
alerts and potentially some sort of infinite loop).

Pacemaker has no concept of a full cluster start/stop, only node
start/stop. You could approximate that by checking whether the node
receiving the alert is the only active node.

Another possibility would be to write a resource agent that does what
you want and order everything else after it. However it's even more
risky for a resource agent to modify the configuration.

Finally you could write a systemd unit to do what you want and order it
after pacemaker.

What's wrong with leaving the constraints permanently configured?

yes, that would be for a node start/stop
I struggle with using constraints to move pgsql (PAF) master
onto a given node - seems that co/locating paf's master
results in troubles (replication brakes) at/after node
shutdown/reboot (not always, but way too often)

What? What's wrong with colocating PAF's masters exactly? How does it
brake any replication? What's these constraints you are dealing with?

Could you share your configuration?

Constraints beyond/above of what is required by PAF agent
itself, say...
you have multiple pgSQL cluster with PAF - thus multiple
(separate, for each pgSQL cluster) masters and you want to
spread/balance those across HA cluster
(or in other words - avoid having more that 1 pgsql master
per HA node)

ok
  

These below, I've tried, those move the master onto chosen
node but.. then the issues I mentioned.

You just mentioned it breaks the replication, but there so little
information about your architecture and configuration, it's impossible to
imagine how this could break the replication.

Could you add details about the issues ?
  

-> $ pcs constraint location PGSQL-PAF-5438-clone prefers
ubusrv1=1002
or
-> $ pcs constraint colocation set PGSQL-PAF-5435-clone
PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master
require-all=false setoptions score=-1000

I suppose "collocation" constraint is the way to go, not the "location"
one.

This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in
my case

No, this is not easy to replicate. I have no idea how you setup your PostgreSQL
replication, neither I have your full pacemaker configuration.

Please provide either detailed setupS and/or ansible and/or terraform and/or
vagrant, then a detailed scenario showing how it breaks. This is how you can
help and motivate devs to reproduce your issue and work on it.

I will not try to poke around for hours until I find an issue that might not
even be the same than yours.
How about you start with the basics - strange inclination to 
complicate things when they are not, I hear from you - 
that's what I did while "stumbled" upon these "issues"

How about just:
a) do vanilla-default pgSQL in Ubuntu (or perhaps any other 
OS of your choice), I use _pg_createcluster_
b) follow PAF official guide (a single PAF resource should 
suffice)
Have a healthy pgSQL cluster, OS _reboot_ nodes - play with 
that, all should be ok, moving around/electing master should 
work a ok.
Then... add, play with "additional" co/location constraints, 
then OS reboots,- things should begin braking.
I have 3-node HA cluster & 3-node PAF resource = 1 master + 
2 slaves.
Only thing I deliberately set, to alleviate pgsql 
replication was _wal_keep_size_ - I increased that, but this 
is subjective.


It's fine with me if you don't feel like doing this.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] to heal gfid which is not there -- IGNORE wrong list

2024-02-04 Thread lejeczek via Users




On 04/02/2024 12:57, lejeczek via Users wrote:

hi guys.

So, I've manged to make my volume go haywire, here:

-> $ gluster volume heal VMAIL info
Brick 10.1.1.100:/devs/00.GLUSTERs/VMAIL
Status: Connected
Number of entries: 0

Brick 10.1.1.101:/devs/00.GLUSTERs/VMAIL
/dovecot-uidlist
Status: Connected
Number of entries: 1

Brick 10.1.1.99:/devs/00.GLUSTERs/VMAIL-arbiter
Status: Connected
Number of entries: 0

That _gfid_ nor _dovecot-uidlist_ do not exist on any of 
the bricks, but I've noticed that all bricks have:

.glusterfs-anonymous-inode-462a1850-a31a-4a17-934d-26f3996dc9b8
which is an empty dir.

That volume I fear won't heal on its own and might need 
some intervention - would you know how/where to start?


many thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home:https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] to heal gfid which is not there

2024-02-04 Thread lejeczek via Users


hi guys.

So, I've manged to make my volume go haywire, here:

-> $ gluster volume heal VMAIL info
Brick 10.1.1.100:/devs/00.GLUSTERs/VMAIL
Status: Connected
Number of entries: 0

Brick 10.1.1.101:/devs/00.GLUSTERs/VMAIL
/dovecot-uidlist
Status: Connected
Number of entries: 1

Brick 10.1.1.99:/devs/00.GLUSTERs/VMAIL-arbiter
Status: Connected
Number of entries: 0

That _gfid_ nor _dovecot-uidlist_ do not exist on any of the 
bricks, but I've noticed that all bricks have:

.glusterfs-anonymous-inode-462a1850-a31a-4a17-934d-26f3996dc9b8
which is an empty dir.

That volume I fear won't heal on its own and might need some 
intervention - would you know how/where to start?


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] trigger something at ?

2024-01-31 Thread lejeczek via Users





On 29/01/2024 17:22, Ken Gaillot wrote:

On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:

Hi guys.

Is it possible to trigger some... action - I'm thinking specifically
at shutdown/start.
If not within the cluster then - if you do that - perhaps outside.
I would like to create/remove constraints, when cluster starts &
stops, respectively.

many thanks, L.


You could use node status alerts for that, but it's risky for alert
agents to change the configuration (since that may result in more
alerts and potentially some sort of infinite loop).

Pacemaker has no concept of a full cluster start/stop, only node
start/stop. You could approximate that by checking whether the node
receiving the alert is the only active node.

Another possibility would be to write a resource agent that does what
you want and order everything else after it. However it's even more
risky for a resource agent to modify the configuration.

Finally you could write a systemd unit to do what you want and order it
after pacemaker.

What's wrong with leaving the constraints permanently configured?

yes, that would be for a node start/stop
I struggle with using constraints to move pgsql (PAF) master 
onto a given node - seems that co/locating paf's master 
results in troubles (replication brakes) at/after node 
shutdown/reboot (not always, but way too often)


Ideally I'm hoping that:
at node stop, stopping node could check if it's PAF's master 
and if yes so then remove given constraints

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] trigger something at ?

2024-01-31 Thread lejeczek via Users





On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:02:12 +0100
lejeczek via Users  wrote:



On 29/01/2024 17:22, Ken Gaillot wrote:

On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:

Hi guys.

Is it possible to trigger some... action - I'm thinking specifically
at shutdown/start.
If not within the cluster then - if you do that - perhaps outside.
I would like to create/remove constraints, when cluster starts &
stops, respectively.

many thanks, L.


You could use node status alerts for that, but it's risky for alert
agents to change the configuration (since that may result in more
alerts and potentially some sort of infinite loop).

Pacemaker has no concept of a full cluster start/stop, only node
start/stop. You could approximate that by checking whether the node
receiving the alert is the only active node.

Another possibility would be to write a resource agent that does what
you want and order everything else after it. However it's even more
risky for a resource agent to modify the configuration.

Finally you could write a systemd unit to do what you want and order it
after pacemaker.

What's wrong with leaving the constraints permanently configured?

yes, that would be for a node start/stop
I struggle with using constraints to move pgsql (PAF) master
onto a given node - seems that co/locating paf's master
results in troubles (replication brakes) at/after node
shutdown/reboot (not always, but way too often)

What? What's wrong with colocating PAF's masters exactly? How does it brake any
replication? What's these constraints you are dealing with?

Could you share your configuration?
Constraints beyond/above of what is required by PAF agent 
itself, say...
you have multiple pgSQL cluster with PAF - thus multiple 
(separate, for each pgSQL cluster) masters and you want to 
spread/balance those across HA cluster
(or in other words - avoid having more that 1 pgsql master 
per HA node)
These below, I've tried, those move the master onto chosen 
node but.. then the issues I mentioned.


-> $ pcs constraint location PGSQL-PAF-5438-clone prefers 
ubusrv1=1002

or
-> $ pcs constraint colocation set PGSQL-PAF-5435-clone 
PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master 
require-all=false setoptions score=-1000




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] trigger something at ?

2024-01-26 Thread lejeczek via Users


Hi guys.

Is it possible to trigger some... action - I'm thinking 
specifically at shutdown/start.
If not within the cluster then - if you do that - perhaps 
outside.
I would like to create/remove constraints, when cluster 
starts & stops, respectively.


many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] non-existent attribute ?

2023-11-19 Thread lejeczek via Users


Hi guys.

My 3-node cluster had one node absent for a long time and 
now when it's back I cannot get _mariadb_ to start on that node.

...
    * MARIADB    (ocf:heartbeat:galera):     ORPHANED Stopped
...
    * MARIADB-last-committed  : 147
    * MARIADB-safe-to-bootstrap   : 0

I wanted to start with a _cleanup_ of node's attribute - as 
mariadb node outside of ha-pcs starts & run a ok, but I get:


-> $ pcs node attribute dzien MARIADB-last-committed=''
Error: attribute: 'MARIADB-last-committed' doesn't exist for 
node: 'dzien'


Is possible to clean that up safely?

many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [EXT] Re: PAF / pgSQL fails after OS/system shutdown - FIX

2023-11-18 Thread lejeczek via Users





On 13/11/2023 13:08, Jehan-Guillaume de Rorthais via Users 
wrote:

On Mon, 13 Nov 2023 11:39:45 +
"Windl, Ulrich"  wrote:


But shouldn't the RA check for that (and act appropriately)?

Interesting. I'm open to discuss this. Below my thoughts so far.

Why the RA should check that? There's so many way to setup the system and
PostgreSQL, where should the RA stop checking for all possible way to break it?

The RA checks various (maybe too many) things related to the instance itself
already.

I know various other PostgreSQL setups that would trigger errors in the cluster
if the dba doesn't check everything is correct. I'm really reluctant to
add add a fair amount of code in the RA to correctly parse and check the
complex PostgreSQL's setup. This would add complexity and bugs. Or maybe I
could add a specific OCF_CHECK_LEVEL sysadmins can trigger by hand before
starting the cluster. But I wonder if it worth the pain, how many people will
know about this and actually run it?

The problem here is that few users actually realize how the postgresql-common
wrapper works and what it actually does behind your back. I really appreciate
this wrapper, I do. But when you setup a Pacemaker cluster, you either have to
bend to it when setting up PAF (as documented), or avoid it completely.

PAF is all about drawing a clear line between the sysadmin job and the
dba one. Dba must build a cluster of instances ready to start/replicate with
standard binaries (not wrappers) before sysadmin can set up the resource in your
cluster.

Thoughts?


I would be the same/similar mind - which is - adding more 
code to account for more/all _config_ cases may not be the 
healthiest approach, however!!


Just like some code-writers out there in around the globe 
here too, some do seem to _not_ appreciate or seem to 
completely ignore, parts which are meant do alleviate the 
TCO burden of any software - documentation best bits which 
for Unix/Linux... are MAN PAGES.
This is rhetorical but I'll ask - what is a project worth 
when its own man pages miss critical _gotchas_ and similar 
gimmicks.


If you read here are are a prospective/future coder-writer 
for Linux please make such note - your 'man pages" ought to 
be at least!! as good as your code.

many thanks, L.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] moving VM live fails?

2023-11-17 Thread lejeczek via Users


Hi guys.

I have a resource which when asked to 'move' then it fails with:

 virtqemud[3405456]: operation failed: guest CPU doesn't 
match specification: missing features: xsave


but VM domain does not require (nor disable) the feature:

  
    what even more interesting, _virsh_ migrate does the 
migration live a ok.


I'm on Centos 9 with packages up to day.

All thoughts much appreciated.
many thanks, L.___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

1 2 >

1 - 100 of 101 matches

Mail list logo