Re: [ClusterLabs] Antw: [EXT] Re: VirtualDomain & "deeper" monitors - what/how?

2021-06-06 Thread Kyle O'Donnell
Let me know if there is a better approach to the following problem.  When the 
virtual machine does not respond to a state query I want the cluster to kick it

I could not find any useful docs for using the nagios plugins. After reading 
the documentation about running a custom script via the "monitor" function in 
the RA I determined that would not meet my requirements as it's only run on 
start and migrate(unless I read it incorrectly?).

Here is what I did (im on ubuntu 20.04):

cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain 
/usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp 
/usr/share/resource-agents/ocft/configs/MyVirtDomain
sed -i 's/VirtualDomain/MyVirtDomain/g' 
/usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
sed -i 's/VirtualDomain/MyVirtDomain/g' 
/usr/share/resource-agents/ocft/configs/MyVirtDomain

edited function *MyVirtDomain_status* in 
/usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the 
status case *running|paused|idle|blocked|"in shutdown")*

FROM
running|paused|idle|blocked|"in shutdown")
# running: domain is currently actively 
consuming cycles
# paused: domain is paused (suspended)
# idle: domain is running but idle
# blocked: synonym for idle used by legacy Xen 
versions
# in shutdown: the domain is in process of 
shutting down, but has not completely shutdown or crashed.

ocf_log debug "Virtual domain $DOMAIN_NAME is 
currently $status."
rc=$OCF_SUCCESS

TO
running|paused|idle|blocked|"in shutdown")
# running: domain is currently actively 
consuming cycles
# paused: domain is paused (suspended)
# idle: domain is running but idle
# blocked: synonym for idle used by legacy Xen 
versions
# in shutdown: the domain is in process of 
shutting down, but has not completely shutdown or crashed.
custom_chk=$(/path/to/myscript.sh -H 
$DOMAIN_NAME -C guest-get-time -l 25 -w 1)
custom_rc=$?
if [ ${custom_rc} -eq 0 ]; then
  ocf_log debug "Virtual domain $DOMAIN_NAME is 
currently $status."
  rc=$OCF_SUCCESS
else
  ocf_log debug "Virtual domain $DOMAIN_NAME is 
currently ${custom_chk}."
  rc=$OCF_ERR_GENERIC
fi

The custom script uses the qemu-guest-agent in my guest, passing the parameter 
to grab the guest's time (seems to be most universal [windows, centos6, ubuntu, 
centos 7]). Runs 25 loops, sleeps 1 second between iterations, exit 0 as soon 
as the agent responds with the time and exit 1 after the 25th loop, which are 
OCF_SUCCESS and OCF_ERR_GENERIC based on docs.

# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[GOOD] - myvm virsh qemu-agent-command guest-get-time output: 
{"return":1623011582178375000}

or when its not responding:
# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent 
is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent 
is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent 
is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent 
is not responding: QEMU guest agent is not connected
... (exits after 25th or
[GOOD] - myvm virsh qemu-agent-command guest-get-time output: 
{"return":1623011582178375000}

and when the vm isnt running:
# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: failed to 
get domain 'myvm'

I updated my test vm to use the new RA, updated the status timeout to 40s from 
default of 30s just in case.

I'd like to be able to update the parameters to *myscript.sh* via crm configure 
edit at some point, but will figure that out later...

My test:

reboot the VM from within the OS, hit escape so that I enter the boot mode 
prompt... after ~30 seconds the cluster decides the resource is having a 
problem, marks it as failed, and restarts the virtual machine (on the same node 
-- which in my case in desirable), once the guest is back up and responding the 
cluster reports the VM as Started

I still have plenty more testing to do and will keep the list posted on 

Re: [ClusterLabs] One Failed Resource = Failover the Cluster?

2021-06-06 Thread Strahil Nikolov
Based on the constraint rules you have mentioned , failure of mysql should not 
cause a failover to another node. For better insight, you have to be able to 
reproduce the issue and share the logs with the community.
Best Regards,Strahil Nikolov
 
 
  On Sat, Jun 5, 2021 at 23:33, Eric Robinson wrote:   
> -Original Message-
> From: Users  On Behalf Of
> kgail...@redhat.com
> Sent: Friday, June 4, 2021 4:49 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] One Failed Resource = Failover the Cluster?
>
> On Fri, 2021-06-04 at 19:10 +, Eric Robinson wrote:
> > Sometimes it seems like Pacemaker fails over an entire cluster when
> > only one resource has failed, even though no other resources are
> > dependent on it. Is that expected behavior?
> >
> > For example, suppose I have the following colocation constraints…
> >
> > filesystem with drbd master
> > vip with filesystem
> > mysql_01 with filesystem
> > mysql_02 with filesystem
> > mysql_03 with filesystem
>
> By default, a resource that is colocated with another resource will influence
> that resource's location. This ensures that as many resources are active as
> possible.
>
> So, if any one of the above resources fails and meets its migration- 
> threshold,
> all of the resources will move to another node so a recovery attempt can be
> made for the failed resource.
>
> No resource will be *stopped* due to the failed resource unless it depends
> on it.
>

Thanks, but I'm confused by your previous two paragraphs. On one hand, "if any 
one of the above resources fails and meets its migration- threshold, all of the 
resources will move to another node." Obviously moving resources requires 
stopping them. But then, "No resource will be *stopped* due to the failed 
resource unless it depends on it." Those two statements seem contradictory to 
me. Not trying to be argumentative. Just trying to understand.

> As of the forthcoming 2.1.0 release, the new "influence" option for
> colocation constraints (and "critical" resource meta-attribute) controls
> whether this effect occurs. If influence is turned off (or the resource made
> non-critical), then the failed resource will just stop, and the other 
> resources
> won't move to try to save it.
>

That sounds like the feature I'm waiting for. In the example configuration I 
provided, I would not want the failure of any mysql instance to cause cluster 
failover. I would only want the cluster to failover if the filesystem or drbd 
resources failed. Basically, if a resource breaks or fails to stop, I don't 
want the whole cluster to failover if nothing depends on that resource. Just 
let it stay down until someone can manually intervene. But if an underlying 
resource fails that everything else is dependent on (drbd or filesystem) then 
go ahead and failover the cluster.

> >
> > …and the following order constraints…
> >
> > promote drbd, then start filesystem
> > start filesystem, then start vip
> > start filesystem, then start mysql_01
> > start filesystem, then start mysql_02
> > start filesystem, then start mysql_03
> >
> > Now, if something goes wrong with mysql_02, will Pacemaker try to fail
> > over the whole cluster? And if mysql_02 can’t be run on either
> > cluster, then does Pacemaker refuse to run any resources?
> >
> > I’m asking because I’ve seen some odd behavior like that over the
> > years. Could be my own configuration mistakes, of course.
> >
> > -Eric
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/