Let me know if there is a better approach to the following problem.  When the 
virtual machine does not respond to a state query I want the cluster to kick it

I could not find any useful docs for using the nagios plugins. After reading 
the documentation about running a custom script via the "monitor" function in 
the RA I determined that would not meet my requirements as it's only run on 
start and migrate(unless I read it incorrectly?).

Here is what I did (im on ubuntu 20.04):

cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain 
/usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp 
/usr/share/resource-agents/ocft/configs/MyVirtDomain
sed -i 's/VirtualDomain/MyVirtDomain/g' 
/usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
sed -i 's/VirtualDomain/MyVirtDomain/g' 
/usr/share/resource-agents/ocft/configs/MyVirtDomain

edited function *MyVirtDomain_status* in 
/usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the 
status case *running|paused|idle|blocked|"in shutdown")*

FROM
                        running|paused|idle|blocked|"in shutdown")
                                # running: domain is currently actively 
consuming cycles
                                # paused: domain is paused (suspended)
                                # idle: domain is running but idle
                                # blocked: synonym for idle used by legacy Xen 
versions
                                # in shutdown: the domain is in process of 
shutting down, but has not completely shutdown or crashed.

                                ocf_log debug "Virtual domain $DOMAIN_NAME is 
currently $status."
                                rc=$OCF_SUCCESS

TO
                        running|paused|idle|blocked|"in shutdown")
                                # running: domain is currently actively 
consuming cycles
                                # paused: domain is paused (suspended)
                                # idle: domain is running but idle
                                # blocked: synonym for idle used by legacy Xen 
versions
                                # in shutdown: the domain is in process of 
shutting down, but has not completely shutdown or crashed.
                                custom_chk=$(/path/to/myscript.sh -H 
$DOMAIN_NAME -C guest-get-time -l 25 -w 1)
                                custom_rc=$?
                                if [ ${custom_rc} -eq 0 ]; then
                                  ocf_log debug "Virtual domain $DOMAIN_NAME is 
currently $status."
                                  rc=$OCF_SUCCESS
                                else
                                  ocf_log debug "Virtual domain $DOMAIN_NAME is 
currently ${custom_chk}."
                                  rc=$OCF_ERR_GENERIC
                                fi

The custom script uses the qemu-guest-agent in my guest, passing the parameter 
to grab the guest's time (seems to be most universal [windows, centos6, ubuntu, 
centos 7]). Runs 25 loops, sleeps 1 second between iterations, exit 0 as soon 
as the agent responds with the time and exit 1 after the 25th loop, which are 
OCF_SUCCESS and OCF_ERR_GENERIC based on docs.

# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[GOOD] - myvm virsh qemu-agent-command guest-get-time output: 
{"return":1623011582178375000}

or when its not responding:
# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent 
is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent 
is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent 
is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent 
is not responding: QEMU guest agent is not connected
... (exits after 25th or
[GOOD] - myvm virsh qemu-agent-command guest-get-time output: 
{"return":1623011582178375000}

and when the vm isnt running:
# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: failed to 
get domain 'myvm'

I updated my test vm to use the new RA, updated the status timeout to 40s from 
default of 30s just in case.

I'd like to be able to update the parameters to *myscript.sh* via crm configure 
edit at some point, but will figure that out later...

My test:

reboot the VM from within the OS, hit escape so that I enter the boot mode 
prompt... after ~30 seconds the cluster decides the resource is having a 
problem, marks it as failed, and restarts the virtual machine (on the same node 
-- which in my case in desirable), once the guest is back up and responding the 
cluster reports the VM as Started

I still have plenty more testing to do and will keep the list posted on 
progress.

-Kyle

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Thursday, May 27th, 2021 at 05:34, Kyle O'Donnell <ky...@0b10.mx> wrote:

> guest-get-fsinfo doesn't seem to work on older agents (centos6) I've found 
> guest-get-time more universal.
>
> Also, found this helpful thread on using monitor_scripts which is part of the 
> VirtualDomain RA
>
> https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter-for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> On Sunday, May 16th, 2021 at 22:49, Kyle O'Donnell ky...@0b10.mx wrote:
>
> > I am thinking about using the qemu-guest-agent to run one of the available 
> > commands to determine the health of the OS inside
> >
> > virsh qemu-agent-command myvm --pretty '{"execute":"guest-get-fsinfo"}'
> >
> > https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> >
> > On Thursday, May 13th, 2021 at 01:28, Andrei Borzenkov arvidj...@gmail.com 
> > wrote:
> >
> > > On 03.05.2021 09:48, Ulrich Windl wrote:
> > >
> > > > > > > Ken Gaillot kgail...@redhat.com schrieb am 30.04.2021 um 16:57 in
> > > > > > >
> > > > > > > Nachricht
> > > > > > >
> > > > > > > 3acef4bc31923fb019619c713300444c2dcd354a.ca...@redhat.com:
> > > > > > >
> > > > > > > On Fri, 2021‑04‑30 at 11:00 +0100, lejeczek wrote:
> > > > >
> > > > > > Hi guys
> > > > > >
> > > > > > I'd like to ask around for thoughts & suggestions on any
> > > > > >
> > > > > > semi/official ways to monitor VirtualDomain.
> > > > > >
> > > > > > Something beyond what included RA does ‑ such as actual
> > > > > >
> > > > > > health testing of and communication with VM's OS.
> > > > > >
> > > > > > many thanks, L.
> > > > >
> > > > > This use case led to a Pacemaker feature many moons ago ...
> > > > >
> > > > > Pacemaker supports nagios plug‑ins as a resource type (e.g.
> > > > >
> > > > > nagios:check_apache_status). These are service checks usually used 
> > > > > with
> > > > >
> > > > > monitoring software such as nagios, icinga, etc.
> > > > >
> > > > > If the service being monitored is inside a VirtualDomain, named vm1 
> > > > > for
> > > > >
> > > > > example, you can configure the nagios resource with the resource meta‑
> > > > >
> > > > > attribute container="vm1". If the nagios check fails, Pacemaker will
> > > > >
> > > > > restart vm1.
> > > >
> > > > "check fails" mans WARNING, CRITICAL, or UNKNOWN? ;-)
> > >
> > > switch (rc) {
> > >
> > > case NAGIOS_STATE_OK:
> > >
> > > return PCMK_OCF_OK;
> > >
> > > case NAGIOS_INSUFFICIENT_PRIV:
> > >
> > > return PCMK_OCF_INSUFFICIENT_PRIV;
> > >
> > > case NAGIOS_NOT_INSTALLED:
> > >
> > > return PCMK_OCF_NOT_INSTALLED;
> > >
> > > case NAGIOS_STATE_WARNING:
> > >
> > > case NAGIOS_STATE_CRITICAL:
> > >
> > > case NAGIOS_STATE_UNKNOWN:
> > >
> > > case NAGIOS_STATE_DEPENDENT:
> > >
> > > default:
> > >
> > > return PCMK_OCF_UNKNOWN_ERROR;
> > >
> > > }
> > >
> > > return PCMK_OCF_UNKNOWN_ERROR;
> > >
> > > Manage your subscription:
> > >
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to