Hi! I wonder: Shouldn't "OCF_RESOURCE_INSTANCE" help you to identify what is going to be monitored? (Reasonable naming assumed ;-))
Regards, Ulrich >>> Kyle O'Donnell <ky...@0b10.mx> schrieb am 26.10.2021 um 13:53 in Nachricht <uNHregOAnWaFxn5xMCQhuxDLxi-E_norRLuSuhxjZTewjFtRwmq_hVWHwJN4Lo4ybtKSmwYqAbkk9zf 57Gp0Hmww903ZC09_P2QrHeneW0=@0b10.mx>: > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Tuesday, October 26th, 2021 at 03:04, Klaus Wenninger <kwenn...@redhat.com> > wrote: > >> On Mon, Oct 25, 2021 at 9:34 PM Kyle O'Donnell <ky...@0b10.mx> wrote: >> >>> Finally got around to working on this. >>> >>> I spoke with someone on the #cluterslabs IRC channel who mentioned that the > monitor_scripts param does indeed run at some frequency (op monitor timeout=? > interval=?), not just during the "start" and "migrate_from" actions. >>> >>> The monitor_scripts param does not support scripts with command line args, > just a space delimited list for running multiple scripts. This means that > each VirtualDomain resource needs its own script to be able to define the > ${DOMAIN_NAME}. I found that a bit annoying so I created a symlink to a > wrapper script using the ${DOMAIN_NAME} as the first part of the filename and > a separator for awk: >> >> The scripts being called by the monitor operation should inherit the > environment from the monitor so that you should be able to use these > variables. >> >> Klaus > > Thanks! > > I tried referencing the ${DOMAIN_NAME} variable initially but that did not > work. I tried running the function that creates the variable > (VirtualDomain_getconfig) it also did not work. > > After some debugging it looks like the following variables are available > from the parent script: > error output [ OCF_ROOT=/usr/lib/ocf ] ] > error output [ OCF_RESKEY_crm_feature_set=3.2.1 ] > error output [ HA_LOGFACILITY=daemon ] > error output [ PCMK_debug=0 ] > error output [ HA_debug=0 ] > error output [ PWD=/var/lib/pacemaker/cores ] > error output [ OCF_RESKEY_hypervisor=qemu:///system ] > error output [ HA_logfile=/var/log/pacemaker/pacemaker.log ] > error output [ HA_logfacility=daemon ] > error output [ OCF_EXIT_REASON_PREFIX=ocf-exit-reason: ] > error output [ OCF_RESOURCE_PROVIDER=heartbeat ] > error output [ PCMK_service=pacemaker-execd ] > error output [ PCMK_mcp=true ] > error output [ > OCF_RESKEY_monitor_scripts=/path/to/myvmhostname____wrap_check.sh ] > error output [ OCF_RA_VERSION_MAJOR=1 ] > error output [ VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no > --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p > --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions > --gen-suppressions=all ] > error output [ HA_cluster_type=corosync ] > error output [ INVOCATION_ID=652062571c8f415a9a7a228c5ad77b20 ] > error output [ OCF_RESKEY_CRM_meta_interval=10000 ] > error output [ OCF_RESOURCE_INSTANCE=myvmhostname ] > error output [ HA_quorum_type=corosync ] > error output [ OCF_RA_VERSION_MINOR=0 ] > error output [ HA_mcp=true ] > error output [ OCF_RESKEY_config=/path/to/myvmhostname/myvmhostname.xml ] > error output [ PCMK_quorum_type=corosync ] > error output [ OCF_RESKEY_CRM_meta_name=monitor ] > error output [ OCF_RESKEY_migration_transport=ssh ] > error output [ SHLVL=1 ] > error output [ OCF_RESKEY_CRM_meta_on_node=node02 ] > error output [ PCMK_watchdog=false ] > error output [ PCMK_logfile=/var/log/pacemaker/pacemaker.log ] > error output [ OCF_RESKEY_CRM_meta_timeout=40000 ] > error output [ OCF_RESOURCE_TYPE=VirtualDomain ] > error output [ PCMK_logfacility=daemon ] > error output [ LC_ALL=C ] > error output [ HA_LOGFILE=/var/log/pacemaker/pacemaker.log ] > error output [ JOURNAL_STREAM=9:42440 ] > error output [ OCF_RESKEY_CRM_meta_on_node_uuid=2 ] > error output [ > PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/ > sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb ] > error output [ OCF_RESKEY_force_stop=false ] > error output [ PCMK_cluster_type=corosync ] > error output [ _=/usr/bin/env ] > > The most helpful variables is: > error output [ OCF_RESKEY_config=/path/to/myvmhostname/myvmhostname.xml ] > > So I copied part of the "VirtualDomain_getconfig" function from the resource > script to populate the variable in the same way: > DOMAIN_NAME=`egrep '[[:space:]]*<name>.*</name>[[:space:]]*$' > ${OCF_RESKEY_config} 2>/dev/null | sed -e > 's/[[:space:]]*<name>\(.*\)<\/name>[[:space:]]*$/\1/'` > > and now it's working without the hacky symlink > >>> ln -s /path/to/wrapper_script.sh > /path/to/wrapper/myvmhostname_____wrapper_script.sh >>> >>> and in my wrapper_script.sh: >>> #!/bin/bash >>> DOMAIN_NAME=$(basename "$0" |awk -F'____' '{print $1}') >>> /path/to/myscript.sh -H ${DOMAIN_NAME} -C guest-get-time -l 25 -w 1 >>> >>> (a bit hack-y but better than creating 1 script per vm resource and > modifying it with the ${DOMAIN_NAME}) >>> >>> Then creating the cluster resource: >>> pcs resource create myvmhostname VirtualDomain > config="/path/to/myvmhostname/myvmhostname.xml" hypervisor="qemu:///system" > migration_transport="ssh" force_stop="false" > monitor_scripts="/path/to/wrapper/myvmhostname_____wrapper_script.sh" meta > allow-migrate="true" target-role="Stopped" op migrate_from timeout=90s > interval=0s op migrate_to timeout=120s interval=0s op monitor timeout=40s > interval=10s op start timeout=90s interval=0s op stop timeout=90s interval=0s >>> >>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>> >>> On Sunday, June 6th, 2021 at 16:56, Kyle O'Donnell <ky...@0b10.mx> wrote: >>> >>>> Let me know if there is a better approach to the following problem. When the > virtual machine does not respond to a state query I want the cluster to kick > it >>>> >>>> I could not find any useful docs for using the nagios plugins. After reading > the documentation about running a custom script via the "monitor" function in > the RA I determined that would not meet my requirements as it's only run on > start and migrate(unless I read it incorrectly?). >>>> >>>> Here is what I did (im on ubuntu 20.04): >>>> >>>> cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain >>>> >>>> cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp > /usr/share/resource-agents/ocft/configs/MyVirtDomain >>>> >>>> sed -i 's/VirtualDomain/MyVirtDomain/g' > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain >>>> >>>> sed -i 's/VirtualDomain/MyVirtDomain/g' > /usr/share/resource-agents/ocft/configs/MyVirtDomain >>>> >>>> edited function MyVirtDomain_status in > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the > status case running|paused|idle|blocked|"in shutdown") >>>> >>>> FROM >>>> >>>> running|paused|idle|blocked|"in shutdown") >>>> >>>> # running: domain is currently actively consuming cycles >>>> >>>> # paused: domain is paused (suspended) >>>> >>>> # idle: domain is running but idle >>>> >>>> # blocked: synonym for idle used by legacy Xen versions >>>> >>>> # in shutdown: the domain is in process of shutting down, but has not > completely shutdown or crashed. >>>> >>>> ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status." >>>> >>>> rc=$OCF_SUCCESS >>>> >>>> TO >>>> >>>> running|paused|idle|blocked|"in shutdown") >>>> >>>> # running: domain is currently actively consuming cycles >>>> >>>> # paused: domain is paused (suspended) >>>> >>>> # idle: domain is running but idle >>>> >>>> # blocked: synonym for idle used by legacy Xen versions >>>> >>>> # in shutdown: the domain is in process of shutting down, but has not > completely shutdown or crashed. >>>> >>>> custom_chk=$(/path/to/myscript.sh -H $DOMAIN_NAME -C guest-get-time -l 25 -w > 1) >>>> >>>> custom_rc=$? >>>> >>>> if [ ${custom_rc} -eq 0 ]; then >>>> >>>> ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status." >>>> >>>> rc=$OCF_SUCCESS >>>> >>>> else >>>> >>>> ocf_log debug "Virtual domain $DOMAIN_NAME is currently ${custom_chk}." >>>> >>>> rc=$OCF_ERR_GENERIC >>>> >>>> fi >>>> >>>> The custom script uses the qemu-guest-agent in my guest, passing the > parameter to grab the guest's time (seems to be most universal [windows, > centos6, ubuntu, centos 7]). Runs 25 loops, sleeps 1 second between > iterations, exit 0 as soon as the agent responds with the time and exit 1 > after the 25th loop, which are OCF_SUCCESS and OCF_ERR_GENERIC based on docs. >>>> >>>> /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1 >>>> ========================================================= >>>> >>>> [GOOD] - myvm virsh qemu-agent-command guest-get-time output: > {"return":1623011582178375000} >>>> >>>> or when its not responding: >>>> >>>> /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1 >>>> ========================================================= >>>> >>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest > agent is not responding: QEMU guest agent is not connected >>>> >>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest > agent is not responding: QEMU guest agent is not connected >>>> >>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest > agent is not responding: QEMU guest agent is not connected >>>> >>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest > agent is not responding: QEMU guest agent is not connected >>>> >>>> ... (exits after 25th or >>>> >>>> [GOOD] - myvm virsh qemu-agent-command guest-get-time output: > {"return":1623011582178375000} >>>> >>>> and when the vm isnt running: >>>> >>>> /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1 >>>> ========================================================= >>>> >>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: failed > to get domain 'myvm' >>>> >>>> I updated my test vm to use the new RA, updated the status timeout to 40s > from default of 30s just in case. >>>> >>>> I'd like to be able to update the parameters to myscript.sh via crm > configure edit at some point, but will figure that out later... >>>> >>>> My test: >>>> >>>> reboot the VM from within the OS, hit escape so that I enter the boot mode > prompt... after ~30 seconds the cluster decides the resource is having a > problem, marks it as failed, and restarts the virtual machine (on the same > node -- which in my case in desirable), once the guest is back up and > responding the cluster reports the VM as Started >>>> >>>> I still have plenty more testing to do and will keep the list posted on > progress. >>>> >>>> -Kyle >>>> >>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>> >>>> On Thursday, May 27th, 2021 at 05:34, Kyle O'Donnell ky...@0b10.mx wrote: >>>> >>>> > guest-get-fsinfo doesn't seem to work on older agents (centos6) I've found > guest-get-time more universal. >>>> > >>>> > Also, found this helpful thread on using monitor_scripts which is part of > the VirtualDomain RA >>>> > >>>> > > https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter- > for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm >>>> > >>>> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>> > >>>> > On Sunday, May 16th, 2021 at 22:49, Kyle O'Donnell ky...@0b10.mx wrote: >>>> > >>>> > > I am thinking about using the qemu-guest-agent to run one of the available > commands to determine the health of the OS inside >>>> > > >>>> > > virsh qemu-agent-command myvm --pretty '{"execute":"guest-get-fsinfo"}' >>>> > > >>>> > > https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html >>>> > > >>>> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>> > > >>>> > > On Thursday, May 13th, 2021 at 01:28, Andrei Borzenkov arvidj...@gmail.com > wrote: >>>> > > >>>> > > > On 03.05.2021 09:48, Ulrich Windl wrote: >>>> > > > >>>> > > > > > > > Ken Gaillot kgail...@redhat.com schrieb am 30.04.2021 um 16:57 in >>>> > > > > > > > >>>> > > > > > > > Nachricht >>>> > > > > > > > >>>> > > > > > > > 3acef4bc31923fb019619c713300444c2dcd354a.ca...@redhat.com: >>>> > > > > > > > >>>> > > > > > > > On Fri, 2021‑04‑30 at 11:00 +0100, lejeczek wrote: >>>> > > > > > >>>> > > > > > > Hi guys >>>> > > > > > > >>>> > > > > > > I'd like to ask around for thoughts & suggestions on any >>>> > > > > > > >>>> > > > > > > semi/official ways to monitor VirtualDomain. >>>> > > > > > > >>>> > > > > > > Something beyond what included RA does ‑ such as actual >>>> > > > > > > >>>> > > > > > > health testing of and communication with VM's OS. >>>> > > > > > > >>>> > > > > > > many thanks, L. >>>> > > > > > >>>> > > > > > This use case led to a Pacemaker feature many moons ago ... >>>> > > > > > >>>> > > > > > Pacemaker supports nagios plug‑ins as a resource type (e.g. >>>> > > > > > >>>> > > > > > nagios:check_apache_status). These are service checks usually used with >>>> > > > > > >>>> > > > > > monitoring software such as nagios, icinga, etc. >>>> > > > > > >>>> > > > > > If the service being monitored is inside a VirtualDomain, named vm1 for >>>> > > > > > >>>> > > > > > example, you can configure the nagios resource with the resource meta‑ >>>> > > > > > >>>> > > > > > attribute container="vm1". If the nagios check fails, Pacemaker will >>>> > > > > > >>>> > > > > > restart vm1. >>>> > > > > >>>> > > > > "check fails" mans WARNING, CRITICAL, or UNKNOWN? ;-) >>>> > > > >>>> > > > switch (rc) { >>>> > > > >>>> > > > case NAGIOS_STATE_OK: >>>> > > > >>>> > > > return PCMK_OCF_OK; >>>> > > > >>>> > > > case NAGIOS_INSUFFICIENT_PRIV: >>>> > > > >>>> > > > return PCMK_OCF_INSUFFICIENT_PRIV; >>>> > > > >>>> > > > case NAGIOS_NOT_INSTALLED: >>>> > > > >>>> > > > return PCMK_OCF_NOT_INSTALLED; >>>> > > > >>>> > > > case NAGIOS_STATE_WARNING: >>>> > > > >>>> > > > case NAGIOS_STATE_CRITICAL: >>>> > > > >>>> > > > case NAGIOS_STATE_UNKNOWN: >>>> > > > >>>> > > > case NAGIOS_STATE_DEPENDENT: >>>> > > > >>>> > > > default: >>>> > > > >>>> > > > return PCMK_OCF_UNKNOWN_ERROR; >>>> > > > >>>> > > > } >>>> > > > >>>> > > > return PCMK_OCF_UNKNOWN_ERROR; >>>> > > > >>>> > > > Manage your subscription: >>>> > > > >>>> > > > https://lists.clusterlabs.org/mailman/listinfo/users >>>> > > > >>>> > > > ClusterLabs home: https://www.clusterlabs.org/ >>>> >>>> Manage your subscription: >>>> >>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> ClusterLabs home: https://www.clusterlabs.org/ >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/