This seems to be a bug, when collectd does not respond (because of waiting for sudo password) OpenNebula does not move the hosts to ERROR. The probes are designed to not start another collectd process; but probably we should check that a running one it is not working and send the ERROR message to OpenNebula.
Pointer to the issue: http://dev.opennebula.org/issues/3118 Cheers On Wed, Jul 30, 2014 at 4:53 PM, Steven Timm <[email protected]> wrote: > On Wed, 30 Jul 2014, Ruben S. Montero wrote: > > Hi, >> 1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo >> access. It should be automatically setup by the opennebula node >> packages. >> >> 2.- It is not a real daemon, the first time a host is monitored a process >> is left to periodically send information. OpenNebula >> restarts it if no information is received in 3 monitor steps. Nothing >> needs to be set up... >> >> Cheers >> >> > On further inspection I found that this collectd was running on my nodes, > and obviously failing up until now because the sudoers was not set > correctly. But there was nothing to warn us about it. Nothing on > the opennebula head node to even tell us that the information was stale. > No log file on the node to show the errors we were getting. In short, > it was just quietly dying and we had no idea. How to make sure this > doesn't happen again in the future? > > Steve Timm > > > > > > > >> On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <[email protected]> wrote: >> On Wed, 30 Jul 2014, Ruben S. Montero wrote: >> >> >> Maybe you could try to execute the monitor probes in the >> node, >> >> 1. ssh the node >> 2. Go to /var/tmp/one/im >> 3. Execute run_probes kvm-probes >> >> >> When I do that, (using sh -x ) I get the following: >> >> -bash-4.1$ sh -x ./run_probes kvm-probes >> ++ dirname ./run_probes >> + source ./../scripts_common.sh >> ++ export LANG=C >> ++ LANG=C >> ++ export >> PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/ >> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin >> ++ >> PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/ >> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin >> ++ AWK=awk >> ++ BASH=bash >> ++ CUT=cut >> ++ DATE=date >> ++ DD=dd >> ++ DF=df >> ++ DU=du >> ++ GREP=grep >> ++ ISCSIADM=iscsiadm >> ++ LVCREATE=lvcreate >> ++ LVREMOVE=lvremove >> ++ LVRENAME=lvrename >> ++ LVS=lvs >> ++ LN=ln >> ++ MD5SUM=md5sum >> ++ MKFS=mkfs >> ++ MKISOFS=genisoimage >> ++ MKSWAP=mkswap >> ++ QEMU_IMG=qemu-img >> ++ RADOS=rados >> ++ RBD=rbd >> ++ READLINK=readlink >> ++ RM=rm >> ++ SCP=scp >> ++ SED=sed >> ++ SSH=ssh >> ++ SUDO=sudo >> ++ SYNC=sync >> ++ TAR=tar >> ++ TGTADM=tgtadm >> ++ TGTADMIN=tgt-admin >> ++ TGTSETUPLUN=tgt-setup-lun-one >> ++ TR=tr >> ++ VGDISPLAY=vgdisplay >> ++ VMKFSTOOLS=vmkfstools >> ++ WGET=wget >> +++ uname -s >> ++ '[' xLinux = xLinux ']' >> ++ SED='sed -r' >> +++ basename ./run_probes >> ++ SCRIPT_NAME=run_probes >> + export LANG=C >> + LANG=C >> + HYPERVISOR_DIR=kvm-probes.d >> + ARGUMENTS=kvm-probes >> ++ dirname ./run_probes >> + SCRIPTS_DIR=. >> + cd . >> ++ '[' -d kvm-probes.d ']' >> ++ run_dir kvm-probes.d >> ++ cd kvm-probes.d >> +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb >> monitor_ds.sh name.sh poll.sh version.sh >> ++ for i in '`ls *`' >> ++ '[' -x architecture.sh ']' >> ++ ./architecture.sh kvm-probes >> ++ EXIT_CODE=0 >> ++ '[' x0 '!=' x0 ']' >> ++ for i in '`ls *`' >> ++ '[' -x collectd-client-shepherd.sh ']' >> ++ ./collectd-client-shepherd.sh kvm-probes >> ++ EXIT_CODE=0 >> ++ '[' x0 '!=' x0 ']' >> ++ for i in '`ls *`' >> ++ '[' -x cpu.sh ']' >> ++ ./cpu.sh kvm-probes >> ++ EXIT_CODE=0 >> ++ '[' x0 '!=' x0 ']' >> ++ for i in '`ls *`' >> ++ '[' -x kvm.rb ']' >> ++ ./kvm.rb kvm-probes >> ++ EXIT_CODE=0 >> ++ '[' x0 '!=' x0 ']' >> ++ for i in '`ls *`' >> ++ '[' -x monitor_ds.sh ']' >> ++ ./monitor_ds.sh kvm-probes >> [sudo] password for oneadmin: >> >> and it stays hung on the password for oneadmin. >> >> What's going on? >> >> Also, you mentioned a collectd--are you saying that OpenNebula 4.6 >> now needs to run a daemon on every single VM host? >> Where is it documented >> on how to set it up? >> >> Steve >> >> >> >> >> >> >> >> Make sure you do not have a host using the same hostname >> fgtest14 and running a collectd process >> >> On Jul 29, 2014 4:35 PM, "Steven Timm" <[email protected]> wrote: >> >> I am still trying to debug a nasty monitoring >> inconsistency. >> >> -bash-4.1$ onevm list | grep fgtest14 >> 26 oneadmin oneadmin fgt6x4-26 runn 6 >> 4G fgtest14 117d 19h50 >> 27 oneadmin oneadmin fgt5x4-27 runn 10 >> 4G fgtest14 117d 17h57 >> 28 oneadmin oneadmin fgt1x1-28 runn 10 >> 4.1G fgtest14 117d 16h59 >> 30 oneadmin oneadmin fgt5x1-30 runn 0 >> 4G fgtest14 116d 23h50 >> 33 oneadmin oneadmin ip6sl5vda-33 runn 6 >> 4G fgtest14 116d 19h57 >> -bash-4.1$ onehost list >> ID NAME CLUSTER RVM ALLOCATED_CPU >> ALLOCATED_MEM STAT >> 3 fgtest11 ipv6 0 0 / 400 (0%) >> 0K / 15.7G (0%) on >> 4 fgtest12 ipv6 0 0 / 400 (0%) >> 0K / 15.7G (0%) on >> 7 fgtest13 ipv6 0 0 / 800 (0%) >> 0K / 23.6G (0%) on >> 8 fgtest14 ipv6 5 0 / 800 (0%) >> 0K / 23.6G (0%) on >> 9 fgtest20 ipv6 3 300 / 800 (37%) >> 12G / 31.4G (38%) on >> 11 fgtest19 ipv6 0 0 / 800 (0%) >> 0K / 31.5G (0%) on >> -bash-4.1$ onehost show 8 >> HOST 8 INFORMATION >> ID : 8 >> NAME : fgtest14 >> CLUSTER : ipv6 >> STATE : MONITORED >> IM_MAD : kvm >> VM_MAD : kvm >> VN_MAD : dummy >> LAST MONITORING TIME : 07/29 09:25:45 >> >> HOST SHARES >> TOTAL MEM : 23.6G >> USED MEM (REAL) : 876.4M >> USED MEM (ALLOCATED) : 0K >> TOTAL CPU : 800 >> USED CPU (REAL) : 0 >> USED CPU (ALLOCATED) : 0 >> RUNNING VMS : 5 >> >> LOCAL SYSTEM DATASTORE #102 CAPACITY >> TOTAL: : 548.8G >> USED: : 175.3G >> FREE: : 345.6G >> >> MONITORING INFORMATION >> ARCH="x86_64" >> CPUSPEED="2992" >> HOSTNAME="fgtest14.fnal.gov" >> HYPERVISOR="kvm" >> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ >> 3.00GHz" >> NETRX="234844577" >> NETTX="21553126" >> RESERVED_CPU="" >> RESERVED_MEM="" >> VERSION="4.6.0" >> >> VIRTUAL MACHINES >> >> ID USER GROUP NAME STAT UCPU >> UMEM HOST TIME >> 26 oneadmin oneadmin fgt6x4-26 runn 6 >> 4G fgtest14 117d 19h50 >> 27 oneadmin oneadmin fgt5x4-27 runn 10 >> 4G fgtest14 117d 17h57 >> 28 oneadmin oneadmin fgt1x1-28 runn 10 >> 4.1G fgtest14 117d 17h00 >> 30 oneadmin oneadmin fgt5x1-30 runn 0 >> 4G fgtest14 116d 23h50 >> 33 oneadmin oneadmin ip6sl5vda-33 runn 6 >> 4G fgtest14 116d 19h57 >> ------------------------------ >> ----------------------------------------------------- >> >> All of this looks great, right? >> Just one problem: There are no VM's running on >> fgtest14 and >> haven't been for 4 days. >> >> [root@fgtest14 ~]# virsh list >> Id Name State >> ---------------------------------------------------- >> >> [root@fgtest14 ~]# >> >> ------------------------------ >> ------------------------------------------- >> Yet the monitoring reports no errors. >> >> Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8) >> successfully monitored. >> >> ------------------------------ >> ----------------------------------------------- >> At the same time, there is no evidence that ONE is >> actually trying to or >> succeeding to monitor these five vm's yet they are >> still stuck in "runn" >> which means I can't do a onevm restart to restart them. >> (the vm images of these 5 vm's are still out there on >> the VM host and >> I would like to save and restart them if I can). >> >> What is the remotes command that ONE4.6 would use to >> monitor this host? >> Can I do it manually and see what output I get? >> >> Are we dealing with some kind of a bug, or just a very >> confused system? >> Any help is appreciated. I have to get this sorted out >> before >> I dare deploy one4.x in production. >> >> Steve Timm >> >> >> ------------------------------ >> ------------------------------------ >> Steven C. Timm, Ph.D (630) 840-8525 >> [email protected] http://home.fnal.gov/~timm/ >> Fermilab Scientific Computing Division, Scientific >> Computing Services Quad. >> Grid and Cloud Services Dept., Associate Dept. Head for >> Cloud Computing >> _______________________________________________ >> Users mailing list >> [email protected] >> http://lists.opennebula.org/ >> listinfo.cgi/users-opennebula.org >> >> >> >> >> ------------------------------------------------------------------ >> Steven C. Timm, Ph.D (630) 840-8525 >> [email protected] http://home.fnal.gov/~timm/ >> Fermilab Scientific Computing Division, Scientific Computing >> Services Quad. >> Grid and Cloud Services Dept., Associate Dept. Head for Cloud >> Computing >> >> >> >> >> -- >> -- >> Ruben S. Montero, PhD >> Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise >> Cloud Made Simple >> www.OpenNebula.org | [email protected] | @OpenNebula >> >> >> > ------------------------------------------------------------------ > Steven C. Timm, Ph.D (630) 840-8525 > [email protected] http://home.fnal.gov/~timm/ > Fermilab Scientific Computing Division, Scientific Computing Services Quad. > Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing > -- -- Ruben S. Montero, PhD Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise Cloud Made Simple www.OpenNebula.org | [email protected] | @OpenNebula
_______________________________________________ Users mailing list [email protected] http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
