Sorry for the noise, just saw it in the other thread...
On Wed, Jul 30, 2014 at 5:01 PM, Ruben S. Montero <[email protected]> wrote: > BTW, Could you paste the ouput of run_probes commands once it finish? > > > On Wed, Jul 30, 2014 at 4:58 PM, Ruben S. Montero < > [email protected]> wrote: > >> This seems to be a bug, when collectd does not respond (because of >> waiting for sudo password) OpenNebula does not move the hosts to ERROR. The >> probes are designed to not start another collectd process; but probably we >> should check that a running one it is not working and send the ERROR >> message to OpenNebula. >> >> Pointer to the issue: >> http://dev.opennebula.org/issues/3118 >> >> Cheers >> >> >> On Wed, Jul 30, 2014 at 4:53 PM, Steven Timm <[email protected]> wrote: >> >>> On Wed, 30 Jul 2014, Ruben S. Montero wrote: >>> >>> Hi, >>>> 1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo >>>> access. It should be automatically setup by the opennebula node >>>> packages. >>>> >>>> 2.- It is not a real daemon, the first time a host is monitored a >>>> process is left to periodically send information. OpenNebula >>>> restarts it if no information is received in 3 monitor steps. Nothing >>>> needs to be set up... >>>> >>>> Cheers >>>> >>>> >>> On further inspection I found that this collectd was running on my >>> nodes, and obviously failing up until now because the sudoers was not set >>> correctly. But there was nothing to warn us about it. Nothing on >>> the opennebula head node to even tell us that the information was stale. >>> No log file on the node to show the errors we were getting. In short, >>> it was just quietly dying and we had no idea. How to make sure this >>> doesn't happen again in the future? >>> >>> Steve Timm >>> >>> >>> >>> >>> >>> >>> >>>> On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <[email protected]> wrote: >>>> On Wed, 30 Jul 2014, Ruben S. Montero wrote: >>>> >>>> >>>> Maybe you could try to execute the monitor probes in the >>>> node, >>>> >>>> 1. ssh the node >>>> 2. Go to /var/tmp/one/im >>>> 3. Execute run_probes kvm-probes >>>> >>>> >>>> When I do that, (using sh -x ) I get the following: >>>> >>>> -bash-4.1$ sh -x ./run_probes kvm-probes >>>> ++ dirname ./run_probes >>>> + source ./../scripts_common.sh >>>> ++ export LANG=C >>>> ++ LANG=C >>>> ++ export >>>> PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/ >>>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin >>>> ++ >>>> PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/ >>>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin >>>> ++ AWK=awk >>>> ++ BASH=bash >>>> ++ CUT=cut >>>> ++ DATE=date >>>> ++ DD=dd >>>> ++ DF=df >>>> ++ DU=du >>>> ++ GREP=grep >>>> ++ ISCSIADM=iscsiadm >>>> ++ LVCREATE=lvcreate >>>> ++ LVREMOVE=lvremove >>>> ++ LVRENAME=lvrename >>>> ++ LVS=lvs >>>> ++ LN=ln >>>> ++ MD5SUM=md5sum >>>> ++ MKFS=mkfs >>>> ++ MKISOFS=genisoimage >>>> ++ MKSWAP=mkswap >>>> ++ QEMU_IMG=qemu-img >>>> ++ RADOS=rados >>>> ++ RBD=rbd >>>> ++ READLINK=readlink >>>> ++ RM=rm >>>> ++ SCP=scp >>>> ++ SED=sed >>>> ++ SSH=ssh >>>> ++ SUDO=sudo >>>> ++ SYNC=sync >>>> ++ TAR=tar >>>> ++ TGTADM=tgtadm >>>> ++ TGTADMIN=tgt-admin >>>> ++ TGTSETUPLUN=tgt-setup-lun-one >>>> ++ TR=tr >>>> ++ VGDISPLAY=vgdisplay >>>> ++ VMKFSTOOLS=vmkfstools >>>> ++ WGET=wget >>>> +++ uname -s >>>> ++ '[' xLinux = xLinux ']' >>>> ++ SED='sed -r' >>>> +++ basename ./run_probes >>>> ++ SCRIPT_NAME=run_probes >>>> + export LANG=C >>>> + LANG=C >>>> + HYPERVISOR_DIR=kvm-probes.d >>>> + ARGUMENTS=kvm-probes >>>> ++ dirname ./run_probes >>>> + SCRIPTS_DIR=. >>>> + cd . >>>> ++ '[' -d kvm-probes.d ']' >>>> ++ run_dir kvm-probes.d >>>> ++ cd kvm-probes.d >>>> +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb >>>> monitor_ds.sh name.sh poll.sh version.sh >>>> ++ for i in '`ls *`' >>>> ++ '[' -x architecture.sh ']' >>>> ++ ./architecture.sh kvm-probes >>>> ++ EXIT_CODE=0 >>>> ++ '[' x0 '!=' x0 ']' >>>> ++ for i in '`ls *`' >>>> ++ '[' -x collectd-client-shepherd.sh ']' >>>> ++ ./collectd-client-shepherd.sh kvm-probes >>>> ++ EXIT_CODE=0 >>>> ++ '[' x0 '!=' x0 ']' >>>> ++ for i in '`ls *`' >>>> ++ '[' -x cpu.sh ']' >>>> ++ ./cpu.sh kvm-probes >>>> ++ EXIT_CODE=0 >>>> ++ '[' x0 '!=' x0 ']' >>>> ++ for i in '`ls *`' >>>> ++ '[' -x kvm.rb ']' >>>> ++ ./kvm.rb kvm-probes >>>> ++ EXIT_CODE=0 >>>> ++ '[' x0 '!=' x0 ']' >>>> ++ for i in '`ls *`' >>>> ++ '[' -x monitor_ds.sh ']' >>>> ++ ./monitor_ds.sh kvm-probes >>>> [sudo] password for oneadmin: >>>> >>>> and it stays hung on the password for oneadmin. >>>> >>>> What's going on? >>>> >>>> Also, you mentioned a collectd--are you saying that OpenNebula >>>> 4.6 now needs to run a daemon on every single VM host? >>>> Where is it documented >>>> on how to set it up? >>>> >>>> Steve >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Make sure you do not have a host using the same hostname >>>> fgtest14 and running a collectd process >>>> >>>> On Jul 29, 2014 4:35 PM, "Steven Timm" <[email protected]> >>>> wrote: >>>> >>>> I am still trying to debug a nasty monitoring >>>> inconsistency. >>>> >>>> -bash-4.1$ onevm list | grep fgtest14 >>>> 26 oneadmin oneadmin fgt6x4-26 runn 6 >>>> 4G fgtest14 117d 19h50 >>>> 27 oneadmin oneadmin fgt5x4-27 runn 10 >>>> 4G fgtest14 117d 17h57 >>>> 28 oneadmin oneadmin fgt1x1-28 runn 10 >>>> 4.1G fgtest14 117d 16h59 >>>> 30 oneadmin oneadmin fgt5x1-30 runn 0 >>>> 4G fgtest14 116d 23h50 >>>> 33 oneadmin oneadmin ip6sl5vda-33 runn 6 >>>> 4G fgtest14 116d 19h57 >>>> -bash-4.1$ onehost list >>>> ID NAME CLUSTER RVM ALLOCATED_CPU >>>> ALLOCATED_MEM STAT >>>> 3 fgtest11 ipv6 0 0 / 400 (0%) >>>> 0K / 15.7G (0%) on >>>> 4 fgtest12 ipv6 0 0 / 400 (0%) >>>> 0K / 15.7G (0%) on >>>> 7 fgtest13 ipv6 0 0 / 800 (0%) >>>> 0K / 23.6G (0%) on >>>> 8 fgtest14 ipv6 5 0 / 800 (0%) >>>> 0K / 23.6G (0%) on >>>> 9 fgtest20 ipv6 3 300 / 800 (37%) >>>> 12G / 31.4G (38%) on >>>> 11 fgtest19 ipv6 0 0 / 800 (0%) >>>> 0K / 31.5G (0%) on >>>> -bash-4.1$ onehost show 8 >>>> HOST 8 INFORMATION >>>> ID : 8 >>>> NAME : fgtest14 >>>> CLUSTER : ipv6 >>>> STATE : MONITORED >>>> IM_MAD : kvm >>>> VM_MAD : kvm >>>> VN_MAD : dummy >>>> LAST MONITORING TIME : 07/29 09:25:45 >>>> >>>> HOST SHARES >>>> TOTAL MEM : 23.6G >>>> USED MEM (REAL) : 876.4M >>>> USED MEM (ALLOCATED) : 0K >>>> TOTAL CPU : 800 >>>> USED CPU (REAL) : 0 >>>> USED CPU (ALLOCATED) : 0 >>>> RUNNING VMS : 5 >>>> >>>> LOCAL SYSTEM DATASTORE #102 CAPACITY >>>> TOTAL: : 548.8G >>>> USED: : 175.3G >>>> FREE: : 345.6G >>>> >>>> MONITORING INFORMATION >>>> ARCH="x86_64" >>>> CPUSPEED="2992" >>>> HOSTNAME="fgtest14.fnal.gov" >>>> HYPERVISOR="kvm" >>>> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ >>>> 3.00GHz" >>>> NETRX="234844577" >>>> NETTX="21553126" >>>> RESERVED_CPU="" >>>> RESERVED_MEM="" >>>> VERSION="4.6.0" >>>> >>>> VIRTUAL MACHINES >>>> >>>> ID USER GROUP NAME STAT UCPU >>>> UMEM HOST TIME >>>> 26 oneadmin oneadmin fgt6x4-26 runn 6 >>>> 4G fgtest14 117d 19h50 >>>> 27 oneadmin oneadmin fgt5x4-27 runn 10 >>>> 4G fgtest14 117d 17h57 >>>> 28 oneadmin oneadmin fgt1x1-28 runn 10 >>>> 4.1G fgtest14 117d 17h00 >>>> 30 oneadmin oneadmin fgt5x1-30 runn 0 >>>> 4G fgtest14 116d 23h50 >>>> 33 oneadmin oneadmin ip6sl5vda-33 runn 6 >>>> 4G fgtest14 116d 19h57 >>>> ------------------------------ >>>> ----------------------------------------------------- >>>> >>>> All of this looks great, right? >>>> Just one problem: There are no VM's running on >>>> fgtest14 and >>>> haven't been for 4 days. >>>> >>>> [root@fgtest14 ~]# virsh list >>>> Id Name State >>>> ---------------------------------------------------- >>>> >>>> [root@fgtest14 ~]# >>>> >>>> ------------------------------ >>>> ------------------------------------------- >>>> Yet the monitoring reports no errors. >>>> >>>> Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8) >>>> successfully monitored. >>>> >>>> ------------------------------ >>>> ----------------------------------------------- >>>> At the same time, there is no evidence that ONE is >>>> actually trying to or >>>> succeeding to monitor these five vm's yet they are >>>> still stuck in "runn" >>>> which means I can't do a onevm restart to restart >>>> them. >>>> (the vm images of these 5 vm's are still out there on >>>> the VM host and >>>> I would like to save and restart them if I can). >>>> >>>> What is the remotes command that ONE4.6 would use to >>>> monitor this host? >>>> Can I do it manually and see what output I get? >>>> >>>> Are we dealing with some kind of a bug, or just a >>>> very confused system? >>>> Any help is appreciated. I have to get this sorted >>>> out before >>>> I dare deploy one4.x in production. >>>> >>>> Steve Timm >>>> >>>> >>>> ------------------------------ >>>> ------------------------------------ >>>> Steven C. Timm, Ph.D (630) 840-8525 >>>> [email protected] http://home.fnal.gov/~timm/ >>>> Fermilab Scientific Computing Division, Scientific >>>> Computing Services Quad. >>>> Grid and Cloud Services Dept., Associate Dept. Head >>>> for Cloud Computing >>>> _______________________________________________ >>>> Users mailing list >>>> [email protected] >>>> http://lists.opennebula.org/ >>>> listinfo.cgi/users-opennebula.org >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------ >>>> ------ >>>> Steven C. Timm, Ph.D (630) 840-8525 >>>> [email protected] http://home.fnal.gov/~timm/ >>>> Fermilab Scientific Computing Division, Scientific Computing >>>> Services Quad. >>>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud >>>> Computing >>>> >>>> >>>> >>>> >>>> -- >>>> -- >>>> Ruben S. Montero, PhD >>>> Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise >>>> Cloud Made Simple >>>> www.OpenNebula.org | [email protected] | @OpenNebula >>>> >>>> >>>> >>> ------------------------------------------------------------------ >>> Steven C. Timm, Ph.D (630) 840-8525 >>> [email protected] http://home.fnal.gov/~timm/ >>> Fermilab Scientific Computing Division, Scientific Computing Services >>> Quad. >>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing >>> >> >> >> >> -- >> -- >> Ruben S. Montero, PhD >> Project co-Lead and Chief Architect >> OpenNebula - Flexible Enterprise Cloud Made Simple >> www.OpenNebula.org | [email protected] | @OpenNebula >> > > > > -- > -- > Ruben S. Montero, PhD > Project co-Lead and Chief Architect > OpenNebula - Flexible Enterprise Cloud Made Simple > www.OpenNebula.org | [email protected] | @OpenNebula > -- -- Ruben S. Montero, PhD Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise Cloud Made Simple www.OpenNebula.org | [email protected] | @OpenNebula
_______________________________________________ Users mailing list [email protected] http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
