Thanks Ruben.
onedb fsck turned up and fixed a bunch of problems including the main
one--that fgtest14 had once been host ID 10 and I had mistakenly
re-inserted into the db as host ID 8. had to manually modify the
mysql on those 5 entries in VM pool to change the <hid> from
10 to 8, but once I did then opennebula finally detected
that they were down and now shows them as UNKN.
There is one remaining problem and that is the following:
To successfully modify the BODY field in the vm_pool of the mysql database
it was necessary to strip out some newlines and single quotes that
were in the XML and so now I have XML that doesn't actually work to
start a VM.
(I did a mysql command
update vm_pool set body='a bunch of xml' where oid=nnn;
and the mysql syntax supported neither newlines or single quotes. That's
a problem because some of the things we are using need single quotes
and maybe newlines too.
does anyone have an xml editor that can more easily modify the
text of the body field in the opennebula database?
Steve Timm
(ps--before the xml in question looked like this:
<devices>
<serial type='pty'>
<target port='0'/>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
</console>
</devices>
And now it looks like this:
< <devices> <serial type=pty>
<target port=0/> </serial> <console type=pty> <target type=serial port=0/>
</console> </devices>
---
Steve Timm
On Wed, 30 Jul 2014, Ruben S. Montero wrote:
This seems to be a problem when upgrading the DB, See the inconsistency in
fgtest14:
<RUNNING_VMS>5</RUNNING_VMS>....<VMS></VMS>
That's the reason for not seeing any action taken on VM 26 it is not registered in
the host (empty <VM> set)
I suggest to stop oned and execute onedb fsck
Cheers
On Wed, Jul 30, 2014 at 4:44 PM, Steven Timm <[email protected]> wrote:
OK--I have now installed the opennebula-node-kvm rpm on
all of the VM hosts (SURPRISE), made sure that the collectd
that is running is the current one from opennebula 4.6,
and verified that the run_probes kvm-probes can
run interactively as oneadmin on all of the nodes. the one on
fgtest14 correctly reports that there are no running VM's,
and the two machines that do have running vm's correctly report
that they do have running VM's.
Only problem is, the five virtual machines that opennebula still thinks
are running on fgtest14, still report back as running
even though opennebula hasn't made any attempt to monitor them?
How do we get things back into sync and tell opennebula that VM #26
isn't really running anymore? Is there a way to force this vm into
"unknown" state so we can do a onevm boot on it?
Database hackery included? Even better, has someone come up with an XML
hacker to
do the XML substitition of one field in the huge mysql field?
Even more important: it's clear that the monitoring was obviously
failing and failing for a long time because we didn't have the
sudoers file there that the opennebula-node-kvm provides.
But there was absolutely no warning of that.. as far as the
head node was concerned we were happy as a clam.
----
The important pieces of output from run_probes kvm-probes
fgtest19
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
HYPERVISOR=kvm
TOTALCPU=800
CPUSPEED=2992
TOTALMEMORY=33010680
USEDMEMORY=1586216
FREEMEMORY=31424464
FREECPU=800.0
USEDCPU=0.0
NETRX=5958104400
NETTX=2323329968
DS_LOCATION_USED_MB=1924
DS_LOCATION_TOTAL_MB=280380
DS_LOCATION_FREE_MB=264129
DS = [
ID = 102,
USED_MB = 1924,
TOTAL_MB = 280380,
FREE_MB = 264129
]
HOSTNAME=fgtest19.fnal.gov
VM_POLL=YES
VM=[
ID=55,
DEPLOY_ID=one-55,
POLL="NETRX=25289118 USEDCPU=0.0 NETTX=214808 USEDMEMORY=4194304
STATE=a" ]
VERSION="4.6.0"
fgtest20
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
HYPERVISOR=kvm
TOTALCPU=800
CPUSPEED=2992
TOTALMEMORY=32875804
USEDMEMORY=8801100
FREEMEMORY=24074704
FREECPU=793.6
USEDCPU=6.39999999999998
NETRX=184155823062
NETTX=58685116817
DS_LOCATION_USED_MB=50049
DS_LOCATION_TOTAL_MB=281012
DS_LOCATION_FREE_MB=216499
DS = [
ID = 102,
USED_MB = 50049,
TOTAL_MB = 281012,
FREE_MB = 216499
]
HOSTNAME=fgtest20.fnal.gov
VM_POLL=YES
VM=[
ID=31,
DEPLOY_ID=one-31,
POLL="NETRX=71728978887 USEDCPU=0.5 NETTX=54281255903 USEDMEMORY=4270812
STATE=a" ]
VM=[
ID=24,
DEPLOY_ID=one-24,
POLL="NETRX=2383960153 USEDCPU=0.0 NETTX=17345416 USEDMEMORY=4194304
STATE=a" ]
VM=[
ID=48,
DEPLOY_ID=one-48,
POLL="NETRX=2546074171 USEDCPU=0.0 NETTX=145782495 USEDMEMORY=4194304
STATE=a" ]
VERSION="4.6.0"
fgtest14
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz"
HYPERVISOR=kvm
TOTALCPU=800
CPUSPEED=2992
TOTALMEMORY=24736796
USEDMEMORY=937004
FREEMEMORY=23799792
FREECPU=800.0
USEDCPU=0.0
NETRX=285471609
NETTX=25467521
DS_LOCATION_USED_MB=179498
DS_LOCATION_TOTAL_MB=561999
DS_LOCATION_FREE_MB=353864
DS = [
ID = 102,
USED_MB = 179498,
TOTAL_MB = 561999,
FREE_MB = 353864
]
-------------------------
And the appropriate excerpts from oned.log:
/var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][D]:
Restarting VM 26
/var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][E]:
Could not restart VM 26, wrong state.
/var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [DiM][D]:
Stopping VM 26
/var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [VMM][D]:
VM 26 successfully monitored: STATE=-
-----------------------------------
This is the mysql row in host_pool for host fgtest14
mysql>
mysql> select * from host_pool where oid=8 \G
*************************** 1. row ***************************
oid: 8
name: fgtest14
body:<HOST><ID>8</ID><NAME>fgtest14</NAME><STATE>2</STATE><IM_MAD>kvm</IM_MAD><VM_MAD>kvm</VM_MAD><VN_MAD>dummy</VN_MAD><LAST_MON_TIME>1
406731190</LAST_MON_TIME><CLUSTER_ID>101</CLUSTER_ID><CLUSTER>ipv6</CLUSTER><HOST_SHARE><DISK_USAGE>0</DISK_USAGE><MEM_USAGE>0</MEM
_USAGE><CPU_USAGE>0</CPU_USAGE><MAX_DISK>561999</MAX_DISK><MAX_MEM>24736796</MAX_MEM><MAX_CPU>800</MAX_CPU><FREE_DISK>353864</FREE_
DISK><FREE_MEM>23802216</FREE_MEM><FREE_CPU>800</FREE_CPU><USED_DISK>179498</USED_DISK><USED_MEM>934580</USED_MEM><USED_CPU>0</USED
_CPU><RUNNING_VMS>5</RUNNING_VMS><DATASTORES><DS><FREE_MB><![CDATA[353864]]></FREE_MB><ID><![CDATA[102]]></ID><TOTAL_MB><![CDATA[56
1999]]></TOTAL_MB><USED_MB><![CDATA[179498]]></USED_MB></DS></DATASTORES></HOST_SHARE><VMS></VMS><TEMPLATE><ARCH><![CDATA[x86_64]]>
</ARCH><CPUSPEED><![CDATA[2992]]></CPUSPEED><HOSTNAME><![CDATA[fgtest14.fnal.gov]]></HOSTNAME><HYPERVISOR><![CDATA[kvm]]></HYPERVIS
OR><MODELNAME><![CDATA[Intel(R) Xeon(R) CPU E5450
@3.00GHz]]></MODELNAME><NETRX><![CDATA[285677608]]></NETRX><NETTX><![CDATA[25489275]]></NETTX><RESERVED_CPU><![CDATA[]]></RESERVED_C
PU><RESERVED_MEM><![CDATA[]]></RESERVED_MEM><VERSION><![CDATA[4.6.0]]></VERSION></TEMPLATE></HOST>
state: 2
last_mon_time: 1406731190
uid: 0
gid: 0
owner_u: 1
group_u: 0
other_u: 0
cid: 101
1 row in set (0.00 sec)
And this is the row in vm_pool for VM id 26
*************************** 1. row ***************************
oid: 26
name: fgt6x4-26
body:<VM><ID>26</ID><UID>0</UID><GID>0</GID><UNAME>oneadmin</UNAME><GNAME>oneadmin</GNAME><NAME>fgt6x4-26</NAME><PERMISSIONS><OWNER_U>1<
/OWNER_U><OWNER_M>1</OWNER_M><OWNER_A>0</OWNER_A><GROUP_U>0</GROUP_U><GROUP_M>0</GROUP_M><GROUP_A>0</GROUP_A><OTHER_U>0</OTHER_U><O
THER_M>0</OTHER_M><OTHER_A>0</OTHER_A></PERMISSIONS><LAST_POLL>1406320668</LAST_POLL><STATE>3</STATE><LCM_STATE>3</LCM_STATE><RESCH
ED>0</RESCHED><STIME>1396463735</STIME><ETIME>0</ETIME><DEPLOY_ID>one-26</DEPLOY_ID><MEMORY>4194304</MEMORY><CPU>6</CPU><NET_TX>748
982286</NET_TX><NET_RX>1588690678</NET_RX><TEMPLATE><AUTOMATIC_REQUIREMENTS><![CDATA[CLUSTER_ID = 101
& !(PUBLIC_CLOUD
=YES)]]></AUTOMATIC_REQUIREMENTS><CONTEXT><CTX_USER><![CDATA[PFVTRVI+PElEPjA8L0lEPjxHSUQ+MDwvR0lEPjxHUk9VUFM+PElEPjA8L0lEPjwvR1JPVVB
TPjxHTkFNRT5vbmVhZG1pbjwvR05BTUU+PE5BTUU+b25lYWRtaW48L05BTUU+PFBBU1NXT1JEPjFmNjQxYzdlMzZkZWU5MmUzNDQ0Mjk2NmI1OTYwMGJkMGE3ZmU5ZDQ8L1
BBU1NXT1JEPjxBVVRIX0RSSVZFUj5jb3JlPC9BVVRIX0RSSVZFUj48RU5BQkxFRD4xPC9FTkFCTEVEPjxURU1QTEFURT48VE9LRU5fUEFTU1dPUkQ+PCFbQ0RBVEFbNzFhY
zU0OWM5MzhmNjA0NmY3NDEzMDI4Y2ZhOGNjODU2YzI2ZGNhNV1dPjwvVE9LRU5fUEFTU1dPUkQ+PC9URU1QTEFURT48REFUQVNUT1JFX1FVT1RBPjwvREFUQVNUT1JFX1FV
T1RBPjxORVRXT1JLX1FVT1RBPjwvTkVUV09SS19RVU9UQT48Vk1fUVVPVEE+PC9WTV9RVU9UQT48SU1BR0VfUVVPVEE+PC9JTUFHRV9RVU9UQT48L1VTRVI+]]></CTX_US
ER><DISK_ID><![CDATA[2]]></DISK_ID><ETH0_DNS><![CDATA[131.225.0.254]]></ETH0_DNS><ETH0_GATEWAY><![CDATA[131.225.41.200]]></ETH0_GAT
EWAY><ETH0_IP><![CDATA[131.225.41.169]]></ETH0_IP><ETH0_IPV6><![CDATA[2001:400:2410:29::169]]></ETH0_IPV6><ETH0_MAC><![CDATA[00:16:
3e:06:06:04]]></ETH0_MAC><ETH0_MASK><![CDATA[255.255.255.128]]></ETH0_MASK><FILES><![CDATA[/cloud/images/OpenNebula/scripts/one3.2/
contextualization/init.sh
/cloud/images/OpenNebula/scripts/one3.2/contextualization/credentials.sh/cloud/images/OpenNebula/scripts/one3.2/contextualization/kerberos.sh]]></FILES><GATEWAY><![CDATA[131.225.41.200]]></GATEWAY><INIT_
SCRIPTS><![CDATA[init.sh
credentials.shkerberos.sh]]></INIT_SCRIPTS><IP_PUBLIC><![CDATA[131.225.41.169]]></IP_PUBLIC><NETMASK><![CDATA[255.255.255.128]]></NETMASK><NETWOR
K><![CDATA[YES]]></NETWORK><ROOT_PUBKEY><![CDATA[id_dsa.pub]]></ROOT_PUBKEY><TARGET><![CDATA[hdc]]></TARGET><USERNAME><![CDATA[open
nebula]]></USERNAME><USER_PUBKEY><![CDATA[id_dsa.pub]]></USER_PUBKEY></CONTEXT><CPU><![CDATA[1]]></CPU><DISK><CLONE><![CDATA[NO]]><
/CLONE><CLONE_TARGET><![CDATA[SYSTEM]]></CLONE_TARGET><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID><DATASTORE><![CDATA[ip6_img_ds]]></DA
TASTORE><DATASTORE_ID><![CDATA[101]]></DATASTORE_ID><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_ID><![CDATA[0]]></DISK_ID><IMAGE><
![CDATA[fgt6x4_os]]></IMAGE><IMAGE_ID><![CDATA[5]]></IMAGE_ID><IMAGE_UNAME><![CDATA[oneadmin]]></IMAGE_UNAME><LN_TARGET><![CDATA[SY
STEM]]></LN_TARGET><PERSISTENT><![CDATA[YES]]></PERSISTENT><READONLY><![CDATA[NO]]></READONLY><SAVE><![CDATA[YES]]></SAVE><SIZE><![
CDATA[46080]]></SIZE><SOURCE><![CDATA[/var/lib/one//datastores/101/3078b4235100008fbdbf9dff7eea95b1]]></SOURCE><TARGET><![CDATA[vda
]]></TARGET><TM_MAD><![CDATA[ssh]]></TM_MAD><TYPE><![CDATA[FILE]]></TYPE></DISK><DISK><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_
ID><![CDATA[1]]></DISK_ID><SIZE><![CDATA[5120]]></SIZE><TARGET><![CDATA[vdb]]></TARGET><TYPE><![CDATA[swap]]></TYPE></DISK><FEATURE
S><ACPI><![CDATA[yes]]></ACPI></FEATURES><GRAPHICS><AUTOPORT><![CDATA[yes]]></AUTOPORT><KEYMAP><![CDATA[en-us]]></KEYMAP><LISTEN><!
[CDATA[127.0.0.1]]></LISTEN><PORT><![CDATA[5926]]></PORT><TYPE><![CDATA[vnc]]></TYPE></GRAPHICS><MEMORY><![CDATA[4096]]></MEMORY><N
IC><BRIDGE><![CDATA[br0]]></BRIDGE><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID><IP><![CDATA[131.225.41.169]]></IP><IP6_LINK><![CDATA[fe
80::216:3eff:fe06:604]]></IP6_LINK><MAC><![CDATA[00:16:3e:06:06:04]]></MAC><MODEL><![CDATA[virtio]]></MODEL><NETWORK><![CDATA[Stati
c_IPV6_Public]]></NETWORK><NETWORK_ID><![CDATA[1]]></NETWORK_ID><NETWORK_UNAME><![CDATA[oneadmin]]></NETWORK_UNAME><NIC_ID><![CDATA
[0]]></NIC_ID><VLAN><![CDATA[NO]]></VLAN></NIC><OS><ARCH><![CDATA[x86_64]]></ARCH></OS><RAW><DATA><![CDATA[
<devices>
<serial type='pty'>
<target port='0'/>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
</console>
</devices>]]></DATA><TYPE><![CDATA[kvm]]></TYPE></RAW><TEMPLATE_ID><![CDATA[6]]></TEMPLATE_ID><VCPU><![CDATA[2]]></VCPU><VMID><![CD
ATA[26]]></VMID></TEMPLATE><USER_TEMPLATE><ERROR><![CDATA[Fri Jul 25
15:37:48 2014 : Error saving VM state: Could not
save one-26
to/var/lib/one/datastores/102/26/checkpoint]]></ERROR><NPTYPE><![CDATA[NPERNLM]]></NPTYPE><RANK><![CDATA[FREEMEMORY]]></RANK><USERVO>
<![CDATA[test181818]]></USERVO></USER_TEMPLATE><HISTORY_RECORDS><HISTORY><OID>26</OID><SEQ>0</SEQ><HOSTNAME>fgtest14</HOSTNAME><HID
>10</HID><CID>101</CID><STIME>1396463752</STIME><ETIME>0</ETIME><VMMMAD>kvm</VMMMAD><VNMMAD>dummy</VNMMAD><TMMAD>ssh</TMMAD><DS_LOC
ATION>/var/lib/one/datastores</DS_LOCATION><DS_ID>102</DS_ID><PSTIME>1396463752</PSTIME><PETIME>1396465032</PETIME><RSTIME>13964650
32</RSTIME><RETIME>0</RETIME><ESTIME>0</ESTIME><EETIME>0</EETIME><REASON>0</REASON><ACTION>0</ACTION></HISTORY></HISTORY_RECORDS></
VM>
uid: 0
gid: 0
last_poll: 1406320668
state: 3
lcm_state: 3
owner_u: 1
group_u: 0
other_u: 0
1 row in set (0.00 sec)
-------------------------------
On Wed, 30 Jul 2014, Steven Timm wrote:
On Wed, 30 Jul 2014, Ruben S. Montero wrote:
Not really sure what can be going on... The monitor scripts
return the
information of all VMs running in the node. In 4.6 the
monitoring system uses a push approach, through UDP, so
you may have the
information being reported by misbehaved monitoring
daemons. Sometimes this may happen in dev environments if
you are
resetting the DB,...
when we ran the update to take this database from ONE4.4 to ONE4.6,
one host (the aforementioned fgtest14)
and one datastore (image store 101) got
wiped out of the database, I reinserted them both back in and
restarted opennebula.
Steve Timm
On Jul 28, 2014 6:32 PM, "Steven Timm" <[email protected]> wrote:
I am currently dealing with an unexplained monitoring
question
in OpenNebula 4.6 on my development cloud.
I frequently see OpenNebula return that the status of
a ONe
host is "ON" even in the case of a system
misconfiguration where,
given the credentials, it is impossible for opennebula
to
even ssh into the node as oneadmin.
I've fixed all those instances, restarted OpenNebula,
but opennebula still reports a number of VM's
in state "running" even though the node they are
running
on was rebooted three days ago and is running no
virtual machines whatsoever.
I think I could be dealing with database corruption of
some type
(generated on the one4.4->one4.6 update), or there
could
be some problem with the remote scripts on the nodes.
I saw, and I think I fixed, the problems with the
database
corruption (namely one of the hosts and one of the
datastores
got knocked out of the database for reasons unknown,
and I
re-inserted them). But in any case there is some
error handling that is not working in the monitoring
and something is exiting with status 0 that shouldn't
be.
ideas? Has anyone else seen something like this?
Steve Timm
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
[email protected] http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific
Computing
Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head
for Cloud
Computing
_______________________________________________
Users mailing list
[email protected]
http: //lists.opennebula.org/listinfo.cgi/users-opennebula.org
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
[email protected] http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing
Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud
Computing
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
[email protected] http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing Services
Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
--
--
Ruben S. Montero, PhD
Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise Cloud Made
Simple
www.OpenNebula.org | [email protected] | @OpenNebula
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
[email protected] http://home.fnal.gov/~timm/
Fermilab Scientific Computing Division, Scientific Computing Services Quad.
Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
_______________________________________________
Users mailing list
[email protected]
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org