Re: [one-users] VM remains in boot state
Hello Rubén, does this happen all the time, or just when deploying many VMs simultaneously? Have you tried to do 'virsh create deployment.0' (if you're using kvm) or 'xm create deployment.0' (if you're using xen) manually from the worker nodes, to see what happens? regards, Jaime On Thu, Jul 29, 2010 at 2:40 PM, Ruben Diez rd...@cesga.es wrote: HI: When we attempt to launch a VM, it remains in boot status. The deployment.0 file is generated in the OpenNebula side, but it is not copied to the $ONE_LOCATION/var/xx/images directory... The last message in the vm.log file is: Generating deployment file: /srv/cloud/one/var/41/deployment.0 No strange things appears in oned.log file??? Some ideas about what is happens?? Any probe to to debug??? Thanks ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
Re: [one-users] oned crashed - general protection fault
Hi Neil, Well, that shouldn't happen ;) Could you please send us a backtrace with the gdb? If the process still can be attached to, you can use $ gdb $ONE_LOCATION/bin/oned oned-pid then, in the (gdb) prompt (gdb) bt and send us the output. If this is not possible, could you please set you system to produce core files, and send it to us (it should appear under $ONE_LOCATION/var after the oned crashed again). To achieve this: $ ulimit -c unlimited Best regards, -Tino -- Constantino Vázquez Blanco | dsa-research.org/tinova Virtualization Technology Engineer / Researcher OpenNebula Toolkit | opennebula.org On Fri, Jul 30, 2010 at 11:07 AM, Neil Mooney neil.moo...@sara.nl wrote: Oned just crashed , leaving just this message in the log: [2046071.765934] oned[17361] general protection ip:7feaaee82e0a sp:7feaa5fcc940 error:0 in libc-2.11.2.so[7feaaee11000+158000] Our distro is Debian / squeeze running a 2.6.32-5-amd64 kernel. We are seeing crashes quite often now, considering auto-restarting from nagios or similar ... Cheers Neil ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
Re: [one-users] oned crashed - general protection fault
Hi Neil, We faced a similar problem(general protection error). On moving libxmlrpc to latest super stable version 1.06.40 we were able to resolve this situation. We had to build libxmlrpc 1.06.40 from source. Thanks and Regards Saurav Lahiri On Fri, Jul 30, 2010 at 3:36 PM, Tino Vazquez tin...@fdi.ucm.es wrote: Hi Neil, Well, that shouldn't happen ;) Could you please send us a backtrace with the gdb? If the process still can be attached to, you can use $ gdb $ONE_LOCATION/bin/oned oned-pid then, in the (gdb) prompt (gdb) bt and send us the output. If this is not possible, could you please set you system to produce core files, and send it to us (it should appear under $ONE_LOCATION/var after the oned crashed again). To achieve this: $ ulimit -c unlimited Best regards, -Tino -- Constantino Vázquez Blanco | dsa-research.org/tinova Virtualization Technology Engineer / Researcher OpenNebula Toolkit | opennebula.org On Fri, Jul 30, 2010 at 11:07 AM, Neil Mooney neil.moo...@sara.nl wrote: Oned just crashed , leaving just this message in the log: [2046071.765934] oned[17361] general protection ip:7feaaee82e0a sp:7feaa5fcc940 error:0 in libc-2.11.2.so[7feaaee11000+158000] Our distro is Debian / squeeze running a 2.6.32-5-amd64 kernel. We are seeing crashes quite often now, considering auto-restarting from nagios or similar ... Cheers Neil ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
Re: [one-users] VM's are going to pending state though enough resources
Hi Carlos, Please find the below details to resolve the VM pending state. onead...@onefrontend:~$ onehost list ID NAME RVM TCPU FCPU ACPU TMEM FMEM STAT 0 192.168.138.241 0 400 400 400 8189640 7408148 on onead...@onefrontend:~$ onevm list ID USER NAME STAT CPU MEM HOSTNAME TIME Template file details onead...@onefrontend:~/one-template$ cat ttylinux.one NAME = ttylinux1 CPU = 1 MEMORY = 128 DISK = [ source = /srv/cloud/one/one-template/ttylinux.img, target = hda, readonly = no ] NIC = [ NETWORK = Small network ] FEATURES=[ acpi=no ] #CONTEXT = [ # hostname = $NAME, # ip_public = PUBLIC_IP, # files = /path/to/init.sh /path/to/id_dsa.pub, # target = hdc, # root_pubkey = id_dsa.pub, # username = opennebula, # user_pubkey = id_dsa.pub #] # --- VNC server --- GRAPHICS = [ type = vnc, listen = 0.0.0.0, passwd = passwd, port = 501] onead...@onefrontend:~/one-template$ onevm list ID USER NAME STAT CPU MEM HOSTNAME TIME 54 oneadmin ttylinux runn 0 131072 192.168.138.241 00 00:07:51 55 oneadmin ttylinux runn 0 131072 192.168.138.241 00 00:07:24 56 oneadmin ttylinux runn 0 131072 192.168.138.241 00 00:06:08 57 oneadmin ttylinux runn 0 131072 192.168.138.241 00 00:05:00 58 oneadmin ttylinux pend 0 0 00 00:03:31 59 oneadmin ttylinux pend 0 0 00 00:01:51 onead...@onefrontend:~/one-template$ onehost list ID NAME RVM TCPU FCPU ACPU TMEM FMEM STAT 0 192.168.138.241 4 400 394 394 8189640 7343376 on onead...@onefrontend:~/var$ onehost show 0 HOST 0 INFORMATION ID : 0 NAME : 192.168.138.241 STATE : MONITORED IM_MAD : im_kvm VM_MAD : vmm_kvm TM_MAD : tm_nfs HOST SHARES MAX MEM : 8189640 USED MEM (REAL) : 1016876 USED MEM (ALLOCATED) : 524288 MAX CPU : 400 USED CPU (REAL) : 10 USED CPU (ALLOCATED) : 400 RUNNING VMS : 4 MONITORING INFORMATION ARCH=x86_64 CPUSPEED=2499 FREECPU=389.2 FREEMEMORY=7343080 HOSTNAME=onenode1 HYPERVISOR=kvm MODELNAME=Intel(R) Xeon(R) CPU X3323 @ 2.50GHz NETRX=0 NETTX=0 TOTALCPU=400 TOTALMEMORY=8189640 USEDCPU=10.8 USEDMEMORY=1016876 onead...@onefrontend:~/var$ cat sched.log Fri Jul 30 19:31:01 2010 [SCHED][I]: Dispatching virtual machine 56 to HID: 0 93177 Fri Jul 30 19:31:31 2010 [HOST][D]: Discovered Hosts (enabled): 0 93178 Fri Jul 30 19:31:31 2010 [VM][D]: Pending virtual machines : 93179 Fri Jul 30 19:31:31 2010 [SCHED][I]: Select hosts 93180 PRI HID 93181 --- 93182 93183 Fri Jul 30 19:32:01 2010 [HOST][D]: Discovered Hosts (enabled): 0 93184 Fri Jul 30 19:32:01 2010 [VM][D]: Pending virtual machines : 57 93185 Fri Jul 30 19:32:01 2010 [RANK][W]: No rank defined for VM 93186 Fri Jul 30 19:32:01 2010 [SCHED][I]: Select hosts 93187 PRI HID 93188 --- 93189 Virtual Machine: 57 93190 0 0 93191 93192 93193 Fri Jul 30 19:32:01 2010 [SCHED][I]: Dispatching virtual machine 57 to HID: 0 93194 Fri Jul 30 19:32:31 2010 [HOST][D]: Discovered Hosts (enabled): 0 93195 Fri Jul 30 19:32:31 2010 [VM][D]: Pending virtual machines : 93196 Fri Jul 30 19:32:31 2010 [SCHED][I]: Select hosts 93197 PRI HID 93198 --- 93199 93200 Fri Jul 30 19:33:01 2010 [HOST][D]: Discovered Hosts (enabled): 0 93201 Fri Jul 30 19:33:01 2010 [VM][D]: Pending virtual machines : 93202 Fri Jul 30 19:33:01 2010 [SCHED][I]: Select hosts 93203 PRI HID 93204 --- 93205 93206 Fri Jul 30 19:33:31 2010 [HOST][D]: Discovered Hosts (enabled): 0 93207 Fri Jul 30 19:33:31 2010 [VM][D]: Pending virtual machines : 58 93208 Fri Jul 30 19:33:31 2010 [RANK][W]: No rank defined for VM 93209 Fri Jul 30 19:33:31 2010 [SCHED][I]: Select hosts 93210 PRI HID 93211 --- 93212 Virtual Machine: 58 93213 93214 93215 Fri Jul 30 19:34:01 2010 [HOST][D]: Discovered Hosts (enabled): 0 93216 Fri Jul 30 19:34:01 2010 [VM][D]: Pending virtual machines : 58 93217 Fri Jul 30 19:34:01 2010 [RANK][W]: No rank defined for VM 93218 Fri Jul 30 19:34:01 2010 [SCHED][I]: Select hosts 93219 PRI HID 93220 --- 93221 Virtual Machine: 58 93222 93223 93224 Fri Jul 30 19:34:31 2010 [HOST][D]: Discovered Hosts (enabled): 0 93225 Fri Jul 30 19:34:31 2010 [VM][D]: Pending virtual machines : 58 93226 Fri Jul 30 19:34:31 2010 [RANK][W]: No rank defined for VM 93227 Fri Jul 30 19:34:31 2010 [SCHED][I]: Select hosts 93228 PRI HID 93229
[one-users] In Onemc LCM state is unknown
Hi, Due to some reasons my fronetend and node machines got powered off. After restarting the systems, in onemc all running images are showing as below and i am unable to connect to those images. please help me in bringing the images up. Id User Name VM State LCM State Cpu Memory Host VNC Port Time 44 oneadmin tty9a active unknown 0 131072 192.168.138.241 6 3d 2:26:35 [console] [details] [log] Thanks Regards, Waseem ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
Re: [one-users] In Onemc LCM state is unknown
Hi, it's just a tipp: you may try restarting your vm-s eg: onevm restart 44 This worked for us when powering of the vm from within the vm itself. Your situation is not the same, but similar... Hope it helps, Cheers Gyula Feladó: users-boun...@lists.opennebula.orgmailto:users-boun...@lists.opennebula.org [users-boun...@lists.opennebula.orgmailto:users-boun...@lists.opennebula.org] ; meghatalmazó: Mirza Baig [waseem_...@yahoo.commailto:waseem_...@yahoo.com] Küldve: 2010. július 30. 16:33 Címzett: users@lists.opennebula.orgmailto:users@lists.opennebula.org Tárgy: [one-users] In Onemc LCM state is unknown Hi, Due to some reasons my fronetend and node machines got powered off. After restarting the systems, in onemc all running images are showing as below and i am unable to connect to those images. please help me in bringing the images up. IdUser NameVM State LCM State Cpu Memory Host VNC Port Time 44 oneadmin tty9a activeunknown 0 131072 192.168.138.2416 3d 2:26:35 [console] [details] [log] Thanks Regards, Waseem ___ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
Re: [one-users] Monitoring/Deployement issues: Virsh failes often
Hi Floris, We are not really proposing that you decrease the VM polling time, that's just a new default in the new version that can be changed at will. Furthermore, I think we are mixing two issues here: Monitoring Issue OpenNebula performs a dominfo request to extract VM information. The way it is doing it now is blocking the libvirt socket, this causes issues with other operations that are being performed simultaneously. This can be solved by making this request readonly, as you proposed. Other possible solution, perfectly aligned and contemplated in the OpenNebula design, is to develop new probes for the Information Manager as other large deployments does that tackles SNMP, Ganglia, Nagios or similar tools inside the VMs and/or the physical hosts to avoid saturation of a particular hypervisor (as this case). Deployment Issue --- OpenNebula performs a domain create operation, which also blocks the socket. This basically causes the same behavior as the monitoring issue, but cannot be solved with the readonly flag, since this operation is not possible in readonly mode. What OpenNebula is providing here is means to circumvent limitations showed by libvirt by spacing the domain create operations. Summarizing, libvirt doesn't like more than one simultaneous operation with VMs since it blocks the socket. The monitoring issue can be solved with the readonly flag or by creating new monitoring probes, and therefore the VM polling frequency can be lifted at will if you feel that the default frequency doesn't cut your infrastructure needs. The deployment issue cannot be solved by any SNMP-like mechanism, and needs to be handled with care, we know that the current OpenNebula approach works for large deployments just by spacing the deployments. Best regards, -Tino -- Constantino Vázquez Blanco | dsa-research.org/tinova Virtualization Technology Engineer / Researcher OpenNebula Toolkit | opennebula.org On Fri, Jul 30, 2010 at 12:49 PM, Floris Sluiter floris.slui...@sara.nl wrote: Hi Tino and List, For us these methods are not very acceptable in a production environment. We feel that we do need to monitor the status of the Cloud, both the hosts and the VMs, at least once every minute for each component. Stopping or drastically reducing the monitoring of VMs is not the way to go for us. If the method of monitoring causes the Cloud to fail, then the method needs changing, not the frequency of it... I'll see what we can come up with to improve on this, we did already donate the SNMP driver for the hosts, maybe something similar can be done for the VMs (we are testing the read only method for virsh). Kind regards, Floris -Original Message- From: tinov...@gmail.com [mailto:tinov...@gmail.com] On Behalf Of Tino Vazquez Sent: donderdag 29 juli 2010 19:29 To: Floris Sluiter Cc: DuDu; users@lists.opennebula.org Subject: Re: Monitoring/Deployement issues: Virsh failes often Dear Floris, We noticed this behavior with the scalability tests we put OpenNebula through, for 2.0 there was a ticket opened regarding this [1]. It happens with libvirt and also with the xen hypervisor. It is by no means an OpenNebula scalability issue, since the intended libvirt behavior is not the one exposed. Notwithstanding, we introduced in the 2.0 version means to avoid this unpredictable behavior. We have limited the number of simultaneous deployments to a same host to one, to avoid this blocked socket issue. This can be configured through the scheduler configuration file. More information on this can be found on [2]. Also, we have introduced a limitation on the simultaneous VM polling requests, also with the same purpose. With the changes detailed above, OpenNebula is able to deploy dozens of thousands of virtual machines on a pool of hundreds of physical servers at the same time. The readonly method to perform the polling is a very neat and interesting proposal, we will evaluate its inclusion in the 2.0 version. Thanks a lot for this valuable feedback. Best regards, -Tino [1] http://dev.opennebula.org/issues/261 [2] http://www.opennebula.org/documentation:rel2.0:schg -- Constantino Vázquez Blanco | dsa-research.org/tinova Virtualization Technology Engineer / Researcher OpenNebula Toolkit | opennebula.org On Thu, Jul 29, 2010 at 6:28 PM, Floris Sluiter floris.slui...@sara.nl wrote: Hi, We again had issues when having many VMs deployed on many hosts at the same time (log excerpts below) and deploying more. We saw over 25 runaway VMs left behind running from the last two weeks, that one had marked as DONE, also deploy, copy and stop failed randomly quite often. It starts to be a major problem, when we can't run opennebula in a stable and predictable manner on larger Clouds. We have the following intervals configured, we do need to monitor more often then every 10 minutes we feel.
Re: [one-users] Monitoring/Deployement issues: Virsh failes often
Hi Tino, I do not think that spacing alone will solve it. Maybe a solution would be to detect if virsh is locked, then wait and retry until the lock is freed (for example with a mutex mechanism)? Because deploying to a host where another user is busy with a VM will probably still result in a fail. Or multiple users deploying/deleting on the same host. It is not a scheduling issue, it is a concurrency issue... Kind regards, Floris -Original Message- From: tinov...@gmail.com [mailto:tinov...@gmail.com] On Behalf Of Tino Vazquez Sent: vrijdag 30 juli 2010 18:07 To: Floris Sluiter Cc: users@lists.opennebula.org Subject: Re: Monitoring/Deployement issues: Virsh failes often Hi Floris, We are not really proposing that you decrease the VM polling time, that's just a new default in the new version that can be changed at will. Furthermore, I think we are mixing two issues here: Monitoring Issue OpenNebula performs a dominfo request to extract VM information. The way it is doing it now is blocking the libvirt socket, this causes issues with other operations that are being performed simultaneously. This can be solved by making this request readonly, as you proposed. Other possible solution, perfectly aligned and contemplated in the OpenNebula design, is to develop new probes for the Information Manager as other large deployments does that tackles SNMP, Ganglia, Nagios or similar tools inside the VMs and/or the physical hosts to avoid saturation of a particular hypervisor (as this case). Deployment Issue --- OpenNebula performs a domain create operation, which also blocks the socket. This basically causes the same behavior as the monitoring issue, but cannot be solved with the readonly flag, since this operation is not possible in readonly mode. What OpenNebula is providing here is means to circumvent limitations showed by libvirt by spacing the domain create operations. Summarizing, libvirt doesn't like more than one simultaneous operation with VMs since it blocks the socket. The monitoring issue can be solved with the readonly flag or by creating new monitoring probes, and therefore the VM polling frequency can be lifted at will if you feel that the default frequency doesn't cut your infrastructure needs. The deployment issue cannot be solved by any SNMP-like mechanism, and needs to be handled with care, we know that the current OpenNebula approach works for large deployments just by spacing the deployments. Best regards, -Tino -- Constantino Vázquez Blanco | dsa-research.org/tinova Virtualization Technology Engineer / Researcher OpenNebula Toolkit | opennebula.org On Fri, Jul 30, 2010 at 12:49 PM, Floris Sluiter floris.slui...@sara.nl wrote: Hi Tino and List, For us these methods are not very acceptable in a production environment. We feel that we do need to monitor the status of the Cloud, both the hosts and the VMs, at least once every minute for each component. Stopping or drastically reducing the monitoring of VMs is not the way to go for us. If the method of monitoring causes the Cloud to fail, then the method needs changing, not the frequency of it... I'll see what we can come up with to improve on this, we did already donate the SNMP driver for the hosts, maybe something similar can be done for the VMs (we are testing the read only method for virsh). Kind regards, Floris -Original Message- From: tinov...@gmail.com [mailto:tinov...@gmail.com] On Behalf Of Tino Vazquez Sent: donderdag 29 juli 2010 19:29 To: Floris Sluiter Cc: DuDu; users@lists.opennebula.org Subject: Re: Monitoring/Deployement issues: Virsh failes often Dear Floris, We noticed this behavior with the scalability tests we put OpenNebula through, for 2.0 there was a ticket opened regarding this [1]. It happens with libvirt and also with the xen hypervisor. It is by no means an OpenNebula scalability issue, since the intended libvirt behavior is not the one exposed. Notwithstanding, we introduced in the 2.0 version means to avoid this unpredictable behavior. We have limited the number of simultaneous deployments to a same host to one, to avoid this blocked socket issue. This can be configured through the scheduler configuration file. More information on this can be found on [2]. Also, we have introduced a limitation on the simultaneous VM polling requests, also with the same purpose. With the changes detailed above, OpenNebula is able to deploy dozens of thousands of virtual machines on a pool of hundreds of physical servers at the same time. The readonly method to perform the polling is a very neat and interesting proposal, we will evaluate its inclusion in the 2.0 version. Thanks a lot for this valuable feedback. Best regards, -Tino [1] http://dev.opennebula.org/issues/261 [2] http://www.opennebula.org/documentation:rel2.0:schg -- Constantino Vázquez Blanco | dsa-research.org/tinova