Re: [ovirt-users] How to create more than 1 vm from template
On Mon, Jun 9, 2014 at 1:45 PM, John Xue xgxj...@gmail.com wrote: Dear all, As you know, we can create 1 vm from template, but how to create many vm at the same time? We call it is a pool. -- Regards, John Xue -- Regards, John Xue ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Ovirt Guest Agent Windows 7
On 06/05/2014 10:21 PM, Jeff Clay wrote: I have the spice guest agent/tools installed, but I'm reading that I also need to install/setup the ovirt-guest-agent to get proper reporting of resources, etc. I'm following the instructions in https://github.com/oVirt/ovirt-guest-agent/blob/master/ovirt-guest-agent/README-windows.txt I am confused at Update the AGENT_CONFIG global variable in OVirtGuestService.py to point to right configuration location. I can find the file without issue, the value I'm requested to change has a default value of: AGENT_CONFIG = 'ovirt-guest-agent.ini' I cannot locate a file named ovirt-guest-agent.ini within the C:\ovirt-guest-agent-master\ovirt-guest-agent folder so I'm not sure what to set this value to. The file is not located in in ovirt-guest-agent-master\configurations\ Please copy all *.ini files into the same folder as the executable. Then it should work. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users -- Regards, Vinzenz Feenstra | Senior Software Engineer RedHat Engineering Virtualization R D Phone: +420 532 294 625 IRC: vfeenstr or evilissimo Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] How to create more than 1 vm from template
Hi John, You are right, if you want to create many VMs from Template you can create a pool. I think the main difference between creating a single VM and creating a pool is that in a pool you can not create a VM with cloned disks. regards, Maor On 06/09/2014 08:45 AM, John Xue wrote: Dear all, As you know, we can create 1 vm from template, but how to create many vm at the same time? We call it is a pool. -- Regards, John Xue ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] iSCSI and multipath
Hi, Context here : - 2 setups (2 datacenters) in oVirt 3.4.1 with CentOS 6.4 and 6.5 hosts - connected to some LUNs in iSCSI on a dedicated physical network Every host has two interfaces used for management and end-user LAN activity. Every host also have 4 additional NICs dedicated to the iSCSI network. Those 4 NICs were setup from the oVirt web GUI in a bonding with a unique IP address and connected to the SAN. Everything is working fine. I just had to manually tweak some points (MTU, other small things) but it is working. Recently, our SAN dealer told us that using bonding in an iSCSI context was terrible, and the recommendation is to use multipathing. My previous experience pre-oVirt was to agree with that. Long story short is just that when setting up the host from oVirt, it was so convenient to click and setup bonding, and observe it working that I did not pay further attention. (and we seem to have no bottleneck yet). Anyway, I dedicated a host to experiment, I things are not clear to me. I know how to setup NICs, iSCSI and multipath to present the host OS a partition or a logical volume, using multipathing instead of bonding. But in this precise case, what is disturbing me is that many layers described above are managed by oVirt (mount/unmount of LV, creation of bridges on top of bonded interfaces, managing the WWID amongst the cluster). And I see nothing related to multipath at the NICs level. Though I can setup everything fine in the host, this setup does not match what oVirt is expecting : oVirt is expecting a bridge named as the iSCSI network, and able to connect to the SAN. My multipathing is offering the access to the partition of the LUNs, it is not the same. I saw that multipathing is talked here : http://www.ovirt.org/Feature/iSCSI-Multipath I here read : Add an iSCSI Storage to the Data Center Make sure the Data Center contains networks. Go to the Data Center main tab and choose the specific Data Center At the sub tab choose iSCSI Bond The only tabs I see are Storage/Logical Networks/Network QoS/Clusters/Permissions. In this datacenter, I have one iSCSI master storage domain, two iSCSI storage domains and one NFS export domain. What did I miss? Press the new button to add a new iSCSI Bond Configure the networks you want to add to the new iSCSI Bond. Anyway, I'm not sure to understand the point of this wiki page and this implementation : it looks like a much higher level of multipathing over virtual networks, and not at all what I'm talking about above...? Well as you see, I need enlightenments. -- Nicolas Ecarnot ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
I just blocked connection to storage for testing, but on result I had this error: Failed to acquire lock error -243, so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks - Original Message - From: Andrew Lau and...@andrewklau.com To: combuster combus...@archlinux.us Cc: users users@ovirt.org Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now. I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote: Ignore that, the issue came back after 10 minutes. I've even tried a gluster mount + nfs server on top of that, and the same issue has come back. On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log. On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote: It was pure NFS on a NAS device. They all had different ids (had no redeployements of nodes before problem occured). Thanks Jirka. On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: I've seen that problem in other threads, the common denominator was nfs on top of gluster. So if you have this setup, then it's a known problem. Or you should double check if you hosts have different ids otherwise they would be trying to acquire the same lock. --Jirka On 06/06/2014 08:03 AM, Andrew Lau wrote: Hi Ivan, Thanks for the in depth reply. I've only seen this happen twice, and only after I added a third host to the HA cluster. I wonder if that's the root problem. Have you seen this happen on all your installs or only just after your manual migration? It's a little frustrating this is happening as I was hoping to get this into a production environment. It was all working except that log message :( Thanks, Andrew On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote: Hi Andrew, this is something that I saw in my logs too, first on one node and then on the other three. When that happend on all four of them, engine was corrupted beyond repair. First of all, I think that message is saying that sanlock can't get a lock on the shared storage that you defined for the hostedengine during installation. I got this error when I've tried to manually migrate the hosted engine. There is an unresolved bug there and I think it's related to this one: [Bug 1093366 - Migration of hosted-engine vm put target host score to zero] https://bugzilla.redhat.com/show_bug.cgi?id=1093366 This is a blocker bug (or should be) for the selfhostedengine and, from my own experience with it, shouldn't be used in the production enviroment (not untill it's fixed). Nothing that I've done couldn't fix the fact that the score for the target node was Zero, tried to reinstall the node, reboot the node, restarted several services, tailed a tons of logs etc but to no avail. When only one node was left (that was actually running the hosted engine), I brought the engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and after that, when I've tried to start the vm - it wouldn't load. Running VNC showed that the filesystem inside the vm was corrupted and when I ran fsck and finally started up - it was too badly damaged. I succeded to start the engine itself (after repairing postgresql service that wouldn't want to start) but the database was damaged enough and acted pretty weird (showed that storage domains were down but the vm's were running fine etc). Lucky me, I had already exported all of the VM's on the first sign of trouble and then installed ovirt-engine on the dedicated server and attached the export domain. So while really a usefull feature, and it's working (for the most part ie, automatic migration works), manually migrating VM with the hosted-engine will lead to troubles. I hope that my experience with it, will be of use to you. It happened to me two weeks ago, ovirt-engine was current (3.4.1) and there was no fix available. Regards, Ivan On 06/06/2014 05:12 AM, Andrew Lau wrote: Hi, I'm seeing this weird message in my engine log 2014-06-06 03:06:09,380 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-79)
Re: [ovirt-users] Recommended setup for a FC based storage domain
OK, I have good news and bad news :) Good news is that I can run different VM's on different nodes when all of their drives are on FC Storage domain. I don't think that all of I/O is running through SPM, but I need to test that. Simply put, for every virtual disk that you create on the shared fc storage domain, ovirt will present that vdisk only to the node wich is running the VM itself. They all can see domain infrastructure (inbox,outbox,metadata) but the LV for the virtual disk itself for that VM is visible only to the node that is running that particular VM. There is no limitation (except for the free space on the storage). Bad news! I can create the virtual disk on the fc storage for a vm, but when I start the VM itself, node wich hosts the VM that I'm starting is going non-operational, and quickly goes up again (ilo fencing agent checks if the node is ok and bring it back up). During that time, vm starts on another node (Default Host parameter was ignored - assigned Host was not available). I can manualy migrate it later to the intended node, that works. Lucky me, on two nodes (of the four) in the cluster, there were no vm's running (i tried this on both, with two different vm's created from scratch and i got the same result. I've killed everything above WARNING because it was killing the performance of the cluster. vdsm.log : [code] Thread-305::WARNING::2014-06-09 12:15:53,236::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is 55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 12:17:25,013::utils::129::root::(rmFile) File: /rhev/data-center/a0500f5c-e8d9-42f1-8f04-15b23514c8ed/55338570-e537-412b-97a9-635eea1ecb10/images/90659ad8-bd90-4a0a-bb4e-7c6afe90e925/242a1bce-a434-4246-ad24-b62f99c03a05 already removed 55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 12:17:25,074::blockSD::761::Storage.StorageDomain::(_getOccupiedMetadataSlots) Could not find mapping for lv 55338570-e537-412b-97a9-635eea1ecb10/242a1bce-a434-4246-ad24-b62f99c03a05 Thread-305::WARNING::2014-06-09 12:20:54,341::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-305::WARNING::2014-06-09 12:25:55,378::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-305::WARNING::2014-06-09 12:30:56,424::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-1857::WARNING::2014-06-09 12:32:45,639::libvirtconnection::116::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 Thread-1857::CRITICAL::2014-06-09 12:32:45,640::libvirtconnection::118::root::(wrapper) taking calling process down. Thread-17704::WARNING::2014-06-09 12:32:48,009::libvirtconnection::116::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 Thread-17704::CRITICAL::2014-06-09 12:32:48,013::libvirtconnection::118::root::(wrapper) taking calling process down. Thread-17704::ERROR::2014-06-09 12:32:48,018::vm::2285::vm.Vm::(_startUnderlyingVm) vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::The vm start process failed Traceback (most recent call last): File /usr/share/vdsm/vm.py, line 2245, in _startUnderlyingVm self._run() File /usr/share/vdsm/vm.py, line 3185, in _run self._connection.createXML(domxml, flags), File /usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py, line 110, in wrapper __connections.get(id(target)).pingLibvirt() File /usr/lib64/python2.6/site-packages/libvirt.py, line 3389, in getLibVersion if ret == -1: raise libvirtError ('virConnectGetLibVersion() failed', conn=self) libvirtError: internal error client socket is closed Thread-1857::WARNING::2014-06-09 12:32:50,673::vm::1963::vm.Vm::(_set_lastStatus) vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::trying to set state to Powering down when already Down Thread-1857::WARNING::2014-06-09 12:32:50,815::utils::129::root::(rmFile) File: /var/lib/libvirt/qemu/channels/2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9.com.redhat.rhevm.vdsm already removed Thread-1857::WARNING::2014-06-09 12:32:50,816::utils::129::root::(rmFile) File: /var/lib/libvirt/qemu/channels/2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9.org.qemu.guest_agent.0 already removed MainThread::WARNING::2014-06-09 12:33:03,770::fileUtils::167::Storage.fileUtils::(createdir) Dir /rhev/data-center/mnt already exists MainThread::WARNING::2014-06-09 12:33:05,738::clientIF::181::vds::(_prepareBindings) Unable to load the json rpc server module. Please make sure it is installed. storageRefresh::WARNING::2014-06-09 12:33:06,133::fileUtils::167::Storage.fileUtils::(createdir) Dir /rhev/data-center/hsm-tasks already exists Thread-35::ERROR::2014-06-09 12:33:08,375::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 55338570-e537-412b-97a9-635eea1ecb10 Thread-35::ERROR::2014-06-09
Re: [ovirt-users] Recommended setup for a FC based storage domain
Bad news happens only when running a VM for the first time, if it helps... On 06/09/2014 01:30 PM, combuster wrote: OK, I have good news and bad news :) Good news is that I can run different VM's on different nodes when all of their drives are on FC Storage domain. I don't think that all of I/O is running through SPM, but I need to test that. Simply put, for every virtual disk that you create on the shared fc storage domain, ovirt will present that vdisk only to the node wich is running the VM itself. They all can see domain infrastructure (inbox,outbox,metadata) but the LV for the virtual disk itself for that VM is visible only to the node that is running that particular VM. There is no limitation (except for the free space on the storage). Bad news! I can create the virtual disk on the fc storage for a vm, but when I start the VM itself, node wich hosts the VM that I'm starting is going non-operational, and quickly goes up again (ilo fencing agent checks if the node is ok and bring it back up). During that time, vm starts on another node (Default Host parameter was ignored - assigned Host was not available). I can manualy migrate it later to the intended node, that works. Lucky me, on two nodes (of the four) in the cluster, there were no vm's running (i tried this on both, with two different vm's created from scratch and i got the same result. I've killed everything above WARNING because it was killing the performance of the cluster. vdsm.log : [code] Thread-305::WARNING::2014-06-09 12:15:53,236::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is 55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 12:17:25,013::utils::129::root::(rmFile) File: /rhev/data-center/a0500f5c-e8d9-42f1-8f04-15b23514c8ed/55338570-e537-412b-97a9-635eea1ecb10/images/90659ad8-bd90-4a0a-bb4e-7c6afe90e925/242a1bce-a434-4246-ad24-b62f99c03a05 already removed 55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 12:17:25,074::blockSD::761::Storage.StorageDomain::(_getOccupiedMetadataSlots) Could not find mapping for lv 55338570-e537-412b-97a9-635eea1ecb10/242a1bce-a434-4246-ad24-b62f99c03a05 Thread-305::WARNING::2014-06-09 12:20:54,341::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-305::WARNING::2014-06-09 12:25:55,378::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-305::WARNING::2014-06-09 12:30:56,424::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-1857::WARNING::2014-06-09 12:32:45,639::libvirtconnection::116::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 Thread-1857::CRITICAL::2014-06-09 12:32:45,640::libvirtconnection::118::root::(wrapper) taking calling process down. Thread-17704::WARNING::2014-06-09 12:32:48,009::libvirtconnection::116::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 Thread-17704::CRITICAL::2014-06-09 12:32:48,013::libvirtconnection::118::root::(wrapper) taking calling process down. Thread-17704::ERROR::2014-06-09 12:32:48,018::vm::2285::vm.Vm::(_startUnderlyingVm) vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::The vm start process failed Traceback (most recent call last): File /usr/share/vdsm/vm.py, line 2245, in _startUnderlyingVm self._run() File /usr/share/vdsm/vm.py, line 3185, in _run self._connection.createXML(domxml, flags), File /usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py, line 110, in wrapper __connections.get(id(target)).pingLibvirt() File /usr/lib64/python2.6/site-packages/libvirt.py, line 3389, in getLibVersion if ret == -1: raise libvirtError ('virConnectGetLibVersion() failed', conn=self) libvirtError: internal error client socket is closed Thread-1857::WARNING::2014-06-09 12:32:50,673::vm::1963::vm.Vm::(_set_lastStatus) vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::trying to set state to Powering down when already Down Thread-1857::WARNING::2014-06-09 12:32:50,815::utils::129::root::(rmFile) File: /var/lib/libvirt/qemu/channels/2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9.com.redhat.rhevm.vdsm already removed Thread-1857::WARNING::2014-06-09 12:32:50,816::utils::129::root::(rmFile) File: /var/lib/libvirt/qemu/channels/2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9.org.qemu.guest_agent.0 already removed MainThread::WARNING::2014-06-09 12:33:03,770::fileUtils::167::Storage.fileUtils::(createdir) Dir /rhev/data-center/mnt already exists MainThread::WARNING::2014-06-09 12:33:05,738::clientIF::181::vds::(_prepareBindings) Unable to load the json rpc server module. Please make sure it is installed. storageRefresh::WARNING::2014-06-09 12:33:06,133::fileUtils::167::Storage.fileUtils::(createdir) Dir /rhev/data-center/hsm-tasks already exists Thread-35::ERROR::2014-06-09
Re: [ovirt-users] iSCSI and multipath
Hi Nicolas, Which DC level are you using? iSCSI multipath should be supported only from DC with compatibility version of 3.4 regards, Maor On 06/09/2014 01:06 PM, Nicolas Ecarnot wrote: Hi, Context here : - 2 setups (2 datacenters) in oVirt 3.4.1 with CentOS 6.4 and 6.5 hosts - connected to some LUNs in iSCSI on a dedicated physical network Every host has two interfaces used for management and end-user LAN activity. Every host also have 4 additional NICs dedicated to the iSCSI network. Those 4 NICs were setup from the oVirt web GUI in a bonding with a unique IP address and connected to the SAN. Everything is working fine. I just had to manually tweak some points (MTU, other small things) but it is working. Recently, our SAN dealer told us that using bonding in an iSCSI context was terrible, and the recommendation is to use multipathing. My previous experience pre-oVirt was to agree with that. Long story short is just that when setting up the host from oVirt, it was so convenient to click and setup bonding, and observe it working that I did not pay further attention. (and we seem to have no bottleneck yet). Anyway, I dedicated a host to experiment, I things are not clear to me. I know how to setup NICs, iSCSI and multipath to present the host OS a partition or a logical volume, using multipathing instead of bonding. But in this precise case, what is disturbing me is that many layers described above are managed by oVirt (mount/unmount of LV, creation of bridges on top of bonded interfaces, managing the WWID amongst the cluster). And I see nothing related to multipath at the NICs level. Though I can setup everything fine in the host, this setup does not match what oVirt is expecting : oVirt is expecting a bridge named as the iSCSI network, and able to connect to the SAN. My multipathing is offering the access to the partition of the LUNs, it is not the same. I saw that multipathing is talked here : http://www.ovirt.org/Feature/iSCSI-Multipath I here read : Add an iSCSI Storage to the Data Center Make sure the Data Center contains networks. Go to the Data Center main tab and choose the specific Data Center At the sub tab choose iSCSI Bond The only tabs I see are Storage/Logical Networks/Network QoS/Clusters/Permissions. In this datacenter, I have one iSCSI master storage domain, two iSCSI storage domains and one NFS export domain. What did I miss? Press the new button to add a new iSCSI Bond Configure the networks you want to add to the new iSCSI Bond. Anyway, I'm not sure to understand the point of this wiki page and this implementation : it looks like a much higher level of multipathing over virtual networks, and not at all what I'm talking about above...? Well as you see, I need enlightenments. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster). On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote: I just blocked connection to storage for testing, but on result I had this error: Failed to acquire lock error -243, so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks - Original Message - From: Andrew Lau and...@andrewklau.com To: combuster combus...@archlinux.us Cc: users users@ovirt.org Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now. I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote: Ignore that, the issue came back after 10 minutes. I've even tried a gluster mount + nfs server on top of that, and the same issue has come back. On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log. On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote: It was pure NFS on a NAS device. They all had different ids (had no redeployements of nodes before problem occured). Thanks Jirka. On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: I've seen that problem in other threads, the common denominator was nfs on top of gluster. So if you have this setup, then it's a known problem. Or you should double check if you hosts have different ids otherwise they would be trying to acquire the same lock. --Jirka On 06/06/2014 08:03 AM, Andrew Lau wrote: Hi Ivan, Thanks for the in depth reply. I've only seen this happen twice, and only after I added a third host to the HA cluster. I wonder if that's the root problem. Have you seen this happen on all your installs or only just after your manual migration? It's a little frustrating this is happening as I was hoping to get this into a production environment. It was all working except that log message :( Thanks, Andrew On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote: Hi Andrew, this is something that I saw in my logs too, first on one node and then on the other three. When that happend on all four of them, engine was corrupted beyond repair. First of all, I think that message is saying that sanlock can't get a lock on the shared storage that you defined for the hostedengine during installation. I got this error when I've tried to manually migrate the hosted engine. There is an unresolved bug there and I think it's related to this one: [Bug 1093366 - Migration of hosted-engine vm put target host score to zero] https://bugzilla.redhat.com/show_bug.cgi?id=1093366 This is a blocker bug (or should be) for the selfhostedengine and, from my own experience with it, shouldn't be used in the production enviroment (not untill it's fixed). Nothing that I've done couldn't fix the fact that the score for the target node was Zero, tried to reinstall the node, reboot the node, restarted several services, tailed a tons of logs etc but to no avail. When only one node was left (that was actually running the hosted engine), I brought the engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and after that, when I've tried to start the vm - it wouldn't load. Running VNC showed that the filesystem inside the vm was corrupted and when I ran fsck and finally started up - it was too badly damaged. I succeded to start the engine itself (after repairing postgresql service that wouldn't want to start) but the database was damaged enough and acted pretty weird (showed that storage domains were down but the vm's were running fine etc). Lucky me, I had already exported all of the VM's on the first sign of trouble and then installed ovirt-engine on the dedicated server and attached the export domain. So while really a usefull feature, and it's working (for the most part ie, automatic migration works), manually migrating VM with the hosted-engine will lead to troubles. I hope that my experience
Re: [ovirt-users] iSCSI and multipath
Le 09-06-2014 13:55, Maor Lipchuk a écrit : Hi Nicolas, Which DC level are you using? iSCSI multipath should be supported only from DC with compatibility version of 3.4 Hi Maor, Oops you're right, my both 3.4 datacenters are using 3.3 level. I migrated recently. How safe or risky is it to increase this DC level ? regards, Maor On 06/09/2014 01:06 PM, Nicolas Ecarnot wrote: Hi, Context here : - 2 setups (2 datacenters) in oVirt 3.4.1 with CentOS 6.4 and 6.5 hosts - connected to some LUNs in iSCSI on a dedicated physical network Every host has two interfaces used for management and end-user LAN activity. Every host also have 4 additional NICs dedicated to the iSCSI network. Those 4 NICs were setup from the oVirt web GUI in a bonding with a unique IP address and connected to the SAN. Everything is working fine. I just had to manually tweak some points (MTU, other small things) but it is working. Recently, our SAN dealer told us that using bonding in an iSCSI context was terrible, and the recommendation is to use multipathing. My previous experience pre-oVirt was to agree with that. Long story short is just that when setting up the host from oVirt, it was so convenient to click and setup bonding, and observe it working that I did not pay further attention. (and we seem to have no bottleneck yet). Anyway, I dedicated a host to experiment, I things are not clear to me. I know how to setup NICs, iSCSI and multipath to present the host OS a partition or a logical volume, using multipathing instead of bonding. But in this precise case, what is disturbing me is that many layers described above are managed by oVirt (mount/unmount of LV, creation of bridges on top of bonded interfaces, managing the WWID amongst the cluster). And I see nothing related to multipath at the NICs level. Though I can setup everything fine in the host, this setup does not match what oVirt is expecting : oVirt is expecting a bridge named as the iSCSI network, and able to connect to the SAN. My multipathing is offering the access to the partition of the LUNs, it is not the same. I saw that multipathing is talked here : http://www.ovirt.org/Feature/iSCSI-Multipath I here read : Add an iSCSI Storage to the Data Center Make sure the Data Center contains networks. Go to the Data Center main tab and choose the specific Data Center At the sub tab choose iSCSI Bond The only tabs I see are Storage/Logical Networks/Network QoS/Clusters/Permissions. In this datacenter, I have one iSCSI master storage domain, two iSCSI storage domains and one NFS export domain. What did I miss? Press the new button to add a new iSCSI Bond Configure the networks you want to add to the new iSCSI Bond. Anyway, I'm not sure to understand the point of this wiki page and this implementation : it looks like a much higher level of multipathing over virtual networks, and not at all what I'm talking about above...? Well as you see, I need enlightenments. -- Nicolas Ecarnot ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] iSCSI and multipath
basically, you should upgrade your DC to 3.4, and then upgrade the clusters you desire also to 3.4. You might need to upgrade your hosts to be compatible with the cluster emulated machines, or they might become non-operational if qemu-kvm does not support it. ether way, you can always ask for advice in the mailing list if you encounter any problem. Regards, Maor On 06/09/2014 03:30 PM, Nicolas Ecarnot wrote: Le 09-06-2014 13:55, Maor Lipchuk a écrit : Hi Nicolas, Which DC level are you using? iSCSI multipath should be supported only from DC with compatibility version of 3.4 Hi Maor, Oops you're right, my both 3.4 datacenters are using 3.3 level. I migrated recently. How safe or risky is it to increase this DC level ? regards, Maor On 06/09/2014 01:06 PM, Nicolas Ecarnot wrote: Hi, Context here : - 2 setups (2 datacenters) in oVirt 3.4.1 with CentOS 6.4 and 6.5 hosts - connected to some LUNs in iSCSI on a dedicated physical network Every host has two interfaces used for management and end-user LAN activity. Every host also have 4 additional NICs dedicated to the iSCSI network. Those 4 NICs were setup from the oVirt web GUI in a bonding with a unique IP address and connected to the SAN. Everything is working fine. I just had to manually tweak some points (MTU, other small things) but it is working. Recently, our SAN dealer told us that using bonding in an iSCSI context was terrible, and the recommendation is to use multipathing. My previous experience pre-oVirt was to agree with that. Long story short is just that when setting up the host from oVirt, it was so convenient to click and setup bonding, and observe it working that I did not pay further attention. (and we seem to have no bottleneck yet). Anyway, I dedicated a host to experiment, I things are not clear to me. I know how to setup NICs, iSCSI and multipath to present the host OS a partition or a logical volume, using multipathing instead of bonding. But in this precise case, what is disturbing me is that many layers described above are managed by oVirt (mount/unmount of LV, creation of bridges on top of bonded interfaces, managing the WWID amongst the cluster). And I see nothing related to multipath at the NICs level. Though I can setup everything fine in the host, this setup does not match what oVirt is expecting : oVirt is expecting a bridge named as the iSCSI network, and able to connect to the SAN. My multipathing is offering the access to the partition of the LUNs, it is not the same. I saw that multipathing is talked here : http://www.ovirt.org/Feature/iSCSI-Multipath I here read : Add an iSCSI Storage to the Data Center Make sure the Data Center contains networks. Go to the Data Center main tab and choose the specific Data Center At the sub tab choose iSCSI Bond The only tabs I see are Storage/Logical Networks/Network QoS/Clusters/Permissions. In this datacenter, I have one iSCSI master storage domain, two iSCSI storage domains and one NFS export domain. What did I miss? Press the new button to add a new iSCSI Bond Configure the networks you want to add to the new iSCSI Bond. Anyway, I'm not sure to understand the point of this wiki page and this implementation : it looks like a much higher level of multipathing over virtual networks, and not at all what I'm talking about above...? Well as you see, I need enlightenments. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] iSCSI and multipath
Le 09-06-2014 14:44, Maor Lipchuk a écrit : basically, you should upgrade your DC to 3.4, and then upgrade the clusters you desire also to 3.4. Well, that seems to have worked, except I had to raise the cluster level first, then the DC level. Now, I can see the iSCSI multipath tab has appeared. But I confirm what I wrote below : I saw that multipathing is talked here : http://www.ovirt.org/Feature/iSCSI-Multipath Add an iSCSI Storage to the Data Center Make sure the Data Center contains networks. Go to the Data Center main tab and choose the specific Data Center At the sub tab choose iSCSI Bond Press the new button to add a new iSCSI Bond Configure the networks you want to add to the new iSCSI Bond. Anyway, I'm not sure to understand the point of this wiki page and this implementation : it looks like a much higher level of multipathing over virtual networks, and not at all what I'm talking about above...? I am actually trying to know whether bonding interfaces (at low level) for the iSCSI network is a bad thing, as was told by my storage provider? -- Nicolas Ecarnot ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] iSCSI and multipath
On Mon, Jun 9, 2014 at 9:23 AM, Nicolas Ecarnot nico...@ecarnot.net wrote: Le 09-06-2014 14:44, Maor Lipchuk a écrit : basically, you should upgrade your DC to 3.4, and then upgrade the clusters you desire also to 3.4. Well, that seems to have worked, except I had to raise the cluster level first, then the DC level. Now, I can see the iSCSI multipath tab has appeared. But I confirm what I wrote below : I saw that multipathing is talked here : http://www.ovirt.org/Feature/iSCSI-Multipath Add an iSCSI Storage to the Data Center Make sure the Data Center contains networks. Go to the Data Center main tab and choose the specific Data Center At the sub tab choose iSCSI Bond Press the new button to add a new iSCSI Bond Configure the networks you want to add to the new iSCSI Bond. Anyway, I'm not sure to understand the point of this wiki page and this implementation : it looks like a much higher level of multipathing over virtual networks, and not at all what I'm talking about above...? I am actually trying to know whether bonding interfaces (at low level) for the iSCSI network is a bad thing, as was told by my storage provider? -- Nicolas Ecarnot Hi Nicolas, I think the naming of the managed iscsi multipathing feature a bond might be a bit confusing. It's not an ethernet/nic bond, but a way to group networks and targets together, so it's not bonding interfaces Behind the scenes what it does is creates iscsi ifaces(/var/lib/iscsi/ifaces) and changes the way the iscsiadm calls are constructed to use those ifaces (instead of the default) to connect and login to the targets Hope that helps. -John ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] oVirt - Node install on CentOS
Could anyone please confirm the correct process to run oVirt node on a standard CentOS install, rather than using the node iso? I'm currently doing the following: - Install CentOS 6.5 - Install qemu-kvm-rhev rpm's to resolve live snapshot issues on the CentOS supplied rpm's - Yum install vdsm ovirt-node-plugin-vdsm vdsm-reg o I have to remove noexec from /tmp or the config fails - I then add the node from the ovirt-engine gui After resolving some problems with group memberships and vdsm requiring sudo access, all is working. Live snapshots and storage migration are OK (tested NFS and Gluster as well). I couldn't really find any docs on how to do this so I just wanted to confirm if what I am doing makes sense. I also don't have the text configuration interface that I would normally get on the oVirt node iso. Can I install this and use it on a non node iso install? Many thanks for any assistance. Simon ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Hacking in Ceph rather then Gluster.
So I understand that the news is still fresh and there may not be much going on yet in making Ceph work with ovirt, but I thought I would reach out and see if it was possible to hack them together and still use librdb rather then NFS. I know, why not just use Gluster... the problem is I have tried to use Gluster for VM storage for years and I still don't think it is ready. Ceph still has work in other areas, but this is one area where I think it shines. This is a new lab cluster and I would like to try to use ceph over gluster if possible. Unless I am missing something, can anyone tell me they are happy with Gluster as a backend image store? This will be a small 16 node 10 gig cluster of shared compute / storage (yes I know people want to keep them separate). nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 | www.broadsoft.com ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt - Node install on CentOS
Simon Barrett wrote: Could anyone please confirm the correct process to run oVirt node on a standard CentOS install, rather than using the node iso? I'm currently doing the following: - Install CentOS 6.5 - Install qemu-kvm-rhev rpm's to resolve live snapshot issues on the CentOS supplied rpm's - Yum install vdsm ovirt-node-plugin-vdsm vdsm-reg o I have to remove noexec from /tmp or the config fails - I then add the node from the ovirt-engine gui After resolving some problems with group memberships and vdsm requiring sudo access, all is working. Live snapshots and storage migration are OK (tested NFS and Gluster as well). I couldn't really find any docs on how to do this so I just wanted to confirm if what I am doing makes sense. I also don't have the text configuration interface that I would normally get on the oVirt node iso. Can I install this and use it on a non node iso install? If you install a minimal Centos-6.5 and add the ovirt repository and then add the host using the webui of engine then it will install all needed packages (vdsm/libvirt/kvm) and you're done. You can then replace the standard qemu with the one that will do live snapshots. Depending on where you're storage is located you shouldn't have to tinker with memberships etc. Regards, Joop ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Spam Re: Spam Re: Spam Windows guest agent
Hi Bob, Thanks for your feedback. We fixed the issue and the new version of oVirt WGT ISO (3.5-2 alpha) is now available from the oVirt website: http://resources.ovirt.org/pub/ovirt-master-snapshot-static/iso/ovirt-guest-tools/ovirt-guest-tools-3.5-2.iso as well is the updated installer: http://resources.ovirt.org/pub/ovirt-master-snapshot-static/exe/ovirt-guest-tools/ovirt-guest-tools-3.5-2.exe Please also note that currently upgrades between versions require to manually stop all the relevant services (Spice and oVirt Agents) before performing an upgrade, we're working on getting this fixed as well. Thanks in advance, Lev Veyde. - Original Message - From: Bob Doolittle b...@doolittle.us.com To: Sandro Bonazzola sbona...@redhat.com, Maurice James mja...@media-node.com, Joop jvdw...@xs4all.nl Cc: Lev Veyde lve...@redhat.com, users@ovirt.org Sent: Friday, June 6, 2014 5:44:42 PM Subject: Re: [ovirt-users] Spam Re: Spam Re: Spam Windows guest agent Just gave this a try on Windows Server 2008 R2, and it worked almost perfectly! The one small problem I had: https://bugzilla.redhat.com/show_bug.cgi?id=1105624 Service was configured with Type Manual rather than Autostart, so did not restart upon reboot. Easy workaround. Thanks guys - this will be an enormous help! :) -Bob P.S. On my system the What would you like me to do with this CD? AutoPlay dialog has a goofy option - Import photos and videos (Using Dropbox). Not sure if that's something you can control. On 06/06/2014 09:56 AM, Sandro Bonazzola wrote: Il 06/06/2014 15:29, Maurice James ha scritto: I think I got it. Just a few key steps that are not obvious for us python for windows virgins. I will send in some screen shots with text so that someone with write access to the wiki can edit and post it I suggest to try the shiny new ovirt-guest-tools iso. You can find more info here: http://www.ovirt.org/Features/oVirt_Windows_Guest_Tools -- *From: *Joop jvdw...@xs4all.nl *To: *users@ovirt.org *Sent: *Friday, June 6, 2014 8:27:53 AM *Subject: *Re: [ovirt-users] Spam Re: Spam Windows guest agent On 6-6-2014 14:14, Karli Sjöberg wrote: Den 6 jun 2014 13:39 skrev Maurice James mja...@media-node.com: Yes that FM in particular :) The step of py2exe doesn't work: can't open File 'setup.py': [Errno 2] No such file or directory My cd is where README-windows is located and the archive of today (-10min ago) Just copying the parent dir to 'Program Files' and execute what is in the README will work though. Thats how I have done it all the time. Joop ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Hacking in Ceph rather then Gluster.
On 06/09/2014 01:28 PM, Nathan Stratton wrote: So I understand that the news is still fresh and there may not be much going on yet in making Ceph work with ovirt, but I thought I would reach out and see if it was possible to hack them together and still use librdb rather then NFS. I know, why not just use Gluster... the problem is I have tried to use Gluster for VM storage for years and I still don't think it is ready. Ceph still has work in other areas, but this is one area where I think it shines. This is a new lab cluster and I would like to try to use ceph over gluster if possible. Unless I am missing something, can anyone tell me they are happy with Gluster as a backend image store? This will be a small 16 node 10 gig cluster of shared compute / storage (yes I know people want to keep them separate). nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 | www.broadsoft.com http://www.broadsoft.com ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users there was a threat about this recently. afaict, ceph support will require adding a specific ceph storage domain to engine and vdsm, which is a full blown feature (I assume you could try and hack it somewhat with a custom hook). waiting for next version planning cycle to see if/how it gets pushed. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Live migration - quest VM stall
Hello, at the moment we are investigating stalls of Windows XP VMs during live migration. Our environment consists of: - FC20 hypervisor nodes - qemu 1.6.2 - OVirt 3.4.1 - Guest: Windows XP SP2 - VM Disks: Virtio IDE tested - SPICE / VNC: both tested - Balloon: With without tested - Cluster compatibility: 3.4 - CPU Nehalem After 2-10 live migrations the Windows XP guest is no longer responsive. First of all we thougth that it might be related to SPICE because we were no longer able to logon to the console. So we installed XP telnet server in the VM but that showed a similar behaviour: - The telnet welcome dialogue is always available (network seems ok) - Sometime after a live migration if you enter the password the telnet gives no response. In parallel the SPICE console allows to move open windows. But as soon as one clicks on the start the menu the system gives no response. Even after updating to qemu 2.0 with virt-preview respositories the behaviour stays the same. Looks like the system cannot access Any ideas? Markus Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine rechtsverbindliche Willenserklärung. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln Vorstand: Kadir Akin Dr. Michael Höhnerbach Vorsitzender des Aufsichtsrates: Hans Kristian Langva Registergericht: Amtsgericht Köln Registernummer: HRB 52 497 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. e-mails sent over the internet may have been written under a wrong name or been manipulated. That is why this message sent as an e-mail is not a legally binding declaration of intention. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln executive board: Kadir Akin Dr. Michael Höhnerbach President of the supervisory board: Hans Kristian Langva Registry office: district court Cologne Register number: HRB 52 497 ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Hacking in Ceph rather then Gluster.
Thanks, I will take a look at it, anyone else currently using Gluster for backend images in production? nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 | www.broadsoft.com On Mon, Jun 9, 2014 at 2:55 PM, Itamar Heim ih...@redhat.com wrote: On 06/09/2014 01:28 PM, Nathan Stratton wrote: So I understand that the news is still fresh and there may not be much going on yet in making Ceph work with ovirt, but I thought I would reach out and see if it was possible to hack them together and still use librdb rather then NFS. I know, why not just use Gluster... the problem is I have tried to use Gluster for VM storage for years and I still don't think it is ready. Ceph still has work in other areas, but this is one area where I think it shines. This is a new lab cluster and I would like to try to use ceph over gluster if possible. Unless I am missing something, can anyone tell me they are happy with Gluster as a backend image store? This will be a small 16 node 10 gig cluster of shared compute / storage (yes I know people want to keep them separate). nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 | www.broadsoft.com http://www.broadsoft.com ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users there was a threat about this recently. afaict, ceph support will require adding a specific ceph storage domain to engine and vdsm, which is a full blown feature (I assume you could try and hack it somewhat with a custom hook). waiting for next version planning cycle to see if/how it gets pushed. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
So after adding the L3 capabilities to my storage network, I'm no longer seeing this issue anymore. So the engine needs to be able to access the storage domain it sits on? But that doesn't show up in the UI? Ivan, was this also the case with your setup? Engine couldn't access storage domain? On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster). On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote: I just blocked connection to storage for testing, but on result I had this error: Failed to acquire lock error -243, so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks - Original Message - From: Andrew Lau and...@andrewklau.com To: combuster combus...@archlinux.us Cc: users users@ovirt.org Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now. I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote: Ignore that, the issue came back after 10 minutes. I've even tried a gluster mount + nfs server on top of that, and the same issue has come back. On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log. On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote: It was pure NFS on a NAS device. They all had different ids (had no redeployements of nodes before problem occured). Thanks Jirka. On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: I've seen that problem in other threads, the common denominator was nfs on top of gluster. So if you have this setup, then it's a known problem. Or you should double check if you hosts have different ids otherwise they would be trying to acquire the same lock. --Jirka On 06/06/2014 08:03 AM, Andrew Lau wrote: Hi Ivan, Thanks for the in depth reply. I've only seen this happen twice, and only after I added a third host to the HA cluster. I wonder if that's the root problem. Have you seen this happen on all your installs or only just after your manual migration? It's a little frustrating this is happening as I was hoping to get this into a production environment. It was all working except that log message :( Thanks, Andrew On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote: Hi Andrew, this is something that I saw in my logs too, first on one node and then on the other three. When that happend on all four of them, engine was corrupted beyond repair. First of all, I think that message is saying that sanlock can't get a lock on the shared storage that you defined for the hostedengine during installation. I got this error when I've tried to manually migrate the hosted engine. There is an unresolved bug there and I think it's related to this one: [Bug 1093366 - Migration of hosted-engine vm put target host score to zero] https://bugzilla.redhat.com/show_bug.cgi?id=1093366 This is a blocker bug (or should be) for the selfhostedengine and, from my own experience with it, shouldn't be used in the production enviroment (not untill it's fixed). Nothing that I've done couldn't fix the fact that the score for the target node was Zero, tried to reinstall the node, reboot the node, restarted several services, tailed a tons of logs etc but to no avail. When only one node was left (that was actually running the hosted engine), I brought the engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and after that, when I've tried to start the vm - it wouldn't load. Running VNC showed that the filesystem inside the vm was corrupted and when I ran fsck and finally started up - it was too badly damaged. I succeded to start the engine itself (after repairing postgresql service that wouldn't want to start) but the database was damaged enough and acted pretty weird (showed that storage domains were down but the vm's were running fine etc).
Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
nvm, just as I hit send the error has returned. Ignore this.. On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote: So after adding the L3 capabilities to my storage network, I'm no longer seeing this issue anymore. So the engine needs to be able to access the storage domain it sits on? But that doesn't show up in the UI? Ivan, was this also the case with your setup? Engine couldn't access storage domain? On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster). On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote: I just blocked connection to storage for testing, but on result I had this error: Failed to acquire lock error -243, so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks - Original Message - From: Andrew Lau and...@andrewklau.com To: combuster combus...@archlinux.us Cc: users users@ovirt.org Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now. I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote: Ignore that, the issue came back after 10 minutes. I've even tried a gluster mount + nfs server on top of that, and the same issue has come back. On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log. On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote: It was pure NFS on a NAS device. They all had different ids (had no redeployements of nodes before problem occured). Thanks Jirka. On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: I've seen that problem in other threads, the common denominator was nfs on top of gluster. So if you have this setup, then it's a known problem. Or you should double check if you hosts have different ids otherwise they would be trying to acquire the same lock. --Jirka On 06/06/2014 08:03 AM, Andrew Lau wrote: Hi Ivan, Thanks for the in depth reply. I've only seen this happen twice, and only after I added a third host to the HA cluster. I wonder if that's the root problem. Have you seen this happen on all your installs or only just after your manual migration? It's a little frustrating this is happening as I was hoping to get this into a production environment. It was all working except that log message :( Thanks, Andrew On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote: Hi Andrew, this is something that I saw in my logs too, first on one node and then on the other three. When that happend on all four of them, engine was corrupted beyond repair. First of all, I think that message is saying that sanlock can't get a lock on the shared storage that you defined for the hostedengine during installation. I got this error when I've tried to manually migrate the hosted engine. There is an unresolved bug there and I think it's related to this one: [Bug 1093366 - Migration of hosted-engine vm put target host score to zero] https://bugzilla.redhat.com/show_bug.cgi?id=1093366 This is a blocker bug (or should be) for the selfhostedengine and, from my own experience with it, shouldn't be used in the production enviroment (not untill it's fixed). Nothing that I've done couldn't fix the fact that the score for the target node was Zero, tried to reinstall the node, reboot the node, restarted several services, tailed a tons of logs etc but to no avail. When only one node was left (that was actually running the hosted engine), I brought the engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and after that, when I've tried to start the vm - it wouldn't load. Running VNC showed that the filesystem inside the vm was corrupted and when I ran fsck and finally started up - it was too badly damaged. I succeded to start the engine itself (after repairing postgresql service that wouldn't
Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS device as the NFS share itself, before the deploy procedure even started. But I'm puzzled at how you can reproduce the bug, all was well on my setup before I've stated manual migration of the engine's vm. Even auto migration worked before that (tested it). Does it just happen without any procedure on the engine itself? Is the score 0 for just one node, or two of three of them? On 06/10/2014 01:02 AM, Andrew Lau wrote: nvm, just as I hit send the error has returned. Ignore this.. On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote: So after adding the L3 capabilities to my storage network, I'm no longer seeing this issue anymore. So the engine needs to be able to access the storage domain it sits on? But that doesn't show up in the UI? Ivan, was this also the case with your setup? Engine couldn't access storage domain? On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster). On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote: I just blocked connection to storage for testing, but on result I had this error: Failed to acquire lock error -243, so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks - Original Message - From: Andrew Lau and...@andrewklau.com To: combuster combus...@archlinux.us Cc: users users@ovirt.org Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now. I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote: Ignore that, the issue came back after 10 minutes. I've even tried a gluster mount + nfs server on top of that, and the same issue has come back. On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log. On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote: It was pure NFS on a NAS device. They all had different ids (had no redeployements of nodes before problem occured). Thanks Jirka. On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: I've seen that problem in other threads, the common denominator was nfs on top of gluster. So if you have this setup, then it's a known problem. Or you should double check if you hosts have different ids otherwise they would be trying to acquire the same lock. --Jirka On 06/06/2014 08:03 AM, Andrew Lau wrote: Hi Ivan, Thanks for the in depth reply. I've only seen this happen twice, and only after I added a third host to the HA cluster. I wonder if that's the root problem. Have you seen this happen on all your installs or only just after your manual migration? It's a little frustrating this is happening as I was hoping to get this into a production environment. It was all working except that log message :( Thanks, Andrew On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote: Hi Andrew, this is something that I saw in my logs too, first on one node and then on the other three. When that happend on all four of them, engine was corrupted beyond repair. First of all, I think that message is saying that sanlock can't get a lock on the shared storage that you defined for the hostedengine during installation. I got this error when I've tried to manually migrate the hosted engine. There is an unresolved bug there and I think it's related to this one: [Bug 1093366 - Migration of hosted-engine vm put target host score to zero] https://bugzilla.redhat.com/show_bug.cgi?id=1093366 This is a blocker bug (or should be) for the selfhostedengine and, from my own experience with it, shouldn't be used in the production enviroment (not untill it's fixed). Nothing that I've done couldn't fix the fact that the score for the target node was Zero, tried to reinstall the node, reboot the node, restarted several services, tailed a tons of logs etc but to no avail. When only one node was left (that was actually running
Re: [ovirt-users] Recommended setup for a FC based storage domain
Hm, another update on this one. If I create another VM with another virtual disk on the node that already have a vm running from the FC storage, then libvirt doesn't brake. I guess it just happens for the first time on any of the nodes. If this is the case, I would have to bring all of the vm's on the other two nodes in this four node cluster and start a VM from the FC storage just to make sure it doesn't brake during working hours. I guess it would be fine then. It seems to me that this is some sort of a timeout issue that happens when I start the vm for the first time on fc sd, this could have something to do with fc card driver settings, or libvirt won't wait for ovirt-engine to present the new LV to the targeted node. I don't see why ovirt-engine waits for the first-time launch of the vm to present the LV at all, shouldn't it be doing this at the time of the virtual disk creation in case I have selected to run from the specific node? On 06/09/2014 01:49 PM, combuster wrote: Bad news happens only when running a VM for the first time, if it helps... On 06/09/2014 01:30 PM, combuster wrote: OK, I have good news and bad news :) Good news is that I can run different VM's on different nodes when all of their drives are on FC Storage domain. I don't think that all of I/O is running through SPM, but I need to test that. Simply put, for every virtual disk that you create on the shared fc storage domain, ovirt will present that vdisk only to the node wich is running the VM itself. They all can see domain infrastructure (inbox,outbox,metadata) but the LV for the virtual disk itself for that VM is visible only to the node that is running that particular VM. There is no limitation (except for the free space on the storage). Bad news! I can create the virtual disk on the fc storage for a vm, but when I start the VM itself, node wich hosts the VM that I'm starting is going non-operational, and quickly goes up again (ilo fencing agent checks if the node is ok and bring it back up). During that time, vm starts on another node (Default Host parameter was ignored - assigned Host was not available). I can manualy migrate it later to the intended node, that works. Lucky me, on two nodes (of the four) in the cluster, there were no vm's running (i tried this on both, with two different vm's created from scratch and i got the same result. I've killed everything above WARNING because it was killing the performance of the cluster. vdsm.log : [code] Thread-305::WARNING::2014-06-09 12:15:53,236::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is 55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 12:17:25,013::utils::129::root::(rmFile) File: /rhev/data-center/a0500f5c-e8d9-42f1-8f04-15b23514c8ed/55338570-e537-412b-97a9-635eea1ecb10/images/90659ad8-bd90-4a0a-bb4e-7c6afe90e925/242a1bce-a434-4246-ad24-b62f99c03a05 already removed 55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 12:17:25,074::blockSD::761::Storage.StorageDomain::(_getOccupiedMetadataSlots) Could not find mapping for lv 55338570-e537-412b-97a9-635eea1ecb10/242a1bce-a434-4246-ad24-b62f99c03a05 Thread-305::WARNING::2014-06-09 12:20:54,341::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-305::WARNING::2014-06-09 12:25:55,378::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-305::WARNING::2014-06-09 12:30:56,424::persistentDict::256::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is Thread-1857::WARNING::2014-06-09 12:32:45,639::libvirtconnection::116::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 Thread-1857::CRITICAL::2014-06-09 12:32:45,640::libvirtconnection::118::root::(wrapper) taking calling process down. Thread-17704::WARNING::2014-06-09 12:32:48,009::libvirtconnection::116::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 Thread-17704::CRITICAL::2014-06-09 12:32:48,013::libvirtconnection::118::root::(wrapper) taking calling process down. Thread-17704::ERROR::2014-06-09 12:32:48,018::vm::2285::vm.Vm::(_startUnderlyingVm) vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::The vm start process failed Traceback (most recent call last): File /usr/share/vdsm/vm.py, line 2245, in _startUnderlyingVm self._run() File /usr/share/vdsm/vm.py, line 3185, in _run self._connection.createXML(domxml, flags), File /usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py, line 110, in wrapper __connections.get(id(target)).pingLibvirt() File /usr/lib64/python2.6/site-packages/libvirt.py, line 3389, in getLibVersion if ret == -1: raise libvirtError ('virConnectGetLibVersion() failed', conn=self) libvirtError: internal error client socket is closed Thread-1857::WARNING::2014-06-09 12:32:50,673::vm::1963::vm.Vm::(_set_lastStatus)
Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
I'm really having a hard time finding out why it's happening.. If I set the cluster to global for a minute or two, the scores will reset back to 2400. Set maintenance mode to none, and all will be fine until a migration occurs. It seems it tries to migrate, fails and sets the score to 0 permanently rather than the 10? minutes mentioned in one of the ovirt slides. When I have two hosts, it's score 0 only when a migration occurs. (Just on the host which doesn't have engine up). The score 0 only happens when it's tried to migrate when I set the host to local maintenance. Migrating the VM from the UI has worked quite a few times, but it's recently started to fail. When I have three hosts, after 5~ mintues of them all up the score will hit 0 on the hosts not running the VMs. It doesn't even have to attempt to migrate before the score goes to 0. Stopping the ha agent on one host, and resetting it with the global maintenance method brings it back to the 2 host scenario above. I may move on and just go back to a standalone engine as this is not getting very much luck.. On Tue, Jun 10, 2014 at 3:11 PM, combuster combus...@archlinux.us wrote: Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS device as the NFS share itself, before the deploy procedure even started. But I'm puzzled at how you can reproduce the bug, all was well on my setup before I've stated manual migration of the engine's vm. Even auto migration worked before that (tested it). Does it just happen without any procedure on the engine itself? Is the score 0 for just one node, or two of three of them? On 06/10/2014 01:02 AM, Andrew Lau wrote: nvm, just as I hit send the error has returned. Ignore this.. On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote: So after adding the L3 capabilities to my storage network, I'm no longer seeing this issue anymore. So the engine needs to be able to access the storage domain it sits on? But that doesn't show up in the UI? Ivan, was this also the case with your setup? Engine couldn't access storage domain? On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster). On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote: I just blocked connection to storage for testing, but on result I had this error: Failed to acquire lock error -243, so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks - Original Message - From: Andrew Lau and...@andrewklau.com To: combuster combus...@archlinux.us Cc: users users@ovirt.org Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now. I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote: Ignore that, the issue came back after 10 minutes. I've even tried a gluster mount + nfs server on top of that, and the same issue has come back. On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log. On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote: It was pure NFS on a NAS device. They all had different ids (had no redeployements of nodes before problem occured). Thanks Jirka. On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: I've seen that problem in other threads, the common denominator was nfs on top of gluster. So if you have this setup, then it's a known problem. Or you should double check if you hosts have different ids otherwise they would be trying to acquire the same lock. --Jirka On 06/06/2014 08:03 AM, Andrew Lau wrote: Hi Ivan, Thanks for the in depth reply. I've only seen this happen twice, and only after I added a third host to the HA cluster. I wonder if that's the root problem. Have you seen this happen on all your installs or only just after your manual migration? It's a little frustrating this is happening as I was
Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
On 06/10/2014 07:19 AM, Andrew Lau wrote: I'm really having a hard time finding out why it's happening.. If I set the cluster to global for a minute or two, the scores will reset back to 2400. Set maintenance mode to none, and all will be fine until a migration occurs. It seems it tries to migrate, fails and sets the score to 0 permanently rather than the 10? minutes mentioned in one of the ovirt slides. When I have two hosts, it's score 0 only when a migration occurs. (Just on the host which doesn't have engine up). The score 0 only happens when it's tried to migrate when I set the host to local maintenance. Migrating the VM from the UI has worked quite a few times, but it's recently started to fail. When I have three hosts, after 5~ mintues of them all up the score will hit 0 on the hosts not running the VMs. It doesn't even have to attempt to migrate before the score goes to 0. Stopping the ha agent on one host, and resetting it with the global maintenance method brings it back to the 2 host scenario above. I may move on and just go back to a standalone engine as this is not getting very much luck.. Well I've done this already, I can't really afford to have so much unplanned downtime on my critical vm's, especially since it would take me several hours (even a whole day) to install a dedicated engine, then setup the nodes if need be, and then import vm's from export domain. I would love to help more to resolve this one, but I was pressed with time, I already had ovirt 3.3 running (for a year and a half rock solid stable, started from 3.1 i think), and I couldn't spare more then a day in trying to get around this bug (had to have a setup runing by the end of the weekend). I wasn't using gluster at all, so at least we know now that gluster is not a must in the mix. Besides Artyom already described it nicely in the bug report, havent had anything to add. You were lucky Andrew, when I've tried the global maintenance method and restarted the VM, I got a corrupted filesystem on the VM's engine and it wouldn't even start on that one node that had a good score. It was bad health or uknown state on all of the nodes, and I've managed to repair the fs on the vm via VNC, then just barely bring the services online but the postgres db was too much damaged, so engine missbehaved. At the time, I've explained it to myself :) that the locking mechanism didn't prevent one node to try to start (or write to) the vm while it was already running on another node, because filesystem was so damaged that I couldn't belive it, for 15 years I've never seen an extX fs so badly damaged, and the fact that this happens during migration just amped this thought up. On Tue, Jun 10, 2014 at 3:11 PM, combuster combus...@archlinux.us wrote: Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS device as the NFS share itself, before the deploy procedure even started. But I'm puzzled at how you can reproduce the bug, all was well on my setup before I've stated manual migration of the engine's vm. Even auto migration worked before that (tested it). Does it just happen without any procedure on the engine itself? Is the score 0 for just one node, or two of three of them? On 06/10/2014 01:02 AM, Andrew Lau wrote: nvm, just as I hit send the error has returned. Ignore this.. On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote: So after adding the L3 capabilities to my storage network, I'm no longer seeing this issue anymore. So the engine needs to be able to access the storage domain it sits on? But that doesn't show up in the UI? Ivan, was this also the case with your setup? Engine couldn't access storage domain? On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote: Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster). On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote: I just blocked connection to storage for testing, but on result I had this error: Failed to acquire lock error -243, so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks - Original Message - From: Andrew Lau and...@andrewklau.com To: combuster combus...@archlinux.us Cc: users users@ovirt.org Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a