Re: [ovirt-users] Servers Hang at 100% CPU On Migration
I'll give that a try. Thanks. On 08/23/2015 10:04 PM, Patrick Russell wrote: We had this exact issue on that same build. Upgrading to oVirt Node - 3.5 - 0.999.201507082312.el7.centos made the issue disappear for us. It was one of the 3.5.3 builds. Hope this helps. -Patrick On Aug 19, 2015, at 1:15 PM, Chris Jones - BookIt.com Systems Administrator chris.jo...@bookit.com wrote: oVirt Node - 3.5 - 0.999.201504280931.el7.centos When migrating servers using an iSCSI storage domain, about 75% of the time they will become unresponsive and stuck at 100% CPU after migration. This does not happen with direct LUNs, however. What causes this? How do I stop it from happening? Thanks -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Servers Hang at 100% CPU On Migration
oVirt Node - 3.5 - 0.999.201504280931.el7.centos When migrating servers using an iSCSI storage domain, about 75% of the time they will become unresponsive and stuck at 100% CPU after migration. This does not happen with direct LUNs, however. What causes this? How do I stop it from happening? Thanks -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Servers Hang at 100% CPU On Migration
I forgot to mention that the vm have to be forcefully restarted when this happens. On 08/19/2015 02:15 PM, Chris Jones - BookIt.com Systems Administrator wrote: oVirt Node - 3.5 - 0.999.201504280931.el7.centos When migrating servers using an iSCSI storage domain, about 75% of the time they will become unresponsive and stuck at 100% CPU after migration. This does not happen with direct LUNs, however. What causes this? How do I stop it from happening? Thanks -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Live VM Backups
Thanks, Soeren. I'll give it a look. On 07/08/2015 03:36 PM, Soeren Malchow wrote: Dear Chris, It is not true, you can snapshot a machine, then clone the snapshot and export it for backup purposes after that you can remove the snapshot, all on the live VM. However, you need newer versions of libvirt to do that, right now we are using CentOS 7.1 and the libvirt that comes with it is capable of doing live merge which is necessary to achieve this. But i have to warn you, we are experiencing a problem when removing the snapshots (the part is commented in the attached script) it sometimes kills virtual machines in a way that makes it necessary to put the hypervisor to maintenance and then restart vdsmd and libvirtd before you can start that VM again. There is a bug filed already and it is in progress https://bugzilla.redhat.com/show_bug.cgi?id=1231754 I also have to add that i newer version of libvirt (on Fedora 20 with the libvirt preview repo) did not have that problem, so i am confident that this will be solved soon. Last but not least there is a plan to be able to export snapshots right away for backup without having to clone them first, this is a huge step forward for the backup procedure in terms of time that is needed and the load on the storage and hypervisor systems. I would really appreciate if you would help improving that script (we are not python developers), i will see that i make this a github project or something like that Cheers Soeren ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Live VM Backups
From what I can tell, you can only backup a VM to an export domain if the VM is shut down. Is a live VM backup not possible through oVirt? If not, why not? Most other virtualization tools can handle this. If it is possible, how do I do it through the backup API? api.vms.myvm.export requires it to be shutdown so what would the alternative be? Thanks. -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] How do I get discard working?
Looks like I need to learn how to use VDSM hooks. I'll start there. Thanks everyone. On 06/10/2015 04:18 PM, Amador Pahim wrote: On 06/10/2015 03:24 PM, Fabian Deutsch wrote: - Original Message - oVirt Node - 3.5 - 0.999.201504280931.el7.centos Using our shared storage via baremetal (stock CentOS 7) - iscsi, I can successfully issue fstrim commands. With oVirt at the VM level, even with direct LUNS, trim commands are not supported despite having the LVM config in the VMs set up to allow it. Hey, IIUIC you try to get discard working for VMs? That means that if fstrim is used inside the VM, that it is getting passed down? The command line needed for qemu to support discards is: $ qemu … -drive if=virtio,cache=unsafe,discard,file=disk … I'm not sure which qemu disk drivers/busses support this, but at least virtio does so. I'm using it for development. You could try a vdsm hook to modify the qemu command which is called when the VM is spawned. Let me know if you can come up with a hook to realize this! There's this hook in code review intended to do so: https://gerrit.ovirt.org/#/c/29770/ Greetings fabian ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] How do I get discard working?
oVirt Node - 3.5 - 0.999.201504280931.el7.centos Using our shared storage via baremetal (stock CentOS 7) - iscsi, I can successfully issue fstrim commands. With oVirt at the VM level, even with direct LUNS, trim commands are not supported despite having the LVM config in the VMs set up to allow it. Thanks -- This email was Virus checked by the PHX UTM 9. For issues please contact the Windows Systems Admin. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Since this thread shows up at the top of the search oVirt compellent, I should mention that this has been solved. The problem was a bad disk in the Compellent's tier 2 storage. The mutlipath.conf and iscsi.conf advice is still valid, though, and made oVirt more resilient when the Compellent was struggling. -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Lets continue this on bugzilla. https://bugzilla.redhat.com/show_bug.cgi?id=1225162 ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Is there maybe some IO problem on the iSCSI target side? IIUIC the problem is some timeout, which could indicate that the target is overloaded. Maybe. I need to check with Dell. I did manage to get it to be a little more stable with this config. defaults { polling_interval 10 path_selector round-robin 0 path_grouping_policy multibus getuid_callout /usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n path_checker readsector0 rr_min_io_rq 100 max_fds 8192 rr_weight priorities failback immediate no_path_retry fail user_friendly_names no } devices { device { vendor COMPELNT product Compellent Vol path_checker tur no_path_retryfail } } I referenced it from http://en.community.dell.com/techcenter/enterprise-solutions/w/oracle_solutions/1315.how-to-configure-device-mapper-multipath. I modified it a bit since that is Red Hat 5 specific and there have been some changes. It's not crashing anymore but I'm still seeing storage warnings in engine.log. I'm going to be enabling jumbo frames and talking with Dell to figure out if it's something on the Compellent side. I'll update here once I find something out. Thanks again for all the help. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] vdsmd fails to start on boot
Running oVirt Node - 3.5 - 0.999.201504280931.el7.centos. On first boot, vdsmd fails to load with Dependency failed for Virtual Desktop Server Manager. When I run systemctl start vdsmd it loads fine. This happens on every reboot. Looks like there is an old bug for this from 3.4. https://bugzilla.redhat.com/show_bug.cgi?id=1055153 -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator wrote: I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart. I take it back. This did not solve the issue. I tried batch starting the VMs and half the nodes went down due to the same storage issues. VDSM Logs again. https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1 ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart. I'm still seeing the domain in problem and recovered from problem warnings in engine.log, though. They were happening only when hosts were activating and when I was mass launching many VMs. Is this normal? 2015-05-21 15:31:32,264 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c2.ism.ld 2015-05-21 15:31:47,468 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c2.ism.ld Here's the vdsm log from a node the engine was warning about https://www.dropbox.com/s/yaubaxax1w499f1/vdsm2.log.gz?dl=1. It's trimmed to just before and after it happened. What is that repo stat command from your previous email, Nir? repostat vdsm.log I don't see it on the engine or the node. Is it used to parse the log? Where can I find it? Thanks again. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Sorry for the delay on this. I am in the process of reproducing the error to get the logs. On 05/19/2015 07:31 PM, Douglas Schilling Landgraf wrote: Hello Chris, On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator wrote: Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this. 2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld My troubleshooting steps so far: 1. Tailing engine.log for in problem and recovered from problem 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports. vdsm.log in the node side, will help here too. When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong. We're close
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
vdsm.log in the node side, will help here too. https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log contains only the messages at and after when a host was become unresponsive due to storage issues. # rpm -qa | grep -i vdsm might help too. vdsm-cli-4.16.14-0.el7.noarch vdsm-reg-4.16.14-0.el7.noarch ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch vdsm-python-zombiereaper-4.16.14-0.el7.noarch vdsm-xmlrpc-4.16.14-0.el7.noarch vdsm-yajsonrpc-4.16.14-0.el7.noarch vdsm-4.16.14-0.el7.x86_64 vdsm-gluster-4.16.14-0.el7.noarch vdsm-hook-ethtool-options-4.16.14-0.el7.noarch vdsm-python-4.16.14-0.el7.noarch vdsm-jsonrpc-4.16.14-0.el7.noarch Hey Chris, please open a bug [1] for this, then we can track it and we can help to identify the issue. I will do so. vdsm.log.gz Description: application/gzip ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Chris, as you are using ovirt-node, after Nir suggestions please also execute the below command too to save the settings changes across reboots: # persist /etc/iscsi/iscsid.conf Thanks. I will do so, but first I have to resolve not being able to update multipath.conf as described in my previous email. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures. device { vendor COMPELNT product Compellent Vol path_grouping_policy multibus path_checker tur features 0 hardware_handler 0 prio const failback immediate rr_weight uniform no_path_retry fail } I wish I could. We're using the CentOS 7 ovirt-node-iso. The multipath.conf is less than ideal but when I tried updating it, oVirt instantly overwrites it. To be clear, yes I know changes do not survive reboots and yes I know about persist, but it changes it while running. Live! Persist won't help there. I also tried building a CentOS 7 thick client where I set up CentOS 7 first, added the oVirt repo, then let the engine provision it. Same problem with multipath.conf being overwritten with the default oVirt setup. So I tried to be slick about it. I made the multipath.conf immutable. That prevented the engine from being able to activate the node. It would fail on a vds command that gets the nodes capabilities and part of what it does is reads then overwrites multipath.conf. How do I safely update multipath.conf? To verify that your devices match this, you can check the devices vendor and procut strings in the output of multipath -ll. I would like to see the output of this command. multipath -ll (default setup) can be seen here. http://paste.linux-help.org/view/430c7538 Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failing the io request. So you may have one bad path, causing 120 second delay, while you could complete the request using another path. Multipath is trying to set this value to 5 seconds, but this value is reverting to the default 120 seconds after a device has trouble. There is an open bug about this which we hope to get fixed in the rhel/centos 7.2. https://bugzilla.redhat.com/1139038 This issue together with no_path_retry queue is a very bad mix for ovirt. You can fix this timeout by setting: # /etc/iscsi/iscsid.conf node.session.timeo.replacement_timeout = 5 I'll see if that's possible with persist. Will this change survive node upgrades? Thanks for the reply and the suggestions. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this. 2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld My troubleshooting steps so far: 1. Tailing engine.log for in problem and recovered from problem 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports. When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong. We're close to giving up on oVirt completely because of this. P.S. I've tested via bare metal and Proxmox with the Compellent. Not at the same scale but it seems to work fine there. -- This email was Virus checked by UTM 9. For