Re: [ovirt-users] Servers Hang at 100% CPU On Migration

2015-08-25 Thread Chris Jones - BookIt . com Systems Administrator

I'll give that a try. Thanks.

On 08/23/2015 10:04 PM, Patrick Russell wrote:

We had this exact issue on that same build. Upgrading to oVirt Node - 3.5 - 
0.999.201507082312.el7.centos made the issue disappear for us. It was one of 
the 3.5.3 builds.

Hope this helps.

-Patrick


On Aug 19, 2015, at 1:15 PM, Chris Jones - BookIt.com Systems Administrator 
chris.jo...@bookit.com wrote:

oVirt Node - 3.5 - 0.999.201504280931.el7.centos

When migrating servers using an iSCSI storage domain, about 75% of the time 
they will become unresponsive and stuck at 100% CPU after migration. This does 
not happen with direct LUNs, however.

What causes this? How do I stop it from happening?

Thanks

--
This email was Virus checked by UTM 9. For issues please contact the Windows 
Systems Admin.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] Servers Hang at 100% CPU On Migration

2015-08-19 Thread Chris Jones - BookIt . com Systems Administrator

oVirt Node - 3.5 - 0.999.201504280931.el7.centos

When migrating servers using an iSCSI storage domain, about 75% of the 
time they will become unresponsive and stuck at 100% CPU after 
migration. This does not happen with direct LUNs, however.


What causes this? How do I stop it from happening?

Thanks

--
This email was Virus checked by UTM 9. For issues please contact the Windows 
Systems Admin.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Servers Hang at 100% CPU On Migration

2015-08-19 Thread Chris Jones - BookIt . com Systems Administrator
I forgot to mention that the vm have to be forcefully restarted when 
this happens.


On 08/19/2015 02:15 PM, Chris Jones - BookIt.com Systems Administrator 
wrote:

oVirt Node - 3.5 - 0.999.201504280931.el7.centos

When migrating servers using an iSCSI storage domain, about 75% of the
time they will become unresponsive and stuck at 100% CPU after
migration. This does not happen with direct LUNs, however.

What causes this? How do I stop it from happening?

Thanks


--
This email was Virus checked by UTM 9. For issues please contact the Windows 
Systems Admin.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Live VM Backups

2015-07-10 Thread Chris Jones - BookIt . com Systems Administrator

Thanks, Soeren. I'll give it a look.

On 07/08/2015 03:36 PM, Soeren Malchow wrote:

Dear Chris,

It is not true, you can snapshot a machine, then clone the snapshot and
export it for backup purposes after that you can remove the snapshot, all
on the live VM.
However, you need newer versions of libvirt to do that, right now we are
using CentOS 7.1 and the libvirt that comes with it is capable of doing
live merge which is necessary to achieve this.

But i have to warn you, we are experiencing a problem when removing the
snapshots (the part is commented in the attached script) it sometimes
kills virtual machines in a way that makes it necessary to put the
hypervisor to maintenance and then restart vdsmd and libvirtd before you
can start that VM again.

There is a bug filed already and it is in progress

https://bugzilla.redhat.com/show_bug.cgi?id=1231754

I also have to add that i newer version of libvirt (on Fedora 20 with the
libvirt preview repo) did not have that problem, so i am confident that
this will be solved soon.

Last but not least there is a plan to be able to export snapshots right
away for backup without having to clone them first, this is a huge step
forward for the backup procedure in terms of time that is needed and the
load on the storage and hypervisor systems.

I would really appreciate if you would help improving that script (we are
not python developers), i will see that i make this a github project or
something like that

Cheers
Soeren

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] Live VM Backups

2015-07-08 Thread Chris Jones - BookIt . com Systems Administrator
From what I can tell, you can only backup a VM to an export domain if 
the VM is shut down. Is a live VM backup not possible through oVirt? If 
not, why not? Most other virtualization tools can handle this.


If it is possible, how do I do it through the backup API? 
api.vms.myvm.export requires it to be shutdown so what would the 
alternative be?


Thanks.

--
This email was Virus checked by UTM 9. For issues please contact the Windows 
Systems Admin.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] How do I get discard working?

2015-06-11 Thread Chris Jones - BookIt . com Systems Administrator
Looks like I need to learn how to use VDSM hooks. I'll start there. 
Thanks everyone.


On 06/10/2015 04:18 PM, Amador Pahim wrote:

On 06/10/2015 03:24 PM, Fabian Deutsch wrote:

- Original Message -

oVirt Node - 3.5 - 0.999.201504280931.el7.centos

Using our shared storage via baremetal (stock CentOS 7) - iscsi, I can
successfully issue fstrim commands. With oVirt at the VM level, even
with direct LUNS, trim commands are not supported despite having the LVM
config in the VMs set up to allow it.

Hey,

IIUIC you try to get discard working for VMs? That means that if
fstrim is
used inside the VM, that it is getting passed down?

The command line needed for qemu to support discards is:

$ qemu … -drive if=virtio,cache=unsafe,discard,file=disk …

I'm not sure which qemu disk drivers/busses support this, but at least
virtio does so.
I'm using it for development.

You could try a vdsm hook to modify the qemu command which is called
when the VM is spawned.

Let me know if you can come up with a hook to realize this!


There's this hook in code review intended  to do so:
https://gerrit.ovirt.org/#/c/29770/



Greetings
fabian
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] How do I get discard working?

2015-06-10 Thread Chris Jones - BookIt . com Systems Administrator

oVirt Node - 3.5 - 0.999.201504280931.el7.centos

Using our shared storage via baremetal (stock CentOS 7) - iscsi, I can 
successfully issue fstrim commands. With oVirt at the VM level, even 
with direct LUNS, trim commands are not supported despite having the LVM 
config in the VMs set up to allow it.


Thanks

--
This email was Virus checked by the PHX UTM 9. For issues please contact the 
Windows Systems Admin.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-06-02 Thread Chris Jones - BookIt . com Systems Administrator
Since this thread shows up at the top of the search oVirt compellent, 
I should mention that this has been solved. The problem was a bad disk 
in the Compellent's tier 2 storage. The mutlipath.conf and iscsi.conf 
advice is still valid, though, and made oVirt more resilient when the 
Compellent was struggling.


--
This email was Virus checked by UTM 9. For issues please contact the Windows 
Systems Admin.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-26 Thread Chris Jones - BookIt . com Systems Administrator

Lets continue this on bugzilla.


https://bugzilla.redhat.com/show_bug.cgi?id=1225162
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-22 Thread Chris Jones - BookIt . com Systems Administrator



Is there maybe some IO problem on the iSCSI target side?
IIUIC the problem is some timeout, which could indicate that the target
is overloaded.


Maybe. I need to check with Dell. I did manage to get it to be a little 
more stable with this config.


defaults {
polling_interval 10
path_selector round-robin 0
path_grouping_policy multibus
getuid_callout  /usr/lib/udev/scsi_id --whitelisted 
--replace-whitespace --device=/dev/%n

path_checker readsector0
rr_min_io_rq 100
max_fds 8192
rr_weight priorities
failback immediate
no_path_retry fail
user_friendly_names no
}
devices {
  device {
vendor   COMPELNT
product  Compellent Vol
path_checker tur
no_path_retryfail
  }
}

I referenced it from 
http://en.community.dell.com/techcenter/enterprise-solutions/w/oracle_solutions/1315.how-to-configure-device-mapper-multipath. 
I modified it a bit since that is Red Hat 5 specific and there have been 
some changes.


It's not crashing anymore but I'm still seeing storage warnings in 
engine.log. I'm going to be enabling jumbo frames and talking with Dell 
to figure out if it's something on the Compellent side. I'll update here 
once I find something out.


Thanks again for all the help.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] vdsmd fails to start on boot

2015-05-21 Thread Chris Jones - BookIt . com Systems Administrator
Running oVirt Node - 3.5 - 0.999.201504280931.el7.centos. On first boot, 
vdsmd fails to load with Dependency failed for Virtual Desktop Server 
Manager.


When I run systemctl start vdsmd it loads fine. This happens on every 
reboot. Looks like there is an old bug for this from 3.4. 
https://bugzilla.redhat.com/show_bug.cgi?id=1055153


--
This email was Virus checked by UTM 9. For issues please contact the Windows 
Systems Admin.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-21 Thread Chris Jones - BookIt . com Systems Administrator
On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator 
wrote:

I've applied the multipath.conf and iscsi.conf changes you recommended.
It seems to be running better. I was able to bring up all the hosts and
VMs without it falling apart.


I take it back. This did not solve the issue. I tried batch starting the 
VMs and half the nodes went down due to the same storage issues. VDSM 
Logs again. 
https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-21 Thread Chris Jones - BookIt . com Systems Administrator
I've applied the multipath.conf and iscsi.conf changes you recommended. 
It seems to be running better. I was able to bring up all the hosts and 
VMs without it falling apart.


I'm still seeing the domain in problem and recovered from problem 
warnings in engine.log, though. They were happening only when hosts were 
activating and when I was mass launching many VMs. Is this normal?


2015-05-21 15:31:32,264 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-13) domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: 
blade6c2.ism.ld
2015-05-21 15:31:47,468 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-4) Domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. 
vds: blade6c2.ism.ld


Here's the vdsm log from a node the engine was warning about 
https://www.dropbox.com/s/yaubaxax1w499f1/vdsm2.log.gz?dl=1. It's 
trimmed to just before and after it happened.


What is that repo stat command from your previous email, Nir? repostat 
vdsm.log I don't see it on the engine or the node. Is it used to parse 
the log? Where can I find it?


Thanks again.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Chris Jones - BookIt . com Systems Administrator
Sorry for the delay on this. I am in the process of reproducing the 
error to get the logs.


On 05/19/2015 07:31 PM, Douglas Schilling Landgraf wrote:

Hello Chris,

On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator 
wrote:

Engine: oVirt Engine Version: 3.5.2-1.el7.centos
Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos
Remote storage: Dell Compellent SC8000
Storage setup: 2 nics connected to the Compellent. Several domains
backed by LUNs. Several VM disk using direct LUN.
Networking: Dell 10 Gb/s switches

I've been struggling with oVirt completely falling apart due to storage
related issues. By falling apart I mean most to all of the nodes
suddenly losing contact with the storage domains. This results in an
endless loop of the VMs on the failed nodes trying to be migrated and
remigrated as the nodes flap between response and unresponsive. During
these times, engine.log looks like this.

2015-05-19 03:09:42,443 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-50) domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds:
blade6c1.ism.ld
2015-05-19 03:09:42,560 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-38) domain
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds:
blade2c1.ism.ld
2015-05-19 03:09:45,497 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-24) domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds:
blade3c2.ism.ld
2015-05-19 03:09:51,713 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-46) domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
blade4c2.ism.ld
2015-05-19 03:09:57,647 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-13) Domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
vds: blade6c1.ism.ld
2015-05-19 03:09:57,782 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-6) domain
26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds:
blade2c1.ism.ld
2015-05-19 03:09:57,783 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-6) Domain
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem.
vds: blade2c1.ism.ld
2015-05-19 03:10:00,639 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-31) Domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
vds: blade4c1.ism.ld
2015-05-19 03:10:00,703 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-17) domain
64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds:
blade1c1.ism.ld
2015-05-19 03:10:00,712 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-4) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
vds: blade3c2.ism.ld
2015-05-19 03:10:06,931 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
vds: blade4c2.ism.ld
2015-05-19 03:10:06,931 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from
problem. No active host in the DC is reporting it as problematic, so
clearing the domain recovery timer.
2015-05-19 03:10:06,932 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem.
vds: blade4c2.ism.ld
2015-05-19 03:10:06,933 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from
problem. No active host in the DC is reporting it as problematic, so
clearing the domain recovery timer.
2015-05-19 03:10:09,929 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-16) domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
blade3c1.ism.ld


My troubleshooting steps so far:

 1. Tailing engine.log for in problem and recovered from problem
 2. Shutting down all the VMs.
 3. Shutting down all but one node.
 4. Bringing up one node at a time to see what the log reports.


vdsm.log in the node side, will help here too.


When only one node is active everything is fine. When a second node
comes up, I begin to see the log output as shown above. I've been
struggling with this for over a month. I'm sure others have used oVirt
with a Compellent and encountered (and worked around) similar problems.
I'm looking for some help in figuring out if it's oVirt or something
that I'm doing wrong.

We're close 

Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Chris Jones - BookIt . com Systems Administrator

vdsm.log in the node side, will help here too.


https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log 
contains only the messages at and after when a host was become 
unresponsive due to storage issues.



# rpm -qa | grep -i vdsm
might help too.


vdsm-cli-4.16.14-0.el7.noarch
vdsm-reg-4.16.14-0.el7.noarch
ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch
vdsm-python-zombiereaper-4.16.14-0.el7.noarch
vdsm-xmlrpc-4.16.14-0.el7.noarch
vdsm-yajsonrpc-4.16.14-0.el7.noarch
vdsm-4.16.14-0.el7.x86_64
vdsm-gluster-4.16.14-0.el7.noarch
vdsm-hook-ethtool-options-4.16.14-0.el7.noarch
vdsm-python-4.16.14-0.el7.noarch
vdsm-jsonrpc-4.16.14-0.el7.noarch



Hey Chris,

please open a bug [1] for this, then we can track it and we can help to
identify the issue.


I will do so.



vdsm.log.gz
Description: application/gzip
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Chris Jones - BookIt . com Systems Administrator



Chris, as you are using ovirt-node, after Nir suggestions please also
execute the below command too to save the settings changes across reboots:

# persist /etc/iscsi/iscsid.conf


Thanks. I will do so, but first I have to resolve not being able to 
update multipath.conf as described in my previous email.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Chris Jones - BookIt . com Systems Administrator

Another issue may be that the setting for COMPELNT/Compellent Vol are wrong;
the setting we ship is missing lot of settings that exists in the builtin
setting, and this may have bad effect. If your devices match this , I would
try this multipath configuration, instead of the one vdsm configures.

device {
vendor COMPELNT
product Compellent Vol
path_grouping_policy multibus
path_checker tur
features 0
hardware_handler 0
prio const
failback immediate
rr_weight uniform
no_path_retry fail
}


I wish I could. We're using the CentOS 7 ovirt-node-iso. The 
multipath.conf is less than ideal but when I tried updating it, oVirt 
instantly overwrites it. To be clear, yes I know changes do not survive 
reboots and yes I know about persist, but it changes it while running. 
Live! Persist won't help there.


I also tried building a CentOS 7 thick client where I set up CentOS 7 
first, added the oVirt repo, then let the engine provision it. Same 
problem with multipath.conf being overwritten with the default oVirt setup.


So I tried to be slick about it. I made the multipath.conf immutable. 
That prevented the engine from being able to activate the node. It would 
fail on a vds command that gets the nodes capabilities and part of what 
it does is reads then overwrites multipath.conf.


How do I safely update multipath.conf?




To verify that your devices match this, you can check the devices vendor and 
procut
strings in the output of multipath -ll. I would like to see the output of this
command.


multipath -ll (default setup) can be seen here.
http://paste.linux-help.org/view/430c7538


Another platform issue is bad default SCSI 
node.session.timeo.replacement_timeout
value, which is set to 120 seconds. This setting mean that the SCSI layer will
wait 120 seconds for io to complete on one path, before failing the io request.
So you may have one bad path, causing 120 second delay, while you could complete
the request using another path.

Multipath is trying to set this value to 5 seconds, but this value is reverting
to the default 120 seconds after a device has trouble. There is an open bug 
about
this which we hope to get fixed in the rhel/centos 7.2.
https://bugzilla.redhat.com/1139038

This issue together with no_path_retry queue is a very bad mix for ovirt.

You can fix this timeout by setting:

# /etc/iscsi/iscsid.conf
node.session.timeo.replacement_timeout = 5


I'll see if that's possible with persist. Will this change survive node 
upgrades?


Thanks for the reply and the suggestions.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-19 Thread Chris Jones - BookIt . com Systems Administrator

Engine: oVirt Engine Version: 3.5.2-1.el7.centos
Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos
Remote storage: Dell Compellent SC8000
Storage setup: 2 nics connected to the Compellent. Several domains 
backed by LUNs. Several VM disk using direct LUN.

Networking: Dell 10 Gb/s switches

I've been struggling with oVirt completely falling apart due to storage 
related issues. By falling apart I mean most to all of the nodes 
suddenly losing contact with the storage domains. This results in an 
endless loop of the VMs on the failed nodes trying to be migrated and 
remigrated as the nodes flap between response and unresponsive. During 
these times, engine.log looks like this.


2015-05-19 03:09:42,443 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-50) domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: 
blade6c1.ism.ld
2015-05-19 03:09:42,560 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-38) domain 
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: 
blade2c1.ism.ld
2015-05-19 03:09:45,497 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-24) domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: 
blade3c2.ism.ld
2015-05-19 03:09:51,713 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-46) domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: 
blade4c2.ism.ld
2015-05-19 03:09:57,647 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-13) Domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. 
vds: blade6c1.ism.ld
2015-05-19 03:09:57,782 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-6) domain 
26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: 
blade2c1.ism.ld
2015-05-19 03:09:57,783 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-6) Domain 
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. 
vds: blade2c1.ism.ld
2015-05-19 03:10:00,639 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-31) Domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. 
vds: blade4c1.ism.ld
2015-05-19 03:10:00,703 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-17) domain 
64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: 
blade1c1.ism.ld
2015-05-19 03:10:00,712 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-4) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. 
vds: blade3c2.ism.ld
2015-05-19 03:10:06,931 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. 
vds: blade4c2.ism.ld
2015-05-19 03:10:06,931 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from 
problem. No active host in the DC is reporting it as problematic, so 
clearing the domain recovery timer.
2015-05-19 03:10:06,932 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. 
vds: blade4c2.ism.ld
2015-05-19 03:10:06,933 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from 
problem. No active host in the DC is reporting it as problematic, so 
clearing the domain recovery timer.
2015-05-19 03:10:09,929 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-16) domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: 
blade3c1.ism.ld



My troubleshooting steps so far:

1. Tailing engine.log for in problem and recovered from problem
2. Shutting down all the VMs.
3. Shutting down all but one node.
4. Bringing up one node at a time to see what the log reports.

When only one node is active everything is fine. When a second node 
comes up, I begin to see the log output as shown above. I've been 
struggling with this for over a month. I'm sure others have used oVirt 
with a Compellent and encountered (and worked around) similar problems. 
I'm looking for some help in figuring out if it's oVirt or something 
that I'm doing wrong.


We're close to giving up on oVirt completely because of this.

P.S.

I've tested via bare metal and Proxmox with the Compellent. Not at the 
same scale but it seems to work fine there.



--
This email was Virus checked by UTM 9. For