Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-06-02 Thread Chris Jones - BookIt . com Systems Administrator
Since this thread shows up at the top of the search "oVirt compellent", 
I should mention that this has been solved. The problem was a bad disk 
in the Compellent's tier 2 storage. The mutlipath.conf and iscsi.conf 
advice is still valid, though, and made oVirt more resilient when the 
Compellent was struggling.


--
This email was Virus checked by UTM 9. For issues please contact the Windows 
Systems Admin.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-26 Thread Chris Jones - BookIt . com Systems Administrator

Lets continue this on bugzilla.


https://bugzilla.redhat.com/show_bug.cgi?id=1225162
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-22 Thread Nir Soffer
- Original Message -
> From: "Chris Jones - BookIt.com Systems Administrator" 
> 
> To: users@ovirt.org
> Sent: Friday, May 22, 2015 8:55:37 PM
> Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via 
> iSCSI/Multipath
> 
> 
> > Is there maybe some IO problem on the iSCSI target side?
> > IIUIC the problem is some timeout, which could indicate that the target
> > is overloaded.
> 
> Maybe. I need to check with Dell. I did manage to get it to be a little
> more stable with this config.
> 
> defaults {
>  polling_interval 10
>  path_selector "round-robin 0"
>  path_grouping_policy multibus
>  getuid_callout  "/usr/lib/udev/scsi_id --whitelisted
> --replace-whitespace --device=/dev/%n"
>  path_checker readsector0
>  rr_min_io_rq 100
>  max_fds 8192
>  rr_weight priorities
>  failback immediate
>  no_path_retry fail
>  user_friendly_names no

You should keep the default without change, and add specific settings under
the device section.

> }
> devices {
>device {
>  vendor   COMPELNT
>  product  "Compellent Vol"
>  path_checker tur
>  no_path_retryfail

This is mostly likely missing some settings. You are *not* getting the settings
from the "defaults" section above.

For example, since you did not specify here "failback immediate", failback for
this device defaults to whatever default multipath chose, not the value set
in "defaults" above.

>}
> }
> 
> I referenced it from
> http://en.community.dell.com/techcenter/enterprise-solutions/w/oracle_solutions/1315.how-to-configure-device-mapper-multipath.
> I modified it a bit since that is Red Hat 5 specific and there have been
> some changes.
> 
> It's not crashing anymore but I'm still seeing storage warnings in
> engine.log. I'm going to be enabling jumbo frames and talking with Dell
> to figure out if it's something on the Compellent side. I'll update here
> once I find something out.

Lets continue this on bugzilla.

See also this patch:
https://gerrit.ovirt.org/41244

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-22 Thread Nir Soffer
- Original Message -
> From: "Chris Jones - BookIt.com Systems Administrator" 
> 
> To: users@ovirt.org
> Sent: Friday, May 22, 2015 12:32:01 AM
> Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via 
> iSCSI/Multipath
> 
> On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator
> wrote:
> > I've applied the multipath.conf and iscsi.conf changes you recommended.
> > It seems to be running better. I was able to bring up all the hosts and
> > VMs without it falling apart.
> 
> I take it back. This did not solve the issue. I tried batch starting the
> VMs and half the nodes went down due to the same storage issues. VDSM
> Logs again.
> https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1

It is possible that the multipath configuration I suggested is not 
optimized correctly for your server or it is too old (last update on 2013).

Or you have some issues in the network or storage server.

I would continue with the storage vendor.

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-22 Thread Nir Soffer


- Original Message -
> From: "Chris Jones - BookIt.com Systems Administrator" 
> 
> To: users@ovirt.org
> Sent: Thursday, May 21, 2015 10:49:23 PM
> Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via 
> iSCSI/Multipath
> 
> I've applied the multipath.conf and iscsi.conf changes you recommended.
> It seems to be running better. I was able to bring up all the hosts and
> VMs without it falling apart.
> 
> I'm still seeing the domain "in problem" and "recovered" from problem
> warnings in engine.log, though. They were happening only when hosts were
> activating and when I was mass launching many VMs. Is this normal?
> 
> 2015-05-21 15:31:32,264 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-13) domain
> c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds:
> blade6c2.ism.ld
> 2015-05-21 15:31:47,468 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-4) Domain
> c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
> vds: blade6c2.ism.ld
> 
> Here's the vdsm log from a node the engine was warning about
> https://www.dropbox.com/s/yaubaxax1w499f1/vdsm2.log.gz?dl=1. It's
> trimmed to just before and after it happened.
> 
> What is that repo stat command from your previous email, Nir? "repostat
> vdsm.log" I don't see it on the engine or the node. Is it used to parse
> the log? Where can I find it?

It is available here:
https://gerrit.ovirt.org/38749

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-22 Thread Chris Jones - BookIt . com Systems Administrator



Is there maybe some IO problem on the iSCSI target side?
IIUIC the problem is some timeout, which could indicate that the target
is overloaded.


Maybe. I need to check with Dell. I did manage to get it to be a little 
more stable with this config.


defaults {
polling_interval 10
path_selector "round-robin 0"
path_grouping_policy multibus
getuid_callout  "/usr/lib/udev/scsi_id --whitelisted 
--replace-whitespace --device=/dev/%n"

path_checker readsector0
rr_min_io_rq 100
max_fds 8192
rr_weight priorities
failback immediate
no_path_retry fail
user_friendly_names no
}
devices {
  device {
vendor   COMPELNT
product  "Compellent Vol"
path_checker tur
no_path_retryfail
  }
}

I referenced it from 
http://en.community.dell.com/techcenter/enterprise-solutions/w/oracle_solutions/1315.how-to-configure-device-mapper-multipath. 
I modified it a bit since that is Red Hat 5 specific and there have been 
some changes.


It's not crashing anymore but I'm still seeing storage warnings in 
engine.log. I'm going to be enabling jumbo frames and talking with Dell 
to figure out if it's something on the Compellent side. I'll update here 
once I find something out.


Thanks again for all the help.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-22 Thread Fabian Deutsch
- Original Message -
> On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator
> wrote:
> > I've applied the multipath.conf and iscsi.conf changes you recommended.
> > It seems to be running better. I was able to bring up all the hosts and
> > VMs without it falling apart.
> 
> I take it back. This did not solve the issue. I tried batch starting the
> VMs and half the nodes went down due to the same storage issues. VDSM

Is there maybe some IO problem on the iSCSI target side?
IIUIC the problem is some timeout, which could indicate that the target
is overloaded.

But maybe I get something wrong ...

- fabian

> Logs again.
> https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
> 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-21 Thread Chris Jones - BookIt . com Systems Administrator
On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator 
wrote:

I've applied the multipath.conf and iscsi.conf changes you recommended.
It seems to be running better. I was able to bring up all the hosts and
VMs without it falling apart.


I take it back. This did not solve the issue. I tried batch starting the 
VMs and half the nodes went down due to the same storage issues. VDSM 
Logs again. 
https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-21 Thread Chris Jones - BookIt . com Systems Administrator
I've applied the multipath.conf and iscsi.conf changes you recommended. 
It seems to be running better. I was able to bring up all the hosts and 
VMs without it falling apart.


I'm still seeing the domain "in problem" and "recovered" from problem 
warnings in engine.log, though. They were happening only when hosts were 
activating and when I was mass launching many VMs. Is this normal?


2015-05-21 15:31:32,264 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-13) domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: 
blade6c2.ism.ld
2015-05-21 15:31:47,468 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-4) Domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. 
vds: blade6c2.ism.ld


Here's the vdsm log from a node the engine was warning about 
https://www.dropbox.com/s/yaubaxax1w499f1/vdsm2.log.gz?dl=1. It's 
trimmed to just before and after it happened.


What is that repo stat command from your previous email, Nir? "repostat 
vdsm.log" I don't see it on the engine or the node. Is it used to parse 
the log? Where can I find it?


Thanks again.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-21 Thread Daniel Helgenberger


On 21.05.2015 02:48, Chris Jones - BookIt.com Systems Administrator wrote:
>> Another issue may be that the setting for COMPELNT/Compellent Vol are wrong;
>> the setting we ship is missing lot of settings that exists in the builtin
>> setting, and this may have bad effect. If your devices match this , I would
>> try this multipath configuration, instead of the one vdsm configures.
>>
>>  device {
>>  vendor "COMPELNT"
>>  product "Compellent Vol"
>>  path_grouping_policy "multibus"
>>  path_checker "tur"
>>  features "0"
>>  hardware_handler "0"
>>  prio "const"
>>  failback "immediate"
>>  rr_weight "uniform"
>>  no_path_retry fail
>>  }
>
> I wish I could. We're using the CentOS 7 ovirt-node-iso. The
> multipath.conf is less than ideal
I have this issue also. I think about opening a BZ ;)

  but when I tried updating it, oVirt
> instantly overwrites it. To be clear, yes I know changes do not survive
> reboots and yes I know about persist, but it changes it while running.
> Live! Persist won't help there.
>
> I also tried building a CentOS 7 "thick client" where I set up CentOS 7
> first, added the oVirt repo, then let the engine provision it. Same
> problem with multipath.conf being overwritten with the default oVirt setup.
>
> So I tried to be slick about it. I made the multipath.conf immutable.
> That prevented the engine from being able to activate the node. It would
> fail on a vds command that gets the nodes capabilities and part of what
> it does is reads then overwrites multipath.conf.
>
> How do I safely update multipath.conf?

In the second line of your multipath conf, add:
# RHEV PRIVATE

Then, host deploy will ignore it and never change it.

>
>
>>
>> To verify that your devices match this, you can check the devices vendor and 
>> procut
>> strings in the output of "multipath -ll". I would like to see the output of 
>> this
>> command.
>
> multipath -ll (default setup) can be seen here.
> http://paste.linux-help.org/view/430c7538
>
>> Another platform issue is bad default SCSI 
>> node.session.timeo.replacement_timeout
>> value, which is set to 120 seconds. This setting mean that the SCSI layer 
>> will
>> wait 120 seconds for io to complete on one path, before failing the io 
>> request.
>> So you may have one bad path, causing 120 second delay, while you could 
>> complete
>> the request using another path.
>>
>> Multipath is trying to set this value to 5 seconds, but this value is 
>> reverting
>> to the default 120 seconds after a device has trouble. There is an open bug 
>> about
>> this which we hope to get fixed in the rhel/centos 7.2.
>> https://bugzilla.redhat.com/1139038
>>
>> This issue together with "no_path_retry queue" is a very bad mix for ovirt.
>>
>> You can fix this timeout by setting:
>>
>> # /etc/iscsi/iscsid.conf
>> node.session.timeo.replacement_timeout = 5
>
> I'll see if that's possible with persist. Will this change survive node
> upgrades?
>
> Thanks for the reply and the suggestions.
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>

-- 
Daniel Helgenberger
m box bewegtbild GmbH

P: +49/30/2408781-22
F: +49/30/2408781-10

ACKERSTR. 19
D-10115 BERLIN


www.m-box.de  www.monkeymen.tv

Geschäftsführer: Martin Retschitzegger / Michaela Göllner
Handeslregister: Amtsgericht Charlottenburg / HRB 112767
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-21 Thread Jorick Astrego


On 05/21/2015 02:47 AM, Chris Jones - BookIt.com Systems Administrator
wrote:
>> Another issue may be that the setting for COMPELNT/Compellent Vol are
>> wrong;
>> the setting we ship is missing lot of settings that exists in the
>> builtin
>> setting, and this may have bad effect. If your devices match this , I
>> would
>> try this multipath configuration, instead of the one vdsm configures.
>>
>> device {
>> vendor "COMPELNT"
>> product "Compellent Vol"
>> path_grouping_policy "multibus"
>> path_checker "tur"
>> features "0"
>> hardware_handler "0"
>> prio "const"
>> failback "immediate"
>> rr_weight "uniform"
>> no_path_retry fail
>> }
>
> I wish I could. We're using the CentOS 7 ovirt-node-iso. The
> multipath.conf is less than ideal but when I tried updating it, oVirt
> instantly overwrites it. To be clear, yes I know changes do not
> survive reboots and yes I know about persist, but it changes it while
> running. Live! Persist won't help there.
>
> I also tried building a CentOS 7 "thick client" where I set up CentOS
> 7 first, added the oVirt repo, then let the engine provision it. Same
> problem with multipath.conf being overwritten with the default oVirt
> setup.
>
> So I tried to be slick about it. I made the multipath.conf immutable.
> That prevented the engine from being able to activate the node. It
> would fail on a vds command that gets the nodes capabilities and part
> of what it does is reads then overwrites multipath.conf.
>
> How do I safely update multipath.conf?
>

Somehow the multipath.conf that oVirt generates forces my HDD RAID
controller disks to change from /dev/sdb* and /dev/sdc*. So I had to
blacklist these.

I was able to persist it by adding "# RHEV PRIVATE" right below the "#
RHEV REVISION 1.1"

Hope this helps






Met vriendelijke groet, With kind regards,

Jorick Astrego

Netbulae Virtualization Experts 



Tel: 053 20 30 270  i...@netbulae.euStaalsteden 4-3A
KvK 08198180
Fax: 053 20 30 271  www.netbulae.eu 7547 TA Enschede
BTW NL821234584B01



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Chris Jones - BookIt . com Systems Administrator



Chris, as you are using ovirt-node, after Nir suggestions please also
execute the below command too to save the settings changes across reboots:

# persist /etc/iscsi/iscsid.conf


Thanks. I will do so, but first I have to resolve not being able to 
update multipath.conf as described in my previous email.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Chris Jones - BookIt . com Systems Administrator

Another issue may be that the setting for COMPELNT/Compellent Vol are wrong;
the setting we ship is missing lot of settings that exists in the builtin
setting, and this may have bad effect. If your devices match this , I would
try this multipath configuration, instead of the one vdsm configures.

device {
vendor "COMPELNT"
product "Compellent Vol"
path_grouping_policy "multibus"
path_checker "tur"
features "0"
hardware_handler "0"
prio "const"
failback "immediate"
rr_weight "uniform"
no_path_retry fail
}


I wish I could. We're using the CentOS 7 ovirt-node-iso. The 
multipath.conf is less than ideal but when I tried updating it, oVirt 
instantly overwrites it. To be clear, yes I know changes do not survive 
reboots and yes I know about persist, but it changes it while running. 
Live! Persist won't help there.


I also tried building a CentOS 7 "thick client" where I set up CentOS 7 
first, added the oVirt repo, then let the engine provision it. Same 
problem with multipath.conf being overwritten with the default oVirt setup.


So I tried to be slick about it. I made the multipath.conf immutable. 
That prevented the engine from being able to activate the node. It would 
fail on a vds command that gets the nodes capabilities and part of what 
it does is reads then overwrites multipath.conf.


How do I safely update multipath.conf?




To verify that your devices match this, you can check the devices vendor and 
procut
strings in the output of "multipath -ll". I would like to see the output of this
command.


multipath -ll (default setup) can be seen here.
http://paste.linux-help.org/view/430c7538


Another platform issue is bad default SCSI 
node.session.timeo.replacement_timeout
value, which is set to 120 seconds. This setting mean that the SCSI layer will
wait 120 seconds for io to complete on one path, before failing the io request.
So you may have one bad path, causing 120 second delay, while you could complete
the request using another path.

Multipath is trying to set this value to 5 seconds, but this value is reverting
to the default 120 seconds after a device has trouble. There is an open bug 
about
this which we hope to get fixed in the rhel/centos 7.2.
https://bugzilla.redhat.com/1139038

This issue together with "no_path_retry queue" is a very bad mix for ovirt.

You can fix this timeout by setting:

# /etc/iscsi/iscsid.conf
node.session.timeo.replacement_timeout = 5


I'll see if that's possible with persist. Will this change survive node 
upgrades?


Thanks for the reply and the suggestions.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Douglas Schilling Landgraf

On 05/20/2015 07:10 PM, Nir Soffer wrote:

- Original Message -

From: "Chris Jones - BookIt.com Systems Administrator" 
To: users@ovirt.org
Sent: Thursday, May 21, 2015 12:49:50 AM
Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via   
iSCSI/Multipath


vdsm.log in the node side, will help here too.


https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log
contains only the messages at and after when a host was become
unresponsive due to storage issues.


According to the log, you have a real issue accessing storage from the host:

[nsoffer@thin untitled (master)]$ repostat vdsm.log
domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2
   delay  avg: 0.000856 min: 0.00 max: 0.001168
   last check avg: 11.51 min: 0.30 max: 64.10
domain: 64101f40-0f10-471d-9f5f-44591f9e087d
   delay  avg: 0.008358 min: 0.00 max: 0.040269
   last check avg: 11.86 min: 0.30 max: 63.40
domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0
   delay  avg: 0.007793 min: 0.000819 max: 0.041316
   last check avg: 11.47 min: 0.00 max: 70.20
domain: 842edf83-22c6-46cd-acaa-a1f76d61e545
   delay  avg: 0.000493 min: 0.000374 max: 0.000698
   last check avg: 4.86 min: 0.20 max: 9.90
domain: b050c455-5ab1-4107-b055-bfcc811195fc
   delay  avg: 0.002080 min: 0.00 max: 0.040142
   last check avg: 11.83 min: 0.00 max: 63.70
domain: c46adffc-614a-4fa2-9d2d-954f174f6a39
   delay  avg: 0.004798 min: 0.00 max: 0.041006
   last check avg: 18.42 min: 1.40 max: 102.90
domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7
   delay  avg: 0.001002 min: 0.00 max: 0.001199
   last check avg: 11.56 min: 0.30 max: 61.70
domain: 20153412-f77a-4944-b252-ff06a78a1d64
   delay  avg: 0.003748 min: 0.00 max: 0.040903
   last check avg: 12.18 min: 0.00 max: 67.20
domain: 26929b89-d1ca-4718-90d6-b3a6da585451
   delay  avg: 0.000963 min: 0.00 max: 0.001209
   last check avg: 10.99 min: 0.00 max: 64.30
domain: 0137183b-ea40-49b1-b617-256f47367280
   delay  avg: 0.000881 min: 0.00 max: 0.001227
   last check avg: 11.086667 min: 0.10 max: 63.20

Note the high last check maximum value (e.g. 102 seconds).

Vdsm has a monitor thread for each domain, doing a read from one of the storage
domain special disk every 10 seconds. When we see high last check value, it
means that the monitor thread is stuck reading from the disk.

This is an indicator that vms may have trouble accessing this storage domains,
and engine is handling this by making the host non-operational, or if all hosts
cannot access the domain, making the domain inactive.

One of the known issues that can be related, is bad multipath configuration. 
Some
storage server have bad builtin configuration embedded into multipath. In 
particular,
using "no_path_retry queue", or "no_path_retry 60". This setting means that when
the SCSI layer fails, and multipath does not have any active path it will queue
io foerver (queue), or retry many times (e.g, 60) before failing the io request.

This can lead to stuck process, doing a read or write that never fails or takes
many minutes to fail. Vdsm is not designed to handle such delays - a stuck 
thread
may block other unrelated threads.

Vdsm includes special configuration for your storage vendor (COMPELNT), but 
maybe
it does not match the product (Compellent Vol).
See 
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multipath.py#L57

 device {
 vendor  "COMPELNT"
 product "Compellent Vol"
 no_path_retry   fail
 }

Another issue may be that the setting for COMPELNT/Compellent Vol are wrong;
the setting we ship is missing lot of settings that exists in the builtin
setting, and this may have bad effect. If your devices match this , I would
try this multipath configuration, instead of the one vdsm configures.

device {
vendor "COMPELNT"
product "Compellent Vol"
path_grouping_policy "multibus"
path_checker "tur"
features "0"
hardware_handler "0"
prio "const"
failback "immediate"
rr_weight "uniform"
no_path_retry fail
}

To verify that your devices match this, you can check the devices vendor and 
procut
strings in the output of "multipath -ll". I would like to see the output of this
command.

Another platform issue is bad default SCSI 
node.session.timeo.replacement_timeout
value, which is set to 120 seconds. This setting mean that the SCSI layer will
wait 120 seconds for io to complete on one path, before failing the io request.
So you may have one bad path, causing 120 second delay, while you could complete
the request using another path.

Multipath is

Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Nir Soffer
- Original Message -
> From: "Chris Jones - BookIt.com Systems Administrator" 
> 
> To: users@ovirt.org
> Sent: Thursday, May 21, 2015 12:49:50 AM
> Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via 
> iSCSI/Multipath
> 
> >> vdsm.log in the node side, will help here too.
> 
> https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log
> contains only the messages at and after when a host was become
> unresponsive due to storage issues.

According to the log, you have a real issue accessing storage from the host:

[nsoffer@thin untitled (master)]$ repostat vdsm.log 
domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2
  delay  avg: 0.000856 min: 0.00 max: 0.001168
  last check avg: 11.51 min: 0.30 max: 64.10
domain: 64101f40-0f10-471d-9f5f-44591f9e087d
  delay  avg: 0.008358 min: 0.00 max: 0.040269
  last check avg: 11.86 min: 0.30 max: 63.40
domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0
  delay  avg: 0.007793 min: 0.000819 max: 0.041316
  last check avg: 11.47 min: 0.00 max: 70.20
domain: 842edf83-22c6-46cd-acaa-a1f76d61e545
  delay  avg: 0.000493 min: 0.000374 max: 0.000698
  last check avg: 4.86 min: 0.20 max: 9.90
domain: b050c455-5ab1-4107-b055-bfcc811195fc
  delay  avg: 0.002080 min: 0.00 max: 0.040142
  last check avg: 11.83 min: 0.00 max: 63.70
domain: c46adffc-614a-4fa2-9d2d-954f174f6a39
  delay  avg: 0.004798 min: 0.00 max: 0.041006
  last check avg: 18.42 min: 1.40 max: 102.90
domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7
  delay  avg: 0.001002 min: 0.00 max: 0.001199
  last check avg: 11.56 min: 0.30 max: 61.70
domain: 20153412-f77a-4944-b252-ff06a78a1d64
  delay  avg: 0.003748 min: 0.00 max: 0.040903
  last check avg: 12.18 min: 0.00 max: 67.20
domain: 26929b89-d1ca-4718-90d6-b3a6da585451
  delay  avg: 0.000963 min: 0.00 max: 0.001209
  last check avg: 10.99 min: 0.00 max: 64.30
domain: 0137183b-ea40-49b1-b617-256f47367280
  delay  avg: 0.000881 min: 0.00 max: 0.001227
  last check avg: 11.086667 min: 0.10 max: 63.20

Note the high last check maximum value (e.g. 102 seconds).

Vdsm has a monitor thread for each domain, doing a read from one of the storage
domain special disk every 10 seconds. When we see high last check value, it 
means that the monitor thread is stuck reading from the disk.

This is an indicator that vms may have trouble accessing this storage domains,
and engine is handling this by making the host non-operational, or if all hosts
cannot access the domain, making the domain inactive.

One of the known issues that can be related, is bad multipath configuration. 
Some
storage server have bad builtin configuration embedded into multipath. In 
particular,
using "no_path_retry queue", or "no_path_retry 60". This setting means that when
the SCSI layer fails, and multipath does not have any active path it will queue
io foerver (queue), or retry many times (e.g, 60) before failing the io request.

This can lead to stuck process, doing a read or write that never fails or takes 
many minutes to fail. Vdsm is not designed to handle such delays - a stuck 
thread
may block other unrelated threads.

Vdsm includes special configuration for your storage vendor (COMPELNT), but 
maybe
it does not match the product (Compellent Vol).
See 
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multipath.py#L57

device {
vendor  "COMPELNT"
product "Compellent Vol"
no_path_retry   fail
}

Another issue may be that the setting for COMPELNT/Compellent Vol are wrong;
the setting we ship is missing lot of settings that exists in the builtin 
setting, and this may have bad effect. If your devices match this , I would
try this multipath configuration, instead of the one vdsm configures.

   device {
   vendor "COMPELNT"

   
   product "Compellent Vol"
   path_grouping_policy "multibus"
   path_checker "tur"
   features "0"
   hardware_handler "0"
   prio "const"
   failback "immediate"
   rr_weight "uniform"
   no_path_retry fail
   }

To verify that your devices match this, you can check the devices vendor and 
procut
strings in the output of "multipath -ll". I would like to see the output of 
this 
command.

Another platform issue is bad default SCSI 
node.session.timeo.replacement_timeout
value, which is set to 120 seconds. This setting mean that the SCSI layer will
wait 120 seconds for io to complete on one path, before failin

Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Chris Jones - BookIt . com Systems Administrator

vdsm.log in the node side, will help here too.


https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log 
contains only the messages at and after when a host was become 
unresponsive due to storage issues.



# rpm -qa | grep -i vdsm
might help too.


vdsm-cli-4.16.14-0.el7.noarch
vdsm-reg-4.16.14-0.el7.noarch
ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch
vdsm-python-zombiereaper-4.16.14-0.el7.noarch
vdsm-xmlrpc-4.16.14-0.el7.noarch
vdsm-yajsonrpc-4.16.14-0.el7.noarch
vdsm-4.16.14-0.el7.x86_64
vdsm-gluster-4.16.14-0.el7.noarch
vdsm-hook-ethtool-options-4.16.14-0.el7.noarch
vdsm-python-4.16.14-0.el7.noarch
vdsm-jsonrpc-4.16.14-0.el7.noarch



Hey Chris,

please open a bug [1] for this, then we can track it and we can help to
identify the issue.


I will do so.



vdsm.log.gz
Description: application/gzip
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Chris Jones - BookIt . com Systems Administrator
Sorry for the delay on this. I am in the process of reproducing the 
error to get the logs.


On 05/19/2015 07:31 PM, Douglas Schilling Landgraf wrote:

Hello Chris,

On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator 
wrote:

Engine: oVirt Engine Version: 3.5.2-1.el7.centos
Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos
Remote storage: Dell Compellent SC8000
Storage setup: 2 nics connected to the Compellent. Several domains
backed by LUNs. Several VM disk using direct LUN.
Networking: Dell 10 Gb/s switches

I've been struggling with oVirt completely falling apart due to storage
related issues. By falling apart I mean most to all of the nodes
suddenly losing contact with the storage domains. This results in an
endless loop of the VMs on the failed nodes trying to be migrated and
remigrated as the nodes flap between response and unresponsive. During
these times, engine.log looks like this.

2015-05-19 03:09:42,443 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-50) domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds:
blade6c1.ism.ld
2015-05-19 03:09:42,560 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-38) domain
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds:
blade2c1.ism.ld
2015-05-19 03:09:45,497 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-24) domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds:
blade3c2.ism.ld
2015-05-19 03:09:51,713 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-46) domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
blade4c2.ism.ld
2015-05-19 03:09:57,647 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-13) Domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
vds: blade6c1.ism.ld
2015-05-19 03:09:57,782 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-6) domain
26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds:
blade2c1.ism.ld
2015-05-19 03:09:57,783 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-6) Domain
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem.
vds: blade2c1.ism.ld
2015-05-19 03:10:00,639 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-31) Domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
vds: blade4c1.ism.ld
2015-05-19 03:10:00,703 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-17) domain
64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds:
blade1c1.ism.ld
2015-05-19 03:10:00,712 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-4) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
vds: blade3c2.ism.ld
2015-05-19 03:10:06,931 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
vds: blade4c2.ism.ld
2015-05-19 03:10:06,931 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from
problem. No active host in the DC is reporting it as problematic, so
clearing the domain recovery timer.
2015-05-19 03:10:06,932 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem.
vds: blade4c2.ism.ld
2015-05-19 03:10:06,933 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from
problem. No active host in the DC is reporting it as problematic, so
clearing the domain recovery timer.
2015-05-19 03:10:09,929 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-16) domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
blade3c1.ism.ld


My troubleshooting steps so far:

 1. Tailing engine.log for "in problem" and "recovered from problem"
 2. Shutting down all the VMs.
 3. Shutting down all but one node.
 4. Bringing up one node at a time to see what the log reports.


vdsm.log in the node side, will help here too.


When only one node is active everything is fine. When a second node
comes up, I begin to see the log output as shown above. I've been
struggling with this for over a month. I'm sure others have used oVirt
with a Compellent and encountered (and worked around) similar problems.
I'm looking for some help in figuring out if it's oVirt or something
that I'm doing wrong.

We're cl

Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Andrea Ghelardi
Hi Chris,
I have an Ovirt + Dell Compellent similar to yours (previous model, not
SC8000) and sometimes I faced issues similar to yours.
>From my experience I can advise you to
A) check links between SAN and servers, all paths, all configuration,
cabling. Everything should be setup correctly (all redundant paths green,
server mappings etc) BEFORE installing ovirt. We had a running KVM
environment before "upgrading" it to ovirt 3.5.1
B) Also check fencing is working both manually and automatically
(connections to iDRAC etc). This is a kind of pre-requisite to have HA
working.
C) I also noticed that when something is not going well on one of the
shared storage, this brings down the whole cluster (VM run, but a lot of
headaches being). First of all note that ovirt tries to stabilize the
situation itself for as long as ~15 minutes or more. It is slow in
re-fencing etc. Sometimes it enters in a loop and you have to locate the
problematic storage. You want to check the multipath on every server is
working correctly.

If you are having problems with just two nodes, I guess something is not
really ok at configuration level. I have 2 clusters, 12 hosts and several
(lots) of shared storage working and usually when something goes wrong is
because of an human error (like when I deleted the LUN on the SAN before
destroying the storage on the ovirt interface).

On the hand, I have the overall impression that the system is not
forgiving at all and that it is far from being rock solid.

Cheers
AG
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-20 Thread Fabian Deutsch
- Original Message -
> Hello Chris,
> 
> On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator
> wrote:
> > Engine: oVirt Engine Version: 3.5.2-1.el7.centos
> > Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos
> > Remote storage: Dell Compellent SC8000
> > Storage setup: 2 nics connected to the Compellent. Several domains
> > backed by LUNs. Several VM disk using direct LUN.
> > Networking: Dell 10 Gb/s switches
> >
> > I've been struggling with oVirt completely falling apart due to storage
> > related issues. By falling apart I mean most to all of the nodes
> > suddenly losing contact with the storage domains. This results in an
> > endless loop of the VMs on the failed nodes trying to be migrated and
> > remigrated as the nodes flap between response and unresponsive. During
> > these times, engine.log looks like this.
> >
> > 2015-05-19 03:09:42,443 WARN
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-50) domain
> > c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds:
> > blade6c1.ism.ld
> > 2015-05-19 03:09:42,560 WARN
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-38) domain
> > 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds:
> > blade2c1.ism.ld
> > 2015-05-19 03:09:45,497 WARN
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-24) domain
> > 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds:
> > blade3c2.ism.ld
> > 2015-05-19 03:09:51,713 WARN
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-46) domain
> > b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
> > blade4c2.ism.ld
> > 2015-05-19 03:09:57,647 INFO
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-13) Domain
> > c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
> > vds: blade6c1.ism.ld
> > 2015-05-19 03:09:57,782 WARN
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-6) domain
> > 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds:
> > blade2c1.ism.ld
> > 2015-05-19 03:09:57,783 INFO
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-6) Domain
> > 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem.
> > vds: blade2c1.ism.ld
> > 2015-05-19 03:10:00,639 INFO
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-31) Domain
> > c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
> > vds: blade4c1.ism.ld
> > 2015-05-19 03:10:00,703 WARN
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-17) domain
> > 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds:
> > blade1c1.ism.ld
> > 2015-05-19 03:10:00,712 INFO
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-4) Domain
> > 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
> > vds: blade3c2.ism.ld
> > 2015-05-19 03:10:06,931 INFO
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-48) Domain
> > 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
> > vds: blade4c2.ism.ld
> > 2015-05-19 03:10:06,931 INFO
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-48) Domain
> > 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from
> > problem. No active host in the DC is reporting it as problematic, so
> > clearing the domain recovery timer.
> > 2015-05-19 03:10:06,932 INFO
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-48) Domain
> > b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem.
> > vds: blade4c2.ism.ld
> > 2015-05-19 03:10:06,933 INFO
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-48) Domain
> > b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from
> > problem. No active host in the DC is reporting it as problematic, so
> > clearing the domain recovery timer.
> > 2015-05-19 03:10:09,929 WARN
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> > (org.ovirt.thread.pool-8-thread-16) domain
> > b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
> > blade3c1.ism.ld
> >
> >
> > My troubleshooting steps so far:
> >
> >  1. Tailing engine.log for "in problem" and "recovered from problem"
> >  2. Shutting down all the VMs.
> >  3. Shutting down all but one node.
> >  4. Bringing up one node at a time to see what the log reports.
> 
> vdsm.log in the node side, will help here too.
> 
> > When only one node is active everything is fine. When a second node
> > comes up, I begin to see

Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-19 Thread Douglas Schilling Landgraf

Hello Chris,

On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator 
wrote:

Engine: oVirt Engine Version: 3.5.2-1.el7.centos
Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos
Remote storage: Dell Compellent SC8000
Storage setup: 2 nics connected to the Compellent. Several domains
backed by LUNs. Several VM disk using direct LUN.
Networking: Dell 10 Gb/s switches

I've been struggling with oVirt completely falling apart due to storage
related issues. By falling apart I mean most to all of the nodes
suddenly losing contact with the storage domains. This results in an
endless loop of the VMs on the failed nodes trying to be migrated and
remigrated as the nodes flap between response and unresponsive. During
these times, engine.log looks like this.

2015-05-19 03:09:42,443 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-50) domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds:
blade6c1.ism.ld
2015-05-19 03:09:42,560 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-38) domain
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds:
blade2c1.ism.ld
2015-05-19 03:09:45,497 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-24) domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds:
blade3c2.ism.ld
2015-05-19 03:09:51,713 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-46) domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
blade4c2.ism.ld
2015-05-19 03:09:57,647 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-13) Domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
vds: blade6c1.ism.ld
2015-05-19 03:09:57,782 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-6) domain
26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds:
blade2c1.ism.ld
2015-05-19 03:09:57,783 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-6) Domain
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem.
vds: blade2c1.ism.ld
2015-05-19 03:10:00,639 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-31) Domain
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
vds: blade4c1.ism.ld
2015-05-19 03:10:00,703 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-17) domain
64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds:
blade1c1.ism.ld
2015-05-19 03:10:00,712 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-4) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
vds: blade3c2.ism.ld
2015-05-19 03:10:06,931 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
vds: blade4c2.ism.ld
2015-05-19 03:10:06,931 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from
problem. No active host in the DC is reporting it as problematic, so
clearing the domain recovery timer.
2015-05-19 03:10:06,932 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem.
vds: blade4c2.ism.ld
2015-05-19 03:10:06,933 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-48) Domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from
problem. No active host in the DC is reporting it as problematic, so
clearing the domain recovery timer.
2015-05-19 03:10:09,929 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
(org.ovirt.thread.pool-8-thread-16) domain
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
blade3c1.ism.ld


My troubleshooting steps so far:

 1. Tailing engine.log for "in problem" and "recovered from problem"
 2. Shutting down all the VMs.
 3. Shutting down all but one node.
 4. Bringing up one node at a time to see what the log reports.


vdsm.log in the node side, will help here too.


When only one node is active everything is fine. When a second node
comes up, I begin to see the log output as shown above. I've been
struggling with this for over a month. I'm sure others have used oVirt
with a Compellent and encountered (and worked around) similar problems.
I'm looking for some help in figuring out if it's oVirt or something
that I'm doing wrong.

We're close to giving up on oVirt completely because of this.

P.S.

I've tested via bare metal and Proxmox with the Compellent. Not at the
same scale but it se

[ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath

2015-05-19 Thread Chris Jones - BookIt . com Systems Administrator

Engine: oVirt Engine Version: 3.5.2-1.el7.centos
Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos
Remote storage: Dell Compellent SC8000
Storage setup: 2 nics connected to the Compellent. Several domains 
backed by LUNs. Several VM disk using direct LUN.

Networking: Dell 10 Gb/s switches

I've been struggling with oVirt completely falling apart due to storage 
related issues. By falling apart I mean most to all of the nodes 
suddenly losing contact with the storage domains. This results in an 
endless loop of the VMs on the failed nodes trying to be migrated and 
remigrated as the nodes flap between response and unresponsive. During 
these times, engine.log looks like this.


2015-05-19 03:09:42,443 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-50) domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: 
blade6c1.ism.ld
2015-05-19 03:09:42,560 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-38) domain 
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: 
blade2c1.ism.ld
2015-05-19 03:09:45,497 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-24) domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: 
blade3c2.ism.ld
2015-05-19 03:09:51,713 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-46) domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: 
blade4c2.ism.ld
2015-05-19 03:09:57,647 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-13) Domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. 
vds: blade6c1.ism.ld
2015-05-19 03:09:57,782 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-6) domain 
26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: 
blade2c1.ism.ld
2015-05-19 03:09:57,783 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-6) Domain 
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. 
vds: blade2c1.ism.ld
2015-05-19 03:10:00,639 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-31) Domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. 
vds: blade4c1.ism.ld
2015-05-19 03:10:00,703 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-17) domain 
64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: 
blade1c1.ism.ld
2015-05-19 03:10:00,712 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-4) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. 
vds: blade3c2.ism.ld
2015-05-19 03:10:06,931 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. 
vds: blade4c2.ism.ld
2015-05-19 03:10:06,931 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from 
problem. No active host in the DC is reporting it as problematic, so 
clearing the domain recovery timer.
2015-05-19 03:10:06,932 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. 
vds: blade4c2.ism.ld
2015-05-19 03:10:06,933 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from 
problem. No active host in the DC is reporting it as problematic, so 
clearing the domain recovery timer.
2015-05-19 03:10:09,929 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-16) domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: 
blade3c1.ism.ld



My troubleshooting steps so far:

1. Tailing engine.log for "in problem" and "recovered from problem"
2. Shutting down all the VMs.
3. Shutting down all but one node.
4. Bringing up one node at a time to see what the log reports.

When only one node is active everything is fine. When a second node 
comes up, I begin to see the log output as shown above. I've been 
struggling with this for over a month. I'm sure others have used oVirt 
with a Compellent and encountered (and worked around) similar problems. 
I'm looking for some help in figuring out if it's oVirt or something 
that I'm doing wrong.


We're close to giving up on oVirt completely because of this.

P.S.

I've tested via bare metal and Proxmox with the Compellent. Not at the 
same scale but it seems to work fine there.



--
This email was Virus checked by UTM 9. F