Re: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

2017-02-07 Thread Mark Greenall
Bug 1419856 Submitted


From: Mark Greenall
Sent: 06 February 2017 17:32
To: 'Pavel Gashev' <p...@acronis.com>; users@ovirt.org
Subject: RE: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Ok, thanks Pavel. I’ll file a bug report with the logs and report back once 
done.

From: Pavel Gashev [mailto:p...@acronis.com]
Sent: 06 February 2017 17:11
To: Mark Greenall 
<m.green...@iontrading.com<mailto:m.green...@iontrading.com>>; 
users@ovirt.org<mailto:users@ovirt.org>
Subject: Re: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Mark,

In your case all 30 workers were busy by vdsm.virt.sampling.HostMonitor 
discarded by timeout, and there were 3000 tasks in the queue.
I encountered the problem. In my case ISO domain was not responding.

The issue is that vdsm executor doesn’t remove discarded workers. This is a bug.



From: Mark Greenall 
<m.green...@iontrading.com<mailto:m.green...@iontrading.com>>
Date: Monday 6 February 2017 at 18:20
To: Pavel Gashev <p...@acronis.com<mailto:p...@acronis.com>>, 
"users@ovirt.org<mailto:users@ovirt.org>" 
<users@ovirt.org<mailto:users@ovirt.org>>
Subject: RE: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Hi Pavel,

Thanks for responding. I bounced the VDSMD service, the guests recovered and 
the monitor and queue full messages also cleared. However, we did keep getting 
intermittent “Guest x Not Responding “ messages being communicated by the 
Hosted Engine, in most cases the guests would actually almost immediately 
recover though. The odd occasion would result in guests staying “Not 
Responding” and me bouncing the VDSMD service again. The Host had a memory load 
of around 85% (out of 768GB) and a CPU load of around 65% (48 cores). I have 
since added another host to that cluster and spread the guests between the two 
hosts. This seems to have totally cleared the messages (at least for the last 5 
days anyway).

I suspect the problem is load related. At what capacity would Ovirt regard a 
host as being ‘full’?

Thanks,
Mark

From: Pavel Gashev [mailto:p...@acronis.com]
Sent: 31 January 2017 15:19
To: Mark Greenall 
<m.green...@iontrading.com<mailto:m.green...@iontrading.com>>; 
users@ovirt.org<mailto:users@ovirt.org>
Subject: Re: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Mark,

Could you please file a bug report?

Restart of vdsmd service would help to resolve the “executor queue full” state.


From: <users-boun...@ovirt.org<mailto:users-boun...@ovirt.org>> on behalf of 
Mark Greenall <m.green...@iontrading.com<mailto:m.green...@iontrading.com>>
Date: Monday 30 January 2017 at 15:26
To: "users@ovirt.org<mailto:users@ovirt.org>" 
<users@ovirt.org<mailto:users@ovirt.org>>
Subject: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Hi,

Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
Stoage: Dell Equallogic (Firmware V8.1.4)
OS: Centos 7.3 (although the same thing happens on 7.2)
Ovirt: 4.0.6.3-1

We have several Ovirt clusters. Two of the hosts (in separate clusters) are 
showing as up in Hosted Engine but the guests running on them are showing as 
Not Responding. I can connect to the guests via ssh, etc but can’t interact 
with them from the Ovirt GUI. It was fine on Saturday (28th Jan) morning but 
looks like something happened Sunday morning around 07:14 as we suddenly see 
the following in engine.log on one host:

2017-01-29 07:14:26,952 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'd0aa990f-e6aa-4e79-93ce-011fe1372fb0'(lnd-ion-lindev-01) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,069 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-lindev-01 is not responding.
2017-01-29 07:14:27,070 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'788bfc0e-1712-469e-9a0a-395b8bb3f369'(lnd-ion-windev-02) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,088 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-windev-02 is not responding.
2017-01-29 07:14:27,089 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'd7eaa4ec-d65e-45c0-bc4f-505100658121'(lnd-ion-windev-04) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,103 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-windev-04 is not responding.
2017-01-29 07:14:27,104 INFO  
[org.ovirt.engine.core.vdsbroker.monitori

Re: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

2017-02-06 Thread Mark Greenall
Ok, thanks Pavel. I’ll file a bug report with the logs and report back once 
done.

From: Pavel Gashev [mailto:p...@acronis.com]
Sent: 06 February 2017 17:11
To: Mark Greenall <m.green...@iontrading.com>; users@ovirt.org
Subject: Re: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Mark,

In your case all 30 workers were busy by vdsm.virt.sampling.HostMonitor 
discarded by timeout, and there were 3000 tasks in the queue.
I encountered the problem. In my case ISO domain was not responding.

The issue is that vdsm executor doesn’t remove discarded workers. This is a bug.



From: Mark Greenall 
<m.green...@iontrading.com<mailto:m.green...@iontrading.com>>
Date: Monday 6 February 2017 at 18:20
To: Pavel Gashev <p...@acronis.com<mailto:p...@acronis.com>>, 
"users@ovirt.org<mailto:users@ovirt.org>" 
<users@ovirt.org<mailto:users@ovirt.org>>
Subject: RE: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Hi Pavel,

Thanks for responding. I bounced the VDSMD service, the guests recovered and 
the monitor and queue full messages also cleared. However, we did keep getting 
intermittent “Guest x Not Responding “ messages being communicated by the 
Hosted Engine, in most cases the guests would actually almost immediately 
recover though. The odd occasion would result in guests staying “Not 
Responding” and me bouncing the VDSMD service again. The Host had a memory load 
of around 85% (out of 768GB) and a CPU load of around 65% (48 cores). I have 
since added another host to that cluster and spread the guests between the two 
hosts. This seems to have totally cleared the messages (at least for the last 5 
days anyway).

I suspect the problem is load related. At what capacity would Ovirt regard a 
host as being ‘full’?

Thanks,
Mark

From: Pavel Gashev [mailto:p...@acronis.com]
Sent: 31 January 2017 15:19
To: Mark Greenall 
<m.green...@iontrading.com<mailto:m.green...@iontrading.com>>; 
users@ovirt.org<mailto:users@ovirt.org>
Subject: Re: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Mark,

Could you please file a bug report?

Restart of vdsmd service would help to resolve the “executor queue full” state.


From: <users-boun...@ovirt.org<mailto:users-boun...@ovirt.org>> on behalf of 
Mark Greenall <m.green...@iontrading.com<mailto:m.green...@iontrading.com>>
Date: Monday 30 January 2017 at 15:26
To: "users@ovirt.org<mailto:users@ovirt.org>" 
<users@ovirt.org<mailto:users@ovirt.org>>
Subject: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Hi,

Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
Stoage: Dell Equallogic (Firmware V8.1.4)
OS: Centos 7.3 (although the same thing happens on 7.2)
Ovirt: 4.0.6.3-1

We have several Ovirt clusters. Two of the hosts (in separate clusters) are 
showing as up in Hosted Engine but the guests running on them are showing as 
Not Responding. I can connect to the guests via ssh, etc but can’t interact 
with them from the Ovirt GUI. It was fine on Saturday (28th Jan) morning but 
looks like something happened Sunday morning around 07:14 as we suddenly see 
the following in engine.log on one host:

2017-01-29 07:14:26,952 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'd0aa990f-e6aa-4e79-93ce-011fe1372fb0'(lnd-ion-lindev-01) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,069 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-lindev-01 is not responding.
2017-01-29 07:14:27,070 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'788bfc0e-1712-469e-9a0a-395b8bb3f369'(lnd-ion-windev-02) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,088 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-windev-02 is not responding.
2017-01-29 07:14:27,089 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'd7eaa4ec-d65e-45c0-bc4f-505100658121'(lnd-ion-windev-04) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,103 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-windev-04 is not responding.
2017-01-29 07:14:27,104 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'5af875ad-70f9-4f49-9640-ee2b9927348b'(lnd-anv9-sup1) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,121 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzSch

Re: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

2017-02-06 Thread Mark Greenall
Hi Pavel,

Thanks for responding. I bounced the VDSMD service, the guests recovered and 
the monitor and queue full messages also cleared. However, we did keep getting 
intermittent “Guest x Not Responding “ messages being communicated by the 
Hosted Engine, in most cases the guests would actually almost immediately 
recover though. The odd occasion would result in guests staying “Not 
Responding” and me bouncing the VDSMD service again. The Host had a memory load 
of around 85% (out of 768GB) and a CPU load of around 65% (48 cores). I have 
since added another host to that cluster and spread the guests between the two 
hosts. This seems to have totally cleared the messages (at least for the last 5 
days anyway).

I suspect the problem is load related. At what capacity would Ovirt regard a 
host as being ‘full’?

Thanks,
Mark

From: Pavel Gashev [mailto:p...@acronis.com]
Sent: 31 January 2017 15:19
To: Mark Greenall <m.green...@iontrading.com>; users@ovirt.org
Subject: Re: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Mark,

Could you please file a bug report?

Restart of vdsmd service would help to resolve the “executor queue full” state.


From: <users-boun...@ovirt.org<mailto:users-boun...@ovirt.org>> on behalf of 
Mark Greenall <m.green...@iontrading.com<mailto:m.green...@iontrading.com>>
Date: Monday 30 January 2017 at 15:26
To: "users@ovirt.org<mailto:users@ovirt.org>" 
<users@ovirt.org<mailto:users@ovirt.org>>
Subject: [ovirt-users] Ovirt 4.0.6 guests 'Not Responding'

Hi,

Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
Stoage: Dell Equallogic (Firmware V8.1.4)
OS: Centos 7.3 (although the same thing happens on 7.2)
Ovirt: 4.0.6.3-1

We have several Ovirt clusters. Two of the hosts (in separate clusters) are 
showing as up in Hosted Engine but the guests running on them are showing as 
Not Responding. I can connect to the guests via ssh, etc but can’t interact 
with them from the Ovirt GUI. It was fine on Saturday (28th Jan) morning but 
looks like something happened Sunday morning around 07:14 as we suddenly see 
the following in engine.log on one host:

2017-01-29 07:14:26,952 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'd0aa990f-e6aa-4e79-93ce-011fe1372fb0'(lnd-ion-lindev-01) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,069 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-lindev-01 is not responding.
2017-01-29 07:14:27,070 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'788bfc0e-1712-469e-9a0a-395b8bb3f369'(lnd-ion-windev-02) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,088 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-windev-02 is not responding.
2017-01-29 07:14:27,089 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'd7eaa4ec-d65e-45c0-bc4f-505100658121'(lnd-ion-windev-04) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,103 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-ion-windev-04 is not responding.
2017-01-29 07:14:27,104 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'5af875ad-70f9-4f49-9640-ee2b9927348b'(lnd-anv9-sup1) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,121 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-anv9-sup1 is not responding.
2017-01-29 07:14:27,121 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'b3b7c5f3-0b5b-4d8f-9cc8-b758cc1ce3b9'(lnd-db-dev-03) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,136 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -1, Message: VM lnd-db-dev-03 is not responding.
2017-01-29 07:14:27,137 INFO  
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [53ca8dc5] VM 
'6c0a6e17-47c3-4464-939b-e83984dbeaa6'(lnd-db-dev-04) moved from 'Up' --> 
'NotResponding'
2017-01-29 07:14:27,167 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, 
Custom Event ID: -

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-16 Thread Mark Greenall
Hi,

To try and get a baseline here I've reverted most of the changes we've made and 
am running the host with just the following iSCSI related configuration 
settings. The tweaks had been made over time to try and alleviate several 
storage related problems, but it's possible that fixes in Ovirt (we've 
gradually gone from early 3.x to 4.0.6) make them redundant now and they simply 
compound the problem. I'll start with these configuration settings and then 
move onto trying the vdsm patch.

/etc/multipath.conf (note: polling_interval and max_fds would not get accepted 
in the devices section. I think they are for default only):

# VDSM REVISION 1.3
# VDSM PRIVATE

blacklist {
   devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
   devnode "^hd[a-z]"
   devnode "^sda$"
}

defaults {
deferred_remove yes
dev_loss_tmo30
fast_io_fail_tmo5
flush_on_last_del   yes
max_fds 4096
no_path_retry   fail
polling_interval5
user_friendly_names no
}

devices {
device {
vendor  "EQLOGIC"
product "100E-00"

# Ovirt defaults
deferred_remove yes
dev_loss_tmo30
fast_io_fail_tmo5
flush_on_last_del   yes
#polling_interval5
user_friendly_names no

# Local settings
#max_fds 8192
path_checkertur
path_grouping_policymultibus
path_selector   "round-robin 0"

# Use 4 retries will provide additional 20 seconds gracetime when no
# path is available before the device is disabled. (assuming 5 seconds
# polling interval). This may prevent vms from pausing when there is
# short outage on the storage server or network.
no_path_retry   4
   }

device {
# These settings overrides built-in devices settings. It does not apply
# to devices without built-in settings (these use the settings in the
# "defaults" section), or to devices defined in the "devices" section.
all_devsyes
no_path_retry   fail
}
}


/etc/iscsi/iscsid.conf default apart from:

node.session.initial_login_retry_max = 12
node.session.cmds_max = 1024
node.session.queue_depth = 128
node.startup = manual
node.session.iscsi.FastAbort = No




The following settings have been commented out / removed:

/etc/sysctl.conf:

# For more information, see sysctl.conf(5) and sysctl.d(5).
# Prevent ARP Flux for multiple NICs on the same subnet:
#net.ipv4.conf.all.arp_ignore = 1
#net.ipv4.conf.all.arp_announce = 2
# Loosen RP Filter to alow multiple iSCSI connections
#net.ipv4.conf.all.rp_filter = 2


/lib/udev/rules.d:

# Various Settings for Dell Equallogic disks based on Dell Optimizing SAN 
Environment for Linux Guide
#
# Modify disk scheduler mode to noop
#ACTION=="add|change", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", 
RUN+="/bin/sh -c 'echo noop > /sys/${DEVPATH}/queue/scheduler'"
# Modify disk timeout value to 60 seconds
#ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh 
-c 'echo 60 > /sys/%p/device/timeout'"
# Modify read ahead value to 1024
#ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh 
-c 'echo 1024 > /sys/${DEVPATH}/bdi/read_ahead_kb'"

I've also removed our defined iSCSI interfaces and have simply left the Ovirt 
'default'

Rebooted and 'Activated' host:

16:09 - Host Activated
16:10 - Non Operational saying it can't access storage domain 'Unknown'
16:12 - Host Activated again
16:12 - Host not responding goes 'Connecting'
16:15 - Can't access ALL the storage Domains. Host goes Non Operational again
16:17 - Host Activated again
16:18 - Can't access ALL the storage Domains. Host goes Non Operational again
16:20 - Host Autorecovers and goes Activating again
That cycle repeated until I started getting VDSM timeout messages and the 
constant LVM processes and high CPU load. @16:30 I rebooted the host and set 
the status to maintenance.

Second host Activation attempt just resulted in the same cycle as above. Host 
now doesn't come online at all.

Next step will be to try the vdsm patch.

Mark
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-13 Thread Mark Greenall
Just been catching up with all the threads and I saw mention of some 
iscsid.conf settings which reminded me we also changed some of those from 
default as per the Dell Optimizing SAN Environment for Linux Guide previously 
mentioned.

Changed from default in /etc/iscsi/iscsid.conf
node.session.initial_login_retry_max = 12
node.session.cmds_max = 1024
node.session.queue_depth = 128
node.startup = manual
node.session.iscsi.FastAbort = No

As mentioned by a couple of people, I do just hope this is a case of an 
optimization conflict between Ovirt and Equallogic. I just don't understand why 
every now and again the host will come up and stay up. In the Ovirt Equallogic 
cluster I have currently battled to get three of the hosts up (and running 
guests), I am left with the fourth host which I'm using for this call and it 
just refuses to stay up. It may not be specifically related to Ovirt 4.x but I 
do know we never used to have this type of a battle getting nodes online. I'm 
quite happy to change settings on this one host but can't make cluster wide 
changes as it will likely bring all the guests down.

As some added information here is the iscsi connection details for one of the 
storage domains. As mentioned we are using the 2 x 10Gb iSCSI HBA's in an LACP 
group in Ovirt and Cisco. Hence we see a login from the same source address 
(but two different interfaces) to the same (single) persistent address which is 
the controllers virtual group address. The Current Portal addresses are the 
Equallogic Active Conrollers eth0 and eth1 addresses.

Target: 
iqn.2001-05.com.equallogic:4-42a846-654479033-f9888b77feb584ec-lnd-ion-db-tprm-dstore01
 (non-flash)
Current Portal: 10.100.214.76:3260,1
Persistent Portal: 10.100.214.77:3260,1
**
Interface:
**
Iface Name: bond1.10
Iface Transport: tcp
Iface Initiatorname: iqn.1994-05.com.redhat:a53470a0ae32
Iface IPaddress: 10.100.214.59
Iface HWaddress: 
Iface Netdev: uk1iscsivlan10
SID: 95
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE
Current Portal: 10.100.214.75:3260,1
Persistent Portal: 10.100.214.77:3260,1
**
Interface:
**
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.1994-05.com.redhat:a53470a0ae32
Iface IPaddress: 10.100.214.59
Iface HWaddress: 
Iface Netdev: 
SID: 96
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE

Thanks,
Mark
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-13 Thread Mark Greenall
Hi Nir,

Thanks very much for your feedback. It's really useful information. I keep my 
fingers crossed it leads to a solution for us.

All the settings we currently have were to try and optimise the Equallogic for 
Linux and Ovirt.

The multipath config settings came from this Dell Forum thread re: getting 
EqualLogic to work with Ovirt 
http://en.community.dell.com/support-forums/storage/f/3775/t/19529606

The udev settings were from the Dell Optimizing SAN Environment for Linux Guide 
here: 
https://www.google.co.uk/url?sa=t=j==s=web=1=0ahUKEwiXvJes4L7RAhXLAsAKHVWLDyQQFggiMAA=http%3A%2F%2Fen.community.dell.com%2Fdell-groups%2Fdtcmedia%2Fm%2Fmediagallery%2F20371245%2Fdownload=AFQjCNG0J8uWEb90m-BwCH_nZJ8lEB3lFA=bv.144224172,d.d24=rja

Perhaps some of the settings are now conflicting with Ovirt best practice as 
you optimise the releases.

As requested, here is the output of multipath -ll

[root@uk1-ion-ovm-08 rules.d]# multipath -ll
364842a3403798409cf7d555c6b8bb82e dm-237 EQLOGIC ,100E-00
size=1.5T features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 48:0:0:0  sdan 66:112 active ready running
  `- 49:0:0:0  sdao 66:128 active ready running
364842a34037924a7bf7d25416b8be891 dm-212 EQLOGIC ,100E-00
size=345G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 42:0:0:0  sdah 66:16  active ready running
  `- 43:0:0:0  sdai 66:32  active ready running
364842a340379c497f47ee5fe6c8b9846 dm-459 EQLOGIC ,100E-00
size=175G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 86:0:0:0  sdbz 68:208 active ready running
  `- 87:0:0:0  sdca 68:224 active ready running
364842a34037944f2807fe5d76d8b1842 dm-526 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 96:0:0:0  sdcj 69:112 active ready running
  `- 97:0:0:0  sdcl 69:144 active ready running
364842a3403798426d37e05bc6c8b6843 dm-420 EQLOGIC ,100E-00
size=250G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 82:0:0:0  sdbu 68:128 active ready running
  `- 83:0:0:0  sdbw 68:160 active ready running
364842a340379449fbf7dc5406b8b2818 dm-199 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 38:0:0:0  sdad 65:208 active ready running
  `- 39:0:0:0  sdae 65:224 active ready running
364842a34037984543c7d35a86a8bc8ee dm-172 EQLOGIC ,100E-00
size=670G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 36:0:0:0  sdaa 65:160 active ready running
  `- 37:0:0:0  sdac 65:192 active ready running
364842a340379e4303c7dd5a76a8bd8b4 dm-140 EQLOGIC ,100E-00
size=1.5T features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 33:0:0:0  sdx  65:112 active ready running
  `- 32:0:0:0  sdy  65:128 active ready running
364842a340379b44c7c7ed53b6c8ba8c0 dm-359 EQLOGIC ,100E-00
size=300G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 69:0:0:0  sdbi 67:192 active ready running
  `- 68:0:0:0  sdbh 67:176 active ready running
364842a3403790415d37ed5bb6c8b68db dm-409 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 80:0:0:0  sdbt 68:112 active ready running
  `- 81:0:0:0  sdbv 68:144 active ready running
364842a34037964f7807f15d86d8b8860 dm-527 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 98:0:0:0  sdck 69:128 active ready running
  `- 99:0:0:0  sdcm 69:160 active ready running
364842a34037944aebf7d85416b8ba895 dm-226 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 46:0:0:0  sdal 66:80  active ready running
  `- 47:0:0:0  sdam 66:96  active ready running
364842a340379f44f7c7e053c6c8b98d2 dm-360 EQLOGIC ,100E-00
size=450G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 70:0:0:0  sdbj 67:208 active ready running
  `- 71:0:0:0  sdbk 67:224 active ready running
364842a34037924276e7e051e6c8b084f dm-308 EQLOGIC ,100E-00
size=120G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 61:0:0:0  sdba 67:64  active ready running
  `- 60:0:0:0  sdaz 67:48  active ready running
364842a34037994b93b7d85a66a8b789a dm-37 EQLOGIC ,100E-00
size=270G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 20:0:0:0  sdl  8:176  active ready running
  `- 21:0:0:0  sdm  8:192  active ready running
364842a340379348d6e7e351e6c8b4865 dm-319 EQLOGIC ,100E-00
size=310G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 62:0:0:0  sdbb 67:80  active ready running
  `- 63:0:0:0  sdbc 67:96  active ready running
364842a34037994cd3b7db5a66a8bc8ff dm-70 EQLOGIC ,100E-00
size=270G features='0' hwhandler='0' wp=rw
`-+- 

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Mark Greenall
Hi Yaniv,

>> 1. There is no point in so many connections.
>> 2. Certainly not the same portal - you really should have more.
>> 3. Note that some go via bond1 - and some via 'default' interface. Is that 
>> intended?
>> 4. Your multipath.conf is using rr_min_io - where it should use rr_min_io_rq 
>> most likely.

We have a single 68TB Equallogic unit with 24 disks. Each Ovirt host has 2 
HBA’s on the iSCSI network. We use Ovirt and the Cisco switches to create an 
LACP group with those 2 HBA’s. I have always assumed that the two connections 
are one each from the HBA’s (i.e I should have two paths and two connections to 
each target).

If we reduce the number of storage domains, we reduce the number of devices and 
therefore the number of LVM Physical volumes that appear in Linux correct? At 
the moment each connection results in a Linux device which has its own queue. 
We have some guests with high IO loads on their device whilst others are low. 
All the storage domain / datastore sizing guides we found seem to imply it’s a 
trade-off between ease of management (i.e not having millions of domains to 
manage), IO contention between guests on a single large storage domain / 
datastore and possible wasted space on storage domains. If you have further 
information on recommendations, I am more than willing to change things as this 
problem is making our environment somewhat unusable at the moment. I have hosts 
that I can’t bring online and therefore reduced resiliency in clusters. They 
used to work just fine but the environment has grown over the last year and we 
also upgraded the Ovirt version from 3.6 to 4.x. We certainly had other 
problems, but host activation wasn’t one of them and it’s a problem that’s 
driving me mad.

Thanks for the pointer on rr_min_io – I see that was for an older kernel. We 
had that set from a Dell guide. I’ve now removed that setting as it seems the 
default value has changed now anyway.

>> Unrelated, your engine.log is quite flooded with:
>> 2017-01-11 15:07:46,085 WARN  
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] 
>> (DefaultQuartzScheduler9) [31a71bf5] Invalid or unknown guest architecture 
>> type '' received from guest agent
>>
>> Any idea what kind of guest you are running?

Do you have any idea what the guest name is that’s coming from? We pretty much 
exclusively have Linux (CentOS various versions) and Windows (various versions) 
as the guest OS.

Thanks again,
Mark
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Mark Greenall
>> I would say that each path has its own device (and therefore its own queue). 
>> So I'd argue that you may want to have (for example) 4 paths to each LUN or 
>> perhaps more (8?). For example, with 2 NICs, each connecting to two 
>> controllers, each controller having 2 NICs (so no SPOF and nice number of 
>> paths).

Totally get where you are coming from with paths to LUN’s and using multipath. 
We do use that with the Dell Compellent storage we have. It has multiple active 
controllers each with an IP address in a different subnet. Unfortunately, the 
Equallogic does NOT have two active controllers. It has a single active 
controller and a single IP that migrates between the controllers when either 
one is active. If I don’t use LACP I can’t use both HBA’s on the host with 
Ovirt as it doesn’t support Dells host integration tool (HIT) software (or you 
could argue Dell don’t support Ovirt). So, instead of being able to have a 
large number of paths to devices I can either have one active path or LACP and 
get two. As two is the most I can have to a LUN with the infrastructure we 
have, we spread the IO by increasing the number of targets (storage domains).

>> Depending on your storage, you may want to use rr_min_io_rq = 1 for latency 
>> purposes.

Looking at the man page for multipath.conf it looks like the default is now 1, 
where it was 1000 for rr_min_io. For now I’ve just removed it from our config 
file and we’ll take the default.

I’m still seeing the same problem with the couple of changes made (lvmetad and 
multipath). I’m really not very good at understanding exactly what is going on 
in the Ovirt logs. Does it provide any clues as to why it brings the host up 
and then takes it offline again? What are the barrage of lvm processes trying 
to achieve and why do they apparently fail (as it keeps on trying to run them)? 
As mentioned, throughout all this I see no multipath errors (all paths 
available), I see no iSCSI connection errors to the Equallogic. It just seems 
to be Ovirt that thinks the storage is unavailable for some reason?

Thanks,
Mark
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users