Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-17 Thread Nir Soffer
On Tue, Jan 17, 2017 at 1:17 PM, Mark Greenall
 wrote:
> Hi Nir,
>
> Thanks for your continuing efforts with this. It's really appreciated.
>
> re: number of storage domains, we'll plan a review of the underlying storage 
> configuration to try and optimise things better. Do you have a recommended 
> Storage Domain Size?
>
> Still running with the modified settings mentioned in my last mail. I'd like 
> to try and work with yourselves to find a set of configuration variables we 
> are all happy works with Ovirt and the Equallogic.
>
>>> 1. Please enable debug logging in sanlock log:
>>>
>>> edit /etc/sanlock/sanlock.conf and set:
>>> logfile_priority = 7
>
> Enabled, rebooted and tried another session this morning:
>
> 09:19 - Host Activated
> 09:21 - Non Operational (Cannot access the storage Domain Unknown)
> 09:25 - Connecting
> 09:31 - Manually rebooted host as getting nowhere.
> Throughout the above I see lots of LVM processes.
>
> I've attached all the logs for the above session. It may be that the stripped 
> back settings are now causing another problem as this doesn't look like the 
> same cycle I was previously seeing. I also see iSCSI connection errors in the 
> messages file too.
>
> Multipath -ll shows a single path to the devices now rather than the two I 
> previously had.
> Iscsiadm shows a single session for each domain as connected but spread 
> between the Equallogic eth0 and eth1 interfaces. Previously I had two 
> sessions for each domain, one connected to eth0 and one connected to eth1.
> Pvdisplay, vgdisplay and lvdisplay show a lot of LVM's and all seem to be 
> 'available'

I guess you old configuration (other then then the new multipath.conf)
is better then the
defaults for now.

>
>>> 2. Try vdsm patch eliminating the delays in the monitoring thread
>
> When I check the gerrit link I see two monitor.py files (monitor_new.py.zip 
> and storage_monitor_test_new.py.zip) which one should I be testing?

The best way is to use git, checkout the patch, and build new rpms.
git fetch https://gerrit.ovirt.org/vdsm refs/changes/50/70450/1 && git
checkout FETCH_HEAD

Please keep the old file and check that the diff between the old and new file
match what we see in gerrit.
diff -u /usr/share/vdsm/storage/monitor.py.old
/usr/share/vdsm/storage/monitor.py
https://gerrit.ovirt.org/#/c/70450/1/vdsm/storage/monitor.py

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-17 Thread Nir Soffer
On Mon, Jan 16, 2017 at 7:19 PM, Mark Greenall
 wrote:
> Hi,
>
> To try and get a baseline here I've reverted most of the changes we've made 
> and am running the host with just the following iSCSI related configuration 
> settings. The tweaks had been made over time to try and alleviate several 
> storage related problems, but it's possible that fixes in Ovirt (we've 
> gradually gone from early 3.x to 4.0.6) make them redundant now and they 
> simply compound the problem. I'll start with these configuration settings and 
> then move onto trying the vdsm patch.
>
> /etc/multipath.conf (note: polling_interval and max_fds would not get 
> accepted in the devices section. I think they are for default only):

Write, my error.

>
> # VDSM REVISION 1.3
> # VDSM PRIVATE
>
> blacklist {
>devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
>devnode "^hd[a-z]"
>devnode "^sda$"
> }
>
> defaults {
> deferred_remove yes
> dev_loss_tmo30
> fast_io_fail_tmo5
> flush_on_last_del   yes
> max_fds 4096

You can bump the number of fds here if this is needed.

> no_path_retry   fail
> polling_interval5
> user_friendly_names no
> }
>
> devices {
> device {
> vendor  "EQLOGIC"
> product "100E-00"
>
> # Ovirt defaults
> deferred_remove yes
> dev_loss_tmo30
> fast_io_fail_tmo5
> flush_on_last_del   yes
> #polling_interval5
> user_friendly_names no
>
> # Local settings
> #max_fds 8192
> path_checkertur
> path_grouping_policymultibus
> path_selector   "round-robin 0"
>
> # Use 4 retries will provide additional 20 seconds gracetime when no
> # path is available before the device is disabled. (assuming 5 seconds
> # polling interval). This may prevent vms from pausing when there is
> # short outage on the storage server or network.
> no_path_retry   4
>}
>
> device {
> # These settings overrides built-in devices settings. It does not 
> apply
> # to devices without built-in settings (these use the settings in the
> # "defaults" section), or to devices defined in the "devices" section.
> all_devsyes
> no_path_retry   fail
> }
> }
>
>
> /etc/iscsi/iscsid.conf default apart from:
>
> node.session.initial_login_retry_max = 12
> node.session.cmds_max = 1024
> node.session.queue_depth = 128
> node.startup = manual
> node.session.iscsi.FastAbort = No

I don't know about these options, I would try to defaults first, unless
you can explain why they are needed.

>
>
>
>
> The following settings have been commented out / removed:
>
> /etc/sysctl.conf:
>
> # For more information, see sysctl.conf(5) and sysctl.d(5).
> # Prevent ARP Flux for multiple NICs on the same subnet:
> #net.ipv4.conf.all.arp_ignore = 1
> #net.ipv4.conf.all.arp_announce = 2
> # Loosen RP Filter to alow multiple iSCSI connections
> #net.ipv4.conf.all.rp_filter = 2

You need these if you are connecting to to addresses on the same subnet.

Vdsm will do this automatically if needed, if this is configured properly on
the engine side. Unfortunately I don't know to configure this on the
engine side,
but maybe other users using same configuration know.

>
>
> /lib/udev/rules.d:
>
> # Various Settings for Dell Equallogic disks based on Dell Optimizing SAN 
> Environment for Linux Guide
> #
> # Modify disk scheduler mode to noop
> #ACTION=="add|change", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", 
> RUN+="/bin/sh -c 'echo noop > /sys/${DEVPATH}/queue/scheduler'"
> # Modify disk timeout value to 60 seconds
> #ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", 
> RUN+="/bin/sh -c 'echo 60 > /sys/%p/device/timeout'"
> # Modify read ahead value to 1024
> #ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", 
> RUN+="/bin/sh -c 'echo 1024 > /sys/${DEVPATH}/bdi/read_ahead_kb'"
>
> I've also removed our defined iSCSI interfaces and have simply left the Ovirt 
> 'default'

The default will probably use only single path for each device, unless you
configure engine to use both nics.

>
> Rebooted and 'Activated' host:
>
> 16:09 - Host Activated
> 16:10 - Non Operational saying it can't access storage domain 'Unknown'

This means a pv is not accessible, smell like connectivity issue with storage.

> 16:12 - Host Activated again
> 16:12 - Host not responding goes 'Connecting'
> 16:15 - Can't access ALL the storage Domains. Host goes Non Operational again

Do you mean it cannot access any storage domain, or it can access only some?

> 16:17 - Host Activated again
> 16:18 - Can't access ALL the storage Domains. Host goes Non Operational again

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-16 Thread Mark Greenall
Hi,

To try and get a baseline here I've reverted most of the changes we've made and 
am running the host with just the following iSCSI related configuration 
settings. The tweaks had been made over time to try and alleviate several 
storage related problems, but it's possible that fixes in Ovirt (we've 
gradually gone from early 3.x to 4.0.6) make them redundant now and they simply 
compound the problem. I'll start with these configuration settings and then 
move onto trying the vdsm patch.

/etc/multipath.conf (note: polling_interval and max_fds would not get accepted 
in the devices section. I think they are for default only):

# VDSM REVISION 1.3
# VDSM PRIVATE

blacklist {
   devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
   devnode "^hd[a-z]"
   devnode "^sda$"
}

defaults {
deferred_remove yes
dev_loss_tmo30
fast_io_fail_tmo5
flush_on_last_del   yes
max_fds 4096
no_path_retry   fail
polling_interval5
user_friendly_names no
}

devices {
device {
vendor  "EQLOGIC"
product "100E-00"

# Ovirt defaults
deferred_remove yes
dev_loss_tmo30
fast_io_fail_tmo5
flush_on_last_del   yes
#polling_interval5
user_friendly_names no

# Local settings
#max_fds 8192
path_checkertur
path_grouping_policymultibus
path_selector   "round-robin 0"

# Use 4 retries will provide additional 20 seconds gracetime when no
# path is available before the device is disabled. (assuming 5 seconds
# polling interval). This may prevent vms from pausing when there is
# short outage on the storage server or network.
no_path_retry   4
   }

device {
# These settings overrides built-in devices settings. It does not apply
# to devices without built-in settings (these use the settings in the
# "defaults" section), or to devices defined in the "devices" section.
all_devsyes
no_path_retry   fail
}
}


/etc/iscsi/iscsid.conf default apart from:

node.session.initial_login_retry_max = 12
node.session.cmds_max = 1024
node.session.queue_depth = 128
node.startup = manual
node.session.iscsi.FastAbort = No




The following settings have been commented out / removed:

/etc/sysctl.conf:

# For more information, see sysctl.conf(5) and sysctl.d(5).
# Prevent ARP Flux for multiple NICs on the same subnet:
#net.ipv4.conf.all.arp_ignore = 1
#net.ipv4.conf.all.arp_announce = 2
# Loosen RP Filter to alow multiple iSCSI connections
#net.ipv4.conf.all.rp_filter = 2


/lib/udev/rules.d:

# Various Settings for Dell Equallogic disks based on Dell Optimizing SAN 
Environment for Linux Guide
#
# Modify disk scheduler mode to noop
#ACTION=="add|change", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", 
RUN+="/bin/sh -c 'echo noop > /sys/${DEVPATH}/queue/scheduler'"
# Modify disk timeout value to 60 seconds
#ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh 
-c 'echo 60 > /sys/%p/device/timeout'"
# Modify read ahead value to 1024
#ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh 
-c 'echo 1024 > /sys/${DEVPATH}/bdi/read_ahead_kb'"

I've also removed our defined iSCSI interfaces and have simply left the Ovirt 
'default'

Rebooted and 'Activated' host:

16:09 - Host Activated
16:10 - Non Operational saying it can't access storage domain 'Unknown'
16:12 - Host Activated again
16:12 - Host not responding goes 'Connecting'
16:15 - Can't access ALL the storage Domains. Host goes Non Operational again
16:17 - Host Activated again
16:18 - Can't access ALL the storage Domains. Host goes Non Operational again
16:20 - Host Autorecovers and goes Activating again
That cycle repeated until I started getting VDSM timeout messages and the 
constant LVM processes and high CPU load. @16:30 I rebooted the host and set 
the status to maintenance.

Second host Activation attempt just resulted in the same cycle as above. Host 
now doesn't come online at all.

Next step will be to try the vdsm patch.

Mark
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-15 Thread Nir Soffer
On Thu, Jan 12, 2017 at 12:02 PM, Mark Greenall
 wrote:
> Firstly, thanks @Yaniv and thanks @Nir for your responses.
>
> @Yaniv, in answer to this:
>
>>> Why do you have 1 SD per VM?
>
> It's a combination of performance and ease of management. We ran some IO 
> tests with various configurations and settled on this one for a balance of 
> reduced IO contention and ease of management. If there is a better 
> recommended way of handling these then I'm all ears. If you believe having a 
> large amount of storage domains adds to the problem then we can also review 
> the setup.

Yes, having one storage domain per vm is an extremely fragile way to use storage
domains. This makes you system very fragile - any problem in
monitoring one of the
45 storage domains can make entire host non-operational.

You should use storage domains for grouping volumes that need to be separated
from other volumes, for example production, staging, different users,
different types
of storage, etc.

If some vms need high IO, and you want to have one or more devices per vm,
you should use direct luns.

If you need snapshots, live storage migration, etc, use volumes on storage
domain.

I looked at the logs, and I can explain why your system becomes non-operational.

Grepping the domain monitor logs, we see that many storage domains have
very slow (up to 749 seconds read delay):

(filtered the log with awk, I don't have the command now)
Thread-12::DEBUG::2017-01-11
15:09:18,785::check::327::storage.check::(_check_completed)
'/dev/7dfeac70-eaa1-4ba6-ad2a-e3c11564ee3b/metadata' elapsed=0.05
Thread-12::DEBUG::2017-01-11
15:09:28,780::check::327::storage.check::(_check_completed)
'/dev/7dfeac70-eaa1-4ba6-ad2a-e3c11564ee3b/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:38,778::check::327::storage.check::(_check_completed)
'/dev/7dfeac70-eaa1-4ba6-ad2a-e3c11564ee3b/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:48,777::check::327::storage.check::(_check_completed)
'/dev/7dfeac70-eaa1-4ba6-ad2a-e3c11564ee3b/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:55,863::check::327::storage.check::(_check_completed)
'/dev/e70839af-77dd-40c0-a541-d364d30e859a/metadata' elapsed=0.02
Thread-12::DEBUG::2017-01-11
15:09:55,957::check::327::storage.check::(_check_completed)
'/dev/1198e513-bdc8-4d5f-8ee5-8e8dc30d309d/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:56,070::check::327::storage.check::(_check_completed)
'/dev/640ac4d3-1e14-465a-9a72-cc2f2c4cfe26/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:56,741::check::327::storage.check::(_check_completed)
'/dev/6e98b678-a955-49b8-aad7-e1e52e26db1f/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:56,798::check::327::storage.check::(_check_completed)
'/dev/02d31cfc-f095-42e6-8396-d4dbebbb4fed/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:57,080::check::327::storage.check::(_check_completed)
'/dev/4b23a421-5c1f-4541-a007-c93b7af4986b/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:57,248::check::327::storage.check::(_check_completed)
'/dev/5d8d49e2-ce0e-402e-9348-94f9576e2e28/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:57,425::check::327::storage.check::(_check_completed)
'/dev/9fcbb7b1-a13a-499a-a534-119360d57f00/metadata' elapsed=0.08
Thread-12::DEBUG::2017-01-11
15:09:57,715::check::327::storage.check::(_check_completed)
'/dev/a25ded63-2c31-4f1d-a65a-5390e47fda99/metadata' elapsed=0.04
Thread-12::DEBUG::2017-01-11
15:09:57,750::check::327::storage.check::(_check_completed)
'/dev/f6a91d2f-ccae-4440-b1a7-f62ee750a58c/metadata' elapsed=0.05
Thread-12::DEBUG::2017-01-11
15:09:58,007::check::327::storage.check::(_check_completed)
'/dev/bfb1d6b2-b610-4565-b818-ab6ee856e023/metadata' elapsed=0.07
Thread-12::DEBUG::2017-01-11
15:09:58,170::check::327::storage.check::(_check_completed)
'/dev/84cfcb68-190f-4836-8294-d5752c07b762/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:58,556::check::327::storage.check::(_check_completed)
'/dev/2204a85e-c8c7-4e1e-b8e6-e392645077c6/metadata' elapsed=0.07
Thread-12::DEBUG::2017-01-11
15:09:58,805::check::327::storage.check::(_check_completed)
'/dev/7dfeac70-eaa1-4ba6-ad2a-e3c11564ee3b/metadata' elapsed=0.07
Thread-12::DEBUG::2017-01-11
15:09:59,093::check::327::storage.check::(_check_completed)
'/dev/78e59ee0-13ac-4176-8950-837498ba6038/metadata' elapsed=0.06
Thread-12::DEBUG::2017-01-11
15:09:59,159::check::327::storage.check::(_check_completed)
'/dev/b66a2944-a056-4a48-a3f9-83f509df5d1b/metadata' elapsed=0.06
Thread-12::DEBUG::2017-01-11
15:09:59,218::check::327::storage.check::(_check_completed)
'/dev/da05d769-27c2-4270-9bba-5277bf3636e6/metadata' elapsed=0.06
Thread-12::DEBUG::2017-01-11
15:09:59,247::check::327::storage.check::(_check_completed)
'/dev/819b51c0-96d7-43c2-b120-7adade60a2e2/metadata' elapsed=0.03
Thread-12::DEBUG::2017-01-11
15:09:59,363::check::327::storage.check::(_check_completed)

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-15 Thread Nir Soffer
On Fri, Jan 13, 2017 at 11:29 AM, Mark Greenall
 wrote:
> Hi Nir,
>
> Thanks very much for your feedback. It's really useful information. I keep my 
> fingers crossed it leads to a solution for us.
>
> All the settings we currently have were to try and optimise the Equallogic 
> for Linux and Ovirt.
>
> The multipath config settings came from this Dell Forum thread re: getting 
> EqualLogic to work with Ovirt 
> http://en.community.dell.com/support-forums/storage/f/3775/t/19529606

I don't think it is a good idea to copy undocumented changes to
multipath.conf like this.

You must understand any change you have in your multipath.conf. If you cannot
explain any of the changes you should use the defaults.

> The udev settings were from the Dell Optimizing SAN Environment for Linux 
> Guide here: 
> https://www.google.co.uk/url?sa=t=j==s=web=1=0ahUKEwiXvJes4L7RAhXLAsAKHVWLDyQQFggiMAA=http%3A%2F%2Fen.community.dell.com%2Fdell-groups%2Fdtcmedia%2Fm%2Fmediagallery%2F20371245%2Fdownload=AFQjCNG0J8uWEb90m-BwCH_nZJ8lEB3lFA=bv.144224172,d.d24=rja

Not sure that these changes were tested by someone with ovirt.

I think the general approach is to first make the system work using
the defaults, applying required changes.

Tuning a system should be done after you the system works, and you
can show that you have performance issues that needs tuning.

> Perhaps some of the settings are now conflicting with Ovirt best practice as 
> you optimise the releases.
>
> As requested, here is the output of multipath -ll
>
> [root@uk1-ion-ovm-08 rules.d]# multipath -ll
> 364842a3403798409cf7d555c6b8bb82e dm-237 EQLOGIC ,100E-00
> size=1.5T features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 48:0:0:0  sdan 66:112 active ready running
>   `- 49:0:0:0  sdao 66:128 active ready running
> 364842a34037924a7bf7d25416b8be891 dm-212 EQLOGIC ,100E-00
> size=345G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 42:0:0:0  sdah 66:16  active ready running
>   `- 43:0:0:0  sdai 66:32  active ready running
> 364842a340379c497f47ee5fe6c8b9846 dm-459 EQLOGIC ,100E-00
> size=175G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 86:0:0:0  sdbz 68:208 active ready running
>   `- 87:0:0:0  sdca 68:224 active ready running
> 364842a34037944f2807fe5d76d8b1842 dm-526 EQLOGIC ,100E-00
> size=200G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 96:0:0:0  sdcj 69:112 active ready running
>   `- 97:0:0:0  sdcl 69:144 active ready running
> 364842a3403798426d37e05bc6c8b6843 dm-420 EQLOGIC ,100E-00
> size=250G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 82:0:0:0  sdbu 68:128 active ready running
>   `- 83:0:0:0  sdbw 68:160 active ready running
> 364842a340379449fbf7dc5406b8b2818 dm-199 EQLOGIC ,100E-00
> size=200G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 38:0:0:0  sdad 65:208 active ready running
>   `- 39:0:0:0  sdae 65:224 active ready running
> 364842a34037984543c7d35a86a8bc8ee dm-172 EQLOGIC ,100E-00
> size=670G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 36:0:0:0  sdaa 65:160 active ready running
>   `- 37:0:0:0  sdac 65:192 active ready running
> 364842a340379e4303c7dd5a76a8bd8b4 dm-140 EQLOGIC ,100E-00
> size=1.5T features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 33:0:0:0  sdx  65:112 active ready running
>   `- 32:0:0:0  sdy  65:128 active ready running
> 364842a340379b44c7c7ed53b6c8ba8c0 dm-359 EQLOGIC ,100E-00
> size=300G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 69:0:0:0  sdbi 67:192 active ready running
>   `- 68:0:0:0  sdbh 67:176 active ready running
> 364842a3403790415d37ed5bb6c8b68db dm-409 EQLOGIC ,100E-00
> size=200G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 80:0:0:0  sdbt 68:112 active ready running
>   `- 81:0:0:0  sdbv 68:144 active ready running
> 364842a34037964f7807f15d86d8b8860 dm-527 EQLOGIC ,100E-00
> size=200G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 98:0:0:0  sdck 69:128 active ready running
>   `- 99:0:0:0  sdcm 69:160 active ready running
> 364842a34037944aebf7d85416b8ba895 dm-226 EQLOGIC ,100E-00
> size=200G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 46:0:0:0  sdal 66:80  active ready running
>   `- 47:0:0:0  sdam 66:96  active ready running
> 364842a340379f44f7c7e053c6c8b98d2 dm-360 EQLOGIC ,100E-00
> size=450G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 70:0:0:0  sdbj 67:208 active ready running
>   `- 71:0:0:0  sdbk 67:224 active ready running
> 364842a34037924276e7e051e6c8b084f dm-308 EQLOGIC ,100E-00
> size=120G 

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-13 Thread Mark Greenall
Just been catching up with all the threads and I saw mention of some 
iscsid.conf settings which reminded me we also changed some of those from 
default as per the Dell Optimizing SAN Environment for Linux Guide previously 
mentioned.

Changed from default in /etc/iscsi/iscsid.conf
node.session.initial_login_retry_max = 12
node.session.cmds_max = 1024
node.session.queue_depth = 128
node.startup = manual
node.session.iscsi.FastAbort = No

As mentioned by a couple of people, I do just hope this is a case of an 
optimization conflict between Ovirt and Equallogic. I just don't understand why 
every now and again the host will come up and stay up. In the Ovirt Equallogic 
cluster I have currently battled to get three of the hosts up (and running 
guests), I am left with the fourth host which I'm using for this call and it 
just refuses to stay up. It may not be specifically related to Ovirt 4.x but I 
do know we never used to have this type of a battle getting nodes online. I'm 
quite happy to change settings on this one host but can't make cluster wide 
changes as it will likely bring all the guests down.

As some added information here is the iscsi connection details for one of the 
storage domains. As mentioned we are using the 2 x 10Gb iSCSI HBA's in an LACP 
group in Ovirt and Cisco. Hence we see a login from the same source address 
(but two different interfaces) to the same (single) persistent address which is 
the controllers virtual group address. The Current Portal addresses are the 
Equallogic Active Conrollers eth0 and eth1 addresses.

Target: 
iqn.2001-05.com.equallogic:4-42a846-654479033-f9888b77feb584ec-lnd-ion-db-tprm-dstore01
 (non-flash)
Current Portal: 10.100.214.76:3260,1
Persistent Portal: 10.100.214.77:3260,1
**
Interface:
**
Iface Name: bond1.10
Iface Transport: tcp
Iface Initiatorname: iqn.1994-05.com.redhat:a53470a0ae32
Iface IPaddress: 10.100.214.59
Iface HWaddress: 
Iface Netdev: uk1iscsivlan10
SID: 95
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE
Current Portal: 10.100.214.75:3260,1
Persistent Portal: 10.100.214.77:3260,1
**
Interface:
**
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.1994-05.com.redhat:a53470a0ae32
Iface IPaddress: 10.100.214.59
Iface HWaddress: 
Iface Netdev: 
SID: 96
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE

Thanks,
Mark
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-13 Thread Mark Greenall
Hi Nir,

Thanks very much for your feedback. It's really useful information. I keep my 
fingers crossed it leads to a solution for us.

All the settings we currently have were to try and optimise the Equallogic for 
Linux and Ovirt.

The multipath config settings came from this Dell Forum thread re: getting 
EqualLogic to work with Ovirt 
http://en.community.dell.com/support-forums/storage/f/3775/t/19529606

The udev settings were from the Dell Optimizing SAN Environment for Linux Guide 
here: 
https://www.google.co.uk/url?sa=t=j==s=web=1=0ahUKEwiXvJes4L7RAhXLAsAKHVWLDyQQFggiMAA=http%3A%2F%2Fen.community.dell.com%2Fdell-groups%2Fdtcmedia%2Fm%2Fmediagallery%2F20371245%2Fdownload=AFQjCNG0J8uWEb90m-BwCH_nZJ8lEB3lFA=bv.144224172,d.d24=rja

Perhaps some of the settings are now conflicting with Ovirt best practice as 
you optimise the releases.

As requested, here is the output of multipath -ll

[root@uk1-ion-ovm-08 rules.d]# multipath -ll
364842a3403798409cf7d555c6b8bb82e dm-237 EQLOGIC ,100E-00
size=1.5T features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 48:0:0:0  sdan 66:112 active ready running
  `- 49:0:0:0  sdao 66:128 active ready running
364842a34037924a7bf7d25416b8be891 dm-212 EQLOGIC ,100E-00
size=345G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 42:0:0:0  sdah 66:16  active ready running
  `- 43:0:0:0  sdai 66:32  active ready running
364842a340379c497f47ee5fe6c8b9846 dm-459 EQLOGIC ,100E-00
size=175G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 86:0:0:0  sdbz 68:208 active ready running
  `- 87:0:0:0  sdca 68:224 active ready running
364842a34037944f2807fe5d76d8b1842 dm-526 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 96:0:0:0  sdcj 69:112 active ready running
  `- 97:0:0:0  sdcl 69:144 active ready running
364842a3403798426d37e05bc6c8b6843 dm-420 EQLOGIC ,100E-00
size=250G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 82:0:0:0  sdbu 68:128 active ready running
  `- 83:0:0:0  sdbw 68:160 active ready running
364842a340379449fbf7dc5406b8b2818 dm-199 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 38:0:0:0  sdad 65:208 active ready running
  `- 39:0:0:0  sdae 65:224 active ready running
364842a34037984543c7d35a86a8bc8ee dm-172 EQLOGIC ,100E-00
size=670G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 36:0:0:0  sdaa 65:160 active ready running
  `- 37:0:0:0  sdac 65:192 active ready running
364842a340379e4303c7dd5a76a8bd8b4 dm-140 EQLOGIC ,100E-00
size=1.5T features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 33:0:0:0  sdx  65:112 active ready running
  `- 32:0:0:0  sdy  65:128 active ready running
364842a340379b44c7c7ed53b6c8ba8c0 dm-359 EQLOGIC ,100E-00
size=300G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 69:0:0:0  sdbi 67:192 active ready running
  `- 68:0:0:0  sdbh 67:176 active ready running
364842a3403790415d37ed5bb6c8b68db dm-409 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 80:0:0:0  sdbt 68:112 active ready running
  `- 81:0:0:0  sdbv 68:144 active ready running
364842a34037964f7807f15d86d8b8860 dm-527 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 98:0:0:0  sdck 69:128 active ready running
  `- 99:0:0:0  sdcm 69:160 active ready running
364842a34037944aebf7d85416b8ba895 dm-226 EQLOGIC ,100E-00
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 46:0:0:0  sdal 66:80  active ready running
  `- 47:0:0:0  sdam 66:96  active ready running
364842a340379f44f7c7e053c6c8b98d2 dm-360 EQLOGIC ,100E-00
size=450G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 70:0:0:0  sdbj 67:208 active ready running
  `- 71:0:0:0  sdbk 67:224 active ready running
364842a34037924276e7e051e6c8b084f dm-308 EQLOGIC ,100E-00
size=120G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 61:0:0:0  sdba 67:64  active ready running
  `- 60:0:0:0  sdaz 67:48  active ready running
364842a34037994b93b7d85a66a8b789a dm-37 EQLOGIC ,100E-00
size=270G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 20:0:0:0  sdl  8:176  active ready running
  `- 21:0:0:0  sdm  8:192  active ready running
364842a340379348d6e7e351e6c8b4865 dm-319 EQLOGIC ,100E-00
size=310G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 62:0:0:0  sdbb 67:80  active ready running
  `- 63:0:0:0  sdbc 67:96  active ready running
364842a34037994cd3b7db5a66a8bc8ff dm-70 EQLOGIC ,100E-00
size=270G features='0' hwhandler='0' wp=rw
`-+- 

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Gianluca Cecchi
On Fri, Jan 13, 2017 at 12:10 AM, Nir Soffer  wrote:

> On Thu, Jan 12, 2017 at 6:01 PM, Nicolas Ecarnot 
> wrote:
> > Hi,
> >
> > As we are using a very similar hardware and usage as Mark (Dell poweredge
> > hosts, Dell Equallogic SAN, iSCSI, and tons of LUNs for all those VMs),
> I'm
> > jumping into this thread.
>
> Can you share your multipath.conf that works with Dell Equallogic SAN?
>
>
I jump in to share what is my current config with EQL SAN and RH EL /
CentOS (but not oVirt).
the examples below for a system connected with a PS6510ES.
Please note that it is to be considered as an element of discussion and to
be then mixed and integrated with oVirt specific requirements (eg no
friendly names).
Also, it is what I'm using on RH EL 6.8 clusters configured with RHCS. Not
tested yet any RH EL / CentOS 7.x system with EQL iSCSI

 - /etc/multipath.conf

defaults {
user_friendly_names yes
}

blacklist {
   wwid my_internal_disk_wwid

   device {
   vendor  "iDRAC"
   product "*"
   }
}

devices {
device {
vendor  "EQLOGIC"
product "100E-00"
path_grouping_policymultibus
features "1 queue_if_no_path"
path_checker directio
failback immediate
path_selector "round-robin 0"
rr_min_io 512
rr_weight priorities
}
}


multipaths {
multipath {
wwid one_of_my_luns_wwid
alias mympfriendlyname
}

... other multipath sections for other luns

}


other important configurations:

- /etc/iscsi/iscsid.conf
other than chap config parameters

diff iscsid.conf iscsid.conf.orig
< #node.session.timeo.replacement_timeout = 120
< node.session.timeo.replacement_timeout = 15
---
> node.session.timeo.replacement_timeout = 120
130,131c125
< #node.session.err_timeo.lu_reset_timeout = 30
< node.session.err_timeo.lu_reset_timeout = 20
---
> node.session.err_timeo.lu_reset_timeout = 30
168,169c162
< # node.session.initial_login_retry_max = 8
< node.session.initial_login_retry_max = 12
---
> node.session.initial_login_retry_max = 8
178,179c171
< #node.session.cmds_max = 128
< node.session.cmds_max = 1024
---
> node.session.cmds_max = 128
183,184c175
< #node.session.queue_depth = 32
< node.session.queue_depth = 128
---
> node.session.queue_depth = 32
310,311c301
< #node.session.iscsi.FastAbort = Yes
< node.session.iscsi.FastAbort = No
---
> node.session.iscsi.FastAbort = Yes


- network adapters dedicated to iSCSI config files
they are 10Gb/s interfaces
(
lspci gives
05:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port
Backplane Connection (rev 01)
)
/etc/sysconfig/network-scripts/ifcfg-eth4
DEVICE=eth4
BOOTPROTO=static
HWADDR=XX:XX:XX:XX:XX:XX
ONBOOT=yes
IPADDR=10.10.100.227
NETMASK=255.255.255.0
TYPE=Ethernet
MTU=9000

similar for eth5 (ip is 10.10.100.227)

ifup eth4
ifup eth5

- /etc/sysctl.conf
net.ipv4.conf.eth4.arp_announce=2
net.ipv4.conf.eth4.arp_ignore=1
net.ipv4.conf.eth4.arp_filter=2
#
net.ipv4.conf.eth5.arp_announce=2
net.ipv4.conf.eth5.arp_ignore=1
net.ipv4.conf.eth5.arp_filter=2

to acquire modification:
sysctl -p

Verify ping to the portal (10.10.100.7) from both interfaces
ping -I eth4 10.10.100.7
ping -I eth5 10.10.100.7

to verify jumbo frame connections (if configured, as in my case):
ping 10.10.100.7 -M do -s 8792 -I eth4
ping 10.10.100.7 -M do -s 8792 -I eth5


- configuration of the iscsi interfaces
iscsiadm -m iface -I ieth4 --op=new
iscsiadm -m iface -I ieth5 --op=new
iscsiadm -m iface -I ieth4 --op=update -n iface.hwaddress -v
XX:XX:XX:XX:XX:XX
iscsiadm -m iface -I ieth5 --op=update -n iface.hwaddress -v
YY:YY:YY:YY:YY:YY


output of some commands with this config

# iscsiadm -m session | grep mylun
tcp: [3] 10.10.100.7:3260,1 iqn.2001-05.com.equallogic:0--mylun
(non-flash)
tcp: [4] 10.10.100.7:3260,1 iqn.2001-05.com.equallogic:0--mylun
(non-flash)

with "-P 1" option

Target: iqn.2001-05.com.equallogic:0--mylun (non-flash)
Current Portal: 10.10.100.38:3260,1
Persistent Portal: 10.10.100.7:3260,1
**
Interface:
**
Iface Name: ieth5
Iface Transport: tcp
Iface Initiatorname: iqn.1994-05.com.redhat:aea9b71a9aaf
Iface IPaddress: 10.10.100.228
Iface HWaddress: 
Iface Netdev: eth5
SID: 3
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE
Current Portal: 10.10.100.37:3260,1
Persistent Portal: 10.10.100.7:3260,1
**
Interface:
**
Iface Name: ieth4

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Nir Soffer
On Thu, Jan 12, 2017 at 6:01 PM, Nicolas Ecarnot  wrote:
> Hi,
>
> As we are using a very similar hardware and usage as Mark (Dell poweredge
> hosts, Dell Equallogic SAN, iSCSI, and tons of LUNs for all those VMs), I'm
> jumping into this thread.

Can you share your multipath.conf that works with Dell Equallogic SAN?

>
> Le 12/01/2017 à 16:29, Yaniv Kaul a écrit :
>
>
> While it's a bit of a religious war on what is preferred with iSCSI -
> network level bonding (LACP) or multipathing on the iSCSI level, I'm on the
> multipathing side. The main reason is that you may end up easily using just
> one of the paths in a bond - if your policy is not set correct on how to
> distribute connections between the physical links (remember that each
> connection sticks to a single physical link. So it really depends on the
> hash policy and even then - not so sure). With iSCSI multipathing you have
> more control - and it can also be determined by queue depth, etc.
> (In your example, if you have SRC A -> DST 1 and SRC B -> DST 1 (as you seem
> to have), both connections may end up on the same physical NIC.)
>
>>
>>
>> If we reduce the number of storage domains, we reduce the number of
>> devices and therefore the number of LVM Physical volumes that appear in
>> Linux correct? At the moment each connection results in a Linux device which
>> has its own queue. We have some guests with high IO loads on their device
>> whilst others are low. All the storage domain / datastore sizing guides we
>> found seem to imply it’s a trade-off between ease of management (i.e not
>> having millions of domains to manage), IO contention between guests on a
>> single large storage domain / datastore and possible wasted space on storage
>> domains. If you have further information on recommendations, I am more than
>> willing to change things as this problem is making our environment somewhat
>> unusable at the moment. I have hosts that I can’t bring online and therefore
>> reduced resiliency in clusters. They used to work just fine but the
>> environment has grown over the last year and we also upgraded the Ovirt
>> version from 3.6 to 4.x. We certainly had other problems, but host
>> activation wasn’t one of them and it’s a problem that’s driving me mad.
>
>
> I would say that each path has its own device (and therefore its own queue).
> So I'd argue that you may want to have (for example) 4 paths to each LUN or
> perhaps more (8?). For example, with 2 NICs, each connecting to two
> controllers, each controller having 2 NICs (so no SPOF and nice number of
> paths).
>
> Here, one key point I'm trying (to no avail) to discuss for years with
> Redhat people, and either I did not understood, either I wasn't clear
> enough, or Redhat people answered me they owned no Equallogic SAN to test
> it, is :
> My (and maybe many others) Equallogic SAN has two controllers, but is
> publishing only *ONE* virtual ip address.
> On one of our other EMC SAN, publishing *TWO* ip addresses, which can be
> published in two different subnets, I fully understand the benefits and
> working of multipathing (and even in the same subnet, our oVirt setup is
> happily using multipath).
>
> But on one of our oVirt setup using the Equallogic SAN, we have no choice
> but point our hosts iSCSI interfaces to one single SAN ip, so no multipath
> here.
>
> At this point, we saw no other mean than using bonding mode 1 to reach our
> SAN, which is terrible for storage experts.
>
>
> To come back to Mark's story, we are still using 3.6.5 DCs and planning to
> upgrade.
> Reading all this is making me delay this step.
>
> --
> Nicolas ECARNOT
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Nir Soffer
On Thu, Jan 12, 2017 at 12:02 PM, Mark Greenall
 wrote:
> Firstly, thanks @Yaniv and thanks @Nir for your responses.
>
> @Yaniv, in answer to this:
>
>>> Why do you have 1 SD per VM?
>
> It's a combination of performance and ease of management. We ran some IO 
> tests with various configurations and settled on this one for a balance of 
> reduced IO contention and ease of management. If there is a better 
> recommended way of handling these then I'm all ears. If you believe having a 
> large amount of storage domains adds to the problem then we can also review 
> the setup.
>
>>> Can you try and disable (mask) the lvmetad service on the hosts and see if 
>>> it improves matters?
>
> Disabled and masked the lvmetad service and tried again this morning. It 
> seemed to be less of a load / quicker getting the initial activation of the 
> host working but the end result was still the same. Just under 10 minutes 
> later the node went non-operational and the cycle began again. By 09:27 we 
> had the high CPU load and repeating lvm cycle.
>
> Host Activation: 09:06
> Host Up: 09:08
> Non-Operational: 09:16
> LVM Load: 09:27
> Host Reboot: 09:30
>
> From yesterday and today I've attached messages, sanlock.log and 
> multipath.conf files too. Although I'm not sure the messages file will be of 
> much use as it looks like log rate limiting kicked in and supressed messages 
> for the duration of the process. I'm booted off the kernel with debugging but 
> maybe that's generating too much info? Let me know if you want me to change 
> anything here to get additional information.
>
> As added configuration information we also have the following settings from 
> the Equallogic and Linux install guide:
>
> /etc/sysctl.conf:
>
> # Prevent ARP Flux for multiple NICs on the same subnet:
> net.ipv4.conf.all.arp_ignore = 1
> net.ipv4.conf.all.arp_announce = 2
> # Loosen RP Filter to alow multiple iSCSI connections
> net.ipv4.conf.all.rp_filter = 2
>
>
> And the following /lib/udev/rules.d/99-eqlsd.rules:
>
> #-
> #  Copyright (c) 2010-2012 by Dell, Inc.
> #
> # All rights reserved.  This software may not be copied, disclosed,
> # transferred, or used except in accordance with a license granted
> # by Dell, Inc.  This software embodies proprietary information
> # and trade secrets of Dell, Inc.
> #
> #-
> #
> # Various Settings for Dell Equallogic disks based on Dell Optimizing SAN 
> Environment for Linux Guide
> #
> # Modify disk scheduler mode to noop
> ACTION=="add|change", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", 
> RUN+="/bin/sh -c 'echo noop > /sys/${DEVPATH}/queue/scheduler'"
> # Modify disk timeout value to 60 seconds
> ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh 
> -c 'echo 60 > /sys/%p/device/timeout'"

This timeout may cause large timeouts in vdsm in commands accessing
storage, it may cause timeouts in various flows, and may cause your
domain to become inactive - since you set this for all domains, it may
cause the entire host to become non-operational.

I recommend to remove this rule.

> # Modify read ahead value to 1024
> ACTION!="remove", SUBSYSTEM=="block", ATTRS{vendor}=="EQLOGIC", RUN+="/bin/sh 
> -c 'echo 1024 > /sys/${DEVPATH}/bdi/read_ahead_kb'"

In your multipath.conf, I see that you changed lot of the defaults
recommended by ovirt:

defaults {
deferred_remove yes
dev_loss_tmo30
fast_io_fail_tmo5
flush_on_last_del   yes
max_fds 4096
no_path_retry   fail
polling_interval5
user_friendly_names no
}

You are using:

defaults {

You are not using "deferred_remove", so you get the default value ("no").
Do you have any reason to change this?

You are not using "dev_loss_tmo", so you get the default value
Do you have any reason to change this?

You are not using "fast_io_fail_tmo", so you will get the default
value  (hopefully 5).
Do you have any reason to change this?

You are not using "flush_on_last_del " - any reason to change this?

   failbackimmediate
   max_fds 8192
   no_path_retry   fail

I guess these are the settings recommended for your storage?

   path_checkertur
   path_grouping_policymultibus
   path_selector   "round-robin 0"

   polling_interval10

This will means multipathd will check paths every 10-40 seconds.
You should use the default 5, which cause multipathd to check every
5-20 seconds.

   rr_min_io   10
   rr_weight   priorities
   user_friendly_names no
}

Also you are mixing defaults and settings that you need for your specific
devices.

You should leave the default without change, and create a device 

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Yaniv Kaul
On Thu, Jan 12, 2017 at 6:01 PM, Nicolas Ecarnot 
wrote:

> Hi,
>
> As we are using a very similar hardware and usage as Mark (Dell poweredge
> hosts, Dell Equallogic SAN, iSCSI, and tons of LUNs for all those VMs), I'm
> jumping into this thread.
>
> Le 12/01/2017 à 16:29, Yaniv Kaul a écrit :
>
>
> While it's a bit of a religious war on what is preferred with iSCSI -
> network level bonding (LACP) or multipathing on the iSCSI level, I'm on the
> multipathing side. The main reason is that you may end up easily using just
> one of the paths in a bond - if your policy is not set correct on how to
> distribute connections between the physical links (remember that each
> connection sticks to a single physical link. So it really depends on the
> hash policy and even then - not so sure). With iSCSI multipathing you have
> more control - and it can also be determined by queue depth, etc.
> (In your example, if you have SRC A -> DST 1 and SRC B -> DST 1 (as you
> seem to have), both connections may end up on the same physical NIC.)
>
>
>>
>> If we reduce the number of storage domains, we reduce the number of
>> devices and therefore the number of LVM Physical volumes that appear in
>> Linux correct? At the moment each connection results in a Linux device
>> which has its own queue. We have some guests with high IO loads on their
>> device whilst others are low. All the storage domain / datastore sizing
>> guides we found seem to imply it’s a trade-off between ease of management
>> (i.e not having millions of domains to manage), IO contention between
>> guests on a single large storage domain / datastore and possible wasted
>> space on storage domains. If you have further information on
>> recommendations, I am more than willing to change things as this problem is
>> making our environment somewhat unusable at the moment. I have hosts that I
>> can’t bring online and therefore reduced resiliency in clusters. They used
>> to work just fine but the environment has grown over the last year and we
>> also upgraded the Ovirt version from 3.6 to 4.x. We certainly had other
>> problems, but host activation wasn’t one of them and it’s a problem that’s
>> driving me mad.
>>
>
> I would say that each path has its own device (and therefore its own
> queue). So I'd argue that you may want to have (for example) 4 paths to
> each LUN or perhaps more (8?). For example, with 2 NICs, each connecting to
> two controllers, each controller having 2 NICs (so no SPOF and nice number
> of paths).
>
> Here, one key point I'm trying (to no avail) to discuss for years with
> Redhat people, and either I did not understood, either I wasn't clear
> enough, or Redhat people answered me they owned no Equallogic SAN to test
> it, is :
> My (and maybe many others) Equallogic SAN has two controllers, but is
> publishing only *ONE* virtual ip address.
>

You are completely right - you keep saying that and I keep forgetting that.
I apologize.


> On one of our other EMC SAN, publishing *TWO* ip addresses, which can be
> published in two different subnets, I fully understand the benefits and
> working of multipathing (and even in the same subnet, our oVirt setup is
> happily using multipath).
>
> But on one of our oVirt setup using the Equallogic SAN, we have no choice
> but point our hosts iSCSI interfaces to one single SAN ip, so no multipath
> here.
>
> At this point, we saw no other mean than using bonding mode 1 to reach our
> SAN, which is terrible for storage experts.
>

You could, if you do it properly, have an active-active mode, no?. And if
the hash policy is correct (for example, layer3+4) you might get both
slaves useful. Also, multiple sessions can be achieved with iscsi.conf's
session.nr_sessions (though I'm not sure we don't have a bug where we don't
disconnect all sessions?).


>
>
> To come back to Mark's story, we are still using 3.6.5 DCs and planning to
> upgrade.
> Reading all this is making me delay this step.
>

Well, it'd be nice to get to the bottom of it, but I'm quite sure it has
relatively nothing to do with 4.0.
Y.


>
> --
> Nicolas ECARNOT
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Mark Greenall
Hi Yaniv,

>> 1. There is no point in so many connections.
>> 2. Certainly not the same portal - you really should have more.
>> 3. Note that some go via bond1 - and some via 'default' interface. Is that 
>> intended?
>> 4. Your multipath.conf is using rr_min_io - where it should use rr_min_io_rq 
>> most likely.

We have a single 68TB Equallogic unit with 24 disks. Each Ovirt host has 2 
HBA’s on the iSCSI network. We use Ovirt and the Cisco switches to create an 
LACP group with those 2 HBA’s. I have always assumed that the two connections 
are one each from the HBA’s (i.e I should have two paths and two connections to 
each target).

If we reduce the number of storage domains, we reduce the number of devices and 
therefore the number of LVM Physical volumes that appear in Linux correct? At 
the moment each connection results in a Linux device which has its own queue. 
We have some guests with high IO loads on their device whilst others are low. 
All the storage domain / datastore sizing guides we found seem to imply it’s a 
trade-off between ease of management (i.e not having millions of domains to 
manage), IO contention between guests on a single large storage domain / 
datastore and possible wasted space on storage domains. If you have further 
information on recommendations, I am more than willing to change things as this 
problem is making our environment somewhat unusable at the moment. I have hosts 
that I can’t bring online and therefore reduced resiliency in clusters. They 
used to work just fine but the environment has grown over the last year and we 
also upgraded the Ovirt version from 3.6 to 4.x. We certainly had other 
problems, but host activation wasn’t one of them and it’s a problem that’s 
driving me mad.

Thanks for the pointer on rr_min_io – I see that was for an older kernel. We 
had that set from a Dell guide. I’ve now removed that setting as it seems the 
default value has changed now anyway.

>> Unrelated, your engine.log is quite flooded with:
>> 2017-01-11 15:07:46,085 WARN  
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] 
>> (DefaultQuartzScheduler9) [31a71bf5] Invalid or unknown guest architecture 
>> type '' received from guest agent
>>
>> Any idea what kind of guest you are running?

Do you have any idea what the guest name is that’s coming from? We pretty much 
exclusively have Linux (CentOS various versions) and Windows (various versions) 
as the guest OS.

Thanks again,
Mark
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Mark Greenall
>> I would say that each path has its own device (and therefore its own queue). 
>> So I'd argue that you may want to have (for example) 4 paths to each LUN or 
>> perhaps more (8?). For example, with 2 NICs, each connecting to two 
>> controllers, each controller having 2 NICs (so no SPOF and nice number of 
>> paths).

Totally get where you are coming from with paths to LUN’s and using multipath. 
We do use that with the Dell Compellent storage we have. It has multiple active 
controllers each with an IP address in a different subnet. Unfortunately, the 
Equallogic does NOT have two active controllers. It has a single active 
controller and a single IP that migrates between the controllers when either 
one is active. If I don’t use LACP I can’t use both HBA’s on the host with 
Ovirt as it doesn’t support Dells host integration tool (HIT) software (or you 
could argue Dell don’t support Ovirt). So, instead of being able to have a 
large number of paths to devices I can either have one active path or LACP and 
get two. As two is the most I can have to a LUN with the infrastructure we 
have, we spread the IO by increasing the number of targets (storage domains).

>> Depending on your storage, you may want to use rr_min_io_rq = 1 for latency 
>> purposes.

Looking at the man page for multipath.conf it looks like the default is now 1, 
where it was 1000 for rr_min_io. For now I’ve just removed it from our config 
file and we’ll take the default.

I’m still seeing the same problem with the couple of changes made (lvmetad and 
multipath). I’m really not very good at understanding exactly what is going on 
in the Ovirt logs. Does it provide any clues as to why it brings the host up 
and then takes it offline again? What are the barrage of lvm processes trying 
to achieve and why do they apparently fail (as it keeps on trying to run them)? 
As mentioned, throughout all this I see no multipath errors (all paths 
available), I see no iSCSI connection errors to the Equallogic. It just seems 
to be Ovirt that thinks the storage is unavailable for some reason?

Thanks,
Mark
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Nicolas Ecarnot

Hi,

As we are using a very similar hardware and usage as Mark (Dell 
poweredge hosts, Dell Equallogic SAN, iSCSI, and tons of LUNs for all 
those VMs), I'm jumping into this thread.


Le 12/01/2017 à 16:29, Yaniv Kaul a écrit :


While it's a bit of a religious war on what is preferred with iSCSI - 
network level bonding (LACP) or multipathing on the iSCSI level, I'm 
on the multipathing side. The main reason is that you may end up 
easily using just one of the paths in a bond - if your policy is not 
set correct on how to distribute connections between the physical 
links (remember that each connection sticks to a single physical link. 
So it really depends on the hash policy and even then - not so sure). 
With iSCSI multipathing you have more control - and it can also be 
determined by queue depth, etc.
(In your example, if you have SRC A -> DST 1 and SRC B -> DST 1 (as 
you seem to have), both connections may end up on the same physical NIC.)


If we reduce the number of storage domains, we reduce the number
of devices and therefore the number of LVM Physical volumes that
appear in Linux correct? At the moment each connection results in
a Linux device which has its own queue. We have some guests with
high IO loads on their device whilst others are low. All the
storage domain / datastore sizing guides we found seem to imply
it’s a trade-off between ease of management (i.e not having
millions of domains to manage), IO contention between guests on a
single large storage domain / datastore and possible wasted space
on storage domains. If you have further information on
recommendations, I am more than willing to change things as this
problem is making our environment somewhat unusable at the moment.
I have hosts that I can’t bring online and therefore reduced
resiliency in clusters. They used to work just fine but the
environment has grown over the last year and we also upgraded the
Ovirt version from 3.6 to 4.x. We certainly had other problems,
but host activation wasn’t one of them and it’s a problem that’s
driving me mad.


I would say that each path has its own device (and therefore its own 
queue). So I'd argue that you may want to have (for example) 4 paths 
to each LUN or perhaps more (8?). For example, with 2 NICs, each 
connecting to two controllers, each controller having 2 NICs (so no 
SPOF and nice number of paths).
Here, one key point I'm trying (to no avail) to discuss for years with 
Redhat people, and either I did not understood, either I wasn't clear 
enough, or Redhat people answered me they owned no Equallogic SAN to 
test it, is :
My (and maybe many others) Equallogic SAN has two controllers, but is 
publishing only *ONE* virtual ip address.
On one of our other EMC SAN, publishing *TWO* ip addresses, which can be 
published in two different subnets, I fully understand the benefits and 
working of multipathing (and even in the same subnet, our oVirt setup is 
happily using multipath).


But on one of our oVirt setup using the Equallogic SAN, we have no 
choice but point our hosts iSCSI interfaces to one single SAN ip, so no 
multipath here.


At this point, we saw no other mean than using bonding mode 1 to reach 
our SAN, which is terrible for storage experts.



To come back to Mark's story, we are still using 3.6.5 DCs and planning 
to upgrade.

Reading all this is making me delay this step.

--
Nicolas ECARNOT
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Yaniv Kaul
On Thu, Jan 12, 2017 at 5:01 PM, Mark Greenall 
wrote:

> Hi Yaniv,
>
>
>
> >> 1. There is no point in so many connections.
>
> >> 2. Certainly not the same portal - you really should have more.
>
> >> 3. Note that some go via bond1 - and some via 'default' interface. Is
> that intended?
>
> >> 4. Your multipath.conf is using rr_min_io - where it should
> use rr_min_io_rq most likely.
>
>
>
> We have a single 68TB Equallogic unit with 24 disks. Each Ovirt host has 2
> HBA’s on the iSCSI network. We use Ovirt and the Cisco switches to create
> an LACP group with those 2 HBA’s. I have always assumed that the two
> connections are one each from the HBA’s (i.e I should have two paths and
> two connections to each target).
>

While it's a bit of a religious war on what is preferred with iSCSI -
network level bonding (LACP) or multipathing on the iSCSI level, I'm on the
multipathing side. The main reason is that you may end up easily using just
one of the paths in a bond - if your policy is not set correct on how to
distribute connections between the physical links (remember that each
connection sticks to a single physical link. So it really depends on the
hash policy and even then - not so sure). With iSCSI multipathing you have
more control - and it can also be determined by queue depth, etc.
(In your example, if you have SRC A -> DST 1 and SRC B -> DST 1 (as you
seem to have), both connections may end up on the same physical NIC.)


>
> If we reduce the number of storage domains, we reduce the number of
> devices and therefore the number of LVM Physical volumes that appear in
> Linux correct? At the moment each connection results in a Linux device
> which has its own queue. We have some guests with high IO loads on their
> device whilst others are low. All the storage domain / datastore sizing
> guides we found seem to imply it’s a trade-off between ease of management
> (i.e not having millions of domains to manage), IO contention between
> guests on a single large storage domain / datastore and possible wasted
> space on storage domains. If you have further information on
> recommendations, I am more than willing to change things as this problem is
> making our environment somewhat unusable at the moment. I have hosts that I
> can’t bring online and therefore reduced resiliency in clusters. They used
> to work just fine but the environment has grown over the last year and we
> also upgraded the Ovirt version from 3.6 to 4.x. We certainly had other
> problems, but host activation wasn’t one of them and it’s a problem that’s
> driving me mad.
>

I would say that each path has its own device (and therefore its own
queue). So I'd argue that you may want to have (for example) 4 paths to
each LUN or perhaps more (8?). For example, with 2 NICs, each connecting to
two controllers, each controller having 2 NICs (so no SPOF and nice number
of paths).

BTW, perhaps some guests need direct LUN?


>
>
> Thanks for the pointer on rr_min_io – I see that was for an older kernel.
> We had that set from a Dell guide. I’ve now removed that setting as it
> seems the default value has changed now anyway.
>

Depending on your storage, you may want to use rr_min_io_rq = 1 for latency
purposes.


>
>
> >> Unrelated, your engine.log is quite flooded with:
>
> >> 2017-01-11 15:07:46,085 WARN  [org.ovirt.engine.core.
> vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] (DefaultQuartzScheduler9)
> [31a71bf5] Invalid or unknown guest architecture type '' received from
> guest agent
>
> >>
>
> >> Any idea what kind of guest you are running?
>
>
>
> Do you have any idea what the guest name is that’s coming from? We pretty
> much exclusively have Linux (CentOS various versions) and Windows (various
> versions) as the guest OS.
>

Vinzenz - any idea?
Y.


>
>
> Thanks again,
>
> Mark
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-12 Thread Yaniv Kaul
On Thu, Jan 12, 2017 at 12:02 PM, Mark Greenall 
wrote:

> Firstly, thanks @Yaniv and thanks @Nir for your responses.
>
> @Yaniv, in answer to this:
>
> >> Why do you have 1 SD per VM?
>
> It's a combination of performance and ease of management. We ran some IO
> tests with various configurations and settled on this one for a balance of
> reduced IO contention and ease of management. If there is a better
> recommended way of handling these then I'm all ears. If you believe having
> a large amount of storage domains adds to the problem then we can also
> review the setup.
>

I don't see how it can improve performance. Having several iSCSI
connections to a (single!) target may help, but certainly not too much.
Just from looking at your /var/log/messages:
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection1:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-37a238a33-4e21185c70857594-uk1-amd-cluster2-template-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection2:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-37a238a33-4e21185c70857594-uk1-amd-cluster2-template-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection3:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-192238a33-1f71185c70b57598-cuuk1ionhurap02-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection4:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-192238a33-1f71185c70b57598-cuuk1ionhurap02-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection5:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-223238a33-7301185c70e57598-cuuk1ionhurdb02-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection6:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-223238a33-7301185c70e57598-cuuk1ionhurdb02-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection7:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-212238a33-2a61185c719576bd-lnd-ion-anv-test-lin-64-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection8:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-212238a33-2a61185c719576bd-lnd-ion-anv-test-lin-64-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection9:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-ad4238a33-1b31185c75157c7e-lnd-ion-lindev-14-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection10:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-ad4238a33-1b31185c75157c7e-lnd-ion-lindev-14-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection11:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-b99479033-9a788b6aa6857d3b-lnd-anv-sup-03-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection12:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-b99479033-9a788b6aa6857d3b-lnd-anv-sup-03-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection13:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-cd9479033-ffc88b6aa6b57d3b-lnd-linsup-02-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection14:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-cd9479033-ffc88b6aa6b57d3b-lnd-linsup-02-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection15:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-db8479033-96f88b6aa6e57d3b-lnd-linsup-03-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection16:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-db8479033-96f88b6aa6e57d3b-lnd-linsup-03-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection17:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-eae479033-f6588b6aa7157d3b-lnd-linsup-04-dstore01,
portal: 10.100.214.77,3260] through [iface: bond1.10] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection18:0 to [target:
iqn.2001-05.com.equallogic:4-42a846-eae479033-f6588b6aa7157d3b-lnd-linsup-04-dstore01,
portal: 10.100.214.77,3260] through [iface: default] is operational now
Jan 11 15:07:11 uk1-ion-ovm-08 iscsid: Connection19:0 to [target:

Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-11 Thread Nir Soffer
On Wed, Jan 11, 2017 at 9:23 PM, Nir Soffer  wrote:
> On Wed, Jan 11, 2017 at 7:35 PM, Mark Greenall
>  wrote:
>> Hi Ovirt Champions,
>>
>>
>>
>> I am pulling my hair out and in need of advice / help.
>>
>>
>>
>> Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
>>
>> Stoage: Dell Equallogic (Firmware V8.1.4)
>>
>> OS: Centos 7.3 (although the same thing happens on 7.2)
>>
>> Ovirt: 4.0.6.3-1 (although also happens on 4.0.5)
>>
>>
>>
>> I can’t exactly pinpoint when this started happening but it’s certainly been
>> happening with Ovirt 4.0.5 and CentOS 7.2. Today I updated Hosted Engine and
>> one host to 4.0.6 and CentOS 7.3 but we still see the same problem. Our
>> hosts are connected to Dell iSCSI Eqallogic storage. We have one storage
>> domain defined per VM guest, so do have quite a few LUN’s presented to the
>> cluster (around 45 in total).
>>
>>
>>
>> Problem Description:
>>
>> 1)  Reboot a host.
>>
>> 2)  Activate a host in Ovirt Admin Gui.
>>
>> 3)  A few minutes later host is shown as activated.
>>
>> 4)  Approx 10-15 mins later host goes offline complaining that it can’t
>> connect to storage.
>>
>> 5)  Constantly then loops around (activating, non operational,
>> connecting, initialising) and the host ends up with a high CPU load and
>> large number of lvm commands in the process tree.
>>
>> 6)  Multipath and iscsi show all storage is available and logged in.
>>
>> 7)  Equallogic shows host connected and no errors.
>>
>> 8)  Admin GUI ends up saying the host can’t connect to storage
>> ‘UNKNOWN’.
>>
>>
>>
>> The strange thing is that every now and again step 5 doesn’t happen and the
>> host will actually activate again and then stays up.  However, it still
>> takes step 4 to take the host offline first.
>>
>>
>>
>> Expected Behaviour:
>>
>> 1)  Reboot a host.
>>
>> 2)  Activate a host in Ovirt Admin Gui.
>>
>> 3)  A few minutes later host is shown as activated.
>>
>> 4)  Begin using host with confidence.
>>
>>
>>
>> I’ve attached the engine.log from Hosted Engine and vdsm.log from the host.
>> The following is a timeline of the latest event.
>>
>>
>>
>> Host Activation : 15:07
>>
>> Host Up: 15:10
>>
>> Non-Operational: 15:17
>>
>>
>>
>> Seriously hoping someone can spot something obvious as this is making the
>> clusters somewhat unstable and unreliable.
>
> Can you share /var/log/messages and /var/log/sanlock.log?

And /etc/multipath.conf

>
> Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-11 Thread Nir Soffer
On Wed, Jan 11, 2017 at 7:35 PM, Mark Greenall
 wrote:
> Hi Ovirt Champions,
>
>
>
> I am pulling my hair out and in need of advice / help.
>
>
>
> Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
>
> Stoage: Dell Equallogic (Firmware V8.1.4)
>
> OS: Centos 7.3 (although the same thing happens on 7.2)
>
> Ovirt: 4.0.6.3-1 (although also happens on 4.0.5)
>
>
>
> I can’t exactly pinpoint when this started happening but it’s certainly been
> happening with Ovirt 4.0.5 and CentOS 7.2. Today I updated Hosted Engine and
> one host to 4.0.6 and CentOS 7.3 but we still see the same problem. Our
> hosts are connected to Dell iSCSI Eqallogic storage. We have one storage
> domain defined per VM guest, so do have quite a few LUN’s presented to the
> cluster (around 45 in total).
>
>
>
> Problem Description:
>
> 1)  Reboot a host.
>
> 2)  Activate a host in Ovirt Admin Gui.
>
> 3)  A few minutes later host is shown as activated.
>
> 4)  Approx 10-15 mins later host goes offline complaining that it can’t
> connect to storage.
>
> 5)  Constantly then loops around (activating, non operational,
> connecting, initialising) and the host ends up with a high CPU load and
> large number of lvm commands in the process tree.
>
> 6)  Multipath and iscsi show all storage is available and logged in.
>
> 7)  Equallogic shows host connected and no errors.
>
> 8)  Admin GUI ends up saying the host can’t connect to storage
> ‘UNKNOWN’.
>
>
>
> The strange thing is that every now and again step 5 doesn’t happen and the
> host will actually activate again and then stays up.  However, it still
> takes step 4 to take the host offline first.
>
>
>
> Expected Behaviour:
>
> 1)  Reboot a host.
>
> 2)  Activate a host in Ovirt Admin Gui.
>
> 3)  A few minutes later host is shown as activated.
>
> 4)  Begin using host with confidence.
>
>
>
> I’ve attached the engine.log from Hosted Engine and vdsm.log from the host.
> The following is a timeline of the latest event.
>
>
>
> Host Activation : 15:07
>
> Host Up: 15:10
>
> Non-Operational: 15:17
>
>
>
> Seriously hoping someone can spot something obvious as this is making the
> clusters somewhat unstable and unreliable.

Can you share /var/log/messages and /var/log/sanlock.log?

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt host activation and lvm looping with high CPU load trying to mount iSCSI storage

2017-01-11 Thread Yaniv Kaul
On Wed, Jan 11, 2017 at 7:35 PM, Mark Greenall 
wrote:

> Hi Ovirt Champions,
>
>
>
> I am pulling my hair out and in need of advice / help.
>
>
>
> Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
>
> Stoage: Dell Equallogic (Firmware V8.1.4)
>
> OS: Centos 7.3 (although the same thing happens on 7.2)
>
> Ovirt: 4.0.6.3-1 (although also happens on 4.0.5)
>
>
>
> I can’t exactly pinpoint when this started happening but it’s certainly
> been happening with Ovirt 4.0.5 and CentOS 7.2. Today I updated Hosted
> Engine and one host to 4.0.6 and CentOS 7.3 but we still see the same
> problem. Our hosts are connected to Dell iSCSI Eqallogic storage. We have
> one storage domain defined per VM guest, so do have quite a few LUN’s
> presented to the cluster (around 45 in total).
>

Why do you have 1 SD per VM?

Can you try and disable (mask) the lvmetad service on the hosts and see if
it improves matters?
Also /var/log/messages from the host may give us some clues.
TIA,
Y.


>
>
> Problem Description:
>
> 1)  Reboot a host.
>
> 2)  Activate a host in Ovirt Admin Gui.
>
> 3)  A few minutes later host is shown as activated.
>
> 4)  Approx 10-15 mins later host goes offline complaining that it
> can’t connect to storage.
>
> 5)  Constantly then loops around (activating, non operational,
> connecting, initialising) and the host ends up with a high CPU load and
> large number of lvm commands in the process tree.
>
> 6)  Multipath and iscsi show all storage is available and logged in.
>
> 7)  Equallogic shows host connected and no errors.
>
> 8)  Admin GUI ends up saying the host can’t connect to storage
> ‘UNKNOWN’.
>
>
>
> The strange thing is that every now and again step 5 doesn’t happen and
> the host will actually activate again and then stays up.  However, it still
> takes step 4 to take the host offline first.
>
>
>
> Expected Behaviour:
>
> 1)  Reboot a host.
>
> 2)  Activate a host in Ovirt Admin Gui.
>
> 3)  A few minutes later host is shown as activated.
>
> 4)  Begin using host with confidence.
>
>
>
> I’ve attached the engine.log from Hosted Engine and vdsm.log from the
> host. The following is a timeline of the latest event.
>
>
>
> Host Activation : 15:07
>
> Host Up: 15:10
>
> Non-Operational: 15:17
>
>
>
> Seriously hoping someone can spot something obvious as this is making the
> clusters somewhat unstable and unreliable.
>
>
>
> Many Thanks,
>
> Mark
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users