----- Original Message ----- > From: "Chris Jones - BookIt.com Systems Administrator" > <chris.jo...@bookit.com> > To: users@ovirt.org > Sent: Thursday, May 21, 2015 12:49:50 AM > Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via > iSCSI/Multipath > > >> vdsm.log in the node side, will help here too. > > https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log > contains only the messages at and after when a host was become > unresponsive due to storage issues.
According to the log, you have a real issue accessing storage from the host: [nsoffer@thin untitled (master)]$ repostat vdsm.log domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2 delay avg: 0.000856 min: 0.000000 max: 0.001168 last check avg: 11.510000 min: 0.300000 max: 64.100000 domain: 64101f40-0f10-471d-9f5f-44591f9e087d delay avg: 0.008358 min: 0.000000 max: 0.040269 last check avg: 11.863333 min: 0.300000 max: 63.400000 domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0 delay avg: 0.007793 min: 0.000819 max: 0.041316 last check avg: 11.466667 min: 0.000000 max: 70.200000 domain: 842edf83-22c6-46cd-acaa-a1f76d61e545 delay avg: 0.000493 min: 0.000374 max: 0.000698 last check avg: 4.860000 min: 0.200000 max: 9.900000 domain: b050c455-5ab1-4107-b055-bfcc811195fc delay avg: 0.002080 min: 0.000000 max: 0.040142 last check avg: 11.830000 min: 0.000000 max: 63.700000 domain: c46adffc-614a-4fa2-9d2d-954f174f6a39 delay avg: 0.004798 min: 0.000000 max: 0.041006 last check avg: 18.423333 min: 1.400000 max: 102.900000 domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7 delay avg: 0.001002 min: 0.000000 max: 0.001199 last check avg: 11.560000 min: 0.300000 max: 61.700000 domain: 20153412-f77a-4944-b252-ff06a78a1d64 delay avg: 0.003748 min: 0.000000 max: 0.040903 last check avg: 12.180000 min: 0.000000 max: 67.200000 domain: 26929b89-d1ca-4718-90d6-b3a6da585451 delay avg: 0.000963 min: 0.000000 max: 0.001209 last check avg: 10.993333 min: 0.000000 max: 64.300000 domain: 0137183b-ea40-49b1-b617-256f47367280 delay avg: 0.000881 min: 0.000000 max: 0.001227 last check avg: 11.086667 min: 0.100000 max: 63.200000 Note the high last check maximum value (e.g. 102 seconds). Vdsm has a monitor thread for each domain, doing a read from one of the storage domain special disk every 10 seconds. When we see high last check value, it means that the monitor thread is stuck reading from the disk. This is an indicator that vms may have trouble accessing this storage domains, and engine is handling this by making the host non-operational, or if all hosts cannot access the domain, making the domain inactive. One of the known issues that can be related, is bad multipath configuration. Some storage server have bad builtin configuration embedded into multipath. In particular, using "no_path_retry queue", or "no_path_retry 60". This setting means that when the SCSI layer fails, and multipath does not have any active path it will queue io foerver (queue), or retry many times (e.g, 60) before failing the io request. This can lead to stuck process, doing a read or write that never fails or takes many minutes to fail. Vdsm is not designed to handle such delays - a stuck thread may block other unrelated threads. Vdsm includes special configuration for your storage vendor (COMPELNT), but maybe it does not match the product (Compellent Vol). See https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multipath.py#L57 device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures. device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback "immediate" rr_weight "uniform" no_path_retry fail } To verify that your devices match this, you can check the devices vendor and procut strings in the output of "multipath -ll". I would like to see the output of this command. Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failing the io request. So you may have one bad path, causing 120 second delay, while you could complete the request using another path. Multipath is trying to set this value to 5 seconds, but this value is reverting to the default 120 seconds after a device has trouble. There is an open bug about this which we hope to get fixed in the rhel/centos 7.2. https://bugzilla.redhat.com/1139038 This issue together with "no_path_retry queue" is a very bad mix for ovirt. You can fix this timeout by setting: # /etc/iscsi/iscsid.conf node.session.timeo.replacement_timeout = 5 And restarting iscsid service. With these tweaks, the issue may be resolved. I hope it helps. Nir > > >> # rpm -qa | grep -i vdsm > >> might help too. > > vdsm-cli-4.16.14-0.el7.noarch > vdsm-reg-4.16.14-0.el7.noarch > ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch > vdsm-python-zombiereaper-4.16.14-0.el7.noarch > vdsm-xmlrpc-4.16.14-0.el7.noarch > vdsm-yajsonrpc-4.16.14-0.el7.noarch > vdsm-4.16.14-0.el7.x86_64 > vdsm-gluster-4.16.14-0.el7.noarch > vdsm-hook-ethtool-options-4.16.14-0.el7.noarch > vdsm-python-4.16.14-0.el7.noarch > vdsm-jsonrpc-4.16.14-0.el7.noarch > > > > > Hey Chris, > > > > please open a bug [1] for this, then we can track it and we can help to > > identify the issue. > > I will do so. > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users