Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-13 Thread Stefano Bovina
It's FC/FcoE.

This is with configuration suggested by emc/redhat

360060160a62134002818778f949de411 dm-5 DGC,VRAID
size=11T features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:1:2 sdr  65:16  active ready running
| `- 2:0:1:2 sdy  65:128 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 1:0:0:2 sdd  8:48   active ready running
  `- 2:0:0:2 sdk  8:160  active ready running
360060160a6213400e622de69949de411 dm-2 DGC,VRAID
size=6.0T features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:1:0 sdp  8:240  active ready running
| `- 2:0:1:0 sdw  65:96  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 1:0:0:0 sdb  8:16   active ready running
  `- 2:0:0:0 sdi  8:128  active ready running
360060160a6213400cce46e40949de411 dm-4 DGC,VRAID
size=560G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:1:3 sds  65:32  active ready running
| `- 2:0:1:3 sdz  65:144 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 1:0:0:3 sde  8:64   active ready running
  `- 2:0:0:3 sdl  8:176  active ready running
360060160a6213400c4b39e80949de411 dm-3 DGC,VRAID
size=500G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:1:1 sdq  65:0   active ready running
| `- 2:0:1:1 sdx  65:112 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 1:0:0:1 sdc  8:32   active ready running
  `- 2:0:0:1 sdj  8:144  active ready running
360060160a6213400fa2d31acbbfce511 dm-8 DGC,RAID 5
size=5.4T features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:0:6 sdh  8:112  active ready running
| `- 2:0:0:6 sdo  8:224  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 1:0:1:6 sdv  65:80  active ready running
  `- 2:0:1:6 sdac 65:192 active ready running
360060160a621340040652b7582f5e511 dm-7 DGC,RAID 5
size=3.6T features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:0:4 sdf  8:80   active ready running
| `- 2:0:0:4 sdm  8:192  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 1:0:1:4 sdt  65:48  active ready running
  `- 2:0:1:4 sdaa 65:160 active ready running
360060160a621340064b1034cbbfce511 dm-6 DGC,RAID 5
size=1.0T features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:1:5 sdu  65:64  active ready running
| `- 2:0:1:5 sdab 65:176 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 1:0:0:5 sdg  8:96   active ready running
  `- 2:0:0:5 sdn  8:208  active ready running


This is with ovirt default conf:

360060160a6213400848e60af82f5e511 dm-3 DGC ,RAID 5
size=3.6T features='1 retain_attached_hw_handler' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 12:0:0:4 sdj 8:144 active ready  running
| `- 13:0:1:4 sdd 8:48  active ready  running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 12:0:1:4 sdf 8:80  active ready  running
  `- 13:0:0:4 sdh 8:112 active ready  running
360060160a621345e425b6b10e611 dm-2 DGC ,RAID 10
size=4.2T features='1 retain_attached_hw_handler' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 12:0:1:0 sde 8:64  active ready  running
| `- 13:0:0:0 sdg 8:96  active ready  running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 13:0:1:0 sdc 8:32  active ready  running
  `- 12:0:0:0 sdi 8:128 active ready  running


2017-05-13 18:50 GMT+02:00 Juan Pablo :

> can you please give the output of:
> multipath -ll
> and
> iscsiadm -m session -P3
>
> JP
>
> 2017-05-13 6:48 GMT-03:00 Stefano Bovina :
>
>> Hi,
>>
>> 2.6.32-696.1.1.el6.x86_64
>> 3.10.0-514.10.2.el7.x86_64
>>
>> I tried ioping test from different group of servers using multipath,
>> members of different storage group (different lun, different raid ecc), and
>> everyone report latency.
>> I tried the same test (ioping) on a server with powerpath instead of
>> multipath, with a dedicated raid group and ioping don't report latency.
>>
>>
>> 2017-05-13 2:00 GMT+02:00 Juan Pablo :
>>
>>> sorry to jump in, but what kernel version are you using? had similar
>>> issue with kernel 4.10/4.11
>>>
>>>
>>> 2017-05-12 16:36 GMT-03:00 Stefano Bovina :
>>>
 Hi,
 a little update:

 The command multipath -ll hung when executed on the host while the
 problem occur (nothing logged in /var/log/messages or dmesg).

 I 

Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-13 Thread Juan Pablo
can you please give the output of:
multipath -ll
and
iscsiadm -m session -P3

JP

2017-05-13 6:48 GMT-03:00 Stefano Bovina :

> Hi,
>
> 2.6.32-696.1.1.el6.x86_64
> 3.10.0-514.10.2.el7.x86_64
>
> I tried ioping test from different group of servers using multipath,
> members of different storage group (different lun, different raid ecc), and
> everyone report latency.
> I tried the same test (ioping) on a server with powerpath instead of
> multipath, with a dedicated raid group and ioping don't report latency.
>
>
> 2017-05-13 2:00 GMT+02:00 Juan Pablo :
>
>> sorry to jump in, but what kernel version are you using? had similar
>> issue with kernel 4.10/4.11
>>
>>
>> 2017-05-12 16:36 GMT-03:00 Stefano Bovina :
>>
>>> Hi,
>>> a little update:
>>>
>>> The command multipath -ll hung when executed on the host while the
>>> problem occur (nothing logged in /var/log/messages or dmesg).
>>>
>>> I tested latency with ioping:
>>> ioping /dev/6a386652-629d-4045-835b-21d2f5c104aa/metadata
>>>
>>> Usually it return "time=15.6 ms", sometimes return "time=19 s" (yes,
>>> seconds)
>>>
>>> Systems are up to date and I tried both path_checker (emc_clariion and
>>> directio), without results.
>>> (https://access.redhat.com/solutions/139193, it refers to the Rev A31
>>> of EMC document; last is A42 and suggest emc_clariion).
>>>
>>> Any idea or suggestion?
>>>
>>> Thanks,
>>>
>>> Stefano
>>>
>>> 2017-05-08 11:56 GMT+02:00 Yaniv Kaul :
>>>


 On Mon, May 8, 2017 at 11:50 AM, Stefano Bovina 
 wrote:

> Yes,
> this configuration is the one suggested by EMC for EL7.
>

 https://access.redhat.com/solutions/139193 suggest that for alua, the
 patch checker needs to be different.

 Anyway, it is very likely that you have storage issues - they need to
 be resolved first and I believe they have little to do with oVirt at the
 moment.
 Y.


>
> By the way,
> "The parameters rr_min_io vs. rr_min_io_rq mean the same thing but are
> used for device-mapper-multipath on differing kernel versions." and
> rr_min_io_rq default value is 1, rr_min_io default value is 1000, so it
> should be fine.
>
>
> 2017-05-08 9:39 GMT+02:00 Yaniv Kaul :
>
>>
>> On Sun, May 7, 2017 at 1:27 PM, Stefano Bovina 
>> wrote:
>>
>>> Sense data are 0x0/0x0/0x0
>>
>>
>> Interesting - first time I'm seeing 0/0/0. The 1st is usually 0x2
>> (see [1]), and then the rest [2], [3] make sense.
>>
>> A google search found another user with Clarion with the exact same
>> error[4], so I'm leaning toward misconfiguration of multipathing/clarion
>> here.
>>
>> Is your multipathing configuration working well for you?
>> Are you sure it's a EL7 configuration? For example, I believe you
>> should have rr_min_io_rq and not rr_min_io .
>> Y.
>>
>> [1] http://www.t10.org/lists/2status.htm
>> [2] http://www.t10.org/lists/2sensekey.htm
>> [3] http://www.t10.org/lists/asc-num.htm
>> [4] http://www.linuxquestions.org/questions/centos-111/multi
>> path-problems-4175544908/
>>
>
>

>>>
>>> ___
>>> Users mailing list
>>> Users@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-13 Thread Stefano Bovina
Hi,

2.6.32-696.1.1.el6.x86_64
3.10.0-514.10.2.el7.x86_64

I tried ioping test from different group of servers using multipath,
members of different storage group (different lun, different raid ecc), and
everyone report latency.
I tried the same test (ioping) on a server with powerpath instead of
multipath, with a dedicated raid group and ioping don't report latency.


2017-05-13 2:00 GMT+02:00 Juan Pablo :

> sorry to jump in, but what kernel version are you using? had similar issue
> with kernel 4.10/4.11
>
>
> 2017-05-12 16:36 GMT-03:00 Stefano Bovina :
>
>> Hi,
>> a little update:
>>
>> The command multipath -ll hung when executed on the host while the
>> problem occur (nothing logged in /var/log/messages or dmesg).
>>
>> I tested latency with ioping:
>> ioping /dev/6a386652-629d-4045-835b-21d2f5c104aa/metadata
>>
>> Usually it return "time=15.6 ms", sometimes return "time=19 s" (yes,
>> seconds)
>>
>> Systems are up to date and I tried both path_checker (emc_clariion and
>> directio), without results.
>> (https://access.redhat.com/solutions/139193, it refers to the Rev A31 of
>> EMC document; last is A42 and suggest emc_clariion).
>>
>> Any idea or suggestion?
>>
>> Thanks,
>>
>> Stefano
>>
>> 2017-05-08 11:56 GMT+02:00 Yaniv Kaul :
>>
>>>
>>>
>>> On Mon, May 8, 2017 at 11:50 AM, Stefano Bovina 
>>> wrote:
>>>
 Yes,
 this configuration is the one suggested by EMC for EL7.

>>>
>>> https://access.redhat.com/solutions/139193 suggest that for alua, the
>>> patch checker needs to be different.
>>>
>>> Anyway, it is very likely that you have storage issues - they need to be
>>> resolved first and I believe they have little to do with oVirt at the
>>> moment.
>>> Y.
>>>
>>>

 By the way,
 "The parameters rr_min_io vs. rr_min_io_rq mean the same thing but are
 used for device-mapper-multipath on differing kernel versions." and
 rr_min_io_rq default value is 1, rr_min_io default value is 1000, so it
 should be fine.


 2017-05-08 9:39 GMT+02:00 Yaniv Kaul :

>
> On Sun, May 7, 2017 at 1:27 PM, Stefano Bovina 
> wrote:
>
>> Sense data are 0x0/0x0/0x0
>
>
> Interesting - first time I'm seeing 0/0/0. The 1st is usually 0x2 (see
> [1]), and then the rest [2], [3] make sense.
>
> A google search found another user with Clarion with the exact same
> error[4], so I'm leaning toward misconfiguration of multipathing/clarion
> here.
>
> Is your multipathing configuration working well for you?
> Are you sure it's a EL7 configuration? For example, I believe you
> should have rr_min_io_rq and not rr_min_io .
> Y.
>
> [1] http://www.t10.org/lists/2status.htm
> [2] http://www.t10.org/lists/2sensekey.htm
> [3] http://www.t10.org/lists/asc-num.htm
> [4] http://www.linuxquestions.org/questions/centos-111/multi
> path-problems-4175544908/
>


>>>
>>
>> ___
>> Users mailing list
>> Users@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-12 Thread Juan Pablo
sorry to jump in, but what kernel version are you using? had similar issue
with kernel 4.10/4.11


2017-05-12 16:36 GMT-03:00 Stefano Bovina :

> Hi,
> a little update:
>
> The command multipath -ll hung when executed on the host while the problem
> occur (nothing logged in /var/log/messages or dmesg).
>
> I tested latency with ioping:
> ioping /dev/6a386652-629d-4045-835b-21d2f5c104aa/metadata
>
> Usually it return "time=15.6 ms", sometimes return "time=19 s" (yes,
> seconds)
>
> Systems are up to date and I tried both path_checker (emc_clariion and
> directio), without results.
> (https://access.redhat.com/solutions/139193, it refers to the Rev A31 of
> EMC document; last is A42 and suggest emc_clariion).
>
> Any idea or suggestion?
>
> Thanks,
>
> Stefano
>
> 2017-05-08 11:56 GMT+02:00 Yaniv Kaul :
>
>>
>>
>> On Mon, May 8, 2017 at 11:50 AM, Stefano Bovina  wrote:
>>
>>> Yes,
>>> this configuration is the one suggested by EMC for EL7.
>>>
>>
>> https://access.redhat.com/solutions/139193 suggest that for alua, the
>> patch checker needs to be different.
>>
>> Anyway, it is very likely that you have storage issues - they need to be
>> resolved first and I believe they have little to do with oVirt at the
>> moment.
>> Y.
>>
>>
>>>
>>> By the way,
>>> "The parameters rr_min_io vs. rr_min_io_rq mean the same thing but are
>>> used for device-mapper-multipath on differing kernel versions." and
>>> rr_min_io_rq default value is 1, rr_min_io default value is 1000, so it
>>> should be fine.
>>>
>>>
>>> 2017-05-08 9:39 GMT+02:00 Yaniv Kaul :
>>>

 On Sun, May 7, 2017 at 1:27 PM, Stefano Bovina 
 wrote:

> Sense data are 0x0/0x0/0x0


 Interesting - first time I'm seeing 0/0/0. The 1st is usually 0x2 (see
 [1]), and then the rest [2], [3] make sense.

 A google search found another user with Clarion with the exact same
 error[4], so I'm leaning toward misconfiguration of multipathing/clarion
 here.

 Is your multipathing configuration working well for you?
 Are you sure it's a EL7 configuration? For example, I believe you
 should have rr_min_io_rq and not rr_min_io .
 Y.

 [1] http://www.t10.org/lists/2status.htm
 [2] http://www.t10.org/lists/2sensekey.htm
 [3] http://www.t10.org/lists/asc-num.htm
 [4] http://www.linuxquestions.org/questions/centos-111/multi
 path-problems-4175544908/

>>>
>>>
>>
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-12 Thread Stefano Bovina
Hi,
a little update:

The command multipath -ll hung when executed on the host while the problem
occur (nothing logged in /var/log/messages or dmesg).

I tested latency with ioping:
ioping /dev/6a386652-629d-4045-835b-21d2f5c104aa/metadata

Usually it return "time=15.6 ms", sometimes return "time=19 s" (yes,
seconds)

Systems are up to date and I tried both path_checker (emc_clariion and
directio), without results.
(https://access.redhat.com/solutions/139193, it refers to the Rev A31 of
EMC document; last is A42 and suggest emc_clariion).

Any idea or suggestion?

Thanks,

Stefano

2017-05-08 11:56 GMT+02:00 Yaniv Kaul :

>
>
> On Mon, May 8, 2017 at 11:50 AM, Stefano Bovina  wrote:
>
>> Yes,
>> this configuration is the one suggested by EMC for EL7.
>>
>
> https://access.redhat.com/solutions/139193 suggest that for alua, the
> patch checker needs to be different.
>
> Anyway, it is very likely that you have storage issues - they need to be
> resolved first and I believe they have little to do with oVirt at the
> moment.
> Y.
>
>
>>
>> By the way,
>> "The parameters rr_min_io vs. rr_min_io_rq mean the same thing but are
>> used for device-mapper-multipath on differing kernel versions." and
>> rr_min_io_rq default value is 1, rr_min_io default value is 1000, so it
>> should be fine.
>>
>>
>> 2017-05-08 9:39 GMT+02:00 Yaniv Kaul :
>>
>>>
>>> On Sun, May 7, 2017 at 1:27 PM, Stefano Bovina  wrote:
>>>
 Sense data are 0x0/0x0/0x0
>>>
>>>
>>> Interesting - first time I'm seeing 0/0/0. The 1st is usually 0x2 (see
>>> [1]), and then the rest [2], [3] make sense.
>>>
>>> A google search found another user with Clarion with the exact same
>>> error[4], so I'm leaning toward misconfiguration of multipathing/clarion
>>> here.
>>>
>>> Is your multipathing configuration working well for you?
>>> Are you sure it's a EL7 configuration? For example, I believe you should
>>> have rr_min_io_rq and not rr_min_io .
>>> Y.
>>>
>>> [1] http://www.t10.org/lists/2status.htm
>>> [2] http://www.t10.org/lists/2sensekey.htm
>>> [3] http://www.t10.org/lists/asc-num.htm
>>> [4] http://www.linuxquestions.org/questions/centos-111/multi
>>> path-problems-4175544908/
>>>
>>
>>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-08 Thread Stefano Bovina
Yes,
this configuration is the one suggested by EMC for EL7.

By the way,
"The parameters rr_min_io vs. rr_min_io_rq mean the same thing but are used
for device-mapper-multipath on differing kernel versions." and rr_min_io_rq
default value is 1, rr_min_io default value is 1000, so it should be fine.


2017-05-08 9:39 GMT+02:00 Yaniv Kaul :

>
> On Sun, May 7, 2017 at 1:27 PM, Stefano Bovina  wrote:
>
>> Sense data are 0x0/0x0/0x0
>
>
> Interesting - first time I'm seeing 0/0/0. The 1st is usually 0x2 (see
> [1]), and then the rest [2], [3] make sense.
>
> A google search found another user with Clarion with the exact same
> error[4], so I'm leaning toward misconfiguration of multipathing/clarion
> here.
>
> Is your multipathing configuration working well for you?
> Are you sure it's a EL7 configuration? For example, I believe you should
> have rr_min_io_rq and not rr_min_io .
> Y.
>
> [1] http://www.t10.org/lists/2status.htm
> [2] http://www.t10.org/lists/2sensekey.htm
> [3] http://www.t10.org/lists/asc-num.htm
> [4] http://www.linuxquestions.org/questions/centos-111/multi
> path-problems-4175544908/
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-08 Thread Stefano Bovina
Hi,

thanks for the advice. The upgrade is already scheduled, but I would like
to fix this issue before proceeding with a big upgrade (unless an upgrade
will fixes the problem).

The problem is on all hypervisors.

We have 2 cluster (both connected to the same storage system):
 - the old one with FC
 - the new one with FCoE


With dmesg -T and looking at /var/log/messages we found several problems
like these:

1)
[Wed May  3 10:40:11 2017] sd 12:0:0:3: Parameters changed
[Wed May  3 10:40:11 2017] sd 12:0:1:3: Parameters changed
[Wed May  3 10:40:11 2017] sd 12:0:1:1: Parameters changed
[Wed May  3 10:40:12 2017] sd 13:0:0:1: Parameters changed
[Wed May  3 10:40:12 2017] sd 13:0:0:3: Parameters changed
[Wed May  3 10:40:12 2017] sd 13:0:1:3: Parameters changed
[Wed May  3 12:39:32 2017] device-mapper: multipath: Failing path 65:144.
[Wed May  3 12:39:37 2017] sd 13:0:1:2: alua: port group 01 state A
preferred supports tolUsNA

2)
[Wed May  3 17:08:17 2017] perf interrupt took too long (2590 > 2500),
lowering kernel.perf_event_max_sample_rate to 5

3)
[Wed May  3 19:16:21 2017] bnx2fc: els 0x5: tgt not ready
[Wed May  3 19:16:21 2017] bnx2fc: Relogin to the tgt

4)
sd 13:0:1:0: [sdx] FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
sd 13:0:1:0: [sdx] CDB: Read(16) 88 00 00 00 00 00 00 58 08 00 00 00 04 00
00 00
blk_update_request: I/O error, dev sdx, sector 5769216
device-mapper: multipath: Failing path 65:112.
sd 13:0:1:0: alua: port group 01 state A preferred supports tolUsNA

5)
multipathd: 360060160a6213400cce46e40949de411: sdaa - emc_clariion_checker:
Read error for WWN 60060160a6213400cce46e40949de411.  Sense data are
0x0/0x0/0x0.
multipathd: checker failed path 65:160 in map
360060160a6213400cce46e40949de411
multipathd: 360060160a6213400cce46e40949de411: remaining active paths: 3
kernel: device-mapper: multipath: Failing path 65:160.
multipathd: 360060160a6213400cce46e40949de411: sdaa - emc_clariion_checker:
Active path is healthy.
multipathd: 65:160: reinstated
multipathd: 360060160a6213400cce46e40949de411: remaining active paths: 4

6)
[Sat May  6 11:37:07 2017] megaraid_sas :02:00.0: Firmware crash dump
is not available


Multipath configuration is the following (recommended by EMC):

# RHEV REVISION 1.1
# RHEV PRIVATE

devices {
device {
vendor  "DGC"
product ".*"
product_blacklist   "LUNZ"
path_grouping_policygroup_by_prio
path_selector   "round-robin 0"
path_checkeremc_clariion
features"1 queue_if_no_path"
hardware_handler"1 alua"
prioalua
failbackimmediate
rr_weight   uniform
no_path_retry   60
rr_min_io   1
}
}

Regards,

Stefano


2017-05-07 8:36 GMT+02:00 Yaniv Kaul :

>
>
> On Tue, May 2, 2017 at 11:09 PM, Stefano Bovina  wrote:
>
>> Hi, the engine logs show high latency on storage domains: "Storage domain
>>  experienced a high latency of 19.2814 seconds from .. This may
>> cause performance and functional issues."
>>
>> Looking at host logs, I found also these locking errors:
>>
>> 2017-05-02 20:52:13+0200 33883 [10098]: s1 renewal error -202
>> delta_length 10 last_success 33853
>> 2017-05-02 20:52:19+0200 33889 [10098]: 6a386652 aio collect 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 result 1048576:0 other free
>> 2017-05-02 21:08:51+0200 34880 [10098]: 6a386652 aio timeout 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe4f2000 ioto 10 to_count 24
>> 2017-05-02 21:08:51+0200 34880 [10098]: s1 delta_renew read rv -202
>> offset 0 /dev/6a386652-629d-4045-835b-21d2f5c104aa/ids
>> 2017-05-02 21:08:51+0200 34880 [10098]: s1 renewal error -202
>> delta_length 10 last_success 34850
>> 2017-05-02 21:08:53+0200 34883 [10098]: 6a386652 aio collect 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe4f2000 result 1048576:0 other free
>> 2017-05-02 21:30:40+0200 36189 [10098]: 6a386652 aio timeout 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 ioto 10 to_count 25
>> 2017-05-02 21:30:40+0200 36189 [10098]: s1 delta_renew read rv -202
>> offset 0 /dev/6a386652-629d-4045-835b-21d2f5c104aa/ids
>> 2017-05-02 21:30:40+0200 36189 [10098]: s1 renewal error -202
>> delta_length 10 last_success 36159
>> 2017-05-02 21:30:45+0200 36195 [10098]: 6a386652 aio collect 0
>> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 result 1048576:0 other free
>>
>> and this vdsm errors too:
>>
>> Thread-22::ERROR::2017-05-02 21:53:48,147::sdc::137::Storag
>> e.StorageDomainCache::(_findDomain) looking for unfetched domain
>> f8f21d6c-2425-45c4-aded-4cb9b53ebd96
>> Thread-22::ERROR::2017-05-02 21:53:48,148::sdc::154::Storag
>> e.StorageDomainCache::(_findUnfetchedDomain) looking for domain
>> f8f21d6c-2425-45c4-aded-4cb9b53ebd96
>>
>> Engine instead is showing this errors:
>>
>> 2017-05-02 21:40:38,089 ERROR [org.ovirt.engine.core.vdsbrok
>> er.vdsbroker.SpmStatusVDSCommand] 

Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-08 Thread Yaniv Kaul
On Sun, May 7, 2017 at 1:27 PM, Stefano Bovina  wrote:

> Sense data are 0x0/0x0/0x0


Interesting - first time I'm seeing 0/0/0. The 1st is usually 0x2 (see
[1]), and then the rest [2], [3] make sense.

A google search found another user with Clarion with the exact same
error[4], so I'm leaning toward misconfiguration of multipathing/clarion
here.

Is your multipathing configuration working well for you?
Are you sure it's a EL7 configuration? For example, I believe you should
have rr_min_io_rq and not rr_min_io .
Y.

[1] http://www.t10.org/lists/2status.htm
[2] http://www.t10.org/lists/2sensekey.htm
[3] http://www.t10.org/lists/asc-num.htm
[4] http://www.linuxquestions.org/questions/centos-111/
multipath-problems-4175544908/
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-07 Thread Yaniv Kaul
On Tue, May 2, 2017 at 11:09 PM, Stefano Bovina  wrote:

> Hi, the engine logs show high latency on storage domains: "Storage domain
>  experienced a high latency of 19.2814 seconds from .. This may
> cause performance and functional issues."
>
> Looking at host logs, I found also these locking errors:
>
> 2017-05-02 20:52:13+0200 33883 [10098]: s1 renewal error -202 delta_length
> 10 last_success 33853
> 2017-05-02 20:52:19+0200 33889 [10098]: 6a386652 aio collect 0
> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 result 1048576:0 other free
> 2017-05-02 21:08:51+0200 34880 [10098]: 6a386652 aio timeout 0
> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe4f2000 ioto 10 to_count 24
> 2017-05-02 21:08:51+0200 34880 [10098]: s1 delta_renew read rv -202 offset
> 0 /dev/6a386652-629d-4045-835b-21d2f5c104aa/ids
> 2017-05-02 21:08:51+0200 34880 [10098]: s1 renewal error -202 delta_length
> 10 last_success 34850
> 2017-05-02 21:08:53+0200 34883 [10098]: 6a386652 aio collect 0
> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe4f2000 result 1048576:0 other free
> 2017-05-02 21:30:40+0200 36189 [10098]: 6a386652 aio timeout 0
> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 ioto 10 to_count 25
> 2017-05-02 21:30:40+0200 36189 [10098]: s1 delta_renew read rv -202 offset
> 0 /dev/6a386652-629d-4045-835b-21d2f5c104aa/ids
> 2017-05-02 21:30:40+0200 36189 [10098]: s1 renewal error -202 delta_length
> 10 last_success 36159
> 2017-05-02 21:30:45+0200 36195 [10098]: 6a386652 aio collect 0
> 0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 result 1048576:0 other free
>
> and this vdsm errors too:
>
> Thread-22::ERROR::2017-05-02 21:53:48,147::sdc::137::Storag
> e.StorageDomainCache::(_findDomain) looking for unfetched domain
> f8f21d6c-2425-45c4-aded-4cb9b53ebd96
> Thread-22::ERROR::2017-05-02 21:53:48,148::sdc::154::Storag
> e.StorageDomainCache::(_findUnfetchedDomain) looking for domain
> f8f21d6c-2425-45c4-aded-4cb9b53ebd96
>
> Engine instead is showing this errors:
>
> 2017-05-02 21:40:38,089 ERROR [org.ovirt.engine.core.vdsbrok
> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-96)
> Command SpmStatusVDSCommand(HostName = , HostId =
> dcc0275a-b011-4e33-bb95-366ffb0697b3, storagePoolId =
> 715d1ba2-eabe-48db-9aea-c28c30359808) execution failed. Exception:
> VDSErrorException: VDSGenericException: VDSErrorException: Failed to
> SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'Sanlock
> exception'), code = 100
> 2017-05-02 21:41:08,431 ERROR [org.ovirt.engine.core.vdsbrok
> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-53)
> [6e0d5ebf] Failed in SpmStatusVDS method
> 2017-05-02 21:41:08,443 ERROR [org.ovirt.engine.core.vdsbrok
> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-53)
> [6e0d5ebf] Command SpmStatusVDSCommand(HostName = ,
> HostId = 7991933e-5f30-48cd-88bf-b0b525613384, storagePoolId =
> 4bd73239-22d0-4c44-ab8c-17adcd580309) execution failed. Exception:
> VDSErrorException: VDSGenericException: VDSErrorException: Failed to
> SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'Sanlock
> exception'), code = 100
> 2017-05-02 21:41:31,975 ERROR [org.ovirt.engine.core.vdsbrok
> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-61)
> [2a54a1b2] Failed in SpmStatusVDS method
> 2017-05-02 21:41:31,987 ERROR [org.ovirt.engine.core.vdsbrok
> er.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-61)
> [2a54a1b2] Command SpmStatusVDSCommand(HostName = ,
> HostId = dcc0275a-b011-4e33-bb95-366ffb0697b3, storagePoolId =
> 715d1ba2-eabe-48db-9aea-c28c30359808) execution failed. Exception:
> VDSErrorException: VDSGenericException: VDSErrorException: Failed to
> SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'Sanlock
> exception'), code = 100
>
>
> I'm using Fibre Channel or FCoE connectivity; storage array technical
> support has analyzed it (also switch and OS configurations), but nothing
> has been found.
>

Is this on a specific hosts, or multiple hosts?
Is that FC or FCoE? Anything on the host's /var/log/messages?


>
> Any advice?
>
> Thanks
>
>
> Installation info:
>
> ovirt-release35-006-1.noarch
>

This is a very old release, I suggest, regardless of this issue, to upgrade.
Y.


> libgovirt-0.3.3-1.el7_2.1.x86_64
> vdsm-4.16.30-0.el7.centos.x86_64
> vdsm-xmlrpc-4.16.30-0.el7.centos.noarch
> vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
> vdsm-jsonrpc-4.16.30-0.el7.centos.noarch
> vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch
> vdsm-python-4.16.30-0.el7.centos.noarch
> vdsm-cli-4.16.30-0.el7.centos.noarch
> qemu-kvm-ev-2.3.0-29.1.el7.x86_64
> qemu-kvm-common-ev-2.3.0-29.1.el7.x86_64
> qemu-kvm-tools-ev-2.3.0-29.1.el7.x86_64
> libvirt-client-1.2.17-13.el7_2.3.x86_64
> libvirt-daemon-driver-storage-1.2.17-13.el7_2.3.x86_64
> libvirt-python-1.2.17-2.el7.x86_64
> libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.3.x86_64
> libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.3.x86_64
> 

[ovirt-users] High latency on storage domains and sanlock renewal error

2017-05-07 Thread Stefano Bovina
Hi, the engine logs show high latency on storage domains: "Storage domain
 experienced a high latency of 19.2814 seconds from .. This may
cause performance and functional issues."

Looking at host logs, I found also these locking errors:

2017-05-02 20:52:13+0200 33883 [10098]: s1 renewal error -202 delta_length
10 last_success 33853
2017-05-02 20:52:19+0200 33889 [10098]: 6a386652 aio collect 0
0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 result 1048576:0 other free
2017-05-02 21:08:51+0200 34880 [10098]: 6a386652 aio timeout 0
0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe4f2000 ioto 10 to_count 24
2017-05-02 21:08:51+0200 34880 [10098]: s1 delta_renew read rv -202 offset
0 /dev/6a386652-629d-4045-835b-21d2f5c104aa/ids
2017-05-02 21:08:51+0200 34880 [10098]: s1 renewal error -202 delta_length
10 last_success 34850
2017-05-02 21:08:53+0200 34883 [10098]: 6a386652 aio collect 0
0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe4f2000 result 1048576:0 other free
2017-05-02 21:30:40+0200 36189 [10098]: 6a386652 aio timeout 0
0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 ioto 10 to_count 25
2017-05-02 21:30:40+0200 36189 [10098]: s1 delta_renew read rv -202 offset
0 /dev/6a386652-629d-4045-835b-21d2f5c104aa/ids
2017-05-02 21:30:40+0200 36189 [10098]: s1 renewal error -202 delta_length
10 last_success 36159
2017-05-02 21:30:45+0200 36195 [10098]: 6a386652 aio collect 0
0x7f1fb80008c0:0x7f1fb80008d0:0x7f1fbe9fb000 result 1048576:0 other free

and this vdsm errors too:

Thread-22::ERROR::2017-05-02 21:53:48,147::sdc::137::
Storage.StorageDomainCache::(_findDomain) looking for unfetched domain
f8f21d6c-2425-45c4-aded-4cb9b53ebd96
Thread-22::ERROR::2017-05-02 21:53:48,148::sdc::154::
Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain
f8f21d6c-2425-45c4-aded-4cb9b53ebd96

Engine instead is showing this errors:

2017-05-02 21:40:38,089 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(DefaultQuartzScheduler_Worker-96) Command SpmStatusVDSCommand(HostName = <
myhost.example.com>, HostId = dcc0275a-b011-4e33-bb95-366ffb0697b3,
storagePoolId = 715d1ba2-eabe-48db-9aea-c28c30359808) execution failed.
Exception: VDSErrorException: VDSGenericException: VDSErrorException:
Failed to SpmStatusVDS, error = (-202, 'Sanlock resource read failure',
'Sanlock exception'), code = 100
2017-05-02 21:41:08,431 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(DefaultQuartzScheduler_Worker-53) [6e0d5ebf] Failed in SpmStatusVDS method
2017-05-02 21:41:08,443 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(DefaultQuartzScheduler_Worker-53) [6e0d5ebf] Command
SpmStatusVDSCommand(HostName = , HostId =
7991933e-5f30-48cd-88bf-b0b525613384, storagePoolId =
4bd73239-22d0-4c44-ab8c-17adcd580309) execution failed. Exception:
VDSErrorException: VDSGenericException: VDSErrorException: Failed to
SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'Sanlock
exception'), code = 100
2017-05-02 21:41:31,975 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(DefaultQuartzScheduler_Worker-61) [2a54a1b2] Failed in SpmStatusVDS method
2017-05-02 21:41:31,987 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(DefaultQuartzScheduler_Worker-61) [2a54a1b2] Command
SpmStatusVDSCommand(HostName = , HostId =
dcc0275a-b011-4e33-bb95-366ffb0697b3, storagePoolId =
715d1ba2-eabe-48db-9aea-c28c30359808) execution failed. Exception:
VDSErrorException: VDSGenericException: VDSErrorException: Failed to
SpmStatusVDS, error = (-202, 'Sanlock resource read failure', 'Sanlock
exception'), code = 100


I'm using Fibre Channel or FCoE connectivity; storage array technical
support has analyzed it (also switch and OS configurations), but nothing
has been found.

Any advice?

Thanks


Installation info:

ovirt-release35-006-1.noarch
libgovirt-0.3.3-1.el7_2.1.x86_64
vdsm-4.16.30-0.el7.centos.x86_64
vdsm-xmlrpc-4.16.30-0.el7.centos.noarch
vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
vdsm-jsonrpc-4.16.30-0.el7.centos.noarch
vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch
vdsm-python-4.16.30-0.el7.centos.noarch
vdsm-cli-4.16.30-0.el7.centos.noarch
qemu-kvm-ev-2.3.0-29.1.el7.x86_64
qemu-kvm-common-ev-2.3.0-29.1.el7.x86_64
qemu-kvm-tools-ev-2.3.0-29.1.el7.x86_64
libvirt-client-1.2.17-13.el7_2.3.x86_64
libvirt-daemon-driver-storage-1.2.17-13.el7_2.3.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.3.x86_64
libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.3.x86_64
libvirt-lock-sanlock-1.2.17-13.el7_2.3.x86_64
libvirt-glib-0.1.9-1.el7.x86_64
libvirt-daemon-driver-network-1.2.17-13.el7_2.3.x86_64
libvirt-daemon-driver-lxc-1.2.17-13.el7_2.3.x86_64
libvirt-daemon-driver-interface-1.2.17-13.el7_2.3.x86_64
libvirt-1.2.17-13.el7_2.3.x86_64
libvirt-daemon-1.2.17-13.el7_2.3.x86_64
libvirt-daemon-config-network-1.2.17-13.el7_2.3.x86_64
libvirt-daemon-driver-secret-1.2.17-13.el7_2.3.x86_64
libvirt-daemon-config-nwfilter-1.2.17-13.el7_2.3.x86_64