Re: [ovirt-users] Found some bugs with NFS.

2018-01-23 Thread Sergey Kulikov
Title: Re: [ovirt-users] Found some bugs with NFS.



I'll post second part there.
Unfotunately I can't use fedora as ovirt node(unsupported), and share hangs only after some time,
I'm trying to find out what type of IO, leads to this hang, I'll try on other OSes if I'll find what
to try.
But first part is directly related to ovirt, I think.
 

-- 



 Tuesday, January 23, 2018, 21:59:12:







On Tue, Jan 23, 2018 at 6:47 PM, Sergey Kulikov <ser...@msm.ru> wrote:

Or maybe somebody can point me to the right place for submitting this?
Thanks. :)

CentOS have a bugtracker[1], but I think it's worthwhile understanding if it is reproducible with other OS. Fedora, for example.
Y.

[1] https://bugs.centos.org/main_page.php 
---



 Monday, January 22, 2018, 14:10:53:

> This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)
>
>
> 1. Can't force NFS to 4.0.
> Some time ago, I've set my NFS version for all storage domains to V4, because there was a bug with Netapp data ontap 8.x
> and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
> so there were no problems related to NFS, after some time Centos 7.4 was released, and I've noticed that mount points started to hang again,
> NFS was mounted with vers=4.1, and it's not possible to change to 4.0, both options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is
> system default version for 4.X, and as I know it was changed in Centos 7.4 from 4.0 to 4.1, maybe 4.0 option should be added
> to force 4.0 version? because adding vers=/nfsvers= in "Additional mount options" is denied by ovirt.
> I know, I can turn it off on netapp side, but there may be situations where storage is out of control. And 4.0 version can't be
> set on ovirt side.
>
> 2. This bug isn't directly related to ovirt, but affects it.
> Don't really shure that this is right place to report.
> As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 7.x, but it was fixed in otap 9.x,
> Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
> After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, This happens randomly, on random hosts, after few
> days of uptime, entire datacenter goes offline, hosts down, storage domains down, some vms in UP and some in unknown state, but
> actually VMs are working, HostedEngine also working, but I can't control the environment.
> There are many hanging ioprocess(>1300) and vdsm processes(>1300) on some hosts, also there are some dd commands, that are checking
> storage hanging:
>         ├─vdsmd─┬─2*[dd]
>         │       ├─1304*[ioprocess───{ioprocess}]
>         │       ├─12*[ioprocess───4*[{ioprocess}]]
>         │       └─1365*[{vdsmd}]
> vdsm     19470  0.0  0.0   4360   348 ?        D<   Jan21   0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/6cd147b4-8039-4f8a-8aa7-5fd54d81/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
> vdsm     40707  0.0  0.0   4360   348 ?        D<   00:44   0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
>
> vdsm is hanging at 100% cpu load
> If I'll try to ls this files ls will hang.
>
> I've made some dump of traffic, so looks like problem with STATID, I've found 2 issues on RedHat web site, but they aren't
> publically available, so i can't read the solution:
> https://access.redhat.com/solutions/3214331   (in my case I have STATEID test)
> https://access.redhat.com/solutions/3164451   (in my case there is no manager thread)
> But it looks' that I've another issue with stateid,
> According to dumps my hosts are sending: TEST_STATEID
> netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
> After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, ACCESS, GETATTR
> Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
> Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
> Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID
>
>
> Entire conversaion looks like:
> No.     Time           Source             Destination       Protocol  Length Info
>       1 0.00       10._host_          10._netapp_        NFS      238    V4 Call (Reply In 2) TEST_STATEID
>       2 0.000251       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID (10025))
>       3 0.000352       10._host_          10._netapp_        NFS      338    V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/
>       4 0.000857       10._netapp_        10._host_          NFS      394    

Re: [ovirt-users] Found some bugs with NFS.

2018-01-23 Thread Yaniv Kaul
On Tue, Jan 23, 2018 at 6:47 PM, Sergey Kulikov  wrote:

>
> Or maybe somebody can point me to the right place for submitting this?
> Thanks. :)
>
> CentOS have a bugtracker[1], but I think it's worthwhile understanding if
it is reproducible with other OS. Fedora, for example.
Y.

[1] https://bugs.centos.org/main_page.php

> ---
>
>
>
>  Monday, January 22, 2018, 14:10:53:
>
> > This is test environment, running Centos 7.4, oVirt 4.2.0, kernel
> 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)
> >
> >
> > 1. Can't force NFS to 4.0.
> > Some time ago, I've set my NFS version for all storage domains to V4,
> because there was a bug with Netapp data ontap 8.x
> > and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID
> problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
> > so there were no problems related to NFS, after some time Centos 7.4 was
> released, and I've noticed that mount points started to hang again,
> > NFS was mounted with vers=4.1, and it's not possible to change to 4.0,
> both options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is
> > system default version for 4.X, and as I know it was changed in Centos
> 7.4 from 4.0 to 4.1, maybe 4.0 option should be added
> > to force 4.0 version? because adding vers=/nfsvers= in "Additional mount
> options" is denied by ovirt.
> > I know, I can turn it off on netapp side, but there may be situations
> where storage is out of control. And 4.0 version can't be
> > set on ovirt side.
> >
> > 2. This bug isn't directly related to ovirt, but affects it.
> > Don't really shure that this is right place to report.
> > As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and
> RHEL 7.x, but it was fixed in otap 9.x,
> > Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
> > After updating to centos 7.4 nfs domains in ovirt started to hang\lock
> again, This happens randomly, on random hosts, after few
> > days of uptime, entire datacenter goes offline, hosts down, storage
> domains down, some vms in UP and some in unknown state, but
> > actually VMs are working, HostedEngine also working, but I can't control
> the environment.
> > There are many hanging ioprocess(>1300) and vdsm processes(>1300) on
> some hosts, also there are some dd commands, that are checking
> > storage hanging:
> > ├─vdsmd─┬─2*[dd]
> > │   ├─1304*[ioprocess───{ioprocess}]
> > │   ├─12*[ioprocess───4*[{ioprocess}]]
> > │   └─1365*[{vdsmd}]
> > vdsm 19470  0.0  0.0   4360   348 ?D<   Jan21   0:00
> /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/
> 6cd147b4-8039-4f8a-8aa7-5fd54d81/dom_md/metadata of=/dev/null bs=4096
> count=1 iflag=direct
> > vdsm 40707  0.0  0.0   4360   348 ?D<   00:44   0:00
> /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_
> export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata of=/dev/null
> bs=4096 count=1 iflag=direct
> >
> > vdsm is hanging at 100% cpu load
> > If I'll try to ls this files ls will hang.
> >
> > I've made some dump of traffic, so looks like problem with STATID, I've
> found 2 issues on RedHat web site, but they aren't
> > publically available, so i can't read the solution:
> > https://access.redhat.com/solutions/3214331   (in my case I have
> STATEID test)
> > https://access.redhat.com/solutions/3164451   (in my case there is no
> manager thread)
> > But it looks' that I've another issue with stateid,
> > According to dumps my hosts are sending: TEST_STATEID
> > netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
> > After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH,
> OPEN, ACCESS, GETATTR
> > Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
> > Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
> > Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID
> >
> >
> > Entire conversaion looks like:
> > No. Time   Source Destination   Protocol
> Length Info
> >   1 0.00   10._host_  10._netapp_NFS
> 238V4 Call (Reply In 2) TEST_STATEID
> >   2 0.000251   10._netapp_10._host_  NFS
> 170V4 Reply (Call In 1) TEST_STATEID (here is Status:
> NFS4ERR_BAD_STATEID (10025))
> >   3 0.000352   10._host_  10._netapp_NFS
> 338V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/
> >   4 0.000857   10._netapp_10._host_  NFS
> 394V4 Reply (Call In 3) OPEN StateID: 0xa205
> >   5 0.000934   10._host_  10._netapp_NFS
> 302V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096
> >   6 0.000964   10._host_  10._netapp_NFS
> 302V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096
> >   7 0.001133   10._netapp_10._host_  TCP
> 70 2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 

Re: [ovirt-users] Found some bugs with NFS.

2018-01-23 Thread Sergey Kulikov

Or maybe somebody can point me to the right place for submitting this?
Thanks. :)

---



 Monday, January 22, 2018, 14:10:53:

> This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 
> 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)
> 
> 
> 1. Can't force NFS to 4.0.
> Some time ago, I've set my NFS version for all storage domains to V4, because 
> there was a bug with Netapp data ontap 8.x
> and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID 
> problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
> so there were no problems related to NFS, after some time Centos 7.4 was 
> released, and I've noticed that mount points started to hang again,
> NFS was mounted with vers=4.1, and it's not possible to change to 4.0, both 
> options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is 
> system default version for 4.X, and as I know it was changed in Centos 7.4 
> from 4.0 to 4.1, maybe 4.0 option should be added
> to force 4.0 version? because adding vers=/nfsvers= in "Additional mount 
> options" is denied by ovirt.
> I know, I can turn it off on netapp side, but there may be situations where 
> storage is out of control. And 4.0 version can't be
> set on ovirt side.
> 
> 2. This bug isn't directly related to ovirt, but affects it.
> Don't really shure that this is right place to report.
> As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 
> 7.x, but it was fixed in otap 9.x,
> Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
> After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, 
> This happens randomly, on random hosts, after few
> days of uptime, entire datacenter goes offline, hosts down, storage domains 
> down, some vms in UP and some in unknown state, but
> actually VMs are working, HostedEngine also working, but I can't control the 
> environment.
> There are many hanging ioprocess(>1300) and vdsm processes(>1300) on some 
> hosts, also there are some dd commands, that are checking
> storage hanging:
> ├─vdsmd─┬─2*[dd]
> │   ├─1304*[ioprocess───{ioprocess}]
> │   ├─12*[ioprocess───4*[{ioprocess}]]
> │   └─1365*[{vdsmd}]
> vdsm 19470  0.0  0.0   4360   348 ?D<   Jan21   0:00 /usr/bin/dd 
> if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/6cd147b4-8039-4f8a-8aa7-5fd54d81/dom_md/metadata
>  of=/dev/null bs=4096 count=1 iflag=direct
> vdsm 40707  0.0  0.0   4360   348 ?D<   00:44   0:00 /usr/bin/dd 
> if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata
>  of=/dev/null bs=4096 count=1 iflag=direct
> 
> vdsm is hanging at 100% cpu load
> If I'll try to ls this files ls will hang.
> 
> I've made some dump of traffic, so looks like problem with STATID, I've found 
> 2 issues on RedHat web site, but they aren't
> publically available, so i can't read the solution:
> https://access.redhat.com/solutions/3214331   (in my case I have STATEID test)
> https://access.redhat.com/solutions/3164451   (in my case there is no manager 
> thread)
> But it looks' that I've another issue with stateid,
> According to dumps my hosts are sending: TEST_STATEID
> netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
> After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, 
> ACCESS, GETATTR
> Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
> Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
> Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID
> 
> 
> Entire conversaion looks like:
> No. Time   Source Destination   Protocol  Length 
> Info
>   1 0.00   10._host_  10._netapp_NFS  238
> V4 Call (Reply In 2) TEST_STATEID
>   2 0.000251   10._netapp_10._host_  NFS  170
> V4 Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID 
> (10025))
>   3 0.000352   10._host_  10._netapp_NFS  338
> V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/
>   4 0.000857   10._netapp_10._host_  NFS  394
> V4 Reply (Call In 3) OPEN StateID: 0xa205
>   5 0.000934   10._host_  10._netapp_NFS  302
> V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096
>   6 0.000964   10._host_  10._netapp_NFS  302
> V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096
>   7 0.001133   10._netapp_10._host_  TCP  70 
> 2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100 
> TSecr=302215289
>   8 0.001258   10._netapp_10._host_  NFS  170
> V4 Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID
>   9 0.001320   10._netapp_10._host_  NFS  170
> V4 Reply (Call In 6) READ Status: 

[ovirt-users] Found some bugs with NFS.

2018-01-22 Thread Sergey Kulikov
This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 
3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)


1. Can't force NFS to 4.0.
Some time ago, I've set my NFS version for all storage domains to V4, because 
there was a bug with Netapp data ontap 8.x
and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID 
problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
so there were no problems related to NFS, after some time Centos 7.4 was 
released, and I've noticed that mount points started to hang again,
NFS was mounted with vers=4.1, and it's not possible to change to 4.0, both 
options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is 
system default version for 4.X, and as I know it was changed in Centos 7.4 from 
4.0 to 4.1, maybe 4.0 option should be added
to force 4.0 version? because adding vers=/nfsvers= in "Additional mount 
options" is denied by ovirt.
I know, I can turn it off on netapp side, but there may be situations where 
storage is out of control. And 4.0 version can't be
set on ovirt side.

2. This bug isn't directly related to ovirt, but affects it.
Don't really shure that this is right place to report.
As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 
7.x, but it was fixed in otap 9.x,
Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, 
This happens randomly, on random hosts, after few
days of uptime, entire datacenter goes offline, hosts down, storage domains 
down, some vms in UP and some in unknown state, but
actually VMs are working, HostedEngine also working, but I can't control the 
environment.
There are many hanging ioprocess(>1300) and vdsm processes(>1300) on some 
hosts, also there are some dd commands, that are checking
storage hanging:
├─vdsmd─┬─2*[dd]
│   ├─1304*[ioprocess───{ioprocess}]
│   ├─12*[ioprocess───4*[{ioprocess}]]
│   └─1365*[{vdsmd}]
vdsm 19470  0.0  0.0   4360   348 ?D<   Jan21   0:00 /usr/bin/dd 
if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/6cd147b4-8039-4f8a-8aa7-5fd54d81/dom_md/metadata
 of=/dev/null bs=4096 count=1 iflag=direct
vdsm 40707  0.0  0.0   4360   348 ?D<   00:44   0:00 /usr/bin/dd 
if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata
 of=/dev/null bs=4096 count=1 iflag=direct

vdsm is hanging at 100% cpu load
If I'll try to ls this files ls will hang.

I've made some dump of traffic, so looks like problem with STATID, I've found 2 
issues on RedHat web site, but they aren't
publically available, so i can't read the solution:
https://access.redhat.com/solutions/3214331   (in my case I have STATEID test)
https://access.redhat.com/solutions/3164451   (in my case there is no manager 
thread)
But it looks' that I've another issue with stateid,
According to dumps my hosts are sending: TEST_STATEID
netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, 
ACCESS, GETATTR
Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID


Entire conversaion looks like:
No. Time   Source Destination   Protocol  Length 
Info
  1 0.00   10._host_  10._netapp_NFS  238V4 
Call (Reply In 2) TEST_STATEID
  2 0.000251   10._netapp_10._host_  NFS  170V4 
Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID (10025))
  3 0.000352   10._host_  10._netapp_NFS  338V4 
Call (Reply In 4) OPEN DH: 0xa2c3ad28/
  4 0.000857   10._netapp_10._host_  NFS  394V4 
Reply (Call In 3) OPEN StateID: 0xa205
  5 0.000934   10._host_  10._netapp_NFS  302V4 
Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096
  6 0.000964   10._host_  10._netapp_NFS  302V4 
Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096
  7 0.001133   10._netapp_10._host_  TCP  70 
2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100 TSecr=302215289
  8 0.001258   10._netapp_10._host_  NFS  170V4 
Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID
  9 0.001320   10._netapp_10._host_  NFS  170V4 
Reply (Call In 6) READ Status: NFS4ERR_BAD_STATEID

Sometimes clearing locks on netapp(vserver locks break) and killing 
dd\ioprocess will help for a while.
Right now I've my test setup in this state, looks like lock problem is always 
with metadata\disk check, but not domain itself,
I can read and write other files in this mountpoint from the same