Re: [ceph-users] CephFS | flapping OSD locked up NFS

2017-06-20 Thread John Spray
On Tue, Jun 20, 2017 at 11:13 AM, David  wrote:
> Hi John
>
> I've had nfs-ganesha testing on the to do list for a while, I think I might
> move it closer to the top!  I'll certainly report back with the results.
>
> I'd still be interested to hear any kernel nfs experiences/tips, my
> understanding is nfs is included in the ceph testing suite so there is an
> expectation people will want to use it.

It is indeed part of the automated tests, although the coverage (in
the "knfs" suite) is fairly light, and does not do any thrashing to
simulate failures the way we do on the main cephfs tests.

John

>
> Thanks,
> David
>
>
> On 19 Jun 2017 3:56 p.m., "John Petrini"  wrote:
>>
>> Hi David,
>>
>> While I have no personal experience with this; from what I've been told,
>> if you're going to export cephfs over NFS it's recommended that you use a
>> userspace implementation of NFS (like nfs-ganesha) rather than
>> nfs-kernel-server. This may be the source of you issues and might be worth
>> testing. I'd be interested to hear the results if you do.
>>
>> ___
>>
>> John Petrini
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS | flapping OSD locked up NFS

2017-06-20 Thread David
Hi John

I've had nfs-ganesha testing on the to do list for a while, I think I might
move it closer to the top!  I'll certainly report back with the results.

I'd still be interested to hear any kernel nfs experiences/tips, my
understanding is nfs is included in the ceph testing suite so there is an
expectation people will want to use it.

Thanks,
David


On 19 Jun 2017 3:56 p.m., "John Petrini"  wrote:

> Hi David,
>
> While I have no personal experience with this; from what I've been told,
> if you're going to export cephfs over NFS it's recommended that you use a
> userspace implementation of NFS (like nfs-ganesha) rather than
> nfs-kernel-server. This may be the source of you issues and might be worth
> testing. I'd be interested to hear the results if you do.
>
> ___
>
> John Petrini
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS | flapping OSD locked up NFS

2017-06-19 Thread John Petrini
Hi David,

While I have no personal experience with this; from what I've been told, if
you're going to export cephfs over NFS it's recommended that you use a
userspace implementation of NFS (like nfs-ganesha) rather than
nfs-kernel-server. This may be the source of you issues and might be worth
testing. I'd be interested to hear the results if you do.

___

John Petrini
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS | flapping OSD locked up NFS

2017-06-19 Thread David
Hi All

We had a faulty OSD that was going up and down for a few hours until Ceph
marked it out. During this time Cephfs was accessible, however, for about
10 mins all NFS processes (kernel NFSv3) on a server exporting Cephfs were
hung, locking up all the NFS clients. The cluster was healthy before the
faulty OSD. I'm trying to understand if this is expected behaviour, a bug
or something else. Any insights would be appreciated.

MDS active/passive
Jewel 10.2.2
Ceph client 3.10.0-514.6.1.el7.x86_64
Cephfs mount: (rw,relatime,name=admin,secret=,acl)

I can see some slow requests in the MDS log during the time the NFS
processes were hung, some for setattr calls:

2017-06-15 04:29:37.081175 7f889401f700  0 log_channel(cluster) log [WRN] :
slow request 60.974528 seconds old, received at 2017-06-15 04:
28:36.106598: client_request(client.2622511:116375892 setattr size=0
#100025b3554 2017-06-15 04:28:36.104928) currently acquired locks

and some for getattr:

2017-06-15 04:29:42.081224 7f889401f700  0 log_channel(cluster) log [WRN] :
slow request 32.225883 seconds old, received at 2017-06-15 04:
29:09.855302: client_request(client.2622511:116380541 getattr pAsLsXsFs
#100025b4d37 2017-06-15 04:29:09.853772) currently failed to rdloc
k, waiting

And a "client not responding to mclientcaps revoke" warning:

2017-06-15 04:31:12.084561 7f889401f700  0 log_channel(cluster) log [WRN] :
client.2344872 isn't responding to mclientcaps(revoke), ino 100025b4d37
pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 122.229172 seconds ag

These issues seemed to have cleared once the faulty OSD was marked out.

In general I have noticed the NFS processes exporting Cephfs do seem to
spend a lot of time in 'D' state, with WCHAN as 'lock_page', compared with
a NFS server exporting a local file system. Also, NFS performance hasn't
been great with small reads/writes, particularly writes with the default
sync export option, I've had to export with async for the time-being. I
haven't had a chance to troubleshoot this in any depth yet, just mentioning
in case it's relevant.

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com