Re: [ceph-users] CephFS | flapping OSD locked up NFS
On Tue, Jun 20, 2017 at 11:13 AM, Davidwrote: > Hi John > > I've had nfs-ganesha testing on the to do list for a while, I think I might > move it closer to the top! I'll certainly report back with the results. > > I'd still be interested to hear any kernel nfs experiences/tips, my > understanding is nfs is included in the ceph testing suite so there is an > expectation people will want to use it. It is indeed part of the automated tests, although the coverage (in the "knfs" suite) is fairly light, and does not do any thrashing to simulate failures the way we do on the main cephfs tests. John > > Thanks, > David > > > On 19 Jun 2017 3:56 p.m., "John Petrini" wrote: >> >> Hi David, >> >> While I have no personal experience with this; from what I've been told, >> if you're going to export cephfs over NFS it's recommended that you use a >> userspace implementation of NFS (like nfs-ganesha) rather than >> nfs-kernel-server. This may be the source of you issues and might be worth >> testing. I'd be interested to hear the results if you do. >> >> ___ >> >> John Petrini >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS | flapping OSD locked up NFS
Hi John I've had nfs-ganesha testing on the to do list for a while, I think I might move it closer to the top! I'll certainly report back with the results. I'd still be interested to hear any kernel nfs experiences/tips, my understanding is nfs is included in the ceph testing suite so there is an expectation people will want to use it. Thanks, David On 19 Jun 2017 3:56 p.m., "John Petrini"wrote: > Hi David, > > While I have no personal experience with this; from what I've been told, > if you're going to export cephfs over NFS it's recommended that you use a > userspace implementation of NFS (like nfs-ganesha) rather than > nfs-kernel-server. This may be the source of you issues and might be worth > testing. I'd be interested to hear the results if you do. > > ___ > > John Petrini > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS | flapping OSD locked up NFS
Hi David, While I have no personal experience with this; from what I've been told, if you're going to export cephfs over NFS it's recommended that you use a userspace implementation of NFS (like nfs-ganesha) rather than nfs-kernel-server. This may be the source of you issues and might be worth testing. I'd be interested to hear the results if you do. ___ John Petrini ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS | flapping OSD locked up NFS
Hi All We had a faulty OSD that was going up and down for a few hours until Ceph marked it out. During this time Cephfs was accessible, however, for about 10 mins all NFS processes (kernel NFSv3) on a server exporting Cephfs were hung, locking up all the NFS clients. The cluster was healthy before the faulty OSD. I'm trying to understand if this is expected behaviour, a bug or something else. Any insights would be appreciated. MDS active/passive Jewel 10.2.2 Ceph client 3.10.0-514.6.1.el7.x86_64 Cephfs mount: (rw,relatime,name=admin,secret=,acl) I can see some slow requests in the MDS log during the time the NFS processes were hung, some for setattr calls: 2017-06-15 04:29:37.081175 7f889401f700 0 log_channel(cluster) log [WRN] : slow request 60.974528 seconds old, received at 2017-06-15 04: 28:36.106598: client_request(client.2622511:116375892 setattr size=0 #100025b3554 2017-06-15 04:28:36.104928) currently acquired locks and some for getattr: 2017-06-15 04:29:42.081224 7f889401f700 0 log_channel(cluster) log [WRN] : slow request 32.225883 seconds old, received at 2017-06-15 04: 29:09.855302: client_request(client.2622511:116380541 getattr pAsLsXsFs #100025b4d37 2017-06-15 04:29:09.853772) currently failed to rdloc k, waiting And a "client not responding to mclientcaps revoke" warning: 2017-06-15 04:31:12.084561 7f889401f700 0 log_channel(cluster) log [WRN] : client.2344872 isn't responding to mclientcaps(revoke), ino 100025b4d37 pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 122.229172 seconds ag These issues seemed to have cleared once the faulty OSD was marked out. In general I have noticed the NFS processes exporting Cephfs do seem to spend a lot of time in 'D' state, with WCHAN as 'lock_page', compared with a NFS server exporting a local file system. Also, NFS performance hasn't been great with small reads/writes, particularly writes with the default sync export option, I've had to export with async for the time-being. I haven't had a chance to troubleshoot this in any depth yet, just mentioning in case it's relevant. Thanks, David ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com