Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-23 Thread Gregory Farnum
On Wed, Sep 21, 2016 at 6:24 PM, Heller, Chris wrote: > What is the interesting value in ‘session ls’? Is it ‘num_leases’ or > ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 > for many many clients! Yeah, it's the num_caps. Interestingly, the "client cache size" defa

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
So just to put more info out there, here is what I’m seeing with a Spark/HDFS client: 2016-09-21 20:09:25.076595 7fd61c16f700 0 -- 192.168.1.157:0/634334964 >> 192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53864 s=2 pgs=50445 cs=1 l=0 c=0x7fd5fdd371d0).fault, initiating reconnect 2016-09

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I also went and bumped mds_cache_size up to 1 million… still seeing cache pressure, but I might just need to evict those clients… On 9/21/16, 9:24 PM, "Heller, Chris" wrote: What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ leases appears to be, on average, 1.

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for many many clients! -Chris On 9/21/16, 9:22 PM, "Gregory Farnum" wrote: On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris wrote: > I’m suspe

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Gregory Farnum
On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris wrote: > I’m suspecting something similar, we have millions of files and can read a > huge subset of them at a time, presently the client is Spark 1.5.2 which I > suspect is leaving the closing of file descriptors up to the garbage > collector. Tha

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’m suspecting something similar, we have millions of files and can read a huge subset of them at a time, presently the client is Spark 1.5.2 which I suspect is leaving the closing of file descriptors up to the garbage collector. That said, I’d like to know if I could verify this theory using th

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Gregory Farnum
On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris wrote: > Ok. I just ran into this issue again. The mds rolled after many clients were > failing to relieve cache pressure. That definitely could have had something to do with it, if say they overloaded the MDS so much it got stuck in a directory rea

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure. Now here is the result of `ceph –s` # ceph -s cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0 health HEALTH_OK monmap e1: 5 mons at {a154=192.168.1.154:6789/0,a155=192.168.

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
Perhaps related, I was watching the active mds with debug_mds set to 5/5, when I saw this in the log: 2016-09-21 15:13:26.067698 7fbaec248700 0 -- 192.168.1.196:6802/13581 >> 192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0 c=0x5631ce0).fault with nothing to send, going

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’ll see if I can capture the output the next time this issue arises, but in general the output looks as if nothing is wrong. No OSD are down, a ‘ceph health detail’ results in HEALTH_OK, the mds server is in the up:active state, in general it’s as if nothing is wrong server side (at least from

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Gregory Farnum
On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris wrote: > I’m running a production 0.94.7 Ceph cluster, and have been seeing a > periodic issue arise where in all my MDS clients will become stuck, and the > fix so far has been to restart the active MDS (sometimes I need to restart > the subsequent a