On Wed, Sep 21, 2016 at 6:24 PM, Heller, Chris wrote:
> What is the interesting value in ‘session ls’? Is it ‘num_leases’ or
> ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385
> for many many clients!
Yeah, it's the num_caps.
Interestingly, the "client cache size" defa
So just to put more info out there, here is what I’m seeing with a Spark/HDFS
client:
2016-09-21 20:09:25.076595 7fd61c16f700 0 -- 192.168.1.157:0/634334964 >>
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53864 s=2 pgs=50445 cs=1
l=0 c=0x7fd5fdd371d0).fault, initiating reconnect
2016-09
I also went and bumped mds_cache_size up to 1 million… still seeing cache
pressure, but I might just need to evict those clients…
On 9/21/16, 9:24 PM, "Heller, Chris" wrote:
What is the interesting value in ‘session ls’? Is it ‘num_leases’ or
‘num_caps’ leases appears to be, on average, 1.
What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’
leases appears to be, on average, 1. But caps seems to be 16385 for many many
clients!
-Chris
On 9/21/16, 9:22 PM, "Gregory Farnum" wrote:
On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris wrote:
> I’m suspe
On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris wrote:
> I’m suspecting something similar, we have millions of files and can read a
> huge subset of them at a time, presently the client is Spark 1.5.2 which I
> suspect is leaving the closing of file descriptors up to the garbage
> collector. Tha
I’m suspecting something similar, we have millions of files and can read a huge
subset of them at a time, presently the client is Spark 1.5.2 which I suspect
is leaving the closing of file descriptors up to the garbage collector. That
said, I’d like to know if I could verify this theory using th
On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris wrote:
> Ok. I just ran into this issue again. The mds rolled after many clients were
> failing to relieve cache pressure.
That definitely could have had something to do with it, if say they
overloaded the MDS so much it got stuck in a directory rea
Ok. I just ran into this issue again. The mds rolled after many clients were
failing to relieve cache pressure.
Now here is the result of `ceph –s`
# ceph -s
cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
health HEALTH_OK
monmap e1: 5 mons at
{a154=192.168.1.154:6789/0,a155=192.168.
Perhaps related, I was watching the active mds with debug_mds set to 5/5, when
I saw this in the log:
2016-09-21 15:13:26.067698 7fbaec248700 0 -- 192.168.1.196:6802/13581 >>
192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0
c=0x5631ce0).fault with nothing to send, going
I’ll see if I can capture the output the next time this issue arises, but in
general the output looks as if nothing is wrong. No OSD are down, a ‘ceph
health detail’ results in HEALTH_OK, the mds server is in the up:active state,
in general it’s as if nothing is wrong server side (at least from
On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris wrote:
> I’m running a production 0.94.7 Ceph cluster, and have been seeing a
> periodic issue arise where in all my MDS clients will become stuck, and the
> fix so far has been to restart the active MDS (sometimes I need to restart
> the subsequent a
11 matches
Mail list logo