Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-23 Thread Gregory Farnum
On Wed, Sep 21, 2016 at 6:24 PM, Heller, Chris  wrote:
> What is the interesting value in ‘session ls’? Is it ‘num_leases’ or 
> ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 
> for many many clients!

Yeah, it's the num_caps.
Interestingly, the "client cache size" defaults to 16384, which makes
me think maybe it's just not releasing caps when requested -- I think
we don't really test that piece in library form much. Can you set one
of the Hadoop Ceph clients to "debug client = 20", capture a log, and
upload it with ceph-post-file?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
So just to put more info out there, here is what I’m seeing with a Spark/HDFS 
client:

2016-09-21 20:09:25.076595 7fd61c16f700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53864 s=2 pgs=50445 cs=1 
l=0 c=0x7fd5fdd371d0).fault, initiating reconnect
2016-09-21 20:09:25.077328 7fd60c579700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=1 pgs=50445 cs=2 
l=0 c=0x7fd5fdd371d0).connect got RESETSESSION
2016-09-21 20:09:25.077429 7fd60fd80700  0 client.585194220 
ms_handle_remote_reset on 192.168.1.190:6802/32183
2016-09-21 20:20:55.990686 7fd61c16f700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=2 pgs=50630 cs=1 
l=0 c=0x7fd5fdd371d0).fault, initiating reconnect
2016-09-21 20:20:55.990890 7fd60c579700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.190:6802/32183 pipe(0x7fd5fcef8ca0 sd=66 :53994 s=1 pgs=50630 cs=2 
l=0 c=0x7fd5fdd371d0).fault
2016-09-21 20:21:09.385228 7fd60c579700  0 -- 192.168.1.157:0/634334964 >> 
192.168.1.154:6800/17142 pipe(0x7fd6401e8160 sd=184 :39160 s=1 pgs=0 cs=0 l=0 
c=0x7fd6400433c0).fault

And here is its session info from ‘session ls’:

{
"id": 585194220,
"num_leases": 0,
"num_caps": 16385,
"state": "open",
"replay_requests": 0,
"reconnecting": false,
"inst": "client.585194220 192.168.1.157:0\/634334964",
"client_metadata": {
"ceph_sha1": "d56bdf93ced6b80b07397d57e3fa68fe68304432",
"ceph_version": "ceph version 0.94.7 
(d56bdf93ced6b80b07397d57e3fa68fe68304432)",
"entity_id": "hdfs.user",
"hostname": "a192-168-1-157.d.a.com"
}
},

-Chris

On 9/21/16, 9:27 PM, "Heller, Chris"  wrote:

I also went and bumped mds_cache_size up to 1 million… still seeing cache 
pressure, but I might just need to evict those clients…

On 9/21/16, 9:24 PM, "Heller, Chris"  wrote:

What is the interesting value in ‘session ls’? Is it ‘num_leases’ or 
‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for 
many many clients!

-Chris

On 9/21/16, 9:22 PM, "Gregory Farnum"  wrote:

On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris  
wrote:
> I’m suspecting something similar, we have millions of files and 
can read a huge subset of them at a time, presently the client is Spark 1.5.2 
which I suspect is leaving the closing of file descriptors up to the garbage 
collector. That said, I’d like to know if I could verify this theory using the 
ceph tools. I’ll try upping “mds cache size”, are there any other configuration 
settings I might adjust to perhaps ease the problem while I track it down in 
the HDFS tools layer?

That's the big one. You can also go through the admin socket 
commands
for things like "session ls" that will tell you how many files the
client is holding on to and compare.

>
> -Chris
>
> On 9/21/16, 4:34 PM, "Gregory Farnum"  wrote:
>
> On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris 
 wrote:
> > Ok. I just ran into this issue again. The mds rolled after 
many clients were failing to relieve cache pressure.
>
> That definitely could have had something to do with it, if 
say they
> overloaded the MDS so much it got stuck in a directory read 
loop.
> ...actually now I come to think of it, I think there was some 
problem
> with Hadoop not being nice about closing files and so forcing 
clients
> to keep them pinned, which will make the MDS pretty unhappy 
if they're
> holding more than it's configured for.
>
> >
> > Now here is the result of `ceph –s`
> >
> > # ceph -s
> > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
> >  health HEALTH_OK
> >  monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> > election epoch 130, quorum 0,1,2,3,4 
a154,a155,a189,a190,a191
> >  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 
up:standby-replay, 3 up:standby
> >  osdmap e118886: 192 osds: 192 up, 192 in
> >   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 
63571 kobjects
> > 69601 GB used, 37656 GB / 104 TB avail
> >11309 active+clean
> >   13 

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I also went and bumped mds_cache_size up to 1 million… still seeing cache 
pressure, but I might just need to evict those clients…

On 9/21/16, 9:24 PM, "Heller, Chris"  wrote:

What is the interesting value in ‘session ls’? Is it ‘num_leases’ or 
‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for 
many many clients!

-Chris

On 9/21/16, 9:22 PM, "Gregory Farnum"  wrote:

On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris  
wrote:
> I’m suspecting something similar, we have millions of files and can 
read a huge subset of them at a time, presently the client is Spark 1.5.2 which 
I suspect is leaving the closing of file descriptors up to the garbage 
collector. That said, I’d like to know if I could verify this theory using the 
ceph tools. I’ll try upping “mds cache size”, are there any other configuration 
settings I might adjust to perhaps ease the problem while I track it down in 
the HDFS tools layer?

That's the big one. You can also go through the admin socket commands
for things like "session ls" that will tell you how many files the
client is holding on to and compare.

>
> -Chris
>
> On 9/21/16, 4:34 PM, "Gregory Farnum"  wrote:
>
> On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris 
 wrote:
> > Ok. I just ran into this issue again. The mds rolled after many 
clients were failing to relieve cache pressure.
>
> That definitely could have had something to do with it, if say 
they
> overloaded the MDS so much it got stuck in a directory read loop.
> ...actually now I come to think of it, I think there was some 
problem
> with Hadoop not being nice about closing files and so forcing 
clients
> to keep them pinned, which will make the MDS pretty unhappy if 
they're
> holding more than it's configured for.
>
> >
> > Now here is the result of `ceph –s`
> >
> > # ceph -s
> > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
> >  health HEALTH_OK
> >  monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> > election epoch 130, quorum 0,1,2,3,4 
a154,a155,a189,a190,a191
> >  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 
up:standby-replay, 3 up:standby
> >  osdmap e118886: 192 osds: 192 up, 192 in
> >   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 
kobjects
> > 69601 GB used, 37656 GB / 104 TB avail
> >11309 active+clean
> >   13 active+clean+scrubbing
> >6 active+clean+scrubbing+deep
> >
> > And here are the ops in flight:
> >
> > # ceph daemon mds.a190 dump_ops_in_flight
> > {
> > "ops": [],
> > "num_ops": 0
> > }
> >
> > And a tail of the active mds log at debug_mds 5/5
> >
> > 2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server 
handle_client_request client_request(client.585124080:17863 lookup 
#1/stream2store 2016-09-21 20:15:53.352390) v2
> > 2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
closed|closing|killing, dropping
>
> This is also pretty solid evidence that the MDS is zapping clients
> when they misbehave.
>
> You can increase "mds cache size" past its default 10 
dentries and
> see if that alleviates (or just draws out) the problem.
> -Greg
>
> > 2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 235) v1 from 
client.507429717
> > 2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache 
check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, 
baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 
caps per inode
> > 2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 
epoch 38 load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
> > 2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 233) v1 from 
client.538555196
> > 2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 365) v1 from 
client.507390467
> > 2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server 

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ 
leases appears to be, on average, 1. But caps seems to be 16385 for many many 
clients!

-Chris

On 9/21/16, 9:22 PM, "Gregory Farnum"  wrote:

On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris  wrote:
> I’m suspecting something similar, we have millions of files and can read 
a huge subset of them at a time, presently the client is Spark 1.5.2 which I 
suspect is leaving the closing of file descriptors up to the garbage collector. 
That said, I’d like to know if I could verify this theory using the ceph tools. 
I’ll try upping “mds cache size”, are there any other configuration settings I 
might adjust to perhaps ease the problem while I track it down in the HDFS 
tools layer?

That's the big one. You can also go through the admin socket commands
for things like "session ls" that will tell you how many files the
client is holding on to and compare.

>
> -Chris
>
> On 9/21/16, 4:34 PM, "Gregory Farnum"  wrote:
>
> On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris  
wrote:
> > Ok. I just ran into this issue again. The mds rolled after many 
clients were failing to relieve cache pressure.
>
> That definitely could have had something to do with it, if say they
> overloaded the MDS so much it got stuck in a directory read loop.
> ...actually now I come to think of it, I think there was some problem
> with Hadoop not being nice about closing files and so forcing clients
> to keep them pinned, which will make the MDS pretty unhappy if they're
> holding more than it's configured for.
>
> >
> > Now here is the result of `ceph –s`
> >
> > # ceph -s
> > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
> >  health HEALTH_OK
> >  monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> > election epoch 130, quorum 0,1,2,3,4 
a154,a155,a189,a190,a191
> >  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 
up:standby-replay, 3 up:standby
> >  osdmap e118886: 192 osds: 192 up, 192 in
> >   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 
kobjects
> > 69601 GB used, 37656 GB / 104 TB avail
> >11309 active+clean
> >   13 active+clean+scrubbing
> >6 active+clean+scrubbing+deep
> >
> > And here are the ops in flight:
> >
> > # ceph daemon mds.a190 dump_ops_in_flight
> > {
> > "ops": [],
> > "num_ops": 0
> > }
> >
> > And a tail of the active mds log at debug_mds 5/5
> >
> > 2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server 
handle_client_request client_request(client.585124080:17863 lookup 
#1/stream2store 2016-09-21 20:15:53.352390) v2
> > 2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
closed|closing|killing, dropping
>
> This is also pretty solid evidence that the MDS is zapping clients
> when they misbehave.
>
> You can increase "mds cache size" past its default 10 dentries and
> see if that alleviates (or just draws out) the problem.
> -Greg
>
> > 2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 235) v1 from 
client.507429717
> > 2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache 
check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, 
baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 
caps per inode
> > 2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 
load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
> > 2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 233) v1 from 
client.538555196
> > 2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 365) v1 from 
client.507390467
> > 2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 367) v1 from 
client.538485341
> > 2016-09-21 20:15:56.243462 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 189) v1 from 
client.538577596
> > 2016-09-21 20:15:56.986901 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 232) v1 from 
client.507430372
> > 2016-09-21 20:15:57.026206 7fce3b626700  3 mds.0.server 

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Gregory Farnum
On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris  wrote:
> I’m suspecting something similar, we have millions of files and can read a 
> huge subset of them at a time, presently the client is Spark 1.5.2 which I 
> suspect is leaving the closing of file descriptors up to the garbage 
> collector. That said, I’d like to know if I could verify this theory using 
> the ceph tools. I’ll try upping “mds cache size”, are there any other 
> configuration settings I might adjust to perhaps ease the problem while I 
> track it down in the HDFS tools layer?

That's the big one. You can also go through the admin socket commands
for things like "session ls" that will tell you how many files the
client is holding on to and compare.

>
> -Chris
>
> On 9/21/16, 4:34 PM, "Gregory Farnum"  wrote:
>
> On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris  wrote:
> > Ok. I just ran into this issue again. The mds rolled after many clients 
> were failing to relieve cache pressure.
>
> That definitely could have had something to do with it, if say they
> overloaded the MDS so much it got stuck in a directory read loop.
> ...actually now I come to think of it, I think there was some problem
> with Hadoop not being nice about closing files and so forcing clients
> to keep them pinned, which will make the MDS pretty unhappy if they're
> holding more than it's configured for.
>
> >
> > Now here is the result of `ceph –s`
> >
> > # ceph -s
> > cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
> >  health HEALTH_OK
> >  monmap e1: 5 mons at 
> {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> > election epoch 130, quorum 0,1,2,3,4 
> a154,a155,a189,a190,a191
> >  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 
> up:standby
> >  osdmap e118886: 192 osds: 192 up, 192 in
> >   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects
> > 69601 GB used, 37656 GB / 104 TB avail
> >11309 active+clean
> >   13 active+clean+scrubbing
> >6 active+clean+scrubbing+deep
> >
> > And here are the ops in flight:
> >
> > # ceph daemon mds.a190 dump_ops_in_flight
> > {
> > "ops": [],
> > "num_ops": 0
> > }
> >
> > And a tail of the active mds log at debug_mds 5/5
> >
> > 2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server 
> handle_client_request client_request(client.585124080:17863 lookup 
> #1/stream2store 2016-09-21 20:15:53.352390) v2
> > 2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
> closed|closing|killing, dropping
>
> This is also pretty solid evidence that the MDS is zapping clients
> when they misbehave.
>
> You can increase "mds cache size" past its default 10 dentries and
> see if that alleviates (or just draws out) the problem.
> -Greg
>
> > 2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 235) v1 from 
> client.507429717
> > 2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache 
> check_memory_usage total 1475784, rss 666432, heap 79712, malloc 584052 mmap 
> 0, baseline 79712, buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 
> caps, 0 caps per inode
> > 2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 
> load mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
> > 2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 233) v1 from 
> client.538555196
> > 2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 365) v1 from 
> client.507390467
> > 2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 367) v1 from 
> client.538485341
> > 2016-09-21 20:15:56.243462 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 189) v1 from 
> client.538577596
> > 2016-09-21 20:15:56.986901 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 232) v1 from 
> client.507430372
> > 2016-09-21 20:15:57.026206 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 364) v1 from 
> client.491885158
> > 2016-09-21 20:15:57.369281 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 364) v1 from 
> client.507390682
> > 2016-09-21 20:15:57.445687 7fce3b626700  3 mds.0.server 
> handle_client_session client_session(request_renewcaps seq 364) v1 from 
> client.538485996
> > 2016-09-21 20:15:57.579268 7fce3b626700  3 mds.0.server 

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’m suspecting something similar, we have millions of files and can read a huge 
subset of them at a time, presently the client is Spark 1.5.2 which I suspect 
is leaving the closing of file descriptors up to the garbage collector. That 
said, I’d like to know if I could verify this theory using the ceph tools. I’ll 
try upping “mds cache size”, are there any other configuration settings I might 
adjust to perhaps ease the problem while I track it down in the HDFS tools 
layer?

-Chris

On 9/21/16, 4:34 PM, "Gregory Farnum"  wrote:

On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris  wrote:
> Ok. I just ran into this issue again. The mds rolled after many clients 
were failing to relieve cache pressure.

That definitely could have had something to do with it, if say they
overloaded the MDS so much it got stuck in a directory read loop.
...actually now I come to think of it, I think there was some problem
with Hadoop not being nice about closing files and so forcing clients
to keep them pinned, which will make the MDS pretty unhappy if they're
holding more than it's configured for.

>
> Now here is the result of `ceph –s`
>
> # ceph -s
> cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
>  health HEALTH_OK
>  monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191
>  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 
up:standby
>  osdmap e118886: 192 osds: 192 up, 192 in
>   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects
> 69601 GB used, 37656 GB / 104 TB avail
>11309 active+clean
>   13 active+clean+scrubbing
>6 active+clean+scrubbing+deep
>
> And here are the ops in flight:
>
> # ceph daemon mds.a190 dump_ops_in_flight
> {
> "ops": [],
> "num_ops": 0
> }
>
> And a tail of the active mds log at debug_mds 5/5
>
> 2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server 
handle_client_request client_request(client.585124080:17863 lookup 
#1/stream2store 2016-09-21 20:15:53.352390) v2
> 2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
closed|closing|killing, dropping

This is also pretty solid evidence that the MDS is zapping clients
when they misbehave.

You can increase "mds cache size" past its default 10 dentries and
see if that alleviates (or just draws out) the problem.
-Greg

> 2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 235) v1 from 
client.507429717
> 2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache check_memory_usage 
total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, 
buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode
> 2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 load 
mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
> 2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 233) v1 from 
client.538555196
> 2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 365) v1 from 
client.507390467
> 2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 367) v1 from 
client.538485341
> 2016-09-21 20:15:56.243462 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 189) v1 from 
client.538577596
> 2016-09-21 20:15:56.986901 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 232) v1 from 
client.507430372
> 2016-09-21 20:15:57.026206 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.491885158
> 2016-09-21 20:15:57.369281 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.507390682
> 2016-09-21 20:15:57.445687 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.538485996
> 2016-09-21 20:15:57.579268 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.538486021
> 2016-09-21 20:15:57.595568 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.507390702
> 2016-09-21 20:15:57.604356 7fce3b626700  3 mds.0.server 
handle_client_session client_session(request_renewcaps seq 364) v1 from 
client.507390712
> 2016-09-21 

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Gregory Farnum
On Wed, Sep 21, 2016 at 1:16 PM, Heller, Chris  wrote:
> Ok. I just ran into this issue again. The mds rolled after many clients were 
> failing to relieve cache pressure.

That definitely could have had something to do with it, if say they
overloaded the MDS so much it got stuck in a directory read loop.
...actually now I come to think of it, I think there was some problem
with Hadoop not being nice about closing files and so forcing clients
to keep them pinned, which will make the MDS pretty unhappy if they're
holding more than it's configured for.

>
> Now here is the result of `ceph –s`
>
> # ceph -s
> cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
>  health HEALTH_OK
>  monmap e1: 5 mons at 
> {a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
> election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191
>  mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 
> up:standby
>  osdmap e118886: 192 osds: 192 up, 192 in
>   pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects
> 69601 GB used, 37656 GB / 104 TB avail
>11309 active+clean
>   13 active+clean+scrubbing
>6 active+clean+scrubbing+deep
>
> And here are the ops in flight:
>
> # ceph daemon mds.a190 dump_ops_in_flight
> {
> "ops": [],
> "num_ops": 0
> }
>
> And a tail of the active mds log at debug_mds 5/5
>
> 2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server handle_client_request 
> client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 
> 20:15:53.352390) v2
> 2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
> closed|closing|killing, dropping

This is also pretty solid evidence that the MDS is zapping clients
when they misbehave.

You can increase "mds cache size" past its default 10 dentries and
see if that alleviates (or just draws out) the problem.
-Greg

> 2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 235) v1 from client.507429717
> 2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache check_memory_usage 
> total 1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, 
> buffers 0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode
> 2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 load 
> mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
> 2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 233) v1 from client.538555196
> 2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 365) v1 from client.507390467
> 2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 367) v1 from client.538485341
> 2016-09-21 20:15:56.243462 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 189) v1 from client.538577596
> 2016-09-21 20:15:56.986901 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 232) v1 from client.507430372
> 2016-09-21 20:15:57.026206 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.491885158
> 2016-09-21 20:15:57.369281 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.507390682
> 2016-09-21 20:15:57.445687 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.538485996
> 2016-09-21 20:15:57.579268 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.538486021
> 2016-09-21 20:15:57.595568 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.507390702
> 2016-09-21 20:15:57.604356 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.507390712
> 2016-09-21 20:15:57.693546 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.507390717
> 2016-09-21 20:15:57.819536 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.491885168
> 2016-09-21 20:15:57.894058 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.507390732
> 2016-09-21 20:15:57.983329 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.507390742
> 2016-09-21 20:15:58.077915 7fce3b626700  3 mds.0.server handle_client_session 
> client_session(request_renewcaps seq 364) v1 from client.538486031
> 2016-09-21 20:15:58.141710 7fce3b626700  3 

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
Ok. I just ran into this issue again. The mds rolled after many clients were 
failing to relieve cache pressure.

Now here is the result of `ceph –s`

# ceph -s
cluster b126570e-9e7c-0bb2-991f-ecf9abe3afa0
 health HEALTH_OK
 monmap e1: 5 mons at 
{a154=192.168.1.154:6789/0,a155=192.168.1.155:6789/0,a189=192.168.1.189:6789/0,a190=192.168.1.190:6789/0,a191=192.168.1.191:6789/0}
election epoch 130, quorum 0,1,2,3,4 a154,a155,a189,a190,a191
 mdsmap e18676: 1/1/1 up {0=a190=up:active}, 1 up:standby-replay, 3 
up:standby
 osdmap e118886: 192 osds: 192 up, 192 in
  pgmap v13706298: 11328 pgs, 5 pools, 22704 GB data, 63571 kobjects
69601 GB used, 37656 GB / 104 TB avail
   11309 active+clean
  13 active+clean+scrubbing
   6 active+clean+scrubbing+deep

And here are the ops in flight:

# ceph daemon mds.a190 dump_ops_in_flight
{
"ops": [],
"num_ops": 0
}

And a tail of the active mds log at debug_mds 5/5

2016-09-21 20:15:53.354226 7fce3b626700  4 mds.0.server handle_client_request 
client_request(client.585124080:17863 lookup #1/stream2store 2016-09-21 
20:15:53.352390) v2
2016-09-21 20:15:53.354234 7fce3b626700  5 mds.0.server session 
closed|closing|killing, dropping
2016-09-21 20:15:54.867108 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 235) v1 from client.507429717
2016-09-21 20:15:54.980907 7fce3851f700  2 mds.0.cache check_memory_usage total 
1475784, rss 666432, heap 79712, malloc 584052 mmap 0, baseline 79712, buffers 
0, max 1048576, 0 / 93392 inodes have caps, 0 caps, 0 caps per inode
2016-09-21 20:15:54.980960 7fce3851f700  5 mds.0.bal mds.0 epoch 38 load 
mdsload<[0,0 0]/[0,0 0], req 1987, hr 0, qlen 0, cpu 0.34>
2016-09-21 20:15:55.247885 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 233) v1 from client.538555196
2016-09-21 20:15:55.455566 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 365) v1 from client.507390467
2016-09-21 20:15:55.807704 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 367) v1 from client.538485341
2016-09-21 20:15:56.243462 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 189) v1 from client.538577596
2016-09-21 20:15:56.986901 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 232) v1 from client.507430372
2016-09-21 20:15:57.026206 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.491885158
2016-09-21 20:15:57.369281 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390682
2016-09-21 20:15:57.445687 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.538485996
2016-09-21 20:15:57.579268 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.538486021
2016-09-21 20:15:57.595568 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390702
2016-09-21 20:15:57.604356 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390712
2016-09-21 20:15:57.693546 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390717
2016-09-21 20:15:57.819536 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.491885168
2016-09-21 20:15:57.894058 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390732
2016-09-21 20:15:57.983329 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.507390742
2016-09-21 20:15:58.077915 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.538486031
2016-09-21 20:15:58.141710 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.491885178
2016-09-21 20:15:58.159134 7fce3b626700  3 mds.0.server handle_client_session 
client_session(request_renewcaps seq 364) v1 from client.491885188

-Chris

On 9/21/16, 11:23 AM, "Heller, Chris"  wrote:

Perhaps related, I was watching the active mds with debug_mds set to 5/5, 
when I saw this in the log:

2016-09-21 15:13:26.067698 7fbaec248700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0 
c=0x5631ce0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067717 7fbaf64ea700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0 
c=0x237e8420).fault with 

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
Perhaps related, I was watching the active mds with debug_mds set to 5/5, when 
I saw this in the log:

2016-09-21 15:13:26.067698 7fbaec248700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.238:0/3488321578 pipe(0x55db000 sd=49 :6802 s=2 pgs=2 cs=1 l=0 
c=0x5631ce0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067717 7fbaf64ea700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0 
c=0x237e8420).fault with nothing to send, going to standby
2016-09-21 15:13:26.067725 7fbb0098e700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0 
c=0x15c29020).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067743 7fbb026ab700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0 
c=0x237e91e0).fault, server,
going to standby
2016-09-21 15:13:26.067749 7fbae840a700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.214:0/329045 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0 
c=0x13b6c160).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067783 7fbadb239700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0 
c=0x28e24f20).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067803 7fbafe66b700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0 
c=0x237e9760).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067876 7fbb01a9f700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0 
c=0x2f92f5a0).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067886 7fbb01ca1700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0 
c=0x2f92cc60).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067865 7fbaf43c9700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0 
c=0x37c902c0).fault with no
thing to send, going to standby
2016-09-21 15:13:26.067910 7fbaf4ed4700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.236:0/650394434 pipe(0x2fbe sd=116 :6802 s=2 pgs=2 cs=1 l=0 
c=0x56a5440).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067911 7fbb01196700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.149:0/821983967 pipe(0x1420b000 sd=104 :6802 s=2 pgs=2 cs=1 l=0 
c=0x2f92cf20).fault with not
hing to send, going to standby
2016-09-21 15:13:26.068076 7fbafc64b700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.190:0/1817596579 pipe(0x36829000 sd=124 :6802 s=2 pgs=2 cs=1 l=0 
c=0x31f7a100).fault with no
thing to send, going to standby
2016-09-21 15:13:26.067717 7fbaf64ea700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.214:0/3252234463 pipe(0x54d1000 sd=76 :6802 s=2 pgs=2 cs=1 l=0 
c=0x237e8420).fault w[0/9326]ing to send, going to standby
2016-09-21 15:13:26.067725 7fbb0098e700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.204:0/2963585795 pipe(0x3bf1000 sd=55 :6802 s=2 pgs=2 cs=1 l=0 
c=0x15c29020).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067743 7fbb026ab700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.192:0/4235516229 pipe(0x562b000 sd=83 :6802 s=2 pgs=2 cs=1 l=0 
c=0x237e91e0).fault, server,
going to standby
2016-09-21 15:13:26.067749 7fbae840a700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.214:0/329045 pipe(0x2a38a000 sd=74 :6802 s=2 pgs=2 cs=1 l=0 
c=0x13b6c160).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067783 7fbadb239700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.141:0/229472938 pipe(0x268d2000 sd=87 :6802 s=2 pgs=2 cs=1 l=0 
c=0x28e24f20).fault with noth
ing to send, going to standby
2016-09-21 15:13:26.067803 7fbafe66b700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.193:0/2637929639 pipe(0x29582000 sd=80 :6802 s=2 pgs=2 cs=1 l=0 
c=0x237e9760).fault with not
hing to send, going to standby
2016-09-21 15:13:26.067876 7fbb01a9f700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.228:0/581679898 pipe(0x2384f000 sd=103 :6802 s=2 pgs=2 cs=1 l=0 
c=0x2f92f5a0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067886 7fbb01ca1700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.145:0/586636299 pipe(0x25806000 sd=101 :6802 s=2 pgs=2 cs=1 l=0 
c=0x2f92cc60).fault with nothing to send, going to standby
2016-09-21 15:13:26.067865 7fbaf43c9700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.234:0/3131612847 pipe(0x2fbe5000 sd=120 :6802 s=2 pgs=2 cs=1 l=0 
c=0x37c902c0).fault with nothing to send, going to standby
2016-09-21 15:13:26.067910 7fbaf4ed4700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.236:0/650394434 pipe(0x2fbe sd=116 :6802 s=2 pgs=2 cs=1 l=0 
c=0x56a5440).fault with nothing to send, going to standby
2016-09-21 15:13:26.067911 7fbb01196700  0 -- 192.168.1.196:6802/13581 >> 
192.168.1.149:0/821983967 

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’ll see if I can capture the output the next time this issue arises, but in 
general the output looks as if nothing is wrong. No OSD are down, a ‘ceph 
health detail’ results in HEALTH_OK, the mds server is in the up:active state, 
in general it’s as if nothing is wrong server side (at least from the summary).

-Chris

On 9/21/16, 10:46 AM, "Gregory Farnum"  wrote:

On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris  wrote:
> I’m running a production 0.94.7 Ceph cluster, and have been seeing a
> periodic issue arise where in all my MDS clients will become stuck, and 
the
> fix so far has been to restart the active MDS (sometimes I need to restart
> the subsequent active MDS as well).
>
>
>
> These clients are using the cephfs-hadoop API, so there is no kernel 
client,
> or fuse api involved. When I see clients get stuck, there are messages
> printed to stderr like the following:
>
>
>
> 2016-09-21 10:31:12.285030 7fea4c7fb700  0 – 192.168.1.241:0/1606648601 >>
> 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0
> c=0x7feaa0a0c500).fault
>
>
>
> I’m at somewhat of a loss on where to begin debugging this issue, and 
wanted
> to ping the list for ideas.

What's the full output of "ceph -s" when this happens? Have you looked
at the MDS' admin socket's ops-in-flight, and that of the clients?j

http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some as 
well.

>
>
>
> I managed to dump the mds cache during one of the stalled moments, which
> hopefully is a useful starting point:
>
>
>
> e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8
> mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)
>
>
>
>
>
> -Chris
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Gregory Farnum
On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris  wrote:
> I’m running a production 0.94.7 Ceph cluster, and have been seeing a
> periodic issue arise where in all my MDS clients will become stuck, and the
> fix so far has been to restart the active MDS (sometimes I need to restart
> the subsequent active MDS as well).
>
>
>
> These clients are using the cephfs-hadoop API, so there is no kernel client,
> or fuse api involved. When I see clients get stuck, there are messages
> printed to stderr like the following:
>
>
>
> 2016-09-21 10:31:12.285030 7fea4c7fb700  0 – 192.168.1.241:0/1606648601 >>
> 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0
> c=0x7feaa0a0c500).fault
>
>
>
> I’m at somewhat of a loss on where to begin debugging this issue, and wanted
> to ping the list for ideas.

What's the full output of "ceph -s" when this happens? Have you looked
at the MDS' admin socket's ops-in-flight, and that of the clients?j

http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some as well.

>
>
>
> I managed to dump the mds cache during one of the stalled moments, which
> hopefully is a useful starting point:
>
>
>
> e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8
> mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)
>
>
>
>
>
> -Chris
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’m running a production 0.94.7 Ceph cluster, and have been seeing a periodic 
issue arise where in all my MDS clients will become stuck, and the fix so far 
has been to restart the active MDS (sometimes I need to restart the subsequent 
active MDS as well).

These clients are using the cephfs-hadoop API, so there is no kernel client, or 
fuse api involved. When I see clients get stuck, there are messages printed to 
stderr like the following:

2016-09-21 10:31:12.285030 7fea4c7fb700  0 – 192.168.1.241:0/1606648601 >> 
192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0 
c=0x7feaa0a0c500).fault

I’m at somewhat of a loss on where to begin debugging this issue, and wanted to 
ping the list for ideas.

I managed to dump the mds cache during one of the stalled moments, which 
hopefully is a useful starting point:

e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8  
mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg)


-Chris

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com