Re: [ceph-users] Ceph MDS randomly hangs with no useful error message

2020-01-22 Thread Janek Bevendorff




I don't find any clue from the backtrace. please run 'ceph daemon
mds. dump_historic_ops' and ''ceph daemon mds.xxx perf reset; ceph
daemon mds.xxx perf dump'. send the outputs to us.


Hi, I assume you mean ceph daemon mds.xxx perf reset _all_?

Here's the output of historic ops https://pastebin.com/yxvjJHY9

and perf dump: https://pastebin.com/BfpAiYT7


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS randomly hangs with no useful error message

2020-01-20 Thread Yan, Zheng
On Tue, Jan 21, 2020 at 12:09 AM Janek Bevendorff
 wrote:
>
> Hi, I did as you asked and created a thread dump with GDB on the
> blocking MDS. Here's the result: https://pastebin.com/pPbNvfdb
>

I don't find any clue from the backtrace. please run 'ceph daemon
mds. dump_historic_ops' and ''ceph daemon mds.xxx perf reset; ceph
daemon mds.xxx perf dump'. send the outputs to us.

>
> On 17/01/2020 13:07, Yan, Zheng wrote:
> > On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff
> >  wrote:
> >> Hi,
> >>
> >> We have a CephFS in our cluster with 3 MDS to which > 300 clients
> >> connect at any given time. The FS contains about 80 TB of data and many
> >> million files, so it is important that meta data operations work
> >> smoothly even when listing large directories.
> >>
> >> Previously, we had massive stability problems causing the MDS nodes to
> >> crash or time out regularly as a result of failing to recall caps fast
> >> enough and weren't able to rejoin afterwards without resetting the
> >> mds*_openfiles objects (see
> >> https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/
> >> for details).
> >>
> >> We have managed to adjust our configuration to avoid this problem. This
> >> comes down mostly to adjusting the recall decay rate (which still isn't
> >> documented), massively reducing any scrubbing activities, allowing for
> >> no more than 10G for mds_cache_memory_limit (the default of 1G is way
> >> too low, but more than 10G seems to cause trouble during replay),
> >> increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We
> >> haven't seen crashes since. But what we do see is that one of the MDS
> >> nodes will randomly lock up and the ceph_mds_reply_latency metric goes
> >> up and then stays at a higher level than any other MDS. The result is
> >> not that the FS is completely down, but everything lags massively to the
> >> point where it's not usable.
> >>
> >> Unfortunately, all the hung MDS is reporting is:
> >>
> >>  -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping
> >> beacon heartbeat to monitors (last acked 320.587s ago); MDS internal
> >> heartbeat is not healthy!
> >>  -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map
> >> is_healthy 'MDSRank' had timed out after 15
> >>
> >> and ceph fs status reports only single-digit ops/s for all three MDSs
> >> (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a
> >> standby to take over, which went without problems. Almost immediately
> >> after, all three now-active MDSs started reporting > 900 ops/s and the
> >> FS started working properly again. For some strange reason, the failed
> >> MDS didn't restart, though. It kept reporting the log message above
> >> until I manually restarted the daemon process.
> >>
> > Looks like mds entered same long (/infinite) loops. If this happens
> > again, could you use gdb to attach it, and run command 'thread apply
> > all bt' inside gdb
> >
> >> Is anybody else experiencing such issues or are there any configuration
> >> parameters that I can tweak to avoid this behaviour?
> >>
> >> Thanks
> >> Janek
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS randomly hangs with no useful error message

2020-01-20 Thread Janek Bevendorff
Hi, I did as you asked and created a thread dump with GDB on the 
blocking MDS. Here's the result: https://pastebin.com/pPbNvfdb



On 17/01/2020 13:07, Yan, Zheng wrote:

On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff
 wrote:

Hi,

We have a CephFS in our cluster with 3 MDS to which > 300 clients
connect at any given time. The FS contains about 80 TB of data and many
million files, so it is important that meta data operations work
smoothly even when listing large directories.

Previously, we had massive stability problems causing the MDS nodes to
crash or time out regularly as a result of failing to recall caps fast
enough and weren't able to rejoin afterwards without resetting the
mds*_openfiles objects (see
https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/
for details).

We have managed to adjust our configuration to avoid this problem. This
comes down mostly to adjusting the recall decay rate (which still isn't
documented), massively reducing any scrubbing activities, allowing for
no more than 10G for mds_cache_memory_limit (the default of 1G is way
too low, but more than 10G seems to cause trouble during replay),
increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We
haven't seen crashes since. But what we do see is that one of the MDS
nodes will randomly lock up and the ceph_mds_reply_latency metric goes
up and then stays at a higher level than any other MDS. The result is
not that the FS is completely down, but everything lags massively to the
point where it's not usable.

Unfortunately, all the hung MDS is reporting is:

 -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 320.587s ago); MDS internal
heartbeat is not healthy!
 -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15

and ceph fs status reports only single-digit ops/s for all three MDSs
(mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a
standby to take over, which went without problems. Almost immediately
after, all three now-active MDSs started reporting > 900 ops/s and the
FS started working properly again. For some strange reason, the failed
MDS didn't restart, though. It kept reporting the log message above
until I manually restarted the daemon process.


Looks like mds entered same long (/infinite) loops. If this happens
again, could you use gdb to attach it, and run command 'thread apply
all bt' inside gdb


Is anybody else experiencing such issues or are there any configuration
parameters that I can tweak to avoid this behaviour?

Thanks
Janek

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS randomly hangs with no useful error message

2020-01-17 Thread Janek Bevendorff
Thanks. I will do that. Right now, we have quite a few lags when listing 
folders, which is probably due to another client heavily using the 
system. Unfortunately, it's rather hard to debug at the moment, since 
the suspected client has to use our Ganesha bridge instead of connecting 
to the Ceph directly. The FS is overall operable and generally usable, 
but the increased latency is quite annoying, still. I wonder if that can 
be mitigated with optimised cluster settings.


I will send you a GDB trace once we encounter another potential MDS loop.


On 17/01/2020 13:07, Yan, Zheng wrote:

On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff
 wrote:

Hi,

We have a CephFS in our cluster with 3 MDS to which > 300 clients
connect at any given time. The FS contains about 80 TB of data and many
million files, so it is important that meta data operations work
smoothly even when listing large directories.

Previously, we had massive stability problems causing the MDS nodes to
crash or time out regularly as a result of failing to recall caps fast
enough and weren't able to rejoin afterwards without resetting the
mds*_openfiles objects (see
https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/
for details).

We have managed to adjust our configuration to avoid this problem. This
comes down mostly to adjusting the recall decay rate (which still isn't
documented), massively reducing any scrubbing activities, allowing for
no more than 10G for mds_cache_memory_limit (the default of 1G is way
too low, but more than 10G seems to cause trouble during replay),
increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We
haven't seen crashes since. But what we do see is that one of the MDS
nodes will randomly lock up and the ceph_mds_reply_latency metric goes
up and then stays at a higher level than any other MDS. The result is
not that the FS is completely down, but everything lags massively to the
point where it's not usable.

Unfortunately, all the hung MDS is reporting is:

 -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 320.587s ago); MDS internal
heartbeat is not healthy!
 -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15

and ceph fs status reports only single-digit ops/s for all three MDSs
(mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a
standby to take over, which went without problems. Almost immediately
after, all three now-active MDSs started reporting > 900 ops/s and the
FS started working properly again. For some strange reason, the failed
MDS didn't restart, though. It kept reporting the log message above
until I manually restarted the daemon process.


Looks like mds entered same long (/infinite) loops. If this happens
again, could you use gdb to attach it, and run command 'thread apply
all bt' inside gdb


Is anybody else experiencing such issues or are there any configuration
parameters that I can tweak to avoid this behaviour?

Thanks
Janek

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS randomly hangs with no useful error message

2020-01-17 Thread Yan, Zheng
On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff
 wrote:
>
> Hi,
>
> We have a CephFS in our cluster with 3 MDS to which > 300 clients
> connect at any given time. The FS contains about 80 TB of data and many
> million files, so it is important that meta data operations work
> smoothly even when listing large directories.
>
> Previously, we had massive stability problems causing the MDS nodes to
> crash or time out regularly as a result of failing to recall caps fast
> enough and weren't able to rejoin afterwards without resetting the
> mds*_openfiles objects (see
> https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/
> for details).
>
> We have managed to adjust our configuration to avoid this problem. This
> comes down mostly to adjusting the recall decay rate (which still isn't
> documented), massively reducing any scrubbing activities, allowing for
> no more than 10G for mds_cache_memory_limit (the default of 1G is way
> too low, but more than 10G seems to cause trouble during replay),
> increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We
> haven't seen crashes since. But what we do see is that one of the MDS
> nodes will randomly lock up and the ceph_mds_reply_latency metric goes
> up and then stays at a higher level than any other MDS. The result is
> not that the FS is completely down, but everything lags massively to the
> point where it's not usable.
>
> Unfortunately, all the hung MDS is reporting is:
>
> -77> 2020-01-17 09:29:17.891 7f34c967b700  0 mds.beacon.XXX Skipping
> beacon heartbeat to monitors (last acked 320.587s ago); MDS internal
> heartbeat is not healthy!
> -76> 2020-01-17 09:29:18.391 7f34c967b700  1 heartbeat_map
> is_healthy 'MDSRank' had timed out after 15
>
> and ceph fs status reports only single-digit ops/s for all three MDSs
> (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a
> standby to take over, which went without problems. Almost immediately
> after, all three now-active MDSs started reporting > 900 ops/s and the
> FS started working properly again. For some strange reason, the failed
> MDS didn't restart, though. It kept reporting the log message above
> until I manually restarted the daemon process.
>

Looks like mds entered same long (/infinite) loops. If this happens
again, could you use gdb to attach it, and run command 'thread apply
all bt' inside gdb

> Is anybody else experiencing such issues or are there any configuration
> parameters that I can tweak to avoid this behaviour?
>
> Thanks
> Janek
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com