Re: [ceph-users] Ceph MDS randomly hangs with no useful error message
I don't find any clue from the backtrace. please run 'ceph daemon mds. dump_historic_ops' and ''ceph daemon mds.xxx perf reset; ceph daemon mds.xxx perf dump'. send the outputs to us. Hi, I assume you mean ceph daemon mds.xxx perf reset _all_? Here's the output of historic ops https://pastebin.com/yxvjJHY9 and perf dump: https://pastebin.com/BfpAiYT7 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph MDS randomly hangs with no useful error message
On Tue, Jan 21, 2020 at 12:09 AM Janek Bevendorff wrote: > > Hi, I did as you asked and created a thread dump with GDB on the > blocking MDS. Here's the result: https://pastebin.com/pPbNvfdb > I don't find any clue from the backtrace. please run 'ceph daemon mds. dump_historic_ops' and ''ceph daemon mds.xxx perf reset; ceph daemon mds.xxx perf dump'. send the outputs to us. > > On 17/01/2020 13:07, Yan, Zheng wrote: > > On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff > > wrote: > >> Hi, > >> > >> We have a CephFS in our cluster with 3 MDS to which > 300 clients > >> connect at any given time. The FS contains about 80 TB of data and many > >> million files, so it is important that meta data operations work > >> smoothly even when listing large directories. > >> > >> Previously, we had massive stability problems causing the MDS nodes to > >> crash or time out regularly as a result of failing to recall caps fast > >> enough and weren't able to rejoin afterwards without resetting the > >> mds*_openfiles objects (see > >> https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/ > >> for details). > >> > >> We have managed to adjust our configuration to avoid this problem. This > >> comes down mostly to adjusting the recall decay rate (which still isn't > >> documented), massively reducing any scrubbing activities, allowing for > >> no more than 10G for mds_cache_memory_limit (the default of 1G is way > >> too low, but more than 10G seems to cause trouble during replay), > >> increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We > >> haven't seen crashes since. But what we do see is that one of the MDS > >> nodes will randomly lock up and the ceph_mds_reply_latency metric goes > >> up and then stays at a higher level than any other MDS. The result is > >> not that the FS is completely down, but everything lags massively to the > >> point where it's not usable. > >> > >> Unfortunately, all the hung MDS is reporting is: > >> > >> -77> 2020-01-17 09:29:17.891 7f34c967b700 0 mds.beacon.XXX Skipping > >> beacon heartbeat to monitors (last acked 320.587s ago); MDS internal > >> heartbeat is not healthy! > >> -76> 2020-01-17 09:29:18.391 7f34c967b700 1 heartbeat_map > >> is_healthy 'MDSRank' had timed out after 15 > >> > >> and ceph fs status reports only single-digit ops/s for all three MDSs > >> (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a > >> standby to take over, which went without problems. Almost immediately > >> after, all three now-active MDSs started reporting > 900 ops/s and the > >> FS started working properly again. For some strange reason, the failed > >> MDS didn't restart, though. It kept reporting the log message above > >> until I manually restarted the daemon process. > >> > > Looks like mds entered same long (/infinite) loops. If this happens > > again, could you use gdb to attach it, and run command 'thread apply > > all bt' inside gdb > > > >> Is anybody else experiencing such issues or are there any configuration > >> parameters that I can tweak to avoid this behaviour? > >> > >> Thanks > >> Janek > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph MDS randomly hangs with no useful error message
Hi, I did as you asked and created a thread dump with GDB on the blocking MDS. Here's the result: https://pastebin.com/pPbNvfdb On 17/01/2020 13:07, Yan, Zheng wrote: On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff wrote: Hi, We have a CephFS in our cluster with 3 MDS to which > 300 clients connect at any given time. The FS contains about 80 TB of data and many million files, so it is important that meta data operations work smoothly even when listing large directories. Previously, we had massive stability problems causing the MDS nodes to crash or time out regularly as a result of failing to recall caps fast enough and weren't able to rejoin afterwards without resetting the mds*_openfiles objects (see https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/ for details). We have managed to adjust our configuration to avoid this problem. This comes down mostly to adjusting the recall decay rate (which still isn't documented), massively reducing any scrubbing activities, allowing for no more than 10G for mds_cache_memory_limit (the default of 1G is way too low, but more than 10G seems to cause trouble during replay), increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We haven't seen crashes since. But what we do see is that one of the MDS nodes will randomly lock up and the ceph_mds_reply_latency metric goes up and then stays at a higher level than any other MDS. The result is not that the FS is completely down, but everything lags massively to the point where it's not usable. Unfortunately, all the hung MDS is reporting is: -77> 2020-01-17 09:29:17.891 7f34c967b700 0 mds.beacon.XXX Skipping beacon heartbeat to monitors (last acked 320.587s ago); MDS internal heartbeat is not healthy! -76> 2020-01-17 09:29:18.391 7f34c967b700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 and ceph fs status reports only single-digit ops/s for all three MDSs (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a standby to take over, which went without problems. Almost immediately after, all three now-active MDSs started reporting > 900 ops/s and the FS started working properly again. For some strange reason, the failed MDS didn't restart, though. It kept reporting the log message above until I manually restarted the daemon process. Looks like mds entered same long (/infinite) loops. If this happens again, could you use gdb to attach it, and run command 'thread apply all bt' inside gdb Is anybody else experiencing such issues or are there any configuration parameters that I can tweak to avoid this behaviour? Thanks Janek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph MDS randomly hangs with no useful error message
Thanks. I will do that. Right now, we have quite a few lags when listing folders, which is probably due to another client heavily using the system. Unfortunately, it's rather hard to debug at the moment, since the suspected client has to use our Ganesha bridge instead of connecting to the Ceph directly. The FS is overall operable and generally usable, but the increased latency is quite annoying, still. I wonder if that can be mitigated with optimised cluster settings. I will send you a GDB trace once we encounter another potential MDS loop. On 17/01/2020 13:07, Yan, Zheng wrote: On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff wrote: Hi, We have a CephFS in our cluster with 3 MDS to which > 300 clients connect at any given time. The FS contains about 80 TB of data and many million files, so it is important that meta data operations work smoothly even when listing large directories. Previously, we had massive stability problems causing the MDS nodes to crash or time out regularly as a result of failing to recall caps fast enough and weren't able to rejoin afterwards without resetting the mds*_openfiles objects (see https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/ for details). We have managed to adjust our configuration to avoid this problem. This comes down mostly to adjusting the recall decay rate (which still isn't documented), massively reducing any scrubbing activities, allowing for no more than 10G for mds_cache_memory_limit (the default of 1G is way too low, but more than 10G seems to cause trouble during replay), increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We haven't seen crashes since. But what we do see is that one of the MDS nodes will randomly lock up and the ceph_mds_reply_latency metric goes up and then stays at a higher level than any other MDS. The result is not that the FS is completely down, but everything lags massively to the point where it's not usable. Unfortunately, all the hung MDS is reporting is: -77> 2020-01-17 09:29:17.891 7f34c967b700 0 mds.beacon.XXX Skipping beacon heartbeat to monitors (last acked 320.587s ago); MDS internal heartbeat is not healthy! -76> 2020-01-17 09:29:18.391 7f34c967b700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 and ceph fs status reports only single-digit ops/s for all three MDSs (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a standby to take over, which went without problems. Almost immediately after, all three now-active MDSs started reporting > 900 ops/s and the FS started working properly again. For some strange reason, the failed MDS didn't restart, though. It kept reporting the log message above until I manually restarted the daemon process. Looks like mds entered same long (/infinite) loops. If this happens again, could you use gdb to attach it, and run command 'thread apply all bt' inside gdb Is anybody else experiencing such issues or are there any configuration parameters that I can tweak to avoid this behaviour? Thanks Janek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph MDS randomly hangs with no useful error message
On Fri, Jan 17, 2020 at 4:47 PM Janek Bevendorff wrote: > > Hi, > > We have a CephFS in our cluster with 3 MDS to which > 300 clients > connect at any given time. The FS contains about 80 TB of data and many > million files, so it is important that meta data operations work > smoothly even when listing large directories. > > Previously, we had massive stability problems causing the MDS nodes to > crash or time out regularly as a result of failing to recall caps fast > enough and weren't able to rejoin afterwards without resetting the > mds*_openfiles objects (see > https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/ > for details). > > We have managed to adjust our configuration to avoid this problem. This > comes down mostly to adjusting the recall decay rate (which still isn't > documented), massively reducing any scrubbing activities, allowing for > no more than 10G for mds_cache_memory_limit (the default of 1G is way > too low, but more than 10G seems to cause trouble during replay), > increasing osd_map_message_max to 100, and osd_map_cache_size to 150. We > haven't seen crashes since. But what we do see is that one of the MDS > nodes will randomly lock up and the ceph_mds_reply_latency metric goes > up and then stays at a higher level than any other MDS. The result is > not that the FS is completely down, but everything lags massively to the > point where it's not usable. > > Unfortunately, all the hung MDS is reporting is: > > -77> 2020-01-17 09:29:17.891 7f34c967b700 0 mds.beacon.XXX Skipping > beacon heartbeat to monitors (last acked 320.587s ago); MDS internal > heartbeat is not healthy! > -76> 2020-01-17 09:29:18.391 7f34c967b700 1 heartbeat_map > is_healthy 'MDSRank' had timed out after 15 > > and ceph fs status reports only single-digit ops/s for all three MDSs > (mostly flat 0). I ran ceph mds fail 1 to fail the MDS and force a > standby to take over, which went without problems. Almost immediately > after, all three now-active MDSs started reporting > 900 ops/s and the > FS started working properly again. For some strange reason, the failed > MDS didn't restart, though. It kept reporting the log message above > until I manually restarted the daemon process. > Looks like mds entered same long (/infinite) loops. If this happens again, could you use gdb to attach it, and run command 'thread apply all bt' inside gdb > Is anybody else experiencing such issues or are there any configuration > parameters that I can tweak to avoid this behaviour? > > Thanks > Janek > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com