[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-23 Thread Patrick Donnelly
On Fri, Oct 23, 2020 at 9:02 AM David C wrote: > > Success! > > I remembered I had a server I'd taken out of the cluster to > investigate some issues, that had some good quality 800GB Intel DC > SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes, > added an MDS to that server and

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-23 Thread David C
Success! I remembered I had a server I'd taken out of the cluster to investigate some issues, that had some good quality 800GB Intel DC SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes, added an MDS to that server and let it run. Took 3 - 4 hours but eventually came back online.

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Frank Schilder
The post was titled "mds behind on trimming - replay until memory exhausted". > Load up with swap and try the up:replay route. > Set the beacon to 10 until it finishes. Good point! The MDS will not send beacons for a long time. Same was necessary in the other case. Good luck!

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Frank Schilder
: 22 October 2020 18:11:57 To: David C Cc: ceph-devel; ceph-users Subject: [ceph-users] Re: Urgent help needed please - MDS offline I assume you aren't able to quickly double the RAM on this MDS ? or failover to a new MDS with more ram? Failing that, you shouldn't reset the journal without

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
He could quickly add > sufficient swap and the MDS managed to come up. Took a long time though, > but might be faster than getting more RAM and will not loose data. > >> > > >> > Your clients will not be able to do much, if anything during recovery > though. > >

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
be faster than getting more RAM and will not loose data. >> > >> > Your clients will not be able to do much, if anything during recovery >> > though. >> > >> > Best regards, >> > ===== >> > Frank Schilder >> > AIT Risø Campu

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
gt; > > > Your clients will not be able to do much, if anything during recovery > though. > > > > Best regards, > > = > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > _________________

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
Dan van der Ster > Sent: 22 October 2020 18:11:57 > To: David C > Cc: ceph-devel; ceph-users > Subject: [ceph-users] Re: Urgent help needed please - MDS offline > > I assume you aren't able to quickly double the RAM on this MDS ? or > failover to a new MDS with more ram? > &g

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
I assume you aren't able to quickly double the RAM on this MDS ? or failover to a new MDS with more ram? Failing that, you shouldn't reset the journal without recovering dentries, otherwise the cephfs_data objects won't be consistent with the metadata. The full procedure to be used is here:

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
I'm pretty sure it's replaying the same ops every time, the last "EMetaBlob.replay updated dir" before it dies is always referring to the same directory. Although interestingly that particular dir shows up in the log thousands of times - the dir appears to be where a desktop app is doing some

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
I wouldn't adjust it. Do you have the impression that the mds is replaying the exact same ops every time the mds is restarting? or is it progressing and trimming the journal over time? The only other advice I have is that 12.2.10 is quite old, and might miss some important replay/mem fixes. I'm

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
I've not touched the journal segments, current value of mds_log_max_segments is 128. Would you recommend I increase (or decrease) that value? And do you think I should change mds_log_max_expiring to match that value? On Thu, Oct 22, 2020 at 3:06 PM Dan van der Ster wrote: > > You could decrease

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
You could decrease the mds_cache_memory_limit but I don't think this will help here during replay. You can see a related tracker here: https://tracker.ceph.com/issues/47582 This is possibly caused by replaying a very large journal. Did you increase the journal segments? -- dan -- dan On

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
Dan, many thanks for the response. I was going down the route of looking at mds_beacon_grace but I now realise when I start my MDS, it's swallowing up memory rapidly and looks like the oom-killer is eventually killing the mds. With debug upped to 10, I can see it's doing EMetaBlob.replays on

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
You can disable that beacon by increasing mds_beacon_grace to 300 or 600. This will stop the mon from failing that mds over to a standby. I don't know if that is set on the mon or mgr, so I usually set it on both. (You might as well disable the standby too -- no sense in something failing back and