[ceph-users] Re: OSD reboot loop after running out of memory

2021-01-02 Thread Stefan Wild
Of course, I forgot to mention that. Thank you for bringing it up! We made sure balancer and PG autoscaler were turned off for the (only) pool that uses those PGs shortly after we noticed the cycle of remapping/backfilling: # ceph balancer status { "active": false,

[ceph-users] Re: OSD reboot loop after running out of memory

2021-01-01 Thread Stefan Wild
Our setup is not using SSDs as the Bluestore DB devices. We only have 2 SSDs vs 12 HDDs, which is normally fine for the low workload of the cluster. The SSDs are serving a pool that is just used by RGW for index and meta. Since the compaction two weeks ago the OSDs have all been stable.

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-16 Thread Frédéric Nass
Regarding RocksDB compaction, if you were in a situation were RocksDB had overspilled to HDDs (if your cluster is using an hybrid setup), the compaction should have move the bits back to fast devices. So it might have helped in this situation too. Regards, Frédéric. Le 16/12/2020 à 09:57,

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-16 Thread Frédéric Nass
Hi Sefan, This has me thinking that the issue your cluster may be facing is probably with bluefs_buffered_io set to true, as this has been reported to induce excessive swap usage (and OSDs flapping or OOMing as consequences) in some versions starting from Nautilus I believe. Can you check

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Frédéric Nass
ick. Will keep you posted, here. Thanks, Stefan From: Igor Fedotov Sent: Monday, December 14, 2020 6:39:28 AM To: Stefan Wild ; ceph-users@ceph.io Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory Hi Stefan, given the crash backtrace i

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
Hi Frédéric, Thanks for the additional input. We are currently only running RGW on the cluster, so no snapshot removal, but there have been plenty of remappings with the OSDs failing (all of them at first during and after the OOM incident, then one-by-one). I haven't had a chance to look into

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Frédéric Nass
be amazing if that does the trick. Will keep you posted, here. Thanks, Stefan From: Igor Fedotov Sent: Monday, December 14, 2020 6:39:28 AM To: Stefan Wild ; ceph-users@ceph.io Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory Hi Stef

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory Hi Stefan, given the crash backtrace in your log I presume some data removal is in progress: Dec 12 21:58:38 ceph-tpa-server1 bash[784256]: 3: (KernelDevice::direct_read_unaligned(unsigned long, unsigned long, char*)+0xd8

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Igor Fedotov
Just a note - all the below is almost completely unrelated to high RAM usage. The latter is a different issue which presumably just triggered PG removal one... On 12/14/2020 2:39 PM, Igor Fedotov wrote: Hi Stefan, given the crash backtrace in your log I presume some data removal is in

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
ld you please comment on, how to safely deal with these bugs or to avoid, > if > indeed they occur? > > thanks a lot, > > samuel > > > > huxia...@horebdata.cn > > From: Kalle Happonen > Date: 2020-12-14 08:28 > To: Stefan Wild > CC: ceph-users > Subjec

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Igor Fedotov
Hi Stefan, given the crash backtrace in your log I presume some data removal is in progress: Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3: (KernelDevice::direct_read_unaligned(unsigned long, unsigned long, char*)+0xd8) [0x5587b9364a48] Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4:

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Kalle Happonen
/ceph/ceph/pull/35584 Cheers, Kalle - Original Message - > From: huxia...@horebdata.cn > To: "Kalle Happonen" , "Stefan Wild" > > Cc: "ceph-users" > Sent: Monday, 14 December, 2020 10:27:57 > Subject: Re: [ceph-users] Re: OSD reboot

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread huxia...@horebdata.cn
a lot, samuel huxia...@horebdata.cn From: Kalle Happonen Date: 2020-12-14 08:28 To: Stefan Wild CC: ceph-users Subject: [ceph-users] Re: OSD reboot loop after running out of memory Hi Stefan, we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case we hit a some bugs

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-13 Thread Kalle Happonen
- Original Message - > From: "Stefan Wild" > To: "Igor Fedotov" , "ceph-users" > Sent: Sunday, 13 December, 2020 14:46:44 > Subject: [ceph-users] Re: OSD reboot loop after running out of memory > Hi Igor, > > Full osd logs from startu

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-13 Thread Stefan Wild
Hi Igor, Full osd logs from startup to failed exit: https://tiltworks.com/osd.1.log In other news, can I expect osd.10 to go down next? Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug 2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no reply from

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Igor Fedotov
Hi Stefan, could you please share OSD startup log from /var/log/ceph? Thanks, Igor On 12/13/2020 5:44 AM, Stefan Wild wrote: Just had another look at the logs and this is what I did notice after the affected OSD starts up. Loads of entries of this sort: Dec 12 21:38:40 ceph-tpa-server1

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild
Got a trace of the osd process, shortly after ceph status -w announced boot for the osd: strace: Process 784735 attached futex(0x5587c3e22fc8, FUTEX_WAIT_PRIVATE, 0, NULL) = ? +++ exited with 1 +++ It was stuck at that one call for several minutes before exiting. From: Stefan Wild Date:

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild
Just had another look at the logs and this is what I did notice after the affected OSD starts up. Loads of entries of this sort: Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug 2020-12-13T02:38:40.851+ 7fafd32c7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fafb721f700'