[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-12 Thread Igor Fedotov
Hey Konstantin, forget to mention - indeed clusters having 4K bluestore min alloc size are more likely to be exposed to the issue. The key point is the difference between bluestore and bluefs allocation sizes. The issue likely to pop-up when user and DB data are collocated but different

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-12 Thread Konstantin Shalygin
Hi Igor, > On 12 Sep 2023, at 15:28, Igor Fedotov wrote: > > Default hybrid allocator (as well as AVL one it's based on) could take > dramatically long time to allocate pretty large (hundreds of MBs) 64K-aligned > chunks for BlueFS. At the original cluster it was exposed as 20-30 sec OSD >

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-12 Thread Igor Fedotov
HI All, as promised here is a postmortem analysis on what happened. the following ticket (https://tracker.ceph.com/issues/62815) with accompanying materials provide a low-level overview on the issue. In a few words it is as follows: Default hybrid allocator (as well as AVL one it's based

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-11 Thread J-P Methot
The bluestore configuration was 100% default when we did the upgrade and the issue happened. We have provided Igor with an OSD dump and a db dump last Friday, so hopefully you can figure out something from it. On 9/8/23 02:48, Konstantin Shalygin wrote: This cluster use the default settings or

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-08 Thread Konstantin Shalygin
This cluster use the default settings or something for Bluestore was changed? You can check this via `ceph config diff` As Mark said, it will be nice to have a tracker, if this really release problem Thanks, k Sent from my iPhone > On 7 Sep 2023, at 20:22, J-P Methot wrote: > > We went from

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-08 Thread Stefan Kooman
On 07-09-2023 19:20, J-P Methot wrote: We went from 16.2.13 to 16.2.14 Also, timeout is 15 seconds because it's the default in Ceph. Basically, 15 seconds before Ceph shows a warning that OSD is timing out. We may have found the solution, but it would be, in fact, related to

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread xiaowenhao111
I also see the dreaded.  i find this is bcache problem .you can use blktrace tools capture iodatas analysis 发自我的小米在 Stefan Kooman ,2023年9月7日 下午10:52写道:On 07-09-2023 09:05, J-P Methot wrote: > Hi, > > We're running latest Pacific on our production cluster and we've been > seeing the dreaded

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson
Oh that's very good to know.  I'm sure Igor will respond here, but do you know which PR this was related to? (possibly https://github.com/ceph/ceph/pull/50321) If we think there's a regression here we should get it into the tracker ASAP. Mark On 9/7/23 13:45, J-P Methot wrote: To be

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
To be quite honest, I will not pretend I have a low level understanding of what was going on. There is very little documentation as to what the bluestore allocator actually does and we had to rely on Igor's help to find the solution, so my understanding of the situation is limited. What I

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson
Ok, good to know.  Please feel free to update us here with what you are seeing in the allocator.  It might also be worth opening a tracker ticket as well.  I did some work in the AVL allocator a while back where we were repeating the linear search from the same offset every allocation, getting

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
Hi, By this point, we're 95% sure that, contrary to our previous beliefs, it's an issue with changes to the bluestore_allocator and not the compaction process. That said, I will keep this email in mind as we will want to test optimizations to compaction on our test environment. On 9/7/23

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
We went from 16.2.13 to 16.2.14 Also, timeout is 15 seconds because it's the default in Ceph. Basically, 15 seconds before Ceph shows a warning that OSD is timing out. We may have found the solution, but it would be, in fact, related to bluestore_allocator and not the compaction process.

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Mark Nelson
Hello, There are two things that might help you here.  One is to try the new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we backported to Pacific in 16.2.13.  So far this appears to be a huge win for avoiding tombstone accumulation during iteration which is often the

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Konstantin Shalygin
Hi, > On 7 Sep 2023, at 18:21, J-P Methot wrote: > > Since my post, we've been speaking with a member of the Ceph dev team. He > did, at first, believe it was an issue linked to the common performance > degradation after huge deletes operation. So we did do offline compactions on > all our

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
Hi, Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation after huge deletes operation. So we did do offline compactions on all our OSDs. It fixed nothing and we are going through the logs

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Stefan Kooman
On 07-09-2023 09:05, J-P Methot wrote: Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Alexander E. Patrakov
On an HDD-based Quincy 17.2.5 cluster (with DB/WAL on datacenter-class NVMe with enhanced power loss protection), I sometimes (once or twice per week) see log entries similar to what I reproduced below (a bit trimmed): Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
We're talking about automatic online compaction here, not running the command. On 9/7/23 04:04, Konstantin Shalygin wrote: Hi, On 7 Sep 2023, at 10:05, J-P Methot wrote: We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Konstantin Shalygin
Hi, > On 7 Sep 2023, at 10:05, J-P Methot wrote: > > We're running latest Pacific on our production cluster and we've been seeing > the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after > 15.00954s' error. We have reasons to believe this happens each time the > RocksDB