February 2, 2024 5:41 AM
To: ceph-users
Subject: [ceph-users] Re: OSD read latency grows over time
I found the internal note I made about it, see below. When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in the RocksDB
1024 PGs on NVMe.
From: Anthony D'Atri
Sent: Friday, February 2, 2024 2:37 PM
To: Cory Snyder
Subject: Re: [ceph-users] OSD read latency grows over time
Thanks. What type of media are your index OSDs? How many PGs? > On Feb 2, 2024,
at 2: 32 PM, Cory Snyder wrote: > > Yes, we changed
Yes, we changed osd_memory_target to 10 GB on just our index OSDs. These OSDs
have over 300 GB of lz4 compressed bucket index omap data. Here is a graph
showing the latencies before/after that single change:
https://pasteboard.co/IMCUWa1t3Uau.png
Cory Snyder
From: Anthony D'Atri
Sent:
You adjusted osd_memory_target? Higher than the default 4GB?
>
>
> Another thing that we've found is that rocksdb can become quite slow if it
> doesn't have enough memory for internal caches. As our cluster usage has
> grown, we've needed to increase OSD memory in accordance with bucket
that increasing OSD memory improved rocksdb latencies
by over 10x.
Hope this helps!
Cory Snyder
From: Tobias Urdin
Sent: Friday, February 2, 2024 5:41 AM
To: ceph-users
Subject: [ceph-users] Re: OSD read latency grows over time
I found the internal note I made about it, see below. When we
I found the internal note I made about it, see below.
When we trim thousands of OMAP keys in RocksDB this calls
SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in
the RocksDB database.
These thousands of tombstones that each needs to be iterated over
Shiming in here, just so that it’s indexed in archives.
We’ve have a lot of issues with tombstones when running RGW usage logging and
when we
trim those the Ceph OSD hosting that usage.X object will basically kill the OSD
performance
due to the tombstones being so many, restarting the OSD
On 1/26/24 11:26, Roman Pashin wrote:
Unfortunately they cannot. You'll want to set them in centralized conf
and then restart OSDs for them to take effect.
Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
them.
Just curious, can decreasing
> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
> HDD OSD, where data is stored, but it would be easier to set it to [osd]
> section without cherry picking only SSD/NVME OSDs, but for all at
> Unfortunately they cannot. You'll want to set them in centralized conf
> and then restart OSDs for them to take effect.
>
Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
them.
Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
4096 hurt
> Do you know if it rocksdb_cf_compact_on_deletion_trigger and
> rocksdb_cf_compact_on_deletion_sliding_window can be changed in runtime
> without OSD restart?
Unfortunately they cannot. You'll want to set them in centralized conf
and then restart OSDs for them to take effect.
Josh
On Fri, Jan
Hi Mark,
In v17.2.7 we enabled a feature that automatically performs a compaction
>> if too many tombstones are present during iteration in RocksDB. It
>> might be worth upgrading to see if it helps (you might have to try
>> tweaking the settings if the defaults aren't helping enough). The PR
>
> Hi Mark, thank you for prompt answer.
The fact that changing the pg_num for the index pool drops the latency
> back down might be a clue. Do you have a lot of deletes happening on
> this cluster? If you have a lot of deletes and long pauses between
> writes, you could be accumulating
HI Roman,
The fact that changing the pg_num for the index pool drops the latency
back down might be a clue. Do you have a lot of deletes happening on
this cluster? If you have a lot of deletes and long pauses between
writes, you could be accumulating tombstones that you have to keep
Hi Stefan,
Do you make use of a separate db partition as well? And if so, where is
> it stored?
>
No, only WAL partition is on separate NVME partition. Not sure if
ceph-ansible could install Ceph with db partition on separate device on
v17.6.2
Do you only see latency increase in reads? And not
Hi Eugen,
How is the data growth in your cluster? Is the pool size rather stable or
> is it constantly growing?
>
Pool size is fairly constant with tiny up trend. It's growth doesn't
correlate with increase of OSD read latency. I've combined pool usage with
OSD read latency on one graph to
On 16-01-2024 11:22, Roman Pashin wrote:
Hello Ceph users,
we see strange issue on last recent Ceph installation v17.6.2. We store
data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME
partition.
Do you make use of a separate db partition as well? And if so, where is
it
Hi,
I checked two production clusters which don't use RGW too heavily,
both on Pacific though. There's no latency increase visible there. How
is the data growth in your cluster? Is the pool size rather stable or
is it constantly growing?
Thanks,
Eugen
Zitat von Roman Pashin :
Hello
18 matches
Mail list logo