[ceph-users] Re: pg_autoscaler using uncompressed bytes as pool current total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings?

Christian Rohmann Fri, 21 Apr 2023 06:57:58 -0700

Hey ceph-users,

may I ask (nag) again about this issue? I am wondering if anybody canconfirm my observations?I raised a bug https://tracker.ceph.com/issues/54136, but apart from theassignment to a

dev a while ago here was not response yet.


Maybe I am just holding it wrong, please someone enlighten me.


Thank you and with kind regards

Christian




On 02/02/2022 20:10, Christian Rohmann wrote:

Hey ceph-users,
I am debugging a mgr pg_autoscaler WARN which states atarget_size_bytes on a pool would overcommit the available storage.There is only one pool with value for target_size_bytes (=5T) definedand that apparently would consume more than the available storage:
--- cut ---
# ceph health detail
HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes
[WRN] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED: 1 subtrees haveovercommitted pool target_size_bytes Pools ['backups', 'images', 'device_health_metrics', '.rgw.root','redacted.rgw.control', 'redacted.rgw.meta', 'redacted.rgw.log','redacted.rgw.otp', 'redacted.rgw.buckets.index','redacted.rgw.buckets.data', 'redacted.rgw.buckets.non-ec'] overcommitavailable storage by 1.011x due to target_size_bytes 15.0T on pools['redacted.rgw.buckets.data'].
--- cut ---
But then looking at the actual usage it seems strange that 15T (5T * 3replicas) should not fit onto the remaining 122 TiB AVAIL:
--- cut ---
# ceph df detail
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    293 TiB  122 TiB  171 TiB   171 TiB      58.44
TOTAL  293 TiB  122 TiB  171 TiB   171 TiB      58.44

--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPRbackups 1 1024 92 TiB 92 TiB 3.8 MiB 28.11M 156 TiB 156 TiB 11 MiB 64.77 28 TiB N/A N/A N/A 39 TiB 123 TiBimages 2 64 1.7 TiB 1.7 TiB 249 KiB 471.72k 5.2 TiB 5.2 TiB 748 KiB 5.81 28 TiB N/A N/A N/A 0 B 0 Bdevice_health_metrics 19 1 82 MiB 0 B 82MiB 43 245 MiB 0 B 245 MiB 0 28 TiBN/A N/A N/A 0 B 0 B.rgw.root 21 32 23 KiB 23 KiB 0 B 25 4.1 MiB 4.1 MiB 0 B 0 28 TiB N/A N/A N/A 0 B 0 Bredacted.rgw.control 22 32 0 B 0 B 0 B 8 0 B 0 B 0 B 0 28 TiB N/A N/A N/A 0 B 0 Bredacted.rgw.meta 23 32 1.7 MiB 394 KiB 1.3MiB 1.38k 237 MiB 233 MiB 3.9 MiB 0 28 TiBN/A N/A N/A 0 B 0 Bredacted.rgw.log 24 32 53 MiB 500 KiB 53MiB 7.60k 204 MiB 47 MiB 158 MiB 0 28 TiBN/A N/A N/A 0 B 0 Bredacted.rgw.otp 25 32 5.2 KiB 0 B 5.2KiB 0 16 KiB 0 B 16 KiB 0 28 TiBN/A N/A N/A 0 B 0 Bredacted.rgw.buckets.index 26 32 1.2 GiB 0 B 1.2GiB 7.46k 3.5 GiB 0 B 3.5 GiB 0 28 TiBN/A N/A N/A 0 B 0 Bredacted.rgw.buckets.data 27 128 3.1 TiB 3.1 TiB 0 B 3.53M 9.5 TiB 9.5 TiB 0 B 10.11 28 TiB N/A N/A N/A 0 B 0 Bredacted.rgw.buckets.non-ec 28 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 28 TiB N/A N/A N/A 0 B 0 B
--- cut ---
I then looked at how those values are determined athttps://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L509.Apparently "total_bytes" are compared with the capacity of theroot_map. I added a debug line and found that the total in my clusterwas already at:
  total=325511007759696
so in excess of 300 TiB - Looking at "ceph df" again this usage seemsstrange.
Looking at how this total is calculated athttps://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L441,you see that the larger value (max) of "actual_raw_used" vs."target_bytes*raw_used_rate" is determined and then summed up.
I dumped the values for all pools my cluster with yet another line ofdebug code:
---cut ---
pool_id 1 - actual_raw_used=303160109187420.0, target_bytes=0raw_used_rate=3.0pool_id 2 - actual_raw_used=5714098884702.0, target_bytes=0raw_used_rate=3.0
pool_id 19 - actual_raw_used=256550760.0, target_bytes=0 raw_used_rate=3.0
pool_id 21 - actual_raw_used=71433.0, target_bytes=0 raw_used_rate=3.0
pool_id 22 - actual_raw_used=0.0, target_bytes=0 raw_used_rate=3.0
pool_id 23 - actual_raw_used=5262798.0, target_bytes=0 raw_used_rate=3.0
pool_id 24 - actual_raw_used=162299940.0, target_bytes=0 raw_used_rate=3.0
pool_id 25 - actual_raw_used=16083.0, target_bytes=0 raw_used_rate=3.0
pool_id 26 - actual_raw_used=3728679936.0, target_bytes=0raw_used_rate=3.0pool_id 27 - actual_raw_used=10035209699328.0,target_bytes=5497558138880 raw_used_rate=3.0
pool_id 28 - actual_raw_used=0.0, target_bytes=0 raw_used_rate=3.0
--- cut ---
All values but those of pool_id 1 (backups) make sense. For backupsit's just reporting a MUCH larger actual_raw_used value than what isshown via ceph df.The only difference of that pool compared to the others is the enabledcompression:
--- cut ---
# ceph osd pool get backups compression_mode
compression_mode: aggressive
--- cut ---
Apparently there already was a similar issue(https://tracker.ceph.com/issues/41567) with a resulting commit(https://github.com/ceph/ceph/commit/dd6e752826bc762095be4d276e3c1b8d31293eb0)changing which from "bytes_used" to the "stored" field for"pool_logical_used".
But how does that take compressed (away) data into account? Does"bytes_used" count all the "stored" bytes, summing up all uncompressedbytes for pools with compression?This surely must be a bug then, as those bytes are not really"actual_raw_used".
I was about to raise a bug, but I wanted to ask here on the ML firstif I misunderstood the mechanisms at play here.
Thanks and with kind regards,


Christian

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg_autoscaler using uncompressed bytes as pool current total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings?

Reply via email to