Dear List,

until we upgraded our cluster 3 weeks ago we had a cute high performing small 
productive CEPH cluster running Nautilus 14.2.22 on Proxmox 6.4 (Kernel 5.4-143 
at this time). Then we started the upgrade to Octopus 15.2.15. Since we did an 
online upgrade, we stopped the autoconvert with

ceph config set osd bluestore_fsck_quick_fix_on_mount false

but followed up the OMAP conversion after the complete upgrade step by step by 
restarting one OSD after the other.

Our Setup is
5 x Storage Node, each : 16 x 2.3GHz, 64GB RAM, 1 x SSD OSD 1.6TB, 1 x 7.68TB 
(both WD Enterprise, SAS-12), 3 HDD OSD (10TB, SAS-12) with Optane Cache)
4 x Compute Nodes
40 GE Storage network (Mellanox Switch + Mellanox CX354 40GE Dual Port Cards, 
Linux OSS drivers)
10 GE Cluster/Mgmt Network

Our performance before the upgrade, Ceph 14.2.22 (about 36k IOPS on the SSD 
Pool)

### SSD Pool on 40GE Switches
# rados bench -p SSD 30 -t 256 -b 1024 write
hints = 1
Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for up 
to 30 seconds or 0 objects
...
Total time run:         30.004
Total writes made:      1094320
Write size:             1024
Object size:            1024
Bandwidth (MB/sec):     35.6177
Stddev Bandwidth:       4.71909
Max bandwidth (MB/sec): 40.7314
Min bandwidth (MB/sec): 21.3037
Average IOPS:           36472
Stddev IOPS:            4832.35
Max IOPS:               41709
Min IOPS:               21815
Average Latency(s):     0.00701759
Stddev Latency(s):      0.00854068
Max latency(s):         0.445397
Min latency(s):         0.000909089
Cleaning up (deleting benchmark objects)

Our performance after the update CEPH 15.2.15 (drops to max 17k IOPS on the SSD 
Pool)
# rados bench -p SSD 30 -t 256 -b 1024 write
hints = 1
Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for up 
to 30 seconds or 0 objects
...
Total time run:         30.0146
Total writes made:      468513
Write size:             1024
Object size:            1024
Bandwidth (MB/sec):     15.2437
Stddev Bandwidth:       0.78677
Max bandwidth (MB/sec): 16.835
Min bandwidth (MB/sec): 13.3184
Average IOPS:           15609
Stddev IOPS:            805.652
Max IOPS:               17239
Min IOPS:               13638
Average Latency(s):     0.016396
Stddev Latency(s):      0.00777054
Max latency(s):         0.140793
Min latency(s):         0.00106735
Cleaning up (deleting benchmark objects)
Note : OSD.17 is out on purpose
# ceph osd tree
ID   CLASS  WEIGHT     TYPE NAME            STATUS  REWEIGHT  PRI-AFF
 -1         208.94525  root default
 -3          41.43977      host xx-ceph01
  0    hdd    9.17380          osd.0            up   1.00000  1.00000
  5    hdd    9.17380          osd.5            up   1.00000  1.00000
 23    hdd   14.65039          osd.23           up   1.00000  1.00000
  7    ssd    1.45549          osd.7            up   1.00000  1.00000
 15    ssd    6.98630          osd.15           up   1.00000  1.00000
 -5          41.43977      host xx-ceph02
  1    hdd    9.17380          osd.1            up   1.00000  1.00000
  4    hdd    9.17380          osd.4            up   1.00000  1.00000
 24    hdd   14.65039          osd.24           up   1.00000  1.00000
  9    ssd    1.45549          osd.9            up   1.00000  1.00000
 20    ssd    6.98630          osd.20           up   1.00000  1.00000
 -7          41.43977      host xx-ceph03
  2    hdd    9.17380          osd.2            up   1.00000  1.00000
  3    hdd    9.17380          osd.3            up   1.00000  1.00000
 25    hdd   14.65039          osd.25           up   1.00000  1.00000
  8    ssd    1.45549          osd.8            up   1.00000  1.00000
 21    ssd    6.98630          osd.21           up   1.00000  1.00000
-17          41.43977      host xx-ceph04
 10    hdd    9.17380          osd.10           up   1.00000  1.00000
 11    hdd    9.17380          osd.11           up   1.00000  1.00000
 26    hdd   14.65039          osd.26           up   1.00000  1.00000
  6    ssd    1.45549          osd.6            up   1.00000  1.00000
 22    ssd    6.98630          osd.22           up   1.00000  1.00000
-21          43.18616      host xx-ceph05
 13    hdd    9.17380          osd.13           up   1.00000  1.00000
 14    hdd    9.17380          osd.14           up   1.00000  1.00000
 27    hdd   14.65039          osd.27           up   1.00000  1.00000
 12    ssd    1.45540          osd.12           up   1.00000  1.00000
 16    ssd    1.74660          osd.16           up   1.00000  1.00000
 17    ssd    3.49309          osd.17           up         0  1.00000
 18    ssd    1.74660          osd.18           up   1.00000  1.00000
 19    ssd    1.74649          osd.19           up   1.00000  1.00000

# ceph osd df
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     
AVAIL    %USE   VAR   PGS  STATUS
 0    hdd   9.17380   1.00000  9.2 TiB  2.5 TiB  2.4 TiB   28 MiB  5.0 GiB  6.6 
TiB  27.56  0.96   88      up
 5    hdd   9.17380   1.00000  9.2 TiB  2.6 TiB  2.5 TiB   57 MiB  5.1 GiB  6.6 
TiB  27.89  0.98   89      up
23    hdd  14.65039   1.00000   15 TiB  3.9 TiB  3.8 TiB   40 MiB  7.2 GiB   11 
TiB  26.69  0.93  137      up
 7    ssd   1.45549   1.00000  1.5 TiB  634 GiB  633 GiB   33 MiB  1.8 GiB  856 
GiB  42.57  1.49   64      up
15    ssd   6.98630   1.00000  7.0 TiB  2.6 TiB  2.6 TiB  118 MiB  5.9 GiB  4.4 
TiB  37.70  1.32  272      up
 1    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   31 MiB  4.7 GiB  6.8 
TiB  26.04  0.91   83      up
 4    hdd   9.17380   1.00000  9.2 TiB  2.6 TiB  2.5 TiB   28 MiB  5.2 GiB  6.6 
TiB  28.51  1.00   91      up
24    hdd  14.65039   1.00000   15 TiB  4.0 TiB  3.9 TiB   38 MiB  7.2 GiB   11 
TiB  27.06  0.95  139      up
 9    ssd   1.45549   1.00000  1.5 TiB  583 GiB  582 GiB   30 MiB  1.6 GiB  907 
GiB  39.13  1.37   59      up
20    ssd   6.98630   1.00000  7.0 TiB  2.5 TiB  2.5 TiB   81 MiB  7.4 GiB  4.5 
TiB  35.45  1.24  260      up
 2    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   26 MiB  4.8 GiB  6.8 
TiB  26.01  0.91   83      up
 3    hdd   9.17380   1.00000  9.2 TiB  2.7 TiB  2.6 TiB   29 MiB  5.4 GiB  6.5 
TiB  29.38  1.03   94      up
25    hdd  14.65039   1.00000   15 TiB  4.2 TiB  4.1 TiB   41 MiB  7.7 GiB   10 
TiB  28.79  1.01  149      up
 8    ssd   1.45549   1.00000  1.5 TiB  637 GiB  635 GiB   34 MiB  1.7 GiB  854 
GiB  42.71  1.49   65      up
21    ssd   6.98630   1.00000  7.0 TiB  2.5 TiB  2.5 TiB   96 MiB  7.5 GiB  4.5 
TiB  35.49  1.24  260      up
10    hdd   9.17380   1.00000  9.2 TiB  2.2 TiB  2.1 TiB   26 MiB  4.5 GiB  7.0 
TiB  24.21  0.85   77      up
11    hdd   9.17380   1.00000  9.2 TiB  2.5 TiB  2.4 TiB   30 MiB  5.0 GiB  6.7 
TiB  27.24  0.95   87      up
26    hdd  14.65039   1.00000   15 TiB  3.6 TiB  3.5 TiB   37 MiB  6.6 GiB   11 
TiB  24.64  0.86  127      up
 6    ssd   1.45549   1.00000  1.5 TiB  572 GiB  570 GiB   29 MiB  1.5 GiB  918 
GiB  38.38  1.34   57      up
22    ssd   6.98630   1.00000  7.0 TiB  2.3 TiB  2.3 TiB   77 MiB  7.0 GiB  4.7 
TiB  33.23  1.16  243      up
13    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   25 MiB  4.8 GiB  6.8 
TiB  26.07  0.91   84      up
14    hdd   9.17380   1.00000  9.2 TiB  2.3 TiB  2.2 TiB   54 MiB  4.6 GiB  6.9 
TiB  25.13  0.88   80      up
27    hdd  14.65039   1.00000   15 TiB  3.7 TiB  3.6 TiB   54 MiB  6.9 GiB   11 
TiB  25.55  0.89  131      up
12    ssd   1.45540   1.00000  1.5 TiB  619 GiB  617 GiB  163 MiB  2.3 GiB  871 
GiB  41.53  1.45   63      up
16    ssd   1.74660   1.00000  1.7 TiB  671 GiB  669 GiB   23 MiB  2.2 GiB  1.1 
TiB  37.51  1.31   69      up
17    ssd   3.49309         0      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0      up
18    ssd   1.74660   1.00000  1.7 TiB  512 GiB  509 GiB   18 MiB  2.3 GiB  1.2 
TiB  28.62  1.00   52      up
19    ssd   1.74649   1.00000  1.7 TiB  709 GiB  707 GiB   64 MiB  2.0 GiB  1.1 
TiB  39.64  1.39   72      up
                        TOTAL  205 TiB   59 TiB   57 TiB  1.3 GiB  128 GiB  147 
TiB  28.60
MIN/MAX VAR: 0.85/1.49  STDDEV: 6.81


What we have done so far (no success)

- reformat two of the SSD OSD's (one was still from luminos, non LVM)
- set bluestore_allocator from hybrid back to bitmap
- set osd_memory_target to 6442450944 for some of the SSD OSDs
- cpupower idle-set -D 11
- bluefs_buffered_io to true
- disabled default firewalls between CEPH nodes (for testing only)
- disabled apparmor
- added memory (runs now on 128GB per Node)
- upgraded OS, runs now on kernel 5.13.19-1

What we observe
- HDD Pool has similar behaviour
- load is higher since update, seems like more CPU consumption (see graph1), 
migration was on 10. Nov, around 10pm
- latency on the "big" 7TB SSD's (i.e. OSD.15) is significantly higher than on 
the small 1.6TB SSDs (OSD.12), see graph2, must be due to the higher weight 
though
- load of OSD.15 is 4 times higher than load of OSD.12, must be due to the 
higher weight though as well
- start of OSD.15 (the 7TB SSD's is significantly slower (~10 sec) compared to 
the 1.6TB SSDs
- increasing the block size in the benchmark to 4k, 8k or even 16k increases 
the throughput but keeps the IOPS more or less stable, the drop at 32k is 
minimal to ~14k IOPS in average

We already checked the ProxMoxx List without any remedies yet and we are a bit 
helpless, any suggestions and / or does someone else has similar experiences?

We are a bit hesitant to upgrade to Pacific, given the current situation.

Thanks,

Kai









_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to