Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
Ceph is a massive overhead, so it seems it maxes out at ~1 (at most 15000) write iops per one ssd with queue depth of 128 and ~1000 iops with queue depth of 1 (1ms latency). Or maybe 2000-2500 write iops (0.4-0.5ms) with best possible hardware. Micron has only squeezed ~8750 iops from each of their NVMes in their reference setup... the same NVMes reached 29 iops in their setup when connected directly. Hi Maged Thanks for your reply. 6k is low as a max write iops value..even for single client. for cluster of 3 nodes, we see from 10k to 60k write iops depending on hardware. can you increase your threads to 64 or 128 via -t parameter I can absolutely get it higher by increasing the parallism. But I may have missed to explain my purpuse - I'm intested in how close to putting local SSD/NVMe in servers I can get with RDB. Thus putting parallel scenarios that I would never see in production in the tests does not really help my understanding. I think a concurrency level of 16 is in the top of what I would expect our PostgreSQL databases to do in real life. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
rados bench is garbage, it creates and benches a very small amount of objects. If you want RBD better test it with fio ioengine=rbd 7 февраля 2019 г. 15:16:11 GMT+03:00, Ryan пишет: >I just ran your test on a cluster with 5 hosts 2x Intel 6130, 12x 860 >Evo >2TB SSD per host (6 per SAS3008), 2x bonded 10GB NIC, 2x Arista >switches. > >Pool with 3x replication > >rados bench -p scbench -b 4096 10 write --no-cleanup >hints = 1 >Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 >for >up to 10 seconds or 0 objects >Object prefix: benchmark_data_dc1-kube-01_3458991 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg >lat(s) >0 0 0 0 0 0 - > 0 >1 16 5090 5074 19.7774 19.8203 0.00312568 >0.00315352 >2 16 10441 10425 20.3276 20.9023 0.00332591 >0.00307105 >3 16 15548 1553220.201 19.9492 0.00337573 >0.00309134 >4 16 20906 20890 20.3826 20.9297 0.00282902 >0.00306437 >5 16 26107 26091 20.3686 20.3164 0.00269844 >0.00306698 >6 16 31246 31230 20.3187 20.0742 0.00339814 >0.00307462 >7 16 36372 36356 20.2753 20.0234 0.00286653 > 0.0030813 >8 16 41470 41454 20.2293 19.9141 0.00272051 >0.00308839 >9 16 46815 46799 20.3011 20.8789 0.00284063 >0.00307738 >Total time run: 10.0035 >Total writes made: 51918 >Write size: 4096 >Object size:4096 >Bandwidth (MB/sec): 20.2734 >Stddev Bandwidth: 0.464082 >Max bandwidth (MB/sec): 20.9297 >Min bandwidth (MB/sec): 19.8203 >Average IOPS: 5189 >Stddev IOPS:118 >Max IOPS: 5358 >Min IOPS: 5074 >Average Latency(s): 0.00308195 >Stddev Latency(s): 0.00142825 >Max latency(s): 0.0267947 >Min latency(s): 0.00217364 > >rados bench -p scbench 10 rand >hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg >lat(s) >0 0 0 0 0 0 - > 0 >1 15 39691 39676154.95 154.984 0.00027022 >0.000395993 >2 16 83701 83685 163.416171.91 0.000318949 >0.000375363 >3 15129218129203 168.199 177.805 0.000300898 >0.000364647 >4 15173733173718 169.617 173.887 0.000311723 >0.00036156 >5 15216073216058 168.769 165.391 0.000407594 >0.000363371 >6 16260381260365 169.483 173.074 0.000323371 >0.000361829 >7 15306838306823 171.193 181.477 0.000284247 >0.000358199 >8 15353675353660 172.661 182.957 0.000338128 >0.000355139 >9 15399221399206 173.243 177.914 0.000422527 >0.00035393 >Total time run: 10.0003 >Total reads made: 446353 >Read size:4096 >Object size: 4096 >Bandwidth (MB/sec): 174.351 >Average IOPS: 44633 >Stddev IOPS: 2220 >Max IOPS: 46837 >Min IOPS: 39676 >Average Latency(s): 0.000351679 >Max latency(s): 0.00530195 >Min latency(s): 0.000135292 > >On Thu, Feb 7, 2019 at 2:17 AM wrote: > >> Hi List >> >> We are in the process of moving to the next usecase for our ceph >cluster >> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and >> that works fine. >> >> We're currently on luminous / bluestore, if upgrading is deemed to >> change what we're seeing then please let us know. >> >> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. >Connected >> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set >to >> deadline, nomerges = 1, rotational = 0. >> >> Each disk "should" give approximately 36K IOPS random write and the >double >> random read. >> >> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup >of >> well performing SSD block devices - potentially to host databases and >> things like that. I ready through this nice document [0], I know the >> HW are radically different from mine, but I still think I'm in the >> very low end of what 6 x S4510 should be capable of doing. >> >> Since it is IOPS i care about I have lowered block size to 4096 -- 4M >> blocksize nicely saturates the NIC's in both directions. >> >> >> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup >> hints = 1 >> Maintaining 16 concurrent writes of 4096 bytes to objects of size >4096 for >> up to 10 seconds or 0 objects >> Object prefix: benchmark_data_torsk2_11207 >> sec Cur ops started finished avg MB/s cur MB/s last lat(s) >avg >> lat(s) >> 0 0 0 0 0 0 - >> 0 >> 1 16 5857 5841 22.8155 22.8164 0.00238437 >> 0.00273434 >> 2 15 11768 11753 22.9533 23.0938 0.0028559 >> 0.00271944 >> 3 16 17264 17
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
> That's a usefull conclusion to take back. Last question - We have our SSD pool set to 3x replication, Micron states that NVMe is good at 2x - is this "taste and safety" or is there any general thoughts about SSD-robustness in a Ceph setup? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
> On 07/02/2019 17:07, jes...@krogh.cc wrote: > Thanks for your explanation. In your case, you have low concurrency > requirements, so focusing on latency rather than total iops is your > goal. Your current setup gives 1.9 ms latency for writes and 0.6 ms for > read. These are considered good, it is difficult to go below 1 ms for > writes. As Wido pointed, to get latency down you need to insure you have > C States in your cpu settings ( or just C1 state ), you have no low > frequencies in your P States and get cpu with high GHz frequency rather > than more cores (Nick Fisk has a good presentation on this), also avoid > dual socket and NUMA. Also if money is no issue, you will get a bit > better latency with 40G or 100G network. Thanks a lot. I'm heading towards the conclusion that if I went all in and got new HW+NVMe drives, then I'd "only" be about 3x better off than where I am today. (compared to the Micron paper) That's a usefull conclusion to take back. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
On 07/02/2019 17:07, jes...@krogh.cc wrote: Hi Maged Thanks for your reply. 6k is low as a max write iops value..even for single client. for cluster of 3 nodes, we see from 10k to 60k write iops depending on hardware. can you increase your threads to 64 or 128 via -t parameter I can absolutely get it higher by increasing the parallism. But I may have missed to explain my purpuse - I'm intested in how close to putting local SSD/NVMe in servers I can get with RDB. Thus putting parallel scenarios that I would never see in production in the tests does not really help my understanding. I think a concurrency level of 16 is in the top of what I would expect our PostgreSQL databases to do in real life. can you run fio with sync=1 on your disks. can you try with noop scheduler what is the %utilization on the disks and cpu ? can you have more than 1 disk per node I'll have a look at that. Thanks for the suggestion. Jesper Thanks for your explanation. In your case, you have low concurrency requirements, so focusing on latency rather than total iops is your goal. Your current setup gives 1.9 ms latency for writes and 0.6 ms for read. These are considered good, it is difficult to go below 1 ms for writes. As Wido pointed, to get latency down you need to insure you have C States in your cpu settings ( or just C1 state ), you have no low frequencies in your P States and get cpu with high GHz frequency rather than more cores (Nick Fisk has a good presentation on this), also avoid dual socket and NUMA. Also if money is no issue, you will get a bit better latency with 40G or 100G network. /Maged ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
Hi Maged Thanks for your reply. > 6k is low as a max write iops value..even for single client. for cluster > of 3 nodes, we see from 10k to 60k write iops depending on hardware. > > can you increase your threads to 64 or 128 via -t parameter I can absolutely get it higher by increasing the parallism. But I may have missed to explain my purpuse - I'm intested in how close to putting local SSD/NVMe in servers I can get with RDB. Thus putting parallel scenarios that I would never see in production in the tests does not really help my understanding. I think a concurrency level of 16 is in the top of what I would expect our PostgreSQL databases to do in real life. > can you run fio with sync=1 on your disks. > > can you try with noop scheduler > > what is the %utilization on the disks and cpu ? > > can you have more than 1 disk per node I'll have a look at that. Thanks for the suggestion. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
On 07/02/2019 09:17, jes...@krogh.cc wrote: Hi List We are in the process of moving to the next usecase for our ceph cluster (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and that works fine. We're currently on luminous / bluestore, if upgrading is deemed to change what we're seeing then please let us know. We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to deadline, nomerges = 1, rotational = 0. Each disk "should" give approximately 36K IOPS random write and the double random read. Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of well performing SSD block devices - potentially to host databases and things like that. I ready through this nice document [0], I know the HW are radically different from mine, but I still think I'm in the very low end of what 6 x S4510 should be capable of doing. Since it is IOPS i care about I have lowered block size to 4096 -- 4M blocksize nicely saturates the NIC's in both directions. $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_torsk2_11207 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 3 16 17264 17248 22.4564 21.4648 0.0024 0.00278101 4 16 22857 22841 22.3037 21.84770.002716 0.00280023 5 16 28462 28446 22.2213 21.8945 0.002201860.002811 6 16 34216 34200 22.2635 22.4766 0.00234315 0.00280552 7 16 39616 39600 22.0962 21.0938 0.00290661 0.00282718 8 16 45510 45494 22.2118 23.0234 0.0033541 0.00281253 9 16 50995 50979 22.1243 21.4258 0.00267282 0.00282371 10 16 56745 56729 22.1577 22.4609 0.00252583 0.0028193 Total time run: 10.002668 Total writes made: 56745 Write size: 4096 Object size:4096 Bandwidth (MB/sec): 22.1601 Stddev Bandwidth: 0.712297 Max bandwidth (MB/sec): 23.0938 Min bandwidth (MB/sec): 21.0938 Average IOPS: 5672 Stddev IOPS:182 Max IOPS: 5912 Min IOPS: 5400 Average Latency(s): 0.00281953 Stddev Latency(s): 0.00190771 Max latency(s): 0.0834767 Min latency(s): 0.00120945 Min latency is fine -- but Max latency of 83ms ? Average IOPS @ 5672 ? $ sudo rados bench -p scbench 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 15 23329 23314 91.0537 91.0703 0.000349856 0.000679074 2 16 48555 48539 94.7884 98.5352 0.000499159 0.000652067 3 16 76193 76177 99.1747 107.961 0.000443877 0.000622775 4 15103923103908 101.459 108.324 0.000678589 0.000609182 5 15132720132705 103.663 112.488 0.000741734 0.000595998 6 15161811161796 105.323 113.637 0.000333166 0.000586323 7 15190196190181 106.115 110.879 0.000612227 0.000582014 8 15221155221140 107.966 120.934 0.000471219 0.000571944 9 16251143251127 108.984 117.137 0.000267528 0.000566659 Total time run: 10.000640 Total reads made: 282097 Read size:4096 Object size: 4096 Bandwidth (MB/sec): 110.187 Average IOPS: 28207 Stddev IOPS: 2357 Max IOPS: 30959 Min IOPS: 23314 Average Latency(s): 0.000560402 Max latency(s): 0.109804 Min latency(s): 0.000212671 This is also quite far from expected. I have 12GB of memory on the OSD daemon for caching on each host - close to idle cluster - thus 50GB+ for caching with a working set of < 6GB .. this should - in this case not really be bound by the underlying SSD. But if it were: IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off? No measureable service time in iostat when running tests, thus I have come to the conclusion that it has to be either client side, the network path, or the OSD-daemon that deliveres the increasing latency / decreased IOPS. Is there any suggestions on how to get more insigths in that? Has anyone replicated close to the number Micron are reporting on NVMe? Thanks a log. [0] https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en __
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
I just ran your test on a cluster with 5 hosts 2x Intel 6130, 12x 860 Evo 2TB SSD per host (6 per SAS3008), 2x bonded 10GB NIC, 2x Arista switches. Pool with 3x replication rados bench -p scbench -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_dc1-kube-01_3458991 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 5090 5074 19.7774 19.8203 0.00312568 0.00315352 2 16 10441 10425 20.3276 20.9023 0.00332591 0.00307105 3 16 15548 1553220.201 19.9492 0.00337573 0.00309134 4 16 20906 20890 20.3826 20.9297 0.00282902 0.00306437 5 16 26107 26091 20.3686 20.3164 0.00269844 0.00306698 6 16 31246 31230 20.3187 20.0742 0.00339814 0.00307462 7 16 36372 36356 20.2753 20.0234 0.00286653 0.0030813 8 16 41470 41454 20.2293 19.9141 0.00272051 0.00308839 9 16 46815 46799 20.3011 20.8789 0.00284063 0.00307738 Total time run: 10.0035 Total writes made: 51918 Write size: 4096 Object size:4096 Bandwidth (MB/sec): 20.2734 Stddev Bandwidth: 0.464082 Max bandwidth (MB/sec): 20.9297 Min bandwidth (MB/sec): 19.8203 Average IOPS: 5189 Stddev IOPS:118 Max IOPS: 5358 Min IOPS: 5074 Average Latency(s): 0.00308195 Stddev Latency(s): 0.00142825 Max latency(s): 0.0267947 Min latency(s): 0.00217364 rados bench -p scbench 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 15 39691 39676154.95 154.984 0.00027022 0.000395993 2 16 83701 83685 163.416171.91 0.000318949 0.000375363 3 15129218129203 168.199 177.805 0.000300898 0.000364647 4 15173733173718 169.617 173.887 0.000311723 0.00036156 5 15216073216058 168.769 165.391 0.000407594 0.000363371 6 16260381260365 169.483 173.074 0.000323371 0.000361829 7 15306838306823 171.193 181.477 0.000284247 0.000358199 8 15353675353660 172.661 182.957 0.000338128 0.000355139 9 15399221399206 173.243 177.914 0.000422527 0.00035393 Total time run: 10.0003 Total reads made: 446353 Read size:4096 Object size: 4096 Bandwidth (MB/sec): 174.351 Average IOPS: 44633 Stddev IOPS: 2220 Max IOPS: 46837 Min IOPS: 39676 Average Latency(s): 0.000351679 Max latency(s): 0.00530195 Min latency(s): 0.000135292 On Thu, Feb 7, 2019 at 2:17 AM wrote: > Hi List > > We are in the process of moving to the next usecase for our ceph cluster > (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and > that works fine. > > We're currently on luminous / bluestore, if upgrading is deemed to > change what we're seeing then please let us know. > > We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected > through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to > deadline, nomerges = 1, rotational = 0. > > Each disk "should" give approximately 36K IOPS random write and the double > random read. > > Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of > well performing SSD block devices - potentially to host databases and > things like that. I ready through this nice document [0], I know the > HW are radically different from mine, but I still think I'm in the > very low end of what 6 x S4510 should be capable of doing. > > Since it is IOPS i care about I have lowered block size to 4096 -- 4M > blocksize nicely saturates the NIC's in both directions. > > > $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for > up to 10 seconds or 0 objects > Object prefix: benchmark_data_torsk2_11207 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 0 0 0 0 0 - > 0 > 1 16 5857 5841 22.8155 22.8164 0.00238437 > 0.00273434 > 2 15 11768 11753 22.9533 23.0938 0.0028559 > 0.00271944 > 3 16 17264 17248 22.4564 21.4648 0.0024 > 0.00278101 > 4 16 22857 22841 22.3037 21.84770.002716 > 0.00280023 > 5 16 28462 28446 22.2213 21.8945 0.00220186 > 0.002811 > 6 16 34216 34200 22.2635 22.4766 0.00234315 > 0.00280552 > 7 16 39616 396
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
4xnodes, around 100GB, 2x2660, 10Gbit, 2xLSI Logic SAS2308 Thanks for the confirmation Marc Can you put in a but more hardware/network details? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
Thanks for the confirmation Marc Can you put in a but more hardware/network details? Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
I did your rados bench test on our sm863a pool 3x rep, got similar results. [@]# rados bench -p fs_data.ssd -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_c04_1337712 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 6302 6286 24.5533 24.5547 0.00304773 0.002541 2 15 12545 12530 24.4705 24.3906 0.00228294 0.0025506 3 16 18675 18659 24.2933 23.9414 0.00332918 0.00257042 4 16 25194 25178 24.5854 25.4648 0.0034176 0.00254016 5 16 31657 31641 24.7169 25.2461 0.00156494 0.00252686 6 16 37713 37697 24.5398 23.6562 0.00228134 0.00254527 7 16 43848 43832 24.4572 23.9648 0.00238393 0.00255401 8 16 49516 49500 24.1673 22.1406 0.00244473 0.00258466 9 16 55562 55546 24.1059 23.6172 0.00249619 0.00259139 10 16 61675 61659 24.0829 23.8789 0.0020192 0.00259362 Total time run: 10.002179 Total writes made: 61675 Write size: 4096 Object size:4096 Bandwidth (MB/sec): 24.0865 Stddev Bandwidth: 0.932554 Max bandwidth (MB/sec): 25.4648 Min bandwidth (MB/sec): 22.1406 Average IOPS: 6166 Stddev IOPS:238 Max IOPS: 6519 Min IOPS: 5668 Average Latency(s): 0.00259383 Stddev Latency(s): 0.00173856 Max latency(s): 0.0778051 Min latency(s): 0.00110931 [@ ]# rados bench -p fs_data.ssd 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 15 27697 27682 108.115 108.133 0.000755936 0.000568212 2 15 57975 57960 113.186 118.273 0.000547682 0.000542773 3 15 88500 88485 115.199 119.238 0.00036749 0.000533185 4 15117199117184 114.422 112.105 0.000354388 0.000536647 5 15147734147719115.39 119.277 0.000419781 0.00053221 6 16176393176377 114.814 111.945 0.000427109 0.000534771 7 15203693203678 113.645 106.645 0.000379089 0.000540113 8 15231917231902 113.219110.25 0.000465232 0.000542156 9 16261054261038 113.284 113.812 0.000358025 0.000541972 Total time run: 10.000669 Total reads made: 290371 Read size:4096 Object size: 4096 Bandwidth (MB/sec): 113.419 Average IOPS: 29035 Stddev IOPS: 1212 Max IOPS: 30535 Min IOPS: 27301 Average Latency(s): 0.000541371 Max latency(s): 0.00380609 Min latency(s): 0.000155521 -Original Message- From: jes...@krogh.cc [mailto:jes...@krogh.cc] Sent: 07 February 2019 08:17 To: ceph-users@lists.ceph.com Subject: [ceph-users] rados block on SSD - performance - how to tune and get insight? Hi List We are in the process of moving to the next usecase for our ceph cluster (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and that works fine. We're currently on luminous / bluestore, if upgrading is deemed to change what we're seeing then please let us know. We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to deadline, nomerges = 1, rotational = 0. Each disk "should" give approximately 36K IOPS random write and the double random read. Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of well performing SSD block devices - potentially to host databases and things like that. I ready through this nice document [0], I know the HW are radically different from mine, but I still think I'm in the very low end of what 6 x S4510 should be capable of doing. Since it is IOPS i care about I have lowered block size to 4096 -- 4M blocksize nicely saturates the NIC's in both directions. $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_torsk2_11207 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 3 16 17264 17248 22.4564 21.4648 0.0024 0.00278101 4 16 22857 22841 22
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
> On 2/7/19 8:41 AM, Brett Chancellor wrote: >> This seems right. You are doing a single benchmark from a single client. >> Your limiting factor will be the network latency. For most networks this >> is between 0.2 and 0.3ms. if you're trying to test the potential of >> your cluster, you'll need multiple workers and clients. >> > > Indeed. To add to this, you will need fast (High clockspeed!) CPUs in > order to get the latency down. The CPUs will need tuning as well like > their power profiles and C-States. Thanks for the insigt, I'm aware and my current CPUs are pretty old - but I'm also in the process of learning how to make the right decisions when expanding. If all my time end up being spend in the client end, then bying NVMe drives does not help me a all nor does better cpus in the OSDs. > You won't get the 1:1 performance from the SSDs on your RBD block devices. I'm full aware of that - Ceph / RBD / etc comes with an awesome feature packages and that flexibility deliveres overhead and eats into it. But it helps to deliver "upper bounds" and work my way to good from there. Thanks. Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
> On Thu, 7 Feb 2019 08:17:20 +0100 jes...@krogh.cc wrote: >> Hi List >> >> We are in the process of moving to the next usecase for our ceph cluster >> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and >> that works fine. >> >> We're currently on luminous / bluestore, if upgrading is deemed to >> change what we're seeing then please let us know. >> >> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. >> Connected >> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to >> deadline, nomerges = 1, rotational = 0. >> > I'd make sure that the endurance of these SSDs is in line with your > expected usage. They are - at the moment :-) and Ceph allows me to change my mind without interferrring with the applications running on top - Nice! >> Each disk "should" give approximately 36K IOPS random write and the >> double >> random read. >> > Only locally, latency is your enemy. > > Tell us more about your network. It is a Dell N4032, N4064 switch stack on 10Gbase-T. All hosts are on same subnet, NIC's are Intel X540's No-jumbo-framing and not much tuning - all kernels are on 4.15 (Ubuntu) Pings from client to two of the osd's --- flodhest.nzcorp.net ping statistics --- 50 packets transmitted, 50 received, 0% packet loss, time 50157ms rtt min/avg/max/mdev = 0.075/0.105/0.158/0.021 ms --- bison.nzcorp.net ping statistics --- 50 packets transmitted, 50 received, 0% packet loss, time 50139ms rtt min/avg/max/mdev = 0.078/0.137/0.275/0.032 ms > rados bench is not the sharpest tool in the shed for this. > As it needs to allocate stuff to begin with, amongst other things. Suggest longer test-runs? >> This is also quite far from expected. I have 12GB of memory on the OSD >> daemon for caching on each host - close to idle cluster - thus 50GB+ for >> caching with a working set of < 6GB .. this should - in this case >> not really be bound by the underlying SSD. > Did you adjust the bluestore parameters (whatever they are this week or > for your version) to actually use that memory? According to top - it is picking up the caching memory. We have this block. bluestore_cache_kv_max = 214748364800 bluestore_cache_kv_ratio = 0.4 bluestore_cache_meta_ratio = 0.1 bluestore_cache_size_hdd = 13958643712 bluestore_cache_size_ssd = 13958643712 bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,compact_on_mount=false I actually think most of above has been applied with the 10TB harddrives in mind, not the SSD's .. but I have no idea if they do "bad things" for us. > Don't use iostat, use atop. > Small IOPS are extremely CPU intensive, so atop will give you an insight > as to what might be busy besides the actual storage device. Thanks will do so. More suggestions are wellcome. Doing some math: Say network latency was the only cost driver - assume rone roundtrip per IOPS per thread. 16 threads - 0.15ms per round-trip - gives 1000 ms/s/thread / 0.15ms/IOPS => 6.666 IOPSs * 16 threads => 10 IOPS/s ok, thats at least an upper bound on expectations in this scenario, and I am at 28207 thus 4x from and have still not accounted any OSD or rdb userspace time into the equation. Can i directly get service-time out of the osd-daemon ? That would be nice to see how many ms is spend at that end from an OSD perspective. Jesper -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
On 2/7/19 8:41 AM, Brett Chancellor wrote: > This seems right. You are doing a single benchmark from a single client. > Your limiting factor will be the network latency. For most networks this > is between 0.2 and 0.3ms. if you're trying to test the potential of > your cluster, you'll need multiple workers and clients. > Indeed. To add to this, you will need fast (High clockspeed!) CPUs in order to get the latency down. The CPUs will need tuning as well like their power profiles and C-States. You won't get the 1:1 performance from the SSDs on your RBD block devices. Wido > On Thu, Feb 7, 2019, 2:17 AM > Hi List > > We are in the process of moving to the next usecase for our ceph cluster > (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and > that works fine. > > We're currently on luminous / bluestore, if upgrading is deemed to > change what we're seeing then please let us know. > > We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. > Connected > through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to > deadline, nomerges = 1, rotational = 0. > > Each disk "should" give approximately 36K IOPS random write and the > double > random read. > > Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of > well performing SSD block devices - potentially to host databases and > things like that. I ready through this nice document [0], I know the > HW are radically different from mine, but I still think I'm in the > very low end of what 6 x S4510 should be capable of doing. > > Since it is IOPS i care about I have lowered block size to 4096 -- 4M > blocksize nicely saturates the NIC's in both directions. > > > $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4096 bytes to objects of size > 4096 for > up to 10 seconds or 0 objects > Object prefix: benchmark_data_torsk2_11207 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) > avg lat(s) > 0 0 0 0 0 0 - > 0 > 1 16 5857 5841 22.8155 22.8164 0.00238437 > 0.00273434 > 2 15 11768 11753 22.9533 23.0938 0.0028559 > 0.00271944 > 3 16 17264 17248 22.4564 21.4648 0.0024 > 0.00278101 > 4 16 22857 22841 22.3037 21.8477 0.002716 > 0.00280023 > 5 16 28462 28446 22.2213 21.8945 0.00220186 > 0.002811 > 6 16 34216 34200 22.2635 22.4766 0.00234315 > 0.00280552 > 7 16 39616 39600 22.0962 21.0938 0.00290661 > 0.00282718 > 8 16 45510 45494 22.2118 23.0234 0.0033541 > 0.00281253 > 9 16 50995 50979 22.1243 21.4258 0.00267282 > 0.00282371 > 10 16 56745 56729 22.1577 22.4609 0.00252583 > 0.0028193 > Total time run: 10.002668 > Total writes made: 56745 > Write size: 4096 > Object size: 4096 > Bandwidth (MB/sec): 22.1601 > Stddev Bandwidth: 0.712297 > Max bandwidth (MB/sec): 23.0938 > Min bandwidth (MB/sec): 21.0938 > Average IOPS: 5672 > Stddev IOPS: 182 > Max IOPS: 5912 > Min IOPS: 5400 > Average Latency(s): 0.00281953 > Stddev Latency(s): 0.00190771 > Max latency(s): 0.0834767 > Min latency(s): 0.00120945 > > Min latency is fine -- but Max latency of 83ms ? > Average IOPS @ 5672 ? > > $ sudo rados bench -p scbench 10 rand > hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) > avg lat(s) > 0 0 0 0 0 0 - > 0 > 1 15 23329 23314 91.0537 91.0703 0.000349856 > 0.000679074 > 2 16 48555 48539 94.7884 98.5352 0.000499159 > 0.000652067 > 3 16 76193 76177 99.1747 107.961 0.000443877 > 0.000622775 > 4 15 103923 103908 101.459 108.324 0.000678589 > 0.000609182 > 5 15 132720 132705 103.663 112.488 0.000741734 > 0.000595998 > 6 15 161811 161796 105.323 113.637 0.000333166 > 0.000586323 > 7 15 190196 190181 106.115 110.879 0.000612227 > 0.000582014 > 8 15 221155 221140 107.966 120.934 0.000471219 > 0.000571944 > 9 16 251143 251127 108.984 117.137 0.000267528 > 0.000566659 > Total time run: 10.000640 > Total reads made: 282097 > Read size: 4096 > Object size: 4
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
Hello, On Thu, 7 Feb 2019 08:17:20 +0100 jes...@krogh.cc wrote: > Hi List > > We are in the process of moving to the next usecase for our ceph cluster > (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and > that works fine. > > We're currently on luminous / bluestore, if upgrading is deemed to > change what we're seeing then please let us know. > > We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected > through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to > deadline, nomerges = 1, rotational = 0. > I'd make sure that the endurance of these SSDs is in line with your expected usage. > Each disk "should" give approximately 36K IOPS random write and the double > random read. > Only locally, latency is your enemy. Tell us more about your network. > Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of > well performing SSD block devices - potentially to host databases and > things like that. I ready through this nice document [0], I know the > HW are radically different from mine, but I still think I'm in the > very low end of what 6 x S4510 should be capable of doing. > > Since it is IOPS i care about I have lowered block size to 4096 -- 4M > blocksize nicely saturates the NIC's in both directions. > > rados bench is not the sharpest tool in the shed for this. As it needs to allocate stuff to begin with, amongst other things. And before you go "fio with RBD engine", that had major issues in my experience, too. Your best and most realistic results will come from doing the testing inside a VM (I presume from your use case) or a mounted RBD block device. And then using fio, of course. > $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for > up to 10 seconds or 0 objects > Object prefix: benchmark_data_torsk2_11207 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 > 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 > 3 16 17264 17248 22.4564 21.4648 0.0024 0.00278101 > 4 16 22857 22841 22.3037 21.84770.002716 0.00280023 > 5 16 28462 28446 22.2213 21.8945 0.002201860.002811 > 6 16 34216 34200 22.2635 22.4766 0.00234315 0.00280552 > 7 16 39616 39600 22.0962 21.0938 0.00290661 0.00282718 > 8 16 45510 45494 22.2118 23.0234 0.0033541 0.00281253 > 9 16 50995 50979 22.1243 21.4258 0.00267282 0.00282371 >10 16 56745 56729 22.1577 22.4609 0.00252583 0.0028193 > Total time run: 10.002668 > Total writes made: 56745 > Write size: 4096 > Object size:4096 > Bandwidth (MB/sec): 22.1601 > Stddev Bandwidth: 0.712297 > Max bandwidth (MB/sec): 23.0938 > Min bandwidth (MB/sec): 21.0938 > Average IOPS: 5672 > Stddev IOPS:182 > Max IOPS: 5912 > Min IOPS: 5400 > Average Latency(s): 0.00281953 > Stddev Latency(s): 0.00190771 > Max latency(s): 0.0834767 > Min latency(s): 0.00120945 > > Min latency is fine -- but Max latency of 83ms ? Outliers during setup are to be expected and ignored > Average IOPS @ 5672 ? > Plenty of good reasons to come up with that number, yes. > $ sudo rados bench -p scbench 10 rand > hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 15 23329 23314 91.0537 91.0703 0.000349856 0.000679074 > 2 16 48555 48539 94.7884 98.5352 0.000499159 0.000652067 > 3 16 76193 76177 99.1747 107.961 0.000443877 0.000622775 > 4 15103923103908 101.459 108.324 0.000678589 0.000609182 > 5 15132720132705 103.663 112.488 0.000741734 0.000595998 > 6 15161811161796 105.323 113.637 0.000333166 0.000586323 > 7 15190196190181 106.115 110.879 0.000612227 0.000582014 > 8 15221155221140 107.966 120.934 0.000471219 0.000571944 > 9 16251143251127 108.984 117.137 0.000267528 0.000566659 > Total time run: 10.000640 > Total reads made: 282097 > Read size:4096 > Object size: 4096 > Bandwidth (MB/sec): 110.187 > Average IOPS: 28207 > Stddev IOPS: 2357 > Max IOPS: 30959 > Min IOPS: 23314 > Average Latency(s): 0.000560402 > Max latency(s): 0.109804 > Min latency(s): 0.000212671 > > This is also quite far from expected. I have 12GB of memory on the
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
This seems right. You are doing a single benchmark from a single client. Your limiting factor will be the network latency. For most networks this is between 0.2 and 0.3ms. if you're trying to test the potential of your cluster, you'll need multiple workers and clients. On Thu, Feb 7, 2019, 2:17 AM Hi List > > We are in the process of moving to the next usecase for our ceph cluster > (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and > that works fine. > > We're currently on luminous / bluestore, if upgrading is deemed to > change what we're seeing then please let us know. > > We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected > through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to > deadline, nomerges = 1, rotational = 0. > > Each disk "should" give approximately 36K IOPS random write and the double > random read. > > Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of > well performing SSD block devices - potentially to host databases and > things like that. I ready through this nice document [0], I know the > HW are radically different from mine, but I still think I'm in the > very low end of what 6 x S4510 should be capable of doing. > > Since it is IOPS i care about I have lowered block size to 4096 -- 4M > blocksize nicely saturates the NIC's in both directions. > > > $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for > up to 10 seconds or 0 objects > Object prefix: benchmark_data_torsk2_11207 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 0 0 0 0 0 - > 0 > 1 16 5857 5841 22.8155 22.8164 0.00238437 > 0.00273434 > 2 15 11768 11753 22.9533 23.0938 0.0028559 > 0.00271944 > 3 16 17264 17248 22.4564 21.4648 0.0024 > 0.00278101 > 4 16 22857 22841 22.3037 21.84770.002716 > 0.00280023 > 5 16 28462 28446 22.2213 21.8945 0.00220186 > 0.002811 > 6 16 34216 34200 22.2635 22.4766 0.00234315 > 0.00280552 > 7 16 39616 39600 22.0962 21.0938 0.00290661 > 0.00282718 > 8 16 45510 45494 22.2118 23.0234 0.0033541 > 0.00281253 > 9 16 50995 50979 22.1243 21.4258 0.00267282 > 0.00282371 >10 16 56745 56729 22.1577 22.4609 0.00252583 > 0.0028193 > Total time run: 10.002668 > Total writes made: 56745 > Write size: 4096 > Object size:4096 > Bandwidth (MB/sec): 22.1601 > Stddev Bandwidth: 0.712297 > Max bandwidth (MB/sec): 23.0938 > Min bandwidth (MB/sec): 21.0938 > Average IOPS: 5672 > Stddev IOPS:182 > Max IOPS: 5912 > Min IOPS: 5400 > Average Latency(s): 0.00281953 > Stddev Latency(s): 0.00190771 > Max latency(s): 0.0834767 > Min latency(s): 0.00120945 > > Min latency is fine -- but Max latency of 83ms ? > Average IOPS @ 5672 ? > > $ sudo rados bench -p scbench 10 rand > hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 0 0 0 0 0 - > 0 > 1 15 23329 23314 91.0537 91.0703 0.000349856 > 0.000679074 > 2 16 48555 48539 94.7884 98.5352 0.000499159 > 0.000652067 > 3 16 76193 76177 99.1747 107.961 0.000443877 > 0.000622775 > 4 15103923103908 101.459 108.324 0.000678589 > 0.000609182 > 5 15132720132705 103.663 112.488 0.000741734 > 0.000595998 > 6 15161811161796 105.323 113.637 0.000333166 > 0.000586323 > 7 15190196190181 106.115 110.879 0.000612227 > 0.000582014 > 8 15221155221140 107.966 120.934 0.000471219 > 0.000571944 > 9 16251143251127 108.984 117.137 0.000267528 > 0.000566659 > Total time run: 10.000640 > Total reads made: 282097 > Read size:4096 > Object size: 4096 > Bandwidth (MB/sec): 110.187 > Average IOPS: 28207 > Stddev IOPS: 2357 > Max IOPS: 30959 > Min IOPS: 23314 > Average Latency(s): 0.000560402 > Max latency(s): 0.109804 > Min latency(s): 0.000212671 > > This is also quite far from expected. I have 12GB of memory on the OSD > daemon for caching on each host - close to idle cluster - thus 50GB+ for > caching with a working set of < 6GB .. this should - in this case > not really be bound by the underlying SSD. But if it were: > > IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off? > > No measureable service time in iostat when running tests, thus I have > come to the conclusion that it has to be either client side,
[ceph-users] rados block on SSD - performance - how to tune and get insight?
Hi List We are in the process of moving to the next usecase for our ceph cluster (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and that works fine. We're currently on luminous / bluestore, if upgrading is deemed to change what we're seeing then please let us know. We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to deadline, nomerges = 1, rotational = 0. Each disk "should" give approximately 36K IOPS random write and the double random read. Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of well performing SSD block devices - potentially to host databases and things like that. I ready through this nice document [0], I know the HW are radically different from mine, but I still think I'm in the very low end of what 6 x S4510 should be capable of doing. Since it is IOPS i care about I have lowered block size to 4096 -- 4M blocksize nicely saturates the NIC's in both directions. $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_torsk2_11207 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 3 16 17264 17248 22.4564 21.4648 0.0024 0.00278101 4 16 22857 22841 22.3037 21.84770.002716 0.00280023 5 16 28462 28446 22.2213 21.8945 0.002201860.002811 6 16 34216 34200 22.2635 22.4766 0.00234315 0.00280552 7 16 39616 39600 22.0962 21.0938 0.00290661 0.00282718 8 16 45510 45494 22.2118 23.0234 0.0033541 0.00281253 9 16 50995 50979 22.1243 21.4258 0.00267282 0.00282371 10 16 56745 56729 22.1577 22.4609 0.00252583 0.0028193 Total time run: 10.002668 Total writes made: 56745 Write size: 4096 Object size:4096 Bandwidth (MB/sec): 22.1601 Stddev Bandwidth: 0.712297 Max bandwidth (MB/sec): 23.0938 Min bandwidth (MB/sec): 21.0938 Average IOPS: 5672 Stddev IOPS:182 Max IOPS: 5912 Min IOPS: 5400 Average Latency(s): 0.00281953 Stddev Latency(s): 0.00190771 Max latency(s): 0.0834767 Min latency(s): 0.00120945 Min latency is fine -- but Max latency of 83ms ? Average IOPS @ 5672 ? $ sudo rados bench -p scbench 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 15 23329 23314 91.0537 91.0703 0.000349856 0.000679074 2 16 48555 48539 94.7884 98.5352 0.000499159 0.000652067 3 16 76193 76177 99.1747 107.961 0.000443877 0.000622775 4 15103923103908 101.459 108.324 0.000678589 0.000609182 5 15132720132705 103.663 112.488 0.000741734 0.000595998 6 15161811161796 105.323 113.637 0.000333166 0.000586323 7 15190196190181 106.115 110.879 0.000612227 0.000582014 8 15221155221140 107.966 120.934 0.000471219 0.000571944 9 16251143251127 108.984 117.137 0.000267528 0.000566659 Total time run: 10.000640 Total reads made: 282097 Read size:4096 Object size: 4096 Bandwidth (MB/sec): 110.187 Average IOPS: 28207 Stddev IOPS: 2357 Max IOPS: 30959 Min IOPS: 23314 Average Latency(s): 0.000560402 Max latency(s): 0.109804 Min latency(s): 0.000212671 This is also quite far from expected. I have 12GB of memory on the OSD daemon for caching on each host - close to idle cluster - thus 50GB+ for caching with a working set of < 6GB .. this should - in this case not really be bound by the underlying SSD. But if it were: IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off? No measureable service time in iostat when running tests, thus I have come to the conclusion that it has to be either client side, the network path, or the OSD-daemon that deliveres the increasing latency / decreased IOPS. Is there any suggestions on how to get more insigths in that? Has anyone replicated close to the number Micron are reporting on NVMe? Thanks a log. [0] https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en ___ ceph-users mailing list ceph-user