Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Vitaliy Filippov
Ceph is a massive overhead, so it seems it maxes out at ~1 (at most  
15000) write iops per one ssd with queue depth of 128 and ~1000 iops with  
queue depth of 1 (1ms latency). Or maybe 2000-2500 write iops (0.4-0.5ms)  
with best possible hardware. Micron has only squeezed ~8750 iops from each  
of their NVMes in their reference setup... the same NVMes reached 29  
iops in their setup when connected directly.



Hi Maged

Thanks for your reply.


6k is low as a max write iops value..even for single client. for cluster
of 3 nodes, we see from 10k to 60k write iops depending on hardware.

can you increase your threads to 64 or 128 via -t parameter


I can absolutely get it higher by increasing the parallism. But I
may have missed to explain my purpuse - I'm intested in how close to
putting local SSD/NVMe in servers I can get with RDB. Thus putting
parallel scenarios that I would never see in production in the
tests does not really help my understanding. I think a concurrency level
of 16 is in the top of what I would expect our PostgreSQL databases to do
in real life.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Виталий Филиппов
rados bench is garbage, it creates and benches a very small amount of objects. 
If you want RBD better test it with fio ioengine=rbd

7 февраля 2019 г. 15:16:11 GMT+03:00, Ryan  пишет:
>I just ran your test on a cluster with 5 hosts 2x Intel 6130, 12x 860
>Evo
>2TB SSD per host (6 per SAS3008), 2x bonded 10GB NIC, 2x Arista
>switches.
>
>Pool with 3x replication
>
>rados bench -p scbench -b 4096 10 write --no-cleanup
>hints = 1
>Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096
>for
>up to 10 seconds or 0 objects
>Object prefix: benchmark_data_dc1-kube-01_3458991
>  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>lat(s)
>0   0 0 0 0 0   -
> 0
>1  16  5090  5074   19.7774   19.8203  0.00312568
>0.00315352
>2  16 10441 10425   20.3276   20.9023  0.00332591
>0.00307105
>3  16 15548 1553220.201   19.9492  0.00337573
>0.00309134
>4  16 20906 20890   20.3826   20.9297  0.00282902
>0.00306437
>5  16 26107 26091   20.3686   20.3164  0.00269844
>0.00306698
>6  16 31246 31230   20.3187   20.0742  0.00339814
>0.00307462
>7  16 36372 36356   20.2753   20.0234  0.00286653
> 0.0030813
>8  16 41470 41454   20.2293   19.9141  0.00272051
>0.00308839
>9  16 46815 46799   20.3011   20.8789  0.00284063
>0.00307738
>Total time run: 10.0035
>Total writes made:  51918
>Write size: 4096
>Object size:4096
>Bandwidth (MB/sec): 20.2734
>Stddev Bandwidth:   0.464082
>Max bandwidth (MB/sec): 20.9297
>Min bandwidth (MB/sec): 19.8203
>Average IOPS:   5189
>Stddev IOPS:118
>Max IOPS:   5358
>Min IOPS:   5074
>Average Latency(s): 0.00308195
>Stddev Latency(s):  0.00142825
>Max latency(s): 0.0267947
>Min latency(s): 0.00217364
>
>rados bench -p scbench 10 rand
>hints = 1
>  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>lat(s)
>0   0 0 0 0 0   -
> 0
>1  15 39691 39676154.95   154.984  0.00027022
>0.000395993
>2  16 83701 83685   163.416171.91 0.000318949
>0.000375363
>3  15129218129203   168.199   177.805 0.000300898
>0.000364647
>4  15173733173718   169.617   173.887 0.000311723
>0.00036156
>5  15216073216058   168.769   165.391 0.000407594
>0.000363371
>6  16260381260365   169.483   173.074 0.000323371
>0.000361829
>7  15306838306823   171.193   181.477 0.000284247
>0.000358199
>8  15353675353660   172.661   182.957 0.000338128
>0.000355139
>9  15399221399206   173.243   177.914 0.000422527
>0.00035393
>Total time run:   10.0003
>Total reads made: 446353
>Read size:4096
>Object size:  4096
>Bandwidth (MB/sec):   174.351
>Average IOPS: 44633
>Stddev IOPS:  2220
>Max IOPS: 46837
>Min IOPS: 39676
>Average Latency(s):   0.000351679
>Max latency(s):   0.00530195
>Min latency(s):   0.000135292
>
>On Thu, Feb 7, 2019 at 2:17 AM  wrote:
>
>> Hi List
>>
>> We are in the process of moving to the next usecase for our ceph
>cluster
>> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
>> that works fine.
>>
>> We're currently on luminous / bluestore, if upgrading is deemed to
>> change what we're seeing then please let us know.
>>
>> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each.
>Connected
>> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set
>to
>> deadline, nomerges = 1, rotational = 0.
>>
>> Each disk "should" give approximately 36K IOPS random write and the
>double
>> random read.
>>
>> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup
>of
>> well performing SSD block devices - potentially to host databases and
>> things like that. I ready through this nice document [0], I know the
>> HW are radically different from mine, but I still think I'm in the
>> very low end of what 6 x S4510 should be capable of doing.
>>
>> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
>> blocksize nicely saturates the NIC's in both directions.
>>
>>
>> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
>> hints = 1
>> Maintaining 16 concurrent writes of 4096 bytes to objects of size
>4096 for
>> up to 10 seconds or 0 objects
>> Object prefix: benchmark_data_torsk2_11207
>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) 
>avg
>> lat(s)
>> 0   0 0 0 0 0   -
>>  0
>> 1  16  5857  5841   22.8155   22.8164  0.00238437
>> 0.00273434
>> 2  15 11768 11753   22.9533   23.0938   0.0028559
>> 0.00271944
>> 3  16 17264 

Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
> That's a usefull conclusion to take back.

Last question - We have our SSD pool set to 3x replication, Micron states
that NVMe is good at 2x - is this "taste and safety" or is there any
general
thoughts about SSD-robustness in a Ceph setup?


Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
> On 07/02/2019 17:07, jes...@krogh.cc wrote:
> Thanks for your explanation. In your case, you have low concurrency
> requirements, so focusing on latency rather than total iops is your
> goal. Your current setup gives 1.9 ms latency for writes and 0.6 ms for
> read. These are considered good, it is difficult to go below 1 ms for
> writes. As Wido pointed, to get latency down you need to insure you have
> C States in your cpu settings ( or just C1 state ), you have no low
> frequencies in your P States and get cpu with high GHz frequency rather
> than more cores (Nick Fisk has a good presentation on this), also avoid
> dual socket and NUMA. Also if money is no issue, you will get a bit
> better latency with 40G or 100G network.

Thanks a lot. I'm heading towards the conclusion that if I went all in
and got new HW+NVMe drives, then I'd "only" be about 3x better off than
where I am today.  (compared to the Micron paper)

That's a usefull conclusion to take back.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Maged Mokhtar



On 07/02/2019 17:07, jes...@krogh.cc wrote:

Hi Maged

Thanks for your reply.


6k is low as a max write iops value..even for single client. for cluster
of 3 nodes, we see from 10k to 60k write iops depending on hardware.

can you increase your threads to 64 or 128 via -t parameter

I can absolutely get it higher by increasing the parallism. But I
may have missed to explain my purpuse - I'm intested in how close to
putting local SSD/NVMe in servers I can get with RDB. Thus putting
parallel scenarios that I would never see in production in the
tests does not really help my understanding. I think a concurrency level
of 16 is in the top of what I would expect our PostgreSQL databases to do
in real life.


can you run fio with sync=1 on your disks.

can you try with noop scheduler

what is the %utilization on the disks and cpu ?

can you have more than 1 disk per node

I'll have a look at that. Thanks for the suggestion.

Jesper


Thanks for your explanation. In your case, you have low concurrency 
requirements, so focusing on latency rather than total iops is your 
goal. Your current setup gives 1.9 ms latency for writes and 0.6 ms for 
read. These are considered good, it is difficult to go below 1 ms for 
writes. As Wido pointed, to get latency down you need to insure you have 
C States in your cpu settings ( or just C1 state ), you have no low 
frequencies in your P States and get cpu with high GHz frequency rather 
than more cores (Nick Fisk has a good presentation on this), also avoid 
dual socket and NUMA. Also if money is no issue, you will get a bit 
better latency with 40G or 100G network.


/Maged


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
Hi Maged

Thanks for your reply.

> 6k is low as a max write iops value..even for single client. for cluster
> of 3 nodes, we see from 10k to 60k write iops depending on hardware.
>
> can you increase your threads to 64 or 128 via -t parameter

I can absolutely get it higher by increasing the parallism. But I
may have missed to explain my purpuse - I'm intested in how close to
putting local SSD/NVMe in servers I can get with RDB. Thus putting
parallel scenarios that I would never see in production in the
tests does not really help my understanding. I think a concurrency level
of 16 is in the top of what I would expect our PostgreSQL databases to do
in real life.

> can you run fio with sync=1 on your disks.
>
> can you try with noop scheduler
>
> what is the %utilization on the disks and cpu ?
>
> can you have more than 1 disk per node

I'll have a look at that. Thanks for the suggestion.

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Maged Mokhtar



On 07/02/2019 09:17, jes...@krogh.cc wrote:

Hi List

We are in the process of moving to the next usecase for our ceph cluster
(Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
that works fine.

We're currently on luminous / bluestore, if upgrading is deemed to
change what we're seeing then please let us know.

We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
deadline, nomerges = 1, rotational = 0.

Each disk "should" give approximately 36K IOPS random write and the double
random read.

Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
well performing SSD block devices - potentially to host databases and
things like that. I ready through this nice document [0], I know the
HW are radically different from mine, but I still think I'm in the
very low end of what 6 x S4510 should be capable of doing.

Since it is IOPS i care about I have lowered block size to 4096 -- 4M
blocksize nicely saturates the NIC's in both directions.


$ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
up to 10 seconds or 0 objects
Object prefix: benchmark_data_torsk2_11207
   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
 0   0 0 0 0 0   -   0
 1  16  5857  5841   22.8155   22.8164  0.00238437  0.00273434
 2  15 11768 11753   22.9533   23.0938   0.0028559  0.00271944
 3  16 17264 17248   22.4564   21.4648  0.0024  0.00278101
 4  16 22857 22841   22.3037   21.84770.002716  0.00280023
 5  16 28462 28446   22.2213   21.8945  0.002201860.002811
 6  16 34216 34200   22.2635   22.4766  0.00234315  0.00280552
 7  16 39616 39600   22.0962   21.0938  0.00290661  0.00282718
 8  16 45510 45494   22.2118   23.0234   0.0033541  0.00281253
 9  16 50995 50979   22.1243   21.4258  0.00267282  0.00282371
10  16 56745 56729   22.1577   22.4609  0.00252583   0.0028193
Total time run: 10.002668
Total writes made:  56745
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 22.1601
Stddev Bandwidth:   0.712297
Max bandwidth (MB/sec): 23.0938
Min bandwidth (MB/sec): 21.0938
Average IOPS:   5672
Stddev IOPS:182
Max IOPS:   5912
Min IOPS:   5400
Average Latency(s): 0.00281953
Stddev Latency(s):  0.00190771
Max latency(s): 0.0834767
Min latency(s): 0.00120945

Min latency is fine -- but Max latency of 83ms ?
Average IOPS @ 5672 ?

$ sudo rados bench -p scbench  10 rand
hints = 1
   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
 0   0 0 0 0 0   -   0
 1  15 23329 23314   91.0537   91.0703 0.000349856 0.000679074
 2  16 48555 48539   94.7884   98.5352 0.000499159 0.000652067
 3  16 76193 76177   99.1747   107.961 0.000443877 0.000622775
 4  15103923103908   101.459   108.324 0.000678589 0.000609182
 5  15132720132705   103.663   112.488 0.000741734 0.000595998
 6  15161811161796   105.323   113.637 0.000333166 0.000586323
 7  15190196190181   106.115   110.879 0.000612227 0.000582014
 8  15221155221140   107.966   120.934 0.000471219 0.000571944
 9  16251143251127   108.984   117.137 0.000267528 0.000566659
Total time run:   10.000640
Total reads made: 282097
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   110.187
Average IOPS: 28207
Stddev IOPS:  2357
Max IOPS: 30959
Min IOPS: 23314
Average Latency(s):   0.000560402
Max latency(s):   0.109804
Min latency(s):   0.000212671

This is also quite far from expected. I have 12GB of memory on the OSD
daemon for caching on each host - close to idle cluster - thus 50GB+ for
caching with a working set of < 6GB .. this should - in this case
not really be bound by the underlying SSD. But if it were:

IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off?

No measureable service time in iostat when running tests, thus I have
come to the conclusion that it has to be either client side, the
network path, or the OSD-daemon that deliveres the increasing latency /
decreased IOPS.

Is there any suggestions on how to get more insigths in that?

Has anyone replicated close to the number Micron are reporting on NVMe?

Thanks a log.

[0]
https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Ryan
I just ran your test on a cluster with 5 hosts 2x Intel 6130, 12x 860 Evo
2TB SSD per host (6 per SAS3008), 2x bonded 10GB NIC, 2x Arista switches.

Pool with 3x replication

rados bench -p scbench -b 4096 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
up to 10 seconds or 0 objects
Object prefix: benchmark_data_dc1-kube-01_3458991
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  16  5090  5074   19.7774   19.8203  0.00312568
0.00315352
2  16 10441 10425   20.3276   20.9023  0.00332591
0.00307105
3  16 15548 1553220.201   19.9492  0.00337573
0.00309134
4  16 20906 20890   20.3826   20.9297  0.00282902
0.00306437
5  16 26107 26091   20.3686   20.3164  0.00269844
0.00306698
6  16 31246 31230   20.3187   20.0742  0.00339814
0.00307462
7  16 36372 36356   20.2753   20.0234  0.00286653
 0.0030813
8  16 41470 41454   20.2293   19.9141  0.00272051
0.00308839
9  16 46815 46799   20.3011   20.8789  0.00284063
0.00307738
Total time run: 10.0035
Total writes made:  51918
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 20.2734
Stddev Bandwidth:   0.464082
Max bandwidth (MB/sec): 20.9297
Min bandwidth (MB/sec): 19.8203
Average IOPS:   5189
Stddev IOPS:118
Max IOPS:   5358
Min IOPS:   5074
Average Latency(s): 0.00308195
Stddev Latency(s):  0.00142825
Max latency(s): 0.0267947
Min latency(s): 0.00217364

rados bench -p scbench 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0   0 0 0 0 0   -
 0
1  15 39691 39676154.95   154.984  0.00027022
0.000395993
2  16 83701 83685   163.416171.91 0.000318949
0.000375363
3  15129218129203   168.199   177.805 0.000300898
0.000364647
4  15173733173718   169.617   173.887 0.000311723
0.00036156
5  15216073216058   168.769   165.391 0.000407594
0.000363371
6  16260381260365   169.483   173.074 0.000323371
0.000361829
7  15306838306823   171.193   181.477 0.000284247
0.000358199
8  15353675353660   172.661   182.957 0.000338128
0.000355139
9  15399221399206   173.243   177.914 0.000422527
0.00035393
Total time run:   10.0003
Total reads made: 446353
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   174.351
Average IOPS: 44633
Stddev IOPS:  2220
Max IOPS: 46837
Min IOPS: 39676
Average Latency(s):   0.000351679
Max latency(s):   0.00530195
Min latency(s):   0.000135292

On Thu, Feb 7, 2019 at 2:17 AM  wrote:

> Hi List
>
> We are in the process of moving to the next usecase for our ceph cluster
> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
> that works fine.
>
> We're currently on luminous / bluestore, if upgrading is deemed to
> change what we're seeing then please let us know.
>
> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
> deadline, nomerges = 1, rotational = 0.
>
> Each disk "should" give approximately 36K IOPS random write and the double
> random read.
>
> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
> well performing SSD block devices - potentially to host databases and
> things like that. I ready through this nice document [0], I know the
> HW are radically different from mine, but I still think I'm in the
> very low end of what 6 x S4510 should be capable of doing.
>
> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
> blocksize nicely saturates the NIC's in both directions.
>
>
> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
> up to 10 seconds or 0 objects
> Object prefix: benchmark_data_torsk2_11207
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  16  5857  5841   22.8155   22.8164  0.00238437
> 0.00273434
> 2  15 11768 11753   22.9533   23.0938   0.0028559
> 0.00271944
> 3  16 17264 17248   22.4564   21.4648  0.0024
> 0.00278101
> 4  16 22857 22841   22.3037   21.84770.002716
> 0.00280023
> 5  16 28462 28446   22.2213   21.8945  0.00220186
> 0.002811
> 6  16 34216 34200   22.2635   22.4766  0.00234315
> 0.00280552
> 7  16 39616 

Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Marc Roos
 
4xnodes, around 100GB, 2x2660, 10Gbit, 2xLSI Logic SAS2308 







Thanks for the confirmation Marc

Can you put in a but more hardware/network details?

Jesper




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper

Thanks for the confirmation Marc

Can you put in a but more hardware/network details?

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Marc Roos
 
I did your rados bench test on our sm863a pool 3x rep, got similar 
results.

[@]# rados bench -p fs_data.ssd -b 4096 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 
for up to 10 seconds or 0 objects
Object prefix: benchmark_data_c04_1337712
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)
0   0 0 0 0 0   -
   0
1  16  6302  6286   24.5533   24.5547  0.00304773
0.002541
2  15 12545 12530   24.4705   24.3906  0.00228294   
0.0025506
3  16 18675 18659   24.2933   23.9414  0.00332918  
0.00257042
4  16 25194 25178   24.5854   25.4648   0.0034176  
0.00254016
5  16 31657 31641   24.7169   25.2461  0.00156494  
0.00252686
6  16 37713 37697   24.5398   23.6562  0.00228134  
0.00254527
7  16 43848 43832   24.4572   23.9648  0.00238393  
0.00255401
8  16 49516 49500   24.1673   22.1406  0.00244473  
0.00258466
9  16 55562 55546   24.1059   23.6172  0.00249619  
0.00259139
   10  16 61675 61659   24.0829   23.8789   0.0020192  
0.00259362
Total time run: 10.002179
Total writes made:  61675
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 24.0865
Stddev Bandwidth:   0.932554
Max bandwidth (MB/sec): 25.4648
Min bandwidth (MB/sec): 22.1406
Average IOPS:   6166
Stddev IOPS:238
Max IOPS:   6519
Min IOPS:   5668
Average Latency(s): 0.00259383
Stddev Latency(s):  0.00173856
Max latency(s): 0.0778051
Min latency(s): 0.00110931


[@ ]# rados bench -p fs_data.ssd  10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)
0   0 0 0 0 0   -
   0
1  15 27697 27682   108.115   108.133 0.000755936 
0.000568212
2  15 57975 57960   113.186   118.273 0.000547682 
0.000542773
3  15 88500 88485   115.199   119.238  0.00036749 
0.000533185
4  15117199117184   114.422   112.105 0.000354388 
0.000536647
5  15147734147719115.39   119.277 0.000419781  
0.00053221
6  16176393176377   114.814   111.945 0.000427109 
0.000534771
7  15203693203678   113.645   106.645 0.000379089 
0.000540113
8  15231917231902   113.219110.25 0.000465232 
0.000542156
9  16261054261038   113.284   113.812 0.000358025 
0.000541972
Total time run:   10.000669
Total reads made: 290371
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   113.419
Average IOPS: 29035
Stddev IOPS:  1212
Max IOPS: 30535
Min IOPS: 27301
Average Latency(s):   0.000541371
Max latency(s):   0.00380609
Min latency(s):   0.000155521




-Original Message-
From: jes...@krogh.cc [mailto:jes...@krogh.cc] 
Sent: 07 February 2019 08:17
To: ceph-users@lists.ceph.com
Subject: [ceph-users] rados block on SSD - performance - how to tune and 
get insight?

Hi List

We are in the process of moving to the next usecase for our ceph cluster 
(Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and 
that works fine.

We're currently on luminous / bluestore, if upgrading is deemed to 
change what we're seeing then please let us know.

We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. 
Connected through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and 
scheduler set to deadline, nomerges = 1, rotational = 0.

Each disk "should" give approximately 36K IOPS random write and the 
double random read.

Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of 
well performing SSD block devices - potentially to host databases and 
things like that. I ready through this nice document [0], I know the HW 
are radically different from mine, but I still think I'm in the very low 
end of what 6 x S4510 should be capable of doing.

Since it is IOPS i care about I have lowered block size to 4096 -- 4M 
blocksize nicely saturates the NIC's in both directions.


$ sudo rados bench -p scbench -b 4096 10 write --no-cleanup hints = 1 
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 
for up to 10 seconds or 0 objects Object prefix: 
benchmark_data_torsk2_11207
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)
0   0 0 0 0 0   -
   0
1  16  5857  5841   22.8155   22.8164  0.00238437  
0.00273434
2  15 11768 11753   22.9533   23.0938   0.0028559  
0.00271944
3  16 17264 17248   22.4564   21.4648  0.0024  
0.00278101
4  16 22857 22841   22.3037   21.84770.002716  
0.00280023

Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
> On 2/7/19 8:41 AM, Brett Chancellor wrote:
>> This seems right. You are doing a single benchmark from a single client.
>> Your limiting factor will be the network latency. For most networks this
>> is between 0.2 and 0.3ms.  if you're trying to test the potential of
>> your cluster, you'll need multiple workers and clients.
>>
>
> Indeed. To add to this, you will need fast (High clockspeed!) CPUs in
> order to get the latency down. The CPUs will need tuning as well like
> their power profiles and C-States.

Thanks for the insigt, I'm aware and my current CPUs are pretty old
- but I'm also in the process of learning how to make the right
decisions when expanding. If all my time end up being spend in the
client end, then bying NVMe drives does not help me a all nor does
better cpus in the OSDs.

> You won't get the 1:1 performance from the SSDs on your RBD block devices.

I'm full aware of that - Ceph / RBD / etc comes with an awesome feature
packages and that flexibility deliveres overhead and eats into it.
But it helps to deliver "upper bounds" and work my way to good from there.

Thanks.

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread jesper
> On Thu, 7 Feb 2019 08:17:20 +0100 jes...@krogh.cc wrote:
>> Hi List
>>
>> We are in the process of moving to the next usecase for our ceph cluster
>> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
>> that works fine.
>>
>> We're currently on luminous / bluestore, if upgrading is deemed to
>> change what we're seeing then please let us know.
>>
>> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each.
>> Connected
>> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
>> deadline, nomerges = 1, rotational = 0.
>>
> I'd make sure that the endurance of these SSDs is in line with your
> expected usage.

They are - at the moment :-) and Ceph allows me to change my mind without
interferrring with the applications running on top - Nice!

>> Each disk "should" give approximately 36K IOPS random write and the
>> double
>> random read.
>>
> Only locally, latency is your enemy.
>
> Tell us more about your network.

It is a Dell N4032, N4064 switch stack on 10Gbase-T.
All hosts are on same subnet, NIC's are Intel X540's
No-jumbo-framing and not much tuning - all kernels are on 4.15 (Ubuntu)

Pings from client to two of the osd's
--- flodhest.nzcorp.net ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 50157ms
rtt min/avg/max/mdev = 0.075/0.105/0.158/0.021 ms
--- bison.nzcorp.net ping statistics ---
50 packets transmitted, 50 received, 0% packet loss, time 50139ms
rtt min/avg/max/mdev = 0.078/0.137/0.275/0.032 ms


> rados bench is not the sharpest tool in the shed for this.
> As it needs to allocate stuff to begin with, amongst other things.

Suggest longer test-runs?

>> This is also quite far from expected. I have 12GB of memory on the OSD
>> daemon for caching on each host - close to idle cluster - thus 50GB+ for
>> caching with a working set of < 6GB .. this should - in this case
>> not really be bound by the underlying SSD.
> Did you adjust the bluestore parameters (whatever they are this week or
> for your version) to actually use that memory?

According to top - it is picking up the caching memory.
We have this block.

bluestore_cache_kv_max = 214748364800
bluestore_cache_kv_ratio = 0.4
bluestore_cache_meta_ratio = 0.1
bluestore_cache_size_hdd = 13958643712
bluestore_cache_size_ssd = 13958643712
bluestore_rocksdb_options =
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,compact_on_mount=false

I actually think most of above has been applied with the 10TB harddrives
in mind, not the SSD's .. but I have no idea if they do "bad things" for
us.

> Don't use iostat, use atop.
> Small IOPS are extremely CPU intensive, so atop will give you an insight
> as to what might be busy besides the actual storage device.

Thanks will do so.

More suggestions are wellcome.

Doing some math:
Say network latency was the only cost driver - assume rone roundtrip per
IOPS per thread.

16 threads - 0.15ms per round-trip - gives 1000 ms/s/thread / 0.15ms/IOPS
=> 6.666 IOPSs * 16 threads => 10 IOPS/s
ok, thats at least an upper bound on expectations in this scenario, and I
am at 28207 thus 4x from and have
still not accounted any OSD or rdb userspace time into the equation.

Can i directly get service-time out of the osd-daemon ? That would be nice
to see how many ms is spend at that end from an OSD perspective.

Jesper

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-07 Thread Wido den Hollander


On 2/7/19 8:41 AM, Brett Chancellor wrote:
> This seems right. You are doing a single benchmark from a single client.
> Your limiting factor will be the network latency. For most networks this
> is between 0.2 and 0.3ms.  if you're trying to test the potential of
> your cluster, you'll need multiple workers and clients.
> 

Indeed. To add to this, you will need fast (High clockspeed!) CPUs in
order to get the latency down. The CPUs will need tuning as well like
their power profiles and C-States.

You won't get the 1:1 performance from the SSDs on your RBD block devices.

Wido

> On Thu, Feb 7, 2019, 2:17 AM  
> Hi List
> 
> We are in the process of moving to the next usecase for our ceph cluster
> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
> that works fine.
> 
> We're currently on luminous / bluestore, if upgrading is deemed to
> change what we're seeing then please let us know.
> 
> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each.
> Connected
> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
> deadline, nomerges = 1, rotational = 0.
> 
> Each disk "should" give approximately 36K IOPS random write and the
> double
> random read.
> 
> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
> well performing SSD block devices - potentially to host databases and
> things like that. I ready through this nice document [0], I know the
> HW are radically different from mine, but I still think I'm in the
> very low end of what 6 x S4510 should be capable of doing.
> 
> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
> blocksize nicely saturates the NIC's in both directions.
> 
> 
> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4096 bytes to objects of size
> 4096 for
> up to 10 seconds or 0 objects
> Object prefix: benchmark_data_torsk2_11207
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) 
> avg lat(s)
>     0       0         0         0         0         0           -   
>        0
>     1      16      5857      5841   22.8155   22.8164  0.00238437 
> 0.00273434
>     2      15     11768     11753   22.9533   23.0938   0.0028559 
> 0.00271944
>     3      16     17264     17248   22.4564   21.4648  0.0024 
> 0.00278101
>     4      16     22857     22841   22.3037   21.8477    0.002716 
> 0.00280023
>     5      16     28462     28446   22.2213   21.8945  0.00220186   
> 0.002811
>     6      16     34216     34200   22.2635   22.4766  0.00234315 
> 0.00280552
>     7      16     39616     39600   22.0962   21.0938  0.00290661 
> 0.00282718
>     8      16     45510     45494   22.2118   23.0234   0.0033541 
> 0.00281253
>     9      16     50995     50979   22.1243   21.4258  0.00267282 
> 0.00282371
>    10      16     56745     56729   22.1577   22.4609  0.00252583 
>  0.0028193
> Total time run:         10.002668
> Total writes made:      56745
> Write size:             4096
> Object size:            4096
> Bandwidth (MB/sec):     22.1601
> Stddev Bandwidth:       0.712297
> Max bandwidth (MB/sec): 23.0938
> Min bandwidth (MB/sec): 21.0938
> Average IOPS:           5672
> Stddev IOPS:            182
> Max IOPS:               5912
> Min IOPS:               5400
> Average Latency(s):     0.00281953
> Stddev Latency(s):      0.00190771
> Max latency(s):         0.0834767
> Min latency(s):         0.00120945
> 
> Min latency is fine -- but Max latency of 83ms ?
> Average IOPS @ 5672 ?
> 
> $ sudo rados bench -p scbench  10 rand
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) 
> avg lat(s)
>     0       0         0         0         0         0           -   
>        0
>     1      15     23329     23314   91.0537   91.0703 0.000349856
> 0.000679074
>     2      16     48555     48539   94.7884   98.5352 0.000499159
> 0.000652067
>     3      16     76193     76177   99.1747   107.961 0.000443877
> 0.000622775
>     4      15    103923    103908   101.459   108.324 0.000678589
> 0.000609182
>     5      15    132720    132705   103.663   112.488 0.000741734
> 0.000595998
>     6      15    161811    161796   105.323   113.637 0.000333166
> 0.000586323
>     7      15    190196    190181   106.115   110.879 0.000612227
> 0.000582014
>     8      15    221155    221140   107.966   120.934 0.000471219
> 0.000571944
>     9      16    251143    251127   108.984   117.137 0.000267528
> 0.000566659
> Total time run:       10.000640
> Total reads made:     282097
> Read size:            4096
> Object size:          

Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-06 Thread Christian Balzer
Hello,

On Thu, 7 Feb 2019 08:17:20 +0100 jes...@krogh.cc wrote:

> Hi List
> 
> We are in the process of moving to the next usecase for our ceph cluster
> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
> that works fine.
> 
> We're currently on luminous / bluestore, if upgrading is deemed to
> change what we're seeing then please let us know.
> 
> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
> deadline, nomerges = 1, rotational = 0.
> 
I'd make sure that the endurance of these SSDs is in line with your
expected usage.

> Each disk "should" give approximately 36K IOPS random write and the double
> random read.
>
Only locally, latency is your enemy.

Tell us more about your network.

> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
> well performing SSD block devices - potentially to host databases and
> things like that. I ready through this nice document [0], I know the
> HW are radically different from mine, but I still think I'm in the
> very low end of what 6 x S4510 should be capable of doing.
> 
> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
> blocksize nicely saturates the NIC's in both directions.
> 
> 
rados bench is not the sharpest tool in the shed for this.
As it needs to allocate stuff to begin with, amongst other things.

And before you go "fio with RBD engine", that had major issues in my
experience, too.
Your best and most realistic results will come from doing the testing
inside a VM (I presume from your use case) or a mounted RBD block device.

And then using fio, of course.

> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
> up to 10 seconds or 0 objects
> Object prefix: benchmark_data_torsk2_11207
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  16  5857  5841   22.8155   22.8164  0.00238437  0.00273434
> 2  15 11768 11753   22.9533   23.0938   0.0028559  0.00271944
> 3  16 17264 17248   22.4564   21.4648  0.0024  0.00278101
> 4  16 22857 22841   22.3037   21.84770.002716  0.00280023
> 5  16 28462 28446   22.2213   21.8945  0.002201860.002811
> 6  16 34216 34200   22.2635   22.4766  0.00234315  0.00280552
> 7  16 39616 39600   22.0962   21.0938  0.00290661  0.00282718
> 8  16 45510 45494   22.2118   23.0234   0.0033541  0.00281253
> 9  16 50995 50979   22.1243   21.4258  0.00267282  0.00282371
>10  16 56745 56729   22.1577   22.4609  0.00252583   0.0028193
> Total time run: 10.002668
> Total writes made:  56745
> Write size: 4096
> Object size:4096
> Bandwidth (MB/sec): 22.1601
> Stddev Bandwidth:   0.712297
> Max bandwidth (MB/sec): 23.0938
> Min bandwidth (MB/sec): 21.0938
> Average IOPS:   5672
> Stddev IOPS:182
> Max IOPS:   5912
> Min IOPS:   5400
> Average Latency(s): 0.00281953
> Stddev Latency(s):  0.00190771
> Max latency(s): 0.0834767
> Min latency(s): 0.00120945
> 
> Min latency is fine -- but Max latency of 83ms ?
Outliers during setup are to be expected and ignored

> Average IOPS @ 5672 ?
> 
Plenty of good reasons to come up with that number, yes.
> $ sudo rados bench -p scbench  10 rand
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  15 23329 23314   91.0537   91.0703 0.000349856 0.000679074
> 2  16 48555 48539   94.7884   98.5352 0.000499159 0.000652067
> 3  16 76193 76177   99.1747   107.961 0.000443877 0.000622775
> 4  15103923103908   101.459   108.324 0.000678589 0.000609182
> 5  15132720132705   103.663   112.488 0.000741734 0.000595998
> 6  15161811161796   105.323   113.637 0.000333166 0.000586323
> 7  15190196190181   106.115   110.879 0.000612227 0.000582014
> 8  15221155221140   107.966   120.934 0.000471219 0.000571944
> 9  16251143251127   108.984   117.137 0.000267528 0.000566659
> Total time run:   10.000640
> Total reads made: 282097
> Read size:4096
> Object size:  4096
> Bandwidth (MB/sec):   110.187
> Average IOPS: 28207
> Stddev IOPS:  2357
> Max IOPS: 30959
> Min IOPS: 23314
> Average Latency(s):   0.000560402
> Max latency(s):   0.109804
> Min latency(s):   0.000212671
> 
> This is also quite far from expected. I have 12GB of memory on the 

Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?

2019-02-06 Thread Brett Chancellor
This seems right. You are doing a single benchmark from a single client.
Your limiting factor will be the network latency. For most networks this is
between 0.2 and 0.3ms.  if you're trying to test the potential of your
cluster, you'll need multiple workers and clients.

On Thu, Feb 7, 2019, 2:17 AM  Hi List
>
> We are in the process of moving to the next usecase for our ceph cluster
> (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
> that works fine.
>
> We're currently on luminous / bluestore, if upgrading is deemed to
> change what we're seeing then please let us know.
>
> We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
> through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
> deadline, nomerges = 1, rotational = 0.
>
> Each disk "should" give approximately 36K IOPS random write and the double
> random read.
>
> Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
> well performing SSD block devices - potentially to host databases and
> things like that. I ready through this nice document [0], I know the
> HW are radically different from mine, but I still think I'm in the
> very low end of what 6 x S4510 should be capable of doing.
>
> Since it is IOPS i care about I have lowered block size to 4096 -- 4M
> blocksize nicely saturates the NIC's in both directions.
>
>
> $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
> up to 10 seconds or 0 objects
> Object prefix: benchmark_data_torsk2_11207
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  16  5857  5841   22.8155   22.8164  0.00238437
> 0.00273434
> 2  15 11768 11753   22.9533   23.0938   0.0028559
> 0.00271944
> 3  16 17264 17248   22.4564   21.4648  0.0024
> 0.00278101
> 4  16 22857 22841   22.3037   21.84770.002716
> 0.00280023
> 5  16 28462 28446   22.2213   21.8945  0.00220186
> 0.002811
> 6  16 34216 34200   22.2635   22.4766  0.00234315
> 0.00280552
> 7  16 39616 39600   22.0962   21.0938  0.00290661
> 0.00282718
> 8  16 45510 45494   22.2118   23.0234   0.0033541
> 0.00281253
> 9  16 50995 50979   22.1243   21.4258  0.00267282
> 0.00282371
>10  16 56745 56729   22.1577   22.4609  0.00252583
>  0.0028193
> Total time run: 10.002668
> Total writes made:  56745
> Write size: 4096
> Object size:4096
> Bandwidth (MB/sec): 22.1601
> Stddev Bandwidth:   0.712297
> Max bandwidth (MB/sec): 23.0938
> Min bandwidth (MB/sec): 21.0938
> Average IOPS:   5672
> Stddev IOPS:182
> Max IOPS:   5912
> Min IOPS:   5400
> Average Latency(s): 0.00281953
> Stddev Latency(s):  0.00190771
> Max latency(s): 0.0834767
> Min latency(s): 0.00120945
>
> Min latency is fine -- but Max latency of 83ms ?
> Average IOPS @ 5672 ?
>
> $ sudo rados bench -p scbench  10 rand
> hints = 1
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  15 23329 23314   91.0537   91.0703 0.000349856
> 0.000679074
> 2  16 48555 48539   94.7884   98.5352 0.000499159
> 0.000652067
> 3  16 76193 76177   99.1747   107.961 0.000443877
> 0.000622775
> 4  15103923103908   101.459   108.324 0.000678589
> 0.000609182
> 5  15132720132705   103.663   112.488 0.000741734
> 0.000595998
> 6  15161811161796   105.323   113.637 0.000333166
> 0.000586323
> 7  15190196190181   106.115   110.879 0.000612227
> 0.000582014
> 8  15221155221140   107.966   120.934 0.000471219
> 0.000571944
> 9  16251143251127   108.984   117.137 0.000267528
> 0.000566659
> Total time run:   10.000640
> Total reads made: 282097
> Read size:4096
> Object size:  4096
> Bandwidth (MB/sec):   110.187
> Average IOPS: 28207
> Stddev IOPS:  2357
> Max IOPS: 30959
> Min IOPS: 23314
> Average Latency(s):   0.000560402
> Max latency(s):   0.109804
> Min latency(s):   0.000212671
>
> This is also quite far from expected. I have 12GB of memory on the OSD
> daemon for caching on each host - close to idle cluster - thus 50GB+ for
> caching with a working set of < 6GB .. this should - in this case
> not really be bound by the underlying SSD. But if it were:
>
> IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off?
>
> No measureable service time in iostat when running tests, thus I have
> come to the conclusion that it has to be either client