Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Frank Schilder
Oh dear. Every occurrence of stripe_* is wrong :)

It should be stripe_count (option --stripe-count in rbd create) everywhere in 
my text.

What choices are legal depends on the restrictions on stripe_count*stripe_unit 
(=stripe_size=stripe_width?) imposed by ceph. I believe all of this ends up 
being powers of 2.

Yes, the 6+2 is a bit surprising. I have no explanation for the observation. It 
just seems a good argument for "do not trust what you believe, gather facts". 
And to try things that seem non-obvious - just to be sure.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Lars 
Marowsky-Bree 
Sent: 11 July 2019 12:17:37
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

On 2019-07-11T09:46:47, Frank Schilder  wrote:

> Striping with stripe units other than 1 is something I also tested. I found 
> that with EC pools non-trivial striping should be avoided. Firstly, EC is 
> already a striped format and, secondly, striping on top of that with 
> stripe_unit>1 will make every write an ec_overwrite, because now shards are 
> rarely if ever written as a whole.

That's why I said that rbd's stripe_unit should match the EC pool's
stripe_width, or be a 2^n multiple of it. (Not sure what stripe_count
should be set to, probably also a small number of two.)

> The native striping in EC pools comes from k, data is striped over k disks. 
> The higher k the more throughput at the expense of cpu and network.

Increasing k also increases stripe_width though; this leads to more IO
suffering from the ec_overwrite penalty.

> In my long list, this should actually be point
>
> 6) Use stripe_unit=1 (default).

You mean stripe-count?

> To get back to your question, this is another argument for k=power-of-two. 
> Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
> factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
> badly a mismatch affects performance should be tested.

Yes, of course. Depending on the IO pattern, this means more IO will be
misaligned or have non-stripe_width portions. (Most IO patterns, if they
strive for alignment, aim for a power of two alignment, obviously.)

> Results with non-trivial striping (stripe_size>1) were so poor, I did not 
> even include them in my report.

stripe_size?

> We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool 
> is used for VMs (RBD images), where IOP/s are more important. It also offers 
> a higher redundancy level. Its an acceptable compromise for us.

Especially with RBDs, I'm surprised that k=6 works well for you. Block
device IO is most commonly aligned on power-of-two boundaries.


Regards,
Lars

--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Lars Marowsky-Bree
On 2019-07-11T09:46:47, Frank Schilder  wrote:

> Striping with stripe units other than 1 is something I also tested. I found 
> that with EC pools non-trivial striping should be avoided. Firstly, EC is 
> already a striped format and, secondly, striping on top of that with 
> stripe_unit>1 will make every write an ec_overwrite, because now shards are 
> rarely if ever written as a whole.

That's why I said that rbd's stripe_unit should match the EC pool's
stripe_width, or be a 2^n multiple of it. (Not sure what stripe_count
should be set to, probably also a small number of two.)

> The native striping in EC pools comes from k, data is striped over k disks. 
> The higher k the more throughput at the expense of cpu and network.

Increasing k also increases stripe_width though; this leads to more IO
suffering from the ec_overwrite penalty.

> In my long list, this should actually be point
> 
> 6) Use stripe_unit=1 (default).

You mean stripe-count?

> To get back to your question, this is another argument for k=power-of-two. 
> Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
> factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
> badly a mismatch affects performance should be tested.

Yes, of course. Depending on the IO pattern, this means more IO will be
misaligned or have non-stripe_width portions. (Most IO patterns, if they
strive for alignment, aim for a power of two alignment, obviously.)

> Results with non-trivial striping (stripe_size>1) were so poor, I did not 
> even include them in my report.

stripe_size?

> We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool 
> is used for VMs (RBD images), where IOP/s are more important. It also offers 
> a higher redundancy level. Its an acceptable compromise for us.

Especially with RBDs, I'm surprised that k=6 works well for you. Block
device IO is most commonly aligned on power-of-two boundaries.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Frank Schilder
Striping with stripe units other than 1 is something I also tested. I found 
that with EC pools non-trivial striping should be avoided. Firstly, EC is 
already a striped format and, secondly, striping on top of that with 
stripe_unit>1 will make every write an ec_overwrite, because now shards are 
rarely if ever written as a whole.

The native striping in EC pools comes from k, data is striped over k disks. The 
higher k the more throughput at the expense of cpu and network.

In my long list, this should actually be point

6) Use stripe_unit=1 (default).

To get back to your question, this is another argument for k=power-of-two. 
Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
badly a mismatch affects performance should be tested.

Example: on our 6+2 EC pool I have stripe_width  24576, which has 3 as a 
factor. The 3 comes from k=6=3*2 and will always be there. This implies a 
misalignment and some writes will have to be split/padded in the middle. This 
does not happen too often per object, so 6+2 performance is good, but not as 
good as 8+2 performance.

Some numbers:

1) rbd object size 8MB, 4 servers writing with 1 processes each (=4 workers):
EC profile 4K random write  sequential write 8M write size
   IOP/s aggregated MB/s aggregated
 5+2802.30  1156.05
 6+2   1188.26  1873.67
 8+2   1210.27  2510.78
10+4421.80   681.22

2) rbd object size 8MB, 4 servers writing with 4 processes each (=16 workers):
EC profile 4K random write  sequential write 8M write size
   IOP/s aggregated MB/s aggregated
6+21384.43  3139.14
8+21343.34  4069.27

The EC-profiles with factor 5 are so bad that I didn't repeat the multi-process 
tests (2) with these. I had limited time and went for the discard-early 
strategy to find suitable parameters.

The 25% smaller throughput (6+2 vs 8+2) in test (2) is probably due to the fact 
that data is striped over 6 instead of 8 disks. There might be some impact of 
the factor 3 somewhere as well, but it seems negligible in the scenario I 
tested.

Results with non-trivial striping (stripe_size>1) were so poor, I did not even 
include them in my report.

We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool is 
used for VMs (RBD images), where IOP/s are more important. It also offers a 
higher redundancy level. Its an acceptable compromise for us.

Note that numbers will vary depending on hardware, OSD config, kernel 
parameters etc, etc. One needs to test what one has.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Lars 
Marowsky-Bree 
Sent: 11 July 2019 10:14:04
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

On 2019-07-09T07:27:28, Frank Schilder  wrote:

> Small addition:
>
> This result holds for rbd bench. It seems to imply good performance for 
> large-file IO on cephfs, since cephfs will split large files into many 
> objects of size object_size. Small-file IO is a different story.
>
> The formula should be N*alloc_size=object_size/k, where N is some integer. 
> alloc_size should be an integer multiple of object_size/k.

If using rbd striping, I'd also assume that making rbd's stripe_unit be
equal to, or at least a multiple of, the stripe_width of the EC pool is
sensible.

(Similar for CephFS's layout.)

Does this hold in your environment?


--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Lars Marowsky-Bree
On 2019-07-09T07:27:28, Frank Schilder  wrote:

> Small addition:
> 
> This result holds for rbd bench. It seems to imply good performance for 
> large-file IO on cephfs, since cephfs will split large files into many 
> objects of size object_size. Small-file IO is a different story.
> 
> The formula should be N*alloc_size=object_size/k, where N is some integer. 
> alloc_size should be an integer multiple of object_size/k.

If using rbd striping, I'd also assume that making rbd's stripe_unit be
equal to, or at least a multiple of, the stripe_width of the EC pool is
sensible.

(Similar for CephFS's layout.)

Does this hold in your environment?


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-09 Thread Frank Schilder
Small addition:

This result holds for rbd bench. It seems to imply good performance for 
large-file IO on cephfs, since cephfs will split large files into many objects 
of size object_size. Small-file IO is a different story.

The formula should be N*alloc_size=object_size/k, where N is some integer. 
alloc_size should be an integer multiple of object_size/k.

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 09 July 2019 09:22
To: Nathan Fish; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

Hi Nathan,

its just a hypothesis. I did not check what the algorithm does.

The reasoning is this. Bluestore and modern disks have preferred read/write 
sizes that are quite large for large drives. These are usually powers of 2. If 
you use a k+m EC profile, any read/write is split into k fragments. What I 
observe is, that throughput seems best if these fragments are multiples of the 
preferred read/write sizes.

Any prime factor other than 2 will imply split-ups that don't fit perfectly. 
The mismatch tends to be worse the larger a prime factor and the smaller the 
object size. At least this is a correlation I observed in benchmarks. Since 
correlation does not mean causation, I will not claim that my hypothesis is an 
explanation of the observation.

Nevertheless, bluestore has default alloc sizes and just for storage efficiency 
I would try to achieve aim for alloc_size=object_size/k. Coincidentally, for 
spinning disks this also seems to imply best performance.

If this is wrong, maybe a disk IO expert can provide a better explanation as a 
guide for EC profile choices?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Nathan Fish 

Sent: 08 July 2019 18:07:25
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

This is very interesting, thank you. I'm curious, what is the reason
for avoiding k's with large prime factors? If I set k=5, what happens?

On Mon, Jul 8, 2019 at 8:56 AM Lei Liu  wrote:
>
> Hi Frank,
>
> Thanks for sharing valuable experience.
>
> Frank Schilder  于2019年7月8日周一 下午4:36写道:
>>
>> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all 
>> journals collocated on the same disk with the data. Disks are spinning 
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on 
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All 
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench) depending 
>> on EC profile, object size, write size, etc. Results were varying a lot. My 
>> advice would be to run benchmarks with your hardware. If there was a single 
>> perfect choice, there wouldn't be so many options. For example, my tests 
>> will not be valid when using separate fast disks for WAL and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which 
>> is probably the network limit and not the disk limit. IOP/s get better with 
>> more disks, but are way lower than what replicated pools can provide. On a 
>> cephfs with EC data pool, small-file IO will be comparably slow and eat a 
>> lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes, which 
>> is due to the way EC overwrites are handled. This is one bottleneck for 
>> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD 
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All 
>> other choices were poor. The value of m seems not relevant for performance. 
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, 
>> with IOP/s getting somewhat better with slower object sizes but throughput 
>> dropping fast. I use the default of 4MB in production. Works well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than 
>> other plugins, which is preferrable for IOP/s. However, CPU usage can become 
>> a problem and a plugin optimized for spe

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-09 Thread Frank Schilder
Hi Nathan,

its just a hypothesis. I did not check what the algorithm does.

The reasoning is this. Bluestore and modern disks have preferred read/write 
sizes that are quite large for large drives. These are usually powers of 2. If 
you use a k+m EC profile, any read/write is split into k fragments. What I 
observe is, that throughput seems best if these fragments are multiples of the 
preferred read/write sizes.

Any prime factor other than 2 will imply split-ups that don't fit perfectly. 
The mismatch tends to be worse the larger a prime factor and the smaller the 
object size. At least this is a correlation I observed in benchmarks. Since 
correlation does not mean causation, I will not claim that my hypothesis is an 
explanation of the observation.

Nevertheless, bluestore has default alloc sizes and just for storage efficiency 
I would try to achieve aim for alloc_size=object_size/k. Coincidentally, for 
spinning disks this also seems to imply best performance.

If this is wrong, maybe a disk IO expert can provide a better explanation as a 
guide for EC profile choices?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Nathan Fish 

Sent: 08 July 2019 18:07:25
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

This is very interesting, thank you. I'm curious, what is the reason
for avoiding k's with large prime factors? If I set k=5, what happens?

On Mon, Jul 8, 2019 at 8:56 AM Lei Liu  wrote:
>
> Hi Frank,
>
> Thanks for sharing valuable experience.
>
> Frank Schilder  于2019年7月8日周一 下午4:36写道:
>>
>> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all 
>> journals collocated on the same disk with the data. Disks are spinning 
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on 
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All 
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench) depending 
>> on EC profile, object size, write size, etc. Results were varying a lot. My 
>> advice would be to run benchmarks with your hardware. If there was a single 
>> perfect choice, there wouldn't be so many options. For example, my tests 
>> will not be valid when using separate fast disks for WAL and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which 
>> is probably the network limit and not the disk limit. IOP/s get better with 
>> more disks, but are way lower than what replicated pools can provide. On a 
>> cephfs with EC data pool, small-file IO will be comparably slow and eat a 
>> lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes, which 
>> is due to the way EC overwrites are handled. This is one bottleneck for 
>> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD 
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All 
>> other choices were poor. The value of m seems not relevant for performance. 
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, 
>> with IOP/s getting somewhat better with slower object sizes but throughput 
>> dropping fast. I use the default of 4MB in production. Works well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than 
>> other plugins, which is preferrable for IOP/s. However, CPU usage can become 
>> a problem and a plugin optimized for specific values of k and m might help 
>> here. Under usual circumstances I see very low load on all OSD hosts, even 
>> under rebalancing. However, I remember that once I needed to rebuild 
>> something on all OSDs (I don't remember what it was, sorry). In this 
>> situation, CPU load went up to 30-50% (meaning up to half the cores were at 
>> 100%), which is really high considering that each server has only 16 disks 
>> at the moment and is sized to handle up to 100. CPU power could become a 
>> bottle for us neck in the future.
>>
>> These are some general observations and do not rep

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Nathan Fish
This is very interesting, thank you. I'm curious, what is the reason
for avoiding k's with large prime factors? If I set k=5, what happens?

On Mon, Jul 8, 2019 at 8:56 AM Lei Liu  wrote:
>
> Hi Frank,
>
> Thanks for sharing valuable experience.
>
> Frank Schilder  于2019年7月8日周一 下午4:36写道:
>>
>> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all 
>> journals collocated on the same disk with the data. Disks are spinning 
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on 
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All 
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench) depending 
>> on EC profile, object size, write size, etc. Results were varying a lot. My 
>> advice would be to run benchmarks with your hardware. If there was a single 
>> perfect choice, there wouldn't be so many options. For example, my tests 
>> will not be valid when using separate fast disks for WAL and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which 
>> is probably the network limit and not the disk limit. IOP/s get better with 
>> more disks, but are way lower than what replicated pools can provide. On a 
>> cephfs with EC data pool, small-file IO will be comparably slow and eat a 
>> lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes, which 
>> is due to the way EC overwrites are handled. This is one bottleneck for 
>> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD 
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All 
>> other choices were poor. The value of m seems not relevant for performance. 
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, 
>> with IOP/s getting somewhat better with slower object sizes but throughput 
>> dropping fast. I use the default of 4MB in production. Works well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than 
>> other plugins, which is preferrable for IOP/s. However, CPU usage can become 
>> a problem and a plugin optimized for specific values of k and m might help 
>> here. Under usual circumstances I see very low load on all OSD hosts, even 
>> under rebalancing. However, I remember that once I needed to rebuild 
>> something on all OSDs (I don't remember what it was, sorry). In this 
>> situation, CPU load went up to 30-50% (meaning up to half the cores were at 
>> 100%), which is really high considering that each server has only 16 disks 
>> at the moment and is sized to handle up to 100. CPU power could become a 
>> bottle for us neck in the future.
>>
>> These are some general observations and do not replace benchmarks for 
>> specific use cases. I was hunting for a specific performance pattern, which 
>> might not be what you want to optimize for. I would recommend to run 
>> extensive benchmarks if you have to live with a configuration for a long 
>> time - EC profiles cannot be changed.
>>
>> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also 
>> use bluestore compression. All meta data pools are on SSD, only very little 
>> SSD space is required. This choice works well for the majority of our use 
>> cases. We can still build small expensive pools to accommodate special 
>> performance requests.
>>
>> Best regards,
>>
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: ceph-users  on behalf of David 
>> 
>> Sent: 07 July 2019 20:01:18
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users]  What's the best practice for Erasure Coding
>>
>> Hi Ceph-Users,
>>
>> I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
>> Recently, I'm trying to use the Erasure Code pool.
>> My question is "what's the best practice for using EC pools ?".
>> More 

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Lei Liu
Hi Frank,

Thanks for sharing valuable experience.

Frank Schilder  于2019年7月8日周一 下午4:36写道:

> Hi David,
>
> I'm running a cluster with bluestore on raw devices (no lvm) and all
> journals collocated on the same disk with the data. Disks are spinning
> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on
> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All
> large pools are EC on spinning disk.
>
> I spent at least one month to run detailed benchmarks (rbd bench)
> depending on EC profile, object size, write size, etc. Results were varying
> a lot. My advice would be to run benchmarks with your hardware. If there
> was a single perfect choice, there wouldn't be so many options. For
> example, my tests will not be valid when using separate fast disks for WAL
> and DB.
>
> There are some results though that might be valid in general:
>
> 1) EC pools have high throughput but low IOP/s compared with replicated
> pools
>
> I see single-thread write speeds of up to 1.2GB (gigabyte) per second,
> which is probably the network limit and not the disk limit. IOP/s get
> better with more disks, but are way lower than what replicated pools can
> provide. On a cephfs with EC data pool, small-file IO will be comparably
> slow and eat a lot of resources.
>
> 2) I observe massive network traffic amplification on small IO sizes,
> which is due to the way EC overwrites are handled. This is one bottleneck
> for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD
> network. OSD bandwidth at least 2x client network, better 4x or more.
>
> 3) k should only have small prime factors, power of 2 if possible
>
> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All
> other choices were poor. The value of m seems not relevant for performance.
> Larger k will require more failure domains (more hardware).
>
> 4) object size matters
>
> The best throughput (1M write size) I see with object sizes of 4MB or 8MB,
> with IOP/s getting somewhat better with slower object sizes but throughput
> dropping fast. I use the default of 4MB in production. Works well for us.
>
> 5) jerasure is quite good and seems most flexible
>
> jerasure is quite CPU efficient and can handle smaller chunk sizes than
> other plugins, which is preferrable for IOP/s. However, CPU usage can
> become a problem and a plugin optimized for specific values of k and m
> might help here. Under usual circumstances I see very low load on all OSD
> hosts, even under rebalancing. However, I remember that once I needed to
> rebuild something on all OSDs (I don't remember what it was, sorry). In
> this situation, CPU load went up to 30-50% (meaning up to half the cores
> were at 100%), which is really high considering that each server has only
> 16 disks at the moment and is sized to handle up to 100. CPU power could
> become a bottle for us neck in the future.
>
> These are some general observations and do not replace benchmarks for
> specific use cases. I was hunting for a specific performance pattern, which
> might not be what you want to optimize for. I would recommend to run
> extensive benchmarks if you have to live with a configuration for a long
> time - EC profiles cannot be changed.
>
> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also
> use bluestore compression. All meta data pools are on SSD, only very little
> SSD space is required. This choice works well for the majority of our use
> cases. We can still build small expensive pools to accommodate special
> performance requests.
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________
> From: ceph-users  on behalf of David <
> xiaomajia...@gmail.com>
> Sent: 07 July 2019 20:01:18
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users]  What's the best practice for Erasure Coding
>
> Hi Ceph-Users,
>
> I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on
> lvm).
> Recently, I'm trying to use the Erasure Code pool.
> My question is "what's the best practice for using EC pools ?".
> More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should
> I adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2),
> (k=6,m=3) ).
>
> Does anyone share some experience?
>
> Thanks for any help.
>
> Regards,
> David
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Frank Schilder
Hi David,

I'm running a cluster with bluestore on raw devices (no lvm) and all journals 
collocated on the same disk with the data. Disks are spinning NL-SAS. Our goal 
was to build storage at lowest cost, therefore all data on HDD only. I got a 
few SSDs that I'm using for FS and RBD meta data. All large pools are EC on 
spinning disk.

I spent at least one month to run detailed benchmarks (rbd bench) depending on 
EC profile, object size, write size, etc. Results were varying a lot. My advice 
would be to run benchmarks with your hardware. If there was a single perfect 
choice, there wouldn't be so many options. For example, my tests will not be 
valid when using separate fast disks for WAL and DB.

There are some results though that might be valid in general:

1) EC pools have high throughput but low IOP/s compared with replicated pools

I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which is 
probably the network limit and not the disk limit. IOP/s get better with more 
disks, but are way lower than what replicated pools can provide. On a cephfs 
with EC data pool, small-file IO will be comparably slow and eat a lot of 
resources.

2) I observe massive network traffic amplification on small IO sizes, which is 
due to the way EC overwrites are handled. This is one bottleneck for IOP/s. We 
have 10G infrastructure and use 2x10G client and 4x10G OSD network. OSD 
bandwidth at least 2x client network, better 4x or more.

3) k should only have small prime factors, power of 2 if possible

I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All other 
choices were poor. The value of m seems not relevant for performance. Larger k 
will require more failure domains (more hardware).

4) object size matters

The best throughput (1M write size) I see with object sizes of 4MB or 8MB, with 
IOP/s getting somewhat better with slower object sizes but throughput dropping 
fast. I use the default of 4MB in production. Works well for us.

5) jerasure is quite good and seems most flexible

jerasure is quite CPU efficient and can handle smaller chunk sizes than other 
plugins, which is preferrable for IOP/s. However, CPU usage can become a 
problem and a plugin optimized for specific values of k and m might help here. 
Under usual circumstances I see very low load on all OSD hosts, even under 
rebalancing. However, I remember that once I needed to rebuild something on all 
OSDs (I don't remember what it was, sorry). In this situation, CPU load went up 
to 30-50% (meaning up to half the cores were at 100%), which is really high 
considering that each server has only 16 disks at the moment and is sized to 
handle up to 100. CPU power could become a bottle for us neck in the future.

These are some general observations and do not replace benchmarks for specific 
use cases. I was hunting for a specific performance pattern, which might not be 
what you want to optimize for. I would recommend to run extensive benchmarks if 
you have to live with a configuration for a long time - EC profiles cannot be 
changed.

We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also use 
bluestore compression. All meta data pools are on SSD, only very little SSD 
space is required. This choice works well for the majority of our use cases. We 
can still build small expensive pools to accommodate special performance 
requests.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of David 

Sent: 07 July 2019 20:01:18
To: ceph-users@lists.ceph.com
Subject: [ceph-users]  What's the best practice for Erasure Coding

Hi Ceph-Users,

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
Recently, I'm trying to use the Erasure Code pool.
My question is "what's the best practice for using EC pools ?".
More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

Does anyone share some experience?

Thanks for any help.

Regards,
David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Jake Grimmett
Hi David,

How many nodes in your cluster? k+m has to be smaller than your node count, 
preferably by at least two. 

How important is your data? i.e. do you have a remote mirror or backup, if not 
you may want m=3 

We use 8+2 on one cluster, and 6+2 on another.

Best,

Jake


On 7 July 2019 19:01:18 BST, David  wrote:
>Hi Ceph-Users,
>
> 
>
>I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on
>lvm).
>
>Recently, I'm trying to use the Erasure Code pool.
>
>My question is "what's the best practice for using EC pools ?".
>
>More specifically, which plugin (jerasure, isa, lrc, shec or  clay)
>should I adopt, and how to choose the combinations of (k,m) (e.g.
>(k=3,m=2), (k=6,m=3) ).
>
> 
>
>Does anyone share some experience?
>
> 
>
>Thanks for any help.
>
> 
>
>Regards,
>
>David
>
> 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What's the best practice for Erasure Coding

2019-07-07 Thread David
Hi Ceph-Users,

 

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).

Recently, I'm trying to use the Erasure Code pool.

My question is "what's the best practice for using EC pools ?".

More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

 

Does anyone share some experience?

 

Thanks for any help.

 

Regards,

David

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com