Re: [ceph-users] osds with different disk sizes may killing, > performance (?? ?)
>> There is no way to fill up all disks evenly with the same number of >> Bytes and then stop filling the small disks when they're full and >> only continue filling the larger disks. >This is possible with adjusting crush weights. Initially the smaller >drives are weighted more highly than larger drives. As data gets added >the weights are changed so that larger drives continue to fill while no >drives becomes overfull. So IMHO when you diverge from the default crush weight (e.g. for performance) it is something you should do permanently or have a very clear path to what the next steps will be (e.g. we need to do this temporarily until the new hardware comes in). You gain some short-term performance while you have the space. However, as the cluster gets fuller: You will need to change the weight which results in a lot of data movement which is a pain in itself especially when the cluster is near its IO limits. Finally, you still end up with the exact same (bad) performance but it might now be a performance cliff instead of a gradual worsening over time. A possible way to solve this would be to implement some mechanism to reshuffle the data based on the IO-patterns so each OSD get the same IO-pressure (or even better based on a new "IOPS" weight you can set). So you could give each OSD the proper "size weight" and ceph would make sure you get the optimal IO performance by making sure each disk has the proper amount of "hot" data where the IOs happen. But I guess that’s a very hard thing to build properly. Final note: if you only have SSDs in the cluster the problem might not be there because usually bigger SSDs are also faster :) Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osds with different disk sizes may killing, > performance (?? ?)
You'll find it said time and time agin on the ML... avoid disks of different sizes in the same cluster. It's a headache that sucks. It's not impossible, it's not even overly hard to pull off... but it's very easy to cause a mess and a lot of headaches. It will also make it harder to diagnose performance issues in the cluster. Not very practical for clusters which aren't new. There is no way to fill up all disks evenly with the same number of Bytes and then stop filling the small disks when they're full and only continue filling the larger disks. This is possible with adjusting crush weights. Initially the smaller drives are weighted more highly than larger drives. As data gets added the weights are changed so that larger drives continue to fill while no drives becomes overfull. What will happen if you are filling all disks evenly with Bytes instead of % is that the small disks will get filled completely and all writes to the cluster will block until you do something to reduce the amount used on the full disks. That means the crush weights were not adjusted correctly as the cluster filled. but in this case you would have a steep drop off of performance. when you reach the fill level where small drives do not accept more data, suddenly you would have a performance cliff where only your larger disks are doing new writes. and only larger disks doing reads on new data. Good point! Although if this is implemented by changing crush weights, adjusting the weights as the cluster fills will cause the data to churn and the new data will not only be assigned to larger drives. :) Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osds with different disk sizes may killing, > performance (?? ?)
You'll find it said time and time agin on the ML... avoid disks of different sizes in the same cluster. It's a headache that sucks. It's not impossible, it's not even overly hard to pull off... but it's very easy to cause a mess and a lot of headaches. It will also make it harder to diagnose performance issues in the cluster. Not very practical for clusters which aren't new. There is no way to fill up all disks evenly with the same number of Bytes and then stop filling the small disks when they're full and only continue filling the larger disks. This is possible with adjusting crush weights. Initially the smaller drives are weighted more highly than larger drives. As data gets added the weights are changed so that larger drives continue to fill while no drives becomes overfull. What will happen if you are filling all disks evenly with Bytes instead of % is that the small disks will get filled completely and all writes to the cluster will block until you do something to reduce the amount used on the full disks. That means the crush weights were not adjusted correctly as the cluster filled. but in this case you would have a steep drop off of performance. when you reach the fill level where small drives do not accept more data, suddenly you would have a performance cliff where only your larger disks are doing new writes. and only larger disks doing reads on new data. Good point! Although if this is implemented by changing crush weights, adjusting the weights as the cluster fills will cause the data to churn and the new data will not only be assigned to larger drives. :) Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance (?? ?)
You'll find it said time and time agin on the ML... avoid disks of different sizes in the same cluster. It's a headache that sucks. It's not impossible, it's not even overly hard to pull off... but it's very easy to cause a mess and a lot of headaches. It will also make it harder to diagnose performance issues in the cluster. There is no way to fill up all disks evenly with the same number of Bytes and then stop filling the small disks when they're full and only continue filling the larger disks. What will happen if you are filling all disks evenly with Bytes instead of % is that the small disks will get filled completely and all writes to the cluster will block until you do something to reduce the amount used on the full disks. On Fri, Apr 13, 2018 at 1:28 AM Ronny Aasenwrote: > On 13. april 2018 05:32, Chad William Seys wrote: > > Hello, > >I think your observations suggest that, to a first approximation, > > filling drives with bytes to the same absolute level is better for > > performance than filling drives to the same percentage full. Assuming > > random distribution of PGs, this would cause the smallest drives to be > > as active as the largest drives. > >E.g. if every drive had 1TB of data, each would be equally likely to > > contain the PG of interest. > >Of course, as more data was added the smallest drives could not hold > > more and the larger drives become more active, but at least the smaller > > drives would as active as possible. > > but in this case you would have a steep drop off of performance. when > you reach the fill level where small drives do not accept more data, > suddenly you would have a performance cliff where only your larger disks > are doing new writes. and only larger disks doing reads on new data. > > > it is also easier to make the logical connection while you are > installing new nodes/disks. then a year later when your cluster just > happen to reach that fill level. > > it would also be an easier job balancing disks between nodes when you > are adding osd's anyway and the new ones are mostly empty. rather then > when your small osd's are full and your large disks have significant > data on them. > > > > kind regards > Ronny Aasen > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance (?? ?)
On 13. april 2018 05:32, Chad William Seys wrote: Hello, I think your observations suggest that, to a first approximation, filling drives with bytes to the same absolute level is better for performance than filling drives to the same percentage full. Assuming random distribution of PGs, this would cause the smallest drives to be as active as the largest drives. E.g. if every drive had 1TB of data, each would be equally likely to contain the PG of interest. Of course, as more data was added the smallest drives could not hold more and the larger drives become more active, but at least the smaller drives would as active as possible. but in this case you would have a steep drop off of performance. when you reach the fill level where small drives do not accept more data, suddenly you would have a performance cliff where only your larger disks are doing new writes. and only larger disks doing reads on new data. it is also easier to make the logical connection while you are installing new nodes/disks. then a year later when your cluster just happen to reach that fill level. it would also be an easier job balancing disks between nodes when you are adding osd's anyway and the new ones are mostly empty. rather then when your small osd's are full and your large disks have significant data on them. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance (?? ?)
Hello, I think your observations suggest that, to a first approximation, filling drives with bytes to the same absolute level is better for performance than filling drives to the same percentage full. Assuming random distribution of PGs, this would cause the smallest drives to be as active as the largest drives. E.g. if every drive had 1TB of data, each would be equally likely to contain the PG of interest. Of course, as more data was added the smallest drives could not hold more and the larger drives become more active, but at least the smaller drives would as active as possible. Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
I can't comment directly on the relation XFS fragmentation has to Bluestore, but I had a similar issue probably 2-3 years ago where XFS fragmentation was causing a significant degradation in cluster performance. The use case was RBDs with lots of snapshots created and deleted at regular intervals. XFS got pretty severely fragmented and the cluster slowed down quickly. The solution I found was to set the XFS allocsize to match the RBD object size via osd_mount_options_xfs. Of course I also had to defragment XFS to clear up the existing fragmentation, but that was fairly painless. XFS fragmentation hasn't been an issue since. That solution isn't as applicable in an object store use case where the object size is more variable, but increasing the XFS allocsize could still help. As far as Bluestore goes, I haven't deployed it in production yet, but I would expect that manipulating bluestore_min_alloc_size in a similar fashion would yield similar benefits. Of course you are then wasting some disk space for every object that ends up being smaller than that allocation size in both cases. That's the trade-off. [cid:SC_LOGO_VERT_4C_100x72_f823be1a-ae53-43d3-975c-b054a1b22ec3.jpg] Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Thu, 2018-04-12 at 04:13 +0200, Marc Roos wrote: Is that not obvious? The 8TB is handling twice as much as the 4TB. Afaik there is not a linear relationship with the iops of a disk and its size. But interesting about this xfs defragmentation, how does this relate/compare to bluestore? -Original Message- From: ? ?? [mailto:yaozong...@outlook.com] Sent: donderdag 12 april 2018 4:36 To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: *SPAM* [ceph-users] osds with different disk sizes may killing performance Importance: High Hi, For anybody who may be interested, here I share a process of locating the reason for ceph cluster performance slow down in our environment. Internally, we have a cluster with capacity 1.1PB, used 800TB, and raw user data is about 500TB. Each day, 3TB' data is uploaded and 3TB oldest data is lifecycled (we are using s3 object store, and bucket lifecycle is enabled). As time goes by, the cluster becomes some slower, we doubt the xfs fragmentation is the fiend. After some testing, we do find xfs fragmentation slow down filestore's performance, for example, at 15% fragmentation, the performance is 85% of the original, and at 25%, the performance is 74.73% of the original. But the main reason for our cluster's deterioration of performance is not the xfs fragmentation. Initially, our ceph cluster contains only osds with 4TB's disk, as time goes by, we scale out our cluster by adding some new osds with 8TB's disk. And as the new disk's capacity is double times of the old disks, so each new osd's weight is double of the old osd. And new osd has double pgs than old osd, and new osd used double disk space than the old osd. Everything looks good and fine. But even though the new osd has double capacity than the old osd, the new osd's performance is not double than the old osd. After digging into our internal system stats, we find the new added's disk io util is about two times than the old. And from time to time, the new disks' io util rises up to 100%. The new added osds are the performance killer. They slow down the whole cluster's performance. As the reason is found, the solution is very simple. After lower new added osds's weight, the annoying slow request warnings have died away. So the conclusion is: in cluster with different osd's disk size, osd's weight is not only determined by its capacity, we should also have a look at its performance. Best wishes, Yao Zongyou ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
Hi, you can also set the primary_affinity to 0.5 at the 8TB-disks to lower the reading access (in this case you don't waste 50% of space). Udo Am 2018-04-12 04:36, schrieb ? ??: Hi, For anybody who may be interested, here I share a process of locating the reason for ceph cluster performance slow down in our environment. Internally, we have a cluster with capacity 1.1PB, used 800TB, and raw user data is about 500TB. Each day, 3TB' data is uploaded and 3TB oldest data is lifecycled (we are using s3 object store, and bucket lifecycle is enabled). As time goes by, the cluster becomes some slower, we doubt the xfs fragmentation is the fiend. After some testing, we do find xfs fragmentation slow down filestore's performance, for example, at 15% fragmentation, the performance is 85% of the original, and at 25%, the performance is 74.73% of the original. But the main reason for our cluster's deterioration of performance is not the xfs fragmentation. Initially, our ceph cluster contains only osds with 4TB's disk, as time goes by, we scale out our cluster by adding some new osds with 8TB's disk. And as the new disk's capacity is double times of the old disks, so each new osd's weight is double of the old osd. And new osd has double pgs than old osd, and new osd used double disk space than the old osd. Everything looks good and fine. But even though the new osd has double capacity than the old osd, the new osd's performance is not double than the old osd. After digging into our internal system stats, we find the new added's disk io util is about two times than the old. And from time to time, the new disks' io util rises up to 100%. The new added osds are the performance killer. They slow down the whole cluster's performance. As the reason is found, the solution is very simple. After lower new added osds's weight, the annoying slow request warnings have died away. So the conclusion is: in cluster with different osd's disk size, osd's weight is not only determined by its capacity, we should also have a look at its performance. Best wishes, Yao Zongyou ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
Is that not obvious? The 8TB is handling twice as much as the 4TB. Afaik there is not a linear relationship with the iops of a disk and its size. But interesting about this xfs defragmentation, how does this relate/compare to bluestore? -Original Message- From: ? ?? [mailto:yaozong...@outlook.com] Sent: donderdag 12 april 2018 4:36 To: ceph-users@lists.ceph.com Subject: *SPAM* [ceph-users] osds with different disk sizes may killing performance Importance: High Hi, For anybody who may be interested, here I share a process of locating the reason for ceph cluster performance slow down in our environment. Internally, we have a cluster with capacity 1.1PB, used 800TB, and raw user data is about 500TB. Each day, 3TB' data is uploaded and 3TB oldest data is lifecycled (we are using s3 object store, and bucket lifecycle is enabled). As time goes by, the cluster becomes some slower, we doubt the xfs fragmentation is the fiend. After some testing, we do find xfs fragmentation slow down filestore's performance, for example, at 15% fragmentation, the performance is 85% of the original, and at 25%, the performance is 74.73% of the original. But the main reason for our cluster's deterioration of performance is not the xfs fragmentation. Initially, our ceph cluster contains only osds with 4TB's disk, as time goes by, we scale out our cluster by adding some new osds with 8TB's disk. And as the new disk's capacity is double times of the old disks, so each new osd's weight is double of the old osd. And new osd has double pgs than old osd, and new osd used double disk space than the old osd. Everything looks good and fine. But even though the new osd has double capacity than the old osd, the new osd's performance is not double than the old osd. After digging into our internal system stats, we find the new added's disk io util is about two times than the old. And from time to time, the new disks' io util rises up to 100%. The new added osds are the performance killer. They slow down the whole cluster's performance. As the reason is found, the solution is very simple. After lower new added osds's weight, the annoying slow request warnings have died away. So the conclusion is: in cluster with different osd's disk size, osd's weight is not only determined by its capacity, we should also have a look at its performance. Best wishes, Yao Zongyou ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
On 04/12/2018 11:21 AM, 宗友 姚 wrote: Currently, this can only be done by hand. Maybe we need some scripts to handle this automatically. Mixed hosts, i.e. half old disks + half new disks is better than "old hosts" and "new hosts" in your case. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
On 04/12/2018 04:36 AM, ? ?? wrote: > Hi, > > For anybody who may be interested, here I share a process of locating the > reason for ceph cluster performance slow down in our environment. > > Internally, we have a cluster with capacity 1.1PB, used 800TB, and raw user > data is about 500TB. Each day, 3TB' data is uploaded and 3TB oldest data is > lifecycled (we are using s3 object store, and bucket lifecycle is enabled). > As time goes by, the cluster becomes some slower, we doubt the xfs > fragmentation is the fiend. > > After some testing, we do find xfs fragmentation slow down filestore's > performance, for example, at 15% fragmentation, the performance is 85% of the > original, and at 25%, the performance is 74.73% of the original. > > But the main reason for our cluster's deterioration of performance is not the > xfs fragmentation. > > Initially, our ceph cluster contains only osds with 4TB's disk, as time goes > by, we scale out our cluster by adding some new osds with 8TB's disk. And as > the new disk's capacity is double times of the old disks, so each new osd's > weight is double of the old osd. And new osd has double pgs than old osd, and > new osd used double disk space than the old osd. Everything looks good and > fine. > > But even though the new osd has double capacity than the old osd, the new > osd's performance is not double than the old osd. After digging into our > internal system stats, we find the new added's disk io util is about two > times than the old. And from time to time, the new disks' io util rises up to > 100%. The new added osds are the performance killer. They slow down the whole > cluster's performance. > > As the reason is found, the solution is very simple. After lower new added > osds's weight, the annoying slow request warnings have died away. > This is to be expected. However, lowering the weight of new disks means that you can't fully use their storage capacity. This is the nature of having a heterogeneous cluster with Ceph. Different disks of different sizes mean that performance will fluctuate. Wido > So the conclusion is: in cluster with different osd's disk size, osd's weight > is not only determined by its capacity, we should also have a look at its > performance.> > Best wishes, > Yao Zongyou > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
Currently, this can only be done by hand. Maybe we need some scripts to handle this automatically. I don't know if https://github.com/ceph/ceph/tree/master/src/pybind/mgr/balancer can handle this. From: Konstantin Shalygin <k0...@k0ste.ru> Sent: Thursday, April 12, 2018 12:00 To: ceph-users@lists.ceph.com Cc: ?? ? Subject: Re: [ceph-users] osds with different disk sizes may killing performance On 04/12/2018 10:58 AM, ?? ? wrote: > Yes, according to crush algorithms, large drives are given high weight, this > is expected. By default, crush gives no consideration of each drive's > performance, which may cause the performance distribution is not balanced. > And the highest io util osd may slow down the whole cluster. You can control 'how much use' drive via adjusting crush weights. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
On 04/12/2018 10:58 AM, ?? ? wrote: Yes, according to crush algorithms, large drives are given high weight, this is expected. By default, crush gives no consideration of each drive's performance, which may cause the performance distribution is not balanced. And the highest io util osd may slow down the whole cluster. You can control 'how much use' drive via adjusting crush weights. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
Yes, according to crush algorithms, large drives are given high weight, this is expected. By default, crush gives no consideration of each drive's performance, which may cause the performance distribution is not balanced. And the highest io util osd may slow down the whole cluster. From: Konstantin Shalygin <k0...@k0ste.ru> Sent: Thursday, April 12, 2018 11:29 To: ceph-users@lists.ceph.com Cc: ? ?? Subject: Re: [ceph-users] osds with different disk sizes may killing performance > After digging into our internal system stats, we find the new added's disk io > util is about two times than the old. This is obviously and expected. Yours 8Tb drives weighted double against 4Tb and do *double* crush work in comparison. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance
After digging into our internal system stats, we find the new added's disk io util is about two times than the old. This is obviously and expected. Yours 8Tb drives weighted double against 4Tb and do *double* crush work in comparison. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osds with different disk sizes may killing performance
Hi, For anybody who may be interested, here I share a process of locating the reason for ceph cluster performance slow down in our environment. Internally, we have a cluster with capacity 1.1PB, used 800TB, and raw user data is about 500TB. Each day, 3TB' data is uploaded and 3TB oldest data is lifecycled (we are using s3 object store, and bucket lifecycle is enabled). As time goes by, the cluster becomes some slower, we doubt the xfs fragmentation is the fiend. After some testing, we do find xfs fragmentation slow down filestore's performance, for example, at 15% fragmentation, the performance is 85% of the original, and at 25%, the performance is 74.73% of the original. But the main reason for our cluster's deterioration of performance is not the xfs fragmentation. Initially, our ceph cluster contains only osds with 4TB's disk, as time goes by, we scale out our cluster by adding some new osds with 8TB's disk. And as the new disk's capacity is double times of the old disks, so each new osd's weight is double of the old osd. And new osd has double pgs than old osd, and new osd used double disk space than the old osd. Everything looks good and fine. But even though the new osd has double capacity than the old osd, the new osd's performance is not double than the old osd. After digging into our internal system stats, we find the new added's disk io util is about two times than the old. And from time to time, the new disks' io util rises up to 100%. The new added osds are the performance killer. They slow down the whole cluster's performance. As the reason is found, the solution is very simple. After lower new added osds's weight, the annoying slow request warnings have died away. So the conclusion is: in cluster with different osd's disk size, osd's weight is not only determined by its capacity, we should also have a look at its performance. Best wishes, Yao Zongyou ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com