Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
Hi Anand, re: inflight calculation On Thu, Feb 11, 2021 at 03:55:33PM +, Michal Rostecki wrote: > > It is better to have random workloads in the above three categories > > of configs. > > > > Apart from the above three configs, there is also > > all-non-rotational with hetero > > For example, ssd and nvme together both are non-rotational. > > And, > > all-rotational with hetero > > For example, rotational disks with different speeds. > > > > > > The inflight calculation is local to btrfs. If the device is busy due to > > external factors, it would not switch to the better performing device. > > > > Good point. Maybe I should try to use the part stats instead of storing > inflight locally in btrfs. I tried today to reuse the inflight calculation which is done for iostat. And I came to conclusion that it's impossible without exporting some methods from the block/ subsystem. The thing is that there are two methods of calculating inflight. Which one of them is used, depends on queue_is_mq(): https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/genhd.c#L1163 And if that condition is true, I noticed that part_stats return 0, even though there are processed requests (I checked with fio inside VM). In general, those two methods of checking inflight are: 1) part_stats - which would be trivial to reuse, just a matter of one simple for_each_possible_cpu() loop with part_stat_local_read_cpu() https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/genhd.c#L133-L146 2) blk_mq_in_flight() - which has a lot of code and unexported structs inside the block/ directory, double function callback; definitely not easy to reimplement easily in btrfs without copying dozens of lines https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/blk-mq.c#L115-L123 Well, I tried copying the whole blk_mq_in_flight() function with all dependencies anyway, hard to do without causing modpost errors. So, to sum it up, I think that making 2) possible to reuse in btrfs would require at lest exporting the blk_mq_in_flight() function, therefore the change would have to go through linux-block tree. Which maybe would be a good thing to do in long term, not sure if it should block my patchset entirely. The question is if we are fine with inflight stats inside btrfs. Personally I think we sholdn't be concerned too much about it. The inflight counter in my patches is a percpu counted, used in places which already do some atomic operations. Thanks, Michal
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On Wed, Feb 10, 2021 at 04:20:20PM +0800, Anand Jain wrote: > On 10/02/2021 04:30, Michal Rostecki wrote: > > The penalty value is an additional value added to the number of inflight > > requests when a scheduled request is non-local (which means it would > > start from the different physical location than the physical location of > > the last request processed by the given device). By default, it's > > applied only in filesystems which have mixed types of devices > > (non-rotational and rotational), but it can be configured to be applied > > without that condition. > > > > The configuration is done through sysfs: > > > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only > > > > where 1 (the default) value means applying penalty only in mixed arrays, > > 0 means applying it unconditionally. > > > > The exact penalty value is defined separately for non-rotational and > > rotational devices. By default, it's 0 for non-rotational devices and 1 > > for rotational devices. Both values are configurable through sysfs: > > > > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc > > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc > > > > To sum it up - the default case is applying the penalty under the > > following conditions: > > > > - the raid1 array consists of mixed types of devices > > - the scheduled request is going to be non-local for the given disk > > - the device is rotational > > > > That default case is based on a slight preference towards non-rotational > > disks in mixed arrays and has proven to give the best performance in > > tested arrays. > >> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s > > (429MB/s) performance. Adding the penalty value 1 resulted in a > > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10 > > was making the performance even worse. > > > > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to > > rotational disks resulted in the best performance - 541MiB/s (567MB/s). > > Not adding any value and increasing the value was making the performance > > worse. > > > Adding penalty value to non-rotational disks was always decreasing the > > performance, which motivated setting it as 0 by default. For the purpose > > of testing, it's still configurable. > > > > To measure the performance of each policy and find optimal penalty > > values, I created scripts which are available here: > > > > So in summary > rotational + non-rotational: penalty = 1 > all-rotational and homo: penalty = 0 > all-non-rotational and homo: penalty = 0 > > I can't find any non-deterministic in your findings above. > It is not very clear to me if we need the configurable > parameters here. > Honestly, the main reason why I made it configurable is to check the performance of different values without editing and recompiling the kernel. I was trying to find the best set of values with my simple Python script which tries all values from 0 to 9 and runs fio. I left those partameters to be configurable in my patches just in case someone else would like to try to tune them on their environments. The script is here: https://github.com/mrostecki/btrfs-perf/blob/main/roundrobin-tune.py But on the other hand, as I mentioned in the other mail - I'm getting skeptical about having the whole penalty mechanism in general. As I wrote and as you pointed, it improves the performance only for mixed arrays. And since the roundrobin policy doesn't perform on mixed as good as policies you proposed, but it performs good on homogeneous arays, maybe it's better if I just focus on homogeneous case, and save some CPU cycles by not storing physical locations. > It is better to have random workloads in the above three categories > of configs. > > Apart from the above three configs, there is also > all-non-rotational with hetero > For example, ssd and nvme together both are non-rotational. > And, > all-rotational with hetero > For example, rotational disks with different speeds. > > > The inflight calculation is local to btrfs. If the device is busy due to > external factors, it would not switch to the better performing device. > Good point. Maybe I should try to use the part stats instead of storing inflight locally in btrfs.
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On Thu, Feb 11, 2021 at 03:27:38AM +0100, Michał Mirosław wrote: > On Wed, Feb 10, 2021 at 07:23:04PM +, Michal Rostecki wrote: > > On Wed, Feb 10, 2021 at 01:58:15PM +0100, Michał Mirosław wrote: > > > On Wed, Feb 10, 2021 at 12:29:25PM +, Michal Rostecki wrote: > > > > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote: > > > > > This looks like it effectively decreases queue depth for non-last > > > > > device. After all devices are filled to queue_depth-penalty, only > > > > > a single mirror will be selected for next reads (until a read on > > > > > some other one completes). > > > > > > > > > > > > > Good point. And if all devices are going to be filled for longer time, > > > > this function will keep selecting the last one. Maybe I should select > > > > last+1 in that case. Would that address your concern or did you have any > > > > other solution in mind? > > > > > > The best would be to postpone the selection until one device becomes free > > > again. But if that's not doable, then yes, you could make sure it stays > > > round-robin after filling the queues (the scheduling will loose the > > > "penalty"-driven adjustment though). > > > > Or another idea - when all the queues are filled, return the mirror > > which has the lowest load (inflight + penalty), even though it's greater > > than queue depth. In that case the scheduling will not lose the penalty > > adjustment and the load is going to be spreaded more fair. > > > > I'm not sure if postponing the selection is that good idea. I think it's > > better if the request is added to the iosched queue anyway, even if the > > disks' queues are filled, and let the I/O scheduler handle that. The > > length of the iosched queue (nr_requests, attribute of the iosched) is > > usually greater than queue depth (attribute of the devide), which means > > that it's fine to schedule more requests for iosched to handle. > > > > IMO btrfs should use the information given by iosched only for heuristic > > mirror selection, rather than implement its own throttling logic. > > > > Does it make sense to you? > > > > An another idea could be an additional iteration in regard to > > nr_requests, if all load values are greater than queue depths, though it > > might be an overkill. I would prefer to stick to my first idea if > > everyone agrees. > > What if iosched could provide an estimate of request's latency? Then > btrfs could always select the lowest. For reads from NVME/SSD I would > normally expect something simple: speed_factor * (pending_bytes + req_bytes). > For HDDs this could do more computation like looking into what is there > in the queue already. > > This would deviate from simple round-robin scheme, though. I think that idea is close to what Anand Jain did in his latency policy, though instead of trying to predict the latency of the currently scheduled request, it uses stats about previous read operations: https://patchwork.kernel.org/project/linux-btrfs/patch/63f6f00e2ecc741efd2200c3c87b5db52c6be2fd.164341.git.anand.j...@oracle.com/ Indeed your idea would deviate from the simple round-robin, so maybe it would be good to try it, but as a separate policy in a separate patch(set). Regards, Michal
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On Wed, Feb 10, 2021 at 07:23:04PM +, Michal Rostecki wrote: > On Wed, Feb 10, 2021 at 01:58:15PM +0100, Michał Mirosław wrote: > > On Wed, Feb 10, 2021 at 12:29:25PM +, Michal Rostecki wrote: > > > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote: > > > > This looks like it effectively decreases queue depth for non-last > > > > device. After all devices are filled to queue_depth-penalty, only > > > > a single mirror will be selected for next reads (until a read on > > > > some other one completes). > > > > > > > > > > Good point. And if all devices are going to be filled for longer time, > > > this function will keep selecting the last one. Maybe I should select > > > last+1 in that case. Would that address your concern or did you have any > > > other solution in mind? > > > > The best would be to postpone the selection until one device becomes free > > again. But if that's not doable, then yes, you could make sure it stays > > round-robin after filling the queues (the scheduling will loose the > > "penalty"-driven adjustment though). > > Or another idea - when all the queues are filled, return the mirror > which has the lowest load (inflight + penalty), even though it's greater > than queue depth. In that case the scheduling will not lose the penalty > adjustment and the load is going to be spreaded more fair. > > I'm not sure if postponing the selection is that good idea. I think it's > better if the request is added to the iosched queue anyway, even if the > disks' queues are filled, and let the I/O scheduler handle that. The > length of the iosched queue (nr_requests, attribute of the iosched) is > usually greater than queue depth (attribute of the devide), which means > that it's fine to schedule more requests for iosched to handle. > > IMO btrfs should use the information given by iosched only for heuristic > mirror selection, rather than implement its own throttling logic. > > Does it make sense to you? > > An another idea could be an additional iteration in regard to > nr_requests, if all load values are greater than queue depths, though it > might be an overkill. I would prefer to stick to my first idea if > everyone agrees. What if iosched could provide an estimate of request's latency? Then btrfs could always select the lowest. For reads from NVME/SSD I would normally expect something simple: speed_factor * (pending_bytes + req_bytes). For HDDs this could do more computation like looking into what is there in the queue already. This would deviate from simple round-robin scheme, though. Best Regards Michał Mirosław
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On Wed, Feb 10, 2021 at 01:58:15PM +0100, Michał Mirosław wrote: > On Wed, Feb 10, 2021 at 12:29:25PM +, Michal Rostecki wrote: > > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote: > > > This looks like it effectively decreases queue depth for non-last > > > device. After all devices are filled to queue_depth-penalty, only > > > a single mirror will be selected for next reads (until a read on > > > some other one completes). > > > > > > > Good point. And if all devices are going to be filled for longer time, > > this function will keep selecting the last one. Maybe I should select > > last+1 in that case. Would that address your concern or did you have any > > other solution in mind? > > The best would be to postpone the selection until one device becomes free > again. But if that's not doable, then yes, you could make sure it stays > round-robin after filling the queues (the scheduling will loose the > "penalty"-driven adjustment though). Or another idea - when all the queues are filled, return the mirror which has the lowest load (inflight + penalty), even though it's greater than queue depth. In that case the scheduling will not lose the penalty adjustment and the load is going to be spreaded more fair. I'm not sure if postponing the selection is that good idea. I think it's better if the request is added to the iosched queue anyway, even if the disks' queues are filled, and let the I/O scheduler handle that. The length of the iosched queue (nr_requests, attribute of the iosched) is usually greater than queue depth (attribute of the devide), which means that it's fine to schedule more requests for iosched to handle. IMO btrfs should use the information given by iosched only for heuristic mirror selection, rather than implement its own throttling logic. Does it make sense to you? An another idea could be an additional iteration in regard to nr_requests, if all load values are greater than queue depths, though it might be an overkill. I would prefer to stick to my first idea if everyone agrees. Regards, Michal
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On Wed, Feb 10, 2021 at 12:29:25PM +, Michal Rostecki wrote: > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote: > > On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote: > > [...] > > > For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s > > > (429MB/s) performance. Adding the penalty value 1 resulted in a > > > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10 > > > was making the performance even worse. > > > > > > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to > > > rotational disks resulted in the best performance - 541MiB/s (567MB/s). > > > Not adding any value and increasing the value was making the performance > > > worse. > > > > > > Adding penalty value to non-rotational disks was always decreasing the > > > performance, which motivated setting it as 0 by default. For the purpose > > > of testing, it's still configurable. > > [...] > > > + bdev = map->stripes[mirror_index].dev->bdev; > > > + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset, > > > +stripe_nr); > > > + queue_depth = blk_queue_depth(bdev->bd_disk->queue); > > > + > > > + return inflight < queue_depth; > > [...] > > > + last_mirror = this_cpu_read(*fs_info->last_mirror); > > [...] > > > + for (i = last_mirror; i < first + num_stripes; i++) { > > > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset, > > > + stripe_nr)) { > > > + preferred_mirror = i; > > > + goto out; > > > + } > > > + } > > > + > > > + for (i = first; i < last_mirror; i++) { > > > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset, > > > + stripe_nr)) { > > > + preferred_mirror = i; > > > + goto out; > > > + } > > > + } > > > + > > > + preferred_mirror = last_mirror; > > > + > > > +out: > > > + this_cpu_write(*fs_info->last_mirror, preferred_mirror); > > > > This looks like it effectively decreases queue depth for non-last > > device. After all devices are filled to queue_depth-penalty, only > > a single mirror will be selected for next reads (until a read on > > some other one completes). > > > > Good point. And if all devices are going to be filled for longer time, > this function will keep selecting the last one. Maybe I should select > last+1 in that case. Would that address your concern or did you have any > other solution in mind? The best would be to postpone the selection until one device becomes free again. But if that's not doable, then yes, you could make sure it stays round-robin after filling the queues (the scheduling will loose the "penalty"-driven adjustment though). Best Reagrds, Michał Mirosław
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote: > On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote: > [...] > > For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s > > (429MB/s) performance. Adding the penalty value 1 resulted in a > > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10 > > was making the performance even worse. > > > > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to > > rotational disks resulted in the best performance - 541MiB/s (567MB/s). > > Not adding any value and increasing the value was making the performance > > worse. > > > > Adding penalty value to non-rotational disks was always decreasing the > > performance, which motivated setting it as 0 by default. For the purpose > > of testing, it's still configurable. > [...] > > + bdev = map->stripes[mirror_index].dev->bdev; > > + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset, > > + stripe_nr); > > + queue_depth = blk_queue_depth(bdev->bd_disk->queue); > > + > > + return inflight < queue_depth; > [...] > > + last_mirror = this_cpu_read(*fs_info->last_mirror); > [...] > > + for (i = last_mirror; i < first + num_stripes; i++) { > > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset, > > + stripe_nr)) { > > + preferred_mirror = i; > > + goto out; > > + } > > + } > > + > > + for (i = first; i < last_mirror; i++) { > > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset, > > + stripe_nr)) { > > + preferred_mirror = i; > > + goto out; > > + } > > + } > > + > > + preferred_mirror = last_mirror; > > + > > +out: > > + this_cpu_write(*fs_info->last_mirror, preferred_mirror); > > This looks like it effectively decreases queue depth for non-last > device. After all devices are filled to queue_depth-penalty, only > a single mirror will be selected for next reads (until a read on > some other one completes). > Good point. And if all devices are going to be filled for longer time, this function will keep selecting the last one. Maybe I should select last+1 in that case. Would that address your concern or did you have any other solution in mind? Thanks for pointing that out. > Have you tried testing with much more jobs / non-sequential accesses? > I didn't try with non-sequential accesses. Will do that before respinning v2. > Best Reagrds, > Michał Mirosław Regards, Michal
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On 10/02/2021 04:30, Michal Rostecki wrote: From: Michal Rostecki Add a new raid1 read policy `roundrobin`. For each read request, it selects the mirror which has lower load than queue depth and it starts iterating from the last used mirror (by the current CPU). Load is defined as the number of inflight requests + a potential penalty value. The policy can be enabled through sysfs: # echo roundrobin > /sys/fs/btrfs/[fsid]/read_policies/policy This policy was tested with fio and compared with the default `pid` policy. The singlethreaded test has the following parameters: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=1 time_based=0 [file1] size=10G ioengine=libaio and shows the following results: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s), io=10.0GiB (10.7GB), run=47082-47082msec * roundrobin policy READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s), io=10.0GiB (10.7GB), run=25028-25028mse - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy (the worst case when only HDDs were chosen) READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s), io=10.0GiB (10.7GB), run=46577-46577mse * pid policy (the best case when SSD was used as well) READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s), io=10.0GiB (10.7GB), run=19954-19954msec * roundrobin (there are no noticeable differences when testing multiple times) READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s), io=10.0GiB (10.7GB), run=18933-18933msec The multithreaded test has the following parameters: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=8 time_based=0 [file1] size=10G ioengine=libaio and shows the following results: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=80.0GiB (85.9GB), run=52210-52211msec * roundrobin READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=80.0GiB (85.9GB), run=47269-47271msec - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s), io=80.0GiB (85.9GB), run=9-44450msec * roundrobin READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s), io=80.0GiB (85.9GB), run=32969-32970msec The penalty value is an additional value added to the number of inflight requests when a scheduled request is non-local (which means it would start from the different physical location than the physical location of the last request processed by the given device). By default, it's applied only in filesystems which have mixed types of devices (non-rotational and rotational), but it can be configured to be applied without that condition. The configuration is done through sysfs: > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only where 1 (the default) value means applying penalty only in mixed arrays, 0 means applying it unconditionally. > The exact penalty value is defined separately for non-rotational and rotational devices. By default, it's 0 for non-rotational devices and 1 for rotational devices. Both values are configurable through sysfs: - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc To sum it up - the default case is applying the penalty under the following conditions: - the raid1 array consists of mixed types of devices - the scheduled request is going to be non-local for the given disk - the device is rotational > That default case is based on a slight preference towards non-rotational disks in mixed arrays and has proven to give the best performance in tested arrays. >> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s (429MB/s) performance. Adding the penalty value 1 resulted in a performance drop to 404MiB/s (424MB/s). Increasing the value towards 10 was making the performance even worse. For the array with 2 HDDs and 1 SSD, adding penalty value 1 to rotational disks resulted in the best performance - 541MiB/s (567MB/s). Not adding any value and increasing the value was making the performance worse. > Adding penalty value to non-rotational disks was always decreasing the performance, which motivated setting it as 0 by default. For the purpose of testing, it's still configurable. > To measure the performance of each policy and find optimal penalty values, I created scripts which a
Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote: [...] > For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s > (429MB/s) performance. Adding the penalty value 1 resulted in a > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10 > was making the performance even worse. > > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to > rotational disks resulted in the best performance - 541MiB/s (567MB/s). > Not adding any value and increasing the value was making the performance > worse. > > Adding penalty value to non-rotational disks was always decreasing the > performance, which motivated setting it as 0 by default. For the purpose > of testing, it's still configurable. [...] > + bdev = map->stripes[mirror_index].dev->bdev; > + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset, > +stripe_nr); > + queue_depth = blk_queue_depth(bdev->bd_disk->queue); > + > + return inflight < queue_depth; [...] > + last_mirror = this_cpu_read(*fs_info->last_mirror); [...] > + for (i = last_mirror; i < first + num_stripes; i++) { > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset, > + stripe_nr)) { > + preferred_mirror = i; > + goto out; > + } > + } > + > + for (i = first; i < last_mirror; i++) { > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset, > + stripe_nr)) { > + preferred_mirror = i; > + goto out; > + } > + } > + > + preferred_mirror = last_mirror; > + > +out: > + this_cpu_write(*fs_info->last_mirror, preferred_mirror); This looks like it effectively decreases queue depth for non-last device. After all devices are filled to queue_depth-penalty, only a single mirror will be selected for next reads (until a read on some other one completes). Have you tried testing with much more jobs / non-sequential accesses? Best Reagrds, Michał Mirosław
[PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy
From: Michal Rostecki Add a new raid1 read policy `roundrobin`. For each read request, it selects the mirror which has lower load than queue depth and it starts iterating from the last used mirror (by the current CPU). Load is defined as the number of inflight requests + a potential penalty value. The policy can be enabled through sysfs: # echo roundrobin > /sys/fs/btrfs/[fsid]/read_policies/policy This policy was tested with fio and compared with the default `pid` policy. The singlethreaded test has the following parameters: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=1 time_based=0 [file1] size=10G ioengine=libaio and shows the following results: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s), io=10.0GiB (10.7GB), run=47082-47082msec * roundrobin policy READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s), io=10.0GiB (10.7GB), run=25028-25028mse - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy (the worst case when only HDDs were chosen) READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s), io=10.0GiB (10.7GB), run=46577-46577mse * pid policy (the best case when SSD was used as well) READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s), io=10.0GiB (10.7GB), run=19954-19954msec * roundrobin (there are no noticeable differences when testing multiple times) READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s), io=10.0GiB (10.7GB), run=18933-18933msec The multithreaded test has the following parameters: [global] name=btrfs-raid1-seqread filename=btrfs-raid1-seqread rw=read bs=64k direct=0 numjobs=8 time_based=0 [file1] size=10G ioengine=libaio and shows the following results: - raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB) 3 x Segate Barracuda ST2000DM008 (2TB) * pid policy READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=80.0GiB (85.9GB), run=52210-52211msec * roundrobin READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=80.0GiB (85.9GB), run=47269-47271msec - raid1c3 with 2 HDDs and 1 SSD: 2 x Segate Barracuda ST2000DM008 (2TB) 1 x Crucial CT256M550SSD1 (256GB) * pid policy READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s), io=80.0GiB (85.9GB), run=9-44450msec * roundrobin READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s), io=80.0GiB (85.9GB), run=32969-32970msec The penalty value is an additional value added to the number of inflight requests when a scheduled request is non-local (which means it would start from the different physical location than the physical location of the last request processed by the given device). By default, it's applied only in filesystems which have mixed types of devices (non-rotational and rotational), but it can be configured to be applied without that condition. The configuration is done through sysfs: - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only where 1 (the default) value means applying penalty only in mixed arrays, 0 means applying it unconditionally. The exact penalty value is defined separately for non-rotational and rotational devices. By default, it's 0 for non-rotational devices and 1 for rotational devices. Both values are configurable through sysfs: - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc To sum it up - the default case is applying the penalty under the following conditions: - the raid1 array consists of mixed types of devices - the scheduled request is going to be non-local for the given disk - the device is rotational That default case is based on a slight preference towards non-rotational disks in mixed arrays and has proven to give the best performance in tested arrays. For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s (429MB/s) performance. Adding the penalty value 1 resulted in a performance drop to 404MiB/s (424MB/s). Increasing the value towards 10 was making the performance even worse. For the array with 2 HDDs and 1 SSD, adding penalty value 1 to rotational disks resulted in the best performance - 541MiB/s (567MB/s). Not adding any value and increasing the value was making the performance worse. Adding penalty value to non-rotational disks was always decreasing the performance, which motivated setting it as 0 by default. For the purpose of testing, it's still configurable. To measure the performance of each policy and find optimal penalty values, I created scripts which are available here: https://gitlab.com/vadorovsky/btrfs-perf https://github.com/mrostecki/btrfs-perf Signed-off-by: Michal