Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-12 Thread Michal Rostecki
Hi Anand,

re: inflight calculation

On Thu, Feb 11, 2021 at 03:55:33PM +, Michal Rostecki wrote:
> > It is better to have random workloads in the above three categories
> > of configs.
> > 
> > Apart from the above three configs, there is also
> >  all-non-rotational with hetero
> > For example, ssd and nvme together both are non-rotational.
> > And,
> >  all-rotational with hetero
> > For example, rotational disks with different speeds.
> > 
> > 
> > The inflight calculation is local to btrfs. If the device is busy due to
> > external factors, it would not switch to the better performing device.
> > 
> 
> Good point. Maybe I should try to use the part stats instead of storing
> inflight locally in btrfs.

I tried today to reuse the inflight calculation which is done for
iostat. And I came to conclusion that it's impossible without exporting
some methods from the block/ subsystem.

The thing is that there are two methods of calculating inflight. Which
one of them is used, depends on queue_is_mq():

https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/genhd.c#L1163

And if that condition is true, I noticed that part_stats return 0, even
though there are processed requests (I checked with fio inside VM).

In general, those two methods of checking inflight are:

1) part_stats - which would be trivial to reuse, just a matter of one
   simple for_each_possible_cpu() loop with part_stat_local_read_cpu()

https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/genhd.c#L133-L146

2) blk_mq_in_flight() - which has a lot of code and unexported
   structs inside the block/ directory, double function callback;
   definitely not easy to reimplement easily in btrfs without copying
   dozens of lines

https://github.com/kdave/btrfs-devel/blob/9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a/block/blk-mq.c#L115-L123

Well, I tried copying the whole blk_mq_in_flight() function with all
dependencies anyway, hard to do without causing modpost errors.

So, to sum it up, I think that making 2) possible to reuse in btrfs
would require at lest exporting the blk_mq_in_flight() function,
therefore the change would have to go through linux-block tree. Which
maybe would be a good thing to do in long term, not sure if it should
block my patchset entirely.

The question is if we are fine with inflight stats inside btrfs.
Personally I think we sholdn't be concerned too much about it. The
inflight counter in my patches is a percpu counted, used in places which
already do some atomic operations.

Thanks,
Michal


Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-11 Thread Michal Rostecki
On Wed, Feb 10, 2021 at 04:20:20PM +0800, Anand Jain wrote:
> On 10/02/2021 04:30, Michal Rostecki wrote:
> > The penalty value is an additional value added to the number of inflight
> > requests when a scheduled request is non-local (which means it would
> > start from the different physical location than the physical location of
> > the last request processed by the given device). By default, it's
> > applied only in filesystems which have mixed types of devices
> > (non-rotational and rotational), but it can be configured to be applied
> > without that condition.
> > 
> > The configuration is done through sysfs:
> > > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only
> > 
> > where 1 (the default) value means applying penalty only in mixed arrays,
> > 0 means applying it unconditionally.
> >
> > The exact penalty value is defined separately for non-rotational and
> > rotational devices. By default, it's 0 for non-rotational devices and 1
> > for rotational devices. Both values are configurable through sysfs:
> > 
> > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc
> > - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc
> > 
> > To sum it up - the default case is applying the penalty under the
> > following conditions:
> > 
> > - the raid1 array consists of mixed types of devices
> > - the scheduled request is going to be non-local for the given disk
> > - the device is rotational
> >
> > That default case is based on a slight preference towards non-rotational
> > disks in mixed arrays and has proven to give the best performance in
> > tested arrays.
> >> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> > (429MB/s) performance. Adding the penalty value 1 resulted in a
> > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> > was making the performance even worse.
> > 
> > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> > rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> > Not adding any value and increasing the value was making the performance
> > worse.
> > > Adding penalty value to non-rotational disks was always decreasing the
> > performance, which motivated setting it as 0 by default. For the purpose
> > of testing, it's still configurable.
> >
> > To measure the performance of each policy and find optimal penalty
> > values, I created scripts which are available here:
> > 
> 
> So in summary
>  rotational + non-rotational: penalty = 1
>  all-rotational and homo: penalty = 0
>  all-non-rotational and homo: penalty = 0
> 
> I can't find any non-deterministic in your findings above.
> It is not very clear to me if we need the configurable
> parameters here.
> 

Honestly, the main reason why I made it configurable is to check the
performance of different values without editing and recompiling the
kernel. I was trying to find the best set of values with my simple Python
script which tries all values from 0 to 9 and runs fio. I left those
partameters to be configurable in my patches just in case someone else
would like to try to tune them on their environments.

The script is here:

https://github.com/mrostecki/btrfs-perf/blob/main/roundrobin-tune.py

But on the other hand, as I mentioned in the other mail - I'm getting
skeptical about having the whole penalty mechanism in general. As I
wrote and as you pointed, it improves the performance only for mixed
arrays. And since the roundrobin policy doesn't perform on mixed as good
as policies you proposed, but it performs good on homogeneous arays,
maybe it's better if I just focus on homogeneous case, and save some CPU
cycles by not storing physical locations.

> It is better to have random workloads in the above three categories
> of configs.
> 
> Apart from the above three configs, there is also
>  all-non-rotational with hetero
> For example, ssd and nvme together both are non-rotational.
> And,
>  all-rotational with hetero
> For example, rotational disks with different speeds.
> 
> 
> The inflight calculation is local to btrfs. If the device is busy due to
> external factors, it would not switch to the better performing device.
> 

Good point. Maybe I should try to use the part stats instead of storing
inflight locally in btrfs.


Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-11 Thread Michal Rostecki
On Thu, Feb 11, 2021 at 03:27:38AM +0100, Michał Mirosław wrote:
> On Wed, Feb 10, 2021 at 07:23:04PM +, Michal Rostecki wrote:
> > On Wed, Feb 10, 2021 at 01:58:15PM +0100, Michał Mirosław wrote:
> > > On Wed, Feb 10, 2021 at 12:29:25PM +, Michal Rostecki wrote:
> > > > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote:
> > > > > This looks like it effectively decreases queue depth for non-last
> > > > > device. After all devices are filled to queue_depth-penalty, only
> > > > > a single mirror will be selected for next reads (until a read on
> > > > > some other one completes).
> > > > > 
> > > > 
> > > > Good point. And if all devices are going to be filled for longer time,
> > > > this function will keep selecting the last one. Maybe I should select
> > > > last+1 in that case. Would that address your concern or did you have any
> > > > other solution in mind?
> > > 
> > > The best would be to postpone the selection until one device becomes free
> > > again. But if that's not doable, then yes, you could make sure it stays
> > > round-robin after filling the queues (the scheduling will loose the
> > > "penalty"-driven adjustment though).
> > 
> > Or another idea - when all the queues are filled, return the mirror
> > which has the lowest load (inflight + penalty), even though it's greater
> > than queue depth. In that case the scheduling will not lose the penalty
> > adjustment and the load is going to be spreaded more fair.
> > 
> > I'm not sure if postponing the selection is that good idea. I think it's
> > better if the request is added to the iosched queue anyway, even if the
> > disks' queues are filled, and let the I/O scheduler handle that. The
> > length of the iosched queue (nr_requests, attribute of the iosched) is
> > usually greater than queue depth (attribute of the devide), which means
> > that it's fine to schedule more requests for iosched to handle.
> > 
> > IMO btrfs should use the information given by iosched only for heuristic
> > mirror selection, rather than implement its own throttling logic.
> > 
> > Does it make sense to you?
> > 
> > An another idea could be an additional iteration in regard to
> > nr_requests, if all load values are greater than queue depths, though it
> > might be an overkill. I would prefer to stick to my first idea if
> > everyone agrees.
> 
> What if iosched could provide an estimate of request's latency? Then
> btrfs could always select the lowest. For reads from NVME/SSD I would
> normally expect something simple: speed_factor * (pending_bytes + req_bytes).
> For HDDs this could do more computation like looking into what is there
> in the queue already.
> 
> This would deviate from simple round-robin scheme, though.

I think that idea is close to what Anand Jain did in his latency
policy, though instead of trying to predict the latency of the currently
scheduled request, it uses stats about previous read operations:

https://patchwork.kernel.org/project/linux-btrfs/patch/63f6f00e2ecc741efd2200c3c87b5db52c6be2fd.164341.git.anand.j...@oracle.com/

Indeed your idea would deviate from the simple round-robin, so maybe it
would be good to try it, but as a separate policy in a separate
patch(set).

Regards,
Michal


Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-10 Thread Michał Mirosław
On Wed, Feb 10, 2021 at 07:23:04PM +, Michal Rostecki wrote:
> On Wed, Feb 10, 2021 at 01:58:15PM +0100, Michał Mirosław wrote:
> > On Wed, Feb 10, 2021 at 12:29:25PM +, Michal Rostecki wrote:
> > > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote:
> > > > This looks like it effectively decreases queue depth for non-last
> > > > device. After all devices are filled to queue_depth-penalty, only
> > > > a single mirror will be selected for next reads (until a read on
> > > > some other one completes).
> > > > 
> > > 
> > > Good point. And if all devices are going to be filled for longer time,
> > > this function will keep selecting the last one. Maybe I should select
> > > last+1 in that case. Would that address your concern or did you have any
> > > other solution in mind?
> > 
> > The best would be to postpone the selection until one device becomes free
> > again. But if that's not doable, then yes, you could make sure it stays
> > round-robin after filling the queues (the scheduling will loose the
> > "penalty"-driven adjustment though).
> 
> Or another idea - when all the queues are filled, return the mirror
> which has the lowest load (inflight + penalty), even though it's greater
> than queue depth. In that case the scheduling will not lose the penalty
> adjustment and the load is going to be spreaded more fair.
> 
> I'm not sure if postponing the selection is that good idea. I think it's
> better if the request is added to the iosched queue anyway, even if the
> disks' queues are filled, and let the I/O scheduler handle that. The
> length of the iosched queue (nr_requests, attribute of the iosched) is
> usually greater than queue depth (attribute of the devide), which means
> that it's fine to schedule more requests for iosched to handle.
> 
> IMO btrfs should use the information given by iosched only for heuristic
> mirror selection, rather than implement its own throttling logic.
> 
> Does it make sense to you?
> 
> An another idea could be an additional iteration in regard to
> nr_requests, if all load values are greater than queue depths, though it
> might be an overkill. I would prefer to stick to my first idea if
> everyone agrees.

What if iosched could provide an estimate of request's latency? Then
btrfs could always select the lowest. For reads from NVME/SSD I would
normally expect something simple: speed_factor * (pending_bytes + req_bytes).
For HDDs this could do more computation like looking into what is there
in the queue already.

This would deviate from simple round-robin scheme, though.

Best Regards
Michał Mirosław


Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-10 Thread Michal Rostecki
On Wed, Feb 10, 2021 at 01:58:15PM +0100, Michał Mirosław wrote:
> On Wed, Feb 10, 2021 at 12:29:25PM +, Michal Rostecki wrote:
> > On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote:
> > > This looks like it effectively decreases queue depth for non-last
> > > device. After all devices are filled to queue_depth-penalty, only
> > > a single mirror will be selected for next reads (until a read on
> > > some other one completes).
> > > 
> > 
> > Good point. And if all devices are going to be filled for longer time,
> > this function will keep selecting the last one. Maybe I should select
> > last+1 in that case. Would that address your concern or did you have any
> > other solution in mind?
> 
> The best would be to postpone the selection until one device becomes free
> again. But if that's not doable, then yes, you could make sure it stays
> round-robin after filling the queues (the scheduling will loose the
> "penalty"-driven adjustment though).

Or another idea - when all the queues are filled, return the mirror
which has the lowest load (inflight + penalty), even though it's greater
than queue depth. In that case the scheduling will not lose the penalty
adjustment and the load is going to be spreaded more fair.

I'm not sure if postponing the selection is that good idea. I think it's
better if the request is added to the iosched queue anyway, even if the
disks' queues are filled, and let the I/O scheduler handle that. The
length of the iosched queue (nr_requests, attribute of the iosched) is
usually greater than queue depth (attribute of the devide), which means
that it's fine to schedule more requests for iosched to handle.

IMO btrfs should use the information given by iosched only for heuristic
mirror selection, rather than implement its own throttling logic.

Does it make sense to you?

An another idea could be an additional iteration in regard to
nr_requests, if all load values are greater than queue depths, though it
might be an overkill. I would prefer to stick to my first idea if
everyone agrees.

Regards,
Michal


Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-10 Thread Michał Mirosław
On Wed, Feb 10, 2021 at 12:29:25PM +, Michal Rostecki wrote:
> On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote:
> > On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote:
> > [...]
> > > For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> > > (429MB/s) performance. Adding the penalty value 1 resulted in a
> > > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> > > was making the performance even worse.
> > > 
> > > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> > > rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> > > Not adding any value and increasing the value was making the performance
> > > worse.
> > > 
> > > Adding penalty value to non-rotational disks was always decreasing the
> > > performance, which motivated setting it as 0 by default. For the purpose
> > > of testing, it's still configurable.
> > [...]
> > > + bdev = map->stripes[mirror_index].dev->bdev;
> > > + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset,
> > > +stripe_nr);
> > > + queue_depth = blk_queue_depth(bdev->bd_disk->queue);
> > > +
> > > + return inflight < queue_depth;
> > [...]
> > > + last_mirror = this_cpu_read(*fs_info->last_mirror);
> > [...]
> > > + for (i = last_mirror; i < first + num_stripes; i++) {
> > > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> > > + stripe_nr)) {
> > > + preferred_mirror = i;
> > > + goto out;
> > > + }
> > > + }
> > > +
> > > + for (i = first; i < last_mirror; i++) {
> > > + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> > > + stripe_nr)) {
> > > + preferred_mirror = i;
> > > + goto out;
> > > + }
> > > + }
> > > +
> > > + preferred_mirror = last_mirror;
> > > +
> > > +out:
> > > + this_cpu_write(*fs_info->last_mirror, preferred_mirror);
> > 
> > This looks like it effectively decreases queue depth for non-last
> > device. After all devices are filled to queue_depth-penalty, only
> > a single mirror will be selected for next reads (until a read on
> > some other one completes).
> > 
> 
> Good point. And if all devices are going to be filled for longer time,
> this function will keep selecting the last one. Maybe I should select
> last+1 in that case. Would that address your concern or did you have any
> other solution in mind?

The best would be to postpone the selection until one device becomes free
again. But if that's not doable, then yes, you could make sure it stays
round-robin after filling the queues (the scheduling will loose the
"penalty"-driven adjustment though).

Best Reagrds,
Michał Mirosław


Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-10 Thread Michal Rostecki
On Wed, Feb 10, 2021 at 05:24:28AM +0100, Michał Mirosław wrote:
> On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote:
> [...]
> > For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> > (429MB/s) performance. Adding the penalty value 1 resulted in a
> > performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> > was making the performance even worse.
> > 
> > For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> > rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> > Not adding any value and increasing the value was making the performance
> > worse.
> > 
> > Adding penalty value to non-rotational disks was always decreasing the
> > performance, which motivated setting it as 0 by default. For the purpose
> > of testing, it's still configurable.
> [...]
> > +   bdev = map->stripes[mirror_index].dev->bdev;
> > +   inflight = mirror_load(fs_info, map, mirror_index, stripe_offset,
> > +  stripe_nr);
> > +   queue_depth = blk_queue_depth(bdev->bd_disk->queue);
> > +
> > +   return inflight < queue_depth;
> [...]
> > +   last_mirror = this_cpu_read(*fs_info->last_mirror);
> [...]
> > +   for (i = last_mirror; i < first + num_stripes; i++) {
> > +   if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> > +   stripe_nr)) {
> > +   preferred_mirror = i;
> > +   goto out;
> > +   }
> > +   }
> > +
> > +   for (i = first; i < last_mirror; i++) {
> > +   if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> > +   stripe_nr)) {
> > +   preferred_mirror = i;
> > +   goto out;
> > +   }
> > +   }
> > +
> > +   preferred_mirror = last_mirror;
> > +
> > +out:
> > +   this_cpu_write(*fs_info->last_mirror, preferred_mirror);
> 
> This looks like it effectively decreases queue depth for non-last
> device. After all devices are filled to queue_depth-penalty, only
> a single mirror will be selected for next reads (until a read on
> some other one completes).
> 

Good point. And if all devices are going to be filled for longer time,
this function will keep selecting the last one. Maybe I should select
last+1 in that case. Would that address your concern or did you have any
other solution in mind?

Thanks for pointing that out.

> Have you tried testing with much more jobs / non-sequential accesses?
> 

I didn't try with non-sequential accesses. Will do that before
respinning v2.

> Best Reagrds,
> Michał Mirosław

Regards,
Michal


Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-10 Thread Anand Jain

On 10/02/2021 04:30, Michal Rostecki wrote:

From: Michal Rostecki 

Add a new raid1 read policy `roundrobin`. For each read request, it
selects the mirror which has lower load than queue depth and it starts
iterating from the last used mirror (by the current CPU). Load is
defined as the number of inflight requests + a potential penalty value.

The policy can be enabled through sysfs:

   # echo roundrobin > /sys/fs/btrfs/[fsid]/read_policies/policy

This policy was tested with fio and compared with the default `pid`
policy.

The singlethreaded test has the following parameters:

   [global]
   name=btrfs-raid1-seqread
   filename=btrfs-raid1-seqread
   rw=read
   bs=64k
   direct=0
   numjobs=1
   time_based=0

   [file1]
   size=10G
   ioengine=libaio

and shows the following results:

- raid1c3 with 3 HDDs:
   3 x Segate Barracuda ST2000DM008 (2TB)
   * pid policy
 READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
 io=10.0GiB (10.7GB), run=47082-47082msec
   * roundrobin policy
 READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
 io=10.0GiB (10.7GB), run=25028-25028mse
- raid1c3 with 2 HDDs and 1 SSD:
   2 x Segate Barracuda ST2000DM008 (2TB)
   1 x Crucial CT256M550SSD1 (256GB)
   * pid policy (the worst case when only HDDs were chosen)
 READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
 io=10.0GiB (10.7GB), run=46577-46577mse
   * pid policy (the best case when SSD was used as well)
 READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
 io=10.0GiB (10.7GB), run=19954-19954msec
   * roundrobin (there are no noticeable differences when testing multiple
 times)
 READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
 io=10.0GiB (10.7GB), run=18933-18933msec

The multithreaded test has the following parameters:

   [global]
   name=btrfs-raid1-seqread
   filename=btrfs-raid1-seqread
   rw=read
   bs=64k
   direct=0
   numjobs=8
   time_based=0

   [file1]
   size=10G
   ioengine=libaio

and shows the following results:

- raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
   3 x Segate Barracuda ST2000DM008 (2TB)
   * pid policy
 READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
 io=80.0GiB (85.9GB), run=52210-52211msec
   * roundrobin
 READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
 io=80.0GiB (85.9GB), run=47269-47271msec
- raid1c3 with 2 HDDs and 1 SSD:
   2 x Segate Barracuda ST2000DM008 (2TB)
   1 x Crucial CT256M550SSD1 (256GB)
   * pid policy
 READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
 io=80.0GiB (85.9GB), run=9-44450msec
   * roundrobin
 READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
 io=80.0GiB (85.9GB), run=32969-32970msec






The penalty value is an additional value added to the number of inflight
requests when a scheduled request is non-local (which means it would
start from the different physical location than the physical location of
the last request processed by the given device). By default, it's
applied only in filesystems which have mixed types of devices
(non-rotational and rotational), but it can be configured to be applied
without that condition.

The configuration is done through sysfs:
> - /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only

where 1 (the default) value means applying penalty only in mixed arrays,
0 means applying it unconditionally.

>

The exact penalty value is defined separately for non-rotational and
rotational devices. By default, it's 0 for non-rotational devices and 1
for rotational devices. Both values are configurable through sysfs:

- /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc
- /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc

To sum it up - the default case is applying the penalty under the
following conditions:

- the raid1 array consists of mixed types of devices
- the scheduled request is going to be non-local for the given disk
- the device is rotational

>

That default case is based on a slight preference towards non-rotational
disks in mixed arrays and has proven to give the best performance in
tested arrays.

>> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s

(429MB/s) performance. Adding the penalty value 1 resulted in a
performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
was making the performance even worse.

For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
rotational disks resulted in the best performance - 541MiB/s (567MB/s).
Not adding any value and increasing the value was making the performance
worse.
> Adding penalty value to non-rotational disks was always decreasing the
performance, which motivated setting it as 0 by default. For the purpose
of testing, it's still configurable.

>

To measure the performance of each policy and find optimal penalty
values, I created scripts which a

Re: [PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-09 Thread Michał Mirosław
On Tue, Feb 09, 2021 at 09:30:40PM +0100, Michal Rostecki wrote:
[...]
> For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
> (429MB/s) performance. Adding the penalty value 1 resulted in a
> performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
> was making the performance even worse.
> 
> For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
> rotational disks resulted in the best performance - 541MiB/s (567MB/s).
> Not adding any value and increasing the value was making the performance
> worse.
> 
> Adding penalty value to non-rotational disks was always decreasing the
> performance, which motivated setting it as 0 by default. For the purpose
> of testing, it's still configurable.
[...]
> + bdev = map->stripes[mirror_index].dev->bdev;
> + inflight = mirror_load(fs_info, map, mirror_index, stripe_offset,
> +stripe_nr);
> + queue_depth = blk_queue_depth(bdev->bd_disk->queue);
> +
> + return inflight < queue_depth;
[...]
> + last_mirror = this_cpu_read(*fs_info->last_mirror);
[...]
> + for (i = last_mirror; i < first + num_stripes; i++) {
> + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> + stripe_nr)) {
> + preferred_mirror = i;
> + goto out;
> + }
> + }
> +
> + for (i = first; i < last_mirror; i++) {
> + if (mirror_queue_not_filled(fs_info, map, i, stripe_offset,
> + stripe_nr)) {
> + preferred_mirror = i;
> + goto out;
> + }
> + }
> +
> + preferred_mirror = last_mirror;
> +
> +out:
> + this_cpu_write(*fs_info->last_mirror, preferred_mirror);

This looks like it effectively decreases queue depth for non-last
device. After all devices are filled to queue_depth-penalty, only
a single mirror will be selected for next reads (until a read on
some other one completes).

Have you tried testing with much more jobs / non-sequential accesses?

Best Reagrds,
Michał Mirosław


[PATCH RFC 6/6] btrfs: Add roundrobin raid1 read policy

2021-02-09 Thread Michal Rostecki
From: Michal Rostecki 

Add a new raid1 read policy `roundrobin`. For each read request, it
selects the mirror which has lower load than queue depth and it starts
iterating from the last used mirror (by the current CPU). Load is
defined as the number of inflight requests + a potential penalty value.

The policy can be enabled through sysfs:

  # echo roundrobin > /sys/fs/btrfs/[fsid]/read_policies/policy

This policy was tested with fio and compared with the default `pid`
policy.

The singlethreaded test has the following parameters:

  [global]
  name=btrfs-raid1-seqread
  filename=btrfs-raid1-seqread
  rw=read
  bs=64k
  direct=0
  numjobs=1
  time_based=0

  [file1]
  size=10G
  ioengine=libaio

and shows the following results:

- raid1c3 with 3 HDDs:
  3 x Segate Barracuda ST2000DM008 (2TB)
  * pid policy
READ: bw=217MiB/s (228MB/s), 217MiB/s-217MiB/s (228MB/s-228MB/s),
io=10.0GiB (10.7GB), run=47082-47082msec
  * roundrobin policy
READ: bw=409MiB/s (429MB/s), 409MiB/s-409MiB/s (429MB/s-429MB/s),
io=10.0GiB (10.7GB), run=25028-25028mse
- raid1c3 with 2 HDDs and 1 SSD:
  2 x Segate Barracuda ST2000DM008 (2TB)
  1 x Crucial CT256M550SSD1 (256GB)
  * pid policy (the worst case when only HDDs were chosen)
READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s),
io=10.0GiB (10.7GB), run=46577-46577mse
  * pid policy (the best case when SSD was used as well)
READ: bw=513MiB/s (538MB/s), 513MiB/s-513MiB/s (538MB/s-538MB/s),
io=10.0GiB (10.7GB), run=19954-19954msec
  * roundrobin (there are no noticeable differences when testing multiple
times)
READ: bw=541MiB/s (567MB/s), 541MiB/s-541MiB/s (567MB/s-567MB/s),
io=10.0GiB (10.7GB), run=18933-18933msec

The multithreaded test has the following parameters:

  [global]
  name=btrfs-raid1-seqread
  filename=btrfs-raid1-seqread
  rw=read
  bs=64k
  direct=0
  numjobs=8
  time_based=0

  [file1]
  size=10G
  ioengine=libaio

and shows the following results:

- raid1c3 with 3 HDDs: 3 x Segate Barracuda ST2000DM008 (2TB)
  3 x Segate Barracuda ST2000DM008 (2TB)
  * pid policy
READ: bw=1569MiB/s (1645MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s),
io=80.0GiB (85.9GB), run=52210-52211msec
  * roundrobin
READ: bw=1733MiB/s (1817MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s),
io=80.0GiB (85.9GB), run=47269-47271msec
- raid1c3 with 2 HDDs and 1 SSD:
  2 x Segate Barracuda ST2000DM008 (2TB)
  1 x Crucial CT256M550SSD1 (256GB)
  * pid policy
READ: bw=1843MiB/s (1932MB/s), 230MiB/s-230MiB/s (242MB/s-242MB/s),
io=80.0GiB (85.9GB), run=9-44450msec
  * roundrobin
READ: bw=2485MiB/s (2605MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s),
io=80.0GiB (85.9GB), run=32969-32970msec

The penalty value is an additional value added to the number of inflight
requests when a scheduled request is non-local (which means it would
start from the different physical location than the physical location of
the last request processed by the given device). By default, it's
applied only in filesystems which have mixed types of devices
(non-rotational and rotational), but it can be configured to be applied
without that condition.

The configuration is done through sysfs:

- /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonlocal_inc_mixed_only

where 1 (the default) value means applying penalty only in mixed arrays,
0 means applying it unconditionally.

The exact penalty value is defined separately for non-rotational and
rotational devices. By default, it's 0 for non-rotational devices and 1
for rotational devices. Both values are configurable through sysfs:

- /sys/fs/btrfs/[fsid]/read_policies/roundrobin_nonrot_nonlocal_inc
- /sys/fs/btrfs/[fsid]/read_policies/roundrobin_rot_nonlocal_inc

To sum it up - the default case is applying the penalty under the
following conditions:

- the raid1 array consists of mixed types of devices
- the scheduled request is going to be non-local for the given disk
- the device is rotational

That default case is based on a slight preference towards non-rotational
disks in mixed arrays and has proven to give the best performance in
tested arrays.

For the array with 3 HDDs, not adding any penalty resulted in 409MiB/s
(429MB/s) performance. Adding the penalty value 1 resulted in a
performance drop to 404MiB/s (424MB/s). Increasing the value towards 10
was making the performance even worse.

For the array with 2 HDDs and 1 SSD, adding penalty value 1 to
rotational disks resulted in the best performance - 541MiB/s (567MB/s).
Not adding any value and increasing the value was making the performance
worse.

Adding penalty value to non-rotational disks was always decreasing the
performance, which motivated setting it as 0 by default. For the purpose
of testing, it's still configurable.

To measure the performance of each policy and find optimal penalty
values, I created scripts which are available here:

https://gitlab.com/vadorovsky/btrfs-perf
https://github.com/mrostecki/btrfs-perf

Signed-off-by: Michal