Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-12-18 Thread Khazhismel Kumykov
On Mon, Dec 18, 2017 at 1:01 PM, Vivek Goyal  wrote:
> On Mon, Dec 18, 2017 at 12:39:50PM -0800, Khazhismel Kumykov wrote:
>> On Mon, Dec 18, 2017 at 10:29 AM, Vivek Goyal  wrote:
>> > On Mon, Dec 18, 2017 at 10:16:02AM -0800, Khazhismel Kumykov wrote:
>> >> On Mon, Nov 20, 2017 at 8:36 PM, Khazhismel Kumykov  
>> >> wrote:
>> >> > On Fri, Nov 17, 2017 at 11:26 AM, Shaohua Li  wrote:
>> >> >> On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote:
>> >> >>> On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Li  wrote:
>> >> >>> > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
>> >> >>> >> Allows configuration additional bytes or ios before a throttle is
>> >> >>> >> triggered.
>> >> >>> >>
>> >> >>> >> This allows implementation of a bucket style rate-limit/throttle 
>> >> >>> >> on a
>> >> >>> >> block device. Previously, bursting to a device was limited to 
>> >> >>> >> allowance
>> >> >>> >> granted in a single throtl_slice (similar to a bucket with limit N 
>> >> >>> >> and
>> >> >>> >> refill rate N/slice).
>> >> >>> >>
>> >> >>> >> Additional parameters bytes/io_burst_conf defined for tg, which 
>> >> >>> >> define a
>> >> >>> >> number of bytes/ios that must be depleted before throttling 
>> >> >>> >> happens. A
>> >> >>> >> tg that does not deplete this allowance functions as though it has 
>> >> >>> >> no
>> >> >>> >> configured limits. tgs earn additional allowance at rate defined by
>> >> >>> >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, 
>> >> >>> >> throttling
>> >> >>> >> kicks in. If a tg is idle for a while, it will again have some 
>> >> >>> >> burst
>> >> >>> >> allowance before it gets throttled again.
>> >> >>> >>
>> >> >>> >> slice_end for a tg is extended until io_disp/byte_disp would fall 
>> >> >>> >> to 0,
>> >> >>> >> when all "used" burst allowance would be earned back. trim_slice 
>> >> >>> >> still
>> >> >>> >> does progress slice_start as before and decrements *_disp as 
>> >> >>> >> before, and
>> >> >>> >> tgs continue to get bytes/ios in throtl_slice intervals.
>> >> >>> >
>> >> >>> > Can you describe why we need this? It would be great if you can 
>> >> >>> > describe the
>> >> >>> > usage model and an example. Does this work for io.low/io.max or 
>> >> >>> > both?
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> > Shaohua
>> >> >>> >
>> >> >>>
>> >> >>> Use case that brought this up was configuring limits for a remote
>> >> >>> shared device. Bursting beyond io.max is desired but only for so much
>> >> >>> before the limit kicks in, afterwards with sustained usage throughput
>> >> >>> is capped. (This proactively avoids remote-side limits). In that case
>> >> >>> one would configure in a root container io.max + io.burst, and
>> >> >>> configure low/other limits on descendants sharing the resource on the
>> >> >>> same node.
>> >> >>>
>> >> >>> With this patch, so long as tg has not dispatched more than the burst,
>> >> >>> no limit is applied at all by that tg, including limit imposed by
>> >> >>> io.low in tg_iops_limit, etc.
>> >> >>
>> >> >> I'd appreciate if you can give more details about the 'why'. 
>> >> >> 'configuring
>> >> >> limits for a remote shared device' doesn't justify the change.
>> >> >
>> >> > This is to configure a bursty workload (and associated device) with
>> >> > known/allowed expected burst size, but to not allow full utilization
>> >> > of the device for extended periods of time for QoS. During idle or low
>> >> > use periods the burst allowance accrues, and then tasks can burst well
>> >> > beyond the configured throttle up to the limit, afterwards is
>> >> > throttled. A constant throttle speed isn't sufficient for this as you
>> >> > can only burst 1 slice worth, but a limit of sorts is desirable for
>> >> > preventing over utilization of the shared device. This type of limit
>> >> > is also slightly different than what i understand io.low does in local
>> >> > cases in that tg is only high priority/unthrottled if it is bursty,
>> >> > and is limited with constant usage
>> >> >
>> >> > Khazhy
>> >>
>> >> Hi Shaohua,
>> >>
>> >> Does this clarify the reason for this patch? Is this (or something
>> >> similar) a good fit for inclusion in blk-throttle?
>> >>
>> >
>> > So does this brust have to be per cgroup. I mean if thortl_slice was
>> > configurable, that will allow to control the size of burst. (Just that
>> > it will be for all cgroups). If that works, that might be a simpler
>> > solution.
>> >
>> > Vivek
>>
>> The purpose for this configuration vs. increasing throtl_slice is the
>> behavior when the burst runs out. io/bytes allowance is given in
>> intervals of throtl_slice, so for long throtl_slice for those devices
>> that exceed the limit will see extended periods with no IO, rather
>> than at throttled speed. With this once burst is run out, since the
>> burst allowance is on top of the throttle, the device can 

Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-18 Thread Bruno Wolff III

On Sun, Dec 17, 2017 at 21:43:50 +0800,
 weiping zhang  wrote:

Hi, thanks for testing, I think you first reproduce this issue(got WARNING
at device_add_disk) by your own build, then add my debug patch.


I'm going to try testing warnings with a kernel I've built, to try to 
determine if warnings are working at all for the ones I'm building. However 
it might be that the WARN_ONs are not being reached for the kernels I've 
built. If that turns out to be the case, I may not be able to get you both 
the output from the WARN_ONs and the output from your debugging patch at 
the same time.

My next kernel build isn't going to finish in time to test today.


Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-12-18 Thread Vivek Goyal
On Mon, Dec 18, 2017 at 12:39:50PM -0800, Khazhismel Kumykov wrote:
> On Mon, Dec 18, 2017 at 10:29 AM, Vivek Goyal  wrote:
> > On Mon, Dec 18, 2017 at 10:16:02AM -0800, Khazhismel Kumykov wrote:
> >> On Mon, Nov 20, 2017 at 8:36 PM, Khazhismel Kumykov  
> >> wrote:
> >> > On Fri, Nov 17, 2017 at 11:26 AM, Shaohua Li  wrote:
> >> >> On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote:
> >> >>> On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Li  wrote:
> >> >>> > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
> >> >>> >> Allows configuration additional bytes or ios before a throttle is
> >> >>> >> triggered.
> >> >>> >>
> >> >>> >> This allows implementation of a bucket style rate-limit/throttle on 
> >> >>> >> a
> >> >>> >> block device. Previously, bursting to a device was limited to 
> >> >>> >> allowance
> >> >>> >> granted in a single throtl_slice (similar to a bucket with limit N 
> >> >>> >> and
> >> >>> >> refill rate N/slice).
> >> >>> >>
> >> >>> >> Additional parameters bytes/io_burst_conf defined for tg, which 
> >> >>> >> define a
> >> >>> >> number of bytes/ios that must be depleted before throttling 
> >> >>> >> happens. A
> >> >>> >> tg that does not deplete this allowance functions as though it has 
> >> >>> >> no
> >> >>> >> configured limits. tgs earn additional allowance at rate defined by
> >> >>> >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
> >> >>> >> kicks in. If a tg is idle for a while, it will again have some burst
> >> >>> >> allowance before it gets throttled again.
> >> >>> >>
> >> >>> >> slice_end for a tg is extended until io_disp/byte_disp would fall 
> >> >>> >> to 0,
> >> >>> >> when all "used" burst allowance would be earned back. trim_slice 
> >> >>> >> still
> >> >>> >> does progress slice_start as before and decrements *_disp as 
> >> >>> >> before, and
> >> >>> >> tgs continue to get bytes/ios in throtl_slice intervals.
> >> >>> >
> >> >>> > Can you describe why we need this? It would be great if you can 
> >> >>> > describe the
> >> >>> > usage model and an example. Does this work for io.low/io.max or both?
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Shaohua
> >> >>> >
> >> >>>
> >> >>> Use case that brought this up was configuring limits for a remote
> >> >>> shared device. Bursting beyond io.max is desired but only for so much
> >> >>> before the limit kicks in, afterwards with sustained usage throughput
> >> >>> is capped. (This proactively avoids remote-side limits). In that case
> >> >>> one would configure in a root container io.max + io.burst, and
> >> >>> configure low/other limits on descendants sharing the resource on the
> >> >>> same node.
> >> >>>
> >> >>> With this patch, so long as tg has not dispatched more than the burst,
> >> >>> no limit is applied at all by that tg, including limit imposed by
> >> >>> io.low in tg_iops_limit, etc.
> >> >>
> >> >> I'd appreciate if you can give more details about the 'why'. 
> >> >> 'configuring
> >> >> limits for a remote shared device' doesn't justify the change.
> >> >
> >> > This is to configure a bursty workload (and associated device) with
> >> > known/allowed expected burst size, but to not allow full utilization
> >> > of the device for extended periods of time for QoS. During idle or low
> >> > use periods the burst allowance accrues, and then tasks can burst well
> >> > beyond the configured throttle up to the limit, afterwards is
> >> > throttled. A constant throttle speed isn't sufficient for this as you
> >> > can only burst 1 slice worth, but a limit of sorts is desirable for
> >> > preventing over utilization of the shared device. This type of limit
> >> > is also slightly different than what i understand io.low does in local
> >> > cases in that tg is only high priority/unthrottled if it is bursty,
> >> > and is limited with constant usage
> >> >
> >> > Khazhy
> >>
> >> Hi Shaohua,
> >>
> >> Does this clarify the reason for this patch? Is this (or something
> >> similar) a good fit for inclusion in blk-throttle?
> >>
> >
> > So does this brust have to be per cgroup. I mean if thortl_slice was
> > configurable, that will allow to control the size of burst. (Just that
> > it will be for all cgroups). If that works, that might be a simpler
> > solution.
> >
> > Vivek
> 
> The purpose for this configuration vs. increasing throtl_slice is the
> behavior when the burst runs out. io/bytes allowance is given in
> intervals of throtl_slice, so for long throtl_slice for those devices
> that exceed the limit will see extended periods with no IO, rather
> than at throttled speed. With this once burst is run out, since the
> burst allowance is on top of the throttle, the device can continue to
> be used more smoothly at the configured throttled speed.

I thought that whole idea of burst is that there is some bursty IO which
will quickly finish. If workload expects a stedy state IO rate, 

Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-12-18 Thread Khazhismel Kumykov
On Mon, Dec 18, 2017 at 10:29 AM, Vivek Goyal  wrote:
> On Mon, Dec 18, 2017 at 10:16:02AM -0800, Khazhismel Kumykov wrote:
>> On Mon, Nov 20, 2017 at 8:36 PM, Khazhismel Kumykov  
>> wrote:
>> > On Fri, Nov 17, 2017 at 11:26 AM, Shaohua Li  wrote:
>> >> On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote:
>> >>> On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Li  wrote:
>> >>> > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
>> >>> >> Allows configuration additional bytes or ios before a throttle is
>> >>> >> triggered.
>> >>> >>
>> >>> >> This allows implementation of a bucket style rate-limit/throttle on a
>> >>> >> block device. Previously, bursting to a device was limited to 
>> >>> >> allowance
>> >>> >> granted in a single throtl_slice (similar to a bucket with limit N and
>> >>> >> refill rate N/slice).
>> >>> >>
>> >>> >> Additional parameters bytes/io_burst_conf defined for tg, which 
>> >>> >> define a
>> >>> >> number of bytes/ios that must be depleted before throttling happens. A
>> >>> >> tg that does not deplete this allowance functions as though it has no
>> >>> >> configured limits. tgs earn additional allowance at rate defined by
>> >>> >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
>> >>> >> kicks in. If a tg is idle for a while, it will again have some burst
>> >>> >> allowance before it gets throttled again.
>> >>> >>
>> >>> >> slice_end for a tg is extended until io_disp/byte_disp would fall to 
>> >>> >> 0,
>> >>> >> when all "used" burst allowance would be earned back. trim_slice still
>> >>> >> does progress slice_start as before and decrements *_disp as before, 
>> >>> >> and
>> >>> >> tgs continue to get bytes/ios in throtl_slice intervals.
>> >>> >
>> >>> > Can you describe why we need this? It would be great if you can 
>> >>> > describe the
>> >>> > usage model and an example. Does this work for io.low/io.max or both?
>> >>> >
>> >>> > Thanks,
>> >>> > Shaohua
>> >>> >
>> >>>
>> >>> Use case that brought this up was configuring limits for a remote
>> >>> shared device. Bursting beyond io.max is desired but only for so much
>> >>> before the limit kicks in, afterwards with sustained usage throughput
>> >>> is capped. (This proactively avoids remote-side limits). In that case
>> >>> one would configure in a root container io.max + io.burst, and
>> >>> configure low/other limits on descendants sharing the resource on the
>> >>> same node.
>> >>>
>> >>> With this patch, so long as tg has not dispatched more than the burst,
>> >>> no limit is applied at all by that tg, including limit imposed by
>> >>> io.low in tg_iops_limit, etc.
>> >>
>> >> I'd appreciate if you can give more details about the 'why'. 'configuring
>> >> limits for a remote shared device' doesn't justify the change.
>> >
>> > This is to configure a bursty workload (and associated device) with
>> > known/allowed expected burst size, but to not allow full utilization
>> > of the device for extended periods of time for QoS. During idle or low
>> > use periods the burst allowance accrues, and then tasks can burst well
>> > beyond the configured throttle up to the limit, afterwards is
>> > throttled. A constant throttle speed isn't sufficient for this as you
>> > can only burst 1 slice worth, but a limit of sorts is desirable for
>> > preventing over utilization of the shared device. This type of limit
>> > is also slightly different than what i understand io.low does in local
>> > cases in that tg is only high priority/unthrottled if it is bursty,
>> > and is limited with constant usage
>> >
>> > Khazhy
>>
>> Hi Shaohua,
>>
>> Does this clarify the reason for this patch? Is this (or something
>> similar) a good fit for inclusion in blk-throttle?
>>
>
> So does this brust have to be per cgroup. I mean if thortl_slice was
> configurable, that will allow to control the size of burst. (Just that
> it will be for all cgroups). If that works, that might be a simpler
> solution.
>
> Vivek

The purpose for this configuration vs. increasing throtl_slice is the
behavior when the burst runs out. io/bytes allowance is given in
intervals of throtl_slice, so for long throtl_slice for those devices
that exceed the limit will see extended periods with no IO, rather
than at throttled speed. With this once burst is run out, since the
burst allowance is on top of the throttle, the device can continue to
be used more smoothly at the configured throttled speed. For this we
do want a throttle group with both the "steady state" rate + the burst
amount, and we get cgroup support with that.

I notice with cgroupv2 io, it seems no longer to configure a
device-wide throttle group e.g. on the root cgroup. (and putting
restrictions on root cgroup isn't an option) For something like this,
it does make sense to want to configure just for the device, vs. per
cgroup, perhaps there is somewhere better it would fit than as 

Re: [PATCH 0/2] block: fix two regressiones on bounce

2017-12-18 Thread Jens Axboe
On 12/18/2017 12:40 AM, Ming Lei wrote:
> Hi Jens,
> 
> The 1st patch makes sure that passthrough IO won't enter .make_request_fn.
> 
> The 2nd one fixes blk_rq_append_bio(), which is from your post with
> small change on failure handling.

Thanks Ming, queued up.

-- 
Jens Axboe



Re: block: oopses on 4.13.*, 4.14.* and 4.15-rc2 (bisected)

2017-12-18 Thread Michele Ballabio
On Mon, 18 Dec 2017 15:46:36 +0800
Ming Lei  wrote:

> On Sat, Dec 9, 2017 at 7:27 AM, Michele Ballabio
>  wrote:
> > On Fri, 8 Dec 2017 13:08:37 -0700
> > Jens Axboe  wrote:
> >  
> >> On 12/08/2017 08:38 AM, Michele Ballabio wrote:  
> >> > Hi,
> >> > kernels 4.13.*, 4.14.* 4.15-rc2 crash on occasion,
> >> > especially on x86-32 systems. To trigger the problem, run as
> >> > root:
> >> >
> >> > while true
> >> > do
> >> > /sbin/udevadm trigger --type=subsystems --action=change
> >> > /sbin/udevadm trigger --type=devices --action=change
> >> > /sbin/udevadm settle --timeout=120
> >> > done
> >> >
> >> > (Thanks to Patrick Volkerding for the reproducer).
> >> >
> >> > Sometimes the kernel oopses immediately, sometimes a bit later
> >> > (less than five minutes).
> >> >
> >> > The bisection pointed to commit
> >> > caa4b02476e31fc7933d2138062f7f355d3cd8f7 (blk-map: call
> >> > blk_queue_bounce from blk_rq_append_bio). A revert fixes the
> >> > problem (tested on 4.13 and master).  
> >>
> >> Thanks for your report - can you try the below patch? Totally
> >> untested...  
> >
> > I applied the patch on master
> > (968edbd93c0cbb40ab48aca972392d377713a0c3), I tried two times to
> > boot the system but couldn't get to the shell. I found this in the
> > log:  
> 
> Hi Michele,
> 
> Please test the patches I sent out and see if it fixes your issue. In
> my environment
> the two just works fine.
> 
> https://marc.info/?l=linux-block=151358285916762=2
> 

I can confirm these fixes the issue on my system (tested on top of
4.15-rc3), thanks!

Tested-by: Michele Ballabio 



Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-12-18 Thread Vivek Goyal
On Mon, Dec 18, 2017 at 10:16:02AM -0800, Khazhismel Kumykov wrote:
> On Mon, Nov 20, 2017 at 8:36 PM, Khazhismel Kumykov  wrote:
> > On Fri, Nov 17, 2017 at 11:26 AM, Shaohua Li  wrote:
> >> On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote:
> >>> On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Li  wrote:
> >>> > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
> >>> >> Allows configuration additional bytes or ios before a throttle is
> >>> >> triggered.
> >>> >>
> >>> >> This allows implementation of a bucket style rate-limit/throttle on a
> >>> >> block device. Previously, bursting to a device was limited to allowance
> >>> >> granted in a single throtl_slice (similar to a bucket with limit N and
> >>> >> refill rate N/slice).
> >>> >>
> >>> >> Additional parameters bytes/io_burst_conf defined for tg, which define 
> >>> >> a
> >>> >> number of bytes/ios that must be depleted before throttling happens. A
> >>> >> tg that does not deplete this allowance functions as though it has no
> >>> >> configured limits. tgs earn additional allowance at rate defined by
> >>> >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
> >>> >> kicks in. If a tg is idle for a while, it will again have some burst
> >>> >> allowance before it gets throttled again.
> >>> >>
> >>> >> slice_end for a tg is extended until io_disp/byte_disp would fall to 0,
> >>> >> when all "used" burst allowance would be earned back. trim_slice still
> >>> >> does progress slice_start as before and decrements *_disp as before, 
> >>> >> and
> >>> >> tgs continue to get bytes/ios in throtl_slice intervals.
> >>> >
> >>> > Can you describe why we need this? It would be great if you can 
> >>> > describe the
> >>> > usage model and an example. Does this work for io.low/io.max or both?
> >>> >
> >>> > Thanks,
> >>> > Shaohua
> >>> >
> >>>
> >>> Use case that brought this up was configuring limits for a remote
> >>> shared device. Bursting beyond io.max is desired but only for so much
> >>> before the limit kicks in, afterwards with sustained usage throughput
> >>> is capped. (This proactively avoids remote-side limits). In that case
> >>> one would configure in a root container io.max + io.burst, and
> >>> configure low/other limits on descendants sharing the resource on the
> >>> same node.
> >>>
> >>> With this patch, so long as tg has not dispatched more than the burst,
> >>> no limit is applied at all by that tg, including limit imposed by
> >>> io.low in tg_iops_limit, etc.
> >>
> >> I'd appreciate if you can give more details about the 'why'. 'configuring
> >> limits for a remote shared device' doesn't justify the change.
> >
> > This is to configure a bursty workload (and associated device) with
> > known/allowed expected burst size, but to not allow full utilization
> > of the device for extended periods of time for QoS. During idle or low
> > use periods the burst allowance accrues, and then tasks can burst well
> > beyond the configured throttle up to the limit, afterwards is
> > throttled. A constant throttle speed isn't sufficient for this as you
> > can only burst 1 slice worth, but a limit of sorts is desirable for
> > preventing over utilization of the shared device. This type of limit
> > is also slightly different than what i understand io.low does in local
> > cases in that tg is only high priority/unthrottled if it is bursty,
> > and is limited with constant usage
> >
> > Khazhy
> 
> Hi Shaohua,
> 
> Does this clarify the reason for this patch? Is this (or something
> similar) a good fit for inclusion in blk-throttle?
> 

So does this brust have to be per cgroup. I mean if thortl_slice was
configurable, that will allow to control the size of burst. (Just that
it will be for all cgroups). If that works, that might be a simpler
solution.

Vivek


Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-12-18 Thread Khazhismel Kumykov
On Mon, Nov 20, 2017 at 8:36 PM, Khazhismel Kumykov  wrote:
> On Fri, Nov 17, 2017 at 11:26 AM, Shaohua Li  wrote:
>> On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote:
>>> On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Li  wrote:
>>> > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
>>> >> Allows configuration additional bytes or ios before a throttle is
>>> >> triggered.
>>> >>
>>> >> This allows implementation of a bucket style rate-limit/throttle on a
>>> >> block device. Previously, bursting to a device was limited to allowance
>>> >> granted in a single throtl_slice (similar to a bucket with limit N and
>>> >> refill rate N/slice).
>>> >>
>>> >> Additional parameters bytes/io_burst_conf defined for tg, which define a
>>> >> number of bytes/ios that must be depleted before throttling happens. A
>>> >> tg that does not deplete this allowance functions as though it has no
>>> >> configured limits. tgs earn additional allowance at rate defined by
>>> >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
>>> >> kicks in. If a tg is idle for a while, it will again have some burst
>>> >> allowance before it gets throttled again.
>>> >>
>>> >> slice_end for a tg is extended until io_disp/byte_disp would fall to 0,
>>> >> when all "used" burst allowance would be earned back. trim_slice still
>>> >> does progress slice_start as before and decrements *_disp as before, and
>>> >> tgs continue to get bytes/ios in throtl_slice intervals.
>>> >
>>> > Can you describe why we need this? It would be great if you can describe 
>>> > the
>>> > usage model and an example. Does this work for io.low/io.max or both?
>>> >
>>> > Thanks,
>>> > Shaohua
>>> >
>>>
>>> Use case that brought this up was configuring limits for a remote
>>> shared device. Bursting beyond io.max is desired but only for so much
>>> before the limit kicks in, afterwards with sustained usage throughput
>>> is capped. (This proactively avoids remote-side limits). In that case
>>> one would configure in a root container io.max + io.burst, and
>>> configure low/other limits on descendants sharing the resource on the
>>> same node.
>>>
>>> With this patch, so long as tg has not dispatched more than the burst,
>>> no limit is applied at all by that tg, including limit imposed by
>>> io.low in tg_iops_limit, etc.
>>
>> I'd appreciate if you can give more details about the 'why'. 'configuring
>> limits for a remote shared device' doesn't justify the change.
>
> This is to configure a bursty workload (and associated device) with
> known/allowed expected burst size, but to not allow full utilization
> of the device for extended periods of time for QoS. During idle or low
> use periods the burst allowance accrues, and then tasks can burst well
> beyond the configured throttle up to the limit, afterwards is
> throttled. A constant throttle speed isn't sufficient for this as you
> can only burst 1 slice worth, but a limit of sorts is desirable for
> preventing over utilization of the shared device. This type of limit
> is also slightly different than what i understand io.low does in local
> cases in that tg is only high priority/unthrottled if it is bursty,
> and is limited with constant usage
>
> Khazhy

Hi Shaohua,

Does this clarify the reason for this patch? Is this (or something
similar) a good fit for inclusion in blk-throttle?

Thanks,
Khazhy


smime.p7s
Description: S/MIME Cryptographic Signature


[PATCH v2] delayacct: Account blkio completion on the correct task

2017-12-18 Thread Josh Snyder
Before commit e33a9bba85a8 ("sched/core: move IO scheduling accounting from
io_schedule_timeout() into scheduler"), delayacct_blkio_end was called after
context-switching into the task which completed I/O. This resulted in double
counting: the task would account a delay both waiting for I/O and for time
spent in the runqueue.

With e33a9bba85a8, delayacct_blkio_end is called by try_to_wake_up. In
ttwu, we have not yet context-switched. This is more correct, in that the
delay accounting ends when the I/O is complete. But delayacct_blkio_end
relies upon `get_current()`, and we have not yet context-switched into the
task whose I/O completed. This results in the wrong task having its delay
accounting statistics updated.

Instead of doing that, pass the task_struct being woken to
delayacct_blkio_end, so that it can update the statistics of the correct
task.

Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from 
io_schedule_timeout() into scheduler")
Signed-off-by: Josh Snyder 
---
 include/linux/delayacct.h |  8 
 kernel/delayacct.c| 42 ++
 kernel/sched/core.c   |  6 +++---
 3 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 4178d24..f2ad868 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -71,7 +71,7 @@ extern void delayacct_init(void);
 extern void __delayacct_tsk_init(struct task_struct *);
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio_start(void);
-extern void __delayacct_blkio_end(void);
+extern void __delayacct_blkio_end(struct task_struct *);
 extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 extern __u64 __delayacct_blkio_ticks(struct task_struct *);
 extern void __delayacct_freepages_start(void);
@@ -122,10 +122,10 @@ static inline void delayacct_blkio_start(void)
__delayacct_blkio_start();
 }
 
-static inline void delayacct_blkio_end(void)
+static inline void delayacct_blkio_end(struct task_struct *p)
 {
if (current->delays)
-   __delayacct_blkio_end();
+   __delayacct_blkio_end(p);
delayacct_clear_flag(DELAYACCT_PF_BLKIO);
 }
 
@@ -169,7 +169,7 @@ static inline void delayacct_tsk_free(struct task_struct 
*tsk)
 {}
 static inline void delayacct_blkio_start(void)
 {}
-static inline void delayacct_blkio_end(void)
+static inline void delayacct_blkio_end(struct task_struct *p)
 {}
 static inline int delayacct_add_tsk(struct taskstats *d,
struct task_struct *tsk)
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 4a1c334..e2ec808 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -51,16 +51,16 @@ void __delayacct_tsk_init(struct task_struct *tsk)
  * Finish delay accounting for a statistic using its timestamps (@start),
  * accumalator (@total) and @count
  */
-static void delayacct_end(u64 *start, u64 *total, u32 *count)
+static void delayacct_end(spinlock_t *lock, u64 *start, u64 *total, u32 *count)
 {
s64 ns = ktime_get_ns() - *start;
unsigned long flags;
 
if (ns > 0) {
-   spin_lock_irqsave(>delays->lock, flags);
+   spin_lock_irqsave(lock, flags);
*total += ns;
(*count)++;
-   spin_unlock_irqrestore(>delays->lock, flags);
+   spin_unlock_irqrestore(lock, flags);
}
 }
 
@@ -69,17 +69,25 @@ void __delayacct_blkio_start(void)
current->delays->blkio_start = ktime_get_ns();
 }
 
-void __delayacct_blkio_end(void)
+/*
+ * We cannot rely on the `current` macro, as we haven't yet switched back to
+ * the process being woken.
+ */
+void __delayacct_blkio_end(struct task_struct *p)
 {
-   if (current->delays->flags & DELAYACCT_PF_SWAPIN)
-   /* Swapin block I/O */
-   delayacct_end(>delays->blkio_start,
-   >delays->swapin_delay,
-   >delays->swapin_count);
-   else/* Other block I/O */
-   delayacct_end(>delays->blkio_start,
-   >delays->blkio_delay,
-   >delays->blkio_count);
+   struct task_delay_info *delays = p->delays;
+   u64 *total;
+   u32 *count;
+
+   if (p->delays->flags & DELAYACCT_PF_SWAPIN) {
+   total = >swapin_delay;
+   count = >swapin_count;
+   } else {
+   total = >blkio_delay;
+   count = >blkio_count;
+   }
+
+   delayacct_end(>lock, >blkio_start, total, count);
 }
 
 int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
@@ -153,8 +161,10 @@ void __delayacct_freepages_start(void)
 
 void __delayacct_freepages_end(void)
 {
-   delayacct_end(>delays->freepages_start,
-   >delays->freepages_delay,
-   >delays->freepages_count);
+   

[PATCH] block, bfq: fix occurrences of request finish method's old name

2017-12-18 Thread Chiara Bruschi
Commit '7b9e93616399' ("blk-mq-sched: unify request finished methods")
changed the old name of current bfq_finish_request method, but left it
unchanged elsewhere in the code (related comments, part of function
name bfq_put_rq_priv_body).

This commit fixes all occurrences of the old name of this method by
changing them into the current name.

Fixes: 7b9e93616399 ("blk-mq-sched: unify request finished methods")
Reviewed-by: Paolo Valente 
Signed-off-by: Federico Motta 
Signed-off-by: Chiara Bruschi 
---
 block/bfq-iosched.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index bcb6d21..6da7f71 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3630,8 +3630,8 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}
 
/*
-* We exploit the put_rq_private hook to decrement
-* rq_in_driver, but put_rq_private will not be
+* We exploit the bfq_finish_request hook to decrement
+* rq_in_driver, but bfq_finish_request will not be
 * invoked on this request. So, to avoid unbalance,
 * just start this request, without incrementing
 * rq_in_driver. As a negative consequence,
@@ -3640,14 +3640,14 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
 * bfq_schedule_dispatch to be invoked uselessly.
 *
 * As for implementing an exact solution, the
-* put_request hook, if defined, is probably invoked
-* also on this request. So, by exploiting this hook,
-* we could 1) increment rq_in_driver here, and 2)
-* decrement it in put_request. Such a solution would
-* let the value of the counter be always accurate,
-* but it would entail using an extra interface
-* function. This cost seems higher than the benefit,
-* being the frequency of non-elevator-private
+* bfq_finish_request hook, if defined, is probably
+* invoked also on this request. So, by exploiting
+* this hook, we could 1) increment rq_in_driver here,
+* and 2) decrement it in bfq_finish_request. Such a
+* solution would let the value of the counter be
+* always accurate, but it would entail using an extra
+* interface function. This cost seems higher than the
+* benefit, being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4482,7 +4482,7 @@ static void bfq_completed_request(struct bfq_queue *bfqq, 
struct bfq_data *bfqd)
bfq_schedule_dispatch(bfqd);
 }
 
-static void bfq_put_rq_priv_body(struct bfq_queue *bfqq)
+static void bfq_finish_request_body(struct bfq_queue *bfqq)
 {
bfqq->allocated--;
 
@@ -4512,7 +4512,7 @@ static void bfq_finish_request(struct request *rq)
spin_lock_irqsave(>lock, flags);
 
bfq_completed_request(bfqq, bfqd);
-   bfq_put_rq_priv_body(bfqq);
+   bfq_finish_request_body(bfqq);
 
spin_unlock_irqrestore(>lock, flags);
} else {
@@ -4533,7 +4533,7 @@ static void bfq_finish_request(struct request *rq)
bfqg_stats_update_io_remove(bfqq_group(bfqq),
rq->cmd_flags);
}
-   bfq_put_rq_priv_body(bfqq);
+   bfq_finish_request_body(bfqq);
}
 
rq->elv.priv[0] = NULL;
-- 
2.1.4



Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-12-18 Thread Stefan Haberland

On 07.12.2017 00:29, Christoph Hellwig wrote:

On Wed, Dec 06, 2017 at 01:25:11PM +0100, Christian Borntraeger wrote:
t > commit 11b2025c3326f7096ceb588c3117c7883850c068-> bad

 blk-mq: create a blk_mq_ctx for each possible CPU
does not boot on DASD and
commit 9c6ae239e01ae9a9f8657f05c55c4372e9fc8bcc-> good
genirq/affinity: assign vectors to all possible CPUs
does boot with DASD disks.

Also adding Stefan Haberland if he has an idea why this fails on DASD and 
adding Martin (for the
s390 irq handling code).

That is interesting as it really isn't related to interrupts at all,
it just ensures that possible CPUs are set in ->cpumask.

I guess we'd really want:

e005655c389e3d25bf3e43f71611ec12f3012de0
"blk-mq: only select online CPUs in blk_mq_hctx_next_cpu"

before this commit, but it seems like the whole stack didn't work for
your either.

I wonder if there is some weird thing about nr_cpu_ids in s390?
--
To unsubscribe from this list: send the line "unsubscribe linux-s390" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I tried this on my system and the blk-mq-hotplug-fix branch does not 
boot for me as well.
The disks get up and running and I/O works fine. At least the partition 
detection and EXT4-fs mount works.


But at some point in time the disk do not get any requests.

I currently have no clue why.
I took a dump and had a look at the disk states and they are fine. No 
error in the logs or in our debug entrys. Just empty DASD devices 
waiting to be called for I/O requests.


Do you have anything I could have a look at?



[PATCH V4 01/45] block: introduce bio helpers for converting to multipage bvec

2017-12-18 Thread Ming Lei
The following helpers are introduced for converting current users of
direct access to bvec table, and prepares for supporting multipage bvec:

bio_pages_all()
bio_first_bvec_all()
bio_first_page_all()
bio_last_bvec_all()

All are named as bio_*_all() to following bio_for_each_segment_all(),
they can only be used on bio of !bio_flagged(bio, BIO_CLONED), that means
the whole bvec table is covered.

Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 82f0c8fd7be8..435ddf04e889 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -300,6 +300,29 @@ static inline void bio_get_last_bvec(struct bio *bio, 
struct bio_vec *bv)
bv->bv_len = iter.bi_bvec_done;
 }
 
+static inline unsigned bio_pages_all(struct bio *bio)
+{
+   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+   return bio->bi_vcnt;
+}
+
+static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
+{
+   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+   return bio->bi_io_vec;
+}
+
+static inline struct page *bio_first_page_all(struct bio *bio)
+{
+   return bio_first_bvec_all(bio)->bv_page;
+}
+
+static inline struct bio_vec *bio_last_bvec_all(struct bio *bio)
+{
+   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+   return >bi_io_vec[bio->bi_vcnt - 1];
+}
+
 enum bip_flags {
BIP_BLOCK_INTEGRITY = 1 << 0, /* block layer owns integrity data */
BIP_MAPPED_INTEGRITY= 1 << 1, /* ref tag has been remapped */
-- 
2.9.5



[PATCH V4 02/45] block: conver to bio_first_bvec_all & bio_first_page_all

2017-12-18 Thread Ming Lei
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei 
---
 drivers/block/drbd/drbd_bitmap.c | 2 +-
 drivers/block/zram/zram_drv.c| 2 +-
 drivers/md/bcache/super.c| 8 
 fs/btrfs/compression.c   | 2 +-
 fs/btrfs/inode.c | 4 ++--
 fs/f2fs/data.c   | 2 +-
 kernel/power/swap.c  | 2 +-
 mm/page_io.c | 4 ++--
 8 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index bd97908c766f..9f4e6f502b84 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -953,7 +953,7 @@ static void drbd_bm_endio(struct bio *bio)
struct drbd_bm_aio_ctx *ctx = bio->bi_private;
struct drbd_device *device = ctx->device;
struct drbd_bitmap *b = device->bitmap;
-   unsigned int idx = bm_page_to_idx(bio->bi_io_vec[0].bv_page);
+   unsigned int idx = bm_page_to_idx(bio_first_page_all(bio));
 
if ((ctx->flags & BM_AIO_COPY_PAGES) == 0 &&
!bm_test_page_unchanged(b->bm_pages[idx]))
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d70eba30003a..0afa6c8c3857 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -430,7 +430,7 @@ static void put_entry_bdev(struct zram *zram, unsigned long 
entry)
 
 static void zram_page_end_io(struct bio *bio)
 {
-   struct page *page = bio->bi_io_vec[0].bv_page;
+   struct page *page = bio_first_page_all(bio);
 
page_endio(page, op_is_write(bio_op(bio)),
blk_status_to_errno(bio->bi_status));
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928dec5..8399fe0651f2 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -211,7 +211,7 @@ static void write_bdev_super_endio(struct bio *bio)
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
-   struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+   struct cache_sb *out = page_address(bio_first_page_all(bio));
unsigned i;
 
bio->bi_iter.bi_sector  = SB_SECTOR;
@@ -1166,7 +1166,7 @@ static void register_bdev(struct cache_sb *sb, struct 
page *sb_page,
dc->bdev->bd_holder = dc;
 
bio_init(>sb_bio, dc->sb_bio.bi_inline_vecs, 1);
-   dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
+   bio_first_bvec_all(>sb_bio)->bv_page = sb_page;
get_page(sb_page);
 
if (cached_dev_init(dc, sb->block_size << 9))
@@ -1810,7 +1810,7 @@ void bch_cache_release(struct kobject *kobj)
free_fifo(>free[i]);
 
if (ca->sb_bio.bi_inline_vecs[0].bv_page)
-   put_page(ca->sb_bio.bi_io_vec[0].bv_page);
+   put_page(bio_first_page_all(>sb_bio));
 
if (!IS_ERR_OR_NULL(ca->bdev))
blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
@@ -1864,7 +1864,7 @@ static int register_cache(struct cache_sb *sb, struct 
page *sb_page,
ca->bdev->bd_holder = ca;
 
bio_init(>sb_bio, ca->sb_bio.bi_inline_vecs, 1);
-   ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
+   bio_first_bvec_all(>sb_bio)->bv_page = sb_page;
get_page(sb_page);
 
if (blk_queue_discard(bdev_get_queue(ca->bdev)))
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5982c8a71f02..38a6b091bc25 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -563,7 +563,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode 
*inode, struct bio *bio,
/* we need the actual starting offset of this extent in the file */
read_lock(_tree->lock);
em = lookup_extent_mapping(em_tree,
-  page_offset(bio->bi_io_vec->bv_page),
+  page_offset(bio_first_page_all(bio)),
   PAGE_SIZE);
read_unlock(_tree->lock);
if (!em)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3cb5be9..4d5cb6e93c80 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8074,7 +8074,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
ASSERT(bio->bi_vcnt == 1);
io_tree = _I(inode)->io_tree;
failure_tree = _I(inode)->io_failure_tree;
-   ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(inode));
+   ASSERT(bio_first_bvec_all(bio)->bv_len == 
btrfs_inode_sectorsize(inode));
 
done->uptodate = 1;
ASSERT(!bio_flagged(bio, BIO_CLONED));
@@ -8164,7 +8164,7 @@ static void btrfs_retry_endio(struct bio *bio)
uptodate = 1;
 
ASSERT(bio->bi_vcnt == 1);
-   ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(done->inode));
+   ASSERT(bio_first_bvec_all(bio)->bv_len == 
btrfs_inode_sectorsize(done->inode));
 
io_tree = 

[PATCH V4 08/45] block: move bio_alloc_pages() to bcache

2017-12-18 Thread Ming Lei
bcache is the only user of bio_alloc_pages(), so move this function into
bcache, and avoid it misused in future.

Also rename it as bch_bio_allo_pages() since it is bcache only.

Signed-off-by: Ming Lei 
---
 block/bio.c   | 28 
 drivers/md/bcache/btree.c |  2 +-
 drivers/md/bcache/debug.c |  2 +-
 drivers/md/bcache/movinggc.c  |  2 +-
 drivers/md/bcache/request.c   |  2 +-
 drivers/md/bcache/util.c  | 27 +++
 drivers/md/bcache/util.h  |  1 +
 drivers/md/bcache/writeback.c |  2 +-
 include/linux/bio.h   |  1 -
 9 files changed, 33 insertions(+), 34 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 8bfdea58159b..fe1efbeaf4aa 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -969,34 +969,6 @@ void bio_advance(struct bio *bio, unsigned bytes)
 EXPORT_SYMBOL(bio_advance);
 
 /**
- * bio_alloc_pages - allocates a single page for each bvec in a bio
- * @bio: bio to allocate pages for
- * @gfp_mask: flags for allocation
- *
- * Allocates pages up to @bio->bi_vcnt.
- *
- * Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages 
are
- * freed.
- */
-int bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
-{
-   int i;
-   struct bio_vec *bv;
-
-   bio_for_each_segment_all(bv, bio, i) {
-   bv->bv_page = alloc_page(gfp_mask);
-   if (!bv->bv_page) {
-   while (--bv >= bio->bi_io_vec)
-   __free_page(bv->bv_page);
-   return -ENOMEM;
-   }
-   }
-
-   return 0;
-}
-EXPORT_SYMBOL(bio_alloc_pages);
-
-/**
  * bio_copy_data - copy contents of data buffers from one chain of bios to
  * another
  * @src: source bio list
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 02a4cf646fdc..ebb1874218e7 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -419,7 +419,7 @@ static void do_btree_node_write(struct btree *b)
SET_PTR_OFFSET(, 0, PTR_OFFSET(, 0) +
   bset_sector_offset(>keys, i));
 
-   if (!bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
+   if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
int j;
struct bio_vec *bv;
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index c7a02c4900da..879ab21074c6 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -116,7 +116,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
return;
check->bi_opf = REQ_OP_READ;
 
-   if (bio_alloc_pages(check, GFP_NOIO))
+   if (bch_bio_alloc_pages(check, GFP_NOIO))
goto out_put;
 
submit_bio_wait(check);
diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index d50c1c97da68..a24c3a95b2c0 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -162,7 +162,7 @@ static void read_moving(struct cache_set *c)
bio_set_op_attrs(bio, REQ_OP_READ, 0);
bio->bi_end_io  = read_moving_endio;
 
-   if (bio_alloc_pages(bio, GFP_KERNEL))
+   if (bch_bio_alloc_pages(bio, GFP_KERNEL))
goto err;
 
trace_bcache_gc_copy(>key);
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 643c3021624f..c493fb947dc9 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -841,7 +841,7 @@ static int cached_dev_cache_miss(struct btree *b, struct 
search *s,
cache_bio->bi_private   = >cl;
 
bch_bio_map(cache_bio, NULL);
-   if (bio_alloc_pages(cache_bio, __GFP_NOWARN|GFP_NOIO))
+   if (bch_bio_alloc_pages(cache_bio, __GFP_NOWARN|GFP_NOIO))
goto out_put;
 
if (reada)
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 61813d230015..a23cd6a14b74 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -283,6 +283,33 @@ start: bv->bv_len  = min_t(size_t, 
PAGE_SIZE - bv->bv_offset,
}
 }
 
+/**
+ * bch_bio_alloc_pages - allocates a single page for each bvec in a bio
+ * @bio: bio to allocate pages for
+ * @gfp_mask: flags for allocation
+ *
+ * Allocates pages up to @bio->bi_vcnt.
+ *
+ * Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages 
are
+ * freed.
+ */
+int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
+{
+   int i;
+   struct bio_vec *bv;
+
+   bio_for_each_segment_all(bv, bio, i) {
+   bv->bv_page = alloc_page(gfp_mask);
+   if (!bv->bv_page) {
+   while (--bv >= bio->bi_io_vec)
+   __free_page(bv->bv_page);
+   return -ENOMEM;
+   }
+   }
+
+   return 0;
+}
+
 /*
  * Portions 

[PATCH V4 03/45] fs: convert to bio_last_bvec_all()

2017-12-18 Thread Ming Lei
This patch convers 3 users to bio_last_bvec_all(), so that we can go
ahread to convert to multipage bvec.

Signed-off-by: Ming Lei 
---
 fs/btrfs/compression.c | 2 +-
 fs/btrfs/extent_io.c   | 2 +-
 fs/buffer.c| 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 38a6b091bc25..75610d23d197 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -411,7 +411,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 
 static u64 bio_end_offset(struct bio *bio)
 {
-   struct bio_vec *last = >bi_io_vec[bio->bi_vcnt - 1];
+   struct bio_vec *last = bio_last_bvec_all(bio);
 
return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
 }
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 012d63870b99..69cd63d4503d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2724,7 +2724,7 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
   unsigned long bio_flags)
 {
blk_status_t ret = 0;
-   struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+   struct bio_vec *bvec = bio_last_bvec_all(bio);
struct page *page = bvec->bv_page;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
diff --git a/fs/buffer.c b/fs/buffer.c
index 0736a6a2e2f0..8b26295a56fe 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3014,7 +3014,7 @@ static void end_bio_bh_io_sync(struct bio *bio)
 void guard_bio_eod(int op, struct bio *bio)
 {
sector_t maxsector;
-   struct bio_vec *bvec = >bi_io_vec[bio->bi_vcnt - 1];
+   struct bio_vec *bvec = bio_last_bvec_all(bio);
unsigned truncated_bytes;
struct hd_struct *part;
 
-- 
2.9.5



[PATCH V4 06/45] dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE

2017-12-18 Thread Ming Lei
For BIO based DM, some targets aren't ready for dealing with bigger
incoming bio than 1Mbyte, such as crypt target.

Cc: Mike Snitzer 
Cc:dm-de...@redhat.com
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index de17b7193299..7475739fee49 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -920,7 +920,15 @@ int dm_set_target_max_io_len(struct dm_target *ti, 
sector_t len)
return -EINVAL;
}
 
-   ti->max_io_len = (uint32_t) len;
+   /*
+* BIO based queue uses its own splitting. When multipage bvecs
+* is switched on, size of the incoming bio may be too big to
+* be handled in some targets, such as crypt.
+*
+* When these targets are ready for the big bio, we can remove
+* the limit.
+*/
+   ti->max_io_len = min_t(uint32_t, len, BIO_MAX_PAGES * PAGE_SIZE);
 
return 0;
 }
-- 
2.9.5



[PATCH V4 10/45] btrfs: avoid to access bvec table directly for a cloned bio

2017-12-18 Thread Ming Lei
Commit 17347cec15f919901c90(Btrfs: change how we iterate bios in endio)
mentioned that for dio the submitted bio may be fast cloned, we
can't access the bvec table directly for a cloned bio, so use
bio_get_first_bvec() to retrieve the 1st bvec.

Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Liu Bo 
Reviewed-by: Liu Bo 
Acked: David Sterba 
Signed-off-by: Ming Lei 
---
 fs/btrfs/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4d5cb6e93c80..cb1e2d201434 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8015,6 +8015,7 @@ static blk_status_t dio_read_error(struct inode *inode, 
struct bio *failed_bio,
int segs;
int ret;
blk_status_t status;
+   struct bio_vec bvec;
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
@@ -8030,8 +8031,9 @@ static blk_status_t dio_read_error(struct inode *inode, 
struct bio *failed_bio,
}
 
segs = bio_segments(failed_bio);
+   bio_get_first_bvec(failed_bio, );
if (segs > 1 ||
-   (failed_bio->bi_io_vec->bv_len > btrfs_inode_sectorsize(inode)))
+   (bvec.bv_len > btrfs_inode_sectorsize(inode)))
read_mode |= REQ_FAILFAST_DEV;
 
isector = start - btrfs_io_bio(failed_bio)->logical;
-- 
2.9.5



[PATCH V4 19/45] block: introduce bio_for_each_segment()

2017-12-18 Thread Ming Lei
This helper is used to iterate multipage bvec for bio spliting/merge,
and it is required in bio_clone_bioset() too, so introduce it.

Signed-off-by: Ming Lei 
---
 include/linux/bio.h  | 34 +++---
 include/linux/bvec.h | 36 
 2 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 95ca5ddc72ef..0cb29c73ff27 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -76,6 +76,9 @@
 #define bio_data_dir(bio) \
(op_is_write(bio_op(bio)) ? WRITE : READ)
 
+#define bio_iter_seg_iovec(bio, iter)  \
+   bvec_iter_segment_bvec((bio)->bi_io_vec, (iter))
+
 /*
  * Check whether this bio carries any data or not. A NULL bio is allowed.
  */
@@ -156,8 +159,8 @@ static inline void *bio_data(struct bio *bio)
 #define bio_for_each_page_all(bvl, bio, i) \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
 
-static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
-   unsigned bytes)
+static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes, bool seg)
 {
iter->bi_sector += bytes >> 9;
 
@@ -165,11 +168,26 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
iter->bi_size -= bytes;
iter->bi_done += bytes;
} else {
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   if (!seg)
+   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else
+   bvec_iter_seg_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
}
 }
 
+static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+   unsigned bytes)
+{
+   __bio_advance_iter(bio, iter, bytes, false);
+}
+
+static inline void bio_advance_seg_iter(struct bio *bio, struct bvec_iter 
*iter,
+  unsigned bytes)
+{
+   __bio_advance_iter(bio, iter, bytes, true);
+}
+
 static inline bool bio_rewind_iter(struct bio *bio, struct bvec_iter *iter,
unsigned int bytes)
 {
@@ -193,6 +211,16 @@ static inline bool bio_rewind_iter(struct bio *bio, struct 
bvec_iter *iter,
 #define bio_for_each_page(bvl, bio, iter)  \
__bio_for_each_page(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_segment(bvl, bio, iter, start)  \
+   for (iter = (start);\
+(iter).bi_size &&  \
+   ((bvl = bio_iter_seg_iovec((bio), (iter))), 1); \
+bio_advance_seg_iter((bio), &(iter), (bvl).bv_len))
+
+/* returns one real segment(multipage bvec) each time */
+#define bio_for_each_segment(bvl, bio, iter)   \
+   __bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_pages(struct bio *bio)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 2433c73fa5ea..84c395feed49 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -126,8 +126,16 @@ struct bvec_iter {
.bv_offset  = bvec_iter_offset((bvec), (iter)), \
 })
 
-static inline bool bvec_iter_advance(const struct bio_vec *bv,
-   struct bvec_iter *iter, unsigned bytes)
+#define bvec_iter_segment_bvec(bvec, iter) \
+((struct bio_vec) {\
+   .bv_page= bvec_iter_segment_page((bvec), (iter)),   \
+   .bv_len = bvec_iter_segment_len((bvec), (iter)),\
+   .bv_offset  = bvec_iter_segment_offset((bvec), (iter)), \
+})
+
+static inline bool __bvec_iter_advance(const struct bio_vec *bv,
+  struct bvec_iter *iter,
+  unsigned bytes, bool segment)
 {
if (WARN_ONCE(bytes > iter->bi_size,
 "Attempted to advance past end of bvec iter\n")) {
@@ -136,8 +144,14 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
}
 
while (bytes) {
-   unsigned iter_len = bvec_iter_len(bv, *iter);
-   unsigned len = min(bytes, iter_len);
+   unsigned len;
+
+   if (segment)
+   len = bvec_iter_segment_len(bv, *iter);
+   else
+   len = bvec_iter_len(bv, *iter);
+
+   len = min(bytes, len);
 
bytes -= len;
iter->bi_size -= len;
@@ -176,6 +190,20 @@ static inline bool bvec_iter_rewind(const 

[PATCH V4 21/45] block: use bio_for_each_segment() to map sg

2017-12-18 Thread Ming Lei
It is more efficient to use bio_for_each_segment() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().

Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 72 +++
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index f65546b46fff..d4f2186e4787 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -440,6 +440,56 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
return 0;
 }
 
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+   struct scatterlist *sglist)
+{
+   if (!*sg)
+   return sglist;
+   else {
+   /*
+* If the driver previously mapped a shorter
+* list, we could see a termination bit
+* prematurely unless it fully inits the sg
+* table on each mapping. We KNOW that there
+* must be more entries here or the driver
+* would be buggy, so force clear the
+* termination bit to avoid doing a full
+* sg_init_table() in drivers for each command.
+*/
+   sg_unmark_end(*sg);
+   return sg_next(*sg);
+   }
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+   struct bio_vec *bvec, struct scatterlist *sglist,
+   struct scatterlist **sg)
+{
+   unsigned nbytes = bvec->bv_len;
+   unsigned nsegs = 0, total = 0;
+
+   while (nbytes > 0) {
+   unsigned seg_size;
+   struct page *pg;
+   unsigned offset, idx;
+
+   *sg = blk_next_sg(sg, sglist);
+
+   seg_size = min(nbytes, queue_max_segment_size(q));
+   offset = (total + bvec->bv_offset) % PAGE_SIZE;
+   idx = (total + bvec->bv_offset) / PAGE_SIZE;
+   pg = nth_page(bvec->bv_page, idx);
+
+   sg_set_page(*sg, pg, seg_size, offset);
+
+   total += seg_size;
+   nbytes -= seg_size;
+   nsegs++;
+   }
+
+   return nsegs;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -473,25 +523,7 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
(*sg)->length += nbytes;
} else {
 new_segment:
-   if (!*sg)
-   *sg = sglist;
-   else {
-   /*
-* If the driver previously mapped a shorter
-* list, we could see a termination bit
-* prematurely unless it fully inits the sg
-* table on each mapping. We KNOW that there
-* must be more entries here or the driver
-* would be buggy, so force clear the
-* termination bit to avoid doing a full
-* sg_init_table() in drivers for each command.
-*/
-   sg_unmark_end(*sg);
-   *sg = sg_next(*sg);
-   }
-
-   sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-   (*nsegs)++;
+   (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
 
/* for making iterator happy */
bvec->bv_offset -= advance;
@@ -517,7 +549,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
int cluster = blk_queue_cluster(q), nsegs = 0;
 
for_each_bio(bio)
-   bio_for_each_page(bvec, bio, iter)
+   bio_for_each_segment(bvec, bio, iter)
__blk_segment_map_sg(q, , sglist, , sg,
 , );
 
-- 
2.9.5



[PATCH V4 20/45] block: use bio_for_each_segment() to compute segments count

2017-12-18 Thread Ming Lei
Firstly it is more efficient to use bio_for_each_segment() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how many
segments there are in the bio.

Secondaly once bio_for_each_segment() is used, the bvec may need to
be splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.

Thirdly during splitting multipage bvec into segments, max segment number
may be reached, then the bio need to be splitted when this happens.

Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 97 ---
 1 file changed, 79 insertions(+), 18 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 25ffb84be058..f65546b46fff 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -97,6 +97,62 @@ static inline unsigned get_max_io_size(struct request_queue 
*q,
return sectors;
 }
 
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+   unsigned *nsegs, unsigned *last_seg_size,
+   unsigned *front_seg_size, unsigned *sectors)
+{
+   bool need_split = false;
+   unsigned len = bv->bv_len;
+   unsigned total_len = 0;
+   unsigned new_nsegs = 0, seg_size = 0;
+
+   if ((*nsegs >= queue_max_segments(q)) || !len)
+   return need_split;
+
+   /*
+* Multipage bvec may be too big to hold in one segment,
+* so the current bvec has to be splitted as multiple
+* segments.
+*/
+   while (new_nsegs + *nsegs < queue_max_segments(q)) {
+   seg_size = min(queue_max_segment_size(q), len);
+
+   new_nsegs++;
+   total_len += seg_size;
+   len -= seg_size;
+
+   if ((queue_virt_boundary(q) && ((bv->bv_offset +
+   total_len) & queue_virt_boundary(q))) || !len)
+   break;
+   }
+
+   /* split in the middle of the bvec */
+   if (len)
+   need_split = true;
+
+   /* update front segment size */
+   if (!*nsegs) {
+   unsigned first_seg_size = seg_size;
+
+   if (new_nsegs > 1)
+   first_seg_size = queue_max_segment_size(q);
+   if (*front_seg_size < first_seg_size)
+   *front_seg_size = first_seg_size;
+   }
+
+   /* update other varibles */
+   *last_seg_size = seg_size;
+   *nsegs += new_nsegs;
+   if (sectors)
+   *sectors += total_len >> 9;
+
+   return need_split;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
@@ -111,7 +167,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
const unsigned max_sectors = get_max_io_size(q, bio);
unsigned advance = 0;
 
-   bio_for_each_page(bv, bio, iter) {
+   bio_for_each_segment(bv, bio, iter) {
/*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
@@ -126,8 +182,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 */
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
-   nsegs++;
-   sectors = max_sectors;
+   /* split in the middle of bvec */
+   bv.bv_len = (max_sectors - sectors) << 9;
+   bvec_split_segs(q, , ,
+   _size,
+   _seg_size,
+   );
}
goto split;
}
@@ -139,10 +199,9 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
goto new_segment;
if (seg_size + bv.bv_len > queue_max_segment_size(q)) {
/*
-* On assumption is that initial value of
-* @seg_size(equals to bv.bv_len) won't be
-* bigger than max segment size, but will
-* becomes false after multipage bvec comes.
+* The initial value of @seg_size won't be
+* bigger than max segment size, because we
+* split the bvec via bvec_split_segs().
 */
advance = queue_max_segment_size(q) - seg_size;
 
@@ -174,11 +233,12 @@ static struct bio 

[PATCH V4 23/45] fs/buffer.c: use bvec iterator to truncate the bio

2017-12-18 Thread Ming Lei
Once multipage bvec is enabled, the last bvec may include more than one
page, this patch use segment_last_page() to truncate the bio.

Signed-off-by: Ming Lei 
---
 fs/buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8b26295a56fe..83fa7fda000b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3050,7 +3050,10 @@ void guard_bio_eod(int op, struct bio *bio)
 
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
-   zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+   struct bio_vec bv;
+
+   segment_last_page(bvec, );
+   zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
 }
-- 
2.9.5



[PATCH V4 24/45] btrfs: use segment_last_page to get bio's last page

2017-12-18 Thread Ming Lei
Preparing for supporting multipage bvec.

Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Signed-off-by: Ming Lei 
---
 fs/btrfs/compression.c | 5 -
 fs/btrfs/extent_io.c   | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 15430c9fc6d7..1a62725a5f08 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -412,8 +412,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 static u64 bio_end_offset(struct bio *bio)
 {
struct bio_vec *last = bio_last_bvec_all(bio);
+   struct bio_vec bv;
 
-   return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
+   segment_last_page(last, );
+
+   return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset;
 }
 
 static noinline int add_ra_bio_pages(struct inode *inode,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6984e0d5b00d..f466289b66a3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2726,11 +2726,12 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
 {
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
-   struct page *page = bvec->bv_page;
+   struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
 
-   start = page_offset(page) + bvec->bv_offset;
+   segment_last_page(bvec, );
+   start = page_offset(bv.bv_page) + bv.bv_offset;
 
bio->bi_private = NULL;
bio_get(bio);
-- 
2.9.5



[PATCH V4 30/45] block: deal with dirtying pages for multipage bvec

2017-12-18 Thread Ming Lei
In bio_check_pages_dirty(), bvec->bv_page is used as flag for marking
if the page has been dirtied & released, and if no, it will be dirtied
in deferred workqueue.

With multipage bvec, we can't do that any more, so change the logic into
checking all pages in one mp bvec, and only release all these pages if all
are dirtied, otherwise dirty them all in deferred wrokqueue.

This patch introduces segment_for_each_page_all() to deal with the case
a bit easier.

Signed-off-by: Ming Lei 
---
 block/bio.c  | 45 +
 include/linux/bvec.h |  7 +++
 2 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1649dc465af7..1c90b8473196 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1574,8 +1574,9 @@ void bio_set_pages_dirty(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
 
if (page && !PageCompound(page))
@@ -1583,16 +1584,26 @@ void bio_set_pages_dirty(struct bio *bio)
}
 }
 
+static inline void release_mp_bvec_pages(struct bio_vec *bvec)
+{
+   struct bio_vec bv;
+   struct bvec_iter iter;
+
+   segment_for_each_page_all(bv, bvec, iter)
+   put_page(bv.bv_page);
+}
+
 static void bio_release_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
 
-   bio_for_each_page_all(bvec, bio, i) {
+   /* iterate each mp bvec */
+   bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;
 
if (page)
-   put_page(page);
+   release_mp_bvec_pages(bvec);
}
 }
 
@@ -1636,20 +1647,38 @@ static void bio_dirty_fn(struct work_struct *work)
}
 }
 
+static inline void check_mp_bvec_pages(struct bio_vec *bvec,
+   int *nr_dirty, int *nr_pages)
+{
+   struct bio_vec bv;
+   struct bvec_iter iter;
+
+   segment_for_each_page_all(bv, bvec, iter) {
+   struct page *page = bv.bv_page;
+
+   if (PageDirty(page) || PageCompound(page))
+   (*nr_dirty)++;
+   (*nr_pages)++;
+   }
+}
+
 void bio_check_pages_dirty(struct bio *bio)
 {
struct bio_vec *bvec;
int nr_clean_pages = 0;
int i;
 
-   bio_for_each_page_all(bvec, bio, i) {
-   struct page *page = bvec->bv_page;
+   bio_for_each_segment_all(bvec, bio, i) {
+   int nr_dirty = 0, nr_pages = 0;
+
+   check_mp_bvec_pages(bvec, _dirty, _pages);
 
-   if (PageDirty(page) || PageCompound(page)) {
-   put_page(page);
+   /* release all pages in the mp bvec if all are dirtied */
+   if (nr_dirty == nr_pages) {
+   release_mp_bvec_pages(bvec);
bvec->bv_page = NULL;
} else {
-   nr_clean_pages++;
+   nr_clean_pages += nr_pages;
}
}
 
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 2deee87b823e..893e8fef0dd0 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -225,6 +225,13 @@ static inline bool bvec_iter_seg_advance(const struct 
bio_vec *bv,
.bi_bvec_done   = 0,\
 }
 
+#define segment_for_each_page_all(pg_bvl, seg_bvec, iter)  \
+   for (iter = BVEC_ITER_ALL_INIT, \
+(iter).bi_size = (seg_bvec)->bv_len  - (iter).bi_bvec_done;\
+(iter).bi_size &&  \
+   ((pg_bvl = bvec_iter_bvec((seg_bvec), (iter))), 1); \
+bvec_iter_advance((seg_bvec), &(iter), (pg_bvl).bv_len))
+
 /* get the last page from the multipage bvec and store it in @pg */
 static inline void segment_last_page(const struct bio_vec *seg,
struct bio_vec *pg)
-- 
2.9.5



[PATCH V4 29/45] block: bio: introduce bio_for_each_page_all2 and bio_for_each_segment_all

2017-12-18 Thread Ming Lei
This patch introduces bio_for_each_page_all2(), which is for replacing
bio_for_each_page_all() in case that the returned bvec has to be single
page bvec.

Given the interface type has to be changed for passing one local iterator
variable of 'bvec_iter_all', and doing all changes in one single patch
isn't realistic, so use the name of bio_for_each_page_all2() temporarily
for conversion, and once all bio_for_each_page_all() is converted, the
original name of bio_for_each_page_all() will be recovered finally.

This patch introduce bio_for_each_segment_all too, which is used for
updating bvec table directly, and users should be carful about this
helper since it returns real multipage segment now.

Signed-off-by: Ming Lei 
---
 include/linux/bio.h  | 18 ++
 include/linux/bvec.h |  6 ++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 205a914ee3c0..f96c9f662f92 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -221,6 +221,24 @@ static inline bool bio_rewind_iter(struct bio *bio, struct 
bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)   \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+#define bio_for_each_segment_all(bvl, bio, i) \
+   bio_for_each_page_all((bvl), (bio), (i))
+
+/*
+ * This helper returns singlepage bvec to caller, and the sp bvec is
+ * generated in-flight from multipage bvec stored in bvec table. So we
+ * can _not_ change the bvec stored in bio->bi_io_vec[] via this helper.
+ *
+ * If bvec need to be updated in the table, please use
+ * bio_for_each_segment_all() and make sure it is correctly used since
+ * bvec may points to one multipage bvec.
+ */
+#define bio_for_each_page_all2(bvl, bio, i, bi)\
+   for ((bi).iter = BVEC_ITER_ALL_INIT, i = 0, bvl = &(bi).bv; \
+(bi).iter.bi_idx < (bio)->bi_vcnt &&   \
+   (((bi).bv = bio_iter_iovec((bio), (bi).iter)), 1);  \
+bio_advance_iter((bio), &(bi).iter, (bi).bv.bv_len), i++)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned __bio_elements(struct bio *bio, bool seg)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 217afcd83a15..2deee87b823e 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -84,6 +84,12 @@ struct bvec_iter {
   current bvec */
 };
 
+/* this iter is only for implementing bio_for_each_page_all2() */
+struct bvec_iter_all {
+   struct bvec_iteriter;
+   struct bio_vec  bv;  /* in-flight singlepage bvec */
+};
+
 /*
  * various member access, note that bio_data should of course not be used
  * on highmem page vectors
-- 
2.9.5



[PATCH V4 31/45] block: convert to bio_for_each_page_all2()

2017-12-18 Thread Ming Lei
We have to convert to bio_for_each_page_all2() for iterating page by
page.

bio_for_each_page_all() can't be used any more after multipage bvec is
enabled.

Signed-off-by: Ming Lei 
---
 block/bio.c | 18 --
 block/blk-zoned.c   |  5 +++--
 block/bounce.c  |  6 --
 include/linux/bio.h |  3 ++-
 4 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1c90b8473196..21d621e07ac9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1063,8 +1063,9 @@ static int bio_copy_from_iter(struct bio *bio, struct 
iov_iter *iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
ssize_t ret;
 
ret = copy_page_from_iter(bvec->bv_page,
@@ -1094,8 +1095,9 @@ static int bio_copy_to_iter(struct bio *bio, struct 
iov_iter iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
ssize_t ret;
 
ret = copy_page_to_iter(bvec->bv_page,
@@ -1117,8 +1119,9 @@ void bio_free_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i)
+   bio_for_each_page_all2(bvec, bio, i, bia)
__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1284,6 +1287,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
struct bio *bio;
int ret;
struct bio_vec *bvec;
+   struct bvec_iter_all bia;
 
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
@@ -1357,7 +1361,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
 
  out_unmap:
-   bio_for_each_page_all(bvec, bio, j) {
+   bio_for_each_page_all2(bvec, bio, j, bia) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1368,11 +1372,12 @@ static void __bio_unmap_user(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all bia;
 
/*
 * make sure we dirty pages we wrote to
 */
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
 
@@ -1464,8 +1469,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
char *p = bio->bi_private;
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 99f6e2cb6fd5..2899adfa23f4 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -81,6 +81,7 @@ int blkdev_report_zones(struct block_device *bdev,
unsigned int ofst;
void *addr;
int ret;
+   struct bvec_iter_all bia;
 
if (!q)
return -ENXIO;
@@ -148,7 +149,7 @@ int blkdev_report_zones(struct block_device *bdev,
n = 0;
nz = 0;
nr_rep = 0;
-   bio_for_each_page_all(bv, bio, i) {
+   bio_for_each_page_all2(bv, bio, i, bia) {
 
if (!bv->bv_page)
break;
@@ -181,7 +182,7 @@ int blkdev_report_zones(struct block_device *bdev,
 
*nr_zones = nz;
 out:
-   bio_for_each_page_all(bv, bio, i)
+   bio_for_each_page_all2(bv, bio, i, bia)
__free_page(bv->bv_page);
bio_put(bio);
 
diff --git a/block/bounce.c b/block/bounce.c
index 67aa6cff16d6..6436c07179f0 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -146,11 +146,12 @@ static void bounce_end_io(struct bio *bio, mempool_t 
*pool)
struct bio_vec *bvec, orig_vec;
int i;
struct bvec_iter orig_iter = bio_orig->bi_iter;
+   struct bvec_iter_all bia;
 
/*
 * free up bounce indirect pages used
 */
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
orig_vec = bio_iter_iovec(bio_orig, orig_iter);
if (bvec->bv_page != orig_vec.bv_page) {
dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
@@ -205,6 +206,7 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
unsigned i = 0;
bool bounce = false;
int sectors = 0;
+   struct bvec_iter_all bia;
 
bio_for_each_page(from, *bio_orig, iter) {
if (i++ < BIO_MAX_PAGES)
@@ -223,7 +225,7 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
}
bio = 

[PATCH V4 32/45] md/dm/bcache: conver to bio_for_each_page_all2 and bio_for_each_segment

2017-12-18 Thread Ming Lei
In bch_bio_alloc_pages(), bio_for_each_segment() is fine because this
helper can only be used on a freshly new bio.

For other cases, we conver to bio_for_each_page_all2() since they needn't
to update bvec table.

bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_page_all2().

Signed-off-by: Ming Lei 
---
 drivers/md/bcache/btree.c | 3 ++-
 drivers/md/bcache/util.c  | 2 +-
 drivers/md/dm-crypt.c | 3 ++-
 drivers/md/raid1.c| 3 ++-
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index a82100527495..ac7bac6e6a29 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -423,8 +423,9 @@ static void do_btree_node_write(struct btree *b)
int j;
struct bio_vec *bv;
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bv, b->bio, j)
+   bio_for_each_page_all2(bv, b->bio, j, bia)
memcpy(page_address(bv->bv_page),
   base + j * PAGE_SIZE, PAGE_SIZE);
 
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 8f2d522822b1..a23cd6a14b74 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -298,7 +298,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
 
-   bio_for_each_page_all(bv, bio, i) {
+   bio_for_each_segment_all(bv, bio, i) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 970a761de621..19dc1f6b523a 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1442,8 +1442,9 @@ static void crypt_free_buffer_pages(struct crypt_config 
*cc, struct bio *clone)
 {
unsigned int i;
struct bio_vec *bv;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bv, clone, i) {
+   bio_for_each_page_all2(bv, clone, i, bia) {
BUG_ON(!bv->bv_page);
mempool_free(bv->bv_page, cc->page_pool);
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e64b49929b8d..da5d7ea5504b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2083,13 +2083,14 @@ static void process_checks(struct r1bio *r1_bio)
struct page **spages = get_resync_pages(sbio)->pages;
struct bio_vec *bi;
int page_len[RESYNC_PAGES] = { 0 };
+   struct bvec_iter_all bia;
 
if (sbio->bi_end_io != end_sync_read)
continue;
/* Now we can 'fixup' the error value */
sbio->bi_status = 0;
 
-   bio_for_each_page_all(bi, sbio, j)
+   bio_for_each_page_all2(bi, sbio, j, bia)
page_len[j] = bi->bv_len;
 
if (!status) {
-- 
2.9.5



[PATCH V4 34/45] btrfs: conver to bio_for_each_page_all2

2017-12-18 Thread Ming Lei
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_page_all2().

Signed-off-by: Ming Lei 
---
 fs/btrfs/compression.c | 3 ++-
 fs/btrfs/disk-io.c | 3 ++-
 fs/btrfs/extent_io.c   | 9 ++---
 fs/btrfs/inode.c   | 6 --
 4 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 1a62725a5f08..f399f298b446 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -165,13 +165,14 @@ static void end_compressed_bio_read(struct bio *bio)
} else {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all bia;
 
/*
 * we have verified the checksum already, set page
 * checked so the end_io handlers know about it
 */
ASSERT(!bio_flagged(bio, BIO_CLONED));
-   bio_for_each_page_all(bvec, cb->orig_bio, i)
+   bio_for_each_page_all2(bvec, cb->orig_bio, i, bia)
SetPageChecked(bvec->bv_page);
 
bio_endio(cb->orig_bio);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4f80361fbea9..8f2afdbd0a27 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -803,9 +803,10 @@ static blk_status_t btree_csum_one_bio(struct bio *bio)
struct bio_vec *bvec;
struct btrfs_root *root;
int i, ret = 0;
+   struct bvec_iter_all bia;
 
ASSERT(!bio_flagged(bio, BIO_CLONED));
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
root = BTRFS_I(bvec->bv_page->mapping->host)->root;
ret = csum_dirty_buffer(root->fs_info, bvec->bv_page);
if (ret)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f466289b66a3..9df1b70cfa9b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2451,9 +2451,10 @@ static void end_bio_extent_writepage(struct bio *bio)
u64 start;
u64 end;
int i;
+   struct bvec_iter_all bia;
 
ASSERT(!bio_flagged(bio, BIO_CLONED));
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2522,9 +2523,10 @@ static void end_bio_extent_readpage(struct bio *bio)
int mirror;
int ret;
int i;
+   struct bvec_iter_all bia;
 
ASSERT(!bio_flagged(bio, BIO_CLONED));
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -3682,9 +3684,10 @@ static void end_bio_extent_buffer_writepage(struct bio 
*bio)
struct bio_vec *bvec;
struct extent_buffer *eb;
int i, done;
+   struct bvec_iter_all bia;
 
ASSERT(!bio_flagged(bio, BIO_CLONED));
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
 
eb = (struct extent_buffer *)page->private;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fda9f2a92f7a..1da401c60b9c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8069,6 +8069,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
struct bio_vec *bvec;
struct extent_io_tree *io_tree, *failure_tree;
int i;
+   struct bvec_iter_all bia;
 
if (bio->bi_status)
goto end;
@@ -8080,7 +8081,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
 
done->uptodate = 1;
ASSERT(!bio_flagged(bio, BIO_CLONED));
-   bio_for_each_page_all(bvec, bio, i)
+   bio_for_each_page_all2(bvec, bio, i, bia)
clean_io_failure(BTRFS_I(inode)->root->fs_info, failure_tree,
 io_tree, done->start, bvec->bv_page,
 btrfs_ino(BTRFS_I(inode)), 0);
@@ -8159,6 +8160,7 @@ static void btrfs_retry_endio(struct bio *bio)
int uptodate;
int ret;
int i;
+   struct bvec_iter_all bia;
 
if (bio->bi_status)
goto end;
@@ -8172,7 +8174,7 @@ static void btrfs_retry_endio(struct bio *bio)
failure_tree = _I(inode)->io_failure_tree;
 
ASSERT(!bio_flagged(bio, BIO_CLONED));
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 bvec->bv_offset, done->start,
 bvec->bv_len);
-- 
2.9.5



[PATCH V4 33/45] fs: conver to bio_for_each_page_all2

2017-12-18 Thread Ming Lei
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_page_all2().

Signed-off-by: Ming Lei 
---
 fs/block_dev.c  | 6 --
 fs/crypto/bio.c | 3 ++-
 fs/direct-io.c  | 4 +++-
 fs/iomap.c  | 3 ++-
 fs/mpage.c  | 3 ++-
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e73f635502e7..41e1fc90f048 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -197,6 +197,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
ssize_t ret;
blk_qc_t qc;
int i;
+   struct bvec_iter_all bia;
 
if ((pos | iov_iter_alignment(iter)) &
(bdev_logical_block_size(bdev) - 1))
@@ -242,7 +243,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
}
__set_current_state(TASK_RUNNING);
 
-   bio_for_each_page_all(bvec, , i) {
+   bio_for_each_page_all2(bvec, , i, bia) {
if (should_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
put_page(bvec->bv_page);
@@ -309,8 +310,9 @@ static void blkdev_bio_end_io(struct bio *bio)
} else {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i)
+   bio_for_each_page_all2(bvec, bio, i, bia)
put_page(bvec->bv_page);
bio_put(bio);
}
diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index 2dda77c3a89a..743c3ecb7f97 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -37,8 +37,9 @@ static void completion_pages(struct work_struct *work)
struct bio *bio = ctx->r.bio;
struct bio_vec *bv;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bv, bio, i) {
+   bio_for_each_page_all2(bv, bio, i, bia) {
struct page *page = bv->bv_page;
int ret = fscrypt_decrypt_page(page->mapping->host, page,
PAGE_SIZE, 0, page->index);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 5e6f4a5772d2..578a3a854115 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -530,7 +530,9 @@ static blk_status_t dio_bio_complete(struct dio *dio, 
struct bio *bio)
if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
-   bio_for_each_page_all(bvec, bio, i) {
+   struct bvec_iter_all bia;
+
+   bio_for_each_page_all2(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
 
if (dio->op == REQ_OP_READ && !PageCompound(page) &&
diff --git a/fs/iomap.c b/fs/iomap.c
index 286daad10ba2..fa4b6e15d29c 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -814,8 +814,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
} else {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i)
+   bio_for_each_page_all2(bvec, bio, i, bia)
put_page(bvec->bv_page);
bio_put(bio);
}
diff --git a/fs/mpage.c b/fs/mpage.c
index 1cf322c4d6f8..f2da0f9ec0f2 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -48,8 +48,9 @@ static void mpage_end_io(struct bio *bio)
 {
struct bio_vec *bv;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bv, bio, i) {
+   bio_for_each_page_all2(bv, bio, i, bia) {
struct page *page = bv->bv_page;
page_endio(page, op_is_write(bio_op(bio)),
blk_status_to_errno(bio->bi_status));
-- 
2.9.5



[PATCH V4 35/45] ext4: conver to bio_for_each_page_all2

2017-12-18 Thread Ming Lei
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_page_all2().

Signed-off-by: Ming Lei 
---
 fs/ext4/page-io.c  | 3 ++-
 fs/ext4/readpage.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 52f2937f5603..b56a733f33c0 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -63,8 +63,9 @@ static void ext4_finish_bio(struct bio *bio)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
struct page *data_page = NULL;
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index 572b6296f709..c46b5ff68fa8 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -72,6 +72,7 @@ static void mpage_end_io(struct bio *bio)
 {
struct bio_vec *bv;
int i;
+   struct bvec_iter_all bia;
 
if (ext4_bio_encrypted(bio)) {
if (bio->bi_status) {
@@ -81,7 +82,7 @@ static void mpage_end_io(struct bio *bio)
return;
}
}
-   bio_for_each_page_all(bv, bio, i) {
+   bio_for_each_page_all2(bv, bio, i, bia) {
struct page *page = bv->bv_page;
 
if (!bio->bi_status) {
-- 
2.9.5



[PATCH V4 45/45] block: document usage of bio iterator helpers

2017-12-18 Thread Ming Lei
Now multipage bvec is supported, and some helpers may return page by
page, and some may return segment by segment, this patch documents the
usage for helping us use them correctly.

Signed-off-by: Ming Lei 
---
 Documentation/block/biovecs.txt | 32 
 1 file changed, 32 insertions(+)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index b4d238b8d9fc..32a6643caeca 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,35 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=
+
+* The following helpers which name has suffix of "_all" can only be used on
+non-BIO_CLONED bio, and ususally they are used by filesystem code, and driver
+shouldn't use them becasue bio may have been splitted before they got to the
+driver:
+
+   bio_for_each_segment_all()
+   bio_for_each_page_all()
+   bio_pages_all()
+   bio_first_bvec_all()
+   bio_first_page_all()
+   bio_last_bvec_all()
+   segment_for_each_page_all()
+
+* The following helpers iterate bio page by page, and the local variable of
+'struct bio_vec' or the reference records single page io vector during the
+itearation:
+
+   bio_for_each_page()
+   bio_for_each_page_all()
+   segment_for_each_page_all()
+
+* The following helpers iterate bio segment by segment, and each segment may
+include multiple physically contiguous pages, and the local variable of
+'struct bio_vec' or the reference records multi page io vector during the
+itearation:
+
+   bio_for_each_segment()
+   bio_for_each_segment_all()
-- 
2.9.5



[PATCH V4 41/45] block: rename bio_for_each_page_all2 as bio_for_each_page_all

2017-12-18 Thread Ming Lei
Now bio_for_each_page_all() is gone, we can reuse the name to iterate
bio page by page, which is done via bio_for_each_page_all2() now.

Signed-off-by: Ming Lei 
---
 block/bio.c   | 14 +++---
 block/blk-zoned.c |  4 ++--
 block/bounce.c|  4 ++--
 drivers/md/bcache/btree.c |  2 +-
 drivers/md/dm-crypt.c |  2 +-
 drivers/md/raid1.c|  2 +-
 fs/block_dev.c|  4 ++--
 fs/btrfs/compression.c|  2 +-
 fs/btrfs/disk-io.c|  2 +-
 fs/btrfs/extent_io.c  |  6 +++---
 fs/btrfs/inode.c  |  4 ++--
 fs/crypto/bio.c   |  2 +-
 fs/direct-io.c|  2 +-
 fs/exofs/ore.c|  2 +-
 fs/exofs/ore_raid.c   |  2 +-
 fs/ext4/page-io.c |  2 +-
 fs/ext4/readpage.c|  2 +-
 fs/f2fs/data.c|  6 +++---
 fs/gfs2/lops.c|  2 +-
 fs/gfs2/meta_io.c |  2 +-
 fs/iomap.c|  2 +-
 fs/mpage.c|  2 +-
 fs/xfs/xfs_aops.c |  2 +-
 include/linux/bio.h   |  4 ++--
 include/linux/bvec.h  |  2 +-
 25 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 21d621e07ac9..e82e4c815dbb 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1065,7 +1065,7 @@ static int bio_copy_from_iter(struct bio *bio, struct 
iov_iter *iter)
struct bio_vec *bvec;
struct bvec_iter_all bia;
 
-   bio_for_each_page_all2(bvec, bio, i, bia) {
+   bio_for_each_page_all(bvec, bio, i, bia) {
ssize_t ret;
 
ret = copy_page_from_iter(bvec->bv_page,
@@ -1097,7 +1097,7 @@ static int bio_copy_to_iter(struct bio *bio, struct 
iov_iter iter)
struct bio_vec *bvec;
struct bvec_iter_all bia;
 
-   bio_for_each_page_all2(bvec, bio, i, bia) {
+   bio_for_each_page_all(bvec, bio, i, bia) {
ssize_t ret;
 
ret = copy_page_to_iter(bvec->bv_page,
@@ -1121,7 +1121,7 @@ void bio_free_pages(struct bio *bio)
int i;
struct bvec_iter_all bia;
 
-   bio_for_each_page_all2(bvec, bio, i, bia)
+   bio_for_each_page_all(bvec, bio, i, bia)
__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1361,7 +1361,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
 
  out_unmap:
-   bio_for_each_page_all2(bvec, bio, j, bia) {
+   bio_for_each_page_all(bvec, bio, j, bia) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1377,7 +1377,7 @@ static void __bio_unmap_user(struct bio *bio)
/*
 * make sure we dirty pages we wrote to
 */
-   bio_for_each_page_all2(bvec, bio, i, bia) {
+   bio_for_each_page_all(bvec, bio, i, bia) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
 
@@ -1471,7 +1471,7 @@ static void bio_copy_kern_endio_read(struct bio *bio)
int i;
struct bvec_iter_all bia;
 
-   bio_for_each_page_all2(bvec, bio, i, bia) {
+   bio_for_each_page_all(bvec, bio, i, bia) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
@@ -1582,7 +1582,7 @@ void bio_set_pages_dirty(struct bio *bio)
int i;
struct bvec_iter_all bia;
 
-   bio_for_each_page_all2(bvec, bio, i, bia) {
+   bio_for_each_page_all(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
 
if (page && !PageCompound(page))
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 2899adfa23f4..360f99317fa2 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -149,7 +149,7 @@ int blkdev_report_zones(struct block_device *bdev,
n = 0;
nz = 0;
nr_rep = 0;
-   bio_for_each_page_all2(bv, bio, i, bia) {
+   bio_for_each_page_all(bv, bio, i, bia) {
 
if (!bv->bv_page)
break;
@@ -182,7 +182,7 @@ int blkdev_report_zones(struct block_device *bdev,
 
*nr_zones = nz;
 out:
-   bio_for_each_page_all2(bv, bio, i, bia)
+   bio_for_each_page_all(bv, bio, i, bia)
__free_page(bv->bv_page);
bio_put(bio);
 
diff --git a/block/bounce.c b/block/bounce.c
index 6436c07179f0..fdabaed443fb 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -151,7 +151,7 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool)
/*
 * free up bounce indirect pages used
 */
-   bio_for_each_page_all2(bvec, bio, i, bia) {
+   bio_for_each_page_all(bvec, bio, i, bia) {
orig_vec = bio_iter_iovec(bio_orig, orig_iter);
if (bvec->bv_page != orig_vec.bv_page) {
dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
@@ -225,7 +225,7 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
}
bio = bio_clone_bioset(*bio_orig, GFP_NOIO, bounce_bio_set);
 
-   

[PATCH V4 44/45] block: always define BIO_MAX_PAGES as 256

2017-12-18 Thread Ming Lei
Now multipage bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5c5cd34c9fa3..2bf1e96c5157 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -38,15 +38,7 @@
 #define BIO_BUG_ON
 #endif
 
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES  HPAGE_PMD_NR
-#else
 #define BIO_MAX_PAGES  256
-#endif
-#else
-#define BIO_MAX_PAGES  256
-#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.5



[PATCH V4 39/45] gfs2: conver to bio_for_each_page_all2

2017-12-18 Thread Ming Lei
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_page_all2().

Given bvec can't be changed inside bio_for_each_page_all2(), this patch
marks the bvec parameter as 'const' for gfs2_end_log_write_bh().

Signed-off-by: Ming Lei 
---
 fs/gfs2/lops.c| 6 --
 fs/gfs2/meta_io.c | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 4579b8433955..29c8751f9672 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -167,7 +167,8 @@ static u64 gfs2_log_bmap(struct gfs2_sbd *sdp)
  * that is pinned in the pagecache.
  */
 
-static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
+static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp,
+ const struct bio_vec *bvec,
  blk_status_t error)
 {
struct buffer_head *bh, *next;
@@ -206,6 +207,7 @@ static void gfs2_end_log_write(struct bio *bio)
struct bio_vec *bvec;
struct page *page;
int i;
+   struct bvec_iter_all bia;
 
if (bio->bi_status) {
fs_err(sdp, "Error %d writing to journal, jid=%u\n",
@@ -213,7 +215,7 @@ static void gfs2_end_log_write(struct bio *bio)
wake_up(>sd_logd_waitq);
}
 
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
page = bvec->bv_page;
if (page_has_buffers(page))
gfs2_end_log_write_bh(sdp, bvec, bio->bi_status);
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 1d720352310a..a945c9fa1dc6 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -190,8 +190,9 @@ static void gfs2_meta_read_endio(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bvec, bio, i) {
+   bio_for_each_page_all2(bvec, bio, i, bia) {
struct page *page = bvec->bv_page;
struct buffer_head *bh = page_buffers(page);
unsigned int len = bvec->bv_len;
-- 
2.9.5



[PATCH V4 43/45] block: bio: pass segments to bio if bio_add_page() is bypassed

2017-12-18 Thread Ming Lei
Under some situations, such as block direct I/O, we can't use
bio_add_page() for merging pages into multipage bvec, so
a new function is implemented for converting page array into one
segment array, then these cases can benefit from multipage bvec
too.

Signed-off-by: Ming Lei 
---
 block/bio.c | 54 --
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 34af328681a8..e808d8352067 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -882,6 +882,41 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+static unsigned convert_to_segs(struct bio* bio, struct page **pages,
+   unsigned char *page_cnt,
+   unsigned nr_pages)
+{
+
+   unsigned idx;
+   unsigned nr_seg = 0;
+   struct request_queue *q = NULL;
+
+   if (bio->bi_disk)
+   q = bio->bi_disk->queue;
+
+   if (!q || !blk_queue_cluster(q)) {
+   memset(page_cnt, 0, nr_pages);
+   return nr_pages;
+   }
+
+   page_cnt[nr_seg] = 0;
+   for (idx = 1; idx < nr_pages; idx++) {
+   struct page *pg_s = pages[nr_seg];
+   struct page *pg = pages[idx];
+
+   if (page_to_pfn(pg_s) + page_cnt[nr_seg] + 1 ==
+   page_to_pfn(pg)) {
+   page_cnt[nr_seg]++;
+   } else {
+   page_cnt[++nr_seg] = 0;
+   if (nr_seg < idx)
+   pages[nr_seg] = pg;
+   }
+   }
+
+   return nr_seg + 1;
+}
+
 /**
  * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
  * @bio: bio to add pages to
@@ -897,6 +932,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter 
*iter)
struct page **pages = (struct page **)bv;
size_t offset, diff;
ssize_t size;
+   unsigned short nr_segs;
+   unsigned char page_cnt[nr_pages];   /* at most 256 pages */
 
size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, );
if (unlikely(size <= 0))
@@ -912,13 +949,18 @@ int bio_iov_iter_get_pages(struct bio *bio, struct 
iov_iter *iter)
 * need to be reflected here as well.
 */
bio->bi_iter.bi_size += size;
-   bio->bi_vcnt += nr_pages;
-
diff = (nr_pages * PAGE_SIZE - offset) - size;
-   while (nr_pages--) {
-   bv[nr_pages].bv_page = pages[nr_pages];
-   bv[nr_pages].bv_len = PAGE_SIZE;
-   bv[nr_pages].bv_offset = 0;
+
+   /* convert into segments */
+   nr_segs = convert_to_segs(bio, pages, page_cnt, nr_pages);
+   bio->bi_vcnt += nr_segs;
+
+   while (nr_segs--) {
+   unsigned cnt = (unsigned)page_cnt[nr_segs] + 1;
+
+   bv[nr_segs].bv_page = pages[nr_segs];
+   bv[nr_segs].bv_len = PAGE_SIZE * cnt;
+   bv[nr_segs].bv_offset = 0;
}
 
bv[0].bv_offset += offset;
-- 
2.9.5



[PATCH V4 42/45] block: enable multipage bvecs

2017-12-18 Thread Ming Lei
This patch pulls the trigger for multipage bvecs.

Now any request queue which supports queue cluster will see multipage
bvecs.

Signed-off-by: Ming Lei 
---
 block/bio.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index e82e4c815dbb..34af328681a8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -845,6 +845,11 @@ int bio_add_page(struct bio *bio, struct page *page,
 * a consecutive offset.  Optimize this special case.
 */
if (bio->bi_vcnt > 0) {
+   struct request_queue *q = NULL;
+
+   if (bio->bi_disk)
+   q = bio->bi_disk->queue;
+
bv = >bi_io_vec[bio->bi_vcnt - 1];
 
if (page == bv->bv_page &&
@@ -852,6 +857,14 @@ int bio_add_page(struct bio *bio, struct page *page,
bv->bv_len += len;
goto done;
}
+
+   /* disable multipage bvec too if cluster isn't enabled */
+   if (q && blk_queue_cluster(q) &&
+   (bvec_to_phys(bv) + bv->bv_len ==
+page_to_phys(page) + offset)) {
+   bv->bv_len += len;
+   goto done;
+   }
}
 
if (bio->bi_vcnt >= bio->bi_max_vecs)
-- 
2.9.5



[PATCH V4 40/45] block: kill bio_for_each_page_all()

2017-12-18 Thread Ming Lei
No one uses it any more, so kill it and we can reuse this helper
name.

Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 899db6701f0d..05027f0df83f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -155,8 +155,10 @@ static inline void *bio_data(struct bio *bio)
 /*
  * drivers should _never_ use the all version - the bio may have been split
  * before it got to the driver and the driver won't own all of it
+ *
+ * This helper iterates bio segment by segment.
  */
-#define bio_for_each_page_all(bvl, bio, i) \
+#define bio_for_each_segment_all(bvl, bio, i)  \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
 
 static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
@@ -221,9 +223,6 @@ static inline bool bio_rewind_iter(struct bio *bio, struct 
bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)   \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
-#define bio_for_each_segment_all(bvl, bio, i) \
-   bio_for_each_page_all((bvl), (bio), (i))
-
 /*
  * This helper returns singlepage bvec to caller, and the sp bvec is
  * generated in-flight from multipage bvec stored in bvec table. So we
-- 
2.9.5



[PATCH V4 37/45] xfs: conver to bio_for_each_page_all2

2017-12-18 Thread Ming Lei
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_page_all2().

Given bvec can't be changed under bio_for_each_page_all2(), this patch
marks the bvec parameter as 'const' for xfs_finish_page_writeback().

Signed-off-by: Ming Lei 
---
 fs/xfs/xfs_aops.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 8c19f7e0fd32..c0d970817cdc 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -107,7 +107,7 @@ xfs_find_daxdev_for_inode(
 static void
 xfs_finish_page_writeback(
struct inode*inode,
-   struct bio_vec  *bvec,
+   const struct bio_vec*bvec,
int error)
 {
struct buffer_head  *head = page_buffers(bvec->bv_page), *bh = head;
@@ -169,6 +169,7 @@ xfs_destroy_ioend(
for (bio = >io_inline_bio; bio; bio = next) {
struct bio_vec  *bvec;
int i;
+   struct bvec_iter_all bia;
 
/*
 * For the last bio, bi_private points to the ioend, so we
@@ -180,7 +181,7 @@ xfs_destroy_ioend(
next = bio->bi_private;
 
/* walk each page on bio, ending page IO on them */
-   bio_for_each_page_all(bvec, bio, i)
+   bio_for_each_page_all2(bvec, bio, i, bia)
xfs_finish_page_writeback(inode, bvec, error);
 
bio_put(bio);
-- 
2.9.5



[PATCH V4 38/45] exofs: conver to bio_for_each_page_all2

2017-12-18 Thread Ming Lei
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_page_all2().

Signed-off-by: Ming Lei 
---
 fs/exofs/ore.c  | 3 ++-
 fs/exofs/ore_raid.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 80517934d7c4..5a1d3cd14b44 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -406,8 +406,9 @@ static void _clear_bio(struct bio *bio)
 {
struct bio_vec *bv;
unsigned i;
+   struct bvec_iter_all bia;
 
-   bio_for_each_page_all(bv, bio, i) {
+   bio_for_each_page_all2(bv, bio, i, bia) {
unsigned this_count = bv->bv_len;
 
if (likely(PAGE_SIZE == this_count))
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 2c3346cd1b29..bb0cc314a987 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -433,11 +433,12 @@ static void _mark_read4write_pages_uptodate(struct 
ore_io_state *ios, int ret)
/* loop on all devices all pages */
for (d = 0; d < ios->numdevs; d++) {
struct bio *bio = ios->per_dev[d].bio;
+   struct bvec_iter_all bia;
 
if (!bio)
continue;
 
-   bio_for_each_page_all(bv, bio, i) {
+   bio_for_each_page_all2(bv, bio, i, bia) {
struct page *page = bv->bv_page;
 
SetPageUptodate(page);
-- 
2.9.5



[PATCH V4 28/45] block: loop: pass segments to iov_iter

2017-12-18 Thread Ming Lei
iov_iter is implemented with bvec itererator, so it is safe to pass
segment to it, and this way is much more efficient than passing one
page in each bvec.

Signed-off-by: Ming Lei 
---
 drivers/block/loop.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8e30d081ad2a..90e3f402af62 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -499,7 +499,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
struct bio_vec tmp;
 
__rq_for_each_bio(bio, rq)
-   segments += bio_pages(bio);
+   segments += bio_segments(bio);
bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_NOIO);
if (!bvec)
return -EIO;
@@ -511,7 +511,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 * copy bio->bi_iov_vec to new bvec. The rq_for_each_page
 * API will take care of all details for us.
 */
-   rq_for_each_page(tmp, rq, iter) {
+   rq_for_each_segment(tmp, rq, iter) {
*bvec = tmp;
bvec++;
}
@@ -525,7 +525,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   segments = bio_pages(bio);
+   segments = bio_segments(bio);
}
atomic_set(>ref, 2);
 
-- 
2.9.5



[PATCH V4 26/45] block: introduce bio_segments()

2017-12-18 Thread Ming Lei
There are still cases in which we need to use bio_segments() for get the
number of segment, so introduce it.

Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 2dd1ca0285e1..205a914ee3c0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -223,9 +223,9 @@ static inline bool bio_rewind_iter(struct bio *bio, struct 
bvec_iter *iter,
 
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
-static inline unsigned bio_pages(struct bio *bio)
+static inline unsigned __bio_elements(struct bio *bio, bool seg)
 {
-   unsigned segs = 0;
+   unsigned elems = 0;
struct bio_vec bv;
struct bvec_iter iter;
 
@@ -245,10 +245,25 @@ static inline unsigned bio_pages(struct bio *bio)
break;
}
 
-   bio_for_each_page(bv, bio, iter)
-   segs++;
+   if (!seg) {
+   bio_for_each_page(bv, bio, iter)
+   elems++;
+   } else {
+   bio_for_each_segment(bv, bio, iter)
+   elems++;
+   }
+
+   return elems;
+}
+
+static inline unsigned bio_pages(struct bio *bio)
+{
+   return __bio_elements(bio, false);
+}
 
-   return segs;
+static inline unsigned bio_segments(struct bio *bio)
+{
+   return __bio_elements(bio, true);
 }
 
 /*
-- 
2.9.5



[PATCH V4 15/45] block: rename bio_for_each_segment* with bio_for_each_page*

2017-12-18 Thread Ming Lei
It is a tree-wide mechanical replacement since both bio_for_each_segment()
and bio_for_each_segment_all() never returns real segment at all, and
both just return one page per bvec and deceive us for long time, so fix
their names.

This is a pre-patch for supporting multipage bvec. Once multipage bvec
is in, each bvec will store a real multipage segment, so people won't be
confused with these wrong names.

Signed-off-by: Ming Lei 
---
 Documentation/block/biovecs.txt |  4 ++--
 arch/m68k/emu/nfblock.c |  2 +-
 arch/powerpc/sysdev/axonram.c   |  2 +-
 arch/xtensa/platforms/iss/simdisk.c |  2 +-
 block/bio-integrity.c   |  2 +-
 block/bio.c | 24 
 block/blk-merge.c   |  6 +++---
 block/blk-zoned.c   |  4 ++--
 block/bounce.c  |  8 
 drivers/block/aoe/aoecmd.c  |  4 ++--
 drivers/block/brd.c |  2 +-
 drivers/block/drbd/drbd_main.c  |  4 ++--
 drivers/block/drbd/drbd_receiver.c  |  2 +-
 drivers/block/drbd/drbd_worker.c|  2 +-
 drivers/block/nbd.c |  2 +-
 drivers/block/null_blk.c|  2 +-
 drivers/block/ps3vram.c |  2 +-
 drivers/block/rbd.c |  2 +-
 drivers/block/rsxx/dma.c|  2 +-
 drivers/block/zram/zram_drv.c   |  2 +-
 drivers/md/bcache/btree.c   |  2 +-
 drivers/md/bcache/debug.c   |  2 +-
 drivers/md/bcache/request.c |  2 +-
 drivers/md/bcache/util.c|  2 +-
 drivers/md/dm-crypt.c   |  2 +-
 drivers/md/dm-integrity.c   |  4 ++--
 drivers/md/dm-log-writes.c  |  2 +-
 drivers/md/dm.c |  2 +-
 drivers/md/raid1.c  |  2 +-
 drivers/md/raid5.c  |  2 +-
 drivers/nvdimm/blk.c|  2 +-
 drivers/nvdimm/btt.c|  2 +-
 drivers/nvdimm/pmem.c   |  2 +-
 drivers/s390/block/dcssblk.c|  2 +-
 drivers/s390/block/xpram.c  |  2 +-
 fs/block_dev.c  |  4 ++--
 fs/btrfs/check-integrity.c  |  4 ++--
 fs/btrfs/compression.c  |  2 +-
 fs/btrfs/disk-io.c  |  2 +-
 fs/btrfs/extent_io.c|  6 +++---
 fs/btrfs/file-item.c|  4 ++--
 fs/btrfs/inode.c|  8 
 fs/btrfs/raid56.c   |  4 ++--
 fs/crypto/bio.c |  2 +-
 fs/direct-io.c  |  2 +-
 fs/exofs/ore.c  |  2 +-
 fs/exofs/ore_raid.c |  2 +-
 fs/ext4/page-io.c   |  2 +-
 fs/ext4/readpage.c  |  2 +-
 fs/f2fs/data.c  |  6 +++---
 fs/gfs2/lops.c  |  2 +-
 fs/gfs2/meta_io.c   |  2 +-
 fs/iomap.c  |  2 +-
 fs/mpage.c  |  2 +-
 fs/xfs/xfs_aops.c   |  2 +-
 include/linux/bio.h | 10 +-
 include/linux/blkdev.h  |  2 +-
 57 files changed, 93 insertions(+), 93 deletions(-)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..b4d238b8d9fc 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -28,10 +28,10 @@ normal code doesn't have to deal with bi_bvec_done.
constructed from the raw biovecs but taking into account bi_bvec_done and
bi_size.
 
-   bio_for_each_segment() has been updated to take a bvec_iter argument
+   bio_for_each_page() has been updated to take a bvec_iter argument
instead of an integer (that corresponded to bi_idx); for a lot of code the
conversion just required changing the types of the arguments to
-   bio_for_each_segment().
+   bio_for_each_page().
 
  * Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a
wrapper around bio_advance_iter() that operates on bio->bi_iter, and also
diff --git a/arch/m68k/emu/nfblock.c b/arch/m68k/emu/nfblock.c
index e9110b9b8bcd..8b226eac9289 100644
--- a/arch/m68k/emu/nfblock.c
+++ b/arch/m68k/emu/nfblock.c
@@ -69,7 +69,7 @@ static blk_qc_t nfhd_make_request(struct request_queue 
*queue, struct bio *bio)
 
dir = bio_data_dir(bio);
shift = dev->bshift;
-   bio_for_each_segment(bvec, bio, iter) {
+   bio_for_each_page(bvec, bio, iter) {
len = bvec.bv_len;
len >>= 9;
nfhd_read_write(dev->id, 0, dir, sec >> shift, len >> shift,
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 1b307c80b401..c88e041a6a8e 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -121,7 +121,7 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
AXON_RAM_SECTOR_SHIFT);
phys_end = bank->io_addr + bank->size;
transfered = 0;
-   

[PATCH V4 18/45] block: introduce multipage page bvec helpers

2017-12-18 Thread Ming Lei
This patch introduces helpers of 'bvec_iter_segment_*' for multipage
bvec(segment) support.

The introduced interfaces treate one bvec as real multipage segment,
for example, .bv_len is the total length of the multipage segment.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as singlepage only by drivers, fs, dm and
etc. These helpers will build singlepage bvec in flight, so users of
current bio/bvec iterator still can work well and needn't change even
though we store real multipage segment into bvec table.

Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 63 +---
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index fe7a22dd133b..2433c73fa5ea 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,44 @@
 #include 
 #include 
 #include 
+#include 
+
+/*
+ * What is multipage bvecs(segment)?
+ *
+ * - bvec stored in bio->bi_io_vec is always multipage(mp) style
+ *
+ * - bvec(struct bio_vec) represents one physically contiguous I/O
+ *   buffer, now the buffer may include more than one pages since
+ *   multipage(mp) bvec is supported, and all these pages represented
+ *   by one bvec is physically contiguous. Before mp support, at most
+ *   one page can be included in one bvec, we call it singlepage(sp)
+ *   bvec.
+ *
+ * - .bv_page of th bvec represents the 1st page in the mp segment
+ *
+ * - .bv_offset of the bvec represents offset of the buffer in the bvec
+ *
+ * The effect on the current drivers/filesystem/dm/bcache/...:
+ *
+ * - almost everyone supposes that one bvec only includes one single
+ *   page, so we keep the sp interface not changed, for example,
+ *   bio_for_each_page() still returns bvec with single page
+ *
+ * - bio_for_each_page_all() will be changed to return singlepage
+ *   bvec too
+ *
+ * - during iterating, iterator variable(struct bvec_iter) is always
+ *   updated in multipage bvec style and that means bvec_iter_advance()
+ *   is kept not changed
+ *
+ * - returned(copied) singlepage bvec is generated in flight by bvec
+ *   helpers from the stored multipage bvec(segment)
+ *
+ * - In case that some components(such as iov_iter) need to support
+ *   multipage segment, we introduce new helpers(bvec_iter_segment_*) for
+ *   them.
+ */
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -52,16 +90,35 @@ struct bvec_iter {
  */
 #define __bvec_iter_bvec(bvec, iter)   (&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter) \
+#define bvec_iter_segment_page(bvec, iter) \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
 
-#define bvec_iter_len(bvec, iter)  \
+#define bvec_iter_segment_len(bvec, iter)  \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
 
-#define bvec_iter_offset(bvec, iter)   \
+#define bvec_iter_segment_offset(bvec, iter)   \
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
 
+#define bvec_iter_page_idx_in_seg(bvec, iter)  \
+   (bvec_iter_segment_offset((bvec), (iter)) / PAGE_SIZE)
+
+/*
+ *  of singlepage(sp) segment.
+ *
+ * This helpers will be implemented for building sp bvec in flight.
+ */
+#define bvec_iter_offset(bvec, iter)   \
+   (bvec_iter_segment_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define bvec_iter_len(bvec, iter)  \
+   min_t(unsigned, bvec_iter_segment_len((bvec), (iter)),  \
+   (PAGE_SIZE - (bvec_iter_offset((bvec), (iter)
+
+#define bvec_iter_page(bvec, iter) \
+   nth_page(bvec_iter_segment_page((bvec), (iter)),\
+bvec_iter_page_idx_in_seg((bvec), (iter)))
+
 #define bvec_iter_bvec(bvec, iter) \
 ((struct bio_vec) {\
.bv_page= bvec_iter_page((bvec), (iter)),   \
-- 
2.9.5



[PATCH V4 13/45] block: blk-merge: try to make front segments in full size

2017-12-18 Thread Ming Lei
When merging one bvec into segment, if the bvec is too big
to merge, current policy is to move the whole bvec into another
new segment.

This patchset changes the policy into trying to maximize size of
front segments, that means in above situation, part of bvec
is merged into current segment, and the remainder is put
into next segment.

This patch prepares for support multipage bvec because
it can be quite common to see this case and we should try
to make front segments in full size.

Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 54 +-
 1 file changed, 49 insertions(+), 5 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index a476337a8ff4..42ceb89bc566 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -109,6 +109,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bool do_split = true;
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
+   unsigned advance = 0;
 
bio_for_each_segment(bv, bio, iter) {
/*
@@ -134,12 +135,32 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
}
 
if (bvprvp && blk_queue_cluster(q)) {
-   if (seg_size + bv.bv_len > queue_max_segment_size(q))
-   goto new_segment;
if (!BIOVEC_PHYS_MERGEABLE(bvprvp, ))
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, bvprvp, ))
goto new_segment;
+   if (seg_size + bv.bv_len > queue_max_segment_size(q)) {
+   /*
+* On assumption is that initial value of
+* @seg_size(equals to bv.bv_len) won't be
+* bigger than max segment size, but will
+* becomes false after multipage bvec comes.
+*/
+   advance = queue_max_segment_size(q) - seg_size;
+
+   if (advance > 0) {
+   seg_size += advance;
+   sectors += advance >> 9;
+   bv.bv_len -= advance;
+   bv.bv_offset += advance;
+   }
+
+   /*
+* Still need to put remainder of current
+* bvec into a new segment.
+*/
+   goto new_segment;
+   }
 
seg_size += bv.bv_len;
bvprv = bv;
@@ -161,6 +182,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
seg_size = bv.bv_len;
sectors += bv.bv_len >> 9;
 
+   /* restore the bvec for iterator */
+   if (advance) {
+   bv.bv_len += advance;
+   bv.bv_offset -= advance;
+   advance = 0;
+   }
}
 
do_split = false;
@@ -361,16 +388,29 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
 {
 
int nbytes = bvec->bv_len;
+   unsigned advance = 0;
 
if (*sg && *cluster) {
-   if ((*sg)->length + nbytes > queue_max_segment_size(q))
-   goto new_segment;
-
if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
goto new_segment;
 
+   /*
+* try best to merge part of the bvec into previous
+* segment and follow same policy with
+* blk_bio_segment_split()
+*/
+   if ((*sg)->length + nbytes > queue_max_segment_size(q)) {
+   advance = queue_max_segment_size(q) - (*sg)->length;
+   if (advance) {
+   (*sg)->length += advance;
+   bvec->bv_offset += advance;
+   bvec->bv_len -= advance;
+   }
+   goto new_segment;
+   }
+
(*sg)->length += nbytes;
} else {
 new_segment:
@@ -393,6 +433,10 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
 
sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
(*nsegs)++;
+
+   /* for making iterator happy */
+   bvec->bv_offset -= advance;
+   bvec->bv_len += advance;
}
*bvprv = *bvec;
 }
-- 
2.9.5



[PATCH V4 14/45] block: blk-merge: remove unnecessary check

2017-12-18 Thread Ming Lei
In this case, 'sectors' can't be zero at all, so remove the check
and let the bio be splitted.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 42ceb89bc566..39f2c1113423 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -129,9 +129,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
nsegs++;
sectors = max_sectors;
}
-   if (sectors)
-   goto split;
-   /* Make this single bvec as the 1st segment */
+   goto split;
}
 
if (bvprvp && blk_queue_cluster(q)) {
-- 
2.9.5



[PATCH V4 12/45] blk-merge: compute bio->bi_seg_front_size efficiently

2017-12-18 Thread Ming Lei
It is enough to check and compute bio->bi_seg_front_size just
after the 1st segment is found, but current code checks that
for each bvec, which is inefficient.

This patch follows the way in  __blk_recalc_rq_segments()
for computing bio->bi_seg_front_size, and it is more efficient
and code becomes more readable too.

Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index f5dedd57dff6..a476337a8ff4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -146,22 +146,21 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bvprvp = 
sectors += bv.bv_len >> 9;
 
-   if (nsegs == 1 && seg_size > front_seg_size)
-   front_seg_size = seg_size;
continue;
}
 new_segment:
if (nsegs == queue_max_segments(q))
goto split;
 
+   if (nsegs == 1 && seg_size > front_seg_size)
+   front_seg_size = seg_size;
+
nsegs++;
bvprv = bv;
bvprvp = 
seg_size = bv.bv_len;
sectors += bv.bv_len >> 9;
 
-   if (nsegs == 1 && seg_size > front_seg_size)
-   front_seg_size = seg_size;
}
 
do_split = false;
@@ -174,6 +173,8 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bio = new;
}
 
+   if (nsegs == 1 && seg_size > front_seg_size)
+   front_seg_size = seg_size;
bio->bi_seg_front_size = front_seg_size;
if (seg_size > bio->bi_seg_back_size)
bio->bi_seg_back_size = seg_size;
-- 
2.9.5



[PATCH V4 17/45] block: rename bio_segments() with bio_pages()

2017-12-18 Thread Ming Lei
bio_segments() never returns count of actual segment, just like
original bio_for_each_segment(), so rename it as bio_pages().

Signed-off-by: Ming Lei 
---
 block/bio.c| 2 +-
 block/blk-merge.c  | 2 +-
 drivers/block/loop.c   | 4 ++--
 drivers/md/dm-log-writes.c | 2 +-
 drivers/target/target_core_pscsi.c | 2 +-
 fs/btrfs/check-integrity.c | 2 +-
 fs/btrfs/inode.c   | 2 +-
 include/linux/bio.h| 2 +-
 8 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b93677f8f682..1649dc465af7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -679,7 +679,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
 *__bio_clone_fast() anyways.
 */
 
-   bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
+   bio = bio_alloc_bioset(gfp_mask, bio_pages(bio_src), bs);
if (!bio)
return NULL;
bio->bi_disk= bio_src->bi_disk;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index b571e91b67f6..25ffb84be058 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -329,7 +329,7 @@ void blk_recount_segments(struct request_queue *q, struct 
bio *bio)
 
/* estimate segment number by bi_vcnt for non-cloned bio */
if (bio_flagged(bio, BIO_CLONED))
-   seg_cnt = bio_segments(bio);
+   seg_cnt = bio_pages(bio);
else
seg_cnt = bio->bi_vcnt;
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 7f56422d0066..8e30d081ad2a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -499,7 +499,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
struct bio_vec tmp;
 
__rq_for_each_bio(bio, rq)
-   segments += bio_segments(bio);
+   segments += bio_pages(bio);
bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_NOIO);
if (!bvec)
return -EIO;
@@ -525,7 +525,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   segments = bio_segments(bio);
+   segments = bio_pages(bio);
}
atomic_set(>ref, 2);
 
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index cd023ff6a33b..1a7436a8aa2a 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -723,7 +723,7 @@ static int log_writes_map(struct dm_target *ti, struct bio 
*bio)
if (discard_bio)
alloc_size = sizeof(struct pending_block);
else
-   alloc_size = sizeof(struct pending_block) + sizeof(struct 
bio_vec) * bio_segments(bio);
+   alloc_size = sizeof(struct pending_block) + sizeof(struct 
bio_vec) * bio_pages(bio);
 
block = kzalloc(alloc_size, GFP_NOIO);
if (!block) {
diff --git a/drivers/target/target_core_pscsi.c 
b/drivers/target/target_core_pscsi.c
index 7c69b4a9694d..88b0502fffbc 100644
--- a/drivers/target/target_core_pscsi.c
+++ b/drivers/target/target_core_pscsi.c
@@ -914,7 +914,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, 
u32 sgl_nents,
rc = bio_add_pc_page(pdv->pdv_sd->request_queue,
bio, page, bytes, off);
pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n",
-   bio_segments(bio), nr_vecs);
+   bio_pages(bio), nr_vecs);
if (rc != bytes) {
pr_debug("PSCSI: Reached bio->bi_vcnt max:"
" %d i: %d bio: %p, allocating another"
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index d200389099db..aac952f47636 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2813,7 +2813,7 @@ static void __btrfsic_submit_bio(struct bio *bio)
struct bvec_iter iter;
int bio_is_patched;
char **mapped_datav;
-   unsigned int segs = bio_segments(bio);
+   unsigned int segs = bio_pages(bio);
 
dev_bytenr = 512 * bio->bi_iter.bi_sector;
bio_is_patched = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c279030ee5ed..fda9f2a92f7a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8030,7 +8030,7 @@ static blk_status_t dio_read_error(struct inode *inode, 
struct bio *failed_bio,
return BLK_STS_IOERR;
}
 
-   segs = bio_segments(failed_bio);
+   segs = bio_pages(failed_bio);
bio_get_first_bvec(failed_bio, );
if (segs > 1 ||
(bvec.bv_len > btrfs_inode_sectorsize(inode)))

[PATCH V4 16/45] block: rename rq_for_each_segment as rq_for_each_page

2017-12-18 Thread Ming Lei
rq_for_each_segment() still deceives us since this helper only returns
one page in each bvec, so fixes its name.

Signed-off-by: Ming Lei 
---
 Documentation/block/biodoc.txt |  6 +++---
 block/blk-core.c   |  2 +-
 drivers/block/floppy.c |  4 ++--
 drivers/block/loop.c   | 12 ++--
 drivers/block/nbd.c|  2 +-
 drivers/block/null_blk.c   |  2 +-
 drivers/block/ps3disk.c|  4 ++--
 drivers/s390/block/dasd_diag.c |  4 ++--
 drivers/s390/block/dasd_eckd.c | 16 
 drivers/s390/block/dasd_fba.c  |  6 +++---
 drivers/s390/block/scm_blk.c   |  2 +-
 include/linux/blkdev.h |  4 ++--
 12 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 86927029a52d..3aeca60e526a 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -458,7 +458,7 @@ With this multipage bio design:
 - A linked list of bios is used as before for unrelated merges (*) - this
   avoids reallocs and makes independent completions easier to handle.
 - Code that traverses the req list can find all the segments of a bio
-  by using rq_for_each_segment.  This handles the fact that a request
+  by using rq_for_each_page.  This handles the fact that a request
   has multiple bios, each of which can have multiple segments.
 - Drivers which can't process a large bio in one shot can use the bi_iter
   field to keep track of the next bio_vec entry to process.
@@ -640,13 +640,13 @@ in lvm or md.
 
 3.2.1 Traversing segments and completion units in a request
 
-The macro rq_for_each_segment() should be used for traversing the bios
+The macro rq_for_each_page() should be used for traversing the bios
 in the request list (drivers should avoid directly trying to do it
 themselves). Using these helpers should also make it easier to cope
 with block changes in the future.
 
struct req_iterator iter;
-   rq_for_each_segment(bio_vec, rq, iter)
+   rq_for_each_page(bio_vec, rq, iter)
/* bio_vec is now current segment */
 
 I/O completion callbacks are per-bio rather than per-segment, so drivers
diff --git a/block/blk-core.c b/block/blk-core.c
index b8881750a3ac..bc9d3c754a9a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3267,7 +3267,7 @@ void rq_flush_dcache_pages(struct request *rq)
struct req_iterator iter;
struct bio_vec bvec;
 
-   rq_for_each_segment(bvec, rq, iter)
+   rq_for_each_page(bvec, rq, iter)
flush_dcache_page(bvec.bv_page);
 }
 EXPORT_SYMBOL_GPL(rq_flush_dcache_pages);
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index eae484acfbbc..556c29dc94e1 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -2382,7 +2382,7 @@ static int buffer_chain_size(void)
base = bio_data(current_req->bio);
size = 0;
 
-   rq_for_each_segment(bv, current_req, iter) {
+   rq_for_each_page(bv, current_req, iter) {
if (page_address(bv.bv_page) + bv.bv_offset != base + size)
break;
 
@@ -2446,7 +2446,7 @@ static void copy_buffer(int ssize, int max_sector, int 
max_sector_2)
 
size = blk_rq_cur_bytes(current_req);
 
-   rq_for_each_segment(bv, current_req, iter) {
+   rq_for_each_page(bv, current_req, iter) {
if (!remaining)
break;
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index bc8e61506968..7f56422d0066 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -290,7 +290,7 @@ static int lo_write_simple(struct loop_device *lo, struct 
request *rq,
struct req_iterator iter;
int ret = 0;
 
-   rq_for_each_segment(bvec, rq, iter) {
+   rq_for_each_page(bvec, rq, iter) {
ret = lo_write_bvec(lo->lo_backing_file, , );
if (ret < 0)
break;
@@ -317,7 +317,7 @@ static int lo_write_transfer(struct loop_device *lo, struct 
request *rq,
if (unlikely(!page))
return -ENOMEM;
 
-   rq_for_each_segment(bvec, rq, iter) {
+   rq_for_each_page(bvec, rq, iter) {
ret = lo_do_transfer(lo, WRITE, page, 0, bvec.bv_page,
bvec.bv_offset, bvec.bv_len, pos >> 9);
if (unlikely(ret))
@@ -343,7 +343,7 @@ static int lo_read_simple(struct loop_device *lo, struct 
request *rq,
struct iov_iter i;
ssize_t len;
 
-   rq_for_each_segment(bvec, rq, iter) {
+   rq_for_each_page(bvec, rq, iter) {
iov_iter_bvec(, ITER_BVEC, , 1, bvec.bv_len);
len = vfs_iter_read(lo->lo_backing_file, , , 0);
if (len < 0)
@@ -378,7 +378,7 @@ static int lo_read_transfer(struct loop_device *lo, struct 
request *rq,
if (unlikely(!page))
return -ENOMEM;
 
-   rq_for_each_segment(bvec, rq, iter) {
+   

[PATCH V4 11/45] dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()

2017-12-18 Thread Ming Lei
The bio is always freed after running crypt_free_buffer_pages(), so it
isn't necessary to clear the bv->bv_page.

Cc: Mike Snitzer 
Cc:dm-de...@redhat.com
Signed-off-by: Ming Lei 
---
 drivers/md/dm-crypt.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 9fc12f556534..48332666fc38 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1446,7 +1446,6 @@ static void crypt_free_buffer_pages(struct crypt_config 
*cc, struct bio *clone)
bio_for_each_segment_all(bv, clone, i) {
BUG_ON(!bv->bv_page);
mempool_free(bv->bv_page, cc->page_pool);
-   bv->bv_page = NULL;
}
 }
 
-- 
2.9.5



[PATCH V4 07/45] bcache: comment on direct access to bvec table

2017-12-18 Thread Ming Lei
All direct access to bvec table are safe even after multipage bvec is supported.

Cc: linux-bca...@vger.kernel.org
Acked-by: Coly Li 
Signed-off-by: Ming Lei 
---
 drivers/md/bcache/btree.c | 1 +
 drivers/md/bcache/util.c  | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81e8dc3dbe5e..02a4cf646fdc 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -432,6 +432,7 @@ static void do_btree_node_write(struct btree *b)
 
continue_at(cl, btree_node_write_done, NULL);
} else {
+   /* No problem for multipage bvec since the bio is just 
allocated */
b->bio->bi_vcnt = 0;
bch_bio_map(b->bio, i);
 
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index e548b8b51322..61813d230015 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -249,6 +249,13 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t 
done)
: 0;
 }
 
+/*
+ * Generally it isn't good to access .bi_io_vec and .bi_vcnt directly,
+ * the preferred way is bio_add_page, but in this case, bch_bio_map()
+ * supposes that the bvec table is empty, so it is safe to access
+ * .bi_vcnt & .bi_io_vec in this way even after multipage bvec is
+ * supported.
+ */
 void bch_bio_map(struct bio *bio, void *base)
 {
size_t size = bio->bi_iter.bi_size;
-- 
2.9.5



[PATCH V4 04/45] block: bounce: avoid direct access to bvec table

2017-12-18 Thread Ming Lei
We will support multipage bvecs in the future, so change to iterator way
for getting bv_page of bvec from original bio.

Cc: Matthew Wilcox 
Signed-off-by: Ming Lei 
---
 block/bounce.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/bounce.c b/block/bounce.c
index fceb1a96480b..0274c31d6c05 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -137,21 +137,20 @@ static void copy_to_high_bio_irq(struct bio *to, struct 
bio *from)
 static void bounce_end_io(struct bio *bio, mempool_t *pool)
 {
struct bio *bio_orig = bio->bi_private;
-   struct bio_vec *bvec, *org_vec;
+   struct bio_vec *bvec, orig_vec;
int i;
-   int start = bio_orig->bi_iter.bi_idx;
+   struct bvec_iter orig_iter = bio_orig->bi_iter;
 
/*
 * free up bounce indirect pages used
 */
bio_for_each_segment_all(bvec, bio, i) {
-   org_vec = bio_orig->bi_io_vec + i + start;
-
-   if (bvec->bv_page == org_vec->bv_page)
-   continue;
-
-   dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
-   mempool_free(bvec->bv_page, pool);
+   orig_vec = bio_iter_iovec(bio_orig, orig_iter);
+   if (bvec->bv_page != orig_vec.bv_page) {
+   dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
+   mempool_free(bvec->bv_page, pool);
+   }
+   bio_advance_iter(bio_orig, _iter, orig_vec.bv_len);
}
 
bio_orig->bi_status = bio->bi_status;
-- 
2.9.5



[PATCH V4 05/45] block: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq

2017-12-18 Thread Ming Lei
Firstly this patch introduce BVEC_ITER_ALL_INIT for iterating one bio
from start to end.

As we need to support multipage bvecs, so don't access bio->bi_io_vec
in copy_to_high_bio_irq(), and just use the standard iterator to do that.

Signed-off-by: Ming Lei 
---
 block/bounce.c   | 16 +++-
 include/linux/bvec.h |  9 +
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/block/bounce.c b/block/bounce.c
index 0274c31d6c05..c35a3d7f0528 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -113,24 +113,30 @@ int init_emergency_isa_pool(void)
 static void copy_to_high_bio_irq(struct bio *to, struct bio *from)
 {
unsigned char *vfrom;
-   struct bio_vec tovec, *fromvec = from->bi_io_vec;
+   struct bio_vec tovec, fromvec;
struct bvec_iter iter;
+   /*
+* The bio of @from is created by bounce, so we can iterate
+* its bvec from start to end, but the @from->bi_iter can't be
+* trusted because it might be changed by splitting.
+*/
+   struct bvec_iter from_iter = BVEC_ITER_ALL_INIT;
 
bio_for_each_segment(tovec, to, iter) {
-   if (tovec.bv_page != fromvec->bv_page) {
+   fromvec = bio_iter_iovec(from, from_iter);
+   if (tovec.bv_page != fromvec.bv_page) {
/*
 * fromvec->bv_offset and fromvec->bv_len might have
 * been modified by the block layer, so use the original
 * copy, bounce_copy_vec already uses tovec->bv_len
 */
-   vfrom = page_address(fromvec->bv_page) +
+   vfrom = page_address(fromvec.bv_page) +
tovec.bv_offset;
 
bounce_copy_vec(, vfrom);
flush_dcache_page(tovec.bv_page);
}
-
-   fromvec++;
+   bio_advance_iter(from, _iter, tovec.bv_len);
}
 }
 
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ec8a4d7af6bd..fe7a22dd133b 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -125,4 +125,13 @@ static inline bool bvec_iter_rewind(const struct bio_vec 
*bv,
((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
 bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
 
+/* for iterating one bio from start to end */
+#define BVEC_ITER_ALL_INIT (struct bvec_iter)  \
+{  \
+   .bi_sector  = 0,\
+   .bi_size= UINT_MAX, \
+   .bi_idx = 0,\
+   .bi_bvec_done   = 0,\
+}
+
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.9.5



Re: [PATCH] Pass the task_struct explicitly to delayacct_blkio_end

2017-12-18 Thread Peter Zijlstra
On Fri, Dec 15, 2017 at 06:11:43PM +, Josh Snyder wrote:

> Subject: [PATCH] Pass the task_struct explicitly to delayacct_blkio_end

That subject is crap; it needs a subsystem prefix and it needs to
describe what is done, not how its done.

Something like:

  delayacct: Fix double accounting for blk-io

> Before e33a9bba85a8, delayacct_blkio_end was called after
> context-switching into the task which completed I/O. This resulted in
> double counting: the task would account a delay both waiting for I/O and
> for time spent in the runqueue.
> 
> With e33a9bba85a8, delayacct_blkio_end is called by try_to_wake_up. In
> ttwu, we have not yet context-switched. This is more correct, in that the
> delay accounting ends when the I/O is complete. But delayacct_blkio_end
> relies upon `get_current()`, and we have not yet context-switched into the
> task whose I/O completed. This results in the wrong task having its delay
> accounting statistics updated.

The correct way to quote a commit is like:

  e33a9bba85a8 ("sched/core: move IO scheduling accounting from 
io_schedule_timeout() into scheduler")

> Instead of doing that, pass the task_struct being woken to
> delayacct_blkio_end, so that it can update the statistics of the correct
> task_struct.