Re: [PATCH V2 1/3] blk-mq: allocate blk_mq_tags and requests in correct node

2017-02-24 Thread Jens Axboe
On 02/01/2017 10:53 AM, Shaohua Li wrote:
> blk_mq_tags/requests of specific hardware queue are mostly used in
> specific cpus, which might not be in the same numa node as disk. For
> example, a nvme card is in node 0. half hardware queue will be used by
> node 0, the other node 1.

Applied 1-3 for this series, thanks Shaohua!

-- 
Jens Axboe



Re: [PATCH V2 2/3] PCI: add an API to get node from vector

2017-02-24 Thread Jens Axboe
On 02/24/2017 03:29 PM, Bjorn Helgaas wrote:
> On Wed, Feb 01, 2017 at 09:53:15AM -0800, Shaohua Li wrote:
>> Next patch will use the API to get the node from vector for nvme device
>>
>> Signed-off-by: Shaohua Li 
> 
> Acked-by: Bjorn Helgaas 
> 
> Sorry I missed this; I normally work from the linux-pci patchwork, and
> this didn't show up there because it wasn't cc'd to linux-pci.  But I
> should have noticed anyway.

Thanks Bjorn!

-- 
Jens Axboe



Re: [PATCH V2 2/3] PCI: add an API to get node from vector

2017-02-24 Thread Bjorn Helgaas
On Wed, Feb 01, 2017 at 09:53:15AM -0800, Shaohua Li wrote:
> Next patch will use the API to get the node from vector for nvme device
> 
> Signed-off-by: Shaohua Li 

Acked-by: Bjorn Helgaas 

Sorry I missed this; I normally work from the linux-pci patchwork, and
this didn't show up there because it wasn't cc'd to linux-pci.  But I
should have noticed anyway.

> ---
>  drivers/pci/msi.c   | 16 
>  include/linux/pci.h |  6 ++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 50c5003..ab7aee7 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1313,6 +1313,22 @@ const struct cpumask *pci_irq_get_affinity(struct 
> pci_dev *dev, int nr)
>  }
>  EXPORT_SYMBOL(pci_irq_get_affinity);
>  
> +/**
> + * pci_irq_get_node - return the numa node of a particular msi vector
> + * @pdev:PCI device to operate on
> + * @vec: device-relative interrupt vector index (0-based).
> + */
> +int pci_irq_get_node(struct pci_dev *pdev, int vec)
> +{
> + const struct cpumask *mask;
> +
> + mask = pci_irq_get_affinity(pdev, vec);
> + if (mask)
> + return local_memory_node(cpu_to_node(cpumask_first(mask)));
> + return dev_to_node(>dev);
> +}
> +EXPORT_SYMBOL(pci_irq_get_node);
> +
>  struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc)
>  {
>   return to_pci_dev(desc->dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index e2d1a12..df2c649 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1334,6 +1334,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, 
> unsigned int min_vecs,
>  void pci_free_irq_vectors(struct pci_dev *dev);
>  int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
>  const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);
> +int pci_irq_get_node(struct pci_dev *pdev, int vec);
>  
>  #else
>  static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
> @@ -1384,6 +1385,11 @@ static inline const struct cpumask 
> *pci_irq_get_affinity(struct pci_dev *pdev,
>  {
>   return cpu_possible_mask;
>  }
> +
> +static inline int pci_irq_get_node(struct pci_dev *pdev, int vec)
> +{
> + return first_online_node;
> +}
>  #endif
>  
>  static inline int
> -- 
> 2.9.3
> 


Re: [PATCH V2 1/3] blk-mq: allocate blk_mq_tags and requests in correct node

2017-02-24 Thread Jens Axboe
On 02/01/2017 12:09 PM, Jens Axboe wrote:
> On 02/01/2017 09:53 AM, Shaohua Li wrote:
>> blk_mq_tags/requests of specific hardware queue are mostly used in
>> specific cpus, which might not be in the same numa node as disk. For
>> example, a nvme card is in node 0. half hardware queue will be used by
>> node 0, the other node 1.
> 
> All three patches look good to me. Bjorn, to avoid complications, if
> you can review/ack patch #2, then I will queue it up through the block
> tree for 4.11.

Bjorn, ping. You were CC'ed on the original patch three weeks ago.

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-24 Thread Bart Van Assche
On Fri, 2017-02-24 at 13:22 -0700, Jens Axboe wrote:
> Bart, I pushed a fix here:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus=61febef40bfe8ab68259d8545257686e8a0d91d1

Hello Jens,

The same test passes against the kernel I obtained by merging your for-linus
branch with the same version of Linus' master branch I mentioned in a previous
e-mail. Feel free to add my Tested-by to the patch "dm-rq: don't dereference
request payload after ending request".

Thanks,

Bart.

Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-24 Thread Jens Axboe
On 02/24/2017 01:00 PM, Jens Axboe wrote:
> On 02/24/2017 12:43 PM, Linus Torvalds wrote:
>> On Fri, Feb 24, 2017 at 9:39 AM, Bart Van Assche
>>  wrote:
>>>
>>> So the crash is caused by an attempt to dereference address 
>>> 0x6b6b6b6b6b6b6b6b
>>> at offset 0x270. I think this means the crash is caused by a use-after-free.
>>
>> Yeah, that's POISON_FREE, and that might explain why you see crashes
>> that others don't - you obviously have SLAB poisoning enabled. Jens
>> may not have.
>>
>> %rdi is "struct mapped_device *md", which came from dm_softirq_done() doing
>>
>> struct dm_rq_target_io *tio = tio_from_request(rq);
>> struct request *clone = tio->clone;
>> int rw;
>>
>> if (!clone) {
>> rq_end_stats(tio->md, rq);
>> rw = rq_data_dir(rq);
>> if (!rq->q->mq_ops)
>> blk_end_request_all(rq, tio->error);
>> else
>> blk_mq_end_request(rq, tio->error);
>> rq_completed(tio->md, rw, false);
>> return;
>> }
>>
>> so it's the 'tio' pointer that has been free'd. But it's worth noting
>> that we did apparently successfully dereference "tio" earlier in that
>> dm_softirq_done() *without* getting the poison value, so what I think
>> might be going on is that the 'tio' thing gets free'd when the code
>> does the blk_end_request_all()/blk_mq_end_request() call.
>>
>> Which makes sense - that ends the lifetime of the request, which in
>> turn also ends the lifetime of the "tio_from_request()", no?
>>
>> So the fix may be as simple as just doing
>>
>> if (!clone) {
>> struct mapped_device *md = tio->md;
>>
>> rq_end_stats(md, rq);
>> ...
>> rq_completed(md, rw, false);
>> return;
>> }
>>
>> because the 'mapped_device' pointer hopefully is still valid, it's
>> just 'tio' that has been freed.
>>
>> Jens? Bart? Christoph? Somebody who knows this code should
>> double-check my thinking above. I don't actually know the tio
>> lifetimes, I'm just going by looking at how earlier accesses seemed to
>> be fine (eg that "tio->clone" got us NULL, not a POISON_FREE pointer,
>> for example).
> 
> I think that is spot on. With the request changes for CDBs, for non
> blk-mq, we know also carry the payload after the request. But since
> blk-mq never frees the request, the above use-after-free with poison
> will only happen for !mq. Caching 'md' and avoiding a dereference of
> 'tio' after calling blk_end_request_all() will likely fix it.
> 
> Bart, can you test that?

Bart, I pushed a fix here:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus=61febef40bfe8ab68259d8545257686e8a0d91d1

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-24 Thread Jens Axboe
On 02/24/2017 12:43 PM, Linus Torvalds wrote:
> On Fri, Feb 24, 2017 at 9:39 AM, Bart Van Assche
>  wrote:
>>
>> So the crash is caused by an attempt to dereference address 
>> 0x6b6b6b6b6b6b6b6b
>> at offset 0x270. I think this means the crash is caused by a use-after-free.
> 
> Yeah, that's POISON_FREE, and that might explain why you see crashes
> that others don't - you obviously have SLAB poisoning enabled. Jens
> may not have.
> 
> %rdi is "struct mapped_device *md", which came from dm_softirq_done() doing
> 
> struct dm_rq_target_io *tio = tio_from_request(rq);
> struct request *clone = tio->clone;
> int rw;
> 
> if (!clone) {
> rq_end_stats(tio->md, rq);
> rw = rq_data_dir(rq);
> if (!rq->q->mq_ops)
> blk_end_request_all(rq, tio->error);
> else
> blk_mq_end_request(rq, tio->error);
> rq_completed(tio->md, rw, false);
> return;
> }
> 
> so it's the 'tio' pointer that has been free'd. But it's worth noting
> that we did apparently successfully dereference "tio" earlier in that
> dm_softirq_done() *without* getting the poison value, so what I think
> might be going on is that the 'tio' thing gets free'd when the code
> does the blk_end_request_all()/blk_mq_end_request() call.
> 
> Which makes sense - that ends the lifetime of the request, which in
> turn also ends the lifetime of the "tio_from_request()", no?
> 
> So the fix may be as simple as just doing
> 
> if (!clone) {
> struct mapped_device *md = tio->md;
> 
> rq_end_stats(md, rq);
> ...
> rq_completed(md, rw, false);
> return;
> }
> 
> because the 'mapped_device' pointer hopefully is still valid, it's
> just 'tio' that has been freed.
> 
> Jens? Bart? Christoph? Somebody who knows this code should
> double-check my thinking above. I don't actually know the tio
> lifetimes, I'm just going by looking at how earlier accesses seemed to
> be fine (eg that "tio->clone" got us NULL, not a POISON_FREE pointer,
> for example).

I think that is spot on. With the request changes for CDBs, for non
blk-mq, we know also carry the payload after the request. But since
blk-mq never frees the request, the above use-after-free with poison
will only happen for !mq. Caching 'md' and avoiding a dereference of
'tio' after calling blk_end_request_all() will likely fix it.

Bart, can you test that?

-- 
Jens Axboe



Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-24 Thread Linus Torvalds
On Fri, Feb 24, 2017 at 9:39 AM, Bart Van Assche
 wrote:
>
> So the crash is caused by an attempt to dereference address 0x6b6b6b6b6b6b6b6b
> at offset 0x270. I think this means the crash is caused by a use-after-free.

Yeah, that's POISON_FREE, and that might explain why you see crashes
that others don't - you obviously have SLAB poisoning enabled. Jens
may not have.

%rdi is "struct mapped_device *md", which came from dm_softirq_done() doing

struct dm_rq_target_io *tio = tio_from_request(rq);
struct request *clone = tio->clone;
int rw;

if (!clone) {
rq_end_stats(tio->md, rq);
rw = rq_data_dir(rq);
if (!rq->q->mq_ops)
blk_end_request_all(rq, tio->error);
else
blk_mq_end_request(rq, tio->error);
rq_completed(tio->md, rw, false);
return;
}

so it's the 'tio' pointer that has been free'd. But it's worth noting
that we did apparently successfully dereference "tio" earlier in that
dm_softirq_done() *without* getting the poison value, so what I think
might be going on is that the 'tio' thing gets free'd when the code
does the blk_end_request_all()/blk_mq_end_request() call.

Which makes sense - that ends the lifetime of the request, which in
turn also ends the lifetime of the "tio_from_request()", no?

So the fix may be as simple as just doing

if (!clone) {
struct mapped_device *md = tio->md;

rq_end_stats(md, rq);
...
rq_completed(md, rw, false);
return;
}

because the 'mapped_device' pointer hopefully is still valid, it's
just 'tio' that has been freed.

Jens? Bart? Christoph? Somebody who knows this code should
double-check my thinking above. I don't actually know the tio
lifetimes, I'm just going by looking at how earlier accesses seemed to
be fine (eg that "tio->clone" got us NULL, not a POISON_FREE pointer,
for example).

   Linus


Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-24 Thread Bart Van Assche
On Wed, 2017-02-22 at 22:29 +0100, Paolo Valente wrote:
> thanks for this second attempt of yours.  Although, unfortunately, not
> providing some clear indication of the exact cause of your hang (apart
> from a possible deadlock), your log helped me notice another bug.
> 
> At any rate, as I have just written to Jens, I have pushed a new
> version of the branch [1] (not just added new commits, but also
> integrated some old commit with new changes, to make it more quickly).
> The branch now contains both a fix for the above bug, and, more
> importantly, a fix for the circular dependencies that were still
> lurking around.  Could you please test it?

Hello Paolo,

I have good news: the same test system boots normally with the same
kernel config I used during my previous tests and with the latest
bfq-mq code (commit a965d19585c0) merged with kernel v4.10.

Thanks,

Bart.

Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-24 Thread Jens Axboe
On 02/24/2017 10:39 AM, Bart Van Assche wrote:
> On Mon, 2017-02-20 at 09:32 -0700, Jens Axboe wrote:
>> On 02/20/2017 09:16 AM, Bart Van Assche wrote:
>>> On 02/19/2017 11:35 PM, Christoph Hellwig wrote:
 On Sun, Feb 19, 2017 at 06:15:41PM -0700, Jens Axboe wrote:
> That said, we will look into this again, of course. Christoph, any idea?

 No idea really - this seems so far away from the code touched, and there
 are no obvious signs for a memory scamble from another object touched
 that I think if it really bisects down to that issue it must be a timing
 issue.

 But reading Bart's message again:  Did you actually bisect it down
 to the is commit?  Or just test the whole tree?  Between the 4.10-rc5
 merge and all the block tree there might a few more likely suspects
 like the scsi bdi lifetime fixes that James mentioned.
>>>
>>> Hello Christoph,
>>>
>>> As far as I know Jens does not rebase his trees so we can use the commit
>>> date to check which patch went in when. From the first of Jan's bdi patches:
>>>
>>> CommitDate: Thu Feb 2 08:18:41 2017 -0700
>>>
>>> So the bdi patches went in several days after I reported the general 
>>> protection
>>> fault issue.
>>>
>>> In an e-mail of January 30th I wrote the following: "Running the srp-test
>>> software against kernel 4.9.6 and kernel 4.10-rc5 went fine.  With your
>>> for-4.11/block branch (commit 400f73b23f457a) however I just ran into
>>> the following warning: [ ... ]" That means that I did not hit the crash with
>>> Jens' for-4.11/block branch but only with the for-next branch. The patches
>>> on Jens' for-next branch after that commit that were applied before I ran
>>> my test are:
>>>
>>> $ PAGER= git log --format=oneline 400f73b23f457a..fb045ca25cc7 block 
>>> drivers/md/dm{,-mpath,-table}.[ch]
>>> fb045ca25cc7b6d46368ab8221774489c2a81648 block: don't assign cmd_flags in 
>>> __blk_rq_prep_clone
>>> 82ed4db499b8598f16f8871261bff088d6b0597f block: split scsi_request out of 
>>> struct request
>>> 8ae94eb65be9425af4d57a4f4cfebfdf03081e93 block/bsg: move queue creation 
>>> into bsg_setup_queue
>>> eb8db831be80692bf4bda3dfc55001daf64ec299 dm: always defer request 
>>> allocation to the owner of the request_queue
>>> 6d247d7f71d1fa4b66a5f4da7b1daa21510d529b block: allow specifying size for 
>>> extra command data
>>> 5ea708d15a928f7a479987704203616d3274c03b block: simplify 
>>> blk_init_allocated_queue
>>> e6f7f93d58de74700f83dd0547dd4306248a093d block: fix elevator init check
>>> f924ba70c1b12706c6679d793202e8f4c125f7ae Merge branch 'for-4.11/block' into 
>>> for-4.11/rq-refactor
>>> 88a7503376f4f3bf303c809d1a389739e1205614 blk-mq: Remove unused variable
>>> bef13315e990fd3d3fb4c39013aefd53f06c3657 block: don't try to discard from 
>>> __blkdev_issue_zeroout
>>> f99e86485cc32cd16e5cc97f9bb0474f28608d84 block: Rename blk_queue_zone_size 
>>> and bdev_zone_size
>>>
>>> Do you see any patch in the above list that does not belong to the "split
>>> scsi passthrough fields out of struct request" series and that could have
>>> caused the reported behavior change?
>>
>> Bart, since you are the only one that can reproduce this, can you just bisect
>> your way through that series?
> 
> Hello Jens,
> 
> Since Christoph also has access to IB hardware I will leave it to Christoph
> to do the bisect. Anyway, I just reproduced this crash with Linus' current
> tree (commit f1ef09fde17f) by running srp-test/run_tests -r 10 -t 02-sq-on-mq
> (see also https://github.com/bvanassche/srp-test):
> 
> [ 1629.920553] general protection fault:  [#1] SMP
> [ 1629.921193] CPU: 6 PID: 46 Comm: ksoftirqd/6 Tainted: G  I 
> 4.10.0-dbg+ #1
> [ 1629.921289] RIP: 0010:rq_completed+0x12/0x90 [dm_mod]
> [ 1629.921316] RSP: 0018:c90001bdbda8 EFLAGS: 00010246
> [ 1629.921344] RAX:  RBX: 6b6b6b6b6b6b6b6b RCX: 
> 
> [ 1629.921372] RDX:  RSI:  RDI: 
> 6b6b6b6b6b6b6b6b
> [ 1629.921401] RBP: c90001bdbdc0 R08: 8803a3858d48 R09: 
> 
> [ 1629.921429] R10:  R11:  R12: 
> 
> [ 1629.921458] R13:  R14: 81c05120 R15: 
> 0004
> [ 1629.921489] FS:  () GS:88046ef8() 
> knlGS:
> [ 1629.921520] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1629.921547] CR2: 7fb6324486b8 CR3: 01c0f000 CR4: 
> 001406e0
> [ 1629.921576] Call Trace:
> [ 1629.921605]  dm_softirq_done+0xe6/0x1e0 [dm_mod]
> [ 1629.921637]  blk_done_softirq+0x88/0xa0
> [ 1629.921663]  __do_softirq+0xba/0x4c0
> [ 1629.921744]  run_ksoftirqd+0x1a/0x50
> [ 1629.921769]  smpboot_thread_fn+0x123/0x1e0
> [ 1629.921797]  kthread+0x107/0x140
> [ 1629.921944]  ret_from_fork+0x2e/0x40
> [ 1629.921972] Code: ff ff 31 f6 48 89 c7 e8 ed 96 2f e1 5d c3 90 66 2e 0f 1f 
> 84 00 00 00 00 00 55 48 63 f6 48 89 e5 41 55 41 89 d5 41 54 53 48 89 fb <4c> 
> 

Re: [GIT PULL] Block pull request for- 4.11-rc1

2017-02-24 Thread Bart Van Assche
On Mon, 2017-02-20 at 09:32 -0700, Jens Axboe wrote:
> On 02/20/2017 09:16 AM, Bart Van Assche wrote:
> > On 02/19/2017 11:35 PM, Christoph Hellwig wrote:
> > > On Sun, Feb 19, 2017 at 06:15:41PM -0700, Jens Axboe wrote:
> > > > That said, we will look into this again, of course. Christoph, any idea?
> > > 
> > > No idea really - this seems so far away from the code touched, and there
> > > are no obvious signs for a memory scamble from another object touched
> > > that I think if it really bisects down to that issue it must be a timing
> > > issue.
> > > 
> > > But reading Bart's message again:  Did you actually bisect it down
> > > to the is commit?  Or just test the whole tree?  Between the 4.10-rc5
> > > merge and all the block tree there might a few more likely suspects
> > > like the scsi bdi lifetime fixes that James mentioned.
> > 
> > Hello Christoph,
> > 
> > As far as I know Jens does not rebase his trees so we can use the commit
> > date to check which patch went in when. From the first of Jan's bdi patches:
> > 
> > CommitDate: Thu Feb 2 08:18:41 2017 -0700
> > 
> > So the bdi patches went in several days after I reported the general 
> > protection
> > fault issue.
> > 
> > In an e-mail of January 30th I wrote the following: "Running the srp-test
> > software against kernel 4.9.6 and kernel 4.10-rc5 went fine.  With your
> > for-4.11/block branch (commit 400f73b23f457a) however I just ran into
> > the following warning: [ ... ]" That means that I did not hit the crash with
> > Jens' for-4.11/block branch but only with the for-next branch. The patches
> > on Jens' for-next branch after that commit that were applied before I ran
> > my test are:
> > 
> > $ PAGER= git log --format=oneline 400f73b23f457a..fb045ca25cc7 block 
> > drivers/md/dm{,-mpath,-table}.[ch]
> > fb045ca25cc7b6d46368ab8221774489c2a81648 block: don't assign cmd_flags in 
> > __blk_rq_prep_clone
> > 82ed4db499b8598f16f8871261bff088d6b0597f block: split scsi_request out of 
> > struct request
> > 8ae94eb65be9425af4d57a4f4cfebfdf03081e93 block/bsg: move queue creation 
> > into bsg_setup_queue
> > eb8db831be80692bf4bda3dfc55001daf64ec299 dm: always defer request 
> > allocation to the owner of the request_queue
> > 6d247d7f71d1fa4b66a5f4da7b1daa21510d529b block: allow specifying size for 
> > extra command data
> > 5ea708d15a928f7a479987704203616d3274c03b block: simplify 
> > blk_init_allocated_queue
> > e6f7f93d58de74700f83dd0547dd4306248a093d block: fix elevator init check
> > f924ba70c1b12706c6679d793202e8f4c125f7ae Merge branch 'for-4.11/block' into 
> > for-4.11/rq-refactor
> > 88a7503376f4f3bf303c809d1a389739e1205614 blk-mq: Remove unused variable
> > bef13315e990fd3d3fb4c39013aefd53f06c3657 block: don't try to discard from 
> > __blkdev_issue_zeroout
> > f99e86485cc32cd16e5cc97f9bb0474f28608d84 block: Rename blk_queue_zone_size 
> > and bdev_zone_size
> > 
> > Do you see any patch in the above list that does not belong to the "split
> > scsi passthrough fields out of struct request" series and that could have
> > caused the reported behavior change?
> 
> Bart, since you are the only one that can reproduce this, can you just bisect
> your way through that series?

Hello Jens,

Since Christoph also has access to IB hardware I will leave it to Christoph
to do the bisect. Anyway, I just reproduced this crash with Linus' current
tree (commit f1ef09fde17f) by running srp-test/run_tests -r 10 -t 02-sq-on-mq
(see also https://github.com/bvanassche/srp-test):

[ 1629.920553] general protection fault:  [#1] SMP
[ 1629.921193] CPU: 6 PID: 46 Comm: ksoftirqd/6 Tainted: G  I 
4.10.0-dbg+ #1
[ 1629.921289] RIP: 0010:rq_completed+0x12/0x90 [dm_mod]
[ 1629.921316] RSP: 0018:c90001bdbda8 EFLAGS: 00010246
[ 1629.921344] RAX:  RBX: 6b6b6b6b6b6b6b6b RCX: 
[ 1629.921372] RDX:  RSI:  RDI: 6b6b6b6b6b6b6b6b
[ 1629.921401] RBP: c90001bdbdc0 R08: 8803a3858d48 R09: 
[ 1629.921429] R10:  R11:  R12: 
[ 1629.921458] R13:  R14: 81c05120 R15: 0004
[ 1629.921489] FS:  () GS:88046ef8() 
knlGS:
[ 1629.921520] CS:  0010 DS:  ES:  CR0: 80050033
[ 1629.921547] CR2: 7fb6324486b8 CR3: 01c0f000 CR4: 001406e0
[ 1629.921576] Call Trace:
[ 1629.921605]  dm_softirq_done+0xe6/0x1e0 [dm_mod]
[ 1629.921637]  blk_done_softirq+0x88/0xa0
[ 1629.921663]  __do_softirq+0xba/0x4c0
[ 1629.921744]  run_ksoftirqd+0x1a/0x50
[ 1629.921769]  smpboot_thread_fn+0x123/0x1e0
[ 1629.921797]  kthread+0x107/0x140
[ 1629.921944]  ret_from_fork+0x2e/0x40
[ 1629.921972] Code: ff ff 31 f6 48 89 c7 e8 ed 96 2f e1 5d c3 90 66 2e 0f 1f 
84 00 00 00 00 00 55 48 63 f6 48 89 e5 41 55 41 89 d5 41 54 53 48 89 fb <4c> 8b 
a7 70 02 00 00 f0 ff 8c b7 38 03 00 00 e8 3a 43 ff ff 85 
[ 1629.922093] RIP: rq_completed+0x12/0x90 [dm_mod] 

[PATCH 2/2] lightnvm: fix assert fixes and enable checks

2017-02-24 Thread Matias Bjørling
The asserts in _nvme_nvm_check_size are not compiled due to the function
not begin called. Make sure that it is called, and also fix the wrong
sizes of asserts for nvme_nvm_addr_format, and nvme_nvm_bb_tbl, which
checked for number of bits instead of bytes.

Reported-by: Scott Bauer 
Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/lightnvm.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index e37b432..b6a67ad 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -241,9 +241,9 @@ static inline void _nvme_nvm_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_nvm_l2ptbl) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_id_group) != 960);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_addr_format) != 128);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_addr_format) != 16);
BUILD_BUG_ON(sizeof(struct nvme_nvm_id) != 4096);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 512);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64);
 }
 
 static int init_grps(struct nvm_id *nvm_id, struct nvme_nvm_id *nvme_nvm_id)
@@ -797,6 +797,8 @@ int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, 
int node)
struct request_queue *q = ns->queue;
struct nvm_dev *dev;
 
+   _nvme_nvm_check_size();
+
dev = nvm_alloc_dev(node);
if (!dev)
return -ENOMEM;
-- 
2.9.3



[PATCH 1/2] lightnvm: add generic ocssd detection

2017-02-24 Thread Matias Bjørling
More implementations of OCSSDs are becoming available. Adding each using
pci ids are becoming a hassle. Instead, use a 16 byte string in the
vendor-specific area of the identification command to identify an
Open-Channel SSD.

The large string should make the collision probability with other
vendor-specific strings to be near nil.

Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/lightnvm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 4ea9c93..e37b432 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -986,6 +986,9 @@ int nvme_nvm_ns_supported(struct nvme_ns *ns, struct 
nvme_id_ns *id)
/* XXX: this is poking into PCI structures from generic code! */
struct pci_dev *pdev = to_pci_dev(ctrl->dev);
 
+   if (!strncmp((char *)id->vs, "open-channel ssd", 16))
+   return 1;
+
/* QEMU NVMe simulator - PCI ID + Vendor specific bit */
if (pdev->vendor == PCI_VENDOR_ID_CNEX &&
pdev->device == PCI_DEVICE_ID_CNEX_QEMU &&
-- 
2.9.3



Re: [PATCH 3/3] lightnvm: free reverse device map

2017-02-24 Thread Matias Bjørling

On 02/24/2017 05:14 PM, Javier González wrote:

Free the reverse mapping table correctly on target tear down

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index ca48792..f1cb485 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -413,6 +413,18 @@ static int nvm_register_map(struct nvm_dev *dev)
return -ENOMEM;
 }

+static void nvm_unregister_map(struct nvm_dev *dev)
+{
+   struct nvm_dev_map *rmap = dev->rmap;
+   int i;
+
+   for (i = 0; i < dev->geo.nr_chnls; i++)
+   kfree(rmap->chnls[i].lun_offs);
+
+   kfree(rmap->chnls);
+   kfree(rmap);
+}
+
 static void nvm_map_to_dev(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p)
 {
struct nvm_dev_map *dev_map = tgt_dev->map;
@@ -994,7 +1006,7 @@ void nvm_free(struct nvm_dev *dev)
if (dev->dma_pool)
dev->ops->destroy_dma_pool(dev->dma_pool);

-   kfree(dev->rmap);
+   nvm_unregister_map(dev);
kfree(dev->lptbl);
kfree(dev->lun_map);
kfree(dev);


Thanks, applied for 4.12.


Re: [PATCH 2/3] lightnvm: rename scrambler controller hint

2017-02-24 Thread Matias Bjørling

On 02/24/2017 05:14 PM, Javier González wrote:

According to the OCSSD 1.2 specification, the 0x200 hint enables the
media scrambler for the read/write opcode, providing that the controller
has been correctly configured by the firmware. Rename the macro to
represent this meaning.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 include/linux/lightnvm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 6a3534b..bebea80 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -123,7 +123,7 @@ enum {
/* NAND Access Modes */
NVM_IO_SUSPEND  = 0x80,
NVM_IO_SLC_MODE = 0x100,
-   NVM_IO_SCRAMBLE_DISABLE = 0x200,
+   NVM_IO_SCRAMBLE_ENABLE  = 0x200,

/* Block Types */
NVM_BLK_T_FREE  = 0x0,



Thanks, applied for 4.12.


[PATCH 3/3] lightnvm: free reverse device map

2017-02-24 Thread Javier González
Free the reverse mapping table correctly on target tear down

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index ca48792..f1cb485 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -413,6 +413,18 @@ static int nvm_register_map(struct nvm_dev *dev)
return -ENOMEM;
 }
 
+static void nvm_unregister_map(struct nvm_dev *dev)
+{
+   struct nvm_dev_map *rmap = dev->rmap;
+   int i;
+
+   for (i = 0; i < dev->geo.nr_chnls; i++)
+   kfree(rmap->chnls[i].lun_offs);
+
+   kfree(rmap->chnls);
+   kfree(rmap);
+}
+
 static void nvm_map_to_dev(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p)
 {
struct nvm_dev_map *dev_map = tgt_dev->map;
@@ -994,7 +1006,7 @@ void nvm_free(struct nvm_dev *dev)
if (dev->dma_pool)
dev->ops->destroy_dma_pool(dev->dma_pool);
 
-   kfree(dev->rmap);
+   nvm_unregister_map(dev);
kfree(dev->lptbl);
kfree(dev->lun_map);
kfree(dev);
-- 
2.7.4



[PATCH 1/3] lightnvm: submit erases using the I/O path

2017-02-24 Thread Javier González
Until now erases have been submitted as synchronous commands through a
dedicated erase function. In order to enable targets implementing
asynchronous erases, refactor the erase path so that it uses the normal
async I/O submission functions. If a target requires sync I/O, it can
implement it internally. Also, adapt rrpc to use the new erase path.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 54 +++-
 drivers/lightnvm/rrpc.c  |  3 +--
 drivers/nvme/host/lightnvm.c | 32 --
 include/linux/lightnvm.h |  8 +++
 4 files changed, 47 insertions(+), 50 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index fcbd82f..ca48792 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -592,11 +592,11 @@ int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, 
struct ppa_addr *ppas,
 
memset(, 0, sizeof(struct nvm_rq));
 
-   nvm_set_rqd_ppalist(dev, , ppas, nr_ppas, 1);
+   nvm_set_rqd_ppalist(tgt_dev, , ppas, nr_ppas, 1);
nvm_rq_tgt_to_dev(tgt_dev, );
 
ret = dev->ops->set_bb_tbl(dev, _addr, rqd.nr_ppas, type);
-   nvm_free_rqd_ppalist(dev, );
+   nvm_free_rqd_ppalist(tgt_dev, );
if (ret) {
pr_err("nvm: failed bb mark\n");
return -EINVAL;
@@ -628,34 +628,45 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct 
nvm_rq *rqd)
 }
 EXPORT_SYMBOL(nvm_submit_io);
 
-int nvm_erase_blk(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas, int 
flags)
+static void nvm_end_io_sync(struct nvm_rq *rqd)
 {
-   struct nvm_dev *dev = tgt_dev->parent;
+   struct completion *waiting = rqd->private;
+
+   complete(waiting);
+}
+
+int nvm_erase_sync(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
+   int nr_ppas)
+{
+   struct nvm_geo *geo = _dev->geo;
struct nvm_rq rqd;
int ret;
-
-   if (!dev->ops->erase_block)
-   return 0;
-
-   nvm_map_to_dev(tgt_dev, ppas);
+   DECLARE_COMPLETION_ONSTACK(wait);
 
memset(, 0, sizeof(struct nvm_rq));
 
-   ret = nvm_set_rqd_ppalist(dev, , ppas, 1, 1);
+   rqd.opcode = NVM_OP_ERASE;
+   rqd.end_io = nvm_end_io_sync;
+   rqd.private = 
+   rqd.flags = geo->plane_mode >> 1;
+
+   ret = nvm_set_rqd_ppalist(tgt_dev, , ppas, nr_ppas, 1);
if (ret)
return ret;
 
-   nvm_rq_tgt_to_dev(tgt_dev, );
+   ret = nvm_submit_io(tgt_dev, );
+   if (ret) {
+   pr_err("rrpr: erase I/O submission falied: %d\n", ret);
+   goto free_ppa_list;
+   }
+   wait_for_completion_io();
 
-   rqd.flags = flags;
-
-   ret = dev->ops->erase_block(dev, );
-
-   nvm_free_rqd_ppalist(dev, );
+free_ppa_list:
+   nvm_free_rqd_ppalist(tgt_dev, );
 
return ret;
 }
-EXPORT_SYMBOL(nvm_erase_blk);
+EXPORT_SYMBOL(nvm_erase_sync);
 
 int nvm_get_l2p_tbl(struct nvm_tgt_dev *tgt_dev, u64 slba, u32 nlb,
nvm_l2p_update_fn *update_l2p, void *priv)
@@ -734,10 +745,11 @@ void nvm_put_area(struct nvm_tgt_dev *tgt_dev, sector_t 
begin)
 }
 EXPORT_SYMBOL(nvm_put_area);
 
-int nvm_set_rqd_ppalist(struct nvm_dev *dev, struct nvm_rq *rqd,
+int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd,
const struct ppa_addr *ppas, int nr_ppas, int vblk)
 {
-   struct nvm_geo *geo = >geo;
+   struct nvm_dev *dev = tgt_dev->parent;
+   struct nvm_geo *geo = _dev->geo;
int i, plane_cnt, pl_idx;
struct ppa_addr ppa;
 
@@ -775,12 +787,12 @@ int nvm_set_rqd_ppalist(struct nvm_dev *dev, struct 
nvm_rq *rqd,
 }
 EXPORT_SYMBOL(nvm_set_rqd_ppalist);
 
-void nvm_free_rqd_ppalist(struct nvm_dev *dev, struct nvm_rq *rqd)
+void nvm_free_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
 {
if (!rqd->ppa_list)
return;
 
-   nvm_dev_dma_free(dev, rqd->ppa_list, rqd->dma_ppa_list);
+   nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
 }
 EXPORT_SYMBOL(nvm_free_rqd_ppalist);
 
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index e68efbc..4e4c299 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -414,7 +414,6 @@ static void rrpc_block_gc(struct work_struct *work)
struct rrpc *rrpc = gcb->rrpc;
struct rrpc_block *rblk = gcb->rblk;
struct rrpc_lun *rlun = rblk->rlun;
-   struct nvm_tgt_dev *dev = rrpc->dev;
struct ppa_addr ppa;
 
mempool_free(gcb, rrpc->gcb_pool);
@@ -430,7 +429,7 @@ static void rrpc_block_gc(struct work_struct *work)
ppa.g.lun = rlun->bppa.g.lun;
ppa.g.blk = rblk->id;
 
-   if (nvm_erase_blk(dev, , 0))
+   if (nvm_erase_sync(rrpc->dev, , 1))
goto put_back;
 

[PATCH 2/3] lightnvm: rename scrambler controller hint

2017-02-24 Thread Javier González
According to the OCSSD 1.2 specification, the 0x200 hint enables the
media scrambler for the read/write opcode, providing that the controller
has been correctly configured by the firmware. Rename the macro to
represent this meaning.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 include/linux/lightnvm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 6a3534b..bebea80 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -123,7 +123,7 @@ enum {
/* NAND Access Modes */
NVM_IO_SUSPEND  = 0x80,
NVM_IO_SLC_MODE = 0x100,
-   NVM_IO_SCRAMBLE_DISABLE = 0x200,
+   NVM_IO_SCRAMBLE_ENABLE  = 0x200,
 
/* Block Types */
NVM_BLK_T_FREE  = 0x0,
-- 
2.7.4



[PATCH v1 14/14] md: raid10: avoid direct access to bvec table in handle_reshape_read_error

2017-02-24 Thread Ming Lei
The cost is 128bytes(8*16) stack space in kernel thread context, and
just use the bio helper to retrieve pages from bio.

Signed-off-by: Ming Lei 
---
 drivers/md/raid10.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ae162d542bf4..705cb9af03ef 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4689,7 +4689,15 @@ static int handle_reshape_read_error(struct mddev *mddev,
struct r10bio *r10b = _stack.r10_bio;
int slot = 0;
int idx = 0;
-   struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
+   struct bio_vec *bvl;
+   struct page *pages[RESYNC_PAGES];
+
+   /*
+* This bio is allocated in reshape_request(), and size
+* is still RESYNC_PAGES
+*/
+   bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
+   pages[idx] = bvl->bv_page;
 
r10b->sector = r10_bio->sector;
__raid10_find_phys(>prev, r10b);
@@ -4718,7 +4726,7 @@ static int handle_reshape_read_error(struct mddev *mddev,
success = sync_page_io(rdev,
   addr,
   s << 9,
-  bvec[idx].bv_page,
+  pages[idx],
   REQ_OP_READ, 0, false);
rdev_dec_pending(rdev, mddev);
rcu_read_lock();
-- 
2.7.4



[PATCH v1 12/14] md: raid10: don't use bio's vec table to manage resync pages

2017-02-24 Thread Ming Lei
Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.

The introduced cost is that we need to allocate (128 + 16) * copies
bytes per r10_bio, and it is fine because the inflight r10_bio for
resync shouldn't be much, as pointed by Shaohua.

Also bio_reset() in raid10_sync_request() and reshape_request()
are removed because all bios are freshly new now in these functions
and not necessary to reset any more.

This patch can be thought as cleanup too.

Suggested-by: Shaohua Li 
Signed-off-by: Ming Lei 
---
 drivers/md/raid10.c | 127 ++--
 1 file changed, 74 insertions(+), 53 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c76e08ea4b92..931f5d80608b 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -110,6 +110,16 @@ static void end_reshape(struct r10conf *conf);
 #define raid10_log(md, fmt, args...)   \
do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid10 " fmt, 
##args); } while (0)
 
+static inline struct resync_pages *get_resync_pages(struct bio *bio)
+{
+   return bio->bi_private;
+}
+
+static inline struct r10bio *get_resync_r10bio(struct bio *bio)
+{
+   return get_resync_pages(bio)->raid_bio;
+}
+
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
struct r10conf *conf = data;
@@ -140,11 +150,11 @@ static void r10bio_pool_free(void *r10_bio, void *data)
 static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
struct r10conf *conf = data;
-   struct page *page;
struct r10bio *r10_bio;
struct bio *bio;
-   int i, j;
-   int nalloc;
+   int j;
+   int nalloc, nalloc_rp;
+   struct resync_pages *rps;
 
r10_bio = r10bio_pool_alloc(gfp_flags, conf);
if (!r10_bio)
@@ -156,6 +166,15 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void 
*data)
else
nalloc = 2; /* recovery */
 
+   /* allocate once for all bios */
+   if (!conf->have_replacement)
+   nalloc_rp = nalloc;
+   else
+   nalloc_rp = nalloc * 2;
+   rps = kmalloc(sizeof(struct resync_pages) * nalloc_rp, gfp_flags);
+   if (!rps)
+   goto out_free_r10bio;
+
/*
 * Allocate bios.
 */
@@ -175,36 +194,40 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void 
*data)
 * Allocate RESYNC_PAGES data pages and attach them
 * where needed.
 */
-   for (j = 0 ; j < nalloc; j++) {
+   for (j = 0; j < nalloc; j++) {
struct bio *rbio = r10_bio->devs[j].repl_bio;
+   struct resync_pages *rp, *rp_repl;
+
+   rp = [j];
+   if (rbio)
+   rp_repl = [nalloc + j];
+
bio = r10_bio->devs[j].bio;
-   for (i = 0; i < RESYNC_PAGES; i++) {
-   if (j > 0 && !test_bit(MD_RECOVERY_SYNC,
-  >mddev->recovery)) {
-   /* we can share bv_page's during recovery
-* and reshape */
-   struct bio *rbio = r10_bio->devs[0].bio;
-   page = rbio->bi_io_vec[i].bv_page;
-   get_page(page);
-   } else
-   page = alloc_page(gfp_flags);
-   if (unlikely(!page))
+
+   if (!j || test_bit(MD_RECOVERY_SYNC,
+  >mddev->recovery)) {
+   if (resync_alloc_pages(rp, gfp_flags))
goto out_free_pages;
+   } else {
+   memcpy(rp, [0], sizeof(*rp));
+   resync_get_all_pages(rp);
+   }
 
-   bio->bi_io_vec[i].bv_page = page;
-   if (rbio)
-   rbio->bi_io_vec[i].bv_page = page;
+   rp->idx = 0;
+   rp->raid_bio = r10_bio;
+   bio->bi_private = rp;
+   if (rbio) {
+   memcpy(rp_repl, rp, sizeof(*rp));
+   rbio->bi_private = rp_repl;
}
}
 
return r10_bio;
 
 out_free_pages:
-   for ( ; i > 0 ; i--)
-   safe_put_page(bio->bi_io_vec[i-1].bv_page);
-   while (j--)
-   for (i = 0; i < RESYNC_PAGES ; i++)
-   
safe_put_page(r10_bio->devs[j].bio->bi_io_vec[i].bv_page);
+   while (--j >= 0)
+   resync_free_pages([j * 2]);
+
j = 0;
 out_free_bio:
for ( ; j < nalloc; j++) {
@@ -213,30 +236,34 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void 
*data)
if (r10_bio->devs[j].repl_bio)

[PATCH v1 13/14] md: raid10: retrieve page from preallocated resync page array

2017-02-24 Thread Ming Lei
Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.

Signed-off-by: Ming Lei 
---
 drivers/md/raid10.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 931f5d80608b..ae162d542bf4 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2065,6 +2065,7 @@ static void sync_request_write(struct mddev *mddev, 
struct r10bio *r10_bio)
int i, first;
struct bio *tbio, *fbio;
int vcnt;
+   struct page **tpages, **fpages;
 
atomic_set(_bio->remaining, 1);
 
@@ -2080,6 +2081,7 @@ static void sync_request_write(struct mddev *mddev, 
struct r10bio *r10_bio)
fbio = r10_bio->devs[i].bio;
fbio->bi_iter.bi_size = r10_bio->sectors << 9;
fbio->bi_iter.bi_idx = 0;
+   fpages = get_resync_pages(fbio)->pages;
 
vcnt = (r10_bio->sectors + (PAGE_SIZE >> 9) - 1) >> (PAGE_SHIFT - 9);
/* now find blocks with errors */
@@ -2094,6 +2096,8 @@ static void sync_request_write(struct mddev *mddev, 
struct r10bio *r10_bio)
continue;
if (i == first)
continue;
+
+   tpages = get_resync_pages(tbio)->pages;
d = r10_bio->devs[i].devnum;
rdev = conf->mirrors[d].rdev;
if (!r10_bio->devs[i].bio->bi_error) {
@@ -2106,8 +2110,8 @@ static void sync_request_write(struct mddev *mddev, 
struct r10bio *r10_bio)
int len = PAGE_SIZE;
if (sectors < (len / 512))
len = sectors * 512;
-   if 
(memcmp(page_address(fbio->bi_io_vec[j].bv_page),
-  
page_address(tbio->bi_io_vec[j].bv_page),
+   if (memcmp(page_address(fpages[j]),
+  page_address(tpages[j]),
   len))
break;
sectors -= len/512;
@@ -2205,6 +2209,7 @@ static void fix_recovery_read_error(struct r10bio 
*r10_bio)
int idx = 0;
int dr = r10_bio->devs[0].devnum;
int dw = r10_bio->devs[1].devnum;
+   struct page **pages = get_resync_pages(bio)->pages;
 
while (sectors) {
int s = sectors;
@@ -2220,7 +2225,7 @@ static void fix_recovery_read_error(struct r10bio 
*r10_bio)
ok = sync_page_io(rdev,
  addr,
  s << 9,
- bio->bi_io_vec[idx].bv_page,
+ pages[idx],
  REQ_OP_READ, 0, false);
if (ok) {
rdev = conf->mirrors[dw].rdev;
@@ -2228,7 +2233,7 @@ static void fix_recovery_read_error(struct r10bio 
*r10_bio)
ok = sync_page_io(rdev,
  addr,
  s << 9,
- bio->bi_io_vec[idx].bv_page,
+ pages[idx],
  REQ_OP_WRITE, 0, false);
if (!ok) {
set_bit(WriteErrorSeen, >flags);
-- 
2.7.4



[PATCH v1 08/14] md: raid1: retrieve page from pre-allocated resync page array

2017-02-24 Thread Ming Lei
Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.

Signed-off-by: Ming Lei 
---
 drivers/md/raid1.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4a208220ff0f..9371caace379 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1970,6 +1970,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
struct mddev *mddev = r1_bio->mddev;
struct r1conf *conf = mddev->private;
struct bio *bio = r1_bio->bios[r1_bio->read_disk];
+   struct page **pages = get_resync_pages(bio)->pages;
sector_t sect = r1_bio->sector;
int sectors = r1_bio->sectors;
int idx = 0;
@@ -2003,7 +2004,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 */
rdev = conf->mirrors[d].rdev;
if (sync_page_io(rdev, sect, s<<9,
-bio->bi_io_vec[idx].bv_page,
+pages[idx],
 REQ_OP_READ, 0, false)) {
success = 1;
break;
@@ -2058,7 +2059,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
continue;
rdev = conf->mirrors[d].rdev;
if (r1_sync_page_io(rdev, sect, s,
-   bio->bi_io_vec[idx].bv_page,
+   pages[idx],
WRITE) == 0) {
r1_bio->bios[d]->bi_end_io = NULL;
rdev_dec_pending(rdev, mddev);
@@ -2073,7 +2074,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
continue;
rdev = conf->mirrors[d].rdev;
if (r1_sync_page_io(rdev, sect, s,
-   bio->bi_io_vec[idx].bv_page,
+   pages[idx],
READ) != 0)
atomic_add(s, >corrected_errors);
}
@@ -2149,6 +2150,8 @@ static void process_checks(struct r1bio *r1_bio)
struct bio *pbio = r1_bio->bios[primary];
struct bio *sbio = r1_bio->bios[i];
int error = sbio->bi_error;
+   struct page **ppages = get_resync_pages(pbio)->pages;
+   struct page **spages = get_resync_pages(sbio)->pages;
 
if (sbio->bi_end_io != end_sync_read)
continue;
@@ -2157,11 +2160,8 @@ static void process_checks(struct r1bio *r1_bio)
 
if (!error) {
for (j = vcnt; j-- ; ) {
-   struct page *p, *s;
-   p = pbio->bi_io_vec[j].bv_page;
-   s = sbio->bi_io_vec[j].bv_page;
-   if (memcmp(page_address(p),
-  page_address(s),
+   if (memcmp(page_address(ppages[j]),
+  page_address(spages[j]),
   sbio->bi_io_vec[j].bv_len))
break;
}
-- 
2.7.4



[PATCH v1 06/14] md: raid1: simplify r1buf_pool_free()

2017-02-24 Thread Ming Lei
This patch gets each page's reference of each bio for resync,
then r1buf_pool_free() gets simplified a lot.

The same policy has been taken in raid10's buf pool allocation/free
too.

Signed-off-by: Ming Lei 
---
 drivers/md/raid1.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2013e5870761..2de0bd69d8da 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -139,9 +139,12 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
/* If not user-requests, copy the page pointers to all bios */
if (!test_bit(MD_RECOVERY_REQUESTED, >mddev->recovery)) {
for (i=0; iraid_disks; j++)
-   r1_bio->bios[j]->bi_io_vec[i].bv_page =
+   for (j=1; jraid_disks; j++) {
+   struct page *page =
r1_bio->bios[0]->bi_io_vec[i].bv_page;
+   get_page(page);
+   r1_bio->bios[j]->bi_io_vec[i].bv_page = page;
+   }
}
 
r1_bio->master_bio = NULL;
@@ -166,12 +169,8 @@ static void r1buf_pool_free(void *__r1_bio, void *data)
struct r1bio *r1bio = __r1_bio;
 
for (i = 0; i < RESYNC_PAGES; i++)
-   for (j = pi->raid_disks; j-- ;) {
-   if (j == 0 ||
-   r1bio->bios[j]->bi_io_vec[i].bv_page !=
-   r1bio->bios[0]->bi_io_vec[i].bv_page)
-   
safe_put_page(r1bio->bios[j]->bi_io_vec[i].bv_page);
-   }
+   for (j = pi->raid_disks; j-- ;)
+   safe_put_page(r1bio->bios[j]->bi_io_vec[i].bv_page);
for (i=0 ; i < pi->raid_disks; i++)
bio_put(r1bio->bios[i]);
 
-- 
2.7.4



[PATCH v1 11/14] md: raid10: refactor code of read reshape's .bi_end_io

2017-02-24 Thread Ming Lei
reshape read request is a bit special and requires one extra
bio which isn't allocated from r10buf_pool.

Refactor the .bi_end_io for read reshape, so that we can use
raid10's resync page mangement approach easily in the following
patches.

Signed-off-by: Ming Lei 
---
 drivers/md/raid10.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 227dd6ad7716..c76e08ea4b92 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1907,17 +1907,9 @@ static int raid10_remove_disk(struct mddev *mddev, 
struct md_rdev *rdev)
return err;
 }
 
-static void end_sync_read(struct bio *bio)
+static void __end_sync_read(struct r10bio *r10_bio, struct bio *bio, int d)
 {
-   struct r10bio *r10_bio = bio->bi_private;
struct r10conf *conf = r10_bio->mddev->private;
-   int d;
-
-   if (bio == r10_bio->master_bio) {
-   /* this is a reshape read */
-   d = r10_bio->read_slot; /* really the read dev */
-   } else
-   d = find_bio_disk(conf, r10_bio, bio, NULL, NULL);
 
if (!bio->bi_error)
set_bit(R10BIO_Uptodate, _bio->state);
@@ -1941,6 +1933,22 @@ static void end_sync_read(struct bio *bio)
}
 }
 
+static void end_sync_read(struct bio *bio)
+{
+   struct r10bio *r10_bio = bio->bi_private;
+   struct r10conf *conf = r10_bio->mddev->private;
+   int d = find_bio_disk(conf, r10_bio, bio, NULL, NULL);
+
+   __end_sync_read(r10_bio, bio, d);
+}
+
+static void end_reshape_read(struct bio *bio)
+{
+   struct r10bio *r10_bio = bio->bi_private;
+
+   __end_sync_read(r10_bio, bio, r10_bio->read_slot);
+}
+
 static void end_sync_request(struct r10bio *r10_bio)
 {
struct mddev *mddev = r10_bio->mddev;
@@ -4474,7 +4482,7 @@ static sector_t reshape_request(struct mddev *mddev, 
sector_t sector_nr,
read_bio->bi_iter.bi_sector = (r10_bio->devs[r10_bio->read_slot].addr
   + rdev->data_offset);
read_bio->bi_private = r10_bio;
-   read_bio->bi_end_io = end_sync_read;
+   read_bio->bi_end_io = end_reshape_read;
bio_set_op_attrs(read_bio, REQ_OP_READ, 0);
read_bio->bi_flags &= (~0UL << BIO_RESET_BITS);
read_bio->bi_error = 0;
-- 
2.7.4



[PATCH v1 02/14] block: introduce bio_remove_last_page()

2017-02-24 Thread Ming Lei
MD need this helper to remove the last added page, so introduce
it.

Signed-off-by: Ming Lei 
---
 block/bio.c | 23 +++
 include/linux/bio.h |  1 +
 2 files changed, 24 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e08417f..0ce7ffcd7939 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -837,6 +837,29 @@ int bio_add_pc_page(struct request_queue *q, struct bio 
*bio, struct page
 EXPORT_SYMBOL(bio_add_pc_page);
 
 /**
+ * bio_remove_last_page-   remove the last added page
+ * @bio: destination bio
+ *
+ * Attempt to remove the last added page from the bio_vec maplist.
+ */
+void bio_remove_last_page(struct bio *bio)
+{
+   /*
+* cloned bio must not modify vec list
+*/
+   if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
+   return;
+
+   if (bio->bi_vcnt > 0) {
+   struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
+
+   bio->bi_iter.bi_size -= bv->bv_len;
+   bio->bi_vcnt--;
+   }
+}
+EXPORT_SYMBOL(bio_remove_last_page);
+
+/**
  * bio_add_page-   attempt to add page to bio
  * @bio: destination bio
  * @page: page to add
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3364b3ed90e7..32aeb493d1fe 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -443,6 +443,7 @@ extern void bio_init(struct bio *bio, struct bio_vec *table,
 extern void bio_reset(struct bio *);
 void bio_chain(struct bio *, struct bio *);
 
+extern void bio_remove_last_page(struct bio *bio);
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned 
int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
   unsigned int, unsigned int);
-- 
2.7.4



[PATCH v1 09/14] md: raid1: use bio helper in process_checks()

2017-02-24 Thread Ming Lei
Avoid to direct access to bvec table.

Signed-off-by: Ming Lei 
---
 drivers/md/raid1.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 9371caace379..7363bf56f3b4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2108,6 +2108,7 @@ static void process_checks(struct r1bio *r1_bio)
int j;
int size;
int error;
+   struct bio_vec *bi;
struct bio *b = r1_bio->bios[i];
struct resync_pages *rp = get_resync_pages(b);
if (b->bi_end_io != end_sync_read)
@@ -2126,9 +2127,7 @@ static void process_checks(struct r1bio *r1_bio)
b->bi_private = rp;
 
size = b->bi_iter.bi_size;
-   for (j = 0; j < vcnt ; j++) {
-   struct bio_vec *bi;
-   bi = >bi_io_vec[j];
+   bio_for_each_segment_all(bi, b, j) {
bi->bv_offset = 0;
if (size > PAGE_SIZE)
bi->bv_len = PAGE_SIZE;
@@ -2152,17 +2151,22 @@ static void process_checks(struct r1bio *r1_bio)
int error = sbio->bi_error;
struct page **ppages = get_resync_pages(pbio)->pages;
struct page **spages = get_resync_pages(sbio)->pages;
+   struct bio_vec *bi;
+   int page_len[RESYNC_PAGES];
 
if (sbio->bi_end_io != end_sync_read)
continue;
/* Now we can 'fixup' the error value */
sbio->bi_error = 0;
 
+   bio_for_each_segment_all(bi, sbio, j)
+   page_len[j] = bi->bv_len;
+
if (!error) {
for (j = vcnt; j-- ; ) {
if (memcmp(page_address(ppages[j]),
   page_address(spages[j]),
-  sbio->bi_io_vec[j].bv_len))
+  page_len[j]))
break;
}
} else
-- 
2.7.4



[PATCH v1 07/14] md: raid1: don't use bio's vec table to manage resync pages

2017-02-24 Thread Ming Lei
Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.

The introduced cost is that we need to allocate (128 + 16) * raid_disks
bytes per r1_bio, and it is fine because the inflight r1_bio for
resync shouldn't be much, as pointed by Shaohua.

Also the bio_reset() in raid1_sync_request() is removed because
all bios are freshly new now and not necessary to reset any more.

This patch can be thought as a cleanup too

Suggested-by: Shaohua Li 
Signed-off-by: Ming Lei 
---
 drivers/md/raid1.c | 86 +++---
 1 file changed, 56 insertions(+), 30 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2de0bd69d8da..4a208220ff0f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t 
sector_nr);
 #define raid1_log(md, fmt, args...)\
do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, 
##args); } while (0)
 
+static inline struct resync_pages *get_resync_pages(struct bio *bio)
+{
+   return bio->bi_private;
+}
+
+static inline struct r1bio *get_resync_r1bio(struct bio *bio)
+{
+   return get_resync_pages(bio)->raid_bio;
+}
+
 static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
struct pool_info *pi = data;
@@ -104,12 +114,18 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void 
*data)
struct r1bio *r1_bio;
struct bio *bio;
int need_pages;
-   int i, j;
+   int j;
+   struct resync_pages *rps;
 
r1_bio = r1bio_pool_alloc(gfp_flags, pi);
if (!r1_bio)
return NULL;
 
+   rps = kmalloc(sizeof(struct resync_pages) * pi->raid_disks,
+ gfp_flags);
+   if (!rps)
+   goto out_free_r1bio;
+
/*
 * Allocate bios : 1 for reading, n-1 for writing
 */
@@ -129,22 +145,22 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void 
*data)
need_pages = pi->raid_disks;
else
need_pages = 1;
-   for (j = 0; j < need_pages; j++) {
+   for (j = 0; j < pi->raid_disks; j++) {
+   struct resync_pages *rp = [j];
+
bio = r1_bio->bios[j];
-   bio->bi_vcnt = RESYNC_PAGES;
-
-   if (bio_alloc_pages(bio, gfp_flags))
-   goto out_free_pages;
-   }
-   /* If not user-requests, copy the page pointers to all bios */
-   if (!test_bit(MD_RECOVERY_REQUESTED, >mddev->recovery)) {
-   for (i=0; iraid_disks; j++) {
-   struct page *page =
-   r1_bio->bios[0]->bi_io_vec[i].bv_page;
-   get_page(page);
-   r1_bio->bios[j]->bi_io_vec[i].bv_page = page;
-   }
+
+   if (j < need_pages) {
+   if (resync_alloc_pages(rp, gfp_flags))
+   goto out_free_pages;
+   } else {
+   memcpy(rp, [0], sizeof(*rp));
+   resync_get_all_pages(rp);
+   }
+
+   rp->idx = 0;
+   rp->raid_bio = r1_bio;
+   bio->bi_private = rp;
}
 
r1_bio->master_bio = NULL;
@@ -153,11 +169,14 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void 
*data)
 
 out_free_pages:
while (--j >= 0)
-   bio_free_pages(r1_bio->bios[j]);
+   resync_free_pages([j]);
 
 out_free_bio:
while (++j < pi->raid_disks)
bio_put(r1_bio->bios[j]);
+   kfree(rps);
+
+out_free_r1bio:
r1bio_pool_free(r1_bio, data);
return NULL;
 }
@@ -165,14 +184,18 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void 
*data)
 static void r1buf_pool_free(void *__r1_bio, void *data)
 {
struct pool_info *pi = data;
-   int i,j;
+   int i;
struct r1bio *r1bio = __r1_bio;
+   struct resync_pages *rp = NULL;
 
-   for (i = 0; i < RESYNC_PAGES; i++)
-   for (j = pi->raid_disks; j-- ;)
-   safe_put_page(r1bio->bios[j]->bi_io_vec[i].bv_page);
-   for (i=0 ; i < pi->raid_disks; i++)
+   for (i = pi->raid_disks; i--; ) {
+   rp = get_resync_pages(r1bio->bios[i]);
+   resync_free_pages(rp);
bio_put(r1bio->bios[i]);
+   }
+
+   /* resync pages array stored in the 1st bio's .bi_private */
+   kfree(rp);
 
r1bio_pool_free(r1bio, data);
 }
@@ -1849,7 +1872,7 @@ static int raid1_remove_disk(struct mddev *mddev, struct 
md_rdev *rdev)
 
 static void end_sync_read(struct bio *bio)
 {
-   struct r1bio *r1_bio = bio->bi_private;
+   struct r1bio *r1_bio = get_resync_r1bio(bio);
 
 

[PATCH v1 04/14] md: move two macros into md.h

2017-02-24 Thread Ming Lei
Both raid1 and raid10 share common resync
block size and page count, so move them into md.h.

Signed-off-by: Ming Lei 
---
 drivers/md/md.h | 5 +
 drivers/md/raid1.c  | 2 --
 drivers/md/raid10.c | 3 ---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index b8859cbf84b6..1d63239a1be4 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -715,4 +715,9 @@ static inline void mddev_check_writesame(struct mddev 
*mddev, struct bio *bio)
!bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
mddev->queue->limits.max_write_same_sectors = 0;
 }
+
+/* Maximum size of each resync request */
+#define RESYNC_BLOCK_SIZE (64*1024)
+#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
+
 #endif /* _MD_MD_H */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2a0bf5b430c9..2013e5870761 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -91,10 +91,8 @@ static void r1bio_pool_free(void *r1_bio, void *data)
kfree(r1_bio);
 }
 
-#define RESYNC_BLOCK_SIZE (64*1024)
 #define RESYNC_DEPTH 32
 #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
-#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
 #define RESYNC_WINDOW (RESYNC_BLOCK_SIZE * RESYNC_DEPTH)
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 125d74dba27e..227dd6ad7716 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -125,9 +125,6 @@ static void r10bio_pool_free(void *r10_bio, void *data)
kfree(r10_bio);
 }
 
-/* Maximum size of each resync request */
-#define RESYNC_BLOCK_SIZE (64*1024)
-#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
 /* amount of memory to reserve for resync requests */
 #define RESYNC_WINDOW (1024*1024)
 /* maximum number of concurrent requests, memory permitting */
-- 
2.7.4



[PATCH v1 03/14] md: raid1/raid10: use bio_remove_last_page()

2017-02-24 Thread Ming Lei
Signed-off-by: Ming Lei 
---
 drivers/md/raid1.c  | 3 +--
 drivers/md/raid10.c | 6 ++
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0628c07dd16d..2a0bf5b430c9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2912,8 +2912,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, 
sector_t sector_nr,
if (bio->bi_end_io==NULL)
continue;
/* remove last page from this 
bio */
-   bio->bi_vcnt--;
-   bio->bi_iter.bi_size -= len;
+   bio_remove_last_page(bio);
bio_clear_flag(bio, 
BIO_SEG_VALID);
}
goto bio_full;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 33f6a535dc1f..125d74dba27e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3446,8 +3446,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, 
sector_t sector_nr,
 bio2 && bio2 != bio;
 bio2 = bio2->bi_next) {
/* remove last page from this bio */
-   bio2->bi_vcnt--;
-   bio2->bi_iter.bi_size -= len;
+   bio_remove_last_page(bio2);
bio_clear_flag(bio2, BIO_SEG_VALID);
}
goto bio_full;
@@ -4537,8 +4536,7 @@ static sector_t reshape_request(struct mddev *mddev, 
sector_t sector_nr,
 bio2 && bio2 != bio;
 bio2 = bio2->bi_next) {
/* Remove last page from this bio */
-   bio2->bi_vcnt--;
-   bio2->bi_iter.bi_size -= len;
+   bio_remove_last_page(bio2);
bio_clear_flag(bio2, BIO_SEG_VALID);
}
goto bio_full;
-- 
2.7.4



[PATCH v1 05/14] md: prepare for managing resync I/O pages in clean way

2017-02-24 Thread Ming Lei
Now resync I/O use bio's bec table to manage pages,
this way is very hacky, and may not work any more
once multipage bvec is introduced.

So introduce helpers and new data structure for
managing resync I/O pages more cleanly.

Signed-off-by: Ming Lei 
---
 drivers/md/md.h | 61 +
 1 file changed, 61 insertions(+)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1d63239a1be4..df18ae05838d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -720,4 +720,65 @@ static inline void mddev_check_writesame(struct mddev 
*mddev, struct bio *bio)
 #define RESYNC_BLOCK_SIZE (64*1024)
 #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
 
+/* for managing resync I/O pages */
+struct resync_pages {
+   unsignedidx;/* for get/put page from the pool */
+   void*raid_bio;
+   struct page *pages[RESYNC_PAGES];
+};
+
+static inline int resync_alloc_pages(struct resync_pages *rp,
+gfp_t gfp_flags)
+{
+   int i;
+
+   for (i = 0; i < RESYNC_PAGES; i++) {
+   rp->pages[i] = alloc_page(gfp_flags);
+   if (!rp->pages[i])
+   goto out_free;
+   }
+
+   return 0;
+
+ out_free:
+   while (--i >= 0)
+   __free_page(rp->pages[i]);
+   return -ENOMEM;
+}
+
+static inline void resync_free_pages(struct resync_pages *rp)
+{
+   int i;
+
+   for (i = 0; i < RESYNC_PAGES; i++)
+   __free_page(rp->pages[i]);
+}
+
+static inline void resync_get_all_pages(struct resync_pages *rp)
+{
+   int i;
+
+   for (i = 0; i < RESYNC_PAGES; i++)
+   get_page(rp->pages[i]);
+}
+
+static inline void resync_store_page(struct resync_pages *rp, struct page 
*page)
+{
+   if (WARN_ON(!rp->idx))
+   return;
+   rp->pages[--rp->idx] = page;
+}
+
+static inline struct page *resync_fetch_page(struct resync_pages *rp)
+{
+   if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
+   return NULL;
+   return rp->pages[rp->idx++];
+}
+
+static inline bool resync_page_available(struct resync_pages *rp)
+{
+   return rp->idx < RESYNC_PAGES;
+}
+
 #endif /* _MD_MD_H */
-- 
2.7.4



[PATCH v1 00/14] md: cleanup on direct access to bvec table

2017-02-24 Thread Ming Lei
In MD's resync I/O path, there are lots of direct access to bio's
bvec table. This patchset kills almost all, and the conversion
is quite straightforward. One root cause of direct access to bvec
table is that resync I/O uses the bio's bvec to manage pages.
In V1, as suggested by Shaohua, a new approach is used to manage
these pages for resync I/O, turns out code becomes more clean
and readable.

Once direct access to bvec table in MD is cleaned up, we may make
multipage bvec moving on.

V1:
- allocate page array to manage resync pages

Thanks,
Ming

Ming Lei (14):
  block: introduce bio_segments_all()
  block: introduce bio_remove_last_page()
  md: raid1/raid10: use bio_remove_last_page()
  md: move two macros into md.h
  md: prepare for managing resync I/O pages in clean way
  md: raid1: simplify r1buf_pool_free()
  md: raid1: don't use bio's vec table to manage resync pages
  md: raid1: retrieve page from pre-allocated resync page array
  md: raid1: use bio helper in process_checks()
  md: raid1: use bio_segments_all()
  md: raid10: refactor code of read reshape's .bi_end_io
  md: raid10: don't use bio's vec table to manage resync pages
  md: raid10: retrieve page from preallocated resync page array
  md: raid10: avoid direct access to bvec table in
handle_reshape_read_error

 block/bio.c |  23 +++
 drivers/md/md.h |  66 +++
 drivers/md/raid1.c  | 125 +--
 drivers/md/raid10.c | 187 +++-
 include/linux/bio.h |   8 +++
 5 files changed, 285 insertions(+), 124 deletions(-)

-- 
2.7.4



[PATCH v1 01/14] block: introduce bio_segments_all()

2017-02-24 Thread Ming Lei
So that we can replace the direct access to .bi_vcnt.

Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..3364b3ed90e7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -293,6 +293,13 @@ static inline void bio_get_last_bvec(struct bio *bio, 
struct bio_vec *bv)
bv->bv_len = iter.bi_bvec_done;
 }
 
+static inline unsigned bio_segments_all(struct bio *bio)
+{
+   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+
+   return bio->bi_vcnt;
+}
+
 enum bip_flags {
BIP_BLOCK_INTEGRITY = 1 << 0, /* block layer owns integrity data */
BIP_MAPPED_INTEGRITY= 1 << 1, /* ref tag has been remapped */
-- 
2.7.4