[PATCH] block/loop: fix race between I/O and set_status

2017-02-10 Thread Ming Lei
Inside set_status, transfer need to setup again, so
we have to drain IO before the transition, otherwise
oops may be triggered like the following:

divide error:  [#1] SMP KASAN
CPU: 0 PID: 2935 Comm: loop7 Not tainted 4.10.0-rc7+ #213
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
task: 88006ba1e840 task.stack: 880067338000
RIP: 0010:transfer_xor+0x1d1/0x440 drivers/block/loop.c:110
RSP: 0018:88006733f108 EFLAGS: 00010246
RAX:  RBX: 8800688d7000 RCX: 0059
RDX:  RSI: 11000d743f43 RDI: 880068891c08
RBP: 88006733f160 R08: 8800688d7001 R09: 
R10:  R11:  R12: 8800688d7000
R13: 880067b7d000 R14: dc00 R15: 
FS:  () GS:88006d00()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 006c17e0 CR3: 66e3b000 CR4: 001406f0
Call Trace:
 lo_do_transfer drivers/block/loop.c:251 [inline]
 lo_read_transfer drivers/block/loop.c:392 [inline]
 do_req_filebacked drivers/block/loop.c:541 [inline]
 loop_handle_cmd drivers/block/loop.c:1677 [inline]
 loop_queue_work+0xda0/0x49b0 drivers/block/loop.c:1689
 kthread_worker_fn+0x4c3/0xa30 kernel/kthread.c:630
 kthread+0x326/0x3f0 kernel/kthread.c:227
 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
Code: 03 83 e2 07 41 29 df 42 0f b6 04 30 4d 8d 44 24 01 38 d0 7f 08
84 c0 0f 85 62 02 00 00 44 89 f8 41 0f b6 48 ff 25 ff 01 00 00 99 
7d c8 48 63 d2 48 03 55 d0 48 89 d0 48 89 d7 48 c1 e8 03 83
RIP: transfer_xor+0x1d1/0x440 drivers/block/loop.c:110 RSP:
88006733f108
---[ end trace 0166f7bd3b0c0933 ]---

Reported-by: Dmitry Vyukov 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---
 drivers/block/loop.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ed5259510857..4b52a1690329 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1097,9 +1097,12 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE)
return -EINVAL;
 
+   /* I/O need to be drained during transfer transition */
+   blk_mq_freeze_queue(lo->lo_queue);
+
err = loop_release_xfer(lo);
if (err)
-   return err;
+   goto exit;
 
if (info->lo_encrypt_type) {
unsigned int type = info->lo_encrypt_type;
@@ -1114,12 +1117,14 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 
err = loop_init_xfer(lo, xfer, info);
if (err)
-   return err;
+   goto exit;
 
if (lo->lo_offset != info->lo_offset ||
lo->lo_sizelimit != info->lo_sizelimit)
-   if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit))
-   return -EFBIG;
+   if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
+   err = -EFBIG;
+   goto exit;
+   }
 
loop_config_discard(lo);
 
@@ -1156,7 +1161,9 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
/* update dio if lo_offset or transfer is changed */
__loop_update_dio(lo, lo->use_dio);
 
-   return 0;
+ exit:
+   blk_mq_unfreeze_queue(lo->lo_queue);
+   return err;
 }
 
 static int
-- 
2.7.4



Re: [PATCHv6 15/37] thp: do not threat slab pages as huge in hpage_{nr_pages,size,mask}

2017-02-10 Thread Matthew Wilcox
On Thu, Jan 26, 2017 at 02:57:57PM +0300, Kirill A. Shutemov wrote:
> Slab pages can be compound, but we shouldn't threat them as THP for
> pupose of hpage_* helpers, otherwise it would lead to confusing results.
> 
> For instance, ext4 uses slab pages for journal pages and we shouldn't
> confuse them with THPs. The easiest way is to exclude them in hpage_*
> helpers.

Well ... I think we should just deal with compound pages instead of just
huge or regular.  So I'm deferring comment on this patch.

> Signed-off-by: Kirill A. Shutemov 
> ---
>  include/linux/huge_mm.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e5c9c26d2439..5e6c408f5b47 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -137,21 +137,21 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t 
> *pmd,
>  }
>  static inline int hpage_nr_pages(struct page *page)
>  {
> - if (unlikely(PageTransHuge(page)))
> + if (unlikely(!PageSlab(page) && PageTransHuge(page)))
>   return HPAGE_PMD_NR;
>   return 1;
>  }
>  
>  static inline int hpage_size(struct page *page)
>  {
> - if (unlikely(PageTransHuge(page)))
> + if (unlikely(!PageSlab(page) && PageTransHuge(page)))
>   return HPAGE_PMD_SIZE;
>   return PAGE_SIZE;
>  }
>  
>  static inline unsigned long hpage_mask(struct page *page)
>  {
> - if (unlikely(PageTransHuge(page)))
> + if (unlikely(!PageSlab(page) && PageTransHuge(page)))
>   return HPAGE_PMD_MASK;
>   return PAGE_MASK;
>  }
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Nbd] [PATCH] nbd: set the logical and physical blocksize properly

2017-02-10 Thread Josef Bacik
On Fri, 2017-02-10 at 21:07 +0100, Alex Bligh wrote:
> > 
> > On 10 Feb 2017, at 19:06, Josef Bacik  wrote:
> > 
> > We noticed when trying to do O_DIRECT to an export on the server
> > side
> > that we were getting requests smaller than the 4k sectorsize of the
> > device.  This is because the client isn't setting the logical and
> > physical blocksizes properly for the underlying device.  Fix this
> > up by
> > setting the queue blocksizes and then calling bd_set_size.
> Interesting. Some input into the info extension (re blocksizes) would
> definitely be appreciated.
> 

What do you mean?  Right now the client is just calling NBD_SET_BLKSIZE
with 4k blocksize since all of our devices are 4k drives.  Thanks,

Josef


Re: [Nbd] [PATCH] nbd: set the logical and physical blocksize properly

2017-02-10 Thread Alex Bligh

> On 10 Feb 2017, at 19:06, Josef Bacik  wrote:
> 
> We noticed when trying to do O_DIRECT to an export on the server side
> that we were getting requests smaller than the 4k sectorsize of the
> device.  This is because the client isn't setting the logical and
> physical blocksizes properly for the underlying device.  Fix this up by
> setting the queue blocksizes and then calling bd_set_size.

Interesting. Some input into the info extension (re blocksizes) would
definitely be appreciated.

-- 
Alex Bligh






Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 19:13, Bart Van Assche 
>  ha scritto:
> 
> On 02/10/2017 08:49 AM, Paolo Valente wrote:
>>> $ grep '^C.*_MQ_' .config
>>> CONFIG_BLK_MQ_PCI=y
>>> CONFIG_MQ_IOSCHED_BFQ=y
>>> CONFIG_MQ_IOSCHED_DEADLINE=y
>>> CONFIG_MQ_IOSCHED_NONE=y
>>> CONFIG_DEFAULT_MQ_BFQ_MQ=y
>>> CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq"
>>> CONFIG_SCSI_MQ_DEFAULT=y
>>> CONFIG_DM_MQ_DEFAULT=y
>>> 
>> 
>> Could you reconfigure with none or mq-deadline as default, check
>> whether the system boots, and, it it does, switch manually to bfq-mq,
>> check what happens, and, in the likely case of a failure, try to get
>> the oops?
> 
> Hello Paolo,
> 
> I just finished performing that test with the following kernel config:
> $ grep '^C.*_MQ_' .config
> CONFIG_BLK_MQ_PCI=y
> CONFIG_MQ_IOSCHED_BFQ=y
> CONFIG_MQ_IOSCHED_DEADLINE=y
> CONFIG_MQ_IOSCHED_NONE=y
> CONFIG_DEFAULT_MQ_DEADLINE=y
> CONFIG_DEFAULT_MQ_IOSCHED="mq-deadline"
> CONFIG_SCSI_MQ_DEFAULT=y
> CONFIG_DM_MQ_DEFAULT=y
> 
> After the system came up I logged in, switched to the bfq-mq scheduler
> and ran several I/O tests against the boot disk.

Without any failure, right?

Unfortunately, as you can imagine, no boot failure occurred on
any of my test systems so far :(

This version of bfq-mq can be configured to print all its activity in
the kernel log, by just defining a macro.  This will of course slow
down the system so much to make it probably unusable, if bfq-mq is
active from boot.  Yet, the failure may still occur so early to make
this approach useful to discover where bfq-mq gets stuck.  As of now I
have no better ideas.  Any suggestion is welcome.

Thanks,
Paolo

> Sorry but nothing
> interesting appeared in the kernel log.
> 
> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
> Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or 
> legally privileged information of WDC and/or its affiliates, and are intended 
> solely for the use of the individual or entity to which they are addressed. 
> If you are not the intended recipient, any disclosure, copying, distribution 
> or any action taken or omitted to be taken in reliance on it, is prohibited. 
> If you have received this e-mail in error, please notify the sender 
> immediately and delete the e-mail in its entirety from your system.
> 



Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 19:34, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-02-07 at 18:24 +0100, Paolo Valente wrote:
>> (lock assertions, BUG_ONs, ...).
> 
> Hello Paolo,
> 
> If you are using BUG_ON(), does that mean that you are not aware of Linus'
> opinion about BUG_ON()? Please read https://lkml.org/lkml/2016/10/4/1.
> 

I am, thanks.  But this is a testing version, overfull of assertions
as a form of hysteric defensive programming.  I will of course remove
all halting assertions in the submission for merging.

Thanks,
Paolo

> Thanks,
> 
> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
> Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or 
> legally privileged information of WDC and/or its affiliates, and are intended 
> solely for the use of the individual or entity to which they are addressed. 
> If you are not the intended recipient, any disclosure, copying, distribution 
> or any action taken or omitted to be taken in reliance on it, is prohibited. 
> If you have received this e-mail in error, please notify the sender 
> immediately and delete the e-mail in its entirety from your system.
> 



Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-02-10 Thread Jens Axboe
On 02/10/2017 11:32 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> None of the other blk-mq elevator hooks are called with this lock held.
> Additionally, it can lead to circular locking dependencies between
> queue_lock and the private scheduler lock.

Applied, thanks Omar.

-- 
Jens Axboe



[PATCH] block: set make_request_fn manually in blk_mq_update_nr_hw_queues

2017-02-10 Thread Josef Bacik
Calling blk_queue_make_request resets a bunch of settings on the
request_queue, but all we really want is to update the make_request_fn,
so do this directly so we don't lose things like the logical and
physical block sizes.

Signed-off-by: Josef Bacik 
---
 block/blk-mq.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index be183e61..eccbb13 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2627,10 +2627,14 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues)
list_for_each_entry(q, >tag_list, tag_set_list) {
blk_mq_realloc_hw_ctxs(set, q);
 
+   /*
+* Manually set the make_request_fn as blk_queue_make_request
+* resets a lot of the queue settings.
+*/
if (q->nr_hw_queues > 1)
-   blk_queue_make_request(q, blk_mq_make_request);
+   q->make_request_fn = blk_mq_make_request;
else
-   blk_queue_make_request(q, blk_sq_make_request);
+   q->make_request_fn = blk_sq_make_request;
 
blk_mq_queue_reinit(q, cpu_online_mask);
}
-- 
2.7.4



Re: [PATCH] block: set make_request_fn manually in blk_mq_update_nr_hw_queues

2017-02-10 Thread Jens Axboe
On 02/10/2017 11:03 AM, Josef Bacik wrote:
> Calling blk_queue_make_request resets a bunch of settings on the
> request_queue, but all we really want is to update the make_request_fn,
> so do this directly so we don't lose things like the logical and
> physical block sizes.

Applied, thanks Josef.

-- 
Jens Axboe



[PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq

2017-02-10 Thread Omar Sandoval
From: Omar Sandoval 

None of the other blk-mq elevator hooks are called with this lock held.
Additionally, it can lead to circular locking dependencies between
queue_lock and the private scheduler lock.

Reported-by: Paolo Valente 
Signed-off-by: Omar Sandoval 
---
 block/blk-ioc.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index fe186a9eade9..b12f9c87b4c3 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -35,7 +35,10 @@ static void icq_free_icq_rcu(struct rcu_head *head)
kmem_cache_free(icq->__rcu_icq_cache, icq);
 }
 
-/* Exit an icq. Called with both ioc and q locked. */
+/*
+ * Exit an icq. Called with both ioc and q locked for sq, only ioc locked for
+ * mq.
+ */
 static void ioc_exit_icq(struct io_cq *icq)
 {
struct elevator_type *et = icq->q->elevator->type;
@@ -166,6 +169,7 @@ EXPORT_SYMBOL(put_io_context);
  */
 void put_io_context_active(struct io_context *ioc)
 {
+   struct elevator_type *et;
unsigned long flags;
struct io_cq *icq;
 
@@ -184,13 +188,19 @@ void put_io_context_active(struct io_context *ioc)
hlist_for_each_entry(icq, >icq_list, ioc_node) {
if (icq->flags & ICQ_EXITED)
continue;
-   if (spin_trylock(icq->q->queue_lock)) {
+
+   et = icq->q->elevator->type;
+   if (et->uses_mq) {
ioc_exit_icq(icq);
-   spin_unlock(icq->q->queue_lock);
} else {
-   spin_unlock_irqrestore(>lock, flags);
-   cpu_relax();
-   goto retry;
+   if (spin_trylock(icq->q->queue_lock)) {
+   ioc_exit_icq(icq);
+   spin_unlock(icq->q->queue_lock);
+   } else {
+   spin_unlock_irqrestore(>lock, flags);
+   cpu_relax();
+   goto retry;
+   }
}
}
spin_unlock_irqrestore(>lock, flags);
-- 
2.11.1



Re: [PATCHv6 16/37] thp: make thp_get_unmapped_area() respect S_HUGE_MODE

2017-02-10 Thread Matthew Wilcox
On Thu, Jan 26, 2017 at 02:57:58PM +0300, Kirill A. Shutemov wrote:
> We want mmap(NULL) to return PMD-aligned address if the inode can have
> huge pages in page cache.
> 
> Signed-off-by: Kirill A. Shutemov 

Reviewed-by: Matthew Wilcox 


Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Bart Van Assche
On 02/10/2017 08:49 AM, Paolo Valente wrote:
>> $ grep '^C.*_MQ_' .config
>> CONFIG_BLK_MQ_PCI=y
>> CONFIG_MQ_IOSCHED_BFQ=y
>> CONFIG_MQ_IOSCHED_DEADLINE=y
>> CONFIG_MQ_IOSCHED_NONE=y
>> CONFIG_DEFAULT_MQ_BFQ_MQ=y
>> CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq"
>> CONFIG_SCSI_MQ_DEFAULT=y
>> CONFIG_DM_MQ_DEFAULT=y
>>
> 
> Could you reconfigure with none or mq-deadline as default, check
> whether the system boots, and, it it does, switch manually to bfq-mq,
> check what happens, and, in the likely case of a failure, try to get
> the oops?

Hello Paolo,

I just finished performing that test with the following kernel config:
$ grep '^C.*_MQ_' .config
CONFIG_BLK_MQ_PCI=y
CONFIG_MQ_IOSCHED_BFQ=y
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_NONE=y
CONFIG_DEFAULT_MQ_DEADLINE=y
CONFIG_DEFAULT_MQ_IOSCHED="mq-deadline"
CONFIG_SCSI_MQ_DEFAULT=y
CONFIG_DM_MQ_DEFAULT=y

After the system came up I logged in, switched to the bfq-mq scheduler
and ran several I/O tests against the boot disk. Sorry but nothing
interesting appeared in the kernel log.

Bart.


Re: [PATCHv6 12/37] brd: make it handle huge pages

2017-02-10 Thread Matthew Wilcox
On Thu, Jan 26, 2017 at 02:57:54PM +0300, Kirill A. Shutemov wrote:
> Do not assume length of bio segment is never larger than PAGE_SIZE.
> With huge pages it's HPAGE_PMD_SIZE (2M on x86-64).

I don't think we even need hugepages for BRD to be buggy.  I think there are
already places which allocate compound pages (not in highmem, obviously ...)
and put them in biovecs.  So this is pure and simple a bugfix.

That said, I find the current code in brd a bit inelegant, and I don't
think this patch helps... indeed, I think it's buggy:

> @@ -202,12 +202,15 @@ static int copy_to_brd_setup(struct brd_device *brd, 
> sector_t sector, size_t n)
>   size_t copy;
>  
>   copy = min_t(size_t, n, PAGE_SIZE - offset);
> + n -= copy;
>   if (!brd_insert_page(brd, sector))
>   return -ENOSPC;
> - if (copy < n) {
> + while (n) {
>   sector += copy >> SECTOR_SHIFT;
>   if (!brd_insert_page(brd, sector))
>   return -ENOSPC;
> + copy = min_t(size_t, n, PAGE_SIZE);
> + n -= copy;
>   }

We're decrementing 'n' to 0, then testing it, so we never fill in the
last page ... right?

Anyway, here's my effort.  Untested.

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..0802a6abcd81 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -202,12 +202,14 @@ static int copy_to_brd_setup(struct brd_device *brd, 
sector_t sector, size_t n)
size_t copy;
 
copy = min_t(size_t, n, PAGE_SIZE - offset);
-   if (!brd_insert_page(brd, sector))
-   return -ENOSPC;
-   if (copy < n) {
-   sector += copy >> SECTOR_SHIFT;
+   for (;;) {
if (!brd_insert_page(brd, sector))
return -ENOSPC;
+   n -= copy;
+   if (!n)
+   break;
+   sector += copy >> SECTOR_SHIFT;
+   copy = min_t(size_t, n, PAGE_SIZE);
}
return 0;
 }
@@ -239,26 +241,23 @@ static void copy_to_brd(struct brd_device *brd, const 
void *src,
struct page *page;
void *dst;
unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
-   size_t copy;
+   size_t copy = min_t(size_t, n, PAGE_SIZE - offset);
 
-   copy = min_t(size_t, n, PAGE_SIZE - offset);
-   page = brd_lookup_page(brd, sector);
-   BUG_ON(!page);
-
-   dst = kmap_atomic(page);
-   memcpy(dst + offset, src, copy);
-   kunmap_atomic(dst);
-
-   if (copy < n) {
-   src += copy;
-   sector += copy >> SECTOR_SHIFT;
-   copy = n - copy;
+   for (;;) {
page = brd_lookup_page(brd, sector);
BUG_ON(!page);
 
dst = kmap_atomic(page);
-   memcpy(dst, src, copy);
+   memcpy(dst + offset, src, copy);
kunmap_atomic(dst);
+
+   n -= copy;
+   if (!n)
+   break;
+   src += copy;
+   sector += copy >> SECTOR_SHIFT;
+   offset = 0;
+   copy = min_t(size_t, n, PAGE_SIZE);
}
 }
 
@@ -271,28 +270,24 @@ static void copy_from_brd(void *dst, struct brd_device 
*brd,
struct page *page;
void *src;
unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
-   size_t copy;
+   size_t copy = min_t(size_t, n, PAGE_SIZE - offset);
 
-   copy = min_t(size_t, n, PAGE_SIZE - offset);
-   page = brd_lookup_page(brd, sector);
-   if (page) {
-   src = kmap_atomic(page);
-   memcpy(dst, src + offset, copy);
-   kunmap_atomic(src);
-   } else
-   memset(dst, 0, copy);
-
-   if (copy < n) {
-   dst += copy;
-   sector += copy >> SECTOR_SHIFT;
-   copy = n - copy;
+   for (;;) {
page = brd_lookup_page(brd, sector);
if (page) {
src = kmap_atomic(page);
-   memcpy(dst, src, copy);
+   memcpy(dst, src + offset, copy);
kunmap_atomic(src);
} else
memset(dst, 0, copy);
+
+   n -= copy;
+   if (!n)
+   break;
+   dst += copy;
+   sector += copy >> SECTOR_SHIFT;
+   offset = 0;
+   copy = min_t(size_t, n, PAGE_SIZE);
}
 }
 


[PATCH] nbd: set the logical and physical blocksize properly

2017-02-10 Thread Josef Bacik
We noticed when trying to do O_DIRECT to an export on the server side
that we were getting requests smaller than the 4k sectorsize of the
device.  This is because the client isn't setting the logical and
physical blocksizes properly for the underlying device.  Fix this up by
setting the queue blocksizes and then calling bd_set_size.

Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 39 ---
 1 file changed, 16 insertions(+), 23 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 34a280a..e0d770c 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -183,7 +183,7 @@ static int nbd_clear_reconnect(struct nbd_device *nbd)
 
 static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
 {
-   bdev->bd_inode->i_size = 0;
+   bd_set_size(bdev, 0);
set_capacity(nbd->disk, 0);
kobject_uevent(_to_dev(nbd)->kobj, KOBJ_CHANGE);
 
@@ -192,29 +192,20 @@ static int nbd_size_clear(struct nbd_device *nbd, struct 
block_device *bdev)
 
 static void nbd_size_update(struct nbd_device *nbd, struct block_device *bdev)
 {
-   if (!nbd_is_connected(nbd))
-   return;
-
-   bdev->bd_inode->i_size = nbd->bytesize;
+   blk_queue_logical_block_size(nbd->disk->queue, nbd->blksize);
+   blk_queue_physical_block_size(nbd->disk->queue, nbd->blksize);
+   bd_set_size(bdev, nbd->bytesize);
set_capacity(nbd->disk, nbd->bytesize >> 9);
kobject_uevent(_to_dev(nbd)->kobj, KOBJ_CHANGE);
 }
 
-static int nbd_size_set(struct nbd_device *nbd, struct block_device *bdev,
+static void nbd_size_set(struct nbd_device *nbd, struct block_device *bdev,
loff_t blocksize, loff_t nr_blocks)
 {
-   int ret;
-
-   ret = set_blocksize(bdev, blocksize);
-   if (ret)
-   return ret;
-
nbd->blksize = blocksize;
nbd->bytesize = blocksize * nr_blocks;
-
-   nbd_size_update(nbd, bdev);
-
-   return 0;
+   if (nbd_is_connected(nbd))
+   nbd_size_update(nbd, bdev);
 }
 
 static void nbd_end_request(struct nbd_cmd *cmd)
@@ -875,6 +866,7 @@ static void send_disconnects(struct nbd_device *nbd)
 
 static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev)
 {
+   printk(KERN_ERR "%u is the blocksize at disconnect\n", 
bdev_logical_block_size(bdev));
dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n");
if (!nbd_socks_get_unless_zero(nbd))
return -EINVAL;
@@ -997,15 +989,16 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
nbd_device *nbd,
case NBD_SET_SOCK:
return nbd_add_socket(nbd, bdev, arg);
case NBD_SET_BLKSIZE:
-   return nbd_size_set(nbd, bdev, arg,
-   div_s64(nbd->bytesize, arg));
+   nbd_size_set(nbd, bdev, arg,
+div_s64(nbd->bytesize, arg));
+   return 0;
case NBD_SET_SIZE:
-   return nbd_size_set(nbd, bdev, nbd->blksize,
-   div_s64(arg, nbd->blksize));
-
+   nbd_size_set(nbd, bdev, nbd->blksize,
+div_s64(arg, nbd->blksize));
+   return 0;
case NBD_SET_SIZE_BLOCKS:
-   return nbd_size_set(nbd, bdev, nbd->blksize, arg);
-
+   nbd_size_set(nbd, bdev, nbd->blksize, arg);
+   return 0;
case NBD_SET_TIMEOUT:
nbd->tag_set.timeout = arg * HZ;
return 0;
-- 
2.7.4



Re: [PATCHv6 17/37] fs: make block_read_full_page() be able to read huge page

2017-02-10 Thread Matthew Wilcox
On Thu, Jan 26, 2017 at 02:57:59PM +0300, Kirill A. Shutemov wrote:
> The approach is straight-forward: for compound pages we read out whole
> huge page.

Ouch.  bufferheads must die ;-)

> For huge page we cannot have array of buffer head pointers on stack --
> it's 4096 pointers on x86-64 -- 'arr' is allocated with kmalloc() for
> huge pages.
> 
> Signed-off-by: Kirill A. Shutemov 

Reviewed-by: Matthew Wilcox 


Re: [PATCHv6 13/37] mm: make write_cache_pages() work on huge pages

2017-02-10 Thread Matthew Wilcox
On Thu, Jan 26, 2017 at 02:57:55PM +0300, Kirill A. Shutemov wrote:
> We writeback whole huge page a time. Let's adjust iteration this way.
> 
> Signed-off-by: Kirill A. Shutemov 

I think a lot of the complexity in this patch is from pagevec_lookup_tag
giving you subpages rather than head pages...

> @@ -2268,7 +2273,8 @@ int write_cache_pages(struct address_space *mapping,
>* not be suitable for data integrity
>* writeout).
>*/
> - done_index = page->index + 1;
> + done_index = compound_head(page)->index
> + + hpage_nr_pages(page);
>   done = 1;
>   break;
>   }

you'd still need this line, but it'd only be:

done_index = page->index +
(1 << compound_order(page));

I think we want:

#define nr_pages(page)  (1 << compound_order(page))

because we seem to be repeating that idiom quite a lot in these patches.

done_index = page->index +
nr_pages(page);

Still doesn't quite fit on one line, but it's closer, and it's the
ridiculous indentation in that function that's the real problem.


RE: [PATCH v2 4/4] Maintainers: Add Information for SED Opal library

2017-02-10 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Scott Bauer
> Sent: Tuesday, November 29, 2016 3:52 PM
> To: linux-n...@lists.infradead.org
> Cc: rafael.antogno...@intel.com; ax...@fb.com; keith.bu...@intel.com;
> jonathan.derr...@intel.com; j.naum...@fu-berlin.de; h...@infradead.org;
> linux-block@vger.kernel.org; s...@grimberg.me; Scott Bauer
> 
> Subject: [PATCH v2 4/4] Maintainers: Add Information for SED Opal library
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8d414840..929eba3 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10846,6 +10846,16 @@ L:   linux-...@vger.kernel.org
>  S:   Maintained
>  F:   drivers/mmc/host/sdhci-spear.c
> 
> +SECURE ENCRYPTING DEVICE (SED) OPAL DRIVER
> +M:   Scott Bauer 
> +M:   Jonathan Derrick 
> +M:   Rafael Antognolli 
> +L:   linux-n...@lists.infradead.org
> +S:   Supported
> +F:   block/sed*
> +F:   include/linux/sed*
> +F:   include/uapi/linux/sed*

Since this is in the block tree and is not nvme specific, could you use 
linux-block as the main list?

Apparently the series has progressed to v6, but only on the 
linux-nvme list.

---
Robert Elliott, HPE Persistent Memory






Re: [PATCH v2 4/4] Maintainers: Add Information for SED Opal library

2017-02-10 Thread Scott Bauer
On Fri, Feb 10, 2017 at 08:46:09AM -0800, Elliott, Robert (Persistent Memory) 
wrote:
> 
> 
> > -Original Message-
> > From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> > ow...@vger.kernel.org] On Behalf Of Scott Bauer
> > Sent: Tuesday, November 29, 2016 3:52 PM
> > To: linux-n...@lists.infradead.org
> > Cc: rafael.antogno...@intel.com; ax...@fb.com; keith.bu...@intel.com;
> > jonathan.derr...@intel.com; j.naum...@fu-berlin.de; h...@infradead.org;
> > linux-block@vger.kernel.org; s...@grimberg.me; Scott Bauer
> > 
> > Subject: [PATCH v2 4/4] Maintainers: Add Information for SED Opal library
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 8d414840..929eba3 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -10846,6 +10846,16 @@ L:linux-...@vger.kernel.org
> >  S:Maintained
> >  F:drivers/mmc/host/sdhci-spear.c
> >
> > +SECURE ENCRYPTING DEVICE (SED) OPAL DRIVER
> > +M:Scott Bauer 
> > +M:Jonathan Derrick 
> > +M:Rafael Antognolli 
> > +L:linux-n...@lists.infradead.org
> > +S:Supported
> > +F:block/sed*
> > +F:include/linux/sed*
> > +F:include/uapi/linux/sed*
> 
> Since this is in the block tree and is not nvme specific, could you use
> linux-block as the main list?

Sure, that's fine. Is there a way to denote "main list" in maintainers,
or just list it first?


Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 17:45, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-02-07 at 18:24 +0100, Paolo Valente wrote:
>> 2) Enable people to test this first version bfq-mq.
> 
> Hello Paolo,
> 
> I installed this version of bfq-mq on a server that boots from a SATA
> disk. That server boots fine with kernel v4.10-rc7 but not with this
> tree. The first 30 seconds of the boot process seem to proceed normally
> but after that time the messages on the console stop scrolling and
> about another 30 seconds later the server reboots. I haven't found
> anything useful in the system log. I configured the block layer as
> follows:
> 
> $ grep '^C.*_MQ_' .config
> CONFIG_BLK_MQ_PCI=y
> CONFIG_MQ_IOSCHED_BFQ=y
> CONFIG_MQ_IOSCHED_DEADLINE=y
> CONFIG_MQ_IOSCHED_NONE=y
> CONFIG_DEFAULT_MQ_BFQ_MQ=y
> CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq"
> CONFIG_SCSI_MQ_DEFAULT=y
> CONFIG_DM_MQ_DEFAULT=y
> 

Could you reconfigure with none or mq-deadline as default, check
whether the system boots, and, it it does, switch manually to bfq-mq,
check what happens, and, in the likely case of a failure, try to get
the oops?

Thank you very much,
Paolo

> Bart.
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
> Notice & Disclaimer:
> 
> This e-mail and any files transmitted with it may contain confidential or 
> legally privileged information of WDC and/or its affiliates, and are intended 
> solely for the use of the individual or entity to which they are addressed. 
> If you are not the intended recipient, any disclosure, copying, distribution 
> or any action taken or omitted to be taken in reliance on it, is prohibited. 
> If you have received this e-mail in error, please notify the sender 
> immediately and delete the e-mail in its entirety from your system.
> 



Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Paolo Valente

> Il giorno 10 feb 2017, alle ore 17:08, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2017-02-07 at 18:24 +0100, Paolo Valente wrote:
>> [1] https://github.com/Algodev-github/bfq-mq
> 
> Hello Paolo,
> 
> That branch includes two changes of the version suffix (EXTRAVERSION in 
> Makefile).
> Please don't do that but set CONFIG_LOCALVERSION in .config to add a suffix to
> the kernel version string.
> 

I know it, thanks. Unfortunately, many other irregular things you will probably 
find in that sort of private branch (as for that suffix, for some reason it was 
handy for me to have it tracked by git).

Thanks,
Paolo

> Thanks,
> 
> Bart.



Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Bart Van Assche
On Tue, 2017-02-07 at 18:24 +0100, Paolo Valente wrote:
> 2) Enable people to test this first version bfq-mq.

Hello Paolo,

I installed this version of bfq-mq on a server that boots from a SATA
disk. That server boots fine with kernel v4.10-rc7 but not with this
tree. The first 30 seconds of the boot process seem to proceed normally
but after that time the messages on the console stop scrolling and
about another 30 seconds later the server reboots. I haven't found
anything useful in the system log. I configured the block layer as
follows:

$ grep '^C.*_MQ_' .config
CONFIG_BLK_MQ_PCI=y
CONFIG_MQ_IOSCHED_BFQ=y
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_NONE=y
CONFIG_DEFAULT_MQ_BFQ_MQ=y
CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq"
CONFIG_SCSI_MQ_DEFAULT=y
CONFIG_DM_MQ_DEFAULT=y

Bart.


Re: [PATCH] bfq-mq: cause deadlock by executing exit_icq body immediately

2017-02-10 Thread Jens Axboe
On 02/10/2017 06:00 AM, Paolo Valente wrote:
> 
>> Il giorno 08 feb 2017, alle ore 18:17, Omar Sandoval  
>> ha scritto:
>>
>> On Wed, Feb 08, 2017 at 11:39:24AM +0100, Paolo Valente wrote:
>>>
 Il giorno 08 feb 2017, alle ore 11:33, Omar Sandoval  
 ha scritto:

 On Wed, Feb 08, 2017 at 11:03:01AM +0100, Paolo Valente wrote:
>
>> Il giorno 07 feb 2017, alle ore 22:45, Omar Sandoval 
>>  ha scritto:
>>
>> On Tue, Feb 07, 2017 at 06:33:46PM +0100, Paolo Valente wrote:
>>> Hi,
>>> this patch is meant to show that, if the  body of the hook exit_icq is 
>>> executed
>>> from inside that hook, and not as deferred work, then a circular 
>>> deadlock
>>> occurs.
>>>
>>> It happens if, on a CPU
>>> - the body of icq_exit takes the scheduler lock,
>>> - it does so from inside the exit_icq hook, which is invoked with the 
>>> queue
>>> lock held
>>>
>>> while, on another CPU
>>> - bfq_bio_merge, after taking the scheduler lock, invokes 
>>> bfq_bic_lookup,
>>> which, in its turn, takes the queue lock. bfq_bic_lookup needs to take 
>>> such a
>>> lock, because it invokes ioc_lookup_icq.
>>>
>>> For more details, here is a lockdep report, right before the deadlock 
>>> did occur.
>>>
>>> [   44.059877] ==
>>> [   44.124922] [ INFO: possible circular locking dependency detected ]
>>> [   44.125795] 4.10.0-rc5-bfq-mq+ #38 Not tainted
>>> [   44.126414] ---
>>> [   44.127291] sync/2043 is trying to acquire lock:
>>> [   44.128918]  (&(>lock)->rlock){-.-...}, at: 
>>> [] bfq_exit_icq_bfqq+0x55/0x140
>>> [   44.134052]
>>> [   44.134052] but task is already holding lock:
>>> [   44.134868]  (&(>__queue_lock)->rlock){-.}, at: 
>>> [] put_io_context_active+0x6e/0xc0
>>
>> Hey, Paolo,
>>
>> I only briefly skimmed the code, but what are you using the queue_lock
>> for? You should just use your scheduler lock everywhere. blk-mq doesn't
>> use the queue lock, so the scheduler is the only thing you need mutual
>> exclusion against.
>
> Hi Omar,
> the cause of the problem is that the hook functions bfq_request_merge
> and bfq_allow_bio_merge invoke, directly or through other functions,
> the function bfq_bic_lookup, which, in its turn, invokes
> ioc_lookup_icq.  The latter must be invoked with the queue lock held.
> In particular the offending lines in bfq_bic_lookup are:
>
>   spin_lock_irqsave(q->queue_lock, flags);
>   icq = icq_to_bic(ioc_lookup_icq(ioc, q));
>   spin_unlock_irqrestore(q->queue_lock, flags);
>
> Maybe I'm missing something and we can avoid taking this lock?

 Ah, I didn't realize we still used the q->queue_lock for the icq stuff.
 You're right, you still need that lock for ioc_lookup_icq(). Unless
 there's something else I'm forgetting, that should be the only thing you
 need it for in the core code, and you should use your scheduler lock for
 everything else. What else are you using q->queue_lock for? 
>>>
>>> Nothing.  The deadlock follows from that bfq_request_merge gets called
>>> with the scheduler lock already held.  Problematic paths start from:
>>> bfq_bio_merge and bfq_insert_request.
>>>
>>> I'm trying to understand whether I/we can reorder operations in some
>>> way that avoids the nested locking, but at no avail so far.
>>>
>>> Thanks,
>>> Paolo
>>
>> Okay, I understand what you're saying now. It was all in the first email
>> but I didn't see it right away, sorry about that.
>>
>> I don't think it makes sense for ->exit_icq() to be invoked while
>> holding q->queue_lock for blk-mq -- we don't hold that lock for any of
>> the other hooks. Could you try the below? I haven't convinced myself
>> that there isn't a circular locking dependency between bfqd->lock and
>> ioc->lock now, but it's probably easiest for you to just try it.
>>
> 
> Just passed the last of a heavy batch of tests!

Omar, care to turn this into a real patch and submit it?

-- 
Jens Axboe



Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-10 Thread Bart Van Assche
On Tue, 2017-02-07 at 18:24 +0100, Paolo Valente wrote:
> [1] https://github.com/Algodev-github/bfq-mq

Hello Paolo,

That branch includes two changes of the version suffix (EXTRAVERSION in 
Makefile).
Please don't do that but set CONFIG_LOCALVERSION in .config to add a suffix to
the kernel version string.

Thanks,

Bart.


[PATCH V4 1/2] uapi: sed-opal fix IOW for activate lsp to use correct struct

2017-02-10 Thread Scott Bauer
the IOW for the IOC_OPAL_ACTIVATE_LSP took the wrong strcure which
would give us the wrong size when using _IOC_SIZE, switch it to the
right structure.

Fixes: 058f8a2 ("Include: Uapi: Add user ABI for Sed/Opal")

Signed-off-by: Scott Bauer 
---
 include/uapi/linux/sed-opal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h
index fc06e3a..c72e073 100644
--- a/include/uapi/linux/sed-opal.h
+++ b/include/uapi/linux/sed-opal.h
@@ -106,7 +106,7 @@ struct opal_mbr_data {
 #define IOC_OPAL_SAVE  _IOW('p', 220, struct opal_lock_unlock)
 #define IOC_OPAL_LOCK_UNLOCK   _IOW('p', 221, struct opal_lock_unlock)
 #define IOC_OPAL_TAKE_OWNERSHIP_IOW('p', 222, struct opal_key)
-#define IOC_OPAL_ACTIVATE_LSP   _IOW('p', 223, struct opal_key)
+#define IOC_OPAL_ACTIVATE_LSP   _IOW('p', 223, struct opal_lr_act)
 #define IOC_OPAL_SET_PW _IOW('p', 224, struct opal_new_pw)
 #define IOC_OPAL_ACTIVATE_USR   _IOW('p', 225, struct opal_session_info)
 #define IOC_OPAL_REVERT_TPR _IOW('p', 226, struct opal_key)
-- 
2.7.4



[PATCH V4 2/2] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-10 Thread Scott Bauer
When CONFIG_KASAN is enabled, compilation fails:

block/sed-opal.c: In function 'sed_ioctl':
block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 
2048 bytes [-Werror=frame-larger-than=]

Moved all the ioctl structures off the stack and dynamically activate
using _IOC_SIZE()

Fixes: 455a7b238cd6 ("block: Add Sed-opal library")

Reported-by: Arnd Bergmann 
Signed-off-by: Scott Bauer 
---
 block/sed-opal.c | 132 ---
 1 file changed, 48 insertions(+), 84 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index bf1406e..d09a089 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -2346,7 +2346,10 @@ EXPORT_SYMBOL(opal_unlock_from_suspend);
 
 int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr)
 {
+   void *ioctl_ptr;
+   int ret = -ENOTTY;
void __user *arg = (void __user *)ptr;
+   unsigned int cmd_size = _IOC_SIZE(cmd);
 
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -2355,94 +2358,55 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, 
unsigned long ptr)
return -ENOTSUPP;
}
 
-   switch (cmd) {
-   case IOC_OPAL_SAVE: {
-   struct opal_lock_unlock lk_unlk;
-
-   if (copy_from_user(_unlk, arg, sizeof(lk_unlk)))
-   return -EFAULT;
-   return opal_save(dev, _unlk);
-   }
-   case IOC_OPAL_LOCK_UNLOCK: {
-   struct opal_lock_unlock lk_unlk;
-
-   if (copy_from_user(_unlk, arg, sizeof(lk_unlk)))
-   return -EFAULT;
-   return opal_lock_unlock(dev, _unlk);
-   }
-   case IOC_OPAL_TAKE_OWNERSHIP: {
-   struct opal_key opal_key;
-
-   if (copy_from_user(_key, arg, sizeof(opal_key)))
-   return -EFAULT;
-   return opal_take_ownership(dev, _key);
-   }
-   case IOC_OPAL_ACTIVATE_LSP: {
-   struct opal_lr_act opal_lr_act;
-
-   if (copy_from_user(_lr_act, arg, sizeof(opal_lr_act)))
-   return -EFAULT;
-   return opal_activate_lsp(dev, _lr_act);
-   }
-   case IOC_OPAL_SET_PW: {
-   struct opal_new_pw opal_pw;
-
-   if (copy_from_user(_pw, arg, sizeof(opal_pw)))
-   return -EFAULT;
-   return opal_set_new_pw(dev, _pw);
-   }
-   case IOC_OPAL_ACTIVATE_USR: {
-   struct opal_session_info session;
-
-   if (copy_from_user(, arg, sizeof(session)))
-   return -EFAULT;
-   return opal_activate_user(dev, );
-   }
-   case IOC_OPAL_REVERT_TPR: {
-   struct opal_key opal_key;
-
-   if (copy_from_user(_key, arg, sizeof(opal_key)))
-   return -EFAULT;
-   return opal_reverttper(dev, _key);
-   }
-   case IOC_OPAL_LR_SETUP: {
-   struct opal_user_lr_setup lrs;
-
-   if (copy_from_user(, arg, sizeof(lrs)))
-   return -EFAULT;
-   return opal_setup_locking_range(dev, );
-   }
-   case IOC_OPAL_ADD_USR_TO_LR: {
-   struct opal_lock_unlock lk_unlk;
-
-   if (copy_from_user(_unlk, arg, sizeof(lk_unlk)))
-   return -EFAULT;
-   return opal_add_user_to_lr(dev, _unlk);
-   }
-   case IOC_OPAL_ENABLE_DISABLE_MBR: {
-   struct opal_mbr_data mbr;
-
-   if (copy_from_user(, arg, sizeof(mbr)))
-   return -EFAULT;
-   return opal_enable_disable_shadow_mbr(dev, );
-   }
-   case IOC_OPAL_ERASE_LR: {
-   struct opal_session_info session;
-
-   if (copy_from_user(, arg, sizeof(session)))
-   return -EFAULT;
-   return opal_erase_locking_range(dev, );
+   ioctl_ptr = memdup_user(arg, cmd_size);
+   if (IS_ERR_OR_NULL(ioctl_ptr)) {
+   ret = PTR_ERR(ioctl_ptr);
+   goto out;
}
-   case IOC_OPAL_SECURE_ERASE_LR: {
-   struct opal_session_info session;
 
-   if (copy_from_user(, arg, sizeof(session)))
-   return -EFAULT;
-   return opal_secure_erase_locking_range(dev, );
-   }
+   switch (cmd) {
+   case IOC_OPAL_SAVE:
+   ret = opal_save(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_LOCK_UNLOCK:
+   ret = opal_lock_unlock(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_TAKE_OWNERSHIP:
+   ret = opal_take_ownership(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_ACTIVATE_LSP:
+   ret = opal_activate_lsp(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_SET_PW:
+   ret = opal_set_new_pw(dev, ioctl_ptr);
+ 

Sed-opal Fixes

2017-02-10 Thread Scott Bauer
changes from v3->v4:

Changed manual kzalloc + copy from user to memdup_user.

This small series fixes a small ABI issue when using the _IOC_SIZE
with the Activate Locking SP ioctl. I had put the wrong structure
in the IOW macro in the uapi header which caused issues when trying
to copy in the contents for the command.





Re: [WIP PATCHSET 1/4] blk-mq: pass bio to blk_mq_sched_get_rq_priv

2017-02-10 Thread Jens Axboe
On 02/07/2017 10:24 AM, Paolo Valente wrote:
> bio is used in bfq-mq's get_rq_priv, to get the request group. We could
> pass directly the group here, but I thought that passing the bio was
> more general, giving the possibility to get other pieces of information
> if needed.

I applied this one, thanks Paolo.

-- 
Jens Axboe



Re: [PATCH V3 2/2] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-10 Thread Scott Bauer
On Fri, Feb 10, 2017 at 09:01:23AM +0100, Arnd Bergmann wrote:
> On Thursday, February 9, 2017 10:20:01 AM CET Scott Bauer wrote:
> > When CONFIG_KASAN is enabled, compilation fails:
> > 
> > block/sed-opal.c: In function 'sed_ioctl':
> > block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 
> > 2048 bytes [-Werror=frame-larger-than=]
> > 
> > Moved all the ioctl structures off the stack and dynamically activate
> > using _IOC_SIZE()
> > 
> > Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
> > 
> > Reported-by: Arnd Bergmann 
> > Signed-off-by: Scott Bauer 
> > ---
> >  block/sed-opal.c | 134 
> > +--
> >  1 file changed, 50 insertions(+), 84 deletions(-)
> > 
> > diff --git a/block/sed-opal.c b/block/sed-opal.c
> > index bf1406e..4985d95 100644
> > --- a/block/sed-opal.c
> > +++ b/block/sed-opal.c
> > @@ -2346,7 +2346,10 @@ EXPORT_SYMBOL(opal_unlock_from_suspend);
> >  
> >  int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr)
> >  {
> > +   void *ioctl_ptr;
> > +   int ret = -ENOTTY;
> > void __user *arg = (void __user *)ptr;
> > +   unsigned int cmd_size = _IOC_SIZE(cmd);
> >  
> > if (!capable(CAP_SYS_ADMIN))
> > return -EACCES;
> 
> We usually have a size check in there to avoid allocating large amounts
> of memory. _IOC_SIZEBITS is 14, so you can have up to 16kb here, which
> is probably ok, but I'd recommend either adding a comment to say that
> it is, or just checking against the largest realistic size.

Right now it's up to the storage driver to call into the sed-opal ioctl.
We have a function is_sed_ioctl() which is called before jumping into sed_ioctl.
So we will be weeding out any non opal ioctls before getting in there so I don't
see any overly large 16kb allocations happening.



Re: [PATCH 07/10] writeback: Implement reliable switching to default writeback structure

2017-02-10 Thread Jan Kara
On Fri 10-02-17 13:19:44, NeilBrown wrote:
> On Thu, Feb 09 2017, Jan Kara wrote:
> 
> > Currently switching of inode between different writeback structures is
> > asynchronous and not guaranteed to succeed. Add a variant of switching
> > that is synchronous and reliable so that it can reliably move inode to
> > the default writeback structure (bdi->wb) when writeback on bdi is going
> > to be shutdown.
> >
> > Signed-off-by: Jan Kara 
> > ---
> >  fs/fs-writeback.c | 60 
> > ---
> >  include/linux/fs.h|  3 ++-
> >  include/linux/writeback.h |  6 +
> >  3 files changed, 60 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 23dc97cf2a50..52992a1036b1 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -332,14 +332,11 @@ struct inode_switch_wbs_context {
> > struct work_struct  work;
> >  };
> >  
> > -static void inode_switch_wbs_work_fn(struct work_struct *work)
> > +static void do_inode_switch_wbs(struct inode *inode,
> > +   struct bdi_writeback *new_wb)
> >  {
> > -   struct inode_switch_wbs_context *isw =
> > -   container_of(work, struct inode_switch_wbs_context, work);
> > -   struct inode *inode = isw->inode;
> > struct address_space *mapping = inode->i_mapping;
> > struct bdi_writeback *old_wb = inode->i_wb;
> > -   struct bdi_writeback *new_wb = isw->new_wb;
> > struct radix_tree_iter iter;
> > bool switched = false;
> > void **slot;
> > @@ -436,15 +433,29 @@ static void inode_switch_wbs_work_fn(struct 
> > work_struct *work)
> > spin_unlock(_wb->list_lock);
> > spin_unlock(_wb->list_lock);
> >  
> > +   /*
> > +* Make sure waitqueue_active() check in wake_up_bit() cannot happen
> > +* before I_WB_SWITCH is cleared. Pairs with the barrier in
> > +* set_task_state() after wait_on_bit() added waiter to the wait queue.
> 
> I think you mean "set_current_state()" ??

Yes, I'll fix that.

> It's rather a trap for the unwary, this need for a smp_mb().
> Greping for wake_up_bit(), I find quite a few places with barriers -
> sometimes clear_bit_unlock() or spin_unlock() - but
> 
> fs/block_dev.c- whole->bd_claiming = NULL;
> fs/block_dev.c: wake_up_bit(>bd_claiming, 0);
> 
> fs/cifs/connect.c-  clear_bit(TCON_LINK_PENDING, >tl_flags);
> fs/cifs/connect.c:  wake_up_bit(>tl_flags, TCON_LINK_PENDING);
> 
> fs/cifs/misc.c- clear_bit(CIFS_INODE_PENDING_WRITERS, 
> >flags);
> fs/cifs/misc.c: wake_up_bit(>flags, 
> CIFS_INODE_PENDING_WRITERS);
> 
> (several more in cifs)
> 
> net/sunrpc/xprt.c-  clear_bit(XPRT_CLOSE_WAIT, >state);
> net/sunrpc/xprt.c-  xprt->ops->close(xprt);
> net/sunrpc/xprt.c-  xprt_release_write(xprt, NULL);
> net/sunrpc/xprt.c:  wake_up_bit(>state, XPRT_LOCKED);
> (there might be a barrier in ->close or xprt_release_write() I guess)
> 
> security/keys/gc.c- clear_bit(KEY_GC_REAPING_KEYTYPE, 
> _gc_flags);
> security/keys/gc.c: wake_up_bit(_gc_flags, 
> KEY_GC_REAPING_KEYTYPE);

Yup, the above look like bugs.

> I wonder if there is a good way to make this less error-prone.
> I would suggest that wake_up_bit() should always have a barrier, and
> __wake_up_bit() is needed to avoid it, but there is already a
> __wake_up_bit() with a slightly different interface.

Yeah, it is error-prone as all waitqueue_active() optimizations...
 
> In this case, you have a spin_unlock() just before the wake_up_bit().
> It is my understand that it would provide enough of a barrier (all
> writes before are globally visible after), so do you really need
> the barrier here?

I believe I do. spin_unlock() is a semi-permeable barrier - i.e., any read
or write from "outside" can be moved inside. So CPU is free to prefetch
values for waitqueue active checks before the spinlock is unlocked or even
before clearing I_WB_SWITCH bit.

> > +*/
> > +   smp_mb();
> > +   wake_up_bit(>i_state, __I_WB_SWITCH);
> > +
> > if (switched) {
> > wb_wakeup(new_wb);
> > wb_put(old_wb);
> > }
> > -   wb_put(new_wb);
> > +}
> >  
> > -   iput(inode);
> > -   kfree(isw);
> > +static void inode_switch_wbs_work_fn(struct work_struct *work)
> > +{
> > +   struct inode_switch_wbs_context *isw =
> > +   container_of(work, struct inode_switch_wbs_context, work);
> >  
> > +   do_inode_switch_wbs(isw->inode, isw->new_wb);
> > +   wb_put(isw->new_wb);
> > +   iput(isw->inode);
> > +   kfree(isw);
> > atomic_dec(_nr_in_flight);
> >  }
> >  
> > @@ -521,6 +532,39 @@ static void inode_switch_wbs(struct inode *inode, int 
> > new_wb_id)
> >  }
> >  
> >  /**
> > + * inode_switch_to_default_wb_sync - change the wb association of an inode 
> > to
> > + * the default writeback structure synchronously
> > + * @inode: target inode
> > + *
> > + * Switch @inode's wb association to the default 

Re: [PATCH] bfq-mq: cause deadlock by executing exit_icq body immediately

2017-02-10 Thread Paolo Valente

> Il giorno 08 feb 2017, alle ore 18:17, Omar Sandoval  ha 
> scritto:
> 
> On Wed, Feb 08, 2017 at 11:39:24AM +0100, Paolo Valente wrote:
>> 
>>> Il giorno 08 feb 2017, alle ore 11:33, Omar Sandoval  
>>> ha scritto:
>>> 
>>> On Wed, Feb 08, 2017 at 11:03:01AM +0100, Paolo Valente wrote:
 
> Il giorno 07 feb 2017, alle ore 22:45, Omar Sandoval 
>  ha scritto:
> 
> On Tue, Feb 07, 2017 at 06:33:46PM +0100, Paolo Valente wrote:
>> Hi,
>> this patch is meant to show that, if the  body of the hook exit_icq is 
>> executed
>> from inside that hook, and not as deferred work, then a circular deadlock
>> occurs.
>> 
>> It happens if, on a CPU
>> - the body of icq_exit takes the scheduler lock,
>> - it does so from inside the exit_icq hook, which is invoked with the 
>> queue
>> lock held
>> 
>> while, on another CPU
>> - bfq_bio_merge, after taking the scheduler lock, invokes bfq_bic_lookup,
>> which, in its turn, takes the queue lock. bfq_bic_lookup needs to take 
>> such a
>> lock, because it invokes ioc_lookup_icq.
>> 
>> For more details, here is a lockdep report, right before the deadlock 
>> did occur.
>> 
>> [   44.059877] ==
>> [   44.124922] [ INFO: possible circular locking dependency detected ]
>> [   44.125795] 4.10.0-rc5-bfq-mq+ #38 Not tainted
>> [   44.126414] ---
>> [   44.127291] sync/2043 is trying to acquire lock:
>> [   44.128918]  (&(>lock)->rlock){-.-...}, at: 
>> [] bfq_exit_icq_bfqq+0x55/0x140
>> [   44.134052]
>> [   44.134052] but task is already holding lock:
>> [   44.134868]  (&(>__queue_lock)->rlock){-.}, at: 
>> [] put_io_context_active+0x6e/0xc0
> 
> Hey, Paolo,
> 
> I only briefly skimmed the code, but what are you using the queue_lock
> for? You should just use your scheduler lock everywhere. blk-mq doesn't
> use the queue lock, so the scheduler is the only thing you need mutual
> exclusion against.
 
 Hi Omar,
 the cause of the problem is that the hook functions bfq_request_merge
 and bfq_allow_bio_merge invoke, directly or through other functions,
 the function bfq_bic_lookup, which, in its turn, invokes
 ioc_lookup_icq.  The latter must be invoked with the queue lock held.
 In particular the offending lines in bfq_bic_lookup are:
 
spin_lock_irqsave(q->queue_lock, flags);
icq = icq_to_bic(ioc_lookup_icq(ioc, q));
spin_unlock_irqrestore(q->queue_lock, flags);
 
 Maybe I'm missing something and we can avoid taking this lock?
>>> 
>>> Ah, I didn't realize we still used the q->queue_lock for the icq stuff.
>>> You're right, you still need that lock for ioc_lookup_icq(). Unless
>>> there's something else I'm forgetting, that should be the only thing you
>>> need it for in the core code, and you should use your scheduler lock for
>>> everything else. What else are you using q->queue_lock for? 
>> 
>> Nothing.  The deadlock follows from that bfq_request_merge gets called
>> with the scheduler lock already held.  Problematic paths start from:
>> bfq_bio_merge and bfq_insert_request.
>> 
>> I'm trying to understand whether I/we can reorder operations in some
>> way that avoids the nested locking, but at no avail so far.
>> 
>> Thanks,
>> Paolo
> 
> Okay, I understand what you're saying now. It was all in the first email
> but I didn't see it right away, sorry about that.
> 
> I don't think it makes sense for ->exit_icq() to be invoked while
> holding q->queue_lock for blk-mq -- we don't hold that lock for any of
> the other hooks. Could you try the below? I haven't convinced myself
> that there isn't a circular locking dependency between bfqd->lock and
> ioc->lock now, but it's probably easiest for you to just try it.
> 

Just passed the last of a heavy batch of tests!

I have updated the bfq-mq branch [1], adding a temporary commit that
contains your diffs (while waiting for you final patch or the like).

Looking forward to some feedback on the other issue I have raised on
locking, and to some review of the current version of bfq-mq, before
proceeding with cgroups support.

Thanks,
Paolo

[1] https://github.com/Algodev-github/bfq-mq

> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index fe186a9eade9..b12f9c87b4c3 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -35,7 +35,10 @@ static void icq_free_icq_rcu(struct rcu_head *head)
>   kmem_cache_free(icq->__rcu_icq_cache, icq);
> }
> 
> -/* Exit an icq. Called with both ioc and q locked. */
> +/*
> + * Exit an icq. Called with both ioc and q locked for sq, only ioc locked for
> + * mq.
> + */
> static void ioc_exit_icq(struct io_cq *icq)
> {
>   struct elevator_type *et = 

Re: [PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events

2017-02-10 Thread Thomas Gleixner
On Fri, 3 Feb 2017, Christoph Hellwig wrote:
> @@ -127,6 +127,7 @@ enum cpuhp_state {
>   CPUHP_AP_ONLINE_IDLE,
>   CPUHP_AP_SMPBOOT_THREADS,
>   CPUHP_AP_X86_VDSO_VMA_ONLINE,
> + CPUHP_AP_IRQ_AFFINIY_ONLINE,

s/AFFINIY/AFFINITY/ perhaps?

> +static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc,
> + cpumask_t *mask)

static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc,
   cpumask_t *mask)

Please

> +{
> + struct irq_data *data = irq_desc_get_irq_data(desc);
> + struct irq_chip *chip = irq_data_get_irq_chip(data);
> + int ret;
> +
> + if (!irqd_can_move_in_process_context(data) && chip->irq_mask)
> + chip->irq_mask(data);
> + ret = chip->irq_set_affinity(data, mask, true);
> + WARN_ON_ONCE(ret);
> +
> + /*
> +  * We unmask if the irq was not marked masked by the core code.
> +  * That respects the lazy irq disable behaviour.
> +  */
> + if (!irqd_can_move_in_process_context(data) &&
> + !irqd_irq_masked(data) && chip->irq_unmask)
> + chip->irq_unmask(data);
> +}

This looks very familiar. arch/x86/kernel/irq.c comes to mind

> +
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> + const struct cpumask *affinity;
> + struct irq_data *data;
> + struct irq_chip *chip;
> + unsigned long flags;
> + cpumask_var_t mask;
> +
> + if (!desc)
> + return;
> +
> + raw_spin_lock_irqsave(>lock, flags);
> +
> + data = irq_desc_get_irq_data(desc);
> + affinity = irq_data_get_affinity_mask(data);
> + if (!irqd_affinity_is_managed(data) ||
> + !irq_has_action(irq) ||
> + !cpumask_test_cpu(cpu, affinity))
> + goto out_unlock;
> +
> + /*
> +  * The interrupt descriptor might have been cleaned up
> +  * already, but it is not yet removed from the radix tree
> +  */
> + chip = irq_data_get_irq_chip(data);
> + if (!chip)
> + goto out_unlock;
> +
> + if (WARN_ON_ONCE(!chip->irq_set_affinity))
> + goto out_unlock;
> +
> + if (!zalloc_cpumask_var(, GFP_KERNEL)) {

You really want to allocate that _BEFORE_ locking desc->lock. GFP_KERNEL
inside the lock held region is wrong and shows that this was never tested :)

And no, we don't want GFP_ATOMIC here. You can allocate is once at the call
site and hand it in, so you avoid the alloc/free dance when iterating over
a large number of descriptors.

> + pr_err("failed to allocate memory for cpumask\n");
> + goto out_unlock;
> + }
> +
> + cpumask_and(mask, affinity, cpu_online_mask);
> + cpumask_set_cpu(cpu, mask);
> + if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) {
> + irq_startup(desc, false);
> + irqd_clear(data, IRQD_AFFINITY_SUSPENDED);
> + } else {
> + __irq_affinity_set(irq, desc, mask);
> + }
> +
> + free_cpumask_var(mask);
> +out_unlock:
> + raw_spin_unlock_irqrestore(>lock, flags);
> +}



> +int irq_affinity_online_cpu(unsigned int cpu)
> +{
> + struct irq_desc *desc;
> + unsigned int irq;
> +
> + for_each_irq_desc(irq, desc)
> + irq_affinity_online_irq(irq, desc, cpu);

That lacks protection against concurrent irq setup/teardown. Wants to be
protected with irq_lock_sparse()

> + return 0;
> +}
> +
> +static void irq_affinity_offline_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> + const struct cpumask *affinity;
> + struct irq_data *data;
> + struct irq_chip *chip;
> + unsigned long flags;
> + cpumask_var_t mask;
> +
> + if (!desc)
> + return;
> +
> + raw_spin_lock_irqsave(>lock, flags);
> +
> + data = irq_desc_get_irq_data(desc);
> + affinity = irq_data_get_affinity_mask(data);
> + if (!irqd_affinity_is_managed(data) ||
> + !irq_has_action(irq) ||
> + irqd_has_set(data, IRQD_AFFINITY_SUSPENDED) ||
> + !cpumask_test_cpu(cpu, affinity))
> + goto out_unlock;
> +
> + /*
> +  * Complete the irq move. This cpu is going down and for
> +  * non intr-remapping case, we can't wait till this interrupt
> +  * arrives at this cpu before completing the irq move.
> +  */
> + irq_force_complete_move(desc);

Hmm. That's what we do in x86 when the cpu is really dying, i.e. before it
really goes away. It's the last resort we have.

So if a move is pending, then you force it here and then you call
__irq_affinity_set() further down, which queues another pending move, which
then gets cleaned up in the cpu dying code.

If a move is pending, then you should first verify whether the pending
affinity mask is different from the one you are going to set. If it's the
same, you can just let the final cleanup code do its job. If not, then you
need to check whether it has 

Re: [PATCH V3 2/2] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-10 Thread Arnd Bergmann
On Fri, Feb 10, 2017 at 11:28 AM, David Laight  wrote:

>>
>> > -   if (copy_from_user(, arg, sizeof(session)))
>> > -   return -EFAULT;
>> > -   return opal_erase_locking_range(dev, );
>> > +   ioctl_ptr = kzalloc(cmd_size, GFP_KERNEL);
>> > +   if (!ioctl_ptr)
>> > +   return -ENOMEM;
>> > +   if (copy_from_user(ioctl_ptr, arg, cmd_size)) {
>> > +   ret = -EFAULT;
>> > +   goto out;
>> > }
>>
>> Can't we use memdup_user() instead of kzalloc() + copy_from_user()?
>
> You either want the copy_from_user() or the memzero() not both.
>
> ISTM there could be two 'library' functions, maybe:
> void *get_ioctl_buf(unsigned int cmd, long arg)
> to malloc the buffer, memzero/copy_from_user, returns -EFAULT if copy fails.
> int put_ioctl_buf(int rval, unsigned int cmd, const void *buf)
> does copy_to_user() if rval >= 0 and IOR_READ, then frees buf.
> return value is rval unless the copyout fails.

All the ioctls commands in this driver are IOW, and no data is passed back
to user space, so there is no need for the memzero(): we can either copy
the data from user space or we fail.

Arnd


[PATCH v1 4/5] md: remove unnecessary check on mddev

2017-02-10 Thread Ming Lei
mddev is never NULL and neither is ->bio_set, so
remove the check.

Suggested-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/md/md.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3425c2b779a6..2835f09b9e71 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -193,9 +193,6 @@ EXPORT_SYMBOL_GPL(bio_alloc_mddev);
 struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev)
 {
-   if (!mddev || !mddev->bio_set)
-   return bio_clone(bio, gfp_mask);
-
return bio_clone_bioset(bio, gfp_mask, mddev->bio_set);
 }
 EXPORT_SYMBOL_GPL(bio_clone_mddev);
-- 
2.7.4



[PATCH v1 3/5] md: fail if mddev->bio_set can't be created

2017-02-10 Thread Ming Lei
The current behaviour is to fall back to allocate
bio from 'fs_bio_set', that isn't a correct way
because it might cause deadlock.

So this patch simply return failure if mddev->bio_set
can't be created.

Signed-off-by: Ming Lei 
---
 drivers/md/md.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4c1b82defa78..3425c2b779a6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5270,8 +5270,11 @@ int md_run(struct mddev *mddev)
sysfs_notify_dirent_safe(rdev->sysfs_state);
}
 
-   if (mddev->bio_set == NULL)
+   if (mddev->bio_set == NULL) {
mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);
+   if (!mddev->bio_set)
+   return -ENOMEM;
+   }
 
spin_lock(_lock);
pers = find_pers(mddev->level, mddev->clevel);
-- 
2.7.4



[PATCH v1 5/5] md: fast clone bio in bio_clone_mddev()

2017-02-10 Thread Ming Lei
Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.

So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().

Rename bio_clone_mddev() as bio_clone_fast_mddev() too, as
suggested by Christoph Hellwig.

Signed-off-by: Ming Lei 
---
 drivers/md/faulty.c |  2 +-
 drivers/md/md.c |  6 +++---
 drivers/md/md.h |  4 ++--
 drivers/md/raid1.c  |  8 
 drivers/md/raid10.c | 11 +--
 drivers/md/raid5.c  |  4 ++--
 6 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
index 685aa2d77e25..f80e7b8f8c40 100644
--- a/drivers/md/faulty.c
+++ b/drivers/md/faulty.c
@@ -214,7 +214,7 @@ static void faulty_make_request(struct mddev *mddev, struct 
bio *bio)
}
}
if (failit) {
-   struct bio *b = bio_clone_mddev(bio, GFP_NOIO, mddev);
+   struct bio *b = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
 
b->bi_bdev = conf->rdev->bdev;
b->bi_private = bio;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2835f09b9e71..d45e8d1382ad 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -190,12 +190,12 @@ struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 }
 EXPORT_SYMBOL_GPL(bio_alloc_mddev);
 
-struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
+struct bio *bio_clone_fast_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev)
 {
-   return bio_clone_bioset(bio, gfp_mask, mddev->bio_set);
+   return bio_clone_fast(bio, gfp_mask, mddev->bio_set);
 }
-EXPORT_SYMBOL_GPL(bio_clone_mddev);
+EXPORT_SYMBOL_GPL(bio_clone_fast_mddev);
 
 /*
  * We have a system wide 'event count' that is incremented
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 968bbe72b237..88d0a101fb4c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -673,8 +673,8 @@ extern void md_rdev_clear(struct md_rdev *rdev);
 
 extern void mddev_suspend(struct mddev *mddev);
 extern void mddev_resume(struct mddev *mddev);
-extern struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
-  struct mddev *mddev);
+extern struct bio *bio_clone_fast_mddev(struct bio *bio, gfp_t gfp_mask,
+   struct mddev *mddev);
 extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
   struct mddev *mddev);
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4d7852c6ae97..9e0b5a5ec0bc 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1108,7 +1108,7 @@ static void raid1_read_request(struct mddev *mddev, 
struct bio *bio,
r1_bio->read_disk = rdisk;
r1_bio->start_next_window = 0;
 
-   read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+   read_bio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
 max_sectors);
 
@@ -1372,7 +1372,7 @@ static void raid1_write_request(struct mddev *mddev, 
struct bio *bio,
}
 
if (!mbio) {
-   mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+   mbio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(mbio, offset, max_sectors);
}
 
@@ -2283,7 +2283,7 @@ static int narrow_write_error(struct r1bio *r1_bio, int i)
 
wbio->bi_vcnt = vcnt;
} else {
-   wbio = bio_clone_mddev(r1_bio->master_bio, GFP_NOIO, 
mddev);
+   wbio = bio_clone_fast_mddev(r1_bio->master_bio, 
GFP_NOIO, mddev);
}
 
bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
@@ -2421,7 +2421,7 @@ static void handle_read_error(struct r1conf *conf, struct 
r1bio *r1_bio)
const unsigned long do_sync
= r1_bio->master_bio->bi_opf & REQ_SYNC;
r1_bio->read_disk = disk;
-   bio = bio_clone_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
+   bio = bio_clone_fast_mddev(r1_bio->master_bio, GFP_NOIO, mddev);
bio_trim(bio, r1_bio->sector - bio->bi_iter.bi_sector,
 max_sectors);
r1_bio->bios[r1_bio->read_disk] = bio;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6bc5c2a85160..406d6651fd4c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1132,7 +1132,7 @@ static void raid10_read_request(struct mddev *mddev, 
struct bio *bio,
}
slot = r10_bio->read_slot;
 
-   read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+   read_bio = bio_clone_fast_mddev(bio, GFP_NOIO, mddev);
bio_trim(read_bio, r10_bio->sector - 

[PATCH v1 1/5] block: introduce bio_clone_bioset_partial()

2017-02-10 Thread Ming Lei
md still need bio clone(not the fast version) for behind write,
and it is more efficient to use bio_clone_bioset_partial().

The idea is simple and just copy the bvecs range specified from
parameters.

Signed-off-by: Ming Lei 
---
 block/bio.c | 61 +
 include/linux/bio.h | 11 --
 2 files changed, 57 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4b564d0c3e29..5eec5e08417f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -625,21 +625,20 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t 
gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
-/**
- * bio_clone_bioset - clone a bio
- * @bio_src: bio to clone
- * @gfp_mask: allocation priority
- * @bs: bio_set to allocate from
- *
- * Clone bio. Caller will own the returned bio, but not the actual data it
- * points to. Reference count of returned bio will be one.
- */
-struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
-struct bio_set *bs)
+static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs, int offset,
+ int size)
 {
struct bvec_iter iter;
struct bio_vec bv;
struct bio *bio;
+   struct bvec_iter iter_src = bio_src->bi_iter;
+
+   /* for supporting partial clone */
+   if (offset || size != bio_src->bi_iter.bi_size) {
+   bio_advance_iter(bio_src, _src, offset);
+   iter_src.bi_size = size;
+   }
 
/*
 * Pre immutable biovecs, __bio_clone() used to just do a memcpy from
@@ -663,7 +662,8 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
 *__bio_clone_fast() anyways.
 */
 
-   bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
+   bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
+  _src), bs);
if (!bio)
return NULL;
bio->bi_bdev= bio_src->bi_bdev;
@@ -680,7 +680,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
break;
default:
-   bio_for_each_segment(bv, bio_src, iter)
+   __bio_for_each_segment(bv, bio_src, iter, iter_src)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
break;
}
@@ -699,9 +699,44 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
 
return bio;
 }
+
+/**
+ * bio_clone_bioset - clone a bio
+ * @bio_src: bio to clone
+ * @gfp_mask: allocation priority
+ * @bs: bio_set to allocate from
+ *
+ * Clone bio. Caller will own the returned bio, but not the actual data it
+ * points to. Reference count of returned bio will be one.
+ */
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+struct bio_set *bs)
+{
+   return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
+ bio_src->bi_iter.bi_size);
+}
 EXPORT_SYMBOL(bio_clone_bioset);
 
 /**
+ * bio_clone_bioset_partial - clone a partial bio
+ * @bio_src: bio to clone
+ * @gfp_mask: allocation priority
+ * @bs: bio_set to allocate from
+ * @offset: cloned starting from the offset
+ * @size: size for the cloned bio
+ *
+ * Clone bio. Caller will own the returned bio, but not the actual data it
+ * points to. Reference count of returned bio will be one.
+ */
+struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
+struct bio_set *bs, int offset,
+int size)
+{
+   return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
+}
+EXPORT_SYMBOL(bio_clone_bioset_partial);
+
+/**
  * bio_add_pc_page -   attempt to add page to bio
  * @q: the target queue
  * @bio: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7cf8a6c70a3f..8e521194f6fc 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -183,7 +183,7 @@ static inline void bio_advance_iter(struct bio *bio, struct 
bvec_iter *iter,
 
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
-static inline unsigned bio_segments(struct bio *bio)
+static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
 {
unsigned segs = 0;
struct bio_vec bv;
@@ -205,12 +205,17 @@ static inline unsigned bio_segments(struct bio *bio)
break;
}
 
-   bio_for_each_segment(bv, bio, iter)
+   __bio_for_each_segment(bv, bio, iter, *bvec)
segs++;
 
return segs;
 }
 
+static inline unsigned bio_segments(struct bio *bio)
+{
+   return __bio_segments(bio, >bi_iter);
+}
+
 /*
  

[PATCH v1 0/5] md: use bio_clone_fast()

2017-02-10 Thread Ming Lei
Hi,

This patches replaces bio_clone() with bio_fast_clone() in
bio_clone_mddev() because:

1) bio_clone_mddev() is used in raid normal I/O and isn't in
resync I/O path, and all the direct access to bvec table in
raid happens on resync I/O only except for write behind of raid1.
Write behind is treated specially, so the replacement is safe.

2) for write behind, bio_clone() is kept, but this patchset
introduces bio_clone_bioset_partial() to just clone one specific 
bvecs range instead of whole table. Then write behind is improved
too.

V1:
1) don't introduce bio_clone_slow_mddev_partial()
2) return failure if mddev->bio_set can't be created
3) remove check in bio_clone_mddev() as suggested by
Christoph Hellwig.
4) rename bio_clone_mddev() as bio_clone_fast_mddev()


Ming Lei (5):
  block: introduce bio_clone_bioset_partial()
  md/raid1: use bio_clone_bioset_partial() in case of write behind
  md: fail if mddev->bio_set can't be created
  md: remove unnecessary check on mddev
  md: fast clone bio in bio_clone_mddev()

 block/bio.c | 61 +
 drivers/md/faulty.c |  2 +-
 drivers/md/md.c | 14 ++--
 drivers/md/md.h |  4 ++--
 drivers/md/raid1.c  | 26 ---
 drivers/md/raid10.c | 11 +-
 drivers/md/raid5.c  |  4 ++--
 include/linux/bio.h | 11 --
 8 files changed, 92 insertions(+), 41 deletions(-)

-- 
2.7.4

Thanks,
Ming


RE: [PATCH V3 2/2] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-10 Thread David Laight
From: Johannes Thumshirn
> Sent: 10 February 2017 07:46
> On 02/09/2017 06:20 PM, Scott Bauer wrote:
> > When CONFIG_KASAN is enabled, compilation fails:

Does CONFIG_KASAN allocate guard stack space around everything that
is passed by address?
That sounds completely brain-dead.
There are a lot of functions that have an 'int *' argument to return
a single value - and are never going to do anything else.

...
> > Moved all the ioctl structures off the stack and dynamically activate
> > using _IOC_SIZE()
...
> 
> > -   if (copy_from_user(, arg, sizeof(session)))
> > -   return -EFAULT;
> > -   return opal_erase_locking_range(dev, );
> > +   ioctl_ptr = kzalloc(cmd_size, GFP_KERNEL);
> > +   if (!ioctl_ptr)
> > +   return -ENOMEM;
> > +   if (copy_from_user(ioctl_ptr, arg, cmd_size)) {
> > +   ret = -EFAULT;
> > +   goto out;
> > }
> 
> Can't we use memdup_user() instead of kzalloc() + copy_from_user()?

You either want the copy_from_user() or the memzero() not both.

ISTM there could be two 'library' functions, maybe:
void *get_ioctl_buf(unsigned int cmd, long arg)
to malloc the buffer, memzero/copy_from_user, returns -EFAULT if copy fails.
int put_ioctl_buf(int rval, unsigned int cmd, const void *buf)
does copy_to_user() if rval >= 0 and IOR_READ, then frees buf.
return value is rval unless the copyout fails.

David



[PATCHv5 0/2] loop: enable different logical blocksizes

2017-02-10 Thread Hannes Reinecke
Currently the loop driver just simulates 512-byte blocks. When
creating bootable images for virtual machines it might be required
to use a different physical blocksize (eg 4k for S/390 DASD), as
the some bootloaders (like lilo or zipl for S/390) need to know
the physical block addresses of the kernel and initrd.

With this patchset the loop driver will export the logical and
physical blocksize and the current LOOP_SET_STATUS64 ioctl is
extended to set the logical blocksize by re-using the existing
'init' fields, which are currently unused.

As usual, comments and reviews are welcome.

Changes to v1:
- Move LO_FLAGS_BLOCKSIZE definition
- Reshuffle patches
Changes to v2:
- Include reviews from Ming Lei
Changes to v3:
- Include reviews from Christoph
- Merge patches
Changes to v4:
- Add LO_INFO_BLOCKSIZE definition

Hannes Reinecke (2):
  loop: Remove unused 'bdev' argument from loop_set_capacity
  loop: support 4k physical blocksize

 drivers/block/loop.c  | 40 +---
 drivers/block/loop.h  |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 37 insertions(+), 7 deletions(-)

-- 
1.8.5.6



[PATCHv5 2/2] loop: support 4k physical blocksize

2017-02-10 Thread Hannes Reinecke
When generating bootable VM images certain systems (most notably
s390x) require devices with 4k blocksize. This patch implements
a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
blocksize to that of the underlying device, and allow to change
the logical blocksize for up to the physical blocksize.

Signed-off-by: Hannes Reinecke 
---
 drivers/block/loop.c  | 36 +++-
 drivers/block/loop.h  |  1 +
 include/uapi/linux/loop.h |  3 +++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8dae865..776b3c6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
 }
 
 static int
-figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
+figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
+loff_t logical_blocksize)
 {
loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
sector_t x = (sector_t)size;
@@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool 
dio)
lo->lo_offset = offset;
if (lo->lo_sizelimit != sizelimit)
lo->lo_sizelimit = sizelimit;
+   if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
+   lo->lo_logical_blocksize = logical_blocksize;
+   blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
+   blk_queue_logical_block_size(lo->lo_queue,
+lo->lo_logical_blocksize);
+   }
set_capacity(lo->lo_disk, x);
bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
/* let user-space know about the new size */
@@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
struct file *file = lo->lo_backing_file;
struct inode *inode = file->f_mapping->host;
struct request_queue *q = lo->lo_queue;
+   int lo_bits = blksize_bits(lo->lo_logical_blocksize);
 
/*
 * We use punch hole to reclaim the free space used by the
@@ -833,7 +841,7 @@ static void loop_config_discard(struct loop_device *lo)
 
q->limits.discard_granularity = inode->i_sb->s_blocksize;
q->limits.discard_alignment = 0;
-   blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+   blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
q->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -922,6 +930,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 
lo->use_dio = false;
lo->lo_blocksize = lo_blocksize;
+   lo->lo_logical_blocksize = 512;
lo->lo_device = bdev;
lo->lo_flags = lo_flags;
lo->lo_backing_file = file;
@@ -1087,6 +1096,7 @@ static int loop_clr_fd(struct loop_device *lo)
int err;
struct loop_func_table *xfer;
kuid_t uid = current_uid();
+   int lo_flags = lo->lo_flags;
 
if (lo->lo_encrypt_key_size &&
!uid_eq(lo->lo_key_owner, uid) &&
@@ -1116,9 +1126,24 @@ static int loop_clr_fd(struct loop_device *lo)
if (err)
return err;
 
+   if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
+   lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
+   if (LO_INFO_BLOCKSIZE(info) != 512 &&
+   LO_INFO_BLOCKSIZE(info) != 1024 &&
+   LO_INFO_BLOCKSIZE(info) != 2048 &&
+   LO_INFO_BLOCKSIZE(info) != 4096)
+   return -EINVAL;
+   if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
+   return -EINVAL;
+   }
+
if (lo->lo_offset != info->lo_offset ||
-   lo->lo_sizelimit != info->lo_sizelimit)
-   if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit))
+   lo->lo_sizelimit != info->lo_sizelimit ||
+   lo->lo_flags != lo_flags ||
+   ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
+lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info)))
+   if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
+info->lo_init[0]))
return -EFBIG;
 
loop_config_discard(lo);
@@ -1303,7 +1328,8 @@ static int loop_set_capacity(struct loop_device *lo)
if (unlikely(lo->lo_state != Lo_bound))
return -ENXIO;
 
-   return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit);
+   return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit,
+   lo->lo_logical_blocksize);
 }
 
 static int loop_set_dio(struct loop_device *lo, unsigned long arg)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fb2237c..579f2f7 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -49,6 +49,7 @@ struct loop_device {
struct file *   lo_backing_file;
 

Re: [PATCH V3 2/2] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-10 Thread Arnd Bergmann
On Thursday, February 9, 2017 10:20:01 AM CET Scott Bauer wrote:
> When CONFIG_KASAN is enabled, compilation fails:
> 
> block/sed-opal.c: In function 'sed_ioctl':
> block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 
> 2048 bytes [-Werror=frame-larger-than=]
> 
> Moved all the ioctl structures off the stack and dynamically activate
> using _IOC_SIZE()
> 
> Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
> 
> Reported-by: Arnd Bergmann 
> Signed-off-by: Scott Bauer 
> ---
>  block/sed-opal.c | 134 
> +--
>  1 file changed, 50 insertions(+), 84 deletions(-)
> 
> diff --git a/block/sed-opal.c b/block/sed-opal.c
> index bf1406e..4985d95 100644
> --- a/block/sed-opal.c
> +++ b/block/sed-opal.c
> @@ -2346,7 +2346,10 @@ EXPORT_SYMBOL(opal_unlock_from_suspend);
>  
>  int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr)
>  {
> + void *ioctl_ptr;
> + int ret = -ENOTTY;
>   void __user *arg = (void __user *)ptr;
> + unsigned int cmd_size = _IOC_SIZE(cmd);
>  
>   if (!capable(CAP_SYS_ADMIN))
>   return -EACCES;

We usually have a size check in there to avoid allocating large amounts
of memory. _IOC_SIZEBITS is 14, so you can have up to 16kb here, which
is probably ok, but I'd recommend either adding a comment to say that
it is, or just checking against the largest realistic size.

Having something like v4l with their tables if ioctl commands and
function pointers would also solve it, as you'd be checking for valid
command numbers before doing the copy then.

Otherwise looks good to me.

Arnd