Re: [PATCH 2/4] blk-mq: merge mq and sq make_request instances

2017-03-13 Thread h...@lst.de
On Mon, Mar 13, 2017 at 09:01:08PM +, Bart Van Assche wrote:
> > +   /*
> > +* @request_count may become stale because of schedule
> > +* out, so check the list again.
> > +*/
> 
> The above comment was relevant as long as there was a request_count assignment
> above blk_mq_sched_get_request(). This patch moves that assignment inside if
> (plug && q->nr_hw_queues == 1). Does that mean that the above comment should 
> be
> removed entirely?

I don't think so - for the !blk_queue_nomerges cases we still rely
on blk_attempt_plug_merge calculatіng request_count, so the comment
still applies.


Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE

2017-03-13 Thread h...@lst.de
On Mon, Mar 13, 2017 at 08:52:54PM +, Bart Van Assche wrote:
> > -   if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
> > -   !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > +   if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> 
> A minor comment: due to this change the outer pair of parentheses
> became superfluous. Please consider removing these.

The last patch in the series removes the statement in this form. But
if I have to respin the series for some reason I'll make sure it
gets removed here already.


Re: [PATCH] block/loop: fix race between I/O and set_status

2017-03-13 Thread Jiri Slaby
On 02/11/2017, 04:40 AM, Ming Lei wrote:
> Inside set_status, transfer need to setup again, so
> we have to drain IO before the transition, otherwise
> oops may be triggered like the following:
> 
>   divide error:  [#1] SMP KASAN
>   CPU: 0 PID: 2935 Comm: loop7 Not tainted 4.10.0-rc7+ #213
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
>   01/01/2011
>   task: 88006ba1e840 task.stack: 880067338000
>   RIP: 0010:transfer_xor+0x1d1/0x440 drivers/block/loop.c:110
>   RSP: 0018:88006733f108 EFLAGS: 00010246
>   RAX:  RBX: 8800688d7000 RCX: 0059
>   RDX:  RSI: 11000d743f43 RDI: 880068891c08
>   RBP: 88006733f160 R08: 8800688d7001 R09: 
>   R10:  R11:  R12: 8800688d7000
>   R13: 880067b7d000 R14: dc00 R15: 
>   FS:  () GS:88006d00()
>   knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: 006c17e0 CR3: 66e3b000 CR4: 001406f0
>   Call Trace:
>lo_do_transfer drivers/block/loop.c:251 [inline]
>lo_read_transfer drivers/block/loop.c:392 [inline]
>do_req_filebacked drivers/block/loop.c:541 [inline]
>loop_handle_cmd drivers/block/loop.c:1677 [inline]
>loop_queue_work+0xda0/0x49b0 drivers/block/loop.c:1689
>kthread_worker_fn+0x4c3/0xa30 kernel/kthread.c:630
>kthread+0x326/0x3f0 kernel/kthread.c:227
>ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
>   Code: 03 83 e2 07 41 29 df 42 0f b6 04 30 4d 8d 44 24 01 38 d0 7f 08
>   84 c0 0f 85 62 02 00 00 44 89 f8 41 0f b6 48 ff 25 ff 01 00 00 99 
>   7d c8 48 63 d2 48 03 55 d0 48 89 d0 48 89 d7 48 c1 e8 03 83
>   RIP: transfer_xor+0x1d1/0x440 drivers/block/loop.c:110 RSP:
>   88006733f108
>   ---[ end trace 0166f7bd3b0c0933 ]---
> 
> Reported-by: Dmitry Vyukov 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Ming Lei 
> ---
>  drivers/block/loop.c | 17 -
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index ed5259510857..4b52a1690329 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1097,9 +1097,12 @@ loop_set_status(struct loop_device *lo, const struct 
> loop_info64 *info)
>   if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE)
>   return -EINVAL;
>  
> + /* I/O need to be drained during transfer transition */
> + blk_mq_freeze_queue(lo->lo_queue);
> +
>   err = loop_release_xfer(lo);
>   if (err)
> - return err;
> + goto exit;
>  
>   if (info->lo_encrypt_type) {
>   unsigned int type = info->lo_encrypt_type;
> @@ -1114,12 +1117,14 @@ loop_set_status(struct loop_device *lo, const struct 
> loop_info64 *info)

I might be missing context, but given this is not explained in the log,
I must ask why are not these returns also converted?

if (info->lo_encrypt_type) {
unsigned int type = info->lo_encrypt_type;

if (type >= MAX_LO_CRYPT)
return -EINVAL;
xfer = xfer_funcs[type];
if (xfer == NULL)
return -EINVAL;
} else
xfer = NULL;

>  
>   err = loop_init_xfer(lo, xfer, info);
>   if (err)
> - return err;
> + goto exit;
>  
>   if (lo->lo_offset != info->lo_offset ||
>   lo->lo_sizelimit != info->lo_sizelimit)
> - if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit))
> - return -EFBIG;
> + if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
> + err = -EFBIG;
> + goto exit;
> + }
>  
>   loop_config_discard(lo);
>  
> @@ -1156,7 +1161,9 @@ loop_set_status(struct loop_device *lo, const struct 
> loop_info64 *info)
>   /* update dio if lo_offset or transfer is changed */
>   __loop_update_dio(lo, lo->use_dio);
>  
> - return 0;
> + exit:
> + blk_mq_unfreeze_queue(lo->lo_queue);
> + return err;
>  }
>  
>  static int
> 

thanks,
-- 
js
suse labs


Re: [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly

2017-03-13 Thread Bart Van Assche
On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
> new wrapper that takes care of RCU / SRCU locking to avoid having
> boileplate code in the caller which would get duplicated with new callers.

Reviewed-by: Bart Van Assche 

Re: [PATCH 2/4] blk-mq: merge mq and sq make_request instances

2017-03-13 Thread Bart Van Assche
On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> @@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>   }
>  
>   plug = current->plug;
> - if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> + if (plug && q->nr_hw_queues == 1) {
> + struct request *last = NULL;
> +
> + blk_mq_bio_to_request(rq, bio);
> +
> + /*
> +  * @request_count may become stale because of schedule
> +  * out, so check the list again.
> +  */

The above comment was relevant as long as there was a request_count assignment
above blk_mq_sched_get_request(). This patch moves that assignment inside if
(plug && q->nr_hw_queues == 1). Does that mean that the above comment should be
removed entirely?

> + if (list_empty(>mq_list))
> + request_count = 0;
> + else if (blk_queue_nomerges(q))
> + request_count = blk_plug_queued_count(q);
> +
> + if (!request_count)
> + trace_block_plug(q);
> + else
> + last = list_entry_rq(plug->mq_list.prev);
> +
> + blk_mq_put_ctx(data.ctx);
> +
> + if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
> + blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
> + blk_flush_plug_list(plug, false);
> + trace_block_plug(q);
> + }
> +
> + list_add_tail(>queuelist, >mq_list);
> + goto done;
> + } else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
>   struct request *old_rq = NULL;
>  
>   blk_mq_bio_to_request(rq, bio);

Bart.

Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE

2017-03-13 Thread Bart Van Assche
On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> This flag was never used since it was introduced.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-mq.c | 8 +---
>  include/linux/blk-mq.h | 1 -
>  2 files changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..acf0ddf4af52 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>   }
>  
>   plug = current->plug;
> - /*
> -  * If the driver supports defer issued based on 'last', then
> -  * queue it up like normal since we can potentially save some
> -  * CPU this way.
> -  */
> - if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
> - !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> + if (((plug && !blk_queue_nomerges(q)) || is_sync)) {

A minor comment: due to this change the outer pair of parentheses
became superfluous. Please consider removing these.

Thanks,

Bart.

Re: [PATCH 0/11 v4] block: Fix block device shutdown related races

2017-03-13 Thread Dan Williams
On Mon, Mar 13, 2017 at 8:13 AM, Jan Kara  wrote:
> Hello,
>
> this is a series with the remaining patches (on top of 4.11-rc2) to fix 
> several
> different races and issues I've found when testing device shutdown and reuse.
> The first two patches fix possible (theoretical) problems when opening of a
> block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
> that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
> problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in 
> gendisk
> code where get_gendisk() can return already freed gendisk structure (again
> triggered by Omar's stress test).
>
> People, please have a look at patches. They are mostly simple however the
> interactions are rather complex so I may have missed something. Also I'm
> happy for any additional testing these patches can get - I've stressed them
> with Omar's script, tested memcg writeback, tested static (not udev managed)
> device inodes.

Passes testing with the libnvdimm unit tests that have been tripped up
by block-unplug bugs in the past.


Re: 4.11.0-rc1 boot resulted in WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup+0x78/0xb0

2017-03-13 Thread Abdul Haleem
On Sat, 2017-03-11 at 15:46 -0700, Jens Axboe wrote:
> On 03/09/2017 05:59 AM, Brian Foster wrote:
> > cc linux-block
> > 
> > On Thu, Mar 09, 2017 at 04:20:06PM +0530, Abdul Haleem wrote:
> >> On Wed, 2017-03-08 at 08:17 -0500, Brian Foster wrote:
> >>> On Tue, Mar 07, 2017 at 10:01:04PM +0530, Abdul Haleem wrote:
> 
>  Hi,
> 
>  Today's mainline (4.11.0-rc1) booted with warnings on Power7 LPAR.
> 
>  Issue is not reproducible all the time.
> 
> Is that still the case with -git as of yesterday? Check that you
> have this merge:
> 
> 34bbce9e344b47e8871273409632f525973afad4
> 
> in your tree.
> 

Thanks for pointing out, with the below merge commit warnings disappear.

commit 34bbce9e344b47e8871273409632f525973afad4
Merge: bb61ce5 672a2c8
Author: Linus Torvalds 
Date:   Thu Mar 9 15:53:25 2017 -0800

Merge branch 'for-linus' of git://git.kernel.dk/linux-block


Thanks for the fix !

Reported-by & Tested-by : Abdul Haleem 

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





[PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE

2017-03-13 Thread Christoph Hellwig
This flag was never used since it was introduced.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c | 8 +---
 include/linux/blk-mq.h | 1 -
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..acf0ddf4af52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
}
 
plug = current->plug;
-   /*
-* If the driver supports defer issued based on 'last', then
-* queue it up like normal since we can potentially save some
-* CPU this way.
-*/
-   if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
-   !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
+   if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
struct request *old_rq = NULL;
 
blk_mq_bio_to_request(rq, bio);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b296a9006117..5b3e201c8d4f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,7 +152,6 @@ enum {
BLK_MQ_F_SHOULD_MERGE   = 1 << 0,
BLK_MQ_F_TAG_SHARED = 1 << 1,
BLK_MQ_F_SG_MERGE   = 1 << 2,
-   BLK_MQ_F_DEFER_ISSUE= 1 << 4,
BLK_MQ_F_BLOCKING   = 1 << 5,
BLK_MQ_F_NO_SCHED   = 1 << 6,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.11.0



[PATCH 4/4] blk-mq: streamline blk_mq_make_request

2017-03-13 Thread Christoph Hellwig
Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos.  Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c | 67 +++---
 1 file changed, 27 insertions(+), 40 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 48748cb799ed..18e449cc832f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,16 +1534,17 @@ static blk_qc_t blk_mq_make_request(struct 
request_queue *q, struct bio *bio)
 
cookie = request_to_qc_t(data.hctx, rq);
 
+   plug = current->plug;
if (unlikely(is_flush_fua)) {
-   if (q->elevator)
-   goto elv_insert;
blk_mq_bio_to_request(rq, bio);
-   blk_insert_flush(rq);
-   goto run_queue;
-   }
-
-   plug = current->plug;
-   if (plug && q->nr_hw_queues == 1) {
+   if (q->elevator) {
+   blk_mq_sched_insert_request(rq, false, true,
+   !is_sync || is_flush_fua, true);
+   } else {
+   blk_insert_flush(rq);
+   blk_mq_run_hw_queue(data.hctx, !is_sync || 
is_flush_fua);
+   }
+   } else if (plug && q->nr_hw_queues == 1) {
struct request *last = NULL;
 
blk_mq_bio_to_request(rq, bio);
@@ -1562,8 +1563,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
else
last = list_entry_rq(plug->mq_list.prev);
 
-   blk_mq_put_ctx(data.ctx);
-
if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
blk_flush_plug_list(plug, false);
@@ -1571,56 +1570,44 @@ static blk_qc_t blk_mq_make_request(struct 
request_queue *q, struct bio *bio)
}
 
list_add_tail(>queuelist, >mq_list);
-   goto done;
-   } else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
-   struct request *old_rq = NULL;
-
+   } else if (plug && !blk_queue_nomerges(q)) {
blk_mq_bio_to_request(rq, bio);
 
/*
 * We do limited plugging. If the bio can be merged, do that.
 * Otherwise the existing request in the plug list will be
 * issued. So the plug list will have one request at most
+*
+* The plug list might get flushed before this. If that happens,
+* the plug list is emptry and same_queue_rq is invalid.
 */
-   if (plug) {
-   /*
-* The plug list might get flushed before this. If that
-* happens, same_queue_rq is invalid and plug list is
-* empty
-*/
-   if (same_queue_rq && !list_empty(>mq_list)) {
-   old_rq = same_queue_rq;
-   list_del_init(_rq->queuelist);
-   }
-   list_add_tail(>queuelist, >mq_list);
-   } else /* is_sync */
-   old_rq = rq;
-   blk_mq_put_ctx(data.ctx);
-   if (old_rq)
-   blk_mq_try_issue_directly(data.hctx, old_rq, );
-   goto done;
-   }
+   if (!list_empty(>mq_list))
+   list_del_init(_queue_rq->queuelist);
+   else
+   same_queue_rq = NULL;
 
-   if (q->elevator) {
-elv_insert:
-   blk_mq_put_ctx(data.ctx);
+   list_add_tail(>queuelist, >mq_list);
+   if (same_queue_rq)
+   blk_mq_try_issue_directly(data.hctx, same_queue_rq,
+   );
+   } else if (is_sync) {
+   blk_mq_bio_to_request(rq, bio);
+   blk_mq_try_issue_directly(data.hctx, rq, );
+   } else if (q->elevator) {
blk_mq_bio_to_request(rq, bio);
blk_mq_sched_insert_request(rq, false, true,
!is_sync || is_flush_fua, true);
-   goto done;
-   }
-   if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+   } else if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
 * For a SYNC request, send it to the hardware immediately. For
 * an ASYNC request, just ensure that we run it later on. The
 * latter allows for merging opportunities and more efficient
 * dispatching.
 */
-run_queue:
   

NULL deref in cpu hot unplug on jens for-linus branch

2017-03-13 Thread Sagi Grimberg

Hey Jens,

After some fixes to nvme-rdma in the area of cpu hot unplug and
rebase to jens for-linus branch I get the following NULL deref [1]

This crash did not happen before I rebased to for-linus (unless I
screwed up something).

I'm on my way out so I just send it out in hope that someone can
figure it out before I do...

After I offlined a cpu, I got the nvmf target to disconnect
from the host, the host then schedules a reconnect. after the
host reconnects it issues a namespace scanning which removes
an old namespace. Then we get to blk_cleanup_queue which
then triggers the NULL deref.

The strange thing is that we pass the
 (blk_mq_hw_queue_mapped(hctx)) condition but still hit a NULL...

[1]
--
[   55.865818] BUG: unable to handle kernel NULL pointer dereference at 
0008

[   55.867094] IP: __blk_mq_tag_idle+0x19/0x30
[   55.867825] PGD 0

[   55.868477] Oops: 0002 [#1] SMP
[   55.869010] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
mlx5_ib ppdev kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper 
cryptd joydev input_leds serio_raw i2c_piix4 parport_pc parport mac_hid 
ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp 
libiscsi sunrpc scsi_transport_iscsi autofs4 cirrus ttm drm_kms_helper 
syscopyarea sysfillrect sysimgblt mlx5_core fb_sys_fops ptp psmouse drm 
floppy pps_core pata_acpi

[   55.876358] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.11.0-rc1+ #136
[   55.877492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014

[   55.879055] Workqueue: events nvme_scan_work [nvme_core]
[   55.879940] task: a0b13e1d9080 task.stack: ad2000244000
[   55.880921] RIP: 0010:__blk_mq_tag_idle+0x19/0x30
[   55.881713] RSP: 0018:ad2000247c70 EFLAGS: 00010203
[   55.882582] RAX:  RBX: a0b13376f400 RCX: 
a0b13fc11d00
[   55.883808] RDX: 0001 RSI: a0b13376f400 RDI: 
a0b13376f400
[   55.884983] RBP: ad2000247c70 R08:  R09: 
bee42e20
[   55.886168] R10: ad2000247b88 R11: 0008 R12: 
a0b1384c6018
[   55.887343] R13: 0001 R14: 0080 R15: 

[   55.888517] FS:  () GS:a0b13fc0() 
knlGS:

[   55.889816] CS:  0010 DS:  ES:  CR0: 80050033
[   55.890738] CR2: 0008 CR3: 3ba2f000 CR4: 
003406f0

[   55.891878] Call Trace:
[   55.892285]  blk_mq_exit_hctx.isra.41+0xc4/0xd0
[   55.893020]  blk_mq_free_queue+0x110/0x130
[   55.893693]  blk_cleanup_queue+0xe0/0x150
[   55.894346]  nvme_ns_remove+0x78/0xd0 [nvme_core]
[   55.895109]  nvme_validate_ns+0x8c/0x290 [nvme_core]
[   55.895911]  ? nvme_scan_work+0x28a/0x370 [nvme_core]
[   55.896726]  nvme_scan_work+0x2ad/0x370 [nvme_core]
[   55.897523]  process_one_work+0x16b/0x480
[   55.898174]  worker_thread+0x4b/0x500
[   55.898771]  kthread+0x101/0x140
[   55.899299]  ? process_one_work+0x480/0x480
[   55.899977]  ? kthread_create_on_node+0x40/0x40
[   55.900711]  ? start_kernel+0x3bc/0x461
[   55.901336]  ? acpi_early_init+0x83/0xf9
[   55.901980]  ? acpi_load_tables+0x31/0x85
[   55.902632]  ret_from_fork+0x2c/0x40
[   55.903215] Code: 74 09 48 8d 7b 48 e8 67 4b 06 00 5b 41 5c 5d c3 66 
90 0f 1f 44 00 00 48 8b 87 08 01 00 00 f0 0f ba 77 18 01 72 01 c3 55 48 
89 e5  ff 48 08 48 8d 78 10 e8 3a 4b 06 00 5d c3 0f 1f 84 00 00 00

[   55.906220] RIP: __blk_mq_tag_idle+0x19/0x30 RSP: ad2000247c70
[   55.907209] CR2: 0008
[   55.907750] ---[ end trace f016dee1082237cf ]---
--


[PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list()

2017-03-13 Thread Jan Kara
When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann 
Acked-by: Tejun Heo 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c| 8 ++--
 include/linux/writeback.h | 1 +
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5ec8750f5332..c66f5dd4a02a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -885,6 +885,8 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(_lock);
list_del_init(>bd_list);
spin_unlock(_lock);
+   /* Detach inode from wb early as bdi_put() may free bdi->wb */
+   inode_detach_wb(inode);
if (bdev->bd_bdi != _backing_dev_info) {
bdi_put(bdev->bd_bdi);
bdev->bd_bdi = _backing_dev_info;
@@ -1880,12 +1882,6 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part)
kill_bdev(bdev);
 
bdev_write_inode(bdev);
-   /*
-* Detaching bdev inode from its wb in __destroy_inode()
-* is too late: the queue which embeds its bdi (along with
-* root wb) can be gone as soon as we put_disk() below.
-*/
-   inode_detach_wb(bdev->bd_inode);
}
if (bdev->bd_contains == bdev) {
if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a3c0cbd7c888..d5815794416c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, 
struct page *page)
 static inline void inode_detach_wb(struct inode *inode)
 {
if (inode->i_wb) {
+   WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
wb_put(inode->i_wb);
inode->i_wb = NULL;
}
-- 
2.10.2



[PATCH 04/11] bdi: Make wb->bdi a proper reference

2017-03-13 Thread Jan Kara
Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo 
Signed-off-by: Jan Kara 
---
 mm/backing-dev.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 12408f86783c..03d4ba27c133 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
 
memset(wb, 0, sizeof(*wb));
 
+   if (wb != >wb)
+   bdi_get(bdi);
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(>b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
wb->dirty_sleep = jiffies;
 
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-   if (!wb->congested)
-   return -ENOMEM;
+   if (!wb->congested) {
+   err = -ENOMEM;
+   goto out_put_bdi;
+   }
 
err = fprop_local_init_percpu(>completions, gfp);
if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct 
backing_dev_info *bdi,
fprop_local_destroy_percpu(>completions);
 out_put_cong:
wb_congested_put(wb->congested);
+out_put_bdi:
+   if (wb != >wb)
+   bdi_put(bdi);
return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
fprop_local_destroy_percpu(>completions);
wb_congested_put(wb->congested);
+   if (wb != >bdi->wb)
+   bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2



[PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()

2017-03-13 Thread Jan Kara
Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.

Acked-by: Tejun Heo 
Signed-off-by: Jan Kara 
---
 mm/backing-dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8c30b1a7aae5..3ea3bbd921d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return ret;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
struct radix_tree_iter iter;
void **slot;
@@ -801,7 +801,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return 0;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
 
 static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 {
@@ -925,7 +925,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(>wb);
-   cgwb_bdi_destroy(bdi);
+   cgwb_bdi_unregister(bdi);
 
if (bdi->dev) {
bdi_debug_unregister(bdi);
-- 
2.10.2



[PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()

2017-03-13 Thread Jan Kara
Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Acked-by: Tejun Heo 
Signed-off-by: Jan Kara 
---
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c | 22 ++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
  */
 enum wb_state {
WB_registered,  /* bdi_register() was done */
+   WB_shutting_down,   /* wb_shutdown() in progress */
WB_writeback_running,   /* Writeback is in progress */
WB_has_dirty_io,/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e3d56dba4da8..b67be4fc12c4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
spin_lock_bh(>work_lock);
if (!test_and_clear_bit(WB_registered, >state)) {
spin_unlock_bh(>work_lock);
+   /*
+* Wait for wb shutdown to finish if someone else is just
+* running wb_shutdown(). Otherwise we could proceed to wb /
+* bdi destruction before wb_shutdown() is finished.
+*/
+   wait_on_bit(>state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
+   set_bit(WB_shutting_down, >state);
spin_unlock_bh(>work_lock);
 
cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, >dwork, 0);
flush_delayed_work(>dwork);
WARN_ON(!list_empty(>work_list));
+   /*
+* Make sure bit gets cleared after shutdown is finished. Matches with
+* the barrier provided by test_and_clear_bit() above.
+*/
+   smp_wmb();
+   clear_bit(WB_shutting_down, >state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -699,12 +712,21 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 {
struct radix_tree_iter iter;
void **slot;
+   struct bdi_writeback *wb;
 
WARN_ON(test_bit(WB_registered, >wb.state));
 
spin_lock_irq(_lock);
radix_tree_for_each_slot(slot, >cgwb_tree, , 0)
cgwb_kill(*slot);
+
+   while (!list_empty(>wb_list)) {
+   wb = list_first_entry(>wb_list, struct bdi_writeback,
+ bdi_node);
+   spin_unlock_irq(_lock);
+   wb_shutdown(wb);
+   spin_lock_irq(_lock);
+   }
spin_unlock_irq(_lock);
 
/*
-- 
2.10.2



[PATCH 11/11] block: Fix oops scsi_disk_get()

2017-03-13 Thread Jan Kara
When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault:  [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs 
raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W  
4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: 88007f438200 task.stack: c9fd
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:c9fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: 88007f56d000 RCX: 
[11863.048030] RDX: 0001 RSI: 0004 RDI: 81a8d880
[11863.048030] RBP: c9fd3a18 R08:  R09: 0001
[11863.059217] R10:  R11:  R12: fffa
[11863.059217] R13: 880078872800 R14: 880070915540 R15: 001d
[11863.059217] FS:  7f2611f71800() GS:88007f0c() 
knlGS:
[11863.059217] CS:  0010 DS:  ES:  CR0: 80050033
[11863.059217] CR2: 0060e048 CR3: 778d4000 CR4: 06e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai 
Reviewed-by: Bart Van Assche 
Acked-by: Tejun Heo 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index a9c516a8b37d..510aac1486cb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1352,7 +1352,7 @@ struct kobject *get_disk(struct gendisk *disk)
owner = disk->fops->owner;
if (owner && !try_module_get(owner))
return NULL;
-   kobj = kobject_get(_to_dev(disk)->kobj);
+   kobj = kobject_get_unless_zero(_to_dev(disk)->kobj);
if (kobj == NULL) {
module_put(owner);
return NULL;
-- 
2.10.2



[PATCH 02/11] block: Fix race of bdev open with gendisk shutdown

2017-03-13 Thread Jan Kara
blkdev_open() may race with gendisk shutdown in two different ways.
Either del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
before we get to get_gendisk() and get_gendisk() will return new gendisk
that got allocated for device that reused the device number.

In both cases this will result in possible inconsistencies between
bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
destroyed and device number reused, in the second case immediately).

Fix the problem by checking whether the gendisk is still alive and inode
hashed when associating bdev inode with it and its bdi. That way we are
sure that we will not associate bdev inode with disk that got past
blk_unregister_region() in del_gendisk() (and thus device number can get
reused). Similarly, we will not associate bdev that was once associated
with gendisk that is going away (and thus the corresponding bdev inode
will get unhashed in del_gendisk()) with a new gendisk that is just
reusing the device number.

Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.

Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..5ec8750f5332 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (!partno) {
ret = -ENXIO;
bdev->bd_part = disk_get_part(disk, partno);
-   if (!bdev->bd_part)
+   if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
+   inode_unhashed(bdev->bd_inode))
goto out_clear;
 
ret = 0;
@@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_contains = whole;
bdev->bd_part = disk_get_part(disk, partno);
if (!(disk->flags & GENHD_FL_UP) ||
-   !bdev->bd_part || !bdev->bd_part->nr_sects) {
+   !bdev->bd_part || !bdev->bd_part->nr_sects ||
+   inode_unhashed(bdev->bd_inode)) {
ret = -ENXIO;
goto out_clear;
}
@@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
 
if (bdev->bd_bdi == _backing_dev_info)
bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
+   else
+   WARN_ON_ONCE(bdev->bd_bdi !=
+disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
-- 
2.10.2



[PATCH 0/11 v4] block: Fix block device shutdown related races

2017-03-13 Thread Jan Kara
Hello,

this is a series with the remaining patches (on top of 4.11-rc2) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Changes since v3:
* Rebased on top of 4.11-rc2
* Reworked patch 2 (block: Fix race of bdev open with gendisk shutdown) based
  on Tejun's feedback
* Significantly updated patch 5 (and dropped previous Tejun's ack) to
  accommodate for fixes to SCSI re-registration of BDI that went to 4.11-rc2

Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun

Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback

Honza


[PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete

2017-03-13 Thread Jan Kara
When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

Acked-by: Tejun Heo 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..53e2389ae4d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,8 +1556,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
-   if (bdev->bd_bdi == _backing_dev_info)
-   bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 
if (!partno) {
ret = -ENXIO;
@@ -1622,6 +1620,9 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
}
bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
}
+
+   if (bdev->bd_bdi == _backing_dev_info)
+   bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
@@ -1653,8 +1654,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
-   bdi_put(bdev->bd_bdi);
-   bdev->bd_bdi = _backing_dev_info;
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
-- 
2.10.2



[PATCH 10/11] kobject: Export kobject_get_unless_zero()

2017-03-13 Thread Jan Kara
Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman 
Reviewed-by: Bart Van Assche 
Acked-by: Tejun Heo 
Signed-off-by: Jan Kara 
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c   | 5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, 
const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+   struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject 
*kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+   if (!kobj)
+   return NULL;
if (!kref_get_unless_zero(>kref))
kobj = NULL;
return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2



[PATCH 03/11] bdi: Mark congested->bdi as internal

2017-03-13 Thread Jan Kara
congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.

Acked-by: Tejun Heo 
Signed-off-by: Jan Kara 
---
 include/linux/backing-dev-defs.h |  4 +++-
 mm/backing-dev.c | 10 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
atomic_t refcnt;/* nr of attached wb's and blkg */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-   struct backing_dev_info *bdi;   /* the associated bdi */
+   struct backing_dev_info *__bdi; /* the associated bdi, set to NULL
+* on bdi unregistration. For memcg-wb
+* internal use only! */
int blkcg_id;   /* ID of the associated blkcg */
struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c6f2a37028c2..12408f86783c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int 
blkcg_id, gfp_t gfp)
return NULL;
 
atomic_set(_congested->refcnt, 0);
-   new_congested->bdi = bdi;
+   new_congested->__bdi = bdi;
new_congested->blkcg_id = blkcg_id;
goto retry;
 
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested 
*congested)
}
 
/* bdi might already have been destroyed leaving @congested unlinked */
-   if (congested->bdi) {
+   if (congested->__bdi) {
rb_erase(>rb_node,
->bdi->cgwb_congested_tree);
-   congested->bdi = NULL;
+>__bdi->cgwb_congested_tree);
+   congested->__bdi = NULL;
}
 
spin_unlock_irqrestore(_lock, flags);
@@ -752,7 +752,7 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
rb_entry(rbn, struct bdi_writeback_congested, rb_node);
 
rb_erase(rbn, >cgwb_congested_tree);
-   congested->bdi = NULL;  /* mark @congested unlinked */
+   congested->__bdi = NULL;/* mark @congested unlinked */
}
spin_unlock_irq(_lock);
 }
-- 
2.10.2



Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock

2017-03-13 Thread Jens Axboe
On 03/11/2017 09:35 PM, Tahsin Erdogan wrote:
> On Sat, Mar 11, 2017 at 2:52 PM, Jens Axboe  wrote:
> 
>>
>> Talked to Tejun about this as well, and we both agree that the splitting
>> this into separate init/alloc paths would be much cleaner. I can't
>> apply the current patch, sorry, it's just too ugly to live.
> 
> Do you mean, you prefer the approach that was taken in v1 patch or
> something else?

I can no longer find v1 of the patch, just v2 and on. Can you send a
link to it?

-- 
Jens Axboe



[PATCH] blk-mq: Fix tagset reinit in the presence of cpu hot-unplug

2017-03-13 Thread Sagi Grimberg
In case cpu was unplugged, we need to make sure not to assume
that the tags for that cpu are still allocated. so check
for null tags when reinitializing a tagset.

Reported-by: Yi Zhang 
Signed-off-by: Sagi Grimberg 
---
 block/blk-mq-tag.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index e48bc2c72615..9d97bfc4d465 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -295,6 +295,9 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
for (i = 0; i < set->nr_hw_queues; i++) {
struct blk_mq_tags *tags = set->tags[i];
 
+   if (!tags)
+   continue;
+
for (j = 0; j < tags->nr_tags; j++) {
if (!tags->static_rqs[j])
continue;
-- 
2.7.4



Re: blk_init_queue vs blk_alloc_queue

2017-03-13 Thread Christoph Hellwig
On Mon, Mar 13, 2017 at 09:14:12AM +, Umesh Patel wrote:
> Hello,
> 
> I Would like to know difference between blk_init_queue and blk_alloc_queue 
> API.
> Can anyone explain when to use what ?

The short answer is that you should neither.  The long answer is that
you should send a link to your driver source if you want anyone to help
you.


Re: [PATCH 06/16] mmc: core: replace waitqueue with worker

2017-03-13 Thread Adrian Hunter
On 11/03/17 00:05, Jens Axboe wrote:
> On 03/10/2017 07:21 AM, Adrian Hunter wrote:
>>> Essentially I take out that thread and replace it with this one worker
>>> introduced in this very patch. I agree the driver can block in many ways
>>> and that is why I need to have it running in process context, and this
>>> is what the worker introduced here provides.
>>
>> The last time I looked at the blk-mq I/O scheduler code, it pulled up to
>> qdepth requests from the I/O scheduler and left them on a local list while
>> running ->queue_rq().  That means blocking in ->queue_rq() leaves some
>> number of requests in limbo (not issued but also not in the I/O scheduler)
>> for that time.
> 
> Look again, if we're not handling the requeued dispatches, we pull one
> at the time from the scheduler.
> 

That's good :-)

Now the next thing ;-)

It looks like we either set BLK_MQ_F_BLOCKING and miss the possibility of
issuing synchronous requests immediately, or we don't set BLK_MQ_F_BLOCKING
in which case we are never allowed to sleep in ->queue_rq().  Is that true?



Re: [PATCH rfc 04/10] block: Add a non-selective polling interface

2017-03-13 Thread Sagi Grimberg



Quite some additional newlines and I'm not really fond of the
->poll_batch() name. It's a bit confusing with ->poll() and we also have
irq_poll(). But the only thing that would come to my mind is
complete_batch() which "races" with ->complete().


What about ->check_completions()? After all, that is what both ->poll()
and ->poll_batch do but with a different stop condition, no ?
So it would also be easy to merge the two: both tag and batch are
unsigned int which could be called "cookie", and add a flag to tell how
to interpret it (as a tag or a batch limit).
e.g. something like:

+typedef int (check_completions_fn)(struct blk_mq_hw_ctx *,
enum blk_mq_check_flags, /* flag (TAG or BATCH) */
unsigned int); /* Target tag or batch limit */



I'd rather not to unite poll/poll_batch, but if this is something
that people want I can definitely do it.


Re: [PATCH rfc 06/10] IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct

2017-03-13 Thread Sagi Grimberg



polling the completion queue directly does not interfere
with the existing polling logic, hence drop the requirement.

This can be used for polling mode ULPs.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/core/cq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 21d1a38af489..7f6ae0ecb0c5 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -64,8 +64,6 @@ static int __ib_process_cq(struct ib_cq *cq, int budget)
  */
 int ib_process_cq_direct(struct ib_cq *cq, int budget)
 {
-   WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
-
return __ib_process_cq(cq, budget);
 }
 EXPORT_SYMBOL(ib_process_cq_direct);


Using ib_process_cq_direct() for queues that have another type than
IB_POLL_DIRECT may trigger concurrent CQ processing.


Correct.


Before this patch
the completions from each CQ were processed sequentially. That's a big
change so I think this should be mentioned clearly in the header above
ib_process_cq_direct().


Note that I now see that the cq lock is not sufficient for mutual
exclusion here because we're referencing cq->wc array outside of it.

There are three options I see here:
1. we'll need to allocate a different wc array for polling mode,
perhaps a smaller one?
2. Export __ib_process_cq (in some form) with an option to pass in a wc
array.
3. Simply not support non-selective polling but it seems like a shame...

Any thoughts?


Re: [PATCH rfc 04/10] block: Add a non-selective polling interface

2017-03-13 Thread Sagi Grimberg



+int blk_mq_poll_batch(struct request_queue *q, unsigned int batch)
+{
+   struct blk_mq_hw_ctx *hctx;
+
+   if (!q->mq_ops || !q->mq_ops->poll_batch)
+   return 0;
+
+   hctx = blk_mq_map_queue(q, smp_processor_id());
+   return q->mq_ops->poll_batch(hctx, batch);
+}
+EXPORT_SYMBOL_GPL(blk_mq_poll_batch);


A new exported function without any documentation? Wow.


I just copied blk_mq_poll export...


Please add a header
above this function that documents at least which other completion processing
code can execute concurrently with this function and from which contexts the
other completion processing code can be called (e.g. blk_mq_poll() and
.complete()).


I can do that, I'll document blk_mq_poll too..


Why to return if (!q->mq_ops || !q->mq_ops->poll_batch)? Shouldn't that be a
WARN_ON_ONCE() instead? I think it is an error to calling blk_mq_poll_batch()
against a queue that does not define .poll_batch().


Not really, we don't know if the block driver actually supports 
poll_batch (or poll for that matter). Instead of conditioning in the

call-site, we condition within the call.

Its not really a bug, its harmless.


Additionally, I think making the hardware context an argument of this function
instead of using blk_mq_map_queue(q, smp_processor_id()) would make this
function much more versatile.


What do you mean? remember that the callers interface to the device is
a request queue, it doesn't even know if its a blk-mq device. Can you
explain in more details what you would like to see?