Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Ming Lei
Hi Jianchao,

On Wed, Jan 17, 2018 at 01:24:23PM +0800, jianchao.wang wrote:
> Hi ming
> 
> Thanks for your kindly response.
> 
> On 01/17/2018 11:52 AM, Ming Lei wrote:
> >> It is here.
> >> __blk_mq_run_hw_queue()
> >> 
> >> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> >> cpu_online(hctx->next_cpu));
> > I think this warning is triggered after the CPU of hctx->next_cpu becomes
> > online again, and it should have been dealt with by the following trick,
> > and this patch is against the previous one, please test it and see if
> > the warning can be fixed.
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 23f0f3ddffcf..0620ccb65e4e 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> > *hctx)
> > tried = true;
> > goto select_cpu;
> > }
> > +
> > +   /* handle after this CPU of hctx->next_cpu becomes online again 
> > */
> > +   hctx->next_cpu_batch = 1;
> > return WORK_CPU_UNBOUND;
> > }
> > return hctx->next_cpu;
> > 
> 
> The WARNING still could be triggered.
> 
> [  282.194532] WARNING: CPU: 0 PID: 222 at 
> /home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
> __blk_mq_run_hw_queue+0x92/0xa0
> [  282.194534] Modules linked in: 
> [  282.194561] CPU: 0 PID: 222 Comm: kworker/2:1H Tainted: GW
> 4.15.0-rc7+ #17
> 

This warning can't be removed completely, for example, the CPU figured
in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
following call returns and before __blk_mq_run_hw_queue() is scheduled
to run.

kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
>run_work, msecs_to_jiffies(msecs))

Just be curious how you trigger this issue? And is it triggered in CPU
hotplug stress test? Or in a normal use case?

> >> 
> >>
> >> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the 
> >> cpu  even if the cpu is offlined and modify the cpu_online above to 
> >> cpu_active.
> >> The kworkers of the per-cpu pool must have be migrated back when the cpu 
> >> is set active.
> >> But there seems to be issues in DASD as your previous comment.
> > Yes, we can't break DASD.
> > 
> >> That is the original version of this patch, and both Christian and Stefan
> >> reported that system can't boot from DASD in this way[2], and I changed
> >> to AND with cpu_online_mask, then their system can boot well
> >> On the other hand, there is also risk in 
> >>
> >> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
> >> request_queue *q,
> >>blk_queue_exit(q);
> >>return ERR_PTR(-EXDEV);
> >>}
> >> -  cpu = cpumask_first(alloc_data.hctx->cpumask);
> >> +  cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> >>alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> >>
> >> what if the cpus in alloc_data.hctx->cpumask are all offlined ?
> > This one is crazy, and is used by NVMe only, it should be fine if
> > the passed 'hctx_idx' is retrieved by the current running CPU, such
> > as the way of blk_mq_map_queue(). But if not, bad thing may happen.
> Even if retrieved by current running cpu, the task still could be migrated 
> away when the cpu is offlined.
> 

I just checked NVMe code, looks only nvmf_connect_io_queue() passes
one specific 'qid' and calls blk_mq_alloc_request_hctx(); and for others,
NVME_QID_ANY is passed, then blk_mq_alloc_request() is called in
nvme_alloc_request().

CC NVMe list and guys.

Hello James, Keith, Christoph and Sagi,

Looks nvmf_connect_io_queue() is run in the following fragile way:

for (i = 1; i < ctrl->ctrl.queue_count; i++) {
...
nvmf_connect_io_queue(>ctrl, i);
...
}

The hardware queue idx is passed to nvmf_connect_io_queue() directly
and finally blk_mq_alloc_request_hctx() is called to allocate request
for the queue, but all CPUs mapped to this hw queue may become offline
when running in blk_mq_alloc_request_hctx(), so what is the supposed way
to handle this situation?

Is it fine to return failure simply in blk_mq_alloc_request_hctx()
under this situation? But the CPU may become online later...

Thanks,
Ming


Re: [PATCH] blktrace: support enabling cgroup info per-device

2018-01-16 Thread Hou Tao
Hi Jens,

Any comments on this patch and the related patch set for blktrace [1] ?

Regards,
Tao

[1]: https://www.spinics.net/lists/linux-btrace/msg00790.html

On 2018/1/11 12:09, Hou Tao wrote:
> Now blktrace supports outputting cgroup info for trace action and
> trace message, however, it can only be enabled globally by writing
> "blk_cgroup" to trace_options file, and there is no per-device API
> for the new functionality.
> 
> Adding a new field (enable_cg_info) by using the pad after act_mask
> in struct blk_user_trace_setup and a new attr file (cgroup_info) under
> /sys/block/$dev/trace dir, so BLKTRACESETUP ioctl and sysfs file
> can be used to enable cgroup info for selected block devices.
> 
> Signed-off-by: Hou Tao 
> ---
>  include/linux/blktrace_api.h  |  2 ++
>  include/uapi/linux/blktrace_api.h |  1 +
>  kernel/trace/blktrace.c   | 14 --
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
> index 8804753..f120c6a 100644
> --- a/include/linux/blktrace_api.h
> +++ b/include/linux/blktrace_api.h
> @@ -18,6 +18,7 @@ struct blk_trace {
>   unsigned long __percpu *sequence;
>   unsigned char __percpu *msg_data;
>   u16 act_mask;
> + bool enable_cg_info;
>   u64 start_lba;
>   u64 end_lba;
>   u32 pid;
> @@ -102,6 +103,7 @@ static inline int blk_trace_init_sysfs(struct device *dev)
>  struct compat_blk_user_trace_setup {
>   char name[BLKTRACE_BDEV_SIZE];
>   u16 act_mask;
> + u8  enable_cg_info;
>   u32 buf_size;
>   u32 buf_nr;
>   compat_u64 start_lba;
> diff --git a/include/uapi/linux/blktrace_api.h 
> b/include/uapi/linux/blktrace_api.h
> index 20d1490d..d9d9fca 100644
> --- a/include/uapi/linux/blktrace_api.h
> +++ b/include/uapi/linux/blktrace_api.h
> @@ -136,6 +136,7 @@ enum {
>  struct blk_user_trace_setup {
>   char name[BLKTRACE_BDEV_SIZE];  /* output */
>   __u16 act_mask; /* input */
> + __u8  enable_cg_info;   /* input */
>   __u32 buf_size; /* input */
>   __u32 buf_nr;   /* input */
>   __u64 start_lba;
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 987d9a9a..f420400 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -180,7 +180,8 @@ void __trace_note_message(struct blk_trace *bt, struct 
> blkcg *blkcg,
>   n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
>   va_end(args);
>  
> - if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
> + if (!((blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP) ||
> +   bt->enable_cg_info))
>   blkcg = NULL;
>  #ifdef CONFIG_BLK_CGROUP
>   trace_note(bt, 0, BLK_TN_MESSAGE, buf, n,
> @@ -549,6 +550,7 @@ static int do_blk_trace_setup(struct request_queue *q, 
> char *name, dev_t dev,
>   bt->act_mask = buts->act_mask;
>   if (!bt->act_mask)
>   bt->act_mask = (u16) -1;
> + bt->enable_cg_info = buts->enable_cg_info;
>  
>   blk_trace_setup_lba(bt, bdev);
>  
> @@ -625,6 +627,7 @@ static int compat_blk_trace_setup(struct request_queue 
> *q, char *name,
>  
>   buts = (struct blk_user_trace_setup) {
>   .act_mask = cbuts.act_mask,
> + .enable_cg_info = cbuts.enable_cg_info,
>   .buf_size = cbuts.buf_size,
>   .buf_nr = cbuts.buf_nr,
>   .start_lba = cbuts.start_lba,
> @@ -773,7 +776,8 @@ blk_trace_bio_get_cgid(struct request_queue *q, struct 
> bio *bio)
>  {
>   struct blk_trace *bt = q->blk_trace;
>  
> - if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
> + if (!(bt && (bt->enable_cg_info ||
> +  (blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP
>   return NULL;
>  
>   if (!bio->bi_css)
> @@ -1664,6 +1668,7 @@ static BLK_TRACE_DEVICE_ATTR(act_mask);
>  static BLK_TRACE_DEVICE_ATTR(pid);
>  static BLK_TRACE_DEVICE_ATTR(start_lba);
>  static BLK_TRACE_DEVICE_ATTR(end_lba);
> +static BLK_TRACE_DEVICE_ATTR(cgroup_info);
>  
>  static struct attribute *blk_trace_attrs[] = {
>   _attr_enable.attr,
> @@ -1671,6 +1676,7 @@ static struct attribute *blk_trace_attrs[] = {
>   _attr_pid.attr,
>   _attr_start_lba.attr,
>   _attr_end_lba.attr,
> + _attr_cgroup_info.attr,
>   NULL
>  };
>  
> @@ -1794,6 +1800,8 @@ static ssize_t sysfs_blk_trace_attr_show(struct device 
> *dev,
>   ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba);
>   else if (attr == _attr_end_lba)
>   ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
> + else if (attr == _attr_cgroup_info)
> + ret = sprintf(buf, "%u\n", q->blk_trace->enable_cg_info);
>  
>  out_unlock_bdev:
>   mutex_unlock(>blk_trace_mutex);
> @@ -1861,6 +1869,8 @@ static ssize_t sysfs_blk_trace_attr_store(struct device 
> *dev,
>   q->blk_trace->start_lba = 

Re: how to enlarge value of max_sectors_kb

2018-01-16 Thread Martin K. Petersen

Tang,

> There is a machine with very little max_sectors_kb size:
> [root@ceph151 queue]# pwd
> /sys/block/sdd/queue
> [root@ceph151 queue]# cat max_hw_sectors_kb 
> 256
> [root@ceph151 queue]# cat max_sectors_kb 
> 256
>
> The performance is very low when I run big I/Os.
> I can not modify it directly, I need a little bigger,
> so how can I enlarge it?

max_hw_sectors_kb is set based on the DMA transfer capabilities of the
SCSI controller.

max_sectors_kb is a soft limit for regular filesystem I/O. It can not
exceed max_hw_sectors_kb since that would violate the hardware
constraints.

You need to figure out why your controller is limiting the transfer
size.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2 06/12] bcache: set error_limit correctly

2018-01-16 Thread tang . junhui
From: Tang Junhui 

Hello Coly:

Then in bch_count_io_errors(), why did us still keep these code:
> 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
> 93 >io_errors);
> 94 errors >>= IO_ERROR_SHIFT;

why not just modify the code as bellow:
> 92 unsigned errors = atomic_add_return(1,
> 93 >io_errors);


>Struct cache uses io_errors for two purposes,
>- Error decay: when cache set error_decay is set, io_errors is used to
>  generate a small piece of delay when I/O error happens.
>- I/O errors counter: in order to generate big enough value for error
>  decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
>  IO_ERROR_SHIFT).
>
>In function bch_count_io_errors(), if I/O errors counter reaches cache set
>error limit, bch_cache_set_error() will be called to retire the whold cache
>set. But current code is problematic when checking the error limit, see the
>following code piece from bch_count_io_errors(),
>
> 90 if (error) {
> 91 char buf[BDEVNAME_SIZE];
> 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
> 93 >io_errors);
> 94 errors >>= IO_ERROR_SHIFT;
> 95
> 96 if (errors < ca->set->error_limit)
> 97 pr_err("%s: IO error on %s, recovering",
> 98bdevname(ca->bdev, buf), m);
> 99 else
>100 bch_cache_set_error(ca->set,
>101 "%s: too many IO errors %s",
>102 bdevname(ca->bdev, buf), m);
>103 }
>
>At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
>errors counter to compare at line 96. But ca->set->error_limit is initia-
>lized with an amplified value in bch_cache_set_alloc(),
>1545 c->error_limit  = 8 << IO_ERROR_SHIFT;
>
>It means by default, in bch_count_io_errors(), before 8<<20 errors happened
>bch_cache_set_error() won't be called to retire the problematic cache
>device. If the average request size is 64KB, it means bcache won't handle
>failed device until 512GB data is requested. This is too large to be an I/O
>threashold. So I believe the correct error limit should be much less.
>
>This patch sets default cache set error limit to 8, then in
>bch_count_io_errors() when errors counter reaches 8 (if it is default
>value), function bch_cache_set_error() will be called to retire the whole
>cache set. This patch also removes bits shifting when store or show
>io_error_limit value via sysfs interface.
>
>Nowadays most of SSDs handle internal flash failure automatically by LBA
>address re-indirect mapping. If an I/O error can be observed by upper layer
>code, it will be a notable error because that SSD can not re-indirect
>map the problematic LBA address to an available flash block. This situation
>indicates the whole SSD will be failed very soon. Therefore setting 8 as
>the default io error limit value makes sense, it is enough for most of
>cache devices.
>
>Changelog:
>v2: add reviewed-by from Hannes.
>v1: initial version for review.
>
>Signed-off-by: Coly Li 
>Reviewed-by: Hannes Reinecke 
>Cc: Michael Lyle 
>Cc: Junhui Tang 
>---
> drivers/md/bcache/bcache.h | 1 +
> drivers/md/bcache/super.c  | 2 +-
> drivers/md/bcache/sysfs.c  | 4 ++--
> 3 files changed, 4 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>index 88d938c8d027..7d7512fa4f09 100644
>--- a/drivers/md/bcache/bcache.h
>+++ b/drivers/md/bcache/bcache.h
>@@ -663,6 +663,7 @@ struct cache_set {
> ON_ERROR_UNREGISTER,
> ON_ERROR_PANIC,
> }on_error;
>+#define DEFAULT_IO_ERROR_LIMIT 8
> unsignederror_limit;
> unsignederror_decay;
> 
>diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>index 6d888e8fea8c..a373648b5d4b 100644
>--- a/drivers/md/bcache/super.c
>+++ b/drivers/md/bcache/super.c
>@@ -1583,7 +1583,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb 
>*sb)
> 
> c->congested_read_threshold_us= 2000;
> c->congested_write_threshold_us= 2;
>-c->error_limit= 8 << IO_ERROR_SHIFT;
>+c->error_limit= DEFAULT_IO_ERROR_LIMIT;
> 
> return c;
> err:
>diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
>index b7166c504cdb..ba62e987b503 100644
>--- a/drivers/md/bcache/sysfs.c
>+++ b/drivers/md/bcache/sysfs.c
>@@ -560,7 +560,7 @@ SHOW(__bch_cache_set)
> 
> /* See count_io_errors for why 88 */
> sysfs_print(io_error_halflife,c->error_decay * 88);
>-sysfs_print(io_error_limit,c->error_limit >> IO_ERROR_SHIFT);
>+sysfs_print(io_error_limit,c->error_limit);
> 
> sysfs_hprint(congested,
>  

Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi ming

Thanks for your kindly response.

On 01/17/2018 11:52 AM, Ming Lei wrote:
>> It is here.
>> __blk_mq_run_hw_queue()
>> 
>> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
>> cpu_online(hctx->next_cpu));
> I think this warning is triggered after the CPU of hctx->next_cpu becomes
> online again, and it should have been dealt with by the following trick,
> and this patch is against the previous one, please test it and see if
> the warning can be fixed.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 23f0f3ddffcf..0620ccb65e4e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> *hctx)
>   tried = true;
>   goto select_cpu;
>   }
> +
> + /* handle after this CPU of hctx->next_cpu becomes online again 
> */
> + hctx->next_cpu_batch = 1;
>   return WORK_CPU_UNBOUND;
>   }
>   return hctx->next_cpu;
> 

The WARNING still could be triggered.

[  282.194532] WARNING: CPU: 0 PID: 222 at 
/home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
__blk_mq_run_hw_queue+0x92/0xa0
[  282.194534] Modules linked in: 
[  282.194561] CPU: 0 PID: 222 Comm: kworker/2:1H Tainted: GW
4.15.0-rc7+ #17
[  282.194562] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  282.194563] Workqueue: kblockd blk_mq_run_work_fn
[  282.194565] RIP: 0010:__blk_mq_run_hw_queue+0x92/0xa0
[  282.194565] RSP: 0018:96c50223be60 EFLAGS: 00010202
[  282.194566] RAX: 0001 RBX: 8a7110233800 RCX: 
[  282.194567] RDX: 8a711255f2d0 RSI:  RDI: 8a7110233800
[  282.194567] RBP: 8a7122ca2140 R08:  R09: 
[  282.194568] R10: 0263 R11:  R12: 8a7122ca8200
[  282.194568] R13:  R14: 08a7122ca820 R15: 8a70bcef0780
[  282.194569] FS:  () GS:8a7122c0() 
knlGS:
[  282.194569] CS:  0010 DS:  ES:  CR0: 80050033
[  282.194570] CR2: 555e14d89d60 CR3: 3c809002 CR4: 003606f0
[  282.194571] DR0:  DR1:  DR2: 
[  282.194571] DR3:  DR6: fffe0ff0 DR7: 0400
[  282.194571] Call Trace:
[  282.194576]  process_one_work+0x14b/0x420
[  282.194577]  worker_thread+0x47/0x420
[  282.194579]  kthread+0xf5/0x130
[  282.194581]  ? process_one_work+0x420/0x420
[  282.194581]  ? kthread_associate_blkcg+0xe0/0xe0
[  282.194583]  ret_from_fork+0x24/0x30
[  282.194585] Code: ff 48 89 df e8 e0 6b 00 00 8b 74 24 04 48 89 df e8 a4 fc 
ff ff 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 0e 48 83 c4 10 5b c3 <0f> ff 
eb b7 0f ff eb c1 e8 e1 02 c6 ff 90 0f 1f 44 00 00 48 8b 
[  282.194601] ---[ end trace c104f0a63d3df31b ]---

>> 
>>
>> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu 
>>  even if the cpu is offlined and modify the cpu_online above to cpu_active.
>> The kworkers of the per-cpu pool must have be migrated back when the cpu is 
>> set active.
>> But there seems to be issues in DASD as your previous comment.
> Yes, we can't break DASD.
> 
>> That is the original version of this patch, and both Christian and Stefan
>> reported that system can't boot from DASD in this way[2], and I changed
>> to AND with cpu_online_mask, then their system can boot well
>> On the other hand, there is also risk in 
>>
>> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
>> request_queue *q,
>>  blk_queue_exit(q);
>>  return ERR_PTR(-EXDEV);
>>  }
>> -cpu = cpumask_first(alloc_data.hctx->cpumask);
>> +cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>  alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>
>> what if the cpus in alloc_data.hctx->cpumask are all offlined ?
> This one is crazy, and is used by NVMe only, it should be fine if
> the passed 'hctx_idx' is retrieved by the current running CPU, such
> as the way of blk_mq_map_queue(). But if not, bad thing may happen.
Even if retrieved by current running cpu, the task still could be migrated away 
when the cpu is offlined.

Thanks
Jianchao




Re: [PATCH] block: Fix __bio_integrity_endio() documentation

2018-01-16 Thread Martin K. Petersen

Bart,

> Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
> Signed-off-by: Bart Van Assche 

Looks fine.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [for-4.16 PATCH v5 2/3] blk-mq: improve DM's blk-mq IO merging performance

2018-01-16 Thread Mike Snitzer
This one is redundant and should be dropped.  I ran git-format-patch
twice after a quick rebase to tweak the subject and header.

Sorry for the confusion.

On Tue, Jan 16 2018 at 11:33pm -0500,
Mike Snitzer  wrote:

> From: Ming Lei 
> 
> blk_insert_cloned_request() is called in the fast path of a dm-rq driver
> (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
> blk_mq_request_bypass_insert() to directly append the request to the
> blk-mq hctx->dispatch_list of the underlying queue.
> 
> 1) This way isn't efficient enough because the hctx spinlock is always
> used.
> 
> 2) With blk_insert_cloned_request(), we completely bypass underlying
> queue's elevator and depend on the upper-level dm-rq driver's elevator
> to schedule IO.  But dm-rq currently can't get the underlying queue's
> dispatch feedback at all.  Without knowing whether a request was issued
> or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> not be able to provide effective IO merging (as a side-effect of dm-rq
> currently blindly destaging a request from its elevator only to requeue
> it after a delay, which kills any opportunity for merging).  This
> obviously causes very bad sequential IO performance.
> 
> Fix this by updating blk_insert_cloned_request() to use
> blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
> request to be issued directly to the underlying queue and returns the
> dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
> returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
> to _not_ destage the request.  Whereby preserving the opportunity to
> merge IO.
> 
> With this, request-based DM's blk-mq sequential IO performance is vastly
> improved (as much as 3X in mpath/virtio-scsi testing).
> 
> Signed-off-by: Ming Lei 
> [based _heavily_ on Ming Lei's initial solution, but blk-mq.c changes
> were refactored to make them less fragile and easier to read/review]
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-core.c   |  3 +--
>  block/blk-mq.c | 42 +-
>  block/blk-mq.h |  3 +++
>  drivers/md/dm-rq.c | 19 ---
>  4 files changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7ba607527487..55f338020254 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - blk_mq_request_bypass_insert(rq, true);
> - return BLK_STS_OK;
> + return blk_mq_request_direct_issue(rq);
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f30e34a22a6c..81ee3f9124dc 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1706,7 +1706,8 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   blk_qc_t new_cookie;
>   blk_status_t ret;
>  
> - new_cookie = request_to_qc_t(hctx, rq);
> + if (cookie)
> + new_cookie = request_to_qc_t(hctx, rq);
>  
>   /*
>* For OK queue, we are done. For error, caller may kill it.
> @@ -1716,13 +1717,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   ret = q->mq_ops->queue_rq(hctx, );
>   switch (ret) {
>   case BLK_STS_OK:
> - *cookie = new_cookie;
> + if (cookie)
> + *cookie = new_cookie;
>   break;
>   case BLK_STS_RESOURCE:
>   __blk_mq_requeue_request(rq);
>   break;
>   default:
> - *cookie = BLK_QC_T_NONE;
> + if (cookie)
> + *cookie = BLK_QC_T_NONE;
>   break;
>   }
>  
> @@ -1731,15 +1734,20 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>  static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
> - bool run_queue)
> + bool run_queue, bool bypass_insert)
>  {
> + if (bypass_insert) {
> + blk_mq_request_bypass_insert(rq, run_queue);
> + return;
> + }
>   blk_mq_sched_insert_request(rq, false, run_queue, false,
>   hctx->flags & BLK_MQ_F_BLOCKING);
>  }
>  
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
> - blk_qc_t *cookie)
> + blk_qc_t *cookie,
> + bool 

[for-4.16 PATCH v5 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-16 Thread Mike Snitzer
From: Ming Lei 

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO.  But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request.  Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei 
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer 
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 42 +-
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ba607527487..55f338020254 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c117c2baf2c9..0b64f7210a89 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1706,7 +1706,8 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret;
 
-   new_cookie = request_to_qc_t(hctx, rq);
+   if (cookie)
+   new_cookie = request_to_qc_t(hctx, rq);
 
/*
 * For OK queue, we are done. For error, caller may kill it.
@@ -1716,13 +1717,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
ret = q->mq_ops->queue_rq(hctx, );
switch (ret) {
case BLK_STS_OK:
-   *cookie = new_cookie;
+   if (cookie)
+   *cookie = new_cookie;
break;
case BLK_STS_RESOURCE:
__blk_mq_requeue_request(rq);
break;
default:
-   *cookie = BLK_QC_T_NONE;
+   if (cookie)
+   *cookie = BLK_QC_T_NONE;
break;
}
 
@@ -1731,15 +1734,20 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
 static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   bool run_queue)
+   bool run_queue, bool bypass_insert)
 {
+   if (bypass_insert) {
+   blk_mq_request_bypass_insert(rq, run_queue);
+   return;
+   }
blk_mq_sched_insert_request(rq, false, run_queue, false,
hctx->flags & BLK_MQ_F_BLOCKING);
 }
 
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   blk_qc_t *cookie)
+   blk_qc_t *cookie,
+   bool bypass_insert)
 {
struct request_queue *q = rq->q;
bool run_queue = true;
@@ -1750,7 +1758,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !bypass_insert)
goto insert;
 
if (!blk_mq_get_driver_tag(rq, NULL, 

[for-4.16 PATCH v5 3/3] blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request

2018-01-16 Thread Mike Snitzer
Signed-off-by: Mike Snitzer 
---
 block/blk-exec.c |  2 +-
 block/blk-mq-sched.c |  2 +-
 block/blk-mq-sched.h |  2 +-
 block/blk-mq.c   | 16 +++-
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/block/blk-exec.c b/block/blk-exec.c
index 5c0f3dc446dc..f7b292f12449 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -61,7 +61,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 * be reused after dying flag is set
 */
if (q->mq_ops) {
-   blk_mq_sched_insert_request(rq, at_head, true, false, false);
+   blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2ff7cf0cbf73..55c0a745b427 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -427,7 +427,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
 }
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
-bool run_queue, bool async, bool can_block)
+bool run_queue, bool async)
 {
struct request_queue *q = rq->q;
struct elevator_queue *e = q->elevator;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index ba1d1418a96d..1e9c9018ace1 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -18,7 +18,7 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, 
struct request *rq);
 void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
-bool run_queue, bool async, bool can_block);
+bool run_queue, bool async);
 void blk_mq_sched_insert_requests(struct request_queue *q,
  struct blk_mq_ctx *ctx,
  struct list_head *list, bool run_queue_async);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0b64f7210a89..06ef6a7cc29c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -745,13 +745,13 @@ static void blk_mq_requeue_work(struct work_struct *work)
 
rq->rq_flags &= ~RQF_SOFTBARRIER;
list_del_init(>queuelist);
-   blk_mq_sched_insert_request(rq, true, false, false, true);
+   blk_mq_sched_insert_request(rq, true, false, false);
}
 
while (!list_empty(_list)) {
rq = list_entry(rq_list.next, struct request, queuelist);
list_del_init(>queuelist);
-   blk_mq_sched_insert_request(rq, false, false, false, true);
+   blk_mq_sched_insert_request(rq, false, false, false);
}
 
blk_mq_run_hw_queues(q, false);
@@ -1732,16 +1732,14 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
-static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
+static void __blk_mq_fallback_to_insert(struct request *rq,
bool run_queue, bool bypass_insert)
 {
if (bypass_insert) {
blk_mq_request_bypass_insert(rq, run_queue);
return;
}
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   blk_mq_sched_insert_request(rq, false, run_queue, false);
 }
 
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
@@ -1771,7 +1769,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
-   __blk_mq_fallback_to_insert(hctx, rq, run_queue, bypass_insert);
+   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
if (bypass_insert)
return BLK_STS_RESOURCE;
 
@@ -1790,7 +1788,7 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE)
-   __blk_mq_fallback_to_insert(hctx, rq, true, false);
+   __blk_mq_fallback_to_insert(rq, true, false);
else if (ret != BLK_STS_OK)
blk_mq_end_request(rq, ret);
 
@@ -1919,7 +1917,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
} else if (q->elevator) {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
-   blk_mq_sched_insert_request(rq, false, true, true, true);
+   blk_mq_sched_insert_request(rq, false, true, true);
} else {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
-- 
2.15.0



[for-4.16 PATCH v5 2/3] blk-mq: improve DM's blk-mq IO merging performance

2018-01-16 Thread Mike Snitzer
From: Ming Lei 

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO.  But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request.  Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei 
[based _heavily_ on Ming Lei's initial solution, but blk-mq.c changes
were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer 
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 42 +-
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ba607527487..55f338020254 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f30e34a22a6c..81ee3f9124dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1706,7 +1706,8 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret;
 
-   new_cookie = request_to_qc_t(hctx, rq);
+   if (cookie)
+   new_cookie = request_to_qc_t(hctx, rq);
 
/*
 * For OK queue, we are done. For error, caller may kill it.
@@ -1716,13 +1717,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
ret = q->mq_ops->queue_rq(hctx, );
switch (ret) {
case BLK_STS_OK:
-   *cookie = new_cookie;
+   if (cookie)
+   *cookie = new_cookie;
break;
case BLK_STS_RESOURCE:
__blk_mq_requeue_request(rq);
break;
default:
-   *cookie = BLK_QC_T_NONE;
+   if (cookie)
+   *cookie = BLK_QC_T_NONE;
break;
}
 
@@ -1731,15 +1734,20 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
 static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   bool run_queue)
+   bool run_queue, bool bypass_insert)
 {
+   if (bypass_insert) {
+   blk_mq_request_bypass_insert(rq, run_queue);
+   return;
+   }
blk_mq_sched_insert_request(rq, false, run_queue, false,
hctx->flags & BLK_MQ_F_BLOCKING);
 }
 
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   blk_qc_t *cookie)
+   blk_qc_t *cookie,
+   bool bypass_insert)
 {
struct request_queue *q = rq->q;
bool run_queue = true;
@@ -1750,7 +1758,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !bypass_insert)
goto insert;
 
if (!blk_mq_get_driver_tag(rq, NULL, false))

[for-4.16 PATCH v5 1/3] blk-mq: factor out a few helpers from __blk_mq_try_issue_directly

2018-01-16 Thread Mike Snitzer
No functional change.  Just makes code flow more logically.

In following commit, __blk_mq_try_issue_directly() will be used to
return the dispatch result (blk_status_t) to DM.  DM needs this
information to improve IO merging.

Signed-off-by: Mike Snitzer 
---
 block/blk-mq.c | 79 ++
 1 file changed, 52 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c8f62e6be6b6..c117c2baf2c9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1694,9 +1694,9 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie)
+static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
 {
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
@@ -1705,6 +1705,43 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
};
blk_qc_t new_cookie;
blk_status_t ret;
+
+   new_cookie = request_to_qc_t(hctx, rq);
+
+   /*
+* For OK queue, we are done. For error, caller may kill it.
+* Any other error (busy), just add it to our list as we
+* previously would have done.
+*/
+   ret = q->mq_ops->queue_rq(hctx, );
+   switch (ret) {
+   case BLK_STS_OK:
+   *cookie = new_cookie;
+   break;
+   case BLK_STS_RESOURCE:
+   __blk_mq_requeue_request(rq);
+   break;
+   default:
+   *cookie = BLK_QC_T_NONE;
+   break;
+   }
+
+   return ret;
+}
+
+static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   bool run_queue)
+{
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
+}
+
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
+{
+   struct request_queue *q = rq->q;
bool run_queue = true;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1724,41 +1761,29 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   new_cookie = request_to_qc_t(hctx, rq);
-
-   /*
-* For OK queue, we are done. For error, kill it. Any other
-* error (busy), just add it to our list as we previously
-* would have done
-*/
-   ret = q->mq_ops->queue_rq(hctx, );
-   switch (ret) {
-   case BLK_STS_OK:
-   *cookie = new_cookie;
-   return;
-   case BLK_STS_RESOURCE:
-   __blk_mq_requeue_request(rq);
-   goto insert;
-   default:
-   *cookie = BLK_QC_T_NONE;
-   blk_mq_end_request(rq, ret);
-   return;
-   }
-
+   return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   __blk_mq_fallback_to_insert(hctx, rq, run_queue);
+
+   return BLK_STS_OK;
 }
 
 static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq, blk_qc_t *cookie)
 {
+   blk_status_t ret;
int srcu_idx;
 
might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
hctx_lock(hctx, _idx);
-   __blk_mq_try_issue_directly(hctx, rq, cookie);
+
+   ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
+   if (ret == BLK_STS_RESOURCE)
+   __blk_mq_fallback_to_insert(hctx, rq, true);
+   else if (ret != BLK_STS_OK)
+   blk_mq_end_request(rq, ret);
+
hctx_unlock(hctx, srcu_idx);
 }
 
-- 
2.15.0



[for-4.16 PATCH v5 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-16 Thread Mike Snitzer
Hi Jens,

I spent a decent amount of time going over this and am happy with it.
Hopefully you'll be too.

Thanks,
Mike

Mike Snitzer (2):
  blk-mq: factor out a few helpers from __blk_mq_try_issue_directly
  blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request

Ming Lei (1):
  blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

 block/blk-core.c |   3 +-
 block/blk-exec.c |   2 +-
 block/blk-mq-sched.c |   2 +-
 block/blk-mq-sched.h |   2 +-
 block/blk-mq.c   | 109 ---
 block/blk-mq.h   |   3 ++
 drivers/md/dm-rq.c   |  19 +++--
 7 files changed, 101 insertions(+), 39 deletions(-)

-- 
2.15.0



Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Ming Lei
Hi Jianchao,

On Wed, Jan 17, 2018 at 10:56:13AM +0800, jianchao.wang wrote:
> Hi ming
> 
> Thanks for your patch and kindly response.

You are welcome!

> 
> On 01/16/2018 11:32 PM, Ming Lei wrote:
> > OK, I got it, and it should have been the only corner case in which
> > all CPUs mapped to this hctx become offline, and I believe the following
> > patch should address this case, could you give a test?
> > 
> > ---
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c376d1b6309a..23f0f3ddffcf 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1416,21 +1416,44 @@ static void __blk_mq_run_hw_queue(struct 
> > blk_mq_hw_ctx *hctx)
> >   */
> >  static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> >  {
> > +   bool tried = false;
> > +
> > if (hctx->queue->nr_hw_queues == 1)
> > return WORK_CPU_UNBOUND;
> >  
> > if (--hctx->next_cpu_batch <= 0) {
> > int next_cpu;
> > +select_cpu:
> >  
> > next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> > cpu_online_mask);
> > if (next_cpu >= nr_cpu_ids)
> > next_cpu = 
> > cpumask_first_and(hctx->cpumask,cpu_online_mask);
> >  
> > -   hctx->next_cpu = next_cpu;
> > +   /*
> > +* No online CPU can be found here when running from
> > +* blk_mq_hctx_notify_dead(), so make sure hctx->next_cpu
> > +* is set correctly.
> > +*/
> > +   if (next_cpu >= nr_cpu_ids)
> > +   hctx->next_cpu = cpumask_first_and(hctx->cpumask,
> > +   cpu_possible_mask);
> > +   else
> > +   hctx->next_cpu = next_cpu;
> > hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> > }
> >  
> > +   /*
> > +* Do unbound schedule if we can't find a online CPU for this hctx,
> > +* and it should happen only if hctx->next_cpu is becoming DEAD.
> > +*/
> > +   if (!cpu_online(hctx->next_cpu)) {
> > +   if (!tried) {
> > +   tried = true;
> > +   goto select_cpu;
> > +   }
> > +   return WORK_CPU_UNBOUND;
> > +   }
> > return hctx->next_cpu;
> >  }
> 
> I have tested this patch. The panic was gone, but I got the following:
> 
> [  231.674464] WARNING: CPU: 0 PID: 263 at 
> /home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
> __blk_mq_run_hw_queue+0x92/0xa0
>

..

> It is here.
> __blk_mq_run_hw_queue()
> 
> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> cpu_online(hctx->next_cpu));

I think this warning is triggered after the CPU of hctx->next_cpu becomes
online again, and it should have been dealt with by the following trick,
and this patch is against the previous one, please test it and see if
the warning can be fixed.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 23f0f3ddffcf..0620ccb65e4e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1452,6 +1452,9 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
tried = true;
goto select_cpu;
}
+
+   /* handle after this CPU of hctx->next_cpu becomes online again 
*/
+   hctx->next_cpu_batch = 1;
return WORK_CPU_UNBOUND;
}
return hctx->next_cpu;

> 
> 
> To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu  
> even if the cpu is offlined and modify the cpu_online above to cpu_active.
> The kworkers of the per-cpu pool must have be migrated back when the cpu is 
> set active.
> But there seems to be issues in DASD as your previous comment.

Yes, we can't break DASD.

> 
> That is the original version of this patch, and both Christian and Stefan
> reported that system can't boot from DASD in this way[2], and I changed
> to AND with cpu_online_mask, then their system can boot well
> 
> 
> On the other hand, there is also risk in 
> 
> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
> request_queue *q,
>   blk_queue_exit(q);
>   return ERR_PTR(-EXDEV);
>   }
> - cpu = cpumask_first(alloc_data.hctx->cpumask);
> + cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>   alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> 
> what if the cpus in alloc_data.hctx->cpumask are all offlined ?

This one is crazy, and is used by NVMe only, it should be fine if
the passed 'hctx_idx' is retrieved by the current running CPU, such
as the way of blk_mq_map_queue(). But if not, bad thing may happen.

Thanks,
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi ming

Thanks for your patch and kindly response.

On 01/16/2018 11:32 PM, Ming Lei wrote:
> OK, I got it, and it should have been the only corner case in which
> all CPUs mapped to this hctx become offline, and I believe the following
> patch should address this case, could you give a test?
> 
> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c376d1b6309a..23f0f3ddffcf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1416,21 +1416,44 @@ static void __blk_mq_run_hw_queue(struct 
> blk_mq_hw_ctx *hctx)
>   */
>  static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
>  {
> + bool tried = false;
> +
>   if (hctx->queue->nr_hw_queues == 1)
>   return WORK_CPU_UNBOUND;
>  
>   if (--hctx->next_cpu_batch <= 0) {
>   int next_cpu;
> +select_cpu:
>  
>   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
>   cpu_online_mask);
>   if (next_cpu >= nr_cpu_ids)
>   next_cpu = 
> cpumask_first_and(hctx->cpumask,cpu_online_mask);
>  
> - hctx->next_cpu = next_cpu;
> + /*
> +  * No online CPU can be found here when running from
> +  * blk_mq_hctx_notify_dead(), so make sure hctx->next_cpu
> +  * is set correctly.
> +  */
> + if (next_cpu >= nr_cpu_ids)
> + hctx->next_cpu = cpumask_first_and(hctx->cpumask,
> + cpu_possible_mask);
> + else
> + hctx->next_cpu = next_cpu;
>   hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>   }
>  
> + /*
> +  * Do unbound schedule if we can't find a online CPU for this hctx,
> +  * and it should happen only if hctx->next_cpu is becoming DEAD.
> +  */
> + if (!cpu_online(hctx->next_cpu)) {
> + if (!tried) {
> + tried = true;
> + goto select_cpu;
> + }
> + return WORK_CPU_UNBOUND;
> + }
>   return hctx->next_cpu;
>  }

I have tested this patch. The panic was gone, but I got the following:

[  231.674464] WARNING: CPU: 0 PID: 263 at 
/home/will/u04/source_code/linux-block/block/blk-mq.c:1315 
__blk_mq_run_hw_queue+0x92/0xa0
[  231.674466] Modules linked in: .
[  231.674494] CPU: 0 PID: 263 Comm: kworker/2:1H Not tainted 4.15.0-rc7+ #12
[  231.674495] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  231.674496] Workqueue: kblockd blk_mq_run_work_fn
[  231.674498] RIP: 0010:__blk_mq_run_hw_queue+0x92/0xa0
[  231.674499] RSP: 0018:a9c801fcfe60 EFLAGS: 00010202
[  231.674500] RAX: 0001 RBX: 9c7c90231400 RCX: 
[  231.674500] RDX: 9c7c9255b0f8 RSI:  RDI: 9c7c90231400
[  231.674500] RBP: 9c7ca2ca2140 R08:  R09: 
[  231.674501] R10: 03cb R11:  R12: 9c7ca2ca8200
[  231.674501] R13:  R14: 09c7ca2ca820 R15: 9c7c3df25240
[  231.674502] FS:  () GS:9c7ca2c0() 
knlGS:
[  231.674502] CS:  0010 DS:  ES:  CR0: 80050033
[  231.674503] CR2: 01727008 CR3: 000336409003 CR4: 003606f0
[  231.674504] DR0:  DR1:  DR2: 
[  231.674504] DR3:  DR6: fffe0ff0 DR7: 0400
[  231.674504] Call Trace:
[  231.674509]  process_one_work+0x14b/0x420
[  231.674510]  worker_thread+0x47/0x420
[  231.674512]  kthread+0xf5/0x130
[  231.674514]  ? process_one_work+0x420/0x420
[  231.674515]  ? kthread_associate_blkcg+0xe0/0xe0
[  231.674517]  ? do_group_exit+0x3a/0xa0
[  231.674518]  ret_from_fork+0x24/0x30
[  231.674520] Code: ff 48 89 df e8 e0 6b 00 00 8b 74 24 04 48 89 df e8 a4 fc 
ff ff 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 0e 48 83 c4 10 5b c3 <0f> ff 
eb b7 0f ff eb c1 e8 e1 02 c6 ff 90 0f 1f 44 00 00 48 8b 
[  231.674537] ---[ end trace cc2de957e0e0fed4 ]---

It is here.
__blk_mq_run_hw_queue()

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
cpu_online(hctx->next_cpu));


To eliminate this risk totally, we could blk_mq_hctx_next_cpu return the cpu  
even if the cpu is offlined and modify the cpu_online above to cpu_active.
The kworkers of the per-cpu pool must have be migrated back when the cpu is set 
active.
But there seems to be issues in DASD as your previous comment.

That is the original version of this patch, and both Christian and Stefan
reported that system can't boot from DASD in this way[2], and I changed
to AND with cpu_online_mask, then their system can boot well


On the other hand, there is also risk in 

@@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
blk_queue_exit(q);
return ERR_PTR(-EXDEV);
}
-   cpu = 

Re: [LSF/MM TOPIC] A high-performance userspace block driver

2018-01-16 Thread Ming Lei
On Tue, Jan 16, 2018 at 10:52 PM, Matthew Wilcox  wrote:
>
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
>
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

I like the idea, and one line code of Python/... may need thousands
of C code to be done in kernel.

>
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
>
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
>
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
>
> It's either that or make pipes seekable ...

Userfaultfd might be another choice:

1) map the block LBA space into a range of process vm space

2) when READ/WRITE req comes, convert it to page fault on the
mapped range, and let userland to take control of it, and meantime
kernel req context is slept

3) IO req context in kernel side is waken up after userspace completed
the IO request via userfaultfd

4) kernel side continue to complete the IO, such as copying page from
storage range to req(bio) pages.

Seems READ should be fine since it is very similar with the use case
of QEMU postcopy live migration, WRITE can be a bit different, and
maybe need some change on userfaultfd.

-- 
Ming Lei


Re: [PATCH 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-16 Thread Bart Van Assche
On Wed, 2018-01-17 at 09:23 +0800, Ming Lei wrote:
> On Wed, Jan 17, 2018 at 8:03 AM, Bart Van Assche  
> wrote:
> > On Tue, 2018-01-16 at 17:32 -0500, Mike Snitzer wrote:
> > > Therefore it seems to me that all queue_attr_{show,store} are racey vs
> > > blk_unregister_queue() removing the 'queue' kobject.
> > > 
> > > And it was just that __elevator_change() was myopicly fixed to address
> > > the race whereas a more generic solution was/is needed.  But short of
> > > that more generic fix your change will reintroduce the potential for
> > > hitting the issue that commit e9a823fb34a8b fixed.
> > > 
> > > In that light, think it best to leave blk_unregister_queue()'s
> > > mutex_lock() above the QUEUE_FLAG_REGISTERED clearing _and_ update
> > > queue_attr_{show,store} to test for QUEUE_FLAG_REGISTERED while holding
> > > sysfs_lock.
> > > 
> > > Then remove the unicorn test_bit for QUEUE_FLAG_REGISTERED from
> > > __elevator_change().
> > 
> > Thanks Mike for the feedback. However, I think a simpler approach exists 
> > than
> > what has been described above, namely by unregistering the sysfs attributes
> > earlier. How about the patch below?
> > 
> > [PATCH] block: Protect less code with sysfs_lock in 
> > blk_{un,}register_queue()
> > ---
> >  block/blk-sysfs.c | 39 ++-
> >  block/elevator.c  |  4 
> >  2 files changed, 26 insertions(+), 17 deletions(-)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index 4a6a40ffd78e..ce32366c6db7 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -853,6 +853,10 @@ struct kobj_type blk_queue_ktype = {
> > .release= blk_release_queue,
> >  };
> > 
> > +/**
> > + * blk_register_queue - register a block layer queue with sysfs
> > + * @disk: Disk of which the request queue should be registered with sysfs.
> > + */
> >  int blk_register_queue(struct gendisk *disk)
> >  {
> > int ret;
> > @@ -909,11 +913,12 @@ int blk_register_queue(struct gendisk *disk)
> > if (q->request_fn || (q->mq_ops && q->elevator)) {
> > ret = elv_register_queue(q);
> > if (ret) {
> > +   mutex_unlock(>sysfs_lock);
> > kobject_uevent(>kobj, KOBJ_REMOVE);
> > kobject_del(>kobj);
> > blk_trace_remove_sysfs(dev);
> > kobject_put(>kobj);
> > -   goto unlock;
> > +   return ret;
> > }
> > }
> > ret = 0;
> > @@ -923,6 +928,13 @@ int blk_register_queue(struct gendisk *disk)
> >  }
> >  EXPORT_SYMBOL_GPL(blk_register_queue);
> > 
> > +/**
> > + * blk_unregister_queue - counterpart of blk_register_queue()
> > + * @disk: Disk of which the request queue should be unregistered from 
> > sysfs.
> > + *
> > + * Note: the caller is responsible for guaranteeing that this function is 
> > called
> > + * after blk_register_queue() has finished.
> > + */
> >  void blk_unregister_queue(struct gendisk *disk)
> >  {
> > struct request_queue *q = disk->queue;
> > @@ -934,28 +946,29 @@ void blk_unregister_queue(struct gendisk *disk)
> > if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
> > return;
> > 
> > -   /*
> > -* Protect against the 'queue' kobj being accessed
> > -* while/after it is removed.
> > -*/
> > -   mutex_lock(>sysfs_lock);
> > -
> > spin_lock_irq(q->queue_lock);
> > queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
> > spin_unlock_irq(q->queue_lock);
> > 
> > -   wbt_exit(q);
> > -
> > +   /*
> > +* Remove the sysfs attributes before unregistering the queue data
> > +* structures that can be modified through sysfs.
> > +*/
> > +   mutex_lock(>sysfs_lock);
> > if (q->mq_ops)
> > blk_mq_unregister_dev(disk_to_dev(disk), q);
> > -
> > -   if (q->request_fn || (q->mq_ops && q->elevator))
> > -   elv_unregister_queue(q);
> > +   mutex_unlock(>sysfs_lock);
> > 
> > kobject_uevent(>kobj, KOBJ_REMOVE);
> > kobject_del(>kobj);
> 
> elevator switch still can come just after the above line code is completed,
> so the previous warning addressed in e9a823fb34a8b can be triggered
> again.
> 
> > blk_trace_remove_sysfs(disk_to_dev(disk));
> > -   kobject_put(_to_dev(disk)->kobj);
> > 
> > +   wbt_exit(q);
> > +
> > +   mutex_lock(>sysfs_lock);
> > +   if (q->request_fn || (q->mq_ops && q->elevator))
> > +   elv_unregister_queue(q);
> > mutex_unlock(>sysfs_lock);
> > +
> > +   kobject_put(_to_dev(disk)->kobj);
> >  }
> > diff --git a/block/elevator.c b/block/elevator.c
> > index e87e9b43aba0..4b7957b28a99 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -1080,10 +1080,6 @@ static int __elevator_change(struct request_queue 
> > *q, 

Re: [PATCH 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-16 Thread Ming Lei
On Wed, Jan 17, 2018 at 8:03 AM, Bart Van Assche  wrote:
> On Tue, 2018-01-16 at 17:32 -0500, Mike Snitzer wrote:
>> Therefore it seems to me that all queue_attr_{show,store} are racey vs
>> blk_unregister_queue() removing the 'queue' kobject.
>>
>> And it was just that __elevator_change() was myopicly fixed to address
>> the race whereas a more generic solution was/is needed.  But short of
>> that more generic fix your change will reintroduce the potential for
>> hitting the issue that commit e9a823fb34a8b fixed.
>>
>> In that light, think it best to leave blk_unregister_queue()'s
>> mutex_lock() above the QUEUE_FLAG_REGISTERED clearing _and_ update
>> queue_attr_{show,store} to test for QUEUE_FLAG_REGISTERED while holding
>> sysfs_lock.
>>
>> Then remove the unicorn test_bit for QUEUE_FLAG_REGISTERED from
>> __elevator_change().
>
> Thanks Mike for the feedback. However, I think a simpler approach exists than
> what has been described above, namely by unregistering the sysfs attributes
> earlier. How about the patch below?
>
> [PATCH] block: Protect less code with sysfs_lock in blk_{un,}register_queue()
> ---
>  block/blk-sysfs.c | 39 ++-
>  block/elevator.c  |  4 
>  2 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 4a6a40ffd78e..ce32366c6db7 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -853,6 +853,10 @@ struct kobj_type blk_queue_ktype = {
> .release= blk_release_queue,
>  };
>
> +/**
> + * blk_register_queue - register a block layer queue with sysfs
> + * @disk: Disk of which the request queue should be registered with sysfs.
> + */
>  int blk_register_queue(struct gendisk *disk)
>  {
> int ret;
> @@ -909,11 +913,12 @@ int blk_register_queue(struct gendisk *disk)
> if (q->request_fn || (q->mq_ops && q->elevator)) {
> ret = elv_register_queue(q);
> if (ret) {
> +   mutex_unlock(>sysfs_lock);
> kobject_uevent(>kobj, KOBJ_REMOVE);
> kobject_del(>kobj);
> blk_trace_remove_sysfs(dev);
> kobject_put(>kobj);
> -   goto unlock;
> +   return ret;
> }
> }
> ret = 0;
> @@ -923,6 +928,13 @@ int blk_register_queue(struct gendisk *disk)
>  }
>  EXPORT_SYMBOL_GPL(blk_register_queue);
>
> +/**
> + * blk_unregister_queue - counterpart of blk_register_queue()
> + * @disk: Disk of which the request queue should be unregistered from sysfs.
> + *
> + * Note: the caller is responsible for guaranteeing that this function is 
> called
> + * after blk_register_queue() has finished.
> + */
>  void blk_unregister_queue(struct gendisk *disk)
>  {
> struct request_queue *q = disk->queue;
> @@ -934,28 +946,29 @@ void blk_unregister_queue(struct gendisk *disk)
> if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
> return;
>
> -   /*
> -* Protect against the 'queue' kobj being accessed
> -* while/after it is removed.
> -*/
> -   mutex_lock(>sysfs_lock);
> -
> spin_lock_irq(q->queue_lock);
> queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
> spin_unlock_irq(q->queue_lock);
>
> -   wbt_exit(q);
> -
> +   /*
> +* Remove the sysfs attributes before unregistering the queue data
> +* structures that can be modified through sysfs.
> +*/
> +   mutex_lock(>sysfs_lock);
> if (q->mq_ops)
> blk_mq_unregister_dev(disk_to_dev(disk), q);
> -
> -   if (q->request_fn || (q->mq_ops && q->elevator))
> -   elv_unregister_queue(q);
> +   mutex_unlock(>sysfs_lock);
>
> kobject_uevent(>kobj, KOBJ_REMOVE);
> kobject_del(>kobj);

elevator switch still can come just after the above line code is completed,
so the previous warning addressed in e9a823fb34a8b can be triggered
again.

> blk_trace_remove_sysfs(disk_to_dev(disk));
> -   kobject_put(_to_dev(disk)->kobj);
>
> +   wbt_exit(q);
> +
> +   mutex_lock(>sysfs_lock);
> +   if (q->request_fn || (q->mq_ops && q->elevator))
> +   elv_unregister_queue(q);
> mutex_unlock(>sysfs_lock);
> +
> +   kobject_put(_to_dev(disk)->kobj);
>  }
> diff --git a/block/elevator.c b/block/elevator.c
> index e87e9b43aba0..4b7957b28a99 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -1080,10 +1080,6 @@ static int __elevator_change(struct request_queue *q, 
> const char *name)
> char elevator_name[ELV_NAME_MAX];
> struct elevator_type *e;
>
> -   /* Make sure queue is not in the middle of being removed */
> -   if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
> -   return -ENOENT;
> -

The above check shouldn't be removed, as I explained above.



-- 
Ming Lei


Re: [LSF/MM TOPIC] A high-performance userspace block driver

2018-01-16 Thread Bart Van Assche
On Tue, 2018-01-16 at 06:52 -0800, Matthew Wilcox wrote:
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
> 
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
> 
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
> 
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
> 
> It's either that or make pipes seekable ...

How about using the RDMA API and the rdma_rxe driver over loopback? The RDMA
API supports zero-copy communication which is something the BSD socket API
does not support. The RDMA API also supports byte-level granularity and the
hot path (ib_post_send(), ib_post_recv(), ib_poll_cq()) does not require any
system calls for PCIe RDMA adapters. The rdma_rxe driver however uses a system
call to trigger the send doorbell.

Bart.


Re: [PATCH 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-16 Thread Bart Van Assche
On Tue, 2018-01-16 at 17:32 -0500, Mike Snitzer wrote:
> Therefore it seems to me that all queue_attr_{show,store} are racey vs
> blk_unregister_queue() removing the 'queue' kobject.
> 
> And it was just that __elevator_change() was myopicly fixed to address
> the race whereas a more generic solution was/is needed.  But short of
> that more generic fix your change will reintroduce the potential for
> hitting the issue that commit e9a823fb34a8b fixed.
> 
> In that light, think it best to leave blk_unregister_queue()'s
> mutex_lock() above the QUEUE_FLAG_REGISTERED clearing _and_ update
> queue_attr_{show,store} to test for QUEUE_FLAG_REGISTERED while holding
> sysfs_lock.
> 
> Then remove the unicorn test_bit for QUEUE_FLAG_REGISTERED from
> __elevator_change().

Thanks Mike for the feedback. However, I think a simpler approach exists than
what has been described above, namely by unregistering the sysfs attributes
earlier. How about the patch below?

[PATCH] block: Protect less code with sysfs_lock in blk_{un,}register_queue()
---
 block/blk-sysfs.c | 39 ++-
 block/elevator.c  |  4 
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4a6a40ffd78e..ce32366c6db7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -853,6 +853,10 @@ struct kobj_type blk_queue_ktype = {
.release= blk_release_queue,
 };
 
+/**
+ * blk_register_queue - register a block layer queue with sysfs
+ * @disk: Disk of which the request queue should be registered with sysfs.
+ */
 int blk_register_queue(struct gendisk *disk)
 {
int ret;
@@ -909,11 +913,12 @@ int blk_register_queue(struct gendisk *disk)
if (q->request_fn || (q->mq_ops && q->elevator)) {
ret = elv_register_queue(q);
if (ret) {
+   mutex_unlock(>sysfs_lock);
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
blk_trace_remove_sysfs(dev);
kobject_put(>kobj);
-   goto unlock;
+   return ret;
}
}
ret = 0;
@@ -923,6 +928,13 @@ int blk_register_queue(struct gendisk *disk)
 }
 EXPORT_SYMBOL_GPL(blk_register_queue);
 
+/**
+ * blk_unregister_queue - counterpart of blk_register_queue()
+ * @disk: Disk of which the request queue should be unregistered from sysfs.
+ *
+ * Note: the caller is responsible for guaranteeing that this function is 
called
+ * after blk_register_queue() has finished.
+ */
 void blk_unregister_queue(struct gendisk *disk)
 {
struct request_queue *q = disk->queue;
@@ -934,28 +946,29 @@ void blk_unregister_queue(struct gendisk *disk)
if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
return;
 
-   /*
-* Protect against the 'queue' kobj being accessed
-* while/after it is removed.
-*/
-   mutex_lock(>sysfs_lock);
-
spin_lock_irq(q->queue_lock);
queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
spin_unlock_irq(q->queue_lock);
 
-   wbt_exit(q);
-
+   /*
+* Remove the sysfs attributes before unregistering the queue data
+* structures that can be modified through sysfs.
+*/
+   mutex_lock(>sysfs_lock);
if (q->mq_ops)
blk_mq_unregister_dev(disk_to_dev(disk), q);
-
-   if (q->request_fn || (q->mq_ops && q->elevator))
-   elv_unregister_queue(q);
+   mutex_unlock(>sysfs_lock);
 
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
-   kobject_put(_to_dev(disk)->kobj);
 
+   wbt_exit(q);
+
+   mutex_lock(>sysfs_lock);
+   if (q->request_fn || (q->mq_ops && q->elevator))
+   elv_unregister_queue(q);
mutex_unlock(>sysfs_lock);
+
+   kobject_put(_to_dev(disk)->kobj);
 }
diff --git a/block/elevator.c b/block/elevator.c
index e87e9b43aba0..4b7957b28a99 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1080,10 +1080,6 @@ static int __elevator_change(struct request_queue *q, 
const char *name)
char elevator_name[ELV_NAME_MAX];
struct elevator_type *e;
 
-   /* Make sure queue is not in the middle of being removed */
-   if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
-   return -ENOENT;
-
/*
 * Special case for mq, turn off scheduling
 */
-- 
2.15.1


Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver

2018-01-16 Thread Bart Van Assche
On Tue, 2018-01-16 at 15:28 -0800, James Bottomley wrote:
> On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> > On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > > 
> > > 
> > > I see the improvements that Facebook have been making to the nbd
> > > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > > this topic is simply: "Shut up, Matthew, this is good enough".
> > > 
> > > It's clear that there's an appetite for userspace block devices;
> > > not for swap devices or the root device, but for accessing data
> > > that's stored in that silo over there, and I really don't want to
> > > bring that entire mess of CORBA / Go / Rust / whatever into the
> > > kernel to get to it, but it would be really handy to present it as
> > > a block device.
> > 
> > ... and using iSCSI was too painful and heavyweight.
> 
> From what I've seen a reasonable number of storage over IP cloud
> implementations are actually using AoE.  The argument goes that the
> protocol is about ideal (at least as compared to iSCSI or FCoE) and the
> company behind it doesn't seem to want to add any more features that
> would bloat it.

Has anyone already looked into iSER, SRP or NVMeOF over rdma_rxe over the
loopback network driver? I think all three driver stacks support zero-copy
receiving, something that is not possible with iSCSI/TCP nor with AoE.

Bart.

Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver

2018-01-16 Thread James Bottomley
On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > 
> > 
> > I see the improvements that Facebook have been making to the nbd
> > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > this topic is simply: "Shut up, Matthew, this is good enough".
> > 
> > It's clear that there's an appetite for userspace block devices;
> > not for swap devices or the root device, but for accessing data
> > that's stored in that silo over there, and I really don't want to
> > bring that entire mess of CORBA / Go / Rust / whatever into the
> > kernel to get to it, but it would be really handy to present it as
> > a block device.
> 
> ... and using iSCSI was too painful and heavyweight.

>From what I've seen a reasonable number of storage over IP cloud
implementations are actually using AoE.  The argument goes that the
protocol is about ideal (at least as compared to iSCSI or FCoE) and the
company behind it doesn't seem to want to add any more features that
would bloat it.

James



Re: [LSF/MM TOPIC] A high-performance userspace block driver

2018-01-16 Thread Theodore Ts'o
On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> 
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

... and using iSCSI was too painful and heavyweight.

Google has an iblock device implementation, so you can use that as
confirmation that there certainly has been a desire for such a thing.
In fact, we're happily using it in production even as we speak.

We have been (tentatively) planning on presenting it at OSS North
America later in the year, since the Vault conference is no longer
with us, but we could probably put together a quick presentation for
LSF/MM if there is interest.

There were plans to do something using page cache tricks (what we were
calling the "zero copy" option), but we decided to start with
something simpler, more reliable, so long as it was less overhead and
pain than iSCSI (which was simply an over-engineered solution for our
use case), it was all upside.

- Ted


Re: [LSF/MM TOPIC] A high-performance userspace block driver

2018-01-16 Thread Viacheslav Dubeyko
On Tue, 2018-01-16 at 06:52 -0800, Matthew Wilcox wrote:
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
> 
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
> 
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
> 
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
> 
> It's either that or make pipes seekable ...

I like the whole idea. But why pipes? What's about shared memory? To
make the pipes seekable sounds like the killing of initial concept.
Usually, we treat pipe as FIFO communication channel. So, to make the
pipe seekable sounds really strange, from my point of view. Maybe, we
need in some new abstraction?

By the way, what's use-case(s) you have in mind for the suggested
approach?

Thanks,
Vyacheslav Dubeyko.




Re: [PATCH 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-16 Thread Mike Snitzer
On Tue, Jan 16 2018 at  1:17pm -0500,
Bart Van Assche  wrote:

> The __blk_mq_register_dev(), blk_mq_unregister_dev(),
> elv_register_queue() and elv_unregister_queue() calls need to be
> protected with sysfs_lock but other code in these functions not.
> Hence protect only this code with sysfs_lock. This patch fixes a
> locking inversion issue in blk_unregister_queue() and also in an
> error path of blk_register_queue(): it is not allowed to hold
> sysfs_lock around the kobject_del(>kobj) call.
> 
> Signed-off-by: Bart Van Assche 
> ---
>  block/blk-sysfs.c | 13 -
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 4a6a40ffd78e..e9ce45ff0ef2 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -909,11 +909,12 @@ int blk_register_queue(struct gendisk *disk)
>   if (q->request_fn || (q->mq_ops && q->elevator)) {
>   ret = elv_register_queue(q);
>   if (ret) {
> + mutex_unlock(>sysfs_lock);
>   kobject_uevent(>kobj, KOBJ_REMOVE);
>   kobject_del(>kobj);
>   blk_trace_remove_sysfs(dev);
>   kobject_put(>kobj);
> - goto unlock;
> + return ret;
>   }
>   }
>   ret = 0;
> @@ -934,28 +935,22 @@ void blk_unregister_queue(struct gendisk *disk)
>   if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
>   return;
>  
> - /*
> -  * Protect against the 'queue' kobj being accessed
> -  * while/after it is removed.
> -  */
> - mutex_lock(>sysfs_lock);
> -
>   spin_lock_irq(q->queue_lock);
>   queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
>   spin_unlock_irq(q->queue_lock);
>  
>   wbt_exit(q);
>  
> + mutex_lock(>sysfs_lock);
>   if (q->mq_ops)
>   blk_mq_unregister_dev(disk_to_dev(disk), q);
>  
>   if (q->request_fn || (q->mq_ops && q->elevator))
>   elv_unregister_queue(q);

My concern with this change is detailed in the following portion of
the header for commit 667257e8b2988c0183ba23e2bcd6900e87961606:

2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

I don't think moving mutex_lock(>sysfs_lock); after the clearing of
QUEUE_FLAG_REGISTERED is a step in the right direction.

Current code shows:

blk_cleanup_queue() calls blk_set_queue_dying() while holding
the sysfs_lock.

queue_attr_{show,store} both test if blk_queue_dying(q) while holding
the sysfs_lock.

BUT drivers can/do call del_gendisk() _before_ blk_cleanup_queue().
(if your proposed change above were to go in all of the block drivers
would first need to be audited for the need to call blk_cleanup_queue()
before del_gendisk() -- seems awful).

Therefore it seems to me that all queue_attr_{show,store} are racey vs
blk_unregister_queue() removing the 'queue' kobject.

And it was just that __elevator_change() was myopicly fixed to address
the race whereas a more generic solution was/is needed.  But short of
that more generic fix your change will reintroduce the potential for
hitting the issue that commit e9a823fb34a8b fixed.

In that light, think it best to leave blk_unregister_queue()'s
mutex_lock() above the QUEUE_FLAG_REGISTERED clearing _and_ update
queue_attr_{show,store} to test for QUEUE_FLAG_REGISTERED while holding
sysfs_lock.

Then remove the unicorn test_bit for QUEUE_FLAG_REGISTERED from
__elevator_change().

But it could be I'm wrong for some reason.. as you know that happens ;)

Mike


Re: [PATCH] lightnvm/pblk-gc: Delete an error message for a failed memory allocation in pblk_gc_line_prepare_ws()

2018-01-16 Thread Matias Bjørling

On 01/16/2018 10:10 PM, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Tue, 16 Jan 2018 22:00:15 +0100

Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
  drivers/lightnvm/pblk-gc.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 9c8e114c8a54..54cdb4360366 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -147,10 +147,8 @@ static void pblk_gc_line_prepare_ws(struct work_struct 
*work)
int ret;
  
  	invalid_bitmap = kmalloc(lm->sec_bitmap_len, GFP_KERNEL);

-   if (!invalid_bitmap) {
-   pr_err("pblk: could not allocate GC invalid bitmap\n");
+   if (!invalid_bitmap)
goto fail_free_ws;
-   }
  
  	emeta_buf = pblk_malloc(lm->emeta_len[0], l_mg->emeta_alloc_type,

GFP_KERNEL);



Thanks Markus. I'll queue it up for 4.17.


[PATCH] lightnvm/pblk-gc: Delete an error message for a failed memory allocation in pblk_gc_line_prepare_ws()

2018-01-16 Thread SF Markus Elfring
From: Markus Elfring 
Date: Tue, 16 Jan 2018 22:00:15 +0100

Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/lightnvm/pblk-gc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 9c8e114c8a54..54cdb4360366 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -147,10 +147,8 @@ static void pblk_gc_line_prepare_ws(struct work_struct 
*work)
int ret;
 
invalid_bitmap = kmalloc(lm->sec_bitmap_len, GFP_KERNEL);
-   if (!invalid_bitmap) {
-   pr_err("pblk: could not allocate GC invalid bitmap\n");
+   if (!invalid_bitmap)
goto fail_free_ws;
-   }
 
emeta_buf = pblk_malloc(lm->emeta_len[0], l_mg->emeta_alloc_type,
GFP_KERNEL);
-- 
2.15.1



[PATCH] block: Fix __bio_integrity_endio() documentation

2018-01-16 Thread Bart Van Assche
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
Signed-off-by: Bart Van Assche 
---
 block/bio-integrity.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 23b42e8aa03e..9cfdd6c83b5b 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -374,7 +374,6 @@ static void bio_integrity_verify_fn(struct work_struct 
*work)
 /**
  * __bio_integrity_endio - Integrity I/O completion function
  * @bio:   Protected bio
- * @error: Pointer to errno
  *
  * Description: Completion for integrity I/O
  *
-- 
2.15.1



[PATCH] blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

2018-01-16 Thread Bart Van Assche
Because the hctx lock is not held around the only
blk_mq_tag_wakeup_all() call in the block layer, the wait queue
entry removal in blk_mq_dispatch_wake() is protected by the wait
queue lock only. Since the hctx->dispatch_wait entry can occur on
any of the SBQ_WAIT_QUEUES, the wait queue presence check, adding
.dispatch_wait to a wait queue and removing the wait queue entry
must all be protected by both the hctx lock and the wait queue
lock.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index edb1291a42c5..6ca85052e63b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1109,7 +1109,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
struct blk_mq_hw_ctx *this_hctx = *hctx;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
-   bool ret;
+   bool ret = false;
 
if (!(this_hctx->flags & BLK_MQ_F_TAG_SHARED)) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state))
@@ -1130,14 +1130,20 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
if (!list_empty_careful(>entry))
return false;
 
+   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
+
+   /*
+* Since hctx->dispatch_wait can already be on any of the
+* SBQ_WAIT_QUEUES number of wait queues, serialize the check and
+* add_wait_queue() calls below with this_hctx->lock.
+*/
spin_lock(_hctx->lock);
-   if (!list_empty(>entry)) {
-   spin_unlock(_hctx->lock);
-   return false;
-   }
+   spin_lock_irq(>wait.lock);
+   if (!list_empty(>entry))
+   goto unlock;
 
-   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
-   add_wait_queue(>wait, wait);
+   wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+   __add_wait_queue(>wait, wait);
 
/*
 * It's possible that a tag was freed in the window between the
@@ -1145,21 +1151,20 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 * queue.
 */
ret = blk_mq_get_driver_tag(rq, hctx, false);
-   if (!ret) {
-   spin_unlock(_hctx->lock);
-   return false;
-   }
+   if (!ret)
+   goto unlock;
 
/*
 * We got a tag, remove ourselves from the wait queue to ensure
 * someone else gets the wakeup.
 */
-   spin_lock_irq(>wait.lock);
list_del_init(>entry);
+
+unlock:
spin_unlock_irq(>wait.lock);
spin_unlock(_hctx->lock);
 
-   return true;
+   return ret;
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
-- 
2.15.1



Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-16 Thread Bart Van Assche
On Mon, 2018-01-15 at 18:13 -0500, Mike Snitzer wrote:
> On Mon, Jan 15 2018 at  6:10P -0500,
> Mike Snitzer  wrote:
> 
> > On Mon, Jan 15 2018 at  5:51pm -0500,
> > Bart Van Assche  wrote:
> > 
> > > On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote:
> > > > sysfs write op calls kernfs_fop_write which takes:
> > > > of->mutex then kn->count#213 (no idea what that is)
> > > > then q->sysfs_lock (via queue_attr_store)
> > > > 
> > > > vs 
> > > > 
> > > > blk_unregister_queue takes:
> > > > q->sysfs_lock then
> > > > kernfs_mutex (via kernfs_remove)
> > > > seems lockdep thinks "kernfs_mutex" is "kn->count#213"?
> > > > 
> > > > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
> > > > triggering false positives.. because these seem like different kernfs
> > > > locks yet they are reported as "kn->count#213".
> > > > 
> > > > Certainly feeling out of my depth with kernfs's locking though.
> > > 
> > > Hello Mike,
> > > 
> > > I don't think that this is a false positive but rather the following 
> > > traditional
> > > pattern of a potential deadlock involving sysfs attributes:
> > > * One context obtains a mutex from inside a sysfs attribute method:
> > >   queue_attr_store() obtains q->sysfs_lock.
> > > * Another context removes a sysfs attribute while holding a mutex:
> > >   blk_unregister_queue() removes the queue sysfs attributes while holding
> > >   q->sysfs_lock.
> > > 
> > > This can result in a real deadlock because the code that removes sysfs 
> > > objects
> > > waits until all ongoing attribute callbacks have finished.
> > > 
> > > Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in
> > > blk_unregister_queue") modified blk_unregister_queue() such that 
> > > q->sysfs_lock
> > > is held around the kobject_del(>kobj) call I think this is a regression
> > > introduced by that commit.
> > 
> > Sure, of course it is a regression.
> > 
> > Aside from moving the mutex_unlock(>sysfs_lock) above the
> > kobject_del(>kobj) I don't know how to fix it.
> > 
> > Though, realistically that'd be an adequate fix (given the way the code
> > was before).
> 
> Any chance you apply this and re-run your srp_test that triggered the
> lockdep splat?
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 4a6a40ffd78e..c50e08e9bf17 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -952,10 +952,10 @@ void blk_unregister_queue(struct gendisk *disk)
>   if (q->request_fn || (q->mq_ops && q->elevator))
>   elv_unregister_queue(q);
>  
> + mutex_unlock(>sysfs_lock);
> +
>   kobject_uevent(>kobj, KOBJ_REMOVE);
>   kobject_del(>kobj);
>   blk_trace_remove_sysfs(disk_to_dev(disk));
>   kobject_put(_to_dev(disk)->kobj);
> -
> - mutex_unlock(>sysfs_lock);
>  }

Hello Mike,

Today sysfs_lock protects a whole bunch of state information. I think for the
longer term we should split it into multiple mutexes. For the short term, please
have a look at the patch series I just posted on the linux-block mailing list.

Thanks,

Bart.

[PATCH 2/3] block: Document scheduler change code locking requirements

2018-01-16 Thread Bart Van Assche
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
---
 block/elevator.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/block/elevator.c b/block/elevator.c
index 4f00b53cd5fd..e87e9b43aba0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -869,6 +869,8 @@ int elv_register_queue(struct request_queue *q)
struct elevator_queue *e = q->elevator;
int error;
 
+   lockdep_assert_held(>sysfs_lock);
+
error = kobject_add(>kobj, >kobj, "%s", "iosched");
if (!error) {
struct elv_fs_entry *attr = e->type->elevator_attrs;
@@ -889,6 +891,8 @@ int elv_register_queue(struct request_queue *q)
 
 void elv_unregister_queue(struct request_queue *q)
 {
+   lockdep_assert_held(>sysfs_lock);
+
if (q) {
struct elevator_queue *e = q->elevator;
 
@@ -965,6 +969,8 @@ static int elevator_switch_mq(struct request_queue *q,
 {
int ret;
 
+   lockdep_assert_held(>sysfs_lock);
+
blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
 
@@ -1010,6 +1016,8 @@ static int elevator_switch(struct request_queue *q, 
struct elevator_type *new_e)
bool old_registered = false;
int err;
 
+   lockdep_assert_held(>sysfs_lock);
+
if (q->mq_ops)
return elevator_switch_mq(q, new_e);
 
-- 
2.15.1



[PATCH 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-16 Thread Bart Van Assche
The __blk_mq_register_dev(), blk_mq_unregister_dev(),
elv_register_queue() and elv_unregister_queue() calls need to be
protected with sysfs_lock but other code in these functions not.
Hence protect only this code with sysfs_lock. This patch fixes a
locking inversion issue in blk_unregister_queue() and also in an
error path of blk_register_queue(): it is not allowed to hold
sysfs_lock around the kobject_del(>kobj) call.

Signed-off-by: Bart Van Assche 
---
 block/blk-sysfs.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4a6a40ffd78e..e9ce45ff0ef2 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -909,11 +909,12 @@ int blk_register_queue(struct gendisk *disk)
if (q->request_fn || (q->mq_ops && q->elevator)) {
ret = elv_register_queue(q);
if (ret) {
+   mutex_unlock(>sysfs_lock);
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
blk_trace_remove_sysfs(dev);
kobject_put(>kobj);
-   goto unlock;
+   return ret;
}
}
ret = 0;
@@ -934,28 +935,22 @@ void blk_unregister_queue(struct gendisk *disk)
if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
return;
 
-   /*
-* Protect against the 'queue' kobj being accessed
-* while/after it is removed.
-*/
-   mutex_lock(>sysfs_lock);
-
spin_lock_irq(q->queue_lock);
queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
spin_unlock_irq(q->queue_lock);
 
wbt_exit(q);
 
+   mutex_lock(>sysfs_lock);
if (q->mq_ops)
blk_mq_unregister_dev(disk_to_dev(disk), q);
 
if (q->request_fn || (q->mq_ops && q->elevator))
elv_unregister_queue(q);
+   mutex_unlock(>sysfs_lock);
 
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
kobject_put(_to_dev(disk)->kobj);
-
-   mutex_unlock(>sysfs_lock);
 }
-- 
2.15.1



[PATCH 0/3] Avoid that blk_{un,}register_queue() trigger lock inversion

2018-01-16 Thread Bart Van Assche
Hello Jens,

The three patches in this series are what I came up with after having
analyzed a lockdep complaint triggered by blk_unregister_queue. Please
consider these patches for kernel v4.16.

Thanks,

Bart.

Bart Van Assche (3):
  block: Unexport elv_register_queue() and elv_unregister_queue()
  block: Document scheduler change code locking requirements
  block: Protect less code with sysfs_lock in blk_{un,}register_queue()

 block/blk-sysfs.c| 13 -
 block/blk.h  |  3 +++
 block/elevator.c | 10 --
 include/linux/elevator.h |  2 --
 4 files changed, 15 insertions(+), 13 deletions(-)

-- 
2.15.1



[PATCH 1/3] block: Unexport elv_register_queue() and elv_unregister_queue()

2018-01-16 Thread Bart Van Assche
These two functions are only called from inside the block layer so
unexport them.

Signed-off-by: Bart Van Assche 
---
 block/blk.h  | 3 +++
 block/elevator.c | 2 --
 include/linux/elevator.h | 2 --
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 662005e46560..46db5dc83dcb 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -162,6 +162,9 @@ static inline void elv_deactivate_rq(struct request_queue 
*q, struct request *rq
e->type->ops.sq.elevator_deactivate_req_fn(q, rq);
 }
 
+int elv_register_queue(struct request_queue *q);
+void elv_unregister_queue(struct request_queue *q);
+
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno);
 
 #ifdef CONFIG_FAIL_IO_TIMEOUT
diff --git a/block/elevator.c b/block/elevator.c
index 138faeb08a7c..4f00b53cd5fd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -886,7 +886,6 @@ int elv_register_queue(struct request_queue *q)
}
return error;
 }
-EXPORT_SYMBOL(elv_register_queue);
 
 void elv_unregister_queue(struct request_queue *q)
 {
@@ -900,7 +899,6 @@ void elv_unregister_queue(struct request_queue *q)
wbt_enable_default(q);
}
 }
-EXPORT_SYMBOL(elv_unregister_queue);
 
 int elv_register(struct elevator_type *e)
 {
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3d794b3dc532..6d9e230dffd2 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -198,8 +198,6 @@ extern bool elv_attempt_insert_merge(struct request_queue 
*, struct request *);
 extern void elv_requeue_request(struct request_queue *, struct request *);
 extern struct request *elv_former_request(struct request_queue *, struct 
request *);
 extern struct request *elv_latter_request(struct request_queue *, struct 
request *);
-extern int elv_register_queue(struct request_queue *q);
-extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, unsigned int);
 extern void elv_completed_request(struct request_queue *, struct request *);
 extern int elv_set_request(struct request_queue *q, struct request *rq,
-- 
2.15.1



Re: [for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Mike Snitzer
On Tue, Jan 16 2018 at 12:41pm -0500,
Jens Axboe  wrote:

> On 1/16/18 10:38 AM, Mike Snitzer wrote:
> > On Tue, Jan 16 2018 at 12:20pm -0500,
> > Jens Axboe  wrote:
> > 
> >> On 1/16/18 8:01 AM, Mike Snitzer wrote:
> >>> From: Ming Lei 
> >>>
> >>> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> >>> in this function we append request to hctx->dispatch_list of the 
> >>> underlying
> >>> queue directly.
> >>>
> >>> 1) This way isn't efficient enough because hctx lock is always required
> >>>
> >>> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO
> >>> scheduler totally, and depend on DM rq driver to do IO schedule
> >>> completely.  But DM rq driver can't get underlying queue's dispatch
> >>> feedback at all, and this information is extreamly useful for IO merge.
> >>> Without that IO merge can't be done basically by blk-mq, which causes
> >>> very bad sequential IO performance.
> >>>
> >>> Fix this by having blk_insert_cloned_request() make use of
> >>> blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
> >>> blk_mq_request_direct_issue() allows a request to be dispatched to be
> >>> issue directly to the underlying queue and provides dispatch result to
> >>> dm-rq and blk-mq.
> >>>
> >>> With this, the DM's blk-mq sequential IO performance is vastly
> >>> improved (as much as 3X in mpath/virtio-scsi testing).
> >>
> >> This still feels pretty hacky...
> >>
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 55f3a27fb2e6..3168a13cb012 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -1706,6 +1706,12 @@ static blk_status_t 
> >>> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >>>   blk_qc_t new_cookie;
> >>>   blk_status_t ret = BLK_STS_OK;
> >>>   bool run_queue = true;
> >>> + /*
> >>> +  * If @cookie is NULL do not insert the request, this mode is used
> >>> +  * by blk_insert_cloned_request() via blk_mq_request_direct_issue()
> >>> +  */
> >>> + bool dispatch_only = !cookie;
> >>> + bool need_insert = false;
> >>
> >> Overloading 'cookie' to also mean this isn't very future proof or solid.
> > 
> > It enables the existing interface to be used without needing to prop up
> > something else that extends out to the edge (blk_insert_cloned_request).
> 
> Doesn't really matter if the end result is too ugly/fragile to live.

Agreed.
 
> >> And now __blk_mq_try_issue_directly() is pretty much a mess. Feels like
> >> it should be split in two, where the other half would do the actual
> >> insert. Then let the caller do it, if we could not issue directly. That
> >> would be a lot more solid and easier to read.
> > 
> > That is effectively what Ming's variant did (by splitting out the issue
> > to a helper).
> > 
> > BUT I'll see what I can come up with...
> 
> Maybe I missed that version, there were many rapid fire versions posted.
> Please just take your time and get it right, that's much more important.

No trying to rush, going over it carefully now.

Think I have a cleaner way forward.

Thanks,
Mike


Re: [for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Jens Axboe
On 1/16/18 10:38 AM, Mike Snitzer wrote:
> On Tue, Jan 16 2018 at 12:20pm -0500,
> Jens Axboe  wrote:
> 
>> On 1/16/18 8:01 AM, Mike Snitzer wrote:
>>> From: Ming Lei 
>>>
>>> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
>>> in this function we append request to hctx->dispatch_list of the underlying
>>> queue directly.
>>>
>>> 1) This way isn't efficient enough because hctx lock is always required
>>>
>>> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO
>>> scheduler totally, and depend on DM rq driver to do IO schedule
>>> completely.  But DM rq driver can't get underlying queue's dispatch
>>> feedback at all, and this information is extreamly useful for IO merge.
>>> Without that IO merge can't be done basically by blk-mq, which causes
>>> very bad sequential IO performance.
>>>
>>> Fix this by having blk_insert_cloned_request() make use of
>>> blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
>>> blk_mq_request_direct_issue() allows a request to be dispatched to be
>>> issue directly to the underlying queue and provides dispatch result to
>>> dm-rq and blk-mq.
>>>
>>> With this, the DM's blk-mq sequential IO performance is vastly
>>> improved (as much as 3X in mpath/virtio-scsi testing).
>>
>> This still feels pretty hacky...
>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 55f3a27fb2e6..3168a13cb012 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1706,6 +1706,12 @@ static blk_status_t 
>>> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>>> blk_qc_t new_cookie;
>>> blk_status_t ret = BLK_STS_OK;
>>> bool run_queue = true;
>>> +   /*
>>> +* If @cookie is NULL do not insert the request, this mode is used
>>> +* by blk_insert_cloned_request() via blk_mq_request_direct_issue()
>>> +*/
>>> +   bool dispatch_only = !cookie;
>>> +   bool need_insert = false;
>>
>> Overloading 'cookie' to also mean this isn't very future proof or solid.
> 
> It enables the existing interface to be used without needing to prop up
> something else that extends out to the edge (blk_insert_cloned_request).

Doesn't really matter if the end result is too ugly/fragile to live.

>> And now __blk_mq_try_issue_directly() is pretty much a mess. Feels like
>> it should be split in two, where the other half would do the actual
>> insert. Then let the caller do it, if we could not issue directly. That
>> would be a lot more solid and easier to read.
> 
> That is effectively what Ming's variant did (by splitting out the issue
> to a helper).
> 
> BUT I'll see what I can come up with...

Maybe I missed that version, there were many rapid fire versions posted.
Please just take your time and get it right, that's much more important.

-- 
Jens Axboe



Re: [for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Mike Snitzer
On Tue, Jan 16 2018 at 12:20pm -0500,
Jens Axboe  wrote:

> On 1/16/18 8:01 AM, Mike Snitzer wrote:
> > From: Ming Lei 
> > 
> > blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> > in this function we append request to hctx->dispatch_list of the underlying
> > queue directly.
> > 
> > 1) This way isn't efficient enough because hctx lock is always required
> > 
> > 2) With blk_insert_cloned_request(), we bypass underlying queue's IO
> > scheduler totally, and depend on DM rq driver to do IO schedule
> > completely.  But DM rq driver can't get underlying queue's dispatch
> > feedback at all, and this information is extreamly useful for IO merge.
> > Without that IO merge can't be done basically by blk-mq, which causes
> > very bad sequential IO performance.
> > 
> > Fix this by having blk_insert_cloned_request() make use of
> > blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
> > blk_mq_request_direct_issue() allows a request to be dispatched to be
> > issue directly to the underlying queue and provides dispatch result to
> > dm-rq and blk-mq.
> > 
> > With this, the DM's blk-mq sequential IO performance is vastly
> > improved (as much as 3X in mpath/virtio-scsi testing).
> 
> This still feels pretty hacky...
> 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 55f3a27fb2e6..3168a13cb012 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1706,6 +1706,12 @@ static blk_status_t 
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > blk_qc_t new_cookie;
> > blk_status_t ret = BLK_STS_OK;
> > bool run_queue = true;
> > +   /*
> > +* If @cookie is NULL do not insert the request, this mode is used
> > +* by blk_insert_cloned_request() via blk_mq_request_direct_issue()
> > +*/
> > +   bool dispatch_only = !cookie;
> > +   bool need_insert = false;
> 
> Overloading 'cookie' to also mean this isn't very future proof or solid.

It enables the existing interface to be used without needing to prop up
something else that extends out to the edge (blk_insert_cloned_request).
 
> And now __blk_mq_try_issue_directly() is pretty much a mess. Feels like
> it should be split in two, where the other half would do the actual
> insert. Then let the caller do it, if we could not issue directly. That
> would be a lot more solid and easier to read.

That is effectively what Ming's variant did (by splitting out the issue
to a helper).

BUT I'll see what I can come up with...

(Ming please stand down until you hear back from me ;)

Thanks,
Mike


Re: [for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Jens Axboe
On 1/16/18 8:01 AM, Mike Snitzer wrote:
> From: Ming Lei 
> 
> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> in this function we append request to hctx->dispatch_list of the underlying
> queue directly.
> 
> 1) This way isn't efficient enough because hctx lock is always required
> 
> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO
> scheduler totally, and depend on DM rq driver to do IO schedule
> completely.  But DM rq driver can't get underlying queue's dispatch
> feedback at all, and this information is extreamly useful for IO merge.
> Without that IO merge can't be done basically by blk-mq, which causes
> very bad sequential IO performance.
> 
> Fix this by having blk_insert_cloned_request() make use of
> blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
> blk_mq_request_direct_issue() allows a request to be dispatched to be
> issue directly to the underlying queue and provides dispatch result to
> dm-rq and blk-mq.
> 
> With this, the DM's blk-mq sequential IO performance is vastly
> improved (as much as 3X in mpath/virtio-scsi testing).

This still feels pretty hacky...

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 55f3a27fb2e6..3168a13cb012 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1706,6 +1706,12 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   blk_qc_t new_cookie;
>   blk_status_t ret = BLK_STS_OK;
>   bool run_queue = true;
> + /*
> +  * If @cookie is NULL do not insert the request, this mode is used
> +  * by blk_insert_cloned_request() via blk_mq_request_direct_issue()
> +  */
> + bool dispatch_only = !cookie;
> + bool need_insert = false;

Overloading 'cookie' to also mean this isn't very future proof or solid.

And now __blk_mq_try_issue_directly() is pretty much a mess. Feels like
it should be split in two, where the other half would do the actual
insert. Then let the caller do it, if we could not issue directly. That
would be a lot more solid and easier to read.

-- 
Jens Axboe



Re: [for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Mike Snitzer
On Tue, Jan 16 2018 at 10:01P -0500,
Mike Snitzer  wrote:

> From: Ming Lei 
> 
> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> in this function we append request to hctx->dispatch_list of the underlying
> queue directly.
> 
> 1) This way isn't efficient enough because hctx lock is always required
> 
> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO
> scheduler totally, and depend on DM rq driver to do IO schedule
> completely.  But DM rq driver can't get underlying queue's dispatch
> feedback at all, and this information is extreamly useful for IO merge.
> Without that IO merge can't be done basically by blk-mq, which causes
> very bad sequential IO performance.
> 
> Fix this by having blk_insert_cloned_request() make use of
> blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
> blk_mq_request_direct_issue() allows a request to be dispatched to be
> issue directly to the underlying queue and provides dispatch result to
> dm-rq and blk-mq.
> 
> With this, the DM's blk-mq sequential IO performance is vastly
> improved (as much as 3X in mpath/virtio-scsi testing).
> 
> Signed-off-by: Ming Lei 
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-core.c   |  3 +--
>  block/blk-mq.c | 42 --
>  block/blk-mq.h |  3 +++
>  drivers/md/dm-rq.c | 19 ---
>  4 files changed, 56 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7ba607527487..55f338020254 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - blk_mq_request_bypass_insert(rq, true);
> - return BLK_STS_OK;
> + return blk_mq_request_direct_issue(rq);
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 55f3a27fb2e6..3168a13cb012 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1706,6 +1706,12 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   blk_qc_t new_cookie;
>   blk_status_t ret = BLK_STS_OK;
>   bool run_queue = true;
> + /*
> +  * If @cookie is NULL do not insert the request, this mode is used
> +  * by blk_insert_cloned_request() via blk_mq_request_direct_issue()
> +  */
> + bool dispatch_only = !cookie;
> + bool need_insert = false;
>  
>   /* RCU or SRCU read lock is needed before checking quiesced flag */
>   if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
> @@ -1713,25 +1719,38 @@ static blk_status_t 
> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   goto insert;
>   }
>  
> - if (q->elevator)
> + if (q->elevator && !dispatch_only)
>   goto insert;
>  
>   if (!blk_mq_get_driver_tag(rq, NULL, false))
> - goto insert;
> + need_insert = true;
>  
> - if (!blk_mq_get_dispatch_budget(hctx)) {
> + if (!need_insert && !blk_mq_get_dispatch_budget(hctx)) {
>   blk_mq_put_driver_tag(rq);
> + need_insert = true;
> + }
> +
> + if (need_insert) {
> + if (dispatch_only)
> + return BLK_STS_RESOURCE;
>   goto insert;
>   }
>  
>   new_cookie = request_to_qc_t(hctx, rq);
>  
> + ret = q->mq_ops->queue_rq(hctx, );
> +
> + if (dispatch_only) {
> + if (ret == BLK_STS_RESOURCE)
> + __blk_mq_requeue_request(rq);
> + return ret;
> + }
> +
>   /*
>* For OK queue, we are done. For error, kill it. Any other
>* error (busy), just add it to our list as we previously
>* would have done
>*/
> - ret = q->mq_ops->queue_rq(hctx, );
>   switch (ret) {
>   case BLK_STS_OK:
>   *cookie = new_cookie;

FYI, because blk_mq_put_driver_tag() is idempotent, the above can be
simplified by eliminating 'need_insert' like so (Jens feel free to fold
this in?):

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3168a13cb012..c2f66a5c5242 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1711,7 +1711,6 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 * by blk_insert_cloned_request() via blk_mq_request_direct_issue()
 */
bool dispatch_only = !cookie;
-   bool need_insert = false;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
@@ -1722,15 +1721,9 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
if (q->elevator && !dispatch_only)
goto 

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Ming Lei
On Tue, Jan 16, 2018 at 03:22:18PM +, Don Brace wrote:
> > -Original Message-
> > From: Laurence Oberman [mailto:lober...@redhat.com]
> > Sent: Tuesday, January 16, 2018 7:29 AM
> > To: Thomas Gleixner ; Ming Lei 
> > Cc: Christoph Hellwig ; Jens Axboe ;
> > linux-block@vger.kernel.org; linux-ker...@vger.kernel.org; Mike Snitzer
> > ; Don Brace 
> > Subject: Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is 
> > assgined
> > to irq vector
> > 
> > > > It is because of irq_create_affinity_masks().
> > >
> > > That still does not answer the question. If the interrupt for a queue
> > > is
> > > assigned to an offline CPU, then the queue should not be used and
> > > never
> > > raise an interrupt. That's how managed interrupts have been designed.
> > >
> > > Thanks,
> > >
> > >   tglx
> > >
> > >
> > >
> > >
> > 
> > I captured a full boot log for this issue for Microsemi, I will send it
> > to Don Brace.
> > I enabled all the HPSA debug and here is snippet
> > 
> > 
> > ..
> > ..
> > ..
> >   246.751135] INFO: task systemd-udevd:413 blocked for more than 120
> > seconds.
> > [  246.788008]   Tainted: G  I  4.15.0-rc4.noming+ #1
> > [  246.822380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  246.865594] systemd-udevd   D0   413411 0x8004
> > [  246.895519] Call Trace:
> > [  246.909713]  ? __schedule+0x340/0xc20
> > [  246.930236]  schedule+0x32/0x80
> > [  246.947905]  schedule_timeout+0x23d/0x450
> > [  246.970047]  ? find_held_lock+0x2d/0x90
> > [  246.991774]  ? wait_for_completion_io+0x108/0x170
> > [  247.018172]  io_schedule_timeout+0x19/0x40
> > [  247.041208]  wait_for_completion_io+0x110/0x170
> > [  247.067326]  ? wake_up_q+0x70/0x70
> > [  247.086801]  hpsa_scsi_do_simple_cmd+0xc6/0x100 [hpsa]
> > [  247.114315]  hpsa_scsi_do_simple_cmd_with_retry+0xb7/0x1c0 [hpsa]
> > [  247.146629]  hpsa_scsi_do_inquiry+0x73/0xd0 [hpsa]
> > [  247.174118]  hpsa_init_one+0x12cb/0x1a59 [hpsa]
> 
> This trace comes from internally generated discovery commands. No SCSI 
> devices have
> been presented to the SML yet.
> 
> At this point we should be running on only one CPU. These commands are meant 
> to use
> reply queue 0 which are tied to CPU 0. It's interesting that the patch helps.

In hpsa_interrupt_mode(), you pass PCI_IRQ_AFFINITY to pci_alloc_irq_vectors(),
which may spread one irq vector across all offline CPUs. That is the cause of
this hang reported by Laurence from my observation.

BTW, if the interrupt handler for the reply queue isn't performance sensitive,
maybe PCI_IRQ_AFFINITY can be removed for avoiding this issue.

But anyway, as I replied in this thread, this patch still improves irq
vectors spread.

Thanks,
Ming


Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Laurence Oberman
On Tue, 2018-01-16 at 15:22 +, Don Brace wrote:
> > -Original Message-
> > From: Laurence Oberman [mailto:lober...@redhat.com]
> > Sent: Tuesday, January 16, 2018 7:29 AM
> > To: Thomas Gleixner ; Ming Lei  > .com>
> > Cc: Christoph Hellwig ; Jens Axboe  > >;
> > linux-block@vger.kernel.org; linux-ker...@vger.kernel.org; Mike
> > Snitzer
> > ; Don Brace 
> > Subject: Re: [PATCH 0/2] genirq/affinity: try to make sure online
> > CPU is assgined
> > to irq vector
> > 
> > > > It is because of irq_create_affinity_masks().
> > > 
> > > That still does not answer the question. If the interrupt for a
> > > queue
> > > is
> > > assigned to an offline CPU, then the queue should not be used and
> > > never
> > > raise an interrupt. That's how managed interrupts have been
> > > designed.
> > > 
> > > Thanks,
> > > 
> > >   tglx
> > > 
> > > 
> > > 
> > > 
> > 
> > I captured a full boot log for this issue for Microsemi, I will
> > send it
> > to Don Brace.
> > I enabled all the HPSA debug and here is snippet
> > 
> > 
> > ..
> > ..
> > ..
> >   246.751135] INFO: task systemd-udevd:413 blocked for more than
> > 120
> > seconds.
> > [  246.788008]   Tainted: G  I  4.15.0-rc4.noming+
> > #1
> > [  246.822380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  246.865594] systemd-udevd   D0   413411 0x8004
> > [  246.895519] Call Trace:
> > [  246.909713]  ? __schedule+0x340/0xc20
> > [  246.930236]  schedule+0x32/0x80
> > [  246.947905]  schedule_timeout+0x23d/0x450
> > [  246.970047]  ? find_held_lock+0x2d/0x90
> > [  246.991774]  ? wait_for_completion_io+0x108/0x170
> > [  247.018172]  io_schedule_timeout+0x19/0x40
> > [  247.041208]  wait_for_completion_io+0x110/0x170
> > [  247.067326]  ? wake_up_q+0x70/0x70
> > [  247.086801]  hpsa_scsi_do_simple_cmd+0xc6/0x100 [hpsa]
> > [  247.114315]  hpsa_scsi_do_simple_cmd_with_retry+0xb7/0x1c0
> > [hpsa]
> > [  247.146629]  hpsa_scsi_do_inquiry+0x73/0xd0 [hpsa]
> > [  247.174118]  hpsa_init_one+0x12cb/0x1a59 [hpsa]
> 
> This trace comes from internally generated discovery commands. No
> SCSI devices have
> been presented to the SML yet.
> 
> At this point we should be running on only one CPU. These commands
> are meant to use
> reply queue 0 which are tied to CPU 0. It's interesting that the
> patch helps.
> 
> However, I was wondering if you could inspect the iLo IML logs and
> send the
> AHS logs for inspection.
> 
> Thanks,
> Don Brace
> ESC - Smart Storage
> Microsemi Corporation


Hello Don

I took two other dl380 g7's and ran the same kernel and it hangs in the
identical place. Its absolutely consistent here.

I doubt all three have hardware issues.

Nothing is logged of interest in the IML.

Ming will have more to share on specifically why it helps.
I think he sent that along to you already.

Regards
Laurence



Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Ming Lei
On Tue, Jan 16, 2018 at 10:31:42PM +0800, jianchao.wang wrote:
> Hi minglei
> 
> On 01/16/2018 08:10 PM, Ming Lei wrote:
> >>> - next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
> >>> + next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> >>> + cpu_online_mask);
> >>>   if (next_cpu >= nr_cpu_ids)
> >>> - next_cpu = cpumask_first(hctx->cpumask);
> >>> + next_cpu = 
> >>> cpumask_first_and(hctx->cpumask,cpu_online_mask);
> >> the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask 
> >> is online.
> > That supposes not happen because storage device(blk-mq hw queue) is
> > generally C/S model, that means the queue becomes only active when
> > there is online CPU mapped to it.
> > 
> > But it won't be true for non-block-IO queue, such as HPSA's queues[1], and
> > network controller RX queues.
> > 
> > [1] 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dkernel-26m-3D151601867018444-26w-3D2=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=tCZdQH6JUW1dkNCN92ycoUoKfDU_qWj-7EsUoYpOeJ0=vgHC9sbjYQb7mtY9MUJzbVXyVEyjoNJPWEx4_rfrHxU=
> > 
> > One thing I am still not sure(but generic irq affinity supposes to deal with
> > well) is that the CPU may become offline after the IO is just submitted,
> > then where the IRQ controller delivers the interrupt of this hw queue
> > to?
> > 
> >> This could be reproduced on NVMe with a patch that could hold some rqs on 
> >> ctx->rq_list,
> >> meanwhile a script online and offline the cpus. Then a panic occurred in 
> >> __queue_work().
> > That shouldn't happen, when CPU offline happens the rqs in ctx->rq_list
> > are dispatched directly, please see blk_mq_hctx_notify_dead().
> 
> Yes, I know. The  blk_mq_hctx_notify_dead will be invoked after the cpu has 
> been set offlined.
> Please refer to the following diagram.
> 
> CPU A  CPU T
>  kick  
>   _cpu_down() ->   cpuhp_thread_fun (cpuhpT kthread)
>AP_ACTIVE   (clear cpu_active_mask)
>  |
>  v
>AP_WORKQUEUE_ONLINE (unbind workers)
>  |
>  v
>TEARDOWN_CPU(stop_machine)
> ,   | execute
>  \_ _ _ _ _ _   v
> preempt  V  take_cpu_down ( migration 
> kthread)
> 
> set_cpu_online(smp_processor_id(), false) (__cpu_disable)  --> Here !!!
> TEARDOWN_CPU
> |
>  cpuhpT kthead is|  v
>  migrated away   ,AP_SCHED_STARTING 
> (migrate_tasks)
>  _ _ _ _ _ _ _ _/   |
> V   v
>   CPU X   AP_OFFLINE
> 
> |
> ,
>  _ _ _ _ _ /
> V
>   do_idle (idle task)
>  <_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cpuhp_report_idle_dead
>  complete st->done_down
>__cpu_die (cpuhpT kthread, teardown_cpu) 
> 
>  AP_OFFLINE
>|
>v
>  BRINGUP_CPU
>|
>v
>  BLK_MQ_DEAD---> Here !!!
>|
>v
>  OFFLINE
> 
> The cpu has been cleared in cpu_online_mask when blk_mq_hctx_notify_dead is 
> invoked.
> If the device is NVMe which only has one cpu mapped on the hctx, 
> cpumask_first_and(hctx->cpumask,cpu_online_mask) will return a bad value.

Hi Jianchao,

OK, I got it, and it should have been the only corner case in which
all CPUs mapped to this hctx become offline, and I believe the following
patch should address this case, could you give a test?

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c376d1b6309a..23f0f3ddffcf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1416,21 +1416,44 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
+   bool tried = false;
+
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
 
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
+select_cpu:
 
next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
cpu_online_mask);
if 

RE: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Don Brace
> -Original Message-
> From: Laurence Oberman [mailto:lober...@redhat.com]
> Sent: Tuesday, January 16, 2018 7:29 AM
> To: Thomas Gleixner ; Ming Lei 
> Cc: Christoph Hellwig ; Jens Axboe ;
> linux-block@vger.kernel.org; linux-ker...@vger.kernel.org; Mike Snitzer
> ; Don Brace 
> Subject: Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is 
> assgined
> to irq vector
> 
> > > It is because of irq_create_affinity_masks().
> >
> > That still does not answer the question. If the interrupt for a queue
> > is
> > assigned to an offline CPU, then the queue should not be used and
> > never
> > raise an interrupt. That's how managed interrupts have been designed.
> >
> > Thanks,
> >
> >   tglx
> >
> >
> >
> >
> 
> I captured a full boot log for this issue for Microsemi, I will send it
> to Don Brace.
> I enabled all the HPSA debug and here is snippet
> 
> 
> ..
> ..
> ..
>   246.751135] INFO: task systemd-udevd:413 blocked for more than 120
> seconds.
> [  246.788008]   Tainted: G  I  4.15.0-rc4.noming+ #1
> [  246.822380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  246.865594] systemd-udevd   D0   413411 0x8004
> [  246.895519] Call Trace:
> [  246.909713]  ? __schedule+0x340/0xc20
> [  246.930236]  schedule+0x32/0x80
> [  246.947905]  schedule_timeout+0x23d/0x450
> [  246.970047]  ? find_held_lock+0x2d/0x90
> [  246.991774]  ? wait_for_completion_io+0x108/0x170
> [  247.018172]  io_schedule_timeout+0x19/0x40
> [  247.041208]  wait_for_completion_io+0x110/0x170
> [  247.067326]  ? wake_up_q+0x70/0x70
> [  247.086801]  hpsa_scsi_do_simple_cmd+0xc6/0x100 [hpsa]
> [  247.114315]  hpsa_scsi_do_simple_cmd_with_retry+0xb7/0x1c0 [hpsa]
> [  247.146629]  hpsa_scsi_do_inquiry+0x73/0xd0 [hpsa]
> [  247.174118]  hpsa_init_one+0x12cb/0x1a59 [hpsa]

This trace comes from internally generated discovery commands. No SCSI devices 
have
been presented to the SML yet.

At this point we should be running on only one CPU. These commands are meant to 
use
reply queue 0 which are tied to CPU 0. It's interesting that the patch helps.

However, I was wondering if you could inspect the iLo IML logs and send the
AHS logs for inspection.

Thanks,
Don Brace
ESC - Smart Storage
Microsemi Corporation

> [  247.199851]  ? __pm_runtime_resume+0x55/0x70
> [  247.224527]  local_pci_probe+0x3f/0xa0
> [  247.246034]  pci_device_probe+0x146/0x1b0
> [  247.268413]  driver_probe_device+0x2b3/0x4a0
> [  247.291868]  __driver_attach+0xda/0xe0
> [  247.313370]  ? driver_probe_device+0x4a0/0x4a0
> [  247.338399]  bus_for_each_dev+0x6a/0xb0
> [  247.359912]  bus_add_driver+0x41/0x260
> [  247.380244]  driver_register+0x5b/0xd0
> [  247.400811]  ? 0xc016b000
> [  247.418819]  hpsa_init+0x38/0x1000 [hpsa]
> [  247.440763]  ? 0xc016b000
> [  247.459451]  do_one_initcall+0x4d/0x19c
> [  247.480539]  ? do_init_module+0x22/0x220
> [  247.502575]  ? rcu_read_lock_sched_held+0x64/0x70
> [  247.529549]  ? kmem_cache_alloc_trace+0x1f7/0x260
> [  247.556204]  ? do_init_module+0x22/0x220
> [  247.578633]  do_init_module+0x5a/0x220
> [  247.600322]  load_module+0x21e8/0x2a50
> [  247.621648]  ? __symbol_put+0x60/0x60
> [  247.642796]  SYSC_finit_module+0x94/0xe0
> [  247.665336]  entry_SYSCALL_64_fastpath+0x1f/0x96
> [  247.691751] RIP: 0033:0x7fc63d6527f9
> [  247.712308] RSP: 002b:7ffdf1659ba8 EFLAGS: 0246 ORIG_RAX:
> 0139
> [  247.755272] RAX: ffda RBX: 556b524c5f70 RCX:
> 7fc63d6527f9
> [  247.795779] RDX:  RSI: 7fc63df6f099 RDI:
> 0008
> [  247.836413] RBP: 7fc63df6f099 R08:  R09:
> 556b524be760
> [  247.876395] R10: 0008 R11: 0246 R12:
> 
> [  247.917597] R13: 556b524c5f10 R14: 0002 R15:
> 
> [  247.957272]
> [  247.957272] Showing all locks held in the system:
> [  247.992019] 1 lock held by khungtaskd/118:
> [  248.015019]  #0:  (tasklist_lock){.+.+}, at: [<4ef3538d>]
> debug_show_all_locks+0x39/0x1b0
> [  248.064600] 2 locks held by systemd-udevd/413:
> [  248.090031]  #0:  (>mutex){}, at: [<2a395ec8>]
> __driver_attach+0x4a/0xe0
> [  248.136620]  #1:  (>mutex){}, at: []
> __driver_attach+0x58/0xe0
> [  248.183245]
> [  248.191675] =
> [  248.191675]
> [  314.825134] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  315.368421] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  315.894373] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  316.418385] dracut-initqueue[437]: Warning: dracut-initqueue timeout
> - starting timeout scripts
> [  316.944461] 

[for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Mike Snitzer
From: Ming Lei 

blk_insert_cloned_request() is called in fast path of dm-rq driver, and
in this function we append request to hctx->dispatch_list of the underlying
queue directly.

1) This way isn't efficient enough because hctx lock is always required

2) With blk_insert_cloned_request(), we bypass underlying queue's IO
scheduler totally, and depend on DM rq driver to do IO schedule
completely.  But DM rq driver can't get underlying queue's dispatch
feedback at all, and this information is extreamly useful for IO merge.
Without that IO merge can't be done basically by blk-mq, which causes
very bad sequential IO performance.

Fix this by having blk_insert_cloned_request() make use of
blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
blk_mq_request_direct_issue() allows a request to be dispatched to be
issue directly to the underlying queue and provides dispatch result to
dm-rq and blk-mq.

With this, the DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei 
Signed-off-by: Mike Snitzer 
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 42 --
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ba607527487..55f338020254 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 55f3a27fb2e6..3168a13cb012 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1706,6 +1706,12 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
+   /*
+* If @cookie is NULL do not insert the request, this mode is used
+* by blk_insert_cloned_request() via blk_mq_request_direct_issue()
+*/
+   bool dispatch_only = !cookie;
+   bool need_insert = false;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
@@ -1713,25 +1719,38 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !dispatch_only)
goto insert;
 
if (!blk_mq_get_driver_tag(rq, NULL, false))
-   goto insert;
+   need_insert = true;
 
-   if (!blk_mq_get_dispatch_budget(hctx)) {
+   if (!need_insert && !blk_mq_get_dispatch_budget(hctx)) {
blk_mq_put_driver_tag(rq);
+   need_insert = true;
+   }
+
+   if (need_insert) {
+   if (dispatch_only)
+   return BLK_STS_RESOURCE;
goto insert;
}
 
new_cookie = request_to_qc_t(hctx, rq);
 
+   ret = q->mq_ops->queue_rq(hctx, );
+
+   if (dispatch_only) {
+   if (ret == BLK_STS_RESOURCE)
+   __blk_mq_requeue_request(rq);
+   return ret;
+   }
+
/*
 * For OK queue, we are done. For error, kill it. Any other
 * error (busy), just add it to our list as we previously
 * would have done
 */
-   ret = q->mq_ops->queue_rq(hctx, );
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
@@ -1746,8 +1765,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
}
 
 insert:
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   if (!dispatch_only)
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
+   else
+   blk_mq_request_bypass_insert(rq, run_queue);
return ret;
 }
 
@@ -1767,6 +1789,14 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
+blk_status_t blk_mq_request_direct_issue(struct request *rq)
+{
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+
+   return blk_mq_try_issue_directly(hctx, rq, NULL);
+}
+
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
const int is_sync = op_is_sync(bio->bi_opf);
diff --git 

[for-4.16 PATCH v4-mike 1/2] blk-mq: return dispatch result from blk_mq_try_issue_directly

2018-01-16 Thread Mike Snitzer
From: Ming Lei 

In the following patch, we will use blk_mq_try_issue_directly() for DM
to return the dispatch result.  DM needs this information to improve
IO merging.

Signed-off-by: Ming Lei 
Signed-off-by: Mike Snitzer 
---
 block/blk-mq.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c8f62e6be6b6..55f3a27fb2e6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1694,9 +1694,9 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie)
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
 {
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
@@ -1704,7 +1704,7 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
.last = true,
};
blk_qc_t new_cookie;
-   blk_status_t ret;
+   blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1735,31 +1735,36 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
-   return;
+   return ret;
case BLK_STS_RESOURCE:
__blk_mq_requeue_request(rq);
goto insert;
default:
*cookie = BLK_QC_T_NONE;
blk_mq_end_request(rq, ret);
-   return;
+   return ret;
}
 
 insert:
blk_mq_sched_insert_request(rq, false, run_queue, false,
hctx->flags & BLK_MQ_F_BLOCKING);
+   return ret;
 }
 
-static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq, blk_qc_t *cookie)
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+ struct request *rq,
+ blk_qc_t *cookie)
 {
int srcu_idx;
+   blk_status_t ret;
 
might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
hctx_lock(hctx, _idx);
-   __blk_mq_try_issue_directly(hctx, rq, cookie);
+   ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
hctx_unlock(hctx, srcu_idx);
+
+   return ret;
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
-- 
2.15.0



[LSF/MM TOPIC] A high-performance userspace block driver

2018-01-16 Thread Matthew Wilcox

I see the improvements that Facebook have been making to the nbd driver,
and I think that's a wonderful thing.  Maybe the outcome of this topic
is simply: "Shut up, Matthew, this is good enough".

It's clear that there's an appetite for userspace block devices; not for
swap devices or the root device, but for accessing data that's stored
in that silo over there, and I really don't want to bring that entire
mess of CORBA / Go / Rust / whatever into the kernel to get to it,
but it would be really handy to present it as a block device.

I've looked at a few block-driver-in-userspace projects that exist, and
they all seem pretty bad.  For example, one API maps a few gigabytes of
address space and plays games with vm_insert_page() to put page cache
pages into the address space of the client process.  Of course, the TLB
flush overhead of that solution is criminal.

I've looked at pipes, and they're not an awful solution.  We've almost
got enough syscalls to treat other objects as pipes.  The problem is
that they're not seekable.  So essentially you're looking at having one
pipe per outstanding command.  If yu want to make good use of a modern
NAND device, you want a few hundred outstanding commands, and that's a
bit of a shoddy interface.

Right now, I'm leaning towards combining these two approaches; adding
a VM_NOTLB flag so the mmaped bits of the page cache never make it into
the process's address space, so the TLB shootdown can be safely skipped.
Then check it in follow_page_mask() and return the appropriate struct
page.  As long as the userspace process does everything using O_DIRECT,
I think this will work.

It's either that or make pipes seekable ...


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi minglei

On 01/16/2018 08:10 PM, Ming Lei wrote:
>>> -   next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
>>> +   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
>>> +   cpu_online_mask);
>>> if (next_cpu >= nr_cpu_ids)
>>> -   next_cpu = cpumask_first(hctx->cpumask);
>>> +   next_cpu = 
>>> cpumask_first_and(hctx->cpumask,cpu_online_mask);
>> the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask 
>> is online.
> That supposes not happen because storage device(blk-mq hw queue) is
> generally C/S model, that means the queue becomes only active when
> there is online CPU mapped to it.
> 
> But it won't be true for non-block-IO queue, such as HPSA's queues[1], and
> network controller RX queues.
> 
> [1] 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dkernel-26m-3D151601867018444-26w-3D2=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=tCZdQH6JUW1dkNCN92ycoUoKfDU_qWj-7EsUoYpOeJ0=vgHC9sbjYQb7mtY9MUJzbVXyVEyjoNJPWEx4_rfrHxU=
> 
> One thing I am still not sure(but generic irq affinity supposes to deal with
> well) is that the CPU may become offline after the IO is just submitted,
> then where the IRQ controller delivers the interrupt of this hw queue
> to?
> 
>> This could be reproduced on NVMe with a patch that could hold some rqs on 
>> ctx->rq_list,
>> meanwhile a script online and offline the cpus. Then a panic occurred in 
>> __queue_work().
> That shouldn't happen, when CPU offline happens the rqs in ctx->rq_list
> are dispatched directly, please see blk_mq_hctx_notify_dead().

Yes, I know. The  blk_mq_hctx_notify_dead will be invoked after the cpu has 
been set offlined.
Please refer to the following diagram.

CPU A  CPU T
 kick  
  _cpu_down() ->   cpuhp_thread_fun (cpuhpT kthread)
   AP_ACTIVE   (clear cpu_active_mask)
 |
 v
   AP_WORKQUEUE_ONLINE (unbind workers)
 |
 v
   TEARDOWN_CPU(stop_machine)
,   | execute
 \_ _ _ _ _ _   v
preempt  V  take_cpu_down ( migration 
kthread)

set_cpu_online(smp_processor_id(), false) (__cpu_disable)  --> Here !!!
TEARDOWN_CPU
|
 cpuhpT kthead is|  v
 migrated away   ,AP_SCHED_STARTING 
(migrate_tasks)
 _ _ _ _ _ _ _ _/   |
V   v
  CPU X   AP_OFFLINE

|
,
 _ _ _ _ _ /
V
  do_idle (idle task)
 <_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cpuhp_report_idle_dead
 complete st->done_down
   __cpu_die (cpuhpT kthread, teardown_cpu) 

 AP_OFFLINE
   |
   v
 BRINGUP_CPU
   |
   v
 BLK_MQ_DEAD---> Here !!!
   |
   v
 OFFLINE

The cpu has been cleared in cpu_online_mask when blk_mq_hctx_notify_dead is 
invoked.
If the device is NVMe which only has one cpu mapped on the hctx, 
cpumask_first_and(hctx->cpumask,cpu_online_mask) will return a bad value.

I even got a backtrace showed that the panic occurred in 
blk_mq_hctx_notify_dead -> kblocked_schedule_delayed_work_on -> __queue_work.
Kdump doesn't work well on my machine, so I cannot share the backtrace here, 
that's really sad.
I even added BUG_ON as following, it could be triggered.

if (next_cpu >= nr_cpu_ids)
next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask);
BUG_ON(next_cpu >= nr_cpu_ids);


Thanks
Jianchao
> 
>> maybe cpu_possible_mask here, the workers in the pool of the offlined cpu 
>> has been unbound.
>> It should be ok to queue on them.


Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Laurence Oberman
On Tue, 2018-01-16 at 12:25 +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Ming Lei wrote:
> 
> > On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote:
> > > On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote:
> > > > Hi,
> > > > 
> > > > These two patches fixes IO hang issue reported by Laurence.
> > > > 
> > > > 84676c1f21 ("genirq/affinity: assign vectors to all possible
> > > > CPUs")
> > > > may cause one irq vector assigned to all offline CPUs, then
> > > > this vector
> > > > can't handle irq any more.
> > > 
> > > Well, that very much was the intention of managed
> > > interrupts.  Why
> > > does the device raise an interrupt for a queue that has no online
> > > cpu assigned to it?
> > 
> > It is because of irq_create_affinity_masks().
> 
> That still does not answer the question. If the interrupt for a queue
> is
> assigned to an offline CPU, then the queue should not be used and
> never
> raise an interrupt. That's how managed interrupts have been designed.
> 
> Thanks,
> 
>   tglx
> 
> 
> 
> 

I captured a full boot log for this issue for Microsemi, I will send it
to Don Brace.
I enabled all the HPSA debug and here is snippet

[0.00] Kernel command line: BOOT_IMAGE=/vmlinuz-4.15.0-
rc4.noming+ root=/dev/mapper/rhel_ibclient-root ro crashkernel=512M@64M
 rd.lvm.lv=rhel_ibclient/root rd.lvm.lv=rhel_ibclient/swap
log_buf_len=54M console=ttyS1,115200n8 scsi_mod.use_blk_mq=y
dm_mod.use_blk_mq=y
[0.00] Memory: 7834908K/1002852K available (8397K kernel code,
3012K rwdata, 3660K rodata, 2184K init, 15344K bss, 2356808K reserved,
0K cma-reserved)
[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32,
Nodes=2
[0.00] ftrace: allocating 33084 entries in 130 pages
[0.00] Running RCU self tests
[0.00] Hierarchical RCU implementation.
[0.00]  RCU lockdep checking is enabled.
[0.00]  RCU restricting CPUs from NR_CPUS=8192 to
nr_cpu_ids=32.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16,
nr_cpu_ids=32
[0.00] NR_IRQS: 524544, nr_irqs: 1088, preallocated irqs: 16
..
..
0.190147] smp: Brought up 2 nodes, 16 CPUs
[0.192006] smpboot: Max logical packages: 4
[0.193007] smpboot: Total of 16 processors activated (76776.33
BogoMIPS)
[0.940640] node 0 initialised, 10803218 pages in 743ms
[1.005449] node 1 initialised, 11812066 pages in 807ms
..
..
[7.440896] hpsa :05:00.0: can't disable ASPM; OS doesn't have
ASPM control
[7.442071] hpsa :05:00.0: Logical aborts not supported
[7.442075] hpsa :05:00.0: HP SSD Smart Path aborts not
supported
[7.442164] hpsa :05:00.0: Controller Configuration information
[7.442167] hpsa :05:00.0: 
[7.442173] hpsa :05:00.0:Signature = CISS
[7.442177] hpsa :05:00.0:Spec Number = 3
[7.442182] hpsa :05:00.0:Transport methods supported =
0x7a07
[7.442186] hpsa :05:00.0:Transport methods active = 0x3
[7.442190] hpsa :05:00.0:Requested transport Method = 0x2
[7.442194] hpsa :05:00.0:Coalesce Interrupt Delay = 0x0
[7.442198] hpsa :05:00.0:Coalesce Interrupt Count = 0x1
[7.442202] hpsa :05:00.0:Max outstanding commands = 1024
[7.442206] hpsa :05:00.0:Bus Types = 0x20
[7.442220] hpsa :05:00.0:Server Name = 2M21220149
[7.442224] hpsa :05:00.0:Heartbeat Counter = 0xd23
[7.442224] 
[7.442224] 
..
..
  246.751135] INFO: task systemd-udevd:413 blocked for more than 120
seconds.
[  246.788008]   Tainted: G  I  4.15.0-rc4.noming+ #1
[  246.822380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  246.865594] systemd-udevd   D0   413411 0x8004
[  246.895519] Call Trace:
[  246.909713]  ? __schedule+0x340/0xc20
[  246.930236]  schedule+0x32/0x80
[  246.947905]  schedule_timeout+0x23d/0x450
[  246.970047]  ? find_held_lock+0x2d/0x90
[  246.991774]  ? wait_for_completion_io+0x108/0x170
[  247.018172]  io_schedule_timeout+0x19/0x40
[  247.041208]  wait_for_completion_io+0x110/0x170
[  247.067326]  ? wake_up_q+0x70/0x70
[  247.086801]  hpsa_scsi_do_simple_cmd+0xc6/0x100 [hpsa]
[  247.114315]  hpsa_scsi_do_simple_cmd_with_retry+0xb7/0x1c0 [hpsa]
[  247.146629]  hpsa_scsi_do_inquiry+0x73/0xd0 [hpsa]
[  247.174118]  hpsa_init_one+0x12cb/0x1a59 [hpsa]
[  247.199851]  ? __pm_runtime_resume+0x55/0x70
[  247.224527]  local_pci_probe+0x3f/0xa0
[  247.246034]  pci_device_probe+0x146/0x1b0
[  247.268413]  driver_probe_device+0x2b3/0x4a0
[  247.291868]  __driver_attach+0xda/0xe0
[  247.313370]  ? driver_probe_device+0x4a0/0x4a0
[  247.338399]  bus_for_each_dev+0x6a/0xb0
[  247.359912]  bus_add_driver+0x41/0x260
[  247.380244]  driver_register+0x5b/0xd0
[  247.400811]  ? 0xc016b000
[  247.418819]  hpsa_init+0x38/0x1000 [hpsa]
[  247.440763]  ? 0xc016b000
[  247.459451]  do_one_initcall+0x4d/0x19c
[  

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Ming Lei
On Tue, Jan 16, 2018 at 12:25:19PM +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Ming Lei wrote:
> 
> > On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote:
> > > On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote:
> > > > Hi,
> > > > 
> > > > These two patches fixes IO hang issue reported by Laurence.
> > > > 
> > > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > > > may cause one irq vector assigned to all offline CPUs, then this vector
> > > > can't handle irq any more.
> > > 
> > > Well, that very much was the intention of managed interrupts.  Why
> > > does the device raise an interrupt for a queue that has no online
> > > cpu assigned to it?
> > 
> > It is because of irq_create_affinity_masks().
> 
> That still does not answer the question. If the interrupt for a queue is
> assigned to an offline CPU, then the queue should not be used and never
> raise an interrupt. That's how managed interrupts have been designed.

Sorry for not answering it in 1st place, but later I realized that:

https://marc.info/?l=linux-block=151606896601195=2

Also wrt. HPSA's queue, looks they are not usual IO queue(such as NVMe's
hw queue) which supposes to be C/S model. And HPSA's queue is more like
a management queue, I guess, since HPSA is still a single queue HBA,
from blk-mq view.

Cc HPSA and SCSI guys.

Thanks,
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Ming Lei
Hi Jianchao,

On Tue, Jan 16, 2018 at 06:12:09PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 01/12/2018 10:53 AM, Ming Lei wrote:
> > From: Christoph Hellwig 
> > 
> > The previous patch assigns interrupt vectors to all possible CPUs, so
> > now hctx can be mapped to possible CPUs, this patch applies this fact
> > to simplify queue mapping & schedule so that we don't need to handle
> > CPU hotplug for dealing with physical CPU plug & unplug. With this
> > simplication, we can work well on physical CPU plug & unplug, which
> > is a normal use case for VM at least.
> > 
> > Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
> > set hctx->numa_node for possible CPUs which are mapped to this hctx. And
> > only choose the online CPUs for schedule.
> > 
> > Reported-by: Christian Borntraeger 
> > Tested-by: Christian Borntraeger 
> > Tested-by: Stefan Haberland 
> > Cc: Thomas Gleixner 
> > Signed-off-by: Christoph Hellwig 
> > (merged the three into one because any single one may not work, and fix
> > selecting online CPUs for scheduler)
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq.c | 19 ---
> >  1 file changed, 8 insertions(+), 11 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 8000ba6db07d..ef9beca2d117 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
> > request_queue *q,
> > blk_queue_exit(q);
> > return ERR_PTR(-EXDEV);
> > }
> > -   cpu = cpumask_first(alloc_data.hctx->cpumask);
> > +   cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> >  
> > rq = blk_mq_get_request(q, NULL, op, _data);
> > @@ -1323,9 +1323,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> > *hctx)
> > if (--hctx->next_cpu_batch <= 0) {
> > int next_cpu;
> >  
> > -   next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
> > +   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> > +   cpu_online_mask);
> > if (next_cpu >= nr_cpu_ids)
> > -   next_cpu = cpumask_first(hctx->cpumask);
> > +   next_cpu = 
> > cpumask_first_and(hctx->cpumask,cpu_online_mask);
> 
> the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask is 
> online.

That supposes not happen because storage device(blk-mq hw queue) is
generally C/S model, that means the queue becomes only active when
there is online CPU mapped to it.

But it won't be true for non-block-IO queue, such as HPSA's queues[1], and
network controller RX queues.

[1] https://marc.info/?l=linux-kernel=151601867018444=2

One thing I am still not sure(but generic irq affinity supposes to deal with
well) is that the CPU may become offline after the IO is just submitted,
then where the IRQ controller delivers the interrupt of this hw queue
to?

> This could be reproduced on NVMe with a patch that could hold some rqs on 
> ctx->rq_list,
> meanwhile a script online and offline the cpus. Then a panic occurred in 
> __queue_work().

That shouldn't happen, when CPU offline happens the rqs in ctx->rq_list
are dispatched directly, please see blk_mq_hctx_notify_dead().

> 
> maybe cpu_possible_mask here, the workers in the pool of the offlined cpu has 
> been unbound.
> It should be ok to queue on them.

That is the original version of this patch, and both Christian and Stefan
reported that system can't boot from DASD in this way[2], and I changed
to AND with cpu_online_mask, then their system can boot well.

[2] https://marc.info/?l=linux-kernel=151256312722285=2

-- 
Ming


Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Thomas Gleixner
On Tue, 16 Jan 2018, Ming Lei wrote:

> On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote:
> > On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote:
> > > Hi,
> > > 
> > > These two patches fixes IO hang issue reported by Laurence.
> > > 
> > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > > may cause one irq vector assigned to all offline CPUs, then this vector
> > > can't handle irq any more.
> > 
> > Well, that very much was the intention of managed interrupts.  Why
> > does the device raise an interrupt for a queue that has no online
> > cpu assigned to it?
> 
> It is because of irq_create_affinity_masks().

That still does not answer the question. If the interrupt for a queue is
assigned to an offline CPU, then the queue should not be used and never
raise an interrupt. That's how managed interrupts have been designed.

Thanks,

tglx






Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread jianchao.wang
Hi Ming

On 01/12/2018 10:53 AM, Ming Lei wrote:
> From: Christoph Hellwig 
> 
> The previous patch assigns interrupt vectors to all possible CPUs, so
> now hctx can be mapped to possible CPUs, this patch applies this fact
> to simplify queue mapping & schedule so that we don't need to handle
> CPU hotplug for dealing with physical CPU plug & unplug. With this
> simplication, we can work well on physical CPU plug & unplug, which
> is a normal use case for VM at least.
> 
> Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
> set hctx->numa_node for possible CPUs which are mapped to this hctx. And
> only choose the online CPUs for schedule.
> 
> Reported-by: Christian Borntraeger 
> Tested-by: Christian Borntraeger 
> Tested-by: Stefan Haberland 
> Cc: Thomas Gleixner 
> Signed-off-by: Christoph Hellwig 
> (merged the three into one because any single one may not work, and fix
> selecting online CPUs for scheduler)
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 19 ---
>  1 file changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8000ba6db07d..ef9beca2d117 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -440,7 +440,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
> request_queue *q,
>   blk_queue_exit(q);
>   return ERR_PTR(-EXDEV);
>   }
> - cpu = cpumask_first(alloc_data.hctx->cpumask);
> + cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>   alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>  
>   rq = blk_mq_get_request(q, NULL, op, _data);
> @@ -1323,9 +1323,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
> *hctx)
>   if (--hctx->next_cpu_batch <= 0) {
>   int next_cpu;
>  
> - next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
> + next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
> + cpu_online_mask);
>   if (next_cpu >= nr_cpu_ids)
> - next_cpu = cpumask_first(hctx->cpumask);
> + next_cpu = 
> cpumask_first_and(hctx->cpumask,cpu_online_mask);

the next_cpu here could be >= nr_cpu_ids when the none of on hctx->cpumask is 
online.
This could be reproduced on NVMe with a patch that could hold some rqs on 
ctx->rq_list,
meanwhile a script online and offline the cpus. Then a panic occurred in 
__queue_work().

maybe cpu_possible_mask here, the workers in the pool of the offlined cpu has 
been unbound.
It should be ok to queue on them.

Thanks
Jianchao 


>   hctx->next_cpu = next_cpu;
>   hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> @@ -2219,16 +2220,11 @@ static void blk_mq_init_cpu_queues(struct 
> request_queue *q,
>   INIT_LIST_HEAD(&__ctx->rq_list);
>   __ctx->queue = q;
>  
> - /* If the cpu isn't present, the cpu is mapped to first hctx */
> - if (!cpu_present(i))
> - continue;
> -
> - hctx = blk_mq_map_queue(q, i);
> -
>   /*
>* Set local node, IFF we have more than one hw queue. If
>* not, we remain on the home node of the device
>*/
> + hctx = blk_mq_map_queue(q, i);
>   if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
>   hctx->numa_node = local_memory_node(cpu_to_node(i));
>   }
> @@ -2285,7 +2281,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>*
>* If the cpu isn't present, the cpu is mapped to first hctx.
>*/
> - for_each_present_cpu(i) {
> + for_each_possible_cpu(i) {
>   hctx_idx = q->mq_map[i];
>   /* unmapped hw queue can be remapped after CPU topo changed */
>   if (!set->tags[hctx_idx] &&
> @@ -2339,7 +2335,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>   /*
>* Initialize batch roundrobin counts
>*/
> - hctx->next_cpu = cpumask_first(hctx->cpumask);
> + hctx->next_cpu = cpumask_first_and(hctx->cpumask,
> + cpu_online_mask);
>   hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
>   }
>  }
> 


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-16 Thread Stefan Haberland

Ming or Christoph,

would you mind to send this patch to stable #4.12? Or is the fixes tag 
enough to get this fixed in all related releases?


Regards,
Stefan



Re: [PATCH v3 13/13] bcache: stop bcache device when backing device is offline

2018-01-16 Thread Hannes Reinecke
On 01/14/2018 03:42 PM, Coly Li wrote:
> Currently bcache does not handle backing device failure, if backing
> device is offline and disconnected from system, its bcache device can still
> be accessible. If the bcache device is in writeback mode, I/O requests even
> can success if the requests hit on cache device. That is to say, when and
> how bcache handles offline backing device is undefined.
> 
> This patch tries to handle backing device offline in a rather simple way,
> - Add cached_dev->status_update_thread kernel thread to update backing
>   device status in every 1 second.
> - Add cached_dev->offline_seconds to record how many seconds the backing
>   device is observed to be offline. If the backing device is offline for
>   BACKING_DEV_OFFLINE_TIMEOUT (30) seconds, set dc->io_disable to 1 and
>   call bcache_device_stop() to stop the bache device which linked to the
>   offline backing device.
> 
> Now if a backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT seconds,
> its bcache device will be removed, then user space application writing on
> it will get error immediately, and handler the device failure in time.
> 
> This patch is quite simple, does not handle more complicated situations.
> Once the bcache device is stopped, users need to recovery the backing
> device, register and attach it manually.
> 
> Signed-off-by: Coly Li 
> Cc: Michael Lyle 
> Cc: Hannes Reinecke 
> Cc: Junhui Tang 
> ---
>  drivers/md/bcache/bcache.h |  2 ++
>  drivers/md/bcache/super.c  | 55 
> ++
>  2 files changed, 57 insertions(+)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 5a811959392d..9eedb35d01bc 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -338,6 +338,7 @@ struct cached_dev {
>  
>   struct keybuf   writeback_keys;
>  
> + struct task_struct  *status_update_thread;
>   /*
>* Order the write-half of writeback operations strongly in dispatch
>* order.  (Maintain LBA order; don't allow reads completing out of
> @@ -384,6 +385,7 @@ struct cached_dev {
>  #define DEFAULT_CACHED_DEV_ERROR_LIMIT 64
>   atomic_tio_errors;
>   unsignederror_limit;
> + unsignedoffline_seconds;
>  };
>  
>  enum alloc_reserve {
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 14fce3623770..85adf1e29d11 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -646,6 +646,11 @@ static int ioctl_dev(struct block_device *b, fmode_t 
> mode,
>unsigned int cmd, unsigned long arg)
>  {
>   struct bcache_device *d = b->bd_disk->private_data;
> + struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +
> + if (dc->io_disable)
> + return -EIO;
> +
>   return d->ioctl(d, mode, cmd, arg);
>  }
>  
> @@ -856,6 +861,45 @@ static void calc_cached_dev_sectors(struct cache_set *c)
>   c->cached_dev_sectors = sectors;
>  }
>  
> +#define BACKING_DEV_OFFLINE_TIMEOUT 5
> +static int cached_dev_status_update(void *arg)
> +{
> + struct cached_dev *dc = arg;
> + struct request_queue *q;
> + char buf[BDEVNAME_SIZE];
> +
> + /*
> +  * If this delayed worker is stopping outside, directly quit here.
> +  * dc->io_disable might be set via sysfs interface, so check it
> +  * here too.
> +  */
> + while (!kthread_should_stop() && !dc->io_disable) {
> + q = bdev_get_queue(dc->bdev);
> + if (blk_queue_dying(q))
> + dc->offline_seconds++;
> + else
> + dc->offline_seconds = 0;
> +
> + if (dc->offline_seconds >= BACKING_DEV_OFFLINE_TIMEOUT) {
> + pr_err("%s: device offline for %d seconds",
> + bdevname(dc->bdev, buf),
> + BACKING_DEV_OFFLINE_TIMEOUT);
> + pr_err("%s: disable I/O request due to backing "
> + "device offline", dc->disk.name);
> + dc->io_disable = true;
> + /* let others know earlier that io_disable is true */
> + smp_mb();
> + bcache_device_stop(>disk);
> + break;
> + }
> +
> + schedule_timeout_interruptible(HZ);
> + }
> +
> + dc->status_update_thread = NULL;
> + return 0;
> +}
> +
>  void bch_cached_dev_run(struct cached_dev *dc)
>  {
>   struct bcache_device *d = >disk;
> @@ -898,6 +942,15 @@ void bch_cached_dev_run(struct cached_dev *dc)
>   if (sysfs_create_link(>kobj, _to_dev(d->disk)->kobj, "dev") ||
>   sysfs_create_link(_to_dev(d->disk)->kobj, >kobj, "bcache"))
>   pr_debug("error creating sysfs link");
> +
> + dc->status_update_thread = 

Re: [PATCH v3 12/13] bcache: add io_disable to struct cached_dev

2018-01-16 Thread Hannes Reinecke
On 01/14/2018 03:42 PM, Coly Li wrote:
> If a bcache device is configured to writeback mode, current code does not
> handle write I/O errors on backing devices properly.
> 
> In writeback mode, write request is written to cache device, and
> latter being flushed to backing device. If I/O failed when writing from
> cache device to the backing device, bcache code just ignores the error and
> upper layer code is NOT noticed that the backing device is broken.
> 
> This patch tries to handle backing device failure like how the cache device
> failure is handled,
> - Add a error counter 'io_errors' and error limit 'error_limit' in struct
>   cached_dev. Add another io_disable to struct cached_dev to disable I/Os
>   on the problematic backing device.
> - When I/O error happens on backing device, increase io_errors counter. And
>   if io_errors reaches error_limit, set cache_dev->io_disable to true, and
>   stop the bcache device.
> 
> The result is, if backing device is broken of disconnected, and I/O errors
> reach its error limit, backing device will be disabled and the associated
> bcache device will be removed from system.
> 
> Signed-off-by: Coly Li 
> Cc: Michael Lyle 
> Cc: Hannes Reinecke 
> Cc: Junhui Tang 
> ---
>  drivers/md/bcache/bcache.h  |  7 +++
>  drivers/md/bcache/io.c  | 14 ++
>  drivers/md/bcache/request.c | 14 --
>  drivers/md/bcache/super.c   | 22 ++
>  drivers/md/bcache/sysfs.c   | 15 ++-
>  5 files changed, 69 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index c41736960045..5a811959392d 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -360,6 +360,7 @@ struct cached_dev {
>   unsignedsequential_cutoff;
>   unsignedreadahead;
>  
> + unsignedio_disable:1;
>   unsignedverify:1;
>   unsignedbypass_torture_test:1;
>  
> @@ -379,6 +380,10 @@ struct cached_dev {
>   unsignedwriteback_rate_i_term_inverse;
>   unsignedwriteback_rate_p_term_inverse;
>   unsignedwriteback_rate_minimum;
> +
> +#define DEFAULT_CACHED_DEV_ERROR_LIMIT 64
> + atomic_tio_errors;
> + unsignederror_limit;
>  };
>  
>  enum alloc_reserve {
> @@ -882,6 +887,7 @@ static inline void closure_bio_submit(struct cache_set *c,
>  
>  /* Forward declarations */
>  
> +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio);
>  void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
>  void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
> blk_status_t, const char *);
> @@ -909,6 +915,7 @@ int bch_bucket_alloc_set(struct cache_set *, unsigned,
>struct bkey *, int, bool);
>  bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned,
>  unsigned, unsigned, bool);
> +bool bch_cached_dev_error(struct cached_dev *dc);
>  
>  __printf(2, 3)
>  bool bch_cache_set_error(struct cache_set *, const char *, ...);
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index 8013ecbcdbda..7fac97ae036e 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -50,6 +50,20 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c,
>  }
>  
>  /* IO errors */
> +void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio)
> +{
> + char buf[BDEVNAME_SIZE];
> + unsigned errors;
> +
> + WARN_ONCE(!dc, "NULL pointer of struct cached_dev");
> +
> + errors = atomic_add_return(1, >io_errors);
> + if (errors < dc->error_limit)
> + pr_err("%s: IO error on backing device, unrecoverable",
> + bio_devname(bio, buf));
> + else
> + bch_cached_dev_error(dc);
> +}
>  
>  void bch_count_io_errors(struct cache *ca,
>blk_status_t error,
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index ad4cf71f7eab..386b388ce296 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -637,6 +637,8 @@ static void backing_request_endio(struct bio *bio)
>  
>   if (bio->bi_status) {
>   struct search *s = container_of(cl, struct search, cl);
> + struct cached_dev *dc = container_of(s->d,
> +  struct cached_dev, disk);
>   /*
>* If a bio has REQ_PREFLUSH for writeback mode, it is
>* speically assembled in cached_dev_write() for a non-zero
> @@ -657,6 +659,7 @@ static void backing_request_endio(struct bio *bio)
>   }
>   s->recoverable = false;
>   /* should count I/O error for backing device here */
> + 

Re: [PATCH v3 03/13] bcache: set task properly in allocator_wait()

2018-01-16 Thread Coly Li
On 16/01/2018 5:05 PM, Hannes Reinecke wrote:
> On 01/14/2018 03:42 PM, Coly Li wrote:
>> Kernel thread routine bch_allocator_thread() references macro
>> allocator_wait() to wait for a condition or quit to do_exit()
>> when kthread_should_stop() is true. Here is the code block,
>>
>> 284 while (1) {   \
>> 285 set_current_state(TASK_INTERRUPTIBLE);\
>> 286 if (cond) \
>> 287 break;\
>> 288   \
>> 289 mutex_unlock(&(ca)->set->bucket_lock);\
>> 290 if (kthread_should_stop())\
>> 291 return 0; \
>> 292   \
>> 293 schedule();   \
>> 294 mutex_lock(&(ca)->set->bucket_lock);  \
>> 295 } \
>> 296 __set_current_state(TASK_RUNNING);\
>>
>> At line 285, task state is set to TASK_INTERRUPTIBLE, if at line 290
>> kthread_should_stop() is true, the kernel thread will terminate and return
>> to kernel/kthread.s:kthread(), then calls do_exit() with TASK_INTERRUPTIBLE
>> state. This is not a suggested behavior and a warning message will be
>> reported by might_sleep() in do_exit() code path: "WARNING: do not call
>> blocking ops when !TASK_RUNNING; state=1 set at []".
>>
>> This patch fixes this problem by setting task state to TASK_RUNNING if
>> kthread_should_stop() is true and before kernel thread returns back to
>> kernel/kthread.s:kthread().
>>
>> Changelog:
>> v2: fix the race issue in v1 patch.
>> v1: initial buggy fix.
>>
>> Signed-off-by: Coly Li 
>> Cc: Michael Lyle 
>> Cc: Hannes Reinecke 
>> Cc: Junhui Tang 
>> ---
>>  drivers/md/bcache/alloc.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
>> index 6cc6c0f9c3a9..458e1d38577d 100644
>> --- a/drivers/md/bcache/alloc.c
>> +++ b/drivers/md/bcache/alloc.c
>> @@ -287,8 +287,10 @@ do {
>> \
>>  break;  \
>>  \
>>  mutex_unlock(&(ca)->set->bucket_lock);  \
>> -if (kthread_should_stop())  \
>> +if (kthread_should_stop()) {\
>> +set_current_state(TASK_RUNNING);\
>>  return 0;   \
>> +}   \
>>  \
>>  schedule(); \
>>  mutex_lock(&(ca)->set->bucket_lock);\
>>
> Might be an idea to merge it with the previous patch.
> 
> Other than that:
> 
> Reviewed-by: Hannes Reinecke 

Hi Hannes,

Sure, I will do that in v4 patche set. Thanks for the review.

Coly Li


Re: [PATCH v3 11/13] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O

2018-01-16 Thread Hannes Reinecke
On 01/14/2018 03:42 PM, Coly Li wrote:
> In order to catch I/O error of backing device, a separate bi_end_io
> call back is required. Then a per backing device counter can record I/O
> errors number and retire the backing device if the counter reaches a
> per backing device I/O error limit.
> 
> This patch adds backing_request_endio() to bcache backing device I/O code
> path, this is a preparation for further complicated backing device failure
> handling. So far there is no real code logic change, I make this change a
> separate patch to make sure it is stable and reliable for further work.
> 
> Signed-off-by: Coly Li 
> Cc: Junhui Tang 
> Cc: Michael Lyle 
> ---
>  drivers/md/bcache/request.c   | 95 
> +++
>  drivers/md/bcache/super.c |  1 +
>  drivers/md/bcache/writeback.c |  1 +
>  3 files changed, 81 insertions(+), 16 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v3 10/13] bcache: fix inaccurate io state for detached bcache devices

2018-01-16 Thread Hannes Reinecke
On 01/14/2018 03:42 PM, Coly Li wrote:
> From: Tang Junhui 
> 
> When we run IO in a detached device,  and run iostat to shows IO status,
> normally it will show like bellow (Omitted some fields):
> Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdd... 15.89 0.531.820.202.23   1.81  52.30
> bcache0... 15.89   115.420.000.000.00   2.40  69.60
> but after IO stopped, there are still very big avgqu-sz and %util
> values as bellow:
> Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> bcache0   ...  0   5326.320.000.000.00   0.00 100.10
> 
> The reason for this issue is that, only generic_start_io_acct() called
> and no generic_end_io_acct() called for detached device in
> cached_dev_make_request(). See the code:
> //start generic_start_io_acct()
> generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0);
> if (cached_dev_get(dc)) {
>   //will callback generic_end_io_acct()
> }
> else {
>   //will not call generic_end_io_acct()
> }
> 
> This patch calls generic_end_io_acct() in the end of IO for detached
> devices, so we can show IO state correctly.
> 
> (Modified to use GFP_NOIO in kzalloc() by Coly Li)
> 
> Signed-off-by: Tang Junhui 
> Reviewed-by: Coly Li 
> ---
>  drivers/md/bcache/request.c | 58 
> +++--
>  1 file changed, 51 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v3 05/13] bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set

2018-01-16 Thread Hannes Reinecke
On 01/14/2018 03:42 PM, Coly Li wrote:
> In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()",
> cached_dev_get() is called when creating dc->writeback_thread, and
> cached_dev_put() is called when exiting dc->writeback_thread. This
> modification works well unless people detach the bcache device manually by
> 'echo 1 > /sys/block/bcache/bcache/detach'
> Because this sysfs interface only calls bch_cached_dev_detach() which wakes
> up dc->writeback_thread but does not stop it. The reason is, before patch
> "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside
> bch_writeback_thread(), if cache is not dirty after writeback,
> cached_dev_put() will be called here. And in cached_dev_make_request() when
> a new write request makes cache from clean to dirty, cached_dev_get() will
> be called there. Since we don't operate dc->count in these locations,
> refcount d->count cannot be dropped after cache becomes clean, and
> cached_dev_detach_finish() won't be called to detach bcache device.
> 
> This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
> set inside bch_writeback_thread(). If this bit is set and cache is clean
> (no existing writeback_keys), break the while-loop, call cached_dev_put()
> and quit the writeback thread.
> 
> Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
> writeback thread should continue to perform writeback, this is the original
> design of manually detach.
> 
> I compose a separte patch because that patch "bcache: fix cached_dev->count
> usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
> Reinecke. Also this fix is not trivial and good for a separate patch.
> 
> Signed-off-by: Coly Li 
> Cc: Michael Lyle 
> Cc: Hannes Reinecke 
> Cc: Huijun Tang 
> ---
>  drivers/md/bcache/writeback.c | 20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index b280c134dd4d..4dbeaaa575bf 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -565,9 +565,15 @@ static int bch_writeback_thread(void *arg)
>   while (!kthread_should_stop()) {
>   down_write(>writeback_lock);
>   set_current_state(TASK_INTERRUPTIBLE);
> - if (!atomic_read(>has_dirty) ||
> - (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) &&
> -  !dc->writeback_running)) {
> + /*
> +  * If the bache device is detaching, skip here and continue
> +  * to perform writeback. Otherwise, if no dirty data on cache,
> +  * or there is dirty data on cache but writeback is disabled,
> +  * the writeback thread should sleep here and wait for others
> +  * to wake up it.
> +  */
> + if (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) &&
> + (!atomic_read(>has_dirty) || !dc->writeback_running)) {
>   up_write(>writeback_lock);
>  
>   if (kthread_should_stop()) {
> @@ -587,6 +593,14 @@ static int bch_writeback_thread(void *arg)
>   atomic_set(>has_dirty, 0);
>   SET_BDEV_STATE(>sb, BDEV_STATE_CLEAN);
>   bch_write_bdev_super(dc, NULL);
> + /*
> +  * If bcache device is detaching via sysfs interface,
> +  * writeback thread should stop after there is no dirty
> +  * data on cache. BCACHE_DEV_DETACHING flag is set in
> +  * bch_cached_dev_detach().
> +  */
> + if (test_bit(BCACHE_DEV_DETACHING, >disk.flags))
> + break;
>   }
>  
>   up_write(>writeback_lock);
> 
Checking several atomic flags in one statement renders the atomic pretty
much pointless; you need to protect the 'if' clause with some lock or
just check _one_ atomic statement.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v3 03/13] bcache: set task properly in allocator_wait()

2018-01-16 Thread Hannes Reinecke
On 01/14/2018 03:42 PM, Coly Li wrote:
> Kernel thread routine bch_allocator_thread() references macro
> allocator_wait() to wait for a condition or quit to do_exit()
> when kthread_should_stop() is true. Here is the code block,
> 
> 284 while (1) {   \
> 285 set_current_state(TASK_INTERRUPTIBLE);\
> 286 if (cond) \
> 287 break;\
> 288   \
> 289 mutex_unlock(&(ca)->set->bucket_lock);\
> 290 if (kthread_should_stop())\
> 291 return 0; \
> 292   \
> 293 schedule();   \
> 294 mutex_lock(&(ca)->set->bucket_lock);  \
> 295 } \
> 296 __set_current_state(TASK_RUNNING);\
> 
> At line 285, task state is set to TASK_INTERRUPTIBLE, if at line 290
> kthread_should_stop() is true, the kernel thread will terminate and return
> to kernel/kthread.s:kthread(), then calls do_exit() with TASK_INTERRUPTIBLE
> state. This is not a suggested behavior and a warning message will be
> reported by might_sleep() in do_exit() code path: "WARNING: do not call
> blocking ops when !TASK_RUNNING; state=1 set at []".
> 
> This patch fixes this problem by setting task state to TASK_RUNNING if
> kthread_should_stop() is true and before kernel thread returns back to
> kernel/kthread.s:kthread().
> 
> Changelog:
> v2: fix the race issue in v1 patch.
> v1: initial buggy fix.
> 
> Signed-off-by: Coly Li 
> Cc: Michael Lyle 
> Cc: Hannes Reinecke 
> Cc: Junhui Tang 
> ---
>  drivers/md/bcache/alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index 6cc6c0f9c3a9..458e1d38577d 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -287,8 +287,10 @@ do { 
> \
>   break;  \
>   \
>   mutex_unlock(&(ca)->set->bucket_lock);  \
> - if (kthread_should_stop())  \
> + if (kthread_should_stop()) {\
> + set_current_state(TASK_RUNNING);\
>   return 0;   \
> + }   \
>   \
>   schedule(); \
>   mutex_lock(&(ca)->set->bucket_lock);\
> 
Might be an idea to merge it with the previous patch.

Other than that:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v3 01/13] bcache: set writeback_rate_update_seconds in range [1, 60] seconds

2018-01-16 Thread Hannes Reinecke
On 01/14/2018 03:42 PM, Coly Li wrote:
> dc->writeback_rate_update_seconds can be set via sysfs and its value can
> be set to [1, ULONG_MAX].  It does not make sense to set such a large
> value, 60 seconds is long enough value considering the default 5 seconds
> works well for long time.
> 
> Because dc->writeback_rate_update is a special delayed work, it re-arms
> itself inside the delayed work routine update_writeback_rate(). When
> stopping it by cancel_delayed_work_sync(), there should be a timeout to
> wait and make sure the re-armed delayed work is stopped too. A small max
> value of dc->writeback_rate_update_seconds is also helpful to decide a
> reasonable small timeout.
> 
> This patch limits sysfs interface to set dc->writeback_rate_update_seconds
> in range of [1, 60] seconds, and replaces the hand-coded number by macros.
> 
> Signed-off-by: Coly Li 
> ---
>  drivers/md/bcache/sysfs.c | 3 +++
>  drivers/md/bcache/writeback.c | 2 +-
>  drivers/md/bcache/writeback.h | 3 +++
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v3 02/13] bcache: properly set task state in bch_writeback_thread()

2018-01-16 Thread Hannes Reinecke
On 01/14/2018 03:42 PM, Coly Li wrote:
> Kernel thread routine bch_writeback_thread() has the following code block,
> 
> 447 down_write(>writeback_lock);
> 448~450 if (check conditions) {
> 451 up_write(>writeback_lock);
> 452 set_current_state(TASK_INTERRUPTIBLE);
> 453
> 454 if (kthread_should_stop())
> 455 return 0;
> 456
> 457 schedule();
> 458 continue;
> 459 }
> 
> If condition check is true, its task state is set to TASK_INTERRUPTIBLE
> and call schedule() to wait for others to wake up it.
> 
> There are 2 issues in current code,
> 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
>another process changes the condition and call wake_up_process(dc->
>writeback_thread), then at line 452 task state is set back to
>TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
>waken up.
> 2, At line 454 if kthread_should_stop() is true, writeback kernel thread
>will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
>call do_exit(). It is not good to enter do_exit() with task state
>TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
>warning message is reported by __might_sleep(): "WARNING: do not call
>blocking ops when !TASK_RUNNING; state=1 set at []".
> 
> For the first issue, task state should be set before condition checks.
> Ineed because dc->writeback_lock is required when modifying all the
> conditions, calling set_current_state() inside code block where dc->
> writeback_lock is hold is safe. But this is quite implicit, so I still move
> set_current_state() before all the condition checks.
> 
> For the second issue, frankley speaking it does not hurt when kernel thread
> exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
> makes them feel there might be something risky with bcache and hurt their
> data.  Setting task state to TASK_RUNNING before returning fixes this
> problem.
> 
> Changelog:
> v2: fix the race issue in v1 patch.
> v1: initial buggy fix.
> 
> Signed-off-by: Coly Li 
> Cc: Michael Lyle 
> Cc: Hannes Reinecke 
> Cc: Junhui Tang 
> ---
>  drivers/md/bcache/writeback.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout

2018-01-16 Thread Lars Ellenberg
On Mon, Jan 15, 2018 at 10:07:38AM -0500, Mike Snitzer wrote:
> > See also:
> > https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
> > https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html
> 
> Right, now that you mention it it is starting to ring a bell (especially
> after I read your 2nd dm-devel archive url above).

> > In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
> > so the result in this scenario is what we expect:
> > 
> >   _: unprovisioned, not allocated, returns zero on read anyways
> >   *: provisioned, some arbitrary data
> >   0: explicitly zeroed:
> > 
> >   |gran|ular|ity ||||
> >   |||||
> >  to|-be-|zero|ed
> >   |**00|||00**|
> > 
> > (leave unallocated blocks alone,
> >  de-allocate full blocks just like with discard,
> >  explicitly zero unaligned head and tail)
> 
> "de-allocate full blocks just like with discard" is an interesting take
> what it means for dm-thin to handle REQ_OP_WRITE_ZEROES "properly".
> 
> > Or DRBD will have to resurrect that reinvented zeroout again,
> > with exactly those semantics. I did reinvent it for a reason ;)
> 
> Yeah, I now recall dropping that line of development because it
> became "hard" (or at least harder than originally thought).
> 
> Don't people use REQ_OP_WRITE_ZEROES to initialize a portion of the
> disk?  E.g. zeroing superblocks, metadata areas, or whatever?
> 
> If we just discarded the logical extent and then a user did a partial
> write to the block, areas that a user might expect to be zeroed wouldn't
> be (at least in the case of dm-thinp if "skip_block_zeroing" is
> enabled).


Oh-kay.
So "immediately after" such an operation
("zero-out head and tail and de-alloc full blocks")
a read to that area would return all zeros, as expected.

But once you do a partial write of something to one of those
de-allocated blocks (and skip_block_zeroing is enabled,
which it likely is due to "performance"),
"magically" arbitrary old garbage data springs into existence
on the LBAs that just before read as zeros.

lvmthin lvm.conf
Would that not break a lot of other things
(any read-modify-write of "upper layers")?
Would that not even be a serious "information leak"
(old garbage of other completely unrelated LVs leaking into this one)?

But thank you for that, I start to see the problem ;-)

> No, dm-thinp doesn't have an easy way to mark an allocated block as
> containing zeroes (without actually zeroing).  I toyed with adding that
> but then realized that even if we had it it'd still require block
> zeroing be enabled.  But block zeroing is done at allocation time.  So
> we'd need to interpret the "this block is zeroes" flag to mean "on first
> write or read to this block it needs to first zero it".  Fugly to say
> the least...


Maybe have a "known zeroed block" pool, allocate only from there,
and "lazy zero" unallocated blocks, add to the known-zero pool?
Fallback to zero-on-alloc if that known-zero-pool is depleted.

Easier said than done, I know.

> But sadly, in general, this is a low priority for me, so you might do
> well to reintroduce your drbd workaround.. sorry about that :(

No problem.
I'll put that back in, and document that we strongly recommend to
NOT skip_block_zeroing in those setups.

Thanks,

Lars