[PATCH] blk-mq: Fix recently introduced races in the timeout handling code

2018-04-08 Thread Bart Van Assche
If a completion occurs after blk_mq_rq_timed_out() has reset
rq->aborted_gstate and the request is again in flight when the timeout
expires then a request will be completed twice: a first time by the
timeout handler and a second time when the regular completion occurs.

Additionally, the blk-mq timeout handling code ignores completions that
occur after blk_mq_check_expired() has been called and before
blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
timeout handler always returns BLK_EH_RESET_TIMER then the result will
be that the request never terminates.

Since the request state can be updated from two different contexts,
namely regular completion and request timeout, this race cannot be
fixed with RCU synchronization only. Fix this race as follows:
- Introduce a spinlock to protect the request state and deadline changes.
- Use the deadline instead of the request generation to detect whether
  or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
  of the lowest two bits of 'gstate'.
- Remove all request member variables that became superfluous due to
  this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
  patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

This patch fixes the following kernel crash:

BUG: unable to handle kernel NULL pointer dereference at   (null)
Oops:  [#1] PREEMPT SMP
CPU: 2 PID: 151 Comm: kworker/2:1H Tainted: GW4.15.0-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]
Call Trace:
blk_mq_terminate_expired+0x42/0x80
bt_iter+0x3d/0x50
blk_mq_queue_tag_busy_iter+0xe9/0x200
blk_mq_timeout_work+0x181/0x2e0
process_one_work+0x21c/0x6d0
worker_thread+0x35/0x380
kthread+0x117/0x130
ret_from_fork+0x24/0x30

Fixes: 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and 
generation based scheme")
Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Cc:  # v4.16
---
 block/blk-core.c   |   3 +-
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 178 +++--
 block/blk-mq.h |  25 ++-
 block/blk-timeout.c|   1 -
 block/blk.h|   4 +-
 include/linux/blkdev.h |  28 ++--
 7 files changed, 53 insertions(+), 187 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2623e609db4a..83c7a58e4fb3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -200,8 +200,7 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
+   spin_lock_init(>state_lock);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 6f72413b6cab..80c7c585769f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -345,7 +345,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7816d28b7219..1da16d5e5cf1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -305,7 +305,6 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->special = NULL;
/* tag was already set */
rq->extra_len = 0;
-   rq->__deadline = 0;
 
INIT_LIST_HEAD(>timeout_list);
rq->timeout = 0;
@@ -527,8 +526,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -577,34 +575,26 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old: Old request state.
+ * @new: New request state.
+ */

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Bart Van Assche
On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote:
> The following kernel oops is triggered by 'removing scsi device' during
> heavy IO.

Is the below patch sufficient to fix this?

Thanks,

Bart.


Subject: blk-mq: Avoid that submitting a bio concurrently with device removal 
triggers a crash

Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
it is no longer safe to access cgroup information during or after the
blk_cleanup_queue() call. Hence check earlier in generic_make_request()
whether the queue has been marked as "dying".
---
 block/blk-core.c | 72 +---
 1 file changed, 37 insertions(+), 35 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index aa8c99fae527..3ac9dd25e04e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2385,10 +2385,21 @@ blk_qc_t generic_make_request(struct bio *bio)
 * yet.
 */
struct bio_list bio_list_on_stack[2];
+   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
+   BLK_MQ_REQ_NOWAIT : 0;
+   struct request_queue *q = bio->bi_disk->queue;
blk_qc_t ret = BLK_QC_T_NONE;
 
if (!generic_make_request_checks(bio))
-   goto out;
+   return ret;
+
+   if (blk_queue_enter(q, flags) < 0) {
+   if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
+   return ret;
+   }
 
/*
 * We only want one ->make_request_fn to be active at a time, else
@@ -2423,46 +2434,37 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_init(_list_on_stack[0]);
current->bio_list = bio_list_on_stack;
do {
-   struct request_queue *q = bio->bi_disk->queue;
-   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
-   BLK_MQ_REQ_NOWAIT : 0;
-
-   if (likely(blk_queue_enter(q, flags) == 0)) {
-   struct bio_list lower, same;
-
-   /* Create a fresh bio_list for all subordinate requests 
*/
-   bio_list_on_stack[1] = bio_list_on_stack[0];
-   bio_list_init(_list_on_stack[0]);
-   ret = q->make_request_fn(q, bio);
-
-   blk_queue_exit(q);
-
-   /* sort new bios into those for a lower level
-* and those for the same level
-*/
-   bio_list_init();
-   bio_list_init();
-   while ((bio = bio_list_pop(_list_on_stack[0])) != 
NULL)
-   if (q == bio->bi_disk->queue)
-   bio_list_add(, bio);
-   else
-   bio_list_add(, bio);
-   /* now assemble so we handle the lowest level first */
-   bio_list_merge(_list_on_stack[0], );
-   bio_list_merge(_list_on_stack[0], );
-   bio_list_merge(_list_on_stack[0], 
_list_on_stack[1]);
-   } else {
-   if (unlikely(!blk_queue_dying(q) &&
-   (bio->bi_opf & REQ_NOWAIT)))
-   bio_wouldblock_error(bio);
+   struct bio_list lower, same;
+
+   WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT) &&
+(bio->bi_opf & REQ_NOWAIT));
+   WARN_ON_ONCE(q != bio->bi_disk->queue);
+   q = bio->bi_disk->queue;
+   /* Create a fresh bio_list for all subordinate requests */
+   bio_list_on_stack[1] = bio_list_on_stack[0];
+   bio_list_init(_list_on_stack[0]);
+   ret = q->make_request_fn(q, bio);
+
+   /* sort new bios into those for a lower level
+* and those for the same level
+*/
+   bio_list_init();
+   bio_list_init();
+   while ((bio = bio_list_pop(_list_on_stack[0])) != NULL)
+   if (q == bio->bi_disk->queue)
+   bio_list_add(, bio);
else
-   bio_io_error(bio);
-   }
+   bio_list_add(, bio);
+   /* now assemble so we handle the lowest level first */
+   bio_list_merge(_list_on_stack[0], );
+   bio_list_merge(_list_on_stack[0], );
+   bio_list_merge(_list_on_stack[0], _list_on_stack[1]);
bio = bio_list_pop(_list_on_stack[0]);
} while (bio);
current->bio_list = NULL; /* deactivate */
 
 out:
+   blk_queue_exit(q);
return ret;
 }
 EXPORT_SYMBOL(generic_make_request);
-- 
2.16.2


Re: Block layer use of __GFP flags

2018-04-08 Thread Bart Van Assche
On Sun, 2018-04-08 at 12:08 -0700, Matthew Wilcox wrote:
> On Sun, Apr 08, 2018 at 04:40:59PM +, Bart Van Assche wrote:
> > Do you perhaps want me to prepare a patch that makes blk_get_request() again
> > respect the full gfp mask passed as third argument to blk_get_request()?
> 
> I think that would be a good idea.  If it's onerous to have extra arguments,
> there are some bits in gfp_flags which could be used for your purposes.

That's indeed something we can consider.

It would be appreciated if you could have a look at the patch below.

Thanks,

Bart.


Subject: block: Make blk_get_request() again respect the entire gfp_t argument

Commit 6a15674d1e90 ("block: Introduce blk_get_request_flags()")
translates the third blk_get_request() arguments GFP_KERNEL, GFP_NOIO
and __GFP_RECLAIM into __GFP_DIRECT_RECLAIM. Make blk_get_request()
again pass the full gfp_t argument to blk_old_get_request() such that
kswapd is again woken up under low memory conditions if the caller
requested this.

Notes:
- This change only affects the behavior of the legacy block layer. See
  also the mempool_alloc() call in __get_request().
- The __GFP_RECLAIM flag in the blk_get_request_flags() calls in
  the IDE and SCSI drivers was removed by commit 039c635f4e66 ("ide,
  scsi: Tell the block layer at request allocation time about preempt
  requests").
---
 block/blk-core.c| 28 +++-
 drivers/ide/ide-pm.c|  2 +-
 drivers/scsi/scsi_lib.c |  3 ++-
 include/linux/blkdev.h  |  3 ++-
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3ac9dd25e04e..83c7a58e4fb3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1333,6 +1333,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * @op: operation and flags
  * @bio: bio to allocate request for (can be %NULL)
  * @flags: BLQ_MQ_REQ_* flags
+ * @gfp_mask: allocation mask
  *
  * Get a free request from @q.  This function may fail under memory
  * pressure or if @q is dead.
@@ -1342,7 +1343,8 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *__get_request(struct request_list *rl, unsigned int op,
-struct bio *bio, blk_mq_req_flags_t flags)
+struct bio *bio, blk_mq_req_flags_t flags,
+gfp_t gfp_mask)
 {
struct request_queue *q = rl->q;
struct request *rq;
@@ -1351,8 +1353,6 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
struct io_cq *icq = NULL;
const bool is_sync = op_is_sync(op);
int may_queue;
-   gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
-__GFP_DIRECT_RECLAIM;
req_flags_t rq_flags = RQF_ALLOCED;
 
lockdep_assert_held(q->queue_lock);
@@ -1516,6 +1516,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * @op: operation and flags
  * @bio: bio to allocate request for (can be %NULL)
  * @flags: BLK_MQ_REQ_* flags.
+ * @gfp_mask: allocation mask
  *
  * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
  * this function keeps retrying under memory pressure and fails iff @q is dead.
@@ -1525,7 +1526,8 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, unsigned int op,
-  struct bio *bio, blk_mq_req_flags_t flags)
+  struct bio *bio, blk_mq_req_flags_t flags,
+  gfp_t gfp_mask)
 {
const bool is_sync = op_is_sync(op);
DEFINE_WAIT(wait);
@@ -1537,7 +1539,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 
rl = blk_get_rl(q, bio);/* transferred to @rq on success */
 retry:
-   rq = __get_request(rl, op, bio, flags);
+   rq = __get_request(rl, op, bio, flags, gfp_mask);
if (!IS_ERR(rq))
return rq;
 
@@ -1575,11 +1577,10 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 
 /* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
 static struct request *blk_old_get_request(struct request_queue *q,
-   unsigned int op, blk_mq_req_flags_t flags)
+   unsigned int op, blk_mq_req_flags_t flags,
+   gfp_t gfp_mask)
 {
struct request *rq;
-   gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
-__GFP_DIRECT_RECLAIM;
int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
@@ -1591,7 +1592,7 @@ static struct request *blk_old_get_request(struct 
request_queue 

Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Ming Lei
On Mon, Apr 09, 2018 at 09:33:08AM +0800, Joseph Qi wrote:
> Hi Bart,
> 
> On 18/4/8 22:50, Bart Van Assche wrote:
> > On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote:
> >> The following kernel oops is triggered by 'removing scsi device' during
> >> heavy IO.
> > 
> > How did you trigger this oops?
> > 
> 
> I can reproduce this oops by the following steps:
> 1) start a fio job with buffered write;
> 2) remove the scsi device fio write to:
> echo "scsi remove-single-device ${dev}" > /proc/scsi/scsi

Yeah, it can be reproduced easily, and I usually remove scsi
device via 'echo 1 > /sys/block/sda/device/delete'

Thanks,
Ming


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 04:35:59PM +0300, Sagi Grimberg wrote:
> 
> 
> On 04/08/2018 03:57 PM, Ming Lei wrote:
> > On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote:
> > > 
> > > > > > > > > Hi Sagi
> > > > > > > > > 
> > > > > > > > > Still can reproduce this issue with the change:
> > > > > > > > 
> > > > > > > > Thanks for validating Yi,
> > > > > > > > 
> > > > > > > > Would it be possible to test the following:
> > > > > > > > --
> > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > > > index 75336848f7a7..81ced3096433 100644
> > > > > > > > --- a/block/blk-mq.c
> > > > > > > > +++ b/block/blk-mq.c
> > > > > > > > @@ -444,6 +444,10 @@ struct request 
> > > > > > > > *blk_mq_alloc_request_hctx(struct
> > > > > > > > request_queue *q,
> > > > > > > >return ERR_PTR(-EXDEV);
> > > > > > > >}
> > > > > > > >cpu = cpumask_first_and(alloc_data.hctx->cpumask, 
> > > > > > > > cpu_online_mask);
> > > > > > > > +   if (cpu >= nr_cpu_ids) {
> > > > > > > > +   pr_warn("no online cpu for hctx %d\n", 
> > > > > > > > hctx_idx);
> > > > > > > > +   cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > > > > > +   }
> > > > > > > >alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > > > > > 
> > > > > > > >rq = blk_mq_get_request(q, NULL, op, _data);
> > > > > > > > --
> > > > > > > > ...
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > [  153.384977] BUG: unable to handle kernel paging request at
> > > > > > > > > 3a9ed053bd48
> > > > > > > > > [  153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > > > > > 
> > > > > > > > Also would it be possible to provide gdb output of:
> > > > > > > > 
> > > > > > > > l *(blk_mq_get_request+0x23e)
> > > > > > > 
> > > > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to 
> > > > > > > allocate
> > > > > > > request from one specific hw queue, but there may not be all 
> > > > > > > online CPUs
> > > > > > > mapped to this hw queue.
> > > > > 
> > > > > Yes, this is what I suspect..
> > > > > 
> > > > > > And the following patchset may fail this kind of allocation and 
> > > > > > avoid
> > > > > > the kernel oops.
> > > > > > 
> > > > > > https://marc.info/?l=linux-block=152318091025252=2
> > > > > 
> > > > > Thanks Ming,
> > > > > 
> > > > > But I don't want to fail the allocation, nvmf_connect_io_queue simply
> > > > > needs a tag to issue the connect request, I much rather to take this
> > > > > tag from an online cpu than failing it... We use this because we 
> > > > > reserve
> > > > 
> > > > The failure is only triggered when there isn't any online CPU mapped to
> > > > this hctx, so do you want to wait for CPUs for this hctx becoming 
> > > > online?
> > > 
> > > I was thinking of allocating a tag from that hctx even if it had no
> > > online cpu, the execution is done on an online cpu (hence the call
> > > to blk_mq_alloc_request_hctx).
> > 
> > That can be done, but not following the current blk-mq's rule, because
> > blk-mq requires to dispatch the request on CPUs mapping to this hctx.
> > 
> > Could you explain a bit why you want to do in this way?
> 
> My device exposes nr_hw_queues which is not higher than num_online_cpus
> so I want to connect all hctxs with hope that they will be used.

The issue is that CPU online & offline can happen any time, and after
blk-mq removes CPU hotplug handler, there is no way to remap queue
when CPU topo is changed.

For example:

1) after nr_hw_queues is set as num_online_cpus() and hw queues
are initialized, then some of CPUs become offline, and the issue
reported by Zhang Yi is triggered, but in this case, we should fail
the allocation since 1:1 mapping doesn't need to use this inactive
hw queue.

2) when nr_hw_queues is set as num_online_cpus(), there may be
much less online CPUs, so the hw queue number can be initialized as
much smaller, then performance is degraded much even if some CPUs
become online later.

So the current policy is to map all possible CPUs for handing CPU
hotplug, and if you want to get 1:1 mapping between hw queue and
online CPU, the nr_hw_queues can be set as num_possible_cpus.

Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
num_possible_cpus() to pci_alloc_irq_vectors).

It will waste some memory resource just like percpu variable, but it
simplifies the queue mapping logic a lot, and can support both hard
and soft CPU online/offline without CPU hotplug handler, which may
cause very complicated queue dependency issue.

> 
> I agree we don't want to connect hctx which doesn't have an online
> cpu, that's redundant, but this is not the case here.

OK, I will explain below, and it can be fixed by the following patch too:

https://marc.info/?l=linux-block=152318093725257=2

> 
> > > > Or I may understand you wrong, :-)
> > > 
> > > In the report we connected 40 hctxs (which was exactly the number of
> > > online cpus), after Yi 

Re: Multi-Actuator SAS HDD First Look

2018-04-08 Thread Tim Walker
On Fri, Apr 6, 2018 at 11:09 AM, Douglas Gilbert  wrote:
>
> On 2018-04-06 02:42 AM, Christoph Hellwig wrote:
>>
>> On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote:
>>>
>>> Ah. Far better.
>>> What about delegating FORMAT UNIT to the control LUN, and not
>>> implementing it for the individual disk LUNs?
>>> That would make an even stronger case for having a control LUN;
>>> with that there wouldn't be any problem with having to synchronize
>>> across LUNs etc.
>>
>>
>> It sounds to me like NVMe might be a much better model for this drive
>> than SCSI, btw :)
>
>
> So you found a document that outlines NVMe's architecture! Could you
> share the url (no marketing BS, please)?
>
>
> And a serious question ... How would you map NVMe's (in Linux)
> subsystem number, controller device minor number, CNTLID field
> (Identify ctl response) and namespace id onto the SCSI subsystem's
> h:c:t:l ?
>
> Doug Gilbert
>

Hannes- yes, a drive system altering operation like FORMAT UNIT is
asking for a dedicated management port, as the NVMe folks apparently
felt. But what is the least painful endpoint type for LUN0?


-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770


Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Joseph Qi
Hi Bart,

On 18/4/8 22:50, Bart Van Assche wrote:
> On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote:
>> The following kernel oops is triggered by 'removing scsi device' during
>> heavy IO.
> 
> How did you trigger this oops?
> 

I can reproduce this oops by the following steps:
1) start a fio job with buffered write;
2) remove the scsi device fio write to:
echo "scsi remove-single-device ${dev}" > /proc/scsi/scsi


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-08 Thread Oleksandr Natalenko
Hi.

Cc'ing linux-block people (mainly, Christoph) too because of 17cb960f29c2. 
Also, duplicating the initial statement for them.

With v4.16 (and now with v4.16.1) it is possible to trigger usercopy whitelist 
warning and/or bug while doing smartctl on a SATA disk having blk-mq and BFQ 
enabled. The warning looks like this:

===
[  574.997022] Bad or missing usercopy whitelist? Kernel memory exposure 
attempt detected from SLUB object 'scsi_sense_cache' (offset 76, size 22)!
[  575.017332] WARNING: CPU: 0 PID: 32436 at mm/usercopy.c:81 usercopy_warn
+0x7d/0xa0
[  575.025262] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel 
kvm bochs_drm iTCO_wdt ttm irqbypass iTCO_vendor_support ppdev drm_kms_helper 
psmouse parport_pc i2c_i801 joydev pcspkr drm parport rtc_cmos mousedev 
input_leds led_class intel_agp evdev syscopyarea qemu_fw_cfg intel_gtt 
sysfillrect mac_hid lpc_ich sysimgblt agpgart fb_sys_fops ip_tables x_tables 
xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c 
crc32c_generic dm_crypt algif_skcipher af_alg hid_generic usbhid hid dm_mod 
raid10 md_mod sr_mod sd_mod cdrom uhci_hcd crct10dif_pclmul crc32_pclmul 
crc32c_intel ghash_clmulni_intel pcbc serio_raw xhci_pci ahci atkbd libps2 
ehci_pci xhci_hcd aesni_intel libahci aes_x86_64 ehci_hcd crypto_simd 
glue_helper cryptd libata usbcore usb_common i8042 serio virtio_scsi scsi_mod
[  575.068775]  virtio_blk virtio_net virtio_pci virtio_ring virtio
[  575.073935] CPU: 0 PID: 32436 Comm: smartctl Not tainted 4.16.0-pf2 #1
[  575.078078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
02/06/2015
[  575.082451] RIP: 0010:usercopy_warn+0x7d/0xa0
[  575.086223] RSP: 0018:9ca84aee7c40 EFLAGS: 00010286
[  575.097637] RAX:  RBX: 95199d68304c RCX: 
0001
[  575.101471] RDX: 0001 RSI: aeeb050a RDI: 

[  575.105939] RBP: 0016 R08:  R09: 
028b
[  575.110370] R10: aee854e9 R11: 0001 R12: 
0001
[  575.113269] R13: 95199d683062 R14: 95199d68304c R15: 
0016
[  575.116132] FS:  7f993d405040() GS:95199f60() knlGS:

[  575.119285] CS:  0010 DS:  ES:  CR0: 80050033
[  575.129619] CR2: 7ffe2390f0a8 CR3: 1d774004 CR4: 
00160ef0
[  575.133976] Call Trace:
[  575.136311]  __check_object_size+0x12f/0x1a0
[  575.139576]  sg_io+0x269/0x3f0
[  575.142000]  ? path_lookupat+0xaa/0x1f0
[  575.144521]  ? current_time+0x18/0x70
[  575.147006]  scsi_cmd_ioctl+0x257/0x410
[  575.149782]  ? xfs_bmapi_read+0x1c3/0x340 [xfs]
[  575.161441]  sd_ioctl+0xbf/0x1a0 [sd_mod]
[  575.165036]  blkdev_ioctl+0x8ca/0x990
[  575.168291]  ? read_null+0x10/0x10
[  575.171638]  block_ioctl+0x39/0x40
[  575.174998]  do_vfs_ioctl+0xa4/0x630
[  575.178261]  ? vfs_write+0x164/0x1a0
[  575.181410]  SyS_ioctl+0x74/0x80
[  575.190904]  do_syscall_64+0x74/0x190
[  575.195200]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  575.199267] RIP: 0033:0x7f993c984d87
[  575.201350] RSP: 002b:7ffe238aeed8 EFLAGS: 0246 ORIG_RAX: 
0010
[  575.204386] RAX: ffda RBX: 7ffe238af180 RCX: 
7f993c984d87
[  575.208349] RDX: 7ffe238aeef0 RSI: 2285 RDI: 
0003
[  575.211254] RBP: 7ffe238af1d0 R08: 0010 R09: 

[  575.220511] R10:  R11: 0246 R12: 
5637ec8e9ce0
[  575.225238] R13:  R14: 5637ec8e3550 R15: 
00da
[  575.230056] Code: 6c e4 ae 41 51 48 c7 c0 19 6e e5 ae 49 89 f1 48 0f 44 c2 
48 89 f9 4d 89 d8 4c 89 d2 48 c7 c7 70 6e e5 ae 48 89 c6 e8 c3 5c e5 ff <0f> 
0b 48 83 c4 18 c3 48 c7 c6 04 cb e4 ae 49 89 f1 49 89 f3 eb 
[  575.239027] ---[ end trace 6e3293933bdd4761 ]---
===

Usually, the warning is triggered first, and all the subsequent printouts are 
bugs because offset gets too big so that it doesn't fit into a real SLAB 
object size:

[ 1687.609889] usercopy: Kernel memory exposure attempt detected from SLUB 
object 'scsi_sense_cache' (offset 107, size 22)!
[ 1687.614197] [ cut here ]
[ 1687.615993] kernel BUG at mm/usercopy.c:100!

To give you an idea regarding variety of offsets, I've summarised the kernel 
log from my server:

$ sudo journalctl -kb | grep "Kernel memory exposure attempt detected" | 
grep -oE 'offset [0-9]+, size [0-9]+' | sort | uniq -c
   9 offset 107, size 22
   6 offset 108, size 22
   8 offset 109, size 22
   7 offset 110, size 22
   5 offset 111, size 22
   5 offset 112, size 22
   2 offset 113, size 22
   2 offset 114, size 22
   1 offset 115, size 22
   1 offset 116, size 22
   1 offset 119, size 22
   1 offset 85, size 22

So far, I wasn't able to trigger this with mq-deadline (or without blk-mq). 
Maybe, this has something to do with blk-mq+BFQ re-queuing, or it's just me 
not being 

Re: Block layer use of __GFP flags

2018-04-08 Thread Matthew Wilcox
On Sun, Apr 08, 2018 at 04:40:59PM +, Bart Van Assche wrote:
> __GFP_KSWAPD_RECLAIM wasn't stripped off on purpose for non-atomic
> allocations. That was an oversight. 

OK, good.

> Do you perhaps want me to prepare a patch that makes blk_get_request() again
> respect the full gfp mask passed as third argument to blk_get_request()?

I think that would be a good idea.  If it's onerous to have extra arguments,
there are some bits in gfp_flags which could be used for your purposes.


Re: KASAN: use-after-free Read in disk_unblock_events

2018-04-08 Thread Eric Biggers
On Mon, 30 Oct 2017 12:32:58 -0700, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on  
> 36ef71cae353f88fd6e095e2aaa3e5953af1685d
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> BUG: KASAN: use-after-free in disk_unblock_events+0x51/0x60  
> block/genhd.c:1632
> Read of size 8 at addr 8801d9049a80 by task blkid/14807
> 
> CPU: 1 PID: 14807 Comm: blkid Not tainted 4.14.0-rc5-next-20171018+ #36
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:16 [inline]
>   dump_stack+0x194/0x257 lib/dump_stack.c:52
>   print_address_description+0x73/0x250 mm/kasan/report.c:252
>   kasan_report_error mm/kasan/report.c:351 [inline]
>   kasan_report+0x25b/0x340 mm/kasan/report.c:409
>   __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>   disk_unblock_events+0x51/0x60 block/genhd.c:1632
>   __blkdev_get+0x78d/0xf90 fs/block_dev.c:1535
>   blkdev_get+0x3a1/0xad0 fs/block_dev.c:1591
>   blkdev_open+0x1c9/0x250 fs/block_dev.c:1749
>   do_dentry_open+0x664/0xd40 fs/open.c:752
>   vfs_open+0x107/0x220 fs/open.c:866
>   do_last fs/namei.c:3387 [inline]
>   path_openat+0x1151/0x3520 fs/namei.c:3527
>   do_filp_open+0x25b/0x3b0 fs/namei.c:3562
>   do_sys_open+0x502/0x6d0 fs/open.c:1059
>   SYSC_open fs/open.c:1077 [inline]
>   SyS_open+0x2d/0x40 fs/open.c:1072
>   entry_SYSCALL_64_fastpath+0x1f/0xbe
> RIP: 0033:0x7ff2d528f120
> RSP: 002b:7ffd153fa6d8 EFLAGS: 0246 ORIG_RAX: 0002
> RAX: ffda RBX:  RCX: 7ff2d528f120
> RDX: 7ffd153fbf35 RSI:  RDI: 7ffd153fbf35
> RBP: 0082 R08: 0078 R09: 
> R10:  R11: 0246 R12: 01fca030
> R13:  R14:  R15: 0005
> 
> Allocated by task 14811:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>   set_track mm/kasan/kasan.c:459 [inline]
>   kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>   kmem_cache_alloc_node_trace+0x150/0x750 mm/slab.c:3657
>   kmalloc_node include/linux/slab.h:540 [inline]
>   kzalloc_node include/linux/slab.h:702 [inline]
>   alloc_disk_node+0xb4/0x4e0 block/genhd.c:1375
>   alloc_disk+0x18/0x20 block/genhd.c:1359
>   loop_add+0x45c/0xa50 drivers/block/loop.c:1808
>   loop_probe+0x16d/0x1a0 drivers/block/loop.c:1915
>   kobj_lookup+0x2ac/0x410 drivers/base/map.c:124
>   get_gendisk+0x37/0x230 block/genhd.c:773
>   bd_start_claiming fs/block_dev.c:1097 [inline]
>   blkdev_get+0x12d/0xad0 fs/block_dev.c:1584
>   blkdev_open+0x1c9/0x250 fs/block_dev.c:1749
>   do_dentry_open+0x664/0xd40 fs/open.c:752
>   vfs_open+0x107/0x220 fs/open.c:866
>   do_last fs/namei.c:3387 [inline]
>   path_openat+0x1151/0x3520 fs/namei.c:3527
>   do_filp_open+0x25b/0x3b0 fs/namei.c:3562
>   do_sys_open+0x502/0x6d0 fs/open.c:1059
>   SYSC_open fs/open.c:1077 [inline]
>   SyS_open+0x2d/0x40 fs/open.c:1072
>   entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> Freed by task 14807:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>   set_track mm/kasan/kasan.c:459 [inline]
>   kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>   __cache_free mm/slab.c:3492 [inline]
>   kfree+0xca/0x250 mm/slab.c:3807
>   disk_release+0x327/0x410 block/genhd.c:1218
>   device_release+0x7c/0x200 drivers/base/core.c:814
>   kobject_cleanup lib/kobject.c:648 [inline]
>   kobject_release lib/kobject.c:677 [inline]
>   kref_put include/linux/kref.h:70 [inline]
>   kobject_put+0x14c/0x240 lib/kobject.c:694
>   put_disk+0x23/0x30 block/genhd.c:1440
>   __blkdev_get+0x6ed/0xf90 fs/block_dev.c:1528
>   blkdev_get+0x3a1/0xad0 fs/block_dev.c:1591
>   blkdev_open+0x1c9/0x250 fs/block_dev.c:1749
>   do_dentry_open+0x664/0xd40 fs/open.c:752
>   vfs_open+0x107/0x220 fs/open.c:866
>   do_last fs/namei.c:3387 [inline]
>   path_openat+0x1151/0x3520 fs/namei.c:3527
>   do_filp_open+0x25b/0x3b0 fs/namei.c:3562
>   do_sys_open+0x502/0x6d0 fs/open.c:1059
>   SYSC_open fs/open.c:1077 [inline]
>   SyS_open+0x2d/0x40 fs/open.c:1072
>   entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> The buggy address belongs to the object at 8801d9049500
>   which belongs to the cache kmalloc-2048 of size 2048
> The buggy address is located 1408 bytes inside of
>   2048-byte region [8801d9049500, 8801d9049d00)
> The buggy address belongs to the page:
> page:ea0007641200 count:1 mapcount:0 mapping:8801d9048400 index:0x0  
> compound_mapcount: 0
> flags: 0x2008100(slab|head)
> raw: 02008100 8801d9048400  00010003
> raw: ea000748e5a0 ea0007063920 8801dac00c40 
> page dumped because: kasan: bad access detected
> 
> Memory state around the 

Re: Block layer use of __GFP flags

2018-04-08 Thread Bart Van Assche
On Sat, 2018-04-07 at 23:54 -0700, Matthew Wilcox wrote:
> Please explain:
> 
> commit 6a15674d1e90917f1723a814e2e8c949000440f7
> Author: Bart Van Assche 
> Date:   Thu Nov 9 10:49:54 2017 -0800
> 
> block: Introduce blk_get_request_flags()
> 
> A side effect of this patch is that the GFP mask that is passed to
> several allocation functions in the legacy block layer is changed
> from GFP_KERNEL into __GFP_DIRECT_RECLAIM.
> 
> Why was this thought to be a good idea?  I think gfp.h is pretty clear:
> 
>  * Useful GFP flag combinations that are commonly used. It is recommended
>  * that subsystems start with one of these combinations and then set/clear
>  * __GFP_FOO flags as necessary.
> 
> Instead, the block layer now throws away all but one bit of the
> information being passed in by the callers, and all it tells the allocator
> is whether or not it can start doing direct reclaim. I can see that
> you may well be in a situation where you don't want to start more I/O,
> but your caller knows that!  Why make the allocator work harder than
> it has to?  In particular, why isn't the page allocator allowed to wake
> up kswapd to do reclaim in non-atomic context, but is when the caller
> is in atomic context?

__GFP_KSWAPD_RECLAIM wasn't stripped off on purpose for non-atomic
allocations. That was an oversight. 

Do you perhaps want me to prepare a patch that makes blk_get_request() again
respect the full gfp mask passed as third argument to blk_get_request()?

Bart.





Re: 4.15.14 crash with iscsi target and dvd

2018-04-08 Thread Wakko Warner
Wakko Warner wrote:
> Bart Van Assche wrote:
> > Have you tried to modify the kernel Makefile as indicated in the following
> > e-mail? This should make the kernel build:
> > 
> > https://lists.ubuntu.com/archives/kernel-team/2016-May/077178.html
> 
> Thanks.  That helped.
> 
> I finished with git bisect.  Here's the output:
> 84c8590646d5b35804bac60eb58b145839b5893e is the first bad commit
> commit 84c8590646d5b35804bac60eb58b145839b5893e
> Author: Ming Lei 
> Date:   Fri Nov 11 20:05:32 2016 +0800
> 
> target: avoid accessing .bi_vcnt directly
> 
> When the bio is full, bio_add_pc_page() will return zero,
> so use this information tell when the bio is full.
> 
> Also replace access to .bi_vcnt for pr_debug() with bio_segments().
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Ming Lei 
> Reviewed-by: Sagi Grimberg 
> Signed-off-by: Jens Axboe 
> 
> :04 04 a3ebbb71c52ee4eb8c3be4d033b81179211bf704 
> de39a328dbd1b18519946b3ad46d9302886e0dd0 M  drivers
> 
> I did a diff between HEAD^ and HEAD and manually patched the file from
> 4.15.14.  It's not an exact revert.  I'm running it now and it's working.
> I'll do a better test later on.  Here's the patch:
> 
> --- a/drivers/target/target_core_pscsi.c  2018-02-04 14:31:31.077316617 
> -0500
> +++ b/drivers/target/target_core_pscsi.c  2018-04-08 11:43:49.588641374 
> -0400
> @@ -915,7 +915,9 @@
>   bio, page, bytes, off);
>   pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n",
>   bio_segments(bio), nr_vecs);
> - if (rc != bytes) {
> + if (rc != bytes)
> + goto fail;
> + if (bio->bi_vcnt > nr_vecs) {
>   pr_debug("PSCSI: Reached bio->bi_vcnt max:"
>   " %d i: %d bio: %p, allocating another"
>   " bio\n", bio->bi_vcnt, i, bio);
> 
> I really appreciate your time and assistance with this.

One thing I noticed after doing this is errors in the kernel log on the
initiator:
[9072625.181744] sr 26:0:0:0: [sr1] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 
driverbyte=0x08
[9072625.181802] sr 26:0:0:0: [sr1] tag#0 Sense Key : 0x2 [current] 
[9072625.181835] sr 26:0:0:0: [sr1] tag#0 ASC=0x8 ASCQ=0x0 
[9072625.181866] sr 26:0:0:0: [sr1] tag#0 CDB: opcode=0x28 28 00 00 0a 81 22 00 
00 80 00
[9072625.181919] blk_update_request: I/O error, dev sr1, sector 2753672

When doing the exact same thing on the target, no mention.  My patch may not
be right, but it doesn't cause an oops.

I'm going to try 4.16.1 and see what happens.

-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.


Re: 4.15.14 crash with iscsi target and dvd

2018-04-08 Thread Wakko Warner
Bart Van Assche wrote:
> On Sat, 2018-04-07 at 12:53 -0400, Wakko Warner wrote:
> > Bart Van Assche wrote:
> > > On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote:
> > > > I know now why scsi_print_command isn't doing anything.  cmd->cmnd is 
> > > > null.
> > > > I added a dev_printk in scsi_print_command where the 2 if statements 
> > > > return.
> > > > Logs:
> > > > [  29.866415] sr 3:0:0:0: cmd->cmnd is NULL
> > > 
> > > That's something that should never happen. As one can see in
> > > scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize
> > > that pointer. Since I have not yet been able to reproduce myself what you
> > > reported, would it be possible for you to bisect this issue? You will need
> > > to follow something like the following procedure (see also
> > > https://git-scm.com/docs/git-bisect):
> > 
> > After doing 3 successful compiles with good/bad, I got this error and was
> > not able to compile any more kernels:
> >   CC  scripts/mod/devicetable-offsets.s
> > scripts/mod/empty.c:1:0: error: code model kernel does not support PIC mode
> >  /* empty file to figure out endianness / word size */
> >  
> > scripts/mod/devicetable-offsets.c:1:0: error: code model kernel does not 
> > support PIC mode
> >  #include 
> >  
> > scripts/Makefile.build:153: recipe for target 
> > 'scripts/mod/devicetable-offsets.s' failed
> > 
> > I don't think it found the bad commit.
> 
> Have you tried to modify the kernel Makefile as indicated in the following
> e-mail? This should make the kernel build:
> 
> https://lists.ubuntu.com/archives/kernel-team/2016-May/077178.html

Thanks.  That helped.

I finished with git bisect.  Here's the output:
84c8590646d5b35804bac60eb58b145839b5893e is the first bad commit
commit 84c8590646d5b35804bac60eb58b145839b5893e
Author: Ming Lei 
Date:   Fri Nov 11 20:05:32 2016 +0800

target: avoid accessing .bi_vcnt directly

When the bio is full, bio_add_pc_page() will return zero,
so use this information tell when the bio is full.

Also replace access to .bi_vcnt for pr_debug() with bio_segments().

Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
Reviewed-by: Sagi Grimberg 
Signed-off-by: Jens Axboe 

:04 04 a3ebbb71c52ee4eb8c3be4d033b81179211bf704 
de39a328dbd1b18519946b3ad46d9302886e0dd0 M  drivers

I did a diff between HEAD^ and HEAD and manually patched the file from
4.15.14.  It's not an exact revert.  I'm running it now and it's working.
I'll do a better test later on.  Here's the patch:

--- a/drivers/target/target_core_pscsi.c2018-02-04 14:31:31.077316617 
-0500
+++ b/drivers/target/target_core_pscsi.c2018-04-08 11:43:49.588641374 
-0400
@@ -915,7 +915,9 @@
bio, page, bytes, off);
pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n",
bio_segments(bio), nr_vecs);
-   if (rc != bytes) {
+   if (rc != bytes)
+   goto fail;
+   if (bio->bi_vcnt > nr_vecs) {
pr_debug("PSCSI: Reached bio->bi_vcnt max:"
" %d i: %d bio: %p, allocating another"
" bio\n", bio->bi_vcnt, i, bio);

I really appreciate your time and assistance with this.

-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.


Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Bart Van Assche
On Sun, 2018-04-08 at 16:11 +0800, Joseph Qi wrote:
> This is because scsi_remove_device() will call blk_cleanup_queue(), and
> then all blkgs have been destroyed and root_blkg is NULL.
> Thus tg is NULL and trigger NULL pointer dereference when get td from
> tg (tg->td).
> It seems that we cannot simply move blkcg_exit_queue() up to
> blk_cleanup_queue().

Had you considered to add a blk_queue_enter() / blk_queue_exit() pair in
generic_make_request()? blk_queue_enter() namely checks the DYING flag.

Thanks,

Bart.




Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Bart Van Assche
On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote:
> The following kernel oops is triggered by 'removing scsi device' during
> heavy IO.

How did you trigger this oops?

Bart.






Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Sagi Grimberg



On 04/08/2018 03:57 PM, Ming Lei wrote:

On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote:



Hi Sagi

Still can reproduce this issue with the change:


Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
request_queue *q,
   return ERR_PTR(-EXDEV);
   }
   cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
   alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

   rq = blk_mq_get_request(q, NULL, op, _data);
--
...



[  153.384977] BUG: unable to handle kernel paging request at
3a9ed053bd48
[  153.393197] IP: blk_mq_get_request+0x23e/0x390


Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)


nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.


Yes, this is what I suspect..


And the following patchset may fail this kind of allocation and avoid
the kernel oops.

https://marc.info/?l=linux-block=152318091025252=2


Thanks Ming,

But I don't want to fail the allocation, nvmf_connect_io_queue simply
needs a tag to issue the connect request, I much rather to take this
tag from an online cpu than failing it... We use this because we reserve


The failure is only triggered when there isn't any online CPU mapped to
this hctx, so do you want to wait for CPUs for this hctx becoming online?


I was thinking of allocating a tag from that hctx even if it had no
online cpu, the execution is done on an online cpu (hence the call
to blk_mq_alloc_request_hctx).


That can be done, but not following the current blk-mq's rule, because
blk-mq requires to dispatch the request on CPUs mapping to this hctx.

Could you explain a bit why you want to do in this way?


My device exposes nr_hw_queues which is not higher than num_online_cpus
so I want to connect all hctxs with hope that they will be used.

I agree we don't want to connect hctx which doesn't have an online
cpu, that's redundant, but this is not the case here.


Or I may understand you wrong, :-)


In the report we connected 40 hctxs (which was exactly the number of
online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
I'm not sure why some hctxs are left without any online cpus.


That is possible after the following two commits:

4b855ad37194 ("blk-mq: Create hctx for each present CPU)
20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)

And this can be triggered even without putting down any CPUs.

The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
remap queue any more when CPU topo is changed, so the static & fixed mapping
has to be setup from the beginning.

Then if there are less enough online CPUs compared with number of hw queues,
some of hctxes can be mapped with all offline CPUs. For example, if one device
has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
2 hw queues are assigned to online CPUs, and the other two are all with offline
CPUs.


That is fine, but the problem that I gave in the example below which has 
nr_hw_queues == num_online_cpus but because of the mapping, we still

have unmapped hctxs.


Lets say I have 4-cpu system and my device always allocates
num_online_cpus() hctxs.

at first I get:
cpu0 -> hctx0
cpu1 -> hctx1
cpu2 -> hctx2
cpu3 -> hctx3

When cpu1 goes offline I think the new mapping will be:
cpu0 -> hctx0
cpu1 -> hctx0 (from cpu_to_queue_index) // offline
cpu2 -> hctx2
cpu3 -> hctx0 (from cpu_to_queue_index)

This means that now hctx1 is unmapped. I guess we can fix nvmf code
to not connect it. But we end up with less queues than cpus without
any good reason.

I would have optimally want a different mapping that will use all
the queues:
cpu0 -> hctx0
cpu2 -> hctx1
cpu3 -> hctx2
* cpu1 -> hctx1 (doesn't matter, offline)

Something looks broken...


No, it isn't broken.


maybe broken is the wrong phrase, but its suboptimal...


Storage is client/server model, the hw queue should be only active if
there is request coming from client(CPU),


Correct.


and the hw queue becomes inactive if no online CPU is mapped to it.


But when we reset the controller, we call blk_mq_update_nr_hw_queues()
with the current number of nr_hw_queues which never exceeds
num_online_cpus. This in turn, remaps the mq_map which results
in unmapped queues because of the mapping function, not because we
have more hctx than online cpus...

An easy fix, is to allocate num_present_cpus queues, and only connect
the oneline ones, but as you said, we have 

Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote:
> 
> > > > > > > Hi Sagi
> > > > > > > 
> > > > > > > Still can reproduce this issue with the change:
> > > > > > 
> > > > > > Thanks for validating Yi,
> > > > > > 
> > > > > > Would it be possible to test the following:
> > > > > > --
> > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > index 75336848f7a7..81ced3096433 100644
> > > > > > --- a/block/blk-mq.c
> > > > > > +++ b/block/blk-mq.c
> > > > > > @@ -444,6 +444,10 @@ struct request 
> > > > > > *blk_mq_alloc_request_hctx(struct
> > > > > > request_queue *q,
> > > > > >   return ERR_PTR(-EXDEV);
> > > > > >   }
> > > > > >   cpu = cpumask_first_and(alloc_data.hctx->cpumask, 
> > > > > > cpu_online_mask);
> > > > > > +   if (cpu >= nr_cpu_ids) {
> > > > > > +   pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > > > > > +   cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > > > +   }
> > > > > >   alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > > > 
> > > > > >   rq = blk_mq_get_request(q, NULL, op, _data);
> > > > > > --
> > > > > > ...
> > > > > > 
> > > > > > 
> > > > > > > [  153.384977] BUG: unable to handle kernel paging request at
> > > > > > > 3a9ed053bd48
> > > > > > > [  153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > > > 
> > > > > > Also would it be possible to provide gdb output of:
> > > > > > 
> > > > > > l *(blk_mq_get_request+0x23e)
> > > > > 
> > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to 
> > > > > allocate
> > > > > request from one specific hw queue, but there may not be all online 
> > > > > CPUs
> > > > > mapped to this hw queue.
> > > 
> > > Yes, this is what I suspect..
> > > 
> > > > And the following patchset may fail this kind of allocation and avoid
> > > > the kernel oops.
> > > > 
> > > > https://marc.info/?l=linux-block=152318091025252=2
> > > 
> > > Thanks Ming,
> > > 
> > > But I don't want to fail the allocation, nvmf_connect_io_queue simply
> > > needs a tag to issue the connect request, I much rather to take this
> > > tag from an online cpu than failing it... We use this because we reserve
> > 
> > The failure is only triggered when there isn't any online CPU mapped to
> > this hctx, so do you want to wait for CPUs for this hctx becoming online?
> 
> I was thinking of allocating a tag from that hctx even if it had no
> online cpu, the execution is done on an online cpu (hence the call
> to blk_mq_alloc_request_hctx).

That can be done, but not following the current blk-mq's rule, because
blk-mq requires to dispatch the request on CPUs mapping to this hctx.

Could you explain a bit why you want to do in this way?

> 
> > Or I may understand you wrong, :-)
> 
> In the report we connected 40 hctxs (which was exactly the number of
> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
> I'm not sure why some hctxs are left without any online cpus.

That is possible after the following two commits:

4b855ad37194 ("blk-mq: Create hctx for each present CPU)
20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)

And this can be triggered even without putting down any CPUs.

The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
remap queue any more when CPU topo is changed, so the static & fixed mapping
has to be setup from the beginning.

Then if there are less enough online CPUs compared with number of hw queues,
some of hctxes can be mapped with all offline CPUs. For example, if one device
has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
2 hw queues are assigned to online CPUs, and the other two are all with offline
CPUs.

> 
> This seems to be related to the queue mapping.

Yes.

> 
> Lets say I have 4-cpu system and my device always allocates
> num_online_cpus() hctxs.
> 
> at first I get:
> cpu0 -> hctx0
> cpu1 -> hctx1
> cpu2 -> hctx2
> cpu3 -> hctx3
> 
> When cpu1 goes offline I think the new mapping will be:
> cpu0 -> hctx0
> cpu1 -> hctx0 (from cpu_to_queue_index) // offline
> cpu2 -> hctx2
> cpu3 -> hctx0 (from cpu_to_queue_index)
> 
> This means that now hctx1 is unmapped. I guess we can fix nvmf code
> to not connect it. But we end up with less queues than cpus without
> any good reason.
> 
> I would have optimally want a different mapping that will use all
> the queues:
> cpu0 -> hctx0
> cpu2 -> hctx1
> cpu3 -> hctx2
> * cpu1 -> hctx1 (doesn't matter, offline)
> 
> Something looks broken...

No, it isn't broken.

Storage is client/server model, the hw queue should be only active if
there is request coming from client(CPU), and the hw queue becomes
inactive if no online CPU is mapped to it.

That is why the normal rule is that request allocation needs CPU context
info, then the hctx is obtained via the queue mapping.

Thanks
Ming


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Sagi Grimberg



Hi Sagi

Still can reproduce this issue with the change:


Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
request_queue *q,
  return ERR_PTR(-EXDEV);
  }
  cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
  alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

  rq = blk_mq_get_request(q, NULL, op, _data);
--
...



[  153.384977] BUG: unable to handle kernel paging request at
3a9ed053bd48
[  153.393197] IP: blk_mq_get_request+0x23e/0x390


Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)


nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.


Yes, this is what I suspect..


And the following patchset may fail this kind of allocation and avoid
the kernel oops.

https://marc.info/?l=linux-block=152318091025252=2


Thanks Ming,

But I don't want to fail the allocation, nvmf_connect_io_queue simply
needs a tag to issue the connect request, I much rather to take this
tag from an online cpu than failing it... We use this because we reserve


The failure is only triggered when there isn't any online CPU mapped to
this hctx, so do you want to wait for CPUs for this hctx becoming online?


I was thinking of allocating a tag from that hctx even if it had no
online cpu, the execution is done on an online cpu (hence the call
to blk_mq_alloc_request_hctx).


Or I may understand you wrong, :-)


In the report we connected 40 hctxs (which was exactly the number of
online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
I'm not sure why some hctxs are left without any online cpus.

This seems to be related to the queue mapping.

Lets say I have 4-cpu system and my device always allocates
num_online_cpus() hctxs.

at first I get:
cpu0 -> hctx0
cpu1 -> hctx1
cpu2 -> hctx2
cpu3 -> hctx3

When cpu1 goes offline I think the new mapping will be:
cpu0 -> hctx0
cpu1 -> hctx0 (from cpu_to_queue_index) // offline
cpu2 -> hctx2
cpu3 -> hctx0 (from cpu_to_queue_index)

This means that now hctx1 is unmapped. I guess we can fix nvmf code
to not connect it. But we end up with less queues than cpus without
any good reason.

I would have optimally want a different mapping that will use all
the queues:
cpu0 -> hctx0
cpu2 -> hctx1
cpu3 -> hctx2
* cpu1 -> hctx1 (doesn't matter, offline)

Something looks broken...


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 01:58:49PM +0300, Sagi Grimberg wrote:
> 
> > > > > Hi Sagi
> > > > > 
> > > > > Still can reproduce this issue with the change:
> > > > 
> > > > Thanks for validating Yi,
> > > > 
> > > > Would it be possible to test the following:
> > > > --
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 75336848f7a7..81ced3096433 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > > > request_queue *q,
> > > >  return ERR_PTR(-EXDEV);
> > > >  }
> > > >  cpu = cpumask_first_and(alloc_data.hctx->cpumask, 
> > > > cpu_online_mask);
> > > > +   if (cpu >= nr_cpu_ids) {
> > > > +   pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > > > +   cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > +   }
> > > >  alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > 
> > > >  rq = blk_mq_get_request(q, NULL, op, _data);
> > > > --
> > > > ...
> > > > 
> > > > 
> > > > > [  153.384977] BUG: unable to handle kernel paging request at
> > > > > 3a9ed053bd48
> > > > > [  153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > 
> > > > Also would it be possible to provide gdb output of:
> > > > 
> > > > l *(blk_mq_get_request+0x23e)
> > > 
> > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> > > request from one specific hw queue, but there may not be all online CPUs
> > > mapped to this hw queue.
> 
> Yes, this is what I suspect..
> 
> > And the following patchset may fail this kind of allocation and avoid
> > the kernel oops.
> > 
> > https://marc.info/?l=linux-block=152318091025252=2
> 
> Thanks Ming,
> 
> But I don't want to fail the allocation, nvmf_connect_io_queue simply
> needs a tag to issue the connect request, I much rather to take this
> tag from an online cpu than failing it... We use this because we reserve

The failure is only triggered when there isn't any online CPU mapped to
this hctx, so do you want to wait for CPUs for this hctx becoming online?

Or I may understand you wrong, :-)

> a tag per-queue for this, but in this case, I'd rather block until the
> inflight tag complete than failing the connect.

No, there can't be any inflight request for this hctx.


Thanks,
Ming


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Sagi Grimberg



Hi Sagi

Still can reproduce this issue with the change:


Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
request_queue *q,
 return ERR_PTR(-EXDEV);
 }
 cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
 alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

 rq = blk_mq_get_request(q, NULL, op, _data);
--
...



[  153.384977] BUG: unable to handle kernel paging request at
3a9ed053bd48
[  153.393197] IP: blk_mq_get_request+0x23e/0x390


Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)


nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.


Yes, this is what I suspect..


And the following patchset may fail this kind of allocation and avoid
the kernel oops.

https://marc.info/?l=linux-block=152318091025252=2


Thanks Ming,

But I don't want to fail the allocation, nvmf_connect_io_queue simply
needs a tag to issue the connect request, I much rather to take this
tag from an online cpu than failing it... We use this because we reserve
a tag per-queue for this, but in this case, I'd rather block until the
inflight tag complete than failing the connect.


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 06:44:33PM +0800, Ming Lei wrote:
> On Sun, Apr 08, 2018 at 01:36:27PM +0300, Sagi Grimberg wrote:
> > 
> > > Hi Sagi
> > > 
> > > Still can reproduce this issue with the change:
> > 
> > Thanks for validating Yi,
> > 
> > Would it be possible to test the following:
> > --
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 75336848f7a7..81ced3096433 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > request_queue *q,
> > return ERR_PTR(-EXDEV);
> > }
> > cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > +   if (cpu >= nr_cpu_ids) {
> > +   pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > +   cpu = cpumask_first(alloc_data.hctx->cpumask);
> > +   }
> > alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > 
> > rq = blk_mq_get_request(q, NULL, op, _data);
> > --
> > ...
> > 
> > 
> > > [  153.384977] BUG: unable to handle kernel paging request at
> > > 3a9ed053bd48
> > > [  153.393197] IP: blk_mq_get_request+0x23e/0x390
> > 
> > Also would it be possible to provide gdb output of:
> > 
> > l *(blk_mq_get_request+0x23e)
> 
> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> request from one specific hw queue, but there may not be all online CPUs
> mapped to this hw queue.

And the following patchset may fail this kind of allocation and avoid
the kernel oops.

https://marc.info/?l=linux-block=152318091025252=2

Thanks,
Ming


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 01:36:27PM +0300, Sagi Grimberg wrote:
> 
> > Hi Sagi
> > 
> > Still can reproduce this issue with the change:
> 
> Thanks for validating Yi,
> 
> Would it be possible to test the following:
> --
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 75336848f7a7..81ced3096433 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> request_queue *q,
> return ERR_PTR(-EXDEV);
> }
> cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> +   if (cpu >= nr_cpu_ids) {
> +   pr_warn("no online cpu for hctx %d\n", hctx_idx);
> +   cpu = cpumask_first(alloc_data.hctx->cpumask);
> +   }
> alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> 
> rq = blk_mq_get_request(q, NULL, op, _data);
> --
> ...
> 
> 
> > [  153.384977] BUG: unable to handle kernel paging request at
> > 3a9ed053bd48
> > [  153.393197] IP: blk_mq_get_request+0x23e/0x390
> 
> Also would it be possible to provide gdb output of:
> 
> l *(blk_mq_get_request+0x23e)

nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.

Thanks,
Ming


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Sagi Grimberg



Hi Sagi

Still can reproduce this issue with the change:


Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,

return ERR_PTR(-EXDEV);
}
cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

rq = blk_mq_get_request(q, NULL, op, _data);
--
...


[  153.384977] BUG: unable to handle kernel paging request at 
3a9ed053bd48

[  153.393197] IP: blk_mq_get_request+0x23e/0x390


Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)

Thanks,


Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 05:25:42PM +0800, Ming Lei wrote:
> On Sun, Apr 08, 2018 at 04:11:51PM +0800, Joseph Qi wrote:
> > This is because scsi_remove_device() will call blk_cleanup_queue(), and
> > then all blkgs have been destroyed and root_blkg is NULL.
> > Thus tg is NULL and trigger NULL pointer dereference when get td from
> > tg (tg->td).
> > It seems that we cannot simply move blkcg_exit_queue() up to
> > blk_cleanup_queue().
> 
> Maybe one per-queue blkcg should be introduced, which seems reasonable
> too.

Sorry, I mean one per-queue blkcg lock.

-- 
Ming


[PATCH 8/8] blk-mq: remove code for dealing with remapping queue

2018-04-08 Thread Ming Lei
Firstly, from commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue any more after CPU topo is changed.

Secondly, set->nr_hw_queues can't be bigger than nr_cpu_ids, and now we map
all possible CPUs to hw queues, so at least one CPU is mapped to each hctx.

So queue mapping has became static and fixed just like percpu variable, and
we don't need to handle queue remapping any more.

Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: Stefan Haberland 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 34 +++---
 1 file changed, 3 insertions(+), 31 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3b4ce83a0ba2..c3964a79638e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2330,7 +2330,7 @@ static void blk_mq_free_map_and_requests(struct 
blk_mq_tag_set *set,
 
 static void blk_mq_map_swqueue(struct request_queue *q)
 {
-   unsigned int i, hctx_idx;
+   unsigned int i;
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;
@@ -2347,23 +2347,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
/*
 * Map software to hardware queues.
-*
-* If the cpu isn't present, the cpu is mapped to first hctx.
 */
for_each_possible_cpu(i) {
-   hctx_idx = q->mq_map[i];
-   /* unmapped hw queue can be remapped after CPU topo changed */
-   if (!set->tags[hctx_idx] &&
-   !__blk_mq_alloc_rq_map(set, hctx_idx)) {
-   /*
-* If tags initialization fail for some hctx,
-* that hctx won't be brought online.  In this
-* case, remap the current ctx to hctx[0] which
-* is guaranteed to always have tags allocated
-*/
-   q->mq_map[i] = 0;
-   }
-
ctx = per_cpu_ptr(q->queue_ctx, i);
hctx = blk_mq_map_queue(q, i);
 
@@ -2375,21 +2360,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
mutex_unlock(>sysfs_lock);
 
queue_for_each_hw_ctx(q, hctx, i) {
-   /*
-* If no software queues are mapped to this hardware queue,
-* disable it and free the request entries.
-*/
-   if (!hctx->nr_ctx) {
-   /* Never unmap queue 0.  We need it as a
-* fallback in case of a new remap fails
-* allocation
-*/
-   if (i && set->tags[i])
-   blk_mq_free_map_and_requests(set, i);
-
-   hctx->tags = NULL;
-   continue;
-   }
+   /* every hctx should get mapped by at least one CPU */
+   WARN_ON(!hctx->nr_ctx);
 
hctx->tags = set->tags[i];
WARN_ON(!hctx->tags);
-- 
2.9.5



[PATCH 6/8] blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()

2018-04-08 Thread Ming Lei
There are several reasons for removing the check:

1) blk_mq_hw_queue_mapped() returns true always now since each hctx
may be mapped by one CPU at least

2) when there isn't any online CPU mapped to this hctx, there won't
be any IO queued to this CPU, blk_mq_run_hw_queue() only runs queue
if there is IO queued to this hctx

3) If __blk_mq_delay_run_hw_queue() is called by blk_mq_delay_run_hw_queue(),
which is run from blk_mq_dispatch_rq_list() or scsi_mq_get_budget(), and
the hctx to be handled has to be mapped.

Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: Stefan Haberland 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8743e9105612..3b4ce83a0ba2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1394,9 +1394,6 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
 static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
unsigned long msecs)
 {
-   if (WARN_ON_ONCE(!blk_mq_hw_queue_mapped(hctx)))
-   return;
-
if (unlikely(blk_mq_hctx_stopped(hctx)))
return;
 
-- 
2.9.5



[PATCH 7/8] blk-mq: reimplement blk_mq_hw_queue_mapped

2018-04-08 Thread Ming Lei
Now the actual meaning of queue mapped is that if there is any online
CPU mapped to this hctx, so implement blk_mq_hw_queue_mapped() in this
way.

Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: Stefan Haberland 
Signed-off-by: Ming Lei 
---
 block/blk-mq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..502af371b83b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -181,7 +181,7 @@ static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx 
*hctx)
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 {
-   return hctx->nr_ctx && hctx->tags;
+   return cpumask_first_and(hctx->cpumask, cpu_online_mask) < nr_cpu_ids;
 }
 
 void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
-- 
2.9.5



[PATCH 4/8] blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu

2018-04-08 Thread Ming Lei
This patch introduces helper of blk_mq_hw_queue_first_cpu() for
figuring out the hctx's first cpu, and code duplication can be
avoided.

Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: Stefan Haberland 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a16efa6f2e7f..e3d02af79010 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1336,6 +1336,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
hctx_unlock(hctx, srcu_idx);
 }
 
+static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
+{
+   int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
+
+   if (cpu >= nr_cpu_ids)
+   cpu = cpumask_first(hctx->cpumask);
+   return cpu;
+}
+
 /*
  * It'd be great if the workqueue API had a way to pass
  * in a mask and had some smarts for more clever placement.
@@ -1355,14 +1364,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
-   next_cpu = cpumask_first_and(hctx->cpumask, 
cpu_online_mask);
-
-   /*
-* No online CPU is found, so have to make sure hctx->next_cpu
-* is set correctly for not breaking workqueue.
-*/
-   if (next_cpu >= nr_cpu_ids)
-   next_cpu = cpumask_first(hctx->cpumask);
+   next_cpu = blk_mq_first_mapped_cpu(hctx);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
 
@@ -2431,10 +2433,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
/*
 * Initialize batch roundrobin counts
 */
-   hctx->next_cpu = cpumask_first_and(hctx->cpumask,
-   cpu_online_mask);
-   if (hctx->next_cpu >= nr_cpu_ids)
-   hctx->next_cpu = cpumask_first(hctx->cpumask);
+   hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
 }
-- 
2.9.5



[PATCH 3/8] blk-mq: avoid to write intermediate result to hctx->next_cpu

2018-04-08 Thread Ming Lei
This patch figures out the final selected CPU, then writes
it to hctx->next_cpu once, then we can avoid to intermediate
next cpu observed from other dispatch paths.

Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: Stefan Haberland 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9b220dc415ac..a16efa6f2e7f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1345,26 +1345,24 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
bool tried = false;
+   int next_cpu = hctx->next_cpu;
 
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
 
if (--hctx->next_cpu_batch <= 0) {
-   int next_cpu;
 select_cpu:
-   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
+   next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
-   next_cpu = 
cpumask_first_and(hctx->cpumask,cpu_online_mask);
+   next_cpu = cpumask_first_and(hctx->cpumask, 
cpu_online_mask);
 
/*
 * No online CPU is found, so have to make sure hctx->next_cpu
 * is set correctly for not breaking workqueue.
 */
if (next_cpu >= nr_cpu_ids)
-   hctx->next_cpu = cpumask_first(hctx->cpumask);
-   else
-   hctx->next_cpu = next_cpu;
+   next_cpu = cpumask_first(hctx->cpumask);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
 
@@ -1372,7 +1370,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
 * Do unbound schedule if we can't find a online CPU for this hctx,
 * and it should only happen in the path of handling CPU DEAD.
 */
-   if (!cpu_online(hctx->next_cpu)) {
+   if (!cpu_online(next_cpu)) {
if (!tried) {
tried = true;
goto select_cpu;
@@ -1382,10 +1380,13 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx 
*hctx)
 * Make sure to re-select CPU next time once after CPUs
 * in hctx->cpumask become online again.
 */
+   hctx->next_cpu = next_cpu;
hctx->next_cpu_batch = 1;
return WORK_CPU_UNBOUND;
}
-   return hctx->next_cpu;
+
+   hctx->next_cpu = next_cpu;
+   return next_cpu;
 }
 
 static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
-- 
2.9.5



[PATCH 2/8] blk-mq: don't keep offline CPUs mapped to hctx 0

2018-04-08 Thread Ming Lei
>From commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue after CPU topo is changed, that said when
some of these offline CPUs become online, they are still mapped to
hctx 0, then hctx 0 may become the bottleneck of IO dispatch and
completion.

This patch sets up the mapping from the beginning, and aligns to
queue mapping for PCI device (blk_mq_pci_map_queues()).

Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: Stefan Haberland 
Cc: Keith Busch 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---
 block/blk-mq-cpumap.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9f8cffc8a701..3eb169f15842 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -16,11 +16,6 @@
 
 static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
 {
-   /*
-* Non present CPU will be mapped to queue index 0.
-*/
-   if (!cpu_present(cpu))
-   return 0;
return cpu % nr_queues;
 }
 
-- 
2.9.5



[PATCH 1/8] blk-mq: make sure that correct hctx->next_cpu is set

2018-04-08 Thread Ming Lei
>From commit 20e4d81393196 (blk-mq: simplify queue mapping & schedule
with each possisble CPU), one hctx can be mapped from all offline CPUs,
then hctx->next_cpu can be set as wrong.

This patch fixes this issue by making hctx->next_cpu pointing to the
first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline.

Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: Stefan Haberland 
Fixes: 20e4d81393196 ("blk-mq: simplify queue mapping & schedule with each 
possisble CPU")
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f5c7dbcb954f..9b220dc415ac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2432,6 +2432,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 */
hctx->next_cpu = cpumask_first_and(hctx->cpumask,
cpu_online_mask);
+   if (hctx->next_cpu >= nr_cpu_ids)
+   hctx->next_cpu = cpumask_first(hctx->cpumask);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
 }
-- 
2.9.5



[PATCH 0/8] blk-mq: fix and improve queue mapping

2018-04-08 Thread Ming Lei
Hi Jens,

The first two patches fix issues about queue mapping.

The other 6 patches improve queue mapping for blk-mq.

Christian, this patches should fix your issue, so please give
a test, and the patches can be found in the following tree:

https://github.com/ming1/linux/commits/v4.17-rc-blk-fix_mapping_v1

Thanks,
Ming

Ming Lei (8):
  blk-mq: make sure that correct hctx->next_cpu is set
  blk-mq: don't keep offline CPUs mapped to hctx 0
  blk-mq: avoid to write intermediate result to hctx->next_cpu
  blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu
  blk-mq: remove blk_mq_delay_queue()
  blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()
  blk-mq: reimplement blk_mq_hw_queue_mapped
  blk-mq: remove code for dealing with remapping queue

 block/blk-mq-cpumap.c  |   5 ---
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 101 +++--
 block/blk-mq.h |   2 +-
 include/linux/blk-mq.h |   2 -
 5 files changed, 24 insertions(+), 87 deletions(-)

-- 
2.9.5



Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Ming Lei
On Sun, Apr 08, 2018 at 04:11:51PM +0800, Joseph Qi wrote:
> This is because scsi_remove_device() will call blk_cleanup_queue(), and
> then all blkgs have been destroyed and root_blkg is NULL.
> Thus tg is NULL and trigger NULL pointer dereference when get td from
> tg (tg->td).
> It seems that we cannot simply move blkcg_exit_queue() up to
> blk_cleanup_queue().

Maybe one per-queue blkcg should be introduced, which seems reasonable
too.

Thanks,
Ming


[PATCH] blk-throttle: fix typo in blkg_print_stat_ios() comments

2018-04-08 Thread Joseph Qi
Signed-off-by: Joseph Qi 
---
 block/blk-cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1c16694..87367d4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -640,7 +640,7 @@ int blkg_print_stat_bytes(struct seq_file *sf, void *v)
 EXPORT_SYMBOL_GPL(blkg_print_stat_bytes);
 
 /**
- * blkg_print_stat_bytes - seq_show callback for blkg->stat_ios
+ * blkg_print_stat_ios - seq_show callback for blkg->stat_ios
  * @sf: seq_file to print to
  * @v: unused
  *
-- 
1.8.3.1



Re: [block regression] kernel oops triggered by removing scsi device dring IO

2018-04-08 Thread Joseph Qi
This is because scsi_remove_device() will call blk_cleanup_queue(), and
then all blkgs have been destroyed and root_blkg is NULL.
Thus tg is NULL and trigger NULL pointer dereference when get td from
tg (tg->td).
It seems that we cannot simply move blkcg_exit_queue() up to
blk_cleanup_queue().

Thanks,
Joseph

On 18/4/8 12:21, Ming Lei wrote:
> Hi,
> 
> The following kernel oops is triggered by 'removing scsi device' during
> heavy IO.
> 
> 'git bisect' shows that commit a063057d7c731cffa7d10740(block: Fix a race
> between request queue removal and the block cgroup controller)
> introduced this regression:
> 
> [   42.268257] BUG: unable to handle kernel NULL pointer dereference at 
> 0028
> [   42.269339] PGD 26bd9f067 P4D 26bd9f067 PUD 26bfec067 PMD 0 
> [   42.270077] Oops:  [#1] PREEMPT SMP NOPTI
> [   42.270681] Dumping ftrace buffer:
> [   42.271141](ftrace buffer empty)
> [   42.271641] Modules linked in: scsi_debug iTCO_wdt iTCO_vendor_support 
> crc32c_intel i2c_i801 i2c_core lpc_ich mfd_core usb_storage nvme shpchp 
> nvme_core virtio_scsi qemu_fw_cfg ip_tables
> [   42.273770] CPU: 5 PID: 1076 Comm: fio Not tainted 4.16.0+ #49
> [   42.274530] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.10.2-2.fc27 04/01/2014
> [   42.275634] RIP: 0010:blk_throtl_bio+0x41/0x904
> [   42.276225] RSP: 0018:c900033cfaa0 EFLAGS: 00010246
> [   42.276907] RAX: 8000 RBX: 8801bdcc5118 RCX: 
> 0001
> [   42.277818] RDX: 8801bdcc5118 RSI:  RDI: 
> 8802641f8870
> [   42.278733] RBP:  R08: 0001 R09: 
> c900033cfb94
> [   42.279651] R10: c900033cfc00 R11: 06ea R12: 
> 8802641f8870
> [   42.280567] R13: 88026f34f000 R14:  R15: 
> 8801bdcc5118
> [   42.281489] FS:  7fc123922d40() GS:880272f4() 
> knlGS:
> [   42.282525] CS:  0010 DS:  ES:  CR0: 80050033
> [   42.283270] CR2: 0028 CR3: 00026d7ac004 CR4: 
> 007606e0
> [   42.284194] DR0:  DR1:  DR2: 
> 
> [   42.285116] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   42.286036] PKRU: 5554
> [   42.286393] Call Trace:
> [   42.286725]  ? try_to_wake_up+0x3a3/0x3c9
> [   42.287255]  ? blk_mq_hctx_notify_dead+0x135/0x135
> [   42.287880]  ? gup_pud_range+0xb5/0x7e1
> [   42.288381]  generic_make_request_checks+0x3cf/0x539
> [   42.289027]  ? gup_pgd_range+0x8e/0xaa
> [   42.289515]  generic_make_request+0x38/0x25b
> [   42.290078]  ? submit_bio+0x103/0x11f
> [   42.290555]  submit_bio+0x103/0x11f
> [   42.291018]  ? bio_iov_iter_get_pages+0xe4/0x104
> [   42.291620]  blkdev_direct_IO+0x2a3/0x3af
> [   42.292151]  ? kiocb_free+0x34/0x34
> [   42.292607]  ? ___preempt_schedule+0x16/0x18
> [   42.293168]  ? preempt_schedule_common+0x4c/0x65
> [   42.293771]  ? generic_file_read_iter+0x96/0x110
> [   42.294377]  generic_file_read_iter+0x96/0x110
> [   42.294962]  aio_read+0xca/0x13b
> [   42.295388]  ? preempt_count_add+0x6d/0x8c
> [   42.295926]  ? aio_read_events+0x287/0x2d6
> [   42.296460]  ? do_io_submit+0x4d2/0x62c
> [   42.296964]  do_io_submit+0x4d2/0x62c
> [   42.297446]  ? do_syscall_64+0x9d/0x15e
> [   42.297950]  do_syscall_64+0x9d/0x15e
> [   42.298431]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [   42.299090] RIP: 0033:0x7fc12244e687
> [   42.299556] RSP: 002b:7ffe18388a68 EFLAGS: 0202 ORIG_RAX: 
> 00d1
> [   42.300528] RAX: ffda RBX: 7fc0fde08670 RCX: 
> 7fc12244e687
> [   42.301442] RDX: 01d1b388 RSI: 0001 RDI: 
> 7fc123782000
> [   42.302359] RBP: 22d8 R08: 0001 R09: 
> 01c461e0
> [   42.303275] R10:  R11: 0202 R12: 
> 7fc0fde08670
> [   42.304195] R13:  R14: 01d1d0c0 R15: 
> 01b872f0
> [   42.305117] Code: 48 85 f6 48 89 7c 24 10 75 0e 48 8b b7 b8 05 00 00 31 ed 
> 48 85 f6 74 0f 48 63 05 75 a4 e4 00 48 8b ac c6 28 02 00 00 f6 43 15 02 <48> 
> 8b 45 28 48 89 04 24 0f 85 28 08 00 00 8b 43 10 45 31 e4 83 
> [   42.307553] RIP: blk_throtl_bio+0x41/0x904 RSP: c900033cfaa0
> [   42.308328] CR2: 0028
> [   42.308920] ---[ end trace f53a144979f63b29 ]---
> [   42.309520] Kernel panic - not syncing: Fatal exception
> [   42.310635] Dumping ftrace buffer:
> [   42.311087](ftrace buffer empty)
> [   42.311583] Kernel Offset: disabled
> [   42.312163] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 


Block layer use of __GFP flags

2018-04-08 Thread Matthew Wilcox

Please explain:

commit 6a15674d1e90917f1723a814e2e8c949000440f7
Author: Bart Van Assche 
Date:   Thu Nov 9 10:49:54 2017 -0800

block: Introduce blk_get_request_flags()

A side effect of this patch is that the GFP mask that is passed to
several allocation functions in the legacy block layer is changed
from GFP_KERNEL into __GFP_DIRECT_RECLAIM.

Why was this thought to be a good idea?  I think gfp.h is pretty clear:

 * Useful GFP flag combinations that are commonly used. It is recommended
 * that subsystems start with one of these combinations and then set/clear
 * __GFP_FOO flags as necessary.

Instead, the block layer now throws away all but one bit of the
information being passed in by the callers, and all it tells the allocator
is whether or not it can start doing direct reclaim.  I can see that
you may well be in a situation where you don't want to start more I/O,
but your caller knows that!  Why make the allocator work harder than
it has to?  In particular, why isn't the page allocator allowed to wake
up kswapd to do reclaim in non-atomic context, but is when the caller
is in atomic context?

This changelog is woefully short on detail.  It says what you're doing,
but not why you're doing it.  And now I have no idea and I have to ask you
what you were thinking at the time.  Please be more considerate in future.