Re: Oops when completing request on the wrong queue

2016-08-19 Thread Gabriel Krisman Bertazi
Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com> writes:

> We, IBM, have been experiencing eventual Oops when stressing IO at the
> same time we add/remove processors.  The Oops happens in the IRQ path,
> when we try to complete a request that was apparently meant for another
> queue.
>
> In __nvme_process_cq, the driver will use the cqe.command_id and the
> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
> initiated the IO.  Eventually, it happens that the request returned by
> that function is not initialized, and we crash inside
> __blk_mq_complete_request, as shown below.
>
> [ 2679.050701] Faulting instruction address: 0xc0549500
> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
> rpcsec_gss_krb5 auth_rpcgss
> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
> autofs4 btrfs xor
> raid6_pq ibmvscsi ibmveth nvme
> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 
> 4.4.0-22-generic #40-Ubuntu
> [ 2679.050780] task: c00497b07e60 ti: c004fff2c000 task.ti: 
> c00451bc
> [ 2679.050786] NIP: c0549500 LR: d46b5e84 CTR: 
> c0549680
> [ 2679.050803] REGS: c004fff2fa80 TRAP: 0300   Not tainted  
> (4.4.0-22-generic)
> [ 2679.050807] MSR: 80019033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  
> XER: 
> [ 2679.050824] CFAR: c0008468 DAR: 00b0 DSISR: 4000 
> SOFTE: 0 
> GPR00: d46b5e84 c004fff2fd00 c15b4300 c002799a0a00 
> GPR04:  0105 0004 0001 
> GPR08: c002799a0a50   d46bdc68 
> GPR12: c0549680 c7b1c780 0008 0001 
> GPR16:    10005b50 
> GPR20: 10005dc8 c15eaa60 0001 0002 
> GPR24: 01e3    
> GPR28:   c004f8cd5180 c002799a0a00 
> [ 2679.050908] NIP [c0549500] __blk_mq_complete_request+0x30/0x1b0
> [ 2679.050924] LR [d46b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050930] Call Trace:
> [ 2679.050941] [c004fff2fd00] [c004fff2fd40] 0xc004fff2fd40 
> (unreliable)
> [ 2679.050952] [c004fff2fd40] [d46b5e84] 
> __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050961] [c004fff2fde0] [d46b613c] nvme_irq+0x3c/0xb0 [nvme]
> [ 2679.050972] [c004fff2fe10] [c0130660] 
> handle_irq_event_percpu+0xa0/0x320
> [ 2679.050981] [c004fff2fed0] [c0130948] 
> handle_irq_event+0x68/0xc0
> [ 2679.050989] [c004fff2ff00] [c0135c2c] 
> handle_fasteoi_irq+0xec/0x2b0
> [ 2679.050997] [c004fff2ff30] [c012f844] 
> generic_handle_irq+0x54/0x80
> [ 2679.051007] [c004fff2ff60] [c0011320] __do_irq+0x80/0x1d0
> [ 2679.051020] [c004fff2ff90] [c0024800] call_do_irq+0x14/0x24
> [ 2679.051037] [c00451bc3350] [c0011508] do_IRQ+0x98/0x140
> [ 2679.051054] [c00451bc33a0] [c0002594]
> hardware_interrupt_common+0x114/0x180
>
> I added some debugging which skipped the Oops and made me think that we
> are fetching the wrong request because we are actually looking at the
> wrong set of tags.
>
> The log below exemplifies what I am saying.  Instead of crashing in the
> Oops, I made the execution skip the request completion and simply
> consider that odd CQE as handled.  The first line of the log does that.
>
>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>  nvme nvme0: Abort status: 0x0
>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>  nvme nvme0: completing aborted command with status: fffc
>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>
> In the first line, we found the request 309 for queue 35, which would
> have crashed the execution.  My debug patch just ignores it.
>
> Then, we can see that eventually, the same IO, 309, will timeout in
> another QID, which was actually expecting for it.  The Abort request
> gets sent and completed, but we never receive the completion of the
> aborted request, thus we timeout again and restart the device.
>
> This only happens when we are changing the SMT settings very
> frequently, which points back to the way we correlate the hctx->tags to
> the nvmeq structure, but I failed to find the exact code

Re: Oops when completing request on the wrong queue

2016-08-19 Thread Gabriel Krisman Bertazi
tempt to
reproduce right now, but I'll look for one and see how it goes.


#!/bin/bash

MAXCPUs=159
STATE=1

while true; do
for i in $(seq 1 ${MAXCPUS}); do
if (($i%8)); then
echo $STATE > /sys/devices/system/cpu/cpu$i/online
fi
done
if [[ $STATE -eq 1 ]]; then
    STATE=0
else
STATE=1
fi
done

-- 
Gabriel Krisman Bertazi

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops when completing request on the wrong queue

2016-08-29 Thread Gabriel Krisman Bertazi
Jens Axboe <ax...@kernel.dk> writes:
>> Can you try this patch? It's not perfect, but I'll be interested if it
>> makes a difference for you.
>

Hi Jens,

Sorry for the delay.  I just got back to this and have been running your
patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
more time to make sure it's running properly, though.

Let me get back to you after a few more rounds of test.

> This one should handle the WARN_ON() for running the hw queue on the
> wrong CPU as well.

On the workaround you added to prevent WARN_ON, we surely need to
prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
place, right..  How do you feel about the following RFC?  I know it's
not a complete fix, but it feels like a good improvement to me.

http://www.spinics.net/lists/linux-scsi/msg98608.html

-- 
Gabriel Krisman Bertazi

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] genwqe: use pci_irq_allocate_vectors

2016-09-29 Thread Gabriel Krisman Bertazi
Christoph Hellwig <h...@lst.de> writes:

> On Thu, Sep 29, 2016 at 03:45:29PM -0300, Gabriel Krisman Bertazi wrote:
>> I'm stepping up to assist with the genwqe_card driver just now, since we
>> (ibm) missed some of the last patches that went in.  I'll add myself to
>> maintainers file.
>
> Can your forward it to Greg together with whatever other changes are
> pending for the driver?

sure, will do.

-- 
Gabriel Krisman Bertazi

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] genwqe: use pci_irq_allocate_vectors

2016-09-29 Thread Gabriel Krisman Bertazi
Christoph Hellwig <h...@lst.de> writes:

> On Thu, Sep 29, 2016 at 03:28:02PM -0300, Gabriel Krisman Bertazi wrote:
>> Christoph Hellwig <h...@lst.de> writes:
>> 
>> > Simply the interrupt setup by using the new PCI layer helpers.
>> 
>> Good clean up.  Tested and:
>> 
>> Acked-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
>
> Which tree should this go in through?

I'd say Greg's char-misc tree.

I'm stepping up to assist with the genwqe_card driver just now, since we
(ibm) missed some of the last patches that went in.  I'll add myself to
maintainers file.


-- 
Gabriel Krisman Bertazi

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] blk-mq: Always schedule hctx->next_cpu

2016-09-27 Thread Gabriel Krisman Bertazi
Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu.  This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong 
CPU")
Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
---
 block/blk-mq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0be5577b0d56..367d21215345 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -883,7 +883,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
return WORK_CPU_UNBOUND;
 
if (--hctx->next_cpu_batch <= 0) {
-   int cpu = hctx->next_cpu, next_cpu;
+   int next_cpu;
 
next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
if (next_cpu >= nr_cpu_ids)
@@ -891,8 +891,6 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 
hctx->next_cpu = next_cpu;
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
-
-   return cpu;
}
 
return hctx->next_cpu;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] blk-mq: Fix failed allocation path when mapping queues

2016-11-22 Thread Gabriel Krisman Bertazi
In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

My one concern about this patch is if remapping an arbitrary queue to
hctx_0 could result in outstanding requests getting submitted to the
wrong hctx.  I couldn't observe this happening during tests, but I'm not
entirely sure it'll never happen.  I believe the queue will be empty if
we are trying to allocate tags for it, unless it was using another alive
hctx queue and for some reason got reassigned to this new hctx.  is this
possible at all?

This should apply cleanly on top of Jen's for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c00fe99eb110]
pc: c05e868c: __sbitmap_queue_get+0x2c/0x180
lr: c0575328: __bt_get+0x48/0xd0
sp: c00fe99eb390
   msr: 90010280b033
   dar: 28
 dsisr: 4000
  current = 0xc00fe9966800
  paca= 0xc7e80300   softe: 0irq_happened: 0x01
pid   = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c00fe99eb3d0] c0575328 __bt_get+0x48/0xd0
[c00fe99eb400] c0575838 bt_get.isra.1+0x78/0x2d0
[c00fe99eb480] c0575cb4 blk_mq_get_tag+0x44/0x100
[c00fe99eb4b0] c056f6f4 __blk_mq_alloc_request+0x44/0x220
[c00fe99eb500] c0570050 blk_mq_map_request+0x100/0x1f0
[c00fe99eb580] c0574650 blk_mq_make_request+0xf0/0x540
[c00fe99eb640] c0561c44 generic_make_request+0x144/0x230
[c00fe99eb690] c0561e00 submit_bio+0xd0/0x200
[c00fe99eb740] c03ef740 ext4_io_submit+0x90/0xb0
[c00fe99eb770] c03e95d8 ext4_writepages+0x588/0xdd0
[c00fe99eb910] c025a9f0 do_writepages+0x60/0xc0
[c00fe99eb940] c0246c88 __filemap_fdatawrite_range+0xf8/0x180
[c00fe99eb9e0] c0246f90 filemap_write_and_wait_range+0x70/0xf0
[c00fe99eba20] c03dd844 ext4_sync_file+0x214/0x540
[c00fe99eba80] c0364718 vfs_fsync_range+0x78/0x130
[c00fe99ebad0] c03dd46c ext4_file_write_iter+0x35c/0x430
[c00fe99ebb90] c038c280 aio_run_iocb+0x3b0/0x450
[c00fe99ebce0] c038dc28 do_io_submit+0x368/0x730
[c00fe99ebe30] c0009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Douglas Miller <dougm...@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
---
 block/blk-mq.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a6af66ae01e5..8cc58f23b2c2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1870,7 +1870,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
 static void blk_mq_map_swqueue(struct request_queue *q,
   const struct cpumask *online_mask)
 {
-   unsigned int i;
+   unsigned int i, hctx_idx;
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;
@@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct request_queue *q,
if (!cpumask_test_cpu(i, online_mask))
continue;
 
+   hctx_idx = q->mq_map[i];
+   /* unmapped hw queue can be remapped after CPU topo changed */
+   if (!set->tags[hctx_idx]) {
+   set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
+
+   if (!set->tags[hctx_idx])
+   q->mq_map[i] = 0;
+   }
+
ctx = per_cpu_ptr(q->queue_ctx, i);
hctx = blk_mq_map_queue(q, i);
 
@@ -1909,7 +1918,10 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 

[PATCH v2 2/2] blk-mq: Avoid memory reclaim when remapping queues

2016-11-22 Thread Gabriel Krisman Bertazi
While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jen's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Douglas Miller <dougm...@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8cc58f23b2c2..db0eac3b9538 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1605,7 +1605,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
INIT_LIST_HEAD(>page_list);
 
tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1631,7 +1631,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1652,7 +1652,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 * Allow kmemleak to scan these pages as they contain pointers
 * to additional allocations like via ops->init_request().
 */
-   kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KE

[PATCH] blk-mq: Drop explicit timeout sync in hotplug

2016-11-28 Thread Gabriel Krisman Bertazi
Sorry for the dup.  Missed linux-block address.

>8

After commit 287922eb0b18 ("block: defer timeouts to a workqueue"),
deleting the timeout work after freezing the queue shouldn't be
necessary, since the synchronization is already enforced by the
acquisition of a q_usage_counter reference in blk_mq_timeout_work.

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Ming Lei <ming@canonical.com>
---
 block/blk-mq.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9d4a1d630d0b..bac12caece06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2220,16 +2220,9 @@ static void blk_mq_queue_reinit_work(void)
 */
list_for_each_entry(q, _q_list, all_q_node)
blk_mq_freeze_queue_start(q);
-   list_for_each_entry(q, _q_list, all_q_node) {
+   list_for_each_entry(q, _q_list, all_q_node)
blk_mq_freeze_queue_wait(q);
 
-   /*
-* timeout handler can't touch hw queue during the
-* reinitialization
-*/
-   del_timer_sync(>timeout);
-   }
-
list_for_each_entry(q, _q_list, all_q_node)
blk_mq_queue_reinit(q, _online_new);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] blk-mq: Avoid memory reclaim when remapping queues

2016-11-14 Thread Gabriel Krisman Bertazi
While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOWAIT.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jen's for-next branch cleanly.

 Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Douglas Miller <dougm...@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7f7c4ba91adf..3e44303646cb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1597,7 +1597,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
INIT_LIST_HEAD(>page_list);
 
tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1623,7 +1623,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1644,7 +1644,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 * Allow kmemleak to scan these pages as they contain pointers
 * to additional allocations like via ops->init_request().
 */
-   kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL);
+   kmemleak_alloc(p, o

[PATCH 1/2] blk-mq: Fix failed allocation path when mapping queues

2016-11-14 Thread Gabriel Krisman Bertazi
In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

My one concern about this patch is if remapping an arbitrary queue to
hctx_0 could result in outstanding requests getting submitted to the
wrong hctx.  I couldn't observe this happening during tests, but I'm not
entirely sure it'll never happen.  I believe the queue will be empty if
we are trying to allocate tags for it, unless it was using another alive
hctx queue and for some reason got reassigned to this new hctx.  is this
possible at all?

This should apply cleanly on top of Jen's for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c00fe99eb110]
pc: c05e868c: __sbitmap_queue_get+0x2c/0x180
lr: c0575328: __bt_get+0x48/0xd0
sp: c00fe99eb390
   msr: 90010280b033
   dar: 28
 dsisr: 4000
  current = 0xc00fe9966800
  paca= 0xc7e80300   softe: 0irq_happened: 0x01
pid   = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c00fe99eb3d0] c0575328 __bt_get+0x48/0xd0
[c00fe99eb400] c0575838 bt_get.isra.1+0x78/0x2d0
[c00fe99eb480] c0575cb4 blk_mq_get_tag+0x44/0x100
[c00fe99eb4b0] c056f6f4 __blk_mq_alloc_request+0x44/0x220
[c00fe99eb500] c0570050 blk_mq_map_request+0x100/0x1f0
[c00fe99eb580] c0574650 blk_mq_make_request+0xf0/0x540
[c00fe99eb640] c0561c44 generic_make_request+0x144/0x230
[c00fe99eb690] c0561e00 submit_bio+0xd0/0x200
[c00fe99eb740] c03ef740 ext4_io_submit+0x90/0xb0
[c00fe99eb770] c03e95d8 ext4_writepages+0x588/0xdd0
[c00fe99eb910] c025a9f0 do_writepages+0x60/0xc0
[c00fe99eb940] c0246c88 __filemap_fdatawrite_range+0xf8/0x180
[c00fe99eb9e0] c0246f90 filemap_write_and_wait_range+0x70/0xf0
[c00fe99eba20] c03dd844 ext4_sync_file+0x214/0x540
[c00fe99eba80] c0364718 vfs_fsync_range+0x78/0x130
[c00fe99ebad0] c03dd46c ext4_file_write_iter+0x35c/0x430
[c00fe99ebb90] c038c280 aio_run_iocb+0x3b0/0x450
[c00fe99ebce0] c038dc28 do_io_submit+0x368/0x730
[c00fe99ebe30] c0009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Douglas Miller <dougm...@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
---
 block/blk-mq.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3c5aaead71bb..7f7c4ba91adf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1862,7 +1862,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
 static void blk_mq_map_swqueue(struct request_queue *q,
   const struct cpumask *online_mask)
 {
-   unsigned int i;
+   unsigned int i, hctx_idx;
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;
@@ -1885,6 +1885,15 @@ static void blk_mq_map_swqueue(struct request_queue *q,
if (!cpumask_test_cpu(i, online_mask))
continue;
 
+   hctx_idx = q->mq_map[i];
+   /* unmapped hw queue can be remapped after CPU topo changed */
+   if (!set->tags[hctx_idx]) {
+   set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
+
+   if (!set->tags[hctx_idx])
+   q->mq_map[i] = 0;
+   }
+
ctx = per_cpu_ptr(q->queue_ctx, i);
hctx = blk_mq_map_queue(q, i);
 
@@ -1901,7 +1910,10 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 

[PATCH RFC 1/2] blk-mq: Fix failed allocation path when mapping queues

2016-10-13 Thread Gabriel Krisman Bertazi
In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jen's for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c00fe99eb110]
pc: c05e868c: __sbitmap_queue_get+0x2c/0x180
lr: c0575328: __bt_get+0x48/0xd0
sp: c00fe99eb390
   msr: 90010280b033
   dar: 28
 dsisr: 4000
  current = 0xc00fe9966800
  paca= 0xc7e80300   softe: 0irq_happened: 0x01
pid   = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c00fe99eb3d0] c0575328 __bt_get+0x48/0xd0
[c00fe99eb400] c0575838 bt_get.isra.1+0x78/0x2d0
[c00fe99eb480] c0575cb4 blk_mq_get_tag+0x44/0x100
[c00fe99eb4b0] c056f6f4 __blk_mq_alloc_request+0x44/0x220
[c00fe99eb500] c0570050 blk_mq_map_request+0x100/0x1f0
[c00fe99eb580] c0574650 blk_mq_make_request+0xf0/0x540
[c00fe99eb640] c0561c44 generic_make_request+0x144/0x230
[c00fe99eb690] c0561e00 submit_bio+0xd0/0x200
[c00fe99eb740] c03ef740 ext4_io_submit+0x90/0xb0
[c00fe99eb770] c03e95d8 ext4_writepages+0x588/0xdd0
[c00fe99eb910] c025a9f0 do_writepages+0x60/0xc0
[c00fe99eb940] c0246c88 __filemap_fdatawrite_range+0xf8/0x180
[c00fe99eb9e0] c0246f90 filemap_write_and_wait_range+0x70/0xf0
[c00fe99eba20] c03dd844 ext4_sync_file+0x214/0x540
[c00fe99eba80] c0364718 vfs_fsync_range+0x78/0x130
[c00fe99ebad0] c03dd46c ext4_file_write_iter+0x35c/0x430
[c00fe99ebb90] c038c280 aio_run_iocb+0x3b0/0x450
[c00fe99ebce0] c038dc28 do_io_submit+0x368/0x730
[c00fe99ebe30] c0009404 system_call+0x38/0xec
--- Exception: c01 (System Call) at 3fff95810818

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Jens Axboe <ax...@fb.com>
---
 block/blk-mq.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f247a18351f0..f19e3b8c9e15 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1744,7 +1744,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
 static void blk_mq_map_swqueue(struct request_queue *q,
   const struct cpumask *online_mask)
 {
-   unsigned int i;
+   unsigned int i, hctx_idx;
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;
@@ -1767,6 +1767,15 @@ static void blk_mq_map_swqueue(struct request_queue *q,
if (!cpumask_test_cpu(i, online_mask))
continue;
 
+   hctx_idx = q->mq_map[i];
+   /* unmapped hw queue can be remapped after CPU topo changed */
+   if (!set->tags[hctx_idx]) {
+   set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
+
+   if (!set->tags[hctx_idx])
+   q->mq_map[i] = 0;
+   }
+
ctx = per_cpu_ptr(q->queue_ctx, i);
hctx = blk_mq_map_queue(q, i);
 
@@ -1783,7 +1792,10 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 * disable it and free the request entries.
 */
if (!hctx->nr_ctx) {
-   if (set->tags[i]) {
+   /* Never unmap queue 0.  We need it as a
+* fallback in case of a new remap fails
+* allocation. */
+   if (i && set->tags[i]) {
blk_mq_free_rq_map(set, set->

[PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues

2016-12-06 Thread Gabriel Krisman Bertazi
While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jen's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Douglas Miller <dougm...@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6718f894fbe1..5f4e452eef72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1605,7 +1605,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
INIT_LIST_HEAD(>page_list);
 
tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1631,7 +1631,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1652,7 +1652,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 * Allow kmemleak to scan these pages as they contain pointers
 * to additional allocations like via ops->init_request().
 */
-   kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KE

Re: [PATCH v2 1/2] blk-mq: Fix failed allocation path when mapping queues

2016-12-01 Thread Gabriel Krisman Bertazi
Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com> writes:

> My one concern about this patch is if remapping an arbitrary queue to
> hctx_0 could result in outstanding requests getting submitted to the
> wrong hctx.  I couldn't observe this happening during tests, but I'm not
> entirely sure it'll never happen.  I believe the queue will be empty if
> we are trying to allocate tags for it, unless it was using another alive
> hctx queue and for some reason got reassigned to this new hctx.  is this
> possible at all?

I no longer believe this case to be an issue for accepting this patch,
since if a queue enters this path and get's remapped to hctx_0, it means
the ctx would already be getting remaped to a new queue anyway, and it
should use the requests that were just mapped for that queue.  So, for a
correctness standpoint it doesn't matter which queue we are remaping,
the failed hctx or hctx_0, no request would get redirected to the wrong
queue with this change.

Jens, do you think this set is good for merging in 4.10?  It fixes some
issues for us on low memory conditions.

-- 
Gabriel Krisman Bertazi

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] blk-mq: Take tagset lock when updaing hw queues

2017-05-30 Thread Gabriel Krisman Bertazi
Keith Busch <keith.bu...@intel.com> writes:

> The tagset lock needs to be held when iterating the tag_list, so a
> lockdep assert was added when updating number of hardware queues. The
> drivers calling this API, however, were unaware of the new requirement,
> so are failing the assertion.
>
> This patch takes the lock within the blk-mq function so the drivers do
> not have to be modified in order to be safe.
>
> Fixes: 705cda97e ("blk-mq: Make it safe to use RCU to iterate over 
> blk_mq_tag_set.tag_list")
>
> Reported-by: Gabriel Krisman Bertazi <kris...@collabora.co.uk>
> Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
> Signed-off-by: Keith Busch <keith.bu...@intel.com>

I know it is already applied but, FWIW, I can confirm it fixed the issue
in our CI.  Feel free to add:

Tested-by: Gabriel Krisman Bertazi <kris...@collabora.co.uk>

Thanks again, Keith.

-- 
Gabriel Krisman Bertazi


Re: WARNING triggers at blk_mq_update_nr_hw_queues during nvme_reset_work

2017-05-30 Thread Gabriel Krisman Bertazi
Keith Busch <keith.bu...@intel.com> writes:

> On Tue, May 30, 2017 at 02:00:44PM -0300, Gabriel Krisman Bertazi wrote:
>> Since the merge window for 4.12, one of the machines in Intel's CI
>> started to hit the WARN_ON below at blk_mq_update_nr_hw_queues during an
>> nvme_reset_work.  The issue persists with the latest 4.12-rc3, and full
>> dmesg from boot, up to the moment where the WARN_ON triggers is
>> available at the following link:
>> 
>> https://intel-gfx-ci.01.org/CI/CI_DRM_2672/fi-kbl-7500u/igt@kms_pipe_crc_ba...@suspend-read-crc-pipe-a.html
>> 
>> Please notice that the test we do in the CI involves putting the
>> machine to sleep (PM), and the issue triggers when resuming execution.
>> 
>> I have not been able to get my hands on the machine yet to do an actual
>> bisect, but I'm wondering if you guys might have an idea of what is
>> wrong.
>> 
>> Any help is appreciated :)
>
> Hi Gabriel,
>
> This appears to be new behavior in blk-mq's tag set update with commit
> 705cda97e. This is asserting a lock is held, but none of the drivers
> that call the export are take that lock.
>
> I think the below should fix it (CC'ing block list and developers).
>

Thanks for the quick fix, Keith.  I'm running it against the CI to
confirm it fixes the issue and will send you my tested-by once the job
is completed.

-- 
Gabriel Krisman Bertazi