Re: [GIT PULL 00/19] LightNVM patches for 4.12.

2017-04-16 Thread Jens Axboe
On 04/15/2017 12:55 PM, Matias Bjørling wrote:
> Hi Jens,
> 
> With this merge window, we like to push pblk upstream. It is a new
> host-side translation layer that implements support for exposing
> Open-Channel SSDs as block devices.
> 
> We have described pblk in the LightNVM paper "LightNVM: The Linux
> Open-Channel SSD Subsystem" that was accepted at FAST 2017. The paper
> defines open-channel SSDs, the subsystem, pblk and has an evaluation as
> well. Over the past couple of kernel versions we have shipped the
> support patches for pblk, and we are now comfortable pushing the core of
> pblk upstream.
> 
> The core contains the logic to control data placement and I/O scheduling
> on open-channel SSDs. Including implementation of translation table
> management, GC, recovery, rate-limiting, and similar components. It
> assumes that the SSD is media-agnostic, and runs on both 1.2 and 2.0 of
> the Open-Channel SSD specification without modifications.
> 
> I want to point out two neat features of pblk. First, pblk can be
> instantiated multiple times on the same SSD, enabling I/O isolation
> between tenants, and makes it able to fulfill strict QoS requirements.
> We showed results from this at the NVMW '17 workshop this year, while
> presenting the "Multi-Tenant I/O Isolation with Open-Channel SSDs" talk.
> Second, now that a full host-side translation layer is implemented, one
> can begin to optimize its data placement and I/O scheduling algorithms
> to match user workloads. We have shown a couple of the benefits in the
> LightNVM paper, and we know of a couple of companies and universities
> that have begun making new algorithms.
> 
> In detail, this pull request contains:
> 
>  - The new host-side FTL pblk from Javier, and other contributors.
> 
>  - Add support to the "create" ioctl to force a target to be
>re-initialized at using "factory" flag from Javier.
> 
>  - Fix various errors in LightNVM core from Javier and me.
> 
>  - An optimization from Neil Brown to skip error checking on mempool
>allocations that can sleep.
> 
>  - A buffer overflow fix in nvme_nvm_identify from Scott Bauer.
> 
>  - Fix for bad block discovery handle error handling from Christophe
>Jaillet.
> 
>  - Fixes from Dan Carpenter to pblk after it went into linux-next.
> 
> Please pull from the for-jens branch or apply the patches posted with
> this mail:
> 
>https://github.com/OpenChannelSSD/linux.git for-jens

Applied for 4.12, thanks Matias.

-- 
Jens Axboe



Kernel Oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050; IP is at blk_mq_poll+0xa0/0x2e0

2017-04-16 Thread Stephen Bates
Hi All

As part of my testing of IO polling [1] I am seeing a NULL pointer dereference 
oops that seems to have been introduced in the preparation for 4.11. The kernel 
oops output is below and this seems to be due to blk_mq_tag_to_rq returning 
NULL in blk_mq_poll in blk-mq.c. I have not had a chance to bisect this down to 
a single commit yet but the same test works fine in 4.10 but not in 4.11-rc6. I 
will try and get a better bisect and send on more information when I get it. 

I am running the following simple shell script which always triggers the Oops. 
Note this script needs some tweaking to work on 4.11 and earlier since the some 
things were moved from sysfs to debugfs in the 4.11 and 4.12 trees.

#!/bin/bash
BLOCK=nvme1n1
DEV=/dev/${BLOCK}

echo "Testing polling on ${DEV}"

# Display the inital polling settings for this device...

cat /sys/block/${BLOCK}/queue/io_poll
cat /sys/block/${BLOCK}/queue/io_poll_delay

# Display the polling results and stats

cat /sys/kernel/debug/block/${BLOCK}/mq/poll_stat
cat /sys/kernel/debug/block/${BLOCK}/mq/0/io_poll

# Now do some polling IO against the block device in question.

fio --filename=${DEV} --size=100% --numjobs=1 --iodepth=1 \
--bs=4k --number_ios=1k --runtime=60 --ioengine=pvsync2 --hipri \
--rw=randrw --random_generator=lfsr --direct=1 --group_reporting=1 \
--rwmixread=100 --loops=1 --name poll.fio

# Display the polling results and stats

cat /sys/kernel/debug/block/${BLOCK}/mq/poll_stat
cat /sys/kernel/debug/block/${BLOCK}/mq/0/io_poll

[   26.024529] BUG: unable to handle kernel NULL pointer dereference at 
0050
[   26.027167] IP: blk_mq_poll+0xa0/0x2e0
[   26.027326] PGD 7ad51067 
[   26.027584] PUD 7adbe067 
[   26.028006] PMD 0 
[   26.028234] 
[   26.029319] Oops:  [#1] SMP
[   26.030405] CPU: 0 PID: 1474 Comm: fio Not tainted 4.11.0-rc6 #42
[   26.031678] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[   26.033749] task: 88007cf75a00 task.stack: c969c000
[   26.034575] RIP: 0010:blk_mq_poll+0xa0/0x2e0
[   26.035234] RSP: 0018:c969fa38 EFLAGS: 0216
[   26.036269] RAX: 88007c5d5000 RBX:  RCX: 
[   26.037330] RDX: 88007c56d1e8 RSI: 00a7 RDI: 
[   26.039285] RBP: c969fad0 R08:  R09: 80a7
[   26.040962] R10: c969faa0 R11: 1000 R12: 88007c561fe0
[   26.042005] R13: c969fd01 R14: 88007d3f8a98 R15: 88007c772800
[   26.043734] FS:  7f8808d7c580() GS:88007fc0() 
knlGS:
[   26.045366] CS:  0010 DS:  ES:  CR0: 80050033
[   26.047239] CR2: 0050 CR3: 7ad4d000 CR4: 06f0
[   26.048801] DR0:  DR1:  DR2: 
[   26.049898] DR3:  DR6:  DR7: 
[   26.051300] Call Trace:
[   26.052580]  ? generic_make_request+0xfb/0x2a0
[   26.053869]  ? submit_bio+0x64/0x120
[   26.054123]  ? submit_bio+0x64/0x120
[   26.054989]  __blkdev_direct_IO_simple+0x1bc/0x2f0
[   26.055978]  ? __d_lookup_done+0x79/0xe0
[   26.057009]  ? blkdev_fsync+0x50/0x50
[   26.058382]  blkdev_direct_IO+0x37d/0x390
[   26.059320]  ? blkdev_direct_IO+0x37d/0x390
[   26.059939]  generic_file_read_iter+0x2c2/0x8c0
[   26.060488]  ? generic_file_read_iter+0x2c2/0x8c0
[   26.060656]  ? path_openat+0x6e4/0x1320
[   26.060910]  blkdev_read_iter+0x35/0x40
[   26.061430]  __do_readv_writev+0x1ef/0x3b0
[   26.061609]  do_readv_writev+0x7d/0xb0
[   26.062629]  ? handle_mm_fault+0x88/0x150
[   26.062895]  vfs_readv+0x39/0x50
[   26.063291]  ? vfs_readv+0x39/0x50
[   26.063666]  do_preadv+0xb1/0xd0
[   26.063975]  SyS_preadv2+0x17/0x30
[   26.064175]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[   26.064682] RIP: 0033:0x7f88064e30f9
[   26.064875] RSP: 002b:7fff74a1b4f8 EFLAGS: 0202 ORIG_RAX: 
0147
[   26.065538] RAX: ffda RBX: 02413500 RCX: 7f88064e30f9
[   26.066274] RDX: 0001 RSI: 0240d770 RDI: 0003
[   26.066701] RBP: 7f87eebb8000 R08:  R09: 0001
[   26.066868] R10: 266fb000 R11: 0202 R12: 
[   26.067384] R13: 1000 R14: 02413528 R15: 7f87eebb8000
[   26.067877] Code: 01 00 00 0f b7 f3 48 c1 e0 21 48 c1 e8 31 85 db 4c 8b 3c 
c2 0f 88 b4 01 00 00 49 8b 87 f0 00 00 00 31 db 39 30 0f 87 b4 01 00 00 <48> 8b 
43 50 4d 8b b7 80 00 00 00 a8 04 0f 85 e8 00 00 00 41 8b 
[   26.073821] RIP: blk_mq_poll+0xa0/0x2e0 RSP: c969fa38
[   26.074078] CR2: 0050
[   26.076101] ---[ end trace 9f9566455cd27c22 ]---

Cheers
 
Stephen

[1] http://marc.info/?l=linux-block=149156785215919=2




Re: [PATCH 0/4] blk-mq-sched: allow to use hw tag for sched

2017-04-16 Thread Ming Lei
On Sat, Apr 15, 2017 at 08:38:21PM +0800, Ming Lei wrote:
> The 1st patch enhances BLK_MQ_F_NO_SCHED so that we can't change/
> show available io schedulers on devices which don't support io
> scheduler.
> 
> The 2nd patch passes BLK_MQ_F_NO_SCHED for avoiding one regression
> on mtip32xx, which is introduced by blk-mq io scheduler.
> 
> The last two patches introduce BLK_MQ_F_SCHED_USE_HW_TAG so that
> we can allow to use hardware tag for scheduler, then mq-deadline
> can work well on mtip32xx. Even though other devices with enough
> hardware tag space can benefit from this feature too.
> 
> The 1st two patches aims on v4.11, and the last two are for
> v4.12.

Please ignore this patchset, and I will post another serial for
mtip32xx fix.

thanks,
Ming


[GIT PULL] A few small fixes for 4.11-rc

2017-04-16 Thread Jens Axboe
Hi Linus,

Four small fixes. Three of them fix the same error in NVMe, in loop, fc,
and rdma respectively. The last one fix from Ming fixes a regression in
this series, where our bvec gap logic was wrong and causes an oops on
NVMe for certain conditions.

Please pull!


  git://git.kernel.dk/linux-block.git for-linus



Ming Lei (1):
  block: fix bio_will_gap() for first bvec with offset

Sagi Grimberg (3):
  nvme-loop: Fix sqsize wrong assignment based on ctrl MQES capability
  nvme-rdma: Fix sqsize wrong assignment based on ctrl MQES capability
  nvme-fc: Fix sqsize wrong assignment based on ctrl MQES capability

 drivers/nvme/host/fc.c |  2 +-
 drivers/nvme/host/rdma.c   |  2 +-
 drivers/nvme/target/loop.c |  2 +-
 include/linux/blkdev.h | 32 
 4 files changed, 31 insertions(+), 7 deletions(-)

-- 
Jens Axboe



Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically

2017-04-16 Thread Ming Lei
On Fri, Apr 14, 2017 at 05:12:50PM +, Bart Van Assche wrote:
> On Fri, 2017-04-14 at 09:13 +0800, Ming Lei wrote:
> > On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> > > On 04/12/17 19:20, Ming Lei wrote:
> > > > On Wed, Apr 12, 2017 at 06:38:07PM +, Bart Van Assche wrote:
> > > > > If the blk-mq core would always rerun a hardware queue if a block 
> > > > > driver
> > > > > returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single 
> > > > > CPU core
> > > > 
> > > > It won't casue 100% CPU utilization since we restart queue in completion
> > > > path and at that time at least one tag is available, then progress can 
> > > > be
> > > > made.
> > > 
> > > Hello Ming,
> > > 
> > > Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> > > then it's likely that calling .queue_rq() again after only a few
> > > microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> > > don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> > > !test_bit(BLK_MQ_S_TAG_WAITING, >state)) blk_mq_run_hw_queue(hctx,
> > > true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
> > 
> > Yes, that can be true, but I mean it is still OK to run the queue again
> > with
> > 
> > if (!blk_mq_sched_needs_restart(hctx) &&
> > !test_bit(BLK_MQ_S_TAG_WAITING, >state))
> > blk_mq_run_hw_queue(hctx, true);
> > 
> > and restarting queue in __blk_mq_finish_request() when
> > BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
> > blk-mq implementation.
> > 
> > Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?
> 
> Because if dm_mq_queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY that there is no
> guarantee that __blk_mq_finish_request() will be called later on for the
> same queue. dm_mq_queue_rq() can e.g. return BLK_MQ_RQ_QUEUE_BUSY while no
> dm requests are in progress because the SCSI error handler is active for
> all underlying paths. See also scsi_lld_busy() and scsi_host_in_recovery().

OK, thanks Bart for the explanation.

Looks a very interesting BLK_MQ_RQ_QUEUE_BUSY case which isn't casued by
too many pending I/O, and will study more about this case.


Thanks,
Ming


[PATCH 4.9 26/31] blk-mq: Avoid memory reclaim when remapping queues

2017-04-16 Thread Greg Kroah-Hartman
4.9-stable review patch.  If anyone has any objections, please let me know.

--

From: Gabriel Krisman Bertazi 

commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 upstream.

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi 
Cc: Brian King 
Cc: Douglas Miller 
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sumit Semwal 
Signed-off-by: Greg Kroah-Hartman 

---
 block/blk-mq.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1474,7 +1474,7 @@ static struct blk_mq_tags *blk_mq_init_r
INIT_LIST_HEAD(>page_list);
 
tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1500,7 +1500,7 @@ static struct blk_mq_tags *blk_mq_init_r
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1521,7 +1521,7 @@ static struct blk_mq_tags *blk_mq_init_r
 * Allow kmemleak to scan these pages as they contain pointers
 * to 

Re: [PATCH V3 00/16] Introduce the BFQ I/O scheduler

2017-04-16 Thread Heinz Diehl
On 11.04.2017, Paolo Valente wrote: 

> new patch series, addressing (both) issues raised by Bart [1].

I'm doing a lot of automatic video transcoding in order to get my
collection of homemade videos down to an acceptable size (mainly
landscapes and boats all over the Norwegian west coast, taken with an old
cam that only produces uncompressed files). This process
involves heavy permanent writing to disk, often over a period of 10
min and more. When this happens, the whole system is kind of
unresponsive. I'm running Fedora 25, but with a self-customised kernel
that is fully low-latency, and the machine is a quadcore Intel Xeon
which should have enough power (Intel(R) Xeon(R) CPU E3-1241 v3 @
3.50GHz).

Using plain blk-mq, the system is very sluggish when there is heavy
disk writing, and it can take up to several minutes (up to the point
where the disk writing actually finishes) to start programs like gimp
or Libreoffice. In fact, when I click on the "applications" button
within XFCE, it can take a long time before the window even opens.
I played with deadline-mq too, and the situation remains the same
unless I do some heavy tuning like this:

echo "mq-deadline" > /sys/block/nvme0n1/queue/scheduler
echo "1" > /sys/block/nvme0n1/queue/iosched/fifo_batch
echo "4" > /sys/block/nvme0n1/queue/iosched/writes_starved
echo "100" > /sys/block/nvme0n1/queue/iosched/read_expire
echo "2000" > /sys/block/nvme0n1/queue/iosched/write_expire

With deadline-mq tuned like this, overall responsiveness is a little bit
better, but not nearly as good as when using bfq. With plain bfq, no
tuning is needed. The system is no longer sluggish. Any program starts
within seconds, and all is very much responsive. Max throughput isn't
important to me, the nvme "harddisk" is fast enough that some MB/s
more or less do not really matter.

[root@chiara ~]# lspci -v | grep -i nvme
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe
SSD Controller SM951/PM951 (rev 01) (prog-if 02 [NVM Express])
Kernel driver in use: nvme
Kernel modules: nvme

As an end-user with no relevant programming skills to be able to
contribute, I would wish that developers would combine their forces and
help Paolo to get bfq into the kernel and to make bfq even better.

Thanks,
 Heinz
 


[PATCH 4.4 13/18] blk-mq: Avoid memory reclaim when remapping queues

2017-04-16 Thread Greg Kroah-Hartman
4.4-stable review patch.  If anyone has any objections, please let me know.

--

From: Gabriel Krisman Bertazi 

commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 upstream.

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi 
Cc: Brian King 
Cc: Douglas Miller 
Cc: linux-block@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sumit Semwal 
Signed-off-by: Greg Kroah-Hartman 

---
 block/blk-mq.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1470,7 +1470,7 @@ static struct blk_mq_tags *blk_mq_init_r
INIT_LIST_HEAD(>page_list);
 
tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1496,7 +1496,7 @@ static struct blk_mq_tags *blk_mq_init_r
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1517,7 +1517,7 @@ static struct blk_mq_tags *blk_mq_init_r
 * Allow kmemleak to scan these pages as they contain pointers
 * to