Re: [PATCH] [v2] nvme-pci: check req to prevent crash in nvme_handle_cqe()

2020-09-01 Thread Keith Busch
On Mon, Aug 31, 2020 at 06:55:53PM +0800, Xianting Tian wrote:
> As blk_mq_tag_to_rq() may return null, so it should be check whether it is
> null before using it to prevent a crash.

It may return NULL if the command id exceeds the number of tags. We
already have a check for a valid command id value, so something is not
adding up here if we're still getting NULL.

>   req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id);
> + if (unlikely(!req)) {
> + dev_warn(nvmeq->dev->ctrl.device,
> + "req is null(tag:%d) on queue %d\n",
> + cqe->command_id, le16_to_cpu(cqe->sq_id));
> + return;
> + }


RE: [PATCH] [v2] nvme-pci: check req to prevent crash in nvme_handle_cqe()

2020-09-01 Thread Tianxianting
Hi,
Could I get the feedback for the patch, whether it can be applied or need some 
improvement?
It really can prevent a crash we met.
Thanks:)

-Original Message-
From: tianxianting (RD) 
Sent: Monday, August 31, 2020 6:56 PM
To: kbu...@kernel.org; ax...@fb.com; h...@lst.de; s...@grimberg.me
Cc: linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org; tianxianting 
(RD) 
Subject: [PATCH] [v2] nvme-pci: check req to prevent crash in nvme_handle_cqe()

We met a crash issue when hot-insert a nvme device, blk_mq_tag_to_rq() returned 
null(req=null), then crash happened in nvme_end_request():
struct nvme_request *rq = nvme_req(req);
rq->result = result;  <==crash here

The test env is, a server is configured with 2 backplanes, each backplane 
support 8 nvme devices, this crash happened when hot-insert a nvme device to 
the second backplane. We measured the signal, which is send out of cpu to ack 
nvme interrupt, the signal is very weak when it reached the second backplane, 
the device can't distinguish it as a ack signal. So it caused the device can't 
clear the interrupt flag.
After updating related driver, the signal sending out of cpu to the second 
backplane is good, the crash issue disappeared.

As blk_mq_tag_to_rq() may return null, so it should be check whether it is null 
before using it to prevent a crash.

[ 1124.256246] nvme nvme5: pci function :e1:00.0
[ 1124.256323] nvme :e1:00.0: enabling device ( -> 0002)
[ 1125.720859] nvme nvme5: 96/0/0 default/read/poll queues
[ 1125.732483]  nvme5n1: p1 p2 p3
[ 1125.788049] BUG: unable to handle kernel NULL pointer dereference at 
0130
[ 1125.788054] PGD 0 P4D 0
[ 1125.788057] Oops: 0002 [#1] SMP NOPTI
[ 1125.788059] CPU: 50 PID: 0 Comm: swapper/50 Kdump: loaded Tainted: G 
  --- -t - 4.18.0-147.el8.x86_64 #1
[ 1125.788065] RIP: 0010:nvme_irq+0xe8/0x240 [nvme]
[ 1125.788068] RSP: 0018:916b8ec83ed0 EFLAGS: 00010813
[ 1125.788069] RAX:  RBX: 918ae9211b00 RCX: 

[ 1125.788070] RDX: 400b RSI:  RDI: 

[ 1125.788071] RBP: 918ae887 R08: 0004 R09: 
918ae887
[ 1125.788072] R10:  R11:  R12: 

[ 1125.788073] R13: 0001 R14:  R15: 
0001
[ 1125.788075] FS:  () GS:916b8ec8() 
knlGS:
[ 1125.788075] CS:  0010 DS:  ES:  CR0: 80050033
[ 1125.788076] CR2: 0130 CR3: 001768f0 CR4: 
00340ee0
[ 1125.788077] Call Trace:
[ 1125.788080]  
[ 1125.788085]  __handle_irq_event_percpu+0x40/0x180
[ 1125.788087]  handle_irq_event_percpu+0x30/0x80
[ 1125.788089]  handle_irq_event+0x36/0x53
[ 1125.788090]  handle_edge_irq+0x82/0x190
[ 1125.788094]  handle_irq+0xbf/0x100
[ 1125.788098]  do_IRQ+0x49/0xd0
[ 1125.788100]  common_interrupt+0xf/0xf

Signed-off-by: Xianting Tian 
---
 drivers/nvme/host/pci.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 
ba725ae47..5f1c51a43 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -960,6 +960,13 @@ static inline void nvme_handle_cqe(struct nvme_queue 
*nvmeq, u16 idx)
}
 
req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id);
+   if (unlikely(!req)) {
+   dev_warn(nvmeq->dev->ctrl.device,
+   "req is null(tag:%d) on queue %d\n",
+   cqe->command_id, le16_to_cpu(cqe->sq_id));
+   return;
+   }
+
trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail);
if (!nvme_end_request(req, cqe->status, cqe->result))
nvme_pci_complete_rq(req);
--
2.17.1



[PATCH] [v2] nvme-pci: check req to prevent crash in nvme_handle_cqe()

2020-08-31 Thread Xianting Tian
We met a crash issue when hot-insert a nvme device, blk_mq_tag_to_rq()
returned null(req=null), then crash happened in nvme_end_request():
struct nvme_request *rq = nvme_req(req);
rq->result = result;  <==crash here

The test env is, a server is configured with 2 backplanes, each backplane
support 8 nvme devices, this crash happened when hot-insert a nvme device
to the second backplane. We measured the signal, which is send out of cpu
to ack nvme interrupt, the signal is very weak when it reached the second
backplane, the device can't distinguish it as a ack signal. So it caused
the device can't clear the interrupt flag.
After updating related driver, the signal sending out of cpu to the second
backplane is good, the crash issue disappeared.

As blk_mq_tag_to_rq() may return null, so it should be check whether it is
null before using it to prevent a crash.

[ 1124.256246] nvme nvme5: pci function :e1:00.0
[ 1124.256323] nvme :e1:00.0: enabling device ( -> 0002)
[ 1125.720859] nvme nvme5: 96/0/0 default/read/poll queues
[ 1125.732483]  nvme5n1: p1 p2 p3
[ 1125.788049] BUG: unable to handle kernel NULL pointer dereference at 
0130
[ 1125.788054] PGD 0 P4D 0
[ 1125.788057] Oops: 0002 [#1] SMP NOPTI
[ 1125.788059] CPU: 50 PID: 0 Comm: swapper/50 Kdump: loaded Tainted: G 
  --- -t - 4.18.0-147.el8.x86_64 #1
[ 1125.788065] RIP: 0010:nvme_irq+0xe8/0x240 [nvme]
[ 1125.788068] RSP: 0018:916b8ec83ed0 EFLAGS: 00010813
[ 1125.788069] RAX:  RBX: 918ae9211b00 RCX: 

[ 1125.788070] RDX: 400b RSI:  RDI: 

[ 1125.788071] RBP: 918ae887 R08: 0004 R09: 
918ae887
[ 1125.788072] R10:  R11:  R12: 

[ 1125.788073] R13: 0001 R14:  R15: 
0001
[ 1125.788075] FS:  () GS:916b8ec8() 
knlGS:
[ 1125.788075] CS:  0010 DS:  ES:  CR0: 80050033
[ 1125.788076] CR2: 0130 CR3: 001768f0 CR4: 
00340ee0
[ 1125.788077] Call Trace:
[ 1125.788080]  
[ 1125.788085]  __handle_irq_event_percpu+0x40/0x180
[ 1125.788087]  handle_irq_event_percpu+0x30/0x80
[ 1125.788089]  handle_irq_event+0x36/0x53
[ 1125.788090]  handle_edge_irq+0x82/0x190
[ 1125.788094]  handle_irq+0xbf/0x100
[ 1125.788098]  do_IRQ+0x49/0xd0
[ 1125.788100]  common_interrupt+0xf/0xf

Signed-off-by: Xianting Tian 
---
 drivers/nvme/host/pci.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ba725ae47..5f1c51a43 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -960,6 +960,13 @@ static inline void nvme_handle_cqe(struct nvme_queue 
*nvmeq, u16 idx)
}
 
req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id);
+   if (unlikely(!req)) {
+   dev_warn(nvmeq->dev->ctrl.device,
+   "req is null(tag:%d) on queue %d\n",
+   cqe->command_id, le16_to_cpu(cqe->sq_id));
+   return;
+   }
+
trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail);
if (!nvme_end_request(req, cqe->status, cqe->result))
nvme_pci_complete_rq(req);
-- 
2.17.1