Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
On Mon, Apr 05, 2021 at 11:42:31PM +, Chuck Lever III wrote: > > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe wrote: > > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote: > >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > >>> From: Leon Romanovsky > >>> > From Avihai, > >>> > >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > >>> imposed on PCI transactions, and thus, can improve performance. > >>> > >>> Until now, relaxed ordering could be set only by user space applications > >>> for user MRs. The following patch series enables relaxed ordering for the > >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as > >>> such, it is ignored by vendors that don't support it. > >>> > >>> The following test results show the performance improvement achieved > >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order > >>> to check performance of storage infrastructure over xprtrdma: > >> > >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all? > >> What does that have to do with storage protocols? > > > > I think it is a typo (or at least mit makes no sense to be talking > > about NFS with a GPU chip) Probably it should be a DGX A100 which is a > > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA > > workload. > > We need to get a better idea what correctness testing has been done, > and whether positive correctness testing results can be replicated > on a variety of platforms. > > I have an old Haswell dual-socket system in my lab, but otherwise > I'm not sure I have a platform that would be interesting for such a > test. Not sure if Haswell will be useful for such testing. It looks like many of those subscribe to 'quirk_relaxedordering_disable'.
Re: [bug report] node: Add memory-side caching attributes
On Thu, Apr 01, 2021 at 12:00:39PM +0300, Dan Carpenter wrote: > Hi Keith, > > I've been trying to figure out ways Smatch can check for device managed > resources. Like adding rules that if we call dev_set_name(>bar) > then it's device managaged and if there is a kfree(foo) without calling > device_put(>bar) then that's a resource leak. > > Of course one of the rules is that if you call device_register(dev) then > you can't kfree(dev), it has to released with device_put(dev) and that's > true even if the register fails. But this code here feels very > intentional so maybe there is an exception to the rule? Thanks for the notice. There was no intentional use of this, so I think it was a mistake. The proposal in the follow on replies looks much better. > The patch acc02a109b04: "node: Add memory-side caching attributes" > from Mar 11, 2019, leads to the following static checker warning: > > drivers/base/node.c:285 node_init_cache_dev() > error: kfree after device_register(): 'dev' > > drivers/base/node.c >263 static void node_init_cache_dev(struct node *node) >264 { >265 struct device *dev; >266 >267 dev = kzalloc(sizeof(*dev), GFP_KERNEL); >268 if (!dev) >269 return; >270 >271 dev->parent = >dev; >272 dev->release = node_cache_release; >273 if (dev_set_name(dev, "memory_side_cache")) >274 goto free_dev; >275 >276 if (device_register(dev)) > ^^^ >277 goto free_name; >278 >279 pm_runtime_no_callbacks(dev); >280 node->cache_dev = dev; >281 return; >282 free_name: >283 kfree_const(dev->kobj.name); >284 free_dev: >285 kfree(dev); > ^^ >286 } > > regards, > dan carpenter
[PATCH] overflow: improve check_shl_overflow comment
A 'false' return means the value was safely set, so the comment should say 'true' for when it is not considered safe. Cc: Jason Gunthorpe Cc: Kees Cook Signed-off-by: Keith Busch --- include/linux/overflow.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/overflow.h b/include/linux/overflow.h index ef74051d5cfed..0f12345c21fb5 100644 --- a/include/linux/overflow.h +++ b/include/linux/overflow.h @@ -235,7 +235,7 @@ static inline bool __must_check __must_check_overflow(bool overflow) * - 'a << s' sets the sign bit, if any, in '*d'. * * '*d' will hold the results of the attempted shift, but is not - * considered "safe for use" if false is returned. + * considered "safe for use" if true is returned. */ #define check_shl_overflow(a, s, d) __must_check_overflow(({ \ typeof(a) _a = a; \ -- 2.25.4
Re: [PATCH v2 2/3] nvme: Remove superflues else in nvme_ctrl_loss_tmo_store
For the subject, s/superflues/superfluous
Re: [PATCH v2] nvme-tcp: Check if request has started before processing it
On Tue, Mar 30, 2021 at 10:34:25AM -0700, Sagi Grimberg wrote: > > > > It is, but in this situation, the controller is sending a second > > > completion that results in a use-after-free, which makes the > > > transport irrelevant. Unless there is some other flow (which is > > > unclear > > > to me) that causes this which is a bug that needs to be fixed rather > > > than hidden with a safeguard. > > > > > > > The kernel should not crash regardless of any network traffic that is > > sent to the system. It should not be possible to either intentionally > > of mistakenly contruct packets that will deny service in this way. > > This is not specific to nvme-tcp. I can build an rdma or pci controller > that can trigger the same crash... I saw a similar patch from Hannes > implemented in the scsi level, and not the individual scsi transports.. If scsi wants this too, this could be made generic at the blk-mq level. We just need to make something like blk_mq_tag_to_rq(), but return NULL if the request isn't started. > I would also mention, that a crash is not even the scariest issue that > we can see here, because if the request happened to be reused we are > in the silent data corruption realm... If this does happen, I think we have to come up with some way to mitigate it. We're not utilizing the full 16 bits of the command_id, so maybe we can append something like a generation sequence number that can be checked for validity.
Re: [PATCH] nvme: disallow passthru cmd from targeting a nsid != nsid of the block dev
On Thu, Mar 25, 2021 at 09:48:37AM +, Niklas Cassel wrote: > From: Niklas Cassel > > When a passthru command targets a specific namespace, the ns parameter to > nvme_user_cmd()/nvme_user_cmd64() is set. However, there is currently no > validation that the nsid specified in the passthru command targets the > namespace/nsid represented by the block device that the ioctl was > performed on. > > Add a check that validates that the nsid in the passthru command matches > that of the supplied namespace. > > Signed-off-by: Niklas Cassel > --- > Currently, if doing NVME_IOCTL_IO_CMD on the controller char device, > if and only if there is a single namespace in the ctrl->namespaces list, > nvme_dev_user_cmd() will call nvme_user_cmd() with ns parameter set. > While it might be good that we validate the nsid even in this case, > perhaps we want to allow a nsid value in the passthru command to be > 0x0 and/or the NSID broadcast value? (Only when NVME_IOCTL_IO_CMD was > done on the controller char device though.) There are no IO commands accepting a 0 NSID, so rejecting those from the driver should be okay. FLUSH is the only IO command that takes a broadcast NSID. I suspect at least some entities were unfortunately sending broadcast flush through this interface, so it's possible we'll hear of breakage, but I'd agree your patch is still the right thing to do. > drivers/nvme/host/core.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 40215a0246e4..e4591a4c68a8 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -1632,6 +1632,8 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct > nvme_ns *ns, > return -EFAULT; > if (cmd.flags) > return -EINVAL; > + if (ns && cmd.nsid != ns->head->ns_id) > + return -EINVAL; > > memset(, 0, sizeof(c)); > c.common.opcode = cmd.opcode; > @@ -1676,6 +1678,8 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, > struct nvme_ns *ns, > return -EFAULT; > if (cmd.flags) > return -EINVAL; > + if (ns && cmd.nsid != ns->head->ns_id) > + return -EINVAL; > > memset(, 0, sizeof(c)); > c.common.opcode = cmd.opcode; > -- > 2.30.2
Re: [PATCH 1/1] nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a
On Wed, Mar 10, 2021 at 02:41:10PM +0100, Christoph Hellwig wrote: > On Wed, Mar 10, 2021 at 02:21:56PM +0100, Christoph Hellwig wrote: > > Can you try this patch instead? > > > > http://lists.infradead.org/pipermail/linux-nvme/2021-February/023183.html > > Actually, please try the patch below instead, it looks like our existing > logic messes up the units: This seems like a good opportunity to incorporate TP4040 for non-MDTS command limits. I sent a proposal here http://lists.infradead.org/pipermail/linux-nvme/2021-March/023522.html And it defaults to your suggestion if the controller doesn't implement the capability.
Re: [PATCH] nvme-tcp: Check if request has started before processing it
On Tue, Mar 02, 2021 at 08:18:40AM +0100, Hannes Reinecke wrote: > On 3/1/21 9:59 PM, Keith Busch wrote: > > On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote: > >> On 3/1/21 5:05 PM, Keith Busch wrote: > >>> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote: > >>>> On 3/1/21 2:26 PM, Daniel Wagner wrote: > >>>>> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote: > >>>>>> Crashing is bad, silent data corruption is worse. Is there truly no > >>>>>> defense against that? If not, why should anyone rely on this? > >>>>> > >>>>> If we receive an response for which we don't have a started request, we > >>>>> know that something is wrong. Couldn't we in just reset the connection > >>>>> in this case? We don't have to pretend nothing has happened and > >>>>> continuing normally. This would avoid a host crash and would not create > >>>>> (more) data corruption. Or I am just too naive? > >>>>> > >>>> This is actually a sensible solution. > >>>> Please send a patch for that. > >>> > >>> Is a bad frame a problem that can be resolved with a reset? > >>> > >>> Even if so, the reset doesn't indicate to the user if previous commands > >>> completed with bad data, so it still seems unreliable. > >>> > >> We need to distinguish two cases here. > >> The one is use receiving a frame with an invalid tag, leading to a crash. > >> This can be easily resolved by issuing a reset, as clearly the command was > >> garbage and we need to invoke error handling (which is reset). > >> > >> The other case is us receiving a frame with a _duplicate_ tag, ie a tag > >> which is _currently_ valid. This is a case which will fail _even now_, as > >> we > >> have simply no way of detecting this. > >> > >> So what again do we miss by fixing the first case? > >> Apart from a system which does _not_ crash? > > > > I'm just saying each case is a symptom of the same problem. The only > > difference from observing one vs the other is a race with the host's > > dispatch. And since you're proposing this patch, it sounds like this > > condition does happen on tcp compared to other transports where we don't > > observe it. I just thought the implication that data corruption happens > > is a alarming. > > > Oh yes, it is. > But sadly TCP inherently suffers from this, as literally anyone can > spoof frames on the network. > Other transports like RDMA or FC do not suffer to that extend as > spoofing frames there is far more elaborate, and not really possible > without dedicated hardware equipment. > > That's why there is header and data digest; that will protect you > against accidental frame corruption (as this case clearly is; the > remainder of the frame is filled with zeroes). > It would not protect you against deliberate frame corruption; that's why > there is TPAR 8010 (TLS encryption for NVMe-TCP). > > Be it as it may, none of these methods are in use here, and none of > these methods can be made mandatory. So we need to deal with the case at > hand. > > And in my opinion crashing is the _worst_ options of all. > Tear the connection down, reset the thing, whatever. > > But do not crash. > Customers tend to have a very dim view on crashing machines, and have a > very limited capacity for being susceptible to our reasoning in these cases. I was pushing the data corruption angle because fixing that should address both cases. If there's really nothing you can do about corruption, your approach here makes sense, and I defer to Sagi and Christoph for inclusion. I still wouldn't trust my data to storage behaving this way, though. :)
Re: [PATCH] nvme-tcp: Check if request has started before processing it
On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote: > On 3/1/21 5:05 PM, Keith Busch wrote: > > On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote: > > > On 3/1/21 2:26 PM, Daniel Wagner wrote: > > > > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote: > > > > > Crashing is bad, silent data corruption is worse. Is there truly no > > > > > defense against that? If not, why should anyone rely on this? > > > > > > > > If we receive an response for which we don't have a started request, we > > > > know that something is wrong. Couldn't we in just reset the connection > > > > in this case? We don't have to pretend nothing has happened and > > > > continuing normally. This would avoid a host crash and would not create > > > > (more) data corruption. Or I am just too naive? > > > > > > > This is actually a sensible solution. > > > Please send a patch for that. > > > > Is a bad frame a problem that can be resolved with a reset? > > > > Even if so, the reset doesn't indicate to the user if previous commands > > completed with bad data, so it still seems unreliable. > > > We need to distinguish two cases here. > The one is use receiving a frame with an invalid tag, leading to a crash. > This can be easily resolved by issuing a reset, as clearly the command was > garbage and we need to invoke error handling (which is reset). > > The other case is us receiving a frame with a _duplicate_ tag, ie a tag > which is _currently_ valid. This is a case which will fail _even now_, as we > have simply no way of detecting this. > > So what again do we miss by fixing the first case? > Apart from a system which does _not_ crash? I'm just saying each case is a symptom of the same problem. The only difference from observing one vs the other is a race with the host's dispatch. And since you're proposing this patch, it sounds like this condition does happen on tcp compared to other transports where we don't observe it. I just thought the implication that data corruption happens is a alarming.
Re: [PATCH] nvme-tcp: Check if request has started before processing it
On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote: > On 3/1/21 2:26 PM, Daniel Wagner wrote: > > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote: > >> Crashing is bad, silent data corruption is worse. Is there truly no > >> defense against that? If not, why should anyone rely on this? > > > > If we receive an response for which we don't have a started request, we > > know that something is wrong. Couldn't we in just reset the connection > > in this case? We don't have to pretend nothing has happened and > > continuing normally. This would avoid a host crash and would not create > > (more) data corruption. Or I am just too naive? > > > This is actually a sensible solution. > Please send a patch for that. Is a bad frame a problem that can be resolved with a reset? Even if so, the reset doesn't indicate to the user if previous commands completed with bad data, so it still seems unreliable.
Re: [PATCH] nvme-tcp: Check if request has started before processing it
On Fri, Feb 26, 2021 at 05:42:46PM +0100, Hannes Reinecke wrote: > On 2/26/21 5:13 PM, Keith Busch wrote: > > > > That's just addressing a symptom. You can't fully verify the request is > > valid this way because the host could have started the same command ID > > the very moment before the code checks it, incorrectly completing an > > in-flight command and getting data corruption. > > > Oh, I am fully aware. > > Bad frames are just that, bad frames. > We can only fully validate that when digests are enabled, but I gather that > controllers sending out bad frames wouldn't want to enable digests, either. > So relying on that is possibly not an option. > > So really what I'm trying to avoid is the host crashing on a bad frame. > That kind of thing always resonates bad with customers. > And tripping over an uninitialized command is just too stupid IMO. Crashing is bad, silent data corruption is worse. Is there truly no defense against that? If not, why should anyone rely on this?
Re: [PATCH] nvme-tcp: Check if request has started before processing it
On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote: > On 2/26/21 1:35 PM, Daniel Wagner wrote: > > On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote: > > > Well, I think we should probably figure out why that is happening first. > > > > I got my hands on a tcpdump trace. I've trimmed it to this: > > > > > > No. Time SourceDestination Protocol > > Length Info > >1 0.00 10.228.194.30 10.228.38.214 NVMe > > 138NVMe Write > >2 0.000285 10.228.38.214 10.228.194.30 > > NVMe/TCP 90 Ready To Transfer > >3 0.000591 10.228.194.30 10.228.38.214 NVMe > > 4186 NVMe Write: Data > >4 0.000673 10.228.38.214 10.228.194.30 TCP > > 66 4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 > > TSecr=1497295579 > >5 0.002140 10.228.38.214 10.228.194.30 NVMe > > 90 NVMe Write: Response > >6 0.002511 10.228.194.30 10.228.38.175 NVMe > > 138NVMe Write > >7 0.002812 10.228.38.175 10.228.194.30 > > NVMe/TCP 90 Ready To Transfer > >8 0.003006 10.228.194.30 10.228.38.175 NVMe > > 4186 NVMe Write: Data > >9 0.003098 10.228.38.175 10.228.194.30 TCP > > 66 4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 > > TSecr=3601034207 > > 10 0.004420 10.228.38.175 10.228.194.30 NVMe > > 90 NVMe Write: Response > > 11 0.004890 10.228.38.214 10.228.194.30 > > NVMe/TCP 90 > > 12 0.004969 10.228.38.175 10.228.194.30 > > NVMe/TCP 90 > > > > > > The last few seconds contain only normal writes and suddenly the host > > receives two invalid packets. From what I see the host doesn't misbehave > > at all. I wonder if it would be possible to detect the invalid packet by > > locking at the PDU header only. If this would be possible we could > > discard it early and do not try to use the invalid command id. > > > > Here the last two packets with details: > > > > > > No. Time SourceDestination Protocol > > Length Info > > 11 0.004890 10.228.38.214 10.228.194.30 > > NVMe/TCP 90 > > > > Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits) > > Encapsulation type: Ethernet (1) > > Arrival Time: Feb 23, 2021 18:16:08.780574000 CET > > [Time shift for this packet: 0.0 seconds] > > Epoch Time: 1614100568.780574000 seconds > > [Time delta from previous captured frame: 0.00047 seconds] > > [Time delta from previous displayed frame: 0.00047 seconds] > > [Time since reference or first frame: 0.00489 seconds] > > Frame Number: 11 > > Frame Length: 90 bytes (720 bits) > > Capture Length: 90 bytes (720 bits) > > [Frame is marked: False] > > [Frame is ignored: False] > > [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp] > > [Coloring Rule Name: TCP] > > [Coloring Rule String: tcp] > > Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: > > Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8) > > Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8) > > Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8) > > ..0. = LG bit: Globally unique address > > (factory default) > > ...0 = IG bit: Individual address > > (unicast) > > Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0) > > Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0) > > ..0. = LG bit: Globally unique address > > (factory default) > > ...0 = IG bit: Individual address > > (unicast) > > Type: IPv4 (0x0800) > > Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30 > > 0100 = Version: 4 > > 0101 = Header Length: 20 bytes (5) > > Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT) > > 00.. = Differentiated Services Codepoint: Default (0) > > ..00 = Explicit Congestion Notification: Not ECN-Capable > > Transport (0) > > Total Length: 76 > > Identification: 0x (0) > > Flags: 0x40, Don't fragment > > 0... = Reserved bit: Not set > > .1.. = Don't fragment: Set > > ..0. = More fragments: Not set > > Fragment Offset: 0 > > Time to Live: 64 > > Protocol: TCP (6) > > Header Checksum: 0x [validation disabled] > > [Header checksum status: Unverified] > > Source Address: 10.228.38.214 > > Destination Address: 10.228.194.30 > > Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, > >
Re: [PATCH] nvme-core: Switch to using the new API kobj_to_dev()
On Sat, Feb 20, 2021 at 05:10:18PM +0800, Yang Li wrote: > fixed the following coccicheck: > ./drivers/nvme/host/core.c:3440:60-61: WARNING opportunity for > kobj_to_dev() > ./drivers/nvme/host/core.c:3679:60-61: WARNING opportunity for > kobj_to_dev() > > Reported-by: Abaci Robot > Signed-off-by: Yang Li This was rejected the last time it posted 6 months ago: http://lists.infradead.org/pipermail/linux-nvme/2020-September/019463.html
Re: [RFC PATCH v5 0/4] add simple copy support
On Sat, Feb 20, 2021 at 06:01:56PM +, David Laight wrote: > From: SelvaKumar S > > Sent: 19 February 2021 12:45 > > > > This patchset tries to add support for TP4065a ("Simple Copy Command"), > > v2020.05.04 ("Ratified") > > > > The Specification can be found in following link. > > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip > > > > Simple copy command is a copy offloading operation and is used to copy > > multiple contiguous ranges (source_ranges) of LBA's to a single destination > > LBA within the device reducing traffic between host and device. > > Sounds to me like the real reason is that the copy just ends up changing > some indirect block pointers rather than having to actually copy the data. I guess an implementation could do that, but I think that's missing the point of the command. The intention is to copy the data to a new location on the media for host managed garbage collection.
Re: [PATCH] nvme-tcp: Check if request has started before processing it
On Fri, Feb 12, 2021 at 12:58:27PM -0800, Sagi Grimberg wrote: > > blk_mq_tag_to_rq() will always return a request if the command_id is > > in the valid range. Check if the request has been started. If we > > blindly process the request we might double complete a request which > > can be fatal. > > How did you get to this one? did the controller send a completion for > a completed/bogus request? If that is the case, then that must mean it's possible the driver could have started the command id just before the bogus completion check. Data iorruption, right?
Re: [PATCH] nvme: Add 48-bit DMA address quirk
On Wed, Feb 03, 2021 at 12:22:31PM +0100, Filippo Sironi wrote: > > On 2/3/21 12:15 PM, Christoph Hellwig wrote: > > > > On Wed, Feb 03, 2021 at 12:12:31PM +0100, Filippo Sironi wrote: > > > I don't disagree on the first part of your sentence, this is a big > > > oversight. > > > > But it is not what your commit log suggests. > > I can definitely rephrase the commit. > > > > On the other hand, those controllers are out there and are in use by a lot > > > of customers. We can keep relying on luck, hoping that customers don't > > > run > > > into troubles or we can merge a few lines of code :) > > > > Your patch does not just quirk a few controllers out there, but all > > current and future controllers with an Amazon vendor ID. We could > > probably talk about quirking an existing vendor ID or two as long as > > this doesn't happen for future hardware. > > I know that the hardware team is working on this but I don't know the > timelines and there are a few upcoming controllers - of which I don't know > the device ids yet - that have the same issue. > > To avoid issues, it is easier to apply the quirk to all Amazon NVMe > controllers for now till the new lines of controllers with the fix comes > out. At that point, we'll be able to restrict the application to the known > bad controllers. Just set the quirk for the known device id's and append more as needed (which should hopefully never happen). Sure, your future broken devices may not work with the first kernel that introduced the quirk, but this is how the quirks should be documented in the code.
Re: [PATCH V2 3/3] Adding device_dma_parameters->offset_preserve_mask to NVMe driver.
On Mon, Feb 01, 2021 at 10:30:17AM -0800, Jianxiong Gao wrote: > @@ -868,12 +871,24 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, > struct request *req, > if (!iod->nents) > goto out_free_sg; > > + offset_ret = dma_set_min_align_mask(dev->dev, NVME_CTRL_PAGE_SIZE - 1); > + if (offset_ret) { > + dev_warn(dev->dev, "dma_set_min_align_mask failed to set > offset\n"); > + goto out_free_sg; > + } > + > if (is_pci_p2pdma_page(sg_page(iod->sg))) > nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg, > iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN); > else > nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, >rq_dma_dir(req), DMA_ATTR_NO_WARN); > + > + offset_ret = dma_set_min_align_mask(dev->dev, 0); > + if (offset_ret) { > + dev_warn(dev->dev, "dma_set_min_align_mask failed to reset > offset\n"); > + goto out_free_sg; > + } > if (!nr_mapped) > goto out_free_sg; Why is this setting being done and undone on each IO? Wouldn't it be more efficient to set it once during device initialization? And more importantly, this isn't thread safe: one CPU may be setting the device's dma alignment mask to 0 while another CPU is expecting it to be NVME_CTRL_PAGE_SIZE - 1.
Re: [PATCH 2/3] Add swiotlb offset preserving mapping when dma_dma_parameters->page_offset_mask is non zero.
On Thu, Jan 28, 2021 at 12:15:28PM -0500, Konrad Rzeszutek Wilk wrote: > On Wed, Jan 27, 2021 at 04:38:28PM -0800, Jianxiong Gao wrote: > > For devices that need to preserve address offset on mapping through > > swiotlb, this patch adds offset preserving based on page_offset_mask > > and keeps the offset if the mask is non zero. This is needed for > > device drivers like NVMe. > > > > Didn't you send this patch like a month ago and someone pointed > out that the right fix would be in the NVMe driver? > > Is there an issue with fixing the NVMe driver? You got it backwards. The initial "fix" used a flag specific to the nvme driver, and it was pointed out that it should just be the generic behaviour.
Re: [RFC PATCH v3 1/2] block: add simple copy support
On Fri, Dec 11, 2020 at 07:21:38PM +0530, SelvaKumar S wrote: > +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload > *payload, > + gfp_t gfp_mask) > +{ > + struct request_queue *q = bdev_get_queue(bdev); > + struct bio *bio; > + void *buf = NULL; > + int i, nr_srcs, max_range_len, ret, cur_dest, cur_size; > + > + nr_srcs = payload->copy_range; > + max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT; The default value for this limit is 0, and this is the function for when the device doesn't support copy. Are we expecting drivers to set this value to something else for that case? > + cur_dest = payload->dest; > + buf = kvmalloc(max_range_len, GFP_ATOMIC); > + if (!buf) > + return -ENOMEM; > + > + for (i = 0; i < nr_srcs; i++) { > + bio = bio_alloc(gfp_mask, 1); > + bio->bi_iter.bi_sector = payload->range[i].src; > + bio->bi_opf = REQ_OP_READ; > + bio_set_dev(bio, bdev); > + > + cur_size = payload->range[i].len << SECTOR_SHIFT; > + ret = bio_add_page(bio, virt_to_page(buf), cur_size, > +offset_in_page(payload)); 'buf' is vmalloc'ed, so we don't necessarily have congituous pages. I think you need to allocate the bio with bio_map_kern() or something like that instead with that kind of memory. > + if (ret != cur_size) { > + ret = -ENOMEM; > + goto out; > + } > + > + ret = submit_bio_wait(bio); > + bio_put(bio); > + if (ret) > + goto out; > + > + bio = bio_alloc(gfp_mask, 1); > + bio_set_dev(bio, bdev); > + bio->bi_opf = REQ_OP_WRITE; > + bio->bi_iter.bi_sector = cur_dest; > + ret = bio_add_page(bio, virt_to_page(buf), cur_size, > +offset_in_page(payload)); > + if (ret != cur_size) { > + ret = -ENOMEM; > + goto out; > + } > + > + ret = submit_bio_wait(bio); > + bio_put(bio); > + if (ret) > + goto out; > + > + cur_dest += payload->range[i].len; > + } I think this would be a faster implementation if the reads were asynchronous with a payload buffer allocated specific to that read, and the callback can enqueue the write part. This would allow you to accumulate all the read data and write it in a single call.
Re: [PATCH] nvme: hwmon: fix crash on device teardown
On Wed, Dec 09, 2020 at 06:32:27PM -0300, Enzo Matsumiya wrote: > +void nvme_hwmon_exit(struct nvme_ctrl *ctrl) > +{ > + hwmon_device_unregister(ctrl->dev); > +} The hwmon registration uses the devm_ version, so don't we need to use the devm_hwmon_device_unregister() here?
Re: [RFC PATCH v2 0/2] add simple copy support
On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote: > On 2020/12/04 20:02, SelvaKumar S wrote: > > This patchset tries to add support for TP4065a ("Simple Copy Command"), > > v2020.05.04 ("Ratified") > > > > The Specification can be found in following link. > > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip > > > > This is an RFC. Looking forward for any feedbacks or other alternate > > designs for plumbing simple copy to IO stack. > > > > Simple copy command is a copy offloading operation and is used to copy > > multiple contiguous ranges (source_ranges) of LBA's to a single destination > > LBA within the device reducing traffic between host and device. > > > > This implementation accepts destination, no of sources and arrays of > > source ranges from application and attach it as payload to the bio and > > submits to the device. > > > > Following limits are added to queue limits and are exposed in sysfs > > to userspace > > - *max_copy_sectors* limits the sum of all source_range length > > - *max_copy_nr_ranges* limits the number of source ranges > > - *max_copy_range_sectors* limit the maximum number of sectors > > that can constitute a single source range. > > Same comment as before. I think this is a good start, but for this to be > really > useful to users and kernel components alike, this really needs copy emulation > for drives that do not have a native copy feature, similarly to what write > zeros > handling for instance: if the drive does not have a copy command (simple copy > for NVMe or XCOPY for scsi), then the block layer should issue read/write > commands to seamlessly execute the copy. Otherwise, this will only serve a > small > niche for users and will not be optimal for FS and DM drivers that could be > simplified with a generic block layer copy functionality. > > This is my 10 cents though, others may differ about this. Yes, I agree that copy emulation support should be included with the hardware enabled solution.
Re: [RFC PATCH 2/2] nvme: add simple copy support
On Tue, Dec 01, 2020 at 11:09:49AM +0530, SelvaKumar S wrote: > +static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns, > +struct nvme_id_ns *id) > +{ > + struct nvme_ctrl *ctrl = ns->ctrl; > + struct request_queue *queue = disk->queue; > + > + if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) { > + queue->limits.max_copy_sectors = 0; > + blk_queue_flag_clear(QUEUE_FLAG_COPY, queue); > + return; > + } > + > + /* setting copy limits */ > + ns->mcl = le64_to_cpu(id->mcl); > + ns->mssrl = le32_to_cpu(id->mssrl); > + ns->msrc = id->msrc; These are not used anywhere outside this function, so there's no need to add members to the struct. > + if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, queue)) > + return; The queue limits are not necessarily the same each time we're called to update the disk info, so this return shouldn't be here. > + > + queue->limits.max_copy_sectors = ns->mcl * (1 << (ns->lba_shift - 9)); > + queue->limits.max_copy_range_sectors = ns->mssrl * > + (1 << (ns->lba_shift - 9)); > + queue->limits.max_copy_nr_ranges = ns->msrc + 1; > +} <> > @@ -2045,6 +2133,7 @@ static void nvme_update_disk_info(struct gendisk *disk, > set_capacity_and_notify(disk, capacity); > > nvme_config_discard(disk, ns); > + nvme_config_copy(disk, ns, id); > nvme_config_write_zeroes(disk, ns); > > if (id->nsattr & NVME_NS_ATTR_RO) > @@ -3014,6 +3103,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) > ctrl->oaes = le32_to_cpu(id->oaes); > ctrl->wctemp = le16_to_cpu(id->wctemp); > ctrl->cctemp = le16_to_cpu(id->cctemp); > + ctrl->ocfs = le32_to_cpu(id->ocfs); ocfs is not used anywhere.
Re: [PATCH v2] nvme: Cache DMA descriptors to prevent corruption.
On Fri, Nov 20, 2020 at 09:02:43AM +0100, Christoph Hellwig wrote: > On Thu, Nov 19, 2020 at 05:27:37PM -0800, Tom Roeder wrote: > > This patch changes the NVMe PCI implementation to cache host_mem_descs > > in non-DMA memory instead of depending on descriptors stored in DMA > > memory. This change is needed under the malicious-hypervisor threat > > model assumed by the AMD SEV and Intel TDX architectures, which encrypt > > guest memory to make it unreadable. Some versions of these architectures > > also make it cryptographically hard to modify guest memory without > > detection. > > I don't think this is a useful threat model, and I've not seen a > discussion on lkml where we had any discussion on this kind of threat > model either. > > Before you start sending patches that regress optimizations in various > drivers (and there will be lots with this model) we need to have a > broader discussion first. > > And HMB support, which is for low-end consumer devices that are usually > not directly assigned to VMs aren't a good starting point for this. Yeah, while doing this for HMB isn't really a performance concern, this method for chaining SGL/PRP lists would be. And perhaps more importantly, the proposed mitigation only lets the guest silently carry on from such an attack while the device is surely corrupting something. I think we'd rather free the wrong address since that may at least eventually raise an error.
Re: [PATCH] nvme: Cache DMA descriptors to prevent corruption.
On Thu, Nov 19, 2020 at 10:59:19AM -0800, Tom Roeder wrote: > This patch changes the NVMe PCI implementation to cache host_mem_descs > in non-DMA memory instead of depending on descriptors stored in DMA > memory. This change is needed under the malicious-hypervisor threat > model assumed by the AMD SEV and Intel TDX architectures, which encrypt > guest memory to make it unreadable. Some versions of these architectures > also make it cryptographically hard to modify guest memory without > detection. > > On these architectures, Linux generally leaves DMA memory unencrypted so > that devices can still communicate directly with the kernel: DMA memory > remains readable to and modifiable by devices. This means that this > memory is also accessible to a hypervisor. > > However, this means that a malicious hypervisor could modify the addr or > size fields of descriptors and cause the NVMe driver to call > dma_free_attrs on arbitrary addresses or on the right addresses but with > the wrong size. To prevent this attack, this commit changes the code to > cache those descriptors in non-DMA memory and to use the cached values > when freeing the memory they describe. If the hypervisor does that, then the device may use the wrong addresses, too. I guess you can't do anything about that from the driver, though. > + /* Cache the host_mem_descs in non-DMA memory so a malicious hypervisor > + * can't change them. > + */ > + struct nvme_host_mem_buf_desc *host_mem_descs_cache; > void **host_mem_desc_bufs; This is never seen by an nvme device, so no need for an nvme specific type here. You can use arch native types.
Re: [PATCH 0/2] nvme-pic: improve max I/O queue handling
On Thu, Nov 12, 2020 at 04:45:35PM +0100, Niklas Schnelle wrote: > You got to get something wrong, I hope in this case it's just the subject > of the cover letter :D I suppose the change logs could be worded a little better :) > Thanks for the review, I appreciate it. Might be getting ahead of > myself but I'm curious who would take this change through their > tree if accepted? The linux-nvme tree is located here: http://git.infradead.org/nvme.git Christoph is currently handling patch commits there.
Re: [PATCH 0/2] nvme-pic: improve max I/O queue handling
On Thu, Nov 12, 2020 at 09:23:00AM +0100, Niklas Schnelle wrote: > while searching for a bug around zPCI + NVMe IRQ handling on a distro > kernel, I got confused around handling of the maximum number > of I/O queues in the NVMe driver. > I think I groked it in the end but would like to propose the following > improvements, that said I'm quite new to this code. > I tested both patches on s390x (with a debug config) and x86_64 so > with both data center and consumer NVMes. > For the second patch, since I don't own a device with the quirk, I tried > always returning 1 from nvme_max_io_queues() and confirmed that on my > Evo 970 Pro this resulted in about half the performance in a fio test > but did not otherwise break things. I couldn't find a reason why > allocating only the I/O queues we actually use would be problematic in > the code either but I might have missed something of course. I don't think you missed anything, and the series looks like a reasonable cleanup. I suspect the code was left over from a time when we didn't allocate the possible queues up-front. Reviewed-by: Keith Busch
Re: [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace
On Fri, Nov 06, 2020 at 10:00:36AM -0700, Logan Gunthorpe wrote: > Allow userspace to obtain CMB memory by mmaping the controller's > char device. The mmap call allocates and returns a hunk of CMB memory, > (the offset is ignored) so userspace does not have control over the > address within the CMB. > > A VMA allocated in this way will only be usable by drivers that set > FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be > checked the first time the pages are mapped for DMA. > > Currently this is only supported by O_DIRECT to an PCI NVMe device > or through the NVMe passthrough IOCTL. Rather than make this be specific to nvme, could pci p2pdma create an mmap'able file for any resource registered with it?
Re: [PATCH V3 1/1] nvme: Add quirk for LiteON CL1 devices running FW 220TQ,22001
On Thu, Oct 29, 2020 at 11:33:06AM +0900, Keith Busch wrote: > On Thu, Oct 29, 2020 at 02:20:27AM +, Gloria Tsai wrote: > > Corrected the description of this bug that SSD will not do GC after > > receiving shutdown cmd. > > Do GC before shutdown -> delete IO Q -> shutdown from host -> breakup GC -> > > D3hot -> enter PS4 -> have a chance swap block -> use wrong pointer on > > device SRAM -> over program > > What do you mean by "wrong pointer"? At the place in the sequence you're > referring to, the PCI BME is disabled: you can't access *any* host RAM, > so there's no "correct" pointer either. Re-reading your message, I do see you said "device" rather than "host", so my response may not be relevant.
Re: [PATCH V3 1/1] nvme: Add quirk for LiteON CL1 devices running FW 220TQ,22001
On Thu, Oct 29, 2020 at 02:20:27AM +, Gloria Tsai wrote: > Corrected the description of this bug that SSD will not do GC after receiving > shutdown cmd. > Do GC before shutdown -> delete IO Q -> shutdown from host -> breakup GC -> > D3hot -> enter PS4 -> have a chance swap block -> use wrong pointer on device > SRAM -> over program What do you mean by "wrong pointer"? At the place in the sequence you're referring to, the PCI BME is disabled: you can't access *any* host RAM, so there's no "correct" pointer either.
Re: [PATCH 1/1] nvme: Add quirk for LiteON CL1 devices running FW 220TQ,22001
On Wed, Oct 28, 2020 at 12:54:38AM +0900, Jongpil Jung wrote: > suspend. > > When NVMe device receive D3hot from host, NVMe firmware will do > garbage collection. While NVMe device do Garbage collection, > firmware has chance to going incorrect address. > In that case, NVMe storage device goes to no device available state. > Finally, host can't access the device any more. > > Quirk devices will not use simple suspend even if HMB is enabled. > In case of poweroff scenario, NVMe receive "PME turn off". > So garbage collection will not be happening. > > Liteon(SSSTC) will fix the issue, that's why quirk apply on specific > vendor id and firmware version. This is a concerning quirk. We use the simple suspend when HMB is enabled because at least some platforms disable device DMA access in the runtime suspend state. Many devices continue to access their HMB while in low power, so we can't let both conditions occur concurrently. Unless you know for sure this device doesn't access host memory in low-power, there will be at least some platform combinations where this quirk will fail.
Re: [PATCH v2] nvmet: fix uninitialized work for zero kato
On Wed, Oct 14, 2020 at 11:36:50AM +0800, zhenwei pi wrote: > Fixes: > Don't run keep alive work with zero kato. "Fixes" tags need to have a git commit id followed by the commit subject. I can't find any commit with that subject, though.
Re: [PATCH] [v2] nvme: replace meaningless judgement by checking whether req is null
The commit subject is a too long. We should really try to keep these to 50 characters or less. nvme-pci: fix NULL req in completion handler Otherwise, looks fine. Reviewed-by: Keith Busch
Re: [PATCH] nvme: replace meaningless judgement by checking whether req is null
On Mon, Sep 21, 2020 at 03:49:09PM +, Tianxianting wrote: > HI Keith, > Thanks for your comments, > I will submit a new patch of version 2 for the further reviewing, v2 patch > will contains: > 1, retain existing judgement and dev_warn; No no, go ahead and remove the existing check just as you've done. That check becomes redundant with the safer one you're adding, and we don't want redundant checks in the fast path. My only suggestion is to use the same dev_warn() in your new check. > 2, add the check whether req is null(already did in this patch) > 3, simplify and make the changelog succinct according to you said " This is > what I'm thinking:". > Is it right? > Should I retain the nvme_irq crash log in changelog, mention the difference > between nvmeq->q_depth and tagset queue_depth? The tagset's queue_depth is a valid point to mention as well. The dirver's current indirect check is not necessarily in sync with the actual tagset. > Thanks > > -Original Message- > From: Keith Busch [mailto:kbu...@kernel.org] > Sent: Monday, September 21, 2020 11:08 PM > To: tianxianting (RD) > Cc: ax...@fb.com; h...@lst.de; s...@grimberg.me; > linux-n...@lists.infradead.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] nvme: replace meaningless judgement by checking whether > req is null > > On Mon, Sep 21, 2020 at 10:10:52AM +0800, Xianting Tian wrote: > > @@ -940,13 +940,6 @@ static inline void nvme_handle_cqe(struct nvme_queue > > *nvmeq, u16 idx) > > struct nvme_completion *cqe = >cqes[idx]; > > struct request *req; > > > > - if (unlikely(cqe->command_id >= nvmeq->q_depth)) { > > - dev_warn(nvmeq->dev->ctrl.device, > > - "invalid id %d completed on queue %d\n", > > - cqe->command_id, le16_to_cpu(cqe->sq_id)); > > - return; > > - } > > - > > /* > > * AEN requests are special as they don't time out and can > > * survive any kind of queue freeze and often don't respond to @@ > > -960,6 +953,13 @@ static inline void nvme_handle_cqe(struct nvme_queue > > *nvmeq, u16 idx) > > } > > > > req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id); > > + if (unlikely(!req)) { > > + dev_warn(nvmeq->dev->ctrl.device, > > + "req is null for tag %d completed on queue %d\n", > > + cqe->command_id, le16_to_cpu(cqe->sq_id)); > > + return; > > + } > > This is making sense now, though I think we should retain the existing > dev_warn() since it's still accurate and provides continuity for people who > are used to looking for these sorts of messages. > > Your changelog is a bit much though. I think we can say it a bit more > succinctly. This is what I'm thinking: > > The driver registers interrupts for queues before initializing the > tagset because it uses the number of successful request_irq() calls > to configure the tagset parameters. This allows a race condition with > the current tag validity check if the controller happens to produce > an interrupt with a corrupted CQE before the tagset is initialized. > > Replace the driver's indirect tag check with the one already provided > by the block layer.
Re: [PATCH] nvme: replace meaningless judgement by checking whether req is null
On Mon, Sep 21, 2020 at 10:10:52AM +0800, Xianting Tian wrote: > @@ -940,13 +940,6 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > struct nvme_completion *cqe = >cqes[idx]; > struct request *req; > > - if (unlikely(cqe->command_id >= nvmeq->q_depth)) { > - dev_warn(nvmeq->dev->ctrl.device, > - "invalid id %d completed on queue %d\n", > - cqe->command_id, le16_to_cpu(cqe->sq_id)); > - return; > - } > - > /* >* AEN requests are special as they don't time out and can >* survive any kind of queue freeze and often don't respond to > @@ -960,6 +953,13 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > } > > req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id); > + if (unlikely(!req)) { > + dev_warn(nvmeq->dev->ctrl.device, > + "req is null for tag %d completed on queue %d\n", > + cqe->command_id, le16_to_cpu(cqe->sq_id)); > + return; > + } This is making sense now, though I think we should retain the existing dev_warn() since it's still accurate and provides continuity for people who are used to looking for these sorts of messages. Your changelog is a bit much though. I think we can say it a bit more succinctly. This is what I'm thinking: The driver registers interrupts for queues before initializing the tagset because it uses the number of successful request_irq() calls to configure the tagset parameters. This allows a race condition with the current tag validity check if the controller happens to produce an interrupt with a corrupted CQE before the tagset is initialized. Replace the driver's indirect tag check with the one already provided by the block layer.
Re: [PATCH] [v2] nvme: use correct upper limit for tag in nvme_handle_cqe()
On Fri, Sep 18, 2020 at 06:44:20PM +0800, Xianting Tian wrote: > @@ -940,7 +940,9 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > struct nvme_completion *cqe = >cqes[idx]; > struct request *req; > > - if (unlikely(cqe->command_id >= nvmeq->q_depth)) { > + if (unlikely(cqe->command_id >= > + nvmeq->qid ? nvmeq->dev->tagset.queue_depth : > + nvmeq->dev->admin_tagset.queue_depth)) { Both of these values are set before blk_mq_alloc_tag_set(), so you still have a race. The interrupt handler probably just shouldn't be registered with the queue before the tagset is initialized since there can't be any work for the handler to do before that happens anyway. The controller is definitely broken, though, and will lead to unavoidable corruption if it's really behaving this way.
Re: [PATCH] nvme: fix NULL pointer dereference
On Thu, Sep 17, 2020 at 11:32:12PM -0400, Tong Zhang wrote: > Please correct me if I am wrong. > After a bit more digging I found out that it is indeed command_id got > corrupted is causing this problem. Although the tag and command_id > range is checked like you said, the elements in rqs cannot be > guaranteed to be not NULL. thus although the range check is passed, > blk_mq_tag_to_rq() can still return NULL. I think your describing a sequence problem in initialization. We shouldn't have interrupts wired up to uninitialized tagsets. A more appropriate sequence would setup request_irq() after the tagset is ready. It makes handling a failed irq setup a bit weird for io queues, though. > It is clear that the current sanitization is not enough and there's > more implication about this -- when all rqs got populated, a corrupted > command_id may silently corrupt other data not belonging to the > current command. The block layer doesn't do anything with requests that haven't been started, so if your controller completes non-existent commands, then nothing particular will happen with the rqs. If the request had been started and the controller provides a corrupted completion, then the fault lies entirely with the controller and you should raise that issue with your vendor. There's no way the driver can distinguish a genuine completion from a corrupted one.
Re: [PATCH] nvme: fix NULL pointer dereference
On Thu, Sep 17, 2020 at 12:56:59PM -0400, Tong Zhang wrote: > The command_id in CQE is writable by NVMe controller, driver should > check its sanity before using it. We already do that.
Re: [PATCH] nvme: fix doulbe irq free
On Thu, Sep 17, 2020 at 11:22:54AM -0400, Tong Zhang wrote: > On Thu, Sep 17, 2020 at 4:30 AM Christoph Hellwig wrote: > > > > On Wed, Sep 16, 2020 at 11:37:00AM -0400, Tong Zhang wrote: > > > the irq might already been released before reset work can run > > > > If it is we have a problem with the state machine, as we shouldn't > > have two resets running at the same time. Can you share your reproducer? > > > > Hi Christoph, > Thanks for replying. > > It is possible that pci_free_irq() can be called twice for the same IRQ, > the first call can be > > [ 124.413113] nvme nvme0: I/O 6 QID 0 timeout, disable controller > [ 124.414317] Workqueue: kblockd blk_mq_timeout_work > [ 124.414317] Call Trace: > [ 124.414317] dump_stack+0x7d/0xb0 > [ 124.414317] nvme_suspend_queue.cold+0x11/0x58 > [ 124.414317] nvme_dev_disable+0x116/0x7b0 > [ 124.414317] nvme_timeout+0x309/0x320 > [ 124.414317] ? nvme_shutdown+0x40/0x40 > [ 124.414317] ? load_balance+0xe76/0x1450 > [ 124.414317] ? deref_stack_reg+0x40/0x40 > [ 124.414317] ? sched_clock_local+0x99/0xc0 > [ 124.414317] blk_mq_check_expired+0x313/0x340 > [ 124.414317] ? blk_mq_freeze_queue_wait+0x180/0x180 > [ 124.414317] ? _find_next_bit.constprop.0+0x3e/0xf0 > [ 124.414317] blk_mq_queue_tag_busy_iter+0x2e9/0x4a0 > [ 124.414317] ? blk_mq_freeze_queue_wait+0x180/0x180 > [ 124.414317] ? blk_mq_all_tag_iter+0x10/0x10 > [ 124.414317] ? _raw_write_lock_irqsave+0xd0/0xd0 > [ 124.414317] ? run_rebalance_domains+0x80/0x80 > [ 124.414317] ? blk_mq_freeze_queue_wait+0x180/0x180 > [ 124.414317] ? debug_object_deactivate.part.0+0x1c4/0x210 > [ 124.414317] blk_mq_timeout_work+0xaa/0x1f0 > [ 124.414317] ? __blk_mq_end_request+0x1f0/0x1f0 > [ 124.414317] ? __schedule+0x581/0xc40 > [ 124.414317] ? read_word_at_a_time+0xe/0x20 > [ 124.414317] ? strscpy+0xbf/0x190 > [ 124.414317] process_one_work+0x4ad/0x7e0 > [ 124.414317] worker_thread+0x73/0x690 > [ 124.414317] ? process_one_work+0x7e0/0x7e0 > [ 124.414317] kthread+0x199/0x1f0 > [ 124.414317] ? kthread_create_on_node+0xd0/0xd0 > [ 124.414317] ret_from_fork+0x22/0x30 > > And the second call can be > [ 149.763974] nvme nvme0: Failed to read smart log (error -4) > [ 149.774307] Trying to free already-free IRQ 10 > [ 149.774385] Call Trace: > [ 149.774385] pci_free_irq+0x13/0x20 > [ 149.774385] nvme_reset_work.cold+0x1be/0xa78 > [ 149.774385] ? nvme_probe+0x8c0/0x8c0 > [ 149.774385] ? check_preempt_curr+0x9d/0xf0 > [ 149.774385] ? ttwu_do_wakeup.isra.0+0x176/0x190 > [ 149.774385] ? try_to_wake_up+0x37c/0x900 > [ 149.774385] ? migration_cpu_stop+0x1e0/0x1e0 > [ 149.774385] ? __schedule+0x581/0xc40 > [ 149.774385] ? read_word_at_a_time+0xe/0x20 > [ 149.774385] ? strscpy+0xbf/0x190 > [ 149.774385] process_one_work+0x4ad/0x7e0 > [ 149.774385] worker_thread+0x73/0x690 > [ 149.774385] ? process_one_work+0x7e0/0x7e0 > [ 149.774385] kthread+0x199/0x1f0 > [ 149.774385] ? kthread_create_on_node+0xd0/0xd0 > [ 149.774385] ret_from_fork+0x22/0x30 > > and clearly something is not working as expected. And the problem is that nvme_hwmon_init() doesn't return an error, so it returns to continue initialization on a disabled controller. That's the bug that needs to be fixed.
Re: [PATCH] nvme: fix NULL pointer dereference
On Wed, Sep 16, 2020 at 11:36:49AM -0400, Tong Zhang wrote: > @@ -960,6 +960,8 @@ static inline void nvme_handle_cqe(struct nvme_queue > *nvmeq, u16 idx) > } > > req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id); > + if (!req) > + return; As I mentioned before, blk_mq_tag_to_rq() returns NULL if the tag exceeds the depth. We already verify the tag prior to calling this function, so what's the real root cause for how we're winding up with NULL here? I'm only asking this because it sounds like there's a bug somewhere else and this change is masking over it. > trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail); > if (!nvme_try_complete_req(req, cqe->status, cqe->result)) > nvme_pci_complete_rq(req);
Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;
On Wed, Sep 09, 2020 at 01:06:39PM -0700, Joe Perches wrote: > diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c > index eea0f453cfb6..8aac5bc60f4c 100644 > --- a/crypto/tcrypt.c > +++ b/crypto/tcrypt.c > @@ -2464,7 +2464,7 @@ static int do_test(const char *alg, u32 type, u32 mask, > int m, u32 num_mb) > test_hash_speed("streebog512", sec, > generic_hash_speed_template); > if (mode > 300 && mode < 400) break; > - fallthrough; > + break; > case 399: > break; Just imho, this change makes the preceding 'if' look even more pointless. Maybe the fallthrough was a deliberate choice? Not that my opinion matters here as I don't know this module, but it looked a bit odd to me.
Re: [PATCH] PCI/ASPM: Enable ASPM for links under VMD domain
On Wed, Sep 02, 2020 at 01:48:19PM -0600, David Fugate wrote: > Over the years, I've been forwarded numerous emails from VMD customers > praising it's ability to prevent Linux kernel panics upon hot-removals > and inserts of U.2 NVMe drives. The same nvme and pcie hotplug drivers are used with or without VMD enabled. Where are the bug reports for linux kernel panics? We have a very real interest in fixing such issues.
Re: [PATCH] [v2] nvme-pci: check req to prevent crash in nvme_handle_cqe()
On Mon, Aug 31, 2020 at 06:55:53PM +0800, Xianting Tian wrote: > As blk_mq_tag_to_rq() may return null, so it should be check whether it is > null before using it to prevent a crash. It may return NULL if the command id exceeds the number of tags. We already have a check for a valid command id value, so something is not adding up here if we're still getting NULL. > req = blk_mq_tag_to_rq(nvme_queue_tagset(nvmeq), cqe->command_id); > + if (unlikely(!req)) { > + dev_warn(nvmeq->dev->ctrl.device, > + "req is null(tag:%d) on queue %d\n", > + cqe->command_id, le16_to_cpu(cqe->sq_id)); > + return; > + }
Re: [PATCH AUTOSEL 5.8 11/42] nvme: skip noiob for zoned devices
On Mon, Aug 31, 2020 at 11:29:03AM -0400, Sasha Levin wrote: > From: Keith Busch > > [ Upstream commit c41ad98bebb8f4f0335b3c50dbb7583a6149dce4 ] > > Zoned block devices reuse the chunk_sectors queue limit to define zone > boundaries. If a such a device happens to also report an optimal > boundary, do not use that to define the chunk_sectors as that may > intermittently interfere with io splitting and zone size queries. > > Signed-off-by: Keith Busch > Signed-off-by: Sagi Grimberg > Signed-off-by: Jens Axboe > Signed-off-by: Sasha Levin You can safely drop this from stable: nvme zoned devices were only introduced to linux in 5.9.
Re: [PATCH v2] nvme-pci: cancel nvme device request before disabling
On Fri, Aug 28, 2020 at 10:17:08AM -0400, Tong Zhang wrote: > This patch addresses an irq free warning and null pointer dereference > error problem when nvme devices got timeout error during initialization. > This problem happens when nvme_timeout() function is called while > nvme_reset_work() is still in execution. This patch fixed the problem by > setting flag of the problematic request to NVME_REQ_CANCELLED before > calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns > an error code and let nvme_submit_sync_cmd() fail gracefully. > The following is console output. Thanks, this looks good to me. Reviewed-by: Keith Busch
Re: [PATCH] nvme-pci: cancel nvme device request before disabling
On Fri, Aug 14, 2020 at 12:11:56PM -0400, Tong Zhang wrote: > On Fri, Aug 14, 2020 at 11:42 AM Keith Busch wrote: > > > > On Fri, Aug 14, 2020 at 03:14:31AM -0400, Tong Zhang wrote: > > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > > > > index ba725ae47305..c4f1ce0ee1e3 100644 > > > > > --- a/drivers/nvme/host/pci.c > > > > > +++ b/drivers/nvme/host/pci.c > > > > > @@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return > > > > > nvme_timeout(struct request *req, bool reserved) > > > > > dev_warn_ratelimited(dev->ctrl.device, > > > > >"I/O %d QID %d timeout, disable controller\n", > > > > >req->tag, nvmeq->qid); > > > > > - nvme_dev_disable(dev, true); > > > > > nvme_req(req)->flags |= NVME_REQ_CANCELLED; > > > > > + nvme_dev_disable(dev, true); > > > > > return BLK_EH_DONE; > > > anymore. The driver is not reporting non-response back for all > > cancelled requests, and that is probably not what we should be doing. > > OK, thanks for the explanation. I think the bottom line here is to let the > probe function know and stop proceeding when there's an error. > I also don't see an obvious reason to set NVME_REQ_CANCELLED > after nvme_dev_disable(dev, true). The flag was set after disabling when it didn't happen to matter: the block layer had a complicated timeout scheme that didn't actually complete the request until the timeout handler returned, so the flag set where it is was 'ok'. That's clearly not the case anymore, so yes, I think we do need your patch. There is one case you are missing, though: --- @@ -1267,10 +1267,10 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) dev_warn(dev->ctrl.device, "I/O %d QID %d timeout, reset controller\n", req->tag, nvmeq->qid); + nvme_req(req)->flags |= NVME_REQ_CANCELLED; nvme_dev_disable(dev, false); nvme_reset_ctrl(>ctrl); - nvme_req(req)->flags |= NVME_REQ_CANCELLED; return BLK_EH_DONE; } --
Re: [RFC][PATCH 0/9] [v3] Migrate Pages in lieu of discard
On Tue, Aug 18, 2020 at 11:41:22AM -0700, Dave Hansen wrote: > Dave Hansen (5): > mm/numa: node demotion data structure and lookup > mm/vmscan: Attempt to migrate page in lieu of discard > mm/numa: automatically generate node migration order > mm/vmscan: never demote for memcg reclaim > mm/numa: new reclaim mode to enable reclaim-based migration > > Keith Busch (2): > mm/migrate: Defer allocating new page until needed > mm/vmscan: Consider anonymous pages without swap > > Yang Shi (1): > mm/vmscan: add page demotion counter > > Documentation/admin-guide/sysctl/vm.rst |9 > include/linux/migrate.h |6 > include/linux/node.h|9 > include/linux/vm_event_item.h |2 > include/trace/events/migrate.h |3 > mm/debug.c |1 > mm/internal.h |1 > mm/migrate.c| 400 > ++-- > mm/page_alloc.c |2 > mm/vmscan.c | 88 ++- > mm/vmstat.c |2 > 11 files changed, 439 insertions(+), 84 deletions(-) The commit summary and diff stat appear to be from an earlier revision of the series. No big deal, just pointing that out. BTW, thank you for continuing to enhance this capability! :) I'd missed out on the earlier discussion from some bad filters that have should now be fixed (hopefully), so I've some catching up to do.
Re: [PATCH] nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
On Mon, Aug 24, 2020 at 01:00:11PM -0700, Sagi Grimberg wrote: > > The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent > > in this function. > > > > Use 'spin_lock_irqsave()' also here, as there is no guarantee that > > interruptions are disabled at that point, according to surrounding code. > > > > Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling") > > Signed-off-by: Christophe JAILLET > > --- > > Not tested, only based on what looks logical to me according to > > surrounding code > > --- > > drivers/nvme/target/fc.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c > > index 55bafd56166a..e6861cc10e7d 100644 > > --- a/drivers/nvme/target/fc.c > > +++ b/drivers/nvme/target/fc.c > > @@ -2342,9 +2342,9 @@ nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod) > > return; > > if (fcpreq->fcp_error || > > fcpreq->transferred_length != fcpreq->transfer_length) { > > - spin_lock(>flock); > > + spin_lock_irqsave(>flock, flags); > > fod->abort = true; > > - spin_unlock(>flock); > > + spin_unlock_irqrestore(>flock, flags); > > nvmet_req_complete(>req, NVME_SC_INTERNAL); > > return; > > > > James, can I get a reviewed-by from you on this? afaics, the lock just serializes single writes, in which WRITE/READ_ONCE() can handle that without a lock, right?
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 05:43:29PM -0600, David Fugate wrote: > There were queries? My key takeaways were a maintainer NAK followed by > instructions to make the Intel drive align with the driver by > implementing NOIOB. While I disagree with the rejection as it appeared > to be based entirely on politics, I can accept it as the quirk wasn't > in the spec. For the record, the suggestion provided, which you agreed to look into, most broadly enables your hardware on Linux and was entirely to your benefit. Not quite as dramatic as a political conspiracy. You later responded with a technical argument against that suggestion; however, your reason didn't add up, and that's where you left the thread. > It's not fair to make this same "your drive should align with the > driver" demand of Samsung because we *are* talking about a spec'ed > feature here. Technical critques of their patches and real performance > degrades observed are fair game and objective; "your company did > the nastiest possible move violating the normal NVMe procedures to make > it optional" is not. Sure, but you're cherry picking comments from the discussion. The performance impact exists, and it's generally not acceptable from a maintenance point to duplicate significant code without at least trying to provide a common solution.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 03:54:20PM -0600, David Fugate wrote: > On Wed, 2020-08-19 at 13:25 -0600, Jens Axboe wrote: > > It's not required, the driver will function quite fine without it. If > > you > > want to use ZNS it's required. > > The NVMe spec does not require Zone Append for ZNS; a *vendor-neutral* > Linux driver should not either. The spec was developed over the course of years with your employer's involvement, and the software enabling efforts occurred in parallel. The "optional" part was made that way at the final hour, so please align your expectations accordingly. > Agreed, but this standard needs to be applied equally to everyone. > E.g., harmless contributions such as > https://lore.kernel.org/linux-nvme/20200611054156.gb3...@lst.de/ get > rejected yet clear spec violations from maintainers are accepted? > type of behavior encourages forking, vendor-specific drivers, etc. > which is somewhere I hope none of us want to go. You're the one who left that thread dangling. You offered to have your firmware accommodate the Intel sponsored feature that makes your patch unnecessary in the first place. Your follow up made no sense and you have not responded to the queries about it.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote: > Intel does not support making *optional* NVMe spec features *required* > by the NVMe driver. This is inaccurate. As another example, the spec optionally allows a zone size to be a power of 2, but linux requires a power of 2 if you want to use it with this driver. > Provided there's no glaring technical issues There are many. Some may be improved through a serious review process, but the mess it makes out of the fast path is measurably harmful to devices that don't subscribe to this. That issue is not so easily remedied. Since this patch is a copy of the scsi implementation, the reasonable thing is take this fight to the generic block layer for a common solution. We're not taking this in the nvme driver.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Tue, Aug 18, 2020 at 07:29:12PM +0200, Javier Gonzalez wrote: > On 18.08.2020 09:58, Keith Busch wrote: > > On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote: > > > a number of customers are requiring the use of normal writes, which we > > > want to support. > > > > A device that supports append is completely usable for those customers, > > too. There's no need to create divergence in this driver. > > Not really. You know as well as I do that some features are disabled for > a particular SSD model on customer requirements. Generic models > implementing append can submit both I/Os, but those that remove append > are left out. You are only supporting my point: if your device supports append, you get to work in every ZNS use case, otherwise you only get to work in a subset.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote: > a number of customers are requiring the use of normal writes, which we > want to support. A device that supports append is completely usable for those customers, too. There's no need to create divergence in this driver.
Re: v5.9-rc1 commit reliably breaks pci nvme detection
On Mon, Aug 17, 2020 at 03:50:11PM +0200, Ahmed S. Darwish wrote: > Hello, > > Below v5.9-rc1 commit reliably breaks my boot on a Thinkpad e480 > laptop. PCI nvme detection fails, and the kernel becomes not able > anymore to find the rootfs / parse "root=". > > Bisecting v5.8=>v5.9-rc1 blames that commit. Reverting it *reliably* > fixes the problem and makes me able to boot v5.9-rc1. The fix is staged in the nvme tree here: http://git.infradead.org/nvme.git/commit/286155561ecd13b6c85a78eaf2880d3baea03b9e
Re: [PATCH] nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
On Fri, Aug 14, 2020 at 11:34:25PM +0800, John Garry wrote: > Fix by making onto a u32. > > Also use u32 for nvme_dev.q_depth, as we assign this value from > nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this > avoids the same crash as above. > > Fixes: 61f3b8963097 ("nvme-pci: use unsigned for io queue depth") > Signed-off-by: John Garry Looks good to me. Reviewed-by: Keith Busch
Re: [PATCH] nvme-pci: cancel nvme device request before disabling
On Fri, Aug 14, 2020 at 11:37:20AM -0400, Tong Zhang wrote: > On Fri, Aug 14, 2020 at 11:04 AM Keith Busch wrote: > > > > On Fri, Aug 14, 2020 at 03:14:31AM -0400, Tong Zhang wrote: > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > > index ba725ae47305..c4f1ce0ee1e3 100644 > > > --- a/drivers/nvme/host/pci.c > > > +++ b/drivers/nvme/host/pci.c > > > @@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return nvme_timeout(struct > > > request *req, bool reserved) > > > dev_warn_ratelimited(dev->ctrl.device, > > >"I/O %d QID %d timeout, disable controller\n", > > >req->tag, nvmeq->qid); > > > - nvme_dev_disable(dev, true); > > > nvme_req(req)->flags |= NVME_REQ_CANCELLED; > > > + nvme_dev_disable(dev, true); > > > return BLK_EH_DONE; > > > > Shouldn't this flag have been set in nvme_cancel_request()? > > nvme_cancel_request() is not setting this flag to cancelled and this is > causing Right, I see that it doesn't, but I'm saying that it should. We used to do something like that, and I'm struggling to recall why we're not anymore. The driver is not reporting non-response back for all cancelled requests, and that is probably not what we should be doing.
Re: [PATCH] nvme-pci: cancel nvme device request before disabling
On Fri, Aug 14, 2020 at 03:14:31AM -0400, Tong Zhang wrote: > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index ba725ae47305..c4f1ce0ee1e3 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return nvme_timeout(struct > request *req, bool reserved) > dev_warn_ratelimited(dev->ctrl.device, >"I/O %d QID %d timeout, disable controller\n", >req->tag, nvmeq->qid); > - nvme_dev_disable(dev, true); > nvme_req(req)->flags |= NVME_REQ_CANCELLED; > + nvme_dev_disable(dev, true); > return BLK_EH_DONE; Shouldn't this flag have been set in nvme_cancel_request()? It's not like the timeout out command is the only command to have been cancelled by this action, nor is it guaranteed that getting here will mean the request was in fact cancelled. The controller could still provide a real completion.
Re: [PATCH v2] nvme: Use spin_lock_irq() when taking the ctrl->lock
There's an unrelated whitespace change in nvme_init_identify(). Otherwise, looks fine. Reviewed-by: Keith Busch
Re: [RESEND PATCH] nvme: Use spin_lock_irqsave() when taking the ctrl->lock
On Wed, Aug 12, 2020 at 03:01:19PM -0600, Logan Gunthorpe wrote: > @@ -2971,15 +2971,16 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 > log_page, u8 lsp, u8 csi, > static struct nvme_cel *nvme_find_cel(struct nvme_ctrl *ctrl, u8 csi) > { > struct nvme_cel *cel, *ret = NULL; > + unsigned long flags; > > - spin_lock(>lock); > + spin_lock_irqsave(>lock, flags); > list_for_each_entry(cel, >cels, entry) { > if (cel->csi == csi) { > ret = cel; > break; > } > } > - spin_unlock(>lock); > + spin_unlock_irqrestore(>lock, flags); > > return ret; > } > @@ -2988,6 +2989,7 @@ static int nvme_get_effects_log(struct nvme_ctrl *ctrl, > u8 csi, > struct nvme_effects_log **log) > { > struct nvme_cel *cel = nvme_find_cel(ctrl, csi); > + unsigned long flags; > int ret; > > if (cel) > @@ -3006,9 +3008,9 @@ static int nvme_get_effects_log(struct nvme_ctrl *ctrl, > u8 csi, > > cel->csi = csi; > > - spin_lock(>lock); > + spin_lock_irqsave(>lock, flags); > list_add_tail(>entry, >cels); > - spin_unlock(>lock); > + spin_unlock_irqrestore(>lock, flags); > out: > *log = >log; > return 0; > Neither of these are ever called from an interrupt disabled context, correct? If so, you can just use spin_lock_irq() without saving the current irq state.
Re: Regression in nvme driver
On Wed, Jul 29, 2020 at 07:29:08PM +, Lach wrote: > Hello > > I caught a regression in the nvme driver, which shows itself on some > controllers (In my case, at 126h:2263) Fix is staged for the next 5.8 pull; https://git.kernel.dk/cgit/linux-block/commit/?h=block-5.8=5bedd3afee8eb01ccd256f0cd2cc0fa6f841417a
Re: [PATCH v16 0/9] nvmet: add target passthru commands support
On Fri, Jul 24, 2020 at 11:25:11AM -0600, Logan Gunthorpe wrote: > This is v16 of the passthru patchset which make a bunch of cleanup as > suggested by Christoph. Thank, looks great. Just the comment on 6/9, which probably isn't super important anyway. Reviewed-by: Keith Busch
Re: [PATCH v16 6/9] nvmet-passthru: Add passthru code to process commands
On Fri, Jul 24, 2020 at 11:25:17AM -0600, Logan Gunthorpe wrote: > + /* > + * The passthru NVMe driver may have a limit on the number of segments > + * which depends on the host's memory fragementation. To solve this, > + * ensure mdts is limitted to the pages equal to the number of limited > + /* don't support fuse commands */ > + id->fuses = 0; If a host were to set a fuse, the target should return an Invalid Field error. Just to future-proof, rejecting commands with any flags set (other than SGL, which you handled in patch 1/9) is probably what should happen, like: > +u16 nvmet_parse_passthru_io_cmd(struct nvmet_req *req) > +{ if (req->cmd->common.flags & ~NVME_CMD_SGL_ALL) return NVME_SC_INVALID_FIELD; Or maybe we could obviate the need for 1/9 with something like: req->cmd->common.flags &= ~NVME_CMD_SGL_ALL; if (req->cmd->common.flags) return NVME_SC_INVALID_FIELD;
Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands
On Mon, Jul 20, 2020 at 04:28:26PM -0700, Sagi Grimberg wrote: > On 7/20/20 4:17 PM, Keith Busch wrote: > > On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote: > > > On 2020-07-20 4:35 p.m., Sagi Grimberg wrote: > > > > > > > passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which > > > > means that the driver shouldn't need the ns at all. So if you have a > > > > dedicated request queue (mapped to the I/O tagset), you don't need the > > > > ns->queue and we can lose the ns lookup altogether. > > > > We still need a request_queue to dispatch the command. I guess you could > > make a generic one for the controller that isn't tied to a namespace, > > but we lose the fair shared tag allocation. > > What do you mean fair shared tag allocation? See hctx_may_queue().
Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands
On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote: > On 2020-07-20 4:35 p.m., Sagi Grimberg wrote: > > > passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which > > means that the driver shouldn't need the ns at all. So if you have a > > dedicated request queue (mapped to the I/O tagset), you don't need the > > ns->queue and we can lose the ns lookup altogether. We still need a request_queue to dispatch the command. I guess you could make a generic one for the controller that isn't tied to a namespace, but we lose the fair shared tag allocation. > Thanks, that helps clarify things a bit, but which xarray were you > talking about That was something that was replacing the list lookup: http://lists.infradead.org/pipermail/linux-nvme/2020-July/018242.html
Re: [PATCH 5/5] nvme-pci: Use standard block status macro
On Fri, Jul 03, 2020 at 10:49:24AM +0800, Baolin Wang wrote: > static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, > @@ -844,7 +844,7 @@ static blk_status_t nvme_map_metadata(struct nvme_dev > *dev, struct request *req, > if (dma_mapping_error(dev->dev, iod->meta_dma)) > return BLK_STS_IOERR; > cmnd->rw.metadata = cpu_to_le64(iod->meta_dma); > - return 0; > + return BLK_STS_OK; > } This is fine, though it takes knowing that this value is 0 for the subsequent 'if (!ret)' check to make sense. Maybe those should change to 'if (ret != BLK_STS_OK)' so the check uses the same symbol as the return, and will always work in the unlikely event that the defines are reordered.
Re: [PATCH] nvme: validate cntlid's only for nvme >= 1.1.0
On Tue, Jun 30, 2020 at 04:01:45PM +0200, Maximilian Heyne wrote: > On 6/30/20 3:36 PM, Christoph Hellwig wrote: > > And actually - 1.0 did not have the concept of a subsystem. So having > > a duplicate serial number for a 1.0 controller actually is a pretty > > nasty bug. Can you point me to this broken controller? Do you think > > the OEM could fix it up to report a proper version number and controller > > ID? > > > > I meant that the VF NVMe controllers will all land in the same subsystem from > the kernel's point of view, because, as you said, there was no idea of > different > subsystems in the 1.0 spec. Each controller should have landed in its own subsystem in this case rather than the same subsystem. > It's an older in-house controller. Seems to set the same serial number for all > VF's. Should the firmware set unique serials for the VF's instead? Yes, the driver shouldn't be finding duplicate serial numbers.
Re: [RFC PATCH] nvme-pci: Move the sg table allocation/free into init/exit_request
On Sun, Jun 28, 2020 at 06:34:46PM +0800, Baolin Wang wrote: > Move the sg table allocation and free into the init_request() and > exit_request(), instead of allocating sg table when queuing requests, > which can benefit the IO performance. If you want to pre-allocate something per-request, you can add the size to the tagset's cmd_size. But this is adding almost 4k per request. Considering how many requests we try to allocate, that's a bit too large to count on being available or sequestor for this driver.
Re: [PATCH 2/3] nvme-pci: Add controller memory buffer supported macro
On Tue, Jun 23, 2020 at 06:27:51PM +0200, Christoph Hellwig wrote: > On Tue, Jun 23, 2020 at 09:24:33PM +0800, Baolin Wang wrote: > > Introduce a new capability macro to indicate if the controller > > supports the memory buffer or not, instead of reading the > > NVME_REG_CMBSZ register. > > This is a complex issue. The CMBS bit was only added in NVMe 1.4 as > a backwards incompatible change, as the CMB addressing scheme can lead > to data corruption. The CMBS was added as part of the horribe hack > that also involves the CBA field, which we'll need to see before > using it to work around the addressing issue. At the same time we > should also continue supporting the legacy pre-1.4 CMB with a warning > (and may reject it if we know we run in a VM). Well, a CMB from an emulated controller (like qemu's) can be used within a VM. It's only if you direct assign a PCI function that CMB usage breaks.
Re: [PATCH 1/3] nvme: Add Arbitration Burst support
On Tue, Jun 23, 2020 at 09:24:32PM +0800, Baolin Wang wrote: > +void nvme_set_arbitration_burst(struct nvme_ctrl *ctrl) > +{ > + u32 result; > + int status; > + > + if (!ctrl->rab) > + return; > + > + /* > + * The Arbitration Burst setting indicates the maximum number of > + * commands that the controller may launch at one time from a > + * particular Submission Queue. It is recommended that host software > + * configure the Arbitration Burst setting as close to the recommended > + * value by the controller as possible. > + */ > + status = nvme_set_features(ctrl, NVME_FEAT_ARBITRATION, ctrl->rab, Since 'rab' is an 8-bit field, but the feature's AB value is only 3 bits, we should add a validity check. > +} > +EXPORT_SYMBOL_GPL(nvme_set_arbitration_burst); I don't see any particular reason to export this function just for the pci transport to use. Just make it a local static function an call it from nvme_init_identify().
Re: [PATCH 1/3] nvme: Add Arbitration Burst support
On Wed, Jun 24, 2020 at 09:34:08AM +0800, Baolin Wang wrote: > OK, I understaood your concern. Now we will select the RR arbitration as > default > in nvme_enable_ctrl(), but for some cases, we will not set the arbitration > burst > values from userspace, and we still want to use the defaut arbitration burst > that > recommended by the controller, taking into consideration any latency > requirements. > > So I'd like to add a parameter to decide if we can use the default arbitration > burst like below, how do yo think? Thanks. I wouldn't call this the 'default' arbitration since it usually is not the default. This is the 'recommended' arbitration value. > static bool use_default_arb; > module_param(use_default_arb, bool, 0444); > MODULE_PARM_DESC(use_default_arb, "use default arbitration burst that > recommended by the controller"); "use controller's recommended arbitration burst"
Re: [PATCH 1/3] nvme: Add Arbitration Burst support
On Tue, Jun 23, 2020 at 10:39:01AM -0700, Sagi Grimberg wrote: > > > > From the NVMe spec, "In order to make efficient use of the non-volatile > > > memory, it is often advantageous to execute multiple commands from a > > > Submission Queue in parallel. For Submission Queues that are using > > > weighted round robin with urgent priority class or round robin > > > arbitration, host software may configure an Arbitration Burst setting". > > > Thus add Arbitration Burst setting support. > > > > But if the user changed it to something else that better matches how > > they want to use queues, the driver is just going to undo that setting > > on the next reset. > > Where do we do priority class arbitration? no one sets it Not the priority class, we don't use WRR in this driver. The RR arbitration can be set from user space and saved across controller resets, like: # nvme set-feature -f 1 -v 3 --save This patch would undo the saved feature value.
Re: [PATCH 1/3] nvme: Add Arbitration Burst support
On Tue, Jun 23, 2020 at 09:24:32PM +0800, Baolin Wang wrote: > From the NVMe spec, "In order to make efficient use of the non-volatile > memory, it is often advantageous to execute multiple commands from a > Submission Queue in parallel. For Submission Queues that are using > weighted round robin with urgent priority class or round robin > arbitration, host software may configure an Arbitration Burst setting". > Thus add Arbitration Burst setting support. But if the user changed it to something else that better matches how they want to use queues, the driver is just going to undo that setting on the next reset.
Re: [PATCH 1/1] nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()
Fixes: fa46c6fb5d61 ("nvme/pci: move cqe check after device shutdown")
Re: [PATCH 1/1] nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()
Looks good to me. Reviewed-by: Keith Busch
Re: [PATCH 0/2] Add support for StorageD3Enable _DSD property
On Wed, Apr 29, 2020 at 05:20:09AM +, Williams, Dan J wrote: > On Tue, 2020-04-28 at 08:27 -0700, David E. Box wrote: > > On Tue, 2020-04-28 at 16:22 +0200, Christoph Hellwig wrote: > > > On Tue, Apr 28, 2020 at 07:09:59AM -0700, David E. Box wrote: > > > > > I'm not sure who came up with the idea to put this into ACPI, > > > > > but > > > > > it > > > > > belongs into NVMe. Please talk to the NVMe technical working > > > > > group > > > > > instead of trying to overrules them in an unrelated group that > > > > > doesn't > > > > > apply to all of PCIe. > > > > > > > > Agreed that this is not ideal since it does not apply to all of > > > > PCIe. > > > > But as the property already exists on shipping systems, we need > > > > to > > > > be > > > > able to read it in the NVMe driver and the patch is consitent > > > > with > > > > the > > > > way properties under PCI ports are read. > > > > > > The point is that it is not the BIOSes job do decide how Linux does > > > power management. For example D3 has really horrible entry and > > > exit > > > latencies in many cases, and will lead to higher power usage. > > > > The platform can know which pm policies will save the most power. But > > since the solution doesn't apply to all PCIe devices (despite BIOS > > specifying it that way) I'll withdraw this patch. Thanks. > > Wait, why withdraw? In this case the platform is unfortunately > preventing the standard driver from making a proper determination. So > while I agree that it's not the BIOSes job, when the platform actively > prevents proper operation due to some ill conceived non-standard > platform property what is Linux left to do on these systems? > > The *patch* is not trying to overrule NVME, and the best I can say is > that the Intel Linux team was not in the loop when this was being > decided between the platform BIOS implemenation and whomever thought > they could just publish random ACPI properties that impacted NVME > operation [1]. > > So now David is trying to get these platform unbroken because they are > already shipping with this b0rkage. Rather than quirking all these cases, which I get the feeling there are many more than we've currently got in our quirk list, perhaps it'd be simpler to default to the simple suspend. AFAIK, the simple suspend works for all platforms, though it may not realize the best power savings and/or exit latency.
Re: [PATCH v2] nvme-pci: Shutdown when removing dead controller
On Mon, Oct 07, 2019 at 01:50:11PM -0400, Tyler Ramer wrote: > Shutdown the controller when nvme_remove_dead_controller is > reached. > > If nvme_remove_dead_controller() is called, the controller won't > be comming back online, so we should shut it down rather than just > disabling. > > Remove nvme_kill_queues() as nvme_dev_remove() will take care of > unquiescing queues. We do still need to kill the queues, though. The shutdown == true just flushes all pending requests. Killing queues does that too, but it also sets the request_queue to dying, which will terminate syncing any dirty pages. > --- > > Changes since v1: > * Clean up commit message > * Remove nvme_kill_queues() > --- > drivers/nvme/host/pci.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index c0808f9eb8ab..68d5fb880d80 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2509,8 +2509,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) > static void nvme_remove_dead_ctrl(struct nvme_dev *dev) > { > nvme_get_ctrl(>ctrl); > - nvme_dev_disable(dev, false); > - nvme_kill_queues(>ctrl); > + nvme_dev_disable(dev, true); > if (!queue_work(nvme_wq, >remove_work)) > nvme_put_ctrl(>ctrl); > } > -- > 2.23.0 >
Re: [PATCH] nvme-pci: Shutdown when removing dead controller
On Mon, Oct 07, 2019 at 11:13:12AM -0400, Tyler Ramer wrote: > > Setting the shutdown to true is > > usually just to get the queues flushed, but the nvme_kill_queues() that > > we call accomplishes the same thing. > > The intention of this patch was to clean up another location where > nvme_dev_disable() > is called with shutdown == false, but the device is being removed due > to a failure > condition, so it should be shutdown. > > Perhaps though, given nvme_kill_queues() provides a subset of the > functionality of > nvme_dev_disable() with shutdown == true, we can just use > nvme_dev_disable() and > remove nvme_kill_queues()? > > This will make nvme_remove_dead_ctrl() more in line with nvme_remove(), > nvme_shutdown(), etc. It's fine to use the shutdown == true in this path as well, but I just wanted to understand what problem it was fixing. It doesn't sound like your scenario is going to end up setting CC.SHN, so the only thing the shutdown should accomplish is flushing pending IO, but we already call kill_queues() right after the nvme_dev_disable(), so that part should be okay.
Re: [PATCH] nvme-pci: Shutdown when removing dead controller
On Fri, Oct 04, 2019 at 11:36:42AM -0400, Tyler Ramer wrote: > Here's a failure we had which represents the issue the patch is > intended to solve: > > Aug 26 15:00:56 testhost kernel: nvme nvme4: async event result 00010300 > Aug 26 15:01:27 testhost kernel: nvme nvme4: controller is down; will > reset: CSTS=0x3, PCI_STATUS=0x10 > Aug 26 15:02:10 testhost kernel: nvme nvme4: Device not ready; aborting reset > Aug 26 15:02:10 testhost kernel: nvme nvme4: Removing after probe > failure status: -19 > > The CSTS warnings comes from nvme_timeout, and is printed by > nvme_warn_reset. A reset then occurs > Controller state should be NVME_CTRL_RESETTING > > Now, in nvme_reset_work, controller is never marked "CONNECTING" at: > > if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_CONNECTING)) > > because several lines above, we can determine that > nvme_pci_configure_admin_queues returns > a bad result, which triggers a goto out_unlock and prints "removing > after probe failure status: -19" > > Because state is never changed to NVME_CTRL_CONNECTING or > NVME_CTRL_DELETING, the > logic added in > https://github.com/torvalds/linux/commit/2036f7263d70e67d70a67899a468588cb7356bc9 > should not apply. Nor should it, because there are no IO in flight at this point, there can't be any timeout work to check the state. > We can further validate that dev->ctrl.state == > NVME_CTRL_RESETTING thanks to > the WARN_ON in nvme_reset_work. I'm not sure I see what this is fixing. Setting the shutdown to true is usually just to get the queues flushed, but the nvme_kill_queues() that we call accomplishes the same thing. > On Thu, Oct 3, 2019 at 3:13 PM Tyler Ramer wrote: > > > > Always shutdown the controller when nvme_remove_dead_controller is > > reached. > > > > It's possible for nvme_remove_dead_controller to be called as part of a > > failed reset, when there is a bad NVME_CSTS. The controller won't > > be comming back online, so we should shut it down rather than just > > disabling. > > > > Signed-off-by: Tyler Ramer > > --- > > drivers/nvme/host/pci.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > index c0808f9eb8ab..c3f5ba22c625 100644 > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -2509,7 +2509,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) > > static void nvme_remove_dead_ctrl(struct nvme_dev *dev) > > { > > nvme_get_ctrl(>ctrl); > > - nvme_dev_disable(dev, false); > > + nvme_dev_disable(dev, true); > > nvme_kill_queues(>ctrl); > > if (!queue_work(nvme_wq, >remove_work)) > > nvme_put_ctrl(>ctrl); > > -- > > 2.23.0 > >
Re: [PATCH v3] nvme: allow 64-bit results in passthru commands
On Tue, Sep 24, 2019 at 11:05:36AM -0700, Sagi Grimberg wrote: > Looks fine to me, > > Reviewed-by: Sagi Grimberg > > Keith, Christoph? Looks good to me, too. Reviewed-by: Keith Busch
Re: [PATCH] nvme: fix an error code in nvme_init_subsystem()
On Mon, Sep 23, 2019 at 05:18:36PM +0300, Dan Carpenter wrote: > "ret" should be a negative error code here, but it's either success or > possibly uninitialized. > > Fixes: 32fd90c40768 ("nvme: change locking for the per-subsystem controller > list") > Signed-off-by: Dan Carpenter Thanks, patch looks good. Reviewed-by: Keith Busch
Re: [PATCH v2] nvme-pci: Save PCI state before putting drive into deepest state
On Wed, Sep 18, 2019 at 01:15:55PM -0500, Mario Limonciello wrote: > The action of saving the PCI state will cause numerous PCI configuration > space reads which depending upon the vendor implementation may cause > the drive to exit the deepest NVMe state. > > In these cases ASPM will typically resolve the PCIe link state and APST > may resolve the NVMe power state. However it has also been observed > that this register access after quiesced will cause PC10 failure > on some device combinations. > > To resolve this, move the PCI state saving to before SetFeatures has been > called. This has been proven to resolve the issue across a 5000 sample > test on previously failing disk/system combinations. > > Signed-off-by: Mario Limonciello This looks good. It clashes with something I posted yesterday, but I'll rebase after this one. Reviewed-by: Keith Busch
Re: NVMe Poll CQ on timeout
On Thu, Sep 19, 2019 at 01:47:50PM +, Bharat Kumar Gogada wrote: > Hi All, > > We are testing NVMe cards on ARM64 platform, the card uses MSI-X interrupts. > We are hitting following case in drivers/nvme/host/pci.c > /* > * Did we miss an interrupt? > */ > if (__nvme_poll(nvmeq, req->tag)) { > dev_warn(dev->ctrl.device, > "I/O %d QID %d timeout, completion polled\n", > req->tag, nvmeq->qid); > return BLK_EH_DONE; > } > > Can anyone tell when does nvme_timeout gets invoked ? Timeout is invoked when the driver didn't see a completion to a submitted command. > In what cases we see this interrupt miss ? That usually happens for one of two reasons: 1. The device didn't send any MSIx message for a CQE 2. The device sent the MSIx message before posting the CQE I've also seen h/w errata where the MSIx and CQE are re-ordered, which can also lead to this. A hardware trace would provide the most detailed view of what's happening. You might be able to infer if you carefully account for commands sent, interrupts received, and spurious interrupts detected. > We are seeing this issue only for reads with following fio command > fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randread --bs=128k > --direct=0 \ > --size=128M --numjobs=3 --group_reporting --filename=/dev/nvme0n1 > > We are not seeing issue with --rw=randwrite for same size. > > Please let us know what can cause this issue.
Re: [PATCH] nvme-pci: Save PCI state before putting drive into deepest state
On Wed, Sep 11, 2019 at 06:42:33PM -0500, Mario Limonciello wrote: > --- > drivers/nvme/host/pci.c | 13 +++-- > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 732d5b6..9b3fed4 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2894,6 +2894,13 @@ static int nvme_suspend(struct device *dev) > if (ret < 0) > goto unfreeze; > > + /* > + * A saved state prevents pci pm from generically controlling the > + * device's power. If we're using protocol specific settings, we don't > + * want pci interfering. > + */ > + pci_save_state(pdev); > + > ret = nvme_set_power_state(ctrl, ctrl->npss); > if (ret < 0) > goto unfreeze; > @@ -2908,12 +2915,6 @@ static int nvme_suspend(struct device *dev) > ret = 0; > goto unfreeze; We would need to clear the saved state here, though. You can also infact remove the unfreeze label and goto. > } > - /* > - * A saved state prevents pci pm from generically controlling the > - * device's power. If we're using protocol specific settings, we don't > - * want pci interfering. > - */ > - pci_save_state(pdev); > unfreeze: > nvme_unfreeze(ctrl); > return ret; > --
Re: [PATCH 0/2] nvme: Add kernel API for admin command
On Wed, Sep 18, 2019 at 03:26:11PM +0200, Christoph Hellwig wrote: > Even if we had a use case for that the bounce buffer is just too ugly > to live. And I'm really sick and tired of Intel wasting our time for > their out of tree monster given that they haven't even tried helping > to improve the in-kernel write caching layers. Right, I don't have any stake in that out-of-tree caching either. I am more just interested in trying to get Linux to generically reach as many NVMe spec features as possible, and extended formats have been in-spec since the beginning of nvme. And yes, that bouncing is really nasty, but it's really only needed for PRP, so maybe let's just not ignore that transfer mode and support extended metadata iff the controller supports SGLs. We just need a special SGL setup routine to weave the data and metadata.
Re: [PATCH] nvme-pci: Save PCI state before putting drive into deepest state
On Wed, Sep 11, 2019 at 06:42:33PM -0500, Mario Limonciello wrote: > The action of saving the PCI state will cause numerous PCI configuration > space reads which depending upon the vendor implementation may cause > the drive to exit the deepest NVMe state. > > In these cases ASPM will typically resolve the PCIe link state and APST > may resolve the NVMe power state. However it has also been observed > that this register access after quiesced will cause PC10 failure > on some device combinations. > > To resolve this, move the PCI state saving to before SetFeatures has been > called. This has been proven to resolve the issue across a 5000 sample > test on previously failing disk/system combinations. > > Signed-off-by: Mario Limonciello > --- > drivers/nvme/host/pci.c | 13 +++-- > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 732d5b6..9b3fed4 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2894,6 +2894,13 @@ static int nvme_suspend(struct device *dev) > if (ret < 0) > goto unfreeze; > > + /* > + * A saved state prevents pci pm from generically controlling the > + * device's power. If we're using protocol specific settings, we don't > + * want pci interfering. > + */ > + pci_save_state(pdev); > + > ret = nvme_set_power_state(ctrl, ctrl->npss); > if (ret < 0) > goto unfreeze; > @@ -2908,12 +2915,6 @@ static int nvme_suspend(struct device *dev) > ret = 0; > goto unfreeze; > } > - /* > - * A saved state prevents pci pm from generically controlling the > - * device's power. If we're using protocol specific settings, we don't > - * want pci interfering. > - */ > - pci_save_state(pdev); > unfreeze: > nvme_unfreeze(ctrl); > return ret; In the event that something else fails after the point you've saved the state, we need to fallback to the behavior for when the driver doesn't save the state, right?
Re: [PATCH 0/2] nvme: Add kernel API for admin command
On Mon, Sep 16, 2019 at 12:13:24PM +, Baldyga, Robert wrote: > Ok, fair enough. We want to keep things hidden behind certain layers, > and that's definitely a good thing. But there is a problem with these > layers - they do not expose all the features. For example AFAIK there > is no clear way to use 512+8 format via block layer (nor even a way > to get know that bdev is formatted to particular lbaf). It's impossible > to use it without layering violations, which nobody sees as a perfect > solution, but it is the only one that works. I think your quickest path to supporting such a format is to consider each part separately, then bounce and interleave/unmix the data + metadata at another layer that understands how the data needs to be laid out in host memory. Something like this RFC here: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015844.html It appears connecting to infradead lists is a bit flaky at the moment, so not sure if you'll be able to read the above link right now. Anyway, the RFC would have needed a bit of work to be considered, like using a mempool for the purpose, but it does generically make such formats usable through the block stack so maybe try flushing out that idea.
Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
On Sat, Sep 07, 2019 at 06:19:21AM +0800, Ming Lei wrote: > On Fri, Sep 06, 2019 at 05:50:49PM +, Long Li wrote: > > >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism > > > > > >Why are all 8 nvmes sharing the same CPU for interrupt handling? > > >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used > > >CPU from the cpumask for the effective interrupt handling? > > > > The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has > > 32 hardware queues. > > Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix > can't avoid effective CPUs overlapping at all. Sure, but it's at most half, meanwhile the CPU that's dispatching requests would naturally be throttled by the other half who's completions are interrupting that CPU, no?
Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
On Fri, Sep 06, 2019 at 11:30:57AM -0700, Sagi Grimberg wrote: > > > > > Ok, so the real problem is per-cpu bounded tasks. > > > > I share Thomas opinion about a NAPI like approach. > > We already have that, its irq_poll, but it seems that for this > use-case, we get lower performance for some reason. I'm not > entirely sure why that is, maybe its because we need to mask interrupts > because we don't have an "arm" register in nvme like network devices > have? For MSI, that's the INTMS/INTMC NVMe registers. MSI-x, though, has to disarm it in its table entry, and the Linux implementation will do a posted read in that path, which is a bit too expensive.
Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote: > When one IRQ flood happens on one CPU: > > 1) softirq handling on this CPU can't make progress > > 2) kernel thread bound to this CPU can't make progress > > For example, network may require softirq to xmit packets, or another irq > thread for handling keyboards/mice or whatever, or rcu_sched may depend > on that CPU for making progress, then the irq flood stalls the whole > system. > > > > > AFAIU, there are fast medium where the responses to requests are faster > > than the time to process them, right? > > Usually medium may not be faster than CPU, now we are talking about > interrupts, which can be originated from lots of devices concurrently, > for example, in Long Li'test, there are 8 NVMe drives involved. Why are all 8 nvmes sharing the same CPU for interrupt handling? Shouldn't matrix_find_best_cpu_managed() handle selecting the least used CPU from the cpumask for the effective interrupt handling?
[PATCHv2] nvme: Assign subsy instance from first ctrl
The namespace disk names must be unique for the lifetime of the subsystem. This was accomplished by using their parent subsystems' instances which were allocated independently from the controllers connected to that subsystem. This allowed name prefixes assigned to namespaces to match a controller from an unrelated subsystem, and has created confusion among users examining device nodes. Ensure a namespace's subsystem instance never clashes with a controller instance of another subsystem by transferring the instance ownership to the parent subsystem from the first controller discovered in that subsystem. Reviewed-by: Logan Gunthorpe Signed-off-by: Keith Busch --- v1 -> v2: Changelog: reduce sensationalism, fix spelling drivers/nvme/host/core.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 14c0bfb55615..8a8279ece5ee 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -81,7 +81,6 @@ EXPORT_SYMBOL_GPL(nvme_reset_wq); struct workqueue_struct *nvme_delete_wq; EXPORT_SYMBOL_GPL(nvme_delete_wq); -static DEFINE_IDA(nvme_subsystems_ida); static LIST_HEAD(nvme_subsystems); static DEFINE_MUTEX(nvme_subsystems_lock); @@ -2344,7 +2343,8 @@ static void nvme_release_subsystem(struct device *dev) struct nvme_subsystem *subsys = container_of(dev, struct nvme_subsystem, dev); - ida_simple_remove(_subsystems_ida, subsys->instance); + if (subsys->instance >= 0) + ida_simple_remove(_instance_ida, subsys->instance); kfree(subsys); } @@ -2473,12 +2473,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); if (!subsys) return -ENOMEM; - ret = ida_simple_get(_subsystems_ida, 0, 0, GFP_KERNEL); - if (ret < 0) { - kfree(subsys); - return ret; - } - subsys->instance = ret; + + subsys->instance = -1; mutex_init(>lock); kref_init(>ref); INIT_LIST_HEAD(>ctrls); @@ -2497,7 +2493,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) subsys->dev.class = nvme_subsys_class; subsys->dev.release = nvme_release_subsystem; subsys->dev.groups = nvme_subsys_attrs_groups; - dev_set_name(>dev, "nvme-subsys%d", subsys->instance); + dev_set_name(>dev, "nvme-subsys%d", ctrl->instance); device_initialize(>dev); mutex_lock(_subsystems_lock); @@ -2528,6 +2524,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) goto out_put_subsystem; } + if (!found) + subsys->instance = ctrl->instance; ctrl->subsys = subsys; list_add_tail(>subsys_entry, >ctrls); mutex_unlock(_subsystems_lock); @@ -3803,7 +3801,9 @@ static void nvme_free_ctrl(struct device *dev) container_of(dev, struct nvme_ctrl, ctrl_device); struct nvme_subsystem *subsys = ctrl->subsys; - ida_simple_remove(_instance_ida, ctrl->instance); + if (subsys && ctrl->instance != subsys->instance) + ida_simple_remove(_instance_ida, ctrl->instance); + kfree(ctrl->effects); nvme_mpath_uninit(ctrl); __free_page(ctrl->discard_page); @@ -4085,7 +4085,6 @@ static int __init nvme_core_init(void) static void __exit nvme_core_exit(void) { - ida_destroy(_subsystems_ida); class_destroy(nvme_subsys_class); class_destroy(nvme_class); unregister_chrdev_region(nvme_chr_devt, NVME_MINORS); -- 2.14.5
[PATCH] nvme: Restore device naming sanity
The namespace names must be unique for the lifetime of the subsystem. This was accomplished by using their parent subsystems' instances which was independent of the controllers connected to that subsystem. The consequence of that naming scheme meant that name prefixes given to namespaces may match a controller from an unrelated subsystem. This has understandbly invited confusion when examining device nodes. Ensure the namespace's subsystem instance never clashes with a controller instance of another subsystem by transferring the instance ownership to parent subsystem from the first controller discovered in that subsystem. Reviewed-by: Logan Gunthorpe Signed-off-by: Keith Busch --- drivers/nvme/host/core.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 14c0bfb55615..8a8279ece5ee 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -81,7 +81,6 @@ EXPORT_SYMBOL_GPL(nvme_reset_wq); struct workqueue_struct *nvme_delete_wq; EXPORT_SYMBOL_GPL(nvme_delete_wq); -static DEFINE_IDA(nvme_subsystems_ida); static LIST_HEAD(nvme_subsystems); static DEFINE_MUTEX(nvme_subsystems_lock); @@ -2344,7 +2343,8 @@ static void nvme_release_subsystem(struct device *dev) struct nvme_subsystem *subsys = container_of(dev, struct nvme_subsystem, dev); - ida_simple_remove(_subsystems_ida, subsys->instance); + if (subsys->instance >= 0) + ida_simple_remove(_instance_ida, subsys->instance); kfree(subsys); } @@ -2473,12 +2473,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); if (!subsys) return -ENOMEM; - ret = ida_simple_get(_subsystems_ida, 0, 0, GFP_KERNEL); - if (ret < 0) { - kfree(subsys); - return ret; - } - subsys->instance = ret; + + subsys->instance = -1; mutex_init(>lock); kref_init(>ref); INIT_LIST_HEAD(>ctrls); @@ -2497,7 +2493,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) subsys->dev.class = nvme_subsys_class; subsys->dev.release = nvme_release_subsystem; subsys->dev.groups = nvme_subsys_attrs_groups; - dev_set_name(>dev, "nvme-subsys%d", subsys->instance); + dev_set_name(>dev, "nvme-subsys%d", ctrl->instance); device_initialize(>dev); mutex_lock(_subsystems_lock); @@ -2528,6 +2524,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) goto out_put_subsystem; } + if (!found) + subsys->instance = ctrl->instance; ctrl->subsys = subsys; list_add_tail(>subsys_entry, >ctrls); mutex_unlock(_subsystems_lock); @@ -3803,7 +3801,9 @@ static void nvme_free_ctrl(struct device *dev) container_of(dev, struct nvme_ctrl, ctrl_device); struct nvme_subsystem *subsys = ctrl->subsys; - ida_simple_remove(_instance_ida, ctrl->instance); + if (subsys && ctrl->instance != subsys->instance) + ida_simple_remove(_instance_ida, ctrl->instance); + kfree(ctrl->effects); nvme_mpath_uninit(ctrl); __free_page(ctrl->discard_page); @@ -4085,7 +4085,6 @@ static int __init nvme_core_init(void) static void __exit nvme_core_exit(void) { - ida_destroy(_subsystems_ida); class_destroy(nvme_subsys_class); class_destroy(nvme_class); unregister_chrdev_region(nvme_chr_devt, NVME_MINORS); -- 2.14.5
Re: [PATCH] nvme-core: Fix subsystem instance mismatches
On Wed, Sep 04, 2019 at 11:01:22AM -0600, Logan Gunthorpe wrote: > Oh, yes that's simpler than the struct/kref method and looks like it > will accomplish the same thing. I did some brief testing with it and it > seems to work for me (though I don't have any subsystems with multiple > controllers). If you want to make a patch out of it you can add my > > Reviewed-by: Logan Gunthorpe Thanks! I'll make it a proper patch and send shortly. For testing multi-controller subsystems, I haven't got proper hardware either, so I really like the nvme loop target. Here's a very simple json defining a two namespace subsystem backed by two real nvme devices: loop.json: --- { "ports": [ { "addr": { "adrfam": "", "traddr": "", "treq": "not specified", "trsvcid": "", "trtype": "loop" }, "portid": 1, "referrals": [], "subsystems": [ "testnqn" ] } ], "subsystems": [ { "attr": { "allow_any_host": "1" }, "namespaces": [ { "device": { "nguid": "ef90689c-6c46-d44c-89c1-4067801309a8", "path": "/dev/nvme0n1" }, "enable": 1, "nsid": 1 }, { "device": { "nguid": "ef90689c-6c46-d44c-89c1-4067801309a9", "path": "/dev/nvme1n1" }, "enable": 1, "nsid": 2 } ], "nqn": "testnqn" } ] } -- Configure the target: # nvmetcli restore loop.json Connect to it twice: # nvme connect -n testnqn -t loop # nvme connect -n testnqn -t loop List the result: # nvme list -v NVM Express Subsystems SubsystemSubsystem-NQN Controllers nvme-subsys0 nqn.2014.08.org.nvmexpress:8086108ePHLE7200015N6P4BGN-17335943:ICDPC5ED2ORA6.4T nvme0 nvme-subsys1 nqn.2014.08.org.nvmexpress:8086108ePHLE7200015N6P4BGN-27335943:ICDPC5ED2ORA6.4T nvme1 nvme-subsys2 testnqn nvme2, nvme3 NVM Express Controllers Device SN MN FR TxPort AddressSubsystemNamespaces -- -- nvme0PHLE7200015N6P4BGN-1 7335943:ICDPC5ED2ORA6.4T QDV1RD07 pcie :88:00.0 nvme-subsys0 nvme0n1 nvme1PHLE7200015N6P4BGN-2 7335943:ICDPC5ED2ORA6.4T QDV1RD03 pcie :89:00.0 nvme-subsys1 nvme1n1 nvme29eb72cbeecc6fdb0 Linux 5.3.0-rc loop nvme-subsys2 nvme2n1, nvme2n2 nvme39eb72cbeecc6fdb0 Linux 5.3.0-rc loop nvme-subsys2 nvme2n1, nvme2n2 NVM Express Namespaces Device NSID Usage Format Controllers -- nvme0n1 1 3.20 TB / 3.20 TB512 B + 0 B nvme0 nvme1n1 1 3.20 TB / 3.20 TB512 B + 0 B nvme1 nvme2n1 1 3.20 TB / 3.20 TB512 B + 0 B nvme2, nvme3 nvme2n2 2 3.20 TB / 3.20 TB512 B + 0 B nvme2, nvme3
Re: [PATCH] nvme-core: Fix subsystem instance mismatches
On Wed, Sep 04, 2019 at 10:07:12AM -0600, Logan Gunthorpe wrote: > Yes, I agree, we can't solve the mismatch problem in the general case: > with sequences of hot plug events there will always be a case that > mismatches. I just think we can do better in the simple common default case. This may be something where udev can help us. I might be able to find some time to look at that, but not today. > > Can we just ensure there is never a matching controller then? This > > patch will accomplish that and simpler than wrapping the instance in a > > refcount'ed object: > > > > http://lists.infradead.org/pipermail/linux-nvme/2019-May/024142.html > > I don't really like that idea. It reduces the confusion caused by > mismatching numbers, but causes the controller to never match the > namespace, which is also confusing but in a different way. > > I like the nvme_instance idea. It's not going to be perfect but it has > some nice properties: the subsystem will try to match the controller's > instance whenever possible, but in cases where it doesn't, the instance > number of the subsystem will never be the same as an existing controller. > > I'll see if I can work up a quick patch set and see what people think. How about this: we have the subsys copy the controller's instance, and the nvme_free_ctrl() doesn't release it if its subsys matches? --- diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 14c0bfb55615..8a8279ece5ee 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -81,7 +81,6 @@ EXPORT_SYMBOL_GPL(nvme_reset_wq); struct workqueue_struct *nvme_delete_wq; EXPORT_SYMBOL_GPL(nvme_delete_wq); -static DEFINE_IDA(nvme_subsystems_ida); static LIST_HEAD(nvme_subsystems); static DEFINE_MUTEX(nvme_subsystems_lock); @@ -2344,7 +2343,8 @@ static void nvme_release_subsystem(struct device *dev) struct nvme_subsystem *subsys = container_of(dev, struct nvme_subsystem, dev); - ida_simple_remove(_subsystems_ida, subsys->instance); + if (subsys->instance >= 0) + ida_simple_remove(_instance_ida, subsys->instance); kfree(subsys); } @@ -2473,12 +2473,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); if (!subsys) return -ENOMEM; - ret = ida_simple_get(_subsystems_ida, 0, 0, GFP_KERNEL); - if (ret < 0) { - kfree(subsys); - return ret; - } - subsys->instance = ret; + + subsys->instance = -1; mutex_init(>lock); kref_init(>ref); INIT_LIST_HEAD(>ctrls); @@ -2497,7 +2493,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) subsys->dev.class = nvme_subsys_class; subsys->dev.release = nvme_release_subsystem; subsys->dev.groups = nvme_subsys_attrs_groups; - dev_set_name(>dev, "nvme-subsys%d", subsys->instance); + dev_set_name(>dev, "nvme-subsys%d", ctrl->instance); device_initialize(>dev); mutex_lock(_subsystems_lock); @@ -2528,6 +2524,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) goto out_put_subsystem; } + if (!found) + subsys->instance = ctrl->instance; ctrl->subsys = subsys; list_add_tail(>subsys_entry, >ctrls); mutex_unlock(_subsystems_lock); @@ -3803,7 +3801,9 @@ static void nvme_free_ctrl(struct device *dev) container_of(dev, struct nvme_ctrl, ctrl_device); struct nvme_subsystem *subsys = ctrl->subsys; - ida_simple_remove(_instance_ida, ctrl->instance); + if (subsys && ctrl->instance != subsys->instance) + ida_simple_remove(_instance_ida, ctrl->instance); + kfree(ctrl->effects); nvme_mpath_uninit(ctrl); __free_page(ctrl->discard_page); @@ -4085,7 +4085,6 @@ static int __init nvme_core_init(void) static void __exit nvme_core_exit(void) { - ida_destroy(_subsystems_ida); class_destroy(nvme_subsys_class); class_destroy(nvme_class); unregister_chrdev_region(nvme_chr_devt, NVME_MINORS); --
Re: [PATCH] nvme-core: Fix subsystem instance mismatches
On Wed, Sep 04, 2019 at 05:42:15PM +0200, Christoph Hellwig wrote: > On Wed, Sep 04, 2019 at 08:44:27AM -0600, Keith Busch wrote: > > Let me step through an example: > > > > Ctrl A gets instance 0. > > > > Its subsystem gets the same instance, and takes ref count on it: > > all namespaces in this subsystem will use '0'. > > > > Ctrl B gets instance 1, and it's in the same subsystem as Ctrl A so > > no new subsytem is allocated. > > > > Ctrl A is disconnected, dropping its ref on instance 0, but the > > subsystem still has its refcount, making it unavailable. > > > > Ctrl A is reconnected, and allocates instance 2 because 0 is still in > > use. > > > > Now all the namespaces in this subsystem are prefixed with nvme0, but no > > controller exists with the same prefix. We still have inevitable naming > > mismatch, right? > > I think th major confusion was that we can use the same handle for > and unrelated subsystem vs controller, and that would avoid it. > > I don't see how we can avoid the controller is entirely different > from namespace problem ever. Can we just ensure there is never a matching controller then? This patch will accomplish that and simpler than wrapping the instance in a refcount'ed object: http://lists.infradead.org/pipermail/linux-nvme/2019-May/024142.html
Re: [PATCH] nvme-core: Fix subsystem instance mismatches
On Wed, Sep 04, 2019 at 08:05:58AM +0200, Christoph Hellwig wrote: > On Tue, Sep 03, 2019 at 10:46:20AM -0600, Keith Busch wrote: > > Could we possibly make /dev/nvmeX be a subsystem handle without causing > > trouble for anyone? This would essentially be the same thing as today > > for non-CMIC controllers with a device-per-controller and only affects > > the CMIC ones. > > A per-subsyste character device doesn't make sense, as a lot of admin > command require a specific controller. Yeah, I was hoping to provide something special for CMIC controllers so you can do path specific admin, but that looks sure to break user space. > If this really is an isue for people we'll just need to refcount the > handle allocation. That is: > > - nvme_init_ctrl allocates a new nvme_instance or so object, which >does the ida_simple_get. > - we allocate a new subsystem that reuses the handle and grabs >a reference in nvme_init_subsystem, then if we find an existing >subsystem we drop that reference again. > - last free of a ctrl or subsystem also drops a reference, with >the final free releasing the ida Let me step through an example: Ctrl A gets instance 0. Its subsystem gets the same instance, and takes ref count on it: all namespaces in this subsystem will use '0'. Ctrl B gets instance 1, and it's in the same subsystem as Ctrl A so no new subsytem is allocated. Ctrl A is disconnected, dropping its ref on instance 0, but the subsystem still has its refcount, making it unavailable. Ctrl A is reconnected, and allocates instance 2 because 0 is still in use. Now all the namespaces in this subsystem are prefixed with nvme0, but no controller exists with the same prefix. We still have inevitable naming mismatch, right?
Re: [PATCH] nvme-core: Fix subsystem instance mismatches
On Tue, Sep 03, 2019 at 10:08:01AM -0600, Logan Gunthorpe wrote: > On 2019-08-31 9:29 a.m., Keith Busch wrote: > > On Fri, Aug 30, 2019 at 06:01:39PM -0600, Logan Gunthorpe wrote: > >> To fix this, assign the subsystem's instance based on the instance > >> number of the controller's instance that first created it. There should > >> always be fewer subsystems than controllers so the should not be a need > >> to create extra subsystems that overlap existing controllers. > > > > The subsystem's lifetime is not tied to the controller's. When the > > controller is removed and releases its instance, the next controller > > to take that available instance will create naming collisions with the > > subsystem still using it. > > > > Hmm, yes, ok. > > So perhaps we can just make the subsystem prefer the ctrl's instance > when allocating the ID? Then at least, in the common case, the > controller numbers will match the subsystem numbers. Only when there's > random hot-plugs would the numbers get out of sync. I really don't know about a patch that works only on static configurations. Connects and disconnects do happen on live systems, so the numerals will inevitably get out of sync. Could we possibly make /dev/nvmeX be a subsystem handle without causing trouble for anyone? This would essentially be the same thing as today for non-CMIC controllers with a device-per-controller and only affects the CMIC ones.
Re: [PATCH] nvme-core: Fix subsystem instance mismatches
On Fri, Aug 30, 2019 at 06:01:39PM -0600, Logan Gunthorpe wrote: > To fix this, assign the subsystem's instance based on the instance > number of the controller's instance that first created it. There should > always be fewer subsystems than controllers so the should not be a need > to create extra subsystems that overlap existing controllers. The subsystem's lifetime is not tied to the controller's. When the controller is removed and releases its instance, the next controller to take that available instance will create naming collisions with the subsystem still using it.
Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
On Tue, Aug 27, 2019 at 08:34:21AM -0600, Keith Busch wrote: > I think you should probably just have pci_irq_get_affinity() take a flags > argument, or make a new function like __pci_irq_get_affinity() that > pci_irq_get_affinity() can call with a default flag. Sorry, copied the wrong function name; I meant pci_request_irq(), not pci_irq_get_affinity().
Re: [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD
On Tue, Aug 27, 2019 at 04:53:44PM +0800, Ming Lei wrote: > In case of IRQF_RESCUE_THREAD, the threaded handler is only used to > handle interrupt when IRQ flood comes, use irq's affinity for this thread > so that scheduler may select other not too busy CPUs for handling the > interrupt. > > Cc: Long Li > Cc: Ingo Molnar , > Cc: Peter Zijlstra > Cc: Keith Busch > Cc: Jens Axboe > Cc: Christoph Hellwig > Cc: Sagi Grimberg > Cc: John Garry > Cc: Thomas Gleixner > Cc: Hannes Reinecke > Cc: linux-n...@lists.infradead.org > Cc: linux-s...@vger.kernel.org > Signed-off-by: Ming Lei > --- > kernel/irq/manage.c | 13 - > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c > index 1566abbf50e8..03bc041348b7 100644 > --- a/kernel/irq/manage.c > +++ b/kernel/irq/manage.c > @@ -968,7 +968,18 @@ irq_thread_check_affinity(struct irq_desc *desc, struct > irqaction *action) > if (cpumask_available(desc->irq_common_data.affinity)) { > const struct cpumask *m; > > - m = irq_data_get_effective_affinity_mask(>irq_data); > + /* > + * Managed IRQ's affinity is setup gracefull on MUNA locality, s/MUNA/NUMA > + * also if IRQF_RESCUE_THREAD is set, interrupt flood has been > + * triggered, so ask scheduler to run the thread on CPUs > + * specified by this interrupt's affinity. > + */ > + if ((action->flags & IRQF_RESCUE_THREAD) && > + irqd_affinity_is_managed(>irq_data)) > + m = desc->irq_common_data.affinity; > + else > + m = irq_data_get_effective_affinity_mask( > + >irq_data); > cpumask_copy(mask, m); > } else { > valid = false; > --
Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
On Tue, Aug 27, 2019 at 05:09:27PM +0800, Ming Lei wrote: > On Tue, Aug 27, 2019 at 11:06:20AM +0200, Johannes Thumshirn wrote: > > On 27/08/2019 10:53, Ming Lei wrote: > > [...] > > > + char *devname; > > > + const struct cpumask *mask; > > > + unsigned long irqflags = IRQF_SHARED; > > > + int vector = pci_irq_vector(pdev, nvmeq->cq_vector); > > > + > > > + devname = kasprintf(GFP_KERNEL, "nvme%dq%d", nr, nvmeq->qid); > > > + if (!devname) > > > + return -ENOMEM; > > > + > > > + mask = pci_irq_get_affinity(pdev, nvmeq->cq_vector); > > > + if (mask && cpumask_weight(mask) > 1) > > > + irqflags |= IRQF_RESCUE_THREAD; > > > + > > > + return request_threaded_irq(vector, nvme_irq, NULL, irqflags, > > > + devname, nvmeq); > > > > This will leak 'devname' which gets allocated by kasprintf() a few lines > > above. > > It won't, please see pci_free_irq() in which free_irq() returns the > 'devname' passed in. Only if request_threaded_irq() doesn't return an error. I think you should probably just have pci_irq_get_affinity() take a flags argument, or make a new function like __pci_irq_get_affinity() that pci_irq_get_affinity() can call with a default flag.
Re: [PATCH v2 0/3] PCI: Add PCI_ERROR_RESPONSE, check for errors
On Thu, Aug 22, 2019 at 03:05:48PM -0500, Bjorn Helgaas wrote: > From: Bjorn Helgaas > > Reads from a PCI device may fail if the device has been turned off (put > into D3cold), removed, or if some other error occurs. The PCI host bridge > typically fabricates ~0 data to complete the CPU's read. > > We check for that in a few places, but not in a consistent way. This > series adds a PCI_ERROR_RESPONSE definition to make the checks more > consistent and easier to find. Note that ~0 may indicate a PCI error, but > it may also be valid read data, so you need more information (such as > knowing that a register can never contain ~0) before concluding that it's > an error. > > This series also adds a new check for PCI_ERROR_RESPONSE in the power > management code because that code frequently encounters devices in D3cold, > where we previously misinterpreted ~0 data. It also uses pci_power_name() > to print D-state names more consistently. > > Rafael, I didn't add your Reviewed-by to "PCI / PM: Return error when > changing power state from D3cold" because I made small changes to try to > make the messages more consistent, and I didn't want to presume they'd be > OK with you. > > Changes since v1: > - Add Rafael's Reviewed-By to the first two patches > - Drop "PCI / PM: Check for error when reading PME status" because Rafael > pointed out that some devices can signal PME even when in D3cold, so > this would require additional rework > - Drop "PCI / PM: Check for error when reading Power State" because > Rafael thinks it's mostly redundant > > Bjorn Helgaas (3): > PCI: Add PCI_ERROR_RESPONSE definition > PCI / PM: Decode D3cold power state correctly > PCI / PM: Return error when changing power state from D3cold Series looks good to me. Reviewed-by: Keith Busch