Re: [PATCH RFC 2/2] NVMe: rfc blk-mq support
On Tue, 8 Oct 2013, Matias Bjørling wrote: Convert the driver to blk mq. The patch consists of: * Initializion of mq data structures. * Convert function calls from bio to request data structures. * IO queues are split into an admin queue and io queues. * bio splits are removed as it should be handled by block layer. Signed-off-by: Matias Bjørling I have no opinion right now if this is a good idea or not. I'll just comment on a couple issues on this implementation. Otherwise I think it's pretty neat and gave me a reason to explore multiqueues! First a couple minor suggestions: You might want to use "REQ_END" from the rq->cmd_flags to know if you should write the queue doorbell or not. Aggregating these would help most devices but we didn't have a good way of knowing before if this was the last request or not. Maybe change nvme_submit_bio_queue to return a BLK_MQ_RQ_QUEUE_ status so you don't need that switch statement after calling it. Must do something about requests that don't align to PRPs. I think you mentioned this in the outstanding work in [0/2]. struct nvme_queue *get_nvmeq(struct nvme_dev *dev) { - return dev->queues[get_cpu() + 1]; + get_cpu(); + return dev->admin_queue; } get_nvmeq now returns only the admin queue when it used to return only IO queues. This breaks NVME_IO and SG_IO ioctl handling. +static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, + unsigned int i) +{ + struct nvme_ns *ns = data; + struct nvme_dev *dev = ns->dev; + struct nvme_queue *nq; + + nq = nvme_create_queue(dev, i + 1, hctx->queue_depth, i); + if (IS_ERR(nq)) + return PTR_ERR(nq); + + hctx->driver_data = nq; + + return 0; +} This right here is the biggest problem with the implemenation. It is going to fail for every namespace but the first one since each namespace registers a multiqueue and each mulitqueue requires a hw context to work. The number of queues is for the device, not namespace, so only the first namespace is going to successfully return from nvme_init_hctx; the rest will be unable to create an NVMe IO queue for trying to create one with already allocated QID. You should instead create the IO queues on the device like how it was done before then just set the hctx->driver_data to dev->queues[i + 1] or something like. +static enum blk_eh_timer_return nvme_timeout(struct request *rq) +{ + /* Currently the driver handle timeouts by itself */ + return BLK_EH_NOT_HANDLED; +} Need do something with the command timeouts here or somewhere. You've changed the driver to poll only on the admin queue for timed out commands, and left the multiqueue timeout a no-op.
Re: [PATCH RFC 0/2] Convert from bio-based to blk-mq
On Tue, 8 Oct 2013, Jens Axboe wrote: On Tue, Oct 08 2013, Matthew Wilcox wrote: On Tue, Oct 08, 2013 at 11:34:20AM +0200, Matias Bjørling wrote: The nvme driver implements itself as a bio-based driver. This primarily because of high lock congestion for high-performance nvm devices. To remove the congestion, a multi-queue block layer is being implemented. Um, no. You'll crater performance by adding another memory allocation (of the struct request). multi-queue is not the solution. That's a rather "jump to conclusions" statement to make. As Matias mentioned, there are no extra fast path allocations. Once the tagging is converted as well, I'd be surprised if it performs worse than before. And that on top of a net reduction in code. blk-mq might not be perfect as it stands, but it's a helluva lot better than a bunch of flash based drivers with lots of duplicated code and mechanisms. We need to move away from that. -- Jens Axboe But this wastes copious amounts of memory on an NVMe device with more than 1 namespace. The hardware's queues are shared among all namespaces, so you can't possibly have all the struct requests in use. What would be better is if I can create one blk-mq for each device/host and attach multiple gendisks to that.
Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
On Mon, Sep 19, 2016 at 12:38:05PM +0200, Alexander Gordeev wrote: > On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote: > > > Having a 1:1 already seemed like the ideal solution since you can't > > simultaneously utilize more than that from the host, so there's no more > > h/w parallelisms from we can exploit. On the controller side, fetching > > commands is serialized memory reads, so I don't think spreading IO > > among more h/w queues helps the target over posting more commands to a > > single queue. > > I take a notion of un-ordered commands completion you described below. > But I fail to realize why a CPU would not simultaneously utilize more > than one queue by posting to multiple. Is it due to nvme specifics or > you assume the host would not issue that many commands? What I mean is that if you have N CPUs, you can't possibly simultaneously write more than N submission queue entries. The benefit of having 1:1 for the queue <-> CPU mapping is that each CPU can post a command to its queue without lock contention at the same time as another thread. Having more to choose from doesn't let the host post commands any faster than we can today. When we're out of tags, the request currently just waits for one to become available, increasing submission latency. You can fix that by increasing the available tags with deeper or more h/w queues, but that just increases completion latency since the device can't process them any faster. It's six of one, half dozen of the other. The depth per queue defaults to 1k. If your process really is able to use all those resources, the hardware is completely saturated and you're not going to benefit from introducing more tags [1]. It could conceivably be worse by reducing cache-hits, or hit inappropriate timeout handling with the increased completion latency. [1] http://lists.infradead.org/pipermail/linux-nvme/2014-July/001064.html
Re: [PATCH 0/3] nvme: Don't add namespaces for locked drives
On Sun, Jun 19, 2016 at 04:06:31PM -0700, Jethro Beekman wrote: > If an NVMe drive is locked with ATA Security, most commands sent to the drive > will fail. This includes commands sent by the kernel upon discovery to probe > for partitions. The failing happens in such a way that trying to do anything > with the drive (e.g. sending an unlock command; unloading the nvme module) is > basically impossible with the high default command timeout. Why is the timeout a factor here? Is it because the error your drive returns does not have DNR set and goes through 30 seconds of retries? If so, I think we should probably have a limited retry count instead of unlimited retries within the command timeout. > This patch adds a check to see if the drive is locked, and if it is, its > namespaces are not initialized. It is expected that userspace will send the > proper "security send/unlock" command and then reset the controller. > Userspace > tools are available at [1]. Aren't these security settings per-namespace rather than the entire device? > I intend to also submit a future patch that tracks ATA Security commands sent > from userspace and remembers the password so it can be submitted to a locked > drive upon pm_resume. (still WIP) This subjects the system to various attacks like cold boot or hotswap, but that's what users want! This is ATA security, though, so wouldn't ATA also benefit from this? The payload setup/decoding should then go in a generic library for everyone. Similar was said about the patch adding OPAL security to the NVMe driver: http://lists.infradead.org/pipermail/linux-nvme/2016-April/004428.html
Re: [PATCH 0/3] nvme: Don't add namespaces for locked drives
On Mon, Jun 20, 2016 at 11:21:09AM -0700, Jethro Beekman wrote: > On 20-06-16 08:26, Keith Busch wrote: > > Would this just be a matter of setting req->retries and checking for it in > nvme_req_needs_retry? How does one keep track of the number of tries so far? I just sent a patch out earlier today to use req->retries to track the retry count, and nvme module parameter to set the max retries. I think that would fix the long delays you're seeing, assuming the patch is okay. > You're right, I assumed that admin commands can't have namespace ids, but > looking at the spec, that's not the case. Turns out there's a problem with the > driver then: nvme_ioctl never includes the ns for NVME_IOCTL_ADMIN_CMD. The NVME_IOCTL_ADMIN_CMD already takes any namespace identifier the user put in that field.
Re: [PATCH v4 0/3] nvme power saving
On Thu, Sep 22, 2016 at 02:33:36PM -0700, J Freyensee wrote: > ...and some SSDs don't even support this feature yet, so the number of > different NVMe devices available to test initially will most likely be > small (like the Fultondales I have, all I could check is to see if the > code broke anything if the device did not have this power-save > feature). > > I agree with Jens, makes a lot of sense to start with this feature > 'off'. > > To 'advertise' the feature, maybe make the feature a new selection in > Kconfig? Example, initially make it "EXPERIMENTAL", and later when > more devices implement this feature it can be integrated more tightly > into the NVMe solution and default to on. Should we just leave the kernel out of this then? I bet we could script this feature in user space.
Re: [PATCHv3 2/2] x86/vmd: Add PCI domain specific LED option
On Fri, Sep 23, 2016 at 09:34:41AM -0500, Bjorn Helgaas wrote: > I made the necessary changes to match the renaming I did in the first > patch, and I also used plain old "#ifdef" instead of "#if IS_ENABLED" > since the rest of the file uses the former style. If there's a reason > to switch, we should change the whole file in a separate patch so we > can explain the rationale. The check was "IS_ENABLED" because VMD can be a loadable module, which fails the ifdef check. I see Fengguang 0'dayed it using the module configuration option. I can send you a fix based on your pci/hotplug branch, or revert and apply the original patch if you prefer. BTW, you had asked me not to split a series when incremental fixes touched a single patch. I didn't resend the whole series here, and while you got the right patches, I apologize for making it more difficult to find.
Re: [PATCHv3 2/2] x86/vmd: Add PCI domain specific LED option
On Fri, Sep 23, 2016 at 02:12:23PM -0500, Bjorn Helgaas wrote: > BTW, the "Volume Management Device Driver" config item appears by > itself in the top-level menuconfig menu. That seems a little ... > presumptuous; is it what you intended? Not really intended, but I didn't really know any better at the time. :). I think drivers/pci/host/ is a more appropriate home for this driver. I'll send a patch to relocate it, along with necessary Kconfig updates.
Re: [PATCH -next] PCI/PCIe: make DPC explicitly non-modular
On Thu, Jul 28, 2016 at 03:33:03PM -0400, Paul Gortmaker wrote: > The Kconfig currently controlling compilation of this code is: > > drivers/pci/pcie/Kconfig:config PCIE_DPC > drivers/pci/pcie/Kconfig: bool "PCIe Downstream Port Containment > support" > > ...meaning that it currently is not being built as a module by anyone. > > Lets remove the modular code that is essentially orphaned, so that > when reading the driver there is no doubt it is builtin-only. > > Since module_init translates to device_initcall in the non-modular > case, the init ordering remains unchanged with this commit. > > We also delete the MODULE_LICENSE tag etc. since all that information > was (or is now) contained at the top of the file in the comments. Thanks for cleaning this up. FWIW, all of the other pcie service drivers look like they could use the same cleanup. Reviewed-by: Keith Busch
Re: [PATCH] nvme/quirk: Add a delay before checking device ready for memblaze device
On Thu, Sep 08, 2016 at 11:36:34AM +0800, Wang Weber wrote: > Hi Jens, > > The following email was sent before, but it seems to be blocked since I > received some messages about sending failure. So resend it with Google > email account. You're still sending non plain text email, so it's going to be rejected by the mailing list filter. It also mangles your patch such that it won't cleanly apply. You have to use plain text for this to work well.
[PATCHv3 1/2] pciehp: Let user control LED status
This patch adds a new flag to the pci_dev structure instructing pciehp to not interpret PCIe slot LED indicators. The pciehp driver will instead allow all LED control from the user by setting the slot control indicators as the user requested through sysfs. This is preparing for domain devices that repurpose these control bits for non-standard use. Signed-off-by: Keith Busch --- v2 -> v3: Moved the slot op's attention status callback to pciehp_hpc.c drivers/pci/hotplug/pciehp.h | 5 + drivers/pci/hotplug/pciehp_core.c | 3 +++ drivers/pci/hotplug/pciehp_hpc.c | 27 +++ include/linux/pci.h | 1 + 4 files changed, 36 insertions(+) diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h index e764918..7f71ac1 100644 --- a/drivers/pci/hotplug/pciehp.h +++ b/drivers/pci/hotplug/pciehp.h @@ -152,6 +152,11 @@ bool pciehp_check_link_active(struct controller *ctrl); void pciehp_release_ctrl(struct controller *ctrl); int pciehp_reset_slot(struct slot *slot, int probe); +int pciehp_set_raw_attention_status(struct hotplug_slot *hotplug_slot, + u8 status); +int pciehp_get_raw_attention_status(struct hotplug_slot *hotplug_slot, + u8 *value); + static inline const char *slot_name(struct slot *slot) { return hotplug_slot_name(slot->hotplug_slot); diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c index fb0f863..68b7aeb 100644 --- a/drivers/pci/hotplug/pciehp_core.c +++ b/drivers/pci/hotplug/pciehp_core.c @@ -113,6 +113,9 @@ static int init_slot(struct controller *ctrl) if (ATTN_LED(ctrl)) { ops->get_attention_status = get_attention_status; ops->set_attention_status = set_attention_status; + } else if (ctrl->pcie->port->user_leds) { + ops->get_attention_status = pciehp_get_raw_attention_status; + ops->set_attention_status = pciehp_set_raw_attention_status; } /* register this slot with the hotplug pci core */ diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 08e84d6..fccb63d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -355,6 +355,18 @@ static int pciehp_link_enable(struct controller *ctrl) return __pciehp_link_set(ctrl, true); } +int pciehp_get_raw_attention_status(struct hotplug_slot *hotplug_slot, + u8 *value) +{ + struct slot *slot = hotplug_slot->private; + struct pci_dev *pdev = ctrl_dev(slot->ctrl); + u16 slot_ctrl; + + pcie_capability_read_word(pdev, PCI_EXP_SLTCTL, &slot_ctrl); + *status = (slot_ctrl & (PCI_EXP_SLTCTL_AIC | PCI_EXP_SLTCTL_PIC)) >> 6; + return 0; +} + void pciehp_get_attention_status(struct slot *slot, u8 *status) { struct controller *ctrl = slot->ctrl; @@ -431,6 +443,17 @@ int pciehp_query_power_fault(struct slot *slot) return !!(slot_status & PCI_EXP_SLTSTA_PFD); } +int pciehp_set_raw_attention_status(struct hotplug_slot *hotplug_slot, + u8 status) +{ + struct slot *slot = hotplug_slot->private; + struct controller *ctrl = slot->ctrl; + + pcie_write_cmd_nowait(ctrl, value << 6, + PCI_EXP_SLTCTL_AIC | PCI_EXP_SLTCTL_PIC); + return 0; +} + void pciehp_set_attention_status(struct slot *slot, u8 value) { struct controller *ctrl = slot->ctrl; @@ -804,6 +827,10 @@ struct controller *pcie_init(struct pcie_device *dev) } ctrl->pcie = dev; pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP, &slot_cap); + + if (pdev->user_leds) + slot_cap &= ~(PCI_EXP_SLTCAP_AIP | PCI_EXP_SLTCAP_PIP); + ctrl->slot_cap = slot_cap; mutex_init(&ctrl->ctrl_lock); init_waitqueue_head(&ctrl->queue); diff --git a/include/linux/pci.h b/include/linux/pci.h index 7256f33..f41bbca 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -308,6 +308,7 @@ struct pci_dev { powered on/off by the corresponding bridge */ unsigned intignore_hotplug:1; /* Ignore hotplug events */ + unsigned intuser_leds:1;/* User excluse LED SlotCtl */ unsigned intd3_delay; /* D3->D0 transition time in ms */ unsigned intd3cold_delay; /* D3cold->D0 transition time in ms */ -- 2.7.2
[PATCHv3 2/2] x86/vmd: Add PCI domain specific LED option
This patch adds a new function to set PCI domain specific options as devices are added. The usage included in this patch is for LED indicator control in VMD domains, but may be extended in the future as new domain specific options are required. PCIe LED Slot Control in a VMD domain is repurposed to a non-standard implementation. As such, all devices in a VMD domain will be flagged so pciehp does not attempt to use LED indicators. This user_led flag has pciehp provide a different sysfs entry for user exclusive control over the domain's slot indicators. In order to determine if a bus is within a PCI domain, the patch appends a bool to the pci_sysdata structure that the VMD driver sets during initialization. Requested-by: Kapil Karkra Tested-by: Artur Paszkiewicz Signed-off-by: Keith Busch --- No change from previous version of this patch; just part of the series. arch/x86/include/asm/pci.h | 14 ++ arch/x86/pci/common.c | 7 +++ arch/x86/pci/vmd.c | 1 + 3 files changed, 22 insertions(+) diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h index 9ab7507..1411dbe 100644 --- a/arch/x86/include/asm/pci.h +++ b/arch/x86/include/asm/pci.h @@ -23,6 +23,9 @@ struct pci_sysdata { #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN void*fwnode;/* IRQ domain for MSI assignment */ #endif +#if IS_ENABLED(CONFIG_VMD) + bool vmd_domain;/* True if in Intel VMD domain */ +#endif }; extern int pci_routeirq; @@ -56,6 +59,17 @@ static inline void *_pci_root_bus_fwnode(struct pci_bus *bus) #define pci_root_bus_fwnode_pci_root_bus_fwnode #endif +static inline bool is_vmd(struct pci_bus *bus) +{ +#if IS_ENABLED(CONFIG_VMD) + struct pci_sysdata *sd = bus->sysdata; + + return sd->vmd_domain; +#else + return false; +#endif +} + /* Can be used to override the logic in pci_scan_bus for skipping already-configured bus numbers - to be used for buggy BIOSes or architectures with incomplete PCI setup by the loader */ diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index 7b6a9d1..ccf696c 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -677,6 +677,12 @@ static void set_dma_domain_ops(struct pci_dev *pdev) static void set_dma_domain_ops(struct pci_dev *pdev) {} #endif +static void set_dev_domain_options(struct pci_dev *pdev) +{ + if (is_vmd(pdev->bus)) + pdev->user_leds = 1; +} + int pcibios_add_device(struct pci_dev *dev) { struct setup_data *data; @@ -707,6 +713,7 @@ int pcibios_add_device(struct pci_dev *dev) iounmap(data); } set_dma_domain_ops(dev); + set_dev_domain_options(dev); return 0; } diff --git a/arch/x86/pci/vmd.c b/arch/x86/pci/vmd.c index b814ca6..a021b7b 100644 --- a/arch/x86/pci/vmd.c +++ b/arch/x86/pci/vmd.c @@ -596,6 +596,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd) .parent = res, }; + sd->vmd_domain = true; sd->domain = vmd_find_free_domain(); if (sd->domain < 0) return sd->domain; -- 2.7.2
Re: [PATCHv3 1/2] pciehp: Let user control LED status
On Tue, Sep 13, 2016 at 09:05:39AM -0600, Keith Busch wrote: > +int pciehp_get_raw_attention_status(struct hotplug_slot *hotplug_slot, > + u8 *value) > +{ > + struct slot *slot = hotplug_slot->private; > + struct pci_dev *pdev = ctrl_dev(slot->ctrl); > + u16 slot_ctrl; > + > + pcie_capability_read_word(pdev, PCI_EXP_SLTCTL, &slot_ctrl); > + *status = (slot_ctrl & (PCI_EXP_SLTCTL_AIC | PCI_EXP_SLTCTL_PIC)) >> 6; Sorry, I generated this patch from a wrong branch. Won't compile, will send the correct series. Apologies for the churn.
Re: blk-mq: allow passing in an external queue mapping V3
On Wed, Sep 14, 2016 at 04:18:46PM +0200, Christoph Hellwig wrote: > This series is the remainder of the earlier "automatic interrupt affinity for > MSI/MSI-X capable devices" series, and make uses of the new irq-level > interrupt / queue mapping code in blk-mq, as well as allowing the driver > to pass in such a mask obtained from the (PCI) interrupt code. To fully > support this feature in drivers the final third in the PCI layer will > be needed as well. Thanks, this looks good and tests successfully on my hardware. For the series: Reviewed-by: Keith Busch
Re: NVME regression in all kernels after 4.4.x for NVME in M2 slot for laptop?
On Fri, Aug 05, 2016 at 12:03:23PM -0700, Marc MERLIN wrote: > Would this patch make sense as being the reason why I can't S3 sleep > anymore and would you have a test patch against 4.5, 4.6, or 4.7 I can > try to see if it fixes the problem? Hi Marc, It might be blk-mq's hot cpu notifier is invoked during suspend and waiting for nvme's queues to freeze, which may not happen if a request is waiting on a stopped queue. The patch you biseceted doesn't necessarilly fix that, but the window for when a request could get queued like that was much shorter. Assuming that is the problem, S3 suspends PCI hardware before IO tasks. I'll see if I can reproduce on one of my machines and look into a fix. Thanks, Keith
Re: [PATCH] nvme: Fix nvme_get/set_features() with a NULL result pointer
On Fri, Aug 26, 2016 at 04:35:57PM +0200, Christoph Hellwig wrote: > On Fri, Aug 26, 2016 at 07:31:33AM -0700, Andy Lutomirski wrote: > > - Consider *deleting* the SCSI translation layer's power saving code. > > It looks almost entirely bogus to me. It has an off-by-one in its > > NPSS handling, it hardcodes power state indices which is total BS, it > > ignores the distinction between operational and non-operational states > > (which I think matters for non-APST usage). It also seems likely to > > be that it's never been used, since it's one of the formerly > > crashy-looking set_features users. > > Please go ahead and send a patch to delete it. Adding the whole SCSI > layer was a mistake to start with, and it's always been horribly buggy. > Until I started running the libiscsi testsuite even fairly normal I/O > commands were a sure way to crash it, and crazy things like PM are > almost guaranteed to a) not actually be used by real application and > b) horrible buggy (as you've already noticed) Ack. If no distros or tools rely on the the SCSI crutch anymore, then by all means, let's delete it. It's been disabled default for a while now, and I think/hope everyone we care about has since migrated to nvme awareness.
Re: [patch 15/55] PCI: vmd: Create named irq domain
On Tue, Jun 20, 2017 at 01:37:15AM +0200, Thomas Gleixner wrote: > static int vmd_enable_domain(struct vmd_dev *vmd) > { > struct pci_sysdata *sd = &vmd->sysdata; > + struct fwnode_handle *fn; > struct resource *res; > u32 upper_bits; > unsigned long flags; > @@ -617,8 +618,13 @@ static int vmd_enable_domain(struct vmd_ > > sd->node = pcibus_to_node(vmd->dev->bus); > > - vmd->irq_domain = pci_msi_create_irq_domain(NULL, &vmd_msi_domain_info, > + fn = irq_domain_alloc_named_id_fwnode("VMD-MSI", vmd->sysdata.domain); > + if (!fn) > + return -ENODEV; > + > + vmd->irq_domain = pci_msi_create_irq_domain(fn, &vmd_msi_domain_info, > x86_vector_domain); > + kfree(fn); If I'm following all this correctly, it looks like we need to use irq_domain_free_fwnode with irq_domain_alloc_named_id_fwnode instead of freeing 'fn' directly, otherwise we leak 'fwid->name'.
Re: [patch 32/55] x86/irq: Restructure fixup_irqs()
On Tue, Jun 20, 2017 at 01:37:32AM +0200, Thomas Gleixner wrote: > @@ -441,18 +440,27 @@ void fixup_irqs(void) > > for_each_irq_desc(irq, desc) { > const struct cpumask *affinity; > - int break_affinity = 0; > - int set_affinity = 1; > + bool break_affinity = false; > > if (!desc) > continue; > - if (irq == 2) > - continue; > > /* interrupt's are disabled at this point */ > raw_spin_lock(&desc->lock); > > data = irq_desc_get_irq_data(desc); > + chip = irq_data_get_irq_chip(data); > + /* > + * The interrupt descriptor might have been cleaned up > + * already, but it is not yet removed from the radix > + * tree. If the chip does not have an affinity setter, > + * nothing to do here. > + */ > + if (!chip !chip->irq_set_affinity) { > + raw_spin_unlock(&desc->lock); > + continue; > + } A bit of a moot point since the very next patch deletes all of this, but found this broken 'if' condition when compiling one at a time, missing the '&&'.
Re: [PATCH V2 1/1] nvme: fix multiple ctrl removal scheduling
On Wed, May 24, 2017 at 05:26:25PM +0300, Rakesh Pandit wrote: > Commit c5f6ce97c1210 tries to address multiple resets but fails as > work_busy doesn't involve any synchronization and can fail. This is > reproducible easily as can be seen by WARNING below which is triggered > with line: > > WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING) > > Allowing multiple resets can result in multiple controller removal as > well if different conditions inside nvme_reset_work fail and which > might deadlock on device_release_driver. > > This patch makes sure that work queue item (reset_work) is added only > if controller state != NVME_CTRL_RESETTING and that is achieved by > moving state change outside nvme_reset_work into nvme_reset and > removing old work_busy call. State change is always synchronizated > using controller spinlock. So, the reason the state is changed when the work is running rather than queueing is for the window when the state may be set to NVME_CTRL_DELETING, and we don't want the reset work to proceed in that case. What do you think about adding a new state, like NVME_CTRL_SCHED_RESET, then leaving the NVME_CTRL_RESETTING state change as-is?
Re: [PATCH 2/2] nvme: Quirk APST on Intel 600P/P3100 devices
On Wed, May 24, 2017 at 03:06:31PM -0700, Andy Lutomirski wrote: > They have known firmware bugs. A fix is apparently in the works -- > once fixed firmware is available, someone from Intel (Hi, Keith!) > can adjust the quirk accordingly. Here's the latest firmware with all the known fixes: https://downloadcenter.intel.com/download/26491?v=t I am not 100% sure if this fixes the issue you're quirking, but I think it falls under the issues the link initialization and ASPM comments in the release notes.
Re: [PATCH 1/3] nvme: do not check for ns on rw path
On Fri, Nov 03, 2017 at 01:53:40PM +0100, Christoph Hellwig wrote: > > - if (ns && ns->ms && > > + if (ns->ms && > > (!ns->pi_type || ns->ms != sizeof(struct t10_pi_tuple)) && > > !blk_integrity_rq(req) && !blk_rq_is_passthrough(req)) > > return BLK_STS_NOTSUPP; > > blk_rq_is_passthrough also can't be true here. > > How about: > > if (ns->ms && !blk_integrity_rq(req) && > (!ns->pi_type || ns->ms != sizeof(struct t10_pi_tuple))) > return BLK_STS_NOTSUPP; > > Although I have to admit I don't really understand what this check > is even trying to do. It basically checks for a namespace that has > a format with metadata that is not T10 protection information and > then rejects all I/O to it. Why are we even creating a block device > node for such a thing? If the namespace has metadata, but the request doesn't have a metadata payload attached to it for whatever reason, we can't construct the command for that format. We also can't have the controller strip/generate the payload with PRACT bit set if it's not a T10 format, so we just fail the command.
Re: [PATCH 3/3] nvme: fix eui_show() print format
On Fri, Nov 03, 2017 at 01:55:16PM +0100, Christoph Hellwig wrote: > On Fri, Nov 03, 2017 at 11:02:50AM +0100, Javier González wrote: > > Signed-off-by: Javier González > > --- > > drivers/nvme/host/core.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > index ae8ab0a1ef0d..f05c81774abf 100644 > > --- a/drivers/nvme/host/core.c > > +++ b/drivers/nvme/host/core.c > > @@ -2108,7 +2108,7 @@ static ssize_t eui_show(struct device *dev, struct > > device_attribute *attr, > > char *buf) > > { > > struct nvme_ns *ns = nvme_get_ns_from_dev(dev); > > - return sprintf(buf, "%8phd\n", ns->eui); > > + return sprintf(buf, "%8phD\n", ns->eui); > > } > > static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL); > > This looks correct. I wonder what the old code printed - does someone > have a device with an EUI-64 at hand to quickly cross check what we > did before? It just prints the same as the 'ph' format, which would look like this: 01 02 03 04 05 06 07 08 The change will make it look like this: 01-02-03-04-05-06-07-08 I think that was the original intention. Reviewed-by: Keith Busch
Re: [PATCH 1/3] nvme: do not check for ns on rw path
On Sat, Nov 04, 2017 at 09:18:25AM +0100, Christoph Hellwig wrote: > On Fri, Nov 03, 2017 at 09:02:04AM -0600, Keith Busch wrote: > > If the namespace has metadata, but the request doesn't have a metadata > > payload attached to it for whatever reason, we can't construct the command > > for that format. We also can't have the controller strip/generate the > > payload with PRACT bit set if it's not a T10 format, so we just fail > > the command. > > The only metadata payload a READ or WRITE request can have is a protection > information one. For a namespace formatted with protection information > bio_integrity_prep as called from blk_mq_make_request will ensure we > always have metadata attached. > > If a namespace is formatted with non-PI metadata we will never have > metadata attached to the bio/request and should not even present the > namespace to the kernel. That's not quite right. For non-PI metadata formats, we use the 'nop_profile', which gets the metadata buffer allocated so we can safely use a metadata formatted namespace. There's no in-kernel user of the allocated payload, but we still need the metadata buffer in order to use the namespace at all.
Re: [PATCH 1/3] nvme: do not check for ns on rw path
On Mon, Nov 06, 2017 at 10:13:24AM +0100, Christoph Hellwig wrote: > On Sat, Nov 04, 2017 at 09:38:45AM -0600, Keith Busch wrote: > > That's not quite right. For non-PI metadata formats, we use the > > 'nop_profile', which gets the metadata buffer allocated so we can safely > > use a metadata formatted namespace. There's no in-kernel user of the > > allocated payload, but we still need the metadata buffer in order to > > use the namespace at all. > > You're right. But that means we will indeed always have a matching > integrity payload here and the check should not be needed. > > Are you fine with turning it into something like: > > if (WARN_ON_ONCE(ns->ms && !blk_integrity_rq(req))) > return BLK_STS_IOERR; > > ? Yes, that looks fine. You're right that we are not supposed to be able to get into this path for read/write since the driver sets the capacity to 0 if a metadata format doesn't have integrity support, so hitting this warning would indicate something has gone wrong.
Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver
On Sat, 3 Oct 2015, Ingo Molnar wrote: * Keith Busch wrote: +config VMDDEV + depends on PCI && PCI_DOMAINS && PCI_MSI && GENERIC_MSI_IRQ_DOMAIN && IRQ_DOMAIN_HIERARCHY + tristate "Volume Management Device Driver" + default N + select HAVE_VMDDEV + ---help--- + Adds support for the Intel Volume Manage Device (VMD). VMD is + a secondary PCI host bridge that allows PCI Express root ports, + and devices attached to them, to be removed from the default PCI + domain and placed within the VMD domain. If your system provides + one of these and has devices attached to it, say "Y". So what this text does not explain is why does the user care? What tangible benefits does this feature offer to users? Hi Ingo, The immediate benefit is that devices on VMD domains do not use resources on the default PCI domain, so we have more than the 256 buses available. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver
On Mon, 5 Oct 2015, Ingo Molnar wrote: * Keith Busch wrote: The immediate benefit is that devices on VMD domains do not use resources on the default PCI domain, so we have more than the 256 buses available. Would be nice to incorporate that information in the help text and in the changelog. Fair enough. Perhaps we thought the "secondary domain" description implied that information, but maybe we've been doing this too long. :) Thanks for the feedback. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver
Hi Bjorn, Thanks for the feedback. Much of the issues you mentioned look pretty straight forward to resolve, and will fix of for the next revision. I have some immediate follow up comments to two issues you brought up: On Tue, 6 Oct 2015, Bjorn Helgaas wrote: +static int vmd_find_free_domain(void) +{ + int domain = -1; + struct pci_bus *bus = NULL; + + while ((bus = pci_find_next_bus(bus)) != NULL) + domain = max_t(int, domain, pci_domain_nr(bus)); + if (domain < 0) + return -ENODEV; + return domain + 1; +} The PCI core should own domain number allocation. We have a little bit of generic domain support, e.g., pci_get_new_domain_nr(). On x86, I think that is compiled-in, but currently not used -- currently x86 only uses _SEG from ACPI. How would you handle collisions between a domain number assigned here (or via pci_get_new_domain_nr()) and a hot-added host bridge where _SEG uses the same domain? Thank you for bringing this up as I hadn't thought much about it and may have misunderstood the meaning of _SEG. AIUI, it is used to identify a logical grouping. The OS does not need to use the same identifier for the domain, so we there's no need to collide if we can assign the domain of a a new _SEG to the next available domain_nr. On pci_get_new_domain_nr(), can we make this use something like an ida instead of an atomic? I think we'd like to put domains back into the available pool if someone unbinds it. I imagine someone will test unbinding/rebinding these devices in a loop for a while, and will file a bug after the atomic increments to 64k. :) + resource_list_for_each_entry(entry, &resources) { + struct resource *source, *resource = entry->res; + + if (!i) { + resource->start = 0; + resource->end = (resource_size( + &vmd->dev->resource[0]) >> 20) - 1; + resource->flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED; I thought BAR0 was CFGBAR. I missed the connection to a bus number aperture. Right, BAR0 is the CFGBAR and is the device's aperture to access its domain's config space. Based on your comment, I assume that's not a bus resource, though it seems to work with the posted code. :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver
On Tue, 6 Oct 2015, Keith Busch wrote: On Tue, 6 Oct 2015, Bjorn Helgaas wrote: + resource_list_for_each_entry(entry, &resources) { + struct resource *source, *resource = entry->res; + + if (!i) { + resource->start = 0; + resource->end = (resource_size( + &vmd->dev->resource[0]) >> 20) - 1; + resource->flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED; I thought BAR0 was CFGBAR. I missed the connection to a bus number aperture. Right, BAR0 is the CFGBAR and is the device's aperture to access its domain's config space. It's a new day, I'll try a new explanation on what this is about. The size of the CFGBAR determines how many bus numbers can be reached through the device's config space aperture. We are not setting the bus resource to BAR0; just determining the bus resource ending based on BAR0's size. We expect the bar to be 256M to access config space for 256 buses: 8 functions * 32 devices * 256 buses * 4k config space per function = 256M If the BAR wasn't provided 256M for any reason, we reduce the number of bus resources this domain can provide. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/2] x86: PCI bus specific MSI operations
This patch adds struct x86_msi_ops to x86's PCI sysdata. This gives a host bridge driver the option to provide alternate MSI Data Register and MSI-X Table Entry programming for devices in PCI domains that do not subscribe to usual "IOAPIC" format. Signed-off-by: Keith Busch CC: Bryan Veal CC: Dan Williams CC: x...@kernel.org CC: linux-kernel@vger.kernel.org CC: linux-...@vger.kernel.org --- arch/x86/include/asm/pci.h |3 +++ arch/x86/kernel/x86_init.c | 19 +++ 2 files changed, 22 insertions(+) diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h index 4625943..98f3802 100644 --- a/arch/x86/include/asm/pci.h +++ b/arch/x86/include/asm/pci.h @@ -20,6 +20,9 @@ struct pci_sysdata { #ifdef CONFIG_X86_64 void*iommu; /* IOMMU private data */ #endif +#ifdef CONFIG_PCI_MSI + struct x86_msi_ops *msi_ops; +#endif }; extern int pci_routeirq; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index 3839628..416de84 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -118,21 +118,40 @@ struct x86_msi_ops x86_msi = { /* MSI arch specific hooks */ int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) { + struct pci_sysdata *sysdata = dev->bus->sysdata; + + if (sysdata && sysdata->msi_ops && sysdata->msi_ops->setup_msi_irqs) + return sysdata->msi_ops->setup_msi_irqs(dev, nvec, type); return x86_msi.setup_msi_irqs(dev, nvec, type); } void arch_teardown_msi_irqs(struct pci_dev *dev) { + struct pci_sysdata *sysdata = dev->bus->sysdata; + + if (sysdata && sysdata->msi_ops && sysdata->msi_ops->teardown_msi_irqs) + return sysdata->msi_ops->teardown_msi_irqs(dev); x86_msi.teardown_msi_irqs(dev); } void arch_teardown_msi_irq(unsigned int irq) { + struct msi_desc *desc = irq_get_msi_desc(irq); + struct pci_sysdata *sysdata = NULL; + + if (desc) + sysdata = desc->dev->bus->sysdata; + if (sysdata && sysdata->msi_ops && sysdata->msi_ops->teardown_msi_irq) + return sysdata->msi_ops->teardown_msi_irq(irq); x86_msi.teardown_msi_irq(irq); } void arch_restore_msi_irqs(struct pci_dev *dev) { + struct pci_sysdata *sysdata = dev->bus->sysdata; + + if (sysdata && sysdata->msi_ops && sysdata->msi_ops->restore_msi_irqs) + return sysdata->msi_ops->restore_msi_irqs(dev); x86_msi.restore_msi_irqs(dev); } #endif -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/2] Driver for new PCI-e device
The Intel Volume Management Device (VMD) is an integrated endpoint on the platform's PCIe root complex that acts as a host bridge to a secondary PCIe domain. BIOS can reassign one or more root ports to appear within VMD domains instead of the primary domain. This driver enables and enumerates the VMD domain using the root bus configuration interface provided by the PCI subsystem. CC: Bryan Veal CC: Dan Williams CC: x...@kernel.org CC: linux-kernel@vger.kernel.org CC: linux-...@vger.kernel.org Keith Busch (2): x86: PCI bus specific MSI operations x86/pci: Initial commit for new VMD device driver arch/x86/Kconfig | 11 ++ arch/x86/include/asm/pci.h |3 + arch/x86/kernel/x86_init.c | 19 ++ arch/x86/pci/Makefile |2 + arch/x86/pci/vmd.c | 412 5 files changed, 447 insertions(+) create mode 100644 arch/x86/pci/vmd.c -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 2/2] x86/pci: Initial commit for new VMD device driver
The Intel Volume Management Device (VMD) is an integrated endpoint on the platform's PCIe root complex that acts as a host bridge to a secondary PCIe domain. BIOS can reassign one or more root ports to appear within the VMD domain instead of the primary domain. This driver enumerates and enables the domain using the root bus configuration interface provided by the PCI subsystem. The driver provides configuration space accessor functions (pci_ops), bus and memory resources, MSI configuration functions, and irq_chip implementation necessary to support the domain through the VMD endpoint's interface. VMD routes I/O as follows: 1) Configuration Space: BAR 0 ("CFGBAR") of VMD provides the base address and size for configuration space register access to VMD-owned root ports. It works similarly to MMCONFIG for extended configuration space. Bus numbering is independent and does not conflict with the primary domain. 2) MMIO Space: BARs 2 and 4 ("MEMBAR1" and "MEMBAR2") of VMD provide the base address, size, and type for MMIO register access. These addresses are not translated by VMD hardware; they are simply reservations to be distributed to root ports' memory base/limit registers and subdivided among devices downstream. 3) DMA: To interact appropriately with IOMMU, the source ID DMA read and write requests are translated to the bus-device-function of the VMD endpoint. Otherwise, DMA operates normally without VMD-specific address translation. 4) Interrupts: Part of VMD's BAR 4 is reserved for VMD's MSI-X Table and PBA. MSIs from VMD domain devices and ports are remapped to appear if they were issued using one of VMD's MSI-X table entries. Each MSI and MSI-X addresses of VMD-owned devices and ports have a special format where the address refers specific entries in VMD's MSI-X table. As with DMA, the interrupt source id is translated to VMD's bus-device-function. The driver provides its own MSI and MSI-X configuration functions specific to how MSi messages are used within the VMD domain, and it provides an irq_chip for indepdent IRQ allocation and to relay interrupts from VMD's interrupt handler to the appropriate device driver's handler. 5) Errors: PCIe error message are intercepted by the root ports normally (e.g. AER), except with VMD, system errors (i.e. firmware first) are disabled by default. AER and hotplug interrupts are translated in the same way as endpoint interrupts. 6) VMD does not support INTx interrupts or IO ports. Devices or drivers requiring these features should either not be placed below VMD-owned root ports, or VMD should be disabled by BIOS for such enpdoints. Contributers to this patch include: Artur Paszkiewicz Bryan Veal Jon Derrick Signed-off-by: Keith Busch CC: Bryan Veal CC: Dan Williams CC: x...@kernel.org CC: linux-kernel@vger.kernel.org CC: linux-...@vger.kernel.org --- arch/x86/Kconfig | 11 ++ arch/x86/pci/Makefile |2 + arch/x86/pci/vmd.c| 412 + 3 files changed, 425 insertions(+) create mode 100644 arch/x86/pci/vmd.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b3a1a5d..a8cad59 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2584,6 +2584,17 @@ config PMC_ATOM def_bool y depends on PCI +config VMDDEV + depends on PCI && PCI_DOMAINS && PCI_MSI + tristate "Volume Management Device Driver" + default N + ---help--- + Adds support for the Intel Volume Manage Device (VMD). VMD is + a secondary PCI host bridge that allows PCI Express root ports, + and devices attached to them, to be removed from the default PCI + domain and placed within the VMD domain. If your system provides + one of these, and you have devices attached to it, say "Y". + source "net/Kconfig" source "drivers/Kconfig" diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile index 5c6fc35..c204079 100644 --- a/arch/x86/pci/Makefile +++ b/arch/x86/pci/Makefile @@ -23,6 +23,8 @@ obj-y += bus_numa.o obj-$(CONFIG_AMD_NB) += amd_bus.o obj-$(CONFIG_PCI_CNB20LE_QUIRK)+= broadcom_bus.o +obj-$(CONFIG_VMDDEV) += vmd.o + ifeq ($(CONFIG_PCI_DEBUG),y) EXTRA_CFLAGS += -DDEBUG endif diff --git a/arch/x86/pci/vmd.c b/arch/x86/pci/vmd.c new file mode 100644 index 000..ea8bb4c --- /dev/null +++ b/arch/x86/pci/vmd.c @@ -0,0 +1,412 @@ +/* + * Volume Management Device driver + * Copyright (c) 2015, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + *
Re: [RFC PATCH 1/2] x86: PCI bus specific MSI operations
On Fri, 28 Aug 2015, Thomas Gleixner wrote: On Thu, 27 Aug 2015, Keith Busch wrote: This patch adds struct x86_msi_ops to x86's PCI sysdata. This gives a host bridge driver the option to provide alternate MSI Data Register and MSI-X Table Entry programming for devices in PCI domains that do not subscribe to usual "IOAPIC" format. I'm not too fond about more ad hoc indirection and special casing. We should be able to handle this with hierarchical irq domains. Jiang might have an idea how to do that for your case. Thank you for the suggestion, I will take a closer look at this again. All the better if we don't require an arch specific dependency. I asked Jiang about domain hierarchies a few weeks ago and understood these are suited for irq controllers, but the VMD device is an aggregator. Here's a little more h/w info in case it helps steer me in the right direction: VMD muxes all the device interrupts in its PCI domain into one of several VMD h/w irqs. The VMD driver then has to de-mux that into CPU side irqs for each device. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Regulator: Suppress compiler warnings
Some compilers complain of possible uninitialized variable usage, like the following: drivers/regulator/helpers.c: In function ‘regulator_get_bypass_regmap’: drivers/regulator/helpers.c:463:16: warning: ‘val’ may be used uninitialized in this function [-Wuninitialized] The code is safe though, and only uses the variables if they were successfully set, so suppressing the warning with uninitialized_val. Signed-off-by: Keith Busch --- drivers/regulator/helpers.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/regulator/helpers.c b/drivers/regulator/helpers.c index 3bbb326..bb5b22a 100644 --- a/drivers/regulator/helpers.c +++ b/drivers/regulator/helpers.c @@ -30,7 +30,7 @@ */ int regulator_is_enabled_regmap(struct regulator_dev *rdev) { - unsigned int val; + unsigned int uninitialized_var(val); int ret; ret = regmap_read(rdev->regmap, rdev->desc->enable_reg, &val); @@ -114,7 +114,7 @@ EXPORT_SYMBOL_GPL(regulator_disable_regmap); */ int regulator_get_voltage_sel_regmap(struct regulator_dev *rdev) { - unsigned int val; + unsigned int uninitialized_var(val); int ret; ret = regmap_read(rdev->regmap, rdev->desc->vsel_reg, &val); @@ -453,7 +453,7 @@ EXPORT_SYMBOL_GPL(regulator_set_bypass_regmap); */ int regulator_get_bypass_regmap(struct regulator_dev *rdev, bool *enable) { - unsigned int val; + unsigned int uninitialized_var(val); int ret; ret = regmap_read(rdev->regmap, rdev->desc->bypass_reg, &val); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Regulator: Suppress compiler warnings
On Tue, 1 Sep 2015, Mark Brown wrote: On Tue, Sep 01, 2015 at 09:52:13AM +0900, Krzysztof Kozlowski wrote: 2015-09-01 1:41 GMT+09:00 Keith Busch : int regulator_is_enabled_regmap(struct regulator_dev *rdev) { - unsigned int val; + unsigned int uninitialized_var(val); int ret; ret = regmap_read(rdev->regmap, rdev->desc->enable_reg, &val); This is quite common pattern so such work-around should be added to many other functions leading to code obfuscation. Which compiler do you have in mind? Right, plus this will shut up valid compiler warnings which is poor practice anyway. I'd say this is a bug in the compiler. Using gcc 4.7.2 with '-Os'. The warning does not happen when that option is not used, i.e. disable CONFIG_CC_OPTIMIZE_FOR_SIZE. I will certainly try other gcc versions with the same config and see what happens. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver
On Tue, 6 Oct 2015, Bjorn Helgaas wrote: +static int __init vmd_init(void) +{ + return pci_register_driver(&vmd_drv); +} +module_init(vmd_init); module_pci_driver(vmd_drv)? We actually only have a module_init in this driver, and purposely left out module_exit. We don't want to be able to unload this because we can't reference count this module for modules depending on it. 'rmmod ' returns busy if you're actively using a device that module is driving, but the end devices VMD provides are not driven by the VMD driver; it'd be kind of like being able to unload the pci core driver. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}B
On Wed, 14 Oct 2015, Christoph Hellwig wrote: Analsys and tentativ fix below: blktrace for before the commit: 259,012 0.02543 2394 G D 0 + 8388607 [mkfs.xfs] 259,013 0.08230 2394 I D 0 + 8388607 [mkfs.xfs] 259,014 0.31090 207 D D 0 + 8388607 [kworker/1:1H] 259,015 0.44869 2394 Q D 8388607 + 8388607 [mkfs.xfs] 259,016 0.45992 2394 G D 8388607 + 8388607 [mkfs.xfs] 259,017 0.49559 2394 I D 8388607 + 8388607 [mkfs.xfs] 259,018 0.61551 207 D D 8388607 + 8388607 [kworker/1:1H] .. and so on. blktrace with the commit: 259,021 0.0 1228 Q D 0 + 4194304 [mkfs.xfs] 259,022 0.02543 1228 G D 0 + 4194304 [mkfs.xfs] 259,023 0.10080 1228 I D 0 + 4194304 [mkfs.xfs] 259,024 0.82187 267 D D 0 + 4194304 [kworker/2:1H] 259,025 0.000224869 1228 Q D 4194304 + 4194304 [mkfs.xfs] 259,026 0.000225835 1228 G D 4194304 + 4194304 [mkfs.xfs] 259,027 0.000229457 1228 I D 4194304 + 4194304 [mkfs.xfs] 259,028 0.000238507 267 D D 4194304 + 4194304 [kworker/2:1H] So discards are smaller, but better aligned. Now if I tweak a single line in blk-lib.c to be able to use all of bi_size I get the old I/O pattern back and everything works fine again: I see why the proposal is an improvement, but I don't understand why the current situation results in a hang. Are we missing some kind of error recovery in the driver? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] blk-mq: fix use-after-free in blk_mq_free_tag_set()
On Tue, 13 Oct 2015, Junichi Nomura wrote: tags is freed in blk_mq_free_rq_map() and should not be used after that. The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because free_cpumask_var() is nop. tags->cpumask is allocated in blk_mq_init_tags() so it's natural to free cpumask in its counter part, blk_mq_free_tags(). Thanks for the fix. Reviewed-by: Keith Busch Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements") Signed-off-by: Jun'ichi Nomura Cc: Keith Busch Cc: Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blk-integrity: empty imlpementation when disabled
This patch moves the blk_integrity_payload definition outside the CONFIG_BLK_DEV_INTERITY dependency and provides empty function implementations when the kernel configuration disables integrity extensions. This simplifies drivers that make use of these to map user data so they don't need to repeat the same configuration checks. Signed-off-by: Keith Busch --- include/linux/bio.h | 32 ++-- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/include/linux/bio.h b/include/linux/bio.h index b9b6e04..f0c46d0 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -318,16 +318,6 @@ enum bip_flags { BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */ }; -#if defined(CONFIG_BLK_DEV_INTEGRITY) - -static inline struct bio_integrity_payload *bio_integrity(struct bio *bio) -{ - if (bio->bi_rw & REQ_INTEGRITY) - return bio->bi_integrity; - - return NULL; -} - /* * bio integrity payload */ @@ -349,6 +339,16 @@ struct bio_integrity_payload { struct bio_vec bip_inline_vecs[0];/* embedded bvec array */ }; +#if defined(CONFIG_BLK_DEV_INTEGRITY) + +static inline struct bio_integrity_payload *bio_integrity(struct bio *bio) +{ + if (bio->bi_rw & REQ_INTEGRITY) + return bio->bi_integrity; + + return NULL; +} + static inline bool bio_integrity_flagged(struct bio *bio, enum bip_flags flag) { struct bio_integrity_payload *bip = bio_integrity(bio); @@ -795,6 +795,18 @@ static inline bool bio_integrity_flagged(struct bio *bio, enum bip_flags flag) return false; } +static inline void *bio_integrity_alloc(struct bio * bio, gfp_t gfp, + unsigned int nr) +{ + return NULL; +} + +static inline int bio_integrity_add_page(struct bio *bio, struct page *page, + `unsigned int len, unsigned int offset) +{ + return 0; +} + #endif /* CONFIG_BLK_DEV_INTEGRITY */ #endif /* CONFIG_BLOCK */ -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2] blk-integrity: empty implementation when disabled
This patch moves the blk_integrity_payload definition outside the CONFIG_BLK_DEV_INTERITY dependency and provides empty function implementations when the kernel configuration disables integrity extensions. This simplifies drivers that make use of these to map user data so they don't need to repeat the same configuration checks. Signed-off-by: Keith Busch --- v1 -> v2: Fixed corrupted patch and subject line spelling error include/linux/bio.h | 32 ++-- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/include/linux/bio.h b/include/linux/bio.h index b9b6e04..f0c46d0 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -318,16 +318,6 @@ enum bip_flags { BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */ }; -#if defined(CONFIG_BLK_DEV_INTEGRITY) - -static inline struct bio_integrity_payload *bio_integrity(struct bio *bio) -{ - if (bio->bi_rw & REQ_INTEGRITY) - return bio->bi_integrity; - - return NULL; -} - /* * bio integrity payload */ @@ -349,6 +339,16 @@ struct bio_integrity_payload { struct bio_vec bip_inline_vecs[0];/* embedded bvec array */ }; +#if defined(CONFIG_BLK_DEV_INTEGRITY) + +static inline struct bio_integrity_payload *bio_integrity(struct bio *bio) +{ + if (bio->bi_rw & REQ_INTEGRITY) + return bio->bi_integrity; + + return NULL; +} + static inline bool bio_integrity_flagged(struct bio *bio, enum bip_flags flag) { struct bio_integrity_payload *bip = bio_integrity(bio); @@ -795,6 +795,18 @@ static inline bool bio_integrity_flagged(struct bio *bio, enum bip_flags flag) return false; } +static inline void *bio_integrity_alloc(struct bio * bio, gfp_t gfp, + unsigned int nr) +{ + return NULL; +} + +static inline int bio_integrity_add_page(struct bio *bio, struct page *page, + unsigned int len, unsigned int offset) +{ + return 0; +} + #endif /* CONFIG_BLK_DEV_INTEGRITY */ #endif /* CONFIG_BLOCK */ -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCHv2] x86/pci: Initial commit for new VMD device driver
The Intel Volume Management Device (VMD) is an integrated endpoint on the platform's PCIe root complex that acts as a host bridge to a secondary PCIe domain. BIOS can reassign one or more root ports to appear within a VMD domain instead of the primary domain. This driver enumerates and enables the domain using the root bus configuration interface provided by the PCI subsystem. The driver provides configuration space accessor functions (pci_ops), bus and memory resources, a chained MSI irq domain, irq_chip implementation, and dma operations necessary to support the domain through the VMD endpoint's interface. VMD routes I/O as follows: 1) Configuration Space: BAR 0 ("CFGBAR") of VMD provides the base address and size for configuration space register access to VMD-owned root ports. It works similarly to MMCONFIG for extended configuration space. Bus numbering is independent and does not conflict with the primary domain. 2) MMIO Space: BARs 2 and 4 ("MEMBAR1" and "MEMBAR2") of VMD provide the base address, size, and type for MMIO register access. These addresses are not translated by VMD hardware; they are simply reservations to be distributed to root ports' memory base/limit registers and subdivided among devices downstream. 3) DMA: To interact appropriately with IOMMU, the source ID DMA read and write requests are translated to the bus-device-function of the VMD endpoint. Otherwise, DMA operates normally without VMD-specific address translation. 4) Interrupts: Part of VMD's BAR 4 is reserved for VMD's MSI-X Table and PBA. MSIs from VMD domain devices and ports are remapped to appear if they were issued using one of VMD's MSI-X table entries. Each MSI and MSI-X addresses of VMD-owned devices and ports have a special format where the address refers specific entries in VMD's MSI-X table. As with DMA, the interrupt source id is translated to VMD's bus-device-function. The driver provides its own MSI and MSI-X configuration functions specific to how MSI messages are used within the VMD domain, and it provides an irq_chip for independent IRQ allocation and to relay interrupts from VMD's interrupt handler to the appropriate device driver's handler. 5) Errors: PCIe error message are intercepted by the root ports normally (e.g. AER), except with VMD, system errors (i.e. firmware first) are disabled by default. AER and hotplug interrupts are translated in the same way as endpoint interrupts. 6) VMD does not support INTx interrupts or IO ports. Devices or drivers requiring these features should either not be placed below VMD-owned root ports, or VMD should be disabled by BIOS for such endpoints. Contributers to this patch include: Artur Paszkiewicz Bryan Veal Jon Derrick Signed-off-by: Keith Busch --- v1 -> v2: The original RFC used custom x86_msi_ops to provide the VMD device specific interrupt setup. This was rejected in favor of a chained irq domain hierarchy, so this version provides that. While it tests out successfully in the limited capacity that I can test this, I honestly don't understand completely how this works, so thank you to Jiang Liu for the guidance! Perhaps I'm missing a callback, but I don't see how the driver can limit the number of irq's requested with the irq domain way. The allocation is done one at a time instead of at once, so the driver doesn't know at this level how many were originally requested. This isn't terrible as I can circle the irq's back to the beginning if they exceed VMD's MSI-x count. This version includes the DMA operations required if an IOMMU is used. That feature was omitted from the original RFC. The dma operations are set via a PCI "fixup" if the device is in a VMD provided domain. All this created a larger in-kernel dependency than before, and it is submitted as a single patch instead of a short series since it is all specific to this driver. arch/x86/Kconfig | 15 ++ arch/x86/include/asm/vmd.h | 39 +++ arch/x86/kernel/apic/msi.c | 74 ++ arch/x86/pci/Makefile |2 + arch/x86/pci/vmd.c | 594 kernel/irq/chip.c |1 + kernel/irq/irqdomain.c |3 + 7 files changed, 728 insertions(+) create mode 100644 arch/x86/include/asm/vmd.h create mode 100644 arch/x86/pci/vmd.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 328c835..73df607 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2631,6 +2631,21 @@ config PMC_ATOM def_bool y depends on PCI +config HAVE_VMDDEV + bool + +config VMDDEV + depends on PCI && PCI_DOMAINS && PCI_MSI && GENERIC_MSI_IRQ_DOMAIN && IRQ_DOMAIN_HIERARCHY + tristate "Volume Management Device Driver"
Re: [PATCH V2] nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
On Wed, Jun 20, 2018 at 01:42:22PM +0800, Jianchao Wang wrote: > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index fc33804..73a97fc 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2289,6 +2289,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, > int status) > > nvme_get_ctrl(&dev->ctrl); > nvme_dev_disable(dev, false); > + nvme_kill_queues(&dev->ctrl); > if (!queue_work(nvme_wq, &dev->remove_work)) > nvme_put_ctrl(&dev->ctrl); > } > @@ -2405,7 +2406,6 @@ static void nvme_remove_dead_ctrl_work(struct > work_struct *work) > struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work); > struct pci_dev *pdev = to_pci_dev(dev->dev); > > - nvme_kill_queues(&dev->ctrl); > if (pci_get_drvdata(pdev)) > device_release_driver(&pdev->dev); > nvme_put_ctrl(&dev->ctrl); Looks good to me. Reviewed-by: Keith Busch
Re: [PATCH 2/2] PCI: NVMe device specific reset quirk
On Mon, Jul 23, 2018 at 04:24:31PM -0600, Alex Williamson wrote: > Take advantage of NVMe devices using a standard interface to quiesce > the controller prior to reset, including device specific delays before > and after that reset. This resolves several NVMe device assignment > scenarios with two different vendors. The Intel DC P3700 controller > has been shown to only work as a VM boot device on the initial VM > startup, failing after reset or reboot, and also fails to initialize > after hot-plug into a VM. Adding a delay after FLR resolves these > cases. The Samsung SM961/PM961 (960 EVO) sometimes fails to return > from FLR with the PCI config space reading back as -1. A reproducible > instance of this behavior is resolved by clearing the enable bit in > the configuration register and waiting for the ready status to clear > (disabling the NVMe controller) prior to FLR. > > As all NVMe devices make use of this standard interface and the NVMe > specification also requires PCIe FLR support, we can apply this quirk > to all devices with matching class code. Shouldn't this go in the nvme driver's reset_prepare/reset_done callbacks?
Re: Report long suspend times of NVMe devices (mostly firmware/device issues)
On Wed, Jan 24, 2018 at 11:29:12PM +0100, Paul Menzel wrote: > Am 22.01.2018 um 22:30 schrieb Keith Busch: > > The nvme spec guides toward longer times than that. I don't see the > > point of warning users about things operating within spec. > > I quickly glanced over NVM Express revision 1.3 specification [1] but > searching for *second *, I could not find something about this. Could you > please point me to the section? Section 7.6.2: It is recommended that the host wait a minimum of the RTD3 Entry Latency reported in the Identify Controller data structure for the shutdown operations to complete; if the value reported in RTD3 Entry Latency is 0h, then the host should wait for a minimum of one second. So if the controller is new enough, it will report it's RTD3 Entry. The maximum allowed by spec is something like 4000 seconds. For controllers that pre-date this field, we're supposed to wait a "minimum" of one second. The spec does not recommend a maximum time in either case. > In my opinion, it’s a good thing to point users to devices holding up > suspend. If a device shutdown exceeds its reported constraints, then absolutely, and we do log a warning for that already. Picking an arbitrary time below spec recommendations is just guaranteed to create alarmed people demanding answers to a warning about behavior that is entirely normal.
Re: [PATCH RESENT] nvme-pci: introduce RECONNECTING state to mark initializing procedure
On Mon, Jan 22, 2018 at 10:03:16PM +0800, Jianchao Wang wrote: > After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect), > both nvme-fc/rdma have following pattern: > RESETTING- quiesce blk-mq queues, teardown and delete queues/ >connections, clear out outstanding IO requests... > RECONNECTING - establish new queues/connections and some other >initializing things. > Introduce RECONNECTING to nvme-pci transport to do the same mark. > Then we get a coherent state definition among nvme pci/rdma/fc > transports. > > Suggested-by: James Smart > Reviewed-by: James Smart > Signed-off-by: Jianchao Wang This looks fine. Thank you for your patience. Reviewed-by: Keith Busch
Re: [PATCH] nvme-pci: calculate iod and avg_seg_size just before use them
On Thu, Jan 11, 2018 at 01:09:39PM +0800, Jianchao Wang wrote: > The calculation of iod and avg_seg_size maybe meaningless if > nvme_pci_use_sgls returns before uses them. So calculate > just before use them. The compiler will do the right thing here, but I see what you mean. I think Christoph has some SGL improvements coming, though, that will obviate this.
Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only
On Thu, Jan 11, 2018 at 06:50:40PM +0100, Maik Broemme wrote: > I've re-run the test with 4.15rc7.r111.g5f615b97cdea and the following > patches from Keith: > > [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported > [PATCH 2/4] PCI/AER: Provide API for getting AER information > [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER > [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling > > The issue is still the same. Additionally to the output before I see now: > > Jan 11 18:34:45 server.theraso.int kernel: dpc :00:10.0:pcie010: DPC > containment event, status:0x1f09 source:0x > Jan 11 18:34:45 server.theraso.int kernel: dpc :00:10.0:pcie010: DPC > unmasked uncorrectable error detected, remove downstream devices > Jan 11 18:34:45 server.theraso.int kernel: pcieport :00:10.0: PCIe Bus > Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, > id=0080(Receiver ID) > Jan 11 18:34:45 server.theraso.int kernel: pcieport :00:10.0: device > [8086:19aa] error status/mask=0020/ > Jan 11 18:34:45 server.theraso.int kernel: pcieport :00:10.0:[ 5] > Surprise Down Error(First) > Jan 11 18:34:46 server.theraso.int kernel: nvme0n1: detected capacity change > from 1024209543168 to 0 Okay, so that series wasn't going to fix anything, but at least it gets some visibility into what's happened. The DPC was triggered due to a Surprise Down uncorrectable error, so the power settting is causing the link to fail. The NVMe driver has quirks specifically for this vendor's devices to fence off NVMe specific automated power settings. Your observations appear to align with the same issues.
Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem
On Tue, Jan 30, 2018 at 11:41:07AM +0800, jianchao.wang wrote: > Another point that confuses me is that whether nvme_set_host_mem is necessary > in nvme_dev_disable ? > As the comment: > > /* >* If the controller is still alive tell it to stop using the >* host memory buffer. In theory the shutdown / reset should >* make sure that it doesn't access the host memoery anymore, >* but I'd rather be safe than sorry.. >*/ > if (dev->host_mem_descs) > nvme_set_host_mem(dev, 0); > > It is to avoid accessing to host memory from controller. But the host memory > is just > freed when nvme_remove. It looks like we just need to do this in nvme_remove. > For example: > - > @@ -2553,6 +2545,14 @@ static void nvme_remove(struct pci_dev *pdev) > flush_work(&dev->ctrl.reset_work); > nvme_stop_ctrl(&dev->ctrl); > nvme_remove_namespaces(&dev->ctrl); > + /* > +* If the controller is still alive tell it to stop using the > +* host memory buffer. In theory the shutdown / reset should > +* make sure that it doesn't access the host memoery anymore, > +* but I'd rather be safe than sorry.. > +*/ > + if (dev->host_mem_descs) > + nvme_set_host_mem(dev, 0); > nvme_dev_disable(dev, true); > nvme_free_host_mem(dev); > > > If anything missed, please point out. > That's really appreciated. I don't make any such controller, so I can't speak to the necessity of having this here. I just think having it in the controller disabling path is for safety. We want the controller to acknowledge that it has completed any use of the HMB before we make it so it can't access it.
Re: [PATCH] PCI/DPC: Fix INT legacy interrupt in dpc_irq
On Wed, Jan 31, 2018 at 09:48:55PM +0530, Oza Pawandeep wrote: > Current dpc driver acknowledge the interrupt in deferred work, which works > okay since LPI are edge triggered. > But when RP does not have MSI support, port service driver falls back to > legacy GIC SPI interrupts, and with current code we do not acknowledge the > interrupt and we get dpc interrupt storm. > This patch acknowledges the interrupt in interrupt handler. > > Signed-off-by: Oza Pawandeep Thanks, looks good to me. Reviewed-by: Keith Busch
Re: [PATCH v5 4/4] PCI/DPC: Enumerate the devices after DPC trigger event
On Thu, Jan 18, 2018 at 11:35:59AM -0500, Sinan Kaya wrote: > On 1/18/2018 12:32 AM, p...@codeaurora.org wrote: > > On 2018-01-18 08:26, Keith Busch wrote: > >> On Wed, Jan 17, 2018 at 08:27:39AM -0800, Sinan Kaya wrote: > >>> On 1/17/2018 5:37 AM, Oza Pawandeep wrote: > >>> > +static bool dpc_wait_link_active(struct pci_dev *pdev) > >>> > +{ > >>> > >>> I think you can also make this function common instead of making another > >>> copy here. > >>> Of course, this would be another patch. > >> > >> It is actually very similar to __pcie_wait_link_active in pciehp_hpc.c, > >> so there's some opprotunity to make even more common code. > > > > in that case there has to be a generic function in > > drives/pci.c > > > > which addresses folowing functions from > > > > pcie-dpc.c: > > dpc_wait_link_inactive > > dpc_wait_link_active > > > > drivers/pci/hotplug/pciehp_hpc.c > > pcie_wait_link_active > > > > > > all aboe making one generic function to be moved to drives/pci.c > > > > please let me know if this is okay. > > Works for me. Keith/Bjorn? Yep, I believe common solutions that reduce code is always encouraged in the Linux kernel.
Re: [PATCH V5 2/2] nvme-pci: fixup the timeout case when reset is ongoing
On Thu, Jan 18, 2018 at 06:10:02PM +0800, Jianchao Wang wrote: > + * - When the ctrl.state is NVME_CTRL_RESETTING, the expired > + * request should come from the previous work and we handle > + * it as nvme_cancel_request. > + * - When the ctrl.state is NVME_CTRL_RECONNECTING, the expired > + * request should come from the initializing procedure such as > + * setup io queues, because all the previous outstanding > + * requests should have been cancelled. >*/ > - if (dev->ctrl.state == NVME_CTRL_RESETTING) { > - dev_warn(dev->ctrl.device, > - "I/O %d QID %d timeout, disable controller\n", > - req->tag, nvmeq->qid); > - nvme_dev_disable(dev, false); > + switch (dev->ctrl.state) { > + case NVME_CTRL_RESETTING: > + nvme_req(req)->status = NVME_SC_ABORT_REQ; > + return BLK_EH_HANDLED; > + case NVME_CTRL_RECONNECTING: > + WARN_ON_ONCE(nvmeq->qid); > nvme_req(req)->flags |= NVME_REQ_CANCELLED; > return BLK_EH_HANDLED; > + default: > + break; > } The driver may be giving up on the command here, but that doesn't mean the controller has. We can't just end the request like this because that will release the memory the controller still owns. We must wait until after nvme_dev_disable clears bus master because we can't say for sure the controller isn't going to write to that address right after we end the request.
Re: [PATCH V5 2/2] nvme-pci: fixup the timeout case when reset is ongoing
On Fri, Jan 19, 2018 at 01:55:29PM +0800, jianchao.wang wrote: > On 01/19/2018 12:59 PM, Keith Busch wrote: > > On Thu, Jan 18, 2018 at 06:10:02PM +0800, Jianchao Wang wrote: > >> + * - When the ctrl.state is NVME_CTRL_RESETTING, the expired > >> + * request should come from the previous work and we handle > >> + * it as nvme_cancel_request. > >> + * - When the ctrl.state is NVME_CTRL_RECONNECTING, the expired > >> + * request should come from the initializing procedure such as > >> + * setup io queues, because all the previous outstanding > >> + * requests should have been cancelled. > >> */ > >> - if (dev->ctrl.state == NVME_CTRL_RESETTING) { > >> - dev_warn(dev->ctrl.device, > >> - "I/O %d QID %d timeout, disable controller\n", > >> - req->tag, nvmeq->qid); > >> - nvme_dev_disable(dev, false); > >> + switch (dev->ctrl.state) { > >> + case NVME_CTRL_RESETTING: > >> + nvme_req(req)->status = NVME_SC_ABORT_REQ; > >> + return BLK_EH_HANDLED; > >> + case NVME_CTRL_RECONNECTING: > >> + WARN_ON_ONCE(nvmeq->qid); > >>nvme_req(req)->flags |= NVME_REQ_CANCELLED; > >>return BLK_EH_HANDLED; > >> + default: > >> + break; > >>} > > > > The driver may be giving up on the command here, but that doesn't mean > > the controller has. We can't just end the request like this because that > > will release the memory the controller still owns. We must wait until > > after nvme_dev_disable clears bus master because we can't say for sure > > the controller isn't going to write to that address right after we end > > the request. > > > Yes, but the controller is going to be reseted or shutdown at the moment, > even if the controller accesses a bad address and goes wrong, everything will > be ok after reset or shutdown. :) Hm, I don't follow. DMA access after free is never okay.
Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing
On Thu, Jan 18, 2018 at 06:10:00PM +0800, Jianchao Wang wrote: > Hello > > Please consider the following scenario. > nvme_reset_ctrl > -> set state to RESETTING > -> queue reset_work > (scheduling) > nvme_reset_work > -> nvme_dev_disable > -> quiesce queues > -> nvme_cancel_request >on outstanding requests > ---_boundary_ > -> nvme initializing (issue request on adminq) > > Before the _boundary_, not only quiesce the queues, but only cancel > all the outstanding requests. > > A request could expire when the ctrl state is RESETTING. > - If the timeout occur before the _boundary_, the expired requests >are from the previous work. > - Otherwise, the expired requests are from the controller initializing >procedure, such as sending cq/sq create commands to adminq to setup >io queues. > In current implementation, nvme_timeout cannot identify the _boundary_ > so only handles second case above. Bare with me a moment, as I'm only just now getting a real chance to look at this, and I'm not quite sure I follow what problem this is solving. The nvme_dev_disable routine makes forward progress without depending on timeout handling to complete expired commands. Once controller disabling completes, there can't possibly be any started requests that can expire. So we don't need nvme_timeout to do anything for requests above the boundary.
Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing
On Fri, Jan 19, 2018 at 04:14:02PM +0800, jianchao.wang wrote: > On 01/19/2018 04:01 PM, Keith Busch wrote: > > The nvme_dev_disable routine makes forward progress without depending on > > timeout handling to complete expired commands. Once controller disabling > > completes, there can't possibly be any started requests that can expire. > > So we don't need nvme_timeout to do anything for requests above the > > boundary. > > > Yes, once controller disabling completes, any started requests will be > handled and cannot expire. > But before the _boundary_, there could be a nvme_timeout context runs with > nvme_dev_disable in parallel. > If a timeout path grabs a request, then nvme_dev_disable cannot get and > cancel it. > So even though the nvme_dev_disable completes, there still could be a request > in nvme_timeout context. > > The worst case is : > nvme_timeout nvme_reset_work > if (ctrl->state == RESETTING ) nvme_dev_disable > nvme_dev_disableinitializing procedure > > the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel. Okay, I see what you're saying. That's a pretty obscure case, as normally we enter nvme_reset_work with the queues already disabled, but there are a few cases where we need nvme_reset_work to handle that. But if we are in that case, I think we really just want to sync the queues. What do you think of this? --- diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index fde6fd2e7eef..516383193416 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3520,6 +3520,17 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_stop_queues); +void nvme_sync_queues(struct nvme_ctrl *ctrl) +{ + struct nvme_ns *ns; + + mutex_lock(&ctrl->namespaces_mutex); + list_for_each_entry(ns, &ctrl->namespaces, list) + blk_sync_queue(ns->queue); + mutex_unlock(&ctrl->namespaces_mutex); +} +EXPORT_SYMBOL_GPL(nvme_sync_queues); + void nvme_start_queues(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 8e7fc1b041b7..45b1b8ceddb6 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -374,6 +374,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, void nvme_stop_queues(struct nvme_ctrl *ctrl); void nvme_start_queues(struct nvme_ctrl *ctrl); +void nvme_sync_queues(struct nvme_ctrl *ctrl) void nvme_kill_queues(struct nvme_ctrl *ctrl); void nvme_unfreeze(struct nvme_ctrl *ctrl); void nvme_wait_freeze(struct nvme_ctrl *ctrl); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a2ffb557b616..1fe00be22ad1 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2289,8 +2289,10 @@ static void nvme_reset_work(struct work_struct *work) * If we're called to reset a live controller first shut it down before * moving on. */ - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) + if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) { nvme_dev_disable(dev, false); + nvme_sync_queues(&dev->ctrl); + } result = nvme_pci_enable(dev); if (result) --
Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing
On Fri, Jan 19, 2018 at 05:02:06PM +0800, jianchao.wang wrote: > We should not use blk_sync_queue here, the requeue_work and run_work will be > canceled. > Just flush_work(&q->timeout_work) should be ok. I agree flushing timeout_work is sufficient. All the other work had already better not be running either, so it doesn't hurt to call the sync API. > In addition, we could check NVME_CC_ENABLE in nvme_dev_disable to avoid > redundant invoking. > :) That should already be inferred through reading back the CSTS register.
Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing
On Fri, Jan 19, 2018 at 09:56:48PM +0800, jianchao.wang wrote: > In nvme_dev_disable, the outstanding requests will be requeued finally. > I'm afraid the requests requeued on the q->requeue_list will be blocked until > another requeue > occurs, if we cancel the requeue work before it get scheduled. We should kick the request list in nvme_start_queues.
Re: Report long suspend times of NVMe devices (mostly firmware/device issues)
On Mon, Jan 22, 2018 at 10:02:12PM +0100, Paul Menzel wrote: > Dear Linux folks, > > > Benchmarking the ACPI S3 suspend and resume times with `sleepgraph.py > -config config/suspend-callgraph.cfg` [1], shows that the NVMe disk SAMSUNG > MZVKW512HMJP-0 in the TUXEDO Book BU1406 takes between 0.3 and 1.4 > seconds, holding up the suspend cycle. > > The time is spent in `nvme_shutdown_ctrl()`. > > ### Linux 4.14.1-041401-generic > > > nvme @ :04:00.0 {nvme} async_device (Total Suspend: 1439.299 ms Total > > Resume: 19.865 ms) > > ### Linux 4.15-rc9 > > > nvme @ :04:00.0 {nvme} async_device (Total Suspend: 362.239 ms Total > > Resume: 19.897 m > It’d be useful, if the Linux kernel logged such issues visibly to the user, > so that the hardware manufacturer can be contacted to fix the device > (probably the firmware). > > In my opinion anything longer than 200 ms should be reported similar to [2], > and maybe worded like below. > > > NVMe took more than 200 ms to do suspend routine > > What do you think? The nvme spec guides toward longer times than that. I don't see the point of warning users about things operating within spec.
Re: [PATCH] nvme-pci: ensure nvme_timeout complete before initializing procedure
On Mon, Jan 22, 2018 at 09:14:23PM +0100, Christoph Hellwig wrote: > > Link: https://lkml.org/lkml/2018/1/19/68 > > Suggested-by: Keith Busch > > Signed-off-by: Keith Busch > > Signed-off-by: Jianchao Wang > > Why does this have a signoff from Keith? Right, I hadn't signed off that. I just trying to get feeback if someting like that was closing the theoretical gap, which it does. I actually have something similar in my patch queue I was about to send around this area, though. I don't like having the IO path take on the error handling, and I think ending unstarted requests directly will be better long term.
Re: [PATCH 09/12] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
On Thu, Jan 04, 2018 at 12:01:34PM -0700, Logan Gunthorpe wrote: > Register the CMB buffer as p2pmem and use the appropriate allocation > functions to create and destroy the IO SQ. > > If the CMB supports WDS and RDS, publish it for use as p2p memory > by other devices. <> > + if (qid && dev->cmb_use_sqes) { > + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth)); > + nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev, > + nvmeq->sq_cmds); > + nvmeq->sq_cmds_is_io = true; > + } This gets into some spec type trouble for creating an IO submission queue. We use the sq_dma_addr as the PRP entry to the Create I/O SQ command, which has some requirements: "the PRP Entry shall have an offset of 0h." So this has to be 4k aligned, but pci_alloc_p2pmem doesn't guarantee that. I doubt the spec's intention was to require such alignment for CMB SQs, but there may be controllers enforcing the rule as written.
Re: [PATCH 09/12] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
On Fri, Jan 05, 2018 at 11:19:28AM -0700, Logan Gunthorpe wrote: > Although it is not explicitly stated anywhere, pci_alloc_p2pmem() should > always be at least 4k aligned. This is because the gen_pool that implements > it is created with PAGE_SHIFT for its min_alloc_order. Ah, I see that now. Thanks for the explanation. Does it need to be created with page sized minimum alloc order? That granularity makes it difficult to fit SQs in CMB on archs with larger pages when we only needed 4k alignment. I was also hoping to extend this for PRP/SGL in CMB where even 4k is too high a granularity to make it really useful. It looks like creating the gen pool with a smaller minimum and gen_pool_first_fit_order_align algo would satisfy my use cases, but I'm not sure if there's another reason you've set it to page alignment.
Re: [PATCH V2 0/2] nvme-pci: fix the timeout case when reset is ongoing
On Tue, Jan 09, 2018 at 10:03:11AM +0800, Jianchao Wang wrote: > Hello Sorry for the distraction, but could you possibly fix the date on your machine? For some reason, lists.infradead.org sorts threads by the time you claim to have sent your message rather than the time it was received, and you're a full day in the future. :)
Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem
On Mon, Jan 29, 2018 at 11:07:35AM +0800, Jianchao Wang wrote: > nvme_set_host_mem will invoke nvme_alloc_request without NOWAIT > flag, it is unsafe for nvme_dev_disable. The adminq driver tags > may have been used up when the previous outstanding adminq requests > cannot be completed due to some hardware error. We have to depend > on the timeout path to complete the previous outstanding adminq > requests and free the tags. > However, nvme_timeout will invoke nvme_dev_disable and try to > get the shutdown_lock which is held by another context who is > sleeping to wait for the tags to be freed by timeout path. A > deadlock comes up. > > To fix it, let nvme_set_host_mem use NOWAIT flag. > > Signed-off-by: Jianchao Wang Thanks for the fix. It looks like we still have a problem, though. Commands submitted with the "shutdown_lock" held need to be able to make forward progress without relying on a completion, but this one could block indefinitely. > drivers/nvme/host/pci.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 6fe7af0..9532529 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1736,7 +1736,8 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 > bits) > c.features.dword14 = cpu_to_le32(upper_32_bits(dma_addr)); > c.features.dword15 = cpu_to_le32(dev->nr_host_mem_descs); > > - ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0); > + ret = __nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, NULL, 0, 0, > + NVME_QID_ANY, 0, BLK_MQ_REQ_NOWAIT); > if (ret) { > dev_warn(dev->ctrl.device, >"failed to set host mem (err %d, flags %#x).\n", > --
Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem
On Mon, Jan 29, 2018 at 09:55:41PM +0200, Sagi Grimberg wrote: > > Thanks for the fix. It looks like we still have a problem, though. > > Commands submitted with the "shutdown_lock" held need to be able to make > > forward progress without relying on a completion, but this one could > > block indefinitely. > > Can you explain to me why is the shutdown_lock needed to synchronize > nvme_dev_disable? More concretely, how is nvme_dev_disable different > from other places where we rely on the ctrl state to serialize stuff? > > The only reason I see would be to protect against completion-after-abort > scenario but I think the block layer should protect against it (checks > if the request timeout timer fired). We can probably find a way to use the state machine for this. Disabling the controller pre-dates the state machine, and the mutex is there to protect against two actors shutting the controller down at the same time, like a hot removal at the same time as a timeout handling reset.
Re: [BUG 4.15-rc7] IRQ matrix management errors
On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote: > 8<-- > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index f8b03bb8e725..3cc471beb50b 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain > *domain, unsigned int virq, > > err = assign_irq_vector_policy(irqd, info); > trace_vector_setup(virq + i, false, err); > - if (err) > + if (err) { > + irqd->chip_data = NULL; > + free_apic_chip_data(apicd); > goto error; > + } > } > > return 0; > > error: > - x86_vector_free_irqs(domain, virq, i + 1); > + x86_vector_free_irqs(domain, virq, i); > return err; > } > The patch does indeed fix all the warnings and allows device binding to succeed, albeit in a degraded performance mode. Despite that, this is a good fix, and looks applicable to 4.4-stable, so: Tested-by: Keith Busch I'm still concerned assign_irq_vector_policy is failing. That has interrupt allocation abandon MSI-x and fall back to legacy IRQ. Your patch does address my main concern, though. Are you comfortable enough to queue this up for 4.15? Thanks, Keith
Re: [BUG 4.15-rc7] IRQ matrix management errors
On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote: > Can you trace the matrix allocations from the very beginning or tell me how > to reproduce. I'd like to figure out why this is happening. Sure, I'll get the irq_matrix events. I reproduce this on a machine with 112 CPUs and 3 NVMe controllers. The first two NVMe want 112 MSI-x vectors, and the last only 31 vectors. The test runs 'modprobe nvme' and 'modprobe -r nvme' in a loop with 10 second delay between each step. Repro occurs within a few iterations, sometimes already broken after the initial boot.
Re: [BUG 4.15-rc7] IRQ matrix management errors
On Wed, Jan 17, 2018 at 04:01:47PM +0100, Thomas Gleixner wrote: > Which device is allocating gazillions of non-managed interrupts? I believe that would be the i40e. :) > The patch below should cure that by spreading them out on allocation. Yep, this is successfully testing already over 200 iterations that used to fail within only a few. I'd say the problem is cured. Thanks! Tested-by: Keith Busch
Re: [PATCH v5 4/4] PCI/DPC: Enumerate the devices after DPC trigger event
On Wed, Jan 17, 2018 at 08:27:39AM -0800, Sinan Kaya wrote: > On 1/17/2018 5:37 AM, Oza Pawandeep wrote: > > +static bool dpc_wait_link_active(struct pci_dev *pdev) > > +{ > > I think you can also make this function common instead of making another copy > here. > Of course, this would be another patch. It is actually very similar to __pcie_wait_link_active in pciehp_hpc.c, so there's some opprotunity to make even more common code.
Re: [PATCH v3 2/2] nvme: add tracepoint for nvme_complete_rq
Looks good. Reviewed-by: Keith Busch
Re: [PATCH v3 1/2] nvme: add tracepoint for nvme_setup_cmd
Looks good. Reviewed-by: Keith Busch
Re: [BUG 4.15-rc7] IRQ matrix management errors
On Thu, Jan 18, 2018 at 09:10:43AM +0100, Thomas Gleixner wrote: > Can you please provide the output of > > # cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ # cat /sys/kernel/debug/irq/irqs/48 handler: handle_edge_irq device: :1a:00.0 status: 0x istate: 0x ddepth: 0 wdepth: 0 dstate: 0x05401200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_CAN_RESERVE node: 0 affinity: 0-27,56-83 effectiv: 0 pending: domain: PCI-MSI-2 hwirq: 0xd2 chip:PCI-MSI flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: VECTOR hwirq: 0x30 chip:APIC flags: 0x0 Vector:69 Target: 0
Re: [PATCH] nvme: initialize hostid uuid in nvmf_host_default to not leak kernel memory
On Tue, Jan 09, 2018 at 04:20:43PM +0100, Johannes Thumshirn wrote: > Alexander reports: > according to KMSAN (and common sense as well) the following code in > drivers/nvme/host/fabrics.c > > (http://elixir.free-electrons.com/linux/latest/source/drivers/nvme/host/fabrics.c#L68): > > 72 host = kmalloc(sizeof(*host), GFP_KERNEL); > 73 if (!host) > 74 return NULL; > 75 > 76 kref_init(&host->ref); > 77 snprintf(host->nqn, NVMF_NQN_SIZE, > 78 "nqn.2014-08.org.nvmexpress:uuid:%pUb", &host->id); > > uses uninitialized heap memory to generate the unique id for the NVMF host. > If I'm understanding correctly, it can be then passed to the > userspace, so the contents of the uninitialized chunk may potentially > leak. > If the specification doesn't rely on this UID to be random or unique, > I suggest using kzalloc() here, otherwise it might be a good idea to > use a real RNG. > > this assumption is correct so initialize the host->id using uuid_gen() as > it was done before commit 6bfe04255d5e ("nvme: add hostid token to fabric > options"). > > Fixes: 6bfe04255d5e ("nvme: add hostid token to fabric options") > Reported-by: Alexander Potapenko > Signed-off-by: Johannes Thumshirn Thanks for the report and the fix. It'd still be good to use the kzalloc variant in addition to this. Reviewed-by: Keith Busch
Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting
On Mon, Jan 15, 2018 at 10:02:04AM +0800, jianchao.wang wrote: > Hi keith > > Thanks for your kindly review and response. I agree with Sagi's feedback, but I can't take credit for it. :)
[BUG 4.15-rc7] IRQ matrix management errors
I hoped to have a better report before the weekend, but I've run out of time and without my machine till next week, so sending what I have and praying someone more in the know will have a better clue. I've a few NVMe drives and occasionally the IRQ teardown and bring-up is failing. Resetting the controllers to re-run interrupt allocation results in device interrupts from not occuring at all. It appears I need at least 2 NVMe drives for the problem. The NVMe driver initializes controllers in parallel, and multiple threads calling pci_alloc_irq_vectors and/or pci_free_vectors at about the same time seems required to trigger the issue. Here are the relavent warnings. I also have trace events from irq_matrix and irq_vectors, but from a different run that won't match up to the timestamps below, but can send if that's helpful. [ 288.519216] WARNING: CPU: 28 PID: 1420 at kernel/irq/matrix.c:222 irq_matrix_remove_managed+0x10f/0x120 [ 288.519218] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel [ 288.519286] ptp nvme_core pps_core [last unloaded: nvme] [ 288.519294] CPU: 28 PID: 1420 Comm: kworker/u674:2 Not tainted 4.15.0-rc7+ #4 [ 288.519296] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017 [ 288.519305] Workqueue: nvme-wq nvme_reset_work [nvme] [ 288.519310] RIP: 0010:irq_matrix_remove_managed+0x10f/0x120 [ 288.519312] RSP: 0018:b01f0aec7a88 EFLAGS: 00010046 [ 288.519315] RAX: 00ee RBX: 9e387d824900 RCX: [ 288.519317] RDX: 0100 RSI: 00ee RDI: 9e387d410c50 [ 288.519319] RBP: R08: 0100 R09: [ 288.519320] R10: 0018 R11: 0003 R12: 9e387d410c00 [ 288.519322] R13: 9e387d410c30 R14: 00ee R15: 00ee [ 288.519324] FS: () GS:9e487cc0() knlGS: [ 288.519326] CS: 0010 DS: ES: CR0: 80050033 [ 288.519328] CR2: 7fbc1c2321d4 CR3: 0010f5209003 CR4: 007606e0 [ 288.519330] DR0: DR1: DR2: [ 288.519332] DR3: DR6: fffe0ff0 DR7: 0400 [ 288.519333] PKRU: 5554 [ 288.519334] Call Trace: [ 288.519345] x86_vector_free_irqs+0xa1/0x180 [ 288.519349] x86_vector_alloc_irqs+0x1e4/0x3a0 [ 288.519354] msi_domain_alloc+0x62/0x130 [ 288.519363] ? kmem_cache_alloc_node_trace+0x1ac/0x1d0 [ 288.519366] __irq_domain_alloc_irqs+0x121/0x300 [ 288.519370] msi_domain_alloc_irqs+0x99/0x2e0 [ 288.519376] native_setup_msi_irqs+0x54/0x90 [ 288.519383] __pci_enable_msix+0xfb/0x4e0 [ 288.519389] pci_alloc_irq_vectors_affinity+0x8e/0x130 [ 288.519394] nvme_reset_work+0x919/0x153b [nvme] [ 288.519404] ? sched_clock+0x5/0x10 [ 288.519407] ? sched_clock+0x5/0x10 [ 288.519414] ? sched_clock_cpu+0xc/0xb0 [ 288.519420] ? pick_next_task_fair+0x4d5/0x5f0 [ 288.519427] ? __switch_to+0xa2/0x430 [ 288.519431] ? put_prev_entity+0x1e/0xe0 [ 288.519438] process_one_work+0x182/0x370 [ 288.519441] worker_thread+0x2e/0x380 [ 288.519444] ? process_one_work+0x370/0x370 [ 288.519449] kthread+0x111/0x130 [ 288.519453] ? kthread_create_worker_on_cpu+0x70/0x70 [ 288.519460] ret_from_fork+0x1f/0x30 [ 288.519464] Code: 89 ea 44 89 f6 41 ff d1 4d 8b 0f 4d 85 c9 75 e2 e9 2a ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f ff e9 14 ff ff ff <0f> ff e9 0d ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 [ 288.519517] ---[ end trace 8fcc70570a780634 ]--- [ 310.959112] nvme nvme0: pci function :5e:00.0 [ 310.959333] nvme nvme1: pci function :60:00.0 [ 310.959675] nvme nvme2: pci function :86:00.0 [ 311.167369] WARNING: CPU: 1 PID: 898 at kernel/irq/matrix.c:215 irq_matrix_remove_managed+0x108/0x120 [ 311.167371] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw
Re: [BUG 4.15-rc7] IRQ matrix management errors
This is all way over my head, but the part that obviously shows something's gone wrong: kworker/u674:3-1421 [028] d... 335.307051: irq_matrix_reserve_managed: bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 global_avl=22084, global_rsvd=157, total_alloc=570 kworker/u674:3-1421 [028] d... 335.307053: irq_matrix_remove_managed: bit=56 cpu=0 online=1 avl=87 alloc=116 managed=2 online_maps=112 global_avl=22085, global_rsvd=157, total_alloc=570 kworker/u674:3-1421 [028] 335.307054: vector_reserve_managed: irq=45 ret=-28 kworker/u674:3-1421 [028] 335.307054: vector_setup: irq=45 is_legacy=0 ret=-28 kworker/u674:3-1421 [028] d... 335.307055: vector_teardown: irq=45 is_managed=1 has_reserved=0 Which leads me to x86_vector_alloc_irqs goto error: error: x86_vector_free_irqs(domain, virq, i + 1); The last parameter looks weird. It's the nr_irqs, and since we failed and bailed, I would think we'd need to subtract 1 rather than add 1. Adding 1 would doublely remove the failed one, and remove the next one that was never setup, right? Or maybe irq_matrix_reserve_managed wasn't expected to fail in the first place?
Re: [BUG 4.15-rc7] IRQ matrix management errors
On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote: > What we want is s/i + 1/i/ > > That's correct because x86_vector_free_irqs() does: > >for (i = 0; i < nr; i++) > > > So if we fail at the first irq, then the loop will do nothing. Failing on > the second will free the first > > Fix below. > > Thanks, > > tglx Thanks! This looks much better. I'll try to verify by tomorrow, though the hardware I was using to recreate is not available to me at the moment. I may be able to synth the conditions on something else. > 8<-- > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index f8b03bb8e725..3cc471beb50b 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain > *domain, unsigned int virq, > > err = assign_irq_vector_policy(irqd, info); > trace_vector_setup(virq + i, false, err); > - if (err) > + if (err) { > + irqd->chip_data = NULL; > + free_apic_chip_data(apicd); > goto error; > + } > } > > return 0; > > error: > - x86_vector_free_irqs(domain, virq, i + 1); > + x86_vector_free_irqs(domain, virq, i); > return err; > } >
Re: [PATCH v2 0/2] add tracepoints for nvme command submission and completion
On Tue, Jan 16, 2018 at 03:28:19PM +0100, Johannes Thumshirn wrote: > Add tracepoints for nvme command submission and completion. The tracepoints > are modeled after SCSI's trace_scsi_dispatch_cmd_start() and > trace_scsi_dispatch_cmd_done() tracepoints and fulfil a similar purpose, > namely a fast way to check which command is going to be queued into the HW or > Fabric driver and which command is completed again. I like this very much, thanks for doing this. I think you could make the submission trace point tighter for PCI as Hannes was suggesting since an MMIO write can't fail, but doesn't look as doable for FC and RDMA.
Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote: > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote: > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote: > > > That difference has been there since the beginning of DPC, so it has > > nothing to do with *this* series EXCEPT for the fact that it really > > complicates the logic you're adding to reset_link() and > > broadcast_error_message(). > > > > We ought to be able to simplify that somehow because the only real > > difference between AER and DPC should be that DPC automatically > > disables the link and AER does it in software. > > I agree this should be possible. Code execution path should be almost > identical to fatal error case. > > Is there any reason why you went to stop driver path, Keith? The fact is the link is truly down during a DPC event. When the link is enabled again, you don't know at that point if the device(s) on the other side have changed. Calling a driver's error handler for the wrong device in an unknown state may have undefined results. Enumerating the slot from scratch should be safe, and will assign resources, tune bus settings, and bind to the matching driver. Per spec, DPC is the recommended way for handling surprise removal events and even recommends DPC capable slots *not* set 'Surprise' in Slot Capabilities so that removals are always handled by DPC. This service driver was developed with that use in mind.
Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
On Mon, Mar 12, 2018 at 08:16:38PM +0530, p...@codeaurora.org wrote: > On 2018-03-12 19:55, Keith Busch wrote: > > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote: > > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote: > > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote: > > > > > > > That difference has been there since the beginning of DPC, so it has > > > > nothing to do with *this* series EXCEPT for the fact that it really > > > > complicates the logic you're adding to reset_link() and > > > > broadcast_error_message(). > > > > > > > > We ought to be able to simplify that somehow because the only real > > > > difference between AER and DPC should be that DPC automatically > > > > disables the link and AER does it in software. > > > > > > I agree this should be possible. Code execution path should be almost > > > identical to fatal error case. > > > > > > Is there any reason why you went to stop driver path, Keith? > > > > The fact is the link is truly down during a DPC event. When the link > > is enabled again, you don't know at that point if the device(s) on the > > other side have changed. Calling a driver's error handler for the wrong > > device in an unknown state may have undefined results. Enumerating the > > slot from scratch should be safe, and will assign resources, tune bus > > settings, and bind to the matching driver. > > > > Per spec, DPC is the recommended way for handling surprise removal > > events and even recommends DPC capable slots *not* set 'Surprise' > > in Slot Capabilities so that removals are always handled by DPC. This > > service driver was developed with that use in mind. > > Now it begs the question, that > > after DPC trigger > > should we enumerate the devices, ? > or > error handling callbacks, followed by stop devices followed by enumeration ? > or > error handling callbacks, followed by enumeration ? (no stop devices) I'm not sure I understand. The link is disabled while DPC is triggered, so if anything, you'd want to un-enumerate everything below the contained port (that's what it does today). After releasing a slot from DPC, the link is allowed to retrain. If there is a working device on the other side, a link up event occurs. That event is handled by the pciehp driver, and that schedules enumeration no matter what you do to the DPC driver.
Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
On Mon, Mar 12, 2018 at 09:04:47PM +0530, p...@codeaurora.org wrote: > On 2018-03-12 20:28, Keith Busch wrote: > > I'm not sure I understand. The link is disabled while DPC is triggered, > > so if anything, you'd want to un-enumerate everything below the > > contained > > port (that's what it does today). > > > > After releasing a slot from DPC, the link is allowed to retrain. If > > there > > is a working device on the other side, a link up event occurs. That > > event is handled by the pciehp driver, and that schedules enumeration > > no matter what you do to the DPC driver. > > yes, that is what i current, but this patch-set makes DPC aware of error > handling driver callbacks. I've been questioning the utility of doing that since the very first version of this patch set. > besides, in absence of pciehp there is nobody to do enumeration. If you configure your kernel to not have a feature, you don't get to expect the feature works. You can still manually rescan through sysfs, /sys/bus/pci/rescan. > And, I was talking about pci_stop_and_remove_bus_device() in dpc. > if DPC calls driver's error callbacks, is it required to stop the devices ? DPC is PCIe's preferred surprise removal mechanism. If you don't want the DPC driver to handle removing devices downstream the containment, how do you want those devices get removed? Just calling driver's error callbacks leaves the kernel's view of the topology in the wrong state.
Re: [pci PATCH v5 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
On Mon, Mar 12, 2018 at 10:21:29AM -0700, Alexander Duyck wrote: > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 024a1beda008..9cab9d0d51dc 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1953,6 +1953,7 @@ static inline void pci_mmcfg_late_init(void) { } > int pci_vfs_assigned(struct pci_dev *dev); > int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs); > int pci_sriov_get_totalvfs(struct pci_dev *dev); > +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn); > resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); > void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe); > #else I recommend stubbing 'pci_sriov_configure_simple' or defining it to NULL in the '#else' section here so you don't need to repeat the "#ifdef CONFIG_PCI_IOV" in each driver wishing to use this function. Otherwise looks fine to me.
Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
On Mon, Mar 12, 2018 at 01:41:07PM -0400, Sinan Kaya wrote: > I was just writing a reply to you. You acted first :) > > On 3/12/2018 1:33 PM, Keith Busch wrote: > >>> After releasing a slot from DPC, the link is allowed to retrain. If > >>> there > >>> is a working device on the other side, a link up event occurs. That > >>> event is handled by the pciehp driver, and that schedules enumeration > >>> no matter what you do to the DPC driver. > >> yes, that is what i current, but this patch-set makes DPC aware of error > >> handling driver callbacks. > > I've been questioning the utility of doing that since the very first > > version of this patch set. > > > > I think we should all agree that shutting down the device drivers with active > work is not safe. There could be outstanding work that the endpoint driver > needs to take care of. > > That was the motivation for this change so that we give endpoint drivers an > error callback when something goes wrong. > > The rest is implementation detail that we can all figure out. I'm not sure if I agree here. All Linux device drivers are supposed to cope with sudden/unexpected loss of communication at any time. This includes cleaning up appropriately when requested to unbind from an inaccessible device with active outstanding work.
Re: [pci PATCH v5 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
On Mon, Mar 12, 2018 at 11:09:34AM -0700, Alexander Duyck wrote: > On Mon, Mar 12, 2018 at 10:40 AM, Keith Busch wrote: > > On Mon, Mar 12, 2018 at 10:21:29AM -0700, Alexander Duyck wrote: > >> diff --git a/include/linux/pci.h b/include/linux/pci.h > >> index 024a1beda008..9cab9d0d51dc 100644 > >> --- a/include/linux/pci.h > >> +++ b/include/linux/pci.h > >> @@ -1953,6 +1953,7 @@ static inline void pci_mmcfg_late_init(void) { } > >> int pci_vfs_assigned(struct pci_dev *dev); > >> int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs); > >> int pci_sriov_get_totalvfs(struct pci_dev *dev); > >> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn); > >> resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); > >> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe); > >> #else > > > > I recommend stubbing 'pci_sriov_configure_simple' or defining it to > > NULL in the '#else' section here so you don't need to repeat the "#ifdef > > CONFIG_PCI_IOV" in each driver wishing to use this function. Otherwise > > looks fine to me. > > My concern with defining it as NULL is that somebody may end up > calling it in the future directly and that may end up causing issues. > One thought I have been debating is moving it to a different file. I > am just not sure where the best place to put something like this would > be. I could move this function to drivers/pci/pci.c if everyone is > okay with it and then I could just strip the contents out by wrapping > them in a #ifdef instead. Okay, instead of NULL, a stub implementation in the header file may suffice when CONFIG_PCI_IOV is not defined: static inline int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn) { return -ENOSYS; } See pci_iov_virtfn_bus, pci_iov_virtfn_devfn, pci_iov_add_virtfn, or pci_enable_sriov for other examples.
Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
Hi Jianchao, The patch tests fine on all hardware I had. I'd like to queue this up for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy suggested and a changelog aligned with Ming's insights? Thanks, Keith
Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
On Mon, Mar 12, 2018 at 02:47:30PM -0500, Bjorn Helgaas wrote: > [+cc Alex] > > On Mon, Mar 12, 2018 at 08:25:51AM -0600, Keith Busch wrote: > > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote: > > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote: > > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote: > > > > > > > That difference has been there since the beginning of DPC, so it has > > > > nothing to do with *this* series EXCEPT for the fact that it really > > > > complicates the logic you're adding to reset_link() and > > > > broadcast_error_message(). > > > > > > > > We ought to be able to simplify that somehow because the only real > > > > difference between AER and DPC should be that DPC automatically > > > > disables the link and AER does it in software. > > > > > > I agree this should be possible. Code execution path should be almost > > > identical to fatal error case. > > > > > > Is there any reason why you went to stop driver path, Keith? > > > > The fact is the link is truly down during a DPC event. When the link > > is enabled again, you don't know at that point if the device(s) on the > > other side have changed. > > When DPC is triggered, the port takes the link down. When we handle > an uncorrectable (nonfatal or fatal) AER event, software takes the > link down. > > In both cases, devices on the other side are at least reset. Whenever > the link goes down, it's conceivable the device could be replaced with > a different one before the link comes back up. Is this why you remove > and re-enumerate? (See tangent [1] below.) Yes. Truthfully, DPC events due to changing topologies was the motivating use case when we initially developed this. We were also going for simplicity (at least initially), and remove + re-enumerate seemed safe without concerning this driver with other capability regsiters, or coordinating with/depending on other drivers. For example, a successful reset may depend on any particular driver calling pci_restore_state from a good saved state. > The point is that from the device's hardware perspective, these > scenarios are the same (it sent a ERR_NONFATAL or ERR_FATAL message > and it sees the link go down). I think we should make them the same > on the software side, too: the driver should see the same callbacks, > in the same order, whether we're doing AER or DPC. > > If removing and re-enumerating is the right thing for DPC, I think > that means it's also the right thing for AER. > > Along this line, we had a question a while back about logging AER > information after a DPC trigger. Obviously we can't collect any > logged information from the downstream devices while link is down, but > I noticed the AER status bits are RW1CS, which means they're sticky > and are not modified by hot reset or FLR. > > So we may be able to read and log the AER information after the DPC > driver brings the link back up. We may want to do the same with AER, > i.e., reset the downstream devices first, then collect the AER status > bits after the link comes back up. I totally agree. We could consult Slot and AER status to guide a smarter approach. > > Calling a driver's error handler for the wrong device in an unknown > > state may have undefined results. Enumerating the slot from scratch > > should be safe, and will assign resources, tune bus settings, and > > bind to the matching driver. > > I agree with this; I think this is heading toward doing > remove/re-enumerate on AER errors as well because it has also reset > the device. > > > Per spec, DPC is the recommended way for handling surprise removal > > events and even recommends DPC capable slots *not* set 'Surprise' > > in Slot Capabilities so that removals are always handled by DPC. This > > service driver was developed with that use in mind. > > Thanks for this tip. The only thing I've found so far is the mention > of Surprise Down triggering DPC in the last paragraph of sec 6.7.5. > Are there other references I should look at? I haven't found the > recommendation about not setting 'Surprise' in Slot Capabilities yet. No problem, it's in the "IMPLEMENTATION NOTE" at the end of 6.2.10.4, "Avoid Disabled Link and Hot-Plug Surprise Use with DPC". Outside the spec, Microsemi as one of the PCI-SIG contributors and early adopters of the capability published a white paper "Hot and Surprise Plug Recommendations for Enterprise PCIe" providing guidance for OS drivers using DPC. We originally developed to that guidance. The pa
Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote: > On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe wrote: > > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq, > > { > > u16 tail = nvmeq->sq_tail; > > > - if (nvmeq->sq_cmds_io) > > - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd)); > > - else > > - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd)); > > + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd)); > > Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC > the _toio() variant enforces alignment, does the copy with 4 byte > stores, and has a full barrier after the copy. In comparison our > regular memcpy() does none of those things and may use unaligned and > vector load/stores. For normal (cacheable) memory that is perfectly > fine, but they can cause alignment faults when targeted at MMIO > (cache-inhibited) memory. > > I think in this particular case it might be ok since we know SEQs are > aligned to 64 byte boundaries and the copy is too small to use our > vectorised memcpy(). I'll assume we don't need explicit ordering > between writes of SEQs since the existing code doesn't seem to care > unless the doorbell is being rung, so you're probably fine there too. > That said, I still think this is a little bit sketchy and at the very > least you should add a comment explaining what's going on when the CMB > is being used. If someone more familiar with the NVMe driver could > chime in I would appreciate it. I may not be understanding the concern, but I'll give it a shot. You're right, the start of any SQE is always 64-byte aligned, so that should satisfy alignment requirements. The order when writing multiple/successive SQEs in a submission queue does matter, and this is currently serialized through the q_lock. The order in which the bytes of a single SQE is written doesn't really matter as long as the entire SQE is written into the CMB prior to writing that SQ's doorbell register. The doorbell register is written immediately after copying a command entry into the submission queue (ignore "shadow buffer" features), so the doorbells written to commands submitted is 1:1. If a CMB SQE and DB order is not enforced with the memcpy, then we do need a barrier after the SQE's memcpy and before the doorbell's writel.
Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote: > So when reading the above mlx code, we see the first wmb() being used > to ensure that CPU stores to cachable memory are visible to the DMA > triggered by the doorbell ring. IIUC, we don't need a similar barrier for NVMe to ensure memory is visibile to DMA since the SQE memory is allocated DMA coherent when the SQ is not within a CMB. > The mmiowb() is used to ensure that DB writes are not combined and not > issued in any order other than implied by the lock that encloses the > whole thing. This is needed because uar_map is WC memory. > > We don't have ordering with respect to two writel's here, so if ARM > performance was a concern the writel could be switched to > writel_relaxed(). > > Presumably nvme has similar requirments, although I guess the DB > register is mapped UC not WC? Yep, the NVMe DB register is required by the spec to be mapped UC.
Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote: > On 4/12/2018 10:06 AM, Bjorn Helgaas wrote: > > > > I think the scenario you are describing is two systems that are > > identical except that in the first, the endpoint is below a hotplug > > bridge, while in the second, it's below a non-hotplug bridge. There's > > no physical hotplug (no drive removed or inserted), and DPC is > > triggered in both systems. > > > > I suggest that DPC should be handled identically in both systems: > > > > - The PCI core should have the same view of the endpoint: it should > > be removed and re-added in both cases (or in neither case). > > > > - The endpoint itself should not be able to tell the difference: it > > should see a link down event, followed by a link retrain, followed > > by the same sequence of config accesses, etc. > > > > - The endpoint driver should not be able to tell the difference, > > i.e., we should be calling the same pci_error_handlers callbacks > > in both cases. > > > > It's true that in the non-hotplug system, pciehp probably won't start > > re-enumeration, so we might need an alternate path to trigger that. > > > > But that's not what we're doing in this patch. In this patch we're > > adding a much bigger difference: for hotplug bridges, we stop and > > remove the hierarchy below the bridge; for non-hotplug bridges, we do > > the AER-style flow of calling pci_error_handlers callbacks. > > Our approach on V12 was to go to AER style recovery for all DPC events > regardless of hotplug support or not. > > Keith was not comfortable with this approach. That's why, we special cased > hotplug. > > If we drop 6/6 on this patch on v13, we achieve this. We still have to > take care of Keith's inputs on individual patches. > > we have been struggling with the direction for a while. > > Keith, what do you think? My only concern was for existing production environments that use DPC for handling surprise removal, and I don't wish to break the existing uses.
Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
On Thu, Apr 12, 2018 at 08:39:54AM -0600, Keith Busch wrote: > On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote: > > On 4/12/2018 10:06 AM, Bjorn Helgaas wrote: > > > > > > I think the scenario you are describing is two systems that are > > > identical except that in the first, the endpoint is below a hotplug > > > bridge, while in the second, it's below a non-hotplug bridge. There's > > > no physical hotplug (no drive removed or inserted), and DPC is > > > triggered in both systems. > > > > > > I suggest that DPC should be handled identically in both systems: > > > > > > - The PCI core should have the same view of the endpoint: it should > > > be removed and re-added in both cases (or in neither case). > > > > > > - The endpoint itself should not be able to tell the difference: it > > > should see a link down event, followed by a link retrain, followed > > > by the same sequence of config accesses, etc. > > > > > > - The endpoint driver should not be able to tell the difference, > > > i.e., we should be calling the same pci_error_handlers callbacks > > > in both cases. > > > > > > It's true that in the non-hotplug system, pciehp probably won't start > > > re-enumeration, so we might need an alternate path to trigger that. > > > > > > But that's not what we're doing in this patch. In this patch we're > > > adding a much bigger difference: for hotplug bridges, we stop and > > > remove the hierarchy below the bridge; for non-hotplug bridges, we do > > > the AER-style flow of calling pci_error_handlers callbacks. > > > > Our approach on V12 was to go to AER style recovery for all DPC events > > regardless of hotplug support or not. > > > > Keith was not comfortable with this approach. That's why, we special cased > > hotplug. > > > > If we drop 6/6 on this patch on v13, we achieve this. We still have to > > take care of Keith's inputs on individual patches. > > > > we have been struggling with the direction for a while. > > > > Keith, what do you think? > > My only concern was for existing production environments that use DPC > for handling surprise removal, and I don't wish to break the existing > uses. Also, I thought the plan was to keep hotplug and non-hotplug the same, except for the very end: if not a hotplug bridge, initiate the rescan automatically after releasing from containment, otherwise let pciehp handle it when the link reactivates.
Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
On Thu, Apr 12, 2018 at 12:27:20PM -0400, Sinan Kaya wrote: > On 4/12/2018 11:02 AM, Keith Busch wrote: > > > > Also, I thought the plan was to keep hotplug and non-hotplug the same, > > except for the very end: if not a hotplug bridge, initiate the rescan > > automatically after releasing from containment, otherwise let pciehp > > handle it when the link reactivates. > > > > Hmm... > > AER driver doesn't do stop and rescan approach for fatal errors. AER driver > makes an error callback followed by secondary bus reset and finally driver > the resume callback on the endpoint only if link recovery is successful. > Otherwise, AER driver bails out with recovery unsuccessful message. I'm not sure if that's necessarily true. People have reported AER handling triggers PCIe hotplug events, and creates some interesting race conditions: https://marc.info/?l=linux-pci&m=152336615707640&w=2 https://www.spinics.net/lists/linux-pci/msg70614.html > Why do we need an additional rescan in the DPC driver if the link is up > and driver resumes operation? I thought the plan was to have DPC always go through the removal path to ensure all devices are properly configured when containment is released. In order to reconfigure those, you'll need to initiate the rescan from somewhere.
Re: [PATCH] NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage
Thanks, applied for 4.17-rc1. I was a little surprised git was able to apply this since the patch format is off, but it worked!
Re: [PATCH 08/12] lightnvm: implement get log report chunk helpers
On Wed, Mar 21, 2018 at 03:06:05AM -0700, Matias Bjørling wrote: > > outside of nvme core so that we can use it form lightnvm. > > > > Signed-off-by: Javier González > > --- > > drivers/lightnvm/core.c | 11 +++ > > drivers/nvme/host/core.c | 6 ++-- > > drivers/nvme/host/lightnvm.c | 74 > > > > drivers/nvme/host/nvme.h | 3 ++ > > include/linux/lightnvm.h | 24 ++ > > 5 files changed, 115 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > index 2e9e9f973a75..af642ce6ba69 100644 > > --- a/drivers/nvme/host/core.c > > +++ b/drivers/nvme/host/core.c > > @@ -2127,9 +2127,9 @@ static int nvme_init_subsystem(struct nvme_ctrl > > *ctrl, struct nvme_id_ctrl *id) > > return ret; > > } > > > > -static int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns, > > - u8 log_page, void *log, > > - size_t size, size_t offset) > > +int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns, > > +u8 log_page, void *log, > > +size_t size, size_t offset) > > { > > struct nvme_command c = { }; > > unsigned long dwlen = size / 4 - 1; > > diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c > > index 08f0f6b5bc06..ffd64a83c8c3 100644 > > --- a/drivers/nvme/host/lightnvm.c > > +++ b/drivers/nvme/host/lightnvm.c > > @@ -35,6 +35,10 @@ enum nvme_nvm_admin_opcode { > > nvme_nvm_admin_set_bb_tbl = 0xf1, > > }; > > > > > > > > > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > > index 1ca08f4993ba..505f797f8c6c 100644 > > --- a/drivers/nvme/host/nvme.h > > +++ b/drivers/nvme/host/nvme.h > > @@ -396,6 +396,9 @@ int nvme_reset_ctrl(struct nvme_ctrl *ctrl); > > int nvme_delete_ctrl(struct nvme_ctrl *ctrl); > > int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl); > > > > +int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns, > > +u8 log_page, void *log, size_t size, size_t offset); > > + > > extern const struct attribute_group nvme_ns_id_attr_group; > > extern const struct block_device_operations nvme_ns_head_ops; > > > > > Keith, Christoph, Sagi, Is it okay that these two changes that exposes > the nvme_get_log_ext fn are carried through Jens' tree after the nvme > tree for 4.17 has been pulled? That's okay with me. Alteratively, if you want to split the generic nvme part out, I can apply that immediately and the API will be in the first nvme-4.17 pull request.
Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices
On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote: > On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote: > > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote: > > >> NVMe driver uses threads for the work at device reset, including enabling > > >> the PCIe device. When multiple NVMe devices are initialized, their reset > > >> works may be scheduled in parallel. Then pci_enable_device_mem can be > > >> called in parallel on multiple cores. > > >> > > >> This causes a loop of enabling of all upstream bridges in > > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations > > >> including __pci_set_master and architecture-specific functions that > > >> call ones like and pci_enable_resources(). Both __pci_set_master() > > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space > > >> and change it. This is done as read/modify/write. > > >> > > >> Imagine that the PCIe tree looks like: > > >> A - B - switch - C - D > > >>\- E - F > > >> > > >> D and F are two NVMe disks and all devices from B are not enabled and bus > > >> mastering is not set. If their reset work are scheduled in parallel the > > >> two > > >> modifications of PCI_COMMAND may happen in parallel without locking and > > >> the > > >> system may end up with the part of PCIe tree not enabled. > > > > > > Then looks serialized reset should be used, and I did see the commit > > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed > > > to mark controller state' in reset stress test. > > > > > > But that commit only covers case of PCI reset from sysfs attribute, and > > > maybe other cases need to be dealt with in similar way too. > > > > > > > It seems to me that the serialized reset works for multiple resets of the > > same device, doesn't it? Our problem is linked to resets of different > > devices > > that share the same PCIe tree. > > Given reset shouldn't be a frequent action, it might be fine to serialize all > reset from different devices. The driver was much simpler when we had serialized resets in line with probe, but that had a bigger problems with certain init systems when you put enough nvme devices in your server, making them unbootable. Would it be okay to serialize just the pci_enable_device across all other tasks messing with the PCI topology? --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index cef5ce851a92..e0a2f6c0f1cf 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev) int result = -ENOMEM; struct pci_dev *pdev = to_pci_dev(dev->dev); - if (pci_enable_device_mem(pdev)) - return result; + pci_lock_rescan_remove(); + result = pci_enable_device_mem(pdev); + pci_unlock_rescan_remove(); + if (result) + return -ENODEV; pci_set_master(pdev); --
Re: [PATCH] nvme: make nvme_get_log_ext non-static
On Wed, Mar 21, 2018 at 08:27:07PM +0100, Matias Bjørling wrote: > Enable the lightnvm integration to use the nvme_get_log_ext() > function. > > Signed-off-by: Matias Bjørling Thanks, applied to nvme-4.17.
Re: [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming
On Mon, Apr 09, 2018 at 10:41:49AM -0400, Oza Pawandeep wrote: > This patch renames error recovery to generic name with pcie prefix > > Signed-off-by: Oza Pawandeep Looks fine. Reviewed-by: Keith Busch
Re: [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER
On Mon, Apr 09, 2018 at 10:41:50AM -0400, Oza Pawandeep wrote: > This patch factors out error reporting callbacks, which are currently > tightly coupled with AER. > > DPC should be able to register callbacks and attempt recovery when DPC > trigger event occurs. > > Signed-off-by: Oza Pawandeep Looks fine. Reviewed-by: Keith Busch
Re: [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service
On Mon, Apr 09, 2018 at 10:41:51AM -0400, Oza Pawandeep wrote: > This patch implements generic pcie_port_find_service() routine. > > Signed-off-by: Oza Pawandeep Looks good. Reviewed-by: Keith Busch
Re: [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI
On Mon, Apr 09, 2018 at 10:41:53AM -0400, Oza Pawandeep wrote: > +/** > + * pcie_wait_for_link - Wait for link till it's active/inactive > + * @pdev: Bridge device > + * @active: waiting for active or inactive ? > + * > + * Use this to wait till link becomes active or inactive. > + */ > +bool pcie_wait_for_link(struct pci_dev *pdev, bool active) > +{ > + int timeout = 1000; > + bool ret; > + u16 lnk_status; > + > + for (;;) { > + pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status); > + ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA); > + if (ret == active) > + return true; > + if (timeout <= 0) > + break; > + timeout -= 10; > + } This is missing an msleep(10) at each iteration. > + > + pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n", > + active ? "set" : "cleared"); > + > + return false; > +}
Re: [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC
On Mon, Apr 09, 2018 at 10:41:52AM -0400, Oza Pawandeep wrote: > +static int find_dpc_dev_iter(struct device *device, void *data) > +{ > + struct pcie_port_service_driver *service_driver; > + struct device **dev; > + > + dev = (struct device **) data; > + > + if (device->bus == &pcie_port_bus_type && device->driver) { > + service_driver = to_service_driver(device->driver); > + if (service_driver->service == PCIE_PORT_SERVICE_DPC) { > + *dev = device; > + return 1; > + } > + } > + > + return 0; > +} > + > +static struct device *pci_find_dpc_dev(struct pci_dev *pdev) > +{ > + struct device *dev = NULL; > + > + device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter); > + > + return dev; > +} The only caller of this doesn't seem to care to use struct device. This should probably just extract struct dpc_dev directly from in here.
Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
On Thu, Mar 08, 2018 at 08:42:20AM +0100, Christoph Hellwig wrote: > > So I suspect we'll need to go with a patch like this, just with a way > better changelog. I have to agree this is required for that use case. I'll run some quick tests and propose an alternate changelog. Longer term, the current way we're including offline present cpus either (a) has the driver allocate resources it can't use or (b) spreads the ones it can use thinner than they need to be. Why don't we rerun the irq spread under a hot cpu notifier for only online CPUs?
Re: [PATCH V3] nvme-pci: assign separate irq vectors for adminq and ioq1
On Tue, Mar 13, 2018 at 06:45:00PM +0800, Ming Lei wrote: > On Tue, Mar 13, 2018 at 05:58:08PM +0800, Jianchao Wang wrote: > > Currently, adminq and ioq1 share the same irq vector which is set > > affinity to cpu0. If a system allows cpu0 to be offlined, the adminq > > will not be able work any more. > > > > To fix this, assign separate irq vectors for adminq and ioq1. Set > > .pre_vectors == 1 when allocate irq vectors, then assign the first > > one to adminq which will have affinity cpumask with all possible > > cpus. On the other hand, if controller has only legacy or single > > -message MSI, we will setup adminq and 1 ioq and let them share > > the only one irq vector. > > > > Signed-off-by: Jianchao Wang > > Reviewed-by: Ming Lei Thanks, applied with an updated changelog. Not being able to use the admin queue is a pretty big deal, so it's pushed to the next nvme 4.16-rc branch. This may even be good stable material.
Re: [PATCH] nvme-pci: disable APST for Samsung NVMe SSD 960 EVO + ASUS PRIME Z370-A
Thanks, applied for 4.17.