Re: [PATCH RFC 2/2] NVMe: rfc blk-mq support

2013-10-08 Thread Keith Busch

On Tue, 8 Oct 2013, Matias Bjørling wrote:

Convert the driver to blk mq.

The patch consists of:

* Initializion of mq data structures.
* Convert function calls from bio to request data structures.
* IO queues are split into an admin queue and io queues.
* bio splits are removed as it should be handled by block layer.

Signed-off-by: Matias Bjørling 


I have no opinion right now if this is a good idea or not. I'll just
comment on a couple issues on this implementation. Otherwise I think
it's pretty neat and gave me a reason to explore multiqueues!

First a couple minor suggestions:

You might want to use "REQ_END" from the rq->cmd_flags to know if you
should write the queue doorbell or not. Aggregating these would help
most devices but we didn't have a good way of knowing before if this
was the last request or not.

Maybe change nvme_submit_bio_queue to return a BLK_MQ_RQ_QUEUE_ status
so you don't need that switch statement after calling it.

Must do something about requests that don't align to PRPs. I think you
mentioned this in the outstanding work in [0/2].


struct nvme_queue *get_nvmeq(struct nvme_dev *dev)
{
-   return dev->queues[get_cpu() + 1];
+   get_cpu();
+   return dev->admin_queue;
}


get_nvmeq now returns only the admin queue when it used to return only
IO queues. This breaks NVME_IO and SG_IO ioctl handling.


+static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int i)
+{
+   struct nvme_ns *ns = data;
+   struct nvme_dev *dev = ns->dev;
+   struct nvme_queue *nq;
+
+   nq = nvme_create_queue(dev, i + 1, hctx->queue_depth, i);
+   if (IS_ERR(nq))
+   return PTR_ERR(nq);
+
+   hctx->driver_data = nq;
+
+   return 0;
+}


This right here is the biggest problem with the implemenation. It is
going to fail for every namespace but the first one since each namespace
registers a multiqueue and each mulitqueue requires a hw context to
work. The number of queues is for the device, not namespace, so only
the first namespace is going to successfully return from nvme_init_hctx;
the rest will be unable to create an NVMe IO queue for trying to create
one with already allocated QID.

You should instead create the IO queues on the device like how it was
done before then just set the hctx->driver_data to dev->queues[i + 1]
or something like.


+static enum blk_eh_timer_return nvme_timeout(struct request *rq)
+{
+   /* Currently the driver handle timeouts by itself */
+   return BLK_EH_NOT_HANDLED;
+}


Need do something with the command timeouts here or somewhere. You've
changed the driver to poll only on the admin queue for timed out commands,
and left the multiqueue timeout a no-op.

Re: [PATCH RFC 0/2] Convert from bio-based to blk-mq

2013-10-09 Thread Keith Busch

On Tue, 8 Oct 2013, Jens Axboe wrote:

On Tue, Oct 08 2013, Matthew Wilcox wrote:

On Tue, Oct 08, 2013 at 11:34:20AM +0200, Matias Bjørling wrote:

The nvme driver implements itself as a bio-based driver. This primarily because
of high lock congestion for high-performance nvm devices. To remove the
congestion, a multi-queue block layer is being implemented.


Um, no.  You'll crater performance by adding another memory allocation
(of the struct request).  multi-queue is not the solution.


That's a rather "jump to conclusions" statement to make. As Matias
mentioned, there are no extra fast path allocations. Once the tagging is
converted as well, I'd be surprised if it performs worse than before.
And that on top of a net reduction in code.

blk-mq might not be perfect as it stands, but it's a helluva lot better
than a bunch of flash based drivers with lots of duplicated code and
mechanisms. We need to move away from that.

--
Jens Axboe


But this wastes copious amounts of memory on an NVMe device with more
than 1 namespace. The hardware's queues are shared among all namespaces,
so you can't possibly have all the struct requests in use. What would
be better is if I can create one blk-mq for each device/host and attach
multiple gendisks to that.

Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-20 Thread Keith Busch
On Mon, Sep 19, 2016 at 12:38:05PM +0200, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
> 
> > Having a 1:1 already seemed like the ideal solution since you can't
> > simultaneously utilize more than that from the host, so there's no more
> > h/w parallelisms from we can exploit. On the controller side, fetching
> > commands is serialized memory reads, so I don't think spreading IO
> > among more h/w queues helps the target over posting more commands to a
> > single queue.
> 
> I take a notion of un-ordered commands completion you described below.
> But I fail to realize why a CPU would not simultaneously utilize more
> than one queue by posting to multiple. Is it due to nvme specifics or
> you assume the host would not issue that many commands?

What I mean is that if you have N CPUs, you can't possibly simultaneously
write more than N submission queue entries. The benefit of having 1:1
for the queue <-> CPU mapping is that each CPU can post a command to
its queue without lock contention at the same time as another thread.
Having more to choose from doesn't let the host post commands any faster
than we can today.

When we're out of tags, the request currently just waits for one to
become available, increasing submission latency. You can fix that by
increasing the available tags with deeper or more h/w queues, but that
just increases completion latency since the device can't process them
any faster. It's six of one, half dozen of the other.

The depth per queue defaults to 1k. If your process really is able to use
all those resources, the hardware is completely saturated and you're not
going to benefit from introducing more tags [1]. It could conceivably
be worse by reducing cache-hits, or hit inappropriate timeout handling
with the increased completion latency.

 [1] http://lists.infradead.org/pipermail/linux-nvme/2014-July/001064.html


Re: [PATCH 0/3] nvme: Don't add namespaces for locked drives

2016-06-20 Thread Keith Busch
On Sun, Jun 19, 2016 at 04:06:31PM -0700, Jethro Beekman wrote:
> If an NVMe drive is locked with ATA Security, most commands sent to the drive 
> will fail. This includes commands sent by the kernel upon discovery to probe 
> for partitions. The failing happens in such a way that trying to do anything 
> with the drive (e.g. sending an unlock command; unloading the nvme module) is 
> basically impossible with the high default command timeout.

Why is the timeout a factor here? Is it because the error your drive
returns does not have DNR set and goes through 30 seconds of retries?

If so, I think we should probably have a limited retry count instead of
unlimited retries within the command timeout.
 
> This patch adds a check to see if the drive is locked, and if it is, its 
> namespaces are not initialized. It is expected that userspace will send the 
> proper "security send/unlock" command and then reset the controller. 
> Userspace 
> tools are available at [1].

Aren't these security settings per-namespace rather than the entire device?
 
> I intend to also submit a future patch that tracks ATA Security commands sent 
> from userspace and remembers the password so it can be submitted to a locked 
> drive upon pm_resume. (still WIP)

This subjects the system to various attacks like cold boot or hotswap,
but that's what users want!

This is ATA security, though, so wouldn't ATA also benefit from this? The
payload setup/decoding should then go in a generic library for everyone.

Similar was said about the patch adding OPAL security to the NVMe driver:

  http://lists.infradead.org/pipermail/linux-nvme/2016-April/004428.html


Re: [PATCH 0/3] nvme: Don't add namespaces for locked drives

2016-06-20 Thread Keith Busch
On Mon, Jun 20, 2016 at 11:21:09AM -0700, Jethro Beekman wrote:
> On 20-06-16 08:26, Keith Busch wrote:
> 
> Would this just be a matter of setting req->retries and checking for it in
> nvme_req_needs_retry? How does one keep track of the number of tries so far?

I just sent a patch out earlier today to use req->retries to track the
retry count, and nvme module parameter to set the max retries. I think
that would fix the long delays you're seeing, assuming the patch is okay.
 
> You're right, I assumed that admin commands can't have namespace ids, but
> looking at the spec, that's not the case. Turns out there's a problem with the
> driver then: nvme_ioctl never includes the ns for NVME_IOCTL_ADMIN_CMD.

The NVME_IOCTL_ADMIN_CMD already takes any namespace identifier the user
put in that field.


Re: [PATCH v4 0/3] nvme power saving

2016-09-22 Thread Keith Busch
On Thu, Sep 22, 2016 at 02:33:36PM -0700, J Freyensee wrote:
> ...and some SSDs don't even support this feature yet, so the number of
> different NVMe devices available to test initially will most likely be
> small (like the Fultondales I have, all I could check is to see if the
> code broke anything if the device did not have this power-save
> feature).
> 
> I agree with Jens, makes a lot of sense to start with this feature
> 'off'.
> 
> To 'advertise' the feature, maybe make the feature a new selection in
> Kconfig?  Example, initially make it "EXPERIMENTAL", and later when
> more devices implement this feature it can be integrated more tightly
> into the NVMe solution and default to on.

Should we just leave the kernel out of this then? I bet we could script
this feature in user space.


Re: [PATCHv3 2/2] x86/vmd: Add PCI domain specific LED option

2016-09-23 Thread Keith Busch
On Fri, Sep 23, 2016 at 09:34:41AM -0500, Bjorn Helgaas wrote:
> I made the necessary changes to match the renaming I did in the first
> patch, and I also used plain old "#ifdef" instead of "#if IS_ENABLED"
> since the rest of the file uses the former style.  If there's a reason
> to switch, we should change the whole file in a separate patch so we
> can explain the rationale.

The check was "IS_ENABLED" because VMD can be a loadable module, which
fails the ifdef check. I see Fengguang 0'dayed it using the module
configuration option. I can send you a fix based on your pci/hotplug
branch, or revert and apply the original patch if you prefer.

BTW, you had asked me not to split a series when incremental fixes
touched a single patch. I didn't resend the whole series here, and while
you got the right patches, I apologize for making it more difficult to find.


Re: [PATCHv3 2/2] x86/vmd: Add PCI domain specific LED option

2016-09-23 Thread Keith Busch
On Fri, Sep 23, 2016 at 02:12:23PM -0500, Bjorn Helgaas wrote:
> BTW, the "Volume Management Device Driver" config item appears by
> itself in the top-level menuconfig menu.  That seems a little ...
> presumptuous; is it what you intended?

Not really intended, but I didn't really know any better at the time. :).

I think drivers/pci/host/ is a more appropriate home for this driver. I'll
send a patch to relocate it, along with necessary Kconfig updates.


Re: [PATCH -next] PCI/PCIe: make DPC explicitly non-modular

2016-07-28 Thread Keith Busch
On Thu, Jul 28, 2016 at 03:33:03PM -0400, Paul Gortmaker wrote:
> The Kconfig currently controlling compilation of this code is:
> 
> drivers/pci/pcie/Kconfig:config PCIE_DPC
> drivers/pci/pcie/Kconfig:   bool "PCIe Downstream Port Containment 
> support"
> 
> ...meaning that it currently is not being built as a module by anyone.
> 
> Lets remove the modular code that is essentially orphaned, so that
> when reading the driver there is no doubt it is builtin-only.
> 
> Since module_init translates to device_initcall in the non-modular
> case, the init ordering remains unchanged with this commit.
> 
> We also delete the MODULE_LICENSE tag etc. since all that information
> was (or is now) contained at the top of the file in the comments.

Thanks for cleaning this up.

FWIW, all of the other pcie service drivers look like they could use
the same cleanup.

Reviewed-by: Keith Busch 


Re: [PATCH] nvme/quirk: Add a delay before checking device ready for memblaze device

2016-09-08 Thread Keith Busch
On Thu, Sep 08, 2016 at 11:36:34AM +0800, Wang Weber wrote:
> Hi Jens,
> 
> The following email was sent before, but it seems to be blocked since I
> received some messages about sending failure. So resend it with Google
> email account.

You're still sending non plain text email, so it's going to be rejected
by the mailing list filter. It also mangles your patch such that it
won't cleanly apply. You have to use plain text for this to work well.


[PATCHv3 1/2] pciehp: Let user control LED status

2016-09-13 Thread Keith Busch
This patch adds a new flag to the pci_dev structure instructing pciehp
to not interpret PCIe slot LED indicators. The pciehp driver will instead
allow all LED control from the user by setting the slot control indicators
as the user requested through sysfs. This is preparing for domain devices
that repurpose these control bits for non-standard use.

Signed-off-by: Keith Busch 
---
v2 -> v3:

  Moved the slot op's attention status callback to pciehp_hpc.c

 drivers/pci/hotplug/pciehp.h  |  5 +
 drivers/pci/hotplug/pciehp_core.c |  3 +++
 drivers/pci/hotplug/pciehp_hpc.c  | 27 +++
 include/linux/pci.h   |  1 +
 4 files changed, 36 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index e764918..7f71ac1 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -152,6 +152,11 @@ bool pciehp_check_link_active(struct controller *ctrl);
 void pciehp_release_ctrl(struct controller *ctrl);
 int pciehp_reset_slot(struct slot *slot, int probe);
 
+int pciehp_set_raw_attention_status(struct hotplug_slot *hotplug_slot,
+   u8 status);
+int pciehp_get_raw_attention_status(struct hotplug_slot *hotplug_slot,
+   u8 *value);
+
 static inline const char *slot_name(struct slot *slot)
 {
return hotplug_slot_name(slot->hotplug_slot);
diff --git a/drivers/pci/hotplug/pciehp_core.c 
b/drivers/pci/hotplug/pciehp_core.c
index fb0f863..68b7aeb 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -113,6 +113,9 @@ static int init_slot(struct controller *ctrl)
if (ATTN_LED(ctrl)) {
ops->get_attention_status = get_attention_status;
ops->set_attention_status = set_attention_status;
+   } else if (ctrl->pcie->port->user_leds) {
+   ops->get_attention_status = pciehp_get_raw_attention_status;
+   ops->set_attention_status = pciehp_set_raw_attention_status;
}
 
/* register this slot with the hotplug pci core */
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 08e84d6..fccb63d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -355,6 +355,18 @@ static int pciehp_link_enable(struct controller *ctrl)
return __pciehp_link_set(ctrl, true);
 }
 
+int pciehp_get_raw_attention_status(struct hotplug_slot *hotplug_slot,
+   u8 *value)
+{
+   struct slot *slot = hotplug_slot->private;
+   struct pci_dev *pdev = ctrl_dev(slot->ctrl);
+   u16 slot_ctrl;
+
+   pcie_capability_read_word(pdev, PCI_EXP_SLTCTL, &slot_ctrl);
+   *status = (slot_ctrl & (PCI_EXP_SLTCTL_AIC | PCI_EXP_SLTCTL_PIC)) >> 6;
+   return 0;
+}
+
 void pciehp_get_attention_status(struct slot *slot, u8 *status)
 {
struct controller *ctrl = slot->ctrl;
@@ -431,6 +443,17 @@ int pciehp_query_power_fault(struct slot *slot)
return !!(slot_status & PCI_EXP_SLTSTA_PFD);
 }
 
+int pciehp_set_raw_attention_status(struct hotplug_slot *hotplug_slot,
+   u8 status)
+{
+   struct slot *slot = hotplug_slot->private;
+   struct controller *ctrl = slot->ctrl;
+
+   pcie_write_cmd_nowait(ctrl, value << 6,
+ PCI_EXP_SLTCTL_AIC | PCI_EXP_SLTCTL_PIC);
+   return 0;
+}
+
 void pciehp_set_attention_status(struct slot *slot, u8 value)
 {
struct controller *ctrl = slot->ctrl;
@@ -804,6 +827,10 @@ struct controller *pcie_init(struct pcie_device *dev)
}
ctrl->pcie = dev;
pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP, &slot_cap);
+
+   if (pdev->user_leds)
+   slot_cap &= ~(PCI_EXP_SLTCAP_AIP | PCI_EXP_SLTCAP_PIP);
+
ctrl->slot_cap = slot_cap;
mutex_init(&ctrl->ctrl_lock);
init_waitqueue_head(&ctrl->queue);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 7256f33..f41bbca 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -308,6 +308,7 @@ struct pci_dev {
   powered on/off by the
   corresponding bridge */
unsigned intignore_hotplug:1;   /* Ignore hotplug events */
+   unsigned intuser_leds:1;/* User excluse LED SlotCtl */
unsigned intd3_delay;   /* D3->D0 transition time in ms */
unsigned intd3cold_delay;   /* D3cold->D0 transition time in ms */
 
-- 
2.7.2



[PATCHv3 2/2] x86/vmd: Add PCI domain specific LED option

2016-09-13 Thread Keith Busch
This patch adds a new function to set PCI domain specific options as
devices are added. The usage included in this patch is for LED indicator
control in VMD domains, but may be extended in the future as new domain
specific options are required.

PCIe LED Slot Control in a VMD domain is repurposed to a non-standard
implementation. As such, all devices in a VMD domain will be flagged so
pciehp does not attempt to use LED indicators. This user_led flag
has pciehp provide a different sysfs entry for user exclusive control
over the domain's slot indicators.

In order to determine if a bus is within a PCI domain, the patch appends
a bool to the pci_sysdata structure that the VMD driver sets during
initialization.

Requested-by: Kapil Karkra 
Tested-by: Artur Paszkiewicz 
Signed-off-by: Keith Busch 
---

No change from previous version of this patch; just part of the series.

 arch/x86/include/asm/pci.h | 14 ++
 arch/x86/pci/common.c  |  7 +++
 arch/x86/pci/vmd.c |  1 +
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index 9ab7507..1411dbe 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -23,6 +23,9 @@ struct pci_sysdata {
 #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
void*fwnode;/* IRQ domain for MSI assignment */
 #endif
+#if IS_ENABLED(CONFIG_VMD)
+   bool vmd_domain;/* True if in Intel VMD domain */
+#endif
 };
 
 extern int pci_routeirq;
@@ -56,6 +59,17 @@ static inline void *_pci_root_bus_fwnode(struct pci_bus *bus)
 #define pci_root_bus_fwnode_pci_root_bus_fwnode
 #endif
 
+static inline bool is_vmd(struct pci_bus *bus)
+{
+#if IS_ENABLED(CONFIG_VMD)
+   struct pci_sysdata *sd = bus->sysdata;
+
+   return sd->vmd_domain;
+#else
+   return false;
+#endif
+}
+
 /* Can be used to override the logic in pci_scan_bus for skipping
already-configured bus numbers - to be used for buggy BIOSes
or architectures with incomplete PCI setup by the loader */
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 7b6a9d1..ccf696c 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -677,6 +677,12 @@ static void set_dma_domain_ops(struct pci_dev *pdev)
 static void set_dma_domain_ops(struct pci_dev *pdev) {}
 #endif
 
+static void set_dev_domain_options(struct pci_dev *pdev)
+{
+   if (is_vmd(pdev->bus))
+   pdev->user_leds = 1;
+}
+
 int pcibios_add_device(struct pci_dev *dev)
 {
struct setup_data *data;
@@ -707,6 +713,7 @@ int pcibios_add_device(struct pci_dev *dev)
iounmap(data);
}
set_dma_domain_ops(dev);
+   set_dev_domain_options(dev);
return 0;
 }
 
diff --git a/arch/x86/pci/vmd.c b/arch/x86/pci/vmd.c
index b814ca6..a021b7b 100644
--- a/arch/x86/pci/vmd.c
+++ b/arch/x86/pci/vmd.c
@@ -596,6 +596,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd)
.parent = res,
};
 
+   sd->vmd_domain = true;
sd->domain = vmd_find_free_domain();
if (sd->domain < 0)
return sd->domain;
-- 
2.7.2



Re: [PATCHv3 1/2] pciehp: Let user control LED status

2016-09-13 Thread Keith Busch
On Tue, Sep 13, 2016 at 09:05:39AM -0600, Keith Busch wrote:
> +int pciehp_get_raw_attention_status(struct hotplug_slot *hotplug_slot,
> + u8 *value)
> +{
> + struct slot *slot = hotplug_slot->private;
> + struct pci_dev *pdev = ctrl_dev(slot->ctrl);
> + u16 slot_ctrl;
> +
> + pcie_capability_read_word(pdev, PCI_EXP_SLTCTL, &slot_ctrl);
> + *status = (slot_ctrl & (PCI_EXP_SLTCTL_AIC | PCI_EXP_SLTCTL_PIC)) >> 6;

Sorry, I generated this patch from a wrong branch. Won't compile, will
send the correct series. Apologies for the churn.


Re: blk-mq: allow passing in an external queue mapping V3

2016-09-15 Thread Keith Busch
On Wed, Sep 14, 2016 at 04:18:46PM +0200, Christoph Hellwig wrote:
> This series is the remainder of the earlier "automatic interrupt affinity for
> MSI/MSI-X capable devices" series, and make uses of the new irq-level
> interrupt / queue mapping code in blk-mq, as well as allowing the driver
> to pass in such a mask obtained from the (PCI) interrupt code.  To fully
> support this feature in drivers the final third in the PCI layer will
> be needed as well.

Thanks, this looks good and tests successfully on my hardware.

For the series:

Reviewed-by: Keith Busch 


Re: NVME regression in all kernels after 4.4.x for NVME in M2 slot for laptop?

2016-08-05 Thread Keith Busch
On Fri, Aug 05, 2016 at 12:03:23PM -0700, Marc MERLIN wrote:
> Would this patch make sense as being the reason why I can't S3 sleep
> anymore and would you have a test patch against 4.5, 4.6, or 4.7 I can
> try to see if it fixes the problem?

Hi Marc,

It might be blk-mq's hot cpu notifier is invoked during suspend and
waiting for nvme's queues to freeze, which may not happen if a request
is waiting on a stopped queue. The patch you biseceted doesn't
necessarilly fix that, but the window for when a request could get queued
like that was much shorter.

Assuming that is the problem, S3 suspends PCI hardware before IO
tasks. I'll see if I can reproduce on one of my machines and look into
a fix.

Thanks,
Keith


Re: [PATCH] nvme: Fix nvme_get/set_features() with a NULL result pointer

2016-08-26 Thread Keith Busch
On Fri, Aug 26, 2016 at 04:35:57PM +0200, Christoph Hellwig wrote:
> On Fri, Aug 26, 2016 at 07:31:33AM -0700, Andy Lutomirski wrote:
> >  - Consider *deleting* the SCSI translation layer's power saving code.
> > It looks almost entirely bogus to me.  It has an off-by-one in its
> > NPSS handling, it hardcodes power state indices which is total BS, it
> > ignores the distinction between operational and non-operational states
> > (which I think matters for non-APST usage).  It also seems likely to
> > be that it's never been used, since it's one of the formerly
> > crashy-looking set_features users.
> 
> Please go ahead and send a patch to delete it.  Adding the whole SCSI
> layer was a mistake to start with, and it's always been horribly buggy.
> Until I started running the libiscsi testsuite even fairly normal I/O
> commands were a sure way to crash it, and crazy things like PM are
> almost guaranteed to a) not actually be used by real application and
> b) horrible buggy (as you've already noticed)

Ack. If no distros or tools rely on the the SCSI crutch anymore, then
by all means, let's delete it. It's been disabled default for a while
now, and I think/hope everyone we care about has since migrated to
nvme awareness.


Re: [patch 15/55] PCI: vmd: Create named irq domain

2017-06-20 Thread Keith Busch
On Tue, Jun 20, 2017 at 01:37:15AM +0200, Thomas Gleixner wrote:
>  static int vmd_enable_domain(struct vmd_dev *vmd)
>  {
>   struct pci_sysdata *sd = &vmd->sysdata;
> + struct fwnode_handle *fn;
>   struct resource *res;
>   u32 upper_bits;
>   unsigned long flags;
> @@ -617,8 +618,13 @@ static int vmd_enable_domain(struct vmd_
>  
>   sd->node = pcibus_to_node(vmd->dev->bus);
>  
> - vmd->irq_domain = pci_msi_create_irq_domain(NULL, &vmd_msi_domain_info,
> + fn = irq_domain_alloc_named_id_fwnode("VMD-MSI", vmd->sysdata.domain);
> + if (!fn)
> + return -ENODEV;
> +
> + vmd->irq_domain = pci_msi_create_irq_domain(fn, &vmd_msi_domain_info,
>   x86_vector_domain);
> + kfree(fn);

If I'm following all this correctly, it looks like we need to use
irq_domain_free_fwnode with irq_domain_alloc_named_id_fwnode instead of
freeing 'fn' directly, otherwise we leak 'fwid->name'.


Re: [patch 32/55] x86/irq: Restructure fixup_irqs()

2017-06-20 Thread Keith Busch
On Tue, Jun 20, 2017 at 01:37:32AM +0200, Thomas Gleixner wrote:
> @@ -441,18 +440,27 @@ void fixup_irqs(void)
>  
>   for_each_irq_desc(irq, desc) {
>   const struct cpumask *affinity;
> - int break_affinity = 0;
> - int set_affinity = 1;
> + bool break_affinity = false;
>  
>   if (!desc)
>   continue;
> - if (irq == 2)
> - continue;
>  
>   /* interrupt's are disabled at this point */
>   raw_spin_lock(&desc->lock);
>  
>   data = irq_desc_get_irq_data(desc);
> + chip = irq_data_get_irq_chip(data);
> + /*
> +  * The interrupt descriptor might have been cleaned up
> +  * already, but it is not yet removed from the radix
> +  * tree. If the chip does not have an affinity setter,
> +  * nothing to do here.
> +  */
> + if (!chip !chip->irq_set_affinity) {
> + raw_spin_unlock(&desc->lock);
> + continue;
> + }

A bit of a moot point since the very next patch deletes all of this,
but found this broken 'if' condition when compiling one at a time,
missing the '&&'.


Re: [PATCH V2 1/1] nvme: fix multiple ctrl removal scheduling

2017-05-26 Thread Keith Busch
On Wed, May 24, 2017 at 05:26:25PM +0300, Rakesh Pandit wrote:
> Commit c5f6ce97c1210 tries to address multiple resets but fails as
> work_busy doesn't involve any synchronization and can fail.  This is
> reproducible easily as can be seen by WARNING below which is triggered
> with line:
> 
> WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)
> 
> Allowing multiple resets can result in multiple controller removal as
> well if different conditions inside nvme_reset_work fail and which
> might deadlock on device_release_driver.
> 
> This patch makes sure that work queue item (reset_work) is added only
> if controller state != NVME_CTRL_RESETTING and that is achieved by
> moving state change outside nvme_reset_work into nvme_reset and
> removing old work_busy call.  State change is always synchronizated
> using controller spinlock.

So, the reason the state is changed when the work is running rather than
queueing is for the window when the state may be set to NVME_CTRL_DELETING,
and we don't want the reset work to proceed in that case.

What do you think about adding a new state, like NVME_CTRL_SCHED_RESET,
then leaving the NVME_CTRL_RESETTING state change as-is?


Re: [PATCH 2/2] nvme: Quirk APST on Intel 600P/P3100 devices

2017-05-30 Thread Keith Busch
On Wed, May 24, 2017 at 03:06:31PM -0700, Andy Lutomirski wrote:
> They have known firmware bugs.  A fix is apparently in the works --
> once fixed firmware is available, someone from Intel (Hi, Keith!)
> can adjust the quirk accordingly.

Here's the latest firmware with all the known fixes:

  https://downloadcenter.intel.com/download/26491?v=t

I am not 100% sure if this fixes the issue you're quirking, but I think
it falls under the issues the link initialization and ASPM comments in
the release notes.


Re: [PATCH 1/3] nvme: do not check for ns on rw path

2017-11-03 Thread Keith Busch
On Fri, Nov 03, 2017 at 01:53:40PM +0100, Christoph Hellwig wrote:
> > -   if (ns && ns->ms &&
> > +   if (ns->ms &&
> > (!ns->pi_type || ns->ms != sizeof(struct t10_pi_tuple)) &&
> > !blk_integrity_rq(req) && !blk_rq_is_passthrough(req))
> > return BLK_STS_NOTSUPP;
> 
> blk_rq_is_passthrough also can't be true here.
> 
> How about:
> 
>   if (ns->ms && !blk_integrity_rq(req) &&
>   (!ns->pi_type || ns->ms != sizeof(struct t10_pi_tuple)))
>   return BLK_STS_NOTSUPP;
> 
> Although I have to admit I don't really understand what this check
> is even trying to do.  It basically checks for a namespace that has
> a format with metadata that is not T10 protection information and
> then rejects all I/O to it.  Why are we even creating a block device
> node for such a thing?

If the namespace has metadata, but the request doesn't have a metadata
payload attached to it for whatever reason, we can't construct the command
for that format. We also can't have the controller strip/generate the
payload with PRACT bit set if it's not a T10 format, so we just fail
the command.


Re: [PATCH 3/3] nvme: fix eui_show() print format

2017-11-03 Thread Keith Busch
On Fri, Nov 03, 2017 at 01:55:16PM +0100, Christoph Hellwig wrote:
> On Fri, Nov 03, 2017 at 11:02:50AM +0100, Javier González wrote:
> > Signed-off-by: Javier González 
> > ---
> >  drivers/nvme/host/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index ae8ab0a1ef0d..f05c81774abf 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -2108,7 +2108,7 @@ static ssize_t eui_show(struct device *dev, struct 
> > device_attribute *attr,
> > char *buf)
> >  {
> > struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> > -   return sprintf(buf, "%8phd\n", ns->eui);
> > +   return sprintf(buf, "%8phD\n", ns->eui);
> >  }
> >  static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
> 
> This looks correct.  I wonder what the old code printed - does someone
> have a device with an EUI-64 at hand to quickly cross check what we
> did before?

It just prints the same as the 'ph' format, which would look like this:

  01 02 03 04 05 06 07 08

The change will make it look like this:

  01-02-03-04-05-06-07-08

I think that was the original intention.

Reviewed-by: Keith Busch 


Re: [PATCH 1/3] nvme: do not check for ns on rw path

2017-11-04 Thread Keith Busch
On Sat, Nov 04, 2017 at 09:18:25AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 03, 2017 at 09:02:04AM -0600, Keith Busch wrote:
> > If the namespace has metadata, but the request doesn't have a metadata
> > payload attached to it for whatever reason, we can't construct the command
> > for that format. We also can't have the controller strip/generate the
> > payload with PRACT bit set if it's not a T10 format, so we just fail
> > the command.
> 
> The only metadata payload a READ or WRITE request can have is a protection
> information one.  For a namespace formatted with protection information
> bio_integrity_prep as called from blk_mq_make_request will ensure we
> always have metadata attached.
> 
> If a namespace is formatted with non-PI metadata we will never have
> metadata attached to the bio/request and should not even present the
> namespace to the kernel.

That's not quite right. For non-PI metadata formats, we use the
'nop_profile', which gets the metadata buffer allocated so we can safely
use a metadata formatted namespace. There's no in-kernel user of the
allocated payload, but we still need the metadata buffer in order to
use the namespace at all.


Re: [PATCH 1/3] nvme: do not check for ns on rw path

2017-11-06 Thread Keith Busch
On Mon, Nov 06, 2017 at 10:13:24AM +0100, Christoph Hellwig wrote:
> On Sat, Nov 04, 2017 at 09:38:45AM -0600, Keith Busch wrote:
> > That's not quite right. For non-PI metadata formats, we use the
> > 'nop_profile', which gets the metadata buffer allocated so we can safely
> > use a metadata formatted namespace. There's no in-kernel user of the
> > allocated payload, but we still need the metadata buffer in order to
> > use the namespace at all.
> 
> You're right.  But that means we will indeed always have a matching
> integrity payload here and the check should not be needed.
> 
> Are you fine with turning it into something like:
> 
>   if (WARN_ON_ONCE(ns->ms && !blk_integrity_rq(req)))
>   return BLK_STS_IOERR;
> 
> ?

Yes, that looks fine. You're right that we are not supposed to be able
to get into this path for read/write since the driver sets the capacity
to 0 if a metadata format doesn't have integrity support, so hitting
this warning would indicate something has gone wrong.


Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver

2015-10-05 Thread Keith Busch

On Sat, 3 Oct 2015, Ingo Molnar wrote:

* Keith Busch  wrote:

+config VMDDEV
+   depends on PCI && PCI_DOMAINS && PCI_MSI && GENERIC_MSI_IRQ_DOMAIN && 
IRQ_DOMAIN_HIERARCHY
+   tristate "Volume Management Device Driver"
+   default N
+   select HAVE_VMDDEV
+   ---help---
+   Adds support for the Intel Volume Manage Device (VMD). VMD is
+   a secondary PCI host bridge that allows PCI Express root ports,
+   and devices attached to them, to be removed from the default PCI
+   domain and placed within the VMD domain. If your system provides
+   one of these and has devices attached to it, say "Y".


So what this text does not explain is why does the user care? What tangible
benefits does this feature offer to users?


Hi Ingo,

The immediate benefit is that devices on VMD domains do not use resources
on the default PCI domain, so we have more than the 256 buses available.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver

2015-10-05 Thread Keith Busch

On Mon, 5 Oct 2015, Ingo Molnar wrote:

* Keith Busch  wrote:

The immediate benefit is that devices on VMD domains do not use resources
on the default PCI domain, so we have more than the 256 buses available.


Would be nice to incorporate that information in the help text and in the
changelog.


Fair enough. Perhaps we thought the "secondary domain" description
implied that information, but maybe we've been doing this too long. :)

Thanks for the feedback.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver

2015-10-06 Thread Keith Busch

Hi Bjorn,

Thanks for the feedback. Much of the issues you mentioned look pretty
straight forward to resolve, and will fix of for the next revision.

I have some immediate follow up comments to two issues you brought up:

On Tue, 6 Oct 2015, Bjorn Helgaas wrote:

+static int vmd_find_free_domain(void)
+{
+   int domain = -1;
+   struct pci_bus *bus = NULL;
+
+   while ((bus = pci_find_next_bus(bus)) != NULL)
+   domain = max_t(int, domain, pci_domain_nr(bus));
+   if (domain < 0)
+   return -ENODEV;
+   return domain + 1;
+}


The PCI core should own domain number allocation.  We have a little
bit of generic domain support, e.g., pci_get_new_domain_nr().  On x86,
I think that is compiled-in, but currently not used -- currently x86
only uses _SEG from ACPI.  How would you handle collisions between a
domain number assigned here (or via pci_get_new_domain_nr()) and a
hot-added host bridge where _SEG uses the same domain?


Thank you for bringing this up as I hadn't thought much about it and may
have misunderstood the meaning of _SEG. AIUI, it is used to identify a
logical grouping. The OS does not need to use the same identifier for
the domain, so we there's no need to collide if we can assign the domain
of a a new _SEG to the next available domain_nr.

On pci_get_new_domain_nr(), can we make this use something like an
ida instead of an atomic? I think we'd like to put domains back into
the available pool if someone unbinds it. I imagine someone will test
unbinding/rebinding these devices in a loop for a while, and will file
a bug after the atomic increments to 64k. :)


+   resource_list_for_each_entry(entry, &resources) {
+   struct resource *source, *resource = entry->res;
+
+   if (!i) {
+   resource->start = 0;
+   resource->end = (resource_size(
+   &vmd->dev->resource[0]) >> 20) - 1;
+   resource->flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED;


I thought BAR0 was CFGBAR.  I missed the connection to a bus number
aperture.


Right, BAR0 is the CFGBAR and is the device's aperture to access its
domain's config space. Based on your comment, I assume that's not
a bus resource, though it seems to work with the posted code. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver

2015-10-07 Thread Keith Busch

On Tue, 6 Oct 2015, Keith Busch wrote:

On Tue, 6 Oct 2015, Bjorn Helgaas wrote:

+   resource_list_for_each_entry(entry, &resources) {
+   struct resource *source, *resource = entry->res;
+
+   if (!i) {
+   resource->start = 0;
+   resource->end = (resource_size(
+   &vmd->dev->resource[0]) >> 20) - 1;
+			resource->flags = IORESOURCE_BUS | 
IORESOURCE_PCI_FIXED;


I thought BAR0 was CFGBAR.  I missed the connection to a bus number
aperture.


Right, BAR0 is the CFGBAR and is the device's aperture to access its
domain's config space.


It's a new day, I'll try a new explanation on what this is about. The size
of the CFGBAR determines how many bus numbers can be reached through the
device's config space aperture. We are not setting the bus resource to
BAR0; just determining the bus resource ending based on BAR0's size. We
expect the bar to be 256M to access config space for 256 buses:

  8 functions * 32 devices * 256 buses * 4k config space per function = 256M

If the BAR wasn't provided 256M for any reason, we reduce the number of
bus resources this domain can provide.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/2] x86: PCI bus specific MSI operations

2015-08-27 Thread Keith Busch
This patch adds struct x86_msi_ops to x86's PCI sysdata. This gives a
host bridge driver the option to provide alternate MSI Data Register
and MSI-X Table Entry programming for devices in PCI domains that do
not subscribe to usual "IOAPIC" format.

Signed-off-by: Keith Busch 
CC: Bryan Veal 
CC: Dan Williams 
CC: x...@kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-...@vger.kernel.org
---
 arch/x86/include/asm/pci.h |3 +++
 arch/x86/kernel/x86_init.c |   19 +++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index 4625943..98f3802 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -20,6 +20,9 @@ struct pci_sysdata {
 #ifdef CONFIG_X86_64
void*iommu; /* IOMMU private data */
 #endif
+#ifdef CONFIG_PCI_MSI
+   struct x86_msi_ops *msi_ops;
+#endif
 };
 
 extern int pci_routeirq;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 3839628..416de84 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -118,21 +118,40 @@ struct x86_msi_ops x86_msi = {
 /* MSI arch specific hooks */
 int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
+   struct pci_sysdata *sysdata = dev->bus->sysdata;
+
+   if (sysdata && sysdata->msi_ops && sysdata->msi_ops->setup_msi_irqs)
+   return sysdata->msi_ops->setup_msi_irqs(dev, nvec, type);
return x86_msi.setup_msi_irqs(dev, nvec, type);
 }
 
 void arch_teardown_msi_irqs(struct pci_dev *dev)
 {
+   struct pci_sysdata *sysdata = dev->bus->sysdata;
+
+   if (sysdata && sysdata->msi_ops && sysdata->msi_ops->teardown_msi_irqs)
+   return sysdata->msi_ops->teardown_msi_irqs(dev);
x86_msi.teardown_msi_irqs(dev);
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
+   struct msi_desc *desc =  irq_get_msi_desc(irq);
+   struct pci_sysdata *sysdata = NULL;
+
+   if (desc)
+   sysdata = desc->dev->bus->sysdata;
+   if (sysdata && sysdata->msi_ops && sysdata->msi_ops->teardown_msi_irq)
+   return sysdata->msi_ops->teardown_msi_irq(irq);
x86_msi.teardown_msi_irq(irq);
 }
 
 void arch_restore_msi_irqs(struct pci_dev *dev)
 {
+   struct pci_sysdata *sysdata = dev->bus->sysdata;
+
+   if (sysdata && sysdata->msi_ops && sysdata->msi_ops->restore_msi_irqs)
+   return sysdata->msi_ops->restore_msi_irqs(dev);
x86_msi.restore_msi_irqs(dev);
 }
 #endif
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/2] Driver for new PCI-e device

2015-08-27 Thread Keith Busch
The Intel Volume Management Device (VMD) is an integrated endpoint on the
platform's PCIe root complex that acts as a host bridge to a secondary
PCIe domain. BIOS can reassign one or more root ports to appear within
VMD domains instead of the primary domain. This driver enables and
enumerates the VMD domain using the root bus configuration interface
provided by the PCI subsystem.

CC: Bryan Veal 
CC: Dan Williams 
CC: x...@kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-...@vger.kernel.org

Keith Busch (2):
  x86: PCI bus specific MSI operations
  x86/pci: Initial commit for new VMD device driver

 arch/x86/Kconfig   |   11 ++
 arch/x86/include/asm/pci.h |3 +
 arch/x86/kernel/x86_init.c |   19 ++
 arch/x86/pci/Makefile  |2 +
 arch/x86/pci/vmd.c |  412 
 5 files changed, 447 insertions(+)
 create mode 100644 arch/x86/pci/vmd.c

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 2/2] x86/pci: Initial commit for new VMD device driver

2015-08-27 Thread Keith Busch
The Intel Volume Management Device (VMD) is an integrated endpoint on the
platform's PCIe root complex that acts as a host bridge to a secondary
PCIe domain. BIOS can reassign one or more root ports to appear within
the VMD domain instead of the primary domain.

This driver enumerates and enables the domain using the root bus
configuration interface provided by the PCI subsystem. The driver
provides configuration space accessor functions (pci_ops), bus and memory
resources, MSI configuration functions, and irq_chip implementation
necessary to support the domain through the VMD endpoint's interface.

VMD routes I/O as follows:

  1) Configuration Space: BAR 0 ("CFGBAR") of VMD provides the base
  address and size for configuration space register access to VMD-owned
  root ports. It works similarly to MMCONFIG for extended configuration
  space. Bus numbering is independent and does not conflict with the
  primary domain.

  2) MMIO Space: BARs 2 and 4 ("MEMBAR1" and "MEMBAR2") of VMD provide the
  base address, size, and type for MMIO register access. These addresses
  are not translated by VMD hardware; they are simply reservations to be
  distributed to root ports' memory base/limit registers and subdivided
  among devices downstream.

  3) DMA: To interact appropriately with IOMMU, the source ID DMA read
  and write requests are translated to the bus-device-function of the
  VMD endpoint. Otherwise, DMA operates normally without VMD-specific
  address translation.

  4) Interrupts: Part of VMD's BAR 4 is reserved for VMD's MSI-X Table and
  PBA. MSIs from VMD domain devices and ports are remapped to appear if
  they were issued using one of VMD's MSI-X table entries. Each MSI and
  MSI-X addresses of VMD-owned devices and ports have a special format
  where the address refers specific entries in VMD's MSI-X table.  As with
  DMA, the interrupt source id is translated to VMD's bus-device-function.

  The driver provides its own MSI and MSI-X configuration functions
  specific to how MSi messages are used within the VMD domain, and
  it provides an irq_chip for indepdent IRQ allocation and to relay
  interrupts from VMD's interrupt handler to the appropriate device
  driver's handler.

  5) Errors: PCIe error message are intercepted by the root ports normally
  (e.g. AER), except with VMD, system errors (i.e. firmware first) are
  disabled by default. AER and hotplug interrupts are translated in the
  same way as endpoint interrupts.

  6) VMD does not support INTx interrupts or IO ports. Devices or drivers
  requiring these features should either not be placed below VMD-owned
  root ports, or VMD should be disabled by BIOS for such enpdoints.

Contributers to this patch include:

  Artur Paszkiewicz 
  Bryan Veal 
  Jon Derrick 

Signed-off-by: Keith Busch 
CC: Bryan Veal 
CC: Dan Williams 
CC: x...@kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-...@vger.kernel.org
---
 arch/x86/Kconfig  |   11 ++
 arch/x86/pci/Makefile |2 +
 arch/x86/pci/vmd.c|  412 +
 3 files changed, 425 insertions(+)
 create mode 100644 arch/x86/pci/vmd.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b3a1a5d..a8cad59 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2584,6 +2584,17 @@ config PMC_ATOM
def_bool y
 depends on PCI
 
+config VMDDEV
+   depends on PCI && PCI_DOMAINS && PCI_MSI
+   tristate "Volume Management Device Driver"
+   default N
+   ---help---
+   Adds support for the Intel Volume Manage Device (VMD). VMD is
+   a secondary PCI host bridge that allows PCI Express root ports,
+   and devices attached to them, to be removed from the default PCI
+   domain and placed within the VMD domain. If your system provides
+   one of these, and you have devices attached to it, say "Y".
+
 source "net/Kconfig"
 
 source "drivers/Kconfig"
diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
index 5c6fc35..c204079 100644
--- a/arch/x86/pci/Makefile
+++ b/arch/x86/pci/Makefile
@@ -23,6 +23,8 @@ obj-y += bus_numa.o
 obj-$(CONFIG_AMD_NB)   += amd_bus.o
 obj-$(CONFIG_PCI_CNB20LE_QUIRK)+= broadcom_bus.o
 
+obj-$(CONFIG_VMDDEV) += vmd.o
+
 ifeq ($(CONFIG_PCI_DEBUG),y)
 EXTRA_CFLAGS += -DDEBUG
 endif
diff --git a/arch/x86/pci/vmd.c b/arch/x86/pci/vmd.c
new file mode 100644
index 000..ea8bb4c
--- /dev/null
+++ b/arch/x86/pci/vmd.c
@@ -0,0 +1,412 @@
+/*
+ * Volume Management Device driver
+ * Copyright (c) 2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ *

Re: [RFC PATCH 1/2] x86: PCI bus specific MSI operations

2015-08-28 Thread Keith Busch

On Fri, 28 Aug 2015, Thomas Gleixner wrote:

On Thu, 27 Aug 2015, Keith Busch wrote:


This patch adds struct x86_msi_ops to x86's PCI sysdata. This gives a
host bridge driver the option to provide alternate MSI Data Register
and MSI-X Table Entry programming for devices in PCI domains that do
not subscribe to usual "IOAPIC" format.


I'm not too fond about more ad hoc indirection and special casing. We
should be able to handle this with hierarchical irq domains. Jiang
might have an idea how to do that for your case.


Thank you for the suggestion, I will take a closer look at this again. All
the better if we don't require an arch specific dependency.

I asked Jiang about domain hierarchies a few weeks ago and understood
these are suited for irq controllers, but the VMD device is an aggregator.

Here's a little more h/w info in case it helps steer me in the right
direction: VMD muxes all the device interrupts in its PCI domain into
one of several VMD h/w irqs. The VMD driver then has to de-mux that into
CPU side irqs for each device.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Regulator: Suppress compiler warnings

2015-08-31 Thread Keith Busch
Some compilers complain of possible uninitialized variable usage, like
the following:

  drivers/regulator/helpers.c: In function ‘regulator_get_bypass_regmap’:
  drivers/regulator/helpers.c:463:16: warning: ‘val’ may be used uninitialized 
in this function [-Wuninitialized]

The code is safe though, and only uses the variables if they were
successfully set, so suppressing the warning with uninitialized_val.

Signed-off-by: Keith Busch 
---
 drivers/regulator/helpers.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/regulator/helpers.c b/drivers/regulator/helpers.c
index 3bbb326..bb5b22a 100644
--- a/drivers/regulator/helpers.c
+++ b/drivers/regulator/helpers.c
@@ -30,7 +30,7 @@
  */
 int regulator_is_enabled_regmap(struct regulator_dev *rdev)
 {
-   unsigned int val;
+   unsigned int uninitialized_var(val);
int ret;
 
ret = regmap_read(rdev->regmap, rdev->desc->enable_reg, &val);
@@ -114,7 +114,7 @@ EXPORT_SYMBOL_GPL(regulator_disable_regmap);
  */
 int regulator_get_voltage_sel_regmap(struct regulator_dev *rdev)
 {
-   unsigned int val;
+   unsigned int uninitialized_var(val);
int ret;
 
ret = regmap_read(rdev->regmap, rdev->desc->vsel_reg, &val);
@@ -453,7 +453,7 @@ EXPORT_SYMBOL_GPL(regulator_set_bypass_regmap);
  */
 int regulator_get_bypass_regmap(struct regulator_dev *rdev, bool *enable)
 {
-   unsigned int val;
+   unsigned int uninitialized_var(val);
int ret;
 
ret = regmap_read(rdev->regmap, rdev->desc->bypass_reg, &val);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Regulator: Suppress compiler warnings

2015-09-01 Thread Keith Busch

On Tue, 1 Sep 2015, Mark Brown wrote:

On Tue, Sep 01, 2015 at 09:52:13AM +0900, Krzysztof Kozlowski wrote:

2015-09-01 1:41 GMT+09:00 Keith Busch :

 int regulator_is_enabled_regmap(struct regulator_dev *rdev)
 {
-   unsigned int val;
+   unsigned int uninitialized_var(val);
int ret;

ret = regmap_read(rdev->regmap, rdev->desc->enable_reg, &val);



This is quite common pattern so such work-around should be added to
many other functions leading to code obfuscation. Which compiler do
you have in mind?


Right, plus this will shut up valid compiler warnings which is poor
practice anyway.  I'd say this is a bug in the compiler.


Using gcc 4.7.2 with '-Os'. The warning does not happen when that option
is not used, i.e. disable CONFIG_CC_OPTIMIZE_FOR_SIZE.

I will certainly try other gcc versions with the same config and see
what happens.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCHv2] x86/pci: Initial commit for new VMD device driver

2015-10-12 Thread Keith Busch

On Tue, 6 Oct 2015, Bjorn Helgaas wrote:

+static int __init vmd_init(void)
+{
+   return pci_register_driver(&vmd_drv);
+}
+module_init(vmd_init);


module_pci_driver(vmd_drv)?


We actually only have a module_init in this driver, and purposely left
out module_exit. We don't want to be able to unload this because we
can't reference count this module for modules depending on it.

'rmmod ' returns busy if you're actively using a device that
module is driving, but the end devices VMD provides are not driven by the
VMD driver; it'd be kind of like being able to unload the pci core driver.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 05/11] block: remove split code in blkdev_issue_{discard,write_same}B

2015-10-14 Thread Keith Busch

On Wed, 14 Oct 2015, Christoph Hellwig wrote:

Analsys and tentativ fix below:

blktrace for before the commit:

259,012 0.02543  2394  G   D 0 + 8388607 [mkfs.xfs]
259,013 0.08230  2394  I   D 0 + 8388607 [mkfs.xfs]
259,014 0.31090   207  D   D 0 + 8388607 [kworker/1:1H]
259,015 0.44869  2394  Q   D 8388607 + 8388607 [mkfs.xfs]
259,016 0.45992  2394  G   D 8388607 + 8388607 [mkfs.xfs]
259,017 0.49559  2394  I   D 8388607 + 8388607 [mkfs.xfs]
259,018 0.61551   207  D   D 8388607 + 8388607 
[kworker/1:1H]

.. and so on.

blktrace with the commit:

259,021 0.0  1228  Q   D 0 + 4194304 [mkfs.xfs]
259,022 0.02543  1228  G   D 0 + 4194304 [mkfs.xfs]
259,023 0.10080  1228  I   D 0 + 4194304 [mkfs.xfs]
259,024 0.82187   267  D   D 0 + 4194304 [kworker/2:1H]
259,025 0.000224869  1228  Q   D 4194304 + 4194304 [mkfs.xfs]
259,026 0.000225835  1228  G   D 4194304 + 4194304 [mkfs.xfs]
259,027 0.000229457  1228  I   D 4194304 + 4194304 [mkfs.xfs]
259,028 0.000238507   267  D   D 4194304 + 4194304 
[kworker/2:1H]

So discards are smaller, but better aligned.  Now if I tweak a single
line in blk-lib.c to be able to use all of bi_size I get the old I/O
pattern back and everything works fine again:


I see why the proposal is an improvement, but I don't understand why the
current situation results in a hang. Are we missing some kind of error
recovery in the driver?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] blk-mq: fix use-after-free in blk_mq_free_tag_set()

2015-10-15 Thread Keith Busch

On Tue, 13 Oct 2015, Junichi Nomura wrote:

tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().


Thanks for the fix.

Reviewed-by: Keith Busch 


Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura 
Cc: Keith Busch 
Cc: Jens Axboe 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] blk-integrity: empty imlpementation when disabled

2015-10-26 Thread Keith Busch
This patch moves the blk_integrity_payload definition outside the
CONFIG_BLK_DEV_INTERITY dependency and provides empty function
implementations when the kernel configuration disables integrity
extensions. This simplifies drivers that make use of these to map user
data so they don't need to repeat the same configuration checks.

Signed-off-by: Keith Busch 
---
 include/linux/bio.h |   32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index b9b6e04..f0c46d0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -318,16 +318,6 @@ enum bip_flags {
BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */
 };
 
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
-
-static inline struct bio_integrity_payload *bio_integrity(struct bio *bio)
-{
-   if (bio->bi_rw & REQ_INTEGRITY)
-   return bio->bi_integrity;
-
-   return NULL;
-}
-
 /*
  * bio integrity payload
  */
@@ -349,6 +339,16 @@ struct bio_integrity_payload {
struct bio_vec  bip_inline_vecs[0];/* embedded bvec array */
 };
 
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+
+static inline struct bio_integrity_payload *bio_integrity(struct bio *bio)
+{
+   if (bio->bi_rw & REQ_INTEGRITY)
+   return bio->bi_integrity;
+
+   return NULL;
+}
+
 static inline bool bio_integrity_flagged(struct bio *bio, enum bip_flags flag)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
@@ -795,6 +795,18 @@ static inline bool bio_integrity_flagged(struct bio *bio, 
enum bip_flags flag)
return false;
 }
 
+static inline void *bio_integrity_alloc(struct bio * bio, gfp_t gfp,
+   unsigned int nr)
+{
+   return NULL;
+}
+
+static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
+   `unsigned int len, unsigned int offset)
+{
+   return 0;
+}
+
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
 #endif /* CONFIG_BLOCK */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2] blk-integrity: empty implementation when disabled

2015-10-26 Thread Keith Busch
This patch moves the blk_integrity_payload definition outside the
CONFIG_BLK_DEV_INTERITY dependency and provides empty function
implementations when the kernel configuration disables integrity
extensions. This simplifies drivers that make use of these to map user
data so they don't need to repeat the same configuration checks.

Signed-off-by: Keith Busch 
---
v1 -> v2: Fixed corrupted patch and subject line spelling error

 include/linux/bio.h |   32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index b9b6e04..f0c46d0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -318,16 +318,6 @@ enum bip_flags {
BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */
 };
 
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
-
-static inline struct bio_integrity_payload *bio_integrity(struct bio *bio)
-{
-   if (bio->bi_rw & REQ_INTEGRITY)
-   return bio->bi_integrity;
-
-   return NULL;
-}
-
 /*
  * bio integrity payload
  */
@@ -349,6 +339,16 @@ struct bio_integrity_payload {
struct bio_vec  bip_inline_vecs[0];/* embedded bvec array */
 };
 
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+
+static inline struct bio_integrity_payload *bio_integrity(struct bio *bio)
+{
+   if (bio->bi_rw & REQ_INTEGRITY)
+   return bio->bi_integrity;
+
+   return NULL;
+}
+
 static inline bool bio_integrity_flagged(struct bio *bio, enum bip_flags flag)
 {
struct bio_integrity_payload *bip = bio_integrity(bio);
@@ -795,6 +795,18 @@ static inline bool bio_integrity_flagged(struct bio *bio, 
enum bip_flags flag)
return false;
 }
 
+static inline void *bio_integrity_alloc(struct bio * bio, gfp_t gfp,
+   unsigned int nr)
+{
+   return NULL;
+}
+
+static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
+   unsigned int len, unsigned int offset)
+{
+   return 0;
+}
+
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
 #endif /* CONFIG_BLOCK */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCHv2] x86/pci: Initial commit for new VMD device driver

2015-10-01 Thread Keith Busch
The Intel Volume Management Device (VMD) is an integrated endpoint on the
platform's PCIe root complex that acts as a host bridge to a secondary
PCIe domain. BIOS can reassign one or more root ports to appear within
a VMD domain instead of the primary domain.

This driver enumerates and enables the domain using the root bus
configuration interface provided by the PCI subsystem. The driver
provides configuration space accessor functions (pci_ops), bus and
memory resources, a chained MSI irq domain, irq_chip implementation,
and dma operations necessary to support the domain through the VMD
endpoint's interface.

VMD routes I/O as follows:

   1) Configuration Space: BAR 0 ("CFGBAR") of VMD provides the base
   address and size for configuration space register access to VMD-owned
   root ports. It works similarly to MMCONFIG for extended configuration
   space. Bus numbering is independent and does not conflict with the
   primary domain.

   2) MMIO Space: BARs 2 and 4 ("MEMBAR1" and "MEMBAR2") of VMD provide the
   base address, size, and type for MMIO register access. These addresses
   are not translated by VMD hardware; they are simply reservations to be
   distributed to root ports' memory base/limit registers and subdivided
   among devices downstream.

   3) DMA: To interact appropriately with IOMMU, the source ID DMA read
   and write requests are translated to the bus-device-function of the
   VMD endpoint. Otherwise, DMA operates normally without VMD-specific
   address translation.

   4) Interrupts: Part of VMD's BAR 4 is reserved for VMD's MSI-X Table and
   PBA. MSIs from VMD domain devices and ports are remapped to appear if
   they were issued using one of VMD's MSI-X table entries. Each MSI and
   MSI-X addresses of VMD-owned devices and ports have a special format
   where the address refers specific entries in VMD's MSI-X table.  As with
   DMA, the interrupt source id is translated to VMD's bus-device-function.

   The driver provides its own MSI and MSI-X configuration functions
   specific to how MSI messages are used within the VMD domain, and
   it provides an irq_chip for independent IRQ allocation and to relay
   interrupts from VMD's interrupt handler to the appropriate device
   driver's handler.

   5) Errors: PCIe error message are intercepted by the root ports normally
   (e.g. AER), except with VMD, system errors (i.e. firmware first) are
   disabled by default. AER and hotplug interrupts are translated in the
   same way as endpoint interrupts.

   6) VMD does not support INTx interrupts or IO ports. Devices or drivers
   requiring these features should either not be placed below VMD-owned
   root ports, or VMD should be disabled by BIOS for such endpoints.

Contributers to this patch include:

   Artur Paszkiewicz 
   Bryan Veal 
   Jon Derrick 

Signed-off-by: Keith Busch 
---
v1 -> v2:

The original RFC used custom x86_msi_ops to provide the VMD device
specific interrupt setup. This was rejected in favor of a chained irq
domain hierarchy, so this version provides that. While it tests out
successfully in the limited capacity that I can test this, I honestly
don't understand completely how this works, so thank you to Jiang Liu
for the guidance!

Perhaps I'm missing a callback, but I don't see how the driver can limit
the number of irq's requested with the irq domain way. The allocation is
done one at a time instead of at once, so the driver doesn't know at this
level how many were originally requested. This isn't terrible as I can
circle the irq's back to the beginning if they exceed VMD's MSI-x count.

This version includes the DMA operations required if an IOMMU is
used. That feature was omitted from the original RFC. The dma operations
are set via a PCI "fixup" if the device is in a VMD provided domain.

All this created a larger in-kernel dependency than before, and it is
submitted as a single patch instead of a short series since it is all
specific to this driver.

 arch/x86/Kconfig   |   15 ++
 arch/x86/include/asm/vmd.h |   39 +++
 arch/x86/kernel/apic/msi.c |   74 ++
 arch/x86/pci/Makefile  |2 +
 arch/x86/pci/vmd.c |  594 
 kernel/irq/chip.c  |1 +
 kernel/irq/irqdomain.c |3 +
 7 files changed, 728 insertions(+)
 create mode 100644 arch/x86/include/asm/vmd.h
 create mode 100644 arch/x86/pci/vmd.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 328c835..73df607 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2631,6 +2631,21 @@ config PMC_ATOM
def_bool y
 depends on PCI
 
+config HAVE_VMDDEV
+   bool
+
+config VMDDEV
+   depends on PCI && PCI_DOMAINS && PCI_MSI && GENERIC_MSI_IRQ_DOMAIN && 
IRQ_DOMAIN_HIERARCHY
+   tristate "Volume Management Device Driver"

Re: [PATCH V2] nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl

2018-06-21 Thread Keith Busch
On Wed, Jun 20, 2018 at 01:42:22PM +0800, Jianchao Wang wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fc33804..73a97fc 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2289,6 +2289,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
> int status)
>  
>   nvme_get_ctrl(&dev->ctrl);
>   nvme_dev_disable(dev, false);
> + nvme_kill_queues(&dev->ctrl);
>   if (!queue_work(nvme_wq, &dev->remove_work))
>   nvme_put_ctrl(&dev->ctrl);
>  }
> @@ -2405,7 +2406,6 @@ static void nvme_remove_dead_ctrl_work(struct 
> work_struct *work)
>   struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
>   struct pci_dev *pdev = to_pci_dev(dev->dev);
>  
> - nvme_kill_queues(&dev->ctrl);
>   if (pci_get_drvdata(pdev))
>   device_release_driver(&pdev->dev);
>   nvme_put_ctrl(&dev->ctrl);

Looks good to me.

Reviewed-by: Keith Busch 


Re: [PATCH 2/2] PCI: NVMe device specific reset quirk

2018-07-23 Thread Keith Busch
On Mon, Jul 23, 2018 at 04:24:31PM -0600, Alex Williamson wrote:
> Take advantage of NVMe devices using a standard interface to quiesce
> the controller prior to reset, including device specific delays before
> and after that reset.  This resolves several NVMe device assignment
> scenarios with two different vendors.  The Intel DC P3700 controller
> has been shown to only work as a VM boot device on the initial VM
> startup, failing after reset or reboot, and also fails to initialize
> after hot-plug into a VM.  Adding a delay after FLR resolves these
> cases.  The Samsung SM961/PM961 (960 EVO) sometimes fails to return
> from FLR with the PCI config space reading back as -1.  A reproducible
> instance of this behavior is resolved by clearing the enable bit in
> the configuration register and waiting for the ready status to clear
> (disabling the NVMe controller) prior to FLR.
> 
> As all NVMe devices make use of this standard interface and the NVMe
> specification also requires PCIe FLR support, we can apply this quirk
> to all devices with matching class code.

Shouldn't this go in the nvme driver's reset_prepare/reset_done callbacks?


Re: Report long suspend times of NVMe devices (mostly firmware/device issues)

2018-01-24 Thread Keith Busch
On Wed, Jan 24, 2018 at 11:29:12PM +0100, Paul Menzel wrote:
> Am 22.01.2018 um 22:30 schrieb Keith Busch:
> > The nvme spec guides toward longer times than that. I don't see the
> > point of warning users about things operating within spec.
> 
> I quickly glanced over NVM Express revision 1.3 specification [1] but
> searching for *second *, I could not find something about this. Could you
> please point me to the section?

Section 7.6.2:

  It is recommended that the host wait a minimum of the RTD3 Entry
  Latency reported in the Identify Controller data structure for the
  shutdown operations to complete; if the value reported in RTD3 Entry
  Latency is 0h, then the host should wait for a minimum of one second.

So if the controller is new enough, it will report it's RTD3 Entry.
The maximum allowed by spec is something like 4000 seconds.

For controllers that pre-date this field, we're supposed to wait a
"minimum" of one second.

The spec does not recommend a maximum time in either case.

> In my opinion, it’s a good thing to point users to devices holding up
> suspend.

If a device shutdown exceeds its reported constraints, then absolutely,
and we do log a warning for that already.

Picking an arbitrary time below spec recommendations is just guaranteed
to create alarmed people demanding answers to a warning about behavior
that is entirely normal.


Re: [PATCH RESENT] nvme-pci: introduce RECONNECTING state to mark initializing procedure

2018-01-25 Thread Keith Busch
On Mon, Jan 22, 2018 at 10:03:16PM +0800, Jianchao Wang wrote:
> After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
> both nvme-fc/rdma have following pattern:
> RESETTING- quiesce blk-mq queues, teardown and delete queues/
>connections, clear out outstanding IO requests...
> RECONNECTING - establish new queues/connections and some other
>initializing things.
> Introduce RECONNECTING to nvme-pci transport to do the same mark.
> Then we get a coherent state definition among nvme pci/rdma/fc
> transports.
> 
> Suggested-by: James Smart 
> Reviewed-by: James Smart 
> Signed-off-by: Jianchao Wang 

This looks fine. Thank you for your patience.

Reviewed-by: Keith Busch 


Re: [PATCH] nvme-pci: calculate iod and avg_seg_size just before use them

2018-01-11 Thread Keith Busch
On Thu, Jan 11, 2018 at 01:09:39PM +0800, Jianchao Wang wrote:
> The calculation of iod and avg_seg_size maybe meaningless if
> nvme_pci_use_sgls returns before uses them. So calculate
> just before use them.

The compiler will do the right thing here, but I see what you mean. I
think Christoph has some SGL improvements coming, though, that will
obviate this.


Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only

2018-01-11 Thread Keith Busch
On Thu, Jan 11, 2018 at 06:50:40PM +0100, Maik Broemme wrote:
> I've re-run the test with 4.15rc7.r111.g5f615b97cdea and the following
> patches from Keith:
> 
> [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
> [PATCH 2/4] PCI/AER: Provide API for getting AER information
> [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
> [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
> 
> The issue is still the same. Additionally to the output before I see now:
> 
> Jan 11 18:34:45 server.theraso.int kernel: dpc :00:10.0:pcie010: DPC 
> containment event, status:0x1f09 source:0x
> Jan 11 18:34:45 server.theraso.int kernel: dpc :00:10.0:pcie010: DPC 
> unmasked uncorrectable error detected, remove downstream devices
> Jan 11 18:34:45 server.theraso.int kernel: pcieport :00:10.0: PCIe Bus 
> Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, 
> id=0080(Receiver ID)
> Jan 11 18:34:45 server.theraso.int kernel: pcieport :00:10.0:   device 
> [8086:19aa] error status/mask=0020/
> Jan 11 18:34:45 server.theraso.int kernel: pcieport :00:10.0:[ 5] 
> Surprise Down Error(First)
> Jan 11 18:34:46 server.theraso.int kernel: nvme0n1: detected capacity change 
> from 1024209543168 to 0

Okay, so that series wasn't going to fix anything, but at least it gets
some visibility into what's happened. The DPC was triggered due to a
Surprise Down uncorrectable error, so the power settting is causing the
link to fail.

The NVMe driver has quirks specifically for this vendor's devices to
fence off NVMe specific automated power settings. Your observations
appear to align with the same issues.


Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

2018-01-30 Thread Keith Busch
On Tue, Jan 30, 2018 at 11:41:07AM +0800, jianchao.wang wrote:
> Another point that confuses me is that whether nvme_set_host_mem is necessary
> in nvme_dev_disable ?
> As the comment:
> 
>   /*
>* If the controller is still alive tell it to stop using the
>* host memory buffer.  In theory the shutdown / reset should
>* make sure that it doesn't access the host memoery anymore,
>* but I'd rather be safe than sorry..
>*/
>   if (dev->host_mem_descs)
>   nvme_set_host_mem(dev, 0);
> 
> It is to avoid accessing to host memory from controller. But the host memory 
> is just
> freed when nvme_remove. It looks like we just need to do this in nvme_remove.
> For example:
> -
> @@ -2553,6 +2545,14 @@ static void nvme_remove(struct pci_dev *pdev)
> flush_work(&dev->ctrl.reset_work);
> nvme_stop_ctrl(&dev->ctrl);
> nvme_remove_namespaces(&dev->ctrl);
> +   /*
> +* If the controller is still alive tell it to stop using the
> +* host memory buffer.  In theory the shutdown / reset should
> +* make sure that it doesn't access the host memoery anymore,
> +* but I'd rather be safe than sorry..
> +*/
> +   if (dev->host_mem_descs)
> +   nvme_set_host_mem(dev, 0);
> nvme_dev_disable(dev, true);
> nvme_free_host_mem(dev);
> 
> 
> If anything missed, please point out.
> That's really appreciated.

I don't make any such controller, so I can't speak to the necessity of
having this here. I just think having it in the controller disabling
path is for safety. We want the controller to acknowledge that it has
completed any use of the HMB before we make it so it can't access it.



Re: [PATCH] PCI/DPC: Fix INT legacy interrupt in dpc_irq

2018-01-31 Thread Keith Busch
On Wed, Jan 31, 2018 at 09:48:55PM +0530, Oza Pawandeep wrote:
> Current dpc driver acknowledge the interrupt in deferred work, which works
> okay since LPI are edge triggered.
> But when RP does not have MSI support, port service driver falls back to
> legacy GIC SPI interrupts, and with current code we do not acknowledge the
> interrupt and we get dpc interrupt storm.
> This patch acknowledges the interrupt in interrupt handler.
> 
> Signed-off-by: Oza Pawandeep 

Thanks, looks good to me.

Reviewed-by: Keith Busch 


Re: [PATCH v5 4/4] PCI/DPC: Enumerate the devices after DPC trigger event

2018-01-18 Thread Keith Busch
On Thu, Jan 18, 2018 at 11:35:59AM -0500, Sinan Kaya wrote:
> On 1/18/2018 12:32 AM, p...@codeaurora.org wrote:
> > On 2018-01-18 08:26, Keith Busch wrote:
> >> On Wed, Jan 17, 2018 at 08:27:39AM -0800, Sinan Kaya wrote:
> >>> On 1/17/2018 5:37 AM, Oza Pawandeep wrote:
> >>> > +static bool dpc_wait_link_active(struct pci_dev *pdev)
> >>> > +{
> >>>
> >>> I think you can also make this function common instead of making another 
> >>> copy here.
> >>> Of course, this would be another patch.
> >>
> >> It is actually very similar to __pcie_wait_link_active in pciehp_hpc.c,
> >> so there's some opprotunity to make even more common code.
> > 
> > in that case there has to be a generic function in
> > drives/pci.c
> > 
> > which addresses folowing functions from
> > 
> > pcie-dpc.c:
> > dpc_wait_link_inactive
> > dpc_wait_link_active
> > 
> > drivers/pci/hotplug/pciehp_hpc.c
> > pcie_wait_link_active
> > 
> > 
> > all aboe making one generic function to be moved to drives/pci.c
> > 
> > please let me know if this is okay.
> 
> Works for me. Keith/Bjorn?

Yep, I believe common solutions that reduce code is always encouraged
in the Linux kernel.


Re: [PATCH V5 2/2] nvme-pci: fixup the timeout case when reset is ongoing

2018-01-18 Thread Keith Busch
On Thu, Jan 18, 2018 at 06:10:02PM +0800, Jianchao Wang wrote:
> +  * - When the ctrl.state is NVME_CTRL_RESETTING, the expired
> +  *   request should come from the previous work and we handle
> +  *   it as nvme_cancel_request.
> +  * - When the ctrl.state is NVME_CTRL_RECONNECTING, the expired
> +  *   request should come from the initializing procedure such as
> +  *   setup io queues, because all the previous outstanding
> +  *   requests should have been cancelled.
>*/
> - if (dev->ctrl.state == NVME_CTRL_RESETTING) {
> - dev_warn(dev->ctrl.device,
> -  "I/O %d QID %d timeout, disable controller\n",
> -  req->tag, nvmeq->qid);
> - nvme_dev_disable(dev, false);
> + switch (dev->ctrl.state) {
> + case NVME_CTRL_RESETTING:
> + nvme_req(req)->status = NVME_SC_ABORT_REQ;
> + return BLK_EH_HANDLED;
> + case NVME_CTRL_RECONNECTING:
> + WARN_ON_ONCE(nvmeq->qid);
>   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>   return BLK_EH_HANDLED;
> + default:
> + break;
>   }

The driver may be giving up on the command here, but that doesn't mean
the controller has. We can't just end the request like this because that
will release the memory the controller still owns. We must wait until
after nvme_dev_disable clears bus master because we can't say for sure
the controller isn't going to write to that address right after we end
the request.


Re: [PATCH V5 2/2] nvme-pci: fixup the timeout case when reset is ongoing

2018-01-18 Thread Keith Busch
On Fri, Jan 19, 2018 at 01:55:29PM +0800, jianchao.wang wrote:
> On 01/19/2018 12:59 PM, Keith Busch wrote:
> > On Thu, Jan 18, 2018 at 06:10:02PM +0800, Jianchao Wang wrote:
> >> +   * - When the ctrl.state is NVME_CTRL_RESETTING, the expired
> >> +   *   request should come from the previous work and we handle
> >> +   *   it as nvme_cancel_request.
> >> +   * - When the ctrl.state is NVME_CTRL_RECONNECTING, the expired
> >> +   *   request should come from the initializing procedure such as
> >> +   *   setup io queues, because all the previous outstanding
> >> +   *   requests should have been cancelled.
> >> */
> >> -  if (dev->ctrl.state == NVME_CTRL_RESETTING) {
> >> -  dev_warn(dev->ctrl.device,
> >> -   "I/O %d QID %d timeout, disable controller\n",
> >> -   req->tag, nvmeq->qid);
> >> -  nvme_dev_disable(dev, false);
> >> +  switch (dev->ctrl.state) {
> >> +  case NVME_CTRL_RESETTING:
> >> +  nvme_req(req)->status = NVME_SC_ABORT_REQ;
> >> +  return BLK_EH_HANDLED;
> >> +  case NVME_CTRL_RECONNECTING:
> >> +  WARN_ON_ONCE(nvmeq->qid);
> >>nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> >>return BLK_EH_HANDLED;
> >> +  default:
> >> +  break;
> >>}
> > 
> > The driver may be giving up on the command here, but that doesn't mean
> > the controller has. We can't just end the request like this because that
> > will release the memory the controller still owns. We must wait until
> > after nvme_dev_disable clears bus master because we can't say for sure
> > the controller isn't going to write to that address right after we end
> > the request.
> > 
> Yes, but the controller is going to be reseted or shutdown at the moment,
> even if the controller accesses a bad address and goes wrong, everything will
> be ok after reset or shutdown. :)

Hm, I don't follow. DMA access after free is never okay.


Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing

2018-01-18 Thread Keith Busch
On Thu, Jan 18, 2018 at 06:10:00PM +0800, Jianchao Wang wrote:
> Hello
> 
> Please consider the following scenario.
> nvme_reset_ctrl
>   -> set state to RESETTING
>   -> queue reset_work   
> (scheduling)
> nvme_reset_work
>   -> nvme_dev_disable
> -> quiesce queues
> -> nvme_cancel_request 
>on outstanding requests
> ---_boundary_
>   -> nvme initializing (issue request on adminq)
> 
> Before the _boundary_, not only quiesce the queues, but only cancel
> all the outstanding requests.
> 
> A request could expire when the ctrl state is RESETTING.
>  - If the timeout occur before the _boundary_, the expired requests
>are from the previous work.
>  - Otherwise, the expired requests are from the controller initializing
>procedure, such as sending cq/sq create commands to adminq to setup
>io queues.
> In current implementation, nvme_timeout cannot identify the _boundary_ 
> so only handles second case above.

Bare with me a moment, as I'm only just now getting a real chance to look
at this, and I'm not quite sure I follow what problem this is solving.

The nvme_dev_disable routine makes forward progress without depending on
timeout handling to complete expired commands. Once controller disabling
completes, there can't possibly be any started requests that can expire.
So we don't need nvme_timeout to do anything for requests above the
boundary.


Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing

2018-01-19 Thread Keith Busch
On Fri, Jan 19, 2018 at 04:14:02PM +0800, jianchao.wang wrote:
> On 01/19/2018 04:01 PM, Keith Busch wrote:
> > The nvme_dev_disable routine makes forward progress without depending on
> > timeout handling to complete expired commands. Once controller disabling
> > completes, there can't possibly be any started requests that can expire.
> > So we don't need nvme_timeout to do anything for requests above the
> > boundary.
> > 
> Yes, once controller disabling completes, any started requests will be 
> handled and cannot expire.
> But before the _boundary_, there could be a nvme_timeout context runs with 
> nvme_dev_disable in parallel.
> If a timeout path grabs a request, then nvme_dev_disable cannot get and 
> cancel it.
> So even though the nvme_dev_disable completes, there still could be a request 
> in nvme_timeout context.
> 
> The worst case is :
> nvme_timeout  nvme_reset_work
> if (ctrl->state == RESETTING )  nvme_dev_disable
> nvme_dev_disableinitializing procedure
> 
> the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.

Okay, I see what you're saying. That's a pretty obscure case, as normally
we enter nvme_reset_work with the queues already disabled, but there
are a few cases where we need nvme_reset_work to handle that.

But if we are in that case, I think we really just want to sync the
queues. What do you think of this?

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fde6fd2e7eef..516383193416 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3520,6 +3520,17 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   mutex_lock(&ctrl->namespaces_mutex);
+   list_for_each_entry(ns, &ctrl->namespaces, list)
+   blk_sync_queue(ns->queue);
+   mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 void nvme_start_queues(struct nvme_ctrl *ctrl)
 {
struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 8e7fc1b041b7..45b1b8ceddb6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -374,6 +374,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, 
__le16 status,
 
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
 void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a2ffb557b616..1fe00be22ad1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2289,8 +2289,10 @@ static void nvme_reset_work(struct work_struct *work)
 * If we're called to reset a live controller first shut it down before
 * moving on.
 */
-   if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
+   if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) {
nvme_dev_disable(dev, false);
+   nvme_sync_queues(&dev->ctrl);
+   }
 
result = nvme_pci_enable(dev);
if (result)
--


Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing

2018-01-19 Thread Keith Busch
On Fri, Jan 19, 2018 at 05:02:06PM +0800, jianchao.wang wrote:
> We should not use blk_sync_queue here, the requeue_work and run_work will be 
> canceled.
> Just flush_work(&q->timeout_work) should be ok.

I agree flushing timeout_work is sufficient. All the other work had
already better not be running either, so it doesn't hurt to call the
sync API.
 
> In addition, we could check NVME_CC_ENABLE in nvme_dev_disable to avoid 
> redundant invoking.
> :)

That should already be inferred through reading back the CSTS register.


Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing

2018-01-19 Thread Keith Busch
On Fri, Jan 19, 2018 at 09:56:48PM +0800, jianchao.wang wrote:
> In nvme_dev_disable, the outstanding requests will be requeued finally.
> I'm afraid the requests requeued on the q->requeue_list will be blocked until 
> another requeue
> occurs, if we cancel the requeue work before it get scheduled.

We should kick the request list in nvme_start_queues.


Re: Report long suspend times of NVMe devices (mostly firmware/device issues)

2018-01-22 Thread Keith Busch
On Mon, Jan 22, 2018 at 10:02:12PM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> Benchmarking the ACPI S3 suspend and resume times with `sleepgraph.py
> -config config/suspend-callgraph.cfg` [1], shows that the NVMe disk SAMSUNG
> MZVKW512HMJP-0 in the TUXEDO Book BU1406 takes between 0.3 and 1.4
> seconds, holding up the suspend cycle.
> 
> The time is spent in `nvme_shutdown_ctrl()`.
> 
> ### Linux 4.14.1-041401-generic
> 
> > nvme @ :04:00.0 {nvme} async_device (Total Suspend: 1439.299 ms Total 
> > Resume: 19.865 ms)
> 
> ### Linux 4.15-rc9
> 
> > nvme @ :04:00.0 {nvme} async_device (Total Suspend: 362.239 ms Total 
> > Resume: 19.897 m
> It’d be useful, if the Linux kernel logged such issues visibly to the user,
> so that the hardware manufacturer can be contacted to fix the device
> (probably the firmware).
> 
> In my opinion anything longer than 200 ms should be reported similar to [2],
> and maybe worded like below.
> 
> > NVMe took more than 200 ms to do suspend routine
> 
> What do you think?

The nvme spec guides toward longer times than that. I don't see the
point of warning users about things operating within spec.


Re: [PATCH] nvme-pci: ensure nvme_timeout complete before initializing procedure

2018-01-22 Thread Keith Busch
On Mon, Jan 22, 2018 at 09:14:23PM +0100, Christoph Hellwig wrote:
> > Link: https://lkml.org/lkml/2018/1/19/68
> > Suggested-by: Keith Busch 
> > Signed-off-by: Keith Busch 
> > Signed-off-by: Jianchao Wang 
> 
> Why does this have a signoff from Keith?

Right, I hadn't signed off that. I just trying to get feeback if
someting like that was closing the theoretical gap, which it does.

I actually have something similar in my patch queue I was about to send
around this area, though. I don't like having the IO path take on the
error handling, and I think ending unstarted requests directly will be
better long term.


Re: [PATCH 09/12] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-01-05 Thread Keith Busch
On Thu, Jan 04, 2018 at 12:01:34PM -0700, Logan Gunthorpe wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO SQ.
> 
> If the CMB supports WDS and RDS, publish it for use as p2p memory
> by other devices.

<>

> + if (qid && dev->cmb_use_sqes) {
> + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));

> + nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
> + nvmeq->sq_cmds);
> + nvmeq->sq_cmds_is_io = true;
> + }

This gets into some spec type trouble for creating an IO submission
queue. We use the sq_dma_addr as the PRP entry to the Create I/O
SQ command, which has some requirements:

  "the PRP Entry shall have an offset of 0h."

So this has to be 4k aligned, but pci_alloc_p2pmem doesn't guarantee
that. I doubt the spec's intention was to require such alignment for
CMB SQs, but there may be controllers enforcing the rule as written.


Re: [PATCH 09/12] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-01-05 Thread Keith Busch
On Fri, Jan 05, 2018 at 11:19:28AM -0700, Logan Gunthorpe wrote:
> Although it is not explicitly stated anywhere, pci_alloc_p2pmem() should
> always be at least 4k aligned. This is because the gen_pool that implements
> it is created with PAGE_SHIFT for its min_alloc_order.

Ah, I see that now. Thanks for the explanation.

Does it need to be created with page sized minimum alloc order? That
granularity makes it difficult to fit SQs in CMB on archs with larger
pages when we only needed 4k alignment.

I was also hoping to extend this for PRP/SGL in CMB where even 4k is
too high a granularity to make it really useful.

It looks like creating the gen pool with a smaller minimum and
gen_pool_first_fit_order_align algo would satisfy my use cases, but I'm
not sure if there's another reason you've set it to page alignment.


Re: [PATCH V2 0/2] nvme-pci: fix the timeout case when reset is ongoing

2018-01-08 Thread Keith Busch
On Tue, Jan 09, 2018 at 10:03:11AM +0800, Jianchao Wang wrote:
> Hello

Sorry for the distraction, but could you possibly fix the date on your
machine? For some reason, lists.infradead.org sorts threads by the time
you claim to have sent your message rather than the time it was received,
and you're a full day in the future. :)


Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

2018-01-29 Thread Keith Busch
On Mon, Jan 29, 2018 at 11:07:35AM +0800, Jianchao Wang wrote:
> nvme_set_host_mem will invoke nvme_alloc_request without NOWAIT
> flag, it is unsafe for nvme_dev_disable. The adminq driver tags
> may have been used up when the previous outstanding adminq requests
> cannot be completed due to some hardware error. We have to depend
> on the timeout path to complete the previous outstanding adminq
> requests and free the tags.
> However, nvme_timeout will invoke nvme_dev_disable and try to
> get the shutdown_lock which is held by another context who is
> sleeping to wait for the tags to be freed by timeout path. A
> deadlock comes up.
> 
> To fix it, let nvme_set_host_mem use NOWAIT flag.
> 
> Signed-off-by: Jianchao Wang 

Thanks for the fix. It looks like we still have a problem, though.
Commands submitted with the "shutdown_lock" held need to be able to make
forward progress without relying on a completion, but this one could
block indefinitely.


>  drivers/nvme/host/pci.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6fe7af0..9532529 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1736,7 +1736,8 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 
> bits)
>   c.features.dword14  = cpu_to_le32(upper_32_bits(dma_addr));
>   c.features.dword15  = cpu_to_le32(dev->nr_host_mem_descs);
>  
> - ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
> + ret = __nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, NULL, 0, 0,
> + NVME_QID_ANY, 0, BLK_MQ_REQ_NOWAIT);
>   if (ret) {
>   dev_warn(dev->ctrl.device,
>"failed to set host mem (err %d, flags %#x).\n",
> -- 


Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

2018-01-29 Thread Keith Busch
On Mon, Jan 29, 2018 at 09:55:41PM +0200, Sagi Grimberg wrote:
> > Thanks for the fix. It looks like we still have a problem, though.
> > Commands submitted with the "shutdown_lock" held need to be able to make
> > forward progress without relying on a completion, but this one could
> > block indefinitely.
> 
> Can you explain to me why is the shutdown_lock needed to synchronize
> nvme_dev_disable? More concretely, how is nvme_dev_disable different
> from other places where we rely on the ctrl state to serialize stuff?
> 
> The only reason I see would be to protect against completion-after-abort
> scenario but I think the block layer should protect against it (checks
> if the request timeout timer fired).

We can probably find a way to use the state machine for this. Disabling
the controller pre-dates the state machine, and the mutex is there to
protect against two actors shutting the controller down at the same
time, like a hot removal at the same time as a timeout handling reset.


Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote:
> 8<--
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f8b03bb8e725..3cc471beb50b 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain 
> *domain, unsigned int virq,
>  
>   err = assign_irq_vector_policy(irqd, info);
>   trace_vector_setup(virq + i, false, err);
> - if (err)
> + if (err) {
> + irqd->chip_data = NULL;
> + free_apic_chip_data(apicd);
>   goto error;
> + }
>   }
>  
>   return 0;
>  
>  error:
> - x86_vector_free_irqs(domain, virq, i + 1);
> + x86_vector_free_irqs(domain, virq, i);
>   return err;
>  }
>  

The patch does indeed fix all the warnings and allows device binding to
succeed, albeit in a degraded performance mode. Despite that, this is
a good fix, and looks applicable to 4.4-stable, so:

Tested-by: Keith Busch 

I'm still concerned assign_irq_vector_policy is failing. That has
interrupt allocation abandon MSI-x and fall back to legacy IRQ. 

Your patch does address my main concern, though. Are you comfortable
enough to queue this up for 4.15?

Thanks,
Keith


Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote:
> Can you trace the matrix allocations from the very beginning or tell me how
> to reproduce. I'd like to figure out why this is happening.

Sure, I'll get the irq_matrix events.

I reproduce this on a machine with 112 CPUs and 3 NVMe controllers. The
first two NVMe want 112 MSI-x vectors, and the last only 31 vectors. The
test runs 'modprobe nvme' and 'modprobe -r nvme' in a loop with 10
second delay between each step. Repro occurs within a few iterations,
sometimes already broken after the initial boot.


Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-17 Thread Keith Busch
On Wed, Jan 17, 2018 at 04:01:47PM +0100, Thomas Gleixner wrote:
> Which device is allocating gazillions of non-managed interrupts?

I believe that would be the i40e. :)
 
> The patch below should cure that by spreading them out on allocation.

Yep, this is successfully testing already over 200 iterations that used to
fail within only a few. I'd say the problem is cured. Thanks!

Tested-by: Keith Busch 


Re: [PATCH v5 4/4] PCI/DPC: Enumerate the devices after DPC trigger event

2018-01-17 Thread Keith Busch
On Wed, Jan 17, 2018 at 08:27:39AM -0800, Sinan Kaya wrote:
> On 1/17/2018 5:37 AM, Oza Pawandeep wrote:
> > +static bool dpc_wait_link_active(struct pci_dev *pdev)
> > +{
> 
> I think you can also make this function common instead of making another copy 
> here.
> Of course, this would be another patch.

It is actually very similar to __pcie_wait_link_active in pciehp_hpc.c,
so there's some opprotunity to make even more common code.


Re: [PATCH v3 2/2] nvme: add tracepoint for nvme_complete_rq

2018-01-17 Thread Keith Busch
Looks good.

Reviewed-by: Keith Busch 


Re: [PATCH v3 1/2] nvme: add tracepoint for nvme_setup_cmd

2018-01-17 Thread Keith Busch
Looks good.

Reviewed-by: Keith Busch 


Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-18 Thread Keith Busch
On Thu, Jan 18, 2018 at 09:10:43AM +0100, Thomas Gleixner wrote:
> Can you please provide the output of
> 
> # cat /sys/kernel/debug/irq/irqs/$ONE_I40_IRQ

# cat /sys/kernel/debug/irq/irqs/48
handler:  handle_edge_irq
device:   :1a:00.0
status:   0x
istate:   0x
ddepth:   0
wdepth:   0
dstate:   0x05401200
IRQD_ACTIVATED
IRQD_IRQ_STARTED
IRQD_SINGLE_TARGET
IRQD_AFFINITY_SET
IRQD_CAN_RESERVE
node: 0
affinity: 0-27,56-83
effectiv: 0
pending:
domain:  PCI-MSI-2
 hwirq:   0xd2
 chip:PCI-MSI
  flags:   0x10
 IRQCHIP_SKIP_SET_WAKE
 parent:
domain:  VECTOR
 hwirq:   0x30
 chip:APIC
  flags:   0x0
 Vector:69
 Target: 0


Re: [PATCH] nvme: initialize hostid uuid in nvmf_host_default to not leak kernel memory

2018-01-09 Thread Keith Busch
On Tue, Jan 09, 2018 at 04:20:43PM +0100, Johannes Thumshirn wrote:
> Alexander reports:
>   according to KMSAN (and common sense as well) the following code in
>   drivers/nvme/host/fabrics.c
>   
> (http://elixir.free-electrons.com/linux/latest/source/drivers/nvme/host/fabrics.c#L68):
> 
> 72 host = kmalloc(sizeof(*host), GFP_KERNEL);
> 73 if (!host)
> 74 return NULL;
> 75
> 76 kref_init(&host->ref);
> 77 snprintf(host->nqn, NVMF_NQN_SIZE,
> 78 "nqn.2014-08.org.nvmexpress:uuid:%pUb", &host->id);
> 
>   uses uninitialized heap memory to generate the unique id for the NVMF host.
>   If I'm understanding correctly, it can be then passed to the
>   userspace, so the contents of the uninitialized chunk may potentially
>   leak.
>   If the specification doesn't rely on this UID to be random or unique,
>   I suggest using kzalloc() here, otherwise it might be a good idea to
>   use a real RNG.
> 
> this assumption is correct so initialize the host->id using uuid_gen() as
> it was done before commit 6bfe04255d5e ("nvme: add hostid token to fabric
> options").
> 
> Fixes: 6bfe04255d5e ("nvme: add hostid token to fabric options")
> Reported-by: Alexander Potapenko 
> Signed-off-by: Johannes Thumshirn 

Thanks for the report and the fix. It'd still be good to use the kzalloc
variant in addition to this.

Reviewed-by: Keith Busch 


Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting

2018-01-14 Thread Keith Busch
On Mon, Jan 15, 2018 at 10:02:04AM +0800, jianchao.wang wrote:
> Hi keith
> 
> Thanks for your kindly review and response.

I agree with Sagi's feedback, but I can't take credit for it. :)


[BUG 4.15-rc7] IRQ matrix management errors

2018-01-14 Thread Keith Busch
I hoped to have a better report before the weekend, but I've run out of
time and without my machine till next week, so sending what I have and
praying someone more in the know will have a better clue.

I've a few NVMe drives and occasionally the IRQ teardown and bring-up
is failing. Resetting the controllers to re-run interrupt allocation
results in device interrupts from not occuring at all.

It appears I need at least 2 NVMe drives for the problem. The NVMe
driver initializes controllers in parallel, and multiple threads calling
pci_alloc_irq_vectors and/or pci_free_vectors at about the same time
seems required to trigger the issue.

Here are the relavent warnings. I also have trace events from irq_matrix
and irq_vectors, but from a different run that won't match up to the
timestamps below, but can send if that's helpful.

[  288.519216] WARNING: CPU: 28 PID: 1420 at kernel/irq/matrix.c:222 
irq_matrix_remove_managed+0x10f/0x120
[  288.519218] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 
xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc 
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle 
ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security 
ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl 
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev 
iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich 
ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs 
libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[  288.519286]  ptp nvme_core pps_core [last unloaded: nvme]
[  288.519294] CPU: 28 PID: 1420 Comm: kworker/u674:2 Not tainted 4.15.0-rc7+ #4
[  288.519296] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS 
SE5C620.86B.00.01.2001.062220170731 06/22/2017
[  288.519305] Workqueue: nvme-wq nvme_reset_work [nvme]
[  288.519310] RIP: 0010:irq_matrix_remove_managed+0x10f/0x120
[  288.519312] RSP: 0018:b01f0aec7a88 EFLAGS: 00010046
[  288.519315] RAX: 00ee RBX: 9e387d824900 RCX: 
[  288.519317] RDX: 0100 RSI: 00ee RDI: 9e387d410c50
[  288.519319] RBP:  R08: 0100 R09: 
[  288.519320] R10: 0018 R11: 0003 R12: 9e387d410c00
[  288.519322] R13: 9e387d410c30 R14: 00ee R15: 00ee
[  288.519324] FS:  () GS:9e487cc0() 
knlGS:
[  288.519326] CS:  0010 DS:  ES:  CR0: 80050033
[  288.519328] CR2: 7fbc1c2321d4 CR3: 0010f5209003 CR4: 007606e0
[  288.519330] DR0:  DR1:  DR2: 
[  288.519332] DR3:  DR6: fffe0ff0 DR7: 0400
[  288.519333] PKRU: 5554
[  288.519334] Call Trace:
[  288.519345]  x86_vector_free_irqs+0xa1/0x180
[  288.519349]  x86_vector_alloc_irqs+0x1e4/0x3a0
[  288.519354]  msi_domain_alloc+0x62/0x130
[  288.519363]  ? kmem_cache_alloc_node_trace+0x1ac/0x1d0
[  288.519366]  __irq_domain_alloc_irqs+0x121/0x300
[  288.519370]  msi_domain_alloc_irqs+0x99/0x2e0
[  288.519376]  native_setup_msi_irqs+0x54/0x90
[  288.519383]  __pci_enable_msix+0xfb/0x4e0
[  288.519389]  pci_alloc_irq_vectors_affinity+0x8e/0x130
[  288.519394]  nvme_reset_work+0x919/0x153b [nvme]
[  288.519404]  ? sched_clock+0x5/0x10
[  288.519407]  ? sched_clock+0x5/0x10
[  288.519414]  ? sched_clock_cpu+0xc/0xb0
[  288.519420]  ? pick_next_task_fair+0x4d5/0x5f0
[  288.519427]  ? __switch_to+0xa2/0x430
[  288.519431]  ? put_prev_entity+0x1e/0xe0
[  288.519438]  process_one_work+0x182/0x370
[  288.519441]  worker_thread+0x2e/0x380
[  288.519444]  ? process_one_work+0x370/0x370
[  288.519449]  kthread+0x111/0x130
[  288.519453]  ? kthread_create_worker_on_cpu+0x70/0x70
[  288.519460]  ret_from_fork+0x1f/0x30
[  288.519464] Code: 89 ea 44 89 f6 41 ff d1 4d 8b 0f 4d 85 c9 75 e2 e9 2a ff 
ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f ff e9 14 ff ff ff <0f> ff 
e9 0d ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
[  288.519517] ---[ end trace 8fcc70570a780634 ]---
[  310.959112] nvme nvme0: pci function :5e:00.0
[  310.959333] nvme nvme1: pci function :60:00.0
[  310.959675] nvme nvme2: pci function :86:00.0
[  311.167369] WARNING: CPU: 1 PID: 898 at kernel/irq/matrix.c:215 
irq_matrix_remove_managed+0x108/0x120
[  311.167371] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 
xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc 
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle 
ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw 

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-15 Thread Keith Busch
This is all way over my head, but the part that obviously shows
something's gone wrong:

  kworker/u674:3-1421  [028] d...   335.307051: irq_matrix_reserve_managed: 
bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 
global_avl=22084, global_rsvd=157, total_alloc=570
  kworker/u674:3-1421  [028] d...   335.307053: irq_matrix_remove_managed: 
bit=56 cpu=0 online=1 avl=87 alloc=116 managed=2 online_maps=112 
global_avl=22085, global_rsvd=157, total_alloc=570
  kworker/u674:3-1421  [028]    335.307054: vector_reserve_managed: irq=45 
ret=-28
  kworker/u674:3-1421  [028]    335.307054: vector_setup: irq=45 
is_legacy=0 ret=-28
  kworker/u674:3-1421  [028] d...   335.307055: vector_teardown: irq=45 
is_managed=1 has_reserved=0

Which leads me to x86_vector_alloc_irqs goto error:

error:
x86_vector_free_irqs(domain, virq, i + 1);

The last parameter looks weird. It's the nr_irqs, and since we failed and
bailed, I would think we'd need to subtract 1 rather than add 1. Adding
1 would doublely remove the failed one, and remove the next one that
was never setup, right?

Or maybe irq_matrix_reserve_managed wasn't expected to fail in the
first place?


Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch
On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote:
> What we want is s/i + 1/i/
> 
> That's correct because x86_vector_free_irqs() does:
> 
>for (i = 0; i < nr; i++)
>  
> 
> So if we fail at the first irq, then the loop will do nothing. Failing on
> the second will free the first 
> 
> Fix below.
> 
> Thanks,
> 
>   tglx

Thanks! This looks much better. I'll try to verify by tomorrow, though
the hardware I was using to recreate is not available to me at the
moment. I may be able to synth the conditions on something else.

 
> 8<--
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f8b03bb8e725..3cc471beb50b 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain 
> *domain, unsigned int virq,
>  
>   err = assign_irq_vector_policy(irqd, info);
>   trace_vector_setup(virq + i, false, err);
> - if (err)
> + if (err) {
> + irqd->chip_data = NULL;
> + free_apic_chip_data(apicd);
>   goto error;
> + }
>   }
>  
>   return 0;
>  
>  error:
> - x86_vector_free_irqs(domain, virq, i + 1);
> + x86_vector_free_irqs(domain, virq, i);
>   return err;
>  }
>  


Re: [PATCH v2 0/2] add tracepoints for nvme command submission and completion

2018-01-16 Thread Keith Busch
On Tue, Jan 16, 2018 at 03:28:19PM +0100, Johannes Thumshirn wrote:
> Add tracepoints for nvme command submission and completion. The tracepoints
> are modeled after SCSI's trace_scsi_dispatch_cmd_start() and
> trace_scsi_dispatch_cmd_done() tracepoints and fulfil a similar purpose,
> namely a fast way to check which command is going to be queued into the HW or
> Fabric driver and which command is completed again.

I like this very much, thanks for doing this. I think you could make the
submission trace point tighter for PCI as Hannes was suggesting since
an MMIO write can't fail, but doesn't look as doable for FC and RDMA.


Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch
On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> 
> > That difference has been there since the beginning of DPC, so it has
> > nothing to do with *this* series EXCEPT for the fact that it really
> > complicates the logic you're adding to reset_link() and
> > broadcast_error_message().
> > 
> > We ought to be able to simplify that somehow because the only real
> > difference between AER and DPC should be that DPC automatically
> > disables the link and AER does it in software.
> 
> I agree this should be possible. Code execution path should be almost
> identical to fatal error case.
> 
> Is there any reason why you went to stop driver path, Keith?

The fact is the link is truly down during a DPC event. When the link
is enabled again, you don't know at that point if the device(s) on the
other side have changed. Calling a driver's error handler for the wrong
device in an unknown state may have undefined results. Enumerating the
slot from scratch should be safe, and will assign resources, tune bus
settings, and bind to the matching driver.

Per spec, DPC is the recommended way for handling surprise removal
events and even recommends DPC capable slots *not* set 'Surprise'
in Slot Capabilities so that removals are always handled by DPC. This
service driver was developed with that use in mind.


Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch
On Mon, Mar 12, 2018 at 08:16:38PM +0530, p...@codeaurora.org wrote:
> On 2018-03-12 19:55, Keith Busch wrote:
> > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> > > 
> > > > That difference has been there since the beginning of DPC, so it has
> > > > nothing to do with *this* series EXCEPT for the fact that it really
> > > > complicates the logic you're adding to reset_link() and
> > > > broadcast_error_message().
> > > >
> > > > We ought to be able to simplify that somehow because the only real
> > > > difference between AER and DPC should be that DPC automatically
> > > > disables the link and AER does it in software.
> > > 
> > > I agree this should be possible. Code execution path should be almost
> > > identical to fatal error case.
> > > 
> > > Is there any reason why you went to stop driver path, Keith?
> > 
> > The fact is the link is truly down during a DPC event. When the link
> > is enabled again, you don't know at that point if the device(s) on the
> > other side have changed. Calling a driver's error handler for the wrong
> > device in an unknown state may have undefined results. Enumerating the
> > slot from scratch should be safe, and will assign resources, tune bus
> > settings, and bind to the matching driver.
> > 
> > Per spec, DPC is the recommended way for handling surprise removal
> > events and even recommends DPC capable slots *not* set 'Surprise'
> > in Slot Capabilities so that removals are always handled by DPC. This
> > service driver was developed with that use in mind.
> 
> Now it begs the question, that
> 
> after DPC trigger
> 
> should we enumerate the devices, ?
> or
> error handling callbacks, followed by stop devices followed by enumeration ?
> or
> error handling callbacks, followed by enumeration ? (no stop devices)

I'm not sure I understand. The link is disabled while DPC is triggered,
so if anything, you'd want to un-enumerate everything below the contained
port (that's what it does today).

After releasing a slot from DPC, the link is allowed to retrain. If there
is a working device on the other side, a link up event occurs. That
event is handled by the pciehp driver, and that schedules enumeration
no matter what you do to the DPC driver.


Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch
On Mon, Mar 12, 2018 at 09:04:47PM +0530, p...@codeaurora.org wrote:
> On 2018-03-12 20:28, Keith Busch wrote:
> > I'm not sure I understand. The link is disabled while DPC is triggered,
> > so if anything, you'd want to un-enumerate everything below the
> > contained
> > port (that's what it does today).
> > 
> > After releasing a slot from DPC, the link is allowed to retrain. If
> > there
> > is a working device on the other side, a link up event occurs. That
> > event is handled by the pciehp driver, and that schedules enumeration
> > no matter what you do to the DPC driver.
> 
> yes, that is what i current, but this patch-set makes DPC aware of error
> handling driver callbacks.

I've been questioning the utility of doing that since the very first
version of this patch set.

> besides, in absence of pciehp there is nobody to do enumeration.

If you configure your kernel to not have a feature, you don't get to
expect the feature works.

You can still manually rescan through sysfs, /sys/bus/pci/rescan.
 
> And, I was talking about pci_stop_and_remove_bus_device() in dpc.
> if DPC calls driver's error callbacks, is it required to stop the devices  ?

DPC is PCIe's preferred surprise removal mechanism. If you don't want
the DPC driver to handle removing devices downstream the containment,
how do you want those devices get removed? Just calling driver's error
callbacks leaves the kernel's view of the topology in the wrong state.


Re: [pci PATCH v5 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources

2018-03-12 Thread Keith Busch
On Mon, Mar 12, 2018 at 10:21:29AM -0700, Alexander Duyck wrote:
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 024a1beda008..9cab9d0d51dc 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1953,6 +1953,7 @@ static inline void pci_mmcfg_late_init(void) { }
>  int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
>  #else

I recommend stubbing 'pci_sriov_configure_simple' or defining it to
NULL in the '#else' section here so you don't need to repeat the "#ifdef
CONFIG_PCI_IOV" in each driver wishing to use this function. Otherwise
looks fine to me.


Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch
On Mon, Mar 12, 2018 at 01:41:07PM -0400, Sinan Kaya wrote:
> I was just writing a reply to you. You acted first :)
> 
> On 3/12/2018 1:33 PM, Keith Busch wrote:
> >>> After releasing a slot from DPC, the link is allowed to retrain. If
> >>> there
> >>> is a working device on the other side, a link up event occurs. That
> >>> event is handled by the pciehp driver, and that schedules enumeration
> >>> no matter what you do to the DPC driver.
> >> yes, that is what i current, but this patch-set makes DPC aware of error
> >> handling driver callbacks.
> > I've been questioning the utility of doing that since the very first
> > version of this patch set.
> > 
> 
> I think we should all agree that shutting down the device drivers with active
> work is not safe. There could be outstanding work that the endpoint driver
> needs to take care of. 
> 
> That was the motivation for this change so that we give endpoint drivers an 
> error callback when something goes wrong. 
> 
> The rest is implementation detail that we can all figure out.

I'm not sure if I agree here. All Linux device drivers are supposed to
cope with sudden/unexpected loss of communication at any time. This
includes cleaning up appropriately when requested to unbind from an
inaccessible device with active outstanding work.


Re: [pci PATCH v5 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources

2018-03-12 Thread Keith Busch
On Mon, Mar 12, 2018 at 11:09:34AM -0700, Alexander Duyck wrote:
> On Mon, Mar 12, 2018 at 10:40 AM, Keith Busch  wrote:
> > On Mon, Mar 12, 2018 at 10:21:29AM -0700, Alexander Duyck wrote:
> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
> >> index 024a1beda008..9cab9d0d51dc 100644
> >> --- a/include/linux/pci.h
> >> +++ b/include/linux/pci.h
> >> @@ -1953,6 +1953,7 @@ static inline void pci_mmcfg_late_init(void) { }
> >>  int pci_vfs_assigned(struct pci_dev *dev);
> >>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
> >>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
> >> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
> >>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
> >>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> >>  #else
> >
> > I recommend stubbing 'pci_sriov_configure_simple' or defining it to
> > NULL in the '#else' section here so you don't need to repeat the "#ifdef
> > CONFIG_PCI_IOV" in each driver wishing to use this function. Otherwise
> > looks fine to me.
> 
> My concern with defining it as NULL is that somebody may end up
> calling it in the future directly and that may end up causing issues.
> One thought I have been debating is moving it to a different file. I
> am just not sure where the best place to put something like this would
> be. I could move this function to drivers/pci/pci.c if everyone is
> okay with it and then I could just strip the contents out by wrapping
> them in a #ifdef instead.

Okay, instead of NULL, a stub implementation in the header file may
suffice when CONFIG_PCI_IOV is not defined:

static inline int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn)
{
return -ENOSYS;
}

See pci_iov_virtfn_bus, pci_iov_virtfn_devfn, pci_iov_add_virtfn, or
pci_enable_sriov for other examples.


Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-03-12 Thread Keith Busch
Hi Jianchao,

The patch tests fine on all hardware I had. I'd like to queue this up
for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy
suggested and a changelog aligned with Ming's insights?

Thanks,
Keith


Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch
On Mon, Mar 12, 2018 at 02:47:30PM -0500, Bjorn Helgaas wrote:
> [+cc Alex]
> 
> On Mon, Mar 12, 2018 at 08:25:51AM -0600, Keith Busch wrote:
> > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> > > 
> > > > That difference has been there since the beginning of DPC, so it has
> > > > nothing to do with *this* series EXCEPT for the fact that it really
> > > > complicates the logic you're adding to reset_link() and
> > > > broadcast_error_message().
> > > > 
> > > > We ought to be able to simplify that somehow because the only real
> > > > difference between AER and DPC should be that DPC automatically
> > > > disables the link and AER does it in software.
> > > 
> > > I agree this should be possible. Code execution path should be almost
> > > identical to fatal error case.
> > > 
> > > Is there any reason why you went to stop driver path, Keith?
> > 
> > The fact is the link is truly down during a DPC event. When the link
> > is enabled again, you don't know at that point if the device(s) on the
> > other side have changed.
> 
> When DPC is triggered, the port takes the link down.  When we handle
> an uncorrectable (nonfatal or fatal) AER event, software takes the
> link down.
> 
> In both cases, devices on the other side are at least reset.  Whenever
> the link goes down, it's conceivable the device could be replaced with
> a different one before the link comes back up.  Is this why you remove
> and re-enumerate?  (See tangent [1] below.)

Yes. Truthfully, DPC events due to changing topologies was the motivating
use case when we initially developed this. We were also going for
simplicity (at least initially), and remove + re-enumerate seemed
safe without concerning this driver with other capability regsiters, or
coordinating with/depending on other drivers. For example, a successful
reset may depend on any particular driver calling pci_restore_state from
a good saved state.

> The point is that from the device's hardware perspective, these
> scenarios are the same (it sent a ERR_NONFATAL or ERR_FATAL message
> and it sees the link go down).  I think we should make them the same
> on the software side, too: the driver should see the same callbacks,
> in the same order, whether we're doing AER or DPC.
> 
> If removing and re-enumerating is the right thing for DPC, I think
> that means it's also the right thing for AER.
> 
> Along this line, we had a question a while back about logging AER
> information after a DPC trigger.  Obviously we can't collect any
> logged information from the downstream devices while link is down, but
> I noticed the AER status bits are RW1CS, which means they're sticky
> and are not modified by hot reset or FLR.
> 
> So we may be able to read and log the AER information after the DPC
> driver brings the link back up.  We may want to do the same with AER,
> i.e., reset the downstream devices first, then collect the AER status
> bits after the link comes back up.

I totally agree. We could consult Slot and AER status to guide a
smarter approach.

> > Calling a driver's error handler for the wrong device in an unknown
> > state may have undefined results. Enumerating the slot from scratch
> > should be safe, and will assign resources, tune bus settings, and
> > bind to the matching driver.
> 
> I agree with this; I think this is heading toward doing
> remove/re-enumerate on AER errors as well because it has also reset
> the device.
> 
> > Per spec, DPC is the recommended way for handling surprise removal
> > events and even recommends DPC capable slots *not* set 'Surprise'
> > in Slot Capabilities so that removals are always handled by DPC. This
> > service driver was developed with that use in mind.
> 
> Thanks for this tip.  The only thing I've found so far is the mention
> of Surprise Down triggering DPC in the last paragraph of sec 6.7.5.
> Are there other references I should look at?  I haven't found the
> recommendation about not setting 'Surprise' in Slot Capabilities yet.

No problem, it's in the "IMPLEMENTATION NOTE" at the end of 6.2.10.4,
"Avoid Disabled Link and Hot-Plug Surprise Use with DPC".

Outside the spec, Microsemi as one of the PCI-SIG contributors and early
adopters of the capability published a white paper "Hot and Surprise
Plug Recommendations for Enterprise PCIe" providing guidance for OS
drivers using DPC. We originally developed to that guidance. The pa

Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Keith Busch
On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe  wrote:
> > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >  {
> > u16 tail = nvmeq->sq_tail;
> 
> > -   if (nvmeq->sq_cmds_io)
> > -   memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> > -   else
> > -   memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> > +   memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> 
> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
> the _toio() variant enforces alignment, does the copy with 4 byte
> stores, and has a full barrier after the copy. In comparison our
> regular memcpy() does none of those things and may use unaligned and
> vector load/stores. For normal (cacheable) memory that is perfectly
> fine, but they can cause alignment faults when targeted at MMIO
> (cache-inhibited) memory.
> 
> I think in this particular case it might be ok since we know SEQs are
> aligned to 64 byte boundaries and the copy is too small to use our
> vectorised memcpy(). I'll assume we don't need explicit ordering
> between writes of SEQs since the existing code doesn't seem to care
> unless the doorbell is being rung, so you're probably fine there too.
> That said, I still think this is a little bit sketchy and at the very
> least you should add a comment explaining what's going on when the CMB
> is being used. If someone more familiar with the NVMe driver could
> chime in I would appreciate it.

I may not be understanding the concern, but I'll give it a shot.

You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.

The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.

The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.

The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.

If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.


Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Keith Busch
On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

IIUC, we don't need a similar barrier for NVMe to ensure memory is
visibile to DMA since the SQE memory is allocated DMA coherent when the
SQ is not within a CMB.
 
> The mmiowb() is used to ensure that DB writes are not combined and not
> issued in any order other than implied by the lock that encloses the
> whole thing. This is needed because uar_map is WC memory.
> 
> We don't have ordering with respect to two writel's here, so if ARM
> performance was a concern the writel could be switched to
> writel_relaxed().
> 
> Presumably nvme has similar requirments, although I guess the DB
> register is mapped UC not WC?

Yep, the NVMe DB register is required by the spec to be mapped UC.


Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system

2018-04-12 Thread Keith Busch
On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote:
> On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
> > 
> > I think the scenario you are describing is two systems that are
> > identical except that in the first, the endpoint is below a hotplug
> > bridge, while in the second, it's below a non-hotplug bridge.  There's
> > no physical hotplug (no drive removed or inserted), and DPC is
> > triggered in both systems.
> > 
> > I suggest that DPC should be handled identically in both systems:
> > 
> >   - The PCI core should have the same view of the endpoint: it should
> > be removed and re-added in both cases (or in neither case).
> > 
> >   - The endpoint itself should not be able to tell the difference: it
> > should see a link down event, followed by a link retrain, followed
> > by the same sequence of config accesses, etc.
> > 
> >   - The endpoint driver should not be able to tell the difference,
> > i.e., we should be calling the same pci_error_handlers callbacks
> > in both cases.
> > 
> > It's true that in the non-hotplug system, pciehp probably won't start
> > re-enumeration, so we might need an alternate path to trigger that.
> > 
> > But that's not what we're doing in this patch.  In this patch we're
> > adding a much bigger difference: for hotplug bridges, we stop and
> > remove the hierarchy below the bridge; for non-hotplug bridges, we do
> > the AER-style flow of calling pci_error_handlers callbacks.
> 
> Our approach on V12 was to go to AER style recovery for all DPC events
> regardless of hotplug support or not. 
> 
> Keith was not comfortable with this approach. That's why, we special cased
> hotplug.
> 
> If we drop 6/6 on this patch on v13, we achieve this. We still have to
> take care of Keith's inputs on individual patches.
> 
> we have been struggling with the direction for a while.
> 
> Keith, what do you think?

My only concern was for existing production environments that use DPC
for handling surprise removal, and I don't wish to break the existing
uses.


Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system

2018-04-12 Thread Keith Busch
On Thu, Apr 12, 2018 at 08:39:54AM -0600, Keith Busch wrote:
> On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote:
> > On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
> > > 
> > > I think the scenario you are describing is two systems that are
> > > identical except that in the first, the endpoint is below a hotplug
> > > bridge, while in the second, it's below a non-hotplug bridge.  There's
> > > no physical hotplug (no drive removed or inserted), and DPC is
> > > triggered in both systems.
> > > 
> > > I suggest that DPC should be handled identically in both systems:
> > > 
> > >   - The PCI core should have the same view of the endpoint: it should
> > > be removed and re-added in both cases (or in neither case).
> > > 
> > >   - The endpoint itself should not be able to tell the difference: it
> > > should see a link down event, followed by a link retrain, followed
> > > by the same sequence of config accesses, etc.
> > > 
> > >   - The endpoint driver should not be able to tell the difference,
> > > i.e., we should be calling the same pci_error_handlers callbacks
> > > in both cases.
> > > 
> > > It's true that in the non-hotplug system, pciehp probably won't start
> > > re-enumeration, so we might need an alternate path to trigger that.
> > > 
> > > But that's not what we're doing in this patch.  In this patch we're
> > > adding a much bigger difference: for hotplug bridges, we stop and
> > > remove the hierarchy below the bridge; for non-hotplug bridges, we do
> > > the AER-style flow of calling pci_error_handlers callbacks.
> > 
> > Our approach on V12 was to go to AER style recovery for all DPC events
> > regardless of hotplug support or not. 
> > 
> > Keith was not comfortable with this approach. That's why, we special cased
> > hotplug.
> > 
> > If we drop 6/6 on this patch on v13, we achieve this. We still have to
> > take care of Keith's inputs on individual patches.
> > 
> > we have been struggling with the direction for a while.
> > 
> > Keith, what do you think?
> 
> My only concern was for existing production environments that use DPC
> for handling surprise removal, and I don't wish to break the existing
> uses.

Also, I thought the plan was to keep hotplug and non-hotplug the same,
except for the very end: if not a hotplug bridge, initiate the rescan
automatically after releasing from containment, otherwise let pciehp
handle it when the link reactivates.


Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system

2018-04-12 Thread Keith Busch
On Thu, Apr 12, 2018 at 12:27:20PM -0400, Sinan Kaya wrote:
> On 4/12/2018 11:02 AM, Keith Busch wrote:
> > 
> > Also, I thought the plan was to keep hotplug and non-hotplug the same,
> > except for the very end: if not a hotplug bridge, initiate the rescan
> > automatically after releasing from containment, otherwise let pciehp
> > handle it when the link reactivates.
> > 
> 
> Hmm...
> 
> AER driver doesn't do stop and rescan approach for fatal errors. AER driver
> makes an error callback followed by secondary bus reset and finally driver
> the resume callback on the endpoint only if link recovery is successful.
> Otherwise, AER driver bails out with recovery unsuccessful message.

I'm not sure if that's necessarily true. People have reported AER
handling triggers PCIe hotplug events, and creates some interesting race
conditions:

https://marc.info/?l=linux-pci&m=152336615707640&w=2

https://www.spinics.net/lists/linux-pci/msg70614.html

> Why do we need an additional rescan in the DPC driver if the link is up
> and driver resumes operation?

I thought the plan was to have DPC always go through the removal path
to ensure all devices are properly configured when containment is
released. In order to reconfigure those, you'll need to initiate the
rescan from somewhere.


Re: [PATCH] NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage

2018-04-12 Thread Keith Busch
Thanks, applied for 4.17-rc1.

I was a little surprised git was able to apply this since the patch
format is off, but it worked!


Re: [PATCH 08/12] lightnvm: implement get log report chunk helpers

2018-03-21 Thread Keith Busch
On Wed, Mar 21, 2018 at 03:06:05AM -0700, Matias Bjørling wrote:
> > outside of nvme core so that we can use it form lightnvm.
> > 
> > Signed-off-by: Javier González 
> > ---
> >   drivers/lightnvm/core.c  | 11 +++
> >   drivers/nvme/host/core.c |  6 ++--
> >   drivers/nvme/host/lightnvm.c | 74 
> > 
> >   drivers/nvme/host/nvme.h |  3 ++
> >   include/linux/lightnvm.h | 24 ++
> >   5 files changed, 115 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 2e9e9f973a75..af642ce6ba69 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -2127,9 +2127,9 @@ static int nvme_init_subsystem(struct nvme_ctrl 
> > *ctrl, struct nvme_id_ctrl *id)
> > return ret;
> >   }
> >   
> > -static int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > -   u8 log_page, void *log,
> > -   size_t size, size_t offset)
> > +int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > +u8 log_page, void *log,
> > +size_t size, size_t offset)
> >   {
> > struct nvme_command c = { };
> > unsigned long dwlen = size / 4 - 1;
> > diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
> > index 08f0f6b5bc06..ffd64a83c8c3 100644
> > --- a/drivers/nvme/host/lightnvm.c
> > +++ b/drivers/nvme/host/lightnvm.c
> > @@ -35,6 +35,10 @@ enum nvme_nvm_admin_opcode {
> > nvme_nvm_admin_set_bb_tbl   = 0xf1,
> >   };
> >   
> 
> 
> 
> >   
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index 1ca08f4993ba..505f797f8c6c 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -396,6 +396,9 @@ int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
> >   int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
> >   int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
> >   
> > +int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > +u8 log_page, void *log, size_t size, size_t offset);
> > +
> >   extern const struct attribute_group nvme_ns_id_attr_group;
> >   extern const struct block_device_operations nvme_ns_head_ops;
> >   
> 
> 
> Keith, Christoph, Sagi, Is it okay that these two changes that exposes 
> the nvme_get_log_ext fn are carried through Jens' tree after the nvme 
> tree for 4.17 has been pulled?

That's okay with me. Alteratively, if you want to split the generic nvme
part out, I can apply that immediately and the API will be in the first
nvme-4.17 pull request.


Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

2018-03-21 Thread Keith Busch
On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> > >> NVMe driver uses threads for the work at device reset, including enabling
> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> > >> called in parallel on multiple cores.
> > >> 
> > >> This causes a loop of enabling of all upstream bridges in
> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> > >> including __pci_set_master and architecture-specific functions that
> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> > >> and change it. This is done as read/modify/write.
> > >> 
> > >> Imagine that the PCIe tree looks like:
> > >> A - B - switch -  C - D
> > >>\- E - F
> > >> 
> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
> > >> mastering is not set. If their reset work are scheduled in parallel the 
> > >> two
> > >> modifications of PCI_COMMAND may happen in parallel without locking and 
> > >> the
> > >> system may end up with the part of PCIe tree not enabled.
> > > 
> > > Then looks serialized reset should be used, and I did see the commit
> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> > > to mark controller state' in reset stress test.
> > > 
> > > But that commit only covers case of PCI reset from sysfs attribute, and
> > > maybe other cases need to be dealt with in similar way too.
> > > 
> > 
> > It seems to me that the serialized reset works for multiple resets of the
> > same device, doesn't it? Our problem is linked to resets of different 
> > devices
> > that share the same PCIe tree.
> 
> Given reset shouldn't be a frequent action, it might be fine to serialize all
> reset from different devices.

The driver was much simpler when we had serialized resets in line with
probe, but that had a bigger problems with certain init systems when
you put enough nvme devices in your server, making them unbootable.

Would it be okay to serialize just the pci_enable_device across all
other tasks messing with the PCI topology?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cef5ce851a92..e0a2f6c0f1cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);

-   if (pci_enable_device_mem(pdev))
-   return result;
+   pci_lock_rescan_remove();
+   result = pci_enable_device_mem(pdev);
+   pci_unlock_rescan_remove();
+   if (result)
+   return -ENODEV;

pci_set_master(pdev);

--


Re: [PATCH] nvme: make nvme_get_log_ext non-static

2018-03-21 Thread Keith Busch
On Wed, Mar 21, 2018 at 08:27:07PM +0100, Matias Bjørling wrote:
> Enable the lightnvm integration to use the nvme_get_log_ext()
> function.
> 
> Signed-off-by: Matias Bjørling 

Thanks, applied to nvme-4.17.


Re: [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming

2018-04-09 Thread Keith Busch
On Mon, Apr 09, 2018 at 10:41:49AM -0400, Oza Pawandeep wrote:
> This patch renames error recovery to generic name with pcie prefix
> 
> Signed-off-by: Oza Pawandeep 

Looks fine.

Reviewed-by: Keith Busch 


Re: [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER

2018-04-09 Thread Keith Busch
On Mon, Apr 09, 2018 at 10:41:50AM -0400, Oza Pawandeep wrote:
> This patch factors out error reporting callbacks, which are currently
> tightly coupled with AER.
> 
> DPC should be able to register callbacks and attempt recovery when DPC
> trigger event occurs.
> 
> Signed-off-by: Oza Pawandeep 

Looks fine.

Reviewed-by: Keith Busch 


Re: [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service

2018-04-09 Thread Keith Busch
On Mon, Apr 09, 2018 at 10:41:51AM -0400, Oza Pawandeep wrote:
> This patch implements generic pcie_port_find_service() routine.
> 
> Signed-off-by: Oza Pawandeep 

Looks good.

Reviewed-by: Keith Busch 


Re: [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI

2018-04-09 Thread Keith Busch
On Mon, Apr 09, 2018 at 10:41:53AM -0400, Oza Pawandeep wrote:
> +/**
> + * pcie_wait_for_link - Wait for link till it's active/inactive
> + * @pdev: Bridge device
> + * @active: waiting for active or inactive ?
> + *
> + * Use this to wait till link becomes active or inactive.
> + */
> +bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> +{
> + int timeout = 1000;
> + bool ret;
> + u16 lnk_status;
> +
> + for (;;) {
> + pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
> + ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
> + if (ret == active)
> + return true;
> + if (timeout <= 0)
> + break;
> + timeout -= 10;
> + }

This is missing an msleep(10) at each iteration.

> +
> + pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
> +  active ? "set" : "cleared");
> +
> + return false;
> +}


Re: [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC

2018-04-09 Thread Keith Busch
On Mon, Apr 09, 2018 at 10:41:52AM -0400, Oza Pawandeep wrote:
> +static int find_dpc_dev_iter(struct device *device, void *data)
> +{
> + struct pcie_port_service_driver *service_driver;
> + struct device **dev;
> +
> + dev = (struct device **) data;
> +
> + if (device->bus == &pcie_port_bus_type && device->driver) {
> + service_driver = to_service_driver(device->driver);
> + if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
> + *dev = device;
> + return 1;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
> +{
> + struct device *dev = NULL;
> +
> + device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
> +
> + return dev;
> +}

The only caller of this doesn't seem to care to use struct device. This
should probably just extract struct dpc_dev directly from in here.


Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-03-09 Thread Keith Busch
On Thu, Mar 08, 2018 at 08:42:20AM +0100, Christoph Hellwig wrote:
> 
> So I suspect we'll need to go with a patch like this, just with a way
> better changelog.

I have to agree this is required for that use case. I'll run some
quick tests and propose an alternate changelog.

Longer term, the current way we're including offline present cpus either
(a) has the driver allocate resources it can't use or (b) spreads the
ones it can use thinner than they need to be. Why don't we rerun the
irq spread under a hot cpu notifier for only online CPUs?


Re: [PATCH V3] nvme-pci: assign separate irq vectors for adminq and ioq1

2018-03-13 Thread Keith Busch
On Tue, Mar 13, 2018 at 06:45:00PM +0800, Ming Lei wrote:
> On Tue, Mar 13, 2018 at 05:58:08PM +0800, Jianchao Wang wrote:
> > Currently, adminq and ioq1 share the same irq vector which is set
> > affinity to cpu0. If a system allows cpu0 to be offlined, the adminq
> > will not be able work any more.
> > 
> > To fix this, assign separate irq vectors for adminq and ioq1. Set
> > .pre_vectors == 1 when allocate irq vectors, then assign the first
> > one to adminq which will have affinity cpumask with all possible
> > cpus. On the other hand, if controller has only legacy or single
> > -message MSI, we will setup adminq and 1 ioq and let them share
> > the only one irq vector.
> > 
> > Signed-off-by: Jianchao Wang 
> 
> Reviewed-by: Ming Lei 

Thanks, applied with an updated changelog.

Not being able to use the admin queue is a pretty big deal, so it's pushed
to the next nvme 4.16-rc branch. This may even be good stable material.


Re: [PATCH] nvme-pci: disable APST for Samsung NVMe SSD 960 EVO + ASUS PRIME Z370-A

2018-03-14 Thread Keith Busch
Thanks, applied for 4.17.


  1   2   3   4   5   6   7   8   9   10   >