Re: [PATCH v3] PCI: Reprogram bridge prefetch registers on resume

2018-09-27 Thread Bjorn Helgaas
[+cc LKML]

On Tue, Sep 18, 2018 at 04:32:44PM -0500, Bjorn Helgaas wrote:
> On Thu, Sep 13, 2018 at 11:37:45AM +0800, Daniel Drake wrote:
> > On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
> > after S3 suspend/resume. The affected products include multiple
> > generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
> > many errors such as:
> > 
> > fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04
> >   [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
> > DRM: failed to idle channel 0 [DRM]
> > 
> > Similarly, the nvidia proprietary driver also fails after resume
> > (black screen, 100% CPU usage in Xorg process). We shipped a sample
> > to Nvidia for diagnosis, and their response indicated that it's a
> > problem with the parent PCI bridge (on the Intel SoC), not the GPU.
> > 
> > Runtime suspend/resume works fine, only S3 suspend is affected.
> > 
> > We found a workaround: on resume, rewrite the Intel PCI bridge
> > 'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
> > the cases that I checked, this register has value 0 and we just have to
> > rewrite that value.
> > 
> > Linux already saves and restores PCI config space during suspend/resume,
> > but this register was being skipped because upon resume, it already
> > has value 0 (the correct, pre-suspend value).
> > 
> > Intel appear to have previously acknowledged this behaviour and the
> > requirement to rewrite this register.
> > https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
> > 
> > Based on that, rewrite the prefetch register values even when that
> > appears unnecessary.
> > 
> > We have confirmed this solution on all the affected models we have
> > in-hands (X542UQ, UX533FD, X530UN, V272UN).
> > 
> > Additionally, this solves an issue where r8169 MSI-X interrupts were
> > broken after S3 suspend/resume on Asus X441UAR. This issue was recently
> > worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
> > RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
> > Aimfor-tech laptop that we had not yet patched. I suspect it will also
> > fix the issue that was worked around in commit 7c53a722459c ("r8169:
> > don't use MSI-X on RTL8168g").
> > 
> > Thomas Martitz reports that this change also solves an issue where
> > the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
> > after S3 suspend/resume.
> > 
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
> > Signed-off-by: Daniel Drake 
> 
> Applied with Rafael's and Peter's reviewed-by to pci/enumeration for v4.20.
> Thanks for the the huge investigative effort!

Since this looks low-risk and fixes several painful issues, I think
this merits a stable tag and being included in v4.19 (instead of
waiting for v4.20).  

I moved it to for-linus for v4.19.  Let me know if you object.

> > ---
> >  drivers/pci/pci.c | 25 +
> >  1 file changed, 17 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 29ff9619b5fa..5d58220b6997 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
> >  EXPORT_SYMBOL(pci_save_state);
> >  
> >  static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> > -u32 saved_val, int retry)
> > +u32 saved_val, int retry, bool force)
> >  {
> > u32 val;
> >  
> > pci_read_config_dword(pdev, offset, );
> > -   if (val == saved_val)
> > +   if (!force && val == saved_val)
> > return;
> >  
> > for (;;) {
> > @@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev 
> > *pdev, int offset,
> >  }
> >  
> >  static void pci_restore_config_space_range(struct pci_dev *pdev,
> > -  int start, int end, int retry)
> > +  int start, int end, int retry,
> > +  bool force)
> >  {
> > int index;
> >  
> > for (index = end; index >= start; index--)
> > pci_restore_config_dword(pdev, 4 * index,
> >  pdev->saved_config_space[index],
> > -retry);
> > +retry, force);
> >  }
> >  

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-31 Thread Bjorn Helgaas
On Mon, Jul 30, 2018 at 08:19:50PM -0700, Alexander Duyck wrote:
> On Mon, Jul 30, 2018 at 7:33 PM, Bjorn Helgaas  wrote:
> > On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote:
> >> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas  wrote:
> >> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
> >> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh  
> >> >> wrote:
> >> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas  
> >> >> > wrote:
> >> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> >> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  
> >> >> >> > wrote:
> >> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, 
> >> >> >> > > jakub.kicin...@netronome.com
> >> >> >> > > wrote:
> >> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >> >> >> > >>> >>>> The devlink params haven't been upstream even for a full 
> >> >> >> > >>> >>>> cycle
> >> >> >> > >>> >>>> and
> >> >> >> > >>> >>>> already you guys are starting to use them to configure 
> >> >> >> > >>> >>>> standard
> >> >> >> > >>> >>>> features like queuing.
> >> >> >> > >>> >>>
> >> >> >> > >>> >>> We developed the devlink params in order to support 
> >> >> >> > >>> >>> non-standard
> >> >> >> > >>> >>> configuration only. And for non-standard, there are 
> >> >> >> > >>> >>> generic and
> >> >> >> > >>> >>> vendor
> >> >> >> > >>> >>> specific options.
> >> >> >> > >>> >>
> >> >> >> > >>> >> I thought it was developed for performing non-standard and
> >> >> >> > >>> >> possibly
> >> >> >> > >>> >> vendor specific configuration.  Look at 
> >> >> >> > >>> >> DEVLINK_PARAM_GENERIC_*
> >> >> >> > >>> >> for
> >> >> >> > >>> >> examples of well justified generic options for which we 
> >> >> >> > >>> >> have no
> >> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor 
> >> >> >> > >>> >> specific
> >> >> >> > >>> >> if you
> >> >> >> > >>> >> ask me, too.
> >> >> >> > >>> >>
> >> >> >> > >>> >> Configuring queuing has an API.  The question is it 
> >> >> >> > >>> >> acceptable to
> >> >> >> > >>> >> enter
> >> >> >> > >>> >> into the risky territory of controlling offloads via devlink
> >> >> >> > >>> >> parameters
> >> >> >> > >>> >> or would we rather make vendors take the time and effort to 
> >> >> >> > >>> >> model
> >> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits 
> >> >> >> > >>> >> the
> >> >> >> > >>> >> APIs
> >> >> >> > >>> >> perfectly.
> >> >> >> > >>> >
> >> >> >> > >>> > I understand what you meant here, I would like to highlight 
> >> >> >> > >>> > that
> >> >> >> > >>> > this
> >> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >> >> >> > >>> > The vendor specific configuration suggested here is to 
> >> &g

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-30 Thread Bjorn Helgaas
On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote:
> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas  wrote:
> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh  
> >> wrote:
> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas  
> >> > wrote:
> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  wrote:
> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com
> >> >> > > wrote:
> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >> >> > >>> >>>> The devlink params haven't been upstream even for a full 
> >> >> > >>> >>>> cycle
> >> >> > >>> >>>> and
> >> >> > >>> >>>> already you guys are starting to use them to configure 
> >> >> > >>> >>>> standard
> >> >> > >>> >>>> features like queuing.
> >> >> > >>> >>>
> >> >> > >>> >>> We developed the devlink params in order to support 
> >> >> > >>> >>> non-standard
> >> >> > >>> >>> configuration only. And for non-standard, there are generic 
> >> >> > >>> >>> and
> >> >> > >>> >>> vendor
> >> >> > >>> >>> specific options.
> >> >> > >>> >>
> >> >> > >>> >> I thought it was developed for performing non-standard and
> >> >> > >>> >> possibly
> >> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> >> >> > >>> >> for
> >> >> > >>> >> examples of well justified generic options for which we have no
> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> >> >> > >>> >> if you
> >> >> > >>> >> ask me, too.
> >> >> > >>> >>
> >> >> > >>> >> Configuring queuing has an API.  The question is it acceptable 
> >> >> > >>> >> to
> >> >> > >>> >> enter
> >> >> > >>> >> into the risky territory of controlling offloads via devlink
> >> >> > >>> >> parameters
> >> >> > >>> >> or would we rather make vendors take the time and effort to 
> >> >> > >>> >> model
> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
> >> >> > >>> >> APIs
> >> >> > >>> >> perfectly.
> >> >> > >>> >
> >> >> > >>> > I understand what you meant here, I would like to highlight that
> >> >> > >>> > this
> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >> >> > >>> > The vendor specific configuration suggested here is to handle a
> >> >> > >>> > congestion
> >> >> > >>> > state in Multi Host environment (which includes PF and multiple
> >> >> > >>> > VFs per
> >> >> > >>> > host), where one host is not aware to the other hosts, and each 
> >> >> > >>> > is
> >> >> > >>> > running
> >> >> > >>> > on its own pci/driver. It is a device working mode 
> >> >> > >>> > configuration.
> >> >> > >>> >
> >> >> > >>> > This  couldn't fit into any existing API, thus creating this
> >> >> > >>> > vendor specific
> >> >> > >>> > unique API is needed.
> >> >> > >>>
> >> >> > >>> If we are just goin

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-30 Thread Bjorn Helgaas
On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh  wrote:
> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas  wrote:
> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  wrote:
> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com
> >> > > wrote:
> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
> >> > >>> >>>> and
> >> > >>> >>>> already you guys are starting to use them to configure standard
> >> > >>> >>>> features like queuing.
> >> > >>> >>>
> >> > >>> >>> We developed the devlink params in order to support non-standard
> >> > >>> >>> configuration only. And for non-standard, there are generic and
> >> > >>> >>> vendor
> >> > >>> >>> specific options.
> >> > >>> >>
> >> > >>> >> I thought it was developed for performing non-standard and
> >> > >>> >> possibly
> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> >> > >>> >> for
> >> > >>> >> examples of well justified generic options for which we have no
> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> >> > >>> >> if you
> >> > >>> >> ask me, too.
> >> > >>> >>
> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
> >> > >>> >> enter
> >> > >>> >> into the risky territory of controlling offloads via devlink
> >> > >>> >> parameters
> >> > >>> >> or would we rather make vendors take the time and effort to model
> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
> >> > >>> >> APIs
> >> > >>> >> perfectly.
> >> > >>> >
> >> > >>> > I understand what you meant here, I would like to highlight that
> >> > >>> > this
> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >> > >>> > The vendor specific configuration suggested here is to handle a
> >> > >>> > congestion
> >> > >>> > state in Multi Host environment (which includes PF and multiple
> >> > >>> > VFs per
> >> > >>> > host), where one host is not aware to the other hosts, and each is
> >> > >>> > running
> >> > >>> > on its own pci/driver. It is a device working mode configuration.
> >> > >>> >
> >> > >>> > This  couldn't fit into any existing API, thus creating this
> >> > >>> > vendor specific
> >> > >>> > unique API is needed.
> >> > >>>
> >> > >>> If we are just going to start creating devlink interfaces in for
> >> > >>> every
> >> > >>> one-off option a device wants to add why did we even bother with
> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
> >> > >>> are back to the same arguments we had back in the day with it.
> >> > >>>
> >> > >>> I feel like the bigger question here is if devlink is how we are
> >> > >>> going
> >> > >>> to deal with all PCIe related features going forward, or should we
> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
> >> > >>> features? My concern is that we have already had features such as
> >> > >>> DMA
> >> > >>> Coalescing that didn't really fit into anything and now we are
> >> > >>> starting to see other things related to DMA and PCIe bus credits.
> >> &

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-28 Thread Bjorn Helgaas
On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  wrote:
> > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com wrote:
> >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >>>  The devlink params haven't been upstream even for a full cycle and
> >>>  already you guys are starting to use them to configure standard
> >>>  features like queuing.
> >>> >>>
> >>> >>> We developed the devlink params in order to support non-standard
> >>> >>> configuration only. And for non-standard, there are generic and vendor
> >>> >>> specific options.
> >>> >>
> >>> >> I thought it was developed for performing non-standard and possibly
> >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
> >>> >> examples of well justified generic options for which we have no
> >>> >> other API.  The vendor mlx4 options look fairly vendor specific if you
> >>> >> ask me, too.
> >>> >>
> >>> >> Configuring queuing has an API.  The question is it acceptable to enter
> >>> >> into the risky territory of controlling offloads via devlink parameters
> >>> >> or would we rather make vendors take the time and effort to model
> >>> >> things to (a subset) of existing APIs.  The HW never fits the APIs
> >>> >> perfectly.
> >>> >
> >>> > I understand what you meant here, I would like to highlight that this
> >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >>> > The vendor specific configuration suggested here is to handle a 
> >>> > congestion
> >>> > state in Multi Host environment (which includes PF and multiple VFs per
> >>> > host), where one host is not aware to the other hosts, and each is 
> >>> > running
> >>> > on its own pci/driver. It is a device working mode configuration.
> >>> >
> >>> > This  couldn't fit into any existing API, thus creating this vendor 
> >>> > specific
> >>> > unique API is needed.
> >>>
> >>> If we are just going to start creating devlink interfaces in for every
> >>> one-off option a device wants to add why did we even bother with
> >>> trying to prevent drivers from using sysfs? This just feels like we
> >>> are back to the same arguments we had back in the day with it.
> >>>
> >>> I feel like the bigger question here is if devlink is how we are going
> >>> to deal with all PCIe related features going forward, or should we
> >>> start looking at creating a new interface/tool for PCI/PCIe related
> >>> features? My concern is that we have already had features such as DMA
> >>> Coalescing that didn't really fit into anything and now we are
> >>> starting to see other things related to DMA and PCIe bus credits. I'm
> >>> wondering if we shouldn't start looking at a tool/interface to
> >>> configure all the PCIe related features such as interrupts, error
> >>> reporting, DMA configuration, power management, etc. Maybe we could
> >>> even look at sharing it across subsystems and include things like
> >>> storage, graphics, and other subsystems in the conversation.
> >>
> >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need
> >>to build up an API.  Sharing it across subsystems would be very cool!

I read the thread (starting at [1], for anybody else coming in late)
and I see this has something to do with "configuring outbound PCIe
buffers", but I haven't seen the connection to PCIe protocol or
features, i.e., I can't connect this to anything in the PCIe spec.

Can somebody help me understand how the PCI core is relevant?  If
there's some connection with a feature defined by PCIe, or if it
affects the PCIe transaction protocol somehow, I'm definitely
interested in this.  But if this only affects the data transferred
over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
sure why the PCI core should care.

> > I wonder howcome there isn't such API in place already. Or is it?
> > If it is not, do you have any idea how should it look like? Should it be
> > an extension of the existing PCI uapi or something completely new?
> > It would be probably good to loop some PCI people in...
> 
> The closest thing I can think of in terms of answering your questions
> as to why we haven't seen anything like that would be setpci.
> Basically with that tool you can go through the PCI configuration
> space and update any piece you want. The problem is it can have
> effects on the driver and I don't recall there ever being any sort of
> notification mechanism added to make a driver aware of configuration
> updates.

setpci is a development and debugging tool, not something we should
use as the standard way of configuring things.  Use of setpci should
probably taint the kernel because the PCI core configures features
like MPS, ASPM, AER, etc., based on the assumption that nobody else is
changing things in PCI config space.

> As far as the interface I 

Re: [PATCH v1 0/4] PCI: Remove unnecessary includes of

2018-07-25 Thread Bjorn Helgaas
On Wed, Jul 25, 2018 at 01:33:23PM -0700, Sinan Kaya wrote:
> On 7/25/2018 12:52 PM, Bjorn Helgaas wrote:
> > emove includes of  from files that don't need
> > it.  I'll apply all these via the PCI tree unless there's objection.
> > 
> > ---
> > 
> > Bjorn Helgaas (4):
> >igb: Remove unnecessary include of 
> >ath9k: Remove unnecessary include of 
> >iwlwifi: Remove unnecessary include of 
> >PCI: Remove unnecessary include of 
> 
> Thanks.
> 
> Reviewed-by: Sinan Kaya 
> 
> Is it possible to kill that file altogether? I haven't looked who is
> using outside of pci directory.

Thanks for taking a look!

It's possible we could remove it altogether; there's very little in
it, and in most cases the only reason drivers include it is to disable
certain ASPM link states to work around hardware defects.  It might
make sense to just move that interface into .


[PATCH v1 1/4] igb: Remove unnecessary include of

2018-07-25 Thread Bjorn Helgaas
From: Bjorn Helgaas 

The igb driver doesn't need anything provided by pci-aspm.h, so remove
the unnecessary include of it.

Signed-off-by: Bjorn Helgaas 
---
 drivers/net/ethernet/intel/igb/igb_main.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index f707709969ac..c77fda05f683 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 



[PATCH v1 4/4] PCI: Remove unnecessary include of

2018-07-25 Thread Bjorn Helgaas
From: Bjorn Helgaas 

Several PCI core files include pci-aspm.h even though they don't need
anything provided by that file.  Remove the unnecessary includes of it.

Signed-off-by: Bjorn Helgaas 
---
 drivers/pci/pci-sysfs.c |1 -
 drivers/pci/pci.c   |1 -
 drivers/pci/probe.c |1 -
 drivers/pci/remove.c|1 -
 4 files changed, 4 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 0c4653c1d2ce..91337faae60d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -23,7 +23,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f5c6ab14fb31..7c2f0e682fc0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -23,7 +23,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac876e32de4b..1ed2852dee21 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -13,7 +13,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index 6f072eae4f7a..01ec7fcb5634 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -1,7 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
 #include 
-#include 
 #include "pci.h"
 
 static void pci_free_resources(struct pci_dev *dev)



[PATCH v1 3/4] iwlwifi: Remove unnecessary include of

2018-07-25 Thread Bjorn Helgaas
From: Bjorn Helgaas 

This part of the iwlwifi driver doesn't need anything provided by
pci-aspm.h, so remove the unnecessary include of it.

Signed-off-by: Bjorn Helgaas 
---
 drivers/net/wireless/intel/iwlwifi/pcie/drv.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/drv.c 
b/drivers/net/wireless/intel/iwlwifi/pcie/drv.c
index 38234bda9017..d6c55e111fda 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/drv.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/drv.c
@@ -72,7 +72,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "fw/acpi.h"



[PATCH v1 2/4] ath9k: Remove unnecessary include of

2018-07-25 Thread Bjorn Helgaas
From: Bjorn Helgaas 

The ath9k driver doesn't need anything provided by pci-aspm.h, so remove
the unnecessary include of it.

Signed-off-by: Bjorn Helgaas 
---
 drivers/net/wireless/ath/ath9k/pci.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/wireless/ath/ath9k/pci.c 
b/drivers/net/wireless/ath/ath9k/pci.c
index 645f0fbd9179..92b2dd396436 100644
--- a/drivers/net/wireless/ath/ath9k/pci.c
+++ b/drivers/net/wireless/ath/ath9k/pci.c
@@ -18,7 +18,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include "ath9k.h"
 



[PATCH v1 0/4] PCI: Remove unnecessary includes of

2018-07-25 Thread Bjorn Helgaas
Remove includes of  from files that don't need
it.  I'll apply all these via the PCI tree unless there's objection.

---

Bjorn Helgaas (4):
  igb: Remove unnecessary include of 
  ath9k: Remove unnecessary include of 
  iwlwifi: Remove unnecessary include of 
  PCI: Remove unnecessary include of 


 drivers/net/ethernet/intel/igb/igb_main.c |1 -
 drivers/net/wireless/ath/ath9k/pci.c  |1 -
 drivers/net/wireless/intel/iwlwifi/pcie/drv.c |1 -
 drivers/pci/pci-sysfs.c   |1 -
 drivers/pci/pci.c |1 -
 drivers/pci/probe.c   |1 -
 drivers/pci/remove.c  |1 -
 7 files changed, 7 deletions(-)


Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-06-19 Thread Bjorn Helgaas
On Fri, May 25, 2018 at 09:02:23AM -0500, Bjorn Helgaas wrote:
> On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> > Hi Bjorn!
> > 
> > On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> > > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > > to not fail, e.g.:
> > > > 
> > > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > > 
> > > > For devices which VF support depends on loaded FW we have the
> > > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > > to 0.  Remove the special values completely and simply initialize
> > > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > > Add a helper for drivers to reset the VF limit back to total.  
> > > 
> > > I still can't really make sense out of the changelog.
> > >
> > > I think part of the reason it's confusing is because there are two
> > > things going on:
> > > 
> > >   1) You want this:
> > >   
> > >pci_sriov_set_totalvfs(dev, 0);
> > >x = pci_sriov_get_totalvfs(dev) 
> > > 
> > >  to return 0 instead of total_VFs.  That seems to connect with
> > >  your subject line.  It means "sriov_totalvfs" in sysfs could be
> > >  0, but I don't know how that is useful (I'm sure it is; just
> > >  educate me :))
> > 
> > Let me just quote the bug report that got filed on our internal bug
> > tracker :)
> > 
> >   When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
> >   errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
> >   then tries to set that as the sriov_numvfs parameter.
> > 
> >   For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
> >   but it's set to max.  When FW is switched to flower*, the correct 
> >   sriov_totalvfs value is presented.
> > 
> > * flower is a project name
> 
> From the point of view of the PCI core (which knows nothing about
> device firmware and relies on the architected config space described
> by the PCIe spec), this sounds like an erratum: with some firmware
> installed, the device is not capable of SR-IOV, but still advertises
> an SR-IOV capability with "TotalVFs > 0".
> 
> Regardless of whether that's an erratum, we do allow PF drivers to use
> pci_sriov_set_totalvfs() to limit the number of VFs that may be
> enabled by writing to the PF's "sriov_numvfs" sysfs file.
> 
> But the current implementation does not allow a PF driver to limit VFs
> to 0, and that does seem nonsensical.
> 
> > My understanding is OpenStack uses sriov_totalvfs to determine how many
> > VFs can be enabled, looks like this is the code:
> > 
> > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> > 
> > >   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
> > >  sure what you intend for this.  Is *every* driver supposed to
> > >  call it in .remove()?  Could/should this be done in the core
> > >  somehow instead of depending on every driver?
> > 
> > Good question, I was just thinking yesterday we may want to call it
> > from the core, but I don't think it's strictly necessary nor always
> > sufficient (we may reload FW without re-probing).
> > 
> > We have a device which supports different number of VFs based on the FW
> > loaded.  Some legacy FWs does not inform the driver how many VFs it can
> > support, because it supports max.  So the flow in our driver is this:
> > 
> > load_fw(dev);
> > ...
> > max_vfs = ask_fw_for_max_vfs(dev);
> > if (max_vfs >= 0)
> > return pci_sriov_set_totalvfs(dev, max_vfs);
> > else /* FW didn't tell us, assume max */
> > return pci_sriov_reset_totalvfs(dev); 
> > 
> > We also reset the max on device remove, but that's not strictly
> > necessary.
> > 
> > Other users of pci_sriov_set_totalvfs() always know the value to set
> > the total to (either always get it from FW or it's a constant).
> > 
> > If you prefer we can work out the correct max for those legacy cases in
> > the driver as well, although it seemed cleaner to just ask the core,
> > since it already has total_VFs value handy :)
> > 
> > > I'm also having a h

Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-05-25 Thread Bjorn Helgaas
On Fri, May 25, 2018 at 02:05:21PM -0700, Jakub Kicinski wrote:
> On Fri, 25 May 2018 09:02:23 -0500, Bjorn Helgaas wrote:
> > On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> > > On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:  
> > > > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:  
> > > > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > > > to not fail, e.g.:
> > > > > 
> > > > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > > > 
> > > > > For devices which VF support depends on loaded FW we have the
> > > > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > > > to 0.  Remove the special values completely and simply initialize
> > > > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > > > Add a helper for drivers to reset the VF limit back to total.
> > > > 
> > > > I still can't really make sense out of the changelog.
> > > >
> > > > I think part of the reason it's confusing is because there are two
> > > > things going on:
> > > > 
> > > >   1) You want this:
> > > >   
> > > >pci_sriov_set_totalvfs(dev, 0);
> > > >x = pci_sriov_get_totalvfs(dev) 
> > > > 
> > > >  to return 0 instead of total_VFs.  That seems to connect with
> > > >  your subject line.  It means "sriov_totalvfs" in sysfs could be
> > > >  0, but I don't know how that is useful (I'm sure it is; just
> > > >  educate me :))  
> > > 
> > > Let me just quote the bug report that got filed on our internal bug
> > > tracker :)
> > > 
> > >   When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
> > >   errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
> > >   then tries to set that as the sriov_numvfs parameter.
> > > 
> > >   For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
> > >   but it's set to max.  When FW is switched to flower*, the correct 
> > >   sriov_totalvfs value is presented.
> > > 
> > > * flower is a project name  
> > 
> > From the point of view of the PCI core (which knows nothing about
> > device firmware and relies on the architected config space described
> > by the PCIe spec), this sounds like an erratum: with some firmware
> > installed, the device is not capable of SR-IOV, but still advertises
> > an SR-IOV capability with "TotalVFs > 0".
> > 
> > Regardless of whether that's an erratum, we do allow PF drivers to use
> > pci_sriov_set_totalvfs() to limit the number of VFs that may be
> > enabled by writing to the PF's "sriov_numvfs" sysfs file.
> 
> Think more of an FPGA which can be reprogrammed at runtime to have
> different capabilities than an erratum.  Some FWs simply have no use
> for VFs and save resources (and validation time) by not supporting it.

This is a bit of a gray area.  Reloading firmware or reprogramming an
FPGA has the potential to create a new and different device than we
had before, but the PCI core doesn't know that.  The typical sequence
is:

  - PCI core enumerates device
  - driver binds to device (we call .probe())
  - driver loads new firmware to device
  - driver resets device with pci_reset_function() or similar
  - pci_reset_function() saves config space
  - pci_reset_function() resets device
  - device uses new firmware when it comes out of reset
  - pci_reset_function() restores config space

Loading the new firmware might change what the device looks like in
config space -- it could change the number or size of BARs, the
capabilities advertised, etc.  We currently sweep that under the rug
and blindly restore the old config space.

It looks like your driver does the reset differently, so maybe it
keeps the original config space setup.

But all that said, I agree that we should allow a PF driver to prevent
VF enablement, whether because the firmware doesn't support it or the
PF driver just wants to prevent use of VFs for whatever reason (maybe
we don't have enough MMIO resources, we don't need the VFs, etc.)

> Okay, perfect.  That makes sense.  The patch below certainly fixes the
> first issue for us.  Thank you!
> 
> As far as the second issue goes - agreed, having the core reset the
> number of VFs to total_VFs definitely makes sense.  It doesn't cater to
> the case where FW is reloaded without reprobing, but we don't do this
> today anyway.
> 
> Should I try to come up with a patch to reset total_VFs after detach?

Yes, please.

Bjorn


Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-05-25 Thread Bjorn Helgaas
On Fri, May 25, 2018 at 03:27:52PM -0400, Don Dutile wrote:
> On 05/25/2018 10:02 AM, Bjorn Helgaas wrote:
> > On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> > > Hi Bjorn!
> > > 
> > > On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> > > > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > > > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > > > to not fail, e.g.:
> > > > > 
> > > > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > > > 
> > > > > For devices which VF support depends on loaded FW we have the
> > > > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > > > to 0.  Remove the special values completely and simply initialize
> > > > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > > > Add a helper for drivers to reset the VF limit back to total.
> > > > 
> > > > I still can't really make sense out of the changelog.
> > > > 
> > > > I think part of the reason it's confusing is because there are two
> > > > things going on:
> > > > 
> > > >1) You want this:
> > > > pci_sriov_set_totalvfs(dev, 0);
> > > > x = pci_sriov_get_totalvfs(dev)
> > > > 
> > > >   to return 0 instead of total_VFs.  That seems to connect with
> > > >   your subject line.  It means "sriov_totalvfs" in sysfs could be
> > > >   0, but I don't know how that is useful (I'm sure it is; just
> > > >   educate me :))
> > > 
> > > Let me just quote the bug report that got filed on our internal bug
> > > tracker :)
> > > 
> > >When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
> > >errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
> > >then tries to set that as the sriov_numvfs parameter.
> > > 
> > >For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0,
> > >but it's set to max.  When FW is switched to flower*, the correct
> > >sriov_totalvfs value is presented.
> > > 
> > > * flower is a project name
> > 
> >  From the point of view of the PCI core (which knows nothing about
> > device firmware and relies on the architected config space described
> > by the PCIe spec), this sounds like an erratum: with some firmware
> > installed, the device is not capable of SR-IOV, but still advertises
> > an SR-IOV capability with "TotalVFs > 0".
> > 
> > Regardless of whether that's an erratum, we do allow PF drivers to use
> > pci_sriov_set_totalvfs() to limit the number of VFs that may be
> > enabled by writing to the PF's "sriov_numvfs" sysfs file.
> > 
> +1.
> 
> > But the current implementation does not allow a PF driver to limit VFs
> > to 0, and that does seem nonsensical.
> > 
> Well, not really -- claiming to support VFs, and then wanting it to be 0...
> I could certainly argue is non-sensical.
> From a sw perspective, sure, see if we can set VFs to 0 (and reset to another 
> value later).
> 
> /me wishes that implementers would follow the architecture vs torquing it 
> into strange shapes.
> 
> > > My understanding is OpenStack uses sriov_totalvfs to determine how many
> > > VFs can be enabled, looks like this is the code:
> > > 
> > > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> > > 
> > > >2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
> > > >   sure what you intend for this.  Is *every* driver supposed to
> > > >   call it in .remove()?  Could/should this be done in the core
> > > >   somehow instead of depending on every driver?
> > > 
> > > Good question, I was just thinking yesterday we may want to call it
> > > from the core, but I don't think it's strictly necessary nor always
> > > sufficient (we may reload FW without re-probing).
> > > 
> > > We have a device which supports different number of VFs based on the FW
> > > loaded.  Some legacy FWs does not inform the driver how many VFs it can
> > > support, because it supports max.  So the flow in our driver is this:
> > > 
> > > load_fw(dev);
> > > ...

Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-05-25 Thread Bjorn Helgaas
[+cc liquidio, benet, fm10k maintainers:

  The patch below will affect you if your driver calls
pci_sriov_set_totalvfs(dev, 0);

  Previously that caused a subsequent pci_sriov_get_totalvfs() to return
  the totalVFs value from the SR-IOV capability.  After this patch, it will
  return 0, which has implications for VF enablement via the sysfs
  "sriov_numvfs" file.]

On Fri, May 25, 2018 at 09:02:23AM -0500, Bjorn Helgaas wrote:
> On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> > Hi Bjorn!
> > 
> > On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> > > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > > to not fail, e.g.:
> > > > 
> > > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > > 
> > > > For devices which VF support depends on loaded FW we have the
> > > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > > to 0.  Remove the special values completely and simply initialize
> > > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > > Add a helper for drivers to reset the VF limit back to total.  
> > > 
> > > I still can't really make sense out of the changelog.
> > >
> > > I think part of the reason it's confusing is because there are two
> > > things going on:
> > > 
> > >   1) You want this:
> > >   
> > >pci_sriov_set_totalvfs(dev, 0);
> > >x = pci_sriov_get_totalvfs(dev) 
> > > 
> > >  to return 0 instead of total_VFs.  That seems to connect with
> > >  your subject line.  It means "sriov_totalvfs" in sysfs could be
> > >  0, but I don't know how that is useful (I'm sure it is; just
> > >  educate me :))
> > 
> > Let me just quote the bug report that got filed on our internal bug
> > tracker :)
> > 
> >   When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
> >   errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
> >   then tries to set that as the sriov_numvfs parameter.
> > 
> >   For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
> >   but it's set to max.  When FW is switched to flower*, the correct 
> >   sriov_totalvfs value is presented.
> > 
> > * flower is a project name
> 
> From the point of view of the PCI core (which knows nothing about
> device firmware and relies on the architected config space described
> by the PCIe spec), this sounds like an erratum: with some firmware
> installed, the device is not capable of SR-IOV, but still advertises
> an SR-IOV capability with "TotalVFs > 0".
> 
> Regardless of whether that's an erratum, we do allow PF drivers to use
> pci_sriov_set_totalvfs() to limit the number of VFs that may be
> enabled by writing to the PF's "sriov_numvfs" sysfs file.
> 
> But the current implementation does not allow a PF driver to limit VFs
> to 0, and that does seem nonsensical.
> 
> > My understanding is OpenStack uses sriov_totalvfs to determine how many
> > VFs can be enabled, looks like this is the code:
> > 
> > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> > 
> > >   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
> > >  sure what you intend for this.  Is *every* driver supposed to
> > >  call it in .remove()?  Could/should this be done in the core
> > >  somehow instead of depending on every driver?
> > 
> > Good question, I was just thinking yesterday we may want to call it
> > from the core, but I don't think it's strictly necessary nor always
> > sufficient (we may reload FW without re-probing).
> > 
> > We have a device which supports different number of VFs based on the FW
> > loaded.  Some legacy FWs does not inform the driver how many VFs it can
> > support, because it supports max.  So the flow in our driver is this:
> > 
> > load_fw(dev);
> > ...
> > max_vfs = ask_fw_for_max_vfs(dev);
> > if (max_vfs >= 0)
> > return pci_sriov_set_totalvfs(dev, max_vfs);
> > else /* FW didn't tell us, assume max */
> > return pci_sriov_reset_totalvfs(dev); 
> > 
> > We also reset the max on device remove, but that's not strictly
> > necessary.
> > 
> > Other users of pci_sriov_set_totalvfs

Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-05-25 Thread Bjorn Helgaas
On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> Hi Bjorn!
> 
> On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > to not fail, e.g.:
> > > 
> > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > 
> > > For devices which VF support depends on loaded FW we have the
> > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > to 0.  Remove the special values completely and simply initialize
> > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > Add a helper for drivers to reset the VF limit back to total.  
> > 
> > I still can't really make sense out of the changelog.
> >
> > I think part of the reason it's confusing is because there are two
> > things going on:
> > 
> >   1) You want this:
> >   
> >pci_sriov_set_totalvfs(dev, 0);
> >x = pci_sriov_get_totalvfs(dev) 
> > 
> >  to return 0 instead of total_VFs.  That seems to connect with
> >  your subject line.  It means "sriov_totalvfs" in sysfs could be
> >  0, but I don't know how that is useful (I'm sure it is; just
> >  educate me :))
> 
> Let me just quote the bug report that got filed on our internal bug
> tracker :)
> 
>   When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
>   errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
>   then tries to set that as the sriov_numvfs parameter.
> 
>   For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
>   but it's set to max.  When FW is switched to flower*, the correct 
>   sriov_totalvfs value is presented.
> 
> * flower is a project name

>From the point of view of the PCI core (which knows nothing about
device firmware and relies on the architected config space described
by the PCIe spec), this sounds like an erratum: with some firmware
installed, the device is not capable of SR-IOV, but still advertises
an SR-IOV capability with "TotalVFs > 0".

Regardless of whether that's an erratum, we do allow PF drivers to use
pci_sriov_set_totalvfs() to limit the number of VFs that may be
enabled by writing to the PF's "sriov_numvfs" sysfs file.

But the current implementation does not allow a PF driver to limit VFs
to 0, and that does seem nonsensical.

> My understanding is OpenStack uses sriov_totalvfs to determine how many
> VFs can be enabled, looks like this is the code:
> 
> http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> 
> >   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
> >  sure what you intend for this.  Is *every* driver supposed to
> >  call it in .remove()?  Could/should this be done in the core
> >  somehow instead of depending on every driver?
> 
> Good question, I was just thinking yesterday we may want to call it
> from the core, but I don't think it's strictly necessary nor always
> sufficient (we may reload FW without re-probing).
> 
> We have a device which supports different number of VFs based on the FW
> loaded.  Some legacy FWs does not inform the driver how many VFs it can
> support, because it supports max.  So the flow in our driver is this:
> 
> load_fw(dev);
> ...
> max_vfs = ask_fw_for_max_vfs(dev);
> if (max_vfs >= 0)
>   return pci_sriov_set_totalvfs(dev, max_vfs);
> else /* FW didn't tell us, assume max */
>   return pci_sriov_reset_totalvfs(dev); 
> 
> We also reset the max on device remove, but that's not strictly
> necessary.
> 
> Other users of pci_sriov_set_totalvfs() always know the value to set
> the total to (either always get it from FW or it's a constant).
> 
> If you prefer we can work out the correct max for those legacy cases in
> the driver as well, although it seemed cleaner to just ask the core,
> since it already has total_VFs value handy :)
> 
> > I'm also having a hard time connecting your user-space command example
> > with the rest of this.  Maybe it will make more sense to me tomorrow
> > after some coffee.
> 
> OpenStack assumes it will always be able to set sriov_numvfs to
> sriov_totalvfs, see this 'if':
> 
> http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n512

Thanks for educating me.  I think there are two issues here that we
can separate.  I extracted the patch below for the first.

The second is the questio

Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-05-24 Thread Bjorn Helgaas
Hi Jakub,

On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> Some user space depends on enabling sriov_totalvfs number of VFs
> to not fail, e.g.:
> 
> $ cat .../sriov_totalvfs > .../sriov_numvfs
> 
> For devices which VF support depends on loaded FW we have the
> pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> a special "unset" value, meaning drivers can't limit sriov_totalvfs
> to 0.  Remove the special values completely and simply initialize
> driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> Add a helper for drivers to reset the VF limit back to total.

I still can't really make sense out of the changelog.

I think part of the reason it's confusing is because there are two
things going on:

  1) You want this:
  
   pci_sriov_set_totalvfs(dev, 0);
   x = pci_sriov_get_totalvfs(dev) 

 to return 0 instead of total_VFs.  That seems to connect with
 your subject line.  It means "sriov_totalvfs" in sysfs could be
 0, but I don't know how that is useful (I'm sure it is; just
 educate me :))

  2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
 sure what you intend for this.  Is *every* driver supposed to
 call it in .remove()?  Could/should this be done in the core
 somehow instead of depending on every driver?

I'm also having a hard time connecting your user-space command example
with the rest of this.  Maybe it will make more sense to me tomorrow
after some coffee.

> Signed-off-by: Jakub Kicinski 
> ---
>  drivers/net/ethernet/netronome/nfp/nfp_main.c |  6 +++---
>  drivers/pci/iov.c | 27 
> +--
>  include/linux/pci.h   |  2 ++
>  3 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
> b/drivers/net/ethernet/netronome/nfp/nfp_main.c
> index c4b1f344b4da..a76d177e40dd 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
> @@ -123,7 +123,7 @@ static int nfp_pcie_sriov_read_nfd_limit(struct nfp_pf 
> *pf)
>   return pci_sriov_set_totalvfs(pf->pdev, pf->limit_vfs);
>  
>   pf->limit_vfs = ~0;
> - pci_sriov_set_totalvfs(pf->pdev, 0); /* 0 is unset */
> + pci_sriov_reset_totalvfs(pf->pdev);
>   /* Allow any setting for backwards compatibility if symbol not found */
>   if (err == -ENOENT)
>   return 0;
> @@ -537,7 +537,7 @@ static int nfp_pci_probe(struct pci_dev *pdev,
>  err_net_remove:
>   nfp_net_pci_remove(pf);
>  err_sriov_unlimit:
> - pci_sriov_set_totalvfs(pf->pdev, 0);
> + pci_sriov_reset_totalvfs(pf->pdev);
>  err_fw_unload:
>   kfree(pf->rtbl);
>   nfp_mip_close(pf->mip);
> @@ -570,7 +570,7 @@ static void nfp_pci_remove(struct pci_dev *pdev)
>   nfp_hwmon_unregister(pf);
>  
>   nfp_pcie_sriov_disable(pdev);
> - pci_sriov_set_totalvfs(pf->pdev, 0);
> + pci_sriov_reset_totalvfs(pf->pdev);
>  
>   nfp_net_pci_remove(pf);
>  
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 677924ae0350..c63ea870d8be 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -443,6 +443,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
>   iov->nres = nres;
>   iov->ctrl = ctrl;
>   iov->total_VFs = total;
> + iov->driver_max_VFs = total;
>   pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, >vf_device);
>   iov->pgsz = pgsz;
>   iov->self = dev;
> @@ -788,12 +789,29 @@ int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 
> numvfs)
>  }
>  EXPORT_SYMBOL_GPL(pci_sriov_set_totalvfs);
>  
> +/**
> + * pci_sriov_reset_totalvfs -- return the TotalVFs value to the default
> + * @dev: the PCI PF device
> + *
> + * Should be called from PF driver's remove routine with
> + * device's mutex held.
> + */
> +void pci_sriov_reset_totalvfs(struct pci_dev *dev)
> +{
> + /* Shouldn't change if VFs already enabled */
> + if (!dev->is_physfn || dev->sriov->ctrl & PCI_SRIOV_CTRL_VFE)
> + return;
> +
> + dev->sriov->driver_max_VFs = dev->sriov->total_VFs;
> +}
> +EXPORT_SYMBOL_GPL(pci_sriov_reset_totalvfs);
> +
>  /**
>   * pci_sriov_get_totalvfs -- get total VFs supported on this device
>   * @dev: the PCI PF device
>   *
> - * For a PCIe device with SRIOV support, return the PCIe
> - * SRIOV capability value of TotalVFs or the value of driver_max_VFs
> + * For a PCIe device with SRIOV support, return the value of driver_max_VFs
> + * which can be equal to the PCIe SRIOV capability value of TotalVFs or lower
>   * if the driver reduced it.  Otherwise 0.
>   */
>  int pci_sriov_get_totalvfs(struct pci_dev *dev)
> @@ -801,9 +819,6 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>   if (!dev->is_physfn)
>   return 0;
>  
> - if (dev->sriov->driver_max_VFs)
> - return dev->sriov->driver_max_VFs;
> -
> - return 

Re: [PATCH v6 0/5] PCI: Improve PCIe link status reporting

2018-05-23 Thread Bjorn Helgaas
[+to Davem]

On Thu, May 03, 2018 at 03:00:07PM -0500, Bjorn Helgaas wrote:
> This is based on Tal's recent work to unify the approach for reporting PCIe
> link speed/width and whether the device is being limited by a slower
> upstream link.
> 
> The new pcie_print_link_status() interface appeared in v4.17-rc1; see
> 9e506a7b5147 ("PCI: Add pcie_print_link_status() to log link speed and
> whether it's limited").
> 
> That's a good way to replace use of pcie_get_minimum_link(), which gives
> misleading results when a path contains both a fast, narrow link and a
> slow, wide link: it reports the equivalent of a slow, narrow link.
> 
> This series removes the remaining uses of pcie_get_minimum_link() and then
> removes the interface itself.  I'd like to merge them all through the PCI
> tree to make the removal easy.
> 
> This does change the dmesg reporting of link speeds, and in the ixgbe case,
> it changes the reporting from KERN_WARN level to KERN_INFO.  If that's an
> issue, let's talk about it.  I'm hoping the reduce code size, improved
> functionality, and consistency across drivers is enough to make this
> worthwhile.
> 
> ---
> 
> Bjorn Helgaas (5):
>   bnx2x: Report PCIe link properties with pcie_print_link_status()
>   bnxt_en: Report PCIe link properties with pcie_print_link_status()
>   cxgb4: Report PCIe link properties with pcie_print_link_status()
>   ixgbe: Report PCIe link properties with pcie_print_link_status()
>   PCI: Remove unused pcie_get_minimum_link()
> 
> 
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |   23 ++-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c|   19 --
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |   75 
> --
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|   47 --
>  drivers/pci/pci.c|   43 -
>  include/linux/pci.h  |2 -
>  6 files changed, 9 insertions(+), 200 deletions(-)

I applied all of these on pci/enumeration for v4.18.  If you'd rather take
them, Dave, let me know and I'll drop them.

I solicited more acks, but only heard from Jeff.


Re: [PATCH v6 0/5] PCI: Improve PCIe link status reporting

2018-05-10 Thread Bjorn Helgaas
On Thu, May 03, 2018 at 03:00:07PM -0500, Bjorn Helgaas wrote:
> This is based on Tal's recent work to unify the approach for reporting PCIe
> link speed/width and whether the device is being limited by a slower
> upstream link.
> 
> The new pcie_print_link_status() interface appeared in v4.17-rc1; see
> 9e506a7b5147 ("PCI: Add pcie_print_link_status() to log link speed and
> whether it's limited").
> 
> That's a good way to replace use of pcie_get_minimum_link(), which gives
> misleading results when a path contains both a fast, narrow link and a
> slow, wide link: it reports the equivalent of a slow, narrow link.
> 
> This series removes the remaining uses of pcie_get_minimum_link() and then
> removes the interface itself.  I'd like to merge them all through the PCI
> tree to make the removal easy.
> 
> This does change the dmesg reporting of link speeds, and in the ixgbe case,
> it changes the reporting from KERN_WARN level to KERN_INFO.  If that's an
> issue, let's talk about it.  I'm hoping the reduce code size, improved
> functionality, and consistency across drivers is enough to make this
> worthwhile.
> 
> ---
> 
> Bjorn Helgaas (5):
>   bnx2x: Report PCIe link properties with pcie_print_link_status()
>   bnxt_en: Report PCIe link properties with pcie_print_link_status()
>   cxgb4: Report PCIe link properties with pcie_print_link_status()
>   ixgbe: Report PCIe link properties with pcie_print_link_status()
>   PCI: Remove unused pcie_get_minimum_link()

Jeff has acked the ixgbe patch.

Any comments on the bnx2x, bnxt_en, or cxgb4 patches?

>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |   23 ++-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c|   19 --
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |   75 
> --
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|   47 --
>  drivers/pci/pci.c|   43 -
>  include/linux/pci.h  |2 -
>  6 files changed, 9 insertions(+), 200 deletions(-)


Re: [PATCH v6 5/5] PCI: Remove unused pcie_get_minimum_link()

2018-05-10 Thread Bjorn Helgaas
On Thu, May 03, 2018 at 03:00:43PM -0500, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelg...@google.com>
> 
> In some cases pcie_get_minimum_link() returned misleading information
> because it found the slowest link and the narrowest link without
> considering the total bandwidth of the link.
> 
> For example, consider a path with these two links:
> 
>   - 16.0 GT/s  x1 link  (16.0 * 10^9 * 128 / 130) *  1 / 8 = 1969 MB/s
>   -  2.5 GT/s x16 link  ( 2.5 * 10^9 *   8 /  10) * 16 / 8 = 4000 MB/s
> 
> The available bandwidth of the path is limited by the 16 GT/s link to about
> 1969 MB/s, but pcie_get_minimum_link() returned 2.5 GT/s x1, which
> corresponds to only 250 MB/s.
> 
> Callers should use pcie_print_link_status() instead, or
> pcie_bandwidth_available() if they need more detailed information.
> 
> Remove pcie_get_minimum_link() since there are no callers left.
> 
> Signed-off-by: Bjorn Helgaas <bhelg...@google.com>

Hi Jeff,

I got your note that you applied this to dev-queue.  I assume that
means you also applied the preceding patches that removed all the
users.  I got a note about ixgbe, but not the others, so I'm just
double-checking.

> ---
>  drivers/pci/pci.c   |   43 ---
>  include/linux/pci.h |2 --
>  2 files changed, 45 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e597655a5643..4bafa817c40a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5069,49 +5069,6 @@ int pcie_set_mps(struct pci_dev *dev, int mps)
>  }
>  EXPORT_SYMBOL(pcie_set_mps);
>  
> -/**
> - * pcie_get_minimum_link - determine minimum link settings of a PCI device
> - * @dev: PCI device to query
> - * @speed: storage for minimum speed
> - * @width: storage for minimum width
> - *
> - * This function will walk up the PCI device chain and determine the minimum
> - * link width and speed of the device.
> - */
> -int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
> -   enum pcie_link_width *width)
> -{
> - int ret;
> -
> - *speed = PCI_SPEED_UNKNOWN;
> - *width = PCIE_LNK_WIDTH_UNKNOWN;
> -
> - while (dev) {
> - u16 lnksta;
> - enum pci_bus_speed next_speed;
> - enum pcie_link_width next_width;
> -
> - ret = pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
> - if (ret)
> - return ret;
> -
> - next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
> - next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >>
> - PCI_EXP_LNKSTA_NLW_SHIFT;
> -
> - if (next_speed < *speed)
> - *speed = next_speed;
> -
> - if (next_width < *width)
> - *width = next_width;
> -
> - dev = dev->bus->self;
> - }
> -
> - return 0;
> -}
> -EXPORT_SYMBOL(pcie_get_minimum_link);
> -
>  /**
>   * pcie_bandwidth_available - determine minimum link settings of a PCIe
>   * device and its bandwidth limitation
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 73178a2fcee0..230615620a4a 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1079,8 +1079,6 @@ int pcie_get_readrq(struct pci_dev *dev);
>  int pcie_set_readrq(struct pci_dev *dev, int rq);
>  int pcie_get_mps(struct pci_dev *dev);
>  int pcie_set_mps(struct pci_dev *dev, int mps);
> -int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
> -   enum pcie_link_width *width);
>  u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
> **limiting_dev,
>enum pci_bus_speed *speed,
>enum pcie_link_width *width);
> 


[PATCH v6 5/5] PCI: Remove unused pcie_get_minimum_link()

2018-05-03 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

In some cases pcie_get_minimum_link() returned misleading information
because it found the slowest link and the narrowest link without
considering the total bandwidth of the link.

For example, consider a path with these two links:

  - 16.0 GT/s  x1 link  (16.0 * 10^9 * 128 / 130) *  1 / 8 = 1969 MB/s
  -  2.5 GT/s x16 link  ( 2.5 * 10^9 *   8 /  10) * 16 / 8 = 4000 MB/s

The available bandwidth of the path is limited by the 16 GT/s link to about
1969 MB/s, but pcie_get_minimum_link() returned 2.5 GT/s x1, which
corresponds to only 250 MB/s.

Callers should use pcie_print_link_status() instead, or
pcie_bandwidth_available() if they need more detailed information.

Remove pcie_get_minimum_link() since there are no callers left.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/pci.c   |   43 ---
 include/linux/pci.h |2 --
 2 files changed, 45 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e597655a5643..4bafa817c40a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5069,49 +5069,6 @@ int pcie_set_mps(struct pci_dev *dev, int mps)
 }
 EXPORT_SYMBOL(pcie_set_mps);
 
-/**
- * pcie_get_minimum_link - determine minimum link settings of a PCI device
- * @dev: PCI device to query
- * @speed: storage for minimum speed
- * @width: storage for minimum width
- *
- * This function will walk up the PCI device chain and determine the minimum
- * link width and speed of the device.
- */
-int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
- enum pcie_link_width *width)
-{
-   int ret;
-
-   *speed = PCI_SPEED_UNKNOWN;
-   *width = PCIE_LNK_WIDTH_UNKNOWN;
-
-   while (dev) {
-   u16 lnksta;
-   enum pci_bus_speed next_speed;
-   enum pcie_link_width next_width;
-
-   ret = pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
-   if (ret)
-   return ret;
-
-   next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
-   next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >>
-   PCI_EXP_LNKSTA_NLW_SHIFT;
-
-   if (next_speed < *speed)
-   *speed = next_speed;
-
-   if (next_width < *width)
-   *width = next_width;
-
-   dev = dev->bus->self;
-   }
-
-   return 0;
-}
-EXPORT_SYMBOL(pcie_get_minimum_link);
-
 /**
  * pcie_bandwidth_available - determine minimum link settings of a PCIe
  *   device and its bandwidth limitation
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 73178a2fcee0..230615620a4a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1079,8 +1079,6 @@ int pcie_get_readrq(struct pci_dev *dev);
 int pcie_set_readrq(struct pci_dev *dev, int rq);
 int pcie_get_mps(struct pci_dev *dev);
 int pcie_set_mps(struct pci_dev *dev, int mps);
-int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
- enum pcie_link_width *width);
 u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,
 enum pci_bus_speed *speed,
 enum pcie_link_width *width);



[PATCH v6 2/5] bnxt_en: Report PCIe link properties with pcie_print_link_status()

2018-05-03 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Previously the driver used pcie_get_minimum_link() to warn when the NIC
is in a slot that can't supply as much bandwidth as the NIC could use.

pcie_get_minimum_link() can be misleading because it finds the slowest link
and the narrowest link (which may be different links) without considering
the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
bandwidth, not the true available bandwidth of about 1969 MB/s for a
16 GT/s x1 link.

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.  This finds
the slowest link in the path to the device by computing the total bandwidth
of each link and compares that with the capabilities of the device.

The dmesg change is:

  - PCIe: Speed %s Width x%d
  + %u.%03u Gb/s available PCIe bandwidth (%s x%d link)

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   19 +--
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index f83769d8047b..34fddb48fecc 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -8621,22 +8621,6 @@ static int bnxt_init_mac_addr(struct bnxt *bp)
return rc;
 }
 
-static void bnxt_parse_log_pcie_link(struct bnxt *bp)
-{
-   enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
-   enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
-
-   if (pcie_get_minimum_link(pci_physfn(bp->pdev), , ) ||
-   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN)
-   netdev_info(bp->dev, "Failed to determine PCIe Link Info\n");
-   else
-   netdev_info(bp->dev, "PCIe: Speed %s Width x%d\n",
-   speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
-   speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
-   speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
-   "Unknown", width);
-}
-
 static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
static int version_printed;
@@ -8851,8 +8835,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
netdev_info(dev, "%s found at mem %lx, node addr %pM\n",
board_info[ent->driver_data].name,
(long)pci_resource_start(pdev, 0), dev->dev_addr);
-
-   bnxt_parse_log_pcie_link(bp);
+   pcie_print_link_status(pdev);
 
return 0;
 



[PATCH v6 0/5] PCI: Improve PCIe link status reporting

2018-05-03 Thread Bjorn Helgaas
This is based on Tal's recent work to unify the approach for reporting PCIe
link speed/width and whether the device is being limited by a slower
upstream link.

The new pcie_print_link_status() interface appeared in v4.17-rc1; see
9e506a7b5147 ("PCI: Add pcie_print_link_status() to log link speed and
whether it's limited").

That's a good way to replace use of pcie_get_minimum_link(), which gives
misleading results when a path contains both a fast, narrow link and a
slow, wide link: it reports the equivalent of a slow, narrow link.

This series removes the remaining uses of pcie_get_minimum_link() and then
removes the interface itself.  I'd like to merge them all through the PCI
tree to make the removal easy.

This does change the dmesg reporting of link speeds, and in the ixgbe case,
it changes the reporting from KERN_WARN level to KERN_INFO.  If that's an
issue, let's talk about it.  I'm hoping the reduce code size, improved
functionality, and consistency across drivers is enough to make this
worthwhile.

---

Bjorn Helgaas (5):
  bnx2x: Report PCIe link properties with pcie_print_link_status()
  bnxt_en: Report PCIe link properties with pcie_print_link_status()
  cxgb4: Report PCIe link properties with pcie_print_link_status()
  ixgbe: Report PCIe link properties with pcie_print_link_status()
  PCI: Remove unused pcie_get_minimum_link()


 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |   23 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c|   19 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |   75 --
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|   47 --
 drivers/pci/pci.c|   43 -
 include/linux/pci.h  |2 -
 6 files changed, 9 insertions(+), 200 deletions(-)


[PATCH v6 3/5] cxgb4: Report PCIe link properties with pcie_print_link_status()

2018-05-03 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Previously the driver used pcie_get_minimum_link() to warn when the NIC
is in a slot that can't supply as much bandwidth as the NIC could use.

pcie_get_minimum_link() can be misleading because it finds the slowest link
and the narrowest link (which may be different links) without considering
the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
bandwidth, not the true available bandwidth of about 1969 MB/s for a
16 GT/s x1 link.

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.  This finds
the slowest link in the path to the device by computing the total bandwidth
of each link and compares that with the capabilities of the device.

The dmesg change is:

  - PCIe link speed is %s, device supports %s
  - PCIe link width is x%d, device supports x%d
  + %u.%03u Gb/s available PCIe bandwidth (%s x%d link)

or, if the device is capable of better performance than is available in the
current slot:

  - A slot with more lanes and/or higher speed is suggested for optimal 
performance.
  + %u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s 
(capable of %u.%03u Gb/s with %s x%d link)

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   75 ---
 1 file changed, 1 insertion(+), 74 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 24d2865b8806..7328f24ba1dd 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5042,79 +5042,6 @@ static int init_rss(struct adapter *adap)
return 0;
 }
 
-static int cxgb4_get_pcie_dev_link_caps(struct adapter *adap,
-   enum pci_bus_speed *speed,
-   enum pcie_link_width *width)
-{
-   u32 lnkcap1, lnkcap2;
-   int err1, err2;
-
-#define  PCIE_MLW_CAP_SHIFT 4   /* start of MLW mask in link capabilities */
-
-   *speed = PCI_SPEED_UNKNOWN;
-   *width = PCIE_LNK_WIDTH_UNKNOWN;
-
-   err1 = pcie_capability_read_dword(adap->pdev, PCI_EXP_LNKCAP,
- );
-   err2 = pcie_capability_read_dword(adap->pdev, PCI_EXP_LNKCAP2,
- );
-   if (!err2 && lnkcap2) { /* PCIe r3.0-compliant */
-   if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_8_0GB)
-   *speed = PCIE_SPEED_8_0GT;
-   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_5_0GB)
-   *speed = PCIE_SPEED_5_0GT;
-   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_2_5GB)
-   *speed = PCIE_SPEED_2_5GT;
-   }
-   if (!err1) {
-   *width = (lnkcap1 & PCI_EXP_LNKCAP_MLW) >> PCIE_MLW_CAP_SHIFT;
-   if (!lnkcap2) { /* pre-r3.0 */
-   if (lnkcap1 & PCI_EXP_LNKCAP_SLS_5_0GB)
-   *speed = PCIE_SPEED_5_0GT;
-   else if (lnkcap1 & PCI_EXP_LNKCAP_SLS_2_5GB)
-   *speed = PCIE_SPEED_2_5GT;
-   }
-   }
-
-   if (*speed == PCI_SPEED_UNKNOWN || *width == PCIE_LNK_WIDTH_UNKNOWN)
-   return err1 ? err1 : err2 ? err2 : -EINVAL;
-   return 0;
-}
-
-static void cxgb4_check_pcie_caps(struct adapter *adap)
-{
-   enum pcie_link_width width, width_cap;
-   enum pci_bus_speed speed, speed_cap;
-
-#define PCIE_SPEED_STR(speed) \
-   (speed == PCIE_SPEED_8_0GT ? "8.0GT/s" : \
-speed == PCIE_SPEED_5_0GT ? "5.0GT/s" : \
-speed == PCIE_SPEED_2_5GT ? "2.5GT/s" : \
-"Unknown")
-
-   if (cxgb4_get_pcie_dev_link_caps(adap, _cap, _cap)) {
-   dev_warn(adap->pdev_dev,
-"Unable to determine PCIe device BW capabilities\n");
-   return;
-   }
-
-   if (pcie_get_minimum_link(adap->pdev, , ) ||
-   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) {
-   dev_warn(adap->pdev_dev,
-"Unable to determine PCI Express bandwidth.\n");
-   return;
-   }
-
-   dev_info(adap->pdev_dev, "PCIe link speed is %s, device supports %s\n",
-PCIE_SPEED_STR(speed), PCIE_SPEED_STR(speed_cap));
-   dev_info(adap->pdev_dev, "PCIe link width is x%d, device supports 
x%d\n",
-width, width_cap);
-   if (speed < speed_cap || width < width_cap)
-   dev_info(adap->pdev_dev,
-"A slot with more lanes and/or higher speed is "
-&q

[PATCH v6 1/5] bnx2x: Report PCIe link properties with pcie_print_link_status()

2018-05-03 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Previously the driver used pcie_get_minimum_link() to warn when the NIC
is in a slot that can't supply as much bandwidth as the NIC could use.

pcie_get_minimum_link() can be misleading because it finds the slowest link
and the narrowest link (which may be different links) without considering
the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
bandwidth, not the true available bandwidth of about 1969 MB/s for a
16 GT/s x1 link.

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.  This finds
the slowest link in the path to the device by computing the total bandwidth
of each link and compares that with the capabilities of the device.

The dmesg change is:

  - %s (%c%d) PCI-E x%d %s found at mem %lx, IRQ %d, node addr %pM
  + %s (%c%d) PCI-E found at mem %lx, IRQ %d, node addr %pM
  + %u.%03u Gb/s available PCIe bandwidth (%s x%d link)

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |   23 ++
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index c766ae23bc74..5b1ed240bf18 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13922,8 +13922,6 @@ static int bnx2x_init_one(struct pci_dev *pdev,
 {
struct net_device *dev = NULL;
struct bnx2x *bp;
-   enum pcie_link_width pcie_width;
-   enum pci_bus_speed pcie_speed;
int rc, max_non_def_sbs;
int rx_count, tx_count, rss_count, doorbell_size;
int max_cos_est;
@@ -14091,21 +14089,12 @@ static int bnx2x_init_one(struct pci_dev *pdev,
dev_addr_add(bp->dev, bp->fip_mac, NETDEV_HW_ADDR_T_SAN);
rtnl_unlock();
}
-   if (pcie_get_minimum_link(bp->pdev, _speed, _width) ||
-   pcie_speed == PCI_SPEED_UNKNOWN ||
-   pcie_width == PCIE_LNK_WIDTH_UNKNOWN)
-   BNX2X_DEV_INFO("Failed to determine PCI Express Bandwidth\n");
-   else
-   BNX2X_DEV_INFO(
-  "%s (%c%d) PCI-E x%d %s found at mem %lx, IRQ %d, node 
addr %pM\n",
-  board_info[ent->driver_data].name,
-  (CHIP_REV(bp) >> 12) + 'A', (CHIP_METAL(bp) >> 4),
-  pcie_width,
-  pcie_speed == PCIE_SPEED_2_5GT ? "2.5GHz" :
-  pcie_speed == PCIE_SPEED_5_0GT ? "5.0GHz" :
-  pcie_speed == PCIE_SPEED_8_0GT ? "8.0GHz" :
-  "Unknown",
-  dev->base_addr, bp->pdev->irq, dev->dev_addr);
+   BNX2X_DEV_INFO(
+  "%s (%c%d) PCI-E found at mem %lx, IRQ %d, node addr %pM\n",
+  board_info[ent->driver_data].name,
+  (CHIP_REV(bp) >> 12) + 'A', (CHIP_METAL(bp) >> 4),
+  dev->base_addr, bp->pdev->irq, dev->dev_addr);
+   pcie_print_link_status(bp->pdev);
 
bnx2x_register_phc(bp);
 



[PATCH v6 4/5] ixgbe: Report PCIe link properties with pcie_print_link_status()

2018-05-03 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Previously the driver used pcie_get_minimum_link() to warn when the NIC
is in a slot that can't supply as much bandwidth as the NIC could use.

pcie_get_minimum_link() can be misleading because it finds the slowest link
and the narrowest link (which may be different links) without considering
the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
bandwidth, not the true available bandwidth of about 1969 MB/s for a
16 GT/s x1 link.

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.  This finds
the slowest link in the path to the device by computing the total bandwidth
of each link and compares that with the capabilities of the device.

The dmesg change is:

  - PCI Express bandwidth of %dGT/s available
  - (Speed:%s, Width: x%d, Encoding Loss:%s)
  + %u.%03u Gb/s available PCIe bandwidth (%s x%d link)

or, if the device is capable of better performance than is available in the
current slot:

  - This is not sufficient for optimal performance of this card.
  - For optimal performance, at least %dGT/s of bandwidth is required.
  - A slot with more lanes and/or higher speed is suggested.
  + %u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s 
(capable of %u.%03u Gb/s with %s x%d link)

Note that the driver previously used dev_warn() to suggest using a
different slot, but pcie_print_link_status() uses dev_info() because if the
platform has no faster slot available, the user can't do anything about the
warning and may not want to be bothered with it.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   47 +
 1 file changed, 1 insertion(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba99f7b8..8990285f6e12 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -270,9 +270,6 @@ static void ixgbe_check_minimum_link(struct ixgbe_adapter 
*adapter,
 int expected_gts)
 {
struct ixgbe_hw *hw = >hw;
-   int max_gts = 0;
-   enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
-   enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
struct pci_dev *pdev;
 
/* Some devices are not connected over PCIe and thus do not negotiate
@@ -288,49 +285,7 @@ static void ixgbe_check_minimum_link(struct ixgbe_adapter 
*adapter,
else
pdev = adapter->pdev;
 
-   if (pcie_get_minimum_link(pdev, , ) ||
-   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) {
-   e_dev_warn("Unable to determine PCI Express bandwidth.\n");
-   return;
-   }
-
-   switch (speed) {
-   case PCIE_SPEED_2_5GT:
-   /* 8b/10b encoding reduces max throughput by 20% */
-   max_gts = 2 * width;
-   break;
-   case PCIE_SPEED_5_0GT:
-   /* 8b/10b encoding reduces max throughput by 20% */
-   max_gts = 4 * width;
-   break;
-   case PCIE_SPEED_8_0GT:
-   /* 128b/130b encoding reduces throughput by less than 2% */
-   max_gts = 8 * width;
-   break;
-   default:
-   e_dev_warn("Unable to determine PCI Express bandwidth.\n");
-   return;
-   }
-
-   e_dev_info("PCI Express bandwidth of %dGT/s available\n",
-  max_gts);
-   e_dev_info("(Speed:%s, Width: x%d, Encoding Loss:%s)\n",
-  (speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
-   speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
-   speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
-   "Unknown"),
-  width,
-  (speed == PCIE_SPEED_2_5GT ? "20%" :
-   speed == PCIE_SPEED_5_0GT ? "20%" :
-   speed == PCIE_SPEED_8_0GT ? "<2%" :
-   "Unknown"));
-
-   if (max_gts < expected_gts) {
-   e_dev_warn("This is not sufficient for optimal performance of 
this card.\n");
-   e_dev_warn("For optimal performance, at least %dGT/s of 
bandwidth is required.\n",
-   expected_gts);
-   e_dev_warn("A slot with more lanes and/or higher speed is 
suggested.\n");
-   }
+   pcie_print_link_status(pdev);
 }
 
 static void ixgbe_service_event_schedule(struct ixgbe_adapter *adapter)



Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV

2018-04-24 Thread Bjorn Helgaas
On Sat, Apr 21, 2018 at 05:22:27PM -0700, Alexander Duyck wrote:
> On Sat, Apr 21, 2018 at 1:34 PM, Bjorn Helgaas <helg...@kernel.org> wrote:

> > For example, I'm not sure what you mean by "devices where the PF is
> > not capable of managing VF resources."
> >
> > It *sounds* like you're saying the hardware works differently on some
> > devices, but I don't think that's what you mean.  I think you're
> > saying something about which drivers are used for the PF and the VF.
> 
> That is sort of what I am saying.
> 
> So for example with ixgbe there is functionality which is controlled
> in the MMIO space of the PF that affects the functionality of the VFs
> that are generated on the device. The PF has to rearrange the
> resources such as queues and interrupts on the device before it can
> enable SR-IOV, and it could alter those later to limit what the VF is
> capable of doing.
> 
> The model I am dealing with via this patch set has a PF that is not
> much different than the VFs other than the fact that it has some
> extended configuration space bits in place for SR-IOV, ARI, ACS, and
> whatever other bits are needed in order to support spawning isolated
> VFs.

OK, thanks for the explanation, I think I understand what's going on
now, correct me if I'm mistaken.  I added a hint about "PF" for Randy,
too.

These are on pci/virtualization for v4.18.


commit 8effc395c209
Author: Alexander Duyck <alexander.h.du...@intel.com>
Date:   Sat Apr 21 15:23:09 2018 -0500

PCI/IOV: Add pci_sriov_configure_simple()

SR-IOV (Single Root I/O Virtualization) is an optional PCIe capability (see
PCIe r4.0, sec 9).  A PCIe Function with the SR-IOV capability is referred
to as a PF (Physical Function).  If SR-IOV is enabled on the PF, several
VFs (Virtual Functions) may be created.  The VFs can be individually
assigned to virtual machines, which allows them to share a single hardware
device while being isolated from each other.

Some SR-IOV devices have resources such as queues and interrupts that must
be set up in the PF before enabling the VFs, so they require a PF driver to
do that.

Other SR-IOV devices don't require any PF setup before enabling VFs.  Add a
pci_sriov_configure_simple() interface so PF drivers for such devices can
use it without repeating the VF-enabling code.

Tested-by: Mark Rustad <mark.d.rus...@intel.com>
    Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
[bhelgaas: changelog, comment]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Greg Rose <gvrose8...@gmail.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>:wq

commit a8ccf8a3
Author: Alexander Duyck <alexander.h.du...@intel.com>
Date:   Tue Apr 24 16:47:16 2018 -0500

PCI/IOV: Add pci-pf-stub driver for PFs that only enable VFs

Some SR-IOV PF devices provide no functionality other than acting as a
means of enabling VFs.  For these devices, we want to enable the VFs and
assign them to guest virtual machines, but there's no need to have a driver
for the PF itself.

Add a new pci-pf-stub driver to claim those PF devices and provide the
generic VF enable functionality.  An administrator can use the sysfs
"sriov_numvfs" file to enable VFs, then assign them to guests.

For now I only have one example ID provided by Amazon in terms of devices
that require this functionality.  The general idea is that in the future we
will see other devices added as vendors come up with devices where the PF
is more or less just a lightweight shim used to allocate VFs.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Greg Rose <gvrose8...@gmail.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>

commit 115ddc491922
Author: Alexander Duyck <alexander.h.du...@intel.com>
Date:   Tue Apr 24 16:47:22 2018 -0500

net: ena: Use pci_sriov_configure_simple() to enable VFs

Instead of implementing our own version of a SR-IOV configuration stub in
the ena driver, use the existing pci_sriov_configure_simple() function.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Greg Rose <gvrose8...@gmail.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>

commit 74d986abc20b
Author: Alexander Duyck <alexander.h.du...@intel.com>
Date:   Tue Apr 24 16:47:27 2018 -0500

nvme-pci: Use pci_sriov_configure_simple() to enable VFs

Instead of implementing our own version of a SR-IOV configuration stub in
the nvme driver, use the existing pci_sriov_configure_simple() function.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>


Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV

2018-04-21 Thread Bjorn Helgaas
On Fri, Apr 20, 2018 at 12:28:08PM -0400, Alexander Duyck wrote:
> This series is meant to add support for SR-IOV on devices when the VFs are
> not managed by the kernel. Examples of recent patches attempting to do this
> include:
> virto - https://patchwork.kernel.org/patch/10241225/
> pci-stub - https://patchwork.kernel.org/patch/10109935/
> vfio - https://patchwork.kernel.org/patch/10103353/
> uio - https://patchwork.kernel.org/patch/9974031/
> 
> Since this is quickly blowing up into a multi-driver problem it is probably
> best to implement this solution as generically as possible.
> 
> This series is an attempt to do that. What we do with this patch set is
> provide a generic framework to enable SR-IOV in the case that the PF driver
> doesn't support managing the VFs itself.
> 
> I based my patch set originally on the patch by Mark Rustad but there isn't
> much left after going through and cleaning out the bits that were no longer
> needed, and after incorporating the feedback from David Miller. At this point
> the only items to be fully reused was his patch description which is now
> present in patch 3 of the set.
> 
> This solution is limited in scope to just adding support for devices that
> provide no functionality for SR-IOV other than allocating the VFs by
> calling pci_enable_sriov. Previous sets had included patches for VFIO, but
> for now I am dropping that as the scope of that work is larger then I
> think I can take on at this time.
> 
> v2: Reduced scope back to just virtio_pci and vfio-pci
> Broke into 3 patch set from single patch
> Changed autoprobe behavior to always set when num_vfs is set non-zero
> v3: Updated Documentation to clarify when sriov_unmanaged_autoprobe is used
> Wrapped vfio_pci_sriov_configure to fix build errors w/o SR-IOV in kernel
> v4: Dropped vfio-pci patch
> Added ena and nvme to drivers now using pci_sriov_configure_unmanaged
> Dropped pci_disable_sriov call in virtio_pci to be consistent with ena
> v5: Dropped sriov_unmanaged_autoprobe and pci_sriov_conifgure_unmanaged
> Added new patch that enables pci_sriov_configure_simple
> Updated drivers to use pci_sriov_configure_simple
> v6: Defined pci_sriov_configure_simple as NULL when SR-IOV is not enabled
> Updated drivers to drop "#ifdef" checks for IOV
> Added pci-pf-stub as place for PF-only drivers to add support
> v7: Dropped pci_id table explanation from pci-pf-stub driver
> Updated pci_sriov_configure_simple to drop need for err value
> Fixed comment explaining why pci_sriov_configure_simple is NULL
> v8: Dropped virtio from the set, support to be added later after TC approval
> 
> Cc: Mark Rustad 
> Cc: Maximilian Heyne 
> Cc: Liang-Min Wang 
> Cc: David Woodhouse 
> 
> ---
> 
> Alexander Duyck (4):
>   pci: Add pci_sriov_configure_simple for PFs that don't manage VF 
> resources
>   ena: Migrate over to unmanaged SR-IOV support
>   nvme: Migrate over to unmanaged SR-IOV support
>   pci-pf-stub: Add PF driver stub for PFs that function only to enable VFs
> 
> 
>  drivers/net/ethernet/amazon/ena/ena_netdev.c |   28 -
>  drivers/nvme/host/pci.c  |   20 --
>  drivers/pci/Kconfig  |   12 ++
>  drivers/pci/Makefile |2 +
>  drivers/pci/iov.c|   31 +++
>  drivers/pci/pci-pf-stub.c|   54 
> ++
>  include/linux/pci.h  |3 +
>  include/linux/pci_ids.h  |2 +
>  8 files changed, 106 insertions(+), 46 deletions(-)
>  create mode 100644 drivers/pci/pci-pf-stub.c

I tentatively applied these to pci/virtualization-review.

The code changes look fine, but I want to flesh out the changelogs a
little bit before merging them.

For example, I'm not sure what you mean by "devices where the PF is
not capable of managing VF resources."

It *sounds* like you're saying the hardware works differently on some
devices, but I don't think that's what you mean.  I think you're
saying something about which drivers are used for the PF and the VF.

I think a trivial example of how this will be used might help.  I
assume this involves a virtualization scenario where the host uses the
PF to enable several VFs, but the host doesn't use the PF for much
else.  Then you assign the VFs to guests, and drivers in the guest
OSes use the VFs.

Since .sriov_configure() is only used by sriov_numvfs_store(), I
assume the usage model involves writing to the sysfs sriov_numvfs
attribute to enable the VFs, then assigning them to guests?

Bjorn


Re: [PATCH] PCI: Add PCIe to pcie_print_link_status() messages

2018-04-20 Thread Bjorn Helgaas
On Fri, Apr 13, 2018 at 11:16:38AM -0700, Jakub Kicinski wrote:
> Currently the pcie_print_link_status() will print PCIe bandwidth
> and link width information but does not mention it is pertaining
> to the PCIe.  Since this and related functions are used exclusively
> by networking drivers today users may get confused into thinking
> that it's the NIC bandwidth that is being talked about.  Insert a
> "PCIe" into the messages.
> 
> Signed-off-by: Jakub Kicinski 

Applied to for-linus for v4.17, thanks!

> ---
>  drivers/pci/pci.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index aa86e904f93c..73a0a4993f6a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5273,11 +5273,11 @@ void pcie_print_link_status(struct pci_dev *dev)
>   bw_avail = pcie_bandwidth_available(dev, _dev, , );
>  
>   if (bw_avail >= bw_cap)
> - pci_info(dev, "%u.%03u Gb/s available bandwidth (%s x%d 
> link)\n",
> + pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d 
> link)\n",
>bw_cap / 1000, bw_cap % 1000,
>PCIE_SPEED2STR(speed_cap), width_cap);
>   else
> - pci_info(dev, "%u.%03u Gb/s available bandwidth, limited by %s 
> x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
> + pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited 
> by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
>bw_avail / 1000, bw_avail % 1000,
>PCIE_SPEED2STR(speed), width,
>limiting_dev ? pci_name(limiting_dev) : "",
> -- 
> 2.16.2
> 


Re: [PATCH net-next 1/2] PCI: Add two more values for PCIe Max_Read_Request_Size

2018-04-16 Thread Bjorn Helgaas
On Mon, Apr 16, 2018 at 09:37:13PM +0200, Heiner Kallweit wrote:
> This patch adds missing values for the max read request size.
> E.g. network driver r8169 uses a value of 4K.
> 
> Signed-off-by: Heiner Kallweit <hkallwe...@gmail.com>

I'd prefer a subject line with more details, e.g.,

  PCI: Add #defines for 2K and 4K Max Read Request Size

Acked-by: Bjorn Helgaas <bhelg...@google.com>

I suspect conflicts are more likely in r8169.c so it might make more
sense to route these through the netdev tree.  I'd also be happy to
take them, so let me know if you want me to take them, David.

> ---
>  include/uapi/linux/pci_regs.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 0c79eac5..699257fb 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -506,6 +506,8 @@
>  #define  PCI_EXP_DEVCTL_READRQ_256B  0x1000 /* 256 Bytes */
>  #define  PCI_EXP_DEVCTL_READRQ_512B  0x2000 /* 512 Bytes */
>  #define  PCI_EXP_DEVCTL_READRQ_1024B 0x3000 /* 1024 Bytes */
> +#define  PCI_EXP_DEVCTL_READRQ_2048B 0x4000 /* 2048 Bytes */
> +#define  PCI_EXP_DEVCTL_READRQ_4096B 0x5000 /* 4096 Bytes */
>  #define  PCI_EXP_DEVCTL_BCR_FLR 0x8000  /* Bridge Configuration Retry / FLR 
> */
>  #define PCI_EXP_DEVSTA   10  /* Device Status */
>  #define  PCI_EXP_DEVSTA_CED  0x0001  /* Correctable Error Detected */
> -- 
> 2.17.0
> 
> 


Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-13 Thread Bjorn Helgaas
On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> > +   if (bw_avail >= bw_cap)
> > +   pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > +   else
> > +   pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d 
> > link at %s (capable of %d Mb/s with %s x%d link)\n",
> > +bw_avail, PCIE_SPEED2STR(speed), width,
> > +limiting_dev ? pci_name(limiting_dev) : "",
> > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> 
> I was just looking at using this new function to print PCIe BW for a
> NIC, but I'm slightly worried that there is nothing in the message that
> says PCIe...  For a NIC some people may interpret the bandwidth as NIC
> bandwidth:
> 
> [   39.839989] nfp :04:00.0: Netronome Flow Processor NFP4000/NFP6000 
> PCIe Card Probe
> [   39.848943] nfp :04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 
> link)
> [   39.857146] nfp :04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 0.1: 
> PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4
> 
> It's not a 63Gbps NIC...  I'm sorry if this was discussed before and I
> didn't find it.  Would it make sense to add the "PCIe: " prefix to the
> message like bnx2x used to do?  Like:
> 
> nfp :04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)

I agree, that does look potentially confusing.  How about this:

  nfp :04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)

I did have to look twice at this before I remembered that we're
printing Gb/s (not GB/s).  Most of the references I found on the web
use GB/s when talking about total PCIe bandwidth.

But either way I think it's definitely worth mentioning PCIe
explicitly.


Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-03 Thread Bjorn Helgaas
On Mon, Apr 02, 2018 at 05:30:54PM -0700, Jacob Keller wrote:
> On Mon, Apr 2, 2018 at 7:05 AM, Bjorn Helgaas <helg...@kernel.org> wrote:
> > +/* PCIe speed to Mb/s reduced by encoding overhead */
> > +#define PCIE_SPEED2MBS_ENC(speed) \
> > +   ((speed) == PCIE_SPEED_16_0GT ? (16000*(128/130)) : \
> > +(speed) == PCIE_SPEED_8_0GT  ?  (8000*(128/130)) : \
> > +(speed) == PCIE_SPEED_5_0GT  ?  (5000*(8/10)) : \
> > +(speed) == PCIE_SPEED_2_5GT  ?  (2500*(8/10)) : \
> > +0)
> > +
> 
> Should this be "(speed * x ) / y" instead? wouldn't they calculate
> 128/130 and truncate that to zero before multiplying by the speed? Or
> are compilers smart enough to do this the other way to avoid the
> losses?

Yep, thanks for saving me yet more embarrassment.


Re: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()

2018-04-02 Thread Bjorn Helgaas
On Mon, Apr 02, 2018 at 03:56:06PM +, Keller, Jacob E wrote:
> > -Original Message-
> > From: Bjorn Helgaas [mailto:helg...@kernel.org]
> > Sent: Friday, March 30, 2018 2:06 PM
> > To: Tal Gilboa <ta...@mellanox.com>
> > Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
> > <jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
> > Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> > <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> > linux-...@vger.kernel.org
> > Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> > pcie_print_link_status()
> > 
> > From: Bjorn Helgaas <bhelg...@google.com>
> > 
> > Use pcie_print_link_status() to report PCIe link speed and possible
> > limitations instead of implementing this in the driver itself.
> > 
> > Note that pcie_get_minimum_link() can return misleading information because
> > it finds the slowest link and the narrowest link without considering the
> > total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
> > 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
> > corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
> > about 2000 MB/s for a 16 GT/s x1 link.
> 
> This comment is about what's being fixed, so it would have been easier to
> parse if it were written to more clearly indicate that we're removing
> (and not adding) this behavior.

Good point.  Is this any better?

  fm10k: Report PCIe link properties with pcie_print_link_status()
  
  Previously the driver used pcie_get_minimum_link() to warn when the NIC
  is in a slot that can't supply as much bandwidth as the NIC could use.
  
  pcie_get_minimum_link() can be misleading because it finds the slowest link
  and the narrowest link (which may be different links) without considering
  the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
  2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
  bandwidth, not the true available bandwidth of about 1969 MB/s for a
  16 GT/s x1 link.
  
  Use pcie_print_link_status() to report PCIe link speed and possible
  limitations instead of implementing this in the driver itself.  This finds
  the slowest link in the path to the device by computing the total bandwidth
  of each link and compares that with the capabilities of the device.
  
  Note that the driver previously used dev_warn() to suggest using a
  different slot, but pcie_print_link_status() uses dev_info() because if the
  platform has no faster slot available, the user can't do anything about the
  warning and may not want to be bothered with it.


Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-02 Thread Bjorn Helgaas
On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote:
> > -Original Message-
> > From: Bjorn Helgaas [mailto:helg...@kernel.org]
> > Sent: Friday, March 30, 2018 2:05 PM
> > To: Tal Gilboa <ta...@mellanox.com>
> > Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
> > <jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
> > Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> > <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> > linux-...@vger.kernel.org
> > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
> > speed and
> > whether it's limited
> > 
> > From: Tal Gilboa <ta...@mellanox.com>
> > 
> > Add pcie_print_link_status().  This logs the current settings of the link
> > (speed, width, and total available bandwidth).
> > 
> > If the device is capable of more bandwidth but is limited by a slower
> > upstream link, we include information about the link that limits the
> > device's performance.
> > 
> > The user may be able to move the device to a different slot for better
> > performance.
> > 
> > This provides a unified method for all PCI devices to report status and
> > issues, instead of each device reporting in a different way, using
> > different code.
> > 
> > Signed-off-by: Tal Gilboa <ta...@mellanox.com>
> > [bhelgaas: changelog, reword log messages, print device capabilities when
> > not limited]
> > Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> > ---
> >  drivers/pci/pci.c   |   29 +
> >  include/linux/pci.h |1 +
> >  2 files changed, 30 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index e00d56b12747..cec7aed09f6b 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
> > enum pci_bus_speed *speed,
> > return *width * PCIE_SPEED2MBS_ENC(*speed);
> >  }
> > 
> > +/**
> > + * pcie_print_link_status - Report the PCI device's link speed and width
> > + * @dev: PCI device to query
> > + *
> > + * Report the available bandwidth at the device.  If this is less than the
> > + * device is capable of, report the device's maximum possible bandwidth and
> > + * the upstream link that limits its performance to less than that.
> > + */
> > +void pcie_print_link_status(struct pci_dev *dev)
> > +{
> > +   enum pcie_link_width width, width_cap;
> > +   enum pci_bus_speed speed, speed_cap;
> > +   struct pci_dev *limiting_dev = NULL;
> > +   u32 bw_avail, bw_cap;
> > +
> > +   bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
> > +   bw_avail = pcie_bandwidth_available(dev, _dev, ,
> > );
> > +
> > +   if (bw_avail >= bw_cap)
> > +   pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > +   else
> > +   pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> > link at %s (capable of %d Mb/s with %s x%d link)\n",
> > +bw_avail, PCIE_SPEED2STR(speed), width,
> > +limiting_dev ? pci_name(limiting_dev) : "",
> > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > +}
> 
> Personally, I would make thic last one a pci_warn() to indicate it at a
> higher log level, but I'm  ok with the wording, and if consensus is that
> this should be at info, I'm ok with that.

Tal's original patch did have a pci_warn() here, and we went back and
forth a bit.  They get bug reports when a device doesn't perform as
expected, which argues for pci_warn().  But they also got feedback
saying warnings are a bit too much, which argues for pci_info() [1]

I don't have a really strong opinion either way.  I have a slight
preference for info because the user may not be able to do anything
about it (there may not be a faster slot available), and I think
distros are usually configured so a warning interrupts the smooth
graphical boot.

It looks like mlx4, fm10k, and ixgbe currently use warnings, while
bnx2x, bnxt_en, and cxgb4 use info.  It's a tie so far :)

[1] https://lkml.kernel.org/r/e47f3628-b56c-4d0a-f18b-5ffaf261c...@mellanox.com

Here's a proposal for printing the bandwidth as "x.xxx Gb/s":

commit ad370f38c1b5e9b8bb941eaed84ebb676c4bdaa4
Author: Tal Gilboa <ta...@mellanox.com>

Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-02 Thread Bjorn Helgaas
On Mon, Apr 02, 2018 at 04:00:16PM +, Keller, Jacob E wrote:
> > -Original Message-
> > From: Tal Gilboa [mailto:ta...@mellanox.com]
> > Sent: Monday, April 02, 2018 7:34 AM
> > To: Bjorn Helgaas <helg...@kernel.org>
> > Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
> > <jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
> > Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> > <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> > l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> > linux-...@vger.kernel.org
> > Subject: Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute
> > max supported link bandwidth
> > 
> > On 4/2/2018 5:05 PM, Bjorn Helgaas wrote:
> > > On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote:
> > >> On 4/2/2018 3:40 AM, Bjorn Helgaas wrote:
> > >>> On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:
> > >>>> On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:
> > >>>>> From: Tal Gilboa <ta...@mellanox.com>
> > >>>>>
> > >>>>> Add pcie_bandwidth_capable() to compute the max link bandwidth
> > supported by
> > >>>>> a device, based on the max link speed and width, adjusted by the
> > encoding
> > >>>>> overhead.
> > >>>>>
> > >>>>> The maximum bandwidth of the link is computed as:
> > >>>>>
> > >>>>>  max_link_speed * max_link_width * (1 - encoding_overhead)
> > >>>>>
> > >>>>> The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using
> > 8b/10b
> > >>>>> encoding, and about 1.5% for 8 GT/s or higher speed links using 
> > >>>>> 128b/130b
> > >>>>> encoding.
> > >>>>>
> > >>>>> Signed-off-by: Tal Gilboa <ta...@mellanox.com>
> > >>>>> [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
> > >>>>> signatures, don't export outside drivers/pci]
> > >>>>> Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> > >>>>> Reviewed-by: Tariq Toukan <tar...@mellanox.com>
> > >>>>> ---
> > >>>>> drivers/pci/pci.c |   21 +
> > >>>>> drivers/pci/pci.h |9 +
> > >>>>> 2 files changed, 30 insertions(+)
> > >>>>>
> > >>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > >>>>> index 43075be79388..9ce89e254197 100644
> > >>>>> --- a/drivers/pci/pci.c
> > >>>>> +++ b/drivers/pci/pci.c
> > >>>>> @@ -5208,6 +5208,27 @@ enum pcie_link_width
> > pcie_get_width_cap(struct pci_dev *dev)
> > >>>>>   return PCIE_LNK_WIDTH_UNKNOWN;
> > >>>>> }
> > >>>>> +/**
> > >>>>> + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth
> > capability
> > >>>>> + * @dev: PCI device
> > >>>>> + * @speed: storage for link speed
> > >>>>> + * @width: storage for link width
> > >>>>> + *
> > >>>>> + * Calculate a PCI device's link bandwidth by querying for its link 
> > >>>>> speed
> > >>>>> + * and width, multiplying them, and applying encoding overhead.
> > >>>>> + */
> > >>>>> +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed
> > *speed,
> > >>>>> +enum pcie_link_width *width)
> > >>>>> +{
> > >>>>> + *speed = pcie_get_speed_cap(dev);
> > >>>>> + *width = pcie_get_width_cap(dev);
> > >>>>> +
> > >>>>> + if (*speed == PCI_SPEED_UNKNOWN || *width ==
> > PCIE_LNK_WIDTH_UNKNOWN)
> > >>>>> + return 0;
> > >>>>> +
> > >>>>> + return *width * PCIE_SPEED2MBS_ENC(*speed);
> > >>>>> +}
> > >>>>> +
> > >>>>> /**
> > >>>>>  * pci_select_bars - Make BAR mask from the type of resource
> > >>>>>  * @dev: the PCI device for which BAR mask is 

Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-02 Thread Bjorn Helgaas
On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote:
> On 4/2/2018 3:40 AM, Bjorn Helgaas wrote:
> > On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:
> > > On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:
> > > > From: Tal Gilboa <ta...@mellanox.com>
> > > > 
> > > > Add pcie_bandwidth_capable() to compute the max link bandwidth 
> > > > supported by
> > > > a device, based on the max link speed and width, adjusted by the 
> > > > encoding
> > > > overhead.
> > > > 
> > > > The maximum bandwidth of the link is computed as:
> > > > 
> > > > max_link_speed * max_link_width * (1 - encoding_overhead)
> > > > 
> > > > The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 
> > > > 8b/10b
> > > > encoding, and about 1.5% for 8 GT/s or higher speed links using 
> > > > 128b/130b
> > > > encoding.
> > > > 
> > > > Signed-off-by: Tal Gilboa <ta...@mellanox.com>
> > > > [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
> > > > signatures, don't export outside drivers/pci]
> > > > Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> > > > Reviewed-by: Tariq Toukan <tar...@mellanox.com>
> > > > ---
> > > >drivers/pci/pci.c |   21 +
> > > >drivers/pci/pci.h |9 +
> > > >2 files changed, 30 insertions(+)
> > > > 
> > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > > index 43075be79388..9ce89e254197 100644
> > > > --- a/drivers/pci/pci.c
> > > > +++ b/drivers/pci/pci.c
> > > > @@ -5208,6 +5208,27 @@ enum pcie_link_width pcie_get_width_cap(struct 
> > > > pci_dev *dev)
> > > > return PCIE_LNK_WIDTH_UNKNOWN;
> > > >}
> > > > +/**
> > > > + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth 
> > > > capability
> > > > + * @dev: PCI device
> > > > + * @speed: storage for link speed
> > > > + * @width: storage for link width
> > > > + *
> > > > + * Calculate a PCI device's link bandwidth by querying for its link 
> > > > speed
> > > > + * and width, multiplying them, and applying encoding overhead.
> > > > + */
> > > > +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed 
> > > > *speed,
> > > > +  enum pcie_link_width *width)
> > > > +{
> > > > +   *speed = pcie_get_speed_cap(dev);
> > > > +   *width = pcie_get_width_cap(dev);
> > > > +
> > > > +   if (*speed == PCI_SPEED_UNKNOWN || *width == 
> > > > PCIE_LNK_WIDTH_UNKNOWN)
> > > > +   return 0;
> > > > +
> > > > +   return *width * PCIE_SPEED2MBS_ENC(*speed);
> > > > +}
> > > > +
> > > >/**
> > > > * pci_select_bars - Make BAR mask from the type of resource
> > > > * @dev: the PCI device for which BAR mask is made
> > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > > > index 66738f1050c0..2a50172b9803 100644
> > > > --- a/drivers/pci/pci.h
> > > > +++ b/drivers/pci/pci.h
> > > > @@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev 
> > > > *dev);
> > > >  (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
> > > >  "Unknown speed")
> > > > +/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for 
> > > > gen3 */
> > > > +#define PCIE_SPEED2MBS_ENC(speed) \
> > > 
> > > Missing gen4.
> > 
> > I made it "gen3+".  I think that's accurate, isn't it?  The spec
> > doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2
> > says rates of 8 GT/s or higher (which I think includes gen3 and gen4)
> > use 128b/130b encoding.
> > 
> 
> I meant that PCIE_SPEED_16_0GT will return 0 from this macro since it wasn't
> added. Need to return 15754.

Oh, duh, of course!  Sorry for being dense.  What about the following?
I included the calculation as opposed to just the magic numbers to try
to make it clear how they're derived.  This has the disadvantage of
truncating the result instead of rounding, but I doubt that's
significant in this context.  If it is, we could use the magic numbers
and put 

Re: [PATCH v5 04/14] PCI: Add pcie_bandwidth_available() to compute bandwidth available to device

2018-04-01 Thread Bjorn Helgaas
On Sun, Apr 01, 2018 at 11:41:42PM +0300, Tal Gilboa wrote:
> On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:
> > From: Tal Gilboa <ta...@mellanox.com>
> > 
> > Add pcie_bandwidth_available() to compute the bandwidth available to a
> > device.  This may be limited by the device itself or by a slower upstream
> > link leading to the device.
> > 
> > The available bandwidth at each link along the path is computed as:
> > 
> >link_speed * link_width * (1 - encoding_overhead)
> > 
> > The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 8b/10b
> > encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
> > encoding.
> > 
> > Also return the device with the slowest link and the speed and width of
> > that link.
> > 
> > Signed-off-by: Tal Gilboa <ta...@mellanox.com>
> > [bhelgaas: changelog, leave pcie_get_minimum_link() alone for now, return
> > bw directly, use pci_upstream_bridge(), check "next_bw <= bw" to find
> > uppermost limiting device, return speed/width of the limiting device]
> > Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> > ---
> >   drivers/pci/pci.c   |   54 
> > +++
> >   include/linux/pci.h |3 +++
> >   2 files changed, 57 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 9ce89e254197..e00d56b12747 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5146,6 +5146,60 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum 
> > pci_bus_speed *speed,
> >   }
> >   EXPORT_SYMBOL(pcie_get_minimum_link);
> > +/**
> > + * pcie_bandwidth_available - determine minimum link settings of a PCIe
> > + *   device and its bandwidth limitation
> > + * @dev: PCI device to query
> > + * @limiting_dev: storage for device causing the bandwidth limitation
> > + * @speed: storage for speed of limiting device
> > + * @width: storage for width of limiting device
> > + *
> > + * Walk up the PCI device chain and find the point where the minimum
> > + * bandwidth is available.  Return the bandwidth available there and (if
> > + * limiting_dev, speed, and width pointers are supplied) information about
> > + * that point.
> > + */
> > +u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
> > **limiting_dev,
> > +enum pci_bus_speed *speed,
> > +enum pcie_link_width *width)
> > +{
> > +   u16 lnksta;
> > +   enum pci_bus_speed next_speed;
> > +   enum pcie_link_width next_width;
> > +   u32 bw, next_bw;
> > +
> > +   *speed = PCI_SPEED_UNKNOWN;
> > +   *width = PCIE_LNK_WIDTH_UNKNOWN;
> 
> This is not safe anymore, now that we allow speed/width=NULL.

Good catch, thanks!


Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-01 Thread Bjorn Helgaas
On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:
> On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:
> > From: Tal Gilboa <ta...@mellanox.com>
> > 
> > Add pcie_bandwidth_capable() to compute the max link bandwidth supported by
> > a device, based on the max link speed and width, adjusted by the encoding
> > overhead.
> > 
> > The maximum bandwidth of the link is computed as:
> > 
> >max_link_speed * max_link_width * (1 - encoding_overhead)
> > 
> > The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 8b/10b
> > encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
> > encoding.
> > 
> > Signed-off-by: Tal Gilboa <ta...@mellanox.com>
> > [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
> > signatures, don't export outside drivers/pci]
> > Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> > Reviewed-by: Tariq Toukan <tar...@mellanox.com>
> > ---
> >   drivers/pci/pci.c |   21 +
> >   drivers/pci/pci.h |9 +
> >   2 files changed, 30 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 43075be79388..9ce89e254197 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5208,6 +5208,27 @@ enum pcie_link_width pcie_get_width_cap(struct 
> > pci_dev *dev)
> > return PCIE_LNK_WIDTH_UNKNOWN;
> >   }
> > +/**
> > + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth 
> > capability
> > + * @dev: PCI device
> > + * @speed: storage for link speed
> > + * @width: storage for link width
> > + *
> > + * Calculate a PCI device's link bandwidth by querying for its link speed
> > + * and width, multiplying them, and applying encoding overhead.
> > + */
> > +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
> > +  enum pcie_link_width *width)
> > +{
> > +   *speed = pcie_get_speed_cap(dev);
> > +   *width = pcie_get_width_cap(dev);
> > +
> > +   if (*speed == PCI_SPEED_UNKNOWN || *width == PCIE_LNK_WIDTH_UNKNOWN)
> > +   return 0;
> > +
> > +   return *width * PCIE_SPEED2MBS_ENC(*speed);
> > +}
> > +
> >   /**
> >* pci_select_bars - Make BAR mask from the type of resource
> >* @dev: the PCI device for which BAR mask is made
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 66738f1050c0..2a50172b9803 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev *dev);
> >  (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
> >  "Unknown speed")
> > +/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for gen3 
> > */
> > +#define PCIE_SPEED2MBS_ENC(speed) \
> 
> Missing gen4.

I made it "gen3+".  I think that's accurate, isn't it?  The spec
doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2
says rates of 8 GT/s or higher (which I think includes gen3 and gen4)
use 128b/130b encoding.


[PATCH v5 00/14] Report PCI device link status

2018-03-30 Thread Bjorn Helgaas
This is mostly Tal's work to reduce code duplication in drivers and unify
the approach for reporting PCIe link speed/width and whether the device is
being limited by a slower upstream link.

This v5 series is based on Tal's v4 [1].

Changes since v4:
  - Added patches to replace uses of pcie_get_minimum_link() in bnx2x,
bnxt_en, cxgb4, fm10k, and ixgbe.  Note that this is a user-visible
change to the log messages, and in some cases changes dev_warn() to
dev_info().  I hope we can converge on something that works for
everybody, and it's OK if we need to tweak the text and/or level used
in pcie_print_link_status() to get there.

  - Rebased on top of Jay Fang's patch that adds 16 GT/s decoding support.

  - Changed pcie_get_speed_cap() and pcie_get_width_cap() to return the
values directly instead of returning both an error code and the value
via a reference parameter.  I don't think the callers can really use
both the error and the value.

  - Moved some declarations from linux/pci.h to drivers/pci/pci.h so
they're not visible outside the PCI subsystem.  Also removed
corresponding EXPORT_SYMBOL()s.  If we need these outside the PCI core,
we can export them again, but that's not needed yet.

  - Reworked pcie_bandwidth_available() so it finds the uppermost limiting
device and returns width/speed info for that device (previously it
could return width from one device and speed from a different one).

The incremental diff between the v4 series (based on v4.17-rc1) and this v5
series (based on v4.17-rc1 + Jay Fang's patch) is attached.  This diff
doesn't include the new patches to bnx2x, bnxt_en, cxgb4, fm10k, and ixgbe.

I don't have any of this hardware, so this is only compile-tested.

Bjorn


[1] 
https://lkml.kernel.org/r/1522394086-3555-1-git-send-email-ta...@mellanox.com

---

Bjorn Helgaas (6):
  bnx2x: Report PCIe link properties with pcie_print_link_status()
  bnxt_en: Report PCIe link properties with pcie_print_link_status()
  cxgb4: Report PCIe link properties with pcie_print_link_status()
  fm10k: Report PCIe link properties with pcie_print_link_status()
  ixgbe: Report PCIe link properties with pcie_print_link_status()
  PCI: Remove unused pcie_get_minimum_link()

Tal Gilboa (8):
  PCI: Add pcie_get_speed_cap() to find max supported link speed
  PCI: Add pcie_get_width_cap() to find max supported link width
  PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth
  PCI: Add pcie_bandwidth_available() to compute bandwidth available to 
device
  PCI: Add pcie_print_link_status() to log link speed and whether it's 
limited
  net/mlx4_core: Report PCIe link properties with pcie_print_link_status()
  net/mlx5: Report PCIe link properties with pcie_print_link_status()
  net/mlx5e: Use pcie_bandwidth_available() to compute bandwidth


 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |   23 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   19 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |   75 -
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c  |   87 ---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   47 --
 drivers/net/ethernet/mellanox/mlx4/main.c |   81 --
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   32 
 drivers/net/ethernet/mellanox/mlx5/core/main.c|4 +
 drivers/pci/pci-sysfs.c   |   38 +
 drivers/pci/pci.c |  167 ++---
 drivers/pci/pci.h |   20 +++
 include/linux/pci.h   |6 +
 12 files changed, 189 insertions(+), 410 deletions(-)



diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1bbd6cd20213..93291ec4a3d1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3864,25 +3864,6 @@ void mlx5e_build_default_indir_rqt(u32 *indirection_rqt, 
int len,
indirection_rqt[i] = i % num_channels;
 }
 
-static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw)
-{
-   enum pcie_link_width width;
-   enum pci_bus_speed speed;
-   int err = 0;
-   int bw;
-
-   err = pcie_bandwidth_available(mdev->pdev, , , , NULL);
-   if (err)
-   return err;
-
-   if (speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN)
-   return -EINVAL;
-
-   *pci_bw = bw;
-
-   return 0;
-}
-
 static bool cqe_compress_heuristic(u32 link_speed, u32 pci_bw)
 {
return (link_speed && pci_bw &&
@@ -3968,7 +3949,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
params->num_tc   = 1;
 
mlx5e_get_max_linkspeed(mdev, _speed);
-   mlx5e_get_pci_bw(mdev, _bw);
+   pci_bw = pcie_bandwidth_available(mdev-&

[PATCH v5 01/14] PCI: Add pcie_get_speed_cap() to find max supported link speed

2018-03-30 Thread Bjorn Helgaas
From: Tal Gilboa <ta...@mellanox.com>

Add pcie_get_speed_cap() to find the max link speed supported by a device.
Change max_link_speed_show() to use pcie_get_speed_cap().

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: return speed directly instead of error and *speed, don't export
outside drivers/pci]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
 drivers/pci/pci-sysfs.c |   28 ++--
 drivers/pci/pci.c   |   44 
 drivers/pci/pci.h   |   10 ++
 3 files changed, 56 insertions(+), 26 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 7dc5be545d18..c2ea05fbbf1d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -158,33 +158,9 @@ static DEVICE_ATTR_RO(resource);
 static ssize_t max_link_speed_show(struct device *dev,
   struct device_attribute *attr, char *buf)
 {
-   struct pci_dev *pci_dev = to_pci_dev(dev);
-   u32 linkcap;
-   int err;
-   const char *speed;
-
-   err = pcie_capability_read_dword(pci_dev, PCI_EXP_LNKCAP, );
-   if (err)
-   return -EINVAL;
-
-   switch (linkcap & PCI_EXP_LNKCAP_SLS) {
-   case PCI_EXP_LNKCAP_SLS_16_0GB:
-   speed = "16 GT/s";
-   break;
-   case PCI_EXP_LNKCAP_SLS_8_0GB:
-   speed = "8 GT/s";
-   break;
-   case PCI_EXP_LNKCAP_SLS_5_0GB:
-   speed = "5 GT/s";
-   break;
-   case PCI_EXP_LNKCAP_SLS_2_5GB:
-   speed = "2.5 GT/s";
-   break;
-   default:
-   speed = "Unknown speed";
-   }
+   struct pci_dev *pdev = to_pci_dev(dev);
 
-   return sprintf(buf, "%s\n", speed);
+   return sprintf(buf, "%s\n", PCIE_SPEED2STR(pcie_get_speed_cap(pdev)));
 }
 static DEVICE_ATTR_RO(max_link_speed);
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..b29d3436ee9f 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5146,6 +5146,50 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum 
pci_bus_speed *speed,
 }
 EXPORT_SYMBOL(pcie_get_minimum_link);
 
+/**
+ * pcie_get_speed_cap - query for the PCI device's link speed capability
+ * @dev: PCI device to query
+ *
+ * Query the PCI device speed capability.  Return the maximum link speed
+ * supported by the device.
+ */
+enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
+{
+   u32 lnkcap2, lnkcap;
+
+   /*
+* PCIe r4.0 sec 7.5.3.18 recommends using the Supported Link
+* Speeds Vector in Link Capabilities 2 when supported, falling
+* back to Max Link Speed in Link Capabilities otherwise.
+*/
+   pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, );
+   if (lnkcap2) { /* PCIe r3.0-compliant */
+   if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_16_0GB)
+   return PCIE_SPEED_16_0GT;
+   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_8_0GB)
+   return PCIE_SPEED_8_0GT;
+   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_5_0GB)
+   return PCIE_SPEED_5_0GT;
+   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_2_5GB)
+   return PCIE_SPEED_2_5GT;
+   return PCI_SPEED_UNKNOWN;
+   }
+
+   pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
+   if (lnkcap) {
+   if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB)
+   return PCIE_SPEED_16_0GT;
+   else if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB)
+   return PCIE_SPEED_8_0GT;
+   else if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB)
+   return PCIE_SPEED_5_0GT;
+   else if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB)
+   return PCIE_SPEED_2_5GT;
+   }
+
+   return PCI_SPEED_UNKNOWN;
+}
+
 /**
  * pci_select_bars - Make BAR mask from the type of resource
  * @dev: the PCI device for which BAR mask is made
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fcd81911b127..1186d8be6055 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -253,6 +253,16 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx);
 void pci_reassigndev_resource_alignment(struct pci_dev *dev);
 void pci_disable_bridge_window(struct pci_dev *dev);
 
+/* PCIe link information */
+#define PCIE_SPEED2STR(speed) \
+   ((speed) == PCIE_SPEED_16_0GT ? "16 GT/s" : \
+(speed) == PCIE_SPEED_8_0GT ? "8 GT/s" : \
+(speed) == PCIE_SPEED_5_0GT ? "5 GT/s" : \
+(speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
+"Unknown speed")
+
+enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
+
 /* Single Root I/O Virtualization */
 struct pci_sriov {
int pos;/* Capability position */



[PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-03-30 Thread Bjorn Helgaas
From: Tal Gilboa <ta...@mellanox.com>

Add pcie_bandwidth_capable() to compute the max link bandwidth supported by
a device, based on the max link speed and width, adjusted by the encoding
overhead.

The maximum bandwidth of the link is computed as:

  max_link_speed * max_link_width * (1 - encoding_overhead)

The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 8b/10b
encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
encoding.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
signatures, don't export outside drivers/pci]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
 drivers/pci/pci.c |   21 +
 drivers/pci/pci.h |9 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 43075be79388..9ce89e254197 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5208,6 +5208,27 @@ enum pcie_link_width pcie_get_width_cap(struct pci_dev 
*dev)
return PCIE_LNK_WIDTH_UNKNOWN;
 }
 
+/**
+ * pcie_bandwidth_capable - calculates a PCI device's link bandwidth capability
+ * @dev: PCI device
+ * @speed: storage for link speed
+ * @width: storage for link width
+ *
+ * Calculate a PCI device's link bandwidth by querying for its link speed
+ * and width, multiplying them, and applying encoding overhead.
+ */
+u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
+  enum pcie_link_width *width)
+{
+   *speed = pcie_get_speed_cap(dev);
+   *width = pcie_get_width_cap(dev);
+
+   if (*speed == PCI_SPEED_UNKNOWN || *width == PCIE_LNK_WIDTH_UNKNOWN)
+   return 0;
+
+   return *width * PCIE_SPEED2MBS_ENC(*speed);
+}
+
 /**
  * pci_select_bars - Make BAR mask from the type of resource
  * @dev: the PCI device for which BAR mask is made
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 66738f1050c0..2a50172b9803 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev *dev);
 (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
 "Unknown speed")
 
+/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for gen3 */
+#define PCIE_SPEED2MBS_ENC(speed) \
+   ((speed) == PCIE_SPEED_8_0GT ? 7877 : \
+(speed) == PCIE_SPEED_5_0GT ? 4000 : \
+(speed) == PCIE_SPEED_2_5GT ? 2000 : \
+0)
+
 enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
 enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
+u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
+  enum pcie_link_width *width);
 
 /* Single Root I/O Virtualization */
 struct pci_sriov {



[PATCH v5 02/14] PCI: Add pcie_get_width_cap() to find max supported link width

2018-03-30 Thread Bjorn Helgaas
From: Tal Gilboa <ta...@mellanox.com>

Add pcie_get_width_cap() to find the max link width supported by a device.
Change max_link_width_show() to use pcie_get_width_cap().

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: return width directly instead of error and *width, don't export
outside drivers/pci]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
 drivers/pci/pci-sysfs.c |   10 ++
 drivers/pci/pci.c   |   18 ++
 drivers/pci/pci.h   |1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index c2ea05fbbf1d..63d0952684fb 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -167,15 +167,9 @@ static DEVICE_ATTR_RO(max_link_speed);
 static ssize_t max_link_width_show(struct device *dev,
   struct device_attribute *attr, char *buf)
 {
-   struct pci_dev *pci_dev = to_pci_dev(dev);
-   u32 linkcap;
-   int err;
-
-   err = pcie_capability_read_dword(pci_dev, PCI_EXP_LNKCAP, );
-   if (err)
-   return -EINVAL;
+   struct pci_dev *pdev = to_pci_dev(dev);
 
-   return sprintf(buf, "%u\n", (linkcap & PCI_EXP_LNKCAP_MLW) >> 4);
+   return sprintf(buf, "%u\n", pcie_get_width_cap(pdev));
 }
 static DEVICE_ATTR_RO(max_link_width);
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b29d3436ee9f..43075be79388 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5190,6 +5190,24 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev 
*dev)
return PCI_SPEED_UNKNOWN;
 }
 
+/**
+ * pcie_get_width_cap - query for the PCI device's link width capability
+ * @dev: PCI device to query
+ *
+ * Query the PCI device width capability.  Return the maximum link width
+ * supported by the device.
+ */
+enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev)
+{
+   u32 lnkcap;
+
+   pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, );
+   if (lnkcap)
+   return (lnkcap & PCI_EXP_LNKCAP_MLW) >> 4;
+
+   return PCIE_LNK_WIDTH_UNKNOWN;
+}
+
 /**
  * pci_select_bars - Make BAR mask from the type of resource
  * @dev: the PCI device for which BAR mask is made
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1186d8be6055..66738f1050c0 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -262,6 +262,7 @@ void pci_disable_bridge_window(struct pci_dev *dev);
 "Unknown speed")
 
 enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
+enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
 
 /* Single Root I/O Virtualization */
 struct pci_sriov {



[PATCH v5 04/14] PCI: Add pcie_bandwidth_available() to compute bandwidth available to device

2018-03-30 Thread Bjorn Helgaas
From: Tal Gilboa <ta...@mellanox.com>

Add pcie_bandwidth_available() to compute the bandwidth available to a
device.  This may be limited by the device itself or by a slower upstream
link leading to the device.

The available bandwidth at each link along the path is computed as:

  link_speed * link_width * (1 - encoding_overhead)

The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 8b/10b
encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
encoding.

Also return the device with the slowest link and the speed and width of
that link.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: changelog, leave pcie_get_minimum_link() alone for now, return
bw directly, use pci_upstream_bridge(), check "next_bw <= bw" to find
uppermost limiting device, return speed/width of the limiting device]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/pci.c   |   54 +++
 include/linux/pci.h |3 +++
 2 files changed, 57 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9ce89e254197..e00d56b12747 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5146,6 +5146,60 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum 
pci_bus_speed *speed,
 }
 EXPORT_SYMBOL(pcie_get_minimum_link);
 
+/**
+ * pcie_bandwidth_available - determine minimum link settings of a PCIe
+ *   device and its bandwidth limitation
+ * @dev: PCI device to query
+ * @limiting_dev: storage for device causing the bandwidth limitation
+ * @speed: storage for speed of limiting device
+ * @width: storage for width of limiting device
+ *
+ * Walk up the PCI device chain and find the point where the minimum
+ * bandwidth is available.  Return the bandwidth available there and (if
+ * limiting_dev, speed, and width pointers are supplied) information about
+ * that point.
+ */
+u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,
+enum pci_bus_speed *speed,
+enum pcie_link_width *width)
+{
+   u16 lnksta;
+   enum pci_bus_speed next_speed;
+   enum pcie_link_width next_width;
+   u32 bw, next_bw;
+
+   *speed = PCI_SPEED_UNKNOWN;
+   *width = PCIE_LNK_WIDTH_UNKNOWN;
+   bw = 0;
+
+   while (dev) {
+   pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
+
+   next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
+   next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >>
+   PCI_EXP_LNKSTA_NLW_SHIFT;
+
+   next_bw = next_width * PCIE_SPEED2MBS_ENC(next_speed);
+
+   /* Check if current device limits the total bandwidth */
+   if (!bw || next_bw <= bw) {
+   bw = next_bw;
+
+   if (limiting_dev)
+   *limiting_dev = dev;
+   if (speed)
+   *speed = next_speed;
+   if (width)
+   *width = next_width;
+   }
+
+   dev = pci_upstream_bridge(dev);
+   }
+
+   return bw;
+}
+EXPORT_SYMBOL(pcie_bandwidth_available);
+
 /**
  * pcie_get_speed_cap - query for the PCI device's link speed capability
  * @dev: PCI device to query
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8043a5937ad0..f2bf2b7a66c7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1083,6 +1083,9 @@ int pcie_get_mps(struct pci_dev *dev);
 int pcie_set_mps(struct pci_dev *dev, int mps);
 int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
  enum pcie_link_width *width);
+u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,
+enum pci_bus_speed *speed,
+enum pcie_link_width *width);
 void pcie_flr(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);



[PATCH v5 07/14] net/mlx5: Report PCIe link properties with pcie_print_link_status()

2018-03-30 Thread Bjorn Helgaas
From: Tal Gilboa <ta...@mellanox.com>

Use pcie_print_link_status() to report PCIe link speed and possible
limitations.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 2ef641c91c26..622f02d34aae 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1043,6 +1043,10 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
dev_info(>dev, "firmware version: %d.%d.%d\n", fw_rev_maj(dev),
 fw_rev_min(dev), fw_rev_sub(dev));
 
+   /* Only PFs hold the relevant PCIe information for this query */
+   if (mlx5_core_is_pf(dev))
+   pcie_print_link_status(dev->pdev);
+
/* on load removing any previous indication of internal error, device is
 * up
 */



[PATCH v5 08/14] net/mlx5e: Use pcie_bandwidth_available() to compute bandwidth

2018-03-30 Thread Bjorn Helgaas
From: Tal Gilboa <ta...@mellanox.com>

Use the new pci_bandwidth_available() function to calculate maximum
available bandwidth through the PCI chain instead of computing it ourselves
with mlx5e_get_pci_bw().

This is used to detect when the device is capable of more bandwidth than is
available in the current slot.  The driver may adjust compression settings
accordingly.

Note that pci_bandwidth_available() accounts for PCIe encoding overhead, so
it is more accurate than mlx5e_get_pci_bw() was.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: remove mlx5e_get_pci_bw() wrapper altogether]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   32 +
 1 file changed, 1 insertion(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 47bab842c5ee..93291ec4a3d1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3864,36 +3864,6 @@ void mlx5e_build_default_indir_rqt(u32 *indirection_rqt, 
int len,
indirection_rqt[i] = i % num_channels;
 }
 
-static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw)
-{
-   enum pcie_link_width width;
-   enum pci_bus_speed speed;
-   int err = 0;
-
-   err = pcie_get_minimum_link(mdev->pdev, , );
-   if (err)
-   return err;
-
-   if (speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN)
-   return -EINVAL;
-
-   switch (speed) {
-   case PCIE_SPEED_2_5GT:
-   *pci_bw = 2500 * width;
-   break;
-   case PCIE_SPEED_5_0GT:
-   *pci_bw = 5000 * width;
-   break;
-   case PCIE_SPEED_8_0GT:
-   *pci_bw = 8000 * width;
-   break;
-   default:
-   return -EINVAL;
-   }
-
-   return 0;
-}
-
 static bool cqe_compress_heuristic(u32 link_speed, u32 pci_bw)
 {
return (link_speed && pci_bw &&
@@ -3979,7 +3949,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
params->num_tc   = 1;
 
mlx5e_get_max_linkspeed(mdev, _speed);
-   mlx5e_get_pci_bw(mdev, _bw);
+   pci_bw = pcie_bandwidth_available(mdev->pdev, NULL, NULL, NULL);
mlx5_core_dbg(mdev, "Max link speed = %d, PCI BW = %d\n",
  link_speed, pci_bw);
 



[PATCH v5 06/14] net/mlx4_core: Report PCIe link properties with pcie_print_link_status()

2018-03-30 Thread Bjorn Helgaas
From: Tal Gilboa <ta...@mellanox.com>

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
Signed-off-by: Tariq Toukan <tar...@mellanox.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/main.c |   81 -
 1 file changed, 1 insertion(+), 80 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 4d84cab77105..30cacac54e69 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -623,85 +623,6 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap)
return 0;
 }
 
-static int mlx4_get_pcie_dev_link_caps(struct mlx4_dev *dev,
-  enum pci_bus_speed *speed,
-  enum pcie_link_width *width)
-{
-   u32 lnkcap1, lnkcap2;
-   int err1, err2;
-
-#define  PCIE_MLW_CAP_SHIFT 4  /* start of MLW mask in link capabilities */
-
-   *speed = PCI_SPEED_UNKNOWN;
-   *width = PCIE_LNK_WIDTH_UNKNOWN;
-
-   err1 = pcie_capability_read_dword(dev->persist->pdev, PCI_EXP_LNKCAP,
- );
-   err2 = pcie_capability_read_dword(dev->persist->pdev, PCI_EXP_LNKCAP2,
- );
-   if (!err2 && lnkcap2) { /* PCIe r3.0-compliant */
-   if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_8_0GB)
-   *speed = PCIE_SPEED_8_0GT;
-   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_5_0GB)
-   *speed = PCIE_SPEED_5_0GT;
-   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_2_5GB)
-   *speed = PCIE_SPEED_2_5GT;
-   }
-   if (!err1) {
-   *width = (lnkcap1 & PCI_EXP_LNKCAP_MLW) >> PCIE_MLW_CAP_SHIFT;
-   if (!lnkcap2) { /* pre-r3.0 */
-   if (lnkcap1 & PCI_EXP_LNKCAP_SLS_5_0GB)
-   *speed = PCIE_SPEED_5_0GT;
-   else if (lnkcap1 & PCI_EXP_LNKCAP_SLS_2_5GB)
-   *speed = PCIE_SPEED_2_5GT;
-   }
-   }
-
-   if (*speed == PCI_SPEED_UNKNOWN || *width == PCIE_LNK_WIDTH_UNKNOWN) {
-   return err1 ? err1 :
-   err2 ? err2 : -EINVAL;
-   }
-   return 0;
-}
-
-static void mlx4_check_pcie_caps(struct mlx4_dev *dev)
-{
-   enum pcie_link_width width, width_cap;
-   enum pci_bus_speed speed, speed_cap;
-   int err;
-
-#define PCIE_SPEED_STR(speed) \
-   (speed == PCIE_SPEED_8_0GT ? "8.0GT/s" : \
-speed == PCIE_SPEED_5_0GT ? "5.0GT/s" : \
-speed == PCIE_SPEED_2_5GT ? "2.5GT/s" : \
-"Unknown")
-
-   err = mlx4_get_pcie_dev_link_caps(dev, _cap, _cap);
-   if (err) {
-   mlx4_warn(dev,
- "Unable to determine PCIe device BW capabilities\n");
-   return;
-   }
-
-   err = pcie_get_minimum_link(dev->persist->pdev, , );
-   if (err || speed == PCI_SPEED_UNKNOWN ||
-   width == PCIE_LNK_WIDTH_UNKNOWN) {
-   mlx4_warn(dev,
- "Unable to determine PCI device chain minimum BW\n");
-   return;
-   }
-
-   if (width != width_cap || speed != speed_cap)
-   mlx4_warn(dev,
- "PCIe BW is different than device's capability\n");
-
-   mlx4_info(dev, "PCIe link speed is %s, device supports %s\n",
- PCIE_SPEED_STR(speed), PCIE_SPEED_STR(speed_cap));
-   mlx4_info(dev, "PCIe link width is x%d, device supports x%d\n",
- width, width_cap);
-   return;
-}
-
 /*The function checks if there are live vf, return the num of them*/
 static int mlx4_how_many_lives_vf(struct mlx4_dev *dev)
 {
@@ -3475,7 +3396,7 @@ static int mlx4_load_one(struct pci_dev *pdev, int 
pci_dev_data,
 * express device capabilities are under-satisfied by the bus.
 */
if (!mlx4_is_slave(dev))
-   mlx4_check_pcie_caps(dev);
+   pcie_print_link_status(dev->persist->pdev);
 
/* In master functions, the communication channel must be initialized
 * after obtaining its address from fw */



[PATCH v5 09/14] bnx2x: Report PCIe link properties with pcie_print_link_status()

2018-03-30 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.

Note that pcie_get_minimum_link() can return misleading information because
it finds the slowest link and the narrowest link without considering the
total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
about 2000 MB/s for a 16 GT/s x1 link.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |   23 ++
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af4aadb..c92601f1b0f3 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13922,8 +13922,6 @@ static int bnx2x_init_one(struct pci_dev *pdev,
 {
struct net_device *dev = NULL;
struct bnx2x *bp;
-   enum pcie_link_width pcie_width;
-   enum pci_bus_speed pcie_speed;
int rc, max_non_def_sbs;
int rx_count, tx_count, rss_count, doorbell_size;
int max_cos_est;
@@ -14091,21 +14089,12 @@ static int bnx2x_init_one(struct pci_dev *pdev,
dev_addr_add(bp->dev, bp->fip_mac, NETDEV_HW_ADDR_T_SAN);
rtnl_unlock();
}
-   if (pcie_get_minimum_link(bp->pdev, _speed, _width) ||
-   pcie_speed == PCI_SPEED_UNKNOWN ||
-   pcie_width == PCIE_LNK_WIDTH_UNKNOWN)
-   BNX2X_DEV_INFO("Failed to determine PCI Express Bandwidth\n");
-   else
-   BNX2X_DEV_INFO(
-  "%s (%c%d) PCI-E x%d %s found at mem %lx, IRQ %d, node 
addr %pM\n",
-  board_info[ent->driver_data].name,
-  (CHIP_REV(bp) >> 12) + 'A', (CHIP_METAL(bp) >> 4),
-  pcie_width,
-  pcie_speed == PCIE_SPEED_2_5GT ? "2.5GHz" :
-  pcie_speed == PCIE_SPEED_5_0GT ? "5.0GHz" :
-  pcie_speed == PCIE_SPEED_8_0GT ? "8.0GHz" :
-  "Unknown",
-  dev->base_addr, bp->pdev->irq, dev->dev_addr);
+   BNX2X_DEV_INFO(
+  "%s (%c%d) PCI-E found at mem %lx, IRQ %d, node addr %pM\n",
+  board_info[ent->driver_data].name,
+  (CHIP_REV(bp) >> 12) + 'A', (CHIP_METAL(bp) >> 4),
+  dev->base_addr, bp->pdev->irq, dev->dev_addr);
+   pcie_print_link_status(bp->pdev);
 
bnx2x_register_phc(bp);
 



[PATCH v5 11/14] cxgb4: Report PCIe link properties with pcie_print_link_status()

2018-03-30 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.

Note that pcie_get_minimum_link() can return misleading information because
it finds the slowest link and the narrowest link without considering the
total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
about 2000 MB/s for a 16 GT/s x1 link.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   75 ---
 1 file changed, 1 insertion(+), 74 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 56bc626ef006..2d6864c8199e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -4762,79 +4762,6 @@ static int init_rss(struct adapter *adap)
return 0;
 }
 
-static int cxgb4_get_pcie_dev_link_caps(struct adapter *adap,
-   enum pci_bus_speed *speed,
-   enum pcie_link_width *width)
-{
-   u32 lnkcap1, lnkcap2;
-   int err1, err2;
-
-#define  PCIE_MLW_CAP_SHIFT 4   /* start of MLW mask in link capabilities */
-
-   *speed = PCI_SPEED_UNKNOWN;
-   *width = PCIE_LNK_WIDTH_UNKNOWN;
-
-   err1 = pcie_capability_read_dword(adap->pdev, PCI_EXP_LNKCAP,
- );
-   err2 = pcie_capability_read_dword(adap->pdev, PCI_EXP_LNKCAP2,
- );
-   if (!err2 && lnkcap2) { /* PCIe r3.0-compliant */
-   if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_8_0GB)
-   *speed = PCIE_SPEED_8_0GT;
-   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_5_0GB)
-   *speed = PCIE_SPEED_5_0GT;
-   else if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_2_5GB)
-   *speed = PCIE_SPEED_2_5GT;
-   }
-   if (!err1) {
-   *width = (lnkcap1 & PCI_EXP_LNKCAP_MLW) >> PCIE_MLW_CAP_SHIFT;
-   if (!lnkcap2) { /* pre-r3.0 */
-   if (lnkcap1 & PCI_EXP_LNKCAP_SLS_5_0GB)
-   *speed = PCIE_SPEED_5_0GT;
-   else if (lnkcap1 & PCI_EXP_LNKCAP_SLS_2_5GB)
-   *speed = PCIE_SPEED_2_5GT;
-   }
-   }
-
-   if (*speed == PCI_SPEED_UNKNOWN || *width == PCIE_LNK_WIDTH_UNKNOWN)
-   return err1 ? err1 : err2 ? err2 : -EINVAL;
-   return 0;
-}
-
-static void cxgb4_check_pcie_caps(struct adapter *adap)
-{
-   enum pcie_link_width width, width_cap;
-   enum pci_bus_speed speed, speed_cap;
-
-#define PCIE_SPEED_STR(speed) \
-   (speed == PCIE_SPEED_8_0GT ? "8.0GT/s" : \
-speed == PCIE_SPEED_5_0GT ? "5.0GT/s" : \
-speed == PCIE_SPEED_2_5GT ? "2.5GT/s" : \
-"Unknown")
-
-   if (cxgb4_get_pcie_dev_link_caps(adap, _cap, _cap)) {
-   dev_warn(adap->pdev_dev,
-"Unable to determine PCIe device BW capabilities\n");
-   return;
-   }
-
-   if (pcie_get_minimum_link(adap->pdev, , ) ||
-   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) {
-   dev_warn(adap->pdev_dev,
-"Unable to determine PCI Express bandwidth.\n");
-   return;
-   }
-
-   dev_info(adap->pdev_dev, "PCIe link speed is %s, device supports %s\n",
-PCIE_SPEED_STR(speed), PCIE_SPEED_STR(speed_cap));
-   dev_info(adap->pdev_dev, "PCIe link width is x%d, device supports 
x%d\n",
-width, width_cap);
-   if (speed < speed_cap || width < width_cap)
-   dev_info(adap->pdev_dev,
-"A slot with more lanes and/or higher speed is "
-"suggested for optimal performance.\n");
-}
-
 /* Dump basic information about the adapter */
 static void print_adapter_info(struct adapter *adapter)
 {
@@ -5466,7 +5393,7 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
}
 
/* check for PCI Express bandwidth capabiltites */
-   cxgb4_check_pcie_caps(adapter);
+   pcie_print_link_status(pdev);
 
err = init_rss(adapter);
if (err)



[PATCH v5 13/14] ixgbe: Report PCIe link properties with pcie_print_link_status()

2018-03-30 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.

Note that pcie_get_minimum_link() can return misleading information because
it finds the slowest link and the narrowest link without considering the
total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
about 2000 MB/s for a 16 GT/s x1 link.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   47 +
 1 file changed, 1 insertion(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2c8aba..38bb9c17d333 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -270,9 +270,6 @@ static void ixgbe_check_minimum_link(struct ixgbe_adapter 
*adapter,
 int expected_gts)
 {
struct ixgbe_hw *hw = >hw;
-   int max_gts = 0;
-   enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
-   enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
struct pci_dev *pdev;
 
/* Some devices are not connected over PCIe and thus do not negotiate
@@ -288,49 +285,7 @@ static void ixgbe_check_minimum_link(struct ixgbe_adapter 
*adapter,
else
pdev = adapter->pdev;
 
-   if (pcie_get_minimum_link(pdev, , ) ||
-   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) {
-   e_dev_warn("Unable to determine PCI Express bandwidth.\n");
-   return;
-   }
-
-   switch (speed) {
-   case PCIE_SPEED_2_5GT:
-   /* 8b/10b encoding reduces max throughput by 20% */
-   max_gts = 2 * width;
-   break;
-   case PCIE_SPEED_5_0GT:
-   /* 8b/10b encoding reduces max throughput by 20% */
-   max_gts = 4 * width;
-   break;
-   case PCIE_SPEED_8_0GT:
-   /* 128b/130b encoding reduces throughput by less than 2% */
-   max_gts = 8 * width;
-   break;
-   default:
-   e_dev_warn("Unable to determine PCI Express bandwidth.\n");
-   return;
-   }
-
-   e_dev_info("PCI Express bandwidth of %dGT/s available\n",
-  max_gts);
-   e_dev_info("(Speed:%s, Width: x%d, Encoding Loss:%s)\n",
-  (speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
-   speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
-   speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
-   "Unknown"),
-  width,
-  (speed == PCIE_SPEED_2_5GT ? "20%" :
-   speed == PCIE_SPEED_5_0GT ? "20%" :
-   speed == PCIE_SPEED_8_0GT ? "<2%" :
-   "Unknown"));
-
-   if (max_gts < expected_gts) {
-   e_dev_warn("This is not sufficient for optimal performance of 
this card.\n");
-   e_dev_warn("For optimal performance, at least %dGT/s of 
bandwidth is required.\n",
-   expected_gts);
-   e_dev_warn("A slot with more lanes and/or higher speed is 
suggested.\n");
-   }
+   pcie_print_link_status(pdev);
 }
 
 static void ixgbe_service_event_schedule(struct ixgbe_adapter *adapter)



[PATCH v5 10/14] bnxt_en: Report PCIe link properties with pcie_print_link_status()

2018-03-30 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.

Note that pcie_get_minimum_link() can return misleading information because
it finds the slowest link and the narrowest link without considering the
total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
about 2000 MB/s for a 16 GT/s x1 link.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   19 +--
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1500243b9886..3be42431e029 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -8469,22 +8469,6 @@ static int bnxt_init_mac_addr(struct bnxt *bp)
return rc;
 }
 
-static void bnxt_parse_log_pcie_link(struct bnxt *bp)
-{
-   enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
-   enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
-
-   if (pcie_get_minimum_link(pci_physfn(bp->pdev), , ) ||
-   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN)
-   netdev_info(bp->dev, "Failed to determine PCIe Link Info\n");
-   else
-   netdev_info(bp->dev, "PCIe: Speed %s Width x%d\n",
-   speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
-   speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
-   speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
-   "Unknown", width);
-}
-
 static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
static int version_printed;
@@ -8694,8 +8678,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
netdev_info(dev, "%s found at mem %lx, node addr %pM\n",
board_info[ent->driver_data].name,
(long)pci_resource_start(pdev, 0), dev->dev_addr);
-
-   bnxt_parse_log_pcie_link(bp);
+   pcie_print_link_status(pdev);
 
return 0;
 



[PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()

2018-03-30 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.

Note that pcie_get_minimum_link() can return misleading information because
it finds the slowest link and the narrowest link without considering the
total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
about 2000 MB/s for a 16 GT/s x1 link.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c |   87 --
 1 file changed, 1 insertion(+), 86 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index a434fecfdfeb..aa05fb534942 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -2120,91 +2120,6 @@ static int fm10k_sw_init(struct fm10k_intfc *interface,
return 0;
 }
 
-static void fm10k_slot_warn(struct fm10k_intfc *interface)
-{
-   enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
-   enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
-   struct fm10k_hw *hw = >hw;
-   int max_gts = 0, expected_gts = 0;
-
-   if (pcie_get_minimum_link(interface->pdev, , ) ||
-   speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) {
-   dev_warn(>pdev->dev,
-"Unable to determine PCI Express bandwidth.\n");
-   return;
-   }
-
-   switch (speed) {
-   case PCIE_SPEED_2_5GT:
-   /* 8b/10b encoding reduces max throughput by 20% */
-   max_gts = 2 * width;
-   break;
-   case PCIE_SPEED_5_0GT:
-   /* 8b/10b encoding reduces max throughput by 20% */
-   max_gts = 4 * width;
-   break;
-   case PCIE_SPEED_8_0GT:
-   /* 128b/130b encoding has less than 2% impact on throughput */
-   max_gts = 8 * width;
-   break;
-   default:
-   dev_warn(>pdev->dev,
-"Unable to determine PCI Express bandwidth.\n");
-   return;
-   }
-
-   dev_info(>pdev->dev,
-"PCI Express bandwidth of %dGT/s available\n",
-max_gts);
-   dev_info(>pdev->dev,
-"(Speed:%s, Width: x%d, Encoding Loss:%s, Payload:%s)\n",
-(speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
- speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
- speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
- "Unknown"),
-hw->bus.width,
-(speed == PCIE_SPEED_2_5GT ? "20%" :
- speed == PCIE_SPEED_5_0GT ? "20%" :
- speed == PCIE_SPEED_8_0GT ? "<2%" :
- "Unknown"),
-(hw->bus.payload == fm10k_bus_payload_128 ? "128B" :
- hw->bus.payload == fm10k_bus_payload_256 ? "256B" :
- hw->bus.payload == fm10k_bus_payload_512 ? "512B" :
- "Unknown"));
-
-   switch (hw->bus_caps.speed) {
-   case fm10k_bus_speed_2500:
-   /* 8b/10b encoding reduces max throughput by 20% */
-   expected_gts = 2 * hw->bus_caps.width;
-   break;
-   case fm10k_bus_speed_5000:
-   /* 8b/10b encoding reduces max throughput by 20% */
-   expected_gts = 4 * hw->bus_caps.width;
-   break;
-   case fm10k_bus_speed_8000:
-   /* 128b/130b encoding has less than 2% impact on throughput */
-   expected_gts = 8 * hw->bus_caps.width;
-   break;
-   default:
-   dev_warn(>pdev->dev,
-"Unable to determine expected PCI Express 
bandwidth.\n");
-   return;
-   }
-
-   if (max_gts >= expected_gts)
-   return;
-
-   dev_warn(>pdev->dev,
-"This device requires %dGT/s of bandwidth for optimal 
performance.\n",
-expected_gts);
-   dev_warn(>pdev->dev,
-"A %sslot with x%d lanes is suggested.\n",
-(hw->bus_caps.speed == fm10k_bus_speed_2500 ? "2.5GT/s " :
- hw->bus_caps.speed == fm10k_bus_speed_5000 ? "5.0GT/s " :
- hw->bus_caps.speed == fm10k_bus_speed_8000 ? "8.0GT/s " : ""),
-hw->bus_caps.width);
-}
-
 /**
  * fm10k_probe - Device Initialization Routine
  * @pdev: PCI device information 

[PATCH v5 14/14] PCI: Remove unused pcie_get_minimum_link()

2018-03-30 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

In some cases pcie_get_minimum_link() returned misleading information
because it found the slowest link and the narrowest link without
considering the total bandwidth of the link.  For example, if the path
contained a 16 GT/s x1 link and a 2.5 GT/s x16 link,
pcie_get_minimum_link() returned 2.5 GT/s x1, which corresponds to 250 MB/s
of bandwidth, not the actual available bandwidth of about 2000 MB/s for a
16 GT/s x1 link.

Callers should use pcie_print_link_status() instead, or
pcie_bandwidth_available() if they need more detailed information.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/pci.c   |   43 ---
 include/linux/pci.h |2 --
 2 files changed, 45 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index cec7aed09f6b..b6951c44ae6c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5103,49 +5103,6 @@ int pcie_set_mps(struct pci_dev *dev, int mps)
 }
 EXPORT_SYMBOL(pcie_set_mps);
 
-/**
- * pcie_get_minimum_link - determine minimum link settings of a PCI device
- * @dev: PCI device to query
- * @speed: storage for minimum speed
- * @width: storage for minimum width
- *
- * This function will walk up the PCI device chain and determine the minimum
- * link width and speed of the device.
- */
-int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
- enum pcie_link_width *width)
-{
-   int ret;
-
-   *speed = PCI_SPEED_UNKNOWN;
-   *width = PCIE_LNK_WIDTH_UNKNOWN;
-
-   while (dev) {
-   u16 lnksta;
-   enum pci_bus_speed next_speed;
-   enum pcie_link_width next_width;
-
-   ret = pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
-   if (ret)
-   return ret;
-
-   next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
-   next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >>
-   PCI_EXP_LNKSTA_NLW_SHIFT;
-
-   if (next_speed < *speed)
-   *speed = next_speed;
-
-   if (next_width < *width)
-   *width = next_width;
-
-   dev = dev->bus->self;
-   }
-
-   return 0;
-}
-EXPORT_SYMBOL(pcie_get_minimum_link);
-
 /**
  * pcie_bandwidth_available - determine minimum link settings of a PCIe
  *   device and its bandwidth limitation
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 38f7957121ef..5ccee29fe1b1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1081,8 +1081,6 @@ int pcie_get_readrq(struct pci_dev *dev);
 int pcie_set_readrq(struct pci_dev *dev, int rq);
 int pcie_get_mps(struct pci_dev *dev);
 int pcie_set_mps(struct pci_dev *dev, int mps);
-int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
- enum pcie_link_width *width);
 u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,
 enum pci_bus_speed *speed,
 enum pcie_link_width *width);



[PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-03-30 Thread Bjorn Helgaas
From: Tal Gilboa <ta...@mellanox.com>

Add pcie_print_link_status().  This logs the current settings of the link
(speed, width, and total available bandwidth).

If the device is capable of more bandwidth but is limited by a slower
upstream link, we include information about the link that limits the
device's performance.

The user may be able to move the device to a different slot for better
performance.

This provides a unified method for all PCI devices to report status and
issues, instead of each device reporting in a different way, using
different code.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: changelog, reword log messages, print device capabilities when
not limited]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/pci/pci.c   |   29 +
 include/linux/pci.h |1 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e00d56b12747..cec7aed09f6b 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum 
pci_bus_speed *speed,
return *width * PCIE_SPEED2MBS_ENC(*speed);
 }
 
+/**
+ * pcie_print_link_status - Report the PCI device's link speed and width
+ * @dev: PCI device to query
+ *
+ * Report the available bandwidth at the device.  If this is less than the
+ * device is capable of, report the device's maximum possible bandwidth and
+ * the upstream link that limits its performance to less than that.
+ */
+void pcie_print_link_status(struct pci_dev *dev)
+{
+   enum pcie_link_width width, width_cap;
+   enum pci_bus_speed speed, speed_cap;
+   struct pci_dev *limiting_dev = NULL;
+   u32 bw_avail, bw_cap;
+
+   bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
+   bw_avail = pcie_bandwidth_available(dev, _dev, , );
+
+   if (bw_avail >= bw_cap)
+   pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
+bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
+   else
+   pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d 
link at %s (capable of %d Mb/s with %s x%d link)\n",
+bw_avail, PCIE_SPEED2STR(speed), width,
+limiting_dev ? pci_name(limiting_dev) : "",
+bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
+}
+EXPORT_SYMBOL(pcie_print_link_status);
+
 /**
  * pci_select_bars - Make BAR mask from the type of resource
  * @dev: the PCI device for which BAR mask is made
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f2bf2b7a66c7..38f7957121ef 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1086,6 +1086,7 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum 
pci_bus_speed *speed,
 u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,
 enum pci_bus_speed *speed,
 enum pcie_link_width *width);
+void pcie_print_link_status(struct pci_dev *dev);
 void pcie_flr(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);



Re: [REGRESSION, bisect] pci: cxgb4 probe fails after commit 104daa71b3961434 ("PCI: Determine actual VPD size on first access")

2018-02-12 Thread Bjorn Helgaas
On Tue, Jan 23, 2018 at 05:59:09PM +0530, Arjun Vynipadath wrote:
> Sending on behalf of "Casey Leedom "
> 
> Way back on April 11, 2016 we reported a regression in Linux kernel 4.6-rc2 
> brought on by kernel.org commit 104daa71b396. This commit calculates the 
> size of a PCI Device's VPD area by parsing the VPD Structure at offset 0x000, 
> and restricts accesses to the VPD to that computed size.
> 
> Our devices have a second VPD structure which is located starting at offset 
> 0x400 which is the "real" VPD[1].  The 104daa71b396 commit (plus a follow on 
> commit 408641e93aa5) caused efforts to read past the end of that computed 
> length of the VPD to return silently without error leaving stack junk in the 
> VPD read buffers.
> 
> We introduced kernel.org commit cb92148b to allow a driver to tell the 
> kernel how large the VPD area really is, introducing a new API 
> pci_set_vpd_size() for this purpose.
> 
> Now we've discovered a new subtlety to the problem.
> 
> We have a KVM Hypervisor running a 4.9.70 kernel.  So it has all of the 
> above commits.  When we attach our Physical Function 4 to a Virtual Machine 
> and attempt to run cxgb4 in that VM, we see the problem again.  The issue is 
> that all of the VM Guest OS's efforts to access the PCIe VPD Capability are 
> trapped into the KVM 4.9.70 kernel and executed there, with the results 
> routed back to the VM Guest OS.  The cxgb4 driver in the VM Guest OS uses 
> the new pci_set_vpd_size() to notify the OS of the true size of the VPD, but 
> that information of course is never sent to the KVM 4.9.70 Hypervisor. 
> (And, truth be told, if the Guest OS were older than 4.6, it wouldn't even 
> know that it needed to do this.)  The result is that again we get silent VPD 
> read failures with random stack garbage in the VPD read buffers. (sigh) 

Let me pull out one tiny piece of this problem: If the VPD read
returns failure, the caller should not look at the read buffer.  But
we should *never* copy random stack garbage into the read buffer, no
matter what the VPD read returns.

I guess it's the 4.9.70 kernel that's putting garbage into the VPD
read buffer?  Is this something that needs to be fixed in the current
upstream kernel?


Re: remove pci_dma_* abuses and workarounds V2

2018-01-17 Thread Bjorn Helgaas
[+cc David]

On Wed, Jan 10, 2018 at 07:03:18PM +0100, Christoph Hellwig wrote:
> Back before the dawn of time pci_dma_* with a NULL pci_dev argument
> was used for all kinds of things, e.g. dma mapping for non-PCI
> devices.  All this has been long removed, but it turns out we
> still care for a NULL pci_dev in the wrappers, and we still have
> two odd USB drivers that use pci_dma_alloc_consistent for allocating
> memory while ignoring the dma_addr_t entirely, and a network driver
> mixing the already wrong usage of dma_* with a NULL device with a
> single call to pci_free_consistent.
> 
> This series switches the two usb drivers to use plain kzalloc, the
> net driver to properly use the dma API and then removes the handling
> of the NULL pci_dev in the pci_dma_* wrappers.
> 
> Changes since V1:
>  - remove allocation failure printks
>  - use kcalloc
>  - fix tsi108_eth
>  - improve changelogs

Applied to pci/dma for v4.16, thanks!


Re: [PATCH 3/4] tsi108_eth: use dma API properly

2018-01-17 Thread Bjorn Helgaas
[+cc David, FYI, I plan to merge this via PCI along with the rest of
Christoph's series]

On Wed, Jan 10, 2018 at 07:03:21PM +0100, Christoph Hellwig wrote:
> We need to pass a struct device to the dma API, even if some
> architectures still support that for legacy reasons, and should not mix
> it with the old PCI dma API.
> 
> Note that the driver also seems to never actually unmap its dma mappings,
> but to fix that we'll need someone more familar with the driver.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/net/ethernet/tundra/tsi108_eth.c | 36 
> ++--
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/tundra/tsi108_eth.c 
> b/drivers/net/ethernet/tundra/tsi108_eth.c
> index 0624b71ab5d4..edcd1e60b30d 100644
> --- a/drivers/net/ethernet/tundra/tsi108_eth.c
> +++ b/drivers/net/ethernet/tundra/tsi108_eth.c
> @@ -152,6 +152,8 @@ struct tsi108_prv_data {
>   u32 msg_enable; /* debug message level */
>   struct mii_if_info mii_if;
>   unsigned int init_media;
> +
> + struct platform_device *pdev;
>  };
>  
>  /* Structure for a device driver */
> @@ -703,17 +705,18 @@ static int tsi108_send_packet(struct sk_buff * skb, 
> struct net_device *dev)
>   data->txskbs[tx] = skb;
>  
>   if (i == 0) {
> - data->txring[tx].buf0 = dma_map_single(NULL, skb->data,
> - skb_headlen(skb), DMA_TO_DEVICE);
> + data->txring[tx].buf0 = dma_map_single(>pdev->dev,
> + skb->data, skb_headlen(skb),
> + DMA_TO_DEVICE);
>   data->txring[tx].len = skb_headlen(skb);
>   misc |= TSI108_TX_SOF;
>   } else {
>   const skb_frag_t *frag = _shinfo(skb)->frags[i - 1];
>  
> - data->txring[tx].buf0 = skb_frag_dma_map(NULL, frag,
> -  0,
> -  
> skb_frag_size(frag),
> -  DMA_TO_DEVICE);
> + data->txring[tx].buf0 =
> + skb_frag_dma_map(>pdev->dev, frag,
> + 0, skb_frag_size(frag),
> + DMA_TO_DEVICE);
>   data->txring[tx].len = skb_frag_size(frag);
>   }
>  
> @@ -808,9 +811,9 @@ static int tsi108_refill_rx(struct net_device *dev, int 
> budget)
>   if (!skb)
>   break;
>  
> - data->rxring[rx].buf0 = dma_map_single(NULL, skb->data,
> - TSI108_RX_SKB_SIZE,
> - DMA_FROM_DEVICE);
> + data->rxring[rx].buf0 = dma_map_single(>pdev->dev,
> + skb->data, TSI108_RX_SKB_SIZE,
> + DMA_FROM_DEVICE);
>  
>   /* Sometimes the hardware sets blen to zero after packet
>* reception, even though the manual says that it's only ever
> @@ -1308,15 +1311,15 @@ static int tsi108_open(struct net_device *dev)
>  data->id, dev->irq, dev->name);
>   }
>  
> - data->rxring = dma_zalloc_coherent(NULL, rxring_size, >rxdma,
> -GFP_KERNEL);
> + data->rxring = dma_zalloc_coherent(>pdev->dev, rxring_size,
> + >rxdma, GFP_KERNEL);
>   if (!data->rxring)
>   return -ENOMEM;
>  
> - data->txring = dma_zalloc_coherent(NULL, txring_size, >txdma,
> -GFP_KERNEL);
> + data->txring = dma_zalloc_coherent(>pdev->dev, txring_size,
> + >txdma, GFP_KERNEL);
>   if (!data->txring) {
> - pci_free_consistent(NULL, rxring_size, data->rxring,
> + dma_free_coherent(>pdev->dev, rxring_size, data->rxring,
>   data->rxdma);
>   return -ENOMEM;
>   }
> @@ -1428,10 +1431,10 @@ static int tsi108_close(struct net_device *dev)
>   dev_kfree_skb(skb);
>   }
>  
> - dma_free_coherent(0,
> + dma_free_coherent(>pdev->dev,
>   TSI108_RXRING_LEN * sizeof(rx_desc),
>   data->rxring, data->rxdma);
> - dma_free_coherent(0,
> + dma_free_coherent(>pdev->dev,
>   TSI108_TXRING_LEN * sizeof(tx_desc),
>   data->txring, data->txdma);
>  
> @@ -1576,6 +1579,7 @@ tsi108_init_one(struct platform_device *pdev)
>   printk("tsi108_eth%d: probe...\n", pdev->id);
>   data = netdev_priv(dev);
>   data->dev = dev;
> + data->pdev = pdev;
>  
>   

Re: [PATCH v17 0/4] Replace PCI pool by DMA pool API

2018-01-03 Thread Bjorn Helgaas
On Tue, Jan 02, 2018 at 04:17:24PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 02, 2018 at 06:53:52PM +0100, Romain Perier wrote:
> > The current PCI pool API are simple macro functions direct expanded to
> > the appropriate dma pool functions. The prototypes are almost the same
> > and semantically, they are very similar. I propose to use the DMA pool
> > API directly and get rid of the old API.
> > 
> > This set of patches, replaces the old API by the dma pool API
> > and remove the defines.
> > ...
> 
> > Romain Perier (4):
> >   block: DAC960: Replace PCI pool old API
> >   net: e100: Replace PCI pool old API
> >   hinic: Replace PCI pool old API
> >   PCI: Remove PCI pool macro functions
> > 
> >  drivers/block/DAC960.c| 38 
> > +++
> >  drivers/block/DAC960.h|  4 +--
> >  drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c | 10 +++---
> >  drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h |  2 +-
> >  drivers/net/ethernet/intel/e100.c | 12 +++
> >  include/linux/pci.h   |  9 --
> >  6 files changed, 32 insertions(+), 43 deletions(-)
> 
> Applied to pci/misc for v4.16, thanks!

Oops, my mistake.  I can't remove the macros themselves yet because
some of the uses were removed by patches that were applied via other
trees, and those patches are not in my tree.

To avoid ordering dependencies during the merge window, I dropped the
"PCI: Remove PCI pool macro functions" patch.  Please repost that
after all the removals have made it into Linus' tree.

Bjorn


Re: [PATCH v17 0/4] Replace PCI pool by DMA pool API

2018-01-02 Thread Bjorn Helgaas
On Tue, Jan 02, 2018 at 06:53:52PM +0100, Romain Perier wrote:
> The current PCI pool API are simple macro functions direct expanded to
> the appropriate dma pool functions. The prototypes are almost the same
> and semantically, they are very similar. I propose to use the DMA pool
> API directly and get rid of the old API.
> 
> This set of patches, replaces the old API by the dma pool API
> and remove the defines.
> ...

> Romain Perier (4):
>   block: DAC960: Replace PCI pool old API
>   net: e100: Replace PCI pool old API
>   hinic: Replace PCI pool old API
>   PCI: Remove PCI pool macro functions
> 
>  drivers/block/DAC960.c| 38 
> +++
>  drivers/block/DAC960.h|  4 +--
>  drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c | 10 +++---
>  drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h |  2 +-
>  drivers/net/ethernet/intel/e100.c | 12 +++
>  include/linux/pci.h   |  9 --
>  6 files changed, 32 insertions(+), 43 deletions(-)

Applied to pci/misc for v4.16, thanks!


[PATCH] qed: Remove unused QED_RDMA_DEV_CAP_* symbols and dev->dev_caps

2017-12-15 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

The QED_RDMA_DEV_CAP_* symbols are only used to set bits in dev->dev_caps.
Nobody ever looks at those bits.  Remove the symbols and dev_caps itself.

Note that if these are ever used and added back, it looks incorrect to set
QED_RDMA_DEV_CAP_ATOMIC_OP based on PCI_EXP_DEVCTL2_LTR_EN.  LTR is the
Latency Tolerance Reporting mechanism, which has nothing to do with Atomic
Ops.

No functional change intended.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/qlogic/qed/qed_rdma.c |   20 --
 include/linux/qed/qed_rdma_if.h|   55 +---
 2 files changed, 1 insertion(+), 74 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_rdma.c 
b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
index c8c4b3940564..1091b6aae0c6 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
@@ -394,7 +394,6 @@ static void qed_rdma_init_devinfo(struct qed_hwfn *p_hwfn,
 {
struct qed_rdma_device *dev = p_hwfn->p_rdma_info->dev;
struct qed_dev *cdev = p_hwfn->cdev;
-   u32 pci_status_control;
u32 num_qps;
 
/* Vendor specific information */
@@ -468,25 +467,6 @@ static void qed_rdma_init_devinfo(struct qed_hwfn *p_hwfn,
dev->max_ah = p_hwfn->p_rdma_info->num_qps;
dev->max_stats_queues = (u8)RESC_NUM(p_hwfn, QED_RDMA_STATS_QUEUE);
 
-   /* Set capablities */
-   dev->dev_caps = 0;
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_RNR_NAK, 1);
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_PORT_ACTIVE_EVENT, 1);
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_PORT_CHANGE_EVENT, 1);
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_RESIZE_CQ, 1);
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_BASE_MEMORY_EXT, 1);
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_BASE_QUEUE_EXT, 1);
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_ZBVA, 1);
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_LOCAL_INV_FENCE, 1);
-
-   /* Check atomic operations support in PCI configuration space. */
-   pci_read_config_dword(cdev->pdev,
- cdev->pdev->pcie_cap + PCI_EXP_DEVCTL2,
- _status_control);
-
-   if (pci_status_control & PCI_EXP_DEVCTL2_LTR_EN)
-   SET_FIELD(dev->dev_caps, QED_RDMA_DEV_CAP_ATOMIC_OP, 1);
-
if (QED_IS_IWARP_PERSONALITY(p_hwfn))
qed_iwarp_init_devinfo(p_hwfn);
 }
diff --git a/include/linux/qed/qed_rdma_if.h b/include/linux/qed/qed_rdma_if.h
index 4dd72ba210f5..a8db5572d3c2 100644
--- a/include/linux/qed/qed_rdma_if.h
+++ b/include/linux/qed/qed_rdma_if.h
@@ -109,60 +109,7 @@ struct qed_rdma_device {
u8 max_pkey;
u16 max_srq_wr;
u8 max_stats_queues;
-   u32 dev_caps;
-
-   /* Abilty to support RNR-NAK generation */
-
-#define QED_RDMA_DEV_CAP_RNR_NAK_MASK   0x1
-#define QED_RDMA_DEV_CAP_RNR_NAK_SHIFT  0
-   /* Abilty to support shutdown port */
-#define QED_RDMA_DEV_CAP_SHUTDOWN_PORT_MASK 0x1
-#define QED_RDMA_DEV_CAP_SHUTDOWN_PORT_SHIFT1
-   /* Abilty to support port active event */
-#define QED_RDMA_DEV_CAP_PORT_ACTIVE_EVENT_MASK 0x1
-#define QED_RDMA_DEV_CAP_PORT_ACTIVE_EVENT_SHIFT2
-   /* Abilty to support port change event */
-#define QED_RDMA_DEV_CAP_PORT_CHANGE_EVENT_MASK 0x1
-#define QED_RDMA_DEV_CAP_PORT_CHANGE_EVENT_SHIFT3
-   /* Abilty to support system image GUID */
-#define QED_RDMA_DEV_CAP_SYS_IMAGE_MASK 0x1
-#define QED_RDMA_DEV_CAP_SYS_IMAGE_SHIFT4
-   /* Abilty to support bad P_Key counter support */
-#define QED_RDMA_DEV_CAP_BAD_PKEY_CNT_MASK  0x1
-#define QED_RDMA_DEV_CAP_BAD_PKEY_CNT_SHIFT 5
-   /* Abilty to support atomic operations */
-#define QED_RDMA_DEV_CAP_ATOMIC_OP_MASK 0x1
-#define QED_RDMA_DEV_CAP_ATOMIC_OP_SHIFT6
-#define QED_RDMA_DEV_CAP_RESIZE_CQ_MASK 0x1
-#define QED_RDMA_DEV_CAP_RESIZE_CQ_SHIFT7
-   /* Abilty to support modifying the maximum number of
-* outstanding work requests per QP
-*/
-#define QED_RDMA_DEV_CAP_RESIZE_MAX_WR_MASK 0x1
-#define QED_RDMA_DEV_CAP_RESIZE_MAX_WR_SHIFT8
-   /* Abilty to support automatic path migration */
-#define QED_RDMA_DEV_CAP_AUTO_PATH_MIG_MASK 0x1
-#define QED_RDMA_DEV_CAP_AUTO_PATH_MIG_SHIFT9
-   /* Abilty to support the base memory management extensions */
-#define QED_RDMA_DEV_CAP_BASE_MEMORY_EXT_MASK   0x1
-#define QED_RDMA_DEV_CAP_BASE_MEMORY_EXT_SHIFT  1

[PATCH] cxgb4: Simplify PCIe Completion Timeout setting

2017-12-15 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Simplify PCIe Completion Timeout setting by using the
pcie_capability_clear_and_set_word() interface.  No functional change
intended.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c |   21 +++--
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index f63210f15579..4c99fdb2e13b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -8492,22 +8492,6 @@ static int t4_get_flash_params(struct adapter *adap)
return 0;
 }
 
-static void set_pcie_completion_timeout(struct adapter *adapter, u8 range)
-{
-   u16 val;
-   u32 pcie_cap;
-
-   pcie_cap = pci_find_capability(adapter->pdev, PCI_CAP_ID_EXP);
-   if (pcie_cap) {
-   pci_read_config_word(adapter->pdev,
-pcie_cap + PCI_EXP_DEVCTL2, );
-   val &= ~PCI_EXP_DEVCTL2_COMP_TIMEOUT;
-   val |= range;
-   pci_write_config_word(adapter->pdev,
- pcie_cap + PCI_EXP_DEVCTL2, val);
-   }
-}
-
 /**
  * t4_prep_adapter - prepare SW and HW for operation
  * @adapter: the adapter
@@ -8593,8 +8577,9 @@ int t4_prep_adapter(struct adapter *adapter)
adapter->params.portvec = 1;
adapter->params.vpd.cclk = 5;
 
-   /* Set pci completion timeout value to 4 seconds. */
-   set_pcie_completion_timeout(adapter, 0xd);
+   /* Set PCIe completion timeout to 4 seconds. */
+   pcie_capability_clear_and_set_word(adapter->pdev, PCI_EXP_DEVCTL2,
+  PCI_EXP_DEVCTL2_COMP_TIMEOUT, 0xd);
return 0;
 }
 



Re: [PATCH 1/3] PCI: introduce a device-managed version of pci_set_mwi

2017-12-11 Thread Bjorn Helgaas
On Sun, Dec 10, 2017 at 12:43:48AM +0100, Heiner Kallweit wrote:
> Introduce a device-managed version of pci_set_mwi. First user is the
> Realtek r8169 driver.
> 
> Signed-off-by: Heiner Kallweit <hkallwe...@gmail.com>

With the subject and changelog as follows and the code reordering below,

  PCI: Add pcim_set_mwi(), a device-managed pci_set_mwi()

  Add pcim_set_mwi(), a device-managed version of pci_set_mwi(). First user
  is the Realtek r8169 driver.

Acked-by: Bjorn Helgaas <bhelg...@google.com>

With these changes, feel free to merge with the series via the netdev
tree.

> ---
>  drivers/pci/pci.c   | 29 +
>  include/linux/pci.h |  1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 4a7c6864f..fc57c378d 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1458,6 +1458,7 @@ struct pci_devres {
>   unsigned int pinned:1;
>   unsigned int orig_intx:1;
>   unsigned int restore_intx:1;
> + unsigned int mwi:1;
>   u32 region_mask;
>  };
>  
> @@ -1476,6 +1477,9 @@ static void pcim_release(struct device *gendev, void 
> *res)
>   if (this->region_mask & (1 << i))
>   pci_release_region(dev, i);
>  
> + if (this->mwi)
> + pci_clear_mwi(dev);
> +
>   if (this->restore_intx)
>   pci_intx(dev, this->orig_intx);
>  
> @@ -3760,6 +3764,31 @@ int pci_set_mwi(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_set_mwi);
>  
> +/**
> + * pcim_set_mwi - Managed pci_set_mwi()
> + * @dev: the PCI device for which MWI is enabled
> + *
> + * Managed pci_set_mwi().
> + *
> + * RETURNS: An appropriate -ERRNO error value on error, or zero for success.

> + */
> +int pcim_set_mwi(struct pci_dev *dev)
> +{
> + struct pci_devres *dr;
> + int ret;
> +
> + ret = pci_set_mwi(dev);
> + if (ret)
> + return ret;
> +
> + dr = find_pci_dr(dev);
> + if (dr)
> + dr->mwi = 1;
> +
> + return 0;

I would rather look up the pci_devres first, e.g.,

  dr = find_pci_dr(dev);
  if (!dr)
return -ENOMEM;

  dr->mwi = 1;
  return pci_set_mwi(dev);

That way we won't enable MWI and be unable to disable it at release-time.

> +}
> +EXPORT_SYMBOL(pcim_set_mwi);
> +
>  /**
>   * pci_try_set_mwi - enables memory-write-invalidate PCI transaction
>   * @dev: the PCI device for which MWI is enabled
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 978aad784..0a7ac863a 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1064,6 +1064,7 @@ int pci_set_pcie_reset_state(struct pci_dev *dev, enum 
> pcie_reset_state state);
>  int pci_set_cacheline_size(struct pci_dev *dev);
>  #define HAVE_PCI_SET_MWI
>  int __must_check pci_set_mwi(struct pci_dev *dev);
> +int __must_check pcim_set_mwi(struct pci_dev *dev);
>  int pci_try_set_mwi(struct pci_dev *dev);
>  void pci_clear_mwi(struct pci_dev *dev);
>  void pci_intx(struct pci_dev *dev, int enable);
> -- 
> 2.15.1
> 
> 


Re: [PATCH v15 5/5] PCI: Remove PCI pool macro functions

2017-11-20 Thread Bjorn Helgaas
On Mon, Nov 20, 2017 at 08:32:47PM +0100, Romain Perier wrote:
> From: Romain Perier <romain.per...@collabora.com>
> 
> Now that all the drivers use dma pool API, we can remove the macro
> functions for PCI pool.
> 
> Signed-off-by: Romain Perier <romain.per...@collabora.com>
> Reviewed-by: Peter Senna Tschudin <peter.se...@collabora.com>

I already acked this once on Oct 24.  Please keep that ack and include
it in any future postings so I don't have to deal with this again.

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
>  include/linux/pci.h | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 96c94980d1ff..d03b4a20033d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1324,15 +1324,6 @@ int pci_set_vga_state(struct pci_dev *pdev, bool 
> decode,
>  #include 
>  #include 
>  
> -#define  pci_pool dma_pool
> -#define pci_pool_create(name, pdev, size, align, allocation) \
> - dma_pool_create(name, >dev, size, align, allocation)
> -#define  pci_pool_destroy(pool) dma_pool_destroy(pool)
> -#define  pci_pool_alloc(pool, flags, handle) dma_pool_alloc(pool, flags, 
> handle)
> -#define  pci_pool_zalloc(pool, flags, handle) \
> - dma_pool_zalloc(pool, flags, handle)
> -#define  pci_pool_free(pool, vaddr, addr) dma_pool_free(pool, vaddr, 
> addr)
> -
>  struct msix_entry {
>   u32 vector; /* kernel uses to write allocated vector */
>   u16 entry;  /* driver uses to specify entry, OS writes */
> -- 
> 2.14.1
> 


Re: [PATCH v14 5/5] PCI: Remove PCI pool macro functions

2017-10-24 Thread Bjorn Helgaas
On Mon, Oct 23, 2017 at 07:59:58PM +0200, Romain Perier wrote:
> From: Romain Perier <romain.per...@collabora.com>
> 
> Now that all the drivers use dma pool API, we can remove the macro
> functions for PCI pool.
> 
> Signed-off-by: Romain Perier <romain.per...@collabora.com>
> Reviewed-by: Peter Senna Tschudin <peter.se...@collabora.com>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

Since this barely touches drivers/pci and linux-pci wasn't copied
until v14, I assume you're planning to merge this via some other tree.
Let me know if you need anything else from me.

> ---
>  include/linux/pci.h | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 80eaa2dbe3e9..a827f6eb54db 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1325,15 +1325,6 @@ int pci_set_vga_state(struct pci_dev *pdev, bool 
> decode,
>  #include 
>  #include 
>  
> -#define  pci_pool dma_pool
> -#define pci_pool_create(name, pdev, size, align, allocation) \
> - dma_pool_create(name, >dev, size, align, allocation)
> -#define  pci_pool_destroy(pool) dma_pool_destroy(pool)
> -#define  pci_pool_alloc(pool, flags, handle) dma_pool_alloc(pool, flags, 
> handle)
> -#define  pci_pool_zalloc(pool, flags, handle) \
> - dma_pool_zalloc(pool, flags, handle)
> -#define  pci_pool_free(pool, vaddr, addr) dma_pool_free(pool, vaddr, 
> addr)
> -
>  struct msix_entry {
>   u32 vector; /* kernel uses to write allocated vector */
>   u16 entry;  /* driver uses to specify entry, OS writes */
> -- 
> 2.14.1
> 


Re: [PATCH v2 06/15] PCI: endpoint: make config_item_type const

2017-10-16 Thread Bjorn Helgaas
On Mon, Oct 16, 2017 at 05:18:45PM +0200, Bhumika Goyal wrote:
> Make config_item_type structures const as they are either passed to a
> function having the argument as const or stored in the const "ci_type"
> field of a config_item structure.
> 
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal <bhumi...@gmail.com>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

Please apply this along with the rest of the series, since it depends
on an earlier patch in the series.

> ---
> * Changes in v2- Combine all the followup patches and the constification
> patches into a series.
> 
>  drivers/pci/endpoint/pci-ep-cfs.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/endpoint/pci-ep-cfs.c 
> b/drivers/pci/endpoint/pci-ep-cfs.c
> index 424fdd6..4f74386 100644
> --- a/drivers/pci/endpoint/pci-ep-cfs.c
> +++ b/drivers/pci/endpoint/pci-ep-cfs.c
> @@ -150,7 +150,7 @@ static void pci_epc_epf_unlink(struct config_item 
> *epc_item,
>   .drop_link  = pci_epc_epf_unlink,
>  };
>  
> -static struct config_item_type pci_epc_type = {
> +static const struct config_item_type pci_epc_type = {
>   .ct_item_ops= _epc_item_ops,
>   .ct_attrs   = pci_epc_attrs,
>   .ct_owner   = THIS_MODULE,
> @@ -361,7 +361,7 @@ static void pci_epf_release(struct config_item *item)
>   .release= pci_epf_release,
>  };
>  
> -static struct config_item_type pci_epf_type = {
> +static const struct config_item_type pci_epf_type = {
>   .ct_item_ops= _epf_ops,
>   .ct_attrs   = pci_epf_attrs,
>   .ct_owner   = THIS_MODULE,
> @@ -400,7 +400,7 @@ static void pci_epf_drop(struct config_group *group, 
> struct config_item *item)
>   .drop_item  = _epf_drop,
>  };
>  
> -static struct config_item_type pci_epf_group_type = {
> +static const struct config_item_type pci_epf_group_type = {
>   .ct_group_ops   = _epf_group_ops,
>   .ct_owner   = THIS_MODULE,
>  };
> @@ -428,15 +428,15 @@ void pci_ep_cfs_remove_epf_group(struct config_group 
> *group)
>  }
>  EXPORT_SYMBOL(pci_ep_cfs_remove_epf_group);
>  
> -static struct config_item_type pci_functions_type = {
> +static const struct config_item_type pci_functions_type = {
>   .ct_owner   = THIS_MODULE,
>  };
>  
> -static struct config_item_type pci_controllers_type = {
> +static const struct config_item_type pci_controllers_type = {
>   .ct_owner   = THIS_MODULE,
>  };
>  
> -static struct config_item_type pci_ep_type = {
> +static const struct config_item_type pci_ep_type = {
>   .ct_owner   = THIS_MODULE,
>  };
>  
> -- 
> 1.9.1
> 


Re: [Intel-wired-lan] [PATCH] PCI: Check/Set ARI capability before setting numVFs

2017-10-06 Thread Bjorn Helgaas
On Thu, Oct 05, 2017 at 04:07:41PM -0500, Bjorn Helgaas wrote:
> On Wed, Oct 04, 2017 at 04:29:14PM -0700, Alexander Duyck wrote:
> > On Wed, Oct 4, 2017 at 4:01 PM, Bjorn Helgaas <helg...@kernel.org> wrote:
> > > On Wed, Oct 04, 2017 at 08:52:58AM -0700, Tony Nguyen wrote:
> > >> This fixes a bug that can occur if an AER error is encountered while 
> > >> SRIOV
> > >> devices are present.

I applied the patch below with Alex's ack to pci/virtualization for v4.15.

> commit 95594dedd443e42ab0c16b9fba0109e955e7be13
> Author: Tony Nguyen <anthony.l.ngu...@intel.com>
> Date:   Wed Oct 4 08:52:58 2017 -0700
> 
> PCI: Restore "ARI Capable Hierarchy" before setting numVFs
> 
> In the restore path, we previously read PCI_SRIOV_VF_OFFSET and
> PCI_SRIOV_VF_STRIDE before restoring PCI_SRIOV_CTRL_ARI, which
> affects the offset and stride:
> 
>   pci_restore_state
> pci_restore_iov_state
>   sriov_restore_state
> pci_iov_set_numvfs
>   pci_read_config_word(... PCI_SRIOV_VF_OFFSET, >offset)
> pci_write_config_word(... PCI_SRIOV_CTRL, iov->ctrl)
> 
> The effect is that suspend/resume and AER recovery, which use
> pci_restore_state(), may corrupt iov->offset and iov->stride.  The iov
> state is associated with the device, not the driver, so if we reload the
> driver, it will use the the corrupted data, which may cause crashes like
> this:
> 
>   kernel BUG at drivers/pci/iov.c:157!
>   RIP: 0010:pci_iov_add_virtfn+0x2eb/0x350
>   Call Trace:
>pci_enable_sriov+0x353/0x440
>ixgbe_pci_sriov_configure+0xd5/0x1f0 [ixgbe]
>sriov_numvfs_store+0xf7/0x170
>dev_attr_store+0x18/0x30
>sysfs_kf_write+0x37/0x40
>kernfs_fop_write+0x120/0x1b0
>vfs_write+0xb5/0x1a0
>SyS_write+0x55/0xc0
> 
> The occurs since during AER recovery the ARI Capable Hierarchy bit, which
> can affect the values for First VF Offset and VF Stride, is not set until
> after pci_iov_set_numvfs() is called.  This can cause the iov structure to
> be populated with values that are incorrect if the bit is later set.
> Check and set this bit, if needed, before calling pci_iov_set_numvfs() so
> that the values being populated properly take the ARI bit into account.
> 
> Signed-off-by: Tony Nguyen <anthony.l.ngu...@intel.com>
> [bhelgaas: changelog, add comment, also clear ARI if necessary]
> Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> CC: Alexander Duyck <alexander.h.du...@intel.com>
> CC: Emil Tantilov <emil.s.tanti...@intel.com>
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index ce24cf235f01..6bacb8995e96 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -498,6 +498,14 @@ static void sriov_restore_state(struct pci_dev *dev)
>   if (ctrl & PCI_SRIOV_CTRL_VFE)
>   return;
>  
> + /*
> +  * Restore PCI_SRIOV_CTRL_ARI before pci_iov_set_numvfs() because
> +  * it reads offset & stride, which depend on PCI_SRIOV_CTRL_ARI.
> +  */
> + ctrl &= ~PCI_SRIOV_CTRL_ARI;
> + ctrl |= iov->ctrl & PCI_SRIOV_CTRL_ARI;
> + pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, ctrl);
> +
>   for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++)
>   pci_update_resource(dev, i);
>  


[PATCH] bnx2x: Use pci_ari_enabled() instead of local copy

2017-10-06 Thread Bjorn Helgaas
From: Bjorn Helgaas <bhelg...@google.com>

Use pci_ari_enabled() from the PCI core instead of the identical local copy
bnx2x_ari_enabled().  No functional change intended.

Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index 9ca994d0bab6..3591077a5f6b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -1074,11 +1074,6 @@ static void bnx2x_vf_set_bars(struct bnx2x *bp, struct 
bnx2x_virtf *vf)
}
 }
 
-static int bnx2x_ari_enabled(struct pci_dev *dev)
-{
-   return dev->bus->self && dev->bus->self->ari_enabled;
-}
-
 static int
 bnx2x_get_vf_igu_cam_info(struct bnx2x *bp)
 {
@@ -1212,7 +1207,7 @@ int bnx2x_iov_init_one(struct bnx2x *bp, int 
int_mode_param,
 
err = -EIO;
/* verify ari is enabled */
-   if (!bnx2x_ari_enabled(bp->pdev)) {
+   if (!pci_ari_enabled(bp->pdev->bus)) {
BNX2X_ERR("ARI not supported (check pci bridge ARI forwarding), 
SRIOV can not be enabled\n");
return 0;
}



Re: [Intel-wired-lan] [PATCH] PCI: Check/Set ARI capability before setting numVFs

2017-10-05 Thread Bjorn Helgaas
On Wed, Oct 04, 2017 at 04:29:14PM -0700, Alexander Duyck wrote:
> On Wed, Oct 4, 2017 at 4:01 PM, Bjorn Helgaas <helg...@kernel.org> wrote:
> > On Wed, Oct 04, 2017 at 08:52:58AM -0700, Tony Nguyen wrote:
> >> This fixes a bug that can occur if an AER error is encountered while SRIOV
> >> devices are present.
> >>
> >> This issue was seen by doing the following. Inject an AER error to a device
> >> that has SRIOV devices.  After the device has recovered, remove the driver.
> >> Reload the driver and enable SRIOV which causes the following crash to
> >> occur:
> >>
> >> kernel BUG at drivers/pci/iov.c:157!
> >> invalid opcode:  [#1] SMP
> >> CPU: 36 PID: 2295 Comm: bash Not tainted 4.14.0-rc1+ #74
> >> Hardware name: Supermicro X9DAi/X9DAi, BIOS 3.0a 04/29/2014
> >> task: 9fa41cd45a00 task.stack: b4b2036e8000
> >> RIP: 0010:pci_iov_add_virtfn+0x2eb/0x350
> >> RSP: 0018:b4b2036ebcb8 EFLAGS: 00010286
> >> RAX: fff0 RBX: 9fa42c1c8800 RCX: 9fa421ce2388
> >> RDX: df90 RSI: 9fa8214fb388 RDI: df903fff
> >> RBP: b4b2036ebd18 R08: 9fa421ce23b8 R09: b4b2036ebc2c
> >> R10: 9fa42c1a5548 R11: 058e R12: 9fa8214fb000
> >> R13: 9fa42c1a5000 R14: 9fa8214fb388 R15: 
> >> FS:  7f60724b6700() GS:9fa82f30()
> >> knlGS:
> >> CS:  0010 DS:  ES:  CR0: 80050033
> >> CR2: 559eca8b0f40 CR3: 000864146000 CR4: 001606e0
> >> Call Trace:
> >>  pci_enable_sriov+0x353/0x440
> >>  ixgbe_pci_sriov_configure+0xd5/0x1f0 [ixgbe]
> >>  sriov_numvfs_store+0xf7/0x170
> >>  dev_attr_store+0x18/0x30
> >>  sysfs_kf_write+0x37/0x40
> >>  kernfs_fop_write+0x120/0x1b0
> >>  __vfs_write+0x37/0x170
> >>  ? __alloc_fd+0x3f/0x170
> >>  ? set_close_on_exec+0x30/0x70
> >>  vfs_write+0xb5/0x1a0
> >>  SyS_write+0x55/0xc0
> >>  entry_SYSCALL_64_fastpath+0x1a/0xa5
> >> RIP: 0033:0x7f6071bafc20
> >> RSP: 002b:7ffe7d42ba48 EFLAGS: 0246 ORIG_RAX: 0001
> >> RAX: ffda RBX: 559eca8b0f30 RCX: 7f6071bafc20
> >> RDX: 0002 RSI: 559eca961f60 RDI: 0001
> >> RBP: 7f6071e78ae0 R08: 7f6071e7a740 R09: 7f60724b6700
> >> R10: 0073 R11: 0246 R12: 
> >> R13:  R14:  R15: 559eca892170
> >> RIP: pci_iov_add_virtfn+0x2eb/0x350 RSP: b4b2036ebcb8
> >>
> >> The occurs since during AER recovery the ARI Capable Hierarchy bit,
> >> which can affect the values for First VF Offset and VF Stride, is not set
> >> until after pci_iov_set_numvfs() is called.
> >
> > Can you elaborate on where exactly this happens?  The only place we
> > explicitly set PCI_SRIOV_CTRL_ARI is in sriov_init(), which is only
> > called at enumeration-time.  So I'm guessing you're talking about this
> > path:
> >
> >   ixgbe_io_slot_reset
> > pci_restore_state
> >   pci_restore_iov_state
> > sriov_restore_state
> >   pci_iov_set_numvfs
> >
> > where we don't set PCI_SRIOV_CTRL_ARI at all.  The fact that you say
> > PCI_SRIOV_CTRL_ARI isn't set until *after* pci_iov_set_numvfs() is
> > called suggests that it is being set *somewhere*, but I don't know
> > where.
> 
> The ARI bit is initialized in sriov_init, stored in iov->ctrl, and
> restored in sriov_restore_state, but it occurs in the line after the
> call to pci_iov_set_numvfs.
> 
> The problem is you don't want to write the full iov->ctrl value until
> after you have reset the the number of VFs since it will set VFE so
> pulling out and configuring the ARI value separately is needed.

Doh, that should have been obvious to me ;)

> >> This can cause the iov
> >> structure to be populated with values that are incorrect if the bit is
> >> later set.   Check and set this bit, if needed, before calling
> >> pci_iov_set_numvfs() so that the values being populated properly take
> >> the ARI bit into account.
> >>
> >> CC: Alexander Duyck <alexander.h.du...@intel.com>
> >> CC: Emil Tantilov <emil.s.tanti...@intel.com>
> >> Signed-off-by: Tony Nguyen <anthony.l.ngu...@intel.com>
> >> ---
> >>  drivers/pci/iov.c | 4 
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/drivers/pci/iov.c b/d

Re: [PATCH] PCI: Check/Set ARI capability before setting numVFs

2017-10-04 Thread Bjorn Helgaas
On Wed, Oct 04, 2017 at 08:52:58AM -0700, Tony Nguyen wrote:
> This fixes a bug that can occur if an AER error is encountered while SRIOV
> devices are present.
> 
> This issue was seen by doing the following. Inject an AER error to a device
> that has SRIOV devices.  After the device has recovered, remove the driver.
> Reload the driver and enable SRIOV which causes the following crash to
> occur:
> 
> kernel BUG at drivers/pci/iov.c:157!
> invalid opcode:  [#1] SMP
> CPU: 36 PID: 2295 Comm: bash Not tainted 4.14.0-rc1+ #74
> Hardware name: Supermicro X9DAi/X9DAi, BIOS 3.0a 04/29/2014
> task: 9fa41cd45a00 task.stack: b4b2036e8000
> RIP: 0010:pci_iov_add_virtfn+0x2eb/0x350
> RSP: 0018:b4b2036ebcb8 EFLAGS: 00010286
> RAX: fff0 RBX: 9fa42c1c8800 RCX: 9fa421ce2388
> RDX: df90 RSI: 9fa8214fb388 RDI: df903fff
> RBP: b4b2036ebd18 R08: 9fa421ce23b8 R09: b4b2036ebc2c
> R10: 9fa42c1a5548 R11: 058e R12: 9fa8214fb000
> R13: 9fa42c1a5000 R14: 9fa8214fb388 R15: 
> FS:  7f60724b6700() GS:9fa82f30()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 559eca8b0f40 CR3: 000864146000 CR4: 001606e0
> Call Trace:
>  pci_enable_sriov+0x353/0x440
>  ixgbe_pci_sriov_configure+0xd5/0x1f0 [ixgbe]
>  sriov_numvfs_store+0xf7/0x170
>  dev_attr_store+0x18/0x30
>  sysfs_kf_write+0x37/0x40
>  kernfs_fop_write+0x120/0x1b0
>  __vfs_write+0x37/0x170
>  ? __alloc_fd+0x3f/0x170
>  ? set_close_on_exec+0x30/0x70
>  vfs_write+0xb5/0x1a0
>  SyS_write+0x55/0xc0
>  entry_SYSCALL_64_fastpath+0x1a/0xa5
> RIP: 0033:0x7f6071bafc20
> RSP: 002b:7ffe7d42ba48 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 559eca8b0f30 RCX: 7f6071bafc20
> RDX: 0002 RSI: 559eca961f60 RDI: 0001
> RBP: 7f6071e78ae0 R08: 7f6071e7a740 R09: 7f60724b6700
> R10: 0073 R11: 0246 R12: 
> R13:  R14:  R15: 559eca892170
> RIP: pci_iov_add_virtfn+0x2eb/0x350 RSP: b4b2036ebcb8
> 
> The occurs since during AER recovery the ARI Capable Hierarchy bit,
> which can affect the values for First VF Offset and VF Stride, is not set
> until after pci_iov_set_numvfs() is called.  

Can you elaborate on where exactly this happens?  The only place we
explicitly set PCI_SRIOV_CTRL_ARI is in sriov_init(), which is only
called at enumeration-time.  So I'm guessing you're talking about this
path:

  ixgbe_io_slot_reset
pci_restore_state
  pci_restore_iov_state
sriov_restore_state
  pci_iov_set_numvfs

where we don't set PCI_SRIOV_CTRL_ARI at all.  The fact that you say
PCI_SRIOV_CTRL_ARI isn't set until *after* pci_iov_set_numvfs() is
called suggests that it is being set *somewhere*, but I don't know
where.

> This can cause the iov
> structure to be populated with values that are incorrect if the bit is
> later set.   Check and set this bit, if needed, before calling
> pci_iov_set_numvfs() so that the values being populated properly take
> the ARI bit into account.
> 
> CC: Alexander Duyck 
> CC: Emil Tantilov 
> Signed-off-by: Tony Nguyen 
> ---
>  drivers/pci/iov.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 7492a65..a8896c7 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -497,6 +497,10 @@ static void sriov_restore_state(struct pci_dev *dev)
>   if (ctrl & PCI_SRIOV_CTRL_VFE)
>   return;
>  
> + if ((iov->ctrl & PCI_SRIOV_CTRL_ARI) && !(ctrl & PCI_SRIOV_CTRL_ARI))
> + pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL,
> +   ctrl | PCI_SRIOV_CTRL_ARI);
> +
>   for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++)
>   pci_update_resource(dev, i);
>  
> -- 
> 2.9.5
> 


Re: [PATCH] PCI: Allow PCI express root ports to find themselves

2017-08-17 Thread Bjorn Helgaas
On Thu, Aug 17, 2017 at 01:06:14PM +0200, Thierry Reding wrote:
> From: Thierry Reding <tred...@nvidia.com>
> 
> If the pci_find_pcie_root_port() function is called on a root port
> itself, return the root port rather than NULL.
> 
> This effectively reverts commit 0e405232871d6 ("PCI: fix oops when
> try to find Root Port for a PCI device") which added an extra check
> that would now be redundant.
> 
> Fixes: a99b646afa8a ("PCI: Disable PCIe Relaxed Ordering if unsupported")
> Fixes: c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5 
> Completion erratum")
> Signed-off-by: Thierry Reding <tred...@nvidia.com>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

I *think* this should work for everybody, but I can't test it personally.

> ---
> This applies on top of and was tested on next-20170817.
> 
> Michael, it'd be great if you could test this one again to clarify
> whether or not the fix that's already in Linus' tree is still needed, or
> whether it's indeed obsoleted by this patch.
> 
>  drivers/pci/pci.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b05c587e335a..dd56c1c05614 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -514,7 +514,7 @@ EXPORT_SYMBOL(pci_find_resource);
>   */
>  struct pci_dev *pci_find_pcie_root_port(struct pci_dev *dev)
>  {
> - struct pci_dev *bridge, *highest_pcie_bridge = NULL;
> + struct pci_dev *bridge, *highest_pcie_bridge = dev;
>  
>   bridge = pci_upstream_bridge(dev);
>   while (bridge && pci_is_pcie(bridge)) {
> @@ -522,11 +522,10 @@ struct pci_dev *pci_find_pcie_root_port(struct pci_dev 
> *dev)
>   bridge = pci_upstream_bridge(bridge);
>   }
>  
> - if (highest_pcie_bridge &&
> - pci_pcie_type(highest_pcie_bridge) == PCI_EXP_TYPE_ROOT_PORT)
> - return highest_pcie_bridge;
> + if (pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT)
> + return NULL;
>  
> - return NULL;
> + return highest_pcie_bridge;
>  }
>  EXPORT_SYMBOL(pci_find_pcie_root_port);
>  
> -- 
> 2.13.3
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH net RESEND] PCI: fix oops when try to find Root Port for a PCI device

2017-08-16 Thread Bjorn Helgaas
On Wed, Aug 16, 2017 at 09:33:03PM +0200, Thierry Reding wrote:
> On Tue, Aug 15, 2017 at 12:03:31PM -0500, Bjorn Helgaas wrote:
> > On Tue, Aug 15, 2017 at 11:24:48PM +0800, Ding Tianhong wrote:
> > > Eric report a oops when booting the system after applying
> > > the commit a99b646afa8a ("PCI: Disable PCIe Relaxed..."):
> > > ...
> > 
> > > It looks like the pci_find_pcie_root_port() was trying to
> > > find the Root Port for the PCI device which is the Root
> > > Port already, it will return NULL and trigger the problem,
> > > so check the highest_pcie_bridge to fix thie problem.
> > 
> > The problem was actually with a Root Complex Integrated Endpoint that
> > has no upstream PCIe device:
> > 
> >   00:05.2 System peripheral: Intel Corporation Device 0e2a (rev 04)
> > Subsystem: Intel Corporation Device 0e2a
> > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
> > Stepping- SERR- FastB2B- DisINTx-
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> > SERR-  > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, 
> > MSI 00
> > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> > <64ns, L1 <1us
> > ExtTag- RBE- FLReset-
> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ 
> > Unsupported+
> > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> > MaxPayload 128 bytes, MaxReadReq 128 bytes
> 
> I've started seeing this crash on Tegra K1 as well. Here's the device
> for which it oopses:
> 
> 00:02.0 PCI bridge: NVIDIA Corporation TegraK1 PCIe x1 Bridge (rev a1) 
> (prog-if 00 [Normal decode])
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
> Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> SERR-  Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 391
> Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
> I/O behind bridge: 1000-1fff [size=4K]
> Memory behind bridge: 1300-130f [size=1M]
> Prefetchable memory behind bridge: 2000-200f 
> [size=1M]
> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- 
>  BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> Capabilities: [40] Subsystem: NVIDIA Corporation TegraK1 PCIe x1 
> Bridge
> Capabilities: [48] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
> PME(D0+,D1+,D2+,D3hot+,D3cold+)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+
> Address: 00fcf000  Data: 
> Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
> Mapping Address Base: fee0
> Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
> DevCap: MaxPayload 128 bytes, PhantFunc 0
> ExtTag+ RBE+
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
> Unsupported-
> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- 
> TransPend-
> LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s, Exit 
> Latency L0s <512ns
> ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> DLActive+ BWMgmt+ ABWMgmt-
> SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- 
> Surprise-
> Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
> SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- 
> HPIrq- LinkChg-
> Control: AttnInd Off, PwrInd On, Power- Interlock-
> SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ 
> Interlock-
> Changed: MRL- PresDet+ LinkState+
> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ 
> CRSVisible-
> RootCap: CRSVisible-
>   

Re: [PATCH net RESEND] PCI: fix oops when try to find Root Port for a PCI device

2017-08-15 Thread Bjorn Helgaas
On Tue, Aug 15, 2017 at 11:24:48PM +0800, Ding Tianhong wrote:
> Eric report a oops when booting the system after applying
> the commit a99b646afa8a ("PCI: Disable PCIe Relaxed..."):
> ...

> It looks like the pci_find_pcie_root_port() was trying to
> find the Root Port for the PCI device which is the Root
> Port already, it will return NULL and trigger the problem,
> so check the highest_pcie_bridge to fix thie problem.

The problem was actually with a Root Complex Integrated Endpoint that
has no upstream PCIe device:

  00:05.2 System peripheral: Intel Corporation Device 0e2a (rev 04)
Subsystem: Intel Corporation Device 0e2a
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR-  Fixes: a99b646afa8a ("PCI: Disable PCIe Relaxed Ordering if unsupported")

This also

Fixes: c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5 
Completion erratum")

which added pci_find_pcie_root_port().  Prior to this Relaxed Ordering
series, we only used pci_find_pcie_root_port() in a Chelsio quirk that
only applied to non-integrated endpoints, so we didn't trip over the
bug.

> Reported-by: Eric Dumazet 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Ding Tianhong 
> ---
>  drivers/pci/pci.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index af0cc34..7e2022f 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -522,7 +522,8 @@ struct pci_dev *pci_find_pcie_root_port(struct pci_dev 
> *dev)
>   bridge = pci_upstream_bridge(bridge);
>   }
>  
> - if (pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT)
> + if (highest_pcie_bridge &&
> + pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT)
>   return NULL;
>  
>   return highest_pcie_bridge;
> -- 

I think structuring the fix as follows is a little more readable:

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index af0cc3456dc1..587cd7623ed8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -522,10 +522,11 @@ struct pci_dev *pci_find_pcie_root_port(struct pci_dev 
*dev)
bridge = pci_upstream_bridge(bridge);
}
 
-   if (pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT)
-   return NULL;
+   if (highest_pcie_bridge &&
+   pci_pcie_type(highest_pcie_bridge) == PCI_EXP_TYPE_ROOT_PORT)
+   return highest_pcie_bridge;
 
-   return highest_pcie_bridge;
+   return NULL;
 }
 EXPORT_SYMBOL(pci_find_pcie_root_port);
 


Re: [PATCH v9 2/4] PCI: Disable PCIe Relaxed Ordering if unsupported

2017-08-08 Thread Bjorn Helgaas
On Tue, Aug 08, 2017 at 09:22:39PM -0500, Bjorn Helgaas wrote:
> On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote:
> > When bit4 is set in the PCIe Device Control register, it indicates
> > whether the device is permitted to use relaxed ordering.
> > On some platforms using relaxed ordering can have performance issues or
> > due to erratum can cause data-corruption. In such cases devices must avoid
> > using relaxed ordering.
> > 
> > This patch checks if there is any node in the hierarchy that indicates that
> > using relaxed ordering is not safe. 
...

> > +EXPORT_SYMBOL(pcie_relaxed_ordering_supported);
> 
> This is misnamed.  This doesn't tell us anything about whether the
> device *supports* relaxed ordering.  It only tells us whether the
> device is *permitted* to use it.
> 
> When a device initiates a transaction, the hardware should set the RO
> bit in the TLP with logic something like this:
> 
>   RO =  &&
> &&
>
> 
> The issue you're fixing is that some Completers don't handle RO
> correctly.  The determining factor is not the Requester, but the
> Completer (for this series, a Root Port).  So I think this should be
> something like:
> 
>   int pcie_relaxed_ordering_broken(struct pci_dev *completer)
>   {
> if (!completer)
>   return 0;
> 
> return completer->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
>   }
> 
> and the caller should do something like this:
> 
>  if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev)))
>adapter->flags |= ROOT_NO_RELAXED_ORDERING;
> 
> That way it's obvious where the issue is, and it's obvious that the
> answer might be different for peer-to-peer transactions than it is for
> transactions to the root port, i.e., to coherent memory.

After looking at the driver, I wonder if it would be simpler like
this:

  int pcie_relaxed_ordering_enabled(struct pci_dev *dev)
  {
u16 ctl;

pcie_capability_read_word(dev, PCI_EXP_DEVCTL, );
return ctl & PCI_EXP_DEVCTL_RELAX_EN;
  }
  EXPORT_SYMBOL(pcie_relaxed_ordering_enabled);

  static void pci_configure_relaxed_ordering(struct pci_dev *dev)
  {
struct pci_dev *root;

if (dev->is_virtfn)
  return;  /* PCI_EXP_DEVCTL_RELAX_EN is RsvdP in VFs */

if (!pcie_relaxed_ordering_enabled(dev))
  return;

/*
 * For now, we only deal with Relaxed Ordering issues with Root
 * Ports.  Peer-to-peer DMA is another can of worms.
 */
root = pci_find_pcie_root_port(dev);
if (!root)
  return;

if (root->relaxed_ordering_broken)
  pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
 PCI_EXP_DEVCTL_RELAX_EN);
  }

This doesn't check every intervening switch, but I don't think we know
about any issues except with root ports.

And the driver could do:

  if (!pcie_relaxed_ordering_enabled(pdev))
adapter->flags |= ROOT_NO_RELAXED_ORDERING;

The driver code wouldn't show anything about coherent memory vs.
peer-to-peer, but we really don't have a clue about how to handle that
yet anyway.

I guess this is back to exactly what you proposed, except that I
changed the name of pcie_relaxed_ordering_supported() to
pcie_relaxed_ordering_enabled(), which I think is slightly more
specific from the device's point of view.

Bjorn


Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-08-08 Thread Bjorn Helgaas
On Wed, Aug 09, 2017 at 01:40:01AM +, Casey Leedom wrote:
> | From: Bjorn Helgaas <helg...@kernel.org>
> | Sent: Tuesday, August 8, 2017 4:22 PM
> | 
> | This needs to include a link to the Intel spec
> | 
> (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
> | sec 3.9.1).
> 
>   In the commit message or as a comment?  Regardless, I agree.  It's always
> nice to be able to go back and see what the official documentation says.
> However, that said, links on the internet are ... fragile as time goes by,
> so we might want to simply quote section 3.9.1 in the commit message since
> it's relatively short:
> 
> 3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
>   and Toward MMIO Regions (P2P)
> 
> In order to maximize performance for PCIe devices in the processors
> listed in Table 3-6 below, the soft- ware should determine whether the
> accesses are toward coherent memory (system memory) or toward MMIO
> regions (P2P access to other devices). If the access is toward MMIO
> region, then software can command HW to set the RO bit in the TLP
> header, as this would allow hardware to achieve maximum throughput for
> these types of accesses. For accesses toward coherent memory, software
> can command HW to clear the RO bit in the TLP header (no RO), as this
> would allow hardware to achieve maximum throughput for these types of
> accesses.
> 
> Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
>PCIe Performance
> 
> ProcessorCPU RP Device IDs
> 
> Intel Xeon processors based on   6F01H-6F0EH
> Broadwell microarchitecture
> 
> Intel Xeon processors based on   2F01H-2F0EH
> Haswell microarchitecture

Agreed, links are prone to being broken.  I would include in the
changelog the complete title and order number, along with the link as
a footnote.  Wouldn't hurt to quote the section too, since it's short.

> | It should also include a pointer to the AMD erratum, if available, or
> | at least some reference to how we know it doesn't obey the rules.
> 
>   Getting an ACK from AMD seems like a forlorn cause at this point.  My
> contact was Bob Shaw <bob.s...@amd.com> and he stopped responding to me
> messages almost a year ago saying that all of AMD's energies were being
> redirected towards upcoming x86 products (likely Ryzen as we now know).  As
> far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM
> SoC.
> 
>   On the specific issue, I can certainly write up somthing even more
> extensive than I wrote up for the comment in drivers/pci/quirks.c.  Please
> review the comment I wrote up and tell me if you'd like something even more
> detailed -- I'm usually acused of writing comments which are too long, so
> this would be a new one on me ... :-)

If you have any bug reports with info about how you debugged it and
concluded that Seattle is broken, you could include a link (probably
in the changelog).  But if there isn't anything, there isn't anything.

I might reorganize those patches as:

  1) Add a PCI_DEV_FLAGS_RELAXED_ORDERING_BROKEN flag, the quirk that
  sets it, and the current patch [2/4] that uses it.

  2) Add the Intel DECLARE_PCI_FIXUP_CLASS_EARLY()s with the Intel
  details.

  3) Add the AMD DECLARE_PCI_FIXUP_CLASS_EARLY()s with the AMD
  details.


Re: [PATCH v9 2/4] PCI: Disable PCIe Relaxed Ordering if unsupported

2017-08-08 Thread Bjorn Helgaas
On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote:
> When bit4 is set in the PCIe Device Control register, it indicates
> whether the device is permitted to use relaxed ordering.
> On some platforms using relaxed ordering can have performance issues or
> due to erratum can cause data-corruption. In such cases devices must avoid
> using relaxed ordering.
> 
> This patch checks if there is any node in the hierarchy that indicates that
> using relaxed ordering is not safe. 

I think you only check the devices between the root port and the
target device.  For example, you don't check siblings or cousins of
the target device.

> In such cases the patch turns off the
> relaxed ordering by clearing the eapability for this device.

s/eapability/capability/

> And if the
> device is probably running in a guest machine, we should do nothing.

I don't know what this sentence means.  "Probably running in a guest
machine" doesn't really make sense, and there's nothing in your patch
that explicitly checks for being in a guest machine.

> Signed-off-by: Ding Tianhong 
> Acked-by: Alexander Duyck 
> Acked-by: Ashok Raj 
> ---
>  drivers/pci/pci.c   | 29 +
>  drivers/pci/probe.c | 37 +
>  include/linux/pci.h |  2 ++
>  3 files changed, 68 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index af0cc34..4f9d7c1 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4854,6 +4854,35 @@ int pcie_set_mps(struct pci_dev *dev, int mps)
>  EXPORT_SYMBOL(pcie_set_mps);
>  
>  /**
> + * pcie_clear_relaxed_ordering - clear PCI Express relaxed ordering bit
> + * @dev: PCI device to query
> + *
> + * If possible clear relaxed ordering

Why "If possible"?  The bit is required to be RW or hardwired to zero,
so PCI_EXP_DEVCTL_RELAX_EN should *always* be zero when this returns.

> + */
> +int pcie_clear_relaxed_ordering(struct pci_dev *dev)
> +{
> + return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
> +   PCI_EXP_DEVCTL_RELAX_EN);
> +}
> +EXPORT_SYMBOL(pcie_clear_relaxed_ordering);

The current series doesn't add any callers of this except
pci_configure_relaxed_ordering(), so it doesn't need to be exported to
modules.

I think I would put both of these functions in drivers/pci/probe.c.
Then this one could be static and you'd only have to add
pcie_relaxed_ordering_supported() to include/linux/pci.h.

> +
> +/**
> + * pcie_relaxed_ordering_supported - Probe for PCIe relexed ordering support

s/relexed/relaxed/

> + * @dev: PCI device to query
> + *
> + * Returns true if the device support relaxed ordering attribute.
> + */
> +bool pcie_relaxed_ordering_supported(struct pci_dev *dev)
> +{
> + u16 v;
> +
> + pcie_capability_read_word(dev, PCI_EXP_DEVCTL, );
> +
> + return !!(v & PCI_EXP_DEVCTL_RELAX_EN);
> +}
> +EXPORT_SYMBOL(pcie_relaxed_ordering_supported);

This is misnamed.  This doesn't tell us anything about whether the
device *supports* relaxed ordering.  It only tells us whether the
device is *permitted* to use it.

When a device initiates a transaction, the hardware should set the RO
bit in the TLP with logic something like this:

  RO =  &&
&&
   

The issue you're fixing is that some Completers don't handle RO
correctly.  The determining factor is not the Requester, but the
Completer (for this series, a Root Port).  So I think this should be
something like:

  int pcie_relaxed_ordering_broken(struct pci_dev *completer)
  {
if (!completer)
  return 0;

return completer->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
  }

and the caller should do something like this:

 if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev)))
   adapter->flags |= ROOT_NO_RELAXED_ORDERING;

That way it's obvious where the issue is, and it's obvious that the
answer might be different for peer-to-peer transactions than it is for
transactions to the root port, i.e., to coherent memory.

> +
> +/**
>   * pcie_get_minimum_link - determine minimum link settings of a PCI device
>   * @dev: PCI device to query
>   * @speed: storage for minimum speed
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index c31310d..48df012 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1762,6 +1762,42 @@ static void pci_configure_extended_tags(struct pci_dev 
> *dev)
>PCI_EXP_DEVCTL_EXT_TAG);
>  }
>  
> +/**
> + * pci_dev_should_disable_relaxed_ordering - check if the PCI device
> + * should disable the relaxed ordering attribute.
> + * @dev: PCI device
> + *
> + * Return true if any of the PCI devices above us do not support
> + * relaxed ordering.
> + */
> +static bool pci_dev_should_disable_relaxed_ordering(struct pci_dev *dev)
> +{
> + while (dev) {
> + if (dev->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING)
> + 

Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-08-08 Thread Bjorn Helgaas
On Sat, Aug 05, 2017 at 03:15:10PM +0800, Ding Tianhong wrote:
> From: Casey Leedom 
> 
> The patch adds a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING to indicate that
> Relaxed Ordering (RO) attribute should not be used for Transaction Layer
> Packets (TLP) targetted towards these affected root complexes. Current list
> of affected parts include some Intel Xeon processors root complex which 
> suffers from
> flow control credits that result in performance issues. On these affected
> parts RO can still be used for peer-2-peer traffic. AMD A1100 ARM ("SEATTLE")
> Root complexes don't obey PCIe 3.0 ordering rules, hence could lead to
> data-corruption.

This needs to include a link to the Intel spec
(https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
sec 3.9.1).

It should also include a pointer to the AMD erratum, if available, or
at least some reference to how we know it doesn't obey the rules.

Ashok, thanks for chiming in.  Now that you have, I have a few more
questions for you:

  - Is the above doc the one you mentioned as being now public?
  
  - Is this considered a hardware erratum?
  
  - If so, is there a pointer to that as well?
  
  - If this is not considered an erratum, can you provide any guidance
about how an OS should determine when it should use RO?

Relying on a list of device IDs in an optimization manual is OK for an
erratum, but if it's *not* an erratum, it seems like a hole in the
specs because as far as I know there's no generic way for the OS to
discover whether to use RO.

Bjorn


Re: [PATCH v9 0/4] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-08-07 Thread Bjorn Helgaas
On Mon, Aug 07, 2017 at 02:14:48PM -0700, David Miller wrote:
> From: Ding Tianhong 
> Date: Mon, 7 Aug 2017 12:13:17 +0800
> 
> > Hi David:
> > 
> > I think networking tree merge it is a better choice, as it mainly used to 
> > tell the NIC
> > drivers how to use the Relaxed Ordering Attribute, and later we need send 
> > patch to enable
> > RO for ixgbe driver base on this patch. But I am not sure whether Bjorn has 
> > some of his own
> > view. :)
> > 
> > Hi Bjorn:
> > 
> > Could you help review this patch or give some feedback ?
> 
> I'm still waiting on this...
> 
> Bjorn?

I was on vacation Friday-today, but I'll look at this series this week.


Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-08-03 Thread Bjorn Helgaas
On Thu, Jul 20, 2017 at 02:41:01PM -0700, Roland Dreier wrote:
> From: Roland Dreier 
> 
> Add one more variant of the 82599 plus the device IDs for X540 and X550
> variants.  Intel has confirmed that none of these devices does peer-to-peer
> between functions.  The X540 and X550 have added ACS capabilities in their
> PCI config space, but the ACS control register is hard-wired to 0 for both
> devices, so we still need the quirk for IOMMU grouping to allow assignment
> of individual SR-IOV functions.
> 
> Signed-off-by: Roland Dreier 

I haven't seen a real conclusion to the discussion yet, so I'm waiting on
that and hopefully an ack from Alex.  Can you please repost with that,
since I'm dropping it from patchwork for now?

> ---
>  drivers/pci/quirks.c | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6967c6b4cf6b..b939db671326 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4335,12 +4335,33 @@ static const struct pci_dev_acs_enabled {
>   { PCI_VENDOR_ID_INTEL, 0x1507, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x1514, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x151C, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x1528, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x1529, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x154A, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x152A, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x154D, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x154F, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x1551, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x1558, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x1560, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x1563, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AA, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AC, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AD, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AE, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15B0, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C2, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C3, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C4, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C6, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C7, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15C8, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15CE, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15E4, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15E5, pci_quirk_mf_endpoint_acs },
> + { PCI_VENDOR_ID_INTEL, 0x15D1, pci_quirk_mf_endpoint_acs },
>   /* 82580 */
>   { PCI_VENDOR_ID_INTEL, 0x1509, pci_quirk_mf_endpoint_acs },
>   { PCI_VENDOR_ID_INTEL, 0x150E, pci_quirk_mf_endpoint_acs },
> -- 
> 2.11.0
> 


Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-06 Thread Bjorn Helgaas
On Thu, Jul 06, 2017 at 08:58:51PM +0800, Ding Tianhong wrote:
> Hi Bjorn:
> 
> Could you please give some feedback about this patchset, it looks like no 
> more comments for more than a week,
> thanks. :)

I was on vacation when you posted it, but don't worry, it's still in
the queue:

  https://patchwork.ozlabs.org/project/linux-pci/list

v4.12 was just released, so it's obviously too late for that.  The
v4.13 merge window is open, so it's too late for v4.13 as well (we
need stuff in -next before the merge window).

There's still plenty of time to work on this for v4.14.

Bjorn


Re: [PATCH 0/2] Replace driver's usage of hard-coded device IDs to #defines

2017-06-16 Thread Bjorn Helgaas
On Thu, May 25, 2017 at 09:56:55AM -0600, Myron Stowe wrote:
> On Wed, 24 May 2017 20:02:49 -0400 (EDT)
> David Miller <da...@davemloft.net> wrote:
> 
> > From: Myron Stowe <myron.st...@redhat.com>
> > Date: Wed, 24 May 2017 16:47:34 -0600
> > 
> > > Noa Osherovich introduced a series of new Mellanox device ID
> > > definitions to help differentiate specific controllers that needed
> > > INTx masking quirks [1].
> > > 
> > > Bjorn Helgaas followed on, using the device ID definitions Noa
> > > provided to replace hard-coded values within the mxl4 ID table [2].
> > > 
> > > This patch continues along similar lines, adding a few additional
> > > Mellanox device ID definitions and converting the net/mlx5e
> > > driver's mlx5 ID table to use the defines so tools like 'grep' and
> > > 'cscope' can be used to help identify relationships with other
> > > aspects (such as INTx masking).  
> > 
> > If you're adding pci_ids.h defines, it's only valid to do so if you
> > actually use the defines in more than one location.
> > 
> > This patch series is not doing that.
> 
> Hi David,
> 
> Yes, now that you mention that again I do vaguely remember past
> conversations stating similar constraints which is a little odd as
> Noa's series did exactly that.  It was Bjorn, in a separate patch, that
> made the connection to the driver with commit c19e4b9037f
> ("net/mlx4_core: Use device ID defines") [1] and even after such, some
> of the introduced #defines are still currently singular in usage.
> 
> Anyway, the part I'm interested in is creating a more transparent
> association between the Mellanox controllers that need the INTx masking
> quirk and their drivers, something that remains very opaque currently
> for a few of the remaining instances (PCI_DEVICE_ID_MELLANOX_CONNECTIB,
> PCI_DEVICE_ID_MELLANOX_CONNECTX4, and
> PCI_DEVICE_ID_MELLANOX_CONNECTX4_LX).

I think what you want is the patch below (your patch 2, after removing
CONNECTX5, CONNECTX5_EX, and CONNECTX6 since they're only used in one
place).

We added definitions for CONNECTIB, CONNECTX4, and CONNECTX4_LX and uses of
them in a quirk via:

  7254383341bc ("PCI: Add Mellanox device IDs")
  d76d2fe05fd9 ("PCI: Convert Mellanox broken INTx quirks to be for listed
  devices only")

But somehow we missed using those in mlx5/core/main.c.

The patch below doesn't touch PCI, so it would be just for netdev.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 0c123d571b4c..8a4e292f26b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1508,11 +1508,11 @@ static void shutdown(struct pci_dev *pdev)
 }
 
 static const struct pci_device_id mlx5_core_pci_table[] = {
-   { PCI_VDEVICE(MELLANOX, 0x1011) },  /* Connect-IB */
+   { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTIB) },
{ PCI_VDEVICE(MELLANOX, 0x1012), MLX5_PCI_DEV_IS_VF},   /* Connect-IB 
VF */
-   { PCI_VDEVICE(MELLANOX, 0x1013) },  /* ConnectX-4 */
+   { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX4) },
{ PCI_VDEVICE(MELLANOX, 0x1014), MLX5_PCI_DEV_IS_VF},   /* ConnectX-4 
VF */
-   { PCI_VDEVICE(MELLANOX, 0x1015) },  /* ConnectX-4LX 
*/
+   { PCI_VDEVICE(MELLANOX, PCI_DEVICE_ID_MELLANOX_CONNECTX4_LX) },
{ PCI_VDEVICE(MELLANOX, 0x1016), MLX5_PCI_DEV_IS_VF},   /* ConnectX-4LX 
VF */
{ PCI_VDEVICE(MELLANOX, 0x1017) },  /* ConnectX-5, 
PCIe 3.0 */
{ PCI_VDEVICE(MELLANOX, 0x1018), MLX5_PCI_DEV_IS_VF},   /* ConnectX-5 
VF */


Re: [PATCH v1] ACPI: Switch to use generic UUID API

2017-05-04 Thread Bjorn Helgaas
On Thu, May 4, 2017 at 4:21 AM, Andy Shevchenko
<andriy.shevche...@linux.intel.com> wrote:
> acpi_evaluate_dsm() and friends take a pointer to a raw buffer of 16
> bytes. Instead we convert them to use uuid_le type. At the same time we
> convert current users.
>
> acpi_str_to_uuid() becomes useless after the conversion and it's safe to
> get rid of it.
>
> The conversion fixes a potential bug in int340x_thermal as well since
> we have to use memcmp() on binary data.
>
> Cc: Rafael J. Wysocki <r...@rjwysocki.net>
> Cc: Mika Westerberg <mika.westerb...@linux.intel.com>
> Cc: Borislav Petkov <b...@suse.de>
> Cc: Dan Williams <dan.j.willi...@intel.com>
> Cc: Amir Goldstein <amir7...@gmail.com>
> Cc: Jarkko Sakkinen <jarkko.sakki...@linux.intel.com>
> Cc: Jani Nikula <jani.nik...@linux.intel.com>
> Cc: Ben Skeggs <bske...@redhat.com>
> Cc: Benjamin Tissoires <benjamin.tissoi...@redhat.com>
> Cc: Joerg Roedel <j...@8bytes.org>
> Cc: Adrian Hunter <adrian.hun...@intel.com>
> Cc: Yisen Zhuang <yisen.zhu...@huawei.com>
> Cc: Bjorn Helgaas <bhelg...@google.com>
> Cc: Zhang Rui <rui.zh...@intel.com>
> Cc: Felipe Balbi <ba...@kernel.org>
> Cc: Mathias Nyman <mathias.ny...@intel.com>
> Cc: Heikki Krogerus <heikki.kroge...@linux.intel.com>
> Cc: Liam Girdwood <lgirdw...@gmail.com>
> Cc: Mark Brown <broo...@kernel.org>
> Signed-off-by: Andy Shevchenko <andriy.shevche...@linux.intel.com>

For the drivers/pci parts:

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
>  drivers/acpi/acpi_extlog.c | 10 +++---
>  drivers/acpi/bus.c | 29 ++--
>  drivers/acpi/nfit/core.c   | 40 
> +++---
>  drivers/acpi/nfit/nfit.h   |  3 +-
>  drivers/acpi/utils.c   |  4 +--
>  drivers/char/tpm/tpm_crb.c |  9 +++--
>  drivers/char/tpm/tpm_ppi.c | 20 +--
>  drivers/gpu/drm/i915/intel_acpi.c  | 14 +++-
>  drivers/gpu/drm/nouveau/nouveau_acpi.c | 20 +--
>  drivers/gpu/drm/nouveau/nvkm/subdev/mxm/base.c |  9 +++--
>  drivers/hid/i2c-hid/i2c-hid.c  |  9 +++--
>  drivers/iommu/dmar.c   | 11 +++---
>  drivers/mmc/host/sdhci-pci-core.c  |  9 +++--
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c | 15 
>  drivers/pci/pci-acpi.c | 11 +++---
>  drivers/pci/pci-label.c|  4 +--
>  drivers/thermal/int340x_thermal/int3400_thermal.c  |  8 ++---
>  drivers/usb/dwc3/dwc3-pci.c|  6 ++--
>  drivers/usb/host/xhci-pci.c|  9 +++--
>  drivers/usb/misc/ucsi.c|  2 +-
>  drivers/usb/typec/typec_wcove.c|  4 +--
>  include/acpi/acpi_bus.h|  9 ++---
>  include/linux/acpi.h   |  4 +--
>  include/linux/pci-acpi.h   |  2 +-
>  sound/soc/intel/skylake/skl-nhlt.c |  7 ++--
>  tools/testing/nvdimm/test/iomap.c  |  2 +-
>  tools/testing/nvdimm/test/nfit.c   |  2 +-
>  27 files changed, 116 insertions(+), 156 deletions(-)
>
> diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> index 502ea4dc2080..69d6140b6afa 100644
> --- a/drivers/acpi/acpi_extlog.c
> +++ b/drivers/acpi/acpi_extlog.c
> @@ -182,17 +182,17 @@ static int extlog_print(struct notifier_block *nb, 
> unsigned long val,
>
>  static bool __init extlog_get_l1addr(void)
>  {
> -   u8 uuid[16];
> +   uuid_le uuid;
> acpi_handle handle;
> union acpi_object *obj;
>
> -   acpi_str_to_uuid(extlog_dsm_uuid, uuid);
> -
> +   if (uuid_le_to_bin(extlog_dsm_uuid, ))
> +   return false;
> if (ACPI_FAILURE(acpi_get_handle(NULL, "\\_SB", )))
> return false;
> -   if (!acpi_check_dsm(handle, uuid, EXTLOG_DSM_REV, 1 << 
> EXTLOG_FN_ADDR))
> +   if (!acpi_check_dsm(handle, , EXTLOG_DSM_REV, 1 << 
> EXTLOG_FN_ADDR))
> return false;
> -   obj = acpi_evaluate_dsm_typed(handle, uuid, EXTLOG_DSM_REV,
> +   obj = acpi_evaluate_dsm_typed(handle, , EXTLOG_DSM_REV,
>   EXTLOG_FN_ADDR, NULL, 
> ACPI_TYPE_INTEGER);
> if (!obj) {
> return false;
> diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
> index 784bda663d16..e8130a

Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER

2017-04-27 Thread Bjorn Helgaas
[+cc Casey]

On Wed, Apr 26, 2017 at 09:18:33AM -0700, Alexander Duyck wrote:
> On Wed, Apr 26, 2017 at 2:26 AM, Ding Tianhong  
> wrote:
> > Hi Amir:
> >
> > It is really glad to hear that the mlx5 will support RO mode this year, if 
> > so, do you agree that enable it dynamic by ethtool -s xxx,
> > we have try it several month ago but there was only one drivers would use 
> > it at that time so the maintainer against it, it mlx5 would support RO,
> > we could try to restart this solution, what do you think about it. :)
> >
> > Thanks
> > Ding
> 
> Hi Ding,
> 
> Enabing relaxed ordering really doesn't have any place in ethtool. It
> is a PCIe attribute that you are essentially wanting to enable.
> 
> It might be worth while to take a look at updating the PCIe code path
> to handle this. Really what we should probably do is guarantee that
> the architectures that need relaxed ordering are setting it in the
> PCIe Device Control register and that the ones that don't are clearing
> the bit. It's possible that this is already occurring, but I don't
> know the state of handling those bits is in the kernel. Once we can
> guarantee that we could use that to have the drivers determine their
> behavior in regards to relaxed ordering. For example in the case of
> igb/ixgbe we could probably change the behavior so that it will bey
> default try to use relaxed ordering but if it is not enabled in PCIe
> Device Control register the hardware should not request to use it. It
> would simplify things in the drivers and allow for each architecture
> to control things as needed in their PCIe code.

I thought Relaxed Ordering was an optimization.  Are there cases where
it is actually required for correct behavior?

The PCI core doesn't currently do anything with Relaxed Ordering.
Some drivers enable/disable it directly.  I think it would probably be
better if the core provided an interface for this.  One reason is
because I think Casey has identified some systems where Relaxed
Ordering doesn't work correctly, and I'd rather deal with them once in
the core than in every driver.

Are you hinting that the PCI core or arch code could actually *enable*
Relaxed Ordering without the driver doing anything?  Is it safe to do
that?  Is there such a thing as a device that is capable of using RO,
but where the driver must be aware of it being enabled, so it programs
the device appropriately?

Bjorn


Re: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it

2017-04-27 Thread Bjorn Helgaas
On Thu, Apr 27, 2017 at 08:47:58AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 25, 2017 at 02:39:55PM -0500, Bjorn Helgaas wrote:
> > This still leaves these:
> > 
> >   [PATCH 4/7] ixgbe: use pcie_flr instead of duplicating it
> >   [PATCH 6/7] crypto: qat: use pcie_flr instead of duplicating it
> >   [PATCH 7/7] liquidio: use pcie_flr instead of duplicating it
> > 
> > I haven't seen any response to 4 and 6.  Felix reported an unused
> > variable in 7.  Let me know if you'd like me to do anything with
> > these.
> 
> Now that Jeff ACKed 4 it might be worth to add it to the pci tree last
> minute.  I'll resend liquidio and qat to the respective maintainers for
> the next merge window.

I applied 4 with Jeff's ack to pci/virtualization for v4.12, thanks!


Re: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it

2017-04-25 Thread Bjorn Helgaas
On Mon, Apr 24, 2017 at 04:35:07PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 24, 2017 at 02:16:31PM +, Byczkowski, Jakub wrote:
> > Tested-by: Jakub Byczkowski 
> 
> Are you (and Doug) ok with queueing this up in the PCI tree?

Applied this with Jakub's tested-by and Doug's ack to pci/virtualization
for v4.12.

This still leaves these:

  [PATCH 4/7] ixgbe: use pcie_flr instead of duplicating it
  [PATCH 6/7] crypto: qat: use pcie_flr instead of duplicating it
  [PATCH 7/7] liquidio: use pcie_flr instead of duplicating it

I haven't seen any response to 4 and 6.  Felix reported an unused
variable in 7.  Let me know if you'd like me to do anything with
these.

Bjorn


Re: export pcie_flr and remove copies of it in drivers V2

2017-04-18 Thread Bjorn Helgaas
On Fri, Apr 14, 2017 at 09:11:24PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this exports the PCI layer pcie_flr helper, and removes various opencoded
> copies of it.
> 
> Changes since V1:
>  - rebase on top of the pci/virtualization branch
>  - fixed the probe case in __pci_dev_reset
>  - added ACKs from Bjorn

Applied the first three patches:
 
  bc13871ef35a PCI: Export pcie_flr()
  e641c375d414 PCI: Call pcie_flr() from reset_intel_82599_sfp_virtfn()
  40e0901ea4bf PCI: Call pcie_flr() from reset_chelsio_generic_dev()

to pci/virtualization for v4.12, thanks!


Re: export pcie_flr and remove copies of it in drivers

2017-04-14 Thread Bjorn Helgaas
On Fri, Apr 14, 2017 at 9:41 AM, Bjorn Helgaas <helg...@kernel.org> wrote:
> On Thu, Apr 13, 2017 at 04:53:32PM +0200, Christoph Hellwig wrote:
>> Hi all,
>>
>> this exports the PCI layer pcie_flr helper, and removes various opencoded
>> copies of it.
>
> Looks good to me (except the comment on probe).  If you want to apply
> the whole series via netdev or some non-PCI tree, here's my ack for
> the drivers/pci parts, assuming the probe thing is resolved:
>
> Acked-by: Bjorn Helgaas <bhelg...@google.com>
>
> Otherwise, I'd be glad to take the series given acks for the non-PCI
> parts.  Just let me know.

I do already have a patch (c5e4f0192ad2 ("PCI: Avoid FLR for Intel
82579 NICs")) on my pci/virtualization branch that touches pcie_flr()
and will conflict with this one.


Re: export pcie_flr and remove copies of it in drivers

2017-04-14 Thread Bjorn Helgaas
On Thu, Apr 13, 2017 at 04:53:32PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this exports the PCI layer pcie_flr helper, and removes various opencoded
> copies of it.

Looks good to me (except the comment on probe).  If you want to apply
the whole series via netdev or some non-PCI tree, here's my ack for
the drivers/pci parts, assuming the probe thing is resolved:

Acked-by: Bjorn Helgaas <bhelg...@google.com>

Otherwise, I'd be glad to take the series given acks for the non-PCI
parts.  Just let me know.

Bjorn


Re: [PATCH 1/7] PCI: export pcie_flr

2017-04-14 Thread Bjorn Helgaas
[+cc Alex]

On Thu, Apr 13, 2017 at 04:53:33PM +0200, Christoph Hellwig wrote:
> Currently we opencode the FLR sequence in lots of place, export a core
> helper instead.  We split out the probing for FLR support as all the
> non-core callers already know their hardware.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/pci/pci.c   | 34 +-
>  include/linux/pci.h |  1 +
>  2 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 7904d02ffdb9..3256a63c5d08 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3773,24 +3773,38 @@ static void pci_flr_wait(struct pci_dev *dev)
>(i - 1) * 100);
>  }
>  
> -static int pcie_flr(struct pci_dev *dev, int probe)
> +/**
> + * pcie_has_flr - check if a device supports function level resets
> + * @dev: device to check
> + *
> + * Returns true if the device advertises support for PCIe function level
> + * resets.
> + */
> +static bool pcie_has_flr(struct pci_dev *dev)
>  {
>   u32 cap;
>  
>   pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, );
> - if (!(cap & PCI_EXP_DEVCAP_FLR))
> - return -ENOTTY;
> -
> - if (probe)
> - return 0;
> + return cap & PCI_EXP_DEVCAP_FLR;
> +}
>  
> +/**
> + * pcie_flr - initiate a PCIe function level reset
> + * @dev: device to reset
> + *
> + * Initiate a function level reset on @dev.  The caller should ensure the
> + * device supports FLR before calling this function, e.g. by using the
> + * pcie_has_flr helper.

s/pcie_has_flr/pcie_has_flr()/

> + */
> +void pcie_flr(struct pci_dev *dev)
> +{
>   if (!pci_wait_for_pending_transaction(dev))
>   dev_err(>dev, "timed out waiting for pending transaction; 
> performing function level reset anyway\n");
>  
>   pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
>   pci_flr_wait(dev);
> - return 0;
>  }
> +EXPORT_SYMBOL_GPL(pcie_flr);
>  
>  static int pci_af_flr(struct pci_dev *dev, int probe)
>  {
> @@ -3971,9 +3985,11 @@ static int __pci_dev_reset(struct pci_dev *dev, int 
> probe)
>   if (rc != -ENOTTY)
>   goto done;
>  
> - rc = pcie_flr(dev, probe);
> - if (rc != -ENOTTY)
> + if (pcie_has_flr(dev)) {
> + pcie_flr(dev);
> + rc = 0;
>   goto done;
> + }

This performs an FLR (if supported) always, regardless of "probe".
I think it should look something like this instead:

  if (pcie_has_flr(dev)) {
if (!probe)
  pcie_flr(dev);
rc = 0;
goto done;
  }

>   rc = pci_af_flr(dev, probe);
>   if (rc != -ENOTTY)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index eb3da1a04e6c..f35e51eddad0 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1052,6 +1052,7 @@ int pcie_get_mps(struct pci_dev *dev);
>  int pcie_set_mps(struct pci_dev *dev, int mps);
>  int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
> enum pcie_link_width *width);
> +void pcie_flr(struct pci_dev *dev);
>  int __pci_reset_function(struct pci_dev *dev);
>  int __pci_reset_function_locked(struct pci_dev *dev);
>  int pci_reset_function(struct pci_dev *dev);
> -- 
> 2.11.0
> 


Re: [PATCH 4/4] PCI: remove pci_enable_msix

2017-04-06 Thread Bjorn Helgaas
On Thu, Apr 06, 2017 at 02:24:48PM +0200, Christoph Hellwig wrote:
> Unused now that all callers switched to pci_alloc_irq_vectors.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

I already acked this, but I can do it again :)
(https://lkml.kernel.org/r/20170330230913.ga3...@bhelgaas-glaptop.roam.corp.google.com)

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
>  drivers/pci/msi.c   | 21 -
>  include/linux/pci.h |  4 
>  2 files changed, 25 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index d571bc330686..0042c365b29b 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -973,27 +973,6 @@ static int __pci_enable_msix(struct pci_dev *dev, struct 
> msix_entry *entries,
>   return msix_capability_init(dev, entries, nvec, affd);
>  }
>  
> -/**
> - * pci_enable_msix - configure device's MSI-X capability structure
> - * @dev: pointer to the pci_dev data structure of MSI-X device function
> - * @entries: pointer to an array of MSI-X entries (optional)
> - * @nvec: number of MSI-X irqs requested for allocation by device driver
> - *
> - * Setup the MSI-X capability structure of device function with the number
> - * of requested irqs upon its software driver call to request for
> - * MSI-X mode enabled on its hardware device function. A return of zero
> - * indicates the successful configuration of MSI-X capability structure
> - * with new allocated MSI-X irqs. A return of < 0 indicates a failure.
> - * Or a return of > 0 indicates that driver request is exceeding the number
> - * of irqs or MSI-X vectors available. Driver should use the returned value 
> to
> - * re-send its request.
> - **/
> -int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int 
> nvec)
> -{
> - return __pci_enable_msix(dev, entries, nvec, NULL);
> -}
> -EXPORT_SYMBOL(pci_enable_msix);
> -
>  void pci_msix_shutdown(struct pci_dev *dev)
>  {
>   struct msi_desc *entry;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index eb3da1a04e6c..82dec36845e6 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1300,7 +1300,6 @@ int pci_msi_vec_count(struct pci_dev *dev);
>  void pci_msi_shutdown(struct pci_dev *dev);
>  void pci_disable_msi(struct pci_dev *dev);
>  int pci_msix_vec_count(struct pci_dev *dev);
> -int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int 
> nvec);
>  void pci_msix_shutdown(struct pci_dev *dev);
>  void pci_disable_msix(struct pci_dev *dev);
>  void pci_restore_msi_state(struct pci_dev *dev);
> @@ -1330,9 +1329,6 @@ static inline int pci_msi_vec_count(struct pci_dev 
> *dev) { return -ENOSYS; }
>  static inline void pci_msi_shutdown(struct pci_dev *dev) { }
>  static inline void pci_disable_msi(struct pci_dev *dev) { }
>  static inline int pci_msix_vec_count(struct pci_dev *dev) { return -ENOSYS; }
> -static inline int pci_enable_msix(struct pci_dev *dev,
> -   struct msix_entry *entries, int nvec)
> -{ return -ENOSYS; }
>  static inline void pci_msix_shutdown(struct pci_dev *dev) { }
>  static inline void pci_disable_msix(struct pci_dev *dev) { }
>  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
> -- 
> 2.11.0
> 


Re: [PATCH 5/5] PCI: remove pci_enable_msix

2017-03-30 Thread Bjorn Helgaas
On Mon, Mar 27, 2017 at 10:29:36AM +0200, Christoph Hellwig wrote:
> Unused now that all callers switched to pci_alloc_irq_vectors.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

I assume this will be merged with the rest via the netdev tree.

> ---
>  drivers/pci/msi.c   | 21 -
>  include/linux/pci.h |  4 
>  2 files changed, 25 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index d571bc330686..0042c365b29b 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -973,27 +973,6 @@ static int __pci_enable_msix(struct pci_dev *dev, struct 
> msix_entry *entries,
>   return msix_capability_init(dev, entries, nvec, affd);
>  }
>  
> -/**
> - * pci_enable_msix - configure device's MSI-X capability structure
> - * @dev: pointer to the pci_dev data structure of MSI-X device function
> - * @entries: pointer to an array of MSI-X entries (optional)
> - * @nvec: number of MSI-X irqs requested for allocation by device driver
> - *
> - * Setup the MSI-X capability structure of device function with the number
> - * of requested irqs upon its software driver call to request for
> - * MSI-X mode enabled on its hardware device function. A return of zero
> - * indicates the successful configuration of MSI-X capability structure
> - * with new allocated MSI-X irqs. A return of < 0 indicates a failure.
> - * Or a return of > 0 indicates that driver request is exceeding the number
> - * of irqs or MSI-X vectors available. Driver should use the returned value 
> to
> - * re-send its request.
> - **/
> -int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int 
> nvec)
> -{
> - return __pci_enable_msix(dev, entries, nvec, NULL);
> -}
> -EXPORT_SYMBOL(pci_enable_msix);
> -
>  void pci_msix_shutdown(struct pci_dev *dev)
>  {
>   struct msi_desc *entry;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index eb3da1a04e6c..82dec36845e6 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1300,7 +1300,6 @@ int pci_msi_vec_count(struct pci_dev *dev);
>  void pci_msi_shutdown(struct pci_dev *dev);
>  void pci_disable_msi(struct pci_dev *dev);
>  int pci_msix_vec_count(struct pci_dev *dev);
> -int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int 
> nvec);
>  void pci_msix_shutdown(struct pci_dev *dev);
>  void pci_disable_msix(struct pci_dev *dev);
>  void pci_restore_msi_state(struct pci_dev *dev);
> @@ -1330,9 +1329,6 @@ static inline int pci_msi_vec_count(struct pci_dev 
> *dev) { return -ENOSYS; }
>  static inline void pci_msi_shutdown(struct pci_dev *dev) { }
>  static inline void pci_disable_msi(struct pci_dev *dev) { }
>  static inline int pci_msix_vec_count(struct pci_dev *dev) { return -ENOSYS; }
> -static inline int pci_enable_msix(struct pci_dev *dev,
> -   struct msix_entry *entries, int nvec)
> -{ return -ENOSYS; }
>  static inline void pci_msix_shutdown(struct pci_dev *dev) { }
>  static inline void pci_disable_msix(struct pci_dev *dev) { }
>  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
> -- 
> 2.11.0
> 


Re: [PATCH 5/5] PCI: remove pci_enable_msix

2017-03-30 Thread Bjorn Helgaas
On Tue, Mar 28, 2017 at 09:24:15AM -0700, David Daney wrote:
> On 03/27/2017 11:41 PM, Christoph Hellwig wrote:
> >On Mon, Mar 27, 2017 at 10:30:46AM -0700, David Daney wrote:
> >>>Use pci_enable_msix_{exact,range} for now, as I told you before.
> >>>
> >>
> >>That still results in twice as many MSI-X being provisioned than are needed.
> >
> >How so?  Except for the return value, a pci_enable_msix_exact call with the
> >same arguments as your previous pci_enable_msix will work exactly the
> >same.
> >
> 
> Sorry, I think it was my misunderstanding.  I didn't realize that we
> had essentially renamed the function, but left the functionality
> mostly unchanged.

Does this mean you're OK with this patch?  I know it may require some
work on out-of-tree drivers and so on, but if that work is possible
and you don't actually lose functionality, I'm OK with this patch.

Bjorn


Re: Legacy features in PCI Express devices

2017-03-13 Thread Bjorn Helgaas
On Mon, Mar 13, 2017 at 05:10:57PM +0100, Mason wrote:
> Hello,
> 
> There are two revisions of our PCI Express controller.
> 
> Rev 1 did not support the following features:
> 
>   1) legacy PCI interrupt delivery (INTx signals)
>   2) I/O address space
> 
> Internally, someone stated that such missing support would prevent
> some PCIe cards from working with our controller.
> 
> Are there really modern PCIe cards that require 1) and/or 2)
> to function?

>From a spec point of view, all endpoints, including Legacy, PCI
Express, and Root Complex Integrated Endpoints, are "required to
support MSI or MSI-X or both if an interrupt resource is requested"
(PCIe r3.0, sec 1.3.2).

The same section says Legacy Endpoints are permitted to use I/O
Requests, but PCI Express and Root Complex Integrated Endpoints are
not.  There's a little wiggle room in the I/O BAR description; I would
interpret it to mean the latter two varieties are permitted to have
I/O BARs, but they must make the resources described by those BARs
available via a memory BAR as well, so they can operate with I/O
address space.

But that's only in theory; David has already given examples of devices
that don't support MSI, and Greg hinted at new devices that might
require I/O space.  I'm particularly curious about that last one,
because there are several host bridges that don't support I/O space at
all.

Bjorn


Re: linux <=4.9.5, 4.10-rc7 ok, 4.9.6 - 4.9.8 nok with realtek wlan, atom

2017-02-09 Thread Bjorn Helgaas
[+cc rtl8192ce folks in case they've seen this]

On Thu, Feb 09, 2017 at 03:45:01PM +0100, rupert THURNER wrote:
> hi,
> 
> not technical expert enough, i just wanted to give a short user
> feedback. for realtek wlan on atom, kernels up to 4.9.5 are ok, and
> kernel 4.10.0-rc7-g926af6273fc6 (arch linux-git version numbering) as
> well. kernels 4.9.6, 4.9.7, and 4.9.8 fail, i.e. connection to a WLAN
> hotspot is possible then drops, or connecting to wlan fails
> alltogether.

Thanks very much for your report, and sorry for the inconvenience.

v4.10-rc7 works, so I guess we don't need to worry about fixing v4.10.

But the stable kernels v4.9.6, v4.9.7, and v4.9.8 are broken, so we
need to figure out why and make sure we fix the v4.9 stable series.

I can't tell yet whether this is PCI-related or not.  If it is,
4922a6a5cfa7 ("PCI: Enumerate switches below PCI-to-PCIe bridges")
appeared in v4.9.6, and there is a known issue with that.  The issue
should be fixed by 610c2b7ff8f6 ("PCI/ASPM: Handle PCI-to-PCIe bridges
as roots of PCIe hierarchies"), which appeared in v4.9.9, so I guess
the first thing to do would be to test v4.9.9.

If it's not fixed in v4.9.9, can you share the complete dmesg log
(output of "dmesg" command) and "lspci -vv" output for v4.9.5 (last
known working version) and v4.9.6 (first known broken version)?  On
v4.9.6, collect the dmesg output after the failure occurs.

> 24: PCI 300.0: 0282 WLAN controller
>   [Created at pci.366]
>   Model: "Realtek RTL8188CE 802.11b/g/n WiFi Adapter"
>   Device: pci 0x8176 "RTL8188CE 802.11b/g/n WiFi Adapter"
>   Revision: 0x01
>   Driver: "rtl8192ce"
>   Driver Modules: "rtl8192ce"
>   Device File: wlp3s0
>   Features: WLAN


Re: [PATCH v2] PCI: lock each enable/disable num_vfs operation in sysfs

2017-02-03 Thread Bjorn Helgaas
On Fri, Jan 06, 2017 at 01:59:08PM -0800, Emil Tantilov wrote:
> Enabling/disabling SRIOV via sysfs by echo-ing multiple values
> simultaneously:
> 
> echo 63 > /sys/class/net/ethX/device/sriov_numvfs&
> echo 63 > /sys/class/net/ethX/device/sriov_numvfs
> 
> sleep 5
> 
> echo 0 > /sys/class/net/ethX/device/sriov_numvfs&
> echo 0 > /sys/class/net/ethX/device/sriov_numvfs
> 
> Results in the following bug:
> 
> kernel BUG at drivers/pci/iov.c:495!
> invalid opcode:  [#1] SMP
> CPU: 1 PID: 8050 Comm: bash Tainted: G   W   4.9.0-rc7-net-next #2092
> RIP: 0010:[]
> [] pci_iov_release+0x57/0x60
> 
> Call Trace:
>  [] pci_release_dev+0x26/0x70
>  [] device_release+0x3e/0xb0
>  [] kobject_cleanup+0x67/0x180
>  [] kobject_put+0x2d/0x60
>  [] put_device+0x17/0x20
>  [] pci_dev_put+0x1a/0x20
>  [] pci_get_dev_by_id+0x5b/0x90
>  [] pci_get_subsys+0x35/0x40
>  [] pci_get_device+0x18/0x20
>  [] pci_get_domain_bus_and_slot+0x2b/0x60
>  [] pci_iov_remove_virtfn+0x57/0x180
>  [] pci_disable_sriov+0x65/0x140
>  [] ixgbe_disable_sriov+0xc7/0x1d0 [ixgbe]
>  [] ixgbe_pci_sriov_configure+0x3d/0x170 [ixgbe]
>  [] sriov_numvfs_store+0xdc/0x130
> ...
> RIP  [] pci_iov_release+0x57/0x60
> 
> Use the existing mutex lock to protect each enable/disable operation.
> 
> -v2: move the existing lock from protecting the config of the IOV bus
> to protecting the writes to sriov_numvfs in sysfs without maintaining
> a "locked" version of pci_iov_add/remove_virtfn().
> As suggested by Gavin Shan 
> 
> CC: Alexander Duyck 
> Signed-off-by: Emil Tantilov 

Applied with Gavin's reviewed-by to pci/virtualization for v4.11, thanks!

> ---
>  drivers/pci/iov.c   |7 ---
>  drivers/pci/pci-sysfs.c |   23 ---
>  drivers/pci/pci.h   |2 +-
>  3 files changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4722782..2479ae8 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -124,7 +124,6 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id, int 
> reset)
>   struct pci_sriov *iov = dev->sriov;
>   struct pci_bus *bus;
>  
> - mutex_lock(>dev->sriov->lock);
>   bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
>   if (!bus)
>   goto failed;
> @@ -162,7 +161,6 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id, int 
> reset)
>   __pci_reset_function(virtfn);
>  
>   pci_device_add(virtfn, virtfn->bus);
> - mutex_unlock(>dev->sriov->lock);
>  
>   pci_bus_add_device(virtfn);
>   sprintf(buf, "virtfn%u", id);
> @@ -181,12 +179,10 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id, int 
> reset)
>   sysfs_remove_link(>dev.kobj, buf);
>  failed1:
>   pci_dev_put(dev);
> - mutex_lock(>dev->sriov->lock);
>   pci_stop_and_remove_bus_device(virtfn);
>  failed0:
>   virtfn_remove_bus(dev->bus, bus);
>  failed:
> - mutex_unlock(>dev->sriov->lock);
>  
>   return rc;
>  }
> @@ -195,7 +191,6 @@ void pci_iov_remove_virtfn(struct pci_dev *dev, int id, 
> int reset)
>  {
>   char buf[VIRTFN_ID_LEN];
>   struct pci_dev *virtfn;
> - struct pci_sriov *iov = dev->sriov;
>  
>   virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
>pci_iov_virtfn_bus(dev, id),
> @@ -218,10 +213,8 @@ void pci_iov_remove_virtfn(struct pci_dev *dev, int id, 
> int reset)
>   if (virtfn->dev.kobj.sd)
>   sysfs_remove_link(>dev.kobj, "physfn");
>  
> - mutex_lock(>dev->sriov->lock);
>   pci_stop_and_remove_bus_device(virtfn);
>   virtfn_remove_bus(dev->bus, virtfn->bus);
> - mutex_unlock(>dev->sriov->lock);
>  
>   /* balance pci_get_domain_bus_and_slot() */
>   pci_dev_put(virtfn);
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 0666287..25d010d 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -472,6 +472,7 @@ static ssize_t sriov_numvfs_store(struct device *dev,
> const char *buf, size_t count)
>  {
>   struct pci_dev *pdev = to_pci_dev(dev);
> + struct pci_sriov *iov = pdev->sriov;
>   int ret;
>   u16 num_vfs;
>  
> @@ -482,38 +483,46 @@ static ssize_t sriov_numvfs_store(struct device *dev,
>   if (num_vfs > pci_sriov_get_totalvfs(pdev))
>   return -ERANGE;
>  
> + mutex_lock(>dev->sriov->lock);
> +
>   if (num_vfs == pdev->sriov->num_VFs)
> - return count;   /* no change */
> + goto exit;
>  
>   /* is PF driver loaded w/callback */
>   if (!pdev->driver || !pdev->driver->sriov_configure) {
>   dev_info(>dev, "Driver doesn't support SRIOV 
> configuration via sysfs\n");
> - return -ENOSYS;
> + ret = -ENOENT;
> + goto exit;
>   }
>  
>   if (num_vfs == 0) {
>  

Re: kill off pci_enable_msi_{exact,range}

2017-01-13 Thread Bjorn Helgaas
On Fri, Jan 13, 2017 at 09:05:53AM +0100, Christoph Hellwig wrote:
> On Fri, Jan 13, 2017 at 08:55:03AM +0100, Christoph Hellwig wrote:
> > On Thu, Jan 12, 2017 at 03:29:00PM -0600, Bjorn Helgaas wrote:
> > > Applied all three (with Tom's ack on the amd-xgbe patch) to pci/msi for
> > > v4.11, thanks!
> > 
> > Tom had just send me an event better version of the xgbe patch.  Tom,
> > maybe you can resend that relative to the PCI tree [1], so that we don't
> > lose it for next merge window?
> 
> Actually - Bjorn, your msi branch contains an empty commit from this
> thread:
> 
>   
> https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/msi=7a8191de43faa9869b421a1b06075d8126ce7c0b

Yep, I botched that.  Thought I'd fixed it, but guess I got distracted.

> Maybe we should rebase it after all to avoid that?  In that case please
> pick up the xgbe patch from Tom below:

I dropped the empty commit and replaced the xgbe patch with the one below.
Can you take a look at [1] and make sure it's what you expected?

[1] https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=pci/msi

Thanks!

> ---
> From: Tom Lendacky <thomas.lenda...@amd.com>
> Subject: [PATCH] amd-xgbe: Update PCI support to use new IRQ functions
> 
> Some of the PCI MSI/MSI-X functions have been deprecated and it is
> recommended to use the new pci_alloc_irq_vectors() function. Convert
> the code over to use the new function. Also, modify the way in which
> the IRQs are requested - try for multiple MSI-X/MSI first, then a
> single MSI/legacy interrupt.
> 
> Signed-off-by: Tom Lendacky <thomas.lenda...@amd.com>
> Signed-off-by: Christoph Hellwig <h...@lst.de>
> ---
>  drivers/net/ethernet/amd/xgbe/xgbe-pci.c |  128 
> +-
>  drivers/net/ethernet/amd/xgbe/xgbe.h |8 +-
>  2 files changed, 41 insertions(+), 95 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-pci.c 
> b/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
> index e76b7f6..e436902 100644
> --- a/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
> +++ b/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
> @@ -122,104 +122,40 @@
>  #include "xgbe.h"
>  #include "xgbe-common.h"
>  
> -static int xgbe_config_msi(struct xgbe_prv_data *pdata)
> +static int xgbe_config_multi_msi(struct xgbe_prv_data *pdata)
>  {
> - unsigned int msi_count;
> + unsigned int vector_count;
>   unsigned int i, j;
>   int ret;
>  
> - msi_count = XGBE_MSIX_BASE_COUNT;
> - msi_count += max(pdata->rx_ring_count,
> -  pdata->tx_ring_count);
> - msi_count = roundup_pow_of_two(msi_count);
> + vector_count = XGBE_MSI_BASE_COUNT;
> + vector_count += max(pdata->rx_ring_count,
> + pdata->tx_ring_count);
>  
> - ret = pci_enable_msi_exact(pdata->pcidev, msi_count);
> + ret = pci_alloc_irq_vectors(pdata->pcidev, XGBE_MSI_MIN_COUNT,
> + vector_count, PCI_IRQ_MSI | PCI_IRQ_MSIX);
>   if (ret < 0) {
> - dev_info(pdata->dev, "MSI request for %u interrupts failed\n",
> -  msi_count);
> -
> - ret = pci_enable_msi(pdata->pcidev);
> - if (ret < 0) {
> - dev_info(pdata->dev, "MSI enablement failed\n");
> - return ret;
> - }
> -
> - msi_count = 1;
> - }
> -
> - pdata->irq_count = msi_count;
> -
> - pdata->dev_irq = pdata->pcidev->irq;
> -
> - if (msi_count > 1) {
> - pdata->ecc_irq = pdata->pcidev->irq + 1;
> - pdata->i2c_irq = pdata->pcidev->irq + 2;
> - pdata->an_irq = pdata->pcidev->irq + 3;
> -
> - for (i = XGBE_MSIX_BASE_COUNT, j = 0;
> -  (i < msi_count) && (j < XGBE_MAX_DMA_CHANNELS);
> -  i++, j++)
> - pdata->channel_irq[j] = pdata->pcidev->irq + i;
> - pdata->channel_irq_count = j;
> -
> - pdata->per_channel_irq = 1;
> - pdata->channel_irq_mode = XGBE_IRQ_MODE_LEVEL;
> - } else {
> - pdata->ecc_irq = pdata->pcidev->irq;
> - pdata->i2c_irq = pdata->pcidev->irq;
> - pdata->an_irq = pdata->pcidev->irq;
> - }
> -
> - if (netif_msg_probe(pdata))
> - dev_dbg(pdata->dev, "MSI interrupts enabled\n");
> -
> - return 0;
> -}
> -
> -static int xgbe_config_msix(struct xgbe_

Re: kill off pci_enable_msi_{exact,range}

2017-01-12 Thread Bjorn Helgaas
On Mon, Jan 09, 2017 at 09:37:37PM +0100, Christoph Hellwig wrote:
> I had hope that we could kill these old interfaces of for 4.10-rc,
> but as of today Linus tree still has two users:
> 
>  (1) the cobalt media driver, for which I sent a patch long time ago,
>  it got missed in the merge window.
>  (2) the new xgbe driver was merged in 4.10-rc but used the old interfaces
>  anyway
> 
> This series resend the patch for (1) and adds a new one for (2), as well
> as having the final removal patch behind it.  Maybe we should just queue
> up all three together in the PCI tree for 4.11?

Applied all three (with Tom's ack on the amd-xgbe patch) to pci/msi for
v4.11, thanks!


Re: [PATCH kernel v3] PCI: Enable access to custom VPD for Chelsio devices (cxgb3)

2016-11-23 Thread Bjorn Helgaas
On Mon, Oct 24, 2016 at 06:04:17PM +1100, Alexey Kardashevskiy wrote:
> There is at least one Chelsio 10Gb card which uses VPD area to store
> some custom blocks (example below). However pci_vpd_size() returns
> the length of the first block only assuming that there can be only
> one VPD "End Tag" and VFIO blocks access beyond that offset
> (since 4e1a63555) which leads to the situation when the guest "cxgb3"
> driver fails to probe the device. The host system does not have this
> problem as the drives accesses the config space directly without
> pci_read_vpd()/...
> 
> This adds a quirk to override the VPD size to a bigger value.
> The maximum size is taken from EEPROMSIZE in
> drivers/net/ethernet/chelsio/cxgb3/common.h. We do not read the tag
> as the cxgb3 driver does as the driver supports writing to EEPROM/VPD
> and when it writes, it only checks for 8192 bytes boundary. The quirk
> is registerted for all devices supported by the cxgb3 driver.
> 
> This adds a quirk to the PCI layer (not to the cxgb3 driver) as
> the cxgb3 driver itself accesses VPD directly and the problem only exists
> with the vfio-pci driver (when cxgb3 is not running on the host and
> may not be even loaded) which blocks accesses beyond the first block
> of VPD data. However vfio-pci itself does not have quirks mechanism so
> we add it to PCI.
> 
> This is the controller:
> Ethernet controller [0200]: Chelsio Communications Inc T310 10GbE Single Port 
> Adapter [1425:0030]
> 
> This is what I parsed from its vpd:
> ===
> b'\x82*\x0010 Gigabit Ethernet-SR PCI Express Adapter\x90J\x00EC\x07D76809 
> FN\x0746K'
>   Large item 42 bytes; name 0x2 Identifier String
>   b'10 Gigabit Ethernet-SR PCI Express Adapter'
>  002d Large item 74 bytes; name 0x10
>   #00 [EC] len=7: b'D76809 '
>   #0a [FN] len=7: b'46K7897'
>   #14 [PN] len=7: b'46K7897'
>   #1e [MN] len=4: b'1037'
>   #25 [FC] len=4: b'5769'
>   #2c [SN] len=12: b'YL102035603V'
>   #3b [NA] len=12: b'00145E992ED1'
>  007a Small item 1 bytes; name 0xf End Tag
> 
>  0c00 Large item 16 bytes; name 0x2 Identifier String
>   b'S310E-SR-X  '
>  0c13 Large item 234 bytes; name 0x10
>   #00 [PN] len=16: b'TBD '
>   #13 [EC] len=16: b'110107730D2 '
>   #26 [SN] len=16: b'97YL102035603V  '
>   #39 [NA] len=12: b'00145E992ED1'
>   #48 [V0] len=6: b'175000'
>   #51 [V1] len=6: b'26'
>   #5a [V2] len=6: b'26'
>   #63 [V3] len=6: b'2000  '
>   #6c [V4] len=2: b'1 '
>   #71 [V5] len=6: b'c2'
>   #7a [V6] len=6: b'0 '
>   #83 [V7] len=2: b'1 '
>   #88 [V8] len=2: b'0 '
>   #8d [V9] len=2: b'0 '
>   #92 [VA] len=2: b'0 '
>   #97 [RV] len=80: 
> b's\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'...
>  0d00 Large item 252 bytes; name 0x11
>   #00 [VC] len=16: b'122310_1222 dp  '
>   #13 [VD] len=16: b'610-0001-00 H1\x00\x00'
>   #26 [VE] len=16: b'122310_1353 fp  '
>   #39 [VF] len=16: b'610-0001-00 H1\x00\x00'
>   #4c [RW] len=173: 
> b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'...
>  0dff Small item 0 bytes; name 0xf End Tag
> 
> 10f3 Large item 13315 bytes; name 0x62
> !!! unknown item name 98: 
> b'\xd0\x03\x00@`\x0c\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> ===
> 
> Signed-off-by: Alexey Kardashevskiy 

Applied to pci/misc for v4.10, thanks, Alexey!

> ---
> Changes:
> v3:
> * unconditionally set VPD size to 8192
> 
> v2:
> * used pci_set_vpd_size() helper
> * added explicit list of IDs from cxgb3 driver
> * added a note in the commit log why the quirk is not in cxgb3
> ---
>  drivers/pci/quirks.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index c232729..bc7c541 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3255,6 +3255,25 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
> PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE,
>   quirk_thunderbolt_hotplug_msi);
>  
> +static void quirk_chelsio_extend_vpd(struct pci_dev *dev)
> +{
> + pci_set_vpd_size(dev, 8192);
> +}
> +
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x20, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x21, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x22, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x23, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x24, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x25, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x26, 
> quirk_chelsio_extend_vpd);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x30, 
> quirk_chelsio_extend_vpd);
> 

mlx4 BUG_ON in probe path

2016-11-16 Thread Bjorn Helgaas
Hi Yishai,

Johannes has been working on an mlx4 initialization problem on an
IBM x3850 X6.  The underlying problem is a PCI core issue -- we're
setting RCB in the Mellanox device, which means it thinks it can
generate 128-byte Completions, even though the Root Port above it
can't handle them.  That issue is
https://bugzilla.kernel.org/show_bug.cgi?id=187781

The machine crashed when this happened, apparently not because of any
error reported via AER, but because mlx4 contains a BUG_ON, probably
the one in mlx4_enter_error_state().

That one happens if pci_channel_offline() returns false.  Is this
telling us about a problem in PCI error handling, or is it just a case
where mlx4 isn't as smart as it could be?

Ideally, if mlx4 can't initialize the device, it should just return an
error from the probe function instead of crashing the whole machine.

Here's the crash (the entire dmesg log is in the bugzilla above):

  mlx4_core :41:00.0: command 0xfff timed out (go bit not cleared)
  mlx4_core :41:00.0: device is going to be reset
  mlx4_core :41:00.0: Failed to obtain HW semaphore, aborting
  mlx4_core :41:00.0: Fail to reset HCA
  [ cut here ]
  kernel BUG at drivers/net/ethernet/mellanox/mlx4/catas.c:193!
  invalid opcode:  [#1] SMP 
  Modules linked in: sr_mod(E) cdrom(E) uas(E) usb_storage(E) mlx4_core(E+) 
cdc_ether(E) usbnet(E) mii(E) joydev(E) x86_pkg_temp_thermal(E) 
intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) 
crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) drbg(E) ansi_cprng(E) 
aesni_intel(E) iTCO_wdt(E) aes_x86_64(E) igb(E) ipmi_devintf(E) 
iTCO_vendor_support(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) ptp(E) 
cryptd(E) pps_core(E) sb_edac(E) pcspkr(E) lpc_ich(E) ipmi_ssif(E) ioatdma(E) 
edac_core(E) shpchp(E) mfd_core(E) dca(E) wmi(E) ipmi_si(E) ipmi_msghandler(E) 
fjes(E) button(E) processor(E) acpi_pad(E) hid_generic(E) usbhid(E) ext4(E) 
crc16(E) jbd2(E) mbcache(E) sd_mod(E) mgag200(E) i2c_algo_bit(E) 
drm_kms_helper(E) syscopyarea(E) xhci_pci(E) sysfillrect(E) ehci_pci(E) 
sysimgblt(E)
   fb_sys_fops(E) xhci_hcd(E) ehci_hcd(E) ttm(E) usbcore(E) drm(E) 
usb_common(E) megaraid_sas(E) dm_mirror(E) dm_region_hash(E) dm_log(E) sg(E) 
dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
scsi_mod(E) autofs4(E)
  Supported: Yes
  CPU: 27 PID: 2867 Comm: modprobe Tainted: GE  4.4.21-default 
#6
  Hardware name: IBM x3850 X6 -[3837Z7P]-/00FN772, BIOS -[A8E120CUS-1.30]- 
08/22/2016
  task: 881fb2ff9280 ti: 881fbd3c4000 task.ti: 881fbd3c4000
  RIP: 0010:[]  [] 
mlx4_enter_error_state+0x240/0x320 [mlx4_core]
  RSP: 0018:881fbd3c79a0  EFLAGS: 00010246
  RAX: 8820b2486e00 RBX: 883fbe24 RCX: 
  RDX: 0001 RSI: 0246 RDI: 881fbf63b000
  RBP: 8820b2486e60 R08: 0029 R09: 88803feda50f
  R10: 000d1b50 R11:  R12: 
  R13:  R14: 883fbe240460 R15: fffb
  FS:  7f7c55203700() GS:883fbf90() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f1813c88000 CR3: 003fbe637000 CR4: 001406e0
  Stack:
   15b3c100 883fbe24 0fff 
   a0447d54   ea60
    ea60 c90031dba680 883fbe24
  Call Trace:
   [] __mlx4_cmd+0x594/0x8a0 [mlx4_core]
   [] mlx4_map_cmd+0x2ab/0x3c0 [mlx4_core]
   [] mlx4_load_one+0x515/0x1220 [mlx4_core]
   [] mlx4_init_one+0x4e9/0x6a0 [mlx4_core]
   [] local_pci_probe+0x3f/0xa0
   [] pci_device_probe+0xd4/0x120
   [] driver_probe_device+0x1f7/0x420
   [] __driver_attach+0x7b/0x80
   [] bus_for_each_dev+0x58/0x90
   [] bus_add_driver+0x1c9/0x280
   [] driver_register+0x5b/0xd0
   [] mlx4_init+0x11a/0x1000 [mlx4_core]
   [] do_one_initcall+0xc8/0x1f0
   [] do_init_module+0x5a/0x1d7
   [] load_module+0x1366/0x1c50
   [] SYSC_finit_module+0x70/0xa0
   [] entry_SYSCALL_64_fastpath+0x12/0x71



Re: [net-next 5/5] PCI: disable FLR for 82579 device

2016-09-28 Thread Bjorn Helgaas
On Wed, Sep 28, 2016 at 03:33:52PM +, Neftin, Sasha wrote:
> 
> Since I worked with Sasha on this I will provide a bit of information from 
> what I understand of this bug as well.
> 
> On Tue, Sep 27, 2016 at 12:13 PM, Alex Williamson 
> <alex.william...@redhat.com> wrote:
> > On Tue, 27 Sep 2016 13:17:02 -0500
> > Bjorn Helgaas <helg...@kernel.org> wrote:
> >
> >> On Sun, Sep 25, 2016 at 10:02:43AM +0300, Neftin, Sasha wrote:
> >> > On 9/24/2016 12:05 AM, Jeff Kirsher wrote:
> >> > >On Fri, 2016-09-23 at 09:01 -0500, Bjorn Helgaas wrote:
> >> > >>On Thu, Sep 22, 2016 at 11:39:01PM -0700, Jeff Kirsher wrote:
> >> > >>>From: Sasha Neftin <sasha.nef...@intel.com>
> >> > >>>
> >> > >>>82579 has a problem reattaching itself after the device is detached.
> >> > >>>The bug was reported by Redhat. The suggested fix is to disable 
> >> > >>>FLR capability in PCIe configuration space.
> >> > >>>
> >> > >>>Reproduction:
> >> > >>>Attach the device to a VM, then detach and try to attach again.
> >> > >>>
> >> > >>>Fix:
> >> > >>>Disable FLR capability to prevent the 82579 from hanging.
> >> > >>Is there a bugzilla or other reference URL to include here?  
> >> > >>Should this be marked for stable?
> >> > >So the author is in Israel, meaning it is their weekend now.  I do 
> >> > >not believe Sasha monitors email over the weekend, so a response 
> >> > >to your questions won't happen for a few days.
> >> > >
> >> > >I tried searching my archives for more information, but had no 
> >> > >luck finding any additional information.
> >> > >
> 
> I agree that we do probably need to update the patch description since it 
> isn't exactly clear what this is fixing or what was actually broken.
> 
> >> > >>>Signed-off-by: Sasha Neftin <sasha.nef...@intel.com>
> >> > >>>Tested-by: Aaron Brown <aaron.f.br...@intel.com>
> >> > >>>Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
> >> > >>>---
> >> > >>>  drivers/pci/quirks.c | 21 +
> >> > >>>  1 file changed, 21 insertions(+)
> >> > >>>
> >> > >>>diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 
> >> > >>>44e0ff3..59fba6e 100644
> >> > >>>--- a/drivers/pci/quirks.c
> >> > >>>+++ b/drivers/pci/quirks.c
> >> > >>>@@ -4431,3 +4431,24 @@ static void quirk_intel_qat_vf_cap(struct 
> >> > >>>pci_dev *pdev)
> >> > >>> }
> >> > >>>  }
> >> > >>>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, 
> >> > >>>quirk_intel_qat_vf_cap);
> >> > >>>+/*
> >> > >>>+ * Workaround FLR issues for 82579
> >> > >>>+ * This code disables the FLR (Function Level Reset) via PCIe, 
> >> > >>>+in
> >> > >>>order
> >> > >>>+ * to workaround a bug found while using device passthrough, 
> >> > >>>+ where the
> >> > >>>+ * interface would become non-responsive.
> >> > >>>+ * NOTE: the FLR bit is Read/Write Once (RWO) in config space, 
> >> > >>>+ so if
> >> > >>>+ * the BIOS or kernel writes this register * then this 
> >> > >>>+ workaround will
> >> > >>>+ * not work.
> >> > >>This doesn't sound like a root cause.  Is the issue a hardware 
> >> > >>erratum?  Linux PCI core bug?  VFIO bug?  Device firmware bug?
> >> > >>
> >> > >>The changelog suggests that the problem only affects passthrough, 
> >> > >>which suggests some sort of kernel bug related to how passthrough 
> >> > >>is implemented.
> >>
> >> If this bug affects all scenarios, not just passthrough, the 
> >> changelog should not mention passthrough.
> >>
> >> > >>>+ */
> >> > >>>+static void quirk_intel_flr_cap_dis(struct pci_dev *dev) {
> >> > >>>+int pos = pci_find_capability(dev, PCI_CAP_ID_AF);
> >> > >>>+if (pos) {
> >>

Re: [net-next 5/5] PCI: disable FLR for 82579 device

2016-09-27 Thread Bjorn Helgaas
On Sun, Sep 25, 2016 at 10:02:43AM +0300, Neftin, Sasha wrote:
> On 9/24/2016 12:05 AM, Jeff Kirsher wrote:
> >On Fri, 2016-09-23 at 09:01 -0500, Bjorn Helgaas wrote:
> >>On Thu, Sep 22, 2016 at 11:39:01PM -0700, Jeff Kirsher wrote:
> >>>From: Sasha Neftin <sasha.nef...@intel.com>
> >>>
> >>>82579 has a problem reattaching itself after the device is detached.
> >>>The bug was reported by Redhat. The suggested fix is to disable
> >>>FLR capability in PCIe configuration space.
> >>>
> >>>Reproduction:
> >>>Attach the device to a VM, then detach and try to attach again.
> >>>
> >>>Fix:
> >>>Disable FLR capability to prevent the 82579 from hanging.
> >>Is there a bugzilla or other reference URL to include here?  Should
> >>this be marked for stable?
> >So the author is in Israel, meaning it is their weekend now.  I do not
> >believe Sasha monitors email over the weekend, so a response to your
> >questions won't happen for a few days.
> >
> >I tried searching my archives for more information, but had no luck finding
> >any additional information.
> >
> >>>Signed-off-by: Sasha Neftin <sasha.nef...@intel.com>
> >>>Tested-by: Aaron Brown <aaron.f.br...@intel.com>
> >>>Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
> >>>---
> >>>  drivers/pci/quirks.c | 21 +
> >>>  1 file changed, 21 insertions(+)
> >>>
> >>>diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> >>>index 44e0ff3..59fba6e 100644
> >>>--- a/drivers/pci/quirks.c
> >>>+++ b/drivers/pci/quirks.c
> >>>@@ -4431,3 +4431,24 @@ static void quirk_intel_qat_vf_cap(struct
> >>>pci_dev *pdev)
> >>>   }
> >>>  }
> >>>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443,
> >>>quirk_intel_qat_vf_cap);
> >>>+/*
> >>>+ * Workaround FLR issues for 82579
> >>>+ * This code disables the FLR (Function Level Reset) via PCIe, in
> >>>order
> >>>+ * to workaround a bug found while using device passthrough, where the
> >>>+ * interface would become non-responsive.
> >>>+ * NOTE: the FLR bit is Read/Write Once (RWO) in config space, so if
> >>>+ * the BIOS or kernel writes this register * then this workaround will
> >>>+ * not work.
> >>This doesn't sound like a root cause.  Is the issue a hardware
> >>erratum?  Linux PCI core bug?  VFIO bug?  Device firmware bug?
> >>
> >>The changelog suggests that the problem only affects passthrough,
> >>which suggests some sort of kernel bug related to how passthrough is
> >>implemented.

If this bug affects all scenarios, not just passthrough, the changelog
should not mention passthrough.

> >>>+ */
> >>>+static void quirk_intel_flr_cap_dis(struct pci_dev *dev)
> >>>+{
> >>>+  int pos = pci_find_capability(dev, PCI_CAP_ID_AF);
> >>>+  if (pos) {
> >>>+  u8 cap;
> >>>+  pci_read_config_byte(dev, pos + PCI_AF_CAP, );
> >>>+  cap = cap & (~PCI_AF_CAP_FLR);
> >>>+  pci_write_config_byte(dev, pos + PCI_AF_CAP, cap);
> >>>+  }
> >>>+}
> >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502,
> >>>quirk_intel_flr_cap_dis);
> >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503,
> >>>quirk_intel_flr_cap_dis);
> >>>-- 
> >>>2.7.4
> >>>
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> >>>the body of a message to majord...@vger.kernel.org
> >>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> Hello,
> 
> Original bugzilla thread could be found here:
> https://bugzilla.redhat.com/show_bug.cgi?format=multiple=966840

That bugzilla is private and I can't read it.

> This is our HW bug, exist only in 82579 devices. More new devices
> have no such problem. We have found root cause and suggested this
> solution. 

Is there an erratum you can reference?

> This solution should work for a 95% of cases, so I do not
> think that this is fragile. For another cases possible solution is
> get up working system and manually disable FLR, before VM start use
> our adapter.

I don't think a 95% solution is sufficient.  Can you use the
pci_dev_specific_reset() framework to make a 100% solution?

Bjorn


Re: [PATCH v5 00/16] Add Paravirtual RDMA Driver

2016-09-26 Thread Bjorn Helgaas
On Sat, Sep 24, 2016 at 04:21:24PM -0700, Adit Ranadive wrote:
>  MAINTAINERS|9 +
>  drivers/infiniband/Kconfig |1 +
>  drivers/infiniband/hw/Makefile |1 +
>  drivers/infiniband/hw/pvrdma/Kconfig   |7 +
>  drivers/infiniband/hw/pvrdma/Makefile  |3 +
>  drivers/infiniband/hw/pvrdma/pvrdma.h  |  473 +
>  drivers/infiniband/hw/pvrdma/pvrdma_cmd.c  |  117 +++
>  drivers/infiniband/hw/pvrdma/pvrdma_cq.c   |  426 +
>  drivers/infiniband/hw/pvrdma/pvrdma_defs.h |  301 ++
>  drivers/infiniband/hw/pvrdma/pvrdma_dev_api.h  |  342 +++
>  drivers/infiniband/hw/pvrdma/pvrdma_doorbell.c |  127 +++
>  drivers/infiniband/hw/pvrdma/pvrdma_ib_verbs.h |  444 +
>  drivers/infiniband/hw/pvrdma/pvrdma_main.c | 1220 
> 
>  drivers/infiniband/hw/pvrdma/pvrdma_misc.c |  304 ++
>  drivers/infiniband/hw/pvrdma/pvrdma_mr.c   |  334 +++
>  drivers/infiniband/hw/pvrdma/pvrdma_qp.c   |  973 +++
>  drivers/infiniband/hw/pvrdma/pvrdma_verbs.c|  577 +++
>  drivers/infiniband/hw/pvrdma/pvrdma_verbs.h|  108 +++
>  drivers/net/vmxnet3/vmxnet3_int.h  |3 +-
>  include/linux/pci_ids.h|1 +
>  include/uapi/rdma/Kbuild   |2 +
>  include/uapi/rdma/pvrdma-abi.h |   99 ++
>  include/uapi/rdma/pvrdma-uapi.h|  255 +

Hi Adit,

You don't need to cc linux-pci just because of the one-line change to
pci_ids.h.  I've already acked that, and the rest is just noise to the
main linux-pci audience.

Bjorn


  1   2   >