from:"Casey Leedom"

Re: [PATCH] cxgb4vf: fix t4vf_eth_xmit()'s return type

2018-05-04 Thread Casey Leedom

| From: Luc Van Oostenryck 
| Sent: Tuesday, April 24, 2018 6:19:02 AM
| 
| The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
| which is a typedef for an enum type, but the implementation in this
| driver returns an 'int'.
| 
| Fix this by returning 'netdev_tx_t' in this driver too.

Looks good to me.

Casey

Re: [PATCH v4 12/17] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Casey Leedom

[[ Appologies for the DUPLICATE email.  I forgot to tell my Mail Agent to
   use Plain Text. -- Casey ]]

  I feel very uncomfortable with these proposed changes.  Our team is right
in the middle of trying to tease our way through the various platform
implementations of writel(), writel_relaxed(), __raw_writel(), etc. in order
to support x86, PowerPC, ARM, etc. with a single code base.  This is
complicated by the somewhat ... "fuzzily defined" semantics and varying
platform implementations of all of these APIs.  (And note that I'm just
picking writel() as an example.)

  Additionally, many of the changes aren't even in fast paths and are thus
unneeded for performance.

  Please don't make these changes.  We're trying to get this all sussed out.

Casey


  
From: Sinan Kaya <ok...@codeaurora.org>
Sent: Monday, March 19, 2018 7:42:27 PM
To: netdev@vger.kernel.org; ti...@codeaurora.org; sulr...@codeaurora.org
Cc: linux-arm-...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; Sinan 
Kaya; Ganesh GR; Casey Leedom; linux-ker...@vger.kernel.org
Subject: [PATCH v4 12/17] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on 
weakly-ordered archs
  

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  6 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/sge.c    | 12 ++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  |  2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/adapter.h  | 14 ++
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c  | 18 ++
 6 files changed, 44 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 9040e13..6bde0b9 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1202,6 +1202,12 @@ static inline void t4_write_reg(struct adapter *adap, 
u32 reg_addr, u32 val)
 writel(val, adap->regs + reg_addr);
 }
 
+static inline void t4_write_reg_relaxed(struct adapter *adap, u32 reg_addr,
+   u32 val)
+{
+   writel_relaxed(val, adap->regs + reg_addr);
+}
+
 #ifndef readq
 static inline u64 readq(const volatile void __iomem *addr)
 {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 7b452e8..276472d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -1723,8 +1723,8 @@ int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, 
u16 pidx,
 else
 val = PIDX_T5_V(delta);
 wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-    QID_V(qid) | val);
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+    QID_V(qid) | val);
 }
 out:
 return ret;
@@ -1902,8 +1902,9 @@ static void enable_txq_db(struct adapter *adap, struct 
sge_txq *q)
  * are committed before we tell HW about them.
  */
 wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-    QID_V(q->cntxt_id) | PIDX_V(q->db_pidx_inc));
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+    QID_V(q->cntxt_id) |
+   PIDX_V(q->db_pidx_inc));
 q->db_pidx_inc = 0;
 }
 q->db_disabled = 0;
@@ -2003,8 +2004,8 @@ static void sync_txq_pidx(struct adapter *adap, struct 
sge_txq *q)
 else
 val = PIDX_T5_V(delta);
 wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-    QID_V(q->cntxt_id) | val);
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+    QID_V(q->cntxt_id) | val);
 }
 out:
 q->db_disabled = 0;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 6e310a0..7388aac 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -530,11 +530,11 @@ static inline void ring_fl_db(struct adapter *adap, 
struct sge_fl *q)
  * mechanism.
  */
 if (unlikely(q->bar2_addr == NULL)) {
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORB

Re: [PATCH net] cxgb4: Fix stack out-of-bounds read due to wrong size to t4_record_mbox()

2017-08-25 Thread Casey Leedom

| From: Stefano Brivio 
| Sent: Friday, August 25, 2017 1:48 PM
|     
| Passing commands for logging to t4_record_mbox() with size
| MBOX_LEN, when the actual command size is actually smaller,
| causes out-of-bounds stack accesses in t4_record_mbox() while
| copying command words here:
| ...

Thanks for catching this.  Definitely a bug.  Odd because
that's not what I checked into our out-of-kernel repository.
And the corresponding code in the cxgb4vf driver is correct.

So yes, ACK!

Casey

Re: [PATCH v10 3/5] PCI: Disable Relaxed Ordering Attributes for AMD A1100

2017-08-14 Thread Casey Leedom

| From: Raj, Ashok <ashok@intel.com>
| Sent: Monday, August 14, 2017 10:19 AM
|     
| On Mon, Aug 14, 2017 at 11:44:57PM +0800, Ding Tianhong wrote:
| > Casey reported that the AMD ARM A1100 SoC has a bug in its PCIe
| > Root Port where Upstream Transaction Layer Packets with the Relaxed
| > Ordering Attribute clear are allowed to bypass earlier TLPs with
| > Relaxed Ordering set, it would cause Data Corruption, so we need
| > to disable Relaxed Ordering Attribute when Upstream TLPs to the
| > Root Port.
| > 
| > Signed-off-by: Casey Leedom <lee...@chelsio.com>
| > Signed-off-by: Ding Tianhong <dingtianh...@huawei.com>
| > Acked-by: Alexander Duyck <alexander.h.du...@intel.com>
| > Acked-by: Ashok Raj <ashok@intel.com>
| 
| I can't ack this patch :-).. must be someone from AMD. Please remove my
| signature from this.

  You can go ahead and leave my name on since I'm the person who found and 
diagnosed the problem (with help from others inside Chelsio).

  If anyone on the Linux PCI List knows anyone at AMD who'd be willing to 
respond it would be great, but as I noted earlier, I think that AMD has 
effectively abandoned the A1100 ("Seattle") ARM SoC, so I doubt if we'll get 
anyone from AMD to even comment now.

Casey

Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-08-09 Thread Casey Leedom

| From: Raj, Ashok <ashok@intel.com>
| Sent: Wednesday, August 9, 2017 11:00 AM
|
| On Wed, Aug 09, 2017 at 04:46:07PM +, Casey Leedom wrote:
| > | From: Raj, Ashok <ashok@intel.com>
| > | Sent: Wednesday, August 9, 2017 8:58 AM
| > | ...
| > | As Casey pointed out in an earlier thread, we choose the heavy hammer
| > | approach because there are some that can lead to data-corruption as
| > | opposed to perf degradation.
| >
| > Careful.  As far as I'm aware, there is no Data Corruption problem
| > whatsoever with Intel Root Ports and processing of Transaction Layer
| > Packets with and without the Relaxed Ordering Attribute set.
|
| That's right.. no data-corruption on Intel parts :-).. It was with other
| vendor. Only performance issue with intel root-ports in the parts identified
| by the optimization guide.

Yes, I didn't want you to get into any trouble over that possible reading of
what you wrote.

Any progress on the "Chicken Bit" investigation?  Being able to disable the
non-optimal Relaxed Ordering "optimization" would be the best PCI Quirk of
all ...

Casey

Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-08-09 Thread Casey Leedom

| From: Raj, Ashok 
| Sent: Wednesday, August 9, 2017 8:58 AM
| ...
| As Casey pointed out in an earlier thread, we choose the heavy hammer
| approach because there are some that can lead to data-corruption as opposed
| to perf degradation.

Careful.  As far as I'm aware, there is no Data Corruption problem
whatsoever with Intel Root Ports and processing of Transaction Layer Packets
with and without the Relaxed Ordering Attribute set.

The only issue which we've discovered with relatively recent Intel Root Port
implementations and the use of the Relaxed Ordering Attribute is a
performance issue.  To the best of our ability to analyze the PCIe traces,
it appeared that the Intel Root Complex delayed returning Link Flow Control
Credits resulting in lowered performance (total bandwidth).  When we used
Relaxed Ordering for Ingress Packet Data delivery on a 100Gb/s Ethernet
link with 1500-byte MTU, we were pegged at ~75Gb/s.  Once we disabled
Relaxed Ordering, we were able to deliver Ingress Packet Data to Host Memory
at the full link rate.

Casey

Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-08-09 Thread Casey Leedom

| From: Ding Tianhong <dingtianh...@huawei.com>
| Sent: Wednesday, August 9, 2017 5:17 AM
|
| On 2017/8/9 11:02, Bjorn Helgaas wrote:
| >
| > On Wed, Aug 09, 2017 at 01:40:01AM +, Casey Leedom wrote:
| > >
| >> | From: Bjorn Helgaas <helg...@kernel.org>
| >> | Sent: Tuesday, August 8, 2017 4:22 PM
| >> | ...
| >> | It should also include a pointer to the AMD erratum, if available, or
| >> | at least some reference to how we know it doesn't obey the rules.
| >>
| >>   Getting an ACK from AMD seems like a forlorn cause at this point.  My
| >> contact was Bob Shaw <bob.s...@amd.com> and he stopped responding to me
| >> messages almost a year ago saying that all of AMD's energies were being
| >> redirected towards upcoming x86 products (likely Ryzen as we now know).
| >> As far as I can tell AMD has walked away from their A1100 (AKA
| >> "Seattle") ARM SoC.
| >>
| >>   On the specific issue, I can certainly write up somthing even more
| >> extensive than I wrote up for the comment in drivers/pci/quirks.c.
| >> Please review the comment I wrote up and tell me if you'd like
| >> something even more detailed -- I'm usually acused of writing comments
| >> which are too long, so this would be a new one on me ... :-)
| >
| > If you have any bug reports with info about how you debugged it and
| > concluded that Seattle is broken, you could include a link (probably
| > in the changelog).  But if there isn't anything, there isn't anything.
| ...
| OK, I could reorganize it, but still need the Casey to give me the link
| for the Seattle, otherwise I could remove the AMD part and wait until
| someone show it. Thanks

There are no links and I was never given an internal bug number at AMD.  As
I said, they stopped responding to my notes about a years ago saying that
they were moving the focus of all their people and no longer had resources
to pursue the issue.  Hopefully for them, Ryzen doesn't have the same
Data Corruption problem ...

As for how we diagnosed it, with our Ingress Packet delivery, we have the
Ingress Packet Data delivered (DMA Write) into Free List Buffers, and then
then a small message (DMA Write) to a "Response Queue" indicating delivery
of the Ingress Packet Data into the Free List Buffers.  The Transaction
Layer Packets which convey the Ingress Packet Data all have the Relaxed
Ordering Attribute set, while the following TLP carring the Ingress Data
delivery notification into the Response Queue does not have the Relaxed
Ordering Attribute set.

The rules for processing TLPs with and without the Relaxed Ordering
Attribute set are covered in Section 2.4.1 of the PCIe 3.0 specification
(Revision 3.0 November 10, 2010).  Table 2-34 "Ordering Rules Summary"
covers the cases where one TLP may "pass" (be proccessed earlier) than a
preceding TLP.  In the case we're talking about, we have a sequence of one
or more Posted DMA Write TLPs with the Relaxed Ordering Attribute set and a
following Posted DMA Write TLP without the Relaxed Ordering Attribute set.
Thus we need to look at the Row A, Column 2 cell of Table 2-34 governing
when a Posted Request may "pass" a preceeding Posted Request.  In that cell
we have:

a) No
b) Y/N

with the explanatory text:

A2aA Posted Request must not pass another Posted Request
   unless A2b applies.

A2bA Posted Request with RO[23] Set is permitted to pass
   another Posted Request[24].  A Posted Request with IDO
   Set is permitted to pass another Posted Request if the
   two Requester IDs are different.

[23] In this section, "RO" is an abbreviation for the Relaxed
 Ordering Attribute field.

[24] Some usages are enabled by not implementing this passing
 (see the No RO-enabled PR-PR Passing bit in Section
 7.8.15).

In our case, we were getting notifications of Ingress Packet Delivery in our
Response Queues, but not all of the Ingress Packet Data Posted DMA Write
TLPs had been processed yet by the Root Complex.  As a result, we were
picking up old stale memory data before those lagging Ingress Packet Data
TLPs could be processed.  This is a clear violation of the PCIe 3.0 TLP
processing rules outlined above.

Does that help?

Casey

Re: [PATCH v9 2/4] PCI: Disable PCIe Relaxed Ordering if unsupported

2017-08-09 Thread Casey Leedom

| From: Bjorn Helgaas 
| Sent: Tuesday, August 8, 2017 7:22 PM
| ...
| and the caller should do something like this:
| 
| if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev)))
| adapter->flags |= ROOT_NO_RELAXED_ORDERING;
| 
| That way it's obvious where the issue is, and it's obvious that the
| answer might be different for peer-to-peer transactions than it is for
| transactions to the root port, i.e., to coherent memory.
| ...

Which is back to something very close to what I initially suggested in my
first patch submission.  Because you're right, this isn't about broken
Source Devices, it's about broken Completing Devices.

Unfortunately, as Alexander Duyck noted, in a Virtual Machine we won't be
able to see our Root Port in order to make this determination.  (And in some
Hypervisor implementations I've seen, there's not even a synthetic Root Port
available to the VM at all, let alone read-only access to the real one.)

So the current scheme was developed of having the Hypervisor kernel traverse
down the PCIe Fabric when it finds a broken Root Port implementation (the
issue that we've mostly been primarily focused upon), and turning off the
PCIe Capability Device Control[Relaxed Ordering Enable].  This was to serve
two purposes:

 1. Turn that off in order to prevent sending any Transaction Layer
Packets with the Relaxed Ordering Attribute to any Completer.
Which unfortunately would also prevent Peer-to-Peer use of the
Relaxed Ordering Attribute.

 2. Act as a message to Device Drivers for those downstream devices
that the they were dealing with a broken Root Port implementation.
And this would work even for a driver in a VM with an attached
device since it would be able to see the PCIe Configuration Space
for the attached device.

I haven't been excited about any of this because:

 A. While so far all of the examples we've talked about are broken
Root Port Completers, it's perfectly possible that other devices
could be broken -- say an NVMe Device which is not "Coherent
Memory".  How would this fit into the framework APIs being
described?

 B. I have yet to see a n example of how the currently proposed
API framework would be used in a hybrid environment where
TLPs to the Root Port would not use Relaxed Ordering, but TLPs
to a Peer would use Relaxed Ordering.  So far its all been about
using a "big hammer" to completely disable the use of Relaxed
Ordering.

But the VM problem keeps cropping up over and over.  A driver in a VM
doesn't have access to the Root Port to determine if its on a "Black List"
and our only way of communicating with the driver in the VM is to leave the
device in a particular state (i.e. initialize the PCIe Capability Device
Control[Relaxed Ordering Enable] to "off").

Oh, and also, on the current patch submission's focus on broken Root Port
implementations: one could suggest that even if we're stuck with the "Device
attached to a VM Conundrum", that what we should really be thinking about is
if ANY device within a PCIe Fabric has broken Relaxed Ordering completion
problems, and, if so, "poisoning" the entire containing PCIe Fabric by
turning off Relaxed Ordering Enable for every device, up, down sideways --
including the Root Port itself.

| ...
| This associates the message with the Requester that may potentially
| use relaxed ordering. But there's nothing wrong or unusual about the
| Requester; the issue is with the *Completer*, so I think the message
| should be in the quirk where we set PCI_DEV_FLAGS_NO_RELAXED_ORDERING.
| Maybe it should be both places; I dunno.
| 
| This implementation assumes the device only initiates transactions to
| coherent memory, i.e., it assumes the device never does peer-to-peer
| DMA.  I guess we'll have to wait and see if we trip over any
| peer-to-peer issues, then figure out how to handle them.
| ...

Yes, as soon as we want to implement the hybrid use of Relaxed Ordering I
mentioned above.  And that the Intel document mentions itself.

Casey

Re: [PATCH v9 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-08-08 Thread Casey Leedom

| From: Bjorn Helgaas
| Sent: Tuesday, August 8, 2017 4:22 PM
|
| This needs to include a link to the Intel spec
|
(https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
| sec 3.9.1).

In the commit message or as a comment? Regardless, I agree. It's always
nice to be able to go back and see what the official documentation says.
However, that said, links on the internet are ... fragile as time goes by,
so we might want to simply quote section 3.9.1 in the commit message since
it's relatively short:

3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
and Toward MMIO Regions (P2P)

In order to maximize performance for PCIe devices in the processors
listed in Table 3-6 below, the soft- ware should determine whether the
accesses are toward coherent memory (system memory) or toward MMIO
regions (P2P access to other devices). If the access is toward MMIO
region, then software can command HW to set the RO bit in the TLP
header, as this would allow hardware to achieve maximum throughput for
these types of accesses. For accesses toward coherent memory, software
can command HW to clear the RO bit in the TLP header (no RO), as this
would allow hardware to achieve maximum throughput for these types of
accesses.

Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
PCIe Performance

ProcessorCPU RP Device IDs

Intel Xeon processors based on 6F01H-6F0EH
Broadwell microarchitecture

Intel Xeon processors based on 2F01H-2F0EH
Haswell microarchitecture

| It should also include a pointer to the AMD erratum, if available, or
| at least some reference to how we know it doesn't obey the rules.

Getting an ACK from AMD seems like a forlorn cause at this point. My
contact was Bob Shaw and he stopped responding to me
messages almost a year ago saying that all of AMD's energies were being
redirected towards upcoming x86 products (likely Ryzen as we now know). As
far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM
SoC.

On the specific issue, I can certainly write up somthing even more
extensive than I wrote up for the comment in drivers/pci/quirks.c. Please
review the comment I wrote up and tell me if you'd like something even more
detailed -- I'm usually acused of writing comments which are too long, so
this would be a new one on me ... :-)

| Ashok, thanks for chiming in. Now that you have, I have a few more
| questions for you:

I can answer a few of these:

| - Is the above doc the one you mentioned as being now public?

Yes. Ashok worked with me to the extent he was allowed prior to the
publishing of the public technocal note, but he couldn't say much. (Believe
it or not, it is possible to say less than the quoted section above.) When
the note was published, Patrick Cramer sent me the note about it and pointed
me at section 3.9.1.

| - Is this considered a hardware erratum?

I certainly consider it a Hardware Bug. And I'm really hoping that Ashok
will be able to find a "Chicken Bit" which allows the broken feature to be
turned off. Remember, the Relaxed Ordering Attribute on a Transaction Layer
Packet is simply a HINT. It is perfectly reasonable for a compliant
implementation to simply ignore the Relaxed Ordering Attribute on an
incoming TLP Request. The sole responsibility of a compliant implementation
is to return the exact same Relaxed Ordering and No Snoop Attributes in any
TLP Response (The rules for ID-Based Ordering Attribute are more complex.)

Earlier Intel Root Complexes did exactly this: they ignored the Relaxed
Ordering Attribute and there was no performance difference for
using/not-using it. It's pretty obvious that an attempt was made to
implement optimizations surounding the use of Relaxed Ordering and they
didn't work.

| - If so, is there a pointer to that as well?

Intel is historically tight-lipped about admiting any bugs/errata in their
products. I'm guessing that the above quoted Section 3.9.1 is likely to be
all we ever get. The language above regarding TLPs targetting Coherent
Shared Memory are basically as much of an admission that they got it wrong
as we're going to get. But heck, maybe we'll get lucky ... Especially with
regard to the hoped for "Chicken Bit" ...

| - If this is not considered an erratum, can you provide any guidance
|about how an OS should determine when it should use RO?

Software? We don't need no stinking software!

Sorry, I couldn't resist.

| Relying on a list of device IDs in an optimization manual is OK for an
| erratum, but if it's *not* an erratum, it seems like a hole in the specs
| because as far as I know there's no generic way for the OS to discover
| whether to use RO.

Well, here's to hoping that Ashok and/or Patrick are able to offer

Re: [PATCH v8 1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-08-04 Thread Casey Leedom

| From: Ding Tianhong 
| Sent: Thursday, August 3, 2017 6:44 AM
|
| diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
| index 6967c6b..1e1cdbe 100644
| --- a/drivers/pci/quirks.c
| +++ b/drivers/pci/quirks.c
| @@ -4016,6 +4016,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev)
|quirk_tw686x_class);
|
|  /*
| + * Some devices have problems with Transaction Layer Packets with the Relaxed
| + * Ordering Attribute set.  Such devices should mark themselves and other
| + * Device Drivers should check before sending TLPs with RO set.
| + */
| +static void quirk_relaxedordering_disable(struct pci_dev *dev)
| +{
| +   dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
| +}
| +
| +/*
| + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can
| + * cause performance problems with Upstream Transaction Layer Packets with
| + * Relaxed Ordering set.
| + */
| +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f02, 
PCI_CLASS_NOT_DEFINED, 8,
| + quirk_relaxedordering_disable);
| +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f04, 
PCI_CLASS_NOT_DEFINED, 8,
| + quirk_relaxedordering_disable);
| +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f08, 
PCI_CLASS_NOT_DEFINED, 8,
| + quirk_relaxedordering_disable);
| + ...

It looks like this is missing the set of Root Complex IDs that were noted in
the document to which Patrick Cramer sent us a reference:

https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

In section 3.9.1 we have:

3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
  and Toward MMIO Regions (P2P)

In order to maximize performance for PCIe devices in the processors
listed in Table 3-6 below, the soft- ware should determine whether the
accesses are toward coherent memory (system memory) or toward MMIO
regions (P2P access to other devices). If the access is toward MMIO
region, then software can command HW to set the RO bit in the TLP
header, as this would allow hardware to achieve maximum throughput for
these types of accesses. For accesses toward coherent memory, software
can command HW to clear the RO bit in the TLP header (no RO), as this
would allow hardware to achieve maximum throughput for these types of
accesses.

Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
   PCIe Performance

ProcessorCPU RP Device IDs

Intel Xeon processors based on   6F01H-6F0EH
Broadwell microarchitecture

Intel Xeon processors based on   2F01H-2F0EH
Haswell microarchitecture

The PCI Device IDs you have there are the first ones that I guessed at
having the performance problem with Relaxed Ordering.  We now apparently
have a complete list from Intel.

I don't want to phrase this as a "NAK" because you've gone around the
mulberry bush a bunch of times already.  So maybe just go with what you've
got in version 8 of your patch and then do a follow on patch to complete the
table?

Casey

Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported

2017-08-04 Thread Casey Leedom

| From: Raj, Ashok <ashok@intel.com>
| Sent: Friday, August 4, 2017 1:21 PM
|
| On Fri, Aug 04, 2017 at 08:20:37PM +, Casey Leedom wrote:
| > ...
| > As I've noted a number of times, it would be great if the Intel Hardware
| > Engineers who attempted to implement the Relaxed Ordering semantics in the
| > current generation of Root Complexes had left the ability to turn off the
| > logic which is obviously not working.  If there was a way to disable the
| > logic via an undocumented register, then we could have the Linux PCI Quirk
| > do that.  Since Relaxed Ordering is just a hint, it's completely legitimate
| > to completely ignore it.
|
| Suppose you are looking for the existence of a chicken bit to instruct the
| port to ignore RO traffic. So all we would do is turn the chicken bit on
| but would permit p2p traffic to be allowed since we won't turn off the
| capability as currently proposed.
|
| Let me look into that keep you posted.

Huh, I'd never heard it called a "chicken bit" before, but yes, that's what
I'm talking about.

Whenever our Hardware Designers implement new functionality in our hardware,
they almost always put in A. several "knobs" which can control fundamental
parameters of the new Hardware Feature, and B.  a mechanism of completely
disabling it if necessary.  This stems from the incredibly long Design ->
Deployment cyle for Hardware (as opposed to the edit->compile->run cycle for s!

It's obvious that handling Relaxed Ordering is a new Hardware Feature for
Intel's Root Complexes since previous versions simply ignored it (because
that's legal[1]).  If I was a Hardware Engineer tasked with implementing
Relaxed Ordering semantics for a device, I would certainly have also
implemented a switch to turn it off in case there were unintended
consequences (performance in this case).

And if there is such a mechanism to simply disable processing of Relaxed
Ordering semantics in the Root Complex, that would be a far easier "fix" for
this problem ... and leave the code in place to continue sending Relaxed
Ordering TLPs for a future Root Complex implementation which got it right ...

Casey

[1] One can't ~quite~ just ignore the Relaxed Ordering Attribute on an
incoming Transaction Layer Packet Request: PCIe completion rules (see
section 2.2.9 of the PCIe 3.0 specificatin) require that the Relaxed
Ordering and No Snoop Attributes in a Request TLP be reflected back
verbatim in any corresponding Response TLP.  (The rules for ID-Based
Ordering are more complex.)

Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported

2017-08-04 Thread Casey Leedom

| From: Raj, Ashok 
| Sent: Thursday, August 3, 2017 1:31 AM
|
| I don't understand this completely.. So your driver would know not to send
| RO TLP's to root complex. But you want to send RO to the NVMe device? This
| is the peer-2-peer case correct?

Yes, this is the "heavy hammer" issue which you alluded to later.  There are
applications where a device will want to send TLPs to a Root Complex without
Relaxed Ordering set, but will want to use it when sending TLPs to a Peer
device (say, an NVMe storage device).  The current approach doesn't make
that easy ... and in fact, I still don't kow how to code a solution for this
with the proposed APIs.  This means that we may be trading off one
performance problem for another and that Relaxed Ordering may be doomed for
use under Linux for the foreseeable future.

As I've noted a number of times, it would be great if the Intel Hardware
Engineers who attempted to implement the Relaxed Ordering semantics in the
current generation of Root Complexes had left the ability to turn off the
logic which is obviously not working.  If there was a way to disable the
logic via an undocumented register, then we could have the Linux PCI Quirk
do that.  Since Relaxed Ordering is just a hint, it's completely legitimate
to completely ignore it.

Casey

Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported

2017-08-02 Thread Casey Leedom

  Okay, here you go.  As you can tell, it's almost a trivial copy of the
cxgb4 patch.
 
  By the way, I realized that we have yet another hole which is likely not
to be fixable.  If we're dealing with a problematic Root Complex, and we
instantiate Virtual Functions and attach them to a Virtual Machine along
with an NVMe device which can deal with Relaxed Ordering TLPs, the VF driver
in the VM will be able to tell that it shouldn't attempt to send RO TLPs to
the RC because it will see the state of its own PCIe Capability Device
Control[Relaxed Ordering Enable] (a copy of the setting in the VF's
corresponding PF), but it won't be able to change that and send non-RO TLPs
to the RC, and RO TLPs to the NVMe device.  Oh well.

  I sure wish that the Intel guys would pop up with a hidden register change
for these problematic Intel RCs that perform poorly with RO TLPs.  Their
silence has been frustrating.

Casey

--

cxgb4vf Ethernet driver now queries PCIe configuration space to determine if
it can send TLPs to it with the Relaxed Ordering Attribute set.

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h 
b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
index 109bc63..08c6ddb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
@@ -408,6 +408,7 @@ enum { /* adapter flags */
USING_MSI  = (1UL << 1),
USING_MSIX = (1UL << 2),
QUEUES_BOUND   = (1UL << 3),
+   ROOT_NO_RELAXED_ORDERING = (1UL << 4),
 };
 
 /*
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index ac7a150..59e7639 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -2888,6 +2888,24 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
 */
adapter->name = pci_name(pdev);
adapter->msg_enable = DFLT_MSG_ENABLE;
+
+   /* If possible, we use PCIe Relaxed Ordering Attribute to deliver
+* Ingress Packet Data to Free List Buffers in order to allow for
+* chipset performance optimizations between the Root Complex and
+* Memory Controllers.  (Messages to the associated Ingress Queue
+* notifying new Packet Placement in the Free Lists Buffers will be
+* send without the Relaxed Ordering Attribute thus guaranteeing that
+* all preceding PCIe Transaction Layer Packets will be processed
+* first.)  But some Root Complexes have various issues with Upstream
+* Transaction Layer Packets with the Relaxed Ordering Attribute set.
+* The PCIe devices which under the Root Complexes will be cleared the
+* Relaxed Ordering bit in the configuration space, So we check our
+* PCIe configuration space to see if it's flagged with advice against
+* using Relaxed Ordering.
+*/
+   if (!pcie_relaxed_ordering_supported(pdev))
+   adapter->flags |= ROOT_NO_RELAXED_ORDERING;
+
err = adap_init0(adapter);
if (err)
goto err_unmap_bar;
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
index e37dde2..05498e7 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -2205,6 +2205,7 @@ int t4vf_sge_alloc_rxq(struct adapter *adapter, struct 
sge_rspq *rspq,
struct port_info *pi = netdev_priv(dev);
struct fw_iq_cmd cmd, rpl;
int ret, iqandst, flsz = 0;
+   int relaxed = !(adapter->flags & ROOT_NO_RELAXED_ORDERING);
 
/*
 * If we're using MSI interrupts and we're not initializing the
@@ -2300,6 +2301,8 @@ int t4vf_sge_alloc_rxq(struct adapter *adapter, struct 
sge_rspq *rspq,
cpu_to_be32(
FW_IQ_CMD_FL0HOSTFCMODE_V(SGE_HOSTFCMODE_NONE) |
FW_IQ_CMD_FL0PACKEN_F |
+   FW_IQ_CMD_FL0FETCHRO_V(relaxed) |
+   FW_IQ_CMD_FL0DATARO_V(relaxed) |
FW_IQ_CMD_FL0PADEN_F);
 
/* In T6, for egress queue type FL there is internal overhead

Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported

2017-07-27 Thread Casey Leedom

| From: Ding Tianhong <dingtianh...@huawei.com>
| Sent: Wednesday, July 26, 2017 6:01 PM
|
| On 2017/7/27 3:05, Casey Leedom wrote:
| >
| > Ding, send me a note if you'd like me to work that [cxgb4vf patch] up
| > for you.
|
| Ok, you could send the change log and I could put it in the v8 version
| together, will you base on the patch 3/3 or build a independence patch?

Which ever you'd prefer. It would basically mirror the same exact code that
you've got for cxgb4. I.e. testing the setting of the VF's PCIe Capability
Device Control[Relaxed Ordering Enable], setting a new flag in
adpater->flags, testing that flag in cxgb4vf/sge.c:t4vf_sge_alloc_rxq().
But since the VF's PF will already have disabled the PF's Relaxed Ordering
Enable, the VF will also have it's Relaxed Ordering Enable disabled and any
effort by the internal chip to send TLPs with the Relaxed Ordering Attribute
will be gated by the PCIe logic. So it's not critical that this be in the
first patch. Your call. Let me know if you'd like me to send that to you.

| From: Ding Tianhong <dingtianh...@huawei.com>
| Sent: Wednesday, July 26, 2017 6:08 PM
|
| On 2017/7/27 2:26, Casey Leedom wrote:
| >
| > 1. Did we ever get any acknowledgement from either Intel or AMD
| > on this patch? I know that we can't ensure that, but it sure would
| > be nice since the PCI Quirks that we're putting in affect their
| > products.
|
| Still no Intel and AMD guys has ack this, this is what I am worried about,
| should I ping some man again ?

By amusing coincidence, Patrik Cramer (now Cc'ed) from Intel sent me a note
yesterday with a link to the official Intel performance tuning documentation
which covers this issue:

https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

In section 3.9.1 we have:

3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
and Toward MMIO Regions (P2P)

Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
PCIe Performance

ProcessorCPU RP Device IDs

Intel Xeon processors based on 6F01H-6F0EH
Broadwell microarchitecture

Intel Xeon processors based on 2F01H-2F0EH
Haswell microarchitecture

Unfortunately that's a pretty thin section. But it does expand the set of
Intel Root Complexes for which our Linux PCI Quirk will need to cover. So
you should add those to the next (and hopefully final) spin of your patch.
And, it also verifies the need to handle the use of Relaxed Ordering more
subtlely than simply turning it off since the NVMe peer-to-peer example I
keep bringing up would fall into the "need to use Relaxed Ordering" case ...

It would have been nice to know why this is happening and if any future
processor would fix this. After all, Relaxed Ordering, is just supposed to
be a hint. At worst, a receiving device could just ignore the attribute
entirely. Obviously someone made an effort to implement it but ... it
didn't go the way they wanted.

And, it also would have been nice to know if there was any hidden register
in these Intel Root Complexes which can completely turn off the effort to
pay attention to the Relaxed Ordering Attribute. We've spend an enormous
amount of effort on this issue here on the Linux PCI email list struggling
mightily to come up with a way to determine when it's
safe/recommended/not-recommended/unsafe to use Relaxed Ordering when
directing TLPs towards the Root Complex. And some architectures require RO
for decent performance so we can't just "turn it off" unilatterally.

Casey

Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported

2017-07-26 Thread Casey Leedom

| From: Alexander Duyck <alexander.du...@gmail.com>
| Sent: Wednesday, July 26, 2017 11:44 AM
| 
| On Jul 26, 2017 11:26 AM, "Casey Leedom" <lee...@chelsio.com> wrote:
| |
| |     I think that the patch will need to be extended to modify
| |     drivers/pci.c/iov.c:sriov_enable() to explicitly turn off
| |     Relaxed Ordering Enable if the Root Complex is marked
|     for no RO TLPs.
| 
| I'm not sure that would be an issue. Wouldn't most VFs inherit the PF's 
settings?

Ah yes, you're right.  This is covered in section 3.5.4 of the Single Root I/O
Virtualization and Sharing Specification, Revision 1.0 (September 11, 2007),
governing the PCIe Capability Device Control register.  It states that the VF
version of that register shall follow the setting of the corresponding PF.

So we should enhance the cxgb4vf/sge.c:t4vf_sge_alloc_rxq() in the same
way we did for the cxgb4 driver, but that's not critical since the Relaxed
Ordering Enable supersedes the internal chip's desire to use the Relaxed
Ordering Attribute.

Ding, send me a note if you'd like me to work that up for you.

| Also I thought most of the VF configuration space is read only.

Yes, but not all of it.  And when a VF is exported to a Virtual Machine,
then the Hypervisor captures and interprets all accesses to the VF's
PCIe Configuration Space from the VM.

Thanks again for reminding me of the subtle aspect of the SR_IOV
specification that I forgot.

Casey

Re: [PATCH v7 2/3] PCI: Enable PCIe Relaxed Ordering if supported

2017-07-26 Thread Casey Leedom

  By the way Ding, two issues:

 1. Did we ever get any acknowledgement from either Intel or AMD
on this patch?  I know that we can't ensure that, but it sure would
be nice since the PCI Quirks that we're putting in affect their
products.

 2. I just realized that there's still a small hole in the patch with
respect to PCIe SR-IOV Virtual Functions.  When the PCI Quirk
notices a problematic PCIe Device and marks it to note that
it's not "happy" receiving Transaction Layer Packets with the
Relaxed Ordering Attribute set, it's my understanding that your
current patch iterates down the PCIe Fabric and turns off the
PCIe Capability Device Control[Relaxed Ordering Enable].
But this scan may miss any SR-IOV VFs because they
probably won't be instantiated at that time.  And, at a later
time, when they are instantiated, they could well have their
Relaxed Ordering Enable set.

I think that the patch will need to be extended to modify
drivers/pci.c/iov.c:sriov_enable() to explicitly turn off
Relaxed Ordering Enable if the Root Complex is marked
for no RO TLPs.

Casey

Re: [PATCH v7 3/3] net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-14 Thread Casey Leedom

Reviewed-by: Casey Leedom <lee...@chelsio.com>

Re: [PATCH v7 3/3] net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-13 Thread Casey Leedom

[[ Sorry for the Double Send: I forgot to switch to Plain Text.  Have I 
mentioned how much I hate modern Web-based email agents? :-) -- Casey ]]

  Yeah, I think this works  for now.  We'll stumble over what to do when we 
want to mix upstream  TLPs without Relaxed Ordering Attributes directed at 
problematic Root  Complexes, and Peer-to-Peer TLPs with Relaxed Ordering 
Attributes ... or vice versa depending on which target PCIe Device has  issues 
with Relaxed Ordering.

  Thanks for all the work!

Casey

Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-12 Thread Casey Leedom

| From: Ding Tianhong 
| Sent: Wednesday, July 12, 2017 6:18 PM
| 
| If no other more suggestion, I will send a new version and remove the
| enable_pcie_relaxed_ordering(), thanks.  :)

   Sounds good to me.  (And sorry for forgetting to justify that last message.
I hate working with web-based email agents.)

Casey

Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-12 Thread Casey Leedom

  Sorry again for the delay.  This time at least partially caused by a 
Chelsio-internal Customer Support request to simply disable Relaxed Ordering 
entirely due to the performance issues with our 100Gb/s product and relatively 
recent Intel Root Complexes.  Our Customer Support people are tired of asking 
customers to try turning off Relaxed Ordering. (sigh)

  So, first off, I've mentioned a couple of times that the current cxgb4 driver 
hardwired the PCIe Capability Device Control[Relaxed Ordering Enable] on.  
Here's the code which does it:

drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4657:

static void enable_pcie_relaxed_ordering(struct pci_dev *dev)
{
pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN);
}

This is called from the PCI Probe() routine init_one() later in that file.  I 
just wanted to make sure people knew about this.  Obviously given our current 
very difficult thread, this would either need to be diked out or changed or a 
completely different mechanism put in place.

  Second, just to make sure everyone's on the same page, the above simply 
allows the device to send TLPs with the Relaxed Ordering Attribute.  It doesn't 
cause TLPs to suddenly all be sent with RO set.  The use of Relaxed Ordering is 
selective.  For instance, in our hardware we can configure the RX Path to use 
RO on Ingress Packet Data delivery to Free List Buffers, but not use RO for 
delivery of messages noting newly delivered Ingress Packet Data.  Doing this 
allows the destination PCIe target to [potentially] optimize the DMA Writes to 
it based on local conditions (memory controller channel availability, etc.), 
but ensure that the message noting newly delivered Ingress Packet Data isn't 
processed till all of the preceding TLPs with RO set containing Ingress Packet 
Data have been processed.  (This by the way is the essence of the AMD A1100 ARM 
SoC bug: its Root Complex isn't obeying that PCIe ordering rule.)

  Third, as noted above, I'm getting a lot of pressure to get this addressed 
sooner than later, so I think that we should go with something fairly simple 
along the lines that you guys are proposing and I'll stop whining about the 
problem of needing to handle Peer-to-Peer with Relaxed Ordering while not using 
it for deliveries to the Root Complex.  We can just wait for that kettle of 
fish to explode on us and deal with the mess then.  (Hhmmm, the mixed metaphor 
landed in an entirely different place than I originally intended ... :-))

  If we try to stick as closely to Ding's latest patch set as possible, then we 
can probably just add the diff to remove the enable_pcie_relaxed_ordering() 
code in cxgb4_main.c.

Casey

Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-10 Thread Casey Leedom


Hey Alexander,

  Okay, I understand your point regarding the "most likely scenario" being
TLPs directed upstream to the Root Complex.  But I'd still like to make sure
that we have an agreed upon API/methodology for doing Peer-to-Peer with
Relaxed Ordering and no Relaxed Ordering to the Root Complex.  I don't see
how the proposed APIs can be used in that fashion.
 
  Right now the proposed change for cxgb4 is for it to test its own PCIe
Capability Device Control[Relaxed Ordering Enable] in order to use that
information to program the Chelsio Hardware to emit/not emit upstream TLPs
with the Relaxed Ordering Attribute set.  But if we're going to have the
mixed mode situation I describe, the PCIe Capability Device Control[Relaxed
Ordering Enable] will have to be set which means that we'll be programming
the Chelsio Hardware to send upstream TLPs with Relaxed Ordering Enable to
the Root Complex which is what we were trying to avoid in the first place ...

  [[ And, as I noted on Friday evening, the currect cxgb4 Driver hardwires
 the Relaxed Ordering Enable on early dureing device probe, so that
 would minimally need to be addressed even if we decide that we don't
 ever want to support mixed mode Relaxed Ordering. ]]

  We need some method of telling the Chelsio Driver that it should/shouldn't
use Relaxed Ordering with TLPs directed at the Root Complex.  And the same
is true for a Peer PCIe Device.

  It may be that we should approach this from the completely opposite
direction and instead of having quirks which identify problematic devices,
have quirks which identify devices which would benefit from the use of
Relaxed Ordering (if the sending device supports that).  That is, assume the
using Relaxed Ordering shouldn't be done unless the target device says "I
love Relaxed Ordering TLPs" ...  In such a world, an NVMe or a Graphics
device might declare love of Relaxed Ordering and the same for a SPARC Root
Complex (I think that was your example).

  By the way, the sole example of Data Corruption with Relaxed Ordering is
the AMD A1100 ARM SoC and AMD appears to have given up on that almost as
soon as it was released.  So what we're left with currently is a performance
problem on modern Intel CPUs ...  (And hopefully we'll get a Technical
Publication on that issue fairly soon.)

Casey

Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-07 Thread Casey Leedom

  Okay, thanks for the note Alexander.  I'll have to look more closely at
the patch on Monday and try it out on one of the targeted systems to verify
the semantics you describe.

  However, that said, there is no way to tell a priori where a device will
send TLPs.  To simply assume that all TLPs will be directed towards the Root
Complex is a big assumption.  Only the device and the code controlling it
know where the TLPs will be directed.  That's why there are changes required
in the cxgb4 driver.  For instance, the code in
drivers/net/ethernet/chelsio./cxgb4/sge.c: t4_sge_alloc_rxq() knows that
it's allocating Free List Buffers in Host Memory and that the RX Queues that
it's allocating in the Hardware will eventually send Ingress Data to those
Free List Buffers.  (And similarly for the Free List Buffer Pointer Queue
with respect to DMA Reads from the host.)  In that routine we explicitly
configure the Hardware to use/not-use the Relaxed Ordering Attribute via the
FW_IQ_CMD_FL0FETCHRO and FW_IQ_CMD_FL0DATARO flags.  Basically we're
conditionally setting them based on the desirability of sending Relaxed
Ordering TLPs to the Root Complex.  (And we would perform the same kind of
check for an nVME application ... which brings us to ...)

  And what would be the code using these patch APIs to set up a Peer-to-Peer
nVME-style application?  In that case we'd need the Chelsio adapter's PCIe
Capability Device Control[Relaxed Ordering Enable] set for the nVME
application ... and we would avoid programming the Chelsio Hardware to use
Relaxed Ordering for TLPs directed at the Root Complex.  Thus we would be in
a position where some TLPs being emitted by the device to Peer devices would
have Relaxed Ordering set and some directed at the Root Complex would not.
And the only way for that to work is if the source device's Device
Control[Relaxed Ordering Enable] is set ...

  Finally, setting aside my disagreements with the patch, we still have the
code in the cxgb4 driver which explicitly turns on its own Device
Control[Relaxed Ordering Enable] in cxgb4_main.c:
enable_pcie_relaxed_ordering().  So the patch is something of a loop if all
we're doing is testing our own Relaxed Ordering Enable state ...
 
Casey

Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-07 Thread Casey Leedom

  By the way, it ~seems~ like the patch set confuses the idea of the PCIe 
Capability Device Control[Relaxed Ordering Enable] with the device's ability to 
handle incoming TLPs with the Relaxed Ordering Attribute set.  These are 
completely different things.  The PCIe Capability Device Control[Relaxed 
Ordering Enable] solely governs the ability of the device to _send_ TLPs with 
the Relaxed Ordering Attribute set.  It has nothing whatsoever to do with the 
handling of incoming TLPs with the Relaxed Ordering Attribute set.  In fact, 
there is any standard way to disable receipt processing of such TLPs.  That's 
kind of the whole point of the majority of the patch.  If there was a separate 
"Ignore Incoming Relaxed Ordering Attributes" then we'd mostly just have a 
quirk which would set that for problematic devices.

  Worse yet, if I'm reading the patch right, it _is_ turning off the PCIe 
Capability Device Control[Relaxed Ordering Enable] for the devices that we've 
identified with problematic receive handling.  Not only does this not solve the 
problem that we've been talking about, it actually almost certainly introduces 
a huge Graphics Performance Bug.  The Relaxed Ordering Attribute was originally 
developed for Graphics Device support in order to download textures, etc. to a 
graphics devices as fast as possible and only ensure ordering at points where 
needed.  For instance, as far as I know, the Intel Root Complexes that we've 
been talking about have no issues whatsoever in generating downstream TLPs with 
the Relaxed Ordering Attribute set.  By turning off the PCIe Capability Device 
Control[Relaxed Ordering Enable] on these Root Complexes, this patch prevents 
the generation of downstream TLPs with the Relaxed Ordering Attribute set 
targeted at devices like graphics adapters.

Casey

Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-07-07 Thread Casey Leedom

Hey Ding, Bjorn, Alexander, et.al.,

  Sorry for the insane delay in getting to this and thanks especially to
Ding for picking up the ball on this feature.  I got side=-tracked into a
multi-week rewrite of our Firmware/Host Driver Port Capabilities code, then
to the recent Ethernet Plug-Fest at the University of New Hampshire for a
week, and finally the 4th of July weekend.  Digging out from under email has
been non-amusing.  In any case, enough excuses.
 
  I'm looking at the "v6" version of the patches.  If this isn't the corect
version, please yell at me and I'll look at the correct one immediately.

  In the change to drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c (patch
3/3) it looks like the code is checking the Chelsio Adapter PCI Device to
see if Relaxed Ordering is supported.  But what we really want to test is
the Root Complex port.  I.e. something like:

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 38a5c67..546538d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -4620,6 +4620,7 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
struct port_info *pi;
bool highdma = false;
struct adapter *adapter = NULL;
+   struct pci_dev *root;
struct net_device *netdev;
void __iomem *regs;
u32 whoami, pl_rev;
@@ -4726,6 +4727,24 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
adapter->msg_enable = DFLT_MSG_ENABLE;
memset(adapter->chan_map, 0xff, sizeof(adapter->chan_map));

+   /* If possible, we use PCIe Relaxed Ordering Attribute to deliver
+* Ingress Packet Data to Free List Buffers in order to allow for
+* chipset performance optimizations between the Root Complex and
+* Memory Controllers.  (Messages to the associated Ingress Queue
+* notifying new Packet Placement in the Free Lists Buffers will be
+* send without the Relaxed Ordering Attribute thus guaranteeing that
+* all preceding PCIe Transaction Layer Packets will be processed
+* first.)  But some Root Complexes have various issues with Upstream
+* Transaction Layer Packets with the Relaxed Ordering Attribute set.
+* The PCIe devices which under the Root Complexes will be cleared the
+* Relaxed Ordering bit in the configuration space, So we check our
+* PCIe configuration space to see if it's flagged with advice against
+* using Relaxed Ordering.
+*/
+   root = pci_find_pcie_root_port(pdev);
+   if (!pcie_relaxed_ordering_supported(root))
+   adapter->flags |= ROOT_NO_RELAXED_ORDERING;
+
spin_lock_init(>stats_lock);
spin_lock_init(>tid_release_lock);
spin_lock_init(>win0_lock);

The basic idea is that we want to find out if the Root Complex is okay with
TLPs directed its way with the Relaxed Ordering Attribute set.

  I also have to say that I'm a bit confused by the
pcie_relaxed_ordering_supported() in use here.  It seems that it's testing
whether the PCI Device's own Relaxed Ordering Enable is set which governs
whether it will send TLPs with the Relaxed Ordering Attribute set.  It has
nothing to do with whether or not the device responds well to incoming TLPs
with the Relaxed Ordering Attribute set.  I think that we really want to use
the pci_dev_should_disable_relaxed_ordering() API on the Root Complex Port
because that tests whether the quirck fired on it?  I.e.:

+   root = pci_find_pcie_root_port(pdev);
+   if (!pci_dev_should_disable_relaxed_ordering(root))
+   adapter->flags |= ROOT_NO_RELAXED_ORDERING;


Casey

Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes

2017-07-06 Thread Casey Leedom

| From: Andrew Lunn 
| Sent: Thursday, July 6, 2017 4:15 PM
|     
| > Even this feels too extreme for me.  I think users would hate it.  They
| > did an ifup and swapped cables.  They expect the OS/Driver/Firmware
| > to continue trying to honor their ifup request.
| 
| Lets take a look around at other subsystems
| ...
| Do you know of any subsystem that tried to keep its configuration
| across a hot unplug/plug event? I suspect they all require user space
| to take some action to get the newly plugged hardware into operation.

I agree ... -ish ... :-)

If you choose to think of a cable unplug/plug event as "hot plug", then
the "reset" is the model that feels right.  But I'll note that this is also
presupposing what the right model is for users.  This is akin
to trying to decide what to make for dinner and deciding that a
hammer is the right tool.  If we end up deciding on Cracked
Crab (or Tree Nuts for the vegans amongst us), then the
hammer is quite possibly a good answer.

All I can continue to say is: keep on thinking about users.  How they're
using networking devices now, how they think about them, how we're
going to explain to them whatever changes they'll need to make to
their usage of network ports.  And yes, how this will fit into management
models for other OSes, etc.

The customers aren't always right, but their opinion matters a lot.

Casey

Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes

2017-07-06 Thread Casey Leedom

| From: Jakub Kicinski <jakub.kicin...@netronome.com>
| Sent: Thursday, July 6, 2017 3:43 PM
|     
| On Thu, 6 Jul 2017 21:53:46 +, Casey Leedom wrote:
| > 
| > But, and far more importantly, ideally _*ANY*_ such decision is made at an
| > architectural level to apply to all Link Parameters and Vendor Products.
| > The last thing a user wants to deal with is a hodge-podge of different
| > policies for different adapters from different vendors.
| 
| Agreed.  Once we decided we should make the expected behaviour very
| clear the everyone.
| 
| Sorry if I'm misunderstanding - are you suggesting we should keep the
| speed settings if hand-selected?  My feeling is those should be reset
| if they are incompatible with the module.

Again, just to be perfectly clear, I don't think that there's a perfect
solution to this.  My personal feeling is that we need to think through all
of the usage scenarios and see what happens in the various models,
and more importantly, how easily we can explain the inevitable
repercussions to users.  Again, the "simplest model wins" philosophy.

But to your specific question: yes, I am saying that if the user selected
25Gb/s and accidentally swapped in a 10Gb/s cable, and then
swapped a different {100,25}Gb/s cable back in, the advertised
Speed(s) should be 25Gb/s with the newest cable, respecting
the original Link Parameter setting with the first cable.

And again, I don't believe that we'll come up with a perfect solution.
But hopefully we can come up with an abstraction model that's
easy to understand and use.  (And requires the fewest changes
to current accepted practices.)

| Hm.  I'm beginning to come around on this.  If my understanding of PHY
| sub-layers is correct the FEC layer shouldn't be reset on module
| unplug.  OTOH when someone replaces a DAC with an optical module,
| keeping FEC around is not going to do any good to users...

When I first stumbled across this issue several months ago I though
I must be dense.  Nothing this simple should be this complicated.
It's been extremely gratifying to find others similarly flummoxed ... :-)

Casey

Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes

2017-07-06 Thread Casey Leedom

| From: Andrew Lunn <and...@lunn.ch>
| Sent: Thursday, July 6, 2017 3:33 PM
|     
| On Thu, Jul 06, 2017 at 09:53:46PM +, Casey Leedom wrote:
| > 
| > But, and far more importantly, ideally _*ANY*_ such decision is made at an
| > architectural level to apply to all Link Parameters and Vendor Products.
| > The last thing a user wants to deal with is a hodge-podge of different
| > policies for different adapters from different vendors.
| 
| Yes.
| 
| SFP needs to becomes a Linux device, similar to Copper PHYs are Linux
| devices. With some core code which all drivers can use, implement
| ethtool --dump-module-eeprom, report speeds to the MAC using
| adjust_link, etc..

Okay.  This "feels" like an implementation approach rather than a model
abstraction commentary, but it sounds like it would help ensure consistency
across vendors, etc.

| > how do users conceive of a "Port"?
| 
| For a user, it is something they configure via /etc/network/interfaces
| and then use ifup/ifdown on.
| 
| ...
| 
| And this is where it gets interesting, as you say. We are into a
| hotplug model.
| 
| I think you also need to define 'cable' here. I assume you are not
| talking about a piece of CAT 5 or glass fibre. You mean something
| which is active. You are putting a different module into the SFP cage.

Yes, I'm talking about active Transceiver Modules whether welded
onto the ends of copper-based cables or Optical Transceiver Modules.

| The extreme model would be, if you pull the module out, the whole
| netdev is hot-unplugged. Plug a different modules in, the netdev is
| hot-plugged. The user has to ifup it again, and would get an error
| message if the configuration is invalid.
| 
| But i think this is too extreme.

I agree.  This would force a completely new set of behavior on all users.
And it wouldn't match the paradigms used by any other OS.  (Yes,
I know that Linux tends to ignore such issues, but users do live in
mixed shops and it would be cruel to them to force radically different
operating paradigms between the systems they need to use.)

| I think the sfp device needs to give a hotplug event on unplug/plug.
| A hot-unplug would result in an ifdown. And within the kernel, the
| netdev is set down. If there is an "allow-hotplug" statement in
| /etc/network/interfaces, on hot-plug, udev would try to ifup and get
| an error and it will stay down. Without the "allow-hotplug" the
| interface remains configured down until the user does an ifup and
| would see an error message if the configuration is invalid.

Even this feels too extreme for me.  I think users would hate it.  They
did an ifup and swapped cables.  They expect the OS/Driver/Firmware
to continue trying to honor their ifup request.

Casey

Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes

2017-07-06 Thread Casey Leedom

| From: Wyborny, Carolyn 
| Sent: Thursday, July 6, 2017 3:16 PM
| 
| Agree with your points generally, but other networking hw vendors have
| dealt with this since SFP variant and other modules arrived on the scene. 

The only case of this of which I'm aware is the SFP+ RJ45 1Gb/s
Transceiver Module.  But yes, that presaged what we're looking at
now as a much simpler example.

Casey

Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes

2017-07-06 Thread Casey Leedom

| From: Jakub Kicinski 
| Sent: Thursday, July 6, 2017 12:02 PM
|
| IMHO if something gets replugged all the settings should be reset.
| I feel that it's not entirely unlike replugging a USB adapter.  Perhaps
| we should introduce some (devlink) notifications for SFP module events
| so userspace can apply whatever static user config it has?

Absolutely a valid approach.  As are all of the ones I outlined.

But, and far more importantly, ideally _*ANY*_ such decision is made at an
architectural level to apply to all Link Parameters and Vendor Products.
The last thing a user wants to deal with is a hodge-podge of different
policies for different adapters from different vendors.
 
As I noted in my previous letter: this is something new that we've never
faced before with any prior networking technology.  All previous networking
technologies had a static set of Physical Port Capabilities fixed from the
moment a Host Diver/Firmware first see a Port.  We're now facing a situation
where these can change dynamically from moment to moment based on what
Transceiver Module is inserted.

With regard to this "architectural" issue, one way of trying to tease out
what model will be the simplest for users to work with is to ask: how do
users conceive of a "Port"?  I.e. when a user requests that a particular
Link Parameter be applied to a Port, are they thinking that it only applies
to the current instantaneous combination of Adapter Transceiver Module Cage
+ Transceiver Module?  Or do they conceptualize a "Port" as being a higher
level entity?

Or, let's make it Very Concrete with a specific example:

 1. User applies some set of Link Parameters.

 2. User attempts to bring Link up but it doesn't come up.

 3. User decides to try a different cable on the grounds that the first
cable may be bad.

 4. New cable is accidentally of a completely different type with completely
different subsequent Physical Port Capabilities, not capable of supporting
the user's selected Link Parameters.

 5. User realizes this accidental mis-selection, and swaps in a third cable,
similar to the first cable.

 6. User attempts to bring the Link up with the third cable.

If we reset all of the user-specified Link Parameters at point #3 and/or #5,
then the user's requested Link Parameter settings from point #1 will no
longer be in effect at point #6.  Is this expected/desirable?

In our discussions (Dustin, I, and others), we felt that the answer to this
above scenario question was "no".  I.e. that users would have a persistent
memory of having applied a set of Link Parameter settings to the "Port" and
would be annoyed/confused if those settings were "arbitrarily" reset at some
indefinite time.

But, this is a discussion and a decision that needs to be made.  Regardless
of what people finally decide is the "Right Answer", it needs to be
extremely well documented, it needs to be executed uniformly, and we may
want to explicitly notify the user at the points where something
unexpected/confusing is occurring.  Say, in the "Reset Unsupportable Link
Parameters When Transceiver Module Is Changed" model that you advocate, when
the user's Link Parameter settings are reset.

Casey

Re: [PATCH net-next 1/3] net: ethtool: add support for forward error correction modes

2017-07-06 Thread Casey Leedom

| From: Jakub Kicinski 
| Sent: Wednesday, June 28, 2017 6:00 PM
|     
| On Wed, 28 Jun 2017 14:47:51 -0700, Dustin Byford wrote:
| > 
| > You're not the first, or the second to ask that question.  I agree it
| > could use clarification.
| > 
| > I always read auto in this context as automatic rather than autoneg.
| > The best I can come up with is to perhaps fully spell out "automatic" in
| > the documentation and the associated uapi enums.  It's accurate, and
| > hopefully different enough from "autoneg" to hint people away from the
| > IEEE autoneg concept.
| 
| So perhaps just "default"?  Even saying something like ieee-selected
| doesn't really help, because apparently there are two autonegs defined
| - IEEE one and a "consortium" one...

First, sorry for the extreme delay in responding.  I was at the {25,100}Gb/s 
Ethernet "Plug Fest" all last week and then the holiday weekend added to the 
delay.

As Dustin notes, you're not the first to be confused by the use of the term 
"auto".  It absolutely means.  It essentially means: "Use standard FEC settings 
as specified by IEEE 802.3 interpretations of Cable Transceiver Module 
parameters."  But this is quite a mouthful.  In this sense, "auto" means 
"automatic".

Other keywords which we bandied about included: standard, default, ieee, 
ieee802.3, ieee802.3-default, transceiver-default, etc.  As you can see, the 
quest to make the option more obvious lead to increasing verbosity.

I think that in the end, we decided to go with a moderately short keyword with 
tons of documentation to make the meaning clear.

By the way, this isn't the end of this problem.  The new QSFP28 and SFP28 Port 
Types have introduced a problem which no one has ever seen with any preceding 
network technology.  With all previous networking technologies, a driver could 
look at an adapter Port Type and know what its Port Capabilities were in terms 
of Speeds, etc. and those Port Capabilities would never change.  With the new 
QSFP28 and SFP28 Port Types, the Physical Port Capabilities can change based on 
what Transceiver Modules you plug in.  For instance, with one QSFP28 
Transceiver Module you might see {100,50,25}Gb/s and with a different one 
{40,10}Gb/s.  One Transceiver Module might support Reed-Solomon Forward Error 
Correction and another might also support BaseR/Reed-Solomon.

For the proposed FEC controls, we've tried to cope with this by having this 
"Automatic IEEE802.3 Transceiver Module-dictated FEC" via the "auto" selection 
(where most users will leave it we hope).  This allows the Firmware/Host Driver 
to automatically track the FEC needs of the currently plugged in Transceiver 
Module.

However, the first question which pops up is: what happens if a user explicitly 
selects a particular FEC for from the set offered by the current Transceiver 
Module, and then swap out Transceiver Modules to one which doesn't support that 
FEC?  Does the OS/Driver/Firmware "forget" the user's FEC setting?  Does it 
respect it and potentially not get a link established?  Do we "temporarily" put 
the user's FEC setting aside?  Or do we have FEC settings per-Transceiver 
Module Type?  Each choice has downsides in terms of "expected behavior" and the 
complexity of the abstraction model offered to the user.

In our discussions of the above, we decided that we should err in the direction 
of the absolutely simplest abstraction model, even when that might result in a 
failure to establish a link.  Our feeling was that doing anything else would 
result in endless user confusion.  Basically, if it takes longer than a simple 
paragraph to explain, you're probably doing the wrong thing.  So, we decided 
that if a user sets up a particular FEC setting, we keep that regardless of 
conflict with different Transceiver Modules.  (But flag such issues in the 
System Log in order to try to give the user a chance to understand what the new 
cable they plugged in wasn't working.)

As noted above, the "auto" FEC setting solves the problem of automatically 
tracking the FEC needs of whatever Transceiver Module happens to be plugged in.

And now with QSFP28 and SFP28, we have the same issue with possible Speeds (and 
other Link Parameters).  It may well be that we're going to need to extend this 
idea into Link Speeds.

Casey

Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-16 Thread Casey Leedom

| From: Ding Tianhong 
| Sent: Wednesday, May 10, 2017 6:15 PM
|
| Hi Casey:
|
| Will you continue to work on this solution or send a new version patches?

I won't be able to work on this any time soon given several other urgent
issues.  If you've got a desire to pick this up, I'd be happy to help code
review your efforts.

Casey

Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-08 Thread Casey Leedom


| From: Alexander Duyck 
| Date: Saturday, May 6, 2017 11:07 AM
| 
| | From: Ding Tianhong 
| | Date: Fri, May 5, 2017 at 8:08 PM
| | 
| | According the suggestion, I could only think of this code:
| | ..
| 
| This is a bit simplistic but it is a start.

  Yes, something tells me that this is going to be more complicated than any
of us like ...

| The other bit I was getting at is that we need to update the core PCIe
| code so that when we configure devices and the root complex reports no
| support for relaxed ordering it should be clearing the relaxed
| ordering bits in the PCIe configuration registers on the upstream
| facing devices.

  Of course, this can be written to by the Driver at any time ... and is in
the case of the cxgb4 Driver ...

  After a lot of rummaging around, it does look like KVM prohibits writes to
the PCIe Capability Device Control register in drivers/xen/xen-pciback/
conf_space_capability.c and conf_space.c simply because writes aren't
allowed unless "permissive" is set.  So it ~looks~ like a driver running in
a Virtual Machine can't turn Enable Relaxed Ordering back on ...

| The last bit we need in all this is a way to allow for setups where
| peer-to-peer wants to perform relaxed ordering but for writes to the
| host we have to not use relaxed ordering. For that we need to enable a
| special case and that isn't handled right now in any of the solutions
| we have coded up so far.

  Yes, we do need this.


| From: Alexander Duyck 
| Date: Saturday, May 8, 2017 08:22 AM
|
| The problem is we need to have something that can be communicated
| through a VM. Your change doesn't work in that regard. That was why I
| suggested just updating the code so that we when we initialized PCIe
| devices what we do is either set or clear the relaxed ordering bit in
| the PCIe device control register. That way when we direct assign an
| interface it could know just based on the bits int the PCIe
| configuration if it could use relaxed ordering or not.
| 
| At that point the driver code itself becomes very simple since you
| could just enable the relaxed ordering by default in the igb/ixgbe
| driver and if the bit is set or cleared in the PCIe configuration then
| we are either sending with relaxed ordering requests or not and don't
| have to try and locate the root complex.
| 
| So from the sound of it Casey has a special use case where he doesn't
| want to send relaxed ordering frames to the root complex, but instead
| would like to send them to another PCIe device. To do that he needs to
| have a way to enable the relaxed ordering bit in the PCIe
| configuration but then not send any to the root complex. Odds are that
| is something he might be able to just implement in the driver, but is
| something that may become a more general case in the future. I don't
| see our change here impacting it as long as we keep the solution
| generic and mostly confined to when we instantiate the devices as the
| driver could likely make the decision to change the behavior later.

  It's not just me.  Intel has said that while RO directed at the Root
Complex Host Coherent Memory has a performance bug (not Data Corruption),
it's a performance win for Peer-to-Peer writes to MMIO Space.  (I'll be very
interested in hearing what the bug is if we get that much detail.  The very
same TLPs directed to the Root Complex Port without Relaxed Ordering set get
good performance.  So this is essentially a bug in the hardware that was
~trying~ to implement a performance win.)

  Meanwhile, I currently only know of a single PCIe End Point which causes
catastrophic results: the AMD A1100 ARM SoC ("SEATTLE").  And it's not even
clear that product is even alive anymore since I haven't been able to get
any responses from them for several months.

  What I'm saying is: let's try to architect a solution which doesn't throw
the baby out with the bath water ...

  I think that if a Device's Root Complex Port has problems with Relaxed
Ordering, it ~probably~ makes sense to turn off the PCIe Capability Device
Control[Enable Relaxed Ordering] when we assign a device to a Virtual
Machine since the Device Driver can no longer query the Relaxed Ordering
Support of the Root Complex Port.  The only down side of this would be if we
assigned two Peers to a VM in an application which wanted to do Peer-to-Peer
transfers.  But that seems like a hard application to support in any case
since the PCI Bus:Slot.Function IDs for assigned Devices within a VM don't
match the actual values.

  For Devices running in the base OS/Hypervisor, their Drivers can query the
Relaxed Ordering Support for the Root Complex Port or a Peer Device.  So a
simple flag within the (struct pci_dev *)->dev_flags would serve for that
along with a per-Architecture/Platform mechanism for setting it ...

Casey

Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-04 Thread Casey Leedom

| From: Alexander Duyck 
| Sent: Wednesday, May 3, 2017 9:02 AM
| ...
| It sounds like we are more or less in agreement. My only concern is
| really what we default this to. On x86 I would say we could probably
| default this to disabled for existing platforms since my understanding
| is that relaxed ordering doesn't provide much benefit on what is out
| there right now when performing DMA through the root complex. As far
| as peer-to-peer I would say we should probably look at enabling the
| ability to have Relaxed Ordering enabled for some channels but not
| others. In those cases the hardware needs to be smart enough to allow
| for you to indicate you want it disabled by default for most of your
| DMA channels, and then enabled for the select channels that are
| handling the peer-to-peer traffic.

  Yes, I think that we are mostly in agreement.  I had just wanted to make
sure that whatever scheme was developed would allow for simultaneously
supporting non-Relaxed Ordering for some PCIe End Points and Relaxed
Ordering for others within the same system.  I.e. not simply
enabling/disabling/etc.  based solely on System Platform Architecture.

  By the way, I've started our QA folks off looking at what things look like
in Linux Virtual Machines under different Hypervisors to see what
information they may provide to the VM in the way of what Root Complex Port
is being used, etc.  So far they've got Windows HyperV done and there
there's no PCIe Fabric exposed in any way: just the attached device.  I'll
have to see what pci_find_pcie_root_port() returns in that environment.
Maybe NULL?

  With your reservations (which I also share), I think that it probably
makes sense to have a per-architecture definition of the "Can I Use Relaxed
Ordering With TLPs Directed At This End Point" predicate, with the default
being "No" for any architecture which doesn't implement the predicate.  And
if the specified (struct pci_dev *) End Node is NULL, it ought to return
False for that as well.  I can't see any reason to pass in the Source End
Node but I may be missing something.

  At this point, this is pretty far outside my level of expertise.  I'm
happy to give it a go, but I'd be even happier if someone with a lot more
experience in the PCIe Infrastructure were to want to carry the ball
forward.  I'm not super familiar with the Linux Kernel "Rules Of
Engagement", so let me know what my next step should be.  Thanks.

Casey

Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-02 Thread Casey Leedom

| From: Alexander Duyck 
| Date: Tuesday, May 2, 2017 11:10 AM
| ...
| So for example, in the case of x86 it seems like there are multiple
| root complexes that have issues, and the gains for enabling it with
| standard DMA to host memory are small. As such we may want to default
| it to off via the architecture specific PCIe code and then look at
| having "white-list" cases where we enable it for things like
| peer-to-peer accesses. In the case of SPARC we could look at
| defaulting it to on, and only "black-list" any cases where there might
| be issues since SPARC relies on this in a significant way for
| performance. In the case of ARM and other architectures it is a bit of
| a toss-up. I would say we could just default it to on for now and
| "black-list" anything that doesn't work, or we could go the other way
| like I suggested for x86. It all depends on what the ARM community
| would want to agree on for this. I would say unless it makes a
| significant difference like it does in the case of SPARC we are
| probably better off just defaulting it to off.

  Sorry, I forgot to respond to this earlier when someone was rushing me out
to a customer dinner.

  I'm unaware of any other x86 Root Complex Port that has a problem with
Relaxed Ordering other than the performance issue with the current Intel
implementation.  Ashok tells me that Intel is in the final stages of putting
together a technical notice on this issue but I don't know when that will
come out.  Hopefully that will shed much more light on the cause and actual
use of Relaxed Ordering when directed to Coherent Memory on current and past
Intel platforms.  (Note that the performance bug seems to limit us to
~75-85Gb/s DMA Write speed to Coherent Host Memory.)

  The only other Device that I currently know of which has issues with
Relaxed Ordering TLPs directed at it, is also a Root Complex Port on the new
AMD A1100 ARM SoC ("SEATTLE").  There we have an actual bug which could lead
to Data Corruption.

  But I don't know anything about other x86 Root Complex Ports having issues
with this -- we've been using it ever since commit ef306b50b from December
2010.

  Also, I'm not aware of any performance data which has been gathered on the
use of Relaxed Ordering when directed at Host Memory.  From your note, it
sounds like it's important on SPARC architectures.  But it could conceivably
be important on any architecture or Root Complex/Memory Controller
implementation.  We use it to direct Ingress Packet Data to Free List
Buffers, followed by a TLP without Relaxed Ordering directed at a Host
Memory Message Queue.  The idea here is to give the Root Complex options on
which DMA Memory Write TLPs to process in order to optimize data placement
in memory.  And with the next Ethernet speed bump to 200Gb/s this may be
even more important.

  Basically, I'd hate to come up with a solution where we write off the use
of Relaxed Ordering for DMA Writes to Host Memory.  I don't think you're
suggesting that, but there are a number of x86 Root Complex implementations
out there -- and some like the new AMD Ryzen have yet to be tested -- as
well as other architectures.

  And, as Ashok and I have both nothed, any solution we come up with needs
to cope with a heterogeneous situation where, on the same PCIe Fabric, it
may be necessesary/desireable to support Relaxed Ordering TLPs directed at
some End Nodes but not others.

Casey

[PATCH 0/2] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-05-01 Thread Casey Leedom

Some devices have problems with Transaction Layer Packets with the Relaxed
Ordering Attribute set.  This patch set adds a new PCIe Device Flag,
PCI_DEV_FLAGS_NO_RELAXED_ORDERING, a set of PCI Quirks to catch some known
devices with Relaxed Ordering issues, and a use of this new flag by the
cxgb4 driver to avoid using Relaxed Ordering with problematic Root Complex
Ports.

It's been years since I've submitted kernel.org patches, I appolgise for the
almost certain submission errors.

Casey Leedom (2):
  PCI: Add new PCIe Fabric End Node flag,
PCI_DEV_FLAGS_NO_RELAXED_ORDERING
  net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 17 +++
 drivers/net/ethernet/chelsio/cxgb4/sge.c|  5 ++--
 drivers/pci/quirks.c| 38 +
 include/linux/pci.h |  2 ++
 5 files changed, 61 insertions(+), 2 deletions(-)

-- 
1.9.1

[PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-01 Thread Casey Leedom

The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed
Ordering Attribute should not be used on Transaction Layer Packets destined
for the PCIe End Node so flagged.  Initially flagged this way are Intel
E5-26xx Root Complex Ports which suffer from a Flow Control Credit
Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which
don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.
---
 drivers/pci/quirks.c | 38 ++
 include/linux/pci.h  |  2 ++
 2 files changed, 40 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index f754453..4ae78b3 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev)
  quirk_tw686x_class);
 
 /*
+ * Some devices have problems with Transaction Layer Packets with the Relaxed
+ * Ordering Attribute set.  Such devices should mark themselves and other
+ * Device Drivers should check before sending TLPs with RO set.
+ */
+static void quirk_relaxedordering_disable(struct pci_dev *dev)
+{
+   dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
+}
+
+/*
+ * Intel E5-26xx Root Complex has a Flow Control Credit issue which can
+ * cause performance problems with Upstream Transaction Layer Packets with
+ * Relaxed Ordering set.
+ */
+DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+
+/*
+ * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex
+ * where Upstream Transaction Layer Packets with the Relaxed Ordering
+ * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering
+ * set.  This is a violation of the PCIe 3.0 Transaction Ordering Rules
+ * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0
+ * November 10, 2010).  As a result, on this platform we can't use Relaxed
+ * Ordering for Upstream TLPs.
+ */
+DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a00, PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a01, PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a02, PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+
+/*
  * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same
  * values for the Attribute as were supplied in the header of the
  * corresponding Request, except as explicitly allowed when IDO is used."
diff --git a/include/linux/pci.h b/include/linux/pci.h
index eb3da1a..6764f66 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -178,6 +178,8 @@ enum pci_dev_flags {
PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7),
/* Get VPD from function 0 VPD */
PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8),
+   /* Don't use Relaxed Ordering for TLPs directed at this device */
+   PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 9),
 };
 
 enum pci_irq_reroute_variant {
-- 
1.9.1

[PATCH 2/2] net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-05-01 Thread Casey Leedom

cxgb4 Ethernet driver now queries Root Complex Port to determine if it can
send TLPs to it with the Relaxed Ordering Attribute set.
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 17 +
 drivers/net/ethernet/chelsio/cxgb4/sge.c|  5 +++--
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 163543b..46d61b1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -512,6 +512,7 @@ enum { /* adapter flags */
USING_SOFT_PARAMS  = (1 << 6),
MASTER_PF  = (1 << 7),
FW_OFLD_CONN   = (1 << 9),
+   ROOT_NO_RELAXED_ORDERING = (1 << 10),
 };
 
 enum {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index afb0967..510c020 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -4636,6 +4636,7 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 #ifdef CONFIG_PCI_IOV
u32 v, port_vec;
 #endif
+   struct pci_dev *root;
 
printk_once(KERN_INFO "%s - version %s\n", DRV_DESC, DRV_VERSION);
 
@@ -4734,6 +4735,22 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
adapter->msg_enable = DFLT_MSG_ENABLE;
memset(adapter->chan_map, 0xff, sizeof(adapter->chan_map));
 
+   /* If possible, we use PCIe Relaxed Ordering Attribute to deliver
+* Ingress Packet Data to Free List Buffers in order to allow for
+* chipset performance optimizations between the Root Complex and
+* Memory Controllers.  (Messages to the associated Ingress Queue
+* notifying new Packet Placement in the Free Lists Buffers will be
+* send without the Relaxed Ordering Attribute thus guaranteing that
+* all preceding PCIe Transaction Layer Packets will be processed
+* first.)  But some Root Complexes have various issues with Upstream
+* Transaction Layer Packets with the Relaxed Ordering Attribute set.
+* So we check our Root Complex to see if it's flaged with advice
+* against using Relaxed Ordering.
+*/
+   root = pci_find_pcie_root_port(adapter->pdev);
+   if (root && (root->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING))
+   adapter->flags |= ROOT_NO_RELAXED_ORDERING;
+
spin_lock_init(>stats_lock);
spin_lock_init(>tid_release_lock);
spin_lock_init(>win0_lock);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index f05f0d4..ac229a3 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2571,6 +2571,7 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct 
sge_rspq *iq, bool fwevtq,
struct fw_iq_cmd c;
struct sge *s = >sge;
struct port_info *pi = netdev_priv(dev);
+   int relaxed = !(adap->flags & ROOT_NO_RELAXED_ORDERING);
 
/* Size needs to be multiple of 16, including status entry. */
iq->size = roundup(iq->size, 16);
@@ -2624,8 +2625,8 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct 
sge_rspq *iq, bool fwevtq,
 
flsz = fl->size / 8 + s->stat_len / sizeof(struct tx_desc);
c.iqns_to_fl0congen |= htonl(FW_IQ_CMD_FL0PACKEN_F |
-FW_IQ_CMD_FL0FETCHRO_F |
-FW_IQ_CMD_FL0DATARO_F |
+FW_IQ_CMD_FL0FETCHRO_V(relaxed) |
+FW_IQ_CMD_FL0DATARO_V(relaxed) |
 FW_IQ_CMD_FL0PADEN_F);
if (cong >= 0)
c.iqns_to_fl0congen |=
-- 
1.9.1

Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER

2017-04-28 Thread Casey Leedom

| From: Lucas Stach 
| Sent: Friday, April 28, 2017 1:51 AM
| 
| Am Donnerstag, den 27.04.2017, 12:19 -0500 schrieb Bjorn Helgaas:
| > 
| > 
| > I thought Relaxed Ordering was an optimization.  Are there cases where
| > it is actually required for correct behavior?
| 
| Yes, at least the Tegra 2 TRM claims that RO needs to be enabled on the
| device side for correct operation with the following language:
| 
| "Tegra 2 requires relaxed ordering for responses to downstream requests
| (responses can pass writes). It is possible in some circumstances for PCIe
| transfers from an external bus masters (i.e. upstream transfers) to become
| blocked by a downstream read or non-posted write. The responses to these
| downstream requests are blocked by upstream posted writes only when PCIe
| strict ordering is imposed. It is therefore necessary to never impose strict
| ordering that would block a response to a downstream NPW/read request and
| always set the relaxed ordering bit to 1. Only devices that are capable of
| relaxed ordering may be used with Tegra 2 devices."

  (woof) Reading through the above paragraph is difficult because the author
seems to shift language and terminology mid sentence and isn't following
standard PCI terminology conventions.  The Root Complex is "Upstream", a
non-Root Complex Node in the PCIe Fabric is "Downstream", Requests that a
Downstream Device (End Point) send to the Root Complex are called "Upstream
Requests", responses that the Root Complex send to a Device are called
"Downstream Responses" (or, even more pedantically, "Responses sent
Downstream for an earlier Upstream Request").

  Because a Root Complex is Upstream, but the Requests it sent Downstream,
and Downstream Devices send their Requests Upstream, it's very important
that we use exceedingly precise language.

  So, it ~sounds like~ the nVidia Tegra 2 document is talking about the need
for Downstream Devices to echo the Relaxed Ordering Attribute in their
Responses directed Upstream to Requests sent Downstream from the Root
Complex.  Moreover, there's code in drivers/pci/host/pci-tegra.c:
tegra_pcie_relax_enable() which appears to set the PCIe Capability Device
Control[Enable Relaxed Ordering] bit on all PCIe Fabric Nodes.

  If my reading of the intent of the nVidia document is correct -- and
that's a Big If because of the extremely imprecise language used -- that
means that the tegra_pcie_relax_enable() is completely bogus.  The PCIe 3.0
Specification states that Responses MUST reflect the Relaxed Ordering and No
Snoop Attributes of the Requests for which they are responding.  Section
2.2.9 of PCI Express(r) Base Specification Revision 3.0 November 10, 2010:
"Completion headers must supply the same values for the Attribute as were
supplied in the header of the corresponding Request, except as explicitly
allowed when IDO is used."

  And, specifically, the PCIe Capability Device Control[Enable Relaxed
Ordering] bit _only_ affects the ability of that Device to originate
Transaction Layer Packet Requests with the Relaxed Ordering Attribute set.
Thus, tegra_pcie_relax_enable() setting those bits on all the Downstream
Devices (and intervening Bridges) does not _cause_ those Devices to generate
Requests with Relaxed Ordering set.  And, if the Devices are PCIe 3.0
compliant, it also doesn't affect the Responses that they send back Upstream
to the Root Complex.

  I apologize for the incredibly detailed nature of these responses, but
it's very easy for people new to PCIe to get these things wrong and/or
misinterpret the PCIe Specifications.

Casey

Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER

2017-04-27 Thread Casey Leedom

| From: Bjorn Helgaas 
| Sent: Thursday, April 27, 2017 10:19 AM
|
| Are you hinting that the PCI core or arch code could actually *enable*
| Relaxed Ordering without the driver doing anything?  Is it safe to do that?
| Is there such a thing as a device that is capable of using RO, but where the
| driver must be aware of it being enabled, so it programs the device
| appropriately?

  I forgot to reply to this portion of Bjorn's email.

  The PCI Configuration Space PCI Capability Device Control[Enable Relaxed
Ordering] bit governs enabling the _ability_ for the PCIe Device to send
TLPs with the Relaxed Ordering Attribute set.  It does not _cause_ RO to be
set on TLPs.  Doing that would almost certainly cause Data Corruption Bugs
since you only want a subset of TLPs to have RO set.

  For instance, we typically use RO for Ingress Packet Data delivery but
non-RO for messages notifying the Host that an Ingress Packet has been
delivered.  This ensures that the "Ingress Packet Delivered" non-RO TLP is
processed _after_ any preceding RO TLPs delivering the actual Ingress Packet
Data.

  In the above scenario, if one were to turn off Enable Relaxed Ordering via
the PCIe Capability, then the on-chip PCIe engine would simply never send a
TLP with the Relaxed Ordering Attribute set, regardless of any other chip
programming.

  And finally, just to be absolutely clear, using Relaxed Ordering isn't and
"Architecture Thing".  It's a PCIe Fabric End Point Thing.  Many End Points
simply ignore the Relaxed Ordering Attribute (except to reflect it back in
Response TLPs).  In this sense, Relaxed Ordering simply provides
potentially useful optimization information to the PCIe End Point.

Casey

Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER

2017-04-27 Thread Casey Leedom

  Thanks for adding me to the Cc list Bjorn.  Hopefully my message will make
it out to the netdev and linux-pci lists -- I'm not currently subscribed to
them.  I've explicitly marked this message to be sent in plain text but
modern email agents suck with respect to this. (sigh) I have officially
become a curmudgeon. 

  So, officially, Relaxed Ordering should be a Semantic Noop as far as PCIe
transfers are concerned, as long as you don't care what order the PCIe
Transaction Layer Packets are processed in by the target PCIe Fabric End
Point.

  Basically, if you have some number of back-to-back PCIe TLPs between two
Fabric End Points {A} -> {B} which have the Relaxed Ordering Attribute set,
the End Point {B} receiving these RO TLPs may process them in any order it
likes.  When a TLP without Relaxed Ordering is sent {A} -> {B}, all
preceding TLPs with Relaxed Ordering set must be processed by {B} prior to
processing the TLP without Relaxed Ordering set.  In this sense, a TLP
without Relaxed Ordering set is something akin to a "memory barrier".

  All of this is covered in Section 2.4.1 of the PCIe 3.0 Specification (PCI
Express(r) Base Specification Revision 3.0 November 10, 2010).

  The advantage of using Relaxed Ordering (which is heavily used when
sending data to Graphics Cards as I understand it), is that the PCIe
Endpoint can potentially optimize the processing order of RO TLPs with
things like a local multi-channel Memory Controller in order to achieve the
highest transfer bandwidth possible.

  However, we have discovered at least two PCIe 3.0 Root Complex
implementations which have problems with TLPs directed at them with the
Relaxed Ordering Attribute set and I'm in the process of working up a Linux
Kernel PCI "Quirk" to allow those PCIe End Points to be marked as "not being
advisable to send RO TLPs to".  These problems range from "mere" Performance
Problems to outright Data Corruption.  I'm working with the vendors of these
...  "problematic" Root Complex implementations and hope to have this patch
submitted to the linux-pci list by tomorrow.

  By the way, it's important to note that just because, say, a Root Complex
has problems with RO TLPs directed at it, that doesn't mean that you want to
avoid all use of Relaxed Ordering within the PCIe Fabric.  For instance,
with the vendor whose Root Complex has a Performance Problem with RO TLPs
directed at it, it's perfectly reasonable -- and desired -- to use Relaxed
Ordering in Peer-to-Peer traffic.  Say for instance, with an NVMe <->
Ethernet application.

Casey

Re: [RFC PATCH v2] net: ethtool: add support for forward error correction modes

2017-02-15 Thread Casey Leedom

  Vidya and I have been in communication on how to support Forward Error 
Correction Management on Links.  My own thoughts are that we should consider 
using the new Get/Set Link Ksettiongs API because FEC is a low-level Link 
parameter which is either negotiated at the same time as Link Speed if 
Auto-Negotiation is in use, or must be set identically on both Link Peers is AN 
is not in use.  I'm not convinced that FEC Management should use a separate 
ethtool/Kernel Driver API ala Pause Frame settings.

  I'm hoping to meet with Vidya and his collaborators at Cumulus Networks in 
the very near term to hash these issues out.

Casey

Re: [RFC PATCH net-next] net: ethtool: add support for forward error correction modes

2016-11-22 Thread Casey Leedom

  I'm attempting to start the work necessary to implement Vidya's proposed new 
ethtool interface to manage Forward Error Correction settings on a link.  I'm 
confused by the ethtool FEC API and the degree/type of control it offers.  At 
the top of the patch we have:

Encoding: Types of encoding
Off:  Turning off any encoding
RS :  enforcing RS-FEC encoding on supported speeds
BaseR  :  enforcing Base R encoding on supported speeds
Auto   :  Default FEC settings  for  divers , and would represent
  asking the hardware to essentially go into a best effort mode.

but then later on we have:

+struct ethtool_fecparam {
+ __u32   cmd;
+ __u32   autoneg;
+ /* bitmask of FEC modes */
+ __u32   fec;
+ __u32   reserved;
+};

...

+enum ethtool_fec_config_bits {
+   ETHTOOL_FEC_NONE_BIT,
+   ETHTOOL_FEC_AUTO_BIT,
+   ETHTOOL_FEC_OFF_BIT,
+   ETHTOOL_FEC_RS_BIT,
+   ETHTOOL_FEC_BASER_BIT,
+};

...

+   ETHTOOL_LINK_MODE_FEC_NONE_BIT  = 47,
+   ETHTOOL_LINK_MODE_FEC_RS_BIT= 48,
+   ETHTOOL_LINK_MODE_FEC_BASER_BIT   = 49,

The last ethtool Link Mode bits seem to imply a separable FEC on/off with 
individual control for RS and BASER.  How would the "Auto" from the top be 
encoded within these Link Mode bits?  And I don't see any reference to the 
ethtool_fec_config_bits in the kernel or ethtool patches so I'm not sure what 
they're supposed to reference.  Can you clarify the above?  I.e. can you offer 
a small template example of what a driver implementation might look like 
interpreting the incoming Link Mode Bits?

  And do we expect that there will be new FECs in the future?

Casey

Re: [RFC PATCH net-next] net: ethtool: add support for forward error correction modes

2016-11-22 Thread Casey Leedom

  And by the way, we currently have two ethtool APIs which pump in an 
Auto-Negotiation indication -- set_link_ksettings() and set_pauseparam().  Now 
we're talking about adding a third, set_fecparam().  Are all of the calls to 
these three APIs supposed to agree on the concept of Auto-Negotiations?  I.e. 
what's it mean if set_link_ksettings() gets called with 
link_ksettings->base.autoneg == AUTONEG_ENABLE but set_pauseparam() gets called 
with epause->autoneg == AUTONEG_DISABLE?  And now adding set_fecparam() into 
the system with a similar ability to specify the state of Auto-Negotiation is 
even more confusing.

Casey

[RFC PATCH net-next] net: ethtool: add support for forward error correction modes

2016-11-11 Thread Casey Leedom

N.B.  Sorry I'm not able to respond to the original message since I
wasn't subscribed to netdev when it was sent a couple of weeks ago.

This feature is something that Chelsio's cxgb4 driver needs.
As we've tested our adapters against a number of switches,
we've discovered a few which use varying defaults for FEC.
And when Auto-Negotiation isn't used (or even possible with
Optical Links), we need to be able to control turning FEC on/off.

For our part, we default FEC Off for Optical Transceivers.
For Copper, we read the Cable's EEPROM to determine
how to default FEC.  For some switches this works, but
for at least one where that switch enables FEC for Optical
Transceivers, it doesn't.  For that switch we had to hard
wire FEC on.  Obviously that's not a good solution and we
need an administrative interface so the system
administrator can configure our adapter to use the
appropriate FEC setting to match that of the switch.

So this is basically a long-winded ACK of Vidya's patch
and we would immediately implement this new ethtool API
as soon as it's available.

Casey

Re: [PATCH net-next] cxgb4: Reduce resource allocation in kdump kernel

2016-06-03 Thread Casey Leedom

  Looks good to me.  Of course I came up with those changes so maybe we should 
get another reviewer? :-)  Also, don't forget to mention "Bug #29998" in the 
commit message ...

Casey


From: Hariprasad Shenai <haripra...@chelsio.com>
Sent: Friday, June 3, 2016 10:35:45 AM
To: da...@davemloft.net
Cc: netdev@vger.kernel.org; Casey Leedom; Nirranjan Kirubaharan; Hariprasad S
Subject: [PATCH net-next] cxgb4: Reduce resource allocation in kdump kernel

When is_kdump_kernel() is true, reduce our memory footprint by only using
a single "Queue Set" and Forcing Master so we can reinitialize the
Firmware/Chip.

Signed-off-by: Hariprasad Shenai <haripra...@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 477db477b133..5317187d0073 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -64,6 +64,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "cxgb4.h"
 #include "t4_regs.h"
@@ -3735,7 +3736,8 @@ static int adap_init0(struct adapter *adap)
return ret;

/* Contact FW, advertising Master capability */
-   ret = t4_fw_hello(adap, adap->mbox, adap->mbox, MASTER_MAY, );
+   ret = t4_fw_hello(adap, adap->mbox, adap->mbox,
+ is_kdump_kernel() ? MASTER_MUST : MASTER_MAY, );
if (ret < 0) {
dev_err(adap->pdev_dev, "could not connect to FW, error %d\n",
ret);
@@ -4366,6 +4368,14 @@ static void cfg_queues(struct adapter *adap)
if (q10g > netif_get_num_default_rss_queues())
q10g = netif_get_num_default_rss_queues();

+   /* Reduce memory usage in kdump environment by using only one queue
+* and disable all offload.
+*/
+   if (is_kdump_kernel()) {
+   q10g = 1;
+   adap->params.offload = 0;
+   }
+
for_each_port(adap, i) {
struct port_info *pi = adap2pinfo(adap, i);

--
2.3.4

Re: [PATCH pci] pci: Add helper function to set VPD size

2016-04-15 Thread Casey Leedom

| From: Steve Wise <sw...@opengridcomputing.com>
| Sent: Friday, April 15, 2016 9:12 AM
|
| On 4/14/2016 1:35 PM, Steve Wise wrote:
| >> The fix is to add a PCI helper function to set the VPD size, so the
| >> driver can expicitly set the exact size of the VPD.
| >>
| >> Fixes commit 104daa71b396 ("PCI: Determine actual VPD size on first
| >> access")
| >>
| >> Signed-off-by: Casey Leedom <lee...@chelsio.com>
| >> Signed-off-by: Hariprasad Shenai <haripra...@chelsio.com>
| > Looks good!
|
| Hey Bjorn,
|
| Will this make the next 4.6-rc?

  Without a fix of some sort, cxgb4 is completely dead in the water as of the 
4.6 series.  I'm also worried about the bnx2x driver which seems to be doing 
something similar to our cxgb4 driver.  (I.e. there isn't just a single VPD 
Data Structure hosted at the beginning of the VPD Space.)

Casey

Re: [PATCH net-next 1/4] cxgb4: Use mask & shift while returning vlan_prio

2015-12-16 Thread Casey Leedom

> On Dec 16, 2015, at 4:07 PM, David Miller  wrote:
> 
> From: Hariprasad Shenai 
> Date: Wed, 16 Dec 2015 13:16:25 +0530
> 
>> @@ -66,7 +66,7 @@ struct l2t_data {
>> 
>> static inline unsigned int vlan_prio(const struct l2t_entry *e)
>> {
>> -return e->vlan >> 13;
>> +return (e->vlan & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;
> 
> e->vlan is a u16, the vlan priotity is the top 3 bits of the 16-bit
> value, and finally the right shift will be non-signed.
> 
> Therefore this change is absolutely not necessary.
> 
> Please remove this patch from the series and resend.

I assume that you only meant that the masking portion is unnecessary.  Doing 
the shift with the symbolic constant VLAN_PRIO_SHIFT instead of the literal 
constant “13” is still a reasonable change.  The masking was almost certainly 
from me because once one uses the symbolic constants, weren’t not supposed to 
“know” about the internal structure of the operation.  Modern compilers are of 
course free to optimize away the mask, etc.

Casey--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCHv2 net-next 2/4] cxgb4: For T4, don't read the Firmware Mailbox Control register

2015-09-30 Thread Casey Leedom

Hari,

  I think you missed the corresponding change that's needed for the const char 
*owner[] array.  You need to add an "" entry so the index of "4" makes 
sense.

Casey


From: Hariprasad Shenai [haripra...@chelsio.com]
Sent: Wednesday, September 30, 2015 8:03 AM
To: netdev@vger.kernel.org
Cc: da...@davemloft.net; Casey Leedom; Nirranjan Kirubaharan; Hariprasad S
Subject: [PATCHv2 net-next 2/4] cxgb4: For T4, don't read the Firmware Mailbox 
Control register

T4 doesn't have the Shadow copy of the register which we can read without
side effect. So don't read mbox control register for T4 adapter

Signed-off-by: Hariprasad Shenai <haripra...@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 0a87a32..8001619 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -1134,12 +1134,20 @@ static int mbox_show(struct seq_file *seq, void *v)
unsigned int mbox = (uintptr_t)seq->private & 7;
struct adapter *adap = seq->private - mbox;
void __iomem *addr = adap->regs + PF_REG(mbox, CIM_PF_MAILBOX_DATA_A);
-   unsigned int ctrl_reg = (is_t4(adap->params.chip)
-? CIM_PF_MAILBOX_CTRL_A
-: CIM_PF_MAILBOX_CTRL_SHADOW_COPY_A);
-   void __iomem *ctrl = adap->regs + PF_REG(mbox, ctrl_reg);

-   i = MBOWNER_G(readl(ctrl));
+   /* For T4 we don't have a shadow copy of the Mailbox Control register.
+* And since reading that real register causes a side effect of
+* granting ownership, we're best of simply not reading it at all.
+*/
+   if (is_t4(adap->params.chip)) {
+   i = 4; /* index of "" */
+   } else {
+   unsigned int ctrl_reg = CIM_PF_MAILBOX_CTRL_SHADOW_COPY_A;
+   void __iomem *ctrl = adap->regs + PF_REG(mbox, ctrl_reg);
+
+   i = MBOWNER_G(readl(ctrl));
+   }
+
seq_printf(seq, "mailbox owned by %s\n\n", owner[i]);

for (i = 0; i < MBOX_LEN; i += 8)
--
2.3.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Request for advice on where to put Root Complex fix up code for downstream device

2015-05-29 Thread Casey Leedom

  Thanks Bjorn and no issues at all about the delay -- I definitely understand 
how
busy we all are.

  I'll go ahead and submit a PCI Quirk.  As part of this, would you like me to
also commit a new PCI-E routine to find the Root Complex Port for a given
PCI Device?  It seem like it might prove useful in the future.  Otherwise I'll
just incorporate that loop in my PCI Quirk.

Casey


From: Bjorn Helgaas [bhelg...@google.com]
Sent: Friday, May 29, 2015 9:20 AM
To: Casey Leedom
Cc: netdev@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: Request for advice on where to put Root Complex fix up code for 
downstream device

Hi Casey,

Sorry, this one slipped through and I forgot to respond earlier.

On Thu, May 07, 2015 at 11:31:58PM +, Casey Leedom wrote:
 | From: Bjorn Helgaas [bhelg...@google.com]
 | Sent: Thursday, May 07, 2015 4:04 PM
 |
 | There are a lot of fixups in drivers/pci/quirks.c.  For things that have to
 | be worked around either before a driver claims the device or if there is no
 | driver at all, the fixup *has* to go in drivers/pci/quirks.c
 |
 | But for things like this, where the problem can only occur after a driver
 | claims the device, I think it makes more sense to put the fixup in the
 | driver itself.  The only wrinkle here is that the fixup has to be done on a
 | separate device, not the device claimed by the driver.  But I think it
 | probably still makes sense to put this fixup in the driver.

   Okay, the example code that I provided (still quoted below) was indeed
 done as a fix within the cxgb4 Network Driver.  I've also worked up a
 version as a PCI Quirk but if you and David Miller agree that the fixup
 code should go into cxgb4, I'm comfortable with that.  I can also provide
 the example PCI Quirk code I worked up if you like.

   One complication to doing this in cxgb4 is that it attaches to Physical
 Function 4 of our T5 chip.  Meanwhile, a completely separate storage
 driver, csiostor, connections to PF5 and PF6 and there's no
 requirement at all that cxgb4 be loaded.  So if we go down the road of
 putting the fixup code in the cxgb4 driver, we'll also need to duplicate
 that code in the csiostor driver.

Sounds simpler to just put the quirk in drivers/pci/quirks.c.

 |  +static void clear_root_complex_tlp_attributes(struct pci_dev *pdev)
 |  +{
 |  + struct pci_bus *bus = pdev-bus;
 |  + struct pci_dev *highest_pcie_bridge = NULL;
 |  +
 |  + while (bus) {
 |  + struct pci_dev *bridge = bus-self;
 |  +
 |  + if (!bridge || !bridge-pcie_cap)
 |  + break;
 |  + highest_pcie_bridge = bridge;
 |  + bus = bus-parent;
 |  + }
 |
 | Can you use pci_upstream_bridge() here?  There are a couple places where we
 | want to find the Root Port, so we might factor that out someday.  It'll be
 | easier to find all those places if they use with pci_upstream_bridge().

 It looks like pci_upstream_bridge() just traverses one like upstream toward 
 the
 Root Complex?  Or am I misunderstanding that function?

No, you're right.  I was just trying to suggest using pci_upstream_bridge()
instead of bus-parent-self in your loop.  It wouldn't replace the loop
completely.

Bjorn
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Request for advice on where to put Root Complex fix up code for downstream device

2015-05-28 Thread Casey Leedom

| From: Casey Leedom [lee...@chelsio.com]
| Sent: Thursday, May 07, 2015 4:31 PM
| 
| | From: Bjorn Helgaas [bhelg...@google.com]
| | Sent: Thursday, May 07, 2015 4:04 PM
| |
| | There are a lot of fixups in drivers/pci/quirks.c.  For things that have to
| | be worked around either before a driver claims the device or if there is no
| | driver at all, the fixup *has* to go in drivers/pci/quirks.c
| |
| | But for things like this, where the problem can only occur after a driver
| | claims the device, I think it makes more sense to put the fixup in the
| | driver itself.  The only wrinkle here is that the fixup has to be done on a
| | separate device, not the device claimed by the driver.  But I think it
| | probably still makes sense to put this fixup in the driver.
| ...
|   One complication to doing this in cxgb4 is that it attaches to Physical
| Function 4 of our T5 chip.  Meanwhile, a completely separate storage
| driver, csiostor, connections to PF5 and PF6 and there's no
| requirement at all that cxgb4 be loaded.  So if we go down the road of
| putting the fixup code in the cxgb4 driver, we'll also need to duplicate
| that code in the csiostor driver.

  I never heard back on this issue of needing to put the Root Complex fixup 
code in two different drivers -- cxgb4 and csiostor -- if we don't go down the 
path of using a PCI Quirk.  I'm happy doing either and have verified both 
solutions locally.  I'd just like to get a judgement call on this.

  It comes down to adding ~30 lines to

drivers/net/eththernet/chelsio/cxgb4/cxgb4_main.c
drivers/scsi/csiostor/csio_init.c

or ~30 lines to

drivers/pci/quirks.c

| | Can you include a pointer to the relevant part of the spec?
| 
|   Sure:
| 
| 2.2.9. Completion Rules
| ...
| Completion headers must supply the same values for
| the Attribute as were supplied in the 20 header of
| the corresponding Request, except as explicitly
| allowed when IDO is used (see Section 2.2.6.4).
| ...
| 2.3.2. Completion Handling Rules
| ...
| If a received Completion matches the Transaction ID
| of an outstanding Request, but in some other way
| does not match the corresponding Request (e.g., a
| problem with Attributes, Traffic Class, Byte Count,
| Lower Address, etc), it is strongly recommended for
| the Receiver to handle the Completion as a Malformed
| TLP. However, if the Completion is otherwise properly
| formed, it is permitted[22] for the Receiver to
| handle the Completion as an Unexpected Completion.

| | Can you use pci_upstream_bridge() here?  There are a couple places where we
| | want to find the Root Port, so we might factor that out someday.  It'll be
| | easier to find all those places if they use with pci_upstream_bridge().
| 
| It looks like pci_upstream_bridge() just traverses one like upstream toward 
the
| Root Complex?  Or am I misunderstanding that function?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/3] cxgb4: Add support for loopback between VI of same port

2015-05-22 Thread Casey Leedom

  Okay, really: I’m not arguing for module parameters.  I’m agreeing with you 
100%.
I’m not trying to be snarky or back you into admitting that there are some times
when a module parameter is needed.  I’m not being sneaky, etc.  I’m really just
asking a mechanism question.  It is, on the other hand, quite likely that I’m
being dumb.  I’ll absolutely grant you that.

  So let me turn this around and ask:

What command would you envision that I use in order to tell a driver
to use a different TX routine for an interface?

ethtool —tx-routine eth{n} loopback

Or?

 Sorry for being so dense.  We really are trying to live within the rules but
we’re struggling to figure out what patch we should submit.

Casey

 On May 22, 2015, at 11:01 AM, David Miller da...@davemloft.net wrote:
 
 From: Casey Leedom lee...@chelsio.com
 Date: Fri, 22 May 2015 16:49:03 +
 
  Oh I definitely understand that and agree.  Unfortunately I've
 inherited a driver architecture that makes that ... difficult
 for many operations ...  And I have an internal bug filed
 against me to fix those particular issues.
 
  However, that doesn't answer at least one of my questions
 which was how do I pass information into the driver _before_
 it does the device probe?
 
 I did answer the question, I said that if you fix the real actual
 core problem then you won't have this need to begin with.
 
 I thought I made that perfectly clear.
 
 I really am not going to entertain arguments of the form it's
 too hard to implement this correctly so I'm going to try
 and slam a module parameter into the driver to fix things.
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH net-next 1/3] cxgb4: Add support for loopback between VI of same port

2015-05-22 Thread Casey Leedom

| From: David Miller [da...@davemloft.net]
| Sent: Thursday, May 21, 2015 12:30 PM
| 
| The prevailing assumption is that it's OK to have configuration
| settings that can't be undone.
| 
| And that's bogus from the beginning.

  Oh I definitely understand that and agree.  Unfortunately I've
inherited a driver architecture that makes that ... difficult
for many operations ...  And I have an internal bug filed
against me to fix those particular issues.

  However, that doesn't answer at least one of my questions
which was how do I pass information into the driver _before_
it does the device probe?  In this case, telling it to _not_
attempt to attached to the chip firmware in order to debug,
load firmware, etc.?

  And, I still need to know what mechanism we need to use
to tell the driver to use one kind of transmit functionality
or another.  [[And in this case, we _can_ switch back and
forth at will.]]

Casey--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH net-next 1/3] cxgb4: Add support for loopback between VI of same port

2015-05-21 Thread Casey Leedom

| From: David Miller [da...@davemloft.net]
| Sent: Sunday, May 17, 2015 8:46 PM
| 
|  From: Hariprasad Shenai haripra...@chelsio.com
|  Date: Sun, 17 May 2015 16:15:21 +0530
|  
|  We now have a new cxgb4 module parameter tx_vf that when set
|  to a non-zero value, causes cxgb4 to use the Virtual Machine version
|  of the firmware Ethernet TX Packet Work Request (FW_ETH_TX_PKT_VM_WR)
|  instead of the normal default non-VM Work Request (FW_ETH_TX_PKT_WR).
| 
| Sorry, module parameters are veboten.
| 
| Especially for settings like this which are guaranteed to be
| interesting for other NIC drivers, not just your's.
| 
| I'm really tired of explaining this to driver authors.  Just
| don't even try to push a module parameter past me.

I definitely understand the issue of wanting to avoid randomly different module 
parameters in various drivers which do similar things.  What we're looking for 
is a list of the acceptable ways for doing things — especially when they don't 
fit current ethtool/ioctl() mechanisms.

  A couple of specific examples:

 1. We need to load the driver and tell it _not_ to attempt to contact firmware 
on
the adapter.  This is typically used to load firmware on a brand new 
adapter or
debug a problem with the adapter without changing its existing state.  This 
need
presents an awkward problem because we need to have the driver know from
the very start that it shouldn't try to communicate with the firmware, 
while our
normal PCI probe() does in fact contact the firmware as part of the probe 
...

 2. This patch: We have the ability to use two fundamentally different TX Work
Requests — one which causes the adapter firmware to check for local
loopback delivery targets and one which doesn't.  Unlike the first example,
this can be specified long after the adapter probe operation but it's 
unclear
if there's any current ethtool/ioctl() which can be used for this.  Should 
we
suggest a new ethtool operation like TX Method?

More generally, is there a document somewhere which already covers the 
suggested mechanisms for passing parameter information into network drivers for 
different cases so we don't send in patch requests which waste people's time?

Casey--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

53 matches

Mail list logo