Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

2023-06-11 Thread Maciej W. Rozycki
On Thu, 4 May 2023, Bjorn Helgaas wrote:

> We talked about reusing pcie_retrain_link() earlier.  IIRC that didn't
> work: ASPM needs to use PCI_EXP_LNKSTA_LT because not all devices
> support PCI_EXP_LNKSTA_DLLLA, and you need PCI_EXP_LNKSTA_DLLLA
> because the erratum makes PCI_EXP_LNKSTA_LT flap.
> 
> What if we made pcie_retrain_link() reusable by making it:
> 
>   bool pcie_retrain_link(struct pci_dev *pdev, u16 link_status_bit)
> 
> so ASPM could use pcie_retrain_link(link->pdev, PCI_EXP_LNKSTA_LT) and
> you could use pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA)?

 This is somewhat more complicated, because of the inverted logic between 
the two status bits.  Therefore I think a boolean flag is more adequate 
with preparatory logic within the function itself.  This will tighten the 
call interface as well.

> Maybe do it two steps?
> 
>   1) Move pcie_retrain_link() just after pcie_wait_for_link() and make
>   it take link->pdev instead of link.
> 
>   2) Add the bit parameter.

 Having compared the two pieces side by side now I think it makes sense.  
While there are minor differences, most prominently the original code is 
more aggressive than mine in polling the status bit, I think these details 
are not significant enough to argue over them here.  And we can consider 
switching to more modern `usleep_range' interface separately.

 A minor pessimisation resulting is that LNKSTA has to be reread in the 
caller after return from `pcie_retrain_link'; previously the last value 
read in the poll loop could have been reused.

> I'm OK with having pcie_retrain_link() in pci.c, but the surrounding
> logic about restricting to 2.5GT/s, retraining, removing the
> restriction, retraining again is stuff I'd rather have in quirks.c so
> it doesn't clutter pci.c.

 Well, it was there in quirks.c originally and I only moved this piece 
following your earlier suggestion:

> If we think the first part (attempting the retrain) is safe to do
> whenever the link is down, maybe we should do that more directly as
> part of the PCI core instead of as a quirk?

as in here: , 
though if you did change your mind after all, I can move it back, sure.  
It's not always that the first thought is the best, or sometimes good at 
all.

> I think it'd be good if the pci_device_add() path made clear that this
> is a workaround for a problem, e.g.,
> 
>   void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>   {
> ...
> if (pcie_link_failed(dev))
>   pcie_fix_link_train(dev);
> 
> where pcie_fix_link_train() could live in quirks.c (with a stub when
> CONFIG_PCI_QUIRKS isn't enabled).  It *might* even be worth adding it
> and the stub first because that's a trivial patch and wouldn't clutter
> the probe.c git history with all the grotty details about ASM2824 and
> this topology.

 I have added a stub now, both as an intermediate step and ultimately for 
!PCI_QUIRKS, but I disagree about having the check in pci.c and the fix in 
quirks.c, because from the code structure's point of view it makes no 
sense IMHO to have the check enabled and the fix disabled both at a time 
for !PCI_QUIRKS, even if the compiler would actually optimise the check 
away in that case.

 Please let me know if you maintain your suggestion and if so, then why 
you find it so important.  I think with the use of `pcie_retrain_link' 
this code has become straightforward enough not to need to be split or 
factored out any further (and while factoring out the conditionals only 
would make some sense to me, it would require duplicating configuration 
register accesses even further).

> > +int pcie_downstream_link_retrain(struct pci_dev *dev)
> > +{
> > +   static const struct pci_device_id ids[] = {
> > +   { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> > +   {}
> > +   };
> > +   u16 lnksta, lnkctl2;
> > +
> > +   if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> > +   !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> > +   return -1;
> > +
> > +   pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, );
> > +   pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
> > +   if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> > +   PCI_EXP_LNKSTA_LBMS) {
> 
> You go to some trouble to make sure PCI_EXP_LNKSTA_LBMS is set, and I
> can't remember what the reason is.  If you make a preparatory patch
> like this, it would give a place for that background, e.g.,

 It has been already documented along with the code in question:

 * With the ASM2824 we can rely on the otherwise optional Data Link Layer
 * Link Active status bit and in the failed link training scenario it will
 * be off along with the Link Bandwidth Management Status indicating that
 * hardware has changed the link speed or width in an attempt to correct
 * unreliable link operation.  For a port that has been left unconnected
 * both bits will be clear.  [...]

>  

Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

2023-05-14 Thread Maciej W. Rozycki
On Sun, 7 May 2023, Maciej W. Rozycki wrote:

> > We're going to land this series this cycle, come hell or high water.
> 
>  Thank you for coming back to me and for your promise.  I'll strive to 
> address your concerns next weekend.
> 
>  Unfortunately a PDU in my remote lab has botched up and I've lost control
> over it (thankfully not one for the RISC-V machine affected by the patch 
> series, so I can still manage it for reboots, etc., but the botched PDU is 
> actually upstream), so depending on how situation develops I may have to 
> book air travel instead and spend the whole weekend getting things back to 
> normal operation at my lab.  That unit was not supposed to fail, not in 
> such a silly way anyway, sigh...

 Last Thu the situation with the PDU became critical, so I spent a better 
part of yesterday and today travelling and then all night long getting 
things sorted.  So it'll have to be next weekend when I get back to these 
patches.  I hope we can still make it regardless.

  Maciej


Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

2023-05-07 Thread Maciej W. Rozycki
On Thu, 4 May 2023, Bjorn Helgaas wrote:

> On Thu, Apr 06, 2023 at 01:21:31AM +0100, Maciej W. Rozycki wrote:
> > Attempt to handle cases such as with a downstream port of the ASMedia 
> > ASM2824 PCIe switch where link training never completes and the link 
> > continues switching between speeds indefinitely with the data link layer 
> > never reaching the active state.
> 
> We're going to land this series this cycle, come hell or high water.

 Thank you for coming back to me and for your promise.  I'll strive to 
address your concerns next weekend.

 Unfortunately a PDU in my remote lab has botched up and I've lost control
over it (thankfully not one for the RISC-V machine affected by the patch 
series, so I can still manage it for reboots, etc., but the botched PDU is 
actually upstream), so depending on how situation develops I may have to 
book air travel instead and spend the whole weekend getting things back to 
normal operation at my lab.  That unit was not supposed to fail, not in 
such a silly way anyway, sigh...

  Maciej


Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

2023-05-04 Thread Bjorn Helgaas
On Thu, Apr 06, 2023 at 01:21:31AM +0100, Maciej W. Rozycki wrote:
> Attempt to handle cases such as with a downstream port of the ASMedia 
> ASM2824 PCIe switch where link training never completes and the link 
> continues switching between speeds indefinitely with the data link layer 
> never reaching the active state.

We're going to land this series this cycle, come hell or high water.

We talked about reusing pcie_retrain_link() earlier.  IIRC that didn't
work: ASPM needs to use PCI_EXP_LNKSTA_LT because not all devices
support PCI_EXP_LNKSTA_DLLLA, and you need PCI_EXP_LNKSTA_DLLLA
because the erratum makes PCI_EXP_LNKSTA_LT flap.

What if we made pcie_retrain_link() reusable by making it:

  bool pcie_retrain_link(struct pci_dev *pdev, u16 link_status_bit)

so ASPM could use pcie_retrain_link(link->pdev, PCI_EXP_LNKSTA_LT) and
you could use pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA)?

Maybe do it two steps?

  1) Move pcie_retrain_link() just after pcie_wait_for_link() and make
  it take link->pdev instead of link.

  2) Add the bit parameter.

I'm OK with having pcie_retrain_link() in pci.c, but the surrounding
logic about restricting to 2.5GT/s, retraining, removing the
restriction, retraining again is stuff I'd rather have in quirks.c so
it doesn't clutter pci.c.

I think it'd be good if the pci_device_add() path made clear that this
is a workaround for a problem, e.g.,

  void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
  {
...
if (pcie_link_failed(dev))
  pcie_fix_link_train(dev);

where pcie_fix_link_train() could live in quirks.c (with a stub when
CONFIG_PCI_QUIRKS isn't enabled).  It *might* even be worth adding it
and the stub first because that's a trivial patch and wouldn't clutter
the probe.c git history with all the grotty details about ASM2824 and
this topology.

> +int pcie_downstream_link_retrain(struct pci_dev *dev)
> +{
> + static const struct pci_device_id ids[] = {
> + { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> + {}
> + };
> + u16 lnksta, lnkctl2;
> +
> + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> + !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> + return -1;
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, );
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
> + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> + PCI_EXP_LNKSTA_LBMS) {

You go to some trouble to make sure PCI_EXP_LNKSTA_LBMS is set, and I
can't remember what the reason is.  If you make a preparatory patch
like this, it would give a place for that background, e.g.,

  +bool pcie_link_failed(struct pci_dev *dev)
  +{
  +   u16 lnksta;
  +
  +   if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
  +   !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
  +   return false;
  +
  +   pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
  +   if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
  +   PCI_EXP_LNKSTA_LBMS)
  +   return true;
  +
  +   return false;
  +}

If this is a generic thing and checking PCI_EXP_LNKSTA_LBMS makes
sense for everybody, it could go in pci.c; otherwise it could go in
quirks.c as well.  I guess it's not *truly* generic anyway because it
only detects link training failures for devices that have LNKCTL2 and
link_active_reporting.

> + unsigned long timeout;
> + u16 lnkctl;
> +
> + pci_info(dev, "broken device, retraining non-functional 
> downstream link at 2.5GT/s\n");
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, );
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> + /*
> +  * Due to an erratum in some devices the Retrain Link bit
> +  * needs to be cleared again manually to allow the link
> +  * training to succeed.
> +  */
> + lnkctl &= ~PCI_EXP_LNKCTL_RL;
> + if (dev->clear_retrain_link)
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
> +lnkctl);
> +
> + timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
> + do {
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
> +  );
> + if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> + break;
> + usleep_range(1, 2);
> + } while (time_before(jiffies, timeout));
> +
> + if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
> + pci_info(dev, "retraining 

[PATCH v8 7/7] PCI: Work around PCIe link training failures

2023-04-05 Thread Maciej W. Rozycki
Attempt to handle cases such as with a downstream port of the ASMedia 
ASM2824 PCIe switch where link training never completes and the link 
continues switching between speeds indefinitely with the data link layer 
never reaching the active state.

It has been observed with a downstream port of the ASMedia ASM2824 Gen 3 
switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2 
switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device, 
P/N 41433, wired to a SiFive HiFive Unmatched board.  In this setup the 
switches are supposed to negotiate the link speed of preferably 5.0GT/s, 
falling back to 2.5GT/s.

Instead the link continues oscillating between the two speeds, at the 
rate of 34-35 times per second, with link training reported repeatedly 
active ~84% of the time.  Forcibly limiting the target link speed to 
2.5GT/s with the upstream ASM2824 device however makes the two switches 
communicate correctly.  Removing the speed restriction afterwards makes 
the two devices switch to 5.0GT/s then.

Make use of these observations then and detect the inability to train 
the link, by checking for the Data Link Layer Link Active status bit 
being off while the Link Bandwidth Management Status indicating that 
hardware has changed the link speed or width in an attempt to correct 
unreliable link operation.

Restrict the speed to 2.5GT/s then with the Target Link Speed field, 
request a retrain and wait 200ms for the data link to go up.  If this 
turns out successful, then lift the restriction, letting the devices 
negotiate a higher speed.

Also check for a 2.5GT/s speed restriction the firmware may have already 
arranged and lift it too with ports of devices known to continue working 
afterwards, currently the ASM2824 only, that already report their data 
link being up.

Signed-off-by: Maciej W. Rozycki 
Link: 
https://lore.kernel.org/r/alpine.deb.2.21.2203022037020.56...@angie.orcam.me.uk/
Link: https://source.denx.de/u-boot/u-boot/-/commit/a398a51ccc68
---
No changes from v7.

Changes from v6:

- Regenerate against 6.3-rc5.

- Shorten the lore.kernel.org archive link in the change description.

Changes from v5:

- Move from a quirk into PCI core and call at device probing, hot-plug,
  reset and resume.  Keep the ASMedia part under CONFIG_PCI_QUIRKS.

- Rely on `dev->link_active_reporting' rather than re-retrieving the 
  capability.

Changes from v4:

- Remove  inclusion no longer needed.

- Make the quirk generic based on probing device features rather than 
  specific to the ASM2824 part only; take the Retrain Link bit erratum 
  into account.

- Still lift the 2.5GT/s speed restriction with the ASM2824 only.

- Increase retrain timeout from 200ms to 1s (PCIE_LINK_RETRAIN_TIMEOUT).

- Remove retrain success notification.

- Use PCIe helpers rather than generic PCI functions throughout.

- Trim down and update the wording of the change description for the 
  switch from an ASM2824-specific to a generic fixup.

Changes from v3:

- Remove the  entry for the ASM2824.

Changes from v2:

- Regenerate for 5.17-rc2 for a merge conflict.

- Replace BUG_ON for a missing PCI Express capability with WARN_ON and an
  early return.

Changes from v1:

- Regenerate for a merge conflict.
---
 drivers/pci/pci.c   |  154 ++--
 drivers/pci/pci.h   |1 
 drivers/pci/probe.c |2 
 3 files changed, 152 insertions(+), 5 deletions(-)

linux-pcie-asm2824-manual-retrain.diff
Index: linux-macro/drivers/pci/pci.c
===
--- linux-macro.orig/drivers/pci/pci.c
+++ linux-macro/drivers/pci/pci.c
@@ -859,6 +859,132 @@ int pci_wait_for_pending(struct pci_dev
return 0;
 }
 
+/*
+ * Retrain the link of a downstream PCIe port by hand if necessary.
+ *
+ * This is needed at least where a downstream port of the ASMedia ASM2824
+ * Gen 3 switch is wired to the upstream port of the Pericom PI7C9X2G304
+ * Gen 2 switch, and observed with the Delock Riser Card PCI Express x1 >
+ * 2 x PCIe x1 device, P/N 41433, plugged into the SiFive HiFive Unmatched
+ * board.
+ *
+ * In such a configuration the switches are supposed to negotiate the link
+ * speed of preferably 5.0GT/s, falling back to 2.5GT/s.  However the link
+ * continues switching between the two speeds indefinitely and the data
+ * link layer never reaches the active state, with link training reported
+ * repeatedly active ~84% of the time.  Forcing the target link speed to
+ * 2.5GT/s with the upstream ASM2824 device makes the two switches talk to
+ * each other correctly however.  And more interestingly retraining with a
+ * higher target link speed afterwards lets the two successfully negotiate
+ * 5.0GT/s.
+ *
+ * With the ASM2824 we can rely on the otherwise optional Data Link Layer
+ * Link Active status bit and in the failed link training scenario it will
+ * be off along with the Link Bandwidth Management Status indicating that
+ *