Public bug reported:

SRU Justification

[Impact]

NVIDIA runs into an issue building DGX Cloud GB300 clusters. The
dst_fetch_ha() checks nud_state without holding the neighbor lock may
result on the incorrect/zero mac being read and programmed in the device
QP while it was not yet updated. This causes silent packet loss and
eventual RETRY_EXC_ERR.

The upstream commit 5e6de34d82b4 (“IB/core: Fix zero dmac race in
neighbor resolution”) solved this problem.

[Fix]

Questing-aws, Resolute-aws and Noble-aws-6.17:

* cherry-pick 5e6de34d82b4 ("IB/core: Fix zero dmac race in neighbor
resolution")

[Test Plan]

* Compile and boot test.

* NVIDIA/AWS test.

[Regression potential]

* The commit add a lock in one line. Regression likliehood is slim.

[Other info]

* SF#00437493

** Affects: linux-aws (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2154023

Title:
  DGX Cloud GB300 clusters fail to build due zero dmac race in neighbor
  resolution

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/2154023/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to