Public bug reported:
SRU Justification
[Impact]
NVIDIA runs into an issue building DGX Cloud GB300 clusters. The
dst_fetch_ha() checks nud_state without holding the neighbor lock may
result on the incorrect/zero mac being read and programmed in the device
QP while it was not yet updated. This causes silent packet loss and
eventual RETRY_EXC_ERR.
The upstream commit 5e6de34d82b4 (“IB/core: Fix zero dmac race in
neighbor resolution”) solved this problem.
[Fix]
Questing-aws, Resolute-aws and Noble-aws-6.17:
* cherry-pick 5e6de34d82b4 ("IB/core: Fix zero dmac race in neighbor
resolution")
[Test Plan]
* Compile and boot test.
* NVIDIA/AWS test.
[Regression potential]
* The commit add a lock in one line. Regression likliehood is slim.
[Other info]
* SF#00437493
** Affects: linux-aws (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2154023
Title:
DGX Cloud GB300 clusters fail to build due zero dmac race in neighbor
resolution
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/2154023/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs