Public bug reported:

SRU Justification:

[Impact]

This is reproducible on systems which already have heavy background
traffic. On top of that, the user issues one of the 2 docker pulls below:
docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest
OR
docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa

The second one is a very large container (17GB)

When they run docker pull, the OOB interface stops being pingable,
the docker pull is interrupted for a very long time (>3mn) or
times out.

[Fix]

* Update the RX_CQE_CI before updating the RX_PI to avoid a race condition 
where we wrongly inform HW that there is space for the WQE.
* disable the RX DMA while we are handling incoming packets to avoid overflow.

[Test Case]

* Created a script which loops 200 times and does a docker pull in each loop:
docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest
OR
docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa

[Regression Potential]

* This could result in slower handling since we are disabling/enabling the DMA 
periodically.
* Although this fix has been tested by the people who opened the bug, QA needs 
to thoroughly test it to make sure it is not reproducible.

** Affects: linux-bluefield (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-bluefield in Ubuntu.
https://bugs.launchpad.net/bugs/1964984

Title:
  Fix OOB handling RX packets in heavy traffic

Status in linux-bluefield package in Ubuntu:
  New

Bug description:
  SRU Justification:

  [Impact]

  This is reproducible on systems which already have heavy background
  traffic. On top of that, the user issues one of the 2 docker pulls below:
  docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest
  OR
  docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa

  The second one is a very large container (17GB)

  When they run docker pull, the OOB interface stops being pingable,
  the docker pull is interrupted for a very long time (>3mn) or
  times out.

  [Fix]

  * Update the RX_CQE_CI before updating the RX_PI to avoid a race condition 
where we wrongly inform HW that there is space for the WQE.
  * disable the RX DMA while we are handling incoming packets to avoid overflow.

  [Test Case]

  * Created a script which loops 200 times and does a docker pull in each loop:
  docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest
  OR
  docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa

  [Regression Potential]

  * This could result in slower handling since we are disabling/enabling the 
DMA periodically.
  * Although this fix has been tested by the people who opened the bug, QA 
needs to thoroughly test it to make sure it is not reproducible.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/1964984/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to