Hi Klement,

The docker containers orchestrated on a Nomad cluster that is located in Vexxhost along with the CSIT teatbeds.  The containers are not pinned to cpus due to the fact that the Jenkins Nomad plugin did not allow it (thus this is historical). Since the CI was recently migrated to github actions with remote runners running on the same Nomad cluster, we might be able to do cpu pinning.  Please reach out to Peter Mikus for his input to how this might be added as he is the author of the Nomad GHA dispatcher and primary maintainer of the Nomad cluster.

Note that most of the issues arise on the debian12 container, because that is the only instance where make test is run with multi-worker enabled on VPP.

I noticed that there were still failures with the stack of gerrit changes that you submitted. Thus for now, to unblock the CI, I submitted a patch with tag_fixme_debian12 for the failing testcases (46009).  In order to verify the fix for this issue, all testcases that are skipped using this tag should remove the tag.

Thanks for your efforts to fix this issue.
-daw-

On 6/5/26 1:02 AM, Klement Sekera via lists.fd.io wrote:
Hi,

Ok that sheds some light. Are you saying that in CI there is a box which runs N 
dockers at once and inside each of these there is a make test job? Are these 
dockers CPU pinned? If not l, then that would be my first suspect - more than 
one docker one the same physical CPU with the test framework doing more pinning 
than before could make it more prone to timing issues.

When tuning the patches I was sometimes hitting reproducible failures in CI and 
I think I saw one the logs mentioning CPU frequency of 0.3GHz which I found 
dubious, while at the same time it showed 128 available CPUs. I couldn’t 
understand why it uses TEST_JOBS=4. But if the CI runs N of these at once, then 
that would make some sense. Maybe we could try sprinting in serial instead of 
walking in parallel to see if patterns emerge?

Thanks,
Klement

On 5 Jun 2026, at 03:59, Dave Wallace via 
lists.fd.io<[email protected]> wrote:

Klement,

The failures are not random, they are intermittent.  A number of tests are 
being skipped (e.g. @tag_fixme_debian12), because they have repeatedly failed 
in an intermittent and un-reproducible pattern in the CI on non-related 
patchsets.

Previous investigations failed to reproduce the issue when run over 1000s of 
iterations on individual servers (both bare-metal and inside the docker 
executor containers used in the CI).  I have long suspected that there are 
'noisy neighbor' cpu pinning issues when a large number of docker containers 
running verify jobs are packed onto a single nomad client.

For the past several months, the number of non-related intermittent job 
failures have been very low since the 'usual suspects' were elided from running 
on debian 12 where the majority of said failures had been occurring.  For 
whatever reason, the latest 'make test' changes have exacerbated the issue.

All of the '@tag_fixme_*' testcases which are elided from per-patch testing 
represent technical debt which has been neglected for a very long time.  Any 
help you can provide to address this technical debt is most appreciated.

Thanks,
-daw-

On 6/4/26 13:43, Klement Sekera via lists.fd.io wrote:
Hey,

Could also be this 
one:https://www.google.com/url?q=https://gerrit.fd.io/r/c/vpp/%2B/45918/5&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1JJF6ehJguLiWRXzA9IjT8

These patches don’t really change test behavior, only scheduling. Before 45918, 
with TEST_JOBS > 1, the pipeline would get underutilized “randomly”, due to 
scheduling at most one test class per finished test class. So if a 1-cpu class 
followed 4-cpu class, then 3 cpus would sit idle. With this patch, the pipeline is 
refilled properly.

I also noticed that any extra cpus (like for vcl tests) were unaccounted for - 
that’s what the later patch fixes.

If the failures are “random”, then it means the tests are flaky and either need 
to fixed or marked for solo run as a temporary(?!) measure. I’d bet $.25 that 
these were fake-solo-run before due to pipeline underutilization.

Regards,
Klement

On 4 Jun 2026, at 19:30, Dave Wallace via 
lists.fd.io<[email protected]> wrote:
Ole/Klement,

Can you please help triage these new intermittent / non-patch related test 
failures?

The frequency of intermittent/ non-patch related test failures have spiked in 
the CI ever since Ole merged the batch of Klement's  test updates in gerrit.

Here's some more that I encountered on my CI monitoring gerrit change [0]:
https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26828951273/job/79105790972%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw386qhQrZeq8JYQ1UbilDnW&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3PGqqnA_ct0OGCnittWUSp
https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26916331530/job/79436593001%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw0L8Z8McTXF6BqH53pZiW28&source=gmail-imap&ust=1781229569000000&usg=AOvVaw15myZGYglbDMEVHLT5QQtV

Thanks,
-daw-
[0]https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://gerrit.fd.io/r/c/vpp/%252B/45941%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw2asscMzmXEXYxjjeddvXv_&source=gmail-imap&ust=1781229569000000&usg=AOvVaw0WhlBvDUXOSR6hJAlvAuzO

On 6/4/26 11:58, Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via 
lists.fd.io wrote:
Hi,

Today I noticed excess random failures, not related to patch, of make test in 
CI across different jobs on a couple of patches.
Some examples:
https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26952741597/job/79522258003%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw1ALoKUFN46rZlEmQpwjojU&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1kUSq4aEGtENBFXkY6zziH
https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26935962088/job/79468040688%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw3OQN_Dos849rFGWXIxyrEB&source=gmail-imap&ust=1781229569000000&usg=AOvVaw06JPcKzQob7_JBtq_CHu78
https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773670%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw31_vDtnQQNTHGlo2YfFNpn&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1ZuiXoBRb0kJsLpW9_p1hR
https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773732%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw36UUipJ6sAUo4Fidziguv0&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3rVFv-PKak8iSD1FvK00eq

Regards,
Matus












-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27046): https://lists.fd.io/g/vpp-dev/message/27046
Mute This Topic: https://lists.fd.io/mt/119648437/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

  • [... Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via lists.fd.io
    • ... Florin Coras via lists.fd.io
    • ... Dave Wallace via lists.fd.io
      • ... Klement Sekera via lists.fd.io
        • ... Dave Wallace via lists.fd.io
          • ... Klement Sekera via lists.fd.io
            • ... Dave Wallace via lists.fd.io
              • ... Klement Sekera via lists.fd.io
                • ... Dave Wallace via lists.fd.io
                • ... Klement Sekera via lists.fd.io
          • ... Klement Sekera via lists.fd.io

Reply via email to