Hi Dave, I'd say this explains why improving CPU utilisation by test framework makes it worse. If the runtime environment is unpredictable, bugs (I believe we are seeing bugs) will get exposed more.
I'm not sure how I can help with anything concerning nomad or vexxhost - I don't know what these names mean. I could guess, but I'm not going to. My current email environment doesn't suggest a contact for Peter Mikus, so I'm leaving that up to you. What could help me move forward with exploring all this is the ability to retrieve test artifacts from the test CI runs - how does one do that? I see the logs, and they do mention gzipping artifacts, but I don't see a way of getting a hold of them. It might also be useful to merge https://gerrit.fd.io/r/c/vpp/+/46002 so that we get coredump visibility under systemd. Thanks, Klement On Sat, Jun 6, 2026 at 3:51 AM Dave Wallace via lists.fd.io <dwallacelf= [email protected]> wrote: > Hi Klement, > > The docker containers orchestrated on a Nomad cluster that is located in > Vexxhost along with the CSIT teatbeds. The containers are not pinned to > cpus due to the fact that the Jenkins Nomad plugin did not allow it (thus > this is historical). Since the CI was recently migrated to github actions > with remote runners running on the same Nomad cluster, we might be able to > do cpu pinning. Please reach out to Peter Mikus for his input to how this > might be added as he is the author of the Nomad GHA dispatcher and primary > maintainer of the Nomad cluster. > > Note that most of the issues arise on the debian12 container, because that > is the only instance where make test is run with multi-worker enabled on > VPP. > > I noticed that there were still failures with the stack of gerrit changes > that you submitted. Thus for now, to unblock the CI, I submitted a patch > with tag_fixme_debian12 for the failing testcases (46009). In order to > verify the fix for this issue, all testcases that are skipped using this > tag should remove the tag. > > Thanks for your efforts to fix this issue. > -daw- > > On 6/5/26 1:02 AM, Klement Sekera via lists.fd.io wrote: > > Hi, > > Ok that sheds some light. Are you saying that in CI there is a box which runs > N dockers at once and inside each of these there is a make test job? Are > these dockers CPU pinned? If not l, then that would be my first suspect - > more than one docker one the same physical CPU with the test framework doing > more pinning than before could make it more prone to timing issues. > > When tuning the patches I was sometimes hitting reproducible failures in CI > and I think I saw one the logs mentioning CPU frequency of 0.3GHz which I > found dubious, while at the same time it showed 128 available CPUs. I > couldn’t understand why it uses TEST_JOBS=4. But if the CI runs N of these at > once, then that would make some sense. Maybe we could try sprinting in serial > instead of walking in parallel to see if patterns emerge? > > Thanks, > Klement > > > On 5 Jun 2026, at 03:59, Dave Wallace via lists.fd.io > <[email protected]> <[email protected]> wrote: > > Klement, > > The failures are not random, they are intermittent. A number of tests are > being skipped (e.g. @tag_fixme_debian12), because they have repeatedly failed > in an intermittent and un-reproducible pattern in the CI on non-related > patchsets. > > Previous investigations failed to reproduce the issue when run over 1000s of > iterations on individual servers (both bare-metal and inside the docker > executor containers used in the CI). I have long suspected that there are > 'noisy neighbor' cpu pinning issues when a large number of docker containers > running verify jobs are packed onto a single nomad client. > > For the past several months, the number of non-related intermittent job > failures have been very low since the 'usual suspects' were elided from > running on debian 12 where the majority of said failures had been occurring. > For whatever reason, the latest 'make test' changes have exacerbated the > issue. > > All of the '@tag_fixme_*' testcases which are elided from per-patch testing > represent technical debt which has been neglected for a very long time. Any > help you can provide to address this technical debt is most appreciated. > > Thanks, > -daw- > > > On 6/4/26 13:43, Klement Sekera via lists.fd.io wrote: > Hey, > > Could also be this one: > https://www.google.com/url?q=https://gerrit.fd.io/r/c/vpp/%2B/45918/5&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1JJF6ehJguLiWRXzA9IjT8 > > These patches don’t really change test behavior, only scheduling. Before > 45918, with TEST_JOBS > 1, the pipeline would get underutilized “randomly”, > due to scheduling at most one test class per finished test class. So if a > 1-cpu class followed 4-cpu class, then 3 cpus would sit idle. With this > patch, the pipeline is refilled properly. > > I also noticed that any extra cpus (like for vcl tests) were unaccounted for > - that’s what the later patch fixes. > > If the failures are “random”, then it means the tests are flaky and either > need to fixed or marked for solo run as a temporary(?!) measure. I’d bet $.25 > that these were fake-solo-run before due to pipeline underutilization. > > Regards, > Klement > > > On 4 Jun 2026, at 19:30, Dave Wallace via lists.fd.io > <[email protected]> <[email protected]> wrote: > > Ole/Klement, > > Can you please help triage these new intermittent / non-patch related test > failures? > > The frequency of intermittent/ non-patch related test failures have spiked in > the CI ever since Ole merged the batch of Klement's test updates in gerrit. > > Here's some more that I encountered on my CI monitoring gerrit change > [0]:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26828951273/job/79105790972%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw386qhQrZeq8JYQ1UbilDnW&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3PGqqnA_ct0OGCnittWUSphttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26916331530/job/79436593001%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw0L8Z8McTXF6BqH53pZiW28&source=gmail-imap&ust=1781229569000000&usg=AOvVaw15myZGYglbDMEVHLT5QQtV > > Thanks, > -daw- > [0] > https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://gerrit.fd.io/r/c/vpp/%252B/45941%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw2asscMzmXEXYxjjeddvXv_&source=gmail-imap&ust=1781229569000000&usg=AOvVaw0WhlBvDUXOSR6hJAlvAuzO > > On 6/4/26 11:58, Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via > lists.fd.io wrote: > Hi, > > Today I noticed excess random failures, not related to patch, of make test in > CI across different jobs on a couple of patches. > Some > examples:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26952741597/job/79522258003%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw1ALoKUFN46rZlEmQpwjojU&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1kUSq4aEGtENBFXkY6zziHhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26935962088/job/79468040688%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw3OQN_Dos849rFGWXIxyrEB&source=gmail-imap&ust=1781229569000000&usg=AOvVaw06JPcKzQob7_JBtq_CHu78https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773670%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw31_vDtnQQNTHGlo2YfFNpn&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1ZuiXoBRb0kJsLpW9_p1hRhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773732%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw36UUipJ6sAUo4Fidziguv0&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3rVFv-PKak8iSD1FvK00eq > > Regards, > Matus > > > > > > > > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#27049): https://lists.fd.io/g/vpp-dev/message/27049 Mute This Topic: https://lists.fd.io/mt/119648437/21656 Group Owner: [email protected] Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
