Hi Klement,
On 6/8/26 10:56 AM, Klement Sekera via lists.fd.io wrote:
Hi Dave,
I'd say this explains why improving CPU utilisation by test framework makes
it worse. If the runtime environment is unpredictable, bugs (I believe we
are seeing bugs) will get exposed more.
I'm not sure how I can help with anything concerning nomad or vexxhost - I
don't know what these names mean. I could guess, but I'm not going to. My
current email environment doesn't suggest a contact for Peter Mikus, so I'm
leaving that up to you.
Peter's email address is available via gerrit:
https://gerrit.fd.io/r/q/owner:peter.mikus%2540icloud.com
What could help me move forward with exploring all this is the ability to
retrieve test artifacts from the test CI runs - how does one do that? I see
the logs, and they do mention gzipping artifacts, but I don't see a way of
getting a hold of them.
Test artifacts are uploaded to AWS S3 -- for any given workflow, you
will find the link to the artifacts in the "AWS Publish Logs" or "AWS
Publish Artifacts" steps (same link is in both steps). For example, in
the test failure for 46002, this is the URL for the failing debian12
workflow:
https://github.com/FDio/vpp/actions/runs/27020468660/job/79747148480
Expanding the 'AWS Publish Logs' step (scroll down below the 'VPP Make
Test' output) you will find this URL to the artifacts in AWS S3:
https://logs.fd.io/vex-yul-rot-jenkins-1/gha-vpp-gerrit-patchset/verify-master-2026_06_05_142213_UTC-gerrit-46002-3/verify-maketest-release-builder-debian12-prod-x86_64/
It might also be useful to mergehttps://gerrit.fd.io/r/c/vpp/+/46002 so
that we get coredump visibility under systemd.
I'm in the process of deploying the fix for the RETRIES env var for
multi-worker OS tests (i.e.debain12) and will rebase this gerrit change
once the fix is available in the production GHA CI.
Thanks,
-daw-
Thanks,
Klement
On Sat, Jun 6, 2026 at 3:51 AM Dave Wallace via lists.fd.io<dwallacelf=
[email protected]> wrote:
Hi Klement,
The docker containers orchestrated on a Nomad cluster that is located in
Vexxhost along with the CSIT teatbeds. The containers are not pinned to
cpus due to the fact that the Jenkins Nomad plugin did not allow it (thus
this is historical). Since the CI was recently migrated to github actions
with remote runners running on the same Nomad cluster, we might be able to
do cpu pinning. Please reach out to Peter Mikus for his input to how this
might be added as he is the author of the Nomad GHA dispatcher and primary
maintainer of the Nomad cluster.
Note that most of the issues arise on the debian12 container, because that
is the only instance where make test is run with multi-worker enabled on
VPP.
I noticed that there were still failures with the stack of gerrit changes
that you submitted. Thus for now, to unblock the CI, I submitted a patch
with tag_fixme_debian12 for the failing testcases (46009). In order to
verify the fix for this issue, all testcases that are skipped using this
tag should remove the tag.
Thanks for your efforts to fix this issue.
-daw-
On 6/5/26 1:02 AM, Klement Sekera via lists.fd.io wrote:
Hi,
Ok that sheds some light. Are you saying that in CI there is a box which runs N
dockers at once and inside each of these there is a make test job? Are these
dockers CPU pinned? If not l, then that would be my first suspect - more than
one docker one the same physical CPU with the test framework doing more pinning
than before could make it more prone to timing issues.
When tuning the patches I was sometimes hitting reproducible failures in CI and
I think I saw one the logs mentioning CPU frequency of 0.3GHz which I found
dubious, while at the same time it showed 128 available CPUs. I couldn’t
understand why it uses TEST_JOBS=4. But if the CI runs N of these at once, then
that would make some sense. Maybe we could try sprinting in serial instead of
walking in parallel to see if patterns emerge?
Thanks,
Klement
On 5 Jun 2026, at 03:59, Dave Wallace via
lists.fd.io<[email protected]>
<[email protected]> wrote:
Klement,
The failures are not random, they are intermittent. A number of tests are
being skipped (e.g. @tag_fixme_debian12), because they have repeatedly failed
in an intermittent and un-reproducible pattern in the CI on non-related
patchsets.
Previous investigations failed to reproduce the issue when run over 1000s of
iterations on individual servers (both bare-metal and inside the docker
executor containers used in the CI). I have long suspected that there are
'noisy neighbor' cpu pinning issues when a large number of docker containers
running verify jobs are packed onto a single nomad client.
For the past several months, the number of non-related intermittent job
failures have been very low since the 'usual suspects' were elided from running
on debian 12 where the majority of said failures had been occurring. For
whatever reason, the latest 'make test' changes have exacerbated the issue.
All of the '@tag_fixme_*' testcases which are elided from per-patch testing
represent technical debt which has been neglected for a very long time. Any
help you can provide to address this technical debt is most appreciated.
Thanks,
-daw-
On 6/4/26 13:43, Klement Sekera via lists.fd.io wrote:
Hey,
Could also be this
one:https://www.google.com/url?q=https://gerrit.fd.io/r/c/vpp/%2B/45918/5&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1JJF6ehJguLiWRXzA9IjT8
These patches don’t really change test behavior, only scheduling. Before 45918,
with TEST_JOBS > 1, the pipeline would get underutilized “randomly”, due to
scheduling at most one test class per finished test class. So if a 1-cpu class
followed 4-cpu class, then 3 cpus would sit idle. With this patch, the pipeline is
refilled properly.
I also noticed that any extra cpus (like for vcl tests) were unaccounted for -
that’s what the later patch fixes.
If the failures are “random”, then it means the tests are flaky and either need
to fixed or marked for solo run as a temporary(?!) measure. I’d bet $.25 that
these were fake-solo-run before due to pipeline underutilization.
Regards,
Klement
On 4 Jun 2026, at 19:30, Dave Wallace via
lists.fd.io<[email protected]>
<[email protected]> wrote:
Ole/Klement,
Can you please help triage these new intermittent / non-patch related test
failures?
The frequency of intermittent/ non-patch related test failures have spiked in
the CI ever since Ole merged the batch of Klement's test updates in gerrit.
Here's some more that I encountered on my CI monitoring gerrit change
[0]:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26828951273/job/79105790972%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw386qhQrZeq8JYQ1UbilDnW&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3PGqqnA_ct0OGCnittWUSphttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26916331530/job/79436593001%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw0L8Z8McTXF6BqH53pZiW28&source=gmail-imap&ust=1781229569000000&usg=AOvVaw15myZGYglbDMEVHLT5QQtV
Thanks,
-daw-
[0]https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://gerrit.fd.io/r/c/vpp/%252B/45941%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw2asscMzmXEXYxjjeddvXv_&source=gmail-imap&ust=1781229569000000&usg=AOvVaw0WhlBvDUXOSR6hJAlvAuzO
On 6/4/26 11:58, Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via
lists.fd.io wrote:
Hi,
Today I noticed excess random failures, not related to patch, of make test in
CI across different jobs on a couple of patches.
Some
examples:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26952741597/job/79522258003%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw1ALoKUFN46rZlEmQpwjojU&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1kUSq4aEGtENBFXkY6zziHhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26935962088/job/79468040688%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw3OQN_Dos849rFGWXIxyrEB&source=gmail-imap&ust=1781229569000000&usg=AOvVaw06JPcKzQob7_JBtq_CHu78https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773670%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw31_vDtnQQNTHGlo2YfFNpn&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1ZuiXoBRb0kJsLpW9_p1hRhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773732%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw36UUipJ6sAUo4Fidziguv0&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3rVFv-PKak8iSD1FvK00eq
Regards,
Matus
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27050): https://lists.fd.io/g/vpp-dev/message/27050
Mute This Topic: https://lists.fd.io/mt/119648437/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-