I've done some tinkering locally and saw mainly quic/vcl failures.

With https://gerrit.fd.io/r/c/vpp/+/46003 these seem to go away, but a
latent l2bd issue surfaces, which https://gerrit.fd.io/r/c/vpp/+/46004
irons over, though it's not a real fix.

Anyhow, I was able to pass the whole suite 4 times in a row before my ssh
session dropped and I didn't bother with more reruns. But it looks like an
improvement nevertheless.

Cheers,
Klement

On Fri, Jun 5, 2026 at 7:02 AM Klement Sekera via lists.fd.io <ksekera=
[email protected]> wrote:

> Hi,
>
> Ok that sheds some light. Are you saying that in CI there is a box which
> runs N dockers at once and inside each of these there is a make test job?
> Are these dockers CPU pinned? If not l, then that would be my first suspect
> - more than one docker one the same physical CPU with the test framework
> doing more pinning than before could make it more prone to timing issues.
>
> When tuning the patches I was sometimes hitting reproducible failures in
> CI and I think I saw one the logs mentioning CPU frequency of 0.3GHz which
> I found dubious, while at the same time it showed 128 available CPUs. I
> couldn’t understand why it uses TEST_JOBS=4. But if the CI runs N of these
> at once, then that would make some sense. Maybe we could try sprinting in
> serial instead of walking in parallel to see if patterns emerge?
>
> Thanks,
> Klement
>
> > On 5 Jun 2026, at 03:59, Dave Wallace via lists.fd.io <dwallacelf=
> [email protected]> wrote:
> >
> > Klement,
> >
> > The failures are not random, they are intermittent.  A number of tests
> are being skipped (e.g. @tag_fixme_debian12), because they have repeatedly
> failed in an intermittent and un-reproducible pattern in the CI on
> non-related patchsets.
> >
> > Previous investigations failed to reproduce the issue when run over
> 1000s of iterations on individual servers (both bare-metal and inside the
> docker executor containers used in the CI).  I have long suspected that
> there are 'noisy neighbor' cpu pinning issues when a large number of docker
> containers running verify jobs are packed onto a single nomad client.
> >
> > For the past several months, the number of non-related intermittent job
> failures have been very low since the 'usual suspects' were elided from
> running on debian 12 where the majority of said failures had been
> occurring.  For whatever reason, the latest 'make test' changes have
> exacerbated the issue.
> >
> > All of the '@tag_fixme_*' testcases which are elided from per-patch
> testing represent technical debt which has been neglected for a very long
> time.  Any help you can provide to address this technical debt is most
> appreciated.
> >
> > Thanks,
> > -daw-
> >
> >> On 6/4/26 13:43, Klement Sekera via lists.fd.io wrote:
> >> Hey,
> >>
> >> Could also be this one:
> https://www.google.com/url?q=https://gerrit.fd.io/r/c/vpp/%2B/45918/5&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1JJF6ehJguLiWRXzA9IjT8
> >>
> >> These patches don’t really change test behavior, only scheduling.
> Before 45918, with TEST_JOBS > 1, the pipeline would get underutilized
> “randomly”, due to scheduling at most one test class per finished test
> class. So if a 1-cpu class followed 4-cpu class, then 3 cpus would sit
> idle. With this patch, the pipeline is refilled properly.
> >>
> >> I also noticed that any extra cpus (like for vcl tests) were
> unaccounted for - that’s what the later patch fixes.
> >>
> >> If the failures are “random”, then it means the tests are flaky and
> either need to fixed or marked for solo run as a temporary(?!) measure. I’d
> bet $.25 that these were fake-solo-run before due to pipeline
> underutilization.
> >>
> >> Regards,
> >> Klement
> >>
> >>>> On 4 Jun 2026, at 19:30, Dave Wallace via lists.fd.io <dwallacelf=
> [email protected]> wrote:
> >>>
> >>> Ole/Klement,
> >>>
> >>> Can you please help triage these new intermittent / non-patch related
> test failures?
> >>>
> >>> The frequency of intermittent/ non-patch related test failures have
> spiked in the CI ever since Ole merged the batch of Klement's  test updates
> in gerrit.
> >>>
> >>> Here's some more that I encountered on my CI monitoring gerrit change
> [0]:
> >>>
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26828951273/job/79105790972%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw386qhQrZeq8JYQ1UbilDnW&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3PGqqnA_ct0OGCnittWUSp
> >>>
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26916331530/job/79436593001%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw0L8Z8McTXF6BqH53pZiW28&source=gmail-imap&ust=1781229569000000&usg=AOvVaw15myZGYglbDMEVHLT5QQtV
> >>>
> >>> Thanks,
> >>> -daw-
> >>> [0]
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://gerrit.fd.io/r/c/vpp/%252B/45941%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw2asscMzmXEXYxjjeddvXv_&source=gmail-imap&ust=1781229569000000&usg=AOvVaw0WhlBvDUXOSR6hJAlvAuzO
> >>>
> >>>> On 6/4/26 11:58, Matus Fabian -X (matfabia - PANTHEON
> TECHNOLOGIES@Cisco) via lists.fd.io wrote:
> >>>> Hi,
> >>>>
> >>>> Today I noticed excess random failures, not related to patch, of make
> test in CI across different jobs on a couple of patches.
> >>>> Some examples:
> >>>>
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26952741597/job/79522258003%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw1ALoKUFN46rZlEmQpwjojU&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1kUSq4aEGtENBFXkY6zziH
> >>>>
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26935962088/job/79468040688%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw3OQN_Dos849rFGWXIxyrEB&source=gmail-imap&ust=1781229569000000&usg=AOvVaw06JPcKzQob7_JBtq_CHu78
> >>>>
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773670%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw31_vDtnQQNTHGlo2YfFNpn&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1ZuiXoBRb0kJsLpW9_p1hR
> >>>>
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773732%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw36UUipJ6sAUo4Fidziguv0&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3rVFv-PKak8iSD1FvK00eq
> >>>>
> >>>> Regards,
> >>>> Matus
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
> >
>
> 
>
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27042): https://lists.fd.io/g/vpp-dev/message/27042
Mute This Topic: https://lists.fd.io/mt/119648437/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

  • [... Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via lists.fd.io
    • ... Florin Coras via lists.fd.io
    • ... Dave Wallace via lists.fd.io
      • ... Klement Sekera via lists.fd.io
        • ... Dave Wallace via lists.fd.io
          • ... Klement Sekera via lists.fd.io
            • ... Dave Wallace via lists.fd.io
              • ... Klement Sekera via lists.fd.io
                • ... Dave Wallace via lists.fd.io
                • ... Klement Sekera via lists.fd.io
          • ... Klement Sekera via lists.fd.io

Reply via email to