Hi, 

Ok that sheds some light. Are you saying that in CI there is a box which runs N 
dockers at once and inside each of these there is a make test job? Are these 
dockers CPU pinned? If not l, then that would be my first suspect - more than 
one docker one the same physical CPU with the test framework doing more pinning 
than before could make it more prone to timing issues.

When tuning the patches I was sometimes hitting reproducible failures in CI and 
I think I saw one the logs mentioning CPU frequency of 0.3GHz which I found 
dubious, while at the same time it showed 128 available CPUs. I couldn’t 
understand why it uses TEST_JOBS=4. But if the CI runs N of these at once, then 
that would make some sense. Maybe we could try sprinting in serial instead of 
walking in parallel to see if patterns emerge? 

Thanks,
Klement

> On 5 Jun 2026, at 03:59, Dave Wallace via lists.fd.io 
> <[email protected]> wrote:
> 
> Klement,
> 
> The failures are not random, they are intermittent.  A number of tests are 
> being skipped (e.g. @tag_fixme_debian12), because they have repeatedly failed 
> in an intermittent and un-reproducible pattern in the CI on non-related 
> patchsets.
> 
> Previous investigations failed to reproduce the issue when run over 1000s of 
> iterations on individual servers (both bare-metal and inside the docker 
> executor containers used in the CI).  I have long suspected that there are 
> 'noisy neighbor' cpu pinning issues when a large number of docker containers 
> running verify jobs are packed onto a single nomad client.
> 
> For the past several months, the number of non-related intermittent job 
> failures have been very low since the 'usual suspects' were elided from 
> running on debian 12 where the majority of said failures had been occurring.  
> For whatever reason, the latest 'make test' changes have exacerbated the 
> issue.
> 
> All of the '@tag_fixme_*' testcases which are elided from per-patch testing 
> represent technical debt which has been neglected for a very long time.  Any 
> help you can provide to address this technical debt is most appreciated.
> 
> Thanks,
> -daw-
> 
>> On 6/4/26 13:43, Klement Sekera via lists.fd.io wrote:
>> Hey,
>> 
>> Could also be this one: 
>> https://www.google.com/url?q=https://gerrit.fd.io/r/c/vpp/%2B/45918/5&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1JJF6ehJguLiWRXzA9IjT8
>> 
>> These patches don’t really change test behavior, only scheduling. Before 
>> 45918, with TEST_JOBS > 1, the pipeline would get underutilized “randomly”, 
>> due to scheduling at most one test class per finished test class. So if a 
>> 1-cpu class followed 4-cpu class, then 3 cpus would sit idle. With this 
>> patch, the pipeline is refilled properly.
>> 
>> I also noticed that any extra cpus (like for vcl tests) were unaccounted for 
>> - that’s what the later patch fixes.
>> 
>> If the failures are “random”, then it means the tests are flaky and either 
>> need to fixed or marked for solo run as a temporary(?!) measure. I’d bet 
>> $.25 that these were fake-solo-run before due to pipeline underutilization.
>> 
>> Regards,
>> Klement
>> 
>>>> On 4 Jun 2026, at 19:30, Dave Wallace via lists.fd.io 
>>>> <[email protected]> wrote:
>>> 
>>> Ole/Klement,
>>> 
>>> Can you please help triage these new intermittent / non-patch related test 
>>> failures?
>>> 
>>> The frequency of intermittent/ non-patch related test failures have spiked 
>>> in the CI ever since Ole merged the batch of Klement's  test updates in 
>>> gerrit.
>>> 
>>> Here's some more that I encountered on my CI monitoring gerrit change [0]:
>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26828951273/job/79105790972%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw386qhQrZeq8JYQ1UbilDnW&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3PGqqnA_ct0OGCnittWUSp
>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26916331530/job/79436593001%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw0L8Z8McTXF6BqH53pZiW28&source=gmail-imap&ust=1781229569000000&usg=AOvVaw15myZGYglbDMEVHLT5QQtV
>>> 
>>> Thanks,
>>> -daw-
>>> [0]   
>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://gerrit.fd.io/r/c/vpp/%252B/45941%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw2asscMzmXEXYxjjeddvXv_&source=gmail-imap&ust=1781229569000000&usg=AOvVaw0WhlBvDUXOSR6hJAlvAuzO
>>> 
>>>> On 6/4/26 11:58, Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) 
>>>> via lists.fd.io wrote:
>>>> Hi,
>>>> 
>>>> Today I noticed excess random failures, not related to patch, of make test 
>>>> in CI across different jobs on a couple of patches.
>>>> Some examples:
>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26952741597/job/79522258003%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw1ALoKUFN46rZlEmQpwjojU&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1kUSq4aEGtENBFXkY6zziH
>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26935962088/job/79468040688%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw3OQN_Dos849rFGWXIxyrEB&source=gmail-imap&ust=1781229569000000&usg=AOvVaw06JPcKzQob7_JBtq_CHu78
>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773670%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw31_vDtnQQNTHGlo2YfFNpn&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1ZuiXoBRb0kJsLpW9_p1hR
>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773732%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw36UUipJ6sAUo4Fidziguv0&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3rVFv-PKak8iSD1FvK00eq
>>>> 
>>>> Regards,
>>>> Matus
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27041): https://lists.fd.io/g/vpp-dev/message/27041
Mute This Topic: https://lists.fd.io/mt/119648437/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

  • [... Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via lists.fd.io
    • ... Florin Coras via lists.fd.io
    • ... Dave Wallace via lists.fd.io
      • ... Klement Sekera via lists.fd.io
        • ... Dave Wallace via lists.fd.io
          • ... Klement Sekera via lists.fd.io
            • ... Dave Wallace via lists.fd.io
              • ... Klement Sekera via lists.fd.io
                • ... Dave Wallace via lists.fd.io
                • ... Klement Sekera via lists.fd.io
          • ... Klement Sekera via lists.fd.io

Reply via email to