Hello, Status update:
For better stability of verify vpp-csit-device job, I decided to silence one of the test that was randomly failing ipv6-vm until full root cause is identified and cleared. Peter Mikus Engineer – Software Cisco Systems Limited ________________________________________ From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> on behalf of Peter Mikus via lists.fd.io <pmikus=cisco....@lists.fd.io> Sent: Thursday, January 21, 2021 18:08 To: csit-...@lists.fd.io; vpp-dev Subject: Re: [vpp-dev] [csit-dev] csit verify failures Hello, Status update: Today I have found one of the resource bottleneck being csit-shim container causing majority of OOM failures on vpp-device job. I fixed/increase the capacity of memory allocated to this container handling reservation API. I restarted machines to ensure everything is started from scratch. I will keep monitor execution of jobs and in case of any major failures or culprits I will contact Dave Wallace to coordinate disabling of voting unless the issue is fixed permanently. Thank you for understanding Peter Mikus Engineer – Software Cisco Systems Limited ________________________________________ From: csit-...@lists.fd.io <csit-...@lists.fd.io> on behalf of Peter Mikus via lists.fd.io <pmikus=cisco....@lists.fd.io> Sent: Wednesday, January 20, 2021 17:27 To: csit-...@lists.fd.io; Benoit Ganne (bganne) Cc: vpp-dev Subject: Re: [csit-dev] csit verify failures Hello, I am now fully allocated to monitor the issue and find most appropriate solution as long term fix. There were few issues identified. 1) Race condition on Intel x700 series card being tracked in separate mail ([csit-dev] [vpp-dev] VPP Device jobs randomly failing) and being solved with cooperation with vendor (Intel). Error symptoms: One specific test of CSIT robot execution is failing. But tests are finished properly and framework quits. 2) OOM kill. Due to increased demands on resources (mainly memory indeed) from device under test and accompanied test stack there was a refusal to start more containers by Docker deamon itself. This issue has been fixed by adjusting pre-allocated memory layout on both devices and is now in the place. I am monitoring jobs. We also put in place optimization in gerrit-jenkins trigger mechanics for vpp-device to prevent the start of unwanted verify jobs (to decrease the load and run verify jobs in more effective pipeline). Error symptoms (message): Docker container failed to start 3) Last but most complicated issue involves garbage collection of virtual functions (sriov vfs) used (+ containers). This issue is complex and while I yesterday reset vpp_device, I am still looking for an permanent fix to be applied. It is indeed related to state where previous simulation did not properly cleaned resources. Error symptoms (message): Cannot find device "enpXXXXXX" + die 'Moving interface to YYYYYY namespace failed! Peter Mikus Engineer – Software Cisco Systems Limited ________________________________________ From: csit-...@lists.fd.io <csit-...@lists.fd.io> on behalf of Benoit Ganne (bganne) via lists.fd.io <bganne=cisco....@lists.fd.io> Sent: Tuesday, January 19, 2021 11:55 To: csit-...@lists.fd.io Cc: vpp-dev Subject: [csit-dev] csit verify failures Hi all, I noticed 100% failures with the verify job vpp-csit-verify-device-master-1n-skx recently, eg. https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10902/console.log.gz It seems to always fail with 'Failed to start TG docker container!' and 'Topology reservation via shim-dcr failed!'. Here is an excerpt: + DCR_UUIDS+=([tg]=$(docker run "${params[@]}")) ++ docker run --detach=true --privileged --publish-all --rm --shm-size 2G --mount type=tmpfs,destination=/sys/bus/pci/devices --volume /dev/vfio:/dev/vfio --volume /var/run/docker.sock:/var/run/docker.sock --volume /opt/boot/:/opt/boot/ --volume /dev/hugepages/:/dev/hugepages/ --sysctl net.ipv6.conf.all.disable_ipv6=1 --sysctl net.ipv6.conf.default.disable_ipv6=1 --name csit-tg-fc4f2532-ea5a-47f2-b0bf-7ccdff00dc32 csit_sut-ubuntu1804:local + die 'Failed to start TG docker container!' + set -x + set +eu + warn 'Failed to start TG docker container!' + set -exuo pipefail + echo 'Failed to start TG docker container!' Failed to start TG docker container! + exit 1 +++ set +eu +++ warn 'Topology reservation via shim-dcr failed!' +++ set -exuo pipefail +++ echo 'Topology reservation via shim-dcr failed!' Topology reservation via shim-dcr failed! +++ exit 1 Build step 'Execute shell' marked build as failure Best ben
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#18578): https://lists.fd.io/g/vpp-dev/message/18578 Mute This Topic: https://lists.fd.io/mt/80008977/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-