Hello,

Status update:

For better stability of verify vpp-csit-device job, I decided to silence one of 
the test that was randomly failing ipv6-vm until full root cause is identified 
and cleared.

Peter Mikus
Engineer – Software
Cisco Systems Limited

________________________________________
From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> on behalf of Peter Mikus via 
lists.fd.io <pmikus=cisco....@lists.fd.io>
Sent: Thursday, January 21, 2021 18:08
To: csit-...@lists.fd.io; vpp-dev
Subject: Re: [vpp-dev] [csit-dev] csit verify failures

Hello,

Status update:

Today I have found one of the resource bottleneck being csit-shim container 
causing majority of OOM failures on vpp-device job.
I fixed/increase the capacity of memory allocated to this container handling 
reservation API.
I restarted machines to ensure everything is started from scratch.

I will keep monitor execution of jobs and in case of any major failures or 
culprits I will contact Dave Wallace to coordinate disabling of voting unless 
the issue is fixed permanently.

Thank you for understanding

Peter Mikus
Engineer – Software
Cisco Systems Limited

________________________________________
From: csit-...@lists.fd.io <csit-...@lists.fd.io> on behalf of Peter Mikus via 
lists.fd.io <pmikus=cisco....@lists.fd.io>
Sent: Wednesday, January 20, 2021 17:27
To: csit-...@lists.fd.io; Benoit Ganne (bganne)
Cc: vpp-dev
Subject: Re: [csit-dev] csit verify failures

Hello,

I am now fully allocated to monitor the issue and find most appropriate 
solution as long term fix.

There were few issues identified.

1) Race condition on Intel x700 series card being tracked in separate mail 
([csit-dev] [vpp-dev] VPP Device jobs randomly failing) and being solved with 
cooperation with vendor (Intel).

Error symptoms: One specific test of CSIT robot execution is failing. But tests 
are finished properly and framework quits.

2)  OOM kill. Due to increased demands on resources (mainly memory indeed) from 
device under test and accompanied test stack there was a refusal to start more 
containers by Docker deamon itself.

This issue has been fixed by adjusting pre-allocated memory layout on both 
devices and is now in the place. I am monitoring jobs. We also put in place 
optimization in gerrit-jenkins trigger mechanics for vpp-device to prevent the 
start of unwanted verify jobs (to decrease the load and run verify jobs in more 
effective pipeline).

Error symptoms (message): Docker container failed to start

3) Last but most complicated issue involves garbage collection of virtual 
functions (sriov vfs) used (+ containers).
This issue is complex and while I yesterday reset vpp_device, I am still 
looking for an permanent fix to be applied. It is indeed related to state where 
previous simulation did not properly cleaned resources.

Error symptoms (message):
Cannot find device "enpXXXXXX"
+ die 'Moving interface to YYYYYY namespace failed!


Peter Mikus
Engineer – Software
Cisco Systems Limited

________________________________________
From: csit-...@lists.fd.io <csit-...@lists.fd.io> on behalf of Benoit Ganne 
(bganne) via lists.fd.io <bganne=cisco....@lists.fd.io>
Sent: Tuesday, January 19, 2021 11:55
To: csit-...@lists.fd.io
Cc: vpp-dev
Subject: [csit-dev] csit verify failures

Hi all,

I noticed 100% failures with the verify job 
vpp-csit-verify-device-master-1n-skx recently, eg. 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10902/console.log.gz
It seems to always fail with 'Failed to start TG docker container!' and 
'Topology reservation via shim-dcr failed!'.

Here is an excerpt:

+ DCR_UUIDS+=([tg]=$(docker run "${params[@]}"))
++ docker run --detach=true --privileged --publish-all --rm --shm-size 2G 
--mount type=tmpfs,destination=/sys/bus/pci/devices --volume 
/dev/vfio:/dev/vfio --volume /var/run/docker.sock:/var/run/docker.sock --volume 
/opt/boot/:/opt/boot/ --volume /dev/hugepages/:/dev/hugepages/ --sysctl 
net.ipv6.conf.all.disable_ipv6=1 --sysctl net.ipv6.conf.default.disable_ipv6=1 
--name csit-tg-fc4f2532-ea5a-47f2-b0bf-7ccdff00dc32 csit_sut-ubuntu1804:local
+ die 'Failed to start TG docker container!'
+ set -x
+ set +eu
+ warn 'Failed to start TG docker container!'
+ set -exuo pipefail
+ echo 'Failed to start TG docker container!'
Failed to start TG docker container!
+ exit 1
+++ set +eu
+++ warn 'Topology reservation via shim-dcr failed!'
+++ set -exuo pipefail
+++ echo 'Topology reservation via shim-dcr failed!'
Topology reservation via shim-dcr failed!
+++ exit 1
Build step 'Execute shell' marked build as failure

Best
ben
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18578): https://lists.fd.io/g/vpp-dev/message/18578
Mute This Topic: https://lists.fd.io/mt/80008977/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to