Re: [vpp-dev] Rx stuck to 0 after a while
Dear Rubina, Excellent, thank you very much! The change is in the master now. Note that to keep the default memory footprint the same I have temporarily halved the default upper limit on sessions (since we create two bihash entries now instead of one). FYI, I plan to do some more work on session management/reuse before 1807 release. --a > On 2 Jun 2018, at 07:48, Rubina Bianchi wrote: > > Dear Andrew > > Sorry for delayed response. I checked your second patch and here is my test > result: > > Best case is still the best and vpp throughput is Maximum (18.5 Gbps) in my > scenario. > Worst case is getting better than past. I never see deadlock again and > throughput increases from 50 Mbps to 5.5 Gbps. I also added my T-Rex result. > > -Per port stats table > ports | 0 | 1 > > - >opackets | 1119818503 | 1065627562 > obytes |490687253990 |471065675962 >ipackets | 274437415 | 391504529 > ibytes |120020261974 |170214837563 > ierrors | 0 | 0 > oerrors | 0 | 0 > Tx Bw | 9.48 Gbps | 9.08 Gbps > > -Global stats enabled > Cpu Utilization : 88.4 % 7.0 Gb/core > Platform_factor : 1.0 > Total-Tx: 18.56 Gbps > Total-Rx: 5.78 Gbps > Total-PPS : 5.27 Mpps > Total-CPS : 79.51 Kcps > > Expected-PPS: 9.02 Mpps > Expected-CPS: 135.31 Kcps > Expected-BPS: 31.77 Gbps > > Active-flows:88840 Clients : 252 Socket-util : 0.5598 % > Open-flows : 33973880 Servers :65532 Socket :88840 > Socket/Clients : 352.5 > drop-rate : 12.79 Gbps > current time: 423.4 sec > test duration : 99576.6 sec > > One point that I missed and would be helpful is that I run T-Rex with '-p' > parameter: > ./t-rex-64 -c 6 -d 10 -f cap2/sfr.yaml --cfg cfg/trex_cfg.yaml -m 30 -p > > Thanks, > Sincerely > > From: Andrew Yourtchenko > Sent: Wednesday, May 30, 2018 12:08 PM > To: Rubina Bianchi > Cc: vpp-dev@lists.fd.io > Subject: Re: [vpp-dev] Rx stuck to 0 after a while > > Dear Rubina, > > Thanks for checking it! > > yeah actually that patch was leaking the sessions in the session reuse > path. I have got the setup in the lab locally yesterday and am working > on a better way to do it... > > Will get back to you when I am happy with the way the code works.. > > --a > > > > On 5/29/18, Rubina Bianchi wrote: > > Dear Andrew > > > > I cleaned everything and created a new deb packages by your patch once > > again. With your patch I never see deadlock again, but still I have > > throughput problem in my scenario. > > > > -Per port stats table > > ports | 0 | 1 > > - > >opackets | 474826597 | 452028770 > > obytes |207843848531 |199591809555 > >ipackets |71010677 |72028456 > > ibytes | 31441646551 | 31687562468 > > ierrors | 0 | 0 > > oerrors | 0 | 0 > > Tx Bw | 9.56 Gbps | 9.16 Gbps > > > > -Global stats enabled > > Cpu Utilization : 88.4 % 7.1 Gb/core > > Platform_factor : 1.0 > > Total-Tx: 18.72 Gbps > > Total-Rx: 59.30 Mbps > > Total-PPS : 5.31 Mpps > > Total-CPS : 79.79 Kcps > > > > Expected-PPS: 9.02 Mpps > > Expected-CPS: 135.31 Kcps > > Expected-BPS: 31.77 Gbps > > > > Active-flows:88837 Clients : 252 Socket-util : 0.5598 % > > Open-flows : 14708455 Servers :65532 Socket :88837 > > Socket/Clients : 352.5 > > Total_queue_full : 328355248 > > drop-rate : 18.66 Gbps > > current time: 180.9 sec > > test duration : 99819.1 sec > > > > In best case (4 interface in one numa that only 2 of them has acl) my device > > (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4 > > interface in one numa that all of them has acl) my device throughput will > > decrease from maximum to around 60Mbps. Actually patch just prevent deadlock > > in my case but throughput is same as before. > > > > _
Re: [vpp-dev] Rx stuck to 0 after a while
Dear Andrew Sorry for delayed response. I checked your second patch and here is my test result: Best case is still the best and vpp throughput is Maximum (18.5 Gbps) in my scenario. Worst case is getting better than past. I never see deadlock again and throughput increases from 50 Mbps to 5.5 Gbps. I also added my T-Rex result. -Per port stats table ports | 0 | 1 - opackets | 1119818503 | 1065627562 obytes |490687253990 |471065675962 ipackets | 274437415 | 391504529 ibytes |120020261974 |170214837563 ierrors | 0 | 0 oerrors | 0 | 0 Tx Bw | 9.48 Gbps | 9.08 Gbps -Global stats enabled Cpu Utilization : 88.4 % 7.0 Gb/core Platform_factor : 1.0 Total-Tx: 18.56 Gbps Total-Rx: 5.78 Gbps Total-PPS : 5.27 Mpps Total-CPS : 79.51 Kcps Expected-PPS: 9.02 Mpps Expected-CPS: 135.31 Kcps Expected-BPS: 31.77 Gbps Active-flows:88840 Clients : 252 Socket-util : 0.5598 % Open-flows : 33973880 Servers :65532 Socket :88840 Socket/Clients : 352.5 drop-rate : 12.79 Gbps current time: 423.4 sec test duration : 99576.6 sec One point that I missed and would be helpful is that I run T-Rex with '-p' parameter: ./t-rex-64 -c 6 -d 10 -f cap2/sfr.yaml --cfg cfg/trex_cfg.yaml -m 30 -p Thanks, Sincerely From: Andrew Yourtchenko Sent: Wednesday, May 30, 2018 12:08 PM To: Rubina Bianchi Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Rx stuck to 0 after a while Dear Rubina, Thanks for checking it! yeah actually that patch was leaking the sessions in the session reuse path. I have got the setup in the lab locally yesterday and am working on a better way to do it... Will get back to you when I am happy with the way the code works.. --a On 5/29/18, Rubina Bianchi wrote: > Dear Andrew > > I cleaned everything and created a new deb packages by your patch once > again. With your patch I never see deadlock again, but still I have > throughput problem in my scenario. > > -Per port stats table > ports | 0 | 1 > - >opackets | 474826597 | 452028770 > obytes |207843848531 |199591809555 >ipackets |71010677 |72028456 > ibytes | 31441646551 | 31687562468 > ierrors | 0 | 0 > oerrors | 0 | 0 > Tx Bw | 9.56 Gbps | 9.16 Gbps > > -Global stats enabled > Cpu Utilization : 88.4 % 7.1 Gb/core > Platform_factor : 1.0 > Total-Tx: 18.72 Gbps > Total-Rx: 59.30 Mbps > Total-PPS : 5.31 Mpps > Total-CPS : 79.79 Kcps > > Expected-PPS: 9.02 Mpps > Expected-CPS: 135.31 Kcps > Expected-BPS: 31.77 Gbps > > Active-flows:88837 Clients : 252 Socket-util : 0.5598 % > Open-flows : 14708455 Servers :65532 Socket :88837 > Socket/Clients : 352.5 > Total_queue_full : 328355248 > drop-rate : 18.66 Gbps > current time: 180.9 sec > test duration : 99819.1 sec > > In best case (4 interface in one numa that only 2 of them has acl) my device > (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4 > interface in one numa that all of them has acl) my device throughput will > decrease from maximum to around 60Mbps. Actually patch just prevent deadlock > in my case but throughput is same as before. > > ____ > From: Andrew Yourtchenko > Sent: Tuesday, May 29, 2018 10:11 AM > To: Rubina Bianchi > Cc: vpp-dev@lists.fd.io > Subject: Re: [vpp-dev] Rx stuck to 0 after a while > > Dear Rubina, > > thank you for quickly checking it! > > Judging by the logs the VPP quits, so I would say there should be a > core file, could you check ? > > If you find it (doublecheck by the timestamps that it is indeed the > fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary' > 'path-to-core') and then get the backtrace using 'bt', this will give > more idea on what is going on. > > --a > > On 5/29/18, Rubina Bianchi wrote: >> Dear Andrew >> >> I tested your patch and my problem still exist, but my service status >> changed and now there isn't any information about deadlock problem. Do >> you >> have any idea about how I can provide you more information? >> >>
Re: [vpp-dev] Rx stuck to 0 after a while
Dear Rubina, okay, I think I am reasonably happy with the change in https://gerrit.fd.io/r/#/c/12770/ - I also have rebased it onto the latest master so that it is ready to commit if it works for you. Please give it a shot and let me know. Note that you might need to adjust the bihash memory - as I'm storing forward and reverse entry now explicitly (rather than per-packet calculating them). Please let me know how it works in your test setup. thanks, andrew On 5/30/18, Andrew Yourtchenko wrote: > Dear Rubina, > > Thanks for checking it! > > yeah actually that patch was leaking the sessions in the session reuse > path. I have got the setup in the lab locally yesterday and am working > on a better way to do it... > > Will get back to you when I am happy with the way the code works.. > > --a > > > > On 5/29/18, Rubina Bianchi wrote: >> Dear Andrew >> >> I cleaned everything and created a new deb packages by your patch once >> again. With your patch I never see deadlock again, but still I have >> throughput problem in my scenario. >> >> -Per port stats table >> ports | 0 | 1 >> - >>opackets | 474826597 | 452028770 >> obytes |207843848531 |199591809555 >>ipackets |71010677 |72028456 >> ibytes | 31441646551 | 31687562468 >> ierrors | 0 | 0 >> oerrors | 0 | 0 >> Tx Bw | 9.56 Gbps | 9.16 Gbps >> >> -Global stats enabled >> Cpu Utilization : 88.4 % 7.1 Gb/core >> Platform_factor : 1.0 >> Total-Tx: 18.72 Gbps >> Total-Rx: 59.30 Mbps >> Total-PPS : 5.31 Mpps >> Total-CPS : 79.79 Kcps >> >> Expected-PPS: 9.02 Mpps >> Expected-CPS: 135.31 Kcps >> Expected-BPS: 31.77 Gbps >> >> Active-flows:88837 Clients : 252 Socket-util : 0.5598 % >> Open-flows : 14708455 Servers :65532 Socket :88837 >> Socket/Clients : 352.5 >> Total_queue_full : 328355248 >> drop-rate : 18.66 Gbps >> current time: 180.9 sec >> test duration : 99819.1 sec >> >> In best case (4 interface in one numa that only 2 of them has acl) my >> device >> (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4 >> interface in one numa that all of them has acl) my device throughput will >> decrease from maximum to around 60Mbps. Actually patch just prevent >> deadlock >> in my case but throughput is same as before. >> >> >> From: Andrew Yourtchenko >> Sent: Tuesday, May 29, 2018 10:11 AM >> To: Rubina Bianchi >> Cc: vpp-dev@lists.fd.io >> Subject: Re: [vpp-dev] Rx stuck to 0 after a while >> >> Dear Rubina, >> >> thank you for quickly checking it! >> >> Judging by the logs the VPP quits, so I would say there should be a >> core file, could you check ? >> >> If you find it (doublecheck by the timestamps that it is indeed the >> fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary' >> 'path-to-core') and then get the backtrace using 'bt', this will give >> more idea on what is going on. >> >> --a >> >> On 5/29/18, Rubina Bianchi wrote: >>> Dear Andrew >>> >>> I tested your patch and my problem still exist, but my service status >>> changed and now there isn't any information about deadlock problem. Do >>> you >>> have any idea about how I can provide you more information? >>> >>> root@MYRB:~# service vpp status >>> * vpp.service - vector packet processing engine >>>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor >>> preset: >>> enabled) >>>Active: inactive (dead) >>> >>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded >>> plugin: udp_ping_test_plugin.so >>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded >>> plugin: stn_test_plugin.so >>> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init >>> args: >>> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w >>> :08:00.0 >>> -w :08:00.1 -w :08 >>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n >>> 4 >>> --huge-dir
Re: [vpp-dev] Rx stuck to 0 after a while
Dear Rubina, Thanks for checking it! yeah actually that patch was leaking the sessions in the session reuse path. I have got the setup in the lab locally yesterday and am working on a better way to do it... Will get back to you when I am happy with the way the code works.. --a On 5/29/18, Rubina Bianchi wrote: > Dear Andrew > > I cleaned everything and created a new deb packages by your patch once > again. With your patch I never see deadlock again, but still I have > throughput problem in my scenario. > > -Per port stats table > ports | 0 | 1 > - >opackets | 474826597 | 452028770 > obytes |207843848531 |199591809555 >ipackets |71010677 |72028456 > ibytes | 31441646551 | 31687562468 > ierrors | 0 | 0 > oerrors | 0 | 0 > Tx Bw | 9.56 Gbps | 9.16 Gbps > > -Global stats enabled > Cpu Utilization : 88.4 % 7.1 Gb/core > Platform_factor : 1.0 > Total-Tx: 18.72 Gbps > Total-Rx: 59.30 Mbps > Total-PPS : 5.31 Mpps > Total-CPS : 79.79 Kcps > > Expected-PPS: 9.02 Mpps > Expected-CPS: 135.31 Kcps > Expected-BPS: 31.77 Gbps > > Active-flows:88837 Clients : 252 Socket-util : 0.5598 % > Open-flows : 14708455 Servers :65532 Socket :88837 > Socket/Clients : 352.5 > Total_queue_full : 328355248 > drop-rate : 18.66 Gbps > current time: 180.9 sec > test duration : 99819.1 sec > > In best case (4 interface in one numa that only 2 of them has acl) my device > (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4 > interface in one numa that all of them has acl) my device throughput will > decrease from maximum to around 60Mbps. Actually patch just prevent deadlock > in my case but throughput is same as before. > > ____ > From: Andrew Yourtchenko > Sent: Tuesday, May 29, 2018 10:11 AM > To: Rubina Bianchi > Cc: vpp-dev@lists.fd.io > Subject: Re: [vpp-dev] Rx stuck to 0 after a while > > Dear Rubina, > > thank you for quickly checking it! > > Judging by the logs the VPP quits, so I would say there should be a > core file, could you check ? > > If you find it (doublecheck by the timestamps that it is indeed the > fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary' > 'path-to-core') and then get the backtrace using 'bt', this will give > more idea on what is going on. > > --a > > On 5/29/18, Rubina Bianchi wrote: >> Dear Andrew >> >> I tested your patch and my problem still exist, but my service status >> changed and now there isn't any information about deadlock problem. Do >> you >> have any idea about how I can provide you more information? >> >> root@MYRB:~# service vpp status >> * vpp.service - vector packet processing engine >>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor >> preset: >> enabled) >>Active: inactive (dead) >> >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded >> plugin: udp_ping_test_plugin.so >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded >> plugin: stn_test_plugin.so >> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init >> args: >> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w >> :08:00.0 >> -w :08:00.1 -w :08 >> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n >> 4 >> --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w >> :08:00.1 -w :08:00.2 -w 000 >> May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough >> DPDK >> crypto resources, default to OpenSSL >> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received >> signal >> SIGCONT, PC 0x7fa535dfbac0 >> May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting... >> May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing >> engine... >> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received >> signal >> SIGTERM, PC 0x7fa534121867 >> May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine. >> >> >> >> From: Andrew Yourtchenko >> Sent: Monday, May 28, 2018 5:58 PM >> To: Rubina Bianchi >> Cc: vpp-dev@lists.fd.io >> Subject: Re: [vpp-dev]
Re: [vpp-dev] Rx stuck to 0 after a while
Dear Andrew I cleaned everything and created a new deb packages by your patch once again. With your patch I never see deadlock again, but still I have throughput problem in my scenario. -Per port stats table ports | 0 | 1 - opackets | 474826597 | 452028770 obytes |207843848531 |199591809555 ipackets |71010677 |72028456 ibytes | 31441646551 | 31687562468 ierrors | 0 | 0 oerrors | 0 | 0 Tx Bw | 9.56 Gbps | 9.16 Gbps -Global stats enabled Cpu Utilization : 88.4 % 7.1 Gb/core Platform_factor : 1.0 Total-Tx: 18.72 Gbps Total-Rx: 59.30 Mbps Total-PPS : 5.31 Mpps Total-CPS : 79.79 Kcps Expected-PPS: 9.02 Mpps Expected-CPS: 135.31 Kcps Expected-BPS: 31.77 Gbps Active-flows:88837 Clients : 252 Socket-util : 0.5598 % Open-flows : 14708455 Servers :65532 Socket :88837 Socket/Clients : 352.5 Total_queue_full : 328355248 drop-rate : 18.66 Gbps current time: 180.9 sec test duration : 99819.1 sec In best case (4 interface in one numa that only 2 of them has acl) my device (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4 interface in one numa that all of them has acl) my device throughput will decrease from maximum to around 60Mbps. Actually patch just prevent deadlock in my case but throughput is same as before. From: Andrew Yourtchenko Sent: Tuesday, May 29, 2018 10:11 AM To: Rubina Bianchi Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Rx stuck to 0 after a while Dear Rubina, thank you for quickly checking it! Judging by the logs the VPP quits, so I would say there should be a core file, could you check ? If you find it (doublecheck by the timestamps that it is indeed the fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary' 'path-to-core') and then get the backtrace using 'bt', this will give more idea on what is going on. --a On 5/29/18, Rubina Bianchi wrote: > Dear Andrew > > I tested your patch and my problem still exist, but my service status > changed and now there isn't any information about deadlock problem. Do you > have any idea about how I can provide you more information? > > root@MYRB:~# service vpp status > * vpp.service - vector packet processing engine >Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset: > enabled) >Active: inactive (dead) > > May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded > plugin: udp_ping_test_plugin.so > May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded > plugin: stn_test_plugin.so > May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init args: > -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 > -w :08:00.1 -w :08 > May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n 4 > --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w > :08:00.1 -w :08:00.2 -w 000 > May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough DPDK > crypto resources, default to OpenSSL > May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal > SIGCONT, PC 0x7fa535dfbac0 > May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting... > May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing > engine... > May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal > SIGTERM, PC 0x7fa534121867 > May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine. > > > > From: Andrew Yourtchenko > Sent: Monday, May 28, 2018 5:58 PM > To: Rubina Bianchi > Cc: vpp-dev@lists.fd.io > Subject: Re: [vpp-dev] Rx stuck to 0 after a while > > Dear Rubina, > > Thanks for catching and reporting this! > > I suspect what might be happening is my recent change of using two > unidirectional sessions in bihash vs. the single one triggered a race, > whereby as the owning worker is deleting the session, > the non-owning worker is trying to update it. That would logically > explain the "BUG: .." line (since you don't change the interfaces nor > moving the traffic around, the 5 tuples should not collide), and as > well the later stop. > > To take care of this issue, I think I will split the deletion of the > session in two stages: > 1) deactivation of the bihash entries that steer the traffic > 2) freeing up the per-worker session structure > > and have a little pause time inbetween these two s
Re: [vpp-dev] Rx stuck to 0 after a while
Dear Rubina, thank you for quickly checking it! Judging by the logs the VPP quits, so I would say there should be a core file, could you check ? If you find it (doublecheck by the timestamps that it is indeed the fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary' 'path-to-core') and then get the backtrace using 'bt', this will give more idea on what is going on. --a On 5/29/18, Rubina Bianchi wrote: > Dear Andrew > > I tested your patch and my problem still exist, but my service status > changed and now there isn't any information about deadlock problem. Do you > have any idea about how I can provide you more information? > > root@MYRB:~# service vpp status > * vpp.service - vector packet processing engine >Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset: > enabled) >Active: inactive (dead) > > May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded > plugin: udp_ping_test_plugin.so > May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded > plugin: stn_test_plugin.so > May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init args: > -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 > -w :08:00.1 -w :08 > May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n 4 > --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w > :08:00.1 -w :08:00.2 -w 000 > May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough DPDK > crypto resources, default to OpenSSL > May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal > SIGCONT, PC 0x7fa535dfbac0 > May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting... > May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing > engine... > May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal > SIGTERM, PC 0x7fa534121867 > May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine. > > > > From: Andrew Yourtchenko > Sent: Monday, May 28, 2018 5:58 PM > To: Rubina Bianchi > Cc: vpp-dev@lists.fd.io > Subject: Re: [vpp-dev] Rx stuck to 0 after a while > > Dear Rubina, > > Thanks for catching and reporting this! > > I suspect what might be happening is my recent change of using two > unidirectional sessions in bihash vs. the single one triggered a race, > whereby as the owning worker is deleting the session, > the non-owning worker is trying to update it. That would logically > explain the "BUG: .." line (since you don't change the interfaces nor > moving the traffic around, the 5 tuples should not collide), and as > well the later stop. > > To take care of this issue, I think I will split the deletion of the > session in two stages: > 1) deactivation of the bihash entries that steer the traffic > 2) freeing up the per-worker session structure > > and have a little pause time inbetween these two so that the > workers-in-progress could > finish updating the structures. > > The below gerrit is the first cut: > > https://gerrit.fd.io/r/#/c/12770/ > > It passes the make test right now but I did not kick its tires too > much yet, will do tomorrow. > > You can try this change out in your test setup as well and tell me how it > feels. > > --a > > On 5/28/18, Rubina Bianchi wrote: >> Hi >> >> I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl >> (permit+reflect) and generated sfr traffic using trex v2.27. My rx will >> become 0 after a short while, about 300 sec in my machine. Here is vpp >> status: >> >> root@MYRB:~# service vpp status >> * vpp.service - vector packet processing engine >>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor >> preset: >> enabled) >>Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130; >> 37s >> ago >> Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm >> /dev/shm/vpe-api (code=exited, status=0/SUCCESS) >> Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf >> (code=killed, signal=ABRT) >> Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic >> (code=exited, >> status=0/SUCCESS) >> Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm >> /dev/shm/vpe-api (code=exited, status=0/SUCCESS) >> Main PID: 31754 (code=killed, signal=ABRT) >> >> May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session >> LSB16(sw_if_index) and 5-tuple collision! >> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received >> signal >> SIGCONT, PC 0x7
Re: [vpp-dev] Rx stuck to 0 after a while
Dear Andrew I tested your patch and my problem still exist, but my service status changed and now there isn't any information about deadlock problem. Do you have any idea about how I can provide you more information? root@MYRB:~# service vpp status * vpp.service - vector packet processing engine Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset: enabled) Active: inactive (dead) May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded plugin: udp_ping_test_plugin.so May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded plugin: stn_test_plugin.so May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w :08:00.1 -w :08 May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w :08:00.1 -w :08:00.2 -w 000 May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough DPDK crypto resources, default to OpenSSL May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal SIGCONT, PC 0x7fa535dfbac0 May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting... May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing engine... May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal SIGTERM, PC 0x7fa534121867 May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine. From: Andrew Yourtchenko Sent: Monday, May 28, 2018 5:58 PM To: Rubina Bianchi Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] Rx stuck to 0 after a while Dear Rubina, Thanks for catching and reporting this! I suspect what might be happening is my recent change of using two unidirectional sessions in bihash vs. the single one triggered a race, whereby as the owning worker is deleting the session, the non-owning worker is trying to update it. That would logically explain the "BUG: .." line (since you don't change the interfaces nor moving the traffic around, the 5 tuples should not collide), and as well the later stop. To take care of this issue, I think I will split the deletion of the session in two stages: 1) deactivation of the bihash entries that steer the traffic 2) freeing up the per-worker session structure and have a little pause time inbetween these two so that the workers-in-progress could finish updating the structures. The below gerrit is the first cut: https://gerrit.fd.io/r/#/c/12770/ It passes the make test right now but I did not kick its tires too much yet, will do tomorrow. You can try this change out in your test setup as well and tell me how it feels. --a On 5/28/18, Rubina Bianchi wrote: > Hi > > I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl > (permit+reflect) and generated sfr traffic using trex v2.27. My rx will > become 0 after a short while, about 300 sec in my machine. Here is vpp > status: > > root@MYRB:~# service vpp status > * vpp.service - vector packet processing engine >Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset: > enabled) >Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130; 37s > ago > Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm > /dev/shm/vpe-api (code=exited, status=0/SUCCESS) > Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf > (code=killed, signal=ABRT) > Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic (code=exited, > status=0/SUCCESS) > Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm > /dev/shm/vpe-api (code=exited, status=0/SUCCESS) > Main PID: 31754 (code=killed, signal=ABRT) > > May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session > LSB16(sw_if_index) and 5-tuple collision! > May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received signal > SIGCONT, PC 0x7f1fb591cac0 > May 28 16:35:02 MYRB vnet[31754]: received SIGTERM, exiting... > May 28 16:35:02 MYRB systemd[1]: Stopping vector packet processing > engine... > May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received signal > SIGTERM, PC 0x7f1fb3c40867 > May 28 16:35:03 MYRB vpp[31754]: vlib_worker_thread_barrier_sync_int: worker > thread deadlock > May 28 16:35:03 MYRB systemd[1]: vpp.service: Main process exited, > code=killed, status=6/ABRT > May 28 16:35:03 MYRB systemd[1]: Stopped vector packet processing engine. > May 28 16:35:03 MYRB systemd[1]: vpp.service: Unit entered failed state. > May 28 16:35:03 MYRB systemd[1]: vpp.service: Failed with result 'signal'. > > I attach my vpp configs to this email. I also run this test with the same > config and added 4 interface instead of two. But in this case nothing > happened to vpp and it was functional for a long time. > > Thanks, > RB >
Re: [vpp-dev] Rx stuck to 0 after a while
Dear Rubina, Thanks for catching and reporting this! I suspect what might be happening is my recent change of using two unidirectional sessions in bihash vs. the single one triggered a race, whereby as the owning worker is deleting the session, the non-owning worker is trying to update it. That would logically explain the "BUG: .." line (since you don't change the interfaces nor moving the traffic around, the 5 tuples should not collide), and as well the later stop. To take care of this issue, I think I will split the deletion of the session in two stages: 1) deactivation of the bihash entries that steer the traffic 2) freeing up the per-worker session structure and have a little pause time inbetween these two so that the workers-in-progress could finish updating the structures. The below gerrit is the first cut: https://gerrit.fd.io/r/#/c/12770/ It passes the make test right now but I did not kick its tires too much yet, will do tomorrow. You can try this change out in your test setup as well and tell me how it feels. --a On 5/28/18, Rubina Bianchiwrote: > Hi > > I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl > (permit+reflect) and generated sfr traffic using trex v2.27. My rx will > become 0 after a short while, about 300 sec in my machine. Here is vpp > status: > > root@MYRB:~# service vpp status > * vpp.service - vector packet processing engine >Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset: > enabled) >Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130; 37s > ago > Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm > /dev/shm/vpe-api (code=exited, status=0/SUCCESS) > Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf > (code=killed, signal=ABRT) > Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic (code=exited, > status=0/SUCCESS) > Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm > /dev/shm/vpe-api (code=exited, status=0/SUCCESS) > Main PID: 31754 (code=killed, signal=ABRT) > > May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session > LSB16(sw_if_index) and 5-tuple collision! > May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received signal > SIGCONT, PC 0x7f1fb591cac0 > May 28 16:35:02 MYRB vnet[31754]: received SIGTERM, exiting... > May 28 16:35:02 MYRB systemd[1]: Stopping vector packet processing > engine... > May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received signal > SIGTERM, PC 0x7f1fb3c40867 > May 28 16:35:03 MYRB vpp[31754]: vlib_worker_thread_barrier_sync_int: worker > thread deadlock > May 28 16:35:03 MYRB systemd[1]: vpp.service: Main process exited, > code=killed, status=6/ABRT > May 28 16:35:03 MYRB systemd[1]: Stopped vector packet processing engine. > May 28 16:35:03 MYRB systemd[1]: vpp.service: Unit entered failed state. > May 28 16:35:03 MYRB systemd[1]: vpp.service: Failed with result 'signal'. > > I attach my vpp configs to this email. I also run this test with the same > config and added 4 interface instead of two. But in this case nothing > happened to vpp and it was functional for a long time. > > Thanks, > RB > -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#9433): https://lists.fd.io/g/vpp-dev/message/9433 View All Messages In Topic (2): https://lists.fd.io/g/vpp-dev/topic/20397310 Mute This Topic: https://lists.fd.io/mt/20397310/21656 New Topic: https://lists.fd.io/g/vpp-dev/post Change Your Subscription: https://lists.fd.io/g/vpp-dev/editsub/21656 Group Home: https://lists.fd.io/g/vpp-dev Contact Group Owner: vpp-dev+ow...@lists.fd.io Terms of Service: https://lists.fd.io/static/tos Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub -=-=-=-=-=-=-=-=-=-=-=-