Re: [vpp-dev] Rx stuck to 0 after a while

Rubina Bianchi Fri, 01 Jun 2018 22:49:03 -0700

Dear Andrew

Sorry for delayed response. I checked your second patch and here is my test 
result:


Best case is still the best and vpp throughput is Maximum (18.5 Gbps) in my 
scenario.
Worst case is getting better than past. I never see deadlock again and 
throughput increases from 50 Mbps to 5.5 Gbps. I also added my T-Rex result.

-Per port stats table
      ports |               0 |               1
 
-----------------------------------------------------------------------------------------
   opackets |      1119818503 |      1065627562
     obytes |    490687253990 |    471065675962
   ipackets |       274437415 |       391504529
     ibytes |    120020261974 |    170214837563
    ierrors |               0 |               0
    oerrors |               0 |               0
      Tx Bw |       9.48 Gbps |       9.08 Gbps

-Global stats enabled
 Cpu Utilization : 88.4  %  7.0 Gb/core
 Platform_factor : 1.0
 Total-Tx        :      18.56 Gbps
 Total-Rx        :       5.78 Gbps
 Total-PPS       :       5.27 Mpps
 Total-CPS       :      79.51 Kcps

 Expected-PPS    :       9.02 Mpps
 Expected-CPS    :     135.31 Kcps
 Expected-BPS    :      31.77 Gbps

 Active-flows    :    88840  Clients :      252   Socket-util : 0.5598 %
 Open-flows      : 33973880  Servers :    65532   Socket :    88840 
Socket/Clients :  352.5
 drop-rate       :      12.79 Gbps
 current time    : 423.4 sec
 test duration   : 99576.6 sec

One point that I missed and would be helpful is that I run T-Rex with '-p' 
parameter:
./t-rex-64 -c 6 -d 100000 -f cap2/sfr.yaml --cfg cfg/trex_cfg.yaml -m 30 -p

Thanks,
Sincerely

________________________________
From: Andrew 👽 Yourtchenko <ayour...@gmail.com>
Sent: Wednesday, May 30, 2018 12:08 PM
To: Rubina Bianchi
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Rx stuck to 0 after a while

Dear Rubina,

Thanks for checking it!

yeah actually that patch was leaking the sessions in the session reuse
path. I have got the setup in the lab locally yesterday and am working
on a better way to do it...

Will get back to you when I am happy with the way the code works..

--a



On 5/29/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
> Dear Andrew
>
> I cleaned everything and created a new deb packages by your patch once
> again. With your patch I never see deadlock again, but still I have
> throughput problem in my scenario.
>
> -Per port stats table
>       ports |               0 |               1
> -----------------------------------------------------------------------------------------
>    opackets |       474826597 |       452028770
>      obytes |    207843848531 |    199591809555
>    ipackets |        71010677 |        72028456
>      ibytes |     31441646551 |     31687562468
>     ierrors |               0 |               0
>     oerrors |               0 |               0
>       Tx Bw |       9.56 Gbps |       9.16 Gbps
>
> -Global stats enabled
>  Cpu Utilization : 88.4  %  7.1 Gb/core
>  Platform_factor : 1.0
>  Total-Tx        :      18.72 Gbps
>  Total-Rx        :      59.30 Mbps
>  Total-PPS       :       5.31 Mpps
>  Total-CPS       :      79.79 Kcps
>
>  Expected-PPS    :       9.02 Mpps
>  Expected-CPS    :     135.31 Kcps
>  Expected-BPS    :      31.77 Gbps
>
>  Active-flows    :    88837  Clients :      252   Socket-util : 0.5598 %
>  Open-flows      : 14708455  Servers :    65532   Socket :    88837
> Socket/Clients :  352.5
>  Total_queue_full : 328355248
>  drop-rate       :      18.66 Gbps
>  current time    : 180.9 sec
>  test duration   : 99819.1 sec
>
> In best case (4 interface in one numa that only 2 of them has acl) my device
> (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4
> interface in one numa that all of them has acl) my device throughput will
> decrease from maximum to around 60Mbps. Actually patch just prevent deadlock
> in my case but throughput is same as before.
>
> ________________________________
> From: Andrew 👽 Yourtchenko <ayour...@gmail.com>
> Sent: Tuesday, May 29, 2018 10:11 AM
> To: Rubina Bianchi
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>
> Dear Rubina,
>
> thank you for quickly checking it!
>
> Judging by the logs the VPP quits, so I would say there should be a
> core file, could you check ?
>
> If you find it (doublecheck by the timestamps that it is indeed the
> fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary'
> 'path-to-core') and then get the backtrace using 'bt', this will give
> more idea on what is going on.
>
> --a
>
> On 5/29/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
>> Dear Andrew
>>
>> I tested your patch and my problem still exist, but my service status
>> changed and now there isn't any information about deadlock problem. Do
>> you
>> have any idea about how I can provide you more information?
>>
>> root@MYRB:~# service vpp status
>> * vpp.service - vector packet processing engine
>>    Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor
>> preset:
>> enabled)
>>    Active: inactive (dead)
>>
>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
>> plugin: udp_ping_test_plugin.so
>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
>> plugin: stn_test_plugin.so
>> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init
>> args:
>> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w
>> 0000:08:00.0
>> -w 0000:08:00.1 -w 0000:08
>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n
>> 4
>> --huge-dir /run/vpp/hugepages --file-prefix vpp -w 0000:08:00.0 -w
>> 0000:08:00.1 -w 0000:08:00.2 -w 000
>> May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough
>> DPDK
>> crypto resources, default to OpenSSL
>> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received
>> signal
>> SIGCONT, PC 0x7fa535dfbac0
>> May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting...
>> May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing
>> engine...
>> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received
>> signal
>> SIGTERM, PC 0x7fa534121867
>> May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine.
>>
>>
>> ________________________________
>> From: Andrew 👽 Yourtchenko <ayour...@gmail.com>
>> Sent: Monday, May 28, 2018 5:58 PM
>> To: Rubina Bianchi
>> Cc: vpp-dev@lists.fd.io
>> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>>
>> Dear Rubina,
>>
>> Thanks for catching and reporting this!
>>
>> I suspect what might be happening is my recent change of using two
>> unidirectional sessions in bihash vs. the single one triggered a race,
>> whereby as the owning worker is deleting the session,
>> the non-owning worker is trying to update it. That would logically
>> explain the "BUG: .." line (since you don't change the interfaces nor
>> moving the traffic around, the 5 tuples should not collide), and as
>> well the later stop.
>>
>> To take care of this issue, I think I will split the deletion of the
>> session in two stages:
>> 1) deactivation of the bihash entries that steer the traffic
>> 2) freeing up the per-worker session structure
>>
>> and have a little pause time inbetween these two so that the
>> workers-in-progress could
>> finish updating the structures.
>>
>> The below gerrit is the first cut:
>>
>> https://gerrit.fd.io/r/#/c/12770/
>>
>> It passes the make test right now but I did not kick its tires too
>> much yet, will do tomorrow.
>>
>> You can try this change out in your test setup as well and tell me how it
>> feels.
>>
>> --a
>>
>> On 5/28/18, Rubina Bianchi <r_bian...@outlook.com> wrote:
>>> Hi
>>>
>>> I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl
>>> (permit+reflect) and generated sfr traffic using trex v2.27. My rx will
>>> become 0 after a short while, about 300 sec in my machine. Here is vpp
>>> status:
>>>
>>> root@MYRB:~# service vpp status
>>> * vpp.service - vector packet processing engine
>>>    Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor
>>> preset:
>>> enabled)
>>>    Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130;
>>> 37s
>>> ago
>>>   Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm
>>> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
>>>   Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf
>>> (code=killed, signal=ABRT)
>>>   Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic
>>> (code=exited,
>>> status=0/SUCCESS)
>>>   Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm
>>> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
>>>  Main PID: 31754 (code=killed, signal=ABRT)
>>>
>>> May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session
>>> LSB16(sw_if_index) and 5-tuple collision!
>>> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received
>>> signal
>>> SIGCONT, PC 0x7f1fb591cac0
>>> May 28 16:35:02 MYRB vnet[31754]: received SIGTERM, exiting...
>>> May 28 16:35:02 MYRB systemd[1]: Stopping vector packet processing
>>> engine...
>>> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received
>>> signal
>>> SIGTERM, PC 0x7f1fb3c40867
>>> May 28 16:35:03 MYRB vpp[31754]: vlib_worker_thread_barrier_sync_int:
>>> worker
>>> thread deadlock
>>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Main process exited,
>>> code=killed, status=6/ABRT
>>> May 28 16:35:03 MYRB systemd[1]: Stopped vector packet processing
>>> engine.
>>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Unit entered failed state.
>>> May 28 16:35:03 MYRB systemd[1]: vpp.service: Failed with result
>>> 'signal'.
>>>
>>> I attach my vpp configs to this email. I also run this test with the
>>> same
>>> config and added 4 interface instead of two. But in this case nothing
>>> happened to vpp and it was functional for a long time.
>>>
>>> Thanks,
>>> RB
>>>
>>
>

Re: [vpp-dev] Rx stuck to 0 after a while

Reply via email to