Re: [vpp-dev] Rx stuck to 0 after a while

2018-06-02 Thread Andrew Yourtchenko
Dear Rubina,

Excellent, thank you very much! The change is in the master now.

Note that to keep the default memory footprint the same I have temporarily 
halved the default upper limit on sessions (since we create two bihash entries 
now instead of one).

FYI, I plan to do some more work on session management/reuse before 1807 
release.

--a

> On 2 Jun 2018, at 07:48, Rubina Bianchi  wrote:
> 
> Dear Andrew
> 
> Sorry for delayed response. I checked your second patch and here is my test 
> result:
> 
> Best case is still the best and vpp throughput is Maximum (18.5 Gbps) in my 
> scenario.
> Worst case is getting better than past. I never see deadlock again and 
> throughput increases from 50 Mbps to 5.5 Gbps. I also added my T-Rex result.
> 
> -Per port stats table 
>   ports |   0 |   1 
>  
> -
>opackets |  1119818503 |  1065627562 
>  obytes |490687253990 |471065675962 
>ipackets |   274437415 |   391504529 
>  ibytes |120020261974 |170214837563 
> ierrors |   0 |   0 
> oerrors |   0 |   0 
>   Tx Bw |   9.48 Gbps |   9.08 Gbps 
> 
> -Global stats enabled 
>  Cpu Utilization : 88.4  %  7.0 Gb/core 
>  Platform_factor : 1.0  
>  Total-Tx:  18.56 Gbps  
>  Total-Rx:   5.78 Gbps  
>  Total-PPS   :   5.27 Mpps  
>  Total-CPS   :  79.51 Kcps  
> 
>  Expected-PPS:   9.02 Mpps  
>  Expected-CPS: 135.31 Kcps  
>  Expected-BPS:  31.77 Gbps  
> 
>  Active-flows:88840  Clients :  252   Socket-util : 0.5598 %
>  Open-flows  : 33973880  Servers :65532   Socket :88840 
> Socket/Clients :  352.5 
>  drop-rate   :  12.79 Gbps   
>  current time: 423.4 sec  
>  test duration   : 99576.6 sec
> 
> One point that I missed and would be helpful is that I run T-Rex with '-p' 
> parameter:
> ./t-rex-64 -c 6 -d 10 -f cap2/sfr.yaml --cfg cfg/trex_cfg.yaml -m 30 -p
> 
> Thanks,
> Sincerely
> 
> From: Andrew  Yourtchenko 
> Sent: Wednesday, May 30, 2018 12:08 PM
> To: Rubina Bianchi
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>  
> Dear Rubina,
> 
> Thanks for checking it!
> 
> yeah actually that patch was leaking the sessions in the session reuse
> path. I have got the setup in the lab locally yesterday and am working
> on a better way to do it...
> 
> Will get back to you when I am happy with the way the code works..
> 
> --a
> 
> 
> 
> On 5/29/18, Rubina Bianchi  wrote:
> > Dear Andrew
> >
> > I cleaned everything and created a new deb packages by your patch once
> > again. With your patch I never see deadlock again, but still I have
> > throughput problem in my scenario.
> >
> > -Per port stats table
> >   ports |   0 |   1
> > -
> >opackets |   474826597 |   452028770
> >  obytes |207843848531 |199591809555
> >ipackets |71010677 |72028456
> >  ibytes | 31441646551 | 31687562468
> > ierrors |   0 |   0
> > oerrors |   0 |   0
> >   Tx Bw |   9.56 Gbps |   9.16 Gbps
> >
> > -Global stats enabled
> >  Cpu Utilization : 88.4  %  7.1 Gb/core
> >  Platform_factor : 1.0
> >  Total-Tx:  18.72 Gbps
> >  Total-Rx:  59.30 Mbps
> >  Total-PPS   :   5.31 Mpps
> >  Total-CPS   :  79.79 Kcps
> >
> >  Expected-PPS:   9.02 Mpps
> >  Expected-CPS: 135.31 Kcps
> >  Expected-BPS:  31.77 Gbps
> >
> >  Active-flows:88837  Clients :  252   Socket-util : 0.5598 %
> >  Open-flows  : 14708455  Servers :65532   Socket :88837
> > Socket/Clients :  352.5
> >  Total_queue_full : 328355248
> >  drop-rate   :  18.66 Gbps
> >  current time: 180.9 sec
> >  test duration   : 99819.1 sec
> >
> > In best case (4 interface in one numa that only 2 of them has acl) my device
> > (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4
> > interface in one numa that all of them has acl) my device throughput will
> > decrease from maximum to around 60Mbps. Actually patch just prevent deadlock
> > in my case but throughput is same as before.
> >
> > _

Re: [vpp-dev] Rx stuck to 0 after a while

2018-06-01 Thread Rubina Bianchi
Dear Andrew

Sorry for delayed response. I checked your second patch and here is my test 
result:

Best case is still the best and vpp throughput is Maximum (18.5 Gbps) in my 
scenario.
Worst case is getting better than past. I never see deadlock again and 
throughput increases from 50 Mbps to 5.5 Gbps. I also added my T-Rex result.

-Per port stats table
  ports |   0 |   1
 
-
   opackets |  1119818503 |  1065627562
 obytes |490687253990 |471065675962
   ipackets |   274437415 |   391504529
 ibytes |120020261974 |170214837563
ierrors |   0 |   0
oerrors |   0 |   0
  Tx Bw |   9.48 Gbps |   9.08 Gbps

-Global stats enabled
 Cpu Utilization : 88.4  %  7.0 Gb/core
 Platform_factor : 1.0
 Total-Tx:  18.56 Gbps
 Total-Rx:   5.78 Gbps
 Total-PPS   :   5.27 Mpps
 Total-CPS   :  79.51 Kcps

 Expected-PPS:   9.02 Mpps
 Expected-CPS: 135.31 Kcps
 Expected-BPS:  31.77 Gbps

 Active-flows:88840  Clients :  252   Socket-util : 0.5598 %
 Open-flows  : 33973880  Servers :65532   Socket :88840 
Socket/Clients :  352.5
 drop-rate   :  12.79 Gbps
 current time: 423.4 sec
 test duration   : 99576.6 sec

One point that I missed and would be helpful is that I run T-Rex with '-p' 
parameter:
./t-rex-64 -c 6 -d 10 -f cap2/sfr.yaml --cfg cfg/trex_cfg.yaml -m 30 -p

Thanks,
Sincerely


From: Andrew  Yourtchenko 
Sent: Wednesday, May 30, 2018 12:08 PM
To: Rubina Bianchi
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Rx stuck to 0 after a while

Dear Rubina,

Thanks for checking it!

yeah actually that patch was leaking the sessions in the session reuse
path. I have got the setup in the lab locally yesterday and am working
on a better way to do it...

Will get back to you when I am happy with the way the code works..

--a



On 5/29/18, Rubina Bianchi  wrote:
> Dear Andrew
>
> I cleaned everything and created a new deb packages by your patch once
> again. With your patch I never see deadlock again, but still I have
> throughput problem in my scenario.
>
> -Per port stats table
>   ports |   0 |   1
> -
>opackets |   474826597 |   452028770
>  obytes |207843848531 |199591809555
>ipackets |71010677 |72028456
>  ibytes | 31441646551 | 31687562468
> ierrors |   0 |   0
> oerrors |   0 |   0
>   Tx Bw |   9.56 Gbps |   9.16 Gbps
>
> -Global stats enabled
>  Cpu Utilization : 88.4  %  7.1 Gb/core
>  Platform_factor : 1.0
>  Total-Tx:  18.72 Gbps
>  Total-Rx:  59.30 Mbps
>  Total-PPS   :   5.31 Mpps
>  Total-CPS   :  79.79 Kcps
>
>  Expected-PPS:   9.02 Mpps
>  Expected-CPS: 135.31 Kcps
>  Expected-BPS:  31.77 Gbps
>
>  Active-flows:88837  Clients :  252   Socket-util : 0.5598 %
>  Open-flows  : 14708455  Servers :65532   Socket :88837
> Socket/Clients :  352.5
>  Total_queue_full : 328355248
>  drop-rate   :  18.66 Gbps
>  current time: 180.9 sec
>  test duration   : 99819.1 sec
>
> In best case (4 interface in one numa that only 2 of them has acl) my device
> (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4
> interface in one numa that all of them has acl) my device throughput will
> decrease from maximum to around 60Mbps. Actually patch just prevent deadlock
> in my case but throughput is same as before.
>
> ____
> From: Andrew  Yourtchenko 
> Sent: Tuesday, May 29, 2018 10:11 AM
> To: Rubina Bianchi
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>
> Dear Rubina,
>
> thank you for quickly checking it!
>
> Judging by the logs the VPP quits, so I would say there should be a
> core file, could you check ?
>
> If you find it (doublecheck by the timestamps that it is indeed the
> fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary'
> 'path-to-core') and then get the backtrace using 'bt', this will give
> more idea on what is going on.
>
> --a
>
> On 5/29/18, Rubina Bianchi  wrote:
>> Dear Andrew
>>
>> I tested your patch and my problem still exist, but my service status
>> changed and now there isn't any information about deadlock problem. Do
>> you
>> have any idea about how I can provide you more information?
>>
>> 

Re: [vpp-dev] Rx stuck to 0 after a while

2018-05-30 Thread Andrew Yourtchenko
Dear Rubina,

okay, I think I am reasonably happy with the change in
https://gerrit.fd.io/r/#/c/12770/ -
I also have rebased it onto the latest master so that it is ready to
commit if it works for you.

Please give it a shot and let me know. Note that you might need to
adjust the bihash
memory - as I'm storing forward and reverse entry now explicitly
(rather than per-packet calculating them).

Please let me know how it works in your test setup.

thanks,
andrew

On 5/30/18, Andrew Yourtchenko  wrote:
> Dear Rubina,
>
> Thanks for checking it!
>
> yeah actually that patch was leaking the sessions in the session reuse
> path. I have got the setup in the lab locally yesterday and am working
> on a better way to do it...
>
> Will get back to you when I am happy with the way the code works..
>
> --a
>
>
>
> On 5/29/18, Rubina Bianchi  wrote:
>> Dear Andrew
>>
>> I cleaned everything and created a new deb packages by your patch once
>> again. With your patch I never see deadlock again, but still I have
>> throughput problem in my scenario.
>>
>> -Per port stats table
>>   ports |   0 |   1
>> -
>>opackets |   474826597 |   452028770
>>  obytes |207843848531 |199591809555
>>ipackets |71010677 |72028456
>>  ibytes | 31441646551 | 31687562468
>> ierrors |   0 |   0
>> oerrors |   0 |   0
>>   Tx Bw |   9.56 Gbps |   9.16 Gbps
>>
>> -Global stats enabled
>>  Cpu Utilization : 88.4  %  7.1 Gb/core
>>  Platform_factor : 1.0
>>  Total-Tx:  18.72 Gbps
>>  Total-Rx:  59.30 Mbps
>>  Total-PPS   :   5.31 Mpps
>>  Total-CPS   :  79.79 Kcps
>>
>>  Expected-PPS:   9.02 Mpps
>>  Expected-CPS: 135.31 Kcps
>>  Expected-BPS:  31.77 Gbps
>>
>>  Active-flows:88837  Clients :  252   Socket-util : 0.5598 %
>>  Open-flows  : 14708455  Servers :65532   Socket :88837
>> Socket/Clients :  352.5
>>  Total_queue_full : 328355248
>>  drop-rate   :  18.66 Gbps
>>  current time: 180.9 sec
>>  test duration   : 99819.1 sec
>>
>> In best case (4 interface in one numa that only 2 of them has acl) my
>> device
>> (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4
>> interface in one numa that all of them has acl) my device throughput will
>> decrease from maximum to around 60Mbps. Actually patch just prevent
>> deadlock
>> in my case but throughput is same as before.
>>
>> 
>> From: Andrew  Yourtchenko 
>> Sent: Tuesday, May 29, 2018 10:11 AM
>> To: Rubina Bianchi
>> Cc: vpp-dev@lists.fd.io
>> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>>
>> Dear Rubina,
>>
>> thank you for quickly checking it!
>>
>> Judging by the logs the VPP quits, so I would say there should be a
>> core file, could you check ?
>>
>> If you find it (doublecheck by the timestamps that it is indeed the
>> fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary'
>> 'path-to-core') and then get the backtrace using 'bt', this will give
>> more idea on what is going on.
>>
>> --a
>>
>> On 5/29/18, Rubina Bianchi  wrote:
>>> Dear Andrew
>>>
>>> I tested your patch and my problem still exist, but my service status
>>> changed and now there isn't any information about deadlock problem. Do
>>> you
>>> have any idea about how I can provide you more information?
>>>
>>> root@MYRB:~# service vpp status
>>> * vpp.service - vector packet processing engine
>>>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor
>>> preset:
>>> enabled)
>>>Active: inactive (dead)
>>>
>>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
>>> plugin: udp_ping_test_plugin.so
>>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
>>> plugin: stn_test_plugin.so
>>> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init
>>> args:
>>> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w
>>> :08:00.0
>>> -w :08:00.1 -w :08
>>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n
>>> 4
>>> --huge-dir

Re: [vpp-dev] Rx stuck to 0 after a while

2018-05-30 Thread Andrew Yourtchenko
Dear Rubina,

Thanks for checking it!

yeah actually that patch was leaking the sessions in the session reuse
path. I have got the setup in the lab locally yesterday and am working
on a better way to do it...

Will get back to you when I am happy with the way the code works..

--a



On 5/29/18, Rubina Bianchi  wrote:
> Dear Andrew
>
> I cleaned everything and created a new deb packages by your patch once
> again. With your patch I never see deadlock again, but still I have
> throughput problem in my scenario.
>
> -Per port stats table
>   ports |   0 |   1
> -
>opackets |   474826597 |   452028770
>  obytes |207843848531 |199591809555
>ipackets |71010677 |72028456
>  ibytes | 31441646551 | 31687562468
> ierrors |   0 |   0
> oerrors |   0 |   0
>   Tx Bw |   9.56 Gbps |   9.16 Gbps
>
> -Global stats enabled
>  Cpu Utilization : 88.4  %  7.1 Gb/core
>  Platform_factor : 1.0
>  Total-Tx:  18.72 Gbps
>  Total-Rx:  59.30 Mbps
>  Total-PPS   :   5.31 Mpps
>  Total-CPS   :  79.79 Kcps
>
>  Expected-PPS:   9.02 Mpps
>  Expected-CPS: 135.31 Kcps
>  Expected-BPS:  31.77 Gbps
>
>  Active-flows:88837  Clients :  252   Socket-util : 0.5598 %
>  Open-flows  : 14708455  Servers :65532   Socket :88837
> Socket/Clients :  352.5
>  Total_queue_full : 328355248
>  drop-rate   :  18.66 Gbps
>  current time: 180.9 sec
>  test duration   : 99819.1 sec
>
> In best case (4 interface in one numa that only 2 of them has acl) my device
> (HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4
> interface in one numa that all of them has acl) my device throughput will
> decrease from maximum to around 60Mbps. Actually patch just prevent deadlock
> in my case but throughput is same as before.
>
> ____
> From: Andrew  Yourtchenko 
> Sent: Tuesday, May 29, 2018 10:11 AM
> To: Rubina Bianchi
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>
> Dear Rubina,
>
> thank you for quickly checking it!
>
> Judging by the logs the VPP quits, so I would say there should be a
> core file, could you check ?
>
> If you find it (doublecheck by the timestamps that it is indeed the
> fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary'
> 'path-to-core') and then get the backtrace using 'bt', this will give
> more idea on what is going on.
>
> --a
>
> On 5/29/18, Rubina Bianchi  wrote:
>> Dear Andrew
>>
>> I tested your patch and my problem still exist, but my service status
>> changed and now there isn't any information about deadlock problem. Do
>> you
>> have any idea about how I can provide you more information?
>>
>> root@MYRB:~# service vpp status
>> * vpp.service - vector packet processing engine
>>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor
>> preset:
>> enabled)
>>Active: inactive (dead)
>>
>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
>> plugin: udp_ping_test_plugin.so
>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
>> plugin: stn_test_plugin.so
>> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init
>> args:
>> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w
>> :08:00.0
>> -w :08:00.1 -w :08
>> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n
>> 4
>> --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w
>> :08:00.1 -w :08:00.2 -w 000
>> May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough
>> DPDK
>> crypto resources, default to OpenSSL
>> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received
>> signal
>> SIGCONT, PC 0x7fa535dfbac0
>> May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting...
>> May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing
>> engine...
>> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received
>> signal
>> SIGTERM, PC 0x7fa534121867
>> May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine.
>>
>>
>> 
>> From: Andrew  Yourtchenko 
>> Sent: Monday, May 28, 2018 5:58 PM
>> To: Rubina Bianchi
>> Cc: vpp-dev@lists.fd.io
>> Subject: Re: [vpp-dev]

Re: [vpp-dev] Rx stuck to 0 after a while

2018-05-29 Thread Rubina Bianchi
Dear Andrew

I cleaned everything and created a new deb packages by your patch once again. 
With your patch I never see deadlock again, but still I have throughput problem 
in my scenario.

-Per port stats table
  ports |   0 |   1
 
-
   opackets |   474826597 |   452028770
 obytes |207843848531 |199591809555
   ipackets |71010677 |72028456
 ibytes | 31441646551 | 31687562468
ierrors |   0 |   0
oerrors |   0 |   0
  Tx Bw |   9.56 Gbps |   9.16 Gbps

-Global stats enabled
 Cpu Utilization : 88.4  %  7.1 Gb/core
 Platform_factor : 1.0
 Total-Tx:  18.72 Gbps
 Total-Rx:  59.30 Mbps
 Total-PPS   :   5.31 Mpps
 Total-CPS   :  79.79 Kcps

 Expected-PPS:   9.02 Mpps
 Expected-CPS: 135.31 Kcps
 Expected-BPS:  31.77 Gbps

 Active-flows:88837  Clients :  252   Socket-util : 0.5598 %
 Open-flows  : 14708455  Servers :65532   Socket :88837 
Socket/Clients :  352.5
 Total_queue_full : 328355248
 drop-rate   :  18.66 Gbps
 current time: 180.9 sec
 test duration   : 99819.1 sec

In best case (4 interface in one numa that only 2 of them has acl) my device 
(HP DL380 G9) throughput is maximum (18.72Gbps) but in worst case (4 interface 
in one numa that all of them has acl) my device throughput will decrease from 
maximum to around 60Mbps. Actually patch just prevent deadlock in my case but 
throughput is same as before.


From: Andrew  Yourtchenko 
Sent: Tuesday, May 29, 2018 10:11 AM
To: Rubina Bianchi
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Rx stuck to 0 after a while

Dear Rubina,

thank you for quickly checking it!

Judging by the logs the VPP quits, so I would say there should be a
core file, could you check ?

If you find it (doublecheck by the timestamps that it is indeed the
fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary'
'path-to-core') and then get the backtrace using 'bt', this will give
more idea on what is going on.

--a

On 5/29/18, Rubina Bianchi  wrote:
> Dear Andrew
>
> I tested your patch and my problem still exist, but my service status
> changed and now there isn't any information about deadlock problem. Do you
> have any idea about how I can provide you more information?
>
> root@MYRB:~# service vpp status
> * vpp.service - vector packet processing engine
>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset:
> enabled)
>Active: inactive (dead)
>
> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
> plugin: udp_ping_test_plugin.so
> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
> plugin: stn_test_plugin.so
> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init args:
> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0
> -w :08:00.1 -w :08
> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n 4
> --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w
> :08:00.1 -w :08:00.2 -w 000
> May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough DPDK
> crypto resources, default to OpenSSL
> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal
> SIGCONT, PC 0x7fa535dfbac0
> May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting...
> May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing
> engine...
> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal
> SIGTERM, PC 0x7fa534121867
> May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine.
>
>
> 
> From: Andrew  Yourtchenko 
> Sent: Monday, May 28, 2018 5:58 PM
> To: Rubina Bianchi
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>
> Dear Rubina,
>
> Thanks for catching and reporting this!
>
> I suspect what might be happening is my recent change of using two
> unidirectional sessions in bihash vs. the single one triggered a race,
> whereby as the owning worker is deleting the session,
> the non-owning worker is trying to update it. That would logically
> explain the "BUG: .." line (since you don't change the interfaces nor
> moving the traffic around, the 5 tuples should not collide), and as
> well the later stop.
>
> To take care of this issue, I think I will split the deletion of the
> session in two stages:
> 1) deactivation of the bihash entries that steer the traffic
> 2) freeing up the per-worker session structure
>
> and have a little pause time inbetween these two s

Re: [vpp-dev] Rx stuck to 0 after a while

2018-05-29 Thread Andrew Yourtchenko
Dear Rubina,

thank you for quickly checking it!

Judging by the logs the VPP quits, so I would say there should be a
core file, could you check ?

If you find it (doublecheck by the timestamps that it is indeed the
fresh one), you can load it in gdb (using gdb 'path-to-vpp-binary'
'path-to-core') and then get the backtrace using 'bt', this will give
more idea on what is going on.

--a

On 5/29/18, Rubina Bianchi  wrote:
> Dear Andrew
>
> I tested your patch and my problem still exist, but my service status
> changed and now there isn't any information about deadlock problem. Do you
> have any idea about how I can provide you more information?
>
> root@MYRB:~# service vpp status
> * vpp.service - vector packet processing engine
>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset:
> enabled)
>Active: inactive (dead)
>
> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
> plugin: udp_ping_test_plugin.so
> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded
> plugin: stn_test_plugin.so
> May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init args:
> -c 1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0
> -w :08:00.1 -w :08
> May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n 4
> --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w
> :08:00.1 -w :08:00.2 -w 000
> May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough DPDK
> crypto resources, default to OpenSSL
> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal
> SIGCONT, PC 0x7fa535dfbac0
> May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting...
> May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing
> engine...
> May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal
> SIGTERM, PC 0x7fa534121867
> May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine.
>
>
> 
> From: Andrew  Yourtchenko 
> Sent: Monday, May 28, 2018 5:58 PM
> To: Rubina Bianchi
> Cc: vpp-dev@lists.fd.io
> Subject: Re: [vpp-dev] Rx stuck to 0 after a while
>
> Dear Rubina,
>
> Thanks for catching and reporting this!
>
> I suspect what might be happening is my recent change of using two
> unidirectional sessions in bihash vs. the single one triggered a race,
> whereby as the owning worker is deleting the session,
> the non-owning worker is trying to update it. That would logically
> explain the "BUG: .." line (since you don't change the interfaces nor
> moving the traffic around, the 5 tuples should not collide), and as
> well the later stop.
>
> To take care of this issue, I think I will split the deletion of the
> session in two stages:
> 1) deactivation of the bihash entries that steer the traffic
> 2) freeing up the per-worker session structure
>
> and have a little pause time inbetween these two so that the
> workers-in-progress could
> finish updating the structures.
>
> The below gerrit is the first cut:
>
> https://gerrit.fd.io/r/#/c/12770/
>
> It passes the make test right now but I did not kick its tires too
> much yet, will do tomorrow.
>
> You can try this change out in your test setup as well and tell me how it
> feels.
>
> --a
>
> On 5/28/18, Rubina Bianchi  wrote:
>> Hi
>>
>> I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl
>> (permit+reflect) and generated sfr traffic using trex v2.27. My rx will
>> become 0 after a short while, about 300 sec in my machine. Here is vpp
>> status:
>>
>> root@MYRB:~# service vpp status
>> * vpp.service - vector packet processing engine
>>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor
>> preset:
>> enabled)
>>Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130;
>> 37s
>> ago
>>   Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm
>> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
>>   Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf
>> (code=killed, signal=ABRT)
>>   Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic
>> (code=exited,
>> status=0/SUCCESS)
>>   Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm
>> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
>>  Main PID: 31754 (code=killed, signal=ABRT)
>>
>> May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session
>> LSB16(sw_if_index) and 5-tuple collision!
>> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received
>> signal
>> SIGCONT, PC 0x7

Re: [vpp-dev] Rx stuck to 0 after a while

2018-05-29 Thread Rubina Bianchi
Dear Andrew

I tested your patch and my problem still exist, but my service status changed 
and now there isn't any information about deadlock problem. Do you have any 
idea about how I can provide you more information?

root@MYRB:~# service vpp status
* vpp.service - vector packet processing engine
   Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset: 
enabled)
   Active: inactive (dead)

May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded 
plugin: udp_ping_test_plugin.so
May 29 09:27:06 MYRB /usr/bin/vpp[30805]: load_one_vat_plugin:67: Loaded 
plugin: stn_test_plugin.so
May 29 09:27:06 MYRB vpp[30805]: /usr/bin/vpp[30805]: dpdk: EAL init args: -c 
1ff -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w 
:08:00.1 -w :08
May 29 09:27:06 MYRB /usr/bin/vpp[30805]: dpdk: EAL init args: -c 1ff -n 4 
--huge-dir /run/vpp/hugepages --file-prefix vpp -w :08:00.0 -w :08:00.1 
-w :08:00.2 -w 000
May 29 09:27:07 MYRB vnet[30805]: dpdk_ipsec_process:1012: not enough DPDK 
crypto resources, default to OpenSSL
May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal 
SIGCONT, PC 0x7fa535dfbac0
May 29 09:27:13 MYRB vnet[30805]: received SIGTERM, exiting...
May 29 09:27:13 MYRB systemd[1]: Stopping vector packet processing engine...
May 29 09:27:13 MYRB vnet[30805]: unix_signal_handler:124: received signal 
SIGTERM, PC 0x7fa534121867
May 29 09:27:13 MYRB systemd[1]: Stopped vector packet processing engine.



From: Andrew  Yourtchenko 
Sent: Monday, May 28, 2018 5:58 PM
To: Rubina Bianchi
Cc: vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] Rx stuck to 0 after a while

Dear Rubina,

Thanks for catching and reporting this!

I suspect what might be happening is my recent change of using two
unidirectional sessions in bihash vs. the single one triggered a race,
whereby as the owning worker is deleting the session,
the non-owning worker is trying to update it. That would logically
explain the "BUG: .." line (since you don't change the interfaces nor
moving the traffic around, the 5 tuples should not collide), and as
well the later stop.

To take care of this issue, I think I will split the deletion of the
session in two stages:
1) deactivation of the bihash entries that steer the traffic
2) freeing up the per-worker session structure

and have a little pause time inbetween these two so that the
workers-in-progress could
finish updating the structures.

The below gerrit is the first cut:

https://gerrit.fd.io/r/#/c/12770/

It passes the make test right now but I did not kick its tires too
much yet, will do tomorrow.

You can try this change out in your test setup as well and tell me how it feels.

--a

On 5/28/18, Rubina Bianchi  wrote:
> Hi
>
> I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl
> (permit+reflect) and generated sfr traffic using trex v2.27. My rx will
> become 0 after a short while, about 300 sec in my machine. Here is vpp
> status:
>
> root@MYRB:~# service vpp status
> * vpp.service - vector packet processing engine
>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset:
> enabled)
>Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130; 37s
> ago
>   Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm
> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
>   Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf
> (code=killed, signal=ABRT)
>   Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic (code=exited,
> status=0/SUCCESS)
>   Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm
> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
>  Main PID: 31754 (code=killed, signal=ABRT)
>
> May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session
> LSB16(sw_if_index) and 5-tuple collision!
> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received signal
> SIGCONT, PC 0x7f1fb591cac0
> May 28 16:35:02 MYRB vnet[31754]: received SIGTERM, exiting...
> May 28 16:35:02 MYRB systemd[1]: Stopping vector packet processing
> engine...
> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received signal
> SIGTERM, PC 0x7f1fb3c40867
> May 28 16:35:03 MYRB vpp[31754]: vlib_worker_thread_barrier_sync_int: worker
> thread deadlock
> May 28 16:35:03 MYRB systemd[1]: vpp.service: Main process exited,
> code=killed, status=6/ABRT
> May 28 16:35:03 MYRB systemd[1]: Stopped vector packet processing engine.
> May 28 16:35:03 MYRB systemd[1]: vpp.service: Unit entered failed state.
> May 28 16:35:03 MYRB systemd[1]: vpp.service: Failed with result 'signal'.
>
> I attach my vpp configs to this email. I also run this test with the same
> config and added 4 interface instead of two. But in this case nothing
> happened to vpp and it was functional for a long time.
>
> Thanks,
> RB
>


Re: [vpp-dev] Rx stuck to 0 after a while

2018-05-28 Thread Andrew Yourtchenko
Dear Rubina,

Thanks for catching and reporting this!

I suspect what might be happening is my recent change of using two
unidirectional sessions in bihash vs. the single one triggered a race,
whereby as the owning worker is deleting the session,
the non-owning worker is trying to update it. That would logically
explain the "BUG: .." line (since you don't change the interfaces nor
moving the traffic around, the 5 tuples should not collide), and as
well the later stop.

To take care of this issue, I think I will split the deletion of the
session in two stages:
1) deactivation of the bihash entries that steer the traffic
2) freeing up the per-worker session structure

and have a little pause time inbetween these two so that the
workers-in-progress could
finish updating the structures.

The below gerrit is the first cut:

https://gerrit.fd.io/r/#/c/12770/

It passes the make test right now but I did not kick its tires too
much yet, will do tomorrow.

You can try this change out in your test setup as well and tell me how it feels.

--a

On 5/28/18, Rubina Bianchi  wrote:
> Hi
>
> I run vpp v18.07-rc0~237-g525c9d0f with only 2 interface in stateful acl
> (permit+reflect) and generated sfr traffic using trex v2.27. My rx will
> become 0 after a short while, about 300 sec in my machine. Here is vpp
> status:
>
> root@MYRB:~# service vpp status
> * vpp.service - vector packet processing engine
>Loaded: loaded (/lib/systemd/system/vpp.service; disabled; vendor preset:
> enabled)
>Active: failed (Result: signal) since Mon 2018-05-28 11:35:03 +0130; 37s
> ago
>   Process: 32838 ExecStopPost=/bin/rm -f /dev/shm/db /dev/shm/global_vm
> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
>   Process: 31754 ExecStart=/usr/bin/vpp -c /etc/vpp/startup.conf
> (code=killed, signal=ABRT)
>   Process: 31750 ExecStartPre=/sbin/modprobe uio_pci_generic (code=exited,
> status=0/SUCCESS)
>   Process: 31747 ExecStartPre=/bin/rm -f /dev/shm/db /dev/shm/global_vm
> /dev/shm/vpe-api (code=exited, status=0/SUCCESS)
>  Main PID: 31754 (code=killed, signal=ABRT)
>
> May 28 16:32:47 MYRB vnet[31754]: acl_fa_node_fn:210: BUG: session
> LSB16(sw_if_index) and 5-tuple collision!
> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received signal
> SIGCONT, PC 0x7f1fb591cac0
> May 28 16:35:02 MYRB vnet[31754]: received SIGTERM, exiting...
> May 28 16:35:02 MYRB systemd[1]: Stopping vector packet processing
> engine...
> May 28 16:35:02 MYRB vnet[31754]: unix_signal_handler:124: received signal
> SIGTERM, PC 0x7f1fb3c40867
> May 28 16:35:03 MYRB vpp[31754]: vlib_worker_thread_barrier_sync_int: worker
> thread deadlock
> May 28 16:35:03 MYRB systemd[1]: vpp.service: Main process exited,
> code=killed, status=6/ABRT
> May 28 16:35:03 MYRB systemd[1]: Stopped vector packet processing engine.
> May 28 16:35:03 MYRB systemd[1]: vpp.service: Unit entered failed state.
> May 28 16:35:03 MYRB systemd[1]: vpp.service: Failed with result 'signal'.
>
> I attach my vpp configs to this email. I also run this test with the same
> config and added 4 interface instead of two. But in this case nothing
> happened to vpp and it was functional for a long time.
>
> Thanks,
> RB
>

-=-=-=-=-=-=-=-=-=-=-=-
Links:

You receive all messages sent to this group.

View/Reply Online (#9433): https://lists.fd.io/g/vpp-dev/message/9433
View All Messages In Topic (2): https://lists.fd.io/g/vpp-dev/topic/20397310
Mute This Topic: https://lists.fd.io/mt/20397310/21656
New Topic: https://lists.fd.io/g/vpp-dev/post

Change Your Subscription: https://lists.fd.io/g/vpp-dev/editsub/21656
Group Home: https://lists.fd.io/g/vpp-dev
Contact Group Owner: vpp-dev+ow...@lists.fd.io
Terms of Service: https://lists.fd.io/static/tos
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub
-=-=-=-=-=-=-=-=-=-=-=-