Hi, John,
                The internal mechanism is very clear to me now.

                And do you have any thought about the dead lock on main thread?

BR/Lollita Liu

From: John Lo (loj) [mailto:l...@cisco.com]
Sent: Tuesday, January 23, 2018 11:18 AM
To: Lollita Liu <lollita....@ericsson.com>; vpp-dev@lists.fd.io
Cc: David Yu Z <david.z...@ericsson.com>; Kingwel Xie 
<kingwel....@ericsson.com>; Terry Zhang Z <terry.z.zh...@ericsson.com>; Jordy 
You <jordy....@ericsson.com>
Subject: RE: Question and bug found on GTP performance testing

Hi Lolita,

Thank you for providing information from your performance test with observed 
behavior and problems.

On interface creation, including tunnels, VPP always creates dedicated output 
and tx nodes for each interface. As you correctly observed, these dedicated tx 
and output nodes are not used for most tunnel interfaces such as GTPU and 
VXLAN. All these tunnel interfaces of the same tunnel type would use an 
existing tunnel type specific encap node as their output nodes.

I can see that for large scale tunnel deployments, creation of a large number 
of these not-used output and tx nodes can be an issue, especially when multiple 
worker threads are used. The worker threads will be blocked from forwarding 
packets while the main thread is busy creating these nodes and do setups for 
multiple worker threads.

I believe we should improve VPP interface creation to allow a way for creating 
interfaces, such as tunnels, where existing (encap-)nodes can be specified as 
interface output nodes without creating dedicated tx and output nodes.

Your observation that the forwarding PPS impact only occur during initial 
tunnel creation and not subsequent delete and create is as expected. It is 
because on tunnel deletion, the associated interfaces are not deleted but kept 
in a reused pool for subsequent creation of the same tunnel type. It may not be 
the best approach for interface usage flexibility but it certainly helps with 
efficiency of tunnel delete and create cases.

I will work on the interface creation improvement described above when I get a 
chance.  I can let you know when a patch is available on vpp master for you to 
try.  As for 18.01 release, it is probably too late to include this improvement.

Regards,
John

From: vpp-dev-boun...@lists.fd.io<mailto:vpp-dev-boun...@lists.fd.io> 
[mailto:vpp-dev-boun...@lists.fd.io] On Behalf Of Lollita Liu
Sent: Monday, January 22, 2018 5:04 AM
To: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Cc: David Yu Z <david.z...@ericsson.com<mailto:david.z...@ericsson.com>>; 
Kingwel Xie <kingwel....@ericsson.com<mailto:kingwel....@ericsson.com>>; Terry 
Zhang Z <terry.z.zh...@ericsson.com<mailto:terry.z.zh...@ericsson.com>>; Jordy 
You <jordy....@ericsson.com<mailto:jordy....@ericsson.com>>
Subject: [vpp-dev] Question and bug found on GTP performance testing

Hi,

                We are do performance testing on GTP of VPP source code, 
testing the GTPU performance impact by creating/removing tunnel. Found some 
curious thing and one bug.



                Testing GTP encryption via one CPU across different rx and tx 
port on same NUMA, with 10K pre-created GTPU tunnel both with data. The result 
is 4.7Mpps@64B<mailto:4.7Mpps@64B>.

                Testing GTP encryption via one CPU across different rx and tx 
port on same NUMA, with 10K pre-created GTPU tunnel both with data, and 
creating another 10K GTPU tunnel at the same time.  The result is about 
400K@64B.


                The create tunnel command is "create gtpu tunnel src 1.4.1.1 
dst 1.4.1.2 teid 1 decap-next ip4"and "ip route add 10.4.0.1/32 via 
gtpu_tunnel0".

You can see the throughput impact is huge. Looks there are lots of node as 
gtpu_tunnelxx-tx and gtpu_tunnelxx-output been created, and all worker node 
will waiting for the node graph update. But in the output of show runtime, no 
such node been called. In source code, the GTP-U encryption has been takeover 
by gtpu4-encap with following code "hi->output_node_index = encap_index;" What 
do those gtpu_tunnel nodes used for?

                Since the nodes are useless. We try another case with following 
procedure:
                (1) Create 10K GTP tunnel
                (2) Rx-Tx with same NUMA using 1G hugepage and 10K GTPU tunnel 
with 10K tunnel data
                (3) Creating another 30K GTP tunnel
                (4) Remove the last 30K GTP tunnel
                The main thread fall into dead lock, no response on command 
line, no impact to worker thread .
In GDB output, mheap_maybe_lock  has been called twice.
Thread 1 (Thread 0x7f335bef5740 (LWP 27464)):
#0  0x00007f335ab518d9 in mheap_maybe_lock (v=0x7f33199dd000) at 
/home/vpp/vpp/build-data/../src/vppinfra/mheap.c:66
#1  mheap_get_aligned (v=0x7f33199dd000, n_user_data_bytes=8, 
n_user_data_bytes@entry=5, align=<optimized out>, align@entry=4,
    align_offset=0, align_offset@entry=4, 
offset_return=offset_return@entry=0x7f331a968618)
    at /home/vpp/vpp/build-data/../src/vppinfra/mheap.c:675
#2  0x00007f335ab7b0f7 in clib_mem_alloc_aligned_at_offset 
(os_out_of_memory_on_failure=1, align_offset=4, align=4, size=5)
    at /home/vpp/vpp/build-data/../src/vppinfra/mem.h:91
#3  vec_resize_allocate_memory (v=<optimized out>, 
length_increment=length_increment@entry=1, data_bytes=5,
    header_bytes=<optimized out>, header_bytes@entry=0, 
data_align=data_align@entry=4)
    at /home/vpp/vpp/build-data/../src/vppinfra/vec.c:59
#4  0x00007f335b8a10ba in _vec_resize (data_align=<optimized out>, 
header_bytes=<optimized out>, data_bytes=<optimized out>,
    length_increment=<optimized out>, v=<optimized out>) at 
/home/vpp/vpp/build-data/../src/vppinfra/vec.h:142
#5  unix_cli_add_pending_output (uf=0x7f331ba606b4, buffer=0x7f335b8b774f "\r", 
buffer_bytes=1, cf=<optimized out>)
    at /home/vpp/vpp/build-data/../src/vlib/unix/cli.c:528
#6  0x00007f335b8a3fcd in unix_cli_file_welcome (cf=0x7f331adaf204, 
cm=<optimized out>)
    at /home/vpp/vpp/build-data/../src/vlib/unix/cli.c:1137
#7  0x00007f335ab85fd1 in timer_interrupt (signum=<optimized out>) at 
/home/vpp/vpp/build-data/../src/vppinfra/timer.c:125
#8  <signal handler called>
#9  0x00007f335ab518d9 in mheap_maybe_lock (v=0x7f33199dd000) at 
/home/vpp/vpp/build-data/../src/vppinfra/mheap.c:66
#10 mheap_get_aligned (v=0x7f33199dd000, 
n_user_data_bytes=n_user_data_bytes@entry=12, align=<optimized out>, 
align@entry=4,
    align_offset=0, align_offset@entry=4, 
offset_return=offset_return@entry=0x7f331a968e68)
    at /home/vpp/vpp/build-data/../src/vppinfra/mheap.c:675
#11 0x00007f335ab7b0f7 in clib_mem_alloc_aligned_at_offset 
(os_out_of_memory_on_failure=1, align_offset=4, align=4, size=12)
    at /home/vpp/vpp/build-data/../src/vppinfra/mem.h:91
#12 vec_resize_allocate_memory (v=v@entry=0x0, length_increment=1, 
data_bytes=12, header_bytes=<optimized out>, header_bytes@entry=0,
    data_align=data_align@entry=4) at 
/home/vpp/vpp/build-data/../src/vppinfra/vec.c:59
#13 0x00007f335b8a5eca in _vec_resize (data_align=0, header_bytes=0, 
data_bytes=<optimized out>, length_increment=<optimized out>,
    v=<optimized out>) at /home/vpp/vpp/build-data/../src/vppinfra/vec.h:142
#14 vlib_process_get_events (data_vector=<synthetic pointer>, vm=0x7f335bac42c0 
<vlib_global_main>)
    at /home/vpp/vpp/build-data/../src/vlib/node_funcs.h:562
#15 unix_cli_process (vm=0x7f335bac42c0 <vlib_global_main>, rt=0x7f331a958000, 
f=<optimized out>)
    at /home/vpp/vpp/build-data/../src/vlib/unix/cli.c:2414
#16 0x00007f335b86fd96 in vlib_process_bootstrap (_a=<optimized out>) at 
/home/vpp/vpp/build-data/../src/vlib/main.c:1231
#17 0x00007f335ab463d8 in clib_calljmp () at 
/home/vpp/vpp/build-data/../src/vppinfra/longjmp.S:110
#18 0x00007f331b9dcc20 in ?? ()
#19 0x00007f335b870f49 in vlib_process_startup (f=0x0, p=0x7f331a958000, 
vm=0x7f335bac42c0 <vlib_global_main>)
    at /home/vpp/vpp/build-data/../src/vlib/main.c:1253
#20 dispatch_process (vm=0x7f335bac42c0 <vlib_global_main>, p=0x7f331a958000, 
last_time_stamp=0, f=0x0)
    at /home/vpp/vpp/build-data/../src/vlib/main.c:1296
---Type <return> to continue, or q <return> to quit---

                We modified the previous steps:
                (1) Create 10K GTP tunnel
                (2) Rx-Tx with same NUMA using 1G hugepage and 10K GTPU tunnel 
with 10K tunnel data
                (3) Creating another 10K GTP tunnel
                (4) Remove and Create the last 1K GTP tunnel repeatedly, with 
10 seconds interval.
The result is 4.6Mpps@64B<mailto:4.6Mpps@64B>. Looks only the first time of GTP 
tunnel creating will impact data plane throughput.

BR/Lollita Liu





_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Reply via email to