Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node

Dave Barach via Lists.Fd.Io Wed, 03 Jul 2019 08:22:13 -0700

Well, there you go... Allocate a frame below vm->heap_aligned_base, and all 
hell breaks loose, precisely the way you’re seeing.


Dlmalloc added a heap chunk *below* the point from which frame indices are 
calculated. The freelist should be empty: you’re creating a new frame vs. 
yanking one from the freelist. You may be able to avoid the issue entirely by 
configuring a large enough [initial] main heap from the command line. Aside 
from the chance that you’ll scare people by consuming large amounts of virtual 
space, it’s reasonably OK to do that.

I suspect that you’re seeing the results of creating too many single-element 
frames w/ vlib_get_frame_to_node(...). 7000 frames is a lot. If possible, 
create full(er) frames. Vpp works an order-of-magnitude more efficiently on 
full frames.

HTH... Dave

From: [email protected] <[email protected]> On Behalf Of Andreas Schultz
Sent: Wednesday, July 3, 2019 10:24 AM
To: Dave Barach (dbarach) <[email protected]>
Cc: [email protected]
Subject: Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node

Am Mi., 3. Juli 2019 um 15:55 Uhr schrieb Dave Barach via 
Lists.Fd.Io<http://Lists.Fd.Io> 
<[email protected]<mailto:[email protected]>>:
>>> vm->heap_aligned_base matches reality?
Check that clib_per_cpu_mheaps[0] (a void *, cast to mstate *) -> least_addr 
equals vm->heap_aligned_base.

That doesn't work out:

(gdb) print *((struct malloc_state *)clib_per_cpu_mheaps[0])
$72 = {smallmap = 1431655760, treemap = 3145721, dvsize = 1056, topsize = 
65456, least_addr = 0x7f7db5c2b000 "", dv = 0x7f7db7cbabe0, top = 
0x7f7db9fab000, trim_check = 2097152, release_checks = 272, magic = 139276536, 
smallbins = {0x0, 0x0,
    0x7f83e3d7f058, 0x7f83e3d7f058, 0x7f83e3d7f068, 0x7f83e3d7f068, 
0x7f83e3d7f078, 0x7f83e3d7f078, 0x7f83e3d7f088, 0x7f83e3d7f088, 0x7f7db934c030, 
0x7f8422348420, 0x7f83e3d7f0a8, 0x7f83e3d7f0a8, 0x7f84196dc1d0, 0x7f7db8fdad50, 
0x7f83e3d7f0c8,
    0x7f83e3d7f0c8, 0x7f8411450130, 0x7f84239dcd70, 0x7f83e3d7f0e8, 
0x7f83e3d7f0e8, 0x7f8412dee250, 0x7f84139e2d00, 0x7f83e3d7f108, 0x7f83e3d7f108, 
0x7f7db77aad10, 0x7f7db927ac60, 0x7f83e3d7f128, 0x7f83e3d7f128, 0x7f7db7c3af90, 
0x7f7db919af90,
    0x7f83e3d7f148, 0x7f83e3d7f148, 0x7f7db90daf30, 0x7f8410de11a0, 
0x7f83e3d7f168, 0x7f83e3d7f168, 0x7f7db830aec0, 0x7f7db7d4abe0, 0x7f83e3d7f188, 
0x7f83e3d7f188, 0x7f841c1ce7d0, 0x7f7db76fae20, 0x7f83e3d7f1a8, 0x7f83e3d7f1a8, 
0x7f7db808af50,
    0x7f7db7d1ad20, 0x7f83e3d7f1c8, 0x7f83e3d7f1c8, 0x7f7db7d7adb0, 
0x7f7db6efaf40, 0x7f83e3d7f1e8, 0x7f83e3d7f1e8, 0x7f7db825aee0, 0x7f7db844aee0, 
0x7f83e3d7f208, 0x7f83e3d7f208, 0x7f8417ebcfb0, 0x7f8417ebcfb0, 0x7f83e3d7f228, 
0x7f83e3d7f228,
    0x7f840fed98d0, 0x7f840fed98d0, 0x7f83e3d7f248, 0x7f83e3d7f248}, treebins = 
{0x7f8410e0e030, 0x0, 0x0, 0x7f840d2efb30, 0x7f7db832abe0, 0x7f8412a7a500, 
0x7f7db9f2a3c0, 0x7f7db9e3a3c0, 0x7f7db9df0140, 0x7f7db9ed21c0, 0x7f7db9740140,
    0x7f7db9da3a20, 0x7f7db95fcc60, 0x7f7db9b23200, 0x7f7dba040140, 
0x7f7db9f9b000, 0x7f7db9f8abe0, 0x7f7db882b000, 0x7f7db871afe0, 0x7f7db97fafe0, 
0x0, 0x7f7db979a3c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
footprint = 1316421632,
  max_footprint = 1316421632, footprint_limit = 0, mflags = 7, mutex = 0, seg = 
{base = 0x7f7db5c2b000 "", size = 70844416, next = 0x7f7dba04afc0, sflags = 1}, 
extp = 0x0, exts = 0}

(gdb) print ((struct malloc_state *)clib_per_cpu_mheaps[0])->least_addr
$68 = 0x7f7db5c2b000 ""
(gdb) print vlib_global_main.heap_aligned_base
$73 = (void *) 0x7f83e3d7f000


>>>   vlib_buffer_alloc (vm, &bi0, 1);
   b0 = vlib_get_buffer (vm, bi0);

right before the crash. gdb tells me that bi0 is 94976. Isn't that a bit too 
large?
b0 is optimised out, so I can't tell its value.
Buffer indices vs frame indices. From what I can tell issue is in the 
frame-world, not the buffer world.
Probably not related to this issue, but: you can work out b0 easily enough... 
Multiply bi0 by 64 (CLIB_CACHE_LINE_BYTES), and add 
vlib_global_main.buffer_main.buffer_mem_start. Cast the result to a 
vlib_buffer_t *, and there you go.

Result is sensible...

You might look at vlib_global_main.node_main.frame_sizes[0].n_alloc_frames, to 
see how many frames have been allocated. I’d expect O(100), not 
O(<huge-number>).

Seems to be okish:

(gdb) print vlib_global_main.node_main.frame_sizes[0].n_alloc_frames
$74 = 7853

Is it normal that the 
vlib_global_main.node_main.frame_sizes[0].frame_free_indices vector is empty?

(gdb) print *((vec_header_t 
*)vlib_global_main.node_main.frame_sizes[0].free_frame_indices - 1)
$78 = {len = 0, dlmalloc_header_offset = 0, vector_data = 0x7f84201583bc 
"4\"\367"}

I've changed vlib/main.c:vlib_frame_alloc_to_node(...) according to you 
suggestion. Anything else I can look at while waiting for the next crash?

Many Thanks,
Andreas

HTH... Dave
From: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>> On Behalf Of Andreas Schultz
Sent: Wednesday, July 3, 2019 9:20 AM
To: Dave Barach (dbarach) <[email protected]<mailto:[email protected]>>
Cc: Hugo Garza <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]>
Subject: Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node

Hi Dave,

Am Mi., 3. Juli 2019 um 14:17 Uhr schrieb Dave Barach (dbarach) 
<[email protected]<mailto:[email protected]>>:
Dear Andreas,

Single thread vs. multiple workers?

We have intentionally limited this to one CPU, it therefore can't be a 
concurrent process doing something

Debug image?

So far the problem has been observed only in release images under load. I've 
been unable to replicate the problem with artificial tests or on a debug image.

vm->heap_aligned_base matches reality?

Not sure what that means. How do I check that?

(Virtual address of allocated frame - vm->heap_aligned_base) / 
CLIB_CACHE_LINE_BYTES fits in 32 bits?

I'm doing a:

   vlib_buffer_alloc (vm, &bi0, 1);
   b0 = vlib_get_buffer (vm, bi0);

right before the crash. gdb tells me that bi0 is 94976. Isn't that a bit too 
large?
b0 is optimised out, so I can't tell its value.

In vlib/main.c:vlib_frame_alloc_to_node(...) try replacing 
vlib_frame_index_no_check(vm, f) with vlib_frame_index(vm, f) in a debug image.

Will do.

Again, best I can do to help w/ next-to-no information.

The problem is, I don't know what information will be useful and how to extract 
it. I have a core file and can dig into some internal structures. But which 
ones are helpful?

Anyway, I'm grateful for any pointers.

Regards
Andreas


D.

From: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>> On Behalf Of Andreas Schultz
Sent: Wednesday, July 3, 2019 4:47 AM
To: Dave Barach (dbarach) <[email protected]<mailto:[email protected]>>
Cc: Hugo Garza <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]>
Subject: Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node

Hi,

I've run into the same issue with different, but also external code.

The calling sequence in my case looks very similar to the one from Hugo. I'm 
also getting a invalid point from vlib_get_frame_to_node.
It is crashing here: 
https://github.com/travelping/vpp/blob/feature/master/upf%2Btdf/src/plugins/upf/upf_pfcp_server.c#L121

@Hugo: have you found the root cause for your problem?

Regards
Andreas

Am Mi., 28. Nov. 2018 um 12:53 Uhr schrieb Dave Barach via 
Lists.Fd.Io<http://Lists.Fd.Io> 
<[email protected]<mailto:[email protected]>>:
None of the routine names in the backtrace exist in master/latest – it’s your 
code - so it will be challenging for the community to help you.

See if you can repro the problem with a TAG=vpp_debug images (aka “make build” 
not “make build-release”). If you’re lucky, one of the numerous ASSERTs will 
catch the problem early.

vlib_get_frame_to_node(...) is not new code, it’s used all over the place, and 
it needs “help” to fail as shown below.

D.

From: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>> On Behalf Of Hugo Garza
Sent: Tuesday, November 27, 2018 7:39 PM
To: [email protected]<mailto:[email protected]>
Subject: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node

Hi vpp-dev,

I'm seeing a crash when I enable our application with multiple works.
Nov 26 14:29:32  vnet[64035]: received signal SIGSEGV, PC 0x7f6979a12ce8, 
faulting address 0x7fa6cd0bd444
Nov 26 14:29:32  vnet[64035]: #0  0x00007f6a812743d8 0x7f6a812743d8
Nov 26 14:29:32  vnet[64035]: #1  0x00007f6a80bc56d0 0x7f6a80bc56d0
Nov 26 14:29:32  vnet[64035]: #2  0x00007f6979a12ce8 vlib_frame_vector_args + 
0x10
Nov 26 14:29:32  vnet[64035]: #3  0x00007f6979a16a2c tcpo_enqueue_to_output_i + 
0xf4
Nov 26 14:29:32  vnet[64035]: #4  0x00007f6979a16b23 tcpo_enqueue_to_output + 
0x25
Nov 26 14:29:32  vnet[64035]: #5  0x00007f6979a33fba send_packets + 0x7f2
Nov 26 14:29:32  vnet[64035]: #6  0x00007f6979a346f8 connection_tx + 0x17e
Nov 26 14:29:32  vnet[64035]: #7  0x00007f6979a34f08 tcpo_dispatch_node_fn + 
0x7fa
Nov 26 14:29:32  vnet[64035]: #8  0x00007f6a81248cb6 vlib_worker_loop + 0x6a6
Nov 26 14:29:32  vnet[64035]: #9  0x00007f6a8094f694 0x7f6a8094f694

Running on CentOS 7.4  with kernel 3.10.0-693.el7.x86_64
VPP
Version:                  v18.10-13~g00adcce~b60
Compiled by:              root
Compile host:             b0f32e97e93a
Compile date:             Mon Nov 26 09:09:42 UTC 2018
Compile location:         /w/workspace/vpp-merge-1810-centos7
Compiler:                 GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
Current PID:              9612

On a Cisco server with 2 socket Intel Xeon E5-2697Av4 @ 2.60GHz and 2 Intel 
X520 NICs. T-Rex traffic generator is hooked up on the other end to provided 
data at about 5Gbps per NIC.
./t-rex-64 --astf -f astf/nginx_wget.py -c 14 -m 40000 -d 3000

startup.conf
unix {
  nodaemon
  interactive
  log /opt/tcpo/logs/vpp.log
  full-coredump
  cli-no-banner
  #startup-config /opt/tcpo/conf/local.conf
  cli-listen /run/vpp/cli.sock
}
api-trace {
  on
}
heapsize 3G
cpu {
  main-core 1
  corelist-workers 2-5
}
tcpo {
runtime-config /opt/tcpo/conf/runtime.conf
session-pool-size 1024000
}
dpdk {
  dev 0000:86:00.0 {
    num-rx-queues 1
  }
  dev 0000:86:00.1 {
    num-rx-queues 1
  }
  dev 0000:84:00.0 {
    num-rx-queues 1
  }
  dev 0000:84:00.1 {
    num-rx-queues 1
  }
  num-mbufs 1024000
  socket-mem 4096,4096
}
plugin_path /usr/lib/vpp_plugins
api-segment {
  gid vpp
}

Here's the function where the SIGSEGV is happening:

static void enqueue_to_output_i(tcpo_worker_ctx_t * wrk, u32 bi, u8 flush) {

    u32 *to_next, next_index;

    vlib_frame_t *f;


    TRACE_FUNC_VAR(bi);


    next_index = tcpo_output_node.index;


    /* Get frame to output node */

    f = wrk->tx_frame;

    if (!f) {

        f = vlib_get_frame_to_node(wrk->vm, next_index);

        ASSERT (clib_mem_is_heap_object (f));

        wrk->tx_frame = f;

    }

    ASSERT (clib_mem_is_heap_object (f));


    to_next = vlib_frame_vector_args(f);

    to_next[f->n_vectors] = bi;

    f->n_vectors += 1;


    if (flush || f->n_vectors == VLIB_FRAME_SIZE) {

        TRACE_FUNC_VAR2(flush, f->n_vectors);

        vlib_put_frame_to_node(wrk->vm, next_index, f);

        wrk->tx_frame = 0;

    }

}


I've observed that after a few Gbps of traffic go through and we call 
vlib_get_frame_to_node the pointer f that gets returned points to a chunk of 
memory that is invalid as confirmed by the assert statement that I added 
afterwards right below.

Not sure how to progress further on tracking down this issue, any help or 
advice would be much appreciated.

Thanks,
Hugo
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#11444): https://lists.fd.io/g/vpp-dev/message/11444
Mute This Topic: https://lists.fd.io/mt/28408842/675601
Group Owner: [email protected]<mailto:vpp-dev%[email protected]>
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
[[email protected]<mailto:[email protected]>]
-=-=-=-=-=-=-=-=-=-=-=-


--

Andreas Schultz

--

Principal Engineer

t: +49 391 819099-224

------------------------------- enabling your networks 
-----------------------------

Travelping GmbH

Roentgenstraße 13

39108 Magdeburg

Germany

t: +49 391 819099-0

f: +49 391 819099-299

e: [email protected]<mailto:[email protected]>

w: https://www.travelping.com/


Company registration: Amtsgericht Stendal
Reg. No.: HRB 10578
Geschaeftsfuehrer: Holger Winkelmann
VAT ID: DE236673780




--

Andreas Schultz

--

Principal Engineer

t: +49 391 819099-224

------------------------------- enabling your networks 
-----------------------------

Travelping GmbH

Roentgenstraße 13

39108 Magdeburg

Germany

t: +49 391 819099-0

f: +49 391 819099-299

e: [email protected]<mailto:[email protected]>

w: https://www.travelping.com/


Company registration: Amtsgericht Stendal
Reg. No.: HRB 10578
Geschaeftsfuehrer: Holger Winkelmann
VAT ID: DE236673780


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#13434): https://lists.fd.io/g/vpp-dev/message/13434
Mute This Topic: https://lists.fd.io/mt/28408842/675601
Group Owner: [email protected]<mailto:vpp-dev%[email protected]>
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
[[email protected]<mailto:[email protected]>]
-=-=-=-=-=-=-=-=-=-=-=-


--

Andreas Schultz

--

Principal Engineer

t: +49 391 819099-224

------------------------------- enabling your networks 
-----------------------------

Travelping GmbH

Roentgenstraße 13

39108 Magdeburg

Germany

t: +49 391 819099-0

f: +49 391 819099-299

e: [email protected]<mailto:[email protected]>

w: https://www.travelping.com/


Company registration: Amtsgericht Stendal
Reg. No.: HRB 10578
Geschaeftsfuehrer: Holger Winkelmann
VAT ID: DE236673780

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#13437): https://lists.fd.io/g/vpp-dev/message/13437
Mute This Topic: https://lists.fd.io/mt/28408842/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node

Reply via email to