Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node

Andreas Schultz Wed, 03 Jul 2019 07:24:48 -0700

Am Mi., 3. Juli 2019 um 15:55 Uhr schrieb Dave Barach via Lists.Fd.Io
<[email protected]>:


> >>> vm->heap_aligned_base matches reality?
>
> Check that clib_per_cpu_mheaps[0] (a void *, cast to mstate *) ->
> least_addr equals vm->heap_aligned_base.
>

That doesn't work out:

(gdb) print *((struct malloc_state *)clib_per_cpu_mheaps[0])
$72 = {smallmap = 1431655760, treemap = 3145721, dvsize = 1056, topsize =
65456, least_addr = 0x7f7db5c2b000 "", dv = 0x7f7db7cbabe0, top =
0x7f7db9fab000, trim_check = 2097152, release_checks = 272, magic =
139276536, smallbins = {0x0, 0x0,
    0x7f83e3d7f058, 0x7f83e3d7f058, 0x7f83e3d7f068, 0x7f83e3d7f068,
0x7f83e3d7f078, 0x7f83e3d7f078, 0x7f83e3d7f088, 0x7f83e3d7f088,
0x7f7db934c030, 0x7f8422348420, 0x7f83e3d7f0a8, 0x7f83e3d7f0a8,
0x7f84196dc1d0, 0x7f7db8fdad50, 0x7f83e3d7f0c8,
    0x7f83e3d7f0c8, 0x7f8411450130, 0x7f84239dcd70, 0x7f83e3d7f0e8,
0x7f83e3d7f0e8, 0x7f8412dee250, 0x7f84139e2d00, 0x7f83e3d7f108,
0x7f83e3d7f108, 0x7f7db77aad10, 0x7f7db927ac60, 0x7f83e3d7f128,
0x7f83e3d7f128, 0x7f7db7c3af90, 0x7f7db919af90,
    0x7f83e3d7f148, 0x7f83e3d7f148, 0x7f7db90daf30, 0x7f8410de11a0,
0x7f83e3d7f168, 0x7f83e3d7f168, 0x7f7db830aec0, 0x7f7db7d4abe0,
0x7f83e3d7f188, 0x7f83e3d7f188, 0x7f841c1ce7d0, 0x7f7db76fae20,
0x7f83e3d7f1a8, 0x7f83e3d7f1a8, 0x7f7db808af50,
    0x7f7db7d1ad20, 0x7f83e3d7f1c8, 0x7f83e3d7f1c8, 0x7f7db7d7adb0,
0x7f7db6efaf40, 0x7f83e3d7f1e8, 0x7f83e3d7f1e8, 0x7f7db825aee0,
0x7f7db844aee0, 0x7f83e3d7f208, 0x7f83e3d7f208, 0x7f8417ebcfb0,
0x7f8417ebcfb0, 0x7f83e3d7f228, 0x7f83e3d7f228,
    0x7f840fed98d0, 0x7f840fed98d0, 0x7f83e3d7f248, 0x7f83e3d7f248},
treebins = {0x7f8410e0e030, 0x0, 0x0, 0x7f840d2efb30, 0x7f7db832abe0,
0x7f8412a7a500, 0x7f7db9f2a3c0, 0x7f7db9e3a3c0, 0x7f7db9df0140,
0x7f7db9ed21c0, 0x7f7db9740140,
    0x7f7db9da3a20, 0x7f7db95fcc60, 0x7f7db9b23200, 0x7f7dba040140,
0x7f7db9f9b000, 0x7f7db9f8abe0, 0x7f7db882b000, 0x7f7db871afe0,
0x7f7db97fafe0, 0x0, 0x7f7db979a3c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0}, footprint = 1316421632,
  max_footprint = 1316421632, footprint_limit = 0, mflags = 7, mutex = 0,
seg = {base = 0x7f7db5c2b000 "", size = 70844416, next = 0x7f7dba04afc0,
sflags = 1}, extp = 0x0, exts = 0}

(gdb) print ((struct malloc_state *)clib_per_cpu_mheaps[0])->least_addr
$68 = 0x7f7db5c2b000 ""
(gdb) print vlib_global_main.heap_aligned_base
$73 = (void *) 0x7f83e3d7f000


>>>   vlib_buffer_alloc (vm, &bi0, 1);
>
>    b0 = vlib_get_buffer (vm, bi0);
>
>
>
> right before the crash. gdb tells me that bi0 is 94976. Isn't that a bit
> too large?
>
> b0 is optimised out, so I can't tell its value.
>
> Buffer indices vs frame indices. From what I can tell issue is in the
> frame-world, not the buffer world.
>
> Probably not related to this issue, but: you can work out b0 easily
> enough... Multiply bi0 by 64 (CLIB_CACHE_LINE_BYTES), and add
> vlib_global_main.buffer_main.buffer_mem_start. Cast the result to a
> vlib_buffer_t *, and there you go.
>

Result is sensible...


> You might look at
> vlib_global_main.node_main.frame_sizes[0].n_alloc_frames, to see how many
> frames have been allocated. I’d expect O(100), not O(<huge-number>).
>

Seems to be okish:

(gdb) print vlib_global_main.node_main.frame_sizes[0].n_alloc_frames
$74 = 7853

Is it normal that the
vlib_global_main.node_main.frame_sizes[0].frame_free_indices vector is
empty?

(gdb) print *((vec_header_t
*)vlib_global_main.node_main.frame_sizes[0].free_frame_indices - 1)
$78 = {len = 0, dlmalloc_header_offset = 0, vector_data = 0x7f84201583bc
"4\"\367"}

I've changed vlib/main.c:vlib_frame_alloc_to_node(...) according to you
suggestion. Anything else I can look at while waiting for the next crash?

Many Thanks,
Andreas

HTH... Dave
>
> *From:* [email protected] <[email protected]> *On Behalf Of *Andreas
> Schultz
> *Sent:* Wednesday, July 3, 2019 9:20 AM
> *To:* Dave Barach (dbarach) <[email protected]>
> *Cc:* Hugo Garza <[email protected]>; [email protected]
> *Subject:* Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node
>
>
>
> Hi Dave,
>
>
>
> Am Mi., 3. Juli 2019 um 14:17 Uhr schrieb Dave Barach (dbarach) <
> [email protected]>:
>
> Dear Andreas,
>
>
>
> Single thread vs. multiple workers?
>
>
>
> We have intentionally limited this to one CPU, it therefore can't be a
> concurrent process doing something
>
>
>
> Debug image?
>
>
>
> So far the problem has been observed only in release images under load.
> I've been unable to replicate the problem with artificial tests or on a
> debug image.
>
>
>
> vm->heap_aligned_base matches reality?
>
>
>
> Not sure what that means. How do I check that?
>
>
>
> (Virtual address of allocated frame - vm->heap_aligned_base) /
> CLIB_CACHE_LINE_BYTES fits in 32 bits?
>
>
>
> I'm doing a:
>
>
>
>    vlib_buffer_alloc (vm, &bi0, 1);
>
>    b0 = vlib_get_buffer (vm, bi0);
>
>
>
> right before the crash. gdb tells me that bi0 is 94976. Isn't that a bit
> too large?
>
> b0 is optimised out, so I can't tell its value.
>
>
>
> In vlib/main.c:vlib_frame_alloc_to_node(...) try replacing
> vlib_frame_index_no_check(vm, f) with vlib_frame_index(vm, f) in a debug
> image.
>
>
>
> Will do.
>
>
>
> Again, best I can do to help w/ next-to-no information.
>
>
>
> The problem is, I don't know what information will be useful and how to
> extract it. I have a core file and can dig into some internal structures.
> But which ones are helpful?
>
>
>
> Anyway, I'm grateful for any pointers.
>
>
>
> Regards
>
> Andreas
>
>
>
>
>
> D.
>
>
>
> *From:* [email protected] <[email protected]> *On Behalf Of *Andreas
> Schultz
> *Sent:* Wednesday, July 3, 2019 4:47 AM
> *To:* Dave Barach (dbarach) <[email protected]>
> *Cc:* Hugo Garza <[email protected]>; [email protected]
> *Subject:* Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node
>
>
>
> Hi,
>
>
>
> I've run into the same issue with different, but also external code.
>
>
>
> The calling sequence in my case looks very similar to the one from Hugo.
> I'm also getting a invalid point from vlib_get_frame_to_node.
>
> It is crashing here:
> https://github.com/travelping/vpp/blob/feature/master/upf%2Btdf/src/plugins/upf/upf_pfcp_server.c#L121
>
>
>
> @Hugo: have you found the root cause for your problem?
>
>
>
> Regards
>
> Andreas
>
>
>
> Am Mi., 28. Nov. 2018 um 12:53 Uhr schrieb Dave Barach via Lists.Fd.Io
> <[email protected]>:
>
> None of the routine names in the backtrace exist in master/latest – it’s
> your code - so it will be challenging for the community to help you.
>
>
>
> See if you can repro the problem with a TAG=vpp_debug images (aka “make
> build” not “make build-release”). If you’re lucky, one of the numerous
> ASSERTs will catch the problem early.
>
>
>
> vlib_get_frame_to_node(...) is not new code, it’s used all over the place,
> and it needs “help” to fail as shown below.
>
>
>
> D.
>
>
>
> *From:* [email protected] <[email protected]> *On Behalf Of *Hugo
> Garza
> *Sent:* Tuesday, November 27, 2018 7:39 PM
> *To:* [email protected]
> *Subject:* [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node
>
>
>
> Hi vpp-dev,
>
> I'm seeing a crash when I enable our application with multiple works.
> Nov 26 14:29:32  vnet[64035]: received signal SIGSEGV, PC 0x7f6979a12ce8,
> faulting address 0x7fa6cd0bd444
> Nov 26 14:29:32  vnet[64035]: #0  0x00007f6a812743d8 0x7f6a812743d8
> Nov 26 14:29:32  vnet[64035]: #1  0x00007f6a80bc56d0 0x7f6a80bc56d0
> Nov 26 14:29:32  vnet[64035]: #2  0x00007f6979a12ce8
> vlib_frame_vector_args + 0x10
> Nov 26 14:29:32  vnet[64035]: #3  0x00007f6979a16a2c
> tcpo_enqueue_to_output_i + 0xf4
> Nov 26 14:29:32  vnet[64035]: #4  0x00007f6979a16b23
> tcpo_enqueue_to_output + 0x25
> Nov 26 14:29:32  vnet[64035]: #5  0x00007f6979a33fba send_packets + 0x7f2
> Nov 26 14:29:32  vnet[64035]: #6  0x00007f6979a346f8 connection_tx + 0x17e
> Nov 26 14:29:32  vnet[64035]: #7  0x00007f6979a34f08 tcpo_dispatch_node_fn
> + 0x7fa
> Nov 26 14:29:32  vnet[64035]: #8  0x00007f6a81248cb6 vlib_worker_loop +
> 0x6a6
> Nov 26 14:29:32  vnet[64035]: #9  0x00007f6a8094f694 0x7f6a8094f694
>
> Running on CentOS 7.4  with kernel 3.10.0-693.el7.x86_64
> VPP
> Version:                  v18.10-13~g00adcce~b60
> Compiled by:              root
> Compile host:             b0f32e97e93a
> Compile date:             Mon Nov 26 09:09:42 UTC 2018
> Compile location:         /w/workspace/vpp-merge-1810-centos7
> Compiler:                 GCC 7.3.1 20180303 (Red Hat 7.3.1-5)
> Current PID:              9612
>
> On a Cisco server with 2 socket Intel Xeon E5-2697Av4 @ 2.60GHz and 2
> Intel X520 NICs. T-Rex traffic generator is hooked up on the other end to
> provided data at about 5Gbps per NIC.
> ./t-rex-64 --astf -f astf/nginx_wget.py -c 14 -m 40000 -d 3000
>
> startup.conf
> unix {
>   nodaemon
>   interactive
>   log /opt/tcpo/logs/vpp.log
>   full-coredump
>   cli-no-banner
>   #startup-config /opt/tcpo/conf/local.conf
>   cli-listen /run/vpp/cli.sock
> }
> api-trace {
>   on
> }
> heapsize 3G
> cpu {
>   main-core 1
>   corelist-workers 2-5
> }
> tcpo {
> runtime-config /opt/tcpo/conf/runtime.conf
> session-pool-size 1024000
> }
> dpdk {
>   dev 0000:86:00.0 {
>     num-rx-queues 1
>   }
>   dev 0000:86:00.1 {
>     num-rx-queues 1
>   }
>   dev 0000:84:00.0 {
>     num-rx-queues 1
>   }
>   dev 0000:84:00.1 {
>     num-rx-queues 1
>   }
>   num-mbufs 1024000
>   socket-mem 4096,4096
> }
> plugin_path /usr/lib/vpp_plugins
> api-segment {
>   gid vpp
> }
>
> Here's the function where the SIGSEGV is happening:
>
>
>
> static void enqueue_to_output_i(tcpo_worker_ctx_t * wrk, u32 bi, u8
> flush) {
>
>
>
>     u32 *to_next, next_index;
>
>
>
>     vlib_frame_t *f;
>
>
>
>
>
>     TRACE_FUNC_VAR(bi);
>
>
>
>
>
>     next_index = tcpo_output_node.index;
>
>
>
>
>
>     /* Get frame to output node */
>
>
>
>     f = wrk->tx_frame;
>
>
>
>     if (!f) {
>
>
>
>         f = vlib_get_frame_to_node(wrk->vm, next_index);
>
>
>
>         ASSERT (clib_mem_is_heap_object (f));
>
>
>
>         wrk->tx_frame = f;
>
>
>
>     }
>
>
>
>     ASSERT (clib_mem_is_heap_object (f));
>
>
>
>
>
>     to_next = vlib_frame_vector_args(f);
>
>
>
>     to_next[f->n_vectors] = bi;
>
>
>
>     f->n_vectors += 1;
>
>
>
>
>
>     if (flush || f->n_vectors == VLIB_FRAME_SIZE) {
>
>
>
>         TRACE_FUNC_VAR2(flush, f->n_vectors);
>
>
>
>         vlib_put_frame_to_node(wrk->vm, next_index, f);
>
>
>
>         wrk->tx_frame = 0;
>
>
>
>     }
>
>
>
> }
>
>
>
>
> I've observed that after a few Gbps of traffic go through and we call
> *vlib_get_frame_to_node* the pointer *f* that gets returned points to a
> chunk of memory that is invalid as confirmed by the assert statement that I
> added afterwards right below.
>
> Not sure how to progress further on tracking down this issue, any help or
> advice would be much appreciated.
>
> Thanks,
> Hugo
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
>
> View/Reply Online (#11444): https://lists.fd.io/g/vpp-dev/message/11444
> Mute This Topic: https://lists.fd.io/mt/28408842/675601
> Group Owner: [email protected]
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [
> [email protected]]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>
>
>
> --
>
> Andreas Schultz
>
> --
>
> Principal Engineer
>
> t: +49 391 819099-224
>
> ------------------------------- enabling your networks
> -----------------------------
>
> Travelping GmbH
>
> Roentgenstraße 13
>
> 39108 Magdeburg
>
> Germany
>
> t: +49 391 819099-0
>
> f: +49 391 819099-299
>
> e: [email protected]
>
> w: https://www.travelping.com/
>
>
>
>
>
> Company registration: Amtsgericht Stendal
>
> Reg. No.: HRB 10578
>
> Geschaeftsfuehrer: Holger Winkelmann
>
> VAT ID: DE236673780
>
>
>
>
>
>
> --
>
> Andreas Schultz
>
> --
>
> Principal Engineer
>
> t: +49 391 819099-224
>
> ------------------------------- enabling your networks
> -----------------------------
>
> Travelping GmbH
>
> Roentgenstraße 13
>
> 39108 Magdeburg
>
> Germany
>
> t: +49 391 819099-0
>
> f: +49 391 819099-299
>
> e: [email protected]
>
> w: https://www.travelping.com/
>
>
>
>
>
> Company registration: Amtsgericht Stendal
>
> Reg. No.: HRB 10578
>
> Geschaeftsfuehrer: Holger Winkelmann
>
> VAT ID: DE236673780
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
>
> View/Reply Online (#13434): https://lists.fd.io/g/vpp-dev/message/13434
> Mute This Topic: https://lists.fd.io/mt/28408842/675601
> Group Owner: [email protected]
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [
> [email protected]]
> -=-=-=-=-=-=-=-=-=-=-=-
>


-- 

Andreas Schultz

-- 

Principal Engineer

t: +49 391 819099-224

------------------------------- enabling your networks
-----------------------------

Travelping GmbH

Roentgenstraße 13

39108 Magdeburg

Germany

t: +49 391 819099-0

f: +49 391 819099-299

e: [email protected]

w: https://www.travelping.com/

Company registration: Amtsgericht Stendal  Reg. No.: HRB 10578
Geschaeftsfuehrer: Holger Winkelmann VAT ID: DE236673780

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#13435): https://lists.fd.io/g/vpp-dev/message/13435
Mute This Topic: https://lists.fd.io/mt/28408842/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node

Reply via email to