Well, there you go... Allocate a frame below vm->heap_aligned_base, and all hell breaks loose, precisely the way you’re seeing.
Dlmalloc added a heap chunk *below* the point from which frame indices are calculated. The freelist should be empty: you’re creating a new frame vs. yanking one from the freelist. You may be able to avoid the issue entirely by configuring a large enough [initial] main heap from the command line. Aside from the chance that you’ll scare people by consuming large amounts of virtual space, it’s reasonably OK to do that. I suspect that you’re seeing the results of creating too many single-element frames w/ vlib_get_frame_to_node(...). 7000 frames is a lot. If possible, create full(er) frames. Vpp works an order-of-magnitude more efficiently on full frames. HTH... Dave From: [email protected] <[email protected]> On Behalf Of Andreas Schultz Sent: Wednesday, July 3, 2019 10:24 AM To: Dave Barach (dbarach) <[email protected]> Cc: [email protected] Subject: Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node Am Mi., 3. Juli 2019 um 15:55 Uhr schrieb Dave Barach via Lists.Fd.Io<http://Lists.Fd.Io> <[email protected]<mailto:[email protected]>>: >>> vm->heap_aligned_base matches reality? Check that clib_per_cpu_mheaps[0] (a void *, cast to mstate *) -> least_addr equals vm->heap_aligned_base. That doesn't work out: (gdb) print *((struct malloc_state *)clib_per_cpu_mheaps[0]) $72 = {smallmap = 1431655760, treemap = 3145721, dvsize = 1056, topsize = 65456, least_addr = 0x7f7db5c2b000 "", dv = 0x7f7db7cbabe0, top = 0x7f7db9fab000, trim_check = 2097152, release_checks = 272, magic = 139276536, smallbins = {0x0, 0x0, 0x7f83e3d7f058, 0x7f83e3d7f058, 0x7f83e3d7f068, 0x7f83e3d7f068, 0x7f83e3d7f078, 0x7f83e3d7f078, 0x7f83e3d7f088, 0x7f83e3d7f088, 0x7f7db934c030, 0x7f8422348420, 0x7f83e3d7f0a8, 0x7f83e3d7f0a8, 0x7f84196dc1d0, 0x7f7db8fdad50, 0x7f83e3d7f0c8, 0x7f83e3d7f0c8, 0x7f8411450130, 0x7f84239dcd70, 0x7f83e3d7f0e8, 0x7f83e3d7f0e8, 0x7f8412dee250, 0x7f84139e2d00, 0x7f83e3d7f108, 0x7f83e3d7f108, 0x7f7db77aad10, 0x7f7db927ac60, 0x7f83e3d7f128, 0x7f83e3d7f128, 0x7f7db7c3af90, 0x7f7db919af90, 0x7f83e3d7f148, 0x7f83e3d7f148, 0x7f7db90daf30, 0x7f8410de11a0, 0x7f83e3d7f168, 0x7f83e3d7f168, 0x7f7db830aec0, 0x7f7db7d4abe0, 0x7f83e3d7f188, 0x7f83e3d7f188, 0x7f841c1ce7d0, 0x7f7db76fae20, 0x7f83e3d7f1a8, 0x7f83e3d7f1a8, 0x7f7db808af50, 0x7f7db7d1ad20, 0x7f83e3d7f1c8, 0x7f83e3d7f1c8, 0x7f7db7d7adb0, 0x7f7db6efaf40, 0x7f83e3d7f1e8, 0x7f83e3d7f1e8, 0x7f7db825aee0, 0x7f7db844aee0, 0x7f83e3d7f208, 0x7f83e3d7f208, 0x7f8417ebcfb0, 0x7f8417ebcfb0, 0x7f83e3d7f228, 0x7f83e3d7f228, 0x7f840fed98d0, 0x7f840fed98d0, 0x7f83e3d7f248, 0x7f83e3d7f248}, treebins = {0x7f8410e0e030, 0x0, 0x0, 0x7f840d2efb30, 0x7f7db832abe0, 0x7f8412a7a500, 0x7f7db9f2a3c0, 0x7f7db9e3a3c0, 0x7f7db9df0140, 0x7f7db9ed21c0, 0x7f7db9740140, 0x7f7db9da3a20, 0x7f7db95fcc60, 0x7f7db9b23200, 0x7f7dba040140, 0x7f7db9f9b000, 0x7f7db9f8abe0, 0x7f7db882b000, 0x7f7db871afe0, 0x7f7db97fafe0, 0x0, 0x7f7db979a3c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, footprint = 1316421632, max_footprint = 1316421632, footprint_limit = 0, mflags = 7, mutex = 0, seg = {base = 0x7f7db5c2b000 "", size = 70844416, next = 0x7f7dba04afc0, sflags = 1}, extp = 0x0, exts = 0} (gdb) print ((struct malloc_state *)clib_per_cpu_mheaps[0])->least_addr $68 = 0x7f7db5c2b000 "" (gdb) print vlib_global_main.heap_aligned_base $73 = (void *) 0x7f83e3d7f000 >>> vlib_buffer_alloc (vm, &bi0, 1); b0 = vlib_get_buffer (vm, bi0); right before the crash. gdb tells me that bi0 is 94976. Isn't that a bit too large? b0 is optimised out, so I can't tell its value. Buffer indices vs frame indices. From what I can tell issue is in the frame-world, not the buffer world. Probably not related to this issue, but: you can work out b0 easily enough... Multiply bi0 by 64 (CLIB_CACHE_LINE_BYTES), and add vlib_global_main.buffer_main.buffer_mem_start. Cast the result to a vlib_buffer_t *, and there you go. Result is sensible... You might look at vlib_global_main.node_main.frame_sizes[0].n_alloc_frames, to see how many frames have been allocated. I’d expect O(100), not O(<huge-number>). Seems to be okish: (gdb) print vlib_global_main.node_main.frame_sizes[0].n_alloc_frames $74 = 7853 Is it normal that the vlib_global_main.node_main.frame_sizes[0].frame_free_indices vector is empty? (gdb) print *((vec_header_t *)vlib_global_main.node_main.frame_sizes[0].free_frame_indices - 1) $78 = {len = 0, dlmalloc_header_offset = 0, vector_data = 0x7f84201583bc "4\"\367"} I've changed vlib/main.c:vlib_frame_alloc_to_node(...) according to you suggestion. Anything else I can look at while waiting for the next crash? Many Thanks, Andreas HTH... Dave From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> On Behalf Of Andreas Schultz Sent: Wednesday, July 3, 2019 9:20 AM To: Dave Barach (dbarach) <[email protected]<mailto:[email protected]>> Cc: Hugo Garza <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> Subject: Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node Hi Dave, Am Mi., 3. Juli 2019 um 14:17 Uhr schrieb Dave Barach (dbarach) <[email protected]<mailto:[email protected]>>: Dear Andreas, Single thread vs. multiple workers? We have intentionally limited this to one CPU, it therefore can't be a concurrent process doing something Debug image? So far the problem has been observed only in release images under load. I've been unable to replicate the problem with artificial tests or on a debug image. vm->heap_aligned_base matches reality? Not sure what that means. How do I check that? (Virtual address of allocated frame - vm->heap_aligned_base) / CLIB_CACHE_LINE_BYTES fits in 32 bits? I'm doing a: vlib_buffer_alloc (vm, &bi0, 1); b0 = vlib_get_buffer (vm, bi0); right before the crash. gdb tells me that bi0 is 94976. Isn't that a bit too large? b0 is optimised out, so I can't tell its value. In vlib/main.c:vlib_frame_alloc_to_node(...) try replacing vlib_frame_index_no_check(vm, f) with vlib_frame_index(vm, f) in a debug image. Will do. Again, best I can do to help w/ next-to-no information. The problem is, I don't know what information will be useful and how to extract it. I have a core file and can dig into some internal structures. But which ones are helpful? Anyway, I'm grateful for any pointers. Regards Andreas D. From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> On Behalf Of Andreas Schultz Sent: Wednesday, July 3, 2019 4:47 AM To: Dave Barach (dbarach) <[email protected]<mailto:[email protected]>> Cc: Hugo Garza <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> Subject: Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node Hi, I've run into the same issue with different, but also external code. The calling sequence in my case looks very similar to the one from Hugo. I'm also getting a invalid point from vlib_get_frame_to_node. It is crashing here: https://github.com/travelping/vpp/blob/feature/master/upf%2Btdf/src/plugins/upf/upf_pfcp_server.c#L121 @Hugo: have you found the root cause for your problem? Regards Andreas Am Mi., 28. Nov. 2018 um 12:53 Uhr schrieb Dave Barach via Lists.Fd.Io<http://Lists.Fd.Io> <[email protected]<mailto:[email protected]>>: None of the routine names in the backtrace exist in master/latest – it’s your code - so it will be challenging for the community to help you. See if you can repro the problem with a TAG=vpp_debug images (aka “make build” not “make build-release”). If you’re lucky, one of the numerous ASSERTs will catch the problem early. vlib_get_frame_to_node(...) is not new code, it’s used all over the place, and it needs “help” to fail as shown below. D. From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> On Behalf Of Hugo Garza Sent: Tuesday, November 27, 2018 7:39 PM To: [email protected]<mailto:[email protected]> Subject: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node Hi vpp-dev, I'm seeing a crash when I enable our application with multiple works. Nov 26 14:29:32 vnet[64035]: received signal SIGSEGV, PC 0x7f6979a12ce8, faulting address 0x7fa6cd0bd444 Nov 26 14:29:32 vnet[64035]: #0 0x00007f6a812743d8 0x7f6a812743d8 Nov 26 14:29:32 vnet[64035]: #1 0x00007f6a80bc56d0 0x7f6a80bc56d0 Nov 26 14:29:32 vnet[64035]: #2 0x00007f6979a12ce8 vlib_frame_vector_args + 0x10 Nov 26 14:29:32 vnet[64035]: #3 0x00007f6979a16a2c tcpo_enqueue_to_output_i + 0xf4 Nov 26 14:29:32 vnet[64035]: #4 0x00007f6979a16b23 tcpo_enqueue_to_output + 0x25 Nov 26 14:29:32 vnet[64035]: #5 0x00007f6979a33fba send_packets + 0x7f2 Nov 26 14:29:32 vnet[64035]: #6 0x00007f6979a346f8 connection_tx + 0x17e Nov 26 14:29:32 vnet[64035]: #7 0x00007f6979a34f08 tcpo_dispatch_node_fn + 0x7fa Nov 26 14:29:32 vnet[64035]: #8 0x00007f6a81248cb6 vlib_worker_loop + 0x6a6 Nov 26 14:29:32 vnet[64035]: #9 0x00007f6a8094f694 0x7f6a8094f694 Running on CentOS 7.4 with kernel 3.10.0-693.el7.x86_64 VPP Version: v18.10-13~g00adcce~b60 Compiled by: root Compile host: b0f32e97e93a Compile date: Mon Nov 26 09:09:42 UTC 2018 Compile location: /w/workspace/vpp-merge-1810-centos7 Compiler: GCC 7.3.1 20180303 (Red Hat 7.3.1-5) Current PID: 9612 On a Cisco server with 2 socket Intel Xeon E5-2697Av4 @ 2.60GHz and 2 Intel X520 NICs. T-Rex traffic generator is hooked up on the other end to provided data at about 5Gbps per NIC. ./t-rex-64 --astf -f astf/nginx_wget.py -c 14 -m 40000 -d 3000 startup.conf unix { nodaemon interactive log /opt/tcpo/logs/vpp.log full-coredump cli-no-banner #startup-config /opt/tcpo/conf/local.conf cli-listen /run/vpp/cli.sock } api-trace { on } heapsize 3G cpu { main-core 1 corelist-workers 2-5 } tcpo { runtime-config /opt/tcpo/conf/runtime.conf session-pool-size 1024000 } dpdk { dev 0000:86:00.0 { num-rx-queues 1 } dev 0000:86:00.1 { num-rx-queues 1 } dev 0000:84:00.0 { num-rx-queues 1 } dev 0000:84:00.1 { num-rx-queues 1 } num-mbufs 1024000 socket-mem 4096,4096 } plugin_path /usr/lib/vpp_plugins api-segment { gid vpp } Here's the function where the SIGSEGV is happening: static void enqueue_to_output_i(tcpo_worker_ctx_t * wrk, u32 bi, u8 flush) { u32 *to_next, next_index; vlib_frame_t *f; TRACE_FUNC_VAR(bi); next_index = tcpo_output_node.index; /* Get frame to output node */ f = wrk->tx_frame; if (!f) { f = vlib_get_frame_to_node(wrk->vm, next_index); ASSERT (clib_mem_is_heap_object (f)); wrk->tx_frame = f; } ASSERT (clib_mem_is_heap_object (f)); to_next = vlib_frame_vector_args(f); to_next[f->n_vectors] = bi; f->n_vectors += 1; if (flush || f->n_vectors == VLIB_FRAME_SIZE) { TRACE_FUNC_VAR2(flush, f->n_vectors); vlib_put_frame_to_node(wrk->vm, next_index, f); wrk->tx_frame = 0; } } I've observed that after a few Gbps of traffic go through and we call vlib_get_frame_to_node the pointer f that gets returned points to a chunk of memory that is invalid as confirmed by the assert statement that I added afterwards right below. Not sure how to progress further on tracking down this issue, any help or advice would be much appreciated. Thanks, Hugo -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#11444): https://lists.fd.io/g/vpp-dev/message/11444 Mute This Topic: https://lists.fd.io/mt/28408842/675601 Group Owner: [email protected]<mailto:vpp-dev%[email protected]> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [[email protected]<mailto:[email protected]>] -=-=-=-=-=-=-=-=-=-=-=- -- Andreas Schultz -- Principal Engineer t: +49 391 819099-224 ------------------------------- enabling your networks ----------------------------- Travelping GmbH Roentgenstraße 13 39108 Magdeburg Germany t: +49 391 819099-0 f: +49 391 819099-299 e: [email protected]<mailto:[email protected]> w: https://www.travelping.com/ Company registration: Amtsgericht Stendal Reg. No.: HRB 10578 Geschaeftsfuehrer: Holger Winkelmann VAT ID: DE236673780 -- Andreas Schultz -- Principal Engineer t: +49 391 819099-224 ------------------------------- enabling your networks ----------------------------- Travelping GmbH Roentgenstraße 13 39108 Magdeburg Germany t: +49 391 819099-0 f: +49 391 819099-299 e: [email protected]<mailto:[email protected]> w: https://www.travelping.com/ Company registration: Amtsgericht Stendal Reg. No.: HRB 10578 Geschaeftsfuehrer: Holger Winkelmann VAT ID: DE236673780 -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#13434): https://lists.fd.io/g/vpp-dev/message/13434 Mute This Topic: https://lists.fd.io/mt/28408842/675601 Group Owner: [email protected]<mailto:vpp-dev%[email protected]> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [[email protected]<mailto:[email protected]>] -=-=-=-=-=-=-=-=-=-=-=- -- Andreas Schultz -- Principal Engineer t: +49 391 819099-224 ------------------------------- enabling your networks ----------------------------- Travelping GmbH Roentgenstraße 13 39108 Magdeburg Germany t: +49 391 819099-0 f: +49 391 819099-299 e: [email protected]<mailto:[email protected]> w: https://www.travelping.com/ Company registration: Amtsgericht Stendal Reg. No.: HRB 10578 Geschaeftsfuehrer: Holger Winkelmann VAT ID: DE236673780
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#13437): https://lists.fd.io/g/vpp-dev/message/13437 Mute This Topic: https://lists.fd.io/mt/28408842/21656 Group Owner: [email protected] Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
