Dear Andreas, Single thread vs. multiple workers? Debug image? vm->heap_aligned_base matches reality?
(Virtual address of allocated frame - vm->heap_aligned_base) / CLIB_CACHE_LINE_BYTES fits in 32 bits? In vlib/main.c:vlib_frame_alloc_to_node(...) try replacing vlib_frame_index_no_check(vm, f) with vlib_frame_index(vm, f) in a debug image. Again, best I can do to help w/ next-to-no information. D. From: [email protected] <[email protected]> On Behalf Of Andreas Schultz Sent: Wednesday, July 3, 2019 4:47 AM To: Dave Barach (dbarach) <[email protected]> Cc: Hugo Garza <[email protected]>; [email protected] Subject: Re: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node Hi, I've run into the same issue with different, but also external code. The calling sequence in my case looks very similar to the one from Hugo. I'm also getting a invalid point from vlib_get_frame_to_node. It is crashing here: https://github.com/travelping/vpp/blob/feature/master/upf%2Btdf/src/plugins/upf/upf_pfcp_server.c#L121 @Hugo: have you found the root cause for your problem? Regards Andreas Am Mi., 28. Nov. 2018 um 12:53 Uhr schrieb Dave Barach via Lists.Fd.Io<http://Lists.Fd.Io> <[email protected]<mailto:[email protected]>>: None of the routine names in the backtrace exist in master/latest – it’s your code - so it will be challenging for the community to help you. See if you can repro the problem with a TAG=vpp_debug images (aka “make build” not “make build-release”). If you’re lucky, one of the numerous ASSERTs will catch the problem early. vlib_get_frame_to_node(...) is not new code, it’s used all over the place, and it needs “help” to fail as shown below. D. From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> On Behalf Of Hugo Garza Sent: Tuesday, November 27, 2018 7:39 PM To: [email protected]<mailto:[email protected]> Subject: [vpp-dev] SIGSEGV after calling vlib_get_frame_to_node Hi vpp-dev, I'm seeing a crash when I enable our application with multiple works. Nov 26 14:29:32 vnet[64035]: received signal SIGSEGV, PC 0x7f6979a12ce8, faulting address 0x7fa6cd0bd444 Nov 26 14:29:32 vnet[64035]: #0 0x00007f6a812743d8 0x7f6a812743d8 Nov 26 14:29:32 vnet[64035]: #1 0x00007f6a80bc56d0 0x7f6a80bc56d0 Nov 26 14:29:32 vnet[64035]: #2 0x00007f6979a12ce8 vlib_frame_vector_args + 0x10 Nov 26 14:29:32 vnet[64035]: #3 0x00007f6979a16a2c tcpo_enqueue_to_output_i + 0xf4 Nov 26 14:29:32 vnet[64035]: #4 0x00007f6979a16b23 tcpo_enqueue_to_output + 0x25 Nov 26 14:29:32 vnet[64035]: #5 0x00007f6979a33fba send_packets + 0x7f2 Nov 26 14:29:32 vnet[64035]: #6 0x00007f6979a346f8 connection_tx + 0x17e Nov 26 14:29:32 vnet[64035]: #7 0x00007f6979a34f08 tcpo_dispatch_node_fn + 0x7fa Nov 26 14:29:32 vnet[64035]: #8 0x00007f6a81248cb6 vlib_worker_loop + 0x6a6 Nov 26 14:29:32 vnet[64035]: #9 0x00007f6a8094f694 0x7f6a8094f694 Running on CentOS 7.4 with kernel 3.10.0-693.el7.x86_64 VPP Version: v18.10-13~g00adcce~b60 Compiled by: root Compile host: b0f32e97e93a Compile date: Mon Nov 26 09:09:42 UTC 2018 Compile location: /w/workspace/vpp-merge-1810-centos7 Compiler: GCC 7.3.1 20180303 (Red Hat 7.3.1-5) Current PID: 9612 On a Cisco server with 2 socket Intel Xeon E5-2697Av4 @ 2.60GHz and 2 Intel X520 NICs. T-Rex traffic generator is hooked up on the other end to provided data at about 5Gbps per NIC. ./t-rex-64 --astf -f astf/nginx_wget.py -c 14 -m 40000 -d 3000 startup.conf unix { nodaemon interactive log /opt/tcpo/logs/vpp.log full-coredump cli-no-banner #startup-config /opt/tcpo/conf/local.conf cli-listen /run/vpp/cli.sock } api-trace { on } heapsize 3G cpu { main-core 1 corelist-workers 2-5 } tcpo { runtime-config /opt/tcpo/conf/runtime.conf session-pool-size 1024000 } dpdk { dev 0000:86:00.0 { num-rx-queues 1 } dev 0000:86:00.1 { num-rx-queues 1 } dev 0000:84:00.0 { num-rx-queues 1 } dev 0000:84:00.1 { num-rx-queues 1 } num-mbufs 1024000 socket-mem 4096,4096 } plugin_path /usr/lib/vpp_plugins api-segment { gid vpp } Here's the function where the SIGSEGV is happening: static void enqueue_to_output_i(tcpo_worker_ctx_t * wrk, u32 bi, u8 flush) { u32 *to_next, next_index; vlib_frame_t *f; TRACE_FUNC_VAR(bi); next_index = tcpo_output_node.index; /* Get frame to output node */ f = wrk->tx_frame; if (!f) { f = vlib_get_frame_to_node(wrk->vm, next_index); ASSERT (clib_mem_is_heap_object (f)); wrk->tx_frame = f; } ASSERT (clib_mem_is_heap_object (f)); to_next = vlib_frame_vector_args(f); to_next[f->n_vectors] = bi; f->n_vectors += 1; if (flush || f->n_vectors == VLIB_FRAME_SIZE) { TRACE_FUNC_VAR2(flush, f->n_vectors); vlib_put_frame_to_node(wrk->vm, next_index, f); wrk->tx_frame = 0; } } I've observed that after a few Gbps of traffic go through and we call vlib_get_frame_to_node the pointer f that gets returned points to a chunk of memory that is invalid as confirmed by the assert statement that I added afterwards right below. Not sure how to progress further on tracking down this issue, any help or advice would be much appreciated. Thanks, Hugo -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#11444): https://lists.fd.io/g/vpp-dev/message/11444 Mute This Topic: https://lists.fd.io/mt/28408842/675601 Group Owner: [email protected]<mailto:vpp-dev%[email protected]> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [[email protected]<mailto:[email protected]>] -=-=-=-=-=-=-=-=-=-=-=- -- Andreas Schultz -- Principal Engineer t: +49 391 819099-224 ------------------------------- enabling your networks ----------------------------- Travelping GmbH Roentgenstraße 13 39108 Magdeburg Germany t: +49 391 819099-0 f: +49 391 819099-299 e: [email protected]<mailto:[email protected]> w: https://www.travelping.com/ Company registration: Amtsgericht Stendal Reg. No.: HRB 10578 Geschaeftsfuehrer: Holger Winkelmann VAT ID: DE236673780
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#13432): https://lists.fd.io/g/vpp-dev/message/13432 Mute This Topic: https://lists.fd.io/mt/28408842/21656 Group Owner: [email protected] Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
