Hi, I saw VPP crash several times during some tests that were running to evaluate IPsec performance. The last upstream commit on my build of VPP is 'fd77f8c00 quic: remove cmake --target'. The tests ran on a C3000 with an onboard QAT. The tests were repeated with the QAT removed from the device whitelist in startup.conf (using async crypto with sw_scheduler) and the same thing happened.
The relevant part of the stack trace looks like this: #8 0x00007fdbb4006459 in os_out_of_memory () at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221 #9 0x00007fdbb400d1fb in clib_mem_alloc_aligned_at_offset (size=2305843009213692256, align=8, align_offset=8, os_out_of_memory_on_failure=1) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243 #10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0, length_increment=288230376151711515, data_bytes=2305843009213692256, header_bytes=8, data_align=8, numa_id=255) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111 #11 0x00007fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0, length_increment=288230376151711515, data_bytes=2305843009213692248, header_bytes=0, data_align=8, numa_id=255) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170 #12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643 #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585 #14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 <cryptodev_raw_dequeue>, n_cache=1, n_total=0x7fdb145053dc) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135 #15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, frame=0x0) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166 #16 0x00007fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT, dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, last_time_stamp=207016971809128) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024 #17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618 In vnet_crypto_async_free_frame() it appears that a call to pool_put() is trying to return a pointer to a pool that it is not a member of: (gdb) frame 13 #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) at /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585 585 pool_put (ct->frame_pool, frame); (gdb) p frame - ct->frame_pool $1 = -13689 It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored by the crypto engine and before it could be dequeued the pool filled and had to be reallocated. The per-thread frame_pool's are allocated with room for 1024 entries initially and ct->frame_pool had a vector length of 1025 when the crash occurred. Can anyone with knowledge of the async crypto code confirm or refute that theory? Anyone have suggestions on the best way to fix this? Thanks, -Matt
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#19479): https://lists.fd.io/g/vpp-dev/message/19479 Mute This Topic: https://lists.fd.io/mt/83112898/21656 Group Owner: [email protected] Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
