Hi Matthew and Florin,

We managed to recreate the problem.
The cause is most likely caused by pool got expanded while there are pending 
frame left to be dequeued. Once frame is dequeued later returning it to the 
pool will cause seg-fault as the pool is in new memory location.

We are working on the fix – currently in validation stage. If everything is 
fine we are to upstream by tomorrow evening.

Regards,
Fan

From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Matthew Smith via 
lists.fd.io
Sent: Thursday, May 27, 2021 2:02 PM
To: Florin Coras <fcoras.li...@gmail.com>
Cc: vpp-dev <vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] IPsec crash with async crypto

Hi Florin!

It appears that the quic plugin is disabled in my build:

2021/05/27 07:44:49:044 notice     plugin/load    Plugin disabled (default): 
quic_plugin.so

I didn't mean to give the impression that I thought this issue was caused by 
quic. My mention of the quic commit was just intended to indicate how up to 
date my build is with the gerrit master branch in case there were 
recent/pending patches that people know of that might be relevant. That quic 
commit is from about 2 weeks ago, which is the last time I merged upstream 
changes.

Thanks,
-Matt


On Wed, May 26, 2021 at 5:58 PM Florin Coras 
<fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>> wrote:
Hi Matt,

Did you try checking if quic plugin is loaded, just to see if there’s a 
connection there.

Regards,
Florin

> On May 26, 2021, at 3:19 PM, Matthew Smith via 
> lists.fd.io<http://lists.fd.io> 
> <mgsmith=netgate....@lists.fd.io<mailto:netgate....@lists.fd.io>> wrote:
>
> Hi,
>
> I saw VPP crash several times during some tests that were running to evaluate 
> IPsec performance. The last upstream commit on my build of VPP is 'fd77f8c00 
> quic: remove cmake --target'. The tests ran on a C3000 with an onboard QAT. 
> The tests were repeated with the QAT removed from the device whitelist in 
> startup.conf (using async crypto with sw_scheduler) and the same thing 
> happened.
>
> The relevant part of the stack trace looks like this:
>
> #8  0x00007fdbb4006459 in os_out_of_memory () at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221
> #9  0x00007fdbb400d1fb in clib_mem_alloc_aligned_at_offset 
> (size=2305843009213692256, align=8, align_offset=8, 
> os_out_of_memory_on_failure=1) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243
> #10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0, 
> length_increment=288230376151711515, data_bytes=2305843009213692256, 
> header_bytes=8, data_align=8, numa_id=255) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111
> #11 0x00007fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0, 
> length_increment=288230376151711515, data_bytes=2305843009213692248, 
> header_bytes=0, data_align=8, numa_id=255) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170
> #12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643
> #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> #14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, 
> ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 <cryptodev_raw_dequeue>, n_cache=1, 
> n_total=0x7fdb145053dc) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135
> #15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, 
> frame=0x0) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166
> #16 0x00007fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80, 
> node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT, 
> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, 
> last_time_stamp=207016971809128) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024
> #17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618
>
> In vnet_crypto_async_free_frame() it appears that a call to pool_put() is 
> trying to return a pointer to a pool that it is not a member of:
>
> (gdb) frame 13
> #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> 585  pool_put (ct->frame_pool, frame);
> (gdb) p frame - ct->frame_pool
> $1 = -13689
>
> It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored by 
> the crypto engine and before it could be dequeued the pool filled and had to 
> be reallocated. The per-thread frame_pool's are allocated with room for 1024 
> entries initially and ct->frame_pool had a vector length of 1025 when the 
> crash occurred.
>
> Can anyone with knowledge of the async crypto code confirm or refute that 
> theory? Anyone have suggestions on the best way to fix this?
>
> Thanks,
> -Matt
>
>
>
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19536): https://lists.fd.io/g/vpp-dev/message/19536
Mute This Topic: https://lists.fd.io/mt/83112898/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to