Re: [vpp-dev] IPsec crash with async crypto

2021-06-08 Thread Matthew Smith via lists.fd.io
Hi  Fan,

Thanks for working on it!

I found a separate related issue which exacerbates the problem - async
crypto frames leak when using more than one worker thread. The ESP
encrypt/decrypt nodes allocate a frame for the crypto operation which needs
to be applied to a packet if there has not already been a frame allocated.
After allocating the frame, it may be decided that the packet should be
handed off to another thread or dropped for some other reason. When this
happens, if the frame never had any packets/operations added to it, it does
not get submitted and it is also not freed. The leak is what caused the
pool to need to be expanded in my test environment. I uploaded a patch to
gerrit to try and fix this - https://gerrit.fd.io/r/c/vpp/+/32596. If you
have any feedback on it, that would be appreciated.

-Matt


On Tue, Jun 8, 2021 at 7:14 AM Zhang, Roy Fan 
wrote:

> Hi Matthew and Florin,
>
>
>
> We managed to recreate the problem.
>
> The cause is most likely caused by pool got expanded while there are
> pending frame left to be dequeued. Once frame is dequeued later returning
> it to the pool will cause seg-fault as the pool is in new memory location.
>
>
>
> We are working on the fix – currently in validation stage. If everything
> is fine we are to upstream by tomorrow evening.
>
>
>
> Regards,
>
> Fan
>
>
>
> *From:* vpp-dev@lists.fd.io  *On Behalf Of *Matthew
> Smith via lists.fd.io
> *Sent:* Thursday, May 27, 2021 2:02 PM
> *To:* Florin Coras 
> *Cc:* vpp-dev 
> *Subject:* Re: [vpp-dev] IPsec crash with async crypto
>
>
>
> Hi Florin!
>
>
>
> It appears that the quic plugin is disabled in my build:
>
>
>
> 2021/05/27 07:44:49:044 notice plugin/loadPlugin disabled
> (default): quic_plugin.so
>
>
>
> I didn't mean to give the impression that I thought this issue was caused
> by quic. My mention of the quic commit was just intended to indicate how up
> to date my build is with the gerrit master branch in case there were
> recent/pending patches that people know of that might be relevant. That
> quic commit is from about 2 weeks ago, which is the last time I merged
> upstream changes.
>
>
>
> Thanks,
>
> -Matt
>
>
>
>
>
> On Wed, May 26, 2021 at 5:58 PM Florin Coras 
> wrote:
>
> Hi Matt,
>
> Did you try checking if quic plugin is loaded, just to see if there’s a
> connection there.
>
> Regards,
> Florin
>
> > On May 26, 2021, at 3:19 PM, Matthew Smith via lists.fd.io  netgate@lists.fd.io> wrote:
> >
> > Hi,
> >
> > I saw VPP crash several times during some tests that were running to
> evaluate IPsec performance. The last upstream commit on my build of VPP is
> 'fd77f8c00 quic: remove cmake --target'. The tests ran on a C3000 with an
> onboard QAT. The tests were repeated with the QAT removed from the device
> whitelist in startup.conf (using async crypto with sw_scheduler) and the
> same thing happened.
> >
> > The relevant part of the stack trace looks like this:
> >
> > #8  0x7fdbb4006459 in os_out_of_memory () at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221
> > #9  0x7fdbb400d1fb in clib_mem_alloc_aligned_at_offset
> (size=2305843009213692256, align=8, align_offset=8,
> os_out_of_memory_on_failure=1) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243
> > #10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0,
> length_increment=288230376151711515, data_bytes=2305843009213692256,
> header_bytes=8, data_align=8, numa_id=255) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111
> > #11 0x7fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0,
> length_increment=288230376151711515, data_bytes=2305843009213692248,
> header_bytes=0, data_align=8, numa_id=255) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170
> > #12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927)
> at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643
> > #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80,
> frame=0x7fdb3461c280) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> > #14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280,
> ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 , n_cache=1,
> n_total=0x7fdb145053dc) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135
> > #15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280,
> frame=0x0) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166
> > #16 0x7fdbb4b78

Re: [vpp-dev] IPsec crash with async crypto

2021-06-08 Thread Fan Zhang
Hi Matthew and Florin,

We managed to recreate the problem.
The cause is most likely caused by pool got expanded while there are pending 
frame left to be dequeued. Once frame is dequeued later returning it to the 
pool will cause seg-fault as the pool is in new memory location.

We are working on the fix – currently in validation stage. If everything is 
fine we are to upstream by tomorrow evening.

Regards,
Fan

From: vpp-dev@lists.fd.io  On Behalf Of Matthew Smith via 
lists.fd.io
Sent: Thursday, May 27, 2021 2:02 PM
To: Florin Coras 
Cc: vpp-dev 
Subject: Re: [vpp-dev] IPsec crash with async crypto

Hi Florin!

It appears that the quic plugin is disabled in my build:

2021/05/27 07:44:49:044 notice plugin/loadPlugin disabled (default): 
quic_plugin.so

I didn't mean to give the impression that I thought this issue was caused by 
quic. My mention of the quic commit was just intended to indicate how up to 
date my build is with the gerrit master branch in case there were 
recent/pending patches that people know of that might be relevant. That quic 
commit is from about 2 weeks ago, which is the last time I merged upstream 
changes.

Thanks,
-Matt


On Wed, May 26, 2021 at 5:58 PM Florin Coras 
mailto:fcoras.li...@gmail.com>> wrote:
Hi Matt,

Did you try checking if quic plugin is loaded, just to see if there’s a 
connection there.

Regards,
Florin

> On May 26, 2021, at 3:19 PM, Matthew Smith via 
> lists.fd.io<http://lists.fd.io> 
> mailto:netgate@lists.fd.io>> wrote:
>
> Hi,
>
> I saw VPP crash several times during some tests that were running to evaluate 
> IPsec performance. The last upstream commit on my build of VPP is 'fd77f8c00 
> quic: remove cmake --target'. The tests ran on a C3000 with an onboard QAT. 
> The tests were repeated with the QAT removed from the device whitelist in 
> startup.conf (using async crypto with sw_scheduler) and the same thing 
> happened.
>
> The relevant part of the stack trace looks like this:
>
> #8  0x7fdbb4006459 in os_out_of_memory () at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221
> #9  0x7fdbb400d1fb in clib_mem_alloc_aligned_at_offset 
> (size=2305843009213692256, align=8, align_offset=8, 
> os_out_of_memory_on_failure=1) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243
> #10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0, 
> length_increment=288230376151711515, data_bytes=2305843009213692256, 
> header_bytes=8, data_align=8, numa_id=255) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111
> #11 0x7fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0, 
> length_increment=288230376151711515, data_bytes=2305843009213692248, 
> header_bytes=0, data_align=8, numa_id=255) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170
> #12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643
> #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> #14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, 
> ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 , n_cache=1, 
> n_total=0x7fdb145053dc) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135
> #15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, 
> frame=0x0) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166
> #16 0x7fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80, 
> node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT, 
> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, 
> last_time_stamp=207016971809128) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024
> #17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618
>
> In vnet_crypto_async_free_frame() it appears that a call to pool_put() is 
> trying to return a pointer to a pool that it is not a member of:
>
> (gdb) frame 13
> #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> 585  pool_put (ct->frame_pool, frame);
> (gdb) p frame - ct->frame_pool
> $1 = -13689
>
> It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored by 
> the crypto engine and before it could be dequeued the pool filled and had to 
> be reallocated. The per-thread frame_pool's are allocated with room for 1024 
> entries initially and ct->frame_pool had a vector length of 1025 when the 
> crash occurred.
>
> Can anyone with kno

Re: [vpp-dev] IPsec crash with async crypto

2021-05-27 Thread Florin Coras
Hi Matt, 

No worries. I asked because, as luck would have it, quic does use the crypto 
infra :-)

Cheers, 
Florin

> On May 27, 2021, at 6:02 AM, Matthew Smith  wrote:
> 
> Hi Florin!
> 
> It appears that the quic plugin is disabled in my build:
> 
> 2021/05/27 07:44:49:044 notice plugin/loadPlugin disabled (default): 
> quic_plugin.so
> 
> I didn't mean to give the impression that I thought this issue was caused by 
> quic. My mention of the quic commit was just intended to indicate how up to 
> date my build is with the gerrit master branch in case there were 
> recent/pending patches that people know of that might be relevant. That quic 
> commit is from about 2 weeks ago, which is the last time I merged upstream 
> changes.
> 
> Thanks,
> -Matt
> 
> 
> On Wed, May 26, 2021 at 5:58 PM Florin Coras  > wrote:
> Hi Matt, 
> 
> Did you try checking if quic plugin is loaded, just to see if there’s a 
> connection there. 
> 
> Regards,
> Florin
> 
> > On May 26, 2021, at 3:19 PM, Matthew Smith via lists.fd.io 
> >   > > wrote:
> > 
> > Hi,
> > 
> > I saw VPP crash several times during some tests that were running to 
> > evaluate IPsec performance. The last upstream commit on my build of VPP is 
> > 'fd77f8c00 quic: remove cmake --target'. The tests ran on a C3000 with an 
> > onboard QAT. The tests were repeated with the QAT removed from the device 
> > whitelist in startup.conf (using async crypto with sw_scheduler) and the 
> > same thing happened.
> > 
> > The relevant part of the stack trace looks like this:
> > 
> > #8  0x7fdbb4006459 in os_out_of_memory () at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221
> > #9  0x7fdbb400d1fb in clib_mem_alloc_aligned_at_offset 
> > (size=2305843009213692256, align=8, align_offset=8, 
> > os_out_of_memory_on_failure=1) at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243
> > #10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0, 
> > length_increment=288230376151711515, data_bytes=2305843009213692256, 
> > header_bytes=8, data_align=8, numa_id=255) at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111
> > #11 0x7fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0, 
> > length_increment=288230376151711515, data_bytes=2305843009213692248, 
> > header_bytes=0, data_align=8, numa_id=255) at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170
> > #12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927) at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643
> > #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) 
> > at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> > #14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, 
> > ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 , n_cache=1, 
> > n_total=0x7fdb145053dc) at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135
> > #15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, 
> > frame=0x0) at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166
> > #16 0x7fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80, 
> > node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT, 
> > dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, 
> > last_time_stamp=207016971809128) at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024
> > #17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618
> > 
> > In vnet_crypto_async_free_frame() it appears that a call to pool_put() is 
> > trying to return a pointer to a pool that it is not a member of:
> > 
> > (gdb) frame 13
> > #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) 
> > at 
> > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> > 585  pool_put (ct->frame_pool, frame);
> > (gdb) p frame - ct->frame_pool
> > $1 = -13689
> > 
> > It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored by 
> > the crypto engine and before it could be dequeued the pool filled and had 
> > to be reallocated. The per-thread frame_pool's are allocated with room for 
> > 1024 entries initially and ct->frame_pool had a vector length of 1025 when 
> > the crash occurred.
> > 
> > Can anyone with knowledge of the async crypto code confirm or refute that 
> > theory? Anyone have suggestions on the best way to fix this?
> > 
> > Thanks,
> > -Matt
> > 
> > 
> > 
> > 
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19491): https://lists.fd.io/g/vpp-dev/message/19491
Mute This Topic: https://lists.fd.io/mt/83112898/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: 

Re: [vpp-dev] IPsec crash with async crypto

2021-05-27 Thread Matthew Smith via lists.fd.io
Hi Florin!

It appears that the quic plugin is disabled in my build:

2021/05/27 07:44:49:044 notice plugin/loadPlugin disabled
(default): quic_plugin.so

I didn't mean to give the impression that I thought this issue was caused
by quic. My mention of the quic commit was just intended to indicate how up
to date my build is with the gerrit master branch in case there were
recent/pending patches that people know of that might be relevant. That
quic commit is from about 2 weeks ago, which is the last time I merged
upstream changes.

Thanks,
-Matt


On Wed, May 26, 2021 at 5:58 PM Florin Coras  wrote:

> Hi Matt,
>
> Did you try checking if quic plugin is loaded, just to see if there’s a
> connection there.
>
> Regards,
> Florin
>
> > On May 26, 2021, at 3:19 PM, Matthew Smith via lists.fd.io  netgate@lists.fd.io> wrote:
> >
> > Hi,
> >
> > I saw VPP crash several times during some tests that were running to
> evaluate IPsec performance. The last upstream commit on my build of VPP is
> 'fd77f8c00 quic: remove cmake --target'. The tests ran on a C3000 with an
> onboard QAT. The tests were repeated with the QAT removed from the device
> whitelist in startup.conf (using async crypto with sw_scheduler) and the
> same thing happened.
> >
> > The relevant part of the stack trace looks like this:
> >
> > #8  0x7fdbb4006459 in os_out_of_memory () at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221
> > #9  0x7fdbb400d1fb in clib_mem_alloc_aligned_at_offset
> (size=2305843009213692256, align=8, align_offset=8,
> os_out_of_memory_on_failure=1) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243
> > #10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0,
> length_increment=288230376151711515, data_bytes=2305843009213692256,
> header_bytes=8, data_align=8, numa_id=255) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111
> > #11 0x7fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0,
> length_increment=288230376151711515, data_bytes=2305843009213692248,
> header_bytes=0, data_align=8, numa_id=255) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170
> > #12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927)
> at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643
> > #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80,
> frame=0x7fdb3461c280) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> > #14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280,
> ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 , n_cache=1,
> n_total=0x7fdb145053dc) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135
> > #15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280,
> frame=0x0) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166
> > #16 0x7fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80,
> node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT,
> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0,
> last_time_stamp=207016971809128) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024
> > #17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618
> >
> > In vnet_crypto_async_free_frame() it appears that a call to pool_put()
> is trying to return a pointer to a pool that it is not a member of:
> >
> > (gdb) frame 13
> > #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80,
> frame=0x7fdb3461c280) at
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> > 585  pool_put (ct->frame_pool, frame);
> > (gdb) p frame - ct->frame_pool
> > $1 = -13689
> >
> > It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored
> by the crypto engine and before it could be dequeued the pool filled and
> had to be reallocated. The per-thread frame_pool's are allocated with room
> for 1024 entries initially and ct->frame_pool had a vector length of 1025
> when the crash occurred.
> >
> > Can anyone with knowledge of the async crypto code confirm or refute
> that theory? Anyone have suggestions on the best way to fix this?
> >
> > Thanks,
> > -Matt
> >
> >
> > 
> >
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19490): https://lists.fd.io/g/vpp-dev/message/19490
Mute This Topic: https://lists.fd.io/mt/83112898/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] IPsec crash with async crypto

2021-05-26 Thread Florin Coras
Hi Matt, 

Did you try checking if quic plugin is loaded, just to see if there’s a 
connection there. 

Regards,
Florin

> On May 26, 2021, at 3:19 PM, Matthew Smith via lists.fd.io 
>  wrote:
> 
> Hi,
> 
> I saw VPP crash several times during some tests that were running to evaluate 
> IPsec performance. The last upstream commit on my build of VPP is 'fd77f8c00 
> quic: remove cmake --target'. The tests ran on a C3000 with an onboard QAT. 
> The tests were repeated with the QAT removed from the device whitelist in 
> startup.conf (using async crypto with sw_scheduler) and the same thing 
> happened.
> 
> The relevant part of the stack trace looks like this:
> 
> #8  0x7fdbb4006459 in os_out_of_memory () at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221
> #9  0x7fdbb400d1fb in clib_mem_alloc_aligned_at_offset 
> (size=2305843009213692256, align=8, align_offset=8, 
> os_out_of_memory_on_failure=1) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243
> #10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0, 
> length_increment=288230376151711515, data_bytes=2305843009213692256, 
> header_bytes=8, data_align=8, numa_id=255) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111
> #11 0x7fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0, 
> length_increment=288230376151711515, data_bytes=2305843009213692248, 
> header_bytes=0, data_align=8, numa_id=255) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170
> #12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643
> #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> #14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, 
> ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 , n_cache=1, 
> n_total=0x7fdb145053dc) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135
> #15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, 
> frame=0x0) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166
> #16 0x7fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80, 
> node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT, 
> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, 
> last_time_stamp=207016971809128) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024
> #17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618
> 
> In vnet_crypto_async_free_frame() it appears that a call to pool_put() is 
> trying to return a pointer to a pool that it is not a member of:
> 
> (gdb) frame 13
> #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280) at 
> /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
> 585  pool_put (ct->frame_pool, frame);
> (gdb) p frame - ct->frame_pool
> $1 = -13689
> 
> It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored by 
> the crypto engine and before it could be dequeued the pool filled and had to 
> be reallocated. The per-thread frame_pool's are allocated with room for 1024 
> entries initially and ct->frame_pool had a vector length of 1025 when the 
> crash occurred.
> 
> Can anyone with knowledge of the async crypto code confirm or refute that 
> theory? Anyone have suggestions on the best way to fix this?
> 
> Thanks,
> -Matt
> 
> 
> 
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19480): https://lists.fd.io/g/vpp-dev/message/19480
Mute This Topic: https://lists.fd.io/mt/83112898/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



[vpp-dev] IPsec crash with async crypto

2021-05-26 Thread Matthew Smith via lists.fd.io
Hi,

I saw VPP crash several times during some tests that were running to
evaluate IPsec performance. The last upstream commit on my build of VPP is
'fd77f8c00 quic: remove cmake --target'. The tests ran on a C3000 with an
onboard QAT. The tests were repeated with the QAT removed from the device
whitelist in startup.conf (using async crypto with sw_scheduler) and the
same thing happened.

The relevant part of the stack trace looks like this:

#8  0x7fdbb4006459 in os_out_of_memory () at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221
#9  0x7fdbb400d1fb in clib_mem_alloc_aligned_at_offset
(size=2305843009213692256, align=8, align_offset=8,
os_out_of_memory_on_failure=1) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243
#10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0,
length_increment=288230376151711515, data_bytes=2305843009213692256,
header_bytes=8, data_align=8, numa_id=255) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111
#11 0x7fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0,
length_increment=288230376151711515, data_bytes=2305843009213692248,
header_bytes=0, data_align=8, numa_id=255) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170
#12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643
#13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280)
at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
#14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280,
ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 , n_cache=1,
n_total=0x7fdb145053dc) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135
#15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280,
frame=0x0) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166
#16 0x7fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80,
node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT,
dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0,
last_time_stamp=207016971809128) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024
#17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618

In vnet_crypto_async_free_frame() it appears that a call to pool_put() is
trying to return a pointer to a pool that it is not a member of:

(gdb) frame 13
#13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280)
at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
585  pool_put (ct->frame_pool, frame);
(gdb) p frame - ct->frame_pool
$1 = -13689

It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored by
the crypto engine and before it could be dequeued the pool filled and had
to be reallocated. The per-thread frame_pool's are allocated with room for
1024 entries initially and ct->frame_pool had a vector length of 1025 when
the crash occurred.

Can anyone with knowledge of the async crypto code confirm or refute that
theory? Anyone have suggestions on the best way to fix this?

Thanks,
-Matt

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19479): https://lists.fd.io/g/vpp-dev/message/19479
Mute This Topic: https://lists.fd.io/mt/83112898/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-