I think we see two problems here:

1.  With non parallel forwarding the IPsec traffic stalls after a while.

Compiled with ENCDEBUG I get this message for each received ESP packet:
esp_input_cb: authentication failed for packet in SA 10.3.45.35/83089fff

I can reproduce it more or less after 30 seconds full traffic.  At
first I suspected some rekeying event, but I cannot find them in
the logs anymore.  I added some kernel locks into the pfkey path,
but it does not help.  Restarting iked fixes the situation by
deleting and adding new SAs.  I am using cryptosoft with auth
gmac-aes-128 enc aes-gcm.

The error message shows that the authentication of the packet does
not match the expected value of the crypto on the receiving side.
I don't know whether the packet is corrupted when sending or some
crypto parameter are wrong on the receiving side.

Although I found some MP bugs in crypto, they look not relvant in
our case.  Everything there should be protected by kernel lock.

2.  With parallel forwarding the kernel crashes.

> That means simultaneous ipsp_spd_lookup() execution breaks not only
> `tdb_policy_head' but the 'ipo->ipo_tdb' pointer too.

I have seen double removes from tdb_policy_head.  It looked like
_Q_INVALID in ddb.  mvs@ has fixed some bugs, but the issue remains
and Hrvoje can reproduce variations of it.

> Also I like to remind, about the logic we have in sys/net/pfkeyv2.c:
> 
> 2017            /*
> 2018             * XXXSMP IPsec data structures are not ready to be
> 2019             * accessed by multiple Network threads in parallel,
> 2020             * so force all packets to be processed by the first          
>   2021             * one.
> 2022             */
> 2023            extern int nettaskqs;
> 2024            nettaskqs = 1;
> 
> It seems to be not working with parallel forwarding diff.

In general our IPsec stack is not MP save.  But this hack should
ensure that only one thread is running in IPsec.  I am also wondering
why the workaround is not working around properly.

Currently I am concentrating on 1, maybe a fix there helps for 2.

At genua we have a diff for tdb refcounting.  tobhe@ investigates
if we can integrate it into OpenBSD.  But it is huge.

bluhm

Reply via email to