Hi,

It seems the issue occurs when there are disconnect called because our
network can't guarantee a tcp can't be reset even when 3 ways handshake is
completed (firewall issue :( ).

When we find the app layer timeout, we will first disconnect (because we
record the session handle, this session might be a half open session), does
vnet session layer guarantee that if we reconnect from master thread when
the half open session still not be released yet (due to asynchronous logic)
that the reconnect fail? if then we can retry connect later.

I prefer to not registered half open callback because I think it make app
complicated from a TCP programming prospective.

For your patch, I think it should be work because I can't delete the half
open session immediately because there is worker configured, so the half
open will be removed from bihash when syn retrans timeout. I have merged
the patch and will provide feedback later.

Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 13:09写道:

> Hi,
>
> Inline.
>
> On Mar 19, 2023, at 6:47 PM, Zhang Dongya <fortitude.zh...@gmail.com>
> wrote:
>
> Hi,
>
> It can be aborted both in established state or half open state because I
> will do timeout in our app layer.
>
>
> [fc] Okay! Is the issue present irrespective of the state of the session
> or does it happen only after a disconnect in hanf-open state? More lower.
>
>
> Regarding your question,
>
> - Yes we add a builtin in app relys on C apis that  mainly use
> vnet_connect/disconnect to connect or disconnect session.
>
>
> [fc] Understood
>
> - We call these api in a vpp ctrl process which should be running on the
> master thread, we never do session setup/teardown on worker thread. (the
> environment that found this issue is configured with 1 master + 1 worker
> setup.)
>
>
> [fc] With vpp latest it’s possible to connect from first workers. It’s an
> optimization meant to avoid 1) worker barrier on syns and 2) entering poll
> mode on main (consume less cpu)
>
> - We started to develop the app using 22.06 and I keep to merge upstream
> changes to latest vpp by cherry-picking. The reason for line mismatch is
> that I added some comment to the session layer code, it should be equal to
> the master branch now.
>
>
> [fc] Ack
>
>
> When reading the code I understand that we mainly want to cleanup half
> open from bihash in session_stream_connect_notify, however, in syn-sent
> state if I choose to close the session, the session might be closed by my
> app due to session setup timeout (in second scale), in that case, session
> will be marked as half_open_done and half open session will be freed
> shortly in the ctrl thread (the 1st worker?).
>
>
> [fc] Actually, this might be the issue. We did start to provide a
> half-open session handle to apps which if closed does clean up the session
> but apparently it is missing the cleanup of the session lookup table. Could
> you try this patch [1]? It might need additional work.
>
> Having said that, forcing a close/cleanup will not free the port
> synchronously. So, if you’re using fixed ports, you’ll have to wait for the
> half-open cleanup notification.
>
>
> Should I also registered half open callback or there are some other reason
> that lead to this failure?
>
>
> [fc] Yes, see above.
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/38526
>
>
> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 06:22写道:
>
>> Hi,
>>
>> When you abort the connection, is it fully established or half-open?
>> Half-opens are cleaned up by the owner thread after a timeout, but the
>> 5-tuple should be assigned to the fully established session by that point.
>> tcp_half_open_connection_cleanup does not cleanup the bihash instead
>> session_stream_connect_notify does once tcp connect returns either success
>> or failure.
>>
>> So a few questions:
>> - is it accurate to assume you have a builtin vpp app and rely only on C
>> apis to interact with host stack?
>> - on what thread (main or first worker) do you call vnet_connect?
>> - what api do you use to close the session?
>> - what version of vpp is this because lines don’t match vpp latest?
>>
>> Regards,
>> Florin
>>
>> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya <fortitude.zh...@gmail.com>
>> wrote:
>> >
>> > Hi list,
>> >
>> > recently in our application, we constantly triggered such abrt issue
>> which make our connectivity interrupt for a while:
>> >
>> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC
>> 0x7fefd3b2000b
>> > Mar 19 16:11:26 ubuntu vnet[2565933]:
>> /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 (tcp46_input_inline)
>> assertion `tcp_lookup_is_valid (tc0, b[0], tcp_buffer_hdr (b[0]))' fails
>> >
>> > Our scenario is quite simple, we will make 4 parallel tcp connection
>> (use 4 fixed source ports) to a remote vpp stack (fixed ip and port), and
>> will do some keepalive in our application layer, since we only use the vpp
>> tcp stack to make the middle box happy with the connection, we do not use
>> the data transport of tcp statck actually.
>> >
>> > However, since the network condition is complex, we have to  always
>> need to abrt the connection and reconnect.
>> >
>> > I keep to merge upstream session and tcp fix however the issue still
>> not fixed, what I found now it may be in some case
>> tcp_half_open_connection_cleanup may not deleted the half open session from
>> the lookup table (bihash) and the session index is realloced by other
>> connection.
>> >
>> > Hope the list can provide some hint about how to overcome this issue,
>> thanks a lot.
>> >
>> >
>> >
>>
>>
>>
>>
>>
>
>
> 
>
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22730): https://lists.fd.io/g/vpp-dev/message/22730
Mute This Topic: https://lists.fd.io/mt/97707823/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to