Re: tcp hang when socket fills up ?
On Wed, 18 Apr 2018, Dominique Martinet wrote: > Jozsef Kadlecsik wrote on Wed, Apr 18, 2018: > > Thanks for the testing! One more line is required, however: we have to get > > the assured bit set for the connection, see the new patch below. > > I think it actually was better before. If I understand things correctly > at this point (when we get in the case TCP_CONNTRACK_SYN_RECV) we will > have seen SYN(out) SYN(in) SYNACK(out), but not the final ACK(in) yet. > > Leaving old state as it was will not set the assured bit, but that will > be set on the next packet because old_state == new_state == established > at that point and the connection will really be setup then. Yes, you are right: the first patch is better than the second one. Overthinking :-) Best regards, Jozsef - E-mail : kad...@blackhole.kfki.hu, kadlecsik.joz...@wigner.mta.hu PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary
Re: tcp hang when socket fills up ?
Jozsef Kadlecsik wrote on Wed, Apr 18, 2018: > Thanks for the testing! One more line is required, however: we have to get > the assured bit set for the connection, see the new patch below. I think it actually was better before. If I understand things correctly at this point (when we get in the case TCP_CONNTRACK_SYN_RECV) we will have seen SYN(out) SYN(in) SYNACK(out), but not the final ACK(in) yet. Leaving old state as it was will not set the assured bit, but that will be set on the next packet because old_state == new_state == established at that point and the connection will really be setup then. I don't think anything will blow up if we do either way, but strictly speaking I'm more comfortable with the former. I'll test the new patch regardless, I left work so can't reproduce anymore but will yell tomorrow if it does explode ;) > The tcp_conntracks state table could be fixed with introducing a new > state, but that part is exposed to userspace (ctnetlink) and ugly > compatibility code would be required for backward compatibility. I agree a new state is more work than it is worth, I'm happy to leave it as is. -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
On Wed, 18 Apr 2018, Dominique Martinet wrote: > Dominique Martinet wrote on Wed, Apr 18, 2018: > > Jozsef Kadlecsik wrote on Wed, Apr 18, 2018: > > > Yes, the state transition is wrong for simultaneous open, because the > > > tcp_conntracks table is not (cannot be) smart enough. Could you verify > > > the > > > next untested patch? > > > > Thanks for the patch; I'll give it a try (probably won't make it today > > so will report tomorrow) > > Actually had time; I can confirm (added printks) we did get in that if > that was pointed at, and we no longer get there now. The connection no > longer gets in invalid state, so that looks like it nailed it. > > I'm now confused what this has to do with tcp_timestamp though, since > setting that off also seemed to work around the issue, but if we get > something like that in I'll be happy anyway. Thanks for the testing! One more line is required, however: we have to get the assured bit set for the connection, see the new patch below. The tcp_conntracks state table could be fixed with introducing a new state, but that part is exposed to userspace (ctnetlink) and ugly compatibility code would be required for backward compatibility. diff --git a/include/uapi/linux/netfilter/nf_conntrack_tcp.h b/include/uapi/linux/netfilter/nf_conntrack_tcp.h index 74b9115..bcba72d 100644 --- a/include/uapi/linux/netfilter/nf_conntrack_tcp.h +++ b/include/uapi/linux/netfilter/nf_conntrack_tcp.h @@ -46,6 +46,9 @@ enum tcp_conntrack { /* Marks possibility for expected RFC5961 challenge ACK */ #define IP_CT_EXP_CHALLENGE_ACK0x40 +/* Simultaneous open initialized */ +#define IP_CT_TCP_SIMULTANEOUS_OPEN0x80 + struct nf_ct_tcp_flags { __u8 flags; __u8 mask; diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c index e97cdc1..2c1fc7e 100644 --- a/net/netfilter/nf_conntrack_proto_tcp.c +++ b/net/netfilter/nf_conntrack_proto_tcp.c @@ -981,6 +981,20 @@ static int tcp_packet(struct nf_conn *ct, return NF_ACCEPT; /* Don't change state */ } break; + case TCP_CONNTRACK_SYN_SENT2: + /* tcp_conntracks table is not smart enough to handle +* simultaneous open. +*/ + ct->proto.tcp.last_flags |= IP_CT_TCP_SIMULTANEOUS_OPEN; + break; + case TCP_CONNTRACK_SYN_RECV: + if (dir == IP_CT_DIR_REPLY && index == TCP_ACK_SET && + ct->proto.tcp.last_flags & IP_CT_TCP_SIMULTANEOUS_OPEN) { + /* We want to set the assured bit */ + old_state = TCP_CONNTRACK_SYN_RECV; + new_state = TCP_CONNTRACK_ESTABLISHED; + } + break; case TCP_CONNTRACK_CLOSE: if (index == TCP_RST_SET && (ct->proto.tcp.seen[!dir].flags & IP_CT_TCP_FLAG_MAXACK_SET) Best regards, Jozsef - E-mail : kad...@blackhole.kfki.hu, kadlecsik.joz...@wigner.mta.hu PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary
Re: tcp hang when socket fills up ?
Dominique Martinet wrote on Wed, Apr 18, 2018: > Jozsef Kadlecsik wrote on Wed, Apr 18, 2018: > > Yes, the state transition is wrong for simultaneous open, because the > > tcp_conntracks table is not (cannot be) smart enough. Could you verify the > > next untested patch? > > Thanks for the patch; I'll give it a try (probably won't make it today > so will report tomorrow) Actually had time; I can confirm (added printks) we did get in that if that was pointed at, and we no longer get there now. The connection no longer gets in invalid state, so that looks like it nailed it. I'm now confused what this has to do with tcp_timestamp though, since setting that off also seemed to work around the issue, but if we get something like that in I'll be happy anyway. Big props to everyone involved, I would have taken much longer alone, -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Jozsef Kadlecsik wrote on Wed, Apr 18, 2018: > Yes, the state transition is wrong for simultaneous open, because the > tcp_conntracks table is not (cannot be) smart enough. Could you verify the > next untested patch? Thanks for the patch; I'll give it a try (probably won't make it today so will report tomorrow) -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Hi, On Tue, 17 Apr 2018, Florian Westphal wrote: > Dominique Martinet wrote: > > [ CC Jozsef ] > > > Could it have something to do with the way I setup the connection? > > I don't think the "both remotes call connect() with carefully selected > > source/dest port" is a very common case.. > > > > If you look at the tcpdump outputs I attached the sequence usually is > > something like > > server > client SYN > > client > server SYN > > server > client SYNACK > > client > server ACK > > > > ultimately it IS a connection, but with an extra SYN packet in front of > > it (that first SYN opens up the conntrack of the nat so that the > > client's syn can come in, the client's conntrack will be that of a > > normal connection since its first SYN goes in directly after the > > server's (it didn't see the server's SYN)) > > > > Looking at my logs again, I'm seeing the same as you: > > > > This looks like the actual SYN/SYN/SYNACK/ACK: > > - 14.364090 seq=505004283 likely SYN coming out of server > > - 14.661731 seq=1913287797 on next line it says receiver > > end=505004284 so likely the matching SYN from client > > Which this time gets a proper SYNACK from server: > > 14.662020 seq=505004283 ack=1913287798 > > And following final dataless ACK: > > 14.687570 seq=1913287798 ack=505004284 > > > > Then as you point out some data ACK, where the scale poofs: > > 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 > > end=1913287819 > > 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 > > maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 > > scale=7 > > 14.688824 tcp_in_window: > > 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 > > end=1913287819 > > 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 > > scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 > > > > As you say, only tcp_options() will clear only on side of the scales. > > We don't have sender->td_maxwin == 0 (printed) so I see no other way > > than we are in the last else if: > > - we have after(end, sender->td_end) (end=1913287819 > sender > > end=1913287798) > > - I assume the tcp state machine must be confused because of the > > SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, > > but since this is a data packet it doesn't have the tcp option for scale > > thus scale resets. > > Yes, this looks correct. Jozsef, can you please have a look? > > Problem seems to be that conntrack believes that ACK packet > re-initializes the connection: > > 595 /* > 596 * RFC 793: "if a TCP is reinitialized ... then it need > 597 * not wait at all; it must only be sure to use sequence > 598 * numbers larger than those recently used." > 599 */ > 600 sender->td_end = > 601 sender->td_maxend = end; > 602 sender->td_maxwin = (win == 0 ? 1 : win); > 603 > 604 tcp_options(skb, dataoff, tcph, sender); > > and last line clears the scale value (no wscale option in data packet). > > > Transitions are: > server > client SYN sNO -> sSS > client > server SYN sSS -> sS2 > server > client SYNACK sS2 -> sSR /* here */ > client > server ACK sSR -> sES > > SYN/ACK was observed in original direction so we hit > state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY test > when we see the ack packet and end up in the 'TCP is reinitialized' branch. > > AFAICS, without this, connection would move to sES just fine, > as the data ack is in window. Yes, the state transition is wrong for simultaneous open, because the tcp_conntracks table is not (cannot be) smart enough. Could you verify the next untested patch? diff --git a/include/uapi/linux/netfilter/nf_conntrack_tcp.h b/include/uapi/linux/netfilter/nf_conntrack_tcp.h index 74b9115..bcba72d 100644 --- a/include/uapi/linux/netfilter/nf_conntrack_tcp.h +++ b/include/uapi/linux/netfilter/nf_conntrack_tcp.h @@ -46,6 +46,9 @@ enum tcp_conntrack { /* Marks possibility for expected RFC5961 challenge ACK */ #define IP_CT_EXP_CHALLENGE_ACK0x40 +/* Simultaneous open initialized */ +#define IP_CT_TCP_SIMULTANEOUS_OPEN0x80 + struct nf_ct_tcp_flags { __u8 flags; __u8 mask; diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c index e97cdc1..8e67910 100644 --- a/net/netfilter/nf_conntrack_proto_tcp.c +++ b/net/netfilter/nf_conntrack_proto_tcp.c @@ -981,6 +981,17 @@ static int tcp_packet(struct nf_conn *ct, return NF_ACCEPT; /* Don't change state */ } break; + case TCP_CONNTRACK_SYN_SENT2: + /* tcp_conntracks table is not smart enough to handle +* simultaneous open. +*/ + ct->proto
Re: tcp hang when socket fills up ?
Dominique Martinet wrote: [ CC Jozsef ] > Could it have something to do with the way I setup the connection? > I don't think the "both remotes call connect() with carefully selected > source/dest port" is a very common case.. > > If you look at the tcpdump outputs I attached the sequence usually is > something like > server > client SYN > client > server SYN > server > client SYNACK > client > server ACK > > ultimately it IS a connection, but with an extra SYN packet in front of > it (that first SYN opens up the conntrack of the nat so that the > client's syn can come in, the client's conntrack will be that of a > normal connection since its first SYN goes in directly after the > server's (it didn't see the server's SYN)) > > Looking at my logs again, I'm seeing the same as you: > > This looks like the actual SYN/SYN/SYNACK/ACK: > - 14.364090 seq=505004283 likely SYN coming out of server > - 14.661731 seq=1913287797 on next line it says receiver > end=505004284 so likely the matching SYN from client > Which this time gets a proper SYNACK from server: > 14.662020 seq=505004283 ack=1913287798 > And following final dataless ACK: > 14.687570 seq=1913287798 ack=505004284 > > Then as you point out some data ACK, where the scale poofs: > 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 > end=1913287819 > 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 > scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 > 14.688824 tcp_in_window: > 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 > end=1913287819 > 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 > scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 > > As you say, only tcp_options() will clear only on side of the scales. > We don't have sender->td_maxwin == 0 (printed) so I see no other way > than we are in the last else if: > - we have after(end, sender->td_end) (end=1913287819 > sender > end=1913287798) > - I assume the tcp state machine must be confused because of the > SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, > but since this is a data packet it doesn't have the tcp option for scale > thus scale resets. Yes, this looks correct. Jozsef, can you please have a look? Problem seems to be that conntrack believes that ACK packet re-initializes the connection: 595 /* 596 * RFC 793: "if a TCP is reinitialized ... then it need 597 * not wait at all; it must only be sure to use sequence 598 * numbers larger than those recently used." 599 */ 600 sender->td_end = 601 sender->td_maxend = end; 602 sender->td_maxwin = (win == 0 ? 1 : win); 603 604 tcp_options(skb, dataoff, tcph, sender); and last line clears the scale value (no wscale option in data packet). Transitions are: server > client SYN sNO -> sSS client > server SYN sSS -> sS2 server > client SYNACK sS2 -> sSR /* here */ client > server ACK sSR -> sES SYN/ACK was observed in original direction so we hit state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY test when we see the ack packet and end up in the 'TCP is reinitialized' branch. AFAICS, without this, connection would move to sES just fine, as the data ack is in window.
Re: tcp hang when socket fills up ?
On Tue, Apr 17, 2018 at 02:34:37PM +0200, Dominique Martinet wrote: > Michal Kubecek wrote on Tue, Apr 17, 2018: > > Data (21 bytes) packet in reply direction. And somewhere between the > > first and second debugging print, we ended up with sender scale=0 and > > that value is then preserved from now on. > > > > The only place between the two debug prints where we could change only > > one of the td_sender values are the two calls to tcp_options() but > > neither should be called now unless I missed something. I'll try to > > think about it some more. > > Could it have something to do with the way I setup the connection? > I don't think the "both remotes call connect() with carefully selected > source/dest port" is a very common case.. > > If you look at the tcpdump outputs I attached the sequence usually is > something like > server > client SYN > client > server SYN > server > client SYNACK > client > server ACK This must be what nf_conntrack_proto_tcp.c calls "simultaneous open". > ultimately it IS a connection, but with an extra SYN packet in front of > it (that first SYN opens up the conntrack of the nat so that the > client's syn can come in, the client's conntrack will be that of a > normal connection since its first SYN goes in directly after the > server's (it didn't see the server's SYN)) > > > Looking at my logs again, I'm seeing the same as you: > > This looks like the actual SYN/SYN/SYNACK/ACK: > - 14.364090 seq=505004283 likely SYN coming out of server > - 14.661731 seq=1913287797 on next line it says receiver > end=505004284 so likely the matching SYN from client > Which this time gets a proper SYNACK from server: > 14.662020 seq=505004283 ack=1913287798 > And following final dataless ACK: > 14.687570 seq=1913287798 ack=505004284 > > Then as you point out some data ACK, where the scale poofs: > 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 > end=1913287819 > 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 > scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 > 14.688824 tcp_in_window: > 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 > end=1913287819 > 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 > scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 > > As you say, only tcp_options() will clear only on side of the scales. > We don't have sender->td_maxwin == 0 (printed) so I see no other way > than we are in the last else if: > - we have after(end, sender->td_end) (end=1913287819 > sender > end=1913287798) > - I assume the tcp state machine must be confused because of the > SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, > but since this is a data packet it doesn't have the tcp option for scale > thus scale resets. I agree that sender->td_maxwin is not zero so that the handshake above probably left the conntrack in TCP_CONNTRACK_SYN_RECV state for some reason. I'll try to go through the code with the pattern you mentioned in mind. Michal Kubecek
Re: tcp hang when socket fills up ?
Michal Kubecek wrote on Tue, Apr 17, 2018: > Data (21 bytes) packet in reply direction. And somewhere between the > first and second debugging print, we ended up with sender scale=0 and > that value is then preserved from now on. > > The only place between the two debug prints where we could change only > one of the td_sender values are the two calls to tcp_options() but > neither should be called now unless I missed something. I'll try to > think about it some more. Could it have something to do with the way I setup the connection? I don't think the "both remotes call connect() with carefully selected source/dest port" is a very common case.. If you look at the tcpdump outputs I attached the sequence usually is something like server > client SYN client > server SYN server > client SYNACK client > server ACK ultimately it IS a connection, but with an extra SYN packet in front of it (that first SYN opens up the conntrack of the nat so that the client's syn can come in, the client's conntrack will be that of a normal connection since its first SYN goes in directly after the server's (it didn't see the server's SYN)) Looking at my logs again, I'm seeing the same as you: This looks like the actual SYN/SYN/SYNACK/ACK: - 14.364090 seq=505004283 likely SYN coming out of server - 14.661731 seq=1913287797 on next line it says receiver end=505004284 so likely the matching SYN from client Which this time gets a proper SYNACK from server: 14.662020 seq=505004283 ack=1913287798 And following final dataless ACK: 14.687570 seq=1913287798 ack=505004284 Then as you point out some data ACK, where the scale poofs: 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 14.688824 tcp_in_window: 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 As you say, only tcp_options() will clear only on side of the scales. We don't have sender->td_maxwin == 0 (printed) so I see no other way than we are in the last else if: - we have after(end, sender->td_end) (end=1913287819 > sender end=1913287798) - I assume the tcp state machine must be confused because of the SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, but since this is a data packet it doesn't have the tcp option for scale thus scale resets. At least peeling the logs myself helped me follow the process, I'll sprinkle some carefully crafted logs tomorrow to check if this is true and will let you figure what is best of trying to preserve scale if it was set before, setting a default to 14 or something else. Thanks! -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote: > > I turned pr_debug on in tcp_in_window() for another try and it's a bit > > mangled because the information on multiple lines and the function is > > called in parallel but it looks like I do have some seq > maxend +1 > > > > Although it's weird, the maxend was set WAY earlier apparently? > > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 > > receiver end=505004284 maxend=505033596 maxwin=29200 > > then window decreased drastically e.g. previous ack just before refusal: > > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 > > end=1913292311 > > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 > > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7 > > scale=0 is suspect. > > Really if conntrack does not see SYN SYNACK packets, it should not > make any window check, since windows can be scaled up to 14 :/ Hm... it doesn't seem to be the case here: 14.364038 tcp_in_window: START 14.364065 tcp_in_window: 14.364090 seq=505004283 ack=0+(0) sack=0+(0) win=29200 end=505004284 14.364129 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 scale=7 receiver end=0 maxend=0 maxwin=0 scale=0 14.364158 tcp_in_window: 14.364185 seq=505004283 ack=0+(0) sack=0+(0) win=29200 end=505004284 14.364210 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 scale=7 receiver end=0 maxend=0 maxwin=0 scale=0 14.364237 tcp_in_window: I=1 II=1 III=1 IV=1 14.364262 tcp_in_window: res=1 sender end=505004284 maxend=505004284 maxwin=29200 receiver end=0 maxend=29200 maxwin=0 looks like SYN packet 14.661682 tcp_in_window: START 14.661706 tcp_in_window: 14.661731 seq=1913287797 ack=0+(0) sack=0+(0) win=29200 end=1913287798 14.661828 tcp_in_window: sender end=0 maxend=29200 maxwin=0 scale=0 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7 14.661867 tcp_in_window: START 14.661893 tcp_in_window: 14.661917 seq=1025597635 ack=1542862349+(0) sack=1542862349+(0) win=2414 end=1025597635 14.661942 tcp_in_window: START 14.661966 tcp_in_window: 14.661993 tcp_in_window: sender end=1025597635 maxend=1025635103 maxwin=354378 scale=7 receiver end=1542862349 maxend=1543168175 maxwin=37504 scale=7 14.662020 seq=505004283 ack=1913287798+(0) sack=1913287798+(0) win=29200 end=505004284 14.662045 tcp_in_window: 14.662072 seq=1025597635 ack=1542862349+(0) sack=1542862349+(0) win=2414 end=1025597635 14.662097 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 scale=7 receiver end=1913287798 maxend=1913287798 maxwin=29200 scale=7 14.662125 tcp_in_window: 14.662150 tcp_in_window: sender end=1025597635 maxend=1025635103 maxwin=354378 scale=7 receiver end=1542862349 maxend=1543168175 maxwin=37504 scale=7 14.662175 seq=505004283 ack=1913287798+(0) sack=1913287798+(0) win=29200 end=505004284 14.662202 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 scale=7 receiver end=1913287798 maxend=1913287798 maxwin=29200 scale=7 14.662226 tcp_in_window: I=1 II=1 III=1 IV=1 14.662251 tcp_in_window: I=1 II=1 III=1 IV=1 14.662277 tcp_in_window: res=1 sender end=505004284 maxend=505004284 maxwin=29200 receiver end=1913287798 maxend=1913316998 maxwin=29200 14.662302 tcp_in_window: res=1 sender end=1025597635 maxend=1025635103 maxwin=354378 receiver end=1542862349 maxend=1543171341 maxwin=37504 SYNACK response and (dataless) ACK in the original direction, mixed with an unrelated packet. 14.687411 tcp_in_window: START 14.687522 tcp_in_window: 14.687570 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287798 14.687619 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29200 scale=7 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7 14.687659 tcp_in_window: 14.687699 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287798 14.687739 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29200 scale=7 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7 14.687774 tcp_in_window: I=1 II=1 III=1 IV=1 14.687806 tcp_in_window: res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 receiver end=505004284 maxend=505033596 maxwin=29200 ACK in the reply direction (no data). We still have scale=7 in both directions. 14.688706 tcp_in_window: START 14.688733 tcp_in_window: 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 14.688824 tcp_in_window: 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 14.688911 tcp_in_window: I=1 II=1 III=1 IV=1 14.688938 tcp_in_window: res=1 sender end=1913287819 maxend=1913
Re: tcp hang when socket fills up ?
On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote: > > I turned pr_debug on in tcp_in_window() for another try and it's a bit > > mangled because the information on multiple lines and the function is > > called in parallel but it looks like I do have some seq > maxend +1 > > > > Although it's weird, the maxend was set WAY earlier apparently? > > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 > > receiver end=505004284 maxend=505033596 maxwin=29200 > > then window decreased drastically e.g. previous ack just before refusal: > > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 > > end=1913292311 > > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 > > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7 > > scale=0 is suspect. > > Really if conntrack does not see SYN SYNACK packets, it should not > make any window check, since windows can be scaled up to 14 :/ Or maybe set the scaling to - TCP_MAX_WSCALE (14) by default - 0 when SYN or SYNACK without window scale option is seen - value of window scale option when SYN or SYNACK with it is seen Michal Kubecek
Re: tcp hang when socket fills up ?
On 04/16/2018 08:52 PM, Dominique Martinet wrote: > Thank you for the hints. Out of order reply. > > Florian Westphal wrote on Mon, Apr 16, 2018: >> echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal >> >> which stops conntrack from marking packets with out-of-window >> acks as invalid. > > That worked, the connection stays established with this set. > > >> echo 6 > /proc/sys/net/netfilter/nf_conntrack_log_invalid > > Took me a while to figure how to log that properly (why do distros not > ship ulogd?!) but that is helpful. > > I have attached the full logs to this mail as text attachments in case I > missed something in my excerpts > > > Specifically, there are many outbound invalid packets, but as Marcelo > pointed out there is no conntrack netfilter for outbound packets... > The first invalid incoming packet is that one (json output seemed to be > the most convenient as it has tcp seq/ackseq and the oob.prefix message, > feel free to ask for some other format) > > {"timestamp": "2018-04-17T10:29:14.956485", "dvc": "Netfilter", > "raw.pktlen": 52, "raw.pktcount": 1, "oob.prefix": "nf_ct_tcp: ACK is > over the upper bound (ACKed data not seen yet) ", "oob.time.sec": > 1523928554, "oob.time.usec": 956485, "oob.mark": 0, "oob.hook": 0, > "oob.family": 2, "oob.protocol": 2048, "raw.label": 0, "ip.protocol": 6, > "ip.tos": 0, "ip.ttl": 52, "ip.totlen": 52, "ip.ihl": 5, "ip.csum": > 50909, "ip.id": 30328, "ip.fragoff": 16384, "src_port": 30280, > "dest_port": 29543, "tcp.seq": 4048786673, "tcp.ackseq": 3024392506, > "tcp.window": 781, "tcp.offset": 0, "tcp.reserved": 0, "tcp.urg": 0, > "tcp.ack": 1, "tcp.psh": 0, "tcp.rst": 0, "tcp.syn": 0, "tcp.fin": 0, > "tcp.res1": 0, "tcp.res2": 0, "tcp.csum": 17752, "oob.in": "", > "oob.out": "", "src_ip": "client", "dest_ip": "server"} > > which is the first ack that isn't seen alright (sent packet, blocked > ack, replays): > 10:29:14.926115 IP server.29543 > client.30280: Flags [.], seq > 3024391132:3024392506, ack 4048786673, win 307, options [nop,nop,TS val > 681965809 ecr 1634528339], length 1374 > ... > 10:29:14.956485 IP client.30280 > server.29543: Flags [.], ack 3024392506, > win 781, options [nop,nop,TS val 1634528369 ecr 681965809], length 0 > ... > 10:29:15.255489 IP server.29543 > client.30280: Flags [.], seq > 3024391132:3024392506, ack 4048786673, win 307, options [nop,nop,TS val > 681966138 ecr 1634528369], length 1374 > 10:29:15.281581 IP client.30280 > server.29543: Flags [.], ack 3024436474, > win 1444, options [nop,nop,TS val 1634528697 ecr 681965904,nop,nop,sack 1 > {3024391132:3024392506}], length 0 > 10:29:15.555466 IP client.30280 > server.29543: Flags [P.], seq > 4048786673:4048786709, ack 3024436474, win 1444, options [nop,nop,TS val > 1634528869 ecr 681965904], length 36 > 10:29:15.719510 IP server.29543 > client.30280: Flags [.], seq > 3024391132:3024392506, ack 4048786673, win 307, options [nop,nop,TS val > 681966602 ecr 1634528369], length 1374 > > So now I guess the question is why netfilter think this is over the > upper bound. > ..And surely enough the answer is within the invalid outbound packets > that aren't considered in the conntrack state machine, the first of whih > is this one: > {"timestamp": "2018-04-17T10:29:14.926143", "dvc": "Netfilter", > "raw.pktlen": 2800, "raw.pktcount": 1, "oob.prefix": "nf_ct_tcp: SEQ is > over the upper bound (over the window of the receiver)", "oob.time.sec": > 1523928554, "oob.time.usec": 926143, "oob.mark": 0, "oob.hook": 0, > "oob.family": 2, "oob.protocol": 2048, "oob.uid": 1000, "oob.gid": 100, > "raw.label": 0, "ip.protocol": 6, "ip.tos": 8, "ip.ttl": 64, > "ip.totlen": 2800, "ip.ihl": 5, "ip.csum": 47600, "ip.id": 27809, > "ip.fragoff": 16384, "src_port": 29543, "dest_port": 30280, "tcp.seq": > 3024391132, "tcp.ackseq": 4048786673, "tcp.window": 307, "tcp.offset": > 0, "tcp.reserved": 0, "tcp.urg": 0, "tcp.ack": 1, "tcp.psh": 0, > "tcp.rst": 0, "tcp.syn": 0, "tcp.fin": 0, "tcp.res1": 0, "tcp.res2": 0, > "tcp.csum": 5201, "oob.in": "", "oob.out": "", "src_ip": "server", > "dest_ip": "client"} > > which sequence matches the start seq of the first packet we don't ack, > so that's why conntrack doesn't recongnize that as a valid ack. > > > Here's some context before that packet: > 10:29:14.921305 IP server.29543 > client.30280: Flags [.], seq > 3024378314:3024379688, ack 4048786673, win 307, options [nop,nop,TS val > 681965804 ecr 1634528336], length 1374 > 10:29:14.921311 IP server.29543 > client.30280: Flags [.], seq > 3024379688:3024381062, ack 4048786673, win 307, options [nop,nop,TS val > 681965804 ecr 1634528336], length 1374 > 10:29:14.922166 IP client.30280 > server.29543: Flags [.], ack 3024365948, > win 329, options [nop,nop,TS val 1634528336 ecr 681965778], length 0 > 10:29:14.922178 IP server.29543 > client.30280: Flags [.], seq > 3024381062:3024382436, ack 4048786673, win 307, options [nop,nop,TS val > 681965805 ecr 1634528336], length 1374 >
Re: tcp hang when socket fills up ?
On Sat, Apr 14, 2018 at 03:55:15AM +0200, Dominique Martinet wrote: > 16:49:26.724457 IP .31872 > .13317: Flags > [.], ack 25134, win 646, options [nop,nop,TS val 1617129467 ecr 1313937599], > length 0 > 16:49:26.726081 IP .31872 > .13317: Flags > [.], ack 26508, win 669, options [nop,nop,TS val 1617129471 ecr 1313937599], > length 0 > 16:49:26.726391 IP .31872 > .13317: Flags > [.], ack 27882, win 691, options [nop,nop,TS val 1617129471 ecr 1313937601], > length 0 > 16:49:26.726962 IP .31872 > .13317: Flags > [.], ack 29256, win 714, options [nop,nop,TS val 1617129472 ecr 1313937601], > length 0 > 16:49:26.727614 IP .31872 > .13317: Flags > [.], ack 30630, win 737, options [nop,nop,TS val 1617129473 ecr 1313937602], > length 0 > 16:49:26.728084 IP .31872 > .13317: Flags > [.], ack 32004, win 759, options [nop,nop,TS val 1617129473 ecr 1313937602], > length 0 [A] (packet above) > 16:49:26.728507 IP .13317 > .31872: Flags > [.], seq 59484:60858, ack 4190, win 307, options [nop,nop,TS val 1313937635 > ecr 1617129473], length 1374 > 16:49:26.728511 IP .13317 > .31872: Flags > [.], seq 60858:62232, ack 4190, win 307, options [nop,nop,TS val 1313937635 > ecr 1617129473], length 1374 > 16:49:26.729531 IP .31872 > .13317: Flags > [.], ack 33378, win 782, options [nop,nop,TS val 1617129475 ecr 1313937607], > length 0 > 16:49:26.730002 IP .31872 > .13317: Flags > [.], ack 34752, win 805, options [nop,nop,TS val 1617129475 ecr 1313937607], > length 0 > 16:49:26.730340 IP .13317 > .31872: Flags > [.], seq 62232:63606, ack 4190, win 307, options [nop,nop,TS val 1313937636 > ecr 1617129473], length 1374 > 16:49:26.730344 IP .13317 > .31872: Flags > [.], seq 63606:64980, ack 4190, win 307, options [nop,nop,TS val 1313937636 > ecr 1617129473], length 1374 > 16:49:26.731398 IP .13317 > .31872: Flags > [.], seq 64980:66354, ack 4190, win 307, options [nop,nop,TS val 1313937638 > ecr 1617129473], length 1374 > 16:49:26.731402 IP .13317 > .31872: Flags > [.], seq 66354:67728, ack 4190, win 307, options [nop,nop,TS val 1313937638 > ecr 1617129473], length 1374 > 16:49:26.731634 IP .31872 > .13317: Flags > [.], ack 36126, win 827, options [nop,nop,TS val 1617129476 ecr 1313937607], > length 0 > 16:49:26.732955 IP .13317 > .31872: Flags > [.], seq 67728:69102, ack 4190, win 307, options [nop,nop,TS val 1313937639 > ecr 1617129473], length 1374 > 16:49:26.732963 IP .13317 > .31872: Flags > [.], seq 69102:70476, ack 4190, win 307, options [nop,nop,TS val 1313937639 > ecr 1617129473], length 1374 > 16:49:26.733956 IP .31872 > .13317: Flags > [.], ack 37500, win 850, options [nop,nop,TS val 1617129476 ecr 1313937607], > length 0 > 16:49:26.734242 IP .31872 > .13317: Flags > [.], ack 38874, win 872, options [nop,nop,TS val 1617129477 ecr 1313937608], > length 0 > 16:49:26.734653 IP .31872 > .13317: Flags > [.], ack 40248, win 895, options [nop,nop,TS val 1617129478 ecr 1313937608], > length 0 > 16:49:26.735042 IP .13317 > .31872: Flags > [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 > ecr 1617129473], length 1374 > 16:49:26.735046 IP .13317 > .31872: Flags > [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 > ecr 1617129473], length 1374 > 16:49:26.735334 IP .31872 > .13317: Flags > [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], > length 0 > 16:49:26.736005 IP .31872 > .13317: Flags > [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], > length 0 > 16:49:26.736402 IP .13317 > .31872: Flags > [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 > ecr 1617129473], length 1374 > 16:49:26.736408 IP .13317 > .31872: Flags > [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 > ecr 1617129473], length 1374 > 16:49:26.738561 IP .31872 > .13317: Flags > [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], > length 0 > 16:49:26.739539 IP .31872 > .13317: Flags > [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], > length 0 > 16:49:26.739882 IP .31872 > .13317: Flags > [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], > length 0 > 16:49:26.740255 IP .31872 > .13317: Flags > [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], > length 0 > 16:49:26.746756 IP .31872 > .13317: Flags > [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], > length 0 > 16:49:26.747923 IP .31872 > .13317: Flags > [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], > length 0 > 16:49:26.749083 IP .31872 > .13317: Flags > [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], > length 0 > 16:49:26.750171 IP .31872 > .13317: Flags > [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], > length 0 > 16:49:26.750808 IP .31872 > .13317: Flags > [.],
Re: tcp hang when socket fills up ?
Dominique Martinet wrote: > Eric Dumazet wrote on Sun, Apr 15, 2018: > > Are you sure you do not have some iptables/netfilter stuff ? > > I have a basic firewall setup with default rules e.g. starts with > -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT > in the INPUT chain... > That said, I just dropped it on the server to check and that seems to > workaround the issue?! > When logging everything dropped it appears to decide that the connection > is no longer established at some point, but only if there is > tcp_timestamp, just, err, how? > > And certainly enough, if I restore the firewall while a connection is up > that just hangs; conntrack doesn't consider it connected anymore at some > point (but it worked for a while!) > > Here's the kind of logs I get from iptables: > IN=wlp1s0 OUT= MAC=00:c2:c6:b4:7e:c7:a4:12:42:b5:5d:fc:08:00 SRC=client > DST=server LEN=52 TOS=0x00 PREC=0x00 TTL=52 ID=17038 DF PROTO=TCP SPT=41558 > DPT=15609 WINDOW=1212 RES=0x00 ACK URGP=0 You could do echo 6 > /proc/sys/net/netfilter/nf_conntrack_log_invalid to have conntrack log when/why it thinks packet is invalid. You can also set echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal which stops conntrack from marking packets with out-of-window acks as invalid. (Earlier email implies this is related to timestamps, but unfortunately to best of my knowledge conntrack doesn't look at tcp timestamps).
Re: tcp hang when socket fills up ?
Dominique Martinet wrote on Mon, Apr 16, 2018: > . . . Oh, there is something interesting there, the connection doesn't > come up with -G? Hm, sorry, I take this last part back. I cannot reproduce -G not working reliably. I'll dig around the conntrack table a bit more. -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Eric Dumazet wrote on Sun, Apr 15, 2018: > Are you sure you do not have some iptables/netfilter stuff ? I have a basic firewall setup with default rules e.g. starts with -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT in the INPUT chain... That said, I just dropped it on the server to check and that seems to workaround the issue?! When logging everything dropped it appears to decide that the connection is no longer established at some point, but only if there is tcp_timestamp, just, err, how? And certainly enough, if I restore the firewall while a connection is up that just hangs; conntrack doesn't consider it connected anymore at some point (but it worked for a while!) Here's the kind of logs I get from iptables: IN=wlp1s0 OUT= MAC=00:c2:c6:b4:7e:c7:a4:12:42:b5:5d:fc:08:00 SRC=client DST=server LEN=52 TOS=0x00 PREC=0x00 TTL=52 ID=17038 DF PROTO=TCP SPT=41558 DPT=15609 WINDOW=1212 RES=0x00 ACK URGP=0 > ss -temoi might give us more info hang ESTAB 081406server:15609 client:41558 users:(("socat",pid=17818,fd=5)) timer:(on,48sec,11) uid:1000 ino:137253 sk:6a <-> skmem:(r0,rb369280,t0,tb147456,f2050,w104446,o0,bl0,d1) ts sack reno wscale:7,7 rto:15168 backoff:6 rtt:36.829/6.492 ato:40 mss:1374 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16 bytes_acked:32004 bytes_received:4189 segs_out:84 segs_in:55 data_segs_out:77 data_segs_in:18 send 298.5Kbps lastsnd:12483 lastrcv:27801 lastack:27726 pacing_rate 19.1Mbps delivery_rate 4.1Mbps busy:28492ms unacked:31 retrans:1/6 lost:31 rcv_rtt:29 rcv_space:29200 rcv_ssthresh:39184 notsent:38812 minrtt:25.152 working (tcp_timestamp=0) ESTAB 036 server:15080 client:32979 users:(("socat",pid=17047,fd=5)) timer:(on,226ms,0) uid:1000 ino:90917 sk:23 <-> skmem:(r0,rb369280,t0,tb1170432,f1792,w2304,o0,bl0,d3) sack reno wscale:7,7 rto:230 rtt:29.413/5.345 ato:64 mss:1386 pmtu:1500 rcvmss:1248 advmss:1460 cwnd:4 ssthresh:3 bytes_acked:17391762 bytes_received:62397 segs_out:13964 segs_in:8642 data_segs_out:13895 data_segs_in:1494 send 1.5Mbps lastsnd:4 lastrcv:5 lastack:5 pacing_rate 1.8Mbps delivery_rate 1.2Mbps busy:56718ms unacked:1 retrans:0/11 rcv_rtt:9112.95 rcv_space:29233 rcv_ssthresh:41680 minrtt:25.95 working (no iptables) ESTAB 00server:61460 client:20468 users:(("socat",pid=17880,fd=5)) uid:1000 ino:129982 sk:6f <-> skmem:(r0,rb369280,t0,tb1852416,f0,w0,o0,bl0,d1) ts sack reno wscale:7,7 rto:244 rtt:43.752/7.726 ato:40 mss:1374 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:10 bytes_acked:2617302 bytes_received:5441 segs_out:1929 segs_in:976 data_segs_out:1919 data_segs_in:41 send 2.5Mbps lastsnd:2734 lastrcv:2734 lastack:2705 pacing_rate 5.0Mbps delivery_rate 12.7Mbps busy:1884ms rcv_rtt:30 rcv_space:29200 rcv_ssthresh:39184 minrtt:26.156 > Really it looks like at some point, all incoming packets are shown by > tcpdump but do not reach the TCP socket anymore. > > (segs_in: might be steady, look at the d0 counter shown by ss -temoi > (dX : drop counters, sk->sk_drops) segs_in does not increase with replays; the d1 seems stable. > While running your experiment, try on the server. > > perf record -a -g -e skb:kfree_skb sleep 30 > perf report While I understand what that should do, I am not sure why I do not get any graph so that doesn't help tell what called kfree_skb and thus what decided to drop the packet (although we no longer really need that now..) perf script just shows kfree_skb e.g. swapper 0 [001] 237244.869321: skb:kfree_skb: skbaddr=0x8800360fda00 protocol=2048 location=0x817a1a77 9458e3 kfree_skb (/usr/lib/debug/lib/modules/4.16.0-300.fc28.x86_64/vmlinux) --- So I guess that ultimately the problem is why conntrack suddenly decides that an established connection suddenly isn't anymore, despite being listed as established by ss.. I'm discovering `conntrack(8)`, but what strikes me as interesting is that even that points at the connection being established (looking at a new connection after iptables started dropping packets) # conntrack -L | grep 21308 tcp 6 267 ESTABLISHED src=server dst=client sport=21308 dport=37552 src=client dst=server sport=37552 dport=21308 [ASSURED] mark=0 use=1 compared to another that isn't dropped (the old connection without tcp_timestamp) tcp 6 299 ESTABLISHED src=server dst=client sport=15080 dport=32979 src=client dst=server sport=32979 dport=15080 [ASSURED] mark=0 use=1 The expect/dying/unconfirmed tables all are empty. . . . Oh, there is something interesting there, the connection doesn't come up with -G? working: conntrack -G --protonum tcp --src server --dst client --sport 15080 --dport 32979 tcp 6 299 ESTABLISHED src=server dst=cl
Re: tcp hang when socket fills up ?
On 04/15/2018 06:47 PM, Dominique Martinet wrote: > Also, here are the per-socket stats I could find (ss -i after having > reproduced hang): >reno wscale:7,7 rto:7456 backoff:5 rtt:32.924/1.41 ato:40 mss:1374 >pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16 >bytes_acked:32004 bytes_received:4189 segs_out:85 segs_in:54 >data_segs_out:78 data_segs_in:18 send 333.9Kbps lastsnd:3912 >lastrcv:11464 lastack:11387 pacing_rate 21.4Mbps delivery_rate >3.5Mbps busy:12188ms unacked:33 retrans:1/5 lost:33 rcv_rtt:37 >rcv_space:29200 rcv_ssthresh:39184 notsent:28796 minrtt:24.986 > ss -temoi might give us more info Really it looks like at some point, all incoming packets are shown by tcpdump but do not reach the TCP socket anymore. (segs_in: might be steady, look at the d0 counter shown by ss -temoi (dX : drop counters, sk->sk_drops) Are you sure you do not have some iptables/netfilter stuff ? While running your experiment, try on the server. perf record -a -g -e skb:kfree_skb sleep 30 perf report
Re: tcp hang when socket fills up ?
Eric Dumazet wrote on Fri, Apr 13, 2018: > That might be caused by some TS val/ecr breakage : > > Many acks were received by the server tcpdump, > but none of them was accepted by TCP stack, for some reason. > > Try to disable TCP timestamps, it will give some hint if bug does not > reproduce. This was spot on, after disabling tcp timestamps I cannot reproduce the hang anymore. I've had another look at the original sequence (as seen by the server) and I don't see much wrong; tell me what I missed: - the replayed packet has seq 32004:33378, so the first ignored ack would be the one with ack 33378, is that right? (meaning the server did accept the one for 32004 and none after that) Assuming it is, excerpt from around then (first emission of that packet then client replies): 16:49:26.700531 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937607 ecr 1617129440], length 1374 ... 16:49:26.728084 IP .31872 > .13317: Flags [.], ack 32004, win 759, options [nop,nop,TS val 1617129473 ecr 1313937602], length 0 ... 16:49:26.729531 IP .31872 > .13317: Flags [.], ack 33378, win 782, options [nop,nop,TS val 1617129475 ecr 1313937607], length 0 16:49:26.730002 IP .31872 > .13317: Flags [.], ack 34752, win 805, options [nop,nop,TS val 1617129475 ecr 1313937607], length 0 ... 16:49:26.731634 IP .31872 > .13317: Flags [.], ack 36126, win 827, options [nop,nop,TS val 1617129476 ecr 1313937607], length 0 - the ecr value matches the val of the packet it acks - the val is >= that of previous packet (won't be considered reorder/should pass paws check?) - even if the packets are processed in parallel and some kind of race occurs, a "bigger" ack should ack all the previous packets, right? - Just to make sure, I checked /proc/net/netstat for PAWSEstab but that is 0: TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW TWRecycled TWKilled PAWSActive PAWSEstab DelayedACKs DelayedACKLocked DelayedACKLost ListenOverflows ListenDrops TCPHPHits TCPPureAcks TCPHPAcks TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures TCPFastRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPRcvCollapsed TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger TCPAbortFailed TCPMemoryPressures TCPMemoryPressuresChrono TCPSACKDiscard TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged TCPSackShiftFallback TCPBacklogDrop PFMemallocDrop TCPMinTTLDrop TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPFastOpenBlackhole TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe TCPKeepAlive TCPMTUPFail TCPMTUPSuccess TcpExt: 0 0 0 0 58 0 0 26 0 0 50 0 0 0 0 75402 17 201 0 0 6876848 59804 2258387 0 33 0 0 3 0 0 0 0 1 102 15 0 0 0 1306 60 386 292 10 0 0 108750 201 1 228 1 8 4 0 63 0 3 0 0 0 0 107 1 0 0 0 2834 1962 622 0 0 0 0 0 0 0 0 0 1065022 54160 0 1 3 3 0 0 0 0 0 0 0 475 0 9578 6 8 71 257 5116325 0 0 0 0 0 0 6 0 0 0 61 85 0 0 IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts InECT0Pkts InCEPkts IpExt: 0 0 0 0 206 0 16405602866 8921427728 0 0 57928 0 0 16121410 0 5388 0 Also, here are the per-socket stats I could find (ss -i after having reproduced hang): reno wscale:7,7 rto:7456 backoff:5 rtt:32.924/1.41 ato:40 mss:1374 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16 bytes_acked:32004 bytes_received:4189 segs_out:85 segs_in:54 data_segs_out:78 data_segs_in:18 send 333.9Kbps lastsnd:3912 lastrcv:11464 lastack:11387 pacing_rate 21.4Mbps delivery_rate 3.5Mbps busy:12188ms unacked:33 retrans:1/5 lost:33 rcv_rtt:37 rcv_space:29200 rcv_ssthresh:39184 notsent:28796 minrtt:24.986 Here are the same stats with tcp timestamp disabled (after running my reproducer, e.g. outputing a big chunk of text
Re: tcp hang when socket fills up ?
Eric Dumazet wrote on Fri, Apr 13, 2018: > Ah sorry, your trace was truncated, we need more packets _before_ the excerpt. Ah, sorry as well. I tried to go far back enough to include the replayed packets, but I see it didn't include the original ack for that packet. I'm resending both full traces from connect to fin this time. > That might be caused by some TS val/ecr breakage : > > Many acks were received by the server tcpdump, > but none of them was accepted by TCP stack, for some reason. > > Try to disable TCP timestamps, it will give some hint if bug does not > reproduce. I unfortunately cannot reproduce on a local network, but will give this a try on Monday and report. (sysctl net.ipv4.tcp_timestamps=0, correct?) Here's the traces again, server-capture first: 16:49:22.374452 IP .13317 > .31872: Flags [S], seq 2026966826, win 29200, options [mss 1460,sackOK,TS val 1313933281 ecr 0,nop,wscale 7], length 0 16:49:22.700553 IP .31872 > .13317: Flags [S], seq 882075258, win 29200, options [mss 1386,sackOK,TS val 1617125446 ecr 0,nop,wscale 7], length 0 16:49:22.700577 IP .13317 > .31872: Flags [S.], seq 2026966826, ack 882075259, win 29200, options [mss 1460,sackOK,TS val 1313933607 ecr 1617125446,nop,wscale 7], length 0 16:49:22.727131 IP .31872 > .13317: Flags [.], ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 1313933607], length 0 16:49:22.729193 IP .31872 > .13317: Flags [P.], seq 1:22, ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 1313933607], length 21 16:49:22.729230 IP .13317 > .31872: Flags [.], ack 22, win 229, options [nop,nop,TS val 1313933635 ecr 1617125472], length 0 16:49:22.732726 IP .13317 > .31872: Flags [P.], seq 1:22, ack 22, win 229, options [nop,nop,TS val 1313933639 ecr 1617125472], length 21 16:49:22.759809 IP .31872 > .13317: Flags [.], ack 22, win 229, options [nop,nop,TS val 1617125503 ecr 1313933639], length 0 16:49:22.759835 IP .13317 > .31872: Flags [P.], seq 22:734, ack 22, win 229, options [nop,nop,TS val 1313933666 ecr 1617125503], length 712 16:49:22.761643 IP .31872 > .13317: Flags [P.], seq 22:1270, ack 22, win 229, options [nop,nop,TS val 1617125504 ecr 1313933639], length 1248 16:49:22.801860 IP .13317 > .31872: Flags [.], ack 1270, win 248, options [nop,nop,TS val 1313933708 ecr 1617125504], length 0 16:49:22.826902 IP .31872 > .13317: Flags [.], ack 734, win 240, options [nop,nop,TS val 1617125572 ecr 1313933666], length 0 16:49:22.827576 IP .31872 > .13317: Flags [P.], seq 1270:1318, ack 734, win 240, options [nop,nop,TS val 1617125573 ecr 1313933708], length 48 16:49:22.827600 IP .13317 > .31872: Flags [.], ack 1318, win 248, options [nop,nop,TS val 1313933734 ecr 1617125573], length 0 16:49:22.833028 IP .13317 > .31872: Flags [P.], seq 734:1122, ack 1318, win 248, options [nop,nop,TS val 1313933739 ecr 1617125573], length 388 16:49:22.858455 IP .31872 > .13317: Flags [.], ack 1122, win 251, options [nop,nop,TS val 1617125604 ecr 1313933739], length 0 16:49:22.865866 IP .31872 > .13317: Flags [P.], seq 1318:1334, ack 1122, win 251, options [nop,nop,TS val 1617125612 ecr 1313933739], length 16 16:49:22.906865 IP .13317 > .31872: Flags [.], ack 1334, win 248, options [nop,nop,TS val 1313933813 ecr 1617125612], length 0 16:49:22.944474 IP .31872 > .13317: Flags [P.], seq 1334:1386, ack 1122, win 251, options [nop,nop,TS val 1617125678 ecr 1313933813], length 52 16:49:22.944497 IP .13317 > .31872: Flags [.], ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 1617125678], length 0 16:49:22.944747 IP .13317 > .31872: Flags [P.], seq 1122:1174, ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 1617125678], length 52 16:49:22.971083 IP .31872 > .13317: Flags [P.], seq 1386:1454, ack 1174, win 251, options [nop,nop,TS val 1617125716 ecr 1313933851], length 68 16:49:22.971607 IP .13317 > .31872: Flags [P.], seq 1174:1258, ack 1454, win 248, options [nop,nop,TS val 1313933878 ecr 1617125716], length 84 16:49:22.998201 IP .31872 > .13317: Flags [P.], seq 1454:2082, ack 1258, win 251, options [nop,nop,TS val 1617125742 ecr 1313933878], length 628 16:49:22.998987 IP .13317 > .31872: Flags [P.], seq 1258:1838, ack 2082, win 268, options [nop,nop,TS val 1313933905 ecr 1617125742], length 580 16:49:23.065102 IP .31872 > .13317: Flags [.], ack 1838, win 262, options [nop,nop,TS val 1617125810 ecr 1313933905], length 0 16:49:24.811297 IP .31872 > .13317: Flags [P.], seq 2082:3254, ack 1838, win 262, options [nop,nop,TS val 1617127527 ecr 1313933905], length 1172 16:49:24.812562 IP .13317 > .31872: Flags [P.], seq 1838:1874, ack 3254, win 287, options [nop,nop,TS val 1313935719 ecr 1617127527], length 36 16:49:24.838210 IP .31872 > .13317: Flags [.], ack 1874, win 262, options [nop,nop,TS val 1617127583 ecr 1313935719], length 0 16:49:24.838236 IP .13317 > .31872: Flags [P.], seq 1874:2534, ack 3254, win 287, options [nop,nop,TS val 1313935744 ecr 1617127583], length 660 16:49:24.839175 I
Re: tcp hang when socket fills up ?
On 04/13/2018 06:09 PM, Dominique Martinet wrote: > Thank you for the replies, > > Eric Dumazet wrote on Fri, Apr 13, 2018: >> There is no way a regular TCP stack (including linux) could send the >> following frame. >> >>> 16:49:27.048760 IP .13317 > .31872: >>> Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val >>> 1313937955 ecr 1617129473], length 1374 >> >> So something is mangling the packet, maybe NAT or something. > > > The pcap was produced on the server which emitted the frame, so it > should be exactly as the server intended it to be without any mangling > involved? > If you could point at what strikes you as odd I can check if the same > happened on other hang traces I might still have. Is it just that it > replays a very old seq? > (if it's odd wrt client packets, the same trace as captured on client is > at the end of this mail) > Ah sorry, your trace was truncated, we need more packets _before_ the excerpt. That might be caused by some TS val/ecr breakage : Many acks were received by the server tcpdump, but none of them was accepted by TCP stack, for some reason. Try to disable TCP timestamps, it will give some hint if bug does not reproduce.
Re: tcp hang when socket fills up ?
Thank you for the replies, Eric Dumazet wrote on Fri, Apr 13, 2018: > There is no way a regular TCP stack (including linux) could send the > following frame. > > > 16:49:27.048760 IP .13317 > .31872: > > Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val > > 1313937955 ecr 1617129473], length 1374 > > So something is mangling the packet, maybe NAT or something. The pcap was produced on the server which emitted the frame, so it should be exactly as the server intended it to be without any mangling involved? If you could point at what strikes you as odd I can check if the same happened on other hang traces I might still have. Is it just that it replays a very old seq? (if it's odd wrt client packets, the same trace as captured on client is at the end of this mail) Michal Kubecek wrote on Fri, Apr 13, 2018: > The way I read this, server doesn't see anything sent by client since > some point shortly before the dump shown here starts (about 5ms). It > keeps sending data until 16:49:26.807754 (seq 77346) and then keeps > resending first (from its point of view) unacknowledged segment > (32004:33378) in exponentially growing intervals and ignores replies > from the client. Client apparently receives these retransmits and > replies with dupack (with D-SACK for 32004:33378) and retransmits of its > own first unacknowledged segment (4190:4226). I have the same understanding. > As we can see the client packets in the dump (which was taken on > server), it would mean they are dropped after the point where packet > socket would pass them to libpcap. That might be e.g. netfilter > (conntrack?) or the IP/TCP code detecting them to be invalid for some > reason (which is not obvious to me from the dump above). In my second mail, I got as far as `tcp_ack`, where the replays didn't get considered for ack because it was seen as having the same ack number as the lastest seen. I'm not sure why it didn't get considered the first time, I need to add more instrumentation. > There are two strange points: > > 1. While client apparently responds to all server retransmits, it does > so with TSecr=1313937714 (matching server packet from 16:49:26.807754) > rather than TSval of the packets it dupacks (1313937955 through > 1313953050). This doesn't seem to follow the rules of RFC 7323 > Section 4.3. I have a pcap of the same sequence on the client as well, and it does get the packets with higher val= unchanged, so that might be something to look at. I'll post a similarily formated tcpdump output at the end of this mail, the timestamps and verious numbers in the traces match with my original mail. > 2. Window size values in acks from client grow with each acked packet by > 22-23 (which might be ~1400 with scaling factor of 64). I would rather > expect advertised receive window to go down by 1374 with each received > segment and to grow by bigger steps with each read()/recv() call from > application. I have a very poor understanding of this part of the protocol/code, but I think the window grows when a socket starts receiving a bulk of data. Given this hangs very fast, there were only individual small packets up till this trace where data comes in quickly, and I think this might just be window scaling? I might very well be wrong though.. > We might get more insight if we saw the same connection on both sides. > From what was presented here, my guess is that > > (1) received packets are dropped somewhere on server side (after they > are cloned for the packet socket) I'd agree for received replays, but I don't get why the first time that ack was received didn't work. > (2) there is something wrong either on client side or between the two > hosts (there is at least a NAT, IIUC) My minimal reproducer involves the internet, so there is no telling what is in the middle, but there is at least two NAT, and the connection is established through two connect() calls with carefully crafted source/destination ports if that makes a difference. I've done a bit more fiddling with netem/tbf and a single nat but I still can't reproduce it on a local network; I need to spend more time trying as this is very frustrating because I don't control any of the two NATs in the real case. I don't think the NATs are culprits, though - looking at the traces I don't see any reordering within a direction, missing packets or fiddling with values so this seems fairly straightforward. Here's the client-side traces. pcap files still available on request, if you prefer to look at it in wireshark or another tool. 16:49:26.726793 IP .13317 > .31872: Flags [.], seq 40248:41622, ack 4190, win 307, options [nop,nop,TS val 1313937609 ecr 1617129440], length 1374 16:49:26.726798 IP .31872 > .13317: Flags [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], length 0 16:49:26.726802 IP .13317 > .31872: Flags [.], seq 41622:42996, ack 4190, win 307, options [nop,nop,TS val 1313937609 ecr 161712
Re: tcp hang when socket fills up ?
On Fri, Apr 06, 2018 at 11:07:20AM +0200, Dominique Martinet wrote: > 16:49:26.735042 IP .13317 > .31872: Flags > [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 > ecr 1617129473], length 1374 > 16:49:26.735046 IP .13317 > .31872: Flags > [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 > ecr 1617129473], length 1374 > 16:49:26.735334 IP .31872 > .13317: Flags > [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], > length 0 > 16:49:26.736005 IP .31872 > .13317: Flags > [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], > length 0 > 16:49:26.736402 IP .13317 > .31872: Flags > [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 > ecr 1617129473], length 1374 > 16:49:26.736408 IP .13317 > .31872: Flags > [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 > ecr 1617129473], length 1374 > 16:49:26.738561 IP .31872 > .13317: Flags > [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], > length 0 > 16:49:26.739539 IP .31872 > .13317: Flags > [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], > length 0 > 16:49:26.739882 IP .31872 > .13317: Flags > [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], > length 0 > 16:49:26.740255 IP .31872 > .13317: Flags > [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], > length 0 > 16:49:26.746756 IP .31872 > .13317: Flags > [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], > length 0 > 16:49:26.747923 IP .31872 > .13317: Flags > [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], > length 0 > 16:49:26.749083 IP .31872 > .13317: Flags > [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], > length 0 > 16:49:26.750171 IP .31872 > .13317: Flags > [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], > length 0 > 16:49:26.750808 IP .31872 > .13317: Flags > [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], > length 0 > 16:49:26.754648 IP .31872 > .13317: Flags > [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], > length 0 > 16:49:26.755985 IP .31872 > .13317: Flags > [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], > length 0 > 16:49:26.758513 IP .31872 > .13317: Flags > [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], > length 0 > 16:49:26.759096 IP .31872 > .13317: Flags > [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], > length 0 > 16:49:26.759421 IP .31872 > .13317: Flags > [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], > length 0 > 16:49:26.759755 IP .31872 > .13317: Flags > [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], > length 0 > 16:49:26.760653 IP .31872 > .13317: Flags > [.], ack 64980, win 1302, options [nop,nop,TS val 1617129505 ecr 1313937636], > length 0 > 16:49:26.761453 IP .31872 > .13317: Flags > [.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], > length 0 > 16:49:26.762199 IP .31872 > .13317: Flags > [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], > length 0 > 16:49:26.763547 IP .31872 > .13317: Flags > [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 > ecr 1313937638], length 36 > 16:49:26.763553 IP .31872 > .13317: Flags > [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], > length 0 > 16:49:26.764298 IP .31872 > .13317: Flags > [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], > length 0 > 16:49:26.764676 IP .31872 > .13317: Flags > [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], > length 0 > 16:49:26.807754 IP .13317 > .31872: Flags > [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 > ecr 1617129473], length 1374 > 16:49:26.876467 IP .31872 > .13317: Flags > [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], > length 0 > 16:49:27.048760 IP .13317 > .31872: Flags > [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 > ecr 1617129473], length 1374 > 16:49:27.051791 IP .31872 > .13317: Flags > [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 > ecr 1313937714], length 36 > 16:49:27.076444 IP .31872 > .13317: Flags > [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr > 1313937714,nop,nop,sack 1 {32004:33378}], length 0 > 16:49:27.371182 IP .31872 > .13317: Flags > [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 > ecr 1313937714], length 36 > 16:49:27.519862 IP .13317 > .31872: Flags > [.], seq 32004:33378, ack 4190, win 307, o
Re: tcp hang when socket fills up ?
On 04/06/2018 02:07 AM, Dominique Martinet wrote: > (current kernel: vanilla 4.14.29) > > I've been running into troubles recently where a sockets "fills up" (as > in, select() will no longer return it in its outfd / attempting to write > to it after setting it to NONBLOCK will return -EWOULDBLOCK) and it > doesn't seem to ever "unfill" until the tcp connexion timeout. > > The previous time I pushed it down to the application for not trying to > read the socket either as I assume the buffers could be shared and > just waiting won't take data out, but this time I'm a bit more > skeptical as socat waits for the fd in both read and write... > > Let me take a minute to describe my setup, I don't think that how the > socket was created matters but it might be interesting: > - I have two computers behind NATs, no port forwarding on either side > - One (call it C for client) runs ssh with a proxycommand ncat/socat to > control the source port, e.g. > $ ssh -o ProxyCommand="socat stdio tcp: ip>:,sourceport=" server > - The server runs another socat to connect to that and forwards to ssh > locally, e.g. > $ socat tcp::,sourceport= tcp:127.0.0.1:22 > > (yes, both are connect() calls, and that just works™ thanks to initial > syn replay and conntrack on routers) > > When things stall, the first socat is in a select with both fd in > reading, so it's waiting data. > The second socat has data coming from ssh and is in a select with the > client-facing socket in both read and write - that select never returns > so the socket is not available for reading or writing and does not free > up until the connection eventually times out a few minutes later. > > At this point, I only see tcp replays in tcpdump/wireshark. I've > compared dumps from both sides and there are no lost packets, only > reordering - there always is a batch of acks that were sent regularily > coming in shortly before the hang. Here's the trace on the server: > > 16:49:26.735042 IP .13317 > .31872: Flags > [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 > ecr 1617129473], length 1374 > 16:49:26.735046 IP .13317 > .31872: Flags > [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 > ecr 1617129473], length 1374 > 16:49:26.735334 IP .31872 > .13317: Flags > [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], > length 0 > 16:49:26.736005 IP .31872 > .13317: Flags > [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], > length 0 > 16:49:26.736402 IP .13317 > .31872: Flags > [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 > ecr 1617129473], length 1374 > 16:49:26.736408 IP .13317 > .31872: Flags > [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 > ecr 1617129473], length 1374 > 16:49:26.738561 IP .31872 > .13317: Flags > [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], > length 0 > 16:49:26.739539 IP .31872 > .13317: Flags > [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], > length 0 > 16:49:26.739882 IP .31872 > .13317: Flags > [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], > length 0 > 16:49:26.740255 IP .31872 > .13317: Flags > [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], > length 0 > 16:49:26.746756 IP .31872 > .13317: Flags > [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], > length 0 > 16:49:26.747923 IP .31872 > .13317: Flags > [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], > length 0 > 16:49:26.749083 IP .31872 > .13317: Flags > [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], > length 0 > 16:49:26.750171 IP .31872 > .13317: Flags > [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], > length 0 > 16:49:26.750808 IP .31872 > .13317: Flags > [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], > length 0 > 16:49:26.754648 IP .31872 > .13317: Flags > [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], > length 0 > 16:49:26.755985 IP .31872 > .13317: Flags > [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], > length 0 > 16:49:26.758513 IP .31872 > .13317: Flags > [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], > length 0 > 16:49:26.759096 IP .31872 > .13317: Flags > [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], > length 0 > 16:49:26.759421 IP .31872 > .13317: Flags > [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], > length 0 > 16:49:26.759755 IP .31872 > .13317: Flags > [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], > length 0 > 16:49:26.760653 IP .31872 > .13317: Flags > [.], ack 64980, win 1302, options [nop,nop,TS val 1
Re: tcp hang when socket fills up ?
Note this is mostly me talking to myself, but in case anyone else hits the same issues I might as well post my meagre progress. Dominique Martinet wrote on Fri, Apr 06, 2018: > (current kernel: vanilla 4.14.29) reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64 kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen any patch about that kind of stuff recently so probably still valid) Other main difference is the qdisc, VM is using fq_codel, host is directly on wireless so mq with 4 pfifo_fast queues - it is harder to reproduce on the VM (even on another VM with the same kernel) so I'd put the difference down to the qdisc, but I was able to reproduce with both ultimately. (update: it actually was still fairly easy to reproduce until it got later (coworkers left?), going from ~5-15s to reproduce to multiple minutes, so this likely depends on net quality a lot. I couldn't reproduce by fiddling with netem on a local network though...) > [please find previous email for setup/tcpdump output] So I have a crash dump with a socket / inet_sock that are blocked, since this is a VM I even get gdb in bonus.. With the crash dump, I can confirm that the socket is not available for writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1), but that doesn't help much if I can't tell why we're not taking acks in With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I think that's the function that should handle my ack, and that gets the replay. First, abusing next, the flow seems to be something like (I folded if/else not taken) static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct tcp_sacktag_state sack_state; struct rate_sample rs = { .prior_delivered = 0 }; u32 prior_snd_una = tp->snd_una; bool is_sack_reneg = tp->is_sack_reneg; u32 ack_seq = TCP_SKB_CB(skb)->seq; u32 ack = TCP_SKB_CB(skb)->ack_seq; bool is_dupack = false; int prior_packets = tp->packets_out; u32 delivered = tp->delivered; u32 lost = tp->lost; int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover losses */ u32 prior_fack; sack_state.first_sackt = 0; sack_state.rate = &rs; /* We very likely will need to access rtx queue. */ prefetch(sk->tcp_rtx_queue.rb_node); /* If the ack is older than previous acks * then we can probably ignore it. */ if (before(ack, prior_snd_una)) { } /* If the ack includes data we haven't sent yet, discard * this segment (RFC793 Section 3.9). */ if (after(ack, tp->snd_nxt)) if (after(ack, prior_snd_una)) { } prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una; rs.prior_in_flight = tcp_packets_in_flight(tp); /* ts_recent update must be made after we are sure that the packet * is in window. */ if (flag & FLAG_UPDATE_TS_RECENT) if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) { } else { u32 ack_ev_flags = CA_ACK_SLOWPATH; if (ack_seq != TCP_SKB_CB(skb)->end_seq) else NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPUREACKS); flag |= tcp_ack_update_window(sk, skb, ack, ack_seq); if (TCP_SKB_CB(skb)->sacked) if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) { } if (flag & FLAG_WIN_UPDATE) tcp_in_ack_event(sk, ack_ev_flags); } /* We passed data and got it acked, remove any soft error * log. Something worked... */ sk->sk_err_soft = 0; icsk->icsk_probes_out = 0; tp->rcv_tstamp = tcp_jiffies32; if (!prior_packets) goto no_queue; no_queue: /* If data was DSACKed, see if we can undo a cwnd reduction. */ if (flag & FLAG_DSACKING_ACK) tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, &flag, &rexmit); /* If this ack opens up a zero window, clear backoff. It was * being used to time the probes, and is probably far higher than * it needs to be for normal retransmission. */ tcp_ack_probe(sk); if (tp->tlp_high_seq) tcp_process_tlp_ack(sk, ack, flag); return 1; And here is 'info local' towards the end of the function: icsk = 0x88001740b800 tp = 0x88001740b800 sack_state = {reord = 84, first_sackt = 0, last_sackt = 0, rate = 0x88003fc83ae0, fla