Re: tcp hang when socket fills up ?

2018-04-18 Thread Jozsef Kadlecsik
On Wed, 18 Apr 2018, Dominique Martinet wrote:

> Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> > Thanks for the testing! One more line is required, however: we have to get 
> > the assured bit set for the connection, see the new patch below.
> 
> I think it actually was better before. If I understand things correctly
> at this point (when we get in the case TCP_CONNTRACK_SYN_RECV) we will
> have seen SYN(out) SYN(in) SYNACK(out), but not the final ACK(in) yet.
> 
> Leaving old state as it was will not set the assured bit, but that will 
> be set on the next packet because old_state == new_state == established 
> at that point and the connection will really be setup then.

Yes, you are right: the first patch is better than the second one. 
Overthinking :-)

Best regards,
Jozsef
-
E-mail  : kad...@blackhole.kfki.hu, kadlecsik.joz...@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
  H-1525 Budapest 114, POB. 49, Hungary


Re: tcp hang when socket fills up ?

2018-04-18 Thread Dominique Martinet
Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> Thanks for the testing! One more line is required, however: we have to get 
> the assured bit set for the connection, see the new patch below.

I think it actually was better before. If I understand things correctly
at this point (when we get in the case TCP_CONNTRACK_SYN_RECV) we will
have seen SYN(out) SYN(in) SYNACK(out), but not the final ACK(in) yet.

Leaving old state as it was will not set the assured bit, but that will
be set on the next packet because old_state == new_state == established
at that point and the connection will really be setup then.


I don't think anything will blow up if we do either way, but strictly
speaking I'm more comfortable with the former.
I'll test the new patch regardless, I left work so can't reproduce
anymore but will yell tomorrow if it does explode ;)


> The tcp_conntracks state table could be fixed with introducing a new 
> state, but that part is exposed to userspace (ctnetlink) and ugly 
> compatibility code would be required for backward compatibility.

I agree a new state is more work than it is worth, I'm happy to leave it
as is.

-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-18 Thread Jozsef Kadlecsik
On Wed, 18 Apr 2018, Dominique Martinet wrote:

> Dominique Martinet wrote on Wed, Apr 18, 2018:
> > Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> > > Yes, the state transition is wrong for simultaneous open, because the 
> > > tcp_conntracks table is not (cannot be) smart enough. Could you verify 
> > > the 
> > > next untested patch?
> > 
> > Thanks for the patch; I'll give it a try (probably won't make it today
> > so will report tomorrow)
> 
> Actually had time; I can confirm (added printks) we did get in that if 
> that was pointed at, and we no longer get there now. The connection no 
> longer gets in invalid state, so that looks like it nailed it.
>
> I'm now confused what this has to do with tcp_timestamp though, since 
> setting that off also seemed to work around the issue, but if we get 
> something like that in I'll be happy anyway.

Thanks for the testing! One more line is required, however: we have to get 
the assured bit set for the connection, see the new patch below.

The tcp_conntracks state table could be fixed with introducing a new 
state, but that part is exposed to userspace (ctnetlink) and ugly 
compatibility code would be required for backward compatibility.
 
diff --git a/include/uapi/linux/netfilter/nf_conntrack_tcp.h 
b/include/uapi/linux/netfilter/nf_conntrack_tcp.h
index 74b9115..bcba72d 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_tcp.h
@@ -46,6 +46,9 @@ enum tcp_conntrack {
 /* Marks possibility for expected RFC5961 challenge ACK */
 #define IP_CT_EXP_CHALLENGE_ACK0x40
 
+/* Simultaneous open initialized */
+#define IP_CT_TCP_SIMULTANEOUS_OPEN0x80
+
 struct nf_ct_tcp_flags {
__u8 flags;
__u8 mask;
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c 
b/net/netfilter/nf_conntrack_proto_tcp.c
index e97cdc1..2c1fc7e 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -981,6 +981,20 @@ static int tcp_packet(struct nf_conn *ct,
return NF_ACCEPT; /* Don't change state */
}
break;
+   case TCP_CONNTRACK_SYN_SENT2:
+   /* tcp_conntracks table is not smart enough to handle
+* simultaneous open.
+*/
+   ct->proto.tcp.last_flags |= IP_CT_TCP_SIMULTANEOUS_OPEN;
+   break;
+   case TCP_CONNTRACK_SYN_RECV:
+   if (dir == IP_CT_DIR_REPLY && index == TCP_ACK_SET &&
+   ct->proto.tcp.last_flags & IP_CT_TCP_SIMULTANEOUS_OPEN) {
+   /* We want to set the assured bit */
+   old_state = TCP_CONNTRACK_SYN_RECV;
+   new_state = TCP_CONNTRACK_ESTABLISHED;
+   }
+   break;
case TCP_CONNTRACK_CLOSE:
if (index == TCP_RST_SET
&& (ct->proto.tcp.seen[!dir].flags & 
IP_CT_TCP_FLAG_MAXACK_SET)

Best regards,
Jozsef
-
E-mail  : kad...@blackhole.kfki.hu, kadlecsik.joz...@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
  H-1525 Budapest 114, POB. 49, Hungary


Re: tcp hang when socket fills up ?

2018-04-18 Thread Dominique Martinet
Dominique Martinet wrote on Wed, Apr 18, 2018:
> Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> > Yes, the state transition is wrong for simultaneous open, because the 
> > tcp_conntracks table is not (cannot be) smart enough. Could you verify the 
> > next untested patch?
> 
> Thanks for the patch; I'll give it a try (probably won't make it today
> so will report tomorrow)

Actually had time; I can confirm (added printks) we did get in that if
that was pointed at, and we no longer get there now.
The connection no longer gets in invalid state, so that looks like it
nailed it.

I'm now confused what this has to do with tcp_timestamp though, since
setting that off also seemed to work around the issue, but if we get
something like that in I'll be happy anyway.


Big props to everyone involved, I would have taken much longer alone,
-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-18 Thread Dominique Martinet
Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> Yes, the state transition is wrong for simultaneous open, because the 
> tcp_conntracks table is not (cannot be) smart enough. Could you verify the 
> next untested patch?

Thanks for the patch; I'll give it a try (probably won't make it today
so will report tomorrow)

-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-18 Thread Jozsef Kadlecsik
Hi,

On Tue, 17 Apr 2018, Florian Westphal wrote:

> Dominique Martinet  wrote:
> 
> [ CC Jozsef ]
> 
> > Could it have something to do with the way I setup the connection?
> > I don't think the "both remotes call connect() with carefully selected
> > source/dest port" is a very common case..
> > 
> > If you look at the tcpdump outputs I attached the sequence usually is
> > something like
> >  server > client SYN
> >  client > server SYN
> >  server > client SYNACK
> >  client > server ACK
> > 
> > ultimately it IS a connection, but with an extra SYN packet in front of
> > it (that first SYN opens up the conntrack of the nat so that the
> > client's syn can come in, the client's conntrack will be that of a
> > normal connection since its first SYN goes in directly after the
> > server's (it didn't see the server's SYN))
> > 
> > Looking at my logs again, I'm seeing the same as you:
> > 
> > This looks like the actual SYN/SYN/SYNACK/ACK:
> >  - 14.364090 seq=505004283 likely SYN coming out of server
> >  - 14.661731 seq=1913287797 on next line it says receiver
> > end=505004284 so likely the matching SYN from client
> > Which this time gets a proper SYNACK from server:
> > 14.662020 seq=505004283 ack=1913287798
> > And following final dataless ACK:
> > 14.687570 seq=1913287798 ack=505004284
> > 
> > Then as you point out some data ACK, where the scale poofs:
> > 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> > end=1913287819
> > 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 
> > maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 
> > scale=7
> > 14.688824 tcp_in_window: 
> > 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> > end=1913287819
> > 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
> > scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> >
> > As you say, only tcp_options() will clear only on side of the scales.
> > We don't have sender->td_maxwin == 0 (printed) so I see no other way
> > than we are in the last else if:
> >  - we have after(end, sender->td_end) (end=1913287819 > sender
> > end=1913287798)
> >  - I assume the tcp state machine must be confused because of the
> > SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> > but since this is a data packet it doesn't have the tcp option for scale
> > thus scale resets.
> 
> Yes, this looks correct. Jozsef, can you please have a look?
> 
> Problem seems to be that conntrack believes that ACK packet
> re-initializes the connection:
> 
>  595 /*
>  596  * RFC 793: "if a TCP is reinitialized ... then it need
>  597  * not wait at all; it must only be sure to use sequence
>  598  * numbers larger than those recently used."
>  599  */
>  600 sender->td_end =
>  601 sender->td_maxend = end;
>  602 sender->td_maxwin = (win == 0 ? 1 : win);
>  603 
>  604 tcp_options(skb, dataoff, tcph, sender);
> 
> and last line clears the scale value (no wscale option in data packet).
> 
> 
> Transitions are:
>  server > client SYN  sNO -> sSS
>  client > server SYN  sSS -> sS2
>  server > client SYNACK   sS2 -> sSR /* here */
>  client > server ACK  sSR -> sES
> 
> SYN/ACK was observed in original direction so we hit
> state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY test
> when we see the ack packet and end up in the 'TCP is reinitialized' branch.
> 
> AFAICS, without this, connection would move to sES just fine,
> as the data ack is in window.

Yes, the state transition is wrong for simultaneous open, because the 
tcp_conntracks table is not (cannot be) smart enough. Could you verify the 
next untested patch?

diff --git a/include/uapi/linux/netfilter/nf_conntrack_tcp.h 
b/include/uapi/linux/netfilter/nf_conntrack_tcp.h
index 74b9115..bcba72d 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_tcp.h
@@ -46,6 +46,9 @@ enum tcp_conntrack {
 /* Marks possibility for expected RFC5961 challenge ACK */
 #define IP_CT_EXP_CHALLENGE_ACK0x40
 
+/* Simultaneous open initialized */
+#define IP_CT_TCP_SIMULTANEOUS_OPEN0x80
+
 struct nf_ct_tcp_flags {
__u8 flags;
__u8 mask;
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c 
b/net/netfilter/nf_conntrack_proto_tcp.c
index e97cdc1..8e67910 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -981,6 +981,17 @@ static int tcp_packet(struct nf_conn *ct,
return NF_ACCEPT; /* Don't change state */
}
break;
+   case TCP_CONNTRACK_SYN_SENT2:
+   /* tcp_conntracks table is not smart enough to handle
+* simultaneous open.
+*/
+   ct->proto

Re: tcp hang when socket fills up ?

2018-04-17 Thread Florian Westphal
Dominique Martinet  wrote:

[ CC Jozsef ]

> Could it have something to do with the way I setup the connection?
> I don't think the "both remotes call connect() with carefully selected
> source/dest port" is a very common case..
> 
> If you look at the tcpdump outputs I attached the sequence usually is
> something like
>  server > client SYN
>  client > server SYN
>  server > client SYNACK
>  client > server ACK
> 
> ultimately it IS a connection, but with an extra SYN packet in front of
> it (that first SYN opens up the conntrack of the nat so that the
> client's syn can come in, the client's conntrack will be that of a
> normal connection since its first SYN goes in directly after the
> server's (it didn't see the server's SYN))
> 
> Looking at my logs again, I'm seeing the same as you:
> 
> This looks like the actual SYN/SYN/SYNACK/ACK:
>  - 14.364090 seq=505004283 likely SYN coming out of server
>  - 14.661731 seq=1913287797 on next line it says receiver
> end=505004284 so likely the matching SYN from client
> Which this time gets a proper SYNACK from server:
> 14.662020 seq=505004283 ack=1913287798
> And following final dataless ACK:
> 14.687570 seq=1913287798 ack=505004284
> 
> Then as you point out some data ACK, where the scale poofs:
> 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
> scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 14.688824 tcp_in_window: 
> 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
> scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
>
> As you say, only tcp_options() will clear only on side of the scales.
> We don't have sender->td_maxwin == 0 (printed) so I see no other way
> than we are in the last else if:
>  - we have after(end, sender->td_end) (end=1913287819 > sender
> end=1913287798)
>  - I assume the tcp state machine must be confused because of the
> SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> but since this is a data packet it doesn't have the tcp option for scale
> thus scale resets.

Yes, this looks correct. Jozsef, can you please have a look?

Problem seems to be that conntrack believes that ACK packet
re-initializes the connection:

 595 /*
 596  * RFC 793: "if a TCP is reinitialized ... then it need
 597  * not wait at all; it must only be sure to use sequence
 598  * numbers larger than those recently used."
 599  */
 600 sender->td_end =
 601 sender->td_maxend = end;
 602 sender->td_maxwin = (win == 0 ? 1 : win);
 603 
 604 tcp_options(skb, dataoff, tcph, sender);

and last line clears the scale value (no wscale option in data packet).


Transitions are:
 server > client SYN  sNO -> sSS
 client > server SYN  sSS -> sS2
 server > client SYNACK   sS2 -> sSR /* here */
 client > server ACK  sSR -> sES

SYN/ACK was observed in original direction so we hit
state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY test
when we see the ack packet and end up in the 'TCP is reinitialized' branch.

AFAICS, without this, connection would move to sES just fine,
as the data ack is in window.


Re: tcp hang when socket fills up ?

2018-04-17 Thread Michal Kubecek
On Tue, Apr 17, 2018 at 02:34:37PM +0200, Dominique Martinet wrote:
> Michal Kubecek wrote on Tue, Apr 17, 2018:
> > Data (21 bytes) packet in reply direction. And somewhere between the
> > first and second debugging print, we ended up with sender scale=0 and
> > that value is then preserved from now on.
> > 
> > The only place between the two debug prints where we could change only
> > one of the td_sender values are the two calls to tcp_options() but
> > neither should be called now unless I missed something. I'll try to
> > think about it some more.
> 
> Could it have something to do with the way I setup the connection?
> I don't think the "both remotes call connect() with carefully selected
> source/dest port" is a very common case..
> 
> If you look at the tcpdump outputs I attached the sequence usually is
> something like
>  server > client SYN
>  client > server SYN
>  server > client SYNACK
>  client > server ACK

This must be what nf_conntrack_proto_tcp.c calls "simultaneous open".

> ultimately it IS a connection, but with an extra SYN packet in front of
> it (that first SYN opens up the conntrack of the nat so that the
> client's syn can come in, the client's conntrack will be that of a
> normal connection since its first SYN goes in directly after the
> server's (it didn't see the server's SYN))
> 
> 
> Looking at my logs again, I'm seeing the same as you:
> 
> This looks like the actual SYN/SYN/SYNACK/ACK:
>  - 14.364090 seq=505004283 likely SYN coming out of server
>  - 14.661731 seq=1913287797 on next line it says receiver
> end=505004284 so likely the matching SYN from client
> Which this time gets a proper SYNACK from server:
> 14.662020 seq=505004283 ack=1913287798
> And following final dataless ACK:
> 14.687570 seq=1913287798 ack=505004284
> 
> Then as you point out some data ACK, where the scale poofs:
> 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
> scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 14.688824 tcp_in_window: 
> 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
> scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 
> As you say, only tcp_options() will clear only on side of the scales.
> We don't have sender->td_maxwin == 0 (printed) so I see no other way
> than we are in the last else if:
>  - we have after(end, sender->td_end) (end=1913287819 > sender
> end=1913287798)
>  - I assume the tcp state machine must be confused because of the
> SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> but since this is a data packet it doesn't have the tcp option for scale
> thus scale resets.

I agree that sender->td_maxwin is not zero so that the handshake above
probably left the conntrack in TCP_CONNTRACK_SYN_RECV state for some
reason. I'll try to go through the code with the pattern you mentioned
in mind.

Michal Kubecek


Re: tcp hang when socket fills up ?

2018-04-17 Thread Dominique Martinet
Michal Kubecek wrote on Tue, Apr 17, 2018:
> Data (21 bytes) packet in reply direction. And somewhere between the
> first and second debugging print, we ended up with sender scale=0 and
> that value is then preserved from now on.
> 
> The only place between the two debug prints where we could change only
> one of the td_sender values are the two calls to tcp_options() but
> neither should be called now unless I missed something. I'll try to
> think about it some more.

Could it have something to do with the way I setup the connection?
I don't think the "both remotes call connect() with carefully selected
source/dest port" is a very common case..

If you look at the tcpdump outputs I attached the sequence usually is
something like
 server > client SYN
 client > server SYN
 server > client SYNACK
 client > server ACK

ultimately it IS a connection, but with an extra SYN packet in front of
it (that first SYN opens up the conntrack of the nat so that the
client's syn can come in, the client's conntrack will be that of a
normal connection since its first SYN goes in directly after the
server's (it didn't see the server's SYN))


Looking at my logs again, I'm seeing the same as you:

This looks like the actual SYN/SYN/SYNACK/ACK:
 - 14.364090 seq=505004283 likely SYN coming out of server
 - 14.661731 seq=1913287797 on next line it says receiver
end=505004284 so likely the matching SYN from client
Which this time gets a proper SYNACK from server:
14.662020 seq=505004283 ack=1913287798
And following final dataless ACK:
14.687570 seq=1913287798 ack=505004284

Then as you point out some data ACK, where the scale poofs:
14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
14.688824 tcp_in_window: 
14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7

As you say, only tcp_options() will clear only on side of the scales.
We don't have sender->td_maxwin == 0 (printed) so I see no other way
than we are in the last else if:
 - we have after(end, sender->td_end) (end=1913287819 > sender
end=1913287798)
 - I assume the tcp state machine must be confused because of the
SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
but since this is a data packet it doesn't have the tcp option for scale
thus scale resets.


At least peeling the logs myself helped me follow the process, I'll
sprinkle some carefully crafted logs tomorrow to check if this is true
and will let you figure what is best of trying to preserve scale if it
was set before, setting a default to 14 or something else.

Thanks!
-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-17 Thread Michal Kubecek
On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote:
> > I turned pr_debug on in tcp_in_window() for another try and it's a bit
> > mangled because the information on multiple lines and the function is
> > called in parallel but it looks like I do have some seq > maxend +1
> > 
> > Although it's weird, the maxend was set WAY earlier apparently?
> > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 
> > receiver end=505004284 maxend=505033596 maxwin=29200
> > then window decreased drastically e.g. previous ack just before refusal:
> > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 
> > end=1913292311
> > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 
> > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7
> 
> scale=0 is suspect.
> 
> Really if conntrack does not see SYN SYNACK packets, it should not
> make any window check, since windows can be scaled up to 14 :/

Hm... it doesn't seem to be the case here:

14.364038 tcp_in_window: START
14.364065 tcp_in_window: 
14.364090 seq=505004283 ack=0+(0) sack=0+(0) win=29200 end=505004284
14.364129 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 
scale=7 receiver end=0 maxend=0 maxwin=0 scale=0
14.364158 tcp_in_window: 
14.364185 seq=505004283 ack=0+(0) sack=0+(0) win=29200 end=505004284
14.364210 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 
scale=7 receiver end=0 maxend=0 maxwin=0 scale=0
14.364237 tcp_in_window: I=1 II=1 III=1 IV=1
14.364262 tcp_in_window: res=1 sender end=505004284 maxend=505004284 
maxwin=29200 receiver end=0 maxend=29200 maxwin=0

looks like SYN packet

14.661682 tcp_in_window: START
14.661706 tcp_in_window: 
14.661731 seq=1913287797 ack=0+(0) sack=0+(0) win=29200 end=1913287798
14.661828 tcp_in_window: sender end=0 maxend=29200 maxwin=0 scale=0 receiver 
end=505004284 maxend=505004284 maxwin=29200 scale=7
14.661867 tcp_in_window: START
14.661893 tcp_in_window: 
14.661917 seq=1025597635 ack=1542862349+(0) sack=1542862349+(0) win=2414 
end=1025597635
14.661942 tcp_in_window: START
14.661966 tcp_in_window: 
14.661993 tcp_in_window: sender end=1025597635 maxend=1025635103 maxwin=354378 
scale=7 receiver end=1542862349 maxend=1543168175 maxwin=37504 scale=7
14.662020 seq=505004283 ack=1913287798+(0) sack=1913287798+(0) win=29200 
end=505004284
14.662045 tcp_in_window: 
14.662072 seq=1025597635 ack=1542862349+(0) sack=1542862349+(0) win=2414 
end=1025597635
14.662097 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 
scale=7 receiver end=1913287798 maxend=1913287798 maxwin=29200 scale=7
14.662125 tcp_in_window: 
14.662150 tcp_in_window: sender end=1025597635 maxend=1025635103 maxwin=354378 
scale=7 receiver end=1542862349 maxend=1543168175 maxwin=37504 scale=7
14.662175 seq=505004283 ack=1913287798+(0) sack=1913287798+(0) win=29200 
end=505004284
14.662202 tcp_in_window: sender end=505004284 maxend=505004284 maxwin=29200 
scale=7 receiver end=1913287798 maxend=1913287798 maxwin=29200 scale=7
14.662226 tcp_in_window: I=1 II=1 III=1 IV=1
14.662251 tcp_in_window: I=1 II=1 III=1 IV=1
14.662277 tcp_in_window: res=1 sender end=505004284 maxend=505004284 
maxwin=29200 receiver end=1913287798 maxend=1913316998 maxwin=29200
14.662302 tcp_in_window: res=1 sender end=1025597635 maxend=1025635103 
maxwin=354378 receiver end=1542862349 maxend=1543171341 maxwin=37504

SYNACK response and (dataless) ACK in the original direction, mixed with
an unrelated packet.

14.687411 tcp_in_window: START
14.687522 tcp_in_window: 
14.687570 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287798
14.687619 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29200 
scale=7 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7
14.687659 tcp_in_window: 
14.687699 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287798
14.687739 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29200 
scale=7 receiver end=505004284 maxend=505004284 maxwin=29200 scale=7
14.687774 tcp_in_window: I=1 II=1 III=1 IV=1
14.687806 tcp_in_window: res=1 sender end=1913287798 maxend=1913316998 
maxwin=29312 receiver end=505004284 maxend=505033596 maxwin=29200

ACK in the reply direction (no data). We still have scale=7 in both
directions.

14.688706 tcp_in_window: START
14.688733 tcp_in_window: 
14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
14.688824 tcp_in_window: 
14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
14.688911 tcp_in_window: I=1 II=1 III=1 IV=1
14.688938 tcp_in_window: res=1 sender end=1913287819 maxend=1913

Re: tcp hang when socket fills up ?

2018-04-16 Thread Michal Kubecek
On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote:
> > I turned pr_debug on in tcp_in_window() for another try and it's a bit
> > mangled because the information on multiple lines and the function is
> > called in parallel but it looks like I do have some seq > maxend +1
> > 
> > Although it's weird, the maxend was set WAY earlier apparently?
> > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 
> > receiver end=505004284 maxend=505033596 maxwin=29200
> > then window decreased drastically e.g. previous ack just before refusal:
> > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 
> > end=1913292311
> > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 
> > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7
> 
> scale=0 is suspect.
> 
> Really if conntrack does not see SYN SYNACK packets, it should not
> make any window check, since windows can be scaled up to 14 :/

Or maybe set the scaling to

  - TCP_MAX_WSCALE (14) by default
  - 0 when SYN or SYNACK without window scale option is seen
  - value of window scale option when SYN or SYNACK with it is seen

Michal Kubecek


Re: tcp hang when socket fills up ?

2018-04-16 Thread Eric Dumazet


On 04/16/2018 08:52 PM, Dominique Martinet wrote:
> Thank you for the hints. Out of order reply.
> 
> Florian Westphal wrote on Mon, Apr 16, 2018:
>> echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal
>>
>> which stops conntrack from marking packets with out-of-window
>> acks as invalid.
> 
> That worked, the connection stays established with this set.
> 
> 
>> echo 6 > /proc/sys/net/netfilter/nf_conntrack_log_invalid
> 
> Took me a while to figure how to log that properly (why do distros not
> ship ulogd?!) but that is helpful.
> 
> I have attached the full logs to this mail as text attachments in case I
> missed something in my excerpts
> 
> 
> Specifically, there are many outbound invalid packets, but as Marcelo
> pointed out there is no conntrack netfilter for outbound packets...
> The first invalid incoming packet is that one (json output seemed to be
> the most convenient as it has tcp seq/ackseq and the oob.prefix message,
> feel free to ask for some other format)
> 
> {"timestamp": "2018-04-17T10:29:14.956485", "dvc": "Netfilter",
> "raw.pktlen": 52, "raw.pktcount": 1, "oob.prefix": "nf_ct_tcp: ACK is
> over the upper bound (ACKed data not seen yet) ", "oob.time.sec":
> 1523928554, "oob.time.usec": 956485, "oob.mark": 0, "oob.hook": 0,
> "oob.family": 2, "oob.protocol": 2048, "raw.label": 0, "ip.protocol": 6,
> "ip.tos": 0, "ip.ttl": 52, "ip.totlen": 52, "ip.ihl": 5, "ip.csum":
> 50909, "ip.id": 30328, "ip.fragoff": 16384, "src_port": 30280,
> "dest_port": 29543, "tcp.seq": 4048786673, "tcp.ackseq": 3024392506,
> "tcp.window": 781, "tcp.offset": 0, "tcp.reserved": 0, "tcp.urg": 0,
> "tcp.ack": 1, "tcp.psh": 0, "tcp.rst": 0, "tcp.syn": 0, "tcp.fin": 0,
> "tcp.res1": 0, "tcp.res2": 0, "tcp.csum": 17752, "oob.in": "",
> "oob.out": "", "src_ip": "client", "dest_ip": "server"}
> 
> which is the first ack that isn't seen alright (sent packet, blocked
> ack, replays):
> 10:29:14.926115 IP server.29543 > client.30280: Flags [.], seq 
> 3024391132:3024392506, ack 4048786673, win 307, options [nop,nop,TS val 
> 681965809 ecr 1634528339], length 1374
> ...
> 10:29:14.956485 IP client.30280 > server.29543: Flags [.], ack 3024392506, 
> win 781, options [nop,nop,TS val 1634528369 ecr 681965809], length 0
> ...
> 10:29:15.255489 IP server.29543 > client.30280: Flags [.], seq 
> 3024391132:3024392506, ack 4048786673, win 307, options [nop,nop,TS val 
> 681966138 ecr 1634528369], length 1374
> 10:29:15.281581 IP client.30280 > server.29543: Flags [.], ack 3024436474, 
> win 1444, options [nop,nop,TS val 1634528697 ecr 681965904,nop,nop,sack 1 
> {3024391132:3024392506}], length 0
> 10:29:15.555466 IP client.30280 > server.29543: Flags [P.], seq 
> 4048786673:4048786709, ack 3024436474, win 1444, options [nop,nop,TS val 
> 1634528869 ecr 681965904], length 36
> 10:29:15.719510 IP server.29543 > client.30280: Flags [.], seq 
> 3024391132:3024392506, ack 4048786673, win 307, options [nop,nop,TS val 
> 681966602 ecr 1634528369], length 1374
> 
> So now I guess the question is why netfilter think this is over the
> upper bound.
> ..And surely enough the answer is within the invalid outbound packets
> that aren't considered in the conntrack state machine, the first of whih
> is this one:
> {"timestamp": "2018-04-17T10:29:14.926143", "dvc": "Netfilter",
> "raw.pktlen": 2800, "raw.pktcount": 1, "oob.prefix": "nf_ct_tcp: SEQ is
> over the upper bound (over the window of the receiver)", "oob.time.sec":
> 1523928554, "oob.time.usec": 926143, "oob.mark": 0, "oob.hook": 0,
> "oob.family": 2, "oob.protocol": 2048, "oob.uid": 1000, "oob.gid": 100,
> "raw.label": 0, "ip.protocol": 6, "ip.tos": 8, "ip.ttl": 64,
> "ip.totlen": 2800, "ip.ihl": 5, "ip.csum": 47600, "ip.id": 27809,
> "ip.fragoff": 16384, "src_port": 29543, "dest_port": 30280, "tcp.seq":
> 3024391132, "tcp.ackseq": 4048786673, "tcp.window": 307, "tcp.offset":
> 0, "tcp.reserved": 0, "tcp.urg": 0, "tcp.ack": 1, "tcp.psh": 0,
> "tcp.rst": 0, "tcp.syn": 0, "tcp.fin": 0, "tcp.res1": 0, "tcp.res2": 0,
> "tcp.csum": 5201, "oob.in": "", "oob.out": "", "src_ip": "server",
> "dest_ip": "client"}
> 
> which sequence matches the start seq of the first packet we don't ack,
> so that's why conntrack doesn't recongnize that as a valid ack.
> 
> 
> Here's some context before that packet:
> 10:29:14.921305 IP server.29543 > client.30280: Flags [.], seq 
> 3024378314:3024379688, ack 4048786673, win 307, options [nop,nop,TS val 
> 681965804 ecr 1634528336], length 1374
> 10:29:14.921311 IP server.29543 > client.30280: Flags [.], seq 
> 3024379688:3024381062, ack 4048786673, win 307, options [nop,nop,TS val 
> 681965804 ecr 1634528336], length 1374
> 10:29:14.922166 IP client.30280 > server.29543: Flags [.], ack 3024365948, 
> win 329, options [nop,nop,TS val 1634528336 ecr 681965778], length 0
> 10:29:14.922178 IP server.29543 > client.30280: Flags [.], seq 
> 3024381062:3024382436, ack 4048786673, win 307, options [nop,nop,TS val 
> 681965805 ecr 1634528336], length 1374
> 

Re: tcp hang when socket fills up ?

2018-04-16 Thread Marcelo Ricardo Leitner
On Sat, Apr 14, 2018 at 03:55:15AM +0200, Dominique Martinet wrote:
> 16:49:26.724457 IP .31872 > .13317: Flags 
> [.], ack 25134, win 646, options [nop,nop,TS val 1617129467 ecr 1313937599], 
> length 0
> 16:49:26.726081 IP .31872 > .13317: Flags 
> [.], ack 26508, win 669, options [nop,nop,TS val 1617129471 ecr 1313937599], 
> length 0
> 16:49:26.726391 IP .31872 > .13317: Flags 
> [.], ack 27882, win 691, options [nop,nop,TS val 1617129471 ecr 1313937601], 
> length 0
> 16:49:26.726962 IP .31872 > .13317: Flags 
> [.], ack 29256, win 714, options [nop,nop,TS val 1617129472 ecr 1313937601], 
> length 0
> 16:49:26.727614 IP .31872 > .13317: Flags 
> [.], ack 30630, win 737, options [nop,nop,TS val 1617129473 ecr 1313937602], 
> length 0
> 16:49:26.728084 IP .31872 > .13317: Flags 
> [.], ack 32004, win 759, options [nop,nop,TS val 1617129473 ecr 1313937602], 
> length 0

[A] (packet above)

> 16:49:26.728507 IP .13317 > .31872: Flags 
> [.], seq 59484:60858, ack 4190, win 307, options [nop,nop,TS val 1313937635 
> ecr 1617129473], length 1374
> 16:49:26.728511 IP .13317 > .31872: Flags 
> [.], seq 60858:62232, ack 4190, win 307, options [nop,nop,TS val 1313937635 
> ecr 1617129473], length 1374
> 16:49:26.729531 IP .31872 > .13317: Flags 
> [.], ack 33378, win 782, options [nop,nop,TS val 1617129475 ecr 1313937607], 
> length 0
> 16:49:26.730002 IP .31872 > .13317: Flags 
> [.], ack 34752, win 805, options [nop,nop,TS val 1617129475 ecr 1313937607], 
> length 0
> 16:49:26.730340 IP .13317 > .31872: Flags 
> [.], seq 62232:63606, ack 4190, win 307, options [nop,nop,TS val 1313937636 
> ecr 1617129473], length 1374
> 16:49:26.730344 IP .13317 > .31872: Flags 
> [.], seq 63606:64980, ack 4190, win 307, options [nop,nop,TS val 1313937636 
> ecr 1617129473], length 1374
> 16:49:26.731398 IP .13317 > .31872: Flags 
> [.], seq 64980:66354, ack 4190, win 307, options [nop,nop,TS val 1313937638 
> ecr 1617129473], length 1374
> 16:49:26.731402 IP .13317 > .31872: Flags 
> [.], seq 66354:67728, ack 4190, win 307, options [nop,nop,TS val 1313937638 
> ecr 1617129473], length 1374
> 16:49:26.731634 IP .31872 > .13317: Flags 
> [.], ack 36126, win 827, options [nop,nop,TS val 1617129476 ecr 1313937607], 
> length 0
> 16:49:26.732955 IP .13317 > .31872: Flags 
> [.], seq 67728:69102, ack 4190, win 307, options [nop,nop,TS val 1313937639 
> ecr 1617129473], length 1374
> 16:49:26.732963 IP .13317 > .31872: Flags 
> [.], seq 69102:70476, ack 4190, win 307, options [nop,nop,TS val 1313937639 
> ecr 1617129473], length 1374
> 16:49:26.733956 IP .31872 > .13317: Flags 
> [.], ack 37500, win 850, options [nop,nop,TS val 1617129476 ecr 1313937607], 
> length 0
> 16:49:26.734242 IP .31872 > .13317: Flags 
> [.], ack 38874, win 872, options [nop,nop,TS val 1617129477 ecr 1313937608], 
> length 0
> 16:49:26.734653 IP .31872 > .13317: Flags 
> [.], ack 40248, win 895, options [nop,nop,TS val 1617129478 ecr 1313937608], 
> length 0
> 16:49:26.735042 IP .13317 > .31872: Flags 
> [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735046 IP .13317 > .31872: Flags 
> [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735334 IP .31872 > .13317: Flags 
> [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736005 IP .31872 > .13317: Flags 
> [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736402 IP .13317 > .31872: Flags 
> [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.736408 IP .13317 > .31872: Flags 
> [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.738561 IP .31872 > .13317: Flags 
> [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739539 IP .31872 > .13317: Flags 
> [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739882 IP .31872 > .13317: Flags 
> [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.740255 IP .31872 > .13317: Flags 
> [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.746756 IP .31872 > .13317: Flags 
> [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], 
> length 0
> 16:49:26.747923 IP .31872 > .13317: Flags 
> [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], 
> length 0
> 16:49:26.749083 IP .31872 > .13317: Flags 
> [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], 
> length 0
> 16:49:26.750171 IP .31872 > .13317: Flags 
> [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], 
> length 0
> 16:49:26.750808 IP .31872 > .13317: Flags 
> [.], 

Re: tcp hang when socket fills up ?

2018-04-16 Thread Florian Westphal
Dominique Martinet  wrote:
> Eric Dumazet wrote on Sun, Apr 15, 2018:
> > Are you sure you do not have some iptables/netfilter stuff ?
> 
> I have a basic firewall setup with default rules e.g. starts with
> -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
> in the INPUT chain...
> That said, I just dropped it on the server to check and that seems to
> workaround the issue?!
> When logging everything dropped it appears to decide that the connection
> is no longer established at some point, but only if there is
> tcp_timestamp, just, err, how?
> 
> And certainly enough, if I restore the firewall while a connection is up
> that just hangs; conntrack doesn't consider it connected anymore at some
> point (but it worked for a while!)
> 
> Here's the kind of logs I get from iptables:
> IN=wlp1s0 OUT= MAC=00:c2:c6:b4:7e:c7:a4:12:42:b5:5d:fc:08:00 SRC=client 
> DST=server LEN=52 TOS=0x00 PREC=0x00 TTL=52 ID=17038 DF PROTO=TCP SPT=41558 
> DPT=15609 WINDOW=1212 RES=0x00 ACK URGP=0 

You could do
echo 6 > /proc/sys/net/netfilter/nf_conntrack_log_invalid

to have conntrack log when/why it thinks packet is invalid.

You can also set
echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal

which stops conntrack from marking packets with out-of-window
acks as invalid.

(Earlier email implies this is related to timestamps, but unfortunately
 to best of my knowledge conntrack doesn't look at tcp timestamps).



Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
Dominique Martinet wrote on Mon, Apr 16, 2018:
> . . . Oh, there is something interesting there, the connection doesn't
> come up with -G?

Hm, sorry, I take this last part back. I cannot reproduce -G not working
reliably.
I'll dig around the conntrack table a bit more.


-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
Eric Dumazet wrote on Sun, Apr 15, 2018:
> Are you sure you do not have some iptables/netfilter stuff ?

I have a basic firewall setup with default rules e.g. starts with
-m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
in the INPUT chain...
That said, I just dropped it on the server to check and that seems to
workaround the issue?!
When logging everything dropped it appears to decide that the connection
is no longer established at some point, but only if there is
tcp_timestamp, just, err, how?

And certainly enough, if I restore the firewall while a connection is up
that just hangs; conntrack doesn't consider it connected anymore at some
point (but it worked for a while!)

Here's the kind of logs I get from iptables:
IN=wlp1s0 OUT= MAC=00:c2:c6:b4:7e:c7:a4:12:42:b5:5d:fc:08:00 SRC=client 
DST=server LEN=52 TOS=0x00 PREC=0x00 TTL=52 ID=17038 DF PROTO=TCP SPT=41558 
DPT=15609 WINDOW=1212 RES=0x00 ACK URGP=0 


> ss -temoi might give us more info

hang
ESTAB   081406server:15609 client:41558
users:(("socat",pid=17818,fd=5)) timer:(on,48sec,11) uid:1000 ino:137253 sk:6a 
<->
 skmem:(r0,rb369280,t0,tb147456,f2050,w104446,o0,bl0,d1) ts sack
 reno wscale:7,7 rto:15168 backoff:6 rtt:36.829/6.492 ato:40
 mss:1374 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16
 bytes_acked:32004 bytes_received:4189 segs_out:84 segs_in:55
 data_segs_out:77 data_segs_in:18 send 298.5Kbps lastsnd:12483
 lastrcv:27801 lastack:27726 pacing_rate 19.1Mbps delivery_rate
 4.1Mbps busy:28492ms unacked:31 retrans:1/6 lost:31 rcv_rtt:29
 rcv_space:29200 rcv_ssthresh:39184 notsent:38812 minrtt:25.152

working (tcp_timestamp=0)
ESTAB   036   server:15080 client:32979
users:(("socat",pid=17047,fd=5)) timer:(on,226ms,0) uid:1000 ino:90917 sk:23 <->
 skmem:(r0,rb369280,t0,tb1170432,f1792,w2304,o0,bl0,d3) sack reno
 wscale:7,7 rto:230 rtt:29.413/5.345 ato:64 mss:1386 pmtu:1500
 rcvmss:1248 advmss:1460 cwnd:4 ssthresh:3 bytes_acked:17391762
 bytes_received:62397 segs_out:13964 segs_in:8642
 data_segs_out:13895 data_segs_in:1494 send 1.5Mbps lastsnd:4
 lastrcv:5 lastack:5 pacing_rate 1.8Mbps delivery_rate 1.2Mbps
 busy:56718ms unacked:1 retrans:0/11 rcv_rtt:9112.95 rcv_space:29233
 rcv_ssthresh:41680 minrtt:25.95

working (no iptables)
ESTAB   00server:61460 client:20468
users:(("socat",pid=17880,fd=5)) uid:1000 ino:129982 sk:6f <->
 skmem:(r0,rb369280,t0,tb1852416,f0,w0,o0,bl0,d1) ts sack reno
 wscale:7,7 rto:244 rtt:43.752/7.726 ato:40 mss:1374 pmtu:1500
 rcvmss:1248 advmss:1448 cwnd:10 bytes_acked:2617302
 bytes_received:5441 segs_out:1929 segs_in:976 data_segs_out:1919
 data_segs_in:41 send 2.5Mbps lastsnd:2734 lastrcv:2734 lastack:2705
 pacing_rate 5.0Mbps delivery_rate 12.7Mbps busy:1884ms rcv_rtt:30
 rcv_space:29200 rcv_ssthresh:39184 minrtt:26.156

> Really it looks like at some point, all incoming packets are shown by
> tcpdump but do not reach the TCP socket anymore.
> 
> (segs_in: might be steady, look at the d0 counter shown by ss -temoi
> (dX : drop counters, sk->sk_drops)

segs_in does not increase with replays; the d1 seems stable.


> While running your experiment, try on the server.
> 
> perf record -a -g -e skb:kfree_skb  sleep 30
> perf report

While I understand what that should do, I am not sure why I do not get
any graph so that doesn't help tell what called kfree_skb and thus what
decided to drop the packet (although we no longer really need that
now..)
perf script just shows kfree_skb e.g.
swapper 0 [001] 237244.869321: skb:kfree_skb: skbaddr=0x8800360fda00 
protocol=2048 location=0x817a1a77
  9458e3 kfree_skb
(/usr/lib/debug/lib/modules/4.16.0-300.fc28.x86_64/vmlinux)


---

So I guess that ultimately the problem is why conntrack suddenly decides
that an established connection suddenly isn't anymore, despite being
listed as established by ss..
I'm discovering `conntrack(8)`, but what strikes me as interesting is
that even that points at the connection being established (looking at a
new connection after iptables started dropping packets)
# conntrack -L | grep 21308
tcp  6 267 ESTABLISHED src=server dst=client sport=21308 dport=37552 
src=client dst=server sport=37552 dport=21308 [ASSURED] mark=0 use=1

compared to another that isn't dropped (the old connection without
tcp_timestamp)
tcp  6 299 ESTABLISHED src=server dst=client sport=15080 dport=32979 
src=client dst=server sport=32979 dport=15080 [ASSURED] mark=0 use=1

The expect/dying/unconfirmed tables all are empty.

. . . Oh, there is something interesting there, the connection doesn't
come up with -G?
working:
conntrack -G --protonum tcp --src server --dst client --sport 15080 --dport 
32979
tcp  6 299 ESTABLISHED src=server dst=cl

Re: tcp hang when socket fills up ?

2018-04-15 Thread Eric Dumazet


On 04/15/2018 06:47 PM, Dominique Martinet wrote:

> Also, here are the per-socket stats I could find (ss -i after having
> reproduced hang):
>reno wscale:7,7 rto:7456 backoff:5 rtt:32.924/1.41 ato:40 mss:1374
>pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16
>bytes_acked:32004 bytes_received:4189 segs_out:85 segs_in:54
>data_segs_out:78 data_segs_in:18 send 333.9Kbps lastsnd:3912
>lastrcv:11464 lastack:11387 pacing_rate 21.4Mbps delivery_rate
>3.5Mbps busy:12188ms unacked:33 retrans:1/5 lost:33 rcv_rtt:37
>rcv_space:29200 rcv_ssthresh:39184 notsent:28796 minrtt:24.986
> 

ss -temoi might give us more info

Really it looks like at some point, all incoming packets are shown by tcpdump 
but do not reach the TCP socket anymore.

(segs_in: might be steady, look at the d0 counter shown by ss -temoi  (dX : 
drop counters, sk->sk_drops)


Are you sure you do not have some iptables/netfilter stuff ?

While running your experiment, try on the server.

perf record -a -g -e skb:kfree_skb  sleep 30
perf report





Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
Eric Dumazet wrote on Fri, Apr 13, 2018:
> That might be caused by some TS val/ecr breakage :
> 
> Many acks were received by the server tcpdump,
> but none of them was accepted by TCP stack, for some reason.
> 
> Try to disable TCP timestamps, it will give some hint if bug does not 
> reproduce.

This was spot on, after disabling tcp timestamps I cannot reproduce the
hang anymore.

I've had another look at the original sequence (as seen by the server)
and I don't see much wrong; tell me what I missed:
 - the replayed packet has seq 32004:33378, so the first ignored ack
would be the one with ack 33378, is that right? (meaning the server did
accept the one for 32004 and none after that)

Assuming it is, excerpt from around then (first emission of that packet then
client replies):
16:49:26.700531 IP .13317 > .31872: Flags [.], seq 32004:33378, 
ack 4190, win 307, options [nop,nop,TS val 1313937607 ecr 1617129440], length 
1374
...
16:49:26.728084 IP .31872 > .13317: Flags [.], ack 32004, win 
759, options [nop,nop,TS val 1617129473 ecr 1313937602], length 0
...
16:49:26.729531 IP .31872 > .13317: Flags [.], ack 33378, win 
782, options [nop,nop,TS val 1617129475 ecr 1313937607], length 0
16:49:26.730002 IP .31872 > .13317: Flags [.], ack 34752, win 
805, options [nop,nop,TS val 1617129475 ecr 1313937607], length 0
...
16:49:26.731634 IP .31872 > .13317: Flags [.], ack 36126, win 
827, options [nop,nop,TS val 1617129476 ecr 1313937607], length 0


 - the ecr value matches the val of the packet it acks
 - the val is >= that of previous packet (won't be considered
reorder/should pass paws check?)
 - even if the packets are processed in parallel and some kind of race
occurs, a "bigger" ack should ack all the previous packets, right?

 - Just to make sure, I checked /proc/net/netstat for PAWSEstab but that
is 0:
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts 
PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW 
TWRecycled TWKilled PAWSActive PAWSEstab DelayedACKs DelayedACKLocked 
DelayedACKLost ListenOverflows ListenDrops TCPHPHits TCPPureAcks TCPHPAcks 
TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder 
TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo 
TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures 
TCPFastRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes 
TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPRcvCollapsed 
TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData 
TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger 
TCPAbortFailed TCPMemoryPressures TCPMemoryPressuresChrono TCPSACKDiscard 
TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound 
TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged 
TCPSackShiftFallback TCPBacklogDrop PFMemallocDrop TCPMinTTLDrop 
TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies 
TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop 
TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive 
TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail 
TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPFastOpenBlackhole 
TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv 
TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent 
TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect 
TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq 
TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe 
TCPKeepAlive TCPMTUPFail TCPMTUPSuccess
TcpExt: 0 0 0 0 58 0 0 26 0 0 50 0 0 0 0 75402 17 201 0 0 6876848 59804 2258387 
0 33 0 0 3 0 0 0 0 1 102 15 0 0 0 1306 60 386 292 10 0 0 108750 201 1 228 1 8 4 
0 63 0 3 0 0 0 0 107 1 0 0 0 2834 1962 622 0 0 0 0 0 0 0 0 0 1065022 54160 0 1 
3 3 0 0 0 0 0 0 0 475 0 9578 6 8 71 257 5116325 0 0 0 0 0 0 6 0 0 0 61 85 0 0
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts 
OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets 
OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts InECT0Pkts InCEPkts
IpExt: 0 0 0 0 206 0 16405602866 8921427728 0 0 57928 0 0 16121410 0 5388 0

Also, here are the per-socket stats I could find (ss -i after having
reproduced hang):
 reno wscale:7,7 rto:7456 backoff:5 rtt:32.924/1.41 ato:40 mss:1374
 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16
 bytes_acked:32004 bytes_received:4189 segs_out:85 segs_in:54
 data_segs_out:78 data_segs_in:18 send 333.9Kbps lastsnd:3912
 lastrcv:11464 lastack:11387 pacing_rate 21.4Mbps delivery_rate
 3.5Mbps busy:12188ms unacked:33 retrans:1/5 lost:33 rcv_rtt:37
 rcv_space:29200 rcv_ssthresh:39184 notsent:28796 minrtt:24.986


Here are the same stats with tcp timestamp disabled (after running my
reproducer, e.g. outputing a big chunk of text

Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet
Eric Dumazet wrote on Fri, Apr 13, 2018:
> Ah sorry, your trace was truncated, we need more packets _before_ the excerpt.

Ah, sorry as well. I tried to go far back enough to include the replayed
packets, but I see it didn't include the original ack for that packet.
I'm resending both full traces from connect to fin this time.

> That might be caused by some TS val/ecr breakage :
> 
> Many acks were received by the server tcpdump,
> but none of them was accepted by TCP stack, for some reason.
> 
> Try to disable TCP timestamps, it will give some hint if bug does not 
> reproduce.

I unfortunately cannot reproduce on a local network, but will give this
a try on Monday and report. (sysctl net.ipv4.tcp_timestamps=0, correct?)


Here's the traces again, server-capture first:
16:49:22.374452 IP .13317 > .31872: Flags 
[S], seq 2026966826, win 29200, options [mss 1460,sackOK,TS val 1313933281 ecr 
0,nop,wscale 7], length 0
16:49:22.700553 IP .31872 > .13317: Flags 
[S], seq 882075258, win 29200, options [mss 1386,sackOK,TS val 1617125446 ecr 
0,nop,wscale 7], length 0
16:49:22.700577 IP .13317 > .31872: Flags 
[S.], seq 2026966826, ack 882075259, win 29200, options [mss 1460,sackOK,TS val 
1313933607 ecr 1617125446,nop,wscale 7], length 0
16:49:22.727131 IP .31872 > .13317: Flags 
[.], ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 1313933607], length 0
16:49:22.729193 IP .31872 > .13317: Flags 
[P.], seq 1:22, ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 
1313933607], length 21
16:49:22.729230 IP .13317 > .31872: Flags 
[.], ack 22, win 229, options [nop,nop,TS val 1313933635 ecr 1617125472], 
length 0
16:49:22.732726 IP .13317 > .31872: Flags 
[P.], seq 1:22, ack 22, win 229, options [nop,nop,TS val 1313933639 ecr 
1617125472], length 21
16:49:22.759809 IP .31872 > .13317: Flags 
[.], ack 22, win 229, options [nop,nop,TS val 1617125503 ecr 1313933639], 
length 0
16:49:22.759835 IP .13317 > .31872: Flags 
[P.], seq 22:734, ack 22, win 229, options [nop,nop,TS val 1313933666 ecr 
1617125503], length 712
16:49:22.761643 IP .31872 > .13317: Flags 
[P.], seq 22:1270, ack 22, win 229, options [nop,nop,TS val 1617125504 ecr 
1313933639], length 1248
16:49:22.801860 IP .13317 > .31872: Flags 
[.], ack 1270, win 248, options [nop,nop,TS val 1313933708 ecr 1617125504], 
length 0
16:49:22.826902 IP .31872 > .13317: Flags 
[.], ack 734, win 240, options [nop,nop,TS val 1617125572 ecr 1313933666], 
length 0
16:49:22.827576 IP .31872 > .13317: Flags 
[P.], seq 1270:1318, ack 734, win 240, options [nop,nop,TS val 1617125573 ecr 
1313933708], length 48
16:49:22.827600 IP .13317 > .31872: Flags 
[.], ack 1318, win 248, options [nop,nop,TS val 1313933734 ecr 1617125573], 
length 0
16:49:22.833028 IP .13317 > .31872: Flags 
[P.], seq 734:1122, ack 1318, win 248, options [nop,nop,TS val 1313933739 ecr 
1617125573], length 388
16:49:22.858455 IP .31872 > .13317: Flags 
[.], ack 1122, win 251, options [nop,nop,TS val 1617125604 ecr 1313933739], 
length 0
16:49:22.865866 IP .31872 > .13317: Flags 
[P.], seq 1318:1334, ack 1122, win 251, options [nop,nop,TS val 1617125612 ecr 
1313933739], length 16
16:49:22.906865 IP .13317 > .31872: Flags 
[.], ack 1334, win 248, options [nop,nop,TS val 1313933813 ecr 1617125612], 
length 0
16:49:22.944474 IP .31872 > .13317: Flags 
[P.], seq 1334:1386, ack 1122, win 251, options [nop,nop,TS val 1617125678 ecr 
1313933813], length 52
16:49:22.944497 IP .13317 > .31872: Flags 
[.], ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 1617125678], 
length 0
16:49:22.944747 IP .13317 > .31872: Flags 
[P.], seq 1122:1174, ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 
1617125678], length 52
16:49:22.971083 IP .31872 > .13317: Flags 
[P.], seq 1386:1454, ack 1174, win 251, options [nop,nop,TS val 1617125716 ecr 
1313933851], length 68
16:49:22.971607 IP .13317 > .31872: Flags 
[P.], seq 1174:1258, ack 1454, win 248, options [nop,nop,TS val 1313933878 ecr 
1617125716], length 84
16:49:22.998201 IP .31872 > .13317: Flags 
[P.], seq 1454:2082, ack 1258, win 251, options [nop,nop,TS val 1617125742 ecr 
1313933878], length 628
16:49:22.998987 IP .13317 > .31872: Flags 
[P.], seq 1258:1838, ack 2082, win 268, options [nop,nop,TS val 1313933905 ecr 
1617125742], length 580
16:49:23.065102 IP .31872 > .13317: Flags 
[.], ack 1838, win 262, options [nop,nop,TS val 1617125810 ecr 1313933905], 
length 0
16:49:24.811297 IP .31872 > .13317: Flags 
[P.], seq 2082:3254, ack 1838, win 262, options [nop,nop,TS val 1617127527 ecr 
1313933905], length 1172
16:49:24.812562 IP .13317 > .31872: Flags 
[P.], seq 1838:1874, ack 3254, win 287, options [nop,nop,TS val 1313935719 ecr 
1617127527], length 36
16:49:24.838210 IP .31872 > .13317: Flags 
[.], ack 1874, win 262, options [nop,nop,TS val 1617127583 ecr 1313935719], 
length 0
16:49:24.838236 IP .13317 > .31872: Flags 
[P.], seq 1874:2534, ack 3254, win 287, options [nop,nop,TS val 1313935744 ecr 
1617127583], length 660
16:49:24.839175 I

Re: tcp hang when socket fills up ?

2018-04-13 Thread Eric Dumazet


On 04/13/2018 06:09 PM, Dominique Martinet wrote:
> Thank you for the replies,
> 
> Eric Dumazet wrote on Fri, Apr 13, 2018:
>> There is no way a regular TCP stack (including linux) could send the 
>> following frame.
>>
>>> 16:49:27.048760 IP .13317 > .31872: 
>>> Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 
>>> 1313937955 ecr 1617129473], length 1374
>>
>> So something is mangling the packet, maybe NAT or something.
> 
> 
> The pcap was produced on the server which emitted the frame, so it
> should be exactly as the server intended it to be without any mangling
> involved?
> If you could point at what strikes you as odd I can check if the same
> happened on other hang traces I might still have. Is it just that it
> replays a very old seq?
> (if it's odd wrt client packets, the same trace as captured on client is
> at the end of this mail)
> 

Ah sorry, your trace was truncated, we need more packets _before_ the excerpt.

That might be caused by some TS val/ecr breakage :

Many acks were received by the server tcpdump,
but none of them was accepted by TCP stack, for some reason.

Try to disable TCP timestamps, it will give some hint if bug does not reproduce.







Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet
Thank you for the replies,

Eric Dumazet wrote on Fri, Apr 13, 2018:
> There is no way a regular TCP stack (including linux) could send the 
> following frame.
> 
> > 16:49:27.048760 IP .13317 > .31872: 
> > Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 
> > 1313937955 ecr 1617129473], length 1374
> 
> So something is mangling the packet, maybe NAT or something.


The pcap was produced on the server which emitted the frame, so it
should be exactly as the server intended it to be without any mangling
involved?
If you could point at what strikes you as odd I can check if the same
happened on other hang traces I might still have. Is it just that it
replays a very old seq?
(if it's odd wrt client packets, the same trace as captured on client is
at the end of this mail)



Michal Kubecek wrote on Fri, Apr 13, 2018:
> The way I read this, server doesn't see anything sent by client since
> some point shortly before the dump shown here starts (about 5ms). It
> keeps sending data until 16:49:26.807754 (seq 77346) and then keeps
> resending first (from its point of view) unacknowledged segment
> (32004:33378) in exponentially growing intervals and ignores replies
> from the client. Client apparently receives these retransmits and
> replies with dupack (with D-SACK for 32004:33378) and retransmits of its
> own first unacknowledged segment (4190:4226).

I have the same understanding.

> As we can see the client packets in the dump (which was taken on
> server), it would mean they are dropped after the point where packet
> socket would pass them to libpcap. That might be e.g. netfilter
> (conntrack?) or the IP/TCP code detecting them to be invalid for some
> reason (which is not obvious to me from the dump above).

In my second mail, I got as far as `tcp_ack`, where the replays didn't
get considered for ack because it was seen as having the same ack number
as the lastest seen. I'm not sure why it didn't get considered the first
time, I need to add more instrumentation.

> There are two strange points:
> 
> 1. While client apparently responds to all server retransmits, it does
> so with TSecr=1313937714 (matching server packet from 16:49:26.807754)
> rather than TSval of the packets it dupacks (1313937955 through
> 1313953050). This doesn't seem to follow the rules of RFC 7323
> Section 4.3.

I have a pcap of the same sequence on the client as well, and it does
get the packets with higher val= unchanged, so that might be something
to look at.
I'll post a similarily formated tcpdump output at the end of this mail,
the timestamps and verious numbers in the traces match with my original
mail.

> 2. Window size values in acks from client grow with each acked packet by
> 22-23 (which might be ~1400 with scaling factor of 64). I would rather
> expect advertised receive window to go down by 1374 with each received
> segment and to grow by bigger steps with each read()/recv() call from
> application.

I have a very poor understanding of this part of the protocol/code, but
I think the window grows when a socket starts receiving a bulk of data.
Given this hangs very fast, there were only individual small packets up
till this trace where data comes in quickly, and I think this might just
be window scaling?
I might very well be wrong though..


> We might get more insight if we saw the same connection on both sides.
> From what was presented here, my guess is that
> 
>   (1) received packets are dropped somewhere on server side (after they
>   are cloned for the packet socket)

I'd agree for received replays, but I don't get why the first time that
ack was received didn't work.

>   (2) there is something wrong either on client side or between the two
>   hosts (there is at least a NAT, IIUC)

My minimal reproducer involves the internet, so there is no telling what
is in the middle, but there is at least two NAT, and the connection is
established through two connect() calls with carefully crafted
source/destination ports if that makes a difference.
I've done a bit more fiddling with netem/tbf and a single nat but I
still can't reproduce it on a local network; I need to spend more time
trying as this is very frustrating because I don't control any of the
two NATs in the real case.

I don't think the NATs are culprits, though - looking at the traces I
don't see any reordering within a direction, missing packets or fiddling
with values so this seems fairly straightforward.

Here's the client-side traces. pcap files still available on request, if
you prefer to look at it in wireshark or another tool.

16:49:26.726793 IP .13317 > .31872: Flags 
[.], seq 40248:41622, ack 4190, win 307, options [nop,nop,TS val 1313937609 ecr 
1617129440], length 1374
16:49:26.726798 IP .31872 > .13317: Flags 
[.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], 
length 0
16:49:26.726802 IP .13317 > .31872: Flags 
[.], seq 41622:42996, ack 4190, win 307, options [nop,nop,TS val 1313937609 ecr 
161712

Re: tcp hang when socket fills up ?

2018-04-13 Thread Michal Kubecek
On Fri, Apr 06, 2018 at 11:07:20AM +0200, Dominique Martinet wrote:
> 16:49:26.735042 IP .13317 > .31872: Flags 
> [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735046 IP .13317 > .31872: Flags 
> [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735334 IP .31872 > .13317: Flags 
> [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736005 IP .31872 > .13317: Flags 
> [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736402 IP .13317 > .31872: Flags 
> [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.736408 IP .13317 > .31872: Flags 
> [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.738561 IP .31872 > .13317: Flags 
> [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739539 IP .31872 > .13317: Flags 
> [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739882 IP .31872 > .13317: Flags 
> [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.740255 IP .31872 > .13317: Flags 
> [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.746756 IP .31872 > .13317: Flags 
> [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], 
> length 0
> 16:49:26.747923 IP .31872 > .13317: Flags 
> [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], 
> length 0
> 16:49:26.749083 IP .31872 > .13317: Flags 
> [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], 
> length 0
> 16:49:26.750171 IP .31872 > .13317: Flags 
> [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], 
> length 0
> 16:49:26.750808 IP .31872 > .13317: Flags 
> [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], 
> length 0
> 16:49:26.754648 IP .31872 > .13317: Flags 
> [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], 
> length 0
> 16:49:26.755985 IP .31872 > .13317: Flags 
> [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], 
> length 0
> 16:49:26.758513 IP .31872 > .13317: Flags 
> [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], 
> length 0
> 16:49:26.759096 IP .31872 > .13317: Flags 
> [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], 
> length 0
> 16:49:26.759421 IP .31872 > .13317: Flags 
> [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], 
> length 0
> 16:49:26.759755 IP .31872 > .13317: Flags 
> [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], 
> length 0
> 16:49:26.760653 IP .31872 > .13317: Flags 
> [.], ack 64980, win 1302, options [nop,nop,TS val 1617129505 ecr 1313937636], 
> length 0
> 16:49:26.761453 IP .31872 > .13317: Flags 
> [.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], 
> length 0
> 16:49:26.762199 IP .31872 > .13317: Flags 
> [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], 
> length 0
> 16:49:26.763547 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 
> ecr 1313937638], length 36
> 16:49:26.763553 IP .31872 > .13317: Flags 
> [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], 
> length 0
> 16:49:26.764298 IP .31872 > .13317: Flags 
> [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], 
> length 0
> 16:49:26.764676 IP .31872 > .13317: Flags 
> [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], 
> length 0
> 16:49:26.807754 IP .13317 > .31872: Flags 
> [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 
> ecr 1617129473], length 1374
> 16:49:26.876467 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], 
> length 0
> 16:49:27.048760 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 
> ecr 1617129473], length 1374
> 16:49:27.051791 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 
> ecr 1313937714], length 36
> 16:49:27.076444 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 
> 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.371182 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 
> ecr 1313937714], length 36
> 16:49:27.519862 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, o

Re: tcp hang when socket fills up ?

2018-04-13 Thread Eric Dumazet


On 04/06/2018 02:07 AM, Dominique Martinet wrote:
> (current kernel: vanilla 4.14.29)
> 
> I've been running into troubles recently where a sockets "fills up" (as
> in, select() will no longer return it in its outfd / attempting to write
> to it after setting it to NONBLOCK will return -EWOULDBLOCK) and it
> doesn't seem to ever "unfill" until the tcp connexion timeout.
> 
> The previous time I pushed it down to the application for not trying to
> read the socket either as I assume the buffers could be shared and
> just waiting won't take data out, but this time I'm a bit more
> skeptical as socat waits for the fd in both read and write...
> 
> Let me take a minute to describe my setup, I don't think that how the
> socket was created matters but it might be interesting:
>  - I have two computers behind NATs, no port forwarding on either side
>  - One (call it C for client) runs ssh with a proxycommand ncat/socat to
> control the source port, e.g.
> $ ssh -o ProxyCommand="socat stdio tcp: ip>:,sourceport=" server
>  - The server runs another socat to connect to that and forwards to ssh
> locally, e.g.
> $ socat tcp::,sourceport= tcp:127.0.0.1:22
> 
> (yes, both are connect() calls, and that just works™ thanks to initial
> syn replay and conntrack on routers)
> 
> When things stall, the first socat is in a select with both fd in
> reading, so it's waiting data.
> The second socat has data coming from ssh and is in a select with the
> client-facing socket in both read and write - that select never returns
> so the socket is not available for reading or writing and does not free
> up until the connection eventually times out a few minutes later.
> 
> At this point, I only see tcp replays in tcpdump/wireshark. I've
> compared dumps from both sides and there are no lost packets, only
> reordering - there always is a batch of acks that were sent regularily
> coming in shortly before the hang. Here's the trace on the server:
> 
> 16:49:26.735042 IP .13317 > .31872: Flags 
> [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735046 IP .13317 > .31872: Flags 
> [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 
> ecr 1617129473], length 1374
> 16:49:26.735334 IP .31872 > .13317: Flags 
> [.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736005 IP .31872 > .13317: Flags 
> [.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], 
> length 0
> 16:49:26.736402 IP .13317 > .31872: Flags 
> [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.736408 IP .13317 > .31872: Flags 
> [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 
> ecr 1617129473], length 1374
> 16:49:26.738561 IP .31872 > .13317: Flags 
> [.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739539 IP .31872 > .13317: Flags 
> [.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], 
> length 0
> 16:49:26.739882 IP .31872 > .13317: Flags 
> [.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.740255 IP .31872 > .13317: Flags 
> [.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], 
> length 0
> 16:49:26.746756 IP .31872 > .13317: Flags 
> [.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], 
> length 0
> 16:49:26.747923 IP .31872 > .13317: Flags 
> [.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], 
> length 0
> 16:49:26.749083 IP .31872 > .13317: Flags 
> [.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], 
> length 0
> 16:49:26.750171 IP .31872 > .13317: Flags 
> [.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], 
> length 0
> 16:49:26.750808 IP .31872 > .13317: Flags 
> [.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], 
> length 0
> 16:49:26.754648 IP .31872 > .13317: Flags 
> [.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], 
> length 0
> 16:49:26.755985 IP .31872 > .13317: Flags 
> [.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], 
> length 0
> 16:49:26.758513 IP .31872 > .13317: Flags 
> [.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], 
> length 0
> 16:49:26.759096 IP .31872 > .13317: Flags 
> [.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], 
> length 0
> 16:49:26.759421 IP .31872 > .13317: Flags 
> [.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], 
> length 0
> 16:49:26.759755 IP .31872 > .13317: Flags 
> [.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], 
> length 0
> 16:49:26.760653 IP .31872 > .13317: Flags 
> [.], ack 64980, win 1302, options [nop,nop,TS val 1

Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet

Note this is mostly me talking to myself, but in case anyone else hits
the same issues I might as well post my meagre progress.

Dominique Martinet wrote on Fri, Apr 06, 2018:
> (current kernel: vanilla 4.14.29)

reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64
kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen
any patch about that kind of stuff recently so probably still valid)

Other main difference is the qdisc, VM is using fq_codel, host is
directly on wireless so mq with 4 pfifo_fast queues - it is harder to
reproduce on the VM (even on another VM with the same kernel) so I'd
put the difference down to the qdisc, but I was able to reproduce with
both ultimately.
(update: it actually was still fairly easy to reproduce until it got
later (coworkers left?), going from ~5-15s to reproduce to multiple
minutes, so this likely depends on net quality a lot. I couldn't
reproduce by fiddling with netem on a local network though...)

> [please find previous email for setup/tcpdump output]

So I have a crash dump with a socket / inet_sock that are blocked, since
this is a VM I even get gdb in bonus..

With the crash dump, I can confirm that the socket is not available for
writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1),
but that doesn't help much if I can't tell why we're not taking acks in

With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I
think that's the function that should handle my ack, and that gets the
replay.

First, abusing next, the flow seems to be something like
(I folded if/else not taken)

static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{   
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_sacktag_state sack_state;
struct rate_sample rs = { .prior_delivered = 0 };
u32 prior_snd_una = tp->snd_una;
bool is_sack_reneg = tp->is_sack_reneg;
u32 ack_seq = TCP_SKB_CB(skb)->seq;
u32 ack = TCP_SKB_CB(skb)->ack_seq;
bool is_dupack = false;
int prior_packets = tp->packets_out;
u32 delivered = tp->delivered;
u32 lost = tp->lost;
int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover
losses */
u32 prior_fack;

sack_state.first_sackt = 0;
sack_state.rate = &rs;

/* We very likely will need to access rtx queue. */
prefetch(sk->tcp_rtx_queue.rb_node);

/* If the ack is older than previous acks
 * then we can probably ignore it.
 */
if (before(ack, prior_snd_una)) {
}

/* If the ack includes data we haven't sent yet, discard
 * this segment (RFC793 Section 3.9).
 */
if (after(ack, tp->snd_nxt))

if (after(ack, prior_snd_una)) {
}

prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) :
tp->snd_una;
rs.prior_in_flight = tcp_packets_in_flight(tp);

/* ts_recent update must be made after we are sure that the
packet   
 * is in window.
 */
if (flag & FLAG_UPDATE_TS_RECENT)

if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
} else {
u32 ack_ev_flags = CA_ACK_SLOWPATH;

if (ack_seq != TCP_SKB_CB(skb)->end_seq)
else
NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPPUREACKS);

flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);

if (TCP_SKB_CB(skb)->sacked)

if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
}

if (flag & FLAG_WIN_UPDATE)

tcp_in_ack_event(sk, ack_ev_flags);
}

/* We passed data and got it acked, remove any soft error
 * log. Something worked...
 */
sk->sk_err_soft = 0;
icsk->icsk_probes_out = 0;
tp->rcv_tstamp = tcp_jiffies32;
if (!prior_packets)
goto no_queue;

no_queue:
/* If data was DSACKed, see if we can undo a cwnd reduction. */
if (flag & FLAG_DSACKING_ACK)
tcp_fastretrans_alert(sk, prior_snd_una, is_dupack,
&flag,
  &rexmit);
/* If this ack opens up a zero window, clear backoff.  It was
 * being used to time the probes, and is probably far higher
 than
 * it needs to be for normal retransmission.
 */
tcp_ack_probe(sk);

if (tp->tlp_high_seq)
tcp_process_tlp_ack(sk, ack, flag);
return 1;


And here is 'info local' towards the end of the function:
icsk = 0x88001740b800
tp = 0x88001740b800
sack_state = {reord = 84, first_sackt = 0, last_sackt = 0, rate = 
0x88003fc83ae0, fla