Re: [PATCH v6] tcp: fix race condition when creating child sockets from syncookies

2020-11-17 Thread Eric Dumazet



On 11/17/20 8:29 AM, Ricardo Dias wrote:
> When the TCP stack is in SYN flood mode, the server child socket is
> created from the SYN cookie received in a TCP packet with the ACK flag
> set.
> 

...

>  
> @@ -1374,6 +1381,13 @@ static struct sock *tcp_v6_syn_recv_sock(const struct 
> sock *sk, struct sk_buff *
>   skb_set_owner_r(newnp->pktoptions, newsk);
>   }
>   }
> + } else {
> + if (!req_unhash && esk) {
> + /* This code path should only be executed in the
> +  * syncookie case only
> +  */
> + newsk = esk;

What happens to the socket we just created and was stored in @newsk ?

We are going to leak kernel memory quite fast, if we just forget about it.

> + }
>   }
>  
>   return newsk;
> 

Anyway, I have tested your patch (see below) and had many errors in 

[  669.578597] cleanup rbuf bug: copied 1A5AD52D seq 1A5AD52D rcvnxt 1A5AD52D
[  669.578604] WARNING: CPU: 30 PID: 13267 at net/ipv4/tcp.c:1553 
tcp_cleanup_rbuf+0x48/0x100
[  669.578604] Modules linked in: vfat fat w1_therm wire i2c_mux_pca954x 
i2c_mux cdc_acm ehci_pci ehci_hcd mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core
[  669.578610] CPU: 30 PID: 13267 Comm: ld-2.15.so.orig Tainted: G S  W 
5.10.0-smp-DEV #394
[  669.578611] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.60.0 10/02/2019
[  669.578612] RIP: 0010:tcp_cleanup_rbuf+0x48/0x100
[  669.578612] Code: f3 48 39 d0 74 26 48 85 c0 74 21 8b 50 2c 8b b7 7c 05 00 
00 39 d6 78 14 8b 8f 78 05 00 00 48 c7 c7 08 8a 57 a5 e8 1d b6 10 00 <0f> 0b 41 
0f b6 84 24 c8 04 00 00 a8 01 74 2a 41 0f b7 8c 24 de 04
[  669.578613] RSP: 0018:893bd898fca8 EFLAGS: 00010286
[  669.578614] RAX:  RBX:  RCX: 7fff
[  669.578615] RDX: 7fff RSI: ffea RDI: 
[  669.578615] RBP: 893bd898fcb8 R08: a5bed108 R09: 00027ffb
[  669.578616] R10: 8000 R11: 3fff R12: 891cd3530140
[  669.578617] R13: 891cd3530140 R14:  R15: 
[  669.578618] FS:  7fb7908ea700() GS:893affc8() 
knlGS:
[  669.578619] CS:  0010 DS:  ES:  CR0: 80050033
[  669.578620] CR2: 34fedd58981c CR3: 0020d87b2001 CR4: 001706e0
[  669.578620] Call Trace:
[  669.578621]  tcp_recvmsg+0x21f/0xac0
[  669.578624]  inet6_recvmsg+0x5f/0x120
[  669.578625]  ? add_wait_queue_exclusive+0x80/0x80
[  669.578627]  sock_recvmsg+0x54/0x80
[  669.578628]  __sys_recvfrom+0x19e/0x1d0
[  669.578630]  ? __fput+0x11c/0x240
[  669.578632]  __x64_sys_recvfrom+0x29/0x30
[  669.578634]  do_syscall_64+0x38/0x50
[  669.578635]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I have used following method on DUT host, and patch [1] in get_rps_cpus()

DUT has 48 hyper threads, I force RPS to spread packets on all of them.

for q in {0..15}; do echo , 
>/sys/class/net/eth0/queues/rx-$q/rps_cpus; done
echo 2 >/proc/sys/net/ipv4/tcp_syncookies
netserver

Then from another host : 

for i in {1..20}
do
 super_netperf 200 -H DUT -t TCP_CRR -l2 -- -p0
done

After the test, I had more than 100,000 TCPv6 sockets that were mem leaked.


[1]
diff --git a/net/core/dev.c b/net/core/dev.c
index 
60d325bda0d7b4a1ecb7bf7b3352d58bed8b96a2..ae99206bd05126eef8cf4490bb197a3c781012ed
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4344,7 +4344,7 @@ static int get_rps_cpu(struct net_device *dev, struct 
sk_buff *skb,
goto done;
 
skb_reset_network_header(skb);
-   hash = skb_get_hash(skb);
+   hash = prandom_u32();
if (!hash)
goto done;
 



[PATCH v6] tcp: fix race condition when creating child sockets from syncookies

2020-11-16 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we return that socket and use it to process the TCP
packet received, and discard the second child socket to the same client.

Signed-off-by: Ricardo Dias 
---
v6 (2020-11-17):
  * Moved the ehash bucket list search for its own helper function.

v5 (2020-11-16):
 - Not considered for review -
 
v4 (2020-11-12):
  * Added `struct sock **esk) parameter to `inet_ehash_insert`.
  * Fixed ref count increment in `inet_ehash_insert`.
  * Fixed callers of inet_ehash_nolisten.

v3 (2020-11-11):
  * Fixed IPv6 handling in inet_ehash_insert
  * Removed unecessary comparison while traversing the ehash bucket
list.
 
v2 (2020-11-09):
  * Changed the author's email domain.
  * Removed the helper function inet_ehash_insert_chk_dup and moved the
logic to the existing inet_ehash_insert.
  * Updated the callers of iner_ehash_nolisten to deal with the new
logic.

 include/net/inet_hashtables.h   |  4 +-
 net/dccp/ipv4.c |  2 +-
 net/dccp/ipv6.c |  2 +-
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/inet_hashtables.c  | 72 +
 net/ipv4/syncookies.c   |  5 ++-
 net/ipv4/tcp_ipv4.c |  9 -
 net/ipv6/tcp_ipv6.c | 16 +++-
 8 files changed, 96 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..b0abc4dd6d49 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -247,8 +247,8 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 unsigned long high_limit);
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
-bool inet_ehash_insert(struct sock *sk, struct sock *osk);
-bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
+bool inet_ehash_insert(struct sock *sk, struct sock *osk, struct sock **esk);
+bool inet_ehash_nolisten(struct sock *sk, struct sock *osk, struct sock **esk);
 int __inet_hash(struct sock *sk, struct sock *osk);
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c28c8251125..098bae35ab76 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -427,7 +427,7 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
 
if (__inet_inherit_port(sk, newsk) < 0)
goto put_and_exit;
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
if (*own_req)
ireq->ireq_opt = NULL;
else
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index ef4ab28cfde0..78ee1b5acf1f 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -533,7 +533,7 @@ static struct sock *dccp_v6_request_recv_sock(const struct 
sock *sk,
dccp_done(newsk);
goto out;
}
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
/* Clone pktoptions received with SYN, if we own the req */
if (*own_req && ireq->pktopts) {
newnp->pktoptions = skb_clone(ireq->pktopts, GFP_ATOMIC);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index b457dd2d6c75..df26489e4f6c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/i