[PATCH v8] tcp: fix race condition when creating child sockets from syncookies

2020-11-20 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.

Signed-off-by: Ricardo Dias 
---
v8 (2020-11-20):
  * Implemented the approach suggested by Eric where we drop the packet
in case we found an already existing socket for the same connection.
  * Updated the commit message to match the new approach.

v7 (2020-11-19):
  * Changed the approach to re-use the first (existing) socket created
from thge syncookie. Instead of returning the existing socket in
tcp_(v4|v6)_syn_recv_sock and continue the protocol state machine
execution, tcp_(v4|v6)_syn_recv_sock signals that already exists a
socket, and tells tcp_(v4|v6)_rcv to lookup the socket again in the
established connections table.
This new approach fixes the errors reported by Eric for the previous
version of the patch.
  * Also fixes the memory leaks by making sure that the newly created
socket in syn_recv_sock is destroyed in case an already existing
socket exists.

v6 (2020-11-17):
  * Moved the ehash bucket list search for its own helper function.

v5 (2020-11-16):
 - Not considered for review -
 
v4 (2020-11-12):
  * Added `struct sock **esk) parameter to `inet_ehash_insert`.
  * Fixed ref count increment in `inet_ehash_insert`.
  * Fixed callers of inet_ehash_nolisten.

v3 (2020-11-11):
  * Fixed IPv6 handling in inet_ehash_insert
  * Removed unecessary comparison while traversing the ehash bucket
list.
 
v2 (2020-11-09):
  * Changed the author's email domain.
  * Removed the helper function inet_ehash_insert_chk_dup and moved the
logic to the existing inet_ehash_insert.
  * Updated the callers of iner_ehash_nolisten to deal with the new
logic.

 include/net/inet_hashtables.h   |  5 ++-
 net/dccp/ipv4.c |  2 +-
 net/dccp/ipv6.c |  2 +-
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/inet_hashtables.c  | 68 +
 net/ipv4/tcp_ipv4.c | 15 +++-
 net/ipv6/tcp_ipv6.c | 13 ++-
 7 files changed, 91 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..ca6a3ea9057e 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -247,8 +247,9 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 unsigned long high_limit);
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
-bool inet_ehash_insert(struct sock *sk, struct sock *osk);
-bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
+bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk);
+bool inet_ehash_nolisten(struct sock *sk, struct sock *osk,
+bool *found_dup_sk);
 int __inet_hash(struct sock *sk, struct sock *osk);
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c28c8251125..098bae35ab76 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -427,7 +427,7 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
 
if (__inet_inherit_port(sk, newsk) < 0)
goto put_and_exit;
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
if (*own_req)

[PATCH v7] tcp: fix race condition when creating child sockets from syncookies

2020-11-19 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we return that socket and use it to process the TCP
packet received, and discard the second child socket to the same client.

Signed-off-by: Ricardo Dias 
---
v7 (2020-11-19):
  * Changed the approach to re-use the first (existing) socket created
from thge syncookie. Instead of returning the existing socket in
tcp_(v4|v6)_syn_recv_sock and continue the protocol state machine
execution, tcp_(v4|v6)_syn_recv_sock signals that already exists a
socket, and tells tcp_(v4|v6)_rcv to lookup the socket again in the
established connections table.
This new approach fixes the errors reported by Eric for the previous
version of the patch.
  * Also fixes the memory leaks by making sure that the newly created
socket in syn_recv_sock is destroyed in case an already existing
socket exists.

v6 (2020-11-17):
  * Moved the ehash bucket list search for its own helper function.

v5 (2020-11-16):
 - Not considered for review -
 
v4 (2020-11-12):
  * Added `struct sock **esk) parameter to `inet_ehash_insert`.
  * Fixed ref count increment in `inet_ehash_insert`.
  * Fixed callers of inet_ehash_nolisten.

v3 (2020-11-11):
  * Fixed IPv6 handling in inet_ehash_insert
  * Removed unecessary comparison while traversing the ehash bucket
list.
 
v2 (2020-11-09):
  * Changed the author's email domain.
  * Removed the helper function inet_ehash_insert_chk_dup and moved the
logic to the existing inet_ehash_insert.
  * Updated the callers of iner_ehash_nolisten to deal with the new
logic.

 include/net/inet_hashtables.h   |  5 ++-
 include/net/tcp.h   |  9 +++--
 net/dccp/ipv4.c |  2 +-
 net/dccp/ipv6.c |  2 +-
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/inet_hashtables.c  | 68 +
 net/ipv4/syncookies.c   | 10 +++--
 net/ipv4/tcp_ipv4.c | 41 +---
 net/ipv6/syncookies.c   |  5 ++-
 net/ipv6/tcp_ipv6.c | 39 +--
 10 files changed, 153 insertions(+), 30 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..ca6a3ea9057e 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -247,8 +247,9 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 unsigned long high_limit);
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
-bool inet_ehash_insert(struct sock *sk, struct sock *osk);
-bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
+bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk);
+bool inet_ehash_nolisten(struct sock *sk, struct sock *osk,
+bool *found_dup_sk);
 int __inet_hash(struct sock *sk, struct sock *osk);
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index eab6c7510b5b..ff144d3ead2f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -465,10 +465,12 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct 
sk_buff *skb);
 /* From syncookies.c */
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 struct request_sock *req,
-struct dst_entry *dst, u32 tsoff);
+struct dst_

[PATCH v6] tcp: fix race condition when creating child sockets from syncookies

2020-11-16 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we return that socket and use it to process the TCP
packet received, and discard the second child socket to the same client.

Signed-off-by: Ricardo Dias 
---
v6 (2020-11-17):
  * Moved the ehash bucket list search for its own helper function.

v5 (2020-11-16):
 - Not considered for review -
 
v4 (2020-11-12):
  * Added `struct sock **esk) parameter to `inet_ehash_insert`.
  * Fixed ref count increment in `inet_ehash_insert`.
  * Fixed callers of inet_ehash_nolisten.

v3 (2020-11-11):
  * Fixed IPv6 handling in inet_ehash_insert
  * Removed unecessary comparison while traversing the ehash bucket
list.
 
v2 (2020-11-09):
  * Changed the author's email domain.
  * Removed the helper function inet_ehash_insert_chk_dup and moved the
logic to the existing inet_ehash_insert.
  * Updated the callers of iner_ehash_nolisten to deal with the new
logic.

 include/net/inet_hashtables.h   |  4 +-
 net/dccp/ipv4.c |  2 +-
 net/dccp/ipv6.c |  2 +-
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/inet_hashtables.c  | 72 +
 net/ipv4/syncookies.c   |  5 ++-
 net/ipv4/tcp_ipv4.c |  9 -
 net/ipv6/tcp_ipv6.c | 16 +++-
 8 files changed, 96 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..b0abc4dd6d49 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -247,8 +247,8 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 unsigned long high_limit);
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
-bool inet_ehash_insert(struct sock *sk, struct sock *osk);
-bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
+bool inet_ehash_insert(struct sock *sk, struct sock *osk, struct sock **esk);
+bool inet_ehash_nolisten(struct sock *sk, struct sock *osk, struct sock **esk);
 int __inet_hash(struct sock *sk, struct sock *osk);
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c28c8251125..098bae35ab76 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -427,7 +427,7 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
 
if (__inet_inherit_port(sk, newsk) < 0)
goto put_and_exit;
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
if (*own_req)
ireq->ireq_opt = NULL;
else
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index ef4ab28cfde0..78ee1b5acf1f 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -533,7 +533,7 @@ static struct sock *dccp_v6_request_recv_sock(const struct 
sock *sk,
dccp_done(newsk);
goto out;
}
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
/* Clone pktoptions received with SYN, if we own the req */
if (*own_req && ireq->pktopts) {
newnp->pktoptions = skb_clone(ireq->pktopts, GFP_ATOMIC);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index b457dd2d6c75..df26489e4f6c 100644
--- a/net/ipv4/inet_conn

Re: [PATCH v5] tcp: fix race condition when creating child sockets from syncookies

2020-11-16 Thread Ricardo Dias
Please to not consider this version for review. I'll send a new patch
version soon.




[PATCH v5] tcp: fix race condition when creating child sockets from syncookies

2020-11-16 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we return that socket and use it to process the TCP
packet received, and discard the second child socket to the same client.

Signed-off-by: Ricardo Dias 
---
 include/net/inet_hashtables.h   |  4 +--
 net/dccp/ipv4.c |  2 +-
 net/dccp/ipv6.c |  2 +-
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/inet_hashtables.c  | 64 -
 net/ipv4/syncookies.c   |  5 ++-
 net/ipv4/tcp_ipv4.c |  9 -
 net/ipv6/tcp_ipv6.c | 16 -
 8 files changed, 88 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..b0abc4dd6d49 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -247,8 +247,8 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 unsigned long high_limit);
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
-bool inet_ehash_insert(struct sock *sk, struct sock *osk);
-bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
+bool inet_ehash_insert(struct sock *sk, struct sock *osk, struct sock **esk);
+bool inet_ehash_nolisten(struct sock *sk, struct sock *osk, struct sock **esk);
 int __inet_hash(struct sock *sk, struct sock *osk);
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c28c8251125..098bae35ab76 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -427,7 +427,7 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
 
if (__inet_inherit_port(sk, newsk) < 0)
goto put_and_exit;
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
if (*own_req)
ireq->ireq_opt = NULL;
else
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index ef4ab28cfde0..78ee1b5acf1f 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -533,7 +533,7 @@ static struct sock *dccp_v6_request_recv_sock(const struct 
sock *sk,
dccp_done(newsk);
goto out;
}
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
/* Clone pktoptions received with SYN, if we own the req */
if (*own_req && ireq->pktopts) {
newnp->pktoptions = skb_clone(ireq->pktopts, GFP_ATOMIC);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index b457dd2d6c75..df26489e4f6c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -787,7 +787,7 @@ static void reqsk_queue_hash_req(struct request_sock *req,
timer_setup(&req->rsk_timer, reqsk_timer_handler, TIMER_PINNED);
mod_timer(&req->rsk_timer, jiffies + timeout);
 
-   inet_ehash_insert(req_to_sk(req), NULL);
+   inet_ehash_insert(req_to_sk(req), NULL, NULL);
/* before letting lookups find us, make sure all req fields
 * are committed to memory and refcnt initialized.
 */
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 239e54474b65..c1eeffd14b59 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -20,6 +20,9 @@

[PATCH v4] tcp: fix race condition when creating child sockets from syncookies

2020-11-13 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we return that socket and use it to process the TCP
packet received, and discard the second child socket to the same client.

Signed-off-by: Ricardo Dias 
---
v4 (2020-11-12):
  * Added `struct sock **esk) parameter to `inet_ehash_insert`.
  * Fixed ref count increment in `inet_ehash_insert`.
  * Fixed callers of inet_ehash_nolisten.

v3 (2020-11-11):
  * Fixed IPv6 handling in inet_ehash_insert
  * Removed unecessary comparison while traversing the ehash bucket
list.
 
v2 (2020-11-09):
  * Changed the author's email domain.
  * Removed the helper function inet_ehash_insert_chk_dup and moved the
logic to the existing inet_ehash_insert.
  * Updated the callers of iner_ehash_nolisten to deal with the new
logic.

 include/net/inet_hashtables.h   |  4 +--
 net/dccp/ipv4.c |  2 +-
 net/dccp/ipv6.c |  2 +-
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/inet_hashtables.c  | 64 -
 net/ipv4/syncookies.c   |  5 ++-
 net/ipv4/tcp_ipv4.c |  9 -
 net/ipv6/tcp_ipv6.c | 16 -
 8 files changed, 87 insertions(+), 17 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..b0abc4dd6d49 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -247,8 +247,8 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 unsigned long high_limit);
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
-bool inet_ehash_insert(struct sock *sk, struct sock *osk);
-bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
+bool inet_ehash_insert(struct sock *sk, struct sock *osk, struct sock **esk);
+bool inet_ehash_nolisten(struct sock *sk, struct sock *osk, struct sock **esk);
 int __inet_hash(struct sock *sk, struct sock *osk);
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c28c8251125..098bae35ab76 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -427,7 +427,7 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
 
if (__inet_inherit_port(sk, newsk) < 0)
goto put_and_exit;
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
if (*own_req)
ireq->ireq_opt = NULL;
else
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index ef4ab28cfde0..78ee1b5acf1f 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -533,7 +533,7 @@ static struct sock *dccp_v6_request_recv_sock(const struct 
sock *sk,
dccp_done(newsk);
goto out;
}
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash), NULL);
/* Clone pktoptions received with SYN, if we own the req */
if (*own_req && ireq->pktopts) {
newnp->pktoptions = skb_clone(ireq->pktopts, GFP_ATOMIC);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index b457dd2d6c75..df26489e4f6c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -787,7 +787,7 @@ static void reqsk_queue_hash_req(struct request_sock *r

Re: [PATCH v3] tcp: fix race condition when creating child sockets from syncookies

2020-11-11 Thread Ricardo Dias
On Wed, Nov 11, 2020 at 05:05:11PM +0100, Eric Dumazet wrote:
> On Wed, Nov 11, 2020 at 8:35 AM Ricardo Dias  wrote:
>
> ...
> 
> > + * If an existing socket already exists, it returns that socket
> > + * through the osk parameter.
> 
> 
> I think this deserves a third parameter, to avoid confusion.

Yes, I agree. It will make much clearer the semantics of this function.

> ...
>
> > +   if 
> > (unlikely(!refcount_inc_not_zero(&esk->sk_refcnt)))
> 
> Can you explain how this could happen ?
> Again, we own the lock here, finding a socket in ehash, with a zero
> refcount can not happen.
> 
> Only a true rcu lookup could see a zero refcount.

Ah! Now I understand what you are saying. Indeed we just need to
increment the ref counter. We don't need to check if it's zero.

> ...
>
> > -   inet_ehash_nolisten(sk, (struct sock *)tw);
> > +   inet_ehash_nolisten(sk, (struct sock **)&tw);
> 
> Why could not we 'find' a prior socket, and since you ignore the result,
> leave a refcount increment, and a socket leak ?
> 
> Same remark for all callers.
> 
> > }
> > if (tw)
> 
> See the problem here ?
> 
> If tw was initially NULL, then inet_ehash_nolisten() could have overwritten tw
> with another socket.
> 
> Surely calling inet_twsk_bind_unhash() could be disastrous.

Yes, you're right.
I adapted this call to match the new signature assuming `tw` would never
be NULL. Shame on me for not noticing the `if (tw)` statement right
below :-).

I'll check each caller code, to make sure I'm not leaking the socket.


Thanks for your reviews. I'm learning more and more with each review
iteration.



[PATCH v3] tcp: fix race condition when creating child sockets from syncookies

2020-11-10 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we return that socket and use it to process the TCP
packet received, and discard the second child socket to the same client.

Signed-off-by: Ricardo Dias 
---
v3 (2020-11-11):
  * Fixed IPv6 handling in inet_ehash_insert
  * Removed unecessary comparison while traversing the ehash bucket
list.
 
v2 (2020-11-09):
  * Changed the author's email domain.
  * Removed the helper function inet_ehash_insert_chk_dup and moved the
logic to the existing inet_ehash_insert.
  * Updated the callers of iner_ehash_nolisten to deal with the new
logic.

 include/net/inet_hashtables.h |  6 +--
 net/dccp/ipv4.c   |  4 +-
 net/dccp/ipv6.c   |  4 +-
 net/ipv4/inet_hashtables.c| 69 ++-
 net/ipv4/syncookies.c |  5 ++-
 net/ipv4/tcp_ipv4.c   | 10 -
 net/ipv6/tcp_ipv6.c   | 17 -
 7 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..dffa345d52a7 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -247,9 +247,9 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 unsigned long high_limit);
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
-bool inet_ehash_insert(struct sock *sk, struct sock *osk);
-bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
-int __inet_hash(struct sock *sk, struct sock *osk);
+bool inet_ehash_insert(struct sock *sk, struct sock **osk);
+bool inet_ehash_nolisten(struct sock *sk, struct sock **osk);
+int __inet_hash(struct sock *sk, struct sock **osk);
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
 
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c28c8251125..99bbba478991 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -400,6 +400,7 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
struct inet_request_sock *ireq;
struct inet_sock *newinet;
struct sock *newsk;
+   struct sock *osk;
 
if (sk_acceptq_is_full(sk))
goto exit_overflow;
@@ -427,7 +428,8 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
 
if (__inet_inherit_port(sk, newsk) < 0)
goto put_and_exit;
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   osk = req_to_sk(req_unhash);
+   *own_req = inet_ehash_nolisten(newsk, &osk);
if (*own_req)
ireq->ireq_opt = NULL;
else
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index ef4ab28cfde0..91a825c00a97 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -407,6 +407,7 @@ static struct sock *dccp_v6_request_recv_sock(const struct 
sock *sk,
struct inet_sock *newinet;
struct dccp6_sock *newdp6;
struct sock *newsk;
+   struct sock *osk;
 
if (skb->protocol == htons(ETH_P_IP)) {
/*
@@ -533,7 +534,8 @@ static struct sock *dccp_v6_request_recv_sock(const struct 
sock *sk,
dccp_done(newsk);
goto out;
}
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   osk = req_to_sk(req_unhash);
+   *own_req = inet_ehash_nolisten(newsk, &osk);
/* Clone pktoptions received with SYN, if we own the req */
  

[PATCH v2] tcp: fix race condition when creating child sockets from syncookies

2020-11-09 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we return that socket and use it to process the TCP
packet received, and discard the second child socket to the same client.

Signed-off-by: Ricardo Dias 
Reported-by: kernel test robot 
---
v2 (2020-11-09):
* Changed the author's email domain.
* Removed the helper function inet_ehash_insert_chk_dup and moved the
  logic to the existing inet_ehash_insert.
* Updated the callers of iner_ehash_nolisten to deal with the new
  logic.


 include/net/inet_hashtables.h |  6 ++--
 net/dccp/ipv4.c   |  4 ++-
 net/dccp/ipv6.c   |  4 ++-
 net/ipv4/inet_hashtables.c| 63 +--
 net/ipv4/syncookies.c |  5 ++-
 net/ipv4/tcp_ipv4.c   | 12 ++-
 net/ipv6/tcp_ipv6.c   | 19 ++-
 7 files changed, 94 insertions(+), 19 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..dffa345d52a7 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -247,9 +247,9 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 unsigned long high_limit);
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
-bool inet_ehash_insert(struct sock *sk, struct sock *osk);
-bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
-int __inet_hash(struct sock *sk, struct sock *osk);
+bool inet_ehash_insert(struct sock *sk, struct sock **osk);
+bool inet_ehash_nolisten(struct sock *sk, struct sock **osk);
+int __inet_hash(struct sock *sk, struct sock **osk);
 int inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
 
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9c28c8251125..99bbba478991 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -400,6 +400,7 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
struct inet_request_sock *ireq;
struct inet_sock *newinet;
struct sock *newsk;
+   struct sock *osk;
 
if (sk_acceptq_is_full(sk))
goto exit_overflow;
@@ -427,7 +428,8 @@ struct sock *dccp_v4_request_recv_sock(const struct sock 
*sk,
 
if (__inet_inherit_port(sk, newsk) < 0)
goto put_and_exit;
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   osk = req_to_sk(req_unhash);
+   *own_req = inet_ehash_nolisten(newsk, &osk);
if (*own_req)
ireq->ireq_opt = NULL;
else
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index ef4ab28cfde0..91a825c00a97 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -407,6 +407,7 @@ static struct sock *dccp_v6_request_recv_sock(const struct 
sock *sk,
struct inet_sock *newinet;
struct dccp6_sock *newdp6;
struct sock *newsk;
+   struct sock *osk;
 
if (skb->protocol == htons(ETH_P_IP)) {
/*
@@ -533,7 +534,8 @@ static struct sock *dccp_v6_request_recv_sock(const struct 
sock *sk,
dccp_done(newsk);
goto out;
}
-   *own_req = inet_ehash_nolisten(newsk, req_to_sk(req_unhash));
+   osk = req_to_sk(req_unhash);
+   *own_req = inet_ehash_nolisten(newsk, &osk);
/* Clone pktoptions received with SYN, if we own the req */
if (*own_req && ireq->pktopts) {
newnp->pktoptions = skb_clone(ireq->pkto

Re: [PATCH] tcp: fix race condition when creating child sockets from syncookies

2020-10-23 Thread Ricardo Dias
On Fri, Oct 23, 2020 at 06:51:05PM +0200, Eric Dumazet wrote:
> On Fri, Oct 23, 2020 at 6:48 PM Ricardo Dias  wrote:
> >
> 
> > In that case, I can change the patch to only iterate the ehash bucket
> > only when the listening socket is using the loopback interface, correct?
> 
> No, the fix should be generic.
> 
> We could still have bad NIC, or bad configuration, spreading packets
> on random cpus ;)

Ah, got it!



Re: [PATCH] tcp: fix race condition when creating child sockets from syncookies

2020-10-23 Thread Ricardo Dias
On Fri, Oct 23, 2020 at 06:36:29PM +0200, Eric Dumazet wrote:
> On Fri, Oct 23, 2020 at 6:06 PM Ricardo Dias  wrote:
> 
> > And what about the loopback interface? Why couldn't the loopback
> > interface also use a single RX queue?
> >
> 
> Loopback is using a per-cpu queue, with no crossing, for efficiency.
> 
> That means : whenever a packet is sent on lo interface from CPU X, it
> is put on CPU X backlog queue.
> 
> If the connect() and sendmsg() are run from different cpus, then the
> ACK (from last packet of 3WH) and the data packet might land on
> different queues.

In that case, I can change the patch to only iterate the ehash bucket
only when the listening socket is using the loopback interface, correct?




Re: [PATCH] tcp: fix race condition when creating child sockets from syncookies

2020-10-23 Thread Ricardo Dias
On Fri, Oct 23, 2020 at 05:56:07PM +0200, Eric Dumazet wrote:
> On Fri, Oct 23, 2020 at 5:51 PM Ricardo Dias  wrote:
> >
> > On Fri, Oct 23, 2020 at 04:03:27PM +0200, Eric Dumazet wrote:
> > > On Fri, Oct 23, 2020 at 1:14 PM Ricardo Dias  wrote:
> > >
> > > ...
> > >
> > > Note that normally, all packets for the same 4-tuple should be handled
> > > by the same cpu,
> > > so this race is quite unlikely to happen in standard setups.
> >
> > I was able to write a small client/server program that used the
> > loopback interface to create connections, which could hit the race
> > condition in 1/200 runs.
> >
> > The server when accepts a connection sends an 8 byte identifier to
> > the client, and then waits for the client to echo the same identifier.
> > The client creates hundreds of simultaneous connections to the server,
> > and in each connection it sends one byte as soon as the connection is
> > established, then reads the 8 byte identifier from the server and sends
> > it back to the server.
> >
> > When we hit the race condition, one of the server connections gets an 8
> > byte identifier different from its own identifier.
> 
> That is on loopback, right ?

Yes it's on loopback.

> 
> A server under syn flood is usually hit on a physical NIC, and a NIC
> will always put all packets of a TCP flow in a single RX queue.
> The cpu associated with this single RX queue won't process two packets
> in parallel.

And what about the loopback interface? Why couldn't the loopback
interface also use a single RX queue?

> 
> Note this issue is known, someone tried to fix it in the past but the
> attempt went nowhere.


Re: [PATCH] tcp: fix race condition when creating child sockets from syncookies

2020-10-23 Thread Ricardo Dias
On Fri, Oct 23, 2020 at 04:03:27PM +0200, Eric Dumazet wrote:
> On Fri, Oct 23, 2020 at 1:14 PM Ricardo Dias  wrote:
> >
> > When the TCP stack is in SYN flood mode, the server child socket is
> > created from the SYN cookie received in a TCP packet with the ACK flag
> > set.
> >
> ...
> 
> This patch only handles IPv4, unless I am missing something ?

Yes, currently the patch only handles IPv4. I'll improve it to also
handle the IPv6 case. 

> 
> It looks like the fix should be done in inet_ehash_insert(), not
> adding yet another helper in TCP.
> This would be family generic.

Ok, sounds good as long as there is not problem in changing the
signature and semantics of the inet_ehash_insert() function, as well as
changing the inet_ehash_nolisten() function.

> 
> Note that normally, all packets for the same 4-tuple should be handled
> by the same cpu,
> so this race is quite unlikely to happen in standard setups.

I was able to write a small client/server program that used the
loopback interface to create connections, which could hit the race
condition in 1/200 runs.

The server when accepts a connection sends an 8 byte identifier to
the client, and then waits for the client to echo the same identifier.
The client creates hundreds of simultaneous connections to the server,
and in each connection it sends one byte as soon as the connection is
established, then reads the 8 byte identifier from the server and sends
it back to the server.

When we hit the race condition, one of the server connections gets an 8
byte identifier different from its own identifier.



[PATCH] tcp: fix race condition when creating child sockets from syncookies

2020-10-23 Thread Ricardo Dias
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we return that socket and use it to process the TCP
packet received, and discard the second child socket to the same client.

Signed-off-by: Ricardo Dias 
---
 include/net/inet_hashtables.h |  1 +
 net/ipv4/inet_hashtables.c| 59 +++
 net/ipv4/syncookies.c |  5 ++-
 net/ipv4/tcp_ipv4.c   | 16 +-
 4 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 92560974ea67..e1d7fd20468a 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -248,6 +248,7 @@ void inet_hashinfo2_init(struct inet_hashinfo *h, const 
char *name,
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h);
 
 bool inet_ehash_insert(struct sock *sk, struct sock *osk);
+struct sock *inet_ehash_insert_chk_dup(struct sock *sk);
 bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
 int __inet_hash(struct sock *sk, struct sock *osk);
 int inet_hash(struct sock *sk);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 239e54474b65..5dbe3aa291e6 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -539,6 +539,65 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk)
return ret;
 }
 
+/* Inserts a socket into ehash if no existing socket exists for the same
+ * quadruple (saddr, sport, daddr, dport).
+ * If there is an existing socket, returns that socket, otherwise returns NULL.
+ */
+struct sock *inet_ehash_insert_chk_dup(struct sock *sk)
+{
+   struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+   struct hlist_nulls_head *list;
+   struct inet_ehash_bucket *head;
+   const struct hlist_nulls_node *node;
+   struct sock *esk;
+   spinlock_t *lock; /* protects hashinfo socket entry */
+   struct net *net = sock_net(sk);
+   const int dif, sdif = sk->sk_bound_dev_if;
+
+   INET_ADDR_COOKIE(acookie, sk->sk_daddr, sk->sk_rcv_saddr);
+   const __portpair ports = INET_COMBINED_PORTS(sk->sk_dport, sk->sk_num);
+
+   WARN_ON_ONCE(!sk_unhashed(sk));
+
+   sk->sk_hash = sk_ehashfn(sk);
+   head = inet_ehash_bucket(hashinfo, sk->sk_hash);
+   list = &head->chain;
+   lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
+
+   spin_lock(lock);
+begin:
+   sk_nulls_for_each_rcu(esk, node, list) {
+   if (esk->sk_hash != sk->sk_hash)
+   continue;
+   if (likely(INET_MATCH(esk, net, acookie,
+ sk->sk_daddr, sk->sk_rcv_saddr, ports,
+ dif, sdif))) {
+   if (unlikely(!refcount_inc_not_zero(&esk->sk_refcnt)))
+   goto out;
+   if (unlikely(!INET_MATCH(esk, net, acookie,
+sk->sk_daddr,
+sk->sk_rcv_saddr, ports,
+dif, sdif))) {
+   sock_gen_put(esk);
+   goto begin;
+   }
+   goto found;
+   }
+   }
+out:
+   esk = NULL;
+   __sk_nulls_add_node_rcu(sk, list);
+found:
+   spin_unlock(l