Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Soheil Hassas Yeganeh
On Mon, Apr 30, 2018 at 12:10 PM, David Miller  wrote:
> From: Eric Dumazet 
> Date: Mon, 30 Apr 2018 09:01:47 -0700
>
>> TCP sockets are read by a single thread really (or synchronized
>> threads), or garbage is ensured, regardless of how the kernel
>> ensures locking while reporting "queue length"
>
> Whatever applications "typically do", we should never return
> garbage, and that is what this code allowing to happen.
>
> Everything else in recvmsg() operates on state under the proper socket
> lock, to ensure consistency.
>
> The only reason we are releasing the socket lock first it to make sure
> the backlog is processed and we have the most update information
> available.
>
> It seems like one is striving for correctness and better accuracy, no?
> :-)
>
> Look, this can be fixed really simply.  And if you are worried about
> unbounded loops if two apps maliciously do recvmsg() in parallel,
> then don't even loop, just fallback to full socket locking and make
> the "non-typical" application pay the price:
>
> tmp1 = A;
> tmp2 = B;
> barrier();
> tmp3 = A;
> if (unlikely(tmp1 != tmp3)) {
> lock_sock(sk);
> tmp1 = A;
> tmp2 = B;
> release_sock(sk);
> }
>
> I'm seriously not applying the patch as-is, sorry.  This issue
> must be addressed somehow.

Thank you David for the suggestion. Sure, I'll send a V3 with what you
suggested above.

Thanks,
Soheil


Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread David Miller
From: Eric Dumazet 
Date: Mon, 30 Apr 2018 09:01:47 -0700

> TCP sockets are read by a single thread really (or synchronized
> threads), or garbage is ensured, regardless of how the kernel
> ensures locking while reporting "queue length"

Whatever applications "typically do", we should never return
garbage, and that is what this code allowing to happen.

Everything else in recvmsg() operates on state under the proper socket
lock, to ensure consistency.

The only reason we are releasing the socket lock first it to make sure
the backlog is processed and we have the most update information
available.

It seems like one is striving for correctness and better accuracy, no?
:-)

Look, this can be fixed really simply.  And if you are worried about
unbounded loops if two apps maliciously do recvmsg() in parallel,
then don't even loop, just fallback to full socket locking and make
the "non-typical" application pay the price:

tmp1 = A;
tmp2 = B;
barrier();
tmp3 = A;
if (unlikely(tmp1 != tmp3)) {
lock_sock(sk);
tmp1 = A;
tmp2 = B;
release_sock(sk);
}

I'm seriously not applying the patch as-is, sorry.  This issue
must be addressed somehow.

Thank you.


Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Eric Dumazet


On 04/30/2018 08:56 AM, David Miller wrote:
> From: Eric Dumazet 
> Date: Mon, 30 Apr 2018 08:43:50 -0700
> 
>> I say sort of, because by the time we have any number, TCP might
>> have received more packets anyway.
> 
> That's fine.
> 
> However, the number reported should have been true at least at some
> finite point in time.
> 
> If you allow overlapping changes to either of the two variables during
> the sampling, then you are reporting a number which was never true at
> any point in time.
> 
> It is essentially garbage.


Correct.

TCP sockets are read by a single thread really (or synchronized threads),
or garbage is ensured, regardless of how the kernel ensures locking while 
reporting "queue length" 




Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Soheil Hassas Yeganeh
On Mon, Apr 30, 2018 at 11:43 AM, Eric Dumazet  wrote:
> On 04/30/2018 08:38 AM, David Miller wrote:
>> From: Soheil Hassas Yeganeh 
>> Date: Fri, 27 Apr 2018 14:57:32 -0400
>>
>>> Since the socket lock is not held when calculating the size of
>>> receive queue, TCP_INQ is a hint.  For example, it can overestimate
>>> the queue size by one byte, if FIN is received.
>>
>> I think it is even worse than that.
>>
>> If another application comes in and does a recvmsg() in parallel with
>> these calculations, you could even report a negative value.

Thanks you David. In addition to Eric's point, for TCP specifically,
it is quite uncommon to have multiple threads calling recvmsg() for
the same socket in parallel, because the application is interested in
the streamed, in-sequence bytes. Except when the application just
wants to discard the incoming stream or has a predefined frame sizes,
this wouldn't be an issue. For such cases, the proposed INQ hint is
not going to be useful.

Could you please let me know whether you have any other example in mind?

Thanks!
Soheil


Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread David Miller
From: Eric Dumazet 
Date: Mon, 30 Apr 2018 08:43:50 -0700

> I say sort of, because by the time we have any number, TCP might
> have received more packets anyway.

That's fine.

However, the number reported should have been true at least at some
finite point in time.

If you allow overlapping changes to either of the two variables during
the sampling, then you are reporting a number which was never true at
any point in time.

It is essentially garbage.


Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Eric Dumazet


On 04/30/2018 08:38 AM, David Miller wrote:
> From: Soheil Hassas Yeganeh 
> Date: Fri, 27 Apr 2018 14:57:32 -0400
> 
>> Since the socket lock is not held when calculating the size of
>> receive queue, TCP_INQ is a hint.  For example, it can overestimate
>> the queue size by one byte, if FIN is received.
> 
> I think it is even worse than that.
> 
> If another application comes in and does a recvmsg() in parallel with
> these calculations, you could even report a negative value.
> 
> These READ_ONCE() make it look like some of these issues are being
> addressed but they are not.
> 
> You could freeze the values just by taking sk->sk_lock.slock, but I
> don't know if that cost is considered acceptable or not.
> 
> Another idea is to sample both values in a loop, similar to a sequence
> lock sequence:
> 
> again:
>   tmp1 = A;
>   tmp2 = B;
>   barrier();
>   tmp3 = A;
>   if (tmp1 != tmp3)
>   goto again;
> 
> But the current state of affairs is not going to work well.
> 

We want a hint, and max_t(int, 0, )  does not return a negative value ?

If the hint is wrong in 0.1 % of the cases, we really do not care, it is not 
meant
to replace the existing precise ( well, sort of ) mechanism.

I say sort of, because by the time we have any number, TCP might have received 
more packets anyway.



Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread David Miller
From: Soheil Hassas Yeganeh 
Date: Fri, 27 Apr 2018 14:57:32 -0400

> Since the socket lock is not held when calculating the size of
> receive queue, TCP_INQ is a hint.  For example, it can overestimate
> the queue size by one byte, if FIN is received.

I think it is even worse than that.

If another application comes in and does a recvmsg() in parallel with
these calculations, you could even report a negative value.

These READ_ONCE() make it look like some of these issues are being
addressed but they are not.

You could freeze the values just by taking sk->sk_lock.slock, but I
don't know if that cost is considered acceptable or not.

Another idea is to sample both values in a loop, similar to a sequence
lock sequence:

again:
tmp1 = A;
tmp2 = B;
barrier();
tmp3 = A;
if (tmp1 != tmp3)
goto again;

But the current state of affairs is not going to work well.


[PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-27 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Willem de Bruijn 
Reviewed-by: Eric Dumazet 
Reviewed-by: Neal Cardwell 
---
 include/linux/tcp.h  |  2 +-
 include/net/tcp.h|  8 
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 27 +++
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
unused:2;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
-   unused1 : 1,
+   recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
repair  : 1,
frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8  repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e3df173..0986836b5df5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1951,6 +1951,14 @@ static inline int tcp_inq(struct sock *sk)
return answ;
 }
 
+static inline int tcp_inq_hint(const struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return max_t(int, 0,
+READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq));
+}
+
 int tcp_peek_len(struct socket *sock);
 
 static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 379b08700a542..d4cdd25a7bd48 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,9 @@ enum {
 #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
+#define TCP_INQ35  /* Notify bytes available to 
read as a cmsg on read */
+
+#define TCP_CM_INQ TCP_INQ
 
 struct tcp_repair_opt {
__u32   opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dfd090ea54ad4..5a7056980f730 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1910,13 +1910,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
u32 peek_seq;
u32 *seq;
unsigned long used;
-   int err;
+   int err, inq;
int target; /* Read at least this many bytes */
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
struct scm_timestamping tss;
bool has_tss = false;
+   bool has_cmsg;
 
if (unlikely(flags & MSG_ERRQUEUE))
return inet_recv_error(sk, msg, len, addr_len);
@@ -1931,6 +1932,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (sk->sk_state == TCP_LISTEN)
goto out;
 
+   has_cmsg = tp->recvmsg_inq;
timeo = sock_rcvtimeo(sk, nonblock);
 
/* Urgent data needs to be handled specially. */
@@ -2117,6 +2119,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (TCP_SKB_CB(skb)->has_rxtstamp) {
tcp_update_recv_tstamps(skb, );