Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Eric Dumazet
On Wed, Nov 21, 2018 at 4:55 PM Yuchung Cheng  wrote:
>
> To clarify I do think this patch set is overall useful so I only
> wanted to discuss the specifics of the head drop.
>
> It occurs to me we check the limit differently (one w/ 64KB more), so
> we may overcommit and trim more often than necessary?

Keep in mind that the normal limit might be hit while backlog is empty.

So the loop might have nothing to remove.

We only have an extra 64KB credit for this very particular case, so
that we still can queue this packet
_if_ the additional credit was not already consumed.

We do not want to add back the terrible bug fixed by :

commit 8eae939f1400326b06d0c9afe53d2a484a326871
Author: Zhu Yi 
Date:   Thu Mar 4 18:01:40 2010 +

net: add limit for socket backlog

We got system OOM while running some UDP netperf testing on the loopback
device. The case is multiple senders sent stream UDP packets to a single
receiver via loopback on local host. Of course, the receiver is not able
to handle all the packets in time. But we surprisingly found that these
packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
Instead, they are kept queuing to sk->sk_backlog and finally ate up all
the memory. We believe this is a secure hole that a none privileged user
can crash the system.

The root cause for this problem is, when the receiver is doing
__release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
sk_receive_queue with the softirq enabled. In the above case, multiple
busy senders will almost make it an endless loop. The skbs in the
backlog end up eat all the system memory.

The issue is not only for UDP. Any protocols using socket backlog is
potentially affected. The patch adds limit for socket backlog so that
the backlog size cannot be expanded endlessly.

Reported-by: Alex Shi 
Cc: David Miller 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexey Kuznetsov 
Cc: Patrick McHardy 
Cc: Vlad Yasevich 
Cc: Sridhar Samudrala 
Cc: Jon Maloy 
Cc: Allan Stephens 
Cc: Andrew Hendry 
Signed-off-by: Zhu Yi 
Signed-off-by: Eric Dumazet 
Acked-by: Arnaldo Carvalho de Melo 
Signed-off-by: David S. Miller 


Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Yuchung Cheng
On Wed, Nov 21, 2018 at 4:18 PM, Eric Dumazet  wrote:
> On Wed, Nov 21, 2018 at 3:52 PM Eric Dumazet  wrote:
>> This is basically what the patch does, the while loop breaks when we have 
>> freed
>> just enough skbs.
>
> Also this is the patch we tested with Jean-Louis on his host, bring
> very nice results,
> even from an old stack sender (the one that had problems with the SACK
> compression we just fixed)
>
> Keep in mind we are dealing here with the exception, I would not spend
> too much time
> testing another variant if this one simply works.
To clarify I do think this patch set is overall useful so I only
wanted to discuss the specifics of the head drop.

It occurs to me we check the limit differently (one w/ 64KB more), so
we may overcommit and trim more often than necessary?


Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Eric Dumazet
On Wed, Nov 21, 2018 at 3:52 PM Eric Dumazet  wrote:
> This is basically what the patch does, the while loop breaks when we have 
> freed
> just enough skbs.

Also this is the patch we tested with Jean-Louis on his host, bring
very nice results,
even from an old stack sender (the one that had problems with the SACK
compression we just fixed)

Keep in mind we are dealing here with the exception, I would not spend
too much time
testing another variant if this one simply works.


Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Eric Dumazet
On Wed, Nov 21, 2018 at 3:47 PM Yuchung Cheng  wrote:
>
> On Wed, Nov 21, 2018 at 2:47 PM, Eric Dumazet  wrote:
> >
> >
> > On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
> >> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
> >>> Under high stress, and if GRO or coalescing does not help,
> >>> we better make room in backlog queue to be able to keep latest
> >>> packet coming.
> >>>
> >>> This generally helps fast recovery, given that we often receive
> >>> packets in order.
> >>
> >> I like the benefit of fast recovery but I am a bit leery about head
> >> drop causing HoLB on large read, while tail drops can be repaired by
> >> RACK and TLP already. Hmm -
> >
> > This is very different pattern here.
> >
> > We have a train of packets coming, the last packet is not a TLP probe...
> >
> > Consider this train coming from an old stack without burst control nor 
> > pacing.
> >
> > This patch guarantees last packet will be processed, and either :
> >
> > 1) We are a receiver, we will send a SACK. Sender will typically start 
> > recovery
> >
> > 2) We are a sender, we will process the most recent ACK sent by the 
> > receiver.
> >
>
> Sure on the sender it's universally good.
>
> On the receiver side my scenario was not the last packet being TLP.
> AFAIU the new design will first try coalesce the incoming skb to the
> tail one then exit. Otherwise it's queued to the back with an
> additional 64KB space credit. This patch checks the space w/o the
> extra credit and drops the head skb. If the head skb has been
> coalesced, we might end dropping the first big chunk that may need a
> few rounds of fast recovery to repair. But I am likely to
> misunderstand the patch :-)

This situation happens only when we are largely over committing
the memory, due to bad skb->len/skb->truesize ratio.

Think about MTU=9000 and some drivers allocating 16 KB of memory per frame,
even if the packet has 1500 bytes in it.

>
> Would it make sense to check the space first before the coalesce
> operation, and drop just enough bytes of the head to make room?

This is basically what the patch does, the while loop breaks when we have freed
just enough skbs.


Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Yuchung Cheng
On Wed, Nov 21, 2018 at 2:47 PM, Eric Dumazet  wrote:
>
>
> On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
>>> Under high stress, and if GRO or coalescing does not help,
>>> we better make room in backlog queue to be able to keep latest
>>> packet coming.
>>>
>>> This generally helps fast recovery, given that we often receive
>>> packets in order.
>>
>> I like the benefit of fast recovery but I am a bit leery about head
>> drop causing HoLB on large read, while tail drops can be repaired by
>> RACK and TLP already. Hmm -
>
> This is very different pattern here.
>
> We have a train of packets coming, the last packet is not a TLP probe...
>
> Consider this train coming from an old stack without burst control nor pacing.
>
> This patch guarantees last packet will be processed, and either :
>
> 1) We are a receiver, we will send a SACK. Sender will typically start 
> recovery
>
> 2) We are a sender, we will process the most recent ACK sent by the receiver.
>

Sure on the sender it's universally good.

On the receiver side my scenario was not the last packet being TLP.
AFAIU the new design will first try coalesce the incoming skb to the
tail one then exit. Otherwise it's queued to the back with an
additional 64KB space credit. This patch checks the space w/o the
extra credit and drops the head skb. If the head skb has been
coalesced, we might end dropping the first big chunk that may need a
few rounds of fast recovery to repair. But I am likely to
misunderstand the patch :-)

Would it make sense to check the space first before the coalesce
operation, and drop just enough bytes of the head to make room?


Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Eric Dumazet



On 11/21/2018 02:40 PM, Yuchung Cheng wrote:
> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
>> Under high stress, and if GRO or coalescing does not help,
>> we better make room in backlog queue to be able to keep latest
>> packet coming.
>>
>> This generally helps fast recovery, given that we often receive
>> packets in order.
> 
> I like the benefit of fast recovery but I am a bit leery about head
> drop causing HoLB on large read, while tail drops can be repaired by
> RACK and TLP already. Hmm -

This is very different pattern here.

We have a train of packets coming, the last packet is not a TLP probe...

Consider this train coming from an old stack without burst control nor pacing.

This patch guarantees last packet will be processed, and either :

1) We are a receiver, we will send a SACK. Sender will typically start recovery

2) We are a sender, we will process the most recent ACK sent by the receiver.



Re: [PATCH net-next 3/3] tcp: implement head drops in backlog queue

2018-11-21 Thread Yuchung Cheng
On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
> Under high stress, and if GRO or coalescing does not help,
> we better make room in backlog queue to be able to keep latest
> packet coming.
>
> This generally helps fast recovery, given that we often receive
> packets in order.

I like the benefit of fast recovery but I am a bit leery about head
drop causing HoLB on large read, while tail drops can be repaired by
RACK and TLP already. Hmm -

>
> Signed-off-by: Eric Dumazet 
> Tested-by: Jean-Louis Dupond 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---
>  net/ipv4/tcp_ipv4.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 
> 401e1d1cb904a4c7963d8baa419cfbf178593344..36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696
>  100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1693,6 +1693,20 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff 
> *skb)
> __skb_push(skb, hdrlen);
> }
>
> +   while (sk_rcvqueues_full(sk, limit)) {
> +   struct sk_buff *head;
> +
> +   head = sk->sk_backlog.head;
> +   if (!head)
> +   break;
> +   sk->sk_backlog.head = head->next;
> +   if (!head->next)
> +   sk->sk_backlog.tail = NULL;
> +   skb_mark_not_on_list(head);
> +   sk->sk_backlog.len -= head->truesize;
> +   kfree_skb(head);
> +   __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP);
> +   }
> /* Only socket owner can try to collapse/prune rx queues
>  * to reduce memory overhead, so add a little headroom here.
>  * Few sockets backlog are possibly concurrently non empty.
> --
> 2.19.1.1215.g8438c0b245-goog
>