Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-21 Thread Eric Dumazet
On Tue, 2016-06-21 at 11:24 -0400, Jason Baron wrote:

> in tcp_check_space() with something like:
> 
> sk->sk_flags &= ~((1UL << SOCK_QUEUE_SHRUNK) | (1UL << SOCK_SHORT_WRITE));
> 
> Since we are already writing to sk_flags there this should have very
> minimal overhead. And then remove the clear in sk_stream_write_space().

Interesting. You added in 3c7151275c0c9 a smp_mb__after_atomic()
in tcp_check_space(), but there is no atomic operation to begin with ;)





Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-21 Thread Jason Baron


On 06/20/2016 06:29 PM, Eric Dumazet wrote:
> On Mon, 2016-06-20 at 17:23 -0400, Jason Baron wrote:
>> From: Jason Baron 
>>
>> When SO_SNDBUF is set and we are under tcp memory pressure, the effective
>> write buffer space can be much lower than what was set using SO_SNDBUF. For
>> example, we may have set the buffer to 100kb, but we may only be able to
>> write 10kb. In this scenario poll()/select()/epoll(), are going to
>> continuously return POLLOUT, followed by -EAGAIN from write(), and thus
>> result in a tight loop. Note that epoll in edge-triggered does not have
>> this issue since it only triggers once data has been ack'd. There is no
>> issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
>> the sk->sndbuf.
>>
>> Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
>> socket when we have a short write due to memory pressure. By then testing
>> for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
>> non-zero amount of data has been ack'd. In a previous approach:
>> http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a
>> new field in 'struct sock' to solve this issue, but its undesirable to add
>> bloat to 'struct sock'. We also could address this issue, by waiting for
>> the buffer to become completely empty, but that may reduce throughput since
>> the write buffer would be empty while waiting for subsequent writes. This
>> change brings us in line with the current epoll edge-trigger behavior for
>> the poll()/select() and epoll level-trigger.
>>
>> We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
>> SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
>> SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).
>>
>> I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
>> server side, to induce tcp memory pressure. A single server thread reduced
>> its cpu usage from 100% to 19%, while maintaining the same level of
>> throughput.
>>
>> Signed-off-by: Jason Baron 
>> ---
> 
> 
> When this bug was added, and by which commit ?
> 

I think we have always had this behavior, I've seen it at least back
to 3.10. It requires SO_SNDBUF, lots of sockets and memory pressure...
That said its quite easy to reproduce.

> This looks serious, but your patch looks way too complex.
> 
> This SOCK_SHORT_WRITE bit constantly cleared in sk_stream_write_space()
> is expensive, because it dirties sk_flags which is not supposed to
> written often. This field is located in a read mostly cache line.
> 
> I believe my suggestion was not to add a per socket bit,
> but have a global state like tcp_memory_pressure. (or reuse it)
> 
> Ie , when under global memory pressure, tcp_poll() should apply a
> strategy to give POLLOUT only on sockets having less than 4K
> (tcp_wmem[0]) of memory consumed in their write queue.
> 
> 
> 
> 

ok, I think you're saying to just add:

if (sk_under_memory_pressure(sk) && sk->sk_wmem_queued > tcp_wmem[0])
return false;

to the beginning of sk_stream_is_writeable().

My concern is that it then allows the sndbuf to basically completely
empty before we are triggered to write again. The patch I presented
attempts to write against as soon as any new space is available for
the flow.

I also didn't necessarily want to affect all flows. If so_sndbuf is
not set, we generally don't need to change the wakeup behavior b/c
the sndbuf shrinks to the point where the normal write wakeup does
not cause these tight poll()/write() loops, where we don't make
any progress. We also don't need to change the behavior if we are
under memory pressure but but have written enough (filled more than 2/3
of the buffer), such that the normal wakeups again would work fine.
So we could incorporate some of this logic into the above check, but
it becomes more complex.

In my patch, I could replace the:

sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);

in tcp_check_space() with something like:

sk->sk_flags &= ~((1UL << SOCK_QUEUE_SHRUNK) | (1UL << SOCK_SHORT_WRITE));

Since we are already writing to sk_flags there this should have very
minimal overhead. And then remove the clear in sk_stream_write_space().

Thanks,

-Jason


Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-20 Thread Eric Dumazet
On Mon, 2016-06-20 at 17:23 -0400, Jason Baron wrote:
> From: Jason Baron 
> 
> When SO_SNDBUF is set and we are under tcp memory pressure, the effective
> write buffer space can be much lower than what was set using SO_SNDBUF. For
> example, we may have set the buffer to 100kb, but we may only be able to
> write 10kb. In this scenario poll()/select()/epoll(), are going to
> continuously return POLLOUT, followed by -EAGAIN from write(), and thus
> result in a tight loop. Note that epoll in edge-triggered does not have
> this issue since it only triggers once data has been ack'd. There is no
> issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
> the sk->sndbuf.
> 
> Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
> socket when we have a short write due to memory pressure. By then testing
> for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
> non-zero amount of data has been ack'd. In a previous approach:
> http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a
> new field in 'struct sock' to solve this issue, but its undesirable to add
> bloat to 'struct sock'. We also could address this issue, by waiting for
> the buffer to become completely empty, but that may reduce throughput since
> the write buffer would be empty while waiting for subsequent writes. This
> change brings us in line with the current epoll edge-trigger behavior for
> the poll()/select() and epoll level-trigger.
> 
> We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
> SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
> SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).
> 
> I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
> server side, to induce tcp memory pressure. A single server thread reduced
> its cpu usage from 100% to 19%, while maintaining the same level of
> throughput.
> 
> Signed-off-by: Jason Baron 
> ---


When this bug was added, and by which commit ?

This looks serious, but your patch looks way too complex.

This SOCK_SHORT_WRITE bit constantly cleared in sk_stream_write_space()
is expensive, because it dirties sk_flags which is not supposed to
written often. This field is located in a read mostly cache line.

I believe my suggestion was not to add a per socket bit,
but have a global state like tcp_memory_pressure. (or reuse it)

Ie , when under global memory pressure, tcp_poll() should apply a
strategy to give POLLOUT only on sockets having less than 4K
(tcp_wmem[0]) of memory consumed in their write queue.






[PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set

2016-06-20 Thread Jason Baron
From: Jason Baron 

When SO_SNDBUF is set and we are under tcp memory pressure, the effective
write buffer space can be much lower than what was set using SO_SNDBUF. For
example, we may have set the buffer to 100kb, but we may only be able to
write 10kb. In this scenario poll()/select()/epoll(), are going to
continuously return POLLOUT, followed by -EAGAIN from write(), and thus
result in a tight loop. Note that epoll in edge-triggered does not have
this issue since it only triggers once data has been ack'd. There is no
issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
the sk->sndbuf.

Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
socket when we have a short write due to memory pressure. By then testing
for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
non-zero amount of data has been ack'd. In a previous approach:
http://marc.info/?l=linux-netdev=143930393211782=2, I had introduced a
new field in 'struct sock' to solve this issue, but its undesirable to add
bloat to 'struct sock'. We also could address this issue, by waiting for
the buffer to become completely empty, but that may reduce throughput since
the write buffer would be empty while waiting for subsequent writes. This
change brings us in line with the current epoll edge-trigger behavior for
the poll()/select() and epoll level-trigger.

We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).

I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
server side, to induce tcp memory pressure. A single server thread reduced
its cpu usage from 100% to 19%, while maintaining the same level of
throughput.

Signed-off-by: Jason Baron 
---
 include/net/sock.h |  6 ++
 net/core/stream.c  |  1 +
 net/ipv4/tcp.c | 26 +++---
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 649d2a8c17fc..616e8e1a5d5d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -741,6 +741,7 @@ enum sock_flags {
SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
+   SOCK_SHORT_WRITE, /* Couldn't fill sndbuf due to memory pressure */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1114,6 +1115,11 @@ static inline bool sk_stream_is_writeable(const struct 
sock *sk)
   sk_stream_memory_free(sk);
 }
 
+static inline void sk_set_short_write(struct sock *sk)
+{
+   if (sk->sk_wmem_queued > 0 && sk_stream_is_writeable(sk))
+   sock_set_flag(sk, SOCK_SHORT_WRITE);
+}
 
 static inline bool sk_has_memory_pressure(const struct sock *sk)
 {
diff --git a/net/core/stream.c b/net/core/stream.c
index 159516a11b7e..ead768b1e95d 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -32,6 +32,7 @@ void sk_stream_write_space(struct sock *sk)
 
if (sk_stream_is_writeable(sk) && sock) {
clear_bit(SOCK_NOSPACE, >flags);
+   sock_reset_flag(sk, SOCK_SHORT_WRITE);
 
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed147449c..254c7bb6d3d5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -517,7 +517,8 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
mask |= POLLIN | POLLRDNORM;
 
if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
-   if (sk_stream_is_writeable(sk)) {
+   if (!sock_flag(sk, SOCK_SHORT_WRITE) &&
+   sk_stream_is_writeable(sk)) {
mask |= POLLOUT | POLLWRNORM;
} else {  /* send SIGIO later */
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
@@ -529,7 +530,8 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
 * pairs with the input side.
 */
smp_mb__after_atomic();
-   if (sk_stream_is_writeable(sk))
+   if (!sock_flag(sk, SOCK_SHORT_WRITE) &&
+   sk_stream_is_writeable(sk))
mask |= POLLOUT | POLLWRNORM;
}
} else
@@ -917,8 +919,10 @@ new_segment:
 
skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
  
skb_queue_empty(>sk_write_queue));
-   if (!skb)
+   if (!skb) {
+