Re: [PATCH] netdev: add netdev_pagefrag_enabled sysctl

2017-11-11 Thread David Miller
From: Hongbo Li 
Date: Thu, 9 Nov 2017 16:12:27 +0800

> From: Hongbo Li 
> 
> This patch solves a memory frag issue when allocating skb.
> I found this issue in a udp scenario, here is my test model:
> 1. About five hundreds udp threads listen on server,
>and five hundreds client threads send udp pkts to them.
>Some threads send pkts in a faster speed than others.
> 2. The user processes on server don't have enough ability
>to receive these pkts.
> 
> Then I got following result:
> 1. Some udp sockets' recv-q reach the queue's limit, others
>not because of the global rmem limit.
> 2. The "free" command shows "used" memory is more than 62GB.
>But cat /proc/net/sockstat shows that udp uses only 12GB.
> 
> This will confused the user that why the system consumes so
> many memory.This is caused by the memory frags in netdev layer.
> __netdev_alloc_frag() allocs a page block which has 8 pages.
> 
> Then in this scenario, most skbs are freed when the recv-q
> is full, but if any skb in the same page block be queued to
> other recv-q which is not full, the whole page block can't
> be freed.
> 
> So from the view of kernel, these pages are used, but from
> the view of tcp/udp, only the skbs in recv-q are used.
> 
> To avoid exhausting memory in such scenario, I add a sysctl
> to make user can disable allocating skbs in page frag.
> 
> Signed-off-by: Hongbo Li 

When something like page fragments don't work properly, we fix
them rather then providing a way to disable them.

Thank you.


[PATCH] netdev: add netdev_pagefrag_enabled sysctl

2017-11-09 Thread Hongbo Li
From: Hongbo Li 

This patch solves a memory frag issue when allocating skb.
I found this issue in a udp scenario, here is my test model:
1. About five hundreds udp threads listen on server,
   and five hundreds client threads send udp pkts to them.
   Some threads send pkts in a faster speed than others.
2. The user processes on server don't have enough ability
   to receive these pkts.

Then I got following result:
1. Some udp sockets' recv-q reach the queue's limit, others
   not because of the global rmem limit.
2. The "free" command shows "used" memory is more than 62GB.
   But cat /proc/net/sockstat shows that udp uses only 12GB.

This will confused the user that why the system consumes so
many memory.This is caused by the memory frags in netdev layer.
__netdev_alloc_frag() allocs a page block which has 8 pages.

Then in this scenario, most skbs are freed when the recv-q
is full, but if any skb in the same page block be queued to
other recv-q which is not full, the whole page block can't
be freed.

So from the view of kernel, these pages are used, but from
the view of tcp/udp, only the skbs in recv-q are used.

To avoid exhausting memory in such scenario, I add a sysctl
to make user can disable allocating skbs in page frag.

Signed-off-by: Hongbo Li 
---
 include/linux/netdevice.h  | 1 +
 net/core/dev.c | 1 +
 net/core/skbuff.c  | 6 --
 net/core/sysctl_net_core.c | 7 +++
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2eaac7d..73540ee 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3319,6 +3319,7 @@ static __always_inline int dev_forward_skb(struct 
net_device *dev,
 
 extern int netdev_budget;
 extern unsigned intnetdev_budget_usecs;
+extern int netdev_pagefrag_enabled;
 
 /* Called by rtnetlink.c:rtnl_unlock() */
 void netdev_run_todo(void);
diff --git a/net/core/dev.c b/net/core/dev.c
index 11596a3..2328ddb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3527,6 +3527,7 @@ int dev_queue_xmit_accel(struct sk_buff *skb, void 
*accel_priv)
 int netdev_tstamp_prequeue __read_mostly = 1;
 int netdev_budget __read_mostly = 300;
 unsigned int __read_mostly netdev_budget_usecs = 2000;
+int netdev_pagefrag_enabled __read_mostly = 1;
 int weight_p __read_mostly = 64;   /* old backlog weight */
 int dev_weight_rx_bias __read_mostly = 1;  /* bias for backlog weight */
 int dev_weight_tx_bias __read_mostly = 1;  /* bias for output_queue quota */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2465607..62a43fe 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -399,7 +399,8 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, 
unsigned int len,
len += NET_SKB_PAD;
 
if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
+   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)) ||
+   !netdev_pagefrag_enabled) {
skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
if (!skb)
goto skb_fail;
@@ -466,7 +467,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, 
unsigned int len,
len += NET_SKB_PAD + NET_IP_ALIGN;
 
if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
+   (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)) ||
+   !netdev_pagefrag_enabled) {
skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
if (!skb)
goto skb_fail;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cbc3dde..c0078c5 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -461,6 +461,13 @@ static int proc_do_rss_key(struct ctl_table *table, int 
write,
.proc_handler   = proc_dointvec_minmax,
.extra1 = &zero,
},
+   {
+   .procname   = "netdev_pagefrag_enabled",
+   .data   = &netdev_pagefrag_enabled,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
-- 
1.8.3.1