[PATCH][net-next] tun: remove unnecessary check in tun_flow_update

2018-12-06 Thread Li RongQing
caller has guaranted that rxhash is not zero

Signed-off-by: Li RongQing 
---
 drivers/net/tun.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d0745dc81976..6760b86547df 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -529,10 +529,7 @@ static void tun_flow_update(struct tun_struct *tun, u32 
rxhash,
unsigned long delay = tun->ageing_time;
u16 queue_index = tfile->queue_index;
 
-   if (!rxhash)
-   return;
-   else
-   head = >flows[tun_hashfn(rxhash)];
+   head = >flows[tun_hashfn(rxhash)];
 
rcu_read_lock();
 
-- 
2.16.2



[PATCH][net-next] tun: align write-heavy flow entry members to a cache line

2018-12-06 Thread Li RongQing
tun flow entry 'updated' fields are written when receive
every packet. Thus if a flow is receiving packets from a
particular flow entry, it'll cause false-sharing with
all the other who has looked it up, so move it in its own
cache line

and update 'queue_index' and 'update' field only when
they are changed to reduce the cache false-sharing.

Signed-off-by: Zhang Yu 
Signed-off-by: Wang Li 
Signed-off-by: Li RongQing 
---
 drivers/net/tun.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 835c73f42ae7..d0745dc81976 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -201,7 +201,7 @@ struct tun_flow_entry {
u32 rxhash;
u32 rps_rxhash;
int queue_index;
-   unsigned long updated;
+   unsigned long updated cacheline_aligned_in_smp;
 };
 
 #define TUN_NUM_FLOW_ENTRIES 1024
@@ -539,8 +539,10 @@ static void tun_flow_update(struct tun_struct *tun, u32 
rxhash,
e = tun_flow_find(head, rxhash);
if (likely(e)) {
/* TODO: keep queueing to old queue until it's empty? */
-   e->queue_index = queue_index;
-   e->updated = jiffies;
+   if (e->queue_index != queue_index)
+   e->queue_index = queue_index;
+   if (e->updated != jiffies)
+   e->updated = jiffies;
sock_rps_record_flow_hash(e->rps_rxhash);
} else {
spin_lock_bh(>lock);
-- 
2.16.2



答复: [PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill

2018-11-22 Thread Li,Rongqing

On 2018/11/23 上午10:04, Li RongQing wrote:
> >when page frag refills, 32K pages, 128MB memory is asked, it hardly 
> >successes when system has memory stress


> Looking at get_order(), it seems we get 3 after get_order(32768) since it 
> accepts the size of block.

You are right, I understood wrongly, 

Please drop this patch, sorry for the noise

-Q


答复: [PATCH] net: fix the per task frag allocator size

2018-11-22 Thread Li,Rongqing

> get_order(8) returns zero here if I understood it correctly.


You are right, I understood wrongly, 

Please drop this patch, sorry for the noise

-Q






[PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill

2018-11-22 Thread Li RongQing
when page frag refills, 32K pages, 128MB memory is asked, it
hardly successes when system has memory stress

And such large memory size will cause the underflow of reference
bias, and make refcount of page chaos, since reference bias will
be decreased to negative before the allocated memory is used up

so 32KB memory is safe choice, meanwhile, remove a unnecessary
check

Fixes: e4dab1e6ea64 ("vhost_net: mitigate page reference counting during page 
frag refill")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 drivers/vhost/net.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index d919284f103b..b933a4a8e4ba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -641,7 +641,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t 
total_len)
   !vhost_vq_avail_empty(vq->dev, vq);
 }
 
-#define SKB_FRAG_PAGE_ORDER get_order(32768)
+#define SKB_FRAG_PAGE_ORDER3
 
 static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
   struct page_frag *pfrag, gfp_t gfp)
@@ -654,17 +654,17 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
*net, unsigned int sz,
 
pfrag->offset = 0;
net->refcnt_bias = 0;
-   if (SKB_FRAG_PAGE_ORDER) {
-   /* Avoid direct reclaim but allow kswapd to wake */
-   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
- __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
- SKB_FRAG_PAGE_ORDER);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-   goto done;
-   }
+
+   /* Avoid direct reclaim but allow kswapd to wake */
+   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+ __GFP_COMP | __GFP_NOWARN |
+ __GFP_NORETRY,
+ SKB_FRAG_PAGE_ORDER);
+   if (likely(pfrag->page)) {
+   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
+   goto done;
}
+
pfrag->page = alloc_page(gfp);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE;
-- 
2.16.2



[PATCH] net: fix the per task frag allocator size

2018-11-22 Thread Li RongQing
when fill task frag, 32K pages, 128MB memory is asked, it
hardly successes when system has memory stress

and commit '5640f7685831 ("net: use a per task frag allocator")'
said it wants 32768 bytes, not 32768 pages:

 "(up to 32768 bytes per frag, thats order-3 pages on x86)"

Fixes: 5640f7685831e ("net: use a per task frag allocator")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/core/sock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 6d7e189e3cd9..e3cbefeedf5c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2186,8 +2186,8 @@ static void sk_leave_memory_pressure(struct sock *sk)
}
 }
 
-/* On 32bit arches, an skb frag is limited to 2^15 */
-#define SKB_FRAG_PAGE_ORDERget_order(32768)
+/* On 32bit arches, an skb frag is limited to 2^15 bytes*/
+#define SKB_FRAG_PAGE_ORDERget_order(8)
 
 /**
  * skb_page_frag_refill - check that a page_frag contains enough room
-- 
2.16.2



[PATCH][net-next] net: slightly optimize eth_type_trans

2018-11-12 Thread Li RongQing
netperf udp stream shows that eth_type_trans takes certain cpu,
so adjust the mac address check order, and firstly check if it
is device address, and only check if it is multicast address
only if not the device address.

After this change:
To unicast, and skb dst mac is device mac, this is most of time
reduce a comparision
To unicast, and skb dst mac is not device mac, nothing change
To multicast, increase a comparision

Before:
1.03%  [kernel]  [k] eth_type_trans

After:
0.78%  [kernel]  [k] eth_type_trans

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/ethernet/eth.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index fd8faa0dfa61..1c88f5c5d5b1 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -165,15 +165,17 @@ __be16 eth_type_trans(struct sk_buff *skb, struct 
net_device *dev)
eth = (struct ethhdr *)skb->data;
skb_pull_inline(skb, ETH_HLEN);
 
-   if (unlikely(is_multicast_ether_addr_64bits(eth->h_dest))) {
-   if (ether_addr_equal_64bits(eth->h_dest, dev->broadcast))
-   skb->pkt_type = PACKET_BROADCAST;
-   else
-   skb->pkt_type = PACKET_MULTICAST;
+   if (unlikely(!ether_addr_equal_64bits(eth->h_dest,
+  dev->dev_addr))) {
+   if (unlikely(is_multicast_ether_addr_64bits(eth->h_dest))) {
+   if (ether_addr_equal_64bits(eth->h_dest, 
dev->broadcast))
+   skb->pkt_type = PACKET_BROADCAST;
+   else
+   skb->pkt_type = PACKET_MULTICAST;
+   } else {
+   skb->pkt_type = PACKET_OTHERHOST;
+   }
}
-   else if (unlikely(!ether_addr_equal_64bits(eth->h_dest,
-  dev->dev_addr)))
-   skb->pkt_type = PACKET_OTHERHOST;
 
/*
 * Some variants of DSA tagging don't have an ethertype field
-- 
2.16.2



[PATCH][net-next][v2] net: remove BUG_ON from __pskb_pull_tail

2018-11-12 Thread Li RongQing
if list is NULL pointer, and the following access of list
will trigger panic, which is same as BUG_ON

Signed-off-by: Li RongQing 
---
 net/core/skbuff.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 396fcb3baad0..d69503d66021 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1925,8 +1925,6 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
struct sk_buff *insp = NULL;
 
do {
-   BUG_ON(!list);
-
if (list->len <= eat) {
/* Eaten as whole. */
eat -= list->len;
-- 
2.16.2



[PATCH][xfrm-next] xfrm6: remove BUG_ON from xfrm6_dst_ifdown

2018-11-12 Thread Li RongQing
if loopback_idev is NULL pointer, and the following access of
loopback_idev will trigger panic, which is same as BUG_ON

Signed-off-by: Li RongQing 
---
 net/ipv6/xfrm6_policy.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index d35bcf92969c..769f8f78d3b8 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -262,7 +262,6 @@ static void xfrm6_dst_ifdown(struct dst_entry *dst, struct 
net_device *dev,
if (xdst->u.rt6.rt6i_idev->dev == dev) {
struct inet6_dev *loopback_idev =
in6_dev_get(dev_net(dev)->loopback_dev);
-   BUG_ON(!loopback_idev);
 
do {
in6_dev_put(xdst->u.rt6.rt6i_idev);
-- 
2.16.2



[PATCH][net-next] net: remove BUG_ON from __pskb_pull_tail

2018-11-12 Thread Li RongQing
if list is NULL pointer, and the following access of list
will trigger panic, which is same as BUG_ON

Signed-off-by: Li RongQing 
---
 net/core/skbuff.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 396fcb3baad0..cd668b52f96f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1925,7 +1925,6 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
struct sk_buff *insp = NULL;
 
do {
-   BUG_ON(!list);
 
if (list->len <= eat) {
/* Eaten as whole. */
-- 
2.16.2



答复: [PATCH][RFC] udp: cache sock to avoid searching it twice

2018-11-11 Thread Li,Rongqing
> >   return pp;
> >   }
>
> What if 'pp' is NULL?
>
> Aside from that, this replace a lookup with 2 atomic ops, and only when
> such lookup is amortized on multiple aggregated packets: I'm unsure if
> it's worthy and I don't understand how that improves RR tests (where
> the socket can't see multiple, consecutive skbs, AFAIK).
>
> Cheers,
>
> Paolo
>

If we not release the socket in udp_gro_complete , we can reduce a udp socket
Lookup when do ip demux again, it maybe more worthy.

I test UDP_STREAM, find no difference, both can reach NIC's limit, 10G; 
so Test RR, I will do more tests

-RongQing


答复: [PATCH][RFC] udp: cache sock to avoid searching it twice

2018-11-11 Thread Li,Rongqing

On Sat, Nov 10, 2018 at 1:29 AM Eric Dumazet  wrote:
>
>
>
> On 11/08/2018 10:21 PM, Li RongQing wrote:
> > GRO for UDP needs to lookup socket twice, first is in gro receive,
> > second is gro complete, so if store sock to skb to avoid looking up
> > twice, this can give small performance boost
> >
> > netperf -t UDP_RR -l 10
> >
> > Before:
> >   Rate per sec: 28746.01
> > After:
> >   Rate per sec: 29401.67
> >
> > Signed-off-by: Li RongQing 
> > ---
> >  net/ipv4/udp_offload.c | 18 +-
> >  1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> > index 0646d61f4fa8..429570112a33 100644
> > --- a/net/ipv4/udp_offload.c
> > +++ b/net/ipv4/udp_offload.c
> > @@ -408,6 +408,11 @@ struct sk_buff *udp_gro_receive(struct list_head 
> > *head, struct sk_buff *skb,
> >
> >   if (udp_sk(sk)->gro_enabled) {
> >   pp = call_gro_receive(udp_gro_receive_segment, head, skb);
> > +
> > + if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) {
> > + sock_hold(sk);
> > + pp->sk = sk;
>
>
> You also have to set pp->destructor to sock_edemux
>
> flush_gro_hash -> kfree_skb()
>
> If there is no destructor, the reference on pp->sk will never be released.
>
>

Ok, thanks,

does it need to reset sk in udp_gro_complete,  ip early demuxing will lookup 
udp socket again, if we can keep it, we can avoid to lookup socket again


-RongQing
>
>
> > + }
> >   rcu_read_unlock();
> >   return pp;
> >   }
> > @@ -444,6 +449,10 @@ struct sk_buff *udp_gro_receive(struct list_head 
> > *head, struct sk_buff *skb,
> >   skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
> >   pp = call_gro_receive_sk(udp_sk(sk)->gro_receive, sk, head, skb);
> >
> > + if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) {
> > + sock_hold(sk);
> > + pp->sk = sk;
> > + }
> >  out_unlock:
> >   rcu_read_unlock();
> >   skb_gro_flush_final(skb, pp, flush);
> > @@ -502,7 +511,9 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
> >   uh->len = newlen;
> >
> >   rcu_read_lock();
> > - sk = (*lookup)(skb, uh->source, uh->dest);
> > + sk = skb->sk;
> > + if (!sk)
> > + sk = (*lookup)(skb, uh->source, uh->dest);
> >   if (sk && udp_sk(sk)->gro_enabled) {
> >   err = udp_gro_complete_segment(skb);
> >   } else if (sk && udp_sk(sk)->gro_complete) {
> > @@ -516,6 +527,11 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
> >   err = udp_sk(sk)->gro_complete(sk, skb,
> >   nhoff + sizeof(struct udphdr));
> >   }
> > +
> > + if (skb->sk) {
> > + sock_put(skb->sk);
> > + skb->sk = NULL;
> > + }
> >   rcu_read_unlock();
> >
> >   if (skb->remcsum_offload)
> >


[PATCH][net-next] net: tcp: remove BUG_ON from tcp_v4_err

2018-11-09 Thread Li RongQing
if skb is NULL pointer, and the following access of skb's
skb_mstamp_ns will trigger panic, which is same as BUG_ON

Signed-off-by: Li RongQing 
---
 net/ipv4/tcp_ipv4.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a336787d75e5..5424a4077c27 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -542,7 +542,6 @@ int tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
icsk->icsk_rto = inet_csk_rto_backoff(icsk, TCP_RTO_MAX);
 
skb = tcp_rtx_queue_head(sk);
-   BUG_ON(!skb);
 
tcp_mstamp_refresh(tp);
delta_us = (u32)(tp->tcp_mstamp - tcp_skb_timestamp_us(skb));
-- 
2.16.2



[PATCH][RFC] udp: cache sock to avoid searching it twice

2018-11-08 Thread Li RongQing
GRO for UDP needs to lookup socket twice, first is in gro receive,
second is gro complete, so if store sock to skb to avoid looking up
twice, this can give small performance boost

netperf -t UDP_RR -l 10

Before:
Rate per sec: 28746.01
After:
Rate per sec: 29401.67

Signed-off-by: Li RongQing 
---
 net/ipv4/udp_offload.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0646d61f4fa8..429570112a33 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -408,6 +408,11 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
 
if (udp_sk(sk)->gro_enabled) {
pp = call_gro_receive(udp_gro_receive_segment, head, skb);
+
+   if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) {
+   sock_hold(sk);
+   pp->sk = sk;
+   }
rcu_read_unlock();
return pp;
}
@@ -444,6 +449,10 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
pp = call_gro_receive_sk(udp_sk(sk)->gro_receive, sk, head, skb);
 
+   if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) {
+   sock_hold(sk);
+   pp->sk = sk;
+   }
 out_unlock:
rcu_read_unlock();
skb_gro_flush_final(skb, pp, flush);
@@ -502,7 +511,9 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
uh->len = newlen;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = skb->sk;
+   if (!sk)
+   sk = (*lookup)(skb, uh->source, uh->dest);
if (sk && udp_sk(sk)->gro_enabled) {
err = udp_gro_complete_segment(skb);
} else if (sk && udp_sk(sk)->gro_complete) {
@@ -516,6 +527,11 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
err = udp_sk(sk)->gro_complete(sk, skb,
nhoff + sizeof(struct udphdr));
}
+
+   if (skb->sk) {
+   sock_put(skb->sk);
+   skb->sk = NULL;
+   }
rcu_read_unlock();
 
if (skb->remcsum_offload)
-- 
2.16.2



[PATCH][net-next] openvswitch: remove BUG_ON from get_dpdev

2018-11-08 Thread Li RongQing
if local is NULL pointer, and the following access of local's
dev will trigger panic, which is same as BUG_ON

Signed-off-by: Li RongQing 
---
 net/openvswitch/vport-netdev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 2e5e7a41d8ef..9bec22e3e9e8 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -84,7 +84,6 @@ static struct net_device *get_dpdev(const struct datapath *dp)
struct vport *local;
 
local = ovs_vport_ovsl(dp, OVSP_LOCAL);
-   BUG_ON(!local);
return local->dev;
 }
 
-- 
2.16.2



[PATCH][net-next][v2] net/ipv6: compute anycast address hash only if dev is null

2018-11-07 Thread Li RongQing
avoid to compute the hash value if dev is not null, since
hash value is not used

Signed-off-by: Li RongQing 
---
 net/ipv6/anycast.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 94999058e110..cca3b3603c42 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -433,7 +433,6 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, 
const struct in6_addr *ad
 bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 const struct in6_addr *addr)
 {
-   unsigned int hash = inet6_acaddr_hash(net, addr);
struct net_device *nh_dev;
struct ifacaddr6 *aca;
bool found = false;
@@ -441,7 +440,9 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device 
*dev,
rcu_read_lock();
if (dev)
found = ipv6_chk_acast_dev(dev, addr);
-   else
+   else {
+   unsigned int hash = inet6_acaddr_hash(net, addr);
+
hlist_for_each_entry_rcu(aca, _acaddr_lst[hash],
 aca_addr_lst) {
nh_dev = fib6_info_nh_dev(aca->aca_rt);
@@ -452,6 +453,7 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device 
*dev,
break;
}
}
+   }
rcu_read_unlock();
return found;
 }
-- 
2.16.2



[PATCH][net-next] net/ipv6: compute anycast address hash only if dev is null

2018-11-07 Thread Li RongQing
avoid to compute the hash value if dev is not null, since
hash value is not used

Signed-off-by: Li RongQing 
---
 net/ipv6/anycast.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 94999058e110..a20e344486cb 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -433,15 +433,16 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, 
const struct in6_addr *ad
 bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 const struct in6_addr *addr)
 {
-   unsigned int hash = inet6_acaddr_hash(net, addr);
struct net_device *nh_dev;
struct ifacaddr6 *aca;
bool found = false;
+   unsigned int hash;
 
rcu_read_lock();
if (dev)
found = ipv6_chk_acast_dev(dev, addr);
-   else
+   else {
+   hash = inet6_acaddr_hash(net, addr);
hlist_for_each_entry_rcu(aca, _acaddr_lst[hash],
 aca_addr_lst) {
nh_dev = fib6_info_nh_dev(aca->aca_rt);
@@ -452,6 +453,7 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device 
*dev,
break;
}
}
+   }
rcu_read_unlock();
return found;
 }
-- 
2.16.2



[net-next][PATCH] net/ipv4: fix a net leak

2018-10-24 Thread Li RongQing
put net when input a invalid ifindex, otherwise it will be leaked

Fixes: 5fcd266a9f64("net/ipv4: Add support for dumping addresses for a specific 
device")
Cc: David Ahern 
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/ipv4/devinet.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 63d5b58fbfdb..fd0c5a47e742 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1775,8 +1775,10 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
 
if (fillargs.ifindex) {
dev = __dev_get_by_index(tgt_net, fillargs.ifindex);
-   if (!dev)
+   if (!dev) {
+   put_net(tgt_net);
return -ENODEV;
+   }
 
in_dev = __in_dev_get_rtnl(dev);
if (in_dev) {
-- 
2.16.2



[PATCH][ipsec-next] xfrm: use correct size to initialise sp->ovec

2018-10-06 Thread Li RongQing
This place should want to initialize array, not a element,
so it should be sizeof(array) instead of sizeof(element)

but now this array only has one element, so no error in
this condition that XFRM_MAX_OFFLOAD_DEPTH is 1

Signed-off-by: Li RongQing 
---
 net/xfrm/xfrm_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index be3520e429c9..684c0bc01e2c 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -131,7 +131,7 @@ struct sec_path *secpath_dup(struct sec_path *src)
sp->len = 0;
sp->olen = 0;
 
-   memset(sp->ovec, 0, sizeof(sp->ovec[XFRM_MAX_OFFLOAD_DEPTH]));
+   memset(sp->ovec, 0, sizeof(sp->ovec));
 
if (src) {
int i;
-- 
2.16.2



[PATCH][ipsec-next] xfrm: remove unnecessary check in xfrmi_get_stats64

2018-10-06 Thread Li RongQing
if tstats of a device is not allocated, this device is not
registered correctly and can not be used.

Signed-off-by: Li RongQing 
---
 net/xfrm/xfrm_interface.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index dc5b20bf29cf..abafd49cc65d 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -561,9 +561,6 @@ static void xfrmi_get_stats64(struct net_device *dev,
 {
int cpu;
 
-   if (!dev->tstats)
-   return;
-
for_each_possible_cpu(cpu) {
struct pcpu_sw_netstats *stats;
struct pcpu_sw_netstats tmp;
-- 
2.16.2



Re: [PATCH][net-next] ipv6: drop container_of when convert dst to rt6_info

2018-09-30 Thread Li RongQing
>
> I don't understand why you are doing this? It is not going to be
> faster (or safer) than container_of. container_of provides the
> same functionality and is safe against position of the member
> in the structure.
>

In fact, most places are converting dst to rt6_info directly, and only
few place uses container_of


net/ipv6/ip6_output.c:  struct rt6_info *rt = (struct rt6_info *)skb_dst(skb);
net/ipv6/route.c:   const struct rt6_info *rt = (struct rt6_info *)dst;

-Li


Re: [PATCH][net-next] ipv6: drop container_of when convert dst to rt6_info

2018-09-30 Thread Li RongQing
> + BUILD_BUG_ON(offsetof(struct rt6_info, dst) != 0);
> +

please drop this patch, thanks
since BUILD_BUG_ON has been added in ip6_fib.h

include/net/ip6_fib.h:  BUILD_BUG_ON(offsetof(struct rt6_info, dst) != 0);

-Li


[PATCH] xfrm: fix gro_cells leak when remove virtual xfrm interfaces

2018-09-30 Thread Li RongQing
The device gro_cells has been initialized, it should be freed,
otherwise it will be leaked

Fixes: f203b76d78092faf2 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/xfrm/xfrm_interface.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 4b4ef4f662d9..9cc6e72bc802 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -116,6 +116,9 @@ static void xfrmi_unlink(struct xfrmi_net *xfrmn, struct 
xfrm_if *xi)
 
 static void xfrmi_dev_free(struct net_device *dev)
 {
+   struct xfrm_if *xi = netdev_priv(dev);
+
+   gro_cells_destroy(>gro_cells);
free_percpu(dev->tstats);
 }
 
-- 
2.16.2



[PATCH][net-next] ipv6: drop container_of when convert dst to rt6_info

2018-09-29 Thread Li RongQing
we can save container_of computation and return dst directly,
since dst is always first member of struct rt6_info

Add a BUILD_BUG_ON() to catch any change that could break this
assertion.

Signed-off-by: Li RongQing 
---
 include/net/ip6_route.h | 4 +++-
 net/ipv6/route.c| 6 +++---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 7b9c82de11cc..1f09298634cb 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -194,8 +194,10 @@ static inline const struct rt6_info *skb_rt6_info(const 
struct sk_buff *skb)
const struct dst_entry *dst = skb_dst(skb);
const struct rt6_info *rt6 = NULL;
 
+   BUILD_BUG_ON(offsetof(struct rt6_info, dst) != 0);
+
if (dst)
-   rt6 = container_of(dst, struct rt6_info, dst);
+   rt6 = (struct rt6_info *)dst;
 
return rt6;
 }
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d28f83e01593..3fb8034fc2d0 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -217,7 +217,7 @@ static struct neighbour *ip6_dst_neigh_lookup(const struct 
dst_entry *dst,
  struct sk_buff *skb,
  const void *daddr)
 {
-   const struct rt6_info *rt = container_of(dst, struct rt6_info, dst);
+   const struct rt6_info *rt = (struct rt6_info *)dst;
 
return ip6_neigh_lookup(>rt6i_gateway, dst->dev, skb, daddr);
 }
@@ -2187,7 +2187,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry 
*dst, u32 cookie)
struct fib6_info *from;
struct rt6_info *rt;
 
-   rt = container_of(dst, struct rt6_info, dst);
+   rt = (struct rt6_info *)dst;
 
rcu_read_lock();
 
@@ -4911,7 +4911,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
}
 
 
-   rt = container_of(dst, struct rt6_info, dst);
+   rt = (struct rt6_info *)dst;
if (rt->dst.error) {
err = rt->dst.error;
ip6_rt_put(rt);
-- 
2.16.2



[PATCH][net-next] net: drop container_of in dst_cache_get_ip4

2018-09-29 Thread Li RongQing
we can save container_of computation and return dst directly,
since dst is always first member of struct rtable, and any
breaking will be caught by BUILD_BUG_ON in route.h

include/net/route.h:BUILD_BUG_ON(offsetof(struct rtable, dst) != 0);

Signed-off-by: Li RongQing 
---
 net/core/dst_cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dst_cache.c b/net/core/dst_cache.c
index 64cef977484a..0753838480fd 100644
--- a/net/core/dst_cache.c
+++ b/net/core/dst_cache.c
@@ -87,7 +87,7 @@ struct rtable *dst_cache_get_ip4(struct dst_cache *dst_cache, 
__be32 *saddr)
return NULL;
 
*saddr = idst->in_saddr.s_addr;
-   return container_of(dst, struct rtable, dst);
+   return (struct rtable *)dst;
 }
 EXPORT_SYMBOL_GPL(dst_cache_get_ip4);
 
-- 
2.16.2



答复: [PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices

2018-09-20 Thread Li,Rongqing
: Re: [PATCH][next-next][v2] netlink: avoid to allocate full skb when
> sending to many devices
> 
> 
> 
> On 09/20/2018 06:43 AM, Eric Dumazet wrote:
> >
> 

Sorry, I should cc to you.

> > And lastly this patch looks way too complicated to me.
> > You probably can write something much simpler.
> 

But it  should not increase the negative performance

> Something like :
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index
> 930d17fa906c9ebf1cf7b6031ce0a22f9f66c0e4..e0a81beb4f37751421dbbe794c
> cf3d5a46bdf900 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -278,22 +278,26 @@ static bool netlink_filter_tap(const struct sk_buff
> *skb)
> return false;
>  }
> 
> -static int __netlink_deliver_tap_skb(struct sk_buff *skb,
> +static int __netlink_deliver_tap_skb(struct sk_buff **pskb,
>  struct net_device *dev)  {
> -   struct sk_buff *nskb;
> +   struct sk_buff *nskb, *skb = *pskb;
> struct sock *sk = skb->sk;
> int ret = -ENOMEM;
> 
> if (!net_eq(dev_net(dev), sock_net(sk)))
> return 0;
> 
> -   dev_hold(dev);
> -
> -   if (is_vmalloc_addr(skb->head))
> +   if (is_vmalloc_addr(skb->head)) {
> nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
> -   else
> -   nskb = skb_clone(skb, GFP_ATOMIC);
> +   if (!nskb)
> +   return -ENOMEM;
> +   consume_skb(skb);

The original skb can not be freed, since it will be used after send to tap in 
__netlink_sendskb

> +   skb = nskb;
> +   *pskb = skb;
> +   }
> +   dev_hold(dev);
> +   nskb = skb_clone(skb, GFP_ATOMIC);

since original skb can not be freed, skb_clone will lead to a leak.

> if (nskb) {
> nskb->dev = dev;
> nskb->protocol = htons((u16) sk->sk_protocol); @@ -318,7 
> +322,7
> @@ static void __netlink_deliver_tap(struct sk_buff *skb, struct
> netlink_tap_net *n
> return;
> 
> list_for_each_entry_rcu(tmp, >netlink_tap_all, list) {
> -   ret = __netlink_deliver_tap_skb(skb, tmp->dev);
> +   ret = __netlink_deliver_tap_skb(, tmp->dev);
> if (unlikely(ret))
> break;
> }
> 


The below change seems simple, but it increase skb allocation and
free one time,  

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e3a0538ec0be..b9631137f0fe 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -290,10 +290,8 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 
dev_hold(dev);
 
-   if (is_vmalloc_addr(skb->head))
-   nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
-   else
-   nskb = skb_clone(skb, GFP_ATOMIC);
+   nskb = skb_clone(skb GFP_ATOMIC);
+
if (nskb) {
nskb->dev = dev;
nskb->protocol = htons((u16) sk->sk_protocol);
@@ -317,11 +315,20 @@ static void __netlink_deliver_tap(struct sk_buff *skb, 
struct netlink_tap_net *n
if (!netlink_filter_tap(skb))
return;
 
+   if (is_vmalloc_addr(skb->head)) {
+   skb = netlink_to_full_skb(skb, GFP_ATOMIC);
+   if (!skb)
+  return;
+   alloc = true;
+   }
+
list_for_each_entry_rcu(tmp, >netlink_tap_all, list) {
+
ret = __netlink_deliver_tap_skb(skb, tmp->dev);
if (unlikely(ret))
break;
}
+
+   if (alloc)
+   consume_skb(skb);
 }

-Q


[PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices

2018-09-20 Thread Li RongQing
if skb->head is vmalloc address, when this skb is delivered, full
allocation for this skb is required, if there are many devices,
the full allocation will be called for every devices

now if it is vmalloc, allocate a new skb, whose data is not vmalloc
address, and use new allocated skb to clone and send, to avoid to
allocate full skb everytime.

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/netlink/af_netlink.c | 37 ++---
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e3a0538ec0be..a5b1bf706526 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -279,21 +279,25 @@ static bool netlink_filter_tap(const struct sk_buff *skb)
 }
 
 static int __netlink_deliver_tap_skb(struct sk_buff *skb,
-struct net_device *dev)
+struct net_device *dev, bool alloc, bool last)
 {
struct sk_buff *nskb;
struct sock *sk = skb->sk;
int ret = -ENOMEM;
 
-   if (!net_eq(dev_net(dev), sock_net(sk)))
+   if (!net_eq(dev_net(dev), sock_net(sk))) {
+   if (last && alloc)
+   consume_skb(skb);
return 0;
+   }
 
dev_hold(dev);
 
-   if (is_vmalloc_addr(skb->head))
-   nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
+   if (unlikely(last && alloc))
+   nskb = skb;
else
nskb = skb_clone(skb, GFP_ATOMIC);
+
if (nskb) {
nskb->dev = dev;
nskb->protocol = htons((u16) sk->sk_protocol);
@@ -303,6 +307,8 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
ret = dev_queue_xmit(nskb);
if (unlikely(ret > 0))
ret = net_xmit_errno(ret);
+   } else if (alloc) {
+   kfree_skb(skb);
}
 
dev_put(dev);
@@ -311,16 +317,33 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 
 static void __netlink_deliver_tap(struct sk_buff *skb, struct netlink_tap_net 
*nn)
 {
+   struct netlink_tap *tmp, *next;
+   bool alloc = false;
int ret;
-   struct netlink_tap *tmp;
 
if (!netlink_filter_tap(skb))
return;
 
-   list_for_each_entry_rcu(tmp, >netlink_tap_all, list) {
-   ret = __netlink_deliver_tap_skb(skb, tmp->dev);
+   tmp = list_first_or_null_rcu(>netlink_tap_all,
+   struct netlink_tap, list);
+   if (!tmp)
+   return;
+
+   if (is_vmalloc_addr(skb->head)) {
+   skb = netlink_to_full_skb(skb, GFP_ATOMIC);
+   if (!skb)
+   return;
+   alloc = true;
+   }
+
+   while (tmp) {
+   next = list_next_or_null_rcu(>netlink_tap_all, >list,
+   struct netlink_tap, list);
+
+   ret = __netlink_deliver_tap_skb(skb, tmp->dev, alloc, !next);
if (unlikely(ret))
break;
+   tmp = next;
}
 }
 
-- 
2.16.2



答复: [PATCH][net-next] netlink: avoid to allocate full skb when sending to many devices

2018-09-18 Thread Li,Rongqing
> On 09/17/2018 10:26 PM, Li RongQing wrote:
> > if skb->head is vmalloc address, when this skb is delivered, full
> > allocation for this skb is required, if there are many devices, the
> > ---
> >  net/netlink/af_netlink.c | 14 --
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> >
> >
> 
> This looks very broken to me.
> 
> Only the original skb (given as an argument to __netlink_deliver_tap()) is
> guaranteed to not disappear while the loop is performed.
> 
> (There is no skb_clone() after the first netlink_to_full_skb())
> 

Thank you;

I will rework it

-RongQing


[PATCH][net-next] netlink: avoid to allocate full skb when sending to many devices

2018-09-17 Thread Li RongQing
if skb->head is vmalloc address, when this skb is delivered, full
allocation for this skb is required, if there are many devices,
the full allocation will be called for every devices

now using the first time allocated skb when iterate other devices
to send, reduce full allocation and speedup deliver.

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/netlink/af_netlink.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e3a0538ec0be..095b99e3c1fb 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -278,11 +278,11 @@ static bool netlink_filter_tap(const struct sk_buff *skb)
return false;
 }
 
-static int __netlink_deliver_tap_skb(struct sk_buff *skb,
+static int __netlink_deliver_tap_skb(struct sk_buff **skb,
 struct net_device *dev)
 {
struct sk_buff *nskb;
-   struct sock *sk = skb->sk;
+   struct sock *sk = (*skb)->sk;
int ret = -ENOMEM;
 
if (!net_eq(dev_net(dev), sock_net(sk)))
@@ -290,10 +290,12 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 
dev_hold(dev);
 
-   if (is_vmalloc_addr(skb->head))
-   nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
+   if (is_vmalloc_addr((*skb)->head)) {
+   nskb = netlink_to_full_skb(*skb, GFP_ATOMIC);
+   *skb = nskb;
+   }
else
-   nskb = skb_clone(skb, GFP_ATOMIC);
+   nskb = skb_clone(*skb, GFP_ATOMIC);
if (nskb) {
nskb->dev = dev;
nskb->protocol = htons((u16) sk->sk_protocol);
@@ -318,7 +320,7 @@ static void __netlink_deliver_tap(struct sk_buff *skb, 
struct netlink_tap_net *n
return;
 
list_for_each_entry_rcu(tmp, >netlink_tap_all, list) {
-   ret = __netlink_deliver_tap_skb(skb, tmp->dev);
+   ret = __netlink_deliver_tap_skb(, tmp->dev);
if (unlikely(ret))
break;
}
-- 
2.16.2



[PATCH][net-next] veth: rename pcpu_vstats as pcpu_lstats

2018-09-17 Thread Li RongQing
struct pcpu_vstats and pcpu_lstats have same members and
usage, and pcpu_lstats is used in many files, so rename
pcpu_vstats as pcpu_lstats to reduce duplicate definition

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 drivers/net/veth.c| 22 --
 include/linux/netdevice.h |  1 -
 2 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index bc8faf13a731..aeecb5892e26 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -37,12 +37,6 @@
 #define VETH_XDP_TXBIT(0)
 #define VETH_XDP_REDIR BIT(1)
 
-struct pcpu_vstats {
-   u64 packets;
-   u64 bytes;
-   struct u64_stats_sync   syncp;
-};
-
 struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
@@ -217,7 +211,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
skb_tx_timestamp(skb);
if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
-   struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
+   struct pcpu_lstats *stats = this_cpu_ptr(dev->lstats);
 
u64_stats_update_begin(>syncp);
stats->bytes += length;
@@ -236,7 +230,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
return NETDEV_TX_OK;
 }
 
-static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
+static u64 veth_stats_one(struct pcpu_lstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
int cpu;
@@ -244,7 +238,7 @@ static u64 veth_stats_one(struct pcpu_vstats *result, 
struct net_device *dev)
result->packets = 0;
result->bytes = 0;
for_each_possible_cpu(cpu) {
-   struct pcpu_vstats *stats = per_cpu_ptr(dev->vstats, cpu);
+   struct pcpu_lstats *stats = per_cpu_ptr(dev->lstats, cpu);
u64 packets, bytes;
unsigned int start;
 
@@ -264,7 +258,7 @@ static void veth_get_stats64(struct net_device *dev,
 {
struct veth_priv *priv = netdev_priv(dev);
struct net_device *peer;
-   struct pcpu_vstats one;
+   struct pcpu_lstats one;
 
tot->tx_dropped = veth_stats_one(, dev);
tot->tx_bytes = one.bytes;
@@ -830,13 +824,13 @@ static int veth_dev_init(struct net_device *dev)
 {
int err;
 
-   dev->vstats = netdev_alloc_pcpu_stats(struct pcpu_vstats);
-   if (!dev->vstats)
+   dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
+   if (!dev->lstats)
return -ENOMEM;
 
err = veth_alloc_queues(dev);
if (err) {
-   free_percpu(dev->vstats);
+   free_percpu(dev->lstats);
return err;
}
 
@@ -846,7 +840,7 @@ static int veth_dev_init(struct net_device *dev)
 static void veth_dev_free(struct net_device *dev)
 {
veth_free_queues(dev);
-   free_percpu(dev->vstats);
+   free_percpu(dev->lstats);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index baed5d5088c5..1cbbf77a685f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2000,7 +2000,6 @@ struct net_device {
struct pcpu_lstats __percpu *lstats;
struct pcpu_sw_netstats __percpu*tstats;
struct pcpu_dstats __percpu *dstats;
-   struct pcpu_vstats __percpu *vstats;
};
 
 #if IS_ENABLED(CONFIG_GARP)
-- 
2.16.2



[PATCH][net-next] net: move definition of pcpu_lstats to header file

2018-09-14 Thread Li RongQing
pcpu_lstats is defined in several files, so unify them as one
and move to header file

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 drivers/net/loopback.c|  6 --
 drivers/net/nlmon.c   |  6 --
 drivers/net/vsockmon.c| 14 --
 include/linux/netdevice.h |  6 ++
 4 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 30612497643c..a7207fa7e451 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -59,12 +59,6 @@
 #include 
 #include 
 
-struct pcpu_lstats {
-   u64 packets;
-   u64 bytes;
-   struct u64_stats_sync   syncp;
-};
-
 /* The higher levels take care of making this non-reentrant (it's
  * called with bh's disabled).
  */
diff --git a/drivers/net/nlmon.c b/drivers/net/nlmon.c
index 4b22955de191..dd0db7534cb3 100644
--- a/drivers/net/nlmon.c
+++ b/drivers/net/nlmon.c
@@ -6,12 +6,6 @@
 #include 
 #include 
 
-struct pcpu_lstats {
-   u64 packets;
-   u64 bytes;
-   struct u64_stats_sync syncp;
-};
-
 static netdev_tx_t nlmon_xmit(struct sk_buff *skb, struct net_device *dev)
 {
int len = skb->len;
diff --git a/drivers/net/vsockmon.c b/drivers/net/vsockmon.c
index c28bdce14fd5..7bad5c95551f 100644
--- a/drivers/net/vsockmon.c
+++ b/drivers/net/vsockmon.c
@@ -11,12 +11,6 @@
 #define DEFAULT_MTU (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + \
 sizeof(struct af_vsockmon_hdr))
 
-struct pcpu_lstats {
-   u64 rx_packets;
-   u64 rx_bytes;
-   struct u64_stats_sync syncp;
-};
-
 static int vsockmon_dev_init(struct net_device *dev)
 {
dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
@@ -56,8 +50,8 @@ static netdev_tx_t vsockmon_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct pcpu_lstats *stats = this_cpu_ptr(dev->lstats);
 
u64_stats_update_begin(>syncp);
-   stats->rx_bytes += len;
-   stats->rx_packets++;
+   stats->bytes += len;
+   stats->packets++;
u64_stats_update_end(>syncp);
 
dev_kfree_skb(skb);
@@ -80,8 +74,8 @@ vsockmon_get_stats64(struct net_device *dev, struct 
rtnl_link_stats64 *stats)
 
do {
start = u64_stats_fetch_begin_irq(>syncp);
-   tbytes = vstats->rx_bytes;
-   tpackets = vstats->rx_packets;
+   tbytes = vstats->bytes;
+   tpackets = vstats->packets;
} while (u64_stats_fetch_retry_irq(>syncp, start));
 
packets += tpackets;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e2b3bd750c98..baed5d5088c5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2382,6 +2382,12 @@ struct pcpu_sw_netstats {
struct u64_stats_sync   syncp;
 };
 
+struct pcpu_lstats {
+   u64 packets;
+   u64 bytes;
+   struct u64_stats_sync syncp;
+};
+
 #define __netdev_alloc_pcpu_stats(type, gfp)   \
 ({ \
typeof(type) __percpu *pcpu_stats = alloc_percpu_gfp(type, gfp);\
-- 
2.16.2



[PATCH][net-next][v2] netlink: remove hash::nelems check in netlink_insert

2018-09-10 Thread Li RongQing
The type of hash::nelems has been changed from size_t to atom_t
which in fact is int, so not need to check if BITS_PER_LONG, that
is bit number of size_t, is bigger than 32

and rht_grow_above_max() will be called to check if hashtable is
too big, ensure it can not bigger than 1<<31

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/netlink/af_netlink.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b4a29bcc33b9..e3a0538ec0be 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -574,11 +574,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
if (nlk_sk(sk)->bound)
goto err;
 
-   err = -ENOMEM;
-   if (BITS_PER_LONG > 32 &&
-   unlikely(atomic_read(>hash.nelems) >= UINT_MAX))
-   goto err;
-
nlk_sk(sk)->portid = portid;
sock_hold(sk);
 
-- 
2.16.2



Re: [PATCH] netlink: fix hash::nelems check

2018-09-09 Thread Li RongQing
after reconsider, I think we can remove this check directly, since
rht_grow_above_max() will be called to check the overflow again in
rhashtable_insert_one.

and atomic_read(>hash.nelems) always compares with unsigned
value, will force to switch unsigned, so the hash.nelems overflows can
be accepted.

-Rong


[PATCH] netlink: fix hash::nelems check

2018-09-08 Thread Li RongQing
The type of hash::nelems has been changed from size_t to atom_t
which is int in fact, and impossible to be bigger than UINT_MAX

Fixes: 97defe1ecf86 ("rhashtable: Per bucket locks & deferred 
expansion/shrinking")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/netlink/af_netlink.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b4a29bcc33b9..412437baee63 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -575,8 +575,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
goto err;
 
err = -ENOMEM;
-   if (BITS_PER_LONG > 32 &&
-   unlikely(atomic_read(>hash.nelems) >= UINT_MAX))
+   if (unlikely(atomic_read(>hash.nelems) == INT_MAX))
goto err;
 
nlk_sk(sk)->portid = portid;
-- 
2.16.2



[PATCH][net-next] vxlan: reduce dirty cache line in vxlan_find_mac

2018-08-28 Thread Li RongQing
vxlan_find_mac() unconditionally set f->used for every packet,
this causes a cache miss for every packet, since remote, hlist
and used of vxlan_fdb share the same cache line, which are
accessed when send every packets.

so f->used is set only if not equal to jiffies, to reduce dirty
cache line times, this gives 3% speed-up with small packets.

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 drivers/net/vxlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ababba37d735..e5d236595206 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -464,7 +464,7 @@ static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev 
*vxlan,
struct vxlan_fdb *f;
 
f = __vxlan_find_mac(vxlan, mac, vni);
-   if (f)
+   if (f && f->used != jiffies)
f->used = jiffies;
 
return f;
-- 
2.16.2



[PATCH][net-next] vxlan: reduce dirty cache line in vxlan_find_mac

2018-08-18 Thread Li RongQing
vxlan_find_mac() unconditionally set f->used for every packet,
this cause a cache miss for every packet, since remote, hlist
and used of vxlan_fdb share the same cacheline.

With this change f->used is set only if not equal to jiffies
This gives up to 5% speed-up with small packets.

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 drivers/net/vxlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ababba37d735..e5d236595206 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -464,7 +464,7 @@ static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev 
*vxlan,
struct vxlan_fdb *f;
 
f = __vxlan_find_mac(vxlan, mac, vni);
-   if (f)
+   if (f && f->used != jiffies)
f->used = jiffies;
 
return f;
-- 
2.16.2



[PATCH][net-next][v2] packet: switch kvzalloc to allocate memory

2018-08-12 Thread Li RongQing
The patches includes following change:

*Use modern kvzalloc()/kvfree() instead of custom allocations.

*Remove order argument for alloc_pg_vec, it can get from req.

*Remove order argument for free_pg_vec, free_pg_vec now uses
kvfree which does not need order argument.

*Remove pg_vec_order from struct packet_ring_buffer, no longer
need to save/restore 'order'

*Remove variable 'order' for packet_set_ring, it is now unused

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/packet/af_packet.c | 44 +---
 net/packet/internal.h  |  1 -
 2 files changed, 13 insertions(+), 32 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 75c92a87e7b2..5610061e7f2e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4137,52 +4137,36 @@ static const struct vm_operations_struct 
packet_mmap_ops = {
.close  =   packet_mm_close,
 };
 
-static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
-   unsigned int len)
+static void free_pg_vec(struct pgv *pg_vec, unsigned int len)
 {
int i;
 
for (i = 0; i < len; i++) {
if (likely(pg_vec[i].buffer)) {
-   if (is_vmalloc_addr(pg_vec[i].buffer))
-   vfree(pg_vec[i].buffer);
-   else
-   free_pages((unsigned long)pg_vec[i].buffer,
-  order);
+   kvfree(pg_vec[i].buffer);
pg_vec[i].buffer = NULL;
}
}
kfree(pg_vec);
 }
 
-static char *alloc_one_pg_vec_page(unsigned long order)
+static char *alloc_one_pg_vec_page(unsigned long size)
 {
char *buffer;
-   gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
- __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
 
-   buffer = (char *) __get_free_pages(gfp_flags, order);
+   buffer = kvzalloc(size, GFP_KERNEL);
if (buffer)
return buffer;
 
-   /* __get_free_pages failed, fall back to vmalloc */
-   buffer = vzalloc(array_size((1 << order), PAGE_SIZE));
-   if (buffer)
-   return buffer;
+   buffer = kvzalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 
-   /* vmalloc failed, lets dig into swap here */
-   gfp_flags &= ~__GFP_NORETRY;
-   buffer = (char *) __get_free_pages(gfp_flags, order);
-   if (buffer)
-   return buffer;
-
-   /* complete and utter failure */
-   return NULL;
+   return buffer;
 }
 
-static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
+static struct pgv *alloc_pg_vec(struct tpacket_req *req)
 {
unsigned int block_nr = req->tp_block_nr;
+   unsigned long size = req->tp_block_size;
struct pgv *pg_vec;
int i;
 
@@ -4191,7 +4175,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, 
int order)
goto out;
 
for (i = 0; i < block_nr; i++) {
-   pg_vec[i].buffer = alloc_one_pg_vec_page(order);
+   pg_vec[i].buffer = alloc_one_pg_vec_page(size);
if (unlikely(!pg_vec[i].buffer))
goto out_free_pgvec;
}
@@ -4200,7 +4184,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, 
int order)
return pg_vec;
 
 out_free_pgvec:
-   free_pg_vec(pg_vec, order, block_nr);
+   free_pg_vec(pg_vec, block_nr);
pg_vec = NULL;
goto out;
 }
@@ -4210,9 +4194,9 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
 {
struct pgv *pg_vec = NULL;
struct packet_sock *po = pkt_sk(sk);
-   int was_running, order = 0;
struct packet_ring_buffer *rb;
struct sk_buff_head *rb_queue;
+   int was_running;
__be16 num;
int err = -EINVAL;
/* Added to avoid minimal code churn */
@@ -4274,8 +4258,7 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
goto out;
 
err = -ENOMEM;
-   order = get_order(req->tp_block_size);
-   pg_vec = alloc_pg_vec(req, order);
+   pg_vec = alloc_pg_vec(req);
if (unlikely(!pg_vec))
goto out;
switch (po->tp_version) {
@@ -4329,7 +4312,6 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
rb->frame_size = req->tp_frame_size;
spin_unlock_bh(_queue->lock);
 
-   swap(rb->pg_vec_order, order);
swap(rb->pg_vec_len, req->tp_block_nr);
 
rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
@@ -4355,7 +4337,7 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
}
 
if (pg_vec)
-   free_pg_vec(pg_vec, order, req->tp_block_nr);
+   

[PATCH][net-next] packet: switch kvzalloc to allocate memory

2018-08-10 Thread Li RongQing
Use modern kvzalloc()/kvfree() instead of custom allocations.
And remove order argument for free_pg_vec and alloc_pg_vec,
this argument is useless to kvfree, or can get from req.

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/packet/af_packet.c | 40 
 1 file changed, 12 insertions(+), 28 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 75c92a87e7b2..f28fcaba4f36 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4137,52 +4137,36 @@ static const struct vm_operations_struct 
packet_mmap_ops = {
.close  =   packet_mm_close,
 };
 
-static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
-   unsigned int len)
+static void free_pg_vec(struct pgv *pg_vec, unsigned int len)
 {
int i;
 
for (i = 0; i < len; i++) {
if (likely(pg_vec[i].buffer)) {
-   if (is_vmalloc_addr(pg_vec[i].buffer))
-   vfree(pg_vec[i].buffer);
-   else
-   free_pages((unsigned long)pg_vec[i].buffer,
-  order);
+   kvfree(pg_vec[i].buffer);
pg_vec[i].buffer = NULL;
}
}
kfree(pg_vec);
 }
 
-static char *alloc_one_pg_vec_page(unsigned long order)
+static char *alloc_one_pg_vec_page(unsigned long size)
 {
char *buffer;
-   gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
- __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
 
-   buffer = (char *) __get_free_pages(gfp_flags, order);
+   buffer = kvzalloc(size, GFP_KERNEL);
if (buffer)
return buffer;
 
-   /* __get_free_pages failed, fall back to vmalloc */
-   buffer = vzalloc(array_size((1 << order), PAGE_SIZE));
-   if (buffer)
-   return buffer;
+   buffer = kvzalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 
-   /* vmalloc failed, lets dig into swap here */
-   gfp_flags &= ~__GFP_NORETRY;
-   buffer = (char *) __get_free_pages(gfp_flags, order);
-   if (buffer)
-   return buffer;
-
-   /* complete and utter failure */
-   return NULL;
+   return buffer;
 }
 
-static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
+static struct pgv *alloc_pg_vec(struct tpacket_req *req)
 {
unsigned int block_nr = req->tp_block_nr;
+   unsigned long size = req->tp_block_size;
struct pgv *pg_vec;
int i;
 
@@ -4191,7 +4175,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, 
int order)
goto out;
 
for (i = 0; i < block_nr; i++) {
-   pg_vec[i].buffer = alloc_one_pg_vec_page(order);
+   pg_vec[i].buffer = alloc_one_pg_vec_page(size);
if (unlikely(!pg_vec[i].buffer))
goto out_free_pgvec;
}
@@ -4200,7 +4184,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, 
int order)
return pg_vec;
 
 out_free_pgvec:
-   free_pg_vec(pg_vec, order, block_nr);
+   free_pg_vec(pg_vec, block_nr);
pg_vec = NULL;
goto out;
 }
@@ -4275,7 +4259,7 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
 
err = -ENOMEM;
order = get_order(req->tp_block_size);
-   pg_vec = alloc_pg_vec(req, order);
+   pg_vec = alloc_pg_vec(req);
if (unlikely(!pg_vec))
goto out;
switch (po->tp_version) {
@@ -4355,7 +4339,7 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
}
 
if (pg_vec)
-   free_pg_vec(pg_vec, order, req->tp_block_nr);
+   free_pg_vec(pg_vec, req->tp_block_nr);
 out:
return err;
 }
-- 
2.16.2



[PATCH][net-next] tun: not use hardcoded mask value

2018-08-03 Thread Li RongQing
0x3ff in tun_hashfn is mask of TUN_NUM_FLOW_ENTRIES, instead
of hardcode, define a macro to setup the relationship with
TUN_NUM_FLOW_ENTRIES

Signed-off-by: Li RongQing 
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 0a3134712652..2bbefe828670 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -200,6 +200,7 @@ struct tun_flow_entry {
 };
 
 #define TUN_NUM_FLOW_ENTRIES 1024
+#define TUN_MASK_FLOW_ENTRIES (TUN_NUM_FLOW_ENTRIES - 1)
 
 struct tun_prog {
struct rcu_head rcu;
@@ -406,7 +407,7 @@ static inline __virtio16 cpu_to_tun16(struct tun_struct 
*tun, u16 val)
 
 static inline u32 tun_hashfn(u32 rxhash)
 {
-   return rxhash & 0x3ff;
+   return rxhash & TUN_MASK_FLOW_ENTRIES;
 }
 
 static struct tun_flow_entry *tun_flow_find(struct hlist_head *head, u32 
rxhash)
-- 
2.16.2



[PATCH][net-next] net: check extack._msg before print

2018-08-03 Thread Li RongQing
dev_set_mtu_ext is able to fail with a valid mtu value, at that
condition, extack._msg is not set and random since it is in stack,
then kernel will crash when print it.

Fixes: 7a4c53bee3324a ("net: report invalid mtu value via netlink extack")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/core/dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 36e994519488..f68122f0ab02 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7583,8 +7583,9 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
struct netlink_ext_ack extack;
int err;
 
+   memset(, 0, sizeof(extack));
err = dev_set_mtu_ext(dev, new_mtu, );
-   if (err)
+   if (err && extack._msg)
net_err_ratelimited("%s: %s\n", dev->name, extack._msg);
return err;
 }
-- 
2.16.2



[PATCH][net-next] openvswitch: eliminate cpu_used_mask from sw_flow

2018-07-27 Thread Li RongQing
The size of struct cpumask varies with CONFIG_NR_CPUS, some config
CONFIG_NR_CPUS is very larger, like 5120, struct cpumask will take
640 bytes, if there is thousands of flows, it will take lots of
memory

cpu_used_mask has two purposes
1: Assume first cpu as cpu0 which maybe not true; now use
   cpumask_first(cpu_possible_mask)
2: when get/clear statistic, reduce the iteratation; but it
   is not hot path, so use for_each_possible_cpu

Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/openvswitch/flow.c   | 11 +--
 net/openvswitch/flow.h   |  5 ++---
 net/openvswitch/flow_table.c | 11 +--
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 56b8e7167790..ad580bec00fb 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -85,7 +85,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
if (cpu == 0 && unlikely(flow->stats_last_writer != cpu))
flow->stats_last_writer = cpu;
} else {
-   stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
+   int cpu1 = cpumask_first(cpu_possible_mask);
+
+   stats = rcu_dereference(flow->stats[cpu1]); /* Pre-allocated. */
spin_lock(>lock);
 
/* If the current CPU is the only writer on the
@@ -118,7 +120,6 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
 
rcu_assign_pointer(flow->stats[cpu],
   new_stats);
-   cpumask_set_cpu(cpu, 
>cpu_used_mask);
goto unlock;
}
}
@@ -145,8 +146,7 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   /* We open code this to make sure cpu 0 is always considered */
-   for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, 
>cpu_used_mask)) {
+   for_each_possible_cpu(cpu) {
struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[cpu]);
 
if (stats) {
@@ -169,8 +169,7 @@ void ovs_flow_stats_clear(struct sw_flow *flow)
 {
int cpu;
 
-   /* We open code this to make sure cpu 0 is always considered */
-   for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, 
>cpu_used_mask)) {
+   for_each_possible_cpu(cpu) {
struct flow_stats *stats = ovsl_dereference(flow->stats[cpu]);
 
if (stats) {
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index c670dd24b8b7..d0ea5d6ced3e 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -223,17 +223,16 @@ struct sw_flow {
u32 hash;
} flow_table, ufid_table;
int stats_last_writer;  /* CPU id of the last writer on
-* 'stats[0]'.
+* 'stats[first cpu id]'.
 */
struct sw_flow_key key;
struct sw_flow_id id;
-   struct cpumask cpu_used_mask;
struct sw_flow_mask *mask;
struct sw_flow_actions __rcu *sf_acts;
struct flow_stats __rcu *stats[]; /* One for each CPU.  First one
   * is allocated at flow creation time,
   * the rest are allocated on demand
-  * while holding the 'stats[0].lock'.
+  * while holding the 'stats[first cpu 
id].lock'
   */
 };
 
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index 80ea2a71852e..e4dbd65c308a 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -80,6 +80,7 @@ struct sw_flow *ovs_flow_alloc(void)
 {
struct sw_flow *flow;
struct flow_stats *stats;
+   int cpu = cpumask_first(cpu_possible_mask);
 
flow = kmem_cache_zalloc(flow_cache, GFP_KERNEL);
if (!flow)
@@ -90,15 +91,13 @@ struct sw_flow *ovs_flow_alloc(void)
/* Initialize the default stat node. */
stats = kmem_cache_alloc_node(flow_stats_cache,
  GFP_KERNEL | __GFP_ZERO,
- node_online(0) ? 0 : NUMA_NO_NODE);
+ cpu_to_node(cpu));
if (!stats)
goto err;
 
spin_lock_init(>lock);
 
-   RCU_INIT_POINTER(flow->stats[0], stats);
-
-   cpumask_set_cpu(0, >cpu_used_mask);
+   RCU_INIT_POINTER(flow->stats[cpu], stats);
 
return flow;
 err:
@@ -142,11 +141,11 @@ static void flow_free(struct sw_flow *flow)

[PATCH][v3] netfilter: use kvmalloc_array to allocate memory for hashtable

2018-07-25 Thread Li RongQing
nf_ct_alloc_hashtable is used to allocate memory for conntrack,
NAT bysrc and expectation hashtable. Assuming 64k bucket size,
which means 7th order page allocation, __get_free_pages, called
by nf_ct_alloc_hashtable, will trigger the direct memory reclaim
and stall for a long time, when system has lots of memory stress

so replace combination of __get_free_pages and vzalloc with
kvmalloc_array, which provides a overflow check and a fallback
if no high order memory is available, and do not retry to reclaim
memory, reduce stall

and remove nf_ct_free_hashtable, since it is just a kvfree

Signed-off-by: Zhang Yu 
Signed-off-by: Wang Li 
Signed-off-by: Li RongQing 
---
 include/net/netfilter/nf_conntrack.h |  2 --
 net/netfilter/nf_conntrack_core.c| 29 ++---
 net/netfilter/nf_conntrack_expect.c  |  2 +-
 net/netfilter/nf_conntrack_helper.c  |  4 ++--
 net/netfilter/nf_nat_core.c  |  4 ++--
 5 files changed, 11 insertions(+), 30 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index a2b0ed025908..7e012312cd61 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -176,8 +176,6 @@ void nf_ct_netns_put(struct net *net, u8 nfproto);
  */
 void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls);
 
-void nf_ct_free_hashtable(void *hash, unsigned int size);
-
 int nf_conntrack_hash_check_insert(struct nf_conn *ct);
 bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report);
 
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 8a113ca1eea2..429151b4991a 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -2022,16 +2022,6 @@ static int kill_all(struct nf_conn *i, void *data)
return net_eq(nf_ct_net(i), data);
 }
 
-void nf_ct_free_hashtable(void *hash, unsigned int size)
-{
-   if (is_vmalloc_addr(hash))
-   vfree(hash);
-   else
-   free_pages((unsigned long)hash,
-  get_order(sizeof(struct hlist_head) * size));
-}
-EXPORT_SYMBOL_GPL(nf_ct_free_hashtable);
-
 void nf_conntrack_cleanup_start(void)
 {
conntrack_gc_work.exiting = true;
@@ -2042,7 +2032,7 @@ void nf_conntrack_cleanup_end(void)
 {
RCU_INIT_POINTER(nf_ct_hook, NULL);
cancel_delayed_work_sync(_gc_work.dwork);
-   nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
+   kvfree(nf_conntrack_hash);
 
nf_conntrack_proto_fini();
nf_conntrack_seqadj_fini();
@@ -2108,7 +2098,6 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int 
nulls)
 {
struct hlist_nulls_head *hash;
unsigned int nr_slots, i;
-   size_t sz;
 
if (*sizep > (UINT_MAX / sizeof(struct hlist_nulls_head)))
return NULL;
@@ -2116,14 +2105,8 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int 
nulls)
BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct 
hlist_head));
nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct 
hlist_nulls_head));
 
-   if (nr_slots > (UINT_MAX / sizeof(struct hlist_nulls_head)))
-   return NULL;
-
-   sz = nr_slots * sizeof(struct hlist_nulls_head);
-   hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,
-   get_order(sz));
-   if (!hash)
-   hash = vzalloc(sz);
+   hash = kvmalloc_array(nr_slots, sizeof(struct hlist_nulls_head),
+   GFP_KERNEL | __GFP_ZERO);
 
if (hash && nulls)
for (i = 0; i < nr_slots; i++)
@@ -2150,7 +2133,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize)
 
old_size = nf_conntrack_htable_size;
if (old_size == hashsize) {
-   nf_ct_free_hashtable(hash, hashsize);
+   kvfree(hash);
return 0;
}
 
@@ -2186,7 +2169,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize)
local_bh_enable();
 
synchronize_net();
-   nf_ct_free_hashtable(old_hash, old_size);
+   kvfree(old_hash);
return 0;
 }
 
@@ -2350,7 +2333,7 @@ int nf_conntrack_init_start(void)
 err_expect:
kmem_cache_destroy(nf_conntrack_cachep);
 err_cachep:
-   nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
+   kvfree(nf_conntrack_hash);
return ret;
 }
 
diff --git a/net/netfilter/nf_conntrack_expect.c 
b/net/netfilter/nf_conntrack_expect.c
index 3f586ba23d92..27b84231db10 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -712,5 +712,5 @@ void nf_conntrack_expect_fini(void)
 {
rcu_barrier(); /* Wait for call_rcu() before destroy */
kmem_cache_destroy(nf_ct_expect_cachep);
-   nf_ct_free_hashtable(nf_ct_expect_hash, nf_ct_expect_hsize);
+   kvfree(nf_ct_expect_hash);
 }
diff --gi

答复: [PATCH][v2] netfilter: use kvzalloc to allocate memory for hashtable

2018-07-24 Thread Li,Rongqing


> -邮件原件-
> 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com]
> 发送时间: 2018年7月25日 13:45
> 收件人: Li,Rongqing ; netdev@vger.kernel.org;
> pa...@netfilter.org; kad...@blackhole.kfki.hu; f...@strlen.de; netfilter-
> de...@vger.kernel.org; coret...@netfilter.org; eduma...@google.com
> 主题: Re: [PATCH][v2] netfilter: use kvzalloc to allocate memory for
> hashtable
> 
> 
> 
> On 07/24/2018 10:34 PM, Li RongQing wrote:
> > nf_ct_alloc_hashtable is used to allocate memory for conntrack, NAT
> > bysrc and expectation hashtable. Assuming 64k bucket size, which means
> > 7th order page allocation, __get_free_pages, called by
> > nf_ct_alloc_hashtable, will trigger the direct memory reclaim and
> > stall for a long time, when system has lots of memory stress
> 
> ...
> 
> > sz = nr_slots * sizeof(struct hlist_nulls_head);
> > -   hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN |
> __GFP_ZERO,
> > -   get_order(sz));
> > -   if (!hash)
> > -   hash = vzalloc(sz);
> > +   hash = kvzalloc(sz, GFP_KERNEL);
> 
> 
> You could remove the @sz computation and call
> 
> hash = kvcalloc(nr_slots, sizeof(struct hlist_nulls_head), GFP_KERNEL);
> 
> Thanks to kvmalloc_array() check, you also could remove the :
> 
> if (nr_slots > (UINT_MAX / sizeof(struct hlist_nulls_head)))
> return NULL;
> 
> That would remove a lot of stuff now we have proper helpers.


Ok, I will send v3

Thanks

-RongQing


[PATCH][v2] netfilter: use kvzalloc to allocate memory for hashtable

2018-07-24 Thread Li RongQing
nf_ct_alloc_hashtable is used to allocate memory for conntrack,
NAT bysrc and expectation hashtable. Assuming 64k bucket size,
which means 7th order page allocation, __get_free_pages, called
by nf_ct_alloc_hashtable, will trigger the direct memory reclaim
and stall for a long time, when system has lots of memory stress

so replace combination of __get_free_pages and vzalloc with
kvzalloc, which provides a fallback if no high order memory
is available, and do not retry to reclaim memory, reduce stall

and remove nf_ct_free_hashtable, since it is just a kvfree

Signed-off-by: Zhang Yu 
Signed-off-by: Wang Li 
Signed-off-by: Li RongQing 
---
 include/net/netfilter/nf_conntrack.h |  2 --
 net/netfilter/nf_conntrack_core.c| 23 +--
 net/netfilter/nf_conntrack_expect.c  |  2 +-
 net/netfilter/nf_conntrack_helper.c  |  4 ++--
 net/netfilter/nf_nat_core.c  |  4 ++--
 5 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index a2b0ed025908..7e012312cd61 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -176,8 +176,6 @@ void nf_ct_netns_put(struct net *net, u8 nfproto);
  */
 void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls);
 
-void nf_ct_free_hashtable(void *hash, unsigned int size);
-
 int nf_conntrack_hash_check_insert(struct nf_conn *ct);
 bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report);
 
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 8a113ca1eea2..191140c469d5 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -2022,16 +2022,6 @@ static int kill_all(struct nf_conn *i, void *data)
return net_eq(nf_ct_net(i), data);
 }
 
-void nf_ct_free_hashtable(void *hash, unsigned int size)
-{
-   if (is_vmalloc_addr(hash))
-   vfree(hash);
-   else
-   free_pages((unsigned long)hash,
-  get_order(sizeof(struct hlist_head) * size));
-}
-EXPORT_SYMBOL_GPL(nf_ct_free_hashtable);
-
 void nf_conntrack_cleanup_start(void)
 {
conntrack_gc_work.exiting = true;
@@ -2042,7 +2032,7 @@ void nf_conntrack_cleanup_end(void)
 {
RCU_INIT_POINTER(nf_ct_hook, NULL);
cancel_delayed_work_sync(_gc_work.dwork);
-   nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
+   kvfree(nf_conntrack_hash);
 
nf_conntrack_proto_fini();
nf_conntrack_seqadj_fini();
@@ -2120,10 +2110,7 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int 
nulls)
return NULL;
 
sz = nr_slots * sizeof(struct hlist_nulls_head);
-   hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,
-   get_order(sz));
-   if (!hash)
-   hash = vzalloc(sz);
+   hash = kvzalloc(sz, GFP_KERNEL);
 
if (hash && nulls)
for (i = 0; i < nr_slots; i++)
@@ -2150,7 +2137,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize)
 
old_size = nf_conntrack_htable_size;
if (old_size == hashsize) {
-   nf_ct_free_hashtable(hash, hashsize);
+   kvfree(hash);
return 0;
}
 
@@ -2186,7 +2173,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize)
local_bh_enable();
 
synchronize_net();
-   nf_ct_free_hashtable(old_hash, old_size);
+   kvfree(old_hash);
return 0;
 }
 
@@ -2350,7 +2337,7 @@ int nf_conntrack_init_start(void)
 err_expect:
kmem_cache_destroy(nf_conntrack_cachep);
 err_cachep:
-   nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
+   kvfree(nf_conntrack_hash);
return ret;
 }
 
diff --git a/net/netfilter/nf_conntrack_expect.c 
b/net/netfilter/nf_conntrack_expect.c
index 3f586ba23d92..27b84231db10 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -712,5 +712,5 @@ void nf_conntrack_expect_fini(void)
 {
rcu_barrier(); /* Wait for call_rcu() before destroy */
kmem_cache_destroy(nf_ct_expect_cachep);
-   nf_ct_free_hashtable(nf_ct_expect_hash, nf_ct_expect_hsize);
+   kvfree(nf_ct_expect_hash);
 }
diff --git a/net/netfilter/nf_conntrack_helper.c 
b/net/netfilter/nf_conntrack_helper.c
index d557a425289d..e24b762ffa1d 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -562,12 +562,12 @@ int nf_conntrack_helper_init(void)
 
return 0;
 out_extend:
-   nf_ct_free_hashtable(nf_ct_helper_hash, nf_ct_helper_hsize);
+   kvfree(nf_ct_helper_hash);
return ret;
 }
 
 void nf_conntrack_helper_fini(void)
 {
nf_ct_extend_unregister(_extend);
-   nf_ct_free_hashtable(nf_ct_helper_hash, nf_ct_helper_hsize);
+   kvfree(nf_ct_helper_hash);
 }
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_c

答复: 答复: [PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable

2018-07-24 Thread Li,Rongqing
> 
> On 07/24/2018 02:50 AM, Li,Rongqing wrote:
> 
> > Thanks, Your patch fixes my issue;
> >
> > My patch may be able to reduce stall when modprobe nf module in
> memory
> > stress, Do you think this patch has any value?
> 
> Only if you make it use kvzalloc()/kvfree()
> 
> Thanks.


I will send v2, free to give your signature.

Thanks,

-RongQing



答复: [PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable

2018-07-24 Thread Li,Rongqing


> -邮件原件-
> 发件人: Florian Westphal [mailto:f...@strlen.de]
> 发送时间: 2018年7月24日 17:20
> 收件人: Li,Rongqing 
> 抄送: netdev@vger.kernel.org; pa...@netfilter.org;
> kad...@blackhole.kfki.hu; f...@strlen.de
> 主题: Re: [PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable
> 
> Li RongQing  wrote:
> > when system forks a process with CLONE_NEWNET flag under the high
> > memory pressure, it will trigger memory reclaim and stall for a long
> > time because nf_ct_alloc_hashtable need to allocate high-order memory
> > at that time. The calltrace as below:
> 
> > nf_ct_alloc_hashtable
> > nf_conntrack_init_net
> 
> This call trace is from a kernel < 4.7.
> 

Sorry;  it is

> commit 56d52d4892d0e478a005b99ed10d0a7f488ea8c1
> netfilter: conntrack: use a single hashtable for all namespaces
> 
> removed per-netns hash table.

Thanks, Your patch fixes my issue;  

My patch may be able to reduce stall when modprobe nf module in memory stress, 
Do you think this patch has any value?

-RongQing


[PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable

2018-07-24 Thread Li RongQing
when system forks a process with CLONE_NEWNET flag under the
high memory pressure, it will trigger memory reclaim and stall
for a long time because nf_ct_alloc_hashtable need to allocate
high-order memory at that time. The calltrace as below:

delay_tsc
__delay
_raw_spin_lock
_spin_lock
mmu_shrink
shrink_slab
zone_reclaim
get_page_from_freelist
__alloc_pages_nodemask
alloc_pages_current
__get_free_pages
nf_ct_alloc_hashtable
nf_conntrack_init_net
setup_net
copy_net_ns
create_new_namespaces
copy_namespaces
copy_process
do_fork
sys_clone
stub_clone
__clone

not use the directly memory reclaim flag to avoid stall

Signed-off-by: Ni Xun 
Signed-off-by: Zhang Yu 
Signed-off-by: Wang Li 
Signed-off-by: Li RongQing 
---
 net/netfilter/nf_conntrack_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 8a113ca1eea2..672c5960530d 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -2120,8 +2120,8 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int 
nulls)
return NULL;
 
sz = nr_slots * sizeof(struct hlist_nulls_head);
-   hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,
-   get_order(sz));
+   hash = (void *)__get_free_pages((GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) |
+   __GFP_NOWARN | __GFP_ZERO, get_order(sz));
if (!hash)
hash = vzalloc(sz);
 
-- 
2.16.2



答复: 答复: [PATCH][net-next] bridge: clean up mtu_set_by_user setting to false and comments

2018-07-13 Thread Li,Rongqing

> This is used to differentiate when auto adjust is used and when user has set
> the MTU.
> As I already said everything is working as expected and you should not
> remove this code.
> 

I see, thank you, and sorry for the noise.

-R


答复: [PATCH][net-next] bridge: clean up mtu_set_by_user setting to false and comments

2018-07-13 Thread Li,Rongqing


> -邮件原件-
> 发件人: Nikolay Aleksandrov [mailto:niko...@cumulusnetworks.com]
> 发送时间: 2018年7月13日 16:01
> 收件人: Li,Rongqing ; netdev@vger.kernel.org
> 主题: Re: [PATCH][net-next] bridge: clean up mtu_set_by_user setting to
> false and comments
> 
> On 13/07/18 09:47, Li RongQing wrote:
> > Once mtu_set_by_user is set to true, br_mtu_auto_adjust will not run,
> > and no chance to clear mtu_set_by_user.
> >
> ^^
> This was by design, there is no error here and no "cleanup" is needed.
> If you read the ndo_change_mtu() call you'll see the comment:
> /* this flag will be cleared if the MTU was automatically adjusted */
>
But after this comment, mtu_set_by_user is set to true, and br_mtu_auto_adjust
 will not truly be run, how to set mtu_set_by_user to false?

230 /* this flag will be cleared if the MTU was automatically adjusted */
231 br->mtu_set_by_user = true;

And the line 457  is useless, since it run only if it is false?

445 void br_mtu_auto_adjust(struct net_bridge *br)
446 {
447 ASSERT_RTNL();
448 
449 /* if the bridge MTU was manually configured don't mess with it */
450 if (br->mtu_set_by_user)
451 return;
452 
453 /* change to the minimum MTU and clear the flag which was set by
454  * the bridge ndo_change_mtu callback
455  */
456 dev_set_mtu(br->dev, br_mtu_min(br));
457 br->mtu_set_by_user = false;
458 }


-R


[PATCH][net-next][v2] net: convert gro_count to bitmask

2018-07-13 Thread Li RongQing
gro_hash size is 192 bytes, and uses 3 cache lines, if there is few
flows, gro_hash may be not fully used, so it is unnecessary to iterate
all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline.

convert gro_count to a bitmask, and rename it as gro_bitmask, each bit
represents a element of gro_hash, only flush a gro_hash element if the
related bit is set, to speed up napi_gro_flush().

and update gro_bitmask only if it will be changed, to reduce cache
update

Suggested-by: Eric Dumazet 
Signed-off-by: Li RongQing 
Cc: Stefano Brivio 
---
netperf shows no difference, maybe because my testing machine has large
cache
 
 include/linux/netdevice.h |  9 +++--
 net/core/dev.c| 36 
 2 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2daf2fa6554f..8837a998de3f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -308,9 +308,14 @@ struct gro_list {
 };
 
 /*
- * Structure for NAPI scheduling similar to tasklet but with weighting
+ * size of gro hash buckets, must less than bit number of
+ * napi_struct::gro_bitmask
  */
 #define GRO_HASH_BUCKETS   8
+
+/*
+ * Structure for NAPI scheduling similar to tasklet but with weighting
+ */
 struct napi_struct {
/* The poll_list must only be managed by the entity which
 * changes the state of the NAPI_STATE_SCHED bit.  This means
@@ -322,7 +327,7 @@ struct napi_struct {
 
unsigned long   state;
int weight;
-   unsigned intgro_count;
+   unsigned long   gro_bitmask;
int (*poll)(struct napi_struct *, int);
 #ifdef CONFIG_NETPOLL
int poll_owner;
diff --git a/net/core/dev.c b/net/core/dev.c
index 14a748ee8cc9..e39fef62e285 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5283,9 +5283,11 @@ static void __napi_gro_flush_chain(struct napi_struct 
*napi, u32 index,
list_del(>list);
skb->next = NULL;
napi_gro_complete(skb);
-   napi->gro_count--;
napi->gro_hash[index].count--;
}
+
+   if (!napi->gro_hash[index].count)
+   __clear_bit(index, >gro_bitmask);
 }
 
 /* napi->gro_hash[].list contains packets ordered by age.
@@ -5296,8 +5298,10 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old)
 {
u32 i;
 
-   for (i = 0; i < GRO_HASH_BUCKETS; i++)
-   __napi_gro_flush_chain(napi, i, flush_old);
+   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
+   if (test_bit(i, >gro_bitmask))
+   __napi_gro_flush_chain(napi, i, flush_old);
+   }
 }
 EXPORT_SYMBOL(napi_gro_flush);
 
@@ -5389,8 +5393,8 @@ static void gro_flush_oldest(struct list_head *head)
if (WARN_ON_ONCE(!oldest))
return;
 
-   /* Do not adjust napi->gro_count, caller is adding a new SKB to
-* the chain.
+   /* Do not adjust napi->gro_hash[].count, caller is adding a new
+* SKB to the chain.
 */
list_del(>list);
napi_gro_complete(oldest);
@@ -5465,7 +5469,6 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
list_del(>list);
pp->next = NULL;
napi_gro_complete(pp);
-   napi->gro_count--;
napi->gro_hash[hash].count--;
}
 
@@ -5478,7 +5481,6 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) {
gro_flush_oldest(gro_head);
} else {
-   napi->gro_count++;
napi->gro_hash[hash].count++;
}
NAPI_GRO_CB(skb)->count = 1;
@@ -5493,6 +5495,13 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
if (grow > 0)
gro_pull_from_frag0(skb, grow);
 ok:
+   if (napi->gro_hash[hash].count) {
+   if (!test_bit(hash, >gro_bitmask))
+   __set_bit(hash, >gro_bitmask);
+   } else if (test_bit(hash, >gro_bitmask)) {
+   __clear_bit(hash, >gro_bitmask);
+   }
+
return ret;
 
 normal:
@@ -5891,7 +5900,7 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
 NAPIF_STATE_IN_BUSY_POLL)))
return false;
 
-   if (n->gro_count) {
+   if (n->gro_bitmask) {
unsigned long timeout = 0;
 
if (work_done)
@@ -6100,7 +6109,7 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer 
*timer)
/* Note : we use a relaxed variant of napi_schedule_prep() not setting
 * NAPI_STATE_MISSED, since we do not react to a device IRQ.
 */
-

[PATCH][net-next] bridge: clean up mtu_set_by_user setting to false and comments

2018-07-13 Thread Li RongQing
Once mtu_set_by_user is set to true, br_mtu_auto_adjust will
not run, and no chance to clear mtu_set_by_user.

and br_mtu_auto_adjust will run only if mtu_set_by_user is
false, so not need to set it to false again

Cc: Nikolay Aleksandrov 
Signed-off-by: Li RongQing 
---
 net/bridge/br_device.c | 1 -
 net/bridge/br_if.c | 4 
 2 files changed, 5 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index e682a668ce57..c636bc2749c2 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -227,7 +227,6 @@ static int br_change_mtu(struct net_device *dev, int 
new_mtu)
 
dev->mtu = new_mtu;
 
-   /* this flag will be cleared if the MTU was automatically adjusted */
br->mtu_set_by_user = true;
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
/* remember the MTU in the rtable for PMTU */
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 05e42d86882d..47c65da4b1be 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -450,11 +450,7 @@ void br_mtu_auto_adjust(struct net_bridge *br)
if (br->mtu_set_by_user)
return;
 
-   /* change to the minimum MTU and clear the flag which was set by
-* the bridge ndo_change_mtu callback
-*/
dev_set_mtu(br->dev, br_mtu_min(br));
-   br->mtu_set_by_user = false;
 }
 
 static void br_set_gso_limits(struct net_bridge *br)
-- 
2.16.2



答复: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread Li,Rongqing


> -邮件原件-
> 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com]
> 发送时间: 2018年7月11日 19:32
> 收件人: Li,Rongqing ; netdev@vger.kernel.org
> 主题: Re: [PATCH] net: convert gro_count to bitmask
> 
> 
> 
> On 07/11/2018 02:15 AM, Li RongQing wrote:
> > gro_hash size is 192 bytes, and uses 3 cache lines, if there is few
> > flows, gro_hash may be not fully used, so it is unnecessary to iterate
> > all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline.
> >
> > convert gro_count to a bitmask, and rename it as gro_bitmask, each bit
> > represents a element of gro_hash, only flush a gro_hash element if the
> > related bit is set, to speed up napi_gro_flush().
> >
> > and update gro_bitmask only if it will be changed, to reduce cache
> > update
> >
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Li RongQing 
> > ---
> >  include/linux/netdevice.h |  2 +-
> >  net/core/dev.c| 35 +++
> >  2 files changed, 24 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index b683971e500d..df49b36ef378 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -322,7 +322,7 @@ struct napi_struct {
> >
> > unsigned long   state;
> > int weight;
> > -   unsigned intgro_count;
> > +   unsigned long   gro_bitmask;
> > int (*poll)(struct napi_struct *, int);
> >  #ifdef CONFIG_NETPOLL
> > int poll_owner;
> > diff --git a/net/core/dev.c b/net/core/dev.c index
> > d13cddcac41f..a08dbdd217a6 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5171,9 +5171,11 @@ static void __napi_gro_flush_chain(struct
> napi_struct *napi, u32 index,
> > return;
> > list_del_init(>list);
> > napi_gro_complete(skb);
> > -   napi->gro_count--;
> > napi->gro_hash[index].count--;
> > }
> > +
> > +   if (!napi->gro_hash[index].count)
> > +   clear_bit(index, >gro_bitmask);
> 
> I suggest you not add an atomic operation here.
> 
> Current cpu owns this NAPI after all.
> 
> Same remark for the whole patch.
> 
> ->  __clear_bit(), __set_bit() and similar operators
> 
> Ideally you should provide TCP_RR number with busy polling enabled, to
> eventually catch regressions.
> 

I will change and do the test
Thank you.

-RongQing

> Thanks.



答复: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread Li,Rongqing


> -邮件原件-
> 发件人: David Miller [mailto:da...@davemloft.net]
> 发送时间: 2018年7月12日 10:49
> 收件人: Li,Rongqing 
> 抄送: netdev@vger.kernel.org
> 主题: Re: [PATCH] net: convert gro_count to bitmask
> 
> From: Li RongQing 
> Date: Wed, 11 Jul 2018 17:15:53 +0800
> 
> > +   clear_bit(index, >gro_bitmask);
> 
> Please don't use atomics here, at least use __clear_bit().
> 

Thanks, this is same as Eric's suggestion.


> This is why I did the operations by hand in my version of the patch.
> Also, if you are going to preempt my patch, at least retain the comment I
> added around the GRO_HASH_BUCKETS definitions which warns the reader
> about the limit.
> 

I add BUILD_BUG_ON in netdev_init, so I think we need not to add comment

@@ -9151,6 +9159,9 @@ static struct hlist_head * __net_init 
netdev_create_hash(void)
 /* Initialize per network namespace state */  static int __net_init 
netdev_init(struct net *net)  {
+   BUILD_BUG_ON(GRO_HASH_BUCKETS >
+   FIELD_SIZEOF(struct napi_struct, gro_bitmask));
+


-RongQing

> Thanks.


答复: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread Li,Rongqing


> -邮件原件-
> 发件人: Stefano Brivio [mailto:sbri...@redhat.com]
> 发送时间: 2018年7月11日 18:52
> 收件人: Li,Rongqing 
> 抄送: netdev@vger.kernel.org; Eric Dumazet 
> 主题: Re: [PATCH] net: convert gro_count to bitmask
> 
> On Wed, 11 Jul 2018 17:15:53 +0800
> Li RongQing  wrote:
> 
> > @@ -5380,6 +5382,12 @@ static enum gro_result dev_gro_receive(struct
> napi_struct *napi, struct sk_buff
> > if (grow > 0)
> > gro_pull_from_frag0(skb, grow);
> >  ok:
> > +   if (napi->gro_hash[hash].count)
> > +   if (!test_bit(hash, >gro_bitmask))
> > +   set_bit(hash, >gro_bitmask);
> > +   else if (test_bit(hash, >gro_bitmask))
> > +   clear_bit(hash, >gro_bitmask);
> 
> This might not do what you want.
> 
> --

could you show detail ?

-RongQing

> Stefano


[PATCH] net: convert gro_count to bitmask

2018-07-11 Thread Li RongQing
gro_hash size is 192 bytes, and uses 3 cache lines, if there is few
flows, gro_hash may be not fully used, so it is unnecessary to iterate
all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline.

convert gro_count to a bitmask, and rename it as gro_bitmask, each bit
represents a element of gro_hash, only flush a gro_hash element if the
related bit is set, to speed up napi_gro_flush().

and update gro_bitmask only if it will be changed, to reduce cache
update

Suggested-by: Eric Dumazet 
Signed-off-by: Li RongQing 
---
 include/linux/netdevice.h |  2 +-
 net/core/dev.c| 35 +++
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b683971e500d..df49b36ef378 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -322,7 +322,7 @@ struct napi_struct {
 
unsigned long   state;
int weight;
-   unsigned intgro_count;
+   unsigned long   gro_bitmask;
int (*poll)(struct napi_struct *, int);
 #ifdef CONFIG_NETPOLL
int poll_owner;
diff --git a/net/core/dev.c b/net/core/dev.c
index d13cddcac41f..a08dbdd217a6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5171,9 +5171,11 @@ static void __napi_gro_flush_chain(struct napi_struct 
*napi, u32 index,
return;
list_del_init(>list);
napi_gro_complete(skb);
-   napi->gro_count--;
napi->gro_hash[index].count--;
}
+
+   if (!napi->gro_hash[index].count)
+   clear_bit(index, >gro_bitmask);
 }
 
 /* napi->gro_hash[].list contains packets ordered by age.
@@ -5184,8 +5186,10 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old)
 {
u32 i;
 
-   for (i = 0; i < GRO_HASH_BUCKETS; i++)
-   __napi_gro_flush_chain(napi, i, flush_old);
+   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
+   if (test_bit(i, >gro_bitmask))
+   __napi_gro_flush_chain(napi, i, flush_old);
+   }
 }
 EXPORT_SYMBOL(napi_gro_flush);
 
@@ -5277,8 +5281,8 @@ static void gro_flush_oldest(struct list_head *head)
if (WARN_ON_ONCE(!oldest))
return;
 
-   /* Do not adjust napi->gro_count, caller is adding a new SKB to
-* the chain.
+   /* Do not adjust napi->gro_hash[].count, caller is adding a new
+* SKB to the chain.
 */
list_del(>list);
napi_gro_complete(oldest);
@@ -5352,7 +5356,6 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (pp) {
list_del_init(>list);
napi_gro_complete(pp);
-   napi->gro_count--;
napi->gro_hash[hash].count--;
}
 
@@ -5365,7 +5368,6 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) {
gro_flush_oldest(gro_head);
} else {
-   napi->gro_count++;
napi->gro_hash[hash].count++;
}
NAPI_GRO_CB(skb)->count = 1;
@@ -5380,6 +5382,12 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
if (grow > 0)
gro_pull_from_frag0(skb, grow);
 ok:
+   if (napi->gro_hash[hash].count)
+   if (!test_bit(hash, >gro_bitmask))
+   set_bit(hash, >gro_bitmask);
+   else if (test_bit(hash, >gro_bitmask))
+   clear_bit(hash, >gro_bitmask);
+
return ret;
 
 normal:
@@ -5778,7 +5786,7 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
 NAPIF_STATE_IN_BUSY_POLL)))
return false;
 
-   if (n->gro_count) {
+   if (n->gro_bitmask) {
unsigned long timeout = 0;
 
if (work_done)
@@ -5987,7 +5995,7 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer 
*timer)
/* Note : we use a relaxed variant of napi_schedule_prep() not setting
 * NAPI_STATE_MISSED, since we do not react to a device IRQ.
 */
-   if (napi->gro_count && !napi_disable_pending(napi) &&
+   if (napi->gro_bitmask && !napi_disable_pending(napi) &&
!test_and_set_bit(NAPI_STATE_SCHED, >state))
__napi_schedule_irqoff(napi);
 
@@ -6002,7 +6010,7 @@ void netif_napi_add(struct net_device *dev, struct 
napi_struct *napi,
INIT_LIST_HEAD(>poll_list);
hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
napi->timer.function = napi_watchdog;
-   napi->gro_count = 0;
+   napi->gro_bitmask = 0;
for (i = 0; i < GRO_

答复: [PATCH][net-next][v2] net: limit each hash list length to MAX_GRO_SKBS

2018-07-10 Thread Li,Rongqing


> -邮件原件-
> 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com]
> 发送时间: 2018年7月8日 8:22
> 收件人: David Miller ; Li,Rongqing
> 
> 抄送: netdev@vger.kernel.org
> 主题: Re: [PATCH][net-next][v2] net: limit each hash list length to
> MAX_GRO_SKBS
> 
> 
> 
> On 07/05/2018 03:20 AM, David Miller wrote:
> > From: Li RongQing 
> > Date: Thu,  5 Jul 2018 14:34:32 +0800
> >
> >> After commit 07d78363dcff ("net: Convert NAPI gro list into a small
> >> hash table.")' there is 8 hash buckets, which allows more flows to be
> >> held for merging.  but MAX_GRO_SKBS, the total held skb for merging,
> >> is 8 skb still, limit the hash table performance.
> >>
> >> keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb,
> >> not the total 8 skb
> >>
> >> Signed-off-by: Li RongQing 
> >
> > Applied, thanks.
> >
> 
> Maybe gro_count should be replaced by a bitmask, so that we can speed up
> napi_gro_flush(), since it now has to use 3 cache lines (gro_hash[] size is 
> 192
> bytes)

Do you means that?

Subject: [PATCH][RFC][net-next] net: convert gro_count to bitmask

convert gro_count to a bitmask, and rename it as gro_bitmask to speed
up napi_gro_flush(), since gro_hash now has to use 3 cache lines

---
 include/linux/netdevice.h |  2 +-
 net/core/dev.c| 36 
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b683971e500d..df49b36ef378 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -322,7 +322,7 @@ struct napi_struct {
 
unsigned long   state;
int weight;
-   unsigned intgro_count;
+   unsigned long   gro_bitmask;
int (*poll)(struct napi_struct *, int);
 #ifdef CONFIG_NETPOLL
int poll_owner;
diff --git a/net/core/dev.c b/net/core/dev.c
index 89825c1eccdc..da2d1185eb82 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5161,9 +5161,11 @@ static void __napi_gro_flush_chain(struct napi_struct 
*napi, u32 index,
return;
list_del_init(>list);
napi_gro_complete(skb);
-   napi->gro_count--;
napi->gro_hash[index].count--;
}
+
+   if (!napi->gro_hash[index].count)
+   clear_bit(index, >gro_bitmask);
 }
 
 /* napi->gro_hash[].list contains packets ordered by age.
@@ -5174,8 +5176,10 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old)
 {
u32 i;
 
-   for (i = 0; i < GRO_HASH_BUCKETS; i++)
-   __napi_gro_flush_chain(napi, i, flush_old);
+   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
+   if (test_bit(i, >gro_bitmask))
+   __napi_gro_flush_chain(napi, i, flush_old);
+   }
 }
 EXPORT_SYMBOL(napi_gro_flush);
 
@@ -5267,8 +5271,8 @@ static void gro_flush_oldest(struct list_head *head)
if (WARN_ON_ONCE(!oldest))
return;
 
-   /* Do not adjust napi->gro_count, caller is adding a new SKB to
-* the chain.
+   /* Do not adjust napi->gro_hash[].count, caller is adding a new
+* SKB to the chain.
 */
list_del(>list);
napi_gro_complete(oldest);
@@ -5342,7 +5346,6 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (pp) {
list_del_init(>list);
napi_gro_complete(pp);
-   napi->gro_count--;
napi->gro_hash[hash].count--;
}
 
@@ -5355,7 +5358,6 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) {
gro_flush_oldest(gro_head);
} else {
-   napi->gro_count++;
napi->gro_hash[hash].count++;
}
NAPI_GRO_CB(skb)->count = 1;
@@ -5370,6 +5372,13 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
if (grow > 0)
gro_pull_from_frag0(skb, grow);
 ok:
+
+   if (napi->gro_hash[hash].count)
+   if (!test_bit(hash, >gro_bitmask))
+   set_bit(hash, >gro_bitmask);
+   else if (test_bit(hash, >gro_bitmask))
+   clear_bit(hash, >gro_bitmask);
+
return ret;
 
 normal:
@@ -5768,7 +5777,7 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
 NAPIF_STATE_IN_BUSY_POLL)))
return false;
 
-   if (n->gro_count) {
+   if (n->gro_bitmask) {
unsigned long timeout = 0;
 
if (work_done)
@@ -5977,7 +5

[PATCH][net-next] net: replace num_possible_cpus with nr_cpu_ids

2018-07-06 Thread Li RongQing
The return of num_possible_cpus() is same as nr_cpu_ids, but
nr_cpu_ids can reduce cpu computation

Signed-off-by: Li RongQing 
---
 net/core/dev.c | 4 ++--
 net/ipv4/inet_hashtables.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 89825c1eccdc..05c7bc6e4ce6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2189,7 +2189,7 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
if (!dev_maps)
goto out_no_maps;
 
-   if (num_possible_cpus() > 1)
+   if (nr_cpu_ids > 1)
possible_mask = cpumask_bits(cpu_possible_mask);
nr_ids = nr_cpu_ids;
clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
@@ -2273,7 +2273,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
nr_ids = dev->num_rx_queues;
} else {
maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
-   if (num_possible_cpus() > 1) {
+   if (nr_cpu_ids > 1) {
online_mask = cpumask_bits(cpu_online_mask);
possible_mask = cpumask_bits(cpu_possible_mask);
}
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 3647167c8fa3..80cadf06fd3f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -825,7 +825,7 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
if (locksz != 0) {
/* allocate 2 cache lines or at least one spinlock per cpu */
nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
-   nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
+   nblocks = roundup_pow_of_two(nblocks * nr_cpu_ids);
 
/* no more locks than number of hash buckets */
nblocks = min(nblocks, hashinfo->ehash_mask + 1);
-- 
2.16.2



[PATCH][net-next][v2] net: limit each hash list length to MAX_GRO_SKBS

2018-07-05 Thread Li RongQing
After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash
table.")' there is 8 hash buckets, which allows more flows to be held for
merging.  but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still,
limit the hash table performance.

keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not
the total 8 skb

Signed-off-by: Li RongQing 
---
 include/linux/netdevice.h |  7 +-
 net/core/dev.c| 56 +++
 2 files changed, 29 insertions(+), 34 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8bf8d6149f79..3b60ac51ddba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -302,6 +302,11 @@ struct netdev_boot_setup {
 
 int __init netdev_boot_setup(char *str);
 
+struct gro_list {
+   struct list_headlist;
+   int count;
+};
+
 /*
  * Structure for NAPI scheduling similar to tasklet but with weighting
  */
@@ -323,7 +328,7 @@ struct napi_struct {
int poll_owner;
 #endif
struct net_device   *dev;
-   struct list_headgro_hash[GRO_HASH_BUCKETS];
+   struct gro_list gro_hash[GRO_HASH_BUCKETS];
struct sk_buff  *skb;
struct hrtimer  timer;
struct list_headdev_list;
diff --git a/net/core/dev.c b/net/core/dev.c
index 08d58e0debe5..38c58e32f5bc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -149,7 +149,6 @@
 
 #include "net-sysfs.h"
 
-/* Instead of increasing this, you should create a hash table. */
 #define MAX_GRO_SKBS 8
 
 /* This should be increased if a protocol with a bigger head is added. */
@@ -4989,9 +4988,10 @@ static int napi_gro_complete(struct sk_buff *skb)
return netif_receive_skb_internal(skb);
 }
 
-static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head 
*head,
+static void __napi_gro_flush_chain(struct napi_struct *napi, u32 index,
   bool flush_old)
 {
+   struct list_head *head = >gro_hash[index].list;
struct sk_buff *skb, *p;
 
list_for_each_entry_safe_reverse(skb, p, head, list) {
@@ -5000,22 +5000,20 @@ static void __napi_gro_flush_chain(struct napi_struct 
*napi, struct list_head *h
list_del_init(>list);
napi_gro_complete(skb);
napi->gro_count--;
+   napi->gro_hash[index].count--;
}
 }
 
-/* napi->gro_hash contains packets ordered by age.
+/* napi->gro_hash[].list contains packets ordered by age.
  * youngest packets at the head of it.
  * Complete skbs in reverse order to reduce latencies.
  */
 void napi_gro_flush(struct napi_struct *napi, bool flush_old)
 {
-   int i;
-
-   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-   struct list_head *head = >gro_hash[i];
+   u32 i;
 
-   __napi_gro_flush_chain(napi, head, flush_old);
-   }
+   for (i = 0; i < GRO_HASH_BUCKETS; i++)
+   __napi_gro_flush_chain(napi, i, flush_old);
 }
 EXPORT_SYMBOL(napi_gro_flush);
 
@@ -5027,7 +5025,7 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
struct list_head *head;
struct sk_buff *p;
 
-   head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)];
+   head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)].list;
list_for_each_entry(p, head, list) {
unsigned long diffs;
 
@@ -5095,27 +5093,13 @@ static void gro_pull_from_frag0(struct sk_buff *skb, 
int grow)
}
 }
 
-static void gro_flush_oldest(struct napi_struct *napi)
+static void gro_flush_oldest(struct list_head *head)
 {
-   struct sk_buff *oldest = NULL;
-   unsigned long age = jiffies;
-   int i;
-
-   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-   struct list_head *head = >gro_hash[i];
-   struct sk_buff *skb;
-
-   if (list_empty(head))
-   continue;
+   struct sk_buff *oldest;
 
-   skb = list_last_entry(head, struct sk_buff, list);
-   if (!oldest || time_before(NAPI_GRO_CB(skb)->age, age)) {
-   oldest = skb;
-   age = NAPI_GRO_CB(skb)->age;
-   }
-   }
+   oldest = list_last_entry(head, struct sk_buff, list);
 
-   /* We are called with napi->gro_count >= MAX_GRO_SKBS, so this is
+   /* We are called with head length >= MAX_GRO_SKBS, so this is
 * impossible.
 */
if (WARN_ON_ONCE(!oldest))
@@ -5130,6 +5114,7 @@ static void gro_flush_oldest(struct napi_struct *napi)
 
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
+   u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
struct list_head *head = _base;
struct packet_offload *ptype;
__be16 typ

[PATCH][net-next] net: limit each hash list length to MAX_GRO_SKBS

2018-07-04 Thread Li RongQing
After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash
table.")' there is 8 hash buckets, which allows more flows to be held for
merging.  but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still,
limit the hash table performance.

keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not
the total 8 skb

Signed-off-by: Li RongQing 
---
 include/linux/netdevice.h |  7 +-
 net/core/dev.c| 54 +++
 2 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8bf8d6149f79..3b60ac51ddba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -302,6 +302,11 @@ struct netdev_boot_setup {
 
 int __init netdev_boot_setup(char *str);
 
+struct gro_list {
+   struct list_headlist;
+   int count;
+};
+
 /*
  * Structure for NAPI scheduling similar to tasklet but with weighting
  */
@@ -323,7 +328,7 @@ struct napi_struct {
int poll_owner;
 #endif
struct net_device   *dev;
-   struct list_headgro_hash[GRO_HASH_BUCKETS];
+   struct gro_list gro_hash[GRO_HASH_BUCKETS];
struct sk_buff  *skb;
struct hrtimer  timer;
struct list_headdev_list;
diff --git a/net/core/dev.c b/net/core/dev.c
index 08d58e0debe5..f8cdc27ee276 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -149,7 +149,6 @@
 
 #include "net-sysfs.h"
 
-/* Instead of increasing this, you should create a hash table. */
 #define MAX_GRO_SKBS 8
 
 /* This should be increased if a protocol with a bigger head is added. */
@@ -4989,10 +4988,11 @@ static int napi_gro_complete(struct sk_buff *skb)
return netif_receive_skb_internal(skb);
 }
 
-static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head 
*head,
+static void __napi_gro_flush_chain(struct napi_struct *napi, int index,
   bool flush_old)
 {
struct sk_buff *skb, *p;
+   struct list_head *head = >gro_hash[index].list;
 
list_for_each_entry_safe_reverse(skb, p, head, list) {
if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
@@ -5000,10 +5000,11 @@ static void __napi_gro_flush_chain(struct napi_struct 
*napi, struct list_head *h
list_del_init(>list);
napi_gro_complete(skb);
napi->gro_count--;
+   napi->gro_hash[index].count--;
}
 }
 
-/* napi->gro_hash contains packets ordered by age.
+/* napi->gro_hash[].list contains packets ordered by age.
  * youngest packets at the head of it.
  * Complete skbs in reverse order to reduce latencies.
  */
@@ -5011,11 +5012,8 @@ void napi_gro_flush(struct napi_struct *napi, bool 
flush_old)
 {
int i;
 
-   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-   struct list_head *head = >gro_hash[i];
-
-   __napi_gro_flush_chain(napi, head, flush_old);
-   }
+   for (i = 0; i < GRO_HASH_BUCKETS; i++)
+   __napi_gro_flush_chain(napi, i, flush_old);
 }
 EXPORT_SYMBOL(napi_gro_flush);
 
@@ -5027,7 +5025,7 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
struct list_head *head;
struct sk_buff *p;
 
-   head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)];
+   head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)].list;
list_for_each_entry(p, head, list) {
unsigned long diffs;
 
@@ -5095,27 +5093,13 @@ static void gro_pull_from_frag0(struct sk_buff *skb, 
int grow)
}
 }
 
-static void gro_flush_oldest(struct napi_struct *napi)
+static void gro_flush_oldest(struct list_head *head)
 {
-   struct sk_buff *oldest = NULL;
-   unsigned long age = jiffies;
-   int i;
+   struct sk_buff *oldest;
 
-   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-   struct list_head *head = >gro_hash[i];
-   struct sk_buff *skb;
+   oldest = list_last_entry(head, struct sk_buff, list);
 
-   if (list_empty(head))
-   continue;
-
-   skb = list_last_entry(head, struct sk_buff, list);
-   if (!oldest || time_before(NAPI_GRO_CB(skb)->age, age)) {
-   oldest = skb;
-   age = NAPI_GRO_CB(skb)->age;
-   }
-   }
-
-   /* We are called with napi->gro_count >= MAX_GRO_SKBS, so this is
+   /* We are called with head length >= MAX_GRO_SKBS, so this is
 * impossible.
 */
if (WARN_ON_ONCE(!oldest))
@@ -5138,6 +5122,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
enum gro_result ret;
int same_flow;
int grow;
+   u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
 

Re: [PATCH][net-next] net: increase MAX_GRO_SKBS to 64

2018-07-03 Thread Li RongQing
On 7/2/18, David Miller  wrote:
> From: Li RongQing 
> Date: Mon,  2 Jul 2018 19:41:43 +0800
>
>> After 07d78363dcffd [net: Convert NAPI gro list into a small hash table]
>> there is 8 hash buckets, which allow more flows to be held for merging.
>>
>> keep each as original list length, so increase MAX_GRO_SKBS to 64
>>
>> Signed-off-by: Li RongQing 
>
> I would like to hear some feedback from Eric, 64 might be too big.
>
How about the below change?

commit 6270b973a973b2944fedb4b5f9926ed3e379d0c2 (HEAD -> master)
Author: Li RongQing 
Date:   Mon Jul 2 19:08:37 2018 +0800

net: limit each hash list length to MAX_GRO_SKBS

After 07d78363dcffd [net: Convert NAPI gro list into a small hash table]
there is 8 hash buckets, which allows more flows to be held for merging.
but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still, limit
the hash table performance.

keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not
the total 8 skb

Signed-off-by: Li RongQing 

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8bf8d6149f79..09d7764a8917 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -324,6 +324,7 @@ struct napi_struct {
 #endif
struct net_device   *dev;
struct list_headgro_hash[GRO_HASH_BUCKETS];
+   int list_len[GRO_HASH_BUCKETS];
struct sk_buff  *skb;
struct hrtimer  timer;
struct list_headdev_list;
diff --git a/net/core/dev.c b/net/core/dev.c
index 08d58e0debe5..3cf3c6676cb3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -149,7 +149,6 @@

 #include "net-sysfs.h"

-/* Instead of increasing this, you should create a hash table. */
 #define MAX_GRO_SKBS 8

 /* This should be increased if a protocol with a bigger head is added. */
@@ -4989,10 +4988,11 @@ static int napi_gro_complete(struct sk_buff *skb)
return netif_receive_skb_internal(skb);
 }

-static void __napi_gro_flush_chain(struct napi_struct *napi, struct
list_head *head,
+static void __napi_gro_flush_chain(struct napi_struct *napi, int index,
   bool flush_old)
 {
struct sk_buff *skb, *p;
+   struct list_head *head = >gro_hash[index];

list_for_each_entry_safe_reverse(skb, p, head, list) {
if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
@@ -5000,6 +5000,7 @@ static void __napi_gro_flush_chain(struct
napi_struct *napi, struct list_head *h
list_del_init(>list);
napi_gro_complete(skb);
napi->gro_count--;
+   napi->list_len[index]--;
}
 }

@@ -5011,11 +5012,8 @@ void napi_gro_flush(struct napi_struct *napi,
bool flush_old)
 {
int i;

-   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-   struct list_head *head = >gro_hash[i];
-
-   __napi_gro_flush_chain(napi, head, flush_old);
-   }
+   for (i = 0; i < GRO_HASH_BUCKETS; i++)
+   __napi_gro_flush_chain(napi, i, flush_old);
 }
 EXPORT_SYMBOL(napi_gro_flush);

@@ -5095,27 +5093,13 @@ static void gro_pull_from_frag0(struct sk_buff
*skb, int grow)
}
 }

-static void gro_flush_oldest(struct napi_struct *napi)
+static void gro_flush_oldest(struct list_head *head)
 {
struct sk_buff *oldest = NULL;
-   unsigned long age = jiffies;
-   int i;

-   for (i = 0; i < GRO_HASH_BUCKETS; i++) {
-   struct list_head *head = >gro_hash[i];
-   struct sk_buff *skb;
+   oldest = list_last_entry(head, struct sk_buff, list);

-   if (list_empty(head))
-   continue;
-
-   skb = list_last_entry(head, struct sk_buff, list);
-   if (!oldest || time_before(NAPI_GRO_CB(skb)->age, age)) {
-   oldest = skb;
-   age = NAPI_GRO_CB(skb)->age;
-   }
-   }
-
-   /* We are called with napi->gro_count >= MAX_GRO_SKBS, so this is
+   /* We are called with head length >= MAX_GRO_SKBS, so this is
 * impossible.
 */
if (WARN_ON_ONCE(!oldest))
@@ -5138,6 +5122,7 @@ static enum gro_result dev_gro_receive(struct
napi_struct *napi, struct sk_buff
enum gro_result ret;
int same_flow;
int grow;
+   u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);

if (netif_elide_gro(skb->dev))
goto normal;
@@ -5196,6 +5181,7 @@ static enum gro_result dev_gro_receive(struct
napi_struct *napi, struct sk_buff
list_del_init(>list);
napi_gro_complete(pp);
napi->gro_count--;
+   napi->list_len[hash]--;
}

if (same_flow)
@@ -5204,10 +5190,11 @@ static enum gro_result dev_gro_receive(struct
napi_struct *napi

答复: [PATCH][net-next] net: increase MAX_GRO_SKBS to 64

2018-07-02 Thread Li,Rongqing

发件人: David Miller [da...@davemloft.net]
发送时间: 2018年7月2日 19:44
收件人: Li,Rongqing
抄送: netdev@vger.kernel.org; eric.duma...@gmail.com
主题: Re: [PATCH][net-next] net: increase MAX_GRO_SKBS to 64

From: Li RongQing 
Date: Mon,  2 Jul 2018 19:41:43 +0800

>>  After 07d78363dcffd [net: Convert NAPI gro list into a small hash table]
> > there is 8 hash buckets, which allow more flows to be held for merging.
> >
> > keep each as original list length, so increase MAX_GRO_SKBS to 64
> >
> > Signed-off-by: Li RongQing 

> I would like to hear some feedback from Eric, 64 might be too big.


I think we should limit each list length to 8 skb , insteading of this change

if there is only one flow,  changing MAX_GRO_SKBS to 64 maybe generate
large delay. 
if keep total 8 skb,  for multiple flow, hash table maybe unable to improve the 
performance

-RongQing

[PATCH][net-next] net: increase MAX_GRO_SKBS to 64

2018-07-02 Thread Li RongQing
After 07d78363dcffd [net: Convert NAPI gro list into a small hash table]
there is 8 hash buckets, which allow more flows to be held for merging.

keep each as original list length, so increase MAX_GRO_SKBS to 64

Signed-off-by: Li RongQing 
---
 net/core/dev.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 08d58e0debe5..ac315e41d5e7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -149,8 +149,7 @@
 
 #include "net-sysfs.h"
 
-/* Instead of increasing this, you should create a hash table. */
-#define MAX_GRO_SKBS 8
+#define MAX_GRO_SKBS 64
 
 /* This should be increased if a protocol with a bigger head is added. */
 #define GRO_MAX_HEAD (MAX_HEADER + 128)
-- 
2.16.2



[PATCH] net: propagate dev_get_valid_name return code

2018-06-19 Thread Li RongQing
if dev_get_valid_name failed, propagate its return code

and remove the setting err to ENODEV, it will be set to
0 again before dev_change_net_namespace exits.

Signed-off-by: Li RongQing 
---
 net/core/dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1844d9bc5714..1c7a3761ec3c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8661,7 +8661,8 @@ int dev_change_net_namespace(struct net_device *dev, 
struct net *net, const char
/* We get here if we can't use the current device name */
if (!pat)
goto out;
-   if (dev_get_valid_name(net, dev, pat) < 0)
+   err = dev_get_valid_name(net, dev, pat);
+   if (err < 0)
goto out;
}
 
@@ -8673,7 +8674,6 @@ int dev_change_net_namespace(struct net_device *dev, 
struct net *net, const char
dev_close(dev);
 
/* And unlink it from device chain */
-   err = -ENODEV;
unlist_netdevice(dev);
 
synchronize_net();
-- 
2.16.2



[PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-19 Thread Li RongQing
The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For some x86 distribution, the NR_CPUS is 8192
and nr_cpu_ids is 4, so replace NR_CPU to save some memory

Signed-off-by: Li RongQing 
Signed-off-by: Wang Li 
---
 net/xfrm/xfrm_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 40b54cc64243..f8188685c1e9 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2989,11 +2989,11 @@ void __init xfrm_init(void)
 {
int i;
 
-   xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work),
+   xfrm_pcpu_work = kmalloc_array(nr_cpu_ids, sizeof(*xfrm_pcpu_work),
   GFP_KERNEL);
BUG_ON(!xfrm_pcpu_work);
 
-   for (i = 0; i < NR_CPUS; i++)
+   for (i = 0; i < nr_cpu_ids; i++)
INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn);
 
register_pernet_subsys(_net_ops);
-- 
2.16.2



Re: [PATCH][ipsec] xfrm: replace NR_CPU with num_possible_cpus()

2018-06-18 Thread Li RongQing
sorry,  please drop this patch.
I should replace NR_CPUS with nr_cpu_ids, i will resend it

-R

On 6/15/18, Li RongQing  wrote:
> The default NR_CPUS can be very large, but actual possible nr_cpu_ids
> usually is very small. For some x86 distribution, the NR_CPUS is 8192
> and nr_cpu_ids is 4.
>
> when xfrm_init is running, num_possible_cpus() should work
>
> Signed-off-by: Li RongQing 
> Signed-off-by: Wang Li 
> ---
>  net/xfrm/xfrm_policy.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 40b54cc64243..cbb862463cbd 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -2988,12 +2988,13 @@ static struct pernet_operations __net_initdata
> xfrm_net_ops = {
>  void __init xfrm_init(void)
>  {
>   int i;
> + unsigned int nr_cpus = num_possible_cpus();
>
> - xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work),
> + xfrm_pcpu_work = kmalloc_array(nr_cpus, sizeof(*xfrm_pcpu_work),
>  GFP_KERNEL);
>   BUG_ON(!xfrm_pcpu_work);
>
> - for (i = 0; i < NR_CPUS; i++)
> + for (i = 0; i < nr_cpus; i++)
>   INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn);
>
>   register_pernet_subsys(_net_ops);
> --
> 2.16.2
>
>


[PATCH][ipsec] xfrm: replace NR_CPU with num_possible_cpus()

2018-06-15 Thread Li RongQing
The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For some x86 distribution, the NR_CPUS is 8192
and nr_cpu_ids is 4.

when xfrm_init is running, num_possible_cpus() should work

Signed-off-by: Li RongQing 
Signed-off-by: Wang Li 
---
 net/xfrm/xfrm_policy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 40b54cc64243..cbb862463cbd 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2988,12 +2988,13 @@ static struct pernet_operations __net_initdata 
xfrm_net_ops = {
 void __init xfrm_init(void)
 {
int i;
+   unsigned int nr_cpus = num_possible_cpus();
 
-   xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work),
+   xfrm_pcpu_work = kmalloc_array(nr_cpus, sizeof(*xfrm_pcpu_work),
   GFP_KERNEL);
BUG_ON(!xfrm_pcpu_work);
 
-   for (i = 0; i < NR_CPUS; i++)
+   for (i = 0; i < nr_cpus; i++)
INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn);
 
register_pernet_subsys(_net_ops);
-- 
2.16.2



[net-next][PATCH] tcp: probe timer MUST not less than 5 minuter for tcp PMTU

2018-06-01 Thread Li RongQing
RFC4821 say: The value for this timer MUST NOT be less than
5 minutes and is recommended to be 10 minutes, per RFC 1981.

Signed-off-by: Li RongQing 
---
 net/ipv4/sysctl_net_ipv4.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d2eed3ddcb0a..ed8952bb6874 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -47,6 +47,7 @@ static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
 static int comp_sack_nr_max = 255;
+static int tcp_probe_interval_min = 300;
 
 /* obsolete */
 static int sysctl_tcp_low_latency __read_mostly;
@@ -711,7 +712,8 @@ static struct ctl_table ipv4_net_table[] = {
.data   = _net.ipv4.sysctl_tcp_probe_interval,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = _probe_interval_min,
},
{
.procname   = "igmp_link_local_mcast_reports",
-- 
2.16.2



[net-next][PATCH] inet: Use switch case instead of multiple condition checks

2018-05-09 Thread Li RongQing
inet_csk_reset_xmit_timer uses multiple equality condition checks,
so it is better to use switch case instead of them

after this patch, the increased image size is acceptable

Before  After
size of net/ipv4/tcp_output.o:  721640  721648
size of vmlinux:400236400   400236401

Signed-off-by: Li RongQing <lirongq...@baidu.com>
---
 include/net/inet_connection_sock.h | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 2ab6667275df..d2e9314cf43d 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -239,22 +239,31 @@ static inline void inet_csk_reset_xmit_timer(struct sock 
*sk, const int what,
when = max_when;
}
 
-   if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
-   what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE ||
-   what == ICSK_TIME_REO_TIMEOUT) {
+   switch (what) {
+   case ICSK_TIME_RETRANS:
+   /* fall through */
+   case ICSK_TIME_PROBE0:
+   /* fall through */
+   case ICSK_TIME_EARLY_RETRANS:
+   /* fall through */
+   case ICSK_TIME_LOSS_PROBE:
+   /* fall through */
+   case ICSK_TIME_REO_TIMEOUT:
icsk->icsk_pending = what;
icsk->icsk_timeout = jiffies + when;
sk_reset_timer(sk, >icsk_retransmit_timer, 
icsk->icsk_timeout);
-   } else if (what == ICSK_TIME_DACK) {
+   break;
+   case ICSK_TIME_DACK:
icsk->icsk_ack.pending |= ICSK_ACK_TIMER;
icsk->icsk_ack.timeout = jiffies + when;
sk_reset_timer(sk, >icsk_delack_timer, 
icsk->icsk_ack.timeout);
-   }
+   break;
 #ifdef INET_CSK_DEBUG
-   else {
+   default:
pr_debug("%s", inet_csk_timer_bug_msg);
-   }
+   break;
 #endif
+   }
 }
 
 static inline unsigned long
-- 
2.16.2



Re: [PATCH] net: net_cls: remove a NULL check for css_cls_state

2018-04-25 Thread Li RongQing
On 4/20/18, David Miller <da...@davemloft.net> wrote:
> From: Li RongQing <lirongq...@baidu.com>
> Date: Thu, 19 Apr 2018 12:59:21 +0800
>
>> The input of css_cls_state() is impossible to NULL except
>> cgrp_css_online, so simplify it
>>
>> Signed-off-by: Li RongQing <lirongq...@baidu.com>
>
> I don't view this as an improvement.  Just let the helper always check
> NULL and that way there are less situations to audit.
>
css_cls_state maybe return NULL, but nearly no places check the return
value with NULL, this seems unreadable.

net/core/netclassid_cgroup.c:27:return
css_cls_state(task_css_check(p, net_cls_cgrp_id,
net/core/netclassid_cgroup.c:46:struct cgroup_cls_state *cs =
css_cls_state(css);
net/core/netclassid_cgroup.c:47:struct cgroup_cls_state
*parent = css_cls_state(css->parent);
net/core/netclassid_cgroup.c:57:kfree(css_cls_state(css));
net/core/netclassid_cgroup.c:82:   (void
*)(unsigned long)css_cls_state(css)->classid);
net/core/netclassid_cgroup.c:89:return css_cls_state(css)->classid;
net/core/netclassid_cgroup.c:95:struct cgroup_cls_state *cs =
css_cls_state(css);

> And it's not like this is a critical fast path either.
>

I see css_cls_state will be called  when send packet if
CONFIG_NET_CLS_ACT and CONFIG_NET_EGRESS enabled, the calling stack is
like below:

css_cls_state
  task_cls_state
task_get_classid
   cls_cgroup_classify
  tcf_classify
sch_handle_egress
   __dev_queue_xmit
CONFIG_NET_CLS_ACT
 CONFIG_NET_EGRESS

-RongQing






> I'm not applying this, sorry.
>


答复: [PATCH][net-next] net: ip tos cgroup

2018-04-18 Thread Li,Rongqing


> -邮件原件-
> 发件人: Daniel Borkmann [mailto:dan...@iogearbox.net]
> 发送时间: 2018年4月17日 22:11
> 收件人: Li,Rongqing <lirongq...@baidu.com>
> 抄送: netdev@vger.kernel.org; t...@kernel.org; a...@fb.com;
> bra...@fb.com
> 主题: Re: [PATCH][net-next] net: ip tos cgroup
> 
> On 04/17/2018 05:36 AM, Li RongQing wrote:
> > ip tos segment can be changed by setsockopt(IP_TOS), or by iptables;
> > this patch creates a new method to change socket tos segment of
> > processes based on cgroup
> >
> > The usage:
> >
> > 1. mount ip_tos cgroup, and setting tos value
> > mount -t cgroup -o ip_tos ip_tos /cgroups/tos
> > echo tos_value >/cgroups/tos/ip_tos.tos
> > 2. then move processes to cgroup, or create processes in cgroup
> >
> > Signed-off-by: jimyan <jim...@baidu.com>
> > Signed-off-by: Li RongQing <lirongq...@baidu.com>
> 
> This functionality is already possible through the help of BPF programs
> attached to cgroups, have you had a chance to look into that?
> 

I think this method is easier to use than BPF, and more efficient


-RongQing 




[PATCH] net: net_cls: remove a NULL check for css_cls_state

2018-04-18 Thread Li RongQing
The input of css_cls_state() is impossible to NULL except
cgrp_css_online, so simplify it

Signed-off-by: Li RongQing <lirongq...@baidu.com>
---
 net/core/netclassid_cgroup.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 5e4f04004a49..ee087cf793c2 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -19,7 +19,7 @@
 
 static inline struct cgroup_cls_state *css_cls_state(struct 
cgroup_subsys_state *css)
 {
-   return css ? container_of(css, struct cgroup_cls_state, css) : NULL;
+   return container_of(css, struct cgroup_cls_state, css);
 }
 
 struct cgroup_cls_state *task_cls_state(struct task_struct *p)
@@ -44,10 +44,9 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
 static int cgrp_css_online(struct cgroup_subsys_state *css)
 {
struct cgroup_cls_state *cs = css_cls_state(css);
-   struct cgroup_cls_state *parent = css_cls_state(css->parent);
 
-   if (parent)
-   cs->classid = parent->classid;
+   if (css->parent)
+   cs->classid = css_cls_state(css->parent)->classid;
 
return 0;
 }
-- 
2.11.0



[PATCH][net-next] net: ip tos cgroup

2018-04-16 Thread Li RongQing
ip tos segment can be changed by setsockopt(IP_TOS), or by iptables;
this patch creates a new method to change socket tos segment of
processes based on cgroup

The usage:

1. mount ip_tos cgroup, and setting tos value
mount -t cgroup -o ip_tos ip_tos /cgroups/tos
echo tos_value >/cgroups/tos/ip_tos.tos
2. then move processes to cgroup, or create processes in cgroup

Signed-off-by: jimyan <jim...@baidu.com>
Signed-off-by: Li RongQing <lirongq...@baidu.com>
---
 include/linux/cgroup_subsys.h |   4 ++
 include/net/tos_cgroup.h  |  35 
 net/ipv4/Kconfig  |  10 
 net/ipv4/Makefile |   1 +
 net/ipv4/af_inet.c|   2 +
 net/ipv4/tos_cgroup.c | 128 ++
 net/ipv6/af_inet6.c   |   2 +
 7 files changed, 182 insertions(+)
 create mode 100644 include/net/tos_cgroup.h
 create mode 100644 net/ipv4/tos_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcff3b4..1b86eda1c23e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -61,6 +61,10 @@ SUBSYS(pids)
 SUBSYS(rdma)
 #endif
 
+#if IS_ENABLED(CONFIG_IP_TOS_CGROUP)
+SUBSYS(ip_tos)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/include/net/tos_cgroup.h b/include/net/tos_cgroup.h
new file mode 100644
index ..0868e921faf3
--- /dev/null
+++ b/include/net/tos_cgroup.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _IP_TOS_CGROUP_H
+#define _IP_TOS_CGROUP_H
+
+#include 
+#include 
+
+struct tos_cgroup_state {
+   struct cgroup_subsys_state css;
+   u32 tos;
+};
+
+#if IS_ENABLED(CONFIG_IP_TOS_CGROUP)
+static inline u32 task_ip_tos(struct task_struct *p)
+{
+   u32 tos;
+
+   if (in_interrupt())
+   return 0;
+
+   rcu_read_lock();
+   tos = container_of(task_css(p, ip_tos_cgrp_id),
+   struct tos_cgroup_state, css)->tos;
+   rcu_read_unlock();
+
+   return tos;
+}
+#else /* !CONFIG_IP_TOS_CGROUP */
+static inline u32 task_ip_tos(struct task_struct *p)
+{
+   return 0;
+}
+#endif /* CONFIG_IP_TOS_CGROUP */
+#endif  /* _IP_TOS_CGROUP_H */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 80dad301361d..57070bbb0394 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -753,3 +753,13 @@ config TCP_MD5SIG
  on the Internet.
 
  If unsure, say N.
+
+config IP_TOS_CGROUP
+   bool "ip tos cgroup"
+   depends on CGROUPS
+   default n
+   ---help---
+ Say Y here if you want to set ip packet tos based on the
+ control cgroup of their process.
+
+ This can set ip packet tos
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index a07b7dd06def..12c708142d1f 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
 obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
 obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_IP_TOS_CGROUP) += tos_cgroup.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
  xfrm4_output.o xfrm4_protocol.o
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eaed0367e669..e2dd902b06dd 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -120,6 +120,7 @@
 #include 
 #endif
 #include 
+#include 
 
 #include 
 
@@ -356,6 +357,7 @@ static int inet_create(struct net *net, struct socket 
*sock, int protocol,
inet->mc_index  = 0;
inet->mc_list   = NULL;
inet->rcv_tos   = 0;
+   inet->tos   = task_ip_tos(current);
 
sk_refcnt_debug_inc(sk);
 
diff --git a/net/ipv4/tos_cgroup.c b/net/ipv4/tos_cgroup.c
new file mode 100644
index ..dbb828f5b464
--- /dev/null
+++ b/net/ipv4/tos_cgroup.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static inline
+struct tos_cgroup_state *css_tos_cgroup(struct cgroup_subsys_state *css)
+{
+   return css ? container_of(css, struct tos_cgroup_state, css) : NULL;
+}
+
+static inline struct tos_cgroup_state *task_tos_cgroup(struct task_struct 
*task)
+{
+   return css_tos_cgroup(task_css(task, ip_tos_cgrp_id));
+}
+
+static struct cgroup_subsys_state
+*cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+   struct tos_cgroup_state *cs;
+
+   cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+   if (!cs)
+   return ERR_PTR(-ENOMEM);
+
+   return >css;
+}
+
+static void cgrp_css_free(struct cgroup_subsys_state *css)
+{
+   kfree(css_tos_cgroup(css));
+}
+
+static int update_tos(const void *v, struct file *file, unsigned int n)
+{
+   int err;
+   struct socket *sock = sock_from_file(file, );
+   u

[PATCH] net: ip tos cgroup

2018-04-01 Thread Li RongQing
ip tos segment can be changed by setsockopt(IP_TOS), or by iptables;
this patch creates a new method to change socket tos segment of
processes based on cgroup

The usage:

1. mount tos_cgroup, and setting tos value
mount -t cgroup -o ip_tos ip_tos /cgroups/tos
echo tos-value >/cgroups/tos/ip_tos.tos
2. then move processes to cgroup, or create processes in cgroup

Signed-off-by: jimyan <jim...@baidu.com>
Signed-off-by: Li RongQing <lirongq...@baidu.com>
---
 include/linux/cgroup_subsys.h |   4 ++
 include/net/tos_cgroup.h  |  46 ++
 net/ipv4/Kconfig  |   9 +++
 net/ipv4/Makefile |   1 +
 net/ipv4/af_inet.c|   2 +
 net/ipv4/tos_cgroup.c | 145 ++
 net/ipv6/af_inet6.c   |   2 +
 7 files changed, 209 insertions(+)
 create mode 100644 include/net/tos_cgroup.h
 create mode 100644 net/ipv4/tos_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcff3b4..1b86eda1c23e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -61,6 +61,10 @@ SUBSYS(pids)
 SUBSYS(rdma)
 #endif
 
+#if IS_ENABLED(CONFIG_IP_TOS_CGROUP)
+SUBSYS(ip_tos)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/include/net/tos_cgroup.h b/include/net/tos_cgroup.h
new file mode 100644
index ..45c33733c4e9
--- /dev/null
+++ b/include/net/tos_cgroup.h
@@ -0,0 +1,46 @@
+/*
+ * tos_cgroup.hIP TOS Control Group
+ *
+ * Authors:Li RongQing <lirongq...@baidu.com>
+ *  Jim Yan <jim...@baidu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#ifndef _IP_TOS_CGROUP_H
+#define _IP_TOS_CGROUP_H
+
+#include 
+#include 
+
+struct tos_cgroup_state {
+   struct cgroup_subsys_state css;
+   u32 tos;
+};
+
+#if IS_ENABLED(CONFIG_IP_TOS_CGROUP)
+static inline u32 task_ip_tos(struct task_struct *p)
+{
+   u32 tos;
+
+   if (in_interrupt())
+   return 0;
+
+   rcu_read_lock();
+   tos = container_of(task_css(p, ip_tos_cgrp_id),
+   struct tos_cgroup_state, css)->tos;
+   rcu_read_unlock();
+
+   return tos;
+}
+#else /* !CONFIG_IP_TOS_CGROUP */
+static inline u32 task_ip_tos(struct task_struct *p)
+{
+   return 0;
+}
+#endif /* CONFIG_IP_TOS_CGROUP */
+#endif  /* _IP_TOS_CGROUP_H */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index f48fe6fc7e8c..6f8ce1b2ceb0 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -748,3 +748,12 @@ config TCP_MD5SIG
  on the Internet.
 
  If unsure, say N.
+
+config IP_TOS_CGROUP
+   bool "ip tos cgroup"
+   depends on CGROUPS
+   ---help---
+ Say Y here if you want to set ip packet tos based on the
+ control cgroup of their process.
+
+ This can set ip packet tos
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 47a0a6649a9d..a2734c50db2e 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
 obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
 obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_IP_TOS_CGROUP) += tos_cgroup.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
  xfrm4_output.o xfrm4_protocol.o
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e4329e161943..90842bedb500 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -120,6 +120,7 @@
 #include 
 #endif
 #include 
+#include 
 
 #include 
 
@@ -356,6 +357,7 @@ static int inet_create(struct net *net, struct socket 
*sock, int protocol,
inet->mc_index  = 0;
inet->mc_list   = NULL;
inet->rcv_tos   = 0;
+   inet->tos   = task_ip_tos(current);
 
sk_refcnt_debug_inc(sk);
 
diff --git a/net/ipv4/tos_cgroup.c b/net/ipv4/tos_cgroup.c
new file mode 100644
index ..17d2d7c02871
--- /dev/null
+++ b/net/ipv4/tos_cgroup.c
@@ -0,0 +1,145 @@
+/*
+ * net/ipv4/tos_cgroup.c   IP TOS Control Group
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Li RongQing <lirongq...@baidu.com>
+ *  jimyan <jim...@baidu.com>
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static inline
+struct 

[PATCH] net: sched: do not emit messages while holding spinlock

2018-03-29 Thread Li RongQing
move messages emitting out of sch_tree_lock to avoid holding
this lock too long.

Signed-off-by: Li RongQing <lirongq...@baidu.com>
---
 net/sched/sch_htb.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 1ea9846cc6ce..2a4ab7caf553 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -1337,6 +1337,7 @@ static int htb_change_class(struct Qdisc *sch, u32 
classid,
struct nlattr *tb[TCA_HTB_MAX + 1];
struct tc_htb_opt *hopt;
u64 rate64, ceil64;
+   int warn = 0;
 
/* extract all subattrs from opt attr */
if (!opt)
@@ -1499,13 +1500,11 @@ static int htb_change_class(struct Qdisc *sch, u32 
classid,
cl->quantum = min_t(u64, quantum, INT_MAX);
 
if (!hopt->quantum && cl->quantum < 1000) {
-   pr_warn("HTB: quantum of class %X is small. Consider 
r2q change.\n",
-   cl->common.classid);
+   warn = -1;
cl->quantum = 1000;
}
if (!hopt->quantum && cl->quantum > 20) {
-   pr_warn("HTB: quantum of class %X is big. Consider r2q 
change.\n",
-   cl->common.classid);
+   warn = 1;
cl->quantum = 20;
}
if (hopt->quantum)
@@ -1519,6 +1518,10 @@ static int htb_change_class(struct Qdisc *sch, u32 
classid,
 
sch_tree_unlock(sch);
 
+   if (warn)
+   pr_warn("HTB: quantum of class %X is %s. Consider r2q 
change.\n",
+   cl->common.classid, (warn == -1 ? "small" : "big"));
+
qdisc_class_hash_grow(sch, >clhash);
 
*arg = (unsigned long)cl;
-- 
2.11.0



[PATCH] tcp: release sk_frag.page in tcp_disconnect

2018-01-26 Thread Li RongQing
socket can be disconnected and gets transformed back to a listening
socket, if sk_frag.page is not released, which will be cloned into
a new socket by sk_clone_lock, but the reference count of this page
is increased, lead to a use after free or double free issue

Signed-off-by: Li RongQing <lirongq...@baidu.com>
Cc: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f08eebe60446..73f068406519 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2431,6 +2431,12 @@ int tcp_disconnect(struct sock *sk, int flags)
 
WARN_ON(inet->inet_num && !icsk->icsk_bind_hash);
 
+   if (sk->sk_frag.page) {
+   put_page(sk->sk_frag.page);
+   sk->sk_frag.page = NULL;
+   sk->sk_frag.offset = 0;
+   }
+
sk->sk_error_report(sk);
return err;
 }
-- 
2.11.0



答复: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Li,Rongqing


> -邮件原件-
> 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com]
> 发送时间: 2018年1月26日 11:14
> 收件人: Li,Rongqing <lirongq...@baidu.com>; netdev@vger.kernel.org
> 抄送: eduma...@google.com
> 主题: Re: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket
> 
> On Fri, 2018-01-26 at 02:09 +, Li,Rongqing wrote:
> 
> >
> > crash> bt 8683
> > PID: 8683   TASK: 881faa088000  CPU: 10  COMMAND: "mynode"
> >  #0 [881fff145e78] crash_nmi_callback at 81031712
> >  #1 [881fff145e88] nmi_handle at 816cafe9
> >  #2 [881fff145ec8] do_nmi at 816cb0f0
> >  #3 [881fff145ef0] end_repeat_nmi at 816ca4a1
> > [exception RIP: _raw_spin_lock_irqsave+62]
> > RIP: 816c9a9e  RSP: 881fa992b990  RFLAGS: 0002
> > RAX: 4358  RBX: 88207ffd7e80  RCX:
> 4358
> > RDX: 4356  RSI: 0246  RDI:
> 88207ffd7ee8
> > RBP: 881fa992b990   R8:    R9:
> 019a16e6
> > R10: 4d24  R11: 4000  R12:
> 0242
> > R13: 4d24  R14: 0001  R15:
> 
> > ORIG_RAX:   CS: 0010  SS: 0018
> > ---  ---
> >  #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e
> >  #5 [881fa992b998] get_page_from_freelist at 8113ce5f
> >  #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f
> >  #7 [881fa992bba0] alloc_pages_current at 8117ab29
> >  #8 [881fa992bbe8] sk_page_frag_refill at 815dd310
> >  #9 [881fa992bc18] tcp_sendmsg at 8163e4f3
> > #10 [881fa992bcd8] inet_sendmsg at 81668434
> > #11 [881fa992bd08] sock_sendmsg at 815d9719
> > #12 [881fa992be58] SYSC_sendto at 815d9c81
> > #13 [881fa992bf70] sys_sendto at 815da6ae
> > #14 [881fa992bf80] system_call_fastpath at 816d2189
> >
> 
> Note that tcp_sendmsg() does not use sk->sk_frag, but the per task page.
> 
> Unless something changes sk->sk_allocation, which a user application can
> not do.
> 
> Are you using a pristine upstream kernel ?

No

I do not know how to reproduce my bug, I find it twice online.

-RongQing



答复: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Li,Rongqing

> > my kernel is 3.10, I did not find the root cause, I guest all kind of
> > possibility
> >
> 
> Have you backported 22a0e18eac7a9e986fec76c60fa4a2926d1291e2 ?
> 
> 
When I see this bug, I find this commit, and backport it, 
But this seems to not related to my bug.


> > > I would rather move that in tcp_disconnect() that only fuzzers use,
> > > instead of doing this on every clone and slowing down normal users.
> > >
> >
> >
> > Do you mean we should fix it like below:
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index
> > f08eebe60446..44f8320610ab 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -2431,6 +2431,12 @@ int tcp_disconnect(struct sock *sk, int flags)
> >
> > WARN_ON(inet->inet_num && !icsk->icsk_bind_hash);
> >
> > +
> > +   if (sk->sk_frag.page) {
> > +   put_page(sk->sk_frag.page);
> > +   sk->sk_frag.page = NULL;
> > +   }
> > +
> > sk->sk_error_report(sk);
> > return err;
> >  }
> 
> Yes, something like that.

Ok, thanks

-R



答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Li,Rongqing


> > if (newsk->sk_prot->sockets_allocated)
> > sk_sockets_allocated_inc(newsk);
> 
> Good catch.
> 
> I suspect this was discovered by some syzkaller/syzbot run ?
> 


No.

I am seeing a panic, a page is in both task.task_frag.page and buddy free list;
It should not happen , and the page->lru->next and page->lru->pre is 
0xdead00100100, then when alloc page from buddy, the system panic at
 __list_del of __rmqueue 

#0 [881fff0c3850] machine_kexec at 8103cca8
 #1 [881fff0c38a0] crash_kexec at 810c2443
 #2 [881fff0c3968] oops_end at 816cae70
 #3 [881fff0c3990] die at 810063eb
 #4 [881fff0c39c0] do_general_protection at 816ca7ce
 #5 [881fff0c39f0] general_protection at 816ca0d8
[exception RIP: __rmqueue+120]
RIP: 8113a918  RSP: 881fff0c3aa0  RFLAGS: 00010046
RAX: 88207ffd8018  RBX: 0003  RCX: 0003
RDX: 0001  RSI: ea006f4cf620  RDI: dead00200200
RBP: 881fff0c3b00   R8: 88207ffd8018   R9: 
R10: dead00100100  R11: ea007ecc6480  R12: ea006f4cf600
R13:   R14: 0003  R15: 88207ffd7e80
ORIG_RAX:   CS: 0010  SS: 
 #6 [881fff0c3b08] get_page_from_freelist at 8113ce71
 #7 [881fff0c3be0] __alloc_pages_nodemask at 8113d15f
 #8 [881fff0c3d10] __alloc_page_frag at 815e2362
 #9 [881fff0c3d40] __netdev_alloc_frag at 815e241b
#10 [881fff0c3d58] __alloc_rx_skb at 815e2f91
#11 [881fff0c3d78] __netdev_alloc_skb at 815e300b
#12 [881fff0c3d90] ixgbe_clean_rx_irq at a003a98f [ixgbe]
#13 [881fff0c3df8] ixgbe_poll at a003c233 [ixgbe]
#14 [881fff0c3e70] net_rx_action at 815f2f09
#15 [881fff0c3ec8] __do_softirq at 81064867
#16 [881fff0c3f38] call_softirq at 816d3a9c
#17 [881fff0c3f50] do_softirq at 81004e65
#18 [881fff0c3f68] irq_exit at 81064b7d
#19 [881fff0c3f78] do_IRQ at 816d4428

The page info is like below, some element is removed:

crash> struct page ea006f4cf600 -x
struct page {
  flags = 0x2f4000, 
  mapping = 0x0, 
  {
{
  counters = 0x2, 
  {
{
  _mapcount = {
counter = 0x
  }, 
  {
inuse = 0x, 
objects = 0x7fff, 
frozen = 0x1
  }, 
  units = 0x
}, 
_count = {
  counter = 0x2
}
  }
}
  }, 
  {
lru = {
  next = 0xdead00100100, 
  prev = 0xdead00200200
}, 
  }, 
…..
  }
}
crash>


the page ea006f4cf600 is in other task task_frag.page and 
the task backtrace is like below

crash> task 8683|grep ea006f4cf600 -A3  
page = 0xea006f4cf600, 
offset = 32768, 
size = 32768
  }, 
crash>

crash> bt 8683
PID: 8683   TASK: 881faa088000  CPU: 10  COMMAND: "mynode"
 #0 [881fff145e78] crash_nmi_callback at 81031712
 #1 [881fff145e88] nmi_handle at 816cafe9
 #2 [881fff145ec8] do_nmi at 816cb0f0
 #3 [881fff145ef0] end_repeat_nmi at 816ca4a1
[exception RIP: _raw_spin_lock_irqsave+62]
RIP: 816c9a9e  RSP: 881fa992b990  RFLAGS: 0002
RAX: 4358  RBX: 88207ffd7e80  RCX: 4358
RDX: 4356  RSI: 0246  RDI: 88207ffd7ee8
RBP: 881fa992b990   R8:    R9: 019a16e6
R10: 4d24  R11: 4000  R12: 0242
R13: 4d24  R14: 0001  R15: 
ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e
 #5 [881fa992b998] get_page_from_freelist at 8113ce5f
 #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f
 #7 [881fa992bba0] alloc_pages_current at 8117ab29
 #8 [881fa992bbe8] sk_page_frag_refill at 815dd310
 #9 [881fa992bc18] tcp_sendmsg at 8163e4f3
#10 [881fa992bcd8] inet_sendmsg at 81668434
#11 [881fa992bd08] sock_sendmsg at 815d9719
#12 [881fa992be58] SYSC_sendto at 815d9c81
#13 [881fa992bf70] sys_sendto at 815da6ae
#14 [881fa992bf80] system_call_fastpath at 816d2189
RIP: 7f5bfe1d804b  RSP: 7f5bfa63b3b0  RFLAGS: 0206
RAX: 002c  RBX: 816d2189  RCX: 7f5bfa63b420
RDX: 2000  RSI: 0c096000  RDI: 0040
RBP:    R8:    R9: 
R10:   R11: 0246  R12: 815da6ae
R13: 881fa992bf78  R14: a552  R15: 0016
ORIG_RAX: 002c  CS: 0033  SS: 002b
crash>


my kernel is 3.10, 

[PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Li RongQing
Clean the sk_frag.page of new cloned socket, otherwise it will release
twice wrongly since the reference count of this sk_frag page is not
increased.

sk_clone_lock() is used to clone a new socket from sock which is in
listening state and has not sk_frag.page, but a socket has sent data
and can gets transformed back to a listening socket, will allocate an
tcp_sock through sk_clone_lock() when a new connection comes in.

Signed-off-by: Li RongQing <lirongq...@baidu.com>
Cc: Eric Dumazet <eduma...@google.com>
---
 net/core/sock.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index c0b5b2f17412..c845856f26da 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1738,6 +1738,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
sk_refcnt_debug_inc(newsk);
sk_set_socket(newsk, NULL);
newsk->sk_wq = NULL;
+   newsk->sk_frag.page = NULL;
+   newsk->sk_frag.offset = 0;
 
if (newsk->sk_prot->sockets_allocated)
sk_sockets_allocated_inc(newsk);
-- 
2.11.0



Re: [PATCH][net-next] ipv6: replace write lock with read lock in addrconf_permanent_addr

2016-03-14 Thread Li RongQing
On Tue, Mar 15, 2016 at 12:25 AM, David Miller  wrote:
>
> We need it for the modifications made by fixup_permanent_addr().


fixup_permanent_addr should be protected by ifp->lock,
not by idev->lock write lock, since ifp is modified,

-Roy


Re: [PATCH][net-next][v2] bridge: allow the maximum mtu to 64k

2016-02-24 Thread Li RongQing
On Thu, Feb 25, 2016 at 5:44 AM, Stephen Hemminger
 wrote:
>> This is especially annoying for the virtualization case because the
>> KVM's tap driver will by default adopt the bridge's MTU on startup
>> making it impossible (without the workaround) to use a large MTU on the
>> guest VMs.
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1399064
>
> This use case looks like KVM misusing bridge MTU. I.e it should set TAP
> MTU to what it wants then enslave it, not vice versa.

1. a use should be able to configure an empty bridge MTU to a higher
mtu than 1500

2. if first configure the tap MTU a higher value, other port is lower
value, the pmtu
will be used, it maybe lower performance.
the configuration process is written into libvirt, located in
virnetdevtap.c, of cause it can
be improved to fix this issue.
https://www.redhat.com/archives/libvir-list/2008-December/msg00083.html

-R


Re: [PATCH][net-next] bridge: increase mtu to 9000

2016-02-22 Thread Li RongQing
On Tue, Feb 23, 2016 at 1:58 AM, Stephen Hemminger
<step...@networkplumber.org> wrote:
>> guest VMs.
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1399064
>>
>> Signed-off-by: Li RongQing <roy.qing...@gmail.com>
>
> Your change works, but I agree with Hannes. Just allow up to 64 * 1024 like
> loopback does.  And no need for a #define for that it is only in one place.



thanks, I will change it as you suggest

-Roy


Re: question about vrf-lite

2016-01-06 Thread Li RongQing
>>
>> is it right?
>
>
> no. The above works fine for me. I literally copied and pasted all of the
> commands except the master ones which were adapted to my setup -- eth9 and
> eth11 for me instead of eth0 and eth1. tcpdump on N2, N3 show the right one
> is receiving packets based on which 'ping -I vrf' is run.
>
> Do tables 5 and 6 have the right routes?


Thanks, David;

it is not VRF issue, it is my configuration issue about qemu;

I am testing VRF on qemu, and the issue/solution is same as issue
under below link

https://lists.gnu.org/archive/html/qemu-discuss/2014-06/msg00059.html

-Roy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: sysctl: fix a kmemleak warning

2015-10-23 Thread Li RongQing
On Fri, Oct 23, 2015 at 6:04 PM, David Miller  wrote:
>> +out1:
>> + unregister_sysctl_table(net_header);
>> + kfree(net_header);
>
> I read over unregister_sysctl_table() and it appears to do the kfree()
> for us, doesn't it?


you are right, thanks

-Roy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] ipconfig: send Client-identifier in DHCP requests

2015-10-15 Thread Li RongQing
On Thu, Oct 15, 2015 at 10:56 PM, Florian Fainelli  wrote:
> Did not you mean strlen(dhcp_client_identifer) + 1 instead?


no;
dhcp_client_identifer[0] is client identifier type, and it maybe 0;
dhcp_client_identifer+1 is the start address of client identifier value;

-Roy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [bug report or not] ping6 will lost packets when ping6 lots of ipv6 address

2015-10-14 Thread Li RongQing
On Wed, Oct 14, 2015 at 12:11 AM, Martin KaFai Lau <ka...@fb.com> wrote:
> On Tue, Oct 13, 2015 at 08:46:49PM +0800, Li RongQing wrote:
>> 1. in a machine, configure 3000 ipv6 address in one interface
>>
>> for i in {1..3000}; do ip -6 addr add 4001:5013::$i/0 dev eth0; done
>>
>>
>> 2. in other machine, ping6 the upper configured ipv6 address, then
>> lots of lost packets
>>
>> ip -6 addr add 4001:5013::0/64 dev eth0
>> for i in {1..2000}; do ping6 -q -c1 4001:5013::$i; done;
>>
>> 3. increasing the gc thresh can handles these lost
>>
>> sysctl -w  net.ipv6.neigh.default.gc_thresh1=2000
>> sysctl -w  net.ipv6.neigh.default.gc_thresh2=3000
>> sysctl -w  net.ipv6.neigh.default.gc_thresh3=4000
>> sysctl -w net.ipv6.route.gc_thresh=3000
>> sysctl -w net.ipv6.route.max_size =3000
> Which kernel is used in this test?

all version, I think this should not a bug, this test will lead to
that the neigh number is
larger than net.ipv6.neigh.default.gc_thresh3, and can not allocate
new neigh, and ping
will lost packets.

Thanks


-Roy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ICMPv6 too big Packet will makes the network unreachable

2015-10-14 Thread Li RongQing
On Tue, Oct 13, 2015 at 10:26 PM, Hannes Frederic Sowa
 wrote:
>>root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2
>>2001:1b70:82a8:18:650:65:0:2 dev eth10.650  src
>> 2001:1b70:82a8:18:650:65:0:2  metric 0
>>cache
>>root@du1:~#
>
> Which kernel version did you test this on?
>
> Thanks,
> Hannes


I think it is all version

-Roy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ICMPv6 too big Packet will makes the network unreachable

2015-10-14 Thread Li RongQing
On Wed, Oct 14, 2015 at 5:18 PM, Sheng Yong <shengyo...@huawei.com> wrote:
> Hi, Rongqing,
>
> Cced Martin KaFai Lau <ka...@fb.com>
>
> It seems you trigger the problem that I met before, here is the link of 
> disucssion:
> http://www.spinics.net/lists/netdev/msg314717.html
>
> You can try these patches to check if they resolve your problem:
> 7035870d1219 | 2015-05-03 | ipv6: Check RTF_LOCAL on rt->rt6i_flags instead 
> of rt->dst.flags
> 653437d02f1f | 2015-04-28 | ipv6: Stop /128 route from disappearing after 
> pmtu update
>
> thanks,
> Sheng
>
> On 10/13/2015 3:09 PM, Li RongQing wrote:
>> 1. Machine with 2001:1b70:82a8:18:650:65:0:2 address, and receive wrong
>> icmp packets
>> root@du1:~# ifconfig
>> eth10.650 Link encap:Ethernet  HWaddr 74:c9:9a:a7:e5:88
>>   inet6 addr: fe80::76c9:9aff:fea7:e588/64 Scope:Link
>>   inet6 addr: 2001:1b70:82a8:18:650:65:0:2/80 Scope:Global
>>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>   RX packets:1 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:0
>>   RX bytes:104 (104.0 B)  TX bytes:934 (934.0 B)
>>
>> 2. ICMPv6 packet is as below.
>>
>>###[ Ethernet ]###
>>  dst   = 74:C9:9A:A7:E5:88
>>  src   = ae:4f:44:f2:10:cc
>>  type  = 0x86dd
>>###[ IPv6 ]###
>> version   = 6
>> tc= 0
>> fl= 0
>> plen  = None
>> nh= ICMPv6
>> hlim  = 64
>> src   = 2001:1b70:82a8:18:650:65:0:4
>> dst   = 2001:1b70:82a8:18:650:65:0:2
>>
>>###[ ICMPv6 Packet Too Big ]###
>>type  = Packet too big
>>code  = 0
>>cksum = None
>>mtu   = 1280
>>
>>###[ IPv6 ]###
>>   version   = 6
>>   tc= 0
>>   fl= 0
>>   plen  = None
>>   nh= ICMPv6
>>   hlim  = 255
>>   src   = 2001:1b70:82a8:18:650:65:0:2
>>   dst   = 2001:1b70:82a8:18:650:65:0:2
>>###[ ICMPv6 Neighbor Discovery - Neighbor Advertisement ]###
>>  type  = Neighbor Advertisement
>>  code  = 0
>>  cksum = None
>>  R = 1
>>  S = 0
>>  O = 1
>>  res   = 0x0
>>  tgt   = 2001:1b70:82a8:18:650:65:0:2
>>
>># Test #
>>
>> 3. Send ICMPv6  with Scapy to trigger fault.
>>
>>conf.iface='eth1'
>>eth = Ether(src='ae:4f:44:f2:10:cc', dst='74:C9:9A:A7:E5:88')
>>base = IPv6(src='2001:1b70:82a8:18:650:65:0:4',
>> dst='2001:1b70:82a8:18:650:65:0:2')
>>ptb = ICMPv6PacketTooBig(type=2)
>>packet = eth/base/ptb
>>ptb_payload_na_base = IPv6(src='2001:1b70:82a8:18:650:65:0:2',
>> dst='2001:1b70:82a8:18:650:65:0:2')
>>ptb_payload_na = ICMPv6ND_NA(type=136, tgt='2001:1b70:82a8:18:650:65:0:2')
>>ptb_payload = ptb_payload_na_base/ptb_payload_na
>>packet = packet/ptb_payload
>>sendp(packet, iface="eth1.650", count=1)
>>
>> 4.  route information  will enter the faulty state after Wait 600 seconds,
>>
>>root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2
>>local 2001:1b70:82a8:18:650:65:0:2 dev lo  proto none  src
>> 2001:1b70:82a8:18:650:65:0:2  metric 0  expires 7sec mtu 1280
>>
>>root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2
>>local 2001:1b70:82a8:18:650:65:0:2 dev lo  proto none  src
>> 2001:1b70:82a8:18:650:65:0:2  metric 0  expires 3sec mtu 1280
>>
>>root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2
>>2001:1b70:82a8:18:650:65:0:2 dev eth10.650  src
>> 2001:1b70:82a8:18:650:65:0:2  metric 0
>>cache
>>root@du1:~#
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> .
>>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ICMPv6 too big Packet will makes the network unreachable

2015-10-14 Thread Li RongQing
On Wed, Oct 14, 2015 at 5:53 PM, Li RongQing <roy.qing...@gmail.com> wrote:
> On Wed, Oct 14, 2015 at 5:18 PM, Sheng Yong <shengyo...@huawei.com> wrote:
>> Hi, Rongqing,
>>
>> Cced Martin KaFai Lau <ka...@fb.com>
>>
>> It seems you trigger the problem that I met before, here is the link of 
>> disucssion:
>> http://www.spinics.net/lists/netdev/msg314717.html
>>
>> You can try these patches to check if they resolve your problem:
>> 7035870d1219 | 2015-05-03 | ipv6: Check RTF_LOCAL on rt->rt6i_flags instead 
>> of rt->dst.flags
>> 653437d02f1f | 2015-04-28 | ipv6: Stop /128 route from disappearing after 
>> pmtu update
>>
>> thanks,
>> Sheng
>>


You are right, these two commits fixed my issue

Thanks

-Roy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ipconfig: send Client-identifier in DHCP requests

2015-10-14 Thread Li RongQing
On Thu, Oct 15, 2015 at 11:27 AM, kbuild test robot  wrote:
> Hi Li,
>
> [auto build test WARNING on net/master -- if it's inappropriate base, please 
> suggest rules for selecting the more suitable base]
>
> url:
> https://github.com/0day-ci/linux/commits/roy-qing-li-gmail-com/ipconfig-send-Client-identifier-in-DHCP-requests/20151015-105553
> config: parisc-c3000_defconfig (attached as .config)
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=parisc
>
> All warnings (new ones prefixed by >>):
>
>>> net/ipv4/ipconfig.c:148:13: warning: 'dhcp_client_identifier' defined but 
>>> not used [-Wunused-variable]
> static char dhcp_client_identifier[253] __initdata;
> ^


Thanks, I will fix it

-Roy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ICMPv6 too big Packet will makes the network unreachable

2015-10-13 Thread Li RongQing
1. Machine with 2001:1b70:82a8:18:650:65:0:2 address, and receive wrong
icmp packets
root@du1:~# ifconfig
eth10.650 Link encap:Ethernet  HWaddr 74:c9:9a:a7:e5:88
  inet6 addr: fe80::76c9:9aff:fea7:e588/64 Scope:Link
  inet6 addr: 2001:1b70:82a8:18:650:65:0:2/80 Scope:Global
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:1 errors:0 dropped:0 overruns:0 frame:0
  TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:104 (104.0 B)  TX bytes:934 (934.0 B)

2. ICMPv6 packet is as below.

   ###[ Ethernet ]###
 dst   = 74:C9:9A:A7:E5:88
 src   = ae:4f:44:f2:10:cc
 type  = 0x86dd
   ###[ IPv6 ]###
version   = 6
tc= 0
fl= 0
plen  = None
nh= ICMPv6
hlim  = 64
src   = 2001:1b70:82a8:18:650:65:0:4
dst   = 2001:1b70:82a8:18:650:65:0:2

   ###[ ICMPv6 Packet Too Big ]###
   type  = Packet too big
   code  = 0
   cksum = None
   mtu   = 1280

   ###[ IPv6 ]###
  version   = 6
  tc= 0
  fl= 0
  plen  = None
  nh= ICMPv6
  hlim  = 255
  src   = 2001:1b70:82a8:18:650:65:0:2
  dst   = 2001:1b70:82a8:18:650:65:0:2
   ###[ ICMPv6 Neighbor Discovery - Neighbor Advertisement ]###
 type  = Neighbor Advertisement
 code  = 0
 cksum = None
 R = 1
 S = 0
 O = 1
 res   = 0x0
 tgt   = 2001:1b70:82a8:18:650:65:0:2

   # Test #

3. Send ICMPv6  with Scapy to trigger fault.

   conf.iface='eth1'
   eth = Ether(src='ae:4f:44:f2:10:cc', dst='74:C9:9A:A7:E5:88')
   base = IPv6(src='2001:1b70:82a8:18:650:65:0:4',
dst='2001:1b70:82a8:18:650:65:0:2')
   ptb = ICMPv6PacketTooBig(type=2)
   packet = eth/base/ptb
   ptb_payload_na_base = IPv6(src='2001:1b70:82a8:18:650:65:0:2',
dst='2001:1b70:82a8:18:650:65:0:2')
   ptb_payload_na = ICMPv6ND_NA(type=136, tgt='2001:1b70:82a8:18:650:65:0:2')
   ptb_payload = ptb_payload_na_base/ptb_payload_na
   packet = packet/ptb_payload
   sendp(packet, iface="eth1.650", count=1)

4.  route information  will enter the faulty state after Wait 600 seconds,

   root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2
   local 2001:1b70:82a8:18:650:65:0:2 dev lo  proto none  src
2001:1b70:82a8:18:650:65:0:2  metric 0  expires 7sec mtu 1280

   root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2
   local 2001:1b70:82a8:18:650:65:0:2 dev lo  proto none  src
2001:1b70:82a8:18:650:65:0:2  metric 0  expires 3sec mtu 1280

   root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2
   2001:1b70:82a8:18:650:65:0:2 dev eth10.650  src
2001:1b70:82a8:18:650:65:0:2  metric 0
   cache
   root@du1:~#
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[bug report or not] ping6 will lost packets when ping6 lots of ipv6 address

2015-10-13 Thread Li RongQing
1. in a machine, configure 3000 ipv6 address in one interface

for i in {1..3000}; do ip -6 addr add 4001:5013::$i/0 dev eth0; done


2. in other machine, ping6 the upper configured ipv6 address, then
lots of lost packets

ip -6 addr add 4001:5013::0/64 dev eth0
for i in {1..2000}; do ping6 -q -c1 4001:5013::$i; done;

3. increasing the gc thresh can handles these lost

sysctl -w  net.ipv6.neigh.default.gc_thresh1=2000
sysctl -w  net.ipv6.neigh.default.gc_thresh2=3000
sysctl -w  net.ipv6.neigh.default.gc_thresh3=4000
sysctl -w net.ipv6.route.gc_thresh=3000
sysctl -w net.ipv6.route.max_size =3000




-Roy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfrm: fix the xfrm_policy/state_walk

2015-04-22 Thread Li RongQing
On Wed, Apr 22, 2015 at 3:57 PM, Herbert Xu herb...@gondor.apana.org.au wrote:
 Signed-off-by: Li RongQing roy.qing...@gmail.com

 This is not a bug fix but an optimisation.  The walker entries are
 all marked as dead and will be skipped by the loop.

 However, I don't see anything wrong with this optimisation.

 Cheers,


thanks, you are right, I will rewrite the commit header

-Roy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html