[PATCH][net-next] tun: remove unnecessary check in tun_flow_update
caller has guaranted that rxhash is not zero Signed-off-by: Li RongQing --- drivers/net/tun.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index d0745dc81976..6760b86547df 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -529,10 +529,7 @@ static void tun_flow_update(struct tun_struct *tun, u32 rxhash, unsigned long delay = tun->ageing_time; u16 queue_index = tfile->queue_index; - if (!rxhash) - return; - else - head = >flows[tun_hashfn(rxhash)]; + head = >flows[tun_hashfn(rxhash)]; rcu_read_lock(); -- 2.16.2
[PATCH][net-next] tun: align write-heavy flow entry members to a cache line
tun flow entry 'updated' fields are written when receive every packet. Thus if a flow is receiving packets from a particular flow entry, it'll cause false-sharing with all the other who has looked it up, so move it in its own cache line and update 'queue_index' and 'update' field only when they are changed to reduce the cache false-sharing. Signed-off-by: Zhang Yu Signed-off-by: Wang Li Signed-off-by: Li RongQing --- drivers/net/tun.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 835c73f42ae7..d0745dc81976 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -201,7 +201,7 @@ struct tun_flow_entry { u32 rxhash; u32 rps_rxhash; int queue_index; - unsigned long updated; + unsigned long updated cacheline_aligned_in_smp; }; #define TUN_NUM_FLOW_ENTRIES 1024 @@ -539,8 +539,10 @@ static void tun_flow_update(struct tun_struct *tun, u32 rxhash, e = tun_flow_find(head, rxhash); if (likely(e)) { /* TODO: keep queueing to old queue until it's empty? */ - e->queue_index = queue_index; - e->updated = jiffies; + if (e->queue_index != queue_index) + e->queue_index = queue_index; + if (e->updated != jiffies) + e->updated = jiffies; sock_rps_record_flow_hash(e->rps_rxhash); } else { spin_lock_bh(>lock); -- 2.16.2
答复: [PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill
On 2018/11/23 上午10:04, Li RongQing wrote: > >when page frag refills, 32K pages, 128MB memory is asked, it hardly > >successes when system has memory stress > Looking at get_order(), it seems we get 3 after get_order(32768) since it > accepts the size of block. You are right, I understood wrongly, Please drop this patch, sorry for the noise -Q
答复: [PATCH] net: fix the per task frag allocator size
> get_order(8) returns zero here if I understood it correctly. You are right, I understood wrongly, Please drop this patch, sorry for the noise -Q
[PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill
when page frag refills, 32K pages, 128MB memory is asked, it hardly successes when system has memory stress And such large memory size will cause the underflow of reference bias, and make refcount of page chaos, since reference bias will be decreased to negative before the allocated memory is used up so 32KB memory is safe choice, meanwhile, remove a unnecessary check Fixes: e4dab1e6ea64 ("vhost_net: mitigate page reference counting during page frag refill") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- drivers/vhost/net.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index d919284f103b..b933a4a8e4ba 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -641,7 +641,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len) !vhost_vq_avail_empty(vq->dev, vq); } -#define SKB_FRAG_PAGE_ORDER get_order(32768) +#define SKB_FRAG_PAGE_ORDER3 static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, struct page_frag *pfrag, gfp_t gfp) @@ -654,17 +654,17 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, pfrag->offset = 0; net->refcnt_bias = 0; - if (SKB_FRAG_PAGE_ORDER) { - /* Avoid direct reclaim but allow kswapd to wake */ - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | - __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY, - SKB_FRAG_PAGE_ORDER); - if (likely(pfrag->page)) { - pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; - goto done; - } + + /* Avoid direct reclaim but allow kswapd to wake */ + pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | + __GFP_COMP | __GFP_NOWARN | + __GFP_NORETRY, + SKB_FRAG_PAGE_ORDER); + if (likely(pfrag->page)) { + pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; + goto done; } + pfrag->page = alloc_page(gfp); if (likely(pfrag->page)) { pfrag->size = PAGE_SIZE; -- 2.16.2
[PATCH] net: fix the per task frag allocator size
when fill task frag, 32K pages, 128MB memory is asked, it hardly successes when system has memory stress and commit '5640f7685831 ("net: use a per task frag allocator")' said it wants 32768 bytes, not 32768 pages: "(up to 32768 bytes per frag, thats order-3 pages on x86)" Fixes: 5640f7685831e ("net: use a per task frag allocator") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/core/sock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 6d7e189e3cd9..e3cbefeedf5c 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2186,8 +2186,8 @@ static void sk_leave_memory_pressure(struct sock *sk) } } -/* On 32bit arches, an skb frag is limited to 2^15 */ -#define SKB_FRAG_PAGE_ORDERget_order(32768) +/* On 32bit arches, an skb frag is limited to 2^15 bytes*/ +#define SKB_FRAG_PAGE_ORDERget_order(8) /** * skb_page_frag_refill - check that a page_frag contains enough room -- 2.16.2
[PATCH][net-next] net: slightly optimize eth_type_trans
netperf udp stream shows that eth_type_trans takes certain cpu, so adjust the mac address check order, and firstly check if it is device address, and only check if it is multicast address only if not the device address. After this change: To unicast, and skb dst mac is device mac, this is most of time reduce a comparision To unicast, and skb dst mac is not device mac, nothing change To multicast, increase a comparision Before: 1.03% [kernel] [k] eth_type_trans After: 0.78% [kernel] [k] eth_type_trans Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/ethernet/eth.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index fd8faa0dfa61..1c88f5c5d5b1 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -165,15 +165,17 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev) eth = (struct ethhdr *)skb->data; skb_pull_inline(skb, ETH_HLEN); - if (unlikely(is_multicast_ether_addr_64bits(eth->h_dest))) { - if (ether_addr_equal_64bits(eth->h_dest, dev->broadcast)) - skb->pkt_type = PACKET_BROADCAST; - else - skb->pkt_type = PACKET_MULTICAST; + if (unlikely(!ether_addr_equal_64bits(eth->h_dest, + dev->dev_addr))) { + if (unlikely(is_multicast_ether_addr_64bits(eth->h_dest))) { + if (ether_addr_equal_64bits(eth->h_dest, dev->broadcast)) + skb->pkt_type = PACKET_BROADCAST; + else + skb->pkt_type = PACKET_MULTICAST; + } else { + skb->pkt_type = PACKET_OTHERHOST; + } } - else if (unlikely(!ether_addr_equal_64bits(eth->h_dest, - dev->dev_addr))) - skb->pkt_type = PACKET_OTHERHOST; /* * Some variants of DSA tagging don't have an ethertype field -- 2.16.2
[PATCH][net-next][v2] net: remove BUG_ON from __pskb_pull_tail
if list is NULL pointer, and the following access of list will trigger panic, which is same as BUG_ON Signed-off-by: Li RongQing --- net/core/skbuff.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 396fcb3baad0..d69503d66021 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1925,8 +1925,6 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) struct sk_buff *insp = NULL; do { - BUG_ON(!list); - if (list->len <= eat) { /* Eaten as whole. */ eat -= list->len; -- 2.16.2
[PATCH][xfrm-next] xfrm6: remove BUG_ON from xfrm6_dst_ifdown
if loopback_idev is NULL pointer, and the following access of loopback_idev will trigger panic, which is same as BUG_ON Signed-off-by: Li RongQing --- net/ipv6/xfrm6_policy.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index d35bcf92969c..769f8f78d3b8 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -262,7 +262,6 @@ static void xfrm6_dst_ifdown(struct dst_entry *dst, struct net_device *dev, if (xdst->u.rt6.rt6i_idev->dev == dev) { struct inet6_dev *loopback_idev = in6_dev_get(dev_net(dev)->loopback_dev); - BUG_ON(!loopback_idev); do { in6_dev_put(xdst->u.rt6.rt6i_idev); -- 2.16.2
[PATCH][net-next] net: remove BUG_ON from __pskb_pull_tail
if list is NULL pointer, and the following access of list will trigger panic, which is same as BUG_ON Signed-off-by: Li RongQing --- net/core/skbuff.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 396fcb3baad0..cd668b52f96f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1925,7 +1925,6 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) struct sk_buff *insp = NULL; do { - BUG_ON(!list); if (list->len <= eat) { /* Eaten as whole. */ -- 2.16.2
答复: [PATCH][RFC] udp: cache sock to avoid searching it twice
> > return pp; > > } > > What if 'pp' is NULL? > > Aside from that, this replace a lookup with 2 atomic ops, and only when > such lookup is amortized on multiple aggregated packets: I'm unsure if > it's worthy and I don't understand how that improves RR tests (where > the socket can't see multiple, consecutive skbs, AFAIK). > > Cheers, > > Paolo > If we not release the socket in udp_gro_complete , we can reduce a udp socket Lookup when do ip demux again, it maybe more worthy. I test UDP_STREAM, find no difference, both can reach NIC's limit, 10G; so Test RR, I will do more tests -RongQing
答复: [PATCH][RFC] udp: cache sock to avoid searching it twice
On Sat, Nov 10, 2018 at 1:29 AM Eric Dumazet wrote: > > > > On 11/08/2018 10:21 PM, Li RongQing wrote: > > GRO for UDP needs to lookup socket twice, first is in gro receive, > > second is gro complete, so if store sock to skb to avoid looking up > > twice, this can give small performance boost > > > > netperf -t UDP_RR -l 10 > > > > Before: > > Rate per sec: 28746.01 > > After: > > Rate per sec: 29401.67 > > > > Signed-off-by: Li RongQing > > --- > > net/ipv4/udp_offload.c | 18 +- > > 1 file changed, 17 insertions(+), 1 deletion(-) > > > > diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c > > index 0646d61f4fa8..429570112a33 100644 > > --- a/net/ipv4/udp_offload.c > > +++ b/net/ipv4/udp_offload.c > > @@ -408,6 +408,11 @@ struct sk_buff *udp_gro_receive(struct list_head > > *head, struct sk_buff *skb, > > > > if (udp_sk(sk)->gro_enabled) { > > pp = call_gro_receive(udp_gro_receive_segment, head, skb); > > + > > + if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) { > > + sock_hold(sk); > > + pp->sk = sk; > > > You also have to set pp->destructor to sock_edemux > > flush_gro_hash -> kfree_skb() > > If there is no destructor, the reference on pp->sk will never be released. > > Ok, thanks, does it need to reset sk in udp_gro_complete, ip early demuxing will lookup udp socket again, if we can keep it, we can avoid to lookup socket again -RongQing > > > > + } > > rcu_read_unlock(); > > return pp; > > } > > @@ -444,6 +449,10 @@ struct sk_buff *udp_gro_receive(struct list_head > > *head, struct sk_buff *skb, > > skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr)); > > pp = call_gro_receive_sk(udp_sk(sk)->gro_receive, sk, head, skb); > > > > + if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) { > > + sock_hold(sk); > > + pp->sk = sk; > > + } > > out_unlock: > > rcu_read_unlock(); > > skb_gro_flush_final(skb, pp, flush); > > @@ -502,7 +511,9 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff, > > uh->len = newlen; > > > > rcu_read_lock(); > > - sk = (*lookup)(skb, uh->source, uh->dest); > > + sk = skb->sk; > > + if (!sk) > > + sk = (*lookup)(skb, uh->source, uh->dest); > > if (sk && udp_sk(sk)->gro_enabled) { > > err = udp_gro_complete_segment(skb); > > } else if (sk && udp_sk(sk)->gro_complete) { > > @@ -516,6 +527,11 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff, > > err = udp_sk(sk)->gro_complete(sk, skb, > > nhoff + sizeof(struct udphdr)); > > } > > + > > + if (skb->sk) { > > + sock_put(skb->sk); > > + skb->sk = NULL; > > + } > > rcu_read_unlock(); > > > > if (skb->remcsum_offload) > >
[PATCH][net-next] net: tcp: remove BUG_ON from tcp_v4_err
if skb is NULL pointer, and the following access of skb's skb_mstamp_ns will trigger panic, which is same as BUG_ON Signed-off-by: Li RongQing --- net/ipv4/tcp_ipv4.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a336787d75e5..5424a4077c27 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -542,7 +542,6 @@ int tcp_v4_err(struct sk_buff *icmp_skb, u32 info) icsk->icsk_rto = inet_csk_rto_backoff(icsk, TCP_RTO_MAX); skb = tcp_rtx_queue_head(sk); - BUG_ON(!skb); tcp_mstamp_refresh(tp); delta_us = (u32)(tp->tcp_mstamp - tcp_skb_timestamp_us(skb)); -- 2.16.2
[PATCH][RFC] udp: cache sock to avoid searching it twice
GRO for UDP needs to lookup socket twice, first is in gro receive, second is gro complete, so if store sock to skb to avoid looking up twice, this can give small performance boost netperf -t UDP_RR -l 10 Before: Rate per sec: 28746.01 After: Rate per sec: 29401.67 Signed-off-by: Li RongQing --- net/ipv4/udp_offload.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 0646d61f4fa8..429570112a33 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -408,6 +408,11 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, if (udp_sk(sk)->gro_enabled) { pp = call_gro_receive(udp_gro_receive_segment, head, skb); + + if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) { + sock_hold(sk); + pp->sk = sk; + } rcu_read_unlock(); return pp; } @@ -444,6 +449,10 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr)); pp = call_gro_receive_sk(udp_sk(sk)->gro_receive, sk, head, skb); + if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) { + sock_hold(sk); + pp->sk = sk; + } out_unlock: rcu_read_unlock(); skb_gro_flush_final(skb, pp, flush); @@ -502,7 +511,9 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff, uh->len = newlen; rcu_read_lock(); - sk = (*lookup)(skb, uh->source, uh->dest); + sk = skb->sk; + if (!sk) + sk = (*lookup)(skb, uh->source, uh->dest); if (sk && udp_sk(sk)->gro_enabled) { err = udp_gro_complete_segment(skb); } else if (sk && udp_sk(sk)->gro_complete) { @@ -516,6 +527,11 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff, err = udp_sk(sk)->gro_complete(sk, skb, nhoff + sizeof(struct udphdr)); } + + if (skb->sk) { + sock_put(skb->sk); + skb->sk = NULL; + } rcu_read_unlock(); if (skb->remcsum_offload) -- 2.16.2
[PATCH][net-next] openvswitch: remove BUG_ON from get_dpdev
if local is NULL pointer, and the following access of local's dev will trigger panic, which is same as BUG_ON Signed-off-by: Li RongQing --- net/openvswitch/vport-netdev.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c index 2e5e7a41d8ef..9bec22e3e9e8 100644 --- a/net/openvswitch/vport-netdev.c +++ b/net/openvswitch/vport-netdev.c @@ -84,7 +84,6 @@ static struct net_device *get_dpdev(const struct datapath *dp) struct vport *local; local = ovs_vport_ovsl(dp, OVSP_LOCAL); - BUG_ON(!local); return local->dev; } -- 2.16.2
[PATCH][net-next][v2] net/ipv6: compute anycast address hash only if dev is null
avoid to compute the hash value if dev is not null, since hash value is not used Signed-off-by: Li RongQing --- net/ipv6/anycast.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c index 94999058e110..cca3b3603c42 100644 --- a/net/ipv6/anycast.c +++ b/net/ipv6/anycast.c @@ -433,7 +433,6 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, const struct in6_addr *ad bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev, const struct in6_addr *addr) { - unsigned int hash = inet6_acaddr_hash(net, addr); struct net_device *nh_dev; struct ifacaddr6 *aca; bool found = false; @@ -441,7 +440,9 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev, rcu_read_lock(); if (dev) found = ipv6_chk_acast_dev(dev, addr); - else + else { + unsigned int hash = inet6_acaddr_hash(net, addr); + hlist_for_each_entry_rcu(aca, _acaddr_lst[hash], aca_addr_lst) { nh_dev = fib6_info_nh_dev(aca->aca_rt); @@ -452,6 +453,7 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev, break; } } + } rcu_read_unlock(); return found; } -- 2.16.2
[PATCH][net-next] net/ipv6: compute anycast address hash only if dev is null
avoid to compute the hash value if dev is not null, since hash value is not used Signed-off-by: Li RongQing --- net/ipv6/anycast.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c index 94999058e110..a20e344486cb 100644 --- a/net/ipv6/anycast.c +++ b/net/ipv6/anycast.c @@ -433,15 +433,16 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, const struct in6_addr *ad bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev, const struct in6_addr *addr) { - unsigned int hash = inet6_acaddr_hash(net, addr); struct net_device *nh_dev; struct ifacaddr6 *aca; bool found = false; + unsigned int hash; rcu_read_lock(); if (dev) found = ipv6_chk_acast_dev(dev, addr); - else + else { + hash = inet6_acaddr_hash(net, addr); hlist_for_each_entry_rcu(aca, _acaddr_lst[hash], aca_addr_lst) { nh_dev = fib6_info_nh_dev(aca->aca_rt); @@ -452,6 +453,7 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev, break; } } + } rcu_read_unlock(); return found; } -- 2.16.2
[net-next][PATCH] net/ipv4: fix a net leak
put net when input a invalid ifindex, otherwise it will be leaked Fixes: 5fcd266a9f64("net/ipv4: Add support for dumping addresses for a specific device") Cc: David Ahern Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/ipv4/devinet.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 63d5b58fbfdb..fd0c5a47e742 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -1775,8 +1775,10 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) if (fillargs.ifindex) { dev = __dev_get_by_index(tgt_net, fillargs.ifindex); - if (!dev) + if (!dev) { + put_net(tgt_net); return -ENODEV; + } in_dev = __in_dev_get_rtnl(dev); if (in_dev) { -- 2.16.2
[PATCH][ipsec-next] xfrm: use correct size to initialise sp->ovec
This place should want to initialize array, not a element, so it should be sizeof(array) instead of sizeof(element) but now this array only has one element, so no error in this condition that XFRM_MAX_OFFLOAD_DEPTH is 1 Signed-off-by: Li RongQing --- net/xfrm/xfrm_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index be3520e429c9..684c0bc01e2c 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -131,7 +131,7 @@ struct sec_path *secpath_dup(struct sec_path *src) sp->len = 0; sp->olen = 0; - memset(sp->ovec, 0, sizeof(sp->ovec[XFRM_MAX_OFFLOAD_DEPTH])); + memset(sp->ovec, 0, sizeof(sp->ovec)); if (src) { int i; -- 2.16.2
[PATCH][ipsec-next] xfrm: remove unnecessary check in xfrmi_get_stats64
if tstats of a device is not allocated, this device is not registered correctly and can not be used. Signed-off-by: Li RongQing --- net/xfrm/xfrm_interface.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index dc5b20bf29cf..abafd49cc65d 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -561,9 +561,6 @@ static void xfrmi_get_stats64(struct net_device *dev, { int cpu; - if (!dev->tstats) - return; - for_each_possible_cpu(cpu) { struct pcpu_sw_netstats *stats; struct pcpu_sw_netstats tmp; -- 2.16.2
Re: [PATCH][net-next] ipv6: drop container_of when convert dst to rt6_info
> > I don't understand why you are doing this? It is not going to be > faster (or safer) than container_of. container_of provides the > same functionality and is safe against position of the member > in the structure. > In fact, most places are converting dst to rt6_info directly, and only few place uses container_of net/ipv6/ip6_output.c: struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); net/ipv6/route.c: const struct rt6_info *rt = (struct rt6_info *)dst; -Li
Re: [PATCH][net-next] ipv6: drop container_of when convert dst to rt6_info
> + BUILD_BUG_ON(offsetof(struct rt6_info, dst) != 0); > + please drop this patch, thanks since BUILD_BUG_ON has been added in ip6_fib.h include/net/ip6_fib.h: BUILD_BUG_ON(offsetof(struct rt6_info, dst) != 0); -Li
[PATCH] xfrm: fix gro_cells leak when remove virtual xfrm interfaces
The device gro_cells has been initialized, it should be freed, otherwise it will be leaked Fixes: f203b76d78092faf2 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/xfrm/xfrm_interface.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index 4b4ef4f662d9..9cc6e72bc802 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -116,6 +116,9 @@ static void xfrmi_unlink(struct xfrmi_net *xfrmn, struct xfrm_if *xi) static void xfrmi_dev_free(struct net_device *dev) { + struct xfrm_if *xi = netdev_priv(dev); + + gro_cells_destroy(>gro_cells); free_percpu(dev->tstats); } -- 2.16.2
[PATCH][net-next] ipv6: drop container_of when convert dst to rt6_info
we can save container_of computation and return dst directly, since dst is always first member of struct rt6_info Add a BUILD_BUG_ON() to catch any change that could break this assertion. Signed-off-by: Li RongQing --- include/net/ip6_route.h | 4 +++- net/ipv6/route.c| 6 +++--- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 7b9c82de11cc..1f09298634cb 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -194,8 +194,10 @@ static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb) const struct dst_entry *dst = skb_dst(skb); const struct rt6_info *rt6 = NULL; + BUILD_BUG_ON(offsetof(struct rt6_info, dst) != 0); + if (dst) - rt6 = container_of(dst, struct rt6_info, dst); + rt6 = (struct rt6_info *)dst; return rt6; } diff --git a/net/ipv6/route.c b/net/ipv6/route.c index d28f83e01593..3fb8034fc2d0 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -217,7 +217,7 @@ static struct neighbour *ip6_dst_neigh_lookup(const struct dst_entry *dst, struct sk_buff *skb, const void *daddr) { - const struct rt6_info *rt = container_of(dst, struct rt6_info, dst); + const struct rt6_info *rt = (struct rt6_info *)dst; return ip6_neigh_lookup(>rt6i_gateway, dst->dev, skb, daddr); } @@ -2187,7 +2187,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie) struct fib6_info *from; struct rt6_info *rt; - rt = container_of(dst, struct rt6_info, dst); + rt = (struct rt6_info *)dst; rcu_read_lock(); @@ -4911,7 +4911,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, } - rt = container_of(dst, struct rt6_info, dst); + rt = (struct rt6_info *)dst; if (rt->dst.error) { err = rt->dst.error; ip6_rt_put(rt); -- 2.16.2
[PATCH][net-next] net: drop container_of in dst_cache_get_ip4
we can save container_of computation and return dst directly, since dst is always first member of struct rtable, and any breaking will be caught by BUILD_BUG_ON in route.h include/net/route.h:BUILD_BUG_ON(offsetof(struct rtable, dst) != 0); Signed-off-by: Li RongQing --- net/core/dst_cache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dst_cache.c b/net/core/dst_cache.c index 64cef977484a..0753838480fd 100644 --- a/net/core/dst_cache.c +++ b/net/core/dst_cache.c @@ -87,7 +87,7 @@ struct rtable *dst_cache_get_ip4(struct dst_cache *dst_cache, __be32 *saddr) return NULL; *saddr = idst->in_saddr.s_addr; - return container_of(dst, struct rtable, dst); + return (struct rtable *)dst; } EXPORT_SYMBOL_GPL(dst_cache_get_ip4); -- 2.16.2
答复: [PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices
: Re: [PATCH][next-next][v2] netlink: avoid to allocate full skb when > sending to many devices > > > > On 09/20/2018 06:43 AM, Eric Dumazet wrote: > > > Sorry, I should cc to you. > > And lastly this patch looks way too complicated to me. > > You probably can write something much simpler. > But it should not increase the negative performance > Something like : > > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index > 930d17fa906c9ebf1cf7b6031ce0a22f9f66c0e4..e0a81beb4f37751421dbbe794c > cf3d5a46bdf900 100644 > --- a/net/netlink/af_netlink.c > +++ b/net/netlink/af_netlink.c > @@ -278,22 +278,26 @@ static bool netlink_filter_tap(const struct sk_buff > *skb) > return false; > } > > -static int __netlink_deliver_tap_skb(struct sk_buff *skb, > +static int __netlink_deliver_tap_skb(struct sk_buff **pskb, > struct net_device *dev) { > - struct sk_buff *nskb; > + struct sk_buff *nskb, *skb = *pskb; > struct sock *sk = skb->sk; > int ret = -ENOMEM; > > if (!net_eq(dev_net(dev), sock_net(sk))) > return 0; > > - dev_hold(dev); > - > - if (is_vmalloc_addr(skb->head)) > + if (is_vmalloc_addr(skb->head)) { > nskb = netlink_to_full_skb(skb, GFP_ATOMIC); > - else > - nskb = skb_clone(skb, GFP_ATOMIC); > + if (!nskb) > + return -ENOMEM; > + consume_skb(skb); The original skb can not be freed, since it will be used after send to tap in __netlink_sendskb > + skb = nskb; > + *pskb = skb; > + } > + dev_hold(dev); > + nskb = skb_clone(skb, GFP_ATOMIC); since original skb can not be freed, skb_clone will lead to a leak. > if (nskb) { > nskb->dev = dev; > nskb->protocol = htons((u16) sk->sk_protocol); @@ -318,7 > +322,7 > @@ static void __netlink_deliver_tap(struct sk_buff *skb, struct > netlink_tap_net *n > return; > > list_for_each_entry_rcu(tmp, >netlink_tap_all, list) { > - ret = __netlink_deliver_tap_skb(skb, tmp->dev); > + ret = __netlink_deliver_tap_skb(, tmp->dev); > if (unlikely(ret)) > break; > } > The below change seems simple, but it increase skb allocation and free one time, diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index e3a0538ec0be..b9631137f0fe 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -290,10 +290,8 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb, dev_hold(dev); - if (is_vmalloc_addr(skb->head)) - nskb = netlink_to_full_skb(skb, GFP_ATOMIC); - else - nskb = skb_clone(skb, GFP_ATOMIC); + nskb = skb_clone(skb GFP_ATOMIC); + if (nskb) { nskb->dev = dev; nskb->protocol = htons((u16) sk->sk_protocol); @@ -317,11 +315,20 @@ static void __netlink_deliver_tap(struct sk_buff *skb, struct netlink_tap_net *n if (!netlink_filter_tap(skb)) return; + if (is_vmalloc_addr(skb->head)) { + skb = netlink_to_full_skb(skb, GFP_ATOMIC); + if (!skb) + return; + alloc = true; + } + list_for_each_entry_rcu(tmp, >netlink_tap_all, list) { + ret = __netlink_deliver_tap_skb(skb, tmp->dev); if (unlikely(ret)) break; } + + if (alloc) + consume_skb(skb); } -Q
[PATCH][next-next][v2] netlink: avoid to allocate full skb when sending to many devices
if skb->head is vmalloc address, when this skb is delivered, full allocation for this skb is required, if there are many devices, the full allocation will be called for every devices now if it is vmalloc, allocate a new skb, whose data is not vmalloc address, and use new allocated skb to clone and send, to avoid to allocate full skb everytime. Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/netlink/af_netlink.c | 37 ++--- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index e3a0538ec0be..a5b1bf706526 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -279,21 +279,25 @@ static bool netlink_filter_tap(const struct sk_buff *skb) } static int __netlink_deliver_tap_skb(struct sk_buff *skb, -struct net_device *dev) +struct net_device *dev, bool alloc, bool last) { struct sk_buff *nskb; struct sock *sk = skb->sk; int ret = -ENOMEM; - if (!net_eq(dev_net(dev), sock_net(sk))) + if (!net_eq(dev_net(dev), sock_net(sk))) { + if (last && alloc) + consume_skb(skb); return 0; + } dev_hold(dev); - if (is_vmalloc_addr(skb->head)) - nskb = netlink_to_full_skb(skb, GFP_ATOMIC); + if (unlikely(last && alloc)) + nskb = skb; else nskb = skb_clone(skb, GFP_ATOMIC); + if (nskb) { nskb->dev = dev; nskb->protocol = htons((u16) sk->sk_protocol); @@ -303,6 +307,8 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb, ret = dev_queue_xmit(nskb); if (unlikely(ret > 0)) ret = net_xmit_errno(ret); + } else if (alloc) { + kfree_skb(skb); } dev_put(dev); @@ -311,16 +317,33 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb, static void __netlink_deliver_tap(struct sk_buff *skb, struct netlink_tap_net *nn) { + struct netlink_tap *tmp, *next; + bool alloc = false; int ret; - struct netlink_tap *tmp; if (!netlink_filter_tap(skb)) return; - list_for_each_entry_rcu(tmp, >netlink_tap_all, list) { - ret = __netlink_deliver_tap_skb(skb, tmp->dev); + tmp = list_first_or_null_rcu(>netlink_tap_all, + struct netlink_tap, list); + if (!tmp) + return; + + if (is_vmalloc_addr(skb->head)) { + skb = netlink_to_full_skb(skb, GFP_ATOMIC); + if (!skb) + return; + alloc = true; + } + + while (tmp) { + next = list_next_or_null_rcu(>netlink_tap_all, >list, + struct netlink_tap, list); + + ret = __netlink_deliver_tap_skb(skb, tmp->dev, alloc, !next); if (unlikely(ret)) break; + tmp = next; } } -- 2.16.2
答复: [PATCH][net-next] netlink: avoid to allocate full skb when sending to many devices
> On 09/17/2018 10:26 PM, Li RongQing wrote: > > if skb->head is vmalloc address, when this skb is delivered, full > > allocation for this skb is required, if there are many devices, the > > --- > > net/netlink/af_netlink.c | 14 -- > > 1 file changed, 8 insertions(+), 6 deletions(-) > > > > > > This looks very broken to me. > > Only the original skb (given as an argument to __netlink_deliver_tap()) is > guaranteed to not disappear while the loop is performed. > > (There is no skb_clone() after the first netlink_to_full_skb()) > Thank you; I will rework it -RongQing
[PATCH][net-next] netlink: avoid to allocate full skb when sending to many devices
if skb->head is vmalloc address, when this skb is delivered, full allocation for this skb is required, if there are many devices, the full allocation will be called for every devices now using the first time allocated skb when iterate other devices to send, reduce full allocation and speedup deliver. Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/netlink/af_netlink.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index e3a0538ec0be..095b99e3c1fb 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -278,11 +278,11 @@ static bool netlink_filter_tap(const struct sk_buff *skb) return false; } -static int __netlink_deliver_tap_skb(struct sk_buff *skb, +static int __netlink_deliver_tap_skb(struct sk_buff **skb, struct net_device *dev) { struct sk_buff *nskb; - struct sock *sk = skb->sk; + struct sock *sk = (*skb)->sk; int ret = -ENOMEM; if (!net_eq(dev_net(dev), sock_net(sk))) @@ -290,10 +290,12 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb, dev_hold(dev); - if (is_vmalloc_addr(skb->head)) - nskb = netlink_to_full_skb(skb, GFP_ATOMIC); + if (is_vmalloc_addr((*skb)->head)) { + nskb = netlink_to_full_skb(*skb, GFP_ATOMIC); + *skb = nskb; + } else - nskb = skb_clone(skb, GFP_ATOMIC); + nskb = skb_clone(*skb, GFP_ATOMIC); if (nskb) { nskb->dev = dev; nskb->protocol = htons((u16) sk->sk_protocol); @@ -318,7 +320,7 @@ static void __netlink_deliver_tap(struct sk_buff *skb, struct netlink_tap_net *n return; list_for_each_entry_rcu(tmp, >netlink_tap_all, list) { - ret = __netlink_deliver_tap_skb(skb, tmp->dev); + ret = __netlink_deliver_tap_skb(, tmp->dev); if (unlikely(ret)) break; } -- 2.16.2
[PATCH][net-next] veth: rename pcpu_vstats as pcpu_lstats
struct pcpu_vstats and pcpu_lstats have same members and usage, and pcpu_lstats is used in many files, so rename pcpu_vstats as pcpu_lstats to reduce duplicate definition Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- drivers/net/veth.c| 22 -- include/linux/netdevice.h | 1 - 2 files changed, 8 insertions(+), 15 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index bc8faf13a731..aeecb5892e26 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -37,12 +37,6 @@ #define VETH_XDP_TXBIT(0) #define VETH_XDP_REDIR BIT(1) -struct pcpu_vstats { - u64 packets; - u64 bytes; - struct u64_stats_sync syncp; -}; - struct veth_rq { struct napi_struct xdp_napi; struct net_device *dev; @@ -217,7 +211,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) skb_tx_timestamp(skb); if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) { - struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats); + struct pcpu_lstats *stats = this_cpu_ptr(dev->lstats); u64_stats_update_begin(>syncp); stats->bytes += length; @@ -236,7 +230,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } -static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev) +static u64 veth_stats_one(struct pcpu_lstats *result, struct net_device *dev) { struct veth_priv *priv = netdev_priv(dev); int cpu; @@ -244,7 +238,7 @@ static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev) result->packets = 0; result->bytes = 0; for_each_possible_cpu(cpu) { - struct pcpu_vstats *stats = per_cpu_ptr(dev->vstats, cpu); + struct pcpu_lstats *stats = per_cpu_ptr(dev->lstats, cpu); u64 packets, bytes; unsigned int start; @@ -264,7 +258,7 @@ static void veth_get_stats64(struct net_device *dev, { struct veth_priv *priv = netdev_priv(dev); struct net_device *peer; - struct pcpu_vstats one; + struct pcpu_lstats one; tot->tx_dropped = veth_stats_one(, dev); tot->tx_bytes = one.bytes; @@ -830,13 +824,13 @@ static int veth_dev_init(struct net_device *dev) { int err; - dev->vstats = netdev_alloc_pcpu_stats(struct pcpu_vstats); - if (!dev->vstats) + dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats); + if (!dev->lstats) return -ENOMEM; err = veth_alloc_queues(dev); if (err) { - free_percpu(dev->vstats); + free_percpu(dev->lstats); return err; } @@ -846,7 +840,7 @@ static int veth_dev_init(struct net_device *dev) static void veth_dev_free(struct net_device *dev) { veth_free_queues(dev); - free_percpu(dev->vstats); + free_percpu(dev->lstats); } #ifdef CONFIG_NET_POLL_CONTROLLER diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index baed5d5088c5..1cbbf77a685f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2000,7 +2000,6 @@ struct net_device { struct pcpu_lstats __percpu *lstats; struct pcpu_sw_netstats __percpu*tstats; struct pcpu_dstats __percpu *dstats; - struct pcpu_vstats __percpu *vstats; }; #if IS_ENABLED(CONFIG_GARP) -- 2.16.2
[PATCH][net-next] net: move definition of pcpu_lstats to header file
pcpu_lstats is defined in several files, so unify them as one and move to header file Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- drivers/net/loopback.c| 6 -- drivers/net/nlmon.c | 6 -- drivers/net/vsockmon.c| 14 -- include/linux/netdevice.h | 6 ++ 4 files changed, 10 insertions(+), 22 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 30612497643c..a7207fa7e451 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -59,12 +59,6 @@ #include #include -struct pcpu_lstats { - u64 packets; - u64 bytes; - struct u64_stats_sync syncp; -}; - /* The higher levels take care of making this non-reentrant (it's * called with bh's disabled). */ diff --git a/drivers/net/nlmon.c b/drivers/net/nlmon.c index 4b22955de191..dd0db7534cb3 100644 --- a/drivers/net/nlmon.c +++ b/drivers/net/nlmon.c @@ -6,12 +6,6 @@ #include #include -struct pcpu_lstats { - u64 packets; - u64 bytes; - struct u64_stats_sync syncp; -}; - static netdev_tx_t nlmon_xmit(struct sk_buff *skb, struct net_device *dev) { int len = skb->len; diff --git a/drivers/net/vsockmon.c b/drivers/net/vsockmon.c index c28bdce14fd5..7bad5c95551f 100644 --- a/drivers/net/vsockmon.c +++ b/drivers/net/vsockmon.c @@ -11,12 +11,6 @@ #define DEFAULT_MTU (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + \ sizeof(struct af_vsockmon_hdr)) -struct pcpu_lstats { - u64 rx_packets; - u64 rx_bytes; - struct u64_stats_sync syncp; -}; - static int vsockmon_dev_init(struct net_device *dev) { dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats); @@ -56,8 +50,8 @@ static netdev_tx_t vsockmon_xmit(struct sk_buff *skb, struct net_device *dev) struct pcpu_lstats *stats = this_cpu_ptr(dev->lstats); u64_stats_update_begin(>syncp); - stats->rx_bytes += len; - stats->rx_packets++; + stats->bytes += len; + stats->packets++; u64_stats_update_end(>syncp); dev_kfree_skb(skb); @@ -80,8 +74,8 @@ vsockmon_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats) do { start = u64_stats_fetch_begin_irq(>syncp); - tbytes = vstats->rx_bytes; - tpackets = vstats->rx_packets; + tbytes = vstats->bytes; + tpackets = vstats->packets; } while (u64_stats_fetch_retry_irq(>syncp, start)); packets += tpackets; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e2b3bd750c98..baed5d5088c5 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2382,6 +2382,12 @@ struct pcpu_sw_netstats { struct u64_stats_sync syncp; }; +struct pcpu_lstats { + u64 packets; + u64 bytes; + struct u64_stats_sync syncp; +}; + #define __netdev_alloc_pcpu_stats(type, gfp) \ ({ \ typeof(type) __percpu *pcpu_stats = alloc_percpu_gfp(type, gfp);\ -- 2.16.2
[PATCH][net-next][v2] netlink: remove hash::nelems check in netlink_insert
The type of hash::nelems has been changed from size_t to atom_t which in fact is int, so not need to check if BITS_PER_LONG, that is bit number of size_t, is bigger than 32 and rht_grow_above_max() will be called to check if hashtable is too big, ensure it can not bigger than 1<<31 Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/netlink/af_netlink.c | 5 - 1 file changed, 5 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index b4a29bcc33b9..e3a0538ec0be 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -574,11 +574,6 @@ static int netlink_insert(struct sock *sk, u32 portid) if (nlk_sk(sk)->bound) goto err; - err = -ENOMEM; - if (BITS_PER_LONG > 32 && - unlikely(atomic_read(>hash.nelems) >= UINT_MAX)) - goto err; - nlk_sk(sk)->portid = portid; sock_hold(sk); -- 2.16.2
Re: [PATCH] netlink: fix hash::nelems check
after reconsider, I think we can remove this check directly, since rht_grow_above_max() will be called to check the overflow again in rhashtable_insert_one. and atomic_read(>hash.nelems) always compares with unsigned value, will force to switch unsigned, so the hash.nelems overflows can be accepted. -Rong
[PATCH] netlink: fix hash::nelems check
The type of hash::nelems has been changed from size_t to atom_t which is int in fact, and impossible to be bigger than UINT_MAX Fixes: 97defe1ecf86 ("rhashtable: Per bucket locks & deferred expansion/shrinking") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/netlink/af_netlink.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index b4a29bcc33b9..412437baee63 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -575,8 +575,7 @@ static int netlink_insert(struct sock *sk, u32 portid) goto err; err = -ENOMEM; - if (BITS_PER_LONG > 32 && - unlikely(atomic_read(>hash.nelems) >= UINT_MAX)) + if (unlikely(atomic_read(>hash.nelems) == INT_MAX)) goto err; nlk_sk(sk)->portid = portid; -- 2.16.2
[PATCH][net-next] vxlan: reduce dirty cache line in vxlan_find_mac
vxlan_find_mac() unconditionally set f->used for every packet, this causes a cache miss for every packet, since remote, hlist and used of vxlan_fdb share the same cache line, which are accessed when send every packets. so f->used is set only if not equal to jiffies, to reduce dirty cache line times, this gives 3% speed-up with small packets. Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- drivers/net/vxlan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index ababba37d735..e5d236595206 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -464,7 +464,7 @@ static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev *vxlan, struct vxlan_fdb *f; f = __vxlan_find_mac(vxlan, mac, vni); - if (f) + if (f && f->used != jiffies) f->used = jiffies; return f; -- 2.16.2
[PATCH][net-next] vxlan: reduce dirty cache line in vxlan_find_mac
vxlan_find_mac() unconditionally set f->used for every packet, this cause a cache miss for every packet, since remote, hlist and used of vxlan_fdb share the same cacheline. With this change f->used is set only if not equal to jiffies This gives up to 5% speed-up with small packets. Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- drivers/net/vxlan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index ababba37d735..e5d236595206 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -464,7 +464,7 @@ static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev *vxlan, struct vxlan_fdb *f; f = __vxlan_find_mac(vxlan, mac, vni); - if (f) + if (f && f->used != jiffies) f->used = jiffies; return f; -- 2.16.2
[PATCH][net-next][v2] packet: switch kvzalloc to allocate memory
The patches includes following change: *Use modern kvzalloc()/kvfree() instead of custom allocations. *Remove order argument for alloc_pg_vec, it can get from req. *Remove order argument for free_pg_vec, free_pg_vec now uses kvfree which does not need order argument. *Remove pg_vec_order from struct packet_ring_buffer, no longer need to save/restore 'order' *Remove variable 'order' for packet_set_ring, it is now unused Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/packet/af_packet.c | 44 +--- net/packet/internal.h | 1 - 2 files changed, 13 insertions(+), 32 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 75c92a87e7b2..5610061e7f2e 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -4137,52 +4137,36 @@ static const struct vm_operations_struct packet_mmap_ops = { .close = packet_mm_close, }; -static void free_pg_vec(struct pgv *pg_vec, unsigned int order, - unsigned int len) +static void free_pg_vec(struct pgv *pg_vec, unsigned int len) { int i; for (i = 0; i < len; i++) { if (likely(pg_vec[i].buffer)) { - if (is_vmalloc_addr(pg_vec[i].buffer)) - vfree(pg_vec[i].buffer); - else - free_pages((unsigned long)pg_vec[i].buffer, - order); + kvfree(pg_vec[i].buffer); pg_vec[i].buffer = NULL; } } kfree(pg_vec); } -static char *alloc_one_pg_vec_page(unsigned long order) +static char *alloc_one_pg_vec_page(unsigned long size) { char *buffer; - gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | - __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY; - buffer = (char *) __get_free_pages(gfp_flags, order); + buffer = kvzalloc(size, GFP_KERNEL); if (buffer) return buffer; - /* __get_free_pages failed, fall back to vmalloc */ - buffer = vzalloc(array_size((1 << order), PAGE_SIZE)); - if (buffer) - return buffer; + buffer = kvzalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL); - /* vmalloc failed, lets dig into swap here */ - gfp_flags &= ~__GFP_NORETRY; - buffer = (char *) __get_free_pages(gfp_flags, order); - if (buffer) - return buffer; - - /* complete and utter failure */ - return NULL; + return buffer; } -static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) +static struct pgv *alloc_pg_vec(struct tpacket_req *req) { unsigned int block_nr = req->tp_block_nr; + unsigned long size = req->tp_block_size; struct pgv *pg_vec; int i; @@ -4191,7 +4175,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) goto out; for (i = 0; i < block_nr; i++) { - pg_vec[i].buffer = alloc_one_pg_vec_page(order); + pg_vec[i].buffer = alloc_one_pg_vec_page(size); if (unlikely(!pg_vec[i].buffer)) goto out_free_pgvec; } @@ -4200,7 +4184,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) return pg_vec; out_free_pgvec: - free_pg_vec(pg_vec, order, block_nr); + free_pg_vec(pg_vec, block_nr); pg_vec = NULL; goto out; } @@ -4210,9 +4194,9 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, { struct pgv *pg_vec = NULL; struct packet_sock *po = pkt_sk(sk); - int was_running, order = 0; struct packet_ring_buffer *rb; struct sk_buff_head *rb_queue; + int was_running; __be16 num; int err = -EINVAL; /* Added to avoid minimal code churn */ @@ -4274,8 +4258,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, goto out; err = -ENOMEM; - order = get_order(req->tp_block_size); - pg_vec = alloc_pg_vec(req, order); + pg_vec = alloc_pg_vec(req); if (unlikely(!pg_vec)) goto out; switch (po->tp_version) { @@ -4329,7 +4312,6 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, rb->frame_size = req->tp_frame_size; spin_unlock_bh(_queue->lock); - swap(rb->pg_vec_order, order); swap(rb->pg_vec_len, req->tp_block_nr); rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE; @@ -4355,7 +4337,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, } if (pg_vec) - free_pg_vec(pg_vec, order, req->tp_block_nr); +
[PATCH][net-next] packet: switch kvzalloc to allocate memory
Use modern kvzalloc()/kvfree() instead of custom allocations. And remove order argument for free_pg_vec and alloc_pg_vec, this argument is useless to kvfree, or can get from req. Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/packet/af_packet.c | 40 1 file changed, 12 insertions(+), 28 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 75c92a87e7b2..f28fcaba4f36 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -4137,52 +4137,36 @@ static const struct vm_operations_struct packet_mmap_ops = { .close = packet_mm_close, }; -static void free_pg_vec(struct pgv *pg_vec, unsigned int order, - unsigned int len) +static void free_pg_vec(struct pgv *pg_vec, unsigned int len) { int i; for (i = 0; i < len; i++) { if (likely(pg_vec[i].buffer)) { - if (is_vmalloc_addr(pg_vec[i].buffer)) - vfree(pg_vec[i].buffer); - else - free_pages((unsigned long)pg_vec[i].buffer, - order); + kvfree(pg_vec[i].buffer); pg_vec[i].buffer = NULL; } } kfree(pg_vec); } -static char *alloc_one_pg_vec_page(unsigned long order) +static char *alloc_one_pg_vec_page(unsigned long size) { char *buffer; - gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | - __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY; - buffer = (char *) __get_free_pages(gfp_flags, order); + buffer = kvzalloc(size, GFP_KERNEL); if (buffer) return buffer; - /* __get_free_pages failed, fall back to vmalloc */ - buffer = vzalloc(array_size((1 << order), PAGE_SIZE)); - if (buffer) - return buffer; + buffer = kvzalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL); - /* vmalloc failed, lets dig into swap here */ - gfp_flags &= ~__GFP_NORETRY; - buffer = (char *) __get_free_pages(gfp_flags, order); - if (buffer) - return buffer; - - /* complete and utter failure */ - return NULL; + return buffer; } -static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) +static struct pgv *alloc_pg_vec(struct tpacket_req *req) { unsigned int block_nr = req->tp_block_nr; + unsigned long size = req->tp_block_size; struct pgv *pg_vec; int i; @@ -4191,7 +4175,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) goto out; for (i = 0; i < block_nr; i++) { - pg_vec[i].buffer = alloc_one_pg_vec_page(order); + pg_vec[i].buffer = alloc_one_pg_vec_page(size); if (unlikely(!pg_vec[i].buffer)) goto out_free_pgvec; } @@ -4200,7 +4184,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) return pg_vec; out_free_pgvec: - free_pg_vec(pg_vec, order, block_nr); + free_pg_vec(pg_vec, block_nr); pg_vec = NULL; goto out; } @@ -4275,7 +4259,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, err = -ENOMEM; order = get_order(req->tp_block_size); - pg_vec = alloc_pg_vec(req, order); + pg_vec = alloc_pg_vec(req); if (unlikely(!pg_vec)) goto out; switch (po->tp_version) { @@ -4355,7 +4339,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, } if (pg_vec) - free_pg_vec(pg_vec, order, req->tp_block_nr); + free_pg_vec(pg_vec, req->tp_block_nr); out: return err; } -- 2.16.2
[PATCH][net-next] tun: not use hardcoded mask value
0x3ff in tun_hashfn is mask of TUN_NUM_FLOW_ENTRIES, instead of hardcode, define a macro to setup the relationship with TUN_NUM_FLOW_ENTRIES Signed-off-by: Li RongQing --- drivers/net/tun.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 0a3134712652..2bbefe828670 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -200,6 +200,7 @@ struct tun_flow_entry { }; #define TUN_NUM_FLOW_ENTRIES 1024 +#define TUN_MASK_FLOW_ENTRIES (TUN_NUM_FLOW_ENTRIES - 1) struct tun_prog { struct rcu_head rcu; @@ -406,7 +407,7 @@ static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 val) static inline u32 tun_hashfn(u32 rxhash) { - return rxhash & 0x3ff; + return rxhash & TUN_MASK_FLOW_ENTRIES; } static struct tun_flow_entry *tun_flow_find(struct hlist_head *head, u32 rxhash) -- 2.16.2
[PATCH][net-next] net: check extack._msg before print
dev_set_mtu_ext is able to fail with a valid mtu value, at that condition, extack._msg is not set and random since it is in stack, then kernel will crash when print it. Fixes: 7a4c53bee3324a ("net: report invalid mtu value via netlink extack") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/core/dev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 36e994519488..f68122f0ab02 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7583,8 +7583,9 @@ int dev_set_mtu(struct net_device *dev, int new_mtu) struct netlink_ext_ack extack; int err; + memset(, 0, sizeof(extack)); err = dev_set_mtu_ext(dev, new_mtu, ); - if (err) + if (err && extack._msg) net_err_ratelimited("%s: %s\n", dev->name, extack._msg); return err; } -- 2.16.2
[PATCH][net-next] openvswitch: eliminate cpu_used_mask from sw_flow
The size of struct cpumask varies with CONFIG_NR_CPUS, some config CONFIG_NR_CPUS is very larger, like 5120, struct cpumask will take 640 bytes, if there is thousands of flows, it will take lots of memory cpu_used_mask has two purposes 1: Assume first cpu as cpu0 which maybe not true; now use cpumask_first(cpu_possible_mask) 2: when get/clear statistic, reduce the iteratation; but it is not hot path, so use for_each_possible_cpu Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/openvswitch/flow.c | 11 +-- net/openvswitch/flow.h | 5 ++--- net/openvswitch/flow_table.c | 11 +-- 3 files changed, 12 insertions(+), 15 deletions(-) diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 56b8e7167790..ad580bec00fb 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -85,7 +85,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, if (cpu == 0 && unlikely(flow->stats_last_writer != cpu)) flow->stats_last_writer = cpu; } else { - stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */ + int cpu1 = cpumask_first(cpu_possible_mask); + + stats = rcu_dereference(flow->stats[cpu1]); /* Pre-allocated. */ spin_lock(>lock); /* If the current CPU is the only writer on the @@ -118,7 +120,6 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags, rcu_assign_pointer(flow->stats[cpu], new_stats); - cpumask_set_cpu(cpu, >cpu_used_mask); goto unlock; } } @@ -145,8 +146,7 @@ void ovs_flow_stats_get(const struct sw_flow *flow, *tcp_flags = 0; memset(ovs_stats, 0, sizeof(*ovs_stats)); - /* We open code this to make sure cpu 0 is always considered */ - for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, >cpu_used_mask)) { + for_each_possible_cpu(cpu) { struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[cpu]); if (stats) { @@ -169,8 +169,7 @@ void ovs_flow_stats_clear(struct sw_flow *flow) { int cpu; - /* We open code this to make sure cpu 0 is always considered */ - for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, >cpu_used_mask)) { + for_each_possible_cpu(cpu) { struct flow_stats *stats = ovsl_dereference(flow->stats[cpu]); if (stats) { diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h index c670dd24b8b7..d0ea5d6ced3e 100644 --- a/net/openvswitch/flow.h +++ b/net/openvswitch/flow.h @@ -223,17 +223,16 @@ struct sw_flow { u32 hash; } flow_table, ufid_table; int stats_last_writer; /* CPU id of the last writer on -* 'stats[0]'. +* 'stats[first cpu id]'. */ struct sw_flow_key key; struct sw_flow_id id; - struct cpumask cpu_used_mask; struct sw_flow_mask *mask; struct sw_flow_actions __rcu *sf_acts; struct flow_stats __rcu *stats[]; /* One for each CPU. First one * is allocated at flow creation time, * the rest are allocated on demand - * while holding the 'stats[0].lock'. + * while holding the 'stats[first cpu id].lock' */ }; diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c index 80ea2a71852e..e4dbd65c308a 100644 --- a/net/openvswitch/flow_table.c +++ b/net/openvswitch/flow_table.c @@ -80,6 +80,7 @@ struct sw_flow *ovs_flow_alloc(void) { struct sw_flow *flow; struct flow_stats *stats; + int cpu = cpumask_first(cpu_possible_mask); flow = kmem_cache_zalloc(flow_cache, GFP_KERNEL); if (!flow) @@ -90,15 +91,13 @@ struct sw_flow *ovs_flow_alloc(void) /* Initialize the default stat node. */ stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, - node_online(0) ? 0 : NUMA_NO_NODE); + cpu_to_node(cpu)); if (!stats) goto err; spin_lock_init(>lock); - RCU_INIT_POINTER(flow->stats[0], stats); - - cpumask_set_cpu(0, >cpu_used_mask); + RCU_INIT_POINTER(flow->stats[cpu], stats); return flow; err: @@ -142,11 +141,11 @@ static void flow_free(struct sw_flow *flow)
[PATCH][v3] netfilter: use kvmalloc_array to allocate memory for hashtable
nf_ct_alloc_hashtable is used to allocate memory for conntrack, NAT bysrc and expectation hashtable. Assuming 64k bucket size, which means 7th order page allocation, __get_free_pages, called by nf_ct_alloc_hashtable, will trigger the direct memory reclaim and stall for a long time, when system has lots of memory stress so replace combination of __get_free_pages and vzalloc with kvmalloc_array, which provides a overflow check and a fallback if no high order memory is available, and do not retry to reclaim memory, reduce stall and remove nf_ct_free_hashtable, since it is just a kvfree Signed-off-by: Zhang Yu Signed-off-by: Wang Li Signed-off-by: Li RongQing --- include/net/netfilter/nf_conntrack.h | 2 -- net/netfilter/nf_conntrack_core.c| 29 ++--- net/netfilter/nf_conntrack_expect.c | 2 +- net/netfilter/nf_conntrack_helper.c | 4 ++-- net/netfilter/nf_nat_core.c | 4 ++-- 5 files changed, 11 insertions(+), 30 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index a2b0ed025908..7e012312cd61 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -176,8 +176,6 @@ void nf_ct_netns_put(struct net *net, u8 nfproto); */ void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls); -void nf_ct_free_hashtable(void *hash, unsigned int size); - int nf_conntrack_hash_check_insert(struct nf_conn *ct); bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report); diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 8a113ca1eea2..429151b4991a 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2022,16 +2022,6 @@ static int kill_all(struct nf_conn *i, void *data) return net_eq(nf_ct_net(i), data); } -void nf_ct_free_hashtable(void *hash, unsigned int size) -{ - if (is_vmalloc_addr(hash)) - vfree(hash); - else - free_pages((unsigned long)hash, - get_order(sizeof(struct hlist_head) * size)); -} -EXPORT_SYMBOL_GPL(nf_ct_free_hashtable); - void nf_conntrack_cleanup_start(void) { conntrack_gc_work.exiting = true; @@ -2042,7 +2032,7 @@ void nf_conntrack_cleanup_end(void) { RCU_INIT_POINTER(nf_ct_hook, NULL); cancel_delayed_work_sync(_gc_work.dwork); - nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size); + kvfree(nf_conntrack_hash); nf_conntrack_proto_fini(); nf_conntrack_seqadj_fini(); @@ -2108,7 +2098,6 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls) { struct hlist_nulls_head *hash; unsigned int nr_slots, i; - size_t sz; if (*sizep > (UINT_MAX / sizeof(struct hlist_nulls_head))) return NULL; @@ -2116,14 +2105,8 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls) BUILD_BUG_ON(sizeof(struct hlist_nulls_head) != sizeof(struct hlist_head)); nr_slots = *sizep = roundup(*sizep, PAGE_SIZE / sizeof(struct hlist_nulls_head)); - if (nr_slots > (UINT_MAX / sizeof(struct hlist_nulls_head))) - return NULL; - - sz = nr_slots * sizeof(struct hlist_nulls_head); - hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, - get_order(sz)); - if (!hash) - hash = vzalloc(sz); + hash = kvmalloc_array(nr_slots, sizeof(struct hlist_nulls_head), + GFP_KERNEL | __GFP_ZERO); if (hash && nulls) for (i = 0; i < nr_slots; i++) @@ -2150,7 +2133,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize) old_size = nf_conntrack_htable_size; if (old_size == hashsize) { - nf_ct_free_hashtable(hash, hashsize); + kvfree(hash); return 0; } @@ -2186,7 +2169,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize) local_bh_enable(); synchronize_net(); - nf_ct_free_hashtable(old_hash, old_size); + kvfree(old_hash); return 0; } @@ -2350,7 +2333,7 @@ int nf_conntrack_init_start(void) err_expect: kmem_cache_destroy(nf_conntrack_cachep); err_cachep: - nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size); + kvfree(nf_conntrack_hash); return ret; } diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 3f586ba23d92..27b84231db10 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -712,5 +712,5 @@ void nf_conntrack_expect_fini(void) { rcu_barrier(); /* Wait for call_rcu() before destroy */ kmem_cache_destroy(nf_ct_expect_cachep); - nf_ct_free_hashtable(nf_ct_expect_hash, nf_ct_expect_hsize); + kvfree(nf_ct_expect_hash); } diff --gi
答复: [PATCH][v2] netfilter: use kvzalloc to allocate memory for hashtable
> -邮件原件- > 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com] > 发送时间: 2018年7月25日 13:45 > 收件人: Li,Rongqing ; netdev@vger.kernel.org; > pa...@netfilter.org; kad...@blackhole.kfki.hu; f...@strlen.de; netfilter- > de...@vger.kernel.org; coret...@netfilter.org; eduma...@google.com > 主题: Re: [PATCH][v2] netfilter: use kvzalloc to allocate memory for > hashtable > > > > On 07/24/2018 10:34 PM, Li RongQing wrote: > > nf_ct_alloc_hashtable is used to allocate memory for conntrack, NAT > > bysrc and expectation hashtable. Assuming 64k bucket size, which means > > 7th order page allocation, __get_free_pages, called by > > nf_ct_alloc_hashtable, will trigger the direct memory reclaim and > > stall for a long time, when system has lots of memory stress > > ... > > > sz = nr_slots * sizeof(struct hlist_nulls_head); > > - hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | > __GFP_ZERO, > > - get_order(sz)); > > - if (!hash) > > - hash = vzalloc(sz); > > + hash = kvzalloc(sz, GFP_KERNEL); > > > You could remove the @sz computation and call > > hash = kvcalloc(nr_slots, sizeof(struct hlist_nulls_head), GFP_KERNEL); > > Thanks to kvmalloc_array() check, you also could remove the : > > if (nr_slots > (UINT_MAX / sizeof(struct hlist_nulls_head))) > return NULL; > > That would remove a lot of stuff now we have proper helpers. Ok, I will send v3 Thanks -RongQing
[PATCH][v2] netfilter: use kvzalloc to allocate memory for hashtable
nf_ct_alloc_hashtable is used to allocate memory for conntrack, NAT bysrc and expectation hashtable. Assuming 64k bucket size, which means 7th order page allocation, __get_free_pages, called by nf_ct_alloc_hashtable, will trigger the direct memory reclaim and stall for a long time, when system has lots of memory stress so replace combination of __get_free_pages and vzalloc with kvzalloc, which provides a fallback if no high order memory is available, and do not retry to reclaim memory, reduce stall and remove nf_ct_free_hashtable, since it is just a kvfree Signed-off-by: Zhang Yu Signed-off-by: Wang Li Signed-off-by: Li RongQing --- include/net/netfilter/nf_conntrack.h | 2 -- net/netfilter/nf_conntrack_core.c| 23 +-- net/netfilter/nf_conntrack_expect.c | 2 +- net/netfilter/nf_conntrack_helper.c | 4 ++-- net/netfilter/nf_nat_core.c | 4 ++-- 5 files changed, 10 insertions(+), 25 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index a2b0ed025908..7e012312cd61 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -176,8 +176,6 @@ void nf_ct_netns_put(struct net *net, u8 nfproto); */ void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls); -void nf_ct_free_hashtable(void *hash, unsigned int size); - int nf_conntrack_hash_check_insert(struct nf_conn *ct); bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report); diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 8a113ca1eea2..191140c469d5 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2022,16 +2022,6 @@ static int kill_all(struct nf_conn *i, void *data) return net_eq(nf_ct_net(i), data); } -void nf_ct_free_hashtable(void *hash, unsigned int size) -{ - if (is_vmalloc_addr(hash)) - vfree(hash); - else - free_pages((unsigned long)hash, - get_order(sizeof(struct hlist_head) * size)); -} -EXPORT_SYMBOL_GPL(nf_ct_free_hashtable); - void nf_conntrack_cleanup_start(void) { conntrack_gc_work.exiting = true; @@ -2042,7 +2032,7 @@ void nf_conntrack_cleanup_end(void) { RCU_INIT_POINTER(nf_ct_hook, NULL); cancel_delayed_work_sync(_gc_work.dwork); - nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size); + kvfree(nf_conntrack_hash); nf_conntrack_proto_fini(); nf_conntrack_seqadj_fini(); @@ -2120,10 +2110,7 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls) return NULL; sz = nr_slots * sizeof(struct hlist_nulls_head); - hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, - get_order(sz)); - if (!hash) - hash = vzalloc(sz); + hash = kvzalloc(sz, GFP_KERNEL); if (hash && nulls) for (i = 0; i < nr_slots; i++) @@ -2150,7 +2137,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize) old_size = nf_conntrack_htable_size; if (old_size == hashsize) { - nf_ct_free_hashtable(hash, hashsize); + kvfree(hash); return 0; } @@ -2186,7 +2173,7 @@ int nf_conntrack_hash_resize(unsigned int hashsize) local_bh_enable(); synchronize_net(); - nf_ct_free_hashtable(old_hash, old_size); + kvfree(old_hash); return 0; } @@ -2350,7 +2337,7 @@ int nf_conntrack_init_start(void) err_expect: kmem_cache_destroy(nf_conntrack_cachep); err_cachep: - nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size); + kvfree(nf_conntrack_hash); return ret; } diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 3f586ba23d92..27b84231db10 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -712,5 +712,5 @@ void nf_conntrack_expect_fini(void) { rcu_barrier(); /* Wait for call_rcu() before destroy */ kmem_cache_destroy(nf_ct_expect_cachep); - nf_ct_free_hashtable(nf_ct_expect_hash, nf_ct_expect_hsize); + kvfree(nf_ct_expect_hash); } diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c index d557a425289d..e24b762ffa1d 100644 --- a/net/netfilter/nf_conntrack_helper.c +++ b/net/netfilter/nf_conntrack_helper.c @@ -562,12 +562,12 @@ int nf_conntrack_helper_init(void) return 0; out_extend: - nf_ct_free_hashtable(nf_ct_helper_hash, nf_ct_helper_hsize); + kvfree(nf_ct_helper_hash); return ret; } void nf_conntrack_helper_fini(void) { nf_ct_extend_unregister(_extend); - nf_ct_free_hashtable(nf_ct_helper_hash, nf_ct_helper_hsize); + kvfree(nf_ct_helper_hash); } diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_c
答复: 答复: [PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable
> > On 07/24/2018 02:50 AM, Li,Rongqing wrote: > > > Thanks, Your patch fixes my issue; > > > > My patch may be able to reduce stall when modprobe nf module in > memory > > stress, Do you think this patch has any value? > > Only if you make it use kvzalloc()/kvfree() > > Thanks. I will send v2, free to give your signature. Thanks, -RongQing
答复: [PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable
> -邮件原件- > 发件人: Florian Westphal [mailto:f...@strlen.de] > 发送时间: 2018年7月24日 17:20 > 收件人: Li,Rongqing > 抄送: netdev@vger.kernel.org; pa...@netfilter.org; > kad...@blackhole.kfki.hu; f...@strlen.de > 主题: Re: [PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable > > Li RongQing wrote: > > when system forks a process with CLONE_NEWNET flag under the high > > memory pressure, it will trigger memory reclaim and stall for a long > > time because nf_ct_alloc_hashtable need to allocate high-order memory > > at that time. The calltrace as below: > > > nf_ct_alloc_hashtable > > nf_conntrack_init_net > > This call trace is from a kernel < 4.7. > Sorry; it is > commit 56d52d4892d0e478a005b99ed10d0a7f488ea8c1 > netfilter: conntrack: use a single hashtable for all namespaces > > removed per-netns hash table. Thanks, Your patch fixes my issue; My patch may be able to reduce stall when modprobe nf module in memory stress, Do you think this patch has any value? -RongQing
[PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable
when system forks a process with CLONE_NEWNET flag under the high memory pressure, it will trigger memory reclaim and stall for a long time because nf_ct_alloc_hashtable need to allocate high-order memory at that time. The calltrace as below: delay_tsc __delay _raw_spin_lock _spin_lock mmu_shrink shrink_slab zone_reclaim get_page_from_freelist __alloc_pages_nodemask alloc_pages_current __get_free_pages nf_ct_alloc_hashtable nf_conntrack_init_net setup_net copy_net_ns create_new_namespaces copy_namespaces copy_process do_fork sys_clone stub_clone __clone not use the directly memory reclaim flag to avoid stall Signed-off-by: Ni Xun Signed-off-by: Zhang Yu Signed-off-by: Wang Li Signed-off-by: Li RongQing --- net/netfilter/nf_conntrack_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 8a113ca1eea2..672c5960530d 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2120,8 +2120,8 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls) return NULL; sz = nr_slots * sizeof(struct hlist_nulls_head); - hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, - get_order(sz)); + hash = (void *)__get_free_pages((GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) | + __GFP_NOWARN | __GFP_ZERO, get_order(sz)); if (!hash) hash = vzalloc(sz); -- 2.16.2
答复: 答复: [PATCH][net-next] bridge: clean up mtu_set_by_user setting to false and comments
> This is used to differentiate when auto adjust is used and when user has set > the MTU. > As I already said everything is working as expected and you should not > remove this code. > I see, thank you, and sorry for the noise. -R
答复: [PATCH][net-next] bridge: clean up mtu_set_by_user setting to false and comments
> -邮件原件- > 发件人: Nikolay Aleksandrov [mailto:niko...@cumulusnetworks.com] > 发送时间: 2018年7月13日 16:01 > 收件人: Li,Rongqing ; netdev@vger.kernel.org > 主题: Re: [PATCH][net-next] bridge: clean up mtu_set_by_user setting to > false and comments > > On 13/07/18 09:47, Li RongQing wrote: > > Once mtu_set_by_user is set to true, br_mtu_auto_adjust will not run, > > and no chance to clear mtu_set_by_user. > > > ^^ > This was by design, there is no error here and no "cleanup" is needed. > If you read the ndo_change_mtu() call you'll see the comment: > /* this flag will be cleared if the MTU was automatically adjusted */ > But after this comment, mtu_set_by_user is set to true, and br_mtu_auto_adjust will not truly be run, how to set mtu_set_by_user to false? 230 /* this flag will be cleared if the MTU was automatically adjusted */ 231 br->mtu_set_by_user = true; And the line 457 is useless, since it run only if it is false? 445 void br_mtu_auto_adjust(struct net_bridge *br) 446 { 447 ASSERT_RTNL(); 448 449 /* if the bridge MTU was manually configured don't mess with it */ 450 if (br->mtu_set_by_user) 451 return; 452 453 /* change to the minimum MTU and clear the flag which was set by 454 * the bridge ndo_change_mtu callback 455 */ 456 dev_set_mtu(br->dev, br_mtu_min(br)); 457 br->mtu_set_by_user = false; 458 } -R
[PATCH][net-next][v2] net: convert gro_count to bitmask
gro_hash size is 192 bytes, and uses 3 cache lines, if there is few flows, gro_hash may be not fully used, so it is unnecessary to iterate all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline. convert gro_count to a bitmask, and rename it as gro_bitmask, each bit represents a element of gro_hash, only flush a gro_hash element if the related bit is set, to speed up napi_gro_flush(). and update gro_bitmask only if it will be changed, to reduce cache update Suggested-by: Eric Dumazet Signed-off-by: Li RongQing Cc: Stefano Brivio --- netperf shows no difference, maybe because my testing machine has large cache include/linux/netdevice.h | 9 +++-- net/core/dev.c| 36 2 files changed, 31 insertions(+), 14 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2daf2fa6554f..8837a998de3f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -308,9 +308,14 @@ struct gro_list { }; /* - * Structure for NAPI scheduling similar to tasklet but with weighting + * size of gro hash buckets, must less than bit number of + * napi_struct::gro_bitmask */ #define GRO_HASH_BUCKETS 8 + +/* + * Structure for NAPI scheduling similar to tasklet but with weighting + */ struct napi_struct { /* The poll_list must only be managed by the entity which * changes the state of the NAPI_STATE_SCHED bit. This means @@ -322,7 +327,7 @@ struct napi_struct { unsigned long state; int weight; - unsigned intgro_count; + unsigned long gro_bitmask; int (*poll)(struct napi_struct *, int); #ifdef CONFIG_NETPOLL int poll_owner; diff --git a/net/core/dev.c b/net/core/dev.c index 14a748ee8cc9..e39fef62e285 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5283,9 +5283,11 @@ static void __napi_gro_flush_chain(struct napi_struct *napi, u32 index, list_del(>list); skb->next = NULL; napi_gro_complete(skb); - napi->gro_count--; napi->gro_hash[index].count--; } + + if (!napi->gro_hash[index].count) + __clear_bit(index, >gro_bitmask); } /* napi->gro_hash[].list contains packets ordered by age. @@ -5296,8 +5298,10 @@ void napi_gro_flush(struct napi_struct *napi, bool flush_old) { u32 i; - for (i = 0; i < GRO_HASH_BUCKETS; i++) - __napi_gro_flush_chain(napi, i, flush_old); + for (i = 0; i < GRO_HASH_BUCKETS; i++) { + if (test_bit(i, >gro_bitmask)) + __napi_gro_flush_chain(napi, i, flush_old); + } } EXPORT_SYMBOL(napi_gro_flush); @@ -5389,8 +5393,8 @@ static void gro_flush_oldest(struct list_head *head) if (WARN_ON_ONCE(!oldest)) return; - /* Do not adjust napi->gro_count, caller is adding a new SKB to -* the chain. + /* Do not adjust napi->gro_hash[].count, caller is adding a new +* SKB to the chain. */ list_del(>list); napi_gro_complete(oldest); @@ -5465,7 +5469,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff list_del(>list); pp->next = NULL; napi_gro_complete(pp); - napi->gro_count--; napi->gro_hash[hash].count--; } @@ -5478,7 +5481,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) { gro_flush_oldest(gro_head); } else { - napi->gro_count++; napi->gro_hash[hash].count++; } NAPI_GRO_CB(skb)->count = 1; @@ -5493,6 +5495,13 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (grow > 0) gro_pull_from_frag0(skb, grow); ok: + if (napi->gro_hash[hash].count) { + if (!test_bit(hash, >gro_bitmask)) + __set_bit(hash, >gro_bitmask); + } else if (test_bit(hash, >gro_bitmask)) { + __clear_bit(hash, >gro_bitmask); + } + return ret; normal: @@ -5891,7 +5900,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done) NAPIF_STATE_IN_BUSY_POLL))) return false; - if (n->gro_count) { + if (n->gro_bitmask) { unsigned long timeout = 0; if (work_done) @@ -6100,7 +6109,7 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer) /* Note : we use a relaxed variant of napi_schedule_prep() not setting * NAPI_STATE_MISSED, since we do not react to a device IRQ. */ -
[PATCH][net-next] bridge: clean up mtu_set_by_user setting to false and comments
Once mtu_set_by_user is set to true, br_mtu_auto_adjust will not run, and no chance to clear mtu_set_by_user. and br_mtu_auto_adjust will run only if mtu_set_by_user is false, so not need to set it to false again Cc: Nikolay Aleksandrov Signed-off-by: Li RongQing --- net/bridge/br_device.c | 1 - net/bridge/br_if.c | 4 2 files changed, 5 deletions(-) diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c index e682a668ce57..c636bc2749c2 100644 --- a/net/bridge/br_device.c +++ b/net/bridge/br_device.c @@ -227,7 +227,6 @@ static int br_change_mtu(struct net_device *dev, int new_mtu) dev->mtu = new_mtu; - /* this flag will be cleared if the MTU was automatically adjusted */ br->mtu_set_by_user = true; #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) /* remember the MTU in the rtable for PMTU */ diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index 05e42d86882d..47c65da4b1be 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -450,11 +450,7 @@ void br_mtu_auto_adjust(struct net_bridge *br) if (br->mtu_set_by_user) return; - /* change to the minimum MTU and clear the flag which was set by -* the bridge ndo_change_mtu callback -*/ dev_set_mtu(br->dev, br_mtu_min(br)); - br->mtu_set_by_user = false; } static void br_set_gso_limits(struct net_bridge *br) -- 2.16.2
答复: [PATCH] net: convert gro_count to bitmask
> -邮件原件- > 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com] > 发送时间: 2018年7月11日 19:32 > 收件人: Li,Rongqing ; netdev@vger.kernel.org > 主题: Re: [PATCH] net: convert gro_count to bitmask > > > > On 07/11/2018 02:15 AM, Li RongQing wrote: > > gro_hash size is 192 bytes, and uses 3 cache lines, if there is few > > flows, gro_hash may be not fully used, so it is unnecessary to iterate > > all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline. > > > > convert gro_count to a bitmask, and rename it as gro_bitmask, each bit > > represents a element of gro_hash, only flush a gro_hash element if the > > related bit is set, to speed up napi_gro_flush(). > > > > and update gro_bitmask only if it will be changed, to reduce cache > > update > > > > Suggested-by: Eric Dumazet > > Signed-off-by: Li RongQing > > --- > > include/linux/netdevice.h | 2 +- > > net/core/dev.c| 35 +++ > > 2 files changed, 24 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index b683971e500d..df49b36ef378 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -322,7 +322,7 @@ struct napi_struct { > > > > unsigned long state; > > int weight; > > - unsigned intgro_count; > > + unsigned long gro_bitmask; > > int (*poll)(struct napi_struct *, int); > > #ifdef CONFIG_NETPOLL > > int poll_owner; > > diff --git a/net/core/dev.c b/net/core/dev.c index > > d13cddcac41f..a08dbdd217a6 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -5171,9 +5171,11 @@ static void __napi_gro_flush_chain(struct > napi_struct *napi, u32 index, > > return; > > list_del_init(>list); > > napi_gro_complete(skb); > > - napi->gro_count--; > > napi->gro_hash[index].count--; > > } > > + > > + if (!napi->gro_hash[index].count) > > + clear_bit(index, >gro_bitmask); > > I suggest you not add an atomic operation here. > > Current cpu owns this NAPI after all. > > Same remark for the whole patch. > > -> __clear_bit(), __set_bit() and similar operators > > Ideally you should provide TCP_RR number with busy polling enabled, to > eventually catch regressions. > I will change and do the test Thank you. -RongQing > Thanks.
答复: [PATCH] net: convert gro_count to bitmask
> -邮件原件- > 发件人: David Miller [mailto:da...@davemloft.net] > 发送时间: 2018年7月12日 10:49 > 收件人: Li,Rongqing > 抄送: netdev@vger.kernel.org > 主题: Re: [PATCH] net: convert gro_count to bitmask > > From: Li RongQing > Date: Wed, 11 Jul 2018 17:15:53 +0800 > > > + clear_bit(index, >gro_bitmask); > > Please don't use atomics here, at least use __clear_bit(). > Thanks, this is same as Eric's suggestion. > This is why I did the operations by hand in my version of the patch. > Also, if you are going to preempt my patch, at least retain the comment I > added around the GRO_HASH_BUCKETS definitions which warns the reader > about the limit. > I add BUILD_BUG_ON in netdev_init, so I think we need not to add comment @@ -9151,6 +9159,9 @@ static struct hlist_head * __net_init netdev_create_hash(void) /* Initialize per network namespace state */ static int __net_init netdev_init(struct net *net) { + BUILD_BUG_ON(GRO_HASH_BUCKETS > + FIELD_SIZEOF(struct napi_struct, gro_bitmask)); + -RongQing > Thanks.
答复: [PATCH] net: convert gro_count to bitmask
> -邮件原件- > 发件人: Stefano Brivio [mailto:sbri...@redhat.com] > 发送时间: 2018年7月11日 18:52 > 收件人: Li,Rongqing > 抄送: netdev@vger.kernel.org; Eric Dumazet > 主题: Re: [PATCH] net: convert gro_count to bitmask > > On Wed, 11 Jul 2018 17:15:53 +0800 > Li RongQing wrote: > > > @@ -5380,6 +5382,12 @@ static enum gro_result dev_gro_receive(struct > napi_struct *napi, struct sk_buff > > if (grow > 0) > > gro_pull_from_frag0(skb, grow); > > ok: > > + if (napi->gro_hash[hash].count) > > + if (!test_bit(hash, >gro_bitmask)) > > + set_bit(hash, >gro_bitmask); > > + else if (test_bit(hash, >gro_bitmask)) > > + clear_bit(hash, >gro_bitmask); > > This might not do what you want. > > -- could you show detail ? -RongQing > Stefano
[PATCH] net: convert gro_count to bitmask
gro_hash size is 192 bytes, and uses 3 cache lines, if there is few flows, gro_hash may be not fully used, so it is unnecessary to iterate all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline. convert gro_count to a bitmask, and rename it as gro_bitmask, each bit represents a element of gro_hash, only flush a gro_hash element if the related bit is set, to speed up napi_gro_flush(). and update gro_bitmask only if it will be changed, to reduce cache update Suggested-by: Eric Dumazet Signed-off-by: Li RongQing --- include/linux/netdevice.h | 2 +- net/core/dev.c| 35 +++ 2 files changed, 24 insertions(+), 13 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b683971e500d..df49b36ef378 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -322,7 +322,7 @@ struct napi_struct { unsigned long state; int weight; - unsigned intgro_count; + unsigned long gro_bitmask; int (*poll)(struct napi_struct *, int); #ifdef CONFIG_NETPOLL int poll_owner; diff --git a/net/core/dev.c b/net/core/dev.c index d13cddcac41f..a08dbdd217a6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5171,9 +5171,11 @@ static void __napi_gro_flush_chain(struct napi_struct *napi, u32 index, return; list_del_init(>list); napi_gro_complete(skb); - napi->gro_count--; napi->gro_hash[index].count--; } + + if (!napi->gro_hash[index].count) + clear_bit(index, >gro_bitmask); } /* napi->gro_hash[].list contains packets ordered by age. @@ -5184,8 +5186,10 @@ void napi_gro_flush(struct napi_struct *napi, bool flush_old) { u32 i; - for (i = 0; i < GRO_HASH_BUCKETS; i++) - __napi_gro_flush_chain(napi, i, flush_old); + for (i = 0; i < GRO_HASH_BUCKETS; i++) { + if (test_bit(i, >gro_bitmask)) + __napi_gro_flush_chain(napi, i, flush_old); + } } EXPORT_SYMBOL(napi_gro_flush); @@ -5277,8 +5281,8 @@ static void gro_flush_oldest(struct list_head *head) if (WARN_ON_ONCE(!oldest)) return; - /* Do not adjust napi->gro_count, caller is adding a new SKB to -* the chain. + /* Do not adjust napi->gro_hash[].count, caller is adding a new +* SKB to the chain. */ list_del(>list); napi_gro_complete(oldest); @@ -5352,7 +5356,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (pp) { list_del_init(>list); napi_gro_complete(pp); - napi->gro_count--; napi->gro_hash[hash].count--; } @@ -5365,7 +5368,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) { gro_flush_oldest(gro_head); } else { - napi->gro_count++; napi->gro_hash[hash].count++; } NAPI_GRO_CB(skb)->count = 1; @@ -5380,6 +5382,12 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (grow > 0) gro_pull_from_frag0(skb, grow); ok: + if (napi->gro_hash[hash].count) + if (!test_bit(hash, >gro_bitmask)) + set_bit(hash, >gro_bitmask); + else if (test_bit(hash, >gro_bitmask)) + clear_bit(hash, >gro_bitmask); + return ret; normal: @@ -5778,7 +5786,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done) NAPIF_STATE_IN_BUSY_POLL))) return false; - if (n->gro_count) { + if (n->gro_bitmask) { unsigned long timeout = 0; if (work_done) @@ -5987,7 +5995,7 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer) /* Note : we use a relaxed variant of napi_schedule_prep() not setting * NAPI_STATE_MISSED, since we do not react to a device IRQ. */ - if (napi->gro_count && !napi_disable_pending(napi) && + if (napi->gro_bitmask && !napi_disable_pending(napi) && !test_and_set_bit(NAPI_STATE_SCHED, >state)) __napi_schedule_irqoff(napi); @@ -6002,7 +6010,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi, INIT_LIST_HEAD(>poll_list); hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED); napi->timer.function = napi_watchdog; - napi->gro_count = 0; + napi->gro_bitmask = 0; for (i = 0; i < GRO_
答复: [PATCH][net-next][v2] net: limit each hash list length to MAX_GRO_SKBS
> -邮件原件- > 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com] > 发送时间: 2018年7月8日 8:22 > 收件人: David Miller ; Li,Rongqing > > 抄送: netdev@vger.kernel.org > 主题: Re: [PATCH][net-next][v2] net: limit each hash list length to > MAX_GRO_SKBS > > > > On 07/05/2018 03:20 AM, David Miller wrote: > > From: Li RongQing > > Date: Thu, 5 Jul 2018 14:34:32 +0800 > > > >> After commit 07d78363dcff ("net: Convert NAPI gro list into a small > >> hash table.")' there is 8 hash buckets, which allows more flows to be > >> held for merging. but MAX_GRO_SKBS, the total held skb for merging, > >> is 8 skb still, limit the hash table performance. > >> > >> keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, > >> not the total 8 skb > >> > >> Signed-off-by: Li RongQing > > > > Applied, thanks. > > > > Maybe gro_count should be replaced by a bitmask, so that we can speed up > napi_gro_flush(), since it now has to use 3 cache lines (gro_hash[] size is > 192 > bytes) Do you means that? Subject: [PATCH][RFC][net-next] net: convert gro_count to bitmask convert gro_count to a bitmask, and rename it as gro_bitmask to speed up napi_gro_flush(), since gro_hash now has to use 3 cache lines --- include/linux/netdevice.h | 2 +- net/core/dev.c| 36 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b683971e500d..df49b36ef378 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -322,7 +322,7 @@ struct napi_struct { unsigned long state; int weight; - unsigned intgro_count; + unsigned long gro_bitmask; int (*poll)(struct napi_struct *, int); #ifdef CONFIG_NETPOLL int poll_owner; diff --git a/net/core/dev.c b/net/core/dev.c index 89825c1eccdc..da2d1185eb82 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5161,9 +5161,11 @@ static void __napi_gro_flush_chain(struct napi_struct *napi, u32 index, return; list_del_init(>list); napi_gro_complete(skb); - napi->gro_count--; napi->gro_hash[index].count--; } + + if (!napi->gro_hash[index].count) + clear_bit(index, >gro_bitmask); } /* napi->gro_hash[].list contains packets ordered by age. @@ -5174,8 +5176,10 @@ void napi_gro_flush(struct napi_struct *napi, bool flush_old) { u32 i; - for (i = 0; i < GRO_HASH_BUCKETS; i++) - __napi_gro_flush_chain(napi, i, flush_old); + for (i = 0; i < GRO_HASH_BUCKETS; i++) { + if (test_bit(i, >gro_bitmask)) + __napi_gro_flush_chain(napi, i, flush_old); + } } EXPORT_SYMBOL(napi_gro_flush); @@ -5267,8 +5271,8 @@ static void gro_flush_oldest(struct list_head *head) if (WARN_ON_ONCE(!oldest)) return; - /* Do not adjust napi->gro_count, caller is adding a new SKB to -* the chain. + /* Do not adjust napi->gro_hash[].count, caller is adding a new +* SKB to the chain. */ list_del(>list); napi_gro_complete(oldest); @@ -5342,7 +5346,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (pp) { list_del_init(>list); napi_gro_complete(pp); - napi->gro_count--; napi->gro_hash[hash].count--; } @@ -5355,7 +5358,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) { gro_flush_oldest(gro_head); } else { - napi->gro_count++; napi->gro_hash[hash].count++; } NAPI_GRO_CB(skb)->count = 1; @@ -5370,6 +5372,13 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff if (grow > 0) gro_pull_from_frag0(skb, grow); ok: + + if (napi->gro_hash[hash].count) + if (!test_bit(hash, >gro_bitmask)) + set_bit(hash, >gro_bitmask); + else if (test_bit(hash, >gro_bitmask)) + clear_bit(hash, >gro_bitmask); + return ret; normal: @@ -5768,7 +5777,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done) NAPIF_STATE_IN_BUSY_POLL))) return false; - if (n->gro_count) { + if (n->gro_bitmask) { unsigned long timeout = 0; if (work_done) @@ -5977,7 +5
[PATCH][net-next] net: replace num_possible_cpus with nr_cpu_ids
The return of num_possible_cpus() is same as nr_cpu_ids, but nr_cpu_ids can reduce cpu computation Signed-off-by: Li RongQing --- net/core/dev.c | 4 ++-- net/ipv4/inet_hashtables.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 89825c1eccdc..05c7bc6e4ce6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2189,7 +2189,7 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset, if (!dev_maps) goto out_no_maps; - if (num_possible_cpus() > 1) + if (nr_cpu_ids > 1) possible_mask = cpumask_bits(cpu_possible_mask); nr_ids = nr_cpu_ids; clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count, @@ -2273,7 +2273,7 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask, nr_ids = dev->num_rx_queues; } else { maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc); - if (num_possible_cpus() > 1) { + if (nr_cpu_ids > 1) { online_mask = cpumask_bits(cpu_online_mask); possible_mask = cpumask_bits(cpu_possible_mask); } diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 3647167c8fa3..80cadf06fd3f 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -825,7 +825,7 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo) if (locksz != 0) { /* allocate 2 cache lines or at least one spinlock per cpu */ nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U); - nblocks = roundup_pow_of_two(nblocks * num_possible_cpus()); + nblocks = roundup_pow_of_two(nblocks * nr_cpu_ids); /* no more locks than number of hash buckets */ nblocks = min(nblocks, hashinfo->ehash_mask + 1); -- 2.16.2
[PATCH][net-next][v2] net: limit each hash list length to MAX_GRO_SKBS
After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash table.")' there is 8 hash buckets, which allows more flows to be held for merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still, limit the hash table performance. keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not the total 8 skb Signed-off-by: Li RongQing --- include/linux/netdevice.h | 7 +- net/core/dev.c| 56 +++ 2 files changed, 29 insertions(+), 34 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8bf8d6149f79..3b60ac51ddba 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -302,6 +302,11 @@ struct netdev_boot_setup { int __init netdev_boot_setup(char *str); +struct gro_list { + struct list_headlist; + int count; +}; + /* * Structure for NAPI scheduling similar to tasklet but with weighting */ @@ -323,7 +328,7 @@ struct napi_struct { int poll_owner; #endif struct net_device *dev; - struct list_headgro_hash[GRO_HASH_BUCKETS]; + struct gro_list gro_hash[GRO_HASH_BUCKETS]; struct sk_buff *skb; struct hrtimer timer; struct list_headdev_list; diff --git a/net/core/dev.c b/net/core/dev.c index 08d58e0debe5..38c58e32f5bc 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -149,7 +149,6 @@ #include "net-sysfs.h" -/* Instead of increasing this, you should create a hash table. */ #define MAX_GRO_SKBS 8 /* This should be increased if a protocol with a bigger head is added. */ @@ -4989,9 +4988,10 @@ static int napi_gro_complete(struct sk_buff *skb) return netif_receive_skb_internal(skb); } -static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head *head, +static void __napi_gro_flush_chain(struct napi_struct *napi, u32 index, bool flush_old) { + struct list_head *head = >gro_hash[index].list; struct sk_buff *skb, *p; list_for_each_entry_safe_reverse(skb, p, head, list) { @@ -5000,22 +5000,20 @@ static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head *h list_del_init(>list); napi_gro_complete(skb); napi->gro_count--; + napi->gro_hash[index].count--; } } -/* napi->gro_hash contains packets ordered by age. +/* napi->gro_hash[].list contains packets ordered by age. * youngest packets at the head of it. * Complete skbs in reverse order to reduce latencies. */ void napi_gro_flush(struct napi_struct *napi, bool flush_old) { - int i; - - for (i = 0; i < GRO_HASH_BUCKETS; i++) { - struct list_head *head = >gro_hash[i]; + u32 i; - __napi_gro_flush_chain(napi, head, flush_old); - } + for (i = 0; i < GRO_HASH_BUCKETS; i++) + __napi_gro_flush_chain(napi, i, flush_old); } EXPORT_SYMBOL(napi_gro_flush); @@ -5027,7 +5025,7 @@ static struct list_head *gro_list_prepare(struct napi_struct *napi, struct list_head *head; struct sk_buff *p; - head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)]; + head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)].list; list_for_each_entry(p, head, list) { unsigned long diffs; @@ -5095,27 +5093,13 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow) } } -static void gro_flush_oldest(struct napi_struct *napi) +static void gro_flush_oldest(struct list_head *head) { - struct sk_buff *oldest = NULL; - unsigned long age = jiffies; - int i; - - for (i = 0; i < GRO_HASH_BUCKETS; i++) { - struct list_head *head = >gro_hash[i]; - struct sk_buff *skb; - - if (list_empty(head)) - continue; + struct sk_buff *oldest; - skb = list_last_entry(head, struct sk_buff, list); - if (!oldest || time_before(NAPI_GRO_CB(skb)->age, age)) { - oldest = skb; - age = NAPI_GRO_CB(skb)->age; - } - } + oldest = list_last_entry(head, struct sk_buff, list); - /* We are called with napi->gro_count >= MAX_GRO_SKBS, so this is + /* We are called with head length >= MAX_GRO_SKBS, so this is * impossible. */ if (WARN_ON_ONCE(!oldest)) @@ -5130,6 +5114,7 @@ static void gro_flush_oldest(struct napi_struct *napi) static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb) { + u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1); struct list_head *head = _base; struct packet_offload *ptype; __be16 typ
[PATCH][net-next] net: limit each hash list length to MAX_GRO_SKBS
After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash table.")' there is 8 hash buckets, which allows more flows to be held for merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still, limit the hash table performance. keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not the total 8 skb Signed-off-by: Li RongQing --- include/linux/netdevice.h | 7 +- net/core/dev.c| 54 +++ 2 files changed, 28 insertions(+), 33 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8bf8d6149f79..3b60ac51ddba 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -302,6 +302,11 @@ struct netdev_boot_setup { int __init netdev_boot_setup(char *str); +struct gro_list { + struct list_headlist; + int count; +}; + /* * Structure for NAPI scheduling similar to tasklet but with weighting */ @@ -323,7 +328,7 @@ struct napi_struct { int poll_owner; #endif struct net_device *dev; - struct list_headgro_hash[GRO_HASH_BUCKETS]; + struct gro_list gro_hash[GRO_HASH_BUCKETS]; struct sk_buff *skb; struct hrtimer timer; struct list_headdev_list; diff --git a/net/core/dev.c b/net/core/dev.c index 08d58e0debe5..f8cdc27ee276 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -149,7 +149,6 @@ #include "net-sysfs.h" -/* Instead of increasing this, you should create a hash table. */ #define MAX_GRO_SKBS 8 /* This should be increased if a protocol with a bigger head is added. */ @@ -4989,10 +4988,11 @@ static int napi_gro_complete(struct sk_buff *skb) return netif_receive_skb_internal(skb); } -static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head *head, +static void __napi_gro_flush_chain(struct napi_struct *napi, int index, bool flush_old) { struct sk_buff *skb, *p; + struct list_head *head = >gro_hash[index].list; list_for_each_entry_safe_reverse(skb, p, head, list) { if (flush_old && NAPI_GRO_CB(skb)->age == jiffies) @@ -5000,10 +5000,11 @@ static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head *h list_del_init(>list); napi_gro_complete(skb); napi->gro_count--; + napi->gro_hash[index].count--; } } -/* napi->gro_hash contains packets ordered by age. +/* napi->gro_hash[].list contains packets ordered by age. * youngest packets at the head of it. * Complete skbs in reverse order to reduce latencies. */ @@ -5011,11 +5012,8 @@ void napi_gro_flush(struct napi_struct *napi, bool flush_old) { int i; - for (i = 0; i < GRO_HASH_BUCKETS; i++) { - struct list_head *head = >gro_hash[i]; - - __napi_gro_flush_chain(napi, head, flush_old); - } + for (i = 0; i < GRO_HASH_BUCKETS; i++) + __napi_gro_flush_chain(napi, i, flush_old); } EXPORT_SYMBOL(napi_gro_flush); @@ -5027,7 +5025,7 @@ static struct list_head *gro_list_prepare(struct napi_struct *napi, struct list_head *head; struct sk_buff *p; - head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)]; + head = >gro_hash[hash & (GRO_HASH_BUCKETS - 1)].list; list_for_each_entry(p, head, list) { unsigned long diffs; @@ -5095,27 +5093,13 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow) } } -static void gro_flush_oldest(struct napi_struct *napi) +static void gro_flush_oldest(struct list_head *head) { - struct sk_buff *oldest = NULL; - unsigned long age = jiffies; - int i; + struct sk_buff *oldest; - for (i = 0; i < GRO_HASH_BUCKETS; i++) { - struct list_head *head = >gro_hash[i]; - struct sk_buff *skb; + oldest = list_last_entry(head, struct sk_buff, list); - if (list_empty(head)) - continue; - - skb = list_last_entry(head, struct sk_buff, list); - if (!oldest || time_before(NAPI_GRO_CB(skb)->age, age)) { - oldest = skb; - age = NAPI_GRO_CB(skb)->age; - } - } - - /* We are called with napi->gro_count >= MAX_GRO_SKBS, so this is + /* We are called with head length >= MAX_GRO_SKBS, so this is * impossible. */ if (WARN_ON_ONCE(!oldest)) @@ -5138,6 +5122,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff enum gro_result ret; int same_flow; int grow; + u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
Re: [PATCH][net-next] net: increase MAX_GRO_SKBS to 64
On 7/2/18, David Miller wrote: > From: Li RongQing > Date: Mon, 2 Jul 2018 19:41:43 +0800 > >> After 07d78363dcffd [net: Convert NAPI gro list into a small hash table] >> there is 8 hash buckets, which allow more flows to be held for merging. >> >> keep each as original list length, so increase MAX_GRO_SKBS to 64 >> >> Signed-off-by: Li RongQing > > I would like to hear some feedback from Eric, 64 might be too big. > How about the below change? commit 6270b973a973b2944fedb4b5f9926ed3e379d0c2 (HEAD -> master) Author: Li RongQing Date: Mon Jul 2 19:08:37 2018 +0800 net: limit each hash list length to MAX_GRO_SKBS After 07d78363dcffd [net: Convert NAPI gro list into a small hash table] there is 8 hash buckets, which allows more flows to be held for merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still, limit the hash table performance. keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not the total 8 skb Signed-off-by: Li RongQing diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8bf8d6149f79..09d7764a8917 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -324,6 +324,7 @@ struct napi_struct { #endif struct net_device *dev; struct list_headgro_hash[GRO_HASH_BUCKETS]; + int list_len[GRO_HASH_BUCKETS]; struct sk_buff *skb; struct hrtimer timer; struct list_headdev_list; diff --git a/net/core/dev.c b/net/core/dev.c index 08d58e0debe5..3cf3c6676cb3 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -149,7 +149,6 @@ #include "net-sysfs.h" -/* Instead of increasing this, you should create a hash table. */ #define MAX_GRO_SKBS 8 /* This should be increased if a protocol with a bigger head is added. */ @@ -4989,10 +4988,11 @@ static int napi_gro_complete(struct sk_buff *skb) return netif_receive_skb_internal(skb); } -static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head *head, +static void __napi_gro_flush_chain(struct napi_struct *napi, int index, bool flush_old) { struct sk_buff *skb, *p; + struct list_head *head = >gro_hash[index]; list_for_each_entry_safe_reverse(skb, p, head, list) { if (flush_old && NAPI_GRO_CB(skb)->age == jiffies) @@ -5000,6 +5000,7 @@ static void __napi_gro_flush_chain(struct napi_struct *napi, struct list_head *h list_del_init(>list); napi_gro_complete(skb); napi->gro_count--; + napi->list_len[index]--; } } @@ -5011,11 +5012,8 @@ void napi_gro_flush(struct napi_struct *napi, bool flush_old) { int i; - for (i = 0; i < GRO_HASH_BUCKETS; i++) { - struct list_head *head = >gro_hash[i]; - - __napi_gro_flush_chain(napi, head, flush_old); - } + for (i = 0; i < GRO_HASH_BUCKETS; i++) + __napi_gro_flush_chain(napi, i, flush_old); } EXPORT_SYMBOL(napi_gro_flush); @@ -5095,27 +5093,13 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow) } } -static void gro_flush_oldest(struct napi_struct *napi) +static void gro_flush_oldest(struct list_head *head) { struct sk_buff *oldest = NULL; - unsigned long age = jiffies; - int i; - for (i = 0; i < GRO_HASH_BUCKETS; i++) { - struct list_head *head = >gro_hash[i]; - struct sk_buff *skb; + oldest = list_last_entry(head, struct sk_buff, list); - if (list_empty(head)) - continue; - - skb = list_last_entry(head, struct sk_buff, list); - if (!oldest || time_before(NAPI_GRO_CB(skb)->age, age)) { - oldest = skb; - age = NAPI_GRO_CB(skb)->age; - } - } - - /* We are called with napi->gro_count >= MAX_GRO_SKBS, so this is + /* We are called with head length >= MAX_GRO_SKBS, so this is * impossible. */ if (WARN_ON_ONCE(!oldest)) @@ -5138,6 +5122,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff enum gro_result ret; int same_flow; int grow; + u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1); if (netif_elide_gro(skb->dev)) goto normal; @@ -5196,6 +5181,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff list_del_init(>list); napi_gro_complete(pp); napi->gro_count--; + napi->list_len[hash]--; } if (same_flow) @@ -5204,10 +5190,11 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi
答复: [PATCH][net-next] net: increase MAX_GRO_SKBS to 64
发件人: David Miller [da...@davemloft.net] 发送时间: 2018年7月2日 19:44 收件人: Li,Rongqing 抄送: netdev@vger.kernel.org; eric.duma...@gmail.com 主题: Re: [PATCH][net-next] net: increase MAX_GRO_SKBS to 64 From: Li RongQing Date: Mon, 2 Jul 2018 19:41:43 +0800 >> After 07d78363dcffd [net: Convert NAPI gro list into a small hash table] > > there is 8 hash buckets, which allow more flows to be held for merging. > > > > keep each as original list length, so increase MAX_GRO_SKBS to 64 > > > > Signed-off-by: Li RongQing > I would like to hear some feedback from Eric, 64 might be too big. I think we should limit each list length to 8 skb , insteading of this change if there is only one flow, changing MAX_GRO_SKBS to 64 maybe generate large delay. if keep total 8 skb, for multiple flow, hash table maybe unable to improve the performance -RongQing
[PATCH][net-next] net: increase MAX_GRO_SKBS to 64
After 07d78363dcffd [net: Convert NAPI gro list into a small hash table] there is 8 hash buckets, which allow more flows to be held for merging. keep each as original list length, so increase MAX_GRO_SKBS to 64 Signed-off-by: Li RongQing --- net/core/dev.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 08d58e0debe5..ac315e41d5e7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -149,8 +149,7 @@ #include "net-sysfs.h" -/* Instead of increasing this, you should create a hash table. */ -#define MAX_GRO_SKBS 8 +#define MAX_GRO_SKBS 64 /* This should be increased if a protocol with a bigger head is added. */ #define GRO_MAX_HEAD (MAX_HEADER + 128) -- 2.16.2
[PATCH] net: propagate dev_get_valid_name return code
if dev_get_valid_name failed, propagate its return code and remove the setting err to ENODEV, it will be set to 0 again before dev_change_net_namespace exits. Signed-off-by: Li RongQing --- net/core/dev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 1844d9bc5714..1c7a3761ec3c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8661,7 +8661,8 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char /* We get here if we can't use the current device name */ if (!pat) goto out; - if (dev_get_valid_name(net, dev, pat) < 0) + err = dev_get_valid_name(net, dev, pat); + if (err < 0) goto out; } @@ -8673,7 +8674,6 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char dev_close(dev); /* And unlink it from device chain */ - err = -ENODEV; unlist_netdevice(dev); synchronize_net(); -- 2.16.2
[PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids
The default NR_CPUS can be very large, but actual possible nr_cpu_ids usually is very small. For some x86 distribution, the NR_CPUS is 8192 and nr_cpu_ids is 4, so replace NR_CPU to save some memory Signed-off-by: Li RongQing Signed-off-by: Wang Li --- net/xfrm/xfrm_policy.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 40b54cc64243..f8188685c1e9 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2989,11 +2989,11 @@ void __init xfrm_init(void) { int i; - xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work), + xfrm_pcpu_work = kmalloc_array(nr_cpu_ids, sizeof(*xfrm_pcpu_work), GFP_KERNEL); BUG_ON(!xfrm_pcpu_work); - for (i = 0; i < NR_CPUS; i++) + for (i = 0; i < nr_cpu_ids; i++) INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn); register_pernet_subsys(_net_ops); -- 2.16.2
Re: [PATCH][ipsec] xfrm: replace NR_CPU with num_possible_cpus()
sorry, please drop this patch. I should replace NR_CPUS with nr_cpu_ids, i will resend it -R On 6/15/18, Li RongQing wrote: > The default NR_CPUS can be very large, but actual possible nr_cpu_ids > usually is very small. For some x86 distribution, the NR_CPUS is 8192 > and nr_cpu_ids is 4. > > when xfrm_init is running, num_possible_cpus() should work > > Signed-off-by: Li RongQing > Signed-off-by: Wang Li > --- > net/xfrm/xfrm_policy.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c > index 40b54cc64243..cbb862463cbd 100644 > --- a/net/xfrm/xfrm_policy.c > +++ b/net/xfrm/xfrm_policy.c > @@ -2988,12 +2988,13 @@ static struct pernet_operations __net_initdata > xfrm_net_ops = { > void __init xfrm_init(void) > { > int i; > + unsigned int nr_cpus = num_possible_cpus(); > > - xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work), > + xfrm_pcpu_work = kmalloc_array(nr_cpus, sizeof(*xfrm_pcpu_work), > GFP_KERNEL); > BUG_ON(!xfrm_pcpu_work); > > - for (i = 0; i < NR_CPUS; i++) > + for (i = 0; i < nr_cpus; i++) > INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn); > > register_pernet_subsys(_net_ops); > -- > 2.16.2 > >
[PATCH][ipsec] xfrm: replace NR_CPU with num_possible_cpus()
The default NR_CPUS can be very large, but actual possible nr_cpu_ids usually is very small. For some x86 distribution, the NR_CPUS is 8192 and nr_cpu_ids is 4. when xfrm_init is running, num_possible_cpus() should work Signed-off-by: Li RongQing Signed-off-by: Wang Li --- net/xfrm/xfrm_policy.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 40b54cc64243..cbb862463cbd 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2988,12 +2988,13 @@ static struct pernet_operations __net_initdata xfrm_net_ops = { void __init xfrm_init(void) { int i; + unsigned int nr_cpus = num_possible_cpus(); - xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work), + xfrm_pcpu_work = kmalloc_array(nr_cpus, sizeof(*xfrm_pcpu_work), GFP_KERNEL); BUG_ON(!xfrm_pcpu_work); - for (i = 0; i < NR_CPUS; i++) + for (i = 0; i < nr_cpus; i++) INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn); register_pernet_subsys(_net_ops); -- 2.16.2
[net-next][PATCH] tcp: probe timer MUST not less than 5 minuter for tcp PMTU
RFC4821 say: The value for this timer MUST NOT be less than 5 minutes and is recommended to be 10 minutes, per RFC 1981. Signed-off-by: Li RongQing --- net/ipv4/sysctl_net_ipv4.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index d2eed3ddcb0a..ed8952bb6874 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -47,6 +47,7 @@ static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; static int comp_sack_nr_max = 255; +static int tcp_probe_interval_min = 300; /* obsolete */ static int sysctl_tcp_low_latency __read_mostly; @@ -711,7 +712,8 @@ static struct ctl_table ipv4_net_table[] = { .data = _net.ipv4.sysctl_tcp_probe_interval, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax, + .extra1 = _probe_interval_min, }, { .procname = "igmp_link_local_mcast_reports", -- 2.16.2
[net-next][PATCH] inet: Use switch case instead of multiple condition checks
inet_csk_reset_xmit_timer uses multiple equality condition checks, so it is better to use switch case instead of them after this patch, the increased image size is acceptable Before After size of net/ipv4/tcp_output.o: 721640 721648 size of vmlinux:400236400 400236401 Signed-off-by: Li RongQing <lirongq...@baidu.com> --- include/net/inet_connection_sock.h | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 2ab6667275df..d2e9314cf43d 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -239,22 +239,31 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what, when = max_when; } - if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 || - what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE || - what == ICSK_TIME_REO_TIMEOUT) { + switch (what) { + case ICSK_TIME_RETRANS: + /* fall through */ + case ICSK_TIME_PROBE0: + /* fall through */ + case ICSK_TIME_EARLY_RETRANS: + /* fall through */ + case ICSK_TIME_LOSS_PROBE: + /* fall through */ + case ICSK_TIME_REO_TIMEOUT: icsk->icsk_pending = what; icsk->icsk_timeout = jiffies + when; sk_reset_timer(sk, >icsk_retransmit_timer, icsk->icsk_timeout); - } else if (what == ICSK_TIME_DACK) { + break; + case ICSK_TIME_DACK: icsk->icsk_ack.pending |= ICSK_ACK_TIMER; icsk->icsk_ack.timeout = jiffies + when; sk_reset_timer(sk, >icsk_delack_timer, icsk->icsk_ack.timeout); - } + break; #ifdef INET_CSK_DEBUG - else { + default: pr_debug("%s", inet_csk_timer_bug_msg); - } + break; #endif + } } static inline unsigned long -- 2.16.2
Re: [PATCH] net: net_cls: remove a NULL check for css_cls_state
On 4/20/18, David Miller <da...@davemloft.net> wrote: > From: Li RongQing <lirongq...@baidu.com> > Date: Thu, 19 Apr 2018 12:59:21 +0800 > >> The input of css_cls_state() is impossible to NULL except >> cgrp_css_online, so simplify it >> >> Signed-off-by: Li RongQing <lirongq...@baidu.com> > > I don't view this as an improvement. Just let the helper always check > NULL and that way there are less situations to audit. > css_cls_state maybe return NULL, but nearly no places check the return value with NULL, this seems unreadable. net/core/netclassid_cgroup.c:27:return css_cls_state(task_css_check(p, net_cls_cgrp_id, net/core/netclassid_cgroup.c:46:struct cgroup_cls_state *cs = css_cls_state(css); net/core/netclassid_cgroup.c:47:struct cgroup_cls_state *parent = css_cls_state(css->parent); net/core/netclassid_cgroup.c:57:kfree(css_cls_state(css)); net/core/netclassid_cgroup.c:82: (void *)(unsigned long)css_cls_state(css)->classid); net/core/netclassid_cgroup.c:89:return css_cls_state(css)->classid; net/core/netclassid_cgroup.c:95:struct cgroup_cls_state *cs = css_cls_state(css); > And it's not like this is a critical fast path either. > I see css_cls_state will be called when send packet if CONFIG_NET_CLS_ACT and CONFIG_NET_EGRESS enabled, the calling stack is like below: css_cls_state task_cls_state task_get_classid cls_cgroup_classify tcf_classify sch_handle_egress __dev_queue_xmit CONFIG_NET_CLS_ACT CONFIG_NET_EGRESS -RongQing > I'm not applying this, sorry. >
答复: [PATCH][net-next] net: ip tos cgroup
> -邮件原件- > 发件人: Daniel Borkmann [mailto:dan...@iogearbox.net] > 发送时间: 2018年4月17日 22:11 > 收件人: Li,Rongqing <lirongq...@baidu.com> > 抄送: netdev@vger.kernel.org; t...@kernel.org; a...@fb.com; > bra...@fb.com > 主题: Re: [PATCH][net-next] net: ip tos cgroup > > On 04/17/2018 05:36 AM, Li RongQing wrote: > > ip tos segment can be changed by setsockopt(IP_TOS), or by iptables; > > this patch creates a new method to change socket tos segment of > > processes based on cgroup > > > > The usage: > > > > 1. mount ip_tos cgroup, and setting tos value > > mount -t cgroup -o ip_tos ip_tos /cgroups/tos > > echo tos_value >/cgroups/tos/ip_tos.tos > > 2. then move processes to cgroup, or create processes in cgroup > > > > Signed-off-by: jimyan <jim...@baidu.com> > > Signed-off-by: Li RongQing <lirongq...@baidu.com> > > This functionality is already possible through the help of BPF programs > attached to cgroups, have you had a chance to look into that? > I think this method is easier to use than BPF, and more efficient -RongQing
[PATCH] net: net_cls: remove a NULL check for css_cls_state
The input of css_cls_state() is impossible to NULL except cgrp_css_online, so simplify it Signed-off-by: Li RongQing <lirongq...@baidu.com> --- net/core/netclassid_cgroup.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c index 5e4f04004a49..ee087cf793c2 100644 --- a/net/core/netclassid_cgroup.c +++ b/net/core/netclassid_cgroup.c @@ -19,7 +19,7 @@ static inline struct cgroup_cls_state *css_cls_state(struct cgroup_subsys_state *css) { - return css ? container_of(css, struct cgroup_cls_state, css) : NULL; + return container_of(css, struct cgroup_cls_state, css); } struct cgroup_cls_state *task_cls_state(struct task_struct *p) @@ -44,10 +44,9 @@ cgrp_css_alloc(struct cgroup_subsys_state *parent_css) static int cgrp_css_online(struct cgroup_subsys_state *css) { struct cgroup_cls_state *cs = css_cls_state(css); - struct cgroup_cls_state *parent = css_cls_state(css->parent); - if (parent) - cs->classid = parent->classid; + if (css->parent) + cs->classid = css_cls_state(css->parent)->classid; return 0; } -- 2.11.0
[PATCH][net-next] net: ip tos cgroup
ip tos segment can be changed by setsockopt(IP_TOS), or by iptables; this patch creates a new method to change socket tos segment of processes based on cgroup The usage: 1. mount ip_tos cgroup, and setting tos value mount -t cgroup -o ip_tos ip_tos /cgroups/tos echo tos_value >/cgroups/tos/ip_tos.tos 2. then move processes to cgroup, or create processes in cgroup Signed-off-by: jimyan <jim...@baidu.com> Signed-off-by: Li RongQing <lirongq...@baidu.com> --- include/linux/cgroup_subsys.h | 4 ++ include/net/tos_cgroup.h | 35 net/ipv4/Kconfig | 10 net/ipv4/Makefile | 1 + net/ipv4/af_inet.c| 2 + net/ipv4/tos_cgroup.c | 128 ++ net/ipv6/af_inet6.c | 2 + 7 files changed, 182 insertions(+) create mode 100644 include/net/tos_cgroup.h create mode 100644 net/ipv4/tos_cgroup.c diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index acb77dcff3b4..1b86eda1c23e 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -61,6 +61,10 @@ SUBSYS(pids) SUBSYS(rdma) #endif +#if IS_ENABLED(CONFIG_IP_TOS_CGROUP) +SUBSYS(ip_tos) +#endif + /* * The following subsystems are not supported on the default hierarchy. */ diff --git a/include/net/tos_cgroup.h b/include/net/tos_cgroup.h new file mode 100644 index ..0868e921faf3 --- /dev/null +++ b/include/net/tos_cgroup.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _IP_TOS_CGROUP_H +#define _IP_TOS_CGROUP_H + +#include +#include + +struct tos_cgroup_state { + struct cgroup_subsys_state css; + u32 tos; +}; + +#if IS_ENABLED(CONFIG_IP_TOS_CGROUP) +static inline u32 task_ip_tos(struct task_struct *p) +{ + u32 tos; + + if (in_interrupt()) + return 0; + + rcu_read_lock(); + tos = container_of(task_css(p, ip_tos_cgrp_id), + struct tos_cgroup_state, css)->tos; + rcu_read_unlock(); + + return tos; +} +#else /* !CONFIG_IP_TOS_CGROUP */ +static inline u32 task_ip_tos(struct task_struct *p) +{ + return 0; +} +#endif /* CONFIG_IP_TOS_CGROUP */ +#endif /* _IP_TOS_CGROUP_H */ diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 80dad301361d..57070bbb0394 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -753,3 +753,13 @@ config TCP_MD5SIG on the Internet. If unsure, say N. + +config IP_TOS_CGROUP + bool "ip tos cgroup" + depends on CGROUPS + default n + ---help--- + Say Y here if you want to set ip packet tos based on the + control cgroup of their process. + + This can set ip packet tos diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index a07b7dd06def..12c708142d1f 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -61,6 +61,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o +obj-$(CONFIG_IP_TOS_CGROUP) += tos_cgroup.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o xfrm4_protocol.o diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index eaed0367e669..e2dd902b06dd 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -120,6 +120,7 @@ #include #endif #include +#include #include @@ -356,6 +357,7 @@ static int inet_create(struct net *net, struct socket *sock, int protocol, inet->mc_index = 0; inet->mc_list = NULL; inet->rcv_tos = 0; + inet->tos = task_ip_tos(current); sk_refcnt_debug_inc(sk); diff --git a/net/ipv4/tos_cgroup.c b/net/ipv4/tos_cgroup.c new file mode 100644 index ..dbb828f5b464 --- /dev/null +++ b/net/ipv4/tos_cgroup.c @@ -0,0 +1,128 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static inline +struct tos_cgroup_state *css_tos_cgroup(struct cgroup_subsys_state *css) +{ + return css ? container_of(css, struct tos_cgroup_state, css) : NULL; +} + +static inline struct tos_cgroup_state *task_tos_cgroup(struct task_struct *task) +{ + return css_tos_cgroup(task_css(task, ip_tos_cgrp_id)); +} + +static struct cgroup_subsys_state +*cgrp_css_alloc(struct cgroup_subsys_state *parent_css) +{ + struct tos_cgroup_state *cs; + + cs = kzalloc(sizeof(*cs), GFP_KERNEL); + if (!cs) + return ERR_PTR(-ENOMEM); + + return >css; +} + +static void cgrp_css_free(struct cgroup_subsys_state *css) +{ + kfree(css_tos_cgroup(css)); +} + +static int update_tos(const void *v, struct file *file, unsigned int n) +{ + int err; + struct socket *sock = sock_from_file(file, ); + u
[PATCH] net: ip tos cgroup
ip tos segment can be changed by setsockopt(IP_TOS), or by iptables; this patch creates a new method to change socket tos segment of processes based on cgroup The usage: 1. mount tos_cgroup, and setting tos value mount -t cgroup -o ip_tos ip_tos /cgroups/tos echo tos-value >/cgroups/tos/ip_tos.tos 2. then move processes to cgroup, or create processes in cgroup Signed-off-by: jimyan <jim...@baidu.com> Signed-off-by: Li RongQing <lirongq...@baidu.com> --- include/linux/cgroup_subsys.h | 4 ++ include/net/tos_cgroup.h | 46 ++ net/ipv4/Kconfig | 9 +++ net/ipv4/Makefile | 1 + net/ipv4/af_inet.c| 2 + net/ipv4/tos_cgroup.c | 145 ++ net/ipv6/af_inet6.c | 2 + 7 files changed, 209 insertions(+) create mode 100644 include/net/tos_cgroup.h create mode 100644 net/ipv4/tos_cgroup.c diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index acb77dcff3b4..1b86eda1c23e 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -61,6 +61,10 @@ SUBSYS(pids) SUBSYS(rdma) #endif +#if IS_ENABLED(CONFIG_IP_TOS_CGROUP) +SUBSYS(ip_tos) +#endif + /* * The following subsystems are not supported on the default hierarchy. */ diff --git a/include/net/tos_cgroup.h b/include/net/tos_cgroup.h new file mode 100644 index ..45c33733c4e9 --- /dev/null +++ b/include/net/tos_cgroup.h @@ -0,0 +1,46 @@ +/* + * tos_cgroup.hIP TOS Control Group + * + * Authors:Li RongQing <lirongq...@baidu.com> + * Jim Yan <jim...@baidu.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + */ + +#ifndef _IP_TOS_CGROUP_H +#define _IP_TOS_CGROUP_H + +#include +#include + +struct tos_cgroup_state { + struct cgroup_subsys_state css; + u32 tos; +}; + +#if IS_ENABLED(CONFIG_IP_TOS_CGROUP) +static inline u32 task_ip_tos(struct task_struct *p) +{ + u32 tos; + + if (in_interrupt()) + return 0; + + rcu_read_lock(); + tos = container_of(task_css(p, ip_tos_cgrp_id), + struct tos_cgroup_state, css)->tos; + rcu_read_unlock(); + + return tos; +} +#else /* !CONFIG_IP_TOS_CGROUP */ +static inline u32 task_ip_tos(struct task_struct *p) +{ + return 0; +} +#endif /* CONFIG_IP_TOS_CGROUP */ +#endif /* _IP_TOS_CGROUP_H */ diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index f48fe6fc7e8c..6f8ce1b2ceb0 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -748,3 +748,12 @@ config TCP_MD5SIG on the Internet. If unsure, say N. + +config IP_TOS_CGROUP + bool "ip tos cgroup" + depends on CGROUPS + ---help--- + Say Y here if you want to set ip packet tos based on the + control cgroup of their process. + + This can set ip packet tos diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index 47a0a6649a9d..a2734c50db2e 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -60,6 +60,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o +obj-$(CONFIG_IP_TOS_CGROUP) += tos_cgroup.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o xfrm4_protocol.o diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index e4329e161943..90842bedb500 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -120,6 +120,7 @@ #include #endif #include +#include #include @@ -356,6 +357,7 @@ static int inet_create(struct net *net, struct socket *sock, int protocol, inet->mc_index = 0; inet->mc_list = NULL; inet->rcv_tos = 0; + inet->tos = task_ip_tos(current); sk_refcnt_debug_inc(sk); diff --git a/net/ipv4/tos_cgroup.c b/net/ipv4/tos_cgroup.c new file mode 100644 index ..17d2d7c02871 --- /dev/null +++ b/net/ipv4/tos_cgroup.c @@ -0,0 +1,145 @@ +/* + * net/ipv4/tos_cgroup.c IP TOS Control Group + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors:Li RongQing <lirongq...@baidu.com> + * jimyan <jim...@baidu.com> + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static inline +struct
[PATCH] net: sched: do not emit messages while holding spinlock
move messages emitting out of sch_tree_lock to avoid holding this lock too long. Signed-off-by: Li RongQing <lirongq...@baidu.com> --- net/sched/sch_htb.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 1ea9846cc6ce..2a4ab7caf553 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -1337,6 +1337,7 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, struct nlattr *tb[TCA_HTB_MAX + 1]; struct tc_htb_opt *hopt; u64 rate64, ceil64; + int warn = 0; /* extract all subattrs from opt attr */ if (!opt) @@ -1499,13 +1500,11 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, cl->quantum = min_t(u64, quantum, INT_MAX); if (!hopt->quantum && cl->quantum < 1000) { - pr_warn("HTB: quantum of class %X is small. Consider r2q change.\n", - cl->common.classid); + warn = -1; cl->quantum = 1000; } if (!hopt->quantum && cl->quantum > 20) { - pr_warn("HTB: quantum of class %X is big. Consider r2q change.\n", - cl->common.classid); + warn = 1; cl->quantum = 20; } if (hopt->quantum) @@ -1519,6 +1518,10 @@ static int htb_change_class(struct Qdisc *sch, u32 classid, sch_tree_unlock(sch); + if (warn) + pr_warn("HTB: quantum of class %X is %s. Consider r2q change.\n", + cl->common.classid, (warn == -1 ? "small" : "big")); + qdisc_class_hash_grow(sch, >clhash); *arg = (unsigned long)cl; -- 2.11.0
[PATCH] tcp: release sk_frag.page in tcp_disconnect
socket can be disconnected and gets transformed back to a listening socket, if sk_frag.page is not released, which will be cloned into a new socket by sk_clone_lock, but the reference count of this page is increased, lead to a use after free or double free issue Signed-off-by: Li RongQing <lirongq...@baidu.com> Cc: Eric Dumazet <eduma...@google.com> --- net/ipv4/tcp.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f08eebe60446..73f068406519 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2431,6 +2431,12 @@ int tcp_disconnect(struct sock *sk, int flags) WARN_ON(inet->inet_num && !icsk->icsk_bind_hash); + if (sk->sk_frag.page) { + put_page(sk->sk_frag.page); + sk->sk_frag.page = NULL; + sk->sk_frag.offset = 0; + } + sk->sk_error_report(sk); return err; } -- 2.11.0
答复: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket
> -邮件原件- > 发件人: Eric Dumazet [mailto:eric.duma...@gmail.com] > 发送时间: 2018年1月26日 11:14 > 收件人: Li,Rongqing <lirongq...@baidu.com>; netdev@vger.kernel.org > 抄送: eduma...@google.com > 主题: Re: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket > > On Fri, 2018-01-26 at 02:09 +, Li,Rongqing wrote: > > > > > crash> bt 8683 > > PID: 8683 TASK: 881faa088000 CPU: 10 COMMAND: "mynode" > > #0 [881fff145e78] crash_nmi_callback at 81031712 > > #1 [881fff145e88] nmi_handle at 816cafe9 > > #2 [881fff145ec8] do_nmi at 816cb0f0 > > #3 [881fff145ef0] end_repeat_nmi at 816ca4a1 > > [exception RIP: _raw_spin_lock_irqsave+62] > > RIP: 816c9a9e RSP: 881fa992b990 RFLAGS: 0002 > > RAX: 4358 RBX: 88207ffd7e80 RCX: > 4358 > > RDX: 4356 RSI: 0246 RDI: > 88207ffd7ee8 > > RBP: 881fa992b990 R8: R9: > 019a16e6 > > R10: 4d24 R11: 4000 R12: > 0242 > > R13: 4d24 R14: 0001 R15: > > > ORIG_RAX: CS: 0010 SS: 0018 > > --- --- > > #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e > > #5 [881fa992b998] get_page_from_freelist at 8113ce5f > > #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f > > #7 [881fa992bba0] alloc_pages_current at 8117ab29 > > #8 [881fa992bbe8] sk_page_frag_refill at 815dd310 > > #9 [881fa992bc18] tcp_sendmsg at 8163e4f3 > > #10 [881fa992bcd8] inet_sendmsg at 81668434 > > #11 [881fa992bd08] sock_sendmsg at 815d9719 > > #12 [881fa992be58] SYSC_sendto at 815d9c81 > > #13 [881fa992bf70] sys_sendto at 815da6ae > > #14 [881fa992bf80] system_call_fastpath at 816d2189 > > > > Note that tcp_sendmsg() does not use sk->sk_frag, but the per task page. > > Unless something changes sk->sk_allocation, which a user application can > not do. > > Are you using a pristine upstream kernel ? No I do not know how to reproduce my bug, I find it twice online. -RongQing
答复: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket
> > my kernel is 3.10, I did not find the root cause, I guest all kind of > > possibility > > > > Have you backported 22a0e18eac7a9e986fec76c60fa4a2926d1291e2 ? > > When I see this bug, I find this commit, and backport it, But this seems to not related to my bug. > > > I would rather move that in tcp_disconnect() that only fuzzers use, > > > instead of doing this on every clone and slowing down normal users. > > > > > > > > > Do you mean we should fix it like below: > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index > > f08eebe60446..44f8320610ab 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -2431,6 +2431,12 @@ int tcp_disconnect(struct sock *sk, int flags) > > > > WARN_ON(inet->inet_num && !icsk->icsk_bind_hash); > > > > + > > + if (sk->sk_frag.page) { > > + put_page(sk->sk_frag.page); > > + sk->sk_frag.page = NULL; > > + } > > + > > sk->sk_error_report(sk); > > return err; > > } > > Yes, something like that. Ok, thanks -R
答复: [PATCH] net: clean the sk_frag.page of new cloned socket
> > if (newsk->sk_prot->sockets_allocated) > > sk_sockets_allocated_inc(newsk); > > Good catch. > > I suspect this was discovered by some syzkaller/syzbot run ? > No. I am seeing a panic, a page is in both task.task_frag.page and buddy free list; It should not happen , and the page->lru->next and page->lru->pre is 0xdead00100100, then when alloc page from buddy, the system panic at __list_del of __rmqueue #0 [881fff0c3850] machine_kexec at 8103cca8 #1 [881fff0c38a0] crash_kexec at 810c2443 #2 [881fff0c3968] oops_end at 816cae70 #3 [881fff0c3990] die at 810063eb #4 [881fff0c39c0] do_general_protection at 816ca7ce #5 [881fff0c39f0] general_protection at 816ca0d8 [exception RIP: __rmqueue+120] RIP: 8113a918 RSP: 881fff0c3aa0 RFLAGS: 00010046 RAX: 88207ffd8018 RBX: 0003 RCX: 0003 RDX: 0001 RSI: ea006f4cf620 RDI: dead00200200 RBP: 881fff0c3b00 R8: 88207ffd8018 R9: R10: dead00100100 R11: ea007ecc6480 R12: ea006f4cf600 R13: R14: 0003 R15: 88207ffd7e80 ORIG_RAX: CS: 0010 SS: #6 [881fff0c3b08] get_page_from_freelist at 8113ce71 #7 [881fff0c3be0] __alloc_pages_nodemask at 8113d15f #8 [881fff0c3d10] __alloc_page_frag at 815e2362 #9 [881fff0c3d40] __netdev_alloc_frag at 815e241b #10 [881fff0c3d58] __alloc_rx_skb at 815e2f91 #11 [881fff0c3d78] __netdev_alloc_skb at 815e300b #12 [881fff0c3d90] ixgbe_clean_rx_irq at a003a98f [ixgbe] #13 [881fff0c3df8] ixgbe_poll at a003c233 [ixgbe] #14 [881fff0c3e70] net_rx_action at 815f2f09 #15 [881fff0c3ec8] __do_softirq at 81064867 #16 [881fff0c3f38] call_softirq at 816d3a9c #17 [881fff0c3f50] do_softirq at 81004e65 #18 [881fff0c3f68] irq_exit at 81064b7d #19 [881fff0c3f78] do_IRQ at 816d4428 The page info is like below, some element is removed: crash> struct page ea006f4cf600 -x struct page { flags = 0x2f4000, mapping = 0x0, { { counters = 0x2, { { _mapcount = { counter = 0x }, { inuse = 0x, objects = 0x7fff, frozen = 0x1 }, units = 0x }, _count = { counter = 0x2 } } } }, { lru = { next = 0xdead00100100, prev = 0xdead00200200 }, }, ….. } } crash> the page ea006f4cf600 is in other task task_frag.page and the task backtrace is like below crash> task 8683|grep ea006f4cf600 -A3 page = 0xea006f4cf600, offset = 32768, size = 32768 }, crash> crash> bt 8683 PID: 8683 TASK: 881faa088000 CPU: 10 COMMAND: "mynode" #0 [881fff145e78] crash_nmi_callback at 81031712 #1 [881fff145e88] nmi_handle at 816cafe9 #2 [881fff145ec8] do_nmi at 816cb0f0 #3 [881fff145ef0] end_repeat_nmi at 816ca4a1 [exception RIP: _raw_spin_lock_irqsave+62] RIP: 816c9a9e RSP: 881fa992b990 RFLAGS: 0002 RAX: 4358 RBX: 88207ffd7e80 RCX: 4358 RDX: 4356 RSI: 0246 RDI: 88207ffd7ee8 RBP: 881fa992b990 R8: R9: 019a16e6 R10: 4d24 R11: 4000 R12: 0242 R13: 4d24 R14: 0001 R15: ORIG_RAX: CS: 0010 SS: 0018 --- --- #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e #5 [881fa992b998] get_page_from_freelist at 8113ce5f #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f #7 [881fa992bba0] alloc_pages_current at 8117ab29 #8 [881fa992bbe8] sk_page_frag_refill at 815dd310 #9 [881fa992bc18] tcp_sendmsg at 8163e4f3 #10 [881fa992bcd8] inet_sendmsg at 81668434 #11 [881fa992bd08] sock_sendmsg at 815d9719 #12 [881fa992be58] SYSC_sendto at 815d9c81 #13 [881fa992bf70] sys_sendto at 815da6ae #14 [881fa992bf80] system_call_fastpath at 816d2189 RIP: 7f5bfe1d804b RSP: 7f5bfa63b3b0 RFLAGS: 0206 RAX: 002c RBX: 816d2189 RCX: 7f5bfa63b420 RDX: 2000 RSI: 0c096000 RDI: 0040 RBP: R8: R9: R10: R11: 0246 R12: 815da6ae R13: 881fa992bf78 R14: a552 R15: 0016 ORIG_RAX: 002c CS: 0033 SS: 002b crash> my kernel is 3.10,
[PATCH] net: clean the sk_frag.page of new cloned socket
Clean the sk_frag.page of new cloned socket, otherwise it will release twice wrongly since the reference count of this sk_frag page is not increased. sk_clone_lock() is used to clone a new socket from sock which is in listening state and has not sk_frag.page, but a socket has sent data and can gets transformed back to a listening socket, will allocate an tcp_sock through sk_clone_lock() when a new connection comes in. Signed-off-by: Li RongQing <lirongq...@baidu.com> Cc: Eric Dumazet <eduma...@google.com> --- net/core/sock.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/sock.c b/net/core/sock.c index c0b5b2f17412..c845856f26da 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1738,6 +1738,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) sk_refcnt_debug_inc(newsk); sk_set_socket(newsk, NULL); newsk->sk_wq = NULL; + newsk->sk_frag.page = NULL; + newsk->sk_frag.offset = 0; if (newsk->sk_prot->sockets_allocated) sk_sockets_allocated_inc(newsk); -- 2.11.0
Re: [PATCH][net-next] ipv6: replace write lock with read lock in addrconf_permanent_addr
On Tue, Mar 15, 2016 at 12:25 AM, David Millerwrote: > > We need it for the modifications made by fixup_permanent_addr(). fixup_permanent_addr should be protected by ifp->lock, not by idev->lock write lock, since ifp is modified, -Roy
Re: [PATCH][net-next][v2] bridge: allow the maximum mtu to 64k
On Thu, Feb 25, 2016 at 5:44 AM, Stephen Hemmingerwrote: >> This is especially annoying for the virtualization case because the >> KVM's tap driver will by default adopt the bridge's MTU on startup >> making it impossible (without the workaround) to use a large MTU on the >> guest VMs. >> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1399064 > > This use case looks like KVM misusing bridge MTU. I.e it should set TAP > MTU to what it wants then enslave it, not vice versa. 1. a use should be able to configure an empty bridge MTU to a higher mtu than 1500 2. if first configure the tap MTU a higher value, other port is lower value, the pmtu will be used, it maybe lower performance. the configuration process is written into libvirt, located in virnetdevtap.c, of cause it can be improved to fix this issue. https://www.redhat.com/archives/libvir-list/2008-December/msg00083.html -R
Re: [PATCH][net-next] bridge: increase mtu to 9000
On Tue, Feb 23, 2016 at 1:58 AM, Stephen Hemminger <step...@networkplumber.org> wrote: >> guest VMs. >> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1399064 >> >> Signed-off-by: Li RongQing <roy.qing...@gmail.com> > > Your change works, but I agree with Hannes. Just allow up to 64 * 1024 like > loopback does. And no need for a #define for that it is only in one place. thanks, I will change it as you suggest -Roy
Re: question about vrf-lite
>> >> is it right? > > > no. The above works fine for me. I literally copied and pasted all of the > commands except the master ones which were adapted to my setup -- eth9 and > eth11 for me instead of eth0 and eth1. tcpdump on N2, N3 show the right one > is receiving packets based on which 'ping -I vrf' is run. > > Do tables 5 and 6 have the right routes? Thanks, David; it is not VRF issue, it is my configuration issue about qemu; I am testing VRF on qemu, and the issue/solution is same as issue under below link https://lists.gnu.org/archive/html/qemu-discuss/2014-06/msg00059.html -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: sysctl: fix a kmemleak warning
On Fri, Oct 23, 2015 at 6:04 PM, David Millerwrote: >> +out1: >> + unregister_sysctl_table(net_header); >> + kfree(net_header); > > I read over unregister_sysctl_table() and it appears to do the kfree() > for us, doesn't it? you are right, thanks -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] ipconfig: send Client-identifier in DHCP requests
On Thu, Oct 15, 2015 at 10:56 PM, Florian Fainelliwrote: > Did not you mean strlen(dhcp_client_identifer) + 1 instead? no; dhcp_client_identifer[0] is client identifier type, and it maybe 0; dhcp_client_identifer+1 is the start address of client identifier value; -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [bug report or not] ping6 will lost packets when ping6 lots of ipv6 address
On Wed, Oct 14, 2015 at 12:11 AM, Martin KaFai Lau <ka...@fb.com> wrote: > On Tue, Oct 13, 2015 at 08:46:49PM +0800, Li RongQing wrote: >> 1. in a machine, configure 3000 ipv6 address in one interface >> >> for i in {1..3000}; do ip -6 addr add 4001:5013::$i/0 dev eth0; done >> >> >> 2. in other machine, ping6 the upper configured ipv6 address, then >> lots of lost packets >> >> ip -6 addr add 4001:5013::0/64 dev eth0 >> for i in {1..2000}; do ping6 -q -c1 4001:5013::$i; done; >> >> 3. increasing the gc thresh can handles these lost >> >> sysctl -w net.ipv6.neigh.default.gc_thresh1=2000 >> sysctl -w net.ipv6.neigh.default.gc_thresh2=3000 >> sysctl -w net.ipv6.neigh.default.gc_thresh3=4000 >> sysctl -w net.ipv6.route.gc_thresh=3000 >> sysctl -w net.ipv6.route.max_size =3000 > Which kernel is used in this test? all version, I think this should not a bug, this test will lead to that the neigh number is larger than net.ipv6.neigh.default.gc_thresh3, and can not allocate new neigh, and ping will lost packets. Thanks -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ICMPv6 too big Packet will makes the network unreachable
On Tue, Oct 13, 2015 at 10:26 PM, Hannes Frederic Sowawrote: >>root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2 >>2001:1b70:82a8:18:650:65:0:2 dev eth10.650 src >> 2001:1b70:82a8:18:650:65:0:2 metric 0 >>cache >>root@du1:~# > > Which kernel version did you test this on? > > Thanks, > Hannes I think it is all version -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ICMPv6 too big Packet will makes the network unreachable
On Wed, Oct 14, 2015 at 5:18 PM, Sheng Yong <shengyo...@huawei.com> wrote: > Hi, Rongqing, > > Cced Martin KaFai Lau <ka...@fb.com> > > It seems you trigger the problem that I met before, here is the link of > disucssion: > http://www.spinics.net/lists/netdev/msg314717.html > > You can try these patches to check if they resolve your problem: > 7035870d1219 | 2015-05-03 | ipv6: Check RTF_LOCAL on rt->rt6i_flags instead > of rt->dst.flags > 653437d02f1f | 2015-04-28 | ipv6: Stop /128 route from disappearing after > pmtu update > > thanks, > Sheng > > On 10/13/2015 3:09 PM, Li RongQing wrote: >> 1. Machine with 2001:1b70:82a8:18:650:65:0:2 address, and receive wrong >> icmp packets >> root@du1:~# ifconfig >> eth10.650 Link encap:Ethernet HWaddr 74:c9:9a:a7:e5:88 >> inet6 addr: fe80::76c9:9aff:fea7:e588/64 Scope:Link >> inet6 addr: 2001:1b70:82a8:18:650:65:0:2/80 Scope:Global >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:1 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:9 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:0 >> RX bytes:104 (104.0 B) TX bytes:934 (934.0 B) >> >> 2. ICMPv6 packet is as below. >> >>###[ Ethernet ]### >> dst = 74:C9:9A:A7:E5:88 >> src = ae:4f:44:f2:10:cc >> type = 0x86dd >>###[ IPv6 ]### >> version = 6 >> tc= 0 >> fl= 0 >> plen = None >> nh= ICMPv6 >> hlim = 64 >> src = 2001:1b70:82a8:18:650:65:0:4 >> dst = 2001:1b70:82a8:18:650:65:0:2 >> >>###[ ICMPv6 Packet Too Big ]### >>type = Packet too big >>code = 0 >>cksum = None >>mtu = 1280 >> >>###[ IPv6 ]### >> version = 6 >> tc= 0 >> fl= 0 >> plen = None >> nh= ICMPv6 >> hlim = 255 >> src = 2001:1b70:82a8:18:650:65:0:2 >> dst = 2001:1b70:82a8:18:650:65:0:2 >>###[ ICMPv6 Neighbor Discovery - Neighbor Advertisement ]### >> type = Neighbor Advertisement >> code = 0 >> cksum = None >> R = 1 >> S = 0 >> O = 1 >> res = 0x0 >> tgt = 2001:1b70:82a8:18:650:65:0:2 >> >># Test # >> >> 3. Send ICMPv6 with Scapy to trigger fault. >> >>conf.iface='eth1' >>eth = Ether(src='ae:4f:44:f2:10:cc', dst='74:C9:9A:A7:E5:88') >>base = IPv6(src='2001:1b70:82a8:18:650:65:0:4', >> dst='2001:1b70:82a8:18:650:65:0:2') >>ptb = ICMPv6PacketTooBig(type=2) >>packet = eth/base/ptb >>ptb_payload_na_base = IPv6(src='2001:1b70:82a8:18:650:65:0:2', >> dst='2001:1b70:82a8:18:650:65:0:2') >>ptb_payload_na = ICMPv6ND_NA(type=136, tgt='2001:1b70:82a8:18:650:65:0:2') >>ptb_payload = ptb_payload_na_base/ptb_payload_na >>packet = packet/ptb_payload >>sendp(packet, iface="eth1.650", count=1) >> >> 4. route information will enter the faulty state after Wait 600 seconds, >> >>root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2 >>local 2001:1b70:82a8:18:650:65:0:2 dev lo proto none src >> 2001:1b70:82a8:18:650:65:0:2 metric 0 expires 7sec mtu 1280 >> >>root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2 >>local 2001:1b70:82a8:18:650:65:0:2 dev lo proto none src >> 2001:1b70:82a8:18:650:65:0:2 metric 0 expires 3sec mtu 1280 >> >>root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2 >>2001:1b70:82a8:18:650:65:0:2 dev eth10.650 src >> 2001:1b70:82a8:18:650:65:0:2 metric 0 >>cache >>root@du1:~# >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> . >> > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ICMPv6 too big Packet will makes the network unreachable
On Wed, Oct 14, 2015 at 5:53 PM, Li RongQing <roy.qing...@gmail.com> wrote: > On Wed, Oct 14, 2015 at 5:18 PM, Sheng Yong <shengyo...@huawei.com> wrote: >> Hi, Rongqing, >> >> Cced Martin KaFai Lau <ka...@fb.com> >> >> It seems you trigger the problem that I met before, here is the link of >> disucssion: >> http://www.spinics.net/lists/netdev/msg314717.html >> >> You can try these patches to check if they resolve your problem: >> 7035870d1219 | 2015-05-03 | ipv6: Check RTF_LOCAL on rt->rt6i_flags instead >> of rt->dst.flags >> 653437d02f1f | 2015-04-28 | ipv6: Stop /128 route from disappearing after >> pmtu update >> >> thanks, >> Sheng >> You are right, these two commits fixed my issue Thanks -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ipconfig: send Client-identifier in DHCP requests
On Thu, Oct 15, 2015 at 11:27 AM, kbuild test robotwrote: > Hi Li, > > [auto build test WARNING on net/master -- if it's inappropriate base, please > suggest rules for selecting the more suitable base] > > url: > https://github.com/0day-ci/linux/commits/roy-qing-li-gmail-com/ipconfig-send-Client-identifier-in-DHCP-requests/20151015-105553 > config: parisc-c3000_defconfig (attached as .config) > reproduce: > wget > https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross > -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > make.cross ARCH=parisc > > All warnings (new ones prefixed by >>): > >>> net/ipv4/ipconfig.c:148:13: warning: 'dhcp_client_identifier' defined but >>> not used [-Wunused-variable] > static char dhcp_client_identifier[253] __initdata; > ^ Thanks, I will fix it -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ICMPv6 too big Packet will makes the network unreachable
1. Machine with 2001:1b70:82a8:18:650:65:0:2 address, and receive wrong icmp packets root@du1:~# ifconfig eth10.650 Link encap:Ethernet HWaddr 74:c9:9a:a7:e5:88 inet6 addr: fe80::76c9:9aff:fea7:e588/64 Scope:Link inet6 addr: 2001:1b70:82a8:18:650:65:0:2/80 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1 errors:0 dropped:0 overruns:0 frame:0 TX packets:9 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:104 (104.0 B) TX bytes:934 (934.0 B) 2. ICMPv6 packet is as below. ###[ Ethernet ]### dst = 74:C9:9A:A7:E5:88 src = ae:4f:44:f2:10:cc type = 0x86dd ###[ IPv6 ]### version = 6 tc= 0 fl= 0 plen = None nh= ICMPv6 hlim = 64 src = 2001:1b70:82a8:18:650:65:0:4 dst = 2001:1b70:82a8:18:650:65:0:2 ###[ ICMPv6 Packet Too Big ]### type = Packet too big code = 0 cksum = None mtu = 1280 ###[ IPv6 ]### version = 6 tc= 0 fl= 0 plen = None nh= ICMPv6 hlim = 255 src = 2001:1b70:82a8:18:650:65:0:2 dst = 2001:1b70:82a8:18:650:65:0:2 ###[ ICMPv6 Neighbor Discovery - Neighbor Advertisement ]### type = Neighbor Advertisement code = 0 cksum = None R = 1 S = 0 O = 1 res = 0x0 tgt = 2001:1b70:82a8:18:650:65:0:2 # Test # 3. Send ICMPv6 with Scapy to trigger fault. conf.iface='eth1' eth = Ether(src='ae:4f:44:f2:10:cc', dst='74:C9:9A:A7:E5:88') base = IPv6(src='2001:1b70:82a8:18:650:65:0:4', dst='2001:1b70:82a8:18:650:65:0:2') ptb = ICMPv6PacketTooBig(type=2) packet = eth/base/ptb ptb_payload_na_base = IPv6(src='2001:1b70:82a8:18:650:65:0:2', dst='2001:1b70:82a8:18:650:65:0:2') ptb_payload_na = ICMPv6ND_NA(type=136, tgt='2001:1b70:82a8:18:650:65:0:2') ptb_payload = ptb_payload_na_base/ptb_payload_na packet = packet/ptb_payload sendp(packet, iface="eth1.650", count=1) 4. route information will enter the faulty state after Wait 600 seconds, root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2 local 2001:1b70:82a8:18:650:65:0:2 dev lo proto none src 2001:1b70:82a8:18:650:65:0:2 metric 0 expires 7sec mtu 1280 root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2 local 2001:1b70:82a8:18:650:65:0:2 dev lo proto none src 2001:1b70:82a8:18:650:65:0:2 metric 0 expires 3sec mtu 1280 root@du1:~# ip route get 2001:1b70:82a8:18:650:65:0:2 2001:1b70:82a8:18:650:65:0:2 dev eth10.650 src 2001:1b70:82a8:18:650:65:0:2 metric 0 cache root@du1:~# -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[bug report or not] ping6 will lost packets when ping6 lots of ipv6 address
1. in a machine, configure 3000 ipv6 address in one interface for i in {1..3000}; do ip -6 addr add 4001:5013::$i/0 dev eth0; done 2. in other machine, ping6 the upper configured ipv6 address, then lots of lost packets ip -6 addr add 4001:5013::0/64 dev eth0 for i in {1..2000}; do ping6 -q -c1 4001:5013::$i; done; 3. increasing the gc thresh can handles these lost sysctl -w net.ipv6.neigh.default.gc_thresh1=2000 sysctl -w net.ipv6.neigh.default.gc_thresh2=3000 sysctl -w net.ipv6.neigh.default.gc_thresh3=4000 sysctl -w net.ipv6.route.gc_thresh=3000 sysctl -w net.ipv6.route.max_size =3000 -Roy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfrm: fix the xfrm_policy/state_walk
On Wed, Apr 22, 2015 at 3:57 PM, Herbert Xu herb...@gondor.apana.org.au wrote: Signed-off-by: Li RongQing roy.qing...@gmail.com This is not a bug fix but an optimisation. The walker entries are all marked as dead and will be skipped by the loop. However, I don't see anything wrong with this optimisation. Cheers, thanks, you are right, I will rewrite the commit header -Roy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html