Re: [PATCH ipsec-next] xfrm: policy: fix policy hash rebuild
On Tue, Nov 27, 2018 at 01:28:54PM +0100, Florian Westphal wrote: > Dan Carpenter reports following static checker warning: > net/xfrm/xfrm_policy.c:1316 xfrm_hash_rebuild() > warn: 'dir' is out of bounds '3' vs '2' > > | 1280 /* reset the bydst and inexact table in all directions */ > | 1281 xfrm_hash_reset_inexact_table(net); > | 1282 > | 1283 for (dir = 0; dir < XFRM_POLICY_MAX; dir++) { > | ^ > |dir == XFRM_POLICY_MAX at the end of this loop. > | 1304 /* re-insert all policies by order of creation */ > | 1305 list_for_each_entry_reverse(policy, >xfrm.policy_all, > walk.all) { > [..] > | 1314 > xfrm_policy_id2dir(policy->index)); > | 1315 if (!chain) { > | 1316 void *p = > xfrm_policy_inexact_insert(policy, dir, 0); > > Fix this by updating 'dir' based on current policy. Otherwise, the > inexact policies won't be found anymore during lookup, as they get > hashed to a bogus bin. > > Reported-by: Dan Carpenter > Fixes: cc1bb845adc9 ("xfrm: policy: return NULL when inexact search needed") > Signed-off-by: Florian Westphal Applied, thanks Florian!
Re: [PATCH][xfrm-next] xfrm6: remove BUG_ON from xfrm6_dst_ifdown
On Mon, Nov 12, 2018 at 05:28:22PM +0800, Li RongQing wrote: > if loopback_idev is NULL pointer, and the following access of > loopback_idev will trigger panic, which is same as BUG_ON > > Signed-off-by: Li RongQing Patch applied, thanks!
Re: [BUG] xfrm: unable to handle kernel NULL pointer dereference
On Fri, Nov 16, 2018 at 08:12:46PM +0100, Steffen Klassert wrote: > On Fri, Nov 16, 2018 at 08:48:00PM +0200, Lennert Buytenhek wrote: > > On Sat, Nov 10, 2018 at 08:34:34PM +0100, Jean-Philippe Menil wrote: > > > > > we're seeing unexpected crashes from kernel 4.15 to 4.18.17, using > > > IPsec VTI interfaces, on several vpn hosts, since upgrade from 4.4. > > > > I looked into this with Jean-Philippe, and it appears to be crashing > > on a NULL pointer dereference in the inlined xfrm_policy_check() call > > in vti_rcv_cb(), and specifically on the skb_dst(skb) dereference in > > __xfrm_policy_check2(): > > > > return (!net->xfrm.policy_count[dir] && !skb->sp) || > > (skb_dst(skb)->flags & DST_NOPOLICY) || <= > > __xfrm_policy_check(sk, ndir, skb, family); > > > > Commit 9e1437937807 ("xfrm: Fix NULL pointer dereference when > > skb_dst_force clears the dst_entry.") fixes a very similar problem on > > the output and forward paths, but our issue seems to be triggering on > > the input path. > > Yes, this is the same problem. skb_dst_force() does not > really force a refcount anymore, it might clear the dst > pointer instead (maybe this function should be renamed). I plan to apply this patch to the ipsec tree: [PATCH RFC] xfrm: Fix NULL pointer dereference in xfrm_input when skb_dst_force clears the dst_entry. Since commit 222d7dbd258d ("net: prevent dst uses after free") skb_dst_force() might clear the dst_entry attached to the skb. The xfrm code don't expect this to happen, so we crash with a NULL pointer dereference in this case. Fix it by checking skb_dst(skb) for NULL after skb_dst_force() and drop the packet in cast the dst_entry was cleared. We also move the skb_dst_force() to a codepath that is not used when the transformation was offloaded, because in this case we don't have a dst_entry attached to the skb. The output and forwarding path was already fixed by commit 9e1437937807 ("xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry.") Fixes: 222d7dbd258d ("net: prevent dst uses after free") Reported-by: Jean-Philippe Menil Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_input.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 684c0bc01e2c..d5635908587f 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -346,6 +346,12 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) skb->sp->xvec[skb->sp->len++] = x; + skb_dst_force(skb); + if (!skb_dst(skb)) { + XFRM_INC_STATS(net, LINUX_MIB_XFRMINERROR); + goto drop; + } + lock: spin_lock(>lock); @@ -385,7 +391,6 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) XFRM_SKB_CB(skb)->seq.input.low = seq; XFRM_SKB_CB(skb)->seq.input.hi = seq_hi; - skb_dst_force(skb); dev_hold(skb->dev); if (crypto_done) -- 2.17.1
Re: [PATCH ipsec-next] xfrm: policy: fix netlink/pf_key policy lookups
On Thu, Nov 15, 2018 at 02:51:57AM +0100, Florian Westphal wrote: > Colin Ian King says: > Static analysis with CoverityScan found a potential issue [..] > It seems that pointer pol is set to NULL and then a check to see if it > is non-null is used to set pol to tmp; howeverm this check is always > going to be false because pol is always NULL. > > Fix this and update test script to catch this. Updated script only: > ./xfrm_policy.sh ; echo $? > RTNETLINK answers: No such file or directory > FAIL: ip -net ns3 xfrm policy get src 10.0.1.0/24 dst 10.0.2.0/24 dir out > RTNETLINK answers: No such file or directory > [..] > PASS: policy before exception matches > PASS: ping to .254 bypassed ipsec tunnel > PASS: direct policy matches > PASS: policy matches > 1 > > Fixes: 6be3b0db6db ("xfrm: policy: add inexact policy search tree > infrastructure") > Reported-by: Colin Ian King > Signed-off-by: Florian Westphal Applied, thanks everyone!
Re: [BUG] xfrm: unable to handle kernel NULL pointer dereference
On Fri, Nov 16, 2018 at 08:48:00PM +0200, Lennert Buytenhek wrote: > On Sat, Nov 10, 2018 at 08:34:34PM +0100, Jean-Philippe Menil wrote: > > > we're seeing unexpected crashes from kernel 4.15 to 4.18.17, using > > IPsec VTI interfaces, on several vpn hosts, since upgrade from 4.4. > > I looked into this with Jean-Philippe, and it appears to be crashing > on a NULL pointer dereference in the inlined xfrm_policy_check() call > in vti_rcv_cb(), and specifically on the skb_dst(skb) dereference in > __xfrm_policy_check2(): > > return (!net->xfrm.policy_count[dir] && !skb->sp) || > (skb_dst(skb)->flags & DST_NOPOLICY) || <= > __xfrm_policy_check(sk, ndir, skb, family); > > Commit 9e1437937807 ("xfrm: Fix NULL pointer dereference when > skb_dst_force clears the dst_entry.") fixes a very similar problem on > the output and forward paths, but our issue seems to be triggering on > the input path. Yes, this is the same problem. skb_dst_force() does not really force a refcount anymore, it might clear the dst pointer instead (maybe this function should be renamed). Want to submit a fix? If not I'll go to fix that. Thanks!
Re: [PATCH ipsec-next 00/11] xfrm: policy: add inexact policy search tree
On Thu, Nov 08, 2018 at 07:00:14PM -0800, David Miller wrote: > From: Florian Westphal > Date: Wed, 7 Nov 2018 23:00:30 +0100 > > > This series attempts to improve xfrm policy lookup performance when > > a lot of (several hundred or even thousands) inexact policies exist > > on a system. > > > > On insert, a policy is either placed in hash table (all direct (/32 for > > ipv4, /128 policies, or all policies matching a user-configured threshold). > > All other policies get inserted into inexact list as per priority. > > > > Lookup then scans inexact list for first matching entry. > > > > This series instead makes it so that inexact policy is added to exactly > > one of four different search list classes. > > > > 1. "Any:Any" list, containing policies where both saddr and daddr are > >wildcards or have very coarse prefixes, e.g. 10.0.0.0/8 and the like. > > 2. "saddr:any" list, containing policies with a fixed saddr/prefixlen, > >but without destination restrictions. > >These lists are stored in rbtree nodes; each node contains those > >policies matching saddr/prefixlen. > > 3. "Any:daddr" list. Similar to 2), except for policies where only the > >destinations are specified. > > 4. "saddr:daddr" lists, containing policies that match the given > >source/destination network. > > > >The root of the saddr/daddr tree is stored in the nodes of the > >'daddr' tree. > ... > > Comments or questions welcome. > > Acked-by: David S. Miller This is now applied to ipsec-next, thanks a lot for your work Florian!
Re: [PATCH] xfrm: Fix bucket count reported to userspace
On Mon, Nov 05, 2018 at 05:00:53PM +0900, Benjamin Poirier wrote: > sadhcnt is reported by `ip -s xfrm state count` as "buckets count", not the > hash mask. > > Fixes: 28d8909bc790 ("[XFRM]: Export SAD info.") > Signed-off-by: Benjamin Poirier Patch applied, thanks!
Re: [PATCH] xfrm: Fix error return code in xfrm_output_one()
On Sat, Oct 27, 2018 at 06:12:06AM +, Wei Yongjun wrote: > xfrm_output_one() does not return a error code when there is > no dst_entry attached to the skb, it is still possible crash > with a NULL pointer dereference in xfrm_output_resume(). Fix > it by return error code -EHOSTUNREACH. > > Fixes: 9e1437937807 ("xfrm: Fix NULL pointer dereference when skb_dst_force > clears the dst_entry.") > Signed-off-by: Wei Yongjun Applied, thanks a lot!
Re: [RFC PATCH v2 00/10] udp: implement GRO support
On Tue, Oct 23, 2018 at 02:22:12PM +0200, Paolo Abeni wrote: > On Tue, 2018-10-23 at 14:10 +0200, Steffen Klassert wrote: > > > Some quick benchmark numbers with UDP packet forwarding > > (1460 byte packets) through two gateways: > > > > net-next: 16.4 Gbps > > > > net-next + UDP GRO: 20.3 Gbps > > uhmmm... what do you think about this speed-up ? skb_segment() burns a lot of cycles. If I do the same test with TCP and disable HW TSO, throughput drops also down to similar values. In case of software segmentation, the skb chain appropach is likely faster because packets are not mangled. So no need to allocate skbs, no new checksum calculations, less memcpy etc. If we have an early route lookup in GRO, we could have a good guess on the offload capabilities of the outgoing device. So in case that software segmentation is likely, use the skb chaining method. If HW segmentation is likely, merge IP packets. The chaining method might be also faster on non UDP GRO enabled sockets. I'll try to implement the skb chaining method on top of this to see what we get from that.
Re: [RFC PATCH v2 00/10] udp: implement GRO support
On Fri, Oct 19, 2018 at 04:25:10PM +0200, Paolo Abeni wrote: > This series implements GRO support for UDP sockets, as the RX counterpart > of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT"). > The core functionality is implemented by the second patch, introducing a new > sockopt to enable UDP_GRO, while patch 3 implements support for passing the > segment size to the user space via a new cmsg. > UDP GRO performs a socket lookup for each ingress packets and aggregate > datagram > directed to UDP GRO enabled sockets with constant l4 tuple. > > UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT > rules, and that could potentially confuse existing applications. > > The solution adopted here is to de-segment the GRO packet before enqueuing > as needed. Since we must cope with packet reinsertion after de-segmentation, > the relevant code is factored-out in ipv4 and ipv6 specific helpers and > exposed > to UDP usage. > > While the current code can probably be improved, this safeguard ,implemented > in > the patches 4-7, allows future enachements to enable UDP GSO offload on more > virtual devices eventually even on forwarded packets. I was curious what I get with this when doing packet forwarding. So I added forwarding support with the patch below (on top of your patchset). While the packet processing could work the way I do it in this patch, I'm not sure about the way I enable UDP GRO on forwarding. Maybe there it a better way than just do UDP GRO if forwarding is enabled on the receiving device. Some quick benchmark numbers with UDP packet forwarding (1460 byte packets) through two gateways: net-next: 16.4 Gbps net-next + UDP GRO: 20.3 Gbps Subject: [PATCH RFC] udp: Allow gro for the forwarding path. This patch adds a early route lookup to inet_gro_receive() in case forwarding is enabled on the receiving device. To be forwarded packets are allowed to enter the UDP GRO handlers then. Signed-off-by: Steffen Klassert --- include/linux/netdevice.h | 2 +- include/net/dst.h | 1 + net/core/dev.c| 6 -- net/ipv4/af_inet.c| 12 net/ipv4/route.c | 1 + net/ipv4/udp_offload.c| 11 +-- 6 files changed, 28 insertions(+), 5 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index dc1d9ed33b31..2eb3fa960ad4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2304,7 +2304,7 @@ struct napi_gro_cb { /* Number of gro_receive callbacks this packet already went through */ u8 recursion_counter:4; - /* 1 bit hole */ + u8 is_fwd:1; /* used to support CHECKSUM_COMPLETE for tunneling protocols */ __wsum csum; diff --git a/include/net/dst.h b/include/net/dst.h index 6cf0870414c7..01bdf825a6f9 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -54,6 +54,7 @@ struct dst_entry { #define DST_XFRM_TUNNEL0x0020 #define DST_XFRM_QUEUE 0x0040 #define DST_METADATA 0x0080 +#define DST_FWD0x0100 /* A non-zero value of dst->obsolete forces by-hand validation * of the route entry. Positive values are set by the generic diff --git a/net/core/dev.c b/net/core/dev.c index 022ad73d6253..c6aaffb99456 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5387,8 +5387,10 @@ static struct list_head *gro_list_prepare(struct napi_struct *napi, diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev; diffs |= p->vlan_tci ^ skb->vlan_tci; - diffs |= skb_metadata_dst_cmp(p, skb); - diffs |= skb_metadata_differs(p, skb); + if (!NAPI_GRO_CB(p)->is_fwd) { + diffs |= skb_metadata_dst_cmp(p, skb); + diffs |= skb_metadata_differs(p, skb); + } if (maclen == ETH_HLEN) diffs |= compare_ether_header(skb_mac_header(p), skb_mac_header(skb)); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 1fbe2f815474..6e4e8588c0b1 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1479,8 +1479,20 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb) NAPI_GRO_CB(p)->flush_id = flush_id; else NAPI_GRO_CB(p)->flush_id |= flush_id; + + NAPI_GRO_CB(skb)->is_fwd = NAPI_GRO_CB(p)->is_fwd; + + goto found; } + if (IN_DEV_FORWARD(__in_dev_get_rcu(skb->dev))) { + if (!ip_route_input_noref(skb, iph->daddr, iph->saddr, iph->tos, + skb->dev)) { + if ((skb_dst(skb)->flags & DST_FWD)) +
Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection
On Mon, Oct 22, 2018 at 02:51:56PM +0200, Paolo Abeni wrote: > > > + > > > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) > > > +{ > > > + struct sk_buff *next, *segs; > > > + int ret; > > > + > > > + if (likely(!udp_unexpected_gso(sk, skb))) > > > + return udp_queue_rcv_one_skb(sk, skb); > > > + > > > + BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET); > > > + __skb_push(skb, -skb_mac_offset(skb)); > > > + segs = udp_rcv_segment(sk, skb); > > > + for (skb = segs; skb; skb = next) { > > > + next = skb->next; > > > + __skb_pull(skb, skb_transport_offset(skb)); > > > + ret = udp_queue_rcv_one_skb(sk, skb); > > > > udp_queue_rcv_one_skb() starts with doing a xfrm4_policy_check(). > > Maybe we can do this on the GSO packet instead of the segments. > > So far this code is just for handling a corner case, but this might > > change. > > I thought about keeping the policy check here, but then I preferred > what looked the safest option. Perhaps we can improve with a follow-up? Fair enough. Let's keep it in mind and do it later.
Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection
On Fri, Oct 19, 2018 at 04:25:16PM +0200, Paolo Abeni wrote: > + > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk, > + struct sk_buff *skb) > +{ > + struct sk_buff *segs; > + > + /* the GSO CB lays after the UDP one, no need to save and restore any > + * CB fragment, just initialize it > + */ > + segs = __skb_gso_segment(skb, NETIF_F_SG, false); > + if (unlikely(IS_ERR(segs))) > + kfree_skb(skb); > + else if (segs) > + consume_skb(skb); > + return segs; > +} > + > + One empty line too much. > #define udp_portaddr_for_each_entry(__sk, list) \ > hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 2331ac9de954..0d55145ce9f5 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable); > * Note that in the success and error cases, the skb is assumed to > * have either been requeued or freed. > */ > -static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) > +static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb) > { > struct udp_sock *up = udp_sk(sk); > int is_udplite = IS_UDPLITE(sk); > @@ -2012,6 +2012,29 @@ static int udp_queue_rcv_skb(struct sock *sk, struct > sk_buff *skb) > return -1; > } > > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int > proto); > + > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) > +{ > + struct sk_buff *next, *segs; > + int ret; > + > + if (likely(!udp_unexpected_gso(sk, skb))) > + return udp_queue_rcv_one_skb(sk, skb); > + > + BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET); > + __skb_push(skb, -skb_mac_offset(skb)); > + segs = udp_rcv_segment(sk, skb); > + for (skb = segs; skb; skb = next) { > + next = skb->next; > + __skb_pull(skb, skb_transport_offset(skb)); > + ret = udp_queue_rcv_one_skb(sk, skb); udp_queue_rcv_one_skb() starts with doing a xfrm4_policy_check(). Maybe we can do this on the GSO packet instead of the segments. So far this code is just for handling a corner case, but this might change.
Re: [RFC PATCH v2 02/10] udp: implement GRO for plain UDP sockets.
On Fri, Oct 19, 2018 at 04:25:12PM +0200, Paolo Abeni wrote: > > +#define UDO_GRO_CNT_MAX 64 Maybe better UDP_GRO_CNT_MAX? Btw. do we really need this explicit limit? We should not get more than 64 packets during one napi poll cycle. > +static struct sk_buff *udp_gro_receive_segment(struct list_head *head, > +struct sk_buff *skb) > +{ > + struct udphdr *uh = udp_hdr(skb); > + struct sk_buff *pp = NULL; > + struct udphdr *uh2; > + struct sk_buff *p; > + > + /* requires non zero csum, for simmetry with GSO */ > + if (!uh->check) { > + NAPI_GRO_CB(skb)->flush = 1; > + return NULL; > + } Why is the requirement of checksums different than in udp_gro_receive? It's not that I care much about UDP packets without a checksum, but you would not need to implement your own loop if the requirement could be the same as in udp_gro_receive. > + > + /* pull encapsulating udp header */ > + skb_gro_pull(skb, sizeof(struct udphdr)); > + skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr)); > + > + list_for_each_entry(p, head, list) { > + if (!NAPI_GRO_CB(p)->same_flow) > + continue; > + > + uh2 = udp_hdr(p); > + > + /* Match ports only, as csum is always non zero */ > + if ((*(u32 *)>source != *(u32 *)>source)) { > + NAPI_GRO_CB(p)->same_flow = 0; > + continue; > + } > + > + /* Terminate the flow on len mismatch or if it grow "too much". > + * Under small packet flood GRO count could elsewhere grow a lot > + * leading to execessive truesize values > + */ > + if (!skb_gro_receive(p, skb) && > + NAPI_GRO_CB(p)->count > UDO_GRO_CNT_MAX) This allows to merge UDO_GRO_CNT_MAX + 1 packets. > + pp = p; > + else if (uh->len != uh2->len) > + pp = p; > + > + return pp; > + } > + > + /* mismatch, but we never need to flush */ > + return NULL; > +}
[PATCH 1/3] xfrm: remove unnecessary check in xfrmi_get_stats64
From: Li RongQing if tstats of a device is not allocated, this device is not registered correctly and can not be used. Signed-off-by: Li RongQing Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_interface.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index dc5b20bf29cf..abafd49cc65d 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -561,9 +561,6 @@ static void xfrmi_get_stats64(struct net_device *dev, { int cpu; - if (!dev->tstats) - return; - for_each_possible_cpu(cpu) { struct pcpu_sw_netstats *stats; struct pcpu_sw_netstats tmp; -- 2.17.1
pull request (net-next): ipsec-next 2018-10-18
1) Remove an unnecessary dev->tstats check in xfrmi_get_stats64. From Li RongQing. 2) We currently do a sizeof(element) instead of a sizeof(array) check when initializing the ovec array of the secpath. Currently this array can have only one element, so code is OK but error-prone. Change this to do a sizeof(array) check so that we can add more elements in future. From Li RongQing. 3) Improve xfrm IPv6 address hashing by using the complete IPv6 addresses for a hash. From Michal Kubecek. Please pull or let me know if there are problems. Thanks! The following changes since commit abf1a08ff3237a27188ff8cc2904f2cea893af55: net: vhost: remove bad code line (2018-10-07 21:31:32 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master for you to fetch changes up to 8d4b6bce2559755cf2db6513a267fccdfbf7c3ab: xfrm: use complete IPv6 addresses for hash (2018-10-15 10:09:18 +0200) Li RongQing (2): xfrm: remove unnecessary check in xfrmi_get_stats64 xfrm: use correct size to initialise sp->ovec Michal Kubecek (1): xfrm: use complete IPv6 addresses for hash net/xfrm/xfrm_hash.h | 5 ++--- net/xfrm/xfrm_input.c | 2 +- net/xfrm/xfrm_interface.c | 3 --- 3 files changed, 3 insertions(+), 7 deletions(-)
[PATCH 2/3] xfrm: use correct size to initialise sp->ovec
From: Li RongQing This place should want to initialize array, not a element, so it should be sizeof(array) instead of sizeof(element) but now this array only has one element, so no error in this condition that XFRM_MAX_OFFLOAD_DEPTH is 1 Signed-off-by: Li RongQing Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index be3520e429c9..684c0bc01e2c 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -131,7 +131,7 @@ struct sec_path *secpath_dup(struct sec_path *src) sp->len = 0; sp->olen = 0; - memset(sp->ovec, 0, sizeof(sp->ovec[XFRM_MAX_OFFLOAD_DEPTH])); + memset(sp->ovec, 0, sizeof(sp->ovec)); if (src) { int i; -- 2.17.1
[PATCH 3/3] xfrm: use complete IPv6 addresses for hash
From: Michal Kubecek In some environments it is common that many hosts share the same lower half of their IPv6 addresses (in particular ::1). As __xfrm6_addr_hash() and __xfrm6_daddr_saddr_hash() calculate the hash only from the lower halves, as much as 1/3 of the hosts ends up in one hashtable chain which harms the performance. Use complete IPv6 addresses when calculating the hashes. Rather than just adding two more words to the xor, use jhash2() for consistency with __xfrm6_pref_hash() and __xfrm6_dpref_spref_hash(). Signed-off-by: Michal Kubecek Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_hash.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/xfrm/xfrm_hash.h b/net/xfrm/xfrm_hash.h index 61be810389d8..ce66323102f9 100644 --- a/net/xfrm/xfrm_hash.h +++ b/net/xfrm/xfrm_hash.h @@ -13,7 +13,7 @@ static inline unsigned int __xfrm4_addr_hash(const xfrm_address_t *addr) static inline unsigned int __xfrm6_addr_hash(const xfrm_address_t *addr) { - return ntohl(addr->a6[2] ^ addr->a6[3]); + return jhash2((__force u32 *)addr->a6, 4, 0); } static inline unsigned int __xfrm4_daddr_saddr_hash(const xfrm_address_t *daddr, @@ -26,8 +26,7 @@ static inline unsigned int __xfrm4_daddr_saddr_hash(const xfrm_address_t *daddr, static inline unsigned int __xfrm6_daddr_saddr_hash(const xfrm_address_t *daddr, const xfrm_address_t *saddr) { - return ntohl(daddr->a6[2] ^ daddr->a6[3] ^ -saddr->a6[2] ^ saddr->a6[3]); + return __xfrm6_addr_hash(daddr) ^ __xfrm6_addr_hash(saddr); } static inline u32 __bits2mask32(__u8 bits) -- 2.17.1
[PATCH 3/4] net/xfrm: fix out-of-bounds packet access
From: Alexei Starovoitov BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161 Read of size 1 at addr 8801d882eec7 by task syz-executor1/6667 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430 _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161 __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299 xfrm_decode_session include/net/xfrm.h:1232 [inline] vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542 __netdev_start_xmit include/linux/netdevice.h:4313 [inline] netdev_start_xmit include/linux/netdevice.h:4322 [inline] xmit_one net/core/dev.c:3217 [inline] dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233 __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803 dev_queue_xmit+0x17/0x20 net/core/dev.c:3836 Reported-by: syzbot+acffccec848dc13fe...@syzkaller.appspotmail.com Reported-by: Eric Dumazet Signed-off-by: Alexei Starovoitov Signed-off-by: Steffen Klassert --- net/ipv6/xfrm6_policy.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index ef3defaf43b9..d35bcf92969c 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -146,8 +146,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse) fl6->daddr = reverse ? hdr->saddr : hdr->daddr; fl6->saddr = reverse ? hdr->daddr : hdr->saddr; - while (nh + offset + 1 < skb->data || - pskb_may_pull(skb, nh + offset + 1 - skb->data)) { + while (nh + offset + sizeof(*exthdr) < skb->data || + pskb_may_pull(skb, nh + offset + sizeof(*exthdr) - skb->data)) { nh = skb_network_header(skb); exthdr = (struct ipv6_opt_hdr *)(nh + offset); -- 2.17.1
[PATCH 1/4] xfrm: fix gro_cells leak when remove virtual xfrm interfaces
From: Li RongQing The device gro_cells has been initialized, it should be freed, otherwise it will be leaked Fixes: f203b76d78092faf2 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_interface.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index 31acc6f33d98..6f05e831a73e 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -116,6 +116,9 @@ static void xfrmi_unlink(struct xfrmi_net *xfrmn, struct xfrm_if *xi) static void xfrmi_dev_free(struct net_device *dev) { + struct xfrm_if *xi = netdev_priv(dev); + + gro_cells_destroy(>gro_cells); free_percpu(dev->tstats); } -- 2.17.1
[PATCH 4/4] xfrm: policy: use hlist rcu variants on insert
From: Florian Westphal bydst table/list lookups use rcu, so insertions must use rcu versions. Fixes: a7c44247f704e ("xfrm: policy: make xfrm_policy_lookup_bytype lockless") Signed-off-by: Florian Westphal Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_policy.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index f094d4b3520d..119a427d9b2b 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -632,9 +632,9 @@ static void xfrm_hash_rebuild(struct work_struct *work) break; } if (newpos) - hlist_add_behind(>bydst, newpos); + hlist_add_behind_rcu(>bydst, newpos); else - hlist_add_head(>bydst, chain); + hlist_add_head_rcu(>bydst, chain); } spin_unlock_bh(>xfrm.xfrm_policy_lock); @@ -774,9 +774,9 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl) break; } if (newpos) - hlist_add_behind(>bydst, newpos); + hlist_add_behind_rcu(>bydst, newpos); else - hlist_add_head(>bydst, chain); + hlist_add_head_rcu(>bydst, chain); __xfrm_policy_link(policy, dir); /* After previous checking, family can either be AF_INET or AF_INET6 */ -- 2.17.1
[PATCH 2/4] MAINTAINERS: Remove net/core/flow.c
net/core/flow.c does not exist anymore, so remove it from the IPSEC NETWORKING section of the MAINTAINERS file. Signed-off-by: Steffen Klassert --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index dcb0191c4f54..4ff21dac9b45 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10130,7 +10130,6 @@ L: netdev@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git S: Maintained -F: net/core/flow.c F: net/xfrm/ F: net/key/ F: net/ipv4/xfrm* -- 2.17.1
pull request (net): ipsec 2018-10-18
1) Free the xfrm interface gro_cells when deleting the interface, otherwise we leak it. From Li RongQing. 2) net/core/flow.c does not exist anymore, so remove it from the MAINTAINERS file. 3) Fix a slab-out-of-bounds in _decode_session6. From Alexei Starovoitov. 4) Fix RCU protection when policies inserted into thei bydst lists. From Florian Westphal. Please pull or let me know if there are problems. Thanks! The following changes since commit 92d7c74b6f72a8a7d04970d5dcfb99673daaf91d: Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth (2018-10-01 22:40:39 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master for you to fetch changes up to 9d200fd178f11dd50eb1fd8ccd0650c9284e: xfrm: policy: use hlist rcu variants on insert (2018-10-11 13:24:46 +0200) Alexei Starovoitov (1): net/xfrm: fix out-of-bounds packet access Florian Westphal (1): xfrm: policy: use hlist rcu variants on insert Li RongQing (1): xfrm: fix gro_cells leak when remove virtual xfrm interfaces Steffen Klassert (1): MAINTAINERS: Remove net/core/flow.c MAINTAINERS | 1 - net/ipv6/xfrm6_policy.c | 4 ++-- net/xfrm/xfrm_interface.c | 3 +++ net/xfrm/xfrm_policy.c| 8 4 files changed, 9 insertions(+), 7 deletions(-)
Re: [PATCH ipsec] xfrm: policy: use hlist rcu variants on insert
On Wed, Oct 10, 2018 at 06:02:21PM +0200, Florian Westphal wrote: > bydst table/list lookups use rcu, so insertions must use rcu versions. > > Fixes: a7c44247f704e ("xfrm: policy: make xfrm_policy_lookup_bytype lockless") > Signed-off-by: Florian Westphal Applied, thanks Florian!
Re: [PATCH net] net/xfrm: fix out-of-bounds packet access
On Tue, Oct 09, 2018 at 09:59:36AM -0700, Alexei Starovoitov wrote: > BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0 > net/ipv6/xfrm6_policy.c:161 > Read of size 1 at addr 8801d882eec7 by task syz-executor1/6667 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 > print_address_description+0x6c/0x20b mm/kasan/report.c:256 > kasan_report_error mm/kasan/report.c:354 [inline] > kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412 > __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430 > _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161 > __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299 > xfrm_decode_session include/net/xfrm.h:1232 [inline] > vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542 > __netdev_start_xmit include/linux/netdevice.h:4313 [inline] > netdev_start_xmit include/linux/netdevice.h:4322 [inline] > xmit_one net/core/dev.c:3217 [inline] > dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233 > __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803 > dev_queue_xmit+0x17/0x20 net/core/dev.c:3836 > > Reported-by: syzbot+acffccec848dc13fe...@syzkaller.appspotmail.com > Reported-by: Eric Dumazet > Signed-off-by: Alexei Starovoitov This looks good, applied to the ipsec tree. Thanks Alexei!
Re: [PATCH][ipsec-next] xfrm: use correct size to initialise sp->ovec
On Sun, Oct 07, 2018 at 10:22:42AM +0800, Li RongQing wrote: > This place should want to initialize array, not a element, > so it should be sizeof(array) instead of sizeof(element) > > but now this array only has one element, so no error in > this condition that XFRM_MAX_OFFLOAD_DEPTH is 1 > > Signed-off-by: Li RongQing Also applied, thanks a lot Li!
Re: [PATCH][ipsec-next] xfrm: remove unnecessary check in xfrmi_get_stats64
On Sun, Oct 07, 2018 at 09:56:15AM +0800, Li RongQing wrote: > if tstats of a device is not allocated, this device is not > registered correctly and can not be used. > > Signed-off-by: Li RongQing Applied to ipsec-next, thanks!
Re: [PATCH v2 ipsec] Clear secpath on loopback_xmit
On Mon, Oct 08, 2018 at 11:13:36AM -0700, Benedict Wong wrote: > This patch clears the skb->sp when transmitted over loopback. This > ensures that the loopback-ed packet does not have any secpath > information from the outbound transforms. > > At present, this causes XFRM tunnel mode packets to be dropped with > XFRMINNOPOLS, due to the outbound state being in the secpath, without > a matching inbound policy. Clearing the secpath ensures that all states > added to the secpath are exclusively from the inbound processing. > > Tests: xfrm tunnel mode tests added for loopback: > https://android-review.googlesource.com/c/kernel/tests/+/777328 > Fixes: 8fe7ee2ba983 ("[IPsec]: Strengthen policy checks") This patch is from 2003, it predates our git history. The fixes tag is mainly to ensure proper backporting. The commit you mentioned can't be found in the mainline git repo, so it does not help much. If this bug was really introduced in pre git times, use the very first git commit in our history. It should look like: Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") But I doubt that this commit introduced the bug. We started to use outbound secpaths when we added offloading support, so I think one of the offloading patches introduced it. Do you have CONFIG_INET_ESP_OFFLOAD or CONFIG_INET6_ESP_OFFLOAD enabled? Do you see the bug without these two config options enabled? > Signed-off-by: Benedict Wong > --- > drivers/net/loopback.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c > index 30612497643c..a6bf54df94bd 100644 > --- a/drivers/net/loopback.c > +++ b/drivers/net/loopback.c > @@ -50,6 +50,7 @@ > #include > #include > #include > +#include > #include /* For the statistics structure. */ > #include /* For ARPHRD_ETHER */ > #include > @@ -82,6 +83,9 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb, >*/ > skb_dst_force(skb); > > + // Clear secpath to ensure xfrm policy check not tainted by outbound > SAs. Please align your comment to the format of the other comments in this file. It should look like this: /* Clear secpath to ensure xfrm policy check not tainted by * outbound SAs. */ > + secpath_reset(skb); > + > skb->protocol = eth_type_trans(skb, dev); > > /* it's OK to use per_cpu_ptr() because BHs are off */ > -- > 2.19.0.605.g01d371f741-goog
Re: [PATCH net-next RFC 0/8] udp and configurable gro
On Fri, Oct 05, 2018 at 10:41:47AM -0400, Willem de Bruijn wrote: > On Fri, Oct 5, 2018 at 9:53 AM Paolo Abeni wrote: > > > > Hi all, > > > > On Fri, 2018-09-14 at 13:59 -0400, Willem de Bruijn wrote: > > > This is a *very rough* draft. Mainly for discussion while we also > > > look at another partially overlapping approach [1]. > > > > I'm wondering how we go on from this ? I'm fine with either approaches. > > Let me send the udp gro static_key patch. Then we don't need > the enable udp on demand logic (patch 2/4). > > Your implementation of GRO is more fleshed out (patch 3/4) than > my quick hack. My only request would be to use a separate > UDP_GRO socket option instead of adding this to the existing > UDP_SEGMENT. > > Sounds good? > > > Also, I'm interested in [try to] enable GRO/GSO batching in the > > forwarding path, as you outlined initially in the GSO series > > submission. That should cover Steffen use-case, too, right? > > Great. Indeed. Though there is some unresolved discussion on > one large gso skb vs frag list. There has been various concerns > around the use of frag lists for GSO in the past, and it does not > match h/w offload. So I think the answer would be the first unless > the second proves considerably faster (in which case it could also > be added later as optimization). I think it depends a bit on the usecase and hardware etc. if the first or the second approach is faster. So it would be good if we can choose which one to use depending on that. For local socket receiving, building big GSO packets is likely faster than the chaining method. But on forwarding the chaining method might be faster because we don't have the overhead of creating GSO packets and of segmenting them back to their native form (at least as long as we don't have NICs that support hardware UDP GSO). Same applies to packets that undergo IPsec transformation. Another thing where the chaining method could be intersting is when we receive already big LRO or HW GRO packets from the NIC. Packets of the same flow could still travel together through the stack with the chaining method. I've never tried this, though. For now it is just an idea. I have the code for the chaining mehthod here, I'd just need some method to hook it in. Maybe it could be done with some sort of an inet_update_offload() as Paolo already propsed in his pachset, or we could make it configurable per device...
Re: [PATCH ipsec-next] Clear secpath on loopback_xmit
On Fri, Oct 05, 2018 at 11:23:28AM -0700, Benedict Wong wrote: > This patch clears the skb->sp when transmitted over loopback. This > ensures that the loopback-ed packet does not have any secpath > information from the outbound transforms. > > At present, this causes XFRM tunnel mode packets to be dropped with > XFRMINNOPOLS, due to the outbound state being in the secpath, without > a matching inbound policy. Clearing the secpath ensures that all states > added to the secpath are exclusively from the inbound processing. This looks like a fix, so the target should be the ipsec tree, not ipsec-next. > > Tests: xfrm tunnel mode tests added for loopback: > https://android-review.googlesource.com/c/kernel/tests/+/777328 > Signed-off-by: Benedict Wong Please add a 'Fixes:' tag. > --- > drivers/net/loopback.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c > index 30612497643c..a6bf54df94bd 100644 > --- a/drivers/net/loopback.c > +++ b/drivers/net/loopback.c > @@ -50,6 +50,7 @@ > #include > #include > #include > +#include > #include /* For the statistics structure. */ > #include /* For ARPHRD_ETHER */ > #include > @@ -82,6 +83,9 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb, >*/ > skb_dst_force(skb); > > + // Clear secpath to ensure xfrm policy check not tainted by outbound > SAs. Comments should use /* ... */, like it is done below. > + secpath_reset(skb); > + > skb->protocol = eth_type_trans(skb, dev); > > /* it's OK to use per_cpu_ptr() because BHs are off */ > -- > 2.19.0.605.g01d371f741-goog
Re: [PATCH] xfrm: fix gro_cells leak when remove virtual xfrm interfaces
On Sun, Sep 30, 2018 at 03:06:06PM +0800, Li RongQing wrote: > The device gro_cells has been initialized, it should be freed, > otherwise it will be leaked > > Fixes: f203b76d78092faf2 ("xfrm: Add virtual xfrm interfaces") > Signed-off-by: Zhang Yu > Signed-off-by: Li RongQing Applied, thanks a lot!
[PATCH 3/3] xfrm: allow driver to quietly refuse offload
From: Shannon Nelson If the "offload" attribute is used to create an IPsec SA and the .xdo_dev_state_add() fails, the SA creation fails. However, if the "offload" attribute is used on a device that doesn't offer it, the attribute is quietly ignored and the SA is created without an offload. Along the same line of that second case, it would be good to have a way for the device to refuse to offload an SA without failing the whole SA creation. This patch adds that feature by allowing the driver to return -EOPNOTSUPP as a signal that the SA may be fine, it just can't be offloaded. This allows the user a little more flexibility in requesting offloads and not needing to know every detail at all times about each specific NIC when trying to create SAs. Signed-off-by: Shannon Nelson Signed-off-by: Steffen Klassert --- Documentation/networking/xfrm_device.txt | 4 net/xfrm/xfrm_device.c | 6 +- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/xfrm_device.txt b/Documentation/networking/xfrm_device.txt index 50c34ca65efe..267f55b5f54a 100644 --- a/Documentation/networking/xfrm_device.txt +++ b/Documentation/networking/xfrm_device.txt @@ -68,6 +68,10 @@ and an indication of whether it is for Rx or Tx. The driver should - verify the algorithm is supported for offloads - store the SA information (key, salt, target-ip, protocol, etc) - enable the HW offload of the SA + - return status value: + 0 success + -EOPNETSUPP offload not supported, try SW IPsec + other fail the request The driver can also set an offload_handle in the SA, an opaque void pointer that can be used to convey context into the fast-path offload requests. diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c index 5611b7521020..3a1d9d6aefb4 100644 --- a/net/xfrm/xfrm_device.c +++ b/net/xfrm/xfrm_device.c @@ -192,9 +192,13 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x, err = dev->xfrmdev_ops->xdo_dev_state_add(x); if (err) { + xso->num_exthdrs = 0; + xso->flags = 0; xso->dev = NULL; dev_put(dev); - return err; + + if (err != -EOPNOTSUPP) + return err; } return 0; -- 2.17.1
pull request (net-next): ipsec-next 2018-10-01
1) Make xfrmi_get_link_net() static to silence a sparse warning. From Wei Yongjun. 2) Remove a unused esph pointer definition in esp_input(). From Haishuang Yan. 3) Allow the NIC driver to quietly refuse xfrm offload in case it does not support it, the SA is created without offload in this case. From Shannon Nelson. Please pull or let me know if there are problems. Thanks! The following changes since commit 817e60a7a2bb1f22052f18562990d675cb3a3762: Merge branch 'nfp-add-NFP5000-support' (2018-08-28 16:01:48 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master for you to fetch changes up to 4a132095dd64fefabdc5dad1cd9e9809b126e582: xfrm: allow driver to quietly refuse offload (2018-08-29 08:04:44 +0200) Haishuang Yan (1): esp: remove redundant define esph Shannon Nelson (1): xfrm: allow driver to quietly refuse offload Wei Yongjun (1): xfrm: Make function xfrmi_get_link_net() static Documentation/networking/xfrm_device.txt | 4 net/ipv4/esp4.c | 7 +++ net/ipv6/esp6.c | 7 +++ net/xfrm/xfrm_device.c | 6 +- net/xfrm/xfrm_interface.c| 2 +- 5 files changed, 16 insertions(+), 10 deletions(-)
[PATCH 2/3] esp: remove redundant define esph
From: Haishuang Yan The pointer 'esph' is defined but is never used hence it is redundant and canbe removed. Signed-off-by: Haishuang Yan Signed-off-by: Steffen Klassert --- net/ipv4/esp4.c | 7 +++ net/ipv6/esp6.c | 7 +++ 2 files changed, 6 insertions(+), 8 deletions(-) diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 97689012b357..211caaf27f6e 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -683,12 +683,11 @@ static void esp_input_done_esn(struct crypto_async_request *base, int err) */ static int esp_input(struct xfrm_state *x, struct sk_buff *skb) { - struct ip_esp_hdr *esph; struct crypto_aead *aead = x->data; struct aead_request *req; struct sk_buff *trailer; int ivlen = crypto_aead_ivsize(aead); - int elen = skb->len - sizeof(*esph) - ivlen; + int elen = skb->len - sizeof(struct ip_esp_hdr) - ivlen; int nfrags; int assoclen; int seqhilen; @@ -698,13 +697,13 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb) struct scatterlist *sg; int err = -EINVAL; - if (!pskb_may_pull(skb, sizeof(*esph) + ivlen)) + if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr) + ivlen)) goto out; if (elen <= 0) goto out; - assoclen = sizeof(*esph); + assoclen = sizeof(struct ip_esp_hdr); seqhilen = 0; if (x->props.flags & XFRM_STATE_ESN) { diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 88a7579c23bd..63b2b66f9dfa 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -601,12 +601,11 @@ static void esp_input_done_esn(struct crypto_async_request *base, int err) static int esp6_input(struct xfrm_state *x, struct sk_buff *skb) { - struct ip_esp_hdr *esph; struct crypto_aead *aead = x->data; struct aead_request *req; struct sk_buff *trailer; int ivlen = crypto_aead_ivsize(aead); - int elen = skb->len - sizeof(*esph) - ivlen; + int elen = skb->len - sizeof(struct ip_esp_hdr) - ivlen; int nfrags; int assoclen; int seqhilen; @@ -616,7 +615,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb) u8 *iv; struct scatterlist *sg; - if (!pskb_may_pull(skb, sizeof(*esph) + ivlen)) { + if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr) + ivlen)) { ret = -EINVAL; goto out; } @@ -626,7 +625,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb) goto out; } - assoclen = sizeof(*esph); + assoclen = sizeof(struct ip_esp_hdr); seqhilen = 0; if (x->props.flags & XFRM_STATE_ESN) { -- 2.17.1
[PATCH 1/3] xfrm: Make function xfrmi_get_link_net() static
From: Wei Yongjun Fixes the following sparse warning: net/xfrm/xfrm_interface.c:745:12: warning: symbol 'xfrmi_get_link_net' was not declared. Should it be static? Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Wei Yongjun Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_interface.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index 31acc6f33d98..2c0a5c59dcd0 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -742,7 +742,7 @@ static int xfrmi_fill_info(struct sk_buff *skb, const struct net_device *dev) return -EMSGSIZE; } -struct net *xfrmi_get_link_net(const struct net_device *dev) +static struct net *xfrmi_get_link_net(const struct net_device *dev) { struct xfrm_if *xi = netdev_priv(dev); -- 2.17.1
[PATCH 1/6] xfrm: Validate address prefix lengths in the xfrm selector.
We don't validate the address prefix lengths in the xfrm selector we got from userspace. This can lead to undefined behaviour in the address matching functions if the prefix is too big for the given address family. Fix this by checking the prefixes and refuse SA/policy insertation when a prefix is invalid. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Air Icy Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_user.c | 12 1 file changed, 12 insertions(+) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 33878e6e0d0a..5151b3ebf068 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -151,10 +151,16 @@ static int verify_newsa_info(struct xfrm_usersa_info *p, err = -EINVAL; switch (p->family) { case AF_INET: + if (p->sel.prefixlen_d > 32 || p->sel.prefixlen_s > 32) + goto out; + break; case AF_INET6: #if IS_ENABLED(CONFIG_IPV6) + if (p->sel.prefixlen_d > 128 || p->sel.prefixlen_s > 128) + goto out; + break; #else err = -EAFNOSUPPORT; @@ -1359,10 +1365,16 @@ static int verify_newpolicy_info(struct xfrm_userpolicy_info *p) switch (p->sel.family) { case AF_INET: + if (p->sel.prefixlen_d > 32 || p->sel.prefixlen_s > 32) + return -EINVAL; + break; case AF_INET6: #if IS_ENABLED(CONFIG_IPV6) + if (p->sel.prefixlen_d > 128 || p->sel.prefixlen_s > 128) + return -EINVAL; + break; #else return -EAFNOSUPPORT; -- 2.17.1
[PATCH 6/6] xfrm: validate template mode
From: Sean Tranchetti XFRM mode parameters passed as part of the user templates in the IP_XFRM_POLICY are never properly validated. Passing values other than valid XFRM modes can cause stack-out-of-bounds reads to occur later in the XFRM processing: [ 140.535608] [ 140.543058] BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x17e4/0x1cc4 [ 140.550306] Read of size 4 at addr ffc0238a7a58 by task repro/5148 [ 140.557369] [ 140.558927] Call trace: [ 140.558936] dump_backtrace+0x0/0x388 [ 140.558940] show_stack+0x24/0x30 [ 140.558946] __dump_stack+0x24/0x2c [ 140.558949] dump_stack+0x8c/0xd0 [ 140.558956] print_address_description+0x74/0x234 [ 140.558960] kasan_report+0x240/0x264 [ 140.558963] __asan_report_load4_noabort+0x2c/0x38 [ 140.558967] xfrm_state_find+0x17e4/0x1cc4 [ 140.558971] xfrm_resolve_and_create_bundle+0x40c/0x1fb8 [ 140.558975] xfrm_lookup+0x238/0x1444 [ 140.558977] xfrm_lookup_route+0x48/0x11c [ 140.558984] ip_route_output_flow+0x88/0xc4 [ 140.558991] raw_sendmsg+0xa74/0x266c [ 140.558996] inet_sendmsg+0x258/0x3b0 [ 140.559002] sock_sendmsg+0xbc/0xec [ 140.559005] SyS_sendto+0x3a8/0x5a8 [ 140.559008] el0_svc_naked+0x34/0x38 [ 140.559009] [ 140.592245] page dumped because: kasan: bad access detected [ 140.597981] page_owner info is not active (free page?) [ 140.603267] [ 140.653503] Signed-off-by: Sean Tranchetti Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_user.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 5151b3ebf068..d0672c400c2f 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1455,6 +1455,9 @@ static int validate_tmpl(int nr, struct xfrm_user_tmpl *ut, u16 family) (ut[i].family != prev_family)) return -EINVAL; + if (ut[i].mode >= XFRM_MODE_MAX) + return -EINVAL; + prev_family = ut[i].family; switch (ut[i].family) { -- 2.17.1
pull request (net): ipsec 2018-10-01
1) Validate address prefix lengths in the xfrm selector, otherwise we may hit undefined behaviour in the address matching functions if the prefix is too big for the given address family. 2) Fix skb leak on local message size errors. From Thadeu Lima de Souza Cascardo. 3) We currently reset the transport header back to the network header after a transport mode transformation is applied. This leads to an incorrect transport header when multiple transport mode transformations are applied. Reset the transport header only after all transformations are already applied to fix this. From Sowmini Varadhan. 4) We only support one offloaded xfrm, so reset crypto_done after the first transformation in xfrm_input(). Otherwise we may call the wrong input method for subsequent transformations. From Sowmini Varadhan. 5) Fix NULL pointer dereference when skb_dst_force clears the dst_entry. skb_dst_force does not really force a dst refcount anymore, it might clear it instead. xfrm code did not expect this, add a check to not dereference skb_dst() if it was cleared by skb_dst_force. 6) Validate xfrm template mode, otherwise we can get a stack-out-of-bounds read in xfrm_state_find. From Sean Tranchetti. Please pull or let me know if there are problems. Thanks! The following changes since commit 25432eba9cd8f2ef5afef55be811b010a004b5fa: openvswitch: meter: Fix setting meter id for new entries (2018-07-29 13:20:54 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master for you to fetch changes up to 32bf94fb5c2ec4ec842152d0e5937cd4bb6738fa: xfrm: validate template mode (2018-09-20 08:30:42 +0200) Sean Tranchetti (1): xfrm: validate template mode Sowmini Varadhan (2): xfrm: reset transport header back to network header after all input transforms ahave been applied xfrm: reset crypto_done when iterating over multiple input xfrms Steffen Klassert (2): xfrm: Validate address prefix lengths in the xfrm selector. xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry. Thadeu Lima de Souza Cascardo (1): xfrm6: call kfree_skb when skb is toobig net/ipv4/xfrm4_input.c | 1 + net/ipv4/xfrm4_mode_transport.c | 4 +--- net/ipv6/xfrm6_input.c | 1 + net/ipv6/xfrm6_mode_transport.c | 4 +--- net/ipv6/xfrm6_output.c | 2 ++ net/xfrm/xfrm_input.c | 1 + net/xfrm/xfrm_output.c | 4 net/xfrm/xfrm_policy.c | 4 net/xfrm/xfrm_user.c| 15 +++ 9 files changed, 30 insertions(+), 6 deletions(-)
[PATCH 4/6] xfrm: reset crypto_done when iterating over multiple input xfrms
From: Sowmini Varadhan We only support one offloaded xfrm (we do not have devices that can handle more than one offload), so reset crypto_done in xfrm_input() when iterating over multiple transforms in xfrm_input, so that we can invoke the appropriate x->type->input for the non-offloaded transforms Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") Signed-off-by: Sowmini Varadhan Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_input.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 352abca2605f..86f5afbd0a0c 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -453,6 +453,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR); goto drop; } + crypto_done = false; } while (!err); err = xfrm_rcv_cb(skb, family, x->type->proto, 0); -- 2.17.1
[PATCH 5/6] xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry.
Since commit 222d7dbd258d ("net: prevent dst uses after free") skb_dst_force() might clear the dst_entry attached to the skb. The xfrm code don't expect this to happen, so we crash with a NULL pointer dereference in this case. Fix it by checking skb_dst(skb) for NULL after skb_dst_force() and drop the packet in cast the dst_entry was cleared. Fixes: 222d7dbd258d ("net: prevent dst uses after free") Reported-by: Tobias Hommel Reported-by: Kristian Evensen Reported-by: Wolfgang Walter Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_output.c | 4 net/xfrm/xfrm_policy.c | 4 2 files changed, 8 insertions(+) diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c index 89b178a78dc7..36d15a38ce5e 100644 --- a/net/xfrm/xfrm_output.c +++ b/net/xfrm/xfrm_output.c @@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int err) spin_unlock_bh(>lock); skb_dst_force(skb); + if (!skb_dst(skb)) { + XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR); + goto error_nolock; + } if (xfrm_offload(skb)) { x->type_offload->encap(x, skb); diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 7c5e8978aeaa..626e0f4d1749 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb, unsigned short family) } skb_dst_force(skb); + if (!skb_dst(skb)) { + XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR); + return 0; + } dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE); if (IS_ERR(dst)) { -- 2.17.1
[PATCH 3/6] xfrm: reset transport header back to network header after all input transforms ahave been applied
From: Sowmini Varadhan A policy may have been set up with multiple transforms (e.g., ESP and ipcomp). In this situation, the ingress IPsec processing iterates in xfrm_input() and applies each transform in turn, processing the nexthdr to find any additional xfrm that may apply. This patch resets the transport header back to network header only after the last transformation so that subsequent xfrms can find the correct transport header. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Suggested-by: Steffen Klassert Signed-off-by: Sowmini Varadhan Signed-off-by: Steffen Klassert --- net/ipv4/xfrm4_input.c | 1 + net/ipv4/xfrm4_mode_transport.c | 4 +--- net/ipv6/xfrm6_input.c | 1 + net/ipv6/xfrm6_mode_transport.c | 4 +--- 4 files changed, 4 insertions(+), 6 deletions(-) diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c index bcfc00e88756..f8de2482a529 100644 --- a/net/ipv4/xfrm4_input.c +++ b/net/ipv4/xfrm4_input.c @@ -67,6 +67,7 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async) if (xo && (xo->flags & XFRM_GRO)) { skb_mac_header_rebuild(skb); + skb_reset_transport_header(skb); return 0; } diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c index 3d36644890bb..1ad2c2c4e250 100644 --- a/net/ipv4/xfrm4_mode_transport.c +++ b/net/ipv4/xfrm4_mode_transport.c @@ -46,7 +46,6 @@ static int xfrm4_transport_output(struct xfrm_state *x, struct sk_buff *skb) static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb) { int ihl = skb->data - skb_transport_header(skb); - struct xfrm_offload *xo = xfrm_offload(skb); if (skb->transport_header != skb->network_header) { memmove(skb_transport_header(skb), @@ -54,8 +53,7 @@ static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb) skb->network_header = skb->transport_header; } ip_hdr(skb)->tot_len = htons(skb->len + ihl); - if (!xo || !(xo->flags & XFRM_GRO)) - skb_reset_transport_header(skb); + skb_reset_transport_header(skb); return 0; } diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c index 841f4a07438e..9ef490dddcea 100644 --- a/net/ipv6/xfrm6_input.c +++ b/net/ipv6/xfrm6_input.c @@ -59,6 +59,7 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async) if (xo && (xo->flags & XFRM_GRO)) { skb_mac_header_rebuild(skb); + skb_reset_transport_header(skb); return -1; } diff --git a/net/ipv6/xfrm6_mode_transport.c b/net/ipv6/xfrm6_mode_transport.c index 9ad07a91708e..3c29da5defe6 100644 --- a/net/ipv6/xfrm6_mode_transport.c +++ b/net/ipv6/xfrm6_mode_transport.c @@ -51,7 +51,6 @@ static int xfrm6_transport_output(struct xfrm_state *x, struct sk_buff *skb) static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb) { int ihl = skb->data - skb_transport_header(skb); - struct xfrm_offload *xo = xfrm_offload(skb); if (skb->transport_header != skb->network_header) { memmove(skb_transport_header(skb), @@ -60,8 +59,7 @@ static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb) } ipv6_hdr(skb)->payload_len = htons(skb->len + ihl - sizeof(struct ipv6hdr)); - if (!xo || !(xo->flags & XFRM_GRO)) - skb_reset_transport_header(skb); + skb_reset_transport_header(skb); return 0; } -- 2.17.1
[PATCH 2/6] xfrm6: call kfree_skb when skb is toobig
From: Thadeu Lima de Souza Cascardo After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching and reporting on xmit"), some too big skbs might be potentially passed down to __xfrm6_output, causing it to fail to transmit but not free the skb, causing a leak of skb, and consequentially a leak of dst references. After running pmtu.sh, that shows as failure to unregister devices in a namespace: [ 311.397671] unregister_netdevice: waiting for veth_b to become free. Usage count = 1 The fix is to call kfree_skb in case of transmit failures. Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error") Signed-off-by: Thadeu Lima de Souza Cascardo Reviewed-by: Sabrina Dubroca Signed-off-by: Steffen Klassert --- net/ipv6/xfrm6_output.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c index 5959ce9620eb..6a74080005cf 100644 --- a/net/ipv6/xfrm6_output.c +++ b/net/ipv6/xfrm6_output.c @@ -170,9 +170,11 @@ static int __xfrm6_output(struct net *net, struct sock *sk, struct sk_buff *skb) if (toobig && xfrm6_local_dontfrag(skb)) { xfrm6_local_rxpmtu(skb, mtu); + kfree_skb(skb); return -EMSGSIZE; } else if (!skb->ignore_df && toobig && skb->sk) { xfrm_local_error(skb, mtu); + kfree_skb(skb); return -EMSGSIZE; } -- 2.17.1
Re: [PATCH net] xfrm: validate template mode
On Wed, Sep 19, 2018 at 01:54:56PM -0600, Sean Tranchetti wrote: > XFRM mode parameters passed as part of the user templates > in the IP_XFRM_POLICY are never properly validated. Passing > values other than valid XFRM modes can cause stack-out-of-bounds > reads to occur later in the XFRM processing: > > [ 140.535608] > > [ 140.543058] BUG: KASAN: stack-out-of-bounds in > xfrm_state_find+0x17e4/0x1cc4 > [ 140.550306] Read of size 4 at addr ffc0238a7a58 by task repro/5148 > [ 140.557369] > [ 140.558927] Call trace: > [ 140.558936] dump_backtrace+0x0/0x388 > [ 140.558940] show_stack+0x24/0x30 > [ 140.558946] __dump_stack+0x24/0x2c > [ 140.558949] dump_stack+0x8c/0xd0 > [ 140.558956] print_address_description+0x74/0x234 > [ 140.558960] kasan_report+0x240/0x264 > [ 140.558963] __asan_report_load4_noabort+0x2c/0x38 > [ 140.558967] xfrm_state_find+0x17e4/0x1cc4 > [ 140.558971] xfrm_resolve_and_create_bundle+0x40c/0x1fb8 > [ 140.558975] xfrm_lookup+0x238/0x1444 > [ 140.558977] xfrm_lookup_route+0x48/0x11c > [ 140.558984] ip_route_output_flow+0x88/0xc4 > [ 140.558991] raw_sendmsg+0xa74/0x266c > [ 140.558996] inet_sendmsg+0x258/0x3b0 > [ 140.559002] sock_sendmsg+0xbc/0xec > [ 140.559005] SyS_sendto+0x3a8/0x5a8 > [ 140.559008] el0_svc_naked+0x34/0x38 > [ 140.559009] > [ 140.592245] page dumped because: kasan: bad access detected > [ 140.597981] page_owner info is not active (free page?) > [ 140.603267] > [ 140.653503] > > > Signed-off-by: Sean Tranchetti Applied to the ipsec tree, thanks a lot!
Re: [PATCH net-next RFC 7/8] udp: gro behind static key
On Mon, Sep 17, 2018 at 10:19:22AM -0400, Willem de Bruijn wrote: > On Mon, Sep 17, 2018 at 6:37 AM Steffen Klassert > wrote: > > > > Maybe in case that forwarding is enabled on the receiving device, > > inet_gro_receive() could do a route lookup and allow GRO if the > > route lookup returned at forwarding route. > > That's a better solution, if the cost is acceptable. We do have to > be careful against increasing per packet cycle cost in this path > given that it's a possible vector for DoS attempts. I played with this already and I have not seen any significant overhead when doing a route lookup in GRO. We have to do a route lookup anyway, so it should not matter too much if we do it here or later in the IP layer. > > > For flows that are likely software segmented after that, it > > would be worth to build packet chains insted of merging the > > payload. Packets of the same flow could travel together, but > > it would save the cost of the packet merging and segmenting. > > With software GSO that is faster, as it would have to allocate > all the separate segment skbs in skb_segment later. Though > there is some complexity if MTUs differ. This can be handled the same way as it is done for TCP GSO packets. If the size of the original packets is bigger than the outgoing MTU, we either signal ICMP_FRAG_NEEDED to the sender, or split the chain back into single packets and do fragmentation on them. > > With hardware UDP GSO, having a single skb will be cheaper in > the forwarding path. Using napi_gro_frags, device drivers really > do only end up allocating one skb for the GSO packet. Right, this is why it would be nice to have this configurable.
Re: [PATCH net-next RFC 7/8] udp: gro behind static key
On Fri, Sep 14, 2018 at 01:59:40PM -0400, Willem de Bruijn wrote: > From: Willem de Bruijn > > Avoid the socket lookup cost in udp_gro_receive if no socket has a > gro callback configured. It would be nice if we could do GRO not just for GRO configured sockets, but also for flows that are going to be IPsec transformed or directly forwarded. Maybe in case that forwarding is enabled on the receiving device, inet_gro_receive() could do a route lookup and allow GRO if the route lookup returned at forwarding route. For flows that are likely software segmented after that, it would be worth to build packet chains insted of merging the payload. Packets of the same flow could travel together, but it would save the cost of the packet merging and segmenting. This could be done similar to what I proposed for the list receive case: https://www.spinics.net/lists/netdev/msg522706.html How GRO should be done could be even configured by replacing the net_offload pointer similar to what Paolo propsed in his pachset with the inet_update_offload() function.
Re: [PATCH net-next RFC 7/8] udp: gro behind static key
On Fri, Sep 14, 2018 at 01:59:40PM -0400, Willem de Bruijn wrote: > From: Willem de Bruijn > > Avoid the socket lookup cost in udp_gro_receive if no socket has a > gro callback configured. > > Signed-off-by: Willem de Bruijn ... > diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c > index 4f6aa95a9b12..f44fe328aa0f 100644 > --- a/net/ipv4/udp_offload.c > +++ b/net/ipv4/udp_offload.c > @@ -405,7 +405,7 @@ static struct sk_buff *udp4_gro_receive(struct list_head > *head, > { > struct udphdr *uh = udp_gro_udphdr(skb); > > - if (unlikely(!uh)) > + if (unlikely(!uh) || !static_branch_unlikely(_encap_needed_key)) > goto flush; If you use udp_encap_needed_key to enalbe UDP GRO, then a UDP encapsulation socket will enable it too. Not sure if this is intentional. That said, enabling UDP GRO on a UDP encapsulation socket (ESP in UPD etc.) will fail badly as then encrypted ESP packets might be merged together. So we somehow should make sure that this does not happen. Anyway, this reminds me that we can support GRO for UDP encapsulation. It just requires separate GRO callbacks for the different encapsulation types.
[PATCH RFC 1/2] net: Support flow sorted RX skb lists for IPv4.
This patch sorts RX skb lists into separate flows, using a flow dissector, at the IP input layer. Packets of the same flow are chained at the frag_list pointer of the first skb of this flow. After ip_list_rcv_finish() the skb list has this layout: |-||-||-| |flow 1 ||flow 1 ||flow 1 | |-||-||-| |frag_list|<-\ |frag_list||frag_list| |-| \|-||-| |next |<-\ \---|next |<---|next | |-| \|-||-| | | ||-||-||-| ||flow 2 ||flow 2 ||flow 2 | ||-||-||-| ||frag_list|<-\ |frag_list||frag_list| ||-| \|-||-| ||next |<-\ \---|next |<---|next | |-| \|-||-| | | ||-||-| |-| ||flow 3 ||flow 3 | |flow 3 | ||-||-| |-| ||frag_list|<-\ |frag_list| |frag_list| ||-| \|-| |-| ||next |\---|next |<--|next | |-||-| |-| With this approach route lookups etc. are done just for one representative packet of a given flow instead for each packet. ip_sublist_rcv_finish() splits these lists then into: |-||-||-| |flow 1 ||flow 1 ||flow 1 | |-||-||-| |frag_list|<-\ |frag_list||frag_list| |-| \|-||-| |next |\---|next |<---|next | |-||-||-| Packets of the same flow can still travel together after that point. On input, this is plumbed through to ip_local_deliver_finish(), here the skb chain is split back into single packets. My hope is that this can be plumbed through to the sockets receive queue. I have a patch for UDP, but it has still problems with UDP encapsulaion, so it is not included here. On forward, the skb chain can travel together to the TX path. __skb_gso_segment() will build a standard skb list from this. For now, this is only enabled if the receiving device allows forwarding, as the forwarding path has currently the most gain from this. Known issues: - I don't have a NIC whose driver supports to build skb lists to be received by netif_receive_skb_list(). To test this codepath I used a hack that builds skb lists at the napi layer. - Performance measurements were done with this hack, so I don't know if these measurements are really meaningful. - This is early stage work, so the functional tests are only done on a basic level, it might be still buggy. - This still uses the skb->next, skb->prev pointers to build skb lists. So needs to be converted to standard list handling at some point. Signed-off-by: Steffen Klassert --- include/linux/skbuff.h| 5 ++ net/core/dev.c| 45 +++- net/core/flow_dissector.c | 40 +++ net/core/skbuff.c | 52 ++ net/ipv4/ip_input.c | 139 ++ net/ipv4/ip_output.c | 3 +- 6 files changed, 270 insertions(+), 14 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 17a13e4785fc..d070d073a1dc 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -575,6 +575,8 @@ enum { SKB_GSO_UDP = 1 << 16, SKB_GSO_UDP_L4 = 1 << 17, + + SKB_GSO_FRAGLIST = 1 << 18, }; #if BITS_PER_LONG > 32 @@ -1226,6 +1228,8 @@ skb_flow_dissect_flow_keys_basic(const struct sk_buff *skb, data, proto, nhoff, hlen, flags); } +u32 skb_flow_keys_rx_digest(struct sk_buff *skb, struct flow_keys_digest *digest); + void skb_flow_dissect_tunnel_info(const struct sk_buff *skb, struct flow_dissector *flow_dissector, @@ -3302,6 +3306,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu); bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len); +void skb_segment_list(
[PATCH RFC 0/2] Flow sorted receive skb lists
This patchset consists of two patches. Patch 1 adds support for flow sorted rx skb lists for IPv4. This means that it sorts the skb list so that packets from the same flow can to travel together through the stack. The second patch of this pachset is just a hack that disables GRO and does skb list receive instead. I don't have a NIC whose driver supports to build skb lists to be received by netif_receive_skb_list(), so I used this hack to test patch 1. This is early stage work, so it might be not complete and still buggy. I send this now because I want to hear some comments on the approach itself before I spend more time to work on this. Forwarding performance measurements: I used used my IPsec forwarding test setup for this: -->| router 1 |>| router 2 |-- | | | | | | |Spirent Testcenter|<-- net-next (September 6th): Single stream UDP frame size 1460 Bytes: 1.258.446 fps. Single stream UDP frame size 64 Bytes: 1.250.000 fps. -- net-next (September 6th) + patch 2 (list receive): Single stream UDP frame size 1460 Bytes: 1.469.595 fps (+16%). Single stream UDP frame size 64 Bytes: 1.502.976 fps (+20%). -- net-next (September 6th) + patch 1 + patch 2 (flow sorted list receive): Single stream UDP frame size 1460 Bytes: 2.525.338 fps (+100%). Single stream UDP frame size 64 Bytes: 2.541.881 fps (+103%). ---
[PATCH RFC 2/2] net: Hack to enable skb list receive in the napi layer.
This patch was used to test patch ("net: Support flow sorted skb lists for IPv4.") It is just a hack that disables GRO and does skb list receive instead. Not for merging! Signed-off-by: Steffen Klassert --- include/linux/netdevice.h | 5 - net/core/dev.c| 23 ++- net/ipv4/ip_input.c | 1 + 3 files changed, 19 insertions(+), 10 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e2b3bd750c98..6518ceeda05d 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -333,7 +333,10 @@ struct napi_struct { int poll_owner; #endif struct net_device *dev; - struct gro_list gro_hash[GRO_HASH_BUCKETS]; + union { + struct list_headrx_list; + struct gro_list gro_hash[GRO_HASH_BUCKETS]; + }; struct sk_buff *skb; struct hrtimer timer; struct list_headdev_list; diff --git a/net/core/dev.c b/net/core/dev.c index 147da35d7380..68e00727c657 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5470,6 +5470,10 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff int same_flow; int grow; + list_add_tail(>list, >rx_list); + ret = GRO_CONSUMED; + goto out; + if (netif_elide_gro(skb->dev)) goto normal; @@ -5559,7 +5563,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff } else if (test_bit(hash, >gro_bitmask)) { __clear_bit(hash, >gro_bitmask); } - +out: return ret; normal: @@ -5958,7 +5962,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done) NAPIF_STATE_IN_BUSY_POLL))) return false; - if (n->gro_bitmask) { + if (!list_empty(>rx_list)) { unsigned long timeout = 0; if (work_done) @@ -5967,8 +5971,10 @@ bool napi_complete_done(struct napi_struct *n, int work_done) if (timeout) hrtimer_start(>timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED); - else - napi_gro_flush(n, false); + else { + netif_receive_skb_list(>rx_list); + INIT_LIST_HEAD(>rx_list); + } } if (unlikely(!list_empty(>poll_list))) { /* If n->poll_list is not empty, we need to mask irqs */ @@ -6189,6 +6195,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi, int (*poll)(struct napi_struct *, int), int weight) { INIT_LIST_HEAD(>poll_list); + INIT_LIST_HEAD(>rx_list); hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED); napi->timer.function = napi_watchdog; init_gro_hash(napi); @@ -6289,11 +6296,9 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll) goto out_unlock; } - if (n->gro_bitmask) { - /* flush too old packets -* If HZ < 1000, flush all packets. -*/ - napi_gro_flush(n, HZ >= 1000); + if (!list_empty(>rx_list)) { + netif_receive_skb_list(>rx_list); + INIT_LIST_HEAD(>rx_list); } /* Some drivers may have called napi_schedule diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index bf710bf95fea..847965f83a2f 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -729,6 +729,7 @@ void ip_list_rcv(struct list_head *head, struct packet_type *pt, } list_add_tail(>list, ); } + /* dispatch final sublist */ ip_sublist_rcv(, curr_dev, curr_net); } -- 2.17.1
Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
On Tue, Sep 11, 2018 at 09:02:48PM +0200, Tobias Hommel wrote: > > > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force > > > clears the dst_entry. > > > > > > Since commit 222d7dbd258d ("net: prevent dst uses after free") > > > skb_dst_force() might clear the dst_entry attached to the skb. > > > The xfrm code don't expect this to happen, so we crash with > > > a NULL pointer dereference in this case. Fix it by checking > > > skb_dst(skb) for NULL after skb_dst_force() and drop the packet > > > in cast the dst_entry was cleared. > > > > > > Fixes: 222d7dbd258d ("net: prevent dst uses after free") > > > Reported-by: Tobias Hommel > > > Reported-by: Kristian Evensen > > > Reported-by: Wolfgang Walter > > > Signed-off-by: Steffen Klassert > > > --- > > > > This patch fixes the problem here. > > > > XfrmFwdHdrError gets around 80 at the very beginning and remains so. > > Probably > > this happens when some route are changed/set then. > > > > Regards and thanks, > > Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220. > This is less than 1 lost packet per minute, which seems to be okay for now. Thanks a lot for testing! This is now applied to the ipsec tree.
Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
On Mon, Sep 10, 2018 at 10:18:47AM +0200, Kristian Evensen wrote: > Hi, > > Thanks everyone for all the effort in debugging this issue. > > On Mon, Sep 10, 2018 at 8:39 AM Steffen Klassert > wrote: > > The easy fix that could be backported to stable would be > > to check skb->dst for NULL and drop the packet in that case. > > Thought I should just chime in and say that we deployed this > work-around when we started observing the error back in June. Since > then we have not seen any crashes. Also, we have instrumented some of > our kernels to count the number of times the error is hit (overall + > consecutive). Compared to the overall number of packets, the error > happens very rarely. With our workloads, we on average see the error > once every couple of days. Thanks for letting us know! I plan to fix this in the ipsec tree with: Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry. Since commit 222d7dbd258d ("net: prevent dst uses after free") skb_dst_force() might clear the dst_entry attached to the skb. The xfrm code don't expect this to happen, so we crash with a NULL pointer dereference in this case. Fix it by checking skb_dst(skb) for NULL after skb_dst_force() and drop the packet in cast the dst_entry was cleared. Fixes: 222d7dbd258d ("net: prevent dst uses after free") Reported-by: Tobias Hommel Reported-by: Kristian Evensen Reported-by: Wolfgang Walter Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_output.c | 4 net/xfrm/xfrm_policy.c | 4 2 files changed, 8 insertions(+) diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c index 89b178a78dc7..36d15a38ce5e 100644 --- a/net/xfrm/xfrm_output.c +++ b/net/xfrm/xfrm_output.c @@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int err) spin_unlock_bh(>lock); skb_dst_force(skb); + if (!skb_dst(skb)) { + XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR); + goto error_nolock; + } if (xfrm_offload(skb)) { x->type_offload->encap(x, skb); diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 7c5e8978aeaa..626e0f4d1749 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb, unsigned short family) } skb_dst_force(skb); + if (!skb_dst(skb)) { + XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR); + return 0; + } dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE); if (IS_ERR(dst)) { -- 2.17.1
Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
On Fri, Sep 07, 2018 at 11:10:55PM +0200, Wolfgang Walter wrote: > Hello Steffen, > > in one of your emails to Thomas you wrote: > > xfrm_lookup+0x2a is at the very beginning of xfrm_lookup(), here we > > find: > > > > u16 family = dst_orig->ops->family; > > > > ops has an offset of 32 bytes (20 hex) in dst_orig, so looks like > > dst_orig is NULL. > > > > In the forwarding case, we get dst_orig from the skb and dst_orig > > can't be NULL here unless the skb itself is already fishy. > > Is this really true? > > If xfrm_lookup is called from > > __xfrm_route_forward(): > > int __xfrm_route_forward(struct sk_buff *skb, unsigned short family) > { > struct net *net = dev_net(skb->dev); > struct flowi fl; > struct dst_entry *dst; > int res = 1; > > if (xfrm_decode_session(skb, , family) < 0) { > XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR); > return 0; > } > > skb_dst_force(skb); > > dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE); > if (IS_ERR(dst)) { > res = 0; > dst = NULL; > } > skb_dst_set(skb, dst); > return res; > } > > couldn't it be possible that skb_dst_force(skb) actually sets dst to NULL if > it cannot safely lock it? If it is absolutely sure that skb_dst_force() never > can set dst to NULL I wonder why it is called at all? Ugh, skb_dst_force apparently changed since I looked at it last time. I did not expect that it can clear skb->dst. This behaviour was introduced with: commit 222d7dbd258dad4cd5241c43ef818141fad5a87a net: prevent dst uses after free from Eric Dumazet (put him to Cc). The easy fix that could be backported to stable would be to check skb->dst for NULL and drop the packet in that case. I wonder if we can do better here. We can still use the dst_entry as long as we don't exit the RCU grace period. But looking deeper into it, the crypto layer might return asynchronously. In this case, we exit the RCU grace period and we have to drop the packet anyway. If I understand correct, the bug happens rarely. So maybe we could just stay with the easy fix (I'll do a patch today). The other thing I wonder about is why Tobias bisected this to commit b838d5e1c5b6e57b10ec8af2268824041e3ea911 ipv4: mark DST_NOGC and remove the operation of dst_free() from 'Jun 17 2017' and not to commit 222d7dbd258dad4cd5241c43ef818141fad5a87a net: prevent dst uses after free from 'Sep 21 2017'. Maybe Tobias has seen two bugs. Before ("net: prevent dst uses after free"), it was the use after free, and after this fix it was a NULL pointer derference of skb->dst.
Re: [PATCH V2 ipsec-next 0/2] xfrm: bug fixes when processing multiple transforms
On Mon, Sep 03, 2018 at 04:36:51AM -0700, Sowmini Varadhan wrote: > This series contains bug fixes that were encountered when I set > up a libreswan tunnel using the config below, which will set up > an IPsec policy involving 2 tmpls. > > type=transport > compress=yes > esp=aes_gcm_c-128-null # offloaded to Niantic > auto=start > > The non-offload test case uses esp=aes_gcm_c-256-null. > > Each patch has a technical description of the contents of the fix. > > V2: added Fixes tag so that it can be backported to the stable trees. > > Sowmini Varadhan (2): > xfrm: reset transport header back to network header after all input > transforms ahave been applied > xfrm: reset crypto_done when iterating over multiple input xfrms All applied to the ipsec tree, thanks a lot Sowmini!
Re: [PATCH v2] xfrm6: call kfree_skb when skb is toobig
On Fri, Aug 31, 2018 at 08:38:49AM -0300, Thadeu Lima de Souza Cascardo wrote: > After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching > and reporting on xmit"), some too big skbs might be potentially passed down to > __xfrm6_output, causing it to fail to transmit but not free the skb, causing a > leak of skb, and consequentially a leak of dst references. > > After running pmtu.sh, that shows as failure to unregister devices in a > namespace: > > [ 311.397671] unregister_netdevice: waiting for veth_b to become free. Usage > count = 1 > > The fix is to call kfree_skb in case of transmit failures. > > Signed-off-by: Thadeu Lima de Souza Cascardo > Reviewed-by: Sabrina Dubroca > Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error") Applied, thanks a lot!
Re: [PATCH ipsec-next 1/2] xfrm: reset transport header back to network header after all input transforms ahave been applied
On Sun, Sep 02, 2018 at 04:18:42PM -0700, Sowmini Varadhan wrote: > A policy may have been set up with multiple transforms (e.g., ESP > and ipcomp). In this situation, the ingress IPsec processing > iterates in xfrm_input() and applies each transform in turn, > processing the nexthdr to find any additional xfrm that may apply. > > This patch resets the transport header back to network header > only after the last transformation so that subsequent xfrms > can find the correct transport header. > > Suggested-by: Steffen Klassert > Signed-off-by: Sowmini Varadhan This patch is a fix, so please add a proper fixes tag so that it can be backported to the stable trees. Same applied to the other patch in this series.
Re: [PATCH RFC net-next 00/11] udp gso
On Fri, Aug 31, 2018 at 09:08:59AM -0400, Willem de Bruijn wrote: > On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni wrote: > > > > Hi, > > > > On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote: > > > That said, for negotiated flows an inverse GRO feature could > > > conceivably be implemented to reduce rx stack traversal, too. > > > Though due to interleaving of packets on the wire, it aggregation > > > would be best effort, similar to TCP TSO and GRO using the > > > PSH bit as packetization signal. > > > > Reviving this old thread, before I forgot again. I have some local > > patches implementing UDP GRO in a dual way to current GSO_UDP_L4 > > implementation: several datagram with the same length are aggregated > > into a single one, and the user space receive a single larger packet > > instead of multiple ones. I hope quic can leverage such scenario, but I > > really know nothing about the protocol. > > > > I measure roughly a 50% performance improvement with udpgso_bench in > > respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU > > usage on the receiver[1]. Some additional hacking to the general GRO > > bits is required to avoid useless socket lookups for ingress UDP > > packets when UDP_GSO is not enabled. > > > > If there is interest on this topic, I can share some RFC patches > > (hopefully somewhat next week). > > As Eric pointed out, QUIC reception on mobile clients over the WAN > may not see much gain. But apparently there is a non-trivial amount > of traffic the other way, to servers. Again, WAN might limit whatever > gain we get, but I do want to look into that. And there are other UDP high > throughput workloads (with or without QUIC) between servers. > > If you have patches, please do share them. I actually also have a rough > patch that I did not consider ready to share yet. Based on Tom's existing > socket lookup in udp_gro_receive to detect whether a local destination > exists and whether it has set an option to support receiving coalesced > payloads (along with a cmsg to share the segment size). > > Converting udp_recvmsg to split apart gso packets to make this > transparent seems to me to be too complex and not worth the effort. > > If a local socket is not found in udp_gro_receive, this could also be > tentative interpreted as a non-local path (with false positives), enabling > transparent use of GRO + GSO batching on the forwarding path. On forwarding, it might be even better to build packet lists instead of merging the payload. This would save the cost of the merge at GRO and the split at GSO (at least when the split happens in software). I'm working on patches that builds such skb lists. The list is chained at the frag_list pointer of the first skb, all subsequent skbs are linked to the next pointer of the skb. It looks like this: |-||-||-| |flow 1 ||flow 1 ||flow 1 | |-||-||-| |frag_list|<-\ |frag_list||frag_list| |-| \|-||-| |next |\---|next |<---|next | |-||-||-| Furthermore, I'm trying to integrate this with the new skb listification work. Instead of just linking the packets into the lists as they arrive, I try to sort them into flows, so that packets of the same flow can travel together and lookups etc. are just done for one representative skb of a given flow. Here I build the packet lists like this: |-||-||-| |flow 1 ||flow 1 ||flow 1 | |-||-||-| |frag_list|<-\ |frag_list||frag_list| |-| \|-||-| |next |<-\ \---|next |<---|next | |-| \|-||-| | | ||-||-||-| ||flow 2 ||flow 2 ||flow 2 | ||-||-||-| ||frag_list|<-\ |frag_list||frag_list| ||-| \|-||-| ||next |<-\ \---|next |<---|next | |-| \|-||-| | | ||-||-| |-| ||flow 3 ||flow 3 | |flow 3 | ||-||-| |-| ||frag_list|<-\ |frag_list| |frag_list| ||-| \|-| |-| ||next |\---|next |<--|next |
Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote: > Hello, > > kernels > 4.12 do not work on one of our main routers. They crash as soon > as ipsec-tunnels are configured and ipsec-traffic actually flows. Can you please send the backtrace of this crash? Thanks!
Re: [PATCH 1/2] xfrm6: call kfree_skb when skb is toobig
On Thu, Aug 30, 2018 at 03:23:11PM +0200, Sabrina Dubroca wrote: > 2018-08-30, 09:58:16 -0300, Thadeu Lima de Souza Cascardo wrote: > > After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU > > caching > > and reporting on xmit"), some too big skbs might be potentially passed down > > to > > __xfrm6_output, causing it to fail to transmit but not free the skb, > > causing a > > leak of skb, and consequentially a leak of dst references. > > > > After running pmtu.sh, that shows as failure to unregister devices in a > > namespace: > > > > [ 311.397671] unregister_netdevice: waiting for veth_b to become free. > > Usage count = 1 > > > > The fix is to call kfree_skb in case of transmit failures. > > > > Signed-off-by: Thadeu Lima de Souza Cascardo Good catch! > Reviewed-by: Sabrina Dubroca > > I was about to post the same patch. Arguably, the commit introducing > this bug is the one that added those "return -EMSGSIZE" to > __xfrm6_output without freeing. > > Either way, it's missing a Fixes: tag, which should be one of those, > or both: > > Fixes: d6990976af7c ("vti6: fix PMTU caching and reporting on xmit") > Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error") This bug can be triggered even without vti6, so the correct Fixes tag would be the latter. Thadeu, please resend this one with the Fixes tag. Thanks!
Re: [PATCH 2/2] vti6: do not check for ignore_df in order to update pmtu
On Thu, Aug 30, 2018 at 09:58:17AM -0300, Thadeu Lima de Souza Cascardo wrote: > Before commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU > caching > and reporting on xmit"), skb was scrubbed before checking for ignore_df. The > scrubbing meant ignore_df was false, making the check irrelevant. Now that the > scrubbing happens after that, some packets might fail the checking and dst > will not have its pmtu updated. > > Not only that, but too big skb will be potentially passed down to > __xfrm6_output, causing it to fail to transmit but not free the skb, causing a > leak of skb, and consequentially a leak of dst references. > > After running pmtu.sh, that shows as failure to unregister devices in a > namespace: > > [ 311.397671] unregister_netdevice: waiting for veth_b to become free. Usage > count = 1 > > Signed-off-by: Thadeu Lima de Souza Cascardo > --- > net/ipv6/ip6_vti.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c > index c72ae3a4fe09..fbd3752ea587 100644 > --- a/net/ipv6/ip6_vti.c > +++ b/net/ipv6/ip6_vti.c > @@ -481,7 +481,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, > struct flowi *fl) > } > > mtu = dst_mtu(dst); > - if (!skb->ignore_df && skb->len > mtu) { > + if (skb->len > mtu) { > skb_dst_update_pmtu(skb, mtu); The very same patch went already to the net tree two day ago: commit 9f2895461439fda2801a7906fb4c5fb3dbb37a0a vti6: remove !skb->ignore_df check from vti6_xmit()
Re: [PATCH ipsec-next] xfrm: allow driver to quietly refuse offload
On Wed, Aug 22, 2018 at 02:38:10PM -0700, Shannon Nelson wrote: > If the "offload" attribute is used to create an IPsec SA > and the .xdo_dev_state_add() fails, the SA creation fails. > However, if the "offload" attribute is used on a device that > doesn't offer it, the attribute is quietly ignored and the SA > is created without an offload. > > Along the same line of that second case, it would be good to > have a way for the device to refuse to offload an SA without > failing the whole SA creation. This patch adds that feature > by allowing the driver to return -EOPNOTSUPP as a signal that > the SA may be fine, it just can't be offloaded. > > This allows the user a little more flexibility in requesting > offloads and not needing to know every detail at all times about > each specific NIC when trying to create SAs. > > Signed-off-by: Shannon Nelson Applied to ipsec-next, thanks Shannon!
Re: [PATCH net] vti6: remove !skb->ignore_df check from vti6_xmit()
On Thu, Aug 23, 2018 at 07:49:54PM +0300, Alexey Kodanev wrote: > Before the commit d6990976af7c ("vti6: fix PMTU caching and reporting > on xmit") '!skb->ignore_df' check was always true because the function > skb_scrub_packet() was called before it, resetting ignore_df to zero. > > In the commit, skb_scrub_packet() was moved below, and now this check > can be false for the packet, e.g. when sending it in the two fragments, > this prevents successful PMTU updates in such case. The next attempts > to send the packet lead to the same tx error. Moreover, vti6 initial > MTU value relies on PMTU adjustments. > > This issue can be reproduced with the following LTP test script: > udp_ipsec_vti.sh -6 -p ah -m tunnel -s 2000 > > Fixes: ccd740cbc6e0 ("vti6: Add pmtu handling to vti6_xmit.") > Signed-off-by: Alexey Kodanev > --- > Not sure about xfrmi_xmit2(), it has a similar check for ignore_df... > > net/ipv6/ip6_vti.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c > index 38dec9d..f48d196 100644 > --- a/net/ipv6/ip6_vti.c > +++ b/net/ipv6/ip6_vti.c > @@ -481,7 +481,7 @@ static bool vti6_state_check(const struct xfrm_state *x, > } > > mtu = dst_mtu(dst); > - if (!skb->ignore_df && skb->len > mtu) { > + if (skb->len > mtu) { > skb_dst_update_pmtu(skb, mtu); This looks OK to me. If I remember correct, the !skb->ignore_df check was taken from the native xfrm6 PMTU handling. There this check makes sense because the packet can be still fragmented along the way through the stack. In this case here it is too late as we are about to TX the packet through the vti device. So we should update to the new IPsec PMTU and notify the sender about this. Acked-by: Steffen Klassert
Re: [PATCH net-next] xfrm: Make function xfrmi_get_link_net() static
On Sat, Jul 28, 2018 at 06:49:48AM +, Wei Yongjun wrote: > Fixes the following sparse warning: > > net/xfrm/xfrm_interface.c:745:12: warning: > symbol 'xfrmi_get_link_net' was not declared. Should it be static? > > Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") > Signed-off-by: Wei Yongjun Applied, thanks Wei!
Re: [PATCH RFC ipsec-next] xfrm: Check Reverse-Mark Lookup Before ADDSA/DELSA
On Wed, Jul 25, 2018 at 03:36:47PM -0700, Nathan Harold wrote: > > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > index b669262682c9..ee212a7c91a9 100644 > --- a/net/xfrm/xfrm_state.c > +++ b/net/xfrm/xfrm_state.c > @@ -815,10 +815,10 @@ xfrm_init_tempstate(struct xfrm_state *x, const struct > flowi *fl, > afinfo->init_temprop(x, tmpl, daddr, saddr); > } > > -static struct xfrm_state *__xfrm_state_lookup(struct net *net, u32 mark, > - const xfrm_address_t *daddr, > - __be32 spi, u8 proto, > - unsigned short family) > +static struct xfrm_state * > +__xfrm_state_lookup(struct net *net, u32 mark, u32 mask, > + const xfrm_address_t *daddr, > + __be32 spi, u8 proto, unsigned short family) The argument list of these functions are getting longer and longer. Can't you just put in a pointer to struct xfrm_mark and dereference inside the function? Looks good otherwise.
[PATCH 07/14] xfrm: use time64_t for in-kernel timestamps
From: Arnd Bergmann The lifetime managment uses '__u64' timestamps on the user space interface, but 'unsigned long' for reading the current time in the kernel with get_seconds(). While this is probably safe beyond y2038, it will still overflow in 2106, and the get_seconds() call is deprecated because fo that. This changes the xfrm time handling to use time64_t consistently, along with reading the time using the safer ktime_get_real_seconds(). It still suffers from problems that can happen from a concurrent settimeofday() call or (to a lesser degree) a leap second update, but since the time stamps are part of the user API, there is nothing we can do to prevent that. Signed-off-by: Arnd Bergmann Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_policy.c | 24 net/xfrm/xfrm_state.c | 10 +- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index ef75891450e7..5d2f734f4309 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -189,8 +189,8 @@ static inline unsigned long make_jiffies(long secs) static void xfrm_policy_timer(struct timer_list *t) { struct xfrm_policy *xp = from_timer(xp, t, timer); - unsigned long now = get_seconds(); - long next = LONG_MAX; + time64_t now = ktime_get_real_seconds(); + time64_t next = TIME64_MAX; int warn = 0; int dir; @@ -202,7 +202,7 @@ static void xfrm_policy_timer(struct timer_list *t) dir = xfrm_policy_id2dir(xp->index); if (xp->lft.hard_add_expires_seconds) { - long tmo = xp->lft.hard_add_expires_seconds + + time64_t tmo = xp->lft.hard_add_expires_seconds + xp->curlft.add_time - now; if (tmo <= 0) goto expired; @@ -210,7 +210,7 @@ static void xfrm_policy_timer(struct timer_list *t) next = tmo; } if (xp->lft.hard_use_expires_seconds) { - long tmo = xp->lft.hard_use_expires_seconds + + time64_t tmo = xp->lft.hard_use_expires_seconds + (xp->curlft.use_time ? : xp->curlft.add_time) - now; if (tmo <= 0) goto expired; @@ -218,7 +218,7 @@ static void xfrm_policy_timer(struct timer_list *t) next = tmo; } if (xp->lft.soft_add_expires_seconds) { - long tmo = xp->lft.soft_add_expires_seconds + + time64_t tmo = xp->lft.soft_add_expires_seconds + xp->curlft.add_time - now; if (tmo <= 0) { warn = 1; @@ -228,7 +228,7 @@ static void xfrm_policy_timer(struct timer_list *t) next = tmo; } if (xp->lft.soft_use_expires_seconds) { - long tmo = xp->lft.soft_use_expires_seconds + + time64_t tmo = xp->lft.soft_use_expires_seconds + (xp->curlft.use_time ? : xp->curlft.add_time) - now; if (tmo <= 0) { warn = 1; @@ -240,7 +240,7 @@ static void xfrm_policy_timer(struct timer_list *t) if (warn) km_policy_expired(xp, dir, 0, 0); - if (next != LONG_MAX && + if (next != TIME64_MAX && !mod_timer(>timer, jiffies + make_jiffies(next))) xfrm_pol_hold(xp); @@ -791,7 +791,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl) } policy->index = delpol ? delpol->index : xfrm_gen_index(net, dir, policy->index); hlist_add_head(>byidx, net->xfrm.policy_byidx+idx_hash(net, policy->index)); - policy->curlft.add_time = get_seconds(); + policy->curlft.add_time = ktime_get_real_seconds(); policy->curlft.use_time = 0; if (!mod_timer(>timer, jiffies + HZ)) xfrm_pol_hold(policy); @@ -1282,7 +1282,7 @@ int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol) old_pol = rcu_dereference_protected(sk->sk_policy[dir], lockdep_is_held(>xfrm.xfrm_policy_lock)); if (pol) { - pol->curlft.add_time = get_seconds(); + pol->curlft.add_time = ktime_get_real_seconds(); pol->index = xfrm_gen_index(net, XFRM_POLICY_MAX+dir, 0); xfrm_sk_policy_link(pol, dir); } @@ -2132,7 +2132,7 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig, } for (i = 0; i < num_pols; i++) - pols[i]->curlft.use_time = get_seconds(); + pols[i]->curlft.use_time = ktime_get_real_seconds(); if (num_xfrms < 0) { /* Prohibit the flow */ @@ -2352,7 +2352,7 @@ int __xfrm_policy_chec
[PATCH 01/14] xfrm: Extend the output_mark to support input direction and masking.
We already support setting an output mark at the xfrm_state, unfortunately this does not support the input direction and masking the marks that will be applied to the skb. This change adds support applying a masked value in both directions. The existing XFRMA_OUTPUT_MARK number is reused for this purpose and as it is now bi-directional, it is renamed to XFRMA_SET_MARK. An additional XFRMA_SET_MARK_MASK attribute is added for setting the mask. If the attribute mask not provided, it is set to 0x, keeping the XFRMA_OUTPUT_MARK existing 'full mask' semantics. Co-developed-by: Tobias Brunner Co-developed-by: Eyal Birger Co-developed-by: Lorenzo Colitti Signed-off-by: Steffen Klassert Signed-off-by: Tobias Brunner Signed-off-by: Eyal Birger Signed-off-by: Lorenzo Colitti --- include/net/xfrm.h| 9 - include/uapi/linux/xfrm.h | 4 +++- net/xfrm/xfrm_device.c| 3 ++- net/xfrm/xfrm_input.c | 2 ++ net/xfrm/xfrm_output.c| 3 +-- net/xfrm/xfrm_policy.c| 5 +++-- net/xfrm/xfrm_user.c | 48 +-- 7 files changed, 57 insertions(+), 17 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 557122846e0e..3dc83ba26f62 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -166,7 +166,7 @@ struct xfrm_state { int header_len; int trailer_len; u32 extra_flags; - u32 output_mark; + struct xfrm_marksmark; } props; struct xfrm_lifetime_cfg lft; @@ -2012,6 +2012,13 @@ static inline int xfrm_mark_put(struct sk_buff *skb, const struct xfrm_mark *m) return ret; } +static inline __u32 xfrm_smark_get(__u32 mark, struct xfrm_state *x) +{ + struct xfrm_mark *m = >props.smark; + + return (m->v & m->m) | (mark & ~m->m); +} + static inline int xfrm_tunnel_check(struct sk_buff *skb, struct xfrm_state *x, unsigned int family) { diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h index e3af2859188b..5a6ed7ce5a29 100644 --- a/include/uapi/linux/xfrm.h +++ b/include/uapi/linux/xfrm.h @@ -305,9 +305,11 @@ enum xfrm_attr_type_t { XFRMA_ADDRESS_FILTER, /* struct xfrm_address_filter */ XFRMA_PAD, XFRMA_OFFLOAD_DEV, /* struct xfrm_state_offload */ - XFRMA_OUTPUT_MARK, /* __u32 */ + XFRMA_SET_MARK, /* __u32 */ + XFRMA_SET_MARK_MASK,/* __u32 */ __XFRMA_MAX +#define XFRMA_OUTPUT_MARK XFRMA_SET_MARK /* Compatibility */ #define XFRMA_MAX (__XFRMA_MAX - 1) }; diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c index 175941e15a6e..16c1230d20fa 100644 --- a/net/xfrm/xfrm_device.c +++ b/net/xfrm/xfrm_device.c @@ -162,7 +162,8 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x, } dst = __xfrm_dst_lookup(net, 0, 0, saddr, daddr, - x->props.family, x->props.output_mark); + x->props.family, + xfrm_smark_get(0, x)); if (IS_ERR(dst)) return 0; diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 352abca2605f..074810436242 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -339,6 +339,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) goto drop; } + skb->mark = xfrm_smark_get(skb->mark, x); + skb->sp->xvec[skb->sp->len++] = x; lock: diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c index 89b178a78dc7..45ba07ab3e4f 100644 --- a/net/xfrm/xfrm_output.c +++ b/net/xfrm/xfrm_output.c @@ -66,8 +66,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err) goto error_nolock; } - if (x->props.output_mark) - skb->mark = x->props.output_mark; + skb->mark = xfrm_smark_get(skb->mark, x); err = x->outer_mode->output(x, skb); if (err) { diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 5f48251c1319..7637637717ec 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1607,10 +1607,11 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy, dst_copy_metrics(dst1, dst); if (xfrm[i]->props.mode != XFRM_MODE_TRANSPORT) { + __u32 mark = xfrm_smark_get(fl->flowi_mark, xfrm[i]); + family = xfrm[i]->props.family; dst = xfrm_dst_lookup(xfrm[i], tos, fl->flowi_oif, - , ,
[PATCH 12/14] xfrm: fix 'passing zero to ERR_PTR()' warning
From: YueHaibing Fix a static code checker warning: net/xfrm/xfrm_policy.c:1836 xfrm_resolve_and_create_bundle() warn: passing zero to 'ERR_PTR' xfrm_tmpl_resolve return 0 just means no xdst found, return NULL instead of passing zero to ERR_PTR. Fixes: d809ec895505 ("xfrm: do not assume that template resolving always returns xfrms") Signed-off-by: YueHaibing Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_policy.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 2f70fe68b9b0..69f06f879091 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1752,7 +1752,10 @@ xfrm_resolve_and_create_bundle(struct xfrm_policy **pols, int num_pols, /* Try to instantiate a bundle */ err = xfrm_tmpl_resolve(pols, num_pols, fl, xfrm, family); if (err <= 0) { - if (err != 0 && err != -EAGAIN) + if (err == 0) + return NULL; + + if (err != -EAGAIN) XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR); return ERR_PTR(err); } -- 2.14.1
[PATCH 08/14] ipv6: xfrm: use 64-bit timestamps
From: Arnd Bergmann get_seconds() is deprecated because it can overflow on 32-bit architectures. For the xfrm_state->lastused member, we treat the data as a 64-bit number already, so we just need to use the right accessor that works on both 32-bit and 64-bit machines. Signed-off-by: Arnd Bergmann Signed-off-by: Steffen Klassert --- include/net/xfrm.h | 2 +- net/ipv6/xfrm6_mode_ro.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index a5378613a49c..1350e2cf0749 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -227,7 +227,7 @@ struct xfrm_state { longsaved_tmo; /* Last used time */ - unsigned long lastused; + time64_tlastused; struct page_frag xfrag; diff --git a/net/ipv6/xfrm6_mode_ro.c b/net/ipv6/xfrm6_mode_ro.c index 07d36573f50b..da28e4407b8f 100644 --- a/net/ipv6/xfrm6_mode_ro.c +++ b/net/ipv6/xfrm6_mode_ro.c @@ -55,7 +55,7 @@ static int xfrm6_ro_output(struct xfrm_state *x, struct sk_buff *skb) __skb_pull(skb, hdr_len); memmove(ipv6_hdr(skb), iph, hdr_len); - x->lastused = get_seconds(); + x->lastused = ktime_get_real_seconds(); return 0; } -- 2.14.1
[PATCH 03/14] xfrm: Add a new lookup key to match xfrm interfaces.
This patch adds the xfrm interface id as a lookup key for xfrm states and policies. With this we can assign states and policies to virtual xfrm interfaces. Signed-off-by: Steffen Klassert Acked-by: Shannon Nelson Acked-by: Benedict Wong Tested-by: Benedict Wong Tested-by: Antony Antony Reviewed-by: Eyal Birger --- include/net/xfrm.h| 21 +- include/uapi/linux/xfrm.h | 1 + net/core/pktgen.c | 2 +- net/key/af_key.c | 6 +++--- net/xfrm/xfrm_policy.c| 18 +++- net/xfrm/xfrm_state.c | 19 - net/xfrm/xfrm_user.c | 54 +-- 7 files changed, 96 insertions(+), 25 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 3dc83ba26f62..e8bada4d2a45 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -147,6 +147,7 @@ struct xfrm_state { struct xfrm_id id; struct xfrm_selectorsel; struct xfrm_markmark; + u32 if_id; u32 tfcpad; u32 genid; @@ -574,6 +575,7 @@ struct xfrm_policy { atomic_tgenid; u32 priority; u32 index; + u32 if_id; struct xfrm_markmark; struct xfrm_selectorselector; struct xfrm_lifetime_cfg lft; @@ -1533,7 +1535,7 @@ struct xfrm_state *xfrm_state_find(const xfrm_address_t *daddr, struct xfrm_tmpl *tmpl, struct xfrm_policy *pol, int *err, unsigned short family); -struct xfrm_state *xfrm_stateonly_find(struct net *net, u32 mark, +struct xfrm_state *xfrm_stateonly_find(struct net *net, u32 mark, u32 if_id, xfrm_address_t *daddr, xfrm_address_t *saddr, unsigned short family, @@ -1690,20 +1692,20 @@ int xfrm_policy_walk(struct net *net, struct xfrm_policy_walk *walk, void *); void xfrm_policy_walk_done(struct xfrm_policy_walk *walk, struct net *net); int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl); -struct xfrm_policy *xfrm_policy_bysel_ctx(struct net *net, u32 mark, +struct xfrm_policy *xfrm_policy_bysel_ctx(struct net *net, u32 mark, u32 if_id, u8 type, int dir, struct xfrm_selector *sel, struct xfrm_sec_ctx *ctx, int delete, int *err); -struct xfrm_policy *xfrm_policy_byid(struct net *net, u32 mark, u8, int dir, -u32 id, int delete, int *err); +struct xfrm_policy *xfrm_policy_byid(struct net *net, u32 mark, u32 if_id, u8, +int dir, u32 id, int delete, int *err); int xfrm_policy_flush(struct net *net, u8 type, bool task_valid); void xfrm_policy_hash_rebuild(struct net *net); u32 xfrm_get_acqseq(void); int verify_spi_info(u8 proto, u32 min, u32 max); int xfrm_alloc_spi(struct xfrm_state *x, u32 minspi, u32 maxspi); struct xfrm_state *xfrm_find_acq(struct net *net, const struct xfrm_mark *mark, -u8 mode, u32 reqid, u8 proto, +u8 mode, u32 reqid, u32 if_id, u8 proto, const xfrm_address_t *daddr, const xfrm_address_t *saddr, int create, unsigned short family); @@ -2019,6 +2021,15 @@ static inline __u32 xfrm_smark_get(__u32 mark, struct xfrm_state *x) return (m->v & m->m) | (mark & ~m->m); } +static inline int xfrm_if_id_put(struct sk_buff *skb, __u32 if_id) +{ + int ret = 0; + + if (if_id) + ret = nla_put_u32(skb, XFRMA_IF_ID, if_id); + return ret; +} + static inline int xfrm_tunnel_check(struct sk_buff *skb, struct xfrm_state *x, unsigned int family) { diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h index 5a6ed7ce5a29..5f3b9fec7b5f 100644 --- a/include/uapi/linux/xfrm.h +++ b/include/uapi/linux/xfrm.h @@ -307,6 +307,7 @@ enum xfrm_attr_type_t { XFRMA_OFFLOAD_DEV, /* struct xfrm_state_offload */ XFRMA_SET_MARK, /* __u32 */ XFRMA_SET_MARK_MASK,/* __u32 */ + XFRMA_IF_ID,/* __u32 */ __XFRMA_MAX #define XFRMA_OUTPUT_MARK XFRMA_SET_MARK /* Compatibility */ diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 49368e21d228..6d37dbf0aa64 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -2255,7 +2255,7 @@ static void get_ipsec_sa(struct pktgen_dev *pkt_dev, int flow) x = xfrm_state_lookup_byspi(pn->
[PATCH 05/14] xfrm: policy: remove pcpu policy cache
From: Florian Westphal Kristian Evensen says: In a project I am involved in, we are running ipsec (Strongswan) on different mt7621-based routers. Each router is configured as an initiator and has around ~30 tunnels to different responders (running on misc. devices). Before the flow cache was removed (kernel 4.9), we got a combined throughput of around 70Mbit/s for all tunnels on one router. However, we recently switched to kernel 4.14 (4.14.48), and the total throughput is somewhere around 57Mbit/s (best-case). I.e., a drop of around 20%. Reverting the flow cache removal restores, as expected, performance levels to that of kernel 4.9. When pcpu xdst exists, it has to be validated first before it can be used. A negative hit thus increases cost vs. no-cache. As number of tunnels increases, hit rate decreases so this pcpu caching isn't a viable strategy. Furthermore, the xdst cache also needs to run with BH off, so when removing this the bh disable/enable pairs can be removed too. Kristian tested a 4.14.y backport of this change and reported increased performance: In our tests, the throughput reduction has been reduced from around -20% to -5%. We also see that the overall throughput is independent of the number of tunnels, while before the throughput was reduced as the number of tunnels increased. Reported-by: Kristian Evensen Signed-off-by: Florian Westphal Signed-off-by: Steffen Klassert --- include/net/xfrm.h | 1 - net/xfrm/xfrm_device.c | 10 net/xfrm/xfrm_policy.c | 139 + net/xfrm/xfrm_state.c | 5 +- 4 files changed, 3 insertions(+), 152 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 3fa578a6a819..a5378613a49c 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -332,7 +332,6 @@ int xfrm_policy_register_afinfo(const struct xfrm_policy_afinfo *afinfo, int fam void xfrm_policy_unregister_afinfo(const struct xfrm_policy_afinfo *afinfo); void km_policy_notify(struct xfrm_policy *xp, int dir, const struct km_event *c); -void xfrm_policy_cache_flush(void); void km_state_notify(struct xfrm_state *x, const struct km_event *c); struct xfrm_tmpl; diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c index 16c1230d20fa..11d56a44e9e8 100644 --- a/net/xfrm/xfrm_device.c +++ b/net/xfrm/xfrm_device.c @@ -307,12 +307,6 @@ static int xfrm_dev_register(struct net_device *dev) return xfrm_api_check(dev); } -static int xfrm_dev_unregister(struct net_device *dev) -{ - xfrm_policy_cache_flush(); - return NOTIFY_DONE; -} - static int xfrm_dev_feat_change(struct net_device *dev) { return xfrm_api_check(dev); @@ -323,7 +317,6 @@ static int xfrm_dev_down(struct net_device *dev) if (dev->features & NETIF_F_HW_ESP) xfrm_dev_state_flush(dev_net(dev), dev, true); - xfrm_policy_cache_flush(); return NOTIFY_DONE; } @@ -335,9 +328,6 @@ static int xfrm_dev_event(struct notifier_block *this, unsigned long event, void case NETDEV_REGISTER: return xfrm_dev_register(dev); - case NETDEV_UNREGISTER: - return xfrm_dev_unregister(dev); - case NETDEV_FEAT_CHANGE: return xfrm_dev_feat_change(dev); diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index d960ea6657b5..ef75891450e7 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -45,8 +45,6 @@ struct xfrm_flo { u8 flags; }; -static DEFINE_PER_CPU(struct xfrm_dst *, xfrm_last_dst); -static struct work_struct *xfrm_pcpu_work __read_mostly; static DEFINE_SPINLOCK(xfrm_if_cb_lock); static struct xfrm_if_cb const __rcu *xfrm_if_cb __read_mostly; @@ -1732,108 +1730,6 @@ static int xfrm_expand_policies(const struct flowi *fl, u16 family, } -static void xfrm_last_dst_update(struct xfrm_dst *xdst, struct xfrm_dst *old) -{ - this_cpu_write(xfrm_last_dst, xdst); - if (old) - dst_release(>u.dst); -} - -static void __xfrm_pcpu_work_fn(void) -{ - struct xfrm_dst *old; - - old = this_cpu_read(xfrm_last_dst); - if (old && !xfrm_bundle_ok(old)) - xfrm_last_dst_update(NULL, old); -} - -static void xfrm_pcpu_work_fn(struct work_struct *work) -{ - local_bh_disable(); - rcu_read_lock(); - __xfrm_pcpu_work_fn(); - rcu_read_unlock(); - local_bh_enable(); -} - -void xfrm_policy_cache_flush(void) -{ - struct xfrm_dst *old; - bool found = false; - int cpu; - - might_sleep(); - - local_bh_disable(); - rcu_read_lock(); - for_each_possible_cpu(cpu) { - old = per_cpu(xfrm_last_dst, cpu); - if (old && !xfrm_bundle_ok(old)) { - if (smp_processor_id() == cpu) { - __xfrm_pcpu_work_fn(); -
[PATCH 09/14] xfrm: don't check offload_handle for nonzero
From: Shannon Nelson The offload_handle should be an opaque data cookie for the driver to use, much like the data cookie for a timer or alarm callback. Thus, the XFRM stack should not be checking for non-zero, because the driver might use that to store an array reference, which could be zero, or some other zero but meaningful value. We can remove the checks for non-zero because there are plenty other attributes also being checked to see if there is an offload in place for the SA in question. Signed-off-by: Shannon Nelson Signed-off-by: Steffen Klassert --- net/ipv4/esp4_offload.c | 6 ++ net/ipv6/esp6_offload.c | 6 ++ net/xfrm/xfrm_device.c | 6 +++--- 3 files changed, 7 insertions(+), 11 deletions(-) diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c index 7cf755ef9efb..133589d693a9 100644 --- a/net/ipv4/esp4_offload.c +++ b/net/ipv4/esp4_offload.c @@ -135,8 +135,7 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb, skb->encap_hdr_csum = 1; - if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle || - (x->xso.dev != skb->dev)) + if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK); else if (!(features & NETIF_F_HW_ESP_TX_CSUM)) esp_features = features & ~NETIF_F_CSUM_MASK; @@ -179,8 +178,7 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features_ if (!xo) return -EINVAL; - if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle || - (x->xso.dev != skb->dev)) { + if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) { xo->flags |= CRYPTO_FALLBACK; hw_offload = false; } diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c index 27f59b61f70f..96af267835c3 100644 --- a/net/ipv6/esp6_offload.c +++ b/net/ipv6/esp6_offload.c @@ -162,8 +162,7 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb, skb->encap_hdr_csum = 1; - if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle || - (x->xso.dev != skb->dev)) + if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK); else if (!(features & NETIF_F_HW_ESP_TX_CSUM)) esp_features = features & ~NETIF_F_CSUM_MASK; @@ -207,8 +206,7 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features if (!xo) return -EINVAL; - if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle || - (x->xso.dev != skb->dev)) { + if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) { xo->flags |= CRYPTO_FALLBACK; hw_offload = false; } diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c index 11d56a44e9e8..5611b7521020 100644 --- a/net/xfrm/xfrm_device.c +++ b/net/xfrm/xfrm_device.c @@ -56,7 +56,7 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur if (skb_is_gso(skb)) { struct net_device *dev = skb->dev; - if (unlikely(!x->xso.offload_handle || (x->xso.dev != dev))) { + if (unlikely(x->xso.dev != dev)) { struct sk_buff *segs; /* Packet got rerouted, fixup features and segment it. */ @@ -211,8 +211,8 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x) if (!x->type_offload || x->encap) return false; - if ((!dev || (x->xso.offload_handle && (dev == xfrm_dst_path(dst)->dev))) && -(!xdst->child->xfrm && x->type->get_mtu)) { + if ((!dev || (dev == xfrm_dst_path(dst)->dev)) && + (!xdst->child->xfrm && x->type->get_mtu)) { mtu = x->type->get_mtu(x, xdst->child_mtu_cached); if (skb->len <= mtu) -- 2.14.1
pull request (net-next): ipsec-next 2018-07-27
1) Extend the output_mark to also support the input direction and masking the mark values before applying to the skb. 2) Add a new lookup key for the upcomming xfrm interfaces. 3) Extend the xfrm lookups to match xfrm interface IDs. 4) Add virtual xfrm interfaces. The purpose of these interfaces is to overcome the design limitations that the existing VTI devices have. The main limitations that we see with the current VTI are the following: VTI interfaces are L3 tunnels with configurable endpoints. For xfrm, the tunnel endpoint are already determined by the SA. So the VTI tunnel endpoints must be either the same as on the SA or wildcards. In case VTI tunnel endpoints are same as on the SA, we get a one to one correlation between the SA and the tunnel. So each SA needs its own tunnel interface. On the other hand, we can have only one VTI tunnel with wildcard src/dst tunnel endpoints in the system because the lookup is based on the tunnel endpoints. The existing tunnel lookup won't work with multiple tunnels with wildcard tunnel endpoints. Some usecases require more than on VTI tunnel of this type, for example if somebody has multiple namespaces and every namespace requires such a VTI. VTI needs separate interfaces for IPv4 and IPv6 tunnels. So when routing to a VTI, we have to know to which address family this traffic class is going to be encapsulated. This is a lmitation because it makes routing more complex and it is not always possible to know what happens behind the VTI, e.g. when the VTI is move to some namespace. VTI works just with tunnel mode SAs. We need generic interfaces that ensures transfomation, regardless of the xfrm mode and the encapsulated address family. VTI is configured with a combination GRE keys and xfrm marks. With this we have to deal with some extra cases in the generic tunnel lookup because the GRE keys on the VTI are actually not GRE keys, the GRE keys were just reused for something else. All extensions to the VTI interfaces would require to add even more complexity to the generic tunnel lookup. So to overcome this, we developed xfrm interfaces with the following design goal: It should be possible to tunnel IPv4 and IPv6 through the same interface. No limitation on xfrm mode (tunnel, transport and beet). Should be a generic virtual interface that ensures IPsec transformation, no need to know what happens behind the interface. Interfaces should be configured with a new key that must match a new policy/SA lookup key. The lookup logic should stay in the xfrm codebase, no need to change or extend generic routing and tunnel lookups. Should be possible to use IPsec hardware offloads of the underlying interface. 5) Remove xfrm pcpu policy cache. This was added after the flowcache removal, but it turned out to make things even worse. From Florian Westphal. 6) Allow to update the set mark on SA updates. From Nathan Harold. 7) Convert some timestamps to time64_t. From Arnd Bergmann. 8) Don't check the offload_handle in xfrm code, it is an opaque data cookie for the driver. From Shannon Nelson. 9) Remove xfrmi interface ID from flowi. After this pach no generic code is touched anymore to do xfrm interface lookups. From Benedict Wong. 10) Allow to update the xfrm interface ID on SA updates. From Nathan Harold. 11) Don't pass zero to ERR_PTR() in xfrm_resolve_and_create_bundle. From YueHaibing. 12) Return more detailed errors on xfrm interface creation. From Benedict Wong. 13) Use PTR_ERR_OR_ZERO instead of IS_ERR + PTR_ERR. From the kbuild test robot. Please pull or let me know if there are problems. Thanks! The following changes since commit dd55c4ea9e6ba957083f985f33f994c23b405e9e: Merge branch 'r8169-enable-ASPM-on-RTL8168E-VL' (2018-06-23 20:54:56 +0900) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master for you to fetch changes up to c6f5e017df9dfa9f6cbe70da008e7d716d726f1b: xfrm: fix ptr_ret.cocci warnings (2018-07-27 06:47:28 +0200) Arnd Bergmann (2): xfrm: use time64_t for in-kernel timestamps ipv6: xfrm: use 64-bit timestamps Benedict Wong (2): xfrm: Remove xfrmi interface ID from flowi xfrm: Return detailed errors from xfrmi_newlink Florian Westphal (1): xfrm: policy: remove pcpu policy cache Nathan Harold (2): xfrm: Allow Set Mark to be Updated Using UPDSA xfrm: Allow xfrmi if_id to be updated by UPDSA Shannon Nelson (1): xfrm: don't check offload_handle for nonzero Steffen Klassert (4): xfrm: Extend the output_mark to support input direction and masking. flow: Extend flow informations with xfrm interface id. xfrm: Add a new lookup key to match xfrm interfaces. xfrm: Add virtual xfrm interfaces
[PATCH 11/14] xfrm: Allow xfrmi if_id to be updated by UPDSA
From: Nathan Harold Allow attaching an SA to an xfrm interface id after the creation of the SA, so that tasks such as keying which must be done as the SA is created, can remain separate from the decision on how to route traffic from an SA. This permits SA creation to be decomposed in to three separate steps: 1) allocation of a SPI 2) algorithm and key negotiation 3) insertion into the data path Signed-off-by: Nathan Harold Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_state.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index bd5cb7ad2447..b669262682c9 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -1561,10 +1561,14 @@ int xfrm_state_update(struct xfrm_state *x) if (x1->curlft.use_time) xfrm_state_check_expire(x1); - if (x->props.smark.m || x->props.smark.v) { + if (x->props.smark.m || x->props.smark.v || x->if_id) { spin_lock_bh(>xfrm.xfrm_state_lock); - x1->props.smark = x->props.smark; + if (x->props.smark.m || x->props.smark.v) + x1->props.smark = x->props.smark; + + if (x->if_id) + x1->if_id = x->if_id; __xfrm_state_bump_genids(x1); spin_unlock_bh(>xfrm.xfrm_state_lock); -- 2.14.1
[PATCH 04/14] xfrm: Add virtual xfrm interfaces
This patch adds support for virtual xfrm interfaces. Packets that are routed through such an interface are guaranteed to be IPsec transformed or dropped. It is a generic virtual interface that ensures IPsec transformation, no need to know what happens behind the interface. This means that we can tunnel IPv4 and IPv6 through the same interface and support all xfrm modes (tunnel, transport and beet) on it. Co-developed-by: Lorenzo Colitti Co-developed-by: Benedict Wong Signed-off-by: Lorenzo Colitti Signed-off-by: Benedict Wong Signed-off-by: Steffen Klassert Acked-by: Shannon Nelson Tested-by: Benedict Wong Tested-by: Antony Antony Reviewed-by: Eyal Birger --- include/net/xfrm.h | 24 ++ include/uapi/linux/if_link.h | 10 + net/xfrm/Kconfig | 8 + net/xfrm/Makefile| 1 + net/xfrm/xfrm_input.c| 3 + net/xfrm/xfrm_interface.c| 972 +++ net/xfrm/xfrm_policy.c | 43 ++ 7 files changed, 1061 insertions(+) create mode 100644 net/xfrm/xfrm_interface.c diff --git a/include/net/xfrm.h b/include/net/xfrm.h index e8bada4d2a45..3fa578a6a819 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -23,6 +23,7 @@ #include #include #include +#include #include @@ -293,6 +294,13 @@ struct xfrm_replay { int (*overflow)(struct xfrm_state *x, struct sk_buff *skb); }; +struct xfrm_if_cb { + struct xfrm_if *(*decode_session)(struct sk_buff *skb); +}; + +void xfrm_if_register_cb(const struct xfrm_if_cb *ifcb); +void xfrm_if_unregister_cb(void); + struct net_device; struct xfrm_type; struct xfrm_dst; @@ -1039,6 +1047,22 @@ static inline void xfrm_dst_destroy(struct xfrm_dst *xdst) void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev); +struct xfrm_if_parms { + char name[IFNAMSIZ];/* name of XFRM device */ + int link; /* ifindex of underlying L2 interface */ + u32 if_id; /* interface identifyer */ +}; + +struct xfrm_if { + struct xfrm_if __rcu *next; /* next interface in list */ + struct net_device *dev; /* virtual device associated with interface */ + struct net_device *phydev; /* physical device */ + struct net *net;/* netns for packet i/o */ + struct xfrm_if_parms p; /* interface parms */ + + struct gro_cells gro_cells; +}; + struct xfrm_offload { /* Output sequence number for replay protection on offloading. */ struct { diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index cf01b6824244..bff0af507b32 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -459,6 +459,16 @@ enum { #define IFLA_MACSEC_MAX (__IFLA_MACSEC_MAX - 1) +/* XFRM section */ +enum { + IFLA_XFRM_UNSPEC, + IFLA_XFRM_LINK, + IFLA_XFRM_IF_ID, + __IFLA_XFRM_MAX +}; + +#define IFLA_XFRM_MAX (__IFLA_XFRM_MAX - 1) + enum macsec_validation_type { MACSEC_VALIDATE_DISABLED = 0, MACSEC_VALIDATE_CHECK = 1, diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig index 286ed25c1a69..53381888a7b3 100644 --- a/net/xfrm/Kconfig +++ b/net/xfrm/Kconfig @@ -25,6 +25,14 @@ config XFRM_USER If unsure, say Y. +config XFRM_INTERFACE + tristate "Transformation virtual interface" + depends on XFRM && IPV6 + ---help--- + This provides a virtual interface to route IPsec traffic. + + If unsure, say N. + config XFRM_SUB_POLICY bool "Transformation sub policy support" depends on XFRM diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile index 0bd2465a8c5a..fbc4552d17b8 100644 --- a/net/xfrm/Makefile +++ b/net/xfrm/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o +obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 074810436242..b89c9c7f8c5c 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -320,6 +320,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) seq = 0; if (!spi && (err = xfrm_parse_spi(skb, nexthdr, , )) != 0) { + secpath_reset(skb); XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR); goto drop; } @@ -328,12 +329,14 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) XFRM_SPI_SKB_CB(skb)->daddroff); do { if (skb->sp->len == XFRM_MAX_DEPTH) { + secpath_reset(skb); XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR); goto drop; } x = xfrm_state_lookup(net, mark, daddr,
[PATCH 06/14] xfrm: Allow Set Mark to be Updated Using UPDSA
From: Nathan Harold Allow UPDSA to change "set mark" to permit policy separation of packet routing decisions from SA keying in systems that use mark-based routing. The set mark, used as a routing and firewall mark for outbound packets, is made update-able which allows routing decisions to be handled independently of keying/SA creation. To maintain consistency with other optional attributes, the set mark is only updated if sent with a non-zero value. The per-SA lock and the xfrm_state_lock are taken in that order to avoid a deadlock with xfrm_timer_handler(), which also takes the locks in that order. Signed-off-by: Nathan Harold Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_state.c | 9 + 1 file changed, 9 insertions(+) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index e04a510ec992..c9ffcdfa89f6 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -1562,6 +1562,15 @@ int xfrm_state_update(struct xfrm_state *x) if (x1->curlft.use_time) xfrm_state_check_expire(x1); + if (x->props.smark.m || x->props.smark.v) { + spin_lock_bh(>xfrm.xfrm_state_lock); + + x1->props.smark = x->props.smark; + + __xfrm_state_bump_genids(x1); + spin_unlock_bh(>xfrm.xfrm_state_lock); + } + err = 0; x->km.state = XFRM_STATE_DEAD; __xfrm_state_put(x); -- 2.14.1
[PATCH 02/14] flow: Extend flow informations with xfrm interface id.
Add a new flowi_xfrm structure with informations needed to do a xfrm lookup. At the moment it keeps the informations about the new xfrm interface id needed to lookup xfrm interfaces that are introduced with a followup patch. We need this new lookup key as other possible keys, like the ifindex is already part of the xfrm selector and used as a key to enforce the output device after the transformation in the policy/state lookup. Signed-off-by: Steffen Klassert Acked-by: Shannon Nelson Acked-by: Benedict Wong Tested-by: Benedict Wong Tested-by: Antony Antony Reviewed-by: Eyal Birger --- include/net/flow.h | 9 + 1 file changed, 9 insertions(+) diff --git a/include/net/flow.h b/include/net/flow.h index 8ce21793094e..187c9bef672f 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -26,6 +26,10 @@ struct flowi_tunnel { __be64 tun_id; }; +struct flowi_xfrm { + __u32 if_id; +}; + struct flowi_common { int flowic_oif; int flowic_iif; @@ -39,6 +43,7 @@ struct flowi_common { #define FLOWI_FLAG_SKIP_NH_OIF 0x04 __u32 flowic_secid; struct flowi_tunnel flowic_tun_key; + struct flowi_xfrm xfrm; kuid_t flowic_uid; }; @@ -78,6 +83,7 @@ struct flowi4 { #define flowi4_secid __fl_common.flowic_secid #define flowi4_tun_key __fl_common.flowic_tun_key #define flowi4_uid __fl_common.flowic_uid +#define flowi4_xfrm__fl_common.xfrm /* (saddr,daddr) must be grouped, same order as in IP header */ __be32 saddr; @@ -109,6 +115,7 @@ static inline void flowi4_init_output(struct flowi4 *fl4, int oif, fl4->flowi4_flags = flags; fl4->flowi4_secid = 0; fl4->flowi4_tun_key.tun_id = 0; + fl4->flowi4_xfrm.if_id = 0; fl4->flowi4_uid = uid; fl4->daddr = daddr; fl4->saddr = saddr; @@ -138,6 +145,7 @@ struct flowi6 { #define flowi6_secid __fl_common.flowic_secid #define flowi6_tun_key __fl_common.flowic_tun_key #define flowi6_uid __fl_common.flowic_uid +#define flowi6_xfrm__fl_common.xfrm struct in6_addr daddr; struct in6_addr saddr; /* Note: flowi6_tos is encoded in flowlabel, too. */ @@ -185,6 +193,7 @@ struct flowi { #define flowi_secidu.__fl_common.flowic_secid #define flowi_tun_key u.__fl_common.flowic_tun_key #define flowi_uid u.__fl_common.flowic_uid +#define flowi_xfrm u.__fl_common.xfrm } __attribute__((__aligned__(BITS_PER_LONG/8))); static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4) -- 2.14.1
[PATCH 13/14] xfrm: Return detailed errors from xfrmi_newlink
From: Benedict Wong Currently all failure modes of xfrm interface creation return EEXIST. This change improves the granularity of errnos provided by also returning ENODEV or EINVAL if failures happen in looking up the underlying interface, or a required parameter is not provided. This change has been tested against the Android Kernel Networking Tests, with additional xfrmi_newlink tests here: https://android-review.googlesource.com/c/kernel/tests/+/715755 Signed-off-by: Benedict Wong Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_interface.c | 32 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index ccfe18d67e98..481d7307ab51 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -149,14 +149,18 @@ static struct xfrm_if *xfrmi_create(struct net *net, struct xfrm_if_parms *p) char name[IFNAMSIZ]; int err; - if (p->name[0]) + if (p->name[0]) { strlcpy(name, p->name, IFNAMSIZ); - else + } else { + err = -EINVAL; goto failed; + } dev = alloc_netdev(sizeof(*xi), name, NET_NAME_UNKNOWN, xfrmi_dev_setup); - if (!dev) + if (!dev) { + err = -EAGAIN; goto failed; + } dev_net_set(dev, net); @@ -165,8 +169,10 @@ static struct xfrm_if *xfrmi_create(struct net *net, struct xfrm_if_parms *p) xi->net = net; xi->dev = dev; xi->phydev = dev_get_by_index(net, p->link); - if (!xi->phydev) + if (!xi->phydev) { + err = -ENODEV; goto failed_free; + } err = xfrmi_create2(dev); if (err < 0) @@ -179,7 +185,7 @@ static struct xfrm_if *xfrmi_create(struct net *net, struct xfrm_if_parms *p) failed_free: free_netdev(dev); failed: - return NULL; + return ERR_PTR(err); } static struct xfrm_if *xfrmi_locate(struct net *net, struct xfrm_if_parms *p, @@ -194,13 +200,13 @@ static struct xfrm_if *xfrmi_locate(struct net *net, struct xfrm_if_parms *p, xip = >next) { if (xi->p.if_id == p->if_id) { if (create) - return NULL; + return ERR_PTR(-EEXIST); return xi; } } if (!create) - return NULL; + return ERR_PTR(-ENODEV); return xfrmi_create(net, p); } @@ -682,8 +688,9 @@ static int xfrmi_newlink(struct net *src_net, struct net_device *dev, nla_strlcpy(p->name, tb[IFLA_IFNAME], IFNAMSIZ); - if (!xfrmi_locate(net, p, 1)) - return -EEXIST; + xi = xfrmi_locate(net, p, 1); + if (IS_ERR(xi)) + return PTR_ERR(xi); return 0; } @@ -704,11 +711,12 @@ static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[], xi = xfrmi_locate(net, >p, 0); - if (xi) { + if (IS_ERR_OR_NULL(xi)) { + xi = netdev_priv(dev); + } else { if (xi->dev != dev) return -EEXIST; - } else - xi = netdev_priv(dev); + } return xfrmi_update(xi, >p); } -- 2.14.1
[PATCH 14/14] xfrm: fix ptr_ret.cocci warnings
From: kbuild test robot net/xfrm/xfrm_interface.c:692:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Fixes: 44e2b838c24d ("xfrm: Return detailed errors from xfrmi_newlink") CC: Benedict Wong Signed-off-by: kbuild test robot Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_interface.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c index 481d7307ab51..31acc6f33d98 100644 --- a/net/xfrm/xfrm_interface.c +++ b/net/xfrm/xfrm_interface.c @@ -689,10 +689,7 @@ static int xfrmi_newlink(struct net *src_net, struct net_device *dev, nla_strlcpy(p->name, tb[IFLA_IFNAME], IFNAMSIZ); xi = xfrmi_locate(net, p, 1); - if (IS_ERR(xi)) - return PTR_ERR(xi); - - return 0; + return PTR_ERR_OR_ZERO(xi); } static void xfrmi_dellink(struct net_device *dev, struct list_head *head) -- 2.14.1
[PATCH 10/14] xfrm: Remove xfrmi interface ID from flowi
From: Benedict Wong In order to remove performance impact of having the extra u32 in every single flowi, this change removes the flowi_xfrm struct, prefering to take the if_id as a method parameter where needed. In the inbound direction, if_id is only needed during the __xfrm_check_policy() function, and the if_id can be determined at that point based on the skb. As such, xfrmi_decode_session() is only called with the skb in __xfrm_check_policy(). In the outbound direction, the only place where if_id is needed is the xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is directly passed into the xfrm_lookup_with_ifid() call. All existing callers can still call xfrm_lookup(), which uses a default if_id of 0. This change does not change any behavior of XFRMIs except for improving overall system performance via flowi size reduction. This change has been tested against the Android Kernel Networking Tests: https://android.googlesource.com/kernel/tests/+/master/net/test Signed-off-by: Benedict Wong Signed-off-by: Steffen Klassert --- include/net/dst.h | 14 +++ include/net/flow.h| 9 - include/net/xfrm.h| 2 +- net/xfrm/xfrm_interface.c | 4 +- net/xfrm/xfrm_policy.c| 98 +++ net/xfrm/xfrm_state.c | 3 +- 6 files changed, 83 insertions(+), 47 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index b3219cd8a5a1..7f735e76ca73 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -475,6 +475,14 @@ static inline struct dst_entry *xfrm_lookup(struct net *net, return dst_orig; } +static inline struct dst_entry * +xfrm_lookup_with_ifid(struct net *net, struct dst_entry *dst_orig, + const struct flowi *fl, const struct sock *sk, + int flags, u32 if_id) +{ + return dst_orig; +} + static inline struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig, const struct flowi *fl, @@ -494,6 +502,12 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig, const struct flowi *fl, const struct sock *sk, int flags); +struct dst_entry *xfrm_lookup_with_ifid(struct net *net, + struct dst_entry *dst_orig, + const struct flowi *fl, + const struct sock *sk, int flags, + u32 if_id); + struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig, const struct flowi *fl, const struct sock *sk, int flags); diff --git a/include/net/flow.h b/include/net/flow.h index 187c9bef672f..8ce21793094e 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -26,10 +26,6 @@ struct flowi_tunnel { __be64 tun_id; }; -struct flowi_xfrm { - __u32 if_id; -}; - struct flowi_common { int flowic_oif; int flowic_iif; @@ -43,7 +39,6 @@ struct flowi_common { #define FLOWI_FLAG_SKIP_NH_OIF 0x04 __u32 flowic_secid; struct flowi_tunnel flowic_tun_key; - struct flowi_xfrm xfrm; kuid_t flowic_uid; }; @@ -83,7 +78,6 @@ struct flowi4 { #define flowi4_secid __fl_common.flowic_secid #define flowi4_tun_key __fl_common.flowic_tun_key #define flowi4_uid __fl_common.flowic_uid -#define flowi4_xfrm__fl_common.xfrm /* (saddr,daddr) must be grouped, same order as in IP header */ __be32 saddr; @@ -115,7 +109,6 @@ static inline void flowi4_init_output(struct flowi4 *fl4, int oif, fl4->flowi4_flags = flags; fl4->flowi4_secid = 0; fl4->flowi4_tun_key.tun_id = 0; - fl4->flowi4_xfrm.if_id = 0; fl4->flowi4_uid = uid; fl4->daddr = daddr; fl4->saddr = saddr; @@ -145,7 +138,6 @@ struct flowi6 { #define flowi6_secid __fl_common.flowic_secid #define flowi6_tun_key __fl_common.flowic_tun_key #define flowi6_uid __fl_common.flowic_uid -#define flowi6_xfrm__fl_common.xfrm struct in6_addr daddr; struct in6_addr saddr; /* Note: flowi6_tos is encoded in flowlabel, too. */ @@ -193,7 +185,6 @@ struct flowi { #define flowi_secidu.__fl_common.flowic_secid #define flowi_tun_key u.__fl_common.flowic_tun_key #define flowi_uid u.__fl_common.flowic_uid -#define flowi_xfrm u.__fl_common.xfrm } __attribute__((__aligned__(BITS_PER_LONG/8))); static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 1350e2cf0749..ca820945f30c 100644 ---
[PATCH 5/5] esp6: fix memleak on error path in esp6_input
From: Zhen Lei This ought to be an omission in e6194923237 ("esp: Fix memleaks on error paths."). The memleak on error path in esp6_input is similar to esp_input of esp4. Fixes: e6194923237 ("esp: Fix memleaks on error paths.") Fixes: 3f29770723f ("ipsec: check return value of skb_to_sgvec always") Signed-off-by: Zhen Lei Signed-off-by: Steffen Klassert --- net/ipv6/esp6.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 97513f35bcc5..88a7579c23bd 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -669,8 +669,10 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb) sg_init_table(sg, nfrags); ret = skb_to_sgvec(skb, sg, 0, skb->len); - if (unlikely(ret < 0)) + if (unlikely(ret < 0)) { + kfree(tmp); goto out; + } skb->ip_summed = CHECKSUM_NONE; -- 2.14.1
[PATCH 2/5] xfrm_user: prevent leaking 2 bytes of kernel memory
From: Eric Dumazet struct xfrm_userpolicy_type has two holes, so we should not use C99 style initializer. KMSAN report: BUG: KMSAN: kernel-infoleak in copyout lib/iov_iter.c:140 [inline] BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571 CPU: 1 PID: 4520 Comm: syz-executor841 Not tainted 4.17.0+ #5 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:113 kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1117 kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1211 kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1253 copyout lib/iov_iter.c:140 [inline] _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571 copy_to_iter include/linux/uio.h:106 [inline] skb_copy_datagram_iter+0x422/0xfa0 net/core/datagram.c:431 skb_copy_datagram_msg include/linux/skbuff.h:3268 [inline] netlink_recvmsg+0x6f1/0x1900 net/netlink/af_netlink.c:1959 sock_recvmsg_nosec net/socket.c:802 [inline] sock_recvmsg+0x1d6/0x230 net/socket.c:809 ___sys_recvmsg+0x3fe/0x810 net/socket.c:2279 __sys_recvmmsg+0x58e/0xe30 net/socket.c:2391 do_sys_recvmmsg+0x2a6/0x3e0 net/socket.c:2472 __do_sys_recvmmsg net/socket.c:2485 [inline] __se_sys_recvmmsg net/socket.c:2481 [inline] __x64_sys_recvmmsg+0x15d/0x1c0 net/socket.c:2481 do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x446ce9 RSP: 002b:7fc307918db8 EFLAGS: 0293 ORIG_RAX: 012b RAX: ffda RBX: 006dbc24 RCX: 00446ce9 RDX: 000a RSI: 20005040 RDI: 0003 RBP: 006dbc20 R08: 20004e40 R09: R10: 4000 R11: 0293 R12: R13: 7ffc8d2df32f R14: 7fc3079199c0 R15: 0001 Uninit was stored to memory at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline] kmsan_save_stack mm/kmsan/kmsan.c:294 [inline] kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:685 kmsan_memcpy_origins+0x11d/0x170 mm/kmsan/kmsan.c:527 __msan_memcpy+0x109/0x160 mm/kmsan/kmsan_instr.c:413 __nla_put lib/nlattr.c:569 [inline] nla_put+0x276/0x340 lib/nlattr.c:627 copy_to_user_policy_type net/xfrm/xfrm_user.c:1678 [inline] dump_one_policy+0xbe1/0x1090 net/xfrm/xfrm_user.c:1708 xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013 xfrm_dump_policy+0x1c0/0x2a0 net/xfrm/xfrm_user.c:1749 netlink_dump+0x9b5/0x1550 net/netlink/af_netlink.c:2226 __netlink_dump_start+0x1131/0x1270 net/netlink/af_netlink.c:2323 netlink_dump_start include/linux/netlink.h:214 [inline] xfrm_user_rcv_msg+0x8a3/0x9b0 net/xfrm/xfrm_user.c:2577 netlink_rcv_skb+0x37e/0x600 net/netlink/af_netlink.c:2448 xfrm_netlink_rcv+0xb2/0xf0 net/xfrm/xfrm_user.c:2598 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline] netlink_unicast+0x1680/0x1750 net/netlink/af_netlink.c:1336 netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901 sock_sendmsg_nosec net/socket.c:629 [inline] sock_sendmsg net/socket.c:639 [inline] ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117 __sys_sendmsg net/socket.c:2155 [inline] __do_sys_sendmsg net/socket.c:2164 [inline] __se_sys_sendmsg net/socket.c:2162 [inline] __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162 do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Local variable description: upt.i@dump_one_policy Variable was created at: dump_one_policy+0x78/0x1090 net/xfrm/xfrm_user.c:1689 xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013 Byte 130 of 137 is uninitialized Memory access starts at 88019550407f Fixes: c0144beaeca42 ("[XFRM] netlink: Use nla_put()/NLA_PUT() variantes") Signed-off-by: Eric Dumazet Reported-by: syzbot Cc: Steffen Klassert Cc: Herbert Xu Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_user.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 080035f056d9..1e50b70ad668 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1671,9 +1671,11 @@ static inline unsigned int userpolicy_type_attrsize(void) #ifdef CONFIG_XFRM_SUB_POLICY static int copy_to_user_policy_type(u8 type, struct sk_buff *skb) { - struct xfrm_userpolicy_type upt = { - .type = type, - }; + struct xfrm_userpolicy_type upt; + + /* Sadly there are two holes in struct xfrm_userpolicy_type */ + memset(, 0, sizeof(upt)); + upt.type = type; return nla_put(skb, XFRMA_POLICY_TYPE, sizeof(upt), ); } -- 2.14.1
pull request (net): ipsec 2018-07-27
1) Fix PMTU handling of vti6. We update the PMTU on the xfrm dst_entry which is not cached anymore after the flowchache removal. So update the PMTU of the original dst_entry instead. From Eyal Birger. 2) Fix a leak of kernel memory to userspace. From Eric Dumazet. 3) Fix a possible dst_entry memleak in xfrm_lookup_route. From Tommi Rantala. 4) Fix a skb leak in case we can't call nlmsg_multicast from xfrm_nlmsg_multicast. From Florian Westphal. 5) Fix a leak of a temporary buffer in the error path of esp6_input. From Zhen Lei. Please pull or let me know if there are problems. Thanks! The following changes since commit 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21: Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next (2018-06-06 18:39:49 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master for you to fetch changes up to 7284fdf39a912322ce97de2d30def3c6068a418c: esp6: fix memleak on error path in esp6_input (2018-06-27 17:32:11 +0200) Eric Dumazet (1): xfrm_user: prevent leaking 2 bytes of kernel memory Eyal Birger (1): vti6: fix PMTU caching and reporting on xmit Florian Westphal (1): xfrm: free skb if nlsk pointer is NULL Tommi Rantala (1): xfrm: fix missing dst_release() after policy blocking lbcast and multicast Zhen Lei (1): esp6: fix memleak on error path in esp6_input net/ipv6/esp6.c| 4 +++- net/ipv6/ip6_vti.c | 11 ++- net/xfrm/xfrm_policy.c | 3 +++ net/xfrm/xfrm_user.c | 18 +++--- 4 files changed, 23 insertions(+), 13 deletions(-)
[PATCH 4/5] xfrm: free skb if nlsk pointer is NULL
From: Florian Westphal nlmsg_multicast() always frees the skb, so in case we cannot call it we must do that ourselves. Fixes: 21ee543edc0dea ("xfrm: fix race between netns cleanup and state expire notification") Signed-off-by: Florian Westphal Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_user.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 1e50b70ad668..33878e6e0d0a 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1025,10 +1025,12 @@ static inline int xfrm_nlmsg_multicast(struct net *net, struct sk_buff *skb, { struct sock *nlsk = rcu_dereference(net->xfrm.nlsk); - if (nlsk) - return nlmsg_multicast(nlsk, skb, pid, group, GFP_ATOMIC); - else - return -1; + if (!nlsk) { + kfree_skb(skb); + return -EPIPE; + } + + return nlmsg_multicast(nlsk, skb, pid, group, GFP_ATOMIC); } static inline unsigned int xfrm_spdinfo_msgsize(void) -- 2.14.1
[PATCH 3/5] xfrm: fix missing dst_release() after policy blocking lbcast and multicast
From: Tommi Rantala Fix missing dst_release() when local broadcast or multicast traffic is xfrm policy blocked. For IPv4 this results to dst leak: ip_route_output_flow() allocates dst_entry via __ip_route_output_key() and passes it to xfrm_lookup_route(). xfrm_lookup returns ERR_PTR(-EPERM) that is propagated. The dst that was allocated is never released. IPv4 local broadcast testcase: ping -b 192.168.1.255 & sleep 1 ip xfrm policy add src 0.0.0.0/0 dst 192.168.1.255/32 dir out action block IPv4 multicast testcase: ping 224.0.0.1 & sleep 1 ip xfrm policy add src 0.0.0.0/0 dst 224.0.0.1/32 dir out action block For IPv6 the missing dst_release() causes trouble e.g. when used in netns: ip netns add TEST ip netns exec TEST ip link set lo up ip link add dummy0 type dummy ip link set dev dummy0 netns TEST ip netns exec TEST ip addr add fd00:: dev dummy0 ip netns exec TEST ip link set dummy0 up ip netns exec TEST ping -6 -c 5 ff02::1%dummy0 & sleep 1 ip netns exec TEST ip xfrm policy add src ::/0 dst ff02::1 dir out action block wait ip netns del TEST After netns deletion we see: [ 258.239097] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 268.279061] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 278.367018] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 288.375259] unregister_netdevice: waiting for lo to become free. Usage count = 2 Fixes: ac37e2515c1a ("xfrm: release dst_orig in case of error in xfrm_lookup()") Signed-off-by: Tommi Rantala Signed-off-by: Steffen Klassert --- net/xfrm/xfrm_policy.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 5f48251c1319..7c5e8978aeaa 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2286,6 +2286,9 @@ struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig, if (IS_ERR(dst) && PTR_ERR(dst) == -EREMOTE) return make_blackhole(net, dst_orig->ops->family, dst_orig); + if (IS_ERR(dst)) + dst_release(dst_orig); + return dst; } EXPORT_SYMBOL(xfrm_lookup_route); -- 2.14.1
[PATCH 1/5] vti6: fix PMTU caching and reporting on xmit
From: Eyal Birger When setting the skb->dst before doing the MTU check, the route PMTU caching and reporting is done on the new dst which is about to be released. Instead, PMTU handling should be done using the original dst. This is aligned with IPv4 VTI. Fixes: ccd740cbc6 ("vti6: Add pmtu handling to vti6_xmit.") Signed-off-by: Eyal Birger Signed-off-by: Steffen Klassert --- net/ipv6/ip6_vti.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c index b7f28deddaea..c72ae3a4fe09 100644 --- a/net/ipv6/ip6_vti.c +++ b/net/ipv6/ip6_vti.c @@ -480,10 +480,6 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) goto tx_err_dst_release; } - skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev))); - skb_dst_set(skb, dst); - skb->dev = skb_dst(skb)->dev; - mtu = dst_mtu(dst); if (!skb->ignore_df && skb->len > mtu) { skb_dst_update_pmtu(skb, mtu); @@ -498,9 +494,14 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) htonl(mtu)); } - return -EMSGSIZE; + err = -EMSGSIZE; + goto tx_err_dst_release; } + skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev))); + skb_dst_set(skb, dst); + skb->dev = skb_dst(skb)->dev; + err = dst_output(t->net, skb->sk, skb); if (net_xmit_eval(err) == 0) { struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats); -- 2.14.1
Re: [PATCH] xfrm: fix ptr_ret.cocci warnings
On Thu, Jul 26, 2018 at 03:09:52PM +0800, kbuild test robot wrote: > From: kbuild test robot > > net/xfrm/xfrm_interface.c:692:1-3: WARNING: PTR_ERR_OR_ZERO can be used > > > Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR > > Generated by: scripts/coccinelle/api/ptr_ret.cocci > > Fixes: 44e2b838c24d ("xfrm: Return detailed errors from xfrmi_newlink") > CC: Benedict Wong > Signed-off-by: kbuild test robot Applied, thanks!
Re: [PATCH ipsec-next] xfrm: Return detailed errors from xfrmi_newlink
On Wed, Jul 25, 2018 at 01:45:29PM -0700, Benedict Wong wrote: > Currently all failure modes of xfrm interface creation return EEXIST. > This change improves the granularity of errnos provided by also > returning ENODEV or EINVAL if failures happen in looking up the > underlying interface, or a required parameter is not provided. > > This change has been tested against the Android Kernel Networking Tests, > with additional xfrmi_newlink tests here: > > https://android-review.googlesource.com/c/kernel/tests/+/715755 > > Signed-off-by: Benedict Wong Applied to ipsec-next, thanks Benedict!
Re: UDP GRO without Merging
On Wed, Jul 25, 2018 at 11:53:45AM +0200, Gauvain Roussel-Tarbouriech wrote: > Subject: > > Hello Netdev, > > I am working on WireGuard as part of Google Summer of Code and Jason and > I are working on adding GRO to WireGuard, on the udp_tunnel side of > things. The goal is to inform udp_tunnel’s gro_receive that certain > packets are part of the same flow. Then sometime later, we’d like to > have a list of those same-flow packets appear in gro_complete. As far as > I can tell, the API seems well suited for merging packets in > gro_receive, such that gro_complete only then gets one single merged > packet. However, we need to receive the entire list of same-flow packets > in gro_complete, unmerged. Is this possible with the present APIs? We recently posted a patchset that does this GRO packet chaining. Not sure if this is exactly what you need, but you can have a look here: https://www.spinics.net/lists/netdev/msg508093.html This is implemented as a part of a forward fastpath, but could be also moved generic code if needed.
Re: [PATCH ipsec-next] xfrm: Allow xfrmi if_id to be updated by UPDSA
On Thu, Jul 19, 2018 at 07:07:47PM -0700, Nathan Harold wrote: > Allow attaching an SA to an xfrm interface id after > the creation of the SA, so that tasks such as keying > which must be done as the SA is created, can remain > separate from the decision on how to route traffic > from an SA. This permits SA creation to be decomposed > in to three separate steps: > 1) allocation of a SPI > 2) algorithm and key negotiation > 3) insertion into the data path > > Signed-off-by: Nathan Harold Applied to ipsec-next, thanks Nathan!
Re: [PATCH ipsec-next] xfrm: Remove xfrmi interface ID from flowi
On Thu, Jul 19, 2018 at 10:50:44AM -0700, Benedict Wong wrote: > In order to remove performance impact of having the extra u32 in every > single flowi, this change removes the flowi_xfrm struct, prefering to > take the if_id as a method parameter where needed. > > In the inbound direction, if_id is only needed during the > __xfrm_check_policy() function, and the if_id can be determined at that > point based on the skb. As such, xfrmi_decode_session() is only called > with the skb in __xfrm_check_policy(). > > In the outbound direction, the only place where if_id is needed is the > xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is > directly passed into the xfrm_lookup_with_ifid() call. All existing > callers can still call xfrm_lookup(), which uses a default if_id of 0. > > This change does not change any behavior of XFRMIs except for improving > overall system performance via flowi size reduction. > > This change has been tested against the Android Kernel Networking Tests: > > https://android.googlesource.com/kernel/tests/+/master/net/test > > Signed-off-by: Benedict Wong Nice improvement. Applied to ipsec-next, thanks a lot!
Re: [PATCH ipsec-next 1/1] xfrm: don't check offload_handle for nonzero
On Tue, Jun 26, 2018 at 02:19:10PM -0700, Shannon Nelson wrote: > The offload_handle should be an opaque data cookie for the driver > to use, much like the data cookie for a timer or alarm callback. > Thus, the XFRM stack should not be checking for non-zero, because > the driver might use that to store an array reference, which could > be zero, or some other zero but meaningful value. > > We can remove the checks for non-zero because there are plenty > other attributes also being checked to see if there is an offload > in place for the SA in question. > > Signed-off-by: Shannon Nelson Applied to ipsec-next, thanks Shannon!
Re: [RFC ipsec-next] xfrm: Remove xfrmi interface ID from flowi
On Tue, Jul 17, 2018 at 02:40:04PM -0700, Benedict Wong wrote: > @@ -2301,6 +2322,13 @@ int __xfrm_policy_check(struct sock *sk, int dir, > struct sk_buff *skb, > int reverse; > struct flowi fl; > int xerr_idx = -1; > + const struct xfrm_if_cb *ifcb; > + struct xfrm_if *xi; > + u32 if_id = 0; > + > + rcu_read_lock(); > + ifcb = xfrm_if_get_cb(); > + rcu_read_unlock(); > > reverse = dir & ~XFRM_POLICY_MASK; > dir &= XFRM_POLICY_MASK; > @@ -2325,10 +2353,16 @@ int __xfrm_policy_check(struct sock *sk, int dir, > struct sk_buff *skb, > } > } > > + if (ifcb) { > + xi = ifcb->decode_session(skb); > + if (xi) > + if_id = xi->p.if_id; > + } The usage of the ifcb pointer should go into the rcu_read_lock section above. Looks good otherwise, nice improvement. Please respin and do an official submission of this patch, I'd like to merge it before I send the pull request for the ipsec-next tree.
Re: [PATCH ipsec-next 1/1] xfrm: don't check offload_handle for nonzero
On Tue, Jun 26, 2018 at 02:19:10PM -0700, Shannon Nelson wrote: > The offload_handle should be an opaque data cookie for the driver > to use, much like the data cookie for a timer or alarm callback. > Thus, the XFRM stack should not be checking for non-zero, because > the driver might use that to store an array reference, which could > be zero, or some other zero but meaningful value. > > We can remove the checks for non-zero because there are plenty > other attributes also being checked to see if there is an offload > in place for the SA in question. > > Signed-off-by: Shannon Nelson I don't have access to my hw offload testlab currently, so I've queued this one until I'm back from my vacation.
Re: [PATCH v2 ipsec-next] xfrm: policy: remove pcpu policy cache
On Mon, Jun 25, 2018 at 05:26:02PM +0200, Florian Westphal wrote: > Kristian Evensen says: > In a project I am involved in, we are running ipsec (Strongswan) on > different mt7621-based routers. Each router is configured as an > initiator and has around ~30 tunnels to different responders (running > on misc. devices). Before the flow cache was removed (kernel 4.9), we > got a combined throughput of around 70Mbit/s for all tunnels on one > router. However, we recently switched to kernel 4.14 (4.14.48), and > the total throughput is somewhere around 57Mbit/s (best-case). I.e., a > drop of around 20%. Reverting the flow cache removal restores, as > expected, performance levels to that of kernel 4.9. > > When pcpu xdst exists, it has to be validated first before it can be > used. > > A negative hit thus increases cost vs. no-cache. > > As number of tunnels increases, hit rate decreases so this pcpu caching > isn't a viable strategy. > > Furthermore, the xdst cache also needs to run with BH off, so when > removing this the bh disable/enable pairs can be removed too. > > Kristian tested a 4.14.y backport of this change and reported > increased performance: > > In our tests, the throughput reduction has been reduced from around -20% > to -5%. We also see that the overall throughput is independent of the > number of tunnels, while before the throughput was reduced as the number > of tunnels increased. > > Reported-by: Kristian Evensen > Signed-off-by: Florian Westphal Applied to ipsec-next, thanks a lot!
Re: [PATCH ipsec] xfrm: free skb if nlsk pointer is NULL
On Mon, Jun 25, 2018 at 02:00:07PM +0200, Florian Westphal wrote: > nlmsg_multicast() always frees the skb, so in case we cannot call > it we must do that ourselves. > > Fixes: 21ee543edc0dea ("xfrm: fix race between netns cleanup and state expire > notification") > Signed-off-by: Florian Westphal Applied, thanks Florian!
Re: [PATCH ipsec-next] xfrm: policy: remove pcpu policy cache
On Mon, Jun 25, 2018 at 01:57:53PM +0200, Florian Westphal wrote: > Kristian Evensen says: > In a project I am involved in, we are running ipsec (Strongswan) on > different mt7621-based routers. Each router is configured as an > initiator and has around ~30 tunnels to different responders (running > on misc. devices). Before the flow cache was removed (kernel 4.9), we > got a combined throughput of around 70Mbit/s for all tunnels on one > router. However, we recently switched to kernel 4.14 (4.14.48), and > the total throughput is somewhere around 57Mbit/s (best-case). I.e., a > drop of around 20%. Reverting the flow cache removal restores, as > expected, performance levels to that of kernel 4.9. > > When pcpu xdst exists, it has to be validated first before it can be > used. > > A negative hit thus increases cost vs. no-cache. > > As number of tunnels increases, hit rate decreases so this pcpu caching > isn't a viable strategy. > > Furthermore, the xdst cache also needs to run with BH off, so when > removing this the bh disable/enable pairs can be removed too. > > Kristian tested a 4.14.y backport of this change and reported > increased performance: > > In our tests, the throughput reduction has been reduced from around -20% > to -5%. We also see that the overall throughput is independent of the > number of tunnels, while before the throughput was reduced as the number > of tunnels increased. > > Reported-by: Kristian Evensen > Signed-off-by: Florian Westphal Can you please rebase this to ipsec-next current? It does not apply cleanly after the merge of the xfrm interface patches. Thanks!
Re: [PATCH RFC v2 ipsec-next 0/3] Virtual xfrm interfaces
On Tue, Jun 12, 2018 at 09:56:07AM +0200, Steffen Klassert wrote: > This patchset introduces new virtual xfrm interfaces. > The design of virtual xfrm interfaces interfaces was > discussed at the Linux IPsec workshop 2018. This patchset > implements these interfaces as the IPsec userspace and > kernel developers agreed. The purpose of these interfaces > is to overcome the design limitations that the existing > VTI devices have. > > The main limitations that we see with the current VTI are the > following: > > - VTI interfaces are L3 tunnels with configurable endpoints. > For xfrm, the tunnel endpoint are already determined by the SA. > So the VTI tunnel endpoints must be either the same as on the > SA or wildcards. In case VTI tunnel endpoints are same as on > the SA, we get a one to one correlation between the SA and > the tunnel. So each SA needs its own tunnel interface. > > On the other hand, we can have only one VTI tunnel with > wildcard src/dst tunnel endpoints in the system because the > lookup is based on the tunnel endpoints. The existing tunnel > lookup won't work with multiple tunnels with wildcard > tunnel endpoints. Some usecases require more than on > VTI tunnel of this type, for example if somebody has multiple > namespaces and every namespace requires such a VTI. > > - VTI needs separate interfaces for IPv4 and IPv6 tunnels. > So when routing to a VTI, we have to know to which address > family this traffic class is going to be encapsulated. > This is a lmitation because it makes routing more complex > and it is not always possible to know what happens behind the > VTI, e.g. when the VTI is move to some namespace. > > - VTI works just with tunnel mode SAs. We need generic interfaces > that ensures transfomation, regardless of the xfrm mode and > the encapsulated address family. > > - VTI is configured with a combination GRE keys and xfrm marks. > With this we have to deal with some extra cases in the generic > tunnel lookup because the GRE keys on the VTI are actually > not GRE keys, the GRE keys were just reused for something else. > All extensions to the VTI interfaces would require to add > even more complexity to the generic tunnel lookup. > > To overcome this, we started with the following design goal: > > - It should be possible to tunnel IPv4 and IPv6 through the same > interface. > > - No limitation on xfrm mode (tunnel, transport and beet). > > - Should be a generic virtual interface that ensures IPsec > transformation, no need to know what happens behind the > interface. > > - Interfaces should be configured with a new key that must match a > new policy/SA lookup key. > > - The lookup logic should stay in the xfrm codebase, no need to > change or extend generic routing and tunnel lookups. > > - Should be possible to use IPsec hardware offloads of the underlying > interface. > > Changes from v1: > > - Document the limitations of VTI interfaces and the design of > the new xfrm interfaces more explicit in the commit messages. > > - No code changes. I have not got any further comments, so applied to ipsec-next.
Re: [PATCH RFC ipsec-next] xfrm: Extend the output_mark to support input direction and masking.
On Fri, Jun 15, 2018 at 08:55:14AM +0200, Steffen Klassert wrote: > We already support setting an output mark at the xfrm_state, > unfortunately this does not support the input direction and > masking the marks that will be applied to the skb. This change > adds support applying a masked value in both directions. > > The existing XFRMA_OUTPUT_MARK number is reused for this purpose > and as it is now bi-directional, it is renamed to XFRMA_SET_MARK. > > An additional XFRMA_SET_MARK_MASK attribute is added for setting the > mask. If the attribute mask not provided, it is set to 0x, > keeping the XFRMA_OUTPUT_MARK existing 'full mask' semantics. > > Co-developed-by: Tobias Brunner > Co-developed-by: Eyal Birger > Co-developed-by: Lorenzo Colitti > Signed-off-by: Steffen Klassert > Signed-off-by: Tobias Brunner > Signed-off-by: Eyal Birger > Signed-off-by: Lorenzo Colitti This is now applied to ipsec-next.
Re: [PATCH net] xfrm_user: prevent leaking 2 bytes of kernel memory
On Mon, Jun 18, 2018 at 09:35:07PM -0700, Eric Dumazet wrote: > struct xfrm_userpolicy_type has two holes, so we should not > use C99 style initializer. > > KMSAN report: > > BUG: KMSAN: kernel-infoleak in copyout lib/iov_iter.c:140 [inline] > BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571 > CPU: 1 PID: 4520 Comm: syz-executor841 Not tainted 4.17.0+ #5 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x185/0x1d0 lib/dump_stack.c:113 > kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1117 > kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1211 > kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1253 > copyout lib/iov_iter.c:140 [inline] > _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571 > copy_to_iter include/linux/uio.h:106 [inline] > skb_copy_datagram_iter+0x422/0xfa0 net/core/datagram.c:431 > skb_copy_datagram_msg include/linux/skbuff.h:3268 [inline] > netlink_recvmsg+0x6f1/0x1900 net/netlink/af_netlink.c:1959 > sock_recvmsg_nosec net/socket.c:802 [inline] > sock_recvmsg+0x1d6/0x230 net/socket.c:809 > ___sys_recvmsg+0x3fe/0x810 net/socket.c:2279 > __sys_recvmmsg+0x58e/0xe30 net/socket.c:2391 > do_sys_recvmmsg+0x2a6/0x3e0 net/socket.c:2472 > __do_sys_recvmmsg net/socket.c:2485 [inline] > __se_sys_recvmmsg net/socket.c:2481 [inline] > __x64_sys_recvmmsg+0x15d/0x1c0 net/socket.c:2481 > do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x446ce9 > RSP: 002b:7fc307918db8 EFLAGS: 0293 ORIG_RAX: 012b > RAX: ffda RBX: 006dbc24 RCX: 00446ce9 > RDX: 000a RSI: 20005040 RDI: 0003 > RBP: 006dbc20 R08: 20004e40 R09: > R10: 4000 R11: 0293 R12: > R13: 7ffc8d2df32f R14: 7fc3079199c0 R15: 0001 > > Uninit was stored to memory at: > kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline] > kmsan_save_stack mm/kmsan/kmsan.c:294 [inline] > kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:685 > kmsan_memcpy_origins+0x11d/0x170 mm/kmsan/kmsan.c:527 > __msan_memcpy+0x109/0x160 mm/kmsan/kmsan_instr.c:413 > __nla_put lib/nlattr.c:569 [inline] > nla_put+0x276/0x340 lib/nlattr.c:627 > copy_to_user_policy_type net/xfrm/xfrm_user.c:1678 [inline] > dump_one_policy+0xbe1/0x1090 net/xfrm/xfrm_user.c:1708 > xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013 > xfrm_dump_policy+0x1c0/0x2a0 net/xfrm/xfrm_user.c:1749 > netlink_dump+0x9b5/0x1550 net/netlink/af_netlink.c:2226 > __netlink_dump_start+0x1131/0x1270 net/netlink/af_netlink.c:2323 > netlink_dump_start include/linux/netlink.h:214 [inline] > xfrm_user_rcv_msg+0x8a3/0x9b0 net/xfrm/xfrm_user.c:2577 > netlink_rcv_skb+0x37e/0x600 net/netlink/af_netlink.c:2448 > xfrm_netlink_rcv+0xb2/0xf0 net/xfrm/xfrm_user.c:2598 > netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline] > netlink_unicast+0x1680/0x1750 net/netlink/af_netlink.c:1336 > netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901 > sock_sendmsg_nosec net/socket.c:629 [inline] > sock_sendmsg net/socket.c:639 [inline] > ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117 > __sys_sendmsg net/socket.c:2155 [inline] > __do_sys_sendmsg net/socket.c:2164 [inline] > __se_sys_sendmsg net/socket.c:2162 [inline] > __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162 > do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > Local variable description: upt.i@dump_one_policy > Variable was created at: > dump_one_policy+0x78/0x1090 net/xfrm/xfrm_user.c:1689 > xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013 > > Byte 130 of 137 is uninitialized > Memory access starts at 88019550407f > > Fixes: c0144beaeca42 ("[XFRM] netlink: Use nla_put()/NLA_PUT() variantes") > Signed-off-by: Eric Dumazet > Reported-by: syzbot > Cc: Steffen Klassert > Cc: Herbert Xu > --- > net/xfrm/xfrm_user.c | 8 +--- It is a fix for xfrm, so I applied it to the ipsec tree. Thanks a lot Eric!
Re: [PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids
On Tue, Jun 19, 2018 at 09:53:49AM +0200, Florian Westphal wrote: > Li RongQing wrote: > > The default NR_CPUS can be very large, but actual possible nr_cpu_ids > > usually is very small. For some x86 distribution, the NR_CPUS is 8192 > > and nr_cpu_ids is 4, so replace NR_CPU to save some memory > > Steffen, > > I will soon submit a patch to remove the percpu cache; removal > improved performance for at least one user (and by quite a sizeable > amount). > > Would you consider such removal for ipsec or ipsec-next? I think this removel would better fit to ipsec-next. > If -next, consider applying this patch for ipsec.git. > > Otherwise consider not applying this, as the code will go away soon. This patch more an optimization than a fix, so I considered to apply it to ipsec-next. If you plan to remove it, I'll wait for that.
[PATCH RFC ipsec-next] xfrm: Extend the output_mark to support input direction and masking.
We already support setting an output mark at the xfrm_state, unfortunately this does not support the input direction and masking the marks that will be applied to the skb. This change adds support applying a masked value in both directions. The existing XFRMA_OUTPUT_MARK number is reused for this purpose and as it is now bi-directional, it is renamed to XFRMA_SET_MARK. An additional XFRMA_SET_MARK_MASK attribute is added for setting the mask. If the attribute mask not provided, it is set to 0x, keeping the XFRMA_OUTPUT_MARK existing 'full mask' semantics. Co-developed-by: Tobias Brunner Co-developed-by: Eyal Birger Co-developed-by: Lorenzo Colitti Signed-off-by: Steffen Klassert Signed-off-by: Tobias Brunner Signed-off-by: Eyal Birger Signed-off-by: Lorenzo Colitti --- include/net/xfrm.h| 9 - include/uapi/linux/xfrm.h | 4 +++- net/xfrm/xfrm_input.c | 2 ++ net/xfrm/xfrm_output.c| 3 +-- net/xfrm/xfrm_policy.c| 5 +++-- net/xfrm/xfrm_user.c | 48 +-- 6 files changed, 55 insertions(+), 16 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 45e75c36b738..8727b2484855 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -166,7 +166,7 @@ struct xfrm_state { int header_len; int trailer_len; u32 extra_flags; - u32 output_mark; + struct xfrm_marksmark; } props; struct xfrm_lifetime_cfg lft; @@ -2012,6 +2012,13 @@ static inline int xfrm_mark_put(struct sk_buff *skb, const struct xfrm_mark *m) return ret; } +static inline __u32 xfrm_smark_get(__u32 mark, struct xfrm_state *x) +{ + struct xfrm_mark *m = >props.smark; + + return (m->v & m->m) | (mark & ~m->m); +} + static inline int xfrm_tunnel_check(struct sk_buff *skb, struct xfrm_state *x, unsigned int family) { diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h index e3af2859188b..5a6ed7ce5a29 100644 --- a/include/uapi/linux/xfrm.h +++ b/include/uapi/linux/xfrm.h @@ -305,9 +305,11 @@ enum xfrm_attr_type_t { XFRMA_ADDRESS_FILTER, /* struct xfrm_address_filter */ XFRMA_PAD, XFRMA_OFFLOAD_DEV, /* struct xfrm_state_offload */ - XFRMA_OUTPUT_MARK, /* __u32 */ + XFRMA_SET_MARK, /* __u32 */ + XFRMA_SET_MARK_MASK,/* __u32 */ __XFRMA_MAX +#define XFRMA_OUTPUT_MARK XFRMA_SET_MARK /* Compatibility */ #define XFRMA_MAX (__XFRMA_MAX - 1) }; diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 352abca2605f..074810436242 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -339,6 +339,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) goto drop; } + skb->mark = xfrm_smark_get(skb->mark, x); + skb->sp->xvec[skb->sp->len++] = x; lock: diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c index 89b178a78dc7..45ba07ab3e4f 100644 --- a/net/xfrm/xfrm_output.c +++ b/net/xfrm/xfrm_output.c @@ -66,8 +66,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err) goto error_nolock; } - if (x->props.output_mark) - skb->mark = x->props.output_mark; + skb->mark = xfrm_smark_get(skb->mark, x); err = x->outer_mode->output(x, skb); if (err) { diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 40b54cc64243..f95f5f75748c 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1607,10 +1607,11 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy, dst_copy_metrics(dst1, dst); if (xfrm[i]->props.mode != XFRM_MODE_TRANSPORT) { + __u32 mark = xfrm_smark_get(fl->flowi_mark, xfrm[i]); + family = xfrm[i]->props.family; dst = xfrm_dst_lookup(xfrm[i], tos, fl->flowi_oif, - , , family, - xfrm[i]->props.output_mark); + , , family, mark); err = PTR_ERR(dst); if (IS_ERR(dst)) goto put_states; diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 080035f056d9..9602cc9e05ab 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -527,6 +527,19 @@ static void xfrm_update_ae_params(struct xfrm_state *x, struct nlattr **attrs, x->replay_maxdiff = nla_get_u32(rt); } +static void xfrm_smark_init(struct nlattr **attrs, struct xfrm_mark *m) +{
Re: [PATCH iproute2-next] ip-xfrm: Add support for OUTPUT_MARK
On Tue, Jun 12, 2018 at 11:33:41AM +0900, Lorenzo Colitti wrote: > On Tue, Jun 12, 2018 at 11:12 AM Subash Abhinov Kasiviswanathan > wrote: > > > > This patch adds support for OUTPUT_MARK in xfrm state to exercise the > > functionality added by kernel commit 077fbac405bf > > ("net: xfrm: support setting an output mark."). > > > > Sample output with output-mark - > > > > src 192.168.1.1 dst 192.168.1.2 > > proto esp spi 0x4321 reqid 0 mode tunnel > > replay-window 0 flag af-unspec > > auth-trunc xcbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b211 96 > > enc cbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b233 > > anti-replay context: seq 0x0, oseq 0x0, bitmap 0x > > output-mark 0x2 > > Have you considered putting this earlier up in the output, where the > mark is printed as well? > > > + if (tb[XFRMA_OUTPUT_MARK]) { > > + __u32 output_mark = rta_getattr_u32(tb[XFRMA_OUTPUT_MARK]); > > + > > + fprintf(fp, "\toutput-mark 0x%x %s", output_mark, _SL_); > > + } > > } > > If you wanted to implement the suggestion above, I think you could do > that by moving this code into xfrm_xfrma_print. > > Other than that, LGTM. > > Acked-by: Lorenzo Colitti > > Steffen - what's the status of the set_mark patches? Are you holding > them until the tree opens again? Yes, I hold them back until after v4.18-rc1 is released and the -next trees open again. But I plan to do a RFC version this week, so that everybody knows about the plan we have.