Re: [PATCH ipsec-next] xfrm: policy: fix policy hash rebuild

2018-11-30 Thread Steffen Klassert
On Tue, Nov 27, 2018 at 01:28:54PM +0100, Florian Westphal wrote:
> Dan Carpenter reports following static checker warning:
>  net/xfrm/xfrm_policy.c:1316 xfrm_hash_rebuild()
>  warn: 'dir' is out of bounds '3' vs '2'
> 
>  |  1280  /* reset the bydst and inexact table in all directions */
>  |  1281  xfrm_hash_reset_inexact_table(net);
>  |  1282
>  |  1283  for (dir = 0; dir < XFRM_POLICY_MAX; dir++) {
>  |  ^
>  |dir == XFRM_POLICY_MAX at the end of this loop.
>  |  1304  /* re-insert all policies by order of creation */
>  |  1305  list_for_each_entry_reverse(policy, >xfrm.policy_all, 
> walk.all) {
>  [..]
>  |  1314
> xfrm_policy_id2dir(policy->index));
>  |  1315  if (!chain) {
>  |  1316  void *p = 
> xfrm_policy_inexact_insert(policy, dir, 0);
> 
> Fix this by updating 'dir' based on current policy.  Otherwise, the
> inexact policies won't be found anymore during lookup, as they get
> hashed to a bogus bin.
> 
> Reported-by: Dan Carpenter 
> Fixes: cc1bb845adc9 ("xfrm: policy: return NULL when inexact search needed")
> Signed-off-by: Florian Westphal 

Applied, thanks Florian!


Re: [PATCH][xfrm-next] xfrm6: remove BUG_ON from xfrm6_dst_ifdown

2018-11-22 Thread Steffen Klassert
On Mon, Nov 12, 2018 at 05:28:22PM +0800, Li RongQing wrote:
> if loopback_idev is NULL pointer, and the following access of
> loopback_idev will trigger panic, which is same as BUG_ON
> 
> Signed-off-by: Li RongQing 

Patch applied, thanks!


Re: [BUG] xfrm: unable to handle kernel NULL pointer dereference

2018-11-21 Thread Steffen Klassert
On Fri, Nov 16, 2018 at 08:12:46PM +0100, Steffen Klassert wrote:
> On Fri, Nov 16, 2018 at 08:48:00PM +0200, Lennert Buytenhek wrote:
> > On Sat, Nov 10, 2018 at 08:34:34PM +0100, Jean-Philippe Menil wrote:
> > 
> > > we're seeing unexpected crashes from kernel 4.15 to 4.18.17, using
> > > IPsec VTI interfaces, on several vpn hosts, since upgrade from 4.4.
> > 
> > I looked into this with Jean-Philippe, and it appears to be crashing
> > on a NULL pointer dereference in the inlined xfrm_policy_check() call
> > in vti_rcv_cb(), and specifically on the skb_dst(skb) dereference in
> > __xfrm_policy_check2():
> > 
> > return  (!net->xfrm.policy_count[dir] && !skb->sp) ||
> > (skb_dst(skb)->flags & DST_NOPOLICY) || <=
> > __xfrm_policy_check(sk, ndir, skb, family);
> > 
> > Commit 9e1437937807 ("xfrm: Fix NULL pointer dereference when
> > skb_dst_force clears the dst_entry.") fixes a very similar problem on
> > the output and forward paths, but our issue seems to be triggering on
> > the input path.
> 
> Yes, this is the same problem. skb_dst_force() does not
> really force a refcount anymore, it might clear the dst
> pointer instead (maybe this function should be renamed).

I plan to apply this patch to the ipsec tree:

[PATCH RFC] xfrm: Fix NULL pointer dereference in xfrm_input when skb_dst_force 
clears the dst_entry.

Since commit 222d7dbd258d ("net: prevent dst uses after free")
skb_dst_force() might clear the dst_entry attached to the skb.
The xfrm code don't expect this to happen, so we crash with
a NULL pointer dereference in this case.

Fix it by checking skb_dst(skb) for NULL after skb_dst_force()
and drop the packet in cast the dst_entry was cleared. We also
move the skb_dst_force() to a codepath that is not used when
the transformation was offloaded, because in this case we
don't have a  dst_entry attached to the skb.

The output and forwarding path was already fixed by
commit 9e1437937807 ("xfrm: Fix NULL pointer dereference when
skb_dst_force clears the dst_entry.")

Fixes: 222d7dbd258d ("net: prevent dst uses after free")
Reported-by: Jean-Philippe Menil 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_input.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 684c0bc01e2c..d5635908587f 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -346,6 +346,12 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
 
skb->sp->xvec[skb->sp->len++] = x;
 
+   skb_dst_force(skb);
+   if (!skb_dst(skb)) {
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMINERROR);
+   goto drop;
+   }
+
 lock:
spin_lock(>lock);
 
@@ -385,7 +391,6 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
XFRM_SKB_CB(skb)->seq.input.low = seq;
XFRM_SKB_CB(skb)->seq.input.hi = seq_hi;
 
-   skb_dst_force(skb);
dev_hold(skb->dev);
 
if (crypto_done)
-- 
2.17.1



Re: [PATCH ipsec-next] xfrm: policy: fix netlink/pf_key policy lookups

2018-11-19 Thread Steffen Klassert
On Thu, Nov 15, 2018 at 02:51:57AM +0100, Florian Westphal wrote:
> Colin Ian King says:
>  Static analysis with CoverityScan found a potential issue [..]
>  It seems that pointer pol is set to NULL and then a check to see if it
>  is non-null is used to set pol to tmp; howeverm this check is always
>  going to be false because pol is always NULL.
> 
> Fix this and update test script to catch this.  Updated script only:
> ./xfrm_policy.sh ; echo $?
> RTNETLINK answers: No such file or directory
> FAIL: ip -net ns3 xfrm policy get src 10.0.1.0/24 dst 10.0.2.0/24 dir out
> RTNETLINK answers: No such file or directory
> [..]
> PASS: policy before exception matches
> PASS: ping to .254 bypassed ipsec tunnel
> PASS: direct policy matches
> PASS: policy matches
> 1
> 
> Fixes: 6be3b0db6db ("xfrm: policy: add inexact policy search tree 
> infrastructure")
> Reported-by: Colin Ian King 
> Signed-off-by: Florian Westphal 

Applied, thanks everyone!


Re: [BUG] xfrm: unable to handle kernel NULL pointer dereference

2018-11-16 Thread Steffen Klassert
On Fri, Nov 16, 2018 at 08:48:00PM +0200, Lennert Buytenhek wrote:
> On Sat, Nov 10, 2018 at 08:34:34PM +0100, Jean-Philippe Menil wrote:
> 
> > we're seeing unexpected crashes from kernel 4.15 to 4.18.17, using
> > IPsec VTI interfaces, on several vpn hosts, since upgrade from 4.4.
> 
> I looked into this with Jean-Philippe, and it appears to be crashing
> on a NULL pointer dereference in the inlined xfrm_policy_check() call
> in vti_rcv_cb(), and specifically on the skb_dst(skb) dereference in
> __xfrm_policy_check2():
> 
>   return  (!net->xfrm.policy_count[dir] && !skb->sp) ||
>   (skb_dst(skb)->flags & DST_NOPOLICY) || <=
>   __xfrm_policy_check(sk, ndir, skb, family);
> 
> Commit 9e1437937807 ("xfrm: Fix NULL pointer dereference when
> skb_dst_force clears the dst_entry.") fixes a very similar problem on
> the output and forward paths, but our issue seems to be triggering on
> the input path.

Yes, this is the same problem. skb_dst_force() does not
really force a refcount anymore, it might clear the dst
pointer instead (maybe this function should be renamed).

Want to submit a fix? If not I'll go to fix that.

Thanks!


Re: [PATCH ipsec-next 00/11] xfrm: policy: add inexact policy search tree

2018-11-13 Thread Steffen Klassert
On Thu, Nov 08, 2018 at 07:00:14PM -0800, David Miller wrote:
> From: Florian Westphal 
> Date: Wed,  7 Nov 2018 23:00:30 +0100
> 
> > This series attempts to improve xfrm policy lookup performance when
> > a lot of (several hundred or even thousands) inexact policies exist
> > on a system.
> > 
> > On insert, a policy is either placed in hash table (all direct (/32 for
> > ipv4, /128 policies, or all policies matching a user-configured threshold).
> > All other policies get inserted into inexact list as per priority.
> > 
> > Lookup then scans inexact list for first matching entry.
> > 
> > This series instead makes it so that inexact policy is added to exactly
> > one of four different search list classes.
> > 
> > 1. "Any:Any" list, containing policies where both saddr and daddr are
> >wildcards or have very coarse prefixes, e.g. 10.0.0.0/8 and the like.
> > 2. "saddr:any" list, containing policies with a fixed saddr/prefixlen,
> >but without destination restrictions.
> >These lists are stored in rbtree nodes; each node contains those
> >policies matching saddr/prefixlen.
> > 3. "Any:daddr" list. Similar to 2), except for policies where only the
> >destinations are specified.
> > 4. "saddr:daddr" lists, containing policies that match the given
> >source/destination network.
> > 
> >The root of the saddr/daddr tree is stored in the nodes of the
> >'daddr' tree.
> ...
> > Comments or questions welcome.
> 
> Acked-by: David S. Miller 

This is now applied to ipsec-next, thanks a lot
for your work Florian!


Re: [PATCH] xfrm: Fix bucket count reported to userspace

2018-11-07 Thread Steffen Klassert
On Mon, Nov 05, 2018 at 05:00:53PM +0900, Benjamin Poirier wrote:
> sadhcnt is reported by `ip -s xfrm state count` as "buckets count", not the
> hash mask.
> 
> Fixes: 28d8909bc790 ("[XFRM]: Export SAD info.")
> Signed-off-by: Benjamin Poirier 

Patch applied, thanks!


Re: [PATCH] xfrm: Fix error return code in xfrm_output_one()

2018-10-30 Thread Steffen Klassert
On Sat, Oct 27, 2018 at 06:12:06AM +, Wei Yongjun wrote:
> xfrm_output_one() does not return a error code when there is
> no dst_entry attached to the skb, it is still possible crash
> with a NULL pointer dereference in xfrm_output_resume(). Fix
> it by return error code -EHOSTUNREACH.
> 
> Fixes: 9e1437937807 ("xfrm: Fix NULL pointer dereference when skb_dst_force 
> clears the dst_entry.")
> Signed-off-by: Wei Yongjun 

Applied, thanks a lot!


Re: [RFC PATCH v2 00/10] udp: implement GRO support

2018-10-24 Thread Steffen Klassert
On Tue, Oct 23, 2018 at 02:22:12PM +0200, Paolo Abeni wrote:
> On Tue, 2018-10-23 at 14:10 +0200, Steffen Klassert wrote:
> 
> > Some quick benchmark numbers with UDP packet forwarding
> > (1460 byte packets) through two gateways:
> > 
> > net-next: 16.4 Gbps
> > 
> > net-next + UDP GRO: 20.3 Gbps
> 
> uhmmm... what do you think about this speed-up ?

skb_segment() burns a lot of cycles. If I do the same test with
TCP and disable HW TSO, throughput drops also down to similar
values.

In case of software segmentation, the skb chain appropach
is likely faster because packets are not mangled. So no need
to allocate skbs, no new checksum calculations, less memcpy etc.

If we have an early route lookup in GRO, we could have a good
guess on the offload capabilities of the outgoing device.
So in case that software segmentation is likely, use the
skb chaining method. If HW segmentation is likely, merge
IP packets.

The chaining method might be also faster on non UDP GRO enabled
sockets.

I'll try to implement the skb chaining method on top of this
to see what we get from that.



Re: [RFC PATCH v2 00/10] udp: implement GRO support

2018-10-23 Thread Steffen Klassert
On Fri, Oct 19, 2018 at 04:25:10PM +0200, Paolo Abeni wrote:
> This series implements GRO support for UDP sockets, as the RX counterpart
> of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
> The core functionality is implemented by the second patch, introducing a new
> sockopt to enable UDP_GRO, while patch 3 implements support for passing the
> segment size to the user space via a new cmsg.
> UDP GRO performs a socket lookup for each ingress packets and aggregate 
> datagram
> directed to UDP GRO enabled sockets with constant l4 tuple.
> 
> UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT
> rules, and that could potentially confuse existing applications.
> 
> The solution adopted here is to de-segment the GRO packet before enqueuing
> as needed. Since we must cope with packet reinsertion after de-segmentation,
> the relevant code is factored-out in ipv4 and ipv6 specific helpers and 
> exposed
> to UDP usage.
> 
> While the current code can probably be improved, this safeguard ,implemented 
> in
> the patches 4-7, allows future enachements to enable UDP GSO offload on more
> virtual devices eventually even on forwarded packets.

I was curious what I get with this when doing packet forwarding.
So I added forwarding support with the patch below (on top of
your patchset). While the packet processing could work the way
I do it in this patch, I'm not sure about the way I enable
UDP GRO on forwarding. Maybe there it a better way than just
do UDP GRO if forwarding is enabled on the receiving device.

Some quick benchmark numbers with UDP packet forwarding
(1460 byte packets) through two gateways:

net-next: 16.4 Gbps

net-next + UDP GRO: 20.3 Gbps

Subject: [PATCH RFC] udp: Allow gro for the forwarding path.

This patch adds a early route lookup to inet_gro_receive()
in case forwarding is enabled on the receiving device.
To be forwarded packets are allowed to enter the UDP
GRO handlers then.

Signed-off-by: Steffen Klassert 
---
 include/linux/netdevice.h |  2 +-
 include/net/dst.h |  1 +
 net/core/dev.c|  6 --
 net/ipv4/af_inet.c| 12 
 net/ipv4/route.c  |  1 +
 net/ipv4/udp_offload.c| 11 +--
 6 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dc1d9ed33b31..2eb3fa960ad4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2304,7 +2304,7 @@ struct napi_gro_cb {
/* Number of gro_receive callbacks this packet already went through */
u8 recursion_counter:4;
 
-   /* 1 bit hole */
+   u8  is_fwd:1;
 
/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum  csum;
diff --git a/include/net/dst.h b/include/net/dst.h
index 6cf0870414c7..01bdf825a6f9 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -54,6 +54,7 @@ struct dst_entry {
 #define DST_XFRM_TUNNEL0x0020
 #define DST_XFRM_QUEUE 0x0040
 #define DST_METADATA   0x0080
+#define DST_FWD0x0100
 
/* A non-zero value of dst->obsolete forces by-hand validation
 * of the route entry.  Positive values are set by the generic
diff --git a/net/core/dev.c b/net/core/dev.c
index 022ad73d6253..c6aaffb99456 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5387,8 +5387,10 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
 
diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
diffs |= p->vlan_tci ^ skb->vlan_tci;
-   diffs |= skb_metadata_dst_cmp(p, skb);
-   diffs |= skb_metadata_differs(p, skb);
+   if (!NAPI_GRO_CB(p)->is_fwd) {
+   diffs |= skb_metadata_dst_cmp(p, skb);
+   diffs |= skb_metadata_differs(p, skb);
+   }
if (maclen == ETH_HLEN)
diffs |= compare_ether_header(skb_mac_header(p),
  skb_mac_header(skb));
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1fbe2f815474..6e4e8588c0b1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1479,8 +1479,20 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
NAPI_GRO_CB(p)->flush_id = flush_id;
else
NAPI_GRO_CB(p)->flush_id |= flush_id;
+
+   NAPI_GRO_CB(skb)->is_fwd = NAPI_GRO_CB(p)->is_fwd;
+
+   goto found;
}
 
+   if (IN_DEV_FORWARD(__in_dev_get_rcu(skb->dev))) {
+   if (!ip_route_input_noref(skb, iph->daddr, iph->saddr, iph->tos,
+ skb->dev)) {
+   if ((skb_dst(skb)->flags & DST_FWD))
+   

Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-23 Thread Steffen Klassert
On Mon, Oct 22, 2018 at 02:51:56PM +0200, Paolo Abeni wrote:
> > > +
> > > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > > +{
> > > + struct sk_buff *next, *segs;
> > > + int ret;
> > > +
> > > + if (likely(!udp_unexpected_gso(sk, skb)))
> > > + return udp_queue_rcv_one_skb(sk, skb);
> > > +
> > > + BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET);
> > > + __skb_push(skb, -skb_mac_offset(skb));
> > > + segs = udp_rcv_segment(sk, skb);
> > > + for (skb = segs; skb; skb = next) {
> > > + next = skb->next;
> > > + __skb_pull(skb, skb_transport_offset(skb));
> > > + ret = udp_queue_rcv_one_skb(sk, skb);
> > 
> > udp_queue_rcv_one_skb() starts with doing a xfrm4_policy_check().
> > Maybe we can do this on the GSO packet instead of the segments.
> > So far this code is just for handling a corner case, but this might
> > change.
> 
> I thought about keeping the policy check here, but then I preferred
> what looked the safest option. Perhaps we can improve with a follow-up?

Fair enough. Let's keep it in mind and do it later.


Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-22 Thread Steffen Klassert
On Fri, Oct 19, 2018 at 04:25:16PM +0200, Paolo Abeni wrote:
> +
> +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> +   struct sk_buff *skb)
> +{
> + struct sk_buff *segs;
> +
> + /* the GSO CB lays after the UDP one, no need to save and restore any
> +  * CB fragment, just initialize it
> +  */
> + segs = __skb_gso_segment(skb, NETIF_F_SG, false);
> + if (unlikely(IS_ERR(segs)))
> + kfree_skb(skb);
> + else if (segs)
> + consume_skb(skb);
> + return segs;
> +}
> +
> +

One empty line too much.

>  #define udp_portaddr_for_each_entry(__sk, list) \
>   hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
>  
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 2331ac9de954..0d55145ce9f5 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable);
>   * Note that in the success and error cases, the skb is assumed to
>   * have either been requeued or freed.
>   */
> -static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> +static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
>  {
>   struct udp_sock *up = udp_sk(sk);
>   int is_udplite = IS_UDPLITE(sk);
> @@ -2012,6 +2012,29 @@ static int udp_queue_rcv_skb(struct sock *sk, struct 
> sk_buff *skb)
>   return -1;
>  }
>  
> +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> proto);
> +
> +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> +{
> + struct sk_buff *next, *segs;
> + int ret;
> +
> + if (likely(!udp_unexpected_gso(sk, skb)))
> + return udp_queue_rcv_one_skb(sk, skb);
> +
> + BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET);
> + __skb_push(skb, -skb_mac_offset(skb));
> + segs = udp_rcv_segment(sk, skb);
> + for (skb = segs; skb; skb = next) {
> + next = skb->next;
> + __skb_pull(skb, skb_transport_offset(skb));
> + ret = udp_queue_rcv_one_skb(sk, skb);

udp_queue_rcv_one_skb() starts with doing a xfrm4_policy_check().
Maybe we can do this on the GSO packet instead of the segments.
So far this code is just for handling a corner case, but this might
change.



Re: [RFC PATCH v2 02/10] udp: implement GRO for plain UDP sockets.

2018-10-22 Thread Steffen Klassert
On Fri, Oct 19, 2018 at 04:25:12PM +0200, Paolo Abeni wrote:
>  
> +#define UDO_GRO_CNT_MAX 64

Maybe better UDP_GRO_CNT_MAX?

Btw. do we really need this explicit limit?
We should not get more than 64 packets during
one napi poll cycle.

> +static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
> +struct sk_buff *skb)
> +{
> + struct udphdr *uh = udp_hdr(skb);
> + struct sk_buff *pp = NULL;
> + struct udphdr *uh2;
> + struct sk_buff *p;
> +
> + /* requires non zero csum, for simmetry with GSO */
> + if (!uh->check) {
> + NAPI_GRO_CB(skb)->flush = 1;
> + return NULL;
> + }

Why is the requirement of checksums different than in 
udp_gro_receive? It's not that I care much about UDP
packets without a checksum, but you would not need
to implement your own loop if the requirement could
be the same as in udp_gro_receive.

> +
> + /* pull encapsulating udp header */
> + skb_gro_pull(skb, sizeof(struct udphdr));
> + skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
> +
> + list_for_each_entry(p, head, list) {
> + if (!NAPI_GRO_CB(p)->same_flow)
> + continue;
> +
> + uh2 = udp_hdr(p);
> +
> + /* Match ports only, as csum is always non zero */
> + if ((*(u32 *)>source != *(u32 *)>source)) {
> + NAPI_GRO_CB(p)->same_flow = 0;
> + continue;
> + }
> +
> + /* Terminate the flow on len mismatch or if it grow "too much".
> +  * Under small packet flood GRO count could elsewhere grow a lot
> +  * leading to execessive truesize values
> +  */
> + if (!skb_gro_receive(p, skb) &&
> + NAPI_GRO_CB(p)->count > UDO_GRO_CNT_MAX)

This allows to merge UDO_GRO_CNT_MAX + 1 packets.

> + pp = p;
> + else if (uh->len != uh2->len)
> + pp = p;
> +
> + return pp;
> + }
> +
> + /* mismatch, but we never need to flush */
> + return NULL;
> +}



[PATCH 1/3] xfrm: remove unnecessary check in xfrmi_get_stats64

2018-10-18 Thread Steffen Klassert
From: Li RongQing 

if tstats of a device is not allocated, this device is not
registered correctly and can not be used.

Signed-off-by: Li RongQing 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_interface.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index dc5b20bf29cf..abafd49cc65d 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -561,9 +561,6 @@ static void xfrmi_get_stats64(struct net_device *dev,
 {
int cpu;
 
-   if (!dev->tstats)
-   return;
-
for_each_possible_cpu(cpu) {
struct pcpu_sw_netstats *stats;
struct pcpu_sw_netstats tmp;
-- 
2.17.1



pull request (net-next): ipsec-next 2018-10-18

2018-10-18 Thread Steffen Klassert
1) Remove an unnecessary dev->tstats check in xfrmi_get_stats64.
   From Li RongQing.

2) We currently do a sizeof(element) instead of a sizeof(array)
   check when initializing the ovec array of the secpath.
   Currently this array can have only one element, so code is
   OK but error-prone. Change this to do a sizeof(array)
   check so that we can add more elements in future.
   From Li RongQing.

3) Improve xfrm IPv6 address hashing by using the complete IPv6
   addresses for a hash. From Michal Kubecek.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit abf1a08ff3237a27188ff8cc2904f2cea893af55:

  net: vhost: remove bad code line (2018-10-07 21:31:32 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master

for you to fetch changes up to 8d4b6bce2559755cf2db6513a267fccdfbf7c3ab:

  xfrm: use complete IPv6 addresses for hash (2018-10-15 10:09:18 +0200)


Li RongQing (2):
  xfrm: remove unnecessary check in xfrmi_get_stats64
  xfrm: use correct size to initialise sp->ovec

Michal Kubecek (1):
  xfrm: use complete IPv6 addresses for hash

 net/xfrm/xfrm_hash.h  | 5 ++---
 net/xfrm/xfrm_input.c | 2 +-
 net/xfrm/xfrm_interface.c | 3 ---
 3 files changed, 3 insertions(+), 7 deletions(-)


[PATCH 2/3] xfrm: use correct size to initialise sp->ovec

2018-10-18 Thread Steffen Klassert
From: Li RongQing 

This place should want to initialize array, not a element,
so it should be sizeof(array) instead of sizeof(element)

but now this array only has one element, so no error in
this condition that XFRM_MAX_OFFLOAD_DEPTH is 1

Signed-off-by: Li RongQing 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index be3520e429c9..684c0bc01e2c 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -131,7 +131,7 @@ struct sec_path *secpath_dup(struct sec_path *src)
sp->len = 0;
sp->olen = 0;
 
-   memset(sp->ovec, 0, sizeof(sp->ovec[XFRM_MAX_OFFLOAD_DEPTH]));
+   memset(sp->ovec, 0, sizeof(sp->ovec));
 
if (src) {
int i;
-- 
2.17.1



[PATCH 3/3] xfrm: use complete IPv6 addresses for hash

2018-10-18 Thread Steffen Klassert
From: Michal Kubecek 

In some environments it is common that many hosts share the same lower half
of their IPv6 addresses (in particular ::1). As __xfrm6_addr_hash() and
__xfrm6_daddr_saddr_hash() calculate the hash only from the lower halves,
as much as 1/3 of the hosts ends up in one hashtable chain which harms the
performance.

Use complete IPv6 addresses when calculating the hashes. Rather than just
adding two more words to the xor, use jhash2() for consistency with
__xfrm6_pref_hash() and __xfrm6_dpref_spref_hash().

Signed-off-by: Michal Kubecek 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_hash.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/xfrm_hash.h b/net/xfrm/xfrm_hash.h
index 61be810389d8..ce66323102f9 100644
--- a/net/xfrm/xfrm_hash.h
+++ b/net/xfrm/xfrm_hash.h
@@ -13,7 +13,7 @@ static inline unsigned int __xfrm4_addr_hash(const 
xfrm_address_t *addr)
 
 static inline unsigned int __xfrm6_addr_hash(const xfrm_address_t *addr)
 {
-   return ntohl(addr->a6[2] ^ addr->a6[3]);
+   return jhash2((__force u32 *)addr->a6, 4, 0);
 }
 
 static inline unsigned int __xfrm4_daddr_saddr_hash(const xfrm_address_t 
*daddr,
@@ -26,8 +26,7 @@ static inline unsigned int __xfrm4_daddr_saddr_hash(const 
xfrm_address_t *daddr,
 static inline unsigned int __xfrm6_daddr_saddr_hash(const xfrm_address_t 
*daddr,
const xfrm_address_t *saddr)
 {
-   return ntohl(daddr->a6[2] ^ daddr->a6[3] ^
-saddr->a6[2] ^ saddr->a6[3]);
+   return __xfrm6_addr_hash(daddr) ^ __xfrm6_addr_hash(saddr);
 }
 
 static inline u32 __bits2mask32(__u8 bits)
-- 
2.17.1



[PATCH 3/4] net/xfrm: fix out-of-bounds packet access

2018-10-18 Thread Steffen Klassert
From: Alexei Starovoitov 

BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0
net/ipv6/xfrm6_policy.c:161
Read of size 1 at addr 8801d882eec7 by task syz-executor1/6667
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
  _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161
  __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299
  xfrm_decode_session include/net/xfrm.h:1232 [inline]
  vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542
  __netdev_start_xmit include/linux/netdevice.h:4313 [inline]
  netdev_start_xmit include/linux/netdevice.h:4322 [inline]
  xmit_one net/core/dev.c:3217 [inline]
  dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233
  __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803
  dev_queue_xmit+0x17/0x20 net/core/dev.c:3836

Reported-by: syzbot+acffccec848dc13fe...@syzkaller.appspotmail.com
Reported-by: Eric Dumazet 
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Steffen Klassert 
---
 net/ipv6/xfrm6_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ef3defaf43b9..d35bcf92969c 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -146,8 +146,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int 
reverse)
fl6->daddr = reverse ? hdr->saddr : hdr->daddr;
fl6->saddr = reverse ? hdr->daddr : hdr->saddr;
 
-   while (nh + offset + 1 < skb->data ||
-  pskb_may_pull(skb, nh + offset + 1 - skb->data)) {
+   while (nh + offset + sizeof(*exthdr) < skb->data ||
+  pskb_may_pull(skb, nh + offset + sizeof(*exthdr) - skb->data)) {
nh = skb_network_header(skb);
exthdr = (struct ipv6_opt_hdr *)(nh + offset);
 
-- 
2.17.1



[PATCH 1/4] xfrm: fix gro_cells leak when remove virtual xfrm interfaces

2018-10-18 Thread Steffen Klassert
From: Li RongQing 

The device gro_cells has been initialized, it should be freed,
otherwise it will be leaked

Fixes: f203b76d78092faf2 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_interface.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 31acc6f33d98..6f05e831a73e 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -116,6 +116,9 @@ static void xfrmi_unlink(struct xfrmi_net *xfrmn, struct 
xfrm_if *xi)
 
 static void xfrmi_dev_free(struct net_device *dev)
 {
+   struct xfrm_if *xi = netdev_priv(dev);
+
+   gro_cells_destroy(>gro_cells);
free_percpu(dev->tstats);
 }
 
-- 
2.17.1



[PATCH 4/4] xfrm: policy: use hlist rcu variants on insert

2018-10-18 Thread Steffen Klassert
From: Florian Westphal 

bydst table/list lookups use rcu, so insertions must use rcu versions.

Fixes: a7c44247f704e ("xfrm: policy: make xfrm_policy_lookup_bytype lockless")
Signed-off-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index f094d4b3520d..119a427d9b2b 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -632,9 +632,9 @@ static void xfrm_hash_rebuild(struct work_struct *work)
break;
}
if (newpos)
-   hlist_add_behind(>bydst, newpos);
+   hlist_add_behind_rcu(>bydst, newpos);
else
-   hlist_add_head(>bydst, chain);
+   hlist_add_head_rcu(>bydst, chain);
}
 
spin_unlock_bh(>xfrm.xfrm_policy_lock);
@@ -774,9 +774,9 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, 
int excl)
break;
}
if (newpos)
-   hlist_add_behind(>bydst, newpos);
+   hlist_add_behind_rcu(>bydst, newpos);
else
-   hlist_add_head(>bydst, chain);
+   hlist_add_head_rcu(>bydst, chain);
__xfrm_policy_link(policy, dir);
 
/* After previous checking, family can either be AF_INET or AF_INET6 */
-- 
2.17.1



[PATCH 2/4] MAINTAINERS: Remove net/core/flow.c

2018-10-18 Thread Steffen Klassert
net/core/flow.c does not exist anymore, so remove it
from the IPSEC NETWORKING section of the MAINTAINERS
file.

Signed-off-by: Steffen Klassert 
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index dcb0191c4f54..4ff21dac9b45 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10130,7 +10130,6 @@ L:  netdev@vger.kernel.org
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git
 T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git
 S: Maintained
-F: net/core/flow.c
 F: net/xfrm/
 F: net/key/
 F: net/ipv4/xfrm*
-- 
2.17.1



pull request (net): ipsec 2018-10-18

2018-10-18 Thread Steffen Klassert
1) Free the xfrm interface gro_cells when deleting the
   interface, otherwise we leak it. From Li RongQing.

2) net/core/flow.c does not exist anymore, so remove it
   from the MAINTAINERS file.

3) Fix a slab-out-of-bounds in _decode_session6.
   From Alexei Starovoitov.

4) Fix RCU protection when policies inserted into
   thei bydst lists. From Florian Westphal.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 92d7c74b6f72a8a7d04970d5dcfb99673daaf91d:

  Merge branch 'for-upstream' of 
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth (2018-10-01 
22:40:39 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 9d200fd178f11dd50eb1fd8ccd0650c9284e:

  xfrm: policy: use hlist rcu variants on insert (2018-10-11 13:24:46 +0200)


Alexei Starovoitov (1):
  net/xfrm: fix out-of-bounds packet access

Florian Westphal (1):
  xfrm: policy: use hlist rcu variants on insert

Li RongQing (1):
  xfrm: fix gro_cells leak when remove virtual xfrm interfaces

Steffen Klassert (1):
  MAINTAINERS: Remove net/core/flow.c

 MAINTAINERS   | 1 -
 net/ipv6/xfrm6_policy.c   | 4 ++--
 net/xfrm/xfrm_interface.c | 3 +++
 net/xfrm/xfrm_policy.c| 8 
 4 files changed, 9 insertions(+), 7 deletions(-)


Re: [PATCH ipsec] xfrm: policy: use hlist rcu variants on insert

2018-10-12 Thread Steffen Klassert
On Wed, Oct 10, 2018 at 06:02:21PM +0200, Florian Westphal wrote:
> bydst table/list lookups use rcu, so insertions must use rcu versions.
> 
> Fixes: a7c44247f704e ("xfrm: policy: make xfrm_policy_lookup_bytype lockless")
> Signed-off-by: Florian Westphal 

Applied, thanks Florian!


Re: [PATCH net] net/xfrm: fix out-of-bounds packet access

2018-10-12 Thread Steffen Klassert
On Tue, Oct 09, 2018 at 09:59:36AM -0700, Alexei Starovoitov wrote:
> BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0
> net/ipv6/xfrm6_policy.c:161
> Read of size 1 at addr 8801d882eec7 by task syz-executor1/6667
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
>   print_address_description+0x6c/0x20b mm/kasan/report.c:256
>   kasan_report_error mm/kasan/report.c:354 [inline]
>   kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
>   __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
>   _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161
>   __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299
>   xfrm_decode_session include/net/xfrm.h:1232 [inline]
>   vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542
>   __netdev_start_xmit include/linux/netdevice.h:4313 [inline]
>   netdev_start_xmit include/linux/netdevice.h:4322 [inline]
>   xmit_one net/core/dev.c:3217 [inline]
>   dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233
>   __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803
>   dev_queue_xmit+0x17/0x20 net/core/dev.c:3836
> 
> Reported-by: syzbot+acffccec848dc13fe...@syzkaller.appspotmail.com
> Reported-by: Eric Dumazet 
> Signed-off-by: Alexei Starovoitov 

This looks good, applied to the ipsec tree.
Thanks Alexei!


Re: [PATCH][ipsec-next] xfrm: use correct size to initialise sp->ovec

2018-10-09 Thread Steffen Klassert
On Sun, Oct 07, 2018 at 10:22:42AM +0800, Li RongQing wrote:
> This place should want to initialize array, not a element,
> so it should be sizeof(array) instead of sizeof(element)
> 
> but now this array only has one element, so no error in
> this condition that XFRM_MAX_OFFLOAD_DEPTH is 1
> 
> Signed-off-by: Li RongQing 

Also applied, thanks a lot Li!


Re: [PATCH][ipsec-next] xfrm: remove unnecessary check in xfrmi_get_stats64

2018-10-09 Thread Steffen Klassert
On Sun, Oct 07, 2018 at 09:56:15AM +0800, Li RongQing wrote:
> if tstats of a device is not allocated, this device is not
> registered correctly and can not be used.
> 
> Signed-off-by: Li RongQing 

Applied to ipsec-next, thanks!


Re: [PATCH v2 ipsec] Clear secpath on loopback_xmit

2018-10-08 Thread Steffen Klassert
On Mon, Oct 08, 2018 at 11:13:36AM -0700, Benedict Wong wrote:
> This patch clears the skb->sp when transmitted over loopback. This
> ensures that the loopback-ed packet does not have any secpath
> information from the outbound transforms.
> 
> At present, this causes XFRM tunnel mode packets to be dropped with
> XFRMINNOPOLS, due to the outbound state being in the secpath, without
> a matching inbound policy. Clearing the secpath ensures that all states
> added to the secpath are exclusively from the inbound processing.
> 
> Tests: xfrm tunnel mode tests added for loopback:
> https://android-review.googlesource.com/c/kernel/tests/+/777328
> Fixes: 8fe7ee2ba983 ("[IPsec]: Strengthen policy checks")

This patch is from 2003, it predates our git history.
The fixes tag is mainly to ensure proper backporting.
The commit you mentioned can't be found in the mainline
git repo, so it does not help much.

If this bug was really introduced in pre git times,
use the very first git commit in our history.

It should look like:

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

But I doubt that this commit introduced the bug. We started to
use outbound secpaths when we added offloading support, so I
think one of the offloading patches introduced it.

Do you have CONFIG_INET_ESP_OFFLOAD or CONFIG_INET6_ESP_OFFLOAD
enabled? Do you see the bug without these two config options
enabled?

> Signed-off-by: Benedict Wong 
> ---
>  drivers/net/loopback.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index 30612497643c..a6bf54df94bd 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -50,6 +50,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include   /* For the statistics structure. */
>  #include /* For ARPHRD_ETHER */
>  #include 
> @@ -82,6 +83,9 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
>*/
>   skb_dst_force(skb);
>  
> + // Clear secpath to ensure xfrm policy check not tainted by outbound 
> SAs.

Please align your comment to the format of the other comments in this file.
It should look like this:

/* Clear secpath to ensure xfrm policy check not tainted by
 * outbound SAs.
 */

> + secpath_reset(skb);
> +
>   skb->protocol = eth_type_trans(skb, dev);
>  
>   /* it's OK to use per_cpu_ptr() because BHs are off */
> -- 
> 2.19.0.605.g01d371f741-goog


Re: [PATCH net-next RFC 0/8] udp and configurable gro

2018-10-08 Thread Steffen Klassert
On Fri, Oct 05, 2018 at 10:41:47AM -0400, Willem de Bruijn wrote:
> On Fri, Oct 5, 2018 at 9:53 AM Paolo Abeni  wrote:
> >
> > Hi all,
> >
> > On Fri, 2018-09-14 at 13:59 -0400, Willem de Bruijn wrote:
> > > This is a *very rough* draft. Mainly for discussion while we also
> > > look at another partially overlapping approach [1].
> >
> > I'm wondering how we go on from this ? I'm fine with either approaches.
> 
> Let me send the udp gro static_key patch. Then we don't need
> the enable udp on demand logic (patch 2/4).
> 
> Your implementation of GRO is more fleshed out (patch 3/4) than
> my quick hack. My only request would be to use a separate
> UDP_GRO socket option instead of adding this to the existing
> UDP_SEGMENT.
> 
> Sounds good?
> 
> > Also, I'm interested in [try to] enable GRO/GSO batching in the
> > forwarding path, as you outlined initially in the GSO series
> > submission. That should cover Steffen use-case, too, right?
> 
> Great. Indeed. Though there is some unresolved discussion on
> one large gso skb vs frag list. There has been various concerns
> around the use of frag lists for GSO in the past, and it does not
> match h/w offload. So I think the answer would be the first unless
> the second proves considerably faster (in which case it could also
> be added later as optimization).

I think it depends a bit on the usecase and hardware etc.
if the first or the second approach is faster. So it would
be good if we can choose which one to use depending on that.

For local socket receiving, building big GSO packets
is likely faster than the chaining method.

But on forwarding the chaining method might be faster
because we don't have the overhead of creating GSO
packets and of segmenting them back to their native
form (at least as long as we don't have NICs that
support hardware UDP GSO). Same applies to packets
that undergo IPsec transformation.

Another thing where the chaining method could be intersting
is when we receive already big LRO or HW GRO packets from the
NIC. Packets of the same flow could still travel together
through the stack with the chaining method. I've never tried
this, though. For now it is just an idea.

I have the code for the chaining mehthod here, I'd just need
some method to hook it in. Maybe it could be done with
some sort of an inet_update_offload() as Paolo already
propsed in his pachset, or we could make it configurable
per device...



Re: [PATCH ipsec-next] Clear secpath on loopback_xmit

2018-10-08 Thread Steffen Klassert
On Fri, Oct 05, 2018 at 11:23:28AM -0700, Benedict Wong wrote:
> This patch clears the skb->sp when transmitted over loopback. This
> ensures that the loopback-ed packet does not have any secpath
> information from the outbound transforms.
> 
> At present, this causes XFRM tunnel mode packets to be dropped with
> XFRMINNOPOLS, due to the outbound state being in the secpath, without
> a matching inbound policy. Clearing the secpath ensures that all states
> added to the secpath are exclusively from the inbound processing.

This looks like a fix, so the target should be the ipsec tree,
not ipsec-next.

> 
> Tests: xfrm tunnel mode tests added for loopback:
> https://android-review.googlesource.com/c/kernel/tests/+/777328
> Signed-off-by: Benedict Wong 

Please add a 'Fixes:' tag.

> ---
>  drivers/net/loopback.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index 30612497643c..a6bf54df94bd 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -50,6 +50,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include   /* For the statistics structure. */
>  #include /* For ARPHRD_ETHER */
>  #include 
> @@ -82,6 +83,9 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
>*/
>   skb_dst_force(skb);
>  
> + // Clear secpath to ensure xfrm policy check not tainted by outbound 
> SAs.

Comments should use /* ... */, like it is done below.

> + secpath_reset(skb);
> +
>   skb->protocol = eth_type_trans(skb, dev);
>  
>   /* it's OK to use per_cpu_ptr() because BHs are off */
> -- 
> 2.19.0.605.g01d371f741-goog


Re: [PATCH] xfrm: fix gro_cells leak when remove virtual xfrm interfaces

2018-10-03 Thread Steffen Klassert
On Sun, Sep 30, 2018 at 03:06:06PM +0800, Li RongQing wrote:
> The device gro_cells has been initialized, it should be freed,
> otherwise it will be leaked
> 
> Fixes: f203b76d78092faf2 ("xfrm: Add virtual xfrm interfaces")
> Signed-off-by: Zhang Yu 
> Signed-off-by: Li RongQing 

Applied, thanks a lot!


[PATCH 3/3] xfrm: allow driver to quietly refuse offload

2018-10-01 Thread Steffen Klassert
From: Shannon Nelson 

If the "offload" attribute is used to create an IPsec SA
and the .xdo_dev_state_add() fails, the SA creation fails.
However, if the "offload" attribute is used on a device that
doesn't offer it, the attribute is quietly ignored and the SA
is created without an offload.

Along the same line of that second case, it would be good to
have a way for the device to refuse to offload an SA without
failing the whole SA creation.  This patch adds that feature
by allowing the driver to return -EOPNOTSUPP as a signal that
the SA may be fine, it just can't be offloaded.

This allows the user a little more flexibility in requesting
offloads and not needing to know every detail at all times about
each specific NIC when trying to create SAs.

Signed-off-by: Shannon Nelson 
Signed-off-by: Steffen Klassert 
---
 Documentation/networking/xfrm_device.txt | 4 
 net/xfrm/xfrm_device.c   | 6 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/xfrm_device.txt 
b/Documentation/networking/xfrm_device.txt
index 50c34ca65efe..267f55b5f54a 100644
--- a/Documentation/networking/xfrm_device.txt
+++ b/Documentation/networking/xfrm_device.txt
@@ -68,6 +68,10 @@ and an indication of whether it is for Rx or Tx.  The driver 
should
- verify the algorithm is supported for offloads
- store the SA information (key, salt, target-ip, protocol, etc)
- enable the HW offload of the SA
+   - return status value:
+   0 success
+   -EOPNETSUPP   offload not supported, try SW IPsec
+   other fail the request
 
 The driver can also set an offload_handle in the SA, an opaque void pointer
 that can be used to convey context into the fast-path offload requests.
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 5611b7521020..3a1d9d6aefb4 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -192,9 +192,13 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state 
*x,
 
err = dev->xfrmdev_ops->xdo_dev_state_add(x);
if (err) {
+   xso->num_exthdrs = 0;
+   xso->flags = 0;
xso->dev = NULL;
dev_put(dev);
-   return err;
+
+   if (err != -EOPNOTSUPP)
+   return err;
}
 
return 0;
-- 
2.17.1



pull request (net-next): ipsec-next 2018-10-01

2018-10-01 Thread Steffen Klassert
1) Make xfrmi_get_link_net() static to silence a sparse warning.
   From Wei Yongjun.

2) Remove a unused esph pointer definition in esp_input().
   From Haishuang Yan.

3) Allow the NIC driver to quietly refuse xfrm offload
   in case it does not support it, the SA is created
   without offload in this case.
   From Shannon Nelson.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 817e60a7a2bb1f22052f18562990d675cb3a3762:

  Merge branch 'nfp-add-NFP5000-support' (2018-08-28 16:01:48 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master

for you to fetch changes up to 4a132095dd64fefabdc5dad1cd9e9809b126e582:

  xfrm: allow driver to quietly refuse offload (2018-08-29 08:04:44 +0200)


Haishuang Yan (1):
  esp: remove redundant define esph

Shannon Nelson (1):
  xfrm: allow driver to quietly refuse offload

Wei Yongjun (1):
  xfrm: Make function xfrmi_get_link_net() static

 Documentation/networking/xfrm_device.txt | 4 
 net/ipv4/esp4.c  | 7 +++
 net/ipv6/esp6.c  | 7 +++
 net/xfrm/xfrm_device.c   | 6 +-
 net/xfrm/xfrm_interface.c| 2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)


[PATCH 2/3] esp: remove redundant define esph

2018-10-01 Thread Steffen Klassert
From: Haishuang Yan 

The pointer 'esph' is defined but is never used hence it is redundant
and canbe removed.

Signed-off-by: Haishuang Yan 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4.c | 7 +++
 net/ipv6/esp6.c | 7 +++
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 97689012b357..211caaf27f6e 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -683,12 +683,11 @@ static void esp_input_done_esn(struct 
crypto_async_request *base, int err)
  */
 static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
 {
-   struct ip_esp_hdr *esph;
struct crypto_aead *aead = x->data;
struct aead_request *req;
struct sk_buff *trailer;
int ivlen = crypto_aead_ivsize(aead);
-   int elen = skb->len - sizeof(*esph) - ivlen;
+   int elen = skb->len - sizeof(struct ip_esp_hdr) - ivlen;
int nfrags;
int assoclen;
int seqhilen;
@@ -698,13 +697,13 @@ static int esp_input(struct xfrm_state *x, struct sk_buff 
*skb)
struct scatterlist *sg;
int err = -EINVAL;
 
-   if (!pskb_may_pull(skb, sizeof(*esph) + ivlen))
+   if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr) + ivlen))
goto out;
 
if (elen <= 0)
goto out;
 
-   assoclen = sizeof(*esph);
+   assoclen = sizeof(struct ip_esp_hdr);
seqhilen = 0;
 
if (x->props.flags & XFRM_STATE_ESN) {
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 88a7579c23bd..63b2b66f9dfa 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -601,12 +601,11 @@ static void esp_input_done_esn(struct 
crypto_async_request *base, int err)
 
 static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
 {
-   struct ip_esp_hdr *esph;
struct crypto_aead *aead = x->data;
struct aead_request *req;
struct sk_buff *trailer;
int ivlen = crypto_aead_ivsize(aead);
-   int elen = skb->len - sizeof(*esph) - ivlen;
+   int elen = skb->len - sizeof(struct ip_esp_hdr) - ivlen;
int nfrags;
int assoclen;
int seqhilen;
@@ -616,7 +615,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff 
*skb)
u8 *iv;
struct scatterlist *sg;
 
-   if (!pskb_may_pull(skb, sizeof(*esph) + ivlen)) {
+   if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr) + ivlen)) {
ret = -EINVAL;
goto out;
}
@@ -626,7 +625,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff 
*skb)
goto out;
}
 
-   assoclen = sizeof(*esph);
+   assoclen = sizeof(struct ip_esp_hdr);
seqhilen = 0;
 
if (x->props.flags & XFRM_STATE_ESN) {
-- 
2.17.1



[PATCH 1/3] xfrm: Make function xfrmi_get_link_net() static

2018-10-01 Thread Steffen Klassert
From: Wei Yongjun 

Fixes the following sparse warning:

net/xfrm/xfrm_interface.c:745:12: warning:
 symbol 'xfrmi_get_link_net' was not declared. Should it be static?

Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Wei Yongjun 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_interface.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 31acc6f33d98..2c0a5c59dcd0 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -742,7 +742,7 @@ static int xfrmi_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
return -EMSGSIZE;
 }
 
-struct net *xfrmi_get_link_net(const struct net_device *dev)
+static struct net *xfrmi_get_link_net(const struct net_device *dev)
 {
struct xfrm_if *xi = netdev_priv(dev);
 
-- 
2.17.1



[PATCH 1/6] xfrm: Validate address prefix lengths in the xfrm selector.

2018-10-01 Thread Steffen Klassert
We don't validate the address prefix lengths in the xfrm
selector we got from userspace. This can lead to undefined
behaviour in the address matching functions if the prefix
is too big for the given address family. Fix this by checking
the prefixes and refuse SA/policy insertation when a prefix
is invalid.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Air Icy 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_user.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 33878e6e0d0a..5151b3ebf068 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -151,10 +151,16 @@ static int verify_newsa_info(struct xfrm_usersa_info *p,
err = -EINVAL;
switch (p->family) {
case AF_INET:
+   if (p->sel.prefixlen_d > 32 || p->sel.prefixlen_s > 32)
+   goto out;
+
break;
 
case AF_INET6:
 #if IS_ENABLED(CONFIG_IPV6)
+   if (p->sel.prefixlen_d > 128 || p->sel.prefixlen_s > 128)
+   goto out;
+
break;
 #else
err = -EAFNOSUPPORT;
@@ -1359,10 +1365,16 @@ static int verify_newpolicy_info(struct 
xfrm_userpolicy_info *p)
 
switch (p->sel.family) {
case AF_INET:
+   if (p->sel.prefixlen_d > 32 || p->sel.prefixlen_s > 32)
+   return -EINVAL;
+
break;
 
case AF_INET6:
 #if IS_ENABLED(CONFIG_IPV6)
+   if (p->sel.prefixlen_d > 128 || p->sel.prefixlen_s > 128)
+   return -EINVAL;
+
break;
 #else
return  -EAFNOSUPPORT;
-- 
2.17.1



[PATCH 6/6] xfrm: validate template mode

2018-10-01 Thread Steffen Klassert
From: Sean Tranchetti 

XFRM mode parameters passed as part of the user templates
in the IP_XFRM_POLICY are never properly validated. Passing
values other than valid XFRM modes can cause stack-out-of-bounds
reads to occur later in the XFRM processing:

[  140.535608] 
[  140.543058] BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x17e4/0x1cc4
[  140.550306] Read of size 4 at addr ffc0238a7a58 by task repro/5148
[  140.557369]
[  140.558927] Call trace:
[  140.558936] dump_backtrace+0x0/0x388
[  140.558940] show_stack+0x24/0x30
[  140.558946] __dump_stack+0x24/0x2c
[  140.558949] dump_stack+0x8c/0xd0
[  140.558956] print_address_description+0x74/0x234
[  140.558960] kasan_report+0x240/0x264
[  140.558963] __asan_report_load4_noabort+0x2c/0x38
[  140.558967] xfrm_state_find+0x17e4/0x1cc4
[  140.558971] xfrm_resolve_and_create_bundle+0x40c/0x1fb8
[  140.558975] xfrm_lookup+0x238/0x1444
[  140.558977] xfrm_lookup_route+0x48/0x11c
[  140.558984] ip_route_output_flow+0x88/0xc4
[  140.558991] raw_sendmsg+0xa74/0x266c
[  140.558996] inet_sendmsg+0x258/0x3b0
[  140.559002] sock_sendmsg+0xbc/0xec
[  140.559005] SyS_sendto+0x3a8/0x5a8
[  140.559008] el0_svc_naked+0x34/0x38
[  140.559009]
[  140.592245] page dumped because: kasan: bad access detected
[  140.597981] page_owner info is not active (free page?)
[  140.603267]
[  140.653503] 

Signed-off-by: Sean Tranchetti 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_user.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 5151b3ebf068..d0672c400c2f 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1455,6 +1455,9 @@ static int validate_tmpl(int nr, struct xfrm_user_tmpl 
*ut, u16 family)
(ut[i].family != prev_family))
return -EINVAL;
 
+   if (ut[i].mode >= XFRM_MODE_MAX)
+   return -EINVAL;
+
prev_family = ut[i].family;
 
switch (ut[i].family) {
-- 
2.17.1



pull request (net): ipsec 2018-10-01

2018-10-01 Thread Steffen Klassert
1) Validate address prefix lengths in the xfrm selector,
   otherwise we may hit undefined behaviour in the
   address matching functions if the prefix is too
   big for the given address family.

2) Fix skb leak on local message size errors.
   From Thadeu Lima de Souza Cascardo.

3) We currently reset the transport header back to the network
   header after a transport mode transformation is applied. This
   leads to an incorrect transport header when multiple transport
   mode transformations are applied. Reset the transport header
   only after all transformations are already applied to fix this.
   From Sowmini Varadhan.

4) We only support one offloaded xfrm, so reset crypto_done after
   the first transformation in xfrm_input(). Otherwise we may call
   the wrong input method for subsequent transformations.
   From Sowmini Varadhan.

5) Fix NULL pointer dereference when skb_dst_force clears the dst_entry.
   skb_dst_force does not really force a dst refcount anymore, it might
   clear it instead. xfrm code did not expect this, add a check to not
   dereference skb_dst() if it was cleared by skb_dst_force.

6) Validate xfrm template mode, otherwise we can get a stack-out-of-bounds
   read in xfrm_state_find. From Sean Tranchetti.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 25432eba9cd8f2ef5afef55be811b010a004b5fa:

  openvswitch: meter: Fix setting meter id for new entries (2018-07-29 13:20:54 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 32bf94fb5c2ec4ec842152d0e5937cd4bb6738fa:

  xfrm: validate template mode (2018-09-20 08:30:42 +0200)


Sean Tranchetti (1):
  xfrm: validate template mode

Sowmini Varadhan (2):
  xfrm: reset transport header back to network header after all input 
transforms ahave been applied
  xfrm: reset crypto_done when iterating over multiple input xfrms

Steffen Klassert (2):
  xfrm: Validate address prefix lengths in the xfrm selector.
  xfrm: Fix NULL pointer dereference when skb_dst_force clears the 
dst_entry.

Thadeu Lima de Souza Cascardo (1):
  xfrm6: call kfree_skb when skb is toobig

 net/ipv4/xfrm4_input.c  |  1 +
 net/ipv4/xfrm4_mode_transport.c |  4 +---
 net/ipv6/xfrm6_input.c  |  1 +
 net/ipv6/xfrm6_mode_transport.c |  4 +---
 net/ipv6/xfrm6_output.c |  2 ++
 net/xfrm/xfrm_input.c   |  1 +
 net/xfrm/xfrm_output.c  |  4 
 net/xfrm/xfrm_policy.c  |  4 
 net/xfrm/xfrm_user.c| 15 +++
 9 files changed, 30 insertions(+), 6 deletions(-)


[PATCH 4/6] xfrm: reset crypto_done when iterating over multiple input xfrms

2018-10-01 Thread Steffen Klassert
From: Sowmini Varadhan 

We only support one offloaded xfrm (we do not have devices that
can handle more than one offload), so reset crypto_done in
xfrm_input() when iterating over multiple transforms in xfrm_input,
so that we can invoke the appropriate x->type->input for the
non-offloaded transforms

Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Sowmini Varadhan 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_input.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 352abca2605f..86f5afbd0a0c 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -453,6 +453,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR);
goto drop;
}
+   crypto_done = false;
} while (!err);
 
err = xfrm_rcv_cb(skb, family, x->type->proto, 0);
-- 
2.17.1



[PATCH 5/6] xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry.

2018-10-01 Thread Steffen Klassert
Since commit 222d7dbd258d ("net: prevent dst uses after free")
skb_dst_force() might clear the dst_entry attached to the skb.
The xfrm code don't expect this to happen, so we crash with
a NULL pointer dereference in this case. Fix it by checking
skb_dst(skb) for NULL after skb_dst_force() and drop the packet
in cast the dst_entry was cleared.

Fixes: 222d7dbd258d ("net: prevent dst uses after free")
Reported-by: Tobias Hommel 
Reported-by: Kristian Evensen 
Reported-by: Wolfgang Walter 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_output.c | 4 
 net/xfrm/xfrm_policy.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 89b178a78dc7..36d15a38ce5e 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
spin_unlock_bh(>lock);
 
skb_dst_force(skb);
+   if (!skb_dst(skb)) {
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+   goto error_nolock;
+   }
 
if (xfrm_offload(skb)) {
x->type_offload->encap(x, skb);
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7c5e8978aeaa..626e0f4d1749 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb, unsigned 
short family)
}
 
skb_dst_force(skb);
+   if (!skb_dst(skb)) {
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
+   return 0;
+   }
 
dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE);
if (IS_ERR(dst)) {
-- 
2.17.1



[PATCH 3/6] xfrm: reset transport header back to network header after all input transforms ahave been applied

2018-10-01 Thread Steffen Klassert
From: Sowmini Varadhan 

A policy may have been set up with multiple transforms (e.g., ESP
and ipcomp). In this situation, the ingress IPsec processing
iterates in xfrm_input() and applies each transform in turn,
processing the nexthdr to find any additional xfrm that may apply.

This patch resets the transport header back to network header
only after the last transformation so that subsequent xfrms
can find the correct transport header.

Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
Suggested-by: Steffen Klassert 
Signed-off-by: Sowmini Varadhan 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/xfrm4_input.c  | 1 +
 net/ipv4/xfrm4_mode_transport.c | 4 +---
 net/ipv6/xfrm6_input.c  | 1 +
 net/ipv6/xfrm6_mode_transport.c | 4 +---
 4 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index bcfc00e88756..f8de2482a529 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -67,6 +67,7 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async)
 
if (xo && (xo->flags & XFRM_GRO)) {
skb_mac_header_rebuild(skb);
+   skb_reset_transport_header(skb);
return 0;
}
 
diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c
index 3d36644890bb..1ad2c2c4e250 100644
--- a/net/ipv4/xfrm4_mode_transport.c
+++ b/net/ipv4/xfrm4_mode_transport.c
@@ -46,7 +46,6 @@ static int xfrm4_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
 static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb)
 {
int ihl = skb->data - skb_transport_header(skb);
-   struct xfrm_offload *xo = xfrm_offload(skb);
 
if (skb->transport_header != skb->network_header) {
memmove(skb_transport_header(skb),
@@ -54,8 +53,7 @@ static int xfrm4_transport_input(struct xfrm_state *x, struct 
sk_buff *skb)
skb->network_header = skb->transport_header;
}
ip_hdr(skb)->tot_len = htons(skb->len + ihl);
-   if (!xo || !(xo->flags & XFRM_GRO))
-   skb_reset_transport_header(skb);
+   skb_reset_transport_header(skb);
return 0;
 }
 
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index 841f4a07438e..9ef490dddcea 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -59,6 +59,7 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async)
 
if (xo && (xo->flags & XFRM_GRO)) {
skb_mac_header_rebuild(skb);
+   skb_reset_transport_header(skb);
return -1;
}
 
diff --git a/net/ipv6/xfrm6_mode_transport.c b/net/ipv6/xfrm6_mode_transport.c
index 9ad07a91708e..3c29da5defe6 100644
--- a/net/ipv6/xfrm6_mode_transport.c
+++ b/net/ipv6/xfrm6_mode_transport.c
@@ -51,7 +51,6 @@ static int xfrm6_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
 static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb)
 {
int ihl = skb->data - skb_transport_header(skb);
-   struct xfrm_offload *xo = xfrm_offload(skb);
 
if (skb->transport_header != skb->network_header) {
memmove(skb_transport_header(skb),
@@ -60,8 +59,7 @@ static int xfrm6_transport_input(struct xfrm_state *x, struct 
sk_buff *skb)
}
ipv6_hdr(skb)->payload_len = htons(skb->len + ihl -
   sizeof(struct ipv6hdr));
-   if (!xo || !(xo->flags & XFRM_GRO))
-   skb_reset_transport_header(skb);
+   skb_reset_transport_header(skb);
return 0;
 }
 
-- 
2.17.1



[PATCH 2/6] xfrm6: call kfree_skb when skb is toobig

2018-10-01 Thread Steffen Klassert
From: Thadeu Lima de Souza Cascardo 

After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), some too big skbs might be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a 
namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
count = 1

The fix is to call kfree_skb in case of transmit failures.

Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error")
Signed-off-by: Thadeu Lima de Souza Cascardo 
Reviewed-by: Sabrina Dubroca 
Signed-off-by: Steffen Klassert 
---
 net/ipv6/xfrm6_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
index 5959ce9620eb..6a74080005cf 100644
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -170,9 +170,11 @@ static int __xfrm6_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
 
if (toobig && xfrm6_local_dontfrag(skb)) {
xfrm6_local_rxpmtu(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
} else if (!skb->ignore_df && toobig && skb->sk) {
xfrm_local_error(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
}
 
-- 
2.17.1



Re: [PATCH net] xfrm: validate template mode

2018-09-24 Thread Steffen Klassert
On Wed, Sep 19, 2018 at 01:54:56PM -0600, Sean Tranchetti wrote:
> XFRM mode parameters passed as part of the user templates
> in the IP_XFRM_POLICY are never properly validated. Passing
> values other than valid XFRM modes can cause stack-out-of-bounds
> reads to occur later in the XFRM processing:
> 
> [  140.535608] 
> 
> [  140.543058] BUG: KASAN: stack-out-of-bounds in 
> xfrm_state_find+0x17e4/0x1cc4
> [  140.550306] Read of size 4 at addr ffc0238a7a58 by task repro/5148
> [  140.557369]
> [  140.558927] Call trace:
> [  140.558936] dump_backtrace+0x0/0x388
> [  140.558940] show_stack+0x24/0x30
> [  140.558946] __dump_stack+0x24/0x2c
> [  140.558949] dump_stack+0x8c/0xd0
> [  140.558956] print_address_description+0x74/0x234
> [  140.558960] kasan_report+0x240/0x264
> [  140.558963] __asan_report_load4_noabort+0x2c/0x38
> [  140.558967] xfrm_state_find+0x17e4/0x1cc4
> [  140.558971] xfrm_resolve_and_create_bundle+0x40c/0x1fb8
> [  140.558975] xfrm_lookup+0x238/0x1444
> [  140.558977] xfrm_lookup_route+0x48/0x11c
> [  140.558984] ip_route_output_flow+0x88/0xc4
> [  140.558991] raw_sendmsg+0xa74/0x266c
> [  140.558996] inet_sendmsg+0x258/0x3b0
> [  140.559002] sock_sendmsg+0xbc/0xec
> [  140.559005] SyS_sendto+0x3a8/0x5a8
> [  140.559008] el0_svc_naked+0x34/0x38
> [  140.559009]
> [  140.592245] page dumped because: kasan: bad access detected
> [  140.597981] page_owner info is not active (free page?)
> [  140.603267]
> [  140.653503] 
> 
> 
> Signed-off-by: Sean Tranchetti 

Applied to the ipsec tree, thanks a lot!


Re: [PATCH net-next RFC 7/8] udp: gro behind static key

2018-09-18 Thread Steffen Klassert
On Mon, Sep 17, 2018 at 10:19:22AM -0400, Willem de Bruijn wrote:
> On Mon, Sep 17, 2018 at 6:37 AM Steffen Klassert
>  wrote:
> >
> > Maybe in case that forwarding is enabled on the receiving device,
> > inet_gro_receive() could do a route lookup and allow GRO if the
> > route lookup returned at forwarding route.
> 
> That's a better solution, if the cost is acceptable. We do have to
> be careful against increasing per packet cycle cost in this path
> given that it's a possible vector for DoS attempts.

I played with this already and I have not seen any significant
overhead when doing a route lookup in GRO. We have to do a route
lookup anyway, so it should not matter too much if we do it here
or later in the IP layer.

> 
> > For flows that are likely software segmented after that, it
> > would be worth to build packet chains insted of merging the
> > payload. Packets of the same flow could travel together, but
> > it would save the cost of the packet merging and segmenting.
> 
> With software GSO that is faster, as it would have to allocate
> all the separate segment skbs in skb_segment later. Though
> there is some complexity if MTUs differ.

This can be handled the same way as it is done for TCP GSO
packets. If the size of the original packets is bigger than
the outgoing MTU, we either signal ICMP_FRAG_NEEDED to the
sender, or split the chain back into single packets and do
fragmentation on them.

> 
> With hardware UDP GSO, having a single skb will be cheaper in
> the forwarding path. Using napi_gro_frags, device drivers really
> do only end up allocating one skb for the GSO packet.

Right, this is why it would be nice to have this
configurable.



Re: [PATCH net-next RFC 7/8] udp: gro behind static key

2018-09-17 Thread Steffen Klassert
On Fri, Sep 14, 2018 at 01:59:40PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Avoid the socket lookup cost in udp_gro_receive if no socket has a
> gro callback configured.

It would be nice if we could do GRO not just for GRO configured
sockets, but also for flows that are going to be IPsec transformed
or directly forwarded.

Maybe in case that forwarding is enabled on the receiving device,
inet_gro_receive() could do a route lookup and allow GRO if the
route lookup returned at forwarding route.

For flows that are likely software segmented after that, it
would be worth to build packet chains insted of merging the
payload. Packets of the same flow could travel together, but
it would save the cost of the packet merging and segmenting.
This could be done similar to what I proposed for the list
receive case:

https://www.spinics.net/lists/netdev/msg522706.html

How GRO should be done could be even configured
by replacing the net_offload pointer similar
to what Paolo propsed in his pachset with
the inet_update_offload() function.


Re: [PATCH net-next RFC 7/8] udp: gro behind static key

2018-09-17 Thread Steffen Klassert
On Fri, Sep 14, 2018 at 01:59:40PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Avoid the socket lookup cost in udp_gro_receive if no socket has a
> gro callback configured.
> 
> Signed-off-by: Willem de Bruijn 

...

> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 4f6aa95a9b12..f44fe328aa0f 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -405,7 +405,7 @@ static struct sk_buff *udp4_gro_receive(struct list_head 
> *head,
>  {
>   struct udphdr *uh = udp_gro_udphdr(skb);
>  
> - if (unlikely(!uh))
> + if (unlikely(!uh) || !static_branch_unlikely(_encap_needed_key))
>   goto flush;

If you use udp_encap_needed_key to enalbe UDP GRO, then a UDP
encapsulation socket will enable it too. Not sure if this is
intentional.

That said, enabling UDP GRO on a UDP encapsulation socket
(ESP in UPD etc.) will fail badly as then encrypted ESP
packets might be merged together. So we somehow should
make sure that this does not happen.

Anyway, this reminds me that we can support GRO for
UDP encapsulation. It just requires separate GRO
callbacks for the different encapsulation types.


[PATCH RFC 1/2] net: Support flow sorted RX skb lists for IPv4.

2018-09-12 Thread Steffen Klassert
This patch sorts RX skb lists into separate flows, using
a flow dissector, at the IP input layer. Packets of the
same flow are chained at the frag_list pointer of the first
skb of this flow.

After ip_list_rcv_finish() the skb list has this layout:

|-||-||-|
|flow 1   ||flow 1   ||flow 1   |
|-||-||-|
|frag_list|<-\ |frag_list||frag_list|
|-|   \|-||-|
|next |<-\ \---|next |<---|next |
|-|   \|-||-|
  |
  |
  ||-||-||-|
  ||flow 2   ||flow 2   ||flow 2   |
  ||-||-||-|
  ||frag_list|<-\ |frag_list||frag_list|
  ||-|   \|-||-|
  ||next |<-\ \---|next |<---|next |
   |-|   \|-||-|
 |
 |
 ||-||-|   
|-|
 ||flow 3   ||flow 3   |   
|flow 3   |
 ||-||-|   
|-|
 ||frag_list|<-\ |frag_list|   
|frag_list|
 ||-|   \|-|   
|-|
 ||next |\---|next 
|<--|next |
  |-||-|   
|-|

With this approach route lookups etc. are done just for one
representative packet of a given flow instead for each packet.

ip_sublist_rcv_finish() splits these lists then into:

|-||-||-|
|flow 1   ||flow 1   ||flow 1   |
|-||-||-|
|frag_list|<-\ |frag_list||frag_list|
|-|   \|-||-|
|next |\---|next |<---|next |
|-||-||-|

Packets of the same flow can still travel together after that point.

On input, this is plumbed through to ip_local_deliver_finish(),
here the skb chain is split back into single packets.

My hope is that this can be plumbed through to the sockets
receive queue. I have a patch for UDP, but it has still
problems with UDP encapsulaion, so it is not included here.

On forward, the skb chain can travel together to the TX path.
__skb_gso_segment() will build a standard skb list from this.

For now, this is only enabled if the receiving device allows
forwarding, as the forwarding path has currently the most gain
from this.

Known issues:

- I don't have a NIC whose driver supports to build skb lists
  to be received by netif_receive_skb_list(). To test this
  codepath I used a hack that builds skb lists at the napi
  layer.

- Performance measurements were done with this hack, so I don't
  know if these measurements are really meaningful.

- This is early stage work, so the functional tests are only
  done on a basic level, it might be still buggy.

- This still uses the skb->next, skb->prev pointers to build
  skb lists. So needs to be converted to standard list handling
  at some point.

Signed-off-by: Steffen Klassert 
---
 include/linux/skbuff.h|   5 ++
 net/core/dev.c|  45 +++-
 net/core/flow_dissector.c |  40 +++
 net/core/skbuff.c |  52 ++
 net/ipv4/ip_input.c   | 139 ++
 net/ipv4/ip_output.c  |   3 +-
 6 files changed, 270 insertions(+), 14 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17a13e4785fc..d070d073a1dc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -575,6 +575,8 @@ enum {
SKB_GSO_UDP = 1 << 16,
 
SKB_GSO_UDP_L4 = 1 << 17,
+
+   SKB_GSO_FRAGLIST = 1 << 18,
 };
 
 #if BITS_PER_LONG > 32
@@ -1226,6 +1228,8 @@ skb_flow_dissect_flow_keys_basic(const struct sk_buff 
*skb,
  data, proto, nhoff, hlen, flags);
 }
 
+u32 skb_flow_keys_rx_digest(struct sk_buff *skb, struct flow_keys_digest 
*digest);
+
 void
 skb_flow_dissect_tunnel_info(const struct sk_buff *skb,
 struct flow_dissector *flow_dissector,
@@ -3302,6 +3306,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, 
int shiftlen);
 void skb_scrub_packet(struct sk_buff *skb, bool xnet);
 bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu);
 bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len);
+void skb_segment_list(

[PATCH RFC 0/2] Flow sorted receive skb lists

2018-09-12 Thread Steffen Klassert
This patchset consists of two patches. Patch 1 adds support for flow
sorted rx skb lists for IPv4. This means that it sorts the skb list so
that packets from the same flow can to travel together through the stack.

The second patch of this pachset is just a hack that disables GRO and does
skb list receive instead. I don't have a NIC whose driver supports to build
skb lists to be received by netif_receive_skb_list(), so I used this hack
to test patch 1.

This is early stage work, so it might be not complete and still buggy.
I send this now because I want to hear some comments on the approach
itself before I spend more time to work on this.

Forwarding performance measurements:

I used used my IPsec forwarding test setup for this:

    
-->| router 1 |>| router 2 |--
|     |
| |
|     |
|Spirent Testcenter|<--


net-next (September 6th):

Single stream UDP frame size 1460 Bytes: 1.258.446 fps.

Single stream UDP frame size 64 Bytes: 1.250.000 fps.

--

net-next (September 6th) + patch 2 (list receive):

Single stream UDP frame size 1460 Bytes: 1.469.595 fps (+16%).

Single stream UDP frame size 64 Bytes: 1.502.976 fps  (+20%).

--

net-next (September 6th) + patch 1 + patch 2 (flow sorted list receive):

Single stream UDP frame size 1460 Bytes: 2.525.338 fps (+100%).

Single stream UDP frame size 64 Bytes: 2.541.881 fps (+103%).

---



[PATCH RFC 2/2] net: Hack to enable skb list receive in the napi layer.

2018-09-12 Thread Steffen Klassert
This patch was used to test patch ("net: Support flow sorted skb lists
for IPv4.") It is just a hack that disables GRO and does skb list
receive instead. Not for merging!

Signed-off-by: Steffen Klassert 
---
 include/linux/netdevice.h |  5 -
 net/core/dev.c| 23 ++-
 net/ipv4/ip_input.c   |  1 +
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e2b3bd750c98..6518ceeda05d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -333,7 +333,10 @@ struct napi_struct {
int poll_owner;
 #endif
struct net_device   *dev;
-   struct gro_list gro_hash[GRO_HASH_BUCKETS];
+   union {
+   struct list_headrx_list;
+   struct gro_list gro_hash[GRO_HASH_BUCKETS];
+   };
struct sk_buff  *skb;
struct hrtimer  timer;
struct list_headdev_list;
diff --git a/net/core/dev.c b/net/core/dev.c
index 147da35d7380..68e00727c657 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5470,6 +5470,10 @@ static enum gro_result dev_gro_receive(struct 
napi_struct *napi, struct sk_buff
int same_flow;
int grow;
 
+   list_add_tail(>list, >rx_list);
+   ret = GRO_CONSUMED;
+   goto out;
+
if (netif_elide_gro(skb->dev))
goto normal;
 
@@ -5559,7 +5563,7 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
} else if (test_bit(hash, >gro_bitmask)) {
__clear_bit(hash, >gro_bitmask);
}
-
+out:
return ret;
 
 normal:
@@ -5958,7 +5962,7 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
 NAPIF_STATE_IN_BUSY_POLL)))
return false;
 
-   if (n->gro_bitmask) {
+   if (!list_empty(>rx_list)) {
unsigned long timeout = 0;
 
if (work_done)
@@ -5967,8 +5971,10 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
if (timeout)
hrtimer_start(>timer, ns_to_ktime(timeout),
  HRTIMER_MODE_REL_PINNED);
-   else
-   napi_gro_flush(n, false);
+   else {
+   netif_receive_skb_list(>rx_list);
+   INIT_LIST_HEAD(>rx_list);
+   }
}
if (unlikely(!list_empty(>poll_list))) {
/* If n->poll_list is not empty, we need to mask irqs */
@@ -6189,6 +6195,7 @@ void netif_napi_add(struct net_device *dev, struct 
napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight)
 {
INIT_LIST_HEAD(>poll_list);
+   INIT_LIST_HEAD(>rx_list);
hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
napi->timer.function = napi_watchdog;
init_gro_hash(napi);
@@ -6289,11 +6296,9 @@ static int napi_poll(struct napi_struct *n, struct 
list_head *repoll)
goto out_unlock;
}
 
-   if (n->gro_bitmask) {
-   /* flush too old packets
-* If HZ < 1000, flush all packets.
-*/
-   napi_gro_flush(n, HZ >= 1000);
+   if (!list_empty(>rx_list)) {
+   netif_receive_skb_list(>rx_list);
+   INIT_LIST_HEAD(>rx_list);
}
 
/* Some drivers may have called napi_schedule
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index bf710bf95fea..847965f83a2f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -729,6 +729,7 @@ void ip_list_rcv(struct list_head *head, struct packet_type 
*pt,
}
list_add_tail(>list, );
}
+
/* dispatch final sublist */
ip_sublist_rcv(, curr_dev, curr_net);
 }
-- 
2.17.1



Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-12 Thread Steffen Klassert
On Tue, Sep 11, 2018 at 09:02:48PM +0200, Tobias Hommel wrote:
> > > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force
> > > clears the dst_entry.
> > > 
> > > Since commit 222d7dbd258d ("net: prevent dst uses after free")
> > > skb_dst_force() might clear the dst_entry attached to the skb.
> > > The xfrm code don't expect this to happen, so we crash with
> > > a NULL pointer dereference in this case. Fix it by checking
> > > skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> > > in cast the dst_entry was cleared.
> > > 
> > > Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> > > Reported-by: Tobias Hommel 
> > > Reported-by: Kristian Evensen 
> > > Reported-by: Wolfgang Walter 
> > > Signed-off-by: Steffen Klassert 
> > > ---
> > 
> > This patch fixes the problem here.
> > 
> > XfrmFwdHdrError gets around 80 at the very beginning and remains so. 
> > Probably 
> > this happens when some route are changed/set then. 
> > 
> > Regards and thanks,
> 
> Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220.
> This is less than 1 lost packet per minute, which seems to be okay for now.

Thanks a lot for testing! This is now applied to the ipsec tree.


Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-11 Thread Steffen Klassert
On Mon, Sep 10, 2018 at 10:18:47AM +0200, Kristian Evensen wrote:
> Hi,
> 
> Thanks everyone for all the effort in debugging this issue.
> 
> On Mon, Sep 10, 2018 at 8:39 AM Steffen Klassert
>  wrote:
> > The easy fix that could be backported to stable would be
> > to check skb->dst for NULL and drop the packet in that case.
> 
> Thought I should just chime in and say that we deployed this
> work-around when we started observing the error back in June. Since
> then we have not seen any crashes. Also, we have instrumented some of
> our kernels to count the number of times the error is hit (overall +
> consecutive). Compared to the overall number of packets, the error
> happens very rarely. With our workloads, we on average see the error
> once every couple of days.

Thanks for letting us know!

I plan to fix this in the ipsec tree with:

Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force 
clears
 the dst_entry.

Since commit 222d7dbd258d ("net: prevent dst uses after free")
skb_dst_force() might clear the dst_entry attached to the skb.
The xfrm code don't expect this to happen, so we crash with
a NULL pointer dereference in this case. Fix it by checking
skb_dst(skb) for NULL after skb_dst_force() and drop the packet
in cast the dst_entry was cleared.

Fixes: 222d7dbd258d ("net: prevent dst uses after free")
Reported-by: Tobias Hommel 
Reported-by: Kristian Evensen 
Reported-by: Wolfgang Walter 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_output.c | 4 
 net/xfrm/xfrm_policy.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 89b178a78dc7..36d15a38ce5e 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -101,6 +101,10 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
spin_unlock_bh(>lock);
 
skb_dst_force(skb);
+   if (!skb_dst(skb)) {
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+   goto error_nolock;
+   }
 
if (xfrm_offload(skb)) {
x->type_offload->encap(x, skb);
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 7c5e8978aeaa..626e0f4d1749 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2548,6 +2548,10 @@ int __xfrm_route_forward(struct sk_buff *skb, unsigned 
short family)
}
 
skb_dst_force(skb);
+   if (!skb_dst(skb)) {
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
+   return 0;
+   }
 
dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE);
if (IS_ERR(dst)) {
-- 
2.17.1



Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-09-10 Thread Steffen Klassert
On Fri, Sep 07, 2018 at 11:10:55PM +0200, Wolfgang Walter wrote:
> Hello Steffen,
> 
> in one of your emails to Thomas you wrote:
> > xfrm_lookup+0x2a is at the very beginning of xfrm_lookup(), here we
> > find:
> > 
> > u16 family = dst_orig->ops->family;
> > 
> > ops has an offset of 32 bytes (20 hex) in dst_orig, so looks like
> > dst_orig is NULL.
> > 
> > In the forwarding case, we get dst_orig from the skb and dst_orig
> > can't be NULL here unless the skb itself is already fishy.
> 
> Is this really true?
> 
> If xfrm_lookup is called from 
> 
> __xfrm_route_forward():
> 
> int __xfrm_route_forward(struct sk_buff *skb, unsigned short family)
> {
> struct net *net = dev_net(skb->dev);
> struct flowi fl;
> struct dst_entry *dst;
> int res = 1;
> 
> if (xfrm_decode_session(skb, , family) < 0) {
> XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
> return 0;
> }
> 
> skb_dst_force(skb);
> 
> dst = xfrm_lookup(net, skb_dst(skb), , NULL, XFRM_LOOKUP_QUEUE);
> if (IS_ERR(dst)) {
> res = 0;
> dst = NULL;
> }
> skb_dst_set(skb, dst);
> return res;
> }
> 
> couldn't it be possible that skb_dst_force(skb) actually sets dst to NULL if 
> it cannot safely lock it? If it is absolutely sure that skb_dst_force() never 
> can set dst to NULL I wonder why it is called at all?

Ugh, skb_dst_force apparently changed since I looked at it last time.
I did not expect that it can clear skb->dst. This behaviour was
introduced with:

commit 222d7dbd258dad4cd5241c43ef818141fad5a87a
net: prevent dst uses after free

from Eric Dumazet (put him to Cc).

The easy fix that could be backported to stable would be
to check skb->dst for NULL and drop the packet in that case.

I wonder if we can do better here. We can still use the
dst_entry as long as we don't exit the RCU grace period.
But looking deeper into it, the crypto layer might return
asynchronously. In this case, we exit the RCU grace period
and we have to drop the packet anyway.

If I understand correct, the bug happens rarely. So maybe
we could just stay with the easy fix (I'll do a patch today).

The other thing I wonder about is why Tobias bisected this to

commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
ipv4: mark DST_NOGC and remove the operation of dst_free()

from 'Jun 17 2017' and not to

commit 222d7dbd258dad4cd5241c43ef818141fad5a87a
net: prevent dst uses after free

from 'Sep 21 2017'.

Maybe Tobias has seen two bugs. Before
("net: prevent dst uses after free"), it was the
use after free, and after this fix it was a NULL
pointer derference of skb->dst.



Re: [PATCH V2 ipsec-next 0/2] xfrm: bug fixes when processing multiple transforms

2018-09-05 Thread Steffen Klassert
On Mon, Sep 03, 2018 at 04:36:51AM -0700, Sowmini Varadhan wrote:
> This series contains bug fixes that were encountered when I set
> up a libreswan tunnel using the config below, which will set up
> an IPsec policy involving 2 tmpls.
> 
> type=transport
> compress=yes
> esp=aes_gcm_c-128-null # offloaded to Niantic
> auto=start
> 
> The non-offload test case uses  esp=aes_gcm_c-256-null.
> 
> Each patch has a technical description of the contents of the fix.
> 
> V2: added Fixes tag so that it can be backported to the stable trees.
> 
> Sowmini Varadhan (2):
>   xfrm: reset transport header back to network header after all input
> transforms ahave been applied
>   xfrm: reset crypto_done when iterating over multiple input xfrms

All applied to the ipsec tree, thanks a lot Sowmini!


Re: [PATCH v2] xfrm6: call kfree_skb when skb is toobig

2018-09-03 Thread Steffen Klassert
On Fri, Aug 31, 2018 at 08:38:49AM -0300, Thadeu Lima de Souza Cascardo wrote:
> After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
> and reporting on xmit"), some too big skbs might be potentially passed down to
> __xfrm6_output, causing it to fail to transmit but not free the skb, causing a
> leak of skb, and consequentially a leak of dst references.
> 
> After running pmtu.sh, that shows as failure to unregister devices in a 
> namespace:
> 
> [  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
> count = 1
> 
> The fix is to call kfree_skb in case of transmit failures.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> Reviewed-by: Sabrina Dubroca 
> Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error")

Applied, thanks a lot!


Re: [PATCH ipsec-next 1/2] xfrm: reset transport header back to network header after all input transforms ahave been applied

2018-09-03 Thread Steffen Klassert
On Sun, Sep 02, 2018 at 04:18:42PM -0700, Sowmini Varadhan wrote:
> A policy may have been set up with multiple transforms (e.g., ESP
> and ipcomp). In this situation, the ingress IPsec processing
> iterates in xfrm_input() and applies each transform in turn,
> processing the nexthdr to find any additional xfrm that may apply.
> 
> This patch resets the transport header back to network header
> only after the last transformation so that subsequent xfrms
> can find the correct transport header.
> 
> Suggested-by: Steffen Klassert 
> Signed-off-by: Sowmini Varadhan 

This patch is a fix, so please add a proper fixes tag
so that it can be backported to the stable trees.
Same applied to the other patch in this series.


Re: [PATCH RFC net-next 00/11] udp gso

2018-09-03 Thread Steffen Klassert
On Fri, Aug 31, 2018 at 09:08:59AM -0400, Willem de Bruijn wrote:
> On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni  wrote:
> >
> > Hi,
> >
> > On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> > > That said, for negotiated flows an inverse GRO feature could
> > > conceivably be implemented to reduce rx stack traversal, too.
> > > Though due to interleaving of packets on the wire, it aggregation
> > > would be best effort, similar to TCP TSO and GRO using the
> > > PSH bit as packetization signal.
> >
> > Reviving this old thread, before I forgot again. I have some local
> > patches implementing UDP GRO in a dual way to current GSO_UDP_L4
> > implementation: several datagram with the same length are aggregated
> > into a single one, and the user space receive a single larger packet
> > instead of multiple ones. I hope quic can leverage such scenario, but I
> > really know nothing about the protocol.
> >
> > I measure roughly a 50% performance improvement with udpgso_bench in
> > respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
> > usage on the receiver[1]. Some additional hacking to the general GRO
> > bits is required to avoid useless socket lookups for ingress UDP
> > packets when UDP_GSO is not enabled.
> >
> > If there is interest on this topic, I can share some RFC patches
> > (hopefully somewhat next week).
> 
> As Eric pointed out, QUIC reception on mobile clients over the WAN
> may not see much gain. But apparently there is a non-trivial amount
> of traffic the other way, to servers. Again, WAN might limit whatever
> gain we get, but I do want to look into that. And there are other UDP high
> throughput workloads (with or without QUIC) between servers.
> 
> If you have patches, please do share them. I actually also have a rough
> patch that I did not consider ready to share yet. Based on Tom's existing
> socket lookup in udp_gro_receive to detect whether a local destination
> exists and whether it has set an option to support receiving coalesced
> payloads (along with a cmsg to share the segment size).
> 
> Converting udp_recvmsg to split apart gso packets to make this
> transparent seems to me to be too complex and not worth the effort.
> 
> If a local socket is not found in udp_gro_receive, this could also be
> tentative interpreted as a non-local path (with false positives), enabling
> transparent use of GRO + GSO batching on the forwarding path.

On forwarding, it might be even better to build packet lists instead
of merging the payload. This would save the cost of the merge at GRO
and the split at GSO (at least when the split happens in software).

I'm working on patches that builds such skb lists. The list is chained
at the frag_list pointer of the first skb, all subsequent skbs are linked
to the next pointer of the skb. It looks like this:

|-||-||-|
|flow 1   ||flow 1   ||flow 1   |
|-||-||-|
|frag_list|<-\ |frag_list||frag_list|
|-|   \|-||-|
|next |\---|next |<---|next |
|-||-||-|

Furthermore, I'm trying to integrate this with the new skb listification
work. Instead of just linking the packets into the lists as they arrive,
I try to sort them into flows, so that packets of the same flow can
travel together and lookups etc. are just done for one representative
skb of a given flow.

Here I build the packet lists like this:


|-||-||-|
|flow 1   ||flow 1   ||flow 1   |
|-||-||-|
|frag_list|<-\ |frag_list||frag_list|
|-|   \|-||-|
|next |<-\ \---|next |<---|next |
|-|   \|-||-|
  |
  |
  ||-||-||-|
  ||flow 2   ||flow 2   ||flow 2   |
  ||-||-||-|
  ||frag_list|<-\ |frag_list||frag_list|
  ||-|   \|-||-|
  ||next |<-\ \---|next |<---|next |
   |-|   \|-||-|
 |
 |
 ||-||-|   
|-|
 ||flow 3   ||flow 3   |   
|flow 3   |
 ||-||-|   
|-|
 ||frag_list|<-\ |frag_list|   
|frag_list|
 ||-|   \|-|   
|-|
 ||next |\---|next 
|<--|next |
 

Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-08-31 Thread Steffen Klassert
On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote:
> Hello,
> 
> kernels > 4.12 do not work on one of our main routers. They crash as soon
> as ipsec-tunnels are configured and ipsec-traffic actually flows.

Can you please send the backtrace of this crash?

Thanks!


Re: [PATCH 1/2] xfrm6: call kfree_skb when skb is toobig

2018-08-31 Thread Steffen Klassert
On Thu, Aug 30, 2018 at 03:23:11PM +0200, Sabrina Dubroca wrote:
> 2018-08-30, 09:58:16 -0300, Thadeu Lima de Souza Cascardo wrote:
> > After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU 
> > caching
> > and reporting on xmit"), some too big skbs might be potentially passed down 
> > to
> > __xfrm6_output, causing it to fail to transmit but not free the skb, 
> > causing a
> > leak of skb, and consequentially a leak of dst references.
> > 
> > After running pmtu.sh, that shows as failure to unregister devices in a 
> > namespace:
> > 
> > [  311.397671] unregister_netdevice: waiting for veth_b to become free. 
> > Usage count = 1
> > 
> > The fix is to call kfree_skb in case of transmit failures.
> > 
> > Signed-off-by: Thadeu Lima de Souza Cascardo 

Good catch!

> Reviewed-by: Sabrina Dubroca 
> 
> I was about to post the same patch. Arguably, the commit introducing
> this bug is the one that added those "return -EMSGSIZE" to
> __xfrm6_output without freeing.
> 
> Either way, it's missing a Fixes: tag, which should be one of those,
> or both:
> 
> Fixes: d6990976af7c ("vti6: fix PMTU caching and reporting on xmit")
> Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error")

This bug can be triggered even without vti6, so the correct
Fixes tag would be the latter.

Thadeu, please resend this one with the Fixes tag.

Thanks!


Re: [PATCH 2/2] vti6: do not check for ignore_df in order to update pmtu

2018-08-31 Thread Steffen Klassert
On Thu, Aug 30, 2018 at 09:58:17AM -0300, Thadeu Lima de Souza Cascardo wrote:
> Before commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU 
> caching
> and reporting on xmit"), skb was scrubbed before checking for ignore_df. The
> scrubbing meant ignore_df was false, making the check irrelevant. Now that the
> scrubbing happens after that, some packets might fail the checking and dst
> will not have its pmtu updated.
> 
> Not only that, but too big skb will be potentially passed down to
> __xfrm6_output, causing it to fail to transmit but not free the skb, causing a
> leak of skb, and consequentially a leak of dst references.
> 
> After running pmtu.sh, that shows as failure to unregister devices in a 
> namespace:
> 
> [  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
> count = 1
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> ---
>  net/ipv6/ip6_vti.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
> index c72ae3a4fe09..fbd3752ea587 100644
> --- a/net/ipv6/ip6_vti.c
> +++ b/net/ipv6/ip6_vti.c
> @@ -481,7 +481,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
> struct flowi *fl)
>   }
>  
>   mtu = dst_mtu(dst);
> - if (!skb->ignore_df && skb->len > mtu) {
> + if (skb->len > mtu) {
>   skb_dst_update_pmtu(skb, mtu);

The very same patch went already to the net tree two day ago:

commit 9f2895461439fda2801a7906fb4c5fb3dbb37a0a
vti6: remove !skb->ignore_df check from vti6_xmit()



Re: [PATCH ipsec-next] xfrm: allow driver to quietly refuse offload

2018-08-29 Thread Steffen Klassert
On Wed, Aug 22, 2018 at 02:38:10PM -0700, Shannon Nelson wrote:
> If the "offload" attribute is used to create an IPsec SA
> and the .xdo_dev_state_add() fails, the SA creation fails.
> However, if the "offload" attribute is used on a device that
> doesn't offer it, the attribute is quietly ignored and the SA
> is created without an offload.
> 
> Along the same line of that second case, it would be good to
> have a way for the device to refuse to offload an SA without
> failing the whole SA creation.  This patch adds that feature
> by allowing the driver to return -EOPNOTSUPP as a signal that
> the SA may be fine, it just can't be offloaded.
> 
> This allows the user a little more flexibility in requesting
> offloads and not needing to know every detail at all times about
> each specific NIC when trying to create SAs.
> 
> Signed-off-by: Shannon Nelson 

Applied to ipsec-next, thanks Shannon!


Re: [PATCH net] vti6: remove !skb->ignore_df check from vti6_xmit()

2018-08-29 Thread Steffen Klassert
On Thu, Aug 23, 2018 at 07:49:54PM +0300, Alexey Kodanev wrote:
> Before the commit d6990976af7c ("vti6: fix PMTU caching and reporting
> on xmit") '!skb->ignore_df' check was always true because the function
> skb_scrub_packet() was called before it, resetting ignore_df to zero.
> 
> In the commit, skb_scrub_packet() was moved below, and now this check
> can be false for the packet, e.g. when sending it in the two fragments,
> this prevents successful PMTU updates in such case. The next attempts
> to send the packet lead to the same tx error. Moreover, vti6 initial
> MTU value relies on PMTU adjustments.
> 
> This issue can be reproduced with the following LTP test script:
> udp_ipsec_vti.sh -6 -p ah -m tunnel -s 2000
> 
> Fixes: ccd740cbc6e0 ("vti6: Add pmtu handling to vti6_xmit.")
> Signed-off-by: Alexey Kodanev 
> ---
> Not sure about xfrmi_xmit2(), it has a similar check for ignore_df...
> 
>  net/ipv6/ip6_vti.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
> index 38dec9d..f48d196 100644
> --- a/net/ipv6/ip6_vti.c
> +++ b/net/ipv6/ip6_vti.c
> @@ -481,7 +481,7 @@ static bool vti6_state_check(const struct xfrm_state *x,
>   }
>  
>   mtu = dst_mtu(dst);
> - if (!skb->ignore_df && skb->len > mtu) {
> + if (skb->len > mtu) {
>   skb_dst_update_pmtu(skb, mtu);

This looks OK to me. If I remember correct, the !skb->ignore_df
check was taken from the native xfrm6 PMTU handling. There this 
check makes sense because the packet can be still fragmented
along the way through the stack. In this case here it is too late
as we are about to TX the packet through the vti device. So
we should update to the new IPsec PMTU and notify the sender
about this.

Acked-by: Steffen Klassert 


Re: [PATCH net-next] xfrm: Make function xfrmi_get_link_net() static

2018-07-31 Thread Steffen Klassert
On Sat, Jul 28, 2018 at 06:49:48AM +, Wei Yongjun wrote:
> Fixes the following sparse warning:
> 
> net/xfrm/xfrm_interface.c:745:12: warning:
>  symbol 'xfrmi_get_link_net' was not declared. Should it be static?
> 
> Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
> Signed-off-by: Wei Yongjun 

Applied, thanks Wei!


Re: [PATCH RFC ipsec-next] xfrm: Check Reverse-Mark Lookup Before ADDSA/DELSA

2018-07-27 Thread Steffen Klassert
On Wed, Jul 25, 2018 at 03:36:47PM -0700, Nathan Harold wrote:
> 
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index b669262682c9..ee212a7c91a9 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -815,10 +815,10 @@ xfrm_init_tempstate(struct xfrm_state *x, const struct 
> flowi *fl,
>   afinfo->init_temprop(x, tmpl, daddr, saddr);
>  }
>  
> -static struct xfrm_state *__xfrm_state_lookup(struct net *net, u32 mark,
> -   const xfrm_address_t *daddr,
> -   __be32 spi, u8 proto,
> -   unsigned short family)
> +static struct xfrm_state *
> +__xfrm_state_lookup(struct net *net, u32 mark, u32 mask,
> + const xfrm_address_t *daddr,
> + __be32 spi, u8 proto, unsigned short family)

The argument list of these functions are getting longer and longer.
Can't you just put in a pointer to struct xfrm_mark and dereference
inside the function?

Looks good otherwise.


[PATCH 07/14] xfrm: use time64_t for in-kernel timestamps

2018-07-27 Thread Steffen Klassert
From: Arnd Bergmann 

The lifetime managment uses '__u64' timestamps on the user space
interface, but 'unsigned long' for reading the current time in the kernel
with get_seconds().

While this is probably safe beyond y2038, it will still overflow in 2106,
and the get_seconds() call is deprecated because fo that.

This changes the xfrm time handling to use time64_t consistently, along
with reading the time using the safer ktime_get_real_seconds(). It still
suffers from problems that can happen from a concurrent settimeofday()
call or (to a lesser degree) a leap second update, but since the time
stamps are part of the user API, there is nothing we can do to prevent
that.

Signed-off-by: Arnd Bergmann 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 24 
 net/xfrm/xfrm_state.c  | 10 +-
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index ef75891450e7..5d2f734f4309 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -189,8 +189,8 @@ static inline unsigned long make_jiffies(long secs)
 static void xfrm_policy_timer(struct timer_list *t)
 {
struct xfrm_policy *xp = from_timer(xp, t, timer);
-   unsigned long now = get_seconds();
-   long next = LONG_MAX;
+   time64_t now = ktime_get_real_seconds();
+   time64_t next = TIME64_MAX;
int warn = 0;
int dir;
 
@@ -202,7 +202,7 @@ static void xfrm_policy_timer(struct timer_list *t)
dir = xfrm_policy_id2dir(xp->index);
 
if (xp->lft.hard_add_expires_seconds) {
-   long tmo = xp->lft.hard_add_expires_seconds +
+   time64_t tmo = xp->lft.hard_add_expires_seconds +
xp->curlft.add_time - now;
if (tmo <= 0)
goto expired;
@@ -210,7 +210,7 @@ static void xfrm_policy_timer(struct timer_list *t)
next = tmo;
}
if (xp->lft.hard_use_expires_seconds) {
-   long tmo = xp->lft.hard_use_expires_seconds +
+   time64_t tmo = xp->lft.hard_use_expires_seconds +
(xp->curlft.use_time ? : xp->curlft.add_time) - now;
if (tmo <= 0)
goto expired;
@@ -218,7 +218,7 @@ static void xfrm_policy_timer(struct timer_list *t)
next = tmo;
}
if (xp->lft.soft_add_expires_seconds) {
-   long tmo = xp->lft.soft_add_expires_seconds +
+   time64_t tmo = xp->lft.soft_add_expires_seconds +
xp->curlft.add_time - now;
if (tmo <= 0) {
warn = 1;
@@ -228,7 +228,7 @@ static void xfrm_policy_timer(struct timer_list *t)
next = tmo;
}
if (xp->lft.soft_use_expires_seconds) {
-   long tmo = xp->lft.soft_use_expires_seconds +
+   time64_t tmo = xp->lft.soft_use_expires_seconds +
(xp->curlft.use_time ? : xp->curlft.add_time) - now;
if (tmo <= 0) {
warn = 1;
@@ -240,7 +240,7 @@ static void xfrm_policy_timer(struct timer_list *t)
 
if (warn)
km_policy_expired(xp, dir, 0, 0);
-   if (next != LONG_MAX &&
+   if (next != TIME64_MAX &&
!mod_timer(>timer, jiffies + make_jiffies(next)))
xfrm_pol_hold(xp);
 
@@ -791,7 +791,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, 
int excl)
}
policy->index = delpol ? delpol->index : xfrm_gen_index(net, dir, 
policy->index);
hlist_add_head(>byidx, net->xfrm.policy_byidx+idx_hash(net, 
policy->index));
-   policy->curlft.add_time = get_seconds();
+   policy->curlft.add_time = ktime_get_real_seconds();
policy->curlft.use_time = 0;
if (!mod_timer(>timer, jiffies + HZ))
xfrm_pol_hold(policy);
@@ -1282,7 +1282,7 @@ int xfrm_sk_policy_insert(struct sock *sk, int dir, 
struct xfrm_policy *pol)
old_pol = rcu_dereference_protected(sk->sk_policy[dir],
lockdep_is_held(>xfrm.xfrm_policy_lock));
if (pol) {
-   pol->curlft.add_time = get_seconds();
+   pol->curlft.add_time = ktime_get_real_seconds();
pol->index = xfrm_gen_index(net, XFRM_POLICY_MAX+dir, 0);
xfrm_sk_policy_link(pol, dir);
}
@@ -2132,7 +2132,7 @@ struct dst_entry *xfrm_lookup(struct net *net, struct 
dst_entry *dst_orig,
}
 
for (i = 0; i < num_pols; i++)
-   pols[i]->curlft.use_time = get_seconds();
+   pols[i]->curlft.use_time = ktime_get_real_seconds();
 
if (num_xfrms < 0) {
/* Prohibit the flow */
@@ -2352,7 +2352,7 @@ int __xfrm_policy_chec

[PATCH 01/14] xfrm: Extend the output_mark to support input direction and masking.

2018-07-27 Thread Steffen Klassert
We already support setting an output mark at the xfrm_state,
unfortunately this does not support the input direction and
masking the marks that will be applied to the skb. This change
adds support applying a masked value in both directions.

The existing XFRMA_OUTPUT_MARK number is reused for this purpose
and as it is now bi-directional, it is renamed to XFRMA_SET_MARK.

An additional XFRMA_SET_MARK_MASK attribute is added for setting the
mask. If the attribute mask not provided, it is set to 0x,
keeping the XFRMA_OUTPUT_MARK existing 'full mask' semantics.

Co-developed-by: Tobias Brunner 
Co-developed-by: Eyal Birger 
Co-developed-by: Lorenzo Colitti 
Signed-off-by: Steffen Klassert 
Signed-off-by: Tobias Brunner 
Signed-off-by: Eyal Birger 
Signed-off-by: Lorenzo Colitti 
---
 include/net/xfrm.h|  9 -
 include/uapi/linux/xfrm.h |  4 +++-
 net/xfrm/xfrm_device.c|  3 ++-
 net/xfrm/xfrm_input.c |  2 ++
 net/xfrm/xfrm_output.c|  3 +--
 net/xfrm/xfrm_policy.c|  5 +++--
 net/xfrm/xfrm_user.c  | 48 +--
 7 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 557122846e0e..3dc83ba26f62 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -166,7 +166,7 @@ struct xfrm_state {
int header_len;
int trailer_len;
u32 extra_flags;
-   u32 output_mark;
+   struct xfrm_marksmark;
} props;
 
struct xfrm_lifetime_cfg lft;
@@ -2012,6 +2012,13 @@ static inline int xfrm_mark_put(struct sk_buff *skb, 
const struct xfrm_mark *m)
return ret;
 }
 
+static inline __u32 xfrm_smark_get(__u32 mark, struct xfrm_state *x)
+{
+   struct xfrm_mark *m = >props.smark;
+
+   return (m->v & m->m) | (mark & ~m->m);
+}
+
 static inline int xfrm_tunnel_check(struct sk_buff *skb, struct xfrm_state *x,
unsigned int family)
 {
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index e3af2859188b..5a6ed7ce5a29 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -305,9 +305,11 @@ enum xfrm_attr_type_t {
XFRMA_ADDRESS_FILTER,   /* struct xfrm_address_filter */
XFRMA_PAD,
XFRMA_OFFLOAD_DEV,  /* struct xfrm_state_offload */
-   XFRMA_OUTPUT_MARK,  /* __u32 */
+   XFRMA_SET_MARK, /* __u32 */
+   XFRMA_SET_MARK_MASK,/* __u32 */
__XFRMA_MAX
 
+#define XFRMA_OUTPUT_MARK XFRMA_SET_MARK   /* Compatibility */
 #define XFRMA_MAX (__XFRMA_MAX - 1)
 };
 
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 175941e15a6e..16c1230d20fa 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -162,7 +162,8 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state 
*x,
}
 
dst = __xfrm_dst_lookup(net, 0, 0, saddr, daddr,
-   x->props.family, x->props.output_mark);
+   x->props.family,
+   xfrm_smark_get(0, x));
if (IS_ERR(dst))
return 0;
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 352abca2605f..074810436242 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -339,6 +339,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
goto drop;
}
 
+   skb->mark = xfrm_smark_get(skb->mark, x);
+
skb->sp->xvec[skb->sp->len++] = x;
 
 lock:
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 89b178a78dc7..45ba07ab3e4f 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -66,8 +66,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
goto error_nolock;
}
 
-   if (x->props.output_mark)
-   skb->mark = x->props.output_mark;
+   skb->mark = xfrm_smark_get(skb->mark, x);
 
err = x->outer_mode->output(x, skb);
if (err) {
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5f48251c1319..7637637717ec 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1607,10 +1607,11 @@ static struct dst_entry *xfrm_bundle_create(struct 
xfrm_policy *policy,
dst_copy_metrics(dst1, dst);
 
if (xfrm[i]->props.mode != XFRM_MODE_TRANSPORT) {
+   __u32 mark = xfrm_smark_get(fl->flowi_mark, xfrm[i]);
+
family = xfrm[i]->props.family;
dst = xfrm_dst_lookup(xfrm[i], tos, fl->flowi_oif,
- , , 

[PATCH 12/14] xfrm: fix 'passing zero to ERR_PTR()' warning

2018-07-27 Thread Steffen Klassert
From: YueHaibing 

Fix a static code checker warning:

  net/xfrm/xfrm_policy.c:1836 xfrm_resolve_and_create_bundle() warn: passing 
zero to 'ERR_PTR'

xfrm_tmpl_resolve return 0 just means no xdst found, return NULL
instead of passing zero to ERR_PTR.

Fixes: d809ec895505 ("xfrm: do not assume that template resolving always 
returns xfrms")
Signed-off-by: YueHaibing 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 2f70fe68b9b0..69f06f879091 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1752,7 +1752,10 @@ xfrm_resolve_and_create_bundle(struct xfrm_policy 
**pols, int num_pols,
/* Try to instantiate a bundle */
err = xfrm_tmpl_resolve(pols, num_pols, fl, xfrm, family);
if (err <= 0) {
-   if (err != 0 && err != -EAGAIN)
+   if (err == 0)
+   return NULL;
+
+   if (err != -EAGAIN)
XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTPOLERROR);
return ERR_PTR(err);
}
-- 
2.14.1



[PATCH 08/14] ipv6: xfrm: use 64-bit timestamps

2018-07-27 Thread Steffen Klassert
From: Arnd Bergmann 

get_seconds() is deprecated because it can overflow on 32-bit
architectures.  For the xfrm_state->lastused member, we treat the data
as a 64-bit number already, so we just need to use the right accessor
that works on both 32-bit and 64-bit machines.

Signed-off-by: Arnd Bergmann 
Signed-off-by: Steffen Klassert 
---
 include/net/xfrm.h   | 2 +-
 net/ipv6/xfrm6_mode_ro.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index a5378613a49c..1350e2cf0749 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -227,7 +227,7 @@ struct xfrm_state {
longsaved_tmo;
 
/* Last used time */
-   unsigned long   lastused;
+   time64_tlastused;
 
struct page_frag xfrag;
 
diff --git a/net/ipv6/xfrm6_mode_ro.c b/net/ipv6/xfrm6_mode_ro.c
index 07d36573f50b..da28e4407b8f 100644
--- a/net/ipv6/xfrm6_mode_ro.c
+++ b/net/ipv6/xfrm6_mode_ro.c
@@ -55,7 +55,7 @@ static int xfrm6_ro_output(struct xfrm_state *x, struct 
sk_buff *skb)
__skb_pull(skb, hdr_len);
memmove(ipv6_hdr(skb), iph, hdr_len);
 
-   x->lastused = get_seconds();
+   x->lastused = ktime_get_real_seconds();
 
return 0;
 }
-- 
2.14.1



[PATCH 03/14] xfrm: Add a new lookup key to match xfrm interfaces.

2018-07-27 Thread Steffen Klassert
This patch adds the xfrm interface id as a lookup key
for xfrm states and policies. With this we can assign
states and policies to virtual xfrm interfaces.

Signed-off-by: Steffen Klassert 
Acked-by: Shannon Nelson 
Acked-by: Benedict Wong 
Tested-by: Benedict Wong 
Tested-by: Antony Antony 
Reviewed-by: Eyal Birger 
---
 include/net/xfrm.h| 21 +-
 include/uapi/linux/xfrm.h |  1 +
 net/core/pktgen.c |  2 +-
 net/key/af_key.c  |  6 +++---
 net/xfrm/xfrm_policy.c| 18 +++-
 net/xfrm/xfrm_state.c | 19 -
 net/xfrm/xfrm_user.c  | 54 +--
 7 files changed, 96 insertions(+), 25 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 3dc83ba26f62..e8bada4d2a45 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -147,6 +147,7 @@ struct xfrm_state {
struct xfrm_id  id;
struct xfrm_selectorsel;
struct xfrm_markmark;
+   u32 if_id;
u32 tfcpad;
 
u32 genid;
@@ -574,6 +575,7 @@ struct xfrm_policy {
atomic_tgenid;
u32 priority;
u32 index;
+   u32 if_id;
struct xfrm_markmark;
struct xfrm_selectorselector;
struct xfrm_lifetime_cfg lft;
@@ -1533,7 +1535,7 @@ struct xfrm_state *xfrm_state_find(const xfrm_address_t 
*daddr,
   struct xfrm_tmpl *tmpl,
   struct xfrm_policy *pol, int *err,
   unsigned short family);
-struct xfrm_state *xfrm_stateonly_find(struct net *net, u32 mark,
+struct xfrm_state *xfrm_stateonly_find(struct net *net, u32 mark, u32 if_id,
   xfrm_address_t *daddr,
   xfrm_address_t *saddr,
   unsigned short family,
@@ -1690,20 +1692,20 @@ int xfrm_policy_walk(struct net *net, struct 
xfrm_policy_walk *walk,
 void *);
 void xfrm_policy_walk_done(struct xfrm_policy_walk *walk, struct net *net);
 int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl);
-struct xfrm_policy *xfrm_policy_bysel_ctx(struct net *net, u32 mark,
+struct xfrm_policy *xfrm_policy_bysel_ctx(struct net *net, u32 mark, u32 if_id,
  u8 type, int dir,
  struct xfrm_selector *sel,
  struct xfrm_sec_ctx *ctx, int delete,
  int *err);
-struct xfrm_policy *xfrm_policy_byid(struct net *net, u32 mark, u8, int dir,
-u32 id, int delete, int *err);
+struct xfrm_policy *xfrm_policy_byid(struct net *net, u32 mark, u32 if_id, u8,
+int dir, u32 id, int delete, int *err);
 int xfrm_policy_flush(struct net *net, u8 type, bool task_valid);
 void xfrm_policy_hash_rebuild(struct net *net);
 u32 xfrm_get_acqseq(void);
 int verify_spi_info(u8 proto, u32 min, u32 max);
 int xfrm_alloc_spi(struct xfrm_state *x, u32 minspi, u32 maxspi);
 struct xfrm_state *xfrm_find_acq(struct net *net, const struct xfrm_mark *mark,
-u8 mode, u32 reqid, u8 proto,
+u8 mode, u32 reqid, u32 if_id, u8 proto,
 const xfrm_address_t *daddr,
 const xfrm_address_t *saddr, int create,
 unsigned short family);
@@ -2019,6 +2021,15 @@ static inline __u32 xfrm_smark_get(__u32 mark, struct 
xfrm_state *x)
return (m->v & m->m) | (mark & ~m->m);
 }
 
+static inline int xfrm_if_id_put(struct sk_buff *skb, __u32 if_id)
+{
+   int ret = 0;
+
+   if (if_id)
+   ret = nla_put_u32(skb, XFRMA_IF_ID, if_id);
+   return ret;
+}
+
 static inline int xfrm_tunnel_check(struct sk_buff *skb, struct xfrm_state *x,
unsigned int family)
 {
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index 5a6ed7ce5a29..5f3b9fec7b5f 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -307,6 +307,7 @@ enum xfrm_attr_type_t {
XFRMA_OFFLOAD_DEV,  /* struct xfrm_state_offload */
XFRMA_SET_MARK, /* __u32 */
XFRMA_SET_MARK_MASK,/* __u32 */
+   XFRMA_IF_ID,/* __u32 */
__XFRMA_MAX
 
 #define XFRMA_OUTPUT_MARK XFRMA_SET_MARK   /* Compatibility */
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 49368e21d228..6d37dbf0aa64 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2255,7 +2255,7 @@ static void get_ipsec_sa(struct pktgen_dev *pkt_dev, int 
flow)
x = xfrm_state_lookup_byspi(pn->

[PATCH 05/14] xfrm: policy: remove pcpu policy cache

2018-07-27 Thread Steffen Klassert
From: Florian Westphal 

Kristian Evensen says:
  In a project I am involved in, we are running ipsec (Strongswan) on
  different mt7621-based routers. Each router is configured as an
  initiator and has around ~30 tunnels to different responders (running
  on misc. devices). Before the flow cache was removed (kernel 4.9), we
  got a combined throughput of around 70Mbit/s for all tunnels on one
  router. However, we recently switched to kernel 4.14 (4.14.48), and
  the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
  drop of around 20%. Reverting the flow cache removal restores, as
  expected, performance levels to that of kernel 4.9.

When pcpu xdst exists, it has to be validated first before it can be
used.

A negative hit thus increases cost vs. no-cache.

As number of tunnels increases, hit rate decreases so this pcpu caching
isn't a viable strategy.

Furthermore, the xdst cache also needs to run with BH off, so when
removing this the bh disable/enable pairs can be removed too.

Kristian tested a 4.14.y backport of this change and reported
increased performance:

  In our tests, the throughput reduction has been reduced from around -20%
  to -5%. We also see that the overall throughput is independent of the
  number of tunnels, while before the throughput was reduced as the number
  of tunnels increased.

Reported-by: Kristian Evensen 
Signed-off-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
---
 include/net/xfrm.h |   1 -
 net/xfrm/xfrm_device.c |  10 
 net/xfrm/xfrm_policy.c | 139 +
 net/xfrm/xfrm_state.c  |   5 +-
 4 files changed, 3 insertions(+), 152 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 3fa578a6a819..a5378613a49c 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -332,7 +332,6 @@ int xfrm_policy_register_afinfo(const struct 
xfrm_policy_afinfo *afinfo, int fam
 void xfrm_policy_unregister_afinfo(const struct xfrm_policy_afinfo *afinfo);
 void km_policy_notify(struct xfrm_policy *xp, int dir,
  const struct km_event *c);
-void xfrm_policy_cache_flush(void);
 void km_state_notify(struct xfrm_state *x, const struct km_event *c);
 
 struct xfrm_tmpl;
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 16c1230d20fa..11d56a44e9e8 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -307,12 +307,6 @@ static int xfrm_dev_register(struct net_device *dev)
return xfrm_api_check(dev);
 }
 
-static int xfrm_dev_unregister(struct net_device *dev)
-{
-   xfrm_policy_cache_flush();
-   return NOTIFY_DONE;
-}
-
 static int xfrm_dev_feat_change(struct net_device *dev)
 {
return xfrm_api_check(dev);
@@ -323,7 +317,6 @@ static int xfrm_dev_down(struct net_device *dev)
if (dev->features & NETIF_F_HW_ESP)
xfrm_dev_state_flush(dev_net(dev), dev, true);
 
-   xfrm_policy_cache_flush();
return NOTIFY_DONE;
 }
 
@@ -335,9 +328,6 @@ static int xfrm_dev_event(struct notifier_block *this, 
unsigned long event, void
case NETDEV_REGISTER:
return xfrm_dev_register(dev);
 
-   case NETDEV_UNREGISTER:
-   return xfrm_dev_unregister(dev);
-
case NETDEV_FEAT_CHANGE:
return xfrm_dev_feat_change(dev);
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d960ea6657b5..ef75891450e7 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -45,8 +45,6 @@ struct xfrm_flo {
u8 flags;
 };
 
-static DEFINE_PER_CPU(struct xfrm_dst *, xfrm_last_dst);
-static struct work_struct *xfrm_pcpu_work __read_mostly;
 static DEFINE_SPINLOCK(xfrm_if_cb_lock);
 static struct xfrm_if_cb const __rcu *xfrm_if_cb __read_mostly;
 
@@ -1732,108 +1730,6 @@ static int xfrm_expand_policies(const struct flowi *fl, 
u16 family,
 
 }
 
-static void xfrm_last_dst_update(struct xfrm_dst *xdst, struct xfrm_dst *old)
-{
-   this_cpu_write(xfrm_last_dst, xdst);
-   if (old)
-   dst_release(>u.dst);
-}
-
-static void __xfrm_pcpu_work_fn(void)
-{
-   struct xfrm_dst *old;
-
-   old = this_cpu_read(xfrm_last_dst);
-   if (old && !xfrm_bundle_ok(old))
-   xfrm_last_dst_update(NULL, old);
-}
-
-static void xfrm_pcpu_work_fn(struct work_struct *work)
-{
-   local_bh_disable();
-   rcu_read_lock();
-   __xfrm_pcpu_work_fn();
-   rcu_read_unlock();
-   local_bh_enable();
-}
-
-void xfrm_policy_cache_flush(void)
-{
-   struct xfrm_dst *old;
-   bool found = false;
-   int cpu;
-
-   might_sleep();
-
-   local_bh_disable();
-   rcu_read_lock();
-   for_each_possible_cpu(cpu) {
-   old = per_cpu(xfrm_last_dst, cpu);
-   if (old && !xfrm_bundle_ok(old)) {
-   if (smp_processor_id() == cpu) {
-   __xfrm_pcpu_work_fn();
- 

[PATCH 09/14] xfrm: don't check offload_handle for nonzero

2018-07-27 Thread Steffen Klassert
From: Shannon Nelson 

The offload_handle should be an opaque data cookie for the driver
to use, much like the data cookie for a timer or alarm callback.
Thus, the XFRM stack should not be checking for non-zero, because
the driver might use that to store an array reference, which could
be zero, or some other zero but meaningful value.

We can remove the checks for non-zero because there are plenty
other attributes also being checked to see if there is an offload
in place for the SA in question.

Signed-off-by: Shannon Nelson 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4_offload.c | 6 ++
 net/ipv6/esp6_offload.c | 6 ++
 net/xfrm/xfrm_device.c  | 6 +++---
 3 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index 7cf755ef9efb..133589d693a9 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -135,8 +135,7 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb,
 
skb->encap_hdr_csum = 1;
 
-   if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
-   (x->xso.dev != skb->dev))
+   if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev)
esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
else if (!(features & NETIF_F_HW_ESP_TX_CSUM))
esp_features = features & ~NETIF_F_CSUM_MASK;
@@ -179,8 +178,7 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff 
*skb,  netdev_features_
if (!xo)
return -EINVAL;
 
-   if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
-   (x->xso.dev != skb->dev)) {
+   if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) {
xo->flags |= CRYPTO_FALLBACK;
hw_offload = false;
}
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 27f59b61f70f..96af267835c3 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -162,8 +162,7 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb,
 
skb->encap_hdr_csum = 1;
 
-   if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
-   (x->xso.dev != skb->dev))
+   if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev)
esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
else if (!(features & NETIF_F_HW_ESP_TX_CSUM))
esp_features = features & ~NETIF_F_CSUM_MASK;
@@ -207,8 +206,7 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff 
*skb,  netdev_features
if (!xo)
return -EINVAL;
 
-   if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
-   (x->xso.dev != skb->dev)) {
+   if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) {
xo->flags |= CRYPTO_FALLBACK;
hw_offload = false;
}
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 11d56a44e9e8..5611b7521020 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -56,7 +56,7 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, 
netdev_features_t featur
if (skb_is_gso(skb)) {
struct net_device *dev = skb->dev;
 
-   if (unlikely(!x->xso.offload_handle || (x->xso.dev != dev))) {
+   if (unlikely(x->xso.dev != dev)) {
struct sk_buff *segs;
 
/* Packet got rerouted, fixup features and segment it. 
*/
@@ -211,8 +211,8 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct 
xfrm_state *x)
if (!x->type_offload || x->encap)
return false;
 
-   if ((!dev || (x->xso.offload_handle && (dev == 
xfrm_dst_path(dst)->dev))) &&
-(!xdst->child->xfrm && x->type->get_mtu)) {
+   if ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
+   (!xdst->child->xfrm && x->type->get_mtu)) {
mtu = x->type->get_mtu(x, xdst->child_mtu_cached);
 
if (skb->len <= mtu)
-- 
2.14.1



pull request (net-next): ipsec-next 2018-07-27

2018-07-27 Thread Steffen Klassert
1) Extend the output_mark to also support the input direction
   and masking the mark values before applying to the skb.

2) Add a new lookup key for the upcomming xfrm interfaces.

3) Extend the xfrm lookups to match xfrm interface IDs.

4) Add virtual xfrm interfaces. The purpose of these interfaces
   is to overcome the design limitations that the existing
   VTI devices have.

  The main limitations that we see with the current VTI are the
  following:

  VTI interfaces are L3 tunnels with configurable endpoints.
  For xfrm, the tunnel endpoint are already determined by the SA.
  So the VTI tunnel endpoints must be either the same as on the
  SA or wildcards. In case VTI tunnel endpoints are same as on
  the SA, we get a one to one correlation between the SA and
  the tunnel. So each SA needs its own tunnel interface.

  On the other hand, we can have only one VTI tunnel with
  wildcard src/dst tunnel endpoints in the system because the
  lookup is based on the tunnel endpoints. The existing tunnel
  lookup won't work with multiple tunnels with wildcard
  tunnel endpoints. Some usecases require more than on
  VTI tunnel of this type, for example if somebody has multiple
  namespaces and every namespace requires such a VTI.

  VTI needs separate interfaces for IPv4 and IPv6 tunnels.
  So when routing to a VTI, we have to know to which address
  family this traffic class is going to be encapsulated.
  This is a lmitation because it makes routing more complex
  and it is not always possible to know what happens behind the
  VTI, e.g. when the VTI is move to some namespace.

  VTI works just with tunnel mode SAs. We need generic interfaces
  that ensures transfomation, regardless of the xfrm mode and
  the encapsulated address family.

  VTI is configured with a combination GRE keys and xfrm marks.
  With this we have to deal with some extra cases in the generic
  tunnel lookup because the GRE keys on the VTI are actually
  not GRE keys, the GRE keys were just reused for something else.
  All extensions to the VTI interfaces would require to add
  even more complexity to the generic tunnel lookup.

  So to overcome this, we developed xfrm interfaces with the
  following design goal:

  It should be possible to tunnel IPv4 and IPv6 through the same
  interface.

  No limitation on xfrm mode (tunnel, transport and beet).

  Should be a generic virtual interface that ensures IPsec
  transformation, no need to know what happens behind the
  interface.

  Interfaces should be configured with a new key that must match a
  new policy/SA lookup key.

  The lookup logic should stay in the xfrm codebase, no need to
  change or extend generic routing and tunnel lookups.

  Should be possible to use IPsec hardware offloads of the underlying
  interface.

5) Remove xfrm pcpu policy cache. This was added after the flowcache
   removal, but it turned out to make things even worse.
   From Florian Westphal.

6) Allow to update the set mark on SA updates.
   From Nathan Harold.

7) Convert some timestamps to time64_t.
   From Arnd Bergmann.

8) Don't check the offload_handle in xfrm code,
   it is an opaque data cookie for the driver.
   From Shannon Nelson.

9) Remove xfrmi interface ID from flowi. After this pach
   no generic code is touched anymore to do xfrm interface
   lookups. From Benedict Wong.

10) Allow to update the xfrm interface ID on SA updates.
From Nathan Harold.

11) Don't pass zero to ERR_PTR() in xfrm_resolve_and_create_bundle.
From YueHaibing.

12) Return more detailed errors on xfrm interface creation.
From Benedict Wong.

13) Use PTR_ERR_OR_ZERO instead of IS_ERR + PTR_ERR.
From the kbuild test robot.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit dd55c4ea9e6ba957083f985f33f994c23b405e9e:

  Merge branch 'r8169-enable-ASPM-on-RTL8168E-VL' (2018-06-23 20:54:56 +0900)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master

for you to fetch changes up to c6f5e017df9dfa9f6cbe70da008e7d716d726f1b:

  xfrm: fix ptr_ret.cocci warnings (2018-07-27 06:47:28 +0200)


Arnd Bergmann (2):
  xfrm: use time64_t for in-kernel timestamps
  ipv6: xfrm: use 64-bit timestamps

Benedict Wong (2):
  xfrm: Remove xfrmi interface ID from flowi
  xfrm: Return detailed errors from xfrmi_newlink

Florian Westphal (1):
  xfrm: policy: remove pcpu policy cache

Nathan Harold (2):
  xfrm: Allow Set Mark to be Updated Using UPDSA
  xfrm: Allow xfrmi if_id to be updated by UPDSA

Shannon Nelson (1):
  xfrm: don't check offload_handle for nonzero

Steffen Klassert (4):
  xfrm: Extend the output_mark to support input direction and masking.
  flow: Extend flow informations with xfrm interface id.
  xfrm: Add a new lookup key to match xfrm interfaces.
  xfrm: Add virtual xfrm interfaces

[PATCH 11/14] xfrm: Allow xfrmi if_id to be updated by UPDSA

2018-07-27 Thread Steffen Klassert
From: Nathan Harold 

Allow attaching an SA to an xfrm interface id after
the creation of the SA, so that tasks such as keying
which must be done as the SA is created, can remain
separate from the decision on how to route traffic
from an SA. This permits SA creation to be decomposed
in to three separate steps:
1) allocation of a SPI
2) algorithm and key negotiation
3) insertion into the data path

Signed-off-by: Nathan Harold 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_state.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index bd5cb7ad2447..b669262682c9 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1561,10 +1561,14 @@ int xfrm_state_update(struct xfrm_state *x)
if (x1->curlft.use_time)
xfrm_state_check_expire(x1);
 
-   if (x->props.smark.m || x->props.smark.v) {
+   if (x->props.smark.m || x->props.smark.v || x->if_id) {
spin_lock_bh(>xfrm.xfrm_state_lock);
 
-   x1->props.smark = x->props.smark;
+   if (x->props.smark.m || x->props.smark.v)
+   x1->props.smark = x->props.smark;
+
+   if (x->if_id)
+   x1->if_id = x->if_id;
 
__xfrm_state_bump_genids(x1);
spin_unlock_bh(>xfrm.xfrm_state_lock);
-- 
2.14.1



[PATCH 04/14] xfrm: Add virtual xfrm interfaces

2018-07-27 Thread Steffen Klassert
This patch adds support for virtual xfrm interfaces.
Packets that are routed through such an interface
are guaranteed to be IPsec transformed or dropped.
It is a generic virtual interface that ensures IPsec
transformation, no need to know what happens behind
the interface. This means that we can tunnel IPv4 and
IPv6 through the same interface and support all xfrm
modes (tunnel, transport and beet) on it.

Co-developed-by: Lorenzo Colitti 
Co-developed-by: Benedict Wong 
Signed-off-by: Lorenzo Colitti 
Signed-off-by: Benedict Wong 
Signed-off-by: Steffen Klassert 
Acked-by: Shannon Nelson 
Tested-by: Benedict Wong 
Tested-by: Antony Antony 
Reviewed-by: Eyal Birger 
---
 include/net/xfrm.h   |  24 ++
 include/uapi/linux/if_link.h |  10 +
 net/xfrm/Kconfig |   8 +
 net/xfrm/Makefile|   1 +
 net/xfrm/xfrm_input.c|   3 +
 net/xfrm/xfrm_interface.c| 972 +++
 net/xfrm/xfrm_policy.c   |  43 ++
 7 files changed, 1061 insertions(+)
 create mode 100644 net/xfrm/xfrm_interface.c

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e8bada4d2a45..3fa578a6a819 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -293,6 +294,13 @@ struct xfrm_replay {
int (*overflow)(struct xfrm_state *x, struct sk_buff *skb);
 };
 
+struct xfrm_if_cb {
+   struct xfrm_if  *(*decode_session)(struct sk_buff *skb);
+};
+
+void xfrm_if_register_cb(const struct xfrm_if_cb *ifcb);
+void xfrm_if_unregister_cb(void);
+
 struct net_device;
 struct xfrm_type;
 struct xfrm_dst;
@@ -1039,6 +1047,22 @@ static inline void xfrm_dst_destroy(struct xfrm_dst 
*xdst)
 
 void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev);
 
+struct xfrm_if_parms {
+   char name[IFNAMSIZ];/* name of XFRM device */
+   int link;   /* ifindex of underlying L2 interface */
+   u32 if_id;  /* interface identifyer */
+};
+
+struct xfrm_if {
+   struct xfrm_if __rcu *next; /* next interface in list */
+   struct net_device *dev; /* virtual device associated with 
interface */
+   struct net_device *phydev;  /* physical device */
+   struct net *net;/* netns for packet i/o */
+   struct xfrm_if_parms p; /* interface parms */
+
+   struct gro_cells gro_cells;
+};
+
 struct xfrm_offload {
/* Output sequence number for replay protection on offloading. */
struct {
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index cf01b6824244..bff0af507b32 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -459,6 +459,16 @@ enum {
 
 #define IFLA_MACSEC_MAX (__IFLA_MACSEC_MAX - 1)
 
+/* XFRM section */
+enum {
+   IFLA_XFRM_UNSPEC,
+   IFLA_XFRM_LINK,
+   IFLA_XFRM_IF_ID,
+   __IFLA_XFRM_MAX
+};
+
+#define IFLA_XFRM_MAX (__IFLA_XFRM_MAX - 1)
+
 enum macsec_validation_type {
MACSEC_VALIDATE_DISABLED = 0,
MACSEC_VALIDATE_CHECK = 1,
diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
index 286ed25c1a69..53381888a7b3 100644
--- a/net/xfrm/Kconfig
+++ b/net/xfrm/Kconfig
@@ -25,6 +25,14 @@ config XFRM_USER
 
  If unsure, say Y.
 
+config XFRM_INTERFACE
+   tristate "Transformation virtual interface"
+   depends on XFRM && IPV6
+   ---help---
+ This provides a virtual interface to route IPsec traffic.
+
+ If unsure, say N.
+
 config XFRM_SUB_POLICY
bool "Transformation sub policy support"
depends on XFRM
diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
index 0bd2465a8c5a..fbc4552d17b8 100644
--- a/net/xfrm/Makefile
+++ b/net/xfrm/Makefile
@@ -10,3 +10,4 @@ obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o
 obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o
 obj-$(CONFIG_XFRM_USER) += xfrm_user.o
 obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
+obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 074810436242..b89c9c7f8c5c 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -320,6 +320,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
 
seq = 0;
if (!spi && (err = xfrm_parse_spi(skb, nexthdr, , )) != 0) {
+   secpath_reset(skb);
XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR);
goto drop;
}
@@ -328,12 +329,14 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
   XFRM_SPI_SKB_CB(skb)->daddroff);
do {
if (skb->sp->len == XFRM_MAX_DEPTH) {
+   secpath_reset(skb);
XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
goto drop;
}
 
x = xfrm_state_lookup(net, mark, daddr,

[PATCH 06/14] xfrm: Allow Set Mark to be Updated Using UPDSA

2018-07-27 Thread Steffen Klassert
From: Nathan Harold 

Allow UPDSA to change "set mark" to permit
policy separation of packet routing decisions from
SA keying in systems that use mark-based routing.

The set mark, used as a routing and firewall mark
for outbound packets, is made update-able which
allows routing decisions to be handled independently
of keying/SA creation. To maintain consistency with
other optional attributes, the set mark is only
updated if sent with a non-zero value.

The per-SA lock and the xfrm_state_lock are taken in
that order to avoid a deadlock with
xfrm_timer_handler(), which also takes the locks in
that order.

Signed-off-by: Nathan Harold 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_state.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index e04a510ec992..c9ffcdfa89f6 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1562,6 +1562,15 @@ int xfrm_state_update(struct xfrm_state *x)
if (x1->curlft.use_time)
xfrm_state_check_expire(x1);
 
+   if (x->props.smark.m || x->props.smark.v) {
+   spin_lock_bh(>xfrm.xfrm_state_lock);
+
+   x1->props.smark = x->props.smark;
+
+   __xfrm_state_bump_genids(x1);
+   spin_unlock_bh(>xfrm.xfrm_state_lock);
+   }
+
err = 0;
x->km.state = XFRM_STATE_DEAD;
__xfrm_state_put(x);
-- 
2.14.1



[PATCH 02/14] flow: Extend flow informations with xfrm interface id.

2018-07-27 Thread Steffen Klassert
Add a new flowi_xfrm structure with informations needed to do
a xfrm lookup. At the moment it keeps the informations about
the new xfrm interface id needed to lookup xfrm interfaces
that are introduced with a followup patch. We need this new
lookup key as other possible keys, like the ifindex is
already part of the xfrm selector and used as a key to
enforce the output device after the transformation in the
policy/state lookup.

Signed-off-by: Steffen Klassert 
Acked-by: Shannon Nelson 
Acked-by: Benedict Wong 
Tested-by: Benedict Wong 
Tested-by: Antony Antony 
Reviewed-by: Eyal Birger 
---
 include/net/flow.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/net/flow.h b/include/net/flow.h
index 8ce21793094e..187c9bef672f 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -26,6 +26,10 @@ struct flowi_tunnel {
__be64  tun_id;
 };
 
+struct flowi_xfrm {
+   __u32   if_id;
+};
+
 struct flowi_common {
int flowic_oif;
int flowic_iif;
@@ -39,6 +43,7 @@ struct flowi_common {
 #define FLOWI_FLAG_SKIP_NH_OIF 0x04
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
+   struct flowi_xfrm xfrm;
kuid_t  flowic_uid;
 };
 
@@ -78,6 +83,7 @@ struct flowi4 {
 #define flowi4_secid   __fl_common.flowic_secid
 #define flowi4_tun_key __fl_common.flowic_tun_key
 #define flowi4_uid __fl_common.flowic_uid
+#define flowi4_xfrm__fl_common.xfrm
 
/* (saddr,daddr) must be grouped, same order as in IP header */
__be32  saddr;
@@ -109,6 +115,7 @@ static inline void flowi4_init_output(struct flowi4 *fl4, 
int oif,
fl4->flowi4_flags = flags;
fl4->flowi4_secid = 0;
fl4->flowi4_tun_key.tun_id = 0;
+   fl4->flowi4_xfrm.if_id = 0;
fl4->flowi4_uid = uid;
fl4->daddr = daddr;
fl4->saddr = saddr;
@@ -138,6 +145,7 @@ struct flowi6 {
 #define flowi6_secid   __fl_common.flowic_secid
 #define flowi6_tun_key __fl_common.flowic_tun_key
 #define flowi6_uid __fl_common.flowic_uid
+#define flowi6_xfrm__fl_common.xfrm
struct in6_addr daddr;
struct in6_addr saddr;
/* Note: flowi6_tos is encoded in flowlabel, too. */
@@ -185,6 +193,7 @@ struct flowi {
 #define flowi_secidu.__fl_common.flowic_secid
 #define flowi_tun_key  u.__fl_common.flowic_tun_key
 #define flowi_uid  u.__fl_common.flowic_uid
+#define flowi_xfrm u.__fl_common.xfrm
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4)
-- 
2.14.1



[PATCH 13/14] xfrm: Return detailed errors from xfrmi_newlink

2018-07-27 Thread Steffen Klassert
From: Benedict Wong 

Currently all failure modes of xfrm interface creation return EEXIST.
This change improves the granularity of errnos provided by also
returning ENODEV or EINVAL if failures happen in looking up the
underlying interface, or a required parameter is not provided.

This change has been tested against the Android Kernel Networking Tests,
with additional xfrmi_newlink tests here:

https://android-review.googlesource.com/c/kernel/tests/+/715755

Signed-off-by: Benedict Wong 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_interface.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index ccfe18d67e98..481d7307ab51 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -149,14 +149,18 @@ static struct xfrm_if *xfrmi_create(struct net *net, 
struct xfrm_if_parms *p)
char name[IFNAMSIZ];
int err;
 
-   if (p->name[0])
+   if (p->name[0]) {
strlcpy(name, p->name, IFNAMSIZ);
-   else
+   } else {
+   err = -EINVAL;
goto failed;
+   }
 
dev = alloc_netdev(sizeof(*xi), name, NET_NAME_UNKNOWN, 
xfrmi_dev_setup);
-   if (!dev)
+   if (!dev) {
+   err = -EAGAIN;
goto failed;
+   }
 
dev_net_set(dev, net);
 
@@ -165,8 +169,10 @@ static struct xfrm_if *xfrmi_create(struct net *net, 
struct xfrm_if_parms *p)
xi->net = net;
xi->dev = dev;
xi->phydev = dev_get_by_index(net, p->link);
-   if (!xi->phydev)
+   if (!xi->phydev) {
+   err = -ENODEV;
goto failed_free;
+   }
 
err = xfrmi_create2(dev);
if (err < 0)
@@ -179,7 +185,7 @@ static struct xfrm_if *xfrmi_create(struct net *net, struct 
xfrm_if_parms *p)
 failed_free:
free_netdev(dev);
 failed:
-   return NULL;
+   return ERR_PTR(err);
 }
 
 static struct xfrm_if *xfrmi_locate(struct net *net, struct xfrm_if_parms *p,
@@ -194,13 +200,13 @@ static struct xfrm_if *xfrmi_locate(struct net *net, 
struct xfrm_if_parms *p,
 xip = >next) {
if (xi->p.if_id == p->if_id) {
if (create)
-   return NULL;
+   return ERR_PTR(-EEXIST);
 
return xi;
}
}
if (!create)
-   return NULL;
+   return ERR_PTR(-ENODEV);
return xfrmi_create(net, p);
 }
 
@@ -682,8 +688,9 @@ static int xfrmi_newlink(struct net *src_net, struct 
net_device *dev,
 
nla_strlcpy(p->name, tb[IFLA_IFNAME], IFNAMSIZ);
 
-   if (!xfrmi_locate(net, p, 1))
-   return -EEXIST;
+   xi = xfrmi_locate(net, p, 1);
+   if (IS_ERR(xi))
+   return PTR_ERR(xi);
 
return 0;
 }
@@ -704,11 +711,12 @@ static int xfrmi_changelink(struct net_device *dev, 
struct nlattr *tb[],
 
xi = xfrmi_locate(net, >p, 0);
 
-   if (xi) {
+   if (IS_ERR_OR_NULL(xi)) {
+   xi = netdev_priv(dev);
+   } else {
if (xi->dev != dev)
return -EEXIST;
-   } else
-   xi = netdev_priv(dev);
+   }
 
return xfrmi_update(xi, >p);
 }
-- 
2.14.1



[PATCH 14/14] xfrm: fix ptr_ret.cocci warnings

2018-07-27 Thread Steffen Klassert
From: kbuild test robot 

net/xfrm/xfrm_interface.c:692:1-3: WARNING: PTR_ERR_OR_ZERO can be used

 Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Fixes: 44e2b838c24d ("xfrm: Return detailed errors from xfrmi_newlink")
CC: Benedict Wong 
Signed-off-by: kbuild test robot 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_interface.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 481d7307ab51..31acc6f33d98 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -689,10 +689,7 @@ static int xfrmi_newlink(struct net *src_net, struct 
net_device *dev,
nla_strlcpy(p->name, tb[IFLA_IFNAME], IFNAMSIZ);
 
xi = xfrmi_locate(net, p, 1);
-   if (IS_ERR(xi))
-   return PTR_ERR(xi);
-
-   return 0;
+   return PTR_ERR_OR_ZERO(xi);
 }
 
 static void xfrmi_dellink(struct net_device *dev, struct list_head *head)
-- 
2.14.1



[PATCH 10/14] xfrm: Remove xfrmi interface ID from flowi

2018-07-27 Thread Steffen Klassert
From: Benedict Wong 

In order to remove performance impact of having the extra u32 in every
single flowi, this change removes the flowi_xfrm struct, prefering to
take the if_id as a method parameter where needed.

In the inbound direction, if_id is only needed during the
__xfrm_check_policy() function, and the if_id can be determined at that
point based on the skb. As such, xfrmi_decode_session() is only called
with the skb in __xfrm_check_policy().

In the outbound direction, the only place where if_id is needed is the
xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is
directly passed into the xfrm_lookup_with_ifid() call. All existing
callers can still call xfrm_lookup(), which uses a default if_id of 0.

This change does not change any behavior of XFRMIs except for improving
overall system performance via flowi size reduction.

This change has been tested against the Android Kernel Networking Tests:

https://android.googlesource.com/kernel/tests/+/master/net/test

Signed-off-by: Benedict Wong 
Signed-off-by: Steffen Klassert 
---
 include/net/dst.h | 14 +++
 include/net/flow.h|  9 -
 include/net/xfrm.h|  2 +-
 net/xfrm/xfrm_interface.c |  4 +-
 net/xfrm/xfrm_policy.c| 98 +++
 net/xfrm/xfrm_state.c |  3 +-
 6 files changed, 83 insertions(+), 47 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index b3219cd8a5a1..7f735e76ca73 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -475,6 +475,14 @@ static inline struct dst_entry *xfrm_lookup(struct net 
*net,
return dst_orig;
 }
 
+static inline struct dst_entry *
+xfrm_lookup_with_ifid(struct net *net, struct dst_entry *dst_orig,
+ const struct flowi *fl, const struct sock *sk,
+ int flags, u32 if_id)
+{
+   return dst_orig;
+}
+
 static inline struct dst_entry *xfrm_lookup_route(struct net *net,
  struct dst_entry *dst_orig,
  const struct flowi *fl,
@@ -494,6 +502,12 @@ struct dst_entry *xfrm_lookup(struct net *net, struct 
dst_entry *dst_orig,
  const struct flowi *fl, const struct sock *sk,
  int flags);
 
+struct dst_entry *xfrm_lookup_with_ifid(struct net *net,
+   struct dst_entry *dst_orig,
+   const struct flowi *fl,
+   const struct sock *sk, int flags,
+   u32 if_id);
+
 struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry 
*dst_orig,
const struct flowi *fl, const struct sock 
*sk,
int flags);
diff --git a/include/net/flow.h b/include/net/flow.h
index 187c9bef672f..8ce21793094e 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -26,10 +26,6 @@ struct flowi_tunnel {
__be64  tun_id;
 };
 
-struct flowi_xfrm {
-   __u32   if_id;
-};
-
 struct flowi_common {
int flowic_oif;
int flowic_iif;
@@ -43,7 +39,6 @@ struct flowi_common {
 #define FLOWI_FLAG_SKIP_NH_OIF 0x04
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
-   struct flowi_xfrm xfrm;
kuid_t  flowic_uid;
 };
 
@@ -83,7 +78,6 @@ struct flowi4 {
 #define flowi4_secid   __fl_common.flowic_secid
 #define flowi4_tun_key __fl_common.flowic_tun_key
 #define flowi4_uid __fl_common.flowic_uid
-#define flowi4_xfrm__fl_common.xfrm
 
/* (saddr,daddr) must be grouped, same order as in IP header */
__be32  saddr;
@@ -115,7 +109,6 @@ static inline void flowi4_init_output(struct flowi4 *fl4, 
int oif,
fl4->flowi4_flags = flags;
fl4->flowi4_secid = 0;
fl4->flowi4_tun_key.tun_id = 0;
-   fl4->flowi4_xfrm.if_id = 0;
fl4->flowi4_uid = uid;
fl4->daddr = daddr;
fl4->saddr = saddr;
@@ -145,7 +138,6 @@ struct flowi6 {
 #define flowi6_secid   __fl_common.flowic_secid
 #define flowi6_tun_key __fl_common.flowic_tun_key
 #define flowi6_uid __fl_common.flowic_uid
-#define flowi6_xfrm__fl_common.xfrm
struct in6_addr daddr;
struct in6_addr saddr;
/* Note: flowi6_tos is encoded in flowlabel, too. */
@@ -193,7 +185,6 @@ struct flowi {
 #define flowi_secidu.__fl_common.flowic_secid
 #define flowi_tun_key  u.__fl_common.flowic_tun_key
 #define flowi_uid  u.__fl_common.flowic_uid
-#define flowi_xfrm u.__fl_common.xfrm
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 1350e2cf0749..ca820945f30c 100644
---

[PATCH 5/5] esp6: fix memleak on error path in esp6_input

2018-07-27 Thread Steffen Klassert
From: Zhen Lei 

This ought to be an omission in e6194923237 ("esp: Fix memleaks on error
paths."). The memleak on error path in esp6_input is similar to esp_input
of esp4.

Fixes: e6194923237 ("esp: Fix memleaks on error paths.")
Fixes: 3f29770723f ("ipsec: check return value of skb_to_sgvec always")
Signed-off-by: Zhen Lei 
Signed-off-by: Steffen Klassert 
---
 net/ipv6/esp6.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 97513f35bcc5..88a7579c23bd 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -669,8 +669,10 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff 
*skb)
 
sg_init_table(sg, nfrags);
ret = skb_to_sgvec(skb, sg, 0, skb->len);
-   if (unlikely(ret < 0))
+   if (unlikely(ret < 0)) {
+   kfree(tmp);
goto out;
+   }
 
skb->ip_summed = CHECKSUM_NONE;
 
-- 
2.14.1



[PATCH 2/5] xfrm_user: prevent leaking 2 bytes of kernel memory

2018-07-27 Thread Steffen Klassert
From: Eric Dumazet 

struct xfrm_userpolicy_type has two holes, so we should not
use C99 style initializer.

KMSAN report:

BUG: KMSAN: kernel-infoleak in copyout lib/iov_iter.c:140 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571
CPU: 1 PID: 4520 Comm: syz-executor841 Not tainted 4.17.0+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1117
 kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1211
 kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1253
 copyout lib/iov_iter.c:140 [inline]
 _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571
 copy_to_iter include/linux/uio.h:106 [inline]
 skb_copy_datagram_iter+0x422/0xfa0 net/core/datagram.c:431
 skb_copy_datagram_msg include/linux/skbuff.h:3268 [inline]
 netlink_recvmsg+0x6f1/0x1900 net/netlink/af_netlink.c:1959
 sock_recvmsg_nosec net/socket.c:802 [inline]
 sock_recvmsg+0x1d6/0x230 net/socket.c:809
 ___sys_recvmsg+0x3fe/0x810 net/socket.c:2279
 __sys_recvmmsg+0x58e/0xe30 net/socket.c:2391
 do_sys_recvmmsg+0x2a6/0x3e0 net/socket.c:2472
 __do_sys_recvmmsg net/socket.c:2485 [inline]
 __se_sys_recvmmsg net/socket.c:2481 [inline]
 __x64_sys_recvmmsg+0x15d/0x1c0 net/socket.c:2481
 do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x446ce9
RSP: 002b:7fc307918db8 EFLAGS: 0293 ORIG_RAX: 012b
RAX: ffda RBX: 006dbc24 RCX: 00446ce9
RDX: 000a RSI: 20005040 RDI: 0003
RBP: 006dbc20 R08: 20004e40 R09: 
R10: 4000 R11: 0293 R12: 
R13: 7ffc8d2df32f R14: 7fc3079199c0 R15: 0001

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
 kmsan_save_stack mm/kmsan/kmsan.c:294 [inline]
 kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:685
 kmsan_memcpy_origins+0x11d/0x170 mm/kmsan/kmsan.c:527
 __msan_memcpy+0x109/0x160 mm/kmsan/kmsan_instr.c:413
 __nla_put lib/nlattr.c:569 [inline]
 nla_put+0x276/0x340 lib/nlattr.c:627
 copy_to_user_policy_type net/xfrm/xfrm_user.c:1678 [inline]
 dump_one_policy+0xbe1/0x1090 net/xfrm/xfrm_user.c:1708
 xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013
 xfrm_dump_policy+0x1c0/0x2a0 net/xfrm/xfrm_user.c:1749
 netlink_dump+0x9b5/0x1550 net/netlink/af_netlink.c:2226
 __netlink_dump_start+0x1131/0x1270 net/netlink/af_netlink.c:2323
 netlink_dump_start include/linux/netlink.h:214 [inline]
 xfrm_user_rcv_msg+0x8a3/0x9b0 net/xfrm/xfrm_user.c:2577
 netlink_rcv_skb+0x37e/0x600 net/netlink/af_netlink.c:2448
 xfrm_netlink_rcv+0xb2/0xf0 net/xfrm/xfrm_user.c:2598
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x1680/0x1750 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Local variable description: upt.i@dump_one_policy
Variable was created at:
 dump_one_policy+0x78/0x1090 net/xfrm/xfrm_user.c:1689
 xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013

Byte 130 of 137 is uninitialized
Memory access starts at 88019550407f

Fixes: c0144beaeca42 ("[XFRM] netlink: Use nla_put()/NLA_PUT() variantes")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Cc: Steffen Klassert 
Cc: Herbert Xu 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_user.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 080035f056d9..1e50b70ad668 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1671,9 +1671,11 @@ static inline unsigned int userpolicy_type_attrsize(void)
 #ifdef CONFIG_XFRM_SUB_POLICY
 static int copy_to_user_policy_type(u8 type, struct sk_buff *skb)
 {
-   struct xfrm_userpolicy_type upt = {
-   .type = type,
-   };
+   struct xfrm_userpolicy_type upt;
+
+   /* Sadly there are two holes in struct xfrm_userpolicy_type */
+   memset(, 0, sizeof(upt));
+   upt.type = type;
 
return nla_put(skb, XFRMA_POLICY_TYPE, sizeof(upt), );
 }
-- 
2.14.1



pull request (net): ipsec 2018-07-27

2018-07-27 Thread Steffen Klassert
1) Fix PMTU handling of vti6. We update the PMTU on
   the xfrm dst_entry which is not cached anymore
   after the flowchache removal. So update the
   PMTU of the original dst_entry instead.
   From Eyal Birger.

2) Fix a leak of kernel memory to userspace.
   From Eric Dumazet.

3) Fix a possible dst_entry memleak in xfrm_lookup_route.
   From Tommi Rantala.

4) Fix a skb leak in case we can't call nlmsg_multicast
   from xfrm_nlmsg_multicast. From Florian Westphal.

5) Fix a leak of a temporary buffer in the error path of
   esp6_input. From Zhen Lei.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 
(2018-06-06 18:39:49 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 7284fdf39a912322ce97de2d30def3c6068a418c:

  esp6: fix memleak on error path in esp6_input (2018-06-27 17:32:11 +0200)


Eric Dumazet (1):
  xfrm_user: prevent leaking 2 bytes of kernel memory

Eyal Birger (1):
  vti6: fix PMTU caching and reporting on xmit

Florian Westphal (1):
  xfrm: free skb if nlsk pointer is NULL

Tommi Rantala (1):
  xfrm: fix missing dst_release() after policy blocking lbcast and multicast

Zhen Lei (1):
  esp6: fix memleak on error path in esp6_input

 net/ipv6/esp6.c|  4 +++-
 net/ipv6/ip6_vti.c | 11 ++-
 net/xfrm/xfrm_policy.c |  3 +++
 net/xfrm/xfrm_user.c   | 18 +++---
 4 files changed, 23 insertions(+), 13 deletions(-)


[PATCH 4/5] xfrm: free skb if nlsk pointer is NULL

2018-07-27 Thread Steffen Klassert
From: Florian Westphal 

nlmsg_multicast() always frees the skb, so in case we cannot call
it we must do that ourselves.

Fixes: 21ee543edc0dea ("xfrm: fix race between netns cleanup and state expire 
notification")
Signed-off-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_user.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 1e50b70ad668..33878e6e0d0a 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1025,10 +1025,12 @@ static inline int xfrm_nlmsg_multicast(struct net *net, 
struct sk_buff *skb,
 {
struct sock *nlsk = rcu_dereference(net->xfrm.nlsk);
 
-   if (nlsk)
-   return nlmsg_multicast(nlsk, skb, pid, group, GFP_ATOMIC);
-   else
-   return -1;
+   if (!nlsk) {
+   kfree_skb(skb);
+   return -EPIPE;
+   }
+
+   return nlmsg_multicast(nlsk, skb, pid, group, GFP_ATOMIC);
 }
 
 static inline unsigned int xfrm_spdinfo_msgsize(void)
-- 
2.14.1



[PATCH 3/5] xfrm: fix missing dst_release() after policy blocking lbcast and multicast

2018-07-27 Thread Steffen Klassert
From: Tommi Rantala 

Fix missing dst_release() when local broadcast or multicast traffic is
xfrm policy blocked.

For IPv4 this results to dst leak: ip_route_output_flow() allocates
dst_entry via __ip_route_output_key() and passes it to
xfrm_lookup_route(). xfrm_lookup returns ERR_PTR(-EPERM) that is
propagated. The dst that was allocated is never released.

IPv4 local broadcast testcase:
 ping -b 192.168.1.255 &
 sleep 1
 ip xfrm policy add src 0.0.0.0/0 dst 192.168.1.255/32 dir out action block

IPv4 multicast testcase:
 ping 224.0.0.1 &
 sleep 1
 ip xfrm policy add src 0.0.0.0/0 dst 224.0.0.1/32 dir out action block

For IPv6 the missing dst_release() causes trouble e.g. when used in netns:
 ip netns add TEST
 ip netns exec TEST ip link set lo up
 ip link add dummy0 type dummy
 ip link set dev dummy0 netns TEST
 ip netns exec TEST ip addr add fd00:: dev dummy0
 ip netns exec TEST ip link set dummy0 up
 ip netns exec TEST ping -6 -c 5 ff02::1%dummy0 &
 sleep 1
 ip netns exec TEST ip xfrm policy add src ::/0 dst ff02::1 dir out action block
 wait
 ip netns del TEST

After netns deletion we see:
[  258.239097] unregister_netdevice: waiting for lo to become free. Usage count 
= 2
[  268.279061] unregister_netdevice: waiting for lo to become free. Usage count 
= 2
[  278.367018] unregister_netdevice: waiting for lo to become free. Usage count 
= 2
[  288.375259] unregister_netdevice: waiting for lo to become free. Usage count 
= 2

Fixes: ac37e2515c1a ("xfrm: release dst_orig in case of error in xfrm_lookup()")
Signed-off-by: Tommi Rantala 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_policy.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5f48251c1319..7c5e8978aeaa 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2286,6 +2286,9 @@ struct dst_entry *xfrm_lookup_route(struct net *net, 
struct dst_entry *dst_orig,
if (IS_ERR(dst) && PTR_ERR(dst) == -EREMOTE)
return make_blackhole(net, dst_orig->ops->family, dst_orig);
 
+   if (IS_ERR(dst))
+   dst_release(dst_orig);
+
return dst;
 }
 EXPORT_SYMBOL(xfrm_lookup_route);
-- 
2.14.1



[PATCH 1/5] vti6: fix PMTU caching and reporting on xmit

2018-07-27 Thread Steffen Klassert
From: Eyal Birger 

When setting the skb->dst before doing the MTU check, the route PMTU
caching and reporting is done on the new dst which is about to be
released.

Instead, PMTU handling should be done using the original dst.

This is aligned with IPv4 VTI.

Fixes: ccd740cbc6 ("vti6: Add pmtu handling to vti6_xmit.")
Signed-off-by: Eyal Birger 
Signed-off-by: Steffen Klassert 
---
 net/ipv6/ip6_vti.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index b7f28deddaea..c72ae3a4fe09 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -480,10 +480,6 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
goto tx_err_dst_release;
}
 
-   skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev)));
-   skb_dst_set(skb, dst);
-   skb->dev = skb_dst(skb)->dev;
-
mtu = dst_mtu(dst);
if (!skb->ignore_df && skb->len > mtu) {
skb_dst_update_pmtu(skb, mtu);
@@ -498,9 +494,14 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
  htonl(mtu));
}
 
-   return -EMSGSIZE;
+   err = -EMSGSIZE;
+   goto tx_err_dst_release;
}
 
+   skb_scrub_packet(skb, !net_eq(t->net, dev_net(dev)));
+   skb_dst_set(skb, dst);
+   skb->dev = skb_dst(skb)->dev;
+
err = dst_output(t->net, skb->sk, skb);
if (net_xmit_eval(err) == 0) {
struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
-- 
2.14.1



Re: [PATCH] xfrm: fix ptr_ret.cocci warnings

2018-07-26 Thread Steffen Klassert
On Thu, Jul 26, 2018 at 03:09:52PM +0800, kbuild test robot wrote:
> From: kbuild test robot 
> 
> net/xfrm/xfrm_interface.c:692:1-3: WARNING: PTR_ERR_OR_ZERO can be used
> 
> 
>  Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
> 
> Generated by: scripts/coccinelle/api/ptr_ret.cocci
> 
> Fixes: 44e2b838c24d ("xfrm: Return detailed errors from xfrmi_newlink")
> CC: Benedict Wong 
> Signed-off-by: kbuild test robot 

Applied, thanks!


Re: [PATCH ipsec-next] xfrm: Return detailed errors from xfrmi_newlink

2018-07-26 Thread Steffen Klassert
On Wed, Jul 25, 2018 at 01:45:29PM -0700, Benedict Wong wrote:
> Currently all failure modes of xfrm interface creation return EEXIST.
> This change improves the granularity of errnos provided by also
> returning ENODEV or EINVAL if failures happen in looking up the
> underlying interface, or a required parameter is not provided.
> 
> This change has been tested against the Android Kernel Networking Tests,
> with additional xfrmi_newlink tests here:
> 
> https://android-review.googlesource.com/c/kernel/tests/+/715755
> 
> Signed-off-by: Benedict Wong 

Applied to ipsec-next, thanks Benedict!


Re: UDP GRO without Merging

2018-07-25 Thread Steffen Klassert
On Wed, Jul 25, 2018 at 11:53:45AM +0200, Gauvain Roussel-Tarbouriech wrote:
> Subject:
> 
> Hello Netdev,
> 
> I am working on WireGuard as part of Google Summer of Code and Jason and
> I are working on adding GRO to WireGuard, on the udp_tunnel side of
> things. The goal is to inform udp_tunnel’s gro_receive that certain
> packets are part of the same flow. Then sometime later, we’d like to
> have a list of those same-flow packets appear in gro_complete. As far as
> I can tell, the API seems well suited for merging packets in
> gro_receive, such that gro_complete only then gets one single merged
> packet. However, we need to receive the entire list of same-flow packets
> in gro_complete, unmerged. Is this possible with the present APIs?

We recently posted a patchset that does this GRO packet chaining.
Not sure if this is exactly what you need, but you can have a look here:

https://www.spinics.net/lists/netdev/msg508093.html

This is implemented as a part of a forward fastpath, but
could be also moved generic code if needed.


Re: [PATCH ipsec-next] xfrm: Allow xfrmi if_id to be updated by UPDSA

2018-07-22 Thread Steffen Klassert
On Thu, Jul 19, 2018 at 07:07:47PM -0700, Nathan Harold wrote:
> Allow attaching an SA to an xfrm interface id after
> the creation of the SA, so that tasks such as keying
> which must be done as the SA is created, can remain
> separate from the decision on how to route traffic
> from an SA. This permits SA creation to be decomposed
> in to three separate steps:
> 1) allocation of a SPI
> 2) algorithm and key negotiation
> 3) insertion into the data path
> 
> Signed-off-by: Nathan Harold 

Applied to ipsec-next, thanks Nathan!


Re: [PATCH ipsec-next] xfrm: Remove xfrmi interface ID from flowi

2018-07-22 Thread Steffen Klassert
On Thu, Jul 19, 2018 at 10:50:44AM -0700, Benedict Wong wrote:
> In order to remove performance impact of having the extra u32 in every
> single flowi, this change removes the flowi_xfrm struct, prefering to
> take the if_id as a method parameter where needed.
> 
> In the inbound direction, if_id is only needed during the
> __xfrm_check_policy() function, and the if_id can be determined at that
> point based on the skb. As such, xfrmi_decode_session() is only called
> with the skb in __xfrm_check_policy().
> 
> In the outbound direction, the only place where if_id is needed is the
> xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is
> directly passed into the xfrm_lookup_with_ifid() call. All existing
> callers can still call xfrm_lookup(), which uses a default if_id of 0.
> 
> This change does not change any behavior of XFRMIs except for improving
> overall system performance via flowi size reduction.
> 
> This change has been tested against the Android Kernel Networking Tests:
> 
> https://android.googlesource.com/kernel/tests/+/master/net/test
> 
> Signed-off-by: Benedict Wong 

Nice improvement. Applied to ipsec-next, thanks a lot!


Re: [PATCH ipsec-next 1/1] xfrm: don't check offload_handle for nonzero

2018-07-22 Thread Steffen Klassert
On Tue, Jun 26, 2018 at 02:19:10PM -0700, Shannon Nelson wrote:
> The offload_handle should be an opaque data cookie for the driver
> to use, much like the data cookie for a timer or alarm callback.
> Thus, the XFRM stack should not be checking for non-zero, because
> the driver might use that to store an array reference, which could
> be zero, or some other zero but meaningful value.
> 
> We can remove the checks for non-zero because there are plenty
> other attributes also being checked to see if there is an offload
> in place for the SA in question.
> 
> Signed-off-by: Shannon Nelson 

Applied to ipsec-next, thanks Shannon!


Re: [RFC ipsec-next] xfrm: Remove xfrmi interface ID from flowi

2018-07-19 Thread Steffen Klassert
On Tue, Jul 17, 2018 at 02:40:04PM -0700, Benedict Wong wrote:

> @@ -2301,6 +2322,13 @@ int __xfrm_policy_check(struct sock *sk, int dir, 
> struct sk_buff *skb,
>   int reverse;
>   struct flowi fl;
>   int xerr_idx = -1;
> + const struct xfrm_if_cb *ifcb;
> + struct xfrm_if *xi;
> + u32 if_id = 0;
> +
> + rcu_read_lock();
> + ifcb = xfrm_if_get_cb();
> + rcu_read_unlock();
>  
>   reverse = dir & ~XFRM_POLICY_MASK;
>   dir &= XFRM_POLICY_MASK;
> @@ -2325,10 +2353,16 @@ int __xfrm_policy_check(struct sock *sk, int dir, 
> struct sk_buff *skb,
>   }
>   }
>  
> + if (ifcb) {
> + xi = ifcb->decode_session(skb);
> + if (xi)
> + if_id = xi->p.if_id;
> + }

The usage of the ifcb pointer should go into the
rcu_read_lock section above.

Looks good otherwise, nice improvement.

Please respin and do an official submission of this
patch, I'd like to merge it before I send the pull
request for the ipsec-next tree.


Re: [PATCH ipsec-next 1/1] xfrm: don't check offload_handle for nonzero

2018-06-27 Thread Steffen Klassert
On Tue, Jun 26, 2018 at 02:19:10PM -0700, Shannon Nelson wrote:
> The offload_handle should be an opaque data cookie for the driver
> to use, much like the data cookie for a timer or alarm callback.
> Thus, the XFRM stack should not be checking for non-zero, because
> the driver might use that to store an array reference, which could
> be zero, or some other zero but meaningful value.
> 
> We can remove the checks for non-zero because there are plenty
> other attributes also being checked to see if there is an offload
> in place for the SA in question.
> 
> Signed-off-by: Shannon Nelson 

I don't have access to my hw offload testlab currently,
so I've queued this one until I'm back from my vacation.


Re: [PATCH v2 ipsec-next] xfrm: policy: remove pcpu policy cache

2018-06-25 Thread Steffen Klassert
On Mon, Jun 25, 2018 at 05:26:02PM +0200, Florian Westphal wrote:
> Kristian Evensen says:
>   In a project I am involved in, we are running ipsec (Strongswan) on
>   different mt7621-based routers. Each router is configured as an
>   initiator and has around ~30 tunnels to different responders (running
>   on misc. devices). Before the flow cache was removed (kernel 4.9), we
>   got a combined throughput of around 70Mbit/s for all tunnels on one
>   router. However, we recently switched to kernel 4.14 (4.14.48), and
>   the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
>   drop of around 20%. Reverting the flow cache removal restores, as
>   expected, performance levels to that of kernel 4.9.
> 
> When pcpu xdst exists, it has to be validated first before it can be
> used.
> 
> A negative hit thus increases cost vs. no-cache.
> 
> As number of tunnels increases, hit rate decreases so this pcpu caching
> isn't a viable strategy.
> 
> Furthermore, the xdst cache also needs to run with BH off, so when
> removing this the bh disable/enable pairs can be removed too.
> 
> Kristian tested a 4.14.y backport of this change and reported
> increased performance:
> 
>   In our tests, the throughput reduction has been reduced from around -20%
>   to -5%. We also see that the overall throughput is independent of the
>   number of tunnels, while before the throughput was reduced as the number
>   of tunnels increased.
> 
> Reported-by: Kristian Evensen 
> Signed-off-by: Florian Westphal 

Applied to ipsec-next, thanks a lot!


Re: [PATCH ipsec] xfrm: free skb if nlsk pointer is NULL

2018-06-25 Thread Steffen Klassert
On Mon, Jun 25, 2018 at 02:00:07PM +0200, Florian Westphal wrote:
> nlmsg_multicast() always frees the skb, so in case we cannot call
> it we must do that ourselves.
> 
> Fixes: 21ee543edc0dea ("xfrm: fix race between netns cleanup and state expire 
> notification")
> Signed-off-by: Florian Westphal 

Applied, thanks Florian!


Re: [PATCH ipsec-next] xfrm: policy: remove pcpu policy cache

2018-06-25 Thread Steffen Klassert
On Mon, Jun 25, 2018 at 01:57:53PM +0200, Florian Westphal wrote:
> Kristian Evensen says:
>   In a project I am involved in, we are running ipsec (Strongswan) on
>   different mt7621-based routers. Each router is configured as an
>   initiator and has around ~30 tunnels to different responders (running
>   on misc. devices). Before the flow cache was removed (kernel 4.9), we
>   got a combined throughput of around 70Mbit/s for all tunnels on one
>   router. However, we recently switched to kernel 4.14 (4.14.48), and
>   the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
>   drop of around 20%. Reverting the flow cache removal restores, as
>   expected, performance levels to that of kernel 4.9.
> 
> When pcpu xdst exists, it has to be validated first before it can be
> used.
> 
> A negative hit thus increases cost vs. no-cache.
> 
> As number of tunnels increases, hit rate decreases so this pcpu caching
> isn't a viable strategy.
> 
> Furthermore, the xdst cache also needs to run with BH off, so when
> removing this the bh disable/enable pairs can be removed too.
> 
> Kristian tested a 4.14.y backport of this change and reported
> increased performance:
> 
>   In our tests, the throughput reduction has been reduced from around -20%
>   to -5%. We also see that the overall throughput is independent of the
>   number of tunnels, while before the throughput was reduced as the number
>   of tunnels increased.
> 
> Reported-by: Kristian Evensen 
> Signed-off-by: Florian Westphal 

Can you please rebase this to ipsec-next current?

It does not apply cleanly after the merge of the
xfrm interface patches.

Thanks!


Re: [PATCH RFC v2 ipsec-next 0/3] Virtual xfrm interfaces

2018-06-25 Thread Steffen Klassert
On Tue, Jun 12, 2018 at 09:56:07AM +0200, Steffen Klassert wrote:
> This patchset introduces new virtual xfrm interfaces.
> The design of virtual xfrm interfaces interfaces was
> discussed at the Linux IPsec workshop 2018. This patchset
> implements these interfaces as the IPsec userspace and
> kernel developers agreed. The purpose of these interfaces
> is to overcome the design limitations that the existing
> VTI devices have.
> 
> The main limitations that we see with the current VTI are the
> following:
> 
> - VTI interfaces are L3 tunnels with configurable endpoints.
>   For xfrm, the tunnel endpoint are already determined by the SA.
>   So the VTI tunnel endpoints must be either the same as on the
>   SA or wildcards. In case VTI tunnel endpoints are same as on
>   the SA, we get a one to one correlation between the SA and
>   the tunnel. So each SA needs its own tunnel interface.
> 
>   On the other hand, we can have only one VTI tunnel with
>   wildcard src/dst tunnel endpoints in the system because the
>   lookup is based on the tunnel endpoints. The existing tunnel
>   lookup won't work with multiple tunnels with wildcard
>   tunnel endpoints. Some usecases require more than on
>   VTI tunnel of this type, for example if somebody has multiple
>   namespaces and every namespace requires such a VTI.
> 
> - VTI needs separate interfaces for IPv4 and IPv6 tunnels.
>   So when routing to a VTI, we have to know to which address
>   family this traffic class is going to be encapsulated.
>   This is a lmitation because it makes routing more complex
>   and it is not always possible to know what happens behind the
>   VTI, e.g. when the VTI is move to some namespace.
> 
> - VTI works just with tunnel mode SAs. We need generic interfaces
>   that ensures transfomation, regardless of the xfrm mode and
>   the encapsulated address family.
> 
> - VTI is configured with a combination GRE keys and xfrm marks.
>   With this we have to deal with some extra cases in the generic
>   tunnel lookup because the GRE keys on the VTI are actually
>   not GRE keys, the GRE keys were just reused for something else.
>   All extensions to the VTI interfaces would require to add
>   even more complexity to the generic tunnel lookup.
> 
> To overcome this, we started with the following design goal:
> 
> - It should be possible to tunnel IPv4 and IPv6 through the same
>   interface.
> 
> - No limitation on xfrm mode (tunnel, transport and beet).
> 
> - Should be a generic virtual interface that ensures IPsec
>   transformation, no need to know what happens behind the
>   interface.
> 
> - Interfaces should be configured with a new key that must match a
>   new policy/SA lookup key.
> 
> - The lookup logic should stay in the xfrm codebase, no need to
>   change or extend generic routing and tunnel lookups.
> 
> - Should be possible to use IPsec hardware offloads of the underlying
>   interface.
> 
> Changes from v1:
> 
> - Document the limitations of VTI interfaces and the design of
>   the new xfrm interfaces more explicit in the commit messages.
> 
> - No code changes.

I have not got any further comments, so applied to ipsec-next.


Re: [PATCH RFC ipsec-next] xfrm: Extend the output_mark to support input direction and masking.

2018-06-25 Thread Steffen Klassert
On Fri, Jun 15, 2018 at 08:55:14AM +0200, Steffen Klassert wrote:
> We already support setting an output mark at the xfrm_state,
> unfortunately this does not support the input direction and
> masking the marks that will be applied to the skb. This change
> adds support applying a masked value in both directions.
> 
> The existing XFRMA_OUTPUT_MARK number is reused for this purpose
> and as it is now bi-directional, it is renamed to XFRMA_SET_MARK.
> 
> An additional XFRMA_SET_MARK_MASK attribute is added for setting the
> mask. If the attribute mask not provided, it is set to 0x,
> keeping the XFRMA_OUTPUT_MARK existing 'full mask' semantics.
> 
> Co-developed-by: Tobias Brunner 
> Co-developed-by: Eyal Birger 
> Co-developed-by: Lorenzo Colitti 
> Signed-off-by: Steffen Klassert 
> Signed-off-by: Tobias Brunner 
> Signed-off-by: Eyal Birger 
> Signed-off-by: Lorenzo Colitti 

This is now applied to ipsec-next.


Re: [PATCH net] xfrm_user: prevent leaking 2 bytes of kernel memory

2018-06-19 Thread Steffen Klassert
On Mon, Jun 18, 2018 at 09:35:07PM -0700, Eric Dumazet wrote:
> struct xfrm_userpolicy_type has two holes, so we should not
> use C99 style initializer.
> 
> KMSAN report:
> 
> BUG: KMSAN: kernel-infoleak in copyout lib/iov_iter.c:140 [inline]
> BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571
> CPU: 1 PID: 4520 Comm: syz-executor841 Not tainted 4.17.0+ #5
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x185/0x1d0 lib/dump_stack.c:113
>  kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1117
>  kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1211
>  kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1253
>  copyout lib/iov_iter.c:140 [inline]
>  _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571
>  copy_to_iter include/linux/uio.h:106 [inline]
>  skb_copy_datagram_iter+0x422/0xfa0 net/core/datagram.c:431
>  skb_copy_datagram_msg include/linux/skbuff.h:3268 [inline]
>  netlink_recvmsg+0x6f1/0x1900 net/netlink/af_netlink.c:1959
>  sock_recvmsg_nosec net/socket.c:802 [inline]
>  sock_recvmsg+0x1d6/0x230 net/socket.c:809
>  ___sys_recvmsg+0x3fe/0x810 net/socket.c:2279
>  __sys_recvmmsg+0x58e/0xe30 net/socket.c:2391
>  do_sys_recvmmsg+0x2a6/0x3e0 net/socket.c:2472
>  __do_sys_recvmmsg net/socket.c:2485 [inline]
>  __se_sys_recvmmsg net/socket.c:2481 [inline]
>  __x64_sys_recvmmsg+0x15d/0x1c0 net/socket.c:2481
>  do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x446ce9
> RSP: 002b:7fc307918db8 EFLAGS: 0293 ORIG_RAX: 012b
> RAX: ffda RBX: 006dbc24 RCX: 00446ce9
> RDX: 000a RSI: 20005040 RDI: 0003
> RBP: 006dbc20 R08: 20004e40 R09: 
> R10: 4000 R11: 0293 R12: 
> R13: 7ffc8d2df32f R14: 7fc3079199c0 R15: 0001
> 
> Uninit was stored to memory at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
>  kmsan_save_stack mm/kmsan/kmsan.c:294 [inline]
>  kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:685
>  kmsan_memcpy_origins+0x11d/0x170 mm/kmsan/kmsan.c:527
>  __msan_memcpy+0x109/0x160 mm/kmsan/kmsan_instr.c:413
>  __nla_put lib/nlattr.c:569 [inline]
>  nla_put+0x276/0x340 lib/nlattr.c:627
>  copy_to_user_policy_type net/xfrm/xfrm_user.c:1678 [inline]
>  dump_one_policy+0xbe1/0x1090 net/xfrm/xfrm_user.c:1708
>  xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013
>  xfrm_dump_policy+0x1c0/0x2a0 net/xfrm/xfrm_user.c:1749
>  netlink_dump+0x9b5/0x1550 net/netlink/af_netlink.c:2226
>  __netlink_dump_start+0x1131/0x1270 net/netlink/af_netlink.c:2323
>  netlink_dump_start include/linux/netlink.h:214 [inline]
>  xfrm_user_rcv_msg+0x8a3/0x9b0 net/xfrm/xfrm_user.c:2577
>  netlink_rcv_skb+0x37e/0x600 net/netlink/af_netlink.c:2448
>  xfrm_netlink_rcv+0xb2/0xf0 net/xfrm/xfrm_user.c:2598
>  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
>  netlink_unicast+0x1680/0x1750 net/netlink/af_netlink.c:1336
>  netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
>  sock_sendmsg_nosec net/socket.c:629 [inline]
>  sock_sendmsg net/socket.c:639 [inline]
>  ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
>  __sys_sendmsg net/socket.c:2155 [inline]
>  __do_sys_sendmsg net/socket.c:2164 [inline]
>  __se_sys_sendmsg net/socket.c:2162 [inline]
>  __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
>  do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Local variable description: upt.i@dump_one_policy
> Variable was created at:
>  dump_one_policy+0x78/0x1090 net/xfrm/xfrm_user.c:1689
>  xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013
> 
> Byte 130 of 137 is uninitialized
> Memory access starts at 88019550407f
> 
> Fixes: c0144beaeca42 ("[XFRM] netlink: Use nla_put()/NLA_PUT() variantes")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 
> Cc: Steffen Klassert 
> Cc: Herbert Xu 
> ---
>  net/xfrm/xfrm_user.c | 8 +---

It is a fix for xfrm, so I applied it
to the ipsec tree.

Thanks a lot Eric!


Re: [PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-19 Thread Steffen Klassert
On Tue, Jun 19, 2018 at 09:53:49AM +0200, Florian Westphal wrote:
> Li RongQing  wrote:
> > The default NR_CPUS can be very large, but actual possible nr_cpu_ids
> > usually is very small. For some x86 distribution, the NR_CPUS is 8192
> > and nr_cpu_ids is 4, so replace NR_CPU to save some memory
> 
> Steffen,
> 
> I will soon submit a patch to remove the percpu cache; removal
> improved performance for at least one user (and by quite a sizeable
> amount).
> 
> Would you consider such removal for ipsec or ipsec-next?

I think this removel would better fit to ipsec-next.

> If -next, consider applying this patch for ipsec.git.
> 
> Otherwise consider not applying this, as the code will go away soon.

This patch more an optimization than a fix, so I
considered to apply it to ipsec-next. If you plan
to remove it, I'll wait for that.


[PATCH RFC ipsec-next] xfrm: Extend the output_mark to support input direction and masking.

2018-06-15 Thread Steffen Klassert
We already support setting an output mark at the xfrm_state,
unfortunately this does not support the input direction and
masking the marks that will be applied to the skb. This change
adds support applying a masked value in both directions.

The existing XFRMA_OUTPUT_MARK number is reused for this purpose
and as it is now bi-directional, it is renamed to XFRMA_SET_MARK.

An additional XFRMA_SET_MARK_MASK attribute is added for setting the
mask. If the attribute mask not provided, it is set to 0x,
keeping the XFRMA_OUTPUT_MARK existing 'full mask' semantics.

Co-developed-by: Tobias Brunner 
Co-developed-by: Eyal Birger 
Co-developed-by: Lorenzo Colitti 
Signed-off-by: Steffen Klassert 
Signed-off-by: Tobias Brunner 
Signed-off-by: Eyal Birger 
Signed-off-by: Lorenzo Colitti 
---
 include/net/xfrm.h|  9 -
 include/uapi/linux/xfrm.h |  4 +++-
 net/xfrm/xfrm_input.c |  2 ++
 net/xfrm/xfrm_output.c|  3 +--
 net/xfrm/xfrm_policy.c|  5 +++--
 net/xfrm/xfrm_user.c  | 48 +--
 6 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 45e75c36b738..8727b2484855 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -166,7 +166,7 @@ struct xfrm_state {
int header_len;
int trailer_len;
u32 extra_flags;
-   u32 output_mark;
+   struct xfrm_marksmark;
} props;
 
struct xfrm_lifetime_cfg lft;
@@ -2012,6 +2012,13 @@ static inline int xfrm_mark_put(struct sk_buff *skb, 
const struct xfrm_mark *m)
return ret;
 }
 
+static inline __u32 xfrm_smark_get(__u32 mark, struct xfrm_state *x)
+{
+   struct xfrm_mark *m = >props.smark;
+
+   return (m->v & m->m) | (mark & ~m->m);
+}
+
 static inline int xfrm_tunnel_check(struct sk_buff *skb, struct xfrm_state *x,
unsigned int family)
 {
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index e3af2859188b..5a6ed7ce5a29 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -305,9 +305,11 @@ enum xfrm_attr_type_t {
XFRMA_ADDRESS_FILTER,   /* struct xfrm_address_filter */
XFRMA_PAD,
XFRMA_OFFLOAD_DEV,  /* struct xfrm_state_offload */
-   XFRMA_OUTPUT_MARK,  /* __u32 */
+   XFRMA_SET_MARK, /* __u32 */
+   XFRMA_SET_MARK_MASK,/* __u32 */
__XFRMA_MAX
 
+#define XFRMA_OUTPUT_MARK XFRMA_SET_MARK   /* Compatibility */
 #define XFRMA_MAX (__XFRMA_MAX - 1)
 };
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 352abca2605f..074810436242 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -339,6 +339,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
goto drop;
}
 
+   skb->mark = xfrm_smark_get(skb->mark, x);
+
skb->sp->xvec[skb->sp->len++] = x;
 
 lock:
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 89b178a78dc7..45ba07ab3e4f 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -66,8 +66,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
goto error_nolock;
}
 
-   if (x->props.output_mark)
-   skb->mark = x->props.output_mark;
+   skb->mark = xfrm_smark_get(skb->mark, x);
 
err = x->outer_mode->output(x, skb);
if (err) {
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 40b54cc64243..f95f5f75748c 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1607,10 +1607,11 @@ static struct dst_entry *xfrm_bundle_create(struct 
xfrm_policy *policy,
dst_copy_metrics(dst1, dst);
 
if (xfrm[i]->props.mode != XFRM_MODE_TRANSPORT) {
+   __u32 mark = xfrm_smark_get(fl->flowi_mark, xfrm[i]);
+
family = xfrm[i]->props.family;
dst = xfrm_dst_lookup(xfrm[i], tos, fl->flowi_oif,
- , , family,
- xfrm[i]->props.output_mark);
+ , , family, mark);
err = PTR_ERR(dst);
if (IS_ERR(dst))
goto put_states;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 080035f056d9..9602cc9e05ab 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -527,6 +527,19 @@ static void xfrm_update_ae_params(struct xfrm_state *x, 
struct nlattr **attrs,
x->replay_maxdiff = nla_get_u32(rt);
 }
 
+static void xfrm_smark_init(struct nlattr **attrs, struct xfrm_mark *m)
+{

Re: [PATCH iproute2-next] ip-xfrm: Add support for OUTPUT_MARK

2018-06-12 Thread Steffen Klassert
On Tue, Jun 12, 2018 at 11:33:41AM +0900, Lorenzo Colitti wrote:
> On Tue, Jun 12, 2018 at 11:12 AM Subash Abhinov Kasiviswanathan
>  wrote:
> >
> > This patch adds support for OUTPUT_MARK in xfrm state to exercise the
> > functionality added by kernel commit 077fbac405bf
> > ("net: xfrm: support setting an output mark.").
> >
> > Sample output with output-mark -
> >
> > src 192.168.1.1 dst 192.168.1.2
> > proto esp spi 0x4321 reqid 0 mode tunnel
> > replay-window 0 flag af-unspec
> > auth-trunc xcbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b211 96
> > enc cbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b233
> > anti-replay context: seq 0x0, oseq 0x0, bitmap 0x
> > output-mark 0x2
> 
> Have you considered putting this earlier up in the output, where the
> mark is printed as well?
> 
> > +   if (tb[XFRMA_OUTPUT_MARK]) {
> > +   __u32 output_mark = rta_getattr_u32(tb[XFRMA_OUTPUT_MARK]);
> > +
> > +   fprintf(fp, "\toutput-mark 0x%x %s", output_mark, _SL_);
> > +   }
> >  }
> 
> If you wanted to implement the suggestion above, I think you could do
> that by moving this code into xfrm_xfrma_print.
> 
> Other than that, LGTM.
> 
> Acked-by: Lorenzo Colitti 
> 
> Steffen - what's the status of the set_mark patches? Are you holding
> them until the tree opens again?

Yes, I hold them back until after v4.18-rc1 is released and the
-next trees open again. But I plan to do a RFC version this week,
so that everybody knows about the plan we have.



  1   2   3   4   5   6   7   8   9   >