[net PATCH 2/2] tools: selftests: add test for changing routes with PTMU exceptions

2021-01-05 Thread Sean Tranchetti
From: Sean Tranchetti 

Adds new 2 new tests to the PTMU script: pmtu_ipv4/6_route_change.

These tests explicitly test for a recently discovered problem in the
IPv6 routing framework where PMTU exceptions were not properly released
when replacing a route via "ip route change ...".

After creating PMTU exceptions, the route from the device A to R1 will be
replaced with a new route, then device A will be deleted. If the PMTU
exceptions were properly cleaned up by the kernel, this device deletion
will succeed. Otherwise, the unregistration of the device will stall, and
messages such as the following will be logged in dmesg:

unregister_netdevice: waiting for veth_A-R1 to become free. Usage count = 4

Signed-off-by: Sean Tranchetti 
---
 tools/testing/selftests/net/pmtu.sh | 71 +++--
 1 file changed, 69 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 464e31e..64cd2e2 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -162,7 +162,15 @@
 # - list_flush_ipv6_exception
 #  Using the same topology as in pmtu_ipv6, create exceptions, and check
 #  they are shown when listing exception caches, gone after flushing them
-
+#
+# - pmtu_ipv4_route_change
+#  Use the same topology as in pmtu_ipv4, but issue a route replacement
+#  command and delete the corresponding device afterward. This tests for
+#  proper cleanup of the PMTU exceptions by the route replacement path.
+#  Device unregistration should complete successfully
+#
+# - pmtu_ipv6_route_change
+#  Same as above but with IPv6
 
 # Kselftest framework requirement - SKIP code is 4.
 ksft_skip=4
@@ -224,7 +232,9 @@ tests="
cleanup_ipv4_exception  ipv4: cleanup of cached exceptions  
1
cleanup_ipv6_exception  ipv6: cleanup of cached exceptions  
1
list_flush_ipv4_exception   ipv4: list and flush cached exceptions  
1
-   list_flush_ipv6_exception   ipv6: list and flush cached exceptions  
1"
+   list_flush_ipv6_exception   ipv6: list and flush cached exceptions  
1
+   pmtu_ipv4_route_change  ipv4: PMTU exception w/route replace
1
+   pmtu_ipv6_route_change  ipv6: PMTU exception w/route replace
1"
 
 NS_A="ns-A"
 NS_B="ns-B"
@@ -1782,6 +1792,63 @@ test_list_flush_ipv6_exception() {
return ${fail}
 }
 
+test_pmtu_ipvX_route_change() {
+   family=${1}
+
+   setup namespaces routing || return 2
+   trace "${ns_a}"  veth_A-R1"${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B"${ns_b}"  veth_B-R1 \
+ "${ns_a}"  veth_A-R2"${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B"${ns_b}"  veth_B-R2
+
+   if [ ${family} -eq 4 ]; then
+   ping=ping
+   dst1="${prefix4}.${b_r1}.1"
+   dst2="${prefix4}.${b_r2}.1"
+   gw="${prefix4}.${a_r1}.2"
+   else
+   ping=${ping6}
+   dst1="${prefix6}:${b_r1}::1"
+   dst2="${prefix6}:${b_r2}::1"
+   gw="${prefix6}:${a_r1}::2"
+   fi
+
+   # Set up initial MTU values
+   mtu "${ns_a}"  veth_A-R1 2000
+   mtu "${ns_r1}" veth_R1-A 2000
+   mtu "${ns_r1}" veth_R1-B 1400
+   mtu "${ns_b}"  veth_B-R1 1400
+
+   mtu "${ns_a}"  veth_A-R2 2000
+   mtu "${ns_r2}" veth_R2-A 2000
+   mtu "${ns_r2}" veth_R2-B 1500
+   mtu "${ns_b}"  veth_B-R2 1500
+
+   # Create route exceptions
+   run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s 1800 ${dst1}
+   run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s 1800 ${dst2}
+
+   # Check that exceptions have been created with the correct PMTU
+   pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" ${dst1})"
+   check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
+   pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" ${dst2})"
+   check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
+
+   # Replace the route from A to R1
+   run_cmd ${ns_a} ip route change default via ${gw}
+
+   # Delete the device in A
+   run_cmd ${ns_a} ip link del "veth_A-R1"
+}
+
+test_pmtu_ipv4_route_change() {
+   test_pmtu_ipvX_route_change 4
+}
+
+test_pmtu_ipv6_route_change() {
+   test_pmtu_ipvX_route_change 6
+}
+
 usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
-- 
2.7.4



[net PATCH 1/2] net: ipv6: fib: flush exceptions when purging route

2021-01-05 Thread Sean Tranchetti
From: Sean Tranchetti 

Route removal is handled by two code paths. The main removal path is via
fib6_del_route() which will handle purging any PMTU exceptions from the
cache, removing all per-cpu copies of the DST entry used by the route, and
releasing the fib6_info struct.

The second removal location is during fib6_add_rt2node() during a route
replacement operation. This path also calls fib6_purge_rt() to handle
cleaning up the per-cpu copies of the DST entries and releasing the
fib6_info associated with the older route, but it does not flush any PMTU
exceptions that the older route had. Since the older route is removed from
the tree during the replacement, we lose any way of accessing it again.

As these lingering DSTs and the fib6_info struct are holding references to
the underlying netdevice struct as well, unregistering that device from the
kernel can never complete.

Signed-off-by: Sean Tranchetti 
---
 net/ipv6/ip6_fib.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 605cdd3..f43e275 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1025,6 +1025,8 @@ static void fib6_purge_rt(struct fib6_info *rt, struct 
fib6_node *fn,
 {
struct fib6_table *table = rt->fib6_table;
 
+   /* Flush all cached dst in exception table */
+   rt6_flush_exceptions(rt);
fib6_drop_pcpu_from(rt, table);
 
if (rt->nh && !list_empty(&rt->nh_list))
@@ -1927,9 +1929,6 @@ static void fib6_del_route(struct fib6_table *table, 
struct fib6_node *fn,
net->ipv6.rt6_stats->fib_rt_entries--;
net->ipv6.rt6_stats->fib_discarded_routes++;
 
-   /* Flush all cached dst in exception table */
-   rt6_flush_exceptions(rt);
-
/* Reset round-robin state, if necessary */
if (rcu_access_pointer(fn->rr_ptr) == rt)
fn->rr_ptr = NULL;
-- 
2.7.4



[PATCH net v2] genetlink: remove genl_bind

2020-06-30 Thread Sean Tranchetti
A potential deadlock can occur during registering or unregistering a
new generic netlink family between the main nl_table_lock and the
cb_lock where each thread wants the lock held by the other, as
demonstrated below.

1) Thread 1 is performing a netlink_bind() operation on a socket. As part
   of this call, it will call netlink_lock_table(), incrementing the
   nl_table_users count to 1.
2) Thread 2 is registering (or unregistering) a genl_family via the
   genl_(un)register_family() API. The cb_lock semaphore will be taken for
   writing.
3) Thread 1 will call genl_bind() as part of the bind operation to handle
   subscribing to GENL multicast groups at the request of the user. It will
   attempt to take the cb_lock semaphore for reading, but it will fail and
   be scheduled away, waiting for Thread 2 to finish the write.
4) Thread 2 will call netlink_table_grab() during the (un)registration
   call. However, as Thread 1 has incremented nl_table_users, it will not
   be able to proceed, and both threads will be stuck waiting for the
   other.

genl_bind() is a noop, unless a genl_family implements the mcast_bind()
function to handle setting up family-specific multicast operations. Since
no one in-tree uses this functionality as Cong pointed out, simply removing
the genl_bind() function will remove the possibility for deadlock, as there
is no attempt by Thread 1 above to take the cb_lock semaphore.

Fixes: c380d9a7afff ("genetlink: pass multicast bind/unbind to families")
Suggested-by: Cong Wang 
Acked-by: Johannes Berg 
Reported-by: kernel test robot 
Signed-off-by: Sean Tranchetti 
---
 include/net/genetlink.h |  8 
 net/netlink/genetlink.c | 49 -
 2 files changed, 57 deletions(-)

diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index 7495066..4cf703d 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -35,12 +35,6 @@ struct genl_multicast_group {
  * do additional, common, filtering and return an error
  * @post_doit: called after an operation's doit callback, it may
  * undo operations done by pre_doit, for example release locks
- * @mcast_bind: a socket bound to the given multicast group (which
- * is given as the offset into the groups array)
- * @mcast_unbind: a socket was unbound from the given multicast group.
- * Note that unbind() will not be called symmetrically if the
- * generic netlink family is removed while there are still open
- * sockets.
  * @attrbuf: buffer to store parsed attributes (private)
  * @mcgrps: multicast groups used by this family
  * @n_mcgrps: number of multicast groups
@@ -64,8 +58,6 @@ struct genl_family {
void(*post_doit)(const struct genl_ops *ops,
 struct sk_buff *skb,
 struct genl_info *info);
-   int (*mcast_bind)(struct net *net, int group);
-   void(*mcast_unbind)(struct net *net, int group);
struct nlattr **attrbuf;/* private */
const struct genl_ops * ops;
const struct genl_multicast_group *mcgrps;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 55ee680..dfaaefb 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1166,60 +1166,11 @@ static int ctrl_dumppolicy(struct sk_buff *skb, struct 
netlink_callback *cb)
.netnsok = true,
 };
 
-static int genl_bind(struct net *net, int group)
-{
-   struct genl_family *f;
-   int err = -ENOENT;
-   unsigned int id;
-
-   down_read(&cb_lock);
-
-   idr_for_each_entry(&genl_fam_idr, f, id) {
-   if (group >= f->mcgrp_offset &&
-   group < f->mcgrp_offset + f->n_mcgrps) {
-   int fam_grp = group - f->mcgrp_offset;
-
-   if (!f->netnsok && net != &init_net)
-   err = -ENOENT;
-   else if (f->mcast_bind)
-   err = f->mcast_bind(net, fam_grp);
-   else
-   err = 0;
-   break;
-   }
-   }
-   up_read(&cb_lock);
-
-   return err;
-}
-
-static void genl_unbind(struct net *net, int group)
-{
-   struct genl_family *f;
-   unsigned int id;
-
-   down_read(&cb_lock);
-
-   idr_for_each_entry(&genl_fam_idr, f, id) {
-   if (group >= f->mcgrp_offset &&
-   group < f->mcgrp_offset + f->n_mcgrps) {
-   int fam_grp = group - f->mcgrp_offset;
-
-   if (f->mcast_unbind)
-   f->mcast_unbind(net, fam_grp);
-   break;
-   }
-   }
-   up_read(&cb_lock);
-}
-
 stati

[PATCH net] genetlink: take netlink table lock when (un)registering

2020-06-26 Thread Sean Tranchetti
A potential deadlock can occur during registering or unregistering a new
generic netlink family between the main nl_table_lock and the cb_lock where
each thread wants the lock held by the other, as demonstrated below.

1) Thread 1 is performing a netlink_bind() operation on a socket. As part
   of this call, it will call netlink_lock_table(), incrementing the
   nl_table_users count to 1.
2) Thread 2 is registering (or unregistering) a genl_family via the
   genl_(un)register_family() API. The cb_lock semaphore will be taken for
   writing.
3) Thread 1 will call genl_bind() as part of the bind operation to handle
   subscribing to GENL multicast groups at the request of the user. It will
   attempt to take the cb_lock semaphore for reading, but it will fail and
   be scheduled away, waiting for Thread 2 to finish the write.
4) Thread 2 will call netlink_table_grab() during the (un)registration
   call. However, as Thread 1 has incremented nl_table_users, it will not
   be able to proceed, and both threads will be stuck waiting for the
   other.

To avoid this scenario, the locks should be acquired in the same order by
both threads. Since both the register and unregister functions need to take
the nl_table_lock in their processing, it makes sense to explicitly acquire
them before they lock the genl_mutex and the cb_lock. In unregistering, no
other change is needed aside from this locking change.

Registering a family requires more ancilary operations, such as memory
allocation. Unfortunately, much of this allocation must be performed inside
of the genl locks to ensure internal synchronization, so they must also be
performed inside of the nl_table_lock where sleeping is not allowed. As a
result, the allocation types must be changed to GFP_ATOMIC.

Fixes: def3117493ea ("genl: Allow concurrent genl callbacks")
Cc: Pravin B Shelar 
Cc: Subash Abhinov Kasiviswanathan 
Signed-off-by: Sean Tranchetti 
---
 net/netlink/genetlink.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 55ee680..79e3b1b 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -155,14 +155,14 @@ static int genl_allocate_reserve_groups(int n_groups, int 
*first_id)
size_t nlen = new_longs * sizeof(unsigned long);
 
if (mc_groups == &mc_group_start) {
-   new_groups = kzalloc(nlen, GFP_KERNEL);
+   new_groups = kzalloc(nlen, GFP_ATOMIC);
if (!new_groups)
return -ENOMEM;
mc_groups = new_groups;
*mc_groups = mc_group_start;
} else {
new_groups = krealloc(mc_groups, nlen,
- GFP_KERNEL);
+ GFP_ATOMIC);
if (!new_groups)
return -ENOMEM;
mc_groups = new_groups;
@@ -229,7 +229,6 @@ static int genl_validate_assign_mc_groups(struct 
genl_family *family)
if (family->netnsok) {
struct net *net;
 
-   netlink_table_grab();
rcu_read_lock();
for_each_net_rcu(net) {
err = __netlink_change_ngroups(net->genl_sock,
@@ -245,10 +244,9 @@ static int genl_validate_assign_mc_groups(struct 
genl_family *family)
}
}
rcu_read_unlock();
-   netlink_table_ungrab();
} else {
-   err = netlink_change_ngroups(init_net.genl_sock,
-mc_groups_longs * BITS_PER_LONG);
+   err = __netlink_change_ngroups(init_net.genl_sock,
+  mc_groups_longs * BITS_PER_LONG);
}
 
if (groups_allocated && err) {
@@ -264,7 +262,6 @@ static void genl_unregister_mc_groups(const struct 
genl_family *family)
struct net *net;
int i;
 
-   netlink_table_grab();
rcu_read_lock();
for_each_net_rcu(net) {
for (i = 0; i < family->n_mcgrps; i++)
@@ -328,6 +325,10 @@ int genl_register_family(struct genl_family *family)
if (err)
return err;
 
+   /* Acquire netlink table lock before any GENL-specific locks to ensure
+* sync with any netlink operations making calls into the GENL code.
+*/
+   netlink_table_grab();
genl_lock_all();
 
if (genl_family_find_byname(family->name)) {
@@ -354,7 +355,7 @@ int genl_register_family(struct genl_family *family)
if (family->maxattr && !family->parallel_ops) {

[PATCH net-next v2] udp: Avoid post-GRO UDP checksum recalculation

2019-05-28 Thread Sean Tranchetti
Currently, when resegmenting an unexpected UDP GRO packet, the full UDP
checksum will be calculated for every new SKB created by skb_segment()
because the netdev features passed in by udp_rcv_segment() lack any
information about checksum offload capabilities.

Usually, we have no need to perform this calculation again, as
  1) The GRO implementation guarantees that any packets making it to the
 udp_rcv_segment() function had correct checksums, and, more
 importantly,
  2) Upon the successful return of udp_rcv_segment(), we immediately pull
 the UDP header off and either queue the segment to the socket or
 hand it off to a new protocol handler.

Unless userspace has set the IP_CHECKSUM sockopt to indicate that they
want the final checksum values, we can pass the needed netdev feature
flags to __skb_gso_segment() to avoid checksumming each segment in
skb_segment().

Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
Cc: Paolo Abeni 
Cc: Subash Abhinov Kasiviswanathan 
Signed-off-by: Sean Tranchetti 
---
 include/net/udp.h | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index d8ce937..dbe030d 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -471,12 +471,19 @@ struct udp_iter_state {
 static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
  struct sk_buff *skb, bool ipv4)
 {
+   netdev_features_t features = NETIF_F_SG;
struct sk_buff *segs;
 
+   /* Avoid csum recalculation by skb_segment unless userspace explicitly
+* asks for the final checksum values
+*/
+   if (!inet_get_convert_csum(sk))
+   features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+
/* the GSO CB lays after the UDP one, no need to save and restore any
 * CB fragment
 */
-   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
+   segs = __skb_gso_segment(skb, features, false);
if (unlikely(IS_ERR_OR_NULL(segs))) {
int segs_nr = skb_shinfo(skb)->gso_segs;
 
-- 
1.9.1



[PATCH net-next] udp: Avoid post-GRO UDP checksum recalculation

2019-05-23 Thread Sean Tranchetti
Currently, when resegmenting an unexpected UDP GRO packet, the full UDP
checksum will be calculated for every new SKB created by skb_segment()
because the netdev features passed in by udp_rcv_segment() lack any
information about checksum offload capabilities.

We have no need to perform this calculation again, as
  1) The GRO implementation guarantees that any packets making it to the
 udp_rcv_segment() function had correct checksums, and, more
 importantly,
  2) Upon the successful return of udp_rcv_segment(), we immediately pull
 the UDP header off and either queue the segment to the socket or
 hand it off to a new protocol handler. In either case, the checksum
 is not needed.

Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
Cc: Paolo Abeni 
Cc: Subash Abhinov Kasiviswanathan 
Signed-off-by: Sean Tranchetti 
---
 include/net/udp.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index d8ce937..6164d5c 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -472,11 +472,15 @@ static inline struct sk_buff *udp_rcv_segment(struct sock 
*sk,
  struct sk_buff *skb, bool ipv4)
 {
struct sk_buff *segs;
+   netdev_features_t features = NETIF_F_SG;
+
+   /* Avoid csum recalculation in skb_segment() */
+   features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
 
/* the GSO CB lays after the UDP one, no need to save and restore any
 * CB fragment
 */
-   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
+   segs = __skb_gso_segment(skb, features, false);
if (unlikely(IS_ERR_OR_NULL(segs))) {
int segs_nr = skb_shinfo(skb)->gso_segs;
 
-- 
1.9.1



[PATCH net] af_key: unconditionally clone on broadcast

2019-02-07 Thread Sean Tranchetti
Attempting to avoid cloning the skb when broadcasting by inflating
the refcount with sock_hold/sock_put while under RCU lock is dangerous
and violates RCU principles. It leads to subtle race conditions when
attempting to free the SKB, as we may reference sockets that have
already been freed by the stack.

Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b
[006b6b6b6b6b6c4b] address between user and kernel address ranges
Internal error: Oops: 9604 [#1] PREEMPT SMP
task: fff78f65b380 task.stack: ff8049a88000
pc : sock_rfree+0x38/0x6c
lr : skb_release_head_state+0x6c/0xcc
Process repro (pid: 7117, stack limit = 0xff8049a88000)
Call trace:
sock_rfree+0x38/0x6c
skb_release_head_state+0x6c/0xcc
skb_release_all+0x1c/0x38
__kfree_skb+0x1c/0x30
kfree_skb+0xd0/0xf4
pfkey_broadcast+0x14c/0x18c
pfkey_sendmsg+0x1d8/0x408
sock_sendmsg+0x44/0x60
___sys_sendmsg+0x1d0/0x2a8
__sys_sendmsg+0x64/0xb4
SyS_sendmsg+0x34/0x4c
el0_svc_naked+0x34/0x38
Kernel panic - not syncing: Fatal exception

Suggested-by: Eric Dumazet 
Signed-off-by: Sean Tranchetti 
---
Realized I never actually sent this patch out after testing the changes
Eric recommended. Whoops. Better late then never, I suppose...

 net/key/af_key.c | 40 +++-
 1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 655c787..e800891 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -196,30 +196,22 @@ static int pfkey_release(struct socket *sock)
return 0;
 }
 
-static int pfkey_broadcast_one(struct sk_buff *skb, struct sk_buff **skb2,
-  gfp_t allocation, struct sock *sk)
+static int pfkey_broadcast_one(struct sk_buff *skb, gfp_t allocation,
+  struct sock *sk)
 {
int err = -ENOBUFS;
 
-   sock_hold(sk);
-   if (*skb2 == NULL) {
-   if (refcount_read(&skb->users) != 1) {
-   *skb2 = skb_clone(skb, allocation);
-   } else {
-   *skb2 = skb;
-   refcount_inc(&skb->users);
-   }
-   }
-   if (*skb2 != NULL) {
-   if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf) {
-   skb_set_owner_r(*skb2, sk);
-   skb_queue_tail(&sk->sk_receive_queue, *skb2);
-   sk->sk_data_ready(sk);
-   *skb2 = NULL;
-   err = 0;
-   }
+   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
+   return err;
+
+   skb = skb_clone(skb, allocation);
+
+   if (skb) {
+   skb_set_owner_r(skb, sk);
+   skb_queue_tail(&sk->sk_receive_queue, skb);
+   sk->sk_data_ready(sk);
+   err = 0;
}
-   sock_put(sk);
return err;
 }
 
@@ -234,7 +226,6 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t 
allocation,
 {
struct netns_pfkey *net_pfkey = net_generic(net, pfkey_net_id);
struct sock *sk;
-   struct sk_buff *skb2 = NULL;
int err = -ESRCH;
 
/* XXX Do we need something like netlink_overrun?  I think
@@ -253,7 +244,7 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t 
allocation,
 * socket.
 */
if (pfk->promisc)
-   pfkey_broadcast_one(skb, &skb2, GFP_ATOMIC, sk);
+   pfkey_broadcast_one(skb, GFP_ATOMIC, sk);
 
/* the exact target will be processed later */
if (sk == one_sk)
@@ -268,7 +259,7 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t 
allocation,
continue;
}
 
-   err2 = pfkey_broadcast_one(skb, &skb2, GFP_ATOMIC, sk);
+   err2 = pfkey_broadcast_one(skb, GFP_ATOMIC, sk);
 
/* Error is cleared after successful sending to at least one
 * registered KM */
@@ -278,9 +269,8 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t 
allocation,
rcu_read_unlock();
 
if (one_sk != NULL)
-   err = pfkey_broadcast_one(skb, &skb2, allocation, one_sk);
+   err = pfkey_broadcast_one(skb, allocation, one_sk);
 
-   kfree_skb(skb2);
kfree_skb(skb);
return err;
 }
-- 
1.9.1



[PATCH net v2] net: udp: fix handling of CHECKSUM_COMPLETE packets

2018-10-23 Thread Sean Tranchetti
Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
incorrect for any packet that has an incorrect checksum value.

udp4/6_csum_init() will both make a call to
__skb_checksum_validate_complete() to initialize/validate the csum
field when receiving a CHECKSUM_COMPLETE packet. When this packet
fails validation, skb->csum will be overwritten with the pseudoheader
checksum so the packet can be fully validated by software, but the
skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
the stack can later warn the user about their hardware spewing bad
checksums. Unfortunately, leaving the SKB in this state can cause
problems later on in the checksum calculation.

Since the the packet is still marked as CHECKSUM_COMPLETE,
udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
from skb->csum instead of adding it, leaving us with a garbage value
in that field. Once we try to copy the packet to userspace in the
udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
to checksum the packet data and add it in the garbage skb->csum value
to perform our final validation check.

Since the value we're validating is not the proper checksum, it's possible
that the folded value could come out to 0, causing us not to drop the
packet. Instead, we believe that the packet was checksummed incorrectly
by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
to warn the user with netdev_rx_csum_fault(skb->dev);

Unfortunately, since this is the UDP path, skb->dev has been overwritten
by skb->dev_scratch and is no longer a valid pointer, so we end up
reading invalid memory.

This patch addresses this problem in two ways:
1) Do not use the dev pointer when calling netdev_rx_csum_fault()
   from skb_copy_and_csum_datagram_msg(). Since this gets called
   from the UDP path where skb->dev has been overwritten, we have
   no way of knowing if the pointer is still valid. Also for the
   sake of consistency with the other uses of
   netdev_rx_csum_fault(), don't attempt to call it if the
   packet was checksummed by software.

2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
   If we receive a packet that's CHECKSUM_COMPLETE that fails
   verification (i.e. skb->csum_valid == 0), check who performed
   the calculation. It's possible that the checksum was done in
   software by the network stack earlier (such as Netfilter's
   CONNTRACK module), and if that says the checksum is bad,
   we can drop the packet immediately instead of waiting until
   we try and copy it to userspace. Otherwise, we need to
   mark the SKB as CHECKSUM_NONE, since the skb->csum field
   no longer contains the full packet checksum after the
   call to __skb_checksum_validate_complete().

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Fixes: c84d949057ca ("udp: copy skb->truesize in the first cache line")
Cc: Sam Kumar 
Cc: Eric Dumazet 
Signed-off-by: Sean Tranchetti 
---
 net/core/datagram.c |  5 +++--
 net/ipv4/udp.c  | 20 ++--
 net/ipv6/ip6_checksum.c | 20 ++--
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 9aac0d6..df16493 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -808,8 +808,9 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
return -EINVAL;
}
 
-   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE))
-   netdev_rx_csum_fault(skb->dev);
+   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
+   !skb->csum_complete_sw)
+   netdev_rx_csum_fault(NULL);
}
return 0;
 fault:
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c32a4c1..f8183fd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2120,8 +2120,24 @@ static inline int udp4_csum_init(struct sk_buff *skb, 
struct udphdr *uh,
/* Note, we are only interested in != 0 or == 0, thus the
 * force to int.
 */
-   return (__force int)skb_checksum_init_zero_check(skb, proto, uh->check,
-inet_compute_pseudo);
+   err = (__force int)skb_checksum_init_zero_check(skb, proto, uh->check,
+   inet_compute_pseudo);
+   if (err)
+   return err;
+
+   if (skb->ip_summed == CHECKSUM_COMPLETE && !skb->csum_valid) {
+   /* If SW calculated the value, we know it's bad */
+   if (skb->csum_complete_sw)
+   return 1;
+
+   /* HW says the value is

[PATCH net] net: udp: fix handling of CHECKSUM_COMPLETE packets

2018-10-11 Thread Sean Tranchetti
Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
incorrect for any packet that has an incorrect checksum value.

udp4/6_csum_init() will both make a call to
__skb_checksum_validate_complete() to initialize/validate the csum
field when receiving a CHECKSUM_COMPLETE packet. When this packet
fails validation, skb->csum will be overwritten with the pseudoheader
checksum so the packet can be fully validated by software, but the
skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
the stack can later warn the user about their hardware spewing bad
checksums. Unfortunately, leaving the SKB in this state can cause
problems later on in the checksum calculation.

Since the the packet is still marked as CHECKSUM_COMPLETE,
udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
from skb->csum instead of adding it, leaving us with a garbage value
in that field. Once we try to copy the packet to userspace in the
udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
to checksum the packet data and add it in the garbage skb->csum value
to perform our final validation check.

Since the value we're validating is not the proper checksum, it's possible
that the folded value could come out to 0, causing us not to drop the
packet. Instead, we believe that the packet was checksummed incorrectly
by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
to warn the user with netdev_rx_csum_fault(skb->dev);

Unfortunately, since this is the UDP path, skb->dev has been overwritten
by skb->dev_scratch and is no longer a valid pointer, so we end up
reading invalid memory.

This patch addresses this problem in two ways:
1) Remove the invalid netdev_rx_csum_fault(skb->dev) call from
   skb_copy_and_csum_datagram_msg(). Since this is used by UDP
   where skb->dev is invalid, trying to warn doesn't make sense.

2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
   If we receive a packet that's CHECKSUM_COMPLETE that fails
   verification (i.e. skb->csum_valid == 0), check who performed
   the calculation. It's possible that the checksum was done in
   software by the network stack earlier (such as Netfilter's
   CONNTRACK module), and if that says the checksum is bad,
   we can drop the packet immediately instead of waiting until
   we try and copy it to userspace. Otherwise, we need to
   mark the SKB as CHECKSUM_NONE, since the skb->csum field
   no longer contains the full packet checksum after the
   call to __skb_checksum_validate_complete().

Signed-off-by: Sean Tranchetti 
---
 net/core/datagram.c |  3 ---
 net/ipv4/udp.c  | 20 ++--
 net/ipv6/ip6_checksum.c | 20 ++--
 3 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 9aac0d6..ab679c5 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -807,9 +807,6 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
iov_iter_revert(&msg->msg_iter, chunk);
return -EINVAL;
}
-
-   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE))
-   netdev_rx_csum_fault(skb->dev);
}
return 0;
 fault:
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c32a4c1..f8183fd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2120,8 +2120,24 @@ static inline int udp4_csum_init(struct sk_buff *skb, 
struct udphdr *uh,
/* Note, we are only interested in != 0 or == 0, thus the
 * force to int.
 */
-   return (__force int)skb_checksum_init_zero_check(skb, proto, uh->check,
-inet_compute_pseudo);
+   err = (__force int)skb_checksum_init_zero_check(skb, proto, uh->check,
+   inet_compute_pseudo);
+   if (err)
+   return err;
+
+   if (skb->ip_summed == CHECKSUM_COMPLETE && !skb->csum_valid) {
+   /* If SW calculated the value, we know it's bad */
+   if (skb->csum_complete_sw)
+   return 1;
+
+   /* HW says the value is bad. Let's validate that.
+* skb->csum is no longer the full packet checksum,
+* so don't treat it as such.
+*/
+   skb_checksum_complete_unset(skb);
+   }
+
+   return 0;
 }
 
 /* wrapper for udp_queue_rcv_skb tacking care of csum conversion and
diff --git a/net/ipv6/ip6_checksum.c b/net/ipv6/ip6_checksum.c
index 547515e..3777170 100644
--- a/net/ipv6/ip6_checksum.c
+++ b/net/ipv6/ip6_checksum.c
@@ -88,8 +88,24 @@ int udp6_csum_init(struct sk_buff *skb, struct udphdr *uh, 
int

[PATCH net] netlabel: check for IPV4MASK in addrinfo_get

2018-09-20 Thread Sean Tranchetti
netlbl_unlabel_addrinfo_get() assumes that if it finds the
NLBL_UNLABEL_A_IPV4ADDR attribute, it must also have the
NLBL_UNLABEL_A_IPV4MASK attribute as well. However, this is
not necessarily the case as the current checks in
netlbl_unlabel_staticadd() and friends are not sufficent to
enforce this.

If passed a netlink message with NLBL_UNLABEL_A_IPV4ADDR,
NLBL_UNLABEL_A_IPV6ADDR, and NLBL_UNLABEL_A_IPV6MASK attributes,
these functions will all call netlbl_unlabel_addrinfo_get() which
will then attempt dereference NULL when fetching the non-existent
NLBL_UNLABEL_A_IPV4MASK attribute:

Unable to handle kernel NULL pointer dereference at virtual address 0
Process unlab (pid: 31762, stack limit = 0xff80502d8000)
Call trace:
netlbl_unlabel_addrinfo_get+0x44/0xd8
netlbl_unlabel_staticremovedef+0x98/0xe0
genl_rcv_msg+0x354/0x388
netlink_rcv_skb+0xac/0x118
genl_rcv+0x34/0x48
netlink_unicast+0x158/0x1f0
netlink_sendmsg+0x32c/0x338
sock_sendmsg+0x44/0x60
___sys_sendmsg+0x1d0/0x2a8
__sys_sendmsg+0x64/0xb4
SyS_sendmsg+0x34/0x4c
el0_svc_naked+0x34/0x38
Code: 51001149 7100113f 54a0 f9401508 (79400108)
---[ end trace f6438a488e737143 ]---
Kernel panic - not syncing: Fatal exception

Signed-off-by: Sean Tranchetti 
---
 net/netlabel/netlabel_unlabeled.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/netlabel/netlabel_unlabeled.c 
b/net/netlabel/netlabel_unlabeled.c
index c070dfc..c92894c 100644
--- a/net/netlabel/netlabel_unlabeled.c
+++ b/net/netlabel/netlabel_unlabeled.c
@@ -781,7 +781,8 @@ static int netlbl_unlabel_addrinfo_get(struct genl_info 
*info,
 {
u32 addr_len;
 
-   if (info->attrs[NLBL_UNLABEL_A_IPV4ADDR]) {
+   if (info->attrs[NLBL_UNLABEL_A_IPV4ADDR] &&
+   info->attrs[NLBL_UNLABEL_A_IPV4MASK]) {
addr_len = nla_len(info->attrs[NLBL_UNLABEL_A_IPV4ADDR]);
if (addr_len != sizeof(struct in_addr) &&
addr_len != nla_len(info->attrs[NLBL_UNLABEL_A_IPV4MASK]))
-- 
1.9.1



[PATCH net] af_key: free SKBs under RCU protection

2018-09-19 Thread Sean Tranchetti
pfkey_broadcast() can make calls to pfkey_broadcast_one() which
will clone or copy the passed in SKB. This new SKB will be assigned
the sock_rfree() function as its destructor, which requires that
the socket reference the SKB contains is valid when the SKB is freed.

If this SKB is ever passed to pfkey_broadcast() again by some other
function (such as pkfey_dump() or pfkey_promisc) it will then be
freed there. However, since this free occurs outside of RCU protection,
it is possible that userspace could close the socket and trigger
pfkey_release() to free the socket before sock_rfree() can run, creating
the following race condition:

1: An SKB belonging to the pfkey socket is passed to pfkey_broadcast().
   It performs the broadcast to any other sockets, and calls
   rcu_read_unlock(), but does not yet reach kfree_skb().
2: Userspace closes the socket, triggering pfkey_realse(). Since no one
   holds the RCU lock, synchronize_rcu() returns and it is allowed to
   continue. It calls sock_put() to free the socket.
3: pfkey_broadcast() now calls kfree_skb() on the original SKB it was
   passed, triggering a call to sock_rfree(). This function now accesses
   the freed struct sock * via skb->sk, and attempts to update invalid
   memory.

By ensuring that the pfkey_broadcast() also frees the SKBs while it holds
the RCU lock, we can ensure that the socket will remain valid when the SKB
is freed, avoiding crashes like the following:

Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b
[006b6b6b6b6b6c4b] address between user and kernel address ranges
Internal error: Oops: 9604 [#1] PREEMPT SMP
task: fff78f65b380 task.stack: ff8049a88000
pc : sock_rfree+0x38/0x6c
lr : skb_release_head_state+0x6c/0xcc
Process repro (pid: 7117, stack limit = 0xff8049a88000)
Call trace:
sock_rfree+0x38/0x6c
skb_release_head_state+0x6c/0xcc
skb_release_all+0x1c/0x38
__kfree_skb+0x1c/0x30
kfree_skb+0xd0/0xf4
pfkey_broadcast+0x14c/0x18c
pfkey_sendmsg+0x1d8/0x408
sock_sendmsg+0x44/0x60
___sys_sendmsg+0x1d0/0x2a8
__sys_sendmsg+0x64/0xb4
SyS_sendmsg+0x34/0x4c
el0_svc_naked+0x34/0x38
Kernel panic - not syncing: Fatal exception

Signed-off-by: Sean Tranchetti 
---
 net/key/af_key.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 9d61266..dd257c7 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -275,13 +275,13 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t 
allocation,
if ((broadcast_flags & BROADCAST_REGISTERED) && err)
err = err2;
}
-   rcu_read_unlock();
 
if (one_sk != NULL)
err = pfkey_broadcast_one(skb, &skb2, allocation, one_sk);
 
kfree_skb(skb2);
kfree_skb(skb);
+   rcu_read_unlock();
return err;
 }
 
-- 
1.9.1



[PATCH net] xfrm: validate template mode

2018-09-19 Thread Sean Tranchetti
XFRM mode parameters passed as part of the user templates
in the IP_XFRM_POLICY are never properly validated. Passing
values other than valid XFRM modes can cause stack-out-of-bounds
reads to occur later in the XFRM processing:

[  140.535608] 
[  140.543058] BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x17e4/0x1cc4
[  140.550306] Read of size 4 at addr ffc0238a7a58 by task repro/5148
[  140.557369]
[  140.558927] Call trace:
[  140.558936] dump_backtrace+0x0/0x388
[  140.558940] show_stack+0x24/0x30
[  140.558946] __dump_stack+0x24/0x2c
[  140.558949] dump_stack+0x8c/0xd0
[  140.558956] print_address_description+0x74/0x234
[  140.558960] kasan_report+0x240/0x264
[  140.558963] __asan_report_load4_noabort+0x2c/0x38
[  140.558967] xfrm_state_find+0x17e4/0x1cc4
[  140.558971] xfrm_resolve_and_create_bundle+0x40c/0x1fb8
[  140.558975] xfrm_lookup+0x238/0x1444
[  140.558977] xfrm_lookup_route+0x48/0x11c
[  140.558984] ip_route_output_flow+0x88/0xc4
[  140.558991] raw_sendmsg+0xa74/0x266c
[  140.558996] inet_sendmsg+0x258/0x3b0
[  140.559002] sock_sendmsg+0xbc/0xec
[  140.559005] SyS_sendto+0x3a8/0x5a8
[  140.559008] el0_svc_naked+0x34/0x38
[  140.559009]
[  140.592245] page dumped because: kasan: bad access detected
[  140.597981] page_owner info is not active (free page?)
[  140.603267]
[  140.653503] 

Signed-off-by: Sean Tranchetti 
---
 net/xfrm/xfrm_user.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 4791aa8..ab2f280 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1480,6 +1480,9 @@ static int validate_tmpl(int nr, struct xfrm_user_tmpl 
*ut, u16 family)
(ut[i].family != prev_family))
return -EINVAL;
 
+   if (ut[i].mode >= XFRM_MODE_MAX)
+   return -EINVAL;
+
prev_family = ut[i].family;
 
switch (ut[i].family) {
-- 
1.9.1



[PATCH net-next] net: gro: Initialize backlog NAPI's gro_list

2018-07-20 Thread Sean Tranchetti
When using RPS, the target CPU uses the backlog NAPI struct.
This NAPI struct needs the list initialized explicitly as normally
the list is initialized from netif_napi_add() on the netdevice.

Unable to handle kernel NULL pointer dereference at virtual address 0008
Kernel BUG at ff9808909310 [verbose debug info unavailable]
PC is at napi_gro_flush+0x74/0x104
LR is at napi_gro_flush+0x70/0x104
[] napi_gro_flush+0x74/0x104
[] net_rx_action+0x3f8/0x4d4
[] __do_softirq+0x144/0x468
[] irq_exit+0x118/0x144
[] __handle_domain_irq+0x94/0xf8
[] gic_handle_irq+0xd0/0x1b4

Fixes: d4546c2509b1 ("net: Convert GRO SKB handling to list_head.")
Signed-off-by: Sean Tranchetti 
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4f8b92d..6b76745 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9556,6 +9556,7 @@ static int __init net_dev_init(void)
 
sd->backlog.poll = process_backlog;
sd->backlog.weight = weight_p;
+   INIT_LIST_HEAD(&sd->backlog.gro_list);
}
 
dev_boot_phase = 0;
-- 
1.9.1



[PATCH net-next] udp: Fix kernel panic in UDP GSO path

2018-05-10 Thread Sean Tranchetti
Using GSO in the UDP path on a device with
scatter-gather netdevice feature disabled will result in a kernel
panic with the following call stack:

kernel BUG at net/core/skbuff.c:104!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
PC is at skb_panic+0x4c/0x54
LR is at skb_panic+0x4c/0x54
Process udpgso_bench_tx (pid: 4078, stack limit = 0xff8048de8000)
[] skb_panic+0x4c/0x54
[] skb_copy_bits+0x0/0x244
[] __ip_append_data+0x230/0x814
[] ip_make_skb+0xe4/0x178
[] udp_sendmsg+0x828/0x888
[] inet_sendmsg+0xe4/0x130
[] ___sys_sendmsg+0x1d8/0x2c0
[] SyS_sendmsg+0x90/0xe0

This panic is the result of allocating SKBs with small size
for the newly segmented SKB. If the scatter-gather feature is
disabled, the code attempts to call skb_put() on the small SKB
with an argument of nearly the entire unsegmented SKB length.

After this patch, attempting to use GSO with scatter-gather
disabled will result in -EINVAL being returned.

Fixes: 15e36f5b8e98 ("udp: paged allocation with gso")
Signed-off-by: Sean Tranchetti 
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/ipv4/ip_output.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b5e21eb..0d63690 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1054,8 +1054,16 @@ static int __ip_append_data(struct sock *sk,
copy = length;
 
if (!(rt->dst.dev->features&NETIF_F_SG)) {
+   struct sk_buff *tmp;
unsigned int off;
 
+   if (paged) {
+   err = -EINVAL;
+   while ((tmp = __skb_dequeue(queue)) != NULL)
+   kfree(tmp);
+   goto error;
+   }
+
off = skb->len;
if (getfrag(from, skb_put(skb, copy),
offset, copy, off, skb) < 0) {
-- 
1.9.1



[PATCH net-next] udp: Complement partial checksum for GSO packet

2018-04-30 Thread Sean Tranchetti
Using the udp_v4_check() function to calculate the pseudo header
for the newly segmented UDP packets results in assigning the complement
of the value to the UDP header checksum field.

Always undo the complement the partial checksum value in order to
match the case where GSO is not used on the UDP transmit path.

Fixes: ee80d1ebe5ba ("udp: add udp gso")
Signed-off-by: Sean Tranchetti 
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/ipv4/udp_offload.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index f78fb36..0062570 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -223,6 +223,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
csum_replace2(&uh->check, htons(mss),
  htons(seg->len - hdrlen - sizeof(*uh)));
 
+   uh->check = ~uh->check;
seg->destructor = sock_wfree;
seg->sk = sk;
sum_truesize += seg->truesize;
-- 
1.9.1