[net PATCH 2/2] tools: selftests: add test for changing routes with PTMU exceptions
From: Sean Tranchetti Adds new 2 new tests to the PTMU script: pmtu_ipv4/6_route_change. These tests explicitly test for a recently discovered problem in the IPv6 routing framework where PMTU exceptions were not properly released when replacing a route via "ip route change ...". After creating PMTU exceptions, the route from the device A to R1 will be replaced with a new route, then device A will be deleted. If the PMTU exceptions were properly cleaned up by the kernel, this device deletion will succeed. Otherwise, the unregistration of the device will stall, and messages such as the following will be logged in dmesg: unregister_netdevice: waiting for veth_A-R1 to become free. Usage count = 4 Signed-off-by: Sean Tranchetti --- tools/testing/selftests/net/pmtu.sh | 71 +++-- 1 file changed, 69 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh index 464e31e..64cd2e2 100755 --- a/tools/testing/selftests/net/pmtu.sh +++ b/tools/testing/selftests/net/pmtu.sh @@ -162,7 +162,15 @@ # - list_flush_ipv6_exception # Using the same topology as in pmtu_ipv6, create exceptions, and check # they are shown when listing exception caches, gone after flushing them - +# +# - pmtu_ipv4_route_change +# Use the same topology as in pmtu_ipv4, but issue a route replacement +# command and delete the corresponding device afterward. This tests for +# proper cleanup of the PMTU exceptions by the route replacement path. +# Device unregistration should complete successfully +# +# - pmtu_ipv6_route_change +# Same as above but with IPv6 # Kselftest framework requirement - SKIP code is 4. ksft_skip=4 @@ -224,7 +232,9 @@ tests=" cleanup_ipv4_exception ipv4: cleanup of cached exceptions 1 cleanup_ipv6_exception ipv6: cleanup of cached exceptions 1 list_flush_ipv4_exception ipv4: list and flush cached exceptions 1 - list_flush_ipv6_exception ipv6: list and flush cached exceptions 1" + list_flush_ipv6_exception ipv6: list and flush cached exceptions 1 + pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1 + pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1" NS_A="ns-A" NS_B="ns-B" @@ -1782,6 +1792,63 @@ test_list_flush_ipv6_exception() { return ${fail} } +test_pmtu_ipvX_route_change() { + family=${1} + + setup namespaces routing || return 2 + trace "${ns_a}" veth_A-R1"${ns_r1}" veth_R1-A \ + "${ns_r1}" veth_R1-B"${ns_b}" veth_B-R1 \ + "${ns_a}" veth_A-R2"${ns_r2}" veth_R2-A \ + "${ns_r2}" veth_R2-B"${ns_b}" veth_B-R2 + + if [ ${family} -eq 4 ]; then + ping=ping + dst1="${prefix4}.${b_r1}.1" + dst2="${prefix4}.${b_r2}.1" + gw="${prefix4}.${a_r1}.2" + else + ping=${ping6} + dst1="${prefix6}:${b_r1}::1" + dst2="${prefix6}:${b_r2}::1" + gw="${prefix6}:${a_r1}::2" + fi + + # Set up initial MTU values + mtu "${ns_a}" veth_A-R1 2000 + mtu "${ns_r1}" veth_R1-A 2000 + mtu "${ns_r1}" veth_R1-B 1400 + mtu "${ns_b}" veth_B-R1 1400 + + mtu "${ns_a}" veth_A-R2 2000 + mtu "${ns_r2}" veth_R2-A 2000 + mtu "${ns_r2}" veth_R2-B 1500 + mtu "${ns_b}" veth_B-R2 1500 + + # Create route exceptions + run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s 1800 ${dst1} + run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1 -s 1800 ${dst2} + + # Check that exceptions have been created with the correct PMTU + pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" ${dst1})" + check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1 + pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" ${dst2})" + check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1 + + # Replace the route from A to R1 + run_cmd ${ns_a} ip route change default via ${gw} + + # Delete the device in A + run_cmd ${ns_a} ip link del "veth_A-R1" +} + +test_pmtu_ipv4_route_change() { + test_pmtu_ipvX_route_change 4 +} + +test_pmtu_ipv6_route_change() { + test_pmtu_ipvX_route_change 6 +} + usage() { echo echo "$0 [OPTIONS] [TEST]..." -- 2.7.4
[net PATCH 1/2] net: ipv6: fib: flush exceptions when purging route
From: Sean Tranchetti Route removal is handled by two code paths. The main removal path is via fib6_del_route() which will handle purging any PMTU exceptions from the cache, removing all per-cpu copies of the DST entry used by the route, and releasing the fib6_info struct. The second removal location is during fib6_add_rt2node() during a route replacement operation. This path also calls fib6_purge_rt() to handle cleaning up the per-cpu copies of the DST entries and releasing the fib6_info associated with the older route, but it does not flush any PMTU exceptions that the older route had. Since the older route is removed from the tree during the replacement, we lose any way of accessing it again. As these lingering DSTs and the fib6_info struct are holding references to the underlying netdevice struct as well, unregistering that device from the kernel can never complete. Signed-off-by: Sean Tranchetti --- net/ipv6/ip6_fib.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 605cdd3..f43e275 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1025,6 +1025,8 @@ static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn, { struct fib6_table *table = rt->fib6_table; + /* Flush all cached dst in exception table */ + rt6_flush_exceptions(rt); fib6_drop_pcpu_from(rt, table); if (rt->nh && !list_empty(&rt->nh_list)) @@ -1927,9 +1929,6 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn, net->ipv6.rt6_stats->fib_rt_entries--; net->ipv6.rt6_stats->fib_discarded_routes++; - /* Flush all cached dst in exception table */ - rt6_flush_exceptions(rt); - /* Reset round-robin state, if necessary */ if (rcu_access_pointer(fn->rr_ptr) == rt) fn->rr_ptr = NULL; -- 2.7.4
[PATCH net v2] genetlink: remove genl_bind
A potential deadlock can occur during registering or unregistering a new generic netlink family between the main nl_table_lock and the cb_lock where each thread wants the lock held by the other, as demonstrated below. 1) Thread 1 is performing a netlink_bind() operation on a socket. As part of this call, it will call netlink_lock_table(), incrementing the nl_table_users count to 1. 2) Thread 2 is registering (or unregistering) a genl_family via the genl_(un)register_family() API. The cb_lock semaphore will be taken for writing. 3) Thread 1 will call genl_bind() as part of the bind operation to handle subscribing to GENL multicast groups at the request of the user. It will attempt to take the cb_lock semaphore for reading, but it will fail and be scheduled away, waiting for Thread 2 to finish the write. 4) Thread 2 will call netlink_table_grab() during the (un)registration call. However, as Thread 1 has incremented nl_table_users, it will not be able to proceed, and both threads will be stuck waiting for the other. genl_bind() is a noop, unless a genl_family implements the mcast_bind() function to handle setting up family-specific multicast operations. Since no one in-tree uses this functionality as Cong pointed out, simply removing the genl_bind() function will remove the possibility for deadlock, as there is no attempt by Thread 1 above to take the cb_lock semaphore. Fixes: c380d9a7afff ("genetlink: pass multicast bind/unbind to families") Suggested-by: Cong Wang Acked-by: Johannes Berg Reported-by: kernel test robot Signed-off-by: Sean Tranchetti --- include/net/genetlink.h | 8 net/netlink/genetlink.c | 49 - 2 files changed, 57 deletions(-) diff --git a/include/net/genetlink.h b/include/net/genetlink.h index 7495066..4cf703d 100644 --- a/include/net/genetlink.h +++ b/include/net/genetlink.h @@ -35,12 +35,6 @@ struct genl_multicast_group { * do additional, common, filtering and return an error * @post_doit: called after an operation's doit callback, it may * undo operations done by pre_doit, for example release locks - * @mcast_bind: a socket bound to the given multicast group (which - * is given as the offset into the groups array) - * @mcast_unbind: a socket was unbound from the given multicast group. - * Note that unbind() will not be called symmetrically if the - * generic netlink family is removed while there are still open - * sockets. * @attrbuf: buffer to store parsed attributes (private) * @mcgrps: multicast groups used by this family * @n_mcgrps: number of multicast groups @@ -64,8 +58,6 @@ struct genl_family { void(*post_doit)(const struct genl_ops *ops, struct sk_buff *skb, struct genl_info *info); - int (*mcast_bind)(struct net *net, int group); - void(*mcast_unbind)(struct net *net, int group); struct nlattr **attrbuf;/* private */ const struct genl_ops * ops; const struct genl_multicast_group *mcgrps; diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index 55ee680..dfaaefb 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -1166,60 +1166,11 @@ static int ctrl_dumppolicy(struct sk_buff *skb, struct netlink_callback *cb) .netnsok = true, }; -static int genl_bind(struct net *net, int group) -{ - struct genl_family *f; - int err = -ENOENT; - unsigned int id; - - down_read(&cb_lock); - - idr_for_each_entry(&genl_fam_idr, f, id) { - if (group >= f->mcgrp_offset && - group < f->mcgrp_offset + f->n_mcgrps) { - int fam_grp = group - f->mcgrp_offset; - - if (!f->netnsok && net != &init_net) - err = -ENOENT; - else if (f->mcast_bind) - err = f->mcast_bind(net, fam_grp); - else - err = 0; - break; - } - } - up_read(&cb_lock); - - return err; -} - -static void genl_unbind(struct net *net, int group) -{ - struct genl_family *f; - unsigned int id; - - down_read(&cb_lock); - - idr_for_each_entry(&genl_fam_idr, f, id) { - if (group >= f->mcgrp_offset && - group < f->mcgrp_offset + f->n_mcgrps) { - int fam_grp = group - f->mcgrp_offset; - - if (f->mcast_unbind) - f->mcast_unbind(net, fam_grp); - break; - } - } - up_read(&cb_lock); -} - stati
[PATCH net] genetlink: take netlink table lock when (un)registering
A potential deadlock can occur during registering or unregistering a new generic netlink family between the main nl_table_lock and the cb_lock where each thread wants the lock held by the other, as demonstrated below. 1) Thread 1 is performing a netlink_bind() operation on a socket. As part of this call, it will call netlink_lock_table(), incrementing the nl_table_users count to 1. 2) Thread 2 is registering (or unregistering) a genl_family via the genl_(un)register_family() API. The cb_lock semaphore will be taken for writing. 3) Thread 1 will call genl_bind() as part of the bind operation to handle subscribing to GENL multicast groups at the request of the user. It will attempt to take the cb_lock semaphore for reading, but it will fail and be scheduled away, waiting for Thread 2 to finish the write. 4) Thread 2 will call netlink_table_grab() during the (un)registration call. However, as Thread 1 has incremented nl_table_users, it will not be able to proceed, and both threads will be stuck waiting for the other. To avoid this scenario, the locks should be acquired in the same order by both threads. Since both the register and unregister functions need to take the nl_table_lock in their processing, it makes sense to explicitly acquire them before they lock the genl_mutex and the cb_lock. In unregistering, no other change is needed aside from this locking change. Registering a family requires more ancilary operations, such as memory allocation. Unfortunately, much of this allocation must be performed inside of the genl locks to ensure internal synchronization, so they must also be performed inside of the nl_table_lock where sleeping is not allowed. As a result, the allocation types must be changed to GFP_ATOMIC. Fixes: def3117493ea ("genl: Allow concurrent genl callbacks") Cc: Pravin B Shelar Cc: Subash Abhinov Kasiviswanathan Signed-off-by: Sean Tranchetti --- net/netlink/genetlink.c | 29 - 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index 55ee680..79e3b1b 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -155,14 +155,14 @@ static int genl_allocate_reserve_groups(int n_groups, int *first_id) size_t nlen = new_longs * sizeof(unsigned long); if (mc_groups == &mc_group_start) { - new_groups = kzalloc(nlen, GFP_KERNEL); + new_groups = kzalloc(nlen, GFP_ATOMIC); if (!new_groups) return -ENOMEM; mc_groups = new_groups; *mc_groups = mc_group_start; } else { new_groups = krealloc(mc_groups, nlen, - GFP_KERNEL); + GFP_ATOMIC); if (!new_groups) return -ENOMEM; mc_groups = new_groups; @@ -229,7 +229,6 @@ static int genl_validate_assign_mc_groups(struct genl_family *family) if (family->netnsok) { struct net *net; - netlink_table_grab(); rcu_read_lock(); for_each_net_rcu(net) { err = __netlink_change_ngroups(net->genl_sock, @@ -245,10 +244,9 @@ static int genl_validate_assign_mc_groups(struct genl_family *family) } } rcu_read_unlock(); - netlink_table_ungrab(); } else { - err = netlink_change_ngroups(init_net.genl_sock, -mc_groups_longs * BITS_PER_LONG); + err = __netlink_change_ngroups(init_net.genl_sock, + mc_groups_longs * BITS_PER_LONG); } if (groups_allocated && err) { @@ -264,7 +262,6 @@ static void genl_unregister_mc_groups(const struct genl_family *family) struct net *net; int i; - netlink_table_grab(); rcu_read_lock(); for_each_net_rcu(net) { for (i = 0; i < family->n_mcgrps; i++) @@ -328,6 +325,10 @@ int genl_register_family(struct genl_family *family) if (err) return err; + /* Acquire netlink table lock before any GENL-specific locks to ensure +* sync with any netlink operations making calls into the GENL code. +*/ + netlink_table_grab(); genl_lock_all(); if (genl_family_find_byname(family->name)) { @@ -354,7 +355,7 @@ int genl_register_family(struct genl_family *family) if (family->maxattr && !family->parallel_ops) {
[PATCH net-next v2] udp: Avoid post-GRO UDP checksum recalculation
Currently, when resegmenting an unexpected UDP GRO packet, the full UDP checksum will be calculated for every new SKB created by skb_segment() because the netdev features passed in by udp_rcv_segment() lack any information about checksum offload capabilities. Usually, we have no need to perform this calculation again, as 1) The GRO implementation guarantees that any packets making it to the udp_rcv_segment() function had correct checksums, and, more importantly, 2) Upon the successful return of udp_rcv_segment(), we immediately pull the UDP header off and either queue the segment to the socket or hand it off to a new protocol handler. Unless userspace has set the IP_CHECKSUM sockopt to indicate that they want the final checksum values, we can pass the needed netdev feature flags to __skb_gso_segment() to avoid checksumming each segment in skb_segment(). Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection") Cc: Paolo Abeni Cc: Subash Abhinov Kasiviswanathan Signed-off-by: Sean Tranchetti --- include/net/udp.h | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/include/net/udp.h b/include/net/udp.h index d8ce937..dbe030d 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -471,12 +471,19 @@ struct udp_iter_state { static inline struct sk_buff *udp_rcv_segment(struct sock *sk, struct sk_buff *skb, bool ipv4) { + netdev_features_t features = NETIF_F_SG; struct sk_buff *segs; + /* Avoid csum recalculation by skb_segment unless userspace explicitly +* asks for the final checksum values +*/ + if (!inet_get_convert_csum(sk)) + features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM; + /* the GSO CB lays after the UDP one, no need to save and restore any * CB fragment */ - segs = __skb_gso_segment(skb, NETIF_F_SG, false); + segs = __skb_gso_segment(skb, features, false); if (unlikely(IS_ERR_OR_NULL(segs))) { int segs_nr = skb_shinfo(skb)->gso_segs; -- 1.9.1
[PATCH net-next] udp: Avoid post-GRO UDP checksum recalculation
Currently, when resegmenting an unexpected UDP GRO packet, the full UDP checksum will be calculated for every new SKB created by skb_segment() because the netdev features passed in by udp_rcv_segment() lack any information about checksum offload capabilities. We have no need to perform this calculation again, as 1) The GRO implementation guarantees that any packets making it to the udp_rcv_segment() function had correct checksums, and, more importantly, 2) Upon the successful return of udp_rcv_segment(), we immediately pull the UDP header off and either queue the segment to the socket or hand it off to a new protocol handler. In either case, the checksum is not needed. Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection") Cc: Paolo Abeni Cc: Subash Abhinov Kasiviswanathan Signed-off-by: Sean Tranchetti --- include/net/udp.h | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/include/net/udp.h b/include/net/udp.h index d8ce937..6164d5c 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -472,11 +472,15 @@ static inline struct sk_buff *udp_rcv_segment(struct sock *sk, struct sk_buff *skb, bool ipv4) { struct sk_buff *segs; + netdev_features_t features = NETIF_F_SG; + + /* Avoid csum recalculation in skb_segment() */ + features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM; /* the GSO CB lays after the UDP one, no need to save and restore any * CB fragment */ - segs = __skb_gso_segment(skb, NETIF_F_SG, false); + segs = __skb_gso_segment(skb, features, false); if (unlikely(IS_ERR_OR_NULL(segs))) { int segs_nr = skb_shinfo(skb)->gso_segs; -- 1.9.1
[PATCH net] af_key: unconditionally clone on broadcast
Attempting to avoid cloning the skb when broadcasting by inflating the refcount with sock_hold/sock_put while under RCU lock is dangerous and violates RCU principles. It leads to subtle race conditions when attempting to free the SKB, as we may reference sockets that have already been freed by the stack. Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b [006b6b6b6b6b6c4b] address between user and kernel address ranges Internal error: Oops: 9604 [#1] PREEMPT SMP task: fff78f65b380 task.stack: ff8049a88000 pc : sock_rfree+0x38/0x6c lr : skb_release_head_state+0x6c/0xcc Process repro (pid: 7117, stack limit = 0xff8049a88000) Call trace: sock_rfree+0x38/0x6c skb_release_head_state+0x6c/0xcc skb_release_all+0x1c/0x38 __kfree_skb+0x1c/0x30 kfree_skb+0xd0/0xf4 pfkey_broadcast+0x14c/0x18c pfkey_sendmsg+0x1d8/0x408 sock_sendmsg+0x44/0x60 ___sys_sendmsg+0x1d0/0x2a8 __sys_sendmsg+0x64/0xb4 SyS_sendmsg+0x34/0x4c el0_svc_naked+0x34/0x38 Kernel panic - not syncing: Fatal exception Suggested-by: Eric Dumazet Signed-off-by: Sean Tranchetti --- Realized I never actually sent this patch out after testing the changes Eric recommended. Whoops. Better late then never, I suppose... net/key/af_key.c | 40 +++- 1 file changed, 15 insertions(+), 25 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 655c787..e800891 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -196,30 +196,22 @@ static int pfkey_release(struct socket *sock) return 0; } -static int pfkey_broadcast_one(struct sk_buff *skb, struct sk_buff **skb2, - gfp_t allocation, struct sock *sk) +static int pfkey_broadcast_one(struct sk_buff *skb, gfp_t allocation, + struct sock *sk) { int err = -ENOBUFS; - sock_hold(sk); - if (*skb2 == NULL) { - if (refcount_read(&skb->users) != 1) { - *skb2 = skb_clone(skb, allocation); - } else { - *skb2 = skb; - refcount_inc(&skb->users); - } - } - if (*skb2 != NULL) { - if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf) { - skb_set_owner_r(*skb2, sk); - skb_queue_tail(&sk->sk_receive_queue, *skb2); - sk->sk_data_ready(sk); - *skb2 = NULL; - err = 0; - } + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + return err; + + skb = skb_clone(skb, allocation); + + if (skb) { + skb_set_owner_r(skb, sk); + skb_queue_tail(&sk->sk_receive_queue, skb); + sk->sk_data_ready(sk); + err = 0; } - sock_put(sk); return err; } @@ -234,7 +226,6 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation, { struct netns_pfkey *net_pfkey = net_generic(net, pfkey_net_id); struct sock *sk; - struct sk_buff *skb2 = NULL; int err = -ESRCH; /* XXX Do we need something like netlink_overrun? I think @@ -253,7 +244,7 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation, * socket. */ if (pfk->promisc) - pfkey_broadcast_one(skb, &skb2, GFP_ATOMIC, sk); + pfkey_broadcast_one(skb, GFP_ATOMIC, sk); /* the exact target will be processed later */ if (sk == one_sk) @@ -268,7 +259,7 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation, continue; } - err2 = pfkey_broadcast_one(skb, &skb2, GFP_ATOMIC, sk); + err2 = pfkey_broadcast_one(skb, GFP_ATOMIC, sk); /* Error is cleared after successful sending to at least one * registered KM */ @@ -278,9 +269,8 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation, rcu_read_unlock(); if (one_sk != NULL) - err = pfkey_broadcast_one(skb, &skb2, allocation, one_sk); + err = pfkey_broadcast_one(skb, allocation, one_sk); - kfree_skb(skb2); kfree_skb(skb); return err; } -- 1.9.1
[PATCH net v2] net: udp: fix handling of CHECKSUM_COMPLETE packets
Current handling of CHECKSUM_COMPLETE packets by the UDP stack is incorrect for any packet that has an incorrect checksum value. udp4/6_csum_init() will both make a call to __skb_checksum_validate_complete() to initialize/validate the csum field when receiving a CHECKSUM_COMPLETE packet. When this packet fails validation, skb->csum will be overwritten with the pseudoheader checksum so the packet can be fully validated by software, but the skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way the stack can later warn the user about their hardware spewing bad checksums. Unfortunately, leaving the SKB in this state can cause problems later on in the checksum calculation. Since the the packet is still marked as CHECKSUM_COMPLETE, udp_csum_pull_header() will SUBTRACT the checksum of the UDP header from skb->csum instead of adding it, leaving us with a garbage value in that field. Once we try to copy the packet to userspace in the udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg() to checksum the packet data and add it in the garbage skb->csum value to perform our final validation check. Since the value we're validating is not the proper checksum, it's possible that the folded value could come out to 0, causing us not to drop the packet. Instead, we believe that the packet was checksummed incorrectly by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt to warn the user with netdev_rx_csum_fault(skb->dev); Unfortunately, since this is the UDP path, skb->dev has been overwritten by skb->dev_scratch and is no longer a valid pointer, so we end up reading invalid memory. This patch addresses this problem in two ways: 1) Do not use the dev pointer when calling netdev_rx_csum_fault() from skb_copy_and_csum_datagram_msg(). Since this gets called from the UDP path where skb->dev has been overwritten, we have no way of knowing if the pointer is still valid. Also for the sake of consistency with the other uses of netdev_rx_csum_fault(), don't attempt to call it if the packet was checksummed by software. 2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init(). If we receive a packet that's CHECKSUM_COMPLETE that fails verification (i.e. skb->csum_valid == 0), check who performed the calculation. It's possible that the checksum was done in software by the network stack earlier (such as Netfilter's CONNTRACK module), and if that says the checksum is bad, we can drop the packet immediately instead of waiting until we try and copy it to userspace. Otherwise, we need to mark the SKB as CHECKSUM_NONE, since the skb->csum field no longer contains the full packet checksum after the call to __skb_checksum_validate_complete(). Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Fixes: c84d949057ca ("udp: copy skb->truesize in the first cache line") Cc: Sam Kumar Cc: Eric Dumazet Signed-off-by: Sean Tranchetti --- net/core/datagram.c | 5 +++-- net/ipv4/udp.c | 20 ++-- net/ipv6/ip6_checksum.c | 20 ++-- 3 files changed, 39 insertions(+), 6 deletions(-) diff --git a/net/core/datagram.c b/net/core/datagram.c index 9aac0d6..df16493 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -808,8 +808,9 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, return -EINVAL; } - if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE)) - netdev_rx_csum_fault(skb->dev); + if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) && + !skb->csum_complete_sw) + netdev_rx_csum_fault(NULL); } return 0; fault: diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c32a4c1..f8183fd 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2120,8 +2120,24 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh, /* Note, we are only interested in != 0 or == 0, thus the * force to int. */ - return (__force int)skb_checksum_init_zero_check(skb, proto, uh->check, -inet_compute_pseudo); + err = (__force int)skb_checksum_init_zero_check(skb, proto, uh->check, + inet_compute_pseudo); + if (err) + return err; + + if (skb->ip_summed == CHECKSUM_COMPLETE && !skb->csum_valid) { + /* If SW calculated the value, we know it's bad */ + if (skb->csum_complete_sw) + return 1; + + /* HW says the value is
[PATCH net] net: udp: fix handling of CHECKSUM_COMPLETE packets
Current handling of CHECKSUM_COMPLETE packets by the UDP stack is incorrect for any packet that has an incorrect checksum value. udp4/6_csum_init() will both make a call to __skb_checksum_validate_complete() to initialize/validate the csum field when receiving a CHECKSUM_COMPLETE packet. When this packet fails validation, skb->csum will be overwritten with the pseudoheader checksum so the packet can be fully validated by software, but the skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way the stack can later warn the user about their hardware spewing bad checksums. Unfortunately, leaving the SKB in this state can cause problems later on in the checksum calculation. Since the the packet is still marked as CHECKSUM_COMPLETE, udp_csum_pull_header() will SUBTRACT the checksum of the UDP header from skb->csum instead of adding it, leaving us with a garbage value in that field. Once we try to copy the packet to userspace in the udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg() to checksum the packet data and add it in the garbage skb->csum value to perform our final validation check. Since the value we're validating is not the proper checksum, it's possible that the folded value could come out to 0, causing us not to drop the packet. Instead, we believe that the packet was checksummed incorrectly by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt to warn the user with netdev_rx_csum_fault(skb->dev); Unfortunately, since this is the UDP path, skb->dev has been overwritten by skb->dev_scratch and is no longer a valid pointer, so we end up reading invalid memory. This patch addresses this problem in two ways: 1) Remove the invalid netdev_rx_csum_fault(skb->dev) call from skb_copy_and_csum_datagram_msg(). Since this is used by UDP where skb->dev is invalid, trying to warn doesn't make sense. 2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init(). If we receive a packet that's CHECKSUM_COMPLETE that fails verification (i.e. skb->csum_valid == 0), check who performed the calculation. It's possible that the checksum was done in software by the network stack earlier (such as Netfilter's CONNTRACK module), and if that says the checksum is bad, we can drop the packet immediately instead of waiting until we try and copy it to userspace. Otherwise, we need to mark the SKB as CHECKSUM_NONE, since the skb->csum field no longer contains the full packet checksum after the call to __skb_checksum_validate_complete(). Signed-off-by: Sean Tranchetti --- net/core/datagram.c | 3 --- net/ipv4/udp.c | 20 ++-- net/ipv6/ip6_checksum.c | 20 ++-- 3 files changed, 36 insertions(+), 7 deletions(-) diff --git a/net/core/datagram.c b/net/core/datagram.c index 9aac0d6..ab679c5 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -807,9 +807,6 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, iov_iter_revert(&msg->msg_iter, chunk); return -EINVAL; } - - if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE)) - netdev_rx_csum_fault(skb->dev); } return 0; fault: diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c32a4c1..f8183fd 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2120,8 +2120,24 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh, /* Note, we are only interested in != 0 or == 0, thus the * force to int. */ - return (__force int)skb_checksum_init_zero_check(skb, proto, uh->check, -inet_compute_pseudo); + err = (__force int)skb_checksum_init_zero_check(skb, proto, uh->check, + inet_compute_pseudo); + if (err) + return err; + + if (skb->ip_summed == CHECKSUM_COMPLETE && !skb->csum_valid) { + /* If SW calculated the value, we know it's bad */ + if (skb->csum_complete_sw) + return 1; + + /* HW says the value is bad. Let's validate that. +* skb->csum is no longer the full packet checksum, +* so don't treat it as such. +*/ + skb_checksum_complete_unset(skb); + } + + return 0; } /* wrapper for udp_queue_rcv_skb tacking care of csum conversion and diff --git a/net/ipv6/ip6_checksum.c b/net/ipv6/ip6_checksum.c index 547515e..3777170 100644 --- a/net/ipv6/ip6_checksum.c +++ b/net/ipv6/ip6_checksum.c @@ -88,8 +88,24 @@ int udp6_csum_init(struct sk_buff *skb, struct udphdr *uh, int
[PATCH net] netlabel: check for IPV4MASK in addrinfo_get
netlbl_unlabel_addrinfo_get() assumes that if it finds the NLBL_UNLABEL_A_IPV4ADDR attribute, it must also have the NLBL_UNLABEL_A_IPV4MASK attribute as well. However, this is not necessarily the case as the current checks in netlbl_unlabel_staticadd() and friends are not sufficent to enforce this. If passed a netlink message with NLBL_UNLABEL_A_IPV4ADDR, NLBL_UNLABEL_A_IPV6ADDR, and NLBL_UNLABEL_A_IPV6MASK attributes, these functions will all call netlbl_unlabel_addrinfo_get() which will then attempt dereference NULL when fetching the non-existent NLBL_UNLABEL_A_IPV4MASK attribute: Unable to handle kernel NULL pointer dereference at virtual address 0 Process unlab (pid: 31762, stack limit = 0xff80502d8000) Call trace: netlbl_unlabel_addrinfo_get+0x44/0xd8 netlbl_unlabel_staticremovedef+0x98/0xe0 genl_rcv_msg+0x354/0x388 netlink_rcv_skb+0xac/0x118 genl_rcv+0x34/0x48 netlink_unicast+0x158/0x1f0 netlink_sendmsg+0x32c/0x338 sock_sendmsg+0x44/0x60 ___sys_sendmsg+0x1d0/0x2a8 __sys_sendmsg+0x64/0xb4 SyS_sendmsg+0x34/0x4c el0_svc_naked+0x34/0x38 Code: 51001149 7100113f 54a0 f9401508 (79400108) ---[ end trace f6438a488e737143 ]--- Kernel panic - not syncing: Fatal exception Signed-off-by: Sean Tranchetti --- net/netlabel/netlabel_unlabeled.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/netlabel/netlabel_unlabeled.c b/net/netlabel/netlabel_unlabeled.c index c070dfc..c92894c 100644 --- a/net/netlabel/netlabel_unlabeled.c +++ b/net/netlabel/netlabel_unlabeled.c @@ -781,7 +781,8 @@ static int netlbl_unlabel_addrinfo_get(struct genl_info *info, { u32 addr_len; - if (info->attrs[NLBL_UNLABEL_A_IPV4ADDR]) { + if (info->attrs[NLBL_UNLABEL_A_IPV4ADDR] && + info->attrs[NLBL_UNLABEL_A_IPV4MASK]) { addr_len = nla_len(info->attrs[NLBL_UNLABEL_A_IPV4ADDR]); if (addr_len != sizeof(struct in_addr) && addr_len != nla_len(info->attrs[NLBL_UNLABEL_A_IPV4MASK])) -- 1.9.1
[PATCH net] af_key: free SKBs under RCU protection
pfkey_broadcast() can make calls to pfkey_broadcast_one() which will clone or copy the passed in SKB. This new SKB will be assigned the sock_rfree() function as its destructor, which requires that the socket reference the SKB contains is valid when the SKB is freed. If this SKB is ever passed to pfkey_broadcast() again by some other function (such as pkfey_dump() or pfkey_promisc) it will then be freed there. However, since this free occurs outside of RCU protection, it is possible that userspace could close the socket and trigger pfkey_release() to free the socket before sock_rfree() can run, creating the following race condition: 1: An SKB belonging to the pfkey socket is passed to pfkey_broadcast(). It performs the broadcast to any other sockets, and calls rcu_read_unlock(), but does not yet reach kfree_skb(). 2: Userspace closes the socket, triggering pfkey_realse(). Since no one holds the RCU lock, synchronize_rcu() returns and it is allowed to continue. It calls sock_put() to free the socket. 3: pfkey_broadcast() now calls kfree_skb() on the original SKB it was passed, triggering a call to sock_rfree(). This function now accesses the freed struct sock * via skb->sk, and attempts to update invalid memory. By ensuring that the pfkey_broadcast() also frees the SKBs while it holds the RCU lock, we can ensure that the socket will remain valid when the SKB is freed, avoiding crashes like the following: Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b [006b6b6b6b6b6c4b] address between user and kernel address ranges Internal error: Oops: 9604 [#1] PREEMPT SMP task: fff78f65b380 task.stack: ff8049a88000 pc : sock_rfree+0x38/0x6c lr : skb_release_head_state+0x6c/0xcc Process repro (pid: 7117, stack limit = 0xff8049a88000) Call trace: sock_rfree+0x38/0x6c skb_release_head_state+0x6c/0xcc skb_release_all+0x1c/0x38 __kfree_skb+0x1c/0x30 kfree_skb+0xd0/0xf4 pfkey_broadcast+0x14c/0x18c pfkey_sendmsg+0x1d8/0x408 sock_sendmsg+0x44/0x60 ___sys_sendmsg+0x1d0/0x2a8 __sys_sendmsg+0x64/0xb4 SyS_sendmsg+0x34/0x4c el0_svc_naked+0x34/0x38 Kernel panic - not syncing: Fatal exception Signed-off-by: Sean Tranchetti --- net/key/af_key.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 9d61266..dd257c7 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -275,13 +275,13 @@ static int pfkey_broadcast(struct sk_buff *skb, gfp_t allocation, if ((broadcast_flags & BROADCAST_REGISTERED) && err) err = err2; } - rcu_read_unlock(); if (one_sk != NULL) err = pfkey_broadcast_one(skb, &skb2, allocation, one_sk); kfree_skb(skb2); kfree_skb(skb); + rcu_read_unlock(); return err; } -- 1.9.1
[PATCH net] xfrm: validate template mode
XFRM mode parameters passed as part of the user templates in the IP_XFRM_POLICY are never properly validated. Passing values other than valid XFRM modes can cause stack-out-of-bounds reads to occur later in the XFRM processing: [ 140.535608] [ 140.543058] BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x17e4/0x1cc4 [ 140.550306] Read of size 4 at addr ffc0238a7a58 by task repro/5148 [ 140.557369] [ 140.558927] Call trace: [ 140.558936] dump_backtrace+0x0/0x388 [ 140.558940] show_stack+0x24/0x30 [ 140.558946] __dump_stack+0x24/0x2c [ 140.558949] dump_stack+0x8c/0xd0 [ 140.558956] print_address_description+0x74/0x234 [ 140.558960] kasan_report+0x240/0x264 [ 140.558963] __asan_report_load4_noabort+0x2c/0x38 [ 140.558967] xfrm_state_find+0x17e4/0x1cc4 [ 140.558971] xfrm_resolve_and_create_bundle+0x40c/0x1fb8 [ 140.558975] xfrm_lookup+0x238/0x1444 [ 140.558977] xfrm_lookup_route+0x48/0x11c [ 140.558984] ip_route_output_flow+0x88/0xc4 [ 140.558991] raw_sendmsg+0xa74/0x266c [ 140.558996] inet_sendmsg+0x258/0x3b0 [ 140.559002] sock_sendmsg+0xbc/0xec [ 140.559005] SyS_sendto+0x3a8/0x5a8 [ 140.559008] el0_svc_naked+0x34/0x38 [ 140.559009] [ 140.592245] page dumped because: kasan: bad access detected [ 140.597981] page_owner info is not active (free page?) [ 140.603267] [ 140.653503] Signed-off-by: Sean Tranchetti --- net/xfrm/xfrm_user.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 4791aa8..ab2f280 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1480,6 +1480,9 @@ static int validate_tmpl(int nr, struct xfrm_user_tmpl *ut, u16 family) (ut[i].family != prev_family)) return -EINVAL; + if (ut[i].mode >= XFRM_MODE_MAX) + return -EINVAL; + prev_family = ut[i].family; switch (ut[i].family) { -- 1.9.1
[PATCH net-next] net: gro: Initialize backlog NAPI's gro_list
When using RPS, the target CPU uses the backlog NAPI struct. This NAPI struct needs the list initialized explicitly as normally the list is initialized from netif_napi_add() on the netdevice. Unable to handle kernel NULL pointer dereference at virtual address 0008 Kernel BUG at ff9808909310 [verbose debug info unavailable] PC is at napi_gro_flush+0x74/0x104 LR is at napi_gro_flush+0x70/0x104 [] napi_gro_flush+0x74/0x104 [] net_rx_action+0x3f8/0x4d4 [] __do_softirq+0x144/0x468 [] irq_exit+0x118/0x144 [] __handle_domain_irq+0x94/0xf8 [] gic_handle_irq+0xd0/0x1b4 Fixes: d4546c2509b1 ("net: Convert GRO SKB handling to list_head.") Signed-off-by: Sean Tranchetti Signed-off-by: Subash Abhinov Kasiviswanathan --- net/core/dev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/dev.c b/net/core/dev.c index 4f8b92d..6b76745 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -9556,6 +9556,7 @@ static int __init net_dev_init(void) sd->backlog.poll = process_backlog; sd->backlog.weight = weight_p; + INIT_LIST_HEAD(&sd->backlog.gro_list); } dev_boot_phase = 0; -- 1.9.1
[PATCH net-next] udp: Fix kernel panic in UDP GSO path
Using GSO in the UDP path on a device with scatter-gather netdevice feature disabled will result in a kernel panic with the following call stack: kernel BUG at net/core/skbuff.c:104! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP PC is at skb_panic+0x4c/0x54 LR is at skb_panic+0x4c/0x54 Process udpgso_bench_tx (pid: 4078, stack limit = 0xff8048de8000) [] skb_panic+0x4c/0x54 [] skb_copy_bits+0x0/0x244 [] __ip_append_data+0x230/0x814 [] ip_make_skb+0xe4/0x178 [] udp_sendmsg+0x828/0x888 [] inet_sendmsg+0xe4/0x130 [] ___sys_sendmsg+0x1d8/0x2c0 [] SyS_sendmsg+0x90/0xe0 This panic is the result of allocating SKBs with small size for the newly segmented SKB. If the scatter-gather feature is disabled, the code attempts to call skb_put() on the small SKB with an argument of nearly the entire unsegmented SKB length. After this patch, attempting to use GSO with scatter-gather disabled will result in -EINVAL being returned. Fixes: 15e36f5b8e98 ("udp: paged allocation with gso") Signed-off-by: Sean Tranchetti Signed-off-by: Subash Abhinov Kasiviswanathan --- net/ipv4/ip_output.c | 8 1 file changed, 8 insertions(+) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index b5e21eb..0d63690 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1054,8 +1054,16 @@ static int __ip_append_data(struct sock *sk, copy = length; if (!(rt->dst.dev->features&NETIF_F_SG)) { + struct sk_buff *tmp; unsigned int off; + if (paged) { + err = -EINVAL; + while ((tmp = __skb_dequeue(queue)) != NULL) + kfree(tmp); + goto error; + } + off = skb->len; if (getfrag(from, skb_put(skb, copy), offset, copy, off, skb) < 0) { -- 1.9.1
[PATCH net-next] udp: Complement partial checksum for GSO packet
Using the udp_v4_check() function to calculate the pseudo header for the newly segmented UDP packets results in assigning the complement of the value to the UDP header checksum field. Always undo the complement the partial checksum value in order to match the case where GSO is not used on the UDP transmit path. Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Sean Tranchetti Signed-off-by: Subash Abhinov Kasiviswanathan --- net/ipv4/udp_offload.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index f78fb36..0062570 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -223,6 +223,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, csum_replace2(&uh->check, htons(mss), htons(seg->len - hdrlen - sizeof(*uh))); + uh->check = ~uh->check; seg->destructor = sock_wfree; seg->sk = sk; sum_truesize += seg->truesize; -- 1.9.1