[PATCH] net: Remove remaining remnants of pm_qos from netdevice.h
Commit e2c6544829f removed pm_qos from struct net_device but left the comment above header file. Remove those. Signed-off-by: David Ahern dsah...@gmail.com Cc: Thomas Graf tg...@suug.ch --- include/linux/netdevice.h | 3 --- 1 file changed, 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 1899c74a7127..05b9a694e213 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -25,7 +25,6 @@ #ifndef _LINUX_NETDEVICE_H #define _LINUX_NETDEVICE_H -#include linux/pm_qos.h #include linux/timer.h #include linux/bug.h #include linux/delay.h @@ -1499,8 +1498,6 @@ enum netdev_priv_flags { * * @qdisc_tx_busylock: XXX: need comments on this one * - * @pm_qos_req:Power Management QoS object - * * FIXME: cleanup struct net_device such that network protocol info * moves out. */ -- 2.2.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] e1000e: Add pm_qos header
Commit e2c6544829f moved pm_qos_req to e1000_adapter. Add the header file that defines the struct. Signed-off-by: David Ahern dsah...@gmail.com Cc: Thomas Graf tg...@suug.ch Cc: Jeff Kirsher jeffrey.t.kirs...@intel.com --- drivers/net/ethernet/intel/e1000e/e1000.h | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h index 5d9ceb17b4cb..0abc942c966e 100644 --- a/drivers/net/ethernet/intel/e1000e/e1000.h +++ b/drivers/net/ethernet/intel/e1000e/e1000.h @@ -40,6 +40,7 @@ #include linux/ptp_classify.h #include linux/mii.h #include linux/mdio.h +#include linux/pm_qos.h #include hw.h struct e1000_info; -- 2.2.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 2/3] VRF driver and needed infrastructure
On 6/8/15 12:35 PM, Shrijeet Mukherjee wrote: diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 019fcef..27a333c 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -283,6 +283,12 @@ config NLMON diagnostics, etc. This is mostly intended for developers or support to debug netlink issues. If unsure, say N. +config NET_VRF + tristate Virtual Routing and Forwarding (Lite) + ---help--- + This option enables the support for mapping interfaces into VRF's. The + support enables VRF devices + endif # NET_CORE config SUNGEM_PHY I think you need: depends on IP_MULTIPLE_TABLES IPV6_MULTIPLE_TABLES David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 0/3] Proposal for VRF-lite
On 6/8/15 12:35 PM, Shrijeet Mukherjee wrote: 5. Debugging is built-in as tcpdump and counters on the VRF device works as is. Is the intent that something like this tcpdump -i vrf0 can be used to see vrf traffic? vrf_handle_frame only bumps counters; it does not switch skb-dev to the vrf device so for Rx path tcpdump will not get the packets. ie., tcpdump only shows outbound packets. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/3] rcv path changes for vrf traffic
On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote: Hi Shrijeet, On Mo, 2015-06-08 at 11:35 -0700, Shrijeet Mukherjee wrote: From: Shrijeet Mukherjee s...@cumulusnetworks.com Incoming frames for IP protocol stacks need the IIF to be changed from the actual interface to the VRF device. This allows the IIF rule to be used to select tables (or do regular PBR) This change selects the iif to be the VRF device if it exists and the incoming iif is enslaved to the VRF device. Since VRF aware sockets are always bound to the VRF device this system allows return traffic to find the socket of origin. changes are in the arp_rcv, icmp_rcv and ip_rcv paths Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF as it would create code variations and the vrf_ptr check is there I can make that whole thing modular. From an architectural level I think the output path looks good. For the input path I would also to propose my (I think) more flexible solution: Something is still not right on the output path. e.g., I see the wrong source address showing up on ping -I vrf0: # ping -I vrf0 1.1.1.254 ping: Warning: source address might be selected on device other than vrf0. PING 1.1.1.254 (1.1.1.254) from 172.16.1.52 vrf0: 56(84) bytes of data. 64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=0.215 ms ... The reason is because the datagram connect function fails to look up the outbound route in the vrf and falls back to the main table. (As an aside the fallback to other tables is something that should not be happening for VRFs; you want to use the table specific to the VRF.) The route lookup fails because it passes in oif = vrf device (this VRF design relies on bind to device which sets oif in the flow). That is good for selecting the table to use for the lookups, but not good for selecting the route within the table. This is one way to fix the connect problem: diff --git a/include/net/route.h b/include/net/route.h index fe22d03afb6a..a18798caec25 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -245,11 +245,18 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 __be16 sport, __be16 dport, struct sock *sk) { + struct net_device *dev = dev_get_by_index(sock_net(sk), oif); __u8 flow_flags = 0; if (inet_sk(sk)-transparent) flow_flags |= FLOWI_FLAG_ANYSRC; + if (dev) { + if (netif_is_vrf(dev)) + flow_flags |= FLOWI_FLAG_VRFSRC; + dev_put(dev); + } + flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, protocol, flow_flags, dst, src, dport, sport); } which essentially tells fib_table_lookup to drop the OIF comparison after selecting the table per this change made in the patch Shrijeet posted: if (!(flp-flowi4_flags FLOWI_FLAG_VRFSRC)) { if (flp-flowi4_oif flp-flowi4_oif != nh-nh_oif) continue; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/3] rcv path changes for vrf traffic
On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote: For rx layer I want to also propose my try: [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override I applied only the first 2 patches from Shrijeet and then tried to apply your patch; it doesn't apply. Way too many failures. What branch should it apply too? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 0/3] Proposal for VRF-lite
Hi Nicolas: On 6/9/15 2:58 AM, Nicolas Dichtel wrote: I'm not really in favor of the name 'vrf'. This term is very controversial and having a consensus of what is/contains a 'vrf' is quite impossible. There was already a lot of discussions about this topic on quagga ml that show that everybody has a different opinion about this term ;-) Are you referring to this thread? https://lists.quagga.net/pipermail/quagga-dev/2014-November/011795.html I could see differing opinions regarding the implementation of a VRF; is there really a controversy on what a VRF is? David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] switchdev: fix BUG when port driver doesn't support set attr op
On 6/10/15 2:56 PM, sfel...@gmail.com wrote: From: Scott Feldman sfel...@gmail.com Fix a BUG() where CONFIG_NET_SWITCHDEV is set but the driver for a bridged port does not support switchdec_port_attr_set op. Don't BUG() if -EOPNOTSUPP is returned. Signed-off-by: Scott Feldman sfel...@gmail.com Reported-by: Brenden Blanco bbla...@plumgrid.com --- net/switchdev/switchdev.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c index e008057..99bced4 100644 --- a/net/switchdev/switchdev.c +++ b/net/switchdev/switchdev.c @@ -103,7 +103,7 @@ static void switchdev_port_attr_set_work(struct work_struct *work) rtnl_lock(); err = switchdev_port_attr_set(asw-dev, asw-attr); - BUG_ON(err); + BUG_ON(err err != -EOPNOTSUPP); rtnl_unlock(); dev_put(asw-dev); Should that be WARN_ON instead of BUG_ON? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] switchdev: fix BUG when port driver doesn't support set attr op
On 6/10/15 3:47 PM, Scott Feldman wrote: Should that be WARN_ON instead of BUG_ON? I think I had it as WARN when we were working on the initial patches, but we changed it to BUG_ON because we should only get an error here if the driver screwed something up between PREPARE phase and COMMIT phase, so it should be considered a driver bug which needs fixing. Linus rants from time to time about the prolific use of BUG_ON. e.g., https://lkml.org/lkml/2015/4/28/528 'BUG_ON() is for things where our internal data structures are so corrupted that we don't know what to do, and there's no way to continue. Not for I want to sprinkle these things around and this should not happen.' David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
On 7/6/15 10:37 AM, Nikolay Aleksandrov wrote: +static int vrf_add_slave(struct net_device *dev, +struct net_device *port_dev) +{ + if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev)) + return -ENODEV; + + if (!vrf_is_master(port_dev) !vrf_is_slave(port_dev)) { + struct slave *s = kzalloc(sizeof(*s), GFP_KERNEL); + struct net_vrf *vrf = netdev_priv(dev); + struct slave_queue *queue = vrf-queue; + bool is_running = netif_running(port_dev); + unsigned int flags = port_dev-flags; + int ret; + + if (!s) + return -ENOMEM; + + s-dev = port_dev; + + spin_lock_bh(queue-lock); + __vrf_insert_slave(queue, s, dev); + spin_unlock_bh(queue-lock); + + port_dev-vrf_ptr = kmalloc(sizeof(*port_dev-vrf_ptr), + GFP_KERNEL); + if (!port_dev-vrf_ptr) + return -ENOMEM; ^ I believe you'll have a slave in the list with inconsistent state which could even lead to null ptr derefernce if vrf_ptr is used, also __vrf_insert_slave does dev_hold so the dev refcnt will be incorrect as well. Right. Good catch, will fix. + + port_dev-vrf_ptr-ifindex = dev-ifindex; + port_dev-vrf_ptr-tb_id = vrf-tb_id; + + /* register the packet handler for slave ports */ + ret = netdev_rx_handler_register(port_dev, vrf_handle_frame, +(void *)dev); + if (ret) { + netdev_err(port_dev, + Device %s failed to register rx_handler\n, + port_dev-name); + kfree(port_dev-vrf_ptr); + kfree(s); + return ret; ^^ The slave is being freed while on the list here, device's refcnt will be wrong etc. ack. Will fix. + } + + if (is_running) { + ret = dev_change_flags(port_dev, flags ~IFF_UP); + if (ret 0) + goto out_fail; + } + + ret = netdev_master_upper_dev_link(port_dev, dev); + if (ret 0) + goto out_fail; + + if (is_running) { + ret = dev_change_flags(port_dev, flags); + if (ret 0) + goto out_fail; + } + + port_dev-flags |= IFF_SLAVE; + + return 0; + +out_fail: + spin_lock_bh(queue-lock); + __vrf_kill_slave(queue, s); + spin_unlock_bh(queue-lock); __vrf_kill_slave() doesn't do upper device unlink and the device can be linked if we fail in the dev_change_flags above. will fix. + + return ret; + } + + return -EINVAL; +} In my opinion the structure of the above function should change to something more straightforward with proper exit labels and cleanup upon failure, also a level of indentation can be avoided. Sure. The indentation comes after the pointer checks so locals can be intialized when declared. Will work on the clean up/simplification for next rev. + +static int vrf_del_slave(struct net_device *dev, +struct net_device *port_dev) +{ + struct net_vrf *vrf = netdev_priv(dev); + struct slave_queue *queue = vrf-queue; + struct slave *slave = __vrf_find_slave_dev(queue, port_dev); + bool is_running = netif_running(port_dev); + unsigned int flags = port_dev-flags; + int ret = 0; ret seems unused/unchecked in this function It is used but not checked. I struggled with what to do on the error path. Do we want netdev_err() on a failure? + + if (!slave) + return -EINVAL; + + if (is_running) + ret = dev_change_flags(port_dev, flags ~IFF_UP); + + spin_lock_bh(queue-lock); + __vrf_kill_slave(queue, slave); + spin_unlock_bh(queue-lock); + + netdev_upper_dev_unlink(port_dev, dev); + + if (is_running) + ret = dev_change_flags(port_dev, flags); + + return 0; +} + +static int vrf_dev_init(struct net_device *dev) +{ + struct net_vrf *vrf = netdev_priv(dev); + + spin_lock_init(vrf-queue.lock); + INIT_LIST_HEAD(vrf-queue.all_slaves); + vrf-queue.master_dev = dev; + + dev-dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats); + dev-flags = IFF_MASTER | IFF_NOARP; + if (!dev-dstats) + return -ENOMEM; ^ nit: I'd suggest moving the check after the allocation agreed. David -- To unsubscribe from this list:
[RFC net-next 5/6] net: Add sk_bind_dev_if to task_struct
Allow tasks to have a default device index for binding sockets. If set the value is passed to all AF_INET/AF_INET6 sockets when they are created. The task setting is passed parent to child on fork, but can be set or changed after task creation using prctl (if task has CAP_NET_ADMIN permissions). The setting for a socket can be retrieved using prctl(). This option allows an administrator to restrict a task to only send/receive packets through the specified device. In the case of VRF devices this option restricts tasks to a specific VRF. Correlation of the device index to a specific VRF, ie., ifindex -- VRF device -- VRF id is left to userspace. Example using VRF devices: 1. vrf1 is created and assigned to table 5 2. eth2 is enslaved to vrf1 3. eth2 is given the address 1.1.1.1/24 $ ip route ls table 5 prohibit default 1.1.1.0/24 dev eth2 scope link local 1.1.1.1 dev eth2 proto kernel scope host src 1.1.1.1 With out setting a VRF context ping, tcp and udp attempts fail. e.g, $ ping 1.1.1.254 connect: Network is unreachable After binding the task to the vrf device ping succeeds: $ ./chvrf -v 1 ping -c1 1.1.1.254 PING 1.1.1.254 (1.1.1.254) 56(84) bytes of data. 64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=2.32 ms --- include/linux/sched.h | 3 +++ include/uapi/linux/prctl.h | 4 kernel/fork.c | 2 ++ kernel/sys.c | 35 +++ net/ipv4/af_inet.c | 1 + net/ipv6/af_inet6.c| 1 + 6 files changed, 46 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6633e83e608a..0b6ab0e2ea57 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1543,6 +1543,9 @@ struct task_struct { struct files_struct *files; /* namespaces */ struct nsproxy *nsproxy; +/* network */ + /* if set INET/INET6 sockets are bound to given dev index on create */ + int sk_bind_dev_if; /* signal handlers */ struct signal_struct *signal; struct sighand_struct *sighand; diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..1ef45195d146 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,8 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 0)/* 64b FP registers */ # define PR_FP_MODE_FRE(1 1)/* 32b compatibility */ +/* get/set network interface sockets are bound to by default */ +#define PR_SET_SK_BIND_DEV_IF 47 +#define PR_GET_SK_BIND_DEV_IF 48 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 0bb88b50..d2c7f32370ef 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -375,6 +375,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) tsk-splice_pipe = NULL; tsk-task_frag.page = NULL; + tsk-sk_bind_dev_if = orig-sk_bind_dev_if; + account_kernel_stack(ti, 1); return tsk; diff --git a/kernel/sys.c b/kernel/sys.c index 8571296b7ddb..7e56fb9dbf8e 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -52,6 +52,7 @@ #include linux/rcupdate.h #include linux/uidgid.h #include linux/cred.h +#include linux/netdevice.h #include linux/kmsg_dump.h /* Move somewhere else to avoid recompiling? */ @@ -2243,6 +2244,40 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NET + case PR_SET_SK_BIND_DEV_IF: + { + struct net_device *dev; + int idx = (int) arg2; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (idx) { + dev = dev_get_by_index(me-nsproxy-net_ns, idx); + if (!dev) + return -EINVAL; + dev_put(dev); + } + me-sk_bind_dev_if = idx; + break; + } + case PR_GET_SK_BIND_DEV_IF: + { + struct task_struct *tsk; + int sk_bind_dev_if = -EINVAL; + + rcu_read_lock(); + tsk = find_task_by_vpid(arg2); + if (tsk) + sk_bind_dev_if = tsk-sk_bind_dev_if; + rcu_read_unlock(); + if (tsk != me !capable(CAP_NET_ADMIN)) + return -EPERM; + error = sk_bind_dev_if; + break; + } +#endif default: error = -EINVAL; break; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9532ee87151f..a3b24f14e378 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -350,6 +350,7 @@ static int inet_create(struct net *net, struct socket *sock, int protocol, sk-sk_destruct= inet_sock_destruct; sk-sk_protocol= protocol; sk-sk_backlog_rcv = sk-sk_prot-backlog_rcv; +
[RFC net-next 3/6] net: Introduce VRF device driver - v2
This driver borrows heavily from IPvlan and teaming drivers. Routing domains (VRF-lite) are created by instantiating a device and enslaving all routed interfaces that participate in the domain. As part of the enslavement, all local routes pointing to enslaved devices are re-pointed to the vrf device, thus forcing outgoing sockets to bind to the vrf to function. Standard FIB rules can then bind the VRF device to tables and regular fib rule processing is followed. Routed traffic through the box, is fwded by using the VRF device as the IIF and following the IIF rule to a table which is mated with the VRF. Locally originated traffic is directed at the VRF device using SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into the xmit function of the vrf driver, which then completes the ip lookup and output. This solution is completely orthogonal to namespaces and allow the L3 equivalent of vlans to exist allowing the routing space to be partitioned. Example: Create vrf 1: ip link add vrf1 type vrf table 5 ip rule add iif vrf1 table 5 ip rule add oif vrf1 table 5 ip route add table 5 prohibit default ip link set vrf1 up Add interface to vrf 1: ip link set eth1 master vrf1 Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com v2: - addressed comments from first RFC - significant changes to improve simplicity of implementation --- drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 486 +++ include/net/vrf.h| 71 4 files changed, 565 insertions(+) create mode 100644 drivers/net/vrf.c create mode 100644 include/net/vrf.h diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 019fceffc9e5..b040aa233408 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -283,6 +283,13 @@ config NLMON diagnostics, etc. This is mostly intended for developers or support to debug netlink issues. If unsure, say N. +config NET_VRF + tristate Virtual Routing and Forwarding (Lite) + depends on IP_MULTIPLE_TABLES IPV6_MULTIPLE_TABLES + ---help--- + This option enables the support for mapping interfaces into VRF's. The + support enables VRF devices + endif # NET_CORE config SUNGEM_PHY diff --git a/drivers/net/Makefile b/drivers/net/Makefile index c12cb22478a7..ca16dd689b36 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o obj-$(CONFIG_VXLAN) += vxlan.o obj-$(CONFIG_GENEVE) += geneve.o obj-$(CONFIG_NLMON) += nlmon.o +obj-$(CONFIG_NET_VRF) += vrf.o # # Networking Drivers diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c new file mode 100644 index ..b9f9ae68388d --- /dev/null +++ b/drivers/net/vrf.c @@ -0,0 +1,487 @@ +/* + * vrf.c: device driver to encapsulate a VRF space + * + * Copyright (c) 2015 Cumulus Networks + * + * Based on dummy, team and ipvlan drivers + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/ip.h +#include linux/init.h +#include linux/moduleparam.h +#include linux/rtnetlink.h +#include net/rtnetlink.h +#include net/arp.h +#include linux/u64_stats_sync.h +#include linux/hashtable.h + +#include linux/inetdevice.h +#include net/ip.h +#include net/ip_fib.h +#include net/ip6_route.h +#include net/rtnetlink.h +#include net/route.h +#include net/addrconf.h +#include net/vrf.h + +#define DRV_NAME vrf +#define DRV_VERSION1.0 + +#define vrf_is_slave(dev) ((dev)-flags IFF_SLAVE) +#define vrf_is_master(dev) ((dev)-flags IFF_MASTER) + +#define vrf_master_get_rcu(dev) \ + ((struct net_device *)rcu_dereference(dev-rx_handler_data)) + +struct pcpu_dstats { + u64 tx_pkts; + u64 tx_bytes; + u64 tx_drps; + u64 rx_pkts; + u64 rx_bytes; + struct u64_stats_sync syncp; +}; + +struct slave { + struct list_headlist; + struct net_device *dev; + longpriority; +}; + +struct slave_queue { + spinlock_t lock; /* lock for slave insert/delete */ + struct list_headall_slaves; + int num_slaves; + struct net_device *master_dev; +}; + +struct net_vrf { + struct slave_queue queue; + struct fib_table*tb; + u32 tb_id; +}; + +static int is_ip_rx_frame(struct sk_buff *skb) +{ + switch (skb-protocol) { + case htons
[RFC net-next 4/6] net: Modifications to ipv4 stack for VRF devices
With the following tweaks to the IPv4 stack: - enslaving devices to a VRF device automatically moves routes to the VRF table; removing the VRF master moves routes back to the main table - the following use cases work for both Rx and Tx: + ICMP (ping -I vrf-device ip) + TCP server and client bound to VRF device + TCP server not bound to VRF device but working through it * client connections are bound to VRF device + UDP server and client bound to VRF device Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/flow.h| 1 + include/net/inet_hashtables.h | 9 +++-- include/net/route.h | 4 net/ipv4/fib_frontend.c | 30 -- net/ipv4/fib_semantics.c | 25 - net/ipv4/fib_trie.c | 7 +-- net/ipv4/icmp.c | 4 net/ipv4/ping.c | 3 ++- net/ipv4/raw.c| 5 +++-- net/ipv4/route.c | 12 ++-- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 6 +- net/ipv4/tcp_ipv4.c | 6 -- net/ipv4/udp.c| 2 ++ 14 files changed, 90 insertions(+), 28 deletions(-) diff --git a/include/net/flow.h b/include/net/flow.h index 8109a159d1b3..69aaa99fdeb8 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -29,6 +29,7 @@ struct flowi_common { __u8flowic_flags; #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 +#define FLOWI_FLAG_VRFSRC 0x04 __u32 flowic_secid; }; diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index b73c88a19dd4..e26c43823a13 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -31,6 +31,7 @@ #include net/route.h #include net/tcp_states.h #include net/netns/hash.h +#include net/vrf.h #include linux/atomic.h #include asm/byteorder.h @@ -300,10 +301,14 @@ static inline struct sock *__inet_lookup(struct net *net, struct inet_hashinfo *hashinfo, const __be32 saddr, const __be16 sport, const __be32 daddr, const __be16 dport, -const int dif) +int dif) { u16 hnum = ntohs(dport); - struct sock *sk = __inet_lookup_established(net, hashinfo, + struct sock *sk; + + dif = vrf_get_master_dev_idx(net, dif) ? : dif; + + sk = __inet_lookup_established(net, hashinfo, saddr, sport, daddr, hnum, dif); return sk ? : __inet_lookup_listener(net, hashinfo, saddr, sport, diff --git a/include/net/route.h b/include/net/route.h index fe22d03afb6a..460333bab217 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -188,6 +188,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk); void ip_rt_send_redirect(struct sk_buff *skb); unsigned int inet_addr_type(struct net *net, __be32 addr); +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); void ip_rt_multicast_event(struct in_device *); @@ -250,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 if (inet_sk(sk)-transparent) flow_flags |= FLOWI_FLAG_ANYSRC; + if (netif_idx_is_vrf(sock_net(sk), oif)) + flow_flags |= FLOWI_FLAG_VRFSRC; + flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, protocol, flow_flags, dst, src, dport, sport); } diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 974fa51effca..7c73eb058c91 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -45,6 +45,7 @@ #include net/ip_fib.h #include net/rtnetlink.h #include net/xfrm.h +#include net/vrf.h #ifndef CONFIG_IP_MULTIPLE_TABLES @@ -212,7 +213,7 @@ void fib_flush_external(struct net *net) */ static inline unsigned int __inet_dev_addr_type(struct net *net, const struct net_device *dev, - __be32 addr) + __be32 addr, int rt_table) { struct flowi4 fl4 = { .daddr = addr }; struct fib_result res; @@ -225,8 +226,7 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, return RTN_MULTICAST; rcu_read_lock(); - - local_table = fib_get_table(net, RT_TABLE_LOCAL); + local_table = fib_get_table(net, rt_table); if (local_table) { ret = RTN_UNICAST; if (!fib_table_lookup(local_table, fl4, res
[RFC PATCH] iproute2: Add support for VRF device
Allow user to create a vrf device and specify its table binding. Based on the iplink_vlan implementation. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/if_link.h | 8 + ip/Makefile | 2 +- ip/iplink.c | 2 +- ip/iplink_vrf.c | 88 + 4 files changed, 98 insertions(+), 2 deletions(-) create mode 100644 ip/iplink_vrf.c diff --git a/include/linux/if_link.h b/include/linux/if_link.h index 8df6a8466839..28872fbf6814 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -337,6 +337,14 @@ enum macvlan_macaddr_mode { #define MACVLAN_FLAG_NOPROMISC 1 +/* VRF section */ +enum { + IFLA_VRF_UNSPEC, + IFLA_VRF_TABLE, + __IFLA_VRF_MAX +}; + +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1) /* IPVLAN section */ enum { IFLA_IPVLAN_UNSPEC, diff --git a/ip/Makefile b/ip/Makefile index 77653ecc5785..d8b38ac2e44b 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \ link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \ iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \ -iplink_geneve.o +iplink_geneve.o iplink_vrf.o RTMONOBJ=rtmon.o diff --git a/ip/iplink.c b/ip/iplink.c index e296e6f611b8..892e8bc8808b 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -93,7 +93,7 @@ void iplink_usage(void) fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | macvtap |\n); fprintf(stderr, bridge | bond | ipoib | ip6tnl | ipip | sit | vxlan |\n); fprintf(stderr, gre | gretap | ip6gre | ip6gretap | vti | nlmon |\n); - fprintf(stderr, bond_slave | ipvlan | geneve }\n); + fprintf(stderr, bond_slave | ipvlan | geneve | vrf }\n); } exit(-1); } diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c new file mode 100644 index ..8d66802cf940 --- /dev/null +++ b/ip/iplink_vrf.c @@ -0,0 +1,88 @@ +/* iplink_vrf.cVRF device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void vrf_explain(FILE *f) +{ + fprintf(f, Usage: ... vrf table TABLEID \n); +} + +static void explain(void) +{ + vrf_explain(stderr); +} + +static int table_arg(void) +{ + fprintf(stderr,Error: argument of \table\ must be 0-32767 and currently unused\n); + return -1; +} + +static int vrf_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, table) == 0) { + __u32 table = 0; + NEXT_ARG(); + + table = atoi(*argv); + if (table 0 || table 32767) + return table_arg(); + /* XXX need a table in-use check here */ + fprintf(stderr, adding table %d\n, table); + addattr32(n, 1024, IFLA_VRF_TABLE, table); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, vrf: unknown option \%s\?\n, + *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ +printf(vrf_print_opt ...\n); + if (!tb) + return; + + if (tb[IFLA_VRF_TABLE]) + fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE])); +} + +static void vrf_print_help(struct link_util *lu, int argc, char **argv, + FILE *f) +{ + vrf_explain(f); +} + +struct link_util vrf_link_util = { + .id = vrf, + .maxattr= IFLA_VRF_MAX, + .parse_opt = vrf_parse_opt, + .print_opt = vrf_print_opt, + .print_help = vrf_print_help, +}; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More
[RFC net-next 1/6] fib: export symbols
This change is needed for the following VRF driver. No active code path changes. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 1 + net/ipv4/fib_trie.c | 1 + 2 files changed, 2 insertions(+) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 6bbc54940eb4..974fa51effca 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -108,6 +108,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id) hlist_add_head_rcu(tb-tb_hlist, net-ipv4.fib_table_hash[h]); return tb; } +EXPORT_SYMBOL_GPL(fib_new_table); /* caller must hold either rtnl or rcu read lock */ struct fib_table *fib_get_table(struct net *net, u32 id) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 15d32612e3c6..ac2d828c6daa 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1887,6 +1887,7 @@ void fib_free_table(struct fib_table *tb) { call_rcu(tb-rcu, __trie_free_rcu); } +EXPORT_SYMBOL_GPL(fib_free_table); static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb, struct sk_buff *skb, struct netlink_callback *cb) -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC net-next 6/6] net: Add chvrf command
Example of how to use the default bind to interface option for tasks and correlate with VRF devices. Signed-off-by: David Ahern d...@cumulusnetworks.com --- tools/net/Makefile | 6 +- tools/net/chvrf.c | 225 + 2 files changed, 229 insertions(+), 2 deletions(-) create mode 100644 tools/net/chvrf.c diff --git a/tools/net/Makefile b/tools/net/Makefile index ee577ea03ba5..c13f11f5637a 100644 --- a/tools/net/Makefile +++ b/tools/net/Makefile @@ -10,7 +10,7 @@ YACC = bison %.lex.c: %.l $(LEX) -o $@ $ -all : bpf_jit_disasm bpf_dbg bpf_asm +all : bpf_jit_disasm bpf_dbg bpf_asm chvrf bpf_jit_disasm : CFLAGS = -Wall -O2 -DPACKAGE='bpf_jit_disasm' bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl @@ -25,8 +25,10 @@ bpf_asm : LDLIBS = bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o bpf_exp.lex.o : bpf_exp.yacc.c +chvrf : CFLAGS = -Wall -O2 + clean : - rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.* + rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.* chvrf install : install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm diff --git a/tools/net/chvrf.c b/tools/net/chvrf.c new file mode 100644 index ..71cc925fd101 --- /dev/null +++ b/tools/net/chvrf.c @@ -0,0 +1,225 @@ +/* + * chvrf.c - Example of how to use the default bind-to-device option for + * tasks and correlate to VRFs via the VRF device. + * + * Copyright (c) 2015 Cumulus Networks + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ +#include sys/ioctl.h +#include sys/prctl.h +#include sys/socket.h +#include signal.h +#include string.h +#include stdio.h +#include stdlib.h +#include unistd.h +#include netinet/in.h +#include net/if.h /* for struct ifreq */ +#include libgen.h +#include errno.h + +#ifndef PR_SET_SK_BIND_DEV_IF +#define PR_SET_SK_BIND_DEV_IF 47 +#endif +#ifndef PR_GET_SK_BIND_DEV_IF +#define PR_GET_SK_BIND_DEV_IF 48 +#endif + +static int vrf_to_device(int vrf) +{ + struct ifreq ifdata; + int sd, rc; + + memset(ifdata, 0, sizeof(ifdata)); + snprintf(ifdata.ifr_name, sizeof(ifdata.ifr_name) - 1, vrf%d, vrf); + + sd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP); + if (sd 0) { + perror(socket failed); + return -1; + } + + /* Get the index for the specified interface */ + rc = ioctl(sd, SIOCGIFINDEX, (char *)ifdata); + close(sd); + if (rc != 0) { + perror(ioctl(SIOCGIFINDEX) failed); + return -1; + } + + return ifdata.ifr_ifindex; +} + +static int device_to_vrf(int idx) +{ + struct ifreq ifdata; + int sd, vrf, rc; + + memset(ifdata, 0, sizeof(ifdata)); + ifdata.ifr_ifindex = idx; + + sd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP); + if (sd 0) { + perror(socket failed); + return -1; + } + + /* Get the index for the specified interface */ + rc = ioctl(sd, SIOCGIFNAME, (char *)ifdata); + close(sd); + if (rc != 0) { + perror(ioctl(SIOCGIFNAME) failed); + return -1; + } + + if (sscanf(ifdata.ifr_name, vrf%d, vrf) != 1) { + fprintf(stderr, Unexpected device name (%s)\n, ifdata.ifr_name); + vrf = -1; + } + + return vrf; +} + +static int set_vrf(int vrf) +{ + int idx; + long err; + + /* convert vrf to device index */ + idx = vrf_to_device(vrf); + if (idx 0) { + fprintf(stderr, Failed to get device index for vrf %d\n, vrf); + return -1; + } + + /* set default device bind */ + err = prctl(PR_SET_SK_BIND_DEV_IF, idx); + if (err 0) { + fprintf(stderr, prctl failed to device index: %d\n, errno); + return -1; + } + + return 0; +} + +/* get vrf context for given process id */ +static int get_vrf(pid_t pid) +{ + int vrf; + long err; + + /* lookup device index pid is tied to */ + err = prctl(PR_GET_SK_BIND_DEV_IF, pid); + if (err 0) { + fprintf(stderr, prctl failed: %d\n, errno); + return -1; + } + + if (err == 0) + return 0; + + /* convert device index to vrf id */ + vrf = device_to_vrf((int)err); + if (vrf 0) { + fprintf(stderr, Failed to get device index for vrf %d\n, vrf); + return -1; + } + + return vrf; +} + +static int run_vrf(char **argv, int vrf) +{ + char *cmd; + + if (set_vrf(vrf) != 0) { + fprintf(stderr, Failed to set vrf context\n); + return 1; + } + + cmd = strdup(argv
[RFC net-next 0/6] Proposal for VRF-lite - v2
case a VRF global or agnostic process handles the connection (ie., this allows 1 listener socket to handle connections across VRFs). The child socket becomes bound to the VRF (sk_bound_dev_if is set to the VRF device). 5. Neighbor entries Neighbor entries are not impacted by the VRF device. Entries are associated with a particular interface; the VRF association is indirect via the interface-to-VRF device enslavement. TO-DO = 1. ipv4 multicast 2. ICMP and error path handling on connection attempts - e.g., connection attempt to a port with no listener 3. IPv6 4. netfilter integration 5. listen filter to restrict VRF connections - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g Bug-Fixes and ideas from Hannes, Roopa Prabhu, Jon Toppins, Jamal Patches can also be pulled from: https://github.com/dsahern/linux.git, vrf-dev-4.1 branch https://github.com/dsahern/iproute2, vrf-dev-4.1 branch Shrijeet Mukherjee and David Ahern (6): fib: export symbols net: Preparation for vrf device net: Introduce VRF device driver - v2 net: Modifications to ipv4 stack for VRF devices net: Add sk_bind_dev_if to task_struct net: Add chvrf command drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c | 486 ++ include/linux/netdevice.h | 21 ++ include/linux/sched.h | 3 + include/net/flow.h| 1 + include/net/inet_hashtables.h | 9 +- include/net/route.h | 4 + include/net/vrf.h | 71 ++ include/uapi/linux/if_link.h | 9 + include/uapi/linux/prctl.h| 4 + kernel/fork.c | 2 + kernel/sys.c | 35 +++ net/ipv4/af_inet.c| 1 + net/ipv4/fib_frontend.c | 31 ++- net/ipv4/fib_semantics.c | 25 ++- net/ipv4/fib_trie.c | 8 +- net/ipv4/icmp.c | 4 + net/ipv4/ping.c | 3 +- net/ipv4/raw.c| 5 +- net/ipv4/route.c | 12 +- net/ipv4/syncookies.c | 4 +- net/ipv4/tcp_input.c | 6 +- net/ipv4/tcp_ipv4.c | 6 +- net/ipv4/udp.c| 2 + net/ipv6/af_inet6.c | 1 + tools/net/Makefile| 6 +- tools/net/chvrf.c | 225 +++ 28 files changed, 962 insertions(+), 30 deletions(-) create mode 100644 drivers/net/vrf.c create mode 100644 include/net/vrf.h create mode 100644 tools/net/chvrf.c -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC net-next 2/6] net: Preparation for vrf device
Add a VRF_MASTER flag for interfaces and helper functions for determining if a device is a VRF_MASTER. Also, add link attribute for passing VRF_TABLE id. Both are used in the following patch that adds a VRF device driver. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/netdevice.h| 21 + include/uapi/linux/if_link.h | 9 + 2 files changed, 30 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e20979dfd6a9..142cb64f139c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1274,6 +1274,7 @@ enum netdev_priv_flags { IFF_XMIT_DST_RELEASE_PERM = 122, IFF_IPVLAN_MASTER = 123, IFF_IPVLAN_SLAVE= 124, + IFF_VRF_MASTER = 125, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1301,6 +1302,7 @@ enum netdev_priv_flags { #define IFF_XMIT_DST_RELEASE_PERM IFF_XMIT_DST_RELEASE_PERM #define IFF_IPVLAN_MASTER IFF_IPVLAN_MASTER #define IFF_IPVLAN_SLAVE IFF_IPVLAN_SLAVE +#define IFF_VRF_MASTER IFF_VRF_MASTER /** * struct net_device - The DEVICE structure. @@ -1417,6 +1419,7 @@ enum netdev_priv_flags { * @dn_ptr:DECnet specific data * @ip6_ptr: IPv6 specific data * @ax25_ptr: AX.25 specific data + * @vrf_ptr: VRF specific data * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering * * @last_rx: Time of last Rx @@ -1629,6 +1632,7 @@ struct net_device { struct dn_dev __rcu *dn_ptr; struct inet6_dev __rcu *ip6_ptr; void*ax25_ptr; + struct net_vrf_dev *vrf_ptr; struct wireless_dev *ieee80211_ptr; struct wpan_dev *ieee802154_ptr; #if IS_ENABLED(CONFIG_MPLS_ROUTING) @@ -3781,6 +3785,23 @@ static inline bool netif_supports_nofcs(struct net_device *dev) return dev-priv_flags IFF_SUPP_NOFCS; } +static inline bool netif_is_vrf(struct net_device *dev) +{ + return dev-priv_flags IFF_VRF_MASTER; +} + +static inline bool netif_idx_is_vrf(struct net *net, int idx) +{ + struct net_device *dev = dev_get_by_index(net, idx); + bool rc = false; + + if (dev) { + rc = netif_is_vrf(dev); + dev_put(dev); + } + return rc; +} + /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */ static inline void netif_keep_dst(struct net_device *dev) { diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 2c7e8e3d3981..bfbb4d8eeec2 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -339,6 +339,15 @@ enum macvlan_macaddr_mode { #define MACVLAN_FLAG_NOPROMISC 1 +/* VRF section */ +enum { + IFLA_VRF_UNSPEC, + IFLA_VRF_TABLE, + __IFLA_VRF_MAX +}; + +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1) + /* IPVLAN section */ enum { IFLA_IPVLAN_UNSPEC, -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: Updates to netif_index_is_vrf
As Eric noted netif_index_is_vrf is not called with rcu_read_lock held, so use dev_get_by_index instead of dev_get_by_index_rcu. If VRF is not enabled or oif is 0 skip the device lookup. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/netdevice.h | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f7a6ef2fae3a..dca36a618781 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3819,12 +3819,20 @@ static inline bool netif_is_vrf(const struct net_device *dev) static inline bool netif_index_is_vrf(struct net *net, int ifindex) { - struct net_device *dev = dev_get_by_index_rcu(net, ifindex); bool rc = false; - if (dev) - rc = netif_is_vrf(dev); +#if IS_ENABLED(CONFIG_NET_VRF) + struct net_device *dev; + + if (ifindex == 0) + return false; + dev = dev_get_by_index(net, ifindex); + if (dev) { + rc = netif_is_vrf(dev); + dev_put(dev); + } +#endif return rc; } -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: Move VRF change to udp_sendmsg to inlined function
Functionally equivalent, but as an inlined function with VRF config check it completely compiles out if VRF is not enabled. Suggested-by: Tom Herbert t...@herbertland.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/udp.c | 44 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c0a15e7f359f..384f8d918033 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -873,6 +873,27 @@ int udp_push_pending_frames(struct sock *sk) } EXPORT_SYMBOL(udp_push_pending_frames); +/* unconnected socket. If output device is enslaved to a VRF + * device lookup source address from VRF table. This mimics + * behavior of ip_route_connect{_init}. + */ +static inline void udp_sendmsg_vrf_saddr(struct net *net, struct flowi4 *fl4, +int oif, struct sock *sk) +{ +#if IS_ENABLED(CONFIG_NET_VRF) + if (netif_index_is_vrf(net, oif)) { + __u8 flow_flags = fl4-flowi4_flags; + struct rtable *rt; + + fl4-flowi4_flags = flow_flags | FLOWI_FLAG_VRFSRC; + rt = ip_route_output_flow(net, fl4, sk); + if (!IS_ERR(rt)) + ip_rt_put(rt); + fl4-flowi4_flags = flow_flags; + } +#endif +} + int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { struct inet_sock *inet = inet_sk(sk); @@ -1013,33 +1034,16 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!rt) { struct net *net = sock_net(sk); - __u8 flow_flags = inet_sk_flowi_flags(sk); fl4 = fl4_stack; - /* unconnected socket. If output device is enslaved to a VRF -* device lookup source address from VRF table. This mimics -* behavior of ip_route_connect{_init}. -*/ - if (netif_index_is_vrf(net, ipc.oif)) { - flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, - RT_SCOPE_UNIVERSE, sk-sk_protocol, - (flow_flags | FLOWI_FLAG_VRFSRC), - faddr, saddr, dport, - inet-inet_sport); - - rt = ip_route_output_flow(net, fl4, sk); - if (!IS_ERR(rt)) { - saddr = fl4-saddr; - ip_rt_put(rt); - } - } - flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, sk-sk_protocol, - flow_flags, + inet_sk_flowi_flags(sk), faddr, saddr, dport, inet-inet_sport); + udp_sendmsg_vrf_saddr(net, fl4, ipc.oif, sk); + security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); rt = ip_route_output_flow(net, fl4, sk); if (IS_ERR(rt)) { -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Updates to netif_index_is_vrf
On 8/15/15 6:39 PM, Florian Westphal wrote: David Ahern d...@cumulusnetworks.com wrote: As Eric noted netif_index_is_vrf is not called with rcu_read_lock held, so use dev_get_by_index instead of dev_get_by_index_rcu. If VRF is not enabled or oif is 0 skip the device lookup. Signed-off-by: David Ahern d...@cumulusnetworks.com Why not static inline bool netif_index_is_vrf(struct net *net, int ifindex) { - struct net_device *dev = dev_get_by_index_rcu(net, ifindex); bool rc = false; - if (dev) - rc = netif_is_vrf(dev); +#if IS_ENABLED(CONFIG_NET_VRF) + struct net_device *dev; + + if (ifindex == 0) + return false; rcu_read_lock(); dev = dev_get_by_index_rcu(net, ifindex); if (dev) rc = netif_is_vrf(dev); rcu_read_unlock(); +#endif return rc; instead? sure. That saves the inc and dec on the refcnt. will respin. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2] net: Updates to netif_index_is_vrf
As Eric noted netif_index_is_vrf is not called with rcu_read_lock held, so wrap the dev_get_by_index_rcu in rcu_read_lock and unlock. If VRF is not enabled or oif is 0 skip the device lookup. In both cases index cannot be the VRF master. Signed-off-by: David Ahern d...@cumulusnetworks.com --- v2: - per Florian's suggestion keep the dev_get_by_index_rcu and wrap with rcu_read_lock/unlock versus moving to dev_get_by_index with dev_hold/put include/linux/netdevice.h | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f7a6ef2fae3a..2d3cd86c5618 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3819,12 +3819,22 @@ static inline bool netif_is_vrf(const struct net_device *dev) static inline bool netif_index_is_vrf(struct net *net, int ifindex) { - struct net_device *dev = dev_get_by_index_rcu(net, ifindex); bool rc = false; +#if IS_ENABLED(CONFIG_NET_VRF) + struct net_device *dev; + + if (ifindex == 0) + return false; + + rcu_read_lock(); + + dev = dev_get_by_index_rcu(net, ifindex); if (dev) rc = netif_is_vrf(dev); + rcu_read_unlock(); +#endif return rc; } -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: Fix docbook warning for IFF_VRF_MASTER enum
kbuild test robot reported: tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: d52736e24fe2e927c26817256f8d1a3c8b5d51a0 commit: 4e3c89920cd3a6cfce22c6f537690747c26128dd [751/762] net: Introduce VRF related flags and helpers reproduce: make htmldocs Warning(include/linux/netdevice.h:1293): Enum value 'IFF_VRF_MASTER' not described in enum 'netdev_priv_flags' Signed-off-by: David Ahern d...@cumulusnetworks.com diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2d3cd86c5618..aa8b79dd08d8 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1262,6 +1262,7 @@ struct net_device_ops { * @IFF_LIVE_ADDR_CHANGE: device supports hardware address * change when it's running * @IFF_MACVLAN: Macvlan device + * @IFF_VRF_MASTER: device is a VRF master */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 10, -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next]: unable to add routes to tables
On 8/18/15 10:57 AM, Andreas Schultz wrote: Hi, It seems that the policy for adding routes to tables has changed between Linux 4.2-rc6 and net-next. In Linux main line (tested up to 4.2-rc6), with this main routing table: # ip route show table main ... 172.28.0.0/24 dev vnf-xe1p0 proto kernel scope link src 172.28.0.16 and an empty table 100, this works: # ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0 With net-next at commit d52736e24fe2e927c26817256f8d1a3c8b5d51a0, the same command leads to an: # ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0 RTNETLINK answers: Resource temporarily unavailable Is this expected behavior? That's going to be due to 3bfd847203c6d89532f836ad3f5b4ff4ced26dd9. I'll fix. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next]: unable to add routes to tables
On 8/18/15 10:57 AM, Andreas Schultz wrote: Hi, It seems that the policy for adding routes to tables has changed between Linux 4.2-rc6 and net-next. In Linux main line (tested up to 4.2-rc6), with this main routing table: # ip route show table main ... 172.28.0.0/24 dev vnf-xe1p0 proto kernel scope link src 172.28.0.16 and an empty table 100, this works: # ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0 With net-next at commit d52736e24fe2e927c26817256f8d1a3c8b5d51a0, the same command leads to an: # ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0 RTNETLINK answers: Resource temporarily unavailable Is this expected behavior? The attached works for me and so does my original problem. Can you confirm it resolves your problem? If so I'll send a formal patch. David diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c8025851dac7..01a237278dd2 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -710,9 +710,16 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, err = fib_table_lookup(tbl, fl4, res, FIB_LOOKUP_IGNORE_LINKSTATE | FIB_LOOKUP_NOREF); - else + + /* on error or if no table given do full lookup. This is +* needed for example when nexthops are in the local table +* rather than the given table +*/ + if (!tbl || err) { err = fib_lookup(net, fl4, res, FIB_LOOKUP_IGNORE_LINKSTATE); + } + if (err) { rcu_read_unlock(); return err;
Re: [PATCH net-next] vrf: plug skb leaks
Hi Nikolay: On 8/18/15 8:12 PM, Nikolay Aleksandrov wrote: diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index ed208317cbb5..4aa06450fafa 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -97,6 +97,12 @@ static bool is_ip_rx_frame(struct sk_buff *skb) return false; } +static void vrf_tx_error(struct net_device *vrf_dev, struct sk_buff *skb) +{ + vrf_dev-stats.tx_errors++; + kfree_skb(skb); +} + /* note: already called with rcu_read_lock */ static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb) { @@ -149,7 +155,8 @@ static struct rtnl_link_stats64 *vrf_get_stats64(struct net_device *dev, static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb, struct net_device *dev) { - return 0; + vrf_tx_error(dev, skb); + return NET_XMIT_DROP; } static int vrf_send_v4_prep(struct sk_buff *skb, struct flowi4 *fl4, @@ -206,8 +213,7 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb, out: return ret; err: - vrf_dev-stats.tx_errors++; - kfree_skb(skb); + vrf_tx_error(vrf_dev, skb); goto out; } @@ -219,6 +225,7 @@ static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev) case htons(ETH_P_IPV6): return vrf_process_v6_outbound(skb, dev); default: + vrf_tx_error(dev, skb); return NET_XMIT_DROP; } } Would be simpler to do the vrf_tx_error at the end of is_ip_tx_frame() if ret == NET_XMIT_DROP. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/4] vrf: cleanups part 2
On 8/18/15 8:27 PM, Nikolay Aleksandrov wrote: From: Nikolay Aleksandrov niko...@cumulusnetworks.com Hi, This is the next part of vrf cleanups, patch 1 drops the SLAB_PANIC when creating kmem cache since it's handled, patch 02 removes a slave duplicate check which is already done by the lower/upper code, patch 3 moves the ndo_add_slave code around a bit so we can drop an error label and patch 4 drops the master device checks which are unnecessary because the ops are taken from the master device itself so it can't be different. Cheers, Nik Nikolay Aleksandrov (4): vrf: don't panic on cache create failure vrf: remove unnecessary duplicate check vrf: move vrf_insert_slave so we can drop a goto label vrf: ndo_add|del_slave drop unnecessary checks drivers/net/vrf.c | 24 1 file changed, 4 insertions(+), 20 deletions(-) Looks good to me. Thanks, Nikolay. Acked-by: David Ahern d...@cumulusnetworks.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/4] vrf: a few simplifications and cleanups
On 8/18/15 11:28 AM, Nikolay Aleksandrov wrote: From: Nikolay Aleksandrov niko...@cumulusnetworks.com Hi, These patches remove some unnecessary checks (patches 3, 4), unnecessary num_slaves member and refcnt manipulations which are already done by the upper functions. Cheers, Nik Nikolay Aleksandrov (4): vrf: drop unnecessary dev refcnt changes vrf: drop unused num_slaves member vrf: don't check for dstats and rth in uninit path vrf: simplify the netdev notifier function drivers/net/vrf.c | 15 --- include/net/vrf.h | 1 - 2 files changed, 4 insertions(+), 12 deletions(-) Looks good to me. Thanks, Nikolay. Acked-by: David Ahern d...@cumulusnetworks.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: Fix nexthop lookups
Andreas reported breakage adding routes with local nexthops: $ ip route show table main ... 172.28.0.0/24 dev vnf-xe1p0 proto kernel scope link src 172.28.0.16 $ ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0 RTNETLINK answers: Resource temporarily unavailable 3bfd847203c changed the lookup to use the passed in table but for cases like this the nexthop is in the local table rather than the passed in table. Fixes: 3bfd847203c (net: Use passed in table for nexthop lookups) Reported-by: Andreas Schultz aschu...@tpip.net Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_semantics.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c8025851dac7..0ab5bf558805 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -710,9 +710,16 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, err = fib_table_lookup(tbl, fl4, res, FIB_LOOKUP_IGNORE_LINKSTATE | FIB_LOOKUP_NOREF); - else + + /* on error or if no table given do full lookup. This +* is needed for example when nexthops are in the local +* table rather than the given table +*/ + if (!tbl || err) { err = fib_lookup(net, fl4, res, FIB_LOOKUP_IGNORE_LINKSTATE); + } + if (err) { rcu_read_unlock(); return err; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] vrf: vrf_master_ifindex_rcu is not always called with rcu read lock
On 8/18/15 10:17 AM, Nikolay Aleksandrov wrote: diff --git a/include/net/vrf.h b/include/net/vrf.h index 40e3793c7a05..22dfe2195092 100644 --- a/include/net/vrf.h +++ b/include/net/vrf.h @@ -35,7 +35,6 @@ struct net_vrf { #if IS_ENABLED(CONFIG_NET_VRF) -/* called with rcu_read_lock() */ static inline int vrf_master_ifindex_rcu(const struct net_device *dev) { struct net_vrf_dev *vrf_ptr; @@ -44,12 +43,14 @@ static inline int vrf_master_ifindex_rcu(const struct net_device *dev) if (!dev) return 0; - if (netif_is_vrf(dev)) + if (netif_is_vrf(dev)) { ifindex = dev-ifindex; - else { + } else { + rcu_read_lock(); vrf_ptr = rcu_dereference(dev-vrf_ptr); if (vrf_ptr) ifindex = vrf_ptr-ifindex; + rcu_read_unlock(); } return ifindex; The intent of the _rcu in the name is to mean it is called with rcu_read_lock held which is the case for __fib_validate_source and ip_route_input_slow. It looks like the icmp callers (icmp_reply and icmp_route_lookup) are the exceptions. For those create a static inline int vrf_master_ifindex(const struct net_device *dev) { } that does the rcu lock/unlock and calls vrf_master_ifindex_rcu in between. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] vrf: plug skb leaks
On 8/19/15 1:17 PM, Nikolay Aleksandrov wrote: On Aug 19, 2015, at 8:27 PM, Nikolay Aleksandrov niko...@cumulusnetworks.com wrote: On Aug 19, 2015 20:13, David Ahern d...@cumulusnetworks.com wrote: Hi Nikolay: On 8/18/15 8:12 PM, Nikolay Aleksandrov wrote: diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index ed208317cbb5..4aa06450fafa 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -97,6 +97,12 @@ static bool is_ip_rx_frame(struct sk_buff *skb) return false; } +static void vrf_tx_error(struct net_device *vrf_dev, struct sk_buff *skb) +{ + vrf_dev-stats.tx_errors++; + kfree_skb(skb); +} + /* note: already called with rcu_read_lock */ static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb) { @@ -149,7 +155,8 @@ static struct rtnl_link_stats64 *vrf_get_stats64(struct net_device *dev, static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb, struct net_device *dev) { - return 0; + vrf_tx_error(dev, skb); + return NET_XMIT_DROP; } static int vrf_send_v4_prep(struct sk_buff *skb, struct flowi4 *fl4, @@ -206,8 +213,7 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb, out: return ret; err: - vrf_dev-stats.tx_errors++; - kfree_skb(skb); + vrf_tx_error(vrf_dev, skb); goto out; } @@ -219,6 +225,7 @@ static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev) case htons(ETH_P_IPV6): return vrf_process_v6_outbound(skb, dev); default: + vrf_tx_error(dev, skb); return NET_XMIT_DROP; } } Would be simpler to do the vrf_tx_error at the end of is_ip_tx_frame() if ret == NET_XMIT_DROP. David Sure, that will work too. Actually no, this will not work because ip_local_out() can return NET_XMIT_DROP and the packet can already be dropped. I’d prefer to keep these cases separate and explicit as they are in my patch. ok. Then the patch looks good to me. Acked-by: David Ahern d...@cumulusnetworks.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] vrf: Add ethernet header for pass through VRF device
The change to use a custom dst broke tcpdump captures on the VRF device: $ tcpdump -n -i vrf10 ... 05:32:29.009362 IP 10.2.1.254 10.2.1.2: ICMP echo request, id 21989, seq 1, length 64 05:32:29.009855 00:00:40:01:8d:36 45:00:00:54:d6:6f, ethertype Unknown (0x0a02), length 84: 0x: 0102 0a02 01fe 9181 55e5 0001 bd11 ..U. 0x0010: da55 bb5d 0700 1011 .U.] 0x0020: 1213 1415 1617 1819 1a1b 1c1d 1e1f 2021 ...! 0x0030: 2223 2425 2627 2829 2a2b 2c2d 2e2f 3031 #$%'()*+,-./01 0x0040: 3233 3435 3637 234567 Local packets going through the VRF device are missing an ethernet header. Fix by adding one and then stripping it off before pushing back to the IP stack. With this patch you get the expected dumps: ... 05:36:15.713944 IP 10.2.1.254 10.2.1.2: ICMP echo request, id 23795, seq 1, length 64 05:36:15.714160 IP 10.2.1.2 10.2.1.254: ICMP echo reply, id 23795, seq 1, length 64 ... Signed-off-by: David Ahern d...@cumulusnetworks.com --- drivers/net/vrf.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index dbeffe789185..e5c792e4c224 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -219,6 +219,9 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb, static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev) { + /* strip the ethernet header added for pass through VRF device */ + __skb_pull(skb, skb_network_offset(skb)); + switch (skb-protocol) { case htons(ETH_P_IP): return vrf_process_v4_outbound(skb, dev); @@ -250,6 +253,17 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev) static netdev_tx_t vrf_finish(struct sock *sk, struct sk_buff *skb) { + int err; + + __skb_pull(skb, skb_network_offset(skb)); + err = dev_hard_header(skb, skb-dev, ntohs(skb-protocol), + NULL, NULL, skb-len); + + if (err 0) { + vrf_tx_error(skb-dev, skb); + return -EINVAL; + } + return dev_queue_xmit(skb); } -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] inetpeer: remove dead code
Remove various inlined functions not referenced in the kernel. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/inetpeer.h | 67 -- 1 file changed, 67 deletions(-) diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h index d5332ddcea3f..002f0bd27001 100644 --- a/include/net/inetpeer.h +++ b/include/net/inetpeer.h @@ -65,71 +65,12 @@ struct inet_peer_base { int total; }; -#define INETPEER_BASE_BIT 0x1UL - -static inline struct inet_peer *inetpeer_ptr(unsigned long val) -{ - BUG_ON(val INETPEER_BASE_BIT); - return (struct inet_peer *) val; -} - -static inline struct inet_peer_base *inetpeer_base_ptr(unsigned long val) -{ - if (!(val INETPEER_BASE_BIT)) - return NULL; - val = ~INETPEER_BASE_BIT; - return (struct inet_peer_base *) val; -} - -static inline bool inetpeer_ptr_is_peer(unsigned long val) -{ - return !(val INETPEER_BASE_BIT); -} - -static inline void __inetpeer_ptr_set_peer(unsigned long *val, struct inet_peer *peer) -{ - /* This implicitly clears INETPEER_BASE_BIT */ - *val = (unsigned long) peer; -} - -static inline bool inetpeer_ptr_set_peer(unsigned long *ptr, struct inet_peer *peer) -{ - unsigned long val = (unsigned long) peer; - unsigned long orig = *ptr; - - if (!(orig INETPEER_BASE_BIT) || - cmpxchg(ptr, orig, val) != orig) - return false; - return true; -} - -static inline void inetpeer_init_ptr(unsigned long *ptr, struct inet_peer_base *base) -{ - *ptr = (unsigned long) base | INETPEER_BASE_BIT; -} - -static inline void inetpeer_transfer_peer(unsigned long *to, unsigned long *from) -{ - unsigned long val = *from; - - *to = val; - if (inetpeer_ptr_is_peer(val)) { - struct inet_peer *peer = inetpeer_ptr(val); - atomic_inc(peer-refcnt); - } -} - void inet_peer_base_init(struct inet_peer_base *); void inet_initpeers(void) __init; #define INETPEER_METRICS_NEW (~(u32) 0) -static inline bool inet_metrics_new(const struct inet_peer *p) -{ - return p-metrics[RTAX_LOCK-1] == INETPEER_METRICS_NEW; -} - /* can be called with or without local BH being disabled */ struct inet_peer *inet_getpeer(struct inet_peer_base *base, const struct inetpeer_addr *daddr, @@ -163,12 +104,4 @@ bool inet_peer_xrlim_allow(struct inet_peer *peer, int timeout); void inetpeer_invalidate_tree(struct inet_peer_base *); -/* - * temporary check to make sure we dont access rid, tcp_ts, - * tcp_ts_stamp if no refcount is taken on inet_peer - */ -static inline void inet_peer_refcheck(const struct inet_peer *p) -{ - WARN_ON_ONCE(atomic_read(p-refcnt) = 0); -} #endif /* _NET_INETPEER_H */ -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] inetpeer: Add support for VRFs
inetpeer caches based on address only, so duplicate IP addresses within a namespace return the same cached entry. Similar to IP fragments handle duplicate addresses across VRFs by adding the VRF master device index to the lookup. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/inetpeer.h | 11 ++- net/ipv4/icmp.c| 3 ++- net/ipv4/inetpeer.c| 5 + net/ipv4/ip_fragment.c | 3 ++- net/ipv4/route.c | 7 +-- 5 files changed, 24 insertions(+), 5 deletions(-) diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h index 002f0bd27001..a75b648b8545 100644 --- a/include/net/inetpeer.h +++ b/include/net/inetpeer.h @@ -26,6 +26,9 @@ struct inetpeer_addr_base { struct inetpeer_addr { struct inetpeer_addr_base addr; __u16 family; +#if IS_ENABLED(CONFIG_NET_VRF) + int vif; +#endif }; struct inet_peer { @@ -78,12 +81,15 @@ struct inet_peer *inet_getpeer(struct inet_peer_base *base, static inline struct inet_peer *inet_getpeer_v4(struct inet_peer_base *base, __be32 v4daddr, - int create) + int vif, int create) { struct inetpeer_addr daddr; daddr.addr.a4 = v4daddr; daddr.family = AF_INET; +#if IS_ENABLED(CONFIG_NET_VRF) + daddr.vif = vif; +#endif return inet_getpeer(base, daddr, create); } @@ -95,6 +101,9 @@ static inline struct inet_peer *inet_getpeer_v6(struct inet_peer_base *base, daddr.addr.in6 = *v6daddr; daddr.family = AF_INET6; +#if IS_ENABLED(CONFIG_NET_VRF) + daddr.vif = 0; /* placeholder until VRF suppoort is added to IPv6 */ +#endif return inet_getpeer(base, daddr, create); } diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index f16488efa1c8..79fe05befcae 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -309,9 +309,10 @@ static bool icmpv4_xrlim_allow(struct net *net, struct rtable *rt, rc = false; if (icmp_global_allow()) { + int vif = vrf_master_ifindex(dst-dev); struct inet_peer *peer; - peer = inet_getpeer_v4(net-ipv4.peers, fl4-daddr, 1); + peer = inet_getpeer_v4(net-ipv4.peers, fl4-daddr, vif, 1); rc = inet_peer_xrlim_allow(peer, net-ipv4.sysctl_icmp_ratelimit); if (peer) diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c index 241afd743d2c..b5f268a3ea6b 100644 --- a/net/ipv4/inetpeer.c +++ b/net/ipv4/inetpeer.c @@ -170,6 +170,11 @@ static int addr_compare(const struct inetpeer_addr *a, return 1; } +#if IS_ENABLED(CONFIG_NET_VRF) + if (a-vif != b-vif) + return a-vif b-vif ? -1 : 1; +#endif + return 0; } diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 15762e758861..fa7f15305f9a 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -151,7 +151,8 @@ static void ip4_frag_init(struct inet_frag_queue *q, const void *a) qp-vif = arg-vif; qp-user = arg-user; qp-peer = sysctl_ipfrag_max_dist ? - inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, 1) : NULL; + inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, arg-vif, 1) : + NULL; } static void ip4_frag_free(struct inet_frag_queue *q) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 2403e85107f0..6805d57152b9 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -838,6 +838,7 @@ void ip_rt_send_redirect(struct sk_buff *skb) struct inet_peer *peer; struct net *net; int log_martians; + int vif; rcu_read_lock(); in_dev = __in_dev_get_rcu(rt-dst.dev); @@ -846,10 +847,11 @@ void ip_rt_send_redirect(struct sk_buff *skb) return; } log_martians = IN_DEV_LOG_MARTIANS(in_dev); + vif = vrf_master_ifindex_rcu(rt-dst.dev); rcu_read_unlock(); net = dev_net(rt-dst.dev); - peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, 1); + peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, vif, 1); if (!peer) { icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST, rt_nexthop(rt, ip_hdr(skb)-daddr)); @@ -938,7 +940,8 @@ static int ip_error(struct sk_buff *skb) break; } - peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, 1); + peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, + vrf_master_ifindex(skb-dev), 1); send = true; if (peer) { -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net af_key: Fix RCU splat
On 8/20/15 9:51 AM, Eric Dumazet wrote: On Thu, 2015-08-20 at 08:51 -0700, David Ahern wrote: Hit the following splat testing VRF change for ipsec: [ 113.475692] === [ 113.476194] [ INFO: suspicious RCU usage. ] [ 113.476667] 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED Not tainted [ 113.477545] --- [ 113.478013] /work/monster-14/dsa/kernel.git/include/linux/rcupdate.h:568 Illegal context switch in RCU read-side critical section! [ 113.479288] [ 113.479288] other info that might help us debug this: [ 113.479288] [ 113.480207] [ 113.480207] rcu_scheduler_active = 1, debug_locks = 1 [ 113.480931] 2 locks held by setkey/6829: [ 113.481371] #0: (net-xfrm.xfrm_cfg_mutex){+.+.+.}, at: [814e9887] pfkey_sendmsg+0xfb/0x213 [ 113.482509] #1: (rcu_read_lock){..}, at: [814e767f] rcu_read_lock+0x0/0x6e [ 113.483509] [ 113.483509] stack backtrace: [ 113.484041] CPU: 0 PID: 6829 Comm: setkey Not tainted 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED [ 113.485422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 113.486845] 0001 88001d4c7a98 81518af2 81086962 [ 113.487732] 88001d538480 88001d4c7ac8 8107ae75 8180a154 [ 113.488628] 0b30 00d0 88001d4c7ad8 [ 113.489525] Call Trace: [ 113.489813] [81518af2] dump_stack+0x4c/0x65 [ 113.490389] [81086962] ? console_unlock+0x3d6/0x405 [ 113.491039] [8107ae75] lockdep_rcu_suspicious+0xfa/0x103 [ 113.491735] [81064032] rcu_preempt_sleep_check+0x45/0x47 [ 113.492442] [8106404d] ___might_sleep+0x19/0x1c8 [ 113.493077] [81064268] __might_sleep+0x6c/0x82 [ 113.493681] [81133190] cache_alloc_debugcheck_before.isra.50+0x1d/0x24 [ 113.494508] [81134876] kmem_cache_alloc+0x31/0x18f [ 113.495149] [814012b5] skb_clone+0x64/0x80 [ 113.495712] [814e6f71] pfkey_broadcast_one+0x3d/0xff [ 113.496380] [814e7b84] pfkey_broadcast+0xb5/0x11e [ 113.497024] [814e82d1] pfkey_register+0x191/0x1b1 [ 113.497653] [814e9770] pfkey_process+0x162/0x17e [ 113.498274] [814e9895] pfkey_sendmsg+0x109/0x213 In pfkey_sendmsg the net mutex is taken and then pfkey_broadcast takes the RCU lock. Fix by using GFP_ATOMIC for the allocation flag. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/key/af_key.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index b397f0aa9005..73527e7dd247 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -1670,7 +1670,7 @@ static int pfkey_register(struct sock *sk, struct sk_buff *skb, const struct sad return -ENOBUFS; } - pfkey_broadcast(supp_skb, GFP_KERNEL, BROADCAST_REGISTERED, sk, sock_net(sk)); + pfkey_broadcast(supp_skb, GFP_ATOMIC, BROADCAST_REGISTERED, sk, sock_net(sk)); return 0; } I would rather remove the useless rcu locking from pfkey_broadcast() if a mutex properly protects the thing. rcu_read_lock was added by Stephen with 7f6b9dbd5afbd. It does not appear the net-xfrm.xfrm_cfg_mutex mutex added by 283bc9f35bbbc properly covers the locking. ie., the rcu_read_lock is needed. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] vrf: vrf_master_ifindex_rcu is not always called with rcu read lock
On 8/18/15 12:15 PM, Nikolay Aleksandrov wrote: diff --git a/include/net/vrf.h b/include/net/vrf.h index 3bb4af462ed6..b039850a94a3 100644 --- a/include/net/vrf.h +++ b/include/net/vrf.h @@ -34,7 +34,6 @@ struct net_vrf { #if IS_ENABLED(CONFIG_NET_VRF) -/* called with rcu_read_lock() */ static inline int vrf_master_ifindex_rcu(const struct net_device *dev) { struct net_vrf_dev *vrf_ptr; That comment is true for this version. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: FIB tracepoints
Signed-off-by: David Ahern d...@cumulusnetworks.com --- I realize the sensitivity around adding tracepoints, but these have been invaluable developing the VRF device driver along with a return probe: perf probe -a 'fib_table_lookup_ret=fib_table_lookup%return ret=%ax' include/trace/events/fib.h | 90 ++ net/core/net-traces.c | 1 + net/ipv4/fib_frontend.c| 3 ++ net/ipv4/fib_trie.c| 5 +++ 4 files changed, 99 insertions(+) create mode 100644 include/trace/events/fib.h diff --git a/include/trace/events/fib.h b/include/trace/events/fib.h new file mode 100644 index ..1ac74ba0c977 --- /dev/null +++ b/include/trace/events/fib.h @@ -0,0 +1,90 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM fib + +#if !defined(_TRACE_FIB_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_FIB_H + +#include linux/skbuff.h +#include linux/netdevice.h +#include net/ip_fib.h +#include linux/tracepoint.h + +TRACE_EVENT(fib_table_lookup, + + TP_PROTO(int tb_id, const struct flowi4 *flp), + + TP_ARGS(tb_id, flp), + + TP_STRUCT__entry( + __field(int,tb_id ) + __field(int,oif ) + __field(int,iif ) + __array(__u8, src,4 ) + __array(__u8, dst,4 ) + ), + + TP_fast_assign( + __entry-tb_id = tb_id; + __entry-oif = flp-flowi4_oif; + __entry-iif = flp-flowi4_iif; + memcpy(__entry-src, flp-saddr, 4); + memcpy(__entry-dst, flp-daddr, 4); + ), + + TP_printk(table %d oif %d iif %d src %pI4 dst %pI4, + __entry-tb_id, __entry-oif, __entry-iif, + __entry-src, __entry-dst) +); + +TRACE_EVENT(fib_table_lookup_nh, + + TP_PROTO(const struct fib_nh *nh), + + TP_ARGS(nh), + + TP_STRUCT__entry( + __string( name, nh-nh_dev-name) + __field(int,oif ) + __array(__u8, src,4 ) + ), + + TP_fast_assign( + __assign_str(name, nh-nh_dev ? nh-nh_dev-name : not set); + __entry-oif = nh-nh_oif; + memcpy(__entry-src, nh-nh_saddr, 4); + ), + + TP_printk(nexthop dev %s oif %d src %pI4, + __get_str(name), __entry-oif, __entry-src) +); + +TRACE_EVENT(fib_validate_source, + + TP_PROTO(const struct net_device *dev, const struct flowi4 *flp), + + TP_ARGS(dev, flp), + + TP_STRUCT__entry( + __string( name, dev-name ) + __field(int,oif ) + __field(int,iif ) + __array(__u8, src,4 ) + __array(__u8, dst,4 ) + ), + + TP_fast_assign( + __assign_str(name, dev ? dev-name : not set); + __entry-oif = flp-flowi4_oif; + __entry-iif = flp-flowi4_iif; + memcpy(__entry-src, flp-saddr, 4); + memcpy(__entry-dst, flp-daddr, 4); + ), + + TP_printk(dev %s oif %d iif %d src %pI4 dst %pI4, + __get_str(name), __entry-oif, __entry-iif, + __entry-src, __entry-dst) +); +#endif /* _TRACE_FIB_H */ + +/* This part must be outside protection */ +#include trace/define_trace.h diff --git a/net/core/net-traces.c b/net/core/net-traces.c index ba3c0120786c..adef015b2f41 100644 --- a/net/core/net-traces.c +++ b/net/core/net-traces.c @@ -31,6 +31,7 @@ #include trace/events/napi.h #include trace/events/sock.h #include trace/events/udp.h +#include trace/events/fib.h EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 7fa277176c33..4036c94dfbe1 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -46,6 +46,7 @@ #include net/rtnetlink.h #include net/xfrm.h #include net/vrf.h +#include trace/events/fib.h #ifndef CONFIG_IP_MULTIPLE_TABLES @@ -344,6 +345,8 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb-mark : 0; + trace_fib_validate_source(dev, fl4); + net = dev_net(dev); if (fib_lookup(net, fl4, res, 0)) goto last_resort; diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 1243c79cb5b0..f552ee31a39d 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -81,6 +81,7 @@ #include net/sock.h #include net/ip_fib.h #include net/switchdev.h +#include trace/events/fib.h #include fib_lookup.h #define MAX_STAT_DEPTH 32 @@ -1278,6 +1279,8 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, unsigned long index; t_key cindex
[PATCH ipsec-next] xfrm: Use VRF master index if output device is enslaved
Directs route lookups to VRF table. Compiles out if NET_VRF is not enabled. With this patch able to successfully bring up ipsec tunnels in VRFs, even with duplicate network configuration (IPv4 tested). Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/xfrm4_policy.c | 7 +-- net/ipv6/xfrm6_policy.c | 7 +-- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index 55b3c0f4dde5..35757f6af2d5 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -15,6 +15,7 @@ #include net/dst.h #include net/xfrm.h #include net/ip.h +#include net/vrf.h static struct xfrm_policy_afinfo xfrm4_policy_afinfo; @@ -107,8 +108,10 @@ _decode_session4(struct sk_buff *skb, struct flowi *fl, int reverse) struct flowi4 *fl4 = fl-u.ip4; int oif = 0; - if (skb_dst(skb)) - oif = skb_dst(skb)-dev-ifindex; + if (skb_dst(skb)) { + oif = vrf_master_ifindex_rcu(skb_dst(skb)-dev) ? + : skb_dst(skb)-dev-ifindex; + } memset(fl4, 0, sizeof(struct flowi4)); fl4-flowi4_mark = skb-mark; diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index a74013d3eceb..4a88b89becf5 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -20,6 +20,7 @@ #include net/ip.h #include net/ipv6.h #include net/ip6_route.h +#include net/vrf.h #if IS_ENABLED(CONFIG_IPV6_MIP6) #include net/mip6.h #endif @@ -131,8 +132,10 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse) nexthdr = nh[nhoff]; - if (skb_dst(skb)) - oif = skb_dst(skb)-dev-ifindex; + if (skb_dst(skb)) { + oif = vrf_master_ifindex_rcu(skb_dst(skb)-dev) ? + : skb_dst(skb)-dev-ifindex; + } memset(fl6, 0, sizeof(struct flowi6)); fl6-flowi6_mark = skb-mark; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] net: Move VRF change to udp_sendmsg to function
On 8/18/15 10:03 AM, Eric Dumazet wrote: +/* unconnected socket. If output device is enslaved to a VRF + * device lookup source address from VRF table. + */ +static void udp_sendmsg_vrf_saddr(struct net *net, struct flowi4 *fl4, + int oif, struct sock *sk) +{ + if (netif_index_is_vrf(net, oif)) { + __u8 flow_flags = fl4-flowi4_flags; + struct rtable *rt; + + fl4-flowi4_flags = flow_flags | FLOWI_FLAG_VRFSRC; + rt = ip_route_output_flow(net, fl4, sk); + if (!IS_ERR(rt)) + ip_rt_put(rt); This looks buggy. What happened to saddr = fl4-saddr; ? Not needed. + fl4-flowi4_flags = flow_flags; + } +} + int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { struct inet_sock *inet = inet_sk(sk); @@ -1013,33 +1033,16 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!rt) { struct net *net = sock_net(sk); - __u8 flow_flags = inet_sk_flowi_flags(sk); fl4 = fl4_stack; - /* unconnected socket. If output device is enslaved to a VRF -* device lookup source address from VRF table. This mimics -* behavior of ip_route_connect{_init}. -*/ - if (netif_index_is_vrf(net, ipc.oif)) { - flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, - RT_SCOPE_UNIVERSE, sk-sk_protocol, - (flow_flags | FLOWI_FLAG_VRFSRC), - faddr, saddr, dport, - inet-inet_sport); - - rt = ip_route_output_flow(net, fl4, sk); - if (!IS_ERR(rt)) { - saddr = fl4-saddr; - ip_rt_put(rt); - } - } - flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, sk-sk_protocol, - flow_flags, + inet_sk_flowi_flags(sk), faddr, saddr, dport, inet-inet_sport); + udp_sendmsg_vrf_saddr(net, fl4, ipc.oif, sk); + fl4-saddr gets set in udp_sendmsg_vrf_saddr, stays for the next two... security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); rt = ip_route_output_flow(net, fl4, sk); if (IS_ERR(rt)) { and then right after the above block you have: if (msg-msg_flagsMSG_CONFIRM) goto do_confirm; back_from_confirm: saddr = fl4-saddr; So in short the original code change did not need the 'saddr = fl4-saddr;'. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linux-next: unregister_netdevice: waiting for lo to become free. Usage count = 1
On 8/18/15 9:24 AM, Andrey Wagin wrote: Hello David, CRIU tests detetect that references on net devices leak on 4.2.0-rc6-next-20150817. Looks like it started with v4.2-rc6-882-g3bfd847. 1e3136789975f03e461798149309034e5213c1b4 should have fixed it. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2] net: Move VRF change to udp_sendmsg to function
Functionally equivalent, but as a separate function with VRF config check. After 2f52bdcf6ba (net: Updates to netif_index_is_vrf) function completely compiles out if VRF is not enabled; additional CONFIG check is not needed. Suggested-by: Tom Herbert t...@herbertland.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- v2 - removed inline per Dave's comment - removed IS_ENABLED(CONFIG_NET_VRF) check; no longer needed after 2f52bdcf6ba net/ipv4/udp.c | 43 +++ 1 file changed, 23 insertions(+), 20 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c0a15e7f359f..76c5e5e945f8 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -873,6 +873,24 @@ int udp_push_pending_frames(struct sock *sk) } EXPORT_SYMBOL(udp_push_pending_frames); +/* unconnected socket. If output device is enslaved to a VRF + * device lookup source address from VRF table. + */ +static void udp_sendmsg_vrf_saddr(struct net *net, struct flowi4 *fl4, + int oif, struct sock *sk) +{ + if (netif_index_is_vrf(net, oif)) { + __u8 flow_flags = fl4-flowi4_flags; + struct rtable *rt; + + fl4-flowi4_flags = flow_flags | FLOWI_FLAG_VRFSRC; + rt = ip_route_output_flow(net, fl4, sk); + if (!IS_ERR(rt)) + ip_rt_put(rt); + fl4-flowi4_flags = flow_flags; + } +} + int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { struct inet_sock *inet = inet_sk(sk); @@ -1013,33 +1033,16 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!rt) { struct net *net = sock_net(sk); - __u8 flow_flags = inet_sk_flowi_flags(sk); fl4 = fl4_stack; - /* unconnected socket. If output device is enslaved to a VRF -* device lookup source address from VRF table. This mimics -* behavior of ip_route_connect{_init}. -*/ - if (netif_index_is_vrf(net, ipc.oif)) { - flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, - RT_SCOPE_UNIVERSE, sk-sk_protocol, - (flow_flags | FLOWI_FLAG_VRFSRC), - faddr, saddr, dport, - inet-inet_sport); - - rt = ip_route_output_flow(net, fl4, sk); - if (!IS_ERR(rt)) { - saddr = fl4-saddr; - ip_rt_put(rt); - } - } - flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, sk-sk_protocol, - flow_flags, + inet_sk_flowi_flags(sk), faddr, saddr, dport, inet-inet_sport); + udp_sendmsg_vrf_saddr(net, fl4, ipc.oif, sk); + security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); rt = ip_route_output_flow(net, fl4, sk); if (IS_ERR(rt)) { -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] inet: Move VRF table lookup to inlined function
Table lookup compiles out when VRF is not enabled. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/vrf.h | 24 net/ipv4/af_inet.c | 10 +- 2 files changed, 25 insertions(+), 9 deletions(-) diff --git a/include/net/vrf.h b/include/net/vrf.h index 0484d29d4589..40e3793c7a05 100644 --- a/include/net/vrf.h +++ b/include/net/vrf.h @@ -81,6 +81,25 @@ static inline int vrf_dev_table(const struct net_device *dev) return tb_id; } +static inline int vrf_dev_table_ifindex(struct net *net, int ifindex) +{ + struct net_device *dev; + int tb_id = 0; + + if (!ifindex) + return 0; + + rcu_read_lock(); + + dev = dev_get_by_index_rcu(net, ifindex); + if (dev) + tb_id = vrf_dev_table_rcu(dev); + + rcu_read_unlock(); + + return tb_id; +} + /* called with rtnl */ static inline int vrf_dev_table_rtnl(const struct net_device *dev) { @@ -125,6 +144,11 @@ static inline int vrf_dev_table(const struct net_device *dev) return 0; } +static inline int vrf_dev_table_ifindex(struct net *net, int ifindex) +{ + return 0; +} + static inline int vrf_dev_table_rtnl(const struct net_device *dev) { return 0; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index c8b855882fa5..675e88cac2b4 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -450,15 +450,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) goto out; } - if (sk-sk_bound_dev_if) { - struct net_device *dev; - - rcu_read_lock(); - dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if); - if (dev) - tb_id = vrf_dev_table_rcu(dev) ? : tb_id; - rcu_read_unlock(); - } + tb_id = vrf_dev_table_ifindex(net, sk-sk_bound_dev_if) ? : tb_id; chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id); /* Not specified by any standard per-se, however it breaks too -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] inetpeer: Add support for VRFs
On 8/23/15 6:15 PM, Thomas Graf wrote: On 08/23/15 at 08:26am, David Ahern wrote: inetpeer caches based on address only, so duplicate IP addresses within a namespace return the same cached entry. Similar to IP fragments handle duplicate addresses across VRFs by adding the VRF master device index to the lookup. We have a lot of other places which use the address only. Are you going to add the VRF id to all these places as well? If appropriate, yes. I have fixed IP fragments and this patch fixes inetpeer cache. In both cases (L3 artifacts) the vrf device index provides the means to uniquely identify duplicate IP addresses within a namespace. If you know of other code that might be impacted I will investigate and fix as needed. Thanks, David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net af_key: Fix RCU splat
Hit the following splat testing VRF change for ipsec: [ 113.475692] === [ 113.476194] [ INFO: suspicious RCU usage. ] [ 113.476667] 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED Not tainted [ 113.477545] --- [ 113.478013] /work/monster-14/dsa/kernel.git/include/linux/rcupdate.h:568 Illegal context switch in RCU read-side critical section! [ 113.479288] [ 113.479288] other info that might help us debug this: [ 113.479288] [ 113.480207] [ 113.480207] rcu_scheduler_active = 1, debug_locks = 1 [ 113.480931] 2 locks held by setkey/6829: [ 113.481371] #0: (net-xfrm.xfrm_cfg_mutex){+.+.+.}, at: [814e9887] pfkey_sendmsg+0xfb/0x213 [ 113.482509] #1: (rcu_read_lock){..}, at: [814e767f] rcu_read_lock+0x0/0x6e [ 113.483509] [ 113.483509] stack backtrace: [ 113.484041] CPU: 0 PID: 6829 Comm: setkey Not tainted 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED [ 113.485422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 113.486845] 0001 88001d4c7a98 81518af2 81086962 [ 113.487732] 88001d538480 88001d4c7ac8 8107ae75 8180a154 [ 113.488628] 0b30 00d0 88001d4c7ad8 [ 113.489525] Call Trace: [ 113.489813] [81518af2] dump_stack+0x4c/0x65 [ 113.490389] [81086962] ? console_unlock+0x3d6/0x405 [ 113.491039] [8107ae75] lockdep_rcu_suspicious+0xfa/0x103 [ 113.491735] [81064032] rcu_preempt_sleep_check+0x45/0x47 [ 113.492442] [8106404d] ___might_sleep+0x19/0x1c8 [ 113.493077] [81064268] __might_sleep+0x6c/0x82 [ 113.493681] [81133190] cache_alloc_debugcheck_before.isra.50+0x1d/0x24 [ 113.494508] [81134876] kmem_cache_alloc+0x31/0x18f [ 113.495149] [814012b5] skb_clone+0x64/0x80 [ 113.495712] [814e6f71] pfkey_broadcast_one+0x3d/0xff [ 113.496380] [814e7b84] pfkey_broadcast+0xb5/0x11e [ 113.497024] [814e82d1] pfkey_register+0x191/0x1b1 [ 113.497653] [814e9770] pfkey_process+0x162/0x17e [ 113.498274] [814e9895] pfkey_sendmsg+0x109/0x213 In pfkey_sendmsg the net mutex is taken and then pfkey_broadcast takes the RCU lock. Fix by using GFP_ATOMIC for the allocation flag. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/key/af_key.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index b397f0aa9005..73527e7dd247 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -1670,7 +1670,7 @@ static int pfkey_register(struct sock *sk, struct sk_buff *skb, const struct sad return -ENOBUFS; } - pfkey_broadcast(supp_skb, GFP_KERNEL, BROADCAST_REGISTERED, sk, sock_net(sk)); + pfkey_broadcast(supp_skb, GFP_ATOMIC, BROADCAST_REGISTERED, sk, sock_net(sk)); return 0; } -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH ipsec-next v2] xfrm: Use VRF master index if output device is enslaved
Directs route lookups to VRF table. Compiles out if NET_VRF is not enabled. With this patch able to successfully bring up ipsec tunnels in VRFs, even with duplicate network configuration. Signed-off-by: David Ahern d...@cumulusnetworks.com --- v2 - use vrf_master_ifindex rather than vrf_master_ifindex_rcu net/ipv4/xfrm4_policy.c | 7 +-- net/ipv6/xfrm6_policy.c | 7 +-- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index 55b3c0f4dde5..35757f6af2d5 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -15,6 +15,7 @@ #include net/dst.h #include net/xfrm.h #include net/ip.h +#include net/vrf.h static struct xfrm_policy_afinfo xfrm4_policy_afinfo; @@ -107,8 +108,10 @@ _decode_session4(struct sk_buff *skb, struct flowi *fl, int reverse) struct flowi4 *fl4 = fl-u.ip4; int oif = 0; - if (skb_dst(skb)) - oif = skb_dst(skb)-dev-ifindex; + if (skb_dst(skb)) { + oif = vrf_master_ifindex(skb_dst(skb)-dev) ? + : skb_dst(skb)-dev-ifindex; + } memset(fl4, 0, sizeof(struct flowi4)); fl4-flowi4_mark = skb-mark; diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index a74013d3eceb..4a88b89becf5 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -20,6 +20,7 @@ #include net/ip.h #include net/ipv6.h #include net/ip6_route.h +#include net/vrf.h #if IS_ENABLED(CONFIG_IPV6_MIP6) #include net/mip6.h #endif @@ -131,8 +132,10 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse) nexthdr = nh[nhoff]; - if (skb_dst(skb)) - oif = skb_dst(skb)-dev-ifindex; + if (skb_dst(skb)) { + oif = vrf_master_ifindex(skb_dst(skb)-dev) ? + : skb_dst(skb)-dev-ifindex; + } memset(fl6, 0, sizeof(struct flowi6)); fl6-flowi6_mark = skb-mark; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next 0/16] Proposal for VRF-lite - v3
On 7/27/15 2:30 PM, Eric W. Biederman wrote: This paragraph is false when it comes to sockets, as I have already pointed out. - VPN Routing and Forwarding (RFC4364 and it's kin) implies isolation strong enough to allow using the the same ip on different machines in different VPN instances and not have confusion. - The routing table is not the only table in the kernel that uses an ip address as a key. The result is that you can combine packets fragments that come in on different interfaces (irrespective of your VPN), confuse tcp parameters between interfaces, scramble your ipsec connections and I don't know what else. The duplicate IP address is a problem with the networking stack today; the VRF device does not introduce it. The VRF device does allow duplicate IP addresses within a namespace but separate VRFs, though yes various places that rely solely on source address like IP fragmentation do need to be fixed. I looked at the IPv4 fragmentation code yesterday and will continue today. So help me with the history: is there any reason why the device index is not used today? It seems like a straight forward change. 1. simple netdevices with the same IP address -- no problem using index in the lookup 2. 2 ipsec tunnels -- different netdevices, same IP address -- no problem using index 3. stacked devices like bonding and team interfaces appear to the stack as a single device -- no problem using index of stacked device 4. If an interface is deleted and a new one is created with the same IP address then we want to fail the lookup -- no problem using index 5. other??? Is there a use case where I can't add ifindex of the incoming device (or higher level device if skb-dev is changed) to the hash and lookup for fragments? Version 3 - addressed comments from first 2 RFCs with the exception of the name Nicolas: We will do the name conversion once we agree on what the correct name should be (vrf, mrf or something else) Not so. I described the deep problems between your goals and your implementation and they are not even mentioned let alone addressed. I have addressed comments to the extent that I can. As I stated in my last followup to you Eric I did not understand your point. I asked for clarification, a --verbose if you will. I can't read your mind, so I need you to elaborate on your points to be able to respond and address your concerns. - packets flow through the VRF device in both directions allowing the following: - tcpdump -i vrfn - tc rules on vrf device - netfilter rules on vrf device Ingo/Andy: I added you two as a start point for the proposed task related changes. Not sure who should be the reviewer; please let me know if someone else is more appropriate. Thanks. It looks like you are trying to implement a namespace that isn't a namespace. Given that it is broken by design you have my nack. This is an L3 separation within a namespace, not a device level separation which is what namespaces provide. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
On 7/28/15 10:01 AM, Eric Dumazet wrote: On Tue, 2015-07-28 at 14:19 +0200, Hannes Frederic Sowa wrote: Hello Eric, On Mon, 2015-07-27 at 15:33 -0500, Eric W. Biederman wrote: David Ahern d...@cumulusnetworks.com writes: Allow tasks to have a default device index for binding sockets. If set the value is passed to all AF_INET/AF_INET6 sockets when they are created. The task setting is passed parent to child on fork, but can be set or changed after task creation using prctl (if task has CAP_NET_ADMIN permissions). The setting for a socket can be retrieved using prctl(). This option allows an administrator to restrict a task to only send/receive packets through the specified device. In the case of VRF devices this option restricts tasks to a specific VRF. Correlation of the device index to a specific VRF, ie., ifindex -- VRF device -- VRF id is left to userspace. Nacked-by: Eric W. Biederman ebied...@xmission.com Because it is broken by design. Your routing device is only safe for programs that know it's limitations it is not appropriate for general applications. Since you don't even seen to know it's limitations I think this is a bad path to walk down. Can you please elaborate about the broken by design? Different operating systems are already using this approach with good success. I read your other mail regarding isolation of different VRFs and I agree that all code which persists state depending solely on the IP address is affected by this and this must be dealt with and fixed (actually, there aren't too many). But I wouldn't call that broken by design. This stuff will get fixed like e.g. cross-talk between fragmentation queues, icmp rate limiters etc, which could already happen in the past. What is your opinion on the fundamental approach only from a user perspective? Do you think that is broken, too? I agree with Eric here. This sk_bind_dev_if on task_struct is quite a hack. What will be added next ? An array of dev_if ? netfilter support ? af_packet support ? What about /proc files and netlink dumps ? It could just as easily be a pointer to a struct (e.g., struct net_ctx) such that the intrusion to task_struct is simply 8 bytes -- very similar to the nsproxy used for the assorted namespaces. The struct can then contain whatever network config is imposed on the task. We already have network namespaces. Extend this if needed, instead of bypassing them. Problems with using network namespaces for VRFs has been discussed in the past. e.g., http://www.spinics.net/lists/netdev/msg298368.html David No need to add something else (with lack of proper reporting for various tools) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
On 7/28/15 9:25 AM, Andy Lutomirski wrote: On Jul 27, 2015 11:33 AM, David Ahern d...@cumulusnetworks.com wrote: Allow tasks to have a default device index for binding sockets. If set the value is passed to all AF_INET/AF_INET6 sockets when they are created. This is not intended to be a review of the concept. I haven't thought about whether the concept is a good idea, broken by design, or whatever. FWIW, if this were added to the kernel and didn't require excessive privilege, I'd probably use it. (I still don't really understand why binding to a device requires privilege in the first place, but, again, I haven't thought about it very much.) The intent here is to restrict a task to only sending and receiving packets from a single network device. The device can be single ethernet interface, a stacked device (e.g, bond) or in our case a VRF device which restricts a task to interfaces (and hence network paths) associated with the VRF. +#ifdef CONFIG_NET + case PR_SET_SK_BIND_DEV_IF: + { + struct net_device *dev; + int idx = (int) arg2; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + Can you either use ns_capable or add a comment as to why not? will do. Also, please return -EINVAL if unused args are nonzero. ok. + if (idx) { + dev = dev_get_by_index(me-nsproxy-net_ns, idx); + if (!dev) + return -EINVAL; + dev_put(dev); + } + me-sk_bind_dev_if = idx; + break; + } + case PR_GET_SK_BIND_DEV_IF: + { + struct task_struct *tsk; + int sk_bind_dev_if = -EINVAL; + + rcu_read_lock(); + tsk = find_task_by_vpid(arg2); + if (tsk) + sk_bind_dev_if = tsk-sk_bind_dev_if; Why do you support different tasks here? Could this use proc instead? In this case we want to allow a separate process to determine if a task is restricted to a device. The same -EINVAL issue applies. Also, I think you need to hook setns and unshare to do something reasonable when the task is bound to a device. ack on both. Thanks for the review, David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 13/16] net: Introduce VRF device driver - v2
On 7/27/15 2:01 PM, Nikolay Aleksandrov wrote: + + if (!vrf_is_master(dev) || vrf_is_master(port_dev) || Hmm, this means that bonds won't be able to be VRF slaves. They have the IFF_MASTER flag set. Right, will change to the IFF_VRF_MASTER flag. + vrf_is_slave(port_dev)) + return -EINVAL; + + return do_vrf_add_slave(dev, port_dev); +} + +/* inverse of do_vrf_add_slave */ +static int do_vrf_del_slave(struct net_device *dev, struct net_device *port_dev) +{ + struct net_vrf *vrf = netdev_priv(dev); + struct slave_queue *queue = vrf-queue; + struct net_vrf_dev *vrf_ptr = NULL; + struct slave *slave; + + vrf_ptr = rcu_dereference(dev-vrf_ptr); + RCU_INIT_POINTER(dev-vrf_ptr, NULL); I think this isn't safe, you should wait for a grace period before freeing the pointer. Actually you can just move the kfree() below the netdev_rx_handler_unregister() since it does synchronize_rcu() anyway. ok And ack on all other comments.. Thanks for the review, David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] net: Use passed in table for nexthop lookups
If a user passes in a table for new routes use that table for nexthop lookups. Specifically, this solves the case where a connected route does not exist in the main table, but only another table and then a subsequent route is added with a next hop using the connected route. ie., $ ip route ls default via 10.0.2.2 dev eth0 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 169.254.0.0/16 dev eth0 scope link metric 1003 192.168.56.0/24 dev eth1 proto kernel scope link src 192.168.56.51 $ ip route ls table 10 1.1.1.0/24 dev eth2 scope link Without this patch adding a nexthop route fails: $ ip route add table 10 2.2.2.0/24 via 1.1.1.10 RTNETLINK answers: Network is unreachable With this patch the route is added successfully. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_semantics.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 85e9a8abf15c..b7f1d20a9615 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -691,6 +691,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, } rcu_read_lock(); { + struct fib_table *tbl = NULL; struct flowi4 fl4 = { .daddr = nh-nh_gw, .flowi4_scope = cfg-fc_scope + 1, @@ -701,8 +702,16 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, /* It is not necessary, but requires a bit of thinking */ if (fl4.flowi4_scope RT_SCOPE_LINK) fl4.flowi4_scope = RT_SCOPE_LINK; - err = fib_lookup(net, fl4, res, -FIB_LOOKUP_IGNORE_LINKSTATE); + + if (cfg-fc_table) + tbl = fib_get_table(net, cfg-fc_table); + + if (tbl) + err = fib_table_lookup(tbl, fl4, res, + FIB_LOOKUP_IGNORE_LINKSTATE); + else + err = fib_lookup(net, fl4, res, +FIB_LOOKUP_IGNORE_LINKSTATE); if (err) { rcu_read_unlock(); return err; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 00/10] VRF-lite - v4
to the VRF (sk_bound_dev_if is set to the VRF device). 5. Neighbor entries Neighbor entries are not impacted by the VRF device. Entries are associated with a particular interface; the VRF association is indirect via the interface-to-VRF device enslavement. Version 4 - builds are clean with and without VRF device enabled (no, yes and module) - tightened the driver implementation + device add/delete, slave add/remove, and module unload are all clean - fixed RCU references + with RCU and lock debugging enabled changes are clean through the suite of tests - TX path uses custom dst, so patch refactoring rtable allocation is dropped along with the patch adding rt_nexthop helper - dropped the task patch that adds default bind to interface for sockets and the associated chvrf example command + the patches are a convenience for running unmodified code. They are not needed for the core functionality. Any application with support for SO_BINDTODEVICE works properly with this patch set. Version 3 - addressed comments from first 2 RFCs with the exception of the name Nicolas: We will do the name conversion once we agree on what the correct name should be (vrf, mrf or something else) - packets flow through the VRF device in both directions allowing the following: - tcpdump -i vrfn - tc rules on vrf device - netfilter rules on vrf device TO-DO = 1. IPv6 2. ip fragments 3. ipsec, xfrms 4. listen filter to restrict VRF connections - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g Eric B: I think I understand your points regarding ip fragments and ipsec now. I will release additional patches for both, but it takes time. For example, I have ipsec working with VRFs implemented using the VRF driver but more changes are needed. Once I have multiple tunnels with overlapping network spaces working I will be sending out patches for review. Thanks to Nikolay for his many, many code reviews whipping the device driver into shape, and bug-Fixes and ideas from Hannes, Roopa Prabhu, Jon Toppins, Jamal. Patches can also be pulled from: https://github.com/dsahern/linux.git, vrf-dev-v4 branch https://github.com/dsahern/iproute2, vrf-dev-v4 branch David Ahern (10): net: Introduce VRF related flags and helpers net: Use VRF device index for lookups on RX net: Use VRF device index for lookups on TX udp: Handle VRF device net: Add inet_addr lookup by table net: Fix up inet_addr_type checks net: Add routes to the table associated with the device net: Use passed in table for nexthop lookups net: Use VRF device index for socket lookups net: Introduce VRF device driver drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 715 +++ include/linux/netdevice.h| 20 ++ include/net/flow.h | 1 + include/net/route.h | 7 + include/net/vrf.h| 176 +++ include/uapi/linux/if_link.h | 9 + net/ipv4/af_inet.c | 13 +- net/ipv4/arp.c | 15 +- net/ipv4/fib_frontend.c | 66 +++- net/ipv4/fib_semantics.c | 44 ++- net/ipv4/fib_trie.c | 7 +- net/ipv4/icmp.c | 9 +- net/ipv4/route.c | 8 +- net/ipv4/syncookies.c| 5 +- net/ipv4/tcp_input.c | 6 +- net/ipv4/tcp_ipv4.c | 11 +- net/ipv4/udp.c | 25 +- 19 files changed, 1102 insertions(+), 43 deletions(-) create mode 100644 drivers/net/vrf.c create mode 100644 include/net/vrf.h -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] iproute2: Add support for VRF device
Allow user to create a vrf device and specify its table binding. Based on the iplink_vlan implementation. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/if_link.h | 8 + ip/Makefile | 2 +- ip/iplink.c | 2 +- ip/iplink_vrf.c | 85 + 4 files changed, 95 insertions(+), 2 deletions(-) create mode 100644 ip/iplink_vrf.c diff --git a/include/linux/if_link.h b/include/linux/if_link.h index b905cf7f4948..74dedf4320b8 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -338,6 +338,14 @@ enum macvlan_macaddr_mode { #define MACVLAN_FLAG_NOPROMISC 1 +/* VRF section */ +enum { + IFLA_VRF_UNSPEC, + IFLA_VRF_TABLE, + __IFLA_VRF_MAX +}; + +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1) /* IPVLAN section */ enum { IFLA_IPVLAN_UNSPEC, diff --git a/ip/Makefile b/ip/Makefile index 77653ecc5785..d8b38ac2e44b 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \ link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \ iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \ -iplink_geneve.o +iplink_geneve.o iplink_vrf.o RTMONOBJ=rtmon.o diff --git a/ip/iplink.c b/ip/iplink.c index 369d50eab94e..14bf7211a447 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -94,7 +94,7 @@ void iplink_usage(void) fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | macvtap |\n); fprintf(stderr, bridge | bond | ipoib | ip6tnl | ipip | sit | vxlan |\n); fprintf(stderr, gre | gretap | ip6gre | ip6gretap | vti | nlmon |\n); - fprintf(stderr, bond_slave | ipvlan | geneve }\n); + fprintf(stderr, bond_slave | ipvlan | geneve | vrf }\n); } exit(-1); } diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c new file mode 100644 index ..0d7e21c7c152 --- /dev/null +++ b/ip/iplink_vrf.c @@ -0,0 +1,85 @@ +/* iplink_vrf.cVRF device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void vrf_explain(FILE *f) +{ + fprintf(f, Usage: ... vrf table TABLEID \n); +} + +static void explain(void) +{ + vrf_explain(stderr); +} + +static int table_arg(void) +{ + fprintf(stderr,Error: argument of \table\ must be 0-32767 and currently unused\n); + return -1; +} + +static int vrf_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, table) == 0) { + __u32 table = 0; + NEXT_ARG(); + + table = atoi(*argv); + if (table 0 || table 32767) + return table_arg(); + addattr32(n, 1024, IFLA_VRF_TABLE, table); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, vrf: unknown option \%s\?\n, + *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + if (!tb) + return; + + if (tb[IFLA_VRF_TABLE]) + fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE])); +} + +static void vrf_print_help(struct link_util *lu, int argc, char **argv, + FILE *f) +{ + vrf_explain(f); +} + +struct link_util vrf_link_util = { + .id = vrf, + .maxattr= IFLA_VRF_MAX, + .parse_opt = vrf_parse_opt, + .print_opt = vrf_print_opt, + .print_help = vrf_print_help, +}; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/10] net: Introduce VRF related flags and helpers
Add a VRF_MASTER flag for interfaces and helper functions for determining if a device is a VRF_MASTER. Add link attribute for passing VRF_TABLE id. Add vrf_ptr to netdevice. Add various macros for determining if a device is a VRF device, the index of the master VRF device and table associated with VRF device. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/netdevice.h| 20 + include/net/vrf.h| 177 +++ include/uapi/linux/if_link.h | 9 +++ 3 files changed, 206 insertions(+) create mode 100644 include/net/vrf.h diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 607b5f41f46f..f7a6ef2fae3a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1289,6 +1289,7 @@ enum netdev_priv_flags { IFF_XMIT_DST_RELEASE_PERM = 122, IFF_IPVLAN_MASTER = 123, IFF_IPVLAN_SLAVE= 124, + IFF_VRF_MASTER = 125, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1316,6 +1317,7 @@ enum netdev_priv_flags { #define IFF_XMIT_DST_RELEASE_PERM IFF_XMIT_DST_RELEASE_PERM #define IFF_IPVLAN_MASTER IFF_IPVLAN_MASTER #define IFF_IPVLAN_SLAVE IFF_IPVLAN_SLAVE +#define IFF_VRF_MASTER IFF_VRF_MASTER /** * struct net_device - The DEVICE structure. @@ -1432,6 +1434,7 @@ enum netdev_priv_flags { * @dn_ptr:DECnet specific data * @ip6_ptr: IPv6 specific data * @ax25_ptr: AX.25 specific data + * @vrf_ptr: VRF specific data * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering * * @last_rx: Time of last Rx @@ -1650,6 +1653,7 @@ struct net_device { struct dn_dev __rcu *dn_ptr; struct inet6_dev __rcu *ip6_ptr; void*ax25_ptr; + struct net_vrf_dev __rcu *vrf_ptr; struct wireless_dev *ieee80211_ptr; struct wpan_dev *ieee802154_ptr; #if IS_ENABLED(CONFIG_MPLS_ROUTING) @@ -3808,6 +3812,22 @@ static inline bool netif_supports_nofcs(struct net_device *dev) return dev-priv_flags IFF_SUPP_NOFCS; } +static inline bool netif_is_vrf(const struct net_device *dev) +{ + return dev-priv_flags IFF_VRF_MASTER; +} + +static inline bool netif_index_is_vrf(struct net *net, int ifindex) +{ + struct net_device *dev = dev_get_by_index_rcu(net, ifindex); + bool rc = false; + + if (dev) + rc = netif_is_vrf(dev); + + return rc; +} + /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */ static inline void netif_keep_dst(struct net_device *dev) { diff --git a/include/net/vrf.h b/include/net/vrf.h new file mode 100644 index ..5d4bd67a4902 --- /dev/null +++ b/include/net/vrf.h @@ -0,0 +1,177 @@ +/* + * include/net/net_vrf.h - adds vrf dev structure definitions + * Copyright (c) 2015 Cumulus Networks + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#ifndef __LINUX_NET_VRF_H +#define __LINUX_NET_VRF_H + +struct net_vrf_dev { + struct rcu_head rcu; + int ifindex; /* ifindex of master dev */ + u32 tb_id; /* table id for VRF */ +}; + +struct slave { + struct list_headlist; + struct net_device *dev; +}; + +struct slave_queue { + struct list_headall_slaves; + int num_slaves; +}; + +struct net_vrf { + struct slave_queue queue; + struct fib_table*tb; + struct rtable *rth; + u32 tb_id; +}; + + +#if IS_ENABLED(CONFIG_NET_VRF) +/* called with rcu_read_lock() */ +static inline int vrf_master_ifindex_rcu(const struct net_device *dev) +{ + struct net_vrf_dev *vrf_ptr; + int ifindex = 0; + + if (!dev) + return 0; + + if (netif_is_vrf(dev)) + ifindex = dev-ifindex; + else { + vrf_ptr = rcu_dereference(dev-vrf_ptr); + if (vrf_ptr) + ifindex = vrf_ptr-ifindex; + } + + return ifindex; +} + +static inline int vrf_master_ifindex(const struct net_device *dev) +{ + int ifindex; + + rcu_read_lock(); + ifindex = vrf_master_ifindex_rcu(dev); + rcu_read_unlock(); + + return ifindex; +} + +static inline int vrf_master_ifindex_by_index(struct net *net, int ifindex) +{ + int rc = 0; + + if (ifindex) { + struct net_device *dev = dev_get_by_index(net, ifindex); + + if (dev) { + rc
[PATCH 05/10] net: Add inet_addr lookup by table
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 1 + net/ipv4/fib_frontend.c | 22 +++--- 2 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 94189d4bd899..6ba681f0b98d 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -189,6 +189,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk); void ip_rt_send_redirect(struct sk_buff *skb); unsigned int inet_addr_type(struct net *net, __be32 addr); +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); void ip_rt_multicast_event(struct in_device *); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d8ced1d89f1b..b11321a8e58d 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -212,12 +212,12 @@ void fib_flush_external(struct net *net) */ static inline unsigned int __inet_dev_addr_type(struct net *net, const struct net_device *dev, - __be32 addr) + __be32 addr, int tb_id) { struct flowi4 fl4 = { .daddr = addr }; struct fib_result res; unsigned int ret = RTN_BROADCAST; - struct fib_table *local_table; + struct fib_table *table; if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr)) return RTN_BROADCAST; @@ -226,10 +226,10 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, rcu_read_lock(); - local_table = fib_get_table(net, RT_TABLE_LOCAL); - if (local_table) { + table = fib_get_table(net, tb_id); + if (table) { ret = RTN_UNICAST; - if (!fib_table_lookup(local_table, fl4, res, FIB_LOOKUP_NOREF)) { + if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) { if (!dev || dev == res.fi-fib_dev) ret = res.type; } @@ -239,16 +239,24 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, return ret; } +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id) +{ + return __inet_dev_addr_type(net, NULL, addr, tb_id); +} +EXPORT_SYMBOL(inet_addr_type_table); + unsigned int inet_addr_type(struct net *net, __be32 addr) { - return __inet_dev_addr_type(net, NULL, addr); + return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL); } EXPORT_SYMBOL(inet_addr_type); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr) { - return __inet_dev_addr_type(net, dev, addr); + int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL; + + return __inet_dev_addr_type(net, dev, addr, rt_table); } EXPORT_SYMBOL(inet_dev_addr_type); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] net: Use VRF device index for lookups on TX
As with ingress use the index of VRF master device for route lookups on egress. However, the oif should only be used to direct the lookups to a specific table. Routes in the table are not based on the VRF device but rather interfaces that are part of the VRF so do not consider the oif for lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this latter part. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/flow.h | 1 + include/net/route.h | 3 +++ net/ipv4/fib_trie.c | 7 +-- net/ipv4/icmp.c | 4 net/ipv4/route.c| 5 + 5 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/net/flow.h b/include/net/flow.h index 3098ae33a178..f305588fc162 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -33,6 +33,7 @@ struct flowi_common { __u8flowic_flags; #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 +#define FLOWI_FLAG_VRFSRC 0x04 __u32 flowic_secid; struct flowi_tunnel flowic_tun_key; }; diff --git a/include/net/route.h b/include/net/route.h index 2d45f419477f..94189d4bd899 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -251,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 if (inet_sk(sk)-transparent) flow_flags |= FLOWI_FLAG_ANYSRC; + if (netif_index_is_vrf(sock_net(sk), oif)) + flow_flags |= FLOWI_FLAG_VRFSRC; + flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, protocol, flow_flags, dst, src, dport, sport); } diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 37c4bb89a708..1243c79cb5b0 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1423,8 +1423,11 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, nh-nh_flags RTNH_F_LINKDOWN !(fib_flags FIB_LOOKUP_IGNORE_LINKSTATE)) continue; - if (flp-flowi4_oif flp-flowi4_oif != nh-nh_oif) - continue; + if (!(flp-flowi4_flags FLOWI_FLAG_VRFSRC)) { + if (flp-flowi4_oif + flp-flowi4_oif != nh-nh_oif) + continue; + } if (!(fib_flags FIB_LOOKUP_NOREF)) atomic_inc(fi-fib_clntref); diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index c0556f1e4bf0..1164fc4ce3bc 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -96,6 +96,7 @@ #include net/xfrm.h #include net/inet_common.h #include net/ip_fib.h +#include net/vrf.h /* * Build xmit assembly blocks @@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb) fl4.flowi4_mark = mark; fl4.flowi4_tos = RT_TOS(ip_hdr(skb)-tos); fl4.flowi4_proto = IPPROTO_ICMP; + fl4.flowi4_oif = vrf_master_ifindex_rcu(skb-dev) ? : skb-dev-ifindex; security_skb_classify_flow(skb, flowi4_to_flowi(fl4)); rt = ip_route_output_key(net, fl4); if (IS_ERR(rt)) @@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net, fl4-flowi4_proto = IPPROTO_ICMP; fl4-fl4_icmp_type = type; fl4-fl4_icmp_code = code; + fl4-flowi4_oif = vrf_master_ifindex_rcu(skb_in-dev) ? : skb_in-dev-ifindex; + security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); rt = __ip_route_output_key(net, fl4); if (IS_ERR(rt)) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c26ff1f7067d..2c89d294b669 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2131,6 +2131,11 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) fl4-saddr = inet_select_addr(dev_out, 0, RT_SCOPE_HOST); } + if (netif_is_vrf(dev_out) + !(fl4-flowi4_flags FLOWI_FLAG_VRFSRC)) { + rth = vrf_dev_get_rth(dev_out); + goto out; + } } if (!fl4-daddr) { -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] net: Introduce VRF device driver
This driver borrows heavily from IPvlan and teaming drivers. Routing domains (VRF-lite) are created by instantiating a VRF master device with an associated table and enslaving all routed interfaces that participate in the domain. As part of the enslavement, all connected routes for the enslaved devices are moved to the table associated with the VRF device. Outgoing sockets must bind to the VRF device to function. Standard FIB rules bind the VRF device to tables and regular fib rule processing is followed. Routed traffic through the box, is forwarded by using the VRF device as the IIF and following the IIF rule to a table that is mated with the VRF. Example: Create vrf 1: ip link add vrf1 type vrf table 5 ip rule add iif vrf1 table 5 ip rule add oif vrf1 table 5 ip route add table 5 prohibit default ip link set vrf1 up Add interface to vrf 1: ip link set eth1 master vrf1 Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 715 +++ include/net/vrf.h| 1 - 4 files changed, 723 insertions(+), 1 deletion(-) create mode 100644 drivers/net/vrf.c diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index c18f9e62a9fa..e58468b02987 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -297,6 +297,13 @@ config NLMON diagnostics, etc. This is mostly intended for developers or support to debug netlink issues. If unsure, say N. +config NET_VRF + tristate Virtual Routing and Forwarding (Lite) + depends on IP_MULTIPLE_TABLES IPV6_MULTIPLE_TABLES + ---help--- + This option enables the support for mapping interfaces into VRF's. The + support enables VRF devices. + endif # NET_CORE config SUNGEM_PHY diff --git a/drivers/net/Makefile b/drivers/net/Makefile index c12cb22478a7..ca16dd689b36 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o obj-$(CONFIG_VXLAN) += vxlan.o obj-$(CONFIG_GENEVE) += geneve.o obj-$(CONFIG_NLMON) += nlmon.o +obj-$(CONFIG_NET_VRF) += vrf.o # # Networking Drivers diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c new file mode 100644 index ..75c06ee2efa3 --- /dev/null +++ b/drivers/net/vrf.c @@ -0,0 +1,715 @@ +/* + * vrf.c: device driver to encapsulate a VRF space + * + * Copyright (c) 2015 Cumulus Networks. All rights reserved. + * Copyright (c) 2015 Shrijeet Mukherjee s...@cumulusnetworks.com + * Copyright (c) 2015 David Ahern d...@cumulusnetworks.com + * + * Based on dummy, team and ipvlan drivers + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/ip.h +#include linux/init.h +#include linux/moduleparam.h +#include linux/netfilter.h +#include linux/rtnetlink.h +#include net/rtnetlink.h +#include linux/u64_stats_sync.h +#include linux/hashtable.h + +#include linux/inetdevice.h +#include net/ip.h +#include net/ip_fib.h +#include net/ip6_route.h +#include net/rtnetlink.h +#include net/route.h +#include net/addrconf.h +#include net/vrf.h + +#define DRV_NAME vrf +#define DRV_VERSION1.0 + +#define vrf_is_slave(dev) ((dev)-flags IFF_SLAVE) + +#define vrf_master_get_rcu(dev) \ + ((struct net_device *)rcu_dereference(dev-rx_handler_data)) + +struct pcpu_dstats { + u64 tx_pkts; + u64 tx_bytes; + u64 tx_drps; + u64 rx_pkts; + u64 rx_bytes; + struct u64_stats_sync syncp; +}; + +static struct dst_entry *vrf_ip_check(struct dst_entry *dst, u32 cookie) +{ + return dst; +} + +static int vrf_ip_local_out(struct sk_buff *skb) +{ + return ip_local_out(skb); +} + +static unsigned int vrf_v4_mtu(const struct dst_entry *dst) +{ + /* TO-DO: return max ethernet size? */ + return dst-dev-mtu; +} + +static void vrf_dst_destroy(struct dst_entry *dst) +{ + /* our dst lives forever - or until the device is closed */ +} + +static unsigned int vrf_default_advmss(const struct dst_entry *dst) +{ + return 65535 - 40; +} + +static struct dst_ops vrf_dst_ops = { + .family = AF_INET, + .local_out = vrf_ip_local_out, + .check = vrf_ip_check, + .mtu= vrf_v4_mtu, + .destroy= vrf_dst_destroy, + .default_advmss = vrf_default_advmss, +}; + +static bool is_ip_rx_frame(struct sk_buff *skb) +{ + switch (skb-protocol) { + case htons
[PATCH 02/10] net: Use VRF device index for lookups on RX
On ingress use index of VRF master device for route lookups if real device is enslaved. Rules are expected to be installed for the VRF device to direct lookups to a specific table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 8 +++- net/ipv4/route.c| 3 ++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 6b98de0d7949..d8ced1d89f1b 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -45,6 +45,7 @@ #include net/ip_fib.h #include net/rtnetlink.h #include net/xfrm.h +#include net/vrf.h #ifndef CONFIG_IP_MULTIPLE_TABLES @@ -309,7 +310,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, bool dev_match; fl4.flowi4_oif = 0; - fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX; + fl4.flowi4_iif = vrf_master_ifindex_rcu(dev); + if (!fl4.flowi4_iif) + fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX; fl4.daddr = src; fl4.saddr = dst; fl4.flowi4_tos = tos; @@ -339,6 +342,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, if (nh-nh_dev == dev) { dev_match = true; break; + } else if (vrf_master_ifindex_rcu(nh-nh_dev) == dev-ifindex) { + dev_match = true; + break; } } #else diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 18fd7c9095c7..c26ff1f7067d 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -112,6 +112,7 @@ #endif #include net/secure_seq.h #include net/ip_tunnels.h +#include net/vrf.h #define RT_FL_TOS(oldflp4) \ ((oldflp4)-flowi4_tos (IPTOS_RT_MASK | RTO_ONLINK)) @@ -1726,7 +1727,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, * Now we are ready to route packet. */ fl4.flowi4_oif = 0; - fl4.flowi4_iif = dev-ifindex; + fl4.flowi4_iif = vrf_master_ifindex_rcu(dev) ? : dev-ifindex; fl4.flowi4_mark = skb-mark; fl4.flowi4_tos = tos; fl4.flowi4_scope = RT_SCOPE_UNIVERSE; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/10] udp: Handle VRF device
For unconnected UDP sockets using a VRF device lookup source address based on VRF table. This allows the UDP header to be properly setup before showing up at the VRF device via the dst. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/udp.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 83aa604f9273..b513d72a21b3 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -884,7 +884,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct rtable *rt = NULL; int free = 0; int connected = 0; - __be32 daddr, faddr, saddr; + __be32 daddr, faddr, saddr, vsaddr = 0; __be16 dport; u8 tos; int err, is_udplite = IS_UDPLITE(sk); @@ -1013,11 +1013,30 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!rt) { struct net *net = sock_net(sk); + __u8 flow_flags = inet_sk_flowi_flags(sk); fl4 = fl4_stack; + + /* unconnected socket. If output device is enslaved to a VRF +* device lookup source address from VRF table. This mimics +* behavior of ip_route_connect{_init}. +*/ + if (netif_index_is_vrf(net, ipc.oif)) { + flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, + RT_SCOPE_UNIVERSE, sk-sk_protocol, + (flow_flags | FLOWI_FLAG_VRFSRC), + faddr, saddr, dport, inet-inet_sport); + + rt = ip_route_output_flow(net, fl4, sk); + if (!IS_ERR(rt)) { + vsaddr = fl4-saddr; + ip_rt_put(rt); + } + } + flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, sk-sk_protocol, - inet_sk_flowi_flags(sk), + flow_flags, faddr, saddr, dport, inet-inet_sport); security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); @@ -1042,6 +1061,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) goto do_confirm; back_from_confirm: + if (vsaddr) + fl4-saddr = vsaddr; saddr = fl4-saddr; if (!ipc.addr) daddr = ipc.addr = fl4-daddr; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] net: Use VRF device index for socket lookups
The intent of the VRF device is to leverage the existing SO_BINDTODEVICE as a means of creating L3 domains. Since sockets are expected to be bound to the VRF device the index of the master device needs to be used for socket lookups. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/syncookies.c | 5 - net/ipv4/tcp_input.c | 6 +- net/ipv4/tcp_ipv4.c | 11 +-- 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index d70b1f603692..e5c8b1240278 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -18,6 +18,7 @@ #include linux/export.h #include net/tcp.h #include net/route.h +#include net/vrf.h extern int sysctl_tcp_syncookies; @@ -348,7 +349,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) treq-snt_synack= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsecr : 0; treq-tfo_listener = false; - ireq-ir_iif = sk-sk_bound_dev_if; + ireq-ir_iif = vrf_master_ifindex_by_index(sock_net(sk), skb-skb_iif); + if (!ireq-ir_iif) + ireq-ir_iif = sk-sk_bound_dev_if; /* We throwed the options of the initial SYN away, so we hope * the ACK carries the same options again (see RFC1122 4.2.3.8) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4e4d6bcd0ca9..6b96240a4055 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -72,6 +72,7 @@ #include net/dst.h #include net/tcp.h #include net/inet_common.h +#include net/vrf.h #include linux/ipsec.h #include asm/unaligned.h #include linux/errqueue.h @@ -6141,7 +6142,10 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, tcp_openreq_init(req, tmp_opt, skb, sk); /* Note: tcp_v6_init_req() might override ir_iif for link locals */ - inet_rsk(req)-ir_iif = sk-sk_bound_dev_if; + inet_rsk(req)-ir_iif = vrf_master_ifindex_by_index(sock_net(sk), + skb-skb_iif); + if (!inet_rsk(req)-ir_iif) + inet_rsk(req)-ir_iif = sk-sk_bound_dev_if; af_ops-init_req(req, sk, skb); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d27eb549ced6..0f8ed98a2e64 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -75,6 +75,7 @@ #include net/secure_seq.h #include net/tcp_memcontrol.h #include net/busy_poll.h +#include net/vrf.h #include linux/inet.h #include linux/ipv6.h @@ -682,6 +683,8 @@ static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb) */ if (sk) arg.bound_dev_if = sk-sk_bound_dev_if; + if (!arg.bound_dev_if skb-dev) + arg.bound_dev_if = vrf_master_ifindex_rcu(skb-dev); arg.tos = ip_hdr(skb)-tos; ip_send_unicast_reply(*this_cpu_ptr(net-ipv4.tcp_sk), @@ -766,8 +769,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, ip_hdr(skb)-saddr, /* XXX */ arg.iov[0].iov_len, IPPROTO_TCP, 0); arg.csumoffset = offsetof(struct tcphdr, check) / 2; - if (oif) - arg.bound_dev_if = oif; + arg.bound_dev_if = oif ? : vrf_master_ifindex_rcu(skb_dst(skb)-dev); + if (!arg.bound_dev_if) + arg.bound_dev_if = vrf_master_ifindex_rcu(skb-dev); + arg.tos = tos; ip_send_unicast_reply(*this_cpu_ptr(net-ipv4.tcp_sk), skb, TCP_SKB_CB(skb)-header.h4.opt, @@ -1269,6 +1274,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb, ireq = inet_rsk(req); sk_daddr_set(newsk, ireq-ir_rmt_addr); sk_rcv_saddr_set(newsk, ireq-ir_loc_addr); + if (netif_index_is_vrf(sock_net(newsk), ireq-ir_iif)) + newsk-sk_bound_dev_if = ireq-ir_iif; newinet-inet_saddr = ireq-ir_loc_addr; inet_opt = ireq-opt; rcu_assign_pointer(newinet-inet_opt, inet_opt); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/10] net: Add routes to the table associated with the device
When a device associated with a VRF is brought up or down routes should be added to/removed from the table associated with the VRF. fib_magic defaults to using the main or local tables. Have it use the table with the device if there is one. A part of this is directing prefsrc validations to the correct table as well. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 8 net/ipv4/fib_semantics.c | 25 +++-- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d84ae0e30369..0a50a08ab844 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -803,6 +803,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifaddr *ifa) { struct net *net = dev_net(ifa-ifa_dev-dev); + int tb_id = vrf_dev_table_rtnl(ifa-ifa_dev-dev); struct fib_table *tb; struct fib_config cfg = { .fc_protocol = RTPROT_KERNEL, @@ -817,11 +818,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad }, }; - if (type == RTN_UNICAST) - tb = fib_new_table(net, RT_TABLE_MAIN); - else - tb = fib_new_table(net, RT_TABLE_LOCAL); + if (!tb_id) + tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL; + tb = fib_new_table(net, tb_id); if (!tb) return; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 410ddb67221e..85e9a8abf15c 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -838,6 +838,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh) return nh-nh_saddr; } +static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc) +{ + if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || + fib_prefsrc != cfg-fc_dst) { + int tb_id = cfg-fc_table; + + if (tb_id == RT_TABLE_MAIN) + tb_id = RT_TABLE_LOCAL; + + if (inet_addr_type_table(cfg-fc_nlinfo.nl_net, +fib_prefsrc, tb_id) != RTN_LOCAL) { + return false; + } + } + return true; +} + struct fib_info *fib_create_info(struct fib_config *cfg) { int err; @@ -1033,12 +1050,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg) fi-fib_flags |= RTNH_F_LINKDOWN; } - if (fi-fib_prefsrc) { - if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || - fi-fib_prefsrc != cfg-fc_dst) - if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL) - goto err_inval; - } + if (fi-fib_prefsrc !fib_valid_prefsrc(cfg, fi-fib_prefsrc)) + goto err_inval; change_nexthops(fi) { fib_info_update_nh_saddr(net, nexthop_nh); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/10] net: Fix up inet_addr_type checks
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. inet_addr_type_dev_table keeps the same semantics as inet_addr_type but if the passed in device is enslaved to a VRF then the table for that VRF is used for the lookup. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 3 +++ net/ipv4/af_inet.c | 13 - net/ipv4/arp.c | 15 +-- net/ipv4/fib_frontend.c | 28 +--- net/ipv4/fib_semantics.c | 6 -- net/ipv4/icmp.c | 5 +++-- 6 files changed, 56 insertions(+), 14 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 6ba681f0b98d..6dda2c1bf8c6 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -192,6 +192,9 @@ unsigned int inet_addr_type(struct net *net, __be32 addr); unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); +unsigned int inet_addr_type_dev_table(struct net *net, + const struct net_device *dev, + __be32 addr); void ip_rt_multicast_event(struct in_device *); int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg); void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index cc4e498a0ccf..96fba4f63454 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -119,6 +119,7 @@ #ifdef CONFIG_IP_MROUTE #include linux/mroute.h #endif +#include net/vrf.h /* The inetsw table contains everything that inet_create needs to @@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) struct net *net = sock_net(sk); unsigned short snum; int chk_addr_ret; + int tb_id = 0; int err; /* If the socket has its own bind function then use it. (RAW) */ @@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) goto out; } - chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr); + if (sk-sk_bound_dev_if) { + struct net_device *dev; + + rcu_read_lock(); + dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if); + if (dev) + tb_id = vrf_dev_table_rcu(dev); + rcu_read_unlock(); + } + chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id); /* Not specified by any standard per-se, however it breaks too * many applications when removed. It is unfortunate since diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 34a308573f4b..30409b75e925 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -233,7 +233,7 @@ static int arp_constructor(struct neighbour *neigh) return -EINVAL; } - neigh-type = inet_addr_type(dev_net(dev), addr); + neigh-type = inet_addr_type_dev_table(dev_net(dev), dev, addr); parms = in_dev-arp_parms; __neigh_parms_put(neigh-parms); @@ -343,7 +343,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) switch (IN_DEV_ARP_ANNOUNCE(in_dev)) { default: case 0: /* By default announce any local IP */ - if (skb inet_addr_type(dev_net(dev), + if (skb inet_addr_type_dev_table(dev_net(dev), dev, ip_hdr(skb)-saddr) == RTN_LOCAL) saddr = ip_hdr(skb)-saddr; break; @@ -351,7 +351,8 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) if (!skb) break; saddr = ip_hdr(skb)-saddr; - if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) { + if (inet_addr_type_dev_table(dev_net(dev), dev, +saddr) == RTN_LOCAL) { /* saddr should be known to target */ if (inet_addr_onlink(in_dev, target, saddr)) break; @@ -751,7 +752,7 @@ static int arp_process(struct sock *sk, struct sk_buff *skb) /* Special case: IPv4 duplicate address detection packet (RFC2131) */ if (sip == 0) { if (arp-ar_op == htons(ARPOP_REQUEST) - inet_addr_type(net, tip) == RTN_LOCAL + inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL !arp_ignore(in_dev, sip, tip)) arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip
Re: [PATCH 09/10] net: Use VRF device index for socket lookups
Hi Tom: On 8/5/15 12:32 PM, Tom Herbert wrote: On Wed, Aug 5, 2015 at 10:14 AM, David Ahernd...@cumulusnetworks.com wrote: The intent of the VRF device is to leverage the existing SO_BINDTODEVICE as a means of creating L3 domains. Since sockets are expected to be bound to the VRF device the index of the master device needs to be used for socket lookups. This patch set seems awfully invasive at the socket layer. Isn't there anyway this functionality be contained in the routing layer and sockets use existing API? This patch is a leftover from earlier versions. It is no longer needed. Will drop for v5. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC net-next 1/3] RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
On 7/30/15 2:55 AM, Sowmini Varadhan wrote: diff --git a/net/rds/connection.c b/net/rds/connection.c index da6da57..3bea7b9 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -117,7 +117,8 @@ static void rds_conn_reset(struct rds_connection *conn) * For now they are not garbage collected once they're created. They * are torn down as the module is removed, if ever. */ -static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr, +static struct rds_connection *__rds_conn_create(struct net *net, + __be32 laddr, __be32 faddr, struct rds_transport *trans, gfp_t gfp, int is_outgoing) { @@ -157,6 +158,7 @@ static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr, conn-c_faddr = faddr; spin_lock_init(conn-c_lock); conn-c_next_tx_seq = 1; + write_pnet(conn-c_net, net); these are typically in wrappers like sock_net and sock_net_set diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 0da2a45..c38d8a0 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -448,8 +448,8 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, (unsigned long long)be64_to_cpu(lguid), (unsigned long long)be64_to_cpu(fguid)); - conn = rds_conn_create(dp-dp_daddr, dp-dp_saddr, rds_ib_transport, - GFP_KERNEL); + conn = rds_conn_create(init_net, dp-dp_daddr, dp-dp_saddr, + rds_ib_transport, GFP_KERNEL); I forget what connection this is -- control channel? you should at least put a note as to why it is using init_net. diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c index 8f486fa..4ea55a3 100644 --- a/net/rds/iw_cm.c +++ b/net/rds/iw_cm.c @@ -398,8 +398,8 @@ int rds_iw_cm_handle_connect(struct rdma_cm_id *cm_id, dp-dp_saddr, dp-dp_daddr, RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version)); - conn = rds_conn_create(dp-dp_daddr, dp-dp_saddr, rds_iw_transport, - GFP_KERNEL); + conn = rds_conn_create(init_net, dp-dp_daddr, dp-dp_saddr, + rds_iw_transport, GFP_KERNEL); Ditto here. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 6/9] net: Fix up inet_addr_type checks
On 8/11/15 12:14 PM, David Miller wrote: From: David Ahern d...@cumulusnetworks.com Date: Mon, 10 Aug 2015 11:50:33 -0600 @@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) struct net *net = sock_net(sk); unsigned short snum; int chk_addr_ret; + int tb_id = 0; int err; /* If the socket has its own bind function then use it. (RAW) */ @@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) goto out; } - chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr); + if (sk-sk_bound_dev_if) { + struct net_device *dev; + + rcu_read_lock(); + dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if); + if (dev) + tb_id = vrf_dev_table_rcu(dev); + rcu_read_unlock(); + } + chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id); /* Not specified by any standard per-se, however it breaks too * many applications when removed. It is unfortunate since ... diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index b11321a8e58d..d84ae0e30369 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -226,6 +226,9 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, rcu_read_lock(); + if (!tb_id) + tb_id = RT_TABLE_LOCAL; + table = fib_get_table(net, tb_id); All of this code that quietly translates table ID zero into RT_TABLE_LOCAL is confusing. It would be so much easier to understand if the code was structured like: int tb_id = RT_TABLE_LOCAL; if (doing_vrf_stuff) tb_id = foo; The intent here was to default to current behavior and to keep the details of that in one place. If you prefer table id to always enter with the right value I can make that happen. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 08/11] net: Use passed in table for nexthop lookups
If a user passes in a table for new routes use that table for nexthop lookups. Specifically, this solves the case where a connected route does not exist in the main table, but only another table and then a subsequent route is added with a next hop using the connected route. ie., $ ip route ls default via 10.0.2.2 dev eth0 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 169.254.0.0/16 dev eth0 scope link metric 1003 192.168.56.0/24 dev eth1 proto kernel scope link src 192.168.56.51 $ ip route ls table 10 1.1.1.0/24 dev eth2 scope link Without this patch adding a nexthop route fails: $ ip route add table 10 2.2.2.0/24 via 1.1.1.10 RTNETLINK answers: Network is unreachable With this patch the route is added successfully. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_semantics.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 85e9a8abf15c..b7f1d20a9615 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -691,6 +691,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, } rcu_read_lock(); { + struct fib_table *tbl = NULL; struct flowi4 fl4 = { .daddr = nh-nh_gw, .flowi4_scope = cfg-fc_scope + 1, @@ -701,8 +702,16 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, /* It is not necessary, but requires a bit of thinking */ if (fl4.flowi4_scope RT_SCOPE_LINK) fl4.flowi4_scope = RT_SCOPE_LINK; - err = fib_lookup(net, fl4, res, -FIB_LOOKUP_IGNORE_LINKSTATE); + + if (cfg-fc_table) + tbl = fib_get_table(net, cfg-fc_table); + + if (tbl) + err = fib_table_lookup(tbl, fl4, res, + FIB_LOOKUP_IGNORE_LINKSTATE); + else + err = fib_lookup(net, fl4, res, +FIB_LOOKUP_IGNORE_LINKSTATE); if (err) { rcu_read_unlock(); return err; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 01/11] net: Introduce VRF related flags and helpers
Add a VRF_MASTER flag for interfaces and helper functions for determining if a device is a VRF_MASTER. Add link attribute for passing VRF_TABLE id. Add vrf_ptr to netdevice. Add various macros for determining if a device is a VRF device, the index of the master VRF device and table associated with VRF device. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/netdevice.h| 20 +++ include/net/vrf.h| 139 +++ include/uapi/linux/if_link.h | 9 +++ 3 files changed, 168 insertions(+) create mode 100644 include/net/vrf.h diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 607b5f41f46f..f7a6ef2fae3a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1289,6 +1289,7 @@ enum netdev_priv_flags { IFF_XMIT_DST_RELEASE_PERM = 122, IFF_IPVLAN_MASTER = 123, IFF_IPVLAN_SLAVE= 124, + IFF_VRF_MASTER = 125, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1316,6 +1317,7 @@ enum netdev_priv_flags { #define IFF_XMIT_DST_RELEASE_PERM IFF_XMIT_DST_RELEASE_PERM #define IFF_IPVLAN_MASTER IFF_IPVLAN_MASTER #define IFF_IPVLAN_SLAVE IFF_IPVLAN_SLAVE +#define IFF_VRF_MASTER IFF_VRF_MASTER /** * struct net_device - The DEVICE structure. @@ -1432,6 +1434,7 @@ enum netdev_priv_flags { * @dn_ptr:DECnet specific data * @ip6_ptr: IPv6 specific data * @ax25_ptr: AX.25 specific data + * @vrf_ptr: VRF specific data * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering * * @last_rx: Time of last Rx @@ -1650,6 +1653,7 @@ struct net_device { struct dn_dev __rcu *dn_ptr; struct inet6_dev __rcu *ip6_ptr; void*ax25_ptr; + struct net_vrf_dev __rcu *vrf_ptr; struct wireless_dev *ieee80211_ptr; struct wpan_dev *ieee802154_ptr; #if IS_ENABLED(CONFIG_MPLS_ROUTING) @@ -3808,6 +3812,22 @@ static inline bool netif_supports_nofcs(struct net_device *dev) return dev-priv_flags IFF_SUPP_NOFCS; } +static inline bool netif_is_vrf(const struct net_device *dev) +{ + return dev-priv_flags IFF_VRF_MASTER; +} + +static inline bool netif_index_is_vrf(struct net *net, int ifindex) +{ + struct net_device *dev = dev_get_by_index_rcu(net, ifindex); + bool rc = false; + + if (dev) + rc = netif_is_vrf(dev); + + return rc; +} + /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */ static inline void netif_keep_dst(struct net_device *dev) { diff --git a/include/net/vrf.h b/include/net/vrf.h new file mode 100644 index ..0484d29d4589 --- /dev/null +++ b/include/net/vrf.h @@ -0,0 +1,139 @@ +/* + * include/net/net_vrf.h - adds vrf dev structure definitions + * Copyright (c) 2015 Cumulus Networks + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#ifndef __LINUX_NET_VRF_H +#define __LINUX_NET_VRF_H + +struct net_vrf_dev { + struct rcu_head rcu; + int ifindex; /* ifindex of master dev */ + u32 tb_id; /* table id for VRF */ +}; + +struct slave { + struct list_headlist; + struct net_device *dev; +}; + +struct slave_queue { + struct list_headall_slaves; + int num_slaves; +}; + +struct net_vrf { + struct slave_queue queue; + struct rtable *rth; + u32 tb_id; +}; + + +#if IS_ENABLED(CONFIG_NET_VRF) +/* called with rcu_read_lock() */ +static inline int vrf_master_ifindex_rcu(const struct net_device *dev) +{ + struct net_vrf_dev *vrf_ptr; + int ifindex = 0; + + if (!dev) + return 0; + + if (netif_is_vrf(dev)) + ifindex = dev-ifindex; + else { + vrf_ptr = rcu_dereference(dev-vrf_ptr); + if (vrf_ptr) + ifindex = vrf_ptr-ifindex; + } + + return ifindex; +} + +/* called with rcu_read_lock */ +static inline int vrf_dev_table_rcu(const struct net_device *dev) +{ + int tb_id = 0; + + if (dev) { + struct net_vrf_dev *vrf_ptr; + + vrf_ptr = rcu_dereference(dev-vrf_ptr); + if (vrf_ptr) + tb_id = vrf_ptr-tb_id; + } + return tb_id; +} + +static inline int vrf_dev_table(const struct net_device *dev) +{ + int tb_id; + + rcu_read_lock(); + tb_id = vrf_dev_table_rcu(dev
[PATCH net-next 05/11] net: Add inet_addr lookup by table
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 1 + net/ipv4/fib_frontend.c | 22 +++--- 2 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 94189d4bd899..6ba681f0b98d 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -189,6 +189,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk); void ip_rt_send_redirect(struct sk_buff *skb); unsigned int inet_addr_type(struct net *net, __be32 addr); +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); void ip_rt_multicast_event(struct in_device *); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d8ced1d89f1b..b11321a8e58d 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -212,12 +212,12 @@ void fib_flush_external(struct net *net) */ static inline unsigned int __inet_dev_addr_type(struct net *net, const struct net_device *dev, - __be32 addr) + __be32 addr, int tb_id) { struct flowi4 fl4 = { .daddr = addr }; struct fib_result res; unsigned int ret = RTN_BROADCAST; - struct fib_table *local_table; + struct fib_table *table; if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr)) return RTN_BROADCAST; @@ -226,10 +226,10 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, rcu_read_lock(); - local_table = fib_get_table(net, RT_TABLE_LOCAL); - if (local_table) { + table = fib_get_table(net, tb_id); + if (table) { ret = RTN_UNICAST; - if (!fib_table_lookup(local_table, fl4, res, FIB_LOOKUP_NOREF)) { + if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) { if (!dev || dev == res.fi-fib_dev) ret = res.type; } @@ -239,16 +239,24 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, return ret; } +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id) +{ + return __inet_dev_addr_type(net, NULL, addr, tb_id); +} +EXPORT_SYMBOL(inet_addr_type_table); + unsigned int inet_addr_type(struct net *net, __be32 addr) { - return __inet_dev_addr_type(net, NULL, addr); + return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL); } EXPORT_SYMBOL(inet_addr_type); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr) { - return __inet_dev_addr_type(net, dev, addr); + int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL; + + return __inet_dev_addr_type(net, dev, addr, rt_table); } EXPORT_SYMBOL(inet_dev_addr_type); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 00/10] VRF-lite - v6
comments from DaveM - added patch to properly set oif in ip_send_unicast_reply. Needs to be set to VRF device for proper FIB lookup - added patch to handle IP fragments Version 5 - dropped patch regarding socket lookups; no longer needed + removed vrf helpers no longer needed after this patch is dropped - removed dev_open and close operations + no need to reset vrf data on an ifdown and creates problems if a slave is deleted while the vrf interface is down (Thanks, Nikolay) - cleanups for sparse warnings + make C=2 is now clean for vrf driver Version 4 - builds are clean with and without VRF device enabled (no, yes and module) - tightened the driver implementation + device add/delete, slave add/remove, and module unload are all clean - fixed RCU references + with RCU and lock debugging enabled changes are clean through the suite of tests - TX path uses custom dst, so patch refactoring rtable allocation is dropped along with the patch adding rt_nexthop helper - dropped the task patch that adds default bind to interface for sockets and the associated chvrf example command + the patches are a convenience for running unmodified code. They are not needed for the core functionality. Any application with support for SO_BINDTODEVICE works properly with this patch set. Version 3 - addressed comments from first 2 RFCs with the exception of the name Nicolas: We will do the name conversion once we agree on what the correct name should be (vrf, mrf or something else) - packets flow through the VRF device in both directions allowing the following: - tcpdump -i vrfn - tc rules on vrf device - netfilter rules on vrf device TO-DO = 1. IPv6 2. ipsec, xfrms - dst patch accepted into ipsec-next; will post VRF patch once merge happens 3. listen filter to allow 1 socket to work with multiple VRF devices - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g Eric B: I have ipsec working with VRFs implemented using the VRF driver, including the worst case scenario of complete duplication in the networking config. Thanks to Nikolay for his many, many code reviews whipping the device driver into shape, and bug-Fixes and ideas from Hannes, Roopa Prabhu, Jon Toppins, Jamal. Patches can also be pulled from: https://github.com/dsahern/linux.git, vrf-dev-v6 branch https://github.com/dsahern/iproute2, vrf-dev-v6 branch David Ahern (11): net: Introduce VRF related flags and helpers net: Use VRF device index for lookups on RX net: Use VRF device index for lookups on TX udp: Handle VRF device in sendmsg net: Add inet_addr lookup by table net: Fix up inet_addr_type checks net: Add routes to the table associated with the device net: Use passed in table for nexthop lookups net: Use VRF index for oif in ip_send_unicast_reply net: frags: Add VRF device index to cache and lookup net: Introduce VRF device driver drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 685 +++ include/linux/netdevice.h| 20 ++ include/net/flow.h | 1 + include/net/route.h | 7 + include/net/vrf.h| 139 + include/uapi/linux/if_link.h | 9 + net/ipv4/af_inet.c | 13 +- net/ipv4/arp.c | 15 +- net/ipv4/fib_frontend.c | 63 +++- net/ipv4/fib_semantics.c | 44 ++- net/ipv4/fib_trie.c | 7 +- net/ipv4/icmp.c | 9 +- net/ipv4/ip_fragment.c | 18 +- net/ipv4/ip_output.c | 7 +- net/ipv4/route.c | 8 +- net/ipv4/udp.c | 22 +- 18 files changed, 1031 insertions(+), 44 deletions(-) create mode 100644 drivers/net/vrf.c create mode 100644 include/net/vrf.h -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 09/11] net: Use VRF index for oif in ip_send_unicast_reply
If output device is not specified use VRF device if input device is enslaved. This is needed to ensure tcp acks and resets go out VRF device. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/ip_output.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 6bf89a6312bc..0138fada0951 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1542,6 +1542,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, struct net *net = sock_net(sk); struct sk_buff *nskb; int err; + int oif; if (__ip_options_echo(replyopts.opt.opt, skb, sopt)) return; @@ -1559,7 +1560,11 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, daddr = replyopts.opt.opt.faddr; } - flowi4_init_output(fl4, arg-bound_dev_if, + oif = arg-bound_dev_if; + if (!oif netif_index_is_vrf(net, skb-skb_iif)) + oif = skb-skb_iif; + + flowi4_init_output(fl4, oif, IP4_REPLY_MARK(net, skb-mark), RT_TOS(arg-tos), RT_SCOPE_UNIVERSE, ip_hdr(skb)-protocol, -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] iproute2: Add support for VRF device
Allow user to create a vrf device and specify its table binding. Based on the iplink_vlan implementation. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/if_link.h | 8 + ip/Makefile | 2 +- ip/iplink.c | 2 +- ip/iplink_vrf.c | 86 + 4 files changed, 96 insertions(+), 2 deletions(-) create mode 100644 ip/iplink_vrf.c diff --git a/include/linux/if_link.h b/include/linux/if_link.h index 4f0a558e8fcf..c8b569a79e80 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -339,6 +339,14 @@ enum macvlan_macaddr_mode { #define MACVLAN_FLAG_NOPROMISC 1 +/* VRF section */ +enum { + IFLA_VRF_UNSPEC, + IFLA_VRF_TABLE, + __IFLA_VRF_MAX +}; + +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1) /* IPVLAN section */ enum { IFLA_IPVLAN_UNSPEC, diff --git a/ip/Makefile b/ip/Makefile index 77653ecc5785..d8b38ac2e44b 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \ link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \ iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \ -iplink_geneve.o +iplink_geneve.o iplink_vrf.o RTMONOBJ=rtmon.o diff --git a/ip/iplink.c b/ip/iplink.c index edee0f7a3b07..e2183e89a7f6 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -94,7 +94,7 @@ void iplink_usage(void) fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | macvtap |\n); fprintf(stderr, bridge | bond | ipoib | ip6tnl | ipip | sit | vxlan |\n); fprintf(stderr, gre | gretap | ip6gre | ip6gretap | vti | nlmon |\n); - fprintf(stderr, bond_slave | ipvlan | geneve | bridge_slave }\n); + fprintf(stderr, bond_slave | ipvlan | geneve | bridge_slave | vrf }\n); } exit(-1); } diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c new file mode 100644 index ..913a2892c95b --- /dev/null +++ b/ip/iplink_vrf.c @@ -0,0 +1,86 @@ +/* iplink_vrf.cVRF device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void vrf_explain(FILE *f) +{ + fprintf(f, Usage: ... vrf table TABLEID \n); +} + +static void explain(void) +{ + vrf_explain(stderr); +} + +static int table_arg(void) +{ + fprintf(stderr,Error: argument of \table\ must be 0-32767 and currently unused\n); + return -1; +} + +static int vrf_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, table) == 0) { + __u32 table; + + NEXT_ARG(); + + table = atoi(*argv); + if (table 32767) + return table_arg(); + addattr32(n, 1024, IFLA_VRF_TABLE, table); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, vrf: unknown option \%s\?\n, + *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + if (!tb) + return; + + if (tb[IFLA_VRF_TABLE]) + fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE])); +} + +static void vrf_print_help(struct link_util *lu, int argc, char **argv, + FILE *f) +{ + vrf_explain(f); +} + +struct link_util vrf_link_util = { + .id = vrf, + .maxattr= IFLA_VRF_MAX, + .parse_opt = vrf_parse_opt, + .print_opt = vrf_print_opt, + .print_help = vrf_print_help, +}; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 10/11] net: frags: Add VRF device index to cache and lookup
Fragmentation cache uses information from the IP header to reassemble packets. That information can be duplicated across VRFs -- same source and destination addresses, protocol and id. Handle fragmentation with VRFs by adding the VRF device index to entries in the cache and the lookup arg. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/ip_fragment.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index d96722ae8979..15762e758861 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -48,6 +48,7 @@ #include linux/inet.h #include linux/netfilter_ipv4.h #include net/inet_ecn.h +#include net/vrf.h /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6 * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c @@ -77,6 +78,7 @@ struct ipq { u8 ecn; /* RFC3168 support */ u16 max_df_size; /* largest frag with DF set seen */ int iif; + int vif; /* VRF device index */ unsigned intrid; struct inet_peer *peer; }; @@ -99,6 +101,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev, struct ip4_create_arg { struct iphdr *iph; u32 user; + int vif; }; static unsigned int ipqhashfn(__be16 id, __be32 saddr, __be32 daddr, u8 prot) @@ -127,7 +130,8 @@ static bool ip4_frag_match(const struct inet_frag_queue *q, const void *a) qp-saddr == arg-iph-saddr qp-daddr == arg-iph-daddr qp-protocol == arg-iph-protocol - qp-user == arg-user; + qp-user == arg-user + qp-vif == arg-vif; } static void ip4_frag_init(struct inet_frag_queue *q, const void *a) @@ -144,6 +148,7 @@ static void ip4_frag_init(struct inet_frag_queue *q, const void *a) qp-ecn = ip4_frag_ecn(arg-iph-tos); qp-saddr = arg-iph-saddr; qp-daddr = arg-iph-daddr; + qp-vif = arg-vif; qp-user = arg-user; qp-peer = sysctl_ipfrag_max_dist ? inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, 1) : NULL; @@ -244,7 +249,8 @@ static void ip_expire(unsigned long arg) /* Find the correct entry in the incomplete datagrams queue for * this IP datagram, and create new one, if nothing is found. */ -static struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user) +static struct ipq *ip_find(struct net *net, struct iphdr *iph, + u32 user, int vif) { struct inet_frag_queue *q; struct ip4_create_arg arg; @@ -252,6 +258,7 @@ static struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user) arg.iph = iph; arg.user = user; + arg.vif = vif; hash = ipqhashfn(iph-id, iph-saddr, iph-daddr, iph-protocol); @@ -648,14 +655,15 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev, /* Process an incoming IP datagram fragment. */ int ip_defrag(struct sk_buff *skb, u32 user) { + struct net_device *dev = skb-dev ? : skb_dst(skb)-dev; + int vif = vrf_master_ifindex_rcu(dev); + struct net *net = dev_net(dev); struct ipq *qp; - struct net *net; - net = skb-dev ? dev_net(skb-dev) : dev_net(skb_dst(skb)-dev); IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS); /* Lookup (or create) queue header */ - qp = ip_find(net, ip_hdr(skb), user); + qp = ip_find(net, ip_hdr(skb), user, vif); if (qp) { int ret; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 02/11] net: Use VRF device index for lookups on RX
On ingress use index of VRF master device for route lookups if real device is enslaved. Rules are expected to be installed for the VRF device to direct lookups to a specific table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 8 +++- net/ipv4/route.c| 3 ++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 6b98de0d7949..d8ced1d89f1b 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -45,6 +45,7 @@ #include net/ip_fib.h #include net/rtnetlink.h #include net/xfrm.h +#include net/vrf.h #ifndef CONFIG_IP_MULTIPLE_TABLES @@ -309,7 +310,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, bool dev_match; fl4.flowi4_oif = 0; - fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX; + fl4.flowi4_iif = vrf_master_ifindex_rcu(dev); + if (!fl4.flowi4_iif) + fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX; fl4.daddr = src; fl4.saddr = dst; fl4.flowi4_tos = tos; @@ -339,6 +342,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, if (nh-nh_dev == dev) { dev_match = true; break; + } else if (vrf_master_ifindex_rcu(nh-nh_dev) == dev-ifindex) { + dev_match = true; + break; } } #else diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 18fd7c9095c7..c26ff1f7067d 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -112,6 +112,7 @@ #endif #include net/secure_seq.h #include net/ip_tunnels.h +#include net/vrf.h #define RT_FL_TOS(oldflp4) \ ((oldflp4)-flowi4_tos (IPTOS_RT_MASK | RTO_ONLINK)) @@ -1726,7 +1727,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, * Now we are ready to route packet. */ fl4.flowi4_oif = 0; - fl4.flowi4_iif = dev-ifindex; + fl4.flowi4_iif = vrf_master_ifindex_rcu(dev) ? : dev-ifindex; fl4.flowi4_mark = skb-mark; fl4.flowi4_tos = tos; fl4.flowi4_scope = RT_SCOPE_UNIVERSE; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 04/11] udp: Handle VRF device in sendmsg
For unconnected UDP sockets using a VRF device lookup source address based on VRF table. This allows the UDP header to be properly setup before showing up at the VRF device via the dst. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/udp.c | 22 +- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 83aa604f9273..7af5052e3b1f 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1013,11 +1013,31 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!rt) { struct net *net = sock_net(sk); + __u8 flow_flags = inet_sk_flowi_flags(sk); fl4 = fl4_stack; + + /* unconnected socket. If output device is enslaved to a VRF +* device lookup source address from VRF table. This mimics +* behavior of ip_route_connect{_init}. +*/ + if (netif_index_is_vrf(net, ipc.oif)) { + flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, + RT_SCOPE_UNIVERSE, sk-sk_protocol, + (flow_flags | FLOWI_FLAG_VRFSRC), + faddr, saddr, dport, + inet-inet_sport); + + rt = ip_route_output_flow(net, fl4, sk); + if (!IS_ERR(rt)) { + saddr = fl4-saddr; + ip_rt_put(rt); + } + } + flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, sk-sk_protocol, - inet_sk_flowi_flags(sk), + flow_flags, faddr, saddr, dport, inet-inet_sport); security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 03/11] net: Use VRF device index for lookups on TX
As with ingress use the index of VRF master device for route lookups on egress. However, the oif should only be used to direct the lookups to a specific table. Routes in the table are not based on the VRF device but rather interfaces that are part of the VRF so do not consider the oif for lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this latter part. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/flow.h | 1 + include/net/route.h | 3 +++ net/ipv4/fib_trie.c | 7 +-- net/ipv4/icmp.c | 4 net/ipv4/route.c| 5 + 5 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/net/flow.h b/include/net/flow.h index 3098ae33a178..f305588fc162 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -33,6 +33,7 @@ struct flowi_common { __u8flowic_flags; #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 +#define FLOWI_FLAG_VRFSRC 0x04 __u32 flowic_secid; struct flowi_tunnel flowic_tun_key; }; diff --git a/include/net/route.h b/include/net/route.h index 2d45f419477f..94189d4bd899 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -251,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 if (inet_sk(sk)-transparent) flow_flags |= FLOWI_FLAG_ANYSRC; + if (netif_index_is_vrf(sock_net(sk), oif)) + flow_flags |= FLOWI_FLAG_VRFSRC; + flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, protocol, flow_flags, dst, src, dport, sport); } diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 37c4bb89a708..1243c79cb5b0 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1423,8 +1423,11 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, nh-nh_flags RTNH_F_LINKDOWN !(fib_flags FIB_LOOKUP_IGNORE_LINKSTATE)) continue; - if (flp-flowi4_oif flp-flowi4_oif != nh-nh_oif) - continue; + if (!(flp-flowi4_flags FLOWI_FLAG_VRFSRC)) { + if (flp-flowi4_oif + flp-flowi4_oif != nh-nh_oif) + continue; + } if (!(fib_flags FIB_LOOKUP_NOREF)) atomic_inc(fi-fib_clntref); diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index c0556f1e4bf0..1164fc4ce3bc 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -96,6 +96,7 @@ #include net/xfrm.h #include net/inet_common.h #include net/ip_fib.h +#include net/vrf.h /* * Build xmit assembly blocks @@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb) fl4.flowi4_mark = mark; fl4.flowi4_tos = RT_TOS(ip_hdr(skb)-tos); fl4.flowi4_proto = IPPROTO_ICMP; + fl4.flowi4_oif = vrf_master_ifindex_rcu(skb-dev) ? : skb-dev-ifindex; security_skb_classify_flow(skb, flowi4_to_flowi(fl4)); rt = ip_route_output_key(net, fl4); if (IS_ERR(rt)) @@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net, fl4-flowi4_proto = IPPROTO_ICMP; fl4-fl4_icmp_type = type; fl4-fl4_icmp_code = code; + fl4-flowi4_oif = vrf_master_ifindex_rcu(skb_in-dev) ? : skb_in-dev-ifindex; + security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); rt = __ip_route_output_key(net, fl4); if (IS_ERR(rt)) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c26ff1f7067d..2c89d294b669 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2131,6 +2131,11 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) fl4-saddr = inet_select_addr(dev_out, 0, RT_SCOPE_HOST); } + if (netif_is_vrf(dev_out) + !(fl4-flowi4_flags FLOWI_FLAG_VRFSRC)) { + rth = vrf_dev_get_rth(dev_out); + goto out; + } } if (!fl4-daddr) { -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 07/11] net: Add routes to the table associated with the device
When a device associated with a VRF is brought up or down routes should be added to/removed from the table associated with the VRF. fib_magic defaults to using the main or local tables. Have it use the table with the device if there is one. A part of this is directing prefsrc validations to the correct table as well. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 8 net/ipv4/fib_semantics.c | 25 +++-- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index c55723ec4c3e..7fa277176c33 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -800,6 +800,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifaddr *ifa) { struct net *net = dev_net(ifa-ifa_dev-dev); + int tb_id = vrf_dev_table_rtnl(ifa-ifa_dev-dev); struct fib_table *tb; struct fib_config cfg = { .fc_protocol = RTPROT_KERNEL, @@ -814,11 +815,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad }, }; - if (type == RTN_UNICAST) - tb = fib_new_table(net, RT_TABLE_MAIN); - else - tb = fib_new_table(net, RT_TABLE_LOCAL); + if (!tb_id) + tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL; + tb = fib_new_table(net, tb_id); if (!tb) return; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 410ddb67221e..85e9a8abf15c 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -838,6 +838,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh) return nh-nh_saddr; } +static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc) +{ + if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || + fib_prefsrc != cfg-fc_dst) { + int tb_id = cfg-fc_table; + + if (tb_id == RT_TABLE_MAIN) + tb_id = RT_TABLE_LOCAL; + + if (inet_addr_type_table(cfg-fc_nlinfo.nl_net, +fib_prefsrc, tb_id) != RTN_LOCAL) { + return false; + } + } + return true; +} + struct fib_info *fib_create_info(struct fib_config *cfg) { int err; @@ -1033,12 +1050,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg) fi-fib_flags |= RTNH_F_LINKDOWN; } - if (fi-fib_prefsrc) { - if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || - fi-fib_prefsrc != cfg-fc_dst) - if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL) - goto err_inval; - } + if (fi-fib_prefsrc !fib_valid_prefsrc(cfg, fi-fib_prefsrc)) + goto err_inval; change_nexthops(fi) { fib_info_update_nh_saddr(net, nexthop_nh); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 06/11] net: Fix up inet_addr_type checks
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. inet_addr_type_dev_table keeps the same semantics as inet_addr_type but if the passed in device is enslaved to a VRF then the table for that VRF is used for the lookup. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 3 +++ net/ipv4/af_inet.c | 13 - net/ipv4/arp.c | 15 +-- net/ipv4/fib_frontend.c | 25 ++--- net/ipv4/fib_semantics.c | 6 -- net/ipv4/icmp.c | 5 +++-- 6 files changed, 53 insertions(+), 14 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 6ba681f0b98d..6dda2c1bf8c6 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -192,6 +192,9 @@ unsigned int inet_addr_type(struct net *net, __be32 addr); unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); +unsigned int inet_addr_type_dev_table(struct net *net, + const struct net_device *dev, + __be32 addr); void ip_rt_multicast_event(struct in_device *); int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg); void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index cc4e498a0ccf..c8b855882fa5 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -119,6 +119,7 @@ #ifdef CONFIG_IP_MROUTE #include linux/mroute.h #endif +#include net/vrf.h /* The inetsw table contains everything that inet_create needs to @@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) struct net *net = sock_net(sk); unsigned short snum; int chk_addr_ret; + int tb_id = RT_TABLE_LOCAL; int err; /* If the socket has its own bind function then use it. (RAW) */ @@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) goto out; } - chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr); + if (sk-sk_bound_dev_if) { + struct net_device *dev; + + rcu_read_lock(); + dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if); + if (dev) + tb_id = vrf_dev_table_rcu(dev) ? : tb_id; + rcu_read_unlock(); + } + chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id); /* Not specified by any standard per-se, however it breaks too * many applications when removed. It is unfortunate since diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 34a308573f4b..30409b75e925 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -233,7 +233,7 @@ static int arp_constructor(struct neighbour *neigh) return -EINVAL; } - neigh-type = inet_addr_type(dev_net(dev), addr); + neigh-type = inet_addr_type_dev_table(dev_net(dev), dev, addr); parms = in_dev-arp_parms; __neigh_parms_put(neigh-parms); @@ -343,7 +343,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) switch (IN_DEV_ARP_ANNOUNCE(in_dev)) { default: case 0: /* By default announce any local IP */ - if (skb inet_addr_type(dev_net(dev), + if (skb inet_addr_type_dev_table(dev_net(dev), dev, ip_hdr(skb)-saddr) == RTN_LOCAL) saddr = ip_hdr(skb)-saddr; break; @@ -351,7 +351,8 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) if (!skb) break; saddr = ip_hdr(skb)-saddr; - if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) { + if (inet_addr_type_dev_table(dev_net(dev), dev, +saddr) == RTN_LOCAL) { /* saddr should be known to target */ if (inet_addr_onlink(in_dev, target, saddr)) break; @@ -751,7 +752,7 @@ static int arp_process(struct sock *sk, struct sk_buff *skb) /* Special case: IPv4 duplicate address detection packet (RFC2131) */ if (sip == 0) { if (arp-ar_op == htons(ARPOP_REQUEST) - inet_addr_type(net, tip) == RTN_LOCAL + inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL !arp_ignore(in_dev, sip, tip)) arp_send(ARPOP_REPLY
[PATCH net-next 11/11] net: Introduce VRF device driver
This driver borrows heavily from IPvlan and teaming drivers. Routing domains (VRF-lite) are created by instantiating a VRF master device with an associated table and enslaving all routed interfaces that participate in the domain. As part of the enslavement, all connected routes for the enslaved devices are moved to the table associated with the VRF device. Outgoing sockets must bind to the VRF device to function. Standard FIB rules bind the VRF device to tables and regular fib rule processing is followed. Routed traffic through the box, is forwarded by using the VRF device as the IIF and following the IIF rule to a table that is mated with the VRF. Example: Create vrf 1: ip link add vrf1 type vrf table 5 ip rule add iif vrf1 table 5 ip rule add oif vrf1 table 5 ip route add table 5 prohibit default ip link set vrf1 up Add interface to vrf 1: ip link set eth1 master vrf1 Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 685 +++ 3 files changed, 693 insertions(+) create mode 100644 drivers/net/vrf.c diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index c18f9e62a9fa..e58468b02987 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -297,6 +297,13 @@ config NLMON diagnostics, etc. This is mostly intended for developers or support to debug netlink issues. If unsure, say N. +config NET_VRF + tristate Virtual Routing and Forwarding (Lite) + depends on IP_MULTIPLE_TABLES IPV6_MULTIPLE_TABLES + ---help--- + This option enables the support for mapping interfaces into VRF's. The + support enables VRF devices. + endif # NET_CORE config SUNGEM_PHY diff --git a/drivers/net/Makefile b/drivers/net/Makefile index c12cb22478a7..ca16dd689b36 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o obj-$(CONFIG_VXLAN) += vxlan.o obj-$(CONFIG_GENEVE) += geneve.o obj-$(CONFIG_NLMON) += nlmon.o +obj-$(CONFIG_NET_VRF) += vrf.o # # Networking Drivers diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c new file mode 100644 index ..95097cb79354 --- /dev/null +++ b/drivers/net/vrf.c @@ -0,0 +1,685 @@ +/* + * vrf.c: device driver to encapsulate a VRF space + * + * Copyright (c) 2015 Cumulus Networks. All rights reserved. + * Copyright (c) 2015 Shrijeet Mukherjee s...@cumulusnetworks.com + * Copyright (c) 2015 David Ahern d...@cumulusnetworks.com + * + * Based on dummy, team and ipvlan drivers + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/ip.h +#include linux/init.h +#include linux/moduleparam.h +#include linux/netfilter.h +#include linux/rtnetlink.h +#include net/rtnetlink.h +#include linux/u64_stats_sync.h +#include linux/hashtable.h + +#include linux/inetdevice.h +#include net/ip.h +#include net/ip_fib.h +#include net/ip6_route.h +#include net/rtnetlink.h +#include net/route.h +#include net/addrconf.h +#include net/vrf.h + +#define DRV_NAME vrf +#define DRV_VERSION1.0 + +#define vrf_is_slave(dev) ((dev)-flags IFF_SLAVE) + +#define vrf_master_get_rcu(dev) \ + ((struct net_device *)rcu_dereference(dev-rx_handler_data)) + +struct pcpu_dstats { + u64 tx_pkts; + u64 tx_bytes; + u64 tx_drps; + u64 rx_pkts; + u64 rx_bytes; + struct u64_stats_sync syncp; +}; + +static struct dst_entry *vrf_ip_check(struct dst_entry *dst, u32 cookie) +{ + return dst; +} + +static int vrf_ip_local_out(struct sk_buff *skb) +{ + return ip_local_out(skb); +} + +static unsigned int vrf_v4_mtu(const struct dst_entry *dst) +{ + /* TO-DO: return max ethernet size? */ + return dst-dev-mtu; +} + +static void vrf_dst_destroy(struct dst_entry *dst) +{ + /* our dst lives forever - or until the device is closed */ +} + +static unsigned int vrf_default_advmss(const struct dst_entry *dst) +{ + return 65535 - 40; +} + +static struct dst_ops vrf_dst_ops = { + .family = AF_INET, + .local_out = vrf_ip_local_out, + .check = vrf_ip_check, + .mtu= vrf_v4_mtu, + .destroy= vrf_dst_destroy, + .default_advmss = vrf_default_advmss, +}; + +static bool is_ip_rx_frame(struct sk_buff *skb) +{ + switch (skb-protocol) { + case htons(ETH_P_IP): + case htons(ETH_P_IPV6
[PATCH net-next 9/9] net: Introduce VRF device driver
This driver borrows heavily from IPvlan and teaming drivers. Routing domains (VRF-lite) are created by instantiating a VRF master device with an associated table and enslaving all routed interfaces that participate in the domain. As part of the enslavement, all connected routes for the enslaved devices are moved to the table associated with the VRF device. Outgoing sockets must bind to the VRF device to function. Standard FIB rules bind the VRF device to tables and regular fib rule processing is followed. Routed traffic through the box, is forwarded by using the VRF device as the IIF and following the IIF rule to a table that is mated with the VRF. Example: Create vrf 1: ip link add vrf1 type vrf table 5 ip rule add iif vrf1 table 5 ip rule add oif vrf1 table 5 ip route add table 5 prohibit default ip link set vrf1 up Add interface to vrf 1: ip link set eth1 master vrf1 Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 685 +++ 3 files changed, 693 insertions(+) create mode 100644 drivers/net/vrf.c diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index c18f9e62a9fa..e58468b02987 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -297,6 +297,13 @@ config NLMON diagnostics, etc. This is mostly intended for developers or support to debug netlink issues. If unsure, say N. +config NET_VRF + tristate Virtual Routing and Forwarding (Lite) + depends on IP_MULTIPLE_TABLES IPV6_MULTIPLE_TABLES + ---help--- + This option enables the support for mapping interfaces into VRF's. The + support enables VRF devices. + endif # NET_CORE config SUNGEM_PHY diff --git a/drivers/net/Makefile b/drivers/net/Makefile index c12cb22478a7..ca16dd689b36 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o obj-$(CONFIG_VXLAN) += vxlan.o obj-$(CONFIG_GENEVE) += geneve.o obj-$(CONFIG_NLMON) += nlmon.o +obj-$(CONFIG_NET_VRF) += vrf.o # # Networking Drivers diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c new file mode 100644 index ..95097cb79354 --- /dev/null +++ b/drivers/net/vrf.c @@ -0,0 +1,685 @@ +/* + * vrf.c: device driver to encapsulate a VRF space + * + * Copyright (c) 2015 Cumulus Networks. All rights reserved. + * Copyright (c) 2015 Shrijeet Mukherjee s...@cumulusnetworks.com + * Copyright (c) 2015 David Ahern d...@cumulusnetworks.com + * + * Based on dummy, team and ipvlan drivers + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/ip.h +#include linux/init.h +#include linux/moduleparam.h +#include linux/netfilter.h +#include linux/rtnetlink.h +#include net/rtnetlink.h +#include linux/u64_stats_sync.h +#include linux/hashtable.h + +#include linux/inetdevice.h +#include net/ip.h +#include net/ip_fib.h +#include net/ip6_route.h +#include net/rtnetlink.h +#include net/route.h +#include net/addrconf.h +#include net/vrf.h + +#define DRV_NAME vrf +#define DRV_VERSION1.0 + +#define vrf_is_slave(dev) ((dev)-flags IFF_SLAVE) + +#define vrf_master_get_rcu(dev) \ + ((struct net_device *)rcu_dereference(dev-rx_handler_data)) + +struct pcpu_dstats { + u64 tx_pkts; + u64 tx_bytes; + u64 tx_drps; + u64 rx_pkts; + u64 rx_bytes; + struct u64_stats_sync syncp; +}; + +static struct dst_entry *vrf_ip_check(struct dst_entry *dst, u32 cookie) +{ + return dst; +} + +static int vrf_ip_local_out(struct sk_buff *skb) +{ + return ip_local_out(skb); +} + +static unsigned int vrf_v4_mtu(const struct dst_entry *dst) +{ + /* TO-DO: return max ethernet size? */ + return dst-dev-mtu; +} + +static void vrf_dst_destroy(struct dst_entry *dst) +{ + /* our dst lives forever - or until the device is closed */ +} + +static unsigned int vrf_default_advmss(const struct dst_entry *dst) +{ + return 65535 - 40; +} + +static struct dst_ops vrf_dst_ops = { + .family = AF_INET, + .local_out = vrf_ip_local_out, + .check = vrf_ip_check, + .mtu= vrf_v4_mtu, + .destroy= vrf_dst_destroy, + .default_advmss = vrf_default_advmss, +}; + +static bool is_ip_rx_frame(struct sk_buff *skb) +{ + switch (skb-protocol) { + case htons(ETH_P_IP): + case htons(ETH_P_IPV6
[PATCH net-next 00/10] VRF-lite - v5
to the VRF (sk_bound_dev_if is set to the VRF device). 5. Neighbor entries Neighbor entries are not impacted by the VRF device. Entries are associated with a particular interface; the VRF association is indirect via the interface-to-VRF device enslavement. Version 5 - dropped patch regarding socket lookups; no longer needed + removed vrf helpers no longer needed after this patch is dropped - removed dev_open and close operations + no need to reset vrf data on an ifdown and creates problems if a slave is deleted while the vrf interface is down (Thanks, Nikolay) - cleanups for sparse warnings + make C=2 is now clean for vrf driver Version 4 - builds are clean with and without VRF device enabled (no, yes and module) - tightened the driver implementation + device add/delete, slave add/remove, and module unload are all clean - fixed RCU references + with RCU and lock debugging enabled changes are clean through the suite of tests - TX path uses custom dst, so patch refactoring rtable allocation is dropped along with the patch adding rt_nexthop helper - dropped the task patch that adds default bind to interface for sockets and the associated chvrf example command + the patches are a convenience for running unmodified code. They are not needed for the core functionality. Any application with support for SO_BINDTODEVICE works properly with this patch set. Version 3 - addressed comments from first 2 RFCs with the exception of the name Nicolas: We will do the name conversion once we agree on what the correct name should be (vrf, mrf or something else) - packets flow through the VRF device in both directions allowing the following: - tcpdump -i vrfn - tc rules on vrf device - netfilter rules on vrf device TO-DO = 1. IPv6 2. ip fragments 3. ipsec, xfrms - have this working now; will post patches soon 4. listen filter to restrict VRF connections - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g Eric B: I have ipsec working with VRFs implemented using the VRF driver, including the worst case scenario of complete duplication in the networking config. Thanks to Nikolay for his many, many code reviews whipping the device driver into shape, and bug-Fixes and ideas from Hannes, Roopa Prabhu, Jon Toppins, Jamal. Patches can also be pulled from: https://github.com/dsahern/linux.git, vrf-dev-v5 branch https://github.com/dsahern/iproute2, vrf-dev-v5 branch David Ahern (9): net: Introduce VRF related flags and helpers net: Use VRF device index for lookups on RX net: Use VRF device index for lookups on TX udp: Handle VRF device in sendmsg net: Add inet_addr lookup by table net: Fix up inet_addr_type checks net: Add routes to the table associated with the device net: Use passed in table for nexthop lookups net: Introduce VRF device driver drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 685 +++ include/linux/netdevice.h| 20 ++ include/net/flow.h | 1 + include/net/route.h | 7 + include/net/vrf.h| 139 + include/uapi/linux/if_link.h | 9 + net/ipv4/af_inet.c | 13 +- net/ipv4/arp.c | 15 +- net/ipv4/fib_frontend.c | 66 - net/ipv4/fib_semantics.c | 44 ++- net/ipv4/fib_trie.c | 7 +- net/ipv4/icmp.c | 9 +- net/ipv4/route.c | 8 +- net/ipv4/udp.c | 22 +- 16 files changed, 1015 insertions(+), 38 deletions(-) create mode 100644 drivers/net/vrf.c create mode 100644 include/net/vrf.h -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/9] net: Use VRF device index for lookups on RX
On ingress use index of VRF master device for route lookups if real device is enslaved. Rules are expected to be installed for the VRF device to direct lookups to a specific table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 8 +++- net/ipv4/route.c| 3 ++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 6b98de0d7949..d8ced1d89f1b 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -45,6 +45,7 @@ #include net/ip_fib.h #include net/rtnetlink.h #include net/xfrm.h +#include net/vrf.h #ifndef CONFIG_IP_MULTIPLE_TABLES @@ -309,7 +310,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, bool dev_match; fl4.flowi4_oif = 0; - fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX; + fl4.flowi4_iif = vrf_master_ifindex_rcu(dev); + if (!fl4.flowi4_iif) + fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX; fl4.daddr = src; fl4.saddr = dst; fl4.flowi4_tos = tos; @@ -339,6 +342,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, if (nh-nh_dev == dev) { dev_match = true; break; + } else if (vrf_master_ifindex_rcu(nh-nh_dev) == dev-ifindex) { + dev_match = true; + break; } } #else diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 18fd7c9095c7..c26ff1f7067d 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -112,6 +112,7 @@ #endif #include net/secure_seq.h #include net/ip_tunnels.h +#include net/vrf.h #define RT_FL_TOS(oldflp4) \ ((oldflp4)-flowi4_tos (IPTOS_RT_MASK | RTO_ONLINK)) @@ -1726,7 +1727,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, * Now we are ready to route packet. */ fl4.flowi4_oif = 0; - fl4.flowi4_iif = dev-ifindex; + fl4.flowi4_iif = vrf_master_ifindex_rcu(dev) ? : dev-ifindex; fl4.flowi4_mark = skb-mark; fl4.flowi4_tos = tos; fl4.flowi4_scope = RT_SCOPE_UNIVERSE; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 8/9] net: Use passed in table for nexthop lookups
If a user passes in a table for new routes use that table for nexthop lookups. Specifically, this solves the case where a connected route does not exist in the main table, but only another table and then a subsequent route is added with a next hop using the connected route. ie., $ ip route ls default via 10.0.2.2 dev eth0 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 169.254.0.0/16 dev eth0 scope link metric 1003 192.168.56.0/24 dev eth1 proto kernel scope link src 192.168.56.51 $ ip route ls table 10 1.1.1.0/24 dev eth2 scope link Without this patch adding a nexthop route fails: $ ip route add table 10 2.2.2.0/24 via 1.1.1.10 RTNETLINK answers: Network is unreachable With this patch the route is added successfully. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_semantics.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 85e9a8abf15c..b7f1d20a9615 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -691,6 +691,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, } rcu_read_lock(); { + struct fib_table *tbl = NULL; struct flowi4 fl4 = { .daddr = nh-nh_gw, .flowi4_scope = cfg-fc_scope + 1, @@ -701,8 +702,16 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, /* It is not necessary, but requires a bit of thinking */ if (fl4.flowi4_scope RT_SCOPE_LINK) fl4.flowi4_scope = RT_SCOPE_LINK; - err = fib_lookup(net, fl4, res, -FIB_LOOKUP_IGNORE_LINKSTATE); + + if (cfg-fc_table) + tbl = fib_get_table(net, cfg-fc_table); + + if (tbl) + err = fib_table_lookup(tbl, fl4, res, + FIB_LOOKUP_IGNORE_LINKSTATE); + else + err = fib_lookup(net, fl4, res, +FIB_LOOKUP_IGNORE_LINKSTATE); if (err) { rcu_read_unlock(); return err; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] iproute2: Add support for VRF device
Allow user to create a vrf device and specify its table binding. Based on the iplink_vlan implementation. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/if_link.h | 8 + ip/Makefile | 2 +- ip/iplink.c | 2 +- ip/iplink_vrf.c | 85 + 4 files changed, 95 insertions(+), 2 deletions(-) create mode 100644 ip/iplink_vrf.c diff --git a/include/linux/if_link.h b/include/linux/if_link.h index b905cf7f4948..74dedf4320b8 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -338,6 +338,14 @@ enum macvlan_macaddr_mode { #define MACVLAN_FLAG_NOPROMISC 1 +/* VRF section */ +enum { + IFLA_VRF_UNSPEC, + IFLA_VRF_TABLE, + __IFLA_VRF_MAX +}; + +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1) /* IPVLAN section */ enum { IFLA_IPVLAN_UNSPEC, diff --git a/ip/Makefile b/ip/Makefile index 77653ecc5785..d8b38ac2e44b 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \ link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \ iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \ -iplink_geneve.o +iplink_geneve.o iplink_vrf.o RTMONOBJ=rtmon.o diff --git a/ip/iplink.c b/ip/iplink.c index 369d50eab94e..14bf7211a447 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -94,7 +94,7 @@ void iplink_usage(void) fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | macvtap |\n); fprintf(stderr, bridge | bond | ipoib | ip6tnl | ipip | sit | vxlan |\n); fprintf(stderr, gre | gretap | ip6gre | ip6gretap | vti | nlmon |\n); - fprintf(stderr, bond_slave | ipvlan | geneve }\n); + fprintf(stderr, bond_slave | ipvlan | geneve | vrf }\n); } exit(-1); } diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c new file mode 100644 index ..0d7e21c7c152 --- /dev/null +++ b/ip/iplink_vrf.c @@ -0,0 +1,85 @@ +/* iplink_vrf.cVRF device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void vrf_explain(FILE *f) +{ + fprintf(f, Usage: ... vrf table TABLEID \n); +} + +static void explain(void) +{ + vrf_explain(stderr); +} + +static int table_arg(void) +{ + fprintf(stderr,Error: argument of \table\ must be 0-32767 and currently unused\n); + return -1; +} + +static int vrf_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, table) == 0) { + __u32 table = 0; + NEXT_ARG(); + + table = atoi(*argv); + if (table 0 || table 32767) + return table_arg(); + addattr32(n, 1024, IFLA_VRF_TABLE, table); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, vrf: unknown option \%s\?\n, + *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + if (!tb) + return; + + if (tb[IFLA_VRF_TABLE]) + fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE])); +} + +static void vrf_print_help(struct link_util *lu, int argc, char **argv, + FILE *f) +{ + vrf_explain(f); +} + +struct link_util vrf_link_util = { + .id = vrf, + .maxattr= IFLA_VRF_MAX, + .parse_opt = vrf_parse_opt, + .print_opt = vrf_print_opt, + .print_help = vrf_print_help, +}; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 6/9] net: Fix up inet_addr_type checks
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. inet_addr_type_dev_table keeps the same semantics as inet_addr_type but if the passed in device is enslaved to a VRF then the table for that VRF is used for the lookup. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 3 +++ net/ipv4/af_inet.c | 13 - net/ipv4/arp.c | 15 +-- net/ipv4/fib_frontend.c | 28 +--- net/ipv4/fib_semantics.c | 6 -- net/ipv4/icmp.c | 5 +++-- 6 files changed, 56 insertions(+), 14 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 6ba681f0b98d..6dda2c1bf8c6 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -192,6 +192,9 @@ unsigned int inet_addr_type(struct net *net, __be32 addr); unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); +unsigned int inet_addr_type_dev_table(struct net *net, + const struct net_device *dev, + __be32 addr); void ip_rt_multicast_event(struct in_device *); int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg); void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index cc4e498a0ccf..96fba4f63454 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -119,6 +119,7 @@ #ifdef CONFIG_IP_MROUTE #include linux/mroute.h #endif +#include net/vrf.h /* The inetsw table contains everything that inet_create needs to @@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) struct net *net = sock_net(sk); unsigned short snum; int chk_addr_ret; + int tb_id = 0; int err; /* If the socket has its own bind function then use it. (RAW) */ @@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) goto out; } - chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr); + if (sk-sk_bound_dev_if) { + struct net_device *dev; + + rcu_read_lock(); + dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if); + if (dev) + tb_id = vrf_dev_table_rcu(dev); + rcu_read_unlock(); + } + chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id); /* Not specified by any standard per-se, however it breaks too * many applications when removed. It is unfortunate since diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 34a308573f4b..30409b75e925 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -233,7 +233,7 @@ static int arp_constructor(struct neighbour *neigh) return -EINVAL; } - neigh-type = inet_addr_type(dev_net(dev), addr); + neigh-type = inet_addr_type_dev_table(dev_net(dev), dev, addr); parms = in_dev-arp_parms; __neigh_parms_put(neigh-parms); @@ -343,7 +343,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) switch (IN_DEV_ARP_ANNOUNCE(in_dev)) { default: case 0: /* By default announce any local IP */ - if (skb inet_addr_type(dev_net(dev), + if (skb inet_addr_type_dev_table(dev_net(dev), dev, ip_hdr(skb)-saddr) == RTN_LOCAL) saddr = ip_hdr(skb)-saddr; break; @@ -351,7 +351,8 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) if (!skb) break; saddr = ip_hdr(skb)-saddr; - if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) { + if (inet_addr_type_dev_table(dev_net(dev), dev, +saddr) == RTN_LOCAL) { /* saddr should be known to target */ if (inet_addr_onlink(in_dev, target, saddr)) break; @@ -751,7 +752,7 @@ static int arp_process(struct sock *sk, struct sk_buff *skb) /* Special case: IPv4 duplicate address detection packet (RFC2131) */ if (sip == 0) { if (arp-ar_op == htons(ARPOP_REQUEST) - inet_addr_type(net, tip) == RTN_LOCAL + inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL !arp_ignore(in_dev, sip, tip)) arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip
[PATCH net-next 7/9] net: Add routes to the table associated with the device
When a device associated with a VRF is brought up or down routes should be added to/removed from the table associated with the VRF. fib_magic defaults to using the main or local tables. Have it use the table with the device if there is one. A part of this is directing prefsrc validations to the correct table as well. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 8 net/ipv4/fib_semantics.c | 25 +++-- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d84ae0e30369..0a50a08ab844 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -803,6 +803,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifaddr *ifa) { struct net *net = dev_net(ifa-ifa_dev-dev); + int tb_id = vrf_dev_table_rtnl(ifa-ifa_dev-dev); struct fib_table *tb; struct fib_config cfg = { .fc_protocol = RTPROT_KERNEL, @@ -817,11 +818,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad }, }; - if (type == RTN_UNICAST) - tb = fib_new_table(net, RT_TABLE_MAIN); - else - tb = fib_new_table(net, RT_TABLE_LOCAL); + if (!tb_id) + tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL; + tb = fib_new_table(net, tb_id); if (!tb) return; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 410ddb67221e..85e9a8abf15c 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -838,6 +838,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh) return nh-nh_saddr; } +static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc) +{ + if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || + fib_prefsrc != cfg-fc_dst) { + int tb_id = cfg-fc_table; + + if (tb_id == RT_TABLE_MAIN) + tb_id = RT_TABLE_LOCAL; + + if (inet_addr_type_table(cfg-fc_nlinfo.nl_net, +fib_prefsrc, tb_id) != RTN_LOCAL) { + return false; + } + } + return true; +} + struct fib_info *fib_create_info(struct fib_config *cfg) { int err; @@ -1033,12 +1050,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg) fi-fib_flags |= RTNH_F_LINKDOWN; } - if (fi-fib_prefsrc) { - if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || - fi-fib_prefsrc != cfg-fc_dst) - if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL) - goto err_inval; - } + if (fi-fib_prefsrc !fib_valid_prefsrc(cfg, fi-fib_prefsrc)) + goto err_inval; change_nexthops(fi) { fib_info_update_nh_saddr(net, nexthop_nh); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 5/9] net: Add inet_addr lookup by table
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 1 + net/ipv4/fib_frontend.c | 22 +++--- 2 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 94189d4bd899..6ba681f0b98d 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -189,6 +189,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk); void ip_rt_send_redirect(struct sk_buff *skb); unsigned int inet_addr_type(struct net *net, __be32 addr); +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); void ip_rt_multicast_event(struct in_device *); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d8ced1d89f1b..b11321a8e58d 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -212,12 +212,12 @@ void fib_flush_external(struct net *net) */ static inline unsigned int __inet_dev_addr_type(struct net *net, const struct net_device *dev, - __be32 addr) + __be32 addr, int tb_id) { struct flowi4 fl4 = { .daddr = addr }; struct fib_result res; unsigned int ret = RTN_BROADCAST; - struct fib_table *local_table; + struct fib_table *table; if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr)) return RTN_BROADCAST; @@ -226,10 +226,10 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, rcu_read_lock(); - local_table = fib_get_table(net, RT_TABLE_LOCAL); - if (local_table) { + table = fib_get_table(net, tb_id); + if (table) { ret = RTN_UNICAST; - if (!fib_table_lookup(local_table, fl4, res, FIB_LOOKUP_NOREF)) { + if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) { if (!dev || dev == res.fi-fib_dev) ret = res.type; } @@ -239,16 +239,24 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, return ret; } +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id) +{ + return __inet_dev_addr_type(net, NULL, addr, tb_id); +} +EXPORT_SYMBOL(inet_addr_type_table); + unsigned int inet_addr_type(struct net *net, __be32 addr) { - return __inet_dev_addr_type(net, NULL, addr); + return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL); } EXPORT_SYMBOL(inet_addr_type); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr) { - return __inet_dev_addr_type(net, dev, addr); + int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL; + + return __inet_dev_addr_type(net, dev, addr, rt_table); } EXPORT_SYMBOL(inet_dev_addr_type); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/9] net: Introduce VRF related flags and helpers
Add a VRF_MASTER flag for interfaces and helper functions for determining if a device is a VRF_MASTER. Add link attribute for passing VRF_TABLE id. Add vrf_ptr to netdevice. Add various macros for determining if a device is a VRF device, the index of the master VRF device and table associated with VRF device. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/netdevice.h| 20 +++ include/net/vrf.h| 139 +++ include/uapi/linux/if_link.h | 9 +++ 3 files changed, 168 insertions(+) create mode 100644 include/net/vrf.h diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 607b5f41f46f..f7a6ef2fae3a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1289,6 +1289,7 @@ enum netdev_priv_flags { IFF_XMIT_DST_RELEASE_PERM = 122, IFF_IPVLAN_MASTER = 123, IFF_IPVLAN_SLAVE= 124, + IFF_VRF_MASTER = 125, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1316,6 +1317,7 @@ enum netdev_priv_flags { #define IFF_XMIT_DST_RELEASE_PERM IFF_XMIT_DST_RELEASE_PERM #define IFF_IPVLAN_MASTER IFF_IPVLAN_MASTER #define IFF_IPVLAN_SLAVE IFF_IPVLAN_SLAVE +#define IFF_VRF_MASTER IFF_VRF_MASTER /** * struct net_device - The DEVICE structure. @@ -1432,6 +1434,7 @@ enum netdev_priv_flags { * @dn_ptr:DECnet specific data * @ip6_ptr: IPv6 specific data * @ax25_ptr: AX.25 specific data + * @vrf_ptr: VRF specific data * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering * * @last_rx: Time of last Rx @@ -1650,6 +1653,7 @@ struct net_device { struct dn_dev __rcu *dn_ptr; struct inet6_dev __rcu *ip6_ptr; void*ax25_ptr; + struct net_vrf_dev __rcu *vrf_ptr; struct wireless_dev *ieee80211_ptr; struct wpan_dev *ieee802154_ptr; #if IS_ENABLED(CONFIG_MPLS_ROUTING) @@ -3808,6 +3812,22 @@ static inline bool netif_supports_nofcs(struct net_device *dev) return dev-priv_flags IFF_SUPP_NOFCS; } +static inline bool netif_is_vrf(const struct net_device *dev) +{ + return dev-priv_flags IFF_VRF_MASTER; +} + +static inline bool netif_index_is_vrf(struct net *net, int ifindex) +{ + struct net_device *dev = dev_get_by_index_rcu(net, ifindex); + bool rc = false; + + if (dev) + rc = netif_is_vrf(dev); + + return rc; +} + /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */ static inline void netif_keep_dst(struct net_device *dev) { diff --git a/include/net/vrf.h b/include/net/vrf.h new file mode 100644 index ..25c709fdb98f --- /dev/null +++ b/include/net/vrf.h @@ -0,0 +1,139 @@ +/* + * include/net/net_vrf.h - adds vrf dev structure definitions + * Copyright (c) 2015 Cumulus Networks + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#ifndef __LINUX_NET_VRF_H +#define __LINUX_NET_VRF_H + +struct net_vrf_dev { + struct rcu_head rcu; + int ifindex; /* ifindex of master dev */ + u32 tb_id; /* table id for VRF */ +}; + +struct slave { + struct list_headlist; + struct net_device *dev; +}; + +struct slave_queue { + struct list_headall_slaves; + int num_slaves; +}; + +struct net_vrf { + struct slave_queue queue; + struct rtable *rth; + u32 tb_id; +}; + + +#if IS_ENABLED(CONFIG_NET_VRF) +/* called with rcu_read_lock() */ +static inline int vrf_master_ifindex_rcu(const struct net_device *dev) +{ + struct net_vrf_dev *vrf_ptr; + int ifindex = 0; + + if (!dev) + return 0; + + if (netif_is_vrf(dev)) + ifindex = dev-ifindex; + else { + vrf_ptr = rcu_dereference(dev-vrf_ptr); + if (vrf_ptr) + ifindex = vrf_ptr-ifindex; + } + + return ifindex; +} + +/* called with rcu_read_lock */ +static inline int vrf_dev_table_rcu(const struct net_device *dev) +{ + int tb_id = 0; + + if (dev) { + struct net_vrf_dev *vrf_ptr; + + vrf_ptr = rcu_dereference(dev-vrf_ptr); + if (vrf_ptr) + tb_id = vrf_ptr-tb_id; + } + return tb_id; +} + +static inline int vrf_dev_table(const struct net_device *dev) +{ + int tb_id = 0; + + rcu_read_lock(); + tb_id = vrf_dev_table_rcu(dev
[PATCH net-next 4/9] udp: Handle VRF device in sendmsg
For unconnected UDP sockets using a VRF device lookup source address based on VRF table. This allows the UDP header to be properly setup before showing up at the VRF device via the dst. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/udp.c | 22 +- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 83aa604f9273..7af5052e3b1f 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1013,11 +1013,31 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!rt) { struct net *net = sock_net(sk); + __u8 flow_flags = inet_sk_flowi_flags(sk); fl4 = fl4_stack; + + /* unconnected socket. If output device is enslaved to a VRF +* device lookup source address from VRF table. This mimics +* behavior of ip_route_connect{_init}. +*/ + if (netif_index_is_vrf(net, ipc.oif)) { + flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, + RT_SCOPE_UNIVERSE, sk-sk_protocol, + (flow_flags | FLOWI_FLAG_VRFSRC), + faddr, saddr, dport, + inet-inet_sport); + + rt = ip_route_output_flow(net, fl4, sk); + if (!IS_ERR(rt)) { + saddr = fl4-saddr; + ip_rt_put(rt); + } + } + flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, sk-sk_protocol, - inet_sk_flowi_flags(sk), + flow_flags, faddr, saddr, dport, inet-inet_sport); security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/9] net: Use VRF device index for lookups on TX
As with ingress use the index of VRF master device for route lookups on egress. However, the oif should only be used to direct the lookups to a specific table. Routes in the table are not based on the VRF device but rather interfaces that are part of the VRF so do not consider the oif for lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this latter part. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/flow.h | 1 + include/net/route.h | 3 +++ net/ipv4/fib_trie.c | 7 +-- net/ipv4/icmp.c | 4 net/ipv4/route.c| 5 + 5 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/net/flow.h b/include/net/flow.h index 3098ae33a178..f305588fc162 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -33,6 +33,7 @@ struct flowi_common { __u8flowic_flags; #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 +#define FLOWI_FLAG_VRFSRC 0x04 __u32 flowic_secid; struct flowi_tunnel flowic_tun_key; }; diff --git a/include/net/route.h b/include/net/route.h index 2d45f419477f..94189d4bd899 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -251,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 if (inet_sk(sk)-transparent) flow_flags |= FLOWI_FLAG_ANYSRC; + if (netif_index_is_vrf(sock_net(sk), oif)) + flow_flags |= FLOWI_FLAG_VRFSRC; + flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, protocol, flow_flags, dst, src, dport, sport); } diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 37c4bb89a708..1243c79cb5b0 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1423,8 +1423,11 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, nh-nh_flags RTNH_F_LINKDOWN !(fib_flags FIB_LOOKUP_IGNORE_LINKSTATE)) continue; - if (flp-flowi4_oif flp-flowi4_oif != nh-nh_oif) - continue; + if (!(flp-flowi4_flags FLOWI_FLAG_VRFSRC)) { + if (flp-flowi4_oif + flp-flowi4_oif != nh-nh_oif) + continue; + } if (!(fib_flags FIB_LOOKUP_NOREF)) atomic_inc(fi-fib_clntref); diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index c0556f1e4bf0..1164fc4ce3bc 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -96,6 +96,7 @@ #include net/xfrm.h #include net/inet_common.h #include net/ip_fib.h +#include net/vrf.h /* * Build xmit assembly blocks @@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb) fl4.flowi4_mark = mark; fl4.flowi4_tos = RT_TOS(ip_hdr(skb)-tos); fl4.flowi4_proto = IPPROTO_ICMP; + fl4.flowi4_oif = vrf_master_ifindex_rcu(skb-dev) ? : skb-dev-ifindex; security_skb_classify_flow(skb, flowi4_to_flowi(fl4)); rt = ip_route_output_key(net, fl4); if (IS_ERR(rt)) @@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net, fl4-flowi4_proto = IPPROTO_ICMP; fl4-fl4_icmp_type = type; fl4-fl4_icmp_code = code; + fl4-flowi4_oif = vrf_master_ifindex_rcu(skb_in-dev) ? : skb_in-dev-ifindex; + security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); rt = __ip_route_output_key(net, fl4); if (IS_ERR(rt)) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c26ff1f7067d..2c89d294b669 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2131,6 +2131,11 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) fl4-saddr = inet_select_addr(dev_out, 0, RT_SCOPE_HOST); } + if (netif_is_vrf(dev_out) + !(fl4-flowi4_flags FLOWI_FLAG_VRFSRC)) { + rth = vrf_dev_get_rth(dev_out); + goto out; + } } if (!fl4-daddr) { -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfrm: Add oif to dst lookups
Rules can be installed that direct route lookups to specific tables based on oif. Plumb the oif through the xfrm lookups so it gets set in the flow struct and passed to the resolver routines. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/xfrm.h | 7 +-- net/ipv4/xfrm4_policy.c | 11 ++- net/ipv6/xfrm6_policy.c | 7 --- net/xfrm/xfrm_policy.c | 24 ++-- 4 files changed, 29 insertions(+), 20 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index f0ee97eec24d..312e3fee9ccf 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -285,10 +285,13 @@ struct xfrm_policy_afinfo { unsigned short family; struct dst_ops *dst_ops; void(*garbage_collect)(struct net *net); - struct dst_entry*(*dst_lookup)(struct net *net, int tos, + struct dst_entry*(*dst_lookup)(struct net *net, + int tos, int oif, const xfrm_address_t *saddr, const xfrm_address_t *daddr); - int (*get_saddr)(struct net *net, xfrm_address_t *saddr, xfrm_address_t *daddr); + int (*get_saddr)(struct net *net, int oif, +xfrm_address_t *saddr, +xfrm_address_t *daddr); void(*decode_session)(struct sk_buff *skb, struct flowi *fl, int reverse); diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index bff69746e05f..55b3c0f4dde5 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -19,7 +19,7 @@ static struct xfrm_policy_afinfo xfrm4_policy_afinfo; static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4, - int tos, + int tos, int oif, const xfrm_address_t *saddr, const xfrm_address_t *daddr) { @@ -28,6 +28,7 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4, memset(fl4, 0, sizeof(*fl4)); fl4-daddr = daddr-a4; fl4-flowi4_tos = tos; + fl4-flowi4_oif = oif; if (saddr) fl4-saddr = saddr-a4; @@ -38,22 +39,22 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4, return ERR_CAST(rt); } -static struct dst_entry *xfrm4_dst_lookup(struct net *net, int tos, +static struct dst_entry *xfrm4_dst_lookup(struct net *net, int tos, int oif, const xfrm_address_t *saddr, const xfrm_address_t *daddr) { struct flowi4 fl4; - return __xfrm4_dst_lookup(net, fl4, tos, saddr, daddr); + return __xfrm4_dst_lookup(net, fl4, tos, oif, saddr, daddr); } -static int xfrm4_get_saddr(struct net *net, +static int xfrm4_get_saddr(struct net *net, int oif, xfrm_address_t *saddr, xfrm_address_t *daddr) { struct dst_entry *dst; struct flowi4 fl4; - dst = __xfrm4_dst_lookup(net, fl4, 0, NULL, daddr); + dst = __xfrm4_dst_lookup(net, fl4, 0, oif, NULL, daddr); if (IS_ERR(dst)) return -EHOSTUNREACH; diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index ed0583c1b9fc..a74013d3eceb 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -26,7 +26,7 @@ static struct xfrm_policy_afinfo xfrm6_policy_afinfo; -static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, +static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, int oif, const xfrm_address_t *saddr, const xfrm_address_t *daddr) { @@ -35,6 +35,7 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, int err; memset(fl6, 0, sizeof(fl6)); + fl6.flowi6_oif = oif; memcpy(fl6.daddr, daddr, sizeof(fl6.daddr)); if (saddr) memcpy(fl6.saddr, saddr, sizeof(fl6.saddr)); @@ -50,13 +51,13 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, return dst; } -static int xfrm6_get_saddr(struct net *net, +static int xfrm6_get_saddr(struct net *net, int oif, xfrm_address_t *saddr, xfrm_address_t *daddr) { struct dst_entry *dst; struct net_device *dev; - dst = xfrm6_dst_lookup(net, 0, NULL, daddr); + dst = xfrm6_dst_lookup(net, 0, oif, NULL, daddr); if (IS_ERR(dst)) return -EHOSTUNREACH; diff --git a/net/xfrm/xfrm_policy.c b/net
Re: [BUG net-next] infamous dev refcnt leak... again.
On 8/14/15 5:14 PM, Eric Dumazet wrote: On Fri, 2015-08-14 at 14:14 -0700, Eric Dumazet wrote: While rebooting host running latest net-next unregister_netdevice: waiting for eth0 to become free. Usage count = 4 Oh well... It looks like David Ahern recent changes uncover a bug ? Not clear which commit is at fault. Maybe 3bfd847203c6d89532f836ad3f5b4ff4ced26dd9 ? Somehow a down device can be found. Can you elaborate on what you are doing to see the refcnt leak? I have not seen that at all. I have to leave for soccer carpool in 45 minutes or so, but can take a look this weekend. David diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index b7f1d20..675a3b6 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -725,10 +725,14 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, nh-nh_dev = dev = FIB_RES_DEV(res); if (!dev) goto out; - dev_hold(dev); if (!netif_carrier_ok(dev)) nh-nh_flags |= RTNH_F_LINKDOWN; - err = (dev-flags IFF_UP) ? 0 : -ENETDOWN; + if (dev-flags IFF_UP) { + err = 0; + dev_hold(dev); + } else { + err = -ENETDOWN; + } } else { struct in_device *in_dev; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 04/11] udp: Handle VRF device in sendmsg
On 8/14/15 9:16 PM, Tom Herbert wrote: At least collect this code into one (static inline) function to better minimize the code churn in udp. If this is general functionality that can be used by other drivers then abstract it out as such. Also, if the VRF driver is not configured it seems like this code should compiled out. As it stands now if (netif_index_is_vrf(net, ipc.oif)) { adds a conditional to every call of udp_sendmsg rather or not we are using VRF :-(. Sure. I wanted to make sure all of the VRF related changes compiled out when the VRF driver is not enabled. This one slipped by me. I'll send a patch next week along with a couple of others per Eric D's comments. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] ipv4: fix refcount leak in fib_check_nh()
On 8/15/15 11:54 AM, Eric Dumazet wrote: From: Eric Dumazet eduma...@google.com fib_lookup() forces FIB_LOOKUP_NOREF flag, while fib_table_lookup() does not. This patch solves the typical message at reboot time or device dismantle : unregister_netdevice: waiting for eth0 to become free. Usage count = 4 Fixes: 3bfd847203c6 (net: Use passed in table for nexthop lookups) Signed-off-by: Eric Dumazet eduma...@google.com Cc: David Ahern d...@cumulusnetworks.com Still puzzled why I was not seeing the refcnt problem at reboot though I did see the extra dev_hold when I instrumented the hold and put. Anyways, thanks for resolving, Eric. Acked-by: David Ahern d...@cumulusnetworks.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
increase in time to delete an interface with 4.x kernels
Hi Alex: I believe you did the recent overhaul to the fib implementation. I am seeing dramatically higher times to delete an interface with an ipv4 address in 4.2-rc3. perf-top points to update_suffix: PerfTop: 15834 irqs/sec kernel:97.3% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) --- 74.69% [kernel] [k] update_suffix 2.38% [kernel] [k] fib_table_flush 2.20% [kernel] [k] fib6_walk_continue 2.03% [kernel] [k] fib6_ifdown 1.31% [kernel] [k] fib6_age I have a simple script to create and assign an ipv4 address to 10k dummy interfaces: l=0 for (( j = 1; j = 40; j += 1)) do for (( k = 1 ; k = 250 ; k += 1 )) do l=$((l + 1)) ip link add dev dummy${l} type dummy ip addr add 72.$j.$k.1/24 dev dummy${l} ifconfig dummy${l} up done done and a counter script to delete them all: k=$(ip link show | grep dummy | wc -l) for (( j = 1; j = k; j += 1)) do ip link del dev dummy${j} done Looking at v3.19: # time ./tadd-dummy.sh real3m8.896s user0m7.104s sys 0m22.020s # time ./tdel-dummy.sh real7m18.207s user0m3.824s sys 3m15.672s And the time to delete 1 interface after all 10k have been created: # time ip link del dev dummy real0m0.064s user0m0.000s sys 0m0.020s Contrast those times with 4.2.0-rc3+ running the exact same scripts # time ./tadd-dummy.sh real2m51.044s user0m7.220s sys 0m29.520s # time ip link del dev dummy real0m0.441s user0m0.000s sys 0m0.416s so here the time to delete 1 interface has gone up by more than 10x. # time ./tdel-dummy.sh ^C real14m10.000s user0m0.528s sys 13m14.728s I killed the delete; after 14 minutes only ~2k+ interfaces had been deleted: # ip link show | grep dummy | wc -l 7822 In 4.2.0-rc3 it seems to take about 60 seconds to delete 150 interfaces which is inline with the 1 interface time of 0.4 seconds. David -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 06/16] net: Tx via VRF device
If out device is enslaved to a VRF device we want packets to go through the VRF master device first. This allows for example iptables rules and tc rules to be configured on the VRF as a whole as well as the option for rules on specific netdevices. This is accomplished by updating the dev in the dst to point to the VRF device if it is enslaved. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/route.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 8119896e1159..050a3c1d89ba 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1903,6 +1903,23 @@ int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr, } EXPORT_SYMBOL(ip_route_input_noref); +/* if out device is enslaved to a VRF device update dst to + * send through it + */ +static void rt_use_vrf_dev(struct rtable *rth, struct net_device *dev_out) +{ +#if IS_ENABLED(CONFIG_NET_VRF) + int ifindex = vrf_master_dev_ifindex(dev_out); + struct net_device *mdev; + + mdev = dev_get_by_index(dev_net(dev_out), ifindex); + if (mdev) { + dev_put(rth-dst.dev); + rth-dst.dev = mdev; + } +#endif +} + /* called with rcu_read_lock() */ static struct rtable *__mkroute_output(const struct fib_result *res, const struct flowi4 *fl4, int orig_oif, @@ -2008,6 +2025,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res, } rt_set_nexthop(rth, fl4-daddr, res, fnhe, fi, type, 0); + rt_use_vrf_dev(rth, dev_out); return rth; } -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct
Allow tasks to have a default device index for binding sockets. If set the value is passed to all AF_INET/AF_INET6 sockets when they are created. The task setting is passed parent to child on fork, but can be set or changed after task creation using prctl (if task has CAP_NET_ADMIN permissions). The setting for a socket can be retrieved using prctl(). This option allows an administrator to restrict a task to only send/receive packets through the specified device. In the case of VRF devices this option restricts tasks to a specific VRF. Correlation of the device index to a specific VRF, ie., ifindex -- VRF device -- VRF id is left to userspace. Example using VRF devices: 1. vrf1 is created and assigned to table 5 2. eth2 is enslaved to vrf1 3. eth2 is given the address 1.1.1.1/24 $ ip route ls table 5 prohibit default 1.1.1.0/24 dev eth2 scope link local 1.1.1.1 dev eth2 proto kernel scope host src 1.1.1.1 With out setting a VRF context ping, tcp and udp attempts fail. e.g, $ ping 1.1.1.254 connect: Network is unreachable After binding the task to the vrf device ping succeeds: $ ./chvrf -v 1 ping -c1 1.1.1.254 PING 1.1.1.254 (1.1.1.254) 56(84) bytes of data. 64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=2.32 ms Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/sched.h | 3 +++ include/uapi/linux/prctl.h | 4 kernel/fork.c | 2 ++ kernel/sys.c | 35 +++ net/ipv4/af_inet.c | 1 + net/ipv4/route.c | 4 +++- net/ipv6/af_inet6.c| 1 + net/ipv6/route.c | 2 +- 8 files changed, 50 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..29b336b8a466 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1528,6 +1528,9 @@ struct task_struct { struct files_struct *files; /* namespaces */ struct nsproxy *nsproxy; +/* network */ + /* if set INET/INET6 sockets are bound to given dev index on create */ + int sk_bind_dev_if; /* signal handlers */ struct signal_struct *signal; struct sighand_struct *sighand; diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..1ef45195d146 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,8 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 0)/* 64b FP registers */ # define PR_FP_MODE_FRE(1 1)/* 32b compatibility */ +/* get/set network interface sockets are bound to by default */ +#define PR_SET_SK_BIND_DEV_IF 47 +#define PR_GET_SK_BIND_DEV_IF 48 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/fork.c b/kernel/fork.c index dbd9b8d7b7cc..8b396e77d2bf 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -380,6 +380,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) tsk-splice_pipe = NULL; tsk-task_frag.page = NULL; + tsk-sk_bind_dev_if = orig-sk_bind_dev_if; + account_kernel_stack(ti, 1); return tsk; diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..59119ac0a0bd 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -52,6 +52,7 @@ #include linux/rcupdate.h #include linux/uidgid.h #include linux/cred.h +#include linux/netdevice.h #include linux/kmsg_dump.h /* Move somewhere else to avoid recompiling? */ @@ -2267,6 +2268,40 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NET + case PR_SET_SK_BIND_DEV_IF: + { + struct net_device *dev; + int idx = (int) arg2; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (idx) { + dev = dev_get_by_index(me-nsproxy-net_ns, idx); + if (!dev) + return -EINVAL; + dev_put(dev); + } + me-sk_bind_dev_if = idx; + break; + } + case PR_GET_SK_BIND_DEV_IF: + { + struct task_struct *tsk; + int sk_bind_dev_if = -EINVAL; + + rcu_read_lock(); + tsk = find_task_by_vpid(arg2); + if (tsk) + sk_bind_dev_if = tsk-sk_bind_dev_if; + rcu_read_unlock(); + if (tsk != me !capable(CAP_NET_ADMIN)) + return -EPERM; + error = sk_bind_dev_if; + break; + } +#endif default: error = -EINVAL; break; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 09c7c1ee307e..0651efa18d39 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -352,6 +352,7 @@ static int inet_create(struct net *net, struct socket *sock, int protocol
[PATCH net-next 13/16] net: Introduce VRF device driver - v2
This driver borrows heavily from IPvlan and teaming drivers. Routing domains (VRF-lite) are created by instantiating a VRF master device with an associated table and enslaving all routed interfaces that participate in the domain. As part of the enslavement, all connected routes for the enslaved devices are moved to the table associated with the VRF device. Outgoing sockets must bind to the VRF device to function. Standard FIB rules bind the VRF device to tables and regular fib rule processing is followed. Routed traffic through the box, is forwarded by using the VRF device as the IIF and following the IIF rule to a table that is mated with the VRF. Example: Create vrf 1: ip link add vrf1 type vrf table 5 ip rule add iif vrf1 table 5 ip rule add oif vrf1 table 5 ip route add table 5 prohibit default ip link set vrf1 up Add interface to vrf 1: ip link set eth1 master vrf1 Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com v2: - addressed comments from first RFC - significant changes to improve simplicity of implementation --- drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 596 +++ 3 files changed, 604 insertions(+) create mode 100644 drivers/net/vrf.c diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index c18f9e62a9fa..e58468b02987 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -297,6 +297,13 @@ config NLMON diagnostics, etc. This is mostly intended for developers or support to debug netlink issues. If unsure, say N. +config NET_VRF + tristate Virtual Routing and Forwarding (Lite) + depends on IP_MULTIPLE_TABLES IPV6_MULTIPLE_TABLES + ---help--- + This option enables the support for mapping interfaces into VRF's. The + support enables VRF devices. + endif # NET_CORE config SUNGEM_PHY diff --git a/drivers/net/Makefile b/drivers/net/Makefile index c12cb22478a7..ca16dd689b36 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o obj-$(CONFIG_VXLAN) += vxlan.o obj-$(CONFIG_GENEVE) += geneve.o obj-$(CONFIG_NLMON) += nlmon.o +obj-$(CONFIG_NET_VRF) += vrf.o # # Networking Drivers diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c new file mode 100644 index ..8669b0f9d749 --- /dev/null +++ b/drivers/net/vrf.c @@ -0,0 +1,596 @@ +/* + * vrf.c: device driver to encapsulate a VRF space + * + * Copyright (c) 2015 Cumulus Networks + * + * Based on dummy, team and ipvlan drivers + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/ip.h +#include linux/init.h +#include linux/moduleparam.h +#include linux/rtnetlink.h +#include net/rtnetlink.h +#include linux/u64_stats_sync.h +#include linux/hashtable.h + +#include linux/inetdevice.h +#include net/ip.h +#include net/ip_fib.h +#include net/ip6_route.h +#include net/rtnetlink.h +#include net/route.h +#include net/addrconf.h +#include net/vrf.h + +#define DRV_NAME vrf +#define DRV_VERSION1.0 + +#define vrf_is_slave(dev) ((dev)-flags IFF_SLAVE) +#define vrf_is_master(dev) ((dev)-flags IFF_MASTER) + +#define vrf_master_get_rcu(dev) \ + ((struct net_device *)rcu_dereference(dev-rx_handler_data)) + +struct pcpu_dstats { + u64 tx_pkts; + u64 tx_bytes; + u64 tx_drps; + u64 rx_pkts; + u64 rx_bytes; + struct u64_stats_sync syncp; +}; + +struct slave { + struct list_headlist; + struct net_device *dev; +}; + +struct slave_queue { + spinlock_t lock; /* lock for slave insert/delete */ + struct list_headall_slaves; + int num_slaves; +}; + +struct net_vrf { + struct slave_queue queue; + struct fib_table*tb; + u32 tb_id; +}; + +static bool is_ip_rx_frame(struct sk_buff *skb) +{ + switch (skb-protocol) { + case htons(ETH_P_IP): + case htons(ETH_P_IPV6): + return true; + } + return false; +} + +/* note: already called with rcu_read_lock */ +static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb) +{ + struct sk_buff *skb = *pskb; + + if (is_ip_rx_frame(skb)) { + struct net_device *dev = vrf_master_get_rcu(skb-dev); + struct pcpu_dstats *dstats = this_cpu_ptr(dev-dstats); + + u64_stats_update_begin(dstats-syncp
[PATCH] iproute2: Add support for VRF device
Allow user to create a vrf device and specify its table binding. Based on the iplink_vlan implementation. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/if_link.h | 8 + ip/Makefile | 2 +- ip/iplink.c | 2 +- ip/iplink_vrf.c | 87 + 4 files changed, 97 insertions(+), 2 deletions(-) create mode 100644 ip/iplink_vrf.c diff --git a/include/linux/if_link.h b/include/linux/if_link.h index 8df6a8466839..28872fbf6814 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -337,6 +337,14 @@ enum macvlan_macaddr_mode { #define MACVLAN_FLAG_NOPROMISC 1 +/* VRF section */ +enum { + IFLA_VRF_UNSPEC, + IFLA_VRF_TABLE, + __IFLA_VRF_MAX +}; + +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1) /* IPVLAN section */ enum { IFLA_IPVLAN_UNSPEC, diff --git a/ip/Makefile b/ip/Makefile index 77653ecc5785..d8b38ac2e44b 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \ link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \ iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \ -iplink_geneve.o +iplink_geneve.o iplink_vrf.o RTMONOBJ=rtmon.o diff --git a/ip/iplink.c b/ip/iplink.c index e296e6f611b8..892e8bc8808b 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -93,7 +93,7 @@ void iplink_usage(void) fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | macvtap |\n); fprintf(stderr, bridge | bond | ipoib | ip6tnl | ipip | sit | vxlan |\n); fprintf(stderr, gre | gretap | ip6gre | ip6gretap | vti | nlmon |\n); - fprintf(stderr, bond_slave | ipvlan | geneve }\n); + fprintf(stderr, bond_slave | ipvlan | geneve | vrf }\n); } exit(-1); } diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c new file mode 100644 index ..bfcb3cdeaf35 --- /dev/null +++ b/ip/iplink_vrf.c @@ -0,0 +1,87 @@ +/* iplink_vrf.cVRF device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void vrf_explain(FILE *f) +{ + fprintf(f, Usage: ... vrf table TABLEID \n); +} + +static void explain(void) +{ + vrf_explain(stderr); +} + +static int table_arg(void) +{ + fprintf(stderr,Error: argument of \table\ must be 0-32767 and currently unused\n); + return -1; +} + +static int vrf_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, table) == 0) { + __u32 table = 0; + NEXT_ARG(); + + table = atoi(*argv); + if (table 0 || table 32767) + return table_arg(); + /* XXX need a table in-use check here */ + fprintf(stderr, adding table %d\n, table); + addattr32(n, 1024, IFLA_VRF_TABLE, table); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, vrf: unknown option \%s\?\n, + *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + if (!tb) + return; + + if (tb[IFLA_VRF_TABLE]) + fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE])); +} + +static void vrf_print_help(struct link_util *lu, int argc, char **argv, + FILE *f) +{ + vrf_explain(f); +} + +struct link_util vrf_link_util = { + .id = vrf, + .maxattr= IFLA_VRF_MAX, + .parse_opt = vrf_parse_opt, + .print_opt = vrf_print_opt, + .print_help = vrf_print_help, +}; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http
[PATCH net-next 11/16] net: Use VRF device index for socket lookups
The intent of the VRF device is to leverage the existing SO_BINDTODEVICE as a means of creating L3 domains. Since sockets are expected to be bound to the VRF device the index of the master device needs to be used for socket lookups. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/syncookies.c | 5 - net/ipv4/tcp_input.c | 6 +- net/ipv4/tcp_ipv4.c | 11 +-- 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index d70b1f603692..dab52fba5872 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -18,6 +18,7 @@ #include linux/export.h #include net/tcp.h #include net/route.h +#include net/vrf.h extern int sysctl_tcp_syncookies; @@ -348,7 +349,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) treq-snt_synack= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsecr : 0; treq-tfo_listener = false; - ireq-ir_iif = sk-sk_bound_dev_if; + ireq-ir_iif = vrf_get_master_dev_ifindex(sock_net(sk), skb-skb_iif); + if (!ireq-ir_iif) + ireq-ir_iif = sk-sk_bound_dev_if; /* We throwed the options of the initial SYN away, so we hope * the ACK carries the same options again (see RFC1122 4.2.3.8) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4e4d6bcd0ca9..df82fb05c459 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -72,6 +72,7 @@ #include net/dst.h #include net/tcp.h #include net/inet_common.h +#include net/vrf.h #include linux/ipsec.h #include asm/unaligned.h #include linux/errqueue.h @@ -6141,7 +6142,10 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, tcp_openreq_init(req, tmp_opt, skb, sk); /* Note: tcp_v6_init_req() might override ir_iif for link locals */ - inet_rsk(req)-ir_iif = sk-sk_bound_dev_if; + inet_rsk(req)-ir_iif = vrf_get_master_dev_ifindex(sock_net(sk), + skb-skb_iif); + if (!inet_rsk(req)-ir_iif) + inet_rsk(req)-ir_iif = sk-sk_bound_dev_if; af_ops-init_req(req, sk, skb); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 486ba96ae91a..d0c40f4d9058 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -75,6 +75,7 @@ #include net/secure_seq.h #include net/tcp_memcontrol.h #include net/busy_poll.h +#include net/vrf.h #include linux/inet.h #include linux/ipv6.h @@ -682,6 +683,8 @@ static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb) */ if (sk) arg.bound_dev_if = sk-sk_bound_dev_if; + if (!arg.bound_dev_if skb-dev) + arg.bound_dev_if = vrf_master_dev_ifindex(skb-dev); arg.tos = ip_hdr(skb)-tos; ip_send_unicast_reply(*this_cpu_ptr(net-ipv4.tcp_sk), @@ -766,8 +769,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, ip_hdr(skb)-saddr, /* XXX */ arg.iov[0].iov_len, IPPROTO_TCP, 0); arg.csumoffset = offsetof(struct tcphdr, check) / 2; - if (oif) - arg.bound_dev_if = oif; + arg.bound_dev_if = oif ? : vrf_master_dev_ifindex(skb_dst(skb)-dev); + if (!arg.bound_dev_if) + arg.bound_dev_if = vrf_master_dev_ifindex(skb-dev); + arg.tos = tos; ip_send_unicast_reply(*this_cpu_ptr(net-ipv4.tcp_sk), skb, TCP_SKB_CB(skb)-header.h4.opt, @@ -1269,6 +1274,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb, ireq = inet_rsk(req); sk_daddr_set(newsk, ireq-ir_rmt_addr); sk_rcv_saddr_set(newsk, ireq-ir_loc_addr); + if (netif_index_is_vrf(sock_net(newsk), ireq-ir_iif)) + newsk-sk_bound_dev_if = ireq-ir_iif; newinet-inet_saddr = ireq-ir_loc_addr; inet_opt = ireq-opt; rcu_assign_pointer(newinet-inet_opt, inet_opt); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 07/16] net: Add inet_addr lookup by table
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 1 + net/ipv4/fib_frontend.c | 22 +++--- 2 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 54f97eea0fb2..3b51c339c269 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -192,6 +192,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk); void ip_rt_send_redirect(struct sk_buff *skb); unsigned int inet_addr_type(struct net *net, __be32 addr); +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); void ip_rt_multicast_event(struct in_device *); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 6e68a003d0fd..cc413b0170ed 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -214,12 +214,12 @@ void fib_flush_external(struct net *net) */ static inline unsigned int __inet_dev_addr_type(struct net *net, const struct net_device *dev, - __be32 addr) + __be32 addr, int tb_id) { struct flowi4 fl4 = { .daddr = addr }; struct fib_result res; unsigned int ret = RTN_BROADCAST; - struct fib_table *local_table; + struct fib_table *table; if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr)) return RTN_BROADCAST; @@ -228,10 +228,10 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, rcu_read_lock(); - local_table = fib_get_table(net, RT_TABLE_LOCAL); - if (local_table) { + table = fib_get_table(net, tb_id); + if (table) { ret = RTN_UNICAST; - if (!fib_table_lookup(local_table, fl4, res, FIB_LOOKUP_NOREF)) { + if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) { if (!dev || dev == res.fi-fib_dev) ret = res.type; } @@ -241,16 +241,24 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, return ret; } +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id) +{ + return __inet_dev_addr_type(net, NULL, addr, tb_id); +} +EXPORT_SYMBOL(inet_addr_type_table); + unsigned int inet_addr_type(struct net *net, __be32 addr) { - return __inet_dev_addr_type(net, NULL, addr); + return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL); } EXPORT_SYMBOL(inet_addr_type); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr) { - return __inet_dev_addr_type(net, dev, addr); + int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL; + + return __inet_dev_addr_type(net, dev, addr, rt_table); } EXPORT_SYMBOL(inet_dev_addr_type); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 12/16] net: Add ipv4 route helper to set next hop
Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 3 +++ net/ipv4/route.c| 10 ++ 2 files changed, 13 insertions(+) diff --git a/include/net/route.h b/include/net/route.h index b14cbec93fbd..900d50fbcfc7 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -107,6 +107,7 @@ struct rt_cache_stat { extern struct ip_rt_acct __percpu *ip_rt_acct; struct in_device; +struct fib_result; int ip_rt_init(void); void rt_cache_flush(struct net *net); @@ -114,6 +115,8 @@ void rt_flush_dev(struct net_device *dev); struct rtable *ip_route_new_rtable(struct net_device *dev, unsigned int flags, u16 type, bool nopolicy, bool noxfrm, bool do_cache); +void ip_route_set_nexthop(struct rtable *rt, __be32 daddr, + const struct fib_result *res); struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp); struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp, struct sock *sk); diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 050a3c1d89ba..47dae001a000 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1537,6 +1537,16 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr, return err; } +void ip_route_set_nexthop(struct rtable *rt, __be32 daddr, + const struct fib_result *res) +{ + struct fib_nh_exception *fnhe; + + fnhe = find_exception(FIB_RES_NH(*res), daddr); + + rt_set_nexthop(rt, daddr, res, fnhe, res-fi, res-type, 0); +} +EXPORT_SYMBOL(ip_route_set_nexthop); static void ip_handle_martian_source(struct net_device *dev, struct in_device *in_dev, -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html