[PATCH] net: Remove remaining remnants of pm_qos from netdevice.h

2015-05-09 Thread David Ahern
Commit e2c6544829f removed pm_qos from struct net_device but left the
comment above header file. Remove those.

Signed-off-by: David Ahern dsah...@gmail.com
Cc: Thomas Graf tg...@suug.ch
---
 include/linux/netdevice.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1899c74a7127..05b9a694e213 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -25,7 +25,6 @@
 #ifndef _LINUX_NETDEVICE_H
 #define _LINUX_NETDEVICE_H
 
-#include linux/pm_qos.h
 #include linux/timer.h
 #include linux/bug.h
 #include linux/delay.h
@@ -1499,8 +1498,6 @@ enum netdev_priv_flags {
  *
  * @qdisc_tx_busylock: XXX: need comments on this one
  *
- * @pm_qos_req:Power Management QoS object
- *
  * FIXME: cleanup struct net_device such that network protocol info
  * moves out.
  */
-- 
2.2.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] e1000e: Add pm_qos header

2015-05-12 Thread David Ahern
Commit e2c6544829f moved pm_qos_req to e1000_adapter. Add the header file
that defines the struct.

Signed-off-by: David Ahern dsah...@gmail.com
Cc: Thomas Graf tg...@suug.ch
Cc: Jeff Kirsher jeffrey.t.kirs...@intel.com
---
 drivers/net/ethernet/intel/e1000e/e1000.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h 
b/drivers/net/ethernet/intel/e1000e/e1000.h
index 5d9ceb17b4cb..0abc942c966e 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -40,6 +40,7 @@
 #include linux/ptp_classify.h
 #include linux/mii.h
 #include linux/mdio.h
+#include linux/pm_qos.h
 #include hw.h
 
 struct e1000_info;
-- 
2.2.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 2/3] VRF driver and needed infrastructure

2015-06-08 Thread David Ahern

On 6/8/15 12:35 PM, Shrijeet Mukherjee wrote:

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 019fcef..27a333c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -283,6 +283,12 @@ config NLMON
  diagnostics, etc. This is mostly intended for developers or support
  to debug netlink issues. If unsure, say N.

+config NET_VRF
+   tristate Virtual Routing and Forwarding (Lite)
+   ---help---
+  This option enables the support for mapping interfaces into VRF's. 
The
+  support enables VRF devices
+
  endif # NET_CORE

  config SUNGEM_PHY


I think you need:

depends on IP_MULTIPLE_TABLES  IPV6_MULTIPLE_TABLES

David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 0/3] Proposal for VRF-lite

2015-06-08 Thread David Ahern

On 6/8/15 12:35 PM, Shrijeet Mukherjee wrote:

5. Debugging is built-in as tcpdump and counters on the VRF device
works as is.


Is the intent that something like this

  tcpdump -i vrf0

can be used to see vrf traffic?

vrf_handle_frame only bumps counters; it does not switch skb-dev to the 
vrf device so for Rx path tcpdump will not get the packets. ie., tcpdump 
only shows outbound packets.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/3] rcv path changes for vrf traffic

2015-06-08 Thread David Ahern

On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote:

Hi Shrijeet,

On Mo, 2015-06-08 at 11:35 -0700, Shrijeet Mukherjee wrote:

From: Shrijeet Mukherjee s...@cumulusnetworks.com

Incoming frames for IP protocol stacks need the IIF to be changed
from the actual interface to the VRF device. This allows the IIF
rule to be used to select tables (or do regular PBR)

This change selects the iif to be the VRF device if it exists and
the incoming iif is enslaved to the VRF device.

Since VRF aware sockets are always bound to the VRF device this
system allows return traffic to find the socket of origin.

changes are in the arp_rcv, icmp_rcv and ip_rcv paths

Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF
as it would create code variations and the vrf_ptr check is there
I can make that whole thing modular.


 From an architectural level I think the output path looks good. For the
input path I would also to propose my (I think) more flexible solution:



Something is still not right on the output path. e.g., I see the wrong 
source address showing up on ping -I vrf0:


# ping -I vrf0 1.1.1.254
ping: Warning: source address might be selected on device other than vrf0.
PING 1.1.1.254 (1.1.1.254) from 172.16.1.52 vrf0: 56(84) bytes of data.
64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=0.215 ms
...

The reason is because the datagram connect function fails to look up the 
outbound route in the vrf and falls back to the main table. (As an aside 
the fallback to other tables is something that should not be happening 
for VRFs; you want to use the table specific to the VRF.)


The route lookup fails because it passes in oif = vrf device (this VRF 
design relies on bind to device which sets oif in the flow). That is 
good for selecting the table to use for the lookups, but not good for 
selecting the route within the table.


This is one way to fix the connect problem:

diff --git a/include/net/route.h b/include/net/route.h
index fe22d03afb6a..a18798caec25 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -245,11 +245,18 @@ static inline void ip_route_connect_init(struct 
flowi4 *fl4, __be32 dst, __be32

 __be16 sport, __be16 dport,
 struct sock *sk)
 {
+   struct net_device *dev = dev_get_by_index(sock_net(sk), oif);
__u8 flow_flags = 0;

if (inet_sk(sk)-transparent)
flow_flags |= FLOWI_FLAG_ANYSRC;

+   if (dev) {
+   if (netif_is_vrf(dev))
+   flow_flags |= FLOWI_FLAG_VRFSRC;
+   dev_put(dev);
+   }
+
flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
 }


which essentially tells fib_table_lookup to drop the OIF comparison 
after selecting the table per this change made in the patch Shrijeet posted:


if (!(flp-flowi4_flags  FLOWI_FLAG_VRFSRC)) {
if (flp-flowi4_oif 
flp-flowi4_oif != nh-nh_oif)
continue;
}

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/3] rcv path changes for vrf traffic

2015-06-08 Thread David Ahern

On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote:

For rx layer I want to also propose my try:

[PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by 
per-interface local table override



I applied only the first 2 patches from Shrijeet and then tried to apply 
your patch; it doesn't apply. Way too many failures. What branch should 
it apply too?



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 0/3] Proposal for VRF-lite

2015-06-09 Thread David Ahern

Hi Nicolas:

On 6/9/15 2:58 AM, Nicolas Dichtel wrote:

I'm not really in favor of the name 'vrf'. This term is very
controversial and
having a consensus of what is/contains a 'vrf' is quite impossible.
There was already a lot of discussions about this topic on quagga ml
that show
that everybody has a different opinion about this term ;-)


Are you referring to this thread?
https://lists.quagga.net/pipermail/quagga-dev/2014-November/011795.html

I could see differing opinions regarding the implementation of a VRF; is 
there really a controversy on what a VRF is?


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] switchdev: fix BUG when port driver doesn't support set attr op

2015-06-10 Thread David Ahern

On 6/10/15 2:56 PM, sfel...@gmail.com wrote:

From: Scott Feldman sfel...@gmail.com

Fix a BUG() where CONFIG_NET_SWITCHDEV is set but the driver for a bridged
port does not support switchdec_port_attr_set op.  Don't BUG() if
-EOPNOTSUPP is returned.

Signed-off-by: Scott Feldman sfel...@gmail.com
Reported-by: Brenden Blanco bbla...@plumgrid.com
---
  net/switchdev/switchdev.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index e008057..99bced4 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -103,7 +103,7 @@ static void switchdev_port_attr_set_work(struct work_struct 
*work)

rtnl_lock();
err = switchdev_port_attr_set(asw-dev, asw-attr);
-   BUG_ON(err);
+   BUG_ON(err  err != -EOPNOTSUPP);
rtnl_unlock();

dev_put(asw-dev);



Should that be WARN_ON instead of BUG_ON?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] switchdev: fix BUG when port driver doesn't support set attr op

2015-06-10 Thread David Ahern

On 6/10/15 3:47 PM, Scott Feldman wrote:

Should that be WARN_ON instead of BUG_ON?


I think I had it as WARN when we were working on the initial patches,
but we changed it to BUG_ON because we should only get an error here
if the driver screwed something up between PREPARE phase and COMMIT
phase, so it should be considered a driver bug which needs fixing.



Linus rants from time to time about the prolific use of BUG_ON. e.g.,
https://lkml.org/lkml/2015/4/28/528

'BUG_ON() is for things where our internal data structures are so 
corrupted that we don't know what to do, and there's no way to continue. 
Not for I want to sprinkle these things around and this should not 
happen.'


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2

2015-07-06 Thread David Ahern

On 7/6/15 10:37 AM, Nikolay Aleksandrov wrote:

+static int vrf_add_slave(struct net_device *dev,
+struct net_device *port_dev)
+{
+   if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev))
+   return -ENODEV;
+
+   if (!vrf_is_master(port_dev)  !vrf_is_slave(port_dev)) {
+   struct slave *s = kzalloc(sizeof(*s), GFP_KERNEL);
+   struct net_vrf *vrf = netdev_priv(dev);
+   struct slave_queue *queue = vrf-queue;
+   bool is_running = netif_running(port_dev);
+   unsigned int flags = port_dev-flags;
+   int ret;
+
+   if (!s)
+   return -ENOMEM;
+
+   s-dev = port_dev;
+
+   spin_lock_bh(queue-lock);
+   __vrf_insert_slave(queue, s, dev);
+   spin_unlock_bh(queue-lock);
+
+   port_dev-vrf_ptr = kmalloc(sizeof(*port_dev-vrf_ptr),
+   GFP_KERNEL);
+   if (!port_dev-vrf_ptr)
+   return -ENOMEM;

 ^
I believe you'll have a slave in the list with inconsistent state which could
even lead to null ptr derefernce if vrf_ptr is used, also __vrf_insert_slave
does dev_hold so the dev refcnt will be incorrect as well.


Right. Good catch, will fix.




+
+   port_dev-vrf_ptr-ifindex = dev-ifindex;
+   port_dev-vrf_ptr-tb_id = vrf-tb_id;
+
+   /* register the packet handler for slave ports */
+   ret = netdev_rx_handler_register(port_dev, vrf_handle_frame,
+(void *)dev);
+   if (ret) {
+   netdev_err(port_dev,
+  Device %s failed to register rx_handler\n,
+  port_dev-name);
+   kfree(port_dev-vrf_ptr);
+   kfree(s);
+   return ret;

 ^^
The slave is being freed while on the list here, device's refcnt will be wrong 
etc.


ack. Will fix.




+   }
+
+   if (is_running) {
+   ret = dev_change_flags(port_dev, flags  ~IFF_UP);
+   if (ret  0)
+   goto out_fail;
+   }
+
+   ret = netdev_master_upper_dev_link(port_dev, dev);
+   if (ret  0)
+   goto out_fail;
+
+   if (is_running) {
+   ret = dev_change_flags(port_dev, flags);
+   if (ret  0)
+   goto out_fail;
+   }
+
+   port_dev-flags |= IFF_SLAVE;
+
+   return 0;
+
+out_fail:
+   spin_lock_bh(queue-lock);
+   __vrf_kill_slave(queue, s);
+   spin_unlock_bh(queue-lock);


__vrf_kill_slave() doesn't do upper device unlink and the device can be linked
if we fail in the dev_change_flags above.


will fix.




+
+   return ret;
+   }
+
+   return -EINVAL;
+}


In my opinion the structure of the above function should change to something 
more
straightforward with proper exit labels and cleanup upon failure, also a level 
of
indentation can be avoided.


Sure. The indentation comes after the pointer checks so locals can be 
intialized when declared. Will work on the clean up/simplification for 
next rev.





+
+static int vrf_del_slave(struct net_device *dev,
+struct net_device *port_dev)
+{
+   struct net_vrf *vrf = netdev_priv(dev);
+   struct slave_queue *queue = vrf-queue;
+   struct slave *slave = __vrf_find_slave_dev(queue, port_dev);
+   bool is_running = netif_running(port_dev);
+   unsigned int flags = port_dev-flags;
+   int ret = 0;


ret seems unused/unchecked in this function


It is used but not checked. I struggled with what to do on the error 
path. Do we want netdev_err() on a failure?





+
+   if (!slave)
+   return -EINVAL;
+
+   if (is_running)
+   ret = dev_change_flags(port_dev, flags  ~IFF_UP);
+
+   spin_lock_bh(queue-lock);
+   __vrf_kill_slave(queue, slave);
+   spin_unlock_bh(queue-lock);
+
+   netdev_upper_dev_unlink(port_dev, dev);
+
+   if (is_running)
+   ret = dev_change_flags(port_dev, flags);
+
+   return 0;
+}
+
+static int vrf_dev_init(struct net_device *dev)
+{
+   struct net_vrf *vrf = netdev_priv(dev);
+
+   spin_lock_init(vrf-queue.lock);
+   INIT_LIST_HEAD(vrf-queue.all_slaves);
+   vrf-queue.master_dev = dev;
+
+   dev-dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
+   dev-flags  =  IFF_MASTER | IFF_NOARP;
+   if (!dev-dstats)
+   return -ENOMEM;

 ^
nit: I'd suggest moving the check after the allocation


agreed.

David
--
To unsubscribe from this list: 

[RFC net-next 5/6] net: Add sk_bind_dev_if to task_struct

2015-07-06 Thread David Ahern
Allow tasks to have a default device index for binding sockets. If set
the value is passed to all AF_INET/AF_INET6 sockets when they are created.

The task setting is passed parent to child on fork, but can be set or
changed after task creation using prctl (if task has CAP_NET_ADMIN
permissions). The setting for a socket can be retrieved using prctl().
This option allows an administrator to restrict a task to only send/receive
packets through the specified device. In the case of VRF devices this
option restricts tasks to a specific VRF.

Correlation of the device index to a specific VRF, ie.,
   ifindex -- VRF device -- VRF id
is left to userspace.

Example using VRF devices:
1. vrf1 is created and assigned to table 5
2. eth2 is enslaved to vrf1
3. eth2 is given the address 1.1.1.1/24

$ ip route ls table 5
prohibit default
1.1.1.0/24 dev eth2  scope link
local 1.1.1.1 dev eth2  proto kernel  scope host  src 1.1.1.1

With out setting a VRF context ping, tcp and udp attempts fail. e.g,
$ ping 1.1.1.254
connect: Network is unreachable

After binding the task to the vrf device ping succeeds:
$ ./chvrf -v 1 ping -c1 1.1.1.254
PING 1.1.1.254 (1.1.1.254) 56(84) bytes of data.
64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=2.32 ms
---
 include/linux/sched.h  |  3 +++
 include/uapi/linux/prctl.h |  4 
 kernel/fork.c  |  2 ++
 kernel/sys.c   | 35 +++
 net/ipv4/af_inet.c |  1 +
 net/ipv6/af_inet6.c|  1 +
 6 files changed, 46 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6633e83e608a..0b6ab0e2ea57 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1543,6 +1543,9 @@ struct task_struct {
struct files_struct *files;
 /* namespaces */
struct nsproxy *nsproxy;
+/* network */
+   /* if set INET/INET6 sockets are bound to given dev index on create */
+   int sk_bind_dev_if;
 /* signal handlers */
struct signal_struct *signal;
struct sighand_struct *sighand;
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..1ef45195d146 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,8 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR (1  0)/* 64b FP registers */
 # define PR_FP_MODE_FRE(1  1)/* 32b compatibility */
 
+/* get/set network interface sockets are bound to by default */
+#define PR_SET_SK_BIND_DEV_IF   47
+#define PR_GET_SK_BIND_DEV_IF   48
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 0bb88b50..d2c7f32370ef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -375,6 +375,8 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig)
tsk-splice_pipe = NULL;
tsk-task_frag.page = NULL;
 
+   tsk-sk_bind_dev_if = orig-sk_bind_dev_if;
+
account_kernel_stack(ti, 1);
 
return tsk;
diff --git a/kernel/sys.c b/kernel/sys.c
index 8571296b7ddb..7e56fb9dbf8e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
 #include linux/rcupdate.h
 #include linux/uidgid.h
 #include linux/cred.h
+#include linux/netdevice.h
 
 #include linux/kmsg_dump.h
 /* Move somewhere else to avoid recompiling? */
@@ -2243,6 +2244,40 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_NET
+   case PR_SET_SK_BIND_DEV_IF:
+   {
+   struct net_device *dev;
+   int idx = (int) arg2;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (idx) {
+   dev = dev_get_by_index(me-nsproxy-net_ns, idx);
+   if (!dev)
+   return -EINVAL;
+   dev_put(dev);
+   }
+   me-sk_bind_dev_if = idx;
+   break;
+   }
+   case PR_GET_SK_BIND_DEV_IF:
+   {
+   struct task_struct *tsk;
+   int sk_bind_dev_if = -EINVAL;
+
+   rcu_read_lock();
+   tsk = find_task_by_vpid(arg2);
+   if (tsk)
+   sk_bind_dev_if = tsk-sk_bind_dev_if;
+   rcu_read_unlock();
+   if (tsk != me  !capable(CAP_NET_ADMIN))
+   return -EPERM;
+   error = sk_bind_dev_if;
+   break;
+   }
+#endif
default:
error = -EINVAL;
break;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee87151f..a3b24f14e378 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -350,6 +350,7 @@ static int inet_create(struct net *net, struct socket 
*sock, int protocol,
sk-sk_destruct= inet_sock_destruct;
sk-sk_protocol= protocol;
sk-sk_backlog_rcv = sk-sk_prot-backlog_rcv;
+   

[RFC net-next 3/6] net: Introduce VRF device driver - v2

2015-07-06 Thread David Ahern
This driver borrows heavily from IPvlan and teaming drivers.

Routing domains (VRF-lite) are created by instantiating a device
and enslaving all routed interfaces that participate in the domain.
As part of the enslavement, all local routes pointing to enslaved
devices are re-pointed to the vrf device, thus forcing outgoing
sockets to bind to the vrf to function.

Standard FIB rules can then bind the VRF device to tables and regular
fib rule processing is followed.

Routed traffic through the box, is fwded by using the VRF device as
the IIF and following the IIF rule to a table which is mated with
the VRF.

Locally originated traffic is directed at the VRF device using
SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
the xmit function of the vrf driver, which then completes the ip lookup
and output.

This solution is completely orthogonal to namespaces and allow the L3
equivalent of vlans to exist allowing the routing space to be
partitioned.

Example:

   Create vrf 1:
 ip link add vrf1 type vrf table 5
 ip rule add iif vrf1 table 5
 ip rule add oif vrf1 table 5
 ip route add table 5 prohibit default
 ip link set vrf1 up

   Add interface to vrf 1:
 ip link set eth1 master vrf1

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com

v2:
- addressed comments from first RFC
- significant changes to improve simplicity of implementation
---
 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c| 486 +++
 include/net/vrf.h|  71 
 4 files changed, 565 insertions(+)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 019fceffc9e5..b040aa233408 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -283,6 +283,13 @@ config NLMON
  diagnostics, etc. This is mostly intended for developers or support
  to debug netlink issues. If unsure, say N.
 
+config NET_VRF
+   tristate Virtual Routing and Forwarding (Lite)
+   depends on IP_MULTIPLE_TABLES  IPV6_MULTIPLE_TABLES
+   ---help---
+  This option enables the support for mapping interfaces into VRF's. 
The
+  support enables VRF devices
+
 endif # NET_CORE
 
 config SUNGEM_PHY
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c12cb22478a7..ca16dd689b36 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_NLMON) += nlmon.o
+obj-$(CONFIG_NET_VRF) += vrf.o
 
 #
 # Networking Drivers
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
new file mode 100644
index ..b9f9ae68388d
--- /dev/null
+++ b/drivers/net/vrf.c
@@ -0,0 +1,487 @@
+/*
+ * vrf.c: device driver to encapsulate a VRF space
+ *
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * Based on dummy, team and ipvlan drivers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/ip.h
+#include linux/init.h
+#include linux/moduleparam.h
+#include linux/rtnetlink.h
+#include net/rtnetlink.h
+#include net/arp.h
+#include linux/u64_stats_sync.h
+#include linux/hashtable.h
+
+#include linux/inetdevice.h
+#include net/ip.h
+#include net/ip_fib.h
+#include net/ip6_route.h
+#include net/rtnetlink.h
+#include net/route.h
+#include net/addrconf.h
+#include net/vrf.h
+
+#define DRV_NAME   vrf
+#define DRV_VERSION1.0
+
+#define vrf_is_slave(dev)   ((dev)-flags  IFF_SLAVE)
+#define vrf_is_master(dev)  ((dev)-flags  IFF_MASTER)
+
+#define vrf_master_get_rcu(dev) \
+   ((struct net_device *)rcu_dereference(dev-rx_handler_data))
+
+struct pcpu_dstats {
+   u64 tx_pkts;
+   u64 tx_bytes;
+   u64 tx_drps;
+   u64 rx_pkts;
+   u64 rx_bytes;
+   struct u64_stats_sync   syncp;
+};
+
+struct slave {
+   struct list_headlist;
+   struct net_device   *dev;
+   longpriority;
+};
+
+struct slave_queue {
+   spinlock_t  lock; /* lock for slave insert/delete */
+   struct list_headall_slaves;
+   int num_slaves;
+   struct net_device   *master_dev;
+};
+
+struct net_vrf {
+   struct slave_queue  queue;
+   struct fib_table*tb;
+   u32 tb_id;
+};
+
+static int is_ip_rx_frame(struct sk_buff *skb)
+{
+   switch (skb-protocol) {
+   case htons

[RFC net-next 4/6] net: Modifications to ipv4 stack for VRF devices

2015-07-06 Thread David Ahern
With the following tweaks to the IPv4 stack:
- enslaving devices to a VRF device automatically moves routes to the
  VRF table; removing the VRF master moves routes back to the main table

- the following use cases work for both Rx and Tx:
  + ICMP (ping -I vrf-device ip)
  + TCP server and client bound to VRF device
  + TCP server not bound to VRF device but working through it
* client connections are bound to VRF device
  + UDP server and client bound to VRF device

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/flow.h|  1 +
 include/net/inet_hashtables.h |  9 +++--
 include/net/route.h   |  4 
 net/ipv4/fib_frontend.c   | 30 --
 net/ipv4/fib_semantics.c  | 25 -
 net/ipv4/fib_trie.c   |  7 +--
 net/ipv4/icmp.c   |  4 
 net/ipv4/ping.c   |  3 ++-
 net/ipv4/raw.c|  5 +++--
 net/ipv4/route.c  | 12 ++--
 net/ipv4/syncookies.c |  4 +++-
 net/ipv4/tcp_input.c  |  6 +-
 net/ipv4/tcp_ipv4.c   |  6 --
 net/ipv4/udp.c|  2 ++
 14 files changed, 90 insertions(+), 28 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a159d1b3..69aaa99fdeb8 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -29,6 +29,7 @@ struct flowi_common {
__u8flowic_flags;
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
+#define FLOWI_FLAG_VRFSRC  0x04
__u32   flowic_secid;
 };
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index b73c88a19dd4..e26c43823a13 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -31,6 +31,7 @@
 #include net/route.h
 #include net/tcp_states.h
 #include net/netns/hash.h
+#include net/vrf.h
 
 #include linux/atomic.h
 #include asm/byteorder.h
@@ -300,10 +301,14 @@ static inline struct sock *__inet_lookup(struct net *net,
 struct inet_hashinfo *hashinfo,
 const __be32 saddr, const __be16 sport,
 const __be32 daddr, const __be16 dport,
-const int dif)
+int dif)
 {
u16 hnum = ntohs(dport);
-   struct sock *sk = __inet_lookup_established(net, hashinfo,
+   struct sock *sk;
+
+   dif = vrf_get_master_dev_idx(net, dif) ? : dif;
+
+   sk = __inet_lookup_established(net, hashinfo,
saddr, sport, daddr, hnum, dif);
 
return sk ? : __inet_lookup_listener(net, hashinfo, saddr, sport,
diff --git a/include/net/route.h b/include/net/route.h
index fe22d03afb6a..460333bab217 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -188,6 +188,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
 unsigned int inet_addr_type(struct net *net, __be32 addr);
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr);
 void ip_rt_multicast_event(struct in_device *);
@@ -250,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 
*fl4, __be32 dst, __be32
if (inet_sk(sk)-transparent)
flow_flags |= FLOWI_FLAG_ANYSRC;
 
+   if (netif_idx_is_vrf(sock_net(sk), oif))
+   flow_flags |= FLOWI_FLAG_VRFSRC;
+
flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
 }
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 974fa51effca..7c73eb058c91 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -45,6 +45,7 @@
 #include net/ip_fib.h
 #include net/rtnetlink.h
 #include net/xfrm.h
+#include net/vrf.h
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
@@ -212,7 +213,7 @@ void fib_flush_external(struct net *net)
  */
 static inline unsigned int __inet_dev_addr_type(struct net *net,
const struct net_device *dev,
-   __be32 addr)
+   __be32 addr, int rt_table)
 {
struct flowi4   fl4 = { .daddr = addr };
struct fib_result   res;
@@ -225,8 +226,7 @@ static inline unsigned int __inet_dev_addr_type(struct net 
*net,
return RTN_MULTICAST;
 
rcu_read_lock();
-
-   local_table = fib_get_table(net, RT_TABLE_LOCAL);
+   local_table = fib_get_table(net, rt_table);
if (local_table) {
ret = RTN_UNICAST;
if (!fib_table_lookup(local_table, fl4, res

[RFC PATCH] iproute2: Add support for VRF device

2015-07-06 Thread David Ahern
Allow user to create a vrf device and specify its table binding.
Based on the iplink_vlan implementation.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/if_link.h |  8 +
 ip/Makefile |  2 +-
 ip/iplink.c |  2 +-
 ip/iplink_vrf.c | 88 +
 4 files changed, 98 insertions(+), 2 deletions(-)
 create mode 100644 ip/iplink_vrf.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 8df6a8466839..28872fbf6814 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -337,6 +337,14 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC 1
 
+/* VRF section */
+enum {
+   IFLA_VRF_UNSPEC,
+   IFLA_VRF_TABLE,
+   __IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
 /* IPVLAN section */
 enum {
IFLA_IPVLAN_UNSPEC,
diff --git a/ip/Makefile b/ip/Makefile
index 77653ecc5785..d8b38ac2e44b 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-iplink_geneve.o
+iplink_geneve.o iplink_vrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink.c b/ip/iplink.c
index e296e6f611b8..892e8bc8808b 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -93,7 +93,7 @@ void iplink_usage(void)
fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | 
macvlan | macvtap |\n);
fprintf(stderr,   bridge | bond | ipoib | ip6tnl | 
ipip | sit | vxlan |\n);
fprintf(stderr,   gre | gretap | ip6gre | ip6gretap | 
vti | nlmon |\n);
-   fprintf(stderr,   bond_slave | ipvlan | geneve }\n);
+   fprintf(stderr,   bond_slave | ipvlan | geneve | vrf 
}\n);
}
exit(-1);
 }
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
new file mode 100644
index ..8d66802cf940
--- /dev/null
+++ b/ip/iplink_vrf.c
@@ -0,0 +1,88 @@
+/* iplink_vrf.cVRF device support
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/socket.h
+#include linux/if_link.h
+
+#include rt_names.h
+#include utils.h
+#include ip_common.h
+
+static void vrf_explain(FILE *f)
+{
+   fprintf(f, Usage: ... vrf table TABLEID \n);
+}
+
+static void explain(void)
+{
+   vrf_explain(stderr);
+}
+
+static int table_arg(void)
+{
+   fprintf(stderr,Error: argument of \table\ must be 0-32767 and 
currently unused\n);
+   return -1;
+}
+
+static int vrf_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *n)
+{
+   while (argc  0) {
+   if (matches(*argv, table) == 0) {
+   __u32 table = 0;
+   NEXT_ARG();
+
+   table = atoi(*argv);
+   if (table  0 || table  32767)
+   return table_arg();
+   /* XXX need a table in-use check here */
+   fprintf(stderr, adding table %d\n, table);
+   addattr32(n, 1024, IFLA_VRF_TABLE, table);
+   } else if (matches(*argv, help) == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, vrf: unknown option \%s\?\n,
+   *argv);
+   explain();
+   return -1;
+   }
+   argc--, argv++;
+   }
+
+   return 0;
+}
+
+static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+printf(vrf_print_opt ...\n);
+   if (!tb)
+   return;
+
+   if (tb[IFLA_VRF_TABLE])
+   fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE]));
+}
+
+static void vrf_print_help(struct link_util *lu, int argc, char **argv,
+ FILE *f)
+{
+   vrf_explain(f);
+}
+
+struct link_util vrf_link_util = {
+   .id = vrf,
+   .maxattr= IFLA_VRF_MAX,
+   .parse_opt  = vrf_parse_opt,
+   .print_opt  = vrf_print_opt,
+   .print_help = vrf_print_help,
+};
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More

[RFC net-next 1/6] fib: export symbols

2015-07-06 Thread David Ahern
This change is needed for the following VRF driver.

No active code path changes.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_frontend.c | 1 +
 net/ipv4/fib_trie.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6bbc54940eb4..974fa51effca 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -108,6 +108,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
hlist_add_head_rcu(tb-tb_hlist, net-ipv4.fib_table_hash[h]);
return tb;
 }
+EXPORT_SYMBOL_GPL(fib_new_table);
 
 /* caller must hold either rtnl or rcu read lock */
 struct fib_table *fib_get_table(struct net *net, u32 id)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 15d32612e3c6..ac2d828c6daa 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1887,6 +1887,7 @@ void fib_free_table(struct fib_table *tb)
 {
call_rcu(tb-rcu, __trie_free_rcu);
 }
+EXPORT_SYMBOL_GPL(fib_free_table);
 
 static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 struct sk_buff *skb, struct netlink_callback *cb)
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC net-next 6/6] net: Add chvrf command

2015-07-06 Thread David Ahern
Example of how to use the default bind to interface option for tasks and
correlate with VRF devices.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 tools/net/Makefile |   6 +-
 tools/net/chvrf.c  | 225 +
 2 files changed, 229 insertions(+), 2 deletions(-)
 create mode 100644 tools/net/chvrf.c

diff --git a/tools/net/Makefile b/tools/net/Makefile
index ee577ea03ba5..c13f11f5637a 100644
--- a/tools/net/Makefile
+++ b/tools/net/Makefile
@@ -10,7 +10,7 @@ YACC = bison
 %.lex.c: %.l
$(LEX) -o $@ $
 
-all : bpf_jit_disasm bpf_dbg bpf_asm
+all : bpf_jit_disasm bpf_dbg bpf_asm chvrf
 
 bpf_jit_disasm : CFLAGS = -Wall -O2 -DPACKAGE='bpf_jit_disasm'
 bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl
@@ -25,8 +25,10 @@ bpf_asm : LDLIBS =
 bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o
 bpf_exp.lex.o : bpf_exp.yacc.c
 
+chvrf : CFLAGS = -Wall -O2
+
 clean :
-   rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.*
+   rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.* 
chvrf
 
 install :
install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
diff --git a/tools/net/chvrf.c b/tools/net/chvrf.c
new file mode 100644
index ..71cc925fd101
--- /dev/null
+++ b/tools/net/chvrf.c
@@ -0,0 +1,225 @@
+/*
+ * chvrf.c - Example of how to use the default bind-to-device option for
+ *   tasks and correlate to VRFs via the VRF device.
+ *
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#include sys/ioctl.h
+#include sys/prctl.h
+#include sys/socket.h
+#include signal.h
+#include string.h
+#include stdio.h
+#include stdlib.h
+#include unistd.h
+#include netinet/in.h
+#include net/if.h /* for struct ifreq  */
+#include libgen.h
+#include errno.h
+
+#ifndef PR_SET_SK_BIND_DEV_IF
+#define PR_SET_SK_BIND_DEV_IF   47
+#endif
+#ifndef PR_GET_SK_BIND_DEV_IF
+#define PR_GET_SK_BIND_DEV_IF   48
+#endif
+
+static int vrf_to_device(int vrf)
+{
+   struct ifreq ifdata;
+   int sd, rc;
+
+   memset(ifdata, 0, sizeof(ifdata));
+   snprintf(ifdata.ifr_name, sizeof(ifdata.ifr_name) - 1, vrf%d, vrf);
+
+   sd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+   if (sd  0) {
+   perror(socket failed);
+   return -1;
+   }
+
+   /* Get the index for the specified interface */
+   rc = ioctl(sd, SIOCGIFINDEX, (char *)ifdata);
+   close(sd);
+   if (rc != 0) {
+   perror(ioctl(SIOCGIFINDEX) failed);
+   return -1;
+   }
+
+   return ifdata.ifr_ifindex;
+}
+
+static int device_to_vrf(int idx)
+{
+   struct ifreq ifdata;
+   int sd, vrf, rc;
+
+   memset(ifdata, 0, sizeof(ifdata));
+   ifdata.ifr_ifindex = idx;
+
+   sd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+   if (sd  0) {
+   perror(socket failed);
+   return -1;
+   }
+
+   /* Get the index for the specified interface */
+   rc = ioctl(sd, SIOCGIFNAME, (char *)ifdata);
+   close(sd);
+   if (rc != 0) {
+   perror(ioctl(SIOCGIFNAME) failed);
+   return -1;
+   }
+
+   if (sscanf(ifdata.ifr_name, vrf%d, vrf) != 1) {
+   fprintf(stderr, Unexpected device name (%s)\n, 
ifdata.ifr_name);
+   vrf = -1;
+   }
+
+   return vrf;
+}
+
+static int set_vrf(int vrf)
+{
+   int idx;
+   long err;
+
+   /* convert vrf to device index */
+   idx = vrf_to_device(vrf);
+   if (idx  0) {
+   fprintf(stderr, Failed to get device index for vrf %d\n, vrf);
+   return -1;
+   }
+
+   /* set default device bind */
+   err = prctl(PR_SET_SK_BIND_DEV_IF, idx);
+   if (err  0) {
+   fprintf(stderr, prctl failed to device index: %d\n, errno);
+   return -1;
+   }
+
+   return 0;
+}
+
+/* get vrf context for given process id */
+static int get_vrf(pid_t pid)
+{
+   int vrf;
+   long err;
+
+   /* lookup device index pid is tied to */
+   err = prctl(PR_GET_SK_BIND_DEV_IF, pid);
+   if (err  0) {
+   fprintf(stderr, prctl failed: %d\n, errno);
+   return -1;
+   }
+
+   if (err == 0)
+   return 0;
+
+   /* convert device index to vrf id */
+   vrf = device_to_vrf((int)err);
+   if (vrf  0) {
+   fprintf(stderr, Failed to get device index for vrf %d\n, vrf);
+   return -1;
+   }
+
+   return vrf;
+}
+
+static int run_vrf(char **argv, int vrf)
+{
+   char *cmd;
+
+   if (set_vrf(vrf) != 0) {
+   fprintf(stderr, Failed to set vrf context\n);
+   return 1;
+   }
+
+   cmd = strdup(argv

[RFC net-next 0/6] Proposal for VRF-lite - v2

2015-07-06 Thread David Ahern
 case a VRF global
   or agnostic process handles the connection (ie., this allows 1 listener
   socket to handle connections across VRFs). The child socket becomes
   bound to the VRF (sk_bound_dev_if is set to the VRF device).

5. Neighbor entries
   Neighbor entries are not impacted by the VRF device. Entries are
   associated with a particular interface; the VRF association is indirect
   via the interface-to-VRF device enslavement.

TO-DO
=
1. ipv4 multicast

2. ICMP and error path handling on connection attempts
   - e.g., connection attempt to a port with no listener

3. IPv6

4. netfilter integration

5. listen filter to restrict VRF connections
   - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g


Bug-Fixes and ideas from Hannes, Roopa Prabhu, Jon Toppins, Jamal

Patches can also be pulled from:
https://github.com/dsahern/linux.git, vrf-dev-4.1 branch
https://github.com/dsahern/iproute2,  vrf-dev-4.1 branch


Shrijeet Mukherjee and David Ahern (6):
  fib: export symbols
  net: Preparation for vrf device
  net: Introduce VRF device driver - v2
  net: Modifications to ipv4 stack for VRF devices
  net: Add sk_bind_dev_if to task_struct
  net: Add chvrf command

 drivers/net/Kconfig   |   7 +
 drivers/net/Makefile  |   1 +
 drivers/net/vrf.c | 486 ++
 include/linux/netdevice.h |  21 ++
 include/linux/sched.h |   3 +
 include/net/flow.h|   1 +
 include/net/inet_hashtables.h |   9 +-
 include/net/route.h   |   4 +
 include/net/vrf.h |  71 ++
 include/uapi/linux/if_link.h  |   9 +
 include/uapi/linux/prctl.h|   4 +
 kernel/fork.c |   2 +
 kernel/sys.c  |  35 +++
 net/ipv4/af_inet.c|   1 +
 net/ipv4/fib_frontend.c   |  31 ++-
 net/ipv4/fib_semantics.c  |  25 ++-
 net/ipv4/fib_trie.c   |   8 +-
 net/ipv4/icmp.c   |   4 +
 net/ipv4/ping.c   |   3 +-
 net/ipv4/raw.c|   5 +-
 net/ipv4/route.c  |  12 +-
 net/ipv4/syncookies.c |   4 +-
 net/ipv4/tcp_input.c  |   6 +-
 net/ipv4/tcp_ipv4.c   |   6 +-
 net/ipv4/udp.c|   2 +
 net/ipv6/af_inet6.c   |   1 +
 tools/net/Makefile|   6 +-
 tools/net/chvrf.c | 225 +++
 28 files changed, 962 insertions(+), 30 deletions(-)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h
 create mode 100644 tools/net/chvrf.c

-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC net-next 2/6] net: Preparation for vrf device

2015-07-06 Thread David Ahern
Add a VRF_MASTER flag for interfaces and helper functions for determining
if a device is a VRF_MASTER.

Also, add link attribute for passing VRF_TABLE id.

Both are used in the following patch that adds a VRF device driver.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/netdevice.h| 21 +
 include/uapi/linux/if_link.h |  9 +
 2 files changed, 30 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e20979dfd6a9..142cb64f139c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1274,6 +1274,7 @@ enum netdev_priv_flags {
IFF_XMIT_DST_RELEASE_PERM   = 122,
IFF_IPVLAN_MASTER   = 123,
IFF_IPVLAN_SLAVE= 124,
+   IFF_VRF_MASTER  = 125,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1301,6 +1302,7 @@ enum netdev_priv_flags {
 #define IFF_XMIT_DST_RELEASE_PERM  IFF_XMIT_DST_RELEASE_PERM
 #define IFF_IPVLAN_MASTER  IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE   IFF_IPVLAN_SLAVE
+#define IFF_VRF_MASTER IFF_VRF_MASTER
 
 /**
  * struct net_device - The DEVICE structure.
@@ -1417,6 +1419,7 @@ enum netdev_priv_flags {
  * @dn_ptr:DECnet specific data
  * @ip6_ptr:   IPv6 specific data
  * @ax25_ptr:  AX.25 specific data
+ * @vrf_ptr:   VRF specific data
  * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering
  *
  * @last_rx:   Time of last Rx
@@ -1629,6 +1632,7 @@ struct net_device {
struct dn_dev __rcu *dn_ptr;
struct inet6_dev __rcu  *ip6_ptr;
void*ax25_ptr;
+   struct net_vrf_dev  *vrf_ptr;
struct wireless_dev *ieee80211_ptr;
struct wpan_dev *ieee802154_ptr;
 #if IS_ENABLED(CONFIG_MPLS_ROUTING)
@@ -3781,6 +3785,23 @@ static inline bool netif_supports_nofcs(struct 
net_device *dev)
return dev-priv_flags  IFF_SUPP_NOFCS;
 }
 
+static inline bool netif_is_vrf(struct net_device *dev)
+{
+   return dev-priv_flags  IFF_VRF_MASTER;
+}
+
+static inline bool netif_idx_is_vrf(struct net *net, int idx)
+{
+   struct net_device *dev = dev_get_by_index(net, idx);
+   bool rc = false;
+
+   if (dev) {
+   rc = netif_is_vrf(dev);
+   dev_put(dev);
+   }
+   return rc;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 2c7e8e3d3981..bfbb4d8eeec2 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -339,6 +339,15 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC 1
 
+/* VRF section */
+enum {
+   IFLA_VRF_UNSPEC,
+   IFLA_VRF_TABLE,
+   __IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
+
 /* IPVLAN section */
 enum {
IFLA_IPVLAN_UNSPEC,
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: Updates to netif_index_is_vrf

2015-08-15 Thread David Ahern
As Eric noted netif_index_is_vrf is not called with rcu_read_lock held,
so use dev_get_by_index instead of dev_get_by_index_rcu.

If VRF is not enabled or oif is 0 skip the device lookup.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/netdevice.h | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f7a6ef2fae3a..dca36a618781 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3819,12 +3819,20 @@ static inline bool netif_is_vrf(const struct net_device 
*dev)
 
 static inline bool netif_index_is_vrf(struct net *net, int ifindex)
 {
-   struct net_device *dev = dev_get_by_index_rcu(net, ifindex);
bool rc = false;
 
-   if (dev)
-   rc = netif_is_vrf(dev);
+#if IS_ENABLED(CONFIG_NET_VRF)
+   struct net_device *dev;
+
+   if (ifindex == 0)
+   return false;
 
+   dev = dev_get_by_index(net, ifindex);
+   if (dev) {
+   rc = netif_is_vrf(dev);
+   dev_put(dev);
+   }
+#endif
return rc;
 }
 
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: Move VRF change to udp_sendmsg to inlined function

2015-08-15 Thread David Ahern
Functionally equivalent, but as an inlined function with VRF config
check it completely compiles out if VRF is not enabled.

Suggested-by: Tom Herbert t...@herbertland.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/udp.c | 44 
 1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c0a15e7f359f..384f8d918033 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -873,6 +873,27 @@ int udp_push_pending_frames(struct sock *sk)
 }
 EXPORT_SYMBOL(udp_push_pending_frames);
 
+/* unconnected socket. If output device is enslaved to a VRF
+ * device lookup source address from VRF table. This mimics
+ * behavior of ip_route_connect{_init}.
+ */
+static inline void udp_sendmsg_vrf_saddr(struct net *net, struct flowi4 *fl4,
+int oif, struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_NET_VRF)
+   if (netif_index_is_vrf(net, oif)) {
+   __u8 flow_flags = fl4-flowi4_flags;
+   struct rtable *rt;
+
+   fl4-flowi4_flags = flow_flags | FLOWI_FLAG_VRFSRC;
+   rt = ip_route_output_flow(net, fl4, sk);
+   if (!IS_ERR(rt))
+   ip_rt_put(rt);
+   fl4-flowi4_flags = flow_flags;
+   }
+#endif
+}
+
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
struct inet_sock *inet = inet_sk(sk);
@@ -1013,33 +1034,16 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
if (!rt) {
struct net *net = sock_net(sk);
-   __u8 flow_flags = inet_sk_flowi_flags(sk);
 
fl4 = fl4_stack;
 
-   /* unconnected socket. If output device is enslaved to a VRF
-* device lookup source address from VRF table. This mimics
-* behavior of ip_route_connect{_init}.
-*/
-   if (netif_index_is_vrf(net, ipc.oif)) {
-   flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
-  RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  (flow_flags | FLOWI_FLAG_VRFSRC),
-  faddr, saddr, dport,
-  inet-inet_sport);
-
-   rt = ip_route_output_flow(net, fl4, sk);
-   if (!IS_ERR(rt)) {
-   saddr = fl4-saddr;
-   ip_rt_put(rt);
-   }
-   }
-
flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
   RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  flow_flags,
+  inet_sk_flowi_flags(sk),
   faddr, saddr, dport, inet-inet_sport);
 
+   udp_sendmsg_vrf_saddr(net, fl4, ipc.oif, sk);
+
security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
rt = ip_route_output_flow(net, fl4, sk);
if (IS_ERR(rt)) {
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: Updates to netif_index_is_vrf

2015-08-16 Thread David Ahern

On 8/15/15 6:39 PM, Florian Westphal wrote:

David Ahern d...@cumulusnetworks.com wrote:

As Eric noted netif_index_is_vrf is not called with rcu_read_lock held,
so use dev_get_by_index instead of dev_get_by_index_rcu.

If VRF is not enabled or oif is 0 skip the device lookup.

Signed-off-by: David Ahern d...@cumulusnetworks.com


Why not


  static inline bool netif_index_is_vrf(struct net *net, int ifindex)
  {
-   struct net_device *dev = dev_get_by_index_rcu(net, ifindex);
bool rc = false;

-   if (dev)
-   rc = netif_is_vrf(dev);
+#if IS_ENABLED(CONFIG_NET_VRF)
+   struct net_device *dev;
+
+   if (ifindex == 0)
+   return false;


rcu_read_lock();

dev = dev_get_by_index_rcu(net, ifindex);
if (dev)
rc = netif_is_vrf(dev);

rcu_read_unlock();


+#endif
return rc;


instead?


sure. That saves the inc and dec on the refcnt. will respin.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2] net: Updates to netif_index_is_vrf

2015-08-16 Thread David Ahern
As Eric noted netif_index_is_vrf is not called with rcu_read_lock held,
so wrap the dev_get_by_index_rcu in rcu_read_lock and unlock.

If VRF is not enabled or oif is 0 skip the device lookup. In both cases
index cannot be the VRF master.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
v2:
- per Florian's suggestion keep the dev_get_by_index_rcu and wrap with
  rcu_read_lock/unlock versus moving to dev_get_by_index with dev_hold/put

 include/linux/netdevice.h | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f7a6ef2fae3a..2d3cd86c5618 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3819,12 +3819,22 @@ static inline bool netif_is_vrf(const struct net_device 
*dev)
 
 static inline bool netif_index_is_vrf(struct net *net, int ifindex)
 {
-   struct net_device *dev = dev_get_by_index_rcu(net, ifindex);
bool rc = false;
 
+#if IS_ENABLED(CONFIG_NET_VRF)
+   struct net_device *dev;
+
+   if (ifindex == 0)
+   return false;
+
+   rcu_read_lock();
+
+   dev = dev_get_by_index_rcu(net, ifindex);
if (dev)
rc = netif_is_vrf(dev);
 
+   rcu_read_unlock();
+#endif
return rc;
 }
 
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: Fix docbook warning for IFF_VRF_MASTER enum

2015-08-16 Thread David Ahern
kbuild test robot reported:
tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   d52736e24fe2e927c26817256f8d1a3c8b5d51a0
commit: 4e3c89920cd3a6cfce22c6f537690747c26128dd [751/762] net: Introduce VRF 
related flags and helpers
reproduce: make htmldocs

 Warning(include/linux/netdevice.h:1293): Enum value 'IFF_VRF_MASTER' not 
 described in enum 'netdev_priv_flags'

Signed-off-by: David Ahern d...@cumulusnetworks.com

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2d3cd86c5618..aa8b79dd08d8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1262,6 +1262,7 @@ struct net_device_ops {
  * @IFF_LIVE_ADDR_CHANGE: device supports hardware address
  * change when it's running
  * @IFF_MACVLAN: Macvlan device
+ * @IFF_VRF_MASTER: device is a VRF master
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 10,
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next]: unable to add routes to tables

2015-08-18 Thread David Ahern

On 8/18/15 10:57 AM, Andreas Schultz wrote:

Hi,

It seems that the policy for adding routes to tables has changed between
Linux 4.2-rc6 and net-next.

In Linux main line (tested up to 4.2-rc6), with this main routing table:
# ip route show table main
...
172.28.0.0/24 dev vnf-xe1p0  proto kernel  scope link  src 172.28.0.16

and an empty table 100, this works:

# ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0

With net-next at commit d52736e24fe2e927c26817256f8d1a3c8b5d51a0, the
same command leads to an:

# ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0
RTNETLINK answers: Resource temporarily unavailable

Is this expected behavior?


That's going to be due to 3bfd847203c6d89532f836ad3f5b4ff4ced26dd9.

I'll fix.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next]: unable to add routes to tables

2015-08-18 Thread David Ahern

On 8/18/15 10:57 AM, Andreas Schultz wrote:

Hi,

It seems that the policy for adding routes to tables has changed between
Linux 4.2-rc6 and net-next.

In Linux main line (tested up to 4.2-rc6), with this main routing table:
# ip route show table main
...
172.28.0.0/24 dev vnf-xe1p0  proto kernel  scope link  src 172.28.0.16

and an empty table 100, this works:

# ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0

With net-next at commit d52736e24fe2e927c26817256f8d1a3c8b5d51a0, the
same command leads to an:

# ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0
RTNETLINK answers: Resource temporarily unavailable

Is this expected behavior?



The attached works for me and so does my original problem. Can you 
confirm it resolves your problem? If so I'll send a formal patch.


David


diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c8025851dac7..01a237278dd2 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -710,9 +710,16 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
err = fib_table_lookup(tbl, fl4, res,
   
FIB_LOOKUP_IGNORE_LINKSTATE |
   FIB_LOOKUP_NOREF);
-   else
+
+   /* on error or if no table given do full lookup. This is
+* needed for example when nexthops are in the local 
table
+* rather than the given table
+*/
+   if (!tbl || err) {
err = fib_lookup(net, fl4, res,
 FIB_LOOKUP_IGNORE_LINKSTATE);
+   }
+
if (err) {
rcu_read_unlock();
return err;


Re: [PATCH net-next] vrf: plug skb leaks

2015-08-19 Thread David Ahern

Hi Nikolay:

On 8/18/15 8:12 PM, Nikolay Aleksandrov wrote:

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index ed208317cbb5..4aa06450fafa 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -97,6 +97,12 @@ static bool is_ip_rx_frame(struct sk_buff *skb)
return false;
  }

+static void vrf_tx_error(struct net_device *vrf_dev, struct sk_buff *skb)
+{
+   vrf_dev-stats.tx_errors++;
+   kfree_skb(skb);
+}
+
  /* note: already called with rcu_read_lock */
  static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
  {
@@ -149,7 +155,8 @@ static struct rtnl_link_stats64 *vrf_get_stats64(struct 
net_device *dev,
  static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb,
   struct net_device *dev)
  {
-   return 0;
+   vrf_tx_error(dev, skb);
+   return NET_XMIT_DROP;
  }

  static int vrf_send_v4_prep(struct sk_buff *skb, struct flowi4 *fl4,
@@ -206,8 +213,7 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff 
*skb,
  out:
return ret;
  err:
-   vrf_dev-stats.tx_errors++;
-   kfree_skb(skb);
+   vrf_tx_error(vrf_dev, skb);
goto out;
  }

@@ -219,6 +225,7 @@ static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, 
struct net_device *dev)
case htons(ETH_P_IPV6):
return vrf_process_v6_outbound(skb, dev);
default:
+   vrf_tx_error(dev, skb);
return NET_XMIT_DROP;
}
  }



Would be simpler to do the vrf_tx_error at the end of is_ip_tx_frame() 
if ret == NET_XMIT_DROP.


David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/4] vrf: cleanups part 2

2015-08-19 Thread David Ahern

On 8/18/15 8:27 PM, Nikolay Aleksandrov wrote:

From: Nikolay Aleksandrov niko...@cumulusnetworks.com

Hi,
This is the next part of vrf cleanups, patch 1 drops the SLAB_PANIC when
creating kmem cache since it's handled, patch 02 removes a slave duplicate
check which is already done by the lower/upper code, patch 3 moves the
ndo_add_slave code around a bit so we can drop an error label and patch 4
drops the master device checks which are unnecessary because the ops are
taken from the master device itself so it can't be different.

Cheers,
  Nik

Nikolay Aleksandrov (4):
   vrf: don't panic on cache create failure
   vrf: remove unnecessary duplicate check
   vrf: move vrf_insert_slave so we can drop a goto label
   vrf: ndo_add|del_slave drop unnecessary checks

  drivers/net/vrf.c | 24 
  1 file changed, 4 insertions(+), 20 deletions(-)



Looks good to me. Thanks, Nikolay.

Acked-by: David Ahern d...@cumulusnetworks.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/4] vrf: a few simplifications and cleanups

2015-08-18 Thread David Ahern

On 8/18/15 11:28 AM, Nikolay Aleksandrov wrote:

From: Nikolay Aleksandrov niko...@cumulusnetworks.com

Hi,
These patches remove some unnecessary checks (patches 3, 4), unnecessary
num_slaves member and refcnt manipulations which are already done by the
upper functions.

Cheers,
  Nik

Nikolay Aleksandrov (4):
   vrf: drop unnecessary dev refcnt changes
   vrf: drop unused num_slaves member
   vrf: don't check for dstats and rth in uninit path
   vrf: simplify the netdev notifier function

  drivers/net/vrf.c | 15 ---
  include/net/vrf.h |  1 -
  2 files changed, 4 insertions(+), 12 deletions(-)



Looks good to me. Thanks, Nikolay.

Acked-by: David Ahern d...@cumulusnetworks.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: Fix nexthop lookups

2015-08-19 Thread David Ahern
Andreas reported breakage adding routes with local nexthops:
$ ip route show table main
...
172.28.0.0/24 dev vnf-xe1p0  proto kernel  scope link  src 172.28.0.16

$ ip route add 10.0.0.0/8 via 172.28.0.32 table 100 dev vnf-xe1p0
RTNETLINK answers: Resource temporarily unavailable

3bfd847203c changed the lookup to use the passed in table but for cases like
this the nexthop is in the local table rather than the passed in table.

Fixes: 3bfd847203c (net: Use passed in table for nexthop lookups)
Reported-by: Andreas Schultz aschu...@tpip.net
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_semantics.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c8025851dac7..0ab5bf558805 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -710,9 +710,16 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
err = fib_table_lookup(tbl, fl4, res,
   
FIB_LOOKUP_IGNORE_LINKSTATE |
   FIB_LOOKUP_NOREF);
-   else
+
+   /* on error or if no table given do full lookup. This
+* is needed for example when nexthops are in the local
+* table rather than the given table
+*/
+   if (!tbl || err) {
err = fib_lookup(net, fl4, res,
 FIB_LOOKUP_IGNORE_LINKSTATE);
+   }
+
if (err) {
rcu_read_unlock();
return err;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] vrf: vrf_master_ifindex_rcu is not always called with rcu read lock

2015-08-18 Thread David Ahern

On 8/18/15 10:17 AM, Nikolay Aleksandrov wrote:

diff --git a/include/net/vrf.h b/include/net/vrf.h
index 40e3793c7a05..22dfe2195092 100644
--- a/include/net/vrf.h
+++ b/include/net/vrf.h
@@ -35,7 +35,6 @@ struct net_vrf {


  #if IS_ENABLED(CONFIG_NET_VRF)
-/* called with rcu_read_lock() */
  static inline int vrf_master_ifindex_rcu(const struct net_device *dev)
  {
struct net_vrf_dev *vrf_ptr;
@@ -44,12 +43,14 @@ static inline int vrf_master_ifindex_rcu(const struct 
net_device *dev)
if (!dev)
return 0;

-   if (netif_is_vrf(dev))
+   if (netif_is_vrf(dev)) {
ifindex = dev-ifindex;
-   else {
+   } else {
+   rcu_read_lock();
vrf_ptr = rcu_dereference(dev-vrf_ptr);
if (vrf_ptr)
ifindex = vrf_ptr-ifindex;
+   rcu_read_unlock();
}

return ifindex;



The intent of the _rcu in the name is to mean it is called with 
rcu_read_lock held which is the case for __fib_validate_source and 
ip_route_input_slow. It looks like the icmp callers (icmp_reply and 
icmp_route_lookup) are the exceptions. For those create a


static inline int vrf_master_ifindex(const struct net_device *dev)
{

}

that does the rcu lock/unlock and calls vrf_master_ifindex_rcu in between.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] vrf: plug skb leaks

2015-08-19 Thread David Ahern

On 8/19/15 1:17 PM, Nikolay Aleksandrov wrote:



On Aug 19, 2015, at 8:27 PM, Nikolay Aleksandrov niko...@cumulusnetworks.com 
wrote:


On Aug 19, 2015 20:13, David Ahern d...@cumulusnetworks.com wrote:


Hi Nikolay:


On 8/18/15 8:12 PM, Nikolay Aleksandrov wrote:


diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index ed208317cbb5..4aa06450fafa 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -97,6 +97,12 @@ static bool is_ip_rx_frame(struct sk_buff *skb)
 return false;
   }

+static void vrf_tx_error(struct net_device *vrf_dev, struct sk_buff *skb)
+{
+   vrf_dev-stats.tx_errors++;
+   kfree_skb(skb);
+}
+
   /* note: already called with rcu_read_lock */
   static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
   {
@@ -149,7 +155,8 @@ static struct rtnl_link_stats64 *vrf_get_stats64(struct 
net_device *dev,
   static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb,
struct net_device *dev)
   {
-   return 0;
+   vrf_tx_error(dev, skb);
+   return NET_XMIT_DROP;
   }

   static int vrf_send_v4_prep(struct sk_buff *skb, struct flowi4 *fl4,
@@ -206,8 +213,7 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff 
*skb,
   out:
 return ret;
   err:
-   vrf_dev-stats.tx_errors++;
-   kfree_skb(skb);
+   vrf_tx_error(vrf_dev, skb);
 goto out;
   }

@@ -219,6 +225,7 @@ static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, 
struct net_device *dev)
 case htons(ETH_P_IPV6):
 return vrf_process_v6_outbound(skb, dev);
 default:
+   vrf_tx_error(dev, skb);
 return NET_XMIT_DROP;
 }
   }



Would be simpler to do the vrf_tx_error at the end of is_ip_tx_frame() if ret 
== NET_XMIT_DROP.

David



Sure, that will work too.



Actually no, this will not work because ip_local_out() can return NET_XMIT_DROP 
and the packet
can already be dropped. I’d prefer to keep these cases separate and explicit as 
they are in my patch.



ok. Then the patch looks good to me.

Acked-by: David Ahern d...@cumulusnetworks.com


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] vrf: Add ethernet header for pass through VRF device

2015-08-23 Thread David Ahern
The change to use a custom dst broke tcpdump captures on the VRF device:

$ tcpdump -n -i vrf10
...
05:32:29.009362 IP 10.2.1.254  10.2.1.2: ICMP echo request, id 21989, seq 1, 
length 64
05:32:29.009855 00:00:40:01:8d:36  45:00:00:54:d6:6f, ethertype Unknown 
(0x0a02), length 84:
0x:  0102 0a02 01fe  9181 55e5 0001 bd11  ..U.
0x0010:  da55   bb5d 0700   1011  .U.]
0x0020:  1213 1415 1617 1819 1a1b 1c1d 1e1f 2021  ...!
0x0030:  2223 2425 2627 2829 2a2b 2c2d 2e2f 3031  #$%'()*+,-./01
0x0040:  3233 3435 3637   234567

Local packets going through the VRF device are missing an ethernet header.
Fix by adding one and then stripping it off before pushing back to the IP
stack. With this patch you get the expected dumps:

...
05:36:15.713944 IP 10.2.1.254  10.2.1.2: ICMP echo request, id 23795, seq 1, 
length 64
05:36:15.714160 IP 10.2.1.2  10.2.1.254: ICMP echo reply, id 23795, seq 1, 
length 64
...

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 drivers/net/vrf.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index dbeffe789185..e5c792e4c224 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -219,6 +219,9 @@ static netdev_tx_t vrf_process_v4_outbound(struct sk_buff 
*skb,
 
 static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev)
 {
+   /* strip the ethernet header added for pass through VRF device */
+   __skb_pull(skb, skb_network_offset(skb));
+
switch (skb-protocol) {
case htons(ETH_P_IP):
return vrf_process_v4_outbound(skb, dev);
@@ -250,6 +253,17 @@ static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
 static netdev_tx_t vrf_finish(struct sock *sk, struct sk_buff *skb)
 {
+   int err;
+
+   __skb_pull(skb, skb_network_offset(skb));
+   err = dev_hard_header(skb, skb-dev, ntohs(skb-protocol),
+ NULL, NULL, skb-len);
+
+   if (err  0) {
+   vrf_tx_error(skb-dev, skb);
+   return -EINVAL;
+   }
+
return dev_queue_xmit(skb);
 }
 
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] inetpeer: remove dead code

2015-08-23 Thread David Ahern
Remove various inlined functions not referenced in the kernel.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/inetpeer.h | 67 --
 1 file changed, 67 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index d5332ddcea3f..002f0bd27001 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -65,71 +65,12 @@ struct inet_peer_base {
int total;
 };
 
-#define INETPEER_BASE_BIT  0x1UL
-
-static inline struct inet_peer *inetpeer_ptr(unsigned long val)
-{
-   BUG_ON(val  INETPEER_BASE_BIT);
-   return (struct inet_peer *) val;
-}
-
-static inline struct inet_peer_base *inetpeer_base_ptr(unsigned long val)
-{
-   if (!(val  INETPEER_BASE_BIT))
-   return NULL;
-   val = ~INETPEER_BASE_BIT;
-   return (struct inet_peer_base *) val;
-}
-
-static inline bool inetpeer_ptr_is_peer(unsigned long val)
-{
-   return !(val  INETPEER_BASE_BIT);
-}
-
-static inline void __inetpeer_ptr_set_peer(unsigned long *val, struct 
inet_peer *peer)
-{
-   /* This implicitly clears INETPEER_BASE_BIT */
-   *val = (unsigned long) peer;
-}
-
-static inline bool inetpeer_ptr_set_peer(unsigned long *ptr, struct inet_peer 
*peer)
-{
-   unsigned long val = (unsigned long) peer;
-   unsigned long orig = *ptr;
-
-   if (!(orig  INETPEER_BASE_BIT) ||
-   cmpxchg(ptr, orig, val) != orig)
-   return false;
-   return true;
-}
-
-static inline void inetpeer_init_ptr(unsigned long *ptr, struct inet_peer_base 
*base)
-{
-   *ptr = (unsigned long) base | INETPEER_BASE_BIT;
-}
-
-static inline void inetpeer_transfer_peer(unsigned long *to, unsigned long 
*from)
-{
-   unsigned long val = *from;
-
-   *to = val;
-   if (inetpeer_ptr_is_peer(val)) {
-   struct inet_peer *peer = inetpeer_ptr(val);
-   atomic_inc(peer-refcnt);
-   }
-}
-
 void inet_peer_base_init(struct inet_peer_base *);
 
 void inet_initpeers(void) __init;
 
 #define INETPEER_METRICS_NEW   (~(u32) 0)
 
-static inline bool inet_metrics_new(const struct inet_peer *p)
-{
-   return p-metrics[RTAX_LOCK-1] == INETPEER_METRICS_NEW;
-}
-
 /* can be called with or without local BH being disabled */
 struct inet_peer *inet_getpeer(struct inet_peer_base *base,
   const struct inetpeer_addr *daddr,
@@ -163,12 +104,4 @@ bool inet_peer_xrlim_allow(struct inet_peer *peer, int 
timeout);
 
 void inetpeer_invalidate_tree(struct inet_peer_base *);
 
-/*
- * temporary check to make sure we dont access rid, tcp_ts,
- * tcp_ts_stamp if no refcount is taken on inet_peer
- */
-static inline void inet_peer_refcheck(const struct inet_peer *p)
-{
-   WARN_ON_ONCE(atomic_read(p-refcnt) = 0);
-}
 #endif /* _NET_INETPEER_H */
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] inetpeer: Add support for VRFs

2015-08-23 Thread David Ahern
inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Similar to IP fragments handle
duplicate addresses across VRFs by adding the VRF master device index to
the lookup.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/inetpeer.h | 11 ++-
 net/ipv4/icmp.c|  3 ++-
 net/ipv4/inetpeer.c|  5 +
 net/ipv4/ip_fragment.c |  3 ++-
 net/ipv4/route.c   |  7 +--
 5 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 002f0bd27001..a75b648b8545 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -26,6 +26,9 @@ struct inetpeer_addr_base {
 struct inetpeer_addr {
struct inetpeer_addr_base   addr;
__u16   family;
+#if IS_ENABLED(CONFIG_NET_VRF)
+   int vif;
+#endif
 };
 
 struct inet_peer {
@@ -78,12 +81,15 @@ struct inet_peer *inet_getpeer(struct inet_peer_base *base,
 
 static inline struct inet_peer *inet_getpeer_v4(struct inet_peer_base *base,
__be32 v4daddr,
-   int create)
+   int vif, int create)
 {
struct inetpeer_addr daddr;
 
daddr.addr.a4 = v4daddr;
daddr.family = AF_INET;
+#if IS_ENABLED(CONFIG_NET_VRF)
+   daddr.vif = vif;
+#endif
return inet_getpeer(base, daddr, create);
 }
 
@@ -95,6 +101,9 @@ static inline struct inet_peer *inet_getpeer_v6(struct 
inet_peer_base *base,
 
daddr.addr.in6 = *v6daddr;
daddr.family = AF_INET6;
+#if IS_ENABLED(CONFIG_NET_VRF)
+   daddr.vif = 0;   /* placeholder until VRF suppoort is added to IPv6 */
+#endif
return inet_getpeer(base, daddr, create);
 }
 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f16488efa1c8..79fe05befcae 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -309,9 +309,10 @@ static bool icmpv4_xrlim_allow(struct net *net, struct 
rtable *rt,
 
rc = false;
if (icmp_global_allow()) {
+   int vif = vrf_master_ifindex(dst-dev);
struct inet_peer *peer;
 
-   peer = inet_getpeer_v4(net-ipv4.peers, fl4-daddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, fl4-daddr, vif, 1);
rc = inet_peer_xrlim_allow(peer,
   net-ipv4.sysctl_icmp_ratelimit);
if (peer)
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 241afd743d2c..b5f268a3ea6b 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -170,6 +170,11 @@ static int addr_compare(const struct inetpeer_addr *a,
return 1;
}
 
+#if IS_ENABLED(CONFIG_NET_VRF)
+   if (a-vif != b-vif)
+   return a-vif  b-vif ? -1 : 1;
+#endif
+
return 0;
 }
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 15762e758861..fa7f15305f9a 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -151,7 +151,8 @@ static void ip4_frag_init(struct inet_frag_queue *q, const 
void *a)
qp-vif = arg-vif;
qp-user = arg-user;
qp-peer = sysctl_ipfrag_max_dist ?
-   inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, 1) : NULL;
+   inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, arg-vif, 1) :
+   NULL;
 }
 
 static void ip4_frag_free(struct inet_frag_queue *q)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 2403e85107f0..6805d57152b9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -838,6 +838,7 @@ void ip_rt_send_redirect(struct sk_buff *skb)
struct inet_peer *peer;
struct net *net;
int log_martians;
+   int vif;
 
rcu_read_lock();
in_dev = __in_dev_get_rcu(rt-dst.dev);
@@ -846,10 +847,11 @@ void ip_rt_send_redirect(struct sk_buff *skb)
return;
}
log_martians = IN_DEV_LOG_MARTIANS(in_dev);
+   vif = vrf_master_ifindex_rcu(rt-dst.dev);
rcu_read_unlock();
 
net = dev_net(rt-dst.dev);
-   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, vif, 1);
if (!peer) {
icmp_send(skb, ICMP_REDIRECT, ICMP_REDIR_HOST,
  rt_nexthop(rt, ip_hdr(skb)-daddr));
@@ -938,7 +940,8 @@ static int ip_error(struct sk_buff *skb)
break;
}
 
-   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr, 1);
+   peer = inet_getpeer_v4(net-ipv4.peers, ip_hdr(skb)-saddr,
+  vrf_master_ifindex(skb-dev), 1);
 
send = true;
if (peer) {
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net af_key: Fix RCU splat

2015-08-20 Thread David Ahern

On 8/20/15 9:51 AM, Eric Dumazet wrote:

On Thu, 2015-08-20 at 08:51 -0700, David Ahern wrote:

Hit the following splat testing VRF change for ipsec:

[  113.475692] ===
[  113.476194] [ INFO: suspicious RCU usage. ]
[  113.476667] 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED 
Not tainted
[  113.477545] ---
[  113.478013] /work/monster-14/dsa/kernel.git/include/linux/rcupdate.h:568 
Illegal context switch in RCU read-side critical section!
[  113.479288]
[  113.479288] other info that might help us debug this:
[  113.479288]
[  113.480207]
[  113.480207] rcu_scheduler_active = 1, debug_locks = 1
[  113.480931] 2 locks held by setkey/6829:
[  113.481371]  #0:  (net-xfrm.xfrm_cfg_mutex){+.+.+.}, at: 
[814e9887] pfkey_sendmsg+0xfb/0x213
[  113.482509]  #1:  (rcu_read_lock){..}, at: [814e767f] 
rcu_read_lock+0x0/0x6e
[  113.483509]
[  113.483509] stack backtrace:
[  113.484041] CPU: 0 PID: 6829 Comm: setkey Not tainted 
4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED
[  113.485422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[  113.486845]  0001 88001d4c7a98 81518af2 
81086962
[  113.487732]  88001d538480 88001d4c7ac8 8107ae75 
8180a154
[  113.488628]  0b30  00d0 
88001d4c7ad8
[  113.489525] Call Trace:
[  113.489813]  [81518af2] dump_stack+0x4c/0x65
[  113.490389]  [81086962] ? console_unlock+0x3d6/0x405
[  113.491039]  [8107ae75] lockdep_rcu_suspicious+0xfa/0x103
[  113.491735]  [81064032] rcu_preempt_sleep_check+0x45/0x47
[  113.492442]  [8106404d] ___might_sleep+0x19/0x1c8
[  113.493077]  [81064268] __might_sleep+0x6c/0x82
[  113.493681]  [81133190] 
cache_alloc_debugcheck_before.isra.50+0x1d/0x24
[  113.494508]  [81134876] kmem_cache_alloc+0x31/0x18f
[  113.495149]  [814012b5] skb_clone+0x64/0x80
[  113.495712]  [814e6f71] pfkey_broadcast_one+0x3d/0xff
[  113.496380]  [814e7b84] pfkey_broadcast+0xb5/0x11e
[  113.497024]  [814e82d1] pfkey_register+0x191/0x1b1
[  113.497653]  [814e9770] pfkey_process+0x162/0x17e
[  113.498274]  [814e9895] pfkey_sendmsg+0x109/0x213

In pfkey_sendmsg the net mutex is taken and then pfkey_broadcast takes
the RCU lock. Fix by using GFP_ATOMIC for the allocation flag.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
  net/key/af_key.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index b397f0aa9005..73527e7dd247 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -1670,7 +1670,7 @@ static int pfkey_register(struct sock *sk, struct sk_buff 
*skb, const struct sad
return -ENOBUFS;
}

-   pfkey_broadcast(supp_skb, GFP_KERNEL, BROADCAST_REGISTERED, sk, 
sock_net(sk));
+   pfkey_broadcast(supp_skb, GFP_ATOMIC, BROADCAST_REGISTERED, sk, 
sock_net(sk));

return 0;
  }


I would rather remove the useless rcu locking from pfkey_broadcast() if
a mutex properly protects the thing.


rcu_read_lock was added by Stephen with 7f6b9dbd5afbd. It does not 
appear the net-xfrm.xfrm_cfg_mutex mutex added by 283bc9f35bbbc 
properly covers the locking. ie., the rcu_read_lock is needed.


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] vrf: vrf_master_ifindex_rcu is not always called with rcu read lock

2015-08-18 Thread David Ahern

On 8/18/15 12:15 PM, Nikolay Aleksandrov wrote:

diff --git a/include/net/vrf.h b/include/net/vrf.h
index 3bb4af462ed6..b039850a94a3 100644
--- a/include/net/vrf.h
+++ b/include/net/vrf.h
@@ -34,7 +34,6 @@ struct net_vrf {


  #if IS_ENABLED(CONFIG_NET_VRF)
-/* called with rcu_read_lock() */
  static inline int vrf_master_ifindex_rcu(const struct net_device *dev)
  {
struct net_vrf_dev *vrf_ptr;


That comment is true for this version.

David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: FIB tracepoints

2015-08-18 Thread David Ahern
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
I realize the sensitivity around adding tracepoints, but these have been
invaluable developing the VRF device driver along with a return probe:
  perf probe -a 'fib_table_lookup_ret=fib_table_lookup%return ret=%ax' 

 include/trace/events/fib.h | 90 ++
 net/core/net-traces.c  |  1 +
 net/ipv4/fib_frontend.c|  3 ++
 net/ipv4/fib_trie.c|  5 +++
 4 files changed, 99 insertions(+)
 create mode 100644 include/trace/events/fib.h

diff --git a/include/trace/events/fib.h b/include/trace/events/fib.h
new file mode 100644
index ..1ac74ba0c977
--- /dev/null
+++ b/include/trace/events/fib.h
@@ -0,0 +1,90 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fib
+
+#if !defined(_TRACE_FIB_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FIB_H
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include net/ip_fib.h
+#include linux/tracepoint.h
+
+TRACE_EVENT(fib_table_lookup,
+
+   TP_PROTO(int tb_id, const struct flowi4 *flp),
+
+   TP_ARGS(tb_id, flp),
+
+   TP_STRUCT__entry(
+   __field(int,tb_id   )
+   __field(int,oif )
+   __field(int,iif )
+   __array(__u8,   src,4   )
+   __array(__u8,   dst,4   )
+   ),
+
+   TP_fast_assign(
+   __entry-tb_id = tb_id;
+   __entry-oif = flp-flowi4_oif;
+   __entry-iif = flp-flowi4_iif;
+   memcpy(__entry-src,  flp-saddr, 4);
+   memcpy(__entry-dst,  flp-daddr, 4);
+   ),
+
+   TP_printk(table %d oif %d iif %d src %pI4 dst %pI4,
+ __entry-tb_id, __entry-oif, __entry-iif,
+ __entry-src, __entry-dst)
+);
+
+TRACE_EVENT(fib_table_lookup_nh,
+
+   TP_PROTO(const struct fib_nh *nh),
+
+   TP_ARGS(nh),
+
+   TP_STRUCT__entry(
+   __string(   name,   nh-nh_dev-name)
+   __field(int,oif )
+   __array(__u8,   src,4   )
+   ),
+
+   TP_fast_assign(
+   __assign_str(name, nh-nh_dev ? nh-nh_dev-name : not set);
+   __entry-oif = nh-nh_oif;
+   memcpy(__entry-src,  nh-nh_saddr, 4);
+   ),
+
+   TP_printk(nexthop dev %s oif %d src %pI4,
+ __get_str(name), __entry-oif, __entry-src)
+);
+
+TRACE_EVENT(fib_validate_source,
+
+   TP_PROTO(const struct net_device *dev, const struct flowi4 *flp),
+
+   TP_ARGS(dev, flp),
+
+   TP_STRUCT__entry(
+   __string(   name,   dev-name   )
+   __field(int,oif )
+   __field(int,iif )
+   __array(__u8,   src,4   )
+   __array(__u8,   dst,4   )
+   ),
+
+   TP_fast_assign(
+   __assign_str(name, dev ? dev-name : not set);
+   __entry-oif = flp-flowi4_oif;
+   __entry-iif = flp-flowi4_iif;
+   memcpy(__entry-src,  flp-saddr, 4);
+   memcpy(__entry-dst,  flp-daddr, 4);
+   ),
+
+   TP_printk(dev %s oif %d iif %d src %pI4 dst %pI4,
+ __get_str(name), __entry-oif, __entry-iif,
+ __entry-src, __entry-dst)
+);
+#endif /* _TRACE_FIB_H */
+
+/* This part must be outside protection */
+#include trace/define_trace.h
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index ba3c0120786c..adef015b2f41 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -31,6 +31,7 @@
 #include trace/events/napi.h
 #include trace/events/sock.h
 #include trace/events/udp.h
+#include trace/events/fib.h
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 7fa277176c33..4036c94dfbe1 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -46,6 +46,7 @@
 #include net/rtnetlink.h
 #include net/xfrm.h
 #include net/vrf.h
+#include trace/events/fib.h
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
@@ -344,6 +345,8 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
 
fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb-mark : 0;
 
+   trace_fib_validate_source(dev, fl4);
+
net = dev_net(dev);
if (fib_lookup(net, fl4, res, 0))
goto last_resort;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 1243c79cb5b0..f552ee31a39d 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -81,6 +81,7 @@
 #include net/sock.h
 #include net/ip_fib.h
 #include net/switchdev.h
+#include trace/events/fib.h
 #include fib_lookup.h
 
 #define MAX_STAT_DEPTH 32
@@ -1278,6 +1279,8 @@ int fib_table_lookup(struct fib_table *tb, const struct 
flowi4 *flp,
unsigned long index;
t_key cindex

[PATCH ipsec-next] xfrm: Use VRF master index if output device is enslaved

2015-08-18 Thread David Ahern
Directs route lookups to VRF table. Compiles out if NET_VRF is not
enabled. With this patch able to successfully bring up ipsec tunnels
in VRFs, even with duplicate network configuration (IPv4 tested).

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/xfrm4_policy.c | 7 +--
 net/ipv6/xfrm6_policy.c | 7 +--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 55b3c0f4dde5..35757f6af2d5 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -15,6 +15,7 @@
 #include net/dst.h
 #include net/xfrm.h
 #include net/ip.h
+#include net/vrf.h
 
 static struct xfrm_policy_afinfo xfrm4_policy_afinfo;
 
@@ -107,8 +108,10 @@ _decode_session4(struct sk_buff *skb, struct flowi *fl, 
int reverse)
struct flowi4 *fl4 = fl-u.ip4;
int oif = 0;
 
-   if (skb_dst(skb))
-   oif = skb_dst(skb)-dev-ifindex;
+   if (skb_dst(skb)) {
+   oif = vrf_master_ifindex_rcu(skb_dst(skb)-dev) ?
+   : skb_dst(skb)-dev-ifindex;
+   }
 
memset(fl4, 0, sizeof(struct flowi4));
fl4-flowi4_mark = skb-mark;
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index a74013d3eceb..4a88b89becf5 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -20,6 +20,7 @@
 #include net/ip.h
 #include net/ipv6.h
 #include net/ip6_route.h
+#include net/vrf.h
 #if IS_ENABLED(CONFIG_IPV6_MIP6)
 #include net/mip6.h
 #endif
@@ -131,8 +132,10 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, 
int reverse)
 
nexthdr = nh[nhoff];
 
-   if (skb_dst(skb))
-   oif = skb_dst(skb)-dev-ifindex;
+   if (skb_dst(skb)) {
+   oif = vrf_master_ifindex_rcu(skb_dst(skb)-dev) ?
+   : skb_dst(skb)-dev-ifindex;
+   }
 
memset(fl6, 0, sizeof(struct flowi6));
fl6-flowi6_mark = skb-mark;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] net: Move VRF change to udp_sendmsg to function

2015-08-18 Thread David Ahern

On 8/18/15 10:03 AM, Eric Dumazet wrote:


+/* unconnected socket. If output device is enslaved to a VRF
+ * device lookup source address from VRF table.
+ */
+static void udp_sendmsg_vrf_saddr(struct net *net, struct flowi4 *fl4,
+  int oif, struct sock *sk)
+{
+   if (netif_index_is_vrf(net, oif)) {
+   __u8 flow_flags = fl4-flowi4_flags;
+   struct rtable *rt;
+
+   fl4-flowi4_flags = flow_flags | FLOWI_FLAG_VRFSRC;
+   rt = ip_route_output_flow(net, fl4, sk);
+   if (!IS_ERR(rt))
+   ip_rt_put(rt);


This looks buggy. What happened to saddr = fl4-saddr; ?


Not needed.




+   fl4-flowi4_flags = flow_flags;
+   }
+}
+
  int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
  {
struct inet_sock *inet = inet_sk(sk);
@@ -1013,33 +1033,16 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)

if (!rt) {
struct net *net = sock_net(sk);
-   __u8 flow_flags = inet_sk_flowi_flags(sk);

fl4 = fl4_stack;

-   /* unconnected socket. If output device is enslaved to a VRF
-* device lookup source address from VRF table. This mimics
-* behavior of ip_route_connect{_init}.
-*/
-   if (netif_index_is_vrf(net, ipc.oif)) {
-   flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
-  RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  (flow_flags | FLOWI_FLAG_VRFSRC),
-  faddr, saddr, dport,
-  inet-inet_sport);
-
-   rt = ip_route_output_flow(net, fl4, sk);
-   if (!IS_ERR(rt)) {
-   saddr = fl4-saddr;
-   ip_rt_put(rt);
-   }
-   }
-
flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
   RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  flow_flags,
+  inet_sk_flowi_flags(sk),
   faddr, saddr, dport, inet-inet_sport);

+   udp_sendmsg_vrf_saddr(net, fl4, ipc.oif, sk);
+


fl4-saddr gets set in udp_sendmsg_vrf_saddr, stays for the next two...


security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
rt = ip_route_output_flow(net, fl4, sk);
if (IS_ERR(rt)) {




and then right after the above block you have:


if (msg-msg_flagsMSG_CONFIRM)
goto do_confirm;
back_from_confirm:

saddr = fl4-saddr;

So in short the original code change did not need the 'saddr = 
fl4-saddr;'.


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-next: unregister_netdevice: waiting for lo to become free. Usage count = 1

2015-08-18 Thread David Ahern

On 8/18/15 9:24 AM, Andrey Wagin wrote:

Hello David,

CRIU tests detetect that references on net devices leak on
4.2.0-rc6-next-20150817. Looks like it started with
v4.2-rc6-882-g3bfd847.


1e3136789975f03e461798149309034e5213c1b4 should have fixed it.

David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2] net: Move VRF change to udp_sendmsg to function

2015-08-18 Thread David Ahern
Functionally equivalent, but as a separate function with VRF config
check. After 2f52bdcf6ba (net: Updates to netif_index_is_vrf) function
completely compiles out if VRF is not enabled; additional CONFIG
check is not needed.

Suggested-by: Tom Herbert t...@herbertland.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
v2
- removed inline per Dave's comment
- removed IS_ENABLED(CONFIG_NET_VRF) check; no longer needed after 2f52bdcf6ba

 net/ipv4/udp.c | 43 +++
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c0a15e7f359f..76c5e5e945f8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -873,6 +873,24 @@ int udp_push_pending_frames(struct sock *sk)
 }
 EXPORT_SYMBOL(udp_push_pending_frames);
 
+/* unconnected socket. If output device is enslaved to a VRF
+ * device lookup source address from VRF table.
+ */
+static void udp_sendmsg_vrf_saddr(struct net *net, struct flowi4 *fl4,
+  int oif, struct sock *sk)
+{
+   if (netif_index_is_vrf(net, oif)) {
+   __u8 flow_flags = fl4-flowi4_flags;
+   struct rtable *rt;
+
+   fl4-flowi4_flags = flow_flags | FLOWI_FLAG_VRFSRC;
+   rt = ip_route_output_flow(net, fl4, sk);
+   if (!IS_ERR(rt))
+   ip_rt_put(rt);
+   fl4-flowi4_flags = flow_flags;
+   }
+}
+
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
struct inet_sock *inet = inet_sk(sk);
@@ -1013,33 +1033,16 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
if (!rt) {
struct net *net = sock_net(sk);
-   __u8 flow_flags = inet_sk_flowi_flags(sk);
 
fl4 = fl4_stack;
 
-   /* unconnected socket. If output device is enslaved to a VRF
-* device lookup source address from VRF table. This mimics
-* behavior of ip_route_connect{_init}.
-*/
-   if (netif_index_is_vrf(net, ipc.oif)) {
-   flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
-  RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  (flow_flags | FLOWI_FLAG_VRFSRC),
-  faddr, saddr, dport,
-  inet-inet_sport);
-
-   rt = ip_route_output_flow(net, fl4, sk);
-   if (!IS_ERR(rt)) {
-   saddr = fl4-saddr;
-   ip_rt_put(rt);
-   }
-   }
-
flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
   RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  flow_flags,
+  inet_sk_flowi_flags(sk),
   faddr, saddr, dport, inet-inet_sport);
 
+   udp_sendmsg_vrf_saddr(net, fl4, ipc.oif, sk);
+
security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
rt = ip_route_output_flow(net, fl4, sk);
if (IS_ERR(rt)) {
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] inet: Move VRF table lookup to inlined function

2015-08-16 Thread David Ahern
Table lookup compiles out when VRF is not enabled.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/vrf.h  | 24 
 net/ipv4/af_inet.c | 10 +-
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/net/vrf.h b/include/net/vrf.h
index 0484d29d4589..40e3793c7a05 100644
--- a/include/net/vrf.h
+++ b/include/net/vrf.h
@@ -81,6 +81,25 @@ static inline int vrf_dev_table(const struct net_device *dev)
return tb_id;
 }
 
+static inline int vrf_dev_table_ifindex(struct net *net, int ifindex)
+{
+   struct net_device *dev;
+   int tb_id = 0;
+
+   if (!ifindex)
+   return 0;
+
+   rcu_read_lock();
+
+   dev = dev_get_by_index_rcu(net, ifindex);
+   if (dev)
+   tb_id = vrf_dev_table_rcu(dev);
+
+   rcu_read_unlock();
+
+   return tb_id;
+}
+
 /* called with rtnl */
 static inline int vrf_dev_table_rtnl(const struct net_device *dev)
 {
@@ -125,6 +144,11 @@ static inline int vrf_dev_table(const struct net_device 
*dev)
return 0;
 }
 
+static inline int vrf_dev_table_ifindex(struct net *net, int ifindex)
+{
+   return 0;
+}
+
 static inline int vrf_dev_table_rtnl(const struct net_device *dev)
 {
return 0;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c8b855882fa5..675e88cac2b4 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -450,15 +450,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
goto out;
}
 
-   if (sk-sk_bound_dev_if) {
-   struct net_device *dev;
-
-   rcu_read_lock();
-   dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if);
-   if (dev)
-   tb_id = vrf_dev_table_rcu(dev) ? : tb_id;
-   rcu_read_unlock();
-   }
+   tb_id = vrf_dev_table_ifindex(net, sk-sk_bound_dev_if) ? : tb_id;
chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id);
 
/* Not specified by any standard per-se, however it breaks too
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] inetpeer: Add support for VRFs

2015-08-23 Thread David Ahern

On 8/23/15 6:15 PM, Thomas Graf wrote:

On 08/23/15 at 08:26am, David Ahern wrote:

inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Similar to IP fragments handle
duplicate addresses across VRFs by adding the VRF master device index to
the lookup.


We have a lot of other places which use the address only. Are you
going to add the VRF id to all these places as well?



If appropriate, yes. I have fixed IP fragments and this patch fixes 
inetpeer cache. In both cases (L3 artifacts) the vrf device index 
provides the means to uniquely identify duplicate IP addresses within a 
namespace. If you know of other code that might be impacted I will 
investigate and fix as needed.


Thanks,
David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net af_key: Fix RCU splat

2015-08-20 Thread David Ahern
Hit the following splat testing VRF change for ipsec:

[  113.475692] ===
[  113.476194] [ INFO: suspicious RCU usage. ]
[  113.476667] 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED 
Not tainted
[  113.477545] ---
[  113.478013] /work/monster-14/dsa/kernel.git/include/linux/rcupdate.h:568 
Illegal context switch in RCU read-side critical section!
[  113.479288]
[  113.479288] other info that might help us debug this:
[  113.479288]
[  113.480207]
[  113.480207] rcu_scheduler_active = 1, debug_locks = 1
[  113.480931] 2 locks held by setkey/6829:
[  113.481371]  #0:  (net-xfrm.xfrm_cfg_mutex){+.+.+.}, at: 
[814e9887] pfkey_sendmsg+0xfb/0x213
[  113.482509]  #1:  (rcu_read_lock){..}, at: [814e767f] 
rcu_read_lock+0x0/0x6e
[  113.483509]
[  113.483509] stack backtrace:
[  113.484041] CPU: 0 PID: 6829 Comm: setkey Not tainted 
4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED
[  113.485422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[  113.486845]  0001 88001d4c7a98 81518af2 
81086962
[  113.487732]  88001d538480 88001d4c7ac8 8107ae75 
8180a154
[  113.488628]  0b30  00d0 
88001d4c7ad8
[  113.489525] Call Trace:
[  113.489813]  [81518af2] dump_stack+0x4c/0x65
[  113.490389]  [81086962] ? console_unlock+0x3d6/0x405
[  113.491039]  [8107ae75] lockdep_rcu_suspicious+0xfa/0x103
[  113.491735]  [81064032] rcu_preempt_sleep_check+0x45/0x47
[  113.492442]  [8106404d] ___might_sleep+0x19/0x1c8
[  113.493077]  [81064268] __might_sleep+0x6c/0x82
[  113.493681]  [81133190] 
cache_alloc_debugcheck_before.isra.50+0x1d/0x24
[  113.494508]  [81134876] kmem_cache_alloc+0x31/0x18f
[  113.495149]  [814012b5] skb_clone+0x64/0x80
[  113.495712]  [814e6f71] pfkey_broadcast_one+0x3d/0xff
[  113.496380]  [814e7b84] pfkey_broadcast+0xb5/0x11e
[  113.497024]  [814e82d1] pfkey_register+0x191/0x1b1
[  113.497653]  [814e9770] pfkey_process+0x162/0x17e
[  113.498274]  [814e9895] pfkey_sendmsg+0x109/0x213

In pfkey_sendmsg the net mutex is taken and then pfkey_broadcast takes
the RCU lock. Fix by using GFP_ATOMIC for the allocation flag.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/key/af_key.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index b397f0aa9005..73527e7dd247 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -1670,7 +1670,7 @@ static int pfkey_register(struct sock *sk, struct sk_buff 
*skb, const struct sad
return -ENOBUFS;
}
 
-   pfkey_broadcast(supp_skb, GFP_KERNEL, BROADCAST_REGISTERED, sk, 
sock_net(sk));
+   pfkey_broadcast(supp_skb, GFP_ATOMIC, BROADCAST_REGISTERED, sk, 
sock_net(sk));
 
return 0;
 }
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH ipsec-next v2] xfrm: Use VRF master index if output device is enslaved

2015-08-20 Thread David Ahern
Directs route lookups to VRF table. Compiles out if NET_VRF is not
enabled. With this patch able to successfully bring up ipsec tunnels
in VRFs, even with duplicate network configuration.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
v2
- use vrf_master_ifindex rather than vrf_master_ifindex_rcu

 net/ipv4/xfrm4_policy.c | 7 +--
 net/ipv6/xfrm6_policy.c | 7 +--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 55b3c0f4dde5..35757f6af2d5 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -15,6 +15,7 @@
 #include net/dst.h
 #include net/xfrm.h
 #include net/ip.h
+#include net/vrf.h
 
 static struct xfrm_policy_afinfo xfrm4_policy_afinfo;
 
@@ -107,8 +108,10 @@ _decode_session4(struct sk_buff *skb, struct flowi *fl, 
int reverse)
struct flowi4 *fl4 = fl-u.ip4;
int oif = 0;
 
-   if (skb_dst(skb))
-   oif = skb_dst(skb)-dev-ifindex;
+   if (skb_dst(skb)) {
+   oif = vrf_master_ifindex(skb_dst(skb)-dev) ?
+   : skb_dst(skb)-dev-ifindex;
+   }
 
memset(fl4, 0, sizeof(struct flowi4));
fl4-flowi4_mark = skb-mark;
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index a74013d3eceb..4a88b89becf5 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -20,6 +20,7 @@
 #include net/ip.h
 #include net/ipv6.h
 #include net/ip6_route.h
+#include net/vrf.h
 #if IS_ENABLED(CONFIG_IPV6_MIP6)
 #include net/mip6.h
 #endif
@@ -131,8 +132,10 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, 
int reverse)
 
nexthdr = nh[nhoff];
 
-   if (skb_dst(skb))
-   oif = skb_dst(skb)-dev-ifindex;
+   if (skb_dst(skb)) {
+   oif = vrf_master_ifindex(skb_dst(skb)-dev) ?
+   : skb_dst(skb)-dev-ifindex;
+   }
 
memset(fl6, 0, sizeof(struct flowi6));
fl6-flowi6_mark = skb-mark;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next 0/16] Proposal for VRF-lite - v3

2015-07-28 Thread David Ahern

On 7/27/15 2:30 PM, Eric W. Biederman wrote:

This paragraph is false when it comes to sockets, as I have already
pointed out.

- VPN Routing and Forwarding (RFC4364 and it's kin) implies isolation
   strong enough to allow using the the same ip on different machines
   in different VPN instances and not have confusion.

- The routing table is not the only table in the kernel that uses
   an ip address as a key.

   The result is that you can combine packets fragments that come in
   on different interfaces (irrespective of your VPN), confuse tcp
   parameters between interfaces, scramble your ipsec connections and I
   don't know what else.


The duplicate IP address is a problem with the networking stack today; 
the VRF device does not introduce it. The VRF device does allow 
duplicate IP addresses within a namespace but separate VRFs, though yes 
various places that rely solely on source address like IP fragmentation 
do need to be fixed.


I looked at the IPv4 fragmentation code yesterday and will continue 
today. So help me with the history: is there any reason why the device 
index is not used today? It seems like a straight forward change.


1. simple netdevices with the same IP address
-- no problem using index in the lookup

2. 2 ipsec tunnels -- different netdevices, same IP address
-- no problem using index

3. stacked devices like bonding and team interfaces appear to the stack 
as a single device

-- no problem using index of stacked device

4. If an interface is deleted and a new one is created with the same IP 
address then we want to fail the lookup

-- no problem using index

5. other???

Is there a use case where I can't add ifindex of the incoming device (or 
higher level device if skb-dev is changed) to the hash and lookup for 
fragments?




Version 3
- addressed comments from first 2 RFCs with the exception of the name
   Nicolas: We will do the name conversion once we agree on what the
correct name should be (vrf, mrf or something else)


Not so.  I described the deep problems between your goals and your
implementation and they are not even mentioned let alone addressed.


I have addressed comments to the extent that I can. As I stated in my 
last followup to you Eric I did not understand your point. I asked for 
clarification, a --verbose if you will. I can't read your mind, so I 
need you to elaborate on your points to be able to respond and address 
your concerns.





-  packets flow through the VRF device in both directions allowing the
following:
- tcpdump -i vrfn
- tc rules on vrf device
- netfilter rules on vrf device

Ingo/Andy: I added you two as a start point for the proposed task related
changes. Not sure who should be the reviewer; please let me know
if someone else is more appropriate. Thanks.


It looks like you are trying to implement a namespace that isn't a
namespace.  Given that it is broken by design you have my nack.


This is an L3 separation within a namespace, not a device level 
separation which is what namespaces provide.


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct

2015-07-28 Thread David Ahern

On 7/28/15 10:01 AM, Eric Dumazet wrote:

On Tue, 2015-07-28 at 14:19 +0200, Hannes Frederic Sowa wrote:

Hello Eric,

On Mon, 2015-07-27 at 15:33 -0500, Eric W. Biederman wrote:

David Ahern d...@cumulusnetworks.com writes:


Allow tasks to have a default device index for binding sockets. If
set
the value is passed to all AF_INET/AF_INET6 sockets when they are
created.

The task setting is passed parent to child on fork, but can be set
or
changed after task creation using prctl (if task has CAP_NET_ADMIN
permissions). The setting for a socket can be retrieved using
prctl().
This option allows an administrator to restrict a task to only
send/receive
packets through the specified device. In the case of VRF devices
this
option restricts tasks to a specific VRF.

Correlation of the device index to a specific VRF, ie.,
ifindex -- VRF device -- VRF id
is left to userspace.


Nacked-by: Eric W. Biederman ebied...@xmission.com

Because it is broken by design.  Your routing device is only safe for
programs that know it's limitations it is not appropriate for general
applications.

Since you don't even seen to know it's limitations I think this is a
bad path to walk down.


Can you please elaborate about the broken by design?

Different operating systems are already using this approach with good
success. I read your other mail regarding isolation of different VRFs
and I agree that all code which persists state depending solely on the
IP address is affected by this and this must be dealt with and fixed
(actually, there aren't too many).

But I wouldn't call that broken by design. This stuff will get fixed
like e.g. cross-talk between fragmentation queues, icmp rate limiters
etc, which could already happen in the past.

What is your opinion on the fundamental approach only from a user
perspective? Do you think that is broken, too?


I agree with Eric here.

This sk_bind_dev_if on task_struct is quite a hack.

What will be added next ? An array of dev_if ? netfilter support ?
af_packet support ? What about /proc files and netlink dumps ?


It could just as easily be a pointer to a struct (e.g., struct net_ctx) 
such that the intrusion to task_struct is simply 8 bytes -- very similar 
to the nsproxy used for the assorted namespaces. The struct can then 
contain whatever network config is imposed on the task.




We already have network namespaces. Extend this if needed, instead of
bypassing them.


Problems with using network namespaces for VRFs has been discussed in 
the past. e.g.,

http://www.spinics.net/lists/netdev/msg298368.html

David



No need to add something else (with lack of proper reporting for various
tools)




--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct

2015-07-28 Thread David Ahern

On 7/28/15 9:25 AM, Andy Lutomirski wrote:

On Jul 27, 2015 11:33 AM, David Ahern d...@cumulusnetworks.com wrote:


Allow tasks to have a default device index for binding sockets. If set
the value is passed to all AF_INET/AF_INET6 sockets when they are created.



This is not intended to be a review of the concept.  I haven't thought
about whether the concept is a good idea, broken by design, or
whatever.  FWIW, if this were added to the kernel and didn't require
excessive privilege, I'd probably use it.  (I still don't really
understand why binding to a device requires privilege in the first
place, but, again, I haven't thought about it very much.)


The intent here is to restrict a task to only sending and receiving 
packets from a single network device. The device can be single ethernet 
interface, a stacked device (e.g, bond) or in our case a VRF device 
which restricts a task to interfaces (and hence network paths) 
associated with the VRF.





+#ifdef CONFIG_NET
+   case PR_SET_SK_BIND_DEV_IF:
+   {
+   struct net_device *dev;
+   int idx = (int) arg2;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+


Can you either use ns_capable or add a comment as to why not?


will do.



Also, please return -EINVAL if unused args are nonzero.


ok.




+   if (idx) {
+   dev = dev_get_by_index(me-nsproxy-net_ns, idx);
+   if (!dev)
+   return -EINVAL;
+   dev_put(dev);
+   }
+   me-sk_bind_dev_if = idx;
+   break;
+   }
+   case PR_GET_SK_BIND_DEV_IF:
+   {
+   struct task_struct *tsk;
+   int sk_bind_dev_if = -EINVAL;
+
+   rcu_read_lock();
+   tsk = find_task_by_vpid(arg2);
+   if (tsk)
+   sk_bind_dev_if = tsk-sk_bind_dev_if;


Why do you support different tasks here?  Could this use proc instead?


In this case we want to allow a separate process to determine if a task 
is restricted to a device.




The same -EINVAL issue applies.

Also, I think you need to hook setns and unshare to do something
reasonable when the task is bound to a device.


ack on both.

Thanks for the review,
David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 13/16] net: Introduce VRF device driver - v2

2015-07-28 Thread David Ahern

On 7/27/15 2:01 PM, Nikolay Aleksandrov wrote:

+
+   if (!vrf_is_master(dev) || vrf_is_master(port_dev) ||


Hmm, this means that bonds won't be able to be VRF slaves.
They have the IFF_MASTER flag set.


Right, will change to the IFF_VRF_MASTER flag.




+   vrf_is_slave(port_dev))
+   return -EINVAL;
+
+   return do_vrf_add_slave(dev, port_dev);
+}
+
+/* inverse of do_vrf_add_slave */
+static int do_vrf_del_slave(struct net_device *dev, struct net_device 
*port_dev)
+{
+   struct net_vrf *vrf = netdev_priv(dev);
+   struct slave_queue *queue = vrf-queue;
+   struct net_vrf_dev *vrf_ptr = NULL;
+   struct slave *slave;
+
+   vrf_ptr = rcu_dereference(dev-vrf_ptr);
+   RCU_INIT_POINTER(dev-vrf_ptr, NULL);


I think this isn't safe, you should wait for a grace period before freeing the
pointer. Actually you can just move the kfree() below the 
netdev_rx_handler_unregister()
since it does synchronize_rcu() anyway.


ok

And ack on all other comments..

Thanks for the review,
David

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] net: Use passed in table for nexthop lookups

2015-08-05 Thread David Ahern
If a user passes in a table for new routes use that table for nexthop
lookups. Specifically, this solves the case where a connected route does
not exist in the main table, but only another table and then a subsequent
route is added with a next hop using the connected route. ie.,

$ ip route ls
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15
169.254.0.0/16 dev eth0  scope link  metric 1003
192.168.56.0/24 dev eth1  proto kernel  scope link  src 192.168.56.51

$ ip route ls table 10
1.1.1.0/24 dev eth2  scope link

Without this patch adding a nexthop route fails:

$ ip route add table 10 2.2.2.0/24 via 1.1.1.10
RTNETLINK answers: Network is unreachable

With this patch the route is added successfully.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_semantics.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 85e9a8abf15c..b7f1d20a9615 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -691,6 +691,7 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
}
rcu_read_lock();
{
+   struct fib_table *tbl = NULL;
struct flowi4 fl4 = {
.daddr = nh-nh_gw,
.flowi4_scope = cfg-fc_scope + 1,
@@ -701,8 +702,16 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
/* It is not necessary, but requires a bit of thinking 
*/
if (fl4.flowi4_scope  RT_SCOPE_LINK)
fl4.flowi4_scope = RT_SCOPE_LINK;
-   err = fib_lookup(net, fl4, res,
-FIB_LOOKUP_IGNORE_LINKSTATE);
+
+   if (cfg-fc_table)
+   tbl = fib_get_table(net, cfg-fc_table);
+
+   if (tbl)
+   err = fib_table_lookup(tbl, fl4, res,
+  FIB_LOOKUP_IGNORE_LINKSTATE);
+   else
+   err = fib_lookup(net, fl4, res,
+FIB_LOOKUP_IGNORE_LINKSTATE);
if (err) {
rcu_read_unlock();
return err;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 00/10] VRF-lite - v4

2015-08-05 Thread David Ahern
 to the VRF (sk_bound_dev_if is set to the VRF device).

5. Neighbor entries
   Neighbor entries are not impacted by the VRF device. Entries are
   associated with a particular interface; the VRF association is indirect
   via the interface-to-VRF device enslavement.


Version 4
- builds are clean with and without VRF device enabled (no, yes and module)

- tightened the driver implementation
  + device add/delete, slave add/remove, and module unload are all clean

- fixed RCU references
  + with RCU and lock debugging enabled changes are clean through the
suite of tests

- TX path uses custom dst, so patch refactoring rtable allocation is
  dropped along with the patch adding rt_nexthop helper

- dropped the task patch that adds default bind to interface for sockets
  and the associated chvrf example command
  + the patches are a convenience for running unmodified code. They
are not needed for the core functionality. Any application with
support for SO_BINDTODEVICE works properly with this patch set.


Version 3
- addressed comments from first 2 RFCs with the exception of the name
  Nicolas: We will do the name conversion once we agree on what the
   correct name should be (vrf, mrf or something else)

-  packets flow through the VRF device in both directions allowing the
   following:
   - tcpdump -i vrfn
   - tc rules on vrf device
   - netfilter rules on vrf device


TO-DO
=
1. IPv6

2. ip fragments

3. ipsec, xfrms

4. listen filter to restrict VRF connections
   - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g


Eric B:
  I think I understand your points regarding ip fragments and ipsec now.
  I will release additional patches for both, but it takes time. For
  example, I have ipsec working with VRFs implemented using the VRF
  driver but more changes are needed. Once I have multiple tunnels with
  overlapping network spaces working I will be sending out patches for
  review.


Thanks to Nikolay for his many, many code reviews whipping the device
driver into shape, and bug-Fixes and ideas from Hannes, Roopa Prabhu,
Jon Toppins, Jamal.

Patches can also be pulled from:
https://github.com/dsahern/linux.git, vrf-dev-v4 branch
https://github.com/dsahern/iproute2,  vrf-dev-v4 branch

David Ahern (10):
  net: Introduce VRF related flags and helpers
  net: Use VRF device index for lookups on RX
  net: Use VRF device index for lookups on TX
  udp: Handle VRF device
  net: Add inet_addr lookup by table
  net: Fix up inet_addr_type checks
  net: Add routes to the table associated with the device
  net: Use passed in table for nexthop lookups
  net: Use VRF device index for socket lookups
  net: Introduce VRF device driver

 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c| 715 +++
 include/linux/netdevice.h|  20 ++
 include/net/flow.h   |   1 +
 include/net/route.h  |   7 +
 include/net/vrf.h| 176 +++
 include/uapi/linux/if_link.h |   9 +
 net/ipv4/af_inet.c   |  13 +-
 net/ipv4/arp.c   |  15 +-
 net/ipv4/fib_frontend.c  |  66 +++-
 net/ipv4/fib_semantics.c |  44 ++-
 net/ipv4/fib_trie.c  |   7 +-
 net/ipv4/icmp.c  |   9 +-
 net/ipv4/route.c |   8 +-
 net/ipv4/syncookies.c|   5 +-
 net/ipv4/tcp_input.c |   6 +-
 net/ipv4/tcp_ipv4.c  |  11 +-
 net/ipv4/udp.c   |  25 +-
 19 files changed, 1102 insertions(+), 43 deletions(-)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h

-- 
2.3.2 (Apple Git-55)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iproute2: Add support for VRF device

2015-08-05 Thread David Ahern
Allow user to create a vrf device and specify its table binding.
Based on the iplink_vlan implementation.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/if_link.h |  8 +
 ip/Makefile |  2 +-
 ip/iplink.c |  2 +-
 ip/iplink_vrf.c | 85 +
 4 files changed, 95 insertions(+), 2 deletions(-)
 create mode 100644 ip/iplink_vrf.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index b905cf7f4948..74dedf4320b8 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -338,6 +338,14 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC 1
 
+/* VRF section */
+enum {
+   IFLA_VRF_UNSPEC,
+   IFLA_VRF_TABLE,
+   __IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
 /* IPVLAN section */
 enum {
IFLA_IPVLAN_UNSPEC,
diff --git a/ip/Makefile b/ip/Makefile
index 77653ecc5785..d8b38ac2e44b 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-iplink_geneve.o
+iplink_geneve.o iplink_vrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink.c b/ip/iplink.c
index 369d50eab94e..14bf7211a447 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -94,7 +94,7 @@ void iplink_usage(void)
fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | 
macvlan | macvtap |\n);
fprintf(stderr,   bridge | bond | ipoib | ip6tnl | 
ipip | sit | vxlan |\n);
fprintf(stderr,   gre | gretap | ip6gre | ip6gretap | 
vti | nlmon |\n);
-   fprintf(stderr,   bond_slave | ipvlan | geneve }\n);
+   fprintf(stderr,   bond_slave | ipvlan | geneve | vrf 
}\n);
}
exit(-1);
 }
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
new file mode 100644
index ..0d7e21c7c152
--- /dev/null
+++ b/ip/iplink_vrf.c
@@ -0,0 +1,85 @@
+/* iplink_vrf.cVRF device support
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/socket.h
+#include linux/if_link.h
+
+#include rt_names.h
+#include utils.h
+#include ip_common.h
+
+static void vrf_explain(FILE *f)
+{
+   fprintf(f, Usage: ... vrf table TABLEID \n);
+}
+
+static void explain(void)
+{
+   vrf_explain(stderr);
+}
+
+static int table_arg(void)
+{
+   fprintf(stderr,Error: argument of \table\ must be 0-32767 and 
currently unused\n);
+   return -1;
+}
+
+static int vrf_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *n)
+{
+   while (argc  0) {
+   if (matches(*argv, table) == 0) {
+   __u32 table = 0;
+   NEXT_ARG();
+
+   table = atoi(*argv);
+   if (table  0 || table  32767)
+   return table_arg();
+   addattr32(n, 1024, IFLA_VRF_TABLE, table);
+   } else if (matches(*argv, help) == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, vrf: unknown option \%s\?\n,
+   *argv);
+   explain();
+   return -1;
+   }
+   argc--, argv++;
+   }
+
+   return 0;
+}
+
+static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+   if (!tb)
+   return;
+
+   if (tb[IFLA_VRF_TABLE])
+   fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE]));
+}
+
+static void vrf_print_help(struct link_util *lu, int argc, char **argv,
+ FILE *f)
+{
+   vrf_explain(f);
+}
+
+struct link_util vrf_link_util = {
+   .id = vrf,
+   .maxattr= IFLA_VRF_MAX,
+   .parse_opt  = vrf_parse_opt,
+   .print_opt  = vrf_print_opt,
+   .print_help = vrf_print_help,
+};
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] net: Introduce VRF related flags and helpers

2015-08-05 Thread David Ahern
Add a VRF_MASTER flag for interfaces and helper functions for determining
if a device is a VRF_MASTER.

Add link attribute for passing VRF_TABLE id.

Add vrf_ptr to netdevice.

Add various macros for determining if a device is a VRF device, the index
of the master VRF device and table associated with VRF device.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/netdevice.h|  20 +
 include/net/vrf.h| 177 +++
 include/uapi/linux/if_link.h |   9 +++
 3 files changed, 206 insertions(+)
 create mode 100644 include/net/vrf.h

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 607b5f41f46f..f7a6ef2fae3a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1289,6 +1289,7 @@ enum netdev_priv_flags {
IFF_XMIT_DST_RELEASE_PERM   = 122,
IFF_IPVLAN_MASTER   = 123,
IFF_IPVLAN_SLAVE= 124,
+   IFF_VRF_MASTER  = 125,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1316,6 +1317,7 @@ enum netdev_priv_flags {
 #define IFF_XMIT_DST_RELEASE_PERM  IFF_XMIT_DST_RELEASE_PERM
 #define IFF_IPVLAN_MASTER  IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE   IFF_IPVLAN_SLAVE
+#define IFF_VRF_MASTER IFF_VRF_MASTER
 
 /**
  * struct net_device - The DEVICE structure.
@@ -1432,6 +1434,7 @@ enum netdev_priv_flags {
  * @dn_ptr:DECnet specific data
  * @ip6_ptr:   IPv6 specific data
  * @ax25_ptr:  AX.25 specific data
+ * @vrf_ptr:   VRF specific data
  * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering
  *
  * @last_rx:   Time of last Rx
@@ -1650,6 +1653,7 @@ struct net_device {
struct dn_dev __rcu *dn_ptr;
struct inet6_dev __rcu  *ip6_ptr;
void*ax25_ptr;
+   struct net_vrf_dev __rcu *vrf_ptr;
struct wireless_dev *ieee80211_ptr;
struct wpan_dev *ieee802154_ptr;
 #if IS_ENABLED(CONFIG_MPLS_ROUTING)
@@ -3808,6 +3812,22 @@ static inline bool netif_supports_nofcs(struct 
net_device *dev)
return dev-priv_flags  IFF_SUPP_NOFCS;
 }
 
+static inline bool netif_is_vrf(const struct net_device *dev)
+{
+   return dev-priv_flags  IFF_VRF_MASTER;
+}
+
+static inline bool netif_index_is_vrf(struct net *net, int ifindex)
+{
+   struct net_device *dev = dev_get_by_index_rcu(net, ifindex);
+   bool rc = false;
+
+   if (dev)
+   rc = netif_is_vrf(dev);
+
+   return rc;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/vrf.h b/include/net/vrf.h
new file mode 100644
index ..5d4bd67a4902
--- /dev/null
+++ b/include/net/vrf.h
@@ -0,0 +1,177 @@
+/*
+ * include/net/net_vrf.h - adds vrf dev structure definitions
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __LINUX_NET_VRF_H
+#define __LINUX_NET_VRF_H
+
+struct net_vrf_dev {
+   struct rcu_head rcu;
+   int ifindex; /* ifindex of master dev */
+   u32 tb_id;   /* table id for VRF */
+};
+
+struct slave {
+   struct list_headlist;
+   struct net_device   *dev;
+};
+
+struct slave_queue {
+   struct list_headall_slaves;
+   int num_slaves;
+};
+
+struct net_vrf {
+   struct slave_queue  queue;
+   struct fib_table*tb;
+   struct rtable   *rth;
+   u32 tb_id;
+};
+
+
+#if IS_ENABLED(CONFIG_NET_VRF)
+/* called with rcu_read_lock() */
+static inline int vrf_master_ifindex_rcu(const struct net_device *dev)
+{
+   struct net_vrf_dev *vrf_ptr;
+   int ifindex = 0;
+
+   if (!dev)
+   return 0;
+
+   if (netif_is_vrf(dev))
+   ifindex = dev-ifindex;
+   else {
+   vrf_ptr = rcu_dereference(dev-vrf_ptr);
+   if (vrf_ptr)
+   ifindex = vrf_ptr-ifindex;
+   }
+
+   return ifindex;
+}
+
+static inline int vrf_master_ifindex(const struct net_device *dev)
+{
+   int ifindex;
+
+   rcu_read_lock();
+   ifindex = vrf_master_ifindex_rcu(dev);
+   rcu_read_unlock();
+
+   return ifindex;
+}
+
+static inline int vrf_master_ifindex_by_index(struct net *net, int ifindex)
+{
+   int rc = 0;
+
+   if (ifindex) {
+   struct net_device *dev = dev_get_by_index(net, ifindex);
+
+   if (dev) {
+   rc

[PATCH 05/10] net: Add inet_addr lookup by table

2015-08-05 Thread David Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/route.h |  1 +
 net/ipv4/fib_frontend.c | 22 +++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 94189d4bd899..6ba681f0b98d 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -189,6 +189,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
 unsigned int inet_addr_type(struct net *net, __be32 addr);
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr);
 void ip_rt_multicast_event(struct in_device *);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d8ced1d89f1b..b11321a8e58d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -212,12 +212,12 @@ void fib_flush_external(struct net *net)
  */
 static inline unsigned int __inet_dev_addr_type(struct net *net,
const struct net_device *dev,
-   __be32 addr)
+   __be32 addr, int tb_id)
 {
struct flowi4   fl4 = { .daddr = addr };
struct fib_result   res;
unsigned int ret = RTN_BROADCAST;
-   struct fib_table *local_table;
+   struct fib_table *table;
 
if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
return RTN_BROADCAST;
@@ -226,10 +226,10 @@ static inline unsigned int __inet_dev_addr_type(struct 
net *net,
 
rcu_read_lock();
 
-   local_table = fib_get_table(net, RT_TABLE_LOCAL);
-   if (local_table) {
+   table = fib_get_table(net, tb_id);
+   if (table) {
ret = RTN_UNICAST;
-   if (!fib_table_lookup(local_table, fl4, res, 
FIB_LOOKUP_NOREF)) {
+   if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) {
if (!dev || dev == res.fi-fib_dev)
ret = res.type;
}
@@ -239,16 +239,24 @@ static inline unsigned int __inet_dev_addr_type(struct 
net *net,
return ret;
 }
 
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id)
+{
+   return __inet_dev_addr_type(net, NULL, addr, tb_id);
+}
+EXPORT_SYMBOL(inet_addr_type_table);
+
 unsigned int inet_addr_type(struct net *net, __be32 addr)
 {
-   return __inet_dev_addr_type(net, NULL, addr);
+   return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL);
 }
 EXPORT_SYMBOL(inet_addr_type);
 
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr)
 {
-   return __inet_dev_addr_type(net, dev, addr);
+   int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+
+   return __inet_dev_addr_type(net, dev, addr, rt_table);
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
 
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] net: Use VRF device index for lookups on TX

2015-08-05 Thread David Ahern
As with ingress use the index of VRF master device for route lookups on
egress. However, the oif should only be used to direct the lookups to a
specific table. Routes in the table are not based on the VRF device but
rather interfaces that are part of the VRF so do not consider the oif for
lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this
latter part.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/flow.h  | 1 +
 include/net/route.h | 3 +++
 net/ipv4/fib_trie.c | 7 +--
 net/ipv4/icmp.c | 4 
 net/ipv4/route.c| 5 +
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 3098ae33a178..f305588fc162 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -33,6 +33,7 @@ struct flowi_common {
__u8flowic_flags;
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
+#define FLOWI_FLAG_VRFSRC  0x04
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
 };
diff --git a/include/net/route.h b/include/net/route.h
index 2d45f419477f..94189d4bd899 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -251,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 
*fl4, __be32 dst, __be32
if (inet_sk(sk)-transparent)
flow_flags |= FLOWI_FLAG_ANYSRC;
 
+   if (netif_index_is_vrf(sock_net(sk), oif))
+   flow_flags |= FLOWI_FLAG_VRFSRC;
+
flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
 }
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 37c4bb89a708..1243c79cb5b0 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1423,8 +1423,11 @@ int fib_table_lookup(struct fib_table *tb, const struct 
flowi4 *flp,
nh-nh_flags  RTNH_F_LINKDOWN 
!(fib_flags  FIB_LOOKUP_IGNORE_LINKSTATE))
continue;
-   if (flp-flowi4_oif  flp-flowi4_oif != nh-nh_oif)
-   continue;
+   if (!(flp-flowi4_flags  FLOWI_FLAG_VRFSRC)) {
+   if (flp-flowi4_oif 
+   flp-flowi4_oif != nh-nh_oif)
+   continue;
+   }
 
if (!(fib_flags  FIB_LOOKUP_NOREF))
atomic_inc(fi-fib_clntref);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index c0556f1e4bf0..1164fc4ce3bc 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -96,6 +96,7 @@
 #include net/xfrm.h
 #include net/inet_common.h
 #include net/ip_fib.h
+#include net/vrf.h
 
 /*
  * Build xmit assembly blocks
@@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct 
sk_buff *skb)
fl4.flowi4_mark = mark;
fl4.flowi4_tos = RT_TOS(ip_hdr(skb)-tos);
fl4.flowi4_proto = IPPROTO_ICMP;
+   fl4.flowi4_oif = vrf_master_ifindex_rcu(skb-dev) ? : skb-dev-ifindex;
security_skb_classify_flow(skb, flowi4_to_flowi(fl4));
rt = ip_route_output_key(net, fl4);
if (IS_ERR(rt))
@@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
fl4-flowi4_proto = IPPROTO_ICMP;
fl4-fl4_icmp_type = type;
fl4-fl4_icmp_code = code;
+   fl4-flowi4_oif = vrf_master_ifindex_rcu(skb_in-dev) ? : 
skb_in-dev-ifindex;
+
security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
rt = __ip_route_output_key(net, fl4);
if (IS_ERR(rt))
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c26ff1f7067d..2c89d294b669 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2131,6 +2131,11 @@ struct rtable *__ip_route_output_key(struct net *net, 
struct flowi4 *fl4)
fl4-saddr = inet_select_addr(dev_out, 0,
  RT_SCOPE_HOST);
}
+   if (netif_is_vrf(dev_out) 
+   !(fl4-flowi4_flags  FLOWI_FLAG_VRFSRC)) {
+   rth = vrf_dev_get_rth(dev_out);
+   goto out;
+   }
}
 
if (!fl4-daddr) {
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] net: Introduce VRF device driver

2015-08-05 Thread David Ahern
This driver borrows heavily from IPvlan and teaming drivers.

Routing domains (VRF-lite) are created by instantiating a VRF master
device with an associated table and enslaving all routed interfaces that
participate in the domain. As part of the enslavement, all connected
routes for the enslaved devices are moved to the table associated with
the VRF device. Outgoing sockets must bind to the VRF device to function.

Standard FIB rules bind the VRF device to tables and regular fib rule
processing is followed. Routed traffic through the box, is forwarded by
using the VRF device as the IIF and following the IIF rule to a table
that is mated with the VRF.

Example:

   Create vrf 1:
 ip link add vrf1 type vrf table 5
 ip rule add iif vrf1 table 5
 ip rule add oif vrf1 table 5
 ip route add table 5 prohibit default
 ip link set vrf1 up

   Add interface to vrf 1:
 ip link set eth1 master vrf1

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c| 715 +++
 include/net/vrf.h|   1 -
 4 files changed, 723 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/vrf.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c18f9e62a9fa..e58468b02987 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -297,6 +297,13 @@ config NLMON
  diagnostics, etc. This is mostly intended for developers or support
  to debug netlink issues. If unsure, say N.
 
+config NET_VRF
+   tristate Virtual Routing and Forwarding (Lite)
+   depends on IP_MULTIPLE_TABLES  IPV6_MULTIPLE_TABLES
+   ---help---
+ This option enables the support for mapping interfaces into VRF's. The
+ support enables VRF devices.
+
 endif # NET_CORE
 
 config SUNGEM_PHY
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c12cb22478a7..ca16dd689b36 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_NLMON) += nlmon.o
+obj-$(CONFIG_NET_VRF) += vrf.o
 
 #
 # Networking Drivers
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
new file mode 100644
index ..75c06ee2efa3
--- /dev/null
+++ b/drivers/net/vrf.c
@@ -0,0 +1,715 @@
+/*
+ * vrf.c: device driver to encapsulate a VRF space
+ *
+ * Copyright (c) 2015 Cumulus Networks. All rights reserved.
+ * Copyright (c) 2015 Shrijeet Mukherjee s...@cumulusnetworks.com
+ * Copyright (c) 2015 David Ahern d...@cumulusnetworks.com
+ *
+ * Based on dummy, team and ipvlan drivers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/ip.h
+#include linux/init.h
+#include linux/moduleparam.h
+#include linux/netfilter.h
+#include linux/rtnetlink.h
+#include net/rtnetlink.h
+#include linux/u64_stats_sync.h
+#include linux/hashtable.h
+
+#include linux/inetdevice.h
+#include net/ip.h
+#include net/ip_fib.h
+#include net/ip6_route.h
+#include net/rtnetlink.h
+#include net/route.h
+#include net/addrconf.h
+#include net/vrf.h
+
+#define DRV_NAME   vrf
+#define DRV_VERSION1.0
+
+#define vrf_is_slave(dev)   ((dev)-flags  IFF_SLAVE)
+
+#define vrf_master_get_rcu(dev) \
+   ((struct net_device *)rcu_dereference(dev-rx_handler_data))
+
+struct pcpu_dstats {
+   u64 tx_pkts;
+   u64 tx_bytes;
+   u64 tx_drps;
+   u64 rx_pkts;
+   u64 rx_bytes;
+   struct u64_stats_sync   syncp;
+};
+
+static struct dst_entry *vrf_ip_check(struct dst_entry *dst, u32 cookie)
+{
+   return dst;
+}
+
+static int vrf_ip_local_out(struct sk_buff *skb)
+{
+   return ip_local_out(skb);
+}
+
+static unsigned int vrf_v4_mtu(const struct dst_entry *dst)
+{
+   /* TO-DO: return max ethernet size? */
+   return dst-dev-mtu;
+}
+
+static void vrf_dst_destroy(struct dst_entry *dst)
+{
+   /* our dst lives forever - or until the device is closed */
+}
+
+static unsigned int vrf_default_advmss(const struct dst_entry *dst)
+{
+   return 65535 - 40;
+}
+
+static struct dst_ops vrf_dst_ops = {
+   .family = AF_INET,
+   .local_out  = vrf_ip_local_out,
+   .check  = vrf_ip_check,
+   .mtu= vrf_v4_mtu,
+   .destroy= vrf_dst_destroy,
+   .default_advmss = vrf_default_advmss,
+};
+
+static bool is_ip_rx_frame(struct sk_buff *skb)
+{
+   switch (skb-protocol) {
+   case htons

[PATCH 02/10] net: Use VRF device index for lookups on RX

2015-08-05 Thread David Ahern
On ingress use index of VRF master device for route lookups if real device
is enslaved. Rules are expected to be installed for the VRF device to
direct lookups to a specific table.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_frontend.c | 8 +++-
 net/ipv4/route.c| 3 ++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6b98de0d7949..d8ced1d89f1b 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -45,6 +45,7 @@
 #include net/ip_fib.h
 #include net/rtnetlink.h
 #include net/xfrm.h
+#include net/vrf.h
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
@@ -309,7 +310,9 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
bool dev_match;
 
fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
+   fl4.flowi4_iif = vrf_master_ifindex_rcu(dev);
+   if (!fl4.flowi4_iif)
+   fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
fl4.daddr = src;
fl4.saddr = dst;
fl4.flowi4_tos = tos;
@@ -339,6 +342,9 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
if (nh-nh_dev == dev) {
dev_match = true;
break;
+   } else if (vrf_master_ifindex_rcu(nh-nh_dev) == dev-ifindex) {
+   dev_match = true;
+   break;
}
}
 #else
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 18fd7c9095c7..c26ff1f7067d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -112,6 +112,7 @@
 #endif
 #include net/secure_seq.h
 #include net/ip_tunnels.h
+#include net/vrf.h
 
 #define RT_FL_TOS(oldflp4) \
((oldflp4)-flowi4_tos  (IPTOS_RT_MASK | RTO_ONLINK))
@@ -1726,7 +1727,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
 *  Now we are ready to route packet.
 */
fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = dev-ifindex;
+   fl4.flowi4_iif = vrf_master_ifindex_rcu(dev) ? : dev-ifindex;
fl4.flowi4_mark = skb-mark;
fl4.flowi4_tos = tos;
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] udp: Handle VRF device

2015-08-05 Thread David Ahern
For unconnected UDP sockets using a VRF device lookup source address
based on VRF table. This allows the UDP header to be properly setup
before showing up at the VRF device via the dst.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/udp.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 83aa604f9273..b513d72a21b3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -884,7 +884,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t 
len)
struct rtable *rt = NULL;
int free = 0;
int connected = 0;
-   __be32 daddr, faddr, saddr;
+   __be32 daddr, faddr, saddr, vsaddr = 0;
__be16 dport;
u8  tos;
int err, is_udplite = IS_UDPLITE(sk);
@@ -1013,11 +1013,30 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
if (!rt) {
struct net *net = sock_net(sk);
+   __u8 flow_flags = inet_sk_flowi_flags(sk);
 
fl4 = fl4_stack;
+
+   /* unconnected socket. If output device is enslaved to a VRF
+* device lookup source address from VRF table. This mimics
+* behavior of ip_route_connect{_init}.
+*/
+   if (netif_index_is_vrf(net, ipc.oif)) {
+   flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
+  RT_SCOPE_UNIVERSE, sk-sk_protocol,
+  (flow_flags | FLOWI_FLAG_VRFSRC),
+  faddr, saddr, dport, 
inet-inet_sport);
+
+   rt = ip_route_output_flow(net, fl4, sk);
+   if (!IS_ERR(rt)) {
+   vsaddr = fl4-saddr;
+   ip_rt_put(rt);
+   }
+   }
+
flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
   RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  inet_sk_flowi_flags(sk),
+  flow_flags,
   faddr, saddr, dport, inet-inet_sport);
 
security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
@@ -1042,6 +1061,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
goto do_confirm;
 back_from_confirm:
 
+   if (vsaddr)
+   fl4-saddr = vsaddr;
saddr = fl4-saddr;
if (!ipc.addr)
daddr = ipc.addr = fl4-daddr;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] net: Use VRF device index for socket lookups

2015-08-05 Thread David Ahern
The intent of the VRF device is to leverage the existing SO_BINDTODEVICE
as a means of creating L3 domains. Since sockets are expected to be bound
to the VRF device the index of the master device needs to be used for
socket lookups.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/syncookies.c |  5 -
 net/ipv4/tcp_input.c  |  6 +-
 net/ipv4/tcp_ipv4.c   | 11 +--
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index d70b1f603692..e5c8b1240278 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -18,6 +18,7 @@
 #include linux/export.h
 #include net/tcp.h
 #include net/route.h
+#include net/vrf.h
 
 extern int sysctl_tcp_syncookies;
 
@@ -348,7 +349,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
treq-snt_synack= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsecr : 0;
treq-tfo_listener  = false;
 
-   ireq-ir_iif = sk-sk_bound_dev_if;
+   ireq-ir_iif = vrf_master_ifindex_by_index(sock_net(sk), skb-skb_iif);
+   if (!ireq-ir_iif)
+   ireq-ir_iif = sk-sk_bound_dev_if;
 
/* We throwed the options of the initial SYN away, so we hope
 * the ACK carries the same options again (see RFC1122 4.2.3.8)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4e4d6bcd0ca9..6b96240a4055 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -72,6 +72,7 @@
 #include net/dst.h
 #include net/tcp.h
 #include net/inet_common.h
+#include net/vrf.h
 #include linux/ipsec.h
 #include asm/unaligned.h
 #include linux/errqueue.h
@@ -6141,7 +6142,10 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
tcp_openreq_init(req, tmp_opt, skb, sk);
 
/* Note: tcp_v6_init_req() might override ir_iif for link locals */
-   inet_rsk(req)-ir_iif = sk-sk_bound_dev_if;
+   inet_rsk(req)-ir_iif = vrf_master_ifindex_by_index(sock_net(sk),
+   skb-skb_iif);
+   if (!inet_rsk(req)-ir_iif)
+   inet_rsk(req)-ir_iif = sk-sk_bound_dev_if;
 
af_ops-init_req(req, sk, skb);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d27eb549ced6..0f8ed98a2e64 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -75,6 +75,7 @@
 #include net/secure_seq.h
 #include net/tcp_memcontrol.h
 #include net/busy_poll.h
+#include net/vrf.h
 
 #include linux/inet.h
 #include linux/ipv6.h
@@ -682,6 +683,8 @@ static void tcp_v4_send_reset(struct sock *sk, struct 
sk_buff *skb)
 */
if (sk)
arg.bound_dev_if = sk-sk_bound_dev_if;
+   if (!arg.bound_dev_if  skb-dev)
+   arg.bound_dev_if = vrf_master_ifindex_rcu(skb-dev);
 
arg.tos = ip_hdr(skb)-tos;
ip_send_unicast_reply(*this_cpu_ptr(net-ipv4.tcp_sk),
@@ -766,8 +769,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, 
u32 ack,
  ip_hdr(skb)-saddr, /* XXX */
  arg.iov[0].iov_len, IPPROTO_TCP, 0);
arg.csumoffset = offsetof(struct tcphdr, check) / 2;
-   if (oif)
-   arg.bound_dev_if = oif;
+   arg.bound_dev_if = oif ? : vrf_master_ifindex_rcu(skb_dst(skb)-dev);
+   if (!arg.bound_dev_if)
+   arg.bound_dev_if = vrf_master_ifindex_rcu(skb-dev);
+
arg.tos = tos;
ip_send_unicast_reply(*this_cpu_ptr(net-ipv4.tcp_sk),
  skb, TCP_SKB_CB(skb)-header.h4.opt,
@@ -1269,6 +1274,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct 
sk_buff *skb,
ireq  = inet_rsk(req);
sk_daddr_set(newsk, ireq-ir_rmt_addr);
sk_rcv_saddr_set(newsk, ireq-ir_loc_addr);
+   if (netif_index_is_vrf(sock_net(newsk), ireq-ir_iif))
+   newsk-sk_bound_dev_if = ireq-ir_iif;
newinet-inet_saddr   = ireq-ir_loc_addr;
inet_opt  = ireq-opt;
rcu_assign_pointer(newinet-inet_opt, inet_opt);
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/10] net: Add routes to the table associated with the device

2015-08-05 Thread David Ahern
When a device associated with a VRF is brought up or down routes
should be added to/removed from the table associated with the VRF.
fib_magic defaults to using the main or local tables. Have it use
the table with the device if there is one.

A part of this is directing prefsrc validations to the correct
table as well.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_frontend.c  |  8 
 net/ipv4/fib_semantics.c | 25 +++--
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d84ae0e30369..0a50a08ab844 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -803,6 +803,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct 
in_ifaddr *ifa)
 {
struct net *net = dev_net(ifa-ifa_dev-dev);
+   int tb_id = vrf_dev_table_rtnl(ifa-ifa_dev-dev);
struct fib_table *tb;
struct fib_config cfg = {
.fc_protocol = RTPROT_KERNEL,
@@ -817,11 +818,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int 
dst_len, struct in_ifad
},
};
 
-   if (type == RTN_UNICAST)
-   tb = fib_new_table(net, RT_TABLE_MAIN);
-   else
-   tb = fib_new_table(net, RT_TABLE_LOCAL);
+   if (!tb_id)
+   tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL;
 
+   tb = fib_new_table(net, tb_id);
if (!tb)
return;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 410ddb67221e..85e9a8abf15c 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -838,6 +838,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct 
fib_nh *nh)
return nh-nh_saddr;
 }
 
+static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc)
+{
+   if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst ||
+   fib_prefsrc != cfg-fc_dst) {
+   int tb_id = cfg-fc_table;
+
+   if (tb_id == RT_TABLE_MAIN)
+   tb_id = RT_TABLE_LOCAL;
+
+   if (inet_addr_type_table(cfg-fc_nlinfo.nl_net,
+fib_prefsrc, tb_id) != RTN_LOCAL) {
+   return false;
+   }
+   }
+   return true;
+}
+
 struct fib_info *fib_create_info(struct fib_config *cfg)
 {
int err;
@@ -1033,12 +1050,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
fi-fib_flags |= RTNH_F_LINKDOWN;
}
 
-   if (fi-fib_prefsrc) {
-   if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst ||
-   fi-fib_prefsrc != cfg-fc_dst)
-   if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL)
-   goto err_inval;
-   }
+   if (fi-fib_prefsrc  !fib_valid_prefsrc(cfg, fi-fib_prefsrc))
+   goto err_inval;
 
change_nexthops(fi) {
fib_info_update_nh_saddr(net, nexthop_nh);
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/10] net: Fix up inet_addr_type checks

2015-08-05 Thread David Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
if the passed in device is enslaved to a VRF then the table for that VRF
is used for the lookup.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/route.h  |  3 +++
 net/ipv4/af_inet.c   | 13 -
 net/ipv4/arp.c   | 15 +--
 net/ipv4/fib_frontend.c  | 28 +---
 net/ipv4/fib_semantics.c |  6 --
 net/ipv4/icmp.c  |  5 +++--
 6 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 6ba681f0b98d..6dda2c1bf8c6 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -192,6 +192,9 @@ unsigned int inet_addr_type(struct net *net, __be32 addr);
 unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr);
+unsigned int inet_addr_type_dev_table(struct net *net,
+ const struct net_device *dev,
+ __be32 addr);
 void ip_rt_multicast_event(struct in_device *);
 int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg);
 void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index cc4e498a0ccf..96fba4f63454 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -119,6 +119,7 @@
 #ifdef CONFIG_IP_MROUTE
 #include linux/mroute.h
 #endif
+#include net/vrf.h
 
 
 /* The inetsw table contains everything that inet_create needs to
@@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
struct net *net = sock_net(sk);
unsigned short snum;
int chk_addr_ret;
+   int tb_id = 0;
int err;
 
/* If the socket has its own bind function then use it. (RAW) */
@@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
goto out;
}
 
-   chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr);
+   if (sk-sk_bound_dev_if) {
+   struct net_device *dev;
+
+   rcu_read_lock();
+   dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if);
+   if (dev)
+   tb_id = vrf_dev_table_rcu(dev);
+   rcu_read_unlock();
+   }
+   chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id);
 
/* Not specified by any standard per-se, however it breaks too
 * many applications when removed.  It is unfortunate since
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 34a308573f4b..30409b75e925 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -233,7 +233,7 @@ static int arp_constructor(struct neighbour *neigh)
return -EINVAL;
}
 
-   neigh-type = inet_addr_type(dev_net(dev), addr);
+   neigh-type = inet_addr_type_dev_table(dev_net(dev), dev, addr);
 
parms = in_dev-arp_parms;
__neigh_parms_put(neigh-parms);
@@ -343,7 +343,7 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
default:
case 0: /* By default announce any local IP */
-   if (skb  inet_addr_type(dev_net(dev),
+   if (skb  inet_addr_type_dev_table(dev_net(dev), dev,
  ip_hdr(skb)-saddr) == RTN_LOCAL)
saddr = ip_hdr(skb)-saddr;
break;
@@ -351,7 +351,8 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
if (!skb)
break;
saddr = ip_hdr(skb)-saddr;
-   if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) {
+   if (inet_addr_type_dev_table(dev_net(dev), dev,
+saddr) == RTN_LOCAL) {
/* saddr should be known to target */
if (inet_addr_onlink(in_dev, target, saddr))
break;
@@ -751,7 +752,7 @@ static int arp_process(struct sock *sk, struct sk_buff *skb)
/* Special case: IPv4 duplicate address detection packet (RFC2131) */
if (sip == 0) {
if (arp-ar_op == htons(ARPOP_REQUEST) 
-   inet_addr_type(net, tip) == RTN_LOCAL 
+   inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL 
!arp_ignore(in_dev, sip, tip))
arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip

Re: [PATCH 09/10] net: Use VRF device index for socket lookups

2015-08-05 Thread David Ahern

Hi Tom:

On 8/5/15 12:32 PM, Tom Herbert wrote:

On Wed, Aug 5, 2015 at 10:14 AM, David Ahernd...@cumulusnetworks.com  wrote:

The intent of the VRF device is to leverage the existing SO_BINDTODEVICE
as a means of creating L3 domains. Since sockets are expected to be bound
to the VRF device the index of the master device needs to be used for
socket lookups.


This patch set seems awfully invasive at the socket layer. Isn't there
anyway this functionality be contained in the routing layer and
sockets use existing API?


This patch is a leftover from earlier versions. It is no longer needed. 
Will drop for v5.


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC net-next 1/3] RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net

2015-07-30 Thread David Ahern

On 7/30/15 2:55 AM, Sowmini Varadhan wrote:

diff --git a/net/rds/connection.c b/net/rds/connection.c
index da6da57..3bea7b9 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -117,7 +117,8 @@ static void rds_conn_reset(struct rds_connection *conn)
   * For now they are not garbage collected once they're created.  They
   * are torn down as the module is removed, if ever.
   */
-static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr,
+static struct rds_connection *__rds_conn_create(struct net *net,
+   __be32 laddr, __be32 faddr,
   struct rds_transport *trans, gfp_t gfp,
   int is_outgoing)
  {
@@ -157,6 +158,7 @@ static struct rds_connection *__rds_conn_create(__be32 
laddr, __be32 faddr,
conn-c_faddr = faddr;
spin_lock_init(conn-c_lock);
conn-c_next_tx_seq = 1;
+   write_pnet(conn-c_net, net);


these are typically in wrappers like sock_net and sock_net_set



diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 0da2a45..c38d8a0 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -448,8 +448,8 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 (unsigned long long)be64_to_cpu(lguid),
 (unsigned long long)be64_to_cpu(fguid));

-   conn = rds_conn_create(dp-dp_daddr, dp-dp_saddr, rds_ib_transport,
-  GFP_KERNEL);
+   conn = rds_conn_create(init_net, dp-dp_daddr, dp-dp_saddr,
+  rds_ib_transport, GFP_KERNEL);


I forget what connection this is -- control channel? you should at least 
put a note as to why it is using init_net.



diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
index 8f486fa..4ea55a3 100644
--- a/net/rds/iw_cm.c
+++ b/net/rds/iw_cm.c
@@ -398,8 +398,8 @@ int rds_iw_cm_handle_connect(struct rdma_cm_id *cm_id,
 dp-dp_saddr, dp-dp_daddr,
 RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version));

-   conn = rds_conn_create(dp-dp_daddr, dp-dp_saddr, rds_iw_transport,
-  GFP_KERNEL);
+   conn = rds_conn_create(init_net, dp-dp_daddr, dp-dp_saddr,
+  rds_iw_transport, GFP_KERNEL);


Ditto here.

David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 6/9] net: Fix up inet_addr_type checks

2015-08-11 Thread David Ahern

On 8/11/15 12:14 PM, David Miller wrote:

From: David Ahern d...@cumulusnetworks.com
Date: Mon, 10 Aug 2015 11:50:33 -0600


@@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
struct net *net = sock_net(sk);
unsigned short snum;
int chk_addr_ret;
+   int tb_id = 0;
int err;

/* If the socket has its own bind function then use it. (RAW) */
@@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
goto out;
}

-   chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr);
+   if (sk-sk_bound_dev_if) {
+   struct net_device *dev;
+
+   rcu_read_lock();
+   dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if);
+   if (dev)
+   tb_id = vrf_dev_table_rcu(dev);
+   rcu_read_unlock();
+   }
+   chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id);

/* Not specified by any standard per-se, however it breaks too
 * many applications when removed.  It is unfortunate since

  ...

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index b11321a8e58d..d84ae0e30369 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -226,6 +226,9 @@ static inline unsigned int __inet_dev_addr_type(struct net 
*net,

rcu_read_lock();

+   if (!tb_id)
+   tb_id = RT_TABLE_LOCAL;
+
table = fib_get_table(net, tb_id);


All of this code that quietly translates table ID zero into RT_TABLE_LOCAL is
confusing.

It would be so much easier to understand if the code was structured like:

int tb_id = RT_TABLE_LOCAL;

if (doing_vrf_stuff)
tb_id = foo;



The intent here was to default to current behavior and to keep the 
details of that in one place. If you prefer table id to always enter 
with the right value I can make that happen.


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 08/11] net: Use passed in table for nexthop lookups

2015-08-13 Thread David Ahern
If a user passes in a table for new routes use that table for nexthop
lookups. Specifically, this solves the case where a connected route does
not exist in the main table, but only another table and then a subsequent
route is added with a next hop using the connected route. ie.,

$ ip route ls
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15
169.254.0.0/16 dev eth0  scope link  metric 1003
192.168.56.0/24 dev eth1  proto kernel  scope link  src 192.168.56.51

$ ip route ls table 10
1.1.1.0/24 dev eth2  scope link

Without this patch adding a nexthop route fails:

$ ip route add table 10 2.2.2.0/24 via 1.1.1.10
RTNETLINK answers: Network is unreachable

With this patch the route is added successfully.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_semantics.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 85e9a8abf15c..b7f1d20a9615 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -691,6 +691,7 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
}
rcu_read_lock();
{
+   struct fib_table *tbl = NULL;
struct flowi4 fl4 = {
.daddr = nh-nh_gw,
.flowi4_scope = cfg-fc_scope + 1,
@@ -701,8 +702,16 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
/* It is not necessary, but requires a bit of thinking 
*/
if (fl4.flowi4_scope  RT_SCOPE_LINK)
fl4.flowi4_scope = RT_SCOPE_LINK;
-   err = fib_lookup(net, fl4, res,
-FIB_LOOKUP_IGNORE_LINKSTATE);
+
+   if (cfg-fc_table)
+   tbl = fib_get_table(net, cfg-fc_table);
+
+   if (tbl)
+   err = fib_table_lookup(tbl, fl4, res,
+  FIB_LOOKUP_IGNORE_LINKSTATE);
+   else
+   err = fib_lookup(net, fl4, res,
+FIB_LOOKUP_IGNORE_LINKSTATE);
if (err) {
rcu_read_unlock();
return err;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 01/11] net: Introduce VRF related flags and helpers

2015-08-13 Thread David Ahern
Add a VRF_MASTER flag for interfaces and helper functions for determining
if a device is a VRF_MASTER.

Add link attribute for passing VRF_TABLE id.

Add vrf_ptr to netdevice.

Add various macros for determining if a device is a VRF device, the index
of the master VRF device and table associated with VRF device.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/netdevice.h|  20 +++
 include/net/vrf.h| 139 +++
 include/uapi/linux/if_link.h |   9 +++
 3 files changed, 168 insertions(+)
 create mode 100644 include/net/vrf.h

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 607b5f41f46f..f7a6ef2fae3a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1289,6 +1289,7 @@ enum netdev_priv_flags {
IFF_XMIT_DST_RELEASE_PERM   = 122,
IFF_IPVLAN_MASTER   = 123,
IFF_IPVLAN_SLAVE= 124,
+   IFF_VRF_MASTER  = 125,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1316,6 +1317,7 @@ enum netdev_priv_flags {
 #define IFF_XMIT_DST_RELEASE_PERM  IFF_XMIT_DST_RELEASE_PERM
 #define IFF_IPVLAN_MASTER  IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE   IFF_IPVLAN_SLAVE
+#define IFF_VRF_MASTER IFF_VRF_MASTER
 
 /**
  * struct net_device - The DEVICE structure.
@@ -1432,6 +1434,7 @@ enum netdev_priv_flags {
  * @dn_ptr:DECnet specific data
  * @ip6_ptr:   IPv6 specific data
  * @ax25_ptr:  AX.25 specific data
+ * @vrf_ptr:   VRF specific data
  * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering
  *
  * @last_rx:   Time of last Rx
@@ -1650,6 +1653,7 @@ struct net_device {
struct dn_dev __rcu *dn_ptr;
struct inet6_dev __rcu  *ip6_ptr;
void*ax25_ptr;
+   struct net_vrf_dev __rcu *vrf_ptr;
struct wireless_dev *ieee80211_ptr;
struct wpan_dev *ieee802154_ptr;
 #if IS_ENABLED(CONFIG_MPLS_ROUTING)
@@ -3808,6 +3812,22 @@ static inline bool netif_supports_nofcs(struct 
net_device *dev)
return dev-priv_flags  IFF_SUPP_NOFCS;
 }
 
+static inline bool netif_is_vrf(const struct net_device *dev)
+{
+   return dev-priv_flags  IFF_VRF_MASTER;
+}
+
+static inline bool netif_index_is_vrf(struct net *net, int ifindex)
+{
+   struct net_device *dev = dev_get_by_index_rcu(net, ifindex);
+   bool rc = false;
+
+   if (dev)
+   rc = netif_is_vrf(dev);
+
+   return rc;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/vrf.h b/include/net/vrf.h
new file mode 100644
index ..0484d29d4589
--- /dev/null
+++ b/include/net/vrf.h
@@ -0,0 +1,139 @@
+/*
+ * include/net/net_vrf.h - adds vrf dev structure definitions
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __LINUX_NET_VRF_H
+#define __LINUX_NET_VRF_H
+
+struct net_vrf_dev {
+   struct rcu_head rcu;
+   int ifindex; /* ifindex of master dev */
+   u32 tb_id;   /* table id for VRF */
+};
+
+struct slave {
+   struct list_headlist;
+   struct net_device   *dev;
+};
+
+struct slave_queue {
+   struct list_headall_slaves;
+   int num_slaves;
+};
+
+struct net_vrf {
+   struct slave_queue  queue;
+   struct rtable   *rth;
+   u32 tb_id;
+};
+
+
+#if IS_ENABLED(CONFIG_NET_VRF)
+/* called with rcu_read_lock() */
+static inline int vrf_master_ifindex_rcu(const struct net_device *dev)
+{
+   struct net_vrf_dev *vrf_ptr;
+   int ifindex = 0;
+
+   if (!dev)
+   return 0;
+
+   if (netif_is_vrf(dev))
+   ifindex = dev-ifindex;
+   else {
+   vrf_ptr = rcu_dereference(dev-vrf_ptr);
+   if (vrf_ptr)
+   ifindex = vrf_ptr-ifindex;
+   }
+
+   return ifindex;
+}
+
+/* called with rcu_read_lock */
+static inline int vrf_dev_table_rcu(const struct net_device *dev)
+{
+   int tb_id = 0;
+
+   if (dev) {
+   struct net_vrf_dev *vrf_ptr;
+
+   vrf_ptr = rcu_dereference(dev-vrf_ptr);
+   if (vrf_ptr)
+   tb_id = vrf_ptr-tb_id;
+   }
+   return tb_id;
+}
+
+static inline int vrf_dev_table(const struct net_device *dev)
+{
+   int tb_id;
+
+   rcu_read_lock();
+   tb_id = vrf_dev_table_rcu(dev

[PATCH net-next 05/11] net: Add inet_addr lookup by table

2015-08-13 Thread David Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/route.h |  1 +
 net/ipv4/fib_frontend.c | 22 +++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 94189d4bd899..6ba681f0b98d 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -189,6 +189,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
 unsigned int inet_addr_type(struct net *net, __be32 addr);
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr);
 void ip_rt_multicast_event(struct in_device *);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d8ced1d89f1b..b11321a8e58d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -212,12 +212,12 @@ void fib_flush_external(struct net *net)
  */
 static inline unsigned int __inet_dev_addr_type(struct net *net,
const struct net_device *dev,
-   __be32 addr)
+   __be32 addr, int tb_id)
 {
struct flowi4   fl4 = { .daddr = addr };
struct fib_result   res;
unsigned int ret = RTN_BROADCAST;
-   struct fib_table *local_table;
+   struct fib_table *table;
 
if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
return RTN_BROADCAST;
@@ -226,10 +226,10 @@ static inline unsigned int __inet_dev_addr_type(struct 
net *net,
 
rcu_read_lock();
 
-   local_table = fib_get_table(net, RT_TABLE_LOCAL);
-   if (local_table) {
+   table = fib_get_table(net, tb_id);
+   if (table) {
ret = RTN_UNICAST;
-   if (!fib_table_lookup(local_table, fl4, res, 
FIB_LOOKUP_NOREF)) {
+   if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) {
if (!dev || dev == res.fi-fib_dev)
ret = res.type;
}
@@ -239,16 +239,24 @@ static inline unsigned int __inet_dev_addr_type(struct 
net *net,
return ret;
 }
 
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id)
+{
+   return __inet_dev_addr_type(net, NULL, addr, tb_id);
+}
+EXPORT_SYMBOL(inet_addr_type_table);
+
 unsigned int inet_addr_type(struct net *net, __be32 addr)
 {
-   return __inet_dev_addr_type(net, NULL, addr);
+   return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL);
 }
 EXPORT_SYMBOL(inet_addr_type);
 
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr)
 {
-   return __inet_dev_addr_type(net, dev, addr);
+   int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+
+   return __inet_dev_addr_type(net, dev, addr, rt_table);
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
 
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 00/10] VRF-lite - v6

2015-08-13 Thread David Ahern
 comments from DaveM

- added patch to properly set oif in ip_send_unicast_reply. Needs to be
  set to VRF device for proper FIB lookup

- added patch to handle IP fragments


Version 5
- dropped patch regarding socket lookups; no longer needed
  + removed vrf helpers no longer needed after this patch is dropped
- removed dev_open and close operations
  + no need to reset vrf data on an ifdown and creates problems if a
slave is deleted while the vrf interface is down (Thanks, Nikolay)
- cleanups for sparse warnings
  + make C=2 is now clean for vrf driver


Version 4
- builds are clean with and without VRF device enabled (no, yes and module)
- tightened the driver implementation
  + device add/delete, slave add/remove, and module unload are all clean
- fixed RCU references
  + with RCU and lock debugging enabled changes are clean through the
suite of tests
- TX path uses custom dst, so patch refactoring rtable allocation is
  dropped along with the patch adding rt_nexthop helper
- dropped the task patch that adds default bind to interface for sockets
  and the associated chvrf example command
  + the patches are a convenience for running unmodified code. They
are not needed for the core functionality. Any application with
support for SO_BINDTODEVICE works properly with this patch set.

Version 3
- addressed comments from first 2 RFCs with the exception of the name
  Nicolas: We will do the name conversion once we agree on what the
   correct name should be (vrf, mrf or something else)

-  packets flow through the VRF device in both directions allowing the
   following:
   - tcpdump -i vrfn
   - tc rules on vrf device
   - netfilter rules on vrf device


TO-DO
=
1. IPv6

2. ipsec, xfrms
   - dst patch accepted into ipsec-next; will post VRF patch once merge happens

3. listen filter to allow 1 socket to work with multiple VRF devices
   - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g


Eric B:
  I have ipsec working with VRFs implemented using the VRF driver,
  including the worst case scenario of complete duplication in the
  networking config.


Thanks to Nikolay for his many, many code reviews whipping the device
driver into shape, and bug-Fixes and ideas from Hannes, Roopa Prabhu,
Jon Toppins, Jamal.

Patches can also be pulled from:
https://github.com/dsahern/linux.git, vrf-dev-v6 branch
https://github.com/dsahern/iproute2,  vrf-dev-v6 branch


David Ahern (11):
  net: Introduce VRF related flags and helpers
  net: Use VRF device index for lookups on RX
  net: Use VRF device index for lookups on TX
  udp: Handle VRF device in sendmsg
  net: Add inet_addr lookup by table
  net: Fix up inet_addr_type checks
  net: Add routes to the table associated with the device
  net: Use passed in table for nexthop lookups
  net: Use VRF index for oif in ip_send_unicast_reply
  net: frags: Add VRF device index to cache and lookup
  net: Introduce VRF device driver

 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c| 685 +++
 include/linux/netdevice.h|  20 ++
 include/net/flow.h   |   1 +
 include/net/route.h  |   7 +
 include/net/vrf.h| 139 +
 include/uapi/linux/if_link.h |   9 +
 net/ipv4/af_inet.c   |  13 +-
 net/ipv4/arp.c   |  15 +-
 net/ipv4/fib_frontend.c  |  63 +++-
 net/ipv4/fib_semantics.c |  44 ++-
 net/ipv4/fib_trie.c  |   7 +-
 net/ipv4/icmp.c  |   9 +-
 net/ipv4/ip_fragment.c   |  18 +-
 net/ipv4/ip_output.c |   7 +-
 net/ipv4/route.c |   8 +-
 net/ipv4/udp.c   |  22 +-
 18 files changed, 1031 insertions(+), 44 deletions(-)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h

-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 09/11] net: Use VRF index for oif in ip_send_unicast_reply

2015-08-13 Thread David Ahern
If output device is not specified use VRF device if input device is
enslaved. This is needed to ensure tcp acks and resets go out VRF device.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/ip_output.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6bf89a6312bc..0138fada0951 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1542,6 +1542,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
struct net *net = sock_net(sk);
struct sk_buff *nskb;
int err;
+   int oif;
 
if (__ip_options_echo(replyopts.opt.opt, skb, sopt))
return;
@@ -1559,7 +1560,11 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
daddr = replyopts.opt.opt.faddr;
}
 
-   flowi4_init_output(fl4, arg-bound_dev_if,
+   oif = arg-bound_dev_if;
+   if (!oif  netif_index_is_vrf(net, skb-skb_iif))
+   oif = skb-skb_iif;
+
+   flowi4_init_output(fl4, oif,
   IP4_REPLY_MARK(net, skb-mark),
   RT_TOS(arg-tos),
   RT_SCOPE_UNIVERSE, ip_hdr(skb)-protocol,
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] iproute2: Add support for VRF device

2015-08-13 Thread David Ahern
Allow user to create a vrf device and specify its table binding.
Based on the iplink_vlan implementation.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/if_link.h |  8 +
 ip/Makefile |  2 +-
 ip/iplink.c |  2 +-
 ip/iplink_vrf.c | 86 +
 4 files changed, 96 insertions(+), 2 deletions(-)
 create mode 100644 ip/iplink_vrf.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 4f0a558e8fcf..c8b569a79e80 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -339,6 +339,14 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC 1
 
+/* VRF section */
+enum {
+   IFLA_VRF_UNSPEC,
+   IFLA_VRF_TABLE,
+   __IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
 /* IPVLAN section */
 enum {
IFLA_IPVLAN_UNSPEC,
diff --git a/ip/Makefile b/ip/Makefile
index 77653ecc5785..d8b38ac2e44b 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-iplink_geneve.o
+iplink_geneve.o iplink_vrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink.c b/ip/iplink.c
index edee0f7a3b07..e2183e89a7f6 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -94,7 +94,7 @@ void iplink_usage(void)
fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | 
macvlan | macvtap |\n);
fprintf(stderr,   bridge | bond | ipoib | ip6tnl | 
ipip | sit | vxlan |\n);
fprintf(stderr,   gre | gretap | ip6gre | ip6gretap | 
vti | nlmon |\n);
-   fprintf(stderr,   bond_slave | ipvlan | geneve | 
bridge_slave }\n);
+   fprintf(stderr,   bond_slave | ipvlan | geneve | 
bridge_slave | vrf }\n);
}
exit(-1);
 }
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
new file mode 100644
index ..913a2892c95b
--- /dev/null
+++ b/ip/iplink_vrf.c
@@ -0,0 +1,86 @@
+/* iplink_vrf.cVRF device support
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/socket.h
+#include linux/if_link.h
+
+#include rt_names.h
+#include utils.h
+#include ip_common.h
+
+static void vrf_explain(FILE *f)
+{
+   fprintf(f, Usage: ... vrf table TABLEID \n);
+}
+
+static void explain(void)
+{
+   vrf_explain(stderr);
+}
+
+static int table_arg(void)
+{
+   fprintf(stderr,Error: argument of \table\ must be 0-32767 and 
currently unused\n);
+   return -1;
+}
+
+static int vrf_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *n)
+{
+   while (argc  0) {
+   if (matches(*argv, table) == 0) {
+   __u32 table;
+
+   NEXT_ARG();
+
+   table = atoi(*argv);
+   if (table  32767)
+   return table_arg();
+   addattr32(n, 1024, IFLA_VRF_TABLE, table);
+   } else if (matches(*argv, help) == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, vrf: unknown option \%s\?\n,
+   *argv);
+   explain();
+   return -1;
+   }
+   argc--, argv++;
+   }
+
+   return 0;
+}
+
+static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+   if (!tb)
+   return;
+
+   if (tb[IFLA_VRF_TABLE])
+   fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE]));
+}
+
+static void vrf_print_help(struct link_util *lu, int argc, char **argv,
+ FILE *f)
+{
+   vrf_explain(f);
+}
+
+struct link_util vrf_link_util = {
+   .id = vrf,
+   .maxattr= IFLA_VRF_MAX,
+   .parse_opt  = vrf_parse_opt,
+   .print_opt  = vrf_print_opt,
+   .print_help = vrf_print_help,
+};
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 10/11] net: frags: Add VRF device index to cache and lookup

2015-08-13 Thread David Ahern
Fragmentation cache uses information from the IP header to reassemble
packets. That information can be duplicated across VRFs -- same source
and destination addresses, protocol and id. Handle fragmentation with
VRFs by adding the VRF device index to entries in the cache and the
lookup arg.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/ip_fragment.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d96722ae8979..15762e758861 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -48,6 +48,7 @@
 #include linux/inet.h
 #include linux/netfilter_ipv4.h
 #include net/inet_ecn.h
+#include net/vrf.h
 
 /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6
  * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c
@@ -77,6 +78,7 @@ struct ipq {
u8  ecn; /* RFC3168 support */
u16 max_df_size; /* largest frag with DF set seen */
int iif;
+   int vif;   /* VRF device index */
unsigned intrid;
struct inet_peer *peer;
 };
@@ -99,6 +101,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*prev,
 struct ip4_create_arg {
struct iphdr *iph;
u32 user;
+   int vif;
 };
 
 static unsigned int ipqhashfn(__be16 id, __be32 saddr, __be32 daddr, u8 prot)
@@ -127,7 +130,8 @@ static bool ip4_frag_match(const struct inet_frag_queue *q, 
const void *a)
qp-saddr == arg-iph-saddr 
qp-daddr == arg-iph-daddr 
qp-protocol == arg-iph-protocol 
-   qp-user == arg-user;
+   qp-user == arg-user 
+   qp-vif == arg-vif;
 }
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -144,6 +148,7 @@ static void ip4_frag_init(struct inet_frag_queue *q, const 
void *a)
qp-ecn = ip4_frag_ecn(arg-iph-tos);
qp-saddr = arg-iph-saddr;
qp-daddr = arg-iph-daddr;
+   qp-vif = arg-vif;
qp-user = arg-user;
qp-peer = sysctl_ipfrag_max_dist ?
inet_getpeer_v4(net-ipv4.peers, arg-iph-saddr, 1) : NULL;
@@ -244,7 +249,8 @@ static void ip_expire(unsigned long arg)
 /* Find the correct entry in the incomplete datagrams queue for
  * this IP datagram, and create new one, if nothing is found.
  */
-static struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
+static struct ipq *ip_find(struct net *net, struct iphdr *iph,
+  u32 user, int vif)
 {
struct inet_frag_queue *q;
struct ip4_create_arg arg;
@@ -252,6 +258,7 @@ static struct ipq *ip_find(struct net *net, struct iphdr 
*iph, u32 user)
 
arg.iph = iph;
arg.user = user;
+   arg.vif = vif;
 
hash = ipqhashfn(iph-id, iph-saddr, iph-daddr, iph-protocol);
 
@@ -648,14 +655,15 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*prev,
 /* Process an incoming IP datagram fragment. */
 int ip_defrag(struct sk_buff *skb, u32 user)
 {
+   struct net_device *dev = skb-dev ? : skb_dst(skb)-dev;
+   int vif = vrf_master_ifindex_rcu(dev);
+   struct net *net = dev_net(dev);
struct ipq *qp;
-   struct net *net;
 
-   net = skb-dev ? dev_net(skb-dev) : dev_net(skb_dst(skb)-dev);
IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS);
 
/* Lookup (or create) queue header */
-   qp = ip_find(net, ip_hdr(skb), user);
+   qp = ip_find(net, ip_hdr(skb), user, vif);
if (qp) {
int ret;
 
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 02/11] net: Use VRF device index for lookups on RX

2015-08-13 Thread David Ahern
On ingress use index of VRF master device for route lookups if real device
is enslaved. Rules are expected to be installed for the VRF device to
direct lookups to a specific table.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_frontend.c | 8 +++-
 net/ipv4/route.c| 3 ++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6b98de0d7949..d8ced1d89f1b 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -45,6 +45,7 @@
 #include net/ip_fib.h
 #include net/rtnetlink.h
 #include net/xfrm.h
+#include net/vrf.h
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
@@ -309,7 +310,9 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
bool dev_match;
 
fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
+   fl4.flowi4_iif = vrf_master_ifindex_rcu(dev);
+   if (!fl4.flowi4_iif)
+   fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
fl4.daddr = src;
fl4.saddr = dst;
fl4.flowi4_tos = tos;
@@ -339,6 +342,9 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
if (nh-nh_dev == dev) {
dev_match = true;
break;
+   } else if (vrf_master_ifindex_rcu(nh-nh_dev) == dev-ifindex) {
+   dev_match = true;
+   break;
}
}
 #else
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 18fd7c9095c7..c26ff1f7067d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -112,6 +112,7 @@
 #endif
 #include net/secure_seq.h
 #include net/ip_tunnels.h
+#include net/vrf.h
 
 #define RT_FL_TOS(oldflp4) \
((oldflp4)-flowi4_tos  (IPTOS_RT_MASK | RTO_ONLINK))
@@ -1726,7 +1727,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
 *  Now we are ready to route packet.
 */
fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = dev-ifindex;
+   fl4.flowi4_iif = vrf_master_ifindex_rcu(dev) ? : dev-ifindex;
fl4.flowi4_mark = skb-mark;
fl4.flowi4_tos = tos;
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 04/11] udp: Handle VRF device in sendmsg

2015-08-13 Thread David Ahern
For unconnected UDP sockets using a VRF device lookup source address
based on VRF table. This allows the UDP header to be properly setup
before showing up at the VRF device via the dst.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/udp.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 83aa604f9273..7af5052e3b1f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1013,11 +1013,31 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
if (!rt) {
struct net *net = sock_net(sk);
+   __u8 flow_flags = inet_sk_flowi_flags(sk);
 
fl4 = fl4_stack;
+
+   /* unconnected socket. If output device is enslaved to a VRF
+* device lookup source address from VRF table. This mimics
+* behavior of ip_route_connect{_init}.
+*/
+   if (netif_index_is_vrf(net, ipc.oif)) {
+   flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
+  RT_SCOPE_UNIVERSE, sk-sk_protocol,
+  (flow_flags | FLOWI_FLAG_VRFSRC),
+  faddr, saddr, dport,
+  inet-inet_sport);
+
+   rt = ip_route_output_flow(net, fl4, sk);
+   if (!IS_ERR(rt)) {
+   saddr = fl4-saddr;
+   ip_rt_put(rt);
+   }
+   }
+
flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
   RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  inet_sk_flowi_flags(sk),
+  flow_flags,
   faddr, saddr, dport, inet-inet_sport);
 
security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 03/11] net: Use VRF device index for lookups on TX

2015-08-13 Thread David Ahern
As with ingress use the index of VRF master device for route lookups on
egress. However, the oif should only be used to direct the lookups to a
specific table. Routes in the table are not based on the VRF device but
rather interfaces that are part of the VRF so do not consider the oif for
lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this
latter part.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/flow.h  | 1 +
 include/net/route.h | 3 +++
 net/ipv4/fib_trie.c | 7 +--
 net/ipv4/icmp.c | 4 
 net/ipv4/route.c| 5 +
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 3098ae33a178..f305588fc162 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -33,6 +33,7 @@ struct flowi_common {
__u8flowic_flags;
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
+#define FLOWI_FLAG_VRFSRC  0x04
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
 };
diff --git a/include/net/route.h b/include/net/route.h
index 2d45f419477f..94189d4bd899 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -251,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 
*fl4, __be32 dst, __be32
if (inet_sk(sk)-transparent)
flow_flags |= FLOWI_FLAG_ANYSRC;
 
+   if (netif_index_is_vrf(sock_net(sk), oif))
+   flow_flags |= FLOWI_FLAG_VRFSRC;
+
flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
 }
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 37c4bb89a708..1243c79cb5b0 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1423,8 +1423,11 @@ int fib_table_lookup(struct fib_table *tb, const struct 
flowi4 *flp,
nh-nh_flags  RTNH_F_LINKDOWN 
!(fib_flags  FIB_LOOKUP_IGNORE_LINKSTATE))
continue;
-   if (flp-flowi4_oif  flp-flowi4_oif != nh-nh_oif)
-   continue;
+   if (!(flp-flowi4_flags  FLOWI_FLAG_VRFSRC)) {
+   if (flp-flowi4_oif 
+   flp-flowi4_oif != nh-nh_oif)
+   continue;
+   }
 
if (!(fib_flags  FIB_LOOKUP_NOREF))
atomic_inc(fi-fib_clntref);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index c0556f1e4bf0..1164fc4ce3bc 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -96,6 +96,7 @@
 #include net/xfrm.h
 #include net/inet_common.h
 #include net/ip_fib.h
+#include net/vrf.h
 
 /*
  * Build xmit assembly blocks
@@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct 
sk_buff *skb)
fl4.flowi4_mark = mark;
fl4.flowi4_tos = RT_TOS(ip_hdr(skb)-tos);
fl4.flowi4_proto = IPPROTO_ICMP;
+   fl4.flowi4_oif = vrf_master_ifindex_rcu(skb-dev) ? : skb-dev-ifindex;
security_skb_classify_flow(skb, flowi4_to_flowi(fl4));
rt = ip_route_output_key(net, fl4);
if (IS_ERR(rt))
@@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
fl4-flowi4_proto = IPPROTO_ICMP;
fl4-fl4_icmp_type = type;
fl4-fl4_icmp_code = code;
+   fl4-flowi4_oif = vrf_master_ifindex_rcu(skb_in-dev) ? : 
skb_in-dev-ifindex;
+
security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
rt = __ip_route_output_key(net, fl4);
if (IS_ERR(rt))
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c26ff1f7067d..2c89d294b669 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2131,6 +2131,11 @@ struct rtable *__ip_route_output_key(struct net *net, 
struct flowi4 *fl4)
fl4-saddr = inet_select_addr(dev_out, 0,
  RT_SCOPE_HOST);
}
+   if (netif_is_vrf(dev_out) 
+   !(fl4-flowi4_flags  FLOWI_FLAG_VRFSRC)) {
+   rth = vrf_dev_get_rth(dev_out);
+   goto out;
+   }
}
 
if (!fl4-daddr) {
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 07/11] net: Add routes to the table associated with the device

2015-08-13 Thread David Ahern
When a device associated with a VRF is brought up or down routes
should be added to/removed from the table associated with the VRF.
fib_magic defaults to using the main or local tables. Have it use
the table with the device if there is one.

A part of this is directing prefsrc validations to the correct
table as well.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_frontend.c  |  8 
 net/ipv4/fib_semantics.c | 25 +++--
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c55723ec4c3e..7fa277176c33 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -800,6 +800,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct 
in_ifaddr *ifa)
 {
struct net *net = dev_net(ifa-ifa_dev-dev);
+   int tb_id = vrf_dev_table_rtnl(ifa-ifa_dev-dev);
struct fib_table *tb;
struct fib_config cfg = {
.fc_protocol = RTPROT_KERNEL,
@@ -814,11 +815,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int 
dst_len, struct in_ifad
},
};
 
-   if (type == RTN_UNICAST)
-   tb = fib_new_table(net, RT_TABLE_MAIN);
-   else
-   tb = fib_new_table(net, RT_TABLE_LOCAL);
+   if (!tb_id)
+   tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL;
 
+   tb = fib_new_table(net, tb_id);
if (!tb)
return;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 410ddb67221e..85e9a8abf15c 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -838,6 +838,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct 
fib_nh *nh)
return nh-nh_saddr;
 }
 
+static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc)
+{
+   if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst ||
+   fib_prefsrc != cfg-fc_dst) {
+   int tb_id = cfg-fc_table;
+
+   if (tb_id == RT_TABLE_MAIN)
+   tb_id = RT_TABLE_LOCAL;
+
+   if (inet_addr_type_table(cfg-fc_nlinfo.nl_net,
+fib_prefsrc, tb_id) != RTN_LOCAL) {
+   return false;
+   }
+   }
+   return true;
+}
+
 struct fib_info *fib_create_info(struct fib_config *cfg)
 {
int err;
@@ -1033,12 +1050,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
fi-fib_flags |= RTNH_F_LINKDOWN;
}
 
-   if (fi-fib_prefsrc) {
-   if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst ||
-   fi-fib_prefsrc != cfg-fc_dst)
-   if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL)
-   goto err_inval;
-   }
+   if (fi-fib_prefsrc  !fib_valid_prefsrc(cfg, fi-fib_prefsrc))
+   goto err_inval;
 
change_nexthops(fi) {
fib_info_update_nh_saddr(net, nexthop_nh);
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 06/11] net: Fix up inet_addr_type checks

2015-08-13 Thread David Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
if the passed in device is enslaved to a VRF then the table for that VRF
is used for the lookup.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/route.h  |  3 +++
 net/ipv4/af_inet.c   | 13 -
 net/ipv4/arp.c   | 15 +--
 net/ipv4/fib_frontend.c  | 25 ++---
 net/ipv4/fib_semantics.c |  6 --
 net/ipv4/icmp.c  |  5 +++--
 6 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 6ba681f0b98d..6dda2c1bf8c6 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -192,6 +192,9 @@ unsigned int inet_addr_type(struct net *net, __be32 addr);
 unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr);
+unsigned int inet_addr_type_dev_table(struct net *net,
+ const struct net_device *dev,
+ __be32 addr);
 void ip_rt_multicast_event(struct in_device *);
 int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg);
 void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index cc4e498a0ccf..c8b855882fa5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -119,6 +119,7 @@
 #ifdef CONFIG_IP_MROUTE
 #include linux/mroute.h
 #endif
+#include net/vrf.h
 
 
 /* The inetsw table contains everything that inet_create needs to
@@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
struct net *net = sock_net(sk);
unsigned short snum;
int chk_addr_ret;
+   int tb_id = RT_TABLE_LOCAL;
int err;
 
/* If the socket has its own bind function then use it. (RAW) */
@@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
goto out;
}
 
-   chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr);
+   if (sk-sk_bound_dev_if) {
+   struct net_device *dev;
+
+   rcu_read_lock();
+   dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if);
+   if (dev)
+   tb_id = vrf_dev_table_rcu(dev) ? : tb_id;
+   rcu_read_unlock();
+   }
+   chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id);
 
/* Not specified by any standard per-se, however it breaks too
 * many applications when removed.  It is unfortunate since
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 34a308573f4b..30409b75e925 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -233,7 +233,7 @@ static int arp_constructor(struct neighbour *neigh)
return -EINVAL;
}
 
-   neigh-type = inet_addr_type(dev_net(dev), addr);
+   neigh-type = inet_addr_type_dev_table(dev_net(dev), dev, addr);
 
parms = in_dev-arp_parms;
__neigh_parms_put(neigh-parms);
@@ -343,7 +343,7 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
default:
case 0: /* By default announce any local IP */
-   if (skb  inet_addr_type(dev_net(dev),
+   if (skb  inet_addr_type_dev_table(dev_net(dev), dev,
  ip_hdr(skb)-saddr) == RTN_LOCAL)
saddr = ip_hdr(skb)-saddr;
break;
@@ -351,7 +351,8 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
if (!skb)
break;
saddr = ip_hdr(skb)-saddr;
-   if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) {
+   if (inet_addr_type_dev_table(dev_net(dev), dev,
+saddr) == RTN_LOCAL) {
/* saddr should be known to target */
if (inet_addr_onlink(in_dev, target, saddr))
break;
@@ -751,7 +752,7 @@ static int arp_process(struct sock *sk, struct sk_buff *skb)
/* Special case: IPv4 duplicate address detection packet (RFC2131) */
if (sip == 0) {
if (arp-ar_op == htons(ARPOP_REQUEST) 
-   inet_addr_type(net, tip) == RTN_LOCAL 
+   inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL 
!arp_ignore(in_dev, sip, tip))
arp_send(ARPOP_REPLY

[PATCH net-next 11/11] net: Introduce VRF device driver

2015-08-13 Thread David Ahern
This driver borrows heavily from IPvlan and teaming drivers.

Routing domains (VRF-lite) are created by instantiating a VRF master
device with an associated table and enslaving all routed interfaces that
participate in the domain. As part of the enslavement, all connected
routes for the enslaved devices are moved to the table associated with
the VRF device. Outgoing sockets must bind to the VRF device to function.

Standard FIB rules bind the VRF device to tables and regular fib rule
processing is followed. Routed traffic through the box, is forwarded by
using the VRF device as the IIF and following the IIF rule to a table
that is mated with the VRF.

Example:

   Create vrf 1:
 ip link add vrf1 type vrf table 5
 ip rule add iif vrf1 table 5
 ip rule add oif vrf1 table 5
 ip route add table 5 prohibit default
 ip link set vrf1 up

   Add interface to vrf 1:
 ip link set eth1 master vrf1

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c| 685 +++
 3 files changed, 693 insertions(+)
 create mode 100644 drivers/net/vrf.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c18f9e62a9fa..e58468b02987 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -297,6 +297,13 @@ config NLMON
  diagnostics, etc. This is mostly intended for developers or support
  to debug netlink issues. If unsure, say N.
 
+config NET_VRF
+   tristate Virtual Routing and Forwarding (Lite)
+   depends on IP_MULTIPLE_TABLES  IPV6_MULTIPLE_TABLES
+   ---help---
+ This option enables the support for mapping interfaces into VRF's. The
+ support enables VRF devices.
+
 endif # NET_CORE
 
 config SUNGEM_PHY
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c12cb22478a7..ca16dd689b36 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_NLMON) += nlmon.o
+obj-$(CONFIG_NET_VRF) += vrf.o
 
 #
 # Networking Drivers
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
new file mode 100644
index ..95097cb79354
--- /dev/null
+++ b/drivers/net/vrf.c
@@ -0,0 +1,685 @@
+/*
+ * vrf.c: device driver to encapsulate a VRF space
+ *
+ * Copyright (c) 2015 Cumulus Networks. All rights reserved.
+ * Copyright (c) 2015 Shrijeet Mukherjee s...@cumulusnetworks.com
+ * Copyright (c) 2015 David Ahern d...@cumulusnetworks.com
+ *
+ * Based on dummy, team and ipvlan drivers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/ip.h
+#include linux/init.h
+#include linux/moduleparam.h
+#include linux/netfilter.h
+#include linux/rtnetlink.h
+#include net/rtnetlink.h
+#include linux/u64_stats_sync.h
+#include linux/hashtable.h
+
+#include linux/inetdevice.h
+#include net/ip.h
+#include net/ip_fib.h
+#include net/ip6_route.h
+#include net/rtnetlink.h
+#include net/route.h
+#include net/addrconf.h
+#include net/vrf.h
+
+#define DRV_NAME   vrf
+#define DRV_VERSION1.0
+
+#define vrf_is_slave(dev)   ((dev)-flags  IFF_SLAVE)
+
+#define vrf_master_get_rcu(dev) \
+   ((struct net_device *)rcu_dereference(dev-rx_handler_data))
+
+struct pcpu_dstats {
+   u64 tx_pkts;
+   u64 tx_bytes;
+   u64 tx_drps;
+   u64 rx_pkts;
+   u64 rx_bytes;
+   struct u64_stats_sync   syncp;
+};
+
+static struct dst_entry *vrf_ip_check(struct dst_entry *dst, u32 cookie)
+{
+   return dst;
+}
+
+static int vrf_ip_local_out(struct sk_buff *skb)
+{
+   return ip_local_out(skb);
+}
+
+static unsigned int vrf_v4_mtu(const struct dst_entry *dst)
+{
+   /* TO-DO: return max ethernet size? */
+   return dst-dev-mtu;
+}
+
+static void vrf_dst_destroy(struct dst_entry *dst)
+{
+   /* our dst lives forever - or until the device is closed */
+}
+
+static unsigned int vrf_default_advmss(const struct dst_entry *dst)
+{
+   return 65535 - 40;
+}
+
+static struct dst_ops vrf_dst_ops = {
+   .family = AF_INET,
+   .local_out  = vrf_ip_local_out,
+   .check  = vrf_ip_check,
+   .mtu= vrf_v4_mtu,
+   .destroy= vrf_dst_destroy,
+   .default_advmss = vrf_default_advmss,
+};
+
+static bool is_ip_rx_frame(struct sk_buff *skb)
+{
+   switch (skb-protocol) {
+   case htons(ETH_P_IP):
+   case htons(ETH_P_IPV6

[PATCH net-next 9/9] net: Introduce VRF device driver

2015-08-10 Thread David Ahern
This driver borrows heavily from IPvlan and teaming drivers.

Routing domains (VRF-lite) are created by instantiating a VRF master
device with an associated table and enslaving all routed interfaces that
participate in the domain. As part of the enslavement, all connected
routes for the enslaved devices are moved to the table associated with
the VRF device. Outgoing sockets must bind to the VRF device to function.

Standard FIB rules bind the VRF device to tables and regular fib rule
processing is followed. Routed traffic through the box, is forwarded by
using the VRF device as the IIF and following the IIF rule to a table
that is mated with the VRF.

Example:

   Create vrf 1:
 ip link add vrf1 type vrf table 5
 ip rule add iif vrf1 table 5
 ip rule add oif vrf1 table 5
 ip route add table 5 prohibit default
 ip link set vrf1 up

   Add interface to vrf 1:
 ip link set eth1 master vrf1

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c| 685 +++
 3 files changed, 693 insertions(+)
 create mode 100644 drivers/net/vrf.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c18f9e62a9fa..e58468b02987 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -297,6 +297,13 @@ config NLMON
  diagnostics, etc. This is mostly intended for developers or support
  to debug netlink issues. If unsure, say N.
 
+config NET_VRF
+   tristate Virtual Routing and Forwarding (Lite)
+   depends on IP_MULTIPLE_TABLES  IPV6_MULTIPLE_TABLES
+   ---help---
+ This option enables the support for mapping interfaces into VRF's. The
+ support enables VRF devices.
+
 endif # NET_CORE
 
 config SUNGEM_PHY
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c12cb22478a7..ca16dd689b36 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_NLMON) += nlmon.o
+obj-$(CONFIG_NET_VRF) += vrf.o
 
 #
 # Networking Drivers
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
new file mode 100644
index ..95097cb79354
--- /dev/null
+++ b/drivers/net/vrf.c
@@ -0,0 +1,685 @@
+/*
+ * vrf.c: device driver to encapsulate a VRF space
+ *
+ * Copyright (c) 2015 Cumulus Networks. All rights reserved.
+ * Copyright (c) 2015 Shrijeet Mukherjee s...@cumulusnetworks.com
+ * Copyright (c) 2015 David Ahern d...@cumulusnetworks.com
+ *
+ * Based on dummy, team and ipvlan drivers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/ip.h
+#include linux/init.h
+#include linux/moduleparam.h
+#include linux/netfilter.h
+#include linux/rtnetlink.h
+#include net/rtnetlink.h
+#include linux/u64_stats_sync.h
+#include linux/hashtable.h
+
+#include linux/inetdevice.h
+#include net/ip.h
+#include net/ip_fib.h
+#include net/ip6_route.h
+#include net/rtnetlink.h
+#include net/route.h
+#include net/addrconf.h
+#include net/vrf.h
+
+#define DRV_NAME   vrf
+#define DRV_VERSION1.0
+
+#define vrf_is_slave(dev)   ((dev)-flags  IFF_SLAVE)
+
+#define vrf_master_get_rcu(dev) \
+   ((struct net_device *)rcu_dereference(dev-rx_handler_data))
+
+struct pcpu_dstats {
+   u64 tx_pkts;
+   u64 tx_bytes;
+   u64 tx_drps;
+   u64 rx_pkts;
+   u64 rx_bytes;
+   struct u64_stats_sync   syncp;
+};
+
+static struct dst_entry *vrf_ip_check(struct dst_entry *dst, u32 cookie)
+{
+   return dst;
+}
+
+static int vrf_ip_local_out(struct sk_buff *skb)
+{
+   return ip_local_out(skb);
+}
+
+static unsigned int vrf_v4_mtu(const struct dst_entry *dst)
+{
+   /* TO-DO: return max ethernet size? */
+   return dst-dev-mtu;
+}
+
+static void vrf_dst_destroy(struct dst_entry *dst)
+{
+   /* our dst lives forever - or until the device is closed */
+}
+
+static unsigned int vrf_default_advmss(const struct dst_entry *dst)
+{
+   return 65535 - 40;
+}
+
+static struct dst_ops vrf_dst_ops = {
+   .family = AF_INET,
+   .local_out  = vrf_ip_local_out,
+   .check  = vrf_ip_check,
+   .mtu= vrf_v4_mtu,
+   .destroy= vrf_dst_destroy,
+   .default_advmss = vrf_default_advmss,
+};
+
+static bool is_ip_rx_frame(struct sk_buff *skb)
+{
+   switch (skb-protocol) {
+   case htons(ETH_P_IP):
+   case htons(ETH_P_IPV6

[PATCH net-next 00/10] VRF-lite - v5

2015-08-10 Thread David Ahern
 to the VRF (sk_bound_dev_if is set to the VRF device).

5. Neighbor entries
   Neighbor entries are not impacted by the VRF device. Entries are
   associated with a particular interface; the VRF association is indirect
   via the interface-to-VRF device enslavement.


Version 5
- dropped patch regarding socket lookups; no longer needed
  + removed vrf helpers no longer needed after this patch is dropped

- removed dev_open and close operations
  + no need to reset vrf data on an ifdown and creates problems if a
slave is deleted while the vrf interface is down (Thanks, Nikolay)

- cleanups for sparse warnings
  + make C=2 is now clean for vrf driver


Version 4
- builds are clean with and without VRF device enabled (no, yes and module)
- tightened the driver implementation
  + device add/delete, slave add/remove, and module unload are all clean
- fixed RCU references
  + with RCU and lock debugging enabled changes are clean through the
suite of tests
- TX path uses custom dst, so patch refactoring rtable allocation is
  dropped along with the patch adding rt_nexthop helper
- dropped the task patch that adds default bind to interface for sockets
  and the associated chvrf example command
  + the patches are a convenience for running unmodified code. They
are not needed for the core functionality. Any application with
support for SO_BINDTODEVICE works properly with this patch set.

Version 3
- addressed comments from first 2 RFCs with the exception of the name
  Nicolas: We will do the name conversion once we agree on what the
   correct name should be (vrf, mrf or something else)

-  packets flow through the VRF device in both directions allowing the
   following:
   - tcpdump -i vrfn
   - tc rules on vrf device
   - netfilter rules on vrf device


TO-DO
=
1. IPv6

2. ip fragments

3. ipsec, xfrms
   - have this working now; will post patches soon

4. listen filter to restrict VRF connections
   - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g


Eric B:
  I have ipsec working with VRFs implemented using the VRF driver,
  including the worst case scenario of complete duplication in the
  networking config.


Thanks to Nikolay for his many, many code reviews whipping the device
driver into shape, and bug-Fixes and ideas from Hannes, Roopa Prabhu,
Jon Toppins, Jamal.

Patches can also be pulled from:
https://github.com/dsahern/linux.git, vrf-dev-v5 branch
https://github.com/dsahern/iproute2,  vrf-dev-v5 branch

David Ahern (9):
  net: Introduce VRF related flags and helpers
  net: Use VRF device index for lookups on RX
  net: Use VRF device index for lookups on TX
  udp: Handle VRF device in sendmsg
  net: Add inet_addr lookup by table
  net: Fix up inet_addr_type checks
  net: Add routes to the table associated with the device
  net: Use passed in table for nexthop lookups
  net: Introduce VRF device driver

 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c| 685 +++
 include/linux/netdevice.h|  20 ++
 include/net/flow.h   |   1 +
 include/net/route.h  |   7 +
 include/net/vrf.h| 139 +
 include/uapi/linux/if_link.h |   9 +
 net/ipv4/af_inet.c   |  13 +-
 net/ipv4/arp.c   |  15 +-
 net/ipv4/fib_frontend.c  |  66 -
 net/ipv4/fib_semantics.c |  44 ++-
 net/ipv4/fib_trie.c  |   7 +-
 net/ipv4/icmp.c  |   9 +-
 net/ipv4/route.c |   8 +-
 net/ipv4/udp.c   |  22 +-
 16 files changed, 1015 insertions(+), 38 deletions(-)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h

-- 
2.3.2 (Apple Git-55)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/9] net: Use VRF device index for lookups on RX

2015-08-10 Thread David Ahern
On ingress use index of VRF master device for route lookups if real device
is enslaved. Rules are expected to be installed for the VRF device to
direct lookups to a specific table.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_frontend.c | 8 +++-
 net/ipv4/route.c| 3 ++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6b98de0d7949..d8ced1d89f1b 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -45,6 +45,7 @@
 #include net/ip_fib.h
 #include net/rtnetlink.h
 #include net/xfrm.h
+#include net/vrf.h
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
@@ -309,7 +310,9 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
bool dev_match;
 
fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
+   fl4.flowi4_iif = vrf_master_ifindex_rcu(dev);
+   if (!fl4.flowi4_iif)
+   fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
fl4.daddr = src;
fl4.saddr = dst;
fl4.flowi4_tos = tos;
@@ -339,6 +342,9 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
if (nh-nh_dev == dev) {
dev_match = true;
break;
+   } else if (vrf_master_ifindex_rcu(nh-nh_dev) == dev-ifindex) {
+   dev_match = true;
+   break;
}
}
 #else
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 18fd7c9095c7..c26ff1f7067d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -112,6 +112,7 @@
 #endif
 #include net/secure_seq.h
 #include net/ip_tunnels.h
+#include net/vrf.h
 
 #define RT_FL_TOS(oldflp4) \
((oldflp4)-flowi4_tos  (IPTOS_RT_MASK | RTO_ONLINK))
@@ -1726,7 +1727,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
 *  Now we are ready to route packet.
 */
fl4.flowi4_oif = 0;
-   fl4.flowi4_iif = dev-ifindex;
+   fl4.flowi4_iif = vrf_master_ifindex_rcu(dev) ? : dev-ifindex;
fl4.flowi4_mark = skb-mark;
fl4.flowi4_tos = tos;
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 8/9] net: Use passed in table for nexthop lookups

2015-08-10 Thread David Ahern
If a user passes in a table for new routes use that table for nexthop
lookups. Specifically, this solves the case where a connected route does
not exist in the main table, but only another table and then a subsequent
route is added with a next hop using the connected route. ie.,

$ ip route ls
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15
169.254.0.0/16 dev eth0  scope link  metric 1003
192.168.56.0/24 dev eth1  proto kernel  scope link  src 192.168.56.51

$ ip route ls table 10
1.1.1.0/24 dev eth2  scope link

Without this patch adding a nexthop route fails:

$ ip route add table 10 2.2.2.0/24 via 1.1.1.10
RTNETLINK answers: Network is unreachable

With this patch the route is added successfully.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_semantics.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 85e9a8abf15c..b7f1d20a9615 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -691,6 +691,7 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
}
rcu_read_lock();
{
+   struct fib_table *tbl = NULL;
struct flowi4 fl4 = {
.daddr = nh-nh_gw,
.flowi4_scope = cfg-fc_scope + 1,
@@ -701,8 +702,16 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
/* It is not necessary, but requires a bit of thinking 
*/
if (fl4.flowi4_scope  RT_SCOPE_LINK)
fl4.flowi4_scope = RT_SCOPE_LINK;
-   err = fib_lookup(net, fl4, res,
-FIB_LOOKUP_IGNORE_LINKSTATE);
+
+   if (cfg-fc_table)
+   tbl = fib_get_table(net, cfg-fc_table);
+
+   if (tbl)
+   err = fib_table_lookup(tbl, fl4, res,
+  FIB_LOOKUP_IGNORE_LINKSTATE);
+   else
+   err = fib_lookup(net, fl4, res,
+FIB_LOOKUP_IGNORE_LINKSTATE);
if (err) {
rcu_read_unlock();
return err;
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iproute2: Add support for VRF device

2015-08-10 Thread David Ahern
Allow user to create a vrf device and specify its table binding.
Based on the iplink_vlan implementation.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/if_link.h |  8 +
 ip/Makefile |  2 +-
 ip/iplink.c |  2 +-
 ip/iplink_vrf.c | 85 +
 4 files changed, 95 insertions(+), 2 deletions(-)
 create mode 100644 ip/iplink_vrf.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index b905cf7f4948..74dedf4320b8 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -338,6 +338,14 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC 1
 
+/* VRF section */
+enum {
+   IFLA_VRF_UNSPEC,
+   IFLA_VRF_TABLE,
+   __IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
 /* IPVLAN section */
 enum {
IFLA_IPVLAN_UNSPEC,
diff --git a/ip/Makefile b/ip/Makefile
index 77653ecc5785..d8b38ac2e44b 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-iplink_geneve.o
+iplink_geneve.o iplink_vrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink.c b/ip/iplink.c
index 369d50eab94e..14bf7211a447 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -94,7 +94,7 @@ void iplink_usage(void)
fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | 
macvlan | macvtap |\n);
fprintf(stderr,   bridge | bond | ipoib | ip6tnl | 
ipip | sit | vxlan |\n);
fprintf(stderr,   gre | gretap | ip6gre | ip6gretap | 
vti | nlmon |\n);
-   fprintf(stderr,   bond_slave | ipvlan | geneve }\n);
+   fprintf(stderr,   bond_slave | ipvlan | geneve | vrf 
}\n);
}
exit(-1);
 }
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
new file mode 100644
index ..0d7e21c7c152
--- /dev/null
+++ b/ip/iplink_vrf.c
@@ -0,0 +1,85 @@
+/* iplink_vrf.cVRF device support
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/socket.h
+#include linux/if_link.h
+
+#include rt_names.h
+#include utils.h
+#include ip_common.h
+
+static void vrf_explain(FILE *f)
+{
+   fprintf(f, Usage: ... vrf table TABLEID \n);
+}
+
+static void explain(void)
+{
+   vrf_explain(stderr);
+}
+
+static int table_arg(void)
+{
+   fprintf(stderr,Error: argument of \table\ must be 0-32767 and 
currently unused\n);
+   return -1;
+}
+
+static int vrf_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *n)
+{
+   while (argc  0) {
+   if (matches(*argv, table) == 0) {
+   __u32 table = 0;
+   NEXT_ARG();
+
+   table = atoi(*argv);
+   if (table  0 || table  32767)
+   return table_arg();
+   addattr32(n, 1024, IFLA_VRF_TABLE, table);
+   } else if (matches(*argv, help) == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, vrf: unknown option \%s\?\n,
+   *argv);
+   explain();
+   return -1;
+   }
+   argc--, argv++;
+   }
+
+   return 0;
+}
+
+static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+   if (!tb)
+   return;
+
+   if (tb[IFLA_VRF_TABLE])
+   fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE]));
+}
+
+static void vrf_print_help(struct link_util *lu, int argc, char **argv,
+ FILE *f)
+{
+   vrf_explain(f);
+}
+
+struct link_util vrf_link_util = {
+   .id = vrf,
+   .maxattr= IFLA_VRF_MAX,
+   .parse_opt  = vrf_parse_opt,
+   .print_opt  = vrf_print_opt,
+   .print_help = vrf_print_help,
+};
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 6/9] net: Fix up inet_addr_type checks

2015-08-10 Thread David Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
if the passed in device is enslaved to a VRF then the table for that VRF
is used for the lookup.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/route.h  |  3 +++
 net/ipv4/af_inet.c   | 13 -
 net/ipv4/arp.c   | 15 +--
 net/ipv4/fib_frontend.c  | 28 +---
 net/ipv4/fib_semantics.c |  6 --
 net/ipv4/icmp.c  |  5 +++--
 6 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 6ba681f0b98d..6dda2c1bf8c6 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -192,6 +192,9 @@ unsigned int inet_addr_type(struct net *net, __be32 addr);
 unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr);
+unsigned int inet_addr_type_dev_table(struct net *net,
+ const struct net_device *dev,
+ __be32 addr);
 void ip_rt_multicast_event(struct in_device *);
 int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg);
 void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index cc4e498a0ccf..96fba4f63454 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -119,6 +119,7 @@
 #ifdef CONFIG_IP_MROUTE
 #include linux/mroute.h
 #endif
+#include net/vrf.h
 
 
 /* The inetsw table contains everything that inet_create needs to
@@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
struct net *net = sock_net(sk);
unsigned short snum;
int chk_addr_ret;
+   int tb_id = 0;
int err;
 
/* If the socket has its own bind function then use it. (RAW) */
@@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
goto out;
}
 
-   chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr);
+   if (sk-sk_bound_dev_if) {
+   struct net_device *dev;
+
+   rcu_read_lock();
+   dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if);
+   if (dev)
+   tb_id = vrf_dev_table_rcu(dev);
+   rcu_read_unlock();
+   }
+   chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id);
 
/* Not specified by any standard per-se, however it breaks too
 * many applications when removed.  It is unfortunate since
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 34a308573f4b..30409b75e925 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -233,7 +233,7 @@ static int arp_constructor(struct neighbour *neigh)
return -EINVAL;
}
 
-   neigh-type = inet_addr_type(dev_net(dev), addr);
+   neigh-type = inet_addr_type_dev_table(dev_net(dev), dev, addr);
 
parms = in_dev-arp_parms;
__neigh_parms_put(neigh-parms);
@@ -343,7 +343,7 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
default:
case 0: /* By default announce any local IP */
-   if (skb  inet_addr_type(dev_net(dev),
+   if (skb  inet_addr_type_dev_table(dev_net(dev), dev,
  ip_hdr(skb)-saddr) == RTN_LOCAL)
saddr = ip_hdr(skb)-saddr;
break;
@@ -351,7 +351,8 @@ static void arp_solicit(struct neighbour *neigh, struct 
sk_buff *skb)
if (!skb)
break;
saddr = ip_hdr(skb)-saddr;
-   if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) {
+   if (inet_addr_type_dev_table(dev_net(dev), dev,
+saddr) == RTN_LOCAL) {
/* saddr should be known to target */
if (inet_addr_onlink(in_dev, target, saddr))
break;
@@ -751,7 +752,7 @@ static int arp_process(struct sock *sk, struct sk_buff *skb)
/* Special case: IPv4 duplicate address detection packet (RFC2131) */
if (sip == 0) {
if (arp-ar_op == htons(ARPOP_REQUEST) 
-   inet_addr_type(net, tip) == RTN_LOCAL 
+   inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL 
!arp_ignore(in_dev, sip, tip))
arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip

[PATCH net-next 7/9] net: Add routes to the table associated with the device

2015-08-10 Thread David Ahern
When a device associated with a VRF is brought up or down routes
should be added to/removed from the table associated with the VRF.
fib_magic defaults to using the main or local tables. Have it use
the table with the device if there is one.

A part of this is directing prefsrc validations to the correct
table as well.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/fib_frontend.c  |  8 
 net/ipv4/fib_semantics.c | 25 +++--
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d84ae0e30369..0a50a08ab844 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -803,6 +803,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct 
in_ifaddr *ifa)
 {
struct net *net = dev_net(ifa-ifa_dev-dev);
+   int tb_id = vrf_dev_table_rtnl(ifa-ifa_dev-dev);
struct fib_table *tb;
struct fib_config cfg = {
.fc_protocol = RTPROT_KERNEL,
@@ -817,11 +818,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int 
dst_len, struct in_ifad
},
};
 
-   if (type == RTN_UNICAST)
-   tb = fib_new_table(net, RT_TABLE_MAIN);
-   else
-   tb = fib_new_table(net, RT_TABLE_LOCAL);
+   if (!tb_id)
+   tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL;
 
+   tb = fib_new_table(net, tb_id);
if (!tb)
return;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 410ddb67221e..85e9a8abf15c 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -838,6 +838,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct 
fib_nh *nh)
return nh-nh_saddr;
 }
 
+static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc)
+{
+   if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst ||
+   fib_prefsrc != cfg-fc_dst) {
+   int tb_id = cfg-fc_table;
+
+   if (tb_id == RT_TABLE_MAIN)
+   tb_id = RT_TABLE_LOCAL;
+
+   if (inet_addr_type_table(cfg-fc_nlinfo.nl_net,
+fib_prefsrc, tb_id) != RTN_LOCAL) {
+   return false;
+   }
+   }
+   return true;
+}
+
 struct fib_info *fib_create_info(struct fib_config *cfg)
 {
int err;
@@ -1033,12 +1050,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
fi-fib_flags |= RTNH_F_LINKDOWN;
}
 
-   if (fi-fib_prefsrc) {
-   if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst ||
-   fi-fib_prefsrc != cfg-fc_dst)
-   if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL)
-   goto err_inval;
-   }
+   if (fi-fib_prefsrc  !fib_valid_prefsrc(cfg, fi-fib_prefsrc))
+   goto err_inval;
 
change_nexthops(fi) {
fib_info_update_nh_saddr(net, nexthop_nh);
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 5/9] net: Add inet_addr lookup by table

2015-08-10 Thread David Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/route.h |  1 +
 net/ipv4/fib_frontend.c | 22 +++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 94189d4bd899..6ba681f0b98d 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -189,6 +189,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
 unsigned int inet_addr_type(struct net *net, __be32 addr);
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr);
 void ip_rt_multicast_event(struct in_device *);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d8ced1d89f1b..b11321a8e58d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -212,12 +212,12 @@ void fib_flush_external(struct net *net)
  */
 static inline unsigned int __inet_dev_addr_type(struct net *net,
const struct net_device *dev,
-   __be32 addr)
+   __be32 addr, int tb_id)
 {
struct flowi4   fl4 = { .daddr = addr };
struct fib_result   res;
unsigned int ret = RTN_BROADCAST;
-   struct fib_table *local_table;
+   struct fib_table *table;
 
if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
return RTN_BROADCAST;
@@ -226,10 +226,10 @@ static inline unsigned int __inet_dev_addr_type(struct 
net *net,
 
rcu_read_lock();
 
-   local_table = fib_get_table(net, RT_TABLE_LOCAL);
-   if (local_table) {
+   table = fib_get_table(net, tb_id);
+   if (table) {
ret = RTN_UNICAST;
-   if (!fib_table_lookup(local_table, fl4, res, 
FIB_LOOKUP_NOREF)) {
+   if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) {
if (!dev || dev == res.fi-fib_dev)
ret = res.type;
}
@@ -239,16 +239,24 @@ static inline unsigned int __inet_dev_addr_type(struct 
net *net,
return ret;
 }
 
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id)
+{
+   return __inet_dev_addr_type(net, NULL, addr, tb_id);
+}
+EXPORT_SYMBOL(inet_addr_type_table);
+
 unsigned int inet_addr_type(struct net *net, __be32 addr)
 {
-   return __inet_dev_addr_type(net, NULL, addr);
+   return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL);
 }
 EXPORT_SYMBOL(inet_addr_type);
 
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr)
 {
-   return __inet_dev_addr_type(net, dev, addr);
+   int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+
+   return __inet_dev_addr_type(net, dev, addr, rt_table);
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
 
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/9] net: Introduce VRF related flags and helpers

2015-08-10 Thread David Ahern
Add a VRF_MASTER flag for interfaces and helper functions for determining
if a device is a VRF_MASTER.

Add link attribute for passing VRF_TABLE id.

Add vrf_ptr to netdevice.

Add various macros for determining if a device is a VRF device, the index
of the master VRF device and table associated with VRF device.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/netdevice.h|  20 +++
 include/net/vrf.h| 139 +++
 include/uapi/linux/if_link.h |   9 +++
 3 files changed, 168 insertions(+)
 create mode 100644 include/net/vrf.h

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 607b5f41f46f..f7a6ef2fae3a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1289,6 +1289,7 @@ enum netdev_priv_flags {
IFF_XMIT_DST_RELEASE_PERM   = 122,
IFF_IPVLAN_MASTER   = 123,
IFF_IPVLAN_SLAVE= 124,
+   IFF_VRF_MASTER  = 125,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1316,6 +1317,7 @@ enum netdev_priv_flags {
 #define IFF_XMIT_DST_RELEASE_PERM  IFF_XMIT_DST_RELEASE_PERM
 #define IFF_IPVLAN_MASTER  IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE   IFF_IPVLAN_SLAVE
+#define IFF_VRF_MASTER IFF_VRF_MASTER
 
 /**
  * struct net_device - The DEVICE structure.
@@ -1432,6 +1434,7 @@ enum netdev_priv_flags {
  * @dn_ptr:DECnet specific data
  * @ip6_ptr:   IPv6 specific data
  * @ax25_ptr:  AX.25 specific data
+ * @vrf_ptr:   VRF specific data
  * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering
  *
  * @last_rx:   Time of last Rx
@@ -1650,6 +1653,7 @@ struct net_device {
struct dn_dev __rcu *dn_ptr;
struct inet6_dev __rcu  *ip6_ptr;
void*ax25_ptr;
+   struct net_vrf_dev __rcu *vrf_ptr;
struct wireless_dev *ieee80211_ptr;
struct wpan_dev *ieee802154_ptr;
 #if IS_ENABLED(CONFIG_MPLS_ROUTING)
@@ -3808,6 +3812,22 @@ static inline bool netif_supports_nofcs(struct 
net_device *dev)
return dev-priv_flags  IFF_SUPP_NOFCS;
 }
 
+static inline bool netif_is_vrf(const struct net_device *dev)
+{
+   return dev-priv_flags  IFF_VRF_MASTER;
+}
+
+static inline bool netif_index_is_vrf(struct net *net, int ifindex)
+{
+   struct net_device *dev = dev_get_by_index_rcu(net, ifindex);
+   bool rc = false;
+
+   if (dev)
+   rc = netif_is_vrf(dev);
+
+   return rc;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/vrf.h b/include/net/vrf.h
new file mode 100644
index ..25c709fdb98f
--- /dev/null
+++ b/include/net/vrf.h
@@ -0,0 +1,139 @@
+/*
+ * include/net/net_vrf.h - adds vrf dev structure definitions
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __LINUX_NET_VRF_H
+#define __LINUX_NET_VRF_H
+
+struct net_vrf_dev {
+   struct rcu_head rcu;
+   int ifindex; /* ifindex of master dev */
+   u32 tb_id;   /* table id for VRF */
+};
+
+struct slave {
+   struct list_headlist;
+   struct net_device   *dev;
+};
+
+struct slave_queue {
+   struct list_headall_slaves;
+   int num_slaves;
+};
+
+struct net_vrf {
+   struct slave_queue  queue;
+   struct rtable   *rth;
+   u32 tb_id;
+};
+
+
+#if IS_ENABLED(CONFIG_NET_VRF)
+/* called with rcu_read_lock() */
+static inline int vrf_master_ifindex_rcu(const struct net_device *dev)
+{
+   struct net_vrf_dev *vrf_ptr;
+   int ifindex = 0;
+
+   if (!dev)
+   return 0;
+
+   if (netif_is_vrf(dev))
+   ifindex = dev-ifindex;
+   else {
+   vrf_ptr = rcu_dereference(dev-vrf_ptr);
+   if (vrf_ptr)
+   ifindex = vrf_ptr-ifindex;
+   }
+
+   return ifindex;
+}
+
+/* called with rcu_read_lock */
+static inline int vrf_dev_table_rcu(const struct net_device *dev)
+{
+   int tb_id = 0;
+
+   if (dev) {
+   struct net_vrf_dev *vrf_ptr;
+
+   vrf_ptr = rcu_dereference(dev-vrf_ptr);
+   if (vrf_ptr)
+   tb_id = vrf_ptr-tb_id;
+   }
+   return tb_id;
+}
+
+static inline int vrf_dev_table(const struct net_device *dev)
+{
+   int tb_id = 0;
+
+   rcu_read_lock();
+   tb_id = vrf_dev_table_rcu(dev

[PATCH net-next 4/9] udp: Handle VRF device in sendmsg

2015-08-10 Thread David Ahern
For unconnected UDP sockets using a VRF device lookup source address
based on VRF table. This allows the UDP header to be properly setup
before showing up at the VRF device via the dst.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/udp.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 83aa604f9273..7af5052e3b1f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1013,11 +1013,31 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
if (!rt) {
struct net *net = sock_net(sk);
+   __u8 flow_flags = inet_sk_flowi_flags(sk);
 
fl4 = fl4_stack;
+
+   /* unconnected socket. If output device is enslaved to a VRF
+* device lookup source address from VRF table. This mimics
+* behavior of ip_route_connect{_init}.
+*/
+   if (netif_index_is_vrf(net, ipc.oif)) {
+   flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
+  RT_SCOPE_UNIVERSE, sk-sk_protocol,
+  (flow_flags | FLOWI_FLAG_VRFSRC),
+  faddr, saddr, dport,
+  inet-inet_sport);
+
+   rt = ip_route_output_flow(net, fl4, sk);
+   if (!IS_ERR(rt)) {
+   saddr = fl4-saddr;
+   ip_rt_put(rt);
+   }
+   }
+
flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos,
   RT_SCOPE_UNIVERSE, sk-sk_protocol,
-  inet_sk_flowi_flags(sk),
+  flow_flags,
   faddr, saddr, dport, inet-inet_sport);
 
security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 3/9] net: Use VRF device index for lookups on TX

2015-08-10 Thread David Ahern
As with ingress use the index of VRF master device for route lookups on
egress. However, the oif should only be used to direct the lookups to a
specific table. Routes in the table are not based on the VRF device but
rather interfaces that are part of the VRF so do not consider the oif for
lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this
latter part.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/flow.h  | 1 +
 include/net/route.h | 3 +++
 net/ipv4/fib_trie.c | 7 +--
 net/ipv4/icmp.c | 4 
 net/ipv4/route.c| 5 +
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 3098ae33a178..f305588fc162 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -33,6 +33,7 @@ struct flowi_common {
__u8flowic_flags;
 #define FLOWI_FLAG_ANYSRC  0x01
 #define FLOWI_FLAG_KNOWN_NH0x02
+#define FLOWI_FLAG_VRFSRC  0x04
__u32   flowic_secid;
struct flowi_tunnel flowic_tun_key;
 };
diff --git a/include/net/route.h b/include/net/route.h
index 2d45f419477f..94189d4bd899 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -251,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 
*fl4, __be32 dst, __be32
if (inet_sk(sk)-transparent)
flow_flags |= FLOWI_FLAG_ANYSRC;
 
+   if (netif_index_is_vrf(sock_net(sk), oif))
+   flow_flags |= FLOWI_FLAG_VRFSRC;
+
flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE,
   protocol, flow_flags, dst, src, dport, sport);
 }
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 37c4bb89a708..1243c79cb5b0 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1423,8 +1423,11 @@ int fib_table_lookup(struct fib_table *tb, const struct 
flowi4 *flp,
nh-nh_flags  RTNH_F_LINKDOWN 
!(fib_flags  FIB_LOOKUP_IGNORE_LINKSTATE))
continue;
-   if (flp-flowi4_oif  flp-flowi4_oif != nh-nh_oif)
-   continue;
+   if (!(flp-flowi4_flags  FLOWI_FLAG_VRFSRC)) {
+   if (flp-flowi4_oif 
+   flp-flowi4_oif != nh-nh_oif)
+   continue;
+   }
 
if (!(fib_flags  FIB_LOOKUP_NOREF))
atomic_inc(fi-fib_clntref);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index c0556f1e4bf0..1164fc4ce3bc 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -96,6 +96,7 @@
 #include net/xfrm.h
 #include net/inet_common.h
 #include net/ip_fib.h
+#include net/vrf.h
 
 /*
  * Build xmit assembly blocks
@@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct 
sk_buff *skb)
fl4.flowi4_mark = mark;
fl4.flowi4_tos = RT_TOS(ip_hdr(skb)-tos);
fl4.flowi4_proto = IPPROTO_ICMP;
+   fl4.flowi4_oif = vrf_master_ifindex_rcu(skb-dev) ? : skb-dev-ifindex;
security_skb_classify_flow(skb, flowi4_to_flowi(fl4));
rt = ip_route_output_key(net, fl4);
if (IS_ERR(rt))
@@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
fl4-flowi4_proto = IPPROTO_ICMP;
fl4-fl4_icmp_type = type;
fl4-fl4_icmp_code = code;
+   fl4-flowi4_oif = vrf_master_ifindex_rcu(skb_in-dev) ? : 
skb_in-dev-ifindex;
+
security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
rt = __ip_route_output_key(net, fl4);
if (IS_ERR(rt))
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c26ff1f7067d..2c89d294b669 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2131,6 +2131,11 @@ struct rtable *__ip_route_output_key(struct net *net, 
struct flowi4 *fl4)
fl4-saddr = inet_select_addr(dev_out, 0,
  RT_SCOPE_HOST);
}
+   if (netif_is_vrf(dev_out) 
+   !(fl4-flowi4_flags  FLOWI_FLAG_VRFSRC)) {
+   rth = vrf_dev_get_rth(dev_out);
+   goto out;
+   }
}
 
if (!fl4-daddr) {
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] xfrm: Add oif to dst lookups

2015-08-10 Thread David Ahern
Rules can be installed that direct route lookups to specific tables based
on oif. Plumb the oif through the xfrm lookups so it gets set in the flow
struct and passed to the resolver routines.

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/xfrm.h  |  7 +--
 net/ipv4/xfrm4_policy.c | 11 ++-
 net/ipv6/xfrm6_policy.c |  7 ---
 net/xfrm/xfrm_policy.c  | 24 ++--
 4 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index f0ee97eec24d..312e3fee9ccf 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -285,10 +285,13 @@ struct xfrm_policy_afinfo {
unsigned short  family;
struct dst_ops  *dst_ops;
void(*garbage_collect)(struct net *net);
-   struct dst_entry*(*dst_lookup)(struct net *net, int tos,
+   struct dst_entry*(*dst_lookup)(struct net *net,
+  int tos, int oif,
   const xfrm_address_t *saddr,
   const xfrm_address_t *daddr);
-   int (*get_saddr)(struct net *net, xfrm_address_t 
*saddr, xfrm_address_t *daddr);
+   int (*get_saddr)(struct net *net, int oif,
+xfrm_address_t *saddr,
+xfrm_address_t *daddr);
void(*decode_session)(struct sk_buff *skb,
  struct flowi *fl,
  int reverse);
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index bff69746e05f..55b3c0f4dde5 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -19,7 +19,7 @@
 static struct xfrm_policy_afinfo xfrm4_policy_afinfo;
 
 static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 
*fl4,
-   int tos,
+   int tos, int oif,
const xfrm_address_t *saddr,
const xfrm_address_t *daddr)
 {
@@ -28,6 +28,7 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, 
struct flowi4 *fl4,
memset(fl4, 0, sizeof(*fl4));
fl4-daddr = daddr-a4;
fl4-flowi4_tos = tos;
+   fl4-flowi4_oif = oif;
if (saddr)
fl4-saddr = saddr-a4;
 
@@ -38,22 +39,22 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net 
*net, struct flowi4 *fl4,
return ERR_CAST(rt);
 }
 
-static struct dst_entry *xfrm4_dst_lookup(struct net *net, int tos,
+static struct dst_entry *xfrm4_dst_lookup(struct net *net, int tos, int oif,
  const xfrm_address_t *saddr,
  const xfrm_address_t *daddr)
 {
struct flowi4 fl4;
 
-   return __xfrm4_dst_lookup(net, fl4, tos, saddr, daddr);
+   return __xfrm4_dst_lookup(net, fl4, tos, oif, saddr, daddr);
 }
 
-static int xfrm4_get_saddr(struct net *net,
+static int xfrm4_get_saddr(struct net *net, int oif,
   xfrm_address_t *saddr, xfrm_address_t *daddr)
 {
struct dst_entry *dst;
struct flowi4 fl4;
 
-   dst = __xfrm4_dst_lookup(net, fl4, 0, NULL, daddr);
+   dst = __xfrm4_dst_lookup(net, fl4, 0, oif, NULL, daddr);
if (IS_ERR(dst))
return -EHOSTUNREACH;
 
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ed0583c1b9fc..a74013d3eceb 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -26,7 +26,7 @@
 
 static struct xfrm_policy_afinfo xfrm6_policy_afinfo;
 
-static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos,
+static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, int oif,
  const xfrm_address_t *saddr,
  const xfrm_address_t *daddr)
 {
@@ -35,6 +35,7 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, 
int tos,
int err;
 
memset(fl6, 0, sizeof(fl6));
+   fl6.flowi6_oif = oif;
memcpy(fl6.daddr, daddr, sizeof(fl6.daddr));
if (saddr)
memcpy(fl6.saddr, saddr, sizeof(fl6.saddr));
@@ -50,13 +51,13 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, 
int tos,
return dst;
 }
 
-static int xfrm6_get_saddr(struct net *net,
+static int xfrm6_get_saddr(struct net *net, int oif,
   xfrm_address_t *saddr, xfrm_address_t *daddr)
 {
struct dst_entry *dst;
struct net_device *dev;
 
-   dst = xfrm6_dst_lookup(net, 0, NULL, daddr);
+   dst = xfrm6_dst_lookup(net, 0, oif, NULL, daddr);
if (IS_ERR(dst))
return -EHOSTUNREACH;
 
diff --git a/net/xfrm/xfrm_policy.c b/net

Re: [BUG net-next] infamous dev refcnt leak... again.

2015-08-14 Thread David Ahern

On 8/14/15 5:14 PM, Eric Dumazet wrote:

On Fri, 2015-08-14 at 14:14 -0700, Eric Dumazet wrote:

While rebooting host running latest net-next

  unregister_netdevice: waiting for eth0 to become free. Usage count = 4

Oh well...



It looks like David Ahern recent changes uncover a bug ?

Not clear which commit is at fault.

Maybe 3bfd847203c6d89532f836ad3f5b4ff4ced26dd9 ?

Somehow a down device can be found.


Can you elaborate on what you are doing to see the refcnt leak? I have 
not seen that at all. I have to leave for soccer carpool in 45 minutes 
or so, but can take a look this weekend.


David



diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index b7f1d20..675a3b6 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -725,10 +725,14 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
nh-nh_dev = dev = FIB_RES_DEV(res);
if (!dev)
goto out;
-   dev_hold(dev);
if (!netif_carrier_ok(dev))
nh-nh_flags |= RTNH_F_LINKDOWN;
-   err = (dev-flags  IFF_UP) ? 0 : -ENETDOWN;
+   if (dev-flags  IFF_UP) {
+   err = 0;
+   dev_hold(dev);
+   } else {
+   err = -ENETDOWN;
+   }
} else {
struct in_device *in_dev;





--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 04/11] udp: Handle VRF device in sendmsg

2015-08-14 Thread David Ahern

On 8/14/15 9:16 PM, Tom Herbert wrote:

At least collect this code into one (static inline) function to better
minimize the code churn in udp. If this is general functionality that
can be used by other drivers then abstract it out as such. Also, if
the VRF driver is not configured it seems like this code should
compiled out. As it stands now if (netif_index_is_vrf(net, ipc.oif))
{ adds a conditional to every call of udp_sendmsg rather or not we
are using VRF :-(.


Sure. I wanted to make sure all of the VRF related changes compiled out 
when the VRF driver is not enabled. This one slipped by me. I'll send a 
patch next week along with a couple of others per Eric D's comments.


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] ipv4: fix refcount leak in fib_check_nh()

2015-08-15 Thread David Ahern

On 8/15/15 11:54 AM, Eric Dumazet wrote:

From: Eric Dumazet eduma...@google.com

fib_lookup() forces FIB_LOOKUP_NOREF flag, while fib_table_lookup()
does not.

This patch solves the typical message at reboot time or device
dismantle :

unregister_netdevice: waiting for eth0 to become free. Usage count = 4

Fixes: 3bfd847203c6 (net: Use passed in table for nexthop lookups)
Signed-off-by: Eric Dumazet eduma...@google.com
Cc: David Ahern d...@cumulusnetworks.com


Still puzzled why I was not seeing the refcnt problem at reboot though I 
did see the extra dev_hold when I instrumented the hold and put. 
Anyways, thanks for resolving, Eric.


Acked-by: David Ahern d...@cumulusnetworks.com


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


increase in time to delete an interface with 4.x kernels

2015-07-27 Thread David Ahern

Hi Alex:

I believe you did the recent overhaul to the fib implementation. I am 
seeing dramatically higher times to delete an interface with an ipv4 
address in 4.2-rc3. perf-top points to update_suffix:


   PerfTop:   15834 irqs/sec  kernel:97.3%  exact:  0.0% [4000Hz 
cpu-clock],  (all, 4 CPUs)

---

74.69%  [kernel]   [k] update_suffix
 2.38%  [kernel]   [k] fib_table_flush
 2.20%  [kernel]   [k] fib6_walk_continue
 2.03%  [kernel]   [k] fib6_ifdown
 1.31%  [kernel]   [k] fib6_age


I have a simple script to create and assign an ipv4 address to 10k dummy 
interfaces:


l=0
for (( j = 1; j = 40; j += 1))
do
for (( k = 1 ; k = 250  ; k += 1 ))
do
l=$((l + 1))
ip link add dev dummy${l} type dummy
ip addr add  72.$j.$k.1/24 dev dummy${l}
ifconfig dummy${l} up
done
done


and a counter script to delete them all:

k=$(ip link show | grep dummy | wc -l)
for (( j = 1; j = k; j += 1))
do
ip link del dev dummy${j}
done


Looking at v3.19:

# time ./tadd-dummy.sh

real3m8.896s
user0m7.104s
sys 0m22.020s


# time ./tdel-dummy.sh

real7m18.207s
user0m3.824s
sys 3m15.672s


And the time to delete 1 interface after all 10k have been created:
# time ip link del dev dummy

real0m0.064s
user0m0.000s
sys 0m0.020s


Contrast those times with 4.2.0-rc3+ running the exact same scripts

# time ./tadd-dummy.sh

real2m51.044s
user0m7.220s
sys 0m29.520s

#  time ip link del dev dummy

real0m0.441s
user0m0.000s
sys 0m0.416s

so here the time to delete 1 interface has gone up by more than 10x.


# time ./tdel-dummy.sh
^C

real14m10.000s
user0m0.528s
sys 13m14.728s

I killed the delete; after 14 minutes only ~2k+ interfaces had been deleted:

# ip link show | grep dummy | wc -l
7822

In 4.2.0-rc3 it seems to take about 60 seconds to delete 150 interfaces 
which is inline with the 1 interface time of 0.4 seconds.


David
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 06/16] net: Tx via VRF device

2015-07-27 Thread David Ahern
If out device is enslaved to a VRF device we want packets to go through the
VRF master device first. This allows for example iptables rules and tc rules
to be configured on the VRF as a whole as well as the option for rules on
specific netdevices. This is accomplished by updating the dev in the dst to
point to the VRF device if it is enslaved.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/route.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8119896e1159..050a3c1d89ba 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1903,6 +1903,23 @@ int ip_route_input_noref(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
 }
 EXPORT_SYMBOL(ip_route_input_noref);
 
+/* if out device is enslaved to a VRF device update dst to
+ * send through it
+ */
+static void rt_use_vrf_dev(struct rtable *rth, struct net_device *dev_out)
+{
+#if IS_ENABLED(CONFIG_NET_VRF)
+   int ifindex = vrf_master_dev_ifindex(dev_out);
+   struct net_device *mdev;
+
+   mdev = dev_get_by_index(dev_net(dev_out), ifindex);
+   if (mdev) {
+   dev_put(rth-dst.dev);
+   rth-dst.dev = mdev;
+   }
+#endif
+}
+
 /* called with rcu_read_lock() */
 static struct rtable *__mkroute_output(const struct fib_result *res,
   const struct flowi4 *fl4, int orig_oif,
@@ -2008,6 +2025,7 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
}
 
rt_set_nexthop(rth, fl4-daddr, res, fnhe, fi, type, 0);
+   rt_use_vrf_dev(rth, dev_out);
 
return rth;
 }
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 14/16] net: Add sk_bind_dev_if to task_struct

2015-07-27 Thread David Ahern
Allow tasks to have a default device index for binding sockets. If set
the value is passed to all AF_INET/AF_INET6 sockets when they are created.

The task setting is passed parent to child on fork, but can be set or
changed after task creation using prctl (if task has CAP_NET_ADMIN
permissions). The setting for a socket can be retrieved using prctl().
This option allows an administrator to restrict a task to only send/receive
packets through the specified device. In the case of VRF devices this
option restricts tasks to a specific VRF.

Correlation of the device index to a specific VRF, ie.,
   ifindex -- VRF device -- VRF id
is left to userspace.

Example using VRF devices:
1. vrf1 is created and assigned to table 5
2. eth2 is enslaved to vrf1
3. eth2 is given the address 1.1.1.1/24

$ ip route ls table 5
prohibit default
1.1.1.0/24 dev eth2  scope link
local 1.1.1.1 dev eth2  proto kernel  scope host  src 1.1.1.1

With out setting a VRF context ping, tcp and udp attempts fail. e.g,
$ ping 1.1.1.254
connect: Network is unreachable

After binding the task to the vrf device ping succeeds:
$ ./chvrf -v 1 ping -c1 1.1.1.254
PING 1.1.1.254 (1.1.1.254) 56(84) bytes of data.
64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=2.32 ms

Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/sched.h  |  3 +++
 include/uapi/linux/prctl.h |  4 
 kernel/fork.c  |  2 ++
 kernel/sys.c   | 35 +++
 net/ipv4/af_inet.c |  1 +
 net/ipv4/route.c   |  4 +++-
 net/ipv6/af_inet6.c|  1 +
 net/ipv6/route.c   |  2 +-
 8 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04b5ada460b4..29b336b8a466 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1528,6 +1528,9 @@ struct task_struct {
struct files_struct *files;
 /* namespaces */
struct nsproxy *nsproxy;
+/* network */
+   /* if set INET/INET6 sockets are bound to given dev index on create */
+   int sk_bind_dev_if;
 /* signal handlers */
struct signal_struct *signal;
struct sighand_struct *sighand;
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..1ef45195d146 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,8 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR (1  0)/* 64b FP registers */
 # define PR_FP_MODE_FRE(1  1)/* 32b compatibility */
 
+/* get/set network interface sockets are bound to by default */
+#define PR_SET_SK_BIND_DEV_IF   47
+#define PR_GET_SK_BIND_DEV_IF   48
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index dbd9b8d7b7cc..8b396e77d2bf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -380,6 +380,8 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig)
tsk-splice_pipe = NULL;
tsk-task_frag.page = NULL;
 
+   tsk-sk_bind_dev_if = orig-sk_bind_dev_if;
+
account_kernel_stack(ti, 1);
 
return tsk;
diff --git a/kernel/sys.c b/kernel/sys.c
index 259fda25eb6b..59119ac0a0bd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
 #include linux/rcupdate.h
 #include linux/uidgid.h
 #include linux/cred.h
+#include linux/netdevice.h
 
 #include linux/kmsg_dump.h
 /* Move somewhere else to avoid recompiling? */
@@ -2267,6 +2268,40 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+#ifdef CONFIG_NET
+   case PR_SET_SK_BIND_DEV_IF:
+   {
+   struct net_device *dev;
+   int idx = (int) arg2;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (idx) {
+   dev = dev_get_by_index(me-nsproxy-net_ns, idx);
+   if (!dev)
+   return -EINVAL;
+   dev_put(dev);
+   }
+   me-sk_bind_dev_if = idx;
+   break;
+   }
+   case PR_GET_SK_BIND_DEV_IF:
+   {
+   struct task_struct *tsk;
+   int sk_bind_dev_if = -EINVAL;
+
+   rcu_read_lock();
+   tsk = find_task_by_vpid(arg2);
+   if (tsk)
+   sk_bind_dev_if = tsk-sk_bind_dev_if;
+   rcu_read_unlock();
+   if (tsk != me  !capable(CAP_NET_ADMIN))
+   return -EPERM;
+   error = sk_bind_dev_if;
+   break;
+   }
+#endif
default:
error = -EINVAL;
break;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 09c7c1ee307e..0651efa18d39 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -352,6 +352,7 @@ static int inet_create(struct net *net, struct socket 
*sock, int protocol

[PATCH net-next 13/16] net: Introduce VRF device driver - v2

2015-07-27 Thread David Ahern
This driver borrows heavily from IPvlan and teaming drivers.

Routing domains (VRF-lite) are created by instantiating a VRF master
device with an associated table and enslaving all routed interfaces that
participate in the domain. As part of the enslavement, all connected
routes for the enslaved devices are moved to the table associated with
the VRF device. Outgoing sockets must bind to the VRF device to function.

Standard FIB rules bind the VRF device to tables and regular fib rule
processing is followed. Routed traffic through the box, is forwarded by
using the VRF device as the IIF and following the IIF rule to a table
that is mated with the VRF.

Example:

   Create vrf 1:
 ip link add vrf1 type vrf table 5
 ip rule add iif vrf1 table 5
 ip rule add oif vrf1 table 5
 ip route add table 5 prohibit default
 ip link set vrf1 up

   Add interface to vrf 1:
 ip link set eth1 master vrf1

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com

v2:
- addressed comments from first RFC
- significant changes to improve simplicity of implementation
---
 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c| 596 +++
 3 files changed, 604 insertions(+)
 create mode 100644 drivers/net/vrf.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c18f9e62a9fa..e58468b02987 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -297,6 +297,13 @@ config NLMON
  diagnostics, etc. This is mostly intended for developers or support
  to debug netlink issues. If unsure, say N.
 
+config NET_VRF
+   tristate Virtual Routing and Forwarding (Lite)
+   depends on IP_MULTIPLE_TABLES  IPV6_MULTIPLE_TABLES
+   ---help---
+ This option enables the support for mapping interfaces into VRF's. The
+ support enables VRF devices.
+
 endif # NET_CORE
 
 config SUNGEM_PHY
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c12cb22478a7..ca16dd689b36 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_NLMON) += nlmon.o
+obj-$(CONFIG_NET_VRF) += vrf.o
 
 #
 # Networking Drivers
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
new file mode 100644
index ..8669b0f9d749
--- /dev/null
+++ b/drivers/net/vrf.c
@@ -0,0 +1,596 @@
+/*
+ * vrf.c: device driver to encapsulate a VRF space
+ *
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * Based on dummy, team and ipvlan drivers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/ip.h
+#include linux/init.h
+#include linux/moduleparam.h
+#include linux/rtnetlink.h
+#include net/rtnetlink.h
+#include linux/u64_stats_sync.h
+#include linux/hashtable.h
+
+#include linux/inetdevice.h
+#include net/ip.h
+#include net/ip_fib.h
+#include net/ip6_route.h
+#include net/rtnetlink.h
+#include net/route.h
+#include net/addrconf.h
+#include net/vrf.h
+
+#define DRV_NAME   vrf
+#define DRV_VERSION1.0
+
+#define vrf_is_slave(dev)   ((dev)-flags  IFF_SLAVE)
+#define vrf_is_master(dev)  ((dev)-flags  IFF_MASTER)
+
+#define vrf_master_get_rcu(dev) \
+   ((struct net_device *)rcu_dereference(dev-rx_handler_data))
+
+struct pcpu_dstats {
+   u64 tx_pkts;
+   u64 tx_bytes;
+   u64 tx_drps;
+   u64 rx_pkts;
+   u64 rx_bytes;
+   struct u64_stats_sync   syncp;
+};
+
+struct slave {
+   struct list_headlist;
+   struct net_device   *dev;
+};
+
+struct slave_queue {
+   spinlock_t  lock; /* lock for slave insert/delete */
+   struct list_headall_slaves;
+   int num_slaves;
+};
+
+struct net_vrf {
+   struct slave_queue  queue;
+   struct fib_table*tb;
+   u32 tb_id;
+};
+
+static bool is_ip_rx_frame(struct sk_buff *skb)
+{
+   switch (skb-protocol) {
+   case htons(ETH_P_IP):
+   case htons(ETH_P_IPV6):
+   return true;
+   }
+   return false;
+}
+
+/* note: already called with rcu_read_lock */
+static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
+{
+   struct sk_buff *skb = *pskb;
+
+   if (is_ip_rx_frame(skb)) {
+   struct net_device *dev = vrf_master_get_rcu(skb-dev);
+   struct pcpu_dstats *dstats = this_cpu_ptr(dev-dstats);
+
+   u64_stats_update_begin(dstats-syncp

[PATCH] iproute2: Add support for VRF device

2015-07-27 Thread David Ahern
Allow user to create a vrf device and specify its table binding.
Based on the iplink_vlan implementation.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/linux/if_link.h |  8 +
 ip/Makefile |  2 +-
 ip/iplink.c |  2 +-
 ip/iplink_vrf.c | 87 +
 4 files changed, 97 insertions(+), 2 deletions(-)
 create mode 100644 ip/iplink_vrf.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 8df6a8466839..28872fbf6814 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -337,6 +337,14 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC 1
 
+/* VRF section */
+enum {
+   IFLA_VRF_UNSPEC,
+   IFLA_VRF_TABLE,
+   __IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
 /* IPVLAN section */
 enum {
IFLA_IPVLAN_UNSPEC,
diff --git a/ip/Makefile b/ip/Makefile
index 77653ecc5785..d8b38ac2e44b 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-iplink_geneve.o
+iplink_geneve.o iplink_vrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink.c b/ip/iplink.c
index e296e6f611b8..892e8bc8808b 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -93,7 +93,7 @@ void iplink_usage(void)
fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | 
macvlan | macvtap |\n);
fprintf(stderr,   bridge | bond | ipoib | ip6tnl | 
ipip | sit | vxlan |\n);
fprintf(stderr,   gre | gretap | ip6gre | ip6gretap | 
vti | nlmon |\n);
-   fprintf(stderr,   bond_slave | ipvlan | geneve }\n);
+   fprintf(stderr,   bond_slave | ipvlan | geneve | vrf 
}\n);
}
exit(-1);
 }
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
new file mode 100644
index ..bfcb3cdeaf35
--- /dev/null
+++ b/ip/iplink_vrf.c
@@ -0,0 +1,87 @@
+/* iplink_vrf.cVRF device support
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/socket.h
+#include linux/if_link.h
+
+#include rt_names.h
+#include utils.h
+#include ip_common.h
+
+static void vrf_explain(FILE *f)
+{
+   fprintf(f, Usage: ... vrf table TABLEID \n);
+}
+
+static void explain(void)
+{
+   vrf_explain(stderr);
+}
+
+static int table_arg(void)
+{
+   fprintf(stderr,Error: argument of \table\ must be 0-32767 and 
currently unused\n);
+   return -1;
+}
+
+static int vrf_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *n)
+{
+   while (argc  0) {
+   if (matches(*argv, table) == 0) {
+   __u32 table = 0;
+   NEXT_ARG();
+
+   table = atoi(*argv);
+   if (table  0 || table  32767)
+   return table_arg();
+   /* XXX need a table in-use check here */
+   fprintf(stderr, adding table %d\n, table);
+   addattr32(n, 1024, IFLA_VRF_TABLE, table);
+   } else if (matches(*argv, help) == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, vrf: unknown option \%s\?\n,
+   *argv);
+   explain();
+   return -1;
+   }
+   argc--, argv++;
+   }
+
+   return 0;
+}
+
+static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+   if (!tb)
+   return;
+
+   if (tb[IFLA_VRF_TABLE])
+   fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE]));
+}
+
+static void vrf_print_help(struct link_util *lu, int argc, char **argv,
+ FILE *f)
+{
+   vrf_explain(f);
+}
+
+struct link_util vrf_link_util = {
+   .id = vrf,
+   .maxattr= IFLA_VRF_MAX,
+   .parse_opt  = vrf_parse_opt,
+   .print_opt  = vrf_print_opt,
+   .print_help = vrf_print_help,
+};
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

[PATCH net-next 11/16] net: Use VRF device index for socket lookups

2015-07-27 Thread David Ahern
The intent of the VRF device is to leverage the existing SO_BINDTODEVICE
as a means of creating L3 domains. Since sockets are expected to be bound
to the VRF device the index of the master device needs to be used for
socket lookups.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 net/ipv4/syncookies.c |  5 -
 net/ipv4/tcp_input.c  |  6 +-
 net/ipv4/tcp_ipv4.c   | 11 +--
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index d70b1f603692..dab52fba5872 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -18,6 +18,7 @@
 #include linux/export.h
 #include net/tcp.h
 #include net/route.h
+#include net/vrf.h
 
 extern int sysctl_tcp_syncookies;
 
@@ -348,7 +349,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
treq-snt_synack= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsecr : 0;
treq-tfo_listener  = false;
 
-   ireq-ir_iif = sk-sk_bound_dev_if;
+   ireq-ir_iif = vrf_get_master_dev_ifindex(sock_net(sk), skb-skb_iif);
+   if (!ireq-ir_iif)
+   ireq-ir_iif = sk-sk_bound_dev_if;
 
/* We throwed the options of the initial SYN away, so we hope
 * the ACK carries the same options again (see RFC1122 4.2.3.8)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4e4d6bcd0ca9..df82fb05c459 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -72,6 +72,7 @@
 #include net/dst.h
 #include net/tcp.h
 #include net/inet_common.h
+#include net/vrf.h
 #include linux/ipsec.h
 #include asm/unaligned.h
 #include linux/errqueue.h
@@ -6141,7 +6142,10 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
tcp_openreq_init(req, tmp_opt, skb, sk);
 
/* Note: tcp_v6_init_req() might override ir_iif for link locals */
-   inet_rsk(req)-ir_iif = sk-sk_bound_dev_if;
+   inet_rsk(req)-ir_iif = vrf_get_master_dev_ifindex(sock_net(sk),
+  skb-skb_iif);
+   if (!inet_rsk(req)-ir_iif)
+   inet_rsk(req)-ir_iif = sk-sk_bound_dev_if;
 
af_ops-init_req(req, sk, skb);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 486ba96ae91a..d0c40f4d9058 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -75,6 +75,7 @@
 #include net/secure_seq.h
 #include net/tcp_memcontrol.h
 #include net/busy_poll.h
+#include net/vrf.h
 
 #include linux/inet.h
 #include linux/ipv6.h
@@ -682,6 +683,8 @@ static void tcp_v4_send_reset(struct sock *sk, struct 
sk_buff *skb)
 */
if (sk)
arg.bound_dev_if = sk-sk_bound_dev_if;
+   if (!arg.bound_dev_if  skb-dev)
+   arg.bound_dev_if = vrf_master_dev_ifindex(skb-dev);
 
arg.tos = ip_hdr(skb)-tos;
ip_send_unicast_reply(*this_cpu_ptr(net-ipv4.tcp_sk),
@@ -766,8 +769,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, 
u32 ack,
  ip_hdr(skb)-saddr, /* XXX */
  arg.iov[0].iov_len, IPPROTO_TCP, 0);
arg.csumoffset = offsetof(struct tcphdr, check) / 2;
-   if (oif)
-   arg.bound_dev_if = oif;
+   arg.bound_dev_if = oif ? : vrf_master_dev_ifindex(skb_dst(skb)-dev);
+   if (!arg.bound_dev_if)
+   arg.bound_dev_if = vrf_master_dev_ifindex(skb-dev);
+
arg.tos = tos;
ip_send_unicast_reply(*this_cpu_ptr(net-ipv4.tcp_sk),
  skb, TCP_SKB_CB(skb)-header.h4.opt,
@@ -1269,6 +1274,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct 
sk_buff *skb,
ireq  = inet_rsk(req);
sk_daddr_set(newsk, ireq-ir_rmt_addr);
sk_rcv_saddr_set(newsk, ireq-ir_loc_addr);
+   if (netif_index_is_vrf(sock_net(newsk), ireq-ir_iif))
+   newsk-sk_bound_dev_if = ireq-ir_iif;
newinet-inet_saddr   = ireq-ir_loc_addr;
inet_opt  = ireq-opt;
rcu_assign_pointer(newinet-inet_opt, inet_opt);
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 07/16] net: Add inet_addr lookup by table

2015-07-27 Thread David Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/route.h |  1 +
 net/ipv4/fib_frontend.c | 22 +++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 54f97eea0fb2..3b51c339c269 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -192,6 +192,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
 unsigned int inet_addr_type(struct net *net, __be32 addr);
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr);
 void ip_rt_multicast_event(struct in_device *);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6e68a003d0fd..cc413b0170ed 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -214,12 +214,12 @@ void fib_flush_external(struct net *net)
  */
 static inline unsigned int __inet_dev_addr_type(struct net *net,
const struct net_device *dev,
-   __be32 addr)
+   __be32 addr, int tb_id)
 {
struct flowi4   fl4 = { .daddr = addr };
struct fib_result   res;
unsigned int ret = RTN_BROADCAST;
-   struct fib_table *local_table;
+   struct fib_table *table;
 
if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
return RTN_BROADCAST;
@@ -228,10 +228,10 @@ static inline unsigned int __inet_dev_addr_type(struct 
net *net,
 
rcu_read_lock();
 
-   local_table = fib_get_table(net, RT_TABLE_LOCAL);
-   if (local_table) {
+   table = fib_get_table(net, tb_id);
+   if (table) {
ret = RTN_UNICAST;
-   if (!fib_table_lookup(local_table, fl4, res, 
FIB_LOOKUP_NOREF)) {
+   if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) {
if (!dev || dev == res.fi-fib_dev)
ret = res.type;
}
@@ -241,16 +241,24 @@ static inline unsigned int __inet_dev_addr_type(struct 
net *net,
return ret;
 }
 
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id)
+{
+   return __inet_dev_addr_type(net, NULL, addr, tb_id);
+}
+EXPORT_SYMBOL(inet_addr_type_table);
+
 unsigned int inet_addr_type(struct net *net, __be32 addr)
 {
-   return __inet_dev_addr_type(net, NULL, addr);
+   return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL);
 }
 EXPORT_SYMBOL(inet_addr_type);
 
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr)
 {
-   return __inet_dev_addr_type(net, dev, addr);
+   int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+
+   return __inet_dev_addr_type(net, dev, addr, rt_table);
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
 
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 12/16] net: Add ipv4 route helper to set next hop

2015-07-27 Thread David Ahern
Signed-off-by: David Ahern d...@cumulusnetworks.com
---
 include/net/route.h |  3 +++
 net/ipv4/route.c| 10 ++
 2 files changed, 13 insertions(+)

diff --git a/include/net/route.h b/include/net/route.h
index b14cbec93fbd..900d50fbcfc7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -107,6 +107,7 @@ struct rt_cache_stat {
 extern struct ip_rt_acct __percpu *ip_rt_acct;
 
 struct in_device;
+struct fib_result;
 
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
@@ -114,6 +115,8 @@ void rt_flush_dev(struct net_device *dev);
 struct rtable *ip_route_new_rtable(struct net_device *dev,
   unsigned int flags, u16 type,
   bool nopolicy, bool noxfrm, bool do_cache);
+void ip_route_set_nexthop(struct rtable *rt, __be32 daddr,
+ const struct fib_result *res);
 struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
struct sock *sk);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 050a3c1d89ba..47dae001a000 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1537,6 +1537,16 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
return err;
 }
 
+void ip_route_set_nexthop(struct rtable *rt, __be32 daddr,
+ const struct fib_result *res)
+{
+   struct fib_nh_exception *fnhe;
+
+   fnhe = find_exception(FIB_RES_NH(*res), daddr);
+
+   rt_set_nexthop(rt, daddr, res, fnhe, res-fi, res-type, 0);
+}
+EXPORT_SYMBOL(ip_route_set_nexthop);
 
 static void ip_handle_martian_source(struct net_device *dev,
 struct in_device *in_dev,
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >