Fwd: [PATCH RFC net-next] net: SAIL based FIB lookup for XDP

2018-11-19 Thread Md. Islam
Forgot to add everyone in the reply..

-- Forwarded message -
From: Md. Islam 
Date: Mon, Nov 19, 2018 at 11:35 PM
Subject: Re: [PATCH RFC net-next] net: SAIL based FIB lookup for XDP
To: 


On Sun, Nov 18, 2018 at 12:42 PM David Ahern  wrote:
>
> On 11/11/18 7:25 PM, Md. Islam wrote:
> > This patch implements SAIL[1] based routing table lookup for XDP. I
> > however made some changes from the original proposal (details are
> > described in the patch). This changes decreased the memory consumption
> > from 21.94 MB to 4.97 MB for my example routing table with 400K
> > routes.
> >
> > This patch can perform FIB lookup in 50 CPU cycles for the example
> > routing table (with 400K routes) whereas LC-trie takes around 350 CPU
> > cycles.
> >
> > I tried to follow all the advice I got from my last patch. Looking
> > forward to your review.
> >
> > 1. Yang, Tong, et al. "Guarantee IP lookup performance with FIB
> > explosion." ACM SIGCOMM Computer Communication Review. Vol. 44. No. 4.
> > ACM, 2014.
>
> This work you are doing on different FIB algorithms is interesting and
> probably has its use cases where it is beneficial, but you are still not
> integrating it into the Linux stack correctly.
>
> For starters, it is wrong to have 2 separate FIB data structures for the
> same network namespace. If SAIL is good enough for XDP, it should be
> good enough for normal routing in the namespace. There is no need to put
> the same route data in 2 places. That means the FIB algorithm needs to
> be selectable - either trie or sail but not both.

Thanks for reviewing the patch!!

Yes, SAIL will only be useful when the routing table is very big (such
as in backbone routers of ISPs where the number of routes already
exceeded 720K). If routing table is small, SAIL does not make much
sense. LC-trie consumes very small memory compared to SAIL and
performs pretty well when routing table is small. This is why I didn't
want to replace LC-trie entirely in the first place.

While it is possible to integrate SAIL into main kernel stack, it will
take some effort. I think, it will be easier if it's included in an
incremental fashion.

>
> Further, you are not handling unexpected conditions - e.g., multipath or
> lwtunnel encaps, cleanup of device references on a FIB entry delete or
> device overflows ...

This patch was intended to perform routing table lookup in a backbone
router. Multi-path routing is an optional feature in that case. I will
fix the the unexpected conditions.

>
> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> > index 69c91d1..cc275c7 100644
> > --- a/include/net/ip_fib.h
> > +++ b/include/net/ip_fib.h
> > @@ -197,6 +197,62 @@ struct fib_entry_notifier_info {
> >  u32 tb_id;
> >  };
> >
> > +#if IS_ENABLED(CONFIG_FIB_SAIL_XDP)
> > +/*
> > + * The router can have upto 255 ports. This limitation
> > + * allows us to represent netdev_index as an u8
> > + */
> > +#define NETDEV_COUNT_MAX 255
>
> ... 255 is high for a physical port count but not a logical device
> count. In the presence of VLANs 255 is nothing and VLANs are an
> essential deployment feature.

Yes, 255 is sufficient in most of the backbone routers (to the best of
my knowledge). I'm not sure if we want to use SAIL in a virtual
network where routing table is usually very small.

>
> > +
> > +struct chunk {
>
> chunk is too generic. sail_chunk at least puts it in the sail symbol
> namespace.
>
> > +/*256-bit bitmap. Here i-th bit (from LSB) is set to 1 if C24[i] > 0 */
> > +u64 bitmap[4];
> > +/*
> > + * Index to C24 where chunk is started. A chunk corresponds
> > + * to 256 elements. Instead of having just one start index for the
> > + * whole chunk, we divide the chunk into 4 parts and save start
> > + * index for each part.
> > + */
> > +u64 start_index[4];
> > +};
> > +
> > +struct sail {
> > +/*default next-hop (Level 0)*/
> > +u8def_nh;
> > +
> > +/*Level 16*/
> > +u8 __rcu *N16;
> > +u8 __rcu *P16;
> > +u16 __rcu *C16;
> > +
> > +/*Level 24*/
> > +u8 __rcu *N24;
> > +u8 __rcu *P24;
> > +struct chunk __rcu *CK24;
> > +u32 __rcu *C24;
> > +u32 cnk24_count;/*Number of chunks in level 24*/
> > +
> > +/*Level 32*/
> > +u8 __rcu *N32;
> > +u8 __rcu *P32;
> > +u32 cnk32_count;/*Number of chunks in level 32*/
> > +
> > +/*Index to this array is stored in N16, N24 and N32*/
> > +struct net_device*netd

Re: [PATCH RFC net-next] net: SAIL based FIB lookup for XDP

2018-11-18 Thread Andrew Lunn
> > + * The router can have upto 255 ports. This limitation
> > + * allows us to represent netdev_index as an u8
> > + */
> > +#define NETDEV_COUNT_MAX 255
> 
> ... 255 is high for a physical port count but not a logical device
> count. In the presence of VLANs 255 is nothing and VLANs are an
> essential deployment feature.

I agree with David here. Doing some network simulation work using
namespaces, i've had more than 255 devices in the root namespace.

Andrew


Re: [PATCH RFC net-next] net: SAIL based FIB lookup for XDP

2018-11-18 Thread David Ahern
On 11/11/18 7:25 PM, Md. Islam wrote:
> This patch implements SAIL[1] based routing table lookup for XDP. I
> however made some changes from the original proposal (details are
> described in the patch). This changes decreased the memory consumption
> from 21.94 MB to 4.97 MB for my example routing table with 400K
> routes.
> 
> This patch can perform FIB lookup in 50 CPU cycles for the example
> routing table (with 400K routes) whereas LC-trie takes around 350 CPU
> cycles.
> 
> I tried to follow all the advice I got from my last patch. Looking
> forward to your review.
> 
> 1. Yang, Tong, et al. "Guarantee IP lookup performance with FIB
> explosion." ACM SIGCOMM Computer Communication Review. Vol. 44. No. 4.
> ACM, 2014.

This work you are doing on different FIB algorithms is interesting and
probably has its use cases where it is beneficial, but you are still not
integrating it into the Linux stack correctly.

For starters, it is wrong to have 2 separate FIB data structures for the
same network namespace. If SAIL is good enough for XDP, it should be
good enough for normal routing in the namespace. There is no need to put
the same route data in 2 places. That means the FIB algorithm needs to
be selectable - either trie or sail but not both.

Further, you are not handling unexpected conditions - e.g., multipath or
lwtunnel encaps, cleanup of device references on a FIB entry delete or
device overflows ...

> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 69c91d1..cc275c7 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -197,6 +197,62 @@ struct fib_entry_notifier_info {
>  u32 tb_id;
>  };
> 
> +#if IS_ENABLED(CONFIG_FIB_SAIL_XDP)
> +/*
> + * The router can have upto 255 ports. This limitation
> + * allows us to represent netdev_index as an u8
> + */
> +#define NETDEV_COUNT_MAX 255

... 255 is high for a physical port count but not a logical device
count. In the presence of VLANs 255 is nothing and VLANs are an
essential deployment feature.

> +
> +struct chunk {

chunk is too generic. sail_chunk at least puts it in the sail symbol
namespace.

> +/*256-bit bitmap. Here i-th bit (from LSB) is set to 1 if C24[i] > 0 */
> +u64 bitmap[4];
> +/*
> + * Index to C24 where chunk is started. A chunk corresponds
> + * to 256 elements. Instead of having just one start index for the
> + * whole chunk, we divide the chunk into 4 parts and save start
> + * index for each part.
> + */
> +u64 start_index[4];
> +};
> +
> +struct sail {
> +/*default next-hop (Level 0)*/
> +u8def_nh;
> +
> +/*Level 16*/
> +u8 __rcu *N16;
> +u8 __rcu *P16;
> +u16 __rcu *C16;
> +
> +/*Level 24*/
> +u8 __rcu *N24;
> +u8 __rcu *P24;
> +struct chunk __rcu *CK24;
> +u32 __rcu *C24;
> +u32 cnk24_count;/*Number of chunks in level 24*/
> +
> +/*Level 32*/
> +u8 __rcu *N32;
> +u8 __rcu *P32;
> +u32 cnk32_count;/*Number of chunks in level 32*/
> +
> +/*Index to this array is stored in N16, N24 and N32*/
> +struct net_device*netdevs[NETDEV_COUNT_MAX];
> +u8 netdev_count;/*Number of netdevs*/
> +
> +spinlock_t lock;
> +};
> +
> +int sail_insert(struct sail *s, u32 key,
> +u8 prefix_len, struct net_device *dev);
> +int sail_delete(struct sail *s, u32 key,
> +u8 prefix_len);
> +int sail_flush(struct sail *s);
> +int sail_lookup(const struct sail *s, const __be32 dest,
> +struct net_device **dev);
> +#endif

Put the new FIB algorithm specific defines in a new header.

> +
>  struct fib_nh_notifier_info {
>  struct fib_notifier_info info; /* must be first */
>  struct fib_nh *fib_nh;
> @@ -219,6 +275,10 @@ struct fib_table {
>  inttb_num_default;
>  struct rcu_headrcu;
>  unsigned long *tb_data;
> +#if IS_ENABLED(CONFIG_FIB_SAIL_XDP)
> +/*Each FIB table will have its own SAIL structure.*/
> +struct sailsail;

Per comment above a separate sail entry is unnecessary when this
algorithm is an alternative to trie; It overlaps tb_data - see code
references for it.

> +#endif
>  unsigned long__data[0];
>  };
> 

...

> diff --git a/net/ipv4/fib_sail_xdp.c b/net/ipv4/fib_sail_xdp.c
> new file mode 100644
> index 000..f3f56c5
> --- /dev/null
> +++ b/net/ipv4/fib_sail_xdp.c
> @@ -0,0 +1,939 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018-19 MD Iftakharul Islam (Tamim) 
> + *
> + *   This program is free software; you can redistribute it and/or
> + *   modify it under the terms of the GNU General Public License
> + *   as published by the Free Software Foundation; either version
> + *   2 of the License, or (at your option) any later version.
> + *
> + *
> + * This is SAIL_L based routing table lookup which was initially proposed in:
> + *
> + * Yang, Tong, Gaogang Xie, YanBiao Li, Qiaobin Fu, Alex X. Liu, Qi Li,
> + * and Laurent Mathy. "Guarantee IP lookup performance with FIB explosion."

[PATCH RFC net-next] net: SAIL based FIB lookup for XDP

2018-11-11 Thread Md. Islam
This patch implements SAIL[1] based routing table lookup for XDP. I
however made some changes from the original proposal (details are
described in the patch). This changes decreased the memory consumption
from 21.94 MB to 4.97 MB for my example routing table with 400K
routes.

This patch can perform FIB lookup in 50 CPU cycles for the example
routing table (with 400K routes) whereas LC-trie takes around 350 CPU
cycles.

I tried to follow all the advice I got from my last patch. Looking
forward to your review.

1. Yang, Tong, et al. "Guarantee IP lookup performance with FIB
explosion." ACM SIGCOMM Computer Communication Review. Vol. 44. No. 4.
ACM, 2014.

Thanks
Tamim

Signed-off-by: MD Iftakharul Islam (Tamim) 
---
 MAINTAINERS |   5 +
 include/net/ip_fib.h|  60 
 net/core/filter.c   |  49 +++
 net/ipv4/Kconfig|  11 +
 net/ipv4/Makefile   |   1 +
 net/ipv4/fib_sail_xdp.c | 939 
 net/ipv4/fib_trie.c |  12 +
 7 files changed, 1077 insertions(+)
 create mode 100644 net/ipv4/fib_sail_xdp.c

diff --git a/MAINTAINERS b/MAINTAINERS
index b22e7fd..969e625 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7630,6 +7630,11 @@ M:Juanjo Ciarlante 
 S:Maintained
 F:net/ipv4/netfilter/ipt_MASQUERADE.c

+SAIL FIB LOOKUP
+M:MD Iftakharul Islam (Tamim) 
+S:Maintained
+F:net/ipv4/fib_sail_xdp.c
+
 IPMI SUBSYSTEM
 M:Corey Minyard 
 L:openipmi-develo...@lists.sourceforge.net (moderated for non-subscribers)
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 69c91d1..cc275c7 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -197,6 +197,62 @@ struct fib_entry_notifier_info {
 u32 tb_id;
 };

+#if IS_ENABLED(CONFIG_FIB_SAIL_XDP)
+/*
+ * The router can have upto 255 ports. This limitation
+ * allows us to represent netdev_index as an u8
+ */
+#define NETDEV_COUNT_MAX 255
+
+struct chunk {
+/*256-bit bitmap. Here i-th bit (from LSB) is set to 1 if C24[i] > 0 */
+u64 bitmap[4];
+/*
+ * Index to C24 where chunk is started. A chunk corresponds
+ * to 256 elements. Instead of having just one start index for the
+ * whole chunk, we divide the chunk into 4 parts and save start
+ * index for each part.
+ */
+u64 start_index[4];
+};
+
+struct sail {
+/*default next-hop (Level 0)*/
+u8def_nh;
+
+/*Level 16*/
+u8 __rcu *N16;
+u8 __rcu *P16;
+u16 __rcu *C16;
+
+/*Level 24*/
+u8 __rcu *N24;
+u8 __rcu *P24;
+struct chunk __rcu *CK24;
+u32 __rcu *C24;
+u32 cnk24_count;/*Number of chunks in level 24*/
+
+/*Level 32*/
+u8 __rcu *N32;
+u8 __rcu *P32;
+u32 cnk32_count;/*Number of chunks in level 32*/
+
+/*Index to this array is stored in N16, N24 and N32*/
+struct net_device*netdevs[NETDEV_COUNT_MAX];
+u8 netdev_count;/*Number of netdevs*/
+
+spinlock_t lock;
+};
+
+int sail_insert(struct sail *s, u32 key,
+u8 prefix_len, struct net_device *dev);
+int sail_delete(struct sail *s, u32 key,
+u8 prefix_len);
+int sail_flush(struct sail *s);
+int sail_lookup(const struct sail *s, const __be32 dest,
+struct net_device **dev);
+#endif
+
 struct fib_nh_notifier_info {
 struct fib_notifier_info info; /* must be first */
 struct fib_nh *fib_nh;
@@ -219,6 +275,10 @@ struct fib_table {
 inttb_num_default;
 struct rcu_headrcu;
 unsigned long *tb_data;
+#if IS_ENABLED(CONFIG_FIB_SAIL_XDP)
+/*Each FIB table will have its own SAIL structure.*/
+struct sailsail;
+#endif
 unsigned long__data[0];
 };

diff --git a/net/core/filter.c b/net/core/filter.c
index 5e00f2b..e89b4bb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4236,6 +4236,49 @@ static int bpf_fib_set_fwd_params(struct
bpf_fib_lookup *params,
 }
 #endif

+#if IS_ENABLED(CONFIG_FIB_SAIL_XDP)
+static int sail_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
+   u32 flags, bool check_mtu)
+{
+struct net_device *dev_in, *dev_out;
+struct fib_table *tb;
+struct neighbour *neigh;
+u32 tbid;
+int err;
+u32 mtu;
+
+if (flags & BPF_FIB_LOOKUP_DIRECT) {
+dev_in = dev_get_by_index_rcu(net, params->ifindex);
+if (unlikely(!dev_in))
+return -ENODEV;
+tbid = l3mdev_fib_table_rcu(dev_in) ? : RT_TABLE_MAIN;
+} else {
+tbid = RT_TABLE_MAIN;
+}
+
+tb = fib_get_table(net, tbid);
+if (unlikely(!tb))
+return BPF_FIB_LKUP_RET_NOT_FWDED;
+
+err = sail_lookup(&tb->sail, params->ipv4_dst, &dev_out);
+if (err)
+return -ENOENT;
+
+if (check_mtu) {
+mtu = min(READ_ONCE(dev_out->mtu), IP_MAX_MTU);
+if (params->tot_len > mtu)
+return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+}
+
+neigh = __ipv4_neigh_lookup_noref(dev_out,
+  (__force u32)params->ipv4_dst);
+if (!neigh)
+retur