Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-06 Thread Alexei Starovoitov

On 1/6/17 2:43 AM, Daniel Borkmann wrote:

On 01/05/2017 09:14 PM, Daniel Mack wrote:
[...]

In my use case, the actual value of a node is in fact ignored, all that
matters is whether a node exists in a trie or not. The test code uses
u64 for its tests.

I can change it around so that the value size can be defined by
userspace, but ideally it would also support 0-byte lengths then. The
bpf map syscall handler should handle the latter just fine if I read the
code correctly?


Right now no map is allowed to have value size of 0, but since kmalloc()
would return ZERO_SIZE_PTR in such case, it looks like it should
work^tm, although I haven't checked whether it's guaranteed that all
the copy_{from,to}_user() implementations work with 0 size as well
and whether ubsan would complain on the ZERO_SIZE_PTR for memcpy() etc.
Perhaps better to reject value size of 0 initially and later on follow
up with making the syscall code more robust for such cases (afaik, for
the htab this was also on todo.)?


yes. the support for value_size=0 was on todo list pretty much as
soon as htab was introduced and early on the verifier was done the way
to make sure such case should work as-is from bpf program point of view,
but for syscall lookup/update commands I didn't want to add checks
for zero value until it's actually needed. So definitely some work
around syscall handling is needed.
Also agree that for lpm I would check value_size > 0 initially and
then relax it for hash and lpm together.



Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-06 Thread Daniel Borkmann

On 01/05/2017 09:14 PM, Daniel Mack wrote:
[...]

In my use case, the actual value of a node is in fact ignored, all that
matters is whether a node exists in a trie or not. The test code uses
u64 for its tests.

I can change it around so that the value size can be defined by
userspace, but ideally it would also support 0-byte lengths then. The
bpf map syscall handler should handle the latter just fine if I read the
code correctly?


Right now no map is allowed to have value size of 0, but since kmalloc()
would return ZERO_SIZE_PTR in such case, it looks like it should
work^tm, although I haven't checked whether it's guaranteed that all
the copy_{from,to}_user() implementations work with 0 size as well
and whether ubsan would complain on the ZERO_SIZE_PTR for memcpy() etc.
Perhaps better to reject value size of 0 initially and later on follow
up with making the syscall code more robust for such cases (afaik, for
the htab this was also on todo.)?

Thanks,
Daniel


Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-05 Thread Daniel Borkmann

Hi Daniel,

On 01/05/2017 09:04 PM, Daniel Mack wrote:

On 01/05/2017 05:25 PM, Daniel Borkmann wrote:

On 12/29/2016 06:28 PM, Daniel Mack wrote:



diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..8b6a61d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c


[..]


+static struct bpf_map *trie_alloc(union bpf_attr *attr)
+{
+   struct lpm_trie *trie;
+
+   /* check sanity of attributes */
+   if (attr->max_entries == 0 || attr->map_flags ||
+   attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
+   attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
+   attr->value_size != sizeof(u64))
+   return ERR_PTR(-EINVAL);


The correct attr->map_flags test here would need to be ...

attr->map_flags != BPF_F_NO_PREALLOC

... since in this case we don't have any prealloc pool, and
should that come one day that test could be relaxed again.


+   trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN);
+   if (!trie)
+   return NULL;
+
+   /* copy mandatory map attributes */
+   trie->map.map_type = attr->map_type;
+   trie->map.key_size = attr->key_size;
+   trie->map.value_size = attr->value_size;
+   trie->map.max_entries = attr->max_entries;


You also need to fill in trie->map.pages as that is eventually
used to charge memory against in bpf_map_charge_memlock(), right
now that would remain as 0 meaning the map is not accounted for.


Hmm, okay. The nodes are, however, allocated dynamically at runtime in
this case. That means that we have trie->map.pages on each allocation,
right?


The current scheme (f.e. htab_map_alloc() has some details, although
probably not too obvious) that was done charges worst-case cost up front,
so it would be in trie_alloc() where you fill map.pages and map_create()
will later account for them.

Thanks,
Daniel


Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-05 Thread Daniel Mack
Hi,

On 01/05/2017 09:01 PM, Daniel Borkmann wrote:
> On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
>> On 12/29/2016 06:28 PM, Daniel Mack wrote:

> [...]
>>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>>> +{
>>> +struct lpm_trie *trie;
>>> +
>>> +/* check sanity of attributes */
>>> +if (attr->max_entries == 0 || attr->map_flags ||
>>> +attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>>> +attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>>> +attr->value_size != sizeof(u64))
>>> +return ERR_PTR(-EINVAL);
> 
> One more question on this regarding value size as u64 (perhaps I
> missed it along the way): reason this was chosen was because for
> keeping stats? Why not making user choose a size as in other maps,
> so also custom structs could be stored there?

In my use case, the actual value of a node is in fact ignored, all that
matters is whether a node exists in a trie or not. The test code uses
u64 for its tests.

I can change it around so that the value size can be defined by
userspace, but ideally it would also support 0-byte lengths then. The
bpf map syscall handler should handle the latter just fine if I read the
code correctly?


Thanks,
Daniel


Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-05 Thread Daniel Mack
Hi Daniel,

Thanks for your feedback! I agree on all points. Two questions below.

On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
> On 12/29/2016 06:28 PM, Daniel Mack wrote:

>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> new file mode 100644
>> index 000..8b6a61d
>> --- /dev/null
>> +++ b/kernel/bpf/lpm_trie.c

[..]

>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>> +{
>> +struct lpm_trie *trie;
>> +
>> +/* check sanity of attributes */
>> +if (attr->max_entries == 0 || attr->map_flags ||
>> +attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>> +attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>> +attr->value_size != sizeof(u64))
>> +return ERR_PTR(-EINVAL);
> 
> The correct attr->map_flags test here would need to be ...
> 
>attr->map_flags != BPF_F_NO_PREALLOC
> 
> ... since in this case we don't have any prealloc pool, and
> should that come one day that test could be relaxed again.
> 
>> +trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN);
>> +if (!trie)
>> +return NULL;
>> +
>> +/* copy mandatory map attributes */
>> +trie->map.map_type = attr->map_type;
>> +trie->map.key_size = attr->key_size;
>> +trie->map.value_size = attr->value_size;
>> +trie->map.max_entries = attr->max_entries;
> 
> You also need to fill in trie->map.pages as that is eventually
> used to charge memory against in bpf_map_charge_memlock(), right
> now that would remain as 0 meaning the map is not accounted for.

Hmm, okay. The nodes are, however, allocated dynamically at runtime in
this case. That means that we have trie->map.pages on each allocation,
right?

>> +static void trie_free(struct bpf_map *map)
>> +{
>> +struct lpm_trie_node __rcu **slot;
>> +struct lpm_trie_node *node;
>> +struct lpm_trie *trie =
>> +container_of(map, struct lpm_trie, map);
>> +
>> +spin_lock(>lock);
>> +
>> +/*
>> + * Always start at the root and walk down to a node that has no
>> + * children. Then free that node, nullify its pointer in the parent,
>> + * then start over.
>> + */
>> +
>> +for (;;) {
>> +slot = >root;
>> +
>> +for (;;) {
>> +node = rcu_dereference_protected(*slot,
>> +lockdep_is_held(>lock));
>> +if (!node)
>> +goto out;
>> +
>> +if (node->child[0]) {
> 
> rcu_access_pointer(node->child[0]) (at least to keep sparse happy?)

Done, but sparse does not actually complain here.



Thanks,
Daniel



Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-05 Thread Daniel Borkmann

On 01/05/2017 05:25 PM, Daniel Borkmann wrote:

On 12/29/2016 06:28 PM, Daniel Mack wrote:

This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack 
Reviewed-by: David Herrmann 


Thanks for working on it, and sorry for late reply. In addition to
Alexei's earlier comments on the cover letter, a few comments inline:


[...]

+static struct bpf_map *trie_alloc(union bpf_attr *attr)
+{
+struct lpm_trie *trie;
+
+/* check sanity of attributes */
+if (attr->max_entries == 0 || attr->map_flags ||
+attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
+attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
+attr->value_size != sizeof(u64))
+return ERR_PTR(-EINVAL);


One more question on this regarding value size as u64 (perhaps I
missed it along the way): reason this was chosen was because for
keeping stats? Why not making user choose a size as in other maps,
so also custom structs could be stored there?

Thanks,
Daniel


Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-05 Thread Daniel Borkmann

On 01/05/2017 05:25 PM, Daniel Borkmann wrote:

On 12/29/2016 06:28 PM, Daniel Mack wrote:

This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack 
Reviewed-by: David Herrmann 


Thanks for working on it, and sorry for late reply. In addition to
Alexei's earlier comments on the cover letter, a few comments inline:

[...]

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..8b6a61d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,468 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+struct rcu_head rcu;
+struct lpm_trie_node __rcu*child[2];
+u32prefixlen;
+u32flags;
+u64value;
+u8data[0];
+};
+
+struct lpm_trie {
+struct bpf_mapmap;
+struct lpm_trie_node __rcu*root;
+size_tn_entries;
+size_tmax_prefixlen;
+size_tdata_size;
+spinlock_tlock;
+};
+

[...]

+
+static inline int extract_bit(const u8 *data, size_t index)
+{
+return !!(data[index / 8] & (1 << (7 - (index % 8;
+}
+

[...]

+
+static struct lpm_trie_node *lpm_trie_node_alloc(size_t data_size)
+{
+return kmalloc(sizeof(struct lpm_trie_node) + data_size,
+   GFP_ATOMIC | __GFP_NOWARN);
+}
+
+/* Called from syscall or from eBPF program */
+static int trie_update_elem(struct bpf_map *map,
+void *_key, void *value, u64 flags)
+{
+struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
+struct lpm_trie_node *node, *im_node, *new_node = NULL;
+struct lpm_trie_node __rcu **slot;
+struct bpf_lpm_trie_key *key = _key;
+unsigned int next_bit;
+size_t matchlen = 0;
+int ret = 0;


We should guard for future map flags here:

 if (unlikely(flags > BPF_EXIST))
 return -EINVAL;

And further below we'd need to check for BPF_{NO,}EXIST when replacing
resp. adding the node?


+if (key->prefixlen > trie->max_prefixlen)
+return -EINVAL;
+
+spin_lock(>lock);


That spin lock would need to be converted to a raw lock, see commit
ac00881f9221 ("bpf: convert hashtab lock to raw lock"). The comment
in htab also mentions that bpf_map_update_elem() can be called in
irq context (I assume as a map from tracing side?), so we'd need to
use the *_irqsave variants here as well.


+/* Allocate and fill a new node */
+
+if (trie->n_entries == trie->map.max_entries) {
+ret = -ENOSPC;
+goto out;
+}
+
+new_node = lpm_trie_node_alloc(trie->data_size);
+if (!new_node) {
+ret = -ENOMEM;
+goto out;
+}
+
+trie->n_entries++;
+new_node->value = *(u64 *) value;
+new_node->prefixlen = key->prefixlen;
+new_node->flags = 0;
+new_node->child[0] = NULL;
+new_node->child[1] = NULL;


Should this be ...

RCU_INIT_POINTER(new_node->child[0], NULL);
RCU_INIT_POINTER(new_node->child[1], NULL);


+memcpy(new_node->data, key->data, trie->data_size);
+
+/*
+ * Now find a slot to attach the new node. To do that, walk the tree
+ * from the root match as many bits as possible for each node until we
+ * either find an empty slot or a slot that needs to be replaced by an
+ * intermediate node.
+ */
+slot = >root;
+
+while ((node = rcu_dereference_protected(*slot,
+lockdep_is_held(>lock {
+matchlen = longest_prefix_match(trie, node, key);
+
+if (node->prefixlen != matchlen ||
+node->prefixlen == key->prefixlen ||
+node->prefixlen == trie->max_prefixlen)
+break;
+
+next_bit = extract_bit(key->data, node->prefixlen);
+slot = >child[next_bit];
+}
+
+/*
+ * If the slot is empty (a free child pointer or an empty root),
+ * simply assign the @new_node to that slot and be done.
+ */
+if (!node) {
+rcu_assign_pointer(*slot, new_node);
+goto 

Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2017-01-05 Thread Daniel Borkmann

On 12/29/2016 06:28 PM, Daniel Mack wrote:

This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack 
Reviewed-by: David Herrmann 


Thanks for working on it, and sorry for late reply. In addition to
Alexei's earlier comments on the cover letter, a few comments inline:

[...]

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..8b6a61d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,468 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node __rcu  *child[2];
+   u32 prefixlen;
+   u32 flags;
+   u64 value;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node __rcu  *root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   spinlock_t  lock;
+};
+

[...]

+
+static inline int extract_bit(const u8 *data, size_t index)
+{
+   return !!(data[index / 8] & (1 << (7 - (index % 8;
+}
+

[...]

+
+static struct lpm_trie_node *lpm_trie_node_alloc(size_t data_size)
+{
+   return kmalloc(sizeof(struct lpm_trie_node) + data_size,
+  GFP_ATOMIC | __GFP_NOWARN);
+}
+
+/* Called from syscall or from eBPF program */
+static int trie_update_elem(struct bpf_map *map,
+   void *_key, void *value, u64 flags)
+{
+   struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
+   struct lpm_trie_node *node, *im_node, *new_node = NULL;
+   struct lpm_trie_node __rcu **slot;
+   struct bpf_lpm_trie_key *key = _key;
+   unsigned int next_bit;
+   size_t matchlen = 0;
+   int ret = 0;


We should guard for future map flags here:

if (unlikely(flags > BPF_EXIST))
return -EINVAL;

And further below we'd need to check for BPF_{NO,}EXIST when replacing
resp. adding the node?


+   if (key->prefixlen > trie->max_prefixlen)
+   return -EINVAL;
+
+   spin_lock(>lock);


That spin lock would need to be converted to a raw lock, see commit
ac00881f9221 ("bpf: convert hashtab lock to raw lock"). The comment
in htab also mentions that bpf_map_update_elem() can be called in
irq context (I assume as a map from tracing side?), so we'd need to
use the *_irqsave variants here as well.


+   /* Allocate and fill a new node */
+
+   if (trie->n_entries == trie->map.max_entries) {
+   ret = -ENOSPC;
+   goto out;
+   }
+
+   new_node = lpm_trie_node_alloc(trie->data_size);
+   if (!new_node) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   trie->n_entries++;
+   new_node->value = *(u64 *) value;
+   new_node->prefixlen = key->prefixlen;
+   new_node->flags = 0;
+   new_node->child[0] = NULL;
+   new_node->child[1] = NULL;


Should this be ...

RCU_INIT_POINTER(new_node->child[0], NULL);
RCU_INIT_POINTER(new_node->child[1], NULL);


+   memcpy(new_node->data, key->data, trie->data_size);
+
+   /*
+* Now find a slot to attach the new node. To do that, walk the tree
+* from the root match as many bits as possible for each node until we
+* either find an empty slot or a slot that needs to be replaced by an
+* intermediate node.
+*/
+   slot = >root;
+
+   while ((node = rcu_dereference_protected(*slot,
+   lockdep_is_held(>lock {
+   matchlen = longest_prefix_match(trie, node, key);
+
+   if (node->prefixlen != matchlen ||
+   node->prefixlen == key->prefixlen ||
+   node->prefixlen == trie->max_prefixlen)
+   break;
+
+   

[PATCH v1 1/2] bpf: add a longest prefix match trie map implementation

2016-12-29 Thread Daniel Mack
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack 
Reviewed-by: David Herrmann 
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/lpm_trie.c| 468 +++
 3 files changed, 476 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..8b6a61d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,468 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node __rcu  *child[2];
+   u32 prefixlen;
+   u32 flags;
+   u64 value;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node __rcu  *root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   spinlock_t  lock;
+};
+
+/*
+ * This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *  ++
+ *  |   (1)  (R) |
+ *  | 192.168.0.0/16 |
+ *  |value: 1|
+ *  |   [0][1]   |
+ *  ++
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ *