Re: [net-next V5 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP

2017-10-09 Thread Daniel Borkmann

On 10/09/2017 07:59 PM, Jesper Dangaard Brouer wrote:
[...]

+static void *cpu_map_lookup_elem(struct bpf_map *map, void *key)
+{
+   struct bpf_cpu_map_entry *rcpu =
+   __cpu_map_lookup_elem(map, *(u32 *)key);
+
+   return rcpu ? >qsize : NULL;


I still think from my prior email/comment that we should use per-cpu
scratch buffer here. Would be nice to keep the guarantee that noone
can modify it, it's just a tiny change.


Well, it's no-longer really needed, right(?), as this patchset update,
change that bpf-side cannot invoke this.  The userspace-side reading
this will get a copy.


Ah sorry, you're right, the related change happens in later patch, I
missed that; would be good to avoid a split in future or other option
is to forbid usage initially in check_map_func_compatibility() by
bailing out in BPF_MAP_TYPE_CPUMAP case unconditionally, and then
enabling it for BPF_FUNC_redirect_map in next step, such that should
someone accidentally only backport this patch, we don't allow for
unintended misuse.

Thanks,
Daniel


Re: [net-next V5 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP

2017-10-09 Thread Jesper Dangaard Brouer
On Mon, 09 Oct 2017 15:31:21 +0200
Daniel Borkmann  wrote:

> On 10/06/2017 06:12 PM, Jesper Dangaard Brouer wrote:
> [...]
> > +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> > +{
> > +   struct bpf_cpu_map *cmap;
> > +   int err = -ENOMEM;  
> 
> err init here is basically not needed since overriden later anyway
> w/o being read, but ...

Thank you for catching this! Guess, I'll send a V6 tomorrow.

[...]
> > +   /* Notice returns -EPERM on if map size is larger than memlock limit */
> > +   err = bpf_map_precharge_memlock(cmap->map.pages);
> > +   if (err)
> > +   goto free_cmap;  
> 
> ... here, you need to set err = -ENOMEM.

Yes, I see my mistake of assigning "err" here.

[...]
> > +static void *cpu_map_lookup_elem(struct bpf_map *map, void *key)
> > +{
> > +   struct bpf_cpu_map_entry *rcpu =
> > +   __cpu_map_lookup_elem(map, *(u32 *)key);
> > +
> > +   return rcpu ? >qsize : NULL;  
> 
> I still think from my prior email/comment that we should use per-cpu
> scratch buffer here. Would be nice to keep the guarantee that noone
> can modify it, it's just a tiny change.

Well, it's no-longer really needed, right(?), as this patchset update,
change that bpf-side cannot invoke this.  The userspace-side reading
this will get a copy.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [net-next V5 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP

2017-10-09 Thread Daniel Borkmann

On 10/09/2017 03:31 PM, Daniel Borkmann wrote:

On 10/06/2017 06:12 PM, Jesper Dangaard Brouer wrote:

[...]

+/* Pre-limit array size based on NR_CPUS, not final CPU check */
+if (cmap->map.max_entries > NR_CPUS)


Nit: needs to be >= NR_CPUS.


Scratch that comment, you bail out on key_cpu >= cmap->map.max_entries
in the other handlers, so that's fine.


Re: [net-next V5 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP

2017-10-09 Thread Daniel Borkmann

On 10/06/2017 06:12 PM, Jesper Dangaard Brouer wrote:
[...]

+static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
+{
+   struct bpf_cpu_map *cmap;
+   int err = -ENOMEM;


err init here is basically not needed since overriden later anyway
w/o being read, but ...


+   u64 cost;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return ERR_PTR(-EPERM);
+
+   /* check sanity of attributes */
+   if (attr->max_entries == 0 || attr->key_size != 4 ||
+   attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
+   return ERR_PTR(-EINVAL);
+
+   cmap = kzalloc(sizeof(*cmap), GFP_USER);
+   if (!cmap)
+   return ERR_PTR(-ENOMEM);
+
+   /* mandatory map attributes */
+   cmap->map.map_type = attr->map_type;
+   cmap->map.key_size = attr->key_size;
+   cmap->map.value_size = attr->value_size;
+   cmap->map.max_entries = attr->max_entries;
+   cmap->map.map_flags = attr->map_flags;
+   cmap->map.numa_node = bpf_map_attr_numa_node(attr);
+
+   /* Pre-limit array size based on NR_CPUS, not final CPU check */
+   if (cmap->map.max_entries > NR_CPUS)


Nit: needs to be >= NR_CPUS.


+   return ERR_PTR(-E2BIG);
+
+   /* make sure page count doesn't overflow */
+   cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
+   cost += cpu_map_bitmap_size(attr) * num_possible_cpus();
+   if (cost >= U32_MAX - PAGE_SIZE)
+   goto free_cmap;
+   cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+   /* Notice returns -EPERM on if map size is larger than memlock limit */
+   err = bpf_map_precharge_memlock(cmap->map.pages);
+   if (err)
+   goto free_cmap;


... here, you need to set err = -ENOMEM.


+   /* A per cpu bitfield with a bit per possible CPU in map  */
+   cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr),
+   __alignof__(unsigned long));
+   if (!cmap->flush_needed)
+   goto free_cmap;


Otherwise when we fail here or in error case for bpf_map_area_alloc()
below, we still return 0 although it's really -ENOMEM. And returning 0,
would mean that find_and_alloc_map() will miss this since it only tests
for IS_ERR(), and we'll crash later on thinking we have a valid map
pointer.


+   /* Alloc array for possible remote "destination" CPUs */
+   cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
+  sizeof(struct bpf_cpu_map_entry *),
+  cmap->map.numa_node);
+   if (!cmap->cpu_map)
+   goto free_cmap;
+
+   return >map;
+free_cmap:
+   free_percpu(cmap->flush_needed);
+   kfree(cmap);
+   return ERR_PTR(err);
+}
+

[...]

+int cpu_map_update_elem(struct bpf_map *map, void *key, void *value,
+   u64 map_flags)
+{
+   struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
+   struct bpf_cpu_map_entry *rcpu;
+
+   /* Array index key correspond to CPU number */
+   u32 key_cpu = *(u32 *)key;
+   /* Value is the queue size */
+   u32 qsize = *(u32 *)value;
+
+   /* Make sure CPU is a valid possible cpu */
+   if (!cpu_possible(key_cpu))
+   return -ENODEV;


Nit: cpu_possible() expects that key_cpu < NR_CPUS, otherwise you'd
access the bitmap out of bounds.

Better move the below test for 'key_cpu >= cmap->map.max_entries'
first as on map alloc you enforce upper limit of NR_CPUS on the
max_entries, then above cpu_possible() test will be valid, too.


+   if (unlikely(map_flags > BPF_EXIST))
+   return -EINVAL;
+   if (unlikely(key_cpu >= cmap->map.max_entries))
+   return -E2BIG;
+   if (unlikely(map_flags == BPF_NOEXIST))
+   return -EEXIST;
+   if (unlikely(qsize > 16384)) /* sanity limit on qsize */
+   return -EOVERFLOW;
+
+   if (qsize == 0) {
+   rcpu = NULL; /* Same as deleting */
+   } else {
+   /* Updating qsize cause re-allocation of bpf_cpu_map_entry */
+   rcpu = __cpu_map_entry_alloc(qsize, key_cpu, map->id);
+   if (!rcpu)
+   return -ENOMEM;
+   }
+   rcu_read_lock();
+   __cpu_map_entry_replace(cmap, key_cpu, rcpu);
+   rcu_read_unlock();
+   return 0;
+}

[...]

+struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
+{
+   struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
+   struct bpf_cpu_map_entry *rcpu;
+
+   if (key >= map->max_entries)
+   return NULL;
+
+   rcpu = READ_ONCE(cmap->cpu_map[key]);
+   return rcpu;
+}
+
+static void *cpu_map_lookup_elem(struct bpf_map *map, void *key)
+{
+   struct bpf_cpu_map_entry *rcpu =
+   __cpu_map_lookup_elem(map, *(u32 *)key);

[net-next V5 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP

2017-10-06 Thread Jesper Dangaard Brouer
The 'cpumap' is primary used as a backend map for XDP BPF helper
call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.

This patch implement the main part of the map.  It is not connected to
the XDP redirect system yet, and no SKB allocation are done yet.

The main concern in this patch is to ensure the datapath can run
without any locking.  This adds complexity to the setup and tear-down
procedure, which assumptions are extra carefully documented in the
code comments.

V2:
 - make sure array isn't larger than NR_CPUS
 - make sure CPUs added is a valid possible CPU

V3: fix nitpicks from Jakub Kicinski 

V5:
 - Restrict map allocation to root / CAP_SYS_ADMIN
 - WARN_ON_ONCE if queue is not empty on tear-down
 - Return -EPERM on memlock limit instead of -ENOMEM
 - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
 - Moved cpu_map_enqueue() to next patch

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/bpf_types.h  |1 
 include/uapi/linux/bpf.h   |1 
 kernel/bpf/Makefile|1 
 kernel/bpf/cpumap.c|  540 
 kernel/bpf/syscall.c   |8 +
 tools/include/uapi/linux/bpf.h |1 
 6 files changed, 551 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/cpumap.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 6f1a567667b8..814c1081a4a9 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -41,4 +41,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 #ifdef CONFIG_STREAM_PARSER
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
 #endif
+BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6d2137b4cf38..03f8e2827a95 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -111,6 +111,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_HASH_OF_MAPS,
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
+   BPF_MAP_TYPE_CPUMAP,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 897daa005b23..dba0bd33a43c 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o 
helpers.o tnum.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o map_in_map.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
+obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
 ifeq ($(CONFIG_STREAM_PARSER),y)
 obj-$(CONFIG_BPF_SYSCALL) += sockmap.o
 endif
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
new file mode 100644
index ..d3e921620097
--- /dev/null
+++ b/kernel/bpf/cpumap.c
@@ -0,0 +1,540 @@
+/* bpf/cpumap.c
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+
+/* The 'cpumap' is primary used as a backend map for XDP BPF helper
+ * call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
+ *
+ * Unlike devmap which redirect XDP frames out another NIC device,
+ * this map type redirect raw XDP frames to another CPU.  The remote
+ * CPU will do SKB-allocation and call the normal network stack.
+ *
+ * This is a scalability and isolation mechanism, that allow
+ * separating the early driver network XDP layer, from the rest of the
+ * netstack, and assigning dedicated CPUs for this stage.  This
+ * basically allows for 10G wirespeed pre-filtering via bpf.
+ */
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+/* General idea: XDP packets getting XDP redirected to another CPU,
+ * will maximum be stored/queued for one driver ->poll() call.  It is
+ * guaranteed that setting flush bit and flush operation happen on
+ * same CPU.  Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
+ * which queue in bpf_cpu_map_entry contains packets.
+ */
+
+#define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
+struct xdp_bulk_queue {
+   void *q[CPU_MAP_BULK_SIZE];
+   unsigned int count;
+};
+
+/* Struct for every remote "destination" CPU in map */
+struct bpf_cpu_map_entry {
+   u32 cpu;/* kthread CPU and map index */
+   int map_id; /* Back reference to map */
+   u32 qsize;  /* Redundant queue size for map lookup */
+
+   /* XDP can run multiple RX-ring queues, need __percpu enqueue store */
+   struct xdp_bulk_queue __percpu *bulkq;
+
+   /* Queue with potential multi-producers, and single-consumer kthread */
+   struct ptr_ring *queue;
+   struct task_struct *kthread;
+   struct work_struct kthread_stop_wq;
+
+   atomic_t refcnt; /* Control when this struct can be free'ed */
+   struct rcu_head rcu;
+};
+
+struct bpf_cpu_map {
+   struct bpf_map map;
+   /* Below members specific for map type */
+   struct bpf_cpu_map_entry **cpu_map;
+   unsigned long __percpu