Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Jann Horn
 On Wed, Apr 14, 2021 at 2:20 PM Florian Weimer  wrote:
>
> * Jann Horn:
>
> > On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer  wrote:
> >>
> >> * Andrei Vagin:
> >>
> >> > We already have process_vm_readv and process_vm_writev to read and write
> >> > to a process memory faster than we can do this with ptrace. And now it
> >> > is time for process_vm_exec that allows executing code in an address
> >> > space of another process. We can do this with ptrace but it is much
> >> > slower.
> >> >
> >> > = Use-cases =
> >>
> >> We also have some vaguely related within the same address space: running
> >> code on another thread, without modifying its stack, while it has signal
> >> handlers blocked, and without causing system calls to fail with EINTR.
> >> This can be used to implement certain kinds of memory barriers.
> >
> > That's what the membarrier() syscall is for, right? Unless you don't
> > want to register all threads for expedited membarrier use?
>
> membarrier is not sufficiently powerful for revoking biased locks, for
> example.

But on Linux >=5.10, together with rseq, it is, right? Then lock
acquisition could look roughly like this, in pseudo-C (yes, I know,
real rseq doesn't quite look like that, you'd need inline asm for that
unless the compiler adds special support for this):


enum local_state {
  STATE_FREE_OR_BIASED,
  STATE_LOCKED
};
#define OWNER_LOCKBIT (1U<<31)
#define OWNER_WAITER_BIT (1U<<30) /* notify futex when OWNER_LOCKBIT
is cleared */
struct biased_lock {
  unsigned int owner_with_lockbit;
  enum local_state local_state;
};

void lock(struct biased_lock *L) {
  unsigned int my_tid = THREAD_SELF->tid;
  RSEQ_SEQUENCE_START(); // restart here on failure
  if (READ_ONCE(L->owner) == my_tid) {
if (READ_ONCE(L->local_state) == STATE_LOCKED) {
  RSEQ_SEQUENCE_END();
  /*
   * Deadlock, abort execution.
   * Note that we are not necessarily actually *holding* the lock;
   * this can also happen if we entered a signal handler while we
   * were in the process of acquiring the lock.
   * But in that case it could just as well have happened that we
   * already grabbed the lock, so the caller is wrong anyway.
   */
  fatal_error();
}
RSEQ_COMMIT(L->local_state = STATE_LOCKED);
return; /* fastpath success */
  }
  RSEQ_SEQUENCE_END();

  /* slowpath */
  /* acquire and lock owner field */
  unsigned int old_owner_with_lockbit;
  while (1) {
old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
if (old_owner_with_lockbit & OWNER_LOCKBIT) {
  if (!__sync_bool_compare_and_swap (>owner_with_lockbit,
old_owner_with_lockbit, my_tid | OWNER_LOCKBIT | OWNER_WAITER_BIT))
   continue;
  futex(>owner_with_lockbit, FUTEX_WAIT,
old_owner_with_lockbit, NULL, NULL, 0);
  continue;
} else {
  if (__sync_bool_compare_and_swap (>owner_with_lockbit,
old_owner_with_lockbit, my_tid | OWNER_LOCKBIT))
break;
}
  }

  /*
   * ensure old owner won't lock local_state anymore.
   * we only have to worry about the owner that directly preceded us here;
   * it will have done this step for the owners that preceded it before clearing
   * the LOCKBIT; so if we were the old owner, we don't have to sync.
   */
  if (old_owner_with_lockbit != my_tid) {
if (membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ, 0, 0))
  fatal_error();
  }

  /*
   * As soon as the lock becomes STATE_FREE_OR_BIASED, we own it; but
   * at this point it might still be locked.
   */
  while (READ_ONCE(L->local_state) == STATE_LOCKED) {
futex(>local_state, FUTEX_WAIT, STATE_LOCKED, NULL, NULL, 0);
  }

  /* OK, now the lock is biased to us and we can grab it. */
  WRITE_ONCE(L->local_state, STATE_LOCKED);

  /* drop lockbit */
  unsigned int old_owner_with_lockbit;
  while (1) {
old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
if (__sync_bool_compare_and_swap (>owner_with_lockbit,
old_owner_with_lockbit, my_tid))
  break;
  }
  if (old_owner_with_lockbit & OWNER_WAITER_BIT)
futex(>owner_with_lockbit, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);
}

void unlock(struct biased_lock *L) {
  unsigned int my_tid = THREAD_SELF->tid;

  /*
   * If we run before the membarrier(), the lock() path will immediately
   * see the lock as uncontended, and we don't need to call futex().
   * If we run after the membarrier(), the ->owner_with_lockbit read
   * here will observe the new owner and we'll wake the futex.
   */
  RSEQ_SEQUENCE_START();
  unsigned int old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
  RSEQ_COMMIT(WRITE_ONCE(L->local_state, STATE_FREE_OR_BIASED));
  if (old_owner_with_lockbit != my_tid)
futex(>local_state, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);
}


Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Jann Horn
On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer  wrote:
>
> * Andrei Vagin:
>
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
>
> We also have some vaguely related within the same address space: running
> code on another thread, without modifying its stack, while it has signal
> handlers blocked, and without causing system calls to fail with EINTR.
> This can be used to implement certain kinds of memory barriers.

That's what the membarrier() syscall is for, right? Unless you don't
want to register all threads for expedited membarrier use?

> It is
> also necessary to implement set*id with POSIX semantics in userspace.
> (Linux only changes the current thread credentials, POSIX requires
> process-wide changes.)  We currently use a signal for set*id, but it has
> issues (it can be blocked, the signal could come from somewhere, etc.).
> We can't use signals for barriers because of the EINTR issue, and
> because the signal context is stored on the stack.

This essentially becomes a question of "how much is set*id allowed to
block and what level of guarantee should there be by the time it
returns that no threads will perform privileged actions anymore after
it returns", right?

Like, if some piece of kernel code grabs a pointer to the current
credentials or acquires a temporary reference to some privileged
resource, then blocks on reading an argument from userspace, and then
performs a privileged action using the previously-grabbed credentials
or resource, what behavior do you want? Should setuid() block until
that privileged action has completed? Should it abort that action
(which is kinda what you get with the signals approach)? Should it
just return immediately even though an attacker who can write to
process memory at that point might still be able to influence a
privileged operation that hasn't read all its inputs yet? Should the
kernel be designed to keep track of whether it is currently holding a
privileged resource? Or should the kernel just specifically permit
credential changes in specific places where it is known that a task
might block for a long time and it is not holding any privileged
resources (kinda like the approach taken for freezer stuff)?

If userspace wants multithreaded setuid() without syscall aborting,
things get gnarly really fast; and having an interface to remotely
perform operations under another task's context isn't really relevant
to the core problem here, I think.


Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Jann Horn
On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin  wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =

It seems to me like your proposed API doesn't really fit either one of
those usecases well...

> Here are two known use-cases. The first one is “application kernel”
> sandboxes like User-mode Linux and gVisor. In this case, we have a
> process that runs the sandbox kernel and a set of stub processes that
> are used to manage guest address spaces. Guest code is executed in the
> context of stub processes but all system calls are intercepted and
> handled in the sandbox kernel. Right now, these sort of sandboxes use
> PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> significantly speed them up.

In this case, since you really only want an mm_struct to run code
under, it seems weird to create a whole task with its own PID and so
on. It seems to me like something similar to the /dev/kvm API would be
more appropriate here? Implementation options that I see for that
would be:

1. mm_struct-based:
  a set of syscalls to create a new mm_struct,
  change memory mappings under that mm_struct, and switch to it
2. pagetable-mirroring-based:
  like /dev/kvm, an API to create a new pagetable, mirror parts of
  the mm_struct's pagetables over into it with modified permissions
  (like KVM_SET_USER_MEMORY_REGION),
  and run code under that context.
  page fault handling would first handle the fault against mm->pgd
  as normal, then mirror the PTE over into the secondary pagetables.
  invalidation could be handled with MMU notifiers.

> Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> process properties can be received only from the process itself. Right
> now, we use a parasite code that is injected into the process. We do
> this with ptrace but it is slow, unsafe, and tricky.

But this API will only let you run code under the *mm* of the target
process, not fully in the context of a target *task*, right? So you
still won't be able to use this for accessing anything other than
memory? That doesn't seem very generically useful to me.

Also, I don't doubt that anything involving ptrace is kinda tricky,
but it would be nice to have some more detail on what exactly makes
this slow, unsafe and tricky. Are there API additions for ptrace that
would make this work better? I imagine you're thinking of things like
an API for injecting a syscall into the target process without having
to first somehow find an existing SYSCALL instruction in the target
process?

> process_vm_exec can
> simplify the process of injecting a parasite code and it will allow
> pre-dump memory without stopping processes. The pre-dump here is when we
> enable a memory tracker and dump the memory while a process is continue
> running. On each interaction we dump memory that has been changed from
> the previous iteration. In the final step, we will stop processes and
> dump their full state. Right now the most effective way to dump process
> memory is to create a set of pipes and splice memory into these pipes
> from the parasite code. With process_vm_exec, we will be able to call
> vmsplice directly. It means that we will not need to stop a process to
> inject the parasite code.

Alternatively you could add splice support to /proc/$pid/mem or add a
syscall similar to process_vm_readv() that splices into a pipe, right?


Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-30 Thread Jann Horn
On Tue, Mar 30, 2021 at 5:01 PM David Hildenbrand  wrote:
> >> +long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long 
> >> start,
> >> +   unsigned long end, bool write, int *locked)
> >> +{
> >> +   struct mm_struct *mm = vma->vm_mm;
> >> +   unsigned long nr_pages = (end - start) / PAGE_SIZE;
> >> +   int gup_flags;
> >> +
> >> +   VM_BUG_ON(!PAGE_ALIGNED(start));
> >> +   VM_BUG_ON(!PAGE_ALIGNED(end));
> >> +   VM_BUG_ON_VMA(start < vma->vm_start, vma);
> >> +   VM_BUG_ON_VMA(end > vma->vm_end, vma);
> >> +   mmap_assert_locked(mm);
> >> +
> >> +   /*
> >> +* FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
> >> +*a poisoned page.
> >> +* FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
> >> +* !FOLL_FORCE: Require proper access permissions.
> >> +*/
> >> +   gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | 
> >> FOLL_HWPOISON;
> >> +   if (write)
> >> +   gup_flags |= FOLL_WRITE;
> >> +
> >> +   /*
> >> +* See check_vma_flags(): Will return -EFAULT on incompatible 
> >> mappings
> >> +* or with insufficient permissions.
> >> +*/
> >> +   return __get_user_pages(mm, start, nr_pages, gup_flags,
> >> +   NULL, NULL, locked);
> >
> > You mentioned in the commit message that you don't want to actually
> > dirty all the file pages and force writeback; but doesn't
> > POPULATE_WRITE still do exactly that? In follow_page_pte(), if
> > FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:
>
> Well, I mention that POPULATE_READ explicitly doesn't do that. I
> primarily set it because populate_vma_page_range() also sets it.
>
> Is it safe to *not* set it? IOW, fault something writable into a page
> table (where the CPU could dirty it without additional page faults)
> without marking it accessed? For me, this made logically sense. Thus I
> also understood why populate_vma_page_range() set it.

FOLL_TOUCH doesn't have anything to do with installing the PTE - it
essentially means "the caller of get_user_pages wants to read/write
the contents of the returned page, so please do the same things you
would do if userspace was accessing the page". So in particular, if
you look up a page via get_user_pages() with FOLL_WRITE|FOLL_TOUCH,
that tells the MM subsystem "I will be writing into this page directly
from the kernel, bypassing the userspace page tables, so please mark
it as dirty now so that it will be properly written back later". Part
of that is that it marks the page as recently used, which has an
effect on LRU pageout behavior, I think - as far as I understand, that
is why populate_vma_page_range() uses FOLL_TOUCH.

If you look at __get_user_pages(), you can see that it is split up
into two major parts: faultin_page() for creating PTEs, and
follow_page_mask() for grabbing pages from PTEs. faultin_page()
ignores FOLL_TOUCH completely; only follow_page_mask() uses it.

In a way I guess maybe you do want the "mark as recently accessed"
part that FOLL_TOUCH would give you without FOLL_WRITE? But I think
you very much don't want the dirtying that FOLL_TOUCH|FOLL_WRITE leads
to. Maybe the ideal approach would be to add a new FOLL flag to say "I
only want to mark as recently used, I don't want to dirty". Or maybe
it's enough to just leave out the FOLL_TOUCH entirely, I don't know.


Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-30 Thread Jann Horn
On Wed, Mar 17, 2021 at 12:07 PM David Hildenbrand  wrote:
> I. Background: Sparse Memory Mappings
>
> When we manage sparse memory mappings dynamically in user space - also
> sometimes involving MAP_NORESERVE - we want to dynamically populate/
> discard memory inside such a sparse memory region. Example users are
> hypervisors (especially implementing memory ballooning or similar
> technologies like virtio-mem) and memory allocators. In addition, we want
> to fail in a nice way (instead of generating SIGBUS) if populating does not
> succeed because we are out of backend memory (which can happen easily with
> file-based mappings, especially tmpfs and hugetlbfs).
>
> While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
> reliably discarding memory, there is no generic approach to populate
> page tables and preallocate memory.
>
> Although mmap() supports MAP_POPULATE, it is not applicable to the concept
> of sparse memory mappings, where we want to do populate/discard
> dynamically and avoid expensive/problematic remappings. In addition,
> we never actually report errors during the final populate phase - it is
> best-effort only.
>
> fallocate() can be used to preallocate file-based memory and fail in a safe
> way. However, it cannot really be used for any private mappings on
> anonymous files via memfd due to COW semantics. In addition, fallocate()
> does not actually populate page tables, so we still always get
> pagefaults on first access - which is sometimes undesired (i.e., real-time
> workloads) and requires real prefaulting of page tables, not just a
> preallocation of backend storage. There might be interesting use cases
> for sparse memory regions along with mlockall(MCL_ONFAULT) which
> fallocate() cannot satisfy as it does not prefault page tables.
>
> II. On preallcoation/prefaulting from user space
>
> Because we don't have a proper interface, what applications
> (like QEMU and databases) end up doing is touching (i.e., reading+writing
> one byte to not overwrite existing data) all individual pages.
>
> However, that approach
> 1) Can result in wear on storage backing, because we end up writing
>and thereby dirtying each page --- i.e., disks or pmem.
> 2) Can result in mmap_sem contention when prefaulting via multiple
>threads.
> 3) Requires expensive signal handling, especially to catch SIGBUS in case
>of hugetlbfs/shmem/file-backed memory. For example, this is
>problematic in hypervisors like QEMU where SIGBUS handlers might already
>be used by other subsystems concurrently to e.g, handle hardware errors.
>"Simply" doing preallocation concurrently from other thread is not that
>easy.
>
> III. On MADV_WILLNEED
>
> Extending MADV_WILLNEED is not an option because
> 1. It would change the semantics: "Expect access in the near future." and
>"might be a good idea to read some pages" vs. "Definitely populate/
>preallocate all memory and definitely fail on errors.".
> 2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
>don't want populate/prealloc semantics. They treat this rather as a hint
>to give a little performance boost without too much overhead - and don't
>expect that a lot of memory might get consumed or a lot of time
>might be spent.
>
> IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE
>
> Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the
> following semantics:
> 1. MADV_POPULATE_READ can be used to preallocate backend memory and
>prefault page tables just like manually reading each individual page.
>This will not break any COW mappings -- e.g., it will populate the
>shared zeropage when applicable.

Please clarify what is meant by "backend memory". As far as I can tell
from looking at the code, MADV_POPULATE_READ on file mappings will
allocate zeroed memory in the page cache, and map it as readonly pages
into userspace, but any attempt to actually write to that memory will
trigger the filesystem's ->page_mkwrite handler; and e.g. ext4 will
only try to allocate disk blocks at that point, which may fail. So as
far as I can tell, for files on filesystems like ext4, the current
implementation of MADV_POPULATE_READ does not replace fallocate(). Am
I missing something?

If the desired semantics are that disk blocks should be preallocated,
I think you may have to look up the ->vm_file and then internally call
vfs_fallocate() to address this, kinda like in madvise_remove()?

> 2. If MADV_POPULATE_READ succeeds, all page tables have been populated
>(prefaulted) readable once.
> 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
>prefault page tables just like manually writing (or
>reading+writing) each individual page. This will break any COW
>mappings -- e.g., the shared zeropage is never populated.
> 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
>(prefaulted) writable once.
> 5. MADV_POPULATE_READ and 

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-26 Thread Jann Horn
On Sun, Feb 21, 2021 at 7:56 PM Chang S. Bae  wrote:
> Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
> architecture. XFD allows the kernel to enable a feature state in XCR0 and
> to receive a #NM trap when a task uses instructions accessing that state.
> In this way, Linux can defer allocating the large XSAVE buffer until tasks
> need it.
>
> XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
> IA32_XFD_ERR to assist the #NM trap handler. Both use the same
> state-component bitmap format, used by XCR0.
>
> Use this hardware capability to find the right time to expand the xstate
> buffer. Introduce two sets of helper functions for that:
>
> 1. The first set is primarily for interacting with the XFD hardware:
> xdisable_setbits()
> xdisable_getbits()
> xdisable_switch()
>
> 2. The second set is for managing the first-use status and handling #NM
>trap:
> xfirstuse_enabled()
> xfirstuse_not_detected()
>
> The #NM handler induces the xstate buffer expansion to save the first-used
> states.
[...]
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 7f5aec758f0e..821a7f408ad4 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
[...]
> +static __always_inline bool handle_xfirstuse_event(struct fpu *fpu)
> +{
> +   bool handled = false;
> +   u64 event_mask;
[...]
> +   if (alloc_xstate_buffer(fpu, event_mask))
> +   return handled;
[...]
> +}
> +
>  DEFINE_IDTENTRY(exc_device_not_available)
>  {
> unsigned long cr0 = read_cr0();
>
> +   if (handle_xfirstuse_event(>thread.fpu))
> +   return;

What happens if handle_xfirstuse_event() fails because vmalloc()
failed in alloc_xstate_buffer()? I think that should probably kill the
task with something like force_sig() - but as far as I can tell, at
the moment, it will instead end up at die(), which should only be used
for kernel bugs.

> +
>  #ifdef CONFIG_MATH_EMULATION
> if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
> struct math_emu_info info = { };
> --
> 2.17.1
>
>


ARM FDPIC_FUNCPTRS personality flag handling looks broken

2021-03-25 Thread Jann Horn
Hi!

Tavis noticed that on ARM kernels with CONFIG_BINFMT_ELF_FDPIC, it
looks like the FDPIC_FUNCPTRS personality flag is not reset on
execve(). This would mean that if a process first executes an ELF
FDPIC binary (which forces the personality to PER_LINUX_FDPIC), and
then executes a non-FDPIC binary, the signal handling code
(setup_return()) will have bogus behavior (interpreting a normal
function pointer as an FDPIC function handle).

I think FDPIC_FUNCPTRS should probably either be reset on every
execve() or not be a personality flag at all (since AFAIU pretty much
the whole point of personality flags is that they control behavior
even across execve()).

(I don't have an FDPIC toolchain, so I'm writing this solely based on
having read the code, without having actually tested it.)


Re: [PATCH v30 07/12] landlock: Support filesystem access-control

2021-03-23 Thread Jann Horn
On Tue, Mar 23, 2021 at 8:22 PM Mickaël Salaün  wrote:
> On 23/03/2021 18:49, Jann Horn wrote:
> > On Tue, Mar 23, 2021 at 4:54 PM Mickaël Salaün  wrote:
> >> On 23/03/2021 01:13, Jann Horn wrote:
> >>>  On Tue, Mar 16, 2021 at 9:43 PM Mickaël Salaün  wrote:
> >>>> Using Landlock objects and ruleset, it is possible to tag inodes
> >>>> according to a process's domain.
> >>> [...]
> >>>> +static void release_inode(struct landlock_object *const object)
> >>>> +   __releases(object->lock)
> >>>> +{
> >>>> +   struct inode *const inode = object->underobj;
> >>>> +   struct super_block *sb;
> >>>> +
> >>>> +   if (!inode) {
> >>>> +   spin_unlock(>lock);
> >>>> +   return;
> >>>> +   }
> >>>> +
> >>>> +   /*
> >>>> +* Protects against concurrent use by hook_sb_delete() of the 
> >>>> reference
> >>>> +* to the underlying inode.
> >>>> +*/
> >>>> +   object->underobj = NULL;
> >>>> +   /*
> >>>> +* Makes sure that if the filesystem is concurrently unmounted,
> >>>> +* hook_sb_delete() will wait for us to finish iput().
> >>>> +*/
> >>>> +   sb = inode->i_sb;
> >>>> +   atomic_long_inc(_superblock(sb)->inode_refs);
> >>>> +   spin_unlock(>lock);
> >>>> +   /*
> >>>> +* Because object->underobj was not NULL, hook_sb_delete() and
> >>>> +* get_inode_object() guarantee that it is safe to reset
> >>>> +* landlock_inode(inode)->object while it is not NULL.  It is 
> >>>> therefore
> >>>> +* not necessary to lock inode->i_lock.
> >>>> +*/
> >>>> +   rcu_assign_pointer(landlock_inode(inode)->object, NULL);
> >>>> +   /*
> >>>> +* Now, new rules can safely be tied to @inode with 
> >>>> get_inode_object().
> >>>> +*/
> >>>> +
> >>>> +   iput(inode);
> >>>> +   if 
> >>>> (atomic_long_dec_and_test(_superblock(sb)->inode_refs))
> >>>> +   wake_up_var(_superblock(sb)->inode_refs);
> >>>> +}
> >>> [...]
> >>>> +static struct landlock_object *get_inode_object(struct inode *const 
> >>>> inode)
> >>>> +{
> >>>> +   struct landlock_object *object, *new_object;
> >>>> +   struct landlock_inode_security *inode_sec = 
> >>>> landlock_inode(inode);
> >>>> +
> >>>> +   rcu_read_lock();
> >>>> +retry:
> >>>> +   object = rcu_dereference(inode_sec->object);
> >>>> +   if (object) {
> >>>> +   if (likely(refcount_inc_not_zero(>usage))) {
> >>>> +   rcu_read_unlock();
> >>>> +   return object;
> >>>> +   }
> >>>> +   /*
> >>>> +* We are racing with release_inode(), the object is 
> >>>> going
> >>>> +* away.  Wait for release_inode(), then retry.
> >>>> +*/
> >>>> +   spin_lock(>lock);
> >>>> +   spin_unlock(>lock);
> >>>> +   goto retry;
> >>>> +   }
> >>>> +   rcu_read_unlock();
> >>>> +
> >>>> +   /*
> >>>> +* If there is no object tied to @inode, then create a new one 
> >>>> (without
> >>>> +* holding any locks).
> >>>> +*/
> >>>> +   new_object = landlock_create_object(_fs_underops, 
> >>>> inode);
> >>>> +   if (IS_ERR(new_object))
> >>>> +   return new_object;
> >>>> +
> >>>> +   /* Protects against concurrent get_inode_object() calls. */
> >>>> +   spin_lock(>i_lock);
> >>>> +   object = rcu_dereference_protected(inode_sec->object,
> >>>> +   lockdep_is_held(>i_lock));
> >>>
> >>> rcu_derefer

Re: [PATCH v30 07/12] landlock: Support filesystem access-control

2021-03-23 Thread Jann Horn
On Tue, Mar 23, 2021 at 4:54 PM Mickaël Salaün  wrote:
> On 23/03/2021 01:13, Jann Horn wrote:
> >  On Tue, Mar 16, 2021 at 9:43 PM Mickaël Salaün  wrote:
> >> Using Landlock objects and ruleset, it is possible to tag inodes
> >> according to a process's domain.
> > [...]
> >> +static void release_inode(struct landlock_object *const object)
> >> +   __releases(object->lock)
> >> +{
> >> +   struct inode *const inode = object->underobj;
> >> +   struct super_block *sb;
> >> +
> >> +   if (!inode) {
> >> +   spin_unlock(>lock);
> >> +   return;
> >> +   }
> >> +
> >> +   /*
> >> +* Protects against concurrent use by hook_sb_delete() of the 
> >> reference
> >> +* to the underlying inode.
> >> +*/
> >> +   object->underobj = NULL;
> >> +   /*
> >> +* Makes sure that if the filesystem is concurrently unmounted,
> >> +* hook_sb_delete() will wait for us to finish iput().
> >> +*/
> >> +   sb = inode->i_sb;
> >> +   atomic_long_inc(_superblock(sb)->inode_refs);
> >> +   spin_unlock(>lock);
> >> +   /*
> >> +* Because object->underobj was not NULL, hook_sb_delete() and
> >> +* get_inode_object() guarantee that it is safe to reset
> >> +* landlock_inode(inode)->object while it is not NULL.  It is 
> >> therefore
> >> +* not necessary to lock inode->i_lock.
> >> +*/
> >> +   rcu_assign_pointer(landlock_inode(inode)->object, NULL);
> >> +   /*
> >> +* Now, new rules can safely be tied to @inode with 
> >> get_inode_object().
> >> +*/
> >> +
> >> +   iput(inode);
> >> +   if (atomic_long_dec_and_test(_superblock(sb)->inode_refs))
> >> +   wake_up_var(_superblock(sb)->inode_refs);
> >> +}
> > [...]
> >> +static struct landlock_object *get_inode_object(struct inode *const inode)
> >> +{
> >> +   struct landlock_object *object, *new_object;
> >> +   struct landlock_inode_security *inode_sec = landlock_inode(inode);
> >> +
> >> +   rcu_read_lock();
> >> +retry:
> >> +   object = rcu_dereference(inode_sec->object);
> >> +   if (object) {
> >> +   if (likely(refcount_inc_not_zero(>usage))) {
> >> +   rcu_read_unlock();
> >> +   return object;
> >> +   }
> >> +   /*
> >> +* We are racing with release_inode(), the object is going
> >> +* away.  Wait for release_inode(), then retry.
> >> +*/
> >> +   spin_lock(>lock);
> >> +   spin_unlock(>lock);
> >> +   goto retry;
> >> +   }
> >> +   rcu_read_unlock();
> >> +
> >> +   /*
> >> +* If there is no object tied to @inode, then create a new one 
> >> (without
> >> +* holding any locks).
> >> +*/
> >> +   new_object = landlock_create_object(_fs_underops, inode);
> >> +   if (IS_ERR(new_object))
> >> +   return new_object;
> >> +
> >> +   /* Protects against concurrent get_inode_object() calls. */
> >> +   spin_lock(>i_lock);
> >> +   object = rcu_dereference_protected(inode_sec->object,
> >> +   lockdep_is_held(>i_lock));
> >
> > rcu_dereference_protected() requires that inode_sec->object is not
> > concurrently changed, but I think another thread could call
> > get_inode_object() while we're in landlock_create_object(), and then
> > we could race with the NULL write in release_inode() here? (It
> > wouldn't actually be a UAF though because we're not actually accessing
> > `object` here.) Or am I missing a lock that prevents this?
> >
> > In v28 this wasn't an issue because release_inode() was holding
> > inode->i_lock (and object->lock) during the NULL store; but in v29 and
> > this version the NULL store in release_inode() moved out of the locked
> > region. I think you could just move the NULL store in release_inode()
> > back up (and maybe add a comment explaining the locking rules for
> > landlock_inode(...)->object)?
> >
> 

Re: [PATCH v30 07/12] landlock: Support filesystem access-control

2021-03-22 Thread Jann Horn
 On Tue, Mar 16, 2021 at 9:43 PM Mickaël Salaün  wrote:
> Using Landlock objects and ruleset, it is possible to tag inodes
> according to a process's domain.
[...]
> +static void release_inode(struct landlock_object *const object)
> +   __releases(object->lock)
> +{
> +   struct inode *const inode = object->underobj;
> +   struct super_block *sb;
> +
> +   if (!inode) {
> +   spin_unlock(>lock);
> +   return;
> +   }
> +
> +   /*
> +* Protects against concurrent use by hook_sb_delete() of the 
> reference
> +* to the underlying inode.
> +*/
> +   object->underobj = NULL;
> +   /*
> +* Makes sure that if the filesystem is concurrently unmounted,
> +* hook_sb_delete() will wait for us to finish iput().
> +*/
> +   sb = inode->i_sb;
> +   atomic_long_inc(_superblock(sb)->inode_refs);
> +   spin_unlock(>lock);
> +   /*
> +* Because object->underobj was not NULL, hook_sb_delete() and
> +* get_inode_object() guarantee that it is safe to reset
> +* landlock_inode(inode)->object while it is not NULL.  It is 
> therefore
> +* not necessary to lock inode->i_lock.
> +*/
> +   rcu_assign_pointer(landlock_inode(inode)->object, NULL);
> +   /*
> +* Now, new rules can safely be tied to @inode with 
> get_inode_object().
> +*/
> +
> +   iput(inode);
> +   if (atomic_long_dec_and_test(_superblock(sb)->inode_refs))
> +   wake_up_var(_superblock(sb)->inode_refs);
> +}
[...]
> +static struct landlock_object *get_inode_object(struct inode *const inode)
> +{
> +   struct landlock_object *object, *new_object;
> +   struct landlock_inode_security *inode_sec = landlock_inode(inode);
> +
> +   rcu_read_lock();
> +retry:
> +   object = rcu_dereference(inode_sec->object);
> +   if (object) {
> +   if (likely(refcount_inc_not_zero(>usage))) {
> +   rcu_read_unlock();
> +   return object;
> +   }
> +   /*
> +* We are racing with release_inode(), the object is going
> +* away.  Wait for release_inode(), then retry.
> +*/
> +   spin_lock(>lock);
> +   spin_unlock(>lock);
> +   goto retry;
> +   }
> +   rcu_read_unlock();
> +
> +   /*
> +* If there is no object tied to @inode, then create a new one 
> (without
> +* holding any locks).
> +*/
> +   new_object = landlock_create_object(_fs_underops, inode);
> +   if (IS_ERR(new_object))
> +   return new_object;
> +
> +   /* Protects against concurrent get_inode_object() calls. */
> +   spin_lock(>i_lock);
> +   object = rcu_dereference_protected(inode_sec->object,
> +   lockdep_is_held(>i_lock));

rcu_dereference_protected() requires that inode_sec->object is not
concurrently changed, but I think another thread could call
get_inode_object() while we're in landlock_create_object(), and then
we could race with the NULL write in release_inode() here? (It
wouldn't actually be a UAF though because we're not actually accessing
`object` here.) Or am I missing a lock that prevents this?

In v28 this wasn't an issue because release_inode() was holding
inode->i_lock (and object->lock) during the NULL store; but in v29 and
this version the NULL store in release_inode() moved out of the locked
region. I think you could just move the NULL store in release_inode()
back up (and maybe add a comment explaining the locking rules for
landlock_inode(...)->object)?

(Or alternatively you could use rcu_dereference_raw() with a comment
explaining that the read pointer is only used to check for NULL-ness,
and that it is guaranteed that the pointer can't change if it is NULL
and we're holding the lock. But that'd be needlessly complicated, I
think.)


> +   if (unlikely(object)) {
> +   /* Someone else just created the object, bail out and retry. 
> */
> +   spin_unlock(>i_lock);
> +   kfree(new_object);
> +
> +   rcu_read_lock();
> +   goto retry;
> +   }
> +
> +   rcu_assign_pointer(inode_sec->object, new_object);
> +   /*
> +* @inode will be released by hook_sb_delete() on its superblock
> +* shutdown.
> +*/
> +   ihold(inode);
> +   spin_unlock(>i_lock);
> +   return new_object;
> +}


Re: [PATCH v30 02/12] landlock: Add ruleset and domain management

2021-03-22 Thread Jann Horn
On Tue, Mar 16, 2021 at 9:43 PM Mickaël Salaün  wrote:
> A Landlock ruleset is mainly a red-black tree with Landlock rules as
> nodes.  This enables quick update and lookup to match a requested
> access, e.g. to a file.  A ruleset is usable through a dedicated file
> descriptor (cf. following commit implementing syscalls) which enables a
> process to create and populate a ruleset with new rules.
>
> A domain is a ruleset tied to a set of processes.  This group of rules
> defines the security policy enforced on these processes and their future
> children.  A domain can transition to a new domain which is the
> intersection of all its constraints and those of a ruleset provided by
> the current process.  This modification only impact the current process.
> This means that a process can only gain more constraints (i.e. lose
> accesses) over time.
>
> Cc: James Morris 
> Cc: Jann Horn 
> Cc: Kees Cook 
> Signed-off-by: Mickaël Salaün 
> Acked-by: Serge Hallyn 
> Link: https://lore.kernel.org/r/20210316204252.427806-3-...@digikod.net

Reviewed-by: Jann Horn 


Re: [PATCH v3 0/3] Binder: Enable App Freezing Capability

2021-03-17 Thread Jann Horn
On Wed, Mar 17, 2021 at 7:00 PM Christian Brauner
 wrote:
> On Mon, Mar 15, 2021 at 06:16:27PM -0700, Li Li wrote:
> > To improve the user experience when switching between recently used
> > applications, the background applications which are not currently needed
> > are cached in the memory. Normally, a well designed application will not
> > consume valuable CPU resources in the background. However, it's possible
> > some applications are not able or willing to behave as expected, wasting
> > energy even after being cached.
> >
> > It is a good idea to freeze those applications when they're only being
> > kept alive for the sake of faster startup and energy saving. These kernel
> > patches will provide the necessary infrastructure for user space framework
> > to freeze and thaw a cached process, check the current freezing status and
> > correctly deal with outstanding binder transactions to frozen processes.

I just have some comments on the overall design:

This seems a bit convoluted to me; and I'm not sure whether this is
really something the kernel should get involved in, or whether this
patchset is operating at the right layer.

If there are non-binder threads that are misbehaving, could you
instead stop all those threads in pure userspace code (e.g. by sending
a thread-directed signal to all of them and letting the signal handler
sleep on a futex); and if the binder thread receives a transaction
that should be handled, wake up those threads again?

Or alternatively you could detect that the application is being woken
up frequently even though it's supposed to be idle (e.g. using
information from procfs), and kill it since you consider it to be
misbehaving?

Or if there are specific usage patterns you see frequently that you
consider to be wasting CPU resources (e.g. setting an interval timer
that fires in short intervals), you could try to delay such timers.


With your current approach, you're baking the assumption that all IPC
goes through binder into the kernel API; things like passing a file
descriptor to a pipe through binder or using shared futexes are no
longer usable for cross-process communication without making more
kernel changes. I'm not sure whether that's a good idea. On top of
that, if you freeze a process while it is in the middle of some
operation, resources associated with the operation will probably stay
in use for quite some time; for example, if an app is in the middle of
downloading some data over HTTP, and you freeze it, this may cause the
TCP connection to remain active and consume resources for send/receive
buffers on both the device and the server.


Re: [Intel-wired-lan] [PATCH][next] ixgbe: Fix out-of-bounds warning in ixgbe_host_interface_command()

2021-03-17 Thread Jann Horn
On Wed, Mar 17, 2021 at 9:04 PM Gustavo A. R. Silva
 wrote:
> On 3/17/21 13:57, Jann Horn wrote:
> >>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c 
> >>>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> >>>> index 62ddb452f862..bff3dc1af702 100644
> >>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> >>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> >>>> @@ -3679,7 +3679,7 @@ s32 ixgbe_host_interface_command(struct ixgbe_hw 
> >>>> *hw, void *buffer,
> >>>> u32 hdr_size = sizeof(struct ixgbe_hic_hdr);
> >>>> union {
> >>>> struct ixgbe_hic_hdr hdr;
> >>>> -   u32 u32arr[1];
> >>>> +   u32 *u32arr;
> >>>> } *bp = buffer;
> >>>> u16 buf_len, dword_len;
> >>>> s32 status;
> >>>
> >>> This looks bogus. An array is inline, a pointer points elsewhere -
> >>> they're not interchangeable.
> >>
> >> Yep; but in this case these are the only places in the code where _u32arr_ 
> >> is
> >> being used:
> >>
> >> 3707 /* first pull in the header so we know the buffer length */
> >> 3708 for (bi = 0; bi < dword_len; bi++) {
> >> 3709 bp->u32arr[bi] = IXGBE_READ_REG_ARRAY(hw, 
> >> IXGBE_FLEX_MNG, bi);
> >> 3710 le32_to_cpus(>u32arr[bi]);
> >> 3711 }
> >
> > So now line 3709 means: Read a pointer from bp->u32arr (the value
> > being read from there is not actually a valid pointer), and write to
> > that pointer at offset `bi`. I don't see how that line could execute
> > without crashing.
>
> Yeah; you're right. I see my confusion now. Apparently, there is no escape
> from allocating heap memory to fix this issue, as I was proposing in my
> last email.

Why? Can't you do something like this?

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
index 62ddb452f862..768fa124105b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
@@ -3677,10 +3677,8 @@ s32 ixgbe_host_interface_command(struct
ixgbe_hw *hw, void *buffer,
 bool return_data)
 {
u32 hdr_size = sizeof(struct ixgbe_hic_hdr);
-   union {
-   struct ixgbe_hic_hdr hdr;
-   u32 u32arr[1];
-   } *bp = buffer;
+   u32 *bp = buffer;
+   struct ixgbe_hic_hdr hdr;
u16 buf_len, dword_len;
s32 status;
u32 bi;
@@ -3706,12 +3704,13 @@ s32 ixgbe_host_interface_command(struct
ixgbe_hw *hw, void *buffer,

/* first pull in the header so we know the buffer length */
for (bi = 0; bi < dword_len; bi++) {
-   bp->u32arr[bi] = IXGBE_READ_REG_ARRAY(hw, IXGBE_FLEX_MNG, bi);
-   le32_to_cpus(>u32arr[bi]);
+   bp[bi] = IXGBE_READ_REG_ARRAY(hw, IXGBE_FLEX_MNG, bi);
+   le32_to_cpus([bi]);
}

/* If there is any thing in data position pull it in */
-   buf_len = bp->hdr.buf_len;
+   memcpy(, bp, sizeof(hdr));
+   buf_len = hdr.buf_len;
if (!buf_len)
goto rel_out;

@@ -3726,8 +3725,8 @@ s32 ixgbe_host_interface_command(struct ixgbe_hw
*hw, void *buffer,

/* Pull in the rest of the buffer (bi is where we left off) */
for (; bi <= dword_len; bi++) {
-   bp->u32arr[bi] = IXGBE_READ_REG_ARRAY(hw, IXGBE_FLEX_MNG, bi);
-   le32_to_cpus(>u32arr[bi]);
+   bp[bi] = IXGBE_READ_REG_ARRAY(hw, IXGBE_FLEX_MNG, bi);
+   le32_to_cpus([bi]);
}

 rel_out:


Re: [Intel-wired-lan] [PATCH][next] ixgbe: Fix out-of-bounds warning in ixgbe_host_interface_command()

2021-03-17 Thread Jann Horn
On Wed, Mar 17, 2021 at 7:27 PM Gustavo A. R. Silva
 wrote:
> On 3/17/21 12:11, Jann Horn wrote:
> > On Wed, Mar 17, 2021 at 8:43 AM Gustavo A. R. Silva
> >  wrote:
> >> Fix the following out-of-bounds warning by replacing the one-element
> >> array in an anonymous union with a pointer:
> >>
> >>   CC [M]  drivers/net/ethernet/intel/ixgbe/ixgbe_common.o
> >> drivers/net/ethernet/intel/ixgbe/ixgbe_common.c: In function 
> >> ‘ixgbe_host_interface_command’:
> >> drivers/net/ethernet/intel/ixgbe/ixgbe_common.c:3729:13: warning: array 
> >> subscript 1 is above array bounds of ‘u32[1]’ {aka ‘unsigned int[1]’} 
> >> [-Warray-bounds]
> >>  3729 |   bp->u32arr[bi] = IXGBE_READ_REG_ARRAY(hw, IXGBE_FLEX_MNG, bi);
> >>   |   ~~^~~~
> >> drivers/net/ethernet/intel/ixgbe/ixgbe_common.c:3682:7: note: while 
> >> referencing ‘u32arr’
> >>  3682 |   u32 u32arr[1];
> >>   |   ^~
> >>
> >> This helps with the ongoing efforts to globally enable -Warray-bounds.
> >>
> >> Notice that, the usual approach to fix these sorts of issues is to
> >> replace the one-element array with a flexible-array member. However,
> >> flexible arrays should not be used in unions. That, together with the
> >> fact that the array notation is not being affected in any ways, is why
> >> the pointer approach was chosen in this case.
> >>
> >> Link: https://github.com/KSPP/linux/issues/109
> >> Signed-off-by: Gustavo A. R. Silva 
> >> ---
> >>  drivers/net/ethernet/intel/ixgbe/ixgbe_common.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c 
> >> b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> >> index 62ddb452f862..bff3dc1af702 100644
> >> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> >> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> >> @@ -3679,7 +3679,7 @@ s32 ixgbe_host_interface_command(struct ixgbe_hw 
> >> *hw, void *buffer,
> >> u32 hdr_size = sizeof(struct ixgbe_hic_hdr);
> >> union {
> >> struct ixgbe_hic_hdr hdr;
> >> -   u32 u32arr[1];
> >> +   u32 *u32arr;
> >> } *bp = buffer;
> >> u16 buf_len, dword_len;
> >> s32 status;
> >
> > This looks bogus. An array is inline, a pointer points elsewhere -
> > they're not interchangeable.
>
> Yep; but in this case these are the only places in the code where _u32arr_ is
> being used:
>
> 3707 /* first pull in the header so we know the buffer length */
> 3708 for (bi = 0; bi < dword_len; bi++) {
> 3709 bp->u32arr[bi] = IXGBE_READ_REG_ARRAY(hw, 
> IXGBE_FLEX_MNG, bi);
> 3710 le32_to_cpus(>u32arr[bi]);
> 3711 }

So now line 3709 means: Read a pointer from bp->u32arr (the value
being read from there is not actually a valid pointer), and write to
that pointer at offset `bi`. I don't see how that line could execute
without crashing.


Re: [PATCH][next] ixgbe: Fix out-of-bounds warning in ixgbe_host_interface_command()

2021-03-17 Thread Jann Horn
On Wed, Mar 17, 2021 at 8:43 AM Gustavo A. R. Silva
 wrote:
> Fix the following out-of-bounds warning by replacing the one-element
> array in an anonymous union with a pointer:
>
>   CC [M]  drivers/net/ethernet/intel/ixgbe/ixgbe_common.o
> drivers/net/ethernet/intel/ixgbe/ixgbe_common.c: In function 
> ‘ixgbe_host_interface_command’:
> drivers/net/ethernet/intel/ixgbe/ixgbe_common.c:3729:13: warning: array 
> subscript 1 is above array bounds of ‘u32[1]’ {aka ‘unsigned int[1]’} 
> [-Warray-bounds]
>  3729 |   bp->u32arr[bi] = IXGBE_READ_REG_ARRAY(hw, IXGBE_FLEX_MNG, bi);
>   |   ~~^~~~
> drivers/net/ethernet/intel/ixgbe/ixgbe_common.c:3682:7: note: while 
> referencing ‘u32arr’
>  3682 |   u32 u32arr[1];
>   |   ^~
>
> This helps with the ongoing efforts to globally enable -Warray-bounds.
>
> Notice that, the usual approach to fix these sorts of issues is to
> replace the one-element array with a flexible-array member. However,
> flexible arrays should not be used in unions. That, together with the
> fact that the array notation is not being affected in any ways, is why
> the pointer approach was chosen in this case.
>
> Link: https://github.com/KSPP/linux/issues/109
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> index 62ddb452f862..bff3dc1af702 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
> @@ -3679,7 +3679,7 @@ s32 ixgbe_host_interface_command(struct ixgbe_hw *hw, 
> void *buffer,
> u32 hdr_size = sizeof(struct ixgbe_hic_hdr);
> union {
> struct ixgbe_hic_hdr hdr;
> -   u32 u32arr[1];
> +   u32 *u32arr;
> } *bp = buffer;
> u16 buf_len, dword_len;
> s32 status;

This looks bogus. An array is inline, a pointer points elsewhere -
they're not interchangeable.


Re: [PATCH v4 1/1] fs: Allow no_new_privs tasks to call chroot(2)

2021-03-16 Thread Jann Horn
On Tue, Mar 16, 2021 at 8:26 PM Mickaël Salaün  wrote:
> On 16/03/2021 20:04, Jann Horn wrote:
> > On Tue, Mar 16, 2021 at 6:02 PM Mickaël Salaün  wrote:
> >> One could argue that chroot(2) is useless without a properly populated
> >> root hierarchy (i.e. without /dev and /proc).  However, there are
> >> multiple use cases that don't require the chrooting process to create
> >> file hierarchies with special files nor mount points, e.g.:
> >> * A process sandboxing itself, once all its libraries are loaded, may
> >>   not need files other than regular files, or even no file at all.
> >> * Some pre-populated root hierarchies could be used to chroot into,
> >>   provided for instance by development environments or tailored
> >>   distributions.
> >> * Processes executed in a chroot may not require access to these special
> >>   files (e.g. with minimal runtimes, or by emulating some special files
> >>   with a LD_PRELOADed library or seccomp).
> >>
> >> Unprivileged chroot is especially interesting for userspace developers
> >> wishing to harden their applications.  For instance, chroot(2) and Yama
> >> enable to build a capability-based security (i.e. remove filesystem
> >> ambient accesses) by calling chroot/chdir with an empty directory and
> >> accessing data through dedicated file descriptors obtained with
> >> openat2(2) and RESOLVE_BENEATH/RESOLVE_IN_ROOT/RESOLVE_NO_MAGICLINKS.
> >
> > I don't entirely understand. Are you writing this with the assumption
> > that a future change will make it possible to set these RESOLVE flags
> > process-wide, or something like that?
>
> No, this scenario is for applications willing to sandbox themselves and
> only use the FDs to access legitimate data.

But if you're chrooted to /proc/self/fdinfo and have an fd to some
directory - let's say /home/user/Downloads - there is nothing that
ensures that you only use that fd with RESOLVE_BENEATH, right? If the
application is compromised, it can do something like openat(fd,
"../.bashrc", O_RDWR), right? Or am I missing something?

> > As long as that doesn't exist, I think that to make this safe, you'd
> > have to do something like the following - let a child process set up a
> > new mount namespace for you, and then chroot() into that namespace's
> > root:
> >
> > struct shared_data {
> >   int root_fd;
> > };
> > int helper_fn(void *args) {
> >   struct shared_data *shared = args;
> >   mount("none", "/tmp", "tmpfs", MS_NOSUID|MS_NODEV, "");
> >   mkdir("/tmp/old_root", 0700);
> >   pivot_root("/tmp", "/tmp/old_root");
> >   umount("/tmp/old_root", "");
> >   shared->root_fd = open("/", O_PATH);
> > }
> > void setup_chroot() {
> >   struct shared_data shared = {};
> >   prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> >   clone(helper_fn, my_stack,
> > CLONE_VFORK|CLONE_VM|CLONE_FILES|CLONE_NEWUSER|CLONE_NEWNS|SIGCHLD,
> > NULL);
> >   fchdir(shared.root_fd);
> >   chroot(".");
> > }
>
> What about this?
> chdir("/proc/self/fdinfo");
> chroot(".");
> close(all unnecessary FDs);

That breaks down if you can e.g. get a unix domain socket connected to
a process in a different chroot, right? Isn't that a bit too fragile?


Re: [PATCH v4 1/1] fs: Allow no_new_privs tasks to call chroot(2)

2021-03-16 Thread Jann Horn
On Tue, Mar 16, 2021 at 6:02 PM Mickaël Salaün  wrote:
> One could argue that chroot(2) is useless without a properly populated
> root hierarchy (i.e. without /dev and /proc).  However, there are
> multiple use cases that don't require the chrooting process to create
> file hierarchies with special files nor mount points, e.g.:
> * A process sandboxing itself, once all its libraries are loaded, may
>   not need files other than regular files, or even no file at all.
> * Some pre-populated root hierarchies could be used to chroot into,
>   provided for instance by development environments or tailored
>   distributions.
> * Processes executed in a chroot may not require access to these special
>   files (e.g. with minimal runtimes, or by emulating some special files
>   with a LD_PRELOADed library or seccomp).
>
> Unprivileged chroot is especially interesting for userspace developers
> wishing to harden their applications.  For instance, chroot(2) and Yama
> enable to build a capability-based security (i.e. remove filesystem
> ambient accesses) by calling chroot/chdir with an empty directory and
> accessing data through dedicated file descriptors obtained with
> openat2(2) and RESOLVE_BENEATH/RESOLVE_IN_ROOT/RESOLVE_NO_MAGICLINKS.

I don't entirely understand. Are you writing this with the assumption
that a future change will make it possible to set these RESOLVE flags
process-wide, or something like that?


As long as that doesn't exist, I think that to make this safe, you'd
have to do something like the following - let a child process set up a
new mount namespace for you, and then chroot() into that namespace's
root:

struct shared_data {
  int root_fd;
};
int helper_fn(void *args) {
  struct shared_data *shared = args;
  mount("none", "/tmp", "tmpfs", MS_NOSUID|MS_NODEV, "");
  mkdir("/tmp/old_root", 0700);
  pivot_root("/tmp", "/tmp/old_root");
  umount("/tmp/old_root", "");
  shared->root_fd = open("/", O_PATH);
}
void setup_chroot() {
  struct shared_data shared = {};
  prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
  clone(helper_fn, my_stack,
CLONE_VFORK|CLONE_VM|CLONE_FILES|CLONE_NEWUSER|CLONE_NEWNS|SIGCHLD,
NULL);
  fchdir(shared.root_fd);
  chroot(".");
}

[...]
> diff --git a/fs/open.c b/fs/open.c
[...]
> +static inline int current_chroot_allowed(void)
> +{
> +   /*
> +* Changing the root directory for the calling task (and its future
> +* children) requires that this task has CAP_SYS_CHROOT in its
> +* namespace, or be running with no_new_privs and not sharing its
> +* fs_struct and not escaping its current root (cf. create_user_ns()).
> +* As for seccomp, checking no_new_privs avoids scenarios where
> +* unprivileged tasks can affect the behavior of privileged children.
> +*/
> +   if (task_no_new_privs(current) && current->fs->users == 1 &&

this read of current->fs->users should be using READ_ONCE()

> +   !current_chrooted())
> +   return 0;
> +   if (ns_capable(current_user_ns(), CAP_SYS_CHROOT))
> +   return 0;
> +   return -EPERM;
> +}
[...]

Overall I think this change is a good idea.


Re: [PATCH v2 1/1] fs: Allow no_new_privs tasks to call chroot(2)

2021-03-10 Thread Jann Horn
On Wed, Mar 10, 2021 at 8:23 PM Eric W. Biederman  wrote:
>
> Mickaël Salaün  writes:
>
> > From: Mickaël Salaün 
> >
> > Being able to easily change root directories enable to ease some
> > development workflow and can be used as a tool to strengthen
> > unprivileged security sandboxes.  chroot(2) is not an access-control
> > mechanism per se, but it can be used to limit the absolute view of the
> > filesystem, and then limit ways to access data and kernel interfaces
> > (e.g. /proc, /sys, /dev, etc.).
> >
> > Users may not wish to expose namespace complexity to potentially
> > malicious processes, or limit their use because of limited resources.
> > The chroot feature is much more simple (and limited) than the mount
> > namespace, but can still be useful.  As for containers, users of
> > chroot(2) should take care of file descriptors or data accessible by
> > other means (e.g. current working directory, leaked FDs, passed FDs,
> > devices, mount points, etc.).  There is a lot of literature that discuss
> > the limitations of chroot, and users of this feature should be aware of
> > the multiple ways to bypass it.  Using chroot(2) for security purposes
> > can make sense if it is combined with other features (e.g. dedicated
> > user, seccomp, LSM access-controls, etc.).
> >
> > One could argue that chroot(2) is useless without a properly populated
> > root hierarchy (i.e. without /dev and /proc).  However, there are
> > multiple use cases that don't require the chrooting process to create
> > file hierarchies with special files nor mount points, e.g.:
> > * A process sandboxing itself, once all its libraries are loaded, may
> >   not need files other than regular files, or even no file at all.
> > * Some pre-populated root hierarchies could be used to chroot into,
> >   provided for instance by development environments or tailored
> >   distributions.
> > * Processes executed in a chroot may not require access to these special
> >   files (e.g. with minimal runtimes, or by emulating some special files
> >   with a LD_PRELOADed library or seccomp).
> >
> > Allowing a task to change its own root directory is not a threat to the
> > system if we can prevent confused deputy attacks, which could be
> > performed through execution of SUID-like binaries.  This can be
> > prevented if the calling task sets PR_SET_NO_NEW_PRIVS on itself with
> > prctl(2).  To only affect this task, its filesystem information must not
> > be shared with other tasks, which can be achieved by not passing
> > CLONE_FS to clone(2).  A similar no_new_privs check is already used by
> > seccomp to avoid the same kind of security issues.  Furthermore, because
> > of its security use and to avoid giving a new way for attackers to get
> > out of a chroot (e.g. using /proc//root), an unprivileged chroot is
> > only allowed if the new root directory is the same or beneath the
> > current one.  This still allows a process to use a subset of its
> > legitimate filesystem to chroot into and then further reduce its view of
> > the filesystem.
> >
> > This change may not impact systems relying on other permission models
> > than POSIX capabilities (e.g. Tomoyo).  Being able to use chroot(2) on
> > such systems may require to update their security policies.
> >
> > Only the chroot system call is relaxed with this no_new_privs check; the
> > init_chroot() helper doesn't require such change.
> >
> > Allowing unprivileged users to use chroot(2) is one of the initial
> > objectives of no_new_privs:
> > https://www.kernel.org/doc/html/latest/userspace-api/no_new_privs.html
> > This patch is a follow-up of a previous one sent by Andy Lutomirski, but
> > with less limitations:
> > https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.l...@amacapital.net/
[...]
> Neither is_path_beneath nor path_is_under really help prevent escapes,
> as except for open files and files accessible from proc chroot already
> disallows going up.  The reason is the path is resolved with the current
> root before switching to it.

Yeah, this probably should use the same check as the CLONE_NEWUSER
logic, current_chrooted() from CLONE_NEWUSER; that check is already
used for guarding against the following syscall sequence, which has
similar security properties:
unshare(CLONE_NEWUSER); // gives the current process namespaced CAP_SYS_ADMIN
chroot("<...>"); // succeeds because of namespaced CAP_SYS_ADMIN

The current_chrooted() check in create_user_ns() is for the same
purpose as the check you're introducing here, so they should use the
same logic.


Re: [PATCH] procfs/dmabuf: Add /proc//task//dmabuf_fds

2021-01-27 Thread Jann Horn
+jeffv from Android

On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh  wrote:
> In order to measure how much memory a process actually consumes, it is
> necessary to include the DMA buffer sizes for that process in the memory
> accounting. Since the handle to DMA buffers are raw FDs, it is important
> to be able to identify which processes have FD references to a DMA buffer.

Or you could try to let the DMA buffer take a reference on the
mm_struct and account its size into the mm_struct? That would probably
be nicer to work with than having to poke around in procfs separately
for DMA buffers.

> Currently, DMA buffer FDs can be accounted using /proc//fd/* and
> /proc//fdinfo -- both of which are only root readable, as follows:

That's not quite right. They can both also be accessed by the user
owning the process. Also, fdinfo is a standard interface for
inspecting process state that doesn't permit reading process memory or
manipulating process state - so I think it would be fine to permit
access to fdinfo under a PTRACE_MODE_READ_FSCRED check, just like the
interface you're suggesting.

>   1. Do a readlink on each FD.
>   2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
>   3. stat the file to get the dmabuf inode number.
>   4. Read/ proc//fdinfo/, to get the DMA buffer size.
>
> Android captures per-process system memory state when certain low memory
> events (e.g a foreground app kill) occur, to identify potential memory
> hoggers. To include a process’s dmabuf usage as part of its memory state,
> the data collection needs to be fast enough to reflect the memory state at
> the time of such events.
>
> Since reading /proc//fd/ and /proc//fdinfo/ requires root
> privileges, this approach is not suitable for production builds.

It should be easy to add enough information to /proc//fdinfo/ so
that you don't need to look at /proc//fd/ anymore.

> Granting
> root privileges even to a system process increases the attack surface and
> is highly undesirable. Additionally this is slow as it requires many
> context switches for searching and getting the dma-buf info.

What do you mean by "context switches"? Task switches or kernel/user
transitions (e.g. via syscall)?

> With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
> details can be queried using their unique inode numbers.
>
> This patch proposes adding a /proc//task//dmabuf_fds interface.
>
> /proc//task//dmabuf_fds contains a list of inode numbers for
> every DMA buffer FD that the task has. Entries with the same inode
> number can appear more than once, indicating the total FD references
> for the associated DMA buffer.
>
> If a thread shares the same files as the group leader then its
> dmabuf_fds file will be empty, as these dmabufs are reported by the
> group leader.
>
> The interface requires PTRACE_MODE_READ_FSCRED (same as /proc//maps)
> and allows the efficient accounting of per-process DMA buffer usage without
> requiring root privileges. (See data below)

I'm not convinced that introducing a new procfs file for this is the
right way to go. And the idea of having to poke into multiple
different files in procfs and in sysfs just to be able to compute a
proper memory usage score for a process seems weird to me. "How much
memory is this process using" seems like the kind of question the
kernel ought to be able to answer (and the kernel needs to be able to
answer somewhat accurately so that its own OOM killer can do its job
properly)?


Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-01-22 Thread Jann Horn
On Fri, Jan 22, 2021 at 2:05 PM Jann Horn  wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka  wrote:
> > On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >
> > >> > The problem is that calculate_order() is called a number of times
> > >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >
> > >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> > , the regression diseapears:
> > >> >
> > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > I'm surprised that hackbench is that sensitive to slab performance, anyway. 
> > It's
> > supposed to be a scheduler benchmark? What exactly is going on?
>
> Uuuh, I think powerpc doesn't have cmpxchg_double?
>
> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"
>
> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?
>
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

(The stuff I have in mind will only work on 64-bit though. We are
talking about PPC64 here, right?)


Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-01-22 Thread Jann Horn
On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka  wrote:
> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >
> >> > The problem is that calculate_order() is called a number of times
> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >
> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> > , the regression diseapears:
> >> >
> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> I'm surprised that hackbench is that sensitive to slab performance, anyway. 
> It's
> supposed to be a scheduler benchmark? What exactly is going on?

Uuuh, I think powerpc doesn't have cmpxchg_double?

"vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And

says under "POWERPC": "no DW LL/SC"

So powerpc is probably hitting the page-bitlock-based implementation
all the time for stuff like __slub_free()? Do you have detailed
profiling results from "perf top" or something like that?

(I actually have some WIP patches and a design document for getting
rid of cmpxchg_double in struct page that I hacked together in the
last couple days; I'm currently in the process of sending them over to
some other folks in the company who hopefully have cycles to
review/polish/benchmark them so that they can be upstreamed, assuming
that those folks think they're important enough. I don't have the
cycles for it...)


Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-20 Thread Jann Horn
On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
> > >
> > > On 01/12, Michal Hocko wrote:
> > > >
> > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > >
> > > > > What we want is the ability for one process to influence another 
> > > > > process
> > > > > in order to optimize performance across the entire system while 
> > > > > leaving
> > > > > the security boundary intact.
> > > > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > > > > and CAP_SYS_NICE for influencing process performance.
> > > >
> > > > I have to say that ptrace modes are rather obscure to me. So I cannot
> > > > really judge whether MODE_READ is sufficient. My understanding has
> > > > always been that this is requred to RO access to the address space. But
> > > > this operation clearly has a visible side effect. Do we have any actual
> > > > documentation for the existing modes?
> > > >
> > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > >
> > > Can't comment, sorry. I never understood these security checks and never 
> > > tried.
> > > IIUC only selinux/etc can treat ATTACH/READ differently and I have no 
> > > idea what
> > > is the difference.

Yama in particular only does its checks on ATTACH and ignores READ,
that's the difference you're probably most likely to encounter on a
normal desktop system, since some distros turn Yama on by default.
Basically the idea there is that running "gdb -p $pid" or "strace -p
$pid" as a normal user will usually fail, but reading /proc/$pid/maps
still works; so you can see things like detailed memory usage
information and such, but you're not supposed to be able to directly
peek into a running SSH client and inject data into the existing SSH
connection, or steal the cryptographic keys for the current
connection, or something like that.

> > I haven't seen a written explanation on ptrace modes but when I
> > consulted Jann his explanation was:
> >
> > PTRACE_MODE_READ means you can inspect metadata about processes with
> > the specified domain, across UID boundaries.
> > PTRACE_MODE_ATTACH means you can fully impersonate processes with the
> > specified domain, across UID boundaries.
>
> Maybe this would be a good start to document expectations. Some more
> practical examples where the difference is visible would be great as
> well.

Before documenting the behavior, it would be a good idea to figure out
what to do with perf_event_open(). That one's weird in that it only
requires PTRACE_MODE_READ, but actually allows you to sample stuff
like userspace stack and register contents (if perf_event_paranoid is
1 or 2). Maybe for SELinux things (and maybe also for Yama), there
should be a level in between that allows fully inspecting the process
(for purposes like profiling) but without the ability to corrupt its
memory or registers or things like that. Or maybe perf_event_open()
should just use the ATTACH mode.


Re: [PATCH] mm, slub: splice cpu and page freelists in deactivate_slab()

2021-01-15 Thread Jann Horn
On Fri, Jan 15, 2021 at 7:35 PM Vlastimil Babka  wrote:
> In deactivate_slab() we currently move all but one objects on the cpu freelist
> to the page freelist one by one using the costly cmpxchg_double() operation.
> Then we unfreeze the page while moving the last object on page freelist, with
> a final cmpxchg_double().
>
> This can be optimized to avoid the cmpxchg_double() per object. Just count the
> objects on cpu freelist (to adjust page->inuse properly) and also remember the
> last object in the chain. Then splice page->freelist to the last object and
> effectively add the whole cpu freelist to page->freelist while unfreezing the
> page, with a single cmpxchg_double().

This might have some more (good) effects, although these might well be
too minuscule to notice:

 - The old code inverted the direction of the freelist, while the new
code preserves the direction.
 - We're no longer dirtying the cachelines of objects in the middle of
the freelist.

In the current code it probably doesn't really matter, since I think
we basically only take this path when handling NUMA mismatches,
PFMEMALLOC stuff, racing new_slab(), and flush_slab() for handling
flushing IPIs? But yeah, if we want to start automatically sending
flush IPIs, it might be a good idea, given that the next accesses to
the page will probably come from a different CPU (unless the page is
entirely unused, in which case it may be freed to the page allocator's
percpu list) and we don't want to create unnecessary cache/memory
traffic. (And it's a good cleanup regardless, I think.)

> Signed-off-by: Vlastimil Babka 

Reviewed-by: Jann Horn 

[...]
> /*
> -* Stage two: Ensure that the page is unfrozen while the
> -* list presence reflects the actual number of objects
> -* during unfreeze.
> +* Stage two: Unfreeze the page while splicing the per-cpu
> +* freelist to the head of page's freelist.
> +*
> +* Ensure that the page is unfrozen while the list presence
> +* reflects the actual number of objects during unfreeze.

(my computer complains about trailing whitespace here)

>  *
>  * We setup the list membership and then perform a cmpxchg
>  * with the count. If there is a mismatch then the page


Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-15 Thread Jann Horn
On Fri, Jan 15, 2021 at 10:10 AM Mickaël Salaün  wrote:
> On 14/01/2021 23:43, Jann Horn wrote:
> > On Thu, Jan 14, 2021 at 7:54 PM Mickaël Salaün  wrote:
> >> On 14/01/2021 04:22, Jann Horn wrote:
> >>> On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
> >>>> Thanks to the Landlock objects and ruleset, it is possible to identify
> >>>> inodes according to a process's domain.  To enable an unprivileged
> >>>> process to express a file hierarchy, it first needs to open a directory
> >>>> (or a file) and pass this file descriptor to the kernel through
> >>>> landlock_add_rule(2).  When checking if a file access request is
> >>>> allowed, we walk from the requested dentry to the real root, following
> >>>> the different mount layers.  The access to each "tagged" inodes are
> >>>> collected according to their rule layer level, and ANDed to create
> >>>> access to the requested file hierarchy.  This makes possible to identify
> >>>> a lot of files without tagging every inodes nor modifying the
> >>>> filesystem, while still following the view and understanding the user
> >>>> has from the filesystem.
> >>>>
> >>>> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> >>>> keep the same struct inodes for the same inodes whereas these inodes are
> >>>> in use.
> >>>>
> >>>> This commit adds a minimal set of supported filesystem access-control
> >>>> which doesn't enable to restrict all file-related actions.  This is the
> >>>> result of multiple discussions to minimize the code of Landlock to ease
> >>>> review.  Thanks to the Landlock design, extending this access-control
> >>>> without breaking user space will not be a problem.  Moreover, seccomp
> >>>> filters can be used to restrict the use of syscall families which may
> >>>> not be currently handled by Landlock.
> >>> [...]
> >>>> +static bool check_access_path_continue(
> >>>> +   const struct landlock_ruleset *const domain,
> >>>> +   const struct path *const path, const u32 access_request,
> >>>> +   u64 *const layer_mask)
> >>>> +{
> >>> [...]
> >>>> +   /*
> >>>> +* An access is granted if, for each policy layer, at least one 
> >>>> rule
> >>>> +* encountered on the pathwalk grants the access, regardless of 
> >>>> their
> >>>> +* position in the layer stack.  We must then check not-yet-seen 
> >>>> layers
> >>>> +* for each inode, from the last one added to the first one.
> >>>> +*/
> >>>> +   for (i = 0; i < rule->num_layers; i++) {
> >>>> +   const struct landlock_layer *const layer = 
> >>>> >layers[i];
> >>>> +   const u64 layer_level = BIT_ULL(layer->level - 1);
> >>>> +
> >>>> +   if (!(layer_level & *layer_mask))
> >>>> +   continue;
> >>>> +   if ((layer->access & access_request) != access_request)
> >>>> +   return false;
> >>>> +   *layer_mask &= ~layer_level;
> >>>
> >>> Hmm... shouldn't the last 5 lines be replaced by the following?
> >>>
> >>> if ((layer->access & access_request) == access_request)
> >>> *layer_mask &= ~layer_level;
> >>>
> >>> And then, since this function would always return true, you could
> >>> change its return type to "void".
> >>>
> >>>
> >>> As far as I can tell, the current version will still, if a ruleset
> >>> looks like this:
> >>>
> >>> /usr read+write
> >>> /usr/lib/ read
> >>>
> >>> reject write access to /usr/lib, right?
> >>
> >> If these two rules are from different layers, then yes it would work as
> >> intended. However, if these rules are from the same layer the path walk
> >> will not stop at /usr/lib but go down to /usr, which grants write
> >> access.
> >
> > I don't see why the code would do what you're saying it does. And an
> > experiment seems to confirm what I said; I checked out landlock-v26,
> > and

Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-14 Thread Jann Horn
On Thu, Jan 14, 2021 at 7:54 PM Mickaël Salaün  wrote:
> On 14/01/2021 04:22, Jann Horn wrote:
> > On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
> >> Thanks to the Landlock objects and ruleset, it is possible to identify
> >> inodes according to a process's domain.  To enable an unprivileged
> >> process to express a file hierarchy, it first needs to open a directory
> >> (or a file) and pass this file descriptor to the kernel through
> >> landlock_add_rule(2).  When checking if a file access request is
> >> allowed, we walk from the requested dentry to the real root, following
> >> the different mount layers.  The access to each "tagged" inodes are
> >> collected according to their rule layer level, and ANDed to create
> >> access to the requested file hierarchy.  This makes possible to identify
> >> a lot of files without tagging every inodes nor modifying the
> >> filesystem, while still following the view and understanding the user
> >> has from the filesystem.
> >>
> >> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> >> keep the same struct inodes for the same inodes whereas these inodes are
> >> in use.
> >>
> >> This commit adds a minimal set of supported filesystem access-control
> >> which doesn't enable to restrict all file-related actions.  This is the
> >> result of multiple discussions to minimize the code of Landlock to ease
> >> review.  Thanks to the Landlock design, extending this access-control
> >> without breaking user space will not be a problem.  Moreover, seccomp
> >> filters can be used to restrict the use of syscall families which may
> >> not be currently handled by Landlock.
> > [...]
> >> +static bool check_access_path_continue(
> >> +   const struct landlock_ruleset *const domain,
> >> +   const struct path *const path, const u32 access_request,
> >> +   u64 *const layer_mask)
> >> +{
> > [...]
> >> +   /*
> >> +* An access is granted if, for each policy layer, at least one 
> >> rule
> >> +* encountered on the pathwalk grants the access, regardless of 
> >> their
> >> +* position in the layer stack.  We must then check not-yet-seen 
> >> layers
> >> +* for each inode, from the last one added to the first one.
> >> +*/
> >> +   for (i = 0; i < rule->num_layers; i++) {
> >> +   const struct landlock_layer *const layer = 
> >> >layers[i];
> >> +   const u64 layer_level = BIT_ULL(layer->level - 1);
> >> +
> >> +   if (!(layer_level & *layer_mask))
> >> +   continue;
> >> +   if ((layer->access & access_request) != access_request)
> >> +   return false;
> >> +   *layer_mask &= ~layer_level;
> >
> > Hmm... shouldn't the last 5 lines be replaced by the following?
> >
> > if ((layer->access & access_request) == access_request)
> > *layer_mask &= ~layer_level;
> >
> > And then, since this function would always return true, you could
> > change its return type to "void".
> >
> >
> > As far as I can tell, the current version will still, if a ruleset
> > looks like this:
> >
> > /usr read+write
> > /usr/lib/ read
> >
> > reject write access to /usr/lib, right?
>
> If these two rules are from different layers, then yes it would work as
> intended. However, if these rules are from the same layer the path walk
> will not stop at /usr/lib but go down to /usr, which grants write
> access.

I don't see why the code would do what you're saying it does. And an
experiment seems to confirm what I said; I checked out landlock-v26,
and the behavior I get is:

user@vm:~/landlock$ dd if=/dev/null of=/tmp/aaa
0+0 records in
0+0 records out
0 bytes copied, 0.00106365 s, 0.0 kB/s
user@vm:~/landlock$ LL_FS_RO='/lib' LL_FS_RW='/' ./sandboxer dd
if=/dev/null of=/tmp/aaa
0+0 records in
0+0 records out
0 bytes copied, 0.000491814 s, 0.0 kB/s
user@vm:~/landlock$ LL_FS_RO='/tmp' LL_FS_RW='/' ./sandboxer dd
if=/dev/null of=/tmp/aaa
dd: failed to open '/tmp/aaa': Permission denied
user@vm:~/landlock$

Granting read access to /tmp prevents writing to it, even though write
access was granted to /.


Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-13 Thread Jann Horn
On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
> Thanks to the Landlock objects and ruleset, it is possible to identify
> inodes according to a process's domain.  To enable an unprivileged
> process to express a file hierarchy, it first needs to open a directory
> (or a file) and pass this file descriptor to the kernel through
> landlock_add_rule(2).  When checking if a file access request is
> allowed, we walk from the requested dentry to the real root, following
> the different mount layers.  The access to each "tagged" inodes are
> collected according to their rule layer level, and ANDed to create
> access to the requested file hierarchy.  This makes possible to identify
> a lot of files without tagging every inodes nor modifying the
> filesystem, while still following the view and understanding the user
> has from the filesystem.
>
> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> keep the same struct inodes for the same inodes whereas these inodes are
> in use.
>
> This commit adds a minimal set of supported filesystem access-control
> which doesn't enable to restrict all file-related actions.  This is the
> result of multiple discussions to minimize the code of Landlock to ease
> review.  Thanks to the Landlock design, extending this access-control
> without breaking user space will not be a problem.  Moreover, seccomp
> filters can be used to restrict the use of syscall families which may
> not be currently handled by Landlock.
[...]
> +static bool check_access_path_continue(
> +   const struct landlock_ruleset *const domain,
> +   const struct path *const path, const u32 access_request,
> +   u64 *const layer_mask)
> +{
[...]
> +   /*
> +* An access is granted if, for each policy layer, at least one rule
> +* encountered on the pathwalk grants the access, regardless of their
> +* position in the layer stack.  We must then check not-yet-seen 
> layers
> +* for each inode, from the last one added to the first one.
> +*/
> +   for (i = 0; i < rule->num_layers; i++) {
> +   const struct landlock_layer *const layer = >layers[i];
> +   const u64 layer_level = BIT_ULL(layer->level - 1);
> +
> +   if (!(layer_level & *layer_mask))
> +   continue;
> +   if ((layer->access & access_request) != access_request)
> +   return false;
> +   *layer_mask &= ~layer_level;

Hmm... shouldn't the last 5 lines be replaced by the following?

if ((layer->access & access_request) == access_request)
*layer_mask &= ~layer_level;

And then, since this function would always return true, you could
change its return type to "void".


As far as I can tell, the current version will still, if a ruleset
looks like this:

/usr read+write
/usr/lib/ read

reject write access to /usr/lib, right?


> +   }
> +   return true;
> +}


Re: [PATCH v26 00/12] Landlock LSM

2021-01-13 Thread Jann Horn
On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
> This patch series adds new built-time checks, a new test, renames some
> variables and functions to improve readability, and shift syscall
> numbers to align with -next.

Sorry, I've finally gotten around to looking at v26 - I hadn't
actually looked at v25 either yet. I think there's still one remaining
small issue in the filesystem access logic, but I think that's very
simple to fix, as long as we agree on what the expected semantics are.
Otherwise it basically looks good, apart from some typos.

I think v27 will be the final version of this series. :) (And I'll try
to actually look at that version much faster - I realize that waiting
for code reviews this long sucks.)


Re: [PATCH v26 02/12] landlock: Add ruleset and domain management

2021-01-13 Thread Jann Horn
On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
> A Landlock ruleset is mainly a red-black tree with Landlock rules as
> nodes.  This enables quick update and lookup to match a requested
> access, e.g. to a file.  A ruleset is usable through a dedicated file
> descriptor (cf. following commit implementing syscalls) which enables a
> process to create and populate a ruleset with new rules.
>
> A domain is a ruleset tied to a set of processes.  This group of rules
> defines the security policy enforced on these processes and their future
> children.  A domain can transition to a new domain which is the
> intersection of all its constraints and those of a ruleset provided by
> the current process.  This modification only impact the current process.
> This means that a process can only gain more constraints (i.e. lose
> accesses) over time.
>
> Cc: James Morris 
> Cc: Jann Horn 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Signed-off-by: Mickaël Salaün 

Yeah, the layer stack stuff in this version looks good to me. :)

Reviewed-by: Jann Horn 


Re: [PATCH v26 11/12] samples/landlock: Add a sandbox manager example

2021-01-13 Thread Jann Horn
On Wed, Dec 9, 2020 at 8:29 PM Mickaël Salaün  wrote:
> Add a basic sandbox tool to launch a command which can only access a
> whitelist of file hierarchies in a read-only or read-write way.

I have to admit that I didn't really look at this closely before
because it's just sample code... but I guess I should. You can add

Reviewed-by: Jann Horn 

if you fix the following nits:

[...]
> diff --git a/samples/Kconfig b/samples/Kconfig
[...]
> +config SAMPLE_LANDLOCK
> +   bool "Build Landlock sample code"
> +   depends on HEADERS_INSTALL
> +   help
> + Build a simple Landlock sandbox manager able to launch a process
> + restricted by a user-defined filesystem access control.

nit: s/filesystem access control/filesystem access control policy/

[...]
> diff --git a/samples/landlock/sandboxer.c b/samples/landlock/sandboxer.c
[...]
> +/*
> + * Simple Landlock sandbox manager able to launch a process restricted by a
> + * user-defined filesystem access control.

nit: s/filesystem access control/filesystem access control policy/

[...]
> +int main(const int argc, char *const argv[], char *const *const envp)
> +{
[...]
> +   if (argc < 2) {
[...]
> +   fprintf(stderr, "* %s: list of paths allowed to be used in a 
> read-only way.\n",
> +   ENV_FS_RO_NAME);
> +   fprintf(stderr, "* %s: list of paths allowed to be used in a 
> read-write way.\n",
> +   ENV_FS_RO_NAME);

s/ENV_FS_RO_NAME/ENV_FS_RW_NAME/

> +   fprintf(stderr, "\nexample:\n"
> +   
> "%s=\"/bin:/lib:/usr:/proc:/etc:/dev/urandom\" "
> +   
> "%s=\"/dev/null:/dev/full:/dev/zero:/dev/pts:/tmp\" "
> +   "%s bash -i\n",
> +   ENV_FS_RO_NAME, ENV_FS_RW_NAME, argv[0]);
> +   return 1;
> +   }
> +
> +   ruleset_fd = landlock_create_ruleset(_attr, 
> sizeof(ruleset_attr), 0);
> +   if (ruleset_fd < 0) {
> +   perror("Failed to create a ruleset");
> +   switch (errno) {

(Just as a note: In theory perror() can change the value of errno, as
far as I know - so AFAIK you'd theoretically have to do something
like:

int errno_ = errno;
perror("...");
switch (errno_) {
 ...
}

I'll almost certainly work fine as-is in practice though.)


Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-13 Thread Jann Horn
On Wed, Jan 13, 2021 at 8:14 PM Vlastimil Babka  wrote:
> On 1/12/21 12:12 AM, Jann Horn wrote:
> It doesn't help that slabinfo (global or per-memcg) is also
> inaccurate as it cannot count free objects on per-cpu partial slabs and thus
> reports them as active.

Maybe SLUB could be taught to track how many objects are in the percpu
machinery, and then print that number separately so that you can at
least know how much data you're missing without having to collect data
with IPIs...

> > It might be a good idea to figure out whether it is possible to
> > efficiently keep track of a more accurate count of the free objects on
>
> As long as there are some inuse objects, it shouldn't matter much if the slab 
> is
> sitting on per-cpu partial list or per-node list, as it can't be freed anyway.
> It becomes a real problem only after the slab become fully free. If we 
> detected
> that in __slab_free() also for already-frozen slabs, we would need to know 
> which
> CPU this slab belongs to (currently that's not tracked afaik),

Yeah, but at least on 64-bit systems we still have 32 completely
unused bits in the counter field that's updated via cmpxchg_double on
struct page. (On 32-bit systems the bitfields are also wider than they
strictly need to be, I think, at least if the system has 4K page
size.) So at least on 64-bit systems, we could squeeze a CPU number in
there, and then you'd know to which CPU the page belonged at the time
the object was freed.

> and send it an
> IPI to do some light version of unfreeze_partials() that would only remove 
> empty
> slabs. The trick would be not to cause too many IPI's by this, obviously :/

Some brainstorming:

Maybe you could have an atomic counter in kmem_cache_cpu that tracks
the number of empty frozen pages that are associated with a specific
CPU? So the freeing slowpath would do its cmpxchg_double, and if the
new state after a successful cmpxchg_double is "inuse==0 && frozen ==
1" with a valid CPU number, you afterwards do
"atomic_long_inc(_cpu_ptr(cache->cpu_slab,
cpu)->empty_partial_pages)". I think it should be possible to
implement that such that the empty_partial_pages count, while not
immediately completely accurate, would be eventually consistent; and
readers on the CPU owning the kmem_cache_cpu should never see a number
that is too large, only one that is too small.

You could additionally have a plain percpu counter, not tied to the
kmem_cache, and increment it by 1<

SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-11 Thread Jann Horn
[This is not something I intend to work on myself. But since I
stumbled over this issue, I figured I should at least document/report
it, in case anyone is willing to pick it up.]

Hi!

I was poking around in SLUB internals and noticed that the estimate of
how many free objects exist on a percpu partial list (tracked in
page->pobjects of the first page on the list and exposed to userspace
via /sys/kernel/slab/*/slabs_cpu_partial) is highly inaccurate.

>From a naive first look, it seems like SLUB tries to roughly keep up
to cache->cpu_partial free objects per slab and CPU around.
cache->cpu_partial depends on the object size; the maximum is 30 (for
objects <256 bytes).

However, the accounting of free objects into page->pobjects only
happens when a page is added to the percpu partial list;
page->pobjects is not updated when objects are freed to a page that is
already on the percpu partial list. Pages can be added to the percpu
partial list in two cases:

1. When an object is freed from a page which previously had zero free
objects (via __slab_free()), meaning the page was not previously on
any list.
2. When the percpu partial list was empty and get_partial_node(),
after grabbing a partial page from the node, moves more partial pages
onto the percpu partial list to make the percpu partial list contain
around cache->cpu_partial/2 free objects.

In case 1, pages will almost always be counted as having one free
object, unless a race with a concurrent __slab_free() on the same page
happens, because the code path specifically handles the case where the
number of free objects just became 1.
In case 2, pages will probably often be counted as having many free
objects; however, this case doesn't appear to happen often in
practice, likely partly because pages outside of percpu partial lists
get freed immediately when they become empty.


This means that in practice, SLUB actually ends up keeping as many
**pages** on the percpu partial lists as it intends to keep **free
objects** there. To see this, you can append the snippet at the end of
this mail to mm/slub.c; that will add a debugfs entry that lets you
accurately dump the percpu partial lists of all running CPUs (at the
cost of an IPI storm when you read it).

Especially after running some forkbombs multiple times on a VM with a
bunch of CPUs, that will show you that some of the percpu lists look
like this (just showing a single slab's percpu lists on three CPU
cores here) - note the "inuse=0" everywhere:

task_delay_info on 10:
  page=f9eb9b00 base=988bd2e6c000 order=0 partial_pages=24
partial_objects=24 objects=51 inuse=0
  page=f9e44469f800 base=988bda7e order=0 partial_pages=23
partial_objects=23 objects=51 inuse=0
  page=f9e444e01380 base=988bf804e000 order=0 partial_pages=22
partial_objects=22 objects=51 inuse=0
  page=f9e444bdda40 base=988bef769000 order=0 partial_pages=21
partial_objects=21 objects=51 inuse=0
  page=f9e444c59700 base=988bf165c000 order=0 partial_pages=20
partial_objects=20 objects=51 inuse=0
  page=f9e44494d280 base=988be534a000 order=0 partial_pages=19
partial_objects=19 objects=51 inuse=0
  page=f9e444fd2e80 base=988bff4ba000 order=0 partial_pages=18
partial_objects=18 objects=51 inuse=0
  page=f9e444c47c80 base=988bf11f2000 order=0 partial_pages=17
partial_objects=17 objects=51 inuse=0
  page=f9e4448ff780 base=988be3fde000 order=0 partial_pages=16
partial_objects=16 objects=51 inuse=0
  page=f9e4443883c0 base=988bce20f000 order=0 partial_pages=15
partial_objects=15 objects=51 inuse=0
  page=f9e444eede40 base=988bfbb79000 order=0 partial_pages=14
partial_objects=14 objects=51 inuse=0
  page=f9e4442febc0 base=988bcbfaf000 order=0 partial_pages=13
partial_objects=13 objects=51 inuse=0
  page=f9e444e44940 base=988bf9125000 order=0 partial_pages=12
partial_objects=12 objects=51 inuse=0
  page=f9e4446f72c0 base=988bdbdcb000 order=0 partial_pages=11
partial_objects=11 objects=51 inuse=0
  page=f9e444dba080 base=988bf6e82000 order=0 partial_pages=10
partial_objects=10 objects=51 inuse=0
  page=f9e444a23c40 base=988be88f1000 order=0 partial_pages=9
partial_objects=9 objects=51 inuse=0
  page=f9e444786cc0 base=988bde1b3000 order=0 partial_pages=8
partial_objects=8 objects=51 inuse=0
  page=f9e444b2cf80 base=988becb3e000 order=0 partial_pages=7
partial_objects=7 objects=51 inuse=0
  page=f9e444f19cc0 base=988bfc673000 order=0 partial_pages=6
partial_objects=6 objects=51 inuse=0
  page=f9e444f08fc0 base=988bfc23f000 order=0 partial_pages=5
partial_objects=5 objects=51 inuse=0
  page=f9e444a0e540 base=988be8395000 order=0 partial_pages=4
partial_objects=4 objects=51 inuse=0
  page=f9e445127a00 base=988c049e8000 order=0 partial_pages=3
partial_objects=3 objects=51 inuse=0
  page=f9e44468ae40 base=988bda2b9000 order=0 partial_pages=2
partial_objects=2 objects=51 

Re: [PATCH] random: fix the RNDRESEEDCRNG ioctl

2021-01-11 Thread Jann Horn
On Mon, Jan 4, 2021 at 7:55 PM Eric Biggers  wrote:
> On Fri, Nov 20, 2020 at 10:52:14AM -0800, Eric Biggers wrote:
> > On Mon, Oct 26, 2020 at 09:33:43AM -0700, Eric Biggers wrote:
> > > On Tue, Oct 06, 2020 at 08:50:21PM -0700, Eric Biggers wrote:
> > > > On Tue, Sep 15, 2020 at 09:19:08PM -0700, Eric Biggers wrote:
> > > > > From: Eric Biggers 
> > > > >
> > > > > The RNDRESEEDCRNG ioctl reseeds the primary_crng from itself, which
> > > > > doesn't make sense.  Reseed it from the input_pool instead.
> > > > >
> > > > > Fixes: d848e5f8e1eb ("random: add new ioctl RNDRESEEDCRNG")
> > > > > Cc: sta...@vger.kernel.org
> > > > > Signed-off-by: Eric Biggers 
> > > >
> > > > Ping?
> > >
> > > Ping.
> >
> > Ping.
>
> Ping.

You may want to resend to akpm with a note that the maintainer is unresponsive.


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Jann Horn
On Mon, Dec 28, 2020 at 6:14 PM Andy Lutomirski  wrote:
> On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin
>  wrote:
> >
> > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:
> > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers
> > >  wrote:
> > > >
> > > > - On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org 
> > > > wrote:
> > > >
> > >
> > > > >
> > > > > I admit that I'm rather surprised that the code worked at all on 
> > > > > arm64,
> > > > > and I'm suspicious that it has never been very well tested.  My 
> > > > > apologies
> > > > > for not reviewing this more carefully in the first place.
> > > >
> > > > Please refer to 
> > > > Documentation/features/sched/membarrier-sync-core/arch-support.txt
> > > >
> > > > It clearly states that only arm, arm64, powerpc and x86 support the 
> > > > membarrier
> > > > sync core feature as of now:
> > >
> > > Sigh, I missed arm (32).  Russell or ARM folks, what's the right
> > > incantation to make the CPU notice instruction changes initiated by
> > > other cores on 32-bit ARM?
> >
> > You need to call flush_icache_range(), since the changes need to be
> > flushed from the data cache to the point of unification (of the Harvard
> > I and D), and the instruction cache needs to be invalidated so it can
> > then see those updated instructions. This will also take care of the
> > necessary barriers that the CPU requires for you.
>
> With what parameters?   From looking at the header, this is for the
> case in which the kernel writes some memory and then intends to
> execute it.  That's not what membarrier() does at all.  membarrier()
> works like this:
>
> User thread 1:
>
> write to RWX memory *or* write to an RW alias of an X region.
> membarrier(...);
> somehow tell thread 2 that we're ready (with a store release, perhaps,
> or even just a relaxed store.)
>
> User thread 2:
>
> wait for the indication from thread 1.
> barrier();
> jump to the code.
>
> membarrier() is, for better or for worse, not given a range of addresses.
>
> On x86, the documentation is a bit weak, but a strict reading
> indicates that thread 2 must execute a serializing instruction at some
> point after thread 1 writes the code and before thread 2 runs it.
> membarrier() does this by sending an IPI and ensuring that a
> "serializing" instruction (thanks for great terminology, Intel) is
> executed.  Note that flush_icache_range() is a no-op on x86, and I've
> asked Intel's architects to please clarify their precise rules.  No
> response yet.
>
> On arm64, flush_icache_range() seems to send an IPI, and that's not
> what I want.  membarrier() already does an IPI.

After chatting with rmk about this (but without claiming that any of
this is his opinion), based on the manpage, I think membarrier()
currently doesn't really claim to be synchronizing caches? It just
serializes cores. So arguably if userspace wants to use membarrier()
to synchronize code changes, userspace should first do the code
change, then flush icache as appropriate for the architecture, and
then do the membarrier() to ensure that the old code is unused?

For 32-bit arm, rmk pointed out that that would be the cacheflush()
syscall. That might cause you to end up with two IPIs instead of one
in total, but we probably don't care _that_ much about extra IPIs on
32-bit arm?

For arm64, I believe userspace can flush icache across the entire
system with some instructions from userspace - "DC CVAU" followed by
"DSB ISH", or something like that, I think? (See e.g.
compat_arm_syscall(), the arm64 compat code that implements the 32-bit
arm cacheflush() syscall.)


[PATCH] mm, slub: Consider rest of partial list if acquire_slab() fails

2020-12-28 Thread Jann Horn
acquire_slab() fails if there is contention on the freelist of the page
(probably because some other CPU is concurrently freeing an object from the
page). In that case, it might make sense to look for a different page
(since there might be more remote frees to the page from other CPUs, and we
don't want contention on struct page).

However, the current code accidentally stops looking at the partial list
completely in that case. Especially on kernels without CONFIG_NUMA set,
this means that get_partial() fails and new_slab_objects() falls back to
new_slab(), allocating new pages. This could lead to an unnecessary
increase in memory fragmentation.

Fixes: 7ced37197196 ("slub: Acquire_slab() avoid loop")
Signed-off-by: Jann Horn 
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 0c8b43a5b3b0..b1777ba06735 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1974,7 +1974,7 @@ static void *get_partial_node(struct kmem_cache *s, 
struct kmem_cache_node *n,
 
t = acquire_slab(s, n, page, object == NULL, );
if (!t)
-   break;
+   continue; /* cmpxchg raced */
 
available += objects;
if (!object) {

base-commit: 5c8fe583cce542aa0b84adc939ce85293de36e5e
-- 
2.29.2.729.g45daf8777d-goog



Re: [PATCH v3 3/4] x86/signal: Prevent an alternate stack overflow before a signal delivery

2020-12-22 Thread Jann Horn
On Wed, Dec 23, 2020 at 2:57 AM Chang S. Bae  wrote:
> The kernel pushes data on the userspace stack when entering a signal. If
> using a sigaltstack(), the kernel precisely knows the user stack size.
>
> When the kernel knows that the user stack is too small, avoid the overflow
> and do an immediate SIGSEGV instead.
>
> This overflow is known to occur on systems with large XSAVE state. The
> effort to increase the size typically used for altstacks reduces the
> frequency of these overflows, but this approach is still useful for legacy
> binaries.
>
> Suggested-by: Jann Horn 
> Signed-off-by: Chang S. Bae 
> Reviewed-by: Len Brown 
> Cc: Jann Horn 
> Cc: x...@kernel.org
> Cc: linux-kernel@vger.kernel.org

Reviewed-by: Jann Horn 


Re: [Bug 210655] ptrace.2: documentation is incorrect about access checking threads in same thread group

2020-12-15 Thread Jann Horn
On Wed, Dec 16, 2020 at 3:21 AM Ted Estes  wrote:
> On 12/15/2020 6:01 PM, Jann Horn wrote:
> > On Wed, Dec 16, 2020 at 12:25 AM Alejandro Colomar (man-pages)
> >  wrote:
> >> On 12/16/20 12:23 AM, Alejandro Colomar (man-pages) wrote:
> >>> On 12/16/20 12:07 AM, Jann Horn wrote:
> >>>> As the comment explains, you can't actually *attach*
> >>>> to another task in the same thread group; but that's
> >>>> not because of the ptrace-style access check rules,
> >>>> but because specifically *attaching* to another task
> >>>> in the same thread group doesn't work.
> > As I said, attaching indeed doesn't work. But that's not what "Ptrace
> > access mode checking" means. As the first sentence of that section
> > says:
> >
> > | Various parts of the kernel-user-space API (not just ptrace()
> > | operations), require so-called "ptrace access mode" checks,
> > | whose outcome determines whether an operation is
> > | permitted (or, in a  few cases,  causes  a "read" operation
> > | to return sanitized data).
> >
> > You can find these places by grepping for \bptrace_may_access\b -
> > operations like e.g. the get_robust_list() syscall will always succeed
> > when inspecting other tasks in the caller's thread group thanks to
> > this rule.
>
> Ah, yes.  I missed that back reference while trying to digest that
> rather meaty man page.  A grep on the man page source tree does show a
> number of references to "ptrace access mode".
>
> That said, the ptrace(2) man page also directly references the ptrace
> access mode check under both PTRACE_ATTACH and PTACE_SEIZE:
>
> | Permission to perform a PTRACE_ATTACH is governed by a ptrace | access
> mode PTRACE_MODE_ATTACH_REALCREDS check; see below. As confirmed, the
> "same thread group" rule does not apply to either of those operations. A
> re-wording of rule 1 similar to this might help avoid confusion: 1. If
> the calling thread and the target thread are in the same thread group:
> a. For ptrace() called with PTRACE_ATTACH or PTRACE_SEIZE, access is
> NEVER allowed. b. For all other so-called "ptrace access mode checks",
> access is ALWAYS allowed. --Ted

Yeah, maybe. OTOH I'm not sure whether it really makes sense to
explain this as being part of a security check, or whether it should
be explained separately as a restriction on PTRACE_ATTACH and
PTRACE_SEIZE (with a note like "(irrelevant for ptrace attachment)" on
rule 1). But I don't feel strongly about it either way.


Re: [Bug 210655] ptrace.2: documentation is incorrect about access checking threads in same thread group

2020-12-15 Thread Jann Horn
Am Tue, Dec 15, 2020 at 06:01:25PM +0100 schrieb Alejandro Colomar (man-pages):
> Hi,
> 
> There's a bug report: https://bugzilla.kernel.org/show_bug.cgi?id=210655
> 
> [[
> Under "Ptrace access mode checking", the documentation states:
>   "1. If the calling thread and the target thread are in the same thread
> group, access is always allowed."
> 
> This is incorrect. A thread may never attach to another in the same group.

No, that is correct. ptrace-mode access checks do always short-circuit for
tasks in the same thread group:

/* Returns 0 on success, -errno on denial. */
static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
{
[...]
/* May we inspect the given task?
 * This check is used both for attaching with ptrace
 * and for allowing access to sensitive information in /proc.
 *
 * ptrace_attach denies several cases that /proc allows
 * because setting up the necessary parent/child relationship
 * or halting the specified task is impossible.
 */

/* Don't let security modules deny introspection */
if (same_thread_group(task, current))
return 0;
[...]
}

As the comment explains, you can't actually *attach*
to another task in the same thread group; but that's
not because of the ptrace-style access check rules,
but because specifically *attaching* to another task
in the same thread group doesn't work.


Re: [Bug 210655] ptrace.2: documentation is incorrect about access checking threads in same thread group

2020-12-15 Thread Jann Horn
On Wed, Dec 16, 2020 at 12:25 AM Alejandro Colomar (man-pages)
 wrote:
> On 12/16/20 12:23 AM, Alejandro Colomar (man-pages) wrote:
> > On 12/16/20 12:07 AM, Jann Horn wrote:
> >> Am Tue, Dec 15, 2020 at 06:01:25PM +0100 schrieb Alejandro Colomar 
> >> (man-pages):
> >>> There's a bug report: https://bugzilla.kernel.org/show_bug.cgi?id=210655
> >>>
> >>> [[
> >>> Under "Ptrace access mode checking", the documentation states:
> >>>   "1. If the calling thread and the target thread are in the same thread
> >>> group, access is always allowed."
> >>>
> >>> This is incorrect. A thread may never attach to another in the same group.
> >>
> >> No, that is correct. ptrace-mode access checks do always short-circuit for
> >> tasks in the same thread group:
> >>
> >> /* Returns 0 on success, -errno on denial. */
> >> static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
> >> {
> >> [...]
> >> /* May we inspect the given task?
> >>  * This check is used both for attaching with ptrace
> >>  * and for allowing access to sensitive information in /proc.
> >>  *
> >>  * ptrace_attach denies several cases that /proc allows
> >>  * because setting up the necessary parent/child relationship
> >>  * or halting the specified task is impossible.
> >>  */
> >>
> >> /* Don't let security modules deny introspection */
> >> if (same_thread_group(task, current))
> >> return 0;
> >> [...]
> >> }
> >
> > AFAICS, that code always returns non-zero,
>
> Sorry, I should have said "that code never returns 0".
>
> > at least when called from ptrace_attach().

Yes.

> > As you can see below,
> > __ptrace_may_access() is called some lines after
> > the code pointed to by the bug report.
> >
> >
> > static int ptrace_attach(struct task_struct *task, long request,
> >unsigned long addr,
> >unsigned long flags)
> > {
> > [...]
> >   if (same_thread_group(task, current))
> >   goto out;
> >
> >   /*
> >* Protect exec's credential calculations against our interference;
> >* SUID, SGID and LSM creds get determined differently
> >* under ptrace.
> >*/
> >   retval = -ERESTARTNOINTR;
> >   if (mutex_lock_interruptible(>signal->cred_guard_mutex))
> >   goto out;
> >
> >   task_lock(task);
> >   retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> > [...]
> > }

I said exactly that in my last mail:

> >> As the comment explains, you can't actually *attach*
> >> to another task in the same thread group; but that's
> >> not because of the ptrace-style access check rules,
> >> but because specifically *attaching* to another task
> >> in the same thread group doesn't work.

As I said, attaching indeed doesn't work. But that's not what "Ptrace
access mode checking" means. As the first sentence of that section
says:

| Various parts of the kernel-user-space API (not just ptrace()
| operations), require so-called "ptrace access mode" checks,
| whose outcome determines whether an operation is
| permitted (or, in a  few cases,  causes  a "read" operation
| to return sanitized data).

You can find these places by grepping for \bptrace_may_access\b -
operations like e.g. the get_robust_list() syscall will always succeed
when inspecting other tasks in the caller's thread group thanks to
this rule.


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-11 Thread Jann Horn
On Sat, Dec 12, 2020 at 12:01 AM Minchan Kim  wrote:
> On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > +CC Christoph Hellwig for opinions on compat
> >
> > On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim  wrote:
> > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > process_madvise requires a vector of address ranges to be provided for
> > > > its operations. When an advice should be applied to the entire process,
> > > > the caller process has to obtain the list of VMAs of the target process
> > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > operation grows linearly with increasing number of VMAs in the target
> > > > process. Even constructing the input vector can be non-trivial when
> > > > target process has several thousands of VMAs and the syscall is being
> > > > issued during high memory pressure period when new allocations for such
> > > > a vector would only worsen the situation.
> > > > In the case when advice is being applied to the entire memory space of
> > > > the target process, this creates an extra overhead.
> > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > advise a memory range of the target process. For now, to keep it simple,
> > > > only the entire process memory range is supported, vec and vlen inputs
> > > > in this mode are ignored and can be NULL and 0.
> > > > Instead of returning the number of bytes that advice was successfully
> > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > to the fact that the number of bytes would not be useful for the caller
> > > > that does not know the amount of memory the call is supposed to affect.
> > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > bytes affected when the operation is applied to a large memory range.
> > >
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > > struct iovec {
> > > .iov_base = NULL,
> > > .iov_len = (~(size_t)0),
> > > };
> >
> > In addition to Suren's objections, I think it's also worth considering
> > how this looks in terms of compat API. If a compat process does
> > process_madvise() on another compat process, it would be specifying
> > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > you'd need special code to catch that case, which would be ugly.
> >
> > And when a compat process uses this API on a non-compat process, it
> > semantically gets really weird: The actual address range covered would
> > be larger than the address range specified.
> >
> > And if we want different access checks for the two flavors in the
> > future, gating that different behavior on special values in the iovec
> > would feel too magical to me.
> >
> > And the length value SIZE_MAX doesn't really make sense anyway because
> > the length of the whole address space would be SIZE_MAX+1, which you
> > can't express.
> >
> > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > a magic number here.
>
> Can't we simply pass NULL as iovec as special id, then?

AFAIK in theory NULL can be a valid userspace pointer (although that
basically never happens and, on MMU systems, requires root to
explicitly do it). On the other hand, there are some parts of the UAPI
that already special-case NULL, so maybe that's considered acceptable?

Also, special-casing NULL slightly increases the chance that userspace
messes up and accidentally triggers completely different behavior
because an allocation failed or something like that.

So while I'm not strongly against using NULL here, I don't exactly
like the idea.


Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-11 Thread Jann Horn
+CC Christoph Hellwig for opinions on compat

On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim  wrote:
> On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > process_madvise requires a vector of address ranges to be provided for
> > its operations. When an advice should be applied to the entire process,
> > the caller process has to obtain the list of VMAs of the target process
> > by reading the /proc/pid/maps or some other way. The cost of this
> > operation grows linearly with increasing number of VMAs in the target
> > process. Even constructing the input vector can be non-trivial when
> > target process has several thousands of VMAs and the syscall is being
> > issued during high memory pressure period when new allocations for such
> > a vector would only worsen the situation.
> > In the case when advice is being applied to the entire memory space of
> > the target process, this creates an extra overhead.
> > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > advise a memory range of the target process. For now, to keep it simple,
> > only the entire process memory range is supported, vec and vlen inputs
> > in this mode are ignored and can be NULL and 0.
> > Instead of returning the number of bytes that advice was successfully
> > applied to, the syscall in this mode returns 0 on success. This is due
> > to the fact that the number of bytes would not be useful for the caller
> > that does not know the amount of memory the call is supposed to affect.
> > Besides, the ssize_t return type can be too small to hold the number of
> > bytes affected when the operation is applied to a large memory range.
>
> Can we just use one element in iovec to indicate entire address rather
> than using up the reserved flags?
>
> struct iovec {
> .iov_base = NULL,
> .iov_len = (~(size_t)0),
> };

In addition to Suren's objections, I think it's also worth considering
how this looks in terms of compat API. If a compat process does
process_madvise() on another compat process, it would be specifying
the maximum 32-bit number, rather than the maximum 64-bit number, so
you'd need special code to catch that case, which would be ugly.

And when a compat process uses this API on a non-compat process, it
semantically gets really weird: The actual address range covered would
be larger than the address range specified.

And if we want different access checks for the two flavors in the
future, gating that different behavior on special values in the iovec
would feel too magical to me.

And the length value SIZE_MAX doesn't really make sense anyway because
the length of the whole address space would be SIZE_MAX+1, which you
can't express.

So I'm in favor of a new flag, and strongly against using SIZE_MAX as
a magic number here.


Re: [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support

2020-12-08 Thread Jann Horn
On Tue, Nov 24, 2020 at 6:50 AM Suren Baghdasaryan  wrote:
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running.
> In such situation it is desirable to be able to free up the memory of the
> process being killed in a more controlled way.
> Enable MADV_DONTNEED to be used with process_madvise when applied to a
> dying process to reclaim its memory. This would allow userspace system
> components like oomd and lmkd to free memory of the target process in
> a more predictable way.
>
> Signed-off-by: Suren Baghdasaryan 
[...]
> @@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> struct iovec __user *, vec,
> goto release_task;
> }
>
> +   if (madvise_destructive(behavior)) {
> +   /* Allow destructive madvise only on a dying processes */
> +   if (!signal_group_exit(task->signal)) {
> +   ret = -EINVAL;
> +   goto release_mm;
> +   }

Technically Linux allows processes to share mm_struct without being in
the same thread group, so I'm not sure whether this check is good
enough? AFAICS the normal OOM killer deals with this case by letting
__oom_kill_process() always kill all tasks that share the mm_struct.


[PATCH pidfd] signal: Add missing __user annotation to copy_siginfo_from_user_any

2020-12-06 Thread Jann Horn
copy_siginfo_from_user_any() takes a userspace pointer as second
argument; annotate the parameter type accordingly.

Signed-off-by: Jann Horn 
---
I'm messing around with clang's version of __user annotation checking
and it spotted this issue:

kernel/signal.c:3759:44: warning: casting to dereferenceable pointer removes 
'noderef' attribute [-Wnoderef]
ret = copy_siginfo_from_user_any(, info);
 ^~~~
Untracked cast to function pointer at kernel/signal.c:4294:26


Christian, since this is pidfd code, can you take this through your tree?
Or should I send this to akpm (or someone else)?

 kernel/signal.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index ef8f2a28d37c..4693191dc17c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3685,7 +3685,8 @@ static bool access_pidfd_pidns(struct pid *pid)
return true;
 }
 
-static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
+static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo,
+   siginfo_t __user *info)
 {
 #ifdef CONFIG_COMPAT
/*

base-commit: 0477e92881850d44910a7e94fc2c46f96faa131f
-- 
2.29.2.576.ga3fc446d84-goog



Re: [NEEDS-REVIEW] [PATCH] do_exit(): panic() when double fault detected

2020-12-06 Thread Jann Horn
On Sun, Dec 6, 2020 at 4:37 PM Dave Hansen  wrote:
> On 12/6/20 5:10 AM, Vladimir Kondratiev wrote:
> > Double fault detected in do_exit() is symptom of integrity
> > compromised. For safety critical systems, it may be better to
> > panic() in this case to minimize risk.
>
> Does this fix a real problem that you have observed in practice?
>
> Or, is this a general "hardening" which you think is a good practice?
>
> What does this have to do specifically with safety critical systems?
>
> The kernel generally tries to fix things up and keep running whenever
> possible, if for no other reason than it helps debug problems.  If that
> is an undesirable property for your systems, then I think you have a
> much bigger problem than faults during exit().
>
> This option, "panic_on_double_fault", doesn't actually panic on all
> double-faults, which means to me that it's dangerously named.

I wonder whether part of the idea here is that normally, when the
kernel fixes up a kernel crash by killing the offending task, a
service management process in userspace (e.g. the init daemon) can
potentially detect this case because it looks as if the task died with
SIGBUS or something. (I don't think that actually always works in
practice though, since AFAICS kernel crashes only lead to the *task*
being killed, not the entire process, and I think killing a single
worker thread of a multithreaded process might just cause the rest of
the userspace process to lock up. Not sure whether that's intentional
or something that should ideally be changed.)

But if the kernel gives up on going through with do_exit() (because it
crashed in do_exit() before being able to mark the task as waitable),
the process may, to userspace, appear to still be alive even though
it's not actually doing anything anymore; and if the kernel doesn't
tell userspace that the process is no longer functional, userspace
can't restore the system to a working state.

But as Dave said, this best-effort fixup is probably not the kind of
behavior you'd want in a "safety critical" system anyway; for example,
often the offending thread will have held some critical spinlock or
mutex or something, and then the rest of the system piles on into a
gigantic deadlock involving the lock in question and possibly multiple
locks that nest around it. You might be better off enabling
panic_on_oops, ideally with something like pstore-based logging of the
crash, and then quickly bringing everything back to a clean state
instead of continuing from an unstable state and then possibly
blocking somewhere.


[PATCH] tty: Remove dead termiox code

2020-12-02 Thread Jann Horn
set_termiox() and the TCGETX handler bail out with -EINVAL immediately
if ->termiox is NULL, but there are no code paths that can set
->termiox to a non-NULL pointer; and no such code paths seem to have
existed since the termiox mechanism was introduced back in
commit 1d65b4a088de ("tty: Add termiox") in v2.6.28.
Similarly, no driver actually implements .set_termiox; and it looks like
no driver ever has.

Delete this dead code; but leave the definition of struct termiox in the
UAPI headers intact.

Signed-off-by: Jann Horn 
---
 drivers/tty/tty_ioctl.c| 61 ++
 include/linux/tty.h|  1 -
 include/linux/tty_driver.h |  9 --
 3 files changed, 2 insertions(+), 69 deletions(-)

diff --git a/drivers/tty/tty_ioctl.c b/drivers/tty/tty_ioctl.c
index e18f318586ab..4de1c6ddb8ff 100644
--- a/drivers/tty/tty_ioctl.c
+++ b/drivers/tty/tty_ioctl.c
@@ -443,51 +443,6 @@ static int get_termio(struct tty_struct *tty, struct 
termio __user *termio)
return 0;
 }
 
-
-#ifdef TCGETX
-
-/**
- * set_termiox -   set termiox fields if possible
- * @tty: terminal
- * @arg: termiox structure from user
- * @opt: option flags for ioctl type
- *
- * Implement the device calling points for the SYS5 termiox ioctl
- * interface in Linux
- */
-
-static int set_termiox(struct tty_struct *tty, void __user *arg, int opt)
-{
-   struct termiox tnew;
-   struct tty_ldisc *ld;
-
-   if (tty->termiox == NULL)
-   return -EINVAL;
-   if (copy_from_user(, arg, sizeof(struct termiox)))
-   return -EFAULT;
-
-   ld = tty_ldisc_ref(tty);
-   if (ld != NULL) {
-   if ((opt & TERMIOS_FLUSH) && ld->ops->flush_buffer)
-   ld->ops->flush_buffer(tty);
-   tty_ldisc_deref(ld);
-   }
-   if (opt & TERMIOS_WAIT) {
-   tty_wait_until_sent(tty, 0);
-   if (signal_pending(current))
-   return -ERESTARTSYS;
-   }
-
-   down_write(>termios_rwsem);
-   if (tty->ops->set_termiox)
-   tty->ops->set_termiox(tty, );
-   up_write(>termios_rwsem);
-   return 0;
-}
-
-#endif
-
-
 #ifdef TIOCGETP
 /*
  * These are deprecated, but there is limited support..
@@ -815,23 +770,11 @@ int tty_mode_ioctl(struct tty_struct *tty, struct file 
*file,
return ret;
 #endif
 #ifdef TCGETX
-   case TCGETX: {
-   struct termiox ktermx;
-   if (real_tty->termiox == NULL)
-   return -EINVAL;
-   down_read(_tty->termios_rwsem);
-   memcpy(, real_tty->termiox, sizeof(struct termiox));
-   up_read(_tty->termios_rwsem);
-   if (copy_to_user(p, , sizeof(struct termiox)))
-   ret = -EFAULT;
-   return ret;
-   }
+   case TCGETX:
case TCSETX:
-   return set_termiox(real_tty, p, 0);
case TCSETXW:
-   return set_termiox(real_tty, p, TERMIOS_WAIT);
case TCSETXF:
-   return set_termiox(real_tty, p, TERMIOS_FLUSH);
+   return -EINVAL;
 #endif 
case TIOCGSOFTCAR:
copy_termios(real_tty, );
diff --git a/include/linux/tty.h b/include/linux/tty.h
index a99e9b8e4e31..52f5544bcd85 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -303,7 +303,6 @@ struct tty_struct {
spinlock_t flow_lock;
/* Termios values are protected by the termios rwsem */
struct ktermios termios, termios_locked;
-   struct termiox *termiox;/* May be NULL for unsupported */
char name[64];
struct pid *pgrp;   /* Protected by ctrl lock */
struct pid *session;
diff --git a/include/linux/tty_driver.h b/include/linux/tty_driver.h
index 358446247ccd..61c3372d3f32 100644
--- a/include/linux/tty_driver.h
+++ b/include/linux/tty_driver.h
@@ -224,14 +224,6 @@
  * line). See tty_do_resize() if you need to wrap the standard method
  * in your own logic - the usual case.
  *
- * void (*set_termiox)(struct tty_struct *tty, struct termiox *new);
- *
- * Called when the device receives a termiox based ioctl. Passes down
- * the requested data from user space. This method will not be invoked
- * unless the tty also has a valid tty->termiox pointer.
- *
- * Optional: Called under the termios lock
- *
  * int (*get_icount)(struct tty_struct *tty, struct serial_icounter *icount);
  *
  * Called when the device receives a TIOCGICOUNT ioctl. Passed a kernel
@@ -285,7 +277,6 @@ struct tty_operations {
int (*tiocmset)(struct tty_struct *tty,
unsigned int set, unsigned int clear);
int (*resize)(struct tty_struct *tty, struct winsize *ws);
-   int (*set_termiox)(struct tty_struct *tty, struct termiox *tnew);

Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver

2020-11-27 Thread Jann Horn
On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin
 wrote:
> On 27/11/2020 20:22, Jann Horn wrote:
> > On Fri, Nov 20, 2020 at 11:29 PM Jann Horn  wrote:
> >> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
> >>  wrote:
> >>> This patch is a driver that exposes a monotonic incremental Virtual
> >>> Machine Generation u32 counter via a char-dev FS interface that
> >>> provides sync and async VmGen counter updates notifications. It also
> >>> provides VmGen counter retrieval and confirmation mechanisms.
> >>>
> >>> The hw provided UUID is not exposed to userspace, it is internally
> >>> used by the driver to keep accounting for the exposed VmGen counter.
> >>> The counter starts from zero when the driver is initialized and
> >>> monotonically increments every time the hw UUID changes (the VM
> >>> generation changes).
> >>>
> >>> On each hw UUID change, the new hypervisor-provided UUID is also fed
> >>> to the kernel RNG.
> >> As for v1:
> >>
> >> Is there a reasonable usecase for the "confirmation" mechanism? It
> >> doesn't seem very useful to me.
>
> I think it adds value in complex scenarios with multiple users of the
> mechanism, potentially at varying layers of the stack, different
> processes and/or runtime libraries.
>
> The driver offers a natural place to handle minimal orchestration
> support and offer visibility in system-wide status.
>
> A high-level service that trusts all system components to properly use
> the confirmation mechanism can actually block and wait patiently for the
> system to adjust to the new world. Even if it doesn't trust all
> components it can still do a best-effort, timeout block.

What concrete action would that high-level service be able to take
after waiting for such an event?

My model of the vmgenid mechanism is that RNGs and cryptographic
libraries that use it need to be fundamentally written such that it is
guaranteed that a VM fork can not cause the same random number /
counter / ... to be reused for two different cryptographic operations
in a way visible to an attacker. This means that e.g. TLS libraries
need to, between accepting unencrypted input and sending out encrypted
data, check whether the vmgenid changed since the connection was set
up, and if a vmgenid change occurred, kill the connection.

Can you give a concrete example of a usecase where the vmgenid
mechanism is used securely and the confirmation mechanism is necessary
as part of that?

> >> How do you envision integrating this with libraries that have to work
> >> in restrictive seccomp sandboxes? If this was in the vDSO, that would
> >> be much easier.
>
> Since this mechanism targets all of userspace stack, the usecase greatly
> vary. I doubt we can have a single silver bullet interface.
>
> For example, the mmap interface targets user space RNGs, where as fast
> and as race free as possible is key. But there also higher level
> applications that don't manage their own memory or don't have access to
> low-level primitives so they can't use the mmap or even vDSO interfaces.
> That's what the rest of the logic is there for, the read+poll interface
> and all of the orchestration logic.

Are you saying that, because people might not want to write proper
bindings for this interface while also being unwilling to take the
performance hit of calling read() in every place where they would have
to do so to be fully correct, you want to build a "best-effort"
mechanism that is deliberately designed to allow some cryptographic
state reuse in a limited time window?

> Like you correctly point out, there are also scenarios like tight
> seccomp jails where even the FS interfaces is inaccessible. For cases
> like this and others, I believe we will have to work incrementally to
> build up the interface diversity to cater to all the user scenarios
> diversity.

It would be much nicer if we could have one simple interface that lets
everyone correctly do what they need to, though...


Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver

2020-11-27 Thread Jann Horn
[resend in the hope that amazon will accept my mail this time instead
of replying "550 Too many invalid recipients" again]

On Fri, Nov 20, 2020 at 11:29 PM Jann Horn  wrote:
> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
>  wrote:
> > This patch is a driver that exposes a monotonic incremental Virtual
> > Machine Generation u32 counter via a char-dev FS interface that
> > provides sync and async VmGen counter updates notifications. It also
> > provides VmGen counter retrieval and confirmation mechanisms.
> >
> > The hw provided UUID is not exposed to userspace, it is internally
> > used by the driver to keep accounting for the exposed VmGen counter.
> > The counter starts from zero when the driver is initialized and
> > monotonically increments every time the hw UUID changes (the VM
> > generation changes).
> >
> > On each hw UUID change, the new hypervisor-provided UUID is also fed
> > to the kernel RNG.
>
> As for v1:
>
> Is there a reasonable usecase for the "confirmation" mechanism? It
> doesn't seem very useful to me.
>
> How do you envision integrating this with libraries that have to work
> in restrictive seccomp sandboxes? If this was in the vDSO, that would
> be much easier.


Re: [PATCH v2 3/4] x86/signal: Prevent an alternate stack overflow before a signal delivery

2020-11-24 Thread Jann Horn
On Tue, Nov 24, 2020 at 9:43 PM Bae, Chang Seok
 wrote:
> > On Nov 24, 2020, at 10:41, Jann Horn  wrote:
> > On Tue, Nov 24, 2020 at 7:22 PM Bae, Chang Seok
> >  wrote:
> >>> On Nov 20, 2020, at 15:04, Jann Horn  wrote:
> >>> On Thu, Nov 19, 2020 at 8:40 PM Chang S. Bae  
> >>> wrote:
> >>>>
> >>>> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> >>>> index ee6f1ceaa7a2..cee41d684dc2 100644
> >>>> --- a/arch/x86/kernel/signal.c
> >>>> +++ b/arch/x86/kernel/signal.c
> >>>> @@ -251,8 +251,13 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
> >>>> *regs, size_t frame_size,
> >>>>
> >>>>   /* This is the X/Open sanctioned signal stack switching.  */
> >>>>   if (ka->sa.sa_flags & SA_ONSTACK) {
> >>>> -   if (sas_ss_flags(sp) == 0)
> >>>> +   if (sas_ss_flags(sp) == 0) {
> >>>> +   /* If the altstack might overflow, die with 
> >>>> SIGSEGV: */
> >>>> +   if (!altstack_size_ok(current))
> >>>> +   return (void __user *)-1L;
> >>>> +
> >>>>   sp = current->sas_ss_sp + current->sas_ss_size;
> >>>> +   }
> >>>
> >>> A couple lines further down, we have this (since commit 14fc9fbc700d):
> >>>
> >>>   /*
> >>>* If we are on the alternate signal stack and would overflow it, 
> >>> don't.
> >>>* Return an always-bogus address instead so we will die with 
> >>> SIGSEGV.
> >>>*/
> >>>   if (onsigstack && !likely(on_sig_stack(sp)))
> >>>   return (void __user *)-1L;
> >>>
> >>> Is that not working?
> >>
> >> onsigstack is set at the beginning here. If a signal hits under normal 
> >> stack,
> >> this flag is not set. Then it will miss the overflow.
> >>
> >> The added check allows to detect the sigaltstack overflow (always).
> >
> > Ah, I think I understand what you're trying to do. But wouldn't the
> > better approach be to ensure that the existing on_sig_stack() check is
> > also used if we just switched to the signal stack? Something like:
> >
> > diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> > index be0d7d4152ec..2f57842fb4d6 100644
> > --- a/arch/x86/kernel/signal.c
> > +++ b/arch/x86/kernel/signal.c
> > @@ -237,7 +237,7 @@ get_sigframe(struct k_sigaction *ka, struct
> > pt_regs *regs, size_t frame_size,
> >unsigned long math_size = 0;
> >unsigned long sp = regs->sp;
> >unsigned long buf_fx = 0;
> > -   int onsigstack = on_sig_stack(sp);
> > +   bool onsigstack = on_sig_stack(sp);
> >int ret;
> >
> >/* redzone */
> > @@ -246,8 +246,10 @@ get_sigframe(struct k_sigaction *ka, struct
> > pt_regs *regs, size_t frame_size,
> >
> >/* This is the X/Open sanctioned signal stack switching.  */
> >if (ka->sa.sa_flags & SA_ONSTACK) {
> > -   if (sas_ss_flags(sp) == 0)
> > +   if (sas_ss_flags(sp) == 0) {
> >sp = current->sas_ss_sp + current->sas_ss_size;
> > +   onsigstack = true;
> > +   }
> >} else if (IS_ENABLED(CONFIG_X86_32) &&
> >   !onsigstack &&
> >   regs->ss != __USER_DS &&
>
> Yeah, but wouldn't it better to avoid overwriting user data if we can? The old
> check raises segfault *after* overwritten.

Where is that overwrite happening? Between the point where your check
happens, and the point where the old check is, the only calls are to
fpu__alloc_mathframe() and align_sigframe(), right?
fpu__alloc_mathframe() just does some size calculations and doesn't
write anything. align_sigframe() also just does size calculations. Am
I missing something?


Re: [PATCH v2 3/4] x86/signal: Prevent an alternate stack overflow before a signal delivery

2020-11-24 Thread Jann Horn
On Tue, Nov 24, 2020 at 7:22 PM Bae, Chang Seok
 wrote:
> > On Nov 20, 2020, at 15:04, Jann Horn  wrote:
> > On Thu, Nov 19, 2020 at 8:40 PM Chang S. Bae  
> > wrote:
> >>
> >> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> >> index ee6f1ceaa7a2..cee41d684dc2 100644
> >> --- a/arch/x86/kernel/signal.c
> >> +++ b/arch/x86/kernel/signal.c
> >> @@ -251,8 +251,13 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
> >> *regs, size_t frame_size,
> >>
> >>/* This is the X/Open sanctioned signal stack switching.  */
> >>if (ka->sa.sa_flags & SA_ONSTACK) {
> >> -   if (sas_ss_flags(sp) == 0)
> >> +   if (sas_ss_flags(sp) == 0) {
> >> +   /* If the altstack might overflow, die with 
> >> SIGSEGV: */
> >> +   if (!altstack_size_ok(current))
> >> +   return (void __user *)-1L;
> >> +
> >>sp = current->sas_ss_sp + current->sas_ss_size;
> >> +   }
> >
> > A couple lines further down, we have this (since commit 14fc9fbc700d):
> >
> >/*
> > * If we are on the alternate signal stack and would overflow it, 
> > don't.
> > * Return an always-bogus address instead so we will die with 
> > SIGSEGV.
> > */
> >if (onsigstack && !likely(on_sig_stack(sp)))
> >return (void __user *)-1L;
> >
> > Is that not working?
>
> onsigstack is set at the beginning here. If a signal hits under normal stack,
> this flag is not set. Then it will miss the overflow.
>
> The added check allows to detect the sigaltstack overflow (always).

Ah, I think I understand what you're trying to do. But wouldn't the
better approach be to ensure that the existing on_sig_stack() check is
also used if we just switched to the signal stack? Something like:

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index be0d7d4152ec..2f57842fb4d6 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -237,7 +237,7 @@ get_sigframe(struct k_sigaction *ka, struct
pt_regs *regs, size_t frame_size,
unsigned long math_size = 0;
unsigned long sp = regs->sp;
unsigned long buf_fx = 0;
-   int onsigstack = on_sig_stack(sp);
+   bool onsigstack = on_sig_stack(sp);
int ret;

/* redzone */
@@ -246,8 +246,10 @@ get_sigframe(struct k_sigaction *ka, struct
pt_regs *regs, size_t frame_size,

/* This is the X/Open sanctioned signal stack switching.  */
if (ka->sa.sa_flags & SA_ONSTACK) {
-   if (sas_ss_flags(sp) == 0)
+   if (sas_ss_flags(sp) == 0) {
sp = current->sas_ss_sp + current->sas_ss_size;
+   onsigstack = true;
+   }
} else if (IS_ENABLED(CONFIG_X86_32) &&
   !onsigstack &&
   regs->ss != __USER_DS &&


Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Jann Horn
On Tue, Nov 24, 2020 at 6:44 PM Greg KH  wrote:
> On Tue, Nov 24, 2020 at 06:30:28PM +0100, Jann Horn wrote:
> > On Tue, Nov 24, 2020 at 6:15 PM Greg KH  wrote:
> > > On Tue, Nov 24, 2020 at 06:06:38PM +0100, Jann Horn wrote:
> > > > +seccomp maintainers/reviewers
> > > > [thread context is at
> > > > https://lore.kernel.org/linux-api/87lfer2c0b@oldenburg2.str.redhat.com/
> > > > ]
> > > >
> > > > On Tue, Nov 24, 2020 at 5:49 PM Christoph Hellwig  
> > > > wrote:
> > > > > On Tue, Nov 24, 2020 at 03:08:05PM +0100, Mark Wielaard wrote:
> > > > > > For valgrind the issue is statx which we try to use before falling 
> > > > > > back
> > > > > > to stat64, fstatat or stat (depending on architecture, not all 
> > > > > > define
> > > > > > all of these). The problem with these fallbacks is that under some
> > > > > > containers (libseccomp versions) they might return EPERM instead of
> > > > > > ENOSYS. This causes really obscure errors that are really hard to
> > > > > > diagnose.
> > > > >
> > > > > So find a way to detect these completely broken container run times
> > > > > and refuse to run under them at all.  After all they've decided to
> > > > > deliberately break the syscall ABI.  (and yes, we gave the the rope
> > > > > to do that with seccomp :().
> > > >
> > > > FWIW, if the consensus is that seccomp filters that return -EPERM by
> > > > default are categorically wrong, I think it should be fairly easy to
> > > > add a check to the seccomp core that detects whether the installed
> > > > filter returns EPERM for some fixed unused syscall number and, if so,
> > > > prints a warning to dmesg or something along those lines...
> > >
> > > Why?  seccomp is saying "this syscall is not permitted", so -EPERM seems
> > > like the correct error to provide here.  It's not -ENOSYS as the syscall
> > > is present.
> > >
> > > As everyone knows, there are other ways to have -EPERM be returned from
> > > a syscall if you don't have the correct permissions to do something.
> > > Why is seccomp being singled out here?  It's doing the correct thing.
> >
> > AFAIU from what the others have said, it's being singled out because
> > it means that for two semantically equivalent operations (e.g.
> > openat() vs open()), one can fail while the other works because the
> > filter doesn't know about one of the syscalls. Normally semantically
> > equivalent syscalls are supposed to be subject to the same checks, and
> > if one of them fails, trying the other one won't help.
>
> They aren't being subject to the same checks, if the seccomp permissions
> are different for both of them, they will get different answers.
>
> Trying to use this to determine if the syscall is present or not is not
> ok, and as Christian just said, needs to be fixed in userspace.  We
> can't change the kernel ABI now, odds are someone else relies on the api
> we have had in place and it can not be changed :)

I don't think anyone was proposing changes to existing kernel API.


Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Jann Horn
On Tue, Nov 24, 2020 at 6:15 PM Greg KH  wrote:
> On Tue, Nov 24, 2020 at 06:06:38PM +0100, Jann Horn wrote:
> > +seccomp maintainers/reviewers
> > [thread context is at
> > https://lore.kernel.org/linux-api/87lfer2c0b@oldenburg2.str.redhat.com/
> > ]
> >
> > On Tue, Nov 24, 2020 at 5:49 PM Christoph Hellwig  
> > wrote:
> > > On Tue, Nov 24, 2020 at 03:08:05PM +0100, Mark Wielaard wrote:
> > > > For valgrind the issue is statx which we try to use before falling back
> > > > to stat64, fstatat or stat (depending on architecture, not all define
> > > > all of these). The problem with these fallbacks is that under some
> > > > containers (libseccomp versions) they might return EPERM instead of
> > > > ENOSYS. This causes really obscure errors that are really hard to
> > > > diagnose.
> > >
> > > So find a way to detect these completely broken container run times
> > > and refuse to run under them at all.  After all they've decided to
> > > deliberately break the syscall ABI.  (and yes, we gave the the rope
> > > to do that with seccomp :().
> >
> > FWIW, if the consensus is that seccomp filters that return -EPERM by
> > default are categorically wrong, I think it should be fairly easy to
> > add a check to the seccomp core that detects whether the installed
> > filter returns EPERM for some fixed unused syscall number and, if so,
> > prints a warning to dmesg or something along those lines...
>
> Why?  seccomp is saying "this syscall is not permitted", so -EPERM seems
> like the correct error to provide here.  It's not -ENOSYS as the syscall
> is present.
>
> As everyone knows, there are other ways to have -EPERM be returned from
> a syscall if you don't have the correct permissions to do something.
> Why is seccomp being singled out here?  It's doing the correct thing.

AFAIU from what the others have said, it's being singled out because
it means that for two semantically equivalent operations (e.g.
openat() vs open()), one can fail while the other works because the
filter doesn't know about one of the syscalls. Normally semantically
equivalent syscalls are supposed to be subject to the same checks, and
if one of them fails, trying the other one won't help.

But if you can't tell whether the more modern syscall failed because
of a seccomp filter, you may be forced to retry with an older syscall
even on systems where the new syscall works fine, and such a fallback
may reduce security or reliability if you're trying to use some flags
that only the new syscall provides for security, or something like
that. (As a contrived example, imagine being forced to retry any
tgkill() that fails with EPERM as a tkill() just in case you're
running under a seccomp filter.)


Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Jann Horn
+seccomp maintainers/reviewers
[thread context is at
https://lore.kernel.org/linux-api/87lfer2c0b@oldenburg2.str.redhat.com/
]

On Tue, Nov 24, 2020 at 5:49 PM Christoph Hellwig  wrote:
> On Tue, Nov 24, 2020 at 03:08:05PM +0100, Mark Wielaard wrote:
> > For valgrind the issue is statx which we try to use before falling back
> > to stat64, fstatat or stat (depending on architecture, not all define
> > all of these). The problem with these fallbacks is that under some
> > containers (libseccomp versions) they might return EPERM instead of
> > ENOSYS. This causes really obscure errors that are really hard to
> > diagnose.
>
> So find a way to detect these completely broken container run times
> and refuse to run under them at all.  After all they've decided to
> deliberately break the syscall ABI.  (and yes, we gave the the rope
> to do that with seccomp :().

FWIW, if the consensus is that seccomp filters that return -EPERM by
default are categorically wrong, I think it should be fairly easy to
add a check to the seccomp core that detects whether the installed
filter returns EPERM for some fixed unused syscall number and, if so,
prints a warning to dmesg or something along those lines...


Re: [PATCH] zlib: define get_unaligned16() only when used

2020-11-24 Thread Jann Horn
On Tue, Nov 24, 2020 at 11:40 AM Lukas Bulwahn  wrote:
> Since commit acaab7335bd6 ("lib/zlib: remove outdated and incorrect
> pre-increment optimization"), get_unaligned16() is only used when
> !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
>
> Hence, make CC=clang W=1 warns:
>
>   lib/zlib_inflate/inffast.c:20:1:
> warning: unused function 'get_unaligned16' [-Wunused-function]
>
> Define get_unaligned16() only when it is actually used.
>
> Signed-off-by: Lukas Bulwahn 

AFAICS a nicer option would be to "#include " and
then use "get_unaligned", which should automatically do the right
thing everywhere and remove the need for defining get_unaligned16()
and checking CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS entirely?


[PATCH] keys: Remove outdated __user annotations

2020-11-23 Thread Jann Horn
When the semantics of the ->read() handlers were changed such that "buffer"
is a kernel pointer, some __user annotations survived.
Since they're wrong now, get rid of them.

Fixes: d3ec10aa9581 ("KEYS: Don't write out to userspace while holding key 
semaphore")
Signed-off-by: Jann Horn 
---
 security/keys/keyring.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 14abfe765b7e..977066208387 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -452,7 +452,7 @@ static void keyring_describe(const struct key *keyring, 
struct seq_file *m)
 struct keyring_read_iterator_context {
size_t  buflen;
size_t  count;
-   key_serial_t __user *buffer;
+   key_serial_t*buffer;
 };
 
 static int keyring_read_iterator(const void *object, void *data)
@@ -479,7 +479,7 @@ static int keyring_read_iterator(const void *object, void 
*data)
  * times.
  */
 static long keyring_read(const struct key *keyring,
-char __user *buffer, size_t buflen)
+char *buffer, size_t buflen)
 {
struct keyring_read_iterator_context ctx;
long ret;
@@ -491,7 +491,7 @@ static long keyring_read(const struct key *keyring,
 
/* Copy as many key IDs as fit into the buffer */
if (buffer && buflen) {
-   ctx.buffer = (key_serial_t __user *)buffer;
+   ctx.buffer = (key_serial_t *)buffer;
ctx.buflen = buflen;
ctx.count = 0;
ret = assoc_array_iterate(>keys,

base-commit: d5beb3140f91b1c8a3d41b14d729aefa4dcc58bc
-- 
2.29.2.454.gaff20da3a2-goog



Re: [PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-23 Thread Jann Horn
On Mon, Nov 23, 2020 at 10:16 PM Mickaël Salaün  wrote:
> On 23/11/2020 20:44, Jann Horn wrote:
> > On Sat, Nov 21, 2020 at 11:06 AM Mickaël Salaün  wrote:
> >> On 21/11/2020 08:00, Jann Horn wrote:
> >>> On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
> >>>> Thanks to the Landlock objects and ruleset, it is possible to identify
> >>>> inodes according to a process's domain.  To enable an unprivileged
> >>>> process to express a file hierarchy, it first needs to open a directory
> >>>> (or a file) and pass this file descriptor to the kernel through
> >>>> landlock_add_rule(2).  When checking if a file access request is
> >>>> allowed, we walk from the requested dentry to the real root, following
> >>>> the different mount layers.  The access to each "tagged" inodes are
> >>>> collected according to their rule layer level, and ANDed to create
> >>>> access to the requested file hierarchy.  This makes possible to identify
> >>>> a lot of files without tagging every inodes nor modifying the
> >>>> filesystem, while still following the view and understanding the user
> >>>> has from the filesystem.
> >>>>
> >>>> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> >>>> keep the same struct inodes for the same inodes whereas these inodes are
> >>>> in use.
> >>>>
> >>>> This commit adds a minimal set of supported filesystem access-control
> >>>> which doesn't enable to restrict all file-related actions.  This is the
> >>>> result of multiple discussions to minimize the code of Landlock to ease
> >>>> review.  Thanks to the Landlock design, extending this access-control
> >>>> without breaking user space will not be a problem.  Moreover, seccomp
> >>>> filters can be used to restrict the use of syscall families which may
> >>>> not be currently handled by Landlock.
> >>>>
> >>>> Cc: Al Viro 
> >>>> Cc: Anton Ivanov 
> >>>> Cc: James Morris 
> >>>> Cc: Jann Horn 
> >>>> Cc: Jeff Dike 
> >>>> Cc: Kees Cook 
> >>>> Cc: Richard Weinberger 
> >>>> Cc: Serge E. Hallyn 
> >>>> Signed-off-by: Mickaël Salaün 
> >>>> ---
> >>>>
> >>>> Changes since v23:
> >>>> * Enforce deterministic interleaved path rules.  To have consistent
> >>>>   layered rules, granting access to a path implies that all accesses
> >>>>   tied to inodes, from the requested file to the real root, must be
> >>>>   checked.  Otherwise, stacked rules may result to overzealous
> >>>>   restrictions.  By excluding the ability to add exceptions in the same
> >>>>   layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
> >>>>   deterministic interleaved path rules.  This removes an optimization
> >>>
> >>> I don't understand the "deterministic interleaved path rules" part.
> >>
> >> I explain bellow.
> >>
> >>>
> >>>
> >>> What if I have a policy like this?
> >>>
> >>> /home/user READ
> >>> /home/user/Downloads READ+WRITE
> >>>
> >>> That's a reasonable policy, right?
> >>
> >> Definitely, I forgot this, thanks for the outside perspective!
> >>
> >>>
> >>> If I then try to open /home/user/Downloads/foo in WRITE mode, the loop
> >>> will first check against the READ+WRITE rule for /home/user, that
> >>> check will pass, and then it will check against the READ rule for /,
> >>> which will deny the access, right? That seems bad.
> >>
> >> Yes that was the intent.
> >>
> >>>
> >>>
> >>> The v22 code ensured that for each layer, the most specific rule (the
> >>> first we encounter on the walk) always wins, right? What's the problem
> >>> with that?
> >>
> >> This can be explained with the interleaved_masked_accesses test:
> >> https://github.com/landlock-lsm/linux/blob/landlock-v24/tools/testing/selftests/landlock/fs_test.c#L647
> >>
> >> In this case there is 4 stacked layers:
> >> layer 1: allows s1d1/s1d2/s1d3/file1
> >> layer 2: allows s1d1/s1d2/s1d3
> >>  denies s1d1/s1d2
> >> layer 3: allows s1d1
> >> 

Re: [PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-23 Thread Jann Horn
On Sat, Nov 21, 2020 at 11:06 AM Mickaël Salaün  wrote:
> On 21/11/2020 08:00, Jann Horn wrote:
> > On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
> >> Thanks to the Landlock objects and ruleset, it is possible to identify
> >> inodes according to a process's domain.  To enable an unprivileged
> >> process to express a file hierarchy, it first needs to open a directory
> >> (or a file) and pass this file descriptor to the kernel through
> >> landlock_add_rule(2).  When checking if a file access request is
> >> allowed, we walk from the requested dentry to the real root, following
> >> the different mount layers.  The access to each "tagged" inodes are
> >> collected according to their rule layer level, and ANDed to create
> >> access to the requested file hierarchy.  This makes possible to identify
> >> a lot of files without tagging every inodes nor modifying the
> >> filesystem, while still following the view and understanding the user
> >> has from the filesystem.
> >>
> >> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> >> keep the same struct inodes for the same inodes whereas these inodes are
> >> in use.
> >>
> >> This commit adds a minimal set of supported filesystem access-control
> >> which doesn't enable to restrict all file-related actions.  This is the
> >> result of multiple discussions to minimize the code of Landlock to ease
> >> review.  Thanks to the Landlock design, extending this access-control
> >> without breaking user space will not be a problem.  Moreover, seccomp
> >> filters can be used to restrict the use of syscall families which may
> >> not be currently handled by Landlock.
> >>
> >> Cc: Al Viro 
> >> Cc: Anton Ivanov 
> >> Cc: James Morris 
> >> Cc: Jann Horn 
> >> Cc: Jeff Dike 
> >> Cc: Kees Cook 
> >> Cc: Richard Weinberger 
> >> Cc: Serge E. Hallyn 
> >> Signed-off-by: Mickaël Salaün 
> >> ---
> >>
> >> Changes since v23:
> >> * Enforce deterministic interleaved path rules.  To have consistent
> >>   layered rules, granting access to a path implies that all accesses
> >>   tied to inodes, from the requested file to the real root, must be
> >>   checked.  Otherwise, stacked rules may result to overzealous
> >>   restrictions.  By excluding the ability to add exceptions in the same
> >>   layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
> >>   deterministic interleaved path rules.  This removes an optimization
> >
> > I don't understand the "deterministic interleaved path rules" part.
>
> I explain bellow.
>
> >
> >
> > What if I have a policy like this?
> >
> > /home/user READ
> > /home/user/Downloads READ+WRITE
> >
> > That's a reasonable policy, right?
>
> Definitely, I forgot this, thanks for the outside perspective!
>
> >
> > If I then try to open /home/user/Downloads/foo in WRITE mode, the loop
> > will first check against the READ+WRITE rule for /home/user, that
> > check will pass, and then it will check against the READ rule for /,
> > which will deny the access, right? That seems bad.
>
> Yes that was the intent.
>
> >
> >
> > The v22 code ensured that for each layer, the most specific rule (the
> > first we encounter on the walk) always wins, right? What's the problem
> > with that?
>
> This can be explained with the interleaved_masked_accesses test:
> https://github.com/landlock-lsm/linux/blob/landlock-v24/tools/testing/selftests/landlock/fs_test.c#L647
>
> In this case there is 4 stacked layers:
> layer 1: allows s1d1/s1d2/s1d3/file1
> layer 2: allows s1d1/s1d2/s1d3
>  denies s1d1/s1d2
> layer 3: allows s1d1
> layer 4: allows s1d1/s1d2
>
> In the v23, access to file1 would be allowed until layer 3, but layer 4
> would merge a new rule for the s1d2 inode. Because we don't record where
> exactly the access come from, we can't tell that layer 2 allowed access
> thanks to s1d3 and that its s1d2 rule was ignored. I think this behavior
> doesn't make sense from the user point of view.

Aah, I think I'm starting to understand the issue now. Basically, with
the current UAPI, the semantics have to be "an access is permitted if,
for each policy layer, at least one rule encountered on the pathwalk
permits the access; rules that deny the access are irrelevant". And if
it turns out that someone needs to be able to deny access to specific
inodes, we'll have to extend struct landlock_path_bene

Re: [arm64] kernel BUG at kernel/seccomp.c:1309!

2020-11-23 Thread Jann Horn
On Mon, Nov 23, 2020 at 2:45 PM Arnd Bergmann  wrote:
> On Mon, Nov 23, 2020 at 12:15 PM Naresh Kamboju
>  wrote:
> >
> > While booting arm64 kernel the following kernel BUG noticed on several arm64
> > devices running linux next 20201123 tag kernel.
> >
> >
> > $ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c
> > 5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp'
> > bce6a8cba7bf Merge branch 'linus'
> > 7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp
> > fab686eb0307 seccomp: Remove bogus __user annotations
> > 0d831528 seccomp/cache: Report cache data through 
> > /proc/pid/seccomp_cache
> > 8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant 
> > allow
> > f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path
> > 23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag
> >
> >
> > Please find these easy steps to reproduce the kernel build and boot.
>
> Adding Gabriel Krisman Bertazi to Cc, as the last patch (23d67a54857a) here
> seems suspicious: it changes
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 02aef2844c38..47763f3999f7 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -42,7 +42,7 @@ struct seccomp {
>  extern int __secure_computing(const struct seccomp_data *sd);
>  static inline int secure_computing(void)
>  {
> -   if (unlikely(test_thread_flag(TIF_SECCOMP)))
> +   if (unlikely(test_syscall_work(SECCOMP)))
> return  __secure_computing(NULL);
> return 0;
>  }
>
> which is in the call chain directly before
>
> int __secure_computing(const struct seccomp_data *sd)
> {
>int mode = current->seccomp.mode;
>
> ...
> switch (mode) {
> case SECCOMP_MODE_STRICT:
> __secure_computing_strict(this_syscall);  /* may call do_exit 
> */
> return 0;
> case SECCOMP_MODE_FILTER:
> return __seccomp_filter(this_syscall, sd, false);
> default:
> BUG();
> }
> }
>
> Clearly, current->seccomp.mode is set to something other
> than SECCOMP_MODE_STRICT or SECCOMP_MODE_FILTER
> while the test_syscall_work(SECCOMP) returns true, and this
> must have not been the case earlier.

Ah, I think the problem is actually in
3136b93c3fb2b7c19e853e049203ff8f2b9dd2cd ("entry: Expose helpers to
migrate TIF to SYSCALL_WORK flag"). In the !GENERIC_ENTRY case, it
adds this code:

+#define set_syscall_work(fl)   \
+   set_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+#define test_syscall_work(fl) \
+   test_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+#define clear_syscall_work(fl) \
+   clear_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+
+#define set_task_syscall_work(t, fl) \
+   set_ti_thread_flag(task_thread_info(t), TIF_##fl)
+#define test_task_syscall_work(t, fl) \
+   test_ti_thread_flag(task_thread_info(t), TIF_##fl)
+#define clear_task_syscall_work(t, fl) \
+   clear_ti_thread_flag(task_thread_info(t), TIF_##fl)

but the SYSCALL_WORK_FLAGS are not valid on !GENERIC_ENTRY, we'll mix
up (on arm64) SYSCALL_WORK_BIT_SECCOMP (==0) and TIF_SIGPENDING (==0).

As part of fixing this, it might be a good idea to put "enum
syscall_work_bit" behind a "#ifdef CONFIG_GENERIC_ENTRY" to avoid
future accidents like this?


Re: [PATCH v24 08/12] landlock: Add syscall implementations

2020-11-20 Thread Jann Horn
On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
> These 3 system calls are designed to be used by unprivileged processes
> to sandbox themselves:
> * landlock_create_ruleset(2): Creates a ruleset and returns its file
>   descriptor.
> * landlock_add_rule(2): Adds a rule (e.g. file hierarchy access) to a
>   ruleset, identified by the dedicated file descriptor.
> * landlock_enforce_ruleset_current(2): Enforces a ruleset on the current
>   thread and its future children (similar to seccomp).  This syscall has
>   the same usage restrictions as seccomp(2): the caller must have the
>   no_new_privs attribute set or have CAP_SYS_ADMIN in the current user
>   namespace.
>
> All these syscalls have a "flags" argument (not currently used) to
> enable extensibility.
>
> Here are the motivations for these new syscalls:
> * A sandboxed process may not have access to file systems, including
>   /dev, /sys or /proc, but it should still be able to add more
>   restrictions to itself.
> * Neither prctl(2) nor seccomp(2) (which was used in a previous version)
>   fit well with the current definition of a Landlock security policy.
>
> All passed structs (attributes) are checked at build time to ensure that
> they don't contain holes and that they are aligned the same way for each
> architecture.
>
> See the user and kernel documentation for more details (provided by a
> following commit):
> * Documentation/userspace-api/landlock.rst
> * Documentation/security/landlock.rst
>
> Cc: Arnd Bergmann 
> Cc: James Morris 
> Cc: Jann Horn 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Signed-off-by: Mickaël Salaün 

Reviewed-by: Jann Horn 


Re: [PATCH v24 02/12] landlock: Add ruleset and domain management

2020-11-20 Thread Jann Horn
On Thu, Nov 12, 2020 at 9:51 PM Mickaël Salaün  wrote:
> A Landlock ruleset is mainly a red-black tree with Landlock rules as
> nodes.  This enables quick update and lookup to match a requested
> access, e.g. to a file.  A ruleset is usable through a dedicated file
> descriptor (cf. following commit implementing syscalls) which enables a
> process to create and populate a ruleset with new rules.
>
> A domain is a ruleset tied to a set of processes.  This group of rules
> defines the security policy enforced on these processes and their future
> children.  A domain can transition to a new domain which is the
> intersection of all its constraints and those of a ruleset provided by
> the current process.  This modification only impact the current process.
> This means that a process can only gain more constraints (i.e. lose
> accesses) over time.
>
> Cc: James Morris 
> Cc: Jann Horn 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Signed-off-by: Mickaël Salaün 
> ---
>
> Changes since v23:
> * Always intersect access rights.  Following the filesystem change
>   logic, make ruleset updates more consistent by always intersecting
>   access rights (boolean AND) instead of combining them (boolean OR) for
>   the same layer.

This seems wrong to me. If some software e.g. builds a policy that
allows it to execute specific libraries and to open input files
specified on the command line, and the user then specifies a library
as an input file, this change will make that fail unless the software
explicitly deduplicates the rules.
Userspace will be forced to add extra complexity to work around this.

>   This defensive approach could also help avoid user
>   space to inadvertently allow multiple access rights for the same
>   object (e.g.  write and execute access on a path hierarchy) instead of
>   dealing with such inconsistency.  This can happen when there is no
>   deduplication of objects (e.g. paths and underlying inodes) whereas
>   they get different access rights with landlock_add_rule(2).

I don't see why that's an issue. If userspace wants to be able to
access the same object in different ways for different purposes, it
should be able to do that, no?

I liked the semantics from the previous version.


Re: [PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-20 Thread Jann Horn
On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
> Thanks to the Landlock objects and ruleset, it is possible to identify
> inodes according to a process's domain.  To enable an unprivileged
> process to express a file hierarchy, it first needs to open a directory
> (or a file) and pass this file descriptor to the kernel through
> landlock_add_rule(2).  When checking if a file access request is
> allowed, we walk from the requested dentry to the real root, following
> the different mount layers.  The access to each "tagged" inodes are
> collected according to their rule layer level, and ANDed to create
> access to the requested file hierarchy.  This makes possible to identify
> a lot of files without tagging every inodes nor modifying the
> filesystem, while still following the view and understanding the user
> has from the filesystem.
>
> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> keep the same struct inodes for the same inodes whereas these inodes are
> in use.
>
> This commit adds a minimal set of supported filesystem access-control
> which doesn't enable to restrict all file-related actions.  This is the
> result of multiple discussions to minimize the code of Landlock to ease
> review.  Thanks to the Landlock design, extending this access-control
> without breaking user space will not be a problem.  Moreover, seccomp
> filters can be used to restrict the use of syscall families which may
> not be currently handled by Landlock.
>
> Cc: Al Viro 
> Cc: Anton Ivanov 
> Cc: James Morris 
> Cc: Jann Horn 
> Cc: Jeff Dike 
> Cc: Kees Cook 
> Cc: Richard Weinberger 
> Cc: Serge E. Hallyn 
> Signed-off-by: Mickaël Salaün 
> ---
>
> Changes since v23:
> * Enforce deterministic interleaved path rules.  To have consistent
>   layered rules, granting access to a path implies that all accesses
>   tied to inodes, from the requested file to the real root, must be
>   checked.  Otherwise, stacked rules may result to overzealous
>   restrictions.  By excluding the ability to add exceptions in the same
>   layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
>   deterministic interleaved path rules.  This removes an optimization

I don't understand the "deterministic interleaved path rules" part.


What if I have a policy like this?

/home/user READ
/home/user/Downloads READ+WRITE

That's a reasonable policy, right?

If I then try to open /home/user/Downloads/foo in WRITE mode, the loop
will first check against the READ+WRITE rule for /home/user, that
check will pass, and then it will check against the READ rule for /,
which will deny the access, right? That seems bad.


The v22 code ensured that for each layer, the most specific rule (the
first we encounter on the walk) always wins, right? What's the problem
with that?

>   which could be replaced by a proper cache mechanism.  This also
>   further simplifies and explain check_access_path_continue().

>From the interdiff between v23 and v24 (git range-diff
99ade5d59b23~1..99ade5d59b23 faa8c09be9fd~1..faa8c09be9fd):

@@ security/landlock/fs.c (new)
 +  rcu_dereference(landlock_inode(inode)->object));
 +  rcu_read_unlock();
 +
-+  /* Checks for matching layers. */
-+  if (rule && (rule->layers | *layer_mask)) {
-+  if ((rule->access & access_request) == access_request) {
-+  *layer_mask &= ~rule->layers;
-+  return true;
-+  } else {
-+  return false;
-+  }
++  if (!rule)
++  /* Continues to walk if there is no rule for this inode. */
++  return true;
++  /*
++   * We must check all layers for each inode because we may encounter
++   * multiple different accesses from the same layer in a walk.  Each
++   * layer must at least allow the access request one time (i.e. with one
++   * inode).  This enables to have a deterministic behavior whatever
++   * inode is tagged within interleaved layers.
++   */
++  if ((rule->access & access_request) == access_request) {
++  /* Validates layers for which all accesses are allowed. */
++  *layer_mask &= ~rule->layers;
++  /* Continues to walk until all layers are validated. */
++  return true;
 +  }
-+  return true;
++  /* Stops if a rule in the path don't allow all requested access. */
++  return false;
 +}
 +
 +static int check_access_path(const struct landlock_ruleset *const domain,
@@ security/landlock/fs.c (new)
 +  _mask)) {
 +  struct dentry *parent_dentry;
 +
-+  /* Stops when a rule from each layer granted access. */
-+  if (layer_mask ==

Re: [PATCH v24 12/12] landlock: Add user and kernel documentation

2020-11-20 Thread Jann Horn
On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
> This documentation can be built with the Sphinx framework.
>
> Cc: James Morris 
> Cc: Jann Horn 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Signed-off-by: Mickaël Salaün 
> Reviewed-by: Vincent Dagonneau 

Reviewed-by: Jann Horn 


Re: [PATCH v24 01/12] landlock: Add object management

2020-11-20 Thread Jann Horn
On Thu, Nov 12, 2020 at 9:51 PM Mickaël Salaün  wrote:
> A Landlock object enables to identify a kernel object (e.g. an inode).
> A Landlock rule is a set of access rights allowed on an object.  Rules
> are grouped in rulesets that may be tied to a set of processes (i.e.
> subjects) to enforce a scoped access-control (i.e. a domain).
>
> Because Landlock's goal is to empower any process (especially
> unprivileged ones) to sandbox themselves, we cannot rely on a
> system-wide object identification such as file extended attributes.
> Indeed, we need innocuous, composable and modular access-controls.
>
> The main challenge with these constraints is to identify kernel objects
> while this identification is useful (i.e. when a security policy makes
> use of this object).  But this identification data should be freed once
> no policy is using it.  This ephemeral tagging should not and may not be
> written in the filesystem.  We then need to manage the lifetime of a
> rule according to the lifetime of its objects.  To avoid a global lock,
> this implementation make use of RCU and counters to safely reference
> objects.
>
> A following commit uses this generic object management for inodes.
>
> Cc: James Morris 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> Signed-off-by: Mickaël Salaün 
> Reviewed-by: Jann Horn 

Still looks good, except for one comment:

[...]
> +   /**
> +* @lock: Guards against concurrent modifications.  This lock might be
> +* held from the time @usage drops to zero until any weak references
> +* from @underobj to this object have been cleaned up.
> +*
> +* Lock ordering: inode->i_lock nests inside this.
> +*/
> +   spinlock_t lock;

Why did you change this to "might be held" (v22 had "must")? Is the
"might" a typo?


Re: [PATCH v2 3/4] x86/signal: Prevent an alternate stack overflow before a signal delivery

2020-11-20 Thread Jann Horn
On Thu, Nov 19, 2020 at 8:40 PM Chang S. Bae  wrote:
> The kernel pushes data on the userspace stack when entering a signal. If
> using a sigaltstack(), the kernel precisely knows the user stack size.
>
> When the kernel knows that the user stack is too small, avoid the overflow
> and do an immediate SIGSEGV instead.
>
> This overflow is known to occur on systems with large XSAVE state. The
> effort to increase the size typically used for altstacks reduces the
> frequency of these overflows, but this approach is still useful for legacy
> binaries.
>
> Here the kernel expects a bit conservative stack size (for 64-bit apps).
> Legacy binaries used a too-small sigaltstack would be already overflowed
> before this change, if they run on modern hardware.
[...]
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index ee6f1ceaa7a2..cee41d684dc2 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -251,8 +251,13 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs 
> *regs, size_t frame_size,
>
> /* This is the X/Open sanctioned signal stack switching.  */
> if (ka->sa.sa_flags & SA_ONSTACK) {
> -   if (sas_ss_flags(sp) == 0)
> +   if (sas_ss_flags(sp) == 0) {
> +   /* If the altstack might overflow, die with SIGSEGV: 
> */
> +   if (!altstack_size_ok(current))
> +   return (void __user *)-1L;
> +
> sp = current->sas_ss_sp + current->sas_ss_size;
> +   }

A couple lines further down, we have this (since commit 14fc9fbc700d):

/*
 * If we are on the alternate signal stack and would overflow it, don't.
 * Return an always-bogus address instead so we will die with SIGSEGV.
 */
if (onsigstack && !likely(on_sig_stack(sp)))
return (void __user *)-1L;

Is that not working?


(It won't handle the case where the kernel fills up almost all of the
alternate stack, and the userspace signal handler then overflows out
of the alternate signal stack. But there isn't much the kernel can do
about that...)


Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver

2020-11-20 Thread Jann Horn
On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
 wrote:
> This patch is a driver that exposes a monotonic incremental Virtual
> Machine Generation u32 counter via a char-dev FS interface that
> provides sync and async VmGen counter updates notifications. It also
> provides VmGen counter retrieval and confirmation mechanisms.
>
> The hw provided UUID is not exposed to userspace, it is internally
> used by the driver to keep accounting for the exposed VmGen counter.
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the hw UUID changes (the VM
> generation changes).
>
> On each hw UUID change, the new hypervisor-provided UUID is also fed
> to the kernel RNG.

As for v1:

Is there a reasonable usecase for the "confirmation" mechanism? It
doesn't seem very useful to me.

How do you envision integrating this with libraries that have to work
in restrictive seccomp sandboxes? If this was in the vDSO, that would
be much easier.


[PATCH v2] seccomp: Remove bogus __user annotations

2020-11-20 Thread Jann Horn
Buffers that are passed to read_actions_logged() and write_actions_logged()
are in kernel memory; the sysctl core takes care of copying from/to
userspace.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Reviewed-by: Tyler Hicks 
Signed-off-by: Jann Horn 
---
v2: corrected "Fixes" tag

 kernel/seccomp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 8ad7a293255a..c2bff3561846 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1968,7 +1968,7 @@ static bool seccomp_actions_logged_from_names(u32 
*actions_logged, char *names)
return true;
 }
 
-static int read_actions_logged(struct ctl_table *ro_table, void __user *buffer,
+static int read_actions_logged(struct ctl_table *ro_table, void *buffer,
   size_t *lenp, loff_t *ppos)
 {
char names[sizeof(seccomp_actions_avail)];
@@ -1986,7 +1986,7 @@ static int read_actions_logged(struct ctl_table 
*ro_table, void __user *buffer,
return proc_dostring(, 0, buffer, lenp, ppos);
 }
 
-static int write_actions_logged(struct ctl_table *ro_table, void __user 
*buffer,
+static int write_actions_logged(struct ctl_table *ro_table, void *buffer,
size_t *lenp, loff_t *ppos, u32 *actions_logged)
 {
char names[sizeof(seccomp_actions_avail)];

base-commit: 4d02da974ea85a62074efedf354e82778f910d82
-- 
2.29.2.454.gaff20da3a2-goog



Re: [PATCH] seccomp: Remove bogus __user annotations

2020-11-20 Thread Jann Horn
On Fri, Nov 20, 2020 at 4:36 PM Tyler Hicks  wrote:
> Hey Jann - Thanks for cleaning this up!
>
> On 2020-11-20 02:59:13, Jann Horn wrote:
> > Buffers that are passed to read_actions_logged() and write_actions_logged()
> > are in kernel memory; the sysctl core takes care of copying from/to
> > userspace.
> >
> > Fixes: 0ddec0fc8900 ("seccomp: Sysctl to configure actions that are allowed 
> > to be logged")
>
> After tracing back through the code, I was struggling to understand why
> I thought the __user annotation was needed back then. It turns out that
> __user was correct when I wrote 0ddec0fc8900 and that the Fixes tag
> should be changed to this:
>
>  Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
>
> If you agree, please adjust and resubmit with:
>
>  Reviewed-by: Tyler Hicks 
>
> Thank you!

Aaaah, that makes sense. Thanks, will do.


[PATCH] seccomp: Remove bogus __user annotations

2020-11-19 Thread Jann Horn
Buffers that are passed to read_actions_logged() and write_actions_logged()
are in kernel memory; the sysctl core takes care of copying from/to
userspace.

Fixes: 0ddec0fc8900 ("seccomp: Sysctl to configure actions that are allowed to 
be logged")
Signed-off-by: Jann Horn 
---
 kernel/seccomp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 8ad7a293255a..c2bff3561846 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1968,7 +1968,7 @@ static bool seccomp_actions_logged_from_names(u32 
*actions_logged, char *names)
return true;
 }
 
-static int read_actions_logged(struct ctl_table *ro_table, void __user *buffer,
+static int read_actions_logged(struct ctl_table *ro_table, void *buffer,
   size_t *lenp, loff_t *ppos)
 {
char names[sizeof(seccomp_actions_avail)];
@@ -1986,7 +1986,7 @@ static int read_actions_logged(struct ctl_table 
*ro_table, void __user *buffer,
return proc_dostring(, 0, buffer, lenp, ppos);
 }
 
-static int write_actions_logged(struct ctl_table *ro_table, void __user 
*buffer,
+static int write_actions_logged(struct ctl_table *ro_table, void *buffer,
size_t *lenp, loff_t *ppos, u32 *actions_logged)
 {
char names[sizeof(seccomp_actions_avail)];

base-commit: 4d02da974ea85a62074efedf354e82778f910d82
-- 
2.29.2.454.gaff20da3a2-goog



Re: [PATCH v4] mm: Optional full ASLR for mmap() and mremap()

2020-11-18 Thread Jann Horn
On Tue, Nov 17, 2020 at 5:55 PM Matthew Wilcox  wrote:
> On Mon, Oct 26, 2020 at 06:05:18PM +0200, Topi Miettinen wrote:
> > Writing a new value of 3 to /proc/sys/kernel/randomize_va_space
> > enables full randomization of memory mappings created with mmap(NULL,
> > ...). With 2, the base of the VMA used for such mappings is random,
> > but the mappings are created in predictable places within the VMA and
> > in sequential order. With 3, new VMAs are created to fully randomize
> > the mappings. Also mremap(..., MREMAP_MAYMOVE) will move the mappings
> > even if not necessary.
>
> Is this worth it?
>
> https://www.ndss-symposium.org/ndss2017/ndss-2017-programme/aslrcache-practical-cache-attacks-mmu/

Yeah, against local attacks (including from JavaScript), ASLR isn't
very robust; but it should still help against true remote attacks
(modulo crazyness like NetSpectre).

E.g. Mateusz Jurczyk's remote Samsung phone exploit via MMS messages
(https://googleprojectzero.blogspot.com/2020/08/mms-exploit-part-5-defeating-aslr-getting-rce.html)
would've probably been quite a bit harder to pull off if he hadn't
been able to rely on having all those memory mappings sandwiched
together.


Re: [PATCH 2/3] lib: zlib_inflate: improves decompression performance

2020-11-11 Thread Jann Horn
On Wed, Nov 11, 2020 at 5:06 PM Zhaoxiu Zeng  wrote:
> 在 2020/11/11 11:46, Jann Horn 写道:
> > On Mon, Nov 9, 2020 at 8:27 PM  wrote:
> >> This patch does:
> >> 1. Cleanup code and reduce branches
> >> 2. Use copy_from_back to copy the matched bytes from the back output buffer
> >
> > What exactly is copy_from_back()? Is it like memmove()? If yes, have
> > you tried using memmove() instead of the code added in patch 1/3?
> >
>
> If use memcpy(or memmove), the code will be like this:
> while (dist < len) {
> memcpy(out, out - dist, dist);
> out += dist;
> len -= dist;
> }
> memcpy(out, out - dist, len);

Ah, thanks. So basically it means: "repeatedly copy a pattern of
length `dist` from `out-dist` to `out` until `len` bytes have been
written"


Re: [PATCH 2/3] lib: zlib_inflate: improves decompression performance

2020-11-10 Thread Jann Horn
On Mon, Nov 9, 2020 at 8:27 PM  wrote:
> This patch does:
> 1. Cleanup code and reduce branches
> 2. Use copy_from_back to copy the matched bytes from the back output buffer

What exactly is copy_from_back()? Is it like memmove()? If yes, have
you tried using memmove() instead of the code added in patch 1/3?


Re: [PATCH v7 3/9] arm64, kfence: enable KFENCE for ARM64

2020-11-03 Thread Jann Horn
On Tue, Nov 3, 2020 at 6:59 PM Marco Elver  wrote:
> Add architecture specific implementation details for KFENCE and enable
> KFENCE for the arm64 architecture. In particular, this implements the
> required interface in .
>
> KFENCE requires that attributes for pages from its memory pool can
> individually be set. Therefore, force the entire linear map to be mapped
> at page granularity. Doing so may result in extra memory allocated for
> page tables in case rodata=full is not set; however, currently
> CONFIG_RODATA_FULL_DEFAULT_ENABLED=y is the default, and the common case
> is therefore not affected by this change.
>
> Reviewed-by: Dmitry Vyukov 
> Co-developed-by: Alexander Potapenko 
> Signed-off-by: Alexander Potapenko 
> Signed-off-by: Marco Elver 

Reviewed-by: Jann Horn 


Re: [PATCH v7 8/9] kfence: add test suite

2020-11-03 Thread Jann Horn
On Tue, Nov 3, 2020 at 6:59 PM Marco Elver  wrote:
> Add KFENCE test suite, testing various error detection scenarios. Makes
> use of KUnit for test organization. Since KFENCE's interface to obtain
> error reports is via the console, the test verifies that KFENCE outputs
> expected reports to the console.
>
> Reviewed-by: Dmitry Vyukov 
> Co-developed-by: Alexander Potapenko 
> Signed-off-by: Alexander Potapenko 
> Signed-off-by: Marco Elver 

Reviewed-by: Jann Horn 


Re: [PATCH v7 7/9] kfence, Documentation: add KFENCE documentation

2020-11-03 Thread Jann Horn
On Tue, Nov 3, 2020 at 6:59 PM Marco Elver  wrote:
> Add KFENCE documentation in dev-tools/kfence.rst, and add to index.
>
> Reviewed-by: Dmitry Vyukov 
> Co-developed-by: Alexander Potapenko 
> Signed-off-by: Alexander Potapenko 
> Signed-off-by: Marco Elver 

Reviewed-by: Jann Horn 


Re: [PATCH v7 2/9] x86, kfence: enable KFENCE for x86

2020-11-03 Thread Jann Horn
On Tue, Nov 3, 2020 at 6:59 PM Marco Elver  wrote:
> Add architecture specific implementation details for KFENCE and enable
> KFENCE for the x86 architecture. In particular, this implements the
> required interface in  for setting up the pool and
> providing helper functions for protecting and unprotecting pages.
>
> For x86, we need to ensure that the pool uses 4K pages, which is done
> using the set_memory_4k() helper function.
>
> Reviewed-by: Dmitry Vyukov 
> Co-developed-by: Marco Elver 
> Signed-off-by: Marco Elver 
> Signed-off-by: Alexander Potapenko 

Reviewed-by: Jann Horn 


Re: [PATCH v7 1/9] mm: add Kernel Electric-Fence infrastructure

2020-11-03 Thread Jann Horn
On Tue, Nov 3, 2020 at 6:58 PM Marco Elver  wrote:
> This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
> low-overhead sampling-based memory safety error detector of heap
> use-after-free, invalid-free, and out-of-bounds access errors.

Reviewed-by: Jann Horn 


Re: [PATCH resend v3 2/2] exec: Broadly lock nascent mm until setup_arg_pages()

2020-11-02 Thread Jann Horn
On Tue, Oct 20, 2020 at 9:15 PM Jason Gunthorpe  wrote:
> On Sat, Oct 17, 2020 at 12:57:13AM +0200, Jann Horn wrote:
> > @@ -1545,6 +1532,18 @@ void setup_new_exec(struct linux_binprm * bprm)
> >   me->mm->task_size = TASK_SIZE;
> >   mutex_unlock(>signal->exec_update_mutex);
> >   mutex_unlock(>signal->cred_guard_mutex);
> > +
> > + if (!IS_ENABLED(CONFIG_MMU)) {
> > + /*
> > +  * On MMU, setup_arg_pages() wants to access bprm->vma after
> > +  * this point, so we can't drop the mmap lock yet.
> > +  * On !MMU, we have neither setup_arg_pages() nor bprm->vma,
> > +  * so we should drop the lock here.
> > +  */
> > + mmap_write_unlock(bprm->mm);
> > + mmput(bprm->mm);
> > + bprm->mm = NULL;
> > + }
>
> The only thing I dislike about this is how tricky the lock lifetime
> is, it all looks correct, but expecting the setup_arg_pages() or
> setup_new_exec() to unlock (depending!) is quite tricky.
>
> It feels like it would be clearer to have an explicit function to do
> this, like 'release_brp_mm()' indicating that current->mm is now the
> only way to get the mm and it must be locked.

That was a good suggestion; I tried to amend my patch as suggested,
and while trying to do that, noticed that under CONFIG_MMU,
binfmt_flat first does setup_new_exec(), then vm_mmap(), and then
later on setup_arg_pages()...

So your suggestion indeed helped make it clear that my patch was
wrong. Guess I'll have to go figure out how to rearrange the pieces in
binfmt_flat to make this work...


Re: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

2020-11-02 Thread Jann Horn
On Tue, Nov 3, 2020 at 3:11 AM Jann Horn  wrote:
> On Sat, Oct 17, 2020 at 2:30 AM Jann Horn  wrote:
> > On Sat, Oct 17, 2020 at 1:21 AM Jason Gunthorpe  wrote:
> > > On Sat, Oct 17, 2020 at 01:09:12AM +0200, Jann Horn wrote:
> > > > Currently, mm_struct has two refcounts:
> > > >
> > > >  - mm_users: preserves everything - the mm_struct, the page tables, the
> > > >memory mappings, and so on
> > > >  - mm_count: preserves the mm_struct and pgd
> > > >
> > > > However, there are three types of users of mm_struct:
> > > >
> > > > 1. users that want page tables, memory mappings and so on
> > > > 2. users that want to preserve the pgd (for lazy TLB)
> > > > 3. users that just want to keep the mm_struct itself around (e.g. for
> > > >mmget_not_zero() or __ptrace_may_access())
> > > >
> > > > Dropping mm_count references can be messy because dropping mm_count to
> > > > zero deletes the pgd, which takes the pgd_lock on x86, meaning it 
> > > > doesn't
> > > > work from RCU callbacks (which run in IRQ context). In those cases,
> > > > mmdrop_async() must be used to punt the invocation of __mmdrop() to
> > > > workqueue context.
> > > >
> > > > That's fine when mmdrop_async() is a rare case, but the preceding patch
> > > > "ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" 
> > > > makes it
> > > > the common case; we should probably avoid punting freeing to workqueue
> > > > context all the time if we can avoid it?
> > > >
> > > > To resolve this, add a third refcount that just protects the mm_struct 
> > > > and
> > > > the user_ns it points to, and which can be dropped with synchronous 
> > > > freeing
> > > > from (almost) any context.
> > > >
> > > > Signed-off-by: Jann Horn 
> > > > ---
> > > >  arch/x86/kernel/tboot.c|  2 ++
> > > >  drivers/firmware/efi/efi.c |  2 ++
> > > >  include/linux/mm_types.h   | 13 +++--
> > > >  include/linux/sched/mm.h   | 13 +
> > > >  kernel/fork.c  | 14 ++
> > > >  mm/init-mm.c   |  2 ++
> > > >  6 files changed, 40 insertions(+), 6 deletions(-)
> > >
> > > I think mmu notifiers and the stuff in drivers/infiniband/core/ can be
> > > converted to this as well..
> > >
> > > Actually I kind of wonder if you should go the reverse and find the
> > > few callers that care about the pgd and give them a new api with a
> > > better name (mmget_pgd?).
> >
> > Yeah, that might make more sense... as long as I'm really sure about
> > all the places I haven't changed. ^^
> >
> > I'll try to change it as you suggested for v2.
>
> Actually, no - I think it ends up being around 30 mentions of the
> "take reference without PGD" function and around 35 mentions of the
> "take reference with PGD" function (assuming that all the weird
> powerpc stuff I don't understand needs the mm_context to not yet be
> destroyed). (A decent chunk of which are all the per-arch functions
> for starting secondary processors.) So I don't think doing it the way
> you suggested would really make the patch(es) smaller.
>
> And I think that it is helpful for review purposes to have separate
> patches for every converted site, and leave things as-is by default.
> If the semantics change for every user that is *not* touched by the
> patch, that makes it really easy for mistakes to slip through.
>
> I could try to convert more callers though?

But really, I would be happiest if I could just leave all the callers
where both refcounts work as-is, and let people more familiar with the
subsystems switch them over when that actually becomes necessary. Is
that not acceptable?


Re: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

2020-11-02 Thread Jann Horn
On Sat, Oct 17, 2020 at 2:30 AM Jann Horn  wrote:
> On Sat, Oct 17, 2020 at 1:21 AM Jason Gunthorpe  wrote:
> > On Sat, Oct 17, 2020 at 01:09:12AM +0200, Jann Horn wrote:
> > > Currently, mm_struct has two refcounts:
> > >
> > >  - mm_users: preserves everything - the mm_struct, the page tables, the
> > >memory mappings, and so on
> > >  - mm_count: preserves the mm_struct and pgd
> > >
> > > However, there are three types of users of mm_struct:
> > >
> > > 1. users that want page tables, memory mappings and so on
> > > 2. users that want to preserve the pgd (for lazy TLB)
> > > 3. users that just want to keep the mm_struct itself around (e.g. for
> > >mmget_not_zero() or __ptrace_may_access())
> > >
> > > Dropping mm_count references can be messy because dropping mm_count to
> > > zero deletes the pgd, which takes the pgd_lock on x86, meaning it doesn't
> > > work from RCU callbacks (which run in IRQ context). In those cases,
> > > mmdrop_async() must be used to punt the invocation of __mmdrop() to
> > > workqueue context.
> > >
> > > That's fine when mmdrop_async() is a rare case, but the preceding patch
> > > "ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" makes 
> > > it
> > > the common case; we should probably avoid punting freeing to workqueue
> > > context all the time if we can avoid it?
> > >
> > > To resolve this, add a third refcount that just protects the mm_struct and
> > > the user_ns it points to, and which can be dropped with synchronous 
> > > freeing
> > > from (almost) any context.
> > >
> > > Signed-off-by: Jann Horn 
> > > ---
> > >  arch/x86/kernel/tboot.c|  2 ++
> > >  drivers/firmware/efi/efi.c |  2 ++
> > >  include/linux/mm_types.h   | 13 +++--
> > >  include/linux/sched/mm.h   | 13 +
> > >  kernel/fork.c  | 14 ++
> > >  mm/init-mm.c   |  2 ++
> > >  6 files changed, 40 insertions(+), 6 deletions(-)
> >
> > I think mmu notifiers and the stuff in drivers/infiniband/core/ can be
> > converted to this as well..
> >
> > Actually I kind of wonder if you should go the reverse and find the
> > few callers that care about the pgd and give them a new api with a
> > better name (mmget_pgd?).
>
> Yeah, that might make more sense... as long as I'm really sure about
> all the places I haven't changed. ^^
>
> I'll try to change it as you suggested for v2.

Actually, no - I think it ends up being around 30 mentions of the
"take reference without PGD" function and around 35 mentions of the
"take reference with PGD" function (assuming that all the weird
powerpc stuff I don't understand needs the mm_context to not yet be
destroyed). (A decent chunk of which are all the per-arch functions
for starting secondary processors.) So I don't think doing it the way
you suggested would really make the patch(es) smaller.

And I think that it is helpful for review purposes to have separate
patches for every converted site, and leave things as-is by default.
If the semantics change for every user that is *not* touched by the
patch, that makes it really easy for mistakes to slip through.

I could try to convert more callers though?


[PATCH 2/3] samples: seccomp: simplify user-trap sample

2020-11-02 Thread Jann Horn
Now that the blocking unotify API returns when all users of the filter are
gone, get rid of the helper task that kills the blocked task.

Note that since seccomp only tracks whether tasks are completely gone, not
whether they have exited, we need to handle SIGCHLD while blocked on
SECCOMP_IOCTL_NOTIF_RECV. Alternatively we could also set SIGCHLD to
SIG_IGN and let the kernel autoreap exiting children.

Signed-off-by: Jann Horn 
---
 samples/seccomp/user-trap.c | 163 +++-
 1 file changed, 87 insertions(+), 76 deletions(-)

diff --git a/samples/seccomp/user-trap.c b/samples/seccomp/user-trap.c
index b9e666f15998..92d0f9ba58b6 100644
--- a/samples/seccomp/user-trap.c
+++ b/samples/seccomp/user-trap.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -198,6 +199,21 @@ static int handle_req(struct seccomp_notif *req,
return ret;
 }
 
+static pid_t worker;
+static int worker_status;
+
+static void sigchld_handler(int sig, siginfo_t *info, void *ctx)
+{
+   if (info->si_pid != worker) {
+   fprintf(stderr, "SIGCHLD from unknown process\n");
+   return;
+   }
+   if (waitpid(worker, _status, 0) != worker) {
+   perror("waitpid");
+   exit(1);
+   }
+}
+
 static void set_blocking(int fd)
 {
int file_status_flags = fcntl(fd, F_GETFL);
@@ -211,14 +227,26 @@ static void set_blocking(int fd)
 
 int main(void)
 {
-   int sk_pair[2], ret = 1, status, listener;
-   pid_t worker = 0 , tracer = 0;
+   int sk_pair[2], ret = 1, listener;
+   struct seccomp_notif *req;
+   struct seccomp_notif_resp *resp;
+   struct seccomp_notif_sizes sizes;
+   pid_t parent = getpid();
 
if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
perror("socketpair");
return 1;
}
 
+   struct sigaction sigchld_act = {
+   .sa_sigaction = sigchld_handler,
+   .sa_flags = SA_SIGINFO
+   };
+   if (sigaction(SIGCHLD, _act, NULL)) {
+   perror("sigaction");
+   return 1;
+   }
+
worker = fork();
if (worker < 0) {
perror("fork");
@@ -226,6 +254,14 @@ int main(void)
}
 
if (worker == 0) {
+   /* quit if the parent dies */
+   if (prctl(PR_SET_PDEATHSIG, SIGKILL)) {
+   perror("PR_SET_PDEATHSIG");
+   exit(1);
+   }
+   if (getppid() != parent)
+   exit(1);
+
listener = user_trap_syscall(__NR_mount,
 SECCOMP_FILTER_FLAG_NEW_LISTENER);
if (listener < 0) {
@@ -283,87 +319,69 @@ int main(void)
 */
listener = recv_fd(sk_pair[0]);
if (listener < 0)
-   goto out_kill;
+   goto close_pair;
 
set_blocking(listener);
 
-   /*
-* Fork a task to handle the requests. This isn't strictly necessary,
-* but it makes the particular writing of this sample easier, since we
-* can just wait ofr the tracee to exit and kill the tracer.
-*/
-   tracer = fork();
-   if (tracer < 0) {
-   perror("fork");
-   goto out_kill;
+   if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, ) < 0) {
+   perror("seccomp(GET_NOTIF_SIZES)");
+   goto out_close;
}
 
-   if (tracer == 0) {
-   struct seccomp_notif *req;
-   struct seccomp_notif_resp *resp;
-   struct seccomp_notif_sizes sizes;
+   req = malloc(sizes.seccomp_notif);
+   if (!req)
+   goto out_close;
+
+   resp = malloc(sizes.seccomp_notif_resp);
+   if (!resp)
+   goto out_req;
+   memset(resp, 0, sizes.seccomp_notif_resp);
+
+   while (1) {
+   memset(req, 0, sizes.seccomp_notif);
+   if (ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, req)) {
+   if (errno == ENOTCONN) {
+   /* The child process went away. */
+   break;
+   } else if (errno == EINTR) {
+   continue;
+   }
 
-   if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, ) < 0) {
-   perror("seccomp(GET_NOTIF_SIZES)");
-   goto out_close;
+   perror("ioctl recv");
+   goto out_resp;
}
 
-   req = malloc(sizes.seccomp_notif);
-   if (!req)
-   goto out_close;
-
-   resp = malloc(sizes.seccomp_notif_resp);
-   if (!resp)
-   goto

[PATCH 1/3] seccomp: Return from SECCOMP_IOCTL_NOTIF_RECV when children are gone

2020-11-02 Thread Jann Horn
At the moment, the seccomp notifier API is hard to use without combining
it with APIs like poll() or epoll(); if all target processes have gone
away, the polling APIs will raise an error indication on the file
descriptor, but SECCOMP_IOCTL_NOTIF_RECV will keep blocking indefinitely.

This usability problem was discovered when Michael Kerrisk wrote a
manpage for this API.

To fix it, get rid of the semaphore logic and let SECCOMP_IOCTL_NOTIF_RECV
behave as follows:

If O_NONBLOCK is set, SECCOMP_IOCTL_NOTIF_RECV always returns
immediately, no matter whether a notification is available or not.

If O_NONBLOCK is unset, SECCOMP_IOCTL_NOTIF_RECV blocks until either a
notification is delivered to userspace or all users of the filter have
gone away.

To avoid subtle breakage from eventloop-style code that doesn't set
O_NONBLOCK, set O_NONBLOCK by default - userspace can clear it if it
wants blocking behavior, and if blocking-style code forgets to do so,
that will be much more obvious than the breakage we'd get the other way
around.
This also means that UAPI breakage from this change should be limited to
blocking users of the API, of which, to my knowledge, there are none so far
(outside of in-tree sample and selftest code, which this patch adjusts - in
particular the code in samples/ has to change a bunch).

This should be backported because future userspace code might otherwise not
work properly on old kernels.

Cc: sta...@vger.kernel.org
Reported-by: Michael Kerrisk 
Signed-off-by: Jann Horn 
---
 kernel/seccomp.c  | 62 +--
 samples/seccomp/user-trap.c   | 16 +
 tools/testing/selftests/seccomp/seccomp_bpf.c | 21 +++
 3 files changed, 79 insertions(+), 20 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 8ad7a293255a..b3730740515f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the
@@ -138,7 +139,6 @@ struct seccomp_kaddfd {
  * @notifications: A list of struct seccomp_knotif elements.
  */
 struct notification {
-   struct semaphore request;
u64 next_id;
struct list_head notifications;
 };
@@ -863,7 +863,6 @@ static int seccomp_do_user_notification(int this_syscall,
list_add(, >notif->notifications);
INIT_LIST_HEAD();
 
-   up(>notif->request);
wake_up_poll(>wqh, EPOLLIN | EPOLLRDNORM);
mutex_unlock(>notify_lock);
 
@@ -1179,9 +1178,10 @@ find_notification(struct seccomp_filter *filter, u64 id)
 
 
 static long seccomp_notify_recv(struct seccomp_filter *filter,
-   void __user *buf)
+   void __user *buf, bool blocking)
 {
struct seccomp_knotif *knotif = NULL, *cur;
+   DEFINE_WAIT(wait);
struct seccomp_notif unotif;
ssize_t ret;
 
@@ -1194,11 +1194,9 @@ static long seccomp_notify_recv(struct seccomp_filter 
*filter,
 
memset(, 0, sizeof(unotif));
 
-   ret = down_interruptible(>notif->request);
-   if (ret < 0)
-   return ret;
-
mutex_lock(>notify_lock);
+
+retry:
list_for_each_entry(cur, >notif->notifications, list) {
if (cur->state == SECCOMP_NOTIFY_INIT) {
knotif = cur;
@@ -1206,14 +1204,40 @@ static long seccomp_notify_recv(struct seccomp_filter 
*filter,
}
}
 
-   /*
-* If we didn't find a notification, it could be that the task was
-* interrupted by a fatal signal between the time we were woken and
-* when we were able to acquire the rw lock.
-*/
if (!knotif) {
-   ret = -ENOENT;
-   goto out;
+   if (!blocking) {
+   if (refcount_read(>users) == 0)
+   ret = -ENOTCONN;
+   else
+   ret = -ENOENT;
+   goto out;
+   }
+
+   /* This has to happen before checking >users. */
+   prepare_to_wait(>wqh, , TASK_INTERRUPTIBLE);
+
+   /*
+* If all users of the filter are gone, throw an error
+* instead of pointlessly continuing to block.
+*/
+   if (refcount_read(>users) == 0) {
+   finish_wait(>wqh, );
+   ret = -ENOTCONN;
+   goto out;
+   }
+
+   /* No notifications - wait for one... */
+   mutex_unlock(>notify_lock);
+   freezable_schedule();
+   finish_wait(>wqh, );
+
+   /* ... and retry */
+   mutex_lock(>notify_lock);
+   if (signal_pending(current)) {
+   ret = -EINTR;
+  

[PATCH 3/3] selftests/seccomp: Test NOTIF_RECV empty/dead errors

2020-11-02 Thread Jann Horn
Test that SECCOMP_IOCTL_NOTIF_RECV on a seccomp fd with zero users returns
-ENOTCONN, both in blocking and in non-blocking mode.
Also test that SECCOMP_IOCTL_NOTIF_RECV on a seccomp fd with no active
notifications returns -ENOENT in non-blocking mode.

Signed-off-by: Jann Horn 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 127 ++
 1 file changed, 127 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 5318f9cb1aec..8214c431ad4b 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -315,6 +316,35 @@ static int __filecmp(pid_t pid1, pid_t pid2, int fd1, int 
fd2)
}   \
_ret; })
 
+static void store_and_wake(int *ptr, int value)
+{
+   __atomic_store(ptr, , __ATOMIC_RELEASE);
+   if (syscall(__NR_futex, ptr, FUTEX_WAKE, INT_MAX, NULL, NULL, 0) == -1) 
{
+   perror("FUTEX_WAKE failed unexpectedly");
+   exit(EXIT_FAILURE);
+   }
+}
+
+static int wait_and_load(int *ptr, int placeholder)
+{
+   int futex_ret;
+   int value;
+
+retry:
+   futex_ret = syscall(__NR_futex, ptr, FUTEX_WAIT, placeholder, NULL,
+   NULL, 0);
+   if (futex_ret == -1 && errno != EAGAIN) {
+   if (errno == EINTR)
+   goto retry;
+   perror("FUTEX_WAIT failed unexpectedly");
+   exit(EXIT_FAILURE);
+   }
+   value = __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
+   if (value == placeholder)
+   goto retry;
+   return value;
+}
+
 TEST(kcmp)
 {
int ret;
@@ -4160,6 +4190,103 @@ TEST(user_notification_addfd_rlimit)
close(memfd);
 }
 
+TEST(user_notification_recv_dead)
+{
+   pid_t pid;
+   int status;
+   struct __clone_args args = {
+   .flags = CLONE_FILES,
+   .exit_signal = SIGCHLD,
+   };
+   struct seccomp_notif notif = {};
+   struct shared_data {
+   int notif_fd;
+   } *shared_data;
+
+   shared_data = mmap(NULL, sizeof(struct shared_data),
+  PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED,
+  -1, 0);
+   ASSERT_NE(NULL, shared_data);
+   ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
+   pid = sys_clone3(, sizeof(args));
+   ASSERT_GE(pid, 0);
+
+   if (pid == 0) {
+   shared_data->notif_fd = user_notif_syscall(
+   __NR_mknodat, SECCOMP_FILTER_FLAG_NEW_LISTENER);
+   if (shared_data->notif_fd < 0)
+   _exit(EXIT_FAILURE);
+
+   _exit(EXIT_SUCCESS);
+   }
+
+   EXPECT_EQ(waitpid(pid, , 0), pid);
+   EXPECT_EQ(true, WIFEXITED(status));
+   EXPECT_EQ(0, WEXITSTATUS(status));
+
+   /* non-blocking recv */
+   EXPECT_EQ(-1, ioctl(shared_data->notif_fd, SECCOMP_IOCTL_NOTIF_RECV, 
));
+   EXPECT_EQ(ENOTCONN, errno);
+
+   /* blocking recv */
+   set_blocking(shared_data->notif_fd);
+   EXPECT_EQ(-1, ioctl(shared_data->notif_fd, SECCOMP_IOCTL_NOTIF_RECV, 
));
+   EXPECT_EQ(ENOTCONN, errno);
+}
+
+TEST(user_notification_recv_nonblock)
+{
+   pid_t pid;
+   struct __clone_args args = {
+   .flags = CLONE_FILES,
+   .exit_signal = SIGCHLD,
+   };
+   struct seccomp_notif notif = {};
+   struct shared_data {
+   int notif_fd;
+   } *shared_data;
+
+   shared_data = mmap(NULL, sizeof(struct shared_data),
+  PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED,
+  -1, 0);
+   ASSERT_NE(NULL, shared_data);
+   shared_data->notif_fd = -1;
+
+   ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
+   pid = sys_clone3(, sizeof(args));
+   ASSERT_GE(pid, 0);
+
+   if (pid == 0) {
+   int fd;
+
+   fd = user_notif_syscall(__NR_mknodat,
+   SECCOMP_FILTER_FLAG_NEW_LISTENER);
+   if (fd < 0) {
+   store_and_wake(_data->notif_fd, -2);
+   _exit(EXIT_FAILURE);
+   }
+   store_and_wake(_data->notif_fd, fd);
+
+   while (1)
+   pause();
+   }
+
+   wait_and_load(_data->notif_fd, -1);
+
+   /* non-blocking recv */
+   EXPECT_EQ(-1, ioctl(shared_data->notif_fd, SECCOMP_IOCTL_NOTIF_RECV,
+   ));
+   EXPECT_EQ(ENOENT, errno);
+
+ 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Mon, Nov 2, 2020 at 8:50 PM Sargun Dhillon  wrote:
> On Mon, Nov 2, 2020 at 11:45 AM Michael Kerrisk (man-pages)
>  wrote:
> >Caveats regarding blocking system calls
> >Suppose that the target performs a blocking system call (e.g.,
> >accept(2)) that the supervisor should handle.  The supervisor
> >might then in turn execute the same blocking system call.
> >
> >In this scenario, it is important to note that if the target's
> >system call is now interrupted by a signal, the supervisor is not
> >informed of this.  If the supervisor does not take suitable steps
> >to actively discover that the target's system call has been
> >canceled, various difficulties can occur.  Taking the example of
> >accept(2), the supervisor might remain blocked in its accept(2)
> >holding a port number that the target (which, after the
> >interruption by the signal handler, perhaps closed  its listening
> >socket) might expect to be able to reuse in a bind(2) call.
> >
> >Therefore, when the supervisor wishes to emulate a blocking system
> >call, it must do so in such a way that it gets informed if the
> >target's system call is interrupted by a signal handler.  For
> >example, if the supervisor itself executes the same blocking
> >system call, then it could employ a separate thread that uses the
> >SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
> >still blocked in its system call.  Alternatively, in the accept(2)
> >example, the supervisor might use poll(2) to monitor both the
> >notification file descriptor (so as as to discover when the
> >target's accept(2) call has been interrupted) and the listening
> >file descriptor (so as to know when a connection is available).
> >
> >If the target's system call is interrupted, the supervisor must
> >take care to release resources (e.g., file descriptors) that it
> >acquired on behalf of the target.
> >
> > Does that seem okay?
> >
> This is far clearer than my explanation. The one thing is that *just*
> poll is not good enough, you would poll, with some timeout, and when
> that timeout is hit, check if all the current notifications are valid,
> as poll isn't woken up when an in progress notification goes off
> AFAIK.

Arguably that's so terrible that it qualifies for being in the BUGS
section of the manpage.

If you want this to be fixed properly, I recommend that someone
implements my proposal from
,
unless you can come up with something better.


Re: [RFC PATCH v1 4/4] Allow to change the user namespace in which user rlimits are counted

2020-11-02 Thread Jann Horn
On Mon, Nov 2, 2020 at 5:52 PM Alexey Gladkov  wrote:
> Add a new prctl to change the user namespace in which the process
> counter is located. A pointer to the user namespace is in cred struct
> to be inherited by all child processes.
[...]
> +   case PR_SET_RLIMIT_USER_NAMESPACE:
> +   if (!capable(CAP_SYS_RESOURCE))
> +   return -EPERM;
> +
> +   switch (arg2) {
> +   case PR_RLIMIT_BIND_GLOBAL_USERNS:
> +   error = set_rlimit_ns(_user_ns);
> +   break;
> +   case PR_RLIMIT_BIND_CURRENT_USERNS:
> +   error = set_rlimit_ns(current_user_ns());
> +   break;
> +   default:
> +   error = -EINVAL;
> +   }
> +   break;

I don't see how this can work. capable() requires that
current_user_ns()==_user_ns, so you can't use this API to bind
rlimits to any other user namespace.

Fundamentally, if it requires CAP_SYS_RESOURCE, this probably can't be
done as an API that a process uses to change its own rlimit scope. In
that case I would implement this as part of clone3() instead of
prctl(). (Then init_user_ns can set it if the caller has
CAP_SYS_RESOURCE. If you want to have support for doing the same thing
with nested namespaces, you'd also need a flag that the first-level
clone3() can set on the namespace to say "further rlimit splitting
should be allowed".)

Or alternatively, we could say that CAP_SYS_RESOURCE doesn't matter,
and instead you're allowed to move the rlimit scope if your current
hard rlimit is INFINITY. That might make more sense? Maybe?


ASSERT_GE definition is backwards

2020-11-02 Thread Jann Horn
ASSERT_GE() is defined as:

/**
 * ASSERT_GE(expected, seen)
 *
 * @expected: expected value
 * @seen: measured value
 *
 * ASSERT_GE(expected, measured): expected >= measured
 */
#define ASSERT_GE(expected, seen) \
__EXPECT(expected, #expected, seen, #seen, >=, 1)

but that means that logically, if you want to write "assert that the
measured PID X is >= the expected value 0", you actually have to use
ASSERT_LE(0, X). That's really awkward. Normally you'd be talking
about how the seen value compares to the expected one, not the other
way around.

At the moment I see tests that are instead written like ASSERT_GE(X,
0), but then that means that the expected and seen values are the
wrong way around.

It might be good if someone could refactor the definitions of
ASSERT_GE and such to swap around which number is the expected and
which is the seen one.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Sat, Oct 31, 2020 at 9:51 AM Michael Kerrisk (man-pages)
 wrote:
> On 10/30/20 8:20 PM, Jann Horn wrote:
> > On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
> >  wrote:
> >> On 10/29/20 2:42 AM, Jann Horn wrote:
> >>> As discussed at
> >>> <https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=q...@mail.gmail.com>,
> >>> we need to re-check checkNotificationIdIsValid() after reading remote
> >>> memory but before using the read value in any way. Otherwise, the
> >>> syscall could in the meantime get interrupted by a signal handler, the
> >>> signal handler could return, and then the function that performed the
> >>> syscall could free() allocations or return (thereby freeing buffers on
> >>> the stack).
> >>>
> >>> In essence, this pread() is (unavoidably) a potential use-after-free
> >>> read; and to make that not have any security impact, we need to check
> >>> whether UAF read occurred before using the read value. This should
> >>> probably be called out elsewhere in the manpage, too...
> >>>
> >>> Now, of course, **reading** is the easy case. The difficult case is if
> >>> we have to **write** to the remote process... because then we can't
> >>> play games like that. If we write data to a freed pointer, we're
> >>> screwed, that's it. (And for somewhat unrelated bonus fun, consider
> >>> that /proc/$pid/mem is originally intended for process debugging,
> >>> including installing breakpoints, and will therefore happily write
> >>> over "readonly" private mappings, such as typical mappings of
> >>> executable code.)
> >>>
> >>> So, h... I guess if anyone wants to actually write memory back to
> >>> the target process, we'd better come up with some dedicated API for
> >>> that, using an ioctl on the seccomp fd that magically freezes the
> >>> target process inside the syscall while writing to its memory, or
> >>> something like that? And until then, the manpage should have a big fat
> >>> warning that writing to the target's memory is simply not possible
> >>> (safely).
> >>
> >> Thank you for your very clear explanation! It turned out to be
> >> trivially easy to demonstrate this issue with a slightly modified
> >> version of my program.
> >>
> >> As well as the change to the code example that I already mentioned
> >> my reply of a few hours ago, I've added the following text to the
> >> page:
> >>
> >>Caveats regarding the use of /proc/[tid]/mem
> >>The discussion above noted the need to use the
> >>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
> >>/proc/[tid]/mem file of the target to avoid the possibility of
> >>accessing the memory of the wrong process in the event that the
> >>target terminates and its ID is recycled by another (unrelated)
> >>thread.  However, the use of this ioctl(2) operation is also
> >>necessary in other situations, as explained in the following
> >>pargraphs.
> >
> > (nit: paragraphs)
>
> I spotted that one also already. But thanks for reading carefully!
>
> >>Consider the following scenario, where the supervisor tries to
> >>read the pathname argument of a target's blocked mount(2) system
> >>call:
> > [...]
> >> Seem okay?
> >
> > Yeah, sounds good.
> >
> >> By the way, is there any analogous kind of issue concerning
> >> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
> >> something.
> >
> > When it is used by a seccomp supervisor, you mean? I think basically
> > the same thing applies - when resource identifiers (such as memory
> > addresses or file descriptors) are passed to a syscall, it generally
> > has to be assumed that those identifiers may become invalid and be
> > reused as soon as the syscall has returned.
>
> I probably needed to be more explicit. Would the following (i.e., a
> single cookie check) not be sufficient to handle the above scenario.
> Here, the target is making a syscall a system call that employs the
> file descriptor 'tfd':
>
> T: makes syscall that triggers notification
> S: Get notification
> S: pidfd = pidfd_open(T, 0);
> S: sfd = pifd_getfd(pidfd, tfd, 0)
> S: check that the cookie is still valid
> S: do operation with sfd [*]
>
> By contrast, I can see 

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Jann Horn
On Sat, Oct 31, 2020 at 9:31 AM Michael Kerrisk (man-pages)
 wrote:
> On 10/30/20 8:14 PM, Jann Horn wrote:
> > With the caveat that a cancelled syscall
> > could've also led to the memory being munmap()ed, so the nread==0 case
> > could also happen legitimately - so you might want to move this check
> > up above the nread==0 (mm went away) and nread==-1 (mm still exists,
> > but read from address failed, errno EIO) checks if the error message
> > shouldn't appear spuriously.
>
> In any case, I've been refactoring (simplifying) that code a little.
> I haven't so far rearranged the order of the checks, but I already
> log message for the nread==0 case. (Instead, there will eventually
> be an error when the response is sent.)
>
> I also haven't exactly tested the scenario you describe in the
> seccomp unotify scenario, but I think the above is not correct. Here
> are two scenarios I did test, simply with mmap() and /proc/PID/mem
> (no seccomp involved):
>
> Scenario 1:
> A creates a mapping at address X
> B opens /proc/A/mem and and lseeks on resulting FD to offset X
> A terminates
> B reads from FD ==> read() returns 0 (EOF)
>
> Scenario 2:
> A creates a mapping at address X
> B opens /proc/A/mem and and lseeks on resulting FD to offset X
> A unmaps mapping at address X
> B reads from FD ==> read() returns -1 / EIO.
>
> That last scenario seems to contradict what you say, since I
> think you meant that in this case read() should return 0 in
> that case. Have I misunderstood you?

Sorry, I messed up the description when I wrote that. Yes, this looks
as expected - EIO if the VMA is gone, 0 if the mm_users of the
mm_struct have dropped to zero because all tasks that use the mm have
exited.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 8:14 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 2:42 AM, Jann Horn wrote:
> > As discussed at
> > <https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=q...@mail.gmail.com>,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
> >
> > Now, of course, **reading** is the easy case. The difficult case is if
> > we have to **write** to the remote process... because then we can't
> > play games like that. If we write data to a freed pointer, we're
> > screwed, that's it. (And for somewhat unrelated bonus fun, consider
> > that /proc/$pid/mem is originally intended for process debugging,
> > including installing breakpoints, and will therefore happily write
> > over "readonly" private mappings, such as typical mappings of
> > executable code.)
> >
> > So, h... I guess if anyone wants to actually write memory back to
> > the target process, we'd better come up with some dedicated API for
> > that, using an ioctl on the seccomp fd that magically freezes the
> > target process inside the syscall while writing to its memory, or
> > something like that? And until then, the manpage should have a big fat
> > warning that writing to the target's memory is simply not possible
> > (safely).
>
> Thank you for your very clear explanation! It turned out to be
> trivially easy to demonstrate this issue with a slightly modified
> version of my program.
>
> As well as the change to the code example that I already mentioned
> my reply of a few hours ago, I've added the following text to the
> page:
>
>Caveats regarding the use of /proc/[tid]/mem
>The discussion above noted the need to use the
>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the
>/proc/[tid]/mem file of the target to avoid the possibility of
>accessing the memory of the wrong process in the event that the
>target terminates and its ID is recycled by another (unrelated)
>thread.  However, the use of this ioctl(2) operation is also
>necessary in other situations, as explained in the following
>pargraphs.

(nit: paragraphs)

>Consider the following scenario, where the supervisor tries to
>read the pathname argument of a target's blocked mount(2) system
>call:
[...]
> Seem okay?

Yeah, sounds good.

> By the way, is there any analogous kind of issue concerning
> pidfd_getfd()? I'm thinking not, but I wonder if I've missed
> something.

When it is used by a seccomp supervisor, you mean? I think basically
the same thing applies - when resource identifiers (such as memory
addresses or file descriptors) are passed to a syscall, it generally
has to be assumed that those identifiers may become invalid and be
reused as soon as the syscall has returned.


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 8:53 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 4:26 PM, Christian Brauner wrote:
> > I like this manpage. I think this is the most comprehensive explanation
> > of any seccomp feature
>
> Thanks (at least, I think so...)
>
> > and somewhat understandable.
>   
>
> (... but I'm not sure ;-).)

Relevant: http://tinefetz.net/files/gimgs/78_78_17.jpg


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Jann Horn
On Thu, Oct 29, 2020 at 3:19 PM Michael Kerrisk (man-pages)
 wrote:
> On 10/29/20 2:42 AM, Jann Horn wrote:
> > On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>static bool
> >>getTargetPathname(struct seccomp_notif *req, int notifyFd,
> >>  char *path, size_t len)
> >>{
> >>char procMemPath[PATH_MAX];
> >>
> >>snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", 
> >> req->pid);
> >>
> >>int procMemFd = open(procMemPath, O_RDONLY);
> >>if (procMemFd == -1)
> >>errExit("\tS: open");
> >>
> >>/* Check that the process whose info we are accessing is still 
> >> alive.
> >>   If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> >>   in checkNotificationIdIsValid()) succeeds, we know that the
> >>   /proc/PID/mem file descriptor that we opened corresponds to 
> >> the
> >>   process for which we received a notification. If that process
> >>   subsequently terminates, then read() on that file descriptor
> >>   will return 0 (EOF). */
> >>
> >>checkNotificationIdIsValid(notifyFd, req->id);
> >>
> >>/* Read bytes at the location containing the pathname argument
> >>   (i.e., the first argument) of the mkdir(2) call */
> >>
> >>ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
> >>if (nread == -1)
> >>errExit("pread");
> >
> > As discussed at
> > <https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=q...@mail.gmail.com>,
> > we need to re-check checkNotificationIdIsValid() after reading remote
> > memory but before using the read value in any way. Otherwise, the
> > syscall could in the meantime get interrupted by a signal handler, the
> > signal handler could return, and then the function that performed the
> > syscall could free() allocations or return (thereby freeing buffers on
> > the stack).
> >
> > In essence, this pread() is (unavoidably) a potential use-after-free
> > read; and to make that not have any security impact, we need to check
> > whether UAF read occurred before using the read value. This should
> > probably be called out elsewhere in the manpage, too...
>
> Thanks very much for pointing me at this!
>
> So, I want to conform that the fix to the code is as simple as
> adding a check following the pread() call, something like:
>
> [[
>  ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
>  if (nread == -1)
> errExit("Supervisor: pread");
>
>  if (nread == 0) {
> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
> "returned 0 (EOF)\n");
> exit(EXIT_FAILURE);
>  }
>
>  if (close(procMemFd) == -1)
> errExit("Supervisor: close-/proc/PID/mem");
>
> +/* Once again check that the notification ID is still valid. The
> +   case we are particularly concerned about here is that just
> +   before we fetched the pathname, the target's blocked system
> +   call was interrupted by a signal handler, and after the handler
> +   returned, the target carried on execution (past the interrupted
> +   system call). In that case, we have no guarantees about what we
> +   are reading, since the target's memory may have been arbitrarily
> +   changed by subsequent operations. */
> +
> +if (!notificationIdIsValid(notifyFd, req->id, "post-open"))
> +return false;
> +
>  /* We have no guarantees about what was in the memory of the target
> process. We therefore treat the buffer returned by pread() as
> untrusted input. The buffer should be terminated by a null byte;
> if not, then we will trigger an error for the target process. */
>
>  if (strnlen(path, nread) < nread)
>  return true;
> ]]

Yeah, that should do the job. With the caveat that a cancelled syscall
could've also led to the memory being munmap()ed, so the nread==0 case
could also happen legitimately - so you might want to move this check
up above the nread==0 (mm went away) and nread==-1 (mm still exists,
but read from address failed, errno EIO) checks if the error message
shouldn't appear spuriously.


Re: [PATCH v1 1/2] ptrace: Set PF_SUPERPRIV when checking capability

2020-10-30 Thread Jann Horn
On Fri, Oct 30, 2020 at 5:06 PM Mickaël Salaün  wrote:
> On 30/10/2020 16:47, Jann Horn wrote:
> > On Fri, Oct 30, 2020 at 1:39 PM Mickaël Salaün  wrote:
> >> Commit 69f594a38967 ("ptrace: do not audit capability check when outputing
> >> /proc/pid/stat") replaced the use of ns_capable() with
> >> has_ns_capability{,_noaudit}() which doesn't set PF_SUPERPRIV.
> >>
> >> Commit 6b3ad6649a4c ("ptrace: reintroduce usage of subjective credentials 
> >> in
> >> ptrace_has_cap()") replaced has_ns_capability{,_noaudit}() with
> >> security_capable(), which doesn't set PF_SUPERPRIV neither.
> >>
> >> Since commit 98f368e9e263 ("kernel: Add noaudit variant of ns_capable()"), 
> >> a
> >> new ns_capable_noaudit() helper is available.  Let's use it!
> >>
> >> As a result, the signature of ptrace_has_cap() is restored to its original 
> >> one.
> >>
> >> Cc: Christian Brauner 
> >> Cc: Eric Paris 
> >> Cc: Jann Horn 
> >> Cc: Kees Cook 
> >> Cc: Oleg Nesterov 
> >> Cc: Serge E. Hallyn 
> >> Cc: Tyler Hicks 
> >> Cc: sta...@vger.kernel.org
> >> Fixes: 6b3ad6649a4c ("ptrace: reintroduce usage of subjective credentials 
> >> in ptrace_has_cap()")
> >> Fixes: 69f594a38967 ("ptrace: do not audit capability check when outputing 
> >> /proc/pid/stat")
> >> Signed-off-by: Mickaël Salaün 
> >
> > Yeah... I guess this makes sense. (We'd have to undo or change it if
> > we ever end up needing to use a different set of credentials, e.g.
> > from ->f_cred, but I guess that's really something we should avoid
> > anyway.)
> >
> > Reviewed-by: Jann Horn 
> >
> > with one nit:
> >
> >
> > [...]
> >>  /* Returns 0 on success, -errno on denial. */
> >>  static int __ptrace_may_access(struct task_struct *task, unsigned int 
> >> mode)
> >>  {
> >> -   const struct cred *cred = current_cred(), *tcred;
> >> +   const struct cred *const cred = current_cred(), *tcred;
> >
> > This is an unrelated change, and almost no kernel code marks local
> > pointer variables as "const". I would drop this change from the patch.
>
> This give guarantee that the cred variable will not be used for
> something else than current_cred(), which kinda prove that this patch
> doesn't change the behavior of __ptrace_may_access() by not using cred
> in ptrace_has_cap(). It doesn't hurt and I think it could be useful to
> spot issues when backporting.

And it might require an extra fixup while backporting because the next
line is different and that might cause the patch to not apply.


Re: [PATCH v1 1/2] ptrace: Set PF_SUPERPRIV when checking capability

2020-10-30 Thread Jann Horn
On Fri, Oct 30, 2020 at 1:39 PM Mickaël Salaün  wrote:
> Commit 69f594a38967 ("ptrace: do not audit capability check when outputing
> /proc/pid/stat") replaced the use of ns_capable() with
> has_ns_capability{,_noaudit}() which doesn't set PF_SUPERPRIV.
>
> Commit 6b3ad6649a4c ("ptrace: reintroduce usage of subjective credentials in
> ptrace_has_cap()") replaced has_ns_capability{,_noaudit}() with
> security_capable(), which doesn't set PF_SUPERPRIV neither.
>
> Since commit 98f368e9e263 ("kernel: Add noaudit variant of ns_capable()"), a
> new ns_capable_noaudit() helper is available.  Let's use it!
>
> As a result, the signature of ptrace_has_cap() is restored to its original 
> one.
>
> Cc: Christian Brauner 
> Cc: Eric Paris 
> Cc: Jann Horn 
> Cc: Kees Cook 
> Cc: Oleg Nesterov 
> Cc: Serge E. Hallyn 
> Cc: Tyler Hicks 
> Cc: sta...@vger.kernel.org
> Fixes: 6b3ad6649a4c ("ptrace: reintroduce usage of subjective credentials in 
> ptrace_has_cap()")
> Fixes: 69f594a38967 ("ptrace: do not audit capability check when outputing 
> /proc/pid/stat")
> Signed-off-by: Mickaël Salaün 

Yeah... I guess this makes sense. (We'd have to undo or change it if
we ever end up needing to use a different set of credentials, e.g.
from ->f_cred, but I guess that's really something we should avoid
anyway.)

Reviewed-by: Jann Horn 

with one nit:


[...]
>  /* Returns 0 on success, -errno on denial. */
>  static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
>  {
> -   const struct cred *cred = current_cred(), *tcred;
> +   const struct cred *const cred = current_cred(), *tcred;

This is an unrelated change, and almost no kernel code marks local
pointer variables as "const". I would drop this change from the patch.

> struct mm_struct *mm;
> kuid_t caller_uid;
> kgid_t caller_gid;


Re: [PATCH v1 2/2] seccomp: Set PF_SUPERPRIV when checking capability

2020-10-30 Thread Jann Horn
On Fri, Oct 30, 2020 at 1:39 PM Mickaël Salaün  wrote:
> Replace the use of security_capable(current_cred(), ...) with
> ns_capable_noaudit() which set PF_SUPERPRIV.
>
> Since commit 98f368e9e263 ("kernel: Add noaudit variant of
> ns_capable()"), a new ns_capable_noaudit() helper is available.  Let's
> use it!
>
> Cc: Jann Horn 
> Cc: Kees Cook 
> Cc: Tyler Hicks 
> Cc: Will Drewry 
> Cc: sta...@vger.kernel.org
> Fixes: e2cfabdfd075 ("seccomp: add system call filtering using BPF")
> Signed-off-by: Mickaël Salaün 

Reviewed-by: Jann Horn 


Re: [PATCH v6 2/9] x86, kfence: enable KFENCE for x86

2020-10-30 Thread Jann Horn
On Fri, Oct 30, 2020 at 2:00 PM Marco Elver  wrote:
> On Fri, 30 Oct 2020 at 03:49, Jann Horn  wrote:
> > On Thu, Oct 29, 2020 at 2:17 PM Marco Elver  wrote:
> > > Add architecture specific implementation details for KFENCE and enable
> > > KFENCE for the x86 architecture. In particular, this implements the
> > > required interface in  for setting up the pool and
> > > providing helper functions for protecting and unprotecting pages.
> > >
> > > For x86, we need to ensure that the pool uses 4K pages, which is done
> > > using the set_memory_4k() helper function.
> > >
> > > Reviewed-by: Dmitry Vyukov 
> > > Co-developed-by: Marco Elver 
> > > Signed-off-by: Marco Elver 
> > > Signed-off-by: Alexander Potapenko 
> > [...]
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > [...]
> > > @@ -725,6 +726,9 @@ no_context(struct pt_regs *regs, unsigned long 
> > > error_code,
> > > if (IS_ENABLED(CONFIG_EFI))
> > > efi_recover_from_page_fault(address);
> > >
> > > +   if (kfence_handle_page_fault(address))
> > > +   return;
[...]
> > Unrelated sidenote: Since we're hooking after exception fixup
> > handling, the debug-only KFENCE_STRESS_TEST_FAULTS can probably still
> > cause some behavioral differences through spurious faults in places
> > like copy_user_enhanced_fast_string (where the exception table entries
> > are used even if the *kernel* pointer, not the user pointer, causes a
> > fault). But since KFENCE_STRESS_TEST_FAULTS is exclusively for KFENCE
> > development, the difference might not matter. And ordering them the
> > other way around definitely isn't possible, because the kernel relies
> > on being able to fixup OOB reads. So there probably isn't really
> > anything we can do better here; it's just something to keep in mind.
> > Maybe you can add a little warning to the help text for that Kconfig
> > entry that warns people about this?
>
> Thanks for pointing it out, but that option really is *only* to stress
> kfence with concurrent allocations/frees/page faults. If anybody
> enables this option for anything other than testing kfence, it's their
> own fault. ;-)

Sounds fair. :P

> I'll try to add a generic note to the Kconfig entry, but what you
> mention here seems quite x86-specific.

(FWIW, I think it could currently also happen on arm64 in the rare
cases where KERNEL_DS is used. But luckily Christoph Hellwig has
already gotten rid of most places that did that.)


Re: [PATCH v6 6/9] kfence, kasan: make KFENCE compatible with KASAN

2020-10-30 Thread Jann Horn
On Fri, Oct 30, 2020 at 2:46 PM Marco Elver  wrote:
> On Fri, 30 Oct 2020 at 03:50, Jann Horn  wrote:
> > On Thu, Oct 29, 2020 at 2:17 PM Marco Elver  wrote:
> > > We make KFENCE compatible with KASAN for testing KFENCE itself. In
> > > particular, KASAN helps to catch any potential corruptions to KFENCE
> > > state, or other corruptions that may be a result of freepointer
> > > corruptions in the main allocators.
> > >
> > > To indicate that the combination of the two is generally discouraged,
> > > CONFIG_EXPERT=y should be set. It also gives us the nice property that
> > > KFENCE will be build-tested by allyesconfig builds.
> > >
> > > Reviewed-by: Dmitry Vyukov 
> > > Co-developed-by: Marco Elver 
> > > Signed-off-by: Marco Elver 
> > > Signed-off-by: Alexander Potapenko 
> >
> > Reviewed-by: Jann Horn 
>
> Thanks!
>
> > with one nit:
> >
> > [...]
> > > diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> > [...]
> > > @@ -141,6 +142,14 @@ void kasan_unpoison_shadow(const void *address, 
> > > size_t size)
> > >  */
> > > address = reset_tag(address);
> > >
> > > +   /*
> > > +* We may be called from SL*B internals, such as ksize(): with a 
> > > size
> > > +* not a multiple of machine-word size, avoid poisoning the 
> > > invalid
> > > +* portion of the word for KFENCE memory.
> > > +*/
> > > +   if (is_kfence_address(address))
> > > +   return;
> >
> > It might be helpful if you could add a comment that explains that
> > kasan_poison_object_data() does not need a similar guard because
> > kasan_poison_object_data() is always paired with
> > kasan_unpoison_object_data() - that threw me off a bit at first.
>
> Well, KFENCE objects should never be poisoned/unpoisoned because the
> kasan_alloc and free hooks have a kfence guard, and none of the code
> in sl*b.c that does kasan_{poison,unpoison}_object_data() should be
> executed for KFENCE objects.
>
> But I just noticed that kernel/scs.c seems to kasan_poison and
> unpoison objects, and keeps them poisoned for most of the object
> lifetime.

FWIW, I wouldn't be surprised if other parts of the kernel also ended
up wanting to have in-object redzones eventually - e.g. inside skb
buffers, which have a struct skb_shared_info at the end. AFAIU at the
moment, KASAN can't catch small OOB accesses from these buffers
because of the following structure.

> I think we better add a kfence guard to
> kasan_poison_shadow() as well.

Sounds good.


Re: [PATCH v22 08/12] landlock: Add syscall implementations

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 12:30 PM Mickaël Salaün  wrote:
> On 29/10/2020 02:06, Jann Horn wrote:
> > On Tue, Oct 27, 2020 at 9:04 PM Mickaël Salaün  wrote:
> >> These 3 system calls are designed to be used by unprivileged processes
> >> to sandbox themselves:
[...]
> >> +   /*
> >> +* Similar checks as for seccomp(2), except that an -EPERM may be
> >> +* returned.
> >> +*/
> >> +   if (!task_no_new_privs(current)) {
> >> +   err = security_capable(current_cred(), current_user_ns(),
> >> +   CAP_SYS_ADMIN, CAP_OPT_NOAUDIT);
> >
> > I think this should be ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN)?
>
> Right. The main difference is that ns_capable*() set PF_SUPERPRIV in
> current->flags. I guess seccomp should use ns_capable_noaudit() as well?

Yeah. That seccomp code is from commit e2cfabdfd0756, with commit date
in April 2012, while ns_capable_noaudit() was introduced in commit
98f368e9e263, with commit date in June 2016; the seccomp code predates
the availability of that API.

Do you want to send a patch to Kees for that, or should I?


Re: [PATCH v22 01/12] landlock: Add object management

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 10:30 AM Mickaël Salaün  wrote:
> On 29/10/2020 02:05, Jann Horn wrote:
> > On Tue, Oct 27, 2020 at 9:04 PM Mickaël Salaün  wrote:
> >> A Landlock object enables to identify a kernel object (e.g. an inode).
> >> A Landlock rule is a set of access rights allowed on an object.  Rules
> >> are grouped in rulesets that may be tied to a set of processes (i.e.
> >> subjects) to enforce a scoped access-control (i.e. a domain).
[...]
> >> diff --git a/security/landlock/object.c b/security/landlock/object.c
> > [...]
> >> +void landlock_put_object(struct landlock_object *const object)
> >> +{
> >> +   /*
> >> +* The call to @object->underops->release(object) might sleep e.g.,
> >
> > s/ e.g.,/, e.g./
>
> I indeed prefer the comma preceding the "e.g.", but it seems that there
> is a difference between UK english and US english:
> https://english.stackexchange.com/questions/16172/should-i-always-use-a-comma-after-e-g-or-i-e
> Looking at the kernel documentation makes it clear:
> $ git grep -F 'e.g. ' | wc -l
> 1179
> $ git grep -F 'e.g., ' | wc -l
> 160
>
> I'll apply your fix in the whole patch series.

Ooh, sorry. I didn't realize that that's valid in UK English...


Re: [PATCH v6 5/9] mm, kfence: insert KFENCE hooks for SLUB

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 2:17 PM Marco Elver  wrote:
> Inserts KFENCE hooks into the SLUB allocator.
>
> To pass the originally requested size to KFENCE, add an argument
> 'orig_size' to slab_alloc*(). The additional argument is required to
> preserve the requested original size for kmalloc() allocations, which
> uses size classes (e.g. an allocation of 272 bytes will return an object
> of size 512). Therefore, kmem_cache::size does not represent the
> kmalloc-caller's requested size, and we must introduce the argument
> 'orig_size' to propagate the originally requested size to KFENCE.
>
> Without the originally requested size, we would not be able to detect
> out-of-bounds accesses for objects placed at the end of a KFENCE object
> page if that object is not equal to the kmalloc-size class it was
> bucketed into.
>
> When KFENCE is disabled, there is no additional overhead, since
> slab_alloc*() functions are __always_inline.
>
> Reviewed-by: Dmitry Vyukov 
> Co-developed-by: Marco Elver 
> Signed-off-by: Marco Elver 
> Signed-off-by: Alexander Potapenko 

Reviewed-by: Jann Horn 

if you fix one nit:

[...]
> diff --git a/mm/slub.c b/mm/slub.c
[...]
> @@ -2658,7 +2664,8 @@ static inline void *get_freelist(struct kmem_cache *s, 
> struct page *page)
>   * already disabled (which is the case for bulk allocation).
>   */
>  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> - unsigned long addr, struct kmem_cache_cpu *c)
> + unsigned long addr, struct kmem_cache_cpu *c,
> + size_t orig_size)

orig_size is added as a new argument, but never used. (And if you
remove this argument, __slab_alloc will also not be using its
orig_size argument anymore.)



>  {
> void *freelist;
> struct page *page;
> @@ -2763,7 +2770,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
> gfpflags, int node,
>   * cpu changes by refetching the per cpu area pointer.
>   */
>  static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> - unsigned long addr, struct kmem_cache_cpu *c)
> + unsigned long addr, struct kmem_cache_cpu *c,
> + size_t orig_size)
>  {
> void *p;
> unsigned long flags;
> @@ -2778,7 +2786,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t 
> gfpflags, int node,
> c = this_cpu_ptr(s->cpu_slab);
>  #endif
>
> -   p = ___slab_alloc(s, gfpflags, node, addr, c);
> +   p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> local_irq_restore(flags);
> return p;
>  }


Re: [PATCH v6 8/9] kfence: add test suite

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 2:17 PM Marco Elver  wrote:
> Add KFENCE test suite, testing various error detection scenarios. Makes
> use of KUnit for test organization. Since KFENCE's interface to obtain
> error reports is via the console, the test verifies that KFENCE outputs
> expected reports to the console.
[...]
> diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
[...]
> +static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum 
> allocation_policy policy)
> +{
> +   void *alloc;
> +   unsigned long timeout, resched_after;
[...]
> +   /*
> +* 100x the sample interval should be more than enough to ensure we 
> get
> +* a KFENCE allocation eventually.
> +*/
> +   timeout = jiffies + msecs_to_jiffies(100 * 
> CONFIG_KFENCE_SAMPLE_INTERVAL);
> +   /*
> +* Especially for non-preemption kernels, ensure the allocation-gate
> +* timer has time to catch up.
> +*/
> +   resched_after = jiffies + 
> msecs_to_jiffies(CONFIG_KFENCE_SAMPLE_INTERVAL);
> +   do {
[...]
> +   if (time_after(jiffies, resched_after))
> +   cond_resched();

You probably meant to recalculate resched_after after the call to
cond_resched()?

> +   } while (time_before(jiffies, timeout));
> +
> +   KUNIT_ASSERT_TRUE_MSG(test, false, "failed to allocate from KFENCE");
> +   return NULL; /* Unreachable. */
> +}
[...]
> +/*
> + * KFENCE is unable to detect an OOB if the allocation's alignment 
> requirements
> + * leave a gap between the object and the guard page. Specifically, an
> + * allocation of e.g. 73 bytes is aligned on 8 and 128 bytes for SLUB or SLAB
> + * respectively. Therefore it is impossible for the allocated object to 
> adhere
> + * to either of the page boundaries.

Should this be "to the left page boundary" instead of "to either of
the page boundaries"?

> + * However, we test that an access to memory beyond the gap result in KFENCE

*results



> + * detecting an OOB access.
> + */
> +static void test_kmalloc_aligned_oob_read(struct kunit *test)


Re: [PATCH v6 3/9] arm64, kfence: enable KFENCE for ARM64

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 2:17 PM Marco Elver  wrote:
> Add architecture specific implementation details for KFENCE and enable
> KFENCE for the arm64 architecture. In particular, this implements the
> required interface in .
>
> KFENCE requires that attributes for pages from its memory pool can
> individually be set. Therefore, force the entire linear map to be mapped
> at page granularity. Doing so may result in extra memory allocated for
> page tables in case rodata=full is not set; however, currently
> CONFIG_RODATA_FULL_DEFAULT_ENABLED=y is the default, and the common case
> is therefore not affected by this change.
[...]
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
[...]
> +   select HAVE_ARCH_KFENCE if (!ARM64_16K_PAGES && !ARM64_64K_PAGES)

"if ARM64_4K_PAGES"?

[...]
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
[...]
> @@ -312,6 +313,9 @@ static void __do_kernel_fault(unsigned long addr, 
> unsigned int esr,
> "Ignoring spurious kernel translation fault at virtual address 
> %016lx\n", addr))
> return;
>
> +   if (kfence_handle_page_fault(addr))
> +   return;

As in the X86 case, we may want to ensure that this doesn't run for
permission faults, only for non-present pages. Maybe move this down
into the third branch of the "if" block below (neither permission
fault nor NULL deref)?



> +
> if (is_el1_permission_fault(addr, esr, regs)) {
> if (esr & ESR_ELx_WNR)
> msg = "write to read-only memory";


Re: [PATCH v6 7/9] kfence, Documentation: add KFENCE documentation

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 2:17 PM Marco Elver  wrote:
> Add KFENCE documentation in dev-tools/kfence.rst, and add to index.
[...]
> +The KFENCE memory pool is of fixed size, and if the pool is exhausted, no
> +further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default
> +255), the number of available guarded objects can be controlled. Each object
> +requires 2 pages, one for the object itself and the other one used as a guard
> +page; object pages are interleaved with guard pages, and every object page is
> +therefore surrounded by two guard pages.
> +
> +The total memory dedicated to the KFENCE memory pool can be computed as::
> +
> +( #objects + 1 ) * 2 * PAGE_SIZE

Plus memory overhead from shattered hugepages. With the default object
count, on x86, we allocate 2MiB of memory pool, but if we have to
shatter a 2MiB hugepage for that, we may cause the allocation of one
extra page table, or 4KiB. Of course that's pretty much negligible.
But on arm64 it's worse, because there we have to disable hugepages in
the linear map completely. So on a device with 4GiB memory, we might
end up with something on the order of 4GiB/2MiB * 0x1000 bytes = 8MiB
of extra L1 page tables that wouldn't have been needed otherwise -
significantly more than the default memory pool size.

If the memory overhead is documented, this detail should probably be
documented, too.

> +Using the default config, and assuming a page size of 4 KiB, results in
> +dedicating 2 MiB to the KFENCE memory pool.
[...]
> +For such errors, the address where the corruption as well as the invalidly

nit: "the address where the corruption occurred" or "the address of
the corruption"

> +written bytes (offset from the address) are shown; in this representation, 
> '.'
> +denote untouched bytes. In the example above ``0xac`` is the value written to
> +the invalid address at offset 0, and the remaining '.' denote that no 
> following
> +bytes have been touched. Note that, real values are only shown for
> +``CONFIG_DEBUG_KERNEL=y`` builds; to avoid information disclosure for 
> non-debug
> +builds, '!' is used instead to denote invalidly written bytes.
[...]
> +KFENCE objects each reside on a dedicated page, at either the left or right
> +page boundaries selected at random. The pages to the left and right of the
> +object page are "guard pages", whose attributes are changed to a protected
> +state, and cause page faults on any attempted access. Such page faults are 
> then
> +intercepted by KFENCE, which handles the fault gracefully by reporting an
> +out-of-bounds access.

... and marking the page as accessible so that the faulting code can
continue (wrongly) executing.


[...]
> +Interface
> +-
> +
> +The following describes the functions which are used by allocators as well 
> page

nit: "as well as"?



> +handling code to set up and deal with KFENCE allocations.


Re: [PATCH v6 9/9] MAINTAINERS: Add entry for KFENCE

2020-10-29 Thread Jann Horn
On Thu, Oct 29, 2020 at 2:17 PM Marco Elver  wrote:
> Add entry for KFENCE maintainers.
>
> Reviewed-by: Dmitry Vyukov 
> Reviewed-by: SeongJae Park 
> Co-developed-by: Alexander Potapenko 
> Signed-off-by: Alexander Potapenko 
> Signed-off-by: Marco Elver 
[...]
> diff --git a/MAINTAINERS b/MAINTAINERS
[...]
> +KFENCE
> +M: Alexander Potapenko 
> +M: Marco Elver 
> +R: Dmitry Vyukov 
> +L: kasan-...@googlegroups.com
> +S: Maintained
> +F: Documentation/dev-tools/kfence.rst
> +F: include/linux/kfence.h
> +F: lib/Kconfig.kfence
> +F: mm/kfence/

Plus arch/*/include/asm/kfence.h?


  1   2   3   4   5   6   7   8   9   10   >