Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-11-04 Thread Dave Martin
On Thu, Oct 29, 2020 at 11:02:22AM +, Catalin Marinas via Libc-alpha wrote:
> On Tue, Oct 27, 2020 at 02:15:22PM +, Dave P Martin wrote:
> > I also wonder whether we actually care whether the pages are marked
> > executable or not here; probably the flags can just be independent.  This
> > rather depends on whether the how the architecture treats the BTI (a.k.a
> > GP) pagetable bit for non-executable pages.  I have a feeling we already
> > allow PROT_BTI && !PROT_EXEC through anyway.
> > 
> > 
> > What about a generic-ish set/clear interface that still works by just
> > adding a couple of PROT_ flags:
> > 
> > switch (flags & (PROT_SET | PROT_CLEAR)) {
> > case PROT_SET: prot |= flags; break;
> > case PROT_CLEAR: prot &= ~flags; break;
> > case 0: prot = flags; break;
> > 
> > default:
> > return -EINVAL;
> > }
> > 
> > This can't atomically set some flags while clearing some others, but for
> > simple stuff it seems sufficient and shouldn't be too invasive on the
> > kernel side.
> > 
> > We will still have to take the mm lock when doing a SET or CLEAR, but
> > not for the non-set/clear case.
> > 
> > 
> > Anyway, libc could now do:
> > 
> > mprotect(addr, len, PROT_SET | PROT_BTI);
> > 
> > with much the same effect as your PROT_BTI_IF_X.
> > 
> > 
> > JITting or breakpoint setting code that wants to change the permissions
> > temporarily, without needing to know whether PROT_BTI is set, say:
> > 
> > mprotect(addr, len, PROT_SET | PROT_WRITE);
> > *addr = BKPT_INSN;
> > mprotect(addr, len, PROT_CLEAR | PROT_WRITE);
> 
> The problem with this approach is that you can't catch
> PROT_EXEC|PROT_WRITE mappings via seccomp. So you'd have to limit it to
> some harmless PROT_ flags only. I don't like this limitation, nor the
> PROT_BTI_IF_X approach.

Ack; this is just one flavour of interface, and every approach seems to
have some shortcomings.

> The only generic solutions I see are to either use a stateful filter in
> systemd or pass the old state to the kernel in a cmpxchg style so that
> seccomp can check it (I think you suggest this at some point).

The "cmpxchg" option has the disadvantage that the caller needs to know
the original permissions.  It seems that glibc is prepared to work
around this, but it won't always be feasible in ancillary /
instrumentation code or libraries.

IMHO it would be preferable to apply a policy to mmap/mprotect in the
kernel proper rather then BPF being the only way to do it -- in any
case, the required checks seem to be out of the scope of what can be
done efficiently (or perhaps at all) in a syscall filter.

> The latter requires a new syscall which is not something we can address
> as a quick, back-portable fix here. If systemd cannot be changed to use
> a stateful filter for w^x detection, my suggestion is to go for the
> kernel setting PROT_BTI on the main executable with glibc changed to
> tolerate EPERM on mprotect(). I don't mind adding an AT_FLAGS bit if
> needed but I don't think it buys us much.

I agree, this seems the best short-term approach.

> Once the current problem is fixed, we can look at a better solution
> longer term as a new syscall.

Agreed, I think if we try to rush the addition of new syscalls, the
chance of coming up with a bad design is high...

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-29 Thread Catalin Marinas
On Tue, Oct 27, 2020 at 02:15:22PM +, Dave P Martin wrote:
> I also wonder whether we actually care whether the pages are marked
> executable or not here; probably the flags can just be independent.  This
> rather depends on whether the how the architecture treats the BTI (a.k.a
> GP) pagetable bit for non-executable pages.  I have a feeling we already
> allow PROT_BTI && !PROT_EXEC through anyway.
> 
> 
> What about a generic-ish set/clear interface that still works by just
> adding a couple of PROT_ flags:
> 
>   switch (flags & (PROT_SET | PROT_CLEAR)) {
>   case PROT_SET: prot |= flags; break;
>   case PROT_CLEAR: prot &= ~flags; break;
>   case 0: prot = flags; break;
> 
>   default:
>   return -EINVAL;
>   }
> 
> This can't atomically set some flags while clearing some others, but for
> simple stuff it seems sufficient and shouldn't be too invasive on the
> kernel side.
> 
> We will still have to take the mm lock when doing a SET or CLEAR, but
> not for the non-set/clear case.
> 
> 
> Anyway, libc could now do:
> 
>   mprotect(addr, len, PROT_SET | PROT_BTI);
> 
> with much the same effect as your PROT_BTI_IF_X.
> 
> 
> JITting or breakpoint setting code that wants to change the permissions
> temporarily, without needing to know whether PROT_BTI is set, say:
> 
>   mprotect(addr, len, PROT_SET | PROT_WRITE);
>   *addr = BKPT_INSN;
>   mprotect(addr, len, PROT_CLEAR | PROT_WRITE);

The problem with this approach is that you can't catch
PROT_EXEC|PROT_WRITE mappings via seccomp. So you'd have to limit it to
some harmless PROT_ flags only. I don't like this limitation, nor the
PROT_BTI_IF_X approach.

The only generic solutions I see are to either use a stateful filter in
systemd or pass the old state to the kernel in a cmpxchg style so that
seccomp can check it (I think you suggest this at some point).

The latter requires a new syscall which is not something we can address
as a quick, back-portable fix here. If systemd cannot be changed to use
a stateful filter for w^x detection, my suggestion is to go for the
kernel setting PROT_BTI on the main executable with glibc changed to
tolerate EPERM on mprotect(). I don't mind adding an AT_FLAGS bit if
needed but I don't think it buys us much.

Once the current problem is fixed, we can look at a better solution
longer term as a new syscall.

-- 
Catalin


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-27 Thread Dave Martin
On Mon, Oct 26, 2020 at 05:39:42PM -0500, Jeremy Linton via Libc-alpha wrote:
> Hi,
> 
> On 10/26/20 12:52 PM, Dave Martin wrote:
> >On Mon, Oct 26, 2020 at 04:57:55PM +, Szabolcs Nagy via Libc-alpha wrote:
> >>The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> >>>Unrolling this discussion a bit, this problem comes from a few sources:
> >>>
> >>>1) systemd is trying to implement a policy that doesn't fit SECCOMP
> >>>syscall filtering very well.
> >>>
> >>>2) The program is trying to do something not expressible through the
> >>>syscall interface: really the intent is to set PROT_BTI on the page,
> >>>with no intent to set PROT_EXEC on any page that didn't already have it
> >>>set.
> >>>
> >>>
> >>>This limitation of mprotect() was known when I originally added PROT_BTI,
> >>>but at that time we weren't aware of a clear use case that would fail.
> >>>
> >>>
> >>>Would it now help to add something like:
> >>>
> >>>int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> >>>{
> >>>   int ret = -EINVAL;
> >>>   mmap_write_lock(current->mm);
> >>>   if (all vmas in [addr .. addr + len) have
> >>>   their mprotect flags set to old_flags) {
> >>>
> >>>   ret = mprotect(addr, len, new_flags);
> >>>   }
> >>>   
> >>>   mmap_write_unlock(current->mm);
> >>>   return ret;
> >>>}
> >>
> >>if more prot flags are introduced then the exact
> >>match for old_flags may be restrictive and currently
> >>there is no way to query these flags to figure out
> >>how to toggle one prot flag in a future proof way,
> >>so i don't think this solves the issue completely.
> >
> >Ack -- I illustrated this model because it makes the seccomp filter's
> >job easy, but it does have limitations.
> >
> >>i think we might need a new api, given that aarch64
> >>now has PROT_BTI and PROT_MTE while existing code
> >>expects RWX only, but i don't know what api is best.
> >
> >An alternative option would be a call that sets / clears chosen
> >flags and leaves others unchanged.
> 
> I tend to favor a set/clear API, but that could also just be done by
> creating a new PROT_BTI_IF_X which enables BTI for areas already set to
> _EXEC. That goes right by the seccomp filters too, and actually is closer to
> what glibc wants to do anyway.

That works, though I'm not so keen on teating PROT_BTI as a special case,
since the problem is likely to recur when other weird per-arch flags get
added...

I also wonder whether we actually care whether the pages are marked
executable or not here; probably the flags can just be independent.  This
rather depends on whether the how the architecture treats the BTI (a.k.a
GP) pagetable bit for non-executable pages.  I have a feeling we already
allow PROT_BTI && !PROT_EXEC through anyway.


What about a generic-ish set/clear interface that still works by just
adding a couple of PROT_ flags:

switch (flags & (PROT_SET | PROT_CLEAR)) {
case PROT_SET: prot |= flags; break;
case PROT_CLEAR: prot &= ~flags; break;
case 0: prot = flags; break;

default:
return -EINVAL;
}

This can't atomically set some flags while clearing some others, but for
simple stuff it seems sufficient and shouldn't be too invasive on the
kernel side.

We will still have to take the mm lock when doing a SET or CLEAR, but
not for the non-set/clear case.


Anyway, libc could now do:

mprotect(addr, len, PROT_SET | PROT_BTI);

with much the same effect as your PROT_BTI_IF_X.


JITting or breakpoint setting code that wants to change the permissions
temporarily, without needing to know whether PROT_BTI is set, say:

mprotect(addr, len, PROT_SET | PROT_WRITE);
*addr = BKPT_INSN;
mprotect(addr, len, PROT_CLEAR | PROT_WRITE);


Thoughts?

I won't claim this doesn't still have some limitations...

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-27 Thread Florian Weimer
* Dave Martin via Libc-alpha:

> On Mon, Oct 26, 2020 at 05:45:42PM +0100, Florian Weimer via Libc-alpha wrote:
>> * Dave Martin via Libc-alpha:
>> 
>> > Would it now help to add something like:
>> >
>> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
>> > {
>> >int ret = -EINVAL;
>> >mmap_write_lock(current->mm);
>> >if (all vmas in [addr .. addr + len) have
>> >their mprotect flags set to old_flags) {
>> >
>> >ret = mprotect(addr, len, new_flags);
>> >}
>> >
>> >mmap_write_unlock(current->mm);
>> >return ret;
>> > }
>> 
>> I suggested something similar as well.  Ideally, the interface would
>> subsume pkey_mprotect, though, and have a separate flags argument from
>> the protection flags.  But then we run into argument list length limits.
>>
>> Thanks,
>> Florian
>
> I suppose.  Assuming that a syscall filter can inspect memory, we might
> be able to bundle arguments into a struct if necessary.

But that leads to a discussion about batch mmap/mprotect/munmap, and
that's again incompatible with seccomp (it would need a checking loop).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-27 Thread Dave Martin
On Mon, Oct 26, 2020 at 05:45:42PM +0100, Florian Weimer via Libc-alpha wrote:
> * Dave Martin via Libc-alpha:
> 
> > Would it now help to add something like:
> >
> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> > {
> > int ret = -EINVAL;
> > mmap_write_lock(current->mm);
> > if (all vmas in [addr .. addr + len) have
> > their mprotect flags set to old_flags) {
> >
> > ret = mprotect(addr, len, new_flags);
> > }
> > 
> > mmap_write_unlock(current->mm);
> > return ret;
> > }
> 
> I suggested something similar as well.  Ideally, the interface would
> subsume pkey_mprotect, though, and have a separate flags argument from
> the protection flags.  But then we run into argument list length limits.
>
> Thanks,
> Florian

I suppose.  Assuming that a syscall filter can inspect memory, we might
be able to bundle arguments into a struct if necessary.

[...]

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Jeremy Linton

Hi,

On 10/26/20 12:52 PM, Dave Martin wrote:

On Mon, Oct 26, 2020 at 04:57:55PM +, Szabolcs Nagy via Libc-alpha wrote:

The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:

Unrolling this discussion a bit, this problem comes from a few sources:

1) systemd is trying to implement a policy that doesn't fit SECCOMP
syscall filtering very well.

2) The program is trying to do something not expressible through the
syscall interface: really the intent is to set PROT_BTI on the page,
with no intent to set PROT_EXEC on any page that didn't already have it
set.


This limitation of mprotect() was known when I originally added PROT_BTI,
but at that time we weren't aware of a clear use case that would fail.


Would it now help to add something like:

int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
{
int ret = -EINVAL;
mmap_write_lock(current->mm);
if (all vmas in [addr .. addr + len) have
their mprotect flags set to old_flags) {

ret = mprotect(addr, len, new_flags);
}

mmap_write_unlock(current->mm);
return ret;
}


if more prot flags are introduced then the exact
match for old_flags may be restrictive and currently
there is no way to query these flags to figure out
how to toggle one prot flag in a future proof way,
so i don't think this solves the issue completely.


Ack -- I illustrated this model because it makes the seccomp filter's
job easy, but it does have limitations.


i think we might need a new api, given that aarch64
now has PROT_BTI and PROT_MTE while existing code
expects RWX only, but i don't know what api is best.


An alternative option would be a call that sets / clears chosen
flags and leaves others unchanged.


I tend to favor a set/clear API, but that could also just be done by 
creating a new PROT_BTI_IF_X which enables BTI for areas already set to 
_EXEC. That goes right by the seccomp filters too, and actually is 
closer to what glibc wants to do anyway.





The trouble with that is that the MDWX policy then becomes hard to
implement again.


But policies might be best set via another route, such as a prctl,
rather than being implemented completely in a seccomp filter.

Cheers
---Dave





Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Dave Martin
On Mon, Oct 26, 2020 at 04:57:55PM +, Szabolcs Nagy via Libc-alpha wrote:
> The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> > Unrolling this discussion a bit, this problem comes from a few sources:
> > 
> > 1) systemd is trying to implement a policy that doesn't fit SECCOMP
> > syscall filtering very well.
> > 
> > 2) The program is trying to do something not expressible through the
> > syscall interface: really the intent is to set PROT_BTI on the page,
> > with no intent to set PROT_EXEC on any page that didn't already have it
> > set.
> > 
> > 
> > This limitation of mprotect() was known when I originally added PROT_BTI,
> > but at that time we weren't aware of a clear use case that would fail.
> > 
> > 
> > Would it now help to add something like:
> > 
> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> > {
> > int ret = -EINVAL;
> > mmap_write_lock(current->mm);
> > if (all vmas in [addr .. addr + len) have
> > their mprotect flags set to old_flags) {
> > 
> > ret = mprotect(addr, len, new_flags);
> > }
> > 
> > mmap_write_unlock(current->mm);
> > return ret;
> > }
> 
> if more prot flags are introduced then the exact
> match for old_flags may be restrictive and currently
> there is no way to query these flags to figure out
> how to toggle one prot flag in a future proof way,
> so i don't think this solves the issue completely.

Ack -- I illustrated this model because it makes the seccomp filter's
job easy, but it does have limitations.

> i think we might need a new api, given that aarch64
> now has PROT_BTI and PROT_MTE while existing code
> expects RWX only, but i don't know what api is best.

An alternative option would be a call that sets / clears chosen
flags and leaves others unchanged.

The trouble with that is that the MDWX policy then becomes hard to
implement again.


But policies might be best set via another route, such as a prctl,
rather than being implemented completely in a seccomp filter.

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Szabolcs Nagy
The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> Unrolling this discussion a bit, this problem comes from a few sources:
> 
> 1) systemd is trying to implement a policy that doesn't fit SECCOMP
> syscall filtering very well.
> 
> 2) The program is trying to do something not expressible through the
> syscall interface: really the intent is to set PROT_BTI on the page,
> with no intent to set PROT_EXEC on any page that didn't already have it
> set.
> 
> 
> This limitation of mprotect() was known when I originally added PROT_BTI,
> but at that time we weren't aware of a clear use case that would fail.
> 
> 
> Would it now help to add something like:
> 
> int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> {
>   int ret = -EINVAL;
>   mmap_write_lock(current->mm);
>   if (all vmas in [addr .. addr + len) have
>   their mprotect flags set to old_flags) {
> 
>   ret = mprotect(addr, len, new_flags);
>   }
>   
>   mmap_write_unlock(current->mm);
>   return ret;
> }

if more prot flags are introduced then the exact
match for old_flags may be restrictive and currently
there is no way to query these flags to figure out
how to toggle one prot flag in a future proof way,
so i don't think this solves the issue completely.

i think we might need a new api, given that aarch64
now has PROT_BTI and PROT_MTE while existing code
expects RWX only, but i don't know what api is best.

> libc would now be able to do
> 
>   mchangeprot(addr, len, PROT_EXEC | PROT_READ,
>   PROT_EXEC | PROT_READ | PROT_BTI);
> 
> while systemd's MDWX filter would reject the call if
> 
>   (new_flags & PROT_EXEC) &&
>   (!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)
> 
> 
> 
> This won't magically fix current code, but something along these lines
> might be better going forward.
> 
> 
> Thoughts?
> 
> ---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Mark Brown
On Mon, Oct 26, 2020 at 03:56:35PM +, Dave Martin wrote:
> On Mon, Oct 26, 2020 at 02:52:46PM +, Catalin Marinas via Libc-alpha 
> wrote:

> > Now, if the dynamic loader silently ignores the mprotect() failure on
> > the main executable, is there much value in exposing a flag in the aux
> > vectors? It saves a few (one?) mprotect() calls but I don't think it
> > matters much. Anyway, I don't mind the flag.

> I don't see a problem with the aforementioned patch [2] to pre-set BTI
> on the pages of the main binary.

Me either FWIW.


signature.asc
Description: PGP signature


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Florian Weimer
* Dave Martin via Libc-alpha:

> Would it now help to add something like:
>
> int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> {
>   int ret = -EINVAL;
>   mmap_write_lock(current->mm);
>   if (all vmas in [addr .. addr + len) have
>   their mprotect flags set to old_flags) {
>
>   ret = mprotect(addr, len, new_flags);
>   }
>   
>   mmap_write_unlock(current->mm);
>   return ret;
> }

I suggested something similar as well.  Ideally, the interface would
subsume pkey_mprotect, though, and have a separate flags argument from
the protection flags.  But then we run into argument list length limits.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Topi Miettinen

On 26.10.2020 18.24, Dave Martin wrote:

On Wed, Oct 21, 2020 at 10:44:46PM -0500, Jeremy Linton via Libc-alpha wrote:

Hi,

There is a problem with glibc+systemd on BTI enabled systems. Systemd
has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
caught by the seccomp filter, resulting in service failures.

So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
This is obviously not desirable.

Various changes have been suggested, replacing the mprotect with mmap calls
having PROT_BTI set on the original mapping, re-mmapping the segments,
implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
and various modification to seccomp to allow particular mprotect cases to
bypass the filters. In each case there seems to be an undesirable attribute
to the solution.

So, whats the best solution?


Unrolling this discussion a bit, this problem comes from a few sources:

1) systemd is trying to implement a policy that doesn't fit SECCOMP
syscall filtering very well.

2) The program is trying to do something not expressible through the
syscall interface: really the intent is to set PROT_BTI on the page,
with no intent to set PROT_EXEC on any page that didn't already have it
set.


This limitation of mprotect() was known when I originally added PROT_BTI,
but at that time we weren't aware of a clear use case that would fail.


Would it now help to add something like:

int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
{
int ret = -EINVAL;
mmap_write_lock(current->mm);
if (all vmas in [addr .. addr + len) have
their mprotect flags set to old_flags) {

ret = mprotect(addr, len, new_flags);
}

mmap_write_unlock(current->mm);
return ret;
}


libc would now be able to do

mchangeprot(addr, len, PROT_EXEC | PROT_READ,
PROT_EXEC | PROT_READ | PROT_BTI);

while systemd's MDWX filter would reject the call if

(new_flags & PROT_EXEC) &&
(!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)



This won't magically fix current code, but something along these lines
might be better going forward.


Thoughts?


Looks good to me.

-Topi



Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Topi Miettinen

On 26.10.2020 16.52, Catalin Marinas wrote:

On Sat, Oct 24, 2020 at 02:01:30PM +0300, Topi Miettinen wrote:

On 23.10.2020 12.02, Catalin Marinas wrote:

On Thu, Oct 22, 2020 at 01:02:18PM -0700, Kees Cook wrote:

Regardless, it makes sense to me to have the kernel load the executable
itself with BTI enabled by default. I prefer gaining Catalin's suggested
patch[2]. :)

[...]

[2] https://lore.kernel.org/linux-arm-kernel/20201022093104.GB1229@gaia/


I think I first heard the idea at Mark R ;).

It still needs glibc changes to avoid the mprotect(), or at least ignore
the error. Since this is an ABI change and we don't know which kernels
would have it backported, maybe better to still issue the mprotect() but
ignore the failure.


What about kernel adding an auxiliary vector as a flag to indicate that BTI
is supported and recommended by the kernel? Then dynamic loader could use
that to detect that a) the main executable is BTI protected and there's no
need to mprotect() it and b) PROT_BTI flag should be added to all PROT_EXEC
pages.


We could add a bit to AT_FLAGS, it's always been 0 for Linux.


Great!


In absence of the vector, the dynamic loader might choose to skip doing
PROT_BTI at all (since the main executable isn't protected anyway either, or
maybe even the kernel is up-to-date but it knows that it's not recommended
for some reason, or maybe the kernel is so ancient that it doesn't know
about BTI). Optionally it could still read the flag from ELF later (for
compatibility with old kernels) and then do the mprotect() dance, which may
trip seccomp filters, possibly fatally.


I think the safest is for the dynamic loader to issue an mprotect() and
ignore the EPERM error. Not all user deployments have this seccomp
filter, so they can still benefit, and user can't tell whether the
kernel change has been backported.


But the seccomp filter can be set to kill the process, so that's 
definitely not the safest way. I think safest is that when the AT_FLAGS 
bit is seen, ld.so doesn't do any mprotect() calls but instead when 
mapping the segments, mmap() flags are adjusted to include PROT_BTI, so 
mprotect() calls are not necessary. If there's no seccomp filter, 
there's no disadvantage for avoiding the useless mprotect() calls.


I'd expect the backported kernel change to include both aux vector and 
also using PROT_BTI for the main executable. Then the logic would work 
with backported kernels as well.


If there's no aux vector, all bets are off. The kernel could be old and 
unpatched, even so old that PROT_BTI is not known. Perhaps also in the 
future there may be new technologies which have replaced BTI and the 
kernel could want a previous generation ld.so not to try to use BTI, so 
this could be also indicated with the lack of aux vector. The dynamic 
loader could still attempt to mprotect() the pages, but that could be 
fatal. Getting to the point where the error can be ignored means that 
there's no seccomp filter, at least none set to kill. Perhaps the pain 
is only temporary, new or patched kernels should eventually replace the 
old versions.



Now, if the dynamic loader silently ignores the mprotect() failure on
the main executable, is there much value in exposing a flag in the aux
vectors? It saves a few (one?) mprotect() calls but I don't think it
matters much. Anyway, I don't mind the flag.


Saving a few system calls is indeed not an issue, but not being able to 
use MDWX and PROT_BTI simultaneously was the original problem (service 
failures).


-Topi


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Dave Martin
On Wed, Oct 21, 2020 at 10:44:46PM -0500, Jeremy Linton via Libc-alpha wrote:
> Hi,
> 
> There is a problem with glibc+systemd on BTI enabled systems. Systemd
> has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
> PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
> being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
> caught by the seccomp filter, resulting in service failures.
> 
> So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
> This is obviously not desirable.
> 
> Various changes have been suggested, replacing the mprotect with mmap calls
> having PROT_BTI set on the original mapping, re-mmapping the segments,
> implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> and various modification to seccomp to allow particular mprotect cases to
> bypass the filters. In each case there seems to be an undesirable attribute
> to the solution.
> 
> So, whats the best solution?

Unrolling this discussion a bit, this problem comes from a few sources:

1) systemd is trying to implement a policy that doesn't fit SECCOMP
syscall filtering very well.

2) The program is trying to do something not expressible through the
syscall interface: really the intent is to set PROT_BTI on the page,
with no intent to set PROT_EXEC on any page that didn't already have it
set.


This limitation of mprotect() was known when I originally added PROT_BTI,
but at that time we weren't aware of a clear use case that would fail.


Would it now help to add something like:

int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
{
int ret = -EINVAL;
mmap_write_lock(current->mm);
if (all vmas in [addr .. addr + len) have
their mprotect flags set to old_flags) {

ret = mprotect(addr, len, new_flags);
}

mmap_write_unlock(current->mm);
return ret;
}


libc would now be able to do

mchangeprot(addr, len, PROT_EXEC | PROT_READ,
PROT_EXEC | PROT_READ | PROT_BTI);

while systemd's MDWX filter would reject the call if

(new_flags & PROT_EXEC) &&
(!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)



This won't magically fix current code, but something along these lines
might be better going forward.


Thoughts?

---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Dave Martin
On Mon, Oct 26, 2020 at 02:52:46PM +, Catalin Marinas via Libc-alpha wrote:
> On Sat, Oct 24, 2020 at 02:01:30PM +0300, Topi Miettinen wrote:
> > On 23.10.2020 12.02, Catalin Marinas wrote:
> > > On Thu, Oct 22, 2020 at 01:02:18PM -0700, Kees Cook wrote:
> > > > Regardless, it makes sense to me to have the kernel load the executable
> > > > itself with BTI enabled by default. I prefer gaining Catalin's suggested
> > > > patch[2]. :)
> > > [...]
> > > > [2] https://lore.kernel.org/linux-arm-kernel/20201022093104.GB1229@gaia/
> > > 
> > > I think I first heard the idea at Mark R ;).
> > > 
> > > It still needs glibc changes to avoid the mprotect(), or at least ignore
> > > the error. Since this is an ABI change and we don't know which kernels
> > > would have it backported, maybe better to still issue the mprotect() but
> > > ignore the failure.
> > 
> > What about kernel adding an auxiliary vector as a flag to indicate that BTI
> > is supported and recommended by the kernel? Then dynamic loader could use
> > that to detect that a) the main executable is BTI protected and there's no
> > need to mprotect() it and b) PROT_BTI flag should be added to all PROT_EXEC
> > pages.
> 
> We could add a bit to AT_FLAGS, it's always been 0 for Linux.
> 
> > In absence of the vector, the dynamic loader might choose to skip doing
> > PROT_BTI at all (since the main executable isn't protected anyway either, or
> > maybe even the kernel is up-to-date but it knows that it's not recommended
> > for some reason, or maybe the kernel is so ancient that it doesn't know
> > about BTI). Optionally it could still read the flag from ELF later (for
> > compatibility with old kernels) and then do the mprotect() dance, which may
> > trip seccomp filters, possibly fatally.
> 
> I think the safest is for the dynamic loader to issue an mprotect() and
> ignore the EPERM error. Not all user deployments have this seccomp
> filter, so they can still benefit, and user can't tell whether the
> kernel change has been backported.
> 
> Now, if the dynamic loader silently ignores the mprotect() failure on
> the main executable, is there much value in exposing a flag in the aux
> vectors? It saves a few (one?) mprotect() calls but I don't think it
> matters much. Anyway, I don't mind the flag.

I don't see a problem with the aforementioned patch [2] to pre-set BTI
on the pages of the main binary.

The original rationale here was that ld.so doesn't _need_ this, since it
is going to examine the binary's ELF headers anyway.  But equally, if
the binary is marked as supporting BTI then it's safe to enable BTI for
the binary's own pages.


I'd tend to agree that an AT_FLAGS flag doesn't add much.  I think real
EPERMs would only be seen in assert-fail type situations.  Failure of
mmap() is likely to result in a segfault later on, or correct operation
with weakened permissions on some pages.  Given the likely failure
modes, that situation doesn't feel too bad.


> The only potential risk is if the dynamic loader decides not to turn
> PROT_BTI one because of some mix and match of objects but AFAIK BTI
> allows interworking.

Yes, the design means that a page's PROT_BTI can be set safely if the
code in that page was compiled for BTI, irrespective of how other pages
were compiled.  The reasons why we don't do this at finer granularity
are (a) is't not very useful, and (b) ELF images only contain a BTI
property note for the whole image, not per segment.

I think that ld.so already makes this decision at ELF image granularity
(unless someone contradicts me).

Cheers
---Dave


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Catalin Marinas
On Sat, Oct 24, 2020 at 02:01:30PM +0300, Topi Miettinen wrote:
> On 23.10.2020 12.02, Catalin Marinas wrote:
> > On Thu, Oct 22, 2020 at 01:02:18PM -0700, Kees Cook wrote:
> > > Regardless, it makes sense to me to have the kernel load the executable
> > > itself with BTI enabled by default. I prefer gaining Catalin's suggested
> > > patch[2]. :)
> > [...]
> > > [2] https://lore.kernel.org/linux-arm-kernel/20201022093104.GB1229@gaia/
> > 
> > I think I first heard the idea at Mark R ;).
> > 
> > It still needs glibc changes to avoid the mprotect(), or at least ignore
> > the error. Since this is an ABI change and we don't know which kernels
> > would have it backported, maybe better to still issue the mprotect() but
> > ignore the failure.
> 
> What about kernel adding an auxiliary vector as a flag to indicate that BTI
> is supported and recommended by the kernel? Then dynamic loader could use
> that to detect that a) the main executable is BTI protected and there's no
> need to mprotect() it and b) PROT_BTI flag should be added to all PROT_EXEC
> pages.

We could add a bit to AT_FLAGS, it's always been 0 for Linux.

> In absence of the vector, the dynamic loader might choose to skip doing
> PROT_BTI at all (since the main executable isn't protected anyway either, or
> maybe even the kernel is up-to-date but it knows that it's not recommended
> for some reason, or maybe the kernel is so ancient that it doesn't know
> about BTI). Optionally it could still read the flag from ELF later (for
> compatibility with old kernels) and then do the mprotect() dance, which may
> trip seccomp filters, possibly fatally.

I think the safest is for the dynamic loader to issue an mprotect() and
ignore the EPERM error. Not all user deployments have this seccomp
filter, so they can still benefit, and user can't tell whether the
kernel change has been backported.

Now, if the dynamic loader silently ignores the mprotect() failure on
the main executable, is there much value in exposing a flag in the aux
vectors? It saves a few (one?) mprotect() calls but I don't think it
matters much. Anyway, I don't mind the flag.

The only potential risk is if the dynamic loader decides not to turn
PROT_BTI one because of some mix and match of objects but AFAIK BTI
allows interworking.

-- 
Catalin


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-25 Thread Jordan Glover
On Saturday, October 24, 2020 2:12 PM, Salvatore Mesoraca 
 wrote:

> On Sat, 24 Oct 2020 at 12:34, Topi Miettinen toiwo...@gmail.com wrote:
>
> > On 23.10.2020 20.52, Salvatore Mesoraca wrote:
> >
> > > Hi,
> > > On Thu, 22 Oct 2020 at 23:24, Topi Miettinen toiwo...@gmail.com wrote:
> > >
> > > > SARA looks interesting. What is missing is a prctl() to enable all W^X
> > > > protections irrevocably for the current process, then systemd could
> > > > enable it for services with MemoryDenyWriteExecute=yes.
> > >
> > > SARA actually has a procattr[0] interface to do just that.
> > > There is also a library[1] to help using it.
> >
> > That means that /proc has to be available and writable at that point, so
> > setting up procattrs has to be done before mount namespaces are set up.
> > In general, it would be nice for sandboxing facilities in kernel if
> > there would be a way to start enforcing restrictions only at next
> > execve(), like setexeccon() for SELinux and aa_change_onexec() for
> > AppArmor. Otherwise the exact order of setting up various sandboxing
> > options can be very tricky to arrange correctly, since each option may
> > have a subtle effect to the sandboxing features enabled later. In case
> > of SARA, the operations done between shuffling the mount namespace and
> > before execve() shouldn't be affected so it isn't important. Even if it
> > did (a new sandboxing feature in the future would need trampolines or
> > JIT code generation), maybe the procattr file could be opened early but
> > it could be written closer to execve().
>
> A new "apply on exec" procattr file seems reasonable and relatively easy to 
> add.
> As Kees pointed out, the main obstacle here is the fact that SARA is
> not upstream :(
>
> Salvatore

Is there a chance we will see new SARA iteration soon on lkml? :)

Jordan


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-24 Thread Salvatore Mesoraca
On Sat, 24 Oct 2020 at 12:34, Topi Miettinen  wrote:
>
> On 23.10.2020 20.52, Salvatore Mesoraca wrote:
> > Hi,
> >
> > On Thu, 22 Oct 2020 at 23:24, Topi Miettinen  wrote:
> >> SARA looks interesting. What is missing is a prctl() to enable all W^X
> >> protections irrevocably for the current process, then systemd could
> >> enable it for services with MemoryDenyWriteExecute=yes.
> >
> > SARA actually has a procattr[0] interface to do just that.
> > There is also a library[1] to help using it.
>
> That means that /proc has to be available and writable at that point, so
> setting up procattrs has to be done before mount namespaces are set up.
> In general, it would be nice for sandboxing facilities in kernel if
> there would be a way to start enforcing restrictions only at next
> execve(), like setexeccon() for SELinux and aa_change_onexec() for
> AppArmor. Otherwise the exact order of setting up various sandboxing
> options can be very tricky to arrange correctly, since each option may
> have a subtle effect to the sandboxing features enabled later. In case
> of SARA, the operations done between shuffling the mount namespace and
> before execve() shouldn't be affected so it isn't important. Even if it
> did (a new sandboxing feature in the future would need trampolines or
> JIT code generation), maybe the procattr file could be opened early but
> it could be written closer to execve().

A new "apply on exec" procattr file seems reasonable and relatively easy to add.
As Kees pointed out, the main obstacle here is the fact that SARA is
not upstream :(

Salvatore


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-24 Thread Topi Miettinen

On 23.10.2020 20.52, Salvatore Mesoraca wrote:

Hi,

On Thu, 22 Oct 2020 at 23:24, Topi Miettinen  wrote:

SARA looks interesting. What is missing is a prctl() to enable all W^X
protections irrevocably for the current process, then systemd could
enable it for services with MemoryDenyWriteExecute=yes.


SARA actually has a procattr[0] interface to do just that.
There is also a library[1] to help using it.


That means that /proc has to be available and writable at that point, so 
setting up procattrs has to be done before mount namespaces are set up. 
In general, it would be nice for sandboxing facilities in kernel if 
there would be a way to start enforcing restrictions only at next 
execve(), like setexeccon() for SELinux and aa_change_onexec() for 
AppArmor. Otherwise the exact order of setting up various sandboxing 
options can be very tricky to arrange correctly, since each option may 
have a subtle effect to the sandboxing features enabled later. In case 
of SARA, the operations done between shuffling the mount namespace and 
before execve() shouldn't be affected so it isn't important. Even if it 
did (a new sandboxing feature in the future would need trampolines or 
JIT code generation), maybe the procattr file could be opened early but 
it could be written closer to execve().


-Topi


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-24 Thread Topi Miettinen

On 23.10.2020 12.02, Catalin Marinas wrote:

On Thu, Oct 22, 2020 at 01:02:18PM -0700, Kees Cook wrote:

Regardless, it makes sense to me to have the kernel load the executable
itself with BTI enabled by default. I prefer gaining Catalin's suggested
patch[2]. :)

[...]

[2] https://lore.kernel.org/linux-arm-kernel/20201022093104.GB1229@gaia/


I think I first heard the idea at Mark R ;).

It still needs glibc changes to avoid the mprotect(), or at least ignore
the error. Since this is an ABI change and we don't know which kernels
would have it backported, maybe better to still issue the mprotect() but
ignore the failure.


What about kernel adding an auxiliary vector as a flag to indicate that 
BTI is supported and recommended by the kernel? Then dynamic loader 
could use that to detect that a) the main executable is BTI protected 
and there's no need to mprotect() it and b) PROT_BTI flag should be 
added to all PROT_EXEC pages.


In absence of the vector, the dynamic loader might choose to skip doing 
PROT_BTI at all (since the main executable isn't protected anyway 
either, or maybe even the kernel is up-to-date but it knows that it's 
not recommended for some reason, or maybe the kernel is so ancient that 
it doesn't know about BTI). Optionally it could still read the flag from 
ELF later (for compatibility with old kernels) and then do the 
mprotect() dance, which may trip seccomp filters, possibly fatally.


-Topi


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-23 Thread Salvatore Mesoraca
Hi,

On Thu, 22 Oct 2020 at 23:24, Topi Miettinen  wrote:
> SARA looks interesting. What is missing is a prctl() to enable all W^X
> protections irrevocably for the current process, then systemd could
> enable it for services with MemoryDenyWriteExecute=yes.

SARA actually has a procattr[0] interface to do just that.
There is also a library[1] to help using it.

> I didn't also see specific measures against memfd_create() or file
> system W, but perhaps those can be added later.

You are right, there are no measures against those vectors.
It would be interesting to add them, though.

> Maybe pkey_mprotect()
> is not handled either unless it uses the same LSM hook as mprotect().

IIRC mprotect is implemented more or less as a pkey_mprotect with -1 as pkey.
The same LSM hook should cover both.

Salvatore

[0] 
https://lore.kernel.org/lkml/1562410493-8661-10-git-send-email-s.mesorac...@gmail.com/
[1] https://github.com/smeso/libsara


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-23 Thread Catalin Marinas
On Fri, Oct 23, 2020 at 07:13:17AM +0100, Szabolcs Nagy wrote:
> The 10/22/2020 10:31, Catalin Marinas wrote:
> > IIUC, the problem is with the main executable which is mapped by the
> > kernel without PROT_BTI. The dynamic loader wants to set PROT_BTI but
> > does not have the original file descriptor to be able to remap. Its only
> > choice is mprotect() and this fails because of the MDWX policy.
> > 
> > Not sure whether the kernel has the right information but could it map
> > the main executable with PROT_BTI if the corresponding PT_GNU_PROPERTY
> > is found? The current ABI states it only sets PROT_BTI for the
> > interpreter who'd be responsible for setting the PROT_BTI on the main
> > executable. I can't tell whether it would break anything but it's worth
> > a try:
> 
> i think it would work, but now i can't easily
> tell from the libc if i have to do the mprotect
> on the main exe or not.
> 
> i guess i can just always mprotect and ignore
> the failure?

I replied to Keys before reading your email. So yeah, still issue
mprotect() but ignore the failure.

-- 
Catalin


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-23 Thread Catalin Marinas
On Thu, Oct 22, 2020 at 01:02:18PM -0700, Kees Cook wrote:
> Regardless, it makes sense to me to have the kernel load the executable
> itself with BTI enabled by default. I prefer gaining Catalin's suggested
> patch[2]. :)
[...]
> [2] https://lore.kernel.org/linux-arm-kernel/20201022093104.GB1229@gaia/

I think I first heard the idea at Mark R ;).

It still needs glibc changes to avoid the mprotect(), or at least ignore
the error. Since this is an ABI change and we don't know which kernels
would have it backported, maybe better to still issue the mprotect() but
ignore the failure.

-- 
Catalin


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-23 Thread Szabolcs Nagy
The 10/22/2020 10:31, Catalin Marinas wrote:
> IIUC, the problem is with the main executable which is mapped by the
> kernel without PROT_BTI. The dynamic loader wants to set PROT_BTI but
> does not have the original file descriptor to be able to remap. Its only
> choice is mprotect() and this fails because of the MDWX policy.
> 
> Not sure whether the kernel has the right information but could it map
> the main executable with PROT_BTI if the corresponding PT_GNU_PROPERTY
> is found? The current ABI states it only sets PROT_BTI for the
> interpreter who'd be responsible for setting the PROT_BTI on the main
> executable. I can't tell whether it would break anything but it's worth
> a try:

i think it would work, but now i can't easily
tell from the libc if i have to do the mprotect
on the main exe or not.

i guess i can just always mprotect and ignore
the failure?

> 
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 4784011cecac..0a08fb9133e8 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -730,14 +730,6 @@ asmlinkage void __sched arm64_preempt_schedule_irq(void)
>  int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
>bool has_interp, bool is_interp)
>  {
> - /*
> -  * For dynamically linked executables the interpreter is
> -  * responsible for setting PROT_BTI on everything except
> -  * itself.
> -  */
> - if (is_interp != has_interp)
> - return prot;
> -
>   if (!(state->flags & ARM64_ELF_BTI))
>   return prot;
>  
> 
> -- 
> Catalin

-- 


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Topi Miettinen

On 22.10.2020 23.02, Kees Cook wrote:

On Thu, Oct 22, 2020 at 01:39:07PM +0300, Topi Miettinen wrote:

But I think SELinux has a more complete solution (execmem) which can track
the pages better than is possible with seccomp solution which has a very
narrow field of view. Maybe this facility could be made available to
non-SELinux systems, for example with prctl()? Then the in-kernel MDWX could
allow mprotect(PROT_EXEC | PROT_BTI) in case the backing file hasn't been
modified, the source filesystem isn't writable for the calling process and
the file descriptor isn't created with memfd_create().


Right. The problem here is that systemd is attempting to mediate a
state change using only syscall details (i.e. with seccomp) instead of
a stateful analysis. Using a MAC is likely the only sane way to do that.
SELinux is a bit difficult to adjust "on the fly" the way systemd would
like to do things, and the more dynamic approach seen with SARA[1] isn't
yet in the kernel.


SARA looks interesting. What is missing is a prctl() to enable all W^X 
protections irrevocably for the current process, then systemd could 
enable it for services with MemoryDenyWriteExecute=yes.


I didn't also see specific measures against memfd_create() or file 
system W, but perhaps those can be added later. Maybe pkey_mprotect() 
is not handled either unless it uses the same LSM hook as mprotect().



Trying to enforce memory W^X protection correctly
via seccomp isn't really going to work well, as far as I can see.


Not in general, but I think it can work well in context of system 
services. Then you can ensure that for a specific service, 
memfd_create() is blocked by seccomp and the file systems are W^X 
because of mount namespaces etc., so there should not be any means to 
construct arbitrary executable pages.


-Topi


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Kees Cook
On Thu, Oct 22, 2020 at 01:39:07PM +0300, Topi Miettinen wrote:
> But I think SELinux has a more complete solution (execmem) which can track
> the pages better than is possible with seccomp solution which has a very
> narrow field of view. Maybe this facility could be made available to
> non-SELinux systems, for example with prctl()? Then the in-kernel MDWX could
> allow mprotect(PROT_EXEC | PROT_BTI) in case the backing file hasn't been
> modified, the source filesystem isn't writable for the calling process and
> the file descriptor isn't created with memfd_create().

Right. The problem here is that systemd is attempting to mediate a
state change using only syscall details (i.e. with seccomp) instead of
a stateful analysis. Using a MAC is likely the only sane way to do that.
SELinux is a bit difficult to adjust "on the fly" the way systemd would
like to do things, and the more dynamic approach seen with SARA[1] isn't
yet in the kernel. Trying to enforce memory W^X protection correctly
via seccomp isn't really going to work well, as far as I can see.

Regardless, it makes sense to me to have the kernel load the executable
itself with BTI enabled by default. I prefer gaining Catalin's suggested
patch[2]. :)

[1] 
https://lore.kernel.org/kernel-hardening/1562410493-8661-1-git-send-email-s.mesorac...@gmail.com/
[2] https://lore.kernel.org/linux-arm-kernel/20201022093104.GB1229@gaia/

-- 
Kees Cook


Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Topi Miettinen

On 22.10.2020 10.54, Szabolcs Nagy wrote:

The 10/21/2020 22:44, Jeremy Linton wrote:

There is a problem with glibc+systemd on BTI enabled systems. Systemd
has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
caught by the seccomp filter, resulting in service failures.

So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
This is obviously not desirable.

Various changes have been suggested, replacing the mprotect with mmap calls
having PROT_BTI set on the original mapping, re-mmapping the segments,
implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
and various modification to seccomp to allow particular mprotect cases to
bypass the filters. In each case there seems to be an undesirable attribute
to the solution.

So, whats the best solution?


the easy fix in glibc is to ignore mprotect(PROT_BTI|PROT_EXEC)
failures, so programs work with seccomp filters, but bti gets
disabled (it's unreasonable to expect bti protection if mprotect
is filtered). it will be a nasty silent failure though.


Some may also want to use seccomp filters so that they will immediately 
kill the process and in this case they couldn't do it.



and i'm also considering a fix that re-mmaps the executable
segment with PROT_BTI instead of mprotect since that is not
filtered. unfortunately the main exe is mmaped by the kernel
without PROT_BTI and the libc does not have the fd to re-mmap.
(bti can be left off for the main exe if mprotect fails and
later we can teach the kernel to add bti there.) currently
this is not a complete fix so i'm a bit hesitant about it.

as for a kernel side fix: if there is a way to only filter
PROT_EXEC mprotect on mappings that are not yet PROT_EXEC
that would solve this problem (but likely needs new syscall
or seccomp capability).


Problem with seccomp MDWX is that it's still possible for malicious 
programs to circumvent the filter by using memfd_create(), fill the 
memory with desired content and then use mmap(,,PROT_EXEC) to make it 
executable without triggering seccomp. This can be mitigated by 
filtering also memfd_create(), but then some programs want to use it. 
Also the protection can be bypassed if the program can write to a file 
system which isn't mounted with "noexec". This can be mitigated with 
private mount namespaces and global mount options, but again some 
programs are written to expect W & X.


But I think SELinux has a more complete solution (execmem) which can 
track the pages better than is possible with seccomp solution which has 
a very narrow field of view. Maybe this facility could be made available 
to non-SELinux systems, for example with prctl()? Then the in-kernel 
MDWX could allow mprotect(PROT_EXEC | PROT_BTI) in case the backing file 
hasn't been modified, the source filesystem isn't writable for the 
calling process and the file descriptor isn't created with memfd_create().


-Topi


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Florian Weimer
* Topi Miettinen:

> Allowing mprotect(PROT_EXEC|PROT_BTI) would mean that all you need to
> circumvent MDWX is to add PROT_BTI flag. I'd suggest getting the flags 
> right at mmap() time or failing that, reverting the PROT_BTI for
> legacy programs later.
>
> Could the kernel tell the loader of the BTI situation with auxiliary
> vectors? Then it would be easy for the loader to always use the best 
> mmap() flags without ever needing to mprotect().

I think what we want is a mprotect2 call with a flags argument (separate
from protection flags) that tells the kernel that the request *removes*
protection flags and should fail otherwise.  seccomp could easily filter
that then.

But like the other proposals, the migration story isn't great.  You
would need kernel and seccomp/systemd etc. updates before glibc starts
working, even if glibc has a fallback from mprotect2 to mprotect
(because the latter would be blocked).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Topi Miettinen

On 22.10.2020 12.31, Catalin Marinas wrote:

On Thu, Oct 22, 2020 at 10:38:23AM +0200, Lennart Poettering wrote:

On Do, 22.10.20 09:29, Szabolcs Nagy (szabolcs.n...@arm.com) wrote:

The dynamic loader has to process the LOAD segments to get to the ELF
note that says to enable BTI.  Maybe we could do a first pass and load
only the segments that cover notes.  But that requires lots of changes
to generic code in the loader.


What if the loader always enabled BTI for PROT_EXEC pages, but then when
discovering that this was a mistake, mprotect() the pages without BTI? Then
both BTI and MDWX would work and the penalty of not getting MDWX would fall
to non-BTI programs. What's the expected proportion of BTI enabled code vs.
disabled in the future, is it perhaps expected that a distro would enable
the flag globally so eventually only a few legacy programs might be
unprotected?


i thought mprotect(PROT_EXEC) would get filtered
with or without bti, is that not the case?


We can adjust the filter in systemd to match any combination of
flags to allow and to deny.


Yes but Szabolcs' point to Topi was that if we can adjust the filters to
allow mprotect(PROT_EXEC), why not allow mprotect(PROT_EXEC|PROT_BTI)
instead? Anyway, I see the MDWX and BTI as complementary policies so
ideally we shouldn't have to choose between one or the other. If we
allow mprotect(PROT_EXEC), that would override MDWX and also disable
BTI.


Allowing mprotect(PROT_EXEC|PROT_BTI) would mean that all you need to 
circumvent MDWX is to add PROT_BTI flag. I'd suggest getting the flags 
right at mmap() time or failing that, reverting the PROT_BTI for legacy 
programs later.


Could the kernel tell the loader of the BTI situation with auxiliary 
vectors? Then it would be easy for the loader to always use the best 
mmap() flags without ever needing to mprotect().


-Topi


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Topi Miettinen

On 22.10.2020 11.29, Szabolcs Nagy wrote:

The 10/22/2020 11:17, Topi Miettinen via Libc-alpha wrote:

On 22.10.2020 10.54, Florian Weimer wrote:

* Lennart Poettering:

Did you see Topi's comments on the systemd issue?

https://github.com/systemd/systemd/issues/17368#issuecomment-710485532

I think I agree with this: it's a bit weird to alter the bits after
the fact. Can't glibc set up everything right from the begining? That
would keep both concepts working.


The dynamic loader has to process the LOAD segments to get to the ELF
note that says to enable BTI.  Maybe we could do a first pass and load
only the segments that cover notes.  But that requires lots of changes
to generic code in the loader.


What if the loader always enabled BTI for PROT_EXEC pages, but then when
discovering that this was a mistake, mprotect() the pages without BTI? Then
both BTI and MDWX would work and the penalty of not getting MDWX would fall
to non-BTI programs. What's the expected proportion of BTI enabled code vs.
disabled in the future, is it perhaps expected that a distro would enable
the flag globally so eventually only a few legacy programs might be
unprotected?


i thought mprotect(PROT_EXEC) would get filtered
with or without bti, is that not the case?


It would be filtered, but the idea is that with modern binaries this 
would not happen since the pages would be mapped with mmap(,, PROT_EXEC 
| PROT_BTI,,) which is OK for purposes MDWX. The loader would have to 
use mprotect(PROT_EXEC) to get rid of PROT_BTI only for the legacy binaries.


-Topi


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Catalin Marinas
On Thu, Oct 22, 2020 at 10:38:23AM +0200, Lennart Poettering wrote:
> On Do, 22.10.20 09:29, Szabolcs Nagy (szabolcs.n...@arm.com) wrote:
> > > > The dynamic loader has to process the LOAD segments to get to the ELF
> > > > note that says to enable BTI.  Maybe we could do a first pass and load
> > > > only the segments that cover notes.  But that requires lots of changes
> > > > to generic code in the loader.
> > >
> > > What if the loader always enabled BTI for PROT_EXEC pages, but then when
> > > discovering that this was a mistake, mprotect() the pages without BTI? 
> > > Then
> > > both BTI and MDWX would work and the penalty of not getting MDWX would 
> > > fall
> > > to non-BTI programs. What's the expected proportion of BTI enabled code 
> > > vs.
> > > disabled in the future, is it perhaps expected that a distro would enable
> > > the flag globally so eventually only a few legacy programs might be
> > > unprotected?
> >
> > i thought mprotect(PROT_EXEC) would get filtered
> > with or without bti, is that not the case?
> 
> We can adjust the filter in systemd to match any combination of
> flags to allow and to deny.

Yes but Szabolcs' point to Topi was that if we can adjust the filters to
allow mprotect(PROT_EXEC), why not allow mprotect(PROT_EXEC|PROT_BTI)
instead? Anyway, I see the MDWX and BTI as complementary policies so
ideally we shouldn't have to choose between one or the other. If we
allow mprotect(PROT_EXEC), that would override MDWX and also disable
BTI.

IIUC, the problem is with the main executable which is mapped by the
kernel without PROT_BTI. The dynamic loader wants to set PROT_BTI but
does not have the original file descriptor to be able to remap. Its only
choice is mprotect() and this fails because of the MDWX policy.

Not sure whether the kernel has the right information but could it map
the main executable with PROT_BTI if the corresponding PT_GNU_PROPERTY
is found? The current ABI states it only sets PROT_BTI for the
interpreter who'd be responsible for setting the PROT_BTI on the main
executable. I can't tell whether it would break anything but it's worth
a try:

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 4784011cecac..0a08fb9133e8 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -730,14 +730,6 @@ asmlinkage void __sched arm64_preempt_schedule_irq(void)
 int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
 bool has_interp, bool is_interp)
 {
-   /*
-* For dynamically linked executables the interpreter is
-* responsible for setting PROT_BTI on everything except
-* itself.
-*/
-   if (is_interp != has_interp)
-   return prot;
-
if (!(state->flags & ARM64_ELF_BTI))
return prot;
 

-- 
Catalin


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Lennart Poettering
On Do, 22.10.20 09:29, Szabolcs Nagy (szabolcs.n...@arm.com) wrote:

> > > The dynamic loader has to process the LOAD segments to get to the ELF
> > > note that says to enable BTI.  Maybe we could do a first pass and load
> > > only the segments that cover notes.  But that requires lots of changes
> > > to generic code in the loader.
> >
> > What if the loader always enabled BTI for PROT_EXEC pages, but then when
> > discovering that this was a mistake, mprotect() the pages without BTI? Then
> > both BTI and MDWX would work and the penalty of not getting MDWX would fall
> > to non-BTI programs. What's the expected proportion of BTI enabled code vs.
> > disabled in the future, is it perhaps expected that a distro would enable
> > the flag globally so eventually only a few legacy programs might be
> > unprotected?
>
> i thought mprotect(PROT_EXEC) would get filtered
> with or without bti, is that not the case?

We can adjust the filter in systemd to match any combination of
flags to allow and to deny.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Lennart Poettering
On Do, 22.10.20 09:05, Szabolcs Nagy (szabolcs.n...@arm.com) wrote:

> > > Various changes have been suggested, replacing the mprotect with mmap 
> > > calls
> > > having PROT_BTI set on the original mapping, re-mmapping the segments,
> > > implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> > > and various modification to seccomp to allow particular mprotect cases to
> > > bypass the filters. In each case there seems to be an undesirable 
> > > attribute
> > > to the solution.
> > >
> > > So, whats the best solution?
> >
> > Did you see Topi's comments on the systemd issue?
> >
> > https://github.com/systemd/systemd/issues/17368#issuecomment-710485532
> >
> > I think I agree with this: it's a bit weird to alter the bits after
> > the fact. Can't glibc set up everything right from the begining? That
> > would keep both concepts working.
>
> that's hard to do and does not work for the main exe currently
> (which is mmaped by the kernel).
>
> (it's hard to do because to know that the elf module requires
> bti the PT_GNU_PROPERTY notes have to be accessed that are
> often in the executable load segment, so either you mmap that
> or have to read that, but the latter has a lot more failure
> modes, so if i have to get the mmap flags right i'd do a mmap
> and then re-mmap if the flags were not right)

Only other option I then see is to neuter one of the two
mechanisms. We could certainly turn off MDWE on arm in systemd, if
people want that. Or make it a build-time choice, so that distros make
the choice: build everything with BTI xor suppport MDWE.

(Might make sense for glibc to gracefully fallback to non-BTI mode if
the mprotect() fails though, to make sure BTI-built binaries work
everywhere.)

I figure your interest in ARM system security is bigger than mine. I
am totally fine to turn off MDWE on ARM if that's what the Linux ARM
folks want. I ave no horse in the race. Just let me know.

[An acceptable compromise might be to allow
mprotect(PROT_EXEC|PROT_BTI) if MDWE is on, but prohibit
mprotect(PROT_EXEC) without PROT_BTI. Then at least you get one of the
two protections, but not both. I mean, MDWE is not perfect anyway on
non-x86-64 already: on 32bit i386 MDWE protection is not complete, due
to ipc() syscall multiplexing being unmatchable with seccomp. I
personally am happy as long as it works fully on x86-64]

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Szabolcs Nagy
The 10/22/2020 11:17, Topi Miettinen via Libc-alpha wrote:
> On 22.10.2020 10.54, Florian Weimer wrote:
> > * Lennart Poettering:
> > > Did you see Topi's comments on the systemd issue?
> > > 
> > > https://github.com/systemd/systemd/issues/17368#issuecomment-710485532
> > > 
> > > I think I agree with this: it's a bit weird to alter the bits after
> > > the fact. Can't glibc set up everything right from the begining? That
> > > would keep both concepts working.
> > 
> > The dynamic loader has to process the LOAD segments to get to the ELF
> > note that says to enable BTI.  Maybe we could do a first pass and load
> > only the segments that cover notes.  But that requires lots of changes
> > to generic code in the loader.
> 
> What if the loader always enabled BTI for PROT_EXEC pages, but then when
> discovering that this was a mistake, mprotect() the pages without BTI? Then
> both BTI and MDWX would work and the penalty of not getting MDWX would fall
> to non-BTI programs. What's the expected proportion of BTI enabled code vs.
> disabled in the future, is it perhaps expected that a distro would enable
> the flag globally so eventually only a few legacy programs might be
> unprotected?

i thought mprotect(PROT_EXEC) would get filtered
with or without bti, is that not the case?

then i guess we can do the protection that way
around, but then i don't see why the filter cannot
treat PROT_EXEC|PROT_BTI the same as PROT_EXEC.



Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Florian Weimer
* Topi Miettinen:

>> The dynamic loader has to process the LOAD segments to get to the ELF
>> note that says to enable BTI.  Maybe we could do a first pass and
>> load only the segments that cover notes.  But that requires lots of
>> changes to generic code in the loader.
>
> What if the loader always enabled BTI for PROT_EXEC pages, but then
> when discovering that this was a mistake, mprotect() the pages without
> BTI?

Is that architecturally supported?  How costly is the mprotect change if
the pages have not been faulted in yet?

> Then both BTI and MDWX would work and the penalty of not getting
> MDWX would fall to non-BTI programs. What's the expected proportion of
> BTI enabled code vs. disabled in the future, is it perhaps expected
> that a distro would enable the flag globally so eventually only a few
> legacy programs might be unprotected?

Eventually, I expect that mainstream distributions build everything for
BTI, so yes, the PROT_BTI removal would only be needed for legacy
programs.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Topi Miettinen

On 22.10.2020 10.54, Florian Weimer wrote:

* Lennart Poettering:


On Mi, 21.10.20 22:44, Jeremy Linton (jeremy.lin...@arm.com) wrote:


Hi,

There is a problem with glibc+systemd on BTI enabled systems. Systemd
has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
caught by the seccomp filter, resulting in service failures.

So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
This is obviously not desirable.

Various changes have been suggested, replacing the mprotect with mmap calls
having PROT_BTI set on the original mapping, re-mmapping the segments,
implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
and various modification to seccomp to allow particular mprotect cases to
bypass the filters. In each case there seems to be an undesirable attribute
to the solution.

So, whats the best solution?


Did you see Topi's comments on the systemd issue?

https://github.com/systemd/systemd/issues/17368#issuecomment-710485532

I think I agree with this: it's a bit weird to alter the bits after
the fact. Can't glibc set up everything right from the begining? That
would keep both concepts working.


The dynamic loader has to process the LOAD segments to get to the ELF
note that says to enable BTI.  Maybe we could do a first pass and load
only the segments that cover notes.  But that requires lots of changes
to generic code in the loader.


What if the loader always enabled BTI for PROT_EXEC pages, but then when 
discovering that this was a mistake, mprotect() the pages without BTI? 
Then both BTI and MDWX would work and the penalty of not getting MDWX 
would fall to non-BTI programs. What's the expected proportion of BTI 
enabled code vs. disabled in the future, is it perhaps expected that a 
distro would enable the flag globally so eventually only a few legacy 
programs might be unprotected?


-Topi


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Szabolcs Nagy
The 10/22/2020 09:18, Lennart Poettering wrote:
> On Mi, 21.10.20 22:44, Jeremy Linton (jeremy.lin...@arm.com) wrote:
> 
> > Hi,
> >
> > There is a problem with glibc+systemd on BTI enabled systems. Systemd
> > has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
> > PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
> > being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
> > caught by the seccomp filter, resulting in service failures.
> >
> > So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
> > This is obviously not desirable.
> >
> > Various changes have been suggested, replacing the mprotect with mmap calls
> > having PROT_BTI set on the original mapping, re-mmapping the segments,
> > implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> > and various modification to seccomp to allow particular mprotect cases to
> > bypass the filters. In each case there seems to be an undesirable attribute
> > to the solution.
> >
> > So, whats the best solution?
> 
> Did you see Topi's comments on the systemd issue?
> 
> https://github.com/systemd/systemd/issues/17368#issuecomment-710485532
> 
> I think I agree with this: it's a bit weird to alter the bits after
> the fact. Can't glibc set up everything right from the begining? That
> would keep both concepts working.

that's hard to do and does not work for the main exe currently
(which is mmaped by the kernel).

(it's hard to do because to know that the elf module requires
bti the PT_GNU_PROPERTY notes have to be accessed that are
often in the executable load segment, so either you mmap that
or have to read that, but the latter has a lot more failure
modes, so if i have to get the mmap flags right i'd do a mmap
and then re-mmap if the flags were not right)


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Florian Weimer
* Lennart Poettering:

> On Mi, 21.10.20 22:44, Jeremy Linton (jeremy.lin...@arm.com) wrote:
>
>> Hi,
>>
>> There is a problem with glibc+systemd on BTI enabled systems. Systemd
>> has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
>> PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
>> being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
>> caught by the seccomp filter, resulting in service failures.
>>
>> So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
>> This is obviously not desirable.
>>
>> Various changes have been suggested, replacing the mprotect with mmap calls
>> having PROT_BTI set on the original mapping, re-mmapping the segments,
>> implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
>> and various modification to seccomp to allow particular mprotect cases to
>> bypass the filters. In each case there seems to be an undesirable attribute
>> to the solution.
>>
>> So, whats the best solution?
>
> Did you see Topi's comments on the systemd issue?
>
> https://github.com/systemd/systemd/issues/17368#issuecomment-710485532
>
> I think I agree with this: it's a bit weird to alter the bits after
> the fact. Can't glibc set up everything right from the begining? That
> would keep both concepts working.

The dynamic loader has to process the LOAD segments to get to the ELF
note that says to enable BTI.  Maybe we could do a first pass and load
only the segments that cover notes.  But that requires lots of changes
to generic code in the loader.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Lennart Poettering
On Mi, 21.10.20 22:44, Jeremy Linton (jeremy.lin...@arm.com) wrote:

> Hi,
>
> There is a problem with glibc+systemd on BTI enabled systems. Systemd
> has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
> PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
> being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
> caught by the seccomp filter, resulting in service failures.
>
> So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
> This is obviously not desirable.
>
> Various changes have been suggested, replacing the mprotect with mmap calls
> having PROT_BTI set on the original mapping, re-mmapping the segments,
> implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> and various modification to seccomp to allow particular mprotect cases to
> bypass the filters. In each case there seems to be an undesirable attribute
> to the solution.
>
> So, whats the best solution?

Did you see Topi's comments on the systemd issue?

https://github.com/systemd/systemd/issues/17368#issuecomment-710485532

I think I agree with this: it's a bit weird to alter the bits after
the fact. Can't glibc set up everything right from the begining? That
would keep both concepts working.

Lennart

--
Lennart Poettering, Berlin