Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-26 Thread Kevin Brodsky

On 25/02/2019 18:02, Szabolcs Nagy wrote:

On 25/02/2019 16:57, Catalin Marinas wrote:

On Tue, Feb 19, 2019 at 06:38:31PM +, Szabolcs Nagy wrote:

i think these rules work for the cases i care about, a more
tricky question is when/how to check for the new syscall abi
and when/how the TCR_EL1.TBI0 setting may be turned off.

I don't think turning TBI0 off is critical (it's handy for PAC with
52-bit VA but then it's short-lived if you want more security features
like MTE).

yes, i made a mistake assuming TBI0 off is
required for (or at least compatible with) MTE.

if TBI0 needs to be on for MTE then some of my
analysis is wrong, and i expect TBI0 to be on
in the foreseeable future.


consider the following cases (tb == top byte):

binary 1: user tb = any, syscall tb = 0
   tbi is on, "legacy binary"

binary 2: user tb = any, syscall tb = any
   tbi is on, "new binary using tb"
   for backward compat it needs to check for new syscall abi.

binary 3: user tb = 0, syscall tb = 0
   tbi can be off, "new binary",
   binary is marked to indicate unused tb,
   kernel may turn tbi off: additional pac bits.

binary 4: user tb = mte, syscall tb = mte
   like binary 3, but with mte, "new binary using mte"

so this should be "like binary 2, but with mte".


   does it have to check for new syscall abi?
   or MTE HWCAP would imply it?
   (is it possible to use mte without new syscall abi?)

I think MTE HWCAP should imply it.


in userspace we want most binaries to be like binary 3 and 4
eventually, i.e. marked as not-relying-on-tbi, if a dso is
loaded that is unmarked (legacy or new tb user), then either
the load fails (e.g. if mte is already used? or can we turn
mte off at runtime?) or tbi has to be enabled (prctl? does
this work with pac? or multi-threads?).

We could enable it via prctl. That's the plan for MTE as well (in
addition maybe to some ELF flag).


as for checking the new syscall abi: i don't see much semantic
difference between AT_HWCAP and AT_FLAGS (either way, the user
has to check a feature flag before using the feature of the
underlying system and it does not matter much if it's a syscall
abi feature or cpu feature), but i don't see anything wrong
with AT_FLAGS if the kernel prefers that.

The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
relaxation of the syscall ABI to accept tb = any. The MTE support will
have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
AT_FLAGS is either redundant here if MTE implies it (and no harm in
keeping it around) or the meaning is different: a tb != 0 may be checked
by the kernel against the allocation tag (i.e. get_user() could fail,
the tag is not entirely ignored).


the discussion here was mostly about binary 2,

That's because passing tb != 0 into the syscall ABI is the main blocker
here that needs clearing out before merging the MTE support. There is,
of course, a variation of binary 1 for MTE:

binary 5: user tb = mte, syscall tb = 0

but this requires a lot of C lib changes to support properly.

yes, i don't think we want to do that.

but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.


but for
me the open question is if we can make binary 3/4 work.
(which requires some elf binary marking, that is recognised
by the kernel and dynamic loader, and efficient handling of
the TBI0 bit, ..if it's not possible, then i don't see how
mte will be deployed).

If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
MTE or not.


and i guess on the kernel side the open question is if the
rules 1/2/3/4 can be made to work in corner cases e.g. when
pointers embedded into structs are passed down in ioctl.

We've been trying to track these down since last summer and we came to
the conclusion that it should be (mostly) fine for the non-weird memory
described above.

i think an interesting case is when userspace passes
a pointer to the kernel and later gets it back,
which is why i proposed rule 4 (kernel has to keep
the tag then).

but i wonder what's the right thing to do for sp
(user can malloc thread/sigalt/makecontext stack
which will be mte tagged in practice with mte on)
does tagged sp work? should userspace untag the
stack memory before setting it up as a stack?
(but then user pointers to that allocation may get
broken..)


Tagged SP does work, and it is actually a good idea (it avoids using the default tag 
for the stack). It would be quite easy for the kernel to tag the initial SP and the 
stack on execve(). For other stacks, it is up to userspace, as you say, and would be 
made easier by making it possible to choose how a mapping should be tagged by the 
kernel via a new mmap() flag. Some software that makes too many assumptions on the 
address of stack variables will be disturbed by a tagged SP, but this should be 
fairly rare.


In any case, I don't think this impacts this ABI proposal (beyond the fact that 
passing tagged pointers to the stack needs to be allowed).


Kevin


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-25 Thread Szabolcs Nagy
On 25/02/2019 16:57, Catalin Marinas wrote:
> On Tue, Feb 19, 2019 at 06:38:31PM +, Szabolcs Nagy wrote:
>> i think these rules work for the cases i care about, a more
>> tricky question is when/how to check for the new syscall abi
>> and when/how the TCR_EL1.TBI0 setting may be turned off.
> 
> I don't think turning TBI0 off is critical (it's handy for PAC with
> 52-bit VA but then it's short-lived if you want more security features
> like MTE).

yes, i made a mistake assuming TBI0 off is
required for (or at least compatible with) MTE.

if TBI0 needs to be on for MTE then some of my
analysis is wrong, and i expect TBI0 to be on
in the foreseeable future.

>> consider the following cases (tb == top byte):
>>
>> binary 1: user tb = any, syscall tb = 0
>>   tbi is on, "legacy binary"
>>
>> binary 2: user tb = any, syscall tb = any
>>   tbi is on, "new binary using tb"
>>   for backward compat it needs to check for new syscall abi.
>>
>> binary 3: user tb = 0, syscall tb = 0
>>   tbi can be off, "new binary",
>>   binary is marked to indicate unused tb,
>>   kernel may turn tbi off: additional pac bits.
>>
>> binary 4: user tb = mte, syscall tb = mte
>>   like binary 3, but with mte, "new binary using mte"

so this should be "like binary 2, but with mte".

>>   does it have to check for new syscall abi?
>>   or MTE HWCAP would imply it?
>>   (is it possible to use mte without new syscall abi?)
> 
> I think MTE HWCAP should imply it.
> 
>> in userspace we want most binaries to be like binary 3 and 4
>> eventually, i.e. marked as not-relying-on-tbi, if a dso is
>> loaded that is unmarked (legacy or new tb user), then either
>> the load fails (e.g. if mte is already used? or can we turn
>> mte off at runtime?) or tbi has to be enabled (prctl? does
>> this work with pac? or multi-threads?).
> 
> We could enable it via prctl. That's the plan for MTE as well (in
> addition maybe to some ELF flag).
> 
>> as for checking the new syscall abi: i don't see much semantic
>> difference between AT_HWCAP and AT_FLAGS (either way, the user
>> has to check a feature flag before using the feature of the
>> underlying system and it does not matter much if it's a syscall
>> abi feature or cpu feature), but i don't see anything wrong
>> with AT_FLAGS if the kernel prefers that.
> 
> The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
> relaxation of the syscall ABI to accept tb = any. The MTE support will
> have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
> AT_FLAGS is either redundant here if MTE implies it (and no harm in
> keeping it around) or the meaning is different: a tb != 0 may be checked
> by the kernel against the allocation tag (i.e. get_user() could fail,
> the tag is not entirely ignored).
> 
>> the discussion here was mostly about binary 2,
> 
> That's because passing tb != 0 into the syscall ABI is the main blocker
> here that needs clearing out before merging the MTE support. There is,
> of course, a variation of binary 1 for MTE:
> 
> binary 5: user tb = mte, syscall tb = 0
> 
> but this requires a lot of C lib changes to support properly.

yes, i don't think we want to do that.

but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.

>> but for
>> me the open question is if we can make binary 3/4 work.
>> (which requires some elf binary marking, that is recognised
>> by the kernel and dynamic loader, and efficient handling of
>> the TBI0 bit, ..if it's not possible, then i don't see how
>> mte will be deployed).
> 
> If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
> MTE or not.
> 
>> and i guess on the kernel side the open question is if the
>> rules 1/2/3/4 can be made to work in corner cases e.g. when
>> pointers embedded into structs are passed down in ioctl.
> 
> We've been trying to track these down since last summer and we came to
> the conclusion that it should be (mostly) fine for the non-weird memory
> described above.

i think an interesting case is when userspace passes
a pointer to the kernel and later gets it back,
which is why i proposed rule 4 (kernel has to keep
the tag then).

but i wonder what's the right thing to do for sp
(user can malloc thread/sigalt/makecontext stack
which will be mte tagged in practice with mte on)
does tagged sp work? should userspace untag the
stack memory before setting it up as a stack?
(but then user pointers to that allocation may get
broken..)


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-25 Thread Catalin Marinas
Hi Szabolcs,

Thanks for looking into this. Comments below.

On Tue, Feb 19, 2019 at 06:38:31PM +, Szabolcs Nagy wrote:
> i think these rules work for the cases i care about, a more
> tricky question is when/how to check for the new syscall abi
> and when/how the TCR_EL1.TBI0 setting may be turned off.

I don't think turning TBI0 off is critical (it's handy for PAC with
52-bit VA but then it's short-lived if you want more security features
like MTE).

> consider the following cases (tb == top byte):
> 
> binary 1: user tb = any, syscall tb = 0
>   tbi is on, "legacy binary"
> 
> binary 2: user tb = any, syscall tb = any
>   tbi is on, "new binary using tb"
>   for backward compat it needs to check for new syscall abi.
> 
> binary 3: user tb = 0, syscall tb = 0
>   tbi can be off, "new binary",
>   binary is marked to indicate unused tb,
>   kernel may turn tbi off: additional pac bits.
> 
> binary 4: user tb = mte, syscall tb = mte
>   like binary 3, but with mte, "new binary using mte"
>   does it have to check for new syscall abi?
>   or MTE HWCAP would imply it?
>   (is it possible to use mte without new syscall abi?)

I think MTE HWCAP should imply it.

> in userspace we want most binaries to be like binary 3 and 4
> eventually, i.e. marked as not-relying-on-tbi, if a dso is
> loaded that is unmarked (legacy or new tb user), then either
> the load fails (e.g. if mte is already used? or can we turn
> mte off at runtime?) or tbi has to be enabled (prctl? does
> this work with pac? or multi-threads?).

We could enable it via prctl. That's the plan for MTE as well (in
addition maybe to some ELF flag).

> as for checking the new syscall abi: i don't see much semantic
> difference between AT_HWCAP and AT_FLAGS (either way, the user
> has to check a feature flag before using the feature of the
> underlying system and it does not matter much if it's a syscall
> abi feature or cpu feature), but i don't see anything wrong
> with AT_FLAGS if the kernel prefers that.

The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
relaxation of the syscall ABI to accept tb = any. The MTE support will
have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
AT_FLAGS is either redundant here if MTE implies it (and no harm in
keeping it around) or the meaning is different: a tb != 0 may be checked
by the kernel against the allocation tag (i.e. get_user() could fail,
the tag is not entirely ignored).

> the discussion here was mostly about binary 2,

That's because passing tb != 0 into the syscall ABI is the main blocker
here that needs clearing out before merging the MTE support. There is,
of course, a variation of binary 1 for MTE:

binary 5: user tb = mte, syscall tb = 0

but this requires a lot of C lib changes to support properly.

> but for
> me the open question is if we can make binary 3/4 work.
> (which requires some elf binary marking, that is recognised
> by the kernel and dynamic loader, and efficient handling of
> the TBI0 bit, ..if it's not possible, then i don't see how
> mte will be deployed).

If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
MTE or not.

> and i guess on the kernel side the open question is if the
> rules 1/2/3/4 can be made to work in corner cases e.g. when
> pointers embedded into structs are passed down in ioctl.

We've been trying to track these down since last summer and we came to
the conclusion that it should be (mostly) fine for the non-weird memory
described above.

-- 
Catalin


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-19 Thread Szabolcs Nagy
On 12/02/2019 18:02, Catalin Marinas wrote:
> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
>> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky  wrote:
>>> On 19/12/2018 12:52, Dave Martin wrote:
 Really, the kernel should do the expected thing with all "non-weird"
 memory.

 In lieu of a proper definition of "non-weird", I think we should have
 some lists of things that are explicitly included, and also excluded:

 OK:
   kernel-allocated process stack
   brk area
   MAP_ANONYMOUS | MAP_PRIVATE
   MAP_PRIVATE mappings of /dev/zero

 Not OK:
   MAP_SHARED
   mmaps of non-memory-like devices
   mmaps of anything that is not a regular file
   the VDSO
   ...

 In general, userspace can tag memory that it "owns", and we do not assume
 a transfer of ownership except in the "OK" list above.  Otherwise, it's
 the kernel's memory, or the owner is simply not well defined.
>>>
>>> Agreed on the general idea: a process should be able to pass tagged 
>>> pointers at the
>>> syscall interface, as long as they point to memory privately owned by the 
>>> process. I
>>> think it would be possible to simplify the definition of "non-weird" memory 
>>> by using
>>> only this "OK" list:
>>> - mmap() done by the process itself, where either:
>>>* flags = MAP_PRIVATE | MAP_ANONYMOUS
>>>* flags = MAP_PRIVATE and fd refers to a regular file or a well-defined 
>>> list of
>>> device files (like /dev/zero)
>>> - brk() done by the process itself
>>> - Any memory mapped by the kernel in the new process's address space during 
>>> execve(),
>>> with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
> 
> Sounds reasonable.

OK. this non-weird memory definition works for me too.

rule 1: if weird memory pointers are passed to the kernel
with top byte set then the behaviour is undefined.

   * When the kernel dereferences a pointer on userspace's behalf, it
 shall behave equivalently to userspace dereferencing the same pointer,
 including use of the same tag (where passed by userspace).

   * Where the pointer tag affects pointer dereference behaviour (i.e.,
 with hardware memory colouring) the kernel makes no guarantee to
 honour pointer tags correctly for every location a buffer based on a
 pointer passed by userspace to the kernel.

 (This means for example that for a read(fd, buf, size), we can check
 the tag for a single arbitrary location in *(char (*)[size])buf
 before passing the buffer to get_user_pages().  Hopefully this could
 be done in get_user_pages() itself rather than hunting call sites.
 For userspace, it means that you're on your own if you ask the
 kernel to operate on a buffer than spans multiple, independently-
 allocated objects, or a deliberately striped single object.)
>>>
>>> I think both points are reasonable. It is very valuable for the kernel to 
>>> access
>>> userspace memory using the user-provided tag, because it enables kernel 
>>> accesses to
>>> be checked in the same way as user accesses, allowing to detect bugs that 
>>> are
>>> potentially hard to find. For instance, if a pointer to an object is passed 
>>> to the
>>> kernel after it has been deallocated, this is invalid and should be 
>>> detected.
>>> However, you are absolutely right that the kernel cannot *guarantee* that 
>>> such a
>>> check is carried out for the entire memory range (or in fact at all); it 
>>> should be a
>>> best-effort approach.
>>
>> It would also be valuable to narrow down the set of "relaxed" (i.e.
>> not tag-checking) syscalls as reasonably possible. We would want to
>> provide tag-checking userspace wrappers for any important calls that
>> are not checked in the kernel. Is it correct to assume that anything
>> that goes through copy_from_user  / copy_to_user is checked?
> 
> I lost track of the context of this thread but if it's just about
> relaxing the ABI for hwasan, the kernel has no idea of the compiler
> generated structures in user space, so nothing is checked.
> 
> If we talk about tags in the context of MTE, than yes, with the current
> proposal the tag would be checked by copy_*_user() functions.

rule 2: kernel derefs as if user derefs when non-weird memory
pointers are passed to the kernel.

note that the important bit is what happens on valid pointer
derefs, invalid pointer deref is usually undefined for user
programs, so what happens in case of mte tag failures is
more of a QoI issue than abi i think.

(e.g. EFAULT is not guaranteed by the kernel currently, i can
successfully do write(open("/dev/null",O_WRONLY), 0, 1), or
get a crash when passing invalid pointer to a vdso function,
so userspace should not rely on some strict EFAULT behaviour).

   * The kernel shall not extend the lifetime of user pointers in ways
 t

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-14 Thread Kevin Brodsky

On 13/02/2019 21:41, Evgenii Stepanov wrote:

On Wed, Feb 13, 2019 at 9:43 AM Dave Martin  wrote:

On Wed, Feb 13, 2019 at 04:42:11PM +, Kevin Brodsky wrote:

(+Cc other people with MTE experience: Branislav, Ruben)

[...]


I'm wondering whether we can piggy-back on existing concepts.

We could say that recolouring memory is safe when and only when
unmapping of the page or removing permissions on the page (via
munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
behaviour of the process is undefined.

Is that a sufficient requirement? I don't think that anything prevents you
from using mprotect() on say [vvar], but we don't necessarily want to map
[vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
here.

I think the origin rules have to apply too: [vvar] is not a regular,
private page but a weird, shared thing mapped for you by the kernel.

Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.

I'm also assuming that userspace cannot recolour memory in read-only
pages.  That sounds bad if there's no way to prevent it.

That sounds like something we would like to do to catch out of bounds
read of .rodata globals.
Another potentially interesting use case for MTE is infinite hardware
watchpoints - that would require trapping reads for individual tagging
granules, include those in read-only binary segment.


I think we should keep this discussion for a later, separate thread. Vincenzo's 
proposal is about allowing userspace to pass tags at the syscall interface. The set 
of mappings allowed to be tagged by userspace (in MTE) should be contained in the set 
of mappings that userspace can pass tagged pointers to (at the syscall interface), 
but they are not necessarily the same. Private read-only mappings are an edge case 
(you can pass tagged pointers to them, the memory may or may not be mapped as tagged, 
but in any case it is not possible to change the memory tags via such mapping).





[...]


It might be reasonable to do the check in access_ok() and skip it in
__put_user() etc.

(I seem to remember some separate discussion about abolishing
__put_user() and friends though, due to the accident risk they pose.)

Keep in mind that with MTE, there is no need to do any explicit check when
accessing user memory via a user-provided pointer. The tagged user pointer
is directly passed to copy_*_user() or put_user(). If the load/store causes
a tag fault, then it is handled just like a page fault (i.e. invoking the
fixup handler). As far as I can tell, there's no need to do anything special
in access_ok() in that case.

[The above applies to precise mode. In imprecise mode, some more work will
be needed after the load/store to check whether a tag fault happened.]

Fair enough, I'm a bit hazy on the details as of right now..

[...]


There are many possible ways to deploy MTE, and debugging is just one of
them. For instance, you may want to turn on heap colouring for some
processes in the system, including in production.

To implement enforceable protection, or as a diagnostic tool for when
something goes wrong?

In the latter case it's still OK for the kernel's tag checking not to be
exhaustive.


Regarding those cases where it is impossible to check tags at the point of
accessing user memory, it is indeed possible to check the memory tags at the
point of stripping the tag from the user pointer. Given that some MTE
use-cases favour performance over tag check coverage, the ideal approach
would be to make these checks configurable (e.g. check one granule, check
all of them, or check none). I don't know how feasible this is in practice.

Check all granules of a massive DMA buffer?

That doesn't sounds feasible without explicit support in the hardware to
have the DMA check tags itself as the memory is accessed.  MTE by itself
doesn't provide for this IIUC (at least, it would require support in the
platform, not just the CPU).

We do not want to bake any assumptions into the ABI about whether a
given data transfer may or may not be offloaded to DMA.  That feels
like a slippery slope.

Providing we get the checks for free in put_user/get_user/
copy_{to,from}_user(), those will cover a lot of cases though, for
non-bulk-IO cases.


My assumption has been that at this point in time we are mainly aiming
to support the debug/diagnostic use cases today.


MTE can be used both for diagnostics (imprecise mode is especially suitable for 
that), and to halt execution when something wrong is detected. Even in the latter 
case, one cannot expect exhaustive checking from MTE, because the way it works is 
fundamentally statistical; an invalid pointer may by chance have the right tag to 
access the given location. So again, I think that a best-effort approach is 
appropriate when the kernel accesses user memory, in terms of checking that tags match.


More specifically, different use-cases come with different tradeoffs (performance / 
tag check coverage). That's why I am suggesting that in 

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-13 Thread Evgenii Stepanov
On Wed, Feb 13, 2019 at 9:43 AM Dave Martin  wrote:
>
> On Wed, Feb 13, 2019 at 04:42:11PM +, Kevin Brodsky wrote:
> > (+Cc other people with MTE experience: Branislav, Ruben)
>
> [...]
>
> > >I'm wondering whether we can piggy-back on existing concepts.
> > >
> > >We could say that recolouring memory is safe when and only when
> > >unmapping of the page or removing permissions on the page (via
> > >munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
> > >behaviour of the process is undefined.
> >
> > Is that a sufficient requirement? I don't think that anything prevents you
> > from using mprotect() on say [vvar], but we don't necessarily want to map
> > [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
> > here.
>
> I think the origin rules have to apply too: [vvar] is not a regular,
> private page but a weird, shared thing mapped for you by the kernel.
>
> Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.
>
> I'm also assuming that userspace cannot recolour memory in read-only
> pages.  That sounds bad if there's no way to prevent it.

That sounds like something we would like to do to catch out of bounds
read of .rodata globals.
Another potentially interesting use case for MTE is infinite hardware
watchpoints - that would require trapping reads for individual tagging
granules, include those in read-only binary segment.

>
> [...]
>
> > >It might be reasonable to do the check in access_ok() and skip it in
> > >__put_user() etc.
> > >
> > >(I seem to remember some separate discussion about abolishing
> > >__put_user() and friends though, due to the accident risk they pose.)
> >
> > Keep in mind that with MTE, there is no need to do any explicit check when
> > accessing user memory via a user-provided pointer. The tagged user pointer
> > is directly passed to copy_*_user() or put_user(). If the load/store causes
> > a tag fault, then it is handled just like a page fault (i.e. invoking the
> > fixup handler). As far as I can tell, there's no need to do anything special
> > in access_ok() in that case.
> >
> > [The above applies to precise mode. In imprecise mode, some more work will
> > be needed after the load/store to check whether a tag fault happened.]
>
> Fair enough, I'm a bit hazy on the details as of right now..
>
> [...]
>
> > There are many possible ways to deploy MTE, and debugging is just one of
> > them. For instance, you may want to turn on heap colouring for some
> > processes in the system, including in production.
>
> To implement enforceable protection, or as a diagnostic tool for when
> something goes wrong?
>
> In the latter case it's still OK for the kernel's tag checking not to be
> exhaustive.
>
> > Regarding those cases where it is impossible to check tags at the point of
> > accessing user memory, it is indeed possible to check the memory tags at the
> > point of stripping the tag from the user pointer. Given that some MTE
> > use-cases favour performance over tag check coverage, the ideal approach
> > would be to make these checks configurable (e.g. check one granule, check
> > all of them, or check none). I don't know how feasible this is in practice.
>
> Check all granules of a massive DMA buffer?
>
> That doesn't sounds feasible without explicit support in the hardware to
> have the DMA check tags itself as the memory is accessed.  MTE by itself
> doesn't provide for this IIUC (at least, it would require support in the
> platform, not just the CPU).
>
> We do not want to bake any assumptions into the ABI about whether a
> given data transfer may or may not be offloaded to DMA.  That feels
> like a slippery slope.
>
> Providing we get the checks for free in put_user/get_user/
> copy_{to,from}_user(), those will cover a lot of cases though, for
> non-bulk-IO cases.
>
>
> My assumption has been that at this point in time we are mainly aiming
> to support the debug/diagnostic use cases today.
>
> At least, those are the low(ish)-hanging fruit.
>
> Others are better placed than me to comment on the goals here.
>
> Cheers
> ---Dave


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-13 Thread Dave Martin
On Wed, Feb 13, 2019 at 04:42:11PM +, Kevin Brodsky wrote:
> (+Cc other people with MTE experience: Branislav, Ruben)

[...]

> >I'm wondering whether we can piggy-back on existing concepts.
> >
> >We could say that recolouring memory is safe when and only when
> >unmapping of the page or removing permissions on the page (via
> >munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
> >behaviour of the process is undefined.
> 
> Is that a sufficient requirement? I don't think that anything prevents you
> from using mprotect() on say [vvar], but we don't necessarily want to map
> [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
> here.

I think the origin rules have to apply too: [vvar] is not a regular,
private page but a weird, shared thing mapped for you by the kernel.

Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.

I'm also assuming that userspace cannot recolour memory in read-only
pages.  That sounds bad if there's no way to prevent it.

[...]

> >It might be reasonable to do the check in access_ok() and skip it in
> >__put_user() etc.
> >
> >(I seem to remember some separate discussion about abolishing
> >__put_user() and friends though, due to the accident risk they pose.)
> 
> Keep in mind that with MTE, there is no need to do any explicit check when
> accessing user memory via a user-provided pointer. The tagged user pointer
> is directly passed to copy_*_user() or put_user(). If the load/store causes
> a tag fault, then it is handled just like a page fault (i.e. invoking the
> fixup handler). As far as I can tell, there's no need to do anything special
> in access_ok() in that case.
> 
> [The above applies to precise mode. In imprecise mode, some more work will
> be needed after the load/store to check whether a tag fault happened.]

Fair enough, I'm a bit hazy on the details as of right now..

[...]

> There are many possible ways to deploy MTE, and debugging is just one of
> them. For instance, you may want to turn on heap colouring for some
> processes in the system, including in production.

To implement enforceable protection, or as a diagnostic tool for when
something goes wrong?

In the latter case it's still OK for the kernel's tag checking not to be
exhaustive.

> Regarding those cases where it is impossible to check tags at the point of
> accessing user memory, it is indeed possible to check the memory tags at the
> point of stripping the tag from the user pointer. Given that some MTE
> use-cases favour performance over tag check coverage, the ideal approach
> would be to make these checks configurable (e.g. check one granule, check
> all of them, or check none). I don't know how feasible this is in practice.

Check all granules of a massive DMA buffer?

That doesn't sounds feasible without explicit support in the hardware to
have the DMA check tags itself as the memory is accessed.  MTE by itself
doesn't provide for this IIUC (at least, it would require support in the
platform, not just the CPU).

We do not want to bake any assumptions into the ABI about whether a
given data transfer may or may not be offloaded to DMA.  That feels
like a slippery slope.

Providing we get the checks for free in put_user/get_user/
copy_{to,from}_user(), those will cover a lot of cases though, for
non-bulk-IO cases.


My assumption has been that at this point in time we are mainly aiming
to support the debug/diagnostic use cases today.

At least, those are the low(ish)-hanging fruit.

Others are better placed than me to comment on the goals here.

Cheers
---Dave


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-13 Thread Kevin Brodsky

(+Cc other people with MTE experience: Branislav, Ruben)

On 13/02/2019 14:58, Dave Martin wrote:

On Tue, Feb 12, 2019 at 06:02:24PM +, Catalin Marinas wrote:

On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:

On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky  wrote:

On 19/12/2018 12:52, Dave Martin wrote:

[...]


   * A single C object should be accessed using a single, fixed
 pointer tag throughout its entire lifetime.

Agreed.  Allocators themselves may need to be excluded though,
depending on how they represent their managed memory.


   * Tags can be changed only when there are no outstanding pointers to
 the affected object or region that may be used to access the object
 or region (i.e., if the object were allocated from the C heap and
 is it safe to realloc() it, then it is safe to change the tag; for
 other types of allocation, analogous arguments can be applied).

Tags can only be changed at the point of deallocation/
reallocation.  Pointers to the object become invalid and cannot
be used after it has been deallocated; memory tagging allows to
catch such invalid usage.

All the above sound well but that's mostly a guideline on what a C
library can do. It doesn't help much with defining the kernel ABI.
Anyway, it's good to clarify the use-cases.

My aim was to clarify the use case in userspace, since I wasn't directly
involved in that.  The kernel ABI needs to be compatible with the the
use case, but doesn't need to specify must of it.

I'm wondering whether we can piggy-back on existing concepts.

We could say that recolouring memory is safe when and only when
unmapping of the page or removing permissions on the page (via
munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
behaviour of the process is undefined.


Is that a sufficient requirement? I don't think that anything prevents you from using 
mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm 
not sure it's easy to define what "safe" would mean here.



Hopefully there are friendly fuzzers testing this kind of thing.

[...]


It would also be valuable to narrow down the set of "relaxed" (i.e.
not tag-checking) syscalls as reasonably possible. We would want to
provide tag-checking userspace wrappers for any important calls that
are not checked in the kernel. Is it correct to assume that anything
that goes through copy_from_user  / copy_to_user is checked?

I lost track of the context of this thread but if it's just about
relaxing the ABI for hwasan, the kernel has no idea of the compiler
generated structures in user space, so nothing is checked.

If we talk about tags in the context of MTE, than yes, with the current
proposal the tag would be checked by copy_*_user() functions.

Also put_user() and friends?

It might be reasonable to do the check in access_ok() and skip it in
__put_user() etc.

(I seem to remember some separate discussion about abolishing
__put_user() and friends though, due to the accident risk they pose.)


Keep in mind that with MTE, there is no need to do any explicit check when accessing 
user memory via a user-provided pointer. The tagged user pointer is directly passed 
to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is 
handled just like a page fault (i.e. invoking the fixup handler). As far as I can 
tell, there's no need to do anything special in access_ok() in that case.


[The above applies to precise mode. In imprecise mode, some more work will be needed 
after the load/store to check whether a tag fault happened.]





For aio* operations it would be nice if the tag was checked at the
time of the actual userspace read/write, either instead of or in
addition to at the time of the system call.

With aio* (and synchronous iovec-based syscalls), the kernel may access
the memory while the corresponding user process is scheduled out. Given
that such access is not done in the context of the user process (and
using the user VA like copy_*_user), the kernel cannot handle potential
tag faults. Moreover, the transfer may be done by DMA and the device
does not understand tags.

I'd like to keep tags as a property of the pointer in a specific virtual
address space. The moment you convert it to a different address space
(e.g. kernel linear map, physical address), the tag property is stripped
and I don't think we should re-build it (and have it checked).

This is probably reasonable.

Ideally we would check the tag at the point of stripping it off, but
most likely it's going to be rather best-effort.

If memory tagging is essential a debugging feature then this seems
an acceptable compromise.


There are many possible ways to deploy MTE, and debugging is just one of them. For 
instance, you may want to turn on heap colouring for some processes in the system, 
including in production.


Regarding those cases where it is impossible to check tags at the point of accessing 
user memory, it is indeed possible to ch

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-13 Thread Dave Martin
On Tue, Feb 12, 2019 at 06:02:24PM +, Catalin Marinas wrote:
> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
> > On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky  wrote:
> > > On 19/12/2018 12:52, Dave Martin wrote:

[...]

> > > >   * A single C object should be accessed using a single, fixed
> > > > pointer tag throughout its entire lifetime.
> > >
> > > Agreed.  Allocators themselves may need to be excluded though,
> > > depending on how they represent their managed memory.
> > >
> > > >   * Tags can be changed only when there are no outstanding pointers to
> > > > the affected object or region that may be used to access the object
> > > > or region (i.e., if the object were allocated from the C heap and
> > > > is it safe to realloc() it, then it is safe to change the tag; for
> > > > other types of allocation, analogous arguments can be applied).
> > >
> > > Tags can only be changed at the point of deallocation/
> > > reallocation.  Pointers to the object become invalid and cannot
> > > be used after it has been deallocated; memory tagging allows to
> > > catch such invalid usage.
>
> All the above sound well but that's mostly a guideline on what a C
> library can do. It doesn't help much with defining the kernel ABI.
> Anyway, it's good to clarify the use-cases.

My aim was to clarify the use case in userspace, since I wasn't directly
involved in that.  The kernel ABI needs to be compatible with the the
use case, but doesn't need to specify must of it.

I'm wondering whether we can piggy-back on existing concepts.

We could say that recolouring memory is safe when and only when
unmapping of the page or removing permissions on the page (via
munmap/mremap/mprotect) would be safe.  Otherwise, the resulting
behaviour of the process is undefined.

Hopefully there are friendly fuzzers testing this kind of thing.

[...]

> > It would also be valuable to narrow down the set of "relaxed" (i.e.
> > not tag-checking) syscalls as reasonably possible. We would want to
> > provide tag-checking userspace wrappers for any important calls that
> > are not checked in the kernel. Is it correct to assume that anything
> > that goes through copy_from_user  / copy_to_user is checked?
> 
> I lost track of the context of this thread but if it's just about
> relaxing the ABI for hwasan, the kernel has no idea of the compiler
> generated structures in user space, so nothing is checked.
> 
> If we talk about tags in the context of MTE, than yes, with the current
> proposal the tag would be checked by copy_*_user() functions.

Also put_user() and friends?

It might be reasonable to do the check in access_ok() and skip it in
__put_user() etc.

(I seem to remember some separate discussion about abolishing
__put_user() and friends though, due to the accident risk they pose.)

> > For aio* operations it would be nice if the tag was checked at the
> > time of the actual userspace read/write, either instead of or in
> > addition to at the time of the system call.
> 
> With aio* (and synchronous iovec-based syscalls), the kernel may access
> the memory while the corresponding user process is scheduled out. Given
> that such access is not done in the context of the user process (and
> using the user VA like copy_*_user), the kernel cannot handle potential
> tag faults. Moreover, the transfer may be done by DMA and the device
> does not understand tags.
> 
> I'd like to keep tags as a property of the pointer in a specific virtual
> address space. The moment you convert it to a different address space
> (e.g. kernel linear map, physical address), the tag property is stripped
> and I don't think we should re-build it (and have it checked).

This is probably reasonable.

Ideally we would check the tag at the point of stripping it off, but
most likely it's going to be rather best-effort.

If memory tagging is essential a debugging feature then this seems
an acceptable compromise.

> > > >   * For purposes other than dereference, the kernel shall accept any
> > > > legitimately tagged pointer (according to the above rules) as
> > > > identifying the associated memory location.
> > > >
> > > > So, mprotect(some_page_aligned_object, ...); is valid irrespective
> > > > of where page_aligned_object() came from.  There is no implicit
> > > > derefence by the kernel here, hence no tag check.
> > > >
> > > > The kernel does not guarantee to work correctly if the wrong tag
> > > > is used, but there is not always a well-defined "right" tag, so
> > > > we can't really guarantee to check it.  So a pointer derived by
> > > > any reasonable means by userspace has to be treated as equally
> > > > valid.
> > >
> > > This is a disputed point :) In my opinion, this is the the most
> > > reasonable approach.
> > 
> > Yes, it would be nice if the kernel explicitly promised, ex.
> > mprotect() over a range of differently tagged pages to be allowed
> > (i.e. address tag should be u

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-12 Thread Catalin Marinas
On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky  wrote:
> > On 19/12/2018 12:52, Dave Martin wrote:
> > > Really, the kernel should do the expected thing with all "non-weird"
> > > memory.
> > >
> > > In lieu of a proper definition of "non-weird", I think we should have
> > > some lists of things that are explicitly included, and also excluded:
> > >
> > > OK:
> > >   kernel-allocated process stack
> > >   brk area
> > >   MAP_ANONYMOUS | MAP_PRIVATE
> > >   MAP_PRIVATE mappings of /dev/zero
> > >
> > > Not OK:
> > >   MAP_SHARED
> > >   mmaps of non-memory-like devices
> > >   mmaps of anything that is not a regular file
> > >   the VDSO
> > >   ...
> > >
> > > In general, userspace can tag memory that it "owns", and we do not assume
> > > a transfer of ownership except in the "OK" list above.  Otherwise, it's
> > > the kernel's memory, or the owner is simply not well defined.
> >
> > Agreed on the general idea: a process should be able to pass tagged 
> > pointers at the
> > syscall interface, as long as they point to memory privately owned by the 
> > process. I
> > think it would be possible to simplify the definition of "non-weird" memory 
> > by using
> > only this "OK" list:
> > - mmap() done by the process itself, where either:
> >* flags = MAP_PRIVATE | MAP_ANONYMOUS
> >* flags = MAP_PRIVATE and fd refers to a regular file or a well-defined 
> > list of
> > device files (like /dev/zero)
> > - brk() done by the process itself
> > - Any memory mapped by the kernel in the new process's address space during 
> > execve(),
> > with the same restrictions as above ([vdso]/[vvar] are therefore excluded)

Sounds reasonable.

> > >   * Userspace should set tags at the point of allocation only.
> >
> > Yes, tags are only supposed to be set at the point of either allocation or
> > deallocation/reallocation. However, allocators can in principle be nested, 
> > so an
> > allocator could  take a region allocated by malloc() as input and subdivide 
> > it
> > (changing tags in the process). That said, this suballocator must not 
> > free() that
> > region until all the suballocations themselves have been freed (thereby 
> > restoring the
> > tags initially set by malloc()).
> >
> > >   * If you don't know how an object was allocated, you cannot modify the
> > > tag, period.
> >
> > Agreed, allocators that tag memory can only operate on memory with a 
> > well-defined
> > provenance (for instance anonymous mmap() or malloc()).
> >
> > >   * A single C object should be accessed using a single, fixed pointer tag
> > > throughout its entire lifetime.
> >
> > Agreed. Allocators themselves may need to be excluded though, depending on 
> > how they
> > represent their managed memory.
> >
> > >   * Tags can be changed only when there are no outstanding pointers to
> > > the affected object or region that may be used to access the object
> > > or region (i.e., if the object were allocated from the C heap and
> > > is it safe to realloc() it, then it is safe to change the tag; for
> > > other types of allocation, analogous arguments can be applied).
> >
> > Tags can only be changed at the point of deallocation/reallocation. 
> > Pointers to the
> > object become invalid and cannot be used after it has been deallocated; 
> > memory
> > tagging allows to catch such invalid usage.

All the above sound well but that's mostly a guideline on what a C
library can do. It doesn't help much with defining the kernel ABI.
Anyway, it's good to clarify the use-cases.

> > >   * When the kernel dereferences a pointer on userspace's behalf, it
> > > shall behave equivalently to userspace dereferencing the same pointer,
> > > including use of the same tag (where passed by userspace).
> > >
> > >   * Where the pointer tag affects pointer dereference behaviour (i.e.,
> > > with hardware memory colouring) the kernel makes no guarantee to
> > > honour pointer tags correctly for every location a buffer based on a
> > > pointer passed by userspace to the kernel.
> > >
> > > (This means for example that for a read(fd, buf, size), we can check
> > > the tag for a single arbitrary location in *(char (*)[size])buf
> > > before passing the buffer to get_user_pages().  Hopefully this could
> > > be done in get_user_pages() itself rather than hunting call sites.
> > > For userspace, it means that you're on your own if you ask the
> > > kernel to operate on a buffer than spans multiple, independently-
> > > allocated objects, or a deliberately striped single object.)
> >
> > I think both points are reasonable. It is very valuable for the kernel to 
> > access
> > userspace memory using the user-provided tag, because it enables kernel 
> > accesses to
> > be checked in the same way as user accesses, allowing to detect bugs that 
> > are
> > potentially hard to find. For insta

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-11 Thread Evgenii Stepanov
On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky  wrote:
>
> On 19/12/2018 12:52, Dave Martin wrote:
> > On Tue, Dec 18, 2018 at 05:59:38PM +, Catalin Marinas wrote:
> >> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> >>> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas  
> >>> wrote:
>  The summary of our internal discussions (mostly between kernel
>  developers) is that we can't properly describe a user ABI that covers
>  future syscalls or syscall extensions while not all syscalls accept
>  tagged pointers. So we tweaked the requirements slightly to only allow
>  tagged pointers back into the kernel *if* the originating address is
>  from an anonymous mmap() or below sbrk(0). This should cover some of the
>  ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
>  pointer to a buffer obtained via mmap() on the device operations.
> 
>  (sorry for not being clear on what Vincenzo's proposal implies)
> >>> OK, I see. So I need to make the following changes to my patchset AFAIU.
> >>>
> >>> 1. Make sure that we only allow tagged user addresses that originate
> >>> from an anonymous mmap() or below sbrk(0). How exactly should this
> >>> check be performed?
> >> I don't think we should perform such checks. That's rather stating that
> >> the kernel only guarantees that the tagged pointers work if they
> >> originated from these memory ranges.
> > I concur.
> >
> > Really, the kernel should do the expected thing with all "non-weird"
> > memory.
> >
> > In lieu of a proper definition of "non-weird", I think we should have
> > some lists of things that are explicitly included, and also excluded:
> >
> > OK:
> >   kernel-allocated process stack
> >   brk area
> >   MAP_ANONYMOUS | MAP_PRIVATE
> >   MAP_PRIVATE mappings of /dev/zero
> >
> > Not OK:
> >   MAP_SHARED
> >   mmaps of non-memory-like devices
> >   mmaps of anything that is not a regular file
> >   the VDSO
> >   ...
> >
> > In general, userspace can tag memory that it "owns", and we do not assume
> > a transfer of ownership except in the "OK" list above.  Otherwise, it's
> > the kernel's memory, or the owner is simply not well defined.
>
> Agreed on the general idea: a process should be able to pass tagged pointers 
> at the
> syscall interface, as long as they point to memory privately owned by the 
> process. I
> think it would be possible to simplify the definition of "non-weird" memory 
> by using
> only this "OK" list:
> - mmap() done by the process itself, where either:
>* flags = MAP_PRIVATE | MAP_ANONYMOUS
>* flags = MAP_PRIVATE and fd refers to a regular file or a well-defined 
> list of
> device files (like /dev/zero)
> - brk() done by the process itself
> - Any memory mapped by the kernel in the new process's address space during 
> execve(),
> with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
>
> > I would also like to see advice for userspace developers, particularly
> > things like (strawman, please challenge!):
>
> To some extent, one could call me a userspace developer, so I'll try to help 
> :)
>
> >   * Userspace should set tags at the point of allocation only.
>
> Yes, tags are only supposed to be set at the point of either allocation or
> deallocation/reallocation. However, allocators can in principle be nested, so 
> an
> allocator could  take a region allocated by malloc() as input and subdivide it
> (changing tags in the process). That said, this suballocator must not free() 
> that
> region until all the suballocations themselves have been freed (thereby 
> restoring the
> tags initially set by malloc()).
>
> >   * If you don't know how an object was allocated, you cannot modify the
> > tag, period.
>
> Agreed, allocators that tag memory can only operate on memory with a 
> well-defined
> provenance (for instance anonymous mmap() or malloc()).
>
> >   * A single C object should be accessed using a single, fixed pointer tag
> > throughout its entire lifetime.
>
> Agreed. Allocators themselves may need to be excluded though, depending on 
> how they
> represent their managed memory.
>
> >   * Tags can be changed only when there are no outstanding pointers to
> > the affected object or region that may be used to access the object
> > or region (i.e., if the object were allocated from the C heap and
> > is it safe to realloc() it, then it is safe to change the tag; for
> > other types of allocation, analogous arguments can be applied).
>
> Tags can only be changed at the point of deallocation/reallocation. Pointers 
> to the
> object become invalid and cannot be used after it has been deallocated; memory
> tagging allows to catch such invalid usage.
>
> >   * When the kernel dereferences a pointer on userspace's behalf, it
> > shall behave equivalently to userspace dereferencing the same pointer,
> > including use of the same tag (where passed by userspace).
> 

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-11 Thread Kevin Brodsky

On 19/12/2018 12:52, Dave Martin wrote:

On Tue, Dec 18, 2018 at 05:59:38PM +, Catalin Marinas wrote:

On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:

On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas  wrote:

The summary of our internal discussions (mostly between kernel
developers) is that we can't properly describe a user ABI that covers
future syscalls or syscall extensions while not all syscalls accept
tagged pointers. So we tweaked the requirements slightly to only allow
tagged pointers back into the kernel *if* the originating address is
from an anonymous mmap() or below sbrk(0). This should cover some of the
ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
pointer to a buffer obtained via mmap() on the device operations.

(sorry for not being clear on what Vincenzo's proposal implies)

OK, I see. So I need to make the following changes to my patchset AFAIU.

1. Make sure that we only allow tagged user addresses that originate
from an anonymous mmap() or below sbrk(0). How exactly should this
check be performed?

I don't think we should perform such checks. That's rather stating that
the kernel only guarantees that the tagged pointers work if they
originated from these memory ranges.

I concur.

Really, the kernel should do the expected thing with all "non-weird"
memory.

In lieu of a proper definition of "non-weird", I think we should have
some lists of things that are explicitly included, and also excluded:

OK:
kernel-allocated process stack
brk area
MAP_ANONYMOUS | MAP_PRIVATE
MAP_PRIVATE mappings of /dev/zero

Not OK:
MAP_SHARED
mmaps of non-memory-like devices
mmaps of anything that is not a regular file
the VDSO
...

In general, userspace can tag memory that it "owns", and we do not assume
a transfer of ownership except in the "OK" list above.  Otherwise, it's
the kernel's memory, or the owner is simply not well defined.


Agreed on the general idea: a process should be able to pass tagged pointers at the 
syscall interface, as long as they point to memory privately owned by the process. I 
think it would be possible to simplify the definition of "non-weird" memory by using 
only this "OK" list:

- mmap() done by the process itself, where either:
  * flags = MAP_PRIVATE | MAP_ANONYMOUS
  * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of 
device files (like /dev/zero)

- brk() done by the process itself
- Any memory mapped by the kernel in the new process's address space during execve(), 
with the same restrictions as above ([vdso]/[vvar] are therefore excluded)



I would also like to see advice for userspace developers, particularly
things like (strawman, please challenge!):


To some extent, one could call me a userspace developer, so I'll try to help :)


  * Userspace should set tags at the point of allocation only.


Yes, tags are only supposed to be set at the point of either allocation or 
deallocation/reallocation. However, allocators can in principle be nested, so an 
allocator could  take a region allocated by malloc() as input and subdivide it 
(changing tags in the process). That said, this suballocator must not free() that 
region until all the suballocations themselves have been freed (thereby restoring the 
tags initially set by malloc()).



  * If you don't know how an object was allocated, you cannot modify the
tag, period.


Agreed, allocators that tag memory can only operate on memory with a well-defined 
provenance (for instance anonymous mmap() or malloc()).



  * A single C object should be accessed using a single, fixed pointer tag
throughout its entire lifetime.


Agreed. Allocators themselves may need to be excluded though, depending on how they 
represent their managed memory.



  * Tags can be changed only when there are no outstanding pointers to
the affected object or region that may be used to access the object
or region (i.e., if the object were allocated from the C heap and
is it safe to realloc() it, then it is safe to change the tag; for
other types of allocation, analogous arguments can be applied).


Tags can only be changed at the point of deallocation/reallocation. Pointers to the 
object become invalid and cannot be used after it has been deallocated; memory 
tagging allows to catch such invalid usage.



  * When the kernel dereferences a pointer on userspace's behalf, it
shall behave equivalently to userspace dereferencing the same pointer,
including use of the same tag (where passed by userspace).

  * Where the pointer tag affects pointer dereference behaviour (i.e.,
with hardware memory colouring) the kernel makes no guarantee to
honour pointer tags correctly for every location a buffer based on a
pointer passed by userspace to the kernel.

(This means for example that for a read(fd, buf, size), we can check
the tag for a single arbitrary location in *(cha

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2018-12-19 Thread Dave Martin
On Tue, Dec 18, 2018 at 05:59:38PM +, Catalin Marinas wrote:
> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> > On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas  
> > wrote:
> > > The summary of our internal discussions (mostly between kernel
> > > developers) is that we can't properly describe a user ABI that covers
> > > future syscalls or syscall extensions while not all syscalls accept
> > > tagged pointers. So we tweaked the requirements slightly to only allow
> > > tagged pointers back into the kernel *if* the originating address is
> > > from an anonymous mmap() or below sbrk(0). This should cover some of the
> > > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> > > pointer to a buffer obtained via mmap() on the device operations.
> > >
> > > (sorry for not being clear on what Vincenzo's proposal implies)
> > 
> > OK, I see. So I need to make the following changes to my patchset AFAIU.
> > 
> > 1. Make sure that we only allow tagged user addresses that originate
> > from an anonymous mmap() or below sbrk(0). How exactly should this
> > check be performed?
> 
> I don't think we should perform such checks. That's rather stating that
> the kernel only guarantees that the tagged pointers work if they
> originated from these memory ranges.

I concur.

Really, the kernel should do the expected thing with all "non-weird"
memory.

In lieu of a proper definition of "non-weird", I think we should have
some lists of things that are explicitly included, and also excluded:

OK:
kernel-allocated process stack
brk area
MAP_ANONYMOUS | MAP_PRIVATE
MAP_PRIVATE mappings of /dev/zero

Not OK:
MAP_SHARED
mmaps of non-memory-like devices
mmaps of anything that is not a regular file
the VDSO
...

In general, userspace can tag memory that it "owns", and we do not assume
a transfer of ownership except in the "OK" list above.  Otherwise, it's
the kernel's memory, or the owner is simply not well defined.


I would also like to see advice for userspace developers, particularly
things like (strawman, please challenge!):

 * Userspace should set tags at the point of allocation only.

 * If you don't know how an object was allocated, you cannot modify the
   tag, period.

 * A single C object should be accessed using a single, fixed pointer tag
   throughout its entire lifetime.

 * Tags can be changed only when there are no outstanding pointers to
   the affected object or region that may be used to access the object
   or region (i.e., if the object were allocated from the C heap and
   is it safe to realloc() it, then it is safe to change the tag; for
   other types of allocation, analogous arguments can be applied).

 * When the kernel dereferences a pointer on userspace's behalf, it
   shall behave equivalently to userspace dereferencing the same pointer,
   including use of the same tag (where passed by userspace).

 * Where the pointer tag affects pointer dereference behaviour (i.e.,
   with hardware memory colouring) the kernel makes no guarantee to
   honour pointer tags correctly for every location a buffer based on a
   pointer passed by userspace to the kernel.

   (This means for example that for a read(fd, buf, size), we can check
   the tag for a single arbitrary location in *(char (*)[size])buf
   before passing the buffer to get_user_pages().  Hopefully this could
   be done in get_user_pages() itself rather than hunting call sites.
   For userspace, it means that you're on your own if you ask the
   kernel to operate on a buffer than spans multiple, independently-
   allocated objects, or a deliberately striped single object.)

 * The kernel shall not extend the lifetime of user pointers in ways
   that are not clear from the specification of the syscall or
   interface to which the pointer is passed (and in any case shall not
   extend pointer lifetimes without good reason).

   So no clever transparent caching between syscalls, unless it _really_
   is transparent in the presence of tags.

 * For purposes other than dereference, the kernel shall accept any
   legitimately tagged pointer (according to the above rules) as
   identifying the associated memory location.

   So, mprotect(some_page_aligned_object, ...); is valid irrespective
   of where page_aligned_object() came from.  There is no implicit
   derefence by the kernel here, hence no tag check.

   The kernel does not guarantee to work correctly if the wrong tag
   is used, but there is not always a well-defined "right" tag, so
   we can't really guarantee to check it.  So a pointer derived by
   any reasonable means by userspace has to be treated as equally
   valid.
  

We would need to get some cross-arch buy-in for this, otherwise core
maintainers might just refuse to maintain the necessary guarantees.


> > 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> > satisfied). Do I understand correctly that this me

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2018-12-18 Thread Catalin Marinas
On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas  
> wrote:
> > The summary of our internal discussions (mostly between kernel
> > developers) is that we can't properly describe a user ABI that covers
> > future syscalls or syscall extensions while not all syscalls accept
> > tagged pointers. So we tweaked the requirements slightly to only allow
> > tagged pointers back into the kernel *if* the originating address is
> > from an anonymous mmap() or below sbrk(0). This should cover some of the
> > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> > pointer to a buffer obtained via mmap() on the device operations.
> >
> > (sorry for not being clear on what Vincenzo's proposal implies)
> 
> OK, I see. So I need to make the following changes to my patchset AFAIU.
> 
> 1. Make sure that we only allow tagged user addresses that originate
> from an anonymous mmap() or below sbrk(0). How exactly should this
> check be performed?

I don't think we should perform such checks. That's rather stating that
the kernel only guarantees that the tagged pointers work if they
originated from these memory ranges.

> 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> satisfied). Do I understand correctly that this means that I need to
> locate all find_vma() callers outside of mm/ and fix them up as well?

Yes (unless anyone as a better idea or objections to this approach).

BTW, I'll be off until the new year, so won't be able to follow up.

-- 
Catalin


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2018-12-18 Thread Andrey Konovalov
On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas  wrote:
>
> Hi Andrey,
>
> On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
> > On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
> >  wrote:
> > > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> > > the userspace (EL0) is allowed to set a non-zero value in the top
> > > byte but the resulting pointers are not allowed at the user-kernel
> > > syscall ABI boundary.
> > >
> > > This patchset proposes a relaxation of the ABI and a mechanism to
> > > advertise it to the userspace via an AT_FLAGS.
> > >
> > > The rationale behind the choice of AT_FLAGS is that the Unix System V
> > > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> > > interpretation.
> > > There are two previous attempts of using AT_FLAGS in the Linux Kernel
> > > for different reasons: the first was more generic and was used to expose
> > > the support for the GNU STACK NX feature [1] and the second was done for
> > > the MIPS architecture and was used to expose the support of "MIPS ABI
> > > Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> > > Both the changes are currently _not_ merged in mainline.
> > > The only architecture that reserves some of the bits in AT_FLAGS is
> > > currently MIPS, which introduced the concept of platform specific ABI
> > > (psABI) reserving the top-byte [3].
> > >
> > > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> > > to the userspace that a relaxed ABI is supported hence this type
> > > of pointers are now allowed to be passed to the syscalls when they are
> > > in memory ranges obtained by anonymous mmap() or brk().
> > >
> > > The userspace _must_ verify that the flag is set before passing tagged
> > > pointers to the syscalls allowed by this relaxation.
> > >
> > > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and 
> > > mandating
> > > to the software to check that the feature is present, before using the
> > > associated functionality, it provides a degree of control on the decision
> > > of disabling such a feature in future without consequently breaking the
> > > userspace.
> [...]
> > Acked-by: Andrey Konovalov 
>
> Thanks for the ack. However, if we go ahead with this ABI proposal it
> means that your patches need to be reworked to allow a non-zero top byte
> in all syscalls, including mmap() and friends, ioctl(). There are ABI
> concerns in either case but I'd rather have this discussion in the open.
> It doesn't necessarily mean that I endorse this proposal, I would like
> feedback and not just from kernel developers but user space ones.
>
> The summary of our internal discussions (mostly between kernel
> developers) is that we can't properly describe a user ABI that covers
> future syscalls or syscall extensions while not all syscalls accept
> tagged pointers. So we tweaked the requirements slightly to only allow
> tagged pointers back into the kernel *if* the originating address is
> from an anonymous mmap() or below sbrk(0). This should cover some of the
> ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> pointer to a buffer obtained via mmap() on the device operations.
>
> (sorry for not being clear on what Vincenzo's proposal implies)

OK, I see. So I need to make the following changes to my patchset AFAIU.

1. Make sure that we only allow tagged user addresses that originate
from an anonymous mmap() or below sbrk(0). How exactly should this
check be performed?

2. Allow tagged addressed passed to memory syscalls (as long as (1) is
satisfied). Do I understand correctly that this means that I need to
locate all find_vma() callers outside of mm/ and fix them up as well?


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2018-12-12 Thread Catalin Marinas
Hi Andrey,

On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
> On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
>  wrote:
> > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> > the userspace (EL0) is allowed to set a non-zero value in the top
> > byte but the resulting pointers are not allowed at the user-kernel
> > syscall ABI boundary.
> >
> > This patchset proposes a relaxation of the ABI and a mechanism to
> > advertise it to the userspace via an AT_FLAGS.
> >
> > The rationale behind the choice of AT_FLAGS is that the Unix System V
> > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> > interpretation.
> > There are two previous attempts of using AT_FLAGS in the Linux Kernel
> > for different reasons: the first was more generic and was used to expose
> > the support for the GNU STACK NX feature [1] and the second was done for
> > the MIPS architecture and was used to expose the support of "MIPS ABI
> > Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> > Both the changes are currently _not_ merged in mainline.
> > The only architecture that reserves some of the bits in AT_FLAGS is
> > currently MIPS, which introduced the concept of platform specific ABI
> > (psABI) reserving the top-byte [3].
> >
> > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> > to the userspace that a relaxed ABI is supported hence this type
> > of pointers are now allowed to be passed to the syscalls when they are
> > in memory ranges obtained by anonymous mmap() or brk().
> >
> > The userspace _must_ verify that the flag is set before passing tagged
> > pointers to the syscalls allowed by this relaxation.
> >
> > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> > to the software to check that the feature is present, before using the
> > associated functionality, it provides a degree of control on the decision
> > of disabling such a feature in future without consequently breaking the
> > userspace.
[...]
> Acked-by: Andrey Konovalov 

Thanks for the ack. However, if we go ahead with this ABI proposal it
means that your patches need to be reworked to allow a non-zero top byte
in all syscalls, including mmap() and friends, ioctl(). There are ABI
concerns in either case but I'd rather have this discussion in the open.
It doesn't necessarily mean that I endorse this proposal, I would like
feedback and not just from kernel developers but user space ones.

The summary of our internal discussions (mostly between kernel
developers) is that we can't properly describe a user ABI that covers
future syscalls or syscall extensions while not all syscalls accept
tagged pointers. So we tweaked the requirements slightly to only allow
tagged pointers back into the kernel *if* the originating address is
from an anonymous mmap() or below sbrk(0). This should cover some of the
ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
pointer to a buffer obtained via mmap() on the device operations.

(sorry for not being clear on what Vincenzo's proposal implies)

-- 
Catalin


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2018-12-12 Thread Andrey Konovalov
On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
 wrote:
>
> On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> the userspace (EL0) is allowed to set a non-zero value in the top
> byte but the resulting pointers are not allowed at the user-kernel
> syscall ABI boundary.
>
> This patchset proposes a relaxation of the ABI and a mechanism to
> advertise it to the userspace via an AT_FLAGS.
>
> The rationale behind the choice of AT_FLAGS is that the Unix System V
> ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> interpretation.
> There are two previous attempts of using AT_FLAGS in the Linux Kernel
> for different reasons: the first was more generic and was used to expose
> the support for the GNU STACK NX feature [1] and the second was done for
> the MIPS architecture and was used to expose the support of "MIPS ABI
> Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> Both the changes are currently _not_ merged in mainline.
> The only architecture that reserves some of the bits in AT_FLAGS is
> currently MIPS, which introduced the concept of platform specific ABI
> (psABI) reserving the top-byte [3].
>
> When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> to the userspace that a relaxed ABI is supported hence this type
> of pointers are now allowed to be passed to the syscalls when they are
> in memory ranges obtained by anonymous mmap() or brk().
>
> The userspace _must_ verify that the flag is set before passing tagged
> pointers to the syscalls allowed by this relaxation.
>
> More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> to the software to check that the feature is present, before using the
> associated functionality, it provides a degree of control on the decision
> of disabling such a feature in future without consequently breaking the
> userspace.
>
> The change required a modification of the elf common code, because in Linux
> the AT_FLAGS are currently set to zero by default by the kernel.
>
> The newly added flag has been verified on arm64 using the code below.
> #include 
> #include 
> #include 
>
> #define ARM64_AT_FLAGS_SYSCALL_TBI (1 << 0)
>
> bool arm64_syscall_tbi_is_present(void)
> {
> unsigned long at_flags = getauxval(AT_FLAGS);
> if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI)
> return true;
>
> return false;
> }
>
> void main()
> {
> if (arm64_syscall_tbi_is_present())
> printf("ARM64_AT_FLAGS_SYSCALL_TBI is present\n");
> }
>
> This patchset should be merged together with [4].
>
> [1] https://patchwork.ozlabs.org/patch/579578/
> [2] https://lore.kernel.org/patchwork/cover/618280/
> [3] ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
> [4] https://patchwork.kernel.org/cover/10674351/
>
> ABI References:
> ---
> Sco SysV ABI: http://www.sco.com/developers/gabi/2003-12-17/contents.html
> PowerPC AUXV: 
> http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655242_98651.html
> AMD64 ABI: https://www.cs.tufts.edu/comp/40-2012f/readings/amd64-abi.pdf
> x86 ABI: https://www.uclibc.org/docs/psABI-i386.pdf
> MIPS ABI: ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
> ARM ABI: 
> http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf
> SPARC ABI: http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_sparc.pdf
>
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Rutland 
> Cc: Robin Murphy 
> Cc: Kees Cook 
> Cc: Kate Stewart 
> Cc: Greg Kroah-Hartman 
> Cc: Andrew Morton 
> Cc: Ingo Molnar 
> Cc: "Kirill A . Shutemov" 
> Cc: Shuah Khan 
> Cc: Chintan Pandya 
> Cc: Jacob Bramley 
> Cc: Ruben Ayrapetyan 
> Cc: Andrey Konovalov 
> Cc: Lee Smith 
> Cc: Kostya Serebryany 
> Cc: Dmitry Vyukov ,
> Cc: Ramana Radhakrishnan 
> Cc: Luc Van Oostenryck 
> Cc: Evgeniy Stepanov 
> CC: Alexander Viro 
> Signed-off-by: Vincenzo Frascino 
>
> Vincenzo Frascino (3):
>   elf: Make AT_FLAGS arch configurable
>   arm64: Define Documentation/arm64/elf_at_flags.txt
>   arm64: elf: Advertise relaxed ABI
>
>  Documentation/arm64/elf_at_flags.txt  | 111 ++
>  arch/arm64/include/asm/atflags.h  |   7 ++
>  arch/arm64/include/asm/elf.h  |   5 ++
>  arch/arm64/include/uapi/asm/atflags.h |   8 ++
>  fs/binfmt_elf.c   |   6 +-
>  fs/binfmt_elf_fdpic.c |   6 +-
>  fs/compat_binfmt_elf.c|   5 ++
>  7 files changed, 146 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/arm64/elf_at_flags.txt
>  create mode 100644 arch/arm64/include/asm/atflags.h
>  create mode 100644 arch/arm64/include/uapi/asm/atflags.h
>
> --
> 2.19.2
>

Acked-by: Andrey Konovalov