Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On 25/02/2019 18:02, Szabolcs Nagy wrote: On 25/02/2019 16:57, Catalin Marinas wrote: On Tue, Feb 19, 2019 at 06:38:31PM +, Szabolcs Nagy wrote: i think these rules work for the cases i care about, a more tricky question is when/how to check for the new syscall abi and when/how the TCR_EL1.TBI0 setting may be turned off. I don't think turning TBI0 off is critical (it's handy for PAC with 52-bit VA but then it's short-lived if you want more security features like MTE). yes, i made a mistake assuming TBI0 off is required for (or at least compatible with) MTE. if TBI0 needs to be on for MTE then some of my analysis is wrong, and i expect TBI0 to be on in the foreseeable future. consider the following cases (tb == top byte): binary 1: user tb = any, syscall tb = 0 tbi is on, "legacy binary" binary 2: user tb = any, syscall tb = any tbi is on, "new binary using tb" for backward compat it needs to check for new syscall abi. binary 3: user tb = 0, syscall tb = 0 tbi can be off, "new binary", binary is marked to indicate unused tb, kernel may turn tbi off: additional pac bits. binary 4: user tb = mte, syscall tb = mte like binary 3, but with mte, "new binary using mte" so this should be "like binary 2, but with mte". does it have to check for new syscall abi? or MTE HWCAP would imply it? (is it possible to use mte without new syscall abi?) I think MTE HWCAP should imply it. in userspace we want most binaries to be like binary 3 and 4 eventually, i.e. marked as not-relying-on-tbi, if a dso is loaded that is unmarked (legacy or new tb user), then either the load fails (e.g. if mte is already used? or can we turn mte off at runtime?) or tbi has to be enabled (prctl? does this work with pac? or multi-threads?). We could enable it via prctl. That's the plan for MTE as well (in addition maybe to some ELF flag). as for checking the new syscall abi: i don't see much semantic difference between AT_HWCAP and AT_FLAGS (either way, the user has to check a feature flag before using the feature of the underlying system and it does not matter much if it's a syscall abi feature or cpu feature), but i don't see anything wrong with AT_FLAGS if the kernel prefers that. The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the relaxation of the syscall ABI to accept tb = any. The MTE support will have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably, AT_FLAGS is either redundant here if MTE implies it (and no harm in keeping it around) or the meaning is different: a tb != 0 may be checked by the kernel against the allocation tag (i.e. get_user() could fail, the tag is not entirely ignored). the discussion here was mostly about binary 2, That's because passing tb != 0 into the syscall ABI is the main blocker here that needs clearing out before merging the MTE support. There is, of course, a variation of binary 1 for MTE: binary 5: user tb = mte, syscall tb = 0 but this requires a lot of C lib changes to support properly. yes, i don't think we want to do that. but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP. but for me the open question is if we can make binary 3/4 work. (which requires some elf binary marking, that is recognised by the kernel and dynamic loader, and efficient handling of the TBI0 bit, ..if it's not possible, then i don't see how mte will be deployed). If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have MTE or not. and i guess on the kernel side the open question is if the rules 1/2/3/4 can be made to work in corner cases e.g. when pointers embedded into structs are passed down in ioctl. We've been trying to track these down since last summer and we came to the conclusion that it should be (mostly) fine for the non-weird memory described above. i think an interesting case is when userspace passes a pointer to the kernel and later gets it back, which is why i proposed rule 4 (kernel has to keep the tag then). but i wonder what's the right thing to do for sp (user can malloc thread/sigalt/makecontext stack which will be mte tagged in practice with mte on) does tagged sp work? should userspace untag the stack memory before setting it up as a stack? (but then user pointers to that allocation may get broken..) Tagged SP does work, and it is actually a good idea (it avoids using the default tag for the stack). It would be quite easy for the kernel to tag the initial SP and the stack on execve(). For other stacks, it is up to userspace, as you say, and would be made easier by making it possible to choose how a mapping should be tagged by the kernel via a new mmap() flag. Some software that makes too many assumptions on the address of stack variables will be disturbed by a tagged SP, but this should be fairly rare. In any case, I don't think this impacts this ABI proposal (beyond the fact that passing tagged pointers to the stack needs to be allowed). Kevin
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On 25/02/2019 16:57, Catalin Marinas wrote: > On Tue, Feb 19, 2019 at 06:38:31PM +, Szabolcs Nagy wrote: >> i think these rules work for the cases i care about, a more >> tricky question is when/how to check for the new syscall abi >> and when/how the TCR_EL1.TBI0 setting may be turned off. > > I don't think turning TBI0 off is critical (it's handy for PAC with > 52-bit VA but then it's short-lived if you want more security features > like MTE). yes, i made a mistake assuming TBI0 off is required for (or at least compatible with) MTE. if TBI0 needs to be on for MTE then some of my analysis is wrong, and i expect TBI0 to be on in the foreseeable future. >> consider the following cases (tb == top byte): >> >> binary 1: user tb = any, syscall tb = 0 >> tbi is on, "legacy binary" >> >> binary 2: user tb = any, syscall tb = any >> tbi is on, "new binary using tb" >> for backward compat it needs to check for new syscall abi. >> >> binary 3: user tb = 0, syscall tb = 0 >> tbi can be off, "new binary", >> binary is marked to indicate unused tb, >> kernel may turn tbi off: additional pac bits. >> >> binary 4: user tb = mte, syscall tb = mte >> like binary 3, but with mte, "new binary using mte" so this should be "like binary 2, but with mte". >> does it have to check for new syscall abi? >> or MTE HWCAP would imply it? >> (is it possible to use mte without new syscall abi?) > > I think MTE HWCAP should imply it. > >> in userspace we want most binaries to be like binary 3 and 4 >> eventually, i.e. marked as not-relying-on-tbi, if a dso is >> loaded that is unmarked (legacy or new tb user), then either >> the load fails (e.g. if mte is already used? or can we turn >> mte off at runtime?) or tbi has to be enabled (prctl? does >> this work with pac? or multi-threads?). > > We could enable it via prctl. That's the plan for MTE as well (in > addition maybe to some ELF flag). > >> as for checking the new syscall abi: i don't see much semantic >> difference between AT_HWCAP and AT_FLAGS (either way, the user >> has to check a feature flag before using the feature of the >> underlying system and it does not matter much if it's a syscall >> abi feature or cpu feature), but i don't see anything wrong >> with AT_FLAGS if the kernel prefers that. > > The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the > relaxation of the syscall ABI to accept tb = any. The MTE support will > have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably, > AT_FLAGS is either redundant here if MTE implies it (and no harm in > keeping it around) or the meaning is different: a tb != 0 may be checked > by the kernel against the allocation tag (i.e. get_user() could fail, > the tag is not entirely ignored). > >> the discussion here was mostly about binary 2, > > That's because passing tb != 0 into the syscall ABI is the main blocker > here that needs clearing out before merging the MTE support. There is, > of course, a variation of binary 1 for MTE: > > binary 5: user tb = mte, syscall tb = 0 > > but this requires a lot of C lib changes to support properly. yes, i don't think we want to do that. but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP. >> but for >> me the open question is if we can make binary 3/4 work. >> (which requires some elf binary marking, that is recognised >> by the kernel and dynamic loader, and efficient handling of >> the TBI0 bit, ..if it's not possible, then i don't see how >> mte will be deployed). > > If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have > MTE or not. > >> and i guess on the kernel side the open question is if the >> rules 1/2/3/4 can be made to work in corner cases e.g. when >> pointers embedded into structs are passed down in ioctl. > > We've been trying to track these down since last summer and we came to > the conclusion that it should be (mostly) fine for the non-weird memory > described above. i think an interesting case is when userspace passes a pointer to the kernel and later gets it back, which is why i proposed rule 4 (kernel has to keep the tag then). but i wonder what's the right thing to do for sp (user can malloc thread/sigalt/makecontext stack which will be mte tagged in practice with mte on) does tagged sp work? should userspace untag the stack memory before setting it up as a stack? (but then user pointers to that allocation may get broken..)
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
Hi Szabolcs, Thanks for looking into this. Comments below. On Tue, Feb 19, 2019 at 06:38:31PM +, Szabolcs Nagy wrote: > i think these rules work for the cases i care about, a more > tricky question is when/how to check for the new syscall abi > and when/how the TCR_EL1.TBI0 setting may be turned off. I don't think turning TBI0 off is critical (it's handy for PAC with 52-bit VA but then it's short-lived if you want more security features like MTE). > consider the following cases (tb == top byte): > > binary 1: user tb = any, syscall tb = 0 > tbi is on, "legacy binary" > > binary 2: user tb = any, syscall tb = any > tbi is on, "new binary using tb" > for backward compat it needs to check for new syscall abi. > > binary 3: user tb = 0, syscall tb = 0 > tbi can be off, "new binary", > binary is marked to indicate unused tb, > kernel may turn tbi off: additional pac bits. > > binary 4: user tb = mte, syscall tb = mte > like binary 3, but with mte, "new binary using mte" > does it have to check for new syscall abi? > or MTE HWCAP would imply it? > (is it possible to use mte without new syscall abi?) I think MTE HWCAP should imply it. > in userspace we want most binaries to be like binary 3 and 4 > eventually, i.e. marked as not-relying-on-tbi, if a dso is > loaded that is unmarked (legacy or new tb user), then either > the load fails (e.g. if mte is already used? or can we turn > mte off at runtime?) or tbi has to be enabled (prctl? does > this work with pac? or multi-threads?). We could enable it via prctl. That's the plan for MTE as well (in addition maybe to some ELF flag). > as for checking the new syscall abi: i don't see much semantic > difference between AT_HWCAP and AT_FLAGS (either way, the user > has to check a feature flag before using the feature of the > underlying system and it does not matter much if it's a syscall > abi feature or cpu feature), but i don't see anything wrong > with AT_FLAGS if the kernel prefers that. The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the relaxation of the syscall ABI to accept tb = any. The MTE support will have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably, AT_FLAGS is either redundant here if MTE implies it (and no harm in keeping it around) or the meaning is different: a tb != 0 may be checked by the kernel against the allocation tag (i.e. get_user() could fail, the tag is not entirely ignored). > the discussion here was mostly about binary 2, That's because passing tb != 0 into the syscall ABI is the main blocker here that needs clearing out before merging the MTE support. There is, of course, a variation of binary 1 for MTE: binary 5: user tb = mte, syscall tb = 0 but this requires a lot of C lib changes to support properly. > but for > me the open question is if we can make binary 3/4 work. > (which requires some elf binary marking, that is recognised > by the kernel and dynamic loader, and efficient handling of > the TBI0 bit, ..if it's not possible, then i don't see how > mte will be deployed). If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have MTE or not. > and i guess on the kernel side the open question is if the > rules 1/2/3/4 can be made to work in corner cases e.g. when > pointers embedded into structs are passed down in ioctl. We've been trying to track these down since last summer and we came to the conclusion that it should be (mostly) fine for the non-weird memory described above. -- Catalin
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On 12/02/2019 18:02, Catalin Marinas wrote: > On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote: >> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky wrote: >>> On 19/12/2018 12:52, Dave Martin wrote: Really, the kernel should do the expected thing with all "non-weird" memory. In lieu of a proper definition of "non-weird", I think we should have some lists of things that are explicitly included, and also excluded: OK: kernel-allocated process stack brk area MAP_ANONYMOUS | MAP_PRIVATE MAP_PRIVATE mappings of /dev/zero Not OK: MAP_SHARED mmaps of non-memory-like devices mmaps of anything that is not a regular file the VDSO ... In general, userspace can tag memory that it "owns", and we do not assume a transfer of ownership except in the "OK" list above. Otherwise, it's the kernel's memory, or the owner is simply not well defined. >>> >>> Agreed on the general idea: a process should be able to pass tagged >>> pointers at the >>> syscall interface, as long as they point to memory privately owned by the >>> process. I >>> think it would be possible to simplify the definition of "non-weird" memory >>> by using >>> only this "OK" list: >>> - mmap() done by the process itself, where either: >>>* flags = MAP_PRIVATE | MAP_ANONYMOUS >>>* flags = MAP_PRIVATE and fd refers to a regular file or a well-defined >>> list of >>> device files (like /dev/zero) >>> - brk() done by the process itself >>> - Any memory mapped by the kernel in the new process's address space during >>> execve(), >>> with the same restrictions as above ([vdso]/[vvar] are therefore excluded) > > Sounds reasonable. OK. this non-weird memory definition works for me too. rule 1: if weird memory pointers are passed to the kernel with top byte set then the behaviour is undefined. * When the kernel dereferences a pointer on userspace's behalf, it shall behave equivalently to userspace dereferencing the same pointer, including use of the same tag (where passed by userspace). * Where the pointer tag affects pointer dereference behaviour (i.e., with hardware memory colouring) the kernel makes no guarantee to honour pointer tags correctly for every location a buffer based on a pointer passed by userspace to the kernel. (This means for example that for a read(fd, buf, size), we can check the tag for a single arbitrary location in *(char (*)[size])buf before passing the buffer to get_user_pages(). Hopefully this could be done in get_user_pages() itself rather than hunting call sites. For userspace, it means that you're on your own if you ask the kernel to operate on a buffer than spans multiple, independently- allocated objects, or a deliberately striped single object.) >>> >>> I think both points are reasonable. It is very valuable for the kernel to >>> access >>> userspace memory using the user-provided tag, because it enables kernel >>> accesses to >>> be checked in the same way as user accesses, allowing to detect bugs that >>> are >>> potentially hard to find. For instance, if a pointer to an object is passed >>> to the >>> kernel after it has been deallocated, this is invalid and should be >>> detected. >>> However, you are absolutely right that the kernel cannot *guarantee* that >>> such a >>> check is carried out for the entire memory range (or in fact at all); it >>> should be a >>> best-effort approach. >> >> It would also be valuable to narrow down the set of "relaxed" (i.e. >> not tag-checking) syscalls as reasonably possible. We would want to >> provide tag-checking userspace wrappers for any important calls that >> are not checked in the kernel. Is it correct to assume that anything >> that goes through copy_from_user / copy_to_user is checked? > > I lost track of the context of this thread but if it's just about > relaxing the ABI for hwasan, the kernel has no idea of the compiler > generated structures in user space, so nothing is checked. > > If we talk about tags in the context of MTE, than yes, with the current > proposal the tag would be checked by copy_*_user() functions. rule 2: kernel derefs as if user derefs when non-weird memory pointers are passed to the kernel. note that the important bit is what happens on valid pointer derefs, invalid pointer deref is usually undefined for user programs, so what happens in case of mte tag failures is more of a QoI issue than abi i think. (e.g. EFAULT is not guaranteed by the kernel currently, i can successfully do write(open("/dev/null",O_WRONLY), 0, 1), or get a crash when passing invalid pointer to a vdso function, so userspace should not rely on some strict EFAULT behaviour). * The kernel shall not extend the lifetime of user pointers in ways t
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On 13/02/2019 21:41, Evgenii Stepanov wrote: On Wed, Feb 13, 2019 at 9:43 AM Dave Martin wrote: On Wed, Feb 13, 2019 at 04:42:11PM +, Kevin Brodsky wrote: (+Cc other people with MTE experience: Branislav, Ruben) [...] I'm wondering whether we can piggy-back on existing concepts. We could say that recolouring memory is safe when and only when unmapping of the page or removing permissions on the page (via munmap/mremap/mprotect) would be safe. Otherwise, the resulting behaviour of the process is undefined. Is that a sufficient requirement? I don't think that anything prevents you from using mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean here. I think the origin rules have to apply too: [vvar] is not a regular, private page but a weird, shared thing mapped for you by the kernel. Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it. I'm also assuming that userspace cannot recolour memory in read-only pages. That sounds bad if there's no way to prevent it. That sounds like something we would like to do to catch out of bounds read of .rodata globals. Another potentially interesting use case for MTE is infinite hardware watchpoints - that would require trapping reads for individual tagging granules, include those in read-only binary segment. I think we should keep this discussion for a later, separate thread. Vincenzo's proposal is about allowing userspace to pass tags at the syscall interface. The set of mappings allowed to be tagged by userspace (in MTE) should be contained in the set of mappings that userspace can pass tagged pointers to (at the syscall interface), but they are not necessarily the same. Private read-only mappings are an edge case (you can pass tagged pointers to them, the memory may or may not be mapped as tagged, but in any case it is not possible to change the memory tags via such mapping). [...] It might be reasonable to do the check in access_ok() and skip it in __put_user() etc. (I seem to remember some separate discussion about abolishing __put_user() and friends though, due to the accident risk they pose.) Keep in mind that with MTE, there is no need to do any explicit check when accessing user memory via a user-provided pointer. The tagged user pointer is directly passed to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is handled just like a page fault (i.e. invoking the fixup handler). As far as I can tell, there's no need to do anything special in access_ok() in that case. [The above applies to precise mode. In imprecise mode, some more work will be needed after the load/store to check whether a tag fault happened.] Fair enough, I'm a bit hazy on the details as of right now.. [...] There are many possible ways to deploy MTE, and debugging is just one of them. For instance, you may want to turn on heap colouring for some processes in the system, including in production. To implement enforceable protection, or as a diagnostic tool for when something goes wrong? In the latter case it's still OK for the kernel's tag checking not to be exhaustive. Regarding those cases where it is impossible to check tags at the point of accessing user memory, it is indeed possible to check the memory tags at the point of stripping the tag from the user pointer. Given that some MTE use-cases favour performance over tag check coverage, the ideal approach would be to make these checks configurable (e.g. check one granule, check all of them, or check none). I don't know how feasible this is in practice. Check all granules of a massive DMA buffer? That doesn't sounds feasible without explicit support in the hardware to have the DMA check tags itself as the memory is accessed. MTE by itself doesn't provide for this IIUC (at least, it would require support in the platform, not just the CPU). We do not want to bake any assumptions into the ABI about whether a given data transfer may or may not be offloaded to DMA. That feels like a slippery slope. Providing we get the checks for free in put_user/get_user/ copy_{to,from}_user(), those will cover a lot of cases though, for non-bulk-IO cases. My assumption has been that at this point in time we are mainly aiming to support the debug/diagnostic use cases today. MTE can be used both for diagnostics (imprecise mode is especially suitable for that), and to halt execution when something wrong is detected. Even in the latter case, one cannot expect exhaustive checking from MTE, because the way it works is fundamentally statistical; an invalid pointer may by chance have the right tag to access the given location. So again, I think that a best-effort approach is appropriate when the kernel accesses user memory, in terms of checking that tags match. More specifically, different use-cases come with different tradeoffs (performance / tag check coverage). That's why I am suggesting that in
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Wed, Feb 13, 2019 at 9:43 AM Dave Martin wrote: > > On Wed, Feb 13, 2019 at 04:42:11PM +, Kevin Brodsky wrote: > > (+Cc other people with MTE experience: Branislav, Ruben) > > [...] > > > >I'm wondering whether we can piggy-back on existing concepts. > > > > > >We could say that recolouring memory is safe when and only when > > >unmapping of the page or removing permissions on the page (via > > >munmap/mremap/mprotect) would be safe. Otherwise, the resulting > > >behaviour of the process is undefined. > > > > Is that a sufficient requirement? I don't think that anything prevents you > > from using mprotect() on say [vvar], but we don't necessarily want to map > > [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean > > here. > > I think the origin rules have to apply too: [vvar] is not a regular, > private page but a weird, shared thing mapped for you by the kernel. > > Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it. > > I'm also assuming that userspace cannot recolour memory in read-only > pages. That sounds bad if there's no way to prevent it. That sounds like something we would like to do to catch out of bounds read of .rodata globals. Another potentially interesting use case for MTE is infinite hardware watchpoints - that would require trapping reads for individual tagging granules, include those in read-only binary segment. > > [...] > > > >It might be reasonable to do the check in access_ok() and skip it in > > >__put_user() etc. > > > > > >(I seem to remember some separate discussion about abolishing > > >__put_user() and friends though, due to the accident risk they pose.) > > > > Keep in mind that with MTE, there is no need to do any explicit check when > > accessing user memory via a user-provided pointer. The tagged user pointer > > is directly passed to copy_*_user() or put_user(). If the load/store causes > > a tag fault, then it is handled just like a page fault (i.e. invoking the > > fixup handler). As far as I can tell, there's no need to do anything special > > in access_ok() in that case. > > > > [The above applies to precise mode. In imprecise mode, some more work will > > be needed after the load/store to check whether a tag fault happened.] > > Fair enough, I'm a bit hazy on the details as of right now.. > > [...] > > > There are many possible ways to deploy MTE, and debugging is just one of > > them. For instance, you may want to turn on heap colouring for some > > processes in the system, including in production. > > To implement enforceable protection, or as a diagnostic tool for when > something goes wrong? > > In the latter case it's still OK for the kernel's tag checking not to be > exhaustive. > > > Regarding those cases where it is impossible to check tags at the point of > > accessing user memory, it is indeed possible to check the memory tags at the > > point of stripping the tag from the user pointer. Given that some MTE > > use-cases favour performance over tag check coverage, the ideal approach > > would be to make these checks configurable (e.g. check one granule, check > > all of them, or check none). I don't know how feasible this is in practice. > > Check all granules of a massive DMA buffer? > > That doesn't sounds feasible without explicit support in the hardware to > have the DMA check tags itself as the memory is accessed. MTE by itself > doesn't provide for this IIUC (at least, it would require support in the > platform, not just the CPU). > > We do not want to bake any assumptions into the ABI about whether a > given data transfer may or may not be offloaded to DMA. That feels > like a slippery slope. > > Providing we get the checks for free in put_user/get_user/ > copy_{to,from}_user(), those will cover a lot of cases though, for > non-bulk-IO cases. > > > My assumption has been that at this point in time we are mainly aiming > to support the debug/diagnostic use cases today. > > At least, those are the low(ish)-hanging fruit. > > Others are better placed than me to comment on the goals here. > > Cheers > ---Dave
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Wed, Feb 13, 2019 at 04:42:11PM +, Kevin Brodsky wrote: > (+Cc other people with MTE experience: Branislav, Ruben) [...] > >I'm wondering whether we can piggy-back on existing concepts. > > > >We could say that recolouring memory is safe when and only when > >unmapping of the page or removing permissions on the page (via > >munmap/mremap/mprotect) would be safe. Otherwise, the resulting > >behaviour of the process is undefined. > > Is that a sufficient requirement? I don't think that anything prevents you > from using mprotect() on say [vvar], but we don't necessarily want to map > [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean > here. I think the origin rules have to apply too: [vvar] is not a regular, private page but a weird, shared thing mapped for you by the kernel. Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it. I'm also assuming that userspace cannot recolour memory in read-only pages. That sounds bad if there's no way to prevent it. [...] > >It might be reasonable to do the check in access_ok() and skip it in > >__put_user() etc. > > > >(I seem to remember some separate discussion about abolishing > >__put_user() and friends though, due to the accident risk they pose.) > > Keep in mind that with MTE, there is no need to do any explicit check when > accessing user memory via a user-provided pointer. The tagged user pointer > is directly passed to copy_*_user() or put_user(). If the load/store causes > a tag fault, then it is handled just like a page fault (i.e. invoking the > fixup handler). As far as I can tell, there's no need to do anything special > in access_ok() in that case. > > [The above applies to precise mode. In imprecise mode, some more work will > be needed after the load/store to check whether a tag fault happened.] Fair enough, I'm a bit hazy on the details as of right now.. [...] > There are many possible ways to deploy MTE, and debugging is just one of > them. For instance, you may want to turn on heap colouring for some > processes in the system, including in production. To implement enforceable protection, or as a diagnostic tool for when something goes wrong? In the latter case it's still OK for the kernel's tag checking not to be exhaustive. > Regarding those cases where it is impossible to check tags at the point of > accessing user memory, it is indeed possible to check the memory tags at the > point of stripping the tag from the user pointer. Given that some MTE > use-cases favour performance over tag check coverage, the ideal approach > would be to make these checks configurable (e.g. check one granule, check > all of them, or check none). I don't know how feasible this is in practice. Check all granules of a massive DMA buffer? That doesn't sounds feasible without explicit support in the hardware to have the DMA check tags itself as the memory is accessed. MTE by itself doesn't provide for this IIUC (at least, it would require support in the platform, not just the CPU). We do not want to bake any assumptions into the ABI about whether a given data transfer may or may not be offloaded to DMA. That feels like a slippery slope. Providing we get the checks for free in put_user/get_user/ copy_{to,from}_user(), those will cover a lot of cases though, for non-bulk-IO cases. My assumption has been that at this point in time we are mainly aiming to support the debug/diagnostic use cases today. At least, those are the low(ish)-hanging fruit. Others are better placed than me to comment on the goals here. Cheers ---Dave
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
(+Cc other people with MTE experience: Branislav, Ruben) On 13/02/2019 14:58, Dave Martin wrote: On Tue, Feb 12, 2019 at 06:02:24PM +, Catalin Marinas wrote: On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote: On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky wrote: On 19/12/2018 12:52, Dave Martin wrote: [...] * A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime. Agreed. Allocators themselves may need to be excluded though, depending on how they represent their managed memory. * Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied). Tags can only be changed at the point of deallocation/ reallocation. Pointers to the object become invalid and cannot be used after it has been deallocated; memory tagging allows to catch such invalid usage. All the above sound well but that's mostly a guideline on what a C library can do. It doesn't help much with defining the kernel ABI. Anyway, it's good to clarify the use-cases. My aim was to clarify the use case in userspace, since I wasn't directly involved in that. The kernel ABI needs to be compatible with the the use case, but doesn't need to specify must of it. I'm wondering whether we can piggy-back on existing concepts. We could say that recolouring memory is safe when and only when unmapping of the page or removing permissions on the page (via munmap/mremap/mprotect) would be safe. Otherwise, the resulting behaviour of the process is undefined. Is that a sufficient requirement? I don't think that anything prevents you from using mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean here. Hopefully there are friendly fuzzers testing this kind of thing. [...] It would also be valuable to narrow down the set of "relaxed" (i.e. not tag-checking) syscalls as reasonably possible. We would want to provide tag-checking userspace wrappers for any important calls that are not checked in the kernel. Is it correct to assume that anything that goes through copy_from_user / copy_to_user is checked? I lost track of the context of this thread but if it's just about relaxing the ABI for hwasan, the kernel has no idea of the compiler generated structures in user space, so nothing is checked. If we talk about tags in the context of MTE, than yes, with the current proposal the tag would be checked by copy_*_user() functions. Also put_user() and friends? It might be reasonable to do the check in access_ok() and skip it in __put_user() etc. (I seem to remember some separate discussion about abolishing __put_user() and friends though, due to the accident risk they pose.) Keep in mind that with MTE, there is no need to do any explicit check when accessing user memory via a user-provided pointer. The tagged user pointer is directly passed to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is handled just like a page fault (i.e. invoking the fixup handler). As far as I can tell, there's no need to do anything special in access_ok() in that case. [The above applies to precise mode. In imprecise mode, some more work will be needed after the load/store to check whether a tag fault happened.] For aio* operations it would be nice if the tag was checked at the time of the actual userspace read/write, either instead of or in addition to at the time of the system call. With aio* (and synchronous iovec-based syscalls), the kernel may access the memory while the corresponding user process is scheduled out. Given that such access is not done in the context of the user process (and using the user VA like copy_*_user), the kernel cannot handle potential tag faults. Moreover, the transfer may be done by DMA and the device does not understand tags. I'd like to keep tags as a property of the pointer in a specific virtual address space. The moment you convert it to a different address space (e.g. kernel linear map, physical address), the tag property is stripped and I don't think we should re-build it (and have it checked). This is probably reasonable. Ideally we would check the tag at the point of stripping it off, but most likely it's going to be rather best-effort. If memory tagging is essential a debugging feature then this seems an acceptable compromise. There are many possible ways to deploy MTE, and debugging is just one of them. For instance, you may want to turn on heap colouring for some processes in the system, including in production. Regarding those cases where it is impossible to check tags at the point of accessing user memory, it is indeed possible to ch
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Tue, Feb 12, 2019 at 06:02:24PM +, Catalin Marinas wrote: > On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote: > > On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky wrote: > > > On 19/12/2018 12:52, Dave Martin wrote: [...] > > > > * A single C object should be accessed using a single, fixed > > > > pointer tag throughout its entire lifetime. > > > > > > Agreed. Allocators themselves may need to be excluded though, > > > depending on how they represent their managed memory. > > > > > > > * Tags can be changed only when there are no outstanding pointers to > > > > the affected object or region that may be used to access the object > > > > or region (i.e., if the object were allocated from the C heap and > > > > is it safe to realloc() it, then it is safe to change the tag; for > > > > other types of allocation, analogous arguments can be applied). > > > > > > Tags can only be changed at the point of deallocation/ > > > reallocation. Pointers to the object become invalid and cannot > > > be used after it has been deallocated; memory tagging allows to > > > catch such invalid usage. > > All the above sound well but that's mostly a guideline on what a C > library can do. It doesn't help much with defining the kernel ABI. > Anyway, it's good to clarify the use-cases. My aim was to clarify the use case in userspace, since I wasn't directly involved in that. The kernel ABI needs to be compatible with the the use case, but doesn't need to specify must of it. I'm wondering whether we can piggy-back on existing concepts. We could say that recolouring memory is safe when and only when unmapping of the page or removing permissions on the page (via munmap/mremap/mprotect) would be safe. Otherwise, the resulting behaviour of the process is undefined. Hopefully there are friendly fuzzers testing this kind of thing. [...] > > It would also be valuable to narrow down the set of "relaxed" (i.e. > > not tag-checking) syscalls as reasonably possible. We would want to > > provide tag-checking userspace wrappers for any important calls that > > are not checked in the kernel. Is it correct to assume that anything > > that goes through copy_from_user / copy_to_user is checked? > > I lost track of the context of this thread but if it's just about > relaxing the ABI for hwasan, the kernel has no idea of the compiler > generated structures in user space, so nothing is checked. > > If we talk about tags in the context of MTE, than yes, with the current > proposal the tag would be checked by copy_*_user() functions. Also put_user() and friends? It might be reasonable to do the check in access_ok() and skip it in __put_user() etc. (I seem to remember some separate discussion about abolishing __put_user() and friends though, due to the accident risk they pose.) > > For aio* operations it would be nice if the tag was checked at the > > time of the actual userspace read/write, either instead of or in > > addition to at the time of the system call. > > With aio* (and synchronous iovec-based syscalls), the kernel may access > the memory while the corresponding user process is scheduled out. Given > that such access is not done in the context of the user process (and > using the user VA like copy_*_user), the kernel cannot handle potential > tag faults. Moreover, the transfer may be done by DMA and the device > does not understand tags. > > I'd like to keep tags as a property of the pointer in a specific virtual > address space. The moment you convert it to a different address space > (e.g. kernel linear map, physical address), the tag property is stripped > and I don't think we should re-build it (and have it checked). This is probably reasonable. Ideally we would check the tag at the point of stripping it off, but most likely it's going to be rather best-effort. If memory tagging is essential a debugging feature then this seems an acceptable compromise. > > > > * For purposes other than dereference, the kernel shall accept any > > > > legitimately tagged pointer (according to the above rules) as > > > > identifying the associated memory location. > > > > > > > > So, mprotect(some_page_aligned_object, ...); is valid irrespective > > > > of where page_aligned_object() came from. There is no implicit > > > > derefence by the kernel here, hence no tag check. > > > > > > > > The kernel does not guarantee to work correctly if the wrong tag > > > > is used, but there is not always a well-defined "right" tag, so > > > > we can't really guarantee to check it. So a pointer derived by > > > > any reasonable means by userspace has to be treated as equally > > > > valid. > > > > > > This is a disputed point :) In my opinion, this is the the most > > > reasonable approach. > > > > Yes, it would be nice if the kernel explicitly promised, ex. > > mprotect() over a range of differently tagged pages to be allowed > > (i.e. address tag should be u
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote: > On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky wrote: > > On 19/12/2018 12:52, Dave Martin wrote: > > > Really, the kernel should do the expected thing with all "non-weird" > > > memory. > > > > > > In lieu of a proper definition of "non-weird", I think we should have > > > some lists of things that are explicitly included, and also excluded: > > > > > > OK: > > > kernel-allocated process stack > > > brk area > > > MAP_ANONYMOUS | MAP_PRIVATE > > > MAP_PRIVATE mappings of /dev/zero > > > > > > Not OK: > > > MAP_SHARED > > > mmaps of non-memory-like devices > > > mmaps of anything that is not a regular file > > > the VDSO > > > ... > > > > > > In general, userspace can tag memory that it "owns", and we do not assume > > > a transfer of ownership except in the "OK" list above. Otherwise, it's > > > the kernel's memory, or the owner is simply not well defined. > > > > Agreed on the general idea: a process should be able to pass tagged > > pointers at the > > syscall interface, as long as they point to memory privately owned by the > > process. I > > think it would be possible to simplify the definition of "non-weird" memory > > by using > > only this "OK" list: > > - mmap() done by the process itself, where either: > >* flags = MAP_PRIVATE | MAP_ANONYMOUS > >* flags = MAP_PRIVATE and fd refers to a regular file or a well-defined > > list of > > device files (like /dev/zero) > > - brk() done by the process itself > > - Any memory mapped by the kernel in the new process's address space during > > execve(), > > with the same restrictions as above ([vdso]/[vvar] are therefore excluded) Sounds reasonable. > > > * Userspace should set tags at the point of allocation only. > > > > Yes, tags are only supposed to be set at the point of either allocation or > > deallocation/reallocation. However, allocators can in principle be nested, > > so an > > allocator could take a region allocated by malloc() as input and subdivide > > it > > (changing tags in the process). That said, this suballocator must not > > free() that > > region until all the suballocations themselves have been freed (thereby > > restoring the > > tags initially set by malloc()). > > > > > * If you don't know how an object was allocated, you cannot modify the > > > tag, period. > > > > Agreed, allocators that tag memory can only operate on memory with a > > well-defined > > provenance (for instance anonymous mmap() or malloc()). > > > > > * A single C object should be accessed using a single, fixed pointer tag > > > throughout its entire lifetime. > > > > Agreed. Allocators themselves may need to be excluded though, depending on > > how they > > represent their managed memory. > > > > > * Tags can be changed only when there are no outstanding pointers to > > > the affected object or region that may be used to access the object > > > or region (i.e., if the object were allocated from the C heap and > > > is it safe to realloc() it, then it is safe to change the tag; for > > > other types of allocation, analogous arguments can be applied). > > > > Tags can only be changed at the point of deallocation/reallocation. > > Pointers to the > > object become invalid and cannot be used after it has been deallocated; > > memory > > tagging allows to catch such invalid usage. All the above sound well but that's mostly a guideline on what a C library can do. It doesn't help much with defining the kernel ABI. Anyway, it's good to clarify the use-cases. > > > * When the kernel dereferences a pointer on userspace's behalf, it > > > shall behave equivalently to userspace dereferencing the same pointer, > > > including use of the same tag (where passed by userspace). > > > > > > * Where the pointer tag affects pointer dereference behaviour (i.e., > > > with hardware memory colouring) the kernel makes no guarantee to > > > honour pointer tags correctly for every location a buffer based on a > > > pointer passed by userspace to the kernel. > > > > > > (This means for example that for a read(fd, buf, size), we can check > > > the tag for a single arbitrary location in *(char (*)[size])buf > > > before passing the buffer to get_user_pages(). Hopefully this could > > > be done in get_user_pages() itself rather than hunting call sites. > > > For userspace, it means that you're on your own if you ask the > > > kernel to operate on a buffer than spans multiple, independently- > > > allocated objects, or a deliberately striped single object.) > > > > I think both points are reasonable. It is very valuable for the kernel to > > access > > userspace memory using the user-provided tag, because it enables kernel > > accesses to > > be checked in the same way as user accesses, allowing to detect bugs that > > are > > potentially hard to find. For insta
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky wrote: > > On 19/12/2018 12:52, Dave Martin wrote: > > On Tue, Dec 18, 2018 at 05:59:38PM +, Catalin Marinas wrote: > >> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote: > >>> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas > >>> wrote: > The summary of our internal discussions (mostly between kernel > developers) is that we can't properly describe a user ABI that covers > future syscalls or syscall extensions while not all syscalls accept > tagged pointers. So we tweaked the requirements slightly to only allow > tagged pointers back into the kernel *if* the originating address is > from an anonymous mmap() or below sbrk(0). This should cover some of the > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a > pointer to a buffer obtained via mmap() on the device operations. > > (sorry for not being clear on what Vincenzo's proposal implies) > >>> OK, I see. So I need to make the following changes to my patchset AFAIU. > >>> > >>> 1. Make sure that we only allow tagged user addresses that originate > >>> from an anonymous mmap() or below sbrk(0). How exactly should this > >>> check be performed? > >> I don't think we should perform such checks. That's rather stating that > >> the kernel only guarantees that the tagged pointers work if they > >> originated from these memory ranges. > > I concur. > > > > Really, the kernel should do the expected thing with all "non-weird" > > memory. > > > > In lieu of a proper definition of "non-weird", I think we should have > > some lists of things that are explicitly included, and also excluded: > > > > OK: > > kernel-allocated process stack > > brk area > > MAP_ANONYMOUS | MAP_PRIVATE > > MAP_PRIVATE mappings of /dev/zero > > > > Not OK: > > MAP_SHARED > > mmaps of non-memory-like devices > > mmaps of anything that is not a regular file > > the VDSO > > ... > > > > In general, userspace can tag memory that it "owns", and we do not assume > > a transfer of ownership except in the "OK" list above. Otherwise, it's > > the kernel's memory, or the owner is simply not well defined. > > Agreed on the general idea: a process should be able to pass tagged pointers > at the > syscall interface, as long as they point to memory privately owned by the > process. I > think it would be possible to simplify the definition of "non-weird" memory > by using > only this "OK" list: > - mmap() done by the process itself, where either: >* flags = MAP_PRIVATE | MAP_ANONYMOUS >* flags = MAP_PRIVATE and fd refers to a regular file or a well-defined > list of > device files (like /dev/zero) > - brk() done by the process itself > - Any memory mapped by the kernel in the new process's address space during > execve(), > with the same restrictions as above ([vdso]/[vvar] are therefore excluded) > > > I would also like to see advice for userspace developers, particularly > > things like (strawman, please challenge!): > > To some extent, one could call me a userspace developer, so I'll try to help > :) > > > * Userspace should set tags at the point of allocation only. > > Yes, tags are only supposed to be set at the point of either allocation or > deallocation/reallocation. However, allocators can in principle be nested, so > an > allocator could take a region allocated by malloc() as input and subdivide it > (changing tags in the process). That said, this suballocator must not free() > that > region until all the suballocations themselves have been freed (thereby > restoring the > tags initially set by malloc()). > > > * If you don't know how an object was allocated, you cannot modify the > > tag, period. > > Agreed, allocators that tag memory can only operate on memory with a > well-defined > provenance (for instance anonymous mmap() or malloc()). > > > * A single C object should be accessed using a single, fixed pointer tag > > throughout its entire lifetime. > > Agreed. Allocators themselves may need to be excluded though, depending on > how they > represent their managed memory. > > > * Tags can be changed only when there are no outstanding pointers to > > the affected object or region that may be used to access the object > > or region (i.e., if the object were allocated from the C heap and > > is it safe to realloc() it, then it is safe to change the tag; for > > other types of allocation, analogous arguments can be applied). > > Tags can only be changed at the point of deallocation/reallocation. Pointers > to the > object become invalid and cannot be used after it has been deallocated; memory > tagging allows to catch such invalid usage. > > > * When the kernel dereferences a pointer on userspace's behalf, it > > shall behave equivalently to userspace dereferencing the same pointer, > > including use of the same tag (where passed by userspace). >
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On 19/12/2018 12:52, Dave Martin wrote: On Tue, Dec 18, 2018 at 05:59:38PM +, Catalin Marinas wrote: On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote: On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas wrote: The summary of our internal discussions (mostly between kernel developers) is that we can't properly describe a user ABI that covers future syscalls or syscall extensions while not all syscalls accept tagged pointers. So we tweaked the requirements slightly to only allow tagged pointers back into the kernel *if* the originating address is from an anonymous mmap() or below sbrk(0). This should cover some of the ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a pointer to a buffer obtained via mmap() on the device operations. (sorry for not being clear on what Vincenzo's proposal implies) OK, I see. So I need to make the following changes to my patchset AFAIU. 1. Make sure that we only allow tagged user addresses that originate from an anonymous mmap() or below sbrk(0). How exactly should this check be performed? I don't think we should perform such checks. That's rather stating that the kernel only guarantees that the tagged pointers work if they originated from these memory ranges. I concur. Really, the kernel should do the expected thing with all "non-weird" memory. In lieu of a proper definition of "non-weird", I think we should have some lists of things that are explicitly included, and also excluded: OK: kernel-allocated process stack brk area MAP_ANONYMOUS | MAP_PRIVATE MAP_PRIVATE mappings of /dev/zero Not OK: MAP_SHARED mmaps of non-memory-like devices mmaps of anything that is not a regular file the VDSO ... In general, userspace can tag memory that it "owns", and we do not assume a transfer of ownership except in the "OK" list above. Otherwise, it's the kernel's memory, or the owner is simply not well defined. Agreed on the general idea: a process should be able to pass tagged pointers at the syscall interface, as long as they point to memory privately owned by the process. I think it would be possible to simplify the definition of "non-weird" memory by using only this "OK" list: - mmap() done by the process itself, where either: * flags = MAP_PRIVATE | MAP_ANONYMOUS * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of device files (like /dev/zero) - brk() done by the process itself - Any memory mapped by the kernel in the new process's address space during execve(), with the same restrictions as above ([vdso]/[vvar] are therefore excluded) I would also like to see advice for userspace developers, particularly things like (strawman, please challenge!): To some extent, one could call me a userspace developer, so I'll try to help :) * Userspace should set tags at the point of allocation only. Yes, tags are only supposed to be set at the point of either allocation or deallocation/reallocation. However, allocators can in principle be nested, so an allocator could take a region allocated by malloc() as input and subdivide it (changing tags in the process). That said, this suballocator must not free() that region until all the suballocations themselves have been freed (thereby restoring the tags initially set by malloc()). * If you don't know how an object was allocated, you cannot modify the tag, period. Agreed, allocators that tag memory can only operate on memory with a well-defined provenance (for instance anonymous mmap() or malloc()). * A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime. Agreed. Allocators themselves may need to be excluded though, depending on how they represent their managed memory. * Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied). Tags can only be changed at the point of deallocation/reallocation. Pointers to the object become invalid and cannot be used after it has been deallocated; memory tagging allows to catch such invalid usage. * When the kernel dereferences a pointer on userspace's behalf, it shall behave equivalently to userspace dereferencing the same pointer, including use of the same tag (where passed by userspace). * Where the pointer tag affects pointer dereference behaviour (i.e., with hardware memory colouring) the kernel makes no guarantee to honour pointer tags correctly for every location a buffer based on a pointer passed by userspace to the kernel. (This means for example that for a read(fd, buf, size), we can check the tag for a single arbitrary location in *(cha
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Tue, Dec 18, 2018 at 05:59:38PM +, Catalin Marinas wrote: > On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote: > > On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas > > wrote: > > > The summary of our internal discussions (mostly between kernel > > > developers) is that we can't properly describe a user ABI that covers > > > future syscalls or syscall extensions while not all syscalls accept > > > tagged pointers. So we tweaked the requirements slightly to only allow > > > tagged pointers back into the kernel *if* the originating address is > > > from an anonymous mmap() or below sbrk(0). This should cover some of the > > > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a > > > pointer to a buffer obtained via mmap() on the device operations. > > > > > > (sorry for not being clear on what Vincenzo's proposal implies) > > > > OK, I see. So I need to make the following changes to my patchset AFAIU. > > > > 1. Make sure that we only allow tagged user addresses that originate > > from an anonymous mmap() or below sbrk(0). How exactly should this > > check be performed? > > I don't think we should perform such checks. That's rather stating that > the kernel only guarantees that the tagged pointers work if they > originated from these memory ranges. I concur. Really, the kernel should do the expected thing with all "non-weird" memory. In lieu of a proper definition of "non-weird", I think we should have some lists of things that are explicitly included, and also excluded: OK: kernel-allocated process stack brk area MAP_ANONYMOUS | MAP_PRIVATE MAP_PRIVATE mappings of /dev/zero Not OK: MAP_SHARED mmaps of non-memory-like devices mmaps of anything that is not a regular file the VDSO ... In general, userspace can tag memory that it "owns", and we do not assume a transfer of ownership except in the "OK" list above. Otherwise, it's the kernel's memory, or the owner is simply not well defined. I would also like to see advice for userspace developers, particularly things like (strawman, please challenge!): * Userspace should set tags at the point of allocation only. * If you don't know how an object was allocated, you cannot modify the tag, period. * A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime. * Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied). * When the kernel dereferences a pointer on userspace's behalf, it shall behave equivalently to userspace dereferencing the same pointer, including use of the same tag (where passed by userspace). * Where the pointer tag affects pointer dereference behaviour (i.e., with hardware memory colouring) the kernel makes no guarantee to honour pointer tags correctly for every location a buffer based on a pointer passed by userspace to the kernel. (This means for example that for a read(fd, buf, size), we can check the tag for a single arbitrary location in *(char (*)[size])buf before passing the buffer to get_user_pages(). Hopefully this could be done in get_user_pages() itself rather than hunting call sites. For userspace, it means that you're on your own if you ask the kernel to operate on a buffer than spans multiple, independently- allocated objects, or a deliberately striped single object.) * The kernel shall not extend the lifetime of user pointers in ways that are not clear from the specification of the syscall or interface to which the pointer is passed (and in any case shall not extend pointer lifetimes without good reason). So no clever transparent caching between syscalls, unless it _really_ is transparent in the presence of tags. * For purposes other than dereference, the kernel shall accept any legitimately tagged pointer (according to the above rules) as identifying the associated memory location. So, mprotect(some_page_aligned_object, ...); is valid irrespective of where page_aligned_object() came from. There is no implicit derefence by the kernel here, hence no tag check. The kernel does not guarantee to work correctly if the wrong tag is used, but there is not always a well-defined "right" tag, so we can't really guarantee to check it. So a pointer derived by any reasonable means by userspace has to be treated as equally valid. We would need to get some cross-arch buy-in for this, otherwise core maintainers might just refuse to maintain the necessary guarantees. > > 2. Allow tagged addressed passed to memory syscalls (as long as (1) is > > satisfied). Do I understand correctly that this me
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote: > On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas > wrote: > > The summary of our internal discussions (mostly between kernel > > developers) is that we can't properly describe a user ABI that covers > > future syscalls or syscall extensions while not all syscalls accept > > tagged pointers. So we tweaked the requirements slightly to only allow > > tagged pointers back into the kernel *if* the originating address is > > from an anonymous mmap() or below sbrk(0). This should cover some of the > > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a > > pointer to a buffer obtained via mmap() on the device operations. > > > > (sorry for not being clear on what Vincenzo's proposal implies) > > OK, I see. So I need to make the following changes to my patchset AFAIU. > > 1. Make sure that we only allow tagged user addresses that originate > from an anonymous mmap() or below sbrk(0). How exactly should this > check be performed? I don't think we should perform such checks. That's rather stating that the kernel only guarantees that the tagged pointers work if they originated from these memory ranges. > 2. Allow tagged addressed passed to memory syscalls (as long as (1) is > satisfied). Do I understand correctly that this means that I need to > locate all find_vma() callers outside of mm/ and fix them up as well? Yes (unless anyone as a better idea or objections to this approach). BTW, I'll be off until the new year, so won't be able to follow up. -- Catalin
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas wrote: > > Hi Andrey, > > On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote: > > On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino > > wrote: > > > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence > > > the userspace (EL0) is allowed to set a non-zero value in the top > > > byte but the resulting pointers are not allowed at the user-kernel > > > syscall ABI boundary. > > > > > > This patchset proposes a relaxation of the ABI and a mechanism to > > > advertise it to the userspace via an AT_FLAGS. > > > > > > The rationale behind the choice of AT_FLAGS is that the Unix System V > > > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in > > > interpretation. > > > There are two previous attempts of using AT_FLAGS in the Linux Kernel > > > for different reasons: the first was more generic and was used to expose > > > the support for the GNU STACK NX feature [1] and the second was done for > > > the MIPS architecture and was used to expose the support of "MIPS ABI > > > Extension for IEEE Std 754 Non-Compliant Interlinking" [2]. > > > Both the changes are currently _not_ merged in mainline. > > > The only architecture that reserves some of the bits in AT_FLAGS is > > > currently MIPS, which introduced the concept of platform specific ABI > > > (psABI) reserving the top-byte [3]. > > > > > > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising > > > to the userspace that a relaxed ABI is supported hence this type > > > of pointers are now allowed to be passed to the syscalls when they are > > > in memory ranges obtained by anonymous mmap() or brk(). > > > > > > The userspace _must_ verify that the flag is set before passing tagged > > > pointers to the syscalls allowed by this relaxation. > > > > > > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and > > > mandating > > > to the software to check that the feature is present, before using the > > > associated functionality, it provides a degree of control on the decision > > > of disabling such a feature in future without consequently breaking the > > > userspace. > [...] > > Acked-by: Andrey Konovalov > > Thanks for the ack. However, if we go ahead with this ABI proposal it > means that your patches need to be reworked to allow a non-zero top byte > in all syscalls, including mmap() and friends, ioctl(). There are ABI > concerns in either case but I'd rather have this discussion in the open. > It doesn't necessarily mean that I endorse this proposal, I would like > feedback and not just from kernel developers but user space ones. > > The summary of our internal discussions (mostly between kernel > developers) is that we can't properly describe a user ABI that covers > future syscalls or syscall extensions while not all syscalls accept > tagged pointers. So we tweaked the requirements slightly to only allow > tagged pointers back into the kernel *if* the originating address is > from an anonymous mmap() or below sbrk(0). This should cover some of the > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a > pointer to a buffer obtained via mmap() on the device operations. > > (sorry for not being clear on what Vincenzo's proposal implies) OK, I see. So I need to make the following changes to my patchset AFAIU. 1. Make sure that we only allow tagged user addresses that originate from an anonymous mmap() or below sbrk(0). How exactly should this check be performed? 2. Allow tagged addressed passed to memory syscalls (as long as (1) is satisfied). Do I understand correctly that this means that I need to locate all find_vma() callers outside of mm/ and fix them up as well?
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
Hi Andrey, On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote: > On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino > wrote: > > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence > > the userspace (EL0) is allowed to set a non-zero value in the top > > byte but the resulting pointers are not allowed at the user-kernel > > syscall ABI boundary. > > > > This patchset proposes a relaxation of the ABI and a mechanism to > > advertise it to the userspace via an AT_FLAGS. > > > > The rationale behind the choice of AT_FLAGS is that the Unix System V > > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in > > interpretation. > > There are two previous attempts of using AT_FLAGS in the Linux Kernel > > for different reasons: the first was more generic and was used to expose > > the support for the GNU STACK NX feature [1] and the second was done for > > the MIPS architecture and was used to expose the support of "MIPS ABI > > Extension for IEEE Std 754 Non-Compliant Interlinking" [2]. > > Both the changes are currently _not_ merged in mainline. > > The only architecture that reserves some of the bits in AT_FLAGS is > > currently MIPS, which introduced the concept of platform specific ABI > > (psABI) reserving the top-byte [3]. > > > > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising > > to the userspace that a relaxed ABI is supported hence this type > > of pointers are now allowed to be passed to the syscalls when they are > > in memory ranges obtained by anonymous mmap() or brk(). > > > > The userspace _must_ verify that the flag is set before passing tagged > > pointers to the syscalls allowed by this relaxation. > > > > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating > > to the software to check that the feature is present, before using the > > associated functionality, it provides a degree of control on the decision > > of disabling such a feature in future without consequently breaking the > > userspace. [...] > Acked-by: Andrey Konovalov Thanks for the ack. However, if we go ahead with this ABI proposal it means that your patches need to be reworked to allow a non-zero top byte in all syscalls, including mmap() and friends, ioctl(). There are ABI concerns in either case but I'd rather have this discussion in the open. It doesn't necessarily mean that I endorse this proposal, I would like feedback and not just from kernel developers but user space ones. The summary of our internal discussions (mostly between kernel developers) is that we can't properly describe a user ABI that covers future syscalls or syscall extensions while not all syscalls accept tagged pointers. So we tweaked the requirements slightly to only allow tagged pointers back into the kernel *if* the originating address is from an anonymous mmap() or below sbrk(0). This should cover some of the ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a pointer to a buffer obtained via mmap() on the device operations. (sorry for not being clear on what Vincenzo's proposal implies) -- Catalin
Re: [RFC][PATCH 0/3] arm64 relaxed ABI
On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino wrote: > > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence > the userspace (EL0) is allowed to set a non-zero value in the top > byte but the resulting pointers are not allowed at the user-kernel > syscall ABI boundary. > > This patchset proposes a relaxation of the ABI and a mechanism to > advertise it to the userspace via an AT_FLAGS. > > The rationale behind the choice of AT_FLAGS is that the Unix System V > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in > interpretation. > There are two previous attempts of using AT_FLAGS in the Linux Kernel > for different reasons: the first was more generic and was used to expose > the support for the GNU STACK NX feature [1] and the second was done for > the MIPS architecture and was used to expose the support of "MIPS ABI > Extension for IEEE Std 754 Non-Compliant Interlinking" [2]. > Both the changes are currently _not_ merged in mainline. > The only architecture that reserves some of the bits in AT_FLAGS is > currently MIPS, which introduced the concept of platform specific ABI > (psABI) reserving the top-byte [3]. > > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising > to the userspace that a relaxed ABI is supported hence this type > of pointers are now allowed to be passed to the syscalls when they are > in memory ranges obtained by anonymous mmap() or brk(). > > The userspace _must_ verify that the flag is set before passing tagged > pointers to the syscalls allowed by this relaxation. > > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating > to the software to check that the feature is present, before using the > associated functionality, it provides a degree of control on the decision > of disabling such a feature in future without consequently breaking the > userspace. > > The change required a modification of the elf common code, because in Linux > the AT_FLAGS are currently set to zero by default by the kernel. > > The newly added flag has been verified on arm64 using the code below. > #include > #include > #include > > #define ARM64_AT_FLAGS_SYSCALL_TBI (1 << 0) > > bool arm64_syscall_tbi_is_present(void) > { > unsigned long at_flags = getauxval(AT_FLAGS); > if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI) > return true; > > return false; > } > > void main() > { > if (arm64_syscall_tbi_is_present()) > printf("ARM64_AT_FLAGS_SYSCALL_TBI is present\n"); > } > > This patchset should be merged together with [4]. > > [1] https://patchwork.ozlabs.org/patch/579578/ > [2] https://lore.kernel.org/patchwork/cover/618280/ > [3] ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf > [4] https://patchwork.kernel.org/cover/10674351/ > > ABI References: > --- > Sco SysV ABI: http://www.sco.com/developers/gabi/2003-12-17/contents.html > PowerPC AUXV: > http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655242_98651.html > AMD64 ABI: https://www.cs.tufts.edu/comp/40-2012f/readings/amd64-abi.pdf > x86 ABI: https://www.uclibc.org/docs/psABI-i386.pdf > MIPS ABI: ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf > ARM ABI: > http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf > SPARC ABI: http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_sparc.pdf > > Cc: Catalin Marinas > Cc: Will Deacon > Cc: Mark Rutland > Cc: Robin Murphy > Cc: Kees Cook > Cc: Kate Stewart > Cc: Greg Kroah-Hartman > Cc: Andrew Morton > Cc: Ingo Molnar > Cc: "Kirill A . Shutemov" > Cc: Shuah Khan > Cc: Chintan Pandya > Cc: Jacob Bramley > Cc: Ruben Ayrapetyan > Cc: Andrey Konovalov > Cc: Lee Smith > Cc: Kostya Serebryany > Cc: Dmitry Vyukov , > Cc: Ramana Radhakrishnan > Cc: Luc Van Oostenryck > Cc: Evgeniy Stepanov > CC: Alexander Viro > Signed-off-by: Vincenzo Frascino > > Vincenzo Frascino (3): > elf: Make AT_FLAGS arch configurable > arm64: Define Documentation/arm64/elf_at_flags.txt > arm64: elf: Advertise relaxed ABI > > Documentation/arm64/elf_at_flags.txt | 111 ++ > arch/arm64/include/asm/atflags.h | 7 ++ > arch/arm64/include/asm/elf.h | 5 ++ > arch/arm64/include/uapi/asm/atflags.h | 8 ++ > fs/binfmt_elf.c | 6 +- > fs/binfmt_elf_fdpic.c | 6 +- > fs/compat_binfmt_elf.c| 5 ++ > 7 files changed, 146 insertions(+), 2 deletions(-) > create mode 100644 Documentation/arm64/elf_at_flags.txt > create mode 100644 arch/arm64/include/asm/atflags.h > create mode 100644 arch/arm64/include/uapi/asm/atflags.h > > -- > 2.19.2 > Acked-by: Andrey Konovalov