Re: [PATCH 1/1] Replace macro "ARCH_HAVE_EXTRA_ELF_NOTES" with kconfig

2024-04-22 Thread Florian Weimer
* Vignesh Balasubramanian:

> diff --git a/include/linux/elf.h b/include/linux/elf.h
> index c9a46c4e183b..5c402788da19 100644
> --- a/include/linux/elf.h
> +++ b/include/linux/elf.h
> @@ -65,7 +65,7 @@ extern Elf64_Dyn _DYNAMIC [];
>  struct file;
>  struct coredump_params;
>  
> -#ifndef ARCH_HAVE_EXTRA_ELF_NOTES
> +#ifndef CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES

You could add

  #pragma GCC poison ARCH_HAVE_EXTRA_ELF_NOTES

to a central header, to let GCC and Clang flag uses that have not been
converted.

Thanks,
Florian



Re: Add fchmodat2() - or add a more general syscall?

2023-07-25 Thread Florian Weimer
* David Howells:

> Rather than adding a fchmodat2() syscall, should we add a "set_file_attrs()"
> syscall that takes a mask and allows you to set a bunch of stuff all in one
> go?  Basically, an interface to notify_change() in the kernel that would allow
> several stats to be set atomically.  This might be of particular interest to
> network filesystems.

Do you mean atomically as in compare-and-swap (update only if old values
match), or just a way to update multiple file attributes with a single
system call?

Thanks,
Florian



Re: [PATCH v3 0/5] Add a new fchmodat4() syscall

2023-07-11 Thread Florian Weimer
* Alexey Gladkov:

> This patch set adds fchmodat4(), a new syscall. The actual
> implementation is super simple: essentially it's just the same as
> fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> I've attempted to make this match "man 2 fchmodat" as closely as
> possible, which says EINVAL is returned for invalid flags (as opposed to
> ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> I have a sketch of a glibc patch that I haven't even compiled yet, but
> seems fairly straight-forward:
>
> diff --git a/sysdeps/unix/sysv/linux/fchmodat.c 
> b/sysdeps/unix/sysv/linux/fchmodat.c
> index 6d9cbc1ce9e0..b1beab76d56c 100644
> --- a/sysdeps/unix/sysv/linux/fchmodat.c
> +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> @@ -29,12 +29,36 @@
>  int
>  fchmodat (int fd, const char *file, mode_t mode, int flag)
>  {
> -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> -return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> -#ifndef __NR_lchmod  /* Linux so far has no lchmod syscall.  
> */
> +  /* There are four paths through this code:
> +  - The flags are zero.  In this case it's fine to call fchmodat.
> +  - The flags are non-zero and glibc doesn't have access to
> + __NR_fchmodat4.  In this case all we can do is emulate the error codes
> + defined by the glibc interface from userspace.
> +  - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel 
> has
> + fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
> + matches glibc's library interface so it can be called directly.
> +  - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel 
> does

If you define __NR_fchmodat4 on all architectures, we can use these
constants directly in glibc.  We no longer depend on the UAPI
definitions of those constants, to cut down the number of code variants,
and to make glibc's system call profile independent of the kernel header
version at build time.

Your version is based on 2.31, more recent versions have some reasonable
emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
describing the same buggy behavior that you witnessed:

+  /* Some Linux versions with some file systems can actually
+change symbolic link permissions via /proc, but this is not
+intentional, and it gives inconsistent results (e.g., error
+return despite mode change).  The expected behavior is that
+symbolic link modes cannot be changed at all, and this check
+enforces that.  */
+  if (S_ISLNK (st.st_mode))
+   {
+ __close_nocancel (pathfd);
+ __set_errno (EOPNOTSUPP);
+ return -1;
+   }

I think there was some kernel discussion about that behavior before, but
apparently, it hasn't led to fixes.

I wonder if it makes sense to add a similar error return to the system
call implementation?

> + not.  In this case we must respect the error codes defined by the glibc
> + interface instead of returning ENOSYS.
> +The intent here is to ensure that the kernel is called at most once 
> per
> +library call, and that the error types defined by glibc are always
> +respected.  */
> +
> +#ifdef __NR_fchmodat4
> +  long result;
> +#endif
> +
> +  if (flag == 0)
> +return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
> +
> +#ifdef __NR_fchmodat4
> +  result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag);
> +  if (result == 0 || errno != ENOSYS)
> +return result;
> +#endif

The last if condition is the recommended approach, but in the past, it
broke container host compatibility pretty badly due to seccomp filters
that return EPERM instead of ENOSYS.  I guess we'll learn soon enough if
that's been fixed by now. 8-P

Thanks,
Florian



Re: [PATCH v3 5/5] selftests: add fchmodat4(2) selftest

2023-07-11 Thread Florian Weimer
* Alexey Gladkov:

> The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag
> fails. This is because not all filesystems support changing the mode
> bits of symlinks properly. These filesystems return an error but change
> the mode bits:
>
> newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
> AT_SYMLINK_NOFOLLOW) = 0
> newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, 
> AT_SYMLINK_NOFOLLOW) = 0
> syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 
> EOPNOTSUPP (Operation not supported)
> newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
> AT_SYMLINK_NOFOLLOW) = 0
>
> This happens with btrfs and xfs:
>
>  $ /kernel/tools/testing/selftests/fchmodat4/fchmodat4_test
>  TAP version 13
>  1..1
>  ok 1 # SKIP fchmodat4(symlink)
>  # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0
>
>  $ stat /tmp/ksft-fchmodat4.*/symlink
>File: /tmp/ksft-fchmodat4.3NCqlE/symlink -> regfile
>Size: 7   Blocks: 0  IO Block: 4096   symbolic link
>  Device: 7,0 Inode: 133 Links: 1
>  Access: (0600/lrw---)  Uid: (0/root)   Gid: (0/root)
>
> Signed-off-by: Alexey Gladkov 

This looks like a bug in those file systems?

As an extra test, “echo 3 > /proc/sys/vm/drop_caches” sometimes has
strange effects in such cases because the bits are not actually stored
on disk, only in the dentry cache.

Thanks,
Florian



Re: [PATCH Linux] powerpc: add documentation for HWCAPs

2022-05-24 Thread Florian Weimer
* Nicholas Piggin:

> +2. Facilities
> +-
> +The Power ISA uses the term "facility" to describe a class of instructions,
> +registers, interrupts, etc. The presence or absence of a facility indicates
> +whether this class is available to be used, but the specifics depend on the
> +ISA version. For example, if the VSX facility is available, the VSX
> +instructions that can be used differ between the v3.0B and v3.1B ISA
> +verstions.

The 2.07 ISA manual also has categories.  ISA 3.0 made a lot of things
mandatory.  It may make sense to clarify that feature bits for mandatory
aspects of the ISA are still set, to help with backwards compatibility.

Thanks,
Florian



Re: Linux powerpc new system call instruction and ABI

2021-05-21 Thread Florian Weimer
* Matheus Castanho via Libc-alpha:

> From: Nicholas Piggin 
> Subject: [PATCH 1/1] powerpc: Fix handling of scv return error codes
>
> When using scv on templated ASM syscalls, current code interprets any
> negative return value as error, but the only valid error codes are in
> the range -4095..-1 according to the ABI.
>
> Reviewed-by: Matheus Castanho 

Please reference bug 27892 in the commit message.  I'd also appreciate a
backport to the 2.33 release branch (where you need to add NEWS manually
to add the bug reference).

Thanks,
Florian



Re: [PATCH v2] powerpc/64/signal: balance return predictor stack in signal trampoline

2021-01-22 Thread Florian Weimer
* Nicholas Piggin:

> diff --git a/arch/powerpc/kernel/vdso64/sigtramp.S 
> b/arch/powerpc/kernel/vdso64/sigtramp.S
> index a8cc0409d7d2..bbf68cd01088 100644
> --- a/arch/powerpc/kernel/vdso64/sigtramp.S
> +++ b/arch/powerpc/kernel/vdso64/sigtramp.S
> @@ -6,6 +6,7 @@
>   * Copyright (C) 2004 Benjamin Herrenschmuidt (b...@kernel.crashing.org), 
> IBM Corp.
>   * Copyright (C) 2004 Alan Modra (amo...@au.ibm.com)), IBM Corp.
>   */
> +#include/* IFETCH_ALIGN_BYTES */
>  #include 
>  #include 
>  #include 
> @@ -14,21 +15,17 @@
>  
>   .text
>  
> -/* The nop here is a hack.  The dwarf2 unwind routines subtract 1 from
> -   the return address to get an address in the middle of the presumed
> -   call instruction.  Since we don't have a call here, we artificially
> -   extend the range covered by the unwind info by padding before the
> -   real start.  */
> - nop
>   .balign 8
> + .balign IFETCH_ALIGN_BYTES
>  V_FUNCTION_BEGIN(__kernel_sigtramp_rt64)
> -.Lsigrt_start = . - 4
> +.Lsigrt_start:
> + bctrl   /* call the handler */
>   addir1, r1, __SIGNAL_FRAMESIZE
>   li  r0,__NR_rt_sigreturn
>   sc
>  .Lsigrt_end:
>  V_FUNCTION_END(__kernel_sigtramp_rt64)
> -/* The ".balign 8" above and the following zeros mimic the old stack
> +/* The .balign 8 above and the following zeros mimic the old stack
> trampoline layout.  The last magic value is the ucontext pointer,
> chosen in such a way that older libgcc unwind code returns a zero
> for a sigcontext pointer.  */

As far as I understand it, this breaks cancellation handling on musl and
future glibc because it is necessary to look at the signal delivery
location to see if a system call sequence has result in an action, and
that location is no longer in user code after this change.

We have a glibc test in preparation of our change, and it started
failing:

  Linux 5.10 breaks sigcontext_get_pc on powerpc64
  

Isn't it possible to avoid the return predictor desynchronization by
adding the appropriate hint?

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: Add a new fchmodat4() syscall, v2

2020-06-09 Thread Florian Weimer
* Palmer Dabbelt:

> This patch set adds fchmodat4(), a new syscall. The actual
> implementation is super simple: essentially it's just the same as
> fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> I've attempted to make this match "man 2 fchmodat" as closely as
> possible, which says EINVAL is returned for invalid flags (as opposed to
> ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> I have a sketch of a glibc patch that I haven't even compiled yet, but
> seems fairly straight-forward:

What's the status here?  We'd really like to see this system call in the
kernel because our emulation in glibc has its problems (especially if
/proc is not mounted).

Thanks,
Florian



Re: [PATCH 2/2] selftests: vm: pkeys: Fix powerpc access right updates

2020-05-08 Thread Florian Weimer
* Sandipan Das:

> Hi Florian,
>
> On 08/05/20 11:31 pm, Florian Weimer wrote:
>> * Sandipan Das:
>> 
>>> The Power ISA mandates that all writes to the Authority
>>> Mask Register (AMR) must always be preceded as well as
>>> succeeded by a context-synchronizing instruction. This
>>> applies to both the privileged and unprivileged variants
>>> of the Move To AMR instruction.
>> 
>> Ugh.  Do you have a reference for that?
>> 
>> We need to fix this in glibc.
>> 
>
> This is from Table 6 of Chapter 11 in page 1134 of Power
> ISA 3.0B. The document can be found here:
> https://ibm.ent.box.com/s/1hzcwkwf8rbju5h9iyf44wm94amnlcrv

Thanks a lot!  I filed:

  <https://sourceware.org/bugzilla/show_bug.cgi?id=25954>

Florian



Re: [PATCH 2/2] selftests: vm: pkeys: Fix powerpc access right updates

2020-05-08 Thread Florian Weimer
* Sandipan Das:

> The Power ISA mandates that all writes to the Authority
> Mask Register (AMR) must always be preceded as well as
> succeeded by a context-synchronizing instruction. This
> applies to both the privileged and unprivileged variants
> of the Move To AMR instruction.

Ugh.  Do you have a reference for that?

We need to fix this in glibc.

Thanks,
Florian



Re: [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-22 Thread Florian Weimer
* Nicholas Piggin:

> So I would be disinclined to use SIGSYS unless there are no other better 
> signal types, and we don't want to use SIGILL for some good reason -- is 
> there a good reason to add complexity for userspace by differentiating 
> these two situations?

No, SIGILL seems fine to me.  scv 0 and scv 1 could well be considered
different instructions eventually (with different mnemonics).


Re: [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-22 Thread Florian Weimer
* Nicholas Piggin:

> Another option would be to use a different signal. I don't see that any 
> are more suitable.

SIGSYS comes to my mind.  But I don't know how exclusively it is
associated with seccomp these days.


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread Florian Weimer
* Szabolcs Nagy:

> * Nicholas Piggin  [2020-04-20 12:08:36 +1000]:
>> Excerpts from Rich Felker's message of April 20, 2020 11:29 am:
>> > Also, allowing patching of executable pages is generally frowned upon
>> > these days because W^X is a desirable hardening property.
>> 
>> Right, it would want be write-protected after being patched.
>
> "frowned upon" means that users may have to update
> their security policy setting in pax, selinux, apparmor,
> seccomp bpf filters and who knows what else that may
> monitor and flag W mprotect.
>
> libc update can break systems if the new libc does W

It's possible to map over pre-compiled alternative implementations,
though.  Basically, we would do the patching and build time and store
the results in the file.

It works best if the variance is concentrated on a few pages, and
there are very few alternatives.  For example, having two syscall APIs
and supporting threading and no-threading versions would need four
code versions in total, which is likely excessive.


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-17 Thread Florian Weimer
* Segher Boessenkool:

> On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote:
>> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote:
>> > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
>> > > > I think my choice would be just making the inline syscall be a single
>> > > > call insn to an asm source file that out-of-lines the loading of TOC
>> > > > pointer and call through it or branch based on hwcap so that it's not
>> > > > repeated all over the place.
>> > > 
>> > > I don't know how problematic control flow out of an inline asm is on
>> > > POWER.  But this is basically the -moutline-atomics approach.
>> > 
>> > Control flow out of inline asm (other than with "asm goto") is not
>> > allowed at all, just like on any other target (and will not work in
>> > practice, either -- just like on any other target).  But the suggestion
>> > was to use actual assembler code, not inline asm?
>> 
>> Calling it control flow out of inline asm is something of a misnomer.
>> The enclosing state is not discarded or altered; the asm statement
>> exits normally, reaching the next instruction in the enclosing
>> block/function as soon as the call from the asm statement returns,
>> with all register/clobber constraints satisfied.
>
> Ah.  That should always Just Work, then -- our ABIs guarantee you can.

After thinking about it, I agree: GCC will handle spilling of the link
register.  Branch-and-link instructions do not clobber the protected
zone, so no stack adjustment is needed (which would be problematic to
reflect in the unwind information).

Of course, the target function has to be written in assembler because
it must not use a regular stack frame.


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Florian Weimer
* Nicholas Piggin via Libc-alpha:

> We may or may not be getting a new ABI that will use instructions not 
> supported by old processors.
>
> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html
>
> Current ABI continues to work of course and be the default for some 
> time, but building for new one would give some opportunity to drop
> such support for old procs, at least for glibc.

If I recall correctly, during last year's GNU Tools Cauldron, I think
it was pretty clear that this was only to be used for intra-DSO ABIs,
not cross-DSO optimization.  Relocatable object files have an ABI,
too, of course, so that's why there's a ABI documentation needed.

For cross-DSO optimization, the link editor would look at the DSO
being linked in, check if it uses the -mfuture ABI, and apply some
shortcuts.  But at that point, if the DSO is swapped back to a version
built without -mfuture, it no longer works with those newly linked
binaries against the -mfuture version.  Such a thing is a clear ABI
bump, and based what I remember from Cauldron, that is not the plan
here.

(I don't have any insider knowledge—I just don't want people to read
this think: gosh, yet another POWER ABI bump.  But the PCREL stuff
*is* exciting!)


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Florian Weimer
* Rich Felker:

> On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote:
>> * Rich Felker:
>> 
>> > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
>> >> * Rich Felker:
>> >> 
>> >> > My preference would be that it work just like the i386 AT_SYSINFO
>> >> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> >> > provides a stub in the vdso that performs either scv or the old
>> >> > mechanism with the same calling convention.
>> >> 
>> >> The i386 mechanism has received some criticism because it provides an
>> >> effective means to redirect execution flow to anyone who can write to
>> >> the TCB.  I am not sure if it makes sense to copy it.
>> >
>> > Indeed that's a good point. Do you have ideas for making it equally
>> > efficient without use of a function pointer in the TCB?
>> 
>> We could add a shared non-writable mapping at a 64K offset from the
>> thread pointer and store the function pointer or the code there.  Then
>> it would be safe.
>> 
>> However, since this is apparently tied to POWER9 and we already have a
>> POWER9 multilib, and assuming that we are going to backport the kernel
>> change, I would tweak the selection criterion for that multilib to
>> include the new HWCAP2 flag.  If a user runs this glibc on a kernel
>> which does not have support, they will get set baseline (POWER8)
>> multilib, which still works.  This way, outside the dynamic loader, no
>> run-time dispatch is needed at all.  I guess this is not at all the
>> answer you were looking for. 8-)
>
> How does this work with -static? :-)

-static is not supported. 8-) (If you use the unsupported static
libraries, you get POWER8 code.)

(Just to be clear, in case someone doesn't get the joke: This is about
a potential approach for a heavily constrained, vertically integrated
environment.  It does not reflect general glibc recommendations.)

>> If a single binary is needed, I would perhaps follow what Arm did for
>> -moutline-atomics: lay out the code so that its easy to execute for
>> the non-POWER9 case, assuming that POWER9 machines will be better at
>> predicting things than their predecessors.
>> 
>> Or you could also put the function pointer into a RELRO segment.  Then
>> there's overlap with the __libc_single_threaded discussion, where
>> people objected to this kind of optimization (although I did not
>> propose to change the TCB ABI, that would be required for
>> __libc_single_threaded because it's an external interface).
>
> Of course you can use a normal global, but now every call point needs
> to setup a TOC pointer (= two entry points and more icache lines for
> otherwise trivial functions).
>
> I think my choice would be just making the inline syscall be a single
> call insn to an asm source file that out-of-lines the loading of TOC
> pointer and call through it or branch based on hwcap so that it's not
> repeated all over the place.

I don't know how problematic control flow out of an inline asm is on
POWER.  But this is basically the -moutline-atomics approach.

> Alternatively, it would perhaps work to just put hwcap in the TCB and
> branch on it rather than making an indirect call to a function pointer
> in the TCB, so that the worst you could do by clobbering it is execute
> the wrong syscall insn and thereby get SIGILL.

The HWCAP is already in the TCB.  I expect this is what generic glibc
builds are going to use (perhaps with a bit of tweaking favorable to
POWER8 implementations, but we'll see).


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Florian Weimer
* Rich Felker:

> On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
>> * Rich Felker:
>> 
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention.
>> 
>> The i386 mechanism has received some criticism because it provides an
>> effective means to redirect execution flow to anyone who can write to
>> the TCB.  I am not sure if it makes sense to copy it.
>
> Indeed that's a good point. Do you have ideas for making it equally
> efficient without use of a function pointer in the TCB?

We could add a shared non-writable mapping at a 64K offset from the
thread pointer and store the function pointer or the code there.  Then
it would be safe.

However, since this is apparently tied to POWER9 and we already have a
POWER9 multilib, and assuming that we are going to backport the kernel
change, I would tweak the selection criterion for that multilib to
include the new HWCAP2 flag.  If a user runs this glibc on a kernel
which does not have support, they will get set baseline (POWER8)
multilib, which still works.  This way, outside the dynamic loader, no
run-time dispatch is needed at all.  I guess this is not at all the
answer you were looking for. 8-)

If a single binary is needed, I would perhaps follow what Arm did for
-moutline-atomics: lay out the code so that its easy to execute for
the non-POWER9 case, assuming that POWER9 machines will be better at
predicting things than their predecessors.

Or you could also put the function pointer into a RELRO segment.  Then
there's overlap with the __libc_single_threaded discussion, where
people objected to this kind of optimization (although I did not
propose to change the TCB ABI, that would be required for
__libc_single_threaded because it's an external interface).


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Florian Weimer
* Rich Felker:

> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention.

The i386 mechanism has received some criticism because it provides an
effective means to redirect execution flow to anyone who can write to
the TCB.  I am not sure if it makes sense to copy it.


Re: powerpc Linux scv support and scv system call ABI proposal

2020-01-30 Thread Florian Weimer
* Segher Boessenkool:

>> I *assumed* that I would still get
>> the value of r0 (the register) from the associated extended asm in this
>> expression, even if it may now be a different register.  Your comment
>> made me think that this is undefined.
>
> Please show full(er) examples, I think we are talking about something
> else?

No, I think we are in agreement here how things should behave under the
new model.  But I have doubts whether that is implemented in GCC 9.

>> The GCC documentation has this warning:
>> 
>> |  _Warning:_ In the above example, be aware that a register (for
>> | example 'r0') can be call-clobbered by subsequent code, including
>> | function calls and library calls for arithmetic operators on other
>> | variables (for example the initialization of 'p2').
>
> Yes.  This does not matter for the only supported use.

I'm not so sure.  See below.

> This is why that *is* the only supported use.  The documentation could
> use a touch-up, I think.  Unless we still have problems here?

I really don't know.  GCC still has *some* support for the old behavior,
though.  For example, local register variables are treated as
initialized, and I think you can still use registers like global
variables.  GCC does not perform copy propagation here:

int f1 (int);

int
f (void)
{
  register int edi __asm__ ("edi");
  int edi_copy = edi;
  return f1 (1) + edi_copy;
}

And the case that we agreed should be defined in fact is not:

void f1 (int);

int
f (void)
{
  register int edi __asm__ ("edi");
  asm ("#" : "=r" (edi));
  f1 (1);
  return edi;
}

On x86-64, %edi is used to pass the first function parameter, so the
call clobbers %edi.  It is simply ambiguous whether edi (the variable)
should retain the value prior to the call to f1 (which I think is what
should happen under the new model, where register variables are only
affect asm operands), or if edi (the variable) should have the value of
%edi (the register) after the call (the old model).

Should we move this to the gcc list?

Thanks,
Florian



Re: powerpc Linux scv support and scv system call ABI proposal

2020-01-30 Thread Florian Weimer
* Segher Boessenkool:

> On Wed, Jan 29, 2020 at 06:02:34PM +0100, Florian Weimer wrote:
>> * Segher Boessenkool:
>> 
>> > On Wed, Jan 29, 2020 at 05:19:19PM +0100, Florian Weimer wrote:
>> >> * Segher Boessenkool:
>> >> >> But GCC doesn't expose them as integers to C code, so you can't do much
>> >> >> without them.
>> >> >
>> >> > Sure, it doesn't expose any other registers directly, either.
>> >> 
>> >> I can use r0 & 1 with a register variable r0 to check a bit.
>> >
>> > That is not reliable, or supported, and it *will* break.  This is
>> > explicit for local register asm, and global register asm is
>> > underdefined.
>> 
>> Ugh.  I did not know that.  And neither did the person who wrote
>> powerpc64/sysdep.h because it uses register variables in regular C
>> expressions. 8-(  Other architectures are affected as well.
>
> Where?  I don't see any?  Ah, the other one, heh (there are two).
>
> No, that *is* supported: as input to or output from an asm, a local
> register asm variable *is* guaranteed to live in the specified register.
> This is the *only* supported use.  Other uses may sometimes still work,
> but they never worked reliably, and it cannot be made reliable; it has
> been documented as not supported since ages, and it will not work at all
> anymore some day.

I must say I find this situation *very* confusing.

You said that r0 & 1 is undefined.  I *assumed* that I would still get
the value of r0 (the register) from the associated extended asm in this
expression, even if it may now be a different register.  Your comment
made me think that this is undefined.  But then the syscall wrappers use
this construct:

__asm__ __volatile__\
  ("sc\n\t" \
   "mfcr  %0\n\t"   \
   "0:" \
   : "=" (r0),\
 "=" (r3), "=" (r4), "=" (r5),\
 "=" (r6), "=" (r7), "=" (r8) \
   : ASM_INPUT_##nr \
   : "r9", "r10", "r11", "r12", \
 "cr0", "ctr", "memory");   \
  err = r0;  \
r3;  \

That lone r3 at the end would be equally undefined because it is not
used in an input or output operand of an extended asm statement.

The GCC documentation has this warning:

|  _Warning:_ In the above example, be aware that a register (for
| example 'r0') can be call-clobbered by subsequent code, including
| function calls and library calls for arithmetic operators on other
| variables (for example the initialization of 'p2').

On POWER, the LOADARGS macros attempt to deal with this by using
non-register temporaries.  However, I don't know how effective this is
if the compiler really doesn't deal with call-clobbered registers
properly.

For the extended asm use case (to express register assignments that
cannot be listed in constraints), I would expect that these variables
retain their values according to the C specification (so they are never
clobbered by calls), but that they only reside in their designated
registers when used as input or output operands in extended asm
statements.

However, this is incompatible with other (ab)uses of local and global
register variables, which may well expect that they are clobbered by
calls.  It looks like GCC uses the same construct for two unrelated
things.

Thanks,
Florian



Re: [PATCH v16 00/23] selftests, powerpc, x86: Memory Protection Keys

2020-01-29 Thread Florian Weimer
* Dave Hansen:

> Still doesn't build for me:
>
>> # make
>> make --no-builtin-rules ARCH=x86_64 -C ../../../.. headers_install
>> make[1]: Entering directory '/home/dave/linux.git'
>>   INSTALL ./usr/include
>> make[1]: Leaving directory '/home/dave/linux.git'
>> make: *** No rule to make target 
>> '/home/dave/linux.git/tools/testing/selftests/vm/protection_keys_32', needed 
>> by 'all'.  Stop.

Do you have 32-bit libraries installed?

Thanks,
Florian



Re: powerpc Linux scv support and scv system call ABI proposal

2020-01-29 Thread Florian Weimer
* Segher Boessenkool:

> On Wed, Jan 29, 2020 at 05:19:19PM +0100, Florian Weimer wrote:
>> * Segher Boessenkool:
>> >> But GCC doesn't expose them as integers to C code, so you can't do much
>> >> without them.
>> >
>> > Sure, it doesn't expose any other registers directly, either.
>> 
>> I can use r0 & 1 with a register variable r0 to check a bit.
>
> That is not reliable, or supported, and it *will* break.  This is
> explicit for local register asm, and global register asm is
> underdefined.

Ugh.  I did not know that.  And neither did the person who wrote
powerpc64/sysdep.h because it uses register variables in regular C
expressions. 8-(  Other architectures are affected as well.

One set of issues is less of a problem if system call arguments are
variables and not complex expressions, so that side effects do not
clobber registers in the initialization:

register long __a0 asm ("$4") = (long) (arg1);

But I wasn't aware of that constraint on the macro users at all.

Thanks,
Florian



Re: powerpc Linux scv support and scv system call ABI proposal

2020-01-29 Thread Florian Weimer
* Segher Boessenkool:

> On Tue, Jan 28, 2020 at 05:04:49PM +0100, Florian Weimer wrote:
>> * Segher Boessenkool:
>> 
>> >> > I don't think we can save LR in a regular register around the system
>> >> > call, explicitly in the inline asm statement, because we still have to
>> >> > generate proper unwinding information using CFI directives, something
>> >> > that you cannot do from within the asm statement.
>> >
>> > Why not?
>> 
>> As far as I knowm there isn't a CFI directive that allows us to restore
>> the CFI state at the end of the inline assembly.  If we say that LR is
>> stored in a different register than what the rest of the function uses,
>> that would lead to incorrect CFI after the exit of the inline assembler
>> fragment.
>> 
>> At least that's what I think.  Compilers aren't really my thing.
>
> .cfi_restore?  Or .cfi_remember_state / .cfi_restore_state, that is
> probably easiest in inline assembler.

Oh, right, .cfi_remember_state and .cfi_restore_state should work, as
long as -fno-dwarf2-cfi-asm isn't used (but that's okay given that we
need this only for the glibc build).

But it looks like we can use an explicit "lr" clobber, so we should be
good anyway.

>> >> > GCC does not model the condition registers,
>> >
>> > Huh?  It does model the condition register, as 8 registers in GCC's
>> > internal model (one each for CR0..CR7).
>> 
>> But GCC doesn't expose them as integers to C code, so you can't do much
>> without them.
>
> Sure, it doesn't expose any other registers directly, either.

I can use r0 & 1 with a register variable r0 to check a bit.  I don't
think writing a similar check against a condition register is possible
today.

Thanks,
Florian



Re: powerpc Linux scv support and scv system call ABI proposal

2020-01-28 Thread Florian Weimer
* Segher Boessenkool:

>> > I don't think we can save LR in a regular register around the system
>> > call, explicitly in the inline asm statement, because we still have to
>> > generate proper unwinding information using CFI directives, something
>> > that you cannot do from within the asm statement.
>
> Why not?

As far as I knowm there isn't a CFI directive that allows us to restore
the CFI state at the end of the inline assembly.  If we say that LR is
stored in a different register than what the rest of the function uses,
that would lead to incorrect CFI after the exit of the inline assembler
fragment.

At least that's what I think.  Compilers aren't really my thing.

>
>> >> - Error handling: use of CR0[SO] to indicate error requires a mtcr / mtocr
>> >>   instruction on the kernel side, and it is currently not implemented well
>> >>   in glibc, requiring a mfcr (mfocr should be possible and asm goto 
>> >> support
>> >>   would allow a better implementation). Is it worth continuing this style 
>> >> of
>> >>   error handling? Or just move to -ve return means error? Using a 
>> >> different
>> >>   bit would allow the kernel to piggy back the CR return code setting with
>> >>   a test for the error case exit.
>> > 
>> > GCC does not model the condition registers,
>
> Huh?  It does model the condition register, as 8 registers in GCC's
> internal model (one each for CR0..CR7).

But GCC doesn't expose them as integers to C code, so you can't do much
without them.

>> >> - Should this be for 64-bit only? 'scv 1' could be reserved for 32-bit
>> >>   calls if there was interest in developing an ABI for 32-bit programs.
>> >>   Marginal benefit in avoiding compat syscall selection.
>> > 
>> > We don't have an ELFv2 ABI for 32-bit.  I doubt it makes sense to
>> > provide an ELFv1 port for this given that it's POWER9-specific.
>
> We *do* have a 32-bit LE ABI.  And ELFv1 is not 32-bit either.  Please
> don't confuse these things :-)
>
> The 64-bit LE kernel does not really support 32-bit userland (or BE
> userland), *that* is what you want to say.

Sorry for the confusion.  Is POWER9 running kernels which are not 64-bit
LE really a thing in practice, though?

>> > From the glibc perspective, the major question is how we handle run-time
>> > selection of the system call instruction sequence.
>
> Well, if it is inlined you don't have this problem either!  :-)

How so?  We would have to put the conditional sequence into all inline
system calls, of course.

Thanks,
Florian



Re: powerpc Linux scv support and scv system call ABI proposal

2020-01-28 Thread Florian Weimer
* Nicholas Piggin:

> That gets the LR save/restore right when we're also using r0.

Yes, I agree it looks good.  Nice.

>> But the kernel uses the -errno convention internally, so I think it
>> would make sense to pass this to userspace and not convert back and
>> forth.  This would align with what most of the architectures do, and
>> also avoids the GCC oddity.
>
> Yes I would be interested in opinions for this option. It seems like
> matching other architectures is a good idea. Maybe there are some
> reasons not to.

If there were a POWER-specific system call that uses all result bits and
doesn't have room for the 4096 error states (or an error number that's
outside that range), that would be a blocker.  I can't find such a
system call wrapped in the glibc sources.  musl's inline syscalls always
convert the errno state to -errno, so it's not possible to use such a
system call there.

>>> - Should this be for 64-bit only? 'scv 1' could be reserved for 32-bit
>>>   calls if there was interest in developing an ABI for 32-bit programs.
>>>   Marginal benefit in avoiding compat syscall selection.
>> 
>> We don't have an ELFv2 ABI for 32-bit.  I doubt it makes sense to
>> provide an ELFv1 port for this given that it's POWER9-specific.
>
> Okay. There's no reason not to enable this for BE, at least for the
> kernel it's no additional work so it probably remains enabled (unless
> there is something really good we could do with the ABI if we exclude
> ELFv1 but I don't see anything).
>
> But if glibc only builds for ELFv2 support that's probably reasonable.

To be clear, we still support ELFv1 for POWER, but given that this
feature is POWER9 and later, I expect the number of users benefiting
from 32-bit support (or ELFv1 and thus big-endian support) to be quite
small.

Especially if we go the conditional branch route, I would restrict this
to ELFv2 in glibc, at least for default builds.

>> From the glibc perspective, the major question is how we handle run-time
>> selection of the system call instruction sequence.  On i386, we use a
>> function pointer in the TCB to call an instruction sequence in the vDSO.
>> That's problematic from a security perspective.  I expect that on
>> POWER9, using a pointer in read-only memory would be equally
>> non-attractive due to a similar lack of PC-relative addressing.  We
>> could use the HWCAP bit in the TCB, but that would add another (easy to
>> predict) conditional branch to every system call.
>
> I would have to defer to glibc devs on this. Conditional branch
> should be acceptable I think, scv improves speed as much as several
> mispredicted branches (about 90 cycles).

But we'd have to pay for that branch (and likely the LR clobber) on
legacy POWER, and that's something to consider, too.

Is there an additional performance hit if a process uses both the old
and new system call sequence?

Thanks,
Florian



Re: powerpc Linux scv support and scv system call ABI proposal

2020-01-28 Thread Florian Weimer
* Nicholas Piggin:

> * Proposal is for PPC_FEATURE2_SCV to indicate 'scv 0' support, all other
>   vectors will return -ENOSYS, and the decision for how to add support for
>   a new vector deferred until we see the next user.

Seems reasonable.  We don't have to decide this today.

> * Proposal is for scv 0 to provide the standard Linux system call ABI with 
> some
>   differences:
>
> - LR is volatile across scv calls. This is necessary for support because the
>   scv instruction clobbers LR.

I think we can express this in the glibc system call assembler wrapper
generators.  The mcount profiling wrappers already have this property.

But I don't think we are so lucky for the inline system calls.  GCC
recognizes an "lr" clobber with inline asm (even though it is not
documented), but it generates rather strange assembler output as a
result:

long
f (long x)
{
  long y;
  asm ("#" : "=r" (y) : "r" (x) : "lr");
  return y;
}

.abiversion 2
.section".text"
.align 2
.p2align 4,,15
.globl f
.type   f, @function
f:
.LFB0:
.cfi_startproc
mflr 0
.cfi_register 65, 0
#APP
 # 5 "t.c" 1
#
 # 0 "" 2
#NO_APP
std 0,16(1)
.cfi_offset 65, 16
ori 2,2,0
ld 0,16(1)
mtlr 0
.cfi_restore 65
blr
.long 0
.byte 0,0,0,1,0,0,0,0
.cfi_endproc
.LFE0:
.size   f,.-f


That's with GCC 8.3 at -O2.  I don't understand what the ori is about.

I don't think we can save LR in a regular register around the system
call, explicitly in the inline asm statement, because we still have to
generate proper unwinding information using CFI directives, something
that you cannot do from within the asm statement.

Supporting this in GCC should not be impossible, but someone who
actually knows this stuff needs to look at it.

> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>   system call exit to avoid restoring the CR register.

This sounds reasonable, but I don't know what kind of knock-on effects
this has.  The inline system call wrappers can handle this with minor
tweaks.

> - Error handling: use of CR0[SO] to indicate error requires a mtcr / mtocr
>   instruction on the kernel side, and it is currently not implemented well
>   in glibc, requiring a mfcr (mfocr should be possible and asm goto support
>   would allow a better implementation). Is it worth continuing this style of
>   error handling? Or just move to -ve return means error? Using a different
>   bit would allow the kernel to piggy back the CR return code setting with
>   a test for the error case exit.

GCC does not model the condition registers, so for inline system calls,
we have to produce a value anyway that the subsequence C code can check.
The assembler syscall wrappers do not need to do this, of course, but
I'm not sure which category of interfaces is more important.

But the kernel uses the -errno convention internally, so I think it
would make sense to pass this to userspace and not convert back and
forth.  This would align with what most of the architectures do, and
also avoids the GCC oddity.

> - Should this be for 64-bit only? 'scv 1' could be reserved for 32-bit
>   calls if there was interest in developing an ABI for 32-bit programs.
>   Marginal benefit in avoiding compat syscall selection.

We don't have an ELFv2 ABI for 32-bit.  I doubt it makes sense to
provide an ELFv1 port for this given that it's POWER9-specific.

>From the glibc perspective, the major question is how we handle run-time
selection of the system call instruction sequence.  On i386, we use a
function pointer in the TCB to call an instruction sequence in the vDSO.
That's problematic from a security perspective.  I expect that on
POWER9, using a pointer in read-only memory would be equally
non-attractive due to a similar lack of PC-relative addressing.  We
could use the HWCAP bit in the TCB, but that would add another (easy to
predict) conditional branch to every system call.

I don't think it matters whether both system call variants use the same
error convention because we could have different error code extraction
code on the two branches.

Thanks,
Florian



Re: [PATCH v18 11/13] open: introduce openat2(2) syscall

2019-12-16 Thread Florian Weimer
* Aleksa Sarai:

> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 1d338357df8a..58c3a0e543c6 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -93,5 +93,40 @@
>  
>  #define AT_RECURSIVE 0x8000  /* Apply to the entire subtree */
>  
> +/*
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates very similarly to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> + * O_TMPFILE} are set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> + __aligned_u64 flags;
> + __u16 mode;
> + __u16 __padding[3]; /* must be zeroed */
> + __aligned_u64 resolve;
> +};
> +
> +#define OPEN_HOW_SIZE_VER0   24 /* sizeof first published struct */
> +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
> +
> +/* how->resolve flags for openat2(2). */
> +#define RESOLVE_NO_XDEV  0x01 /* Block mount-point crossings
> + (includes bind-mounts). */
> +#define RESOLVE_NO_MAGICLINKS0x02 /* Block traversal through 
> procfs-style
> + "magic-links". */
> +#define RESOLVE_NO_SYMLINKS  0x04 /* Block traversal through all symlinks
> + (implies OEXT_NO_MAGICLINKS) */
> +#define RESOLVE_BENEATH  0x08 /* Block "lexical" trickery like
> + "..", symlinks, and absolute
> + paths which escape the dirfd. */
> +#define RESOLVE_IN_ROOT  0x10 /* Make all jumps to "/" and ".."
> + be scoped inside the dirfd
> + (similar to chroot(2)). */
>  
>  #endif /* _UAPI_LINUX_FCNTL_H */

Would it be possible to move these to a new UAPI header?

In glibc, we currently do not #include .  We need some of
the AT_* constants in POSIX mode, and the header is not necessarily
namespace-clean.  If there was a separate header for openat2 support, we
could use that easily, and we would only have to maintain the baseline
definitions (which never change).

Thanks,
Florian



Re: [PATCH] [RFC] Remove bdflush syscall stub

2019-05-28 Thread Florian Weimer
* Cyril Hrubis:

> Hi!
>> > I've tested the patch on i386. Before the patch calling bdflush() with
>> > attempt to tune a variable returned 0 and after the patch the syscall
>> > fails with EINVAL.
>> 
>> Should be ENOSYS, doesn't it?
>
> My bad, the LTP syscall wrapper handles ENOSYS and produces skipped
> results based on that.
>
> EINVAL is what you get for not yet implemented syscalls, i.e. new
> syscall on old kernel.

EINVAL?  Is that a bdflush-specific thing, test-specific, or is itmore
general?

glibc has fallback paths that test for ENOSYS only.  EINVAL will be
passed to the application, skipping fallback.  For missing system calls,
this is not what we want.

Thanks,
Florian


Re: [PATCH 1/2] open: add close_range()

2019-05-21 Thread Florian Weimer
* Christian Brauner:

>> Solaris has an fdwalk function:
>> 
>>   
>> 
>> So a different way to implement this would expose a nextfd system call
>
> Meh. If nextfd() then I would like it to be able to:
> - get the nextfd(fd) >= fd
> - get highest open fd e.g. nextfd(-1)

The highest open descriptor isn't istering for fdwalk because nextfd
would just fail.

> But then I wonder if nextfd() needs to be a syscall and isn't just
> either:
> fcntl(fd, F_GET_NEXT)?
> or
> prctl(PR_GET_NEXT)?

I think the fcntl route is a bit iffy because you might need it to get
the *first* valid descriptor.

>> to userspace, so that we can use that to implement both fdwalk and
>> closefrom.  But maybe fdwalk is just too obscure, given the existence of
>> /proc.
>
> Yeah we probably don't need fdwalk.

Agreed.  Just wanted to bring it up for completeness.  I certainly don't
want to derail the implementation of close_range.

Thanks,
Florian


Re: [PATCH 1/2] open: add close_range()

2019-05-21 Thread Florian Weimer
* Christian Brauner:

> +/**
> + * __close_range() - Close all file descriptors in a given range.
> + *
> + * @fd: starting file descriptor to close
> + * @max_fd: last file descriptor to close
> + *
> + * This closes a range of file descriptors. All file descriptors
> + * from @fd up to and including @max_fd are closed.
> + */
> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> + unsigned int cur_max;
> +
> + if (fd > max_fd)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + cur_max = files_fdtable(files)->max_fds;
> + rcu_read_unlock();
> +
> + /* cap to last valid index into fdtable */
> + if (max_fd >= cur_max)
> + max_fd = cur_max - 1;
> +
> + while (fd <= max_fd)
> + __close_fd(files, fd++);
> +
> + return 0;
> +}

This seems rather drastic.  How long does this block in kernel mode?
Maybe it's okay as long as the maximum possible value for cur_max stays
around 4 million or so.

Solaris has an fdwalk function:

  

So a different way to implement this would expose a nextfd system call
to userspace, so that we can use that to implement both fdwalk and
closefrom.  But maybe fdwalk is just too obscure, given the existence of
/proc.

I'll happily implement closefrom on top of close_range in glibc (plus
fallback for older kernels based on /proc—with an abort in case that
doesn't work because the RLIMIT_NOFILE hack is unreliable
unfortunately).

Thanks,
Florian


Re: [PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-16 Thread Florian Weimer
* hpa:

> Using symbol versioning doesn't really help much since the real
> problem is that struct termios can be passed around in userspace, and
> the interfaces between user space libraries don't have any
> versioning. However, my POC code deals with that too by only seeing
> BOTHER when necessary, so if the structure is extended garbage in the
> extra fields will be ignored unless new baud rates are in use.

That still doesn't solve the problem of changing struct offsets after a
struct field of type struct termios.

> Exporting termios2 to user space feels a bit odd at this stage as it
> would only be usable as a fallback on old glibc. Call it
> kernel_termios2 at least.

I'm not sure why we should do that?  The kernel calls it struct termios2
in its UAPI headers.  If that name is not appropriate, it should be
changed first in the UAPI headers.

Thanks,
Florian


Re: [PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-15 Thread Florian Weimer
* Florian Weimer:

> My gut feeling is that it's safer to add new interfaces, based on the
> actual kernel/userspace interface, rather than trying to fix up existing
> interfaces with symbol versioning.  The main reason is that code
> involving serial interfaces is difficult to test, so it will take years
> until we find the last application broken by the glibc interface bump.
>
> I don't feel strongly about this.  This came out of a request for
> enabling TCGETS2 support downstream.  If I can't fix this upstream, I
> will just reject that request.

Any further comments on this matter?

Thanks,
Florian


Re: [PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-12 Thread Florian Weimer
* Adhemerval Zanella:

> On 11/04/2019 08:07, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> This allows us to adjust the baud rates to non-standard values using termios
>>> interfaces without to resorting to add new headers and use a different API
>>> (ioctl).
>> 
>> How much symbol versioning will be required for this change?
>
> I think all interfaces that have termios as input for sparc and mips 
> (tcgetattr, tcsetattr, cfmakeraw, cfgetispeed, cfgetospeed, cfsetispeed,
> cfsetospeed, cfsetspeed).
>
> Alpha will also need to use termios1 for pre-4.20 kernels.

So only new symbol versions there?  Hmm.

>>> As Peter Anvin has indicated, he create a POC [1] with the aforementioned
>>> new interfaces.  It has not been rebased against master, more specially 
>>> against
>>> my termios refactor to simplify the multiple architecture header 
>>> definitions,
>>> but I intend to use as a base.
>> 
>> Reference [1] is still missing. 8-(
>
> Oops... it is 
> https://git.zytor.com/users/hpa/glibc/termbaud.git/log/?h=wip.termbaud

This doesn't really illuminate things.  “Drop explicit baud setting
interfaces in favor of cfenc|decspeed()” removes the new symbol version
for the cf* functions.

My gut feeling is that it's safer to add new interfaces, based on the
actual kernel/userspace interface, rather than trying to fix up existing
interfaces with symbol versioning.  The main reason is that code
involving serial interfaces is difficult to test, so it will take years
until we find the last application broken by the glibc interface bump.

I don't feel strongly about this.  This came out of a request for
enabling TCGETS2 support downstream.  If I can't fix this upstream, I
will just reject that request.

Thanks,
Florian


Re: [PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-11 Thread Florian Weimer
* Adhemerval Zanella:

> This allows us to adjust the baud rates to non-standard values using termios
> interfaces without to resorting to add new headers and use a different API
> (ioctl).

How much symbol versioning will be required for this change?

> As Peter Anvin has indicated, he create a POC [1] with the aforementioned
> new interfaces.  It has not been rebased against master, more specially 
> against
> my termios refactor to simplify the multiple architecture header definitions,
> but I intend to use as a base.

Reference [1] is still missing. 8-(

Thanks,
Florian


Re: [PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-10 Thread Florian Weimer
* hpa:

> PowerPC doesn't need it at all.

Because the definition would be the same as struct termios?

I think we should still define struct termios2 using the struct termios
definition, and also define TCGETS2 and TCSETS2.  This way, applications
can use the struct termios2 type without affecting portability to POWER.

Thanks,
Florian


Re: [PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-09 Thread Florian Weimer
* hpa:

> Anything blocking the POC code I provided?

I don't recall that, sorry.  Would you please provide a reference?

Thanks,
Florian


[PATCH] Linux: Define struct termios2 in under _GNU_SOURCE [BZ #10339]

2019-04-09 Thread Florian Weimer
struct termios2 is required for setting arbitrary baud rates on serial
ports.   and  have conflicting
definitions in the existing termios definitions, which means that it
is currently very difficult to use TCGETS2/TCSETS2 and struct termios2
with glibc.  Providing a definition within glibc resolves this problem.

This does not completely address bug 10339, but it at least exposes
the current kernel functionality in this area.

Support for struct termios2 is architecture-specific in the kernel.
Support on alpha was only added in Linux 4.20.  POWER support is
currently missing.  The expectation is that the kernel will eventually
use the generic UAPI definition for struct termios2.

2019-04-09  Florian Weimer  

[BZ #10339]
Linux: Define struct termios2 in  under _GNU_SOURCE.
* sysdeps/unix/sysv/linux/Makefile [$(subdir) == termios] (tests):
Add tst-termios2.
* sysdeps/unix/sysv/linux/tst-termios2.c: New file.
* sysdeps/unix/sysv/linux/bits/termios2-struct.h: Likewise.
* sysdeps/unix/sysv/linux/bits/termios.h [__USE_GNU]: Include it.
* sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h: New file.
* sysdeps/unix/sysv/linux/sparc/bits/termios2-struct.h: Likewise.

diff --git a/NEWS b/NEWS
index b58e2469d4..5e6ecb9c7d 100644
--- a/NEWS
+++ b/NEWS
@@ -18,6 +18,9 @@ Major new features:
 
 * On Linux, the gettid function has been added.
 
+* On Linux,  now provides a definition of struct termios2 with
+  the _GNU_SOURCE feature test macro.
+
 * Minguo (Republic of China) calendar support has been added as an
   alternative calendar for the following locales: zh_TW, cmn_TW, hak_TW,
   nan_TW, lzh_TW.
diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
index 52ac6ad484..4cb5e4f0d2 100644
--- a/sysdeps/unix/sysv/linux/Makefile
+++ b/sysdeps/unix/sysv/linux/Makefile
@@ -156,6 +156,7 @@ endif
 
 ifeq ($(subdir),termios)
 sysdep_headers += termio.h
+tests += tst-termios2
 endif
 
 ifeq ($(subdir),posix)
diff --git a/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h 
b/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h
new file mode 100644
index 00..5f09445e23
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/alpha/bits/termios2-struct.h
@@ -0,0 +1,33 @@
+/* struct termios2 definition.  Linux/alpha version.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _TERMIOS_H
+# error "Never include  directly; use  
instead."
+#endif
+
+struct termios2
+{
+  tcflag_t c_iflag;
+  tcflag_t c_oflag;
+  tcflag_t c_cflag;
+  tcflag_t c_lflag;
+  cc_t c_cc[NCCS];
+  cc_t c_line;
+  speed_t c_ispeed;
+  speed_t c_ospeed;
+};
diff --git a/sysdeps/unix/sysv/linux/bits/termios.h 
b/sysdeps/unix/sysv/linux/bits/termios.h
index 997231cd03..45ac7affdf 100644
--- a/sysdeps/unix/sysv/linux/bits/termios.h
+++ b/sysdeps/unix/sysv/linux/bits/termios.h
@@ -25,6 +25,10 @@ typedef unsigned int speed_t;
 typedef unsigned int   tcflag_t;
 
 #include 
+#ifdef __USE_GNU
+# include 
+#endif
+
 #include 
 #include 
 #include 
diff --git a/sysdeps/unix/sysv/linux/bits/termios2-struct.h 
b/sysdeps/unix/sysv/linux/bits/termios2-struct.h
new file mode 100644
index 00..5a48e45ef3
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/bits/termios2-struct.h
@@ -0,0 +1,33 @@
+/* struct termios2 definition.  Linux/generic version.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _TERMIOS_H
+# error "Never include  directly; use  
instead.&

Re: pkeys: Reserve PKEY_DISABLE_READ

2018-12-05 Thread Florian Weimer
* Ram Pai:

> Ok. here is a patch, compiled but not tested. See if this meets the
> specifications.
>
> ---
>
> commit 3dc06e73f3795921265d5d1d935e428deab01616
> Author: Ram Pai 
> Date:   Tue Dec 4 00:04:11 2018 -0500
>
> pkeys: add support of PKEY_DISABLE_READ

Thanks.  In the x86 code, the translation of PKEY_DISABLE_READ |
PKEY_DISABLE_WRITE to PKEY_DISABLE_ACCESS appears to be missing.  I
believe the existing code produces PKEY_DISABLE_WRITE, which is wrong.

Rest looks okay to me (again not tested).

Thanks,
Florian


Re: [PATCH] powerpc/pkeys: copy pkey-tracking-information at fork()

2018-12-04 Thread Florian Weimer
* Ram Pai:

> +void arch_dup_pkeys(struct mm_struct *oldmm, struct mm_struct *mm)
> +{
> + /* Duplicate the oldmm pkey state in mm: */
> + mm_pkey_allocation_map(mm) = mm_pkey_allocation_map(oldmm);
> + mm->context.execute_only_pkey = oldmm->context.execute_only_pkey;
> +}

Looks reasonable to me.

Thanks,
Florian (who is not a kernel developer though)


Re: pkeys: Reserve PKEY_DISABLE_READ

2018-12-03 Thread Florian Weimer
* Ram Pai:

> So the problem is as follows:
>
> Currently the kernel supports  'disable-write'  and 'disable-access'.
>
> On x86, cpu supports 'disable-write' and 'disable-access'. This
> matches with what the kernel supports. All good.
>
> However on power, cpu supports 'disable-read' too. Since userspace can
> program the cpu directly, userspace has the ability to set
> 'disable-read' too.  This can lead to inconsistency between the kernel
> and the userspace.
>
> We want the kernel to match userspace on all architectures.

Correct.

> Proposed Solution:
>
> Enhance the kernel to understand 'disable-read', and facilitate architectures
> that understand 'disable-read' to allow it.
>
> Also explicitly define the semantics of disable-access  as 
> 'disable-read and disable-write'
>
> Did I get this right?  Assuming I did, the implementation has to do
> the following --
>   
>   On power, sys_pkey_alloc() should succeed if the init_val
>   is PKEY_DISABLE_READ, PKEY_DISABLE_WRITE, PKEY_DISABLE_ACCESS
>   or any combination of the three.

Agreed.

>   On x86, sys_pkey_alloc() should succeed if the init_val is
>   PKEY_DISABLE_WRITE or PKEY_DISABLE_ACCESS or PKEY_DISABLE_READ
>   or any combination of the three, except  PKEY_DISABLE_READ
>   specified all by itself.

Again agreed.  That's a clever way of phrasing it actually.

>   On all other arches, none of the flags are supported.
>
>
> Are we on the same plate?

I think so, thanks.

Florian


Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-29 Thread Florian Weimer
* Dave Hansen:

> On 11/27/18 3:57 AM, Florian Weimer wrote:
>> I would have expected something that translates PKEY_DISABLE_WRITE |
>> PKEY_DISABLE_READ into PKEY_DISABLE_ACCESS, and also accepts
>> PKEY_DISABLE_ACCESS | PKEY_DISABLE_READ, for consistency with POWER.
>> 
>> (My understanding is that PKEY_DISABLE_ACCESS does not disable all
>> access, but produces execute-only memory.)
>
> Correct, it disables all data access, but not execution.

So I would expect something like this (completely untested, I did not
even compile this):

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 20ebf153c871..bed23f9e8336 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -199,6 +199,11 @@ static inline bool arch_pkeys_enabled(void)
return !static_branch_likely(_disabled);
 }
 
+static inline bool arch_pkey_access_rights_valid(unsigned long rights)
+{
+   return (rights & ~(unsigned long)PKEY_ACCESS_MASK) == 0;
+}
+
 extern void pkey_mm_init(struct mm_struct *mm);
 extern bool arch_supports_pkeys(int cap);
 extern unsigned int arch_usable_pkeys(void);
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 19b137f1b3be..e3e1d5a316e8 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -14,6 +14,17 @@ static inline bool arch_pkeys_enabled(void)
return boot_cpu_has(X86_FEATURE_OSPKE);
 }
 
+static inline bool arch_pkey_access_rights_valid(unsigned long rights)
+{
+   if (rights & ~(unsigned long)PKEY_ACCESS_MASK)
+   return false;
+   if (rights & PKEY_DISABLE_READ) {
+   /* x86 can only disable read access along with write access. */
+   return rights & (PKEY_DISABLE_WRITE | PKEY_DISABLE_ACCESS);
+   }
+   return true;
+}
+
 /*
  * Try to dedicate one of the protection keys to be used as an
  * execute-only protection key.
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7642d3..b9b78145017f 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -928,7 +928,13 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
return -EINVAL;
 
/* Set the bits we need in PKRU:  */
-   if (init_val & PKEY_DISABLE_ACCESS)
+   if (init_val & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_READ))
+   /*
+* arch_pkey_access_rights_valid checked that
+* PKEY_DISABLE_READ is actually representable on x86
+* (that is, it comes with PKEY_DISABLE_ACCESS or
+* PKEY_DISABLE_WRITE).
+*/
new_pkru_bits |= PKRU_AD_BIT;
if (init_val & PKEY_DISABLE_WRITE)
new_pkru_bits |= PKRU_WD_BIT;
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 2955ba976048..2c330fabbe55 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -48,6 +48,11 @@ static inline void copy_init_pkru_to_fpregs(void)
 {
 }
 
+static inline bool arch_pkey_access_rights_valid(unsigned long rights)
+{
+   return false;
+}
+
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6d331620b9e5..f4cefc3540df 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -597,7 +597,7 @@ SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned 
long, init_val)
if (flags)
return -EINVAL;
/* check for unsupported init values */
-   if (init_val & ~PKEY_ACCESS_MASK)
+   if (!arch_pkey_access_rights_valid(init_val))
return -EINVAL;
 
down_write(>mm->mmap_sem);

Thanks,
Florian


Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-27 Thread Florian Weimer
* Ram Pai:

> diff --git a/arch/x86/include/uapi/asm/mman.h 
> b/arch/x86/include/uapi/asm/mman.h
> index d4a8d04..e9b121b 100644
> --- a/arch/x86/include/uapi/asm/mman.h
> +++ b/arch/x86/include/uapi/asm/mman.h
> @@ -24,6 +24,11 @@
>   ((key) & 0x2 ? VM_PKEY_BIT1 : 0) |  \
>   ((key) & 0x4 ? VM_PKEY_BIT2 : 0) |  \
>   ((key) & 0x8 ? VM_PKEY_BIT3 : 0))
> +
> +/* Override any generic PKEY permission defines */
> +#undef PKEY_ACCESS_MASK
> +#define PKEY_ACCESS_MASK   (PKEY_DISABLE_ACCESS |\
> + PKEY_DISABLE_WRITE)
>  #endif

I would have expected something that translates PKEY_DISABLE_WRITE |
PKEY_DISABLE_READ into PKEY_DISABLE_ACCESS, and also accepts
PKEY_DISABLE_ACCESS | PKEY_DISABLE_READ, for consistency with POWER.

(My understanding is that PKEY_DISABLE_ACCESS does not disable all
access, but produces execute-only memory.)

> diff --git a/include/uapi/asm-generic/mman-common.h 
> b/include/uapi/asm-generic/mman-common.h
> index e7ee328..61168e4 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -71,7 +71,8 @@
>  
>  #define PKEY_DISABLE_ACCESS  0x1
>  #define PKEY_DISABLE_WRITE   0x2
> -#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> -  PKEY_DISABLE_WRITE)
> -
> +#define PKEY_DISABLE_EXECUTE 0x4
> +#define PKEY_DISABLE_READ0x8
> +#define PKEY_ACCESS_MASK 0x0 /* arch can override and define its own
> +mask bits */
>  #endif /* __ASM_GENERIC_MMAN_COMMON_H */

I think Dave requested a value for PKEY_DISABLE_READ which is further
away from the existing bits.

Thanks,
Florian


Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-12 Thread Florian Weimer
* Ram Pai:

> On Thu, Nov 08, 2018 at 09:23:35PM +0100, Florian Weimer wrote:
>> * Ram Pai:
>> 
>> > Florian,
>> >
>> >I can. But I am struggling to understand the requirement. Why is
>> >this needed?  Are we proposing a enhancement to the sys_pkey_alloc(),
>> >to be able to allocate keys that are initialied to disable-read
>> >only?
>> 
>> Yes, I think that would be a natural consequence.
>> 
>> However, my immediate need comes from the fact that the AMR register can
>> contain a flag combination that is not possible to represent with the
>> existing PKEY_DISABLE_WRITE and PKEY_DISABLE_ACCESS flags.  User code
>> could write to AMR directly, so I cannot rule out that certain flag
>> combinations exist there.
>> 
>> So I came up with this:
>> 
>> int
>> pkey_get (int key)
>> {
>>   if (key < 0 || key > PKEY_MAX)
>> {
>>   __set_errno (EINVAL);
>>   return -1;
>> }
>>   unsigned int index = pkey_index (key);
>>   unsigned long int amr = pkey_read ();
>>   unsigned int bits = (amr >> index) & 3;
>> 
>>   /* Translate from AMR values.  PKEY_AMR_READ standing alone is not
>>  currently representable.  */
>>   if (bits & PKEY_AMR_READ)
>
> this should be
>if (bits & (PKEY_AMR_READ|PKEY_AMR_WRITE))

This would return zero for PKEY_AMR_READ alone.

>> return PKEY_DISABLE_ACCESS;
>
>
>>   else if (bits == PKEY_AMR_WRITE)
>> return PKEY_DISABLE_WRITE;
>>   return 0;
>> }

It's hard to tell whether PKEY_DISABLE_ACCESS is better in this case.
Which is why I want PKEY_DISABLE_READ.

>> And this is not ideal.  I would prefer something like this instead:
>> 
>>   switch (bits)
>> {
>>   case PKEY_AMR_READ | PKEY_AMR_WRITE:
>> return PKEY_DISABLE_ACCESS;
>>   case PKEY_AMR_READ:
>> return PKEY_DISABLE_READ;
>>   case PKEY_AMR_WRITE:
>> return PKEY_DISABLE_WRITE;
>>   case 0:
>> return 0;
>> }
>
> yes.
>  and on x86 it will be something like:
>switch (bits)
>  {
>case PKEY_PKRU_ACCESS :
>  return PKEY_DISABLE_ACCESS;
>case PKEY_AMR_WRITE:
>  return PKEY_DISABLE_WRITE;
>case 0:
>  return 0;
>  }

x86 returns the PKRU bits directly, including the nonsensical case
(PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE).

> But for this to work, why do you need to enhance the sys_pkey_alloc()
> interface?  Not that I am against it. Trying to understand if the
> enhancement is really needed.

sys_pkey_alloc performs an implicit pkey_set for the newly allocated key
(that is, it updates the PKRU/AMR register).  It makes sense to match
the behavior of the userspace implementation.

Thanks,
Florian


Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-12 Thread Florian Weimer
* Ram Pai:

> On Thu, Nov 08, 2018 at 01:05:09PM +0100, Florian Weimer wrote:
>> Would it be possible to reserve a bit for PKEY_DISABLE_READ?
>> 
>> I think the POWER implementation can disable read access at the hardware
>> level, but not write access, and that cannot be expressed with the
>> current PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE bits.
>
> POWER hardware can disable-read and can **also disable-write**
> at the hardware level. It can disable-execute aswell at the
> hardware level.   For example if the key bits for a given key in the AMR
> register is  
>   0b01  it is read-disable
>   0b10  it is write-disable
>
> To support access-disable, we make the key value 0b11.
>
> So in case if you want to know if the key is read-disable 'bitwise-and' it
> against 0x1.  i.e  (x & 0x1)

Not sure if we covered that alreay, but my problem is that I cannot
translate a 0b01 mask to a PKEY_DISABLE_* flag combination with the
current flags.  0b10 and 0b11 are fine.

POWER also loses the distinction between PKEY_DISABLE_ACCESS and
PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE, but that's fine.  This breaks
the current glibc test case, but I have a patch for that.  Arguably, the
test is wrong or at least overly strict in what it accepts.

Thanks,
Florian


Re: pkeys: Reserve PKEY_DISABLE_READ

2018-11-08 Thread Florian Weimer
* Ram Pai:

> Florian,
>
>   I can. But I am struggling to understand the requirement. Why is
>   this needed?  Are we proposing a enhancement to the sys_pkey_alloc(),
>   to be able to allocate keys that are initialied to disable-read
>   only?

Yes, I think that would be a natural consequence.

However, my immediate need comes from the fact that the AMR register can
contain a flag combination that is not possible to represent with the
existing PKEY_DISABLE_WRITE and PKEY_DISABLE_ACCESS flags.  User code
could write to AMR directly, so I cannot rule out that certain flag
combinations exist there.

So I came up with this:

int
pkey_get (int key)
{
  if (key < 0 || key > PKEY_MAX)
{
  __set_errno (EINVAL);
  return -1;
}
  unsigned int index = pkey_index (key);
  unsigned long int amr = pkey_read ();
  unsigned int bits = (amr >> index) & 3;

  /* Translate from AMR values.  PKEY_AMR_READ standing alone is not
 currently representable.  */
  if (bits & PKEY_AMR_READ)
return PKEY_DISABLE_ACCESS;
  else if (bits == PKEY_AMR_WRITE)
return PKEY_DISABLE_WRITE;
  return 0;
}

And this is not ideal.  I would prefer something like this instead:

  switch (bits)
{
  case PKEY_AMR_READ | PKEY_AMR_WRITE:
return PKEY_DISABLE_ACCESS;
  case PKEY_AMR_READ:
return PKEY_DISABLE_READ;
  case PKEY_AMR_WRITE:
return PKEY_DISABLE_WRITE;
  case 0:
return 0;
}

By the way, is the AMR register 64-bit or 32-bit on 32-bit POWER?

Thanks,
Florian


Re: [RFC PATCH v2 00/14] New TM Model

2018-11-06 Thread Florian Weimer
* Breno Leitao:

> This  patchset for the hardware transactional memory (TM) subsystem
> aims to avoid spending a lot of time on TM suspended mode in kernel
> space.  It basically changes where the reclaim/recheckpoint will be
> executed.

I assumed that we want to abort on every system call these days?

We have this commit in glibc:

commit f0458cf4f9ff3d870c43b624e6dccaaf657d5e83
Author: Adhemerval Zanella 
Date:   Mon Aug 27 09:42:50 2018 -0300

powerpc: Only enable TLE with PPC_FEATURE2_HTM_NOSC

Linux from 3.9 through 4.2 does not abort HTM transaction on syscalls,
instead it suspend and resume it when leaving the kernel.  The
side-effects of the syscall will always remain visible, even if the
transaction is aborted.  This is an issue when transaction is used along
with futex syscall, on pthread_cond_wait for instance, where the futex
call might succeed but the transaction is rolled back leading the
pthread_cond object in an inconsistent state.

Glibc used to prevent it by always aborting a transaction before issuing
a syscall.  Linux 4.2 also decided to abort active transaction in
syscalls which makes the glibc workaround superfluous.  Worse, glibc
transaction abortion leads to a performance issue on recent kernels
where the HTM state is saved/restore lazily (v4.9).  By aborting a
transaction on every syscalls, regardless whether a transaction has being
initiated before, GLIBS makes the kernel always save/restore HTM state
(it can not even lazily disable it after a certain number of syscall
iterations).

Because of this shortcoming, Transactional Lock Elision is just enabled
when it has been explicitly set (either by tunables of by a configure
switch) and if kernel aborts HTM transactions on syscalls
(PPC_FEATURE2_HTM_NOSC).  It is reported that using simple benchmark [1],
the context-switch is about 5% faster by not issuing a tabort in every
syscall in newer kernels.

I wonder how the new TM model interacts with the assumption we currently
have in glibc.

Thanks,
Florian


Re: PIE binaries are no longer mapped below 4 GiB on ppc64le

2018-11-01 Thread Florian Weimer
* Michael Ellerman:

> Hi Florian,
>
> Florian Weimer  writes:
>> We tried to use Go to build PIE binaries, and while the Go toolchain is
>> definitely not ready (it produces text relocations and problematic
>> relocations in general), it exposed what could be an accidental
>> userspace ABI change.
>>
>> With our 4.10-derived kernel, PIE binaries are mapped below 4 GiB, so
>> relocations like R_PPC64_ADDR16_HA work:
>>
>> 21f0-220d r-xp  fd:00 36593493   
>> /root/extld
>> 220d-220e r--p 001c fd:00 36593493   
>> /root/extld
>> 220e-2210 rw-p 001d fd:00 36593493   
>> /root/extld
> ...
>>
>> With a 4.18-derived kernel (with the hashed mm), we get this instead:
>>
>> 120e6-12103 rw-p  fd:00 102447141
>> /root/extld
>> 12103-12106 rw-p 001c fd:00 102447141
>> /root/extld
>> 12106-12108 rw-p  00:00 0 
>
> I assume that's caused by:
>
>   47ebb09d5485 ("powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB")
>
> Which did roughly:
>
>   -#define ELF_ET_DYN_BASE0x2000
>   +#define ELF_ET_DYN_BASE(is_32bit_task() ? 0x00040UL : \
>   +  0x1UL)
>
> And went into 4.13.
>
>> ...
>> I'm not entirely sure what to make of this, but I'm worried that this
>> could be a regression that matters to userspace.
>
> It was a deliberate change, and it seemed to not break anything so we
> merged it. But obviously we didn't test widely enough.

* Michael Ellerman:

>> I'm not entirely sure what to make of this, but I'm worried that this
>> could be a regression that matters to userspace.
>
> It was a deliberate change, and it seemed to not break anything so we
> merged it. But obviously we didn't test widely enough.

Thanks for moving back the discussion to kernel matters. 8-)

> So I guess it clearly can matter to userspace, and it used to work, so
> therefore it is a regression.

Is there a knob to get back the old base address?

> But at the same time we haven't had any other reports of breakage, so is
> this somehow specific to something Go is doing?

Go uses 32-bit run-time relocations which (I think) were primarily
designed as link-time relocations for programs mapped under 4 GiB.  It's
amazing that the binaries work at all under old kernels.  On other
targets, the link editor refuses to produce an executable, or may even
produce a binary which crashes at run time.

> Or did we just get lucky up until now? Or is no one actually testing
> on Power? ;)

I'm not too worried about it.  It looks like a well-understood change to
me.  The glibc dynamic linker prints a reasonably informative error
message (in the sense that it doesn't crash without printing anything).
I think we can wait and see if someone comes up with a more compelling
case for backwards compatibility than the broken Go binaries (which we
will rebuild anyway because we don't want text relocations).  I assume
that it will be possible to add a personality flag if it ever proves
necessary—or maybe map the executable below 4 GiB in case of ASLR is
disabled, so that people have at least a workaround to get old binaries
going again.

But right now, that doesn't seem necessary.

Thanks,
Florian


Re: PIE binaries are no longer mapped below 4 GiB on ppc64le

2018-10-31 Thread Florian Weimer
* Tulio Magno Quites Machado Filho:

> Florian Weimer  writes:
>
>> * Michal Suchánek:
>>
>>> On Wed, 31 Oct 2018 18:20:56 +0100
>>> Florian Weimer  wrote:
>>>
>>>> And it needs to be built with:
>>>> 
>>>>   go build -ldflags=-extldflags=-pie extld.go
>>>> 
>>>> I'm not entirely sure what to make of this, but I'm worried that this
>>>> could be a regression that matters to userspace.
>>>
>>> I encountered the same when trying to build go on ppc64le. I am not
>>> familiar with the internals so I just let it be.
>>>
>>> It does not seem to matter to any other userspace.
>>
>> It would matter to C code which returns the address of a global variable
>> in the main program through and (implicit) int return value.
>
> I wonder if this is restricted to linker that Golang uses.
> Were you able to reproduce the same problem with Binutils' linker?

The example is carefully constructed to use the external linker.  It
invokes gcc, which then invokes the BFD linker in my case.

Based on the relocations, I assume there is only so much the linker can
do here.  I'm amazed that it produces an executable at all, let alone
one that runs correctly on some kernel versions!  I assume that the Go
toolchain simply lacks PIE support on ppc64le.

Thanks,
Florian


Re: PIE binaries are no longer mapped below 4 GiB on ppc64le

2018-10-31 Thread Florian Weimer
* Michal Suchánek:

> On Wed, 31 Oct 2018 18:20:56 +0100
> Florian Weimer  wrote:
>
>> We tried to use Go to build PIE binaries, and while the Go toolchain
>> is definitely not ready (it produces text relocations and problematic
>> relocations in general), it exposed what could be an accidental
>> userspace ABI change.
>> 
>> With our 4.10-derived kernel, PIE binaries are mapped below 4 GiB, so
>> relocations like R_PPC64_ADDR16_HA work:
>> 
> ...
>
>> There are fewer mappings because the loader detects a relocation
>> overflow and aborts (“error while loading shared libraries:
>> R_PPC64_ADDR16_HA reloc at 0x000120f0983c for symbol `' out of
>> range”), so I had to recover the mappings externally.  Disabling ASLR
>> does not help.
>> 
> ...
>> 
>> And it needs to be built with:
>> 
>>   go build -ldflags=-extldflags=-pie extld.go
>> 
>> I'm not entirely sure what to make of this, but I'm worried that this
>> could be a regression that matters to userspace.
>
> I encountered the same when trying to build go on ppc64le. I am not
> familiar with the internals so I just let it be.
>
> It does not seem to matter to any other userspace.

It would matter to C code which returns the address of a global variable
in the main program through and (implicit) int return value.

The old behavior hid some pointer truncation issues.

> Maybe it would be good idea to generate 64bit relocations on 64bit
> targets?

Yes, the Go toolchain definitely needs fixing for PIE.  I don't dispute
that.

Thanks,
Florian


PIE binaries are no longer mapped below 4 GiB on ppc64le

2018-10-31 Thread Florian Weimer
We tried to use Go to build PIE binaries, and while the Go toolchain is
definitely not ready (it produces text relocations and problematic
relocations in general), it exposed what could be an accidental
userspace ABI change.

With our 4.10-derived kernel, PIE binaries are mapped below 4 GiB, so
relocations like R_PPC64_ADDR16_HA work:

21f0-220d r-xp  fd:00 36593493   
/root/extld
220d-220e r--p 001c fd:00 36593493   
/root/extld
220e-2210 rw-p 001d fd:00 36593493   
/root/extld
2210-2212 rw-p  00:00 0
264b-264e rw-p  00:00 0  [heap]
c0-c1 rw-p  00:00 0
c41ffe-c42030 rw-p  00:00 0
3fff8c00-3fff8c03 rw-p  00:00 0
3fff8c03-3fff9000 ---p  00:00 0
3fff9000-3fff9003 rw-p  00:00 0
3fff9003-3fff9400 ---p  00:00 0
3fff9400-3fff9403 rw-p  00:00 0
3fff9403-3fff9800 ---p  00:00 0
3fff9800-3fff9803 rw-p  00:00 0
3fff9803-3fff9c00 ---p  00:00 0
3fff9c00-3fff9c03 rw-p  00:00 0
3fff9c03-3fffa000 ---p  00:00 0
3fffa229-3fffa22d rw-p  00:00 0
3fffa22d-3fffa22e ---p  00:00 0
3fffa22e-3fffa2ae rw-p  00:00 0
3fffa2ae-3fffa2af ---p  00:00 0
3fffa2af-3fffa32f rw-p  00:00 0
3fffa32f-3fffa330 ---p  00:00 0
3fffa330-3fffa3b0 rw-p  00:00 0
3fffa3b0-3fffa3b1 ---p  00:00 0
3fffa3b1-3fffa431 rw-p  00:00 0
3fffa431-3fffa432 ---p  00:00 0
3fffa432-3fffa4bb rw-p  00:00 0
3fffa4bb-3fffa4da r-xp  fd:00 34316081   
/usr/lib64/power9/libc-2.28.so
3fffa4da-3fffa4db r--p 001e fd:00 34316081   
/usr/lib64/power9/libc-2.28.so
3fffa4db-3fffa4dc rw-p 001f fd:00 34316081   
/usr/lib64/power9/libc-2.28.so
3fffa4dc-3fffa4df r-xp  fd:00 34316085   
/usr/lib64/power9/libpthread-2.28.so
3fffa4df-3fffa4e0 r--p 0002 fd:00 34316085   
/usr/lib64/power9/libpthread-2.28.so
3fffa4e0-3fffa4e1 rw-p 0003 fd:00 34316085   
/usr/lib64/power9/libpthread-2.28.so
3fffa4e1-3fffa4e2 rw-p  00:00 0
3fffa4e2-3fffa4e4 r-xp  00:00 0  [vdso]
3fffa4e4-3fffa4e7 r-xp  fd:00 874114 
/usr/lib64/ld-2.28.so
3fffa4e7-3fffa4e8 r--p 0002 fd:00 874114 
/usr/lib64/ld-2.28.so
3fffa4e8-3fffa4e9 rw-p 0003 fd:00 874114 
/usr/lib64/ld-2.28.so
3300-3303 rw-p  00:00 0
[stack]

With a 4.18-derived kernel (with the hashed mm), we get this instead:

120e6-12103 rw-p  fd:00 102447141
/root/extld
12103-12106 rw-p 001c fd:00 102447141
/root/extld
12106-12108 rw-p  00:00 0 
7fffb5b0-7fffb5cf r-xp  fd:00 67169871   
/usr/lib64/power9/libc-2.28.so
7fffb5cf-7fffb5d0 r--p 001e fd:00 67169871   
/usr/lib64/power9/libc-2.28.so
7fffb5d0-7fffb5d1 rw-p 001f fd:00 67169871   
/usr/lib64/power9/libc-2.28.so
7fffb5d1-7fffb5d4 r-xp  fd:00 67169875   
/usr/lib64/power9/libpthread-2.28.so
7fffb5d4-7fffb5d5 r--p 0002 fd:00 67169875   
/usr/lib64/power9/libpthread-2.28.so
7fffb5d5-7fffb5d6 rw-p 0003 fd:00 67169875   
/usr/lib64/power9/libpthread-2.28.so
7fffb5d6-7fffb5d7 r--p  fd:00 67780267   
/etc/ld.so.cache
7fffb5d7-7fffb5d9 r-xp  00:00 0  [vdso]
7fffb5d9-7fffb5dc r-xp  fd:00 1477   
/usr/lib64/ld-2.28.so
7fffb5dc-7fffb5de rw-p 0002 fd:00 1477   
/usr/lib64/ld-2.28.so
7f6c-7f6f rw-p  00:00 0  [stack]

There are fewer mappings because the loader detects a relocation
overflow and aborts (“error while loading shared libraries:
R_PPC64_ADDR16_HA reloc at 0x000120f0983c for symbol `' out of
range”), so I had to recover the mappings externally.  Disabling ASLR
does not help.

The Go program looks like this:

package main

import (
"fmt"
"io/ioutil"
"os"
)

// #include 
import "C"

func main() {
// Force external linking against glibc.
fmt.Printf("%#v\n", C.GoString(C.gnu_get_libc_version()))

maps, err := os.Open("/proc/self/maps")
if err != nil {
panic(err)
}
defer maps.Close()

Re: [PATCH] seccomp: Add pkru into seccomp_data

2018-10-25 Thread Florian Weimer
* Michael Sammler:

> Thank you for the pointer about the POWER implementation. I am not
> familiar with POWER in general and its protection key feature at
> all. Would the AMR register be the correct register to expose here?

Yes, according to my notes, the register is called AMR (special purpose
register 13).

> I understand your concern about exposing the number of protection keys
> in the ABI. One idea would be to state, that the pkru field (which
> should probably be renamed) contains an architecture specific value,
> which could then be the PKRU on x86 and AMR (or another register) on
> POWER. This new field should probably be extended to __u64 and the
> reserved field removed.

POWER also has proper read/write bit separation, not PKEY_DISABLE_ACCESS
(disable read and write) and PKEY_DISABLE_WRITE like Intel.  It's
currently translated by the kernel, but I really need a
PKEY_DISABLE_READ bit in glibc to implement pkey_get in case the memory
is write-only.

> Another idea would be to not add a field in the seccomp_data
> structure, but instead provide a new BPF instruction, which reads the
> value of a specified protection key.

I would prefer that if it's possible.  We should make sure that the bits
are the same as those returned from pkey_get.  I have an implementation
on POWER, but have yet to figure out the implications for 32-bit because
I do not know the AMR register size there.

Thanks,
Florian


Re: [PATCH] seccomp: Add pkru into seccomp_data

2018-10-24 Thread Florian Weimer
* Michael Sammler:

> Add the current value of the PKRU register to data available for
> seccomp-bpf programs to work on. This allows filters based on the
> currently enabled protection keys.

> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 9efc0e73..e8b9ecfc 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -52,12 +52,16 @@
>   * @instruction_pointer: at the time of the system call.
>   * @args: up to 6 system call arguments always stored as 64-bit values
>   *regardless of the architecture.
> + * @pkru: value of the pkru register
> + * @reserved: pad the structure to a multiple of eight bytes
>   */
>  struct seccomp_data {
>   int nr;
>   __u32 arch;
>   __u64 instruction_pointer;
>   __u64 args[6];
> + __u32 pkru;
> + __u32 reserved;
>  };

This doesn't cover the POWER implementation.  Adding Cc:s.

And I think the kernel shouldn't expose the number of protection keys in
the ABI.

Thanks,
Florian


Re: [PATCH v2 0/6] powerpc/pkeys: fixes to pkeys

2018-06-20 Thread Florian Weimer

On 06/19/2018 02:40 PM, Michael Ellerman wrote:

I tested the whole series with the new selftests, with the printamr.c
program I posted earlier, and the glibc test for pkey_alloc   The
latter required some test fixes, but now passes as well.  As far as I
can tell, everything looks good now.

Tested-By: Florian Weimer

Thanks. I'll add that to each patch I guess, if you're happy with that?


Sure, but I only tested the whole series as a whole.

Thanks,
Florian


Re: [PATCH v2 6/6] powerpc/pkeys: Deny read/write/execute by default

2018-06-19 Thread Florian Weimer

On 06/19/2018 02:39 PM, Michael Ellerman wrote:

Ram Pai  writes:

Deny all permissions on all keys, with some exceptions.  pkey-0 must
allow all permissions, or else everything comes to a screaching halt.
Execute-only key must allow execute permission.


Another ABI change.

Are we calling this a bug fix?


Yes.  Note that the default is configurable on x86 (default is deny), so 
it's not really an ABI change as such IMHO.


Thanks,
Florian


Re: [PATCH v13 00/24] selftests, powerpc, x86 : Memory Protection Keys

2018-06-14 Thread Florian Weimer

On 06/14/2018 02:44 AM, Ram Pai wrote:

Test

Verified for correctness on powerpc. Need help verifying on x86.
Compiles on x86.


It breaks make in tools/testing/selftests/x86:

make: *** No rule to make target `protection_keys.c', needed by 
`/home/linux/tools/testing/selftests/x86/protection_keys_64'.  Stop.


The generic implementation no longer builds 32-bit binaries.  Is this 
the intent?


It's possible to build 32-bit binaries with “make CC='gcc -m32'”, so 
perhaps this is good enough?


But with that, I get a warning:

protection_keys.c: In function ‘dump_mem’:
protection_keys.c:172:3: warning: format ‘%lx’ expects argument of type 
‘long unsigned int’, but argument 4 has type ‘uint64_t’ [-Wformat=]

   dprintf1("dump[%03d][@%p]: %016lx\n", i, ptr, *ptr);
   ^

I suppose you could use %016llx and add a cast to unsigned long long to 
fix this.


Anyway, both the 32-bit and 64-bit tests fail here:

assert() at protection_keys.c::943 test_nr: 12 iteration: 1
running abort_hooks()...

I've yet checked what causes this.  It's with the kernel headers from 
4.17, but with other userspace headers based on glibc 2.17.  I hope to 
look into this some more before the weekend, but I eventually have to 
return the test machine to the pool.


Thanks,
Florian


Re: [PATCH v2 0/6] powerpc/pkeys: fixes to pkeys

2018-06-14 Thread Florian Weimer

On 06/14/2018 02:28 AM, Ram Pai wrote:

Assortment of fixes to pkey.

Patch 1  makes pkey consumable in multithreaded applications.

Patch 2  fixes fork behavior to inherit the key attributes.

Patch 3  A off-by-one bug made one key unusable. Fixes it.

Patch 4  Execute-only key is preallocated.

Patch 5  Makes pkey-0 less special.

Patch 6  Deny by default permissions on all unallocated keys.

Passes all selftests on powerpc. Also behavior verified to be correct
by Florian.

Changelog:

v2: . fixed merge conflict with upstream code.
. Add patch 6. Makes the behavior consistent
  with that on x86.


(Except signal handling, but I agree with Ram that the POWER behavior is 
the correct one.)



Ram Pai (6):
   powerpc/pkeys: Enable all user-allocatable pkeys at init.
   powerpc/pkeys: Save the pkey registers before fork
   powerpc/pkeys: fix calculation of total pkeys.
   powerpc/pkeys: Preallocate execute-only key
   powerpc/pkeys: make protection key 0 less special
   powerpc/pkeys: Deny read/write/execute by default


I tested the whole series with the new selftests, with the printamr.c 
program I posted earlier, and the glibc test for pkey_alloc   The 
latter required some test fixes, but now passes as well.  As far as I 
can tell, everything looks good now.


Tested-By: Florian Weimer 

Thanks,
Florian


Re: [PATCH 5/5] powerpc/pkeys: make protection key 0 less special

2018-06-12 Thread Florian Weimer

On 06/05/2018 04:09 AM, Ram Pai wrote:

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0409c80..d349e22 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -13,7 +13,10 @@
  
  DECLARE_STATIC_KEY_TRUE(pkey_disabled);

  extern int pkeys_total; /* total pkeys as per device tree */
-extern u32 initial_allocation_mask; /* bits set for reserved keys */
+extern u32 initial_allocation_mask; /*  bits set for the initially allocated 
keys */
+extern u32 reserved_allocation_mask; /* bits set for reserved keys */
+
+#define PKEY_0 0
  
  /*

   * Define these here temporarily so we're not dependent on patching 
linux/mm.h.


This part no longer applies cleanly.

Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-12 Thread Florian Weimer

On 06/11/2018 10:08 PM, Ram Pai wrote:

Ok. try this patch. This patch is on top of the 5 patches that I had
sent last week i.e  "[PATCH  0/5] powerpc/pkeys: fixes to pkeys"

The following is a draft patch though to check if it meets your
expectations.

commit fe53b5fe2dcb3139ea27ade3ae7cbbe43c4af3be
Author: Ram Pai
Date:   Mon Jun 11 14:57:34 2018 -0500

 powerpc/pkeys: Deny read/write/execute by default


With this patch, my existing misc/tst-pkey test in glibc passes.  The 
in-tree version still has some incorrect assumptions on implementation 
behavior, but those are test bugs.  The kernel behavior with your patch 
look good to me.  Thanks.


Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-11 Thread Florian Weimer

On 06/11/2018 07:23 PM, Ram Pai wrote:

On Fri, Jun 08, 2018 at 07:53:51AM +0200, Florian Weimer wrote:

On 06/08/2018 04:34 AM, Ram Pai wrote:


So the remaining question at this point is whether the Intel
behavior (default-deny instead of default-allow) is preferable.


Florian, remind me what behavior needs to fixed?


See the other thread.  The Intel register equivalent to the AMR by
default disallows access to yet-unallocated keys, so that threads
which are created before key allocation do not magically gain access
to a key allocated by another thread.


Are you referring to the thread
'[PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics'



Otherwise please point me to the URL of that thread. Sorry and thankx. :)


No, it's this issue:

  <https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-May/173157.html>

The UAMOR part has been fixed (thanks), but I think processes still 
start out with default-allow AMR.


Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-08 Thread Florian Weimer

On 06/08/2018 03:49 PM, Michal Suchánek wrote:

On Fri, 8 Jun 2018 14:57:06 +0200
Florian Weimer  wrote:


On 06/08/2018 02:54 PM, Michal Suchánek wrote:

On Fri, 8 Jun 2018 12:44:53 +0200
Florian Weimer  wrote:
   

On 06/08/2018 12:15 PM, Michal Suchánek wrote:

On Fri, 8 Jun 2018 07:53:51 +0200
Florian Weimer  wrote:
  

On 06/08/2018 04:34 AM, Ram Pai wrote:


So the remaining question at this point is whether the Intel
behavior (default-deny instead of default-allow) is
preferable.


Florian, remind me what behavior needs to fixed?


See the other thread.  The Intel register equivalent to the AMR
by default disallows access to yet-unallocated keys, so that
threads which are created before key allocation do not magically
gain access to a key allocated by another thread.
 


That does not make any sense. The threads share the address space
so they should also share the keys.

Or in other words the keys are supposed to be acceleration of
mprotect() so if mprotect() magically gives access to threads that
did not call it so should pkey functions. If they cannot do that
then they fail the primary purpose.


That's not how protection keys work.  The access rights are
thread-specific, so that you can change them locally, without
synchronization and expensive inter-node communication.
  


And the association of a key with part of the address space is
thread-local as well?


No, that part is still per-process.


So as said above it does not make sense to make keys per-thread.


The keys are still global, but the access rights are per-thread and have 
to be for reliability reasons.


Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-08 Thread Florian Weimer

On 06/08/2018 02:54 PM, Michal Suchánek wrote:

On Fri, 8 Jun 2018 12:44:53 +0200
Florian Weimer  wrote:


On 06/08/2018 12:15 PM, Michal Suchánek wrote:

On Fri, 8 Jun 2018 07:53:51 +0200
Florian Weimer  wrote:
   

On 06/08/2018 04:34 AM, Ram Pai wrote:


So the remaining question at this point is whether the Intel
behavior (default-deny instead of default-allow) is preferable.


Florian, remind me what behavior needs to fixed?


See the other thread.  The Intel register equivalent to the AMR by
default disallows access to yet-unallocated keys, so that threads
which are created before key allocation do not magically gain
access to a key allocated by another thread.
  


That does not make any sense. The threads share the address space so
they should also share the keys.

Or in other words the keys are supposed to be acceleration of
mprotect() so if mprotect() magically gives access to threads that
did not call it so should pkey functions. If they cannot do that
then they fail the primary purpose.


That's not how protection keys work.  The access rights are
thread-specific, so that you can change them locally, without
synchronization and expensive inter-node communication.



And the association of a key with part of the address space is
thread-local as well?


No, that part is still per-process.

Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-08 Thread Florian Weimer

On 06/08/2018 12:15 PM, Michal Suchánek wrote:

On Fri, 8 Jun 2018 07:53:51 +0200
Florian Weimer  wrote:


On 06/08/2018 04:34 AM, Ram Pai wrote:


So the remaining question at this point is whether the Intel
behavior (default-deny instead of default-allow) is preferable.


Florian, remind me what behavior needs to fixed?


See the other thread.  The Intel register equivalent to the AMR by
default disallows access to yet-unallocated keys, so that threads
which are created before key allocation do not magically gain access
to a key allocated by another thread.



That does not make any sense. The threads share the address space so
they should also share the keys.

Or in other words the keys are supposed to be acceleration of
mprotect() so if mprotect() magically gives access to threads that did
not call it so should pkey functions. If they cannot do that then they
fail the primary purpose.


That's not how protection keys work.  The access rights are 
thread-specific, so that you can change them locally, without 
synchronization and expensive inter-node communication.


Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-08 Thread Florian Weimer

On 06/08/2018 04:34 AM, Ram Pai wrote:


So the remaining question at this point is whether the Intel
behavior (default-deny instead of default-allow) is preferable.


Florian, remind me what behavior needs to fixed?


See the other thread.  The Intel register equivalent to the AMR by 
default disallows access to yet-unallocated keys, so that threads which 
are created before key allocation do not magically gain access to a key 
allocated by another thread.


Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-04 Thread Florian Weimer

On 06/04/2018 09:02 PM, Ram Pai wrote:

On Mon, Jun 04, 2018 at 07:57:46PM +0200, Florian Weimer wrote:

On 06/04/2018 04:01 PM, Ram Pai wrote:

On Mon, Jun 04, 2018 at 12:12:07PM +0200, Florian Weimer wrote:

On 06/03/2018 10:18 PM, Ram Pai wrote:

On Mon, May 21, 2018 at 01:29:11PM +0200, Florian Weimer wrote:

On 05/20/2018 09:11 PM, Ram Pai wrote:

Florian,

Does the following patch fix the problem for you?  Just like x86
I am enabling all keys in the UAMOR register during
initialization itself. Hence any key created by any thread at
any time, will get activated on all threads. So any thread
can change the permission on that key. Smoke tested it
with your test program.


I think this goes in the right direction, but the AMR value after
fork is still strange:

AMR (PID 34912): 0x
AMR after fork (PID 34913): 0x
AMR (PID 34913): 0x
Allocated key in subprocess (PID 34913): 2
Allocated key (PID 34912): 2
Setting AMR: 0x
New AMR value (PID 34912): 0x0fff
About to call execl (PID 34912) ...
AMR (PID 34912): 0x0fff
AMR after fork (PID 34914): 0x0003
AMR (PID 34914): 0x0003
Allocated key in subprocess (PID 34914): 2
Allocated key (PID 34912): 2
Setting AMR: 0x
New AMR value (PID 34912): 0x0fff

I mean this line:

AMR after fork (PID 34914): 0x0003

Shouldn't it be the same as in the parent process?


Fixed it. Please try this patch. If it all works to your satisfaction, I
will clean it up further and send to Michael Ellermen(ppc maintainer).


commit 51f4208ed5baeab1edb9b0f8b68d719b3527
Author: Ram Pai 
Date:   Sun Jun 3 14:44:32 2018 -0500

 Fix for the fork bug.
 Signed-off-by: Ram Pai 


Is this on top of the previous patch, or a separate fix?


top of previous patch.


Thanks.  With this patch, I get this on an LPAR:

AMR (PID 1876): 0x0003
AMR after fork (PID 1877): 0x0003
AMR (PID 1877): 0x0003
Allocated key in subprocess (PID 1877): 2
Allocated key (PID 1876): 2
Setting AMR: 0x
New AMR value (PID 1876): 0x0fff
About to call execl (PID 1876) ...
AMR (PID 1876): 0x0003
AMR after fork (PID 1878): 0x0003
AMR (PID 1878): 0x0003
Allocated key in subprocess (PID 1878): 2
Allocated key (PID 1876): 2
Setting AMR: 0x
New AMR value (PID 1876): 0x0fff

Test program is still this one:

<https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-May/173198.html>

So the process starts out with a different AMR value for some
reason. That could be a pre-existing bug that was just hidden by the
reset-to-zero on fork, or it could be intentional.  But the kernel


yes it is a bug, a patch for which is lined up for submission..

The fix is


commit eaf5b2ac002ad2f5bca118d7ce075ce28311aa8e
Author: Ram Pai 
Date:   Mon Jun 4 10:58:44 2018 -0500

 powerpc/pkeys: fix total pkeys calculation
 
 Total number of pkeys calculation is off by 1. Fix it.
 
 Signed-off-by: Ram Pai 


Looks good to me now.  Initial AMR value is zero, as is currently intended.

So the remaining question at this point is whether the Intel behavior 
(default-deny instead of default-allow) is preferable.


But if you can get the existing fixes into 4.18 and perhaps the relevant 
stable kernels, that would already be a great help for my glibc work.


Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-04 Thread Florian Weimer

On 06/04/2018 04:01 PM, Ram Pai wrote:

On Mon, Jun 04, 2018 at 12:12:07PM +0200, Florian Weimer wrote:

On 06/03/2018 10:18 PM, Ram Pai wrote:

On Mon, May 21, 2018 at 01:29:11PM +0200, Florian Weimer wrote:

On 05/20/2018 09:11 PM, Ram Pai wrote:

Florian,

Does the following patch fix the problem for you?  Just like x86
I am enabling all keys in the UAMOR register during
initialization itself. Hence any key created by any thread at
any time, will get activated on all threads. So any thread
can change the permission on that key. Smoke tested it
with your test program.


I think this goes in the right direction, but the AMR value after
fork is still strange:

AMR (PID 34912): 0x
AMR after fork (PID 34913): 0x
AMR (PID 34913): 0x
Allocated key in subprocess (PID 34913): 2
Allocated key (PID 34912): 2
Setting AMR: 0x
New AMR value (PID 34912): 0x0fff
About to call execl (PID 34912) ...
AMR (PID 34912): 0x0fff
AMR after fork (PID 34914): 0x0003
AMR (PID 34914): 0x0003
Allocated key in subprocess (PID 34914): 2
Allocated key (PID 34912): 2
Setting AMR: 0x
New AMR value (PID 34912): 0x0fff

I mean this line:

AMR after fork (PID 34914): 0x0003

Shouldn't it be the same as in the parent process?


Fixed it. Please try this patch. If it all works to your satisfaction, I
will clean it up further and send to Michael Ellermen(ppc maintainer).


commit 51f4208ed5baeab1edb9b0f8b68d719b3527
Author: Ram Pai 
Date:   Sun Jun 3 14:44:32 2018 -0500

 Fix for the fork bug.
 Signed-off-by: Ram Pai 


Is this on top of the previous patch, or a separate fix?


top of previous patch.


Thanks.  With this patch, I get this on an LPAR:

AMR (PID 1876): 0x0003
AMR after fork (PID 1877): 0x0003
AMR (PID 1877): 0x0003
Allocated key in subprocess (PID 1877): 2
Allocated key (PID 1876): 2
Setting AMR: 0x
New AMR value (PID 1876): 0x0fff
About to call execl (PID 1876) ...
AMR (PID 1876): 0x0003
AMR after fork (PID 1878): 0x0003
AMR (PID 1878): 0x0003
Allocated key in subprocess (PID 1878): 2
Allocated key (PID 1876): 2
Setting AMR: 0x
New AMR value (PID 1876): 0x0fff

Test program is still this one:

<https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-May/173198.html>

So the process starts out with a different AMR value for some reason. 
That could be a pre-existing bug that was just hidden by the 
reset-to-zero on fork, or it could be intentional.  But the kernel code 
does not indicate that key 63 is reserved (POWER numbers keys from the 
MSB to the LSB).


But it looks like we are finally getting somewhere. 8-)

Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-06-04 Thread Florian Weimer

On 06/03/2018 10:18 PM, Ram Pai wrote:

On Mon, May 21, 2018 at 01:29:11PM +0200, Florian Weimer wrote:

On 05/20/2018 09:11 PM, Ram Pai wrote:

Florian,

Does the following patch fix the problem for you?  Just like x86
I am enabling all keys in the UAMOR register during
initialization itself. Hence any key created by any thread at
any time, will get activated on all threads. So any thread
can change the permission on that key. Smoke tested it
with your test program.


I think this goes in the right direction, but the AMR value after
fork is still strange:

AMR (PID 34912): 0x
AMR after fork (PID 34913): 0x
AMR (PID 34913): 0x
Allocated key in subprocess (PID 34913): 2
Allocated key (PID 34912): 2
Setting AMR: 0x
New AMR value (PID 34912): 0x0fff
About to call execl (PID 34912) ...
AMR (PID 34912): 0x0fff
AMR after fork (PID 34914): 0x0003
AMR (PID 34914): 0x0003
Allocated key in subprocess (PID 34914): 2
Allocated key (PID 34912): 2
Setting AMR: 0x
New AMR value (PID 34912): 0x0fff

I mean this line:

AMR after fork (PID 34914): 0x0003

Shouldn't it be the same as in the parent process?


Fixed it. Please try this patch. If it all works to your satisfaction, I
will clean it up further and send to Michael Ellermen(ppc maintainer).


commit 51f4208ed5baeab1edb9b0f8b68d719b3527
Author: Ram Pai 
Date:   Sun Jun 3 14:44:32 2018 -0500

 Fix for the fork bug.
 
 Signed-off-by: Ram Pai 


Is this on top of the previous patch, or a separate fix?

Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-05-21 Thread Florian Weimer

On 05/20/2018 09:11 PM, Ram Pai wrote:

Florian,

Does the following patch fix the problem for you?  Just like x86
I am enabling all keys in the UAMOR register during
initialization itself. Hence any key created by any thread at
any time, will get activated on all threads. So any thread
can change the permission on that key. Smoke tested it
with your test program.


I think this goes in the right direction, but the AMR value after fork 
is still strange:


AMR (PID 34912): 0x
AMR after fork (PID 34913): 0x
AMR (PID 34913): 0x
Allocated key in subprocess (PID 34913): 2
Allocated key (PID 34912): 2
Setting AMR: 0x
New AMR value (PID 34912): 0x0fff
About to call execl (PID 34912) ...
AMR (PID 34912): 0x0fff
AMR after fork (PID 34914): 0x0003
AMR (PID 34914): 0x0003
Allocated key in subprocess (PID 34914): 2
Allocated key (PID 34912): 2
Setting AMR: 0x
New AMR value (PID 34912): 0x0fff

I mean this line:

AMR after fork (PID 34914): 0x0003

Shouldn't it be the same as in the parent process?

Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-05-19 Thread Florian Weimer

On 05/19/2018 03:19 AM, Ram Pai wrote:

The issue you may be talking about here is that  --

"when you set the AMR register to 0x, it
just sets it to 0x0c00."

To me it looks like, exec/fork are not related to the issue.
Or are they also somehow connected to the issue?


The reason the AMR register does not get set to 0x,
is because none of those keys; except key 2, are active. So it ignores
all other bits and just sets the bits corresponding to key 2.


Here's a slightly different test:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

/* Return the value of the AMR register.  */
static inline unsigned long int
pkey_read (void)
{
  unsigned long int result;
  __asm__ volatile ("mfspr %0, 13" : "=r" (result));
  return result;
}

/* Overwrite the AMR register with VALUE.  */
static inline void
pkey_write (unsigned long int value)
{
  __asm__ volatile ("mtspr 13, %0" : : "r" (value));
}

int
main (int argc, char **argv)
{
  printf ("AMR (PID %d): 0x%016lx\n", (int) getpid (), pkey_read());
  if (argc > 1 && strcmp (argv[1], "alloc") == 0)
{
  int key = syscall (__NR_pkey_alloc, 0, 0);
  if (key < 0)
err (1, "pkey_alloc");
  printf ("Allocated key in subprocess (PID %d): %d\n",
  (int) getpid (), key);
  return 0;
}

  pid_t pid = fork ();
  if (pid == 0)
{
  printf ("AMR after fork (PID %d): 0x%016lx\n",
  (int) getpid (), pkey_read());
  execl ("/proc/self/exe", argv[0], "alloc", NULL);
  _exit (1);
}
  if (pid < 0)
err (1, "fork");
  int status;
  if (waitpid (pid, , 0) < 0)
err (1, "waitpid");

  int key = syscall (__NR_pkey_alloc, 0, 0);
  if (key < 0)
err (1, "pkey_alloc");
  printf ("Allocated key (PID %d): %d\n", (int) getpid (), key);

  unsigned long int amr = -1;
  printf ("Setting AMR: 0x%016lx\n", amr);
  pkey_write (amr);
  printf ("New AMR value (PID %d): 0x%016lx\n",
  (int) getpid (), pkey_read());
  if (argc == 1)
{
  printf ("About to call execl (PID %d) ...\n", (int) getpid ());
  execl ("/proc/self/exe", argv[0], "execl", NULL);
  err (1, "exec");
  return 1;
}
  else
return 0;
}

It produces:

AMR (PID 110163): 0x
AMR after fork (PID 110164): 0x
AMR (PID 110164): 0x
Allocated key in subprocess (PID 110164): 2
Allocated key (PID 110163): 2
Setting AMR: 0x
New AMR value (PID 110163): 0x0c00
About to call execl (PID 110163) ...
AMR (PID 110163): 0x0c00
AMR after fork (PID 110165): 0x
AMR (PID 110165): 0x
Allocated key in subprocess (PID 110165): 2
Allocated key (PID 110163): 2
Setting AMR: 0x
New AMR value (PID 110163): 0x0c00

A few things which are odd stand out (apart the wrong default for AMR 
and the AMR update restriction covered in the other thread):


* execve does not reset AMR (see after “About to call execl”)
* fork resets AMR (see lines with PID 110165))
* After execve, a key with non-default access rights is allocated
  (see “Allocated key (PID 110163): 2”, second time, after execl)

No matter what you think about the AMR default, I posit that each of 
those are bugs (although the last one should be fixed by resetting AMR 
on execve).


Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-05-18 Thread Florian Weimer

On 05/19/2018 03:50 AM, Andy Lutomirski wrote:

Now another thread calls pkey_alloc(), so UAMR is asynchronously changed,
and the thread will write zero to the relevant AMR bits.  If I understand
correctly, this means that the decision to mask off unallocated keys via
UAMR effectively forces the initial value of newly-allocated keys in other
threads in the allocating process to be zero, whatever zero means.  (I
didn't get far enough in the POWER docs to figure out what zero means.)  So


(Note that this belongs on the other thread, here I originally wanted to 
talk about the lack of reset of AMR to the default value on execve.)


I don't think UAMOR is updated asynchronously.  On pkey_alloc, it is 
only changed for the current thread, and future threads launched from 
it.  Existing threads are unaffected.  This still results in a 
programming model which is substantially different from x86.



I don't think you're doing anyone any favors by making UAMR dynamic.


This is still true, I think.

Thanks,
Florian


Re: pkeys on POWER: Default AMR, UAMOR values

2018-05-18 Thread Florian Weimer

On 05/19/2018 02:52 AM, Ram Pai wrote:


The POWER semantics make it very hard for a multithreaded program to
meaningfully use protection keys to prevent accidental access to important
memory.


And you can change access rights for unallocated keys (unallocated
at thread start time, allocated later) on x86.  I have extended the
misc/tst-pkeys test to verify that, and it passes on x86, but not on
POWER, where the access rights are stuck.


This is something I do not understand. How can a thread change permissions
on a key, that is not even allocated in the first place.


It was allocated by another thread, and there is synchronization so that 
the allocation happens before the change in access rights.



Do you consider a key
allocated in some other thread's context, as allocated in this threads
context?


Yes, x86 does that.


If not, does that mean -- On x86, you can activate a key just
by changing its permission?


This also true on x86, but just an artifact of the implementation.  You 
are supposed to call pkey_alloc before changing the flag.


Thanks,
Florian


Re: pkeys on POWER: Access rights not reset on execve

2018-05-18 Thread Florian Weimer

On 05/19/2018 03:19 AM, Ram Pai wrote:


New AMR value (PID 112291, before execl): 0x0c00
AMR (PID 112291): 0x0c00


The issue is here.  The second line is after the execl (printed from the 
start of main), and the AMR value is not reset to zero.



Allocated key (PID 112291): 2

I think this is a real bug and needs to be fixed even if the
defaults are kept as-is (see the other thread).


The issue you may be talking about here is that  --

"when you set the AMR register to 0x, it
just sets it to 0x0c00."

To me it looks like, exec/fork are not related to the issue.
Or are they also somehow connected to the issue?


Yes, this is the other issue.  It is not really important here.

Thanks,
Florian


Re: pkeys on POWER: Default AMR, UAMOR values

2018-05-18 Thread Florian Weimer

On 05/18/2018 09:39 PM, Andy Lutomirski wrote:

The difference is that x86 starts out with deny-all instead of allow-all.
The POWER semantics make it very hard for a multithreaded program to
meaningfully use protection keys to prevent accidental access to important
memory.


And you can change access rights for unallocated keys (unallocated at 
thread start time, allocated later) on x86.  I have extended the 
misc/tst-pkeys test to verify that, and it passes on x86, but not on 
POWER, where the access rights are stuck.


I believe this is due to an incorrect UAMOR setting.

Thanks,
Florian


Re: pkeys on POWER: Default AMR, UAMOR values

2018-05-18 Thread Florian Weimer

On 05/18/2018 07:44 PM, Ram Pai wrote:

Florian, is the behavior on x86 any different? A key allocated in the
context off one thread is not meaningful in the context of any other
thread.

Since thread B was created prior to the creation of the key, and the key
was created in the context of thread A, thread B neither inherits the
key nor its permissions. Atleast that is how the semantics are supposed
to work as per the man page.

man 7 pkey

" Applications  using  threads  and  protection  keys  should
be especially careful.  Threads inherit the protection key rights of the
parent at the time of the clone(2), system call.  Applications should
either ensure that their own permissions are appropriate for child
threads at the time when clone(2) is  called,  or ensure that each child
thread can perform its own initialization of protection key rights."


I reported two separate issues (actually three, but the execve bug is in 
a separate issue).  The default, and the write restrictions.


The default is just a difference to x86 (however, x86 can be booted with 
init_pkru=0 and behaves the same way, but we're probably going to remove 
that).


The POWER implementation has the additional wrinkle that threads 
launched early, before key allocation, can never change access rights 
because they inherited not just the access rights, but also the access 
rights access mask.  This is different from x86, where all threads can 
freely update access rights, and contradicts the behavior in the manpage 
which says that “each child thread can perform its own initialization of 
protection key rights”.  It can't do that if it is launched before key 
allocation, which is not the right behavior IMO.


Thanks,
Florian


pkeys on POWER: Access rights not reset on execve

2018-05-18 Thread Florian Weimer

This test program:

#include 
#include 
#include 
#include 
#include 
#include 

/* Return the value of the AMR register.  */
static inline unsigned long int
pkey_read (void)
{
  unsigned long int result;
  __asm__ volatile ("mfspr %0, 13" : "=r" (result));
  return result;
}

/* Overwrite the AMR register with VALUE.  */
static inline void
pkey_write (unsigned long int value)
{
  __asm__ volatile ("mtspr 13, %0" : : "r" (value));
}

int
main (int argc, char **argv)
{
  printf ("AMR (PID %d): 0x%016lx\n", (int) getpid (), pkey_read());
  if (argc > 1)
{
  int key = syscall (__NR_pkey_alloc, 0, 0);
  if (key < 0)
err (1, "pkey_alloc");
  printf ("Allocated key (PID %d): %d\n", (int) getpid (), key);
  return 0;
}

  pid_t pid = fork ();
  if (pid == 0)
{
  execl ("/proc/self/exe", argv[0], "subprocess", NULL);
  _exit (1);
}
  if (pid < 0)
err (1, "fork");
  int status;
  if (waitpid (pid, , 0) < 0)
err (1, "waitpid");

  int key = syscall (__NR_pkey_alloc, 0, 0);
  if (key < 0)
err (1, "pkey_alloc");
  printf ("Allocated key (PID %d): %d\n", (int) getpid (), key);

  unsigned long int amr = -1;
  printf ("Setting AMR: 0x%016lx\n", amr);
  pkey_write (amr);
  printf ("New AMR value (PID %d, before execl): 0x%016lx\n",
  (int) getpid (), pkey_read());
  execl ("/proc/self/exe", argv[0], "subprocess", NULL);
  err (1, "exec");
  return 1;
}

shows that the AMR register value is not reset on execve:

AMR (PID 112291): 0x
AMR (PID 112292): 0x
Allocated key (PID 112292): 2
Allocated key (PID 112291): 2
Setting AMR: 0x
New AMR value (PID 112291, before execl): 0x0c00
AMR (PID 112291): 0x0c00
Allocated key (PID 112291): 2

I think this is a real bug and needs to be fixed even if the defaults 
are kept as-is (see the other thread).


(Seen on 4.17.0-rc5.)

Thanks,
Florian


pkeys on POWER: Default AMR, UAMOR values

2018-05-18 Thread Florian Weimer
I'm working on adding POWER pkeys support to glibc.  The coding work is 
done, but I'm faced with some test suite failures.


Unlike the default x86 configuration, on POWER, existing threads have 
full access to newly allocated keys.


Or, more precisely, in this scenario:

* Thread A launches thread B
* Thread B waits
* Thread A allocations a protection key with pkey_alloc
* Thread A applies the key to a page
* Thread A signals thread B
* Thread B starts to run and accesses the page

Then at the end, the access will be granted.

I hope it's not too late to change this to denied access.

Furthermore, I think the UAMOR value is wrong as well because it 
prevents thread B at the end to set the AMR register.  In particular, if 
I do this


* … (as before)
* Thread A signals thread B
* Thread B sets the access rights for the key to PKEY_DISABLE_ACCESS
* Thread B reads the current access rights for the key

then it still gets 0 (all access permitted) because the original UAMOR 
value inherited from thread A prior to the key allocation masks out the 
access right update for the newly allocated key.


Thanks,
Florian


Re: [PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics

2018-05-17 Thread Florian Weimer

On 05/16/2018 10:35 PM, Ram Pai wrote:

So let me see if I understand the overall idea.

Application can allocate new keys through a new syscall
sys_pkey_alloc_1(flags, init_val, sig_init_val)

'sig_init_val' is the permission-state of the key in signal context.


I would keep the existing system call and just add a flag, say 
PKEY_ALLOC_SETSIGNAL.  If the current thread needs different access 
rights, it can set those rights just after pkey_alloc returns.  There is 
no race that matters here, I think.


Thanks,
Florian


Re: [PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics

2018-05-17 Thread Florian Weimer

On 05/16/2018 11:07 PM, Ram Pai wrote:


what would change the key-permission-values enforced in signal-handler
context?  Or can it never be changed, ones set through sys_pkey_alloc()?


The access rights can only be set by pkey_alloc and are unchanged after 
that (so we do not have to discuss whether the signal handler access 
rights are per-thread or not).



I suppose key-permission-values change done in non-signal-handler context,
will not apply to those in signal-handler context.


Correct, that is the plan.


Can the signal handler change the key-permission-values from the
signal-handler context?


Yes, changes are possible.  The access rights given to pkey_alloc only 
specify the initial access rights when the signal handler is entered.


We need to decide if we should restore it on exit from the signal 
handler.  There is also the matter of siglongjmp, which currently does 
not restore the current thread's access rights.  In general, this might 
be difficult to implement because of the limited space in jmp_buf.


Thanks,
Florian


Re: [PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics

2018-05-14 Thread Florian Weimer

On 05/14/2018 05:32 PM, Andy Lutomirski wrote:





On May 14, 2018, at 5:01 AM, Florian Weimer <fwei...@redhat.com> wrote:


One thing we could do, though: the current initual state on process
creation is all access blocked on all keys.  We could change it so that
half the keys are fully blocked and half are read-only.  Then we could add
a PKEY_ALLOC_STRICT or similar that allocates a key with the correct
initial state*and*  does the setsignal thing.  If there are no keys left
with the correct initial state, then it fails.


The initial PKRU value can currently be configured by the system administrator. 
 I fear this approach has too many moving parts to be viable.




Honestly, I think we should drop that option. I don’t see how we can expect an 
administrator to do this usefully.


I don't disagree—it makes things way less predictable in practice.

Thanks,
Florian


Re: [PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics

2018-05-14 Thread Florian Weimer

On 05/09/2018 04:41 PM, Andy Lutomirski wrote:

Hmm.  I can get on board with the idea that fork() / clone() /
pthread_create() are all just special cases of the idea that the thread
that*calls*  them should have the right pkey values, and the latter is
already busted given our inability to asynchronously propagate the new mode
in pkey_alloc().  So let's so PKEY_ALLOC_SETSIGNAL as a starting point.


Ram, any suggestions for implementing this on POWER?


One thing we could do, though: the current initual state on process
creation is all access blocked on all keys.  We could change it so that
half the keys are fully blocked and half are read-only.  Then we could add
a PKEY_ALLOC_STRICT or similar that allocates a key with the correct
initial state*and*  does the setsignal thing.  If there are no keys left
with the correct initial state, then it fails.


The initial PKRU value can currently be configured by the system 
administrator.  I fear this approach has too many moving parts to be viable.


Thanks,
Florian


Re: [PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics

2018-05-08 Thread Florian Weimer

On 05/08/2018 04:49 AM, Andy Lutomirski wrote:

On Mon, May 7, 2018 at 2:48 AM Florian Weimer <fwei...@redhat.com> wrote:


On 05/03/2018 06:05 AM, Andy Lutomirski wrote:

On Wed, May 2, 2018 at 7:11 PM Ram Pai <linux...@us.ibm.com> wrote:


On Wed, May 02, 2018 at 09:23:49PM +, Andy Lutomirski wrote:



If I recall correctly, the POWER maintainer did express a strong

desire

back then for (what is, I believe) their current semantics, which my
PKEY_ALLOC_SIGNALINHERIT patch implements for x86, too.


Ram, I really really don't like the POWER semantics.  Can you give

some

justification for them?  Does POWER at least have an atomic way for
userspace to modify just the key it wants to modify or, even better,
special load and store instructions to use alternate keys?



I wouldn't call it POWER semantics. The way I implemented it on power
lead to the semantics, given that nothing was explicitly stated
about how the semantics should work within a signal handler.


I think that this is further evidence that we should introduce a new
pkey_alloc() mode and deprecate the old.  To the extent possible, this
thing should work the same way on x86 and POWER.



Do you propose to change POWER or to change x86?


Sorry for being slow to reply.  I propose to introduce a new
PKEY_ALLOC_something variant on x86 and POWER and to make the behavior
match on both.


So basically implement PKEY_ALLOC_SETSIGNAL for POWER, and keep the 
existing (different) behavior without the flag?


Ram, would you be okay with that?  Could you give me a hand if 
necessary?  (I assume we have silicon in-house because it's a 
long-standing feature of the POWER platform which was simply dormant on 
Linux until now.)



It should at least update the values loaded when a signal
is delivered and it should probably also update it for new threads.


I think we should keep inheritance for new threads and fork.  pkey_alloc 
only has a single access rights argument, which makes it hard to reuse 
this interface if there are two (three) separate sets of access rights.


Is there precedent for process state reverting on fork, besides 
MADV_WIPEONFORK?  My gut feeling is that we should avoid that.



For glibc, for example, I assume that you want signals to be delivered with
write access disabled to the GOT.  Otherwise you would fail to protect
against exploits that occur in signal context.  Glibc controls thread
creation, so the initial state on thread startup doesn't really matter, but
there will be more users than just glibc.


glibc does not control thread, or more precisely, subprocess creation. 
Otherwise we wouldn't have face that many issues with our PID cache. 8-/


Thanks,
Florian


Re: [PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics

2018-05-07 Thread Florian Weimer

On 05/03/2018 01:38 AM, Ram Pai wrote:

This is a new requirement that I was not aware off. Its not documented
anywhere AFAICT.


Correct.  All inheritance behavior was deliberately left unspecified.

I'm surprised about the reluctance to fix the x86 behavior.  Are there 
any applications at all for the current semantics?


I guess I can implement this particular glibc hardening on POWER only 
for now.


Thanks,
Florian


Re: [PATCH] x86, powerpc : pkey-mprotect must allow pkey-0

2018-03-14 Thread Florian Weimer

On 03/14/2018 09:00 AM, Florian Weimer wrote:

On 03/09/2018 09:00 PM, Ram Pai wrote:

On Fri, Mar 09, 2018 at 12:04:49PM +0100, Florian Weimer wrote:

On 03/09/2018 09:12 AM, Ram Pai wrote:
Once an address range is associated with an allocated pkey, it 
cannot be
reverted back to key-0. There is no valid reason for the above 
behavior.


mprotect without a key does not necessarily use key 0, e.g. if
protection keys are used to emulate page protection flag combination
which is not directly supported by the hardware.

Therefore, it seems to me that filtering out non-allocated keys is
the right thing to do.


I am not sure, what you mean. Do you agree with the patch or otherwise?


I think it's inconsistent to make key 0 allocated, but not the key which 
is used for PROT_EXEC emulation, which is still reserved.  Even if you 
change the key 0 behavior, it is still not possible to emulate mprotect 
behavior faithfully with an allocated key.


Ugh.  Should have read the code first before replying:

/* Do we need to assign a pkey for mm's execute-only maps? */
if (execute_only_pkey == -1) {
/* Go allocate one to use, which might fail */
execute_only_pkey = mm_pkey_alloc(mm);
if (execute_only_pkey < 0)
return -1;
need_to_set_mm_pkey = true;
}

So we do allocate the PROT_EXEC-only key, and I assume it means that the 
key can be restored using pkey_mprotect.  So the key 0 behavior is a 
true exception after all, and it makes sense to realign the behavior 
with the other keys.


Thanks,
Florian


Re: [PATCH] x86, powerpc : pkey-mprotect must allow pkey-0

2018-03-14 Thread Florian Weimer

On 03/09/2018 09:00 PM, Ram Pai wrote:

On Fri, Mar 09, 2018 at 12:04:49PM +0100, Florian Weimer wrote:

On 03/09/2018 09:12 AM, Ram Pai wrote:

Once an address range is associated with an allocated pkey, it cannot be
reverted back to key-0. There is no valid reason for the above behavior.


mprotect without a key does not necessarily use key 0, e.g. if
protection keys are used to emulate page protection flag combination
which is not directly supported by the hardware.

Therefore, it seems to me that filtering out non-allocated keys is
the right thing to do.


I am not sure, what you mean. Do you agree with the patch or otherwise?


I think it's inconsistent to make key 0 allocated, but not the key which 
is used for PROT_EXEC emulation, which is still reserved.  Even if you 
change the key 0 behavior, it is still not possible to emulate mprotect 
behavior faithfully with an allocated key.


Thanks,
Florian


Re: [PATCH] x86, powerpc : pkey-mprotect must allow pkey-0

2018-03-09 Thread Florian Weimer

On 03/09/2018 09:12 AM, Ram Pai wrote:

Once an address range is associated with an allocated pkey, it cannot be
reverted back to key-0. There is no valid reason for the above behavior.


mprotect without a key does not necessarily use key 0, e.g. if 
protection keys are used to emulate page protection flag combination 
which is not directly supported by the hardware.


Therefore, it seems to me that filtering out non-allocated keys is the 
right thing to do.


Thanks,
Florian


Re: Kernel 4.15 lost set_robust_list support on POWER 9

2018-02-05 Thread Florian Weimer

On 02/05/2018 01:48 PM, Florian Weimer wrote:

I get this:

7041  set_robust_list(0x7fff93dc3980, 24) = -1 ENOSYS (Function not 
implemented)


The system call works on 4.14.  Looks like the probing for 
futex_cmpxchg_enabled goes wrong.


Sorry, I have no idea where to start digging.  I don't see anything 
obvious in dmesg.


I'm trying to revert

commit 30d6e0a4190d37740e9447e4e4815f06992dd8c3
Author: Jiri Slaby <jsl...@suse.cz>
Date:   Thu Aug 24 09:31:05 2017 +0200

     futex: Remove duplicated code and fix undefined behaviour

to see if it makes a difference


Never mind, it must be something else because that commit is in 4.14, 
but set_robust_list is still working there.  (Let's hope that the 
probing doesn't fail randomly …)


Thanks,
Florian


Kernel 4.15 lost set_robust_list support on POWER 9

2018-02-05 Thread Florian Weimer

I get this:

7041  set_robust_list(0x7fff93dc3980, 24) = -1 ENOSYS (Function not 
implemented)


The system call works on 4.14.  Looks like the probing for 
futex_cmpxchg_enabled goes wrong.


Sorry, I have no idea where to start digging.  I don't see anything 
obvious in dmesg.


I'm trying to revert

commit 30d6e0a4190d37740e9447e4e4815f06992dd8c3
Author: Jiri Slaby 
Date:   Thu Aug 24 09:31:05 2017 +0200

futex: Remove duplicated code and fix undefined behaviour

to see if it makes a difference.

Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Florian Weimer

On 11/08/2017 07:08 AM, Michael Ellerman wrote:

"Kirill A. Shutemov" <kir...@shutemov.name> writes:


On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:

On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:

On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:

On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:


First of all, using addr and MAP_FIXED to develop our heuristic can
never really give unchanged ABI. It's an in-band signal. brk() is a
good example that steadily keeps incrementing address, so depending
on malloc usage and address space randomization, you will get a brk()
that ends exactly at 128T, then the next one will be >
DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.


No, it won't. You will hit stack first.


That's not actually true on POWER in some cases.  See the process maps I
posted here:

<https://marc.info/?l=linuxppc-embedded=150988538106263=2>


Hm? I see that in all three cases the [stack] is the last mapping.
Do I miss something?


Hah, I had not noticed.  Occasionally, the order of heap and stack is
reversed.  This happens in approximately 15% of the runs.


Heh. I guess ASLR on Power is too fancy :)


Fancy implies we're doing it on purpose :P


That's strange layout. It doesn't give that much (relatively speaking)
virtual address space for both stack and heap to grow.


I'm pretty sure it only happens when you're running an ELF interpreter
directly, because of Kees patch which changed the logic to load ELF
interpreters in the mmap region, vs PIE binaries which go to
ELF_ET_DYN_BASE. (eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only
for PIE"))


From that commit:

+   * There are effectively two types of ET_DYN
+   * binaries: programs (i.e. PIE: ET_DYN with INTERP)
+   * and loaders (ET_DYN without INTERP, since they
+   * _are_ the ELF interpreter). The loaders must

Note that the comment is a bit misleading: statically linked PIE 
binaries are ET_DYN without INTERP, too.


So any oddity which is observable today with an explicitly ld.so 
invocation only will gain more relevance once we get static PIE support 
in user space because it will then affect regular applications, too. 
(Well, statically linked ones.)  In this sense, process layouts which 
cause premature brk failure or insufficient stack allocations are real bugs.


Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Florian Weimer

On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:

On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:

On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:


First of all, using addr and MAP_FIXED to develop our heuristic can
never really give unchanged ABI. It's an in-band signal. brk() is a
good example that steadily keeps incrementing address, so depending
on malloc usage and address space randomization, you will get a brk()
that ends exactly at 128T, then the next one will be >
DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.


No, it won't. You will hit stack first.


That's not actually true on POWER in some cases.  See the process maps I
posted here:

   <https://marc.info/?l=linuxppc-embedded=150988538106263=2>


Hm? I see that in all three cases the [stack] is the last mapping.
Do I miss something?


Hah, I had not noticed.  Occasionally, the order of heap and stack is 
reversed.  This happens in approximately 15% of the runs.


See the attached example.

Thanks,
Florian
7fffacc5-7fffacc9 rw-p  00:00 0 
7fffacc9-7fffaccf r--p  fd:00 25167925   
/usr/lib/locale/en_US.utf8/LC_CTYPE
7fffaccf-7fffacd0 r--p  fd:00 25167928   
/usr/lib/locale/en_US.utf8/LC_NUMERIC
7fffacd0-7fffacd1 r--p  fd:00 16798929   
/usr/lib/locale/en_US.utf8/LC_TIME
7fffacd1-7ffface4 r--p  fd:00 25167924   
/usr/lib/locale/en_US.utf8/LC_COLLATE
7ffface4-7ffface5 r--p  fd:00 16798927   
/usr/lib/locale/en_US.utf8/LC_MONETARY
7ffface5-7ffface6 r--p  fd:00 2511   
/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7ffface6-7ffface7 r--p  fd:00 16798942   
/usr/lib/locale/en_US.utf8/LC_PAPER
7ffface7-7ffface8 r--p  fd:00 25167927   
/usr/lib/locale/en_US.utf8/LC_NAME
7ffface8-7ffface9 r--p  fd:00 16798924   
/usr/lib/locale/en_US.utf8/LC_ADDRESS
7ffface9-7fffacea r--p  fd:00 16798928   
/usr/lib/locale/en_US.utf8/LC_TELEPHONE
7fffacea-7fffaceb r--p  fd:00 16798926   
/usr/lib/locale/en_US.utf8/LC_MEASUREMENT
7fffaceb-7fffacec r--s  fd:00 8390657
/usr/lib64/gconv/gconv-modules.cache
7fffacec-7fffad0d r-xp  fd:00 8390335
/usr/lib64/libc-2.25.so
7fffad0d-7fffad0e ---p 0021 fd:00 8390335
/usr/lib64/libc-2.25.so
7fffad0e-7fffad0f r--p 0021 fd:00 8390335
/usr/lib64/libc-2.25.so
7fffad0f-7fffad10 rw-p 0022 fd:00 8390335
/usr/lib64/libc-2.25.so
7fffad10-7fffad11 r--p  fd:00 16798925   
/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
7fffad11-7fffad12 r-xp  fd:00 63543  
/usr/bin/cat
7fffad12-7fffad13 r--p  fd:00 63543  
/usr/bin/cat
7fffad13-7fffad14 rw-p 0001 fd:00 63543  
/usr/bin/cat
7fffad14-7fffad16 r-xp  00:00 0  [vdso]
7fffad16-7fffad1a r-xp  fd:00 8390328
/usr/lib64/ld-2.25.so
7fffad1a-7fffad1b r--p 0003 fd:00 8390328
/usr/lib64/ld-2.25.so
7fffad1b-7fffad1c rw-p 0004 fd:00 8390328
/usr/lib64/ld-2.25.so
7fffc2cf-7fffc2d2 rw-p  00:00 0  [stack]
7fffc8c1-7fffc8c4 rw-p  00:00 0  [heap]


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Florian Weimer

On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:


First of all, using addr and MAP_FIXED to develop our heuristic can
never really give unchanged ABI. It's an in-band signal. brk() is a
good example that steadily keeps incrementing address, so depending
on malloc usage and address space randomization, you will get a brk()
that ends exactly at 128T, then the next one will be >
DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.


No, it won't. You will hit stack first.


That's not actually true on POWER in some cases.  See the process maps I 
posted here:


  

Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-07 Thread Florian Weimer

On 11/07/2017 06:07 AM, Nicholas Piggin wrote:


First of all, using addr and MAP_FIXED to develop our heuristic can
never really give unchanged ABI. It's an in-band signal. brk() is a
good example that steadily keeps incrementing address, so depending
on malloc usage and address space randomization, you will get a brk()
that ends exactly at 128T, then the next one will be >
DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.


Note that this brk phenomenon is only a concern for some currently 
obscure process memory layouts where the heap ends up at the top of the 
address space.  Usually, there is something above it which eliminates 
the possibility that it can cross into the 128 TiB wilderness.  So the 
brk problem only happens on some architectures (e.g., not x86-64), and 
only with strange ways of running programs (explicitly ld.so invocation 
and likely static PIE, too).



So unless everyone else thinks I'm crazy and disagrees, I'd ask for
a bit more time to make sure we get this interface right. I would
hope for something like prctl PR_SET_MM which can be used to set
our user virtual address bits on a fine grained basis. Maybe a
sysctl, maybe a personality. Something out-of-band. I don't wan to
get too far into that discussion yet. First we need to agree whether
or not the code in the tree today is a problem.


There is certainly more demand for similar functionality, like creating 
mappings below 2 GB/4 GB/32 GB, and probably other bit patterns. 
Hotspot would use this to place the heap with compressed oops, instead 
of manually hunting for a suitable place for the mapping.  (Essentially, 
32-bit pointers on 64-bit architectures for sufficiently small heap 
sizes.)  It would perhaps be possible to use the hints address as a 
source of the bit count, for full flexibility.  And the mapping should 
be placed into the upper half of the selected window if possible.


MAP_FIXED is near-impossible to use correctly.  I hope you don't expect 
applications to do that.  If you want address-based opt in, it should 
work without MAP_FIXED.  Sure, in obscure cases, applications might 
still see out-of-range addresses, but I expected a full opt-out based on 
RLIMIT_AS would be sufficient for them.


Thanks,
Florian


Re: [PATCH v9 00/51] powerpc, mm: Memory Protection Keys

2017-11-06 Thread Florian Weimer
* Ram Pai:

> On Mon, Nov 06, 2017 at 10:28:41PM +0100, Florian Weimer wrote:
>> * Ram Pai:
>> 
>> > Testing:
>> > ---
>> > This patch series has passed all the protection key
>> > tests available in the selftest directory.The
>> > tests are updated to work on both x86 and powerpc.
>> > The selftests have passed on x86 and powerpc hardware.
>> 
>> How do you deal with the key reuse problem?  Is it the same as x86-64,
>> where it's quite easy to accidentally grant existing threads access to
>> a just-allocated key, either due to key reuse or a changed init_pkru
>> parameter?
>
> I am not sure how on x86-64, two threads get allocated the same key
> at the same time? the key allocation is guarded under the mmap_sem
> semaphore. So there cannot be a race where two threads get allocated
> the same key.

The problem is a pkey_alloc/pthread_create/pkey_free/pkey_alloc
sequence.  The pthread_create call makes the new thread inherit the
access rights of the current thread, but then the key is deallocated.
Reallocation of the same key will have that thread retain its access
rights, which is IMHO not correct.

> Can you point me to the issue, if it is already discussed somewhere?

See ‘MPK: pkey_free and key reuse’ on various lists (including
linux-mm and linux-arch).

It has a test case attached which demonstrates the behavior.

> As far as the semantics is concerned, a key allocated in one thread's
> context has no meaning if used in some other threads context within the
> same process.  The app should not try to re-use a key allocated in a
> thread's context in some other threads's context.

Uh-oh, that's not how this feature works on x86-64 at all.  There, the
keys are a process-global resource.  Treating them per-thread
seriously reduces their usefulness.

>> What about siglongjmp from a signal handler?
>
> On powerpc there is some relief.  the permissions on a key can be
> modified from anywhere, including from the signal handler, and the
> effect will be immediate.  You dont have to wait till the
> signal handler returns for the key permissions to be restore.

My concern is that the signal handler knows nothing about protection
keys, but the current x86-64 semantics will cause it to clobber the
access rights of the current thread.

> also after return from the sigsetjmp();
> possibly caused by siglongjmp(), the program can restore the permission
> on any key.

So that's not really an option.

> Atleast that is my theory. Can you give me a testcase; if you have one
> handy.

The glibc patch I posted under the ‘MPK: pkey_free and key reuse’
thread covers this, too.


Re: [PATCH v9 00/51] powerpc, mm: Memory Protection Keys

2017-11-06 Thread Florian Weimer
* Ram Pai:

> Testing:
> ---
> This patch series has passed all the protection key
> tests available in the selftest directory.The
> tests are updated to work on both x86 and powerpc.
> The selftests have passed on x86 and powerpc hardware.

How do you deal with the key reuse problem?  Is it the same as x86-64,
where it's quite easy to accidentally grant existing threads access to
a just-allocated key, either due to key reuse or a changed init_pkru
parameter?

What about siglongjmp from a signal handler?

  

I wonder if it's possible to fix some of these things before the exact
semantics of these interfaces are set in stone.


Re: [PATCH 0/5] VA allocator fixes

2017-11-06 Thread Florian Weimer

On 11/06/2017 11:03 AM, Nicholas Piggin wrote:

Florian found a nasty corner case with the VA allocation logic
for crossing from 128TB to 512TB limit on hash, and made a
really superb report of the problem -- traces, reproducer recipes,
analysis, etc. which already mostly solved it.

The first patch in the series should solve Florian's particular
case, the next 3 are other issues with addr_limit. The last
patch is technically a cleanup but I think it's fairly important
in terms of understanding the code and also enabling some BUG
checks (when addr_limit == 0).

I have not tested these exactly on Florian's test case, but
some tests of my own behave better afterwards. Hopefully he has
time to re-test. Some careful review would be welcome too.


I think I have applied the five patches you posted, but I still get a 
brk value above 128 TiB:


# /lib64/ld64.so.1 ./a.out
initial brk value: 0x7fffde96
probing at 0x8001fffc

I assumed you wanted to reject those?

In either case, I recommend to tweak the VM layout, so that ld.so does 
not land closely to to the 128 TiB limit, so that the brk failure or 
returning of 48-bit addresses is avoided.


Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-06 Thread Florian Weimer

On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:

On 11/06/2017 01:55 PM, Nicholas Piggin wrote:

On Mon, 6 Nov 2017 09:11:37 +0100
Florian Weimer <fwei...@redhat.com> wrote:


On 11/06/2017 07:47 AM, Nicholas Piggin wrote:

"You get < 128TB unless explicitly requested."

Simple, reasonable, obvious rule. Avoids breaking apps that store
some bits in the top of pointers (provided that memory allocator
userspace libraries also do the right thing).


So brk would simplify fail instead of crossing the 128 TiB threshold?


Yes, that was the intention and that's what x86 seems to do.



glibc malloc should cope with that and switch to malloc, but this code
path is obviously less well-tested than the regular way.


Switch to mmap() I guess you meant?


Yes, sorry.


powerpc has a couple of bugs in corner cases, so those should be fixed
according to intended policy for stable kernels I think.

But I question the policy. Just seems like an ugly and ineffective wart.
Exactly for such cases as this -- behaviour would change from run to run
depending on your address space randomization for example! In case your
brk happens to land nicely on 128TB then the next one would succeed.


Why ? It should not change between run to run. We limit the free
area search range based on hint address. So we should get consistent 
results across run. even if we changed the context.addr_limit.


The size of the gap to the 128 TiB limit varies between runs because of 
ASLR.  So some runs would use brk alone, others would use brk + malloc. 
That's not really desirable IMHO.


Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-06 Thread Florian Weimer

On 11/06/2017 07:47 AM, Nicholas Piggin wrote:

"You get < 128TB unless explicitly requested."

Simple, reasonable, obvious rule. Avoids breaking apps that store
some bits in the top of pointers (provided that memory allocator
userspace libraries also do the right thing).


So brk would simplify fail instead of crossing the 128 TiB threshold?

glibc malloc should cope with that and switch to malloc, but this code 
path is obviously less well-tested than the regular way.


Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-06 Thread Florian Weimer

On 11/06/2017 07:18 AM, Aneesh Kumar K.V wrote:

We should not return that address, unless we requested with a hint value
of > 128TB. IIRC we discussed this early during the mmap interface
change and said, we will return an address > 128T only if the hint
address is above 128TB (not hint addr + length). I am not sure why
we are finding us returning and address > 128TB with paca limit set to
128TB?


See the memory maps I posted.  I think it was not anticipated that the 
heap could be near the 128 TiB limit because it is placed next to the 
initially mapped object.


I think this could become worse once we have static PIE support because 
static PIE binaries likely have the same memory layout.  (Ordinary PIE 
does not.)


Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-05 Thread Florian Weimer

On 11/05/2017 01:18 PM, Nicholas Piggin wrote:

Something like the following patch may help if you could test.


The patch appears to fix it:

# /lib64/ld64.so.1 ./a.out
initial brk value: 0x7fffe459
probing at 0x8001fffc

I used the follow simplified reproducer:

#include 
#include 
#include 
#include 
#include 

int
main (void)
{
  errno = 0;
  void *p = sbrk (0);
  if (errno != 0)
err (1, "sbrk (0)");
  printf ("initial brk value: %p\n", p);
  unsigned long long target = 0x8002ULL;
  if ((uintptr_t) p >= target)
errx (1, "initial brk value is already above target");
  unsigned long long increment = target - (uintptr_t) p;
  errno = 0;
  sbrk (increment);
  if (errno != 0)
err (1, "sbrk (0x%llx)", increment);
  volatile int *pi = (volatile int *) (target - 4);
  printf ("probing at %p\n", pi);
  *pi = 1;
}


It is still probabilistic because if the increment is too large, the 
second sbrk call will fail with an out of memory error (which is 
expected), so you'll have to run it a couple of times.


If the test fails, the write at the will segfault.

Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-05 Thread Florian Weimer

On 11/05/2017 01:18 PM, Nicholas Piggin wrote:


There was a recent change to move to 128TB address space by default,
and option for 512TB addresses if explicitly requested.


Do you have a commit hash for the introduction of 128TB by default?  Thanks.


Your brk request asked for > 128TB which the kernel gave it, but the
address limit in the paca that the SLB miss tests against was not
updated to reflect the switch to 512TB address space.

Why is your brk starting so high? Are you trying to test the > 128TB
case, or maybe something is confused by the 64->128TB change? What's
the strace look like if you run on a distro or <= 4.10 kernel?


I think it is a consequence of running with an explicit loader 
invocation.  With that, the heap is placed above ld.so, which can be 
quite high in the address space.


I'm attaching two runs of cat, one executing directly as /bin/cat, and 
one with /lib64/ld64.so.1 /bin/cat.


Fortunately, this does *not* apply to PIE binaries (also attached). 
However, explicit loader invocations are sometimes used in test suites 
(not just for glibc), and these sporadic test failures are quite annoying.


Do you still need the strace log?  And if yes, of what exactly?


Something like the following patch may help if you could test.


Okay, this will take some time.

Thanks,
Florian
1231d-1231e r-xp  fd:00 17852425 
/root/a.out
1231e-1231f r--p  fd:00 17852425 
/root/a.out
1231f-12320 rw-p 0001 fd:00 17852425 
/root/a.out
1000dbc-1000dbf rw-p  00:00 0[heap]
7fffa31d-7fffa340 r-xp  fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fffa340-7fffa341 ---p 0023 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fffa341-7fffa342 r--p 0023 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fffa342-7fffa343 rw-p 0024 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fffa344-7fffa346 r-xp  00:00 0  [vdso]
7fffa346-7fffa34a r-xp  fd:00 8390329
/usr/lib64/ld-2.25.so
7fffa34a-7fffa34b r--p 0003 fd:00 8390329
/usr/lib64/ld-2.25.so
7fffa34b-7fffa34c rw-p 0004 fd:00 8390329
/usr/lib64/ld-2.25.so
7fffe945-7fffe948 rw-p  00:00 0  [stack]
7fff7e79-7fff7e7d rw-p  00:00 0 
7fff7e7d-7fff7e83 r--p  fd:00 25167925   
/usr/lib/locale/en_US.utf8/LC_CTYPE
7fff7e83-7fff7e84 r--p  fd:00 25167928   
/usr/lib/locale/en_US.utf8/LC_NUMERIC
7fff7e84-7fff7e85 r--p  fd:00 16798929   
/usr/lib/locale/en_US.utf8/LC_TIME
7fff7e85-7fff7e98 r--p  fd:00 25167924   
/usr/lib/locale/en_US.utf8/LC_COLLATE
7fff7e98-7fff7e99 r--p  fd:00 16798927   
/usr/lib/locale/en_US.utf8/LC_MONETARY
7fff7e99-7fff7e9a r--p  fd:00 2511   
/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7fff7e9a-7fff7e9b r--p  fd:00 16798942   
/usr/lib/locale/en_US.utf8/LC_PAPER
7fff7e9b-7fff7e9c r--p  fd:00 25167927   
/usr/lib/locale/en_US.utf8/LC_NAME
7fff7e9c-7fff7e9d r--p  fd:00 16798924   
/usr/lib/locale/en_US.utf8/LC_ADDRESS
7fff7e9d-7fff7e9e r--p  fd:00 16798928   
/usr/lib/locale/en_US.utf8/LC_TELEPHONE
7fff7e9e-7fff7e9f r--p  fd:00 16798926   
/usr/lib/locale/en_US.utf8/LC_MEASUREMENT
7fff7e9f-7fff7ea0 r--s  fd:00 8390669
/usr/lib64/gconv/gconv-modules.cache
7fff7ea0-7fff7ec3 r-xp  fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fff7ec3-7fff7ec4 ---p 0023 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fff7ec4-7fff7ec5 r--p 0023 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fff7ec5-7fff7ec6 rw-p 0024 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fff7ec6-7fff7ec7 r--p  fd:00 16798925   
/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
7fff7ec7-7fff7ec8 r-xp  fd:00 202293 
/usr/bin/cat
7fff7ec8-7fff7ec9 r--p  fd:00 202293 
/usr/bin/cat
7fff7ec9-7fff7eca rw-p 0001 fd:00 202293 
/usr/bin/cat
7fff7eca-7fff7ecc r-xp  00:00 0  [vdso]
7fff7ecc-7fff7ed0 r-xp  fd:00 8390329
/usr/lib64/ld-2.25.so
7fff7ed0-7fff7ed1 r--p 0003 fd:00 8390329

POWER: Unexpected fault when writing to brk-allocated memory

2017-11-03 Thread Florian Weimer
We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
variant, but I have not seen it on our own builders) where running 
localedef as part of the glibc build crashes with a segmentation fault.


Kernel version is 4.13.9 (Fedora 26 variant).

I have only seen this with an explicit loader invocation, like this:

while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
--alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
charmaps/UTF-8 
--prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
; done


To be run in the localedata subdirectory of a glibc *source* tree, after 
a build.  You may have to create the 
/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
directory.  I have only reproduced this inside a Fedora 27 chroot on a 
Fedora 26 host, but there it does not matter if you run the old (chroot) 
or newly built binary.


I filed this as a glibc bug for tracking:

  https://sourceware.org/bugzilla/show_bug.cgi?id=22390

There's an strace log and a coredump from the crash.

I think the data shows that the address in question should be writable.

The crossed 0x8000 binary is very suggestive.  I think that 
based on the operation of glibc's malloc, this write would be the first 
time this happens during the lifetime of the process.


Does that ring any bells?  Is there anything I can do to provide more 
data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
any diagnostics tool which is provided by Fedora.


I can try to come up with a better reproducer, but that appears to be 
difficult.


Thanks,
Florian


Re: [PATCH 3/4] powerpc/powernv: Enable TM without suspend if possible

2017-10-19 Thread Florian Weimer

On 10/19/2017 02:04 PM, Tulio Magno Quites Machado Filho wrote:

Florian Weimer <fwei...@redhat.com> writes:


On 10/12/2017 12:17 PM, Michael Ellerman wrote:

+   pr_info("Enabling TM (Transactional Memory) with Suspend Disabled\n");
+   cur_cpu_spec->cpu_features |= CPU_FTR_TM;
+   cur_cpu_spec->cpu_user_features2 |= PPC_FEATURE2_HTM_NO_SUSPEND;
+   tm_suspend_disabled = true;


This doesn't look right because if suspend is not available, you need to
clear the original PPC_FEATURE2_HTM flag because the semantics are not
right, so that applications can use fallback code.  Otherwise,
applications may incorrectly select the HTM code and break if running on
a system which supports HTM, but without the suspend state.


Just clarifying: it won't break an application, but abort the transaction,
which can cause a performance hit.


And in this case, it is better not HTM at all.


The new flag should say that HTM is supported, but without the suspend
state, and it should be always set if PPC_FEATURE2_HTM is set.


If we change the semantics of this bit, old applications that don't suspend
transactions won't run on these processors, even if they're safe to run.

If we adopt the current semantics, only applications that enter in suspend
state will have to be modified in order to not get a performance regression.


In this light, the trade-off implied by the posted patch seems to be the 
right one.  But the other discussion of an ABI change still makes me a 
bit nervous.


Thanks,
Florian


  1   2   >