------- Comment From s...@de.ibm.com 2020-07-23 11:06 EDT-------
Hi,

I was able to reproduce the "make: echo: Operation not permitted" on my Ubuntu 
20.04 s390x machine.
I've installed build and installed the mentioned make-dfsg_4.3-4ubuntu1 package 
without the "--disable-posix-spawn" configure flag.
I've build flatpak-builder_1.0.11-1 which executes the test which is triggering 
the "Operation not permitted".

Then I've adjusted the tests, thus I can also run them without building the 
package itself.
This test runs flatpak-builder which prepares some stuff (e.g. a root-directory 
with all needed files / binaries / libraries).
flatpak-builder then creates a container with bwrap and calls a configure 
skript, which generates a Makefile.
In a second invocation, make is invoked.

I've adjusted the configure script which now executed an own small program.
This program is first waiting some time, which I use to deterine its PID. Then 
I can either attach strace or gdb.
After the timeout, the program just execve's to make. Thus in the end I have a 
process-chain like:
flatpak-builder--bwrap---bwrap---configure---make

The strace output shows, that the clone syscall is failing with EPERM:
4269 17:08:47.914142 stat("/usr/bin/echo", {st_mode=S_IFREG|0755, 
st_size=39136, ...}) = 0 <0.000003>
4270 17:08:47.914155 geteuid()               = 1001 <0.000001>
4271 17:08:47.914167 getegid()               = 1001 <0.000002>
4272 17:08:47.914175 getuid()                = 1001 <0.000001>
4273 17:08:47.914182 getgid()                = 1001 <0.000001>
4274 17:08:47.914189 access("/usr/bin/echo", X_OK) = 0 <0.000005>
4275 17:08:47.914203 mmap(NULL, 36864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x3ff9c86b000 <0.000002>
4276 17:08:47.914214 rt_sigprocmask(SIG_BLOCK, ~[], [HUP INT QUIT TERM CHLD 
XCPU XFSZ], 8) = 0 <0.000001>
4277 17:08:47.914224 clone(child_stack=0x3ff9c874000, 
flags=CLONE_VM|CLONE_VFORK|SIGCHLD) = -1 EPERM (Operation not permitted)  
<0.000001>
4278 17:08:47.914235 munmap(0x3ff9c86b000, 36864) = 0 <0.000004>
4279 17:08:47.914245 rt_sigprocmask(SIG_SETMASK, [HUP INT QUIT TERM CHLD XCPU 
XFSZ], NULL, 8) = 0 <0.000001>

A gdb session showed that posix_spawn is called by make like that (Info: make 
is using vfork() if configured with "--disable-posix-spawn"):
jobs.c:child_execute_job (struct childbase *child, int good_stdin, char **argv)
posix_spawnattr_t attr;
posix_spawn_file_actions_t fa;
short flags = 0;
posix_spawnattr_init (&attr)
posix_spawn_file_actions_init (&fa)
flags |= POSIX_SPAWN_SETSIGMASK; => 0x08
flags |= POSIX_SPAWN_USEVFORK; => 0x40
fdin=0, fdout=1, fderr=2
flags |= POSIX_SPAWN_RESETIDS; => 0x01
=> flags = 0x49
posix_spawnattr_setflags (&attr, flags)
/* Start the program.  */
while ((r = posix_spawn (&pid, cmd, &fa, &attr, argv,
child->environment)) == EINTR)
;

In glibc, the posix_spawn is doing this:
posix_spawn(...) -> __spawni(..., 0) -> __spawnix(..., __execve)
void *stack = __mmap (NULL, stack_size, prot, MAP_PRIVATE | MAP_ANONYMOUS | 
MAP_STACK, -1, 0);
/* Disable asynchronous cancellation.  */
__libc_signal_block_all (&args.oldmask);
# define CLONE(__fn, __stack, __stacksize, __flags, __args) \
__clone (__fn, __stack, __flags, __args)
new_pid = CLONE (__spawni_child, STACK (stack, stack_size), stack_size,
CLONE_VM | CLONE_VFORK | SIGCHLD, &args);
=> __clone (__spawni_child, __stack, CLONE_VM | CLONE_VFORK | SIGCHLD, &args);
<glibc-src>/sysdeps/unix/sysv/linux/s390/s390-64/clone.S
(gdb) i r r2 r3 r4 r5 r6
r2             0x3ffb22f53c0       4396740989888
r3             0x3ffb24f4000       4396743081984
r4             0x4111              16657
r5             0x3ffce97c9e0       4397217597920
r6             0xffffffffffffffff  18446744073709551615
?  >0x3ffb2306760 <clone>           stg     %r6,48(%r15)
?   0x3ffb2306766 <clone+6>         lgr     %r0,%r5
?   0x3ffb230676a <clone+10>        ltgr    %r1,%r2
?   0x3ffb230676e <clone+14>        je      0x3ffb23067a6 <clone+70>
?   0x3ffb2306772 <clone+18>        ltgr    %r2,%r3
?   0x3ffb2306776 <clone+22>        je      0x3ffb23067a6 <clone+70>
?   0x3ffb230677a <clone+26>        lgr     %r3,%r4
?   0x3ffb230677e <clone+30>        lgr     %r4,%r6
?   0x3ffb2306782 <clone+34>        lg      %r5,168(%r15)
?   0x3ffb2306788 <clone+40>        lg      %r6,160(%r15)
(gdb) i r r1 r2 r3 r4 r5 r6
r1             0x3ffb22f53c0       4396740989888
r2             0x3ffb24f4000       4396743081984
r3             0x4111              16657
# define CLONE_VM      0x00000100 /* Set if VM shared between processes.  */
# define CLONE_VFORK   0x00004000 /* Set if the parent wants the child to wake 
it up on mm_release.  */
<glibc-src>/sysdeps/unix/sysv/linux/bits/signum.h:41:#define SIGCHLD 17 => 0x11
r4             0xffffffffffffffff  18446744073709551615
r5             0x3ffce97c960       4397217597792
r6             0x0                 0
/* sys_clone  (void *child_stack, unsigned long flags, pid_t *parent_tid, pid_t 
*child_tid, void *tls);  */
?   0x3ffb230678e <clone+46>        svc     120                                 
                                                                               ?
=> sys_clone is returning EPERM instead of succeeding and jumping to 
__spawni_child().

At this time, the make process has those opened files:
find /proc/273963/fdinfo  -type f -printf "\ncat %p\n" -exec cat {} \;

cat /proc/273963/fdinfo/0
pos:    0
flags:  0100000
mnt_id: 25

cat /proc/273963/fdinfo/1
pos:    0
flags:  02001
mnt_id: 14

cat /proc/273963/fdinfo/2
pos:    0
flags:  02001
mnt_id: 14

ls -la /proc/273963/fd/*
lr-x------ 1 stli stli 64 Jul 23 10:31 /proc/273963/fd/0 -> /dev/null
l-wx------ 1 stli stli 64 Jul 23 10:35 /proc/273963/fd/1 -> 'pipe:[661251]'
l-wx------ 1 stli stli 64 Jul 23 10:35 /proc/273963/fd/2 -> 'pipe:[661251]'

A workmate of mine gave me a hint, that he had a similar issue with podman 
containers where a seccomp filter was applied.
Thus I've used https://github.com/david942j/seccomp-tools with a private patch 
from my workmate which enables s390x support.
And indeed, there is a seccomp filter applied for the second bwrap-process and 
its childs:
line  CODE  JT   JF      K
=================================
0000: 0x20 0x00 0x00 0x00000004  A = arch
0001: 0x15 0x00 0x1f 0x80000016  if (A != ARCH_S390X) goto 0033
0002: 0x20 0x00 0x00 0x00000000  A = sys_number
0003: 0x15 0x1c 0x00 0x00000015  if (A == mount) goto 0032
0004: 0x15 0x1b 0x00 0x00000033  if (A == acct) goto 0032
0005: 0x15 0x1a 0x00 0x00000056  if (A == uselib) goto 0032
0006: 0x15 0x19 0x00 0x00000067  if (A == syslog) goto 0032
0007: 0x15 0x18 0x00 0x00000083  if (A == quotactl) goto 0032
0008: 0x15 0x17 0x00 0x000000d9  if (A == pivot_root) goto 0032
0009: 0x15 0x16 0x00 0x0000010c  if (A == mbind) goto 0032
0010: 0x15 0x15 0x00 0x0000010d  if (A == get_mempolicy) goto 0032
0011: 0x15 0x14 0x00 0x0000010e  if (A == set_mempolicy) goto 0032
0012: 0x15 0x13 0x00 0x00000116  if (A == add_key) goto 0032
0013: 0x15 0x12 0x00 0x00000117  if (A == request_key) goto 0032
0014: 0x15 0x11 0x00 0x00000118  if (A == keyctl) goto 0032
0015: 0x15 0x10 0x00 0x0000011f  if (A == migrate_pages) goto 0032
0016: 0x15 0x0f 0x00 0x0000012f  if (A == unshare) goto 0032
0017: 0x15 0x0e 0x00 0x00000136  if (A == move_pages) goto 0032
0018: 0x15 0x00 0x05 0x00000036  if (A != ioctl) goto 0024
# => for clone, we goto 0024
0019: 0x20 0x00 0x00 0x00000018  A = cmd # ioctl(fd, cmd, arg)
0020: 0x54 0x00 0x00 0x00000000  A &= 0x0
0021: 0x15 0x00 0x09 0x00000000  if (A != 0) goto 0031
0022: 0x20 0x00 0x00 0x0000001c  A = cmd >> 32 # ioctl(fd, cmd, arg)
0023: 0x15 0x08 0x07 0x00005412  if (A == 0x5412) goto 0032 else goto 0031
0024: 0x15 0x00 0x06 0x00000078  if (A != clone) goto 0031
# => all other syscalls are allowed, but clone is handled here
0025: 0x20 0x00 0x00 0x00000010  A = clone_flags # clone(clone_flags, newsp, 
parent_tidptr, child_tidptr, tls)
0026: 0x54 0x00 0x00 0x00000000  A &= 0x0
0027: 0x15 0x00 0x03 0x00000000  if (A != 0) goto 0031
# => the previous check seems to be a nop
0028: 0x20 0x00 0x00 0x00000014  A = clone_flags >> 32 # clone(clone_flags, 
newsp, parent_tidptr, child_tidptr, tls)
# => The flags of clone are checked:
0029: 0x54 0x00 0x00 0x10000000  A &= 0x10000000
# define CLONE_NEWUSER  0x10000000 = 268435456  /* New user namespace.  */
0030: 0x15 0x01 0x00 0x10000000  if (A == 268435456) goto 0032
# => ERRNO(1) which is EPERM
0031: 0x06 0x00 0x00 0x7fff0000  return ALLOW
0032: 0x06 0x00 0x00 0x00050001  return ERRNO(1)
0033: 0x06 0x00 0x00 0x00000000  return KILL

Unfortunately the order of arguments for clone syscall on s390x differs 
compared to x86_64!
=> The filter is checking the first argument which on s390x is the 
stack-pointer instead of the flags.
Note:
The order of arguments and its names are hardcoded in seccomp-tools 
disassembler.
The seccomp filter is using the argument index.
The ">> 32" also belongs to an hardcoded output of seccomp-tools depending of 
even or odd index of the argument.

I've saw, that bwrap can apply a seccomp-filer:
bubblewrap.c:do_init():
if (seccomp_prog != NULL &&
prctl (PR_SET_SECCOMP, SECCOMP_MODE_FILTER, seccomp_prog) != 0)
die_with_error ("prctl(PR_SET_SECCOMP)");

This is executed if you call bwrap with "--seccomp FD" (Load and use seccomp 
rules from FD)
I've also dumped the /proc/PID/cmdline for the processes:
flatpak-builder\0-v\0--repo=/path/to/workdir\0--force-clean\0appdir\0test.json\0
bwrap\0--args\012\0./configure\0--prefix=/app\0--some-arg\0
bwrap\0--args\012\0./configure\0--prefix=/app\0--some-arg\0
/bin/sh\0./configure\0--prefix=/app\0--some-arg\0

Thus I suppose bwrap is not adding this seccomp filter.
I had a look to /proc/<PID of configure>/cgroup
12:cpuset:/
11:perf_event:/
10:devices:/user.slice
9:rdma:/
8:pids:/user.slice/user-1001.slice/user@1001.service
7:memory:/user.slice/user-1001.slice/user@1001.service
6:hugetlb:/
5:net_cls,net_prio:/
4:blkio:/user.slice
3:cpu,cpuacct:/user.slice
2:freezer:/
1:name=systemd:/user.slice/user-1001.slice/user@1001.service/flatpak-org.test.Hello2-14224.scope
0::/user.slice/user-1001.slice/user@1001.service/flatpak-org.test.Hello2-14224.scope

It could be that systemd is applying the seccomp-filter, but I don't know how.
Can anybody help?

For a test, the seccomp-filter could be adjusted, to check the second argument 
for the clone syscall.
Of course, for a real patch, the index has to be determined depending on the 
current architecture.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1886814

Title:
  posix_spawn usage in gnu make causes failures on s390x

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1886814/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to