Re: KASAN: use-after-free Read in userfaultfd_release (2)

2020-07-20 Thread Daniel Colascione

On 7/20/20 9:00 AM, Al Viro wrote:

On Mon, Jul 13, 2020 at 04:45:12PM +0800, Hillf Danton wrote:


Bridge the gap between slab free and the fput in task work wrt
file's private data.


No.  This


@@ -2048,6 +2055,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
  
  	fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC);

if (fd < 0) {
+   file->private_data = NULL;
fput(file);
goto out;
}



is fundamentally wrong; you really shouldn't take over the cleanups
if you ever do fput().


Yep. I don't recall how the O_CLOEXEC got in there: that's indeed wrong, 
and probably the result of patch-editing butchery. As for the exit 
cleanup: yes, that's a bug. I was trying to keep the exit paths 
together. We could fix it forward (which seems simple enough) or re-submit.


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-22 Thread Daniel Colascione
On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
> [adding more people because this is going to be an ABI break, sigh]

Just pinging this thread. I'd like to rev my patch and I'm not sure
what we want to do about problem Andy identified. Are we removing
UFFD_EVENT_FORK?


Re: [PATCH 1/7] Add a new flags-accepting interface for anonymous inodes

2019-10-14 Thread Daniel Colascione
Thanks for taking a look

On Mon, Oct 14, 2019 at 8:39 AM Jann Horn  wrote:
>
> On Sat, Oct 12, 2019 at 9:16 PM Daniel Colascione  wrote:
> > Add functions forwarding from the old names to the new ones so we
> > don't need to change any callers.
>
> This patch does more than the commit message says; it also refactors
> the body of the function. (I would've moved that refactoring over into
> patch 2, but I guess this works, too.)
>
> [...]
> > -struct file *anon_inode_getfile(const char *name,
> > -   const struct file_operations *fops,
> > -   void *priv, int flags)
> > +struct file *anon_inode_getfile2(const char *name,
> > +const struct file_operations *fops,
> > +void *priv, int flags, int 
> > anon_inode_flags)
>
> (AFAIK, normal kernel style is to slap a "__" prefix in front of the
> function name instead of appending a digit, but I guess it doesn't
> really matter.)

I thought prefixing "_" was for signaling "this is an implementation
detail and you probably don't want to call it unless you know what
you're doing", not "here's a new version that does a new thing".

> >  {
> > +   struct inode *inode;
> > struct file *file;
> >
> > -   if (IS_ERR(anon_inode_inode))
> > -   return ERR_PTR(-ENODEV);
> > -
> > -   if (fops->owner && !try_module_get(fops->owner))
> > -   return ERR_PTR(-ENOENT);
> > +   if (anon_inode_flags)
> > +   return ERR_PTR(-EINVAL);
> >
> > +   inode = anon_inode_inode;
> > +   if (IS_ERR(inode))
> > +   return ERR_PTR(-ENODEV);
> > /*
> > -* We know the anon_inode inode count is always greater than zero,
> > -* so ihold() is safe.
> > +* We know the anon_inode inode count is always
> > +* greater than zero, so ihold() is safe.
> >  */
>
> This looks like maybe you started editing the comment, then un-did the
> change, but left the modified line wrapping in your patch? Please
> avoid that - code changes with no real reason make "git blame" output
> more annoying and create trouble when porting patches between kernel
> versions.

I'll fix it.

>
> [...]
> >  EXPORT_SYMBOL_GPL(anon_inode_getfile);
> > +EXPORT_SYMBOL_GPL(anon_inode_getfile2);
> [...]
> >  EXPORT_SYMBOL_GPL(anon_inode_getfd);
> > +EXPORT_SYMBOL_GPL(anon_inode_getfd2);
>
> Since anon_inode_getfd() is now a static inline function in
> include/linux/anon_inodes.h, exporting it doesn't make sense anymore.
> Same for anon_inode_getfile().

I didn't want to break modules unnecessarily. Declaring the function
inline and also exporting it gives us an efficiency win while avoiding
an ABI break, right?

> [...]
> > +#define ANON_INODE_SECURE 1
>
> That #define belongs in a later patch, right?

Yep.


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-12 Thread Daniel Colascione
On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
>
..
>
> > But maybe we can go further: let's separate authentication and
> > authorization, as we do in other LSM hooks. Let's split my
> > inode_init_security_anon into two hooks, inode_init_security_anon and
> > inode_create_anon. We'd define the former to just initialize the file
> > object's security information --- in the SELinux case, figuring out
> > its class and SID --- and define the latter to answer the yes/no
> > question of whether a particular anonymous inode creation should be
> > allowed. Normally, anon_inode_getfile2() would just call both hooks.
> > We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
> > or something, that would tell anon_inode_getfile2() to skip calling
> > the authorization hook, effectively making the creation always
> > succeed. We can then make the UFFD code pass
> > ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
> > fork child while creating UFFD_EVENT_FORK messages.
>
> That sounds like an improvement.  Or maybe just teach SELinux that
> this particular fd creation is actually making an anon_inode that is a
> child of an existing anon inode and that the context should be copied
> or whatever SELinux wants to do.  Like this, maybe:
>
> static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
>   struct userfaultfd_ctx *new,
>   struct uffd_msg *msg)
> {
> int fd;
>
> Change this:
>
> fd = anon_inode_getfd("[userfaultfd]", _fops, new,
>   O_RDWR | (new->flags & 
> UFFD_SHARED_FCNTL_FLAGS));
>
> to something like:
>
>   fd = anon_inode_make_child_fd(..., ctx->inode, ...);
>
> where ctx->inode is the one context's inode.

Yeah. I figured we could just add a special-purpose hook for this
case. Having a special hook for this one case feels ugly though, and
at copy_mm time, we don't have a PID for the new child yet --- I don't
know whether LSMs would care about that. But maybe this is one of
those "doctor, it hurts when I do this!" situations and this child
process difficulty is just a hint that some other design might work
better.

> Now that you've pointed this mechanism out, it is utterly and
> completely broken and should be removed from the kernel outright or at
> least severely restricted.  A .read implementation MUST NOT ACT ON THE
> CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> as stdin to a setuid program.
>
> So I think the right solution might be to attempt to *remove*
> UFFD_EVENT_FORK.  Maybe the solution is to say that, unless the
> creator of a userfaultfd() has global CAP_SYS_ADMIN, then it cannot
> use UFFD_FEATURE_EVENT_FORK) and print a warning (once) when
> UFFD_FEATURE_EVENT_FORK is allowed.  And, after some suitable
> deprecation period, just remove it.  If it's genuinely useful, it
> needs an entirely new API based on ioctl() or a syscall.  Or even
> recvmsg() :)

IMHO, userfaultfd should have been a datagram socket from the start.
As you point out, it's a good fit for the UFFD protocol, which
involves FD passing and a fixed message size.

> And UFFD_SECURE should just become automatic, since you don't have a
> problem any more. :-p

Agreed. I'll wait to hear what everyone else has to say.


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-12 Thread Daniel Colascione
On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  wrote:
>
> On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  wrote:
> >
> > The new secure flag makes userfaultfd use a new "secure" anonymous
> > file object instead of the default one, letting security modules
> > supervise userfaultfd use.
> >
> > Requiring that users pass a new flag lets us avoid changing the
> > semantics for existing callers.
>
> Is there any good reason not to make this be the default?
>
>
> The only downside I can see is that it would increase the memory usage
> of userfaultfd(), but that doesn't seem like such a big deal.  A
> lighter-weight alternative would be to have a single inode shared by
> all userfaultfd instances, which would require a somewhat different
> internal anon_inode API.

I'd also prefer to just make SELinux use mandatory, but there's a
nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
better way to deal with it.

Right now, when a process with a UFFD-managed VMA using
UFFD_EVENT_FORK forks, we make a new userfaultfd_ctx out of thin air
and enqueue it on the message queue for the parent process. When we
dequeue that context, we get to resolve_userfault_fork, which makes up
a new UFFD file object out of thin air in the context of the reading
process. Following normal SELinux rules, the SID attached to that new
file object would be the task SID of the process *reading* the fork
event, not the SID of the new fork child. That seems wrong, because
the label we give to the UFFD should correspond to the label of the
process that UFFD controls.

To try to solve this problem, we can move the file object creation to
the fork child and enqueue the file object itself instead of just the
userfaultfd_ctx, treating the dequeue as a file-descriptor-receive
operation just like a recvmsg of an AF_UNIX socket with SCM_RIGHTS.
(This approach seems more elegant anyway, since it reflects what's
actually going on.) The trouble the early-file-object-creation
approach is that the fork child may not be allowed to create UFFD file
objects on its own and an LSM can't tell the difference between
UFFD_EVENT_FORK handling creating the file object and the fork child
just calling userfaultfd(), meaning an LSM could veto the creation of
the file object for the fork event. We can't just create a
non-ANON_INODE_SECURE file object instead: that would defeat the whole
purpose of supervising UFFD using SELinux.

But maybe we can go further: let's separate authentication and
authorization, as we do in other LSM hooks. Let's split my
inode_init_security_anon into two hooks, inode_init_security_anon and
inode_create_anon. We'd define the former to just initialize the file
object's security information --- in the SELinux case, figuring out
its class and SID --- and define the latter to answer the yes/no
question of whether a particular anonymous inode creation should be
allowed. Normally, anon_inode_getfile2() would just call both hooks.
We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
or something, that would tell anon_inode_getfile2() to skip calling
the authorization hook, effectively making the creation always
succeed. We can then make the UFFD code pass
ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
fork child while creating UFFD_EVENT_FORK messages.

Granted, UFFD fork processing doesn't actually occur in the fork
child, but in copy_mm, in the parent --- but the right thing should
happen anyway, right?

I'm open to suggestions. In the meantime, I figured we'd just define a
UFFD_SECURE and make it incompatible with UFFD_EVENT_FORK.

> In any event, I don't think that "make me visible to SELinux" should
> be a choice that user code makes.

Right. The new unprivileged_userfaultfd setting is ugly, but it at
least removes the ability of unprivileged users to opt out of SELinux
supervision.


Re: [PATCH 4/7] Teach SELinux about a new userfaultfd class

2019-10-12 Thread Daniel Colascione
On Sat, Oct 12, 2019 at 4:09 PM Andy Lutomirski  wrote:
>
> On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione  wrote:
> >
> > Use the secure anonymous inode LSM hook we just added to let SELinux
> > policy place restrictions on userfaultfd use. The create operation
> > applies to processes creating new instances of these file objects;
> > transfer between processes is covered by restrictions on read, write,
> > and ioctl access already checked inside selinux_file_receive.
>
> This is great, and I suspect we'll want it for things like SGX, too.
> But the current design seems like it will make it essentially
> impossible for SELinux to reference an anon_inode class whose
> file_operations are in a module, and moving file_operations out of a
> module would be nasty.
>
> Could this instead be keyed off a new struct anon_inode_class, an
> enum, or even just a string?

The new LSM hook already receives the string that callers pass to the
anon_inode APIs; modules can look at that instead of the fops if they
want. The reason to pass both the name and the fops through the hook
is to allow LSMs to match using fops comparison (which seems less
prone to breakage) when possible and rely on string matching when it
isn't.


[PATCH 7/7] Add a new sysctl for limiting userfaultfd to user mode faults

2019-10-12 Thread Daniel Colascione
Add a new sysctl knob unprivileged_userfaultfd_user_mode_only.
This sysctl can be set to either zero or one. When zero (the default)
the system lets all users call userfaultfd with or without
UFFD_USER_MODE_ONLY, modulo other access controls. When
unprivileged_userfaultfd_user_mode_only is set to one, users without
CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
will fail with EPERM. This facility allows administrators to reduce
the likelihood that an attacker with access to userfaultfd can delay
faulting kernel code to widen timing windows for other exploits.

Signed-off-by: Daniel Colascione 
---
 Documentation/admin-guide/sysctl/vm.rst | 13 +
 fs/userfaultfd.c| 12 ++--
 include/linux/userfaultfd_k.h   |  1 +
 kernel/sysctl.c |  9 +
 4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 6664eec7bd35..330fd82b3f4e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -849,6 +849,19 @@ they pass the UFFD_SECURE, enabling MAC security checks.
 
 The default value is 1.
 
+unprivileged_userfaultfd_user_mode_only
+
+
+This flag controls whether unprivileged users can use the userfaultfd
+system calls to handle page faults in kernel mode.  If set to zero,
+userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
+unprivileged_userfaultfd above.  If set to one, users without
+SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
+to succeed.  Prohibiting use of userfaultfd for handling faults from
+kernel mode may make certain vulnerabilities more difficult
+to exploit.
+
+The default value is 0.
 
 user_reserve_kbytes
 ===
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aaed9347973e..02addd425ab7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -29,6 +29,7 @@
 #include 
 
 int sysctl_unprivileged_userfaultfd __read_mostly = 1;
+int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0;
 
 static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
 
@@ -1963,8 +1964,15 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
struct userfaultfd_ctx *ctx;
int fd;
static const int uffd_flags = UFFD_SECURE | UFFD_USER_MODE_ONLY;
-   bool need_cap_check = sysctl_unprivileged_userfaultfd == 0 ||
-   (sysctl_unprivileged_userfaultfd == 2 && !(flags & 
UFFD_SECURE));
+   bool need_cap_check = false;
+
+   if (sysctl_unprivileged_userfaultfd == 0 ||
+   (sysctl_unprivileged_userfaultfd == 2 && !(flags & UFFD_SECURE)))
+   need_cap_check = true;
+
+   if (sysctl_unprivileged_userfaultfd_user_mode_only &&
+   (flags & UFFD_USER_MODE_ONLY) == 0)
+   need_cap_check = true;
 
if (need_cap_check && !capable(CAP_SYS_PTRACE))
return -EPERM;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 549c8b0cca52..efe14abb2dc8 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -29,6 +29,7 @@
 #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
 
 extern int sysctl_unprivileged_userfaultfd;
+extern int sysctl_unprivileged_userfaultfd_user_mode_only;
 
 extern const struct file_operations userfaultfd_fops;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index fc98d5df344e..4f296676c0ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1740,6 +1740,15 @@ static struct ctl_table vm_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = ,
},
+   {
+   .procname   = "unprivileged_userfaultfd_user_mode_only",
+   .data   = 
_unprivileged_userfaultfd_user_mode_only,
+   .maxlen = 
sizeof(sysctl_unprivileged_userfaultfd_user_mode_only),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = SYSCTL_ZERO,
+   .extra2 = SYSCTL_ONE,
+   },
 #endif
{ }
 };
-- 
2.23.0.700.g56cf767bdb-goog



[PATCH 6/7] Allow users to require UFFD_SECURE

2019-10-12 Thread Daniel Colascione
This change adds 2 as an allowable value for
unprivileged_userfaultfd. (Previously, this sysctl could be either 0
or 1.) When unprivileged_userfaultfd is 2, users with CAP_SYS_PTRACE
may create userfaultfd with or without UFFD_SECURE, but users without
CAP_SYS_PTRACE must pass UFFD_SECURE to userfaultfd in order for the
system call to succeed, effectively forcing them to opt into
additional security checks.

Signed-off-by: Daniel Colascione 
---
 Documentation/admin-guide/sysctl/vm.rst | 6 --
 fs/userfaultfd.c| 4 +++-
 kernel/sysctl.c | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 64aeee1009ca..6664eec7bd35 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -842,8 +842,10 @@ unprivileged_userfaultfd
 
 This flag controls whether unprivileged users can use the userfaultfd
 system calls.  Set this to 1 to allow unprivileged users to use the
-userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
-privileged users (with SYS_CAP_PTRACE capability).
+userfaultfd system calls, or set this to 0 to restrict userfaultfd to
+only privileged users (with SYS_CAP_PTRACE capability).  If set to 2,
+unprivileged (non-SYS_CAP_PTRACE) users may use userfaultfd only if
+they pass the UFFD_SECURE, enabling MAC security checks.
 
 The default value is 1.
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 986d23b2cd33..aaed9347973e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1963,8 +1963,10 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
struct userfaultfd_ctx *ctx;
int fd;
static const int uffd_flags = UFFD_SECURE | UFFD_USER_MODE_ONLY;
+   bool need_cap_check = sysctl_unprivileged_userfaultfd == 0 ||
+   (sysctl_unprivileged_userfaultfd == 2 && !(flags & 
UFFD_SECURE));
 
-   if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
+   if (need_cap_check && !capable(CAP_SYS_PTRACE))
return -EPERM;
 
BUG_ON(!current->mm);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 00fcea236eba..fc98d5df344e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1738,7 +1738,7 @@ static struct ctl_table vm_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
-   .extra2 = SYSCTL_ONE,
+   .extra2 = ,
},
 #endif
{ }
-- 
2.23.0.700.g56cf767bdb-goog



[PATCH 4/7] Teach SELinux about a new userfaultfd class

2019-10-12 Thread Daniel Colascione
Use the secure anonymous inode LSM hook we just added to let SELinux
policy place restrictions on userfaultfd use. The create operation
applies to processes creating new instances of these file objects;
transfer between processes is covered by restrictions on read, write,
and ioctl access already checked inside selinux_file_receive.

Signed-off-by: Daniel Colascione 
---
 fs/userfaultfd.c|  4 +-
 include/linux/userfaultfd_k.h   |  2 +
 security/selinux/hooks.c| 68 +
 security/selinux/include/classmap.h |  2 +
 4 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 29f920fb236e..1123089c3d55 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1014,8 +1014,6 @@ static __poll_t userfaultfd_poll(struct file *file, 
poll_table *wait)
}
 }
 
-static const struct file_operations userfaultfd_fops;
-
 static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
  struct userfaultfd_ctx *new,
  struct uffd_msg *msg)
@@ -1934,7 +1932,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, 
struct file *f)
 }
 #endif
 
-static const struct file_operations userfaultfd_fops = {
+const struct file_operations userfaultfd_fops = {
 #ifdef CONFIG_PROC_FS
.show_fdinfo= userfaultfd_show_fdinfo,
 #endif
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ac9d71e24b81..549c8b0cca52 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -30,6 +30,8 @@
 
 extern int sysctl_unprivileged_userfaultfd;
 
+extern const struct file_operations userfaultfd_fops;
+
 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 9625b99e677f..0b3a36cbfbdc 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -92,6 +92,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_USERFAULTFD
+#include 
+#endif
+
 #include "avc.h"
 #include "objsec.h"
 #include "netif.h"
@@ -2943,6 +2947,69 @@ static int selinux_inode_init_security(struct inode 
*inode, struct inode *dir,
return 0;
 }
 
+static int selinux_inode_init_security_anon(struct inode *inode,
+   const char *name,
+   const struct file_operations *fops)
+{
+   const struct task_security_struct *tsec = selinux_cred(current_cred());
+   struct common_audit_data ad;
+   struct inode_security_struct *isec;
+
+   if (unlikely(IS_PRIVATE(inode)))
+   return 0;
+
+   /*
+* We shouldn't be creating secure anonymous inodes before LSM
+* initialization completes.
+*/
+   if (unlikely(!selinux_state.initialized))
+   return -EBUSY;
+
+   isec = selinux_inode(inode);
+
+   /*
+* We only get here once per ephemeral inode.  The inode has
+* been initialized via inode_alloc_security but is otherwise
+* untouched, so check that the state is as
+* inode_alloc_security left it.
+*/
+   BUG_ON(isec->initialized != LABEL_INVALID);
+   BUG_ON(isec->sclass != SECCLASS_FILE);
+
+#ifdef CONFIG_USERFAULTFD
+   if (fops == _fops)
+   isec->sclass = SECCLASS_UFFD;
+#endif
+
+   if (isec->sclass == SECCLASS_FILE) {
+   printk(KERN_WARNING "refusing to create secure anonymous inode "
+  "of unknown type");
+   return -EOPNOTSUPP;
+   }
+   /*
+* Always give secure anonymous inodes the sid of the
+* creating task.
+*/
+
+   isec->sid = tsec->sid;
+   isec->initialized = LABEL_INITIALIZED;
+
+   /*
+* Now that we've initialized security, check whether we're
+* allowed to actually create this type of anonymous inode.
+*/
+
+   ad.type = LSM_AUDIT_DATA_INODE;
+   ad.u.inode = inode;
+
+   return avc_has_perm(_state,
+   tsec->sid,
+   isec->sid,
+   isec->sclass,
+   FILE__CREATE,
+   );
+}
+
 static int selinux_inode_create(struct inode *dir, struct dentry *dentry, 
umode_t mode)
 {
return may_create(dir, dentry, SECCLASS_FILE);
@@ -6840,6 +6907,7 @@ static struct security_hook_list selinux_hooks[] 
__lsm_ro_after_init = {
LSM_HOOK_INIT(inode_alloc_security, selinux_inode_alloc_security),
LSM_HOOK_INIT(inode_free_security, selinux_inode_free_security),
LSM_HOOK_INIT(inode_init_security, selinux_inode_init_security),
+   LSM_HOOK_INIT(inode_init_security_anon, 
selinux_inode_init_security_anon),

[PATCH 2/7] Add a concept of a "secure" anonymous file

2019-10-12 Thread Daniel Colascione
A secure anonymous file is one we hooked up to its own inode (as
opposed to the shared inode we use for non-secure anonymous files). A
new selinux hook gives security modules a chance to initialize, label,
and veto the creation of these secure anonymous files. Security
modules had limit ability to interact with non-secure anonymous files
due to all of these files sharing a single inode.

Signed-off-by: Daniel Colascione 
---
 fs/anon_inodes.c  | 45 ++-
 include/linux/lsm_hooks.h |  8 +++
 include/linux/security.h  |  2 ++
 security/security.c   |  8 +++
 4 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index caa36019afca..d68d76523ad3 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -55,6 +55,23 @@ static struct file_system_type anon_inode_fs_type = {
.kill_sb= kill_anon_super,
 };
 
+struct inode *anon_inode_make_secure_inode(const char *name,
+  const struct file_operations *fops)
+{
+   struct inode *inode;
+   int error;
+   inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+   if (IS_ERR(inode))
+   return ERR_PTR(PTR_ERR(inode));
+   inode->i_flags &= ~S_PRIVATE;
+   error = security_inode_init_security_anon(inode, name, fops);
+   if (error) {
+   iput(inode);
+   return ERR_PTR(error);
+   }
+   return inode;
+}
+
 /**
  * anon_inode_getfile2 - creates a new file instance by hooking it up to
  *   an anonymous inode, and a dentry that describe
@@ -72,7 +89,9 @@ static struct file_system_type anon_inode_fs_type = {
  * hence saving memory and avoiding code duplication for the file/inode/dentry
  * setup.  Returns the newly created file* or an error pointer.
  *
- * anon_inode_flags must be zero.
+ * If anon_inode_flags contains ANON_INODE_SECURE, create a new inode
+ * and enable security checks for it. Otherwise, attach a new file to
+ * a singleton placeholder inode with security checks disabled.
  */
 struct file *anon_inode_getfile2(const char *name,
 const struct file_operations *fops,
@@ -81,17 +100,23 @@ struct file *anon_inode_getfile2(const char *name,
struct inode *inode;
struct file *file;
 
-   if (anon_inode_flags)
+   if (anon_inode_flags & ~ANON_INODE_SECURE)
return ERR_PTR(-EINVAL);
 
-   inode = anon_inode_inode;
-   if (IS_ERR(inode))
-   return ERR_PTR(-ENODEV);
-   /*
-* We know the anon_inode inode count is always
-* greater than zero, so ihold() is safe.
-*/
-   ihold(inode);
+   if (anon_inode_flags & ANON_INODE_SECURE) {
+   inode = anon_inode_make_secure_inode(name, fops);
+   if (IS_ERR(inode))
+   return ERR_PTR(PTR_ERR(inode));
+   } else {
+   inode = anon_inode_inode;
+   if (IS_ERR(inode))
+   return ERR_PTR(-ENODEV);
+   /*
+* We know the anon_inode inode count is always
+* greater than zero, so ihold() is safe.
+*/
+   ihold(inode);
+   }
 
if (fops->owner && !try_module_get(fops->owner)) {
file = ERR_PTR(-ENOENT);
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index a3763247547c..3744ce9e9172 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -215,6 +215,10 @@
  * Returns 0 if @name and @value have been successfully set,
  * -EOPNOTSUPP if no security attribute is needed, or
  * -ENOMEM on memory allocation failure.
+ * @inode_init_security_anon:
+ *  Set up a secure anonymous inode.
+ * Returns 0 on success. Returns -EPERM if the security module denies
+ * the creation of this inode.
  * @inode_create:
  * Check permission to create a regular file.
  * @dir contains inode structure of the parent of the new file.
@@ -1552,6 +1556,9 @@ union security_list_options {
const struct qstr *qstr,
const char **name, void **value,
size_t *len);
+   int (*inode_init_security_anon)(struct inode *inode,
+   const char *name,
+   const struct file_operations *fops);
int (*inode_create)(struct inode *dir, struct dentry *dentry,
umode_t mode);
int (*inode_link)(struct dentry *old_dentry, struct inode *dir,
@@ -1876,6 +1883,7 @@ struct security_hook_heads {
struct hlist_head inode_alloc_security;
struct hlist_head inode_free_security;
struct hlist_head inode_init_security;
+   struct hlist_head inode_init_security_anon;
  

[PATCH 1/7] Add a new flags-accepting interface for anonymous inodes

2019-10-12 Thread Daniel Colascione
Add functions forwarding from the old names to the new ones so we
don't need to change any callers.

Signed-off-by: Daniel Colascione 
---
 fs/anon_inodes.c| 62 ++---
 include/linux/anon_inodes.h | 27 +---
 2 files changed, 59 insertions(+), 30 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 89714308c25b..caa36019afca 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -56,60 +56,71 @@ static struct file_system_type anon_inode_fs_type = {
 };
 
 /**
- * anon_inode_getfile - creates a new file instance by hooking it up to an
- *  anonymous inode, and a dentry that describe the "class"
- *  of the file
+ * anon_inode_getfile2 - creates a new file instance by hooking it up to
+ *   an anonymous inode, and a dentry that describe
+ *   the "class" of the file
  *
  * @name:[in]name of the "class" of the new file
  * @fops:[in]file operations for the new file
  * @priv:[in]private data for the new file (will be file's 
private_data)
- * @flags:   [in]flags
+ * @flags:   [in]flags for the file
+ * @anon_inode_flags: [in] flags for anon_inode*
  *
- * Creates a new file by hooking it on a single inode. This is useful for files
+ * Creates a new file by hooking it on an unspecified inode. This is useful 
for files
  * that do not need to have a full-fledged inode in order to operate correctly.
  * All the files created with anon_inode_getfile() will share a single inode,
  * hence saving memory and avoiding code duplication for the file/inode/dentry
  * setup.  Returns the newly created file* or an error pointer.
+ *
+ * anon_inode_flags must be zero.
  */
-struct file *anon_inode_getfile(const char *name,
-   const struct file_operations *fops,
-   void *priv, int flags)
+struct file *anon_inode_getfile2(const char *name,
+const struct file_operations *fops,
+void *priv, int flags, int anon_inode_flags)
 {
+   struct inode *inode;
struct file *file;
 
-   if (IS_ERR(anon_inode_inode))
-   return ERR_PTR(-ENODEV);
-
-   if (fops->owner && !try_module_get(fops->owner))
-   return ERR_PTR(-ENOENT);
+   if (anon_inode_flags)
+   return ERR_PTR(-EINVAL);
 
+   inode = anon_inode_inode;
+   if (IS_ERR(inode))
+   return ERR_PTR(-ENODEV);
/*
-* We know the anon_inode inode count is always greater than zero,
-* so ihold() is safe.
+* We know the anon_inode inode count is always
+* greater than zero, so ihold() is safe.
 */
-   ihold(anon_inode_inode);
-   file = alloc_file_pseudo(anon_inode_inode, anon_inode_mnt, name,
+   ihold(inode);
+
+   if (fops->owner && !try_module_get(fops->owner)) {
+   file = ERR_PTR(-ENOENT);
+   goto err;
+   }
+
+   file = alloc_file_pseudo(inode, anon_inode_mnt, name,
 flags & (O_ACCMODE | O_NONBLOCK), fops);
if (IS_ERR(file))
goto err;
 
-   file->f_mapping = anon_inode_inode->i_mapping;
+   file->f_mapping = inode->i_mapping;
 
file->private_data = priv;
 
return file;
 
 err:
-   iput(anon_inode_inode);
+   iput(inode);
module_put(fops->owner);
return file;
 }
 EXPORT_SYMBOL_GPL(anon_inode_getfile);
+EXPORT_SYMBOL_GPL(anon_inode_getfile2);
 
 /**
- * anon_inode_getfd - creates a new file instance by hooking it up to an
- *anonymous inode, and a dentry that describe the "class"
- *of the file
+ * anon_inode_getfd2 - creates a new file instance by hooking it up to an
+ * anonymous inode, and a dentry that describe the "class"
+ * of the file
  *
  * @name:[in]name of the "class" of the new file
  * @fops:[in]file operations for the new file
@@ -122,8 +133,8 @@ EXPORT_SYMBOL_GPL(anon_inode_getfile);
  * hence saving memory and avoiding code duplication for the file/inode/dentry
  * setup.  Returns new descriptor or an error code.
  */
-int anon_inode_getfd(const char *name, const struct file_operations *fops,
-void *priv, int flags)
+int anon_inode_getfd2(const char *name, const struct file_operations *fops,
+ void *priv, int flags, int anon_inode_flags)
 {
int error, fd;
struct file *file;
@@ -133,7 +144,7 @@ int anon_inode_getfd(const char *name, const struct 
file_operations *fops,
return error;
fd = error;
 
-   file = anon_inode_getfile(name, fops, priv, flags);
+   file = anon_inode_getfile2(name, fops, priv, fla

[PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-12 Thread Daniel Colascione
The new secure flag makes userfaultfd use a new "secure" anonymous
file object instead of the default one, letting security modules
supervise userfaultfd use.

Requiring that users pass a new flag lets us avoid changing the
semantics for existing callers.

Signed-off-by: Daniel Colascione 
---
 fs/userfaultfd.c | 28 +---
 include/uapi/linux/userfaultfd.h |  8 
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f9fd18670e22..29f920fb236e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1022,6 +1022,13 @@ static int resolve_userfault_fork(struct userfaultfd_ctx 
*ctx,
 {
int fd;
 
+   /*
+* Using a secure-mode UFFD to monitor forks isn't supported
+* right now.
+*/
+   if (new->flags & UFFD_SECURE)
+   return -EOPNOTSUPP;
+
fd = anon_inode_getfd("[userfaultfd]", _fops, new,
  O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));
if (fd < 0)
@@ -1841,6 +1848,18 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
ret = -EINVAL;
goto out;
}
+   if ((ctx->flags & UFFD_SECURE) &&
+   (features & UFFD_FEATURE_EVENT_FORK)) {
+   /*
+* We don't support UFFD_FEATURE_EVENT_FORK on a
+* secure-mode UFFD: doing so would need us to
+* construct the new file object in the context of the
+* fork child, and it's not worth it right now.
+*/
+   ret = -EINVAL;
+   goto out;
+   }
+
/* report all available features and ioctls to userland */
uffdio_api.features = UFFD_API_FEATURES;
uffdio_api.ioctls = UFFD_API_IOCTLS;
@@ -1942,6 +1961,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 {
struct userfaultfd_ctx *ctx;
int fd;
+   static const int uffd_flags = UFFD_SECURE;
 
if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
return -EPERM;
@@ -1951,8 +1971,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
/* Check the UFFD_* constants for consistency.  */
BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+   BUILD_BUG_ON(UFFD_SHARED_FCNTL_FLAGS & uffd_flags);
 
-   if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+   if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | uffd_flags))
return -EINVAL;
 
ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
@@ -1969,8 +1990,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
/* prevent the mm struct to be freed */
mmgrab(ctx->mm);
 
-   fd = anon_inode_getfd("[userfaultfd]", _fops, ctx,
- O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+   fd = anon_inode_getfd2("[userfaultfd]", _fops, ctx,
+  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS),
+  ((flags & UFFD_SECURE) ? ANON_INODE_SECURE : 0));
if (fd < 0) {
mmdrop(ctx->mm);
kmem_cache_free(userfaultfd_ctx_cachep, ctx);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..12d7d40d7f25 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -231,4 +231,12 @@ struct uffdio_zeropage {
__s64 zeropage;
 };
 
+/*
+ * Flags for the userfaultfd(2) system call itself.
+ */
+
+/*
+ * Create a userfaultfd with MAC security checks enabled.
+ */
+#define UFFD_SECURE 1
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.23.0.700.g56cf767bdb-goog



[PATCH 0/7] Harden userfaultfd

2019-10-12 Thread Daniel Colascione
Userfaultfd in unprivileged contexts could be potentially very
useful. We'd like to harden userfaultfd to make such unprivileged use
less risky. This patch series allows SELinux to manage userfaultfd
file descriptors (via a new flag, for compatibility with existing
code) and allows administrators to limit userfaultfd to servicing
user-mode faults, increasing the difficulty of using userfaultfd in
exploit chains invoking delaying kernel faults.

A new anon_inodes interface allows callers to opt into SELinux
management of anonymous file objects. In this mode, anon_inodes
creates new ephemeral inodes for anonymous file objects instead of
reusing a singleton dummy inode. A new LSM hook gives security modules
an opportunity to configure and veto these ephemeral inodes.

Existing anon_inodes users must opt into the new functionality.

Daniel Colascione (7):
  Add a new flags-accepting interface for anonymous inodes
  Add a concept of a "secure" anonymous file
  Add a UFFD_SECURE flag to the userfaultfd API.
  Teach SELinux about a new userfaultfd class
  Let userfaultfd opt out of handling kernel-mode faults
  Allow users to require UFFD_SECURE
  Add a new sysctl for limiting userfaultfd to user mode faults

 Documentation/admin-guide/sysctl/vm.rst | 19 +-
 fs/anon_inodes.c| 89 +
 fs/userfaultfd.c| 47 +++--
 include/linux/anon_inodes.h | 27 ++--
 include/linux/lsm_hooks.h   |  8 +++
 include/linux/security.h|  2 +
 include/linux/userfaultfd_k.h   |  3 +
 include/uapi/linux/userfaultfd.h| 14 
 kernel/sysctl.c |  9 +++
 security/security.c |  8 +++
 security/selinux/hooks.c| 68 +++
 security/selinux/include/classmap.h |  2 +
 12 files changed, 256 insertions(+), 40 deletions(-)

-- 
2.23.0.700.g56cf767bdb-goog



[PATCH 5/7] Let userfaultfd opt out of handling kernel-mode faults

2019-10-12 Thread Daniel Colascione
userfaultfd handles page faults from both user and kernel code.  Add a
new UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the
resulting userfaultfd object refuse to handle faults from kernel mode,
treating these faults as if SIGBUS were always raised, causing the
kernel code to fail with EFAULT.

A future patch adds a knob allowing administrators to give some
processes the ability to create userfaultfd file objects only if they
pass UFFD_USER_MODE_ONLY, reducing the likelihood that these processes
will exploit userfaultfd's ability to delay kernel page faults to open
timing windows for future exploits.

Signed-off-by: Daniel Colascione 
---
 fs/userfaultfd.c | 5 -
 include/uapi/linux/userfaultfd.h | 6 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1123089c3d55..986d23b2cd33 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -389,6 +389,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned 
long reason)
 
if (ctx->features & UFFD_FEATURE_SIGBUS)
goto out;
+   if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
+   ctx->flags & UFFD_USER_MODE_ONLY)
+   goto out;
 
/*
 * If it's already released don't get it. This avoids to loop
@@ -1959,7 +1962,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 {
struct userfaultfd_ctx *ctx;
int fd;
-   static const int uffd_flags = UFFD_SECURE;
+   static const int uffd_flags = UFFD_SECURE | UFFD_USER_MODE_ONLY;
 
if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
return -EPERM;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 12d7d40d7f25..eadd1497e7b5 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -239,4 +239,10 @@ struct uffdio_zeropage {
  * Create a userfaultfd with MAC security checks enabled.
  */
 #define UFFD_SECURE 1
+
+/*
+ * Create a userfaultfd that can handle page faults only in user mode.
+ */
+#define UFFD_USER_MODE_ONLY 2
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.23.0.700.g56cf767bdb-goog



Re: [PATCH] Make SPLIT_RSS_COUNTING configurable

2019-10-04 Thread Daniel Colascione
On Fri, Oct 4, 2019 at 6:26 AM Kirill A. Shutemov  wrote:
> On Fri, Oct 04, 2019 at 02:33:49PM +0200, Michal Hocko wrote:
> > On Wed 02-10-19 19:08:16, Daniel Colascione wrote:
> > > On Wed, Oct 2, 2019 at 6:56 PM Qian Cai  wrote:
> > > > > On Oct 2, 2019, at 4:29 PM, Daniel Colascione  
> > > > > wrote:
> > > > >
> > > > > Adding the correct linux-mm address.
> > > > >
> > > > >
> > > > >> +config SPLIT_RSS_COUNTING
> > > > >> +   bool "Per-thread mm counter caching"
> > > > >> +   depends on MMU
> > > > >> +   default y if NR_CPUS >= SPLIT_PTLOCK_CPUS
> > > > >> +   help
> > > > >> + Cache mm counter updates in thread structures and
> > > > >> + flush them to visible per-process statistics in batches.
> > > > >> + Say Y here to slightly reduce cache contention in processes
> > > > >> + with many threads at the expense of decreasing the accuracy
> > > > >> + of memory statistics in /proc.
> > > > >> +
> > > > >> endmenu
> > > >
> > > > All those vague words are going to make developers almost impossible to 
> > > > decide the right selection here. It sounds like we should kill 
> > > > SPLIT_RSS_COUNTING at all to simplify the code as the benefit is so 
> > > > small vs the side-effect?
> > >
> > > Killing SPLIT_RSS_COUNTING would be my first choice; IME, on mobile
> > > and a basic desktop, it doesn't make a difference. I figured making it
> > > a knob would help allay concerns about the performance impact in more
> > > extreme configurations.
> >
> > I do agree with Qian. Either it is really helpful (is it? probably on
> > the number of cpus) and it should be auto-enabled or it should be
> > dropped altogether. You cannot really expect people know how to enable
> > this without a deep understanding of the MM internals. Not to mention
> > all those users using distro kernels/configs.
> >
> > A config option sounds like a bad way forward.
>
> And I don't see much point anyway. Reading RSS counters from proc is
> inherently racy. It can just either way after the read due to process
> behaviour.

Split RSS accounting doesn't make reading from mm counters racy. It
makes these counters *wrong*. We flush task mm counters to the
mm_struct once every 64 page faults that a task incurs or when that
task exits. That means that if a thread takes 63 page faults and then
sleeps for a week, that thread's process's mm counters are wrong by 63
pages *for a week*. And some processes have a lot of threads,
compounding the error. Split RSS accounting means that memory usage
numbers don't add up. I don't think it's unreasonable to want a mode
where memory counters to agree with other indicators of system
activity.

Nobody has demonstrated that split RSS accounting actually helps in
the real world. But I've described above, concretely, how split RSS
accounting hurts. I've been trying for over a year to either disable
split RSS accounting or to let people opt out of it. If you won't
remove split RSS accounting and you won't let me add a configuration
knob that lets people opt out of it, what will you accept?


Re: [PATCH] Make SPLIT_RSS_COUNTING configurable

2019-10-02 Thread Daniel Colascione
On Wed, Oct 2, 2019 at 6:56 PM Qian Cai  wrote:
> > On Oct 2, 2019, at 4:29 PM, Daniel Colascione  wrote:
> >
> > Adding the correct linux-mm address.
> >
> >
> >> On Wed, Oct 2, 2019 at 1:25 PM Daniel Colascione  wrote:
> >>
> >> Using the new config option, users can disable SPLIT_RSS_COUNTING to
> >> get increased accuracy in user-visible mm counters.
> >>
> >> Signed-off-by: Daniel Colascione 
> >> ---
> >> include/linux/mm.h|  4 ++--
> >> include/linux/mm_types_task.h |  5 ++---
> >> include/linux/sched.h |  2 +-
> >> kernel/fork.c |  2 +-
> >> mm/Kconfig| 11 +++
> >> mm/memory.c   |  6 +++---
> >> 6 files changed, 20 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index cc292273e6ba..221395de3cb4 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1637,7 +1637,7 @@ static inline unsigned long get_mm_counter(struct 
> >> mm_struct *mm, int member)
> >> {
> >>long val = atomic_long_read(>rss_stat.count[member]);
> >>
> >> -#ifdef SPLIT_RSS_COUNTING
> >> +#ifdef CONFIG_SPLIT_RSS_COUNTING
> >>/*
> >> * counter is updated in asynchronous manner and may go to minus.
> >> * But it's never be expected number for users.
> >> @@ -1723,7 +1723,7 @@ static inline void setmax_mm_hiwater_rss(unsigned 
> >> long *maxrss,
> >>*maxrss = hiwater_rss;
> >> }
> >>
> >> -#if defined(SPLIT_RSS_COUNTING)
> >> +#ifdef CONFIG_SPLIT_RSS_COUNTING
> >> void sync_mm_rss(struct mm_struct *mm);
> >> #else
> >> static inline void sync_mm_rss(struct mm_struct *mm)
> >> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> >> index c1bc6731125c..d2adc8057e65 100644
> >> --- a/include/linux/mm_types_task.h
> >> +++ b/include/linux/mm_types_task.h
> >> @@ -48,14 +48,13 @@ enum {
> >>NR_MM_COUNTERS
> >> };
> >>
> >> -#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
> >> -#define SPLIT_RSS_COUNTING
> >> +#ifdef CONFIG_SPLIT_RSS_COUNTING
> >> /* per-thread cached information, */
> >> struct task_rss_stat {
> >>int events; /* for synchronization threshold */
> >>int count[NR_MM_COUNTERS];
> >> };
> >> -#endif /* USE_SPLIT_PTE_PTLOCKS */
> >> +#endif /* CONFIG_SPLIT_RSS_COUNTING */
> >>
> >> struct mm_rss_stat {
> >>atomic_long_t count[NR_MM_COUNTERS];
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 2c2e56bd8913..22f354774540 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -729,7 +729,7 @@ struct task_struct {
> >>/* Per-thread vma caching: */
> >>struct vmacache vmacache;
> >>
> >> -#ifdef SPLIT_RSS_COUNTING
> >> +#ifdef CONFIG_SPLIT_RSS_COUNTING
> >>struct task_rss_statrss_stat;
> >> #endif
> >>int exit_state;
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index f9572f416126..fc5e0889922b 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -1917,7 +1917,7 @@ static __latent_entropy struct task_struct 
> >> *copy_process(
> >>p->vtime.state = VTIME_INACTIVE;
> >> #endif
> >>
> >> -#if defined(SPLIT_RSS_COUNTING)
> >> +#ifdef CONFIG_SPLIT_RSS_COUNTING
> >>memset(>rss_stat, 0, sizeof(p->rss_stat));
> >> #endif
> >>
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index a5dae9a7eb51..372ef9449924 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
> >> config ARCH_HAS_HUGEPD
> >>bool
> >>
> >> +config SPLIT_RSS_COUNTING
> >> +   bool "Per-thread mm counter caching"
> >> +   depends on MMU
> >> +   default y if NR_CPUS >= SPLIT_PTLOCK_CPUS
> >> +   help
> >> + Cache mm counter updates in thread structures and
> >> + flush them to visible per-process statistics in batches.
> >> + Say Y here to slightly reduce cache contention in processes
> >> + with many threads at the expense of decreasing the accuracy
> >> + of memory statistics in /proc.
> >> +
> >> endmenu
>
> All those vague words are going to make developers almost impossible to 
> decide the right selection here. It sounds like we should kill 
> SPLIT_RSS_COUNTING at all to simplify the code as the benefit is so small vs 
> the side-effect?

Killing SPLIT_RSS_COUNTING would be my first choice; IME, on mobile
and a basic desktop, it doesn't make a difference. I figured making it
a knob would help allay concerns about the performance impact in more
extreme configurations.


Re: [PATCH] Make SPLIT_RSS_COUNTING configurable

2019-10-02 Thread Daniel Colascione
Adding the correct linux-mm address.


On Wed, Oct 2, 2019 at 1:25 PM Daniel Colascione  wrote:
>
> Using the new config option, users can disable SPLIT_RSS_COUNTING to
> get increased accuracy in user-visible mm counters.
>
> Signed-off-by: Daniel Colascione 
> ---
>  include/linux/mm.h|  4 ++--
>  include/linux/mm_types_task.h |  5 ++---
>  include/linux/sched.h |  2 +-
>  kernel/fork.c |  2 +-
>  mm/Kconfig| 11 +++
>  mm/memory.c   |  6 +++---
>  6 files changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc292273e6ba..221395de3cb4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1637,7 +1637,7 @@ static inline unsigned long get_mm_counter(struct 
> mm_struct *mm, int member)
>  {
> long val = atomic_long_read(>rss_stat.count[member]);
>
> -#ifdef SPLIT_RSS_COUNTING
> +#ifdef CONFIG_SPLIT_RSS_COUNTING
> /*
>  * counter is updated in asynchronous manner and may go to minus.
>  * But it's never be expected number for users.
> @@ -1723,7 +1723,7 @@ static inline void setmax_mm_hiwater_rss(unsigned long 
> *maxrss,
> *maxrss = hiwater_rss;
>  }
>
> -#if defined(SPLIT_RSS_COUNTING)
> +#ifdef CONFIG_SPLIT_RSS_COUNTING
>  void sync_mm_rss(struct mm_struct *mm);
>  #else
>  static inline void sync_mm_rss(struct mm_struct *mm)
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index c1bc6731125c..d2adc8057e65 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -48,14 +48,13 @@ enum {
> NR_MM_COUNTERS
>  };
>
> -#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
> -#define SPLIT_RSS_COUNTING
> +#ifdef CONFIG_SPLIT_RSS_COUNTING
>  /* per-thread cached information, */
>  struct task_rss_stat {
> int events; /* for synchronization threshold */
> int count[NR_MM_COUNTERS];
>  };
> -#endif /* USE_SPLIT_PTE_PTLOCKS */
> +#endif /* CONFIG_SPLIT_RSS_COUNTING */
>
>  struct mm_rss_stat {
> atomic_long_t count[NR_MM_COUNTERS];
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2c2e56bd8913..22f354774540 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -729,7 +729,7 @@ struct task_struct {
> /* Per-thread vma caching: */
> struct vmacache vmacache;
>
> -#ifdef SPLIT_RSS_COUNTING
> +#ifdef CONFIG_SPLIT_RSS_COUNTING
> struct task_rss_statrss_stat;
>  #endif
> int exit_state;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f9572f416126..fc5e0889922b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1917,7 +1917,7 @@ static __latent_entropy struct task_struct 
> *copy_process(
> p->vtime.state = VTIME_INACTIVE;
>  #endif
>
> -#if defined(SPLIT_RSS_COUNTING)
> +#ifdef CONFIG_SPLIT_RSS_COUNTING
> memset(>rss_stat, 0, sizeof(p->rss_stat));
>  #endif
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a5dae9a7eb51..372ef9449924 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
>  config ARCH_HAS_HUGEPD
> bool
>
> +config SPLIT_RSS_COUNTING
> +   bool "Per-thread mm counter caching"
> +   depends on MMU
> +   default y if NR_CPUS >= SPLIT_PTLOCK_CPUS
> +   help
> + Cache mm counter updates in thread structures and
> + flush them to visible per-process statistics in batches.
> + Say Y here to slightly reduce cache contention in processes
> + with many threads at the expense of decreasing the accuracy
> + of memory statistics in /proc.
> +
>  endmenu
> diff --git a/mm/memory.c b/mm/memory.c
> index b1ca51a079f2..bf557ed5ba23 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -141,7 +141,7 @@ static int __init init_zero_pfn(void)
>  core_initcall(init_zero_pfn);
>
>
> -#if defined(SPLIT_RSS_COUNTING)
> +#ifdef CONFIG_SPLIT_RSS_COUNTING
>
>  void sync_mm_rss(struct mm_struct *mm)
>  {
> @@ -177,7 +177,7 @@ static void check_sync_rss_stat(struct task_struct *task)
> if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
> sync_mm_rss(task->mm);
>  }
> -#else /* SPLIT_RSS_COUNTING */
> +#else /* CONFIG_SPLIT_RSS_COUNTING */
>
>  #define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
>  #define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
> @@ -186,7 +186,7 @@ static void check_sync_rss_stat(struct task_struct *task)
>  {
>  }
>
> -#endif /* SPLIT_RSS_COUNTING */
> +#endif /* CONFIG_SPLIT_RSS_COUNTING */
>
>  /*
>   * Note: this doesn't free the actual pages themselves. That
> --
> 2.23.0.581.g78d2f28ef7-goog
>


[PATCH] Make SPLIT_RSS_COUNTING configurable

2019-10-02 Thread Daniel Colascione
Using the new config option, users can disable SPLIT_RSS_COUNTING to
get increased accuracy in user-visible mm counters.

Signed-off-by: Daniel Colascione 
---
 include/linux/mm.h|  4 ++--
 include/linux/mm_types_task.h |  5 ++---
 include/linux/sched.h |  2 +-
 kernel/fork.c |  2 +-
 mm/Kconfig| 11 +++
 mm/memory.c   |  6 +++---
 6 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..221395de3cb4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1637,7 +1637,7 @@ static inline unsigned long get_mm_counter(struct 
mm_struct *mm, int member)
 {
long val = atomic_long_read(>rss_stat.count[member]);
 
-#ifdef SPLIT_RSS_COUNTING
+#ifdef CONFIG_SPLIT_RSS_COUNTING
/*
 * counter is updated in asynchronous manner and may go to minus.
 * But it's never be expected number for users.
@@ -1723,7 +1723,7 @@ static inline void setmax_mm_hiwater_rss(unsigned long 
*maxrss,
*maxrss = hiwater_rss;
 }
 
-#if defined(SPLIT_RSS_COUNTING)
+#ifdef CONFIG_SPLIT_RSS_COUNTING
 void sync_mm_rss(struct mm_struct *mm);
 #else
 static inline void sync_mm_rss(struct mm_struct *mm)
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index c1bc6731125c..d2adc8057e65 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -48,14 +48,13 @@ enum {
NR_MM_COUNTERS
 };
 
-#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
-#define SPLIT_RSS_COUNTING
+#ifdef CONFIG_SPLIT_RSS_COUNTING
 /* per-thread cached information, */
 struct task_rss_stat {
int events; /* for synchronization threshold */
int count[NR_MM_COUNTERS];
 };
-#endif /* USE_SPLIT_PTE_PTLOCKS */
+#endif /* CONFIG_SPLIT_RSS_COUNTING */
 
 struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c2e56bd8913..22f354774540 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -729,7 +729,7 @@ struct task_struct {
/* Per-thread vma caching: */
struct vmacache vmacache;
 
-#ifdef SPLIT_RSS_COUNTING
+#ifdef CONFIG_SPLIT_RSS_COUNTING
struct task_rss_statrss_stat;
 #endif
int exit_state;
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..fc5e0889922b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1917,7 +1917,7 @@ static __latent_entropy struct task_struct *copy_process(
p->vtime.state = VTIME_INACTIVE;
 #endif
 
-#if defined(SPLIT_RSS_COUNTING)
+#ifdef CONFIG_SPLIT_RSS_COUNTING
memset(>rss_stat, 0, sizeof(p->rss_stat));
 #endif
 
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..372ef9449924 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
bool
 
+config SPLIT_RSS_COUNTING
+   bool "Per-thread mm counter caching"
+   depends on MMU
+   default y if NR_CPUS >= SPLIT_PTLOCK_CPUS
+   help
+ Cache mm counter updates in thread structures and
+ flush them to visible per-process statistics in batches.
+ Say Y here to slightly reduce cache contention in processes
+ with many threads at the expense of decreasing the accuracy
+ of memory statistics in /proc.
+ 
 endmenu
diff --git a/mm/memory.c b/mm/memory.c
index b1ca51a079f2..bf557ed5ba23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -141,7 +141,7 @@ static int __init init_zero_pfn(void)
 core_initcall(init_zero_pfn);
 
 
-#if defined(SPLIT_RSS_COUNTING)
+#ifdef CONFIG_SPLIT_RSS_COUNTING
 
 void sync_mm_rss(struct mm_struct *mm)
 {
@@ -177,7 +177,7 @@ static void check_sync_rss_stat(struct task_struct *task)
if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
sync_mm_rss(task->mm);
 }
-#else /* SPLIT_RSS_COUNTING */
+#else /* CONFIG_SPLIT_RSS_COUNTING */
 
 #define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
 #define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
@@ -186,7 +186,7 @@ static void check_sync_rss_stat(struct task_struct *task)
 {
 }
 
-#endif /* SPLIT_RSS_COUNTING */
+#endif /* CONFIG_SPLIT_RSS_COUNTING */
 
 /*
  * Note: this doesn't free the actual pages themselves. That
-- 
2.23.0.581.g78d2f28ef7-goog



Re: For review: pidfd_send_signal(2) manual page

2019-09-24 Thread Daniel Colascione
On Tue, Sep 24, 2019 at 2:00 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Christian,
>
> >>> If you're the parent of the process you can do this without CLONE_PIDFD:
> >>> pid = fork();
> >>> pidfd = pidfd_open();
> >>> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
> >>> if (ret < 0 && errno == ESRCH)
> >>> /* pidfd refers to another, recycled process */
> >>
> >> Although there is still the race between the fork() and the
> >> pidfd_open(), right?
> >
> > Actually no and my code is even too complex.
> > If you are the parent, and this is really a sequence that obeys the
> > ordering pidfd_open() before waiting:
> >
> > pid = fork();
> > if (pid == 0)
> >   exit(EXIT_SUCCESS);
> > pidfd = pidfd_open(pid, 0);
> > waitid(pid, ...);
> >
> > Then you are guaranteed that pidfd will refer to pid. No recycling can
> > happen since the process has not been waited upon yet (That is,
>
> D'oh! Yes, of course.

You still have a race if you're the parent and you have SIGCHLD set to
SIG_IGN though.

> > excluding special cases such as where you have a mainloop where a
> > callback reacts to a SIGCHLD event and waits on the child behind your
> > back and your next callback in the mainloop calls pidfd_open() while the
> > pid has been recycled etc.).

That's a pretty common case though, especially if you're a library.

> > A race could only appear in sequences where waiting happens before
> > pidfd_open():
> >
> > pid = fork();
> > if (pid == 0)
> >   exit(EXIT_SUCCESS);
> > waitid(pid, ...);
> > pidfd = pidfd_open(pid, 0);
> >
> > which honestly simply doesn't make any sense. So if you're the parent
> > and you combine fork() + pidfd_open() correctly things should be fine
> > without even having to verify via pidfd_send_signal() (I missed that in
> > my first mail.).
>
> Thanks for the additional detail.
>
> I added the following to the pidfd_open() page, to
> prevent people making the same thinko as me:
>
>The following code sequence can be used to obtain a file  descrip‐
>tor for the child of fork(2):
>
>pid = fork();
>if (pid > 0) { /* If parent */
>pidfd = pidfd_open(pid, 0);
>...
>}
>
>Even  if  the  child process has already terminated by the time of
>the pidfd_open() call, the returned file descriptor is  guaranteed
>to refer to the child because the parent has not yet waited on the
>child (and therefore, the child's ID has not been recycled).

I'd prefer that sample code be robust in all cases.


Re: For review: pidfd_send_signal(2) manual page

2019-09-23 Thread Daniel Colascione
On Mon, Sep 23, 2019 at 2:12 AM Michael Kerrisk (man-pages)
 wrote:
>The  pidfd_send_signal()  system call allows the avoidance of race
>conditions that occur when using traditional interfaces  (such  as
>kill(2)) to signal a process.  The problem is that the traditional
>interfaces specify the target process via a process ID (PID), with
>the  result  that the sender may accidentally send a signal to the
>wrong process if the originally intended target process has termi‐
>nated  and its PID has been recycled for another process.  By con‐
>trast, a PID file descriptor is a stable reference to  a  specific
>process;  if  that  process  terminates,  then the file descriptor
>ceases to be  valid

The file *descriptor* remains valid even after the process to which it
refers exits. You can close(2) the file descriptor without getting
EBADF. I'd say, instead, that "a PID file descriptor is a stable
reference to a specific process; process-related operations on a PID
file descriptor fail after that process exits".


Re: For review: pidfd_open(2) manual page

2019-09-23 Thread Daniel Colascione
On Mon, Sep 23, 2019 at 3:53 AM Florian Weimer  wrote:
>
> * Michael Kerrisk:
>
> > SYNOPSIS
> >int pidfd_open(pid_t pid, unsigned int flags);
>
> Should this mention  for pid_t?
>
> > ERRORS
> >EINVAL flags is not 0.
> >
> >EINVAL pid is not valid.
> >
> >ESRCH  The process specified by pid does not exist.
>
> Presumably, EMFILE and ENFILE are also possible errors, and so is
> ENOMEM.
>
> >A  PID  file descriptor can be monitored using poll(2), select(2),
> >and epoll(7).  When the process that it refers to terminates,  the
> >file descriptor indicates as readable.

The phrase "becomes readable" is simpler than "indicates as readable"
and conveys the same meaning. I agree with Florian's comment on this
point below.

> > Note, however, that in the
> >current implementation, nothing can be read from the file descrip‐
> >tor.
>
> “is indicated as readable” or “becomes readable”?  Will reading block?
>
> >The  pidfd_open()  system call is the preferred way of obtaining a
> >PID file descriptor.  The alternative is to obtain a file descrip‐
> >tor by opening a /proc/[pid] directory.  However, the latter tech‐
> >nique is possible only if the proc(5) file system is mounted; fur‐
> >thermore,  the  file  descriptor  obtained in this way is not pol‐
> >lable.

Referring to procfs directory FDs as pidfds will probably confuse
people. I'd just omit this paragraph.

> One question is whether the glibc wrapper should fall back back to the
> /proc subdirectory if it is not available.  Probably not.

I'd prefer that glibc not provide this kind of fallback.
posix_fallocate-style emulation is, IMHO, too surprising.


Re: [RFC] Add critical process prctl

2019-09-10 Thread Daniel Colascione
On Tue, Sep 10, 2019 at 9:57 AM Andy Lutomirski  wrote:
>
> On Wed, Sep 4, 2019 at 5:53 PM Daniel Colascione  wrote:
> >
> > A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
> > meaning that if the task ever exits, the kernel panics. This facility
> > is intended for use by low-level core system processes that cannot
> > gracefully restart without a reboot. This prctl allows these processes
> > to ensure that the system restarts when they die regardless of whether
> > the rest of userspace is operational.
>
> The kind of panic produced by init crashing is awful -- logs don't get
> written, etc.

True today --- but that's a separate problem, and one that can be
solved in a few ways, e.g., pre-registering log buffers to be
incorporated into any kexec kernel memory dumps. If a system aiming
for reliability can't diagnose panics, that's a problem with or
without my patch.

> I'm wondering if you would be better off with a new
> watchdog-like device that, when closed, kills the system in a
> configurable way (e.g. after a certain amount of time, while still
> logging something and having a decent chance of getting the logs
> written out.)  This could plausibly even be an extension to the
> existing /dev/watchdog API.

There are lots of approaches that work today: a few people have
suggested just having init watch processes, perhaps with pidfds. What
I worry about is increasing the length (both in terms of time and
complexity) of the critical path between something going wrong in a
critical process and the system getting back into a known-good state.
A panic at the earliest moment we know that a marked-critical process
has become doomed seems like the most reliable approach, especially
since alternatives can get backed up behind things like file
descriptor closing and various forms of scheduling delay.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 5:59 PM Joel Fernandes  wrote:
> On Thu, Sep 05, 2019 at 10:50:27AM -0700, Daniel Colascione wrote:
> > On Thu, Sep 5, 2019 at 10:35 AM Steven Rostedt  wrote:
> > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > Suren Baghdasaryan  wrote:
> > >
> > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> > > > >
> > > > > [Add Steven]
> > > > >
> > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > [...]
> > > > > > > > but also for reducing
> > > > > > > > tracing noise. Flooding the traces makes it less useful for 
> > > > > > > > long traces and
> > > > > > > > post-processing of traces. IOW, the overhead reduction is a 
> > > > > > > > bonus.
> > > > > > >
> > > > > > > This is not really anything special for this tracepoint though.
> > > > > > > Basically any tracepoint in a hot path is in the same situation 
> > > > > > > and I do
> > > > > > > not see a point why each of them should really invent its own way 
> > > > > > > to
> > > > > > > throttle. Maybe there is some way to do that in the tracing 
> > > > > > > subsystem
> > > > > > > directly.
> > > > > >
> > > > > > I am not sure if there is a way to do this easily. Add to that, the 
> > > > > > fact that
> > > > > > you still have to call into trace events. Why call into it at all, 
> > > > > > if you can
> > > > > > filter in advance and have a sane filtering default?
> > > > > >
> > > > > > The bigger improvement with the threshold is the number of trace 
> > > > > > records are
> > > > > > almost halved by using a threshold. The number of records went from 
> > > > > > 4.6K to
> > > > > > 2.6K.
> > > > >
> > > > > Steven, would it be feasible to add a generic tracepoint throttling?
> > > >
> > > > I might misunderstand this but is the issue here actually throttling
> > > > of the sheer number of trace records or tracing large enough changes
> > > > to RSS that user might care about? Small changes happen all the time
> > > > but we are likely not interested in those. Surely we could postprocess
> > > > the traces to extract changes large enough to be interesting but why
> > > > capture uninteresting information in the first place? IOW the
> > > > throttling here should be based not on the time between traces but on
> > > > the amount of change of the traced signal. Maybe a generic facility
> > > > like that would be a good idea?
> > >
> > > You mean like add a trigger (or filter) that only traces if a field has
> > > changed since the last time the trace was hit? Hmm, I think we could
> > > possibly do that. Perhaps even now with histogram triggers?
> >
> > I was thinking along the same lines. The histogram subsystem seems
> > like a very good fit here. Histogram triggers already let users talk
> > about specific fields of trace events, aggregate them in configurable
> > ways, and (importantly, IMHO) create synthetic new trace events that
> > the kernel emits under configurable conditions.
>
> Hmm, I think this tracing feature will be a good idea. But in order not to
> gate this patch, can we agree on keeping a temporary threshold for this
> patch? Once such idea is implemented in trace subsystem, then we can remove
> the temporary filter.
>
> As Tim said, we don't want our traces flooded and this is a very useful
> tracepoint as proven in our internal usage at Android. The threshold filter
> is just few lines of code.

I'm not sure the threshold filtering code you've added does the right
thing: we don't keep state, so if a counter constantly flips between
one "side" of the TRACE_MM_COUNTER_THRESHOLD and the other, we'll emit
ftrace events at high frequency. More generally, this filtering
couples the rate of counter logging to the *value* of the counter ---
that is, we log ftrace events at different times depending on how much
memory we happen to have used --- and that's not ideal from a
predictability POV.

All things being equal, I'd prefer that we get things upstream as fast
as possible. But in this case, I'd rather wait for a general-purpose
filtering facility (whether that facility is based on histogram, eBPF,
or something else) rather than hardcode one particular fixed filtering
strategy (which might be suboptimal) for one particular kind of event.
Is there some special urgency here?

How about we instead add non-filtered tracepoints for the mm counters?
These tracepoints will still be free when turned off.

Having added the basic tracepoints, we can discuss separately how to
do the rate limiting. Maybe instead of providing direct support for
the algorithm that I described above, we can just use a BPF program as
a yes/no predicate for whether to log to ftrace. That'd get us to the
same place as this patch, but more flexibly, right?


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 3:12 PM Daniel Colascione  wrote:
> Basically, what I have in mind is this:

Actually --- I wonder whether there's already enough power in the
trigger mechanism to do this without any code changes to ftrace
histograms themselves. I'm trying to think of the minimum additional
kernel facility that we'd need to implement the scheme I described
above, and it might be that we don't need to do anything at all except
add the actual level tracepoints.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 2:14 PM Tom Zanussi  wrote:
>
> On Thu, 2019-09-05 at 13:24 -0700, Daniel Colascione wrote:
> > On Thu, Sep 5, 2019 at 12:56 PM Tom Zanussi 
> > wrote:
> > > On Thu, 2019-09-05 at 13:51 -0400, Joel Fernandes wrote:
> > > > On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> > > > > On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > > > > >
> > > > > >
> > > > > > [ Added Tom ]
> > > > > >
> > > > > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > > > > Suren Baghdasaryan  wrote:
> > > > > >
> > > > > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  > > > > > > org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > [Add Steven]
> > > > > > > >
> > > > > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  > > > > > > > > rnel
> > > > > > > > > .org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > > > >
> > > > > > > > [...]
> > > > > > > > > > > but also for reducing
> > > > > > > > > > > tracing noise. Flooding the traces makes it less
> > > > > > > > > > > useful
> > > > > > > > > > > for long traces and
> > > > > > > > > > > post-processing of traces. IOW, the overhead
> > > > > > > > > > > reduction
> > > > > > > > > > > is a bonus.
> > > > > > > > > >
> > > > > > > > > > This is not really anything special for this
> > > > > > > > > > tracepoint
> > > > > > > > > > though.
> > > > > > > > > > Basically any tracepoint in a hot path is in the same
> > > > > > > > > > situation and I do
> > > > > > > > > > not see a point why each of them should really invent
> > > > > > > > > > its
> > > > > > > > > > own way to
> > > > > > > > > > throttle. Maybe there is some way to do that in the
> > > > > > > > > > tracing subsystem
> > > > > > > > > > directly.
> > > > > > > > >
> > > > > > > > > I am not sure if there is a way to do this easily. Add
> > > > > > > > > to
> > > > > > > > > that, the fact that
> > > > > > > > > you still have to call into trace events. Why call into
> > > > > > > > > it
> > > > > > > > > at all, if you can
> > > > > > > > > filter in advance and have a sane filtering default?
> > > > > > > > >
> > > > > > > > > The bigger improvement with the threshold is the number
> > > > > > > > > of
> > > > > > > > > trace records are
> > > > > > > > > almost halved by using a threshold. The number of
> > > > > > > > > records
> > > > > > > > > went from 4.6K to
> > > > > > > > > 2.6K.
> > > > > > > >
> > > > > > > > Steven, would it be feasible to add a generic tracepoint
> > > > > > > > throttling?
> > > > > > >
> > > > > > > I might misunderstand this but is the issue here actually
> > > > > > > throttling
> > > > > > > of the sheer number of trace records or tracing large
> > > > > > > enough
> > > > > > > changes
> > > > > > > to RSS that user might care about? Small changes happen all
> > > > > > > the
> > > > > > > time
> > > > > > > but we are likely not interested in those. Surely we could
> > > > > > > postprocess
> > > > > > > the traces to extract changes large enough t

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 12:56 PM Tom Zanussi  wrote:
> On Thu, 2019-09-05 at 13:51 -0400, Joel Fernandes wrote:
> > On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> > > On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > > >
> > > >
> > > > [ Added Tom ]
> > > >
> > > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > > Suren Baghdasaryan  wrote:
> > > >
> > > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko 
> > > > > wrote:
> > > > > >
> > > > > > [Add Steven]
> > > > > >
> > > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  > > > > > > .org> wrote:
> > > > > > > >
> > > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > >
> > > > > > [...]
> > > > > > > > > but also for reducing
> > > > > > > > > tracing noise. Flooding the traces makes it less useful
> > > > > > > > > for long traces and
> > > > > > > > > post-processing of traces. IOW, the overhead reduction
> > > > > > > > > is a bonus.
> > > > > > > >
> > > > > > > > This is not really anything special for this tracepoint
> > > > > > > > though.
> > > > > > > > Basically any tracepoint in a hot path is in the same
> > > > > > > > situation and I do
> > > > > > > > not see a point why each of them should really invent its
> > > > > > > > own way to
> > > > > > > > throttle. Maybe there is some way to do that in the
> > > > > > > > tracing subsystem
> > > > > > > > directly.
> > > > > > >
> > > > > > > I am not sure if there is a way to do this easily. Add to
> > > > > > > that, the fact that
> > > > > > > you still have to call into trace events. Why call into it
> > > > > > > at all, if you can
> > > > > > > filter in advance and have a sane filtering default?
> > > > > > >
> > > > > > > The bigger improvement with the threshold is the number of
> > > > > > > trace records are
> > > > > > > almost halved by using a threshold. The number of records
> > > > > > > went from 4.6K to
> > > > > > > 2.6K.
> > > > > >
> > > > > > Steven, would it be feasible to add a generic tracepoint
> > > > > > throttling?
> > > > >
> > > > > I might misunderstand this but is the issue here actually
> > > > > throttling
> > > > > of the sheer number of trace records or tracing large enough
> > > > > changes
> > > > > to RSS that user might care about? Small changes happen all the
> > > > > time
> > > > > but we are likely not interested in those. Surely we could
> > > > > postprocess
> > > > > the traces to extract changes large enough to be interesting
> > > > > but why
> > > > > capture uninteresting information in the first place? IOW the
> > > > > throttling here should be based not on the time between traces
> > > > > but on
> > > > > the amount of change of the traced signal. Maybe a generic
> > > > > facility
> > > > > like that would be a good idea?
> > > >
> > > > You mean like add a trigger (or filter) that only traces if a
> > > > field has
> > > > changed since the last time the trace was hit? Hmm, I think we
> > > > could
> > > > possibly do that. Perhaps even now with histogram triggers?
> > >
> > >
> > > Hey Steve,
> > >
> > > Something like an analog to digitial coversion function where you
> > > lose the
> > > granularity of the signal depending on how much trace data:
> > > https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee38d1a
> > > 85d37fa23f86a14d3a9776ff67b0ec0f3b.gif
> >
> > s/how much trace data/what the resolution is/
> >
> > > so like, if you had a counter incrementing with values after the
> > > increments
> > > as:  1,3,4,8,12,14,30 and say 5 is the threshold at which to emit a
> > > trace,
> > > then you would get 1,8,12,30.
> > >
> > > So I guess what is need is a way to reduce the quantiy of trace
> > > data this
> > > way. For this usecase, the user mostly cares about spikes in the
> > > counter
> > > changing that accurate values of the different points.
> >
> > s/that accurate/than accurate/
> >
> > I think Tim, Suren, Dan and Michal are all saying the same thing as
> > well.
> >
>
> There's not a way to do this using existing triggers (histogram
> triggers have an onchange() that fires on any change, but that doesn't
> help here), and I wouldn't expect there to be - these sound like very
> specific cases that would never have support in the simple trigger
> 'language'.

I don't see the filtering under discussion as some "very specific"
esoteric need. You need this general kind of mechanism any time you
want to monitor at low frequency a thing that changes at high
frequency. The general pattern isn't specific to RSS or even memory in
general. One might imagine, say, wanting to trace large changes in TCP
window sizes. Any time something in the kernel has a "level" and that
level changes at high frequency and we want to learn about big swings
in that level, the mechanism we're talking about becomes useful. I
don't think it should be out of bounds for the histogram mechanism,
which is *almost* there right now. We already 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 10:35 AM Steven Rostedt  wrote:
> On Thu, 5 Sep 2019 09:03:01 -0700
> Suren Baghdasaryan  wrote:
>
> > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> > >
> > > [Add Steven]
> > >
> > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > [...]
> > > > > > but also for reducing
> > > > > > tracing noise. Flooding the traces makes it less useful for long 
> > > > > > traces and
> > > > > > post-processing of traces. IOW, the overhead reduction is a bonus.
> > > > >
> > > > > This is not really anything special for this tracepoint though.
> > > > > Basically any tracepoint in a hot path is in the same situation and I 
> > > > > do
> > > > > not see a point why each of them should really invent its own way to
> > > > > throttle. Maybe there is some way to do that in the tracing subsystem
> > > > > directly.
> > > >
> > > > I am not sure if there is a way to do this easily. Add to that, the 
> > > > fact that
> > > > you still have to call into trace events. Why call into it at all, if 
> > > > you can
> > > > filter in advance and have a sane filtering default?
> > > >
> > > > The bigger improvement with the threshold is the number of trace 
> > > > records are
> > > > almost halved by using a threshold. The number of records went from 
> > > > 4.6K to
> > > > 2.6K.
> > >
> > > Steven, would it be feasible to add a generic tracepoint throttling?
> >
> > I might misunderstand this but is the issue here actually throttling
> > of the sheer number of trace records or tracing large enough changes
> > to RSS that user might care about? Small changes happen all the time
> > but we are likely not interested in those. Surely we could postprocess
> > the traces to extract changes large enough to be interesting but why
> > capture uninteresting information in the first place? IOW the
> > throttling here should be based not on the time between traces but on
> > the amount of change of the traced signal. Maybe a generic facility
> > like that would be a good idea?
>
> You mean like add a trigger (or filter) that only traces if a field has
> changed since the last time the trace was hit? Hmm, I think we could
> possibly do that. Perhaps even now with histogram triggers?

I was thinking along the same lines. The histogram subsystem seems
like a very good fit here. Histogram triggers already let users talk
about specific fields of trace events, aggregate them in configurable
ways, and (importantly, IMHO) create synthetic new trace events that
the kernel emits under configurable conditions.


[RFC] Add critical process prctl

2019-09-04 Thread Daniel Colascione
A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
meaning that if the task ever exits, the kernel panics. This facility
is intended for use by low-level core system processes that cannot
gracefully restart without a reboot. This prctl allows these processes
to ensure that the system restarts when they die regardless of whether
the rest of userspace is operational.

Signed-off-by: Daniel Colascione 
---
 include/linux/sched.h  |  5 +
 include/uapi/linux/prctl.h |  5 +
 kernel/exit.c  |  2 ++
 kernel/sys.c   | 19 +++
 4 files changed, 31 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..29420b9ebb63 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1526,6 +1526,7 @@ static inline bool is_percpu_thread(void)
 #define PFA_SPEC_IB_DISABLE5   /* Indirect branch speculation 
restricted */
 #define PFA_SPEC_IB_FORCE_DISABLE  6   /* Indirect branch speculation 
permanently restricted */
 #define PFA_SPEC_SSB_NOEXEC7   /* Speculative Store Bypass 
clear on execve() */
+#define PFA_CRITICAL8   /* Panic system if process 
exits */
 
 #define TASK_PFA_TEST(name, func)  \
static inline bool task_##func(struct task_struct *p)   \
@@ -1568,6 +1569,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
 TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
 TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
 
+TASK_PFA_TEST(CRITICAL, critical)
+TASK_PFA_SET(CRITICAL, critical)
+TASK_PFA_CLEAR(CRITICAL, critical)
+
 static inline void
 current_restore_flags(unsigned long orig_flags, unsigned long flags)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 094bb03b9cc2..4964723bbd47 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -229,4 +229,9 @@ struct prctl_mm_map {
 # define PR_PAC_APDBKEY(1UL << 3)
 # define PR_PAC_APGAKEY(1UL << 4)
 
+/* Per-task criticality control */
+#define PR_SET_TASK_CRITICAL 55
+#define PR_CRITICAL_NOT_CRITICAL 0
+#define PR_CRITICAL_CRITICAL 1
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..9b3d3411d935 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -788,6 +788,8 @@ void __noreturn do_exit(long code)
panic("Aiee, killing interrupt handler!");
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");
+   if (unlikely(task_critical(tsk)))
+   panic("Critical task died!");
 
/*
 * If do_exit is called because this processes oopsed, it's possible
diff --git a/kernel/sys.c b/kernel/sys.c
index 2969304c29fe..097e05ebaf94 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2269,6 +2269,20 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct 
*t, unsigned long which,
return -EINVAL;
 }
 
+int task_do_set_critical(struct task_struct *t, unsigned long opt)
+{
+   if (opt != PR_CRITICAL_NOT_CRITICAL &&
+   opt != PR_CRITICAL_CRITICAL)
+   return -EINVAL;
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+   if (opt == PR_CRITICAL_NOT_CRITICAL)
+   task_clear_critical(t);
+   else
+   task_set_critical(t);
+   return 0;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
 {
@@ -2492,6 +2506,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
return -EINVAL;
error = PAC_RESET_KEYS(me, arg2);
break;
+   case PR_SET_TASK_CRITICAL:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = task_do_set_critical(me, arg2);
+   break;
default:
error = -EINVAL;
break;
-- 
2.23.0.187.g17f5b7556c-goog



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Daniel Colascione
On Wed, Sep 4, 2019 at 8:38 AM Michal Hocko  wrote:
> > but also for reducing
> > tracing noise. Flooding the traces makes it less useful for long traces and
> > post-processing of traces. IOW, the overhead reduction is a bonus.
>
> This is not really anything special for this tracepoint though.
> Basically any tracepoint in a hot path is in the same situation and I do
> not see a point why each of them should really invent its own way to
> throttle. Maybe there is some way to do that in the tracing subsystem
> directly.

I agree. I'd rather not special-case RSS in this way, especially not
with a hardcoded aggregation and thresholding configuration. It should
be possible to handle high-frequency trace data point aggregation in a
general way. Why shouldn't we be able to get a time series for, say,
dirty pages? Or various slab counts? IMHO, any counter on the system
ought to be observable in a uniform way.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Daniel Colascione
On Wed, Sep 4, 2019 at 7:59 AM Joel Fernandes  wrote:
>
> On Tue, Sep 03, 2019 at 10:42:53PM -0700, Daniel Colascione wrote:
> > On Tue, Sep 3, 2019 at 10:15 PM Joel Fernandes  
> > wrote:
> > >
> > > On Tue, Sep 03, 2019 at 09:51:20PM -0700, Daniel Colascione wrote:
> > > > On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  
> > > > wrote:
> > > > >
> > > > > On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
> > > > >  wrote:
> > > > > >
> > > > > > Useful to track how RSS is changing per TGID to detect spikes in 
> > > > > > RSS and
> > > > > > memory hogs. Several Android teams have been using this patch in 
> > > > > > various
> > > > > > kernel trees for half a year now. Many reported to me it is really
> > > > > > useful so I'm posting it upstream.
> > > >
> > > > It's also worth being able to turn off the per-task memory counter
> > > > caching, otherwise you'll have two levels of batching before the
> > > > counter gets updated, IIUC.
> > >
> > > I prefer to keep split RSS accounting turned on if it is available.
> >
> > Why? AFAIK, nobody's produced numbers showing that split accounting
> > has a real benefit.
>
> I am not too sure. Have you checked the original patches that added this
> stuff though? It seems to me the main win would be on big systems that have
> to pay for atomic updates.

I looked into this issue the last time I mentioned split mm
accounting. See [1]. It's my sense that the original change was
inadequately justified; Michal Hocko seems to agree. I've tried
disabling split rss accounting locally on a variety of systems ---
Android, laptop, desktop --- and failed to notice any difference. It's
possible that some difference appears at a scale beyond that to which
I have access, but if the benefit of split rss accounting is limited
to these cases, split rss accounting shouldn't be on by default, since
it comes at a cost in consistency.

[1] https://lore.kernel.org/linux-mm/20180227100234.gf15...@dhcp22.suse.cz/

> > > I think
> > > discussing split RSS accounting is a bit out of scope of this patch as 
> > > well.
> >
> > It's in-scope, because with split RSS accounting, allocated memory can
> > stay accumulated in task structs for an indefinite time without being
> > flushed to the mm. As a result, if you take the stream of virtual
> > memory management system calls that  program makes on one hand, and VM
> > counter values on the other, the two don't add up. For various kinds
> > of robustness (trace self-checking, say) it's important that various
> > sources of data add up.
> >
> > If we're adding a configuration knob that controls how often VM
> > counters get reflected in system trace points, we should also have a
> > knob to control delayed VM counter operations. The whole point is for
> > users to be able to specify how precisely they want VM counter changes
> > reported to analysis tools.
>
> We're not adding more configuration knobs.

This position doesn't seem to be the thread consensus yet.

> > > Any improvements on that front can be a follow-up.
> > >
> > > Curious, has split RSS accounting shown you any issue with this patch?
> >
> > Split accounting has been a source of confusion for a while now: it
> > causes that numbers-don't-add-up problem even when sampling from
> > procfs instead of reading memory tracepoint data.
>
> I think you can just disable split RSS accounting if it does not work well
> for your configuration.

There's no build-time configuration for split RSS accounting. It's not
reasonable to expect people to carry patches just to get their memory
usage numbers to add up.

> Also AFAIU, every TASK_RSS_EVENTS_THRESH the page fault code does sync the
> counters. So it does not indefinitely lurk.

If a thread incurs TASK_RSS_EVENTS_THRESH - 1 page faults and then
sleeps for a week, all memory counters observable from userspace will
be wrong for a week. Multiply this potential error by the number of
threads on a typical system and you have to conclude that split RSS
accounting produces a lot of potential uncertainty. What are we
getting in exchange for this uncertainty?

> The tracepoint's main intended
> use is to detect spikes which provides ample opportunity to sync the cache.

The intended use is measuring memory levels of various processes over
time, not just detecting "spikes". In order to make sense of the
resulting data series, we need to be able to place error bars on it.
The presence of split RSS accounting makes thos

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-03 Thread Daniel Colascione
On Tue, Sep 3, 2019 at 10:15 PM Joel Fernandes  wrote:
>
> On Tue, Sep 03, 2019 at 09:51:20PM -0700, Daniel Colascione wrote:
> > On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  wrote:
> > >
> > > On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
> > >  wrote:
> > > >
> > > > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > > > memory hogs. Several Android teams have been using this patch in various
> > > > kernel trees for half a year now. Many reported to me it is really
> > > > useful so I'm posting it upstream.
> >
> > It's also worth being able to turn off the per-task memory counter
> > caching, otherwise you'll have two levels of batching before the
> > counter gets updated, IIUC.
>
> I prefer to keep split RSS accounting turned on if it is available.

Why? AFAIK, nobody's produced numbers showing that split accounting
has a real benefit.

> I think
> discussing split RSS accounting is a bit out of scope of this patch as well.

It's in-scope, because with split RSS accounting, allocated memory can
stay accumulated in task structs for an indefinite time without being
flushed to the mm. As a result, if you take the stream of virtual
memory management system calls that  program makes on one hand, and VM
counter values on the other, the two don't add up. For various kinds
of robustness (trace self-checking, say) it's important that various
sources of data add up.

If we're adding a configuration knob that controls how often VM
counters get reflected in system trace points, we should also have a
knob to control delayed VM counter operations. The whole point is for
users to be able to specify how precisely they want VM counter changes
reported to analysis tools.

> Any improvements on that front can be a follow-up.
>
> Curious, has split RSS accounting shown you any issue with this patch?

Split accounting has been a source of confusion for a while now: it
causes that numbers-don't-add-up problem even when sampling from
procfs instead of reading memory tracepoint data.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-03 Thread Daniel Colascione
On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  wrote:
>
> On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
>  wrote:
> >
> > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > memory hogs. Several Android teams have been using this patch in various
> > kernel trees for half a year now. Many reported to me it is really
> > useful so I'm posting it upstream.

It's also worth being able to turn off the per-task memory counter
caching, otherwise you'll have two levels of batching before the
counter gets updated, IIUC.


Re: [RFCv2 1/6] mm: introduce MADV_COLD

2019-06-03 Thread Daniel Colascione
On Mon, Jun 3, 2019 at 12:16 AM Michal Hocko  wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it 
> > > > > > could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory 
> > > > > > pressure
> > > > > > happens but data should be preserved for future use.  This could 
> > > > > > reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not 
> > > > > > expected
> > > > > > to be used in the near future. The hint can help kernel in deciding 
> > > > > > which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to 
> > > > > > inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and 
> > > > > > that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. 
> > > > > > Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than 
> > > > > > other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.

Whether to reclaim shared pages is a policy decision best left to
userland, IMHO.


Re: [RFCv2 5/6] mm: introduce external memory hinting API

2019-05-31 Thread Daniel Colascione
On Thu, May 30, 2019 at 11:43 PM Minchan Kim  wrote:
>
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> ActivityManagerService is one of them.
>
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
>
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It could give a hint to the exeternal process of pidfd.
>
>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> unsigned long cookie, unsigned long flag);
>
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
>
> The syscall has a cookie argument to privode atomicity(i.e., detect
> target process's address space change since monitor process has parsed
> the address range of target process so the operaion could fail in case
> of happening race). Although there is no interface to get a cookie
> at this moment, it could be useful to consider it as argument to avoid
> introducing another new syscall in future. It could support *atomicity*
> for disruptive hint(e.g., MADV_DONTNEED|FREE).
> flag argument is reserved for future use if we need to extend the API.

How about a compromise? Let's allow all madvise hints if the process
is calling process_madvise *on itself* (which will be useful once we
wire up the atomicity cookie) and restrict the cross-process case to
the hints you've mentioned. This way, the restriction on madvise hints
isn't tied to the specific API, but to the relationship between hinter
and hintee.


Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary

2019-05-30 Thread Daniel Colascione
On Thu, May 30, 2019 at 1:02 AM Minchan Kim  wrote:
>
> On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> > On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> > > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko 
> > > > > > > > > > wrote:
> > > > > > > > > > > [Cc linux-api]
> > > > > > > > > > >
> > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > > > Currently, process_madvise syscall works for only one 
> > > > > > > > > > > > address range
> > > > > > > > > > > > so user should call the syscall several times to give 
> > > > > > > > > > > > hints to
> > > > > > > > > > > > multiple address range.
> > > > > > > > > > >
> > > > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > > > >
> > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I 
> > > > > > > > > > will come up
> > > > > > > > > > with number in the description at respin.
> > > > > > > > >
> > > > > > > > > Does this really have to be a fast operation? I would expect 
> > > > > > > > > the monitor
> > > > > > > > > is by no means a fast path. The system call overhead is not 
> > > > > > > > > what it used
> > > > > > > > > to be, sigh, but still for something that is not a hot path 
> > > > > > > > > it should be
> > > > > > > > > tolerable, especially when the whole operation is quite 
> > > > > > > > > expensive on its
> > > > > > > > > own (wrt. the syscall entry/exit).
> > > > > > > >
> > > > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > > > If the range needed to be covered is a lot, vector operation 
> > > > > > > > makes senese
> > > > > > > > to me.
> > > > > > >
> > > > > > > I am not saying that the vector API is wrong. All I am trying to 
> > > > > > > say is
> > > > > > > that the benefit is not really clear so far. If you want to push 
> > > > > > > it
> > > > > > > through then you should better get some supporting data.
> > > > > >
> > > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement 
> > > > > > but
> > > > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > > > I will drop vector support at next revision.
> > > > >
> > > > > Please do keep the vector support. Absolute timing is misleading,
> > > > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > > > seen tons of improvements in things like camera start come from
> > > > > coalescing mprotect calls, with the gains coming from taking and
> > > > > releasing various locks a lot less often and bouncing around less on
> > > > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > > > especially on mobile.
> > > >
> > > > This will always be a double edge sword. Taking a lock for longer can
> > > > improve a throughput of a single call but it would make a latency for
> > > > anybody contending on

Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary

2019-05-29 Thread Daniel Colascione
On Mon, May 27, 2019 at 12:49 AM Minchan Kim  wrote:
>
> On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > [Cc linux-api]
> > > > > >
> > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > Currently, process_madvise syscall works for only one address 
> > > > > > > range
> > > > > > > so user should call the syscall several times to give hints to
> > > > > > > multiple address range.
> > > > > >
> > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > >
> > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > with number in the description at respin.
> > > >
> > > > Does this really have to be a fast operation? I would expect the monitor
> > > > is by no means a fast path. The system call overhead is not what it used
> > > > to be, sigh, but still for something that is not a hot path it should be
> > > > tolerable, especially when the whole operation is quite expensive on its
> > > > own (wrt. the syscall entry/exit).
> > >
> > > What's different with process_vm_[readv|writev] and vmsplice?
> > > If the range needed to be covered is a lot, vector operation makes senese
> > > to me.
> >
> > I am not saying that the vector API is wrong. All I am trying to say is
> > that the benefit is not really clear so far. If you want to push it
> > through then you should better get some supporting data.
>
> I measured 1000 madvise syscall vs. a vector range syscall with 1000
> ranges on ARM64 mordern device. Even though I saw 15% improvement but
> absoluate gain is just 1ms so I don't think it's worth to support.
> I will drop vector support at next revision.

Please do keep the vector support. Absolute timing is misleading,
since in a tight loop, you're not going to contend on mmap_sem. We've
seen tons of improvements in things like camera start come from
coalescing mprotect calls, with the gains coming from taking and
releasing various locks a lot less often and bouncing around less on
the contended lock paths. Raw throughput doesn't tell the whole story,
especially on mobile.


Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-28 Thread Daniel Colascione
On Tue, May 28, 2019 at 4:56 AM Michal Hocko  wrote:
>
> On Tue 28-05-19 04:42:47, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 4:28 AM Michal Hocko  wrote:
> > >
> > > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione 
> > > > > > > > wrote:
> > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim 
> > > > > > > > >  wrote:
> > > > > > > > > > if we went with the per vma fd approach then you would get 
> > > > > > > > > > this
> > > > > > > > > > > feature automatically because map_files would refer to 
> > > > > > > > > > > file backed
> > > > > > > > > > > mappings while map_anon could refer only to anonymous 
> > > > > > > > > > > mappings.
> > > > > > > > > >
> > > > > > > > > > The reason to add such filter option is to avoid the 
> > > > > > > > > > parsing overhead
> > > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > > >
> > > > > > > > > Without chiming on whether the filter option is a good idea, 
> > > > > > > > > I'd like
> > > > > > > > > to suggest that providing an efficient binary interfaces for 
> > > > > > > > > pulling
> > > > > > > > > memory map information out of processes.  Some 
> > > > > > > > > single-system-call
> > > > > > > > > method for retrieving a binary snapshot of a process's 
> > > > > > > > > address space
> > > > > > > > > complete with attributes (selectable, like statx?) for each 
> > > > > > > > > VMA would
> > > > > > > > > reduce complexity and increase performance in a variety of 
> > > > > > > > > areas,
> > > > > > > > > e.g., Android memory map debugging commands.
> > > > > > > >
> > > > > > > > I agree it's the best we can get *generally*.
> > > > > > > > Michal, any opinion?
> > > > > > >
> > > > > > > I am not really sure this is directly related. I think the primary
> > > > > > > question that we have to sort out first is whether we want to have
> > > > > > > the remote madvise call process or vma fd based. This is an 
> > > > > > > important
> > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd 
> > > > > > > discussions
> > > > > > > so far unfortunately.
> > > > > >
> > > > > > With current usecase, it's per-process API with distinguishable 
> > > > > > anon/file
> > > > > > but thought it could be easily extended later for each address range
> > > > > > operation as userspace getting smarter with more information.
> > > > >
> > > > > Never design user API based on a single usecase, please. The "easily
> > > > > extended" part is by far not clear to me TBH. As I've already 
> > > > > mentioned
> > > > > several times, the synchronization model has to be thought through
> > > > > carefuly before a remote process address range operation can be
> > > > > implemented.
> > > >
> > > > I agree with you that we shouldn't design API on single usecase but what
> > > > you are concerning is actually not our usecase because we are resilient
> > > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > > Actually, many hints are already racy in that the upcoming pattern would
> > > > be different with the behavior you thought at the moment.
> > >
> > > How come they are racy wrt address ranges? You would have to be in
> > > multithreaded environment and then the onus of synchron

Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-28 Thread Daniel Colascione
On Tue, May 28, 2019 at 4:49 AM Michal Hocko  wrote:
>
> On Tue 28-05-19 04:21:44, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 3:33 AM Michal Hocko  wrote:
> > >
> > > On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> > > > On Tue, May 28, 2019 at 2:08 AM Michal Hocko  wrote:
> > > > >
> > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim  
> > > > > > > wrote:
> > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > feature automatically because map_files would refer to file 
> > > > > > > > > backed
> > > > > > > > > mappings while map_anon could refer only to anonymous 
> > > > > > > > > mappings.
> > > > > > > >
> > > > > > > > The reason to add such filter option is to avoid the parsing 
> > > > > > > > overhead
> > > > > > > > so map_anon wouldn't be helpful.
> > > > > > >
> > > > > > > Without chiming on whether the filter option is a good idea, I'd 
> > > > > > > like
> > > > > > > to suggest that providing an efficient binary interfaces for 
> > > > > > > pulling
> > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > method for retrieving a binary snapshot of a process's address 
> > > > > > > space
> > > > > > > complete with attributes (selectable, like statx?) for each VMA 
> > > > > > > would
> > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > e.g., Android memory map debugging commands.
> > > > > >
> > > > > > I agree it's the best we can get *generally*.
> > > > > > Michal, any opinion?
> > > > >
> > > > > I am not really sure this is directly related. I think the primary
> > > > > question that we have to sort out first is whether we want to have
> > > > > the remote madvise call process or vma fd based. This is an important
> > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > so far unfortunately.
> > > >
> > > > I don't think the vma fd approach is viable. We have some processes
> > > > with a *lot* of VMAs --- system_server had 4204 when I checked just
> > > > now (and that's typical) --- and an FD operation per VMA would be
> > > > excessive.
> > >
> > > What do you mean by excessive here? Do you expect the process to have
> > > them open all at once?
> >
> > Minchan's already done timing. More broadly, in an era with various
> > speculative execution mitigations, making a system call is pretty
> > expensive.
>
> This is a completely separate discussion. This could be argued about
> many other syscalls.

Yes, it can be. That's why we have scatter-gather IO system calls in
the first place.

> Let's make the semantic correct first before we
> even start thinking about mutliplexing. It is easier to multiplex on an
> existing and sane interface.

I don't think of it as "multiplexing" yet, not in the fundamental unit
of operation is the address range.

> Btw. Minchan concluded that multiplexing is not really all that
> important based on his numbers 
> http://lkml.kernel.org/r/20190527074940.gb6...@google.com
>
> [...]
>
> > > Is this really too much different from /proc//map_files?
> >
> > It's very different. See below.
> >
> > > > > An interface to query address range information is a separate but
> > > > > although a related topic. We have /proc//[s]maps for that right
> > > > > now and I understand it is not a general win for all usecases because
> > > > > it tends to be slow for some. I can see how /proc//map_anons 
> > > > > could
> > > > > provide per vma information in a binary form via a fd based interface.
> > > > > But I would rather not conflate those two discussions much - well 
> > > > > except
> > > > > if it could give one of the approaches more justification but let's
> > > > > focus on the mad

Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-28 Thread Daniel Colascione
On Tue, May 28, 2019 at 4:44 AM Minchan Kim  wrote:
>
> On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim 
> > > > > > > >  wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get 
> > > > > > > > > this
> > > > > > > > > > feature automatically because map_files would refer to file 
> > > > > > > > > > backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous 
> > > > > > > > > > mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing 
> > > > > > > > > overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > >
> > > > > > > > Without chiming on whether the filter option is a good idea, 
> > > > > > > > I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for 
> > > > > > > > pulling
> > > > > > > > memory map information out of processes.  Some 
> > > > > > > > single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address 
> > > > > > > > space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA 
> > > > > > > > would
> > > > > > > > reduce complexity and increase performance in a variety of 
> > > > > > > > areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > >
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > >
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an 
> > > > > > important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd 
> > > > > > discussions
> > > > > > so far unfortunately.
> > > > >
> > > > > With current usecase, it's per-process API with distinguishable 
> > > > > anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > >
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > >
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> >
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
>
> Think about MADV_FREE. Allocator would think the chunk is worth to mark
> "freeable" but soon, user of the allocator asked the chunk - ie, it's not
> freeable any longer once user start to use it.
>
> My point is that kinds of *hints* are always racy so any synchronization
> couldn't help a lot. That's why I want to restrict hints process_madvise
> supports as such kinds of non-destruptive one at next respin.

I think it's more natural for process_madvise to be a superset of
regular madvise. What's the harm? There are no security implications,
since anyone who could process_madvise could just ptrace anyway. I
also don't think limiting the hinting to non-destructive operations
guarantees safety (in a broad sense) either, since operating on the
wrong memory range can still cause unexpected system performance
issues even if there's no data loss.

More broadly, what I want to see is a family of process_* APIs that
provide a superset of the functionality that the existing intraprocess
APIs provide. I think this approach is elegant and generalizes easily.
I'm worried about prematurely limiting the interprocess memory APIs
and creating limitations that will last a long time in order to avoid
having to consider issues like VMA synchronization.


Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-28 Thread Daniel Colascione
On Tue, May 28, 2019 at 4:28 AM Michal Hocko  wrote:
>
> On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim  
> > > > > > > wrote:
> > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > feature automatically because map_files would refer to file 
> > > > > > > > > backed
> > > > > > > > > mappings while map_anon could refer only to anonymous 
> > > > > > > > > mappings.
> > > > > > > >
> > > > > > > > The reason to add such filter option is to avoid the parsing 
> > > > > > > > overhead
> > > > > > > > so map_anon wouldn't be helpful.
> > > > > > >
> > > > > > > Without chiming on whether the filter option is a good idea, I'd 
> > > > > > > like
> > > > > > > to suggest that providing an efficient binary interfaces for 
> > > > > > > pulling
> > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > method for retrieving a binary snapshot of a process's address 
> > > > > > > space
> > > > > > > complete with attributes (selectable, like statx?) for each VMA 
> > > > > > > would
> > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > e.g., Android memory map debugging commands.
> > > > > >
> > > > > > I agree it's the best we can get *generally*.
> > > > > > Michal, any opinion?
> > > > >
> > > > > I am not really sure this is directly related. I think the primary
> > > > > question that we have to sort out first is whether we want to have
> > > > > the remote madvise call process or vma fd based. This is an important
> > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > so far unfortunately.
> > > >
> > > > With current usecase, it's per-process API with distinguishable 
> > > > anon/file
> > > > but thought it could be easily extended later for each address range
> > > > operation as userspace getting smarter with more information.
> > >
> > > Never design user API based on a single usecase, please. The "easily
> > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > several times, the synchronization model has to be thought through
> > > carefuly before a remote process address range operation can be
> > > implemented.
> >
> > I agree with you that we shouldn't design API on single usecase but what
> > you are concerning is actually not our usecase because we are resilient
> > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > Actually, many hints are already racy in that the upcoming pattern would
> > be different with the behavior you thought at the moment.
>
> How come they are racy wrt address ranges? You would have to be in
> multithreaded environment and then the onus of synchronization is on
> threads. That model is quite clear. But we are talking about separate
> processes and some of them might be even not aware of an external entity
> tweaking their address space.

I don't think the difference between a thread and a process matters in
this context. Threads race on address space operations all the time
--- in the sense that multiple threads modify a process's address
space without synchronization. The main reasons that these races
hasn't been a problem are: 1) threads mostly "mind their own business"
and modify different parts of the address space or use locks to ensure
that they don't stop on each other (e.g., the malloc heap lock), and
2) POSIX mmap atomic-replacement semantics make certain classes of
operation (like "magic ring buffer" setup) safe even in the presence
of other threads stomping over an address space.

The thing that's new in this discussion from a synchronization point
of view isn't that the VM operatio

Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-28 Thread Daniel Colascione
On Tue, May 28, 2019 at 3:41 AM Michal Hocko  wrote:
>
> On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim  
> > > > > wrote:
> > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > >
> > > > > > The reason to add such filter option is to avoid the parsing 
> > > > > > overhead
> > > > > > so map_anon wouldn't be helpful.
> > > > >
> > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > memory map information out of processes.  Some single-system-call
> > > > > method for retrieving a binary snapshot of a process's address space
> > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > reduce complexity and increase performance in a variety of areas,
> > > > > e.g., Android memory map debugging commands.
> > > >
> > > > I agree it's the best we can get *generally*.
> > > > Michal, any opinion?
> > >
> > > I am not really sure this is directly related. I think the primary
> > > question that we have to sort out first is whether we want to have
> > > the remote madvise call process or vma fd based. This is an important
> > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > so far unfortunately.
> >
> > With current usecase, it's per-process API with distinguishable anon/file
> > but thought it could be easily extended later for each address range
> > operation as userspace getting smarter with more information.
>
> Never design user API based on a single usecase, please. The "easily
> extended" part is by far not clear to me TBH. As I've already mentioned
> several times, the synchronization model has to be thought through
> carefuly before a remote process address range operation can be
> implemented.

I don't think anyone is overfitting for a specific use case. When some
process A wants to manipulate process B's memory, it's fair for A to
want to know what memory it's manipulating. That's a general concern
that applies to a large family of cross-process memory operations.
It's less important for non-destructive hints than for some kind of
destructive operation, but the same idea applies. If there's a simple
way to solve this A-B information problem in a general way, it seems
to be that we should apply that general solution. Likewise, an API to
get an efficiently-collected snapshot of a process's address space
would be immediately useful in several very different use cases,
including debuggers, Android memory use reporting tools, and various
kinds of metric collection. Because we're talking about mechanisms
that solve several independent problems at the same time and in a
general way, it doesn't sound to me like overfitting for a particular
use case.


Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-28 Thread Daniel Colascione
On Tue, May 28, 2019 at 3:33 AM Michal Hocko  wrote:
>
> On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 2:08 AM Michal Hocko  wrote:
> > >
> > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim  
> > > > > wrote:
> > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > >
> > > > > > The reason to add such filter option is to avoid the parsing 
> > > > > > overhead
> > > > > > so map_anon wouldn't be helpful.
> > > > >
> > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > memory map information out of processes.  Some single-system-call
> > > > > method for retrieving a binary snapshot of a process's address space
> > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > reduce complexity and increase performance in a variety of areas,
> > > > > e.g., Android memory map debugging commands.
> > > >
> > > > I agree it's the best we can get *generally*.
> > > > Michal, any opinion?
> > >
> > > I am not really sure this is directly related. I think the primary
> > > question that we have to sort out first is whether we want to have
> > > the remote madvise call process or vma fd based. This is an important
> > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > so far unfortunately.
> >
> > I don't think the vma fd approach is viable. We have some processes
> > with a *lot* of VMAs --- system_server had 4204 when I checked just
> > now (and that's typical) --- and an FD operation per VMA would be
> > excessive.
>
> What do you mean by excessive here? Do you expect the process to have
> them open all at once?

Minchan's already done timing. More broadly, in an era with various
speculative execution mitigations, making a system call is pretty
expensive. If we have two options for remote VMA manipulation, one
that requires thousands of system calls (with the count proportional
to the address space size of the process) and one that requires only a
few system calls no matter how large the target process is, the latter
ought to start off with more points than the former under any kind of
design scoring.

> > VMAs also come and go pretty easily depending on changes in
> > protections and various faults.
>
> Is this really too much different from /proc//map_files?

It's very different. See below.

> > > An interface to query address range information is a separate but
> > > although a related topic. We have /proc//[s]maps for that right
> > > now and I understand it is not a general win for all usecases because
> > > it tends to be slow for some. I can see how /proc//map_anons could
> > > provide per vma information in a binary form via a fd based interface.
> > > But I would rather not conflate those two discussions much - well except
> > > if it could give one of the approaches more justification but let's
> > > focus on the madvise part first.
> >
> > I don't think it's a good idea to focus on one feature in a
> > multi-feature change when the interactions between features can be
> > very important for overall design of the multi-feature system and the
> > design of each feature.
> >
> > Here's my thinking on the high-level design:
> >
> > I'm imagining an address-range system that would work like this: we'd
> > create some kind of process_vm_getinfo(2) system call [1] that would
> > accept a statx-like attribute map and a pid/fd parameter as input and
> > return, on output, two things: 1) an array [2] of VMA descriptors
> > containing the requested information, and 2) a VMA configuration
> > sequence number. We'd then have process_madvise() and other
> > cross-process VM interfaces accept both address ranges and this
> > sequence number; they'd succeed only if the VMA configuration sequence
> > number is still current, i.e., the target process hasn't changed its
> > VMA configuration (implicitly or explicitly) since the call to
> > process_vm_getinfo().
>

Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-28 Thread Daniel Colascione
On Tue, May 28, 2019 at 2:08 AM Michal Hocko  wrote:
>
> On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim  wrote:
> > > > if we went with the per vma fd approach then you would get this
> > > > > feature automatically because map_files would refer to file backed
> > > > > mappings while map_anon could refer only to anonymous mappings.
> > > >
> > > > The reason to add such filter option is to avoid the parsing overhead
> > > > so map_anon wouldn't be helpful.
> > >
> > > Without chiming on whether the filter option is a good idea, I'd like
> > > to suggest that providing an efficient binary interfaces for pulling
> > > memory map information out of processes.  Some single-system-call
> > > method for retrieving a binary snapshot of a process's address space
> > > complete with attributes (selectable, like statx?) for each VMA would
> > > reduce complexity and increase performance in a variety of areas,
> > > e.g., Android memory map debugging commands.
> >
> > I agree it's the best we can get *generally*.
> > Michal, any opinion?
>
> I am not really sure this is directly related. I think the primary
> question that we have to sort out first is whether we want to have
> the remote madvise call process or vma fd based. This is an important
> distinction wrt. usability. I have only seen pid vs. pidfd discussions
> so far unfortunately.

I don't think the vma fd approach is viable. We have some processes
with a *lot* of VMAs --- system_server had 4204 when I checked just
now (and that's typical) --- and an FD operation per VMA would be
excessive. VMAs also come and go pretty easily depending on changes in
protections and various faults. It's also not entirely clear what the
semantics of vma FDs should be over address space mutations, while the
semantics of address ranges are well-understood. I would much prefer
an interface operating on address ranges to one operating on VMA FDs,
both for efficiency and for consistency with other memory management
APIs.

> An interface to query address range information is a separate but
> although a related topic. We have /proc//[s]maps for that right
> now and I understand it is not a general win for all usecases because
> it tends to be slow for some. I can see how /proc//map_anons could
> provide per vma information in a binary form via a fd based interface.
> But I would rather not conflate those two discussions much - well except
> if it could give one of the approaches more justification but let's
> focus on the madvise part first.

I don't think it's a good idea to focus on one feature in a
multi-feature change when the interactions between features can be
very important for overall design of the multi-feature system and the
design of each feature.

Here's my thinking on the high-level design:

I'm imagining an address-range system that would work like this: we'd
create some kind of process_vm_getinfo(2) system call [1] that would
accept a statx-like attribute map and a pid/fd parameter as input and
return, on output, two things: 1) an array [2] of VMA descriptors
containing the requested information, and 2) a VMA configuration
sequence number. We'd then have process_madvise() and other
cross-process VM interfaces accept both address ranges and this
sequence number; they'd succeed only if the VMA configuration sequence
number is still current, i.e., the target process hasn't changed its
VMA configuration (implicitly or explicitly) since the call to
process_vm_getinfo().

This way, a process A that wants to perform some VM operation on
process B can slurp B's VMA configuration using process_vm_getinfo(),
figure out what it wants to do, and attempt to do it. If B modifies
its memory map in the meantime, If A finds that its local knowledge of
B's memory map has become invalid between the process_vm_getinfo() and
A taking some action based on the result, A can retry [3]. While A
could instead ptrace or otherwise suspend B, *then* read B's memory
map (knowing B is quiescent), *then* operate on B, the optimistic
approach I'm describing would be much lighter-weight in the typical
case. It's also pretty simple, IMHO. If the "operate on B" step is
some kind of vectorized operation over multiple address ranges, this
approach also gets us all-or-nothing semantics.

Or maybe the whole sequence number thing is overkill and we don't need
atomicity? But if there's a concern  that A shouldn't operate on B's
memory without knowing what it's operating on, then the scheme I've
proposed above solves this knowledge problem in a pretty lightweight
way.

[1] or some other interface
[2] or something more complicated if we want the descriptors to
contain variable-length elements, e.g., strings
[3] or override the sequence number check if it's feeling bold?


Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

2019-05-28 Thread Daniel Colascione
On Tue, May 28, 2019 at 1:14 AM Minchan Kim  wrote:
> if we went with the per vma fd approach then you would get this
> > feature automatically because map_files would refer to file backed
> > mappings while map_anon could refer only to anonymous mappings.
>
> The reason to add such filter option is to avoid the parsing overhead
> so map_anon wouldn't be helpful.

Without chiming on whether the filter option is a good idea, I'd like
to suggest that providing an efficient binary interfaces for pulling
memory map information out of processes.  Some single-system-call
method for retrieving a binary snapshot of a process's address space
complete with attributes (selectable, like statx?) for each VMA would
reduce complexity and increase performance in a variety of areas,
e.g., Android memory map debugging commands.


Re: [RFC 0/7] introduce memory hinting API for external process

2019-05-22 Thread Daniel Colascione
On Wed, May 22, 2019 at 9:01 AM Christian Brauner  wrote:
>
> On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> > On Wed, May 22, 2019 at 8:48 AM Christian Brauner  
> > wrote:
> > >
> > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner 
> > > >  wrote:
> > > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > > >
> > > > Ok. We're each allowed our opinion.
> > > >
> > > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > > from process_vm_{read,write}v that both take a pid_t.
> > > >
> > > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > > currently-unused flags argument. Both should grow a flag that tells
> > > > them to interpret the pid argument as a pidfd. Or do you support
> > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > > isn't called pidfd_vm_readv?
> > >
> > > Actually, you should then do the same with process_madvise() and give it
> > > a flag for that too if that's not too crazy.
> >
> > I don't know what you mean. My gut feeling is that for the sake of
> > consistency, process_madvise, process_vm_readv, and process_vm_writev
> > should all accept a first argument interpreted as either a numeric PID
> > or a pidfd depending on a flag --- ideally the same flag. Is that what
> > you have in mind?
>
> Yes. For the sake of consistency they should probably all default to
> interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
> interpret as pidfd.

Sounds good to me!


Re: [RFC 0/7] introduce memory hinting API for external process

2019-05-22 Thread Daniel Colascione
On Wed, May 22, 2019 at 8:48 AM Christian Brauner  wrote:
>
> On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > On Wed, May 22, 2019 at 7:52 AM Christian Brauner  
> > wrote:
> > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> >
> > Ok. We're each allowed our opinion.
> >
> > > It's tied to the api, transparent for userspace, and disambiguates it
> > > from process_vm_{read,write}v that both take a pid_t.
> >
> > Speaking of process_vm_readv and process_vm_writev: both have a
> > currently-unused flags argument. Both should grow a flag that tells
> > them to interpret the pid argument as a pidfd. Or do you support
> > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > should process_madvise be called pidfd_madvise while process_vm_readv
> > isn't called pidfd_vm_readv?
>
> Actually, you should then do the same with process_madvise() and give it
> a flag for that too if that's not too crazy.

I don't know what you mean. My gut feeling is that for the sake of
consistency, process_madvise, process_vm_readv, and process_vm_writev
should all accept a first argument interpreted as either a numeric PID
or a pidfd depending on a flag --- ideally the same flag. Is that what
you have in mind?


Re: [RFC 0/7] introduce memory hinting API for external process

2019-05-22 Thread Daniel Colascione
On Wed, May 22, 2019 at 7:52 AM Christian Brauner  wrote:
> I'm not going to go into yet another long argument. I prefer pidfd_*.

Ok. We're each allowed our opinion.

> It's tied to the api, transparent for userspace, and disambiguates it
> from process_vm_{read,write}v that both take a pid_t.

Speaking of process_vm_readv and process_vm_writev: both have a
currently-unused flags argument. Both should grow a flag that tells
them to interpret the pid argument as a pidfd. Or do you support
adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
should process_madvise be called pidfd_madvise while process_vm_readv
isn't called pidfd_vm_readv?


Re: [RFC 0/7] introduce memory hinting API for external process

2019-05-22 Thread Daniel Colascione
On Wed, May 22, 2019 at 1:22 AM Christian Brauner  wrote:
>
> On Wed, May 22, 2019 at 7:12 AM Daniel Colascione  wrote:
> >
> > On Tue, May 21, 2019 at 4:39 AM Christian Brauner  
> > wrote:
> > >
> > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > > > - Background
> > > > > > >
> > > > > > > The Android terminology used for forking a new process and 
> > > > > > > starting an app
> > > > > > > from scratch is a cold start, while resuming an existing app is a 
> > > > > > > hot start.
> > > > > > > While we continually try to improve the performance of cold 
> > > > > > > starts, hot
> > > > > > > starts will always be significantly less power hungry as well as 
> > > > > > > faster so
> > > > > > > we are trying to make hot start more likely than cold start.
> > > > > > >
> > > > > > > To increase hot start, Android userspace manages the order that 
> > > > > > > apps should
> > > > > > > be killed in a process called ActivityManagerService. 
> > > > > > > ActivityManagerService
> > > > > > > tracks every Android app or service that the user could be 
> > > > > > > interacting with
> > > > > > > at any time and translates that into a ranked list for lmkd(low 
> > > > > > > memory
> > > > > > > killer daemon). They are likely to be killed by lmkd if the 
> > > > > > > system has to
> > > > > > > reclaim memory. In that sense they are similar to entries in any 
> > > > > > > other cache.
> > > > > > > Those apps are kept alive for opportunistic performance 
> > > > > > > improvements but
> > > > > > > those performance improvements will vary based on the memory 
> > > > > > > requirements of
> > > > > > > individual workloads.
> > > > > > >
> > > > > > > - Problem
> > > > > > >
> > > > > > > Naturally, cached apps were dominant consumers of memory on the 
> > > > > > > system.
> > > > > > > However, they were not significant consumers of swap even though 
> > > > > > > they are
> > > > > > > good candidate for swap. Under investigation, swapping out only 
> > > > > > > begins
> > > > > > > once the low zone watermark is hit and kswapd wakes up, but the 
> > > > > > > overall
> > > > > > > allocation rate in the system might trip lmkd thresholds and 
> > > > > > > cause a cached
> > > > > > > process to be killed(we measured performance swapping out vs. 
> > > > > > > zapping the
> > > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times 
> > > > > > > faster
> > > > > > > even though we use zram which is much faster than real storage) 
> > > > > > > so kill
> > > > > > > from lmkd will often satisfy the high zone watermark, resulting 
> > > > > > > in very
> > > > > > > few pages actually being moved to swap.
> > > > > > >
> > > > > > > - Approach
> > > > > > >
> > > > > > > The approach we chose was to use a new interface to allow 
> > > > > > > userspace to
> > > > > > > proactively reclaim entire processes by leveraging platform 
> > > > > > > information.
> > > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for 
> > > > > > > pages
> > > > > > > that are known to be cold from userspace and to avoid races with 
> > > > > > > lmkd
> > > > > > > by reclaiming apps as soon as they entered the cached state. 
> > > > > > > Additionally,
> > > > > > > it could provide many chances for

Re: [RFC 0/7] introduce memory hinting API for external process

2019-05-21 Thread Daniel Colascione
On Tue, May 21, 2019 at 4:39 AM Christian Brauner  wrote:
>
> On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > - Background
> > > > >
> > > > > The Android terminology used for forking a new process and starting 
> > > > > an app
> > > > > from scratch is a cold start, while resuming an existing app is a hot 
> > > > > start.
> > > > > While we continually try to improve the performance of cold starts, 
> > > > > hot
> > > > > starts will always be significantly less power hungry as well as 
> > > > > faster so
> > > > > we are trying to make hot start more likely than cold start.
> > > > >
> > > > > To increase hot start, Android userspace manages the order that apps 
> > > > > should
> > > > > be killed in a process called ActivityManagerService. 
> > > > > ActivityManagerService
> > > > > tracks every Android app or service that the user could be 
> > > > > interacting with
> > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > killer daemon). They are likely to be killed by lmkd if the system 
> > > > > has to
> > > > > reclaim memory. In that sense they are similar to entries in any 
> > > > > other cache.
> > > > > Those apps are kept alive for opportunistic performance improvements 
> > > > > but
> > > > > those performance improvements will vary based on the memory 
> > > > > requirements of
> > > > > individual workloads.
> > > > >
> > > > > - Problem
> > > > >
> > > > > Naturally, cached apps were dominant consumers of memory on the 
> > > > > system.
> > > > > However, they were not significant consumers of swap even though they 
> > > > > are
> > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > once the low zone watermark is hit and kswapd wakes up, but the 
> > > > > overall
> > > > > allocation rate in the system might trip lmkd thresholds and cause a 
> > > > > cached
> > > > > process to be killed(we measured performance swapping out vs. zapping 
> > > > > the
> > > > > memory by killing a process. Unsurprisingly, zapping is 10x times 
> > > > > faster
> > > > > even though we use zram which is much faster than real storage) so 
> > > > > kill
> > > > > from lmkd will often satisfy the high zone watermark, resulting in 
> > > > > very
> > > > > few pages actually being moved to swap.
> > > > >
> > > > > - Approach
> > > > >
> > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > proactively reclaim entire processes by leveraging platform 
> > > > > information.
> > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for 
> > > > > pages
> > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > by reclaiming apps as soon as they entered the cached state. 
> > > > > Additionally,
> > > > > it could provide many chances for platform to use much information to
> > > > > optimize memory efficiency.
> > > > >
> > > > > IMHO we should spell it out that this patchset complements 
> > > > > MADV_WONTNEED
> > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints 
> > > > > the
> > > > > kernel that memory region is not currently needed and should be 
> > > > > reclaimed
> > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints 
> > > > > the
> > > > > kernel that memory region is not currently needed and should be 
> > > > > reclaimed
> > > > > when memory pressure rises.
> > > > >
> > > > > To achieve the goal, the patchset introduce two new options for 
> > > > > madvise.
> > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > MADV_COLD which will reclaim private pages instantly. These new 
> > > > > options
> > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways 
> > > > > to
> > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in 
> > > > > a way
> > > > > that it hints the kernel that memory region is not currently needed 
> > > > > and
> > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in 
> > > > > a way
> > > > > that it hints the kernel that memory region is not currently needed 
> > > > > and
> > > > > should be reclaimed when memory pressure rises.
> > > > >
> > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > information required to make the reclaim decision is not known to the 
> > > > > app.
> > > > > Instead, it is known to a centralized userspace daemon, and that 
> > > > > daemon
> > > > > must be able to initiate reclaim on its own without any app 
> > > > > involvement.
> > > > > To solve the concern, this patch introduces new syscall 

Re: [PATCH v1 2/2] Add selftests for pidfd polling

2019-04-26 Thread Daniel Colascione
On Fri, Apr 26, 2019 at 10:26 AM Joel Fernandes  wrote:
> On Thu, Apr 25, 2019 at 03:07:48PM -0700, Daniel Colascione wrote:
> > On Thu, Apr 25, 2019 at 2:29 PM Christian Brauner  
> > wrote:
> > > This timing-based testing seems kinda odd to be honest. Can't we do
> > > something better than this?
> >
> > Agreed. Timing-based tests have a substantial risk of becoming flaky.
> > We ought to be able to make these tests fully deterministic and not
> > subject to breakage from odd scheduling outcomes. We don't have
> > sleepable events for everything, granted, but sleep-waiting on a
> > condition with exponential backoff is fine in test code. In general,
> > if you start with a robust test, you can insert a sleep(100) anywhere
> > and not break the logic. Violating this rule always causes pain sooner
> > or later.
>
> I prefer if you can be more specific about how to redesign the test. Please
> go through the code and make suggestions there. The tests have not been flaky
> in my experience.

You've been running them in an ideal environment.

Some tests do depend on timing like the preemptoff tests,
> that can't be helped. Or a performance test that calculates framedrops.

Performance tests are *about* timing. This is a functional test. Here,
we care about sequencing, not timing, and using a bare sleep instead
of sleeping with a condition check (see below) is always flaky.

> In this case, we want to make sure that the poll unblocks at the right "time"
> that is when the non-leader thread exits, and not when the leader thread
> exits (test 1), or when the non-leader thread exits and not when the same
> non-leader previous did an execve (test 2).

Instead of sleeping, you want to wait for some condition. Right now,
in a bunch of places, the test does something like this:

do_something()
sleep(SOME_TIMEOUT)
check(some_condition())

You can replace each of these clauses with something like this:

do_something()
start_time = now()
while(!some_condition() && now() - start_time < LONG_TIMEOUT)
  sleep(SHORT_DELAY)
check(some_condition())

This way, you're insensitive to timing, up to LONG_TIMEOUT (which can
be something like a minute). Yes, you can always write
sleep(LARGE_TIMEOUT) instead, but a good, robust value of LONG_TIMEOUT
(which should be tens of seconds) would make the test take far too
long to run in the happy case.

Note that this code is fine:

check(!some_condition())
sleep(SOME_REASONABLE_TIMEOUT)
check(!some_condition())

It's okay to sleep for a little while and check that something did
*not* happen, but it's not okay for the test to *fail* due to
scheduling delays. The difference is that
sleeping-and-checking-that-something-didn't-happen can only generate
false negatives when checking for failures, and it's much better from
a code health perspective for a test to sometimes fail to detect a bug
than for it to fire occasionally when there's no bug actually present.

> These are inherently timing related.

No they aren't. We don't care how long these operations take. We only
care that they happen in the right order.

(Well, we do care about performance, but not for the purposes of this
functional test.)

> Yes it is true that if this runs in a VM
> and if the VM CPU is preempted for a couple seconds, then the test can fail
> falsely. Still I would argue such a failure scenario of a multi-second CPU
> lock-up can cause more serious issues like RCU stalls, and that's not a test
> issue. We can increase the sleep intervals if you want, to reduce the risk of
> such scenarios.
>
> I would love to make the test not depend on timing, but I don't know how.

For threads, implement some_condition() above by opening a /proc
directory to the task you want. You can look by death by looking for
zombie status in stat or ESRCH.

If you want to test that poll() actually unblocks on exit (as opposed
to EPOLLIN-ing immediately when the waited process is already dead),
do something like this:

- [Main test thread] Start subprocess, getting a pidfd
- [Subprocess] Wait forever
- [Main test thread] Start a waiter thread
- [Waiter test thread] poll(2) (or epoll, if you insist) on process exit
- [Main test thread] sleep(FAIRLY_SHORT_TIMEOUT)
- [Main test thread] Check that the subprocess is alive
- [Main test thread] pthread_tryjoin_np (make sure the waiter thread
is still alive)
- [Main test thread] Kill the subprocess (or one of its threads, for
testing the leader-exit case)
- [Main test thread] pthread_timedjoin_np(LONG_TIMEOUT) the waiter thread
- [Waiter test thread] poll(2) returns and thread exits
- [Main test thread] pthread_join returns: test succeeds (or the
pthread_timedjoin_np fails with ETIMEOUT, it means poll(2) didn't
unblock, and the test should fail).

Tests that sleep for synchronization *do* end up being flaky. That the
flakiness doesn't show up in local iterative testing doesn't mean that
the test is adequately robust.


Re: [PATCH v1 2/2] Add selftests for pidfd polling

2019-04-25 Thread Daniel Colascione
On Thu, Apr 25, 2019 at 2:29 PM Christian Brauner  wrote:
> This timing-based testing seems kinda odd to be honest. Can't we do
> something better than this?

Agreed. Timing-based tests have a substantial risk of becoming flaky.
We ought to be able to make these tests fully deterministic and not
subject to breakage from odd scheduling outcomes. We don't have
sleepable events for everything, granted, but sleep-waiting on a
condition with exponential backoff is fine in test code. In general,
if you start with a robust test, you can insert a sleep(100) anywhere
and not break the logic. Violating this rule always causes pain sooner
or later.

Other thoughts: IMHO, using poll(2) instead of epoll would simplify
the test code, and I think we can get away with calling
pthread_exit(3) instead of SYS_exit.


Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]

2019-04-20 Thread Daniel Colascione
On Sat, Apr 20, 2019 at 12:14 AM Kevin Easton  wrote:
> On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote:
> > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai  wrote:
> > >
> > > On 2019-04-15, Enrico Weigelt, metux IT consult  wrote:
> > > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > > clone() system call as previously discussed.
> > > >
> > > > Sorry, for highjacking this thread, but I'm curious on what things to
> > > > consider when introducing new CLONE_* flags.
> > > >
> > > > The reason I'm asking is:
> > > >
> > > > I'm working on implementing plan9-like fs namespaces, where unprivileged
> > > > processes can change their own namespace at will. For that, certain
> > > > traditional unix'ish things have to be disabled, most notably suid.
> > > > As forbidding suid can be helpful in other scenarios, too, I thought
> > > > about making this its own feature. Doing that switch on clone() seems
> > > > a nice place for that, IMHO.
> > >
> > > Just spit-balling -- is no_new_privs not sufficient for this usecase?
> > > Not granting privileges such as setuid during execve(2) is the main
> > > point of that flag.
> > >
> >
> > I would personally *love* it if distros started setting no_new_privs
> > for basically all processes.  And pidfd actually gets us part of the
> > way toward a straightforward way to make sudo and su still work in a
> > no_new_privs world: su could call into a daemon that would spawn the
> > privileged task, and su would get a (read-only!) pidfd back and then
> > wait for the fd and exit.  I suppose that, done naively, this might
> > cause some odd effects with respect to tty handling, but I bet it's
> > solveable.  I suppose it would be nifty if there were a way for a
> > process, by mutual agreement, to reparent itself to an unrelated
> > process.
> >
> > Anyway, clone(2) is an enormous mess.  Surely the right solution here
> > is to have a whole new process creation API that takes a big,
> > extensible struct as an argument, and supports *at least* the full
> > abilities of posix_spawn() and ideally covers all the use cases for
> > fork() + do stuff + exec().  It would be nifty if this API also had a
> > way to say "add no_new_privs and therefore enable extra functionality
> > that doesn't work without no_new_privs".  This functionality would
> > include things like returning a future extra-privileged pidfd that
> > gives ptrace-like access.
> >
> > As basic examples, the improved process creation API should take a
> > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag
> > from, fds to close (or, maybe even better, a list of fds to *not*
> > close), a list of rlimit changes to make, a list of signal changes to
> > make, the ability to set sid, pgrp, uid, gid (as in
> > setresuid/setresgid), the ability to do capset() operations, etc.  The
> > posix_spawn() API, for all that it's rather complicated, covers a
> > bunch of the basics pretty well.
>
> The idea of a system call that takes an infinitely-extendable laundry
> list of operations to perform in kernel space seems quite inelegant, if
> only for the error-reporting reason.
>
> Instead, I suggest that what you'd want is a way to create a new
> embryonic process that has no address space and isn't yet schedulable.
> You then just need other-process-directed variants of all the normal
> setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode),
> pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd)
> etc.

Providing process-directed versions of these functions would be useful
for a variety of management tasks anyway,

> Then when it's all set up you pr_execve() to kick it off.

Yes. That's the right general approach.


Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 4:12 PM Christian Brauner  wrote:
>
> On Sat, Apr 20, 2019 at 12:46 AM Daniel Colascione  wrote:
> >
> > On Fri, Apr 19, 2019 at 3:02 PM Christian Brauner  
> > wrote:
> > >
> > > On Fri, Apr 19, 2019 at 11:48 PM Christian Brauner  
> > > wrote:
> > > >
> > > > On Fri, Apr 19, 2019 at 11:21 PM Daniel Colascione  
> > > > wrote:
> > > > >
> > > > > On Fri, Apr 19, 2019 at 1:57 PM Christian Brauner 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner 
> > > > > > > > wrote:
> > > > > > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes 
> > > > > > > > > wrote:
> > > > > > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner 
> > > > > > > > > > wrote:
> > > > > > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >> On 04/16, Joel Fernandes wrote:
> > > > > > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg 
> > > > > > > > > > > >> > Nesterov wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Could you explain when it should return POLLIN? 
> > > > > > > > > > > >> > > When the whole
> > > > > > > > > > > >process exits?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > It returns POLLIN when the task is dead or doesn't 
> > > > > > > > > > > >> > exist anymore,
> > > > > > > > > > > >or when it
> > > > > > > > > > > >> > is in a zombie state and there's no other thread in 
> > > > > > > > > > > >> > the thread
> > > > > > > > > > > >group.
> > > > > > > > > > > >>
> > > > > > > > > > > >> IOW, when the whole thread group exits, so it can't be 
> > > > > > > > > > > >> used to
> > > > > > > > > > > >monitor sub-threads.
> > > > > > > > > > > >>
> > > > > > > > > > > >> just in case... speaking of this patch it doesn't 
> > > > > > > > > > > >> modify
> > > > > > > > > > > >proc_tid_base_operations,
> > > > > > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but 
> > > > > > > > > > > >> iiuc you are
> > > > > > > > > > > >going to use
> > > > > > > > > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > > > > > > > > >
> > > > > > > > > > > >I don't think procfs works that way. 
> > > > > > > > > > > >/proc/sub-thread-tid has
> > > > > > > > > > > >proc_tgid_base_operations despite not being a thread 
> > > > > > > > > > > >group leader.
> > > > > > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in 
> > > > > > > > > > > >this code can
> > > > > > > > > > > >be hit trivially, and then the code will misbehave.
> > > > > > > > > > > >
> > > > > > > > > > > >@Joel: I think you'll have to either rewrite this to 
> > > &

Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 4:33 PM Linus Torvalds
 wrote:
>
> On Fri, Apr 19, 2019 at 4:20 PM Christian Brauner  
> wrote:
> >
> > On Sat, Apr 20, 2019 at 1:11 AM Linus Torvalds
> >  wrote:
> > >
> > > It's also worth noting that POLLERR/POLLHUP/POLLNVAL cannot be masked
> > > for "poll()". Even if you only ask for POLLIN/POLLOUT, you will always
> > > get POLLERR/POLLHUP notification. That is again historical behavior,
> > > and it's kind of a "you can't poll a hung up fd". But it once again
> > > means that you should consider POLLHUP to be something *exceptional*
> > > and final, where no further or other state changes can happen or are
> > > relevant.
> >
> > Which kind of makes sense for process exit. So the historical behavior
> > here is in our favor and having POLLIN | POLLHUP rather fitting.
> > It just seems right that POLLHUP indicates "there can be
> > no more state transitions".
>
> Note that that is *not* true of process exit.
>
> The final state transition isn't "exit", it is actually "process has
> been reaped".  That's the point where data no longer exists.

FWIW, I think the exit status should be available via pidfd even after
process reaping. A non-parent holder of a pidfd has no ability to
control when the parent reaps the child, or even if reaping is
necessary at all --- the parent could make SIGCHLD SIG_IGN. Someone
trying to read exit status via a pidfd shouldn't fail to get that exit
status just because he lost the race with a concurrent waitpid().


Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 4:02 PM Christian Brauner  wrote:
>
> On Sat, Apr 20, 2019 at 12:35 AM Daniel Colascione  wrote:
> >
> > On Fri, Apr 19, 2019 at 2:48 PM Christian Brauner  
> > wrote:
> > >
> > > On Fri, Apr 19, 2019 at 11:21 PM Daniel Colascione  
> > > wrote:
> > > >
> > > > On Fri, Apr 19, 2019 at 1:57 PM Christian Brauner 
> > > >  wrote:
> > > > >
> > > > > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote:
> > > > > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote:
> > > > > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner 
> > > > > > > > > wrote:
> > > > > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn 
> > > > > > > > > >  wrote:
> > > > > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov 
> > > > > > > > > > > wrote:
> > > > > > > > > > >> On 04/16, Joel Fernandes wrote:
> > > > > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg 
> > > > > > > > > > >> > Nesterov wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Could you explain when it should return POLLIN? When 
> > > > > > > > > > >> > > the whole
> > > > > > > > > > >process exits?
> > > > > > > > > > >> >
> > > > > > > > > > >> > It returns POLLIN when the task is dead or doesn't 
> > > > > > > > > > >> > exist anymore,
> > > > > > > > > > >or when it
> > > > > > > > > > >> > is in a zombie state and there's no other thread in 
> > > > > > > > > > >> > the thread
> > > > > > > > > > >group.
> > > > > > > > > > >>
> > > > > > > > > > >> IOW, when the whole thread group exits, so it can't be 
> > > > > > > > > > >> used to
> > > > > > > > > > >monitor sub-threads.
> > > > > > > > > > >>
> > > > > > > > > > >> just in case... speaking of this patch it doesn't modify
> > > > > > > > > > >proc_tid_base_operations,
> > > > > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but 
> > > > > > > > > > >> iiuc you are
> > > > > > > > > > >going to use
> > > > > > > > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > > > > > > > >
> > > > > > > > > > >I don't think procfs works that way. /proc/sub-thread-tid 
> > > > > > > > > > >has
> > > > > > > > > > >proc_tgid_base_operations despite not being a thread group 
> > > > > > > > > > >leader.
> > > > > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in 
> > > > > > > > > > >this code can
> > > > > > > > > > >be hit trivially, and then the code will misbehave.
> > > > > > > > > > >
> > > > > > > > > > >@Joel: I think you'll have to either rewrite this to 
> > > > > > > > > > >explicitly bail
> > > > > > > > > > >out if you're dealing with a thread group leader, or make 
> > > > > > > > > > >the code
> > > > > > > > > > >work for threads, too.
> > > > > > > > > >
> > > > > > > > > > The latter case probably being preferred if this API is 
> > > > > > > > > > supposed to be
> > > >

Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 3:02 PM Christian Brauner  wrote:
>
> On Fri, Apr 19, 2019 at 11:48 PM Christian Brauner  
> wrote:
> >
> > On Fri, Apr 19, 2019 at 11:21 PM Daniel Colascione  
> > wrote:
> > >
> > > On Fri, Apr 19, 2019 at 1:57 PM Christian Brauner  
> > > wrote:
> > > >
> > > > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione  
> > > > wrote:
> > > > >
> > > > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote:
> > > > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote:
> > > > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner 
> > > > > > > > wrote:
> > > > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn 
> > > > > > > > >  wrote:
> > > > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov 
> > > > > > > > > > wrote:
> > > > > > > > > >> On 04/16, Joel Fernandes wrote:
> > > > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov 
> > > > > > > > > >> > wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > > Could you explain when it should return POLLIN? When 
> > > > > > > > > >> > > the whole
> > > > > > > > > >process exits?
> > > > > > > > > >> >
> > > > > > > > > >> > It returns POLLIN when the task is dead or doesn't exist 
> > > > > > > > > >> > anymore,
> > > > > > > > > >or when it
> > > > > > > > > >> > is in a zombie state and there's no other thread in the 
> > > > > > > > > >> > thread
> > > > > > > > > >group.
> > > > > > > > > >>
> > > > > > > > > >> IOW, when the whole thread group exits, so it can't be 
> > > > > > > > > >> used to
> > > > > > > > > >monitor sub-threads.
> > > > > > > > > >>
> > > > > > > > > >> just in case... speaking of this patch it doesn't modify
> > > > > > > > > >proc_tid_base_operations,
> > > > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc 
> > > > > > > > > >> you are
> > > > > > > > > >going to use
> > > > > > > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > > > > > > >
> > > > > > > > > >I don't think procfs works that way. /proc/sub-thread-tid has
> > > > > > > > > >proc_tgid_base_operations despite not being a thread group 
> > > > > > > > > >leader.
> > > > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this 
> > > > > > > > > >code can
> > > > > > > > > >be hit trivially, and then the code will misbehave.
> > > > > > > > > >
> > > > > > > > > >@Joel: I think you'll have to either rewrite this to 
> > > > > > > > > >explicitly bail
> > > > > > > > > >out if you're dealing with a thread group leader, or make 
> > > > > > > > > >the code
> > > > > > > > > >work for threads, too.
> > > > > > > > >
> > > > > > > > > The latter case probably being preferred if this API is 
> > > > > > > > > supposed to be
> > > > > > > > > useable for thread management in userspace.
> > > > > > > >
> > > > > > > > At the moment, we are not planning to use this for sub-thread 
> > > > > > > > management. I
> > > > > > > > am reworking this patch to only work on clone(2) pidfds which 
> > > > > > > > makes the above
> > > > >

Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 3:18 PM Christian Brauner  wrote:
>
> On Sat, Apr 20, 2019 at 12:08 AM Daniel Colascione  wrote:
> >
> > On Fri, Apr 19, 2019 at 2:45 PM Joel Fernandes  
> > wrote:
> > >
> > > On Fri, Apr 19, 2019 at 02:24:09PM -0700, Daniel Colascione wrote:
> > > > On Fri, Apr 19, 2019 at 2:20 PM Joel Fernandes  
> > > > wrote:
> > > > >
> > > > > On Fri, Apr 19, 2019 at 10:57:11PM +0200, Christian Brauner wrote:
> > > > > > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner 
> > > > > > > > wrote:
> > > > > > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes 
> > > > > > > > > wrote:
> > > > > > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner 
> > > > > > > > > > wrote:
> > > > > > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >> On 04/16, Joel Fernandes wrote:
> > > > > > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg 
> > > > > > > > > > > >> > Nesterov wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Could you explain when it should return POLLIN? 
> > > > > > > > > > > >> > > When the whole
> > > > > > > > > > > >process exits?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > It returns POLLIN when the task is dead or doesn't 
> > > > > > > > > > > >> > exist anymore,
> > > > > > > > > > > >or when it
> > > > > > > > > > > >> > is in a zombie state and there's no other thread in 
> > > > > > > > > > > >> > the thread
> > > > > > > > > > > >group.
> > > > > > > > > > > >>
> > > > > > > > > > > >> IOW, when the whole thread group exits, so it can't be 
> > > > > > > > > > > >> used to
> > > > > > > > > > > >monitor sub-threads.
> > > > > > > > > > > >>
> > > > > > > > > > > >> just in case... speaking of this patch it doesn't 
> > > > > > > > > > > >> modify
> > > > > > > > > > > >proc_tid_base_operations,
> > > > > > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but 
> > > > > > > > > > > >> iiuc you are
> > > > > > > > > > > >going to use
> > > > > > > > > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > > > > > > > > >
> > > > > > > > > > > >I don't think procfs works that way. 
> > > > > > > > > > > >/proc/sub-thread-tid has
> > > > > > > > > > > >proc_tgid_base_operations despite not being a thread 
> > > > > > > > > > > >group leader.
> > > > > > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in 
> > > > > > > > > > > >this code can
> > > > > > > > > > > >be hit trivially, and then the code will misbehave.
> > > > > > > > > > > >
> > > > > > > > > > > >@Joel: I think you'll have to either rewrite this to 
> > > > > > > > > > > >explicitly bail
> > > > > > > &g

Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 2:48 PM Christian Brauner  wrote:
>
> On Fri, Apr 19, 2019 at 11:21 PM Daniel Colascione  wrote:
> >
> > On Fri, Apr 19, 2019 at 1:57 PM Christian Brauner  
> > wrote:
> > >
> > > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione  
> > > wrote:
> > > >
> > > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes 
> > > >  wrote:
> > > > >
> > > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote:
> > > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote:
> > > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner wrote:
> > > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn 
> > > > > > > >  wrote:
> > > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov 
> > > > > > > > > wrote:
> > > > > > > > >> On 04/16, Joel Fernandes wrote:
> > > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov 
> > > > > > > > >> > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > Could you explain when it should return POLLIN? When the 
> > > > > > > > >> > > whole
> > > > > > > > >process exits?
> > > > > > > > >> >
> > > > > > > > >> > It returns POLLIN when the task is dead or doesn't exist 
> > > > > > > > >> > anymore,
> > > > > > > > >or when it
> > > > > > > > >> > is in a zombie state and there's no other thread in the 
> > > > > > > > >> > thread
> > > > > > > > >group.
> > > > > > > > >>
> > > > > > > > >> IOW, when the whole thread group exits, so it can't be used 
> > > > > > > > >> to
> > > > > > > > >monitor sub-threads.
> > > > > > > > >>
> > > > > > > > >> just in case... speaking of this patch it doesn't modify
> > > > > > > > >proc_tid_base_operations,
> > > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc 
> > > > > > > > >> you are
> > > > > > > > >going to use
> > > > > > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > > > > > >
> > > > > > > > >I don't think procfs works that way. /proc/sub-thread-tid has
> > > > > > > > >proc_tgid_base_operations despite not being a thread group 
> > > > > > > > >leader.
> > > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this 
> > > > > > > > >code can
> > > > > > > > >be hit trivially, and then the code will misbehave.
> > > > > > > > >
> > > > > > > > >@Joel: I think you'll have to either rewrite this to 
> > > > > > > > >explicitly bail
> > > > > > > > >out if you're dealing with a thread group leader, or make the 
> > > > > > > > >code
> > > > > > > > >work for threads, too.
> > > > > > > >
> > > > > > > > The latter case probably being preferred if this API is 
> > > > > > > > supposed to be
> > > > > > > > useable for thread management in userspace.
> > > > > > >
> > > > > > > At the moment, we are not planning to use this for sub-thread 
> > > > > > > management. I
> > > > > > > am reworking this patch to only work on clone(2) pidfds which 
> > > > > > > makes the above
> > > > > >
> > > > > > Indeed and agreed.
> > > > > >
> > > > > > > discussion about /proc a bit unnecessary I think. Per the latest 
> > > > > > > CLONE_PIDFD
> > > > > > > patches, CLONE_THREAD with pidfd is not supported.
> > > > > >
> > > > > > Yes. We have no one asking for it right now and we can easily add 
>

Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 2:45 PM Joel Fernandes  wrote:
>
> On Fri, Apr 19, 2019 at 02:24:09PM -0700, Daniel Colascione wrote:
> > On Fri, Apr 19, 2019 at 2:20 PM Joel Fernandes  
> > wrote:
> > >
> > > On Fri, Apr 19, 2019 at 10:57:11PM +0200, Christian Brauner wrote:
> > > > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione  
> > > > wrote:
> > > > >
> > > > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote:
> > > > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote:
> > > > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner 
> > > > > > > > wrote:
> > > > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn 
> > > > > > > > >  wrote:
> > > > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov 
> > > > > > > > > > wrote:
> > > > > > > > > >> On 04/16, Joel Fernandes wrote:
> > > > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov 
> > > > > > > > > >> > wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > > Could you explain when it should return POLLIN? When 
> > > > > > > > > >> > > the whole
> > > > > > > > > >process exits?
> > > > > > > > > >> >
> > > > > > > > > >> > It returns POLLIN when the task is dead or doesn't exist 
> > > > > > > > > >> > anymore,
> > > > > > > > > >or when it
> > > > > > > > > >> > is in a zombie state and there's no other thread in the 
> > > > > > > > > >> > thread
> > > > > > > > > >group.
> > > > > > > > > >>
> > > > > > > > > >> IOW, when the whole thread group exits, so it can't be 
> > > > > > > > > >> used to
> > > > > > > > > >monitor sub-threads.
> > > > > > > > > >>
> > > > > > > > > >> just in case... speaking of this patch it doesn't modify
> > > > > > > > > >proc_tid_base_operations,
> > > > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc 
> > > > > > > > > >> you are
> > > > > > > > > >going to use
> > > > > > > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > > > > > > >
> > > > > > > > > >I don't think procfs works that way. /proc/sub-thread-tid has
> > > > > > > > > >proc_tgid_base_operations despite not being a thread group 
> > > > > > > > > >leader.
> > > > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this 
> > > > > > > > > >code can
> > > > > > > > > >be hit trivially, and then the code will misbehave.
> > > > > > > > > >
> > > > > > > > > >@Joel: I think you'll have to either rewrite this to 
> > > > > > > > > >explicitly bail
> > > > > > > > > >out if you're dealing with a thread group leader, or make 
> > > > > > > > > >the code
> > > > > > > > > >work for threads, too.
> > > > > > > > >
> > > > > > > > > The latter case probably being preferred if this API is 
> > > > > > > > > supposed to be
> > > > > > > > > useable for thread management in userspace.
> > > > > > > >
> > > > > > > > At the moment, we are not planning to use this for sub-thread 
> > > > > > > > management. I
> > > > > > > > am reworking this patch to only work on clone(2) pidfds which 
> > > > > > > > makes the above
> > > > > > >
> > > > &

Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 2:20 PM Joel Fernandes  wrote:
>
> On Fri, Apr 19, 2019 at 10:57:11PM +0200, Christian Brauner wrote:
> > On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione  
> > wrote:
> > >
> > > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes  
> > > wrote:
> > > >
> > > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote:
> > > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote:
> > > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner wrote:
> > > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn 
> > > > > > >  wrote:
> > > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov  
> > > > > > > >wrote:
> > > > > > > >> On 04/16, Joel Fernandes wrote:
> > > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov 
> > > > > > > >> > wrote:
> > > > > > > >> > >
> > > > > > > >> > > Could you explain when it should return POLLIN? When the 
> > > > > > > >> > > whole
> > > > > > > >process exits?
> > > > > > > >> >
> > > > > > > >> > It returns POLLIN when the task is dead or doesn't exist 
> > > > > > > >> > anymore,
> > > > > > > >or when it
> > > > > > > >> > is in a zombie state and there's no other thread in the 
> > > > > > > >> > thread
> > > > > > > >group.
> > > > > > > >>
> > > > > > > >> IOW, when the whole thread group exits, so it can't be used to
> > > > > > > >monitor sub-threads.
> > > > > > > >>
> > > > > > > >> just in case... speaking of this patch it doesn't modify
> > > > > > > >proc_tid_base_operations,
> > > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc you 
> > > > > > > >> are
> > > > > > > >going to use
> > > > > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > > > > >
> > > > > > > >I don't think procfs works that way. /proc/sub-thread-tid has
> > > > > > > >proc_tgid_base_operations despite not being a thread group 
> > > > > > > >leader.
> > > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this 
> > > > > > > >code can
> > > > > > > >be hit trivially, and then the code will misbehave.
> > > > > > > >
> > > > > > > >@Joel: I think you'll have to either rewrite this to explicitly 
> > > > > > > >bail
> > > > > > > >out if you're dealing with a thread group leader, or make the 
> > > > > > > >code
> > > > > > > >work for threads, too.
> > > > > > >
> > > > > > > The latter case probably being preferred if this API is supposed 
> > > > > > > to be
> > > > > > > useable for thread management in userspace.
> > > > > >
> > > > > > At the moment, we are not planning to use this for sub-thread 
> > > > > > management. I
> > > > > > am reworking this patch to only work on clone(2) pidfds which makes 
> > > > > > the above
> > > > >
> > > > > Indeed and agreed.
> > > > >
> > > > > > discussion about /proc a bit unnecessary I think. Per the latest 
> > > > > > CLONE_PIDFD
> > > > > > patches, CLONE_THREAD with pidfd is not supported.
> > > > >
> > > > > Yes. We have no one asking for it right now and we can easily add this
> > > > > later.
> > > > >
> > > > > Admittedly I haven't gotten around to reviewing the patches here yet
> > > > > completely. But one thing about using POLLIN. FreeBSD is using POLLHUP
> > > > > on process exit which I think is nice as well. How about returning
> > > > > POLLIN | POLLHUP on process exit?
> > > > > We already do things like this. For example, when you 

Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 1:57 PM Christian Brauner  wrote:
>
> On Fri, Apr 19, 2019 at 10:34 PM Daniel Colascione  wrote:
> >
> > On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes  
> > wrote:
> > >
> > > On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote:
> > > > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote:
> > > > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner wrote:
> > > > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn 
> > > > > >  wrote:
> > > > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov  
> > > > > > >wrote:
> > > > > > >> On 04/16, Joel Fernandes wrote:
> > > > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov wrote:
> > > > > > >> > >
> > > > > > >> > > Could you explain when it should return POLLIN? When the 
> > > > > > >> > > whole
> > > > > > >process exits?
> > > > > > >> >
> > > > > > >> > It returns POLLIN when the task is dead or doesn't exist 
> > > > > > >> > anymore,
> > > > > > >or when it
> > > > > > >> > is in a zombie state and there's no other thread in the thread
> > > > > > >group.
> > > > > > >>
> > > > > > >> IOW, when the whole thread group exits, so it can't be used to
> > > > > > >monitor sub-threads.
> > > > > > >>
> > > > > > >> just in case... speaking of this patch it doesn't modify
> > > > > > >proc_tid_base_operations,
> > > > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc you 
> > > > > > >> are
> > > > > > >going to use
> > > > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > > > >
> > > > > > >I don't think procfs works that way. /proc/sub-thread-tid has
> > > > > > >proc_tgid_base_operations despite not being a thread group leader.
> > > > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this code 
> > > > > > >can
> > > > > > >be hit trivially, and then the code will misbehave.
> > > > > > >
> > > > > > >@Joel: I think you'll have to either rewrite this to explicitly 
> > > > > > >bail
> > > > > > >out if you're dealing with a thread group leader, or make the code
> > > > > > >work for threads, too.
> > > > > >
> > > > > > The latter case probably being preferred if this API is supposed to 
> > > > > > be
> > > > > > useable for thread management in userspace.
> > > > >
> > > > > At the moment, we are not planning to use this for sub-thread 
> > > > > management. I
> > > > > am reworking this patch to only work on clone(2) pidfds which makes 
> > > > > the above
> > > >
> > > > Indeed and agreed.
> > > >
> > > > > discussion about /proc a bit unnecessary I think. Per the latest 
> > > > > CLONE_PIDFD
> > > > > patches, CLONE_THREAD with pidfd is not supported.
> > > >
> > > > Yes. We have no one asking for it right now and we can easily add this
> > > > later.
> > > >
> > > > Admittedly I haven't gotten around to reviewing the patches here yet
> > > > completely. But one thing about using POLLIN. FreeBSD is using POLLHUP
> > > > on process exit which I think is nice as well. How about returning
> > > > POLLIN | POLLHUP on process exit?
> > > > We already do things like this. For example, when you proxy between
> > > > ttys. If the process that you're reading data from has exited and closed
> > > > it's end you still can't usually simply exit because it might have still
> > > > buffered data that you want to read.  The way one can deal with this
> > > > from  userspace is that you can observe a (POLLHUP | POLLIN) event and
> > > > you keep on reading until you only observe a POLLHUP without a POLLIN
> > > > event at which point you know you have read
> > > > all data.
> > > > I like the semantics for pid

Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-19 Thread Daniel Colascione
On Fri, Apr 19, 2019 at 12:49 PM Joel Fernandes  wrote:
>
> On Fri, Apr 19, 2019 at 09:18:59PM +0200, Christian Brauner wrote:
> > On Fri, Apr 19, 2019 at 03:02:47PM -0400, Joel Fernandes wrote:
> > > On Thu, Apr 18, 2019 at 07:26:44PM +0200, Christian Brauner wrote:
> > > > On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn  
> > > > wrote:
> > > > >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov  wrote:
> > > > >> On 04/16, Joel Fernandes wrote:
> > > > >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov wrote:
> > > > >> > >
> > > > >> > > Could you explain when it should return POLLIN? When the whole
> > > > >process exits?
> > > > >> >
> > > > >> > It returns POLLIN when the task is dead or doesn't exist anymore,
> > > > >or when it
> > > > >> > is in a zombie state and there's no other thread in the thread
> > > > >group.
> > > > >>
> > > > >> IOW, when the whole thread group exits, so it can't be used to
> > > > >monitor sub-threads.
> > > > >>
> > > > >> just in case... speaking of this patch it doesn't modify
> > > > >proc_tid_base_operations,
> > > > >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc you are
> > > > >going to use
> > > > >> the anonymous file returned by CLONE_PIDFD ?
> > > > >
> > > > >I don't think procfs works that way. /proc/sub-thread-tid has
> > > > >proc_tgid_base_operations despite not being a thread group leader.
> > > > >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this code can
> > > > >be hit trivially, and then the code will misbehave.
> > > > >
> > > > >@Joel: I think you'll have to either rewrite this to explicitly bail
> > > > >out if you're dealing with a thread group leader, or make the code
> > > > >work for threads, too.
> > > >
> > > > The latter case probably being preferred if this API is supposed to be
> > > > useable for thread management in userspace.
> > >
> > > At the moment, we are not planning to use this for sub-thread management. 
> > > I
> > > am reworking this patch to only work on clone(2) pidfds which makes the 
> > > above
> >
> > Indeed and agreed.
> >
> > > discussion about /proc a bit unnecessary I think. Per the latest 
> > > CLONE_PIDFD
> > > patches, CLONE_THREAD with pidfd is not supported.
> >
> > Yes. We have no one asking for it right now and we can easily add this
> > later.
> >
> > Admittedly I haven't gotten around to reviewing the patches here yet
> > completely. But one thing about using POLLIN. FreeBSD is using POLLHUP
> > on process exit which I think is nice as well. How about returning
> > POLLIN | POLLHUP on process exit?
> > We already do things like this. For example, when you proxy between
> > ttys. If the process that you're reading data from has exited and closed
> > it's end you still can't usually simply exit because it might have still
> > buffered data that you want to read.  The way one can deal with this
> > from  userspace is that you can observe a (POLLHUP | POLLIN) event and
> > you keep on reading until you only observe a POLLHUP without a POLLIN
> > event at which point you know you have read
> > all data.
> > I like the semantics for pidfds as well as it would indicate:
> > - POLLHUP -> process has exited
> > - POLLIN  -> information can be read
>
> Actually I think a bit different about this, in my opinion the pidfd should
> always be readable (we would store the exit status somewhere in the future
> which would be readable, even after task_struct is dead). So I was thinking
> we always return EPOLLIN.  If process has not exited, then it blocks.

ITYM that a pidfd polls as readable *once a task exits* and stays
readable forever. Before a task exit, a poll on a pidfd should *not*
yield POLLIN and reading that pidfd should *not* complete immediately.
There's no way that, having observed POLLIN on a pidfd, you should
ever then *not* see POLLIN on that pidfd in the future --- it's a
one-way transition from not-ready-to-get-exit-status to
ready-to-get-exit-status.

Besides, didn't Linux say that he wanted waitpid(2) to be the function
for exit status on a pidfd, not read(2)? It doesn't really matter:
POLLIN on a pidfd would just make "I, the kernel, say that waitpid on
this FD won't block", whereas for something like socket, it would mean
"I, the kernel, say read(2) on this FD won't block".

I don't see the need for POLLHUP in pidfds at all. IMHO, we shouldn't
include it. "Hangup" doesn't have an obvious meaning distinct from
exit, and we might want the POLLHUP bit for something else in the
future. What would a caller do with POLLHUP? If the answer is
"something to do with ptrace", let's defer that to some future work,
since ptrace has a ton of moving parts I don't want to consider right
now.


Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-18 Thread Daniel Colascione
On Thu, Apr 18, 2019 at 11:44 AM Jonathan Kowalski  wrote:
>
> On Tue, Apr 16, 2019 at 8:21 PM Joel Fernandes  wrote:
> >
> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov wrote:
> > > On 04/11, Joel Fernandes (Google) wrote:
> > > >
> > > > +static unsigned int proc_tgid_base_poll(struct file *file, struct 
> > > > poll_table_struct *pts)
> > > > +{
> > > > +   int poll_flags = 0;
> > > > +   struct task_struct *task;
> > > > +   struct pid *pid;
> > > > +
> > > > +   task = get_proc_task(file->f_path.dentry->d_inode);
> > > > +
> > > > +   WARN_ON_ONCE(task && !thread_group_leader(task));
> > > > +
> > > > +   /*
> > > > +* tasklist_lock must be held because to avoid racing with
> > > > +* changes in exit_state and wake up. Basically to avoid:
> > > > +*
> > > > +* P0: read exit_state = 0
> > > > +* P1: write exit_state = EXIT_DEAD
> > > > +* P1: Do a wake up - wq is empty, so do nothing
> > > > +* P0: Queue for polling - wait forever.
> > > > +*/
> > > > +   read_lock(_lock);
> > > > +   if (!task)
> > > > +   poll_flags = POLLIN | POLLRDNORM | POLLERR;
> > > > +   else if (task->exit_state == EXIT_DEAD)
> > > > +   poll_flags = POLLIN | POLLRDNORM;
> > > > +   else if (task->exit_state == EXIT_ZOMBIE && 
> > > > thread_group_empty(task))
> > > > +   poll_flags = POLLIN | POLLRDNORM;
> > > > +
> > > > +   if (!poll_flags) {
> > > > +   pid = proc_pid(file->f_path.dentry->d_inode);
> > > > +   poll_wait(file, >wait_pidfd, pts);
> > > > +   }
> > >
> > > can't understand...
> > >
> > > Could you explain when it should return POLLIN? When the whole process 
> > > exits?
> >
> > It returns POLLIN when the task is dead or doesn't exist anymore, or when it
> > is in a zombie state and there's no other thread in the thread group.
> >
>
> Would using something other than POLLIN be an option (maybe POLLPRI)?
> The convention is to use it to indicate readability on the descriptor,
> and also possibly POLLHUP instead of POLLERR (the latter is less of a
> problem, but FreeBSD also does the same, so it'd help with some
> consistency for libraries wanting to use this, which aren't interested
> in other sub states).

Existing event loop libraries generally support checking only for
readability and writability. Not setting POLLIN would make these FDs
more difficult to use with existing event loop libraries. What
advantage would compensate for this difficulty?


Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-18 Thread Daniel Colascione
On Thu, Apr 18, 2019 at 10:26 AM Christian Brauner  wrote:
>
> On April 18, 2019 7:23:38 PM GMT+02:00, Jann Horn  wrote:
> >On Wed, Apr 17, 2019 at 3:09 PM Oleg Nesterov  wrote:
> >> On 04/16, Joel Fernandes wrote:
> >> > On Tue, Apr 16, 2019 at 02:04:31PM +0200, Oleg Nesterov wrote:
> >> > >
> >> > > Could you explain when it should return POLLIN? When the whole
> >process exits?
> >> >
> >> > It returns POLLIN when the task is dead or doesn't exist anymore,
> >or when it
> >> > is in a zombie state and there's no other thread in the thread
> >group.
> >>
> >> IOW, when the whole thread group exits, so it can't be used to
> >monitor sub-threads.
> >>
> >> just in case... speaking of this patch it doesn't modify
> >proc_tid_base_operations,
> >> so you can't poll("/proc/sub-thread-tid") anyway, but iiuc you are
> >going to use
> >> the anonymous file returned by CLONE_PIDFD ?
> >
> >I don't think procfs works that way. /proc/sub-thread-tid has
> >proc_tgid_base_operations despite not being a thread group leader.

Huh. That seems very weird. Is that too late to change now? It feels like a bug.

> >(Yes, that's kinda weird.) AFAICS the WARN_ON_ONCE() in this code can
> >be hit trivially, and then the code will misbehave.
> >
> >@Joel: I think you'll have to either rewrite this to explicitly bail
> >out if you're dealing with a thread group leader

If you're _not_ dealing with a leader, right?

> , or make the code
> >work for threads, too.
> The latter case probably being preferred if this API is supposed to be 
> useable for thread management in userspace.

IMHO, focusing on the thread group case for now might be best. We can
always support thread management in future work.

Besides: I'm not sure that we need kernel support for thread
monitoring. Can't libc provide a pollable FD for a thread internally?
libc can always run code just before thread exit, and it can wake a
signalfd at that point. Directly terminating individual threads
without going through userland is something that breaks the process
anyway: it's legal and normal to SIGKILL a process a whole, but if an
individual thread terminates without going through libc, the process
is likely going to be fatally broken anyway. (What if it's holding the
heap lock?)

I know that in some tools want to wait for termination of individual
threads in an external monitored process, but couldn't these tools
cooperate with libc to get these per-thread eventfds?

Is there a use case I'm missing?


Re: [PATCH 2/4] clone: add CLONE_PIDFD

2019-04-15 Thread Daniel Colascione
On Mon, Apr 15, 2019 at 10:15 AM Jonathan Kowalski  wrote:
> > Why else do we want pidfd?
>
> Apart from what others have already pointed out, there are two other
> things I am looking forward to:

Everything that Christian, Joel, and Jonathan have said is right.

If I can wax philosophical for a bit (as I've been accused to doing
:-)), there's a lot of value in consistency itself, a "more than the
sum of its parts" effect that arises from modeling all kernel-resource
handles as file descriptors. You get lifecycle consistency, API
consistency (e.g., dup, close), introspection consistency (via
/proc/pid/fd and friends), wait consistency, IPC consistency, and tons
of other benefits from using a file descriptor. The alternatives tend
to be very ugly: one of SysV IPC's* biggest mistakes, for example, was
having users manage its resources via non-file-descriptor kernel
handles. The process is, I think, the last major class of kernel
resource that users can't manipulate via file descriptor. Even if
using pidfds didn't provide strong immediate and direct benefits, it'd
*still* be worth moving to a file descriptor resource handle model for
the sake of making the system interface regular and uniform.

* Does anyone know *why* the SysV people didn't use FDs? The
consistency argument I'm making was just as relevant then as it is
now.


Re: [PATCH RFC 1/2] Add polling support to pidfd

2019-04-12 Thread Daniel Colascione
[Resending due to accidental HTML. I need to take Joel's advice and
switch to a real email client]

On Fri, Apr 12, 2019 at 5:54 PM Daniel Colascione  wrote:
>
> On Fri, Apr 12, 2019 at 5:09 PM Joel Fernandes  wrote:
>>
>> Hi Andy!
>>
>> On Fri, Apr 12, 2019 at 02:32:53PM -0700, Andy Lutomirski wrote:
>> > On Thu, Apr 11, 2019 at 10:51 AM Joel Fernandes (Google)
>> >  wrote:
>> > >
>> > > pidfd are /proc/pid directory file descriptors referring to a task group
>> > > leader. Android low memory killer (LMK) needs pidfd polling support to
>> > > replace code that currently checks for existence of /proc/pid for
>> > > knowing a process that is signalled to be killed has died, which is both
>> > > racy and slow. The pidfd poll approach is race-free, and also allows the
>> > > LMK to do other things (such as by polling on other fds) while awaiting
>> > > the process being killed to die.
>> > >
>> > > It prevents a situation where a PID is reused between when LMK sends a
>> > > kill signal and checks for existence of the PID, since the wrong PID is
>> > > now possibly checked for existence.
>> > >
>> > > In this patch, we follow the same mechanism used uhen the parent of the
>> > > task group is to be notified, that is when the tasks waiting on a poll
>> > > of pidfd are also awakened.
>> > >
>> > > We have decided to include the waitqueue in struct pid for the following
>> > > reasons:
>> > > 1. The wait queue has to survive for the lifetime of the poll. Including
>> > > it in task_struct would not be option in this case because the task can
>> > > be reaped and destroyed before the poll returns.
>> >
>> > Are you sure?  I admit I'm not all that familiar with the innards of
>> > poll() on Linux, but I thought that the waitqueue only had to survive
>> > long enough to kick the polling thread and did *not* have to survive
>> > until poll() actually returned.
>>
>> I am not sure now. I thought epoll(2) was based on the wait_event APIs,
>> however more closely looking at the eventpoll code, it looks like there are 2
>> waitqueues involved, one that we pass and the other that is a part of the
>> eventpoll session itself, so you could be right about that. Daniel Colascione
>> may have some more thoughts about it since he brought up the possiblity of a
>> wq life-time issue. Daniel?   We were just playing it safe.

I think you (Joel) and Andy are talking about different meanings of
poll(). Joel is talking about the VFS method; Andy is talking about
the system call. ISTM that the lifetime of wait queue we give to
poll_wait needs to last through the poll. Normally the wait queue gets
pinned by the struct file that we give to poll_wait (which takes a
reference on the struct file), but the pidfd struct file doesn't pin
the struct task, so we can't use a wait queue in struct task.
(remove_wait_queue, which poll implementations call to undo wait queue
additions, takes the wait queue head we pass to poll_wait, and we
don't want to pass a dangling pointer to remove_wait_queue.) If the
lifetime requirements for the queue aren't this strict, I don't see it
documented anywhere. Besides: if we don't actually need to pin the
waitqueue lifetime for the duration of the poll, why bother taking a
reference on the polled struct file?


Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-12 Thread Daniel Colascione
On Fri, Apr 12, 2019 at 7:14 AM Daniel Colascione  wrote:
>
> On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko  wrote:
> >
> > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.
> > >
> > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > every time a process is going to die?
> >
> > Well, you are tearing down an address space which might be still in use
> > because the task not fully dead yeat. So there are two downsides AFAICS.
> > Core dumping which will not see the reaped memory so the resulting
>
> Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you
> try to start a core dump after reaping begins, too bad: you could have
> raced with process death anyway.
>
> > coredump might be incomplete. And unexpected #PF/gup on the reaped
> > memory will result in SIGBUS.
>
> It's a dying process. Why even bother returning from the fault
> handler? Just treat that situation as a thread exit. There's no need
> to make this observable to userspace at all.

Just for clarity, checking the code, I think we already do this.
zap_other_threads sets SIGKILL pending on every thread in the group,
and we'll handle SIGKILL in the process of taking any page fault or
doing any system call, so I don't think it's actually possible for a
thread in a dying process to observe the SIGBUS that reaping in theory
can generate.


Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-12 Thread Daniel Colascione
On Fri, Apr 12, 2019 at 7:15 AM Suren Baghdasaryan  wrote:
>
> On Thu, Apr 11, 2019 at 11:49 PM Michal Hocko  wrote:
> >
> > On Thu 11-04-19 10:47:50, Daniel Colascione wrote:
> > > On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote:
> > > > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan 
> > > > >  wrote:
> > > > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan 
> > > > > > > wrote:
> > > > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim 
> > > > > > > > of the
> > > > > > > > victim process. The usage of this flag is currently limited to 
> > > > > > > > SIGKILL
> > > > > > > > signal and only to privileged users.
> > > > > > >
> > > > > > > What is the downside of doing expedited memory reclaim?  ie why 
> > > > > > > not do it
> > > > > > > every time a process is going to die?
> > > > > >
> > > > > > I think with an implementation that does not use/abuse oom-reaper
> > > > > > thread this could be done for any kill. As I mentioned oom-reaper 
> > > > > > is a
> > > > > > limited resource which has access to memory reserves and should not 
> > > > > > be
> > > > > > abused in the way I do in this reference implementation.
> > > > > > While there might be downsides that I don't know of, I'm not sure 
> > > > > > it's
> > > > > > required to hurry every kill's memory reclaim. I think there are 
> > > > > > cases
> > > > > > when resource deallocation is critical, for example when we kill to
> > > > > > relieve resource shortage and there are kills when reclaim speed is
> > > > > > not essential. It would be great if we can identify urgent cases
> > > > > > without userspace hints, so I'm open to suggestions that do not
> > > > > > involve additional flags.
> > > > >
> > > > > I was imagining a PI-ish approach where we'd reap in case an RT
> > > > > process was waiting on the death of some other process. I'd still
> > > > > prefer the API I proposed in the other message because it gets the
> > > > > kernel out of the business of deciding what the right signal is. I'm a
> > > > > huge believer in "mechanism, not policy".
> > > >
> > > > It's not a question of the kernel deciding what the right signal is.
> > > > The kernel knows whether a signal is fatal to a particular process or 
> > > > not.
> > > > The question is whether the killing process should do the work of 
> > > > reaping
> > > > the dying process's resources sometimes, always or never.  Currently,
> > > > that is never (the process reaps its own resources); Suren is suggesting
> > > > sometimes, and I'm asking "Why not always?"
> > >
> > > FWIW, Suren's initial proposal is that the oom_reaper kthread do the
> > > reaping, not the process sending the kill. Are you suggesting that
> > > sending SIGKILL should spend a while in signal delivery reaping pages
> > > before returning? I thought about just doing it this way, but I didn't
> > > like the idea: it'd slow down mass-killing programs like killall(1).
> > > Programs expect sending SIGKILL to be a fast operation that returns
> > > immediately.
> >
> > I was thinking about this as well. And SYNC_SIGKILL would workaround the

SYNC_SIGKILL (which, I presume, blocks in kill(2)) was proposed in
many occasions while we discussed pidfd waits over the past six months
or so. We've decided to just make pidfds pollable instead. The kernel
already has several ways to express the idea that a task should wait
for another task to die, and I don't think we need another. If you
want a process that's waiting for a task to exit to help reap that
task, great --- that's an option we've talked about --- but we don't
need new interface to do it, since the kernel already has all the
information it needs.

> > current

Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-12 Thread Daniel Colascione
On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko  wrote:
>
> On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > victim process. The usage of this flag is currently limited to SIGKILL
> > > signal and only to privileged users.
> >
> > What is the downside of doing expedited memory reclaim?  ie why not do it
> > every time a process is going to die?
>
> Well, you are tearing down an address space which might be still in use
> because the task not fully dead yeat. So there are two downsides AFAICS.
> Core dumping which will not see the reaped memory so the resulting

Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you
try to start a core dump after reaping begins, too bad: you could have
raced with process death anyway.

> coredump might be incomplete. And unexpected #PF/gup on the reaped
> memory will result in SIGBUS.

It's a dying process. Why even bother returning from the fault
handler? Just treat that situation as a thread exit. There's no need
to make this observable to userspace at all.


Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-11 Thread Daniel Colascione
On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox  wrote:
>
> On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote:
> > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan  
> > wrote:
> > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > > signal and only to privileged users.
> > > >
> > > > What is the downside of doing expedited memory reclaim?  ie why not do 
> > > > it
> > > > every time a process is going to die?
> > >
> > > I think with an implementation that does not use/abuse oom-reaper
> > > thread this could be done for any kill. As I mentioned oom-reaper is a
> > > limited resource which has access to memory reserves and should not be
> > > abused in the way I do in this reference implementation.
> > > While there might be downsides that I don't know of, I'm not sure it's
> > > required to hurry every kill's memory reclaim. I think there are cases
> > > when resource deallocation is critical, for example when we kill to
> > > relieve resource shortage and there are kills when reclaim speed is
> > > not essential. It would be great if we can identify urgent cases
> > > without userspace hints, so I'm open to suggestions that do not
> > > involve additional flags.
> >
> > I was imagining a PI-ish approach where we'd reap in case an RT
> > process was waiting on the death of some other process. I'd still
> > prefer the API I proposed in the other message because it gets the
> > kernel out of the business of deciding what the right signal is. I'm a
> > huge believer in "mechanism, not policy".
>
> It's not a question of the kernel deciding what the right signal is.
> The kernel knows whether a signal is fatal to a particular process or not.
> The question is whether the killing process should do the work of reaping
> the dying process's resources sometimes, always or never.  Currently,
> that is never (the process reaps its own resources); Suren is suggesting
> sometimes, and I'm asking "Why not always?"

FWIW, Suren's initial proposal is that the oom_reaper kthread do the
reaping, not the process sending the kill. Are you suggesting that
sending SIGKILL should spend a while in signal delivery reaping pages
before returning? I thought about just doing it this way, but I didn't
like the idea: it'd slow down mass-killing programs like killall(1).
Programs expect sending SIGKILL to be a fast operation that returns
immediately.


Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-11 Thread Daniel Colascione
On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan  wrote:
>
> On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox  wrote:
> >
> > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > victim process. The usage of this flag is currently limited to SIGKILL
> > > signal and only to privileged users.
> >
> > What is the downside of doing expedited memory reclaim?  ie why not do it
> > every time a process is going to die?
>
> I think with an implementation that does not use/abuse oom-reaper
> thread this could be done for any kill. As I mentioned oom-reaper is a
> limited resource which has access to memory reserves and should not be
> abused in the way I do in this reference implementation.
> While there might be downsides that I don't know of, I'm not sure it's
> required to hurry every kill's memory reclaim. I think there are cases
> when resource deallocation is critical, for example when we kill to
> relieve resource shortage and there are kills when reclaim speed is
> not essential. It would be great if we can identify urgent cases
> without userspace hints, so I'm open to suggestions that do not
> involve additional flags.

I was imagining a PI-ish approach where we'd reap in case an RT
process was waiting on the death of some other process. I'd still
prefer the API I proposed in the other message because it gets the
kernel out of the business of deciding what the right signal is. I'm a
huge believer in "mechanism, not policy".


Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-11 Thread Daniel Colascione
On Thu, Apr 11, 2019 at 8:23 AM Suren Baghdasaryan  wrote:
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.

FWIW, I like Suren's general idea, but I was thinking of a different
way of exposing the same general functionality to userspace. The way I
look at it, it's very useful for an auto-balancing memory system like
Android (or, I presume, something that uses oomd) to recover memory
*immediately* after a SIGKILL instead of waiting for the process to
kill itself: a process's death can be delayed for a long time due to
factors like scheduling and being blocked in various uninterruptible
kernel-side operations. Suren's proposal is basically about pulling
forward in time page reclaimation that would happen anyway.

What if we let userspace control exactly when this reclaimation
happens? I'm imagining a new* kernel facility that basically looks
like this. It lets lmkd determine for itself how much work the system
should expend on reclaiming memory from dying processes.

size_t try_reap_dying_process(
  int pidfd,
  int flags /* must be zero */,
  size_t maximum_bytes_to_reap);

Precondition: process is pending group-exit (someone already sent it SIGKILL)
Postcondition: some memory reclaimed from dying process
Invariant: doesn't sleep; stops reaping after MAXIMUM_BYTES_TO_REAP

-> success: return number of bytes reaped
-> failure: (size_t)-1

EBUSY: couldn't get mmap_sem
EINVAL: PIDFD isn't a pidfd or otherwise invalid arguments
EPERM: process hasn't been send SIGKILL: try_reap_dying_process on a
process that isn't dying is illegal

Kernel-side, try_reap_dying_process would try-acquire mmap_sem and
just fail if it couldn't get it. Once acquired, it would release
"easy" pages (using the same check the oom reaper uses) until it
either ran out of pages or hit the MAXIMUM_BYTES_TO_REAP cap. The
purpose of MAXIMUM_BYTES_TO_REAP is to let userspace bound-above the
amount of time we spend reclaiming pages. It'd be up to userspace to
set policy on retries, the actual value of the reap cap, the priority
at which we run TRY_REAP_DYING_PROCESS, and so on. We return the
number of bytes we managed to free this way so that lmkd can make an
informed decision about what to do next, e.g., kill something else or
wait a little while.

Personally, I like th approach a bit more that recruiting the oom
reaper through because it doesn't affect any kind of  emergency memory
reserve permission and because it frees us from having to think about
whether the oom reaper's thread priority is right for this particular
job.

It also occurred to me that try_reap_dying_process might make a decent
shrinker callback. Shrinkers are there, AIUI, to reclaim memory that's
easy to free and that's not essential for correct kernel operation.
Usually, it's some kind of cache that meets these criteria. But the
private pages of a dying process also meet the criteria, don't they?
I'm imagining the shrinker just picking an arbitrary doomed (dying but
not yet dead) process and freeing some of its pages. I know there are
concerns about slow shrinkers causing havoc throughout the system, but
since this shrinker would be bounded above on CPU time and would never
block, I feel like it'd be pretty safe.

* insert standard missive about system calls being cheap, but we can
talk about the way in which we expose this functionality after we
agree that it's a good idea generally


Re: [RFC PATCH] fork: add CLONE_PIDFD

2019-04-10 Thread Daniel Colascione
Thanks for trying it both ways.

On Wed, Apr 10, 2019 at 4:43 PM Christian Brauner  wrote:
>
> Hey Linus,
>
> This is an RFC for adding a new CLONE_PIDFD flag to clone() as
> previously discussed.
> While implementing this Jann and I ran into additional complexity that
> prompted us to send out an initial RFC patchset to make sure we still
> think going forward with the current implementation is a good idea and
> also provide an alternative approach:
>
> RFC-1:
> This is an RFC for the implementation of pidfds as /proc/ file
> descriptors.
> The tricky part here is that we need to retrieve a file descriptor for
> /proc/ before clone's point of no return. Otherwise, we need to fail
> the creation of a process that has already passed all barriers and is
> visible in userspace. Getting that file descriptor then becomes a rather
> intricate dance including allocating a detached dentry that we need to
> commit once attach_pid() has been called.
> Note that this RFC only includes the logic we think is needed to return
> /proc/ file descriptors from clone. It does *not* yet include the even
> more complex logic needed to restrict procfs itself. And the additional
> logic needed to prevent attacks such as openat(pidfd, "..", ...) and access
> to /proc//net/ on top of the procfs restriction.

Why would filtering proc be all that complicated? Wouldn't it just be
adding a "sensitive" flag to struct pid_entry and skipping entries
with that flag when constructing proc entries?

> There are a couple of reasons why we stopped short of this and decided to
> sent out an RFC first:
> - Even the initial part of getting file descriptors from /proc/ out
>   of clone() required rather complex code that struck us as very
>   inelegant and heavy (which granted, might partially caused by not seeing
>   a cleaner way to implement this). Thus, it felt like we needed to see
>   whether this is even remotely considered acceptable.
> - While discussing further aspects of this approach with Al we received
>   rather substantiated opposition to exposing even more codepaths to
>   procfs.
> - Restricting access to procfs properly requires a lot of invasive work
>   even touching core vfs functions such as
>   follow_dotdot()/follow_dotdot_rcu() which also caused 2.

Wasn't an internal bind mount supposed to take care of the parent
traversal problem?


Re: [RFC-2 PATCH 4/4] samples: show race-free pidfd metadata access

2019-04-10 Thread Daniel Colascione
Thanks for providing this example. A new nits below.

On Wed, Apr 10, 2019 at 4:43 PM Christian Brauner  wrote:
>
> This is an sample program to show userspace how to get race-free access to
> process metadata from a pidfd.
> It is really not that difficult and instead of burdening the kernel with
> this task by using fds to /proc/ we can simply add a helper to libc
> that does it for the user.
>
> Signed-off-by: Christian Brauner 
> Signed-off-by: Jann Horn 
> Cc: Arnd Bergmann 
> Cc: "Eric W. Biederman" 
> Cc: Kees Cook 
> Cc: Alexey Dobriyan 
> Cc: Thomas Gleixner 
> Cc: David Howells 
> Cc: "Michael Kerrisk (man-pages)" 
> Cc: Jonathan Kowalski 
> Cc: "Dmitry V. Levin" 
> Cc: Andy Lutomirsky 
> Cc: Andrew Morton 
> Cc: Oleg Nesterov 
> Cc: Aleksa Sarai 
> Cc: Linus Torvalds 
> Cc: Al Viro 
> ---
>  samples/Makefile   |   2 +-
>  samples/pidfd/Makefile |   6 ++
>  samples/pidfd/pidfd-metadata.c | 169 +
>  3 files changed, 176 insertions(+), 1 deletion(-)
>  create mode 100644 samples/pidfd/Makefile
>  create mode 100644 samples/pidfd/pidfd-metadata.c
>
> diff --git a/samples/Makefile b/samples/Makefile
> index b1142a958811..fadadb1c3b05 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -3,4 +3,4 @@
>  obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ trace_events/ livepatch/ \
>hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ 
> \
>configfs/ connector/ v4l/ trace_printk/ \
> -  vfio-mdev/ statx/ qmi/ binderfs/
> +  vfio-mdev/ statx/ qmi/ binderfs/ pidfd/
> diff --git a/samples/pidfd/Makefile b/samples/pidfd/Makefile
> new file mode 100644
> index ..0ff97784177a
> --- /dev/null
> +++ b/samples/pidfd/Makefile
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +hostprogs-y := pidfd-metadata
> +always := $(hostprogs-y)
> +HOSTCFLAGS_pidfd-metadata.o += -I$(objtree)/usr/include
> +all: pidfd-metadata
> diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
> new file mode 100644
> index ..c46c6c34a012
> --- /dev/null
> +++ b/samples/pidfd/pidfd-metadata.c
> @@ -0,0 +1,169 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#ifndef CLONE_PIDFD
> +#define CLONE_PIDFD 0x1000
> +#endif
> +
> +static int raw_clone_pidfd(void)
> +{
> +   unsigned long flags = CLONE_PIDFD;
> +
> +#if defined(__s390x__) || defined(__s390__) || defined(__CRIS__)
> +   /* On s390/s390x and cris the order of the first and second arguments
> + * of the system call is reversed.
> + */
> +   return (int)syscall(__NR_clone, NULL, flags | SIGCHLD);
> +#elif defined(__sparc__) && defined(__arch64__)
> +   {
> +   /*
> + * sparc64 always returns the other process id in %o0, and a
> + * boolean flag whether this is the child or the parent in 
> %o1.
> + * Inline assembly is needed to get the flag returned in %o1.
> + */
> +   int in_child;
> +   int child_pid;
> +   asm volatile("mov %2, %%g1\n\t"
> +"mov %3, %%o0\n\t"
> +"mov 0 , %%o1\n\t"
> +"t 0x6d\n\t"
> +"mov %%o1, %0\n\t"
> +"mov %%o0, %1"
> +: "=r"(in_child), "=r"(child_pid)
> +: "i"(__NR_clone), "r"(flags | SIGCHLD)
> +: "%o1", "%o0", "%g1");
> +
> +   if (in_child)
> +   return 0;
> +   else
> +   return child_pid;
> +   }
> +#elif defined(__ia64__)
> +   /* On ia64 the stack and stack size are passed as separate arguments. 
> */
> +   return (int)syscall(__NR_clone, flags | SIGCHLD, NULL, prctl_arg(0));
> +#else
> +   return (int)syscall(__NR_clone, flags | SIGCHLD, NULL);
> +#endif
> +}
> +
> +static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
> +   unsigned int flags)
> +{
> +   return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
> +}
> +
> +static int pidfd_metadata_fd(int pidfd)
> +{
> +   int procfd, ret;
> +   char path[100];
> +   FILE *f;
> +   size_t n = 0;
> +   char *line = NULL;
> +
> +   snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);
> +
> +   f = fopen(path, "re");
> +   if (!f)
> +   return -1;
> +
> +   ret = 0;
> +   while (getline(, , f) != -1) {
> +   char *numstr;
> +   size_t len;
> +
> +   if (strncmp(line, 

Re: [PATCH v2 0/5] pid: add pidfd_open()

2019-04-01 Thread Daniel Colascione
On Mon, Apr 1, 2019 at 3:13 PM Linus Torvalds
 wrote:
>
> On Mon, Apr 1, 2019 at 2:58 PM Jonathan Kowalski  wrote:
> >
> > You mention the race about learning the PID, PID being recycled, and
> > pidfd_open getting the wrong reference.
> >
> > This exists with the /proc model to way. How do you propose to address this?
>
> Note that that race exists _regardless_ of any interfaces.
> pidfd_open() has the same race: any time you have a pid, the lifetime
> of it is only as long as the process existing.
>
> That's why we talked about the CLONE_PIDFD flag, which would return
> the pidfd itself when creating a new process. That's one truly
> race-free way to handle it.

Yes. Returning a pidfd from clone seems like a simple and robust approach.

> Or just do the fork(), and know that the pid won't be re-used until
> you've done the wait() for it, and block SIGCHLD until you've done the
> lookup.

That doesn't work when some other thread is running a waitpid(-1)
loop. I think it's important to create an interface that libraries can
use without global coordination.

> That said, in *practice*, you can probably use any of the racy "look
> up pidfd using pid" models, as long as you just verify the end result
> after you've opened it.
>
> That verification could be as simple as "look up the parent pid of the
> pidfd I got", if you know you created it with fork() (and you can
> obviously track what _other_ thread you yourself created, so you can
> verify whether it is yours or not).
>
> For example, using "openat(pidfd, "status", ..)", but also by just
> tracking what you've done waitpid() on (but you need to look out for
> subtle races with another thread being in the process of doing so).
>
> Or you can just say that as long as you got the pidfd quickly after
> the fork(), any pid wrapping attack is practically not possible even
> if it might be racy in theory.

I don't like ignoring races just because they're rare. The cost of
complete race freedom for the process interface is low considering the
work we're doing on pidfds anyway.


Re: [PATCH v2 0/5] pid: add pidfd_open()

2019-04-01 Thread Daniel Colascione
On Mon, Apr 1, 2019 at 9:30 AM Linus Torvalds
 wrote:
>
> On Mon, Apr 1, 2019 at 9:22 AM Daniel Colascione  wrote:
> >
> > There's a subtlety: suppose I'm a library and I want to create a
> > private subprocess. I use the new clone facility, whatever it is, and
> > get a pidfd back. I need to be able to read the child's exit status
> > even if the child exits before clone returns in the parent. Doesn't
> > this requirement imply that the pidfd, kernel-side, contain something
> > a bit more than a struct pid?
>
> Note that clone() has always supported this, but it has basically
> never been used.
>
> One of the early thinkings behind clone() was that it could be used
> for basically "AIO in user space", by having exactly these kinds of
> library-private internal subthreads that are hidden as real threads.
>
> It's why we have that special CSIGNAL mask for setting the exit
> signal. It doesn't *just* set the signal to be sent when the thread
> exits - setting the signal to something else than SIGCHLD actually
> changes behavior in other ways too.
>
> In particular a non-SIGCHLD thread won't be found by a regular
> "waitpid()". You have to explicitly wait for it with __WCLONE (which
> in turn is supposed to be used with the explicit pid to be waited
> for).

But doesn't the CSIGNAL approach still require that libraries somehow
coordinate which non-SIGCHLD signal they use? (Signal coordination a
separate problem, unfortunately.) If you don't set a signal, you don't
get any notification at all, and in that case, you have to burn a
thread on a blocking wait, right? And you don't *have* to wait for a
specific PID with __WCLONE. If two libraries each did a separate
__WCLONE or __WALL wait on all subprocesses, they'd still interfere,
AIUI, even if the interface isn't supposed to be used that way.

I like the pidfd way of doing private subprocesses a bit more than the
CSIGNAL approach because it's more "natural" (in that a pidfd is just
a handle like we have handles to tons of other things), integrates
with existing event and wait facilities, doesn't require any global
coordination at all, and works for more than this one use case.
(CSIGNAL is limited to immediate parents. You could, in principle,
SCM_RIGHTS a pidfd to someone else.)

> Now, none of this was ever really used. The people who wanted AIO
> wanted the brain-damaged POSIX kind of AIO, not something cool and
> clever. So I suspect the whole exit-signal thing was just used for
> random small per-project things.

AIO is a whole other ball of wax. :-)


Re: [PATCH v2 0/5] pid: add pidfd_open()

2019-04-01 Thread Daniel Colascione
On Mon, Apr 1, 2019 at 9:07 AM Jonathan Kowalski  wrote:
>
> On Mon, Apr 1, 2019 at 4:55 PM Daniel Colascione  wrote:
> >
> > On Mon, Apr 1, 2019 at 8:36 AM Linus Torvalds
> >  wrote:
> > >
> > > On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai  wrote:
> > > >
> > > > Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> > > > an attempt to make it possible one day to mount /proc inside a container
> > > > without adding a bunch of masked paths), though it was just an idea and
> > > > I don't know if he ever had a patch for it.
> >
> > Couldn't this mode just be a relatively simple procfs mount option
> > instead of a whole new filesystem? It'd be a bit like hidepid, right?
> > The internal bind mount option and the no-dotdot-traversal options
> > also look good to me.
> >
> > > I wonder if we really want a fill procfs2, or maybe we could just make
> > > the pidfd readable (yes, it's a directory file descriptor, but we
> > > could allow reading).
> >
> > What would read(2) read?
> >
> > > What are the *actual* use cases for opening /proc files through it? If
> > > it's really just for a small subset that android wants to do this
> > > (getting basic process state like "running" etc), rather than anything
> > > else, then we could skip the whole /proc linking entirely and go the
> > > other way instead (ie open_pidfd() would get that limited IO model,
> > > and we could make the /proc directory node get the same limited IO
> > > model).
> >
> > We do a lot of process state inspection and manipulation, including
> > reading and writing the oom killer adjustment score, reading smaps,
> > and the occasional cgroup manipulation. More generally, I'd also like
> > to be able to write a race-free pkill(1). Doing this work via pidfd
> > would be convenient. More generally, we can't enumerate the specific
> > use cases, because what we want to do with processes isn't bounded in
> > advance, and we regularly find new things in /proc/pid that we want to
> > read and write. I'd rather not prematurely limit the applicability of
> > the pidfd interface, especially when there's a simple option (the
> > procfs directory file descriptor approach) that doesn't require
> > in-advance enumeration of supported process inspection and
> > manipulation actions or a separate per-option pidfd equivalent. I very
> > much want a general-purpose API that reuses the metadata interfaces
> > the kernel already exposes. It's not clear to me how this rich
> > interface could be matched by read(2) on a pidfd.
>
> With the POLLHUP model on a simple pidfd, you'd know when the process
> you were referring to is dead (and one can map POLLPRI to dead and
> POLLHUP to zombie, etc).

I agree that there needs to be *some* kind of pollable file descriptor
that fires on process death. Should it be the pifd itself? I've been
thinking that a pollable directory would be too weird, but if that's
not actually a problem, I'm fine with it.

Mapping different state transitions to different poll bits is an
interesting idea, but I'm also worried about making poll the "source
of truth" with respect to process state as reported by a pidfd.
Consider a socket: for a socket, read(2)/recv(2) is the "source of
truth" and poll is just a hint that says "now is a good time to
attempt a read". Some other systems even allow for spurious wakeups
from poll. It's also worth keeping in mind that some pre-existing
event loops let you provide "is readable" and "is writable" callbacks,
but don't support the more exotic poll notification bits. That's why
I've tried to keep my proposals limited to poll signaling readability.

But I'm not really that picky. I just want something that works.

> This is just an extension of the child process model, since you'd know
> when it's dead, there's no race involved with opening the wrong
> /proc/ entry. The entire thing is already non-racy for direct
> children, the same model can be extended to non-direct ones.
>
> This simplifies a lot of things, now I am essentially just passing a
> file descriptor pinning the struct pid associated with the original
> task, and not process state around to others (I may even want the
> other process to not read that stuff out even if it was allowed to, as
> it wouldn't have been able to otherwise, due to being a in a different
> mount namespace). KISS.
>
> The upshot is this same descriptor can be returned from clone, which
> would allow you to directly register it in your event loop (like
> signalfd, timerfd, file fd, sockets, etc) and 

Re: [PATCH v2 0/5] pid: add pidfd_open()

2019-04-01 Thread Daniel Colascione
On Mon, Apr 1, 2019 at 9:01 AM Linus Torvalds
 wrote:
>
> On Mon, Apr 1, 2019 at 8:55 AM Daniel Colascione  wrote:
> >
> >
> > > I wonder if we really want a fill procfs2, or maybe we could just make
> > > the pidfd readable (yes, it's a directory file descriptor, but we
> > > could allow reading).
> >
> > What would read(2) read?
>
> We could make it read anything, but it would have to be something
> people agree is sufficient (and not so expensive to create that rare
> users of that data would find the overhead excessive).

In my exithand patch last year, I took a minimal approach and just had
read(2) read EOF once the process exited and blocked until then. Maybe
we could do that? But if we're supported waitid(2) on pidfds instead,
then we don't need read(2) at all on pidfds, right? Right now, I don't
see a situation in which I'd take advantage of a pidfd read(2) that
just gave me the running/sleeping/zombie state.

I'm also a bit worried about the implications of making a directory FD
also readable. Didn't problems with that come up in one of the
multiple-named-streams proposals?

> Eg we could make it return the same thing that /proc//status
> reads right now.
>
> But it sounds like you need pretty much all of /proc//xyz:
>
> > We do a lot of process state inspection and manipulation, including
> > reading and writing the oom killer adjustment score, reading smaps,
> > and the occasional cgroup manipulation. More generally, I'd also like
> > to be able to write a race-free pkill(1)
>
> I suspect most of what pkill wants is indeed in that 'status' file,
> but other things aren't.

Right. It's hard to predict what we might need. pkill also needs
/proc/pid/cmdline, FWIW.


Re: [PATCH v2 0/5] pid: add pidfd_open()

2019-04-01 Thread Daniel Colascione
On Mon, Apr 1, 2019 at 8:36 AM Linus Torvalds
 wrote:
>
> On Mon, Apr 1, 2019 at 4:41 AM Aleksa Sarai  wrote:
> >
> > Eric pitched a procfs2 which would *just* be the PIDs some time ago (in
> > an attempt to make it possible one day to mount /proc inside a container
> > without adding a bunch of masked paths), though it was just an idea and
> > I don't know if he ever had a patch for it.

Couldn't this mode just be a relatively simple procfs mount option
instead of a whole new filesystem? It'd be a bit like hidepid, right?
The internal bind mount option and the no-dotdot-traversal options
also look good to me.

> I wonder if we really want a fill procfs2, or maybe we could just make
> the pidfd readable (yes, it's a directory file descriptor, but we
> could allow reading).

What would read(2) read?

> What are the *actual* use cases for opening /proc files through it? If
> it's really just for a small subset that android wants to do this
> (getting basic process state like "running" etc), rather than anything
> else, then we could skip the whole /proc linking entirely and go the
> other way instead (ie open_pidfd() would get that limited IO model,
> and we could make the /proc directory node get the same limited IO
> model).

We do a lot of process state inspection and manipulation, including
reading and writing the oom killer adjustment score, reading smaps,
and the occasional cgroup manipulation. More generally, I'd also like
to be able to write a race-free pkill(1). Doing this work via pidfd
would be convenient. More generally, we can't enumerate the specific
use cases, because what we want to do with processes isn't bounded in
advance, and we regularly find new things in /proc/pid that we want to
read and write. I'd rather not prematurely limit the applicability of
the pidfd interface, especially when there's a simple option (the
procfs directory file descriptor approach) that doesn't require
in-advance enumeration of supported process inspection and
manipulation actions or a separate per-option pidfd equivalent. I very
much want a general-purpose API that reuses the metadata interfaces
the kernel already exposes. It's not clear to me how this rich
interface could be matched by read(2) on a pidfd.


Re: [PATCH v2 0/5] pid: add pidfd_open()

2019-03-31 Thread Daniel Colascione
On Sun, Mar 31, 2019 at 8:05 AM Christian Brauner  wrote:
>
> On Sun, Mar 31, 2019 at 07:52:28AM -0700, Linus Torvalds wrote:
> > On Sat, Mar 30, 2019 at 9:47 PM Jann Horn  wrote:
> > >
> > > Sure, given a pidfd_clone() syscall, as long as the parent of the
> > > process is giving you a pidfd for it and you don't have to deal with
> > > grandchildren created by fork() calls outside your control, that
> > > works.
> >
> > Don't do pidfd_clone() and pidfd_wait().
> >
> > Both of those existing system calls already get a "flags" argument.
> > Just make a WPIDFD (for waitid) and CLONE_PIDFD (for clone) bit, and
> > make the existing system calls just take/return a pidfd.
>
> Yes, that's one of the options I was considering but was afraid of
> pitching it because of the very massive opposition I got
> regarding"multiplexers". I'm perfectly happy with doing it this way.

This approach is fine by me, FWIW. I like it more than a general-purpose pidctl.

> Btw, the /proc/ race issue that is mentioned constantly is simply
> avoidable by placing the pid that the pidfd has stashed relative to the
> callers' procfs mount's pid namespace in the pidfd's fdinfo. So there's
> not even a need to really go through /proc/ in the first place. A
> caller wanting to get metadata access and avoid a race with pid
> recycling can then simply do:
>
> int pidfd = pidfd_open(pid, 0);
> int pid = parse_fdinfo("/proc/self/fdinfo/");
> int procpidfd = open("/proc/", ...);

IMHO, it's worth documenting this procedure in the pidfd man page.

> /* Test if process still exists by sending signal 0 through our pidfd. */

Are you planning on officially documenting this incantation in the
pidfd man page?

> int ret = pidfd_send_signal(pid, 0, NULL, PIDFD_SIGNAL_THREAD);
> if (ret < 0 && errno == ESRCH) {
> /* pid has been recycled and procpidfd refers to another process */
> }

I was going to suggest that WNOHANG also works for this purpose, but
that idea raises another question: normally, you can wait*(2) on a
process only once. Are you imagining waitid on a pidfd succeeding more
than one? ISTM that the pidfd would need to internally store not just
a struct pid, but the exit status as well or some way to get to it.


Re: [PATCH v2 0/5] pid: add pidfd_open()

2019-03-30 Thread Daniel Colascione
On Sat, Mar 30, 2019 at 9:24 AM Linus Torvalds
 wrote:
>
> On Sat, Mar 30, 2019 at 9:19 AM Christian Brauner  
> wrote:
> >
> > From pure API perspective that's all I care about: independence of procfs.
> > Once we have pidfd_open() we can cleanly signal threads etc.
>
> But "independence from procfs" means that you damn well don't then do
> "oh, now I have a pidfd, I want to turn it into a /proc fd and then
> munge around there".
>
> So I'm literally saying that it had better really *be* independent
> from /proc. It is the standalone version, but it's most definitely
> also the version that doesn't then give you secret access to /proc.

Just to be clear, I'm not proposing granting secret access to procfs,
and as far as I can see, nobody else is either. We've been talking
about making it easier to avoid races when you happen to want a pidfd
and a procfs fd that point to the same process, not granting access
that you didn't have before. If you'd rather not connect procfs and
pidfds, we can take this functionality off the table.


Re: [PATCH v2 0/5] pid: add pidfd_open()

2019-03-30 Thread Daniel Colascione
On Sat, Mar 30, 2019 at 9:09 AM Linus Torvalds
 wrote:
>
> On Fri, Mar 29, 2019 at 8:54 AM Christian Brauner  
> wrote:
> >
> > /* Introduction */
> > This adds the pidfd_open() syscall.
> > pidfd_open() allows to retrieve file descriptors for a given pid. This
> > includes both file descriptors for processes and file descriptors for
> > threads.
>
> I'm ok with the pidfd_open() call to just get a pidfd, but that
> "pidfd_to_profs()" needs to go away.
>
> If you want to open the /proc//status file, then just do that.
> This whole "do one and convert to the other" needs to die. No, no, no.

How do you propose that someone open the /proc//status file in a
race-free manner starting with the result of calling pidfd_open?


Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-30 Thread Daniel Colascione
On Sat, Mar 30, 2019 at 7:30 AM Jonathan Kowalski  wrote:
>
> On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione  wrote:
> >
> >  [SNIP]
> >
> > Thanks again.
> >
> > I agree that the operation we're discussing has a simple signature,
> > but signature flexibility isn't the only reason to prefer a system
> > call over an ioctl. There are other reasons for preferring system
> > calls to ioctls (safety, tracing, etc.) that apply even if the
> > operation we're discussing has a relatively simple signature: for
> > example, every system call has a distinct and convenient ftrace event,
> > but ioctls don't; strace filtering Just Works on a
> > system-call-by-system-call basis, but it doesn't for ioctls; and
>
> It does for those with a unique number.

There's no such thing as a unique ioctl number though. Anyone is free
to create an ioctl with a conflicting number. Sure, you can guarantee
ioctl request number uniqueness within things that ship with the
kernel, but you can also guarantee system call number uniqueness, so
what do we gain by using an ioctl? And yes, you can do *some*
filtering based on ioctl command, but it's frequently more awkward
than the system call equivalent and sometimes doesn't work at all.
What's the strace -e pidfd_open equivalent for an ioctl? Grep isn't
quite equivalent.

> > documentation for system calls is much more discoverable (e.g., man
> > -k) than documentation for ioctls. Even if the distinction doesn't
> > matter much, IMHO, it still matters a little, enough to favor a system
> > call without an offsetting advantage for the ioctl option.
>
> It will be alongside other I/O operations in the pidfd man page.
>
> >
> > > If you start adding a system call for every specific operation on file
> > > descriptors, it *will* become a problem.
> >
> > I'm not sure what you mean. Do you mean that adding a top-level system
> > call for every operation that might apply to one specific kind of file
> > descriptor would lead, as the overall result, to the kernel having
> > enough system calls to cause negative consequences? I'm not sure I
> > agree, but accepting this idea for the sake of discussion: shouldn't
> > we be more okay with system calls for features present on almost all
> > systems --- like procfs --- even if we punt to ioctls very rarely-used
> > functionality, e.g., some hypothetical special squeak noise that you
> > could get some specific 1995 adlib clone to make?
>
> Consider if we were to do:
>
> int procpidfd_open(int pidfd, int procrootfd);
>
> This system call is useless besides a very specific operation for a
> very specific usecase on a file descriptor.

I don't understand this argument based on an operation being "very
specific". Are you suggesting that epoll_wait should have been an
ioctl? I don't see how an operation being specific to a type of file
descriptor is an argument for making that operation an ioctl instead
of a system call.

> Then, you figure you might
> want to support procpidfd back to pidfd (although YAGNI), so you will
> add a CMD or flags argument now,

Why would we need a system call at all? And why are we invoking
hypothetical argument B in an argument against adding operation A as a
system call? B doesn't exist yet. As Christian mentioned, we could
create a named /proc/pid/handle file which, when opened, would yield a
pidfd. More generally, though: if we expose new functionality, we can
make a new system call. Why is that worse than adding a new ioctl
code?

> int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags);
>
>   int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD);
>   int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD);
>
> vs procpidfd2pidfd(int procpidfd); if you did not foresee the need.
> Then, you want pidfd2pid, and so on.
>
> If you don't, you will have to add a command interface in the new
> system call, which changes parameters depending on the flags. This is
> already starting to get ugly.

> In the end, it's a matter of taste,

I don't think it's quite a matter of taste. That idiom suggests that
the two options are roughly functionally equivalent. In this case,
there are clear and specific technical reasons to prefer system calls.
We're not talking about two equivalent approaches. Making the
operation a system call gives users more power, and we shouldn't
deprive them of this power without a good reason. So far, I haven't
seen such a reason.

>  this
> pattern if exploited leads to endless proliferation of the system call
> interface, often ridden with short-sighted APIs, because you cannot
> know if you want a focused call or a command style call.

Bad interfaces proliferate no matter what, 

Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-30 Thread Daniel Colascione
On Fri, Mar 29, 2019 at 11:25 PM Jonathan Kowalski  wrote:
>
> On Sat, Mar 30, 2019 at 5:35 AM Daniel Colascione  wrote:
> >
> > On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner  
> > wrote:
> > >
> > > > All that said, thanks for the work on this once again. My intention is
> > > > just that we don't end up with an API that could have been done better
> > > > and be cleaner to use for potential users in the coming years.
> > >
> > > Thanks for your input on all of this. I still don't find multiplexers in
> > > the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> > > deal with a specific task. They are very much different from ioctl()s in
> > > that regard. But since Joel, you, and Daniel found the pidctl() approach
> > > not very nice I dropped it. The interface needs to be satisfactory for
> > > all of us especially since Android and other system managers will be the
> > > main consumers.
> >
> > Thanks.
> >
> > > So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
> > > allows to cleanly get pidfds independent procfs and do the translation
> > > to procpidfds in an ioctl() as we've discussed in prior threads. This
> >
> > I sustain my objection to adding an ioctl. Compared to a system call,
> > an ioctl has a more rigid interface, greater susceptibility to
> > programmer error (due to the same ioctl control code potentially doing
> > different things for different file types), longer path length, and
> > more awkward filtering/monitoring/auditing/tracing. We've discussed
> > this issue at length before, and I thought we all agreed to use system
> > calls, not ioctl, for core kernel functionality. So why is an ioctl
> > suddenly back on the table? The way I see it, an ioctl has no
> > advantages except for 1) conserving system call numbers, which are not
> > scarce, and 2) avoiding the system call number coordination problem
> > (and the coordination problem isn't a factor for core kernel code). I
> > don't understand everyone's reluctance to add new system calls. What
> > am I missing? Why would we give up all the advantages that a system
> > call gives us?
> >
>
> I agree in general, but in this particular case a system call or an
> ioctl doesn't matter much, all it does is take the pidfd, the command,
> and /proc's dir fd.

Thanks again.

I agree that the operation we're discussing has a simple signature,
but signature flexibility isn't the only reason to prefer a system
call over an ioctl. There are other reasons for preferring system
calls to ioctls (safety, tracing, etc.) that apply even if the
operation we're discussing has a relatively simple signature: for
example, every system call has a distinct and convenient ftrace event,
but ioctls don't; strace filtering Just Works on a
system-call-by-system-call basis, but it doesn't for ioctls; and
documentation for system calls is much more discoverable (e.g., man
-k) than documentation for ioctls. Even if the distinction doesn't
matter much, IMHO, it still matters a little, enough to favor a system
call without an offsetting advantage for the ioctl option.

> If you start adding a system call for every specific operation on file
> descriptors, it *will* become a problem.

I'm not sure what you mean. Do you mean that adding a top-level system
call for every operation that might apply to one specific kind of file
descriptor would lead, as the overall result, to the kernel having
enough system calls to cause negative consequences? I'm not sure I
agree, but accepting this idea for the sake of discussion: shouldn't
we be more okay with system calls for features present on almost all
systems --- like procfs --- even if we punt to ioctls very rarely-used
functionality, e.g., some hypothetical special squeak noise that you
could get some specific 1995 adlib clone to make?

> Besides, the translation is
> just there because it is racy to do in userspace,  it is not some well
> defined core kernel functionality.
> Therefore, it is just a way to
> enter the kernel to do the openat in a race free and safe manner.

I agree that the translation has to be done in the kernel, not
userspace, and that the kernel must provide to userspace some
interface for requesting that the translation happen: we're just
discussing the shape of this interface. Shouldn't all interfaces
provided by the kernel to userspace be equally well defined? I'm not
sure that the internal simplicity of the operation matters much
either. There are already explicit system calls for some
simple-to-implement things, e.g., timerfd_gettime. It's worth noting
that timerfd is (IIRC), like procfs, a feature that's both ubiquitous
and optional.

>

Re: [PATCH 2/4] pid: add pidfd_open()

2019-03-29 Thread Daniel Colascione
On Thu, Mar 28, 2019 at 3:38 AM Christian Brauner  wrote:
>
> > All that said, thanks for the work on this once again. My intention is
> > just that we don't end up with an API that could have been done better
> > and be cleaner to use for potential users in the coming years.
>
> Thanks for your input on all of this. I still don't find multiplexers in
> the style of seccomp()/fsconfig()/keyctl() to be a problem since they
> deal with a specific task. They are very much different from ioctl()s in
> that regard. But since Joel, you, and Daniel found the pidctl() approach
> not very nice I dropped it. The interface needs to be satisfactory for
> all of us especially since Android and other system managers will be the
> main consumers.

Thanks.

> So let's split this into pidfd_open(pid_t pid, unsigned int flags) which
> allows to cleanly get pidfds independent procfs and do the translation
> to procpidfds in an ioctl() as we've discussed in prior threads. This

I sustain my objection to adding an ioctl. Compared to a system call,
an ioctl has a more rigid interface, greater susceptibility to
programmer error (due to the same ioctl control code potentially doing
different things for different file types), longer path length, and
more awkward filtering/monitoring/auditing/tracing. We've discussed
this issue at length before, and I thought we all agreed to use system
calls, not ioctl, for core kernel functionality. So why is an ioctl
suddenly back on the table? The way I see it, an ioctl has no
advantages except for 1) conserving system call numbers, which are not
scarce, and 2) avoiding the system call number coordination problem
(and the coordination problem isn't a factor for core kernel code). I
don't understand everyone's reluctance to add new system calls. What
am I missing? Why would we give up all the advantages that a system
call gives us?

I also don't understand Andy's argument on the other thread that an
ioctl is okay if it's an "operation on an FD" --- *most* system calls
are operations on FDs. We don't have an ioctl for sendmsg(2) and it's
an "operation on an FD".


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
On Tue, Mar 26, 2019 at 9:46 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 09:42:59AM -0700, Andy Lutomirski wrote:
> > On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  
> > wrote:
> > >
> > > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > > Thanks for the patch.
> > > > > >
> > > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > > > >  wrote:
> > > > > > >
> > > > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > > > translate_pid() [4].
> > > > > > > I quote Konstantins original patchset first that has already been 
> > > > > > > acked and
> > > > > > > picked up by Eric before and whose functionality is preserved in 
> > > > > > > this
> > > > > > > syscall:
> > > > > >
> > > > > > We still haven't had a much-needed conversation about splitting this
> > > > > > system call into smaller logical operations. It's important that we
> > > > > > address this point before this patch is merged and becomes permanent
> > > > > > kernel ABI.
> > > > >
> > > > > I don't particularly mind splitting this into an additional syscall 
> > > > > like
> > > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > > about this rn is that it connects both apis in a single syscall
> > > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > > what other people think.
> > > >
> > > > There's something to be said for
> > > >
> > > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > > >
> > > > /* get pidfd */
> > > > int pidfd = pidfd_open(1234, -1, 0);
> > > >
> > > > /* convert to procfd */
> > > > int procfd = pidfd_open(-1, 4, 0);
> > > >
> > > > /* convert to pidfd */
> > > > int pidfd = pidfd_open(4, -1, 0);
> > >
> > > probably rather:
> > >
> > > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> >
> > Do you mean:
> >
> > int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
> > int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD);
> >
> > or do you have some other solution in mind to avoid the security problem?
>
> Yes, we need the proc root obviously. I just jotted this down.
>
> We probably would need where one of the fds can refer to the proc root.
>
> pidfd_open(pid_t, int fd, int fd, 0)

Indeed. This is precisely the pidfd-procfd translation API I proposed
in the last paragraph of [1].

[1] 
https://lore.kernel.org/lkml/cakozuetcfgu0b53+mgmq3+539mpt_tiu-pacx2atvihhrrm...@mail.gmail.com/


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
On Tue, Mar 26, 2019 at 9:44 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 09:38:31AM -0700, Daniel Colascione wrote:
> > On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  
> > wrote:
> > >
> > > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > > Thanks for the patch.
> > > > > >
> > > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > > > >  wrote:
> > > > > > >
> > > > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > > > translate_pid() [4].
> > > > > > > I quote Konstantins original patchset first that has already been 
> > > > > > > acked and
> > > > > > > picked up by Eric before and whose functionality is preserved in 
> > > > > > > this
> > > > > > > syscall:
> > > > > >
> > > > > > We still haven't had a much-needed conversation about splitting this
> > > > > > system call into smaller logical operations. It's important that we
> > > > > > address this point before this patch is merged and becomes permanent
> > > > > > kernel ABI.
> > > > >
> > > > > I don't particularly mind splitting this into an additional syscall 
> > > > > like
> > > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > > about this rn is that it connects both apis in a single syscall
> > > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > > what other people think.
> > > >
> > > > There's something to be said for
> > > >
> > > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > > >
> > > > /* get pidfd */
> > > > int pidfd = pidfd_open(1234, -1, 0);
> > > >
> > > > /* convert to procfd */
> > > > int procfd = pidfd_open(-1, 4, 0);
> > > >
> > > > /* convert to pidfd */
> > > > int pidfd = pidfd_open(4, -1, 0);
> > >
> > > probably rather:
> > >
> > > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> > > int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> > > int pidfd = pidfd_open(1234, -1, 0);
> >
> > These three operations look like three related but distinct functions
> > to me, and in the second case, the "pidfd_open" name is a bit of a
> > misnomer. IMHO, the presence of an "operation name" field in any API
> > is usually a good indication that we're looking at a family of related
> > APIs, not a single coherent operation.
>
> So I'm happy to accommodate the need for a clean api even though I
> disagree that what we have in pidctl() is unclean.
> But I will not start sending a pile of syscalls. There is nothing
> necessarily wrong to group related APIs together.

In the email I sent just now, I identified several specific technical
disadvantages arising from unnecessary grouping of system calls. We
have historical evidence in the form of socketcall that this grouping
tends to be regrettable. I don't recall your identifying any
offsetting technical advantages. Did I miss something?

> By these standards the
> new mount API would need to be like 30 different syscalls, same for
> keyring management.

Can you please point out the problem that would arise from splitting
the mount and keyring APIs this way? One could have made the same
argument about grouping socket operations, and this socket-operation
grouping ended up being a mistake.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > Thanks for the patch.
> > > >
> > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > >  wrote:
> > > > >
> > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been 
> > > > > acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall:
> > > >
> > > > We still haven't had a much-needed conversation about splitting this
> > > > system call into smaller logical operations. It's important that we
> > > > address this point before this patch is merged and becomes permanent
> > > > kernel ABI.
> > >
> > > I don't particularly mind splitting this into an additional syscall like
> > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > about this rn is that it connects both apis in a single syscall
> > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > what other people think.
> >
> > There's something to be said for
> >
> > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> >
> > /* get pidfd */
> > int pidfd = pidfd_open(1234, -1, 0);
> >
> > /* convert to procfd */
> > int procfd = pidfd_open(-1, 4, 0);
> >
> > /* convert to pidfd */
> > int pidfd = pidfd_open(4, -1, 0);
>
> probably rather:
>
> int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> int pidfd = pidfd_open(1234, -1, 0);

These three operations look like three related but distinct functions
to me, and in the second case, the "pidfd_open" name is a bit of a
misnomer. IMHO, the presence of an "operation name" field in any API
is usually a good indication that we're looking at a family of related
APIs, not a single coherent operation.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
On Tue, Mar 26, 2019 at 9:23 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > Thanks for the patch.
> >
> > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner  
> > wrote:
> > >
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() 
> > > [4].
> > > I quote Konstantins original patchset first that has already been acked 
> > > and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall:
> >
> > We still haven't had a much-needed conversation about splitting this
> > system call into smaller logical operations. It's important that we
> > address this point before this patch is merged and becomes permanent
> > kernel ABI.
>
> I don't particularly mind splitting this into an additional syscall like
> e.g.  pidfd_open() but then we have - and yes, I know you'll say
> syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> about this rn is that it connects both apis in a single syscall
> and allows pidfd retrieval across pid namespaces. So I guess we'll see
> what other people think.

Thanks. I also appreciate a clean unification of related
functionality, but I'm concerned that this API in particular --- due
in part to its *ctl() name --- will become a catch-all facility for
doing *anything* with processes. (Granted, heavy use of a new, good,
and clean API would be a good problem to have.) This
single-system-call state of affairs would make it more awkward than
necessary to do system-call level logging (say, strace -e), enable or
disable tracing of specific operations with ftrace, apply some kinds
of SELinux policy, and so on, and the only advantage of the single
system call design that I can see right now is the logical
cleanliness.

I'd propose splitting the call, or if we can't do that, renaming it to
something else --- pidfd_query --- so that it's less likely to become
a catch-all operation holder.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
Thanks for the patch.

On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner  wrote:
>
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall:

We still haven't had a much-needed conversation about splitting this
system call into smaller logical operations. It's important that we
address this point before this patch is merged and becomes permanent
kernel ABI.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 3:37 PM Jonathan Kowalski  wrote:
>
> On Mon, Mar 25, 2019 at 10:07 PM Daniel Colascione  wrote:
> >
> > On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski  
> > wrote:
> > >
> > > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes  
> > > wrote:
> > > >
> > > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes 
> > > > >  wrote:
> > > > >
> > > > > But often you don't just want to wait for a single thing to happen;
> > > > > you want to wait for many things at once, and react as soon as any one
> > > > > of them happens. This is why the kernel has epoll and all the other
> > > > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > > > with fd-based APIs like epoll, users of this API have to spin up
> > > > > useless helper threads.
> > > >
> > > > This is true. I almost forgot about the polling requirement, sorry. So 
> > > > then a
> > > > new syscall it is.. About what to wait for, that can be a separate 
> > > > parameter
> > > > to pidfd_wait then.
> > > >
> > >
> > > This introduces a time window where the process changes state between
> > > "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> > > itself supported .poll and on termination the exit status was made
> > > readable from the file descriptor.
> >
> > I don't see a race here. Process state moves in one direction, so
> > there's no race. If you want the poll to end when a process exits and
> > the process exits before you poll, the poll should finish immediately.
> > If the process exits before you even create the polling FD, whatever
> > creates the polling FD can fail with ESRCH, which is what usually
> > happens when you try to do anything with a process that's gone. Either
> > way, whatever's trying to set up the poll knows the state of the
> > process and there's no possibility of a missed wakeup.
> >
>
> poll will possibly work and return immediately, but you won't be able
> to read back anything, because for the kernel there will be no waiter
> before you open one. If you make pidfd pollable and readable (for
> parents, as an example), the poll ends immediately but there will
> something to read from the fd.

You can lose exit status this way. That's not a problem, though,
because *without* synchronization between the exiting process and the
waiting process, the waiting process can lose the race no matter what,
and *with* synchronization, there's no change of losing the
information. (It's just like a condition variable.) Today,
synchronization between a parent and child is automatic (because
zombies), but direct children work reliably with pidfd too: the
descriptor that pidfd_clone (or whatever it ends up being called)
returns will *always* be created before the child process and so will
*always* have exit status information available, even with some
non-SIGCHLD option applied.

> > > Further, in the clone4 patchset, there was a mechanism to autoreap
> > > such a process so that it does not interfere with waiting a process
> > > does normally. How do you intend to handle this case if anyone except
> > > the parent is wanting to *wait* on the process (a second process,
> > > without reaping, so as to not disrupt any waiting in the parent), and
> > > for things like libraries that finally can manage their own set of
> > > process when pidfd_clone is added, by excluding this process from the
> > > process's normal wait handling logic.
> >
> > I think that discussion is best had around pidfd_clone. pidfd waiting
> > functionality shouldn't affect wait* in any way nor affect a zombie
> > transition or reaping.
>
> So this is more akin to waitfd and traditional wait* and friends, and
> a single notification fd for possibly multiple children?

I don't know what gave you this idea. I'm talking about a model in
which you wait for one child with one open file description.

> I just wanted
> to be sure one would (should) be able to use a pidfd obtained from
> pidfd_clone without affecting the existing waitfd, and other
> primitives.

I don't think the use of pidfd *waiting*, by itself, should affect the
current fork/wait/zombie model of process exit delivery. That is, by
default, a process should transition to the zombie state, send SIGCHLD
to its parent, and then get reaped at exactly the same times it does
today, with exactly the same semantics we have today. Pidfd 

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski  wrote:
>
> On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes  wrote:
> >
> > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes  
> > > wrote:
> > >
> > > But often you don't just want to wait for a single thing to happen;
> > > you want to wait for many things at once, and react as soon as any one
> > > of them happens. This is why the kernel has epoll and all the other
> > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > with fd-based APIs like epoll, users of this API have to spin up
> > > useless helper threads.
> >
> > This is true. I almost forgot about the polling requirement, sorry. So then 
> > a
> > new syscall it is.. About what to wait for, that can be a separate parameter
> > to pidfd_wait then.
> >
>
> This introduces a time window where the process changes state between
> "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> itself supported .poll and on termination the exit status was made
> readable from the file descriptor.

I don't see a race here. Process state moves in one direction, so
there's no race. If you want the poll to end when a process exits and
the process exits before you poll, the poll should finish immediately.
If the process exits before you even create the polling FD, whatever
creates the polling FD can fail with ESRCH, which is what usually
happens when you try to do anything with a process that's gone. Either
way, whatever's trying to set up the poll knows the state of the
process and there's no possibility of a missed wakeup.

> Further, in the clone4 patchset, there was a mechanism to autoreap
> such a process so that it does not interfere with waiting a process
> does normally. How do you intend to handle this case if anyone except
> the parent is wanting to *wait* on the process (a second process,
> without reaping, so as to not disrupt any waiting in the parent), and
> for things like libraries that finally can manage their own set of
> process when pidfd_clone is added, by excluding this process from the
> process's normal wait handling logic.

I think that discussion is best had around pidfd_clone. pidfd waiting
functionality shouldn't affect wait* in any way nor affect a zombie
transition or reaping.

> Moreover, should anyone except the parent process be allowed to open a
> readable pidfd (or waitfd), without additional capabilities?

That's a separate discussion. See
https://lore.kernel.org/lkml/cakozueussvwabqac+o9fw+mzayccvttkqzfwg0hh-cz+1yk...@mail.gmail.com/,
where we talked about permissions extensively. Anyone can hammer on
/proc today or hammer on kill(pid, 0) to learn about a process
exiting, so anyone should be able to wait for a process to die --- it
just automates what anyone can do manually. What's left is the
question of who should be able to learn a process's exit code. As I
mentioned in the linked email, process exit codes are public on
FreeBSD, and the simplest option is to make them public in Linux too.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 2:11 PM Joel Fernandes  wrote:
>
> On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner 
> > > >  wrote:
> > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been 
> > > > > acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall. Multiple people have asked when this patchset will be sent in
> > > > > for merging (cf. [1], [2]). It has recently been revived by 
> > > > > Nagarathnam
> > > > > Muthusamy from Oracle [3].
> > > > >
> > > > > The intention of the original translate_pid() syscall was twofold:
> > > > > 1. Provide translation of pids between pid namespaces
> > > > > 2. Provide implicit pid namespace introspection
> > > > >
> > > > > Both functionalities are preserved. The latter task has been improved
> > > > > upon though. In the original version of the pachset passing pid as 1
> > > > > would allow to deterimine the relationship between the pid namespaces.
> > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > > > would report false negatives. For example, if pid 1 inside of the 
> > > > > target
> > > > > pid namespace already died, it would report that the target pid
> > > > > namespace cannot be reached from the source pid namespace because it
> > > > > couldn't find the pid inside of the target pid namespace and thus
> > > > > falsely report to the user that the two pid namespaces are not 
> > > > > related.
> > > > > This problem is simple to avoid. In the new version we simply walk the
> > > > > list of ancestors and check whether the namespace are related to each
> > > > > other. By doing it this way we can reliably report what the 
> > > > > relationship
> > > > > between two pid namespace file descriptors looks like.
> > > > >
> > > > > Additionally, this syscall has been extended to allow the retrieval of
> > > > > pidfds independent of procfs. These pidfds can e.g. be used with the 
> > > > > new
> > > > > pidfd_send_signal() syscall we recently merged. The ability to 
> > > > > retrieve
> > > > > pidfds independent of procfs had already been requested in the
> > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by 
> > > > > Alexey
> > > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > > anon-inode based file descriptors are used that stash a reference to
> > > > > struct pid in file->private_data and drop that reference on close.
> > > > >
> > > > > With this translate_pid() has three closely related but still distinct
> > > > > functionalities. To clarify the semantics and to make it easier for
> > > > > userspace to use the syscall it has:
> > > > > - gained a command argument and three commands clearly reflecting the
> > > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > > >   PIDCMD_GET_PIDFD).
> > > > > - been renamed to pidctl()
> > > >
> > > [snip]
> > > > Also, I'm still confused about how metadata access is supposed to work
> > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > > You snipped out a portion of a previous email in which I asked about
> > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > > place, we have two different kinds of file descriptors for processes,
> > > > one derived from procfs and one that's independent. The former works
> > > > with openat(2). The latter does not. To be very specific; if I'm
> > > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > > smaps or oom_score_adj or sta

Re: pidfd design

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 1:14 PM Jann Horn  wrote:
>
> On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski  wrote:
> > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione  
> > wrote:
> > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner  
> > > wrote:
> > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote:
> > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner 
> > > > >  wrote:
> > > > > >
> > > > > > You're misunderstanding. Again, I said in my previous mails it 
> > > > > > should
> > > > > > accept pidfds optionally as arguments, yes. But I don't want it to
> > > > > > return the status fds that you previously wanted pidfd_wait() to 
> > > > > > return.
> > > > > > I really want to see Joel's pidfd_wait() patchset and have more 
> > > > > > people
> > > > > > review the actual code.
> > > > >
> > > > > Just to make sure that no one is forgetting a material security 
> > > > > consideration:
> > > >
> > > > Andy, thanks for commenting!
> > > >
> > > > >
> > > > > $ ls /proc/self
> > > > > attr exemountinfo  projid_mapstatus
> > > > > autogroupfd mounts root  syscall
> > > > > auxv fdinfo mountstats sched task
> > > > > cgroup   gid_mapnetschedstat timers
> > > > > clear_refs   io ns sessionid timerslack_ns
> > > > > cmdline  latencynuma_maps  setgroups uid_map
> > > > > comm limits oom_adjsmaps wchan
> > > > > coredump_filter  loginuid   oom_score  smaps_rollup
> > > > > cpuset   map_files  oom_score_adj  stack
> > > > > cwd  maps   pagemapstat
> > > > > environ  mempersonalitystatm
> > > > >
> > > > > A bunch of this stuff makes sense to make accessible through a syscall
> > > > > interface that we expect to be used even in sandboxes.  But a bunch of
> > > > > it does not.  For example, *_map, mounts, mountstats, and net are all
> > > > > namespace-wide things that certain policies expect to be unavailable.
> > > > > stack, for example, is a potential attack surface.  Etc.
> > >
> > > If you can access these files sources via open(2) on /proc/, you
> > > should be able to access them via a pidfd. If you can't, you
> > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't
> > > see how pidfd makes any material changes to anyone's security. As far
> > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a
> > > broken and unsupported configuration.
> >
> > It's not "broken and unsupported".  I know of an actual working,
> > deployed container-ish sandbox that does exactly this.  I would also
> > guess that quite a few not-at-all-container-like sandboxes work like
> > this.  (The obvious seccomp + unshare + pivot_root
> > deny-myself-access-to-lots-of-things trick results in no /proc, which
> > is by dsign.)
> >
> > >
> > > An actual threat model and real thought paid to access capabilities
> > > would help. Almost everything around the interaction of Linux kernel
> > > namespaces and security feels like a jumble of ad-hoc patches added as
> > > afterthoughts in response to random objections.
> >
> > I fully agree.  But if you start thinking for real about access
> > capabilities, there's no way that you're going to conclude that a
> > capability to access some process implies a capability to access the
> > settings of its network namespace.
> >
> > >
> > > >> All these new APIs either need to
> > > > > return something more restrictive than a proc dirfd or they need to
> > > > > follow the same rules.
> > >
> >
> > ...
> >
> > > What's special about libraries? How is a library any worse-off using
> > > openat(2) on a pidfd than it would be just opening the file called
> > > "/proc/$apid"?
> >
> > Because most libraries actually work, right now, without /proc.  Even
> > libraries that spawn subprocesses.  If we make the new API have the
> > property that it doesn't wo

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski  wrote:
>
> On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione  wrote:
> >
> > On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski  
> > wrote:
> > >
> > > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione  
> > > wrote:
> > > >
> > > > [..snip..]
> > > >
> > > > I don't like the idea of having one kind of pollfd be pollable and
> > > > another not. Such an interface would be confusing for users. If, as
> > > > you suggest below, we instead make the procfs-less FD the only thing
> > > > we call a "pidfd", then this proposal becomes less confusing and more
> > > > viable.
> > >
> > > That's certainly on the table, we remove the ability to open
> > > /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of
> > > this.
> > >
> > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid 
> > > > > directory fd
> > > > > can be translated into a "pidfd" using another syscall or even a 
> > > > > node, like
> > > > > /proc/pid/handle or something. I think this is what Christian 
> > > > > suggested in
> > > > > the previous threads.
> > > >
> > > > /proc/pid/handle, if I understand correctly, "translates" a
> > > > procfs-based pidfd to a non-pidfd one. The needed interface would have
> > > > to perform the opposite translation, providing a procfs directory for
> > > > a given pidfd.
> > > >
> > > > > And also for the translation the other way, add a syscall or modify
> > > > > translate_fd or something, to covert a anon_inode pidfd into a 
> > > > > /proc/pid
> > > > > directory fd. Then the user is welcomed to do openat(2) on _that_ 
> > > > > directory fd.
> > > > > Then we modify pidfd_send_signal to only send signals to pure pidfd 
> > > > > fds, not
> > > > > to /proc/pid directory fds.
> > > >
> > > > This approach would work, but there's one subtlety to take into
> > > > account: which procfs? You'd want to take, as inputs, 1) the procfs
> > > > root you want, and 2) the pidfd, this way you could return to the user
> > > > a directory FD in the right procfs.
> > > >
> > >
> > > I don't think metadata access is in the scope of such a pidfd itself.
> >
> > It should be possible to write a race-free pkill(1) using pidfds.
> > Without metadata access tied to the pidfd somehow, that can't be done.
> >
> >
> > > > > Should we work on patches for these? Please let us know if this idea 
> > > > > makes
> > > > > sense and thanks a lot for adding us to the review as well.
> > > >
> > > > I would strongly prefer that we not merge pidctl (or whatever it is)
> > > > without a story for metadata access, be it your suggestion or
> > > > something else.
> > >
> > > IMO, this is out of scope for a process handle. Hence, the need for
> > > metadata access musn't stall it as is.
> >
> > On the contrary, rather than metadata being out of scope, metadata
> > access is essential. Given a file handle (an FD), you can learn about
> > the file backing that handle by using fstat(2). Given a socket handle
> > (a socket FD), you can learn about the socket with getsockopt(2) and
> > ioctl(2). It would be strange not to be able, similarly, to learn
> > things about a process given a handle (a pidfd) to that process. I
> > want process handles to be "boring" in that they let users query and
> > manipulate processes mostly like they manipulate other resources.
> >
>
> Yes, but everything in /proc is not equivalent to an attribute, or an
> option, and depending on its configuration, you may not want to allow
> processes to even be able to see /proc for any PIDs other than those
> running as their own user (hidepid). This means, even if this new
> system call is added, to respect hidepid, it must, depending on if
> /proc is mounted (and what hidepid is set to, and what gid= is set
> to), return EPERM, because then there is a discrepancy between how the
> two entrypoints to acquire a process handle do access control.

That's why I proposed that this translation mechanism accept a procfs
root directory --- so you'd specify *which* procfs you want and let
the kernel apply whatever hidepid access restrictions it wants.

[

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski  wrote:
>
> On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione  wrote:
> >
> > [..snip..]
> >
> > I don't like the idea of having one kind of pollfd be pollable and
> > another not. Such an interface would be confusing for users. If, as
> > you suggest below, we instead make the procfs-less FD the only thing
> > we call a "pidfd", then this proposal becomes less confusing and more
> > viable.
>
> That's certainly on the table, we remove the ability to open
> /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of
> this.
>
> > > But.. one thing we could do for Daniel usecase is if a /proc/pid 
> > > directory fd
> > > can be translated into a "pidfd" using another syscall or even a node, 
> > > like
> > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > the previous threads.
> >
> > /proc/pid/handle, if I understand correctly, "translates" a
> > procfs-based pidfd to a non-pidfd one. The needed interface would have
> > to perform the opposite translation, providing a procfs directory for
> > a given pidfd.
> >
> > > And also for the translation the other way, add a syscall or modify
> > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> > > directory fd. Then the user is welcomed to do openat(2) on _that_ 
> > > directory fd.
> > > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, 
> > > not
> > > to /proc/pid directory fds.
> >
> > This approach would work, but there's one subtlety to take into
> > account: which procfs? You'd want to take, as inputs, 1) the procfs
> > root you want, and 2) the pidfd, this way you could return to the user
> > a directory FD in the right procfs.
> >
>
> I don't think metadata access is in the scope of such a pidfd itself.

It should be possible to write a race-free pkill(1) using pidfds.
Without metadata access tied to the pidfd somehow, that can't be done.

> > > Should we work on patches for these? Please let us know if this idea makes
> > > sense and thanks a lot for adding us to the review as well.
> >
> > I would strongly prefer that we not merge pidctl (or whatever it is)
> > without a story for metadata access, be it your suggestion or
> > something else.
>
> IMO, this is out of scope for a process handle. Hence, the need for
> metadata access musn't stall it as is.

On the contrary, rather than metadata being out of scope, metadata
access is essential. Given a file handle (an FD), you can learn about
the file backing that handle by using fstat(2). Given a socket handle
(a socket FD), you can learn about the socket with getsockopt(2) and
ioctl(2). It would be strange not to be able, similarly, to learn
things about a process given a handle (a pidfd) to that process. I
want process handles to be "boring" in that they let users query and
manipulate processes mostly like they manipulate other resources.

> If you really need this, the pid this pidfd is mapped to can be
> exposed through fdinfo (which is perfectly fine for your case as you
> use /proc anyway). This means you can reliably determine if you're
> opening it for the same pid (and it hasn't been recycled) because this
> pidfd will be pollable for state change (in the future). Exposing it
> through fdinfo isn't a problem, pid ns is bound to the process during
> its lifetime, it's a process creation time property, so the value
> isn't going to change anyway. So you can do all metadata access you
> want subsequently.

Thanks for the proposal. I'm a bit confused. Are you suggesting that
we report the numeric FD in fdinfo? Are you saying it should work
basically like this?

int get_oom_score_adj(int pidfd) {
  unique_fd fdinfo_fd = open(fmt("/proc/self/fdinfo/%d", pidfd));
  string fdinfo = read_all(fdinfo_fd);
  numeric_pid = parse_fdinfo_for_line("pid");
  unique_fd procdir_fd = open(fmt("/proc/%d", numeric_pid), O_DIRECTORY);
  if(!is_pidfd_alive(pidfd)) { return -1; /* process died */ } /*
LIVENESS CHECK */
  // We know the process named by pidfd was called NUMERIC_PID
  // in our namespace (because fdinfo told us) and we know that the
named process
  // was alive after we successfully opened its /proc/pid directory, therefore,
  // PROCDIR_FD and PIDFD must refer to the same underlying process.
  unique_fd oom_adj_score_fd = openat(procdir_fd, "oom_score_adj");
  return parse_int(read_all(oom_adj_score_fd));
}

While this interface functions correctly, I have two concerns: 1) the
number of system calls necessary is very large -- by my count,
starting wi

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 10:36 AM Joel Fernandes  wrote:
>
> On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  
> > wrote:
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() 
> > > [4].
> > > I quote Konstantins original patchset first that has already been acked 
> > > and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall. Multiple people have asked when this patchset will be sent in
> > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > Muthusamy from Oracle [3].
> > >
> > > The intention of the original translate_pid() syscall was twofold:
> > > 1. Provide translation of pids between pid namespaces
> > > 2. Provide implicit pid namespace introspection
> > >
> > > Both functionalities are preserved. The latter task has been improved
> > > upon though. In the original version of the pachset passing pid as 1
> > > would allow to deterimine the relationship between the pid namespaces.
> > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > would report false negatives. For example, if pid 1 inside of the target
> > > pid namespace already died, it would report that the target pid
> > > namespace cannot be reached from the source pid namespace because it
> > > couldn't find the pid inside of the target pid namespace and thus
> > > falsely report to the user that the two pid namespaces are not related.
> > > This problem is simple to avoid. In the new version we simply walk the
> > > list of ancestors and check whether the namespace are related to each
> > > other. By doing it this way we can reliably report what the relationship
> > > between two pid namespace file descriptors looks like.
> > >
> > > Additionally, this syscall has been extended to allow the retrieval of
> > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > pidfds independent of procfs had already been requested in the
> > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > [5]. A use-case where a kernel is compiled without procfs but where
> > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > anon-inode based file descriptors are used that stash a reference to
> > > struct pid in file->private_data and drop that reference on close.
> > >
> > > With this translate_pid() has three closely related but still distinct
> > > functionalities. To clarify the semantics and to make it easier for
> > > userspace to use the syscall it has:
> > > - gained a command argument and three commands clearly reflecting the
> > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > >   PIDCMD_GET_PIDFD).
> > > - been renamed to pidctl()
> >
> [snip]
> > Also, I'm still confused about how metadata access is supposed to work
> > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > You snipped out a portion of a previous email in which I asked about
> > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > place, we have two different kinds of file descriptors for processes,
> > one derived from procfs and one that's independent. The former works
> > with openat(2). The latter does not. To be very specific; if I'm
> > writing a function that accepts a pidfd and I get a pidfd that comes
> > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > smaps or oom_score_adj or statm for the named process in a race-free
> > manner?
>
> This is true, that such usecase will not be supportable.  But the advantage
> on the other hand, is that suchs "pidfd" can be made pollable or readable in
> the future. Potentially allowing us to return exit status without a new
> syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> with proc.

I don't like the idea of having one kind of pollfd be pollable and
another not. Such an interface would be confusing for users. If, as
you suggest below, we instead make the procfs-less FD the only thing
we call a "pidfd", then this proposal becomes less confusing and more
viable.

> But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> can be translated into a "pidfd" using another syscall or even a node, like
> /proc/pid/handle or something. I

Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 10:05 AM Konstantin Khlebnikov
 wrote:
> On 25.03.2019 19:48, Daniel Colascione wrote:
> > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  
> > wrote:
> >> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> >> I quote Konstantins original patchset first that has already been acked and
> >> picked up by Eric before and whose functionality is preserved in this
> >> syscall. Multiple people have asked when this patchset will be sent in
> >> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> >> Muthusamy from Oracle [3].
> >>
> >> The intention of the original translate_pid() syscall was twofold:
> >> 1. Provide translation of pids between pid namespaces
> >> 2. Provide implicit pid namespace introspection
> >>
> >> Both functionalities are preserved. The latter task has been improved
> >> upon though. In the original version of the pachset passing pid as 1
> >> would allow to deterimine the relationship between the pid namespaces.
> >> This is inherhently racy. If pid 1 inside a pid namespace has died it
> >> would report false negatives. For example, if pid 1 inside of the target
> >> pid namespace already died, it would report that the target pid
> >> namespace cannot be reached from the source pid namespace because it
> >> couldn't find the pid inside of the target pid namespace and thus
> >> falsely report to the user that the two pid namespaces are not related.
> >> This problem is simple to avoid. In the new version we simply walk the
> >> list of ancestors and check whether the namespace are related to each
> >> other. By doing it this way we can reliably report what the relationship
> >> between two pid namespace file descriptors looks like.
> >>
> >> Additionally, this syscall has been extended to allow the retrieval of
> >> pidfds independent of procfs. These pidfds can e.g. be used with the new
> >> pidfd_send_signal() syscall we recently merged. The ability to retrieve
> >> pidfds independent of procfs had already been requested in the
> >> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> >> [5]. A use-case where a kernel is compiled without procfs but where
> >> pidfds are still useful has been outlined by Andy in [6]. Regular
> >> anon-inode based file descriptors are used that stash a reference to
> >> struct pid in file->private_data and drop that reference on close.
> >>
> >> With this translate_pid() has three closely related but still distinct
> >> functionalities. To clarify the semantics and to make it easier for
> >> userspace to use the syscall it has:
> >> - gained a command argument and three commands clearly reflecting the
> >>distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> >>PIDCMD_GET_PIDFD).
> >> - been renamed to pidctl()
> >
> > Having made these changes, you've built a general-purpose command
> > command multiplexer, not one operation that happens to be flexible.
> > The general-purpose command multiplexer is a common antipattern:
> > multiplexers make it hard to talk about different kernel-provided
> > operations using the common vocabulary we use to distinguish
> > kernel-related operations, the system call number. socketcall, for
> > example, turned out to be cumbersome for users like SELinux policy
> > writers. People had to do work work later to split socketcall into
> > fine-grained system calls. Please split the pidctl system call so that
> > the design is clean from the start and we avoid work later. System
> > calls are cheap.
> >
> > Also, I'm still confused about how metadata access is supposed to work
> > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > You snipped out a portion of a previous email in which I asked about
> > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > place, we have two different kinds of file descriptors for processes,
> > one derived from procfs and one that's independent. The former works
> > with openat(2). The latter does not. To be very specific; if I'm
> > writing a function that accepts a pidfd and I get a pidfd that comes
> > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > smaps or oom_score_adj or statm for the named process in a race-free
> > manner?
> >
>
> Task metadata could be exposed via "pages" identified by offset:
>
> struct pidfd_stats stats;
>
> pread(pidfd, , sizeof(stats), PIDFD_STATS_OFFSET);
>
> I'm not sure that we need yet another binary procfs.
> But it will be faster than current text-based for sure.

There are many options. I have some detailed thoughts on several of
them, but long design emails seem rather unwelcome. Right now, I'd
like to hear how Christian intends his patch set to address this use
case.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 9:56 AM David Howells  wrote:
> Daniel Colascione  wrote:
>
> > System calls are cheap.
>
> Only to a point.  x86_64 will have an issue when we hit syscall 512.  We're
> currently at 427.

IIRC, a while ago, someone proposed restarting system call numbering
above (again, IIRC) 1024 for both 32- and 64-bit varieties to
establish a clean slate and sidestep this problem.


Re: [PATCH 0/4] pid: add pidctl()

2019-03-25 Thread Daniel Colascione
On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner  wrote:
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall. Multiple people have asked when this patchset will be sent in
> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> Muthusamy from Oracle [3].
>
> The intention of the original translate_pid() syscall was twofold:
> 1. Provide translation of pids between pid namespaces
> 2. Provide implicit pid namespace introspection
>
> Both functionalities are preserved. The latter task has been improved
> upon though. In the original version of the pachset passing pid as 1
> would allow to deterimine the relationship between the pid namespaces.
> This is inherhently racy. If pid 1 inside a pid namespace has died it
> would report false negatives. For example, if pid 1 inside of the target
> pid namespace already died, it would report that the target pid
> namespace cannot be reached from the source pid namespace because it
> couldn't find the pid inside of the target pid namespace and thus
> falsely report to the user that the two pid namespaces are not related.
> This problem is simple to avoid. In the new version we simply walk the
> list of ancestors and check whether the namespace are related to each
> other. By doing it this way we can reliably report what the relationship
> between two pid namespace file descriptors looks like.
>
> Additionally, this syscall has been extended to allow the retrieval of
> pidfds independent of procfs. These pidfds can e.g. be used with the new
> pidfd_send_signal() syscall we recently merged. The ability to retrieve
> pidfds independent of procfs had already been requested in the
> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> [5]. A use-case where a kernel is compiled without procfs but where
> pidfds are still useful has been outlined by Andy in [6]. Regular
> anon-inode based file descriptors are used that stash a reference to
> struct pid in file->private_data and drop that reference on close.
>
> With this translate_pid() has three closely related but still distinct
> functionalities. To clarify the semantics and to make it easier for
> userspace to use the syscall it has:
> - gained a command argument and three commands clearly reflecting the
>   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
>   PIDCMD_GET_PIDFD).
> - been renamed to pidctl()

Having made these changes, you've built a general-purpose command
command multiplexer, not one operation that happens to be flexible.
The general-purpose command multiplexer is a common antipattern:
multiplexers make it hard to talk about different kernel-provided
operations using the common vocabulary we use to distinguish
kernel-related operations, the system call number. socketcall, for
example, turned out to be cumbersome for users like SELinux policy
writers. People had to do work work later to split socketcall into
fine-grained system calls. Please split the pidctl system call so that
the design is clean from the start and we avoid work later. System
calls are cheap.

Also, I'm still confused about how metadata access is supposed to work
for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
You snipped out a portion of a previous email in which I asked about
your thoughts on this question. With the PIDCMD_GET_PIDFD command in
place, we have two different kinds of file descriptors for processes,
one derived from procfs and one that's independent. The former works
with openat(2). The latter does not. To be very specific; if I'm
writing a function that accepts a pidfd and I get a pidfd that comes
from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
smaps or oom_score_adj or statm for the named process in a race-free
manner?


  1   2   3   4   >