Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Rik van Riel
On Fri, 2017-08-18 at 11:15 -0700, Andrew Morton wrote:
> On Fri, 18 Aug 2017 12:28:29 -0400 Rik van Riel 
> wrote:
> 
> > On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> > > On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel 
> > > wrote:
> > > 
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > > > vm_area_struct *vma,
> > > > > > __  }
> > > > > > __  new_flags &= ~VM_DONTCOPY;
> > > > > > __  break;
> > > > > > +   case MADV_WIPEONFORK:
> > > > > > +   /* MADV_WIPEONFORK is only supported on
> > > > > > anonymous
> > > > > > memory. */
> > > > > > +   if (vma->vm_file || vma->vm_flags &
> > > > > > VM_SHARED)
> > > > > > {
> > > > > > +   error = -EINVAL;
> > > > > > +   goto out;
> > > > > > +   }
> > > > > > +   new_flags |= VM_WIPEONFORK;
> > > > > > +   break;
> > > > > > +   case MADV_KEEPONFORK:
> > > > > > +   new_flags &= ~VM_WIPEONFORK;
> > > > > > +   break;
> > > > > > __  case MADV_DONTDUMP:
> > > > > > __  new_flags |= VM_DONTDUMP;
> > > > > > __  break;
> > > > > 
> > > > > It seems odd to permit MADV_KEEPONFORK against other-than-
> > > > > anon
> > > > > vmas?
> > > > 
> > > > Given that the only way to set VM_WIPEONFORK is through
> > > > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > > > other-than-anon vma would be equivalent to a noop.
> > > > 
> > > > If new_flags == vma->vm_flags, madvise_behavior() will
> > > > immediately exit.
> > > 
> > > Yes, but calling MADV_WIPEONFORK against an other-than-anon vma
> > > is
> > > presumably a userspace bug.A bug which will probably result
> > > in
> > > userspace having WIPEONFORK memory which it didn't want.The
> > > kernel
> > > can trivially tell userspace that it has this bug so why not do
> > > so?
> > 
> > Uh, what?
> > 
> 
> Braino.  I meant MADV_KEEPONFORK.  Calling MADV_KEEPONFORK against an
> other-than-anon vma is a presumptive userspace bug and the kernel
> should report that.

All MADV_KEEPONFORK does is clear the flag set by
MADV_WIPEONFORK. Since there is no way to set the
WIPEONFORK flag on other-than-anon VMAs, that means
MADV_KEEPONFORK is always a noop for those VMAs.

You remind me that I should send in a man page
addition, though, when this code gets sent to
Linus.


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Rik van Riel
On Fri, 2017-08-18 at 11:15 -0700, Andrew Morton wrote:
> On Fri, 18 Aug 2017 12:28:29 -0400 Rik van Riel 
> wrote:
> 
> > On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> > > On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel 
> > > wrote:
> > > 
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > > > vm_area_struct *vma,
> > > > > > __  }
> > > > > > __  new_flags &= ~VM_DONTCOPY;
> > > > > > __  break;
> > > > > > +   case MADV_WIPEONFORK:
> > > > > > +   /* MADV_WIPEONFORK is only supported on
> > > > > > anonymous
> > > > > > memory. */
> > > > > > +   if (vma->vm_file || vma->vm_flags &
> > > > > > VM_SHARED)
> > > > > > {
> > > > > > +   error = -EINVAL;
> > > > > > +   goto out;
> > > > > > +   }
> > > > > > +   new_flags |= VM_WIPEONFORK;
> > > > > > +   break;
> > > > > > +   case MADV_KEEPONFORK:
> > > > > > +   new_flags &= ~VM_WIPEONFORK;
> > > > > > +   break;
> > > > > > __  case MADV_DONTDUMP:
> > > > > > __  new_flags |= VM_DONTDUMP;
> > > > > > __  break;
> > > > > 
> > > > > It seems odd to permit MADV_KEEPONFORK against other-than-
> > > > > anon
> > > > > vmas?
> > > > 
> > > > Given that the only way to set VM_WIPEONFORK is through
> > > > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > > > other-than-anon vma would be equivalent to a noop.
> > > > 
> > > > If new_flags == vma->vm_flags, madvise_behavior() will
> > > > immediately exit.
> > > 
> > > Yes, but calling MADV_WIPEONFORK against an other-than-anon vma
> > > is
> > > presumably a userspace bug.A bug which will probably result
> > > in
> > > userspace having WIPEONFORK memory which it didn't want.The
> > > kernel
> > > can trivially tell userspace that it has this bug so why not do
> > > so?
> > 
> > Uh, what?
> > 
> 
> Braino.  I meant MADV_KEEPONFORK.  Calling MADV_KEEPONFORK against an
> other-than-anon vma is a presumptive userspace bug and the kernel
> should report that.

All MADV_KEEPONFORK does is clear the flag set by
MADV_WIPEONFORK. Since there is no way to set the
WIPEONFORK flag on other-than-anon VMAs, that means
MADV_KEEPONFORK is always a noop for those VMAs.

You remind me that I should send in a man page
addition, though, when this code gets sent to
Linus.


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Andrew Morton
On Fri, 18 Aug 2017 12:28:29 -0400 Rik van Riel  wrote:

> On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> > On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel 
> > wrote:
> > 
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > > vm_area_struct *vma,
> > > > > __}
> > > > > __new_flags &= ~VM_DONTCOPY;
> > > > > __break;
> > > > > + case MADV_WIPEONFORK:
> > > > > + /* MADV_WIPEONFORK is only supported on
> > > > > anonymous
> > > > > memory. */
> > > > > + if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > > > {
> > > > > + error = -EINVAL;
> > > > > + goto out;
> > > > > + }
> > > > > + new_flags |= VM_WIPEONFORK;
> > > > > + break;
> > > > > + case MADV_KEEPONFORK:
> > > > > + new_flags &= ~VM_WIPEONFORK;
> > > > > + break;
> > > > > __case MADV_DONTDUMP:
> > > > > __new_flags |= VM_DONTDUMP;
> > > > > __break;
> > > > 
> > > > It seems odd to permit MADV_KEEPONFORK against other-than-anon
> > > > vmas?
> > > 
> > > Given that the only way to set VM_WIPEONFORK is through
> > > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > > other-than-anon vma would be equivalent to a noop.
> > > 
> > > If new_flags == vma->vm_flags, madvise_behavior() will
> > > immediately exit.
> > 
> > Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
> > presumably a userspace bug.A bug which will probably result in
> > userspace having WIPEONFORK memory which it didn't want.The kernel
> > can trivially tell userspace that it has this bug so why not do so?
> 
> Uh, what?
> 

Braino.  I meant MADV_KEEPONFORK.  Calling MADV_KEEPONFORK against an
other-than-anon vma is a presumptive userspace bug and the kernel
should report that.


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Andrew Morton
On Fri, 18 Aug 2017 12:28:29 -0400 Rik van Riel  wrote:

> On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> > On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel 
> > wrote:
> > 
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > > vm_area_struct *vma,
> > > > > __}
> > > > > __new_flags &= ~VM_DONTCOPY;
> > > > > __break;
> > > > > + case MADV_WIPEONFORK:
> > > > > + /* MADV_WIPEONFORK is only supported on
> > > > > anonymous
> > > > > memory. */
> > > > > + if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > > > {
> > > > > + error = -EINVAL;
> > > > > + goto out;
> > > > > + }
> > > > > + new_flags |= VM_WIPEONFORK;
> > > > > + break;
> > > > > + case MADV_KEEPONFORK:
> > > > > + new_flags &= ~VM_WIPEONFORK;
> > > > > + break;
> > > > > __case MADV_DONTDUMP:
> > > > > __new_flags |= VM_DONTDUMP;
> > > > > __break;
> > > > 
> > > > It seems odd to permit MADV_KEEPONFORK against other-than-anon
> > > > vmas?
> > > 
> > > Given that the only way to set VM_WIPEONFORK is through
> > > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > > other-than-anon vma would be equivalent to a noop.
> > > 
> > > If new_flags == vma->vm_flags, madvise_behavior() will
> > > immediately exit.
> > 
> > Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
> > presumably a userspace bug.A bug which will probably result in
> > userspace having WIPEONFORK memory which it didn't want.The kernel
> > can trivially tell userspace that it has this bug so why not do so?
> 
> Uh, what?
> 

Braino.  I meant MADV_KEEPONFORK.  Calling MADV_KEEPONFORK against an
other-than-anon vma is a presumptive userspace bug and the kernel
should report that.


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Mike Kravetz
On 08/11/2017 02:28 PM, r...@redhat.com wrote:
> From: Rik van Riel 
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> MADV_WIPEONFORK only works on private, anonymous VMAs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.
> 
> This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
> 
> https://man.openbsd.org/minherit.2
> 
> Reported-by: Florian Weimer 
> Reported-by: Colm MacCártaigh 
> Signed-off-by: Rik van Riel 

My primary concern with the first suggested patch was trying to define
semantics if MADV_WIPEONFORK was applied to a shared or file backed
mapping.  This is no longer allowed.

Reviewed-by: Mike Kravetz 

> ---
>  arch/alpha/include/uapi/asm/mman.h |  3 +++
>  arch/mips/include/uapi/asm/mman.h  |  3 +++
>  arch/parisc/include/uapi/asm/mman.h|  3 +++
>  arch/xtensa/include/uapi/asm/mman.h|  3 +++
>  fs/proc/task_mmu.c |  1 +
>  include/linux/mm.h |  2 +-
>  include/trace/events/mmflags.h |  8 +---
>  include/uapi/asm-generic/mman-common.h |  3 +++
>  kernel/fork.c  | 10 --
>  mm/madvise.c   | 13 +
>  10 files changed, 39 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h 
> b/arch/alpha/include/uapi/asm/mman.h
> index 02760f6e6ca4..2a708a792882 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -64,6 +64,9 @@
>  overrides the coredump filter bits */
>  #define MADV_DODUMP  17  /* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 18   /* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 19   /* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE 0
>  
> diff --git a/arch/mips/include/uapi/asm/mman.h 
> b/arch/mips/include/uapi/asm/mman.h
> index 655e2fb5395b..d59c57d60d7d 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -91,6 +91,9 @@
>  overrides the coredump filter bits */
>  #define MADV_DODUMP  17  /* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 18   /* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 19   /* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE 0
>  
> diff --git a/arch/parisc/include/uapi/asm/mman.h 
> b/arch/parisc/include/uapi/asm/mman.h
> index 5979745815a5..e205e0179642 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -60,6 +60,9 @@
>  overrides the coredump filter bits */
>  #define MADV_DODUMP  70  /* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 71   /* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 72   /* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE 0
>  #define MAP_VARIABLE 0
> diff --git a/arch/xtensa/include/uapi/asm/mman.h 
> b/arch/xtensa/include/uapi/asm/mman.h
> index 

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Mike Kravetz
On 08/11/2017 02:28 PM, r...@redhat.com wrote:
> From: Rik van Riel 
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> MADV_WIPEONFORK only works on private, anonymous VMAs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.
> 
> This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
> 
> https://man.openbsd.org/minherit.2
> 
> Reported-by: Florian Weimer 
> Reported-by: Colm MacCártaigh 
> Signed-off-by: Rik van Riel 

My primary concern with the first suggested patch was trying to define
semantics if MADV_WIPEONFORK was applied to a shared or file backed
mapping.  This is no longer allowed.

Reviewed-by: Mike Kravetz 

> ---
>  arch/alpha/include/uapi/asm/mman.h |  3 +++
>  arch/mips/include/uapi/asm/mman.h  |  3 +++
>  arch/parisc/include/uapi/asm/mman.h|  3 +++
>  arch/xtensa/include/uapi/asm/mman.h|  3 +++
>  fs/proc/task_mmu.c |  1 +
>  include/linux/mm.h |  2 +-
>  include/trace/events/mmflags.h |  8 +---
>  include/uapi/asm-generic/mman-common.h |  3 +++
>  kernel/fork.c  | 10 --
>  mm/madvise.c   | 13 +
>  10 files changed, 39 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h 
> b/arch/alpha/include/uapi/asm/mman.h
> index 02760f6e6ca4..2a708a792882 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -64,6 +64,9 @@
>  overrides the coredump filter bits */
>  #define MADV_DODUMP  17  /* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 18   /* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 19   /* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE 0
>  
> diff --git a/arch/mips/include/uapi/asm/mman.h 
> b/arch/mips/include/uapi/asm/mman.h
> index 655e2fb5395b..d59c57d60d7d 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -91,6 +91,9 @@
>  overrides the coredump filter bits */
>  #define MADV_DODUMP  17  /* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 18   /* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 19   /* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE 0
>  
> diff --git a/arch/parisc/include/uapi/asm/mman.h 
> b/arch/parisc/include/uapi/asm/mman.h
> index 5979745815a5..e205e0179642 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -60,6 +60,9 @@
>  overrides the coredump filter bits */
>  #define MADV_DODUMP  70  /* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 71   /* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 72   /* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE 0
>  #define MAP_VARIABLE 0
> diff --git a/arch/xtensa/include/uapi/asm/mman.h 
> b/arch/xtensa/include/uapi/asm/mman.h
> index 24365b30aae9..ed23e0a1b30d 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ 

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Rik van Riel
On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel 
> wrote:
> 
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > vm_area_struct *vma,
> > > > __  }
> > > > __  new_flags &= ~VM_DONTCOPY;
> > > > __  break;
> > > > +   case MADV_WIPEONFORK:
> > > > +   /* MADV_WIPEONFORK is only supported on
> > > > anonymous
> > > > memory. */
> > > > +   if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > > {
> > > > +   error = -EINVAL;
> > > > +   goto out;
> > > > +   }
> > > > +   new_flags |= VM_WIPEONFORK;
> > > > +   break;
> > > > +   case MADV_KEEPONFORK:
> > > > +   new_flags &= ~VM_WIPEONFORK;
> > > > +   break;
> > > > __  case MADV_DONTDUMP:
> > > > __  new_flags |= VM_DONTDUMP;
> > > > __  break;
> > > 
> > > It seems odd to permit MADV_KEEPONFORK against other-than-anon
> > > vmas?
> > 
> > Given that the only way to set VM_WIPEONFORK is through
> > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > other-than-anon vma would be equivalent to a noop.
> > 
> > If new_flags == vma->vm_flags, madvise_behavior() will
> > immediately exit.
> 
> Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
> presumably a userspace bug.  A bug which will probably result in
> userspace having WIPEONFORK memory which it didn't want.  The kernel
> can trivially tell userspace that it has this bug so why not do so?

Uh, what?

Calling MADV_WIPEONFORK on an other-than-anon vma results
in NOT getting VM_WIPEONFORK semantics on that VMA.

The code you are commenting on is the bit that CLEARS
the VM_WIPEONFORK code, not the bit where it gets set.


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-18 Thread Rik van Riel
On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel 
> wrote:
> 
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > vm_area_struct *vma,
> > > > __  }
> > > > __  new_flags &= ~VM_DONTCOPY;
> > > > __  break;
> > > > +   case MADV_WIPEONFORK:
> > > > +   /* MADV_WIPEONFORK is only supported on
> > > > anonymous
> > > > memory. */
> > > > +   if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > > {
> > > > +   error = -EINVAL;
> > > > +   goto out;
> > > > +   }
> > > > +   new_flags |= VM_WIPEONFORK;
> > > > +   break;
> > > > +   case MADV_KEEPONFORK:
> > > > +   new_flags &= ~VM_WIPEONFORK;
> > > > +   break;
> > > > __  case MADV_DONTDUMP:
> > > > __  new_flags |= VM_DONTDUMP;
> > > > __  break;
> > > 
> > > It seems odd to permit MADV_KEEPONFORK against other-than-anon
> > > vmas?
> > 
> > Given that the only way to set VM_WIPEONFORK is through
> > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > other-than-anon vma would be equivalent to a noop.
> > 
> > If new_flags == vma->vm_flags, madvise_behavior() will
> > immediately exit.
> 
> Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
> presumably a userspace bug.  A bug which will probably result in
> userspace having WIPEONFORK memory which it didn't want.  The kernel
> can trivially tell userspace that it has this bug so why not do so?

Uh, what?

Calling MADV_WIPEONFORK on an other-than-anon vma results
in NOT getting VM_WIPEONFORK semantics on that VMA.

The code you are commenting on is the bit that CLEARS
the VM_WIPEONFORK code, not the bit where it gets set.


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-17 Thread Andrew Morton
On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel  wrote:

> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > vm_area_struct *vma,
> > > __}
> > > __new_flags &= ~VM_DONTCOPY;
> > > __break;
> > > + case MADV_WIPEONFORK:
> > > + /* MADV_WIPEONFORK is only supported on anonymous
> > > memory. */
> > > + if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > > + error = -EINVAL;
> > > + goto out;
> > > + }
> > > + new_flags |= VM_WIPEONFORK;
> > > + break;
> > > + case MADV_KEEPONFORK:
> > > + new_flags &= ~VM_WIPEONFORK;
> > > + break;
> > > __case MADV_DONTDUMP:
> > > __new_flags |= VM_DONTDUMP;
> > > __break;
> > 
> > It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?
> 
> Given that the only way to set VM_WIPEONFORK is through
> MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> other-than-anon vma would be equivalent to a noop.
> 
> If new_flags == vma->vm_flags, madvise_behavior() will
> immediately exit.

Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
presumably a userspace bug.  A bug which will probably result in
userspace having WIPEONFORK memory which it didn't want.  The kernel
can trivially tell userspace that it has this bug so why not do so?




Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-17 Thread Andrew Morton
On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel  wrote:

> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > vm_area_struct *vma,
> > > __}
> > > __new_flags &= ~VM_DONTCOPY;
> > > __break;
> > > + case MADV_WIPEONFORK:
> > > + /* MADV_WIPEONFORK is only supported on anonymous
> > > memory. */
> > > + if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > > + error = -EINVAL;
> > > + goto out;
> > > + }
> > > + new_flags |= VM_WIPEONFORK;
> > > + break;
> > > + case MADV_KEEPONFORK:
> > > + new_flags &= ~VM_WIPEONFORK;
> > > + break;
> > > __case MADV_DONTDUMP:
> > > __new_flags |= VM_DONTDUMP;
> > > __break;
> > 
> > It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?
> 
> Given that the only way to set VM_WIPEONFORK is through
> MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> other-than-anon vma would be equivalent to a noop.
> 
> If new_flags == vma->vm_flags, madvise_behavior() will
> immediately exit.

Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
presumably a userspace bug.  A bug which will probably result in
userspace having WIPEONFORK memory which it didn't want.  The kernel
can trivially tell userspace that it has this bug so why not do so?




Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-15 Thread Rik van Riel
On Tue, 2017-08-15 at 15:51 -0700, Andrew Morton wrote:
> On Fri, 11 Aug 2017 17:28:29 -0400 r...@redhat.com wrote:
> 
> > A further complication is the proliferation of clone flags,
> > programs bypassing glibc's functions to call clone directly,
> > and programs calling unshare, causing the glibc pthread_atfork
> > hook to not get called.
> > 
> > It would be better to have the kernel take care of this
> > automatically.
> 
> I'll add "The patch also adds MADV_KEEPONFORK, to undo the effects of
> a
> prior MADV_WIPEONFORK." here.
> 
> I guess it isn't worth mentioning that these things can cause VMA
> merges and splits. 

That's the same as every other Linux specific madvise operation.

> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > vm_area_struct *vma,
> >     }
> >     new_flags &= ~VM_DONTCOPY;
> >     break;
> > +   case MADV_WIPEONFORK:
> > +   /* MADV_WIPEONFORK is only supported on anonymous
> > memory. */
> > +   if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > +   error = -EINVAL;
> > +   goto out;
> > +   }
> > +   new_flags |= VM_WIPEONFORK;
> > +   break;
> > +   case MADV_KEEPONFORK:
> > +   new_flags &= ~VM_WIPEONFORK;
> > +   break;
> >     case MADV_DONTDUMP:
> >     new_flags |= VM_DONTDUMP;
> >     break;
> 
> It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?

Given that the only way to set VM_WIPEONFORK is through
MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
other-than-anon vma would be equivalent to a noop.

If new_flags == vma->vm_flags, madvise_behavior() will
immediately exit.



Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-15 Thread Rik van Riel
On Tue, 2017-08-15 at 15:51 -0700, Andrew Morton wrote:
> On Fri, 11 Aug 2017 17:28:29 -0400 r...@redhat.com wrote:
> 
> > A further complication is the proliferation of clone flags,
> > programs bypassing glibc's functions to call clone directly,
> > and programs calling unshare, causing the glibc pthread_atfork
> > hook to not get called.
> > 
> > It would be better to have the kernel take care of this
> > automatically.
> 
> I'll add "The patch also adds MADV_KEEPONFORK, to undo the effects of
> a
> prior MADV_WIPEONFORK." here.
> 
> I guess it isn't worth mentioning that these things can cause VMA
> merges and splits. 

That's the same as every other Linux specific madvise operation.

> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > vm_area_struct *vma,
> >     }
> >     new_flags &= ~VM_DONTCOPY;
> >     break;
> > +   case MADV_WIPEONFORK:
> > +   /* MADV_WIPEONFORK is only supported on anonymous
> > memory. */
> > +   if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > +   error = -EINVAL;
> > +   goto out;
> > +   }
> > +   new_flags |= VM_WIPEONFORK;
> > +   break;
> > +   case MADV_KEEPONFORK:
> > +   new_flags &= ~VM_WIPEONFORK;
> > +   break;
> >     case MADV_DONTDUMP:
> >     new_flags |= VM_DONTDUMP;
> >     break;
> 
> It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?

Given that the only way to set VM_WIPEONFORK is through
MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
other-than-anon vma would be equivalent to a noop.

If new_flags == vma->vm_flags, madvise_behavior() will
immediately exit.



Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-15 Thread Andrew Morton
On Fri, 11 Aug 2017 17:28:29 -0400 r...@redhat.com wrote:

> From: Rik van Riel 
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> MADV_WIPEONFORK only works on private, anonymous VMAs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.

I'll add "The patch also adds MADV_KEEPONFORK, to undo the effects of a
prior MADV_WIPEONFORK." here.

I guess it isn't worth mentioning that these things can cause VMA
merges and splits. 

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -80,6 +80,17 @@ static long madvise_behavior(struct vm_area_struct *vma,
>   }
>   new_flags &= ~VM_DONTCOPY;
>   break;
> + case MADV_WIPEONFORK:
> + /* MADV_WIPEONFORK is only supported on anonymous memory. */
> + if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> + error = -EINVAL;
> + goto out;
> + }
> + new_flags |= VM_WIPEONFORK;
> + break;
> + case MADV_KEEPONFORK:
> + new_flags &= ~VM_WIPEONFORK;
> + break;
>   case MADV_DONTDUMP:
>   new_flags |= VM_DONTDUMP;
>   break;

It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-15 Thread Andrew Morton
On Fri, 11 Aug 2017 17:28:29 -0400 r...@redhat.com wrote:

> From: Rik van Riel 
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> MADV_WIPEONFORK only works on private, anonymous VMAs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.

I'll add "The patch also adds MADV_KEEPONFORK, to undo the effects of a
prior MADV_WIPEONFORK." here.

I guess it isn't worth mentioning that these things can cause VMA
merges and splits. 

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -80,6 +80,17 @@ static long madvise_behavior(struct vm_area_struct *vma,
>   }
>   new_flags &= ~VM_DONTCOPY;
>   break;
> + case MADV_WIPEONFORK:
> + /* MADV_WIPEONFORK is only supported on anonymous memory. */
> + if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> + error = -EINVAL;
> + goto out;
> + }
> + new_flags |= VM_WIPEONFORK;
> + break;
> + case MADV_KEEPONFORK:
> + new_flags &= ~VM_WIPEONFORK;
> + break;
>   case MADV_DONTDUMP:
>   new_flags |= VM_DONTDUMP;
>   break;

It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-14 Thread kbuild test robot
Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/riel-redhat-com/mm-fork-security-introduce-MADV_WIPEONFORK/20170806-035914
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   In file included from mm/debug.c:12:0:
>> include/trace/events/mmflags.h:131:31: error: 'VM_ARCH_2' undeclared here 
>> (not in a function)
#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
  ^
   include/trace/events/mmflags.h:165:2: note: in expansion of macro 
'__VM_ARCH_SPECIFIC_2'
 __VM_ARCH_SPECIFIC_2,  \
 ^
   mm/debug.c:39:2: note: in expansion of macro '__def_vmaflag_names'
 __def_vmaflag_names,
 ^

vim +/VM_ARCH_2 +131 include/trace/events/mmflags.h

bcf669179 Kirill A. Shutemov 2016-03-17  127  
bcf669179 Kirill A. Shutemov 2016-03-17  128  #if defined(CONFIG_X86)
bcf669179 Kirill A. Shutemov 2016-03-17  129  #define __VM_ARCH_SPECIFIC_2 
{VM_MPX, "mpx"   }
bcf669179 Kirill A. Shutemov 2016-03-17  130  #else
bcf669179 Kirill A. Shutemov 2016-03-17 @131  #define __VM_ARCH_SPECIFIC_2 
{VM_ARCH_2,  "arch_2"}
420adbe9f Vlastimil Babka2016-03-15  132  #endif
420adbe9f Vlastimil Babka2016-03-15  133  

:: The code at line 131 was first introduced by commit
:: bcf6691797f425b301f629bb783b7ff2d0bcfa5a mm, tracing: refresh 
__def_vmaflag_names

:: TO: Kirill A. Shutemov 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-14 Thread kbuild test robot
Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/riel-redhat-com/mm-fork-security-introduce-MADV_WIPEONFORK/20170806-035914
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   In file included from mm/debug.c:12:0:
>> include/trace/events/mmflags.h:131:31: error: 'VM_ARCH_2' undeclared here 
>> (not in a function)
#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
  ^
   include/trace/events/mmflags.h:165:2: note: in expansion of macro 
'__VM_ARCH_SPECIFIC_2'
 __VM_ARCH_SPECIFIC_2,  \
 ^
   mm/debug.c:39:2: note: in expansion of macro '__def_vmaflag_names'
 __def_vmaflag_names,
 ^

vim +/VM_ARCH_2 +131 include/trace/events/mmflags.h

bcf669179 Kirill A. Shutemov 2016-03-17  127  
bcf669179 Kirill A. Shutemov 2016-03-17  128  #if defined(CONFIG_X86)
bcf669179 Kirill A. Shutemov 2016-03-17  129  #define __VM_ARCH_SPECIFIC_2 
{VM_MPX, "mpx"   }
bcf669179 Kirill A. Shutemov 2016-03-17  130  #else
bcf669179 Kirill A. Shutemov 2016-03-17 @131  #define __VM_ARCH_SPECIFIC_2 
{VM_ARCH_2,  "arch_2"}
420adbe9f Vlastimil Babka2016-03-15  132  #endif
420adbe9f Vlastimil Babka2016-03-15  133  

:: The code at line 131 was first introduced by commit
:: bcf6691797f425b301f629bb783b7ff2d0bcfa5a mm, tracing: refresh 
__def_vmaflag_names

:: TO: Kirill A. Shutemov 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Linus Torvalds
On Fri, Aug 11, 2017 at 1:27 PM, Rik van Riel  wrote:
>>
>> Yes, you don't do the page table copies. Fine. But you leave vma with
>> the the anon_vma pointer - doesn't that mean that it's still
>> connected
>> to the original anonvma chain, and we might end up swapping something
>> in?
>
> Swapping something in would require there to be a swap entry in
> the page table entries, which we are not copying, so this should
> not be a correctness issue.

Yeah, I thought the rmap code just used the offset from the start to
avoid even doing swap entries, but I guess we don't actually ever
populate the page tables without the swap entry being there.

> There is another test in copy_page_range already which ends up
> skipping the page table copy when it should not be done.

Well, the VM_DONTCOPY test is in dup_mmap(), and I think I'd rather
match up the VM_WIPEONFORK logic with VM_DONTCOPY than with the
copy_page_range() tests.

Because I assume you are talking about the "if it's a shared mapping,
we don't need to copy the page tables and can just do it at page fault
time instead" part? That's a rather different thing, which isn't so
much about semantics, as about just a trade-off about when to touch
the page tables.

But yes, that one *might* make sense in dup_mmap() too. I just don't
think it's really analogous to the WIPEONFORK and DONTCOPY tests.

  Linus


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Linus Torvalds
On Fri, Aug 11, 2017 at 1:27 PM, Rik van Riel  wrote:
>>
>> Yes, you don't do the page table copies. Fine. But you leave vma with
>> the the anon_vma pointer - doesn't that mean that it's still
>> connected
>> to the original anonvma chain, and we might end up swapping something
>> in?
>
> Swapping something in would require there to be a swap entry in
> the page table entries, which we are not copying, so this should
> not be a correctness issue.

Yeah, I thought the rmap code just used the offset from the start to
avoid even doing swap entries, but I guess we don't actually ever
populate the page tables without the swap entry being there.

> There is another test in copy_page_range already which ends up
> skipping the page table copy when it should not be done.

Well, the VM_DONTCOPY test is in dup_mmap(), and I think I'd rather
match up the VM_WIPEONFORK logic with VM_DONTCOPY than with the
copy_page_range() tests.

Because I assume you are talking about the "if it's a shared mapping,
we don't need to copy the page tables and can just do it at page fault
time instead" part? That's a rather different thing, which isn't so
much about semantics, as about just a trade-off about when to touch
the page tables.

But yes, that one *might* make sense in dup_mmap() too. I just don't
think it's really analogous to the WIPEONFORK and DONTCOPY tests.

  Linus


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel
On Fri, 2017-08-11 at 12:42 -0700, Linus Torvalds wrote:
> On Fri, Aug 11, 2017 at 12:19 PM,   wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0e517be91a89..f9b0ad7feb57 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct
> > *dst_mm, struct mm_struct *src_mm,
> > !vma->anon_vma)
> > return 0;
> > 
> > +   /*
> > +* With VM_WIPEONFORK, the child inherits the VMA from the
> > +* parent, but not its contents.
> > +*
> > +* A child accessing VM_WIPEONFORK memory will see all
> > zeroes;
> > +* a child accessing VM_DONTCOPY memory receives a
> > segfault.
> > +*/
> > +   if (vma->vm_flags & VM_WIPEONFORK)
> > +   return 0;
> > +
> 
> Is this right?
> 
> Yes, you don't do the page table copies. Fine. But you leave vma with
> the the anon_vma pointer - doesn't that mean that it's still
> connected
> to the original anonvma chain, and we might end up swapping something
> in?

Swapping something in would require there to be a swap entry in
the page table entries, which we are not copying, so this should
not be a correctness issue.

> And even if that ends up not being an issue, I'd expect that you'd
> want to break the anon_vma chain just to not make it grow
> unnecessarily.

This is a good point. I can send a v4 that skips the anon_vma_fork()
call if VM_WIPEONFORK, and calls anon_vma_prepare(), instead.

> So my gut feel is that doing this in "copy_page_range()" is wrong,
> and
> the logic should be moved up to dup_mmap(), where we can also
> short-circuit the anon_vma chain entirely.
> 
> No?

There is another test in copy_page_range already which ends up
skipping the page table copy when it should not be done.

If you want, I can move that test into a should_copy_page_range()
function, and call that from dup_mmap(), skipping the call to
copy_page_range() if should_copy_page_range() returns false.

Having only one of the two sets of tests in dup_mmap(), and
the other in copy_page_range() seems wrong.

Just let me know what you prefer, and I'll put that in v4.

> The madvice() interface looks fine to me.

That was the main reason for adding you to the thread :)

kind regards,

Rik


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel
On Fri, 2017-08-11 at 12:42 -0700, Linus Torvalds wrote:
> On Fri, Aug 11, 2017 at 12:19 PM,   wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0e517be91a89..f9b0ad7feb57 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct
> > *dst_mm, struct mm_struct *src_mm,
> > !vma->anon_vma)
> > return 0;
> > 
> > +   /*
> > +* With VM_WIPEONFORK, the child inherits the VMA from the
> > +* parent, but not its contents.
> > +*
> > +* A child accessing VM_WIPEONFORK memory will see all
> > zeroes;
> > +* a child accessing VM_DONTCOPY memory receives a
> > segfault.
> > +*/
> > +   if (vma->vm_flags & VM_WIPEONFORK)
> > +   return 0;
> > +
> 
> Is this right?
> 
> Yes, you don't do the page table copies. Fine. But you leave vma with
> the the anon_vma pointer - doesn't that mean that it's still
> connected
> to the original anonvma chain, and we might end up swapping something
> in?

Swapping something in would require there to be a swap entry in
the page table entries, which we are not copying, so this should
not be a correctness issue.

> And even if that ends up not being an issue, I'd expect that you'd
> want to break the anon_vma chain just to not make it grow
> unnecessarily.

This is a good point. I can send a v4 that skips the anon_vma_fork()
call if VM_WIPEONFORK, and calls anon_vma_prepare(), instead.

> So my gut feel is that doing this in "copy_page_range()" is wrong,
> and
> the logic should be moved up to dup_mmap(), where we can also
> short-circuit the anon_vma chain entirely.
> 
> No?

There is another test in copy_page_range already which ends up
skipping the page table copy when it should not be done.

If you want, I can move that test into a should_copy_page_range()
function, and call that from dup_mmap(), skipping the call to
copy_page_range() if should_copy_page_range() returns false.

Having only one of the two sets of tests in dup_mmap(), and
the other in copy_page_range() seems wrong.

Just let me know what you prefer, and I'll put that in v4.

> The madvice() interface looks fine to me.

That was the main reason for adding you to the thread :)

kind regards,

Rik


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Linus Torvalds
On Fri, Aug 11, 2017 at 12:19 PM,   wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 0e517be91a89..f9b0ad7feb57 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct 
> mm_struct *src_mm,
> !vma->anon_vma)
> return 0;
>
> +   /*
> +* With VM_WIPEONFORK, the child inherits the VMA from the
> +* parent, but not its contents.
> +*
> +* A child accessing VM_WIPEONFORK memory will see all zeroes;
> +* a child accessing VM_DONTCOPY memory receives a segfault.
> +*/
> +   if (vma->vm_flags & VM_WIPEONFORK)
> +   return 0;
> +

Is this right?

Yes, you don't do the page table copies. Fine. But you leave vma with
the the anon_vma pointer - doesn't that mean that it's still connected
to the original anonvma chain, and we might end up swapping something
in?

And even if that ends up not being an issue, I'd expect that you'd
want to break the anon_vma chain just to not make it grow
unnecessarily.

So my gut feel is that doing this in "copy_page_range()" is wrong, and
the logic should be moved up to dup_mmap(), where we can also
short-circuit the anon_vma chain entirely.

No?

The madvice() interface looks fine to me.

  Linus


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Linus Torvalds
On Fri, Aug 11, 2017 at 12:19 PM,   wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 0e517be91a89..f9b0ad7feb57 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct 
> mm_struct *src_mm,
> !vma->anon_vma)
> return 0;
>
> +   /*
> +* With VM_WIPEONFORK, the child inherits the VMA from the
> +* parent, but not its contents.
> +*
> +* A child accessing VM_WIPEONFORK memory will see all zeroes;
> +* a child accessing VM_DONTCOPY memory receives a segfault.
> +*/
> +   if (vma->vm_flags & VM_WIPEONFORK)
> +   return 0;
> +

Is this right?

Yes, you don't do the page table copies. Fine. But you leave vma with
the the anon_vma pointer - doesn't that mean that it's still connected
to the original anonvma chain, and we might end up swapping something
in?

And even if that ends up not being an issue, I'd expect that you'd
want to break the anon_vma chain just to not make it grow
unnecessarily.

So my gut feel is that doing this in "copy_page_range()" is wrong, and
the logic should be moved up to dup_mmap(), where we can also
short-circuit the anon_vma chain entirely.

No?

The madvice() interface looks fine to me.

  Linus


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Mike Kravetz
On 08/11/2017 09:59 AM, Rik van Riel wrote:
> On Fri, 2017-08-11 at 09:36 -0700, Mike Kravetz wrote:
>> On 08/11/2017 08:23 AM, Rik van Riel wrote:
>>> On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
 On Sun 06-08-17 10:04:25, Rik van Riel wrote:
 [...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 17921b0390b4..db1fb2802ecc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -659,6 +659,13 @@ static __latent_entropy int
> dup_mmap(struct
> mm_struct *mm,
>   tmp->vm_flags &= ~(VM_LOCKED |
> VM_LOCKONFAULT);
>   tmp->vm_next = tmp->vm_prev = NULL;
>   file = tmp->vm_file;
> +
> + /* With VM_WIPEONFORK, the child gets an empty
> VMA. */
> + if (tmp->vm_flags & VM_WIPEONFORK) {
> + tmp->vm_file = file = NULL;
> + tmp->vm_ops = NULL;
> + }

 What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the
 around?
 At
 least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I
 miss
 where those flags are cleared?
>>>
>>> Huh, good spotting.  That makes me wonder why the test case that
>>> Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
>>> and returned zero-filled memory when read by the child process.
>>
>> Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma
>> on
>> your v2 patch.  Did not really want to start a discussion on the
>> implementation until the issue of exactly what VM_WIPEONFORK was
>> supposed
>> to do was settled.
> 
> It worked here, but now I don't understand why :)
> 
>>>
>>> OK, I'll do a minimal implementation for now, which will return
>>> -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
>>> and/or an mmapped file.
>>>
>>> It will work the way it is supposed to with anonymous MAP_PRIVATE
>>> memory, which is likely the only memory it will be used on, anyway.
>>>
>>
>> Seems reasonable.
>>
>> You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
>> VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.
> 
> In other words (flags & MAP_SHARED || vma->vm_file) would catch
> hugetlbfs, too?

Yes, that should catch hugetlbfs.

-- 
Mike Kravetz


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Mike Kravetz
On 08/11/2017 09:59 AM, Rik van Riel wrote:
> On Fri, 2017-08-11 at 09:36 -0700, Mike Kravetz wrote:
>> On 08/11/2017 08:23 AM, Rik van Riel wrote:
>>> On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
 On Sun 06-08-17 10:04:25, Rik van Riel wrote:
 [...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 17921b0390b4..db1fb2802ecc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -659,6 +659,13 @@ static __latent_entropy int
> dup_mmap(struct
> mm_struct *mm,
>   tmp->vm_flags &= ~(VM_LOCKED |
> VM_LOCKONFAULT);
>   tmp->vm_next = tmp->vm_prev = NULL;
>   file = tmp->vm_file;
> +
> + /* With VM_WIPEONFORK, the child gets an empty
> VMA. */
> + if (tmp->vm_flags & VM_WIPEONFORK) {
> + tmp->vm_file = file = NULL;
> + tmp->vm_ops = NULL;
> + }

 What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the
 around?
 At
 least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I
 miss
 where those flags are cleared?
>>>
>>> Huh, good spotting.  That makes me wonder why the test case that
>>> Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
>>> and returned zero-filled memory when read by the child process.
>>
>> Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma
>> on
>> your v2 patch.  Did not really want to start a discussion on the
>> implementation until the issue of exactly what VM_WIPEONFORK was
>> supposed
>> to do was settled.
> 
> It worked here, but now I don't understand why :)
> 
>>>
>>> OK, I'll do a minimal implementation for now, which will return
>>> -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
>>> and/or an mmapped file.
>>>
>>> It will work the way it is supposed to with anonymous MAP_PRIVATE
>>> memory, which is likely the only memory it will be used on, anyway.
>>>
>>
>> Seems reasonable.
>>
>> You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
>> VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.
> 
> In other words (flags & MAP_SHARED || vma->vm_file) would catch
> hugetlbfs, too?

Yes, that should catch hugetlbfs.

-- 
Mike Kravetz


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel
On Fri, 2017-08-11 at 09:36 -0700, Mike Kravetz wrote:
> On 08/11/2017 08:23 AM, Rik van Riel wrote:
> > On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> > > On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> > > [...]
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 17921b0390b4..db1fb2802ecc 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -659,6 +659,13 @@ static __latent_entropy int
> > > > dup_mmap(struct
> > > > mm_struct *mm,
> > > >     tmp->vm_flags &= ~(VM_LOCKED |
> > > > VM_LOCKONFAULT);
> > > >     tmp->vm_next = tmp->vm_prev = NULL;
> > > >     file = tmp->vm_file;
> > > > +
> > > > +   /* With VM_WIPEONFORK, the child gets an empty
> > > > VMA. */
> > > > +   if (tmp->vm_flags & VM_WIPEONFORK) {
> > > > +   tmp->vm_file = file = NULL;
> > > > +   tmp->vm_ops = NULL;
> > > > +   }
> > > 
> > > What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the
> > > around?
> > > At
> > > least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I
> > > miss
> > > where those flags are cleared?
> > 
> > Huh, good spotting.  That makes me wonder why the test case that
> > Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
> > and returned zero-filled memory when read by the child process.
> 
> Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma
> on
> your v2 patch.  Did not really want to start a discussion on the
> implementation until the issue of exactly what VM_WIPEONFORK was
> supposed
> to do was settled.

It worked here, but now I don't understand why :)

> > 
> > OK, I'll do a minimal implementation for now, which will return
> > -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
> > and/or an mmapped file.
> > 
> > It will work the way it is supposed to with anonymous MAP_PRIVATE
> > memory, which is likely the only memory it will be used on, anyway.
> > 
> 
> Seems reasonable.
> 
> You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
> VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.

In other words (flags & MAP_SHARED || vma->vm_file) would catch
hugetlbfs, too?



Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel
On Fri, 2017-08-11 at 09:36 -0700, Mike Kravetz wrote:
> On 08/11/2017 08:23 AM, Rik van Riel wrote:
> > On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> > > On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> > > [...]
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 17921b0390b4..db1fb2802ecc 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -659,6 +659,13 @@ static __latent_entropy int
> > > > dup_mmap(struct
> > > > mm_struct *mm,
> > > >     tmp->vm_flags &= ~(VM_LOCKED |
> > > > VM_LOCKONFAULT);
> > > >     tmp->vm_next = tmp->vm_prev = NULL;
> > > >     file = tmp->vm_file;
> > > > +
> > > > +   /* With VM_WIPEONFORK, the child gets an empty
> > > > VMA. */
> > > > +   if (tmp->vm_flags & VM_WIPEONFORK) {
> > > > +   tmp->vm_file = file = NULL;
> > > > +   tmp->vm_ops = NULL;
> > > > +   }
> > > 
> > > What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the
> > > around?
> > > At
> > > least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I
> > > miss
> > > where those flags are cleared?
> > 
> > Huh, good spotting.  That makes me wonder why the test case that
> > Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
> > and returned zero-filled memory when read by the child process.
> 
> Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma
> on
> your v2 patch.  Did not really want to start a discussion on the
> implementation until the issue of exactly what VM_WIPEONFORK was
> supposed
> to do was settled.

It worked here, but now I don't understand why :)

> > 
> > OK, I'll do a minimal implementation for now, which will return
> > -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
> > and/or an mmapped file.
> > 
> > It will work the way it is supposed to with anonymous MAP_PRIVATE
> > memory, which is likely the only memory it will be used on, anyway.
> > 
> 
> Seems reasonable.
> 
> You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
> VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.

In other words (flags & MAP_SHARED || vma->vm_file) would catch
hugetlbfs, too?



Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Mike Kravetz
On 08/11/2017 08:23 AM, Rik van Riel wrote:
> On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
>> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
>> [...]
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 17921b0390b4..db1fb2802ecc 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct
>>> mm_struct *mm,
>>> tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>>> tmp->vm_next = tmp->vm_prev = NULL;
>>> file = tmp->vm_file;
>>> +
>>> +   /* With VM_WIPEONFORK, the child gets an empty
>>> VMA. */
>>> +   if (tmp->vm_flags & VM_WIPEONFORK) {
>>> +   tmp->vm_file = file = NULL;
>>> +   tmp->vm_ops = NULL;
>>> +   }
>>
>> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around?
>> At
>> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
>> where those flags are cleared?
> 
> Huh, good spotting.  That makes me wonder why the test case that
> Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
> and returned zero-filled memory when read by the child process.

Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma on
your v2 patch.  Did not really want to start a discussion on the
implementation until the issue of exactly what VM_WIPEONFORK was supposed
to do was settled.

> 
> OK, I'll do a minimal implementation for now, which will return
> -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
> and/or an mmapped file.
> 
> It will work the way it is supposed to with anonymous MAP_PRIVATE
> memory, which is likely the only memory it will be used on, anyway.
> 

Seems reasonable.

You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.

-- 
Mike Kravetz


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Mike Kravetz
On 08/11/2017 08:23 AM, Rik van Riel wrote:
> On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
>> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
>> [...]
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 17921b0390b4..db1fb2802ecc 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct
>>> mm_struct *mm,
>>> tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>>> tmp->vm_next = tmp->vm_prev = NULL;
>>> file = tmp->vm_file;
>>> +
>>> +   /* With VM_WIPEONFORK, the child gets an empty
>>> VMA. */
>>> +   if (tmp->vm_flags & VM_WIPEONFORK) {
>>> +   tmp->vm_file = file = NULL;
>>> +   tmp->vm_ops = NULL;
>>> +   }
>>
>> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around?
>> At
>> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
>> where those flags are cleared?
> 
> Huh, good spotting.  That makes me wonder why the test case that
> Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
> and returned zero-filled memory when read by the child process.

Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma on
your v2 patch.  Did not really want to start a discussion on the
implementation until the issue of exactly what VM_WIPEONFORK was supposed
to do was settled.

> 
> OK, I'll do a minimal implementation for now, which will return
> -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
> and/or an mmapped file.
> 
> It will work the way it is supposed to with anonymous MAP_PRIVATE
> memory, which is likely the only memory it will be used on, anyway.
> 

Seems reasonable.

You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.

-- 
Mike Kravetz


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel
On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> [...]
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 17921b0390b4..db1fb2802ecc 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct
> > mm_struct *mm,
> >     tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> >     tmp->vm_next = tmp->vm_prev = NULL;
> >     file = tmp->vm_file;
> > +
> > +   /* With VM_WIPEONFORK, the child gets an empty
> > VMA. */
> > +   if (tmp->vm_flags & VM_WIPEONFORK) {
> > +   tmp->vm_file = file = NULL;
> > +   tmp->vm_ops = NULL;
> > +   }
> 
> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around?
> At
> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
> where those flags are cleared?

Huh, good spotting.  That makes me wonder why the test case that
Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
and returned zero-filled memory when read by the child process.

OK, I'll do a minimal implementation for now, which will return
-EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
and/or an mmapped file.

It will work the way it is supposed to with anonymous MAP_PRIVATE
memory, which is likely the only memory it will be used on, anyway.


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-11 Thread Rik van Riel
On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> [...]
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 17921b0390b4..db1fb2802ecc 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct
> > mm_struct *mm,
> >     tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> >     tmp->vm_next = tmp->vm_prev = NULL;
> >     file = tmp->vm_file;
> > +
> > +   /* With VM_WIPEONFORK, the child gets an empty
> > VMA. */
> > +   if (tmp->vm_flags & VM_WIPEONFORK) {
> > +   tmp->vm_file = file = NULL;
> > +   tmp->vm_ops = NULL;
> > +   }
> 
> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around?
> At
> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
> where those flags are cleared?

Huh, good spotting.  That makes me wonder why the test case that
Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
and returned zero-filled memory when read by the child process.

OK, I'll do a minimal implementation for now, which will return
-EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
and/or an mmapped file.

It will work the way it is supposed to with anonymous MAP_PRIVATE
memory, which is likely the only memory it will be used on, anyway.


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-10 Thread Michal Hocko
On Sun 06-08-17 10:04:25, Rik van Riel wrote:
[...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 17921b0390b4..db1fb2802ecc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct mm_struct 
> *mm,
>   tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>   tmp->vm_next = tmp->vm_prev = NULL;
>   file = tmp->vm_file;
> +
> + /* With VM_WIPEONFORK, the child gets an empty VMA. */
> + if (tmp->vm_flags & VM_WIPEONFORK) {
> + tmp->vm_file = file = NULL;
> + tmp->vm_ops = NULL;
> + }

What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around? At
least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
where those flags are cleared?

> +
>   if (file) {
>   struct inode *inode = file_inode(file);
>   struct address_space *mapping = file->f_mapping;
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-10 Thread Michal Hocko
On Sun 06-08-17 10:04:25, Rik van Riel wrote:
[...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 17921b0390b4..db1fb2802ecc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct mm_struct 
> *mm,
>   tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>   tmp->vm_next = tmp->vm_prev = NULL;
>   file = tmp->vm_file;
> +
> + /* With VM_WIPEONFORK, the child gets an empty VMA. */
> + if (tmp->vm_flags & VM_WIPEONFORK) {
> + tmp->vm_file = file = NULL;
> + tmp->vm_ops = NULL;
> + }

What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around? At
least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
where those flags are cleared?

> +
>   if (file) {
>   struct inode *inode = file_inode(file);
>   struct address_space *mapping = file->f_mapping;
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-05 Thread kbuild test robot
Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc3 next-20170804]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/riel-redhat-com/x86-mpx-make-mpx-depend-on-x86-64-to-free-up-VMA-flag/20170806-011851
config: tile-allmodconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 4.6.2
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=tile 

All errors (new ones prefixed by >>):

>> mm/debug.c:39:2: error: 'VM_ARCH_2' undeclared here (not in a function)

vim +/VM_ARCH_2 +39 mm/debug.c

420adbe9f Vlastimil Babka 2016-03-15  37  
edf14cdbf Vlastimil Babka 2016-03-15  38  const struct trace_print_flags 
vmaflag_names[] = {
edf14cdbf Vlastimil Babka 2016-03-15 @39__def_vmaflag_names,
edf14cdbf Vlastimil Babka 2016-03-15  40{0, NULL}
82742a3a5 Sasha Levin 2014-10-09  41  };
82742a3a5 Sasha Levin 2014-10-09  42  

:: The code at line 39 was first introduced by commit
:: edf14cdbf9a0e5ab52698ca66d07a76ade0d5c46 mm, printk: introduce new 
format string for flags

:: TO: Vlastimil Babka 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-05 Thread kbuild test robot
Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc3 next-20170804]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/riel-redhat-com/x86-mpx-make-mpx-depend-on-x86-64-to-free-up-VMA-flag/20170806-011851
config: tile-allmodconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 4.6.2
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=tile 

All errors (new ones prefixed by >>):

>> mm/debug.c:39:2: error: 'VM_ARCH_2' undeclared here (not in a function)

vim +/VM_ARCH_2 +39 mm/debug.c

420adbe9f Vlastimil Babka 2016-03-15  37  
edf14cdbf Vlastimil Babka 2016-03-15  38  const struct trace_print_flags 
vmaflag_names[] = {
edf14cdbf Vlastimil Babka 2016-03-15 @39__def_vmaflag_names,
edf14cdbf Vlastimil Babka 2016-03-15  40{0, NULL}
82742a3a5 Sasha Levin 2014-10-09  41  };
82742a3a5 Sasha Levin 2014-10-09  42  

:: The code at line 39 was first introduced by commit
:: edf14cdbf9a0e5ab52698ca66d07a76ade0d5c46 mm, printk: introduce new 
format string for flags

:: TO: Vlastimil Babka 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-05 Thread kbuild test robot
Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc3 next-20170804]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/riel-redhat-com/x86-mpx-make-mpx-depend-on-x86-64-to-free-up-VMA-flag/20170806-011851
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All error/warnings (new ones prefixed by >>):

   In file included from mm/debug.c:12:0:
>> include/trace/events/mmflags.h:131:31: error: 'VM_ARCH_2' undeclared here 
>> (not in a function)
#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
  ^
>> include/trace/events/mmflags.h:165:2: note: in expansion of macro 
>> '__VM_ARCH_SPECIFIC_2'
 __VM_ARCH_SPECIFIC_2,  \
 ^
>> mm/debug.c:39:2: note: in expansion of macro '__def_vmaflag_names'
 __def_vmaflag_names,
 ^

vim +/VM_ARCH_2 +131 include/trace/events/mmflags.h

bcf669179 Kirill A. Shutemov 2016-03-17  127  
bcf669179 Kirill A. Shutemov 2016-03-17  128  #if defined(CONFIG_X86)
bcf669179 Kirill A. Shutemov 2016-03-17  129  #define __VM_ARCH_SPECIFIC_2 
{VM_MPX, "mpx"   }
bcf669179 Kirill A. Shutemov 2016-03-17  130  #else
bcf669179 Kirill A. Shutemov 2016-03-17 @131  #define __VM_ARCH_SPECIFIC_2 
{VM_ARCH_2,  "arch_2"}
420adbe9f Vlastimil Babka2016-03-15  132  #endif
420adbe9f Vlastimil Babka2016-03-15  133  
420adbe9f Vlastimil Babka2016-03-15  134  #ifdef CONFIG_MEM_SOFT_DIRTY
420adbe9f Vlastimil Babka2016-03-15  135  #define 
IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
420adbe9f Vlastimil Babka2016-03-15  136  #else
420adbe9f Vlastimil Babka2016-03-15  137  #define 
IF_HAVE_VM_SOFTDIRTY(flag,name)
420adbe9f Vlastimil Babka2016-03-15  138  #endif
420adbe9f Vlastimil Babka2016-03-15  139  
420adbe9f Vlastimil Babka2016-03-15  140  #define __def_vmaflag_names   
\
420adbe9f Vlastimil Babka2016-03-15  141{VM_READ,   
"read"  },  \
420adbe9f Vlastimil Babka2016-03-15  142{VM_WRITE,  
"write" },  \
420adbe9f Vlastimil Babka2016-03-15  143{VM_EXEC,   
"exec"  },  \
420adbe9f Vlastimil Babka2016-03-15  144{VM_SHARED, 
"shared"},  \
420adbe9f Vlastimil Babka2016-03-15  145{VM_MAYREAD,
"mayread"   },  \
420adbe9f Vlastimil Babka2016-03-15  146{VM_MAYWRITE,   
"maywrite"  },  \
420adbe9f Vlastimil Babka2016-03-15  147{VM_MAYEXEC,
"mayexec"   },  \
420adbe9f Vlastimil Babka2016-03-15  148{VM_MAYSHARE,   
"mayshare"  },  \
420adbe9f Vlastimil Babka2016-03-15  149{VM_GROWSDOWN,  
"growsdown" },  \
bcf669179 Kirill A. Shutemov 2016-03-17  150{VM_UFFD_MISSING,   
"uffd_missing"  },  \
420adbe9f Vlastimil Babka2016-03-15  151{VM_PFNMAP, 
"pfnmap"},  \
420adbe9f Vlastimil Babka2016-03-15  152{VM_DENYWRITE,  
"denywrite" },  \
bcf669179 Kirill A. Shutemov 2016-03-17  153{VM_UFFD_WP,
"uffd_wp"   },  \
420adbe9f Vlastimil Babka2016-03-15  154{VM_LOCKED, 
"locked"},  \
420adbe9f Vlastimil Babka2016-03-15  155{VM_IO, 
"io"},  \
420adbe9f Vlastimil Babka2016-03-15  156{VM_SEQ_READ,   
"seqread"   },  \
420adbe9f Vlastimil Babka2016-03-15  157{VM_RAND_READ,  
"randread"  },  \
420adbe9f Vlastimil Babka2016-03-15  158{VM_DONTCOPY,   
"dontcopy"  },  \
420adbe9f Vlastimil Babka2016-03-15  159{VM_DONTEXPAND, 
"dontexpand"},  \
bcf669179 Kirill A. Shutemov 2016-03-17  160{VM_LOCKONFAULT,
"lockonfault"   },  \
420adbe9f Vlastimil Babka2016-03-15  161{VM_ACCOUNT,
"account"   },  \
420adbe9f Vlastimil Babka2016-03-15  162{VM_NORESERVE,  
"noreserve" },  \
420adbe9f Vlastimil Babka2016-03-15  163{VM_HUGETLB,
"hugetlb"   },  \
bcf669179 Kirill A. Shutemov 2016-03-17  164__VM_ARCH_SPECIFIC_1

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-05 Thread kbuild test robot
Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc3 next-20170804]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/riel-redhat-com/x86-mpx-make-mpx-depend-on-x86-64-to-free-up-VMA-flag/20170806-011851
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All error/warnings (new ones prefixed by >>):

   In file included from mm/debug.c:12:0:
>> include/trace/events/mmflags.h:131:31: error: 'VM_ARCH_2' undeclared here 
>> (not in a function)
#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
  ^
>> include/trace/events/mmflags.h:165:2: note: in expansion of macro 
>> '__VM_ARCH_SPECIFIC_2'
 __VM_ARCH_SPECIFIC_2,  \
 ^
>> mm/debug.c:39:2: note: in expansion of macro '__def_vmaflag_names'
 __def_vmaflag_names,
 ^

vim +/VM_ARCH_2 +131 include/trace/events/mmflags.h

bcf669179 Kirill A. Shutemov 2016-03-17  127  
bcf669179 Kirill A. Shutemov 2016-03-17  128  #if defined(CONFIG_X86)
bcf669179 Kirill A. Shutemov 2016-03-17  129  #define __VM_ARCH_SPECIFIC_2 
{VM_MPX, "mpx"   }
bcf669179 Kirill A. Shutemov 2016-03-17  130  #else
bcf669179 Kirill A. Shutemov 2016-03-17 @131  #define __VM_ARCH_SPECIFIC_2 
{VM_ARCH_2,  "arch_2"}
420adbe9f Vlastimil Babka2016-03-15  132  #endif
420adbe9f Vlastimil Babka2016-03-15  133  
420adbe9f Vlastimil Babka2016-03-15  134  #ifdef CONFIG_MEM_SOFT_DIRTY
420adbe9f Vlastimil Babka2016-03-15  135  #define 
IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
420adbe9f Vlastimil Babka2016-03-15  136  #else
420adbe9f Vlastimil Babka2016-03-15  137  #define 
IF_HAVE_VM_SOFTDIRTY(flag,name)
420adbe9f Vlastimil Babka2016-03-15  138  #endif
420adbe9f Vlastimil Babka2016-03-15  139  
420adbe9f Vlastimil Babka2016-03-15  140  #define __def_vmaflag_names   
\
420adbe9f Vlastimil Babka2016-03-15  141{VM_READ,   
"read"  },  \
420adbe9f Vlastimil Babka2016-03-15  142{VM_WRITE,  
"write" },  \
420adbe9f Vlastimil Babka2016-03-15  143{VM_EXEC,   
"exec"  },  \
420adbe9f Vlastimil Babka2016-03-15  144{VM_SHARED, 
"shared"},  \
420adbe9f Vlastimil Babka2016-03-15  145{VM_MAYREAD,
"mayread"   },  \
420adbe9f Vlastimil Babka2016-03-15  146{VM_MAYWRITE,   
"maywrite"  },  \
420adbe9f Vlastimil Babka2016-03-15  147{VM_MAYEXEC,
"mayexec"   },  \
420adbe9f Vlastimil Babka2016-03-15  148{VM_MAYSHARE,   
"mayshare"  },  \
420adbe9f Vlastimil Babka2016-03-15  149{VM_GROWSDOWN,  
"growsdown" },  \
bcf669179 Kirill A. Shutemov 2016-03-17  150{VM_UFFD_MISSING,   
"uffd_missing"  },  \
420adbe9f Vlastimil Babka2016-03-15  151{VM_PFNMAP, 
"pfnmap"},  \
420adbe9f Vlastimil Babka2016-03-15  152{VM_DENYWRITE,  
"denywrite" },  \
bcf669179 Kirill A. Shutemov 2016-03-17  153{VM_UFFD_WP,
"uffd_wp"   },  \
420adbe9f Vlastimil Babka2016-03-15  154{VM_LOCKED, 
"locked"},  \
420adbe9f Vlastimil Babka2016-03-15  155{VM_IO, 
"io"},  \
420adbe9f Vlastimil Babka2016-03-15  156{VM_SEQ_READ,   
"seqread"   },  \
420adbe9f Vlastimil Babka2016-03-15  157{VM_RAND_READ,  
"randread"  },  \
420adbe9f Vlastimil Babka2016-03-15  158{VM_DONTCOPY,   
"dontcopy"  },  \
420adbe9f Vlastimil Babka2016-03-15  159{VM_DONTEXPAND, 
"dontexpand"},  \
bcf669179 Kirill A. Shutemov 2016-03-17  160{VM_LOCKONFAULT,
"lockonfault"   },  \
420adbe9f Vlastimil Babka2016-03-15  161{VM_ACCOUNT,
"account"   },  \
420adbe9f Vlastimil Babka2016-03-15  162{VM_NORESERVE,  
"noreserve" },  \
420adbe9f Vlastimil Babka2016-03-15  163{VM_HUGETLB,
"hugetlb"   },  \
bcf669179 Kirill A. Shutemov 2016-03-17  164__VM_ARCH_SPECIFIC_1

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-05 Thread Rik van Riel
On Fri, 2017-08-04 at 16:09 -0700, Mike Kravetz wrote:
> On 08/04/2017 12:07 PM, r...@redhat.com wrote:
> > From: Rik van Riel 
> > 
> > Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> > empty in the child process after fork. This differs from
> > MADV_DONTFORK
> > in one important way.
> > 
> > If a child process accesses memory that was MADV_WIPEONFORK, it
> > will get zeroes. The address ranges are still valid, they are just
> > empty.
> > 
> This didn't seem 'quite right' to me for shared mappings and/or file
> backed mappings.  I wasn't exactly sure what it 'should' do in such
> cases.  So, I tried it with a mapping created as follows:
> 
> addr = mmap(ADDR, page_size,
> PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS|MAP_SHARED, -1, 0);

Your test program is pretty much the same I used, except I
used MAP_PRIVATE instead of MAP_SHARED.

Let me see how the code paths differ for both cases...


> When setting MADV_WIPEONFORK on the vma/mapping, I got the following
> at task exit time:
> 
> [  694.558290] [ cut here ]
> [  694.558978] kernel BUG at mm/filemap.c:212!
> [  694.559476] invalid opcode:  [#1] SMP
> [  694.560023] Modules linked in: ip6t_REJECT nf_reject_ipv6
> ip6t_rpfilter xt_conntrack ebtable_broute bridge stp llc ebtable_nat
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
> ip6table_raw ip6table_mangle ip6table_security iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
> iptable_raw iptable_mangle 9p iptable_security ebtable_filter
> ebtables ip6table_filter ip6_tables snd_hda_codec_generic
> snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq ppdev
> snd_seq_device joydev crct10dif_pclmul crc32_pclmul crc32c_intel
> snd_pcm ghash_clmulni_intel 9pnet_virtio virtio_balloon snd_timer
> 9pnet parport_pc snd parport i2c_piix4 soundcore nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc virtio_net virtio_blk virtio_console
> 8139too qxl drm_kms_helper ttm drm serio_raw 8139cp
> [  694.571554]  mii virtio_pci ata_generic virtio_ring virtio
> pata_acpi
> [  694.572608] CPU: 3 PID: 1200 Comm: test_wipe2 Not tainted 4.13.0-
> rc3+ #8
> [  694.573778] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.1-1.fc24 04/01/2014
> [  694.574917] task: 880137178040 task.stack: c900019d4000
> [  694.575650] RIP: 0010:__delete_from_page_cache+0x344/0x410
> [  694.576409] RSP: 0018:c900019d7a88 EFLAGS: 00010082
> [  694.577238] RAX: 0021 RBX: ea00047d0e00 RCX:
> 0006
> [  694.578537] RDX:  RSI: 0096 RDI:
> 88023fd0db90
> [  694.579774] RBP: c900019d7ad8 R08: 000882b6 R09:
> 028a
> [  694.580754] R10: c900019d7da8 R11: 8211184d R12:
> ea00047d0e00
> [  694.582040] R13:  R14: 0202 R15:
> 8801384439e8
> [  694.583236] FS:  () GS:88023fd0()
> knlGS:
> [  694.584607] CS:  0010 DS:  ES:  CR0: 80050033
> [  694.585409] CR2: 7ff77a8da618 CR3: 01e09000 CR4:
> 001406e0
> [  694.586547] Call Trace:
> [  694.586996]  delete_from_page_cache+0x54/0x110
> [  694.587481]  truncate_inode_page+0xab/0x120
> [  694.588110]  shmem_undo_range+0x498/0xa50
> [  694.588813]  ? save_stack_trace+0x1b/0x20
> [  694.589529]  ? set_track+0x70/0x140
> [  694.590150]  ? init_object+0x69/0xa0
> [  694.590722]  ? __inode_wait_for_writeback+0x73/0xe0
> [  694.591525]  shmem_truncate_range+0x16/0x40
> [  694.592268]  shmem_evict_inode+0xb1/0x190
> [  694.592735]  evict+0xbb/0x1c0
> [  694.593147]  iput+0x1c0/0x210
> [  694.593497]  dentry_unlink_inode+0xb4/0x150
> [  694.593982]  __dentry_kill+0xc1/0x150
> [  694.594400]  dput+0x1c8/0x1e0
> [  694.594745]  __fput+0x172/0x1e0
> [  694.595103]  fput+0xe/0x10
> [  694.595463]  task_work_run+0x80/0xa0
> [  694.595886]  do_exit+0x2d6/0xb50
> [  694.596323]  ? __do_page_fault+0x288/0x4a0
> [  694.596818]  do_group_exit+0x47/0xb0
> [  694.597249]  SyS_exit_group+0x14/0x20
> [  694.597682]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [  694.598198] RIP: 0033:0x7ff77a5e78c8
> [  694.598612] RSP: 002b:7ffc5aece318 EFLAGS: 0246 ORIG_RAX:
> 00e7
> [  694.599804] RAX: ffda RBX:  RCX:
> 7ff77a5e78c8
> [  694.600609] RDX:  RSI: 003c RDI:
> 
> [  694.601424] RBP: 7ff77a8da618 R08: 00e7 R09:
> ff98
> [  694.602224] R10: 0003 R11: 0246 R12:
> 0001
> [  694.603151] R13: 7ff77a8dbc60 R14:  R15:
> 
> [  694.603984] Code: 60 f3 c5 81 e8 2e 7e 03 00 0f 0b 48 c7 c6 60 f3
> c5 81 4c 89 e7 e8 1d 7e 03 00 0f 0b 48 c7 c6 00 f4 c5 81 4c 89 e7 e8
> 0c 7e 03 00 <0f> 0b 48 c7 c6 38 f3 c5 81 4c 89 e7 e8 fb 7d 03 00 0f
> 0b 48 c7 
> [  694.606500] RIP: 

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-05 Thread Rik van Riel
On Fri, 2017-08-04 at 16:09 -0700, Mike Kravetz wrote:
> On 08/04/2017 12:07 PM, r...@redhat.com wrote:
> > From: Rik van Riel 
> > 
> > Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> > empty in the child process after fork. This differs from
> > MADV_DONTFORK
> > in one important way.
> > 
> > If a child process accesses memory that was MADV_WIPEONFORK, it
> > will get zeroes. The address ranges are still valid, they are just
> > empty.
> > 
> This didn't seem 'quite right' to me for shared mappings and/or file
> backed mappings.  I wasn't exactly sure what it 'should' do in such
> cases.  So, I tried it with a mapping created as follows:
> 
> addr = mmap(ADDR, page_size,
> PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS|MAP_SHARED, -1, 0);

Your test program is pretty much the same I used, except I
used MAP_PRIVATE instead of MAP_SHARED.

Let me see how the code paths differ for both cases...


> When setting MADV_WIPEONFORK on the vma/mapping, I got the following
> at task exit time:
> 
> [  694.558290] [ cut here ]
> [  694.558978] kernel BUG at mm/filemap.c:212!
> [  694.559476] invalid opcode:  [#1] SMP
> [  694.560023] Modules linked in: ip6t_REJECT nf_reject_ipv6
> ip6t_rpfilter xt_conntrack ebtable_broute bridge stp llc ebtable_nat
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
> ip6table_raw ip6table_mangle ip6table_security iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
> iptable_raw iptable_mangle 9p iptable_security ebtable_filter
> ebtables ip6table_filter ip6_tables snd_hda_codec_generic
> snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq ppdev
> snd_seq_device joydev crct10dif_pclmul crc32_pclmul crc32c_intel
> snd_pcm ghash_clmulni_intel 9pnet_virtio virtio_balloon snd_timer
> 9pnet parport_pc snd parport i2c_piix4 soundcore nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc virtio_net virtio_blk virtio_console
> 8139too qxl drm_kms_helper ttm drm serio_raw 8139cp
> [  694.571554]  mii virtio_pci ata_generic virtio_ring virtio
> pata_acpi
> [  694.572608] CPU: 3 PID: 1200 Comm: test_wipe2 Not tainted 4.13.0-
> rc3+ #8
> [  694.573778] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.1-1.fc24 04/01/2014
> [  694.574917] task: 880137178040 task.stack: c900019d4000
> [  694.575650] RIP: 0010:__delete_from_page_cache+0x344/0x410
> [  694.576409] RSP: 0018:c900019d7a88 EFLAGS: 00010082
> [  694.577238] RAX: 0021 RBX: ea00047d0e00 RCX:
> 0006
> [  694.578537] RDX:  RSI: 0096 RDI:
> 88023fd0db90
> [  694.579774] RBP: c900019d7ad8 R08: 000882b6 R09:
> 028a
> [  694.580754] R10: c900019d7da8 R11: 8211184d R12:
> ea00047d0e00
> [  694.582040] R13:  R14: 0202 R15:
> 8801384439e8
> [  694.583236] FS:  () GS:88023fd0()
> knlGS:
> [  694.584607] CS:  0010 DS:  ES:  CR0: 80050033
> [  694.585409] CR2: 7ff77a8da618 CR3: 01e09000 CR4:
> 001406e0
> [  694.586547] Call Trace:
> [  694.586996]  delete_from_page_cache+0x54/0x110
> [  694.587481]  truncate_inode_page+0xab/0x120
> [  694.588110]  shmem_undo_range+0x498/0xa50
> [  694.588813]  ? save_stack_trace+0x1b/0x20
> [  694.589529]  ? set_track+0x70/0x140
> [  694.590150]  ? init_object+0x69/0xa0
> [  694.590722]  ? __inode_wait_for_writeback+0x73/0xe0
> [  694.591525]  shmem_truncate_range+0x16/0x40
> [  694.592268]  shmem_evict_inode+0xb1/0x190
> [  694.592735]  evict+0xbb/0x1c0
> [  694.593147]  iput+0x1c0/0x210
> [  694.593497]  dentry_unlink_inode+0xb4/0x150
> [  694.593982]  __dentry_kill+0xc1/0x150
> [  694.594400]  dput+0x1c8/0x1e0
> [  694.594745]  __fput+0x172/0x1e0
> [  694.595103]  fput+0xe/0x10
> [  694.595463]  task_work_run+0x80/0xa0
> [  694.595886]  do_exit+0x2d6/0xb50
> [  694.596323]  ? __do_page_fault+0x288/0x4a0
> [  694.596818]  do_group_exit+0x47/0xb0
> [  694.597249]  SyS_exit_group+0x14/0x20
> [  694.597682]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [  694.598198] RIP: 0033:0x7ff77a5e78c8
> [  694.598612] RSP: 002b:7ffc5aece318 EFLAGS: 0246 ORIG_RAX:
> 00e7
> [  694.599804] RAX: ffda RBX:  RCX:
> 7ff77a5e78c8
> [  694.600609] RDX:  RSI: 003c RDI:
> 
> [  694.601424] RBP: 7ff77a8da618 R08: 00e7 R09:
> ff98
> [  694.602224] R10: 0003 R11: 0246 R12:
> 0001
> [  694.603151] R13: 7ff77a8dbc60 R14:  R15:
> 
> [  694.603984] Code: 60 f3 c5 81 e8 2e 7e 03 00 0f 0b 48 c7 c6 60 f3
> c5 81 4c 89 e7 e8 1d 7e 03 00 0f 0b 48 c7 c6 00 f4 c5 81 4c 89 e7 e8
> 0c 7e 03 00 <0f> 0b 48 c7 c6 38 f3 c5 81 4c 89 e7 e8 fb 7d 03 00 0f
> 0b 48 c7 
> [  694.606500] RIP: 

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-04 Thread Mike Kravetz
On 08/04/2017 12:07 PM, r...@redhat.com wrote:
> From: Rik van Riel 
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.
> 
> This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
> 
> https://man.openbsd.org/minherit.2
> 
> Reported-by: Florian Weimer 
> Reported-by: Colm MacCártaigh 
> Signed-off-by: Rik van Riel 
> ---
>  arch/alpha/include/uapi/asm/mman.h |  3 +++
>  arch/mips/include/uapi/asm/mman.h  |  3 +++
>  arch/parisc/include/uapi/asm/mman.h|  3 +++
>  arch/xtensa/include/uapi/asm/mman.h|  3 +++
>  fs/proc/task_mmu.c |  1 +
>  include/linux/mm.h |  2 +-
>  include/uapi/asm-generic/mman-common.h |  3 +++
>  kernel/fork.c  |  8 ++--
>  mm/madvise.c   |  8 
>  mm/memory.c| 10 ++
>  10 files changed, 41 insertions(+), 3 deletions(-)
> 

This didn't seem 'quite right' to me for shared mappings and/or file
backed mappings.  I wasn't exactly sure what it 'should' do in such
cases.  So, I tried it with a mapping created as follows:

addr = mmap(ADDR, page_size,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS|MAP_SHARED, -1, 0);

When setting MADV_WIPEONFORK on the vma/mapping, I got the following
at task exit time:

[  694.558290] [ cut here ]
[  694.558978] kernel BUG at mm/filemap.c:212!
[  694.559476] invalid opcode:  [#1] SMP
[  694.560023] Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter 
xt_conntrack ebtable_broute bridge stp llc ebtable_nat ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_mangle 
ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack iptable_raw iptable_mangle 9p iptable_security 
ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_generic 
snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq ppdev snd_seq_device 
joydev crct10dif_pclmul crc32_pclmul crc32c_intel snd_pcm ghash_clmulni_intel 
9pnet_virtio virtio_balloon snd_timer 9pnet parport_pc snd parport i2c_piix4 
soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net virtio_blk 
virtio_console 8139too qxl drm_kms_helper ttm drm serio_raw 8139cp
[  694.571554]  mii virtio_pci ata_generic virtio_ring virtio pata_acpi
[  694.572608] CPU: 3 PID: 1200 Comm: test_wipe2 Not tainted 4.13.0-rc3+ #8
[  694.573778] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.1-1.fc24 04/01/2014
[  694.574917] task: 880137178040 task.stack: c900019d4000
[  694.575650] RIP: 0010:__delete_from_page_cache+0x344/0x410
[  694.576409] RSP: 0018:c900019d7a88 EFLAGS: 00010082
[  694.577238] RAX: 0021 RBX: ea00047d0e00 RCX: 0006
[  694.578537] RDX:  RSI: 0096 RDI: 88023fd0db90
[  694.579774] RBP: c900019d7ad8 R08: 000882b6 R09: 028a
[  694.580754] R10: c900019d7da8 R11: 8211184d R12: ea00047d0e00
[  694.582040] R13:  R14: 0202 R15: 8801384439e8
[  694.583236] 

Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK

2017-08-04 Thread Mike Kravetz
On 08/04/2017 12:07 PM, r...@redhat.com wrote:
> From: Rik van Riel 
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.
> 
> This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
> 
> https://man.openbsd.org/minherit.2
> 
> Reported-by: Florian Weimer 
> Reported-by: Colm MacCártaigh 
> Signed-off-by: Rik van Riel 
> ---
>  arch/alpha/include/uapi/asm/mman.h |  3 +++
>  arch/mips/include/uapi/asm/mman.h  |  3 +++
>  arch/parisc/include/uapi/asm/mman.h|  3 +++
>  arch/xtensa/include/uapi/asm/mman.h|  3 +++
>  fs/proc/task_mmu.c |  1 +
>  include/linux/mm.h |  2 +-
>  include/uapi/asm-generic/mman-common.h |  3 +++
>  kernel/fork.c  |  8 ++--
>  mm/madvise.c   |  8 
>  mm/memory.c| 10 ++
>  10 files changed, 41 insertions(+), 3 deletions(-)
> 

This didn't seem 'quite right' to me for shared mappings and/or file
backed mappings.  I wasn't exactly sure what it 'should' do in such
cases.  So, I tried it with a mapping created as follows:

addr = mmap(ADDR, page_size,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS|MAP_SHARED, -1, 0);

When setting MADV_WIPEONFORK on the vma/mapping, I got the following
at task exit time:

[  694.558290] [ cut here ]
[  694.558978] kernel BUG at mm/filemap.c:212!
[  694.559476] invalid opcode:  [#1] SMP
[  694.560023] Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter 
xt_conntrack ebtable_broute bridge stp llc ebtable_nat ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_mangle 
ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack iptable_raw iptable_mangle 9p iptable_security 
ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_generic 
snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq ppdev snd_seq_device 
joydev crct10dif_pclmul crc32_pclmul crc32c_intel snd_pcm ghash_clmulni_intel 
9pnet_virtio virtio_balloon snd_timer 9pnet parport_pc snd parport i2c_piix4 
soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net virtio_blk 
virtio_console 8139too qxl drm_kms_helper ttm drm serio_raw 8139cp
[  694.571554]  mii virtio_pci ata_generic virtio_ring virtio pata_acpi
[  694.572608] CPU: 3 PID: 1200 Comm: test_wipe2 Not tainted 4.13.0-rc3+ #8
[  694.573778] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.1-1.fc24 04/01/2014
[  694.574917] task: 880137178040 task.stack: c900019d4000
[  694.575650] RIP: 0010:__delete_from_page_cache+0x344/0x410
[  694.576409] RSP: 0018:c900019d7a88 EFLAGS: 00010082
[  694.577238] RAX: 0021 RBX: ea00047d0e00 RCX: 0006
[  694.578537] RDX:  RSI: 0096 RDI: 88023fd0db90
[  694.579774] RBP: c900019d7ad8 R08: 000882b6 R09: 028a
[  694.580754] R10: c900019d7da8 R11: 8211184d R12: ea00047d0e00
[  694.582040] R13:  R14: 0202 R15: 8801384439e8
[  694.583236] FS:  () GS:88023fd0()