Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE

2018-07-13 Thread Pavel Emelyanov
On 07/12/2018 07:33 PM, Eric W. Biederman wrote:
> 
> Adrian Reber  writes:
> 
>> The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
>> combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
>> distribution kernels and also part of the defconfigs of various
>> architectures.
>>
>> To make it easier for distributions to enable CHECKPOINT_RESTORE this
>> removes EXPERT and moves the configuration option out of the EXPERT
>> block.
> 
> I think we should change the help text at the same time, to match
> our improve understanding of the situation.
> 
> Does anyone remember why this option was added at all?

Sure! Quoting Andrew's ~7 years ago akpm branch merge e-mail:

   However I'm less confident than the developers that it will all
   eventually work! So what I'm asking them to do is to wrap each piece
   of new code inside CONFIG_CHECKPOINT_RESTORE.  So if it all
   eventually comes to tears and the project as a whole fails, it should
   be a simple matter to go through and delete all trace of it.

the best link with full e-mail I googled for is
https://gitlab.imag.fr/kaunetem/linux-kaunetem/commit/099469502f62fbe0d7e4f0b83a2f22538367f734

-- Pavel

> Why this option was placed under expert?
> 
> What is the value of disabling this functionality ever?
> 
> Is there any reason why we don't just delete CONFIG_CHECKPOINT_RESTORE
> entirely?
> 
> Certainly we are at a point where distro's are enabling this so hiding
> it behind CONFIG_EXPERT with a default of N seems inapparopriate.
> 
> The only thing I can imagine might be sensible is changing the default
> to Y and leaving it behind CONFIG_EXPERT.
> 
> I want to know what is the point of maintaining all of the complexity of
> the ifdefs.  If no one can come up with a reason I say let's just remove
> CONFIG_CHECKPOINT_RESTORE entirely.
> 
> Eric
> 
> 
>> Signed-off-by: Adrian Reber 
>> Cc: Oleg Nesterov 
>> Cc: Pavel Emelyanov 
>> Cc: Andrew Morton 
>> Cc: Eric W. Biederman 
>> Cc: Andrei Vagin 
>> Cc: Hendrik Brueckner 
>> ---
>>  init/Kconfig | 24 
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 041f3a022122..9c529c763326 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -932,6 +932,18 @@ config NET_NS
>>  
>>  endif # NAMESPACES
>>  
>> +config CHECKPOINT_RESTORE
>> +bool "Checkpoint/restore support"
>> +select PROC_CHILDREN
>> +default n
>> +help
>> +  Enables additional kernel features in a sake of checkpoint/restore.
>> +  In particular it adds auxiliary prctl codes to setup process text,
>> +  data and heap segment sizes, and a few additional /proc filesystem
>> +  entries.
>> +
>> +  If unsure, say N here.
>> +
>>  config SCHED_AUTOGROUP
>>  bool "Automatic process group scheduling"
>>  select CGROUPS
>> @@ -1348,18 +1360,6 @@ config MEMBARRIER
>>  
>>If unsure, say Y.
>>  
>> -config CHECKPOINT_RESTORE
>> -bool "Checkpoint/restore support" if EXPERT
>> -select PROC_CHILDREN
>> -default n
>> -help
>> -  Enables additional kernel features in a sake of checkpoint/restore.
>> -  In particular it adds auxiliary prctl codes to setup process text,
>> -  data and heap segment sizes, and a few additional /proc filesystem
>> -  entries.
>> -
>> -  If unsure, say N here.
>> -
>>  config KALLSYMS
>>   bool "Load all symbols for debugging/ksymoops" if EXPERT
>>   default y
> .
> 



Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE

2018-07-13 Thread Pavel Emelyanov
On 07/12/2018 07:33 PM, Eric W. Biederman wrote:
> 
> Adrian Reber  writes:
> 
>> The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
>> combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
>> distribution kernels and also part of the defconfigs of various
>> architectures.
>>
>> To make it easier for distributions to enable CHECKPOINT_RESTORE this
>> removes EXPERT and moves the configuration option out of the EXPERT
>> block.
> 
> I think we should change the help text at the same time, to match
> our improve understanding of the situation.
> 
> Does anyone remember why this option was added at all?

Sure! Quoting Andrew's ~7 years ago akpm branch merge e-mail:

   However I'm less confident than the developers that it will all
   eventually work! So what I'm asking them to do is to wrap each piece
   of new code inside CONFIG_CHECKPOINT_RESTORE.  So if it all
   eventually comes to tears and the project as a whole fails, it should
   be a simple matter to go through and delete all trace of it.

the best link with full e-mail I googled for is
https://gitlab.imag.fr/kaunetem/linux-kaunetem/commit/099469502f62fbe0d7e4f0b83a2f22538367f734

-- Pavel

> Why this option was placed under expert?
> 
> What is the value of disabling this functionality ever?
> 
> Is there any reason why we don't just delete CONFIG_CHECKPOINT_RESTORE
> entirely?
> 
> Certainly we are at a point where distro's are enabling this so hiding
> it behind CONFIG_EXPERT with a default of N seems inapparopriate.
> 
> The only thing I can imagine might be sensible is changing the default
> to Y and leaving it behind CONFIG_EXPERT.
> 
> I want to know what is the point of maintaining all of the complexity of
> the ifdefs.  If no one can come up with a reason I say let's just remove
> CONFIG_CHECKPOINT_RESTORE entirely.
> 
> Eric
> 
> 
>> Signed-off-by: Adrian Reber 
>> Cc: Oleg Nesterov 
>> Cc: Pavel Emelyanov 
>> Cc: Andrew Morton 
>> Cc: Eric W. Biederman 
>> Cc: Andrei Vagin 
>> Cc: Hendrik Brueckner 
>> ---
>>  init/Kconfig | 24 
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 041f3a022122..9c529c763326 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -932,6 +932,18 @@ config NET_NS
>>  
>>  endif # NAMESPACES
>>  
>> +config CHECKPOINT_RESTORE
>> +bool "Checkpoint/restore support"
>> +select PROC_CHILDREN
>> +default n
>> +help
>> +  Enables additional kernel features in a sake of checkpoint/restore.
>> +  In particular it adds auxiliary prctl codes to setup process text,
>> +  data and heap segment sizes, and a few additional /proc filesystem
>> +  entries.
>> +
>> +  If unsure, say N here.
>> +
>>  config SCHED_AUTOGROUP
>>  bool "Automatic process group scheduling"
>>  select CGROUPS
>> @@ -1348,18 +1360,6 @@ config MEMBARRIER
>>  
>>If unsure, say Y.
>>  
>> -config CHECKPOINT_RESTORE
>> -bool "Checkpoint/restore support" if EXPERT
>> -select PROC_CHILDREN
>> -default n
>> -help
>> -  Enables additional kernel features in a sake of checkpoint/restore.
>> -  In particular it adds auxiliary prctl codes to setup process text,
>> -  data and heap segment sizes, and a few additional /proc filesystem
>> -  entries.
>> -
>> -  If unsure, say N here.
>> -
>>  config KALLSYMS
>>   bool "Load all symbols for debugging/ksymoops" if EXPERT
>>   default y
> .
> 



Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE

2018-07-13 Thread Pavel Emelyanov
On 07/12/2018 04:07 PM, Adrian Reber wrote:
> The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
> combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
> distribution kernels and also part of the defconfigs of various
> architectures.
> 
> To make it easier for distributions to enable CHECKPOINT_RESTORE this
> removes EXPERT and moves the configuration option out of the EXPERT
> block.
> 
> Signed-off-by: Adrian Reber 
> Cc: Oleg Nesterov 
> Cc: Pavel Emelyanov 
> Cc: Andrew Morton 
> Cc: Eric W. Biederman 
> Cc: Andrei Vagin 
> Cc: Hendrik Brueckner 

Acked-by: Pavel Emelyanov 

> ---
>  init/Kconfig | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/init/Kconfig b/init/Kconfig
> index 041f3a022122..9c529c763326 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -932,6 +932,18 @@ config NET_NS
>  
>  endif # NAMESPACES
>  
> +config CHECKPOINT_RESTORE
> + bool "Checkpoint/restore support"
> + select PROC_CHILDREN
> + default n
> + help
> +   Enables additional kernel features in a sake of checkpoint/restore.
> +   In particular it adds auxiliary prctl codes to setup process text,
> +   data and heap segment sizes, and a few additional /proc filesystem
> +   entries.
> +
> +   If unsure, say N here.
> +
>  config SCHED_AUTOGROUP
>   bool "Automatic process group scheduling"
>   select CGROUPS
> @@ -1348,18 +1360,6 @@ config MEMBARRIER
>  
> If unsure, say Y.
>  
> -config CHECKPOINT_RESTORE
> - bool "Checkpoint/restore support" if EXPERT
> - select PROC_CHILDREN
> - default n
> - help
> -   Enables additional kernel features in a sake of checkpoint/restore.
> -   In particular it adds auxiliary prctl codes to setup process text,
> -   data and heap segment sizes, and a few additional /proc filesystem
> -   entries.
> -
> -   If unsure, say N here.
> -
>  config KALLSYMS
>bool "Load all symbols for debugging/ksymoops" if EXPERT
>default y
> 



Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE

2018-07-13 Thread Pavel Emelyanov
On 07/12/2018 04:07 PM, Adrian Reber wrote:
> The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
> combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
> distribution kernels and also part of the defconfigs of various
> architectures.
> 
> To make it easier for distributions to enable CHECKPOINT_RESTORE this
> removes EXPERT and moves the configuration option out of the EXPERT
> block.
> 
> Signed-off-by: Adrian Reber 
> Cc: Oleg Nesterov 
> Cc: Pavel Emelyanov 
> Cc: Andrew Morton 
> Cc: Eric W. Biederman 
> Cc: Andrei Vagin 
> Cc: Hendrik Brueckner 

Acked-by: Pavel Emelyanov 

> ---
>  init/Kconfig | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/init/Kconfig b/init/Kconfig
> index 041f3a022122..9c529c763326 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -932,6 +932,18 @@ config NET_NS
>  
>  endif # NAMESPACES
>  
> +config CHECKPOINT_RESTORE
> + bool "Checkpoint/restore support"
> + select PROC_CHILDREN
> + default n
> + help
> +   Enables additional kernel features in a sake of checkpoint/restore.
> +   In particular it adds auxiliary prctl codes to setup process text,
> +   data and heap segment sizes, and a few additional /proc filesystem
> +   entries.
> +
> +   If unsure, say N here.
> +
>  config SCHED_AUTOGROUP
>   bool "Automatic process group scheduling"
>   select CGROUPS
> @@ -1348,18 +1360,6 @@ config MEMBARRIER
>  
> If unsure, say Y.
>  
> -config CHECKPOINT_RESTORE
> - bool "Checkpoint/restore support" if EXPERT
> - select PROC_CHILDREN
> - default n
> - help
> -   Enables additional kernel features in a sake of checkpoint/restore.
> -   In particular it adds auxiliary prctl codes to setup process text,
> -   data and heap segment sizes, and a few additional /proc filesystem
> -   entries.
> -
> -   If unsure, say N here.
> -
>  config KALLSYMS
>bool "Load all symbols for debugging/ksymoops" if EXPERT
>default y
> 



Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2018-05-25 Thread Pavel Emelyanov

>> But doesn't it race even with regular PF handling, not only the fork? How
>> do we handle this race?
> 
> With the regular #PF handing, the faulting thread patiently waits until
> page fault is resolved. With fork(), mremap() etc the thread that caused
> the event resumes once the uffd message is read by the monitor. That's
> surely way before monitor had chance to somehow process that message.

Ouch, yes. This is nasty :( So having no better solution in mind, let's
move forward with this.

Acked-by: Pavel Emelyanov <xe...@virtuozzo.com>


Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2018-05-25 Thread Pavel Emelyanov

>> But doesn't it race even with regular PF handling, not only the fork? How
>> do we handle this race?
> 
> With the regular #PF handing, the faulting thread patiently waits until
> page fault is resolved. With fork(), mremap() etc the thread that caused
> the event resumes once the uffd message is read by the monitor. That's
> surely way before monitor had chance to somehow process that message.

Ouch, yes. This is nasty :( So having no better solution in mind, let's
move forward with this.

Acked-by: Pavel Emelyanov 


Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2018-05-24 Thread Pavel Emelyanov
On 05/24/2018 02:56 PM, Mike Rapoport wrote:
> On Thu, May 24, 2018 at 02:24:37PM +0300, Pavel Emelyanov wrote:
>> On 05/23/2018 10:42 AM, Mike Rapoport wrote:
>>> If a process monitored with userfaultfd changes it's memory mappings or
>>> forks() at the same time as uffd monitor fills the process memory with
>>> UFFDIO_COPY, the actual creation of page table entries and copying of the
>>> data in mcopy_atomic may happen either before of after the memory mapping
>>> modifications and there is no way for the uffd monitor to maintain
>>> consistent view of the process memory layout.
>>>
>>> For instance, let's consider fork() running in parallel with
>>> userfaultfd_copy():
>>>
>>> process  |  uffd monitor
>>> -+--
>>> fork()   | userfaultfd_copy()
>>> ...  | ...
>>> dup_mmap()   | down_read(mmap_sem)
>>> down_write(mmap_sem) | /* create PTEs, copy data */
>>> dup_uffd()   | up_read(mmap_sem)
>>> copy_page_range()|
>>> up_write(mmap_sem)   |
>>> dup_uffd_complete()  |
>>> /* notify monitor */ |
>>>
>>> If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be
>>> present by the time copy_page_range() is called and they will appear in the
>>> child's memory mappings. However, if the fork() is the first to take the
>>> mmap_sem, the new pages won't be mapped in the child's address space.
>>
>> But in this case child should get an entry, that emits a message to uffd 
>> when step upon!
>> And uffd will just userfaultfd_copy() it again. No?
>  
> There will be a message, indeed. But there is no way for monitor to tell
> whether the pages it copied are present or not in the child.

If there's a message, then they are not present, that's for sure :)

> Since the monitor cannot assume that the process will access all its memory
> it has to copy some pages "in the background". A simple monitor may look
> like:
> 
>   for (;;) {
>   wait_for_uffd_events(timeout);
>   handle_uffd_events();
>   uffd_copy(some not faulted pages);
>   }
> 
> Then, if the "background" uffd_copy() races with fork, the pages we've
> copied may be already present in parent's mappings before the call to
> copy_page_range() and may be not.
> 
> If the pages were not present, uffd_copy'ing them again to the child's
> memory would be ok.

Yes.

> But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
> again, child process will get memory corruption.

You mean the background uffd_copy()? But doesn't it race even with regular PF 
handling,
not only the fork? How do we handle this race?

-- Pavel


Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2018-05-24 Thread Pavel Emelyanov
On 05/24/2018 02:56 PM, Mike Rapoport wrote:
> On Thu, May 24, 2018 at 02:24:37PM +0300, Pavel Emelyanov wrote:
>> On 05/23/2018 10:42 AM, Mike Rapoport wrote:
>>> If a process monitored with userfaultfd changes it's memory mappings or
>>> forks() at the same time as uffd monitor fills the process memory with
>>> UFFDIO_COPY, the actual creation of page table entries and copying of the
>>> data in mcopy_atomic may happen either before of after the memory mapping
>>> modifications and there is no way for the uffd monitor to maintain
>>> consistent view of the process memory layout.
>>>
>>> For instance, let's consider fork() running in parallel with
>>> userfaultfd_copy():
>>>
>>> process  |  uffd monitor
>>> -+--
>>> fork()   | userfaultfd_copy()
>>> ...  | ...
>>> dup_mmap()   | down_read(mmap_sem)
>>> down_write(mmap_sem) | /* create PTEs, copy data */
>>> dup_uffd()   | up_read(mmap_sem)
>>> copy_page_range()|
>>> up_write(mmap_sem)   |
>>> dup_uffd_complete()  |
>>> /* notify monitor */ |
>>>
>>> If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be
>>> present by the time copy_page_range() is called and they will appear in the
>>> child's memory mappings. However, if the fork() is the first to take the
>>> mmap_sem, the new pages won't be mapped in the child's address space.
>>
>> But in this case child should get an entry, that emits a message to uffd 
>> when step upon!
>> And uffd will just userfaultfd_copy() it again. No?
>  
> There will be a message, indeed. But there is no way for monitor to tell
> whether the pages it copied are present or not in the child.

If there's a message, then they are not present, that's for sure :)

> Since the monitor cannot assume that the process will access all its memory
> it has to copy some pages "in the background". A simple monitor may look
> like:
> 
>   for (;;) {
>   wait_for_uffd_events(timeout);
>   handle_uffd_events();
>   uffd_copy(some not faulted pages);
>   }
> 
> Then, if the "background" uffd_copy() races with fork, the pages we've
> copied may be already present in parent's mappings before the call to
> copy_page_range() and may be not.
> 
> If the pages were not present, uffd_copy'ing them again to the child's
> memory would be ok.

Yes.

> But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
> again, child process will get memory corruption.

You mean the background uffd_copy()? But doesn't it race even with regular PF 
handling,
not only the fork? How do we handle this race?

-- Pavel


Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2018-05-24 Thread Pavel Emelyanov
On 05/23/2018 10:42 AM, Mike Rapoport wrote:
> If a process monitored with userfaultfd changes it's memory mappings or
> forks() at the same time as uffd monitor fills the process memory with
> UFFDIO_COPY, the actual creation of page table entries and copying of the
> data in mcopy_atomic may happen either before of after the memory mapping
> modifications and there is no way for the uffd monitor to maintain
> consistent view of the process memory layout.
> 
> For instance, let's consider fork() running in parallel with
> userfaultfd_copy():
> 
> process|  uffd monitor
> -+--
> fork() | userfaultfd_copy()
> ...| ...
> dup_mmap() | down_read(mmap_sem)
> down_write(mmap_sem) | /* create PTEs, copy data */
> dup_uffd()   | up_read(mmap_sem)
> copy_page_range()|
> up_write(mmap_sem)   |
> dup_uffd_complete()  |
> /* notify monitor */ |
> 
> If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be
> present by the time copy_page_range() is called and they will appear in the
> child's memory mappings. However, if the fork() is the first to take the
> mmap_sem, the new pages won't be mapped in the child's address space.

But in this case child should get an entry, that emits a message to uffd when 
step upon!
And uffd will just userfaultfd_copy() it again. No?

-- Pavel

> Since userfaultfd monitor has no way to determine what was the order, let's
> disallow userfaultfd_copy in parallel with the non-cooperative events. In
> such case we return -EAGAIN and the uffd monitor can understand that
> userfaultfd_copy() clashed with a non-cooperative event and take an
> appropriate action.
> 
> Signed-off-by: Mike Rapoport <r...@linux.vnet.ibm.com>
> Cc: Andrea Arcangeli <aarca...@redhat.com>
> Cc: Mike Kravetz <mike.krav...@oracle.com>
> Cc: Pavel Emelyanov <xe...@virtuozzo.com>
> Cc: Andrei Vagin <ava...@virtuozzo.com>
> ---
>  fs/userfaultfd.c  | 22 --
>  include/linux/userfaultfd_k.h |  6 --
>  mm/userfaultfd.c  | 22 +-
>  3 files changed, 41 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index cec550c8468f..123bf7d516fc 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -62,6 +62,8 @@ struct userfaultfd_ctx {
>   enum userfaultfd_state state;
>   /* released */
>   bool released;
> + /* memory mappings are changing because of non-cooperative event */
> + bool mmap_changing;
>   /* mm with one ore more vmas attached to this userfaultfd_ctx */
>   struct mm_struct *mm;
>  };
> @@ -641,6 +643,7 @@ static void userfaultfd_event_wait_completion(struct 
> userfaultfd_ctx *ctx,
>* already released.
>*/
>  out:
> + WRITE_ONCE(ctx->mmap_changing, false);
>   userfaultfd_ctx_put(ctx);
>  }
>  
> @@ -686,10 +689,12 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
> list_head *fcs)
>   ctx->state = UFFD_STATE_RUNNING;
>   ctx->features = octx->features;
>   ctx->released = false;
> + ctx->mmap_changing = false;
>   ctx->mm = vma->vm_mm;
>   mmgrab(ctx->mm);
>  
>   userfaultfd_ctx_get(octx);
> + WRITE_ONCE(octx->mmap_changing, true);
>   fctx->orig = octx;
>   fctx->new = ctx;
>   list_add_tail(>list, fcs);
> @@ -732,6 +737,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
>   if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) {
>   vm_ctx->ctx = ctx;
>   userfaultfd_ctx_get(ctx);
> + WRITE_ONCE(ctx->mmap_changing, true);
>   }
>  }
>  
> @@ -772,6 +778,7 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
>   return true;
>  
>   userfaultfd_ctx_get(ctx);
> + WRITE_ONCE(ctx->mmap_changing, true);
>   up_read(>mmap_sem);
>  
>   msg_init();
> @@ -815,6 +822,7 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma,
>   return -ENOMEM;
>  
>   userfaultfd_ctx_get(ctx);
> + WRITE_ONCE(ctx->mmap_changing, true);
>   unmap_ctx->ctx = ctx;
>   unmap_ctx->start = start;
>   unmap_ctx->end = end;
> @@ -1653,6 +1661,10 @@ static int userfaultfd

Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2018-05-24 Thread Pavel Emelyanov
On 05/23/2018 10:42 AM, Mike Rapoport wrote:
> If a process monitored with userfaultfd changes it's memory mappings or
> forks() at the same time as uffd monitor fills the process memory with
> UFFDIO_COPY, the actual creation of page table entries and copying of the
> data in mcopy_atomic may happen either before of after the memory mapping
> modifications and there is no way for the uffd monitor to maintain
> consistent view of the process memory layout.
> 
> For instance, let's consider fork() running in parallel with
> userfaultfd_copy():
> 
> process|  uffd monitor
> -+--
> fork() | userfaultfd_copy()
> ...| ...
> dup_mmap() | down_read(mmap_sem)
> down_write(mmap_sem) | /* create PTEs, copy data */
> dup_uffd()   | up_read(mmap_sem)
> copy_page_range()|
> up_write(mmap_sem)   |
> dup_uffd_complete()  |
> /* notify monitor */ |
> 
> If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be
> present by the time copy_page_range() is called and they will appear in the
> child's memory mappings. However, if the fork() is the first to take the
> mmap_sem, the new pages won't be mapped in the child's address space.

But in this case child should get an entry, that emits a message to uffd when 
step upon!
And uffd will just userfaultfd_copy() it again. No?

-- Pavel

> Since userfaultfd monitor has no way to determine what was the order, let's
> disallow userfaultfd_copy in parallel with the non-cooperative events. In
> such case we return -EAGAIN and the uffd monitor can understand that
> userfaultfd_copy() clashed with a non-cooperative event and take an
> appropriate action.
> 
> Signed-off-by: Mike Rapoport 
> Cc: Andrea Arcangeli 
> Cc: Mike Kravetz 
> Cc: Pavel Emelyanov 
> Cc: Andrei Vagin 
> ---
>  fs/userfaultfd.c  | 22 --
>  include/linux/userfaultfd_k.h |  6 --
>  mm/userfaultfd.c  | 22 +-
>  3 files changed, 41 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index cec550c8468f..123bf7d516fc 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -62,6 +62,8 @@ struct userfaultfd_ctx {
>   enum userfaultfd_state state;
>   /* released */
>   bool released;
> + /* memory mappings are changing because of non-cooperative event */
> + bool mmap_changing;
>   /* mm with one ore more vmas attached to this userfaultfd_ctx */
>   struct mm_struct *mm;
>  };
> @@ -641,6 +643,7 @@ static void userfaultfd_event_wait_completion(struct 
> userfaultfd_ctx *ctx,
>* already released.
>*/
>  out:
> + WRITE_ONCE(ctx->mmap_changing, false);
>   userfaultfd_ctx_put(ctx);
>  }
>  
> @@ -686,10 +689,12 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
> list_head *fcs)
>   ctx->state = UFFD_STATE_RUNNING;
>   ctx->features = octx->features;
>   ctx->released = false;
> + ctx->mmap_changing = false;
>   ctx->mm = vma->vm_mm;
>   mmgrab(ctx->mm);
>  
>   userfaultfd_ctx_get(octx);
> + WRITE_ONCE(octx->mmap_changing, true);
>   fctx->orig = octx;
>   fctx->new = ctx;
>   list_add_tail(>list, fcs);
> @@ -732,6 +737,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
>   if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) {
>   vm_ctx->ctx = ctx;
>   userfaultfd_ctx_get(ctx);
> + WRITE_ONCE(ctx->mmap_changing, true);
>   }
>  }
>  
> @@ -772,6 +778,7 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
>   return true;
>  
>   userfaultfd_ctx_get(ctx);
> + WRITE_ONCE(ctx->mmap_changing, true);
>   up_read(>mmap_sem);
>  
>   msg_init();
> @@ -815,6 +822,7 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma,
>   return -ENOMEM;
>  
>   userfaultfd_ctx_get(ctx);
> + WRITE_ONCE(ctx->mmap_changing, true);
>   unmap_ctx->ctx = ctx;
>   unmap_ctx->start = start;
>   unmap_ctx->end = end;
> @@ -1653,6 +1661,10 @@ static int userfaultfd_copy(struct userfaultfd_ctx 
> *ctx,
>  
>   user_uffdio_copy = (struct uffdio_copy __user *) arg;
>  
> + ret = -E

Re: [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE

2018-02-28 Thread Pavel Emelyanov

> @@ -52,6 +53,7 @@
>  #define _UFFDIO_WAKE (0x02)
>  #define _UFFDIO_COPY (0x03)
>  #define _UFFDIO_ZEROPAGE (0x04)
> +#define _UFFDIO_WAKE_SYNC_EVENT  (0x05)

Excuse my ignorance, but what's the difference between UFFDIO_WAKE and 
UFFDIO_WAKE_SYNC_EVENT?

-- Pavel


Re: [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE

2018-02-28 Thread Pavel Emelyanov

> @@ -52,6 +53,7 @@
>  #define _UFFDIO_WAKE (0x02)
>  #define _UFFDIO_COPY (0x03)
>  #define _UFFDIO_ZEROPAGE (0x04)
> +#define _UFFDIO_WAKE_SYNC_EVENT  (0x05)

Excuse my ignorance, but what's the difference between UFFDIO_WAKE and 
UFFDIO_WAKE_SYNC_EVENT?

-- Pavel


Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

2018-02-27 Thread Pavel Emelyanov
On 02/27/2018 05:18 AM, Dmitry V. Levin wrote:
> On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote:
>> On 02/21/2018 03:44 AM, Andrew Morton wrote:
>>> On Tue,  9 Jan 2018 08:30:49 +0200 Mike Rapoport <r...@linux.vnet.ibm.com> 
>>> wrote:
>>>
>>>> This patches introduces new process_vmsplice system call that combines
>>>> functionality of process_vm_read and vmsplice.
>>>
>>> All seems fairly strightforward.  The big question is: do we know that
>>> people will actually use this, and get sufficient value from it to
>>> justify its addition?
>>
>> Yes, that's what bothers us a lot too :) I've tried to start with finding 
>> out if anyone 
>> used the sys_read/write_process_vm() calls, but failed :( Does anybody know 
>> how popular
>> these syscalls are?
> 
> Well, process_vm_readv itself is quite popular, it's used by debuggers 
> nowadays,
> see e.g.
> $ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat 
> /dev/null

I see. Well, yes, this use-case will not benefit much from remote splice. How 
about more
interactive debug by, say, gdb? It may attach, then splice all the memory, then 
analyze
the victim code/data w/o copying it to its address space?

-- Pavel


Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

2018-02-27 Thread Pavel Emelyanov
On 02/27/2018 05:18 AM, Dmitry V. Levin wrote:
> On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote:
>> On 02/21/2018 03:44 AM, Andrew Morton wrote:
>>> On Tue,  9 Jan 2018 08:30:49 +0200 Mike Rapoport  
>>> wrote:
>>>
>>>> This patches introduces new process_vmsplice system call that combines
>>>> functionality of process_vm_read and vmsplice.
>>>
>>> All seems fairly strightforward.  The big question is: do we know that
>>> people will actually use this, and get sufficient value from it to
>>> justify its addition?
>>
>> Yes, that's what bothers us a lot too :) I've tried to start with finding 
>> out if anyone 
>> used the sys_read/write_process_vm() calls, but failed :( Does anybody know 
>> how popular
>> these syscalls are?
> 
> Well, process_vm_readv itself is quite popular, it's used by debuggers 
> nowadays,
> see e.g.
> $ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat 
> /dev/null

I see. Well, yes, this use-case will not benefit much from remote splice. How 
about more
interactive debug by, say, gdb? It may attach, then splice all the memory, then 
analyze
the victim code/data w/o copying it to its address space?

-- Pavel


Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

2018-02-26 Thread Pavel Emelyanov
On 02/21/2018 03:44 AM, Andrew Morton wrote:
> On Tue,  9 Jan 2018 08:30:49 +0200 Mike Rapoport  
> wrote:
> 
>> This patches introduces new process_vmsplice system call that combines
>> functionality of process_vm_read and vmsplice.
> 
> All seems fairly strightforward.  The big question is: do we know that
> people will actually use this, and get sufficient value from it to
> justify its addition?

Yes, that's what bothers us a lot too :) I've tried to start with finding out 
if anyone 
used the sys_read/write_process_vm() calls, but failed :( Does anybody know how 
popular
these syscalls are? If its users operate on big amount of memory, they could 
benefit from
the proposed splice extension.

-- Pavel


Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

2018-02-26 Thread Pavel Emelyanov
On 02/21/2018 03:44 AM, Andrew Morton wrote:
> On Tue,  9 Jan 2018 08:30:49 +0200 Mike Rapoport  
> wrote:
> 
>> This patches introduces new process_vmsplice system call that combines
>> functionality of process_vm_read and vmsplice.
> 
> All seems fairly strightforward.  The big question is: do we know that
> people will actually use this, and get sufficient value from it to
> justify its addition?

Yes, that's what bothers us a lot too :) I've tried to start with finding out 
if anyone 
used the sys_read/write_process_vm() calls, but failed :( Does anybody know how 
popular
these syscalls are? If its users operate on big amount of memory, they could 
benefit from
the proposed splice extension.

-- Pavel


Re: [PATCH 1/4] fs/notify: fdinfo can report unsupported file handles.

2017-12-11 Thread Pavel Emelyanov
On 12/11/2017 05:08 PM, Amir Goldstein wrote:
> On Mon, Dec 11, 2017 at 3:46 PM, Pavel Emelyanov <xe...@virtuozzo.com> wrote:
>> On 12/11/2017 10:05 AM, Amir Goldstein wrote:
>>> On Mon, Dec 11, 2017 at 8:41 AM, Amir Goldstein <amir7...@gmail.com> wrote:
>>>> On Mon, Dec 11, 2017 at 8:04 AM, NeilBrown <ne...@suse.com> wrote:
>>>>> If a filesystem does not set sb->s_export_op, then it
>>>>> does not support filehandles and export_fs_encode_fh()
>>>>> and exportfs_encode_inode_fh() should not be called.
>>>>> They will use export_encode_fh() is which is a default
>>>>> that uses inode number generation number, but in general
>>>>> they may not be stable.
>>>>>
>>>>> So change exportfs_encode_inode_fh() to return FILEID_INVALID
>>>>> if called on an unsupported Filesystem.  Currently only
>>>>> notify/fdinfo can do that.
>>>>>
>>>>
>>>> I wish you would leave this check to the caller, maybe add a helper
>>>> exportfs_can_decode_fh() for callers to use.
>>>>
>>>> Although there are no current uses for it in-tree, there is value in
>>>> being able to encode a unique file handle even when it cannot be
>>>> decoded back to an open file.
>>>>
>>>> I am using this property in my fanotify super block watch patches,
>>>> where the object identifier on the event is an encoded file handle
>>>> of the object, which delegates tracking filesystem objects to
>>>> userspace and prevents fanotify from keeping elevated refcounts
>>>> on inodes and dentries.
>>>>
>>>> There are quite a few userspace tools out there that are checking
>>>> that st_ino hasn't changed on a file between non atomic operations.
>>>> Those tools (or others) could benefit from a unique file handle if
>>>> we ever decide to provide a relaxed version of name_to_handle_at().
>>>>
>>>
>>> And this change need a clause about not breaking userspace.
>>>
>>> Pavel,
>>>
>>> Will this break any version of criu in the wild?
>>
>> If there's no fliehandle in the output, it will make dump fail, but we're
>> already prepared for the fact, that there's no handle at hands. In the
>> worst case criu will exit with error.
>>
>> I also agree that it should only happen when current is OOM killed, and in
>> case of CRIU this means killing criu process itself.
>>
> 
> But this patch [1/4] changes behavior so you cannot dump fsnotify
> state if watched file system does not support *decoding* file handles.

That's OK :) After we get a filehandle we check it's accessible, so for
FSs that couldn't decode one, we'd fail the dump anyway.

> This means that criu anyway won't be able to restore the fsnotify state.
> Is it OK that criu dump state will fail in that case?
> 
> Amir.
> .
> 



Re: [PATCH 1/4] fs/notify: fdinfo can report unsupported file handles.

2017-12-11 Thread Pavel Emelyanov
On 12/11/2017 05:08 PM, Amir Goldstein wrote:
> On Mon, Dec 11, 2017 at 3:46 PM, Pavel Emelyanov  wrote:
>> On 12/11/2017 10:05 AM, Amir Goldstein wrote:
>>> On Mon, Dec 11, 2017 at 8:41 AM, Amir Goldstein  wrote:
>>>> On Mon, Dec 11, 2017 at 8:04 AM, NeilBrown  wrote:
>>>>> If a filesystem does not set sb->s_export_op, then it
>>>>> does not support filehandles and export_fs_encode_fh()
>>>>> and exportfs_encode_inode_fh() should not be called.
>>>>> They will use export_encode_fh() is which is a default
>>>>> that uses inode number generation number, but in general
>>>>> they may not be stable.
>>>>>
>>>>> So change exportfs_encode_inode_fh() to return FILEID_INVALID
>>>>> if called on an unsupported Filesystem.  Currently only
>>>>> notify/fdinfo can do that.
>>>>>
>>>>
>>>> I wish you would leave this check to the caller, maybe add a helper
>>>> exportfs_can_decode_fh() for callers to use.
>>>>
>>>> Although there are no current uses for it in-tree, there is value in
>>>> being able to encode a unique file handle even when it cannot be
>>>> decoded back to an open file.
>>>>
>>>> I am using this property in my fanotify super block watch patches,
>>>> where the object identifier on the event is an encoded file handle
>>>> of the object, which delegates tracking filesystem objects to
>>>> userspace and prevents fanotify from keeping elevated refcounts
>>>> on inodes and dentries.
>>>>
>>>> There are quite a few userspace tools out there that are checking
>>>> that st_ino hasn't changed on a file between non atomic operations.
>>>> Those tools (or others) could benefit from a unique file handle if
>>>> we ever decide to provide a relaxed version of name_to_handle_at().
>>>>
>>>
>>> And this change need a clause about not breaking userspace.
>>>
>>> Pavel,
>>>
>>> Will this break any version of criu in the wild?
>>
>> If there's no fliehandle in the output, it will make dump fail, but we're
>> already prepared for the fact, that there's no handle at hands. In the
>> worst case criu will exit with error.
>>
>> I also agree that it should only happen when current is OOM killed, and in
>> case of CRIU this means killing criu process itself.
>>
> 
> But this patch [1/4] changes behavior so you cannot dump fsnotify
> state if watched file system does not support *decoding* file handles.

That's OK :) After we get a filehandle we check it's accessible, so for
FSs that couldn't decode one, we'd fail the dump anyway.

> This means that criu anyway won't be able to restore the fsnotify state.
> Is it OK that criu dump state will fail in that case?
> 
> Amir.
> .
> 



Re: [PATCH 1/4] fs/notify: fdinfo can report unsupported file handles.

2017-12-11 Thread Pavel Emelyanov
On 12/11/2017 10:05 AM, Amir Goldstein wrote:
> On Mon, Dec 11, 2017 at 8:41 AM, Amir Goldstein  wrote:
>> On Mon, Dec 11, 2017 at 8:04 AM, NeilBrown  wrote:
>>> If a filesystem does not set sb->s_export_op, then it
>>> does not support filehandles and export_fs_encode_fh()
>>> and exportfs_encode_inode_fh() should not be called.
>>> They will use export_encode_fh() is which is a default
>>> that uses inode number generation number, but in general
>>> they may not be stable.
>>>
>>> So change exportfs_encode_inode_fh() to return FILEID_INVALID
>>> if called on an unsupported Filesystem.  Currently only
>>> notify/fdinfo can do that.
>>>
>>
>> I wish you would leave this check to the caller, maybe add a helper
>> exportfs_can_decode_fh() for callers to use.
>>
>> Although there are no current uses for it in-tree, there is value in
>> being able to encode a unique file handle even when it cannot be
>> decoded back to an open file.
>>
>> I am using this property in my fanotify super block watch patches,
>> where the object identifier on the event is an encoded file handle
>> of the object, which delegates tracking filesystem objects to
>> userspace and prevents fanotify from keeping elevated refcounts
>> on inodes and dentries.
>>
>> There are quite a few userspace tools out there that are checking
>> that st_ino hasn't changed on a file between non atomic operations.
>> Those tools (or others) could benefit from a unique file handle if
>> we ever decide to provide a relaxed version of name_to_handle_at().
>>
> 
> And this change need a clause about not breaking userspace.
> 
> Pavel,
> 
> Will this break any version of criu in the wild?

If there's no fliehandle in the output, it will make dump fail, but we're
already prepared for the fact, that there's no handle at hands. In the
worst case criu will exit with error.

I also agree that it should only happen when current is OOM killed, and in
case of CRIU this means killing criu process itself.

-- Pavel


Re: [PATCH 1/4] fs/notify: fdinfo can report unsupported file handles.

2017-12-11 Thread Pavel Emelyanov
On 12/11/2017 10:05 AM, Amir Goldstein wrote:
> On Mon, Dec 11, 2017 at 8:41 AM, Amir Goldstein  wrote:
>> On Mon, Dec 11, 2017 at 8:04 AM, NeilBrown  wrote:
>>> If a filesystem does not set sb->s_export_op, then it
>>> does not support filehandles and export_fs_encode_fh()
>>> and exportfs_encode_inode_fh() should not be called.
>>> They will use export_encode_fh() is which is a default
>>> that uses inode number generation number, but in general
>>> they may not be stable.
>>>
>>> So change exportfs_encode_inode_fh() to return FILEID_INVALID
>>> if called on an unsupported Filesystem.  Currently only
>>> notify/fdinfo can do that.
>>>
>>
>> I wish you would leave this check to the caller, maybe add a helper
>> exportfs_can_decode_fh() for callers to use.
>>
>> Although there are no current uses for it in-tree, there is value in
>> being able to encode a unique file handle even when it cannot be
>> decoded back to an open file.
>>
>> I am using this property in my fanotify super block watch patches,
>> where the object identifier on the event is an encoded file handle
>> of the object, which delegates tracking filesystem objects to
>> userspace and prevents fanotify from keeping elevated refcounts
>> on inodes and dentries.
>>
>> There are quite a few userspace tools out there that are checking
>> that st_ino hasn't changed on a file between non atomic operations.
>> Those tools (or others) could benefit from a unique file handle if
>> we ever decide to provide a relaxed version of name_to_handle_at().
>>
> 
> And this change need a clause about not breaking userspace.
> 
> Pavel,
> 
> Will this break any version of criu in the wild?

If there's no fliehandle in the output, it will make dump fail, but we're
already prepared for the fact, that there's no handle at hands. In the
worst case criu will exit with error.

I also agree that it should only happen when current is OOM killed, and in
case of CRIU this means killing criu process itself.

-- Pavel


Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

2017-05-24 Thread Pavel Emelyanov
On 05/24/2017 02:31 PM, Vlastimil Babka wrote:
> On 05/24/2017 12:39 PM, Mike Rapoport wrote:
>>> Hm so the prctl does:
>>>
>>> if (arg2)
>>> me->mm->def_flags |= VM_NOHUGEPAGE;
>>> else
>>> me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>
>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>> THP will check both the per-vma flag and the per-mm flag?
>>
>> I afraid I don't understand how that can help.
>> What we need is an ability to temporarily disable collapse of the pages in
>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>> THP, the vma->vm_flags for those VMAs will remain intact.
> 
> That's what I'm saying - instead of implementing the prctl flag via
> mm->def_flags (which gets permanently propagated to newly created vma's
> but e.g. doesn't affect already existing ones), it would be setting a
> flag somewhere in mm, which khugepaged (and page faults) would check in
> addition to the per-vma flags.

I do not insist, but this would make existing paths (checking for flags) be 
2 times slower -- from now on these would need to check two bits (vma flags
and mm flags) which are 100% in different cache lines.

What Mike is proposing is the way to fine-tune the existing vma flags. This
would keep current paths as fast (or slow ;) ) as they are now. All the
complexity would go to rare cases when someone needs to turn thp off for a
while and then turn it back on.

-- Pavel


Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

2017-05-24 Thread Pavel Emelyanov
On 05/24/2017 02:31 PM, Vlastimil Babka wrote:
> On 05/24/2017 12:39 PM, Mike Rapoport wrote:
>>> Hm so the prctl does:
>>>
>>> if (arg2)
>>> me->mm->def_flags |= VM_NOHUGEPAGE;
>>> else
>>> me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>
>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>> THP will check both the per-vma flag and the per-mm flag?
>>
>> I afraid I don't understand how that can help.
>> What we need is an ability to temporarily disable collapse of the pages in
>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>> THP, the vma->vm_flags for those VMAs will remain intact.
> 
> That's what I'm saying - instead of implementing the prctl flag via
> mm->def_flags (which gets permanently propagated to newly created vma's
> but e.g. doesn't affect already existing ones), it would be setting a
> flag somewhere in mm, which khugepaged (and page faults) would check in
> addition to the per-vma flags.

I do not insist, but this would make existing paths (checking for flags) be 
2 times slower -- from now on these would need to check two bits (vma flags
and mm flags) which are 100% in different cache lines.

What Mike is proposing is the way to fine-tune the existing vma flags. This
would keep current paths as fast (or slow ;) ) as they are now. All the
complexity would go to rare cases when someone needs to turn thp off for a
while and then turn it back on.

-- Pavel


Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

2017-05-24 Thread Pavel Emelyanov
On 05/24/2017 02:18 PM, Michal Hocko wrote:
> On Wed 24-05-17 13:39:48, Mike Rapoport wrote:
>> On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
>>> On 05/24/2017 09:50 AM, Mike Rapoport wrote:
 On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
>>
>> Probably I didn't explained it too well.
>>
>> The range is intentionally not populated. When we combine pre- and
>> post-copy for process migration, we create memory pre-dump without 
>> stopping
>> the process, then we freeze the process without dumping the pages it has
>> dirtied between pre-dump and freeze, and then, during restore, we 
>> populate
>> the dirtied pages using userfaultfd.
>>
>> When CRIU restores a process in such scenario, it does something like:
>>
>> * mmap() memory region
>> * fill in the pages that were collected during the pre-dump
>> * do some other stuff
>> * register memory region with userfaultfd
>> * populate the missing memory on demand
>>
>> khugepaged collapses the pages in the partially populated regions before 
>> we
>> have a chance to register these regions with userfaultfd, which would
>> prevent the collapse.
>>
>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
>> there would be no race because there would be nothing for khugepaged to
>> collapse at that point. But the problem is that we have no way to reset
>> *HUGEPAGE flags after the memory restore is complete.
>
> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> this scenario is indeed impossible?
>
> - you do the mmap
> - khugepaged will choose the process' mm to scan
> - khugepaged will get to the vma in question, it doesn't have
> MADV_NOHUGEPAGE yet
> - you set MADV_NOHUGEPAGE on the vma
> - you start populating the vma
> - khugepaged sees the vma is non-empty, collapses
>
> unless I'm wrong, the racers will have mmap_sem for reading only when
> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
>
> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> nonzero value, the flag is set, otherwise it is cleared." says the
> manpage. Do it before the mmap and you avoid the race as well?

 Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
 When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
 created after prctl(). This returns me to the state when checkpoint-restore
 alters the application vma->vm_flags although it shouldn't and I do not see
 a way to fix it using existing interfaces.
>>>
>>> [CC linux-api, should have been done in the initial posting already]
>>
>> Sorry, missed that.
>>  
>>> Hm so the prctl does:
>>>
>>> if (arg2)
>>> me->mm->def_flags |= VM_NOHUGEPAGE;
>>> else
>>> me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>
>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>> THP will check both the per-vma flag and the per-mm flag?
>>
>> I afraid I don't understand how that can help.
>> What we need is an ability to temporarily disable collapse of the pages in
>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>> THP, the vma->vm_flags for those VMAs will remain intact.
> 
> Why cannot khugepaged simply skip over all VMAs which have userfault
> regions registered? This would sound like a less error prone approach to
> me.

It already does so. The problem is that there's a race window. We first 
populate VMA
with pages, then register it in UFFD. Between these two actions khugepaged comes
and generates a huge page out of populated pages and holes. And the holes in 
question
are not, well, holes -- they should be populated later via the UFFD, while the
generated huge page prevents this from happening.

-- Pavel


Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

2017-05-24 Thread Pavel Emelyanov
On 05/24/2017 02:18 PM, Michal Hocko wrote:
> On Wed 24-05-17 13:39:48, Mike Rapoport wrote:
>> On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
>>> On 05/24/2017 09:50 AM, Mike Rapoport wrote:
 On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
>>
>> Probably I didn't explained it too well.
>>
>> The range is intentionally not populated. When we combine pre- and
>> post-copy for process migration, we create memory pre-dump without 
>> stopping
>> the process, then we freeze the process without dumping the pages it has
>> dirtied between pre-dump and freeze, and then, during restore, we 
>> populate
>> the dirtied pages using userfaultfd.
>>
>> When CRIU restores a process in such scenario, it does something like:
>>
>> * mmap() memory region
>> * fill in the pages that were collected during the pre-dump
>> * do some other stuff
>> * register memory region with userfaultfd
>> * populate the missing memory on demand
>>
>> khugepaged collapses the pages in the partially populated regions before 
>> we
>> have a chance to register these regions with userfaultfd, which would
>> prevent the collapse.
>>
>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
>> there would be no race because there would be nothing for khugepaged to
>> collapse at that point. But the problem is that we have no way to reset
>> *HUGEPAGE flags after the memory restore is complete.
>
> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> this scenario is indeed impossible?
>
> - you do the mmap
> - khugepaged will choose the process' mm to scan
> - khugepaged will get to the vma in question, it doesn't have
> MADV_NOHUGEPAGE yet
> - you set MADV_NOHUGEPAGE on the vma
> - you start populating the vma
> - khugepaged sees the vma is non-empty, collapses
>
> unless I'm wrong, the racers will have mmap_sem for reading only when
> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
>
> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> nonzero value, the flag is set, otherwise it is cleared." says the
> manpage. Do it before the mmap and you avoid the race as well?

 Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
 When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
 created after prctl(). This returns me to the state when checkpoint-restore
 alters the application vma->vm_flags although it shouldn't and I do not see
 a way to fix it using existing interfaces.
>>>
>>> [CC linux-api, should have been done in the initial posting already]
>>
>> Sorry, missed that.
>>  
>>> Hm so the prctl does:
>>>
>>> if (arg2)
>>> me->mm->def_flags |= VM_NOHUGEPAGE;
>>> else
>>> me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>
>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>> THP will check both the per-vma flag and the per-mm flag?
>>
>> I afraid I don't understand how that can help.
>> What we need is an ability to temporarily disable collapse of the pages in
>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>> THP, the vma->vm_flags for those VMAs will remain intact.
> 
> Why cannot khugepaged simply skip over all VMAs which have userfault
> regions registered? This would sound like a less error prone approach to
> me.

It already does so. The problem is that there's a race window. We first 
populate VMA
with pages, then register it in UFFD. Between these two actions khugepaged comes
and generates a huge page out of populated pages and holes. And the holes in 
question
are not, well, holes -- they should be populated later via the UFFD, while the
generated huge page prevents this from happening.

-- Pavel


Re: [PATCH] Add pidfs filesystem

2017-02-22 Thread Pavel Emelyanov
On 02/22/2017 03:04 PM, Alexey Gladkov wrote:
> On Wed, Feb 22, 2017 at 10:40:49AM +0300, Pavel Emelyanov wrote:
>> On 02/21/2017 05:57 PM, Oleg Nesterov wrote:
>>> On 02/18, Alexey Gladkov wrote:
>>>>
>>>> This patch allows to mount only the part of /proc related to pids
>>>> without rest objects. Since this is an addon to /proc, flags applied to
>>>> /proc have an effect on this pidfs filesystem.
>>>
>>> I leave this to you and Eric, but imo it would be nice to avoid another
>>> filesystem.
>>>
>>>> Why not implement it as another flag to /proc ?
>>>>
>>>> The /proc flags is stored in the pid_namespace and are global for
>>>> namespace. It means that if you add a flag to hide all except the pids,
>>>> then it will act on all mounted instances of /proc.
>>>
>>> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
>>> the simple patch below. Not sure it is correct/complete, just to illustrate
>>> the idea.
>>>
>>> With this patch you can mount proc with -onodev and it will only show
>>> pids/self/thread_self:
>>>
>>> # mkdir /tmp/D
>>> # mount -t proc -o nodev none /tmp/D
>>> # ls /tmp/D
>>> 1   11  13  15  17  19  20  22  24  28  3   31  33  4  56  7  9 
>>> thread-self
>>> 10  12  14  16  18  2   21  23  27  29  30  32  34  5  6   8  self
>>> # cat /tmp/D/meminfo
>>> cat: /tmp/D/meminfo: No such file or directory
>>> # ls /tmp/D/irq
>>> ls: cannot open directory /tmp/D/irq: No such file or directory
>>>
>>> No?
>>
>> Yes!!! If this whole effort with pidfs and overlayfs will move forward, I 
>> would
>> prefer seeing the nodev procfs version, rather than another fs.
> 
> But this is not procfs anymore. If someone will wait for procfs here it will
> be disappointed :)

Well, it depends on what files he's looking for in there. This is what overlay
part should come for.

>> As far as the overlayfs part is concerned, having an overlayfs mounted on 
>> /proc
>> inside container may result in problems as applications sometimes check for 
>> /proc
>> containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by 
>> reading
>> the /proc/mounts).
> 
> It is not a replacement for procfs. It's a subset of procfs. If someone wants
> the procfs in the code we should not deceive him.
> 
> No?

But this is what we actually do -- Docker does with bind-mounts, LXC does with 
lxcfs,
OpenVZ does with kernel patches. Every time a container starts the regular 
/proc is
mutated not to show some information.

-- Pavel


Re: [PATCH] Add pidfs filesystem

2017-02-22 Thread Pavel Emelyanov
On 02/22/2017 03:04 PM, Alexey Gladkov wrote:
> On Wed, Feb 22, 2017 at 10:40:49AM +0300, Pavel Emelyanov wrote:
>> On 02/21/2017 05:57 PM, Oleg Nesterov wrote:
>>> On 02/18, Alexey Gladkov wrote:
>>>>
>>>> This patch allows to mount only the part of /proc related to pids
>>>> without rest objects. Since this is an addon to /proc, flags applied to
>>>> /proc have an effect on this pidfs filesystem.
>>>
>>> I leave this to you and Eric, but imo it would be nice to avoid another
>>> filesystem.
>>>
>>>> Why not implement it as another flag to /proc ?
>>>>
>>>> The /proc flags is stored in the pid_namespace and are global for
>>>> namespace. It means that if you add a flag to hide all except the pids,
>>>> then it will act on all mounted instances of /proc.
>>>
>>> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
>>> the simple patch below. Not sure it is correct/complete, just to illustrate
>>> the idea.
>>>
>>> With this patch you can mount proc with -onodev and it will only show
>>> pids/self/thread_self:
>>>
>>> # mkdir /tmp/D
>>> # mount -t proc -o nodev none /tmp/D
>>> # ls /tmp/D
>>> 1   11  13  15  17  19  20  22  24  28  3   31  33  4  56  7  9 
>>> thread-self
>>> 10  12  14  16  18  2   21  23  27  29  30  32  34  5  6   8  self
>>> # cat /tmp/D/meminfo
>>> cat: /tmp/D/meminfo: No such file or directory
>>> # ls /tmp/D/irq
>>> ls: cannot open directory /tmp/D/irq: No such file or directory
>>>
>>> No?
>>
>> Yes!!! If this whole effort with pidfs and overlayfs will move forward, I 
>> would
>> prefer seeing the nodev procfs version, rather than another fs.
> 
> But this is not procfs anymore. If someone will wait for procfs here it will
> be disappointed :)

Well, it depends on what files he's looking for in there. This is what overlay
part should come for.

>> As far as the overlayfs part is concerned, having an overlayfs mounted on 
>> /proc
>> inside container may result in problems as applications sometimes check for 
>> /proc
>> containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by 
>> reading
>> the /proc/mounts).
> 
> It is not a replacement for procfs. It's a subset of procfs. If someone wants
> the procfs in the code we should not deceive him.
> 
> No?

But this is what we actually do -- Docker does with bind-mounts, LXC does with 
lxcfs,
OpenVZ does with kernel patches. Every time a container starts the regular 
/proc is
mutated not to show some information.

-- Pavel


Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files

2017-02-22 Thread Pavel Emelyanov
On 02/22/2017 11:18 AM, Cyrill Gorcunov wrote:
> On Wed, Feb 22, 2017 at 11:09:23AM +0300, Pavel Emelyanov wrote:
>>>>
>>>> Actually it shouldn't. If you extend the kcmp argument to accept the
>>>> epollfd:epollslot pair, this would be effectively the same as if you
>>>> had all your epoll-ed files injected into your fdtable with "strange"
>>>> fd numbers. We already have two-level rbtree for this in criu, adding
>>>> extended ("strange") fd to it should be OK.
>>>
>>> Nope. Pavel, I guess you forget how we handle file tree in criu currently.
>>> We call for kcmp only if we have to -- when primary key for two entries
>>> is the same.
>>
>> True, but the latter is an optimization to reduce the number of syscalls.
> 
> Exactly. While syscalls are quite effective, they are still not coming
> for free, so I'm trying to reduce their number as much as possible.
> 
>> Look, in order to have a primary key you need to do some system call for the
>> fd you check (read from proc or stat the descriptor). But for target files in
>> e-polls you don't make this per-fd syscall to get primary key, just call the
>> kcmp instead.
> 
> I have to parse fdinfo anyway, because I need to fetch queued events and mask.
> So I'll _have_ to make this per-fd syscall for parsing. And this opens
> a way to optimize overall picture -- we can immediately read primary
> key and reduce kcmp calls.

You read fdinfo per-epoll, but kcmp-s we're talking here are about 
per-target-files.
So having dev:ino pair would help to reduce the number of kcmps, but even w/o
this extension we can work OK.

Besides, in most of the cases fd number you'd read from epoll's fdinfo will 
actually
be present in task's fdtable, so you can call a single kcmp, make sure the file 
is
correct and that's it. The need to actually _search_ for the runaway file with 
the
set of kcmp will (should) be quite rare case.

-- Pavel



Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files

2017-02-22 Thread Pavel Emelyanov
On 02/22/2017 11:18 AM, Cyrill Gorcunov wrote:
> On Wed, Feb 22, 2017 at 11:09:23AM +0300, Pavel Emelyanov wrote:
>>>>
>>>> Actually it shouldn't. If you extend the kcmp argument to accept the
>>>> epollfd:epollslot pair, this would be effectively the same as if you
>>>> had all your epoll-ed files injected into your fdtable with "strange"
>>>> fd numbers. We already have two-level rbtree for this in criu, adding
>>>> extended ("strange") fd to it should be OK.
>>>
>>> Nope. Pavel, I guess you forget how we handle file tree in criu currently.
>>> We call for kcmp only if we have to -- when primary key for two entries
>>> is the same.
>>
>> True, but the latter is an optimization to reduce the number of syscalls.
> 
> Exactly. While syscalls are quite effective, they are still not coming
> for free, so I'm trying to reduce their number as much as possible.
> 
>> Look, in order to have a primary key you need to do some system call for the
>> fd you check (read from proc or stat the descriptor). But for target files in
>> e-polls you don't make this per-fd syscall to get primary key, just call the
>> kcmp instead.
> 
> I have to parse fdinfo anyway, because I need to fetch queued events and mask.
> So I'll _have_ to make this per-fd syscall for parsing. And this opens
> a way to optimize overall picture -- we can immediately read primary
> key and reduce kcmp calls.

You read fdinfo per-epoll, but kcmp-s we're talking here are about 
per-target-files.
So having dev:ino pair would help to reduce the number of kcmps, but even w/o
this extension we can work OK.

Besides, in most of the cases fd number you'd read from epoll's fdinfo will 
actually
be present in task's fdtable, so you can call a single kcmp, make sure the file 
is
correct and that's it. The need to actually _search_ for the runaway file with 
the
set of kcmp will (should) be quite rare case.

-- Pavel



Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files

2017-02-22 Thread Pavel Emelyanov
On 02/22/2017 10:54 AM, Cyrill Gorcunov wrote:
> On Wed, Feb 22, 2017 at 10:44:07AM +0300, Pavel Emelyanov wrote:
>> On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote:
>>> On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote:
>>>>> Thus lets add file position, inode and device number where
>>>>> this target lays. This three fields can be used as a primary
>>>>> key for sorting, and together with kcmp help CRIU can find
>>>>> out an exact file target (from the whole set of processes
>>>>> being checkpointed).
>>>>
>>>> I have no problem with this, but I'm wondering whether kcmp's ordered
>>>> comparisons could also be used for this purpose.
>>>
>>> Yes it can, but it would increas number of kcmp calls signisicantly.
>>
>> Actually it shouldn't. If you extend the kcmp argument to accept the
>> epollfd:epollslot pair, this would be effectively the same as if you
>> had all your epoll-ed files injected into your fdtable with "strange"
>> fd numbers. We already have two-level rbtree for this in criu, adding
>> extended ("strange") fd to it should be OK.
> 
> Nope. Pavel, I guess you forget how we handle file tree in criu currently.
> We call for kcmp only if we have to -- when primary key for two entries
> is the same.

True, but the latter is an optimization to reduce the number of syscalls.

Look, in order to have a primary key you need to do some system call for the
fd you check (read from proc or stat the descriptor). But for target files in
e-polls you don't make this per-fd syscall to get primary key, just call the
kcmp instead.

-- Pavel


Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files

2017-02-22 Thread Pavel Emelyanov
On 02/22/2017 10:54 AM, Cyrill Gorcunov wrote:
> On Wed, Feb 22, 2017 at 10:44:07AM +0300, Pavel Emelyanov wrote:
>> On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote:
>>> On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote:
>>>>> Thus lets add file position, inode and device number where
>>>>> this target lays. This three fields can be used as a primary
>>>>> key for sorting, and together with kcmp help CRIU can find
>>>>> out an exact file target (from the whole set of processes
>>>>> being checkpointed).
>>>>
>>>> I have no problem with this, but I'm wondering whether kcmp's ordered
>>>> comparisons could also be used for this purpose.
>>>
>>> Yes it can, but it would increas number of kcmp calls signisicantly.
>>
>> Actually it shouldn't. If you extend the kcmp argument to accept the
>> epollfd:epollslot pair, this would be effectively the same as if you
>> had all your epoll-ed files injected into your fdtable with "strange"
>> fd numbers. We already have two-level rbtree for this in criu, adding
>> extended ("strange") fd to it should be OK.
> 
> Nope. Pavel, I guess you forget how we handle file tree in criu currently.
> We call for kcmp only if we have to -- when primary key for two entries
> is the same.

True, but the latter is an optimization to reduce the number of syscalls.

Look, in order to have a primary key you need to do some system call for the
fd you check (read from proc or stat the descriptor). But for target files in
e-polls you don't make this per-fd syscall to get primary key, just call the
kcmp instead.

-- Pavel


Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files

2017-02-21 Thread Pavel Emelyanov
On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote:
> On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote:
>>> Thus lets add file position, inode and device number where
>>> this target lays. This three fields can be used as a primary
>>> key for sorting, and together with kcmp help CRIU can find
>>> out an exact file target (from the whole set of processes
>>> being checkpointed).
>>
>> I have no problem with this, but I'm wondering whether kcmp's ordered
>> comparisons could also be used for this purpose.
> 
> Yes it can, but it would increas number of kcmp calls signisicantly.

Actually it shouldn't. If you extend the kcmp argument to accept the
epollfd:epollslot pair, this would be effectively the same as if you
had all your epoll-ed files injected into your fdtable with "strange"
fd numbers. We already have two-level rbtree for this in criu, adding
extended ("strange") fd to it should be OK.

> Look, here is how we build files tree in criu: we take inode^sdev^pos
> as a primary key and remember it inside rbtree while we're dumping files
> (note also that we don't keep files opened but rather dump them in
> chunks). Then once we find that two files have same primary key
> we use kcmp to build subtree. This really helps a lot. And I plan
> to do the same with target files from epolls:
> 
>  - they gonna be handled after all opened files of all processes
>in container (or children processes if dumping single task),
>thus the complete tree with primary key already will be built
> 
>  - on every target file I calculate primary key and then using
>kcmp will find if file is exactly one matching
> .
> 



Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files

2017-02-21 Thread Pavel Emelyanov
On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote:
> On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote:
>>> Thus lets add file position, inode and device number where
>>> this target lays. This three fields can be used as a primary
>>> key for sorting, and together with kcmp help CRIU can find
>>> out an exact file target (from the whole set of processes
>>> being checkpointed).
>>
>> I have no problem with this, but I'm wondering whether kcmp's ordered
>> comparisons could also be used for this purpose.
> 
> Yes it can, but it would increas number of kcmp calls signisicantly.

Actually it shouldn't. If you extend the kcmp argument to accept the
epollfd:epollslot pair, this would be effectively the same as if you
had all your epoll-ed files injected into your fdtable with "strange"
fd numbers. We already have two-level rbtree for this in criu, adding
extended ("strange") fd to it should be OK.

> Look, here is how we build files tree in criu: we take inode^sdev^pos
> as a primary key and remember it inside rbtree while we're dumping files
> (note also that we don't keep files opened but rather dump them in
> chunks). Then once we find that two files have same primary key
> we use kcmp to build subtree. This really helps a lot. And I plan
> to do the same with target files from epolls:
> 
>  - they gonna be handled after all opened files of all processes
>in container (or children processes if dumping single task),
>thus the complete tree with primary key already will be built
> 
>  - on every target file I calculate primary key and then using
>kcmp will find if file is exactly one matching
> .
> 



Re: [PATCH] Add pidfs filesystem

2017-02-21 Thread Pavel Emelyanov
On 02/21/2017 05:57 PM, Oleg Nesterov wrote:
> On 02/18, Alexey Gladkov wrote:
>>
>> This patch allows to mount only the part of /proc related to pids
>> without rest objects. Since this is an addon to /proc, flags applied to
>> /proc have an effect on this pidfs filesystem.
> 
> I leave this to you and Eric, but imo it would be nice to avoid another
> filesystem.
> 
>> Why not implement it as another flag to /proc ?
>>
>> The /proc flags is stored in the pid_namespace and are global for
>> namespace. It means that if you add a flag to hide all except the pids,
>> then it will act on all mounted instances of /proc.
> 
> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
> the simple patch below. Not sure it is correct/complete, just to illustrate
> the idea.
> 
> With this patch you can mount proc with -onodev and it will only show
> pids/self/thread_self:
> 
>   # mkdir /tmp/D
>   # mount -t proc -o nodev none /tmp/D
>   # ls /tmp/D
>   1   11  13  15  17  19  20  22  24  28  3   31  33  4  56  7  9 
> thread-self
>   10  12  14  16  18  2   21  23  27  29  30  32  34  5  6   8  self
>   # cat /tmp/D/meminfo
>   cat: /tmp/D/meminfo: No such file or directory
>   # ls /tmp/D/irq
>   ls: cannot open directory /tmp/D/irq: No such file or directory
> 
> No?

Yes!!! If this whole effort with pidfs and overlayfs will move forward, I would
prefer seeing the nodev procfs version, rather than another fs.

As far as the overlayfs part is concerned, having an overlayfs mounted on /proc
inside container may result in problems as applications sometimes check for 
/proc
containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by reading
the /proc/mounts).

-- Pavel

> Oleg.
> 
> 
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -305,11 +305,22 @@ int proc_readdir_de(struct proc_dir_entry *de, struct 
> file *file,
>  
>  int proc_readdir(struct file *file, struct dir_context *ctx)
>  {
> + int mnt_flags = file->f_path.mnt->mnt_flags;
>   struct inode *inode = file_inode(file);
>  
> + if (mnt_flags & MNT_NODEV)
> + return 1;
> +
>   return proc_readdir_de(PDE(inode), file, ctx);
>  }
>  
> +static int proc_dir_open(struct inode *inode, struct file *file)
> +{
> + if (file->f_path.mnt->mnt_flags & MNT_NODEV)
> + return -ENOENT;
> + return 0;
> +}
> +
>  /*
>   * These are the generic /proc directory operations. They
>   * use the in-memory "struct proc_dir_entry" tree to parse
> @@ -319,6 +330,7 @@ static const struct file_operations proc_dir_operations = 
> {
>   .llseek = generic_file_llseek,
>   .read   = generic_read_dir,
>   .iterate_shared = proc_readdir,
> + .open   = proc_dir_open,
>  };
>  
>  /*
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -318,12 +318,16 @@ proc_reg_get_unmapped_area(struct file *file, unsigned 
> long orig_addr,
>  
>  static int proc_reg_open(struct inode *inode, struct file *file)
>  {
> + int mnt_flags = file->f_path.mnt->mnt_flags;
>   struct proc_dir_entry *pde = PDE(inode);
>   int rv = 0;
>   int (*open)(struct inode *, struct file *);
>   int (*release)(struct inode *, struct file *);
>   struct pde_opener *pdeo;
>  
> + if (mnt_flags & MNT_NODEV)
> + return -ENOENT;
> +
>   /*
>* Ensure that
>* 1) PDE's ->release hook will be called no matter what
> 
> .
> 



Re: [PATCH] Add pidfs filesystem

2017-02-21 Thread Pavel Emelyanov
On 02/21/2017 05:57 PM, Oleg Nesterov wrote:
> On 02/18, Alexey Gladkov wrote:
>>
>> This patch allows to mount only the part of /proc related to pids
>> without rest objects. Since this is an addon to /proc, flags applied to
>> /proc have an effect on this pidfs filesystem.
> 
> I leave this to you and Eric, but imo it would be nice to avoid another
> filesystem.
> 
>> Why not implement it as another flag to /proc ?
>>
>> The /proc flags is stored in the pid_namespace and are global for
>> namespace. It means that if you add a flag to hide all except the pids,
>> then it will act on all mounted instances of /proc.
> 
> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
> the simple patch below. Not sure it is correct/complete, just to illustrate
> the idea.
> 
> With this patch you can mount proc with -onodev and it will only show
> pids/self/thread_self:
> 
>   # mkdir /tmp/D
>   # mount -t proc -o nodev none /tmp/D
>   # ls /tmp/D
>   1   11  13  15  17  19  20  22  24  28  3   31  33  4  56  7  9 
> thread-self
>   10  12  14  16  18  2   21  23  27  29  30  32  34  5  6   8  self
>   # cat /tmp/D/meminfo
>   cat: /tmp/D/meminfo: No such file or directory
>   # ls /tmp/D/irq
>   ls: cannot open directory /tmp/D/irq: No such file or directory
> 
> No?

Yes!!! If this whole effort with pidfs and overlayfs will move forward, I would
prefer seeing the nodev procfs version, rather than another fs.

As far as the overlayfs part is concerned, having an overlayfs mounted on /proc
inside container may result in problems as applications sometimes check for 
/proc
containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by reading
the /proc/mounts).

-- Pavel

> Oleg.
> 
> 
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -305,11 +305,22 @@ int proc_readdir_de(struct proc_dir_entry *de, struct 
> file *file,
>  
>  int proc_readdir(struct file *file, struct dir_context *ctx)
>  {
> + int mnt_flags = file->f_path.mnt->mnt_flags;
>   struct inode *inode = file_inode(file);
>  
> + if (mnt_flags & MNT_NODEV)
> + return 1;
> +
>   return proc_readdir_de(PDE(inode), file, ctx);
>  }
>  
> +static int proc_dir_open(struct inode *inode, struct file *file)
> +{
> + if (file->f_path.mnt->mnt_flags & MNT_NODEV)
> + return -ENOENT;
> + return 0;
> +}
> +
>  /*
>   * These are the generic /proc directory operations. They
>   * use the in-memory "struct proc_dir_entry" tree to parse
> @@ -319,6 +330,7 @@ static const struct file_operations proc_dir_operations = 
> {
>   .llseek = generic_file_llseek,
>   .read   = generic_read_dir,
>   .iterate_shared = proc_readdir,
> + .open   = proc_dir_open,
>  };
>  
>  /*
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -318,12 +318,16 @@ proc_reg_get_unmapped_area(struct file *file, unsigned 
> long orig_addr,
>  
>  static int proc_reg_open(struct inode *inode, struct file *file)
>  {
> + int mnt_flags = file->f_path.mnt->mnt_flags;
>   struct proc_dir_entry *pde = PDE(inode);
>   int rv = 0;
>   int (*open)(struct inode *, struct file *);
>   int (*release)(struct inode *, struct file *);
>   struct pde_opener *pdeo;
>  
> + if (mnt_flags & MNT_NODEV)
> + return -ENOENT;
> +
>   /*
>* Ensure that
>* 1) PDE's ->release hook will be called no matter what
> 
> .
> 



Re: [CRIU] [PATCH net-next] tcp: allow to enable the repair mode for non-listening sockets

2016-11-15 Thread Pavel Emelyanov
On 11/15/2016 05:15 AM, Andrei Vagin wrote:
> The repair mode is used to get and restore sequence numbers and
> data from queues. It used to checkpoint/restore connections.
> 
> Currently the repair mode can be enabled for sockets in the established
> and closed states, but for other states we have to dump the same socket
> properties, so lets allow to enable repair mode for these sockets.
> 
> The repair mode reveals nothing more for sockets in other states.
> 
> Signed-off-by: Andrei Vagin <ava...@openvz.org>

Acked-by: Pavel Emelyanov <xe...@openvz.org>

> ---
>  net/ipv4/tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3251fe7..a2a3a8c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2302,7 +2302,7 @@ EXPORT_SYMBOL(tcp_disconnect);
>  static inline bool tcp_can_repair_sock(const struct sock *sk)
>  {
>   return ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) &&
> - ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_ESTABLISHED));
> + (sk->sk_state != TCP_LISTEN);
>  }
>  
>  static int tcp_repair_set_window(struct tcp_sock *tp, char __user *optbuf, 
> int len)
> 



Re: [CRIU] [PATCH net-next] tcp: allow to enable the repair mode for non-listening sockets

2016-11-15 Thread Pavel Emelyanov
On 11/15/2016 05:15 AM, Andrei Vagin wrote:
> The repair mode is used to get and restore sequence numbers and
> data from queues. It used to checkpoint/restore connections.
> 
> Currently the repair mode can be enabled for sockets in the established
> and closed states, but for other states we have to dump the same socket
> properties, so lets allow to enable repair mode for these sockets.
> 
> The repair mode reveals nothing more for sockets in other states.
> 
> Signed-off-by: Andrei Vagin 

Acked-by: Pavel Emelyanov 

> ---
>  net/ipv4/tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3251fe7..a2a3a8c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2302,7 +2302,7 @@ EXPORT_SYMBOL(tcp_disconnect);
>  static inline bool tcp_can_repair_sock(const struct sock *sk)
>  {
>   return ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) &&
> - ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_ESTABLISHED));
> + (sk->sk_state != TCP_LISTEN);
>  }
>  
>  static int tcp_repair_set_window(struct tcp_sock *tp, char __user *optbuf, 
> int len)
> 



Re: [PATCH v2] locks: Filter /proc/locks output on proc pid ns

2016-08-03 Thread Pavel Emelyanov
On 08/03/2016 05:17 PM, Nikolay Borisov wrote:
> 
> 
> On 08/03/2016 04:46 PM, Jeff Layton wrote:
>> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
>>> On busy container servers reading /proc/locks shows all the locks
>>> created by all clients. This can cause large latency spikes. In my
>>> case I observed lsof taking up to 5-10 seconds while processing around
>>> 50k locks. Fix this by limiting the locks shown only to those created
>>> in the same pidns as the one the proc was mounted in. When reading
>>> /proc/locks from the init_pid_ns show everything.
>>>
 Signed-off-by: Nikolay Borisov 
>>> ---
>>>  fs/locks.c | 6 ++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/fs/locks.c b/fs/locks.c
>>> index ee1b15f6fc13..751673d7f7fc 100644
>>> --- a/fs/locks.c
>>> +++ b/fs/locks.c
>>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>>>  {
struct locks_iterator *iter = f->private;
struct file_lock *fl, *bfl;
 +  struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
 +  struct pid_namespace *current_pidns = task_active_pid_ns(current);
>>>  
fl = hlist_entry(v, struct file_lock, fl_link);
>>>  
> + if ((current_pidns != _pid_ns) && fl->fl_nspid
>>
>> Ok, so when you read from a process that's in the init_pid_ns
>> namespace, then you'll get the whole pile of locks, even when reading
>> this from a filesystem that was mounted in a different pid_ns?
>>
>> That seems odd to me if so. Any reason not to just uniformly use the
>> proc_pidns here?
> 
> [CCing some people from openvz/CRIU]

Thanks :)

> My train of thought was "we should have means which would be the one
> universal truth about everything and this would be a process in the
> init_pid_ns". I don't have strong preference as long as I'm not breaking
> userspace. As I said before - I think the CRIU guys might be using that
> interface.

This particular change won't break us mostly because we've switched to
reading the /proc/pid/fdinfo/n files for locks.

-- Pavel

>>
> + && (proc_pidns != ns_of_pid(fl->fl_nspid)))
 +  return 0;
>>> +
lock_get_status(f, fl, iter->li_pos, "");
>>>  
list_for_each_entry(bfl, >fl_block, fl_block)
>>
> .
> 



Re: [PATCH v2] locks: Filter /proc/locks output on proc pid ns

2016-08-03 Thread Pavel Emelyanov
On 08/03/2016 05:17 PM, Nikolay Borisov wrote:
> 
> 
> On 08/03/2016 04:46 PM, Jeff Layton wrote:
>> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
>>> On busy container servers reading /proc/locks shows all the locks
>>> created by all clients. This can cause large latency spikes. In my
>>> case I observed lsof taking up to 5-10 seconds while processing around
>>> 50k locks. Fix this by limiting the locks shown only to those created
>>> in the same pidns as the one the proc was mounted in. When reading
>>> /proc/locks from the init_pid_ns show everything.
>>>
 Signed-off-by: Nikolay Borisov 
>>> ---
>>>  fs/locks.c | 6 ++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/fs/locks.c b/fs/locks.c
>>> index ee1b15f6fc13..751673d7f7fc 100644
>>> --- a/fs/locks.c
>>> +++ b/fs/locks.c
>>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>>>  {
struct locks_iterator *iter = f->private;
struct file_lock *fl, *bfl;
 +  struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
 +  struct pid_namespace *current_pidns = task_active_pid_ns(current);
>>>  
fl = hlist_entry(v, struct file_lock, fl_link);
>>>  
> + if ((current_pidns != _pid_ns) && fl->fl_nspid
>>
>> Ok, so when you read from a process that's in the init_pid_ns
>> namespace, then you'll get the whole pile of locks, even when reading
>> this from a filesystem that was mounted in a different pid_ns?
>>
>> That seems odd to me if so. Any reason not to just uniformly use the
>> proc_pidns here?
> 
> [CCing some people from openvz/CRIU]

Thanks :)

> My train of thought was "we should have means which would be the one
> universal truth about everything and this would be a process in the
> init_pid_ns". I don't have strong preference as long as I'm not breaking
> userspace. As I said before - I think the CRIU guys might be using that
> interface.

This particular change won't break us mostly because we've switched to
reading the /proc/pid/fdinfo/n files for locks.

-- Pavel

>>
> + && (proc_pidns != ns_of_pid(fl->fl_nspid)))
 +  return 0;
>>> +
lock_get_status(f, fl, iter->li_pos, "");
>>>  
list_for_each_entry(bfl, >fl_block, fl_block)
>>
> .
> 



Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

2016-04-20 Thread Pavel Emelyanov
On 03/20/2016 03:42 PM, Mike Rapoport wrote:
> Hi,
> 
> This set is to address the issues that appear in userfaultfd usage
> scenarios when the task monitoring the uffd and the mm-owner do not 
> cooperate to each other on VM changes such as remaps, madvises and 
> fork()-s.
> 
> The pacthes are essentially the same as in the prevoious respin (1),
> they've just been rebased on the current tree.

Hi, Andrea.

Hopefully one day after LSFMM is good time to try to get a bit of
your attention to this set :)

-- Pavel



Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

2016-04-20 Thread Pavel Emelyanov
On 03/20/2016 03:42 PM, Mike Rapoport wrote:
> Hi,
> 
> This set is to address the issues that appear in userfaultfd usage
> scenarios when the task monitoring the uffd and the mm-owner do not 
> cooperate to each other on VM changes such as remaps, madvises and 
> fork()-s.
> 
> The pacthes are essentially the same as in the prevoious respin (1),
> they've just been rebased on the current tree.

Hi, Andrea.

Hopefully one day after LSFMM is good time to try to get a bit of
your attention to this set :)

-- Pavel



Re: [PATCH 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Pavel Emelyanov
On 04/13/2016 02:43 PM, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> This patch series adds a trivial system call "getumask" which returns
> the umask of the current process.

Ah! Thanks for this :)

Acked-by: Pavel Emelyanov <xe...@virtuozzo.com>



Re: [PATCH 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Pavel Emelyanov
On 04/13/2016 02:43 PM, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> This patch series adds a trivial system call "getumask" which returns
> the umask of the current process.

Ah! Thanks for this :)

Acked-by: Pavel Emelyanov 



Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

2016-03-21 Thread Pavel Emelyanov
On 03/20/2016 03:42 PM, Mike Rapoport wrote:
> Hi,
> 
> This set is to address the issues that appear in userfaultfd usage
> scenarios when the task monitoring the uffd and the mm-owner do not 
> cooperate to each other on VM changes such as remaps, madvises and 
> fork()-s.
> 
> The pacthes are essentially the same as in the prevoious respin (1),
> they've just been rebased on the current tree.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/132662

Thanks, Mike!

Acked-by: Pavel Emelyanov <xe...@virtuozzo.com>



Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

2016-03-21 Thread Pavel Emelyanov
On 03/20/2016 03:42 PM, Mike Rapoport wrote:
> Hi,
> 
> This set is to address the issues that appear in userfaultfd usage
> scenarios when the task monitoring the uffd and the mm-owner do not 
> cooperate to each other on VM changes such as remaps, madvises and 
> fork()-s.
> 
> The pacthes are essentially the same as in the prevoious respin (1),
> they've just been rebased on the current tree.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/132662

Thanks, Mike!

Acked-by: Pavel Emelyanov 



Re: [regression] x86/signal/64: Fix SS handling for signals delivered to 64-bit programs breaks dosemu

2015-08-14 Thread Pavel Emelyanov
On 08/14/2015 10:22 AM, Cyrill Gorcunov wrote:
> On Thu, Aug 13, 2015 at 01:09:47PM -0700, Linus Torvalds wrote:
>> On Thu, Aug 13, 2015 at 1:08 PM, Cyrill Gorcunov  wrote:
>>>
>>> If only I'm not missin something obvious this should not hurt us.
>>> But I gonna build test kernel and check to be sure tomorrow, ok?
> 
> Managed to test it. And criu works fine with the revert as expected.
> Actually it's because of commit c6f2062935c8 Oleg made us a patch:
> 
>  | commit 07dcf0dbb6ff97c255bc6b06569255a9479bccdd
>  | Author: Oleg Nesterov 
>  | Date:   Thu Mar 19 19:14:00 2015 +0300
>  |
>  | restore/x86: restore_gpregs() needs to initialize ->ss as well
>  |
>  | Before the recent "x86_64,signal: Fix SS handling for signals delivered
>  | to 64-bit programs" kernel patch, sigreturn paths forgot to restore 
> ->ss
>  | after return from the signal handler.
>  |
>  | Now that the kernel was fixed, restore_gpregs() has to initialize ->ss
>  | too, it is no longer ignored.
>  | ...
>  |
>  | +++ b/arch/x86/include/asm/restorer.h
>  | @@ -53,7 +53,7 @@ struct rt_sigcontext {
>  | unsigned short  cs;
>  | unsigned short  gs;
>  | unsigned short  fs;
>  | -   unsigned short  __pad0;
>  | +   unsigned short  ss;
>  | unsigned long   err;
>  | unsigned long   trapno;
>  | unsigned long   oldmask;
> 
> IOW we've been not setting up __pad0 which became ss
> inside the kernel (in result we've been passing 0 here,
> which caused the problem).
> 
> fwiw, we declare that new criu versions may require new
> kernels to work but never promised that new criu gonna
> be compatible with old kernels.

We did. Kernel 3.11 is still declared as supported (modulo new stuff
we added after 1.0 might not work, but basic sigframe mgmt is not one
of those "new" things).

> That said if something
> get changed inside sigcontext structure in future
> we may update criu code as well.
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] x86/signal/64: Fix SS handling for signals delivered to 64-bit programs breaks dosemu

2015-08-14 Thread Pavel Emelyanov
On 08/14/2015 10:22 AM, Cyrill Gorcunov wrote:
 On Thu, Aug 13, 2015 at 01:09:47PM -0700, Linus Torvalds wrote:
 On Thu, Aug 13, 2015 at 1:08 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:

 If only I'm not missin something obvious this should not hurt us.
 But I gonna build test kernel and check to be sure tomorrow, ok?
 
 Managed to test it. And criu works fine with the revert as expected.
 Actually it's because of commit c6f2062935c8 Oleg made us a patch:
 
  | commit 07dcf0dbb6ff97c255bc6b06569255a9479bccdd
  | Author: Oleg Nesterov o...@redhat.com
  | Date:   Thu Mar 19 19:14:00 2015 +0300
  |
  | restore/x86: restore_gpregs() needs to initialize -ss as well
  |
  | Before the recent x86_64,signal: Fix SS handling for signals delivered
  | to 64-bit programs kernel patch, sigreturn paths forgot to restore 
 -ss
  | after return from the signal handler.
  |
  | Now that the kernel was fixed, restore_gpregs() has to initialize -ss
  | too, it is no longer ignored.
  | ...
  |
  | +++ b/arch/x86/include/asm/restorer.h
  | @@ -53,7 +53,7 @@ struct rt_sigcontext {
  | unsigned short  cs;
  | unsigned short  gs;
  | unsigned short  fs;
  | -   unsigned short  __pad0;
  | +   unsigned short  ss;
  | unsigned long   err;
  | unsigned long   trapno;
  | unsigned long   oldmask;
 
 IOW we've been not setting up __pad0 which became ss
 inside the kernel (in result we've been passing 0 here,
 which caused the problem).
 
 fwiw, we declare that new criu versions may require new
 kernels to work but never promised that new criu gonna
 be compatible with old kernels.

We did. Kernel 3.11 is still declared as supported (modulo new stuff
we added after 1.0 might not work, but basic sigframe mgmt is not one
of those new things).

 That said if something
 get changed inside sigcontext structure in future
 we may update criu code as well.
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] mm: move ->mremap() from file_operations to vm_operations_struct

2015-06-26 Thread Pavel Emelyanov
On 06/25/2015 11:45 PM, Oleg Nesterov wrote:
> vma->vm_ops->mremap() looks more natural and clean in move_vma(),
> and this way ->mremap() can have more users. Say, vdso.
> 
> While at it, s/aio_ring_remap/aio_ring_mremap/.
> 
> Signed-off-by: Oleg Nesterov 

Acked-by: Pavel Emelyanov 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] mm: move -mremap() from file_operations to vm_operations_struct

2015-06-26 Thread Pavel Emelyanov
On 06/25/2015 11:45 PM, Oleg Nesterov wrote:
 vma-vm_ops-mremap() looks more natural and clean in move_vma(),
 and this way -mremap() can have more users. Say, vdso.
 
 While at it, s/aio_ring_remap/aio_ring_mremap/.
 
 Signed-off-by: Oleg Nesterov o...@redhat.com

Acked-by: Pavel Emelyanov xe...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mm: move ->mremap() from file_operations to vm_operations_struct

2015-06-23 Thread Pavel Emelyanov
On 06/23/2015 09:02 PM, Oleg Nesterov wrote:
> vma->vm_ops->mremap() looks more natural and clean in move_vma(),
> and this way ->mremap() can have more users. Say, vdso.
> 
> Signed-off-by: Oleg Nesterov 

Looks-good-to: /me :) Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vdso && f_op->mremap (Was: special_mapping_fault() is broken)

2015-06-23 Thread Pavel Emelyanov
On 06/23/2015 03:47 AM, Oleg Nesterov wrote:
> On 06/21, Oleg Nesterov wrote:
>>
>> Forgot to add Andy...
> 
> Add Pavel ;)
> 
> I never understood why ->mremap() lives in file_operations, not in
> vm_operations_struct. To me vma->vm_file->f_op in move_vma() just
> looks strange, vma->vm_ops->mremap(new_vma) looks "obviously better".
> 
> And afaics more useful. CRIU remaps vdso, but this does not update
> mm->context.vdso. OK, probably this does not matter currently, CRIU
> can't c/r the compat tasks, and 64-bit apps do not use context.vdso.

Yup, the context.vdso exists not for all vdso-s. There should have been
a patch for power arch adding this change.

> Afaics. Still, I think we might want to have special_mapping_remap()
> and we can't do this because ->vm_file == NULL.

For aio (the single for now mapping with mremap callback) the vm_file
is not NULL.

> And perhaps other architectures can depend on the "correct" value
> in >context.vdso more then x86, I dunno...
> 
> In short. Shouldn't we move ->mremap() to vm_operations_struct before
> it has another user? We need to fix aio.c, but this is trivial.
> 
> No?

I will be OK with this change :)

-- Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vdso f_op-mremap (Was: special_mapping_fault() is broken)

2015-06-23 Thread Pavel Emelyanov
On 06/23/2015 03:47 AM, Oleg Nesterov wrote:
 On 06/21, Oleg Nesterov wrote:

 Forgot to add Andy...
 
 Add Pavel ;)
 
 I never understood why -mremap() lives in file_operations, not in
 vm_operations_struct. To me vma-vm_file-f_op in move_vma() just
 looks strange, vma-vm_ops-mremap(new_vma) looks obviously better.
 
 And afaics more useful. CRIU remaps vdso, but this does not update
 mm-context.vdso. OK, probably this does not matter currently, CRIU
 can't c/r the compat tasks, and 64-bit apps do not use context.vdso.

Yup, the context.vdso exists not for all vdso-s. There should have been
a patch for power arch adding this change.

 Afaics. Still, I think we might want to have special_mapping_remap()
 and we can't do this because -vm_file == NULL.

For aio (the single for now mapping with mremap callback) the vm_file
is not NULL.

 And perhaps other architectures can depend on the correct value
 in context.vdso more then x86, I dunno...
 
 In short. Shouldn't we move -mremap() to vm_operations_struct before
 it has another user? We need to fix aio.c, but this is trivial.
 
 No?

I will be OK with this change :)

-- Pavel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mm: move -mremap() from file_operations to vm_operations_struct

2015-06-23 Thread Pavel Emelyanov
On 06/23/2015 09:02 PM, Oleg Nesterov wrote:
 vma-vm_ops-mremap() looks more natural and clean in move_vma(),
 and this way -mremap() can have more users. Say, vdso.
 
 Signed-off-by: Oleg Nesterov o...@redhat.com

Looks-good-to: /me :) Thanks!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5] seccomp: add ptrace options for suspend/resume

2015-06-15 Thread Pavel Emelyanov
On 06/13/2015 06:02 PM, Tycho Andersen wrote:
> This patch is the first step in enabling checkpoint/restore of processes
> with seccomp enabled.
> 
> One of the things CRIU does while dumping tasks is inject code into them
> via ptrace to collect information that is only available to the process
> itself. However, if we are in a seccomp mode where these processes are
> prohibited from making these syscalls, then what CRIU does kills the task.
> 
> This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables
> a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp
> filters to disable (and re-enable) seccomp filters for another task so that
> they can be successfully dumped (and restored). We restrict the set of
> processes that can disable seccomp through ptrace because although today
> ptrace can be used to bypass seccomp, there is some discussion of closing
> this loophole in the future and we would like this patch to not depend on
> that behavior and be future proofed for when it is removed.
> 
> Note that seccomp can be suspended before any filters are actually
> installed; this behavior is useful on criu restore, so that we can suspend
> seccomp, restore the filters, unmap our restore code from the restored
> process' address space, and then resume the task by detaching and have the
> filters resumed as well.
> 
> v2 changes:
> 
> * require that the tracer have no seccomp filters installed
> * drop TIF_NOTSC manipulation from the patch
> * change from ptrace command to a ptrace option and use this ptrace option
>   as the flag to check. This means that as soon as the tracer
>   detaches/dies, seccomp is re-enabled and as a corrollary that one can not
>   disable seccomp across PTRACE_ATTACHs.
> 
> v3 changes:
> 
> * get rid of various #ifdefs everywhere
> * report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly
>   used
> 
> v4 changes:
> 
> * get rid of may_suspend_seccomp() in favor of a capable() check in ptrace
>   directly
> 
> v5 changes:
> 
> * check that seccomp is not enabled (or suspended) on the tracer
> 
> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Andy Lutomirski 
> CC: Will Drewry 
> CC: Roland McGrath 
> CC: Oleg Nesterov 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 

Acked-by: Pavel Emelyanov 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5] seccomp: add ptrace options for suspend/resume

2015-06-15 Thread Pavel Emelyanov
On 06/13/2015 06:02 PM, Tycho Andersen wrote:
 This patch is the first step in enabling checkpoint/restore of processes
 with seccomp enabled.
 
 One of the things CRIU does while dumping tasks is inject code into them
 via ptrace to collect information that is only available to the process
 itself. However, if we are in a seccomp mode where these processes are
 prohibited from making these syscalls, then what CRIU does kills the task.
 
 This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables
 a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp
 filters to disable (and re-enable) seccomp filters for another task so that
 they can be successfully dumped (and restored). We restrict the set of
 processes that can disable seccomp through ptrace because although today
 ptrace can be used to bypass seccomp, there is some discussion of closing
 this loophole in the future and we would like this patch to not depend on
 that behavior and be future proofed for when it is removed.
 
 Note that seccomp can be suspended before any filters are actually
 installed; this behavior is useful on criu restore, so that we can suspend
 seccomp, restore the filters, unmap our restore code from the restored
 process' address space, and then resume the task by detaching and have the
 filters resumed as well.
 
 v2 changes:
 
 * require that the tracer have no seccomp filters installed
 * drop TIF_NOTSC manipulation from the patch
 * change from ptrace command to a ptrace option and use this ptrace option
   as the flag to check. This means that as soon as the tracer
   detaches/dies, seccomp is re-enabled and as a corrollary that one can not
   disable seccomp across PTRACE_ATTACHs.
 
 v3 changes:
 
 * get rid of various #ifdefs everywhere
 * report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly
   used
 
 v4 changes:
 
 * get rid of may_suspend_seccomp() in favor of a capable() check in ptrace
   directly
 
 v5 changes:
 
 * check that seccomp is not enabled (or suspended) on the tracer
 
 Signed-off-by: Tycho Andersen tycho.ander...@canonical.com
 CC: Kees Cook keesc...@chromium.org
 CC: Andy Lutomirski l...@amacapital.net
 CC: Will Drewry w...@chromium.org
 CC: Roland McGrath rol...@hack.frob.com
 CC: Oleg Nesterov o...@redhat.com
 CC: Pavel Emelyanov xe...@parallels.com
 CC: Serge E. Hallyn serge.hal...@ubuntu.com

Acked-by: Pavel Emelyanov xe...@parallels.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] seccomp: add ptrace commands for suspend/resume

2015-06-02 Thread Pavel Emelyanov

>> +int suspend_seccomp(struct task_struct *task)
>> +{
>> +int ret = -EACCES;
>> +
>> +spin_lock_irq(>sighand->siglock);
>> +
>> +if (!capable(CAP_SYS_ADMIN))
>> +goto out;
> 
> I am puzzled ;) Why do we need ->siglock? And even if we need it, why
> we can't check CAP_SYS_ADMIN lockless?
> 
> And I am not sure I understand why do we need the additional security
> check, but I leave this to you and Andy.
> 
> If you have the rights to trace this task, then you can do anything
> the tracee could do without the filtering.

I think _this_ check is required, otherwise the seccomp-ed task (in
filtered mode) fork-s a child, then this child ptrace-attach to parent
(allowed) then suspend its seccomd. And -- we have unpriviledged process 
de-seccomped.

-- Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] seccomp: add ptrace commands for suspend/resume

2015-06-02 Thread Pavel Emelyanov

 +int suspend_seccomp(struct task_struct *task)
 +{
 +int ret = -EACCES;
 +
 +spin_lock_irq(task-sighand-siglock);
 +
 +if (!capable(CAP_SYS_ADMIN))
 +goto out;
 
 I am puzzled ;) Why do we need -siglock? And even if we need it, why
 we can't check CAP_SYS_ADMIN lockless?
 
 And I am not sure I understand why do we need the additional security
 check, but I leave this to you and Andy.
 
 If you have the rights to trace this task, then you can do anything
 the tracee could do without the filtering.

I think _this_ check is required, otherwise the seccomp-ed task (in
filtered mode) fork-s a child, then this child ptrace-attach to parent
(allowed) then suspend its seccomd. And -- we have unpriviledged process 
de-seccomped.

-- Pavel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] userfaultfd v4

2015-05-18 Thread Pavel Emelyanov
On 05/14/2015 08:30 PM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> This is the latest userfaultfd patchset against mm-v4.1-rc3
> 2015-05-14-10:04.
> 
> The postcopy live migration feature on the qemu side is mostly ready
> to be merged and it entirely depends on the userfaultfd syscall to be
> merged as well. So it'd be great if this patchset could be reviewed
> for merging in -mm.
> 
> Userfaults allow to implement on demand paging from userland and more
> generally they allow userland to more efficiently take control of the
> behavior of page faults than what was available before
> (PROT_NONE + SIGSEGV trap).

Not to spam with 23 e-mails, all patches are

Acked-by: Pavel Emelyanov 

Thanks!

-- Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] userfaultfd v4

2015-05-18 Thread Pavel Emelyanov
On 05/14/2015 08:30 PM, Andrea Arcangeli wrote:
 Hello everyone,
 
 This is the latest userfaultfd patchset against mm-v4.1-rc3
 2015-05-14-10:04.
 
 The postcopy live migration feature on the qemu side is mostly ready
 to be merged and it entirely depends on the userfaultfd syscall to be
 merged as well. So it'd be great if this patchset could be reviewed
 for merging in -mm.
 
 Userfaults allow to implement on demand paging from userland and more
 generally they allow userland to more efficiently take control of the
 behavior of page faults than what was available before
 (PROT_NONE + SIGSEGV trap).

Not to spam with 23 e-mails, all patches are

Acked-by: Pavel Emelyanov xe...@parallels.com

Thanks!

-- Pavel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] uffd: Add fork() event

2015-05-13 Thread Pavel Emelyanov
When the mm with uffd-ed vmas fork()-s the respective vmas
notify their uffds with the event which contains a descriptor
with new uffd. This new descriptor can then be used to get
events from the child and populate its mm with data. Note,
that there can be different uffd-s controlling different
vmas within one mm, so first we should collect all those
uffds (and ctx-s) in a list and then notify them all one by
one but only once per fork().

The context is created at fork() time but the descriptor, file
struct and anon inode object is created at event read time. So
some trickery is added to the userfaultfd_ctx_read() to handle
the ctx queues' locking vs file creation.

Another thing worth noticing is that the task that fork()-s
waits for the uffd event to get processed WITHOUT the mmap sem.

Signed-off-by: Pavel Emelyanov 
---
 fs/userfaultfd.c | 137 +++
 include/linux/userfaultfd_k.h|  12 
 include/uapi/linux/userfaultfd.h |  13 ++--
 kernel/fork.c|   9 ++-
 4 files changed, 163 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 83acb19..202d0ac 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -62,6 +62,12 @@ struct userfaultfd_ctx {
struct mm_struct *mm;
 };
 
+struct userfaultfd_fork_ctx {
+   struct userfaultfd_ctx *orig;
+   struct userfaultfd_ctx *new;
+   struct list_head list;
+};
+
 struct userfaultfd_wait_queue {
struct uffd_msg msg;
wait_queue_t wq;
@@ -432,6 +438,79 @@ static void userfaultfd_event_complete(struct 
userfaultfd_ctx *ctx,
__remove_wait_queue(>event_wqh, >wq);
 }
 
+int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
+{
+   struct userfaultfd_ctx *ctx = NULL, *octx;
+   struct userfaultfd_fork_ctx *fctx;
+
+   octx = vma->vm_userfaultfd_ctx.ctx;
+   if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   return 0;
+   }
+
+   list_for_each_entry(fctx, fcs, list)
+   if (fctx->orig == octx) {
+   ctx = fctx->new;
+   break;
+   }
+
+   if (!ctx) {
+   fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+   if (!fctx)
+   return -ENOMEM;
+
+   ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
+   if (!ctx) {
+   kfree(fctx);
+   return -ENOMEM;
+   }
+
+   atomic_set(>refcount, 1);
+   ctx->flags = octx->flags;
+   ctx->state = UFFD_STATE_RUNNING;
+   ctx->features = octx->features;
+   ctx->released = false;
+   ctx->mm = vma->vm_mm;
+   atomic_inc(>mm->mm_users);
+
+   userfaultfd_ctx_get(octx);
+   fctx->orig = octx;
+   fctx->new = ctx;
+   list_add_tail(>list, fcs);
+   }
+
+   vma->vm_userfaultfd_ctx.ctx = ctx;
+   return 0;
+}
+
+static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
+{
+   struct userfaultfd_ctx *ctx = fctx->orig;
+   struct userfaultfd_wait_queue ewq;
+
+   msg_init();
+
+   ewq.msg.event = UFFD_EVENT_FORK;
+   ewq.msg.arg.reserved.reserved1 = (__u64)fctx->new;
+
+   return userfaultfd_event_wait_completion(ctx, );
+}
+
+void dup_userfaultfd_complete(struct list_head *fcs)
+{
+   int ret = 0;
+   struct userfaultfd_fork_ctx *fctx, *n;
+
+   list_for_each_entry_safe(fctx, n, fcs, list) {
+   if (!ret)
+   ret = dup_fctx(fctx);
+   list_del(>list);
+   kfree(fctx);
+   }
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
struct userfaultfd_ctx *ctx = file->private_data;
@@ -563,12 +642,47 @@ static unsigned int userfaultfd_poll(struct file *file, 
poll_table *wait)
}
 }
 
+static const struct file_operations userfaultfd_fops;
+
+static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
+   struct userfaultfd_ctx *new, struct uffd_msg *msg)
+{
+   int fd;
+   struct file *file;
+
+   fd = get_unused_fd_flags(new->flags & UFFD_SHARED_FCNTL_FLAGS);
+   if (fd < 0)
+   return fd;
+
+   file = anon_inode_getfile("[userfaultfd]", _fops, new,
+ O_RDWR | (new->flags & 
UFFD_SHARED_FCNTL_FLAGS));
+   if (IS_ERR(file)) {
+   put_unused_fd(fd);
+   return PTR_ERR(file);
+   }
+
+   fd_install(fd, file);
+   msg->arg.reserved.reserved1 = 0;
+   msg->arg.fork.ufd = fd;
+
+   return 0;
+}
+
 static ssize

[PATCH 5/5] uffd: Add madvise() event for MADV_DONTNEED request

2015-05-13 Thread Pavel Emelyanov
If the page is punched out of the address space the uffd reader
should know this and zeromap the respective area in case of
the #PF event.

Signed-off-by: Pavel Emelyanov 
---
 fs/userfaultfd.c | 26 ++
 include/linux/userfaultfd_k.h| 10 ++
 include/uapi/linux/userfaultfd.h |  9 -
 mm/madvise.c |  2 ++
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 697e636..6e80a02 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -546,6 +546,32 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx 
vm_ctx, unsigned long
userfaultfd_event_wait_completion(ctx, );
 }
 
+void madvise_userfault_dontneed(struct vm_area_struct *vma,
+   struct vm_area_struct **prev,
+   unsigned long start, unsigned long end)
+{
+   struct userfaultfd_ctx *ctx;
+   struct userfaultfd_wait_queue ewq;
+
+   ctx = vma->vm_userfaultfd_ctx.ctx;
+   if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_MADVDONTNEED))
+   return;
+
+   userfaultfd_ctx_get(ctx);
+   *prev = NULL; /* We wait for ACK w/o the mmap semaphore */
+   up_read(>vm_mm->mmap_sem);
+
+   msg_init();
+
+   ewq.msg.event = UFFD_EVENT_MADVDONTNEED;
+   ewq.msg.arg.madv_dn.start = start;
+   ewq.msg.arg.madv_dn.end = end;
+
+   userfaultfd_event_wait_completion(ctx, );
+
+   down_read(>vm_mm->mmap_sem);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
struct userfaultfd_ctx *ctx = file->private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 0ed5dce..2f72069 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -82,6 +82,10 @@ extern void mremap_userfaultfd_prep(struct vm_area_struct *, 
struct vm_userfault
 extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx, unsigned 
long from,
unsigned long to, unsigned long len);
 
+extern void madvise_userfault_dontneed(struct vm_area_struct *vma,
+   struct vm_area_struct **prev, unsigned long start,
+   unsigned long end);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -131,6 +135,12 @@ static inline void mremap_userfaultfd_complete(struct 
vm_userfaultfd_ctx ctx,
unsigned long from, unsigned long to, unsigned long len)
 {
 }
+
+static inline void madvise_userfault_dontneed(struct vm_area_struct *vma,
+   struct vm_area_struct **prev, unsigned long start,
+   unsigned long end)
+{
+}
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 59a141c..d0d1ef1 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -14,7 +14,7 @@
  * After implementing the respective features it will become:
  * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP)
  */
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP)
+#define UFFD_API_FEATURES 
(UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP|UFFD_FEATURE_EVENT_MADVDONTNEED)
 #define UFFD_API_IOCTLS\
((__u64)1 << _UFFDIO_REGISTER | \
 (__u64)1 << _UFFDIO_UNREGISTER |   \
@@ -79,6 +79,11 @@ struct uffd_msg {
} remap;
 
struct {
+   __u64   start;
+   __u64   end;
+   } madv_dn;
+
+   struct {
/* unused reserved fields */
__u64   reserved1;
__u64   reserved2;
@@ -93,6 +98,7 @@ struct uffd_msg {
 #define UFFD_EVENT_PAGEFAULT   0x12
 #define UFFD_EVENT_FORK0x13
 #define UFFD_EVENT_REMAP   0x14
+#define UFFD_EVENT_MADVDONTNEED0x15
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE  (1<<0)  /* If this was a write fault */
@@ -116,6 +122,7 @@ struct uffdio_api {
 #endif
 #define UFFD_FEATURE_EVENT_FORK(1<<1)
 #define UFFD_FEATURE_EVENT_REMAP   (1<<2)
+#define UFFD_FEATURE_EVENT_MADVDONTNEED(1<<3)
__u64 features;
 
__u64 ioctls;
diff --git a/mm/madvise.c b/mm/madvise.c
index 10f62b7..3ea20e2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -283,6 +284,7 @@ static long madvise_dontneed(struct vm_area_struct *vma,
return -EINVAL;
 
zap_page_range(vma, start, end - start, NULL);
+   madvise_userfault_dontneed(vma, prev, start, end);
return 0;
 }
 
-- 
1.9.3


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to m

[PATCH 4/5] uffd: Add mremap() event

2015-05-13 Thread Pavel Emelyanov
The event denotes that an area [start:end] moves to different
location. Lenght change isn't reported as "new" addresses if
they appear on the uffd reader side will not contain any data
and the latter can just zeromap them.

Waiting for the event ACK is also done outside of mmap sem, as
for fork event.

Signed-off-by: Pavel Emelyanov 
---
 fs/userfaultfd.c | 35 +++
 include/linux/userfaultfd_k.h| 12 
 include/uapi/linux/userfaultfd.h | 10 +-
 mm/mremap.c  | 31 ---
 4 files changed, 76 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 202d0ac..697e636 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -511,6 +511,41 @@ void dup_userfaultfd_complete(struct list_head *fcs)
}
 }
 
+void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct 
vm_userfaultfd_ctx *vm_ctx)
+{
+   struct userfaultfd_ctx *ctx;
+
+   ctx = vma->vm_userfaultfd_ctx.ctx;
+   if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) {
+   vm_ctx->ctx = ctx;
+   userfaultfd_ctx_get(ctx);
+   }
+}
+
+void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx, unsigned 
long from,
+   unsigned long to, unsigned long len)
+{
+   struct userfaultfd_ctx *ctx = vm_ctx.ctx;
+   struct userfaultfd_wait_queue ewq;
+
+   if (ctx == NULL)
+   return;
+
+   if (to & ~PAGE_MASK) {
+   userfaultfd_ctx_put(ctx);
+   return;
+   }
+
+   msg_init();
+
+   ewq.msg.event = UFFD_EVENT_REMAP;
+   ewq.msg.arg.remap.from = from;
+   ewq.msg.arg.remap.to = to;
+   ewq.msg.arg.remap.len = len;
+
+   userfaultfd_event_wait_completion(ctx, );
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
struct userfaultfd_ctx *ctx = file->private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 44827f7..0ed5dce 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -78,6 +78,10 @@ static inline bool userfaultfd_armed(struct vm_area_struct 
*vma)
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
 extern void dup_userfaultfd_complete(struct list_head *);
 
+extern void mremap_userfaultfd_prep(struct vm_area_struct *, struct 
vm_userfaultfd_ctx *);
+extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx, unsigned 
long from,
+   unsigned long to, unsigned long len);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -119,6 +123,14 @@ static inline void dup_userfaultfd_complete(struct 
list_head *)
 {
 }
 
+static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct 
vm_userfaultfd_ctx *ctx)
+{
+}
+
+static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
+   unsigned long from, unsigned long to, unsigned long len)
+{
+}
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index fb9ced2..59a141c 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -14,7 +14,7 @@
  * After implementing the respective features it will become:
  * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP)
  */
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK)
+#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP)
 #define UFFD_API_IOCTLS\
((__u64)1 << _UFFDIO_REGISTER | \
 (__u64)1 << _UFFDIO_UNREGISTER |   \
@@ -73,6 +73,12 @@ struct uffd_msg {
} fork;
 
struct {
+   __u64   from;
+   __u64   to;
+   __u64   len;
+   } remap;
+
+   struct {
/* unused reserved fields */
__u64   reserved1;
__u64   reserved2;
@@ -86,6 +92,7 @@ struct uffd_msg {
  */
 #define UFFD_EVENT_PAGEFAULT   0x12
 #define UFFD_EVENT_FORK0x13
+#define UFFD_EVENT_REMAP   0x14
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE  (1<<0)  /* If this was a write fault */
@@ -108,6 +115,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
 #endif
 #define UFFD_FEATURE_EVENT_FORK(1<<1)
+#define UFFD_FEATURE_EVENT_REMAP   (1<<2)
__u64 features;
 
__u64 ioctls;
diff --git a/mm/mremap.c b/mm/mremap.c
index 034e2d3..a63287d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -234,7 +235,8 @@ unsigned long move_page_tables(struct vm_are

[PATCH 2/5] uffd: Add ability to report non-PF events from uffd descriptor

2015-05-13 Thread Pavel Emelyanov
The custom events are queued in ctx->event_wqh not to disturb the
fast-path-ed PF queue-wait-wakeup functions.

The events to be generated (other than PF-s) are requested in UFFD_API
ioctl with the uffd_api.features bits. Those, known by the kernel, are
then turned on and reported back to the user-space.

Signed-off-by: Pavel Emelyanov 
---
 fs/userfaultfd.c | 97 ++--
 1 file changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c593a72..83acb19 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -12,6 +12,7 @@
  *  mm/ksm.c (mm hashing).
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -43,12 +44,16 @@ struct userfaultfd_ctx {
wait_queue_head_t fault_pending_wqh;
/* waitqueue head for the userfaults */
wait_queue_head_t fault_wqh;
+   /* waitqueue head for events */
+   wait_queue_head_t event_wqh;
/* waitqueue head for the pseudo fd to wakeup poll/read */
wait_queue_head_t fd_wqh;
/* pseudo fd refcounting */
atomic_t refcount;
/* userfaultfd syscall flags */
unsigned int flags;
+   /* features requested from the userspace */
+   unsigned int features;
/* state machine */
enum userfaultfd_state state;
/* released */
@@ -120,6 +125,8 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
VM_BUG_ON(waitqueue_active(>fault_pending_wqh));
VM_BUG_ON(spin_is_locked(>fault_wqh.lock));
VM_BUG_ON(waitqueue_active(>fault_wqh));
+   VM_BUG_ON(spin_is_locked(>event_wqh.lock));
+   VM_BUG_ON(waitqueue_active(>event_wqh));
VM_BUG_ON(spin_is_locked(>fd_wqh.lock));
VM_BUG_ON(waitqueue_active(>fd_wqh));
mmput(ctx->mm);
@@ -373,6 +380,58 @@ out:
return ret;
 }
 
+static int __maybe_unused userfaultfd_event_wait_completion(struct 
userfaultfd_ctx *ctx,
+   struct userfaultfd_wait_queue *ewq)
+{
+   int ret = 0;
+
+   ewq->ctx = ctx;
+   init_waitqueue_entry(>wq, current);
+
+   spin_lock(>event_wqh.lock);
+   /*
+* After the __add_wait_queue the uwq is visible to userland
+* through poll/read().
+*/
+   __add_wait_queue(>event_wqh, >wq);
+   for (;;) {
+   set_current_state(TASK_KILLABLE);
+   if (ewq->msg.event == 0)
+   break;
+   if (ACCESS_ONCE(ctx->released) ||
+   fatal_signal_pending(current)) {
+   ret = -1;
+   __remove_wait_queue(>event_wqh, >wq);
+   break;
+   }
+
+   spin_unlock(>event_wqh.lock);
+
+   wake_up_poll(>fd_wqh, POLLIN);
+   schedule();
+
+   spin_lock(>event_wqh.lock);
+   }
+   __set_current_state(TASK_RUNNING);
+   spin_unlock(>event_wqh.lock);
+
+   /*
+* ctx may go away after this if the userfault pseudo fd is
+* already released.
+*/
+
+   userfaultfd_ctx_put(ctx);
+   return ret;
+}
+
+static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
+   struct userfaultfd_wait_queue *ewq)
+{
+   ewq->msg.event = 0;
+   wake_up_locked(>event_wqh);
+   __remove_wait_queue(>event_wqh, >wq);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
struct userfaultfd_ctx *ctx = file->private_data;
@@ -458,6 +517,12 @@ static inline struct userfaultfd_wait_queue 
*find_userfault(
return find_userfault_in(>fault_pending_wqh);
 }
 
+static inline struct userfaultfd_wait_queue *find_userfault_evt(
+   struct userfaultfd_ctx *ctx)
+{
+   return find_userfault_in(>event_wqh);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
struct userfaultfd_ctx *ctx = file->private_data;
@@ -489,6 +554,9 @@ static unsigned int userfaultfd_poll(struct file *file, 
poll_table *wait)
smp_mb();
if (waitqueue_active(>fault_pending_wqh))
ret = POLLIN;
+   else if (waitqueue_active(>event_wqh))
+   ret = POLLIN;
+
return ret;
default:
BUG();
@@ -528,6 +596,19 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx 
*ctx, int no_wait,
break;
}
spin_unlock(>fault_pending_wqh.lock);
+
+   spin_lock(>event_wqh.lock);
+   uwq = find_userfault_evt(ctx);
+   if (uwq) {
+   *msg = uwq->msg;
+
+   userfaultfd_event_complete(ctx, uwq);
+  

[PATCH 1/5] uffd: Split the find_userfault() routine

2015-05-13 Thread Pavel Emelyanov
I will need one to lookup for userfaultfd_wait_queue-s in different
wait queue.

Signed-off-by: Pavel Emelyanov 
---
 fs/userfaultfd.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c89e96f..c593a72 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -433,23 +433,29 @@ static int userfaultfd_release(struct inode *inode, 
struct file *file)
 }
 
 /* fault_pending_wqh.lock must be hold by the caller */
-static inline struct userfaultfd_wait_queue *find_userfault(
-   struct userfaultfd_ctx *ctx)
+static inline struct userfaultfd_wait_queue *find_userfault_in(
+   wait_queue_head_t *wqh)
 {
wait_queue_t *wq;
struct userfaultfd_wait_queue *uwq;
 
-   VM_BUG_ON(!spin_is_locked(>fault_pending_wqh.lock));
+   VM_BUG_ON(!spin_is_locked(>lock));
 
uwq = NULL;
-   if (!waitqueue_active(>fault_pending_wqh))
+   if (!waitqueue_active(wqh))
goto out;
/* walk in reverse to provide FIFO behavior to read userfaults */
-   wq = list_last_entry(>fault_pending_wqh.task_list,
-typeof(*wq), task_list);
+   wq = list_last_entry(>task_list, typeof(*wq), task_list);
uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 out:
return uwq;
+
+}
+
+static inline struct userfaultfd_wait_queue *find_userfault(
+   struct userfaultfd_ctx *ctx)
+{
+   return find_userfault_in(>fault_pending_wqh);
 }
 
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
-- 
1.9.3


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5] UserfaultFD: Extension for non cooperative uffd usage (v2)

2015-05-13 Thread Pavel Emelyanov
Hi,

This set is to address the issues that appear in userfaultfd usage
scenarios when the task monitoring the uffd and the mm-owner do not 
cooperate to each other on VM changes such as remaps, madvises and 
fork()-s.

This is the re-based set on the recent userfaultfd branch, two major
changes are:

* No need in separate API version, the uffd_msg introduced in the
  current code and UFFD_API ioctl are enough for the needed extentions
* Two events added -- for mremap() and madvise() MADV_DONTNEED

More details about the particular events are in patches 3 trough 4.
Comments and suggestion are warmly welcome :)

The v1 discussion thread is here: https://lkml.org/lkml/2015/3/18/729

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] uffd: Add fork() event

2015-05-13 Thread Pavel Emelyanov
When the mm with uffd-ed vmas fork()-s the respective vmas
notify their uffds with the event which contains a descriptor
with new uffd. This new descriptor can then be used to get
events from the child and populate its mm with data. Note,
that there can be different uffd-s controlling different
vmas within one mm, so first we should collect all those
uffds (and ctx-s) in a list and then notify them all one by
one but only once per fork().

The context is created at fork() time but the descriptor, file
struct and anon inode object is created at event read time. So
some trickery is added to the userfaultfd_ctx_read() to handle
the ctx queues' locking vs file creation.

Another thing worth noticing is that the task that fork()-s
waits for the uffd event to get processed WITHOUT the mmap sem.

Signed-off-by: Pavel Emelyanov xe...@parallels.com
---
 fs/userfaultfd.c | 137 +++
 include/linux/userfaultfd_k.h|  12 
 include/uapi/linux/userfaultfd.h |  13 ++--
 kernel/fork.c|   9 ++-
 4 files changed, 163 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 83acb19..202d0ac 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -62,6 +62,12 @@ struct userfaultfd_ctx {
struct mm_struct *mm;
 };
 
+struct userfaultfd_fork_ctx {
+   struct userfaultfd_ctx *orig;
+   struct userfaultfd_ctx *new;
+   struct list_head list;
+};
+
 struct userfaultfd_wait_queue {
struct uffd_msg msg;
wait_queue_t wq;
@@ -432,6 +438,79 @@ static void userfaultfd_event_complete(struct 
userfaultfd_ctx *ctx,
__remove_wait_queue(ctx-event_wqh, ewq-wq);
 }
 
+int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
+{
+   struct userfaultfd_ctx *ctx = NULL, *octx;
+   struct userfaultfd_fork_ctx *fctx;
+
+   octx = vma-vm_userfaultfd_ctx.ctx;
+   if (!octx || !(octx-features  UFFD_FEATURE_EVENT_FORK)) {
+   vma-vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vma-vm_flags = ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   return 0;
+   }
+
+   list_for_each_entry(fctx, fcs, list)
+   if (fctx-orig == octx) {
+   ctx = fctx-new;
+   break;
+   }
+
+   if (!ctx) {
+   fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+   if (!fctx)
+   return -ENOMEM;
+
+   ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
+   if (!ctx) {
+   kfree(fctx);
+   return -ENOMEM;
+   }
+
+   atomic_set(ctx-refcount, 1);
+   ctx-flags = octx-flags;
+   ctx-state = UFFD_STATE_RUNNING;
+   ctx-features = octx-features;
+   ctx-released = false;
+   ctx-mm = vma-vm_mm;
+   atomic_inc(ctx-mm-mm_users);
+
+   userfaultfd_ctx_get(octx);
+   fctx-orig = octx;
+   fctx-new = ctx;
+   list_add_tail(fctx-list, fcs);
+   }
+
+   vma-vm_userfaultfd_ctx.ctx = ctx;
+   return 0;
+}
+
+static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
+{
+   struct userfaultfd_ctx *ctx = fctx-orig;
+   struct userfaultfd_wait_queue ewq;
+
+   msg_init(ewq.msg);
+
+   ewq.msg.event = UFFD_EVENT_FORK;
+   ewq.msg.arg.reserved.reserved1 = (__u64)fctx-new;
+
+   return userfaultfd_event_wait_completion(ctx, ewq);
+}
+
+void dup_userfaultfd_complete(struct list_head *fcs)
+{
+   int ret = 0;
+   struct userfaultfd_fork_ctx *fctx, *n;
+
+   list_for_each_entry_safe(fctx, n, fcs, list) {
+   if (!ret)
+   ret = dup_fctx(fctx);
+   list_del(fctx-list);
+   kfree(fctx);
+   }
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
struct userfaultfd_ctx *ctx = file-private_data;
@@ -563,12 +642,47 @@ static unsigned int userfaultfd_poll(struct file *file, 
poll_table *wait)
}
 }
 
+static const struct file_operations userfaultfd_fops;
+
+static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
+   struct userfaultfd_ctx *new, struct uffd_msg *msg)
+{
+   int fd;
+   struct file *file;
+
+   fd = get_unused_fd_flags(new-flags  UFFD_SHARED_FCNTL_FLAGS);
+   if (fd  0)
+   return fd;
+
+   file = anon_inode_getfile([userfaultfd], userfaultfd_fops, new,
+ O_RDWR | (new-flags  
UFFD_SHARED_FCNTL_FLAGS));
+   if (IS_ERR(file)) {
+   put_unused_fd(fd);
+   return PTR_ERR(file);
+   }
+
+   fd_install(fd, file);
+   msg-arg.reserved.reserved1 = 0;
+   msg-arg.fork.ufd = fd;
+
+   return 0;
+}
+
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait

[PATCH 0/5] UserfaultFD: Extension for non cooperative uffd usage (v2)

2015-05-13 Thread Pavel Emelyanov
Hi,

This set is to address the issues that appear in userfaultfd usage
scenarios when the task monitoring the uffd and the mm-owner do not 
cooperate to each other on VM changes such as remaps, madvises and 
fork()-s.

This is the re-based set on the recent userfaultfd branch, two major
changes are:

* No need in separate API version, the uffd_msg introduced in the
  current code and UFFD_API ioctl are enough for the needed extentions
* Two events added -- for mremap() and madvise() MADV_DONTNEED

More details about the particular events are in patches 3 trough 4.
Comments and suggestion are warmly welcome :)

The v1 discussion thread is here: https://lkml.org/lkml/2015/3/18/729

Thanks,
Pavel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] uffd: Split the find_userfault() routine

2015-05-13 Thread Pavel Emelyanov
I will need one to lookup for userfaultfd_wait_queue-s in different
wait queue.

Signed-off-by: Pavel Emelyanov xe...@parallels.com
---
 fs/userfaultfd.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c89e96f..c593a72 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -433,23 +433,29 @@ static int userfaultfd_release(struct inode *inode, 
struct file *file)
 }
 
 /* fault_pending_wqh.lock must be hold by the caller */
-static inline struct userfaultfd_wait_queue *find_userfault(
-   struct userfaultfd_ctx *ctx)
+static inline struct userfaultfd_wait_queue *find_userfault_in(
+   wait_queue_head_t *wqh)
 {
wait_queue_t *wq;
struct userfaultfd_wait_queue *uwq;
 
-   VM_BUG_ON(!spin_is_locked(ctx-fault_pending_wqh.lock));
+   VM_BUG_ON(!spin_is_locked(wqh-lock));
 
uwq = NULL;
-   if (!waitqueue_active(ctx-fault_pending_wqh))
+   if (!waitqueue_active(wqh))
goto out;
/* walk in reverse to provide FIFO behavior to read userfaults */
-   wq = list_last_entry(ctx-fault_pending_wqh.task_list,
-typeof(*wq), task_list);
+   wq = list_last_entry(wqh-task_list, typeof(*wq), task_list);
uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 out:
return uwq;
+
+}
+
+static inline struct userfaultfd_wait_queue *find_userfault(
+   struct userfaultfd_ctx *ctx)
+{
+   return find_userfault_in(ctx-fault_pending_wqh);
 }
 
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
-- 
1.9.3


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] uffd: Add ability to report non-PF events from uffd descriptor

2015-05-13 Thread Pavel Emelyanov
The custom events are queued in ctx-event_wqh not to disturb the
fast-path-ed PF queue-wait-wakeup functions.

The events to be generated (other than PF-s) are requested in UFFD_API
ioctl with the uffd_api.features bits. Those, known by the kernel, are
then turned on and reported back to the user-space.

Signed-off-by: Pavel Emelyanov xe...@parallels.com
---
 fs/userfaultfd.c | 97 ++--
 1 file changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c593a72..83acb19 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -12,6 +12,7 @@
  *  mm/ksm.c (mm hashing).
  */
 
+#include linux/list.h
 #include linux/hashtable.h
 #include linux/sched.h
 #include linux/mm.h
@@ -43,12 +44,16 @@ struct userfaultfd_ctx {
wait_queue_head_t fault_pending_wqh;
/* waitqueue head for the userfaults */
wait_queue_head_t fault_wqh;
+   /* waitqueue head for events */
+   wait_queue_head_t event_wqh;
/* waitqueue head for the pseudo fd to wakeup poll/read */
wait_queue_head_t fd_wqh;
/* pseudo fd refcounting */
atomic_t refcount;
/* userfaultfd syscall flags */
unsigned int flags;
+   /* features requested from the userspace */
+   unsigned int features;
/* state machine */
enum userfaultfd_state state;
/* released */
@@ -120,6 +125,8 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
VM_BUG_ON(waitqueue_active(ctx-fault_pending_wqh));
VM_BUG_ON(spin_is_locked(ctx-fault_wqh.lock));
VM_BUG_ON(waitqueue_active(ctx-fault_wqh));
+   VM_BUG_ON(spin_is_locked(ctx-event_wqh.lock));
+   VM_BUG_ON(waitqueue_active(ctx-event_wqh));
VM_BUG_ON(spin_is_locked(ctx-fd_wqh.lock));
VM_BUG_ON(waitqueue_active(ctx-fd_wqh));
mmput(ctx-mm);
@@ -373,6 +380,58 @@ out:
return ret;
 }
 
+static int __maybe_unused userfaultfd_event_wait_completion(struct 
userfaultfd_ctx *ctx,
+   struct userfaultfd_wait_queue *ewq)
+{
+   int ret = 0;
+
+   ewq-ctx = ctx;
+   init_waitqueue_entry(ewq-wq, current);
+
+   spin_lock(ctx-event_wqh.lock);
+   /*
+* After the __add_wait_queue the uwq is visible to userland
+* through poll/read().
+*/
+   __add_wait_queue(ctx-event_wqh, ewq-wq);
+   for (;;) {
+   set_current_state(TASK_KILLABLE);
+   if (ewq-msg.event == 0)
+   break;
+   if (ACCESS_ONCE(ctx-released) ||
+   fatal_signal_pending(current)) {
+   ret = -1;
+   __remove_wait_queue(ctx-event_wqh, ewq-wq);
+   break;
+   }
+
+   spin_unlock(ctx-event_wqh.lock);
+
+   wake_up_poll(ctx-fd_wqh, POLLIN);
+   schedule();
+
+   spin_lock(ctx-event_wqh.lock);
+   }
+   __set_current_state(TASK_RUNNING);
+   spin_unlock(ctx-event_wqh.lock);
+
+   /*
+* ctx may go away after this if the userfault pseudo fd is
+* already released.
+*/
+
+   userfaultfd_ctx_put(ctx);
+   return ret;
+}
+
+static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
+   struct userfaultfd_wait_queue *ewq)
+{
+   ewq-msg.event = 0;
+   wake_up_locked(ctx-event_wqh);
+   __remove_wait_queue(ctx-event_wqh, ewq-wq);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
struct userfaultfd_ctx *ctx = file-private_data;
@@ -458,6 +517,12 @@ static inline struct userfaultfd_wait_queue 
*find_userfault(
return find_userfault_in(ctx-fault_pending_wqh);
 }
 
+static inline struct userfaultfd_wait_queue *find_userfault_evt(
+   struct userfaultfd_ctx *ctx)
+{
+   return find_userfault_in(ctx-event_wqh);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
struct userfaultfd_ctx *ctx = file-private_data;
@@ -489,6 +554,9 @@ static unsigned int userfaultfd_poll(struct file *file, 
poll_table *wait)
smp_mb();
if (waitqueue_active(ctx-fault_pending_wqh))
ret = POLLIN;
+   else if (waitqueue_active(ctx-event_wqh))
+   ret = POLLIN;
+
return ret;
default:
BUG();
@@ -528,6 +596,19 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx 
*ctx, int no_wait,
break;
}
spin_unlock(ctx-fault_pending_wqh.lock);
+
+   spin_lock(ctx-event_wqh.lock);
+   uwq = find_userfault_evt(ctx);
+   if (uwq) {
+   *msg = uwq-msg;
+
+   userfaultfd_event_complete(ctx, uwq

[PATCH 4/5] uffd: Add mremap() event

2015-05-13 Thread Pavel Emelyanov
The event denotes that an area [start:end] moves to different
location. Lenght change isn't reported as new addresses if
they appear on the uffd reader side will not contain any data
and the latter can just zeromap them.

Waiting for the event ACK is also done outside of mmap sem, as
for fork event.

Signed-off-by: Pavel Emelyanov xe...@parallels.com
---
 fs/userfaultfd.c | 35 +++
 include/linux/userfaultfd_k.h| 12 
 include/uapi/linux/userfaultfd.h | 10 +-
 mm/mremap.c  | 31 ---
 4 files changed, 76 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 202d0ac..697e636 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -511,6 +511,41 @@ void dup_userfaultfd_complete(struct list_head *fcs)
}
 }
 
+void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct 
vm_userfaultfd_ctx *vm_ctx)
+{
+   struct userfaultfd_ctx *ctx;
+
+   ctx = vma-vm_userfaultfd_ctx.ctx;
+   if (ctx  (ctx-features  UFFD_FEATURE_EVENT_REMAP)) {
+   vm_ctx-ctx = ctx;
+   userfaultfd_ctx_get(ctx);
+   }
+}
+
+void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx, unsigned 
long from,
+   unsigned long to, unsigned long len)
+{
+   struct userfaultfd_ctx *ctx = vm_ctx.ctx;
+   struct userfaultfd_wait_queue ewq;
+
+   if (ctx == NULL)
+   return;
+
+   if (to  ~PAGE_MASK) {
+   userfaultfd_ctx_put(ctx);
+   return;
+   }
+
+   msg_init(ewq.msg);
+
+   ewq.msg.event = UFFD_EVENT_REMAP;
+   ewq.msg.arg.remap.from = from;
+   ewq.msg.arg.remap.to = to;
+   ewq.msg.arg.remap.len = len;
+
+   userfaultfd_event_wait_completion(ctx, ewq);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
struct userfaultfd_ctx *ctx = file-private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 44827f7..0ed5dce 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -78,6 +78,10 @@ static inline bool userfaultfd_armed(struct vm_area_struct 
*vma)
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
 extern void dup_userfaultfd_complete(struct list_head *);
 
+extern void mremap_userfaultfd_prep(struct vm_area_struct *, struct 
vm_userfaultfd_ctx *);
+extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx, unsigned 
long from,
+   unsigned long to, unsigned long len);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -119,6 +123,14 @@ static inline void dup_userfaultfd_complete(struct 
list_head *)
 {
 }
 
+static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct 
vm_userfaultfd_ctx *ctx)
+{
+}
+
+static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
+   unsigned long from, unsigned long to, unsigned long len)
+{
+}
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index fb9ced2..59a141c 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -14,7 +14,7 @@
  * After implementing the respective features it will become:
  * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP)
  */
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK)
+#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP)
 #define UFFD_API_IOCTLS\
((__u64)1  _UFFDIO_REGISTER | \
 (__u64)1  _UFFDIO_UNREGISTER |   \
@@ -73,6 +73,12 @@ struct uffd_msg {
} fork;
 
struct {
+   __u64   from;
+   __u64   to;
+   __u64   len;
+   } remap;
+
+   struct {
/* unused reserved fields */
__u64   reserved1;
__u64   reserved2;
@@ -86,6 +92,7 @@ struct uffd_msg {
  */
 #define UFFD_EVENT_PAGEFAULT   0x12
 #define UFFD_EVENT_FORK0x13
+#define UFFD_EVENT_REMAP   0x14
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE  (10)  /* If this was a write fault */
@@ -108,6 +115,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (10)
 #endif
 #define UFFD_FEATURE_EVENT_FORK(11)
+#define UFFD_FEATURE_EVENT_REMAP   (12)
__u64 features;
 
__u64 ioctls;
diff --git a/mm/mremap.c b/mm/mremap.c
index 034e2d3..a63287d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -22,6 +22,7 @@
 #include linux/mmu_notifier.h
 #include linux/sched/sysctl.h
 #include linux/uaccess.h
+#include linux/userfaultfd_k.h
 
 #include asm/cacheflush.h
 #include asm/tlbflush.h
@@ -234,7 +235,8 @@ unsigned long

[PATCH 5/5] uffd: Add madvise() event for MADV_DONTNEED request

2015-05-13 Thread Pavel Emelyanov
If the page is punched out of the address space the uffd reader
should know this and zeromap the respective area in case of
the #PF event.

Signed-off-by: Pavel Emelyanov xe...@parallels.com
---
 fs/userfaultfd.c | 26 ++
 include/linux/userfaultfd_k.h| 10 ++
 include/uapi/linux/userfaultfd.h |  9 -
 mm/madvise.c |  2 ++
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 697e636..6e80a02 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -546,6 +546,32 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx 
vm_ctx, unsigned long
userfaultfd_event_wait_completion(ctx, ewq);
 }
 
+void madvise_userfault_dontneed(struct vm_area_struct *vma,
+   struct vm_area_struct **prev,
+   unsigned long start, unsigned long end)
+{
+   struct userfaultfd_ctx *ctx;
+   struct userfaultfd_wait_queue ewq;
+
+   ctx = vma-vm_userfaultfd_ctx.ctx;
+   if (!ctx || !(ctx-features  UFFD_FEATURE_EVENT_MADVDONTNEED))
+   return;
+
+   userfaultfd_ctx_get(ctx);
+   *prev = NULL; /* We wait for ACK w/o the mmap semaphore */
+   up_read(vma-vm_mm-mmap_sem);
+
+   msg_init(ewq.msg);
+
+   ewq.msg.event = UFFD_EVENT_MADVDONTNEED;
+   ewq.msg.arg.madv_dn.start = start;
+   ewq.msg.arg.madv_dn.end = end;
+
+   userfaultfd_event_wait_completion(ctx, ewq);
+
+   down_read(vma-vm_mm-mmap_sem);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
struct userfaultfd_ctx *ctx = file-private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 0ed5dce..2f72069 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -82,6 +82,10 @@ extern void mremap_userfaultfd_prep(struct vm_area_struct *, 
struct vm_userfault
 extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx, unsigned 
long from,
unsigned long to, unsigned long len);
 
+extern void madvise_userfault_dontneed(struct vm_area_struct *vma,
+   struct vm_area_struct **prev, unsigned long start,
+   unsigned long end);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -131,6 +135,12 @@ static inline void mremap_userfaultfd_complete(struct 
vm_userfaultfd_ctx ctx,
unsigned long from, unsigned long to, unsigned long len)
 {
 }
+
+static inline void madvise_userfault_dontneed(struct vm_area_struct *vma,
+   struct vm_area_struct **prev, unsigned long start,
+   unsigned long end)
+{
+}
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 59a141c..d0d1ef1 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -14,7 +14,7 @@
  * After implementing the respective features it will become:
  * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP)
  */
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP)
+#define UFFD_API_FEATURES 
(UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP|UFFD_FEATURE_EVENT_MADVDONTNEED)
 #define UFFD_API_IOCTLS\
((__u64)1  _UFFDIO_REGISTER | \
 (__u64)1  _UFFDIO_UNREGISTER |   \
@@ -79,6 +79,11 @@ struct uffd_msg {
} remap;
 
struct {
+   __u64   start;
+   __u64   end;
+   } madv_dn;
+
+   struct {
/* unused reserved fields */
__u64   reserved1;
__u64   reserved2;
@@ -93,6 +98,7 @@ struct uffd_msg {
 #define UFFD_EVENT_PAGEFAULT   0x12
 #define UFFD_EVENT_FORK0x13
 #define UFFD_EVENT_REMAP   0x14
+#define UFFD_EVENT_MADVDONTNEED0x15
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE  (10)  /* If this was a write fault */
@@ -116,6 +122,7 @@ struct uffdio_api {
 #endif
 #define UFFD_FEATURE_EVENT_FORK(11)
 #define UFFD_FEATURE_EVENT_REMAP   (12)
+#define UFFD_FEATURE_EVENT_MADVDONTNEED(13)
__u64 features;
 
__u64 ioctls;
diff --git a/mm/madvise.c b/mm/madvise.c
index 10f62b7..3ea20e2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -10,6 +10,7 @@
 #include linux/syscalls.h
 #include linux/mempolicy.h
 #include linux/page-isolation.h
+#include linux/userfaultfd_k.h
 #include linux/hugetlb.h
 #include linux/falloc.h
 #include linux/sched.h
@@ -283,6 +284,7 @@ static long madvise_dontneed(struct vm_area_struct *vma,
return -EINVAL;
 
zap_page_range(vma, start, end - start, NULL);
+   madvise_userfault_dontneed(vma, prev, start, end);
return 0;
 }
 
-- 
1.9.3


--
To unsubscribe from this list: send

Re: [PATCH 2/3] uffd: Introduce the v2 API

2015-04-30 Thread Pavel Emelyanov
On 04/28/2015 12:12 AM, Andrea Arcangeli wrote:
> Hello,
> 
> On Thu, Apr 23, 2015 at 09:29:07AM +0300, Pavel Emelyanov wrote:
>> So your proposal is to always report 16 bytes per PF from read() and
>> let userspace decide itself how to handle the result?
> 
> Reading 16bytes for each userfault (instead of 8) and sharing the same
> read(2) protocol (UFFD_API) for both the cooperative and
> non-cooperative usages, is something I just suggested for
> consideration after reading your patchset.
> 
> The pros of using a single protocol for both is that it would reduce
> amount of code and there would be just one file operation for the
> .read method. The cons is that it will waste 8 bytes per userfault in
> terms of memory footprint. The other major cons is that it would force
> us to define the format of the non cooperative protocol now despite it's
> not fully finished yet.
> 
> I'm also ok with two protocols if nobody else objects, but if we use
> two protocols, we should at least use different file operation methods
> and use __always_inline with constants passed as parameter to optimize
> away the branches at build time. This way we get the reduced memory
> footprint in the read syscall without other runtime overhead
> associated with it.

OK. I would go with two protocols then and will reshuffle the code to
use two ops.

>>>> +struct uffd_v2_msg {
>>>> +  __u64   type;
>>>> +  __u64   arg;
>>>> +};
>>>> +
>>>> +#define UFFD_PAGEFAULT0x1
>>>> +
>>>> +#define UFFD_PAGEFAULT_BIT(1 << (UFFD_PAGEFAULT - 1))
>>>> +#define __UFFD_API_V2_BITS(UFFD_PAGEFAULT_BIT)
>>>> +
>>>> +/*
>>>> + * Lower PAGE_SHIFT bits are used to report those supported
>>>> + * by the pagefault message itself. Other bits are used to
>>>> + * report the message types v2 API supports
>>>> + */
>>>> +#define UFFD_API_V2_BITS  (__UFFD_API_V2_BITS << 12)
>>>> +
>>>
>>> And why exactly is this 12 hardcoded?
>>
>> Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it
>> would be OK to have different shifts in different arches.
>>
>> But taking into account your comment that bits field id bad for these
>> values, if we introduce the new .features one for api message, then this
>> 12 will just go away.
> 
> Ok.
> 
>>> And which field should be masked
>>> with the bits? In the V1 protocol it was the "arg" (userfault address)
>>> not the "type". So this is a bit confusing and probably requires
>>> simplification.
>>
>> I see. Actually I decided that since bits higher than 12th (for x86) is
>> always 0 in api message (no bits allowed there, since pfn sits in this
>> place), it would be OK to put non-PF bits there.
> 
> That was ok yes.
> 
>> Should I better introduce another .features field in uffd API message?
> 
> What about renaming "uffdio_api.bits" to "uffdio_api.features"?

Yup, agreed, will do.

> And then we set uffdio_api.features to
> UFFD_FEATURE_WRITE|UFFD_FEATURE_WP|UFFD_FEATURE_FORK as needed.
> 
> UFFD_FEATURE_WRITE would always be enabled, it's there only in case we
> want to disable it later (mostly if some arch has trouble with it,
> which is unlikely, but qemu doesn't need that bit of information at
> all for example so qemu would be fine if UFFD_FEATURE_WRITE
> disappears).
> 
> UFFD_FEATURE_WP would signal also that the wrprotection feature (not
> implemented yet) is available (then later the register ioctl would
> also show the new wrprotection ioctl numbers available to mangle the
> wrprotection). The UFFD_FEATURE_WP feature in the cooperative usage
> (qemu live snapshotting) can use the UFFD_API first protocol too.
> 
> UFFD_FEATURE_FORK would be returned if the UFFD_API_V2 was set in
> uffdio.api, and it would be part of the incremental non-cooperative
> patchset.
> 
> We could also not define "UFFD_FEATURE_FORK" at all and imply that
> fork/mremap/MADV_DONTNEED are all available if UFFD_API_V2 uffdio_api
> ioctl succeeds... That's only doable if we keep two different read
> protocols though. UFFD_FEATURE_FORK (or UFFD_FEATURE_NON_COOPERATIVE)
> are really strictly needed only if we share the same read(2) protocol
> for both the cooperative and non-cooperative usages.
> 
> The idea is that there's not huge benefit of only having the "fork"
> feature supported but missing "mremap" and "madv_dontneed".
> 
> In fact if a new syscall that works like mremap is added later (cal

Re: [PATCH 2/3] uffd: Introduce the v2 API

2015-04-30 Thread Pavel Emelyanov
On 04/28/2015 12:12 AM, Andrea Arcangeli wrote:
 Hello,
 
 On Thu, Apr 23, 2015 at 09:29:07AM +0300, Pavel Emelyanov wrote:
 So your proposal is to always report 16 bytes per PF from read() and
 let userspace decide itself how to handle the result?
 
 Reading 16bytes for each userfault (instead of 8) and sharing the same
 read(2) protocol (UFFD_API) for both the cooperative and
 non-cooperative usages, is something I just suggested for
 consideration after reading your patchset.
 
 The pros of using a single protocol for both is that it would reduce
 amount of code and there would be just one file operation for the
 .read method. The cons is that it will waste 8 bytes per userfault in
 terms of memory footprint. The other major cons is that it would force
 us to define the format of the non cooperative protocol now despite it's
 not fully finished yet.
 
 I'm also ok with two protocols if nobody else objects, but if we use
 two protocols, we should at least use different file operation methods
 and use __always_inline with constants passed as parameter to optimize
 away the branches at build time. This way we get the reduced memory
 footprint in the read syscall without other runtime overhead
 associated with it.

OK. I would go with two protocols then and will reshuffle the code to
use two ops.

 +struct uffd_v2_msg {
 +  __u64   type;
 +  __u64   arg;
 +};
 +
 +#define UFFD_PAGEFAULT0x1
 +
 +#define UFFD_PAGEFAULT_BIT(1  (UFFD_PAGEFAULT - 1))
 +#define __UFFD_API_V2_BITS(UFFD_PAGEFAULT_BIT)
 +
 +/*
 + * Lower PAGE_SHIFT bits are used to report those supported
 + * by the pagefault message itself. Other bits are used to
 + * report the message types v2 API supports
 + */
 +#define UFFD_API_V2_BITS  (__UFFD_API_V2_BITS  12)
 +

 And why exactly is this 12 hardcoded?

 Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it
 would be OK to have different shifts in different arches.

 But taking into account your comment that bits field id bad for these
 values, if we introduce the new .features one for api message, then this
 12 will just go away.
 
 Ok.
 
 And which field should be masked
 with the bits? In the V1 protocol it was the arg (userfault address)
 not the type. So this is a bit confusing and probably requires
 simplification.

 I see. Actually I decided that since bits higher than 12th (for x86) is
 always 0 in api message (no bits allowed there, since pfn sits in this
 place), it would be OK to put non-PF bits there.
 
 That was ok yes.
 
 Should I better introduce another .features field in uffd API message?
 
 What about renaming uffdio_api.bits to uffdio_api.features?

Yup, agreed, will do.

 And then we set uffdio_api.features to
 UFFD_FEATURE_WRITE|UFFD_FEATURE_WP|UFFD_FEATURE_FORK as needed.
 
 UFFD_FEATURE_WRITE would always be enabled, it's there only in case we
 want to disable it later (mostly if some arch has trouble with it,
 which is unlikely, but qemu doesn't need that bit of information at
 all for example so qemu would be fine if UFFD_FEATURE_WRITE
 disappears).
 
 UFFD_FEATURE_WP would signal also that the wrprotection feature (not
 implemented yet) is available (then later the register ioctl would
 also show the new wrprotection ioctl numbers available to mangle the
 wrprotection). The UFFD_FEATURE_WP feature in the cooperative usage
 (qemu live snapshotting) can use the UFFD_API first protocol too.
 
 UFFD_FEATURE_FORK would be returned if the UFFD_API_V2 was set in
 uffdio.api, and it would be part of the incremental non-cooperative
 patchset.
 
 We could also not define UFFD_FEATURE_FORK at all and imply that
 fork/mremap/MADV_DONTNEED are all available if UFFD_API_V2 uffdio_api
 ioctl succeeds... That's only doable if we keep two different read
 protocols though. UFFD_FEATURE_FORK (or UFFD_FEATURE_NON_COOPERATIVE)
 are really strictly needed only if we share the same read(2) protocol
 for both the cooperative and non-cooperative usages.
 
 The idea is that there's not huge benefit of only having the fork
 feature supported but missing mremap and madv_dontneed.
 
 In fact if a new syscall that works like mremap is added later (call
 it mremap2), we would need to fail the UFFDIO_API_V2 and require a
 UFFDIO_API_V3 for such kernel that can return a new mremap2 type of
 event. Userland couldn't just assume it is ok to use postcopy live
 migration for containers, because
 UFFD_FEATURE_FORK|MREMAP|MADV_DONTNEED are present in the
 uffdio.features when it asked for API_V2. There shall be something
 that tells userland hey there's a new mremap2 that the software
 inside the container can run on top of this kernel, so you are going
 to get a new mremap2 type of userfault event too.

But that's why I assumed to use per-sycall bits -- UFFD_FEATURE_FORK,
_MREMAP, _MWHATEVER so that userspace can read those bits and make sure
it contains only bits it understands with other bits set to zero.

If we had only one UFFD_API_NON_COOPERATIVE userspace

Re: [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage

2015-04-23 Thread Pavel Emelyanov
On 04/21/2015 03:02 PM, Andrea Arcangeli wrote:
> Hi Pavel,
> 
> On Wed, Mar 18, 2015 at 10:34:26PM +0300, Pavel Emelyanov wrote:
>> Hi,
>>
>> On the recent LSF Andrea presented his userfault-fd patches and
>> I had shown some issues that appear in usage scenarios when the
>> monitor task and mm task do not cooperate to each other on VM
>> changes (and fork()-s).
>>
>> Here's the implementation of the extended uffd API that would help 
>> us to address those issues.
>>
>> As proof of concept I've implemented only fix for fork() case, but
>> I also plan to add the mremap() and exit() notifications, both are
>> also required for such non-cooperative usage.
>>
>> More details about the extension itself is in patch #2 and the fork()
>> notification description is in patch #3.
>>
>> Comments and suggestion are warmly welcome :)
> 
> This looks feasible.
> 
>> Andrea, what's the best way to go on with the patches -- would you
>> prefer to include them in your git tree or should I instead continue
>> with them on my own, re-sending them when required? Either way would
>> be OK for me.
> 
> Ok so various improvements happened in userfaultfd patchset over the
> last month so your incremental patchset likely requires a rebase
> sorry. When you posted it I was in the middle of the updates. Now
> things are working stable and I have no pending updates, so it would
> be a good time for a rebase.

OK, thanks for the heads up! I will rebase my patches.

> I can merge it if you like, it's your call if you prefer to maintain
> it incrementally or if I should merge it, but I wouldn't push it to
> Andrew for upstream integration in the first batch, as this
> complicates things further and it's not fully complete yet (at least
> the version posted only handled fork). I think it can be merged
> incrementally in a second stage.

Sure!

> The major updates of the userfaultfd patchset over the last month were:
> 
> 1) Various mixed fixes thanks to the feedback from Dave Hansen and
>David Gilbert.
> 
>The most notable one is the use of mm_users instead of mm_count to
>pin the mm to avoid crashes that assumed the vma still existed (in
>the userfaultfd_release method and in the various ioctl). exit_mmap
>doesn't even set mm->mmap to NULL, so unless I introduce a
>userfaultfd_exit to call in mmput, I have to pin the mm_users to be
>safe. This is mainly an issue for the non-cooperative usage you're
>implementing. Can you catch the exit somehow so you can close the
>fd? The memory won't be released until you do. Is this ok with you?
>I suppose you had to close the fd somehow anyway.
> 
> 2) userfaults are waken immediately even if they're not been "read"
>yet, this can lead to POLLIN false positives (so I only allow poll
>if the fd is open in nonblocking mode to be sure it won't hang). Is
>it too paranoid to return POLLERR if the fd is not open in
>nonblocking mode?
> 
>   
> http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault=f222d9de0a5302dc8ac62d6fab53a84251098751
> 
> 3) optimize read to return entries in O(1) and poll which was already
>O(1) becomes lockless. This required to split the waitqueue in two,
>one for pending faults and one for non pending faults, and the
>faults are refiled across the two waitqueues when they're
>read. Both waitqueues are protected by a single lock to be simpler
>and faster at runtime (the fault_pending_wqh one).
> 
>   
> http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault=9aa033ed43a1134c2223dac8c5d9e02e0100fca1
> 
> 4) Allocate the ctx with kmem_cache_alloc. This is going to collide a
>bit with your cleanup patch sorry.
> 
>   
> http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6
> 
> 5) Originally qemu had two bitflags for each page and kept 3 states
>(of the 4 possible with two bits) for each page in order to deal
>with the races that can happen if one thread is reading the
>userfaults and another thread is calling the UFFDIO_COPY ioctl in
>the background. This patch solves all races in the kernel so the
>two bits per page can be dropped from qemu codebase. I started
>documenting the races that can materialize by using 2 threads
>(instead of running the workload single threaded with a single poll
>event loop), and how userland had to solve them until I decided it
>was simpler to fix the race in the kernel by running an ad-hoc
>pagetable walk inside the wait_event()-kind-of-section. This
>simp

Re: [PATCH 2/3] uffd: Introduce the v2 API

2015-04-23 Thread Pavel Emelyanov
On 04/21/2015 03:18 PM, Andrea Arcangeli wrote:
> On Wed, Mar 18, 2015 at 10:35:17PM +0300, Pavel Emelyanov wrote:
>> +if (!(ctx->features & UFFD_FEATURE_LONGMSG)) {
> 
> If we are to use different protocols, it'd be nicer to have two
> different methods to assign to userfaultfd_fops.read that calls an
> __always_inline function, so that the above check can be optimized
> away at build time when the inline is expanded. So the branch is
> converted to calling a different pointer to function which is zero
> additional cost.

OK :)

>> +/* careful to always initialize addr if ret == 0 */
>> +__u64 uninitialized_var(addr);
>> +__u64 uninitialized_var(mtype);
>> +if (count < sizeof(addr))
>> +return ret ? ret : -EINVAL;
>> +_ret = userfaultfd_ctx_read(ctx, no_wait, , 
>> );
>> +if (_ret < 0)
>> +return ret ? ret : _ret;
>> +BUG_ON(mtype != UFFD_PAGEFAULT);
>> +if (put_user(addr, (__u64 __user *) buf))
>> +return ret ? ret : -EFAULT;
>> +_ret = sizeof(addr);
>> +} else {
>> +struct uffd_v2_msg msg;
>> +if (count < sizeof(msg))
>> +return ret ? ret : -EINVAL;
>> +_ret = userfaultfd_ctx_read(ctx, no_wait, , 
>> );
>> +if (_ret < 0)
>> +return ret ? ret : _ret;
>> +if (copy_to_user(buf, , sizeof(msg)))
>> +return ret ? ret : -EINVAL;
>> +_ret = sizeof(msg);
> 
> Reading 16bytes instead of 8bytes for each fault, probably wouldn't
> move the needle much in terms of userfaultfd_read performance. Perhaps
> we could consider using the uffd_v2_msg unconditionally and then have
> a single protocol differentiated by the feature bits.

So your proposal is to always report 16 bytes per PF from read() and
let userspace decide itself how to handle the result?

> The only reason to have two different protocols would be to be able to
> read 8 bytes per userfault, in the cooperative usage (i.e. qemu
> postcopy). But if we do that we want to use the __always_inline trick
> to avoid branches and additional runtime costs (otherwise we may as
> well forget all microoptimizations and read 16bytes always).
> 
>> @@ -992,6 +1013,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
>>  /* careful not to leak info, we only read the first 8 bytes */
>>  uffdio_api.bits = UFFD_API_BITS;
>>  uffdio_api.ioctls = UFFD_API_IOCTLS;
>> +
>> +if (uffdio_api.api == UFFD_API_V2) {
>> +ctx->features |= UFFD_FEATURE_LONGMSG;
>> +uffdio_api.bits |= UFFD_API_V2_BITS;
>> +}
>> +
>>  ret = -EFAULT;
>>  if (copy_to_user(buf, _api, sizeof(uffdio_api)))
>>  goto out;
> 
> The original meaning of the bits is:
> 
> If UFFD_BIT_WRITE was set in api.bits, it means the
> !!(address_BIT_WRITE) tells if it was a write fault (missing or
> WP).
> 
> If UFFD_BIT_WP was set in api.bits, it means the
> !!(address_BIT_WP) tells if it was a WP fault (if not set it
> means it was a missing fault).
> 
> Currently api.bits sets only UFFD_BIT_WRITE, and UFFD_BIT_WP will be
> set later, after the WP tracking mode will be implemented.
> 
> I'm uncertain how bits translated to features and if they should be
> unified or only have features.
> 
>> +struct uffd_v2_msg {
>> +__u64   type;
>> +__u64   arg;
>> +};
>> +
>> +#define UFFD_PAGEFAULT  0x1
>> +
>> +#define UFFD_PAGEFAULT_BIT  (1 << (UFFD_PAGEFAULT - 1))
>> +#define __UFFD_API_V2_BITS  (UFFD_PAGEFAULT_BIT)
>> +
>> +/*
>> + * Lower PAGE_SHIFT bits are used to report those supported
>> + * by the pagefault message itself. Other bits are used to
>> + * report the message types v2 API supports
>> + */
>> +#define UFFD_API_V2_BITS(__UFFD_API_V2_BITS << 12)
>> +
> 
> And why exactly is this 12 hardcoded?

Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it
would be OK to have different shifts in different arches.

But taking into account your comment that bits field id bad for these
values, if we introduce the new .features one for api message, then this
12 will just go away.

> And which field should be masked
> with the bits? In the V1 protocol it was the &q

Re: [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage

2015-04-23 Thread Pavel Emelyanov
On 04/21/2015 03:02 PM, Andrea Arcangeli wrote:
 Hi Pavel,
 
 On Wed, Mar 18, 2015 at 10:34:26PM +0300, Pavel Emelyanov wrote:
 Hi,

 On the recent LSF Andrea presented his userfault-fd patches and
 I had shown some issues that appear in usage scenarios when the
 monitor task and mm task do not cooperate to each other on VM
 changes (and fork()-s).

 Here's the implementation of the extended uffd API that would help 
 us to address those issues.

 As proof of concept I've implemented only fix for fork() case, but
 I also plan to add the mremap() and exit() notifications, both are
 also required for such non-cooperative usage.

 More details about the extension itself is in patch #2 and the fork()
 notification description is in patch #3.

 Comments and suggestion are warmly welcome :)
 
 This looks feasible.
 
 Andrea, what's the best way to go on with the patches -- would you
 prefer to include them in your git tree or should I instead continue
 with them on my own, re-sending them when required? Either way would
 be OK for me.
 
 Ok so various improvements happened in userfaultfd patchset over the
 last month so your incremental patchset likely requires a rebase
 sorry. When you posted it I was in the middle of the updates. Now
 things are working stable and I have no pending updates, so it would
 be a good time for a rebase.

OK, thanks for the heads up! I will rebase my patches.

 I can merge it if you like, it's your call if you prefer to maintain
 it incrementally or if I should merge it, but I wouldn't push it to
 Andrew for upstream integration in the first batch, as this
 complicates things further and it's not fully complete yet (at least
 the version posted only handled fork). I think it can be merged
 incrementally in a second stage.

Sure!

 The major updates of the userfaultfd patchset over the last month were:
 
 1) Various mixed fixes thanks to the feedback from Dave Hansen and
David Gilbert.
 
The most notable one is the use of mm_users instead of mm_count to
pin the mm to avoid crashes that assumed the vma still existed (in
the userfaultfd_release method and in the various ioctl). exit_mmap
doesn't even set mm-mmap to NULL, so unless I introduce a
userfaultfd_exit to call in mmput, I have to pin the mm_users to be
safe. This is mainly an issue for the non-cooperative usage you're
implementing. Can you catch the exit somehow so you can close the
fd? The memory won't be released until you do. Is this ok with you?
I suppose you had to close the fd somehow anyway.
 
 2) userfaults are waken immediately even if they're not been read
yet, this can lead to POLLIN false positives (so I only allow poll
if the fd is open in nonblocking mode to be sure it won't hang). Is
it too paranoid to return POLLERR if the fd is not open in
nonblocking mode?
 
   
 http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=f222d9de0a5302dc8ac62d6fab53a84251098751
 
 3) optimize read to return entries in O(1) and poll which was already
O(1) becomes lockless. This required to split the waitqueue in two,
one for pending faults and one for non pending faults, and the
faults are refiled across the two waitqueues when they're
read. Both waitqueues are protected by a single lock to be simpler
and faster at runtime (the fault_pending_wqh one).
 
   
 http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=9aa033ed43a1134c2223dac8c5d9e02e0100fca1
 
 4) Allocate the ctx with kmem_cache_alloc. This is going to collide a
bit with your cleanup patch sorry.
 
   
 http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6
 
 5) Originally qemu had two bitflags for each page and kept 3 states
(of the 4 possible with two bits) for each page in order to deal
with the races that can happen if one thread is reading the
userfaults and another thread is calling the UFFDIO_COPY ioctl in
the background. This patch solves all races in the kernel so the
two bits per page can be dropped from qemu codebase. I started
documenting the races that can materialize by using 2 threads
(instead of running the workload single threaded with a single poll
event loop), and how userland had to solve them until I decided it
was simpler to fix the race in the kernel by running an ad-hoc
pagetable walk inside the wait_event()-kind-of-section. This
simplified qemu significantly and it doesn't make the kernel much
more complicated.
 
I tried this before in much older versions but I use gup_fast for
it and it didn't work well with gup_fast for various reasons.
 

 http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=41efeae4e93f0296436f2a9fc6b28b6b0158512a
 
After this patch the only reason to call UFFDIO_WAKE is to handle
the userfaults in batches in combination

Re: [PATCH 2/3] uffd: Introduce the v2 API

2015-04-23 Thread Pavel Emelyanov
On 04/21/2015 03:18 PM, Andrea Arcangeli wrote:
 On Wed, Mar 18, 2015 at 10:35:17PM +0300, Pavel Emelyanov wrote:
 +if (!(ctx-features  UFFD_FEATURE_LONGMSG)) {
 
 If we are to use different protocols, it'd be nicer to have two
 different methods to assign to userfaultfd_fops.read that calls an
 __always_inline function, so that the above check can be optimized
 away at build time when the inline is expanded. So the branch is
 converted to calling a different pointer to function which is zero
 additional cost.

OK :)

 +/* careful to always initialize addr if ret == 0 */
 +__u64 uninitialized_var(addr);
 +__u64 uninitialized_var(mtype);
 +if (count  sizeof(addr))
 +return ret ? ret : -EINVAL;
 +_ret = userfaultfd_ctx_read(ctx, no_wait, mtype, 
 addr);
 +if (_ret  0)
 +return ret ? ret : _ret;
 +BUG_ON(mtype != UFFD_PAGEFAULT);
 +if (put_user(addr, (__u64 __user *) buf))
 +return ret ? ret : -EFAULT;
 +_ret = sizeof(addr);
 +} else {
 +struct uffd_v2_msg msg;
 +if (count  sizeof(msg))
 +return ret ? ret : -EINVAL;
 +_ret = userfaultfd_ctx_read(ctx, no_wait, msg.type, 
 msg.arg);
 +if (_ret  0)
 +return ret ? ret : _ret;
 +if (copy_to_user(buf, msg, sizeof(msg)))
 +return ret ? ret : -EINVAL;
 +_ret = sizeof(msg);
 
 Reading 16bytes instead of 8bytes for each fault, probably wouldn't
 move the needle much in terms of userfaultfd_read performance. Perhaps
 we could consider using the uffd_v2_msg unconditionally and then have
 a single protocol differentiated by the feature bits.

So your proposal is to always report 16 bytes per PF from read() and
let userspace decide itself how to handle the result?

 The only reason to have two different protocols would be to be able to
 read 8 bytes per userfault, in the cooperative usage (i.e. qemu
 postcopy). But if we do that we want to use the __always_inline trick
 to avoid branches and additional runtime costs (otherwise we may as
 well forget all microoptimizations and read 16bytes always).
 
 @@ -992,6 +1013,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
  /* careful not to leak info, we only read the first 8 bytes */
  uffdio_api.bits = UFFD_API_BITS;
  uffdio_api.ioctls = UFFD_API_IOCTLS;
 +
 +if (uffdio_api.api == UFFD_API_V2) {
 +ctx-features |= UFFD_FEATURE_LONGMSG;
 +uffdio_api.bits |= UFFD_API_V2_BITS;
 +}
 +
  ret = -EFAULT;
  if (copy_to_user(buf, uffdio_api, sizeof(uffdio_api)))
  goto out;
 
 The original meaning of the bits is:
 
 If UFFD_BIT_WRITE was set in api.bits, it means the
 !!(addressUFFD_BIT_WRITE) tells if it was a write fault (missing or
 WP).
 
 If UFFD_BIT_WP was set in api.bits, it means the
 !!(addressUFFD_BIT_WP) tells if it was a WP fault (if not set it
 means it was a missing fault).
 
 Currently api.bits sets only UFFD_BIT_WRITE, and UFFD_BIT_WP will be
 set later, after the WP tracking mode will be implemented.
 
 I'm uncertain how bits translated to features and if they should be
 unified or only have features.
 
 +struct uffd_v2_msg {
 +__u64   type;
 +__u64   arg;
 +};
 +
 +#define UFFD_PAGEFAULT  0x1
 +
 +#define UFFD_PAGEFAULT_BIT  (1  (UFFD_PAGEFAULT - 1))
 +#define __UFFD_API_V2_BITS  (UFFD_PAGEFAULT_BIT)
 +
 +/*
 + * Lower PAGE_SHIFT bits are used to report those supported
 + * by the pagefault message itself. Other bits are used to
 + * report the message types v2 API supports
 + */
 +#define UFFD_API_V2_BITS(__UFFD_API_V2_BITS  12)
 +
 
 And why exactly is this 12 hardcoded?

Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it
would be OK to have different shifts in different arches.

But taking into account your comment that bits field id bad for these
values, if we introduce the new .features one for api message, then this
12 will just go away.

 And which field should be masked
 with the bits? In the V1 protocol it was the arg (userfault address)
 not the type. So this is a bit confusing and probably requires
 simplification.

I see. Actually I decided that since bits higher than 12th (for x86) is
always 0 in api message (no bits allowed there, since pfn sits in this
place), it would be OK to put non-PF bits there.

Should I better introduce another .features field in uffd API message?

-- Pavel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http

Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook

2015-04-13 Thread Pavel Emelyanov
>>> I initially thought it would be enough to put it into
>>> , expecting it works as
>>> . But that's not the case.
>>>
>>> It probably worth at some point rework all  to include
>>>  at the end as we do for .
>>> But that's outside the scope of the patchset, I guess.
>>>
>>> I don't see any better candidate for such dummy header. :-/
>>
>> Clearly, I'm not confortable with a rewrite of  :(
>>
>> So what about this patch, is this v3 acceptable ?
> 
> Acked-by: Kirill A. Shutemov 

Other than the #ifdef thing, the same:

Acked-by: Pavel Emelyanov 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook

2015-04-13 Thread Pavel Emelyanov
On 04/13/2015 04:21 PM, Laurent Dufour wrote:
> On 13/04/2015 15:13, Kirill A. Shutemov wrote:
>> On Mon, Apr 13, 2015 at 02:41:22PM +0200, Laurent Dufour wrote:
>>> On 13/04/2015 13:58, Kirill A. Shutemov wrote:
 On Mon, Apr 13, 2015 at 11:56:27AM +0200, Laurent Dufour wrote:
> Some architecture would like to be triggered when a memory area is moved
> through the mremap system call.
>
> This patch is introducing a new arch_remap mm hook which is placed in the
> path of mremap, and is called before the old area is unmapped (and the
> arch_unmap hook is called).
>
> The architectures which need to call this hook should define
> __HAVE_ARCH_REMAP in their asm/mmu_context.h and provide the arch_remap
> service with the following prototype:
> void arch_remap(struct mm_struct *mm,
> unsigned long old_start, unsigned long old_end,
> unsigned long new_start, unsigned long new_end);
>
> Signed-off-by: Laurent Dufour 
> Reviewed-by: Ingo Molnar 
> ---
>  mm/mremap.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 2dc44b1cb1df..009db5565893 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -25,6 +25,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -286,13 +287,19 @@ static unsigned long move_vma(struct vm_area_struct 
> *vma,
>   old_len = new_len;
>   old_addr = new_addr;
>   new_addr = -ENOMEM;
> - } else if (vma->vm_file && vma->vm_file->f_op->mremap) {
> - err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
> - if (err < 0) {
> - move_page_tables(new_vma, new_addr, vma, old_addr,
> -  moved_len, true);
> - return err;
> + } else {
> + if (vma->vm_file && vma->vm_file->f_op->mremap) {
> + err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
> + if (err < 0) {
> + move_page_tables(new_vma, new_addr, vma,
> +   old_addr, moved_len, true);
> + return err;
> + }
>   }
> +#ifdef __HAVE_ARCH_REMAP

 It would be cleaner to provide dummy arch_remap() for !__HAVE_ARCH_REMAP
 in some generic header.
>>>
>>> The idea was to not impact all the architectures as arch_unmap(),
>>> arch_dup_mmap() or arch_exit_mmap() implies.
>>>
>>> I look at the headers where such a dummy arch_remap could be put but I
>>> can't figure out one which will not impact all the architecture.
>>> What about defining a dummy service earlier in mm/remap.c in the case
>>> __HAVE_ARCH_REMAP is not defined ?
>>> Something like :
>>> #ifndef __HAVE_ARCH_REMAP
>>> static inline void void arch_remap(struct mm_struct *mm,
>>>unsigned long old_start,
>>>unsigned long old_end,
>>>unsigned long new_start,
>>>unsigned long new_end)
>>> {
>>> }
>>> #endif
>>
>> Or just #define arch_remap(...) do { } while (0)
>>
> 
> I guessed you wanted the arch_remap() prototype to be exposed somewhere
> in the code.
> 
> To be honest, I can't find the benefit of defining a dummy arch_remap()
> in mm/remap.c if __HAVE_ARCH_REMAP is not defined instead of calling it
> in move_vma if __HAVE_ARCH_REMAP is defined.
> Is it really what you want ?

I think Kirill meant something like e.g. the arch_enter_lazy_mmu_mode()
is implemented and called in mm/mremap.c -- the "generic" part is in the
include/asm-generic/pgtable.h and those architectures willing to have
their own implementation are in arch/$arch/...

Kirill, if I'm right with it, can you suggest the header where to put
the "generic" mremap hook's (empty) body?

-- Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook

2015-04-13 Thread Pavel Emelyanov
 I initially thought it would be enough to put it into
 asm-generic/mmu_context.h, expecting it works as
 asm-generic/pgtable.h. But that's not the case.

 It probably worth at some point rework all asm/mmu_context.h to include
 asm-generic/mmu_context.h at the end as we do for asm/pgtable.h.
 But that's outside the scope of the patchset, I guess.

 I don't see any better candidate for such dummy header. :-/

 Clearly, I'm not confortable with a rewrite of asm/mmu_context.h :(

 So what about this patch, is this v3 acceptable ?
 
 Acked-by: Kirill A. Shutemov kirill.shute...@linux.intel.com

Other than the #ifdef thing, the same:

Acked-by: Pavel Emelyanov xe...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook

2015-04-13 Thread Pavel Emelyanov
On 04/13/2015 04:21 PM, Laurent Dufour wrote:
 On 13/04/2015 15:13, Kirill A. Shutemov wrote:
 On Mon, Apr 13, 2015 at 02:41:22PM +0200, Laurent Dufour wrote:
 On 13/04/2015 13:58, Kirill A. Shutemov wrote:
 On Mon, Apr 13, 2015 at 11:56:27AM +0200, Laurent Dufour wrote:
 Some architecture would like to be triggered when a memory area is moved
 through the mremap system call.

 This patch is introducing a new arch_remap mm hook which is placed in the
 path of mremap, and is called before the old area is unmapped (and the
 arch_unmap hook is called).

 The architectures which need to call this hook should define
 __HAVE_ARCH_REMAP in their asm/mmu_context.h and provide the arch_remap
 service with the following prototype:
 void arch_remap(struct mm_struct *mm,
 unsigned long old_start, unsigned long old_end,
 unsigned long new_start, unsigned long new_end);

 Signed-off-by: Laurent Dufour lduf...@linux.vnet.ibm.com
 Reviewed-by: Ingo Molnar mi...@kernel.org
 ---
  mm/mremap.c | 19 +--
  1 file changed, 13 insertions(+), 6 deletions(-)

 diff --git a/mm/mremap.c b/mm/mremap.c
 index 2dc44b1cb1df..009db5565893 100644
 --- a/mm/mremap.c
 +++ b/mm/mremap.c
 @@ -25,6 +25,7 @@
  
  #include asm/cacheflush.h
  #include asm/tlbflush.h
 +#include asm/mmu_context.h
  
  #include internal.h
  
 @@ -286,13 +287,19 @@ static unsigned long move_vma(struct vm_area_struct 
 *vma,
   old_len = new_len;
   old_addr = new_addr;
   new_addr = -ENOMEM;
 - } else if (vma-vm_file  vma-vm_file-f_op-mremap) {
 - err = vma-vm_file-f_op-mremap(vma-vm_file, new_vma);
 - if (err  0) {
 - move_page_tables(new_vma, new_addr, vma, old_addr,
 -  moved_len, true);
 - return err;
 + } else {
 + if (vma-vm_file  vma-vm_file-f_op-mremap) {
 + err = vma-vm_file-f_op-mremap(vma-vm_file, new_vma);
 + if (err  0) {
 + move_page_tables(new_vma, new_addr, vma,
 +   old_addr, moved_len, true);
 + return err;
 + }
   }
 +#ifdef __HAVE_ARCH_REMAP

 It would be cleaner to provide dummy arch_remap() for !__HAVE_ARCH_REMAP
 in some generic header.

 The idea was to not impact all the architectures as arch_unmap(),
 arch_dup_mmap() or arch_exit_mmap() implies.

 I look at the headers where such a dummy arch_remap could be put but I
 can't figure out one which will not impact all the architecture.
 What about defining a dummy service earlier in mm/remap.c in the case
 __HAVE_ARCH_REMAP is not defined ?
 Something like :
 #ifndef __HAVE_ARCH_REMAP
 static inline void void arch_remap(struct mm_struct *mm,
unsigned long old_start,
unsigned long old_end,
unsigned long new_start,
unsigned long new_end)
 {
 }
 #endif

 Or just #define arch_remap(...) do { } while (0)

 
 I guessed you wanted the arch_remap() prototype to be exposed somewhere
 in the code.
 
 To be honest, I can't find the benefit of defining a dummy arch_remap()
 in mm/remap.c if __HAVE_ARCH_REMAP is not defined instead of calling it
 in move_vma if __HAVE_ARCH_REMAP is defined.
 Is it really what you want ?

I think Kirill meant something like e.g. the arch_enter_lazy_mmu_mode()
is implemented and called in mm/mremap.c -- the generic part is in the
include/asm-generic/pgtable.h and those architectures willing to have
their own implementation are in arch/$arch/...

Kirill, if I'm right with it, can you suggest the header where to put
the generic mremap hook's (empty) body?

-- Pavel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/2] x86_64,signal: Fix SS handling for signals delivered to 64-bit programs

2015-03-18 Thread Pavel Emelyanov
On 03/19/2015 12:26 AM, Andy Lutomirski wrote:
> On Wed, Mar 18, 2015 at 1:02 PM, Oleg Nesterov  wrote:
>> On 03/18, Andrey Wagin wrote:
>>>
>>> This patch fixes the problem. Oleg, could you send this path in the
>>> criu maillist?
>>
>> Sure, will do.
> 
> We still haven't answered one question: what's the kernel's position
> on ABI stability wrt CRIU?  We clearly shouldn't make changes that
> break the principle of CRIU, but CRIU encodes so many tricky
> assumptions about the inner workings of the kernel that it's really
> tough to avoid breaking old CRIU versions.

Well, we try hard to use only documented kernel API-s. Isn't the sigframe
considered to be some sort of "stable API"? I mean -- it's visible by the
userspace, nobody prevents glibc or gdb from messing with this stuff just
by reading it from memory.

If it's "parse-able" e.g. like VDSO is, but we don't do it in CRIU -- then
it's definitely a CRIU BUG to be fixed.

> So... do we introduce somewhat nasty code into the kernel to keep old
> CRIU versions working, or do we require that users who want to restore
> onto new kernels use new CRIU?

It's OK (I think) to require newer versions of CRIU, it's easy to update
one unlike the kernel ;)

But if "old" version of CRIU just crash the restored processes on "new"
kernels and there's no way to detect this properly -- that's the problem.

> (It seems clear to me that CRIU should apply the patch regardless of
> what the kernel does.  It will enable CRIU to work on the same class
> of programs that are fixed by the kernel change that started this
> thread.)
> 
> --Andy
> .
> 

Thanks,
Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] uffd: Introduce fork() notification

2015-03-18 Thread Pavel Emelyanov
As described in previous e-mail, we need to get informed when
the task, whose mm is monitored with uffd, calls fork().

The fork notification is the new uffd with the same regions
and flags as those on parent. When read()-ing from uffd the
monitor task would receive the new uffd's descriptor number
and will be able to start reading events from the new task.

The fork() of mm with uffd attached doesn't finish until the 
monitor "acks" the message by reading the new uffd descriptor.

Signed-off-by: Pavel Emelyanov 
---
 fs/userfaultfd.c | 175 ++-
 include/linux/userfaultfd_k.h|  12 +++
 include/uapi/linux/userfaultfd.h |   4 +-
 kernel/fork.c|   9 +-
 4 files changed, 194 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bd629b4..265f031 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -12,6 +12,7 @@
  *  mm/ksm.c (mm hashing).
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -37,6 +38,8 @@ struct userfaultfd_ctx {
atomic_t refcount;
/* waitqueue head for the userfaultfd page faults */
wait_queue_head_t fault_wqh;
+   /* waitqueue head for fork-s */
+   wait_queue_head_t fork_wqh;
/* waitqueue head for the pseudo fd to wakeup poll/read */
wait_queue_head_t fd_wqh;
/* userfaultfd syscall flags */
@@ -51,10 +54,21 @@ struct userfaultfd_ctx {
struct mm_struct *mm;
 };
 
+struct userfaultfd_fork_ctx {
+   struct userfaultfd_ctx *orig;
+   struct userfaultfd_ctx *new;
+   struct list_head list;
+};
+
 #define UFFD_FEATURE_LONGMSG   0x1
+#define UFFD_FEATURE_FORK  0x2
 
 struct userfaultfd_wait_queue {
-   unsigned long address;
+   union {
+   unsigned long address;
+   struct userfaultfd_ctx *nctx;
+   int fd;
+   };
wait_queue_t wq;
bool pending;
struct userfaultfd_ctx *ctx;
@@ -75,6 +89,7 @@ static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void)
if (ctx) {
atomic_set(>refcount, 1);
init_waitqueue_head(>fault_wqh);
+   init_waitqueue_head(>fork_wqh);
init_waitqueue_head(>fd_wqh);
ctx->released = false;
}
@@ -270,6 +285,111 @@ int handle_userfault(struct vm_area_struct *vma, unsigned 
long address,
return VM_FAULT_RETRY;
 }
 
+int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
+{
+   struct userfaultfd_ctx *ctx = NULL, *octx;
+   struct userfaultfd_fork_ctx *fctx;
+
+   octx = vma->vm_userfaultfd_ctx.ctx;
+   if (!octx || !(octx->features & UFFD_FEATURE_FORK)) {
+   vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   return 0;
+   }
+
+   list_for_each_entry(fctx, fcs, list)
+   if (fctx->orig == octx) {
+   ctx = fctx->new;
+   break;
+   }
+
+   if (!ctx) {
+   fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+   if (!fctx)
+   return -ENOMEM;
+
+   ctx = userfaultfd_ctx_alloc();
+   if (!ctx) {
+   kfree(fctx);
+   return -ENOMEM;
+   }
+
+   ctx->flags = octx->flags;
+   ctx->state = UFFD_STATE_RUNNING;
+   ctx->features = UFFD_FEATURE_FORK | UFFD_FEATURE_LONGMSG;
+   ctx->mm = vma->vm_mm;
+   atomic_inc(>mm->mm_count);
+
+   userfaultfd_ctx_get(octx);
+   fctx->orig = octx;
+   fctx->new = ctx;
+   list_add_tail(>list, fcs);
+   }
+
+   vma->vm_userfaultfd_ctx.ctx = ctx;
+   return 0;
+}
+
+static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
+{
+   int ret = 0;
+   struct userfaultfd_ctx *ctx = fctx->orig;
+   struct userfaultfd_wait_queue uwq;
+
+   init_waitqueue_entry(, current);
+   uwq.pending = true;
+   uwq.ctx = ctx;
+   uwq.nctx = fctx->new;
+
+   spin_lock(>fork_wqh.lock);
+   /*
+* After the __add_wait_queue the uwq is visible to userland
+* through poll/read().
+*/
+   __add_wait_queue(>fork_wqh, );
+   for (;;) {
+   set_current_state(TASK_KILLABLE);
+   if (!uwq.pending)
+   break;
+   if (ACCESS_ONCE(ctx->released) ||
+   fatal_signal_pending(current)) {
+   ret = -1;
+   break;
+   }
+
+   spin_unlock(>fork_wqh.lock);
+
+   wake_up_poll(>fd_wqh, POLLIN);
+   schedule();
+
+   spin_lock(>fork_wqh.lock);
+   }
+   

[PATCH 2/3] uffd: Introduce the v2 API

2015-03-18 Thread Pavel Emelyanov
The new API will report more than just the page-faults. The
reason for this is -- when the task whose mm we monitor with 
uffd and the monitor task itself cannot cooperate with each
other, the former one can screw things up. Like this.

If task fork()-s the child process is detached from uffd and
thus all not-yet-faulted-in memory gets mapped with zero-pages
on touch.

Another example is mremap(). When the victim remaps the uffd-ed
region and starts touching it the monitor would receive fault
messages with addresses that were not register-ed with uffd
ioctl before it. Thus monitor will have no idea how to handle
those faults.

To address both we can send more events to the monitor. In
particular, on fork() we can create another uffd context,
register the same set of regions in it and "send" the descriptor
to monitor.

For mremap() we can send the message describing what change
has been performed.

So this patch prepares to ground for the described above feature
by introducing the v2 API of uffd. With new API the kernel would
respond with a message containing the event type (pagefault,
fork or remap) and argument (fault address, new uffd descriptor
or region change respectively).

Signed-off-by: Pavel Emelyanov 
---
 fs/userfaultfd.c | 56 ++--
 include/uapi/linux/userfaultfd.h | 21 ++-
 2 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6c9a2d6..bd629b4 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -41,6 +41,8 @@ struct userfaultfd_ctx {
wait_queue_head_t fd_wqh;
/* userfaultfd syscall flags */
unsigned int flags;
+   /* features flags */
+   unsigned int features;
/* state machine */
enum userfaultfd_state state;
/* released */
@@ -49,6 +51,8 @@ struct userfaultfd_ctx {
struct mm_struct *mm;
 };
 
+#define UFFD_FEATURE_LONGMSG   0x1
+
 struct userfaultfd_wait_queue {
unsigned long address;
wait_queue_t wq;
@@ -369,7 +373,7 @@ static unsigned int userfaultfd_poll(struct file *file, 
poll_table *wait)
 }
 
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
-   __u64 *addr)
+   __u64 *mtype, __u64 *addr)
 {
ssize_t ret;
DECLARE_WAITQUEUE(wait, current);
@@ -383,6 +387,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx 
*ctx, int no_wait,
if (find_userfault(ctx, )) {
uwq->pending = false;
/* careful to always initialize addr if ret == 0 */
+   *mtype = UFFD_PAGEFAULT;
*addr = uwq->address;
ret = 0;
break;
@@ -411,8 +416,6 @@ static ssize_t userfaultfd_read(struct file *file, char 
__user *buf,
 {
struct userfaultfd_ctx *ctx = file->private_data;
ssize_t _ret, ret = 0;
-   /* careful to always initialize addr if ret == 0 */
-   __u64 uninitialized_var(addr);
int no_wait = file->f_flags & O_NONBLOCK;
 
if (ctx->state == UFFD_STATE_WAIT_API)
@@ -420,16 +423,34 @@ static ssize_t userfaultfd_read(struct file *file, char 
__user *buf,
BUG_ON(ctx->state != UFFD_STATE_RUNNING);
 
for (;;) {
-   if (count < sizeof(addr))
-   return ret ? ret : -EINVAL;
-   _ret = userfaultfd_ctx_read(ctx, no_wait, );
-   if (_ret < 0)
-   return ret ? ret : _ret;
-   if (put_user(addr, (__u64 __user *) buf))
-   return ret ? ret : -EFAULT;
-   ret += sizeof(addr);
-   buf += sizeof(addr);
-   count -= sizeof(addr);
+   if (!(ctx->features & UFFD_FEATURE_LONGMSG)) {
+   /* careful to always initialize addr if ret == 0 */
+   __u64 uninitialized_var(addr);
+   __u64 uninitialized_var(mtype);
+   if (count < sizeof(addr))
+   return ret ? ret : -EINVAL;
+   _ret = userfaultfd_ctx_read(ctx, no_wait, , 
);
+   if (_ret < 0)
+   return ret ? ret : _ret;
+   BUG_ON(mtype != UFFD_PAGEFAULT);
+   if (put_user(addr, (__u64 __user *) buf))
+   return ret ? ret : -EFAULT;
+   _ret = sizeof(addr);
+   } else {
+   struct uffd_v2_msg msg;
+   if (count < sizeof(msg))
+   return ret ? ret : -EINVAL;
+   _ret = userfaultfd_ctx_read(ctx, no_wait, , 
);
+   if (_ret < 0)
+   return ret ? ret : _ret;
+

[PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage

2015-03-18 Thread Pavel Emelyanov
Hi,

On the recent LSF Andrea presented his userfault-fd patches and
I had shown some issues that appear in usage scenarios when the
monitor task and mm task do not cooperate to each other on VM
changes (and fork()-s).

Here's the implementation of the extended uffd API that would help 
us to address those issues.

As proof of concept I've implemented only fix for fork() case, but
I also plan to add the mremap() and exit() notifications, both are
also required for such non-cooperative usage.

More details about the extension itself is in patch #2 and the fork()
notification description is in patch #3.

Comments and suggestion are warmly welcome :)


Andrea, what's the best way to go on with the patches -- would you
prefer to include them in your git tree or should I instead continue
with them on my own, re-sending them when required? Either way would
be OK for me.

Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] uffd: Tossing bits around

2015-03-18 Thread Pavel Emelyanov
Reformat the existing code a bit to make it easier for
further patching.

Signed-off-by: Pavel Emelyanov 
---
 fs/userfaultfd.c | 37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b4c7f25..6c9a2d6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -61,6 +61,23 @@ struct userfaultfd_wake_range {
unsigned long len;
 };
 
+static const struct file_operations userfaultfd_fops;
+
+static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void)
+{
+   struct userfaultfd_ctx *ctx;
+
+   ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+   if (ctx) {
+   atomic_set(>refcount, 1);
+   init_waitqueue_head(>fault_wqh);
+   init_waitqueue_head(>fd_wqh);
+   ctx->released = false;
+   }
+
+   return ctx;
+}
+
 static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 int wake_flags, void *key)
 {
@@ -307,15 +324,15 @@ static int userfaultfd_release(struct inode *inode, 
struct file *file)
return 0;
 }
 
-static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+static inline unsigned int do_find_userfault(wait_queue_head_t *wqh,
  struct userfaultfd_wait_queue **uwq)
 {
wait_queue_t *wq;
struct userfaultfd_wait_queue *_uwq;
unsigned int ret = 0;
 
-   spin_lock(>fault_wqh.lock);
-   list_for_each_entry(wq, >fault_wqh.task_list, task_list) {
+   spin_lock(>lock);
+   list_for_each_entry(wq, >task_list, task_list) {
_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
if (_uwq->pending) {
ret = POLLIN;
@@ -324,11 +341,17 @@ static inline unsigned int find_userfault(struct 
userfaultfd_ctx *ctx,
break;
}
}
-   spin_unlock(>fault_wqh.lock);
+   spin_unlock(>lock);
 
return ret;
 }
 
+static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+   struct userfaultfd_wait_queue **uwq)
+{
+   return do_find_userfault(>fault_wqh, uwq);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
struct userfaultfd_ctx *ctx = file->private_data;
@@ -1080,16 +1103,12 @@ static struct file *userfaultfd_file_create(int flags)
goto out;
 
file = ERR_PTR(-ENOMEM);
-   ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+   ctx = userfaultfd_ctx_alloc();
if (!ctx)
goto out;
 
-   atomic_set(>refcount, 1);
-   init_waitqueue_head(>fault_wqh);
-   init_waitqueue_head(>fd_wqh);
ctx->flags = flags;
ctx->state = UFFD_STATE_WAIT_API;
-   ctx->released = false;
ctx->mm = current->mm;
/* prevent the mm struct to be freed */
atomic_inc(>mm->mm_count);
-- 
1.8.4.2


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] uffd: Tossing bits around

2015-03-18 Thread Pavel Emelyanov
Reformat the existing code a bit to make it easier for
further patching.

Signed-off-by: Pavel Emelyanov xe...@parallels.com
---
 fs/userfaultfd.c | 37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b4c7f25..6c9a2d6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -61,6 +61,23 @@ struct userfaultfd_wake_range {
unsigned long len;
 };
 
+static const struct file_operations userfaultfd_fops;
+
+static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void)
+{
+   struct userfaultfd_ctx *ctx;
+
+   ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+   if (ctx) {
+   atomic_set(ctx-refcount, 1);
+   init_waitqueue_head(ctx-fault_wqh);
+   init_waitqueue_head(ctx-fd_wqh);
+   ctx-released = false;
+   }
+
+   return ctx;
+}
+
 static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 int wake_flags, void *key)
 {
@@ -307,15 +324,15 @@ static int userfaultfd_release(struct inode *inode, 
struct file *file)
return 0;
 }
 
-static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+static inline unsigned int do_find_userfault(wait_queue_head_t *wqh,
  struct userfaultfd_wait_queue **uwq)
 {
wait_queue_t *wq;
struct userfaultfd_wait_queue *_uwq;
unsigned int ret = 0;
 
-   spin_lock(ctx-fault_wqh.lock);
-   list_for_each_entry(wq, ctx-fault_wqh.task_list, task_list) {
+   spin_lock(wqh-lock);
+   list_for_each_entry(wq, wqh-task_list, task_list) {
_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
if (_uwq-pending) {
ret = POLLIN;
@@ -324,11 +341,17 @@ static inline unsigned int find_userfault(struct 
userfaultfd_ctx *ctx,
break;
}
}
-   spin_unlock(ctx-fault_wqh.lock);
+   spin_unlock(wqh-lock);
 
return ret;
 }
 
+static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+   struct userfaultfd_wait_queue **uwq)
+{
+   return do_find_userfault(ctx-fault_wqh, uwq);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
struct userfaultfd_ctx *ctx = file-private_data;
@@ -1080,16 +1103,12 @@ static struct file *userfaultfd_file_create(int flags)
goto out;
 
file = ERR_PTR(-ENOMEM);
-   ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+   ctx = userfaultfd_ctx_alloc();
if (!ctx)
goto out;
 
-   atomic_set(ctx-refcount, 1);
-   init_waitqueue_head(ctx-fault_wqh);
-   init_waitqueue_head(ctx-fd_wqh);
ctx-flags = flags;
ctx-state = UFFD_STATE_WAIT_API;
-   ctx-released = false;
ctx-mm = current-mm;
/* prevent the mm struct to be freed */
atomic_inc(ctx-mm-mm_count);
-- 
1.8.4.2


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] uffd: Introduce fork() notification

2015-03-18 Thread Pavel Emelyanov
As described in previous e-mail, we need to get informed when
the task, whose mm is monitored with uffd, calls fork().

The fork notification is the new uffd with the same regions
and flags as those on parent. When read()-ing from uffd the
monitor task would receive the new uffd's descriptor number
and will be able to start reading events from the new task.

The fork() of mm with uffd attached doesn't finish until the 
monitor acks the message by reading the new uffd descriptor.

Signed-off-by: Pavel Emelyanov xe...@parallels.com
---
 fs/userfaultfd.c | 175 ++-
 include/linux/userfaultfd_k.h|  12 +++
 include/uapi/linux/userfaultfd.h |   4 +-
 kernel/fork.c|   9 +-
 4 files changed, 194 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bd629b4..265f031 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -12,6 +12,7 @@
  *  mm/ksm.c (mm hashing).
  */
 
+#include linux/list.h
 #include linux/hashtable.h
 #include linux/sched.h
 #include linux/mm.h
@@ -37,6 +38,8 @@ struct userfaultfd_ctx {
atomic_t refcount;
/* waitqueue head for the userfaultfd page faults */
wait_queue_head_t fault_wqh;
+   /* waitqueue head for fork-s */
+   wait_queue_head_t fork_wqh;
/* waitqueue head for the pseudo fd to wakeup poll/read */
wait_queue_head_t fd_wqh;
/* userfaultfd syscall flags */
@@ -51,10 +54,21 @@ struct userfaultfd_ctx {
struct mm_struct *mm;
 };
 
+struct userfaultfd_fork_ctx {
+   struct userfaultfd_ctx *orig;
+   struct userfaultfd_ctx *new;
+   struct list_head list;
+};
+
 #define UFFD_FEATURE_LONGMSG   0x1
+#define UFFD_FEATURE_FORK  0x2
 
 struct userfaultfd_wait_queue {
-   unsigned long address;
+   union {
+   unsigned long address;
+   struct userfaultfd_ctx *nctx;
+   int fd;
+   };
wait_queue_t wq;
bool pending;
struct userfaultfd_ctx *ctx;
@@ -75,6 +89,7 @@ static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void)
if (ctx) {
atomic_set(ctx-refcount, 1);
init_waitqueue_head(ctx-fault_wqh);
+   init_waitqueue_head(ctx-fork_wqh);
init_waitqueue_head(ctx-fd_wqh);
ctx-released = false;
}
@@ -270,6 +285,111 @@ int handle_userfault(struct vm_area_struct *vma, unsigned 
long address,
return VM_FAULT_RETRY;
 }
 
+int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
+{
+   struct userfaultfd_ctx *ctx = NULL, *octx;
+   struct userfaultfd_fork_ctx *fctx;
+
+   octx = vma-vm_userfaultfd_ctx.ctx;
+   if (!octx || !(octx-features  UFFD_FEATURE_FORK)) {
+   vma-vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vma-vm_flags = ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   return 0;
+   }
+
+   list_for_each_entry(fctx, fcs, list)
+   if (fctx-orig == octx) {
+   ctx = fctx-new;
+   break;
+   }
+
+   if (!ctx) {
+   fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+   if (!fctx)
+   return -ENOMEM;
+
+   ctx = userfaultfd_ctx_alloc();
+   if (!ctx) {
+   kfree(fctx);
+   return -ENOMEM;
+   }
+
+   ctx-flags = octx-flags;
+   ctx-state = UFFD_STATE_RUNNING;
+   ctx-features = UFFD_FEATURE_FORK | UFFD_FEATURE_LONGMSG;
+   ctx-mm = vma-vm_mm;
+   atomic_inc(ctx-mm-mm_count);
+
+   userfaultfd_ctx_get(octx);
+   fctx-orig = octx;
+   fctx-new = ctx;
+   list_add_tail(fctx-list, fcs);
+   }
+
+   vma-vm_userfaultfd_ctx.ctx = ctx;
+   return 0;
+}
+
+static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
+{
+   int ret = 0;
+   struct userfaultfd_ctx *ctx = fctx-orig;
+   struct userfaultfd_wait_queue uwq;
+
+   init_waitqueue_entry(uwq.wq, current);
+   uwq.pending = true;
+   uwq.ctx = ctx;
+   uwq.nctx = fctx-new;
+
+   spin_lock(ctx-fork_wqh.lock);
+   /*
+* After the __add_wait_queue the uwq is visible to userland
+* through poll/read().
+*/
+   __add_wait_queue(ctx-fork_wqh, uwq.wq);
+   for (;;) {
+   set_current_state(TASK_KILLABLE);
+   if (!uwq.pending)
+   break;
+   if (ACCESS_ONCE(ctx-released) ||
+   fatal_signal_pending(current)) {
+   ret = -1;
+   break;
+   }
+
+   spin_unlock(ctx-fork_wqh.lock);
+
+   wake_up_poll(ctx-fd_wqh, POLLIN);
+   schedule();
+
+   spin_lock(ctx-fork_wqh.lock);
+   }
+   __remove_wait_queue(ctx-fork_wqh

[PATCH 2/3] uffd: Introduce the v2 API

2015-03-18 Thread Pavel Emelyanov
The new API will report more than just the page-faults. The
reason for this is -- when the task whose mm we monitor with 
uffd and the monitor task itself cannot cooperate with each
other, the former one can screw things up. Like this.

If task fork()-s the child process is detached from uffd and
thus all not-yet-faulted-in memory gets mapped with zero-pages
on touch.

Another example is mremap(). When the victim remaps the uffd-ed
region and starts touching it the monitor would receive fault
messages with addresses that were not register-ed with uffd
ioctl before it. Thus monitor will have no idea how to handle
those faults.

To address both we can send more events to the monitor. In
particular, on fork() we can create another uffd context,
register the same set of regions in it and send the descriptor
to monitor.

For mremap() we can send the message describing what change
has been performed.

So this patch prepares to ground for the described above feature
by introducing the v2 API of uffd. With new API the kernel would
respond with a message containing the event type (pagefault,
fork or remap) and argument (fault address, new uffd descriptor
or region change respectively).

Signed-off-by: Pavel Emelyanov xe...@parallels.com
---
 fs/userfaultfd.c | 56 ++--
 include/uapi/linux/userfaultfd.h | 21 ++-
 2 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6c9a2d6..bd629b4 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -41,6 +41,8 @@ struct userfaultfd_ctx {
wait_queue_head_t fd_wqh;
/* userfaultfd syscall flags */
unsigned int flags;
+   /* features flags */
+   unsigned int features;
/* state machine */
enum userfaultfd_state state;
/* released */
@@ -49,6 +51,8 @@ struct userfaultfd_ctx {
struct mm_struct *mm;
 };
 
+#define UFFD_FEATURE_LONGMSG   0x1
+
 struct userfaultfd_wait_queue {
unsigned long address;
wait_queue_t wq;
@@ -369,7 +373,7 @@ static unsigned int userfaultfd_poll(struct file *file, 
poll_table *wait)
 }
 
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
-   __u64 *addr)
+   __u64 *mtype, __u64 *addr)
 {
ssize_t ret;
DECLARE_WAITQUEUE(wait, current);
@@ -383,6 +387,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx 
*ctx, int no_wait,
if (find_userfault(ctx, uwq)) {
uwq-pending = false;
/* careful to always initialize addr if ret == 0 */
+   *mtype = UFFD_PAGEFAULT;
*addr = uwq-address;
ret = 0;
break;
@@ -411,8 +416,6 @@ static ssize_t userfaultfd_read(struct file *file, char 
__user *buf,
 {
struct userfaultfd_ctx *ctx = file-private_data;
ssize_t _ret, ret = 0;
-   /* careful to always initialize addr if ret == 0 */
-   __u64 uninitialized_var(addr);
int no_wait = file-f_flags  O_NONBLOCK;
 
if (ctx-state == UFFD_STATE_WAIT_API)
@@ -420,16 +423,34 @@ static ssize_t userfaultfd_read(struct file *file, char 
__user *buf,
BUG_ON(ctx-state != UFFD_STATE_RUNNING);
 
for (;;) {
-   if (count  sizeof(addr))
-   return ret ? ret : -EINVAL;
-   _ret = userfaultfd_ctx_read(ctx, no_wait, addr);
-   if (_ret  0)
-   return ret ? ret : _ret;
-   if (put_user(addr, (__u64 __user *) buf))
-   return ret ? ret : -EFAULT;
-   ret += sizeof(addr);
-   buf += sizeof(addr);
-   count -= sizeof(addr);
+   if (!(ctx-features  UFFD_FEATURE_LONGMSG)) {
+   /* careful to always initialize addr if ret == 0 */
+   __u64 uninitialized_var(addr);
+   __u64 uninitialized_var(mtype);
+   if (count  sizeof(addr))
+   return ret ? ret : -EINVAL;
+   _ret = userfaultfd_ctx_read(ctx, no_wait, mtype, 
addr);
+   if (_ret  0)
+   return ret ? ret : _ret;
+   BUG_ON(mtype != UFFD_PAGEFAULT);
+   if (put_user(addr, (__u64 __user *) buf))
+   return ret ? ret : -EFAULT;
+   _ret = sizeof(addr);
+   } else {
+   struct uffd_v2_msg msg;
+   if (count  sizeof(msg))
+   return ret ? ret : -EINVAL;
+   _ret = userfaultfd_ctx_read(ctx, no_wait, msg.type, 
msg.arg);
+   if (_ret  0)
+   return ret ? ret : _ret

[PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage

2015-03-18 Thread Pavel Emelyanov
Hi,

On the recent LSF Andrea presented his userfault-fd patches and
I had shown some issues that appear in usage scenarios when the
monitor task and mm task do not cooperate to each other on VM
changes (and fork()-s).

Here's the implementation of the extended uffd API that would help 
us to address those issues.

As proof of concept I've implemented only fix for fork() case, but
I also plan to add the mremap() and exit() notifications, both are
also required for such non-cooperative usage.

More details about the extension itself is in patch #2 and the fork()
notification description is in patch #3.

Comments and suggestion are warmly welcome :)


Andrea, what's the best way to go on with the patches -- would you
prefer to include them in your git tree or should I instead continue
with them on my own, re-sending them when required? Either way would
be OK for me.

Thanks,
Pavel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/2] x86_64,signal: Fix SS handling for signals delivered to 64-bit programs

2015-03-18 Thread Pavel Emelyanov
On 03/19/2015 12:26 AM, Andy Lutomirski wrote:
 On Wed, Mar 18, 2015 at 1:02 PM, Oleg Nesterov o...@redhat.com wrote:
 On 03/18, Andrey Wagin wrote:

 This patch fixes the problem. Oleg, could you send this path in the
 criu maillist?

 Sure, will do.
 
 We still haven't answered one question: what's the kernel's position
 on ABI stability wrt CRIU?  We clearly shouldn't make changes that
 break the principle of CRIU, but CRIU encodes so many tricky
 assumptions about the inner workings of the kernel that it's really
 tough to avoid breaking old CRIU versions.

Well, we try hard to use only documented kernel API-s. Isn't the sigframe
considered to be some sort of stable API? I mean -- it's visible by the
userspace, nobody prevents glibc or gdb from messing with this stuff just
by reading it from memory.

If it's parse-able e.g. like VDSO is, but we don't do it in CRIU -- then
it's definitely a CRIU BUG to be fixed.

 So... do we introduce somewhat nasty code into the kernel to keep old
 CRIU versions working, or do we require that users who want to restore
 onto new kernels use new CRIU?

It's OK (I think) to require newer versions of CRIU, it's easy to update
one unlike the kernel ;)

But if old version of CRIU just crash the restored processes on new
kernels and there's no way to detect this properly -- that's the problem.

 (It seems clear to me that CRIU should apply the patch regardless of
 what the kernel does.  It will enable CRIU to work on the same class
 of programs that are fixed by the kernel change that started this
 thread.)
 
 --Andy
 .
 

Thanks,
Pavel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] pagemap: do not leak physical addresses to non-privileged userspace

2015-03-09 Thread Pavel Emelyanov
On 03/10/2015 12:11 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" 
> 
> As pointed by recent post[1] on exploiting DRAM physical imperfection,
> /proc/PID/pagemap exposes sensitive information which can be used to do
> attacks.
> 
> This is RFC patch which disallow anybody without CAP_SYS_ADMIN to read
> the pagemap.
> 
> Any comments?

If I'm not mistaken, the pagemap file is used by some userspace that does 
working-set size analysis. But this thing only needs the flags (referenced
bit) from the PTE-s. Maybe it would be better not to lock this file completely,
but instead report the PFN part as zero?

Other than this, I don't mind :) Although we use this heavily in CRIU we
anyway work only with the CAP_SYS_ADMIN, so adding the new one doesn't hurt.

Thanks,
Pavel

> [1] 
> http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
> 
> Signed-off-by: Kirill A. Shutemov 
> Cc: Pavel Emelyanov 
> Cc: Konstantin Khlebnikov 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Mark Seaborn 
> Cc: Andy Lutomirski 
> ---
>  fs/proc/task_mmu.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 246eae84b13b..b72b36e64286 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1322,6 +1322,9 @@ out:
>  
>  static int pagemap_open(struct inode *inode, struct file *file)
>  {
> + /* do not disclose physical addresses: attack vector */
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
>   pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
>   "to stop being page-shift some time soon. See the "
>   "linux/Documentation/vm/pagemap.txt for details.\n");
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] pagemap: do not leak physical addresses to non-privileged userspace

2015-03-09 Thread Pavel Emelyanov
On 03/10/2015 12:11 AM, Kirill A. Shutemov wrote:
 From: Kirill A. Shutemov kirill.shute...@linux.intel.com
 
 As pointed by recent post[1] on exploiting DRAM physical imperfection,
 /proc/PID/pagemap exposes sensitive information which can be used to do
 attacks.
 
 This is RFC patch which disallow anybody without CAP_SYS_ADMIN to read
 the pagemap.
 
 Any comments?

If I'm not mistaken, the pagemap file is used by some userspace that does 
working-set size analysis. But this thing only needs the flags (referenced
bit) from the PTE-s. Maybe it would be better not to lock this file completely,
but instead report the PFN part as zero?

Other than this, I don't mind :) Although we use this heavily in CRIU we
anyway work only with the CAP_SYS_ADMIN, so adding the new one doesn't hurt.

Thanks,
Pavel

 [1] 
 http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
 
 Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
 Cc: Pavel Emelyanov xe...@parallels.com
 Cc: Konstantin Khlebnikov khlebni...@openvz.org
 Cc: Andrew Morton a...@linux-foundation.org
 Cc: Linus Torvalds torva...@linux-foundation.org
 Cc: Mark Seaborn mseab...@chromium.org
 Cc: Andy Lutomirski l...@amacapital.net
 ---
  fs/proc/task_mmu.c | 3 +++
  1 file changed, 3 insertions(+)
 
 diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
 index 246eae84b13b..b72b36e64286 100644
 --- a/fs/proc/task_mmu.c
 +++ b/fs/proc/task_mmu.c
 @@ -1322,6 +1322,9 @@ out:
  
  static int pagemap_open(struct inode *inode, struct file *file)
  {
 + /* do not disclose physical addresses: attack vector */
 + if (!capable(CAP_SYS_ADMIN))
 + return -EPERM;
   pr_warn_once(Bits 55-60 of /proc/PID/pagemap entries are about 
   to stop being page-shift some time soon. See the 
   linux/Documentation/vm/pagemap.txt for details.\n);
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/21] RFC: userfaultfd v3

2015-03-05 Thread Pavel Emelyanov

> All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
> live migration and the UFFD can be passed to a manager process through
> unix domain sockets to satisfy point 5).

Yup :) That's the best (from my POV) point of ufd -- the ability to delegate
the descriptor to some other task. Though there are several limitations (I've
expressed them in other e-mails), I'm definitely supporting this!

The respective CRIU code is quite sloppy yet, I will try to brush one up and
show soon.

Thanks,
Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation

2015-03-05 Thread Pavel Emelyanov
> +static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> + pmd_t *dst_pmd,
> + struct vm_area_struct *dst_vma,
> + unsigned long dst_addr,
> + unsigned long src_addr)
> +{
> + struct mem_cgroup *memcg;
> + pte_t _dst_pte, *dst_pte;
> + spinlock_t *ptl;
> + struct page *page;
> + void *page_kaddr;
> + int ret;
> +
> + ret = -ENOMEM;
> + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
> + if (!page)
> + goto out;

Not a fatal thing, but still quite inconvenient. If there are two tasks that
have anonymous private VMAs that are still not COW-ed from each other, then
it will be impossible to keep the pages shared with userfault. Thus if we do
post-copy memory migration for tasks, then these guys will have their
memory COW-ed.


Thanks,
Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Pavel Emelyanov
> +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> + unsigned long dst_start, unsigned long src_start,
> + unsigned long len, __u64 mode)
> +{
> + struct vm_area_struct *src_vma, *dst_vma;
> + long err = -EINVAL;
> + pmd_t *src_pmd, *dst_pmd;
> + pte_t *src_pte, *dst_pte;
> + spinlock_t *dst_ptl, *src_ptl;
> + unsigned long src_addr, dst_addr;
> + int thp_aligned = -1;
> + ssize_t moved = 0;
> +
> + /*
> +  * Sanitize the command parameters:
> +  */
> + BUG_ON(src_start & ~PAGE_MASK);
> + BUG_ON(dst_start & ~PAGE_MASK);
> + BUG_ON(len & ~PAGE_MASK);
> +
> + /* Does the address range wrap, or is the span zero-sized? */
> + BUG_ON(src_start + len <= src_start);
> + BUG_ON(dst_start + len <= dst_start);
> +
> + /*
> +  * Because these are read sempahores there's no risk of lock
> +  * inversion.
> +  */
> + down_read(_mm->mmap_sem);
> + if (dst_mm != src_mm)
> + down_read(_mm->mmap_sem);
> +
> + /*
> +  * Make sure the vma is not shared, that the src and dst remap
> +  * ranges are both valid and fully within a single existing
> +  * vma.
> +  */
> + src_vma = find_vma(src_mm, src_start);
> + if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> + goto out;
> + if (src_start < src_vma->vm_start ||
> + src_start + len > src_vma->vm_end)
> + goto out;
> +
> + dst_vma = find_vma(dst_mm, dst_start);
> + if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> + goto out;

I again have a concern about the case when one task monitors the VM of the
other one. If the target task (owning the mm) unmaps a VMA then the monitor
task (holding and operating on the ufd) will get plain EINVAL on UFFDIO_REMAP
request. This is not fatal, but still inconvenient as it will be hard to
find out the reason for failure -- dst VMA is removed and the monitor should
just drop the respective pages with data, or some other error has occurred
and some other actions should be taken.

Thanks,
Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization

2015-03-05 Thread Pavel Emelyanov

> +int handle_userfault(struct vm_area_struct *vma, unsigned long address,
> +  unsigned int flags, unsigned long reason)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + struct userfaultfd_ctx *ctx;
> + struct userfaultfd_wait_queue uwq;
> +
> + BUG_ON(!rwsem_is_locked(>mmap_sem));
> +
> + ctx = vma->vm_userfaultfd_ctx.ctx;
> + if (!ctx)
> + return VM_FAULT_SIGBUS;
> +
> + BUG_ON(ctx->mm != mm);
> +
> + VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> + VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> +
> + /*
> +  * If it's already released don't get it. This avoids to loop
> +  * in __get_user_pages if userfaultfd_release waits on the
> +  * caller of handle_userfault to release the mmap_sem.
> +  */
> + if (unlikely(ACCESS_ONCE(ctx->released)))
> + return VM_FAULT_SIGBUS;
> +
> + /* check that we can return VM_FAULT_RETRY */
> + if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
> + /*
> +  * Validate the invariant that nowait must allow retry
> +  * to be sure not to return SIGBUS erroneously on
> +  * nowait invocations.
> +  */
> + BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
> +#ifdef CONFIG_DEBUG_VM
> + if (printk_ratelimit()) {
> + printk(KERN_WARNING
> +"FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
> + dump_stack();
> + }
> +#endif
> + return VM_FAULT_SIGBUS;
> + }
> +
> + /*
> +  * Handle nowait, not much to do other than tell it to retry
> +  * and wait.
> +  */
> + if (flags & FAULT_FLAG_RETRY_NOWAIT)
> + return VM_FAULT_RETRY;
> +
> + /* take the reference before dropping the mmap_sem */
> + userfaultfd_ctx_get(ctx);
> +
> + /* be gentle and immediately relinquish the mmap_sem */
> + up_read(>mmap_sem);
> +
> + init_waitqueue_func_entry(, userfaultfd_wake_function);
> + uwq.wq.private = current;
> + uwq.address = userfault_address(address, flags, reason);

Since we report only the virtual address of the fault, this will make 
difficulties
for task monitoring the address space of some other task. Like this:

Let's assume a task creates a userfaultfd, activates one, registers several 
VMAs 
in it and then sends the ufd descriptor to other task. If later the first task 
will
remap those VMAs and will start touching pages, the monitor will start 
receiving 
fault addresses using which it will not be able to guess the exact vma the
requests come from.

Thanks,
Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

2015-03-05 Thread Pavel Emelyanov
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cf65139..cb215c0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct 
> mm_struct *oldmm)
>   goto fail_nomem_anon_vma_fork;
>   tmp->vm_flags &= ~VM_LOCKED;
>   tmp->vm_next = tmp->vm_prev = NULL;
> + tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;

This creates an interesting effect when the userfaultfd is used outside of
the process which created and activated one. If I try to monitor the memory
usage of one task with another, once the first task fork()-s, its child
begins to see zero-pages in the places where the monitor task was supposed
to insert pages with data.

>   file = tmp->vm_file;
>   if (file) {
>   struct inode *inode = file_inode(file);
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

2015-03-05 Thread Pavel Emelyanov
 diff --git a/kernel/fork.c b/kernel/fork.c
 index cf65139..cb215c0 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct 
 mm_struct *oldmm)
   goto fail_nomem_anon_vma_fork;
   tmp-vm_flags = ~VM_LOCKED;
   tmp-vm_next = tmp-vm_prev = NULL;
 + tmp-vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;

This creates an interesting effect when the userfaultfd is used outside of
the process which created and activated one. If I try to monitor the memory
usage of one task with another, once the first task fork()-s, its child
begins to see zero-pages in the places where the monitor task was supposed
to insert pages with data.

   file = tmp-vm_file;
   if (file) {
   struct inode *inode = file_inode(file);
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization

2015-03-05 Thread Pavel Emelyanov

 +int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 +  unsigned int flags, unsigned long reason)
 +{
 + struct mm_struct *mm = vma-vm_mm;
 + struct userfaultfd_ctx *ctx;
 + struct userfaultfd_wait_queue uwq;
 +
 + BUG_ON(!rwsem_is_locked(mm-mmap_sem));
 +
 + ctx = vma-vm_userfaultfd_ctx.ctx;
 + if (!ctx)
 + return VM_FAULT_SIGBUS;
 +
 + BUG_ON(ctx-mm != mm);
 +
 + VM_BUG_ON(reason  ~(VM_UFFD_MISSING|VM_UFFD_WP));
 + VM_BUG_ON(!(reason  VM_UFFD_MISSING) ^ !!(reason  VM_UFFD_WP));
 +
 + /*
 +  * If it's already released don't get it. This avoids to loop
 +  * in __get_user_pages if userfaultfd_release waits on the
 +  * caller of handle_userfault to release the mmap_sem.
 +  */
 + if (unlikely(ACCESS_ONCE(ctx-released)))
 + return VM_FAULT_SIGBUS;
 +
 + /* check that we can return VM_FAULT_RETRY */
 + if (unlikely(!(flags  FAULT_FLAG_ALLOW_RETRY))) {
 + /*
 +  * Validate the invariant that nowait must allow retry
 +  * to be sure not to return SIGBUS erroneously on
 +  * nowait invocations.
 +  */
 + BUG_ON(flags  FAULT_FLAG_RETRY_NOWAIT);
 +#ifdef CONFIG_DEBUG_VM
 + if (printk_ratelimit()) {
 + printk(KERN_WARNING
 +FAULT_FLAG_ALLOW_RETRY missing %x\n, flags);
 + dump_stack();
 + }
 +#endif
 + return VM_FAULT_SIGBUS;
 + }
 +
 + /*
 +  * Handle nowait, not much to do other than tell it to retry
 +  * and wait.
 +  */
 + if (flags  FAULT_FLAG_RETRY_NOWAIT)
 + return VM_FAULT_RETRY;
 +
 + /* take the reference before dropping the mmap_sem */
 + userfaultfd_ctx_get(ctx);
 +
 + /* be gentle and immediately relinquish the mmap_sem */
 + up_read(mm-mmap_sem);
 +
 + init_waitqueue_func_entry(uwq.wq, userfaultfd_wake_function);
 + uwq.wq.private = current;
 + uwq.address = userfault_address(address, flags, reason);

Since we report only the virtual address of the fault, this will make 
difficulties
for task monitoring the address space of some other task. Like this:

Let's assume a task creates a userfaultfd, activates one, registers several 
VMAs 
in it and then sends the ufd descriptor to other task. If later the first task 
will
remap those VMAs and will start touching pages, the monitor will start 
receiving 
fault addresses using which it will not be able to guess the exact vma the
requests come from.

Thanks,
Pavel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Pavel Emelyanov
 +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 + unsigned long dst_start, unsigned long src_start,
 + unsigned long len, __u64 mode)
 +{
 + struct vm_area_struct *src_vma, *dst_vma;
 + long err = -EINVAL;
 + pmd_t *src_pmd, *dst_pmd;
 + pte_t *src_pte, *dst_pte;
 + spinlock_t *dst_ptl, *src_ptl;
 + unsigned long src_addr, dst_addr;
 + int thp_aligned = -1;
 + ssize_t moved = 0;
 +
 + /*
 +  * Sanitize the command parameters:
 +  */
 + BUG_ON(src_start  ~PAGE_MASK);
 + BUG_ON(dst_start  ~PAGE_MASK);
 + BUG_ON(len  ~PAGE_MASK);
 +
 + /* Does the address range wrap, or is the span zero-sized? */
 + BUG_ON(src_start + len = src_start);
 + BUG_ON(dst_start + len = dst_start);
 +
 + /*
 +  * Because these are read sempahores there's no risk of lock
 +  * inversion.
 +  */
 + down_read(dst_mm-mmap_sem);
 + if (dst_mm != src_mm)
 + down_read(src_mm-mmap_sem);
 +
 + /*
 +  * Make sure the vma is not shared, that the src and dst remap
 +  * ranges are both valid and fully within a single existing
 +  * vma.
 +  */
 + src_vma = find_vma(src_mm, src_start);
 + if (!src_vma || (src_vma-vm_flags  VM_SHARED))
 + goto out;
 + if (src_start  src_vma-vm_start ||
 + src_start + len  src_vma-vm_end)
 + goto out;
 +
 + dst_vma = find_vma(dst_mm, dst_start);
 + if (!dst_vma || (dst_vma-vm_flags  VM_SHARED))
 + goto out;

I again have a concern about the case when one task monitors the VM of the
other one. If the target task (owning the mm) unmaps a VMA then the monitor
task (holding and operating on the ufd) will get plain EINVAL on UFFDIO_REMAP
request. This is not fatal, but still inconvenient as it will be hard to
find out the reason for failure -- dst VMA is removed and the monitor should
just drop the respective pages with data, or some other error has occurred
and some other actions should be taken.

Thanks,
Pavel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >