Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE
On 07/12/2018 07:33 PM, Eric W. Biederman wrote: > > Adrian Reber writes: > >> The CHECKPOINT_RESTORE configuration option was introduced in 2012 and >> combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many >> distribution kernels and also part of the defconfigs of various >> architectures. >> >> To make it easier for distributions to enable CHECKPOINT_RESTORE this >> removes EXPERT and moves the configuration option out of the EXPERT >> block. > > I think we should change the help text at the same time, to match > our improve understanding of the situation. > > Does anyone remember why this option was added at all? Sure! Quoting Andrew's ~7 years ago akpm branch merge e-mail: However I'm less confident than the developers that it will all eventually work! So what I'm asking them to do is to wrap each piece of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all eventually comes to tears and the project as a whole fails, it should be a simple matter to go through and delete all trace of it. the best link with full e-mail I googled for is https://gitlab.imag.fr/kaunetem/linux-kaunetem/commit/099469502f62fbe0d7e4f0b83a2f22538367f734 -- Pavel > Why this option was placed under expert? > > What is the value of disabling this functionality ever? > > Is there any reason why we don't just delete CONFIG_CHECKPOINT_RESTORE > entirely? > > Certainly we are at a point where distro's are enabling this so hiding > it behind CONFIG_EXPERT with a default of N seems inapparopriate. > > The only thing I can imagine might be sensible is changing the default > to Y and leaving it behind CONFIG_EXPERT. > > I want to know what is the point of maintaining all of the complexity of > the ifdefs. If no one can come up with a reason I say let's just remove > CONFIG_CHECKPOINT_RESTORE entirely. > > Eric > > >> Signed-off-by: Adrian Reber >> Cc: Oleg Nesterov >> Cc: Pavel Emelyanov >> Cc: Andrew Morton >> Cc: Eric W. Biederman >> Cc: Andrei Vagin >> Cc: Hendrik Brueckner >> --- >> init/Kconfig | 24 >> 1 file changed, 12 insertions(+), 12 deletions(-) >> >> diff --git a/init/Kconfig b/init/Kconfig >> index 041f3a022122..9c529c763326 100644 >> --- a/init/Kconfig >> +++ b/init/Kconfig >> @@ -932,6 +932,18 @@ config NET_NS >> >> endif # NAMESPACES >> >> +config CHECKPOINT_RESTORE >> +bool "Checkpoint/restore support" >> +select PROC_CHILDREN >> +default n >> +help >> + Enables additional kernel features in a sake of checkpoint/restore. >> + In particular it adds auxiliary prctl codes to setup process text, >> + data and heap segment sizes, and a few additional /proc filesystem >> + entries. >> + >> + If unsure, say N here. >> + >> config SCHED_AUTOGROUP >> bool "Automatic process group scheduling" >> select CGROUPS >> @@ -1348,18 +1360,6 @@ config MEMBARRIER >> >>If unsure, say Y. >> >> -config CHECKPOINT_RESTORE >> -bool "Checkpoint/restore support" if EXPERT >> -select PROC_CHILDREN >> -default n >> -help >> - Enables additional kernel features in a sake of checkpoint/restore. >> - In particular it adds auxiliary prctl codes to setup process text, >> - data and heap segment sizes, and a few additional /proc filesystem >> - entries. >> - >> - If unsure, say N here. >> - >> config KALLSYMS >> bool "Load all symbols for debugging/ksymoops" if EXPERT >> default y > . >
Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE
On 07/12/2018 07:33 PM, Eric W. Biederman wrote: > > Adrian Reber writes: > >> The CHECKPOINT_RESTORE configuration option was introduced in 2012 and >> combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many >> distribution kernels and also part of the defconfigs of various >> architectures. >> >> To make it easier for distributions to enable CHECKPOINT_RESTORE this >> removes EXPERT and moves the configuration option out of the EXPERT >> block. > > I think we should change the help text at the same time, to match > our improve understanding of the situation. > > Does anyone remember why this option was added at all? Sure! Quoting Andrew's ~7 years ago akpm branch merge e-mail: However I'm less confident than the developers that it will all eventually work! So what I'm asking them to do is to wrap each piece of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all eventually comes to tears and the project as a whole fails, it should be a simple matter to go through and delete all trace of it. the best link with full e-mail I googled for is https://gitlab.imag.fr/kaunetem/linux-kaunetem/commit/099469502f62fbe0d7e4f0b83a2f22538367f734 -- Pavel > Why this option was placed under expert? > > What is the value of disabling this functionality ever? > > Is there any reason why we don't just delete CONFIG_CHECKPOINT_RESTORE > entirely? > > Certainly we are at a point where distro's are enabling this so hiding > it behind CONFIG_EXPERT with a default of N seems inapparopriate. > > The only thing I can imagine might be sensible is changing the default > to Y and leaving it behind CONFIG_EXPERT. > > I want to know what is the point of maintaining all of the complexity of > the ifdefs. If no one can come up with a reason I say let's just remove > CONFIG_CHECKPOINT_RESTORE entirely. > > Eric > > >> Signed-off-by: Adrian Reber >> Cc: Oleg Nesterov >> Cc: Pavel Emelyanov >> Cc: Andrew Morton >> Cc: Eric W. Biederman >> Cc: Andrei Vagin >> Cc: Hendrik Brueckner >> --- >> init/Kconfig | 24 >> 1 file changed, 12 insertions(+), 12 deletions(-) >> >> diff --git a/init/Kconfig b/init/Kconfig >> index 041f3a022122..9c529c763326 100644 >> --- a/init/Kconfig >> +++ b/init/Kconfig >> @@ -932,6 +932,18 @@ config NET_NS >> >> endif # NAMESPACES >> >> +config CHECKPOINT_RESTORE >> +bool "Checkpoint/restore support" >> +select PROC_CHILDREN >> +default n >> +help >> + Enables additional kernel features in a sake of checkpoint/restore. >> + In particular it adds auxiliary prctl codes to setup process text, >> + data and heap segment sizes, and a few additional /proc filesystem >> + entries. >> + >> + If unsure, say N here. >> + >> config SCHED_AUTOGROUP >> bool "Automatic process group scheduling" >> select CGROUPS >> @@ -1348,18 +1360,6 @@ config MEMBARRIER >> >>If unsure, say Y. >> >> -config CHECKPOINT_RESTORE >> -bool "Checkpoint/restore support" if EXPERT >> -select PROC_CHILDREN >> -default n >> -help >> - Enables additional kernel features in a sake of checkpoint/restore. >> - In particular it adds auxiliary prctl codes to setup process text, >> - data and heap segment sizes, and a few additional /proc filesystem >> - entries. >> - >> - If unsure, say N here. >> - >> config KALLSYMS >> bool "Load all symbols for debugging/ksymoops" if EXPERT >> default y > . >
Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE
On 07/12/2018 04:07 PM, Adrian Reber wrote: > The CHECKPOINT_RESTORE configuration option was introduced in 2012 and > combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many > distribution kernels and also part of the defconfigs of various > architectures. > > To make it easier for distributions to enable CHECKPOINT_RESTORE this > removes EXPERT and moves the configuration option out of the EXPERT > block. > > Signed-off-by: Adrian Reber > Cc: Oleg Nesterov > Cc: Pavel Emelyanov > Cc: Andrew Morton > Cc: Eric W. Biederman > Cc: Andrei Vagin > Cc: Hendrik Brueckner Acked-by: Pavel Emelyanov > --- > init/Kconfig | 24 > 1 file changed, 12 insertions(+), 12 deletions(-) > > diff --git a/init/Kconfig b/init/Kconfig > index 041f3a022122..9c529c763326 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -932,6 +932,18 @@ config NET_NS > > endif # NAMESPACES > > +config CHECKPOINT_RESTORE > + bool "Checkpoint/restore support" > + select PROC_CHILDREN > + default n > + help > + Enables additional kernel features in a sake of checkpoint/restore. > + In particular it adds auxiliary prctl codes to setup process text, > + data and heap segment sizes, and a few additional /proc filesystem > + entries. > + > + If unsure, say N here. > + > config SCHED_AUTOGROUP > bool "Automatic process group scheduling" > select CGROUPS > @@ -1348,18 +1360,6 @@ config MEMBARRIER > > If unsure, say Y. > > -config CHECKPOINT_RESTORE > - bool "Checkpoint/restore support" if EXPERT > - select PROC_CHILDREN > - default n > - help > - Enables additional kernel features in a sake of checkpoint/restore. > - In particular it adds auxiliary prctl codes to setup process text, > - data and heap segment sizes, and a few additional /proc filesystem > - entries. > - > - If unsure, say N here. > - > config KALLSYMS >bool "Load all symbols for debugging/ksymoops" if EXPERT >default y >
Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE
On 07/12/2018 04:07 PM, Adrian Reber wrote: > The CHECKPOINT_RESTORE configuration option was introduced in 2012 and > combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many > distribution kernels and also part of the defconfigs of various > architectures. > > To make it easier for distributions to enable CHECKPOINT_RESTORE this > removes EXPERT and moves the configuration option out of the EXPERT > block. > > Signed-off-by: Adrian Reber > Cc: Oleg Nesterov > Cc: Pavel Emelyanov > Cc: Andrew Morton > Cc: Eric W. Biederman > Cc: Andrei Vagin > Cc: Hendrik Brueckner Acked-by: Pavel Emelyanov > --- > init/Kconfig | 24 > 1 file changed, 12 insertions(+), 12 deletions(-) > > diff --git a/init/Kconfig b/init/Kconfig > index 041f3a022122..9c529c763326 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -932,6 +932,18 @@ config NET_NS > > endif # NAMESPACES > > +config CHECKPOINT_RESTORE > + bool "Checkpoint/restore support" > + select PROC_CHILDREN > + default n > + help > + Enables additional kernel features in a sake of checkpoint/restore. > + In particular it adds auxiliary prctl codes to setup process text, > + data and heap segment sizes, and a few additional /proc filesystem > + entries. > + > + If unsure, say N here. > + > config SCHED_AUTOGROUP > bool "Automatic process group scheduling" > select CGROUPS > @@ -1348,18 +1360,6 @@ config MEMBARRIER > > If unsure, say Y. > > -config CHECKPOINT_RESTORE > - bool "Checkpoint/restore support" if EXPERT > - select PROC_CHILDREN > - default n > - help > - Enables additional kernel features in a sake of checkpoint/restore. > - In particular it adds auxiliary prctl codes to setup process text, > - data and heap segment sizes, and a few additional /proc filesystem > - entries. > - > - If unsure, say N here. > - > config KALLSYMS >bool "Load all symbols for debugging/ksymoops" if EXPERT >default y >
Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races
>> But doesn't it race even with regular PF handling, not only the fork? How >> do we handle this race? > > With the regular #PF handing, the faulting thread patiently waits until > page fault is resolved. With fork(), mremap() etc the thread that caused > the event resumes once the uffd message is read by the monitor. That's > surely way before monitor had chance to somehow process that message. Ouch, yes. This is nasty :( So having no better solution in mind, let's move forward with this. Acked-by: Pavel Emelyanov <xe...@virtuozzo.com>
Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races
>> But doesn't it race even with regular PF handling, not only the fork? How >> do we handle this race? > > With the regular #PF handing, the faulting thread patiently waits until > page fault is resolved. With fork(), mremap() etc the thread that caused > the event resumes once the uffd message is read by the monitor. That's > surely way before monitor had chance to somehow process that message. Ouch, yes. This is nasty :( So having no better solution in mind, let's move forward with this. Acked-by: Pavel Emelyanov
Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races
On 05/24/2018 02:56 PM, Mike Rapoport wrote: > On Thu, May 24, 2018 at 02:24:37PM +0300, Pavel Emelyanov wrote: >> On 05/23/2018 10:42 AM, Mike Rapoport wrote: >>> If a process monitored with userfaultfd changes it's memory mappings or >>> forks() at the same time as uffd monitor fills the process memory with >>> UFFDIO_COPY, the actual creation of page table entries and copying of the >>> data in mcopy_atomic may happen either before of after the memory mapping >>> modifications and there is no way for the uffd monitor to maintain >>> consistent view of the process memory layout. >>> >>> For instance, let's consider fork() running in parallel with >>> userfaultfd_copy(): >>> >>> process | uffd monitor >>> -+-- >>> fork() | userfaultfd_copy() >>> ... | ... >>> dup_mmap() | down_read(mmap_sem) >>> down_write(mmap_sem) | /* create PTEs, copy data */ >>> dup_uffd() | up_read(mmap_sem) >>> copy_page_range()| >>> up_write(mmap_sem) | >>> dup_uffd_complete() | >>> /* notify monitor */ | >>> >>> If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be >>> present by the time copy_page_range() is called and they will appear in the >>> child's memory mappings. However, if the fork() is the first to take the >>> mmap_sem, the new pages won't be mapped in the child's address space. >> >> But in this case child should get an entry, that emits a message to uffd >> when step upon! >> And uffd will just userfaultfd_copy() it again. No? > > There will be a message, indeed. But there is no way for monitor to tell > whether the pages it copied are present or not in the child. If there's a message, then they are not present, that's for sure :) > Since the monitor cannot assume that the process will access all its memory > it has to copy some pages "in the background". A simple monitor may look > like: > > for (;;) { > wait_for_uffd_events(timeout); > handle_uffd_events(); > uffd_copy(some not faulted pages); > } > > Then, if the "background" uffd_copy() races with fork, the pages we've > copied may be already present in parent's mappings before the call to > copy_page_range() and may be not. > > If the pages were not present, uffd_copy'ing them again to the child's > memory would be ok. Yes. > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them > again, child process will get memory corruption. You mean the background uffd_copy()? But doesn't it race even with regular PF handling, not only the fork? How do we handle this race? -- Pavel
Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races
On 05/24/2018 02:56 PM, Mike Rapoport wrote: > On Thu, May 24, 2018 at 02:24:37PM +0300, Pavel Emelyanov wrote: >> On 05/23/2018 10:42 AM, Mike Rapoport wrote: >>> If a process monitored with userfaultfd changes it's memory mappings or >>> forks() at the same time as uffd monitor fills the process memory with >>> UFFDIO_COPY, the actual creation of page table entries and copying of the >>> data in mcopy_atomic may happen either before of after the memory mapping >>> modifications and there is no way for the uffd monitor to maintain >>> consistent view of the process memory layout. >>> >>> For instance, let's consider fork() running in parallel with >>> userfaultfd_copy(): >>> >>> process | uffd monitor >>> -+-- >>> fork() | userfaultfd_copy() >>> ... | ... >>> dup_mmap() | down_read(mmap_sem) >>> down_write(mmap_sem) | /* create PTEs, copy data */ >>> dup_uffd() | up_read(mmap_sem) >>> copy_page_range()| >>> up_write(mmap_sem) | >>> dup_uffd_complete() | >>> /* notify monitor */ | >>> >>> If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be >>> present by the time copy_page_range() is called and they will appear in the >>> child's memory mappings. However, if the fork() is the first to take the >>> mmap_sem, the new pages won't be mapped in the child's address space. >> >> But in this case child should get an entry, that emits a message to uffd >> when step upon! >> And uffd will just userfaultfd_copy() it again. No? > > There will be a message, indeed. But there is no way for monitor to tell > whether the pages it copied are present or not in the child. If there's a message, then they are not present, that's for sure :) > Since the monitor cannot assume that the process will access all its memory > it has to copy some pages "in the background". A simple monitor may look > like: > > for (;;) { > wait_for_uffd_events(timeout); > handle_uffd_events(); > uffd_copy(some not faulted pages); > } > > Then, if the "background" uffd_copy() races with fork, the pages we've > copied may be already present in parent's mappings before the call to > copy_page_range() and may be not. > > If the pages were not present, uffd_copy'ing them again to the child's > memory would be ok. Yes. > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them > again, child process will get memory corruption. You mean the background uffd_copy()? But doesn't it race even with regular PF handling, not only the fork? How do we handle this race? -- Pavel
Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races
On 05/23/2018 10:42 AM, Mike Rapoport wrote: > If a process monitored with userfaultfd changes it's memory mappings or > forks() at the same time as uffd monitor fills the process memory with > UFFDIO_COPY, the actual creation of page table entries and copying of the > data in mcopy_atomic may happen either before of after the memory mapping > modifications and there is no way for the uffd monitor to maintain > consistent view of the process memory layout. > > For instance, let's consider fork() running in parallel with > userfaultfd_copy(): > > process| uffd monitor > -+-- > fork() | userfaultfd_copy() > ...| ... > dup_mmap() | down_read(mmap_sem) > down_write(mmap_sem) | /* create PTEs, copy data */ > dup_uffd() | up_read(mmap_sem) > copy_page_range()| > up_write(mmap_sem) | > dup_uffd_complete() | > /* notify monitor */ | > > If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be > present by the time copy_page_range() is called and they will appear in the > child's memory mappings. However, if the fork() is the first to take the > mmap_sem, the new pages won't be mapped in the child's address space. But in this case child should get an entry, that emits a message to uffd when step upon! And uffd will just userfaultfd_copy() it again. No? -- Pavel > Since userfaultfd monitor has no way to determine what was the order, let's > disallow userfaultfd_copy in parallel with the non-cooperative events. In > such case we return -EAGAIN and the uffd monitor can understand that > userfaultfd_copy() clashed with a non-cooperative event and take an > appropriate action. > > Signed-off-by: Mike Rapoport <r...@linux.vnet.ibm.com> > Cc: Andrea Arcangeli <aarca...@redhat.com> > Cc: Mike Kravetz <mike.krav...@oracle.com> > Cc: Pavel Emelyanov <xe...@virtuozzo.com> > Cc: Andrei Vagin <ava...@virtuozzo.com> > --- > fs/userfaultfd.c | 22 -- > include/linux/userfaultfd_k.h | 6 -- > mm/userfaultfd.c | 22 +- > 3 files changed, 41 insertions(+), 9 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index cec550c8468f..123bf7d516fc 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -62,6 +62,8 @@ struct userfaultfd_ctx { > enum userfaultfd_state state; > /* released */ > bool released; > + /* memory mappings are changing because of non-cooperative event */ > + bool mmap_changing; > /* mm with one ore more vmas attached to this userfaultfd_ctx */ > struct mm_struct *mm; > }; > @@ -641,6 +643,7 @@ static void userfaultfd_event_wait_completion(struct > userfaultfd_ctx *ctx, >* already released. >*/ > out: > + WRITE_ONCE(ctx->mmap_changing, false); > userfaultfd_ctx_put(ctx); > } > > @@ -686,10 +689,12 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct > list_head *fcs) > ctx->state = UFFD_STATE_RUNNING; > ctx->features = octx->features; > ctx->released = false; > + ctx->mmap_changing = false; > ctx->mm = vma->vm_mm; > mmgrab(ctx->mm); > > userfaultfd_ctx_get(octx); > + WRITE_ONCE(octx->mmap_changing, true); > fctx->orig = octx; > fctx->new = ctx; > list_add_tail(>list, fcs); > @@ -732,6 +737,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma, > if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) { > vm_ctx->ctx = ctx; > userfaultfd_ctx_get(ctx); > + WRITE_ONCE(ctx->mmap_changing, true); > } > } > > @@ -772,6 +778,7 @@ bool userfaultfd_remove(struct vm_area_struct *vma, > return true; > > userfaultfd_ctx_get(ctx); > + WRITE_ONCE(ctx->mmap_changing, true); > up_read(>mmap_sem); > > msg_init(); > @@ -815,6 +822,7 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma, > return -ENOMEM; > > userfaultfd_ctx_get(ctx); > + WRITE_ONCE(ctx->mmap_changing, true); > unmap_ctx->ctx = ctx; > unmap_ctx->start = start; > unmap_ctx->end = end; > @@ -1653,6 +1661,10 @@ static int userfaultfd
Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races
On 05/23/2018 10:42 AM, Mike Rapoport wrote: > If a process monitored with userfaultfd changes it's memory mappings or > forks() at the same time as uffd monitor fills the process memory with > UFFDIO_COPY, the actual creation of page table entries and copying of the > data in mcopy_atomic may happen either before of after the memory mapping > modifications and there is no way for the uffd monitor to maintain > consistent view of the process memory layout. > > For instance, let's consider fork() running in parallel with > userfaultfd_copy(): > > process| uffd monitor > -+-- > fork() | userfaultfd_copy() > ...| ... > dup_mmap() | down_read(mmap_sem) > down_write(mmap_sem) | /* create PTEs, copy data */ > dup_uffd() | up_read(mmap_sem) > copy_page_range()| > up_write(mmap_sem) | > dup_uffd_complete() | > /* notify monitor */ | > > If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be > present by the time copy_page_range() is called and they will appear in the > child's memory mappings. However, if the fork() is the first to take the > mmap_sem, the new pages won't be mapped in the child's address space. But in this case child should get an entry, that emits a message to uffd when step upon! And uffd will just userfaultfd_copy() it again. No? -- Pavel > Since userfaultfd monitor has no way to determine what was the order, let's > disallow userfaultfd_copy in parallel with the non-cooperative events. In > such case we return -EAGAIN and the uffd monitor can understand that > userfaultfd_copy() clashed with a non-cooperative event and take an > appropriate action. > > Signed-off-by: Mike Rapoport > Cc: Andrea Arcangeli > Cc: Mike Kravetz > Cc: Pavel Emelyanov > Cc: Andrei Vagin > --- > fs/userfaultfd.c | 22 -- > include/linux/userfaultfd_k.h | 6 -- > mm/userfaultfd.c | 22 +- > 3 files changed, 41 insertions(+), 9 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index cec550c8468f..123bf7d516fc 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -62,6 +62,8 @@ struct userfaultfd_ctx { > enum userfaultfd_state state; > /* released */ > bool released; > + /* memory mappings are changing because of non-cooperative event */ > + bool mmap_changing; > /* mm with one ore more vmas attached to this userfaultfd_ctx */ > struct mm_struct *mm; > }; > @@ -641,6 +643,7 @@ static void userfaultfd_event_wait_completion(struct > userfaultfd_ctx *ctx, >* already released. >*/ > out: > + WRITE_ONCE(ctx->mmap_changing, false); > userfaultfd_ctx_put(ctx); > } > > @@ -686,10 +689,12 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct > list_head *fcs) > ctx->state = UFFD_STATE_RUNNING; > ctx->features = octx->features; > ctx->released = false; > + ctx->mmap_changing = false; > ctx->mm = vma->vm_mm; > mmgrab(ctx->mm); > > userfaultfd_ctx_get(octx); > + WRITE_ONCE(octx->mmap_changing, true); > fctx->orig = octx; > fctx->new = ctx; > list_add_tail(>list, fcs); > @@ -732,6 +737,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma, > if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) { > vm_ctx->ctx = ctx; > userfaultfd_ctx_get(ctx); > + WRITE_ONCE(ctx->mmap_changing, true); > } > } > > @@ -772,6 +778,7 @@ bool userfaultfd_remove(struct vm_area_struct *vma, > return true; > > userfaultfd_ctx_get(ctx); > + WRITE_ONCE(ctx->mmap_changing, true); > up_read(>mmap_sem); > > msg_init(); > @@ -815,6 +822,7 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma, > return -ENOMEM; > > userfaultfd_ctx_get(ctx); > + WRITE_ONCE(ctx->mmap_changing, true); > unmap_ctx->ctx = ctx; > unmap_ctx->start = start; > unmap_ctx->end = end; > @@ -1653,6 +1661,10 @@ static int userfaultfd_copy(struct userfaultfd_ctx > *ctx, > > user_uffdio_copy = (struct uffdio_copy __user *) arg; > > + ret = -E
Re: [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE
> @@ -52,6 +53,7 @@ > #define _UFFDIO_WAKE (0x02) > #define _UFFDIO_COPY (0x03) > #define _UFFDIO_ZEROPAGE (0x04) > +#define _UFFDIO_WAKE_SYNC_EVENT (0x05) Excuse my ignorance, but what's the difference between UFFDIO_WAKE and UFFDIO_WAKE_SYNC_EVENT? -- Pavel
Re: [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE
> @@ -52,6 +53,7 @@ > #define _UFFDIO_WAKE (0x02) > #define _UFFDIO_COPY (0x03) > #define _UFFDIO_ZEROPAGE (0x04) > +#define _UFFDIO_WAKE_SYNC_EVENT (0x05) Excuse my ignorance, but what's the difference between UFFDIO_WAKE and UFFDIO_WAKE_SYNC_EVENT? -- Pavel
Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe
On 02/27/2018 05:18 AM, Dmitry V. Levin wrote: > On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote: >> On 02/21/2018 03:44 AM, Andrew Morton wrote: >>> On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <r...@linux.vnet.ibm.com> >>> wrote: >>> >>>> This patches introduces new process_vmsplice system call that combines >>>> functionality of process_vm_read and vmsplice. >>> >>> All seems fairly strightforward. The big question is: do we know that >>> people will actually use this, and get sufficient value from it to >>> justify its addition? >> >> Yes, that's what bothers us a lot too :) I've tried to start with finding >> out if anyone >> used the sys_read/write_process_vm() calls, but failed :( Does anybody know >> how popular >> these syscalls are? > > Well, process_vm_readv itself is quite popular, it's used by debuggers > nowadays, > see e.g. > $ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat > /dev/null I see. Well, yes, this use-case will not benefit much from remote splice. How about more interactive debug by, say, gdb? It may attach, then splice all the memory, then analyze the victim code/data w/o copying it to its address space? -- Pavel
Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe
On 02/27/2018 05:18 AM, Dmitry V. Levin wrote: > On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote: >> On 02/21/2018 03:44 AM, Andrew Morton wrote: >>> On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport >>> wrote: >>> >>>> This patches introduces new process_vmsplice system call that combines >>>> functionality of process_vm_read and vmsplice. >>> >>> All seems fairly strightforward. The big question is: do we know that >>> people will actually use this, and get sufficient value from it to >>> justify its addition? >> >> Yes, that's what bothers us a lot too :) I've tried to start with finding >> out if anyone >> used the sys_read/write_process_vm() calls, but failed :( Does anybody know >> how popular >> these syscalls are? > > Well, process_vm_readv itself is quite popular, it's used by debuggers > nowadays, > see e.g. > $ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat > /dev/null I see. Well, yes, this use-case will not benefit much from remote splice. How about more interactive debug by, say, gdb? It may attach, then splice all the memory, then analyze the victim code/data w/o copying it to its address space? -- Pavel
Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe
On 02/21/2018 03:44 AM, Andrew Morton wrote: > On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport> wrote: > >> This patches introduces new process_vmsplice system call that combines >> functionality of process_vm_read and vmsplice. > > All seems fairly strightforward. The big question is: do we know that > people will actually use this, and get sufficient value from it to > justify its addition? Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular these syscalls are? If its users operate on big amount of memory, they could benefit from the proposed splice extension. -- Pavel
Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe
On 02/21/2018 03:44 AM, Andrew Morton wrote: > On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport > wrote: > >> This patches introduces new process_vmsplice system call that combines >> functionality of process_vm_read and vmsplice. > > All seems fairly strightforward. The big question is: do we know that > people will actually use this, and get sufficient value from it to > justify its addition? Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular these syscalls are? If its users operate on big amount of memory, they could benefit from the proposed splice extension. -- Pavel
Re: [PATCH 1/4] fs/notify: fdinfo can report unsupported file handles.
On 12/11/2017 05:08 PM, Amir Goldstein wrote: > On Mon, Dec 11, 2017 at 3:46 PM, Pavel Emelyanov <xe...@virtuozzo.com> wrote: >> On 12/11/2017 10:05 AM, Amir Goldstein wrote: >>> On Mon, Dec 11, 2017 at 8:41 AM, Amir Goldstein <amir7...@gmail.com> wrote: >>>> On Mon, Dec 11, 2017 at 8:04 AM, NeilBrown <ne...@suse.com> wrote: >>>>> If a filesystem does not set sb->s_export_op, then it >>>>> does not support filehandles and export_fs_encode_fh() >>>>> and exportfs_encode_inode_fh() should not be called. >>>>> They will use export_encode_fh() is which is a default >>>>> that uses inode number generation number, but in general >>>>> they may not be stable. >>>>> >>>>> So change exportfs_encode_inode_fh() to return FILEID_INVALID >>>>> if called on an unsupported Filesystem. Currently only >>>>> notify/fdinfo can do that. >>>>> >>>> >>>> I wish you would leave this check to the caller, maybe add a helper >>>> exportfs_can_decode_fh() for callers to use. >>>> >>>> Although there are no current uses for it in-tree, there is value in >>>> being able to encode a unique file handle even when it cannot be >>>> decoded back to an open file. >>>> >>>> I am using this property in my fanotify super block watch patches, >>>> where the object identifier on the event is an encoded file handle >>>> of the object, which delegates tracking filesystem objects to >>>> userspace and prevents fanotify from keeping elevated refcounts >>>> on inodes and dentries. >>>> >>>> There are quite a few userspace tools out there that are checking >>>> that st_ino hasn't changed on a file between non atomic operations. >>>> Those tools (or others) could benefit from a unique file handle if >>>> we ever decide to provide a relaxed version of name_to_handle_at(). >>>> >>> >>> And this change need a clause about not breaking userspace. >>> >>> Pavel, >>> >>> Will this break any version of criu in the wild? >> >> If there's no fliehandle in the output, it will make dump fail, but we're >> already prepared for the fact, that there's no handle at hands. In the >> worst case criu will exit with error. >> >> I also agree that it should only happen when current is OOM killed, and in >> case of CRIU this means killing criu process itself. >> > > But this patch [1/4] changes behavior so you cannot dump fsnotify > state if watched file system does not support *decoding* file handles. That's OK :) After we get a filehandle we check it's accessible, so for FSs that couldn't decode one, we'd fail the dump anyway. > This means that criu anyway won't be able to restore the fsnotify state. > Is it OK that criu dump state will fail in that case? > > Amir. > . >
Re: [PATCH 1/4] fs/notify: fdinfo can report unsupported file handles.
On 12/11/2017 05:08 PM, Amir Goldstein wrote: > On Mon, Dec 11, 2017 at 3:46 PM, Pavel Emelyanov wrote: >> On 12/11/2017 10:05 AM, Amir Goldstein wrote: >>> On Mon, Dec 11, 2017 at 8:41 AM, Amir Goldstein wrote: >>>> On Mon, Dec 11, 2017 at 8:04 AM, NeilBrown wrote: >>>>> If a filesystem does not set sb->s_export_op, then it >>>>> does not support filehandles and export_fs_encode_fh() >>>>> and exportfs_encode_inode_fh() should not be called. >>>>> They will use export_encode_fh() is which is a default >>>>> that uses inode number generation number, but in general >>>>> they may not be stable. >>>>> >>>>> So change exportfs_encode_inode_fh() to return FILEID_INVALID >>>>> if called on an unsupported Filesystem. Currently only >>>>> notify/fdinfo can do that. >>>>> >>>> >>>> I wish you would leave this check to the caller, maybe add a helper >>>> exportfs_can_decode_fh() for callers to use. >>>> >>>> Although there are no current uses for it in-tree, there is value in >>>> being able to encode a unique file handle even when it cannot be >>>> decoded back to an open file. >>>> >>>> I am using this property in my fanotify super block watch patches, >>>> where the object identifier on the event is an encoded file handle >>>> of the object, which delegates tracking filesystem objects to >>>> userspace and prevents fanotify from keeping elevated refcounts >>>> on inodes and dentries. >>>> >>>> There are quite a few userspace tools out there that are checking >>>> that st_ino hasn't changed on a file between non atomic operations. >>>> Those tools (or others) could benefit from a unique file handle if >>>> we ever decide to provide a relaxed version of name_to_handle_at(). >>>> >>> >>> And this change need a clause about not breaking userspace. >>> >>> Pavel, >>> >>> Will this break any version of criu in the wild? >> >> If there's no fliehandle in the output, it will make dump fail, but we're >> already prepared for the fact, that there's no handle at hands. In the >> worst case criu will exit with error. >> >> I also agree that it should only happen when current is OOM killed, and in >> case of CRIU this means killing criu process itself. >> > > But this patch [1/4] changes behavior so you cannot dump fsnotify > state if watched file system does not support *decoding* file handles. That's OK :) After we get a filehandle we check it's accessible, so for FSs that couldn't decode one, we'd fail the dump anyway. > This means that criu anyway won't be able to restore the fsnotify state. > Is it OK that criu dump state will fail in that case? > > Amir. > . >
Re: [PATCH 1/4] fs/notify: fdinfo can report unsupported file handles.
On 12/11/2017 10:05 AM, Amir Goldstein wrote: > On Mon, Dec 11, 2017 at 8:41 AM, Amir Goldsteinwrote: >> On Mon, Dec 11, 2017 at 8:04 AM, NeilBrown wrote: >>> If a filesystem does not set sb->s_export_op, then it >>> does not support filehandles and export_fs_encode_fh() >>> and exportfs_encode_inode_fh() should not be called. >>> They will use export_encode_fh() is which is a default >>> that uses inode number generation number, but in general >>> they may not be stable. >>> >>> So change exportfs_encode_inode_fh() to return FILEID_INVALID >>> if called on an unsupported Filesystem. Currently only >>> notify/fdinfo can do that. >>> >> >> I wish you would leave this check to the caller, maybe add a helper >> exportfs_can_decode_fh() for callers to use. >> >> Although there are no current uses for it in-tree, there is value in >> being able to encode a unique file handle even when it cannot be >> decoded back to an open file. >> >> I am using this property in my fanotify super block watch patches, >> where the object identifier on the event is an encoded file handle >> of the object, which delegates tracking filesystem objects to >> userspace and prevents fanotify from keeping elevated refcounts >> on inodes and dentries. >> >> There are quite a few userspace tools out there that are checking >> that st_ino hasn't changed on a file between non atomic operations. >> Those tools (or others) could benefit from a unique file handle if >> we ever decide to provide a relaxed version of name_to_handle_at(). >> > > And this change need a clause about not breaking userspace. > > Pavel, > > Will this break any version of criu in the wild? If there's no fliehandle in the output, it will make dump fail, but we're already prepared for the fact, that there's no handle at hands. In the worst case criu will exit with error. I also agree that it should only happen when current is OOM killed, and in case of CRIU this means killing criu process itself. -- Pavel
Re: [PATCH 1/4] fs/notify: fdinfo can report unsupported file handles.
On 12/11/2017 10:05 AM, Amir Goldstein wrote: > On Mon, Dec 11, 2017 at 8:41 AM, Amir Goldstein wrote: >> On Mon, Dec 11, 2017 at 8:04 AM, NeilBrown wrote: >>> If a filesystem does not set sb->s_export_op, then it >>> does not support filehandles and export_fs_encode_fh() >>> and exportfs_encode_inode_fh() should not be called. >>> They will use export_encode_fh() is which is a default >>> that uses inode number generation number, but in general >>> they may not be stable. >>> >>> So change exportfs_encode_inode_fh() to return FILEID_INVALID >>> if called on an unsupported Filesystem. Currently only >>> notify/fdinfo can do that. >>> >> >> I wish you would leave this check to the caller, maybe add a helper >> exportfs_can_decode_fh() for callers to use. >> >> Although there are no current uses for it in-tree, there is value in >> being able to encode a unique file handle even when it cannot be >> decoded back to an open file. >> >> I am using this property in my fanotify super block watch patches, >> where the object identifier on the event is an encoded file handle >> of the object, which delegates tracking filesystem objects to >> userspace and prevents fanotify from keeping elevated refcounts >> on inodes and dentries. >> >> There are quite a few userspace tools out there that are checking >> that st_ino hasn't changed on a file between non atomic operations. >> Those tools (or others) could benefit from a unique file handle if >> we ever decide to provide a relaxed version of name_to_handle_at(). >> > > And this change need a clause about not breaking userspace. > > Pavel, > > Will this break any version of criu in the wild? If there's no fliehandle in the output, it will make dump fail, but we're already prepared for the fact, that there's no handle at hands. In the worst case criu will exit with error. I also agree that it should only happen when current is OOM killed, and in case of CRIU this means killing criu process itself. -- Pavel
Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
On 05/24/2017 02:31 PM, Vlastimil Babka wrote: > On 05/24/2017 12:39 PM, Mike Rapoport wrote: >>> Hm so the prctl does: >>> >>> if (arg2) >>> me->mm->def_flags |= VM_NOHUGEPAGE; >>> else >>> me->mm->def_flags &= ~VM_NOHUGEPAGE; >>> >>> That's rather lazy implementation IMHO. Could we change it so the flag >>> is stored elsewhere in the mm, and the code that decides to (not) use >>> THP will check both the per-vma flag and the per-mm flag? >> >> I afraid I don't understand how that can help. >> What we need is an ability to temporarily disable collapse of the pages in >> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable >> THP, the vma->vm_flags for those VMAs will remain intact. > > That's what I'm saying - instead of implementing the prctl flag via > mm->def_flags (which gets permanently propagated to newly created vma's > but e.g. doesn't affect already existing ones), it would be setting a > flag somewhere in mm, which khugepaged (and page faults) would check in > addition to the per-vma flags. I do not insist, but this would make existing paths (checking for flags) be 2 times slower -- from now on these would need to check two bits (vma flags and mm flags) which are 100% in different cache lines. What Mike is proposing is the way to fine-tune the existing vma flags. This would keep current paths as fast (or slow ;) ) as they are now. All the complexity would go to rare cases when someone needs to turn thp off for a while and then turn it back on. -- Pavel
Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
On 05/24/2017 02:31 PM, Vlastimil Babka wrote: > On 05/24/2017 12:39 PM, Mike Rapoport wrote: >>> Hm so the prctl does: >>> >>> if (arg2) >>> me->mm->def_flags |= VM_NOHUGEPAGE; >>> else >>> me->mm->def_flags &= ~VM_NOHUGEPAGE; >>> >>> That's rather lazy implementation IMHO. Could we change it so the flag >>> is stored elsewhere in the mm, and the code that decides to (not) use >>> THP will check both the per-vma flag and the per-mm flag? >> >> I afraid I don't understand how that can help. >> What we need is an ability to temporarily disable collapse of the pages in >> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable >> THP, the vma->vm_flags for those VMAs will remain intact. > > That's what I'm saying - instead of implementing the prctl flag via > mm->def_flags (which gets permanently propagated to newly created vma's > but e.g. doesn't affect already existing ones), it would be setting a > flag somewhere in mm, which khugepaged (and page faults) would check in > addition to the per-vma flags. I do not insist, but this would make existing paths (checking for flags) be 2 times slower -- from now on these would need to check two bits (vma flags and mm flags) which are 100% in different cache lines. What Mike is proposing is the way to fine-tune the existing vma flags. This would keep current paths as fast (or slow ;) ) as they are now. All the complexity would go to rare cases when someone needs to turn thp off for a while and then turn it back on. -- Pavel
Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
On 05/24/2017 02:18 PM, Michal Hocko wrote: > On Wed 24-05-17 13:39:48, Mike Rapoport wrote: >> On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote: >>> On 05/24/2017 09:50 AM, Mike Rapoport wrote: On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote: > On 05/22/2017 04:29 PM, Mike Rapoport wrote: >> >> Probably I didn't explained it too well. >> >> The range is intentionally not populated. When we combine pre- and >> post-copy for process migration, we create memory pre-dump without >> stopping >> the process, then we freeze the process without dumping the pages it has >> dirtied between pre-dump and freeze, and then, during restore, we >> populate >> the dirtied pages using userfaultfd. >> >> When CRIU restores a process in such scenario, it does something like: >> >> * mmap() memory region >> * fill in the pages that were collected during the pre-dump >> * do some other stuff >> * register memory region with userfaultfd >> * populate the missing memory on demand >> >> khugepaged collapses the pages in the partially populated regions before >> we >> have a chance to register these regions with userfaultfd, which would >> prevent the collapse. >> >> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then >> there would be no race because there would be nothing for khugepaged to >> collapse at that point. But the problem is that we have no way to reset >> *HUGEPAGE flags after the memory restore is complete. > > Hmm, I wouldn't be that sure if this is indeed race-free. Check that > this scenario is indeed impossible? > > - you do the mmap > - khugepaged will choose the process' mm to scan > - khugepaged will get to the vma in question, it doesn't have > MADV_NOHUGEPAGE yet > - you set MADV_NOHUGEPAGE on the vma > - you start populating the vma > - khugepaged sees the vma is non-empty, collapses > > unless I'm wrong, the racers will have mmap_sem for reading only when > setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug. > > However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a > nonzero value, the flag is set, otherwise it is cleared." says the > manpage. Do it before the mmap and you avoid the race as well? Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :( When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs created after prctl(). This returns me to the state when checkpoint-restore alters the application vma->vm_flags although it shouldn't and I do not see a way to fix it using existing interfaces. >>> >>> [CC linux-api, should have been done in the initial posting already] >> >> Sorry, missed that. >> >>> Hm so the prctl does: >>> >>> if (arg2) >>> me->mm->def_flags |= VM_NOHUGEPAGE; >>> else >>> me->mm->def_flags &= ~VM_NOHUGEPAGE; >>> >>> That's rather lazy implementation IMHO. Could we change it so the flag >>> is stored elsewhere in the mm, and the code that decides to (not) use >>> THP will check both the per-vma flag and the per-mm flag? >> >> I afraid I don't understand how that can help. >> What we need is an ability to temporarily disable collapse of the pages in >> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable >> THP, the vma->vm_flags for those VMAs will remain intact. > > Why cannot khugepaged simply skip over all VMAs which have userfault > regions registered? This would sound like a less error prone approach to > me. It already does so. The problem is that there's a race window. We first populate VMA with pages, then register it in UFFD. Between these two actions khugepaged comes and generates a huge page out of populated pages and holes. And the holes in question are not, well, holes -- they should be populated later via the UFFD, while the generated huge page prevents this from happening. -- Pavel
Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
On 05/24/2017 02:18 PM, Michal Hocko wrote: > On Wed 24-05-17 13:39:48, Mike Rapoport wrote: >> On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote: >>> On 05/24/2017 09:50 AM, Mike Rapoport wrote: On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote: > On 05/22/2017 04:29 PM, Mike Rapoport wrote: >> >> Probably I didn't explained it too well. >> >> The range is intentionally not populated. When we combine pre- and >> post-copy for process migration, we create memory pre-dump without >> stopping >> the process, then we freeze the process without dumping the pages it has >> dirtied between pre-dump and freeze, and then, during restore, we >> populate >> the dirtied pages using userfaultfd. >> >> When CRIU restores a process in such scenario, it does something like: >> >> * mmap() memory region >> * fill in the pages that were collected during the pre-dump >> * do some other stuff >> * register memory region with userfaultfd >> * populate the missing memory on demand >> >> khugepaged collapses the pages in the partially populated regions before >> we >> have a chance to register these regions with userfaultfd, which would >> prevent the collapse. >> >> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then >> there would be no race because there would be nothing for khugepaged to >> collapse at that point. But the problem is that we have no way to reset >> *HUGEPAGE flags after the memory restore is complete. > > Hmm, I wouldn't be that sure if this is indeed race-free. Check that > this scenario is indeed impossible? > > - you do the mmap > - khugepaged will choose the process' mm to scan > - khugepaged will get to the vma in question, it doesn't have > MADV_NOHUGEPAGE yet > - you set MADV_NOHUGEPAGE on the vma > - you start populating the vma > - khugepaged sees the vma is non-empty, collapses > > unless I'm wrong, the racers will have mmap_sem for reading only when > setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug. > > However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a > nonzero value, the flag is set, otherwise it is cleared." says the > manpage. Do it before the mmap and you avoid the race as well? Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :( When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs created after prctl(). This returns me to the state when checkpoint-restore alters the application vma->vm_flags although it shouldn't and I do not see a way to fix it using existing interfaces. >>> >>> [CC linux-api, should have been done in the initial posting already] >> >> Sorry, missed that. >> >>> Hm so the prctl does: >>> >>> if (arg2) >>> me->mm->def_flags |= VM_NOHUGEPAGE; >>> else >>> me->mm->def_flags &= ~VM_NOHUGEPAGE; >>> >>> That's rather lazy implementation IMHO. Could we change it so the flag >>> is stored elsewhere in the mm, and the code that decides to (not) use >>> THP will check both the per-vma flag and the per-mm flag? >> >> I afraid I don't understand how that can help. >> What we need is an ability to temporarily disable collapse of the pages in >> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable >> THP, the vma->vm_flags for those VMAs will remain intact. > > Why cannot khugepaged simply skip over all VMAs which have userfault > regions registered? This would sound like a less error prone approach to > me. It already does so. The problem is that there's a race window. We first populate VMA with pages, then register it in UFFD. Between these two actions khugepaged comes and generates a huge page out of populated pages and holes. And the holes in question are not, well, holes -- they should be populated later via the UFFD, while the generated huge page prevents this from happening. -- Pavel
Re: [PATCH] Add pidfs filesystem
On 02/22/2017 03:04 PM, Alexey Gladkov wrote: > On Wed, Feb 22, 2017 at 10:40:49AM +0300, Pavel Emelyanov wrote: >> On 02/21/2017 05:57 PM, Oleg Nesterov wrote: >>> On 02/18, Alexey Gladkov wrote: >>>> >>>> This patch allows to mount only the part of /proc related to pids >>>> without rest objects. Since this is an addon to /proc, flags applied to >>>> /proc have an effect on this pidfs filesystem. >>> >>> I leave this to you and Eric, but imo it would be nice to avoid another >>> filesystem. >>> >>>> Why not implement it as another flag to /proc ? >>>> >>>> The /proc flags is stored in the pid_namespace and are global for >>>> namespace. It means that if you add a flag to hide all except the pids, >>>> then it will act on all mounted instances of /proc. >>> >>> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see >>> the simple patch below. Not sure it is correct/complete, just to illustrate >>> the idea. >>> >>> With this patch you can mount proc with -onodev and it will only show >>> pids/self/thread_self: >>> >>> # mkdir /tmp/D >>> # mount -t proc -o nodev none /tmp/D >>> # ls /tmp/D >>> 1 11 13 15 17 19 20 22 24 28 3 31 33 4 56 7 9 >>> thread-self >>> 10 12 14 16 18 2 21 23 27 29 30 32 34 5 6 8 self >>> # cat /tmp/D/meminfo >>> cat: /tmp/D/meminfo: No such file or directory >>> # ls /tmp/D/irq >>> ls: cannot open directory /tmp/D/irq: No such file or directory >>> >>> No? >> >> Yes!!! If this whole effort with pidfs and overlayfs will move forward, I >> would >> prefer seeing the nodev procfs version, rather than another fs. > > But this is not procfs anymore. If someone will wait for procfs here it will > be disappointed :) Well, it depends on what files he's looking for in there. This is what overlay part should come for. >> As far as the overlayfs part is concerned, having an overlayfs mounted on >> /proc >> inside container may result in problems as applications sometimes check for >> /proc >> containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by >> reading >> the /proc/mounts). > > It is not a replacement for procfs. It's a subset of procfs. If someone wants > the procfs in the code we should not deceive him. > > No? But this is what we actually do -- Docker does with bind-mounts, LXC does with lxcfs, OpenVZ does with kernel patches. Every time a container starts the regular /proc is mutated not to show some information. -- Pavel
Re: [PATCH] Add pidfs filesystem
On 02/22/2017 03:04 PM, Alexey Gladkov wrote: > On Wed, Feb 22, 2017 at 10:40:49AM +0300, Pavel Emelyanov wrote: >> On 02/21/2017 05:57 PM, Oleg Nesterov wrote: >>> On 02/18, Alexey Gladkov wrote: >>>> >>>> This patch allows to mount only the part of /proc related to pids >>>> without rest objects. Since this is an addon to /proc, flags applied to >>>> /proc have an effect on this pidfs filesystem. >>> >>> I leave this to you and Eric, but imo it would be nice to avoid another >>> filesystem. >>> >>>> Why not implement it as another flag to /proc ? >>>> >>>> The /proc flags is stored in the pid_namespace and are global for >>>> namespace. It means that if you add a flag to hide all except the pids, >>>> then it will act on all mounted instances of /proc. >>> >>> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see >>> the simple patch below. Not sure it is correct/complete, just to illustrate >>> the idea. >>> >>> With this patch you can mount proc with -onodev and it will only show >>> pids/self/thread_self: >>> >>> # mkdir /tmp/D >>> # mount -t proc -o nodev none /tmp/D >>> # ls /tmp/D >>> 1 11 13 15 17 19 20 22 24 28 3 31 33 4 56 7 9 >>> thread-self >>> 10 12 14 16 18 2 21 23 27 29 30 32 34 5 6 8 self >>> # cat /tmp/D/meminfo >>> cat: /tmp/D/meminfo: No such file or directory >>> # ls /tmp/D/irq >>> ls: cannot open directory /tmp/D/irq: No such file or directory >>> >>> No? >> >> Yes!!! If this whole effort with pidfs and overlayfs will move forward, I >> would >> prefer seeing the nodev procfs version, rather than another fs. > > But this is not procfs anymore. If someone will wait for procfs here it will > be disappointed :) Well, it depends on what files he's looking for in there. This is what overlay part should come for. >> As far as the overlayfs part is concerned, having an overlayfs mounted on >> /proc >> inside container may result in problems as applications sometimes check for >> /proc >> containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by >> reading >> the /proc/mounts). > > It is not a replacement for procfs. It's a subset of procfs. If someone wants > the procfs in the code we should not deceive him. > > No? But this is what we actually do -- Docker does with bind-mounts, LXC does with lxcfs, OpenVZ does with kernel patches. Every time a container starts the regular /proc is mutated not to show some information. -- Pavel
Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files
On 02/22/2017 11:18 AM, Cyrill Gorcunov wrote: > On Wed, Feb 22, 2017 at 11:09:23AM +0300, Pavel Emelyanov wrote: >>>> >>>> Actually it shouldn't. If you extend the kcmp argument to accept the >>>> epollfd:epollslot pair, this would be effectively the same as if you >>>> had all your epoll-ed files injected into your fdtable with "strange" >>>> fd numbers. We already have two-level rbtree for this in criu, adding >>>> extended ("strange") fd to it should be OK. >>> >>> Nope. Pavel, I guess you forget how we handle file tree in criu currently. >>> We call for kcmp only if we have to -- when primary key for two entries >>> is the same. >> >> True, but the latter is an optimization to reduce the number of syscalls. > > Exactly. While syscalls are quite effective, they are still not coming > for free, so I'm trying to reduce their number as much as possible. > >> Look, in order to have a primary key you need to do some system call for the >> fd you check (read from proc or stat the descriptor). But for target files in >> e-polls you don't make this per-fd syscall to get primary key, just call the >> kcmp instead. > > I have to parse fdinfo anyway, because I need to fetch queued events and mask. > So I'll _have_ to make this per-fd syscall for parsing. And this opens > a way to optimize overall picture -- we can immediately read primary > key and reduce kcmp calls. You read fdinfo per-epoll, but kcmp-s we're talking here are about per-target-files. So having dev:ino pair would help to reduce the number of kcmps, but even w/o this extension we can work OK. Besides, in most of the cases fd number you'd read from epoll's fdinfo will actually be present in task's fdtable, so you can call a single kcmp, make sure the file is correct and that's it. The need to actually _search_ for the runaway file with the set of kcmp will (should) be quite rare case. -- Pavel
Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files
On 02/22/2017 11:18 AM, Cyrill Gorcunov wrote: > On Wed, Feb 22, 2017 at 11:09:23AM +0300, Pavel Emelyanov wrote: >>>> >>>> Actually it shouldn't. If you extend the kcmp argument to accept the >>>> epollfd:epollslot pair, this would be effectively the same as if you >>>> had all your epoll-ed files injected into your fdtable with "strange" >>>> fd numbers. We already have two-level rbtree for this in criu, adding >>>> extended ("strange") fd to it should be OK. >>> >>> Nope. Pavel, I guess you forget how we handle file tree in criu currently. >>> We call for kcmp only if we have to -- when primary key for two entries >>> is the same. >> >> True, but the latter is an optimization to reduce the number of syscalls. > > Exactly. While syscalls are quite effective, they are still not coming > for free, so I'm trying to reduce their number as much as possible. > >> Look, in order to have a primary key you need to do some system call for the >> fd you check (read from proc or stat the descriptor). But for target files in >> e-polls you don't make this per-fd syscall to get primary key, just call the >> kcmp instead. > > I have to parse fdinfo anyway, because I need to fetch queued events and mask. > So I'll _have_ to make this per-fd syscall for parsing. And this opens > a way to optimize overall picture -- we can immediately read primary > key and reduce kcmp calls. You read fdinfo per-epoll, but kcmp-s we're talking here are about per-target-files. So having dev:ino pair would help to reduce the number of kcmps, but even w/o this extension we can work OK. Besides, in most of the cases fd number you'd read from epoll's fdinfo will actually be present in task's fdtable, so you can call a single kcmp, make sure the file is correct and that's it. The need to actually _search_ for the runaway file with the set of kcmp will (should) be quite rare case. -- Pavel
Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files
On 02/22/2017 10:54 AM, Cyrill Gorcunov wrote: > On Wed, Feb 22, 2017 at 10:44:07AM +0300, Pavel Emelyanov wrote: >> On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote: >>> On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote: >>>>> Thus lets add file position, inode and device number where >>>>> this target lays. This three fields can be used as a primary >>>>> key for sorting, and together with kcmp help CRIU can find >>>>> out an exact file target (from the whole set of processes >>>>> being checkpointed). >>>> >>>> I have no problem with this, but I'm wondering whether kcmp's ordered >>>> comparisons could also be used for this purpose. >>> >>> Yes it can, but it would increas number of kcmp calls signisicantly. >> >> Actually it shouldn't. If you extend the kcmp argument to accept the >> epollfd:epollslot pair, this would be effectively the same as if you >> had all your epoll-ed files injected into your fdtable with "strange" >> fd numbers. We already have two-level rbtree for this in criu, adding >> extended ("strange") fd to it should be OK. > > Nope. Pavel, I guess you forget how we handle file tree in criu currently. > We call for kcmp only if we have to -- when primary key for two entries > is the same. True, but the latter is an optimization to reduce the number of syscalls. Look, in order to have a primary key you need to do some system call for the fd you check (read from proc or stat the descriptor). But for target files in e-polls you don't make this per-fd syscall to get primary key, just call the kcmp instead. -- Pavel
Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files
On 02/22/2017 10:54 AM, Cyrill Gorcunov wrote: > On Wed, Feb 22, 2017 at 10:44:07AM +0300, Pavel Emelyanov wrote: >> On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote: >>> On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote: >>>>> Thus lets add file position, inode and device number where >>>>> this target lays. This three fields can be used as a primary >>>>> key for sorting, and together with kcmp help CRIU can find >>>>> out an exact file target (from the whole set of processes >>>>> being checkpointed). >>>> >>>> I have no problem with this, but I'm wondering whether kcmp's ordered >>>> comparisons could also be used for this purpose. >>> >>> Yes it can, but it would increas number of kcmp calls signisicantly. >> >> Actually it shouldn't. If you extend the kcmp argument to accept the >> epollfd:epollslot pair, this would be effectively the same as if you >> had all your epoll-ed files injected into your fdtable with "strange" >> fd numbers. We already have two-level rbtree for this in criu, adding >> extended ("strange") fd to it should be OK. > > Nope. Pavel, I guess you forget how we handle file tree in criu currently. > We call for kcmp only if we have to -- when primary key for two entries > is the same. True, but the latter is an optimization to reduce the number of syscalls. Look, in order to have a primary key you need to do some system call for the fd you check (read from proc or stat the descriptor). But for target files in e-polls you don't make this per-fd syscall to get primary key, just call the kcmp instead. -- Pavel
Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files
On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote: > On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote: >>> Thus lets add file position, inode and device number where >>> this target lays. This three fields can be used as a primary >>> key for sorting, and together with kcmp help CRIU can find >>> out an exact file target (from the whole set of processes >>> being checkpointed). >> >> I have no problem with this, but I'm wondering whether kcmp's ordered >> comparisons could also be used for this purpose. > > Yes it can, but it would increas number of kcmp calls signisicantly. Actually it shouldn't. If you extend the kcmp argument to accept the epollfd:epollslot pair, this would be effectively the same as if you had all your epoll-ed files injected into your fdtable with "strange" fd numbers. We already have two-level rbtree for this in criu, adding extended ("strange") fd to it should be OK. > Look, here is how we build files tree in criu: we take inode^sdev^pos > as a primary key and remember it inside rbtree while we're dumping files > (note also that we don't keep files opened but rather dump them in > chunks). Then once we find that two files have same primary key > we use kcmp to build subtree. This really helps a lot. And I plan > to do the same with target files from epolls: > > - they gonna be handled after all opened files of all processes >in container (or children processes if dumping single task), >thus the complete tree with primary key already will be built > > - on every target file I calculate primary key and then using >kcmp will find if file is exactly one matching > . >
Re: [RFC 1/3] procfs: fdinfo -- Extend information about epoll target files
On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote: > On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote: >>> Thus lets add file position, inode and device number where >>> this target lays. This three fields can be used as a primary >>> key for sorting, and together with kcmp help CRIU can find >>> out an exact file target (from the whole set of processes >>> being checkpointed). >> >> I have no problem with this, but I'm wondering whether kcmp's ordered >> comparisons could also be used for this purpose. > > Yes it can, but it would increas number of kcmp calls signisicantly. Actually it shouldn't. If you extend the kcmp argument to accept the epollfd:epollslot pair, this would be effectively the same as if you had all your epoll-ed files injected into your fdtable with "strange" fd numbers. We already have two-level rbtree for this in criu, adding extended ("strange") fd to it should be OK. > Look, here is how we build files tree in criu: we take inode^sdev^pos > as a primary key and remember it inside rbtree while we're dumping files > (note also that we don't keep files opened but rather dump them in > chunks). Then once we find that two files have same primary key > we use kcmp to build subtree. This really helps a lot. And I plan > to do the same with target files from epolls: > > - they gonna be handled after all opened files of all processes >in container (or children processes if dumping single task), >thus the complete tree with primary key already will be built > > - on every target file I calculate primary key and then using >kcmp will find if file is exactly one matching > . >
Re: [PATCH] Add pidfs filesystem
On 02/21/2017 05:57 PM, Oleg Nesterov wrote: > On 02/18, Alexey Gladkov wrote: >> >> This patch allows to mount only the part of /proc related to pids >> without rest objects. Since this is an addon to /proc, flags applied to >> /proc have an effect on this pidfs filesystem. > > I leave this to you and Eric, but imo it would be nice to avoid another > filesystem. > >> Why not implement it as another flag to /proc ? >> >> The /proc flags is stored in the pid_namespace and are global for >> namespace. It means that if you add a flag to hide all except the pids, >> then it will act on all mounted instances of /proc. > > But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see > the simple patch below. Not sure it is correct/complete, just to illustrate > the idea. > > With this patch you can mount proc with -onodev and it will only show > pids/self/thread_self: > > # mkdir /tmp/D > # mount -t proc -o nodev none /tmp/D > # ls /tmp/D > 1 11 13 15 17 19 20 22 24 28 3 31 33 4 56 7 9 > thread-self > 10 12 14 16 18 2 21 23 27 29 30 32 34 5 6 8 self > # cat /tmp/D/meminfo > cat: /tmp/D/meminfo: No such file or directory > # ls /tmp/D/irq > ls: cannot open directory /tmp/D/irq: No such file or directory > > No? Yes!!! If this whole effort with pidfs and overlayfs will move forward, I would prefer seeing the nodev procfs version, rather than another fs. As far as the overlayfs part is concerned, having an overlayfs mounted on /proc inside container may result in problems as applications sometimes check for /proc containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by reading the /proc/mounts). -- Pavel > Oleg. > > > --- a/fs/proc/generic.c > +++ b/fs/proc/generic.c > @@ -305,11 +305,22 @@ int proc_readdir_de(struct proc_dir_entry *de, struct > file *file, > > int proc_readdir(struct file *file, struct dir_context *ctx) > { > + int mnt_flags = file->f_path.mnt->mnt_flags; > struct inode *inode = file_inode(file); > > + if (mnt_flags & MNT_NODEV) > + return 1; > + > return proc_readdir_de(PDE(inode), file, ctx); > } > > +static int proc_dir_open(struct inode *inode, struct file *file) > +{ > + if (file->f_path.mnt->mnt_flags & MNT_NODEV) > + return -ENOENT; > + return 0; > +} > + > /* > * These are the generic /proc directory operations. They > * use the in-memory "struct proc_dir_entry" tree to parse > @@ -319,6 +330,7 @@ static const struct file_operations proc_dir_operations = > { > .llseek = generic_file_llseek, > .read = generic_read_dir, > .iterate_shared = proc_readdir, > + .open = proc_dir_open, > }; > > /* > --- a/fs/proc/inode.c > +++ b/fs/proc/inode.c > @@ -318,12 +318,16 @@ proc_reg_get_unmapped_area(struct file *file, unsigned > long orig_addr, > > static int proc_reg_open(struct inode *inode, struct file *file) > { > + int mnt_flags = file->f_path.mnt->mnt_flags; > struct proc_dir_entry *pde = PDE(inode); > int rv = 0; > int (*open)(struct inode *, struct file *); > int (*release)(struct inode *, struct file *); > struct pde_opener *pdeo; > > + if (mnt_flags & MNT_NODEV) > + return -ENOENT; > + > /* >* Ensure that >* 1) PDE's ->release hook will be called no matter what > > . >
Re: [PATCH] Add pidfs filesystem
On 02/21/2017 05:57 PM, Oleg Nesterov wrote: > On 02/18, Alexey Gladkov wrote: >> >> This patch allows to mount only the part of /proc related to pids >> without rest objects. Since this is an addon to /proc, flags applied to >> /proc have an effect on this pidfs filesystem. > > I leave this to you and Eric, but imo it would be nice to avoid another > filesystem. > >> Why not implement it as another flag to /proc ? >> >> The /proc flags is stored in the pid_namespace and are global for >> namespace. It means that if you add a flag to hide all except the pids, >> then it will act on all mounted instances of /proc. > > But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see > the simple patch below. Not sure it is correct/complete, just to illustrate > the idea. > > With this patch you can mount proc with -onodev and it will only show > pids/self/thread_self: > > # mkdir /tmp/D > # mount -t proc -o nodev none /tmp/D > # ls /tmp/D > 1 11 13 15 17 19 20 22 24 28 3 31 33 4 56 7 9 > thread-self > 10 12 14 16 18 2 21 23 27 29 30 32 34 5 6 8 self > # cat /tmp/D/meminfo > cat: /tmp/D/meminfo: No such file or directory > # ls /tmp/D/irq > ls: cannot open directory /tmp/D/irq: No such file or directory > > No? Yes!!! If this whole effort with pidfs and overlayfs will move forward, I would prefer seeing the nodev procfs version, rather than another fs. As far as the overlayfs part is concerned, having an overlayfs mounted on /proc inside container may result in problems as applications sometimes check for /proc containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by reading the /proc/mounts). -- Pavel > Oleg. > > > --- a/fs/proc/generic.c > +++ b/fs/proc/generic.c > @@ -305,11 +305,22 @@ int proc_readdir_de(struct proc_dir_entry *de, struct > file *file, > > int proc_readdir(struct file *file, struct dir_context *ctx) > { > + int mnt_flags = file->f_path.mnt->mnt_flags; > struct inode *inode = file_inode(file); > > + if (mnt_flags & MNT_NODEV) > + return 1; > + > return proc_readdir_de(PDE(inode), file, ctx); > } > > +static int proc_dir_open(struct inode *inode, struct file *file) > +{ > + if (file->f_path.mnt->mnt_flags & MNT_NODEV) > + return -ENOENT; > + return 0; > +} > + > /* > * These are the generic /proc directory operations. They > * use the in-memory "struct proc_dir_entry" tree to parse > @@ -319,6 +330,7 @@ static const struct file_operations proc_dir_operations = > { > .llseek = generic_file_llseek, > .read = generic_read_dir, > .iterate_shared = proc_readdir, > + .open = proc_dir_open, > }; > > /* > --- a/fs/proc/inode.c > +++ b/fs/proc/inode.c > @@ -318,12 +318,16 @@ proc_reg_get_unmapped_area(struct file *file, unsigned > long orig_addr, > > static int proc_reg_open(struct inode *inode, struct file *file) > { > + int mnt_flags = file->f_path.mnt->mnt_flags; > struct proc_dir_entry *pde = PDE(inode); > int rv = 0; > int (*open)(struct inode *, struct file *); > int (*release)(struct inode *, struct file *); > struct pde_opener *pdeo; > > + if (mnt_flags & MNT_NODEV) > + return -ENOENT; > + > /* >* Ensure that >* 1) PDE's ->release hook will be called no matter what > > . >
Re: [CRIU] [PATCH net-next] tcp: allow to enable the repair mode for non-listening sockets
On 11/15/2016 05:15 AM, Andrei Vagin wrote: > The repair mode is used to get and restore sequence numbers and > data from queues. It used to checkpoint/restore connections. > > Currently the repair mode can be enabled for sockets in the established > and closed states, but for other states we have to dump the same socket > properties, so lets allow to enable repair mode for these sockets. > > The repair mode reveals nothing more for sockets in other states. > > Signed-off-by: Andrei Vagin <ava...@openvz.org> Acked-by: Pavel Emelyanov <xe...@openvz.org> > --- > net/ipv4/tcp.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 3251fe7..a2a3a8c 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2302,7 +2302,7 @@ EXPORT_SYMBOL(tcp_disconnect); > static inline bool tcp_can_repair_sock(const struct sock *sk) > { > return ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) && > - ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_ESTABLISHED)); > + (sk->sk_state != TCP_LISTEN); > } > > static int tcp_repair_set_window(struct tcp_sock *tp, char __user *optbuf, > int len) >
Re: [CRIU] [PATCH net-next] tcp: allow to enable the repair mode for non-listening sockets
On 11/15/2016 05:15 AM, Andrei Vagin wrote: > The repair mode is used to get and restore sequence numbers and > data from queues. It used to checkpoint/restore connections. > > Currently the repair mode can be enabled for sockets in the established > and closed states, but for other states we have to dump the same socket > properties, so lets allow to enable repair mode for these sockets. > > The repair mode reveals nothing more for sockets in other states. > > Signed-off-by: Andrei Vagin Acked-by: Pavel Emelyanov > --- > net/ipv4/tcp.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 3251fe7..a2a3a8c 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2302,7 +2302,7 @@ EXPORT_SYMBOL(tcp_disconnect); > static inline bool tcp_can_repair_sock(const struct sock *sk) > { > return ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) && > - ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_ESTABLISHED)); > + (sk->sk_state != TCP_LISTEN); > } > > static int tcp_repair_set_window(struct tcp_sock *tp, char __user *optbuf, > int len) >
Re: [PATCH v2] locks: Filter /proc/locks output on proc pid ns
On 08/03/2016 05:17 PM, Nikolay Borisov wrote: > > > On 08/03/2016 04:46 PM, Jeff Layton wrote: >> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote: >>> On busy container servers reading /proc/locks shows all the locks >>> created by all clients. This can cause large latency spikes. In my >>> case I observed lsof taking up to 5-10 seconds while processing around >>> 50k locks. Fix this by limiting the locks shown only to those created >>> in the same pidns as the one the proc was mounted in. When reading >>> /proc/locks from the init_pid_ns show everything. >>> Signed-off-by: Nikolay Borisov>>> --- >>> fs/locks.c | 6 ++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/fs/locks.c b/fs/locks.c >>> index ee1b15f6fc13..751673d7f7fc 100644 >>> --- a/fs/locks.c >>> +++ b/fs/locks.c >>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v) >>> { struct locks_iterator *iter = f->private; struct file_lock *fl, *bfl; + struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; + struct pid_namespace *current_pidns = task_active_pid_ns(current); >>> fl = hlist_entry(v, struct file_lock, fl_link); >>> > + if ((current_pidns != _pid_ns) && fl->fl_nspid >> >> Ok, so when you read from a process that's in the init_pid_ns >> namespace, then you'll get the whole pile of locks, even when reading >> this from a filesystem that was mounted in a different pid_ns? >> >> That seems odd to me if so. Any reason not to just uniformly use the >> proc_pidns here? > > [CCing some people from openvz/CRIU] Thanks :) > My train of thought was "we should have means which would be the one > universal truth about everything and this would be a process in the > init_pid_ns". I don't have strong preference as long as I'm not breaking > userspace. As I said before - I think the CRIU guys might be using that > interface. This particular change won't break us mostly because we've switched to reading the /proc/pid/fdinfo/n files for locks. -- Pavel >> > + && (proc_pidns != ns_of_pid(fl->fl_nspid))) + return 0; >>> + lock_get_status(f, fl, iter->li_pos, ""); >>> list_for_each_entry(bfl, >fl_block, fl_block) >> > . >
Re: [PATCH v2] locks: Filter /proc/locks output on proc pid ns
On 08/03/2016 05:17 PM, Nikolay Borisov wrote: > > > On 08/03/2016 04:46 PM, Jeff Layton wrote: >> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote: >>> On busy container servers reading /proc/locks shows all the locks >>> created by all clients. This can cause large latency spikes. In my >>> case I observed lsof taking up to 5-10 seconds while processing around >>> 50k locks. Fix this by limiting the locks shown only to those created >>> in the same pidns as the one the proc was mounted in. When reading >>> /proc/locks from the init_pid_ns show everything. >>> Signed-off-by: Nikolay Borisov >>> --- >>> fs/locks.c | 6 ++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/fs/locks.c b/fs/locks.c >>> index ee1b15f6fc13..751673d7f7fc 100644 >>> --- a/fs/locks.c >>> +++ b/fs/locks.c >>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v) >>> { struct locks_iterator *iter = f->private; struct file_lock *fl, *bfl; + struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; + struct pid_namespace *current_pidns = task_active_pid_ns(current); >>> fl = hlist_entry(v, struct file_lock, fl_link); >>> > + if ((current_pidns != _pid_ns) && fl->fl_nspid >> >> Ok, so when you read from a process that's in the init_pid_ns >> namespace, then you'll get the whole pile of locks, even when reading >> this from a filesystem that was mounted in a different pid_ns? >> >> That seems odd to me if so. Any reason not to just uniformly use the >> proc_pidns here? > > [CCing some people from openvz/CRIU] Thanks :) > My train of thought was "we should have means which would be the one > universal truth about everything and this would be a process in the > init_pid_ns". I don't have strong preference as long as I'm not breaking > userspace. As I said before - I think the CRIU guys might be using that > interface. This particular change won't break us mostly because we've switched to reading the /proc/pid/fdinfo/n files for locks. -- Pavel >> > + && (proc_pidns != ns_of_pid(fl->fl_nspid))) + return 0; >>> + lock_get_status(f, fl, iter->li_pos, ""); >>> list_for_each_entry(bfl, >fl_block, fl_block) >> > . >
Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage
On 03/20/2016 03:42 PM, Mike Rapoport wrote: > Hi, > > This set is to address the issues that appear in userfaultfd usage > scenarios when the task monitoring the uffd and the mm-owner do not > cooperate to each other on VM changes such as remaps, madvises and > fork()-s. > > The pacthes are essentially the same as in the prevoious respin (1), > they've just been rebased on the current tree. Hi, Andrea. Hopefully one day after LSFMM is good time to try to get a bit of your attention to this set :) -- Pavel
Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage
On 03/20/2016 03:42 PM, Mike Rapoport wrote: > Hi, > > This set is to address the issues that appear in userfaultfd usage > scenarios when the task monitoring the uffd and the mm-owner do not > cooperate to each other on VM changes such as remaps, madvises and > fork()-s. > > The pacthes are essentially the same as in the prevoious respin (1), > they've just been rebased on the current tree. Hi, Andrea. Hopefully one day after LSFMM is good time to try to get a bit of your attention to this set :) -- Pavel
Re: [PATCH 0/2] vfs: Define new syscall getumask.
On 04/13/2016 02:43 PM, Richard W.M. Jones wrote: > It's not possible to read the process umask without also modifying it, > which is what umask(2) does. A library cannot read umask safely, > especially if the main program might be multithreaded. > > This patch series adds a trivial system call "getumask" which returns > the umask of the current process. Ah! Thanks for this :) Acked-by: Pavel Emelyanov <xe...@virtuozzo.com>
Re: [PATCH 0/2] vfs: Define new syscall getumask.
On 04/13/2016 02:43 PM, Richard W.M. Jones wrote: > It's not possible to read the process umask without also modifying it, > which is what umask(2) does. A library cannot read umask safely, > especially if the main program might be multithreaded. > > This patch series adds a trivial system call "getumask" which returns > the umask of the current process. Ah! Thanks for this :) Acked-by: Pavel Emelyanov
Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage
On 03/20/2016 03:42 PM, Mike Rapoport wrote: > Hi, > > This set is to address the issues that appear in userfaultfd usage > scenarios when the task monitoring the uffd and the mm-owner do not > cooperate to each other on VM changes such as remaps, madvises and > fork()-s. > > The pacthes are essentially the same as in the prevoious respin (1), > they've just been rebased on the current tree. > > [1] http://thread.gmane.org/gmane.linux.kernel.mm/132662 Thanks, Mike! Acked-by: Pavel Emelyanov <xe...@virtuozzo.com>
Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage
On 03/20/2016 03:42 PM, Mike Rapoport wrote: > Hi, > > This set is to address the issues that appear in userfaultfd usage > scenarios when the task monitoring the uffd and the mm-owner do not > cooperate to each other on VM changes such as remaps, madvises and > fork()-s. > > The pacthes are essentially the same as in the prevoious respin (1), > they've just been rebased on the current tree. > > [1] http://thread.gmane.org/gmane.linux.kernel.mm/132662 Thanks, Mike! Acked-by: Pavel Emelyanov
Re: [regression] x86/signal/64: Fix SS handling for signals delivered to 64-bit programs breaks dosemu
On 08/14/2015 10:22 AM, Cyrill Gorcunov wrote: > On Thu, Aug 13, 2015 at 01:09:47PM -0700, Linus Torvalds wrote: >> On Thu, Aug 13, 2015 at 1:08 PM, Cyrill Gorcunov wrote: >>> >>> If only I'm not missin something obvious this should not hurt us. >>> But I gonna build test kernel and check to be sure tomorrow, ok? > > Managed to test it. And criu works fine with the revert as expected. > Actually it's because of commit c6f2062935c8 Oleg made us a patch: > > | commit 07dcf0dbb6ff97c255bc6b06569255a9479bccdd > | Author: Oleg Nesterov > | Date: Thu Mar 19 19:14:00 2015 +0300 > | > | restore/x86: restore_gpregs() needs to initialize ->ss as well > | > | Before the recent "x86_64,signal: Fix SS handling for signals delivered > | to 64-bit programs" kernel patch, sigreturn paths forgot to restore > ->ss > | after return from the signal handler. > | > | Now that the kernel was fixed, restore_gpregs() has to initialize ->ss > | too, it is no longer ignored. > | ... > | > | +++ b/arch/x86/include/asm/restorer.h > | @@ -53,7 +53,7 @@ struct rt_sigcontext { > | unsigned short cs; > | unsigned short gs; > | unsigned short fs; > | - unsigned short __pad0; > | + unsigned short ss; > | unsigned long err; > | unsigned long trapno; > | unsigned long oldmask; > > IOW we've been not setting up __pad0 which became ss > inside the kernel (in result we've been passing 0 here, > which caused the problem). > > fwiw, we declare that new criu versions may require new > kernels to work but never promised that new criu gonna > be compatible with old kernels. We did. Kernel 3.11 is still declared as supported (modulo new stuff we added after 1.0 might not work, but basic sigframe mgmt is not one of those "new" things). > That said if something > get changed inside sigcontext structure in future > we may update criu code as well. > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] x86/signal/64: Fix SS handling for signals delivered to 64-bit programs breaks dosemu
On 08/14/2015 10:22 AM, Cyrill Gorcunov wrote: On Thu, Aug 13, 2015 at 01:09:47PM -0700, Linus Torvalds wrote: On Thu, Aug 13, 2015 at 1:08 PM, Cyrill Gorcunov gorcu...@gmail.com wrote: If only I'm not missin something obvious this should not hurt us. But I gonna build test kernel and check to be sure tomorrow, ok? Managed to test it. And criu works fine with the revert as expected. Actually it's because of commit c6f2062935c8 Oleg made us a patch: | commit 07dcf0dbb6ff97c255bc6b06569255a9479bccdd | Author: Oleg Nesterov o...@redhat.com | Date: Thu Mar 19 19:14:00 2015 +0300 | | restore/x86: restore_gpregs() needs to initialize -ss as well | | Before the recent x86_64,signal: Fix SS handling for signals delivered | to 64-bit programs kernel patch, sigreturn paths forgot to restore -ss | after return from the signal handler. | | Now that the kernel was fixed, restore_gpregs() has to initialize -ss | too, it is no longer ignored. | ... | | +++ b/arch/x86/include/asm/restorer.h | @@ -53,7 +53,7 @@ struct rt_sigcontext { | unsigned short cs; | unsigned short gs; | unsigned short fs; | - unsigned short __pad0; | + unsigned short ss; | unsigned long err; | unsigned long trapno; | unsigned long oldmask; IOW we've been not setting up __pad0 which became ss inside the kernel (in result we've been passing 0 here, which caused the problem). fwiw, we declare that new criu versions may require new kernels to work but never promised that new criu gonna be compatible with old kernels. We did. Kernel 3.11 is still declared as supported (modulo new stuff we added after 1.0 might not work, but basic sigframe mgmt is not one of those new things). That said if something get changed inside sigcontext structure in future we may update criu code as well. . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/1] mm: move ->mremap() from file_operations to vm_operations_struct
On 06/25/2015 11:45 PM, Oleg Nesterov wrote: > vma->vm_ops->mremap() looks more natural and clean in move_vma(), > and this way ->mremap() can have more users. Say, vdso. > > While at it, s/aio_ring_remap/aio_ring_mremap/. > > Signed-off-by: Oleg Nesterov Acked-by: Pavel Emelyanov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/1] mm: move -mremap() from file_operations to vm_operations_struct
On 06/25/2015 11:45 PM, Oleg Nesterov wrote: vma-vm_ops-mremap() looks more natural and clean in move_vma(), and this way -mremap() can have more users. Say, vdso. While at it, s/aio_ring_remap/aio_ring_mremap/. Signed-off-by: Oleg Nesterov o...@redhat.com Acked-by: Pavel Emelyanov xe...@parallels.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mm: move ->mremap() from file_operations to vm_operations_struct
On 06/23/2015 09:02 PM, Oleg Nesterov wrote: > vma->vm_ops->mremap() looks more natural and clean in move_vma(), > and this way ->mremap() can have more users. Say, vdso. > > Signed-off-by: Oleg Nesterov Looks-good-to: /me :) Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vdso && f_op->mremap (Was: special_mapping_fault() is broken)
On 06/23/2015 03:47 AM, Oleg Nesterov wrote: > On 06/21, Oleg Nesterov wrote: >> >> Forgot to add Andy... > > Add Pavel ;) > > I never understood why ->mremap() lives in file_operations, not in > vm_operations_struct. To me vma->vm_file->f_op in move_vma() just > looks strange, vma->vm_ops->mremap(new_vma) looks "obviously better". > > And afaics more useful. CRIU remaps vdso, but this does not update > mm->context.vdso. OK, probably this does not matter currently, CRIU > can't c/r the compat tasks, and 64-bit apps do not use context.vdso. Yup, the context.vdso exists not for all vdso-s. There should have been a patch for power arch adding this change. > Afaics. Still, I think we might want to have special_mapping_remap() > and we can't do this because ->vm_file == NULL. For aio (the single for now mapping with mremap callback) the vm_file is not NULL. > And perhaps other architectures can depend on the "correct" value > in >context.vdso more then x86, I dunno... > > In short. Shouldn't we move ->mremap() to vm_operations_struct before > it has another user? We need to fix aio.c, but this is trivial. > > No? I will be OK with this change :) -- Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vdso f_op-mremap (Was: special_mapping_fault() is broken)
On 06/23/2015 03:47 AM, Oleg Nesterov wrote: On 06/21, Oleg Nesterov wrote: Forgot to add Andy... Add Pavel ;) I never understood why -mremap() lives in file_operations, not in vm_operations_struct. To me vma-vm_file-f_op in move_vma() just looks strange, vma-vm_ops-mremap(new_vma) looks obviously better. And afaics more useful. CRIU remaps vdso, but this does not update mm-context.vdso. OK, probably this does not matter currently, CRIU can't c/r the compat tasks, and 64-bit apps do not use context.vdso. Yup, the context.vdso exists not for all vdso-s. There should have been a patch for power arch adding this change. Afaics. Still, I think we might want to have special_mapping_remap() and we can't do this because -vm_file == NULL. For aio (the single for now mapping with mremap callback) the vm_file is not NULL. And perhaps other architectures can depend on the correct value in context.vdso more then x86, I dunno... In short. Shouldn't we move -mremap() to vm_operations_struct before it has another user? We need to fix aio.c, but this is trivial. No? I will be OK with this change :) -- Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mm: move -mremap() from file_operations to vm_operations_struct
On 06/23/2015 09:02 PM, Oleg Nesterov wrote: vma-vm_ops-mremap() looks more natural and clean in move_vma(), and this way -mremap() can have more users. Say, vdso. Signed-off-by: Oleg Nesterov o...@redhat.com Looks-good-to: /me :) Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5] seccomp: add ptrace options for suspend/resume
On 06/13/2015 06:02 PM, Tycho Andersen wrote: > This patch is the first step in enabling checkpoint/restore of processes > with seccomp enabled. > > One of the things CRIU does while dumping tasks is inject code into them > via ptrace to collect information that is only available to the process > itself. However, if we are in a seccomp mode where these processes are > prohibited from making these syscalls, then what CRIU does kills the task. > > This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables > a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp > filters to disable (and re-enable) seccomp filters for another task so that > they can be successfully dumped (and restored). We restrict the set of > processes that can disable seccomp through ptrace because although today > ptrace can be used to bypass seccomp, there is some discussion of closing > this loophole in the future and we would like this patch to not depend on > that behavior and be future proofed for when it is removed. > > Note that seccomp can be suspended before any filters are actually > installed; this behavior is useful on criu restore, so that we can suspend > seccomp, restore the filters, unmap our restore code from the restored > process' address space, and then resume the task by detaching and have the > filters resumed as well. > > v2 changes: > > * require that the tracer have no seccomp filters installed > * drop TIF_NOTSC manipulation from the patch > * change from ptrace command to a ptrace option and use this ptrace option > as the flag to check. This means that as soon as the tracer > detaches/dies, seccomp is re-enabled and as a corrollary that one can not > disable seccomp across PTRACE_ATTACHs. > > v3 changes: > > * get rid of various #ifdefs everywhere > * report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly > used > > v4 changes: > > * get rid of may_suspend_seccomp() in favor of a capable() check in ptrace > directly > > v5 changes: > > * check that seccomp is not enabled (or suspended) on the tracer > > Signed-off-by: Tycho Andersen > CC: Kees Cook > CC: Andy Lutomirski > CC: Will Drewry > CC: Roland McGrath > CC: Oleg Nesterov > CC: Pavel Emelyanov > CC: Serge E. Hallyn Acked-by: Pavel Emelyanov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5] seccomp: add ptrace options for suspend/resume
On 06/13/2015 06:02 PM, Tycho Andersen wrote: This patch is the first step in enabling checkpoint/restore of processes with seccomp enabled. One of the things CRIU does while dumping tasks is inject code into them via ptrace to collect information that is only available to the process itself. However, if we are in a seccomp mode where these processes are prohibited from making these syscalls, then what CRIU does kills the task. This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp filters to disable (and re-enable) seccomp filters for another task so that they can be successfully dumped (and restored). We restrict the set of processes that can disable seccomp through ptrace because although today ptrace can be used to bypass seccomp, there is some discussion of closing this loophole in the future and we would like this patch to not depend on that behavior and be future proofed for when it is removed. Note that seccomp can be suspended before any filters are actually installed; this behavior is useful on criu restore, so that we can suspend seccomp, restore the filters, unmap our restore code from the restored process' address space, and then resume the task by detaching and have the filters resumed as well. v2 changes: * require that the tracer have no seccomp filters installed * drop TIF_NOTSC manipulation from the patch * change from ptrace command to a ptrace option and use this ptrace option as the flag to check. This means that as soon as the tracer detaches/dies, seccomp is re-enabled and as a corrollary that one can not disable seccomp across PTRACE_ATTACHs. v3 changes: * get rid of various #ifdefs everywhere * report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly used v4 changes: * get rid of may_suspend_seccomp() in favor of a capable() check in ptrace directly v5 changes: * check that seccomp is not enabled (or suspended) on the tracer Signed-off-by: Tycho Andersen tycho.ander...@canonical.com CC: Kees Cook keesc...@chromium.org CC: Andy Lutomirski l...@amacapital.net CC: Will Drewry w...@chromium.org CC: Roland McGrath rol...@hack.frob.com CC: Oleg Nesterov o...@redhat.com CC: Pavel Emelyanov xe...@parallels.com CC: Serge E. Hallyn serge.hal...@ubuntu.com Acked-by: Pavel Emelyanov xe...@parallels.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] seccomp: add ptrace commands for suspend/resume
>> +int suspend_seccomp(struct task_struct *task) >> +{ >> +int ret = -EACCES; >> + >> +spin_lock_irq(>sighand->siglock); >> + >> +if (!capable(CAP_SYS_ADMIN)) >> +goto out; > > I am puzzled ;) Why do we need ->siglock? And even if we need it, why > we can't check CAP_SYS_ADMIN lockless? > > And I am not sure I understand why do we need the additional security > check, but I leave this to you and Andy. > > If you have the rights to trace this task, then you can do anything > the tracee could do without the filtering. I think _this_ check is required, otherwise the seccomp-ed task (in filtered mode) fork-s a child, then this child ptrace-attach to parent (allowed) then suspend its seccomd. And -- we have unpriviledged process de-seccomped. -- Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] seccomp: add ptrace commands for suspend/resume
+int suspend_seccomp(struct task_struct *task) +{ +int ret = -EACCES; + +spin_lock_irq(task-sighand-siglock); + +if (!capable(CAP_SYS_ADMIN)) +goto out; I am puzzled ;) Why do we need -siglock? And even if we need it, why we can't check CAP_SYS_ADMIN lockless? And I am not sure I understand why do we need the additional security check, but I leave this to you and Andy. If you have the rights to trace this task, then you can do anything the tracee could do without the filtering. I think _this_ check is required, otherwise the seccomp-ed task (in filtered mode) fork-s a child, then this child ptrace-attach to parent (allowed) then suspend its seccomd. And -- we have unpriviledged process de-seccomped. -- Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/23] userfaultfd v4
On 05/14/2015 08:30 PM, Andrea Arcangeli wrote: > Hello everyone, > > This is the latest userfaultfd patchset against mm-v4.1-rc3 > 2015-05-14-10:04. > > The postcopy live migration feature on the qemu side is mostly ready > to be merged and it entirely depends on the userfaultfd syscall to be > merged as well. So it'd be great if this patchset could be reviewed > for merging in -mm. > > Userfaults allow to implement on demand paging from userland and more > generally they allow userland to more efficiently take control of the > behavior of page faults than what was available before > (PROT_NONE + SIGSEGV trap). Not to spam with 23 e-mails, all patches are Acked-by: Pavel Emelyanov Thanks! -- Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/23] userfaultfd v4
On 05/14/2015 08:30 PM, Andrea Arcangeli wrote: Hello everyone, This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. The postcopy live migration feature on the qemu side is mostly ready to be merged and it entirely depends on the userfaultfd syscall to be merged as well. So it'd be great if this patchset could be reviewed for merging in -mm. Userfaults allow to implement on demand paging from userland and more generally they allow userland to more efficiently take control of the behavior of page faults than what was available before (PROT_NONE + SIGSEGV trap). Not to spam with 23 e-mails, all patches are Acked-by: Pavel Emelyanov xe...@parallels.com Thanks! -- Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] uffd: Add fork() event
When the mm with uffd-ed vmas fork()-s the respective vmas notify their uffds with the event which contains a descriptor with new uffd. This new descriptor can then be used to get events from the child and populate its mm with data. Note, that there can be different uffd-s controlling different vmas within one mm, so first we should collect all those uffds (and ctx-s) in a list and then notify them all one by one but only once per fork(). The context is created at fork() time but the descriptor, file struct and anon inode object is created at event read time. So some trickery is added to the userfaultfd_ctx_read() to handle the ctx queues' locking vs file creation. Another thing worth noticing is that the task that fork()-s waits for the uffd event to get processed WITHOUT the mmap sem. Signed-off-by: Pavel Emelyanov --- fs/userfaultfd.c | 137 +++ include/linux/userfaultfd_k.h| 12 include/uapi/linux/userfaultfd.h | 13 ++-- kernel/fork.c| 9 ++- 4 files changed, 163 insertions(+), 8 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 83acb19..202d0ac 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -62,6 +62,12 @@ struct userfaultfd_ctx { struct mm_struct *mm; }; +struct userfaultfd_fork_ctx { + struct userfaultfd_ctx *orig; + struct userfaultfd_ctx *new; + struct list_head list; +}; + struct userfaultfd_wait_queue { struct uffd_msg msg; wait_queue_t wq; @@ -432,6 +438,79 @@ static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx, __remove_wait_queue(>event_wqh, >wq); } +int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) +{ + struct userfaultfd_ctx *ctx = NULL, *octx; + struct userfaultfd_fork_ctx *fctx; + + octx = vma->vm_userfaultfd_ctx.ctx; + if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { + vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); + return 0; + } + + list_for_each_entry(fctx, fcs, list) + if (fctx->orig == octx) { + ctx = fctx->new; + break; + } + + if (!ctx) { + fctx = kmalloc(sizeof(*fctx), GFP_KERNEL); + if (!fctx) + return -ENOMEM; + + ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL); + if (!ctx) { + kfree(fctx); + return -ENOMEM; + } + + atomic_set(>refcount, 1); + ctx->flags = octx->flags; + ctx->state = UFFD_STATE_RUNNING; + ctx->features = octx->features; + ctx->released = false; + ctx->mm = vma->vm_mm; + atomic_inc(>mm->mm_users); + + userfaultfd_ctx_get(octx); + fctx->orig = octx; + fctx->new = ctx; + list_add_tail(>list, fcs); + } + + vma->vm_userfaultfd_ctx.ctx = ctx; + return 0; +} + +static int dup_fctx(struct userfaultfd_fork_ctx *fctx) +{ + struct userfaultfd_ctx *ctx = fctx->orig; + struct userfaultfd_wait_queue ewq; + + msg_init(); + + ewq.msg.event = UFFD_EVENT_FORK; + ewq.msg.arg.reserved.reserved1 = (__u64)fctx->new; + + return userfaultfd_event_wait_completion(ctx, ); +} + +void dup_userfaultfd_complete(struct list_head *fcs) +{ + int ret = 0; + struct userfaultfd_fork_ctx *fctx, *n; + + list_for_each_entry_safe(fctx, n, fcs, list) { + if (!ret) + ret = dup_fctx(fctx); + list_del(>list); + kfree(fctx); + } +} + static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file->private_data; @@ -563,12 +642,47 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) } } +static const struct file_operations userfaultfd_fops; + +static int resolve_userfault_fork(struct userfaultfd_ctx *ctx, + struct userfaultfd_ctx *new, struct uffd_msg *msg) +{ + int fd; + struct file *file; + + fd = get_unused_fd_flags(new->flags & UFFD_SHARED_FCNTL_FLAGS); + if (fd < 0) + return fd; + + file = anon_inode_getfile("[userfaultfd]", _fops, new, + O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS)); + if (IS_ERR(file)) { + put_unused_fd(fd); + return PTR_ERR(file); + } + + fd_install(fd, file); + msg->arg.reserved.reserved1 = 0; + msg->arg.fork.ufd = fd; + + return 0; +} + static ssize
[PATCH 5/5] uffd: Add madvise() event for MADV_DONTNEED request
If the page is punched out of the address space the uffd reader should know this and zeromap the respective area in case of the #PF event. Signed-off-by: Pavel Emelyanov --- fs/userfaultfd.c | 26 ++ include/linux/userfaultfd_k.h| 10 ++ include/uapi/linux/userfaultfd.h | 9 - mm/madvise.c | 2 ++ 4 files changed, 46 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 697e636..6e80a02 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -546,6 +546,32 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx, unsigned long userfaultfd_event_wait_completion(ctx, ); } +void madvise_userfault_dontneed(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + struct userfaultfd_ctx *ctx; + struct userfaultfd_wait_queue ewq; + + ctx = vma->vm_userfaultfd_ctx.ctx; + if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_MADVDONTNEED)) + return; + + userfaultfd_ctx_get(ctx); + *prev = NULL; /* We wait for ACK w/o the mmap semaphore */ + up_read(>vm_mm->mmap_sem); + + msg_init(); + + ewq.msg.event = UFFD_EVENT_MADVDONTNEED; + ewq.msg.arg.madv_dn.start = start; + ewq.msg.arg.madv_dn.end = end; + + userfaultfd_event_wait_completion(ctx, ); + + down_read(>vm_mm->mmap_sem); +} + static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file->private_data; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 0ed5dce..2f72069 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -82,6 +82,10 @@ extern void mremap_userfaultfd_prep(struct vm_area_struct *, struct vm_userfault extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx, unsigned long from, unsigned long to, unsigned long len); +extern void madvise_userfault_dontneed(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end); + #else /* CONFIG_USERFAULTFD */ /* mm helpers */ @@ -131,6 +135,12 @@ static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx, unsigned long from, unsigned long to, unsigned long len) { } + +static inline void madvise_userfault_dontneed(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end) +{ +} #endif /* CONFIG_USERFAULTFD */ #endif /* _LINUX_USERFAULTFD_K_H */ diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 59a141c..d0d1ef1 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -14,7 +14,7 @@ * After implementing the respective features it will become: * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP) */ -#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP) +#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP|UFFD_FEATURE_EVENT_MADVDONTNEED) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -79,6 +79,11 @@ struct uffd_msg { } remap; struct { + __u64 start; + __u64 end; + } madv_dn; + + struct { /* unused reserved fields */ __u64 reserved1; __u64 reserved2; @@ -93,6 +98,7 @@ struct uffd_msg { #define UFFD_EVENT_PAGEFAULT 0x12 #define UFFD_EVENT_FORK0x13 #define UFFD_EVENT_REMAP 0x14 +#define UFFD_EVENT_MADVDONTNEED0x15 /* flags for UFFD_EVENT_PAGEFAULT */ #define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */ @@ -116,6 +122,7 @@ struct uffdio_api { #endif #define UFFD_FEATURE_EVENT_FORK(1<<1) #define UFFD_FEATURE_EVENT_REMAP (1<<2) +#define UFFD_FEATURE_EVENT_MADVDONTNEED(1<<3) __u64 features; __u64 ioctls; diff --git a/mm/madvise.c b/mm/madvise.c index 10f62b7..3ea20e2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -283,6 +284,7 @@ static long madvise_dontneed(struct vm_area_struct *vma, return -EINVAL; zap_page_range(vma, start, end - start, NULL); + madvise_userfault_dontneed(vma, prev, start, end); return 0; } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to m
[PATCH 4/5] uffd: Add mremap() event
The event denotes that an area [start:end] moves to different location. Lenght change isn't reported as "new" addresses if they appear on the uffd reader side will not contain any data and the latter can just zeromap them. Waiting for the event ACK is also done outside of mmap sem, as for fork event. Signed-off-by: Pavel Emelyanov --- fs/userfaultfd.c | 35 +++ include/linux/userfaultfd_k.h| 12 include/uapi/linux/userfaultfd.h | 10 +- mm/mremap.c | 31 --- 4 files changed, 76 insertions(+), 12 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 202d0ac..697e636 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -511,6 +511,41 @@ void dup_userfaultfd_complete(struct list_head *fcs) } } +void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct vm_userfaultfd_ctx *vm_ctx) +{ + struct userfaultfd_ctx *ctx; + + ctx = vma->vm_userfaultfd_ctx.ctx; + if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) { + vm_ctx->ctx = ctx; + userfaultfd_ctx_get(ctx); + } +} + +void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx, unsigned long from, + unsigned long to, unsigned long len) +{ + struct userfaultfd_ctx *ctx = vm_ctx.ctx; + struct userfaultfd_wait_queue ewq; + + if (ctx == NULL) + return; + + if (to & ~PAGE_MASK) { + userfaultfd_ctx_put(ctx); + return; + } + + msg_init(); + + ewq.msg.event = UFFD_EVENT_REMAP; + ewq.msg.arg.remap.from = from; + ewq.msg.arg.remap.to = to; + ewq.msg.arg.remap.len = len; + + userfaultfd_event_wait_completion(ctx, ); +} + static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file->private_data; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 44827f7..0ed5dce 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -78,6 +78,10 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma) extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *); extern void dup_userfaultfd_complete(struct list_head *); +extern void mremap_userfaultfd_prep(struct vm_area_struct *, struct vm_userfaultfd_ctx *); +extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx, unsigned long from, + unsigned long to, unsigned long len); + #else /* CONFIG_USERFAULTFD */ /* mm helpers */ @@ -119,6 +123,14 @@ static inline void dup_userfaultfd_complete(struct list_head *) { } +static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct vm_userfaultfd_ctx *ctx) +{ +} + +static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx, + unsigned long from, unsigned long to, unsigned long len) +{ +} #endif /* CONFIG_USERFAULTFD */ #endif /* _LINUX_USERFAULTFD_K_H */ diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index fb9ced2..59a141c 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -14,7 +14,7 @@ * After implementing the respective features it will become: * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP) */ -#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK) +#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -73,6 +73,12 @@ struct uffd_msg { } fork; struct { + __u64 from; + __u64 to; + __u64 len; + } remap; + + struct { /* unused reserved fields */ __u64 reserved1; __u64 reserved2; @@ -86,6 +92,7 @@ struct uffd_msg { */ #define UFFD_EVENT_PAGEFAULT 0x12 #define UFFD_EVENT_FORK0x13 +#define UFFD_EVENT_REMAP 0x14 /* flags for UFFD_EVENT_PAGEFAULT */ #define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */ @@ -108,6 +115,7 @@ struct uffdio_api { #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #endif #define UFFD_FEATURE_EVENT_FORK(1<<1) +#define UFFD_FEATURE_EVENT_REMAP (1<<2) __u64 features; __u64 ioctls; diff --git a/mm/mremap.c b/mm/mremap.c index 034e2d3..a63287d 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -234,7 +235,8 @@ unsigned long move_page_tables(struct vm_are
[PATCH 2/5] uffd: Add ability to report non-PF events from uffd descriptor
The custom events are queued in ctx->event_wqh not to disturb the fast-path-ed PF queue-wait-wakeup functions. The events to be generated (other than PF-s) are requested in UFFD_API ioctl with the uffd_api.features bits. Those, known by the kernel, are then turned on and reported back to the user-space. Signed-off-by: Pavel Emelyanov --- fs/userfaultfd.c | 97 ++-- 1 file changed, 95 insertions(+), 2 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index c593a72..83acb19 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -12,6 +12,7 @@ * mm/ksm.c (mm hashing). */ +#include #include #include #include @@ -43,12 +44,16 @@ struct userfaultfd_ctx { wait_queue_head_t fault_pending_wqh; /* waitqueue head for the userfaults */ wait_queue_head_t fault_wqh; + /* waitqueue head for events */ + wait_queue_head_t event_wqh; /* waitqueue head for the pseudo fd to wakeup poll/read */ wait_queue_head_t fd_wqh; /* pseudo fd refcounting */ atomic_t refcount; /* userfaultfd syscall flags */ unsigned int flags; + /* features requested from the userspace */ + unsigned int features; /* state machine */ enum userfaultfd_state state; /* released */ @@ -120,6 +125,8 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx) VM_BUG_ON(waitqueue_active(>fault_pending_wqh)); VM_BUG_ON(spin_is_locked(>fault_wqh.lock)); VM_BUG_ON(waitqueue_active(>fault_wqh)); + VM_BUG_ON(spin_is_locked(>event_wqh.lock)); + VM_BUG_ON(waitqueue_active(>event_wqh)); VM_BUG_ON(spin_is_locked(>fd_wqh.lock)); VM_BUG_ON(waitqueue_active(>fd_wqh)); mmput(ctx->mm); @@ -373,6 +380,58 @@ out: return ret; } +static int __maybe_unused userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx, + struct userfaultfd_wait_queue *ewq) +{ + int ret = 0; + + ewq->ctx = ctx; + init_waitqueue_entry(>wq, current); + + spin_lock(>event_wqh.lock); + /* +* After the __add_wait_queue the uwq is visible to userland +* through poll/read(). +*/ + __add_wait_queue(>event_wqh, >wq); + for (;;) { + set_current_state(TASK_KILLABLE); + if (ewq->msg.event == 0) + break; + if (ACCESS_ONCE(ctx->released) || + fatal_signal_pending(current)) { + ret = -1; + __remove_wait_queue(>event_wqh, >wq); + break; + } + + spin_unlock(>event_wqh.lock); + + wake_up_poll(>fd_wqh, POLLIN); + schedule(); + + spin_lock(>event_wqh.lock); + } + __set_current_state(TASK_RUNNING); + spin_unlock(>event_wqh.lock); + + /* +* ctx may go away after this if the userfault pseudo fd is +* already released. +*/ + + userfaultfd_ctx_put(ctx); + return ret; +} + +static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx, + struct userfaultfd_wait_queue *ewq) +{ + ewq->msg.event = 0; + wake_up_locked(>event_wqh); + __remove_wait_queue(>event_wqh, >wq); +} + static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file->private_data; @@ -458,6 +517,12 @@ static inline struct userfaultfd_wait_queue *find_userfault( return find_userfault_in(>fault_pending_wqh); } +static inline struct userfaultfd_wait_queue *find_userfault_evt( + struct userfaultfd_ctx *ctx) +{ + return find_userfault_in(>event_wqh); +} + static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) { struct userfaultfd_ctx *ctx = file->private_data; @@ -489,6 +554,9 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) smp_mb(); if (waitqueue_active(>fault_pending_wqh)) ret = POLLIN; + else if (waitqueue_active(>event_wqh)) + ret = POLLIN; + return ret; default: BUG(); @@ -528,6 +596,19 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, break; } spin_unlock(>fault_pending_wqh.lock); + + spin_lock(>event_wqh.lock); + uwq = find_userfault_evt(ctx); + if (uwq) { + *msg = uwq->msg; + + userfaultfd_event_complete(ctx, uwq); +
[PATCH 1/5] uffd: Split the find_userfault() routine
I will need one to lookup for userfaultfd_wait_queue-s in different wait queue. Signed-off-by: Pavel Emelyanov --- fs/userfaultfd.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index c89e96f..c593a72 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -433,23 +433,29 @@ static int userfaultfd_release(struct inode *inode, struct file *file) } /* fault_pending_wqh.lock must be hold by the caller */ -static inline struct userfaultfd_wait_queue *find_userfault( - struct userfaultfd_ctx *ctx) +static inline struct userfaultfd_wait_queue *find_userfault_in( + wait_queue_head_t *wqh) { wait_queue_t *wq; struct userfaultfd_wait_queue *uwq; - VM_BUG_ON(!spin_is_locked(>fault_pending_wqh.lock)); + VM_BUG_ON(!spin_is_locked(>lock)); uwq = NULL; - if (!waitqueue_active(>fault_pending_wqh)) + if (!waitqueue_active(wqh)) goto out; /* walk in reverse to provide FIFO behavior to read userfaults */ - wq = list_last_entry(>fault_pending_wqh.task_list, -typeof(*wq), task_list); + wq = list_last_entry(>task_list, typeof(*wq), task_list); uwq = container_of(wq, struct userfaultfd_wait_queue, wq); out: return uwq; + +} + +static inline struct userfaultfd_wait_queue *find_userfault( + struct userfaultfd_ctx *ctx) +{ + return find_userfault_in(>fault_pending_wqh); } static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] UserfaultFD: Extension for non cooperative uffd usage (v2)
Hi, This set is to address the issues that appear in userfaultfd usage scenarios when the task monitoring the uffd and the mm-owner do not cooperate to each other on VM changes such as remaps, madvises and fork()-s. This is the re-based set on the recent userfaultfd branch, two major changes are: * No need in separate API version, the uffd_msg introduced in the current code and UFFD_API ioctl are enough for the needed extentions * Two events added -- for mremap() and madvise() MADV_DONTNEED More details about the particular events are in patches 3 trough 4. Comments and suggestion are warmly welcome :) The v1 discussion thread is here: https://lkml.org/lkml/2015/3/18/729 Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] uffd: Add fork() event
When the mm with uffd-ed vmas fork()-s the respective vmas notify their uffds with the event which contains a descriptor with new uffd. This new descriptor can then be used to get events from the child and populate its mm with data. Note, that there can be different uffd-s controlling different vmas within one mm, so first we should collect all those uffds (and ctx-s) in a list and then notify them all one by one but only once per fork(). The context is created at fork() time but the descriptor, file struct and anon inode object is created at event read time. So some trickery is added to the userfaultfd_ctx_read() to handle the ctx queues' locking vs file creation. Another thing worth noticing is that the task that fork()-s waits for the uffd event to get processed WITHOUT the mmap sem. Signed-off-by: Pavel Emelyanov xe...@parallels.com --- fs/userfaultfd.c | 137 +++ include/linux/userfaultfd_k.h| 12 include/uapi/linux/userfaultfd.h | 13 ++-- kernel/fork.c| 9 ++- 4 files changed, 163 insertions(+), 8 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 83acb19..202d0ac 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -62,6 +62,12 @@ struct userfaultfd_ctx { struct mm_struct *mm; }; +struct userfaultfd_fork_ctx { + struct userfaultfd_ctx *orig; + struct userfaultfd_ctx *new; + struct list_head list; +}; + struct userfaultfd_wait_queue { struct uffd_msg msg; wait_queue_t wq; @@ -432,6 +438,79 @@ static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx, __remove_wait_queue(ctx-event_wqh, ewq-wq); } +int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) +{ + struct userfaultfd_ctx *ctx = NULL, *octx; + struct userfaultfd_fork_ctx *fctx; + + octx = vma-vm_userfaultfd_ctx.ctx; + if (!octx || !(octx-features UFFD_FEATURE_EVENT_FORK)) { + vma-vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vma-vm_flags = ~(VM_UFFD_WP | VM_UFFD_MISSING); + return 0; + } + + list_for_each_entry(fctx, fcs, list) + if (fctx-orig == octx) { + ctx = fctx-new; + break; + } + + if (!ctx) { + fctx = kmalloc(sizeof(*fctx), GFP_KERNEL); + if (!fctx) + return -ENOMEM; + + ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL); + if (!ctx) { + kfree(fctx); + return -ENOMEM; + } + + atomic_set(ctx-refcount, 1); + ctx-flags = octx-flags; + ctx-state = UFFD_STATE_RUNNING; + ctx-features = octx-features; + ctx-released = false; + ctx-mm = vma-vm_mm; + atomic_inc(ctx-mm-mm_users); + + userfaultfd_ctx_get(octx); + fctx-orig = octx; + fctx-new = ctx; + list_add_tail(fctx-list, fcs); + } + + vma-vm_userfaultfd_ctx.ctx = ctx; + return 0; +} + +static int dup_fctx(struct userfaultfd_fork_ctx *fctx) +{ + struct userfaultfd_ctx *ctx = fctx-orig; + struct userfaultfd_wait_queue ewq; + + msg_init(ewq.msg); + + ewq.msg.event = UFFD_EVENT_FORK; + ewq.msg.arg.reserved.reserved1 = (__u64)fctx-new; + + return userfaultfd_event_wait_completion(ctx, ewq); +} + +void dup_userfaultfd_complete(struct list_head *fcs) +{ + int ret = 0; + struct userfaultfd_fork_ctx *fctx, *n; + + list_for_each_entry_safe(fctx, n, fcs, list) { + if (!ret) + ret = dup_fctx(fctx); + list_del(fctx-list); + kfree(fctx); + } +} + static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file-private_data; @@ -563,12 +642,47 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) } } +static const struct file_operations userfaultfd_fops; + +static int resolve_userfault_fork(struct userfaultfd_ctx *ctx, + struct userfaultfd_ctx *new, struct uffd_msg *msg) +{ + int fd; + struct file *file; + + fd = get_unused_fd_flags(new-flags UFFD_SHARED_FCNTL_FLAGS); + if (fd 0) + return fd; + + file = anon_inode_getfile([userfaultfd], userfaultfd_fops, new, + O_RDWR | (new-flags UFFD_SHARED_FCNTL_FLAGS)); + if (IS_ERR(file)) { + put_unused_fd(fd); + return PTR_ERR(file); + } + + fd_install(fd, file); + msg-arg.reserved.reserved1 = 0; + msg-arg.fork.ufd = fd; + + return 0; +} + static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait
[PATCH 0/5] UserfaultFD: Extension for non cooperative uffd usage (v2)
Hi, This set is to address the issues that appear in userfaultfd usage scenarios when the task monitoring the uffd and the mm-owner do not cooperate to each other on VM changes such as remaps, madvises and fork()-s. This is the re-based set on the recent userfaultfd branch, two major changes are: * No need in separate API version, the uffd_msg introduced in the current code and UFFD_API ioctl are enough for the needed extentions * Two events added -- for mremap() and madvise() MADV_DONTNEED More details about the particular events are in patches 3 trough 4. Comments and suggestion are warmly welcome :) The v1 discussion thread is here: https://lkml.org/lkml/2015/3/18/729 Thanks, Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] uffd: Split the find_userfault() routine
I will need one to lookup for userfaultfd_wait_queue-s in different wait queue. Signed-off-by: Pavel Emelyanov xe...@parallels.com --- fs/userfaultfd.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index c89e96f..c593a72 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -433,23 +433,29 @@ static int userfaultfd_release(struct inode *inode, struct file *file) } /* fault_pending_wqh.lock must be hold by the caller */ -static inline struct userfaultfd_wait_queue *find_userfault( - struct userfaultfd_ctx *ctx) +static inline struct userfaultfd_wait_queue *find_userfault_in( + wait_queue_head_t *wqh) { wait_queue_t *wq; struct userfaultfd_wait_queue *uwq; - VM_BUG_ON(!spin_is_locked(ctx-fault_pending_wqh.lock)); + VM_BUG_ON(!spin_is_locked(wqh-lock)); uwq = NULL; - if (!waitqueue_active(ctx-fault_pending_wqh)) + if (!waitqueue_active(wqh)) goto out; /* walk in reverse to provide FIFO behavior to read userfaults */ - wq = list_last_entry(ctx-fault_pending_wqh.task_list, -typeof(*wq), task_list); + wq = list_last_entry(wqh-task_list, typeof(*wq), task_list); uwq = container_of(wq, struct userfaultfd_wait_queue, wq); out: return uwq; + +} + +static inline struct userfaultfd_wait_queue *find_userfault( + struct userfaultfd_ctx *ctx) +{ + return find_userfault_in(ctx-fault_pending_wqh); } static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] uffd: Add ability to report non-PF events from uffd descriptor
The custom events are queued in ctx-event_wqh not to disturb the fast-path-ed PF queue-wait-wakeup functions. The events to be generated (other than PF-s) are requested in UFFD_API ioctl with the uffd_api.features bits. Those, known by the kernel, are then turned on and reported back to the user-space. Signed-off-by: Pavel Emelyanov xe...@parallels.com --- fs/userfaultfd.c | 97 ++-- 1 file changed, 95 insertions(+), 2 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index c593a72..83acb19 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -12,6 +12,7 @@ * mm/ksm.c (mm hashing). */ +#include linux/list.h #include linux/hashtable.h #include linux/sched.h #include linux/mm.h @@ -43,12 +44,16 @@ struct userfaultfd_ctx { wait_queue_head_t fault_pending_wqh; /* waitqueue head for the userfaults */ wait_queue_head_t fault_wqh; + /* waitqueue head for events */ + wait_queue_head_t event_wqh; /* waitqueue head for the pseudo fd to wakeup poll/read */ wait_queue_head_t fd_wqh; /* pseudo fd refcounting */ atomic_t refcount; /* userfaultfd syscall flags */ unsigned int flags; + /* features requested from the userspace */ + unsigned int features; /* state machine */ enum userfaultfd_state state; /* released */ @@ -120,6 +125,8 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx) VM_BUG_ON(waitqueue_active(ctx-fault_pending_wqh)); VM_BUG_ON(spin_is_locked(ctx-fault_wqh.lock)); VM_BUG_ON(waitqueue_active(ctx-fault_wqh)); + VM_BUG_ON(spin_is_locked(ctx-event_wqh.lock)); + VM_BUG_ON(waitqueue_active(ctx-event_wqh)); VM_BUG_ON(spin_is_locked(ctx-fd_wqh.lock)); VM_BUG_ON(waitqueue_active(ctx-fd_wqh)); mmput(ctx-mm); @@ -373,6 +380,58 @@ out: return ret; } +static int __maybe_unused userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx, + struct userfaultfd_wait_queue *ewq) +{ + int ret = 0; + + ewq-ctx = ctx; + init_waitqueue_entry(ewq-wq, current); + + spin_lock(ctx-event_wqh.lock); + /* +* After the __add_wait_queue the uwq is visible to userland +* through poll/read(). +*/ + __add_wait_queue(ctx-event_wqh, ewq-wq); + for (;;) { + set_current_state(TASK_KILLABLE); + if (ewq-msg.event == 0) + break; + if (ACCESS_ONCE(ctx-released) || + fatal_signal_pending(current)) { + ret = -1; + __remove_wait_queue(ctx-event_wqh, ewq-wq); + break; + } + + spin_unlock(ctx-event_wqh.lock); + + wake_up_poll(ctx-fd_wqh, POLLIN); + schedule(); + + spin_lock(ctx-event_wqh.lock); + } + __set_current_state(TASK_RUNNING); + spin_unlock(ctx-event_wqh.lock); + + /* +* ctx may go away after this if the userfault pseudo fd is +* already released. +*/ + + userfaultfd_ctx_put(ctx); + return ret; +} + +static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx, + struct userfaultfd_wait_queue *ewq) +{ + ewq-msg.event = 0; + wake_up_locked(ctx-event_wqh); + __remove_wait_queue(ctx-event_wqh, ewq-wq); +} + static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file-private_data; @@ -458,6 +517,12 @@ static inline struct userfaultfd_wait_queue *find_userfault( return find_userfault_in(ctx-fault_pending_wqh); } +static inline struct userfaultfd_wait_queue *find_userfault_evt( + struct userfaultfd_ctx *ctx) +{ + return find_userfault_in(ctx-event_wqh); +} + static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) { struct userfaultfd_ctx *ctx = file-private_data; @@ -489,6 +554,9 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) smp_mb(); if (waitqueue_active(ctx-fault_pending_wqh)) ret = POLLIN; + else if (waitqueue_active(ctx-event_wqh)) + ret = POLLIN; + return ret; default: BUG(); @@ -528,6 +596,19 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, break; } spin_unlock(ctx-fault_pending_wqh.lock); + + spin_lock(ctx-event_wqh.lock); + uwq = find_userfault_evt(ctx); + if (uwq) { + *msg = uwq-msg; + + userfaultfd_event_complete(ctx, uwq
[PATCH 4/5] uffd: Add mremap() event
The event denotes that an area [start:end] moves to different location. Lenght change isn't reported as new addresses if they appear on the uffd reader side will not contain any data and the latter can just zeromap them. Waiting for the event ACK is also done outside of mmap sem, as for fork event. Signed-off-by: Pavel Emelyanov xe...@parallels.com --- fs/userfaultfd.c | 35 +++ include/linux/userfaultfd_k.h| 12 include/uapi/linux/userfaultfd.h | 10 +- mm/mremap.c | 31 --- 4 files changed, 76 insertions(+), 12 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 202d0ac..697e636 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -511,6 +511,41 @@ void dup_userfaultfd_complete(struct list_head *fcs) } } +void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct vm_userfaultfd_ctx *vm_ctx) +{ + struct userfaultfd_ctx *ctx; + + ctx = vma-vm_userfaultfd_ctx.ctx; + if (ctx (ctx-features UFFD_FEATURE_EVENT_REMAP)) { + vm_ctx-ctx = ctx; + userfaultfd_ctx_get(ctx); + } +} + +void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx, unsigned long from, + unsigned long to, unsigned long len) +{ + struct userfaultfd_ctx *ctx = vm_ctx.ctx; + struct userfaultfd_wait_queue ewq; + + if (ctx == NULL) + return; + + if (to ~PAGE_MASK) { + userfaultfd_ctx_put(ctx); + return; + } + + msg_init(ewq.msg); + + ewq.msg.event = UFFD_EVENT_REMAP; + ewq.msg.arg.remap.from = from; + ewq.msg.arg.remap.to = to; + ewq.msg.arg.remap.len = len; + + userfaultfd_event_wait_completion(ctx, ewq); +} + static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file-private_data; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 44827f7..0ed5dce 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -78,6 +78,10 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma) extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *); extern void dup_userfaultfd_complete(struct list_head *); +extern void mremap_userfaultfd_prep(struct vm_area_struct *, struct vm_userfaultfd_ctx *); +extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx, unsigned long from, + unsigned long to, unsigned long len); + #else /* CONFIG_USERFAULTFD */ /* mm helpers */ @@ -119,6 +123,14 @@ static inline void dup_userfaultfd_complete(struct list_head *) { } +static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct vm_userfaultfd_ctx *ctx) +{ +} + +static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx, + unsigned long from, unsigned long to, unsigned long len) +{ +} #endif /* CONFIG_USERFAULTFD */ #endif /* _LINUX_USERFAULTFD_K_H */ diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index fb9ced2..59a141c 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -14,7 +14,7 @@ * After implementing the respective features it will become: * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP) */ -#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK) +#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP) #define UFFD_API_IOCTLS\ ((__u64)1 _UFFDIO_REGISTER | \ (__u64)1 _UFFDIO_UNREGISTER | \ @@ -73,6 +73,12 @@ struct uffd_msg { } fork; struct { + __u64 from; + __u64 to; + __u64 len; + } remap; + + struct { /* unused reserved fields */ __u64 reserved1; __u64 reserved2; @@ -86,6 +92,7 @@ struct uffd_msg { */ #define UFFD_EVENT_PAGEFAULT 0x12 #define UFFD_EVENT_FORK0x13 +#define UFFD_EVENT_REMAP 0x14 /* flags for UFFD_EVENT_PAGEFAULT */ #define UFFD_PAGEFAULT_FLAG_WRITE (10) /* If this was a write fault */ @@ -108,6 +115,7 @@ struct uffdio_api { #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (10) #endif #define UFFD_FEATURE_EVENT_FORK(11) +#define UFFD_FEATURE_EVENT_REMAP (12) __u64 features; __u64 ioctls; diff --git a/mm/mremap.c b/mm/mremap.c index 034e2d3..a63287d 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -22,6 +22,7 @@ #include linux/mmu_notifier.h #include linux/sched/sysctl.h #include linux/uaccess.h +#include linux/userfaultfd_k.h #include asm/cacheflush.h #include asm/tlbflush.h @@ -234,7 +235,8 @@ unsigned long
[PATCH 5/5] uffd: Add madvise() event for MADV_DONTNEED request
If the page is punched out of the address space the uffd reader should know this and zeromap the respective area in case of the #PF event. Signed-off-by: Pavel Emelyanov xe...@parallels.com --- fs/userfaultfd.c | 26 ++ include/linux/userfaultfd_k.h| 10 ++ include/uapi/linux/userfaultfd.h | 9 - mm/madvise.c | 2 ++ 4 files changed, 46 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 697e636..6e80a02 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -546,6 +546,32 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx, unsigned long userfaultfd_event_wait_completion(ctx, ewq); } +void madvise_userfault_dontneed(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + struct userfaultfd_ctx *ctx; + struct userfaultfd_wait_queue ewq; + + ctx = vma-vm_userfaultfd_ctx.ctx; + if (!ctx || !(ctx-features UFFD_FEATURE_EVENT_MADVDONTNEED)) + return; + + userfaultfd_ctx_get(ctx); + *prev = NULL; /* We wait for ACK w/o the mmap semaphore */ + up_read(vma-vm_mm-mmap_sem); + + msg_init(ewq.msg); + + ewq.msg.event = UFFD_EVENT_MADVDONTNEED; + ewq.msg.arg.madv_dn.start = start; + ewq.msg.arg.madv_dn.end = end; + + userfaultfd_event_wait_completion(ctx, ewq); + + down_read(vma-vm_mm-mmap_sem); +} + static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file-private_data; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 0ed5dce..2f72069 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -82,6 +82,10 @@ extern void mremap_userfaultfd_prep(struct vm_area_struct *, struct vm_userfault extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx, unsigned long from, unsigned long to, unsigned long len); +extern void madvise_userfault_dontneed(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end); + #else /* CONFIG_USERFAULTFD */ /* mm helpers */ @@ -131,6 +135,12 @@ static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx, unsigned long from, unsigned long to, unsigned long len) { } + +static inline void madvise_userfault_dontneed(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end) +{ +} #endif /* CONFIG_USERFAULTFD */ #endif /* _LINUX_USERFAULTFD_K_H */ diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 59a141c..d0d1ef1 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -14,7 +14,7 @@ * After implementing the respective features it will become: * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP) */ -#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP) +#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP|UFFD_FEATURE_EVENT_MADVDONTNEED) #define UFFD_API_IOCTLS\ ((__u64)1 _UFFDIO_REGISTER | \ (__u64)1 _UFFDIO_UNREGISTER | \ @@ -79,6 +79,11 @@ struct uffd_msg { } remap; struct { + __u64 start; + __u64 end; + } madv_dn; + + struct { /* unused reserved fields */ __u64 reserved1; __u64 reserved2; @@ -93,6 +98,7 @@ struct uffd_msg { #define UFFD_EVENT_PAGEFAULT 0x12 #define UFFD_EVENT_FORK0x13 #define UFFD_EVENT_REMAP 0x14 +#define UFFD_EVENT_MADVDONTNEED0x15 /* flags for UFFD_EVENT_PAGEFAULT */ #define UFFD_PAGEFAULT_FLAG_WRITE (10) /* If this was a write fault */ @@ -116,6 +122,7 @@ struct uffdio_api { #endif #define UFFD_FEATURE_EVENT_FORK(11) #define UFFD_FEATURE_EVENT_REMAP (12) +#define UFFD_FEATURE_EVENT_MADVDONTNEED(13) __u64 features; __u64 ioctls; diff --git a/mm/madvise.c b/mm/madvise.c index 10f62b7..3ea20e2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -10,6 +10,7 @@ #include linux/syscalls.h #include linux/mempolicy.h #include linux/page-isolation.h +#include linux/userfaultfd_k.h #include linux/hugetlb.h #include linux/falloc.h #include linux/sched.h @@ -283,6 +284,7 @@ static long madvise_dontneed(struct vm_area_struct *vma, return -EINVAL; zap_page_range(vma, start, end - start, NULL); + madvise_userfault_dontneed(vma, prev, start, end); return 0; } -- 1.9.3 -- To unsubscribe from this list: send
Re: [PATCH 2/3] uffd: Introduce the v2 API
On 04/28/2015 12:12 AM, Andrea Arcangeli wrote: > Hello, > > On Thu, Apr 23, 2015 at 09:29:07AM +0300, Pavel Emelyanov wrote: >> So your proposal is to always report 16 bytes per PF from read() and >> let userspace decide itself how to handle the result? > > Reading 16bytes for each userfault (instead of 8) and sharing the same > read(2) protocol (UFFD_API) for both the cooperative and > non-cooperative usages, is something I just suggested for > consideration after reading your patchset. > > The pros of using a single protocol for both is that it would reduce > amount of code and there would be just one file operation for the > .read method. The cons is that it will waste 8 bytes per userfault in > terms of memory footprint. The other major cons is that it would force > us to define the format of the non cooperative protocol now despite it's > not fully finished yet. > > I'm also ok with two protocols if nobody else objects, but if we use > two protocols, we should at least use different file operation methods > and use __always_inline with constants passed as parameter to optimize > away the branches at build time. This way we get the reduced memory > footprint in the read syscall without other runtime overhead > associated with it. OK. I would go with two protocols then and will reshuffle the code to use two ops. >>>> +struct uffd_v2_msg { >>>> + __u64 type; >>>> + __u64 arg; >>>> +}; >>>> + >>>> +#define UFFD_PAGEFAULT0x1 >>>> + >>>> +#define UFFD_PAGEFAULT_BIT(1 << (UFFD_PAGEFAULT - 1)) >>>> +#define __UFFD_API_V2_BITS(UFFD_PAGEFAULT_BIT) >>>> + >>>> +/* >>>> + * Lower PAGE_SHIFT bits are used to report those supported >>>> + * by the pagefault message itself. Other bits are used to >>>> + * report the message types v2 API supports >>>> + */ >>>> +#define UFFD_API_V2_BITS (__UFFD_API_V2_BITS << 12) >>>> + >>> >>> And why exactly is this 12 hardcoded? >> >> Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it >> would be OK to have different shifts in different arches. >> >> But taking into account your comment that bits field id bad for these >> values, if we introduce the new .features one for api message, then this >> 12 will just go away. > > Ok. > >>> And which field should be masked >>> with the bits? In the V1 protocol it was the "arg" (userfault address) >>> not the "type". So this is a bit confusing and probably requires >>> simplification. >> >> I see. Actually I decided that since bits higher than 12th (for x86) is >> always 0 in api message (no bits allowed there, since pfn sits in this >> place), it would be OK to put non-PF bits there. > > That was ok yes. > >> Should I better introduce another .features field in uffd API message? > > What about renaming "uffdio_api.bits" to "uffdio_api.features"? Yup, agreed, will do. > And then we set uffdio_api.features to > UFFD_FEATURE_WRITE|UFFD_FEATURE_WP|UFFD_FEATURE_FORK as needed. > > UFFD_FEATURE_WRITE would always be enabled, it's there only in case we > want to disable it later (mostly if some arch has trouble with it, > which is unlikely, but qemu doesn't need that bit of information at > all for example so qemu would be fine if UFFD_FEATURE_WRITE > disappears). > > UFFD_FEATURE_WP would signal also that the wrprotection feature (not > implemented yet) is available (then later the register ioctl would > also show the new wrprotection ioctl numbers available to mangle the > wrprotection). The UFFD_FEATURE_WP feature in the cooperative usage > (qemu live snapshotting) can use the UFFD_API first protocol too. > > UFFD_FEATURE_FORK would be returned if the UFFD_API_V2 was set in > uffdio.api, and it would be part of the incremental non-cooperative > patchset. > > We could also not define "UFFD_FEATURE_FORK" at all and imply that > fork/mremap/MADV_DONTNEED are all available if UFFD_API_V2 uffdio_api > ioctl succeeds... That's only doable if we keep two different read > protocols though. UFFD_FEATURE_FORK (or UFFD_FEATURE_NON_COOPERATIVE) > are really strictly needed only if we share the same read(2) protocol > for both the cooperative and non-cooperative usages. > > The idea is that there's not huge benefit of only having the "fork" > feature supported but missing "mremap" and "madv_dontneed". > > In fact if a new syscall that works like mremap is added later (cal
Re: [PATCH 2/3] uffd: Introduce the v2 API
On 04/28/2015 12:12 AM, Andrea Arcangeli wrote: Hello, On Thu, Apr 23, 2015 at 09:29:07AM +0300, Pavel Emelyanov wrote: So your proposal is to always report 16 bytes per PF from read() and let userspace decide itself how to handle the result? Reading 16bytes for each userfault (instead of 8) and sharing the same read(2) protocol (UFFD_API) for both the cooperative and non-cooperative usages, is something I just suggested for consideration after reading your patchset. The pros of using a single protocol for both is that it would reduce amount of code and there would be just one file operation for the .read method. The cons is that it will waste 8 bytes per userfault in terms of memory footprint. The other major cons is that it would force us to define the format of the non cooperative protocol now despite it's not fully finished yet. I'm also ok with two protocols if nobody else objects, but if we use two protocols, we should at least use different file operation methods and use __always_inline with constants passed as parameter to optimize away the branches at build time. This way we get the reduced memory footprint in the read syscall without other runtime overhead associated with it. OK. I would go with two protocols then and will reshuffle the code to use two ops. +struct uffd_v2_msg { + __u64 type; + __u64 arg; +}; + +#define UFFD_PAGEFAULT0x1 + +#define UFFD_PAGEFAULT_BIT(1 (UFFD_PAGEFAULT - 1)) +#define __UFFD_API_V2_BITS(UFFD_PAGEFAULT_BIT) + +/* + * Lower PAGE_SHIFT bits are used to report those supported + * by the pagefault message itself. Other bits are used to + * report the message types v2 API supports + */ +#define UFFD_API_V2_BITS (__UFFD_API_V2_BITS 12) + And why exactly is this 12 hardcoded? Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it would be OK to have different shifts in different arches. But taking into account your comment that bits field id bad for these values, if we introduce the new .features one for api message, then this 12 will just go away. Ok. And which field should be masked with the bits? In the V1 protocol it was the arg (userfault address) not the type. So this is a bit confusing and probably requires simplification. I see. Actually I decided that since bits higher than 12th (for x86) is always 0 in api message (no bits allowed there, since pfn sits in this place), it would be OK to put non-PF bits there. That was ok yes. Should I better introduce another .features field in uffd API message? What about renaming uffdio_api.bits to uffdio_api.features? Yup, agreed, will do. And then we set uffdio_api.features to UFFD_FEATURE_WRITE|UFFD_FEATURE_WP|UFFD_FEATURE_FORK as needed. UFFD_FEATURE_WRITE would always be enabled, it's there only in case we want to disable it later (mostly if some arch has trouble with it, which is unlikely, but qemu doesn't need that bit of information at all for example so qemu would be fine if UFFD_FEATURE_WRITE disappears). UFFD_FEATURE_WP would signal also that the wrprotection feature (not implemented yet) is available (then later the register ioctl would also show the new wrprotection ioctl numbers available to mangle the wrprotection). The UFFD_FEATURE_WP feature in the cooperative usage (qemu live snapshotting) can use the UFFD_API first protocol too. UFFD_FEATURE_FORK would be returned if the UFFD_API_V2 was set in uffdio.api, and it would be part of the incremental non-cooperative patchset. We could also not define UFFD_FEATURE_FORK at all and imply that fork/mremap/MADV_DONTNEED are all available if UFFD_API_V2 uffdio_api ioctl succeeds... That's only doable if we keep two different read protocols though. UFFD_FEATURE_FORK (or UFFD_FEATURE_NON_COOPERATIVE) are really strictly needed only if we share the same read(2) protocol for both the cooperative and non-cooperative usages. The idea is that there's not huge benefit of only having the fork feature supported but missing mremap and madv_dontneed. In fact if a new syscall that works like mremap is added later (call it mremap2), we would need to fail the UFFDIO_API_V2 and require a UFFDIO_API_V3 for such kernel that can return a new mremap2 type of event. Userland couldn't just assume it is ok to use postcopy live migration for containers, because UFFD_FEATURE_FORK|MREMAP|MADV_DONTNEED are present in the uffdio.features when it asked for API_V2. There shall be something that tells userland hey there's a new mremap2 that the software inside the container can run on top of this kernel, so you are going to get a new mremap2 type of userfault event too. But that's why I assumed to use per-sycall bits -- UFFD_FEATURE_FORK, _MREMAP, _MWHATEVER so that userspace can read those bits and make sure it contains only bits it understands with other bits set to zero. If we had only one UFFD_API_NON_COOPERATIVE userspace
Re: [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage
On 04/21/2015 03:02 PM, Andrea Arcangeli wrote: > Hi Pavel, > > On Wed, Mar 18, 2015 at 10:34:26PM +0300, Pavel Emelyanov wrote: >> Hi, >> >> On the recent LSF Andrea presented his userfault-fd patches and >> I had shown some issues that appear in usage scenarios when the >> monitor task and mm task do not cooperate to each other on VM >> changes (and fork()-s). >> >> Here's the implementation of the extended uffd API that would help >> us to address those issues. >> >> As proof of concept I've implemented only fix for fork() case, but >> I also plan to add the mremap() and exit() notifications, both are >> also required for such non-cooperative usage. >> >> More details about the extension itself is in patch #2 and the fork() >> notification description is in patch #3. >> >> Comments and suggestion are warmly welcome :) > > This looks feasible. > >> Andrea, what's the best way to go on with the patches -- would you >> prefer to include them in your git tree or should I instead continue >> with them on my own, re-sending them when required? Either way would >> be OK for me. > > Ok so various improvements happened in userfaultfd patchset over the > last month so your incremental patchset likely requires a rebase > sorry. When you posted it I was in the middle of the updates. Now > things are working stable and I have no pending updates, so it would > be a good time for a rebase. OK, thanks for the heads up! I will rebase my patches. > I can merge it if you like, it's your call if you prefer to maintain > it incrementally or if I should merge it, but I wouldn't push it to > Andrew for upstream integration in the first batch, as this > complicates things further and it's not fully complete yet (at least > the version posted only handled fork). I think it can be merged > incrementally in a second stage. Sure! > The major updates of the userfaultfd patchset over the last month were: > > 1) Various mixed fixes thanks to the feedback from Dave Hansen and >David Gilbert. > >The most notable one is the use of mm_users instead of mm_count to >pin the mm to avoid crashes that assumed the vma still existed (in >the userfaultfd_release method and in the various ioctl). exit_mmap >doesn't even set mm->mmap to NULL, so unless I introduce a >userfaultfd_exit to call in mmput, I have to pin the mm_users to be >safe. This is mainly an issue for the non-cooperative usage you're >implementing. Can you catch the exit somehow so you can close the >fd? The memory won't be released until you do. Is this ok with you? >I suppose you had to close the fd somehow anyway. > > 2) userfaults are waken immediately even if they're not been "read" >yet, this can lead to POLLIN false positives (so I only allow poll >if the fd is open in nonblocking mode to be sure it won't hang). Is >it too paranoid to return POLLERR if the fd is not open in >nonblocking mode? > > > http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault=f222d9de0a5302dc8ac62d6fab53a84251098751 > > 3) optimize read to return entries in O(1) and poll which was already >O(1) becomes lockless. This required to split the waitqueue in two, >one for pending faults and one for non pending faults, and the >faults are refiled across the two waitqueues when they're >read. Both waitqueues are protected by a single lock to be simpler >and faster at runtime (the fault_pending_wqh one). > > > http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault=9aa033ed43a1134c2223dac8c5d9e02e0100fca1 > > 4) Allocate the ctx with kmem_cache_alloc. This is going to collide a >bit with your cleanup patch sorry. > > > http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6 > > 5) Originally qemu had two bitflags for each page and kept 3 states >(of the 4 possible with two bits) for each page in order to deal >with the races that can happen if one thread is reading the >userfaults and another thread is calling the UFFDIO_COPY ioctl in >the background. This patch solves all races in the kernel so the >two bits per page can be dropped from qemu codebase. I started >documenting the races that can materialize by using 2 threads >(instead of running the workload single threaded with a single poll >event loop), and how userland had to solve them until I decided it >was simpler to fix the race in the kernel by running an ad-hoc >pagetable walk inside the wait_event()-kind-of-section. This >simp
Re: [PATCH 2/3] uffd: Introduce the v2 API
On 04/21/2015 03:18 PM, Andrea Arcangeli wrote: > On Wed, Mar 18, 2015 at 10:35:17PM +0300, Pavel Emelyanov wrote: >> +if (!(ctx->features & UFFD_FEATURE_LONGMSG)) { > > If we are to use different protocols, it'd be nicer to have two > different methods to assign to userfaultfd_fops.read that calls an > __always_inline function, so that the above check can be optimized > away at build time when the inline is expanded. So the branch is > converted to calling a different pointer to function which is zero > additional cost. OK :) >> +/* careful to always initialize addr if ret == 0 */ >> +__u64 uninitialized_var(addr); >> +__u64 uninitialized_var(mtype); >> +if (count < sizeof(addr)) >> +return ret ? ret : -EINVAL; >> +_ret = userfaultfd_ctx_read(ctx, no_wait, , >> ); >> +if (_ret < 0) >> +return ret ? ret : _ret; >> +BUG_ON(mtype != UFFD_PAGEFAULT); >> +if (put_user(addr, (__u64 __user *) buf)) >> +return ret ? ret : -EFAULT; >> +_ret = sizeof(addr); >> +} else { >> +struct uffd_v2_msg msg; >> +if (count < sizeof(msg)) >> +return ret ? ret : -EINVAL; >> +_ret = userfaultfd_ctx_read(ctx, no_wait, , >> ); >> +if (_ret < 0) >> +return ret ? ret : _ret; >> +if (copy_to_user(buf, , sizeof(msg))) >> +return ret ? ret : -EINVAL; >> +_ret = sizeof(msg); > > Reading 16bytes instead of 8bytes for each fault, probably wouldn't > move the needle much in terms of userfaultfd_read performance. Perhaps > we could consider using the uffd_v2_msg unconditionally and then have > a single protocol differentiated by the feature bits. So your proposal is to always report 16 bytes per PF from read() and let userspace decide itself how to handle the result? > The only reason to have two different protocols would be to be able to > read 8 bytes per userfault, in the cooperative usage (i.e. qemu > postcopy). But if we do that we want to use the __always_inline trick > to avoid branches and additional runtime costs (otherwise we may as > well forget all microoptimizations and read 16bytes always). > >> @@ -992,6 +1013,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, >> /* careful not to leak info, we only read the first 8 bytes */ >> uffdio_api.bits = UFFD_API_BITS; >> uffdio_api.ioctls = UFFD_API_IOCTLS; >> + >> +if (uffdio_api.api == UFFD_API_V2) { >> +ctx->features |= UFFD_FEATURE_LONGMSG; >> +uffdio_api.bits |= UFFD_API_V2_BITS; >> +} >> + >> ret = -EFAULT; >> if (copy_to_user(buf, _api, sizeof(uffdio_api))) >> goto out; > > The original meaning of the bits is: > > If UFFD_BIT_WRITE was set in api.bits, it means the > !!(address_BIT_WRITE) tells if it was a write fault (missing or > WP). > > If UFFD_BIT_WP was set in api.bits, it means the > !!(address_BIT_WP) tells if it was a WP fault (if not set it > means it was a missing fault). > > Currently api.bits sets only UFFD_BIT_WRITE, and UFFD_BIT_WP will be > set later, after the WP tracking mode will be implemented. > > I'm uncertain how bits translated to features and if they should be > unified or only have features. > >> +struct uffd_v2_msg { >> +__u64 type; >> +__u64 arg; >> +}; >> + >> +#define UFFD_PAGEFAULT 0x1 >> + >> +#define UFFD_PAGEFAULT_BIT (1 << (UFFD_PAGEFAULT - 1)) >> +#define __UFFD_API_V2_BITS (UFFD_PAGEFAULT_BIT) >> + >> +/* >> + * Lower PAGE_SHIFT bits are used to report those supported >> + * by the pagefault message itself. Other bits are used to >> + * report the message types v2 API supports >> + */ >> +#define UFFD_API_V2_BITS(__UFFD_API_V2_BITS << 12) >> + > > And why exactly is this 12 hardcoded? Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it would be OK to have different shifts in different arches. But taking into account your comment that bits field id bad for these values, if we introduce the new .features one for api message, then this 12 will just go away. > And which field should be masked > with the bits? In the V1 protocol it was the &q
Re: [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage
On 04/21/2015 03:02 PM, Andrea Arcangeli wrote: Hi Pavel, On Wed, Mar 18, 2015 at 10:34:26PM +0300, Pavel Emelyanov wrote: Hi, On the recent LSF Andrea presented his userfault-fd patches and I had shown some issues that appear in usage scenarios when the monitor task and mm task do not cooperate to each other on VM changes (and fork()-s). Here's the implementation of the extended uffd API that would help us to address those issues. As proof of concept I've implemented only fix for fork() case, but I also plan to add the mremap() and exit() notifications, both are also required for such non-cooperative usage. More details about the extension itself is in patch #2 and the fork() notification description is in patch #3. Comments and suggestion are warmly welcome :) This looks feasible. Andrea, what's the best way to go on with the patches -- would you prefer to include them in your git tree or should I instead continue with them on my own, re-sending them when required? Either way would be OK for me. Ok so various improvements happened in userfaultfd patchset over the last month so your incremental patchset likely requires a rebase sorry. When you posted it I was in the middle of the updates. Now things are working stable and I have no pending updates, so it would be a good time for a rebase. OK, thanks for the heads up! I will rebase my patches. I can merge it if you like, it's your call if you prefer to maintain it incrementally or if I should merge it, but I wouldn't push it to Andrew for upstream integration in the first batch, as this complicates things further and it's not fully complete yet (at least the version posted only handled fork). I think it can be merged incrementally in a second stage. Sure! The major updates of the userfaultfd patchset over the last month were: 1) Various mixed fixes thanks to the feedback from Dave Hansen and David Gilbert. The most notable one is the use of mm_users instead of mm_count to pin the mm to avoid crashes that assumed the vma still existed (in the userfaultfd_release method and in the various ioctl). exit_mmap doesn't even set mm-mmap to NULL, so unless I introduce a userfaultfd_exit to call in mmput, I have to pin the mm_users to be safe. This is mainly an issue for the non-cooperative usage you're implementing. Can you catch the exit somehow so you can close the fd? The memory won't be released until you do. Is this ok with you? I suppose you had to close the fd somehow anyway. 2) userfaults are waken immediately even if they're not been read yet, this can lead to POLLIN false positives (so I only allow poll if the fd is open in nonblocking mode to be sure it won't hang). Is it too paranoid to return POLLERR if the fd is not open in nonblocking mode? http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=f222d9de0a5302dc8ac62d6fab53a84251098751 3) optimize read to return entries in O(1) and poll which was already O(1) becomes lockless. This required to split the waitqueue in two, one for pending faults and one for non pending faults, and the faults are refiled across the two waitqueues when they're read. Both waitqueues are protected by a single lock to be simpler and faster at runtime (the fault_pending_wqh one). http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=9aa033ed43a1134c2223dac8c5d9e02e0100fca1 4) Allocate the ctx with kmem_cache_alloc. This is going to collide a bit with your cleanup patch sorry. http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6 5) Originally qemu had two bitflags for each page and kept 3 states (of the 4 possible with two bits) for each page in order to deal with the races that can happen if one thread is reading the userfaults and another thread is calling the UFFDIO_COPY ioctl in the background. This patch solves all races in the kernel so the two bits per page can be dropped from qemu codebase. I started documenting the races that can materialize by using 2 threads (instead of running the workload single threaded with a single poll event loop), and how userland had to solve them until I decided it was simpler to fix the race in the kernel by running an ad-hoc pagetable walk inside the wait_event()-kind-of-section. This simplified qemu significantly and it doesn't make the kernel much more complicated. I tried this before in much older versions but I use gup_fast for it and it didn't work well with gup_fast for various reasons. http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=41efeae4e93f0296436f2a9fc6b28b6b0158512a After this patch the only reason to call UFFDIO_WAKE is to handle the userfaults in batches in combination
Re: [PATCH 2/3] uffd: Introduce the v2 API
On 04/21/2015 03:18 PM, Andrea Arcangeli wrote: On Wed, Mar 18, 2015 at 10:35:17PM +0300, Pavel Emelyanov wrote: +if (!(ctx-features UFFD_FEATURE_LONGMSG)) { If we are to use different protocols, it'd be nicer to have two different methods to assign to userfaultfd_fops.read that calls an __always_inline function, so that the above check can be optimized away at build time when the inline is expanded. So the branch is converted to calling a different pointer to function which is zero additional cost. OK :) +/* careful to always initialize addr if ret == 0 */ +__u64 uninitialized_var(addr); +__u64 uninitialized_var(mtype); +if (count sizeof(addr)) +return ret ? ret : -EINVAL; +_ret = userfaultfd_ctx_read(ctx, no_wait, mtype, addr); +if (_ret 0) +return ret ? ret : _ret; +BUG_ON(mtype != UFFD_PAGEFAULT); +if (put_user(addr, (__u64 __user *) buf)) +return ret ? ret : -EFAULT; +_ret = sizeof(addr); +} else { +struct uffd_v2_msg msg; +if (count sizeof(msg)) +return ret ? ret : -EINVAL; +_ret = userfaultfd_ctx_read(ctx, no_wait, msg.type, msg.arg); +if (_ret 0) +return ret ? ret : _ret; +if (copy_to_user(buf, msg, sizeof(msg))) +return ret ? ret : -EINVAL; +_ret = sizeof(msg); Reading 16bytes instead of 8bytes for each fault, probably wouldn't move the needle much in terms of userfaultfd_read performance. Perhaps we could consider using the uffd_v2_msg unconditionally and then have a single protocol differentiated by the feature bits. So your proposal is to always report 16 bytes per PF from read() and let userspace decide itself how to handle the result? The only reason to have two different protocols would be to be able to read 8 bytes per userfault, in the cooperative usage (i.e. qemu postcopy). But if we do that we want to use the __always_inline trick to avoid branches and additional runtime costs (otherwise we may as well forget all microoptimizations and read 16bytes always). @@ -992,6 +1013,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, /* careful not to leak info, we only read the first 8 bytes */ uffdio_api.bits = UFFD_API_BITS; uffdio_api.ioctls = UFFD_API_IOCTLS; + +if (uffdio_api.api == UFFD_API_V2) { +ctx-features |= UFFD_FEATURE_LONGMSG; +uffdio_api.bits |= UFFD_API_V2_BITS; +} + ret = -EFAULT; if (copy_to_user(buf, uffdio_api, sizeof(uffdio_api))) goto out; The original meaning of the bits is: If UFFD_BIT_WRITE was set in api.bits, it means the !!(addressUFFD_BIT_WRITE) tells if it was a write fault (missing or WP). If UFFD_BIT_WP was set in api.bits, it means the !!(addressUFFD_BIT_WP) tells if it was a WP fault (if not set it means it was a missing fault). Currently api.bits sets only UFFD_BIT_WRITE, and UFFD_BIT_WP will be set later, after the WP tracking mode will be implemented. I'm uncertain how bits translated to features and if they should be unified or only have features. +struct uffd_v2_msg { +__u64 type; +__u64 arg; +}; + +#define UFFD_PAGEFAULT 0x1 + +#define UFFD_PAGEFAULT_BIT (1 (UFFD_PAGEFAULT - 1)) +#define __UFFD_API_V2_BITS (UFFD_PAGEFAULT_BIT) + +/* + * Lower PAGE_SHIFT bits are used to report those supported + * by the pagefault message itself. Other bits are used to + * report the message types v2 API supports + */ +#define UFFD_API_V2_BITS(__UFFD_API_V2_BITS 12) + And why exactly is this 12 hardcoded? Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it would be OK to have different shifts in different arches. But taking into account your comment that bits field id bad for these values, if we introduce the new .features one for api message, then this 12 will just go away. And which field should be masked with the bits? In the V1 protocol it was the arg (userfault address) not the type. So this is a bit confusing and probably requires simplification. I see. Actually I decided that since bits higher than 12th (for x86) is always 0 in api message (no bits allowed there, since pfn sits in this place), it would be OK to put non-PF bits there. Should I better introduce another .features field in uffd API message? -- Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http
Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook
>>> I initially thought it would be enough to put it into >>> , expecting it works as >>> . But that's not the case. >>> >>> It probably worth at some point rework all to include >>> at the end as we do for . >>> But that's outside the scope of the patchset, I guess. >>> >>> I don't see any better candidate for such dummy header. :-/ >> >> Clearly, I'm not confortable with a rewrite of :( >> >> So what about this patch, is this v3 acceptable ? > > Acked-by: Kirill A. Shutemov Other than the #ifdef thing, the same: Acked-by: Pavel Emelyanov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook
On 04/13/2015 04:21 PM, Laurent Dufour wrote: > On 13/04/2015 15:13, Kirill A. Shutemov wrote: >> On Mon, Apr 13, 2015 at 02:41:22PM +0200, Laurent Dufour wrote: >>> On 13/04/2015 13:58, Kirill A. Shutemov wrote: On Mon, Apr 13, 2015 at 11:56:27AM +0200, Laurent Dufour wrote: > Some architecture would like to be triggered when a memory area is moved > through the mremap system call. > > This patch is introducing a new arch_remap mm hook which is placed in the > path of mremap, and is called before the old area is unmapped (and the > arch_unmap hook is called). > > The architectures which need to call this hook should define > __HAVE_ARCH_REMAP in their asm/mmu_context.h and provide the arch_remap > service with the following prototype: > void arch_remap(struct mm_struct *mm, > unsigned long old_start, unsigned long old_end, > unsigned long new_start, unsigned long new_end); > > Signed-off-by: Laurent Dufour > Reviewed-by: Ingo Molnar > --- > mm/mremap.c | 19 +-- > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/mm/mremap.c b/mm/mremap.c > index 2dc44b1cb1df..009db5565893 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -25,6 +25,7 @@ > > #include > #include > +#include > > #include "internal.h" > > @@ -286,13 +287,19 @@ static unsigned long move_vma(struct vm_area_struct > *vma, > old_len = new_len; > old_addr = new_addr; > new_addr = -ENOMEM; > - } else if (vma->vm_file && vma->vm_file->f_op->mremap) { > - err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma); > - if (err < 0) { > - move_page_tables(new_vma, new_addr, vma, old_addr, > - moved_len, true); > - return err; > + } else { > + if (vma->vm_file && vma->vm_file->f_op->mremap) { > + err = vma->vm_file->f_op->mremap(vma->vm_file, new_vma); > + if (err < 0) { > + move_page_tables(new_vma, new_addr, vma, > + old_addr, moved_len, true); > + return err; > + } > } > +#ifdef __HAVE_ARCH_REMAP It would be cleaner to provide dummy arch_remap() for !__HAVE_ARCH_REMAP in some generic header. >>> >>> The idea was to not impact all the architectures as arch_unmap(), >>> arch_dup_mmap() or arch_exit_mmap() implies. >>> >>> I look at the headers where such a dummy arch_remap could be put but I >>> can't figure out one which will not impact all the architecture. >>> What about defining a dummy service earlier in mm/remap.c in the case >>> __HAVE_ARCH_REMAP is not defined ? >>> Something like : >>> #ifndef __HAVE_ARCH_REMAP >>> static inline void void arch_remap(struct mm_struct *mm, >>>unsigned long old_start, >>>unsigned long old_end, >>>unsigned long new_start, >>>unsigned long new_end) >>> { >>> } >>> #endif >> >> Or just #define arch_remap(...) do { } while (0) >> > > I guessed you wanted the arch_remap() prototype to be exposed somewhere > in the code. > > To be honest, I can't find the benefit of defining a dummy arch_remap() > in mm/remap.c if __HAVE_ARCH_REMAP is not defined instead of calling it > in move_vma if __HAVE_ARCH_REMAP is defined. > Is it really what you want ? I think Kirill meant something like e.g. the arch_enter_lazy_mmu_mode() is implemented and called in mm/mremap.c -- the "generic" part is in the include/asm-generic/pgtable.h and those architectures willing to have their own implementation are in arch/$arch/... Kirill, if I'm right with it, can you suggest the header where to put the "generic" mremap hook's (empty) body? -- Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook
I initially thought it would be enough to put it into asm-generic/mmu_context.h, expecting it works as asm-generic/pgtable.h. But that's not the case. It probably worth at some point rework all asm/mmu_context.h to include asm-generic/mmu_context.h at the end as we do for asm/pgtable.h. But that's outside the scope of the patchset, I guess. I don't see any better candidate for such dummy header. :-/ Clearly, I'm not confortable with a rewrite of asm/mmu_context.h :( So what about this patch, is this v3 acceptable ? Acked-by: Kirill A. Shutemov kirill.shute...@linux.intel.com Other than the #ifdef thing, the same: Acked-by: Pavel Emelyanov xe...@parallels.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH v3 1/2] mm: Introducing arch_remap hook
On 04/13/2015 04:21 PM, Laurent Dufour wrote: On 13/04/2015 15:13, Kirill A. Shutemov wrote: On Mon, Apr 13, 2015 at 02:41:22PM +0200, Laurent Dufour wrote: On 13/04/2015 13:58, Kirill A. Shutemov wrote: On Mon, Apr 13, 2015 at 11:56:27AM +0200, Laurent Dufour wrote: Some architecture would like to be triggered when a memory area is moved through the mremap system call. This patch is introducing a new arch_remap mm hook which is placed in the path of mremap, and is called before the old area is unmapped (and the arch_unmap hook is called). The architectures which need to call this hook should define __HAVE_ARCH_REMAP in their asm/mmu_context.h and provide the arch_remap service with the following prototype: void arch_remap(struct mm_struct *mm, unsigned long old_start, unsigned long old_end, unsigned long new_start, unsigned long new_end); Signed-off-by: Laurent Dufour lduf...@linux.vnet.ibm.com Reviewed-by: Ingo Molnar mi...@kernel.org --- mm/mremap.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 2dc44b1cb1df..009db5565893 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -25,6 +25,7 @@ #include asm/cacheflush.h #include asm/tlbflush.h +#include asm/mmu_context.h #include internal.h @@ -286,13 +287,19 @@ static unsigned long move_vma(struct vm_area_struct *vma, old_len = new_len; old_addr = new_addr; new_addr = -ENOMEM; - } else if (vma-vm_file vma-vm_file-f_op-mremap) { - err = vma-vm_file-f_op-mremap(vma-vm_file, new_vma); - if (err 0) { - move_page_tables(new_vma, new_addr, vma, old_addr, - moved_len, true); - return err; + } else { + if (vma-vm_file vma-vm_file-f_op-mremap) { + err = vma-vm_file-f_op-mremap(vma-vm_file, new_vma); + if (err 0) { + move_page_tables(new_vma, new_addr, vma, + old_addr, moved_len, true); + return err; + } } +#ifdef __HAVE_ARCH_REMAP It would be cleaner to provide dummy arch_remap() for !__HAVE_ARCH_REMAP in some generic header. The idea was to not impact all the architectures as arch_unmap(), arch_dup_mmap() or arch_exit_mmap() implies. I look at the headers where such a dummy arch_remap could be put but I can't figure out one which will not impact all the architecture. What about defining a dummy service earlier in mm/remap.c in the case __HAVE_ARCH_REMAP is not defined ? Something like : #ifndef __HAVE_ARCH_REMAP static inline void void arch_remap(struct mm_struct *mm, unsigned long old_start, unsigned long old_end, unsigned long new_start, unsigned long new_end) { } #endif Or just #define arch_remap(...) do { } while (0) I guessed you wanted the arch_remap() prototype to be exposed somewhere in the code. To be honest, I can't find the benefit of defining a dummy arch_remap() in mm/remap.c if __HAVE_ARCH_REMAP is not defined instead of calling it in move_vma if __HAVE_ARCH_REMAP is defined. Is it really what you want ? I think Kirill meant something like e.g. the arch_enter_lazy_mmu_mode() is implemented and called in mm/mremap.c -- the generic part is in the include/asm-generic/pgtable.h and those architectures willing to have their own implementation are in arch/$arch/... Kirill, if I'm right with it, can you suggest the header where to put the generic mremap hook's (empty) body? -- Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/2] x86_64,signal: Fix SS handling for signals delivered to 64-bit programs
On 03/19/2015 12:26 AM, Andy Lutomirski wrote: > On Wed, Mar 18, 2015 at 1:02 PM, Oleg Nesterov wrote: >> On 03/18, Andrey Wagin wrote: >>> >>> This patch fixes the problem. Oleg, could you send this path in the >>> criu maillist? >> >> Sure, will do. > > We still haven't answered one question: what's the kernel's position > on ABI stability wrt CRIU? We clearly shouldn't make changes that > break the principle of CRIU, but CRIU encodes so many tricky > assumptions about the inner workings of the kernel that it's really > tough to avoid breaking old CRIU versions. Well, we try hard to use only documented kernel API-s. Isn't the sigframe considered to be some sort of "stable API"? I mean -- it's visible by the userspace, nobody prevents glibc or gdb from messing with this stuff just by reading it from memory. If it's "parse-able" e.g. like VDSO is, but we don't do it in CRIU -- then it's definitely a CRIU BUG to be fixed. > So... do we introduce somewhat nasty code into the kernel to keep old > CRIU versions working, or do we require that users who want to restore > onto new kernels use new CRIU? It's OK (I think) to require newer versions of CRIU, it's easy to update one unlike the kernel ;) But if "old" version of CRIU just crash the restored processes on "new" kernels and there's no way to detect this properly -- that's the problem. > (It seems clear to me that CRIU should apply the patch regardless of > what the kernel does. It will enable CRIU to work on the same class > of programs that are fixed by the kernel change that started this > thread.) > > --Andy > . > Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] uffd: Introduce fork() notification
As described in previous e-mail, we need to get informed when the task, whose mm is monitored with uffd, calls fork(). The fork notification is the new uffd with the same regions and flags as those on parent. When read()-ing from uffd the monitor task would receive the new uffd's descriptor number and will be able to start reading events from the new task. The fork() of mm with uffd attached doesn't finish until the monitor "acks" the message by reading the new uffd descriptor. Signed-off-by: Pavel Emelyanov --- fs/userfaultfd.c | 175 ++- include/linux/userfaultfd_k.h| 12 +++ include/uapi/linux/userfaultfd.h | 4 +- kernel/fork.c| 9 +- 4 files changed, 194 insertions(+), 6 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index bd629b4..265f031 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -12,6 +12,7 @@ * mm/ksm.c (mm hashing). */ +#include #include #include #include @@ -37,6 +38,8 @@ struct userfaultfd_ctx { atomic_t refcount; /* waitqueue head for the userfaultfd page faults */ wait_queue_head_t fault_wqh; + /* waitqueue head for fork-s */ + wait_queue_head_t fork_wqh; /* waitqueue head for the pseudo fd to wakeup poll/read */ wait_queue_head_t fd_wqh; /* userfaultfd syscall flags */ @@ -51,10 +54,21 @@ struct userfaultfd_ctx { struct mm_struct *mm; }; +struct userfaultfd_fork_ctx { + struct userfaultfd_ctx *orig; + struct userfaultfd_ctx *new; + struct list_head list; +}; + #define UFFD_FEATURE_LONGMSG 0x1 +#define UFFD_FEATURE_FORK 0x2 struct userfaultfd_wait_queue { - unsigned long address; + union { + unsigned long address; + struct userfaultfd_ctx *nctx; + int fd; + }; wait_queue_t wq; bool pending; struct userfaultfd_ctx *ctx; @@ -75,6 +89,7 @@ static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void) if (ctx) { atomic_set(>refcount, 1); init_waitqueue_head(>fault_wqh); + init_waitqueue_head(>fork_wqh); init_waitqueue_head(>fd_wqh); ctx->released = false; } @@ -270,6 +285,111 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address, return VM_FAULT_RETRY; } +int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) +{ + struct userfaultfd_ctx *ctx = NULL, *octx; + struct userfaultfd_fork_ctx *fctx; + + octx = vma->vm_userfaultfd_ctx.ctx; + if (!octx || !(octx->features & UFFD_FEATURE_FORK)) { + vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); + return 0; + } + + list_for_each_entry(fctx, fcs, list) + if (fctx->orig == octx) { + ctx = fctx->new; + break; + } + + if (!ctx) { + fctx = kmalloc(sizeof(*fctx), GFP_KERNEL); + if (!fctx) + return -ENOMEM; + + ctx = userfaultfd_ctx_alloc(); + if (!ctx) { + kfree(fctx); + return -ENOMEM; + } + + ctx->flags = octx->flags; + ctx->state = UFFD_STATE_RUNNING; + ctx->features = UFFD_FEATURE_FORK | UFFD_FEATURE_LONGMSG; + ctx->mm = vma->vm_mm; + atomic_inc(>mm->mm_count); + + userfaultfd_ctx_get(octx); + fctx->orig = octx; + fctx->new = ctx; + list_add_tail(>list, fcs); + } + + vma->vm_userfaultfd_ctx.ctx = ctx; + return 0; +} + +static int dup_fctx(struct userfaultfd_fork_ctx *fctx) +{ + int ret = 0; + struct userfaultfd_ctx *ctx = fctx->orig; + struct userfaultfd_wait_queue uwq; + + init_waitqueue_entry(, current); + uwq.pending = true; + uwq.ctx = ctx; + uwq.nctx = fctx->new; + + spin_lock(>fork_wqh.lock); + /* +* After the __add_wait_queue the uwq is visible to userland +* through poll/read(). +*/ + __add_wait_queue(>fork_wqh, ); + for (;;) { + set_current_state(TASK_KILLABLE); + if (!uwq.pending) + break; + if (ACCESS_ONCE(ctx->released) || + fatal_signal_pending(current)) { + ret = -1; + break; + } + + spin_unlock(>fork_wqh.lock); + + wake_up_poll(>fd_wqh, POLLIN); + schedule(); + + spin_lock(>fork_wqh.lock); + } +
[PATCH 2/3] uffd: Introduce the v2 API
The new API will report more than just the page-faults. The reason for this is -- when the task whose mm we monitor with uffd and the monitor task itself cannot cooperate with each other, the former one can screw things up. Like this. If task fork()-s the child process is detached from uffd and thus all not-yet-faulted-in memory gets mapped with zero-pages on touch. Another example is mremap(). When the victim remaps the uffd-ed region and starts touching it the monitor would receive fault messages with addresses that were not register-ed with uffd ioctl before it. Thus monitor will have no idea how to handle those faults. To address both we can send more events to the monitor. In particular, on fork() we can create another uffd context, register the same set of regions in it and "send" the descriptor to monitor. For mremap() we can send the message describing what change has been performed. So this patch prepares to ground for the described above feature by introducing the v2 API of uffd. With new API the kernel would respond with a message containing the event type (pagefault, fork or remap) and argument (fault address, new uffd descriptor or region change respectively). Signed-off-by: Pavel Emelyanov --- fs/userfaultfd.c | 56 ++-- include/uapi/linux/userfaultfd.h | 21 ++- 2 files changed, 62 insertions(+), 15 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 6c9a2d6..bd629b4 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -41,6 +41,8 @@ struct userfaultfd_ctx { wait_queue_head_t fd_wqh; /* userfaultfd syscall flags */ unsigned int flags; + /* features flags */ + unsigned int features; /* state machine */ enum userfaultfd_state state; /* released */ @@ -49,6 +51,8 @@ struct userfaultfd_ctx { struct mm_struct *mm; }; +#define UFFD_FEATURE_LONGMSG 0x1 + struct userfaultfd_wait_queue { unsigned long address; wait_queue_t wq; @@ -369,7 +373,7 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) } static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, - __u64 *addr) + __u64 *mtype, __u64 *addr) { ssize_t ret; DECLARE_WAITQUEUE(wait, current); @@ -383,6 +387,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, if (find_userfault(ctx, )) { uwq->pending = false; /* careful to always initialize addr if ret == 0 */ + *mtype = UFFD_PAGEFAULT; *addr = uwq->address; ret = 0; break; @@ -411,8 +416,6 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, { struct userfaultfd_ctx *ctx = file->private_data; ssize_t _ret, ret = 0; - /* careful to always initialize addr if ret == 0 */ - __u64 uninitialized_var(addr); int no_wait = file->f_flags & O_NONBLOCK; if (ctx->state == UFFD_STATE_WAIT_API) @@ -420,16 +423,34 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, BUG_ON(ctx->state != UFFD_STATE_RUNNING); for (;;) { - if (count < sizeof(addr)) - return ret ? ret : -EINVAL; - _ret = userfaultfd_ctx_read(ctx, no_wait, ); - if (_ret < 0) - return ret ? ret : _ret; - if (put_user(addr, (__u64 __user *) buf)) - return ret ? ret : -EFAULT; - ret += sizeof(addr); - buf += sizeof(addr); - count -= sizeof(addr); + if (!(ctx->features & UFFD_FEATURE_LONGMSG)) { + /* careful to always initialize addr if ret == 0 */ + __u64 uninitialized_var(addr); + __u64 uninitialized_var(mtype); + if (count < sizeof(addr)) + return ret ? ret : -EINVAL; + _ret = userfaultfd_ctx_read(ctx, no_wait, , ); + if (_ret < 0) + return ret ? ret : _ret; + BUG_ON(mtype != UFFD_PAGEFAULT); + if (put_user(addr, (__u64 __user *) buf)) + return ret ? ret : -EFAULT; + _ret = sizeof(addr); + } else { + struct uffd_v2_msg msg; + if (count < sizeof(msg)) + return ret ? ret : -EINVAL; + _ret = userfaultfd_ctx_read(ctx, no_wait, , ); + if (_ret < 0) + return ret ? ret : _ret; +
[PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage
Hi, On the recent LSF Andrea presented his userfault-fd patches and I had shown some issues that appear in usage scenarios when the monitor task and mm task do not cooperate to each other on VM changes (and fork()-s). Here's the implementation of the extended uffd API that would help us to address those issues. As proof of concept I've implemented only fix for fork() case, but I also plan to add the mremap() and exit() notifications, both are also required for such non-cooperative usage. More details about the extension itself is in patch #2 and the fork() notification description is in patch #3. Comments and suggestion are warmly welcome :) Andrea, what's the best way to go on with the patches -- would you prefer to include them in your git tree or should I instead continue with them on my own, re-sending them when required? Either way would be OK for me. Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] uffd: Tossing bits around
Reformat the existing code a bit to make it easier for further patching. Signed-off-by: Pavel Emelyanov --- fs/userfaultfd.c | 37 - 1 file changed, 28 insertions(+), 9 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index b4c7f25..6c9a2d6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -61,6 +61,23 @@ struct userfaultfd_wake_range { unsigned long len; }; +static const struct file_operations userfaultfd_fops; + +static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void) +{ + struct userfaultfd_ctx *ctx; + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + if (ctx) { + atomic_set(>refcount, 1); + init_waitqueue_head(>fault_wqh); + init_waitqueue_head(>fd_wqh); + ctx->released = false; + } + + return ctx; +} + static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode, int wake_flags, void *key) { @@ -307,15 +324,15 @@ static int userfaultfd_release(struct inode *inode, struct file *file) return 0; } -static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx, +static inline unsigned int do_find_userfault(wait_queue_head_t *wqh, struct userfaultfd_wait_queue **uwq) { wait_queue_t *wq; struct userfaultfd_wait_queue *_uwq; unsigned int ret = 0; - spin_lock(>fault_wqh.lock); - list_for_each_entry(wq, >fault_wqh.task_list, task_list) { + spin_lock(>lock); + list_for_each_entry(wq, >task_list, task_list) { _uwq = container_of(wq, struct userfaultfd_wait_queue, wq); if (_uwq->pending) { ret = POLLIN; @@ -324,11 +341,17 @@ static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx, break; } } - spin_unlock(>fault_wqh.lock); + spin_unlock(>lock); return ret; } +static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx, + struct userfaultfd_wait_queue **uwq) +{ + return do_find_userfault(>fault_wqh, uwq); +} + static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) { struct userfaultfd_ctx *ctx = file->private_data; @@ -1080,16 +1103,12 @@ static struct file *userfaultfd_file_create(int flags) goto out; file = ERR_PTR(-ENOMEM); - ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + ctx = userfaultfd_ctx_alloc(); if (!ctx) goto out; - atomic_set(>refcount, 1); - init_waitqueue_head(>fault_wqh); - init_waitqueue_head(>fd_wqh); ctx->flags = flags; ctx->state = UFFD_STATE_WAIT_API; - ctx->released = false; ctx->mm = current->mm; /* prevent the mm struct to be freed */ atomic_inc(>mm->mm_count); -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] uffd: Tossing bits around
Reformat the existing code a bit to make it easier for further patching. Signed-off-by: Pavel Emelyanov xe...@parallels.com --- fs/userfaultfd.c | 37 - 1 file changed, 28 insertions(+), 9 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index b4c7f25..6c9a2d6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -61,6 +61,23 @@ struct userfaultfd_wake_range { unsigned long len; }; +static const struct file_operations userfaultfd_fops; + +static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void) +{ + struct userfaultfd_ctx *ctx; + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + if (ctx) { + atomic_set(ctx-refcount, 1); + init_waitqueue_head(ctx-fault_wqh); + init_waitqueue_head(ctx-fd_wqh); + ctx-released = false; + } + + return ctx; +} + static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode, int wake_flags, void *key) { @@ -307,15 +324,15 @@ static int userfaultfd_release(struct inode *inode, struct file *file) return 0; } -static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx, +static inline unsigned int do_find_userfault(wait_queue_head_t *wqh, struct userfaultfd_wait_queue **uwq) { wait_queue_t *wq; struct userfaultfd_wait_queue *_uwq; unsigned int ret = 0; - spin_lock(ctx-fault_wqh.lock); - list_for_each_entry(wq, ctx-fault_wqh.task_list, task_list) { + spin_lock(wqh-lock); + list_for_each_entry(wq, wqh-task_list, task_list) { _uwq = container_of(wq, struct userfaultfd_wait_queue, wq); if (_uwq-pending) { ret = POLLIN; @@ -324,11 +341,17 @@ static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx, break; } } - spin_unlock(ctx-fault_wqh.lock); + spin_unlock(wqh-lock); return ret; } +static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx, + struct userfaultfd_wait_queue **uwq) +{ + return do_find_userfault(ctx-fault_wqh, uwq); +} + static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) { struct userfaultfd_ctx *ctx = file-private_data; @@ -1080,16 +1103,12 @@ static struct file *userfaultfd_file_create(int flags) goto out; file = ERR_PTR(-ENOMEM); - ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + ctx = userfaultfd_ctx_alloc(); if (!ctx) goto out; - atomic_set(ctx-refcount, 1); - init_waitqueue_head(ctx-fault_wqh); - init_waitqueue_head(ctx-fd_wqh); ctx-flags = flags; ctx-state = UFFD_STATE_WAIT_API; - ctx-released = false; ctx-mm = current-mm; /* prevent the mm struct to be freed */ atomic_inc(ctx-mm-mm_count); -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] uffd: Introduce fork() notification
As described in previous e-mail, we need to get informed when the task, whose mm is monitored with uffd, calls fork(). The fork notification is the new uffd with the same regions and flags as those on parent. When read()-ing from uffd the monitor task would receive the new uffd's descriptor number and will be able to start reading events from the new task. The fork() of mm with uffd attached doesn't finish until the monitor acks the message by reading the new uffd descriptor. Signed-off-by: Pavel Emelyanov xe...@parallels.com --- fs/userfaultfd.c | 175 ++- include/linux/userfaultfd_k.h| 12 +++ include/uapi/linux/userfaultfd.h | 4 +- kernel/fork.c| 9 +- 4 files changed, 194 insertions(+), 6 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index bd629b4..265f031 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -12,6 +12,7 @@ * mm/ksm.c (mm hashing). */ +#include linux/list.h #include linux/hashtable.h #include linux/sched.h #include linux/mm.h @@ -37,6 +38,8 @@ struct userfaultfd_ctx { atomic_t refcount; /* waitqueue head for the userfaultfd page faults */ wait_queue_head_t fault_wqh; + /* waitqueue head for fork-s */ + wait_queue_head_t fork_wqh; /* waitqueue head for the pseudo fd to wakeup poll/read */ wait_queue_head_t fd_wqh; /* userfaultfd syscall flags */ @@ -51,10 +54,21 @@ struct userfaultfd_ctx { struct mm_struct *mm; }; +struct userfaultfd_fork_ctx { + struct userfaultfd_ctx *orig; + struct userfaultfd_ctx *new; + struct list_head list; +}; + #define UFFD_FEATURE_LONGMSG 0x1 +#define UFFD_FEATURE_FORK 0x2 struct userfaultfd_wait_queue { - unsigned long address; + union { + unsigned long address; + struct userfaultfd_ctx *nctx; + int fd; + }; wait_queue_t wq; bool pending; struct userfaultfd_ctx *ctx; @@ -75,6 +89,7 @@ static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void) if (ctx) { atomic_set(ctx-refcount, 1); init_waitqueue_head(ctx-fault_wqh); + init_waitqueue_head(ctx-fork_wqh); init_waitqueue_head(ctx-fd_wqh); ctx-released = false; } @@ -270,6 +285,111 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address, return VM_FAULT_RETRY; } +int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) +{ + struct userfaultfd_ctx *ctx = NULL, *octx; + struct userfaultfd_fork_ctx *fctx; + + octx = vma-vm_userfaultfd_ctx.ctx; + if (!octx || !(octx-features UFFD_FEATURE_FORK)) { + vma-vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vma-vm_flags = ~(VM_UFFD_WP | VM_UFFD_MISSING); + return 0; + } + + list_for_each_entry(fctx, fcs, list) + if (fctx-orig == octx) { + ctx = fctx-new; + break; + } + + if (!ctx) { + fctx = kmalloc(sizeof(*fctx), GFP_KERNEL); + if (!fctx) + return -ENOMEM; + + ctx = userfaultfd_ctx_alloc(); + if (!ctx) { + kfree(fctx); + return -ENOMEM; + } + + ctx-flags = octx-flags; + ctx-state = UFFD_STATE_RUNNING; + ctx-features = UFFD_FEATURE_FORK | UFFD_FEATURE_LONGMSG; + ctx-mm = vma-vm_mm; + atomic_inc(ctx-mm-mm_count); + + userfaultfd_ctx_get(octx); + fctx-orig = octx; + fctx-new = ctx; + list_add_tail(fctx-list, fcs); + } + + vma-vm_userfaultfd_ctx.ctx = ctx; + return 0; +} + +static int dup_fctx(struct userfaultfd_fork_ctx *fctx) +{ + int ret = 0; + struct userfaultfd_ctx *ctx = fctx-orig; + struct userfaultfd_wait_queue uwq; + + init_waitqueue_entry(uwq.wq, current); + uwq.pending = true; + uwq.ctx = ctx; + uwq.nctx = fctx-new; + + spin_lock(ctx-fork_wqh.lock); + /* +* After the __add_wait_queue the uwq is visible to userland +* through poll/read(). +*/ + __add_wait_queue(ctx-fork_wqh, uwq.wq); + for (;;) { + set_current_state(TASK_KILLABLE); + if (!uwq.pending) + break; + if (ACCESS_ONCE(ctx-released) || + fatal_signal_pending(current)) { + ret = -1; + break; + } + + spin_unlock(ctx-fork_wqh.lock); + + wake_up_poll(ctx-fd_wqh, POLLIN); + schedule(); + + spin_lock(ctx-fork_wqh.lock); + } + __remove_wait_queue(ctx-fork_wqh
[PATCH 2/3] uffd: Introduce the v2 API
The new API will report more than just the page-faults. The reason for this is -- when the task whose mm we monitor with uffd and the monitor task itself cannot cooperate with each other, the former one can screw things up. Like this. If task fork()-s the child process is detached from uffd and thus all not-yet-faulted-in memory gets mapped with zero-pages on touch. Another example is mremap(). When the victim remaps the uffd-ed region and starts touching it the monitor would receive fault messages with addresses that were not register-ed with uffd ioctl before it. Thus monitor will have no idea how to handle those faults. To address both we can send more events to the monitor. In particular, on fork() we can create another uffd context, register the same set of regions in it and send the descriptor to monitor. For mremap() we can send the message describing what change has been performed. So this patch prepares to ground for the described above feature by introducing the v2 API of uffd. With new API the kernel would respond with a message containing the event type (pagefault, fork or remap) and argument (fault address, new uffd descriptor or region change respectively). Signed-off-by: Pavel Emelyanov xe...@parallels.com --- fs/userfaultfd.c | 56 ++-- include/uapi/linux/userfaultfd.h | 21 ++- 2 files changed, 62 insertions(+), 15 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 6c9a2d6..bd629b4 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -41,6 +41,8 @@ struct userfaultfd_ctx { wait_queue_head_t fd_wqh; /* userfaultfd syscall flags */ unsigned int flags; + /* features flags */ + unsigned int features; /* state machine */ enum userfaultfd_state state; /* released */ @@ -49,6 +51,8 @@ struct userfaultfd_ctx { struct mm_struct *mm; }; +#define UFFD_FEATURE_LONGMSG 0x1 + struct userfaultfd_wait_queue { unsigned long address; wait_queue_t wq; @@ -369,7 +373,7 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait) } static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, - __u64 *addr) + __u64 *mtype, __u64 *addr) { ssize_t ret; DECLARE_WAITQUEUE(wait, current); @@ -383,6 +387,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, if (find_userfault(ctx, uwq)) { uwq-pending = false; /* careful to always initialize addr if ret == 0 */ + *mtype = UFFD_PAGEFAULT; *addr = uwq-address; ret = 0; break; @@ -411,8 +416,6 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, { struct userfaultfd_ctx *ctx = file-private_data; ssize_t _ret, ret = 0; - /* careful to always initialize addr if ret == 0 */ - __u64 uninitialized_var(addr); int no_wait = file-f_flags O_NONBLOCK; if (ctx-state == UFFD_STATE_WAIT_API) @@ -420,16 +423,34 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, BUG_ON(ctx-state != UFFD_STATE_RUNNING); for (;;) { - if (count sizeof(addr)) - return ret ? ret : -EINVAL; - _ret = userfaultfd_ctx_read(ctx, no_wait, addr); - if (_ret 0) - return ret ? ret : _ret; - if (put_user(addr, (__u64 __user *) buf)) - return ret ? ret : -EFAULT; - ret += sizeof(addr); - buf += sizeof(addr); - count -= sizeof(addr); + if (!(ctx-features UFFD_FEATURE_LONGMSG)) { + /* careful to always initialize addr if ret == 0 */ + __u64 uninitialized_var(addr); + __u64 uninitialized_var(mtype); + if (count sizeof(addr)) + return ret ? ret : -EINVAL; + _ret = userfaultfd_ctx_read(ctx, no_wait, mtype, addr); + if (_ret 0) + return ret ? ret : _ret; + BUG_ON(mtype != UFFD_PAGEFAULT); + if (put_user(addr, (__u64 __user *) buf)) + return ret ? ret : -EFAULT; + _ret = sizeof(addr); + } else { + struct uffd_v2_msg msg; + if (count sizeof(msg)) + return ret ? ret : -EINVAL; + _ret = userfaultfd_ctx_read(ctx, no_wait, msg.type, msg.arg); + if (_ret 0) + return ret ? ret : _ret
[PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage
Hi, On the recent LSF Andrea presented his userfault-fd patches and I had shown some issues that appear in usage scenarios when the monitor task and mm task do not cooperate to each other on VM changes (and fork()-s). Here's the implementation of the extended uffd API that would help us to address those issues. As proof of concept I've implemented only fix for fork() case, but I also plan to add the mremap() and exit() notifications, both are also required for such non-cooperative usage. More details about the extension itself is in patch #2 and the fork() notification description is in patch #3. Comments and suggestion are warmly welcome :) Andrea, what's the best way to go on with the patches -- would you prefer to include them in your git tree or should I instead continue with them on my own, re-sending them when required? Either way would be OK for me. Thanks, Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/2] x86_64,signal: Fix SS handling for signals delivered to 64-bit programs
On 03/19/2015 12:26 AM, Andy Lutomirski wrote: On Wed, Mar 18, 2015 at 1:02 PM, Oleg Nesterov o...@redhat.com wrote: On 03/18, Andrey Wagin wrote: This patch fixes the problem. Oleg, could you send this path in the criu maillist? Sure, will do. We still haven't answered one question: what's the kernel's position on ABI stability wrt CRIU? We clearly shouldn't make changes that break the principle of CRIU, but CRIU encodes so many tricky assumptions about the inner workings of the kernel that it's really tough to avoid breaking old CRIU versions. Well, we try hard to use only documented kernel API-s. Isn't the sigframe considered to be some sort of stable API? I mean -- it's visible by the userspace, nobody prevents glibc or gdb from messing with this stuff just by reading it from memory. If it's parse-able e.g. like VDSO is, but we don't do it in CRIU -- then it's definitely a CRIU BUG to be fixed. So... do we introduce somewhat nasty code into the kernel to keep old CRIU versions working, or do we require that users who want to restore onto new kernels use new CRIU? It's OK (I think) to require newer versions of CRIU, it's easy to update one unlike the kernel ;) But if old version of CRIU just crash the restored processes on new kernels and there's no way to detect this properly -- that's the problem. (It seems clear to me that CRIU should apply the patch regardless of what the kernel does. It will enable CRIU to work on the same class of programs that are fixed by the kernel change that started this thread.) --Andy . Thanks, Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] pagemap: do not leak physical addresses to non-privileged userspace
On 03/10/2015 12:11 AM, Kirill A. Shutemov wrote: > From: "Kirill A. Shutemov" > > As pointed by recent post[1] on exploiting DRAM physical imperfection, > /proc/PID/pagemap exposes sensitive information which can be used to do > attacks. > > This is RFC patch which disallow anybody without CAP_SYS_ADMIN to read > the pagemap. > > Any comments? If I'm not mistaken, the pagemap file is used by some userspace that does working-set size analysis. But this thing only needs the flags (referenced bit) from the PTE-s. Maybe it would be better not to lock this file completely, but instead report the PFN part as zero? Other than this, I don't mind :) Although we use this heavily in CRIU we anyway work only with the CAP_SYS_ADMIN, so adding the new one doesn't hurt. Thanks, Pavel > [1] > http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html > > Signed-off-by: Kirill A. Shutemov > Cc: Pavel Emelyanov > Cc: Konstantin Khlebnikov > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: Mark Seaborn > Cc: Andy Lutomirski > --- > fs/proc/task_mmu.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 246eae84b13b..b72b36e64286 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1322,6 +1322,9 @@ out: > > static int pagemap_open(struct inode *inode, struct file *file) > { > + /* do not disclose physical addresses: attack vector */ > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about " > "to stop being page-shift some time soon. See the " > "linux/Documentation/vm/pagemap.txt for details.\n"); > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH] pagemap: do not leak physical addresses to non-privileged userspace
On 03/10/2015 12:11 AM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com As pointed by recent post[1] on exploiting DRAM physical imperfection, /proc/PID/pagemap exposes sensitive information which can be used to do attacks. This is RFC patch which disallow anybody without CAP_SYS_ADMIN to read the pagemap. Any comments? If I'm not mistaken, the pagemap file is used by some userspace that does working-set size analysis. But this thing only needs the flags (referenced bit) from the PTE-s. Maybe it would be better not to lock this file completely, but instead report the PFN part as zero? Other than this, I don't mind :) Although we use this heavily in CRIU we anyway work only with the CAP_SYS_ADMIN, so adding the new one doesn't hurt. Thanks, Pavel [1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com Cc: Pavel Emelyanov xe...@parallels.com Cc: Konstantin Khlebnikov khlebni...@openvz.org Cc: Andrew Morton a...@linux-foundation.org Cc: Linus Torvalds torva...@linux-foundation.org Cc: Mark Seaborn mseab...@chromium.org Cc: Andy Lutomirski l...@amacapital.net --- fs/proc/task_mmu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 246eae84b13b..b72b36e64286 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1322,6 +1322,9 @@ out: static int pagemap_open(struct inode *inode, struct file *file) { + /* do not disclose physical addresses: attack vector */ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; pr_warn_once(Bits 55-60 of /proc/PID/pagemap entries are about to stop being page-shift some time soon. See the linux/Documentation/vm/pagemap.txt for details.\n); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/21] RFC: userfaultfd v3
> All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy > live migration and the UFFD can be passed to a manager process through > unix domain sockets to satisfy point 5). Yup :) That's the best (from my POV) point of ufd -- the ability to delegate the descriptor to some other task. Though there are several limitations (I've expressed them in other e-mails), I'm definitely supporting this! The respective CRIU code is quite sloppy yet, I will try to brush one up and show soon. Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation
> +static int mcopy_atomic_pte(struct mm_struct *dst_mm, > + pmd_t *dst_pmd, > + struct vm_area_struct *dst_vma, > + unsigned long dst_addr, > + unsigned long src_addr) > +{ > + struct mem_cgroup *memcg; > + pte_t _dst_pte, *dst_pte; > + spinlock_t *ptl; > + struct page *page; > + void *page_kaddr; > + int ret; > + > + ret = -ENOMEM; > + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr); > + if (!page) > + goto out; Not a fatal thing, but still quite inconvenient. If there are two tasks that have anonymous private VMAs that are still not COW-ed from each other, then it will be impossible to keep the pages shared with userfault. Thus if we do post-copy memory migration for tasks, then these guys will have their memory COW-ed. Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
> +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm, > + unsigned long dst_start, unsigned long src_start, > + unsigned long len, __u64 mode) > +{ > + struct vm_area_struct *src_vma, *dst_vma; > + long err = -EINVAL; > + pmd_t *src_pmd, *dst_pmd; > + pte_t *src_pte, *dst_pte; > + spinlock_t *dst_ptl, *src_ptl; > + unsigned long src_addr, dst_addr; > + int thp_aligned = -1; > + ssize_t moved = 0; > + > + /* > + * Sanitize the command parameters: > + */ > + BUG_ON(src_start & ~PAGE_MASK); > + BUG_ON(dst_start & ~PAGE_MASK); > + BUG_ON(len & ~PAGE_MASK); > + > + /* Does the address range wrap, or is the span zero-sized? */ > + BUG_ON(src_start + len <= src_start); > + BUG_ON(dst_start + len <= dst_start); > + > + /* > + * Because these are read sempahores there's no risk of lock > + * inversion. > + */ > + down_read(_mm->mmap_sem); > + if (dst_mm != src_mm) > + down_read(_mm->mmap_sem); > + > + /* > + * Make sure the vma is not shared, that the src and dst remap > + * ranges are both valid and fully within a single existing > + * vma. > + */ > + src_vma = find_vma(src_mm, src_start); > + if (!src_vma || (src_vma->vm_flags & VM_SHARED)) > + goto out; > + if (src_start < src_vma->vm_start || > + src_start + len > src_vma->vm_end) > + goto out; > + > + dst_vma = find_vma(dst_mm, dst_start); > + if (!dst_vma || (dst_vma->vm_flags & VM_SHARED)) > + goto out; I again have a concern about the case when one task monitors the VM of the other one. If the target task (owning the mm) unmaps a VMA then the monitor task (holding and operating on the ufd) will get plain EINVAL on UFFDIO_REMAP request. This is not fatal, but still inconvenient as it will be hard to find out the reason for failure -- dst VMA is removed and the monitor should just drop the respective pages with data, or some other error has occurred and some other actions should be taken. Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization
> +int handle_userfault(struct vm_area_struct *vma, unsigned long address, > + unsigned int flags, unsigned long reason) > +{ > + struct mm_struct *mm = vma->vm_mm; > + struct userfaultfd_ctx *ctx; > + struct userfaultfd_wait_queue uwq; > + > + BUG_ON(!rwsem_is_locked(>mmap_sem)); > + > + ctx = vma->vm_userfaultfd_ctx.ctx; > + if (!ctx) > + return VM_FAULT_SIGBUS; > + > + BUG_ON(ctx->mm != mm); > + > + VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); > + VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); > + > + /* > + * If it's already released don't get it. This avoids to loop > + * in __get_user_pages if userfaultfd_release waits on the > + * caller of handle_userfault to release the mmap_sem. > + */ > + if (unlikely(ACCESS_ONCE(ctx->released))) > + return VM_FAULT_SIGBUS; > + > + /* check that we can return VM_FAULT_RETRY */ > + if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) { > + /* > + * Validate the invariant that nowait must allow retry > + * to be sure not to return SIGBUS erroneously on > + * nowait invocations. > + */ > + BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT); > +#ifdef CONFIG_DEBUG_VM > + if (printk_ratelimit()) { > + printk(KERN_WARNING > +"FAULT_FLAG_ALLOW_RETRY missing %x\n", flags); > + dump_stack(); > + } > +#endif > + return VM_FAULT_SIGBUS; > + } > + > + /* > + * Handle nowait, not much to do other than tell it to retry > + * and wait. > + */ > + if (flags & FAULT_FLAG_RETRY_NOWAIT) > + return VM_FAULT_RETRY; > + > + /* take the reference before dropping the mmap_sem */ > + userfaultfd_ctx_get(ctx); > + > + /* be gentle and immediately relinquish the mmap_sem */ > + up_read(>mmap_sem); > + > + init_waitqueue_func_entry(, userfaultfd_wake_function); > + uwq.wq.private = current; > + uwq.address = userfault_address(address, flags, reason); Since we report only the virtual address of the fault, this will make difficulties for task monitoring the address space of some other task. Like this: Let's assume a task creates a userfaultfd, activates one, registers several VMAs in it and then sends the ufd descriptor to other task. If later the first task will remap those VMAs and will start touching pages, the monitor will start receiving fault addresses using which it will not be able to guess the exact vma the requests come from. Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
> diff --git a/kernel/fork.c b/kernel/fork.c > index cf65139..cb215c0 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct > mm_struct *oldmm) > goto fail_nomem_anon_vma_fork; > tmp->vm_flags &= ~VM_LOCKED; > tmp->vm_next = tmp->vm_prev = NULL; > + tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; This creates an interesting effect when the userfaultfd is used outside of the process which created and activated one. If I try to monitor the memory usage of one task with another, once the first task fork()-s, its child begins to see zero-pages in the places where the monitor task was supposed to insert pages with data. > file = tmp->vm_file; > if (file) { > struct inode *inode = file_inode(file); > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
diff --git a/kernel/fork.c b/kernel/fork.c index cf65139..cb215c0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) goto fail_nomem_anon_vma_fork; tmp-vm_flags = ~VM_LOCKED; tmp-vm_next = tmp-vm_prev = NULL; + tmp-vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; This creates an interesting effect when the userfaultfd is used outside of the process which created and activated one. If I try to monitor the memory usage of one task with another, once the first task fork()-s, its child begins to see zero-pages in the places where the monitor task was supposed to insert pages with data. file = tmp-vm_file; if (file) { struct inode *inode = file_inode(file); . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization
+int handle_userfault(struct vm_area_struct *vma, unsigned long address, + unsigned int flags, unsigned long reason) +{ + struct mm_struct *mm = vma-vm_mm; + struct userfaultfd_ctx *ctx; + struct userfaultfd_wait_queue uwq; + + BUG_ON(!rwsem_is_locked(mm-mmap_sem)); + + ctx = vma-vm_userfaultfd_ctx.ctx; + if (!ctx) + return VM_FAULT_SIGBUS; + + BUG_ON(ctx-mm != mm); + + VM_BUG_ON(reason ~(VM_UFFD_MISSING|VM_UFFD_WP)); + VM_BUG_ON(!(reason VM_UFFD_MISSING) ^ !!(reason VM_UFFD_WP)); + + /* + * If it's already released don't get it. This avoids to loop + * in __get_user_pages if userfaultfd_release waits on the + * caller of handle_userfault to release the mmap_sem. + */ + if (unlikely(ACCESS_ONCE(ctx-released))) + return VM_FAULT_SIGBUS; + + /* check that we can return VM_FAULT_RETRY */ + if (unlikely(!(flags FAULT_FLAG_ALLOW_RETRY))) { + /* + * Validate the invariant that nowait must allow retry + * to be sure not to return SIGBUS erroneously on + * nowait invocations. + */ + BUG_ON(flags FAULT_FLAG_RETRY_NOWAIT); +#ifdef CONFIG_DEBUG_VM + if (printk_ratelimit()) { + printk(KERN_WARNING +FAULT_FLAG_ALLOW_RETRY missing %x\n, flags); + dump_stack(); + } +#endif + return VM_FAULT_SIGBUS; + } + + /* + * Handle nowait, not much to do other than tell it to retry + * and wait. + */ + if (flags FAULT_FLAG_RETRY_NOWAIT) + return VM_FAULT_RETRY; + + /* take the reference before dropping the mmap_sem */ + userfaultfd_ctx_get(ctx); + + /* be gentle and immediately relinquish the mmap_sem */ + up_read(mm-mmap_sem); + + init_waitqueue_func_entry(uwq.wq, userfaultfd_wake_function); + uwq.wq.private = current; + uwq.address = userfault_address(address, flags, reason); Since we report only the virtual address of the fault, this will make difficulties for task monitoring the address space of some other task. Like this: Let's assume a task creates a userfaultfd, activates one, registers several VMAs in it and then sends the ufd descriptor to other task. If later the first task will remap those VMAs and will start touching pages, the monitor will start receiving fault addresses using which it will not be able to guess the exact vma the requests come from. Thanks, Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
+ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm, + unsigned long dst_start, unsigned long src_start, + unsigned long len, __u64 mode) +{ + struct vm_area_struct *src_vma, *dst_vma; + long err = -EINVAL; + pmd_t *src_pmd, *dst_pmd; + pte_t *src_pte, *dst_pte; + spinlock_t *dst_ptl, *src_ptl; + unsigned long src_addr, dst_addr; + int thp_aligned = -1; + ssize_t moved = 0; + + /* + * Sanitize the command parameters: + */ + BUG_ON(src_start ~PAGE_MASK); + BUG_ON(dst_start ~PAGE_MASK); + BUG_ON(len ~PAGE_MASK); + + /* Does the address range wrap, or is the span zero-sized? */ + BUG_ON(src_start + len = src_start); + BUG_ON(dst_start + len = dst_start); + + /* + * Because these are read sempahores there's no risk of lock + * inversion. + */ + down_read(dst_mm-mmap_sem); + if (dst_mm != src_mm) + down_read(src_mm-mmap_sem); + + /* + * Make sure the vma is not shared, that the src and dst remap + * ranges are both valid and fully within a single existing + * vma. + */ + src_vma = find_vma(src_mm, src_start); + if (!src_vma || (src_vma-vm_flags VM_SHARED)) + goto out; + if (src_start src_vma-vm_start || + src_start + len src_vma-vm_end) + goto out; + + dst_vma = find_vma(dst_mm, dst_start); + if (!dst_vma || (dst_vma-vm_flags VM_SHARED)) + goto out; I again have a concern about the case when one task monitors the VM of the other one. If the target task (owning the mm) unmaps a VMA then the monitor task (holding and operating on the ufd) will get plain EINVAL on UFFDIO_REMAP request. This is not fatal, but still inconvenient as it will be hard to find out the reason for failure -- dst VMA is removed and the monitor should just drop the respective pages with data, or some other error has occurred and some other actions should be taken. Thanks, Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/