Re: [PATCHv5 25/37] x86/vdso: Switch image on setns()/clone()

2019-08-01 Thread Andy Lutomirski
On Wed, Jul 31, 2019 at 11:09 PM  wrote:
>
> On July 31, 2019 10:34:26 PM PDT, Andy Lutomirski  wrote:
> >On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov  wrote:
> >>
> >> As it has been discussed on timens RFC, adding a new conditional
> >branch
> >> `if (inside_time_ns)` on VDSO for all processes is undesirable.
> >> It will add a penalty for everybody as branch predictor may
> >mispredict
> >> the jump. Also there are instruction cache lines wasted on cmp/jmp.
> >
> >
> >>
> >> +#ifdef CONFIG_TIME_NS
> >> +int vdso_join_timens(struct task_struct *task)
> >> +{
> >> +   struct mm_struct *mm = task->mm;
> >> +   struct vm_area_struct *vma;
> >> +
> >> +   if (down_write_killable(>mmap_sem))
> >> +   return -EINTR;
> >> +
> >> +   for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >> +   unsigned long size = vma->vm_end - vma->vm_start;
> >> +
> >> +   if (vma_is_special_mapping(vma, _mapping) ||
> >> +   vma_is_special_mapping(vma, _mapping))
> >> +   zap_page_range(vma, vma->vm_start, size);
> >> +   }
> >
> >This is, unfortunately, fundamentally buggy.  If any thread is in the
> >vDSO or has the vDSO on the stack (due to a signal, for example), this
> >will crash it.  I can think of three solutions:
> >
> >1. Say that you can't setns() if you have other mms and ignore the
> >signal issue.  Anything with green threads will disapprove.  It's also
> >rather gross.
> >
> >2. Make it so that you can flip the static branch safely.  As in my
> >other email, you'll need to deal with CoW somehow,
> >
> >3. Make it so that you can't change timens, or at least that you can't
> >turn timens on or off, without execve() or fork().
> >
> >BTW, that static branch probably needs to be aligned to a cache line
> >or something similar to avoid all the nastiness with trying to poke
> >text that might be concurrently executing.  This will be a mess.
>
> Since we are talking about different physical addresses I believe we should 
> be okay as long as they don't cross page boundaries, and even if they do it 
> can be managed with proper page invalidation sequencing – it's not like the 
> problems of having to deal with XMC on live pages like in the kernel.
>
> Still, you really need each instruction sequence to be present, with the only 
> difference being specific patch sites.
>
> Any fundamental reason this can't be strictly data driven? Seems odd to me if 
> it couldn't, but I might be missing something obvious.

I think it can be.  There are at least two places where vDSO slow
paths could hook without affecting fast paths: vclock_mode and the low
bit of the sequence number.


Re: [PATCHv5 25/37] x86/vdso: Switch image on setns()/clone()

2019-08-01 Thread hpa
On July 31, 2019 10:34:26 PM PDT, Andy Lutomirski  wrote:
>On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov  wrote:
>>
>> As it has been discussed on timens RFC, adding a new conditional
>branch
>> `if (inside_time_ns)` on VDSO for all processes is undesirable.
>> It will add a penalty for everybody as branch predictor may
>mispredict
>> the jump. Also there are instruction cache lines wasted on cmp/jmp.
>
>
>>
>> +#ifdef CONFIG_TIME_NS
>> +int vdso_join_timens(struct task_struct *task)
>> +{
>> +   struct mm_struct *mm = task->mm;
>> +   struct vm_area_struct *vma;
>> +
>> +   if (down_write_killable(>mmap_sem))
>> +   return -EINTR;
>> +
>> +   for (vma = mm->mmap; vma; vma = vma->vm_next) {
>> +   unsigned long size = vma->vm_end - vma->vm_start;
>> +
>> +   if (vma_is_special_mapping(vma, _mapping) ||
>> +   vma_is_special_mapping(vma, _mapping))
>> +   zap_page_range(vma, vma->vm_start, size);
>> +   }
>
>This is, unfortunately, fundamentally buggy.  If any thread is in the
>vDSO or has the vDSO on the stack (due to a signal, for example), this
>will crash it.  I can think of three solutions:
>
>1. Say that you can't setns() if you have other mms and ignore the
>signal issue.  Anything with green threads will disapprove.  It's also
>rather gross.
>
>2. Make it so that you can flip the static branch safely.  As in my
>other email, you'll need to deal with CoW somehow,
>
>3. Make it so that you can't change timens, or at least that you can't
>turn timens on or off, without execve() or fork().
>
>BTW, that static branch probably needs to be aligned to a cache line
>or something similar to avoid all the nastiness with trying to poke
>text that might be concurrently executing.  This will be a mess.

Since we are talking about different physical addresses I believe we should be 
okay as long as they don't cross page boundaries, and even if they do it can be 
managed with proper page invalidation sequencing – it's not like the problems 
of having to deal with XMC on live pages like in the kernel.

Still, you really need each instruction sequence to be present, with the only 
difference being specific patch sites.

Any fundamental reason this can't be strictly data driven? Seems odd to me if 
it couldn't, but I might be missing something obvious.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCHv5 25/37] x86/vdso: Switch image on setns()/clone()

2019-07-31 Thread Andy Lutomirski
On Mon, Jul 29, 2019 at 2:58 PM Dmitry Safonov  wrote:
>
> As it has been discussed on timens RFC, adding a new conditional branch
> `if (inside_time_ns)` on VDSO for all processes is undesirable.
> It will add a penalty for everybody as branch predictor may mispredict
> the jump. Also there are instruction cache lines wasted on cmp/jmp.


>
> +#ifdef CONFIG_TIME_NS
> +int vdso_join_timens(struct task_struct *task)
> +{
> +   struct mm_struct *mm = task->mm;
> +   struct vm_area_struct *vma;
> +
> +   if (down_write_killable(>mmap_sem))
> +   return -EINTR;
> +
> +   for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +   unsigned long size = vma->vm_end - vma->vm_start;
> +
> +   if (vma_is_special_mapping(vma, _mapping) ||
> +   vma_is_special_mapping(vma, _mapping))
> +   zap_page_range(vma, vma->vm_start, size);
> +   }

This is, unfortunately, fundamentally buggy.  If any thread is in the
vDSO or has the vDSO on the stack (due to a signal, for example), this
will crash it.  I can think of three solutions:

1. Say that you can't setns() if you have other mms and ignore the
signal issue.  Anything with green threads will disapprove.  It's also
rather gross.

2. Make it so that you can flip the static branch safely.  As in my
other email, you'll need to deal with CoW somehow,

3. Make it so that you can't change timens, or at least that you can't
turn timens on or off, without execve() or fork().

BTW, that static branch probably needs to be aligned to a cache line
or something similar to avoid all the nastiness with trying to poke
text that might be concurrently executing.  This will be a mess.


[PATCHv5 25/37] x86/vdso: Switch image on setns()/clone()

2019-07-29 Thread Dmitry Safonov
As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

Whenever a user does setns() or unshare(CLONE_TIMENS) followed
by clone(), change VDSO image in mm and zap VVAR/VDSO page tables.
They will be re-faulted with corresponding image and VVAR offsets.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 arch/x86/entry/vdso/vma.c   | 23 +++
 arch/x86/include/asm/vdso.h |  1 +
 kernel/time_namespace.c | 11 +++
 3 files changed, 35 insertions(+)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 8a8211fd4cfc..91cf5a5c8c9e 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if defined(CONFIG_X86_64)
 unsigned int __read_mostly vdso64_enabled = 1;
@@ -266,6 +267,28 @@ static const struct vm_special_mapping vvar_mapping = {
.mremap = vvar_mremap,
 };
 
+#ifdef CONFIG_TIME_NS
+int vdso_join_timens(struct task_struct *task)
+{
+   struct mm_struct *mm = task->mm;
+   struct vm_area_struct *vma;
+
+   if (down_write_killable(>mmap_sem))
+   return -EINTR;
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   unsigned long size = vma->vm_end - vma->vm_start;
+
+   if (vma_is_special_mapping(vma, _mapping) ||
+   vma_is_special_mapping(vma, _mapping))
+   zap_page_range(vma, vma->vm_start, size);
+   }
+
+   up_write(>mmap_sem);
+   return 0;
+}
+#endif
+
 /*
  * Add vdso and vvar mappings to current process.
  * @image  - blob to map
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 03f468c63a24..ccf89dedd04f 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -45,6 +45,7 @@ extern struct vdso_image vdso_image_32;
 extern void __init init_vdso_image(struct vdso_image *image);
 
 extern int map_vdso_once(const struct vdso_image *image, unsigned long addr);
+extern int vdso_join_timens(struct task_struct *task);
 
 #endif /* __ASSEMBLER__ */
 
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 9807c5c90cb2..4b2eb92ad595 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim,
struct timens_offsets *ns_offsets)
@@ -199,6 +200,7 @@ static void timens_put(struct ns_common *ns)
 static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
 {
struct time_namespace *ns = to_time_ns(new);
+   int ret;
 
if (!thread_group_empty(current))
return -EINVAL;
@@ -207,6 +209,10 @@ static int timens_install(struct nsproxy *nsproxy, struct 
ns_common *new)
!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
return -EPERM;
 
+   ret = vdso_join_timens(current);
+   if (ret)
+   return ret;
+
get_time_ns(ns);
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
@@ -221,10 +227,15 @@ int timens_on_fork(struct nsproxy *nsproxy, struct 
task_struct *tsk)
 {
struct ns_common *nsc = >time_ns_for_children->ns;
struct time_namespace *ns = to_time_ns(nsc);
+   int ret;
 
if (nsproxy->time_ns == nsproxy->time_ns_for_children)
return 0;
 
+   ret = vdso_join_timens(tsk);
+   if (ret)
+   return ret;
+
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
nsproxy->time_ns = ns;
-- 
2.22.0