[tip: sched/core] rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

2021-03-17 Thread tip-bot2 for Piotr Figiel
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 90f093fa8ea48e5d991332cee160b761423d55c1
Gitweb:
https://git.kernel.org/tip/90f093fa8ea48e5d991332cee160b761423d55c1
Author:Piotr Figiel 
AuthorDate:Fri, 26 Feb 2021 14:51:56 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 17 Mar 2021 16:15:39 +01:00

rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

For userspace checkpoint and restore (C/R) a way of getting process state
containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is needed
to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is passed
to the migrated application code leading to RSEQ invariants not being
preserved. C/R code will use RSEQ ABI address to find the abort handler
to which the instruction pointer needs to be set.

To achieve above goals expose the RSEQ ABI address and the signature value
with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.

This new ptrace request can also be used by debuggers so they are aware
of stops within restartable sequences in progress.

Signed-off-by: Piotr Figiel 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Michal Miroslaw 
Reviewed-by: Mathieu Desnoyers 
Acked-by: Oleg Nesterov 
Link: https://lkml.kernel.org/r/20210226135156.1081606-1-fig...@google.com
---
 include/uapi/linux/ptrace.h | 10 ++
 kernel/ptrace.c | 25 +
 2 files changed, 35 insertions(+)

diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 83ee45f..3747bf8 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -102,6 +102,16 @@ struct ptrace_syscall_info {
};
 };
 
+#define PTRACE_GET_RSEQ_CONFIGURATION  0x420f
+
+struct ptrace_rseq_configuration {
+   __u64 rseq_abi_pointer;
+   __u32 rseq_abi_size;
+   __u32 signature;
+   __u32 flags;
+   __u32 pad;
+};
+
 /*
  * These values are stored in task->ptrace_message
  * by tracehook_report_syscall_* to describe the current syscall-stop.
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 821cf17..c71270a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include/* for syscall_get_* */
 
@@ -779,6 +780,24 @@ static int ptrace_peek_siginfo(struct task_struct *child,
return ret;
 }
 
+#ifdef CONFIG_RSEQ
+static long ptrace_get_rseq_configuration(struct task_struct *task,
+ unsigned long size, void __user *data)
+{
+   struct ptrace_rseq_configuration conf = {
+   .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
+   .rseq_abi_size = sizeof(*task->rseq),
+   .signature = task->rseq_sig,
+   .flags = 0,
+   };
+
+   size = min_t(unsigned long, size, sizeof(conf));
+   if (copy_to_user(data, , size))
+   return -EFAULT;
+   return sizeof(conf);
+}
+#endif
+
 #ifdef PTRACE_SINGLESTEP
 #define is_singlestep(request) ((request) == PTRACE_SINGLESTEP)
 #else
@@ -1222,6 +1241,12 @@ int ptrace_request(struct task_struct *child, long 
request,
ret = seccomp_get_metadata(child, addr, datavp);
break;
 
+#ifdef CONFIG_RSEQ
+   case PTRACE_GET_RSEQ_CONFIGURATION:
+   ret = ptrace_get_rseq_configuration(child, addr, datavp);
+   break;
+#endif
+
default:
break;
}


[tip: sched/core] rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

2021-03-17 Thread tip-bot2 for Piotr Figiel
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2c406d3f436db1deea55ec44cc4c3c0861c3c185
Gitweb:
https://git.kernel.org/tip/2c406d3f436db1deea55ec44cc4c3c0861c3c185
Author:Piotr Figiel 
AuthorDate:Fri, 26 Feb 2021 14:51:56 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 17 Mar 2021 14:05:40 +01:00

rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

For userspace checkpoint and restore (C/R) a way of getting process state
containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is needed
to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is passed
to the migrated application code leading to RSEQ invariants not being
preserved. C/R code will use RSEQ ABI address to find the abort handler
to which the instruction pointer needs to be set.

To achieve above goals expose the RSEQ ABI address and the signature value
with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.

This new ptrace request can also be used by debuggers so they are aware
of stops within restartable sequences in progress.

Signed-off-by: Piotr Figiel 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Miroslaw 
Reviewed-by: Mathieu Desnoyers 
Acked-by: Oleg Nesterov 
Link: https://lkml.kernel.org/r/20210226135156.1081606-1-fig...@google.com
---
 include/uapi/linux/ptrace.h | 10 ++
 kernel/ptrace.c | 25 +
 2 files changed, 35 insertions(+)

diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 83ee45f..3747bf8 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -102,6 +102,16 @@ struct ptrace_syscall_info {
};
 };
 
+#define PTRACE_GET_RSEQ_CONFIGURATION  0x420f
+
+struct ptrace_rseq_configuration {
+   __u64 rseq_abi_pointer;
+   __u32 rseq_abi_size;
+   __u32 signature;
+   __u32 flags;
+   __u32 pad;
+};
+
 /*
  * These values are stored in task->ptrace_message
  * by tracehook_report_syscall_* to describe the current syscall-stop.
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 821cf17..c71270a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include/* for syscall_get_* */
 
@@ -779,6 +780,24 @@ static int ptrace_peek_siginfo(struct task_struct *child,
return ret;
 }
 
+#ifdef CONFIG_RSEQ
+static long ptrace_get_rseq_configuration(struct task_struct *task,
+ unsigned long size, void __user *data)
+{
+   struct ptrace_rseq_configuration conf = {
+   .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
+   .rseq_abi_size = sizeof(*task->rseq),
+   .signature = task->rseq_sig,
+   .flags = 0,
+   };
+
+   size = min_t(unsigned long, size, sizeof(conf));
+   if (copy_to_user(data, , size))
+   return -EFAULT;
+   return sizeof(conf);
+}
+#endif
+
 #ifdef PTRACE_SINGLESTEP
 #define is_singlestep(request) ((request) == PTRACE_SINGLESTEP)
 #else
@@ -1222,6 +1241,12 @@ int ptrace_request(struct task_struct *child, long 
request,
ret = seccomp_get_metadata(child, addr, datavp);
break;
 
+#ifdef CONFIG_RSEQ
+   case PTRACE_GET_RSEQ_CONFIGURATION:
+   ret = ptrace_get_rseq_configuration(child, addr, datavp);
+   break;
+#endif
+
default:
break;
}


Re: [PATCH v4] fs/proc: Expose RSEQ configuration

2021-03-10 Thread Piotr Figiel
On Tue, Feb 02, 2021 at 06:37:09PM +0100, Piotr Figiel wrote:
> For userspace checkpoint and restore (C/R) some way of getting process
> state containing RSEQ configuration is needed.

[...]

> To achieve above goals expose the RSEQ ABI address and the signature
> value with the new procfs file "/proc//rseq".

For the record: this idea was dropped in favor of ptrace approach, as
discussed over separate mail thread with Mathieu Desnoyers and Peter
Zijlstra.  The equivalent ptrace patch in its current version is here:

https://lore.kernel.org/lkml/20210226135156.1081606-1-fig...@google.com/

Best regards,
Piotr.


Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-02-26 Thread Piotr Figiel
Hi,

On Fri, Feb 26, 2021 at 10:32:35AM -0500, Mathieu Desnoyers wrote:
> > +static long ptrace_get_rseq_configuration(struct task_struct *task,
> > + unsigned long size, void __user *data)
> > +{
> > +   struct ptrace_rseq_configuration conf = {
> > +   .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
> > +   .rseq_abi_size = sizeof(*task->rseq),
> > +   .signature = task->rseq_sig,
> > +   .flags = 0,
> > +   };
> > +
> > +   size = min_t(unsigned long, size, sizeof(conf));
> > +   if (copy_to_user(data, , size))
> > +   return -EFAULT;
> > +   return sizeof(conf);
> > +}
> 
> I think what Florian was after would be:
> 
> struct ptrace_rseq_configuration {
>   __u32 size;  /* size of struct ptrace_rseq_configuration */
>   __u32 flags;
>   __u64 rseq_abi_pointer;
>   __u32 signature;
>   __u32 pad;
> };
> 
> where:
> 
> .size = sizeof(struct ptrace_rseq_configuration),
> 
> This way, the configuration structure can be expanded in the future. The
> rseq ABI structure is by definition fixed-size, so there is no point in
> having its size here.

Still rseq syscall accepts the rseq ABI structure size as a paremeter.
I think this way the information returned from ptrace is consistent with
the userspace view of the rseq state and allows expansion in case the
ABI structure would have to be extended (in spite of it's current
definition).

The configuration structure still can be expanded as its size is
reported to userspace as return value from the request (in line with
Dmitry's comments).

Best regards, Piotr.


Re: [PATCH] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-02-26 Thread Piotr Figiel
Hi,

On Mon, Feb 22, 2021 at 09:53:17AM -0500, Mathieu Desnoyers wrote:

> I notice that other structures defined in this UAPI header are not
> packed as well.  Should we add an attribute packed on new structures ?
> It seems like it is generally a safer course of action, even though
> each field is naturally aligned here (there is no padding/hole in the
> structure).

I considered this for quite a while. There are some gains for this
approach, i.e. it's safer towards the ISO C, as theoretically compiler
can generate arbitrary offsets as long as struct elements have correct
order in memory.
Also with packed attribute it would be harder to make it incorrect in
future modifications.
User code also could theoretically put the structure on any misaligned
address.

But the drawback is that all accesses to the structure contents are
inefficient and some compilers may generate large chunks of code
whenever the structure elements are accessed (I recall at least one ARM
compiler which generates series of single-byte accesses for those). For
kernel it doesn't matter much because the structure type is used in one
place, but it may be different for the application code.

The change would be also inconsistent with the rest of the file and IMO
the gains are only theoretical.

If there are more opinions on this or you have some argument I'm missing
please let me know I can send v3 with packed and explicit padding
removed. I think this is rather borderline trade off.

Best regards and thanks for looking at this,
Piotr.


[PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-02-26 Thread Piotr Figiel
For userspace checkpoint and restore (C/R) a way of getting process state
containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is needed
to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is passed
to the migrated application code leading to RSEQ invariants not being
preserved. C/R code will use RSEQ ABI address to find the abort handler
to which the instruction pointer needs to be set.

To achieve above goals expose the RSEQ ABI address and the signature value
with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.

This new ptrace request can also be used by debuggers so they are aware
of stops within restartable sequences in progress.

Signed-off-by: Piotr Figiel 
Reviewed-by: Michal Miroslaw 

---
v2:
Applied review comments:
 - changed return value from the ptrace request to the size of the
   configuration structure
 - expanded configuration structure with the flags field and
   the rseq abi structure size

v1:
 https://lore.kernel.org/lkml/20210222100443.4155938-1-fig...@google.com/

---
 include/uapi/linux/ptrace.h | 10 ++
 kernel/ptrace.c | 25 +
 2 files changed, 35 insertions(+)

diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 83ee45fa634b..3747bf816f9a 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -102,6 +102,16 @@ struct ptrace_syscall_info {
};
 };
 
+#define PTRACE_GET_RSEQ_CONFIGURATION  0x420f
+
+struct ptrace_rseq_configuration {
+   __u64 rseq_abi_pointer;
+   __u32 rseq_abi_size;
+   __u32 signature;
+   __u32 flags;
+   __u32 pad;
+};
+
 /*
  * These values are stored in task->ptrace_message
  * by tracehook_report_syscall_* to describe the current syscall-stop.
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 61db50f7ca86..76f09456ec4b 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include/* for syscall_get_* */
 
@@ -779,6 +780,24 @@ static int ptrace_peek_siginfo(struct task_struct *child,
return ret;
 }
 
+#ifdef CONFIG_RSEQ
+static long ptrace_get_rseq_configuration(struct task_struct *task,
+ unsigned long size, void __user *data)
+{
+   struct ptrace_rseq_configuration conf = {
+   .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
+   .rseq_abi_size = sizeof(*task->rseq),
+   .signature = task->rseq_sig,
+   .flags = 0,
+   };
+
+   size = min_t(unsigned long, size, sizeof(conf));
+   if (copy_to_user(data, , size))
+   return -EFAULT;
+   return sizeof(conf);
+}
+#endif
+
 #ifdef PTRACE_SINGLESTEP
 #define is_singlestep(request) ((request) == PTRACE_SINGLESTEP)
 #else
@@ -1222,6 +1241,12 @@ int ptrace_request(struct task_struct *child, long 
request,
ret = seccomp_get_metadata(child, addr, datavp);
break;
 
+#ifdef CONFIG_RSEQ
+   case PTRACE_GET_RSEQ_CONFIGURATION:
+   ret = ptrace_get_rseq_configuration(child, addr, datavp);
+   break;
+#endif
+
default:
break;
}
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-02-22 Thread Piotr Figiel
For userspace checkpoint and restore (C/R) a way of getting process state
containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is needed
to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is passed
to the migrated application code leading to RSEQ invariants not being
preserved. C/R code will use RSEQ ABI address to find the abort handler
to which the instruction pointer needs to be set.

To achieve above goals expose the RSEQ ABI address and the signature value
with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.

This new ptrace request can also be used by debuggers so they are aware
of stops within restartable sequences in progress.

Signed-off-by: Piotr Figiel 
Reviewed-by: Michal Miroslaw 

---
 include/uapi/linux/ptrace.h |  8 
 kernel/ptrace.c | 23 +++
 2 files changed, 31 insertions(+)

diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 83ee45fa634b..d54cf6b6ce7c 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -102,6 +102,14 @@ struct ptrace_syscall_info {
};
 };
 
+#define PTRACE_GET_RSEQ_CONFIGURATION  0x420f
+
+struct ptrace_rseq_configuration {
+   __u64 rseq_abi_pointer;
+   __u32 signature;
+   __u32 pad;
+};
+
 /*
  * These values are stored in task->ptrace_message
  * by tracehook_report_syscall_* to describe the current syscall-stop.
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 61db50f7ca86..a936af66cf6f 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include/* for syscall_get_* */
 
@@ -779,6 +780,22 @@ static int ptrace_peek_siginfo(struct task_struct *child,
return ret;
 }
 
+#ifdef CONFIG_RSEQ
+static long ptrace_get_rseq_configuration(struct task_struct *task,
+ unsigned long size, void __user *data)
+{
+   struct ptrace_rseq_configuration conf = {
+   .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
+   .signature = task->rseq_sig,
+   };
+
+   size = min_t(unsigned long, size, sizeof(conf));
+   if (copy_to_user(data, , size))
+   return -EFAULT;
+   return size;
+}
+#endif
+
 #ifdef PTRACE_SINGLESTEP
 #define is_singlestep(request) ((request) == PTRACE_SINGLESTEP)
 #else
@@ -1222,6 +1239,12 @@ int ptrace_request(struct task_struct *child, long 
request,
ret = seccomp_get_metadata(child, addr, datavp);
break;
 
+#ifdef CONFIG_RSEQ
+   case PTRACE_GET_RSEQ_CONFIGURATION:
+   ret = ptrace_get_rseq_configuration(child, addr, datavp);
+   break;
+#endif
+
default:
break;
}
-- 
2.30.0.617.g56c4b15f3c-goog



[PATCH v4] fs/proc: Expose RSEQ configuration

2021-02-02 Thread Piotr Figiel
For userspace checkpoint and restore (C/R) some way of getting process
state containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is
needed to enforce behavior of RSEQ abort during C/R. Attaching with
ptrace() before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is
passed to the migrated application code leading to RSEQ invariants not
being preserved.

To achieve above goals expose the RSEQ ABI address and the signature
value with the new procfs file "/proc//rseq".

Signed-off-by: Piotr Figiel 

---

v4:
 - added documentation and extended comment before task_lock()
v3:
 - added locking so that the proc file always shows consistent pair of
   RSEQ ABI address and the signature
 - changed string formatting to use %px for the RSEQ ABI address
v2:
 - fixed string formatting for 32-bit architectures
v1:
 - https://lkml.kernel.org/r/20210113174127.2500051-1-fig...@google.com

---
 Documentation/filesystems/proc.rst | 16 
 fs/exec.c  |  2 ++
 fs/proc/base.c | 22 ++
 include/linux/sched/task.h |  3 ++-
 kernel/rseq.c  |  4 
 5 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 2fa69f710e2a..d887666dc849 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -47,6 +47,7 @@ fixes/update part 1.1  Stefani Seibold   
  June 9 2009
   3.10  /proc//timerslack_ns - Task timerslack value
   3.11 /proc//patch_state - Livepatch patch operation state
   3.12 /proc//arch_status - Task architecture specific information
+  3.13 /proc//rseq - RSEQ configuration state
 
   4Configuring procfs
   4.1  Mount options
@@ -2131,6 +2132,21 @@ AVX512_elapsed_ms
   the task is unlikely an AVX512 user, but depends on the workload and the
   scheduling scenario, it also could be a false negative mentioned above.
 
+3.13   /proc//rseq - RSEQ configuration state
+---
+This file provides RSEQ configuration of a thread. Available fields correspond
+to the rseq() syscall parameters and are:
+
+ - RSEQ ABI structure address shared between the kernel and user-space
+ - signature value expected before the abort handler code
+
+Both values are in hexadecimal format, for example::
+
+   # cat /proc/12345/rseq
+   abcdef12340 aabb0011
+
+This file is only present if CONFIG_RSEQ is enabled.
+
 Chapter 4: Configuring procfs
 =
 
diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..5d84f98847f1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1830,7 +1830,9 @@ static int bprm_execve(struct linux_binprm *bprm,
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
+   task_lock(current);
rseq_execve(current);
+   task_unlock(current);
acct_update_integrals(current);
task_numa_free(current, false);
return retval;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b3422cda2a91..89232329d966 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -662,6 +662,22 @@ static int proc_pid_syscall(struct seq_file *m, struct 
pid_namespace *ns,
 
return 0;
 }
+
+#ifdef CONFIG_RSEQ
+static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   int res = lock_trace(task);
+
+   if (res)
+   return res;
+   task_lock(task);
+   seq_printf(m, "%px %08x\n", task->rseq, task->rseq_sig);
+   task_unlock(task);
+   unlock_trace(task);
+   return 0;
+}
+#endif /* CONFIG_RSEQ */
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
 //
@@ -3182,6 +3198,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("comm",  S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall",S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+   ONE("rseq",   S_IRUSR, proc_pid_rseq),
+#endif
 #endif
REG("cmdline",S_IRUGO, proc_pid_cmdline_ops),
ONE("stat",   S_IRUGO, proc_tgid_stat),
@@ -3522,6 +3541,9 @@ static const struct pid_entry tid_base_stuff[] = {
 _pid_set_comm_operations, {}),
 #ifdef CONFIG_HAVE_AR

Re: [PATCH v3] fs/proc: Expose RSEQ configuration

2021-01-27 Thread Piotr Figiel
On Tue, Jan 26, 2021 at 03:58:46PM -0500, Mathieu Desnoyers wrote:
> - On Jan 26, 2021, at 1:54 PM, Piotr Figiel fig...@google.com wrote:
> [...]
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index a4f86a9d6937..6aea67878065 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -322,8 +322,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> > rseq_len,
> > ret = rseq_reset_rseq_cpu_id(current);
> > if (ret)
> > return ret;
> > +   task_lock(current);
> > current->rseq = NULL;
> > current->rseq_sig = 0;
> > +   task_unlock(current);
> > return 0;
> > }
> > 
> > @@ -353,8 +355,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> > rseq_len,
> > return -EINVAL;
> > if (!access_ok(rseq, rseq_len))
> > return -EFAULT;
> > +   task_lock(current);
> > current->rseq = rseq;
> > current->rseq_sig = sig;
> > +   task_unlock(current);
> 
> So AFAIU, the locks are there to make sure that whenever a user-space
> thread reads that state through that new /proc file ABI, it observes
> coherent "rseq" vs "rseq_sig" values.

Yes, that was the intention.

> However, I'm not convinced this is the right approach to consistency
> here.
> 
> Because if you add locking as done here, you ensure that the /proc
> file reader sees coherent values, but between the point where those
> values are read from kernel-space, copied to user-space, and then
> acted upon by user-space, those can very well have become outdated if
> the observed process runs concurrently.

You are right here, but I think this comment is valid for most of the
process information exported via procfs. The user can almost always make
a time of check/time of use race when interacting with procfs. I agree
that the locking added in v3 doesn't help much, but at least it does
provide a well defined answer: i.e. at least in some point of time the
effective configuration was as returned.
It makes it a bit easier to document and reason about the file contents,
compared to the inconsistent version.

> So my understanding here is that the only non-racy way to effectively
> use those values is to either read them from /proc/self/* (from the
> thread owning the task struct), or to ensure that the thread is
> stopped/frozen while the read is done.

Constraining this solely to the owning thread I think is a bit too
limiting. I think we could limit it to stopped threads but I don't think
it eliminates the potential of time of check/time of use races for the
user.  In this shape as in v3 - it's up to the user to decide if there
is a relevant risk of a race, if it's unwanted then the thread can be
stopped with e.g. ptrace, cgroup freeze or SIGSTOP.

Best regards,
Piotr.


Re: [PATCH v3] fs/proc: Expose RSEQ configuration

2021-01-27 Thread Piotr Figiel
On Tue, Jan 26, 2021 at 11:25:47AM -0800, Andrew Morton wrote:
> On Tue, 26 Jan 2021 19:54:12 +0100 Piotr Figiel  wrote:
> > To achieve above goals expose the RSEQ structure address and the
> > signature value with the new per-thread procfs file "rseq".
> Using "/proc//rseq" would be more informative.
> 
> >  fs/exec.c  |  2 ++
> >  fs/proc/base.c | 22 ++
> >  kernel/rseq.c  |  4 
> 
> A Documentation/ update would be appropriate.
> 
> > +   task_lock(current);
> > rseq_execve(current);
> > +   task_unlock(current);
> 
> There's a comment over the task_lock() implementation which explains
> what things it locks.  An update to that would be helpful.

Agreed I'll include fixes for above comments in v4.

> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -662,6 +662,22 @@ static int proc_pid_syscall(struct seq_file *m, struct 
> > pid_namespace *ns,
> >  
> > return 0;
> >  }
> > +
> > +#ifdef CONFIG_RSEQ
> > +static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
> > +   struct pid *pid, struct task_struct *task)
> > +{
> > +   int res = lock_trace(task);
> > +
> > +   if (res)
> > +   return res;
> > +   task_lock(task);
> > +   seq_printf(m, "%px %08x\n", task->rseq, task->rseq_sig);
> > +   task_unlock(task);
> > +   unlock_trace(task);
> > +   return 0;
> > +}
> 
> Do we actually need task_lock() for this purpose?  Would
> exec_update_lock() alone be adequate and appropriate?

Now rseq syscall which modifies those fields isn't synchronised with
exec_update_lock. So either a new lock or task_lock() could be used or
exec_update_lock could be reused in the syscall.  I decided against
exec_update_lock reuse in the syscall because it's normally used to
guard access checks against concurrent setuid exec. This could be
potentially confusing as it's not relevant for the the rseq syscall
code.
I think task_lock usage here is also consistent with how it's used
across the kernel.

Whether we need consistent rseq and rseq_sig pairs in the proc output, I
think there's some argument for it (discussed also in parallel thread
with Mathieu Desnoyers).

Best regards,
Piotr.


[PATCH v3] fs/proc: Expose RSEQ configuration

2021-01-26 Thread Piotr Figiel
For userspace checkpoint and restore (C/R) some way of getting process
state containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is
needed to enforce behavior of RSEQ abort during C/R. Attaching with
ptrace() before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is
passed to the migrated application code leading to RSEQ invariants not
being preserved.

To achieve above goals expose the RSEQ structure address and the
signature value with the new per-thread procfs file "rseq".

Signed-off-by: Piotr Figiel 

---

v3:
 - added locking so that the proc file always shows consistent pair of
   RSEQ ABI address and the signature
 - changed string formatting to use %px for the RSEQ ABI address
v2:
 - fixed string formatting for 32-bit architectures
v1:
 - https://lkml.kernel.org/r/20210113174127.2500051-1-fig...@google.com

---
 fs/exec.c  |  2 ++
 fs/proc/base.c | 22 ++
 kernel/rseq.c  |  4 
 3 files changed, 28 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 5d4d52039105..5d84f98847f1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1830,7 +1830,9 @@ static int bprm_execve(struct linux_binprm *bprm,
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
+   task_lock(current);
rseq_execve(current);
+   task_unlock(current);
acct_update_integrals(current);
task_numa_free(current, false);
return retval;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b3422cda2a91..89232329d966 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -662,6 +662,22 @@ static int proc_pid_syscall(struct seq_file *m, struct 
pid_namespace *ns,
 
return 0;
 }
+
+#ifdef CONFIG_RSEQ
+static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   int res = lock_trace(task);
+
+   if (res)
+   return res;
+   task_lock(task);
+   seq_printf(m, "%px %08x\n", task->rseq, task->rseq_sig);
+   task_unlock(task);
+   unlock_trace(task);
+   return 0;
+}
+#endif /* CONFIG_RSEQ */
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
 //
@@ -3182,6 +3198,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("comm",  S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall",S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+   ONE("rseq",   S_IRUSR, proc_pid_rseq),
+#endif
 #endif
REG("cmdline",S_IRUGO, proc_pid_cmdline_ops),
ONE("stat",   S_IRUGO, proc_tgid_stat),
@@ -3522,6 +3541,9 @@ static const struct pid_entry tid_base_stuff[] = {
 _pid_set_comm_operations, {}),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall",   S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+   ONE("rseq",  S_IRUSR, proc_pid_rseq),
+#endif
 #endif
REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
ONE("stat",  S_IRUGO, proc_tid_stat),
diff --git a/kernel/rseq.c b/kernel/rseq.c
index a4f86a9d6937..6aea67878065 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -322,8 +322,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, 
rseq_len,
ret = rseq_reset_rseq_cpu_id(current);
if (ret)
return ret;
+   task_lock(current);
current->rseq = NULL;
current->rseq_sig = 0;
+   task_unlock(current);
return 0;
}
 
@@ -353,8 +355,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, 
rseq_len,
return -EINVAL;
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+   task_lock(current);
current->rseq = rseq;
current->rseq_sig = sig;
+   task_unlock(current);
/*
 * If rseq was previously inactive, and has just been
 * registered, ensure the cpu_id_start and cpu_id fields
-- 
2.30.0.280.ga3ce27912f-goog



Re: [PATCH v2] fs/proc: Expose RSEQ configuration

2021-01-18 Thread Piotr Figiel
Hi, thanks for review.

On Fri, Jan 15, 2021 at 10:44:20AM -0500, Mathieu Desnoyers wrote:
> - On Jan 14, 2021, at 1:54 PM, Piotr Figiel fig...@google.com wrote:
> Added PeterZ, Paul and Boqun to CC. They are also listed as maintainers of 
> rseq.
> Please CC them in your next round of patches.

OK.

> > Since C/R preserves TLS memory and addresses RSEQ ABI will be
> > restored using the address registered before C/R.
> How do you plan to re-register the rseq TLS for each thread upon
> restore ?

In CRIU restorer there is a moment when the code runs on behalf of the
restored thread after the memory is already restored but before the
control is passed to the application code. I'm going to use rseq()
syscall there with the checkpointed values of ABI address and signatures
(obtained via the newly added procfs file).

> I suspect you move the return IP to the abort either at checkpoint or
> restore if you detect that the thread is running in a rseq critical
> section.

Actually in the prototype implementation I use PTRACE_SINGLESTEP during
checkpointing (with some safeguards) to force the kernel to jump out of
the critical section before registers values are fetched. This has the
drawback though that the first instruction of abort handler is executed
upon checkpointing.
I'll likely rework it to update instruction pointer by getting abort
address with PTRACE_PEEKTEXT (via RSEQ ABI).
I think an option is to have a kernel interface to trigger the abort on
userspace's request without need for some hacks. This could be a ptrace
extension.  Alternatively attach could trigger RSEQ logic, but this is
potentially a breaking change for debuggers.

> > Detection whether the thread is in a critical section during C/R is
> > needed to enforce behavior of RSEQ abort during C/R. Attaching with
> > ptrace() before registers are dumped itself doesn't cause RSEQ
> > abort.
> Right, because the RSEQ abort is only done when going back to
> user-space, and AFAIU the checkpointed process will cease to exist,
> and won't go back to user-space, therefore bypassing any RSEQ abort.

The checkpointed process doesn't have to cease to exist, actually it can
continue, and when it's unpaused kernel will schedule the process and
should call the abort handler for RSEQ CS. But this will happen on the
checkpointing side after process state was already checkpointed.
For C/R is important that the checkpoint (serialized process state) is
safe wrt RSEQ.

> > Restoring the instruction pointer within the critical section is
> > problematic because rseq_cs may get cleared before the control is
> > passed to the migrated application code leading to RSEQ invariants
> > not being preserved.
> The commit message should state that both the per-thread rseq TLS area
> address and the signature are dumped within this new proc file.

I'll include this in v3, thanks.

> AFAIU lock_trace prevents concurrent exec() from modifying the task's
> content.  What prevents a concurrent rseq register/unregister to be
> executed concurrently with proc_pid_rseq ?

Yes, in this shape only ptrace prevents, as it was the intended use
case. Do you think it would make sense to add a mutex on task_struct for
the purpose of casual reader (sys admin?) consistency? This would be
locked only here and in the syscall during setting.

(Alternatively SMP barrier could be used to enforce the order so that
the ABI address is always written first, and the signature wouldn't make
sense on ABI address = 0, but probably being simply consistent is
better).

> I wonder if all those parentheses are needed. Wouldn't it be enough to have:

Will remove thanks.

Best regards, Piotr.


Re: [PATCH] fs/proc: Expose RSEQ configuration

2021-01-14 Thread Piotr Figiel
On Thu, Jan 14, 2021 at 12:32:30AM +0300, Alexey Dobriyan wrote:
> On Wed, Jan 13, 2021 at 06:41:27PM +0100, Piotr Figiel wrote:
> > For userspace checkpoint and restore (C/R) some way of getting process
> > state containing RSEQ configuration is needed.
> > +   seq_printf(m, "0x%llx 0x%x\n", (uint64_t)task->rseq, task->rseq_sig);
> %llx is too much on 32-bit. "%tx %x" is better (or even %08x)

Hi, many thanks for the suggestion. I applied this on v2,
https://lore.kernel.org/linux-fsdevel/20210114185445.996-1-fig...@google.com
I had to cast it via uintptr_t to cast-away the user address space
without warnings. Could you please take a look?

Best regards, Piotr.


[PATCH v2] fs/proc: Expose RSEQ configuration

2021-01-14 Thread Piotr Figiel
For userspace checkpoint and restore (C/R) some way of getting process
state containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is
needed to enforce behavior of RSEQ abort during C/R. Attaching with
ptrace() before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is
passed to the migrated application code leading to RSEQ invariants not
being preserved.

Signed-off-by: Piotr Figiel 

---

v2:
 - fixed string formatting for 32-bit architectures

v1:
 - https://lkml.kernel.org/r/20210113174127.2500051-1-fig...@google.com

---
 fs/proc/base.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b3422cda2a91..7cc36a224b8b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -662,6 +662,21 @@ static int proc_pid_syscall(struct seq_file *m, struct 
pid_namespace *ns,
 
return 0;
 }
+
+#ifdef CONFIG_RSEQ
+static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   int res = lock_trace(task);
+
+   if (res)
+   return res;
+   seq_printf(m, "%tx %08x\n", (ptrdiff_t)((uintptr_t)task->rseq),
+  task->rseq_sig);
+   unlock_trace(task);
+   return 0;
+}
+#endif /* CONFIG_RSEQ */
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
 //
@@ -3182,6 +3197,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("comm",  S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall",S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+   ONE("rseq",   S_IRUSR, proc_pid_rseq),
+#endif
 #endif
REG("cmdline",S_IRUGO, proc_pid_cmdline_ops),
ONE("stat",   S_IRUGO, proc_tgid_stat),
@@ -3522,6 +3540,9 @@ static const struct pid_entry tid_base_stuff[] = {
 _pid_set_comm_operations, {}),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall",   S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+   ONE("rseq",  S_IRUSR, proc_pid_rseq),
+#endif
 #endif
REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
ONE("stat",  S_IRUGO, proc_tid_stat),
-- 
2.30.0.284.gd98b1dd5eaa7-goog



[PATCH] fs/proc: Expose RSEQ configuration

2021-01-13 Thread Piotr Figiel
For userspace checkpoint and restore (C/R) some way of getting process
state containing RSEQ configuration is needed.

There are two ways this information is going to be used:
 - to re-enable RSEQ for threads which had it enabled before C/R
 - to detect if a thread was in a critical section during C/R

Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.

Detection whether the thread is in a critical section during C/R is
needed to enforce behavior of RSEQ abort during C/R. Attaching with
ptrace() before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is
passed to the migrated application code leading to RSEQ invariants not
being preserved.

Signed-off-by: Piotr Figiel 
---
 fs/proc/base.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b3422cda2a91..3d4712ac4370 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -662,6 +662,20 @@ static int proc_pid_syscall(struct seq_file *m, struct 
pid_namespace *ns,
 
return 0;
 }
+
+#ifdef CONFIG_RSEQ
+static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
+   struct pid *pid, struct task_struct *task)
+{
+   int res = lock_trace(task);
+
+   if (res)
+   return res;
+   seq_printf(m, "0x%llx 0x%x\n", (uint64_t)task->rseq, task->rseq_sig);
+   unlock_trace(task);
+   return 0;
+}
+#endif /* CONFIG_RSEQ */
 #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
 
 //
@@ -3182,6 +3196,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("comm",  S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall",S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+   ONE("rseq",   S_IRUSR, proc_pid_rseq),
+#endif
 #endif
REG("cmdline",S_IRUGO, proc_pid_cmdline_ops),
ONE("stat",   S_IRUGO, proc_tgid_stat),
@@ -3522,6 +3539,9 @@ static const struct pid_entry tid_base_stuff[] = {
 _pid_set_comm_operations, {}),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
ONE("syscall",   S_IRUSR, proc_pid_syscall),
+#ifdef CONFIG_RSEQ
+   ONE("rseq",  S_IRUSR, proc_pid_rseq),
+#endif
 #endif
REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
ONE("stat",  S_IRUGO, proc_tid_stat),
-- 
2.30.0.284.gd98b1dd5eaa7-goog