Re: [PATCH 2/8] pstore: Do not use crash buffer for decompression

2018-11-29 Thread Joel Fernandes
On Thu, Nov 29, 2018 at 02:06:39PM -0800, Kees Cook wrote:
> On Tue, Nov 13, 2018 at 11:56 PM Kees Cook  wrote:
> > On Fri, Nov 2, 2018 at 1:24 PM, Joel Fernandes  
> > wrote:
> > > On Thu, Nov 01, 2018 at 04:51:54PM -0700, Kees Cook wrote:
> > >> + workspace = kmalloc(unzipped_len + record->ecc_notice_size,
> > >
> > > Should tihs be unzipped_len + record->ecc_notice_size + 1. The extra byte
> > > being for the NULL character of the ecc notice?
> > >
> > > This occurred to me when I saw the + 1 in ram.c. It could be better to 
> > > just
> > > abstract the size as a macro.
> >
> > Ooh, yes, good catch. I'll get this fixed.
> 
> I spent more time looking at this, and it seems that only the initial
> creation of this string needs the +1, since all other operations are
> byte-based not NUL-terminated string based. It's a big odd, and I
> might try to clean it up differently, but as it stands, this is okay.
> (See inode.c which doesn't include the trailing NUL byte.)

Ok. Yes it does seem a bit inconsistent but I agree its not an issue for this
particular patch. Sorry to waste your time, thanks!

 - Joel



Re: [RFC][PATCH 11/14] function_graph: Convert ret_stack to a series of longs

2018-11-27 Thread Joel Fernandes
On Mon, Nov 26, 2018 at 11:26:03AM -0500, Steven Rostedt wrote:
> On Tue, 27 Nov 2018 01:07:55 +0900
> Masami Hiramatsu  wrote:
> 
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -1119,7 +1119,7 @@ struct task_struct {
> > > > int curr_ret_depth;
> > > >  
> > > > /* Stack of return addresses for return function tracing: */
> > > > -   struct ftrace_ret_stack *ret_stack;
> > > > +   unsigned long   *ret_stack;
> > > >  
> > > > /* Timestamp for last schedule: */
> > > > unsigned long long  ftrace_timestamp;
> > > > diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> > > > index 9b85638ecded..1389fe39f64c 100644
> > > > --- a/kernel/trace/fgraph.c
> > > > +++ b/kernel/trace/fgraph.c
> > > > @@ -23,6 +23,17 @@
> > > >  #define ASSIGN_OPS_HASH(opsname, val)
> > > >  #endif
> > > >  
> > > > +#define FGRAPH_RET_SIZE (sizeof(struct ftrace_ret_stack))
> > > > +#define FGRAPH_RET_INDEX (ALIGN(FGRAPH_RET_SIZE, sizeof(long)) / 
> > > > sizeof(long))
> > > > +#define SHADOW_STACK_SIZE (FTRACE_RETFUNC_DEPTH * FGRAPH_RET_SIZE)
> > > > +#define SHADOW_STACK_INDEX \
> > > > +   (ALIGN(SHADOW_STACK_SIZE, sizeof(long)) / sizeof(long))
> > > > +#define SHADOW_STACK_MAX_INDEX (SHADOW_STACK_INDEX - FGRAPH_RET_INDEX)
> > > > +
> > > > +#define RET_STACK(t, index) ((struct ftrace_ret_stack 
> > > > *)(&(t)->ret_stack[index]))
> > > > +#define RET_STACK_INC(c) ({ c += FGRAPH_RET_INDEX; })
> > > > +#define RET_STACK_DEC(c) ({ c -= FGRAPH_RET_INDEX; })
> > > > +  
> > > [...]  
> > > > @@ -514,7 +531,7 @@ void ftrace_graph_init_task(struct task_struct *t)
> > > >  
> > > >  void ftrace_graph_exit_task(struct task_struct *t)
> > > >  {
> > > > -   struct ftrace_ret_stack *ret_stack = t->ret_stack;
> > > > +   unsigned long *ret_stack = t->ret_stack;
> > > >  
> > > > t->ret_stack = NULL;
> > > > /* NULL must become visible to IRQs before we free it: */
> > > > @@ -526,12 +543,10 @@ void ftrace_graph_exit_task(struct task_struct *t)
> > > >  /* Allocate a return stack for each task */
> > > >  static int start_graph_tracing(void)
> > > >  {
> > > > -   struct ftrace_ret_stack **ret_stack_list;
> > > > +   unsigned long **ret_stack_list;
> > > > int ret, cpu;
> > > >  
> > > > -   ret_stack_list = kmalloc_array(FTRACE_RETSTACK_ALLOC_SIZE,
> > > > -  sizeof(struct ftrace_ret_stack 
> > > > *),
> > > > -  GFP_KERNEL);
> > > > +   ret_stack_list = kmalloc(SHADOW_STACK_SIZE, GFP_KERNEL);
> > > >
> > > 
> > > I had dumped the fgraph size related macros to understand the patch 
> > > better, I
> > > got:
> > > [0.909528] val of FGRAPH_RET_SIZE is 40
> > > [0.910250] val of FGRAPH_RET_INDEX is 5
> > > [0.910866] val of FGRAPH_ARRAY_SIZE is 16
> > > [0.911488] val of FGRAPH_ARRAY_MASK is 255
> > > [0.912134] val of FGRAPH_MAX_INDEX is 16
> > > [0.912751] val of FGRAPH_INDEX_SHIFT is 8
> > > [0.913382] val of FGRAPH_FRAME_SIZE is 168
> > > [0.914033] val of FGRAPH_FRAME_INDEX is 21
> > >   FTRACE_RETFUNC_DEPTH is 50
> > > [0.914686] val of SHADOW_STACK_SIZE is 8400
> > > 
> > > I had a concern about memory overhead per-task. It seems the total memory
> > > needed per task for the stack is 8400 bytes (with my configuration with
> > > FUNCTION_PROFILE
> > > turned off).
> > > 
> > > Where as before it would be 32 * 40 = 1280 bytes. That looks like ~7 times
> > > more than before.  
> > 
> > Hmm, this seems too big... I thought the shadow-stack size should be
> > smaller than 1 page (4kB). Steve, can we give a 4k page for shadow stack
> > and define FTRACE_RETFUNC_DEPTH = 4096 / FGRAPH_RET_SIZE ?
> 
> For the first pass, I decided not to worry about the size. It made the
> code less complex :-)
> 
> Yes, I plan on working on making the size of the stack smaller, but
> that will probably be added on patches to do so.

Cool, sounds good.

> > > On my system with ~4000 threads, that becomes ~32MB which seems a bit
> > > wasteful especially if there was only one or 2 function graph callbacks
> > > registered and most of the callback array in the stack isn't used.
> 
> Note, all 4000 threads could be doing those trace backs, and if you are
> doing full function graph tracing, it will use a lot.

But I think each of the threads will only use N entries in the callback array
where N is the number of function graph callback users who registered, right?
So the remaining total-N allocated callback array entries per thread will not
be used.

> > > Could we make the array size configurable at compile time and start it 
> > > with a
> > > small number like 4 or 6?  
> > 
> > Or, we can introduce online setting :)
> 
> Yes, that can easily be added. I didn't try to make this into the
> perfect solution, I wanted 

Re: [PATCH -next 1/2] mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust

2018-11-26 Thread Joel Fernandes
On Sat, Nov 24, 2018 at 04:47:36PM -0800, Matthew Wilcox wrote:
> On Sat, Nov 24, 2018 at 04:42:29PM -0800, Andrew Morton wrote:
> > This changelog doesn't have the nifty test case code which was in
> > earlier versions?
> 
> Why do we put regression tests in the changelogs anyway?  We have
> tools/testing/selftests/vm/ already, perhaps they should go there?

The reason is I didn't add it was that test case went out of date and the
updated version of the test case went into the selftests in patch 2/2. I
thought that would suffice which covers all the cases. That's why I dropped
it.  Would that be Ok?

The changelog of the previous series had it because the selftest was added
only later.

Let me know, thanks,

 - Joel



Re: [RFC][PATCH 11/14] function_graph: Convert ret_stack to a series of longs

2018-11-23 Thread Joel Fernandes
On Wed, Nov 21, 2018 at 08:27:19PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (VMware)" 
> 
> In order to make it possible to have multiple callbacks registered with the
> function_graph tracer, the retstack needs to be converted from an array of
> ftrace_ret_stack structures to an array of longs. This will allow to store
> the list of callbacks on the stack for the return side of the functions.
> 
> [ Note, this currently breaks architectures that access the ret_stack of a
>   task to handle unwinding when 'return_to_handler' is on the stack ]
> 
> Signed-off-by: Steven Rostedt (VMware) 
> ---
>  include/linux/sched.h |   2 +-
>  kernel/trace/fgraph.c | 123 +++---
>  2 files changed, 70 insertions(+), 55 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d6183a55e8eb..71a084a300da 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1119,7 +1119,7 @@ struct task_struct {
>   int curr_ret_depth;
>  
>   /* Stack of return addresses for return function tracing: */
> - struct ftrace_ret_stack *ret_stack;
> + unsigned long   *ret_stack;
>  
>   /* Timestamp for last schedule: */
>   unsigned long long  ftrace_timestamp;
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 9b85638ecded..1389fe39f64c 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -23,6 +23,17 @@
>  #define ASSIGN_OPS_HASH(opsname, val)
>  #endif
>  
> +#define FGRAPH_RET_SIZE (sizeof(struct ftrace_ret_stack))
> +#define FGRAPH_RET_INDEX (ALIGN(FGRAPH_RET_SIZE, sizeof(long)) / 
> sizeof(long))
> +#define SHADOW_STACK_SIZE (FTRACE_RETFUNC_DEPTH * FGRAPH_RET_SIZE)
> +#define SHADOW_STACK_INDEX   \
> + (ALIGN(SHADOW_STACK_SIZE, sizeof(long)) / sizeof(long))
> +#define SHADOW_STACK_MAX_INDEX (SHADOW_STACK_INDEX - FGRAPH_RET_INDEX)
> +
> +#define RET_STACK(t, index) ((struct ftrace_ret_stack 
> *)(&(t)->ret_stack[index]))
> +#define RET_STACK_INC(c) ({ c += FGRAPH_RET_INDEX; })
> +#define RET_STACK_DEC(c) ({ c -= FGRAPH_RET_INDEX; })
> +
[...]
> @@ -514,7 +531,7 @@ void ftrace_graph_init_task(struct task_struct *t)
>  
>  void ftrace_graph_exit_task(struct task_struct *t)
>  {
> - struct ftrace_ret_stack *ret_stack = t->ret_stack;
> + unsigned long *ret_stack = t->ret_stack;
>  
>   t->ret_stack = NULL;
>   /* NULL must become visible to IRQs before we free it: */
> @@ -526,12 +543,10 @@ void ftrace_graph_exit_task(struct task_struct *t)
>  /* Allocate a return stack for each task */
>  static int start_graph_tracing(void)
>  {
> - struct ftrace_ret_stack **ret_stack_list;
> + unsigned long **ret_stack_list;
>   int ret, cpu;
>  
> - ret_stack_list = kmalloc_array(FTRACE_RETSTACK_ALLOC_SIZE,
> -sizeof(struct ftrace_ret_stack *),
> -GFP_KERNEL);
> + ret_stack_list = kmalloc(SHADOW_STACK_SIZE, GFP_KERNEL);
>  

I had dumped the fgraph size related macros to understand the patch better, I
got:
[0.909528] val of FGRAPH_RET_SIZE is 40
[0.910250] val of FGRAPH_RET_INDEX is 5
[0.910866] val of FGRAPH_ARRAY_SIZE is 16
[0.911488] val of FGRAPH_ARRAY_MASK is 255
[0.912134] val of FGRAPH_MAX_INDEX is 16
[0.912751] val of FGRAPH_INDEX_SHIFT is 8
[0.913382] val of FGRAPH_FRAME_SIZE is 168
[0.914033] val of FGRAPH_FRAME_INDEX is 21
  FTRACE_RETFUNC_DEPTH is 50
[0.914686] val of SHADOW_STACK_SIZE is 8400

I had a concern about memory overhead per-task. It seems the total memory
needed per task for the stack is 8400 bytes (with my configuration with
FUNCTION_PROFILE
turned off).

Where as before it would be 32 * 40 = 1280 bytes. That looks like ~7 times
more than before.

On my system with ~4000 threads, that becomes ~32MB which seems a bit
wasteful especially if there was only one or 2 function graph callbacks
registered and most of the callback array in the stack isn't used.

Could we make the array size configurable at compile time and start it with a
small number like 4 or 6?

Also for patches 1 through 10:
Reviewed-by: Joel Fernandes (Google) 

thanks,

 - Joel



Re: [RFC][PATCH 06/14] fgraph: Move function graph specific code into fgraph.c

2018-11-23 Thread Joel Fernandes
On Fri, Nov 23, 2018 at 01:11:38PM -0500, Steven Rostedt wrote:
> On Fri, 23 Nov 2018 12:58:34 -0500
> Steven Rostedt  wrote:
> 
> > I think the better answer is to move it into trace_functions_graph.c.
> 
> I take that back. I think the better answer is to not call that
> function if the profiler is not set, nor have that option even
> available. Because it has no meaning without the profiler.

Agreed, that's better. Thanks,

 - Joel


Re: [RFC][PATCH 06/14] fgraph: Move function graph specific code into fgraph.c

2018-11-22 Thread Joel Fernandes
On Wed, Nov 21, 2018 at 08:27:14PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (VMware)" 
> 
> To make the function graph infrastructure more managable, the code needs to
> be in its own file (fgraph.c). Move the code that is specific for managing
> the function graph infrastructure out of ftrace.c and into fgraph.c
> 
> Signed-off-by: Steven Rostedt (VMware) 

I think this patch causes a build error if CONFIG_FUNCTION_PROFILER is
disabled but function graph is enabled. The following diff fixes it for me.

thanks,

 - Joel

 8<--
 
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 3b8c307c7ff0..ce38bb962f91 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -382,6 +382,15 @@ static void ftrace_update_pid_func(void)
update_ftrace_function();
 }
 
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+static bool fgraph_graph_time = true;
+
+void ftrace_graph_graph_time_control(bool enable)
+{
+   fgraph_graph_time = enable;
+}
+#endif
+
 #ifdef CONFIG_FUNCTION_PROFILER
 struct ftrace_profile {
struct hlist_node   node;
@@ -783,12 +792,6 @@ function_profile_call(unsigned long ip, unsigned long 
parent_ip,
 }
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-static bool fgraph_graph_time = true;
-
-void ftrace_graph_graph_time_control(bool enable)
-{
-   fgraph_graph_time = enable;
-}
 
 static int profile_graph_entry(struct ftrace_graph_ent *trace)
 {


Re: [RFC][PATCH 09/14] function_graph: Move ftrace_graph_get_addr() to fgraph.c

2018-11-22 Thread Joel Fernandes
On Wed, Nov 21, 2018 at 08:27:17PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (VMware)" 
> 
> Move the function function_graph_get_addr() to fgraph.c, as the management
> of the curr_ret_stack is going to change, and all the accesses to ret_stack
> needs to be done in fgraph.c.

s/ftrace_graph_get_addr/ftrace_graph_ret_addr/

thanks,

 - Joel

> 
> Signed-off-by: Steven Rostedt (VMware) 
> ---
>  kernel/trace/fgraph.c| 55 
>  kernel/trace/trace_functions_graph.c | 55 
>  2 files changed, 55 insertions(+), 55 deletions(-)
> 
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index f3a89ecac671..c7d612897e33 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -233,6 +233,61 @@ unsigned long ftrace_return_to_handler(unsigned long 
> frame_pointer)
>   return ret;
>  }
>  
> +/**
> + * ftrace_graph_ret_addr - convert a potentially modified stack return 
> address
> + *  to its original value
> + *
> + * This function can be called by stack unwinding code to convert a found 
> stack
> + * return address ('ret') to its original value, in case the function graph
> + * tracer has modified it to be 'return_to_handler'.  If the address hasn't
> + * been modified, the unchanged value of 'ret' is returned.
> + *
> + * 'idx' is a state variable which should be initialized by the caller to 
> zero
> + * before the first call.
> + *
> + * 'retp' is a pointer to the return address on the stack.  It's ignored if
> + * the arch doesn't have HAVE_FUNCTION_GRAPH_RET_ADDR_PTR defined.
> + */
> +#ifdef HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
> +unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx,
> + unsigned long ret, unsigned long *retp)
> +{
> + int index = task->curr_ret_stack;
> + int i;
> +
> + if (ret != (unsigned long)return_to_handler)
> + return ret;
> +
> + if (index < 0)
> + return ret;
> +
> + for (i = 0; i <= index; i++)
> + if (task->ret_stack[i].retp == retp)
> + return task->ret_stack[i].ret;
> +
> + return ret;
> +}
> +#else /* !HAVE_FUNCTION_GRAPH_RET_ADDR_PTR */
> +unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx,
> + unsigned long ret, unsigned long *retp)
> +{
> + int task_idx;
> +
> + if (ret != (unsigned long)return_to_handler)
> + return ret;
> +
> + task_idx = task->curr_ret_stack;
> +
> + if (!task->ret_stack || task_idx < *idx)
> + return ret;
> +
> + task_idx -= *idx;
> + (*idx)++;
> +
> + return task->ret_stack[task_idx].ret;
> +}
> +#endif /* HAVE_FUNCTION_GRAPH_RET_ADDR_PTR */
> +
>  static struct ftrace_ops graph_ops = {
>   .func   = ftrace_stub,
>   .flags  = FTRACE_OPS_FL_RECURSION_SAFE |
> diff --git a/kernel/trace/trace_functions_graph.c 
> b/kernel/trace/trace_functions_graph.c
> index 7c7fd13d2373..0f9cbc30645d 100644
> --- a/kernel/trace/trace_functions_graph.c
> +++ b/kernel/trace/trace_functions_graph.c
> @@ -90,61 +90,6 @@ static void
>  print_graph_duration(struct trace_array *tr, unsigned long long duration,
>struct trace_seq *s, u32 flags);
>  
> -/**
> - * ftrace_graph_ret_addr - convert a potentially modified stack return 
> address
> - *  to its original value
> - *
> - * This function can be called by stack unwinding code to convert a found 
> stack
> - * return address ('ret') to its original value, in case the function graph
> - * tracer has modified it to be 'return_to_handler'.  If the address hasn't
> - * been modified, the unchanged value of 'ret' is returned.
> - *
> - * 'idx' is a state variable which should be initialized by the caller to 
> zero
> - * before the first call.
> - *
> - * 'retp' is a pointer to the return address on the stack.  It's ignored if
> - * the arch doesn't have HAVE_FUNCTION_GRAPH_RET_ADDR_PTR defined.
> - */
> -#ifdef HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
> -unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx,
> - unsigned long ret, unsigned long *retp)
> -{
> - int index = task->curr_ret_stack;
> - int i;
> -
> - if (ret != (unsigned long)return_to_handler)
> - return ret;
> -
> - if (index < 0)
> - return ret;
> -
> - for (i = 0; i <= index; i++)
> - if (task->ret_stack[i].retp == retp)
> - return task->ret_stack[i].ret;
> -
> - return ret;
> -}
> -#else /* !HAVE_FUNCTION_GRAPH_RET_ADDR_PTR */
> -unsigned long ftrace_graph_ret_addr(struct task_struct *task, int *idx,
> - unsigned long ret, unsigned long *retp)
> -{
> - int task_idx;
> -
> - if (ret != (unsigned long)return_to_handler)
> - return ret;
> -
> - task_idx = 

Re: [RFC][PATCH 07/14] fgraph: Add new fgraph_ops structure to enable function graph hooks

2018-11-22 Thread Joel Fernandes
On Wed, Nov 21, 2018 at 08:27:15PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (VMware)" 
> 
> Currently the registering of function graph is to pass in a entry and return
> function. We need to have a way to associate those functions together where
> the entry can determine to run the return hook. Having a structure that
> contains both functions will facilitate the process of converting the code
> to be able to do such.
> 
> This is similar to the way function hooks are enabled (it passes in
> ftrace_ops). Instead of passing in the functions to use, a single structure
> is passed in to the registering function.
> 
> The unregister function is now passed in the fgraph_ops handle. When we
> allow more than one callback to the function graph hooks, this will let the
> system know which one to remove.
> 
> Signed-off-by: Steven Rostedt (VMware) 
> ---
>  include/linux/ftrace.h   | 24 +---
>  kernel/trace/fgraph.c|  9 -
>  kernel/trace/ftrace.c| 10 +++---
>  kernel/trace/trace_functions_graph.c | 21 -
>  kernel/trace/trace_irqsoff.c | 10 +++---
>  kernel/trace/trace_sched_wakeup.c| 10 +++---
>  kernel/trace/trace_selftest.c|  8 ++--
>  7 files changed, 64 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index f98063e273e5..477ff9412d26 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -749,6 +749,18 @@ typedef int (*trace_func_graph_ent_t)(struct 
> ftrace_graph_ent *); /* entry */
>  
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>  
> +struct fgraph_ops {
> + trace_func_graph_ent_t  entryfunc;
> + trace_func_graph_ret_t  retfunc;
> + struct fgraph_ops __rcu *next;
> + unsigned long   flags;
> + void*private;
> +#ifdef CONFIG_DYNAMIC_FTRACE
> + struct ftrace_ops_hash  local_hash;
> + struct ftrace_ops_hash  *func_hash;
> +#endif
> +};
> +
>  /*
>   * Stack of return addresses for functions
>   * of a thread.
> @@ -792,8 +804,9 @@ unsigned long ftrace_graph_ret_addr(struct task_struct 
> *task, int *idx,
>  
>  #define FTRACE_RETFUNC_DEPTH 50
>  #define FTRACE_RETSTACK_ALLOC_SIZE 32
> -extern int register_ftrace_graph(trace_func_graph_ret_t retfunc,
> - trace_func_graph_ent_t entryfunc);
> +
> +extern int register_ftrace_graph(struct fgraph_ops *ops);
> +extern void unregister_ftrace_graph(struct fgraph_ops *ops);
>  
>  extern bool ftrace_graph_is_dead(void);
>  extern void ftrace_graph_stop(void);
> @@ -802,8 +815,6 @@ extern void ftrace_graph_stop(void);
>  extern trace_func_graph_ret_t ftrace_graph_return;
>  extern trace_func_graph_ent_t ftrace_graph_entry;
>  
> -extern void unregister_ftrace_graph(void);
> -
>  extern void ftrace_graph_init_task(struct task_struct *t);
>  extern void ftrace_graph_exit_task(struct task_struct *t);
>  extern void ftrace_graph_init_idle_task(struct task_struct *t, int cpu);
> @@ -830,12 +841,11 @@ static inline void ftrace_graph_init_task(struct 
> task_struct *t) { }
>  static inline void ftrace_graph_exit_task(struct task_struct *t) { }
>  static inline void ftrace_graph_init_idle_task(struct task_struct *t, int 
> cpu) { }
>  
> -static inline int register_ftrace_graph(trace_func_graph_ret_t retfunc,
> -   trace_func_graph_ent_t entryfunc)
> +static inline int register_ftrace_graph(struct fgraph_ops *ops);
>  {
>   return -1;
>  }
> -static inline void unregister_ftrace_graph(void) { }
> +static inline void unregister_ftrace_graph(struct fgraph_ops *ops) { }
>  
>  static inline int task_curr_ret_stack(struct task_struct *tsk)
>  {
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index b9c7dbbbdd96..f3a89ecac671 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -491,8 +491,7 @@ static int start_graph_tracing(void)
>   return ret;
>  }
>  
> -int register_ftrace_graph(trace_func_graph_ret_t retfunc,
> - trace_func_graph_ent_t entryfunc)
> +int register_ftrace_graph(struct fgraph_ops *gops)
>  {
>   int ret = 0;
>  
> @@ -513,7 +512,7 @@ int register_ftrace_graph(trace_func_graph_ret_t retfunc,
>   goto out;
>   }
>  
> - ftrace_graph_return = retfunc;
> + ftrace_graph_return = gops->retfunc;
>  
>   /*
>* Update the indirect function to the entryfunc, and the
> @@ -521,7 +520,7 @@ int register_ftrace_graph(trace_func_graph_ret_t retfunc,
>* call the update fgraph entry function to determine if
>* the entryfunc should be called directly or not.
>*/
> - __ftrace_graph_entry = entryfunc;
> + __ftrace_graph_entry = gops->entryfunc;
>   ftrace_graph_entry = ftrace_graph_entry_test;
>   update_function_graph_func();
>  
> @@ -531,7 +530,7 @@ int 

Re: [PATCH -next 2/2] selftests/memfd: modify tests for F_SEAL_FUTURE_WRITE seal

2018-11-22 Thread Joel Fernandes
On Mon, Nov 19, 2018 at 09:21:37PM -0800, Joel Fernandes (Google) wrote:
> Modify the tests for F_SEAL_FUTURE_WRITE based on the changes
> introduced in previous patch.
> 
> Also add a test to make sure the reopen issue pointed by Jann Horn [1]
> is fixed.
> 
> [1] 
> https://lore.kernel.org/lkml/CAG48ez1h=v-JYnDw81HaYJzOfrNhwYksxmc2r=cjvdqvgym...@mail.gmail.com/
> 
> Cc: Jann Horn 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  tools/testing/selftests/memfd/memfd_test.c | 88 +++---
>  1 file changed, 44 insertions(+), 44 deletions(-)

Since we squashed [1] the mm/memfd patch modifications suggested by Andy into
the original patch, I also squashed the selftests modifications and appended
the patch inline below if you want to take this instead:

[1] 
https://lore.kernel.org/lkml/20181122230906.ga198...@google.com/T/#m8ba68f67f3ec24913a977b62bcaeafc4b194b8c8

---8<---

From: "Joel Fernandes (Google)" 
Subject: [PATCH v4] selftests/memfd: add tests for F_SEAL_FUTURE_WRITE seal

Add tests to verify sealing memfds with the F_SEAL_FUTURE_WRITE works as
expected.

Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/memfd/memfd_test.c | 74 ++
 1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index 10baa1652fc2..c67d32eeb668 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -54,6 +54,22 @@ static int mfd_assert_new(const char *name, loff_t sz, 
unsigned int flags)
return fd;
 }
 
+static int mfd_assert_reopen_fd(int fd_in)
+{
+   int r, fd;
+   char path[100];
+
+   sprintf(path, "/proc/self/fd/%d", fd_in);
+
+   fd = open(path, O_RDWR);
+   if (fd < 0) {
+   printf("re-open of existing fd %d failed\n", fd_in);
+   abort();
+   }
+
+   return fd;
+}
+
 static void mfd_fail_new(const char *name, unsigned int flags)
 {
int r;
@@ -255,6 +271,25 @@ static void mfd_assert_read(int fd)
munmap(p, mfd_def_size);
 }
 
+/* Test that PROT_READ + MAP_SHARED mappings work. */
+static void mfd_assert_read_shared(int fd)
+{
+   void *p;
+
+   /* verify PROT_READ and MAP_SHARED *is* allowed */
+   p = mmap(NULL,
+mfd_def_size,
+PROT_READ,
+MAP_SHARED,
+fd,
+0);
+   if (p == MAP_FAILED) {
+   printf("mmap() failed: %m\n");
+   abort();
+   }
+   munmap(p, mfd_def_size);
+}
+
 static void mfd_assert_write(int fd)
 {
ssize_t l;
@@ -692,6 +727,44 @@ static void test_seal_write(void)
close(fd);
 }
 
+/*
+ * Test SEAL_FUTURE_WRITE
+ * Test whether SEAL_FUTURE_WRITE actually prevents modifications.
+ */
+static void test_seal_future_write(void)
+{
+   int fd, fd2;
+   void *p;
+
+   printf("%s SEAL-FUTURE-WRITE\n", memfd_str);
+
+   fd = mfd_assert_new("kern_memfd_seal_future_write",
+   mfd_def_size,
+   MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+   p = mfd_assert_mmap_shared(fd);
+
+   mfd_assert_has_seals(fd, 0);
+
+   mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_FUTURE_WRITE);
+
+   /* read should pass, writes should fail */
+   mfd_assert_read(fd);
+   mfd_assert_read_shared(fd);
+   mfd_fail_write(fd);
+
+   fd2 = mfd_assert_reopen_fd(fd);
+   /* read should pass, writes should still fail */
+   mfd_assert_read(fd2);
+   mfd_assert_read_shared(fd2);
+   mfd_fail_write(fd2);
+
+   munmap(p, mfd_def_size);
+   close(fd2);
+   close(fd);
+}
+
 /*
  * Test SEAL_SHRINK
  * Test whether SEAL_SHRINK actually prevents shrinking
@@ -945,6 +1018,7 @@ int main(int argc, char **argv)
test_basic();
 
test_seal_write();
+   test_seal_future_write();
test_seal_shrink();
test_seal_grow();
test_seal_resize();
-- 
2.19.1.1215.g8438c0b245-goog



Re: [PATCH -next 1/2] mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust

2018-11-22 Thread Joel Fernandes
On Wed, Nov 21, 2018 at 07:25:26PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 21, 2018 at 6:27 PM Andrew Morton  
> wrote:
> >
> > On Tue, 20 Nov 2018 13:13:35 -0800 Joel Fernandes  
> > wrote:
> >
> > > > > I am Ok with whatever Andrew wants to do, if it is better to squash 
> > > > > it with
> > > > > the original, then I can do that and send another patch.
> > > > >
> > > > >
> > > >
> > > > From experience, Andrew will food in fixups on request :)
> > >
> > > Andrew, could you squash this patch into the one titled ("mm: Add an
> > > F_SEAL_FUTURE_WRITE seal to memfd")?
> >
> > Sure.
> >
> > I could of course queue them separately but I rarely do so - I don't
> > think that the intermediate development states are useful in the
> > infinite-term, and I make them available via additional Link: tags in
> > the changelog footers anyway.
> >
> > I think that the magnitude of these patches is such that John Stultz's
> > Reviewed-by is invalidated, so this series is now in the "unreviewed"
> > state.
> >
> > So can we have a re-review please?  For convenience, here's the
> > folded-together [1/1] patch, as it will go to Linus.

Sure, I removed the old tags and also provide an updated patch below inline.

> > From: "Joel Fernandes (Google)" 
> > Subject: mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
> >
> > Android uses ashmem for sharing memory regions.  We are looking forward to
> > migrating all usecases of ashmem to memfd so that we can possibly remove
> > the ashmem driver in the future from staging while also benefiting from
> > using memfd and contributing to it.  Note staging drivers are also not ABI
> > and generally can be removed at anytime.
[...]
> > --- a/include/uapi/linux/fcntl.h~mm-add-an-f_seal_future_write-seal-to-memfd
> > +++ a/include/uapi/linux/fcntl.h
> > @@ -41,6 +41,7 @@
> >  #define F_SEAL_SHRINK  0x0002  /* prevent file from shrinking */
> >  #define F_SEAL_GROW0x0004  /* prevent file from growing */
> >  #define F_SEAL_WRITE   0x0008  /* prevent writes */
> > +#define F_SEAL_FUTURE_WRITE0x0010  /* prevent future writes while 
> > mapped */
> >  /* (1U << 31) is reserved for signed error codes */
> >
> >  /*
> > --- a/mm/memfd.c~mm-add-an-f_seal_future_write-seal-to-memfd
> > +++ a/mm/memfd.c
> > @@ -131,7 +131,8 @@ static unsigned int *memfd_file_seals_pt
> >  #define F_ALL_SEALS (F_SEAL_SEAL | \
> >  F_SEAL_SHRINK | \
> >  F_SEAL_GROW | \
> > -F_SEAL_WRITE)
> > +F_SEAL_WRITE | \
> > +F_SEAL_FUTURE_WRITE)
> >
> >  static int memfd_add_seals(struct file *file, unsigned int seals)
> >  {
> > --- a/fs/hugetlbfs/inode.c~mm-add-an-f_seal_future_write-seal-to-memfd
> > +++ a/fs/hugetlbfs/inode.c
> > @@ -530,7 +530,7 @@ static long hugetlbfs_punch_hole(struct
> > inode_lock(inode);
> >
> > /* protected by i_mutex */
> > -   if (info->seals & F_SEAL_WRITE) {
> > +   if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
> > inode_unlock(inode);
> > return -EPERM;
> > }
> > --- a/mm/shmem.c~mm-add-an-f_seal_future_write-seal-to-memfd
> > +++ a/mm/shmem.c
> > @@ -2119,6 +2119,23 @@ out_nomem:
> >
> >  static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> >  {
> > +   struct shmem_inode_info *info = SHMEM_I(file_inode(file));
> > +
> > +   /*
> > +* New PROT_READ and MAP_SHARED mmaps are not allowed when "future
> 
> PROT_WRITE, perhaps?

Yes, fixed.

> > +* write" seal active.
> > +*/
> > +   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE) &&
> > +   (info->seals & F_SEAL_FUTURE_WRITE))
> > +   return -EPERM;
> > +
> > +   /*
> > +* Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED 
> > read-only
> > +* mapping, take care to not allow mprotect to revert protections.
> > +*/
> > +   if (info->seals & F_SEAL_FUTURE_WRITE)
> > +   vma->vm_flags &= ~(VM_MAYWRITE);
> > +
> 
> This might all be clearer as:
> 
> if (info->seals & F_SEAL_FUTURE_WRITE) {
>   if (vma->

Re: dyntick-idle CPU and node's qsmask

2018-11-20 Thread Joel Fernandes
On Tue, Nov 20, 2018 at 06:41:07PM -0800, Paul E. McKenney wrote:
[...] 
> > > > I was thinking if we could simplify rcu_note_context_switch (the parts 
> > > > that
> > > > call rcu_momentary_dyntick_idle), if we did the following in
> > > > rcu_implicit_dynticks_qs.
> > > > 
> > > > Since we already call rcu_qs in rcu_note_context_switch, that would 
> > > > clear the
> > > > rdp->cpu_no_qs flag. Then there should be no need to call
> > > > rcu_momentary_dyntick_idle from rcu_note_context switch.
> > > 
> > > But does this also work for the rcu_all_qs() code path?
> > 
> > Could we not do something like this in rcu_all_qs? as some over-simplified
> > pseudo code:
> > 
> > rcu_all_qs() {
> >   if (!urgent_qs || !heavy_qs)
> >  return;
> > 
> >   rcu_qs();   // This clears the rdp->cpu_no_qs flags which we can monitor 
> > in
> >   //  the diff in my last email (from rcu_implicit_dynticks_qs)
> > }
> 
> Except that rcu_qs() doesn't necessarily report the quiescent state to
> the RCU core.  Keeping down context-switch overhead and all that.

Sure yeah, but I think the QS will be indirectly anyway by the force_qs_rnp()
path if we detect that rcu_qs() happened on the CPU?

> > > > I think this would simplify cond_resched as well.  Could this avoid the 
> > > > need
> > > > for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU 
> > > > corner cases..
> > > 
> > > There is also the code path from cond_resched() in PREEMPT=n kernels.
> > > This needs rcu_all_qs().  Though it is quite possible that some additional
> > > code collapsing is possible.
> > > 
> > > > Basically for some background, I was thinking can we simplify the code 
> > > > that
> > > > calls "rcu_momentary_dyntick_idle" since we already register a qs in 
> > > > other
> > > > ways (like by resetting cpu_no_qs).
> > > 
> > > One complication is that rcu_all_qs() is invoked with interrupts
> > > and preemption enabled, while rcu_note_context_switch() is
> > > invoked with interrupts disabled.  Also, as you say, Tasks RCU.
> > > Plus rcu_all_qs() wants to exit immediately if there is nothing to
> > > do, while rcu_note_context_switch() must unconditionally do rcu_qs()
> > > -- yes, it could check, but that would be redundant with the checks
> > 
> > This immediate exit is taken care off in the above psuedo code, would that
> > help the cond_resched performance?
> 
> It look like you are cautiously edging towards the two wrapper functions
> calling common code, relying on inlining and simplification.  Why not just
> try doing it?  ;-)

Sure yeah. I was more thinking of the ambitious goal of getting rid of the
complexity and exploring the general design idea, than containing/managing
the complexity with reducing code duplication. :D

> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index c818e0c91a81..5aa0259c014d 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct 
> > > > rcu_data *rdp)
> > > >  * read-side critical section that started before the beginning
> > > >  * of the current RCU grace period.
> > > >  */
> > > > -   if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> > > > +   if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || 
> > > > !rdp->cpu_no_qs.b.norm) {
> > > 
> > > If I am not too confused, this change could cause trouble for
> > > nohz_full CPUs looping in the kernel.  Such CPUs don't necessarily take
> > > scheduler-clock interrupts, last I checked, and this could prevent the
> > > CPU from reporting its quiescent state to core RCU.
> > 
> > Would that still be a problem if rcu_all_qs called rcu_qs? Also the above
> > diff is an OR condition so it is more relaxed than before.
> 
> Yes, because rcu_qs() is only guaranteed to capture the quiescent
> state on the current CPU, not necessarily report it to the RCU core.

The reporting to the core is necessary to call rcu_report_qs_rnp so that the
QS information is propogating up the tree, right?

Wouldn't that reporting be done anyway by:

force_qs_rnp
  -> rcu_implicit_dynticks_qs  (which returns 1 because rdp->cpu_no_qs.b.norm
was cleared by rcu_qs() and we detect that
with help of above diff)

  -> rcu_report_qs_rnp is called with mask bit set for corresponding CPU that
has the !rdp->cpu_no_qs.b.norm


I think that's what I am missing - that why wouldn't the above scheme work.
The only difference is reporting to the RCU core might invoke pending
callbacks but I'm not sure if that matters for this. I'll these changes,
and try tracing it out and study it more.  thanks for the patience,

 - Joel
 


Re: dyntick-idle CPU and node's qsmask

2018-11-20 Thread Joel Fernandes
On Tue, Nov 20, 2018 at 02:28:14PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 20, 2018 at 12:42:43PM -0800, Joel Fernandes wrote:
> > On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > > On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > > > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > > > Hi Paul and everyone,
> > > > > > > > 
> > > > > > > > I was tracing/studying the RCU code today in paul/dev branch 
> > > > > > > > and noticed that
> > > > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the 
> > > > > > > > rnp->qsmask
> > > > > > > > corresponding to the leaf node for the idle CPU, and reporting 
> > > > > > > > a QS on their
> > > > > > > > behalf.
> > > > > > > > 
> > > > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  
> > > > > > > > rcu_sched 792 0 dti
> > > > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  
> > > > > > > > rcu_sched 801 2 dti
> > > > > > > > rcu_sched-10[003]40.008041: rcu_quiescent_state_report: 
> > > > > > > > rcu_sched 805 5>0 0 0 3 0
> > > > > > > > 
> > > > > > > > That's all good but I was wondering if we can do better for the 
> > > > > > > > idle CPUs if
> > > > > > > > we can some how not set the qsmask of the node in the first 
> > > > > > > > place. Then no
> > > > > > > > reporting would be needed of quiescent state is needed for idle 
> > > > > > > > CPUs right?
> > > > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > > > 
> > > > > > > > At least for a single node tree RCU system, it seems that would 
> > > > > > > > avoid needing
> > > > > > > > to acquire the lock without complications. Anyway let me know 
> > > > > > > > your thoughts
> > > > > > > > and happy to discuss this at the hallways of the LPC as well 
> > > > > > > > for folks
> > > > > > > > attending :)
> > > > > > > 
> > > > > > > We could, but that would require consulting the rcu_data 
> > > > > > > structure for
> > > > > > > each CPU while initializing the grace period, thus increasing the 
> > > > > > > number
> > > > > > > of cache misses during grace-period initialization and also 
> > > > > > > shortly after
> > > > > > > for any non-idle CPUs.  This seems backwards on busy systems 
> > > > > > > where each
> > > > > > 
> > > > > > When I traced, it appears to me that rcu_data structure of a remote 
> > > > > > CPU was
> > > > > > being consulted anyway by the rcu_sched thread. So it seems like 
> > > > > > such cache
> > > > > > miss would happen anyway whether it is during grace-period 
> > > > > > initialization or
> > > > > > during the fqs stage? I guess I'm trying to say, the consultation 
> > > > > > of remote
> > > > > > CPU's rcu_data happens anyway.
> > > > > 
> > > > > Hmmm...
> > > > > 
> > > > > The rcu_gp_init() function does access an rcu_data structure, but it 
> > > > > is
> > > > > that of the current CPU, so shouldn't involve a communications cache 
> > > > > miss,
> > > > > at least not in the common case.
> > > > > 
> > > > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > > > functions that it calls?  In that case, please see below.
> > > > 
> > > > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > > > 
> > > > > > > CP

Re: [PATCH 2/8] pstore: Do not use crash buffer for decompression

2018-11-20 Thread Joel Fernandes
On Wed, Nov 14, 2018 at 01:56:09AM -0600, Kees Cook wrote:
> On Fri, Nov 2, 2018 at 1:24 PM, Joel Fernandes  wrote:
> > On Thu, Nov 01, 2018 at 04:51:54PM -0700, Kees Cook wrote:
> >>  static void decompress_record(struct pstore_record *record)
> >>  {
> >> + int ret;
> >>   int unzipped_len;
> >
> > nit: We could get rid of the unzipped_len variable now I think.
> 
> I didn't follow this -- it gets used quite a bit. I don't see a clean
> way to remove it?

You are right. Sorry I missed that crpyto_comp_decompress actually uses it.

thanks,

 - Joel


Re: [PATCH -next 1/2] mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust

2018-11-20 Thread Joel Fernandes
On Tue, Nov 20, 2018 at 02:02:49PM -0700, Andy Lutomirski wrote:
> 
> > On Nov 20, 2018, at 1:47 PM, Joel Fernandes  wrote:
> > 
> >> On Tue, Nov 20, 2018 at 01:33:18PM -0700, Andy Lutomirski wrote:
> >> 
> >>> On Nov 20, 2018, at 1:07 PM, Stephen Rothwell  
> >>> wrote:
> >>> 
> >>> Hi Joel,
> >>> 
> >>>>> On Tue, 20 Nov 2018 10:39:26 -0800 Joel Fernandes 
> >>>>>  wrote:
> >>>>> 
> >>>>> On Tue, Nov 20, 2018 at 07:13:17AM -0800, Andy Lutomirski wrote:
> >>>>> On Mon, Nov 19, 2018 at 9:21 PM Joel Fernandes (Google)
> >>>>>  wrote:  
> >>>>>> 
> >>>>>> A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
> >>>>>> where we don't need to modify core VFS structures to get the same
> >>>>>> behavior of the seal. This solves several side-effects pointed out by
> >>>>>> Andy [2].
> >>>>>> 
> >>>>>> [1] https://lore.kernel.org/lkml/2018173650.ga256...@google.com/
> >>>>>> [2] 
> >>>>>> https://lore.kernel.org/lkml/69ce06cc-e47c-4992-848a-66eb23ee6...@amacapital.net/
> >>>>>> 
> >>>>>> Suggested-by: Andy Lutomirski 
> >>>>>> Fixes: 5e653c2923fd ("mm: Add an F_SEAL_FUTURE_WRITE seal to memfd")  
> >>>>> 
> >>>>> What tree is that commit in?  Can we not just fold this in?  
> >>>> 
> >>>> It is in linux-next. Could we keep both commits so we have the history?
> >>> 
> >>> Well, its in Andrew's mmotm, so its up to him.
> >>> 
> >>> 
> >> 
> >> Unless mmotm is more magical than I think, the commit hash in your fixed
> >> tag is already nonsense. mmotm gets rebased all the time, and is only
> >> barely a git tree.
> > 
> > I wouldn't go so far to call it nonsense. It was a working patch, it just 
> > did
> > things differently. Your help with improving the patch is much appreciated.
> 
> I’m not saying the patch is nonsense — I’m saying the *hash* may be
> nonsense. akpm uses a bunch of .patch files and all kinds of crazy scripts,
> and the mmotm.git tree is not stable at all.
> 

Oh, ok. Sorry for misunderstanding and thanks for clarification. :-)

> > I am Ok with whatever Andrew wants to do, if it is better to squash it with
> > the original, then I can do that and send another patch.
> > 
> > 
> 
> From experience, Andrew will food in fixups on request :)

Andrew, could you squash this patch into the one titled ("mm: Add an
F_SEAL_FUTURE_WRITE seal to memfd")? That one was already picked up by -next
but I imagine you might have a crazy script as Andy pointed out for exactly
these situations. ;-)

thanks,

 - Joel



Re: [PATCH -next 1/2] mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust

2018-11-20 Thread Joel Fernandes
On Tue, Nov 20, 2018 at 01:33:18PM -0700, Andy Lutomirski wrote:
> 
> > On Nov 20, 2018, at 1:07 PM, Stephen Rothwell  wrote:
> > 
> > Hi Joel,
> > 
> >> On Tue, 20 Nov 2018 10:39:26 -0800 Joel Fernandes  
> >> wrote:
> >> 
> >>> On Tue, Nov 20, 2018 at 07:13:17AM -0800, Andy Lutomirski wrote:
> >>> On Mon, Nov 19, 2018 at 9:21 PM Joel Fernandes (Google)
> >>>  wrote:  
> >>>> 
> >>>> A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
> >>>> where we don't need to modify core VFS structures to get the same
> >>>> behavior of the seal. This solves several side-effects pointed out by
> >>>> Andy [2].
> >>>> 
> >>>> [1] https://lore.kernel.org/lkml/2018173650.ga256...@google.com/
> >>>> [2] 
> >>>> https://lore.kernel.org/lkml/69ce06cc-e47c-4992-848a-66eb23ee6...@amacapital.net/
> >>>> 
> >>>> Suggested-by: Andy Lutomirski 
> >>>> Fixes: 5e653c2923fd ("mm: Add an F_SEAL_FUTURE_WRITE seal to memfd")  
> >>> 
> >>> What tree is that commit in?  Can we not just fold this in?  
> >> 
> >> It is in linux-next. Could we keep both commits so we have the history?
> > 
> > Well, its in Andrew's mmotm, so its up to him.
> > 
> > 
> 
> Unless mmotm is more magical than I think, the commit hash in your fixed
> tag is already nonsense. mmotm gets rebased all the time, and is only
> barely a git tree.

I wouldn't go so far to call it nonsense. It was a working patch, it just did
things differently. Your help with improving the patch is much appreciated.

I am Ok with whatever Andrew wants to do, if it is better to squash it with
the original, then I can do that and send another patch.

- Joel


Re: dyntick-idle CPU and node's qsmask

2018-11-20 Thread Joel Fernandes
On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > Hi Paul and everyone,
> > > > > > 
> > > > > > I was tracing/studying the RCU code today in paul/dev branch and 
> > > > > > noticed that
> > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > > > corresponding to the leaf node for the idle CPU, and reporting a QS 
> > > > > > on their
> > > > > > behalf.
> > > > > > 
> > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  rcu_sched 
> > > > > > 792 0 dti
> > > > > > rcu_sched-10[003]40.008039: rcu_fqs:  rcu_sched 
> > > > > > 801 2 dti
> > > > > > rcu_sched-10[003]40.008041: rcu_quiescent_state_report: 
> > > > > > rcu_sched 805 5>0 0 0 3 0
> > > > > > 
> > > > > > That's all good but I was wondering if we can do better for the 
> > > > > > idle CPUs if
> > > > > > we can some how not set the qsmask of the node in the first place. 
> > > > > > Then no
> > > > > > reporting would be needed of quiescent state is needed for idle 
> > > > > > CPUs right?
> > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > 
> > > > > > At least for a single node tree RCU system, it seems that would 
> > > > > > avoid needing
> > > > > > to acquire the lock without complications. Anyway let me know your 
> > > > > > thoughts
> > > > > > and happy to discuss this at the hallways of the LPC as well for 
> > > > > > folks
> > > > > > attending :)
> > > > > 
> > > > > We could, but that would require consulting the rcu_data structure for
> > > > > each CPU while initializing the grace period, thus increasing the 
> > > > > number
> > > > > of cache misses during grace-period initialization and also shortly 
> > > > > after
> > > > > for any non-idle CPUs.  This seems backwards on busy systems where 
> > > > > each
> > > > 
> > > > When I traced, it appears to me that rcu_data structure of a remote CPU 
> > > > was
> > > > being consulted anyway by the rcu_sched thread. So it seems like such 
> > > > cache
> > > > miss would happen anyway whether it is during grace-period 
> > > > initialization or
> > > > during the fqs stage? I guess I'm trying to say, the consultation of 
> > > > remote
> > > > CPU's rcu_data happens anyway.
> > > 
> > > Hmmm...
> > > 
> > > The rcu_gp_init() function does access an rcu_data structure, but it is
> > > that of the current CPU, so shouldn't involve a communications cache miss,
> > > at least not in the common case.
> > > 
> > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > functions that it calls?  In that case, please see below.
> > 
> > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > 
> > > > > CPU will with high probability report its own quiescent state before 
> > > > > three
> > > > > jiffies pass, in which case the cache misses on the rcu_data 
> > > > > structures
> > > > > would be wasted motion.
> > > > 
> > > > If all the CPUs are busy and reporting their QS themselves, then I 
> > > > think the
> > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > rcu_data right?
> > > 
> > > Yes, but assuming that all CPUs report their quiescent states before
> > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > looping in the kernel for many milliseconds without passing through a
> > > quiescent state.  This is because for 

Re: [PATCH -next 1/2] mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust

2018-11-20 Thread Joel Fernandes
On Tue, Nov 20, 2018 at 07:13:17AM -0800, Andy Lutomirski wrote:
> On Mon, Nov 19, 2018 at 9:21 PM Joel Fernandes (Google)
>  wrote:
> >
> > A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
> > where we don't need to modify core VFS structures to get the same
> > behavior of the seal. This solves several side-effects pointed out by
> > Andy [2].
> >
> > [1] https://lore.kernel.org/lkml/2018173650.ga256...@google.com/
> > [2] 
> > https://lore.kernel.org/lkml/69ce06cc-e47c-4992-848a-66eb23ee6...@amacapital.net/
> >
> > Suggested-by: Andy Lutomirski 
> > Fixes: 5e653c2923fd ("mm: Add an F_SEAL_FUTURE_WRITE seal to memfd")
> 
> What tree is that commit in?  Can we not just fold this in?

It is in linux-next. Could we keep both commits so we have the history?

thanks,

 - Joel


[PATCH -manpage 2/2] memfd_create.2: Update manpage with new memfd F_SEAL_FUTURE_WRITE seal

2018-11-19 Thread Joel Fernandes (Google)
More details of the seal can be found in the LKML patch:
https://lore.kernel.org/lkml/20181120052137.74317-1-j...@joelfernandes.org/T/#t

Signed-off-by: Joel Fernandes (Google) 
---
 man2/memfd_create.2 | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
index 3cd392d1b4d9..fce2bf8d0fff 100644
--- a/man2/memfd_create.2
+++ b/man2/memfd_create.2
@@ -280,7 +280,15 @@ in order to restrict further modifications on the file.
 (If placing the seal
 .BR F_SEAL_WRITE ,
 then it will be necessary to first unmap the shared writable mapping
-created in the previous step.)
+created in the previous step. Otherwise, behavior similar to
+.BR F_SEAL_WRITE
+can be achieved, by using
+.BR F_SEAL_FUTURE_WRITE
+which will prevent future writes via
+.BR mmap (2)
+and
+.BR write (2)
+from succeeding, while keeping existing shared writable mappings).
 .IP 4.
 A second process obtains a file descriptor for the
 .BR tmpfs (5)
@@ -425,6 +433,7 @@ main(int argc, char *argv[])
 fprintf(stderr, "\\t\\tg \- F_SEAL_GROW\\n");
 fprintf(stderr, "\\t\\ts \- F_SEAL_SHRINK\\n");
 fprintf(stderr, "\\t\\tw \- F_SEAL_WRITE\\n");
+fprintf(stderr, "\\t\\tW \- F_SEAL_FUTURE_WRITE\\n");
 fprintf(stderr, "\\t\\tS \- F_SEAL_SEAL\\n");
 exit(EXIT_FAILURE);
 }
@@ -463,6 +472,8 @@ main(int argc, char *argv[])
 seals |= F_SEAL_SHRINK;
 if (strchr(seals_arg, \(aqw\(aq) != NULL)
 seals |= F_SEAL_WRITE;
+if (strchr(seals_arg, \(aqW\(aq) != NULL)
+seals |= F_SEAL_FUTURE_WRITE;
 if (strchr(seals_arg, \(aqS\(aq) != NULL)
 seals |= F_SEAL_SEAL;
 
@@ -518,6 +529,8 @@ main(int argc, char *argv[])
 printf(" GROW");
 if (seals & F_SEAL_WRITE)
 printf(" WRITE");
+if (seals & F_SEAL_FUTURE_WRITE)
+printf(" FUTURE_WRITE");
 if (seals & F_SEAL_SHRINK)
 printf(" SHRINK");
 printf("\\n");
-- 
2.19.1.1215.g8438c0b245-goog



[PATCH -manpage 1/2] fcntl.2: Update manpage with new memfd F_SEAL_FUTURE_WRITE seal

2018-11-19 Thread Joel Fernandes (Google)
More details of the seal can be found in the LKML patch:
https://lore.kernel.org/lkml/20181120052137.74317-1-j...@joelfernandes.org/T/#t

Signed-off-by: Joel Fernandes (Google) 
---
 man2/fcntl.2 | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index 03533d65b49d..54772f94964c 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -1525,6 +1525,21 @@ Furthermore, if there are any asynchronous I/O operations
 .RB ( io_submit (2))
 pending on the file,
 all outstanding writes will be discarded.
+.TP
+.BR F_SEAL_FUTURE_WRITE
+If this seal is set, the contents of the file can be modified only from
+existing writeable mappings that were created prior to the seal being set.
+Any attempt to create a new writeable mapping on the memfd via
+.BR mmap (2)
+will fail with
+.BR EPERM.
+Also any attempts to write to the memfd via
+.BR write (2)
+will fail with
+.BR EPERM.
+This is useful in situations where existing writable mapped regions need to be
+kept intact while preventing any future writes. For example, to share a
+read-only memory buffer to other processes that only the sender can write to.
 .\"
 .SS File read/write hints
 Write lifetime hints can be used to inform the kernel about the relative
-- 
2.19.1.1215.g8438c0b245-goog



[PATCH -next 1/2] mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust

2018-11-19 Thread Joel Fernandes (Google)
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same
behavior of the seal. This solves several side-effects pointed out by
Andy [2].

[1] https://lore.kernel.org/lkml/2018173650.ga256...@google.com/
[2] 
https://lore.kernel.org/lkml/69ce06cc-e47c-4992-848a-66eb23ee6...@amacapital.net/

Suggested-by: Andy Lutomirski 
Fixes: 5e653c2923fd ("mm: Add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Joel Fernandes (Google) 
---
 fs/hugetlbfs/inode.c |  2 +-
 mm/memfd.c   | 19 ---
 mm/shmem.c   | 24 +---
 3 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 762028994f47..5b54bf893a67 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -558,7 +558,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, 
loff_t offset, loff_t len)
inode_lock(inode);
 
/* protected by i_mutex */
-   if (info->seals & F_SEAL_WRITE) {
+   if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
inode_unlock(inode);
return -EPERM;
}
diff --git a/mm/memfd.c b/mm/memfd.c
index 63fff5e77114..650e65a46b9c 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -201,25 +201,6 @@ static int memfd_add_seals(struct file *file, unsigned int 
seals)
}
}
 
-   if ((seals & F_SEAL_FUTURE_WRITE) &&
-   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
-   /*
-* The FUTURE_WRITE seal also prevents growing and shrinking
-* so we need them to be already set, or requested now.
-*/
-   int test_seals = (seals | *file_seals) &
-(F_SEAL_GROW | F_SEAL_SHRINK);
-
-   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
-   error = -EINVAL;
-   goto unlock;
-   }
-
-   spin_lock(>f_lock);
-   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
-   spin_unlock(>f_lock);
-   }
-
*file_seals |= seals;
error = 0;
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 32eb29bd72c6..cee9878c87f1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2121,6 +2121,23 @@ int shmem_lock(struct file *file, int lock, struct 
user_struct *user)
 
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
+   struct shmem_inode_info *info = SHMEM_I(file_inode(file));
+
+   /*
+* New PROT_READ and MAP_SHARED mmaps are not allowed when "future
+* write" seal active.
+*/
+   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE) &&
+   (info->seals & F_SEAL_FUTURE_WRITE))
+   return -EPERM;
+
+   /*
+* Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED read-only
+* mapping, take care to not allow mprotect to revert protections.
+*/
+   if (info->seals & F_SEAL_FUTURE_WRITE)
+   vma->vm_flags &= ~(VM_MAYWRITE);
+
file_accessed(file);
vma->vm_ops = _vm_ops;
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
@@ -2346,8 +2363,9 @@ shmem_write_begin(struct file *file, struct address_space 
*mapping,
pgoff_t index = pos >> PAGE_SHIFT;
 
/* i_mutex is held by caller */
-   if (unlikely(info->seals & (F_SEAL_WRITE | F_SEAL_GROW))) {
-   if (info->seals & F_SEAL_WRITE)
+   if (unlikely(info->seals & (F_SEAL_GROW |
+  F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+   if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
return -EPERM;
if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
return -EPERM;
@@ -2610,7 +2628,7 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
 
/* protected by i_mutex */
-   if (info->seals & F_SEAL_WRITE) {
+   if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
error = -EPERM;
goto out;
}
-- 
2.19.1.1215.g8438c0b245-goog



[PATCH -next 2/2] selftests/memfd: modify tests for F_SEAL_FUTURE_WRITE seal

2018-11-19 Thread Joel Fernandes (Google)
Modify the tests for F_SEAL_FUTURE_WRITE based on the changes
introduced in previous patch.

Also add a test to make sure the reopen issue pointed by Jann Horn [1]
is fixed.

[1] 
https://lore.kernel.org/lkml/CAG48ez1h=v-JYnDw81HaYJzOfrNhwYksxmc2r=cjvdqvgym...@mail.gmail.com/

Cc: Jann Horn 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/memfd/memfd_test.c | 88 +++---
 1 file changed, 44 insertions(+), 44 deletions(-)

diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index 32b207ca7372..c67d32eeb668 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -54,6 +54,22 @@ static int mfd_assert_new(const char *name, loff_t sz, 
unsigned int flags)
return fd;
 }
 
+static int mfd_assert_reopen_fd(int fd_in)
+{
+   int r, fd;
+   char path[100];
+
+   sprintf(path, "/proc/self/fd/%d", fd_in);
+
+   fd = open(path, O_RDWR);
+   if (fd < 0) {
+   printf("re-open of existing fd %d failed\n", fd_in);
+   abort();
+   }
+
+   return fd;
+}
+
 static void mfd_fail_new(const char *name, unsigned int flags)
 {
int r;
@@ -255,6 +271,25 @@ static void mfd_assert_read(int fd)
munmap(p, mfd_def_size);
 }
 
+/* Test that PROT_READ + MAP_SHARED mappings work. */
+static void mfd_assert_read_shared(int fd)
+{
+   void *p;
+
+   /* verify PROT_READ and MAP_SHARED *is* allowed */
+   p = mmap(NULL,
+mfd_def_size,
+PROT_READ,
+MAP_SHARED,
+fd,
+0);
+   if (p == MAP_FAILED) {
+   printf("mmap() failed: %m\n");
+   abort();
+   }
+   munmap(p, mfd_def_size);
+}
+
 static void mfd_assert_write(int fd)
 {
ssize_t l;
@@ -698,7 +733,7 @@ static void test_seal_write(void)
  */
 static void test_seal_future_write(void)
 {
-   int fd;
+   int fd, fd2;
void *p;
 
printf("%s SEAL-FUTURE-WRITE\n", memfd_str);
@@ -710,58 +745,23 @@ static void test_seal_future_write(void)
p = mfd_assert_mmap_shared(fd);
 
mfd_assert_has_seals(fd, 0);
-   /* Not adding grow/shrink seals makes the future write
-* seal fail to get added
-*/
-   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
-
-   mfd_assert_add_seals(fd, F_SEAL_GROW);
-   mfd_assert_has_seals(fd, F_SEAL_GROW);
-
-   /* Should still fail since shrink seal has
-* not yet been added
-*/
-   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
-
-   mfd_assert_add_seals(fd, F_SEAL_SHRINK);
-   mfd_assert_has_seals(fd, F_SEAL_GROW |
-F_SEAL_SHRINK);
 
-   /* Now should succeed, also verifies that the seal
-* could be added with an existing writable mmap
-*/
mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE);
-   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
-F_SEAL_GROW |
-F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_FUTURE_WRITE);
 
/* read should pass, writes should fail */
mfd_assert_read(fd);
+   mfd_assert_read_shared(fd);
mfd_fail_write(fd);
 
-   munmap(p, mfd_def_size);
-   close(fd);
-
-   /* Test adding all seals (grow, shrink, future write) at once */
-   fd = mfd_assert_new("kern_memfd_seal_future_write2",
-   mfd_def_size,
-   MFD_CLOEXEC | MFD_ALLOW_SEALING);
-
-   p = mfd_assert_mmap_shared(fd);
-
-   mfd_assert_has_seals(fd, 0);
-   mfd_assert_add_seals(fd, F_SEAL_SHRINK |
-F_SEAL_GROW |
-F_SEAL_FUTURE_WRITE);
-   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
-F_SEAL_GROW |
-F_SEAL_FUTURE_WRITE);
-
-   /* read should pass, writes should fail */
-   mfd_assert_read(fd);
-   mfd_fail_write(fd);
+   fd2 = mfd_assert_reopen_fd(fd);
+   /* read should pass, writes should still fail */
+   mfd_assert_read(fd2);
+   mfd_assert_read_shared(fd2);
+   mfd_fail_write(fd2);
 
munmap(p, mfd_def_size);
+   close(fd2);
close(fd);
 }
 
-- 
2.19.1.1215.g8438c0b245-goog



Re: [PATCH RFC v2 0/3] cleanups for pstore and ramoops

2018-11-14 Thread Joel Fernandes
On Wed, Nov 14, 2018 at 01:14:57AM -0600, Kees Cook wrote:
> On Sat, Nov 3, 2018 at 6:38 PM, Joel Fernandes (Google)
>  wrote:
> > Here are some simple cleanups and fixes for ramoops in pstore. Let me know
> > what you think, thanks.
> 
> I took these and slightly tweaked code locations for the first one.
> I'll send out the series for review when I'm back from Plumber's.

Cool, thanks!

 - Joel



Re: dyntick-idle CPU and node's qsmask

2018-11-11 Thread Joel Fernandes
On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
[..]
> > > > > CPU will with high probability report its own quiescent state before 
> > > > > three
> > > > > jiffies pass, in which case the cache misses on the rcu_data 
> > > > > structures
> > > > > would be wasted motion.
> > > > 
> > > > If all the CPUs are busy and reporting their QS themselves, then I 
> > > > think the
> > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > rcu_data right?
> > > 
> > > Yes, but assuming that all CPUs report their quiescent states before
> > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > looping in the kernel for many milliseconds without passing through a
> > > quiescent state.  This is because for recent kernels, cond_resched()
> > > is not a quiescent state until the grace period is something like 100
> > > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > > quiescent state unless it actually scheduled.)
> > > 
> > > Why wait 100 milliseconds?  Because otherwise the increase in
> > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > to complain bitterly.  Besides, I would expect that in the common case,
> > > CPUs would be executing usermode code.
> > 
> > Makes sense. I was also wondering about this other thing you mentioned about
> > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > that mean that even if a single CPU is dyntick-idle for a long period of
> > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > someone wants shorter grace periods, they should just use expedited GPs), 
> > but
> > it sounds like it would be shorter GP if we just set the qsmask early on 
> > some
> > how and we can manage the overhead of doing so.
> 
> First, there is some autotuning of the delay based on HZ:
> 
> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> 
> So at HZ=300, you should be seeing a two-jiffy delay rather than the
> usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
> is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> rates often mean slower instruction execution and thus a desire for
> lower RCU overhead.  There is further autotuning based on number of
> CPUs, but this does not kick in until you have 256 CPUs on your system,
> and I bet that smartphones aren't there yet.  Nevertheless, check out
> RCU_JIFFIES_FQS_DIV for more info on this.

Got it. I agree with that heuristic.

> But you can always override this autotuning using the following kernel
> boot paramters:
> 
> rcutree.jiffies_till_first_fqs
> rcutree.jiffies_till_next_fqs
> 
> You can even set the first one to zero if you want the effect of pre-scanning
> for idle CPUs.  ;-)
> 
> The second must be set to one or greater.
> 
> Both are capped at one second (HZ).

Got it. Thanks a lot for the explanations.

> > > > Anyway it was just an idea that popped up when I was going through 
> > > > traces :)
> > > > Thanks for the discussion and happy to discuss further or try out 
> > > > anything.
> > > 
> > > Either way, I do appreciate your going through this.  People have found
> > > RCU bugs this way, one of which involved RCU uselessly calling a 
> > > particular
> > > function twice in quick succession.  ;-)
> >  
> > Thanks.  It is my pleasure and happy to help :) I'll keep digging into it.
> 
> Looking forward to further questions and patches.  ;-)

Will do! thanks,

 - Joel



Re: dyntick-idle CPU and node's qsmask

2018-11-11 Thread Joel Fernandes
On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > Hi Paul and everyone,
> > > > 
> > > > I was tracing/studying the RCU code today in paul/dev branch and 
> > > > noticed that
> > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > corresponding to the leaf node for the idle CPU, and reporting a QS on 
> > > > their
> > > > behalf.
> > > > 
> > > > rcu_sched-10[003]40.008039: rcu_fqs:  rcu_sched 792 
> > > > 0 dti
> > > > rcu_sched-10[003]40.008039: rcu_fqs:  rcu_sched 801 
> > > > 2 dti
> > > > rcu_sched-10[003]40.008041: rcu_quiescent_state_report: 
> > > > rcu_sched 805 5>0 0 0 3 0
> > > > 
> > > > That's all good but I was wondering if we can do better for the idle 
> > > > CPUs if
> > > > we can some how not set the qsmask of the node in the first place. Then 
> > > > no
> > > > reporting would be needed of quiescent state is needed for idle CPUs 
> > > > right?
> > > > And we would also not need to acquire the rnp lock I think.
> > > > 
> > > > At least for a single node tree RCU system, it seems that would avoid 
> > > > needing
> > > > to acquire the lock without complications. Anyway let me know your 
> > > > thoughts
> > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > attending :)
> > > 
> > > We could, but that would require consulting the rcu_data structure for
> > > each CPU while initializing the grace period, thus increasing the number
> > > of cache misses during grace-period initialization and also shortly after
> > > for any non-idle CPUs.  This seems backwards on busy systems where each
> > 
> > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > miss would happen anyway whether it is during grace-period initialization or
> > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > CPU's rcu_data happens anyway.
> 
> Hmmm...
> 
> The rcu_gp_init() function does access an rcu_data structure, but it is
> that of the current CPU, so shouldn't involve a communications cache miss,
> at least not in the common case.
> 
> Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> functions that it calls?  In that case, please see below.

Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.

> > > CPU will with high probability report its own quiescent state before three
> > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > would be wasted motion.
> > 
> > If all the CPUs are busy and reporting their QS themselves, then I think the
> > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > rcu_data right?
> 
> Yes, but assuming that all CPUs report their quiescent states before
> the first call to rcu_gp_fqs().  One exception is when some CPU is
> looping in the kernel for many milliseconds without passing through a
> quiescent state.  This is because for recent kernels, cond_resched()
> is not a quiescent state until the grace period is something like 100
> milliseconds old.  (For older kernels, cond_resched() was never an RCU
> quiescent state unless it actually scheduled.)
> 
> Why wait 100 milliseconds?  Because otherwise the increase in
> cond_resched() overhead shows up all too well, causing 0day test robot
> to complain bitterly.  Besides, I would expect that in the common case,
> CPUs would be executing usermode code.

Makes sense. I was also wondering about this other thing you mentioned about
waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
that mean that even if a single CPU is dyntick-idle for a long period of
time, then the minimum grace period duration would be atleast 3 jiffies? In
our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
consumption low. Not that I'm saying its an issue or anything (since IIUC if
someone wants shorter grace periods, they should just use expedited GPs), but
it sounds like it would be shorter GP if we just set the qsma

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-11 Thread Joel Fernandes
too once the seal 
> >>>> is
> >>>> set, not just the mmap. That means we have to add code in mm/shmem.c to 
> >>>> do
> >>>> that in all those handlers, to check for the seal (and hope we didn't 
> >>>> miss a
> >>>> file_operations handler). Is that what you are proposing?
> >>> 
> >>> The existing code already does this. That’s why I suggested grepping :)
> >>> 
> >>>> 
> >>>> Also, it means we have to keep CONFIG_TMPFS enabled so that the
> >>>> shmem_file_operations write handlers like write_iter are hooked up. 
> >>>> Currently
> >>>> memfd works even with !CONFIG_TMPFS.
> >>> 
> >>> If so, that sounds like it may already be a bug.
> > 
> > Why shouldn't memfd work independently of CONFIG_TMPFS? In particular,
> > write(2) on tmpfs FDs shouldn't work differently. If it does, that's a
> > kernel implementation detail leaking out into userspace.
> > 
> >>>>> - add_seals won’t need the wait_for_pins and mapping_deny_write logic.
> >>>>> 
> >>>>> That really should be all that’s needed.
> >>>> 
> >>>> It seems a fair idea what you're saying. But I don't see how its less
> >>>> complex.. IMO its far more simple to have VFS do the denial of the 
> >>>> operations
> >>>> based on the flags of its datastructures.. and if it works (which I will 
> >>>> test
> >>>> to be sure it will), then we should be good.
> >>> 
> >>> I agree it’s complicated, but the code is already written.  You should 
> >>> just
> >>> need to adjust some masks.
> >>> 
> >> 
> >> Its actually not that bad and a great idea, I did something like the
> >> following and it works pretty well. I would say its cleaner than the old
> >> approach for sure (and I also added a /proc/pid/fd/N reopen test to the
> >> selftest and made sure that issue goes away).
> >> 
> >> Side note: One subtelty I discovered from the existing selftests is once 
> >> the
> >> F_SEAL_WRITE are active, an mmap of PROT_READ and MAP_SHARED region is
> >> expected to fail. This is also evident from this code in mmap_region:
> >>if (vm_flags & VM_SHARED) {
> >>error = mapping_map_writable(file->f_mapping);
> >>if (error)
> >>goto allow_write_and_free_vma;
> >>}
> >> 
> > 
> > This behavior seems like a bug. Why should MAP_SHARED writes be denied
> > here? There's no semantic incompatibility between shared mappings and
> > the seal. And I think this change would represent an ABI break using
> > memfd seals for ashmem, since ashmem currently allows MAP_SHARED
> > mappings after changing prot_mask.
> 
> Hmm. I’m guessing the intent is that the mmap count should track writable
> mappings in addition to mappings that could be made writable using
> mprotect.  I think you could address this for SEAL_FUTURE in two ways:
> 
> 1. In shmem_mmap, mask off VM_MAYWRITE if SEAL_FUTURE is set, or
> 
> 2. Add a new vm operation that allows a vma to reject an mprotect attempt,
> like security_file_mprotect but per vma.  Then give it reasonable semantics
> for shmem.
> 
> (1) probably gives the semantics you want for SEAL_FUTURE: old maps can be
> mprotected, but new maps can’t.

Thanks Andy and Daniel! This occured to me too and I like the solution in (1).
I tested that now PROT_READ + MAP_SHARED works and the mrprotect is not able
to revert the protection. In fact (1) is exactly what we do in the ashmem
driver.

The updated patch now looks like the following:

---8<---

From: "Joel Fernandes" 
Subject: [PATCH] mm/memfd: implement future write seal using shmem ops

Signed-off-by: Joel Fernandes 
---
 fs/hugetlbfs/inode.c |  2 +-
 mm/memfd.c   | 19 ---
 mm/shmem.c   | 24 +---
 3 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 32920a10100e..1978581abfdf 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -530,7 +530,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, 
loff_t offset, loff_t len)
inode_lock(inode);
 
/* protected by i_mutex */
-   if (info->seals & F_SEAL_WRITE) {
+   if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
 

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-11 Thread Joel Fernandes
On Sat, Nov 10, 2018 at 07:40:10PM -0800, Andy Lutomirski wrote:
[...]
> >>>>>>> I see two reasonable solutions:
> >>>>>>> 
> >>>>>>> 1. Don’t fiddle with the struct file at all. Instead make the inode 
> >>>>>>> flag
> >>>>>>> work by itself.
> >>>>>> 
> >>>>>> Currently, the various VFS paths check only the struct file's f_mode 
> >>>>>> to deny
> >>>>>> writes of already opened files. This would mean more checking in all 
> >>>>>> those
> >>>>>> paths (and modification of all those paths).
> >>>>>> 
> >>>>>> Anyway going with that idea, we could
> >>>>>> 1. call deny_write_access(file) from the memfd's seal path which 
> >>>>>> decrements
> >>>>>> the inode::i_writecount.
> >>>>>> 2. call get_write_access(inode) in the various VFS paths in addition to
> >>>>>> checking for FMODE_*WRITE and deny the write (incase i_writecount is 
> >>>>>> negative)
> >>>>>> 
> >>>>>> That will prevent both reopens, and writes from succeeding. However I 
> >>>>>> worry a
> >>>>>> bit about 2 not being too familiar with VFS internals, about what the
> >>>>>> consequences of doing that may be.
> >>>>> 
> >>>>> IMHO, modifying both the inode and the struct file separately is fine,
> >>>>> since they mean different things. In regular filesystems, it's fine to
> >>>>> have a read-write open file description for a file whose inode grants
> >>>>> write permission to nobody. Speaking of which: is fchmod enough to
> >>>>> prevent this attack?
> >>>> 
> >>>> Well, yes and no. fchmod does prevent reopening the file RW, but
> >>>> anyone with permissions (owner, CAP_FOWNER) can just fchmod it back. A
> >>>> seal is supposed to be irrevocable, so fchmod-as-inode-seal probably
> >>>> isn't sufficient by itself. While it might be good enough for Android
> >>>> (in the sense that it'll prevent RW-reopens from other security
> >>>> contexts to which we send an open memfd file), it's still conceptually
> >>>> ugly, IMHO. Let's go with the original approach of just tweaking the
> >>>> inode so that open-for-write is permanently blocked.
> >>> 
> >>> Agreed with the idea of modifying both file and inode flags. I was 
> >>> thinking
> >>> modifying i_mode may do the trick but as you pointed it probably could be
> >>> reverted by chmod or some other attribute setting calls.
> >>> 
> >>> OTOH, I don't think deny_write_access(file) can be reverted from any
> >>> user-facing path so we could do that from the seal to prevent the future
> >>> opens in write mode. I'll double check and test that out tomorrow.
> >>> 
> >>> 
> >> 
> >> This seems considerably more complicated and more fragile than needed. Just
> >> add a new F_SEAL_WRITE_FUTURE.  Grep for F_SEAL_WRITE and make the _FUTURE
> >> variant work exactly like it with two exceptions:
> >> 
> >> - shmem_mmap and maybe its hugetlbfs equivalent should check for it and act
> >> accordingly.
> > 
> > There's more to it than that, we also need to block future writes through
> > write syscall, so we have to hook into the write path too once the seal is
> > set, not just the mmap. That means we have to add code in mm/shmem.c to do
> > that in all those handlers, to check for the seal (and hope we didn't miss a
> > file_operations handler). Is that what you are proposing?
> 
> The existing code already does this. That’s why I suggested grepping :)
> 
> > 
> > Also, it means we have to keep CONFIG_TMPFS enabled so that the
> > shmem_file_operations write handlers like write_iter are hooked up. 
> > Currently
> > memfd works even with !CONFIG_TMPFS.
> 
> If so, that sounds like it may already be a bug.
> 
> > 
> >> - add_seals won’t need the wait_for_pins and mapping_deny_write logic.
> >> 
> >> That really should be all that’s needed.
> > 
> > It seems a fair idea what you're saying. But I don't see how its less
> > complex.. IMO its far more simple to have VFS do the denial of the 
> > operations
> > based on the flag

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-10 Thread Joel Fernandes
On Sat, Nov 10, 2018 at 07:40:10PM -0800, Andy Lutomirski wrote:
> 
> 
> > On Nov 10, 2018, at 6:38 PM, Joel Fernandes  wrote:
> > 
> >> On Sat, Nov 10, 2018 at 02:18:23PM -0800, Andy Lutomirski wrote:
> >> 
> >>>> On Nov 10, 2018, at 2:09 PM, Joel Fernandes  
> >>>> wrote:
> >>>> 
> >>>>> On Sat, Nov 10, 2018 at 11:11:27AM -0800, Daniel Colascione wrote:
> >>>>>> On Sat, Nov 10, 2018 at 10:45 AM, Daniel Colascione 
> >>>>>>  wrote:
> >>>>>> On Sat, Nov 10, 2018 at 10:24 AM, Joel Fernandes 
> >>>>>>  wrote:
> >>>>>> Thanks Andy for your thoughts, my comments below:
> >>>> [snip]
> >>>>>> I don't see it as warty, different seals will work differently. It 
> >>>>>> works
> >>>>>> quite well for our usecase, and since Linux is all about solving real
> >>>>>> problems in the real work, it would be useful to have it.
> >>>>>> 
> >>>>>>> - causes a probably-observable effect in the file mode in F_GETFL.
> >>>>>> 
> >>>>>> Wouldn't that be the right thing to observe anyway?
> >>>>>> 
> >>>>>>> - causes reopen to fail.
> >>>>>> 
> >>>>>> So this concern isn't true anymore if we make reopen fail only for 
> >>>>>> WRITE
> >>>>>> opens as Daniel suggested. I will make this change so that the 
> >>>>>> security fix
> >>>>>> is a clean one.
> >>>>>> 
> >>>>>>> - does *not* affect other struct files that may already exist on the 
> >>>>>>> same inode.
> >>>>>> 
> >>>>>> TBH if you really want to block all writes to the file, then you want
> >>>>>> F_SEAL_WRITE, not this seal. The usecase we have is the fd is sent 
> >>>>>> over IPC
> >>>>>> to another process and we want to prevent any new writes in the 
> >>>>>> receiver
> >>>>>> side. There is no way this other receiving process can have an 
> >>>>>> existing fd
> >>>>>> unless it was already sent one without the seal applied.  The proposed 
> >>>>>> seal
> >>>>>> could be renamed to F_SEAL_FD_WRITE if that is preferred.
> >>>>>> 
> >>>>>>> - mysteriously malfunctions if you try to set it again on another 
> >>>>>>> struct
> >>>>>>> file that already exists
> >>>>>>> 
> >>>>>> 
> >>>>>> I didn't follow this, could you explain more?
> >>>>>> 
> >>>>>>> - probably is insecure when used on hugetlbfs.
> >>>>>> 
> >>>>>> The usecase is not expected to prevent all writes, indeed the usecase
> >>>>>> requires existing mmaps to continue to be able to write into the 
> >>>>>> memory map.
> >>>>>> So would you call that a security issue too? The use of the seal wants 
> >>>>>> to
> >>>>>> allow existing mmap regions to be continue to be written into (I 
> >>>>>> mentioned
> >>>>>> more details in the cover letter).
> >>>>>> 
> >>>>>>> I see two reasonable solutions:
> >>>>>>> 
> >>>>>>> 1. Don’t fiddle with the struct file at all. Instead make the inode 
> >>>>>>> flag
> >>>>>>> work by itself.
> >>>>>> 
> >>>>>> Currently, the various VFS paths check only the struct file's f_mode 
> >>>>>> to deny
> >>>>>> writes of already opened files. This would mean more checking in all 
> >>>>>> those
> >>>>>> paths (and modification of all those paths).
> >>>>>> 
> >>>>>> Anyway going with that idea, we could
> >>>>>> 1. call deny_write_access(file) from the memfd's seal path which 
> >>>>>> decrements
> >>>>>> the inode::i_writecount.
> >>>>>> 2. call get_write_access(inode) in the various VFS paths in addition to
> >>>>>> checking for FMODE_*W

Re: dyntick-idle CPU and node's qsmask

2018-11-10 Thread Joel Fernandes
On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > Hi Paul and everyone,
> > 
> > I was tracing/studying the RCU code today in paul/dev branch and noticed 
> > that
> > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > behalf.
> > 
> > rcu_sched-10[003]40.008039: rcu_fqs:  rcu_sched 792 0 
> > dti
> > rcu_sched-10[003]40.008039: rcu_fqs:  rcu_sched 801 2 
> > dti
> > rcu_sched-10[003]40.008041: rcu_quiescent_state_report: rcu_sched 
> > 805 5>0 0 0 3 0
> > 
> > That's all good but I was wondering if we can do better for the idle CPUs if
> > we can some how not set the qsmask of the node in the first place. Then no
> > reporting would be needed of quiescent state is needed for idle CPUs right?
> > And we would also not need to acquire the rnp lock I think.
> > 
> > At least for a single node tree RCU system, it seems that would avoid 
> > needing
> > to acquire the lock without complications. Anyway let me know your thoughts
> > and happy to discuss this at the hallways of the LPC as well for folks
> > attending :)
> 
> We could, but that would require consulting the rcu_data structure for
> each CPU while initializing the grace period, thus increasing the number
> of cache misses during grace-period initialization and also shortly after
> for any non-idle CPUs.  This seems backwards on busy systems where each

When I traced, it appears to me that rcu_data structure of a remote CPU was
being consulted anyway by the rcu_sched thread. So it seems like such cache
miss would happen anyway whether it is during grace-period initialization or
during the fqs stage? I guess I'm trying to say, the consultation of remote
CPU's rcu_data happens anyway.

> CPU will with high probability report its own quiescent state before three
> jiffies pass, in which case the cache misses on the rcu_data structures
> would be wasted motion.

If all the CPUs are busy and reporting their QS themselves, then I think the
qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
force_qs_rnp) wouldn't be called and so there would no cache misses on
rcu_data right?

> Now, this does increase overhead on mostly idle systems, but the theory
> is that mostly idle systems are most able to absorb this extra overhead.

Yes. Could we use rcuperf to check the impact of such change?

Anyway it was just an idea that popped up when I was going through traces :)
Thanks for the discussion and happy to discuss further or try out anything.

- Joel



Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-10 Thread Joel Fernandes
On Sat, Nov 10, 2018 at 02:18:23PM -0800, Andy Lutomirski wrote:
> 
> > On Nov 10, 2018, at 2:09 PM, Joel Fernandes  wrote:
> > 
> >> On Sat, Nov 10, 2018 at 11:11:27AM -0800, Daniel Colascione wrote:
> >>> On Sat, Nov 10, 2018 at 10:45 AM, Daniel Colascione  
> >>> wrote:
> >>>> On Sat, Nov 10, 2018 at 10:24 AM, Joel Fernandes 
> >>>>  wrote:
> >>>> Thanks Andy for your thoughts, my comments below:
> >> [snip]
> >>>> I don't see it as warty, different seals will work differently. It works
> >>>> quite well for our usecase, and since Linux is all about solving real
> >>>> problems in the real work, it would be useful to have it.
> >>>> 
> >>>>> - causes a probably-observable effect in the file mode in F_GETFL.
> >>>> 
> >>>> Wouldn't that be the right thing to observe anyway?
> >>>> 
> >>>>> - causes reopen to fail.
> >>>> 
> >>>> So this concern isn't true anymore if we make reopen fail only for WRITE
> >>>> opens as Daniel suggested. I will make this change so that the security 
> >>>> fix
> >>>> is a clean one.
> >>>> 
> >>>>> - does *not* affect other struct files that may already exist on the 
> >>>>> same inode.
> >>>> 
> >>>> TBH if you really want to block all writes to the file, then you want
> >>>> F_SEAL_WRITE, not this seal. The usecase we have is the fd is sent over 
> >>>> IPC
> >>>> to another process and we want to prevent any new writes in the receiver
> >>>> side. There is no way this other receiving process can have an existing 
> >>>> fd
> >>>> unless it was already sent one without the seal applied.  The proposed 
> >>>> seal
> >>>> could be renamed to F_SEAL_FD_WRITE if that is preferred.
> >>>> 
> >>>>> - mysteriously malfunctions if you try to set it again on another struct
> >>>>> file that already exists
> >>>>> 
> >>>> 
> >>>> I didn't follow this, could you explain more?
> >>>> 
> >>>>> - probably is insecure when used on hugetlbfs.
> >>>> 
> >>>> The usecase is not expected to prevent all writes, indeed the usecase
> >>>> requires existing mmaps to continue to be able to write into the memory 
> >>>> map.
> >>>> So would you call that a security issue too? The use of the seal wants to
> >>>> allow existing mmap regions to be continue to be written into (I 
> >>>> mentioned
> >>>> more details in the cover letter).
> >>>> 
> >>>>> I see two reasonable solutions:
> >>>>> 
> >>>>> 1. Don’t fiddle with the struct file at all. Instead make the inode flag
> >>>>> work by itself.
> >>>> 
> >>>> Currently, the various VFS paths check only the struct file's f_mode to 
> >>>> deny
> >>>> writes of already opened files. This would mean more checking in all 
> >>>> those
> >>>> paths (and modification of all those paths).
> >>>> 
> >>>> Anyway going with that idea, we could
> >>>> 1. call deny_write_access(file) from the memfd's seal path which 
> >>>> decrements
> >>>> the inode::i_writecount.
> >>>> 2. call get_write_access(inode) in the various VFS paths in addition to
> >>>> checking for FMODE_*WRITE and deny the write (incase i_writecount is 
> >>>> negative)
> >>>> 
> >>>> That will prevent both reopens, and writes from succeeding. However I 
> >>>> worry a
> >>>> bit about 2 not being too familiar with VFS internals, about what the
> >>>> consequences of doing that may be.
> >>> 
> >>> IMHO, modifying both the inode and the struct file separately is fine,
> >>> since they mean different things. In regular filesystems, it's fine to
> >>> have a read-write open file description for a file whose inode grants
> >>> write permission to nobody. Speaking of which: is fchmod enough to
> >>> prevent this attack?
> >> 
> >> Well, yes and no. fchmod does prevent reopening the file RW, but
> >> anyone with permissions (owner, CAP_FOWNER) can just fchmod it back. A
> >

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-10 Thread Joel Fernandes
On Sat, Nov 10, 2018 at 11:11:27AM -0800, Daniel Colascione wrote:
> On Sat, Nov 10, 2018 at 10:45 AM, Daniel Colascione  wrote:
> > On Sat, Nov 10, 2018 at 10:24 AM, Joel Fernandes  
> > wrote:
> >> Thanks Andy for your thoughts, my comments below:
> [snip]
> >> I don't see it as warty, different seals will work differently. It works
> >> quite well for our usecase, and since Linux is all about solving real
> >> problems in the real work, it would be useful to have it.
> >>
> >>> - causes a probably-observable effect in the file mode in F_GETFL.
> >>
> >> Wouldn't that be the right thing to observe anyway?
> >>
> >>> - causes reopen to fail.
> >>
> >> So this concern isn't true anymore if we make reopen fail only for WRITE
> >> opens as Daniel suggested. I will make this change so that the security fix
> >> is a clean one.
> >>
> >>> - does *not* affect other struct files that may already exist on the same 
> >>> inode.
> >>
> >> TBH if you really want to block all writes to the file, then you want
> >> F_SEAL_WRITE, not this seal. The usecase we have is the fd is sent over IPC
> >> to another process and we want to prevent any new writes in the receiver
> >> side. There is no way this other receiving process can have an existing fd
> >> unless it was already sent one without the seal applied.  The proposed seal
> >> could be renamed to F_SEAL_FD_WRITE if that is preferred.
> >>
> >>> - mysteriously malfunctions if you try to set it again on another struct
> >>> file that already exists
> >>>
> >>
> >> I didn't follow this, could you explain more?
> >>
> >>> - probably is insecure when used on hugetlbfs.
> >>
> >> The usecase is not expected to prevent all writes, indeed the usecase
> >> requires existing mmaps to continue to be able to write into the memory 
> >> map.
> >> So would you call that a security issue too? The use of the seal wants to
> >> allow existing mmap regions to be continue to be written into (I mentioned
> >> more details in the cover letter).
> >>
> >>> I see two reasonable solutions:
> >>>
> >>> 1. Don’t fiddle with the struct file at all. Instead make the inode flag
> >>> work by itself.
> >>
> >> Currently, the various VFS paths check only the struct file's f_mode to 
> >> deny
> >> writes of already opened files. This would mean more checking in all those
> >> paths (and modification of all those paths).
> >>
> >> Anyway going with that idea, we could
> >> 1. call deny_write_access(file) from the memfd's seal path which decrements
> >> the inode::i_writecount.
> >> 2. call get_write_access(inode) in the various VFS paths in addition to
> >> checking for FMODE_*WRITE and deny the write (incase i_writecount is 
> >> negative)
> >>
> >> That will prevent both reopens, and writes from succeeding. However I 
> >> worry a
> >> bit about 2 not being too familiar with VFS internals, about what the
> >> consequences of doing that may be.
> >
> > IMHO, modifying both the inode and the struct file separately is fine,
> > since they mean different things. In regular filesystems, it's fine to
> > have a read-write open file description for a file whose inode grants
> > write permission to nobody. Speaking of which: is fchmod enough to
> > prevent this attack?
> 
> Well, yes and no. fchmod does prevent reopening the file RW, but
> anyone with permissions (owner, CAP_FOWNER) can just fchmod it back. A
> seal is supposed to be irrevocable, so fchmod-as-inode-seal probably
> isn't sufficient by itself. While it might be good enough for Android
> (in the sense that it'll prevent RW-reopens from other security
> contexts to which we send an open memfd file), it's still conceptually
> ugly, IMHO. Let's go with the original approach of just tweaking the
> inode so that open-for-write is permanently blocked.

Agreed with the idea of modifying both file and inode flags. I was thinking
modifying i_mode may do the trick but as you pointed it probably could be
reverted by chmod or some other attribute setting calls.

OTOH, I don't think deny_write_access(file) can be reverted from any
user-facing path so we could do that from the seal to prevent the future
opens in write mode. I'll double check and test that out tomorrow.

thanks,

 - Joel



dyntick-idle CPU and node's qsmask

2018-11-10 Thread Joel Fernandes
Hi Paul and everyone,

I was tracing/studying the RCU code today in paul/dev branch and noticed that
for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
corresponding to the leaf node for the idle CPU, and reporting a QS on their
behalf.

rcu_sched-10[003]40.008039: rcu_fqs:  rcu_sched 792 0 dti
rcu_sched-10[003]40.008039: rcu_fqs:  rcu_sched 801 2 dti
rcu_sched-10[003]40.008041: rcu_quiescent_state_report: rcu_sched 805 
5>0 0 0 3 0

That's all good but I was wondering if we can do better for the idle CPUs if
we can some how not set the qsmask of the node in the first place. Then no
reporting would be needed of quiescent state is needed for idle CPUs right?
And we would also not need to acquire the rnp lock I think.

At least for a single node tree RCU system, it seems that would avoid needing
to acquire the lock without complications. Anyway let me know your thoughts
and happy to discuss this at the hallways of the LPC as well for folks
attending :)

thanks,

- Joel


Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-10 Thread Joel Fernandes
Thanks Andy for your thoughts, my comments below:

On Fri, Nov 09, 2018 at 10:05:14PM -0800, Andy Lutomirski wrote:
> 
> 
> > On Nov 9, 2018, at 7:20 PM, Joel Fernandes  wrote:
> > 
> >> On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote:
> >>> On Fri, Nov 9, 2018 at 10:06 PM Jann Horn  wrote:
> >>> On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
> >>>  wrote:
> >>>> Android uses ashmem for sharing memory regions. We are looking forward
> >>>> to migrating all usecases of ashmem to memfd so that we can possibly
> >>>> remove the ashmem driver in the future from staging while also
> >>>> benefiting from using memfd and contributing to it. Note staging drivers
> >>>> are also not ABI and generally can be removed at anytime.
> >>>> 
> >>>> One of the main usecases Android has is the ability to create a region
> >>>> and mmap it as writeable, then add protection against making any
> >>>> "future" writes while keeping the existing already mmap'ed
> >>>> writeable-region active.  This allows us to implement a usecase where
> >>>> receivers of the shared memory buffer can get a read-only view, while
> >>>> the sender continues to write to the buffer.
> >>>> See CursorWindow documentation in Android for more details:
> >>>> https://developer.android.com/reference/android/database/CursorWindow
> >>>> 
> >>>> This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> >>>> To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
> >>>> which prevents any future mmap and write syscalls from succeeding while
> >>>> keeping the existing mmap active.
> >>> 
> >>> Please CC linux-api@ on patches like this. If you had done that, I
> >>> might have criticized your v1 patch instead of your v3 patch...
> >>> 
> >>>> The following program shows the seal
> >>>> working in action:
> >>> [...]
> >>>> Cc: jr...@google.com
> >>>> Cc: john.stu...@linaro.org
> >>>> Cc: tk...@google.com
> >>>> Cc: gre...@linuxfoundation.org
> >>>> Cc: h...@infradead.org
> >>>> Reviewed-by: John Stultz 
> >>>> Signed-off-by: Joel Fernandes (Google) 
> >>>> ---
> >>> [...]
> >>>> diff --git a/mm/memfd.c b/mm/memfd.c
> >>>> index 2bb5e257080e..5ba9804e9515 100644
> >>>> --- a/mm/memfd.c
> >>>> +++ b/mm/memfd.c
> >>> [...]
> >>>> @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, 
> >>>> unsigned int seals)
> >>>>}
> >>>>}
> >>>> 
> >>>> +   if ((seals & F_SEAL_FUTURE_WRITE) &&
> >>>> +   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
> >>>> +   /*
> >>>> +* The FUTURE_WRITE seal also prevents growing and 
> >>>> shrinking
> >>>> +* so we need them to be already set, or requested now.
> >>>> +*/
> >>>> +   int test_seals = (seals | *file_seals) &
> >>>> +(F_SEAL_GROW | F_SEAL_SHRINK);
> >>>> +
> >>>> +   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
> >>>> +   error = -EINVAL;
> >>>> +   goto unlock;
> >>>> +   }
> >>>> +
> >>>> +   spin_lock(>f_lock);
> >>>> +   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> >>>> +   spin_unlock(>f_lock);
> >>>> +   }
> >>> 
> >>> So you're fiddling around with the file, but not the inode? How are
> >>> you preventing code like the following from re-opening the file as
> >>> writable?
> >>> 
> >>> $ cat memfd.c
> >>> #define _GNU_SOURCE
> >>> #include 
> >>> #include 
> >>> #include 
> >>> #include 
> >>> #include 
> >>> #include 
> >>> 
> >>> int main(void) {
> >>>  int fd = syscall(__NR_memfd_create, "testfd", 0);
> >>>  if (fd == -1) err(1, "memfd");
> >>>  ch

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-10 Thread Joel Fernandes
On Sat, Nov 10, 2018 at 04:26:46AM -0800, Daniel Colascione wrote:
> On Friday, November 9, 2018, Joel Fernandes  wrote:
> 
> > On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote:
> > > On Fri, Nov 9, 2018 at 10:06 PM Jann Horn  wrote:
> > > > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
> > > >  wrote:
> > > > > Android uses ashmem for sharing memory regions. We are looking
> > forward
> > > > > to migrating all usecases of ashmem to memfd so that we can possibly
> > > > > remove the ashmem driver in the future from staging while also
> > > > > benefiting from using memfd and contributing to it. Note staging
> > drivers
> > > > > are also not ABI and generally can be removed at anytime.
> > > > >
> > > > > One of the main usecases Android has is the ability to create a
> > region
> > > > > and mmap it as writeable, then add protection against making any
> > > > > "future" writes while keeping the existing already mmap'ed
> > > > > writeable-region active.  This allows us to implement a usecase where
> > > > > receivers of the shared memory buffer can get a read-only view, while
> > > > > the sender continues to write to the buffer.
> > > > > See CursorWindow documentation in Android for more details:
> > > > > https://developer.android.com/reference/android/database/
> > CursorWindow
> > > > >
> > > > > This usecase cannot be implemented with the existing F_SEAL_WRITE
> > seal.
> > > > > To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE
> > seal
> > > > > which prevents any future mmap and write syscalls from succeeding
> > while
> > > > > keeping the existing mmap active.
> > > >
> > > > Please CC linux-api@ on patches like this. If you had done that, I
> > > > might have criticized your v1 patch instead of your v3 patch...
> > > >
> > > > > The following program shows the seal
> > > > > working in action:
> > > > [...]
> > > > > Cc: jr...@google.com
> > > > > Cc: john.stu...@linaro.org
> > > > > Cc: tk...@google.com
> > > > > Cc: gre...@linuxfoundation.org
> > > > > Cc: h...@infradead.org
> > > > > Reviewed-by: John Stultz 
> > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > ---
> > > > [...]
> > > > > diff --git a/mm/memfd.c b/mm/memfd.c
> > > > > index 2bb5e257080e..5ba9804e9515 100644
> > > > > --- a/mm/memfd.c
> > > > > +++ b/mm/memfd.c
> > > > [...]
> > > > > @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file,
> > unsigned int seals)
> > > > > }
> > > > > }
> > > > >
> > > > > +   if ((seals & F_SEAL_FUTURE_WRITE) &&
> > > > > +   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
> > > > > +   /*
> > > > > +* The FUTURE_WRITE seal also prevents growing and
> > shrinking
> > > > > +* so we need them to be already set, or requested
> > now.
> > > > > +*/
> > > > > +   int test_seals = (seals | *file_seals) &
> > > > > +(F_SEAL_GROW | F_SEAL_SHRINK);
> > > > > +
> > > > > +   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
> > > > > +   error = -EINVAL;
> > > > > +   goto unlock;
> > > > > +   }
> > > > > +
> > > > > +   spin_lock(>f_lock);
> > > > > +   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> > > > > +   spin_unlock(>f_lock);
> > > > > +   }
> > > >
> > > > So you're fiddling around with the file, but not the inode? How are
> > > > you preventing code like the following from re-opening the file as
> > > > writable?
> > > >
> > > > $ cat memfd.c
> > > > #define _GNU_SOURCE
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > >
> > > &

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 12:36:34PM -0800, Andrew Morton wrote:
> On Wed,  7 Nov 2018 20:15:36 -0800 "Joel Fernandes (Google)" 
>  wrote:
> 
> > Android uses ashmem for sharing memory regions. We are looking forward
> > to migrating all usecases of ashmem to memfd so that we can possibly
> > remove the ashmem driver in the future from staging while also
> > benefiting from using memfd and contributing to it. Note staging drivers
> > are also not ABI and generally can be removed at anytime.
> > 
> > One of the main usecases Android has is the ability to create a region
> > and mmap it as writeable, then add protection against making any
> > "future" writes while keeping the existing already mmap'ed
> > writeable-region active.  This allows us to implement a usecase where
> > receivers of the shared memory buffer can get a read-only view, while
> > the sender continues to write to the buffer.
> > See CursorWindow documentation in Android for more details:
> > https://developer.android.com/reference/android/database/CursorWindow
> 
> It appears that the memfd_create and fcntl manpages will require
> updating.  Please attend to this at the appropriate time?

Yes, I am planning to send those out shortly. I finished working on them.

Also just to let you know, I posted a fix for the security issue Jann Horn
reported and requested him to test it:
https://lore.kernel.org/lkml/20181109234636.ga136...@google.com/T/#m8d9d185e6480d095f0ab8f84bcb103892181f77d

This fix along with the 2 other patches I posted in v3 are all that's needed. 
thanks!

- Joel



Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote:
> On Fri, Nov 9, 2018 at 10:06 PM Jann Horn  wrote:
> > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
> >  wrote:
> > > Android uses ashmem for sharing memory regions. We are looking forward
> > > to migrating all usecases of ashmem to memfd so that we can possibly
> > > remove the ashmem driver in the future from staging while also
> > > benefiting from using memfd and contributing to it. Note staging drivers
> > > are also not ABI and generally can be removed at anytime.
> > >
> > > One of the main usecases Android has is the ability to create a region
> > > and mmap it as writeable, then add protection against making any
> > > "future" writes while keeping the existing already mmap'ed
> > > writeable-region active.  This allows us to implement a usecase where
> > > receivers of the shared memory buffer can get a read-only view, while
> > > the sender continues to write to the buffer.
> > > See CursorWindow documentation in Android for more details:
> > > https://developer.android.com/reference/android/database/CursorWindow
> > >
> > > This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> > > To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
> > > which prevents any future mmap and write syscalls from succeeding while
> > > keeping the existing mmap active.
> >
> > Please CC linux-api@ on patches like this. If you had done that, I
> > might have criticized your v1 patch instead of your v3 patch...
> >
> > > The following program shows the seal
> > > working in action:
> > [...]
> > > Cc: jr...@google.com
> > > Cc: john.stu...@linaro.org
> > > Cc: tk...@google.com
> > > Cc: gre...@linuxfoundation.org
> > > Cc: h...@infradead.org
> > > Reviewed-by: John Stultz 
> > > Signed-off-by: Joel Fernandes (Google) 
> > > ---
> > [...]
> > > diff --git a/mm/memfd.c b/mm/memfd.c
> > > index 2bb5e257080e..5ba9804e9515 100644
> > > --- a/mm/memfd.c
> > > +++ b/mm/memfd.c
> > [...]
> > > @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, 
> > > unsigned int seals)
> > > }
> > > }
> > >
> > > +   if ((seals & F_SEAL_FUTURE_WRITE) &&
> > > +   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
> > > +   /*
> > > +* The FUTURE_WRITE seal also prevents growing and 
> > > shrinking
> > > +* so we need them to be already set, or requested now.
> > > +*/
> > > +   int test_seals = (seals | *file_seals) &
> > > +(F_SEAL_GROW | F_SEAL_SHRINK);
> > > +
> > > +   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
> > > +   error = -EINVAL;
> > > +   goto unlock;
> > > +   }
> > > +
> > > +   spin_lock(>f_lock);
> > > +   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> > > +   spin_unlock(>f_lock);
> > > +   }
> >
> > So you're fiddling around with the file, but not the inode? How are
> > you preventing code like the following from re-opening the file as
> > writable?
> >
> > $ cat memfd.c
> > #define _GNU_SOURCE
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > int main(void) {
> >   int fd = syscall(__NR_memfd_create, "testfd", 0);
> >   if (fd == -1) err(1, "memfd");
> >   char path[100];
> >   sprintf(path, "/proc/self/fd/%d", fd);
> >   int fd2 = open(path, O_RDWR);
> >   if (fd2 == -1) err(1, "reopen");
> >   printf("reopen successful: %d\n", fd2);
> > }
> > $ gcc -o memfd memfd.c
> > $ ./memfd
> > reopen successful: 4
> > $
> >
> > That aside: I wonder whether a better API would be something that
> > allows you to create a new readonly file descriptor, instead of
> > fiddling with the writability of an existing fd.
> 
> My favorite approach would be to forbid open() on memfds, hope that
> nobody notices the tiny API break, and then add an ioctl for "reopen
> this memfd with reduced permissions" - but that's just my personal
> opinion.

I did something along these lines and 

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 08:02:14PM +, Michael Tirado wrote:
[...]
> > > That aside: I wonder whether a better API would be something that
> > > allows you to create a new readonly file descriptor, instead of
> > > fiddling with the writability of an existing fd.
> >
> > Every now and then I try to write a patch to prevent using proc to reopen
> > a file with greater permission than the original open.
> >
> > I like your idea to have a clean way to reopen a a memfd with reduced
> > permissions. But I would make it a syscall instead and maybe make it only
> > work for memfd at first.  And the proc issue would need to be fixed, too.
> 
> IMO the best solution would handle the issue at memfd creation time by
> removing the race condition.

I agree, this is another idea I'm exploring. We could add a new .open
callback to shmem_file_operations and check for seals there.

thanks,

 - Joel


Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 03:14:02PM -0800, Andy Lutomirski wrote:
>  That aside: I wonder whether a better API would be something that
>  allows you to create a new readonly file descriptor, instead of
>  fiddling with the writability of an existing fd.
> >>> 
> >>> That doesn't work, unfortunately. The ashmem API we're replacing with
> >>> memfd requires file descriptor continuity. I also looked into opening
> >>> a new FD and dup2(2)ing atop the old one, but this approach doesn't
> >>> work in the case that the old FD has already leaked to some other
> >>> context (e.g., another dup, SCM_RIGHTS). See
> >>> https://developer.android.com/ndk/reference/group/memory. We can't
> >>> break ASharedMemory_setProt.
> >> 
> >> 
> >> Hmm.  If we fix the general reopen bug, a way to drop write access from
> >> an existing struct file would do what Android needs, right?  I don’t
> >> know if there are general VFS issues with that.
> > 

I don't think there is a way to fix this in /proc/pid/fd. At the proc
level, the /proc/pid/fd/N files are just soft symlinks that follow through to
the actual file. The open is actually done on that inode/file. I think
changing it the way being discussed here means changing the way symlinks work
in Linux.

I think the right way to fix this is at the memfd inode level. I am working
on a follow up patch on top of this patch, and will send that out in a few
days (along with the man page updates).

thanks!

 - Joel


Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 10:06:31PM +0100, Jann Horn wrote:
> +linux-api for API addition
> +hughd as FYI since this is somewhat related to mm/shmem
> 
> On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
>  wrote:
> > Android uses ashmem for sharing memory regions. We are looking forward
> > to migrating all usecases of ashmem to memfd so that we can possibly
> > remove the ashmem driver in the future from staging while also
> > benefiting from using memfd and contributing to it. Note staging drivers
> > are also not ABI and generally can be removed at anytime.
> >
> > One of the main usecases Android has is the ability to create a region
> > and mmap it as writeable, then add protection against making any
> > "future" writes while keeping the existing already mmap'ed
> > writeable-region active.  This allows us to implement a usecase where
> > receivers of the shared memory buffer can get a read-only view, while
> > the sender continues to write to the buffer.
> > See CursorWindow documentation in Android for more details:
> > https://developer.android.com/reference/android/database/CursorWindow
> >
> > This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> > To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
> > which prevents any future mmap and write syscalls from succeeding while
> > keeping the existing mmap active.
> 
> Please CC linux-api@ on patches like this. If you had done that, I
> might have criticized your v1 patch instead of your v3 patch...

Ok, will do from next time.

> > The following program shows the seal
> > working in action:
> [...]
> > Cc: jr...@google.com
> > Cc: john.stu...@linaro.org
> > Cc: tk...@google.com
> > Cc: gre...@linuxfoundation.org
> > Cc: h...@infradead.org
> > Reviewed-by: John Stultz 
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> [...]
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 2bb5e257080e..5ba9804e9515 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> [...]
> > @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, unsigned 
> > int seals)
> > }
> > }
> >
> > +   if ((seals & F_SEAL_FUTURE_WRITE) &&
> > +   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
> > +   /*
> > +* The FUTURE_WRITE seal also prevents growing and shrinking
> > +* so we need them to be already set, or requested now.
> > +*/
> > +   int test_seals = (seals | *file_seals) &
> > +(F_SEAL_GROW | F_SEAL_SHRINK);
> > +
> > +   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
> > +   error = -EINVAL;
> > +   goto unlock;
> > +   }
> > +
> > +   spin_lock(>f_lock);
> > +   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> > +   spin_unlock(>f_lock);
> > +   }
> 
> So you're fiddling around with the file, but not the inode? How are
> you preventing code like the following from re-opening the file as
> writable?
> 
> $ cat memfd.c
> #define _GNU_SOURCE
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main(void) {
>   int fd = syscall(__NR_memfd_create, "testfd", 0);
>   if (fd == -1) err(1, "memfd");
>   char path[100];
>   sprintf(path, "/proc/self/fd/%d", fd);
>   int fd2 = open(path, O_RDWR);
>   if (fd2 == -1) err(1, "reopen");
>   printf("reopen successful: %d\n", fd2);
> }
> $ gcc -o memfd memfd.c
> $ ./memfd
> reopen successful: 4

Great catch and this is indeed an issue :-(. I verified it too.

> That aside: I wonder whether a better API would be something that
> allows you to create a new readonly file descriptor, instead of
> fiddling with the writability of an existing fd.

Android usecases cannot deal with a new fd number because it breaks the
continuity of having the same old fd, as Dan also pointed out.

Also such API will have the same issues you brought up?

thanks,

 - Joel



Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Wed, Nov 07, 2018 at 08:15:36PM -0800, Joel Fernandes (Google) wrote:
> Android uses ashmem for sharing memory regions. We are looking forward
> to migrating all usecases of ashmem to memfd so that we can possibly
> remove the ashmem driver in the future from staging while also
> benefiting from using memfd and contributing to it. Note staging drivers
> are also not ABI and generally can be removed at anytime.
> 
> One of the main usecases Android has is the ability to create a region
> and mmap it as writeable, then add protection against making any
> "future" writes while keeping the existing already mmap'ed
> writeable-region active.  This allows us to implement a usecase where
> receivers of the shared memory buffer can get a read-only view, while
> the sender continues to write to the buffer.
> See CursorWindow documentation in Android for more details:
> https://developer.android.com/reference/android/database/CursorWindow
> 
> This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
> which prevents any future mmap and write syscalls from succeeding while
> keeping the existing mmap active. The following program shows the seal
> working in action:
> 
[...] 
> The output of running this program is as follows:
> ret=3
> map 0 passed
> write passed
> map 1 prot-write passed as expected
> future-write seal now active
> write failed as expected due to future-write seal
> map 2 prot-write failed as expected due to seal
> : Permission denied
> map 3 prot-read passed as expected
> 
> Cc: jr...@google.com
> Cc: john.stu...@linaro.org
> Cc: tk...@google.com
> Cc: gre...@linuxfoundation.org
> Cc: h...@infradead.org
> Reviewed-by: John Stultz 
> Signed-off-by: Joel Fernandes (Google) 
> ---
> v1->v2: No change, just added selftests to the series. manpages are
> ready and I'll submit them once the patches are accepted.
> 
> v2->v3: Updated commit message to have more support code (John Stultz)
>   Renamed seal from F_SEAL_FS_WRITE to F_SEAL_FUTURE_WRITE
>   (Christoph Hellwig)
>   Allow for this seal only if grow/shrink seals are also
>   either previous set, or are requested along with this seal.
>   (Christoph Hellwig)
>   Added locking to synchronize access to file->f_mode.
>   (Christoph Hellwig)


Christoph, do the patches look Ok to you now? If so, then could you give an
Acked-by or Reviewed-by tag?

Thanks a lot,

 - Joel


>  include/uapi/linux/fcntl.h |  1 +
>  mm/memfd.c | 22 +-
>  2 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 6448cdd9a350..a2f8658f1c55 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -41,6 +41,7 @@
>  #define F_SEAL_SHRINK0x0002  /* prevent file from shrinking */
>  #define F_SEAL_GROW  0x0004  /* prevent file from growing */
>  #define F_SEAL_WRITE 0x0008  /* prevent writes */
> +#define F_SEAL_FUTURE_WRITE  0x0010  /* prevent future writes while mapped */
>  /* (1U << 31) is reserved for signed error codes */
>  
>  /*
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 2bb5e257080e..5ba9804e9515 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file 
> *file)
>  #define F_ALL_SEALS (F_SEAL_SEAL | \
>F_SEAL_SHRINK | \
>F_SEAL_GROW | \
> -  F_SEAL_WRITE)
> +  F_SEAL_WRITE | \
> +  F_SEAL_FUTURE_WRITE)
>  
>  static int memfd_add_seals(struct file *file, unsigned int seals)
>  {
> @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, unsigned 
> int seals)
>   }
>   }
>  
> + if ((seals & F_SEAL_FUTURE_WRITE) &&
> + !(*file_seals & F_SEAL_FUTURE_WRITE)) {
> + /*
> +  * The FUTURE_WRITE seal also prevents growing and shrinking
> +  * so we need them to be already set, or requested now.
> +  */
> + int test_seals = (seals | *file_seals) &
> +  (F_SEAL_GROW | F_SEAL_SHRINK);
> +
> + if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
> + error = -EINVAL;
> + goto unlock;
> + }
> +
> + spin_lock(>f_lock);
> + file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> + spin_unlock(>f_lock);
> + }
> +
>   *file_seals |= seals;
>   error = 0;
>  
> -- 
> 2.19.1.930.g4563a0d9d0-goog


[PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-07 Thread Joel Fernandes (Google)
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging drivers
are also not ABI and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then add protection against making any
"future" writes while keeping the existing already mmap'ed
writeable-region active.  This allows us to implement a usecase where
receivers of the shared memory buffer can get a read-only view, while
the sender continues to write to the buffer.
See CursorWindow documentation in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active. The following program shows the seal
working in action:

 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #define F_SEAL_FUTURE_WRITE 0x0010
 #define REGION_SIZE (5 * 1024 * 1024)

int memfd_create_region(const char *name, size_t size)
{
int ret;
int fd = syscall(__NR_memfd_create, name, MFD_ALLOW_SEALING);
if (fd < 0) return fd;
ret = ftruncate(fd, size);
if (ret < 0) { close(fd); return ret; }
return fd;
}

int main() {
int ret, fd;
void *addr, *addr2, *addr3, *addr1;
ret = memfd_create_region("test_region", REGION_SIZE);
printf("ret=%d\n", ret);
fd = ret;

// Create map
addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
printf("map 0 failed\n");
else
printf("map 0 passed\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed even though no future-write seal "
   "(ret=%d errno =%d)\n", ret, errno);
else
printf("write passed\n");

addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr1 == MAP_FAILED)
perror("map 1 prot-write failed even though no seal\n");
else
printf("map 1 prot-write passed as expected\n");

ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FUTURE_WRITE |
 F_SEAL_GROW |
 F_SEAL_SHRINK);
if (ret == -1)
printf("fcntl failed, errno: %d\n", errno);
else
printf("future-write seal now active\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed as expected due to future-write seal\n");
else
printf("write passed (unexpected)\n");

addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr2 == MAP_FAILED)
perror("map 2 prot-write failed as expected due to seal\n");
else
printf("map 2 passed\n");

addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0);
if (addr3 == MAP_FAILED)
perror("map 3 failed\n");
else
printf("map 3 prot-read passed as expected\n");
}

The output of running this program is as follows:
ret=3
map 0 passed
write passed
map 1 prot-write passed as expected
future-write seal now active
write failed as expected due to future-write seal
map 2 prot-write failed as expected due to seal
: Permission denied
map 3 prot-read passed as expected

Cc: jr...@google.com
Cc: john.stu...@linaro.org
Cc: tk...@google.com
Cc: gre...@linuxfoundation.org
Cc: h...@infradead.org
Reviewed-by: John Stultz 
Signed-off-by: Joel Fernandes (Google) 
---
v1->v2: No change, just added selftests to the series. manpages are
ready and I'll submit them once the patches are accepted.

v2->v3: Updated commit message to have more support code (John Stultz)
Renamed seal from F_SEAL_FS_WRITE to F_SEAL_FUTURE_WRITE
(Christoph Hellwig)
Allow for this seal only if grow/shrink seals are also
either previous set, or are requested along with this seal.
(Christoph Hellwig)
Added locking to synchronize access to file->f_mode.
(Christoph Hellwig)

 include/uapi/linux/fcntl.h |  1 +
 mm/memfd.c | 22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..a2f8658f1c55 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -41,6 +41,7

[PATCH v3 resend 2/2] selftests/memfd: Add tests for F_SEAL_FUTURE_WRITE seal

2018-11-07 Thread Joel Fernandes (Google)
Add tests to verify sealing memfds with the F_SEAL_FUTURE_WRITE works as
expected.

Cc: dan...@google.com
Cc: minc...@kernel.org
Reviewed-by: John Stultz 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/memfd/memfd_test.c | 74 ++
 1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index 10baa1652fc2..32b207ca7372 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -692,6 +692,79 @@ static void test_seal_write(void)
close(fd);
 }
 
+/*
+ * Test SEAL_FUTURE_WRITE
+ * Test whether SEAL_FUTURE_WRITE actually prevents modifications.
+ */
+static void test_seal_future_write(void)
+{
+   int fd;
+   void *p;
+
+   printf("%s SEAL-FUTURE-WRITE\n", memfd_str);
+
+   fd = mfd_assert_new("kern_memfd_seal_future_write",
+   mfd_def_size,
+   MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+   p = mfd_assert_mmap_shared(fd);
+
+   mfd_assert_has_seals(fd, 0);
+   /* Not adding grow/shrink seals makes the future write
+* seal fail to get added
+*/
+   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
+
+   mfd_assert_add_seals(fd, F_SEAL_GROW);
+   mfd_assert_has_seals(fd, F_SEAL_GROW);
+
+   /* Should still fail since shrink seal has
+* not yet been added
+*/
+   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
+
+   mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+   mfd_assert_has_seals(fd, F_SEAL_GROW |
+F_SEAL_SHRINK);
+
+   /* Now should succeed, also verifies that the seal
+* could be added with an existing writable mmap
+*/
+   mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+
+   /* read should pass, writes should fail */
+   mfd_assert_read(fd);
+   mfd_fail_write(fd);
+
+   munmap(p, mfd_def_size);
+   close(fd);
+
+   /* Test adding all seals (grow, shrink, future write) at once */
+   fd = mfd_assert_new("kern_memfd_seal_future_write2",
+   mfd_def_size,
+   MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+   p = mfd_assert_mmap_shared(fd);
+
+   mfd_assert_has_seals(fd, 0);
+   mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+
+   /* read should pass, writes should fail */
+   mfd_assert_read(fd);
+   mfd_fail_write(fd);
+
+   munmap(p, mfd_def_size);
+   close(fd);
+}
+
 /*
  * Test SEAL_SHRINK
  * Test whether SEAL_SHRINK actually prevents shrinking
@@ -945,6 +1018,7 @@ int main(int argc, char **argv)
test_basic();
 
test_seal_write();
+   test_seal_future_write();
test_seal_shrink();
test_seal_grow();
test_seal_resize();
-- 
2.19.1.930.g4563a0d9d0-goog



Re: [PATCH v2] driver-staging: vsoc.c: Add sysfs support for examining the permissions of regions.

2018-11-06 Thread Joel Fernandes
On Tue, Nov 06, 2018 at 01:21:16PM +0800, Jerry Lin wrote:
> Add a attribute called permissions under vsoc device node for examining
> current granted permissions in vsoc_device.
> 
> This file will display permissions in following format:
>   begin_offset  end_offset  owner_offset  owned_value
> %x  %x%x   %x
> 
> Signed-off-by: Jerry Lin 

Please always post patches to the list with reviewers on CC. Also CC the
following addresses in addition:
astrac...@google.com
ghart...@google.com (infact you deleted a TODO he added so)

And just one more minor nit below, otherwise LGTM.

thanks,

- Joel

> ---
>  drivers/staging/android/vsoc.c | 49 
> +++---
>  1 file changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/staging/android/vsoc.c b/drivers/staging/android/vsoc.c
> index 22571ab..481554a 100644
> --- a/drivers/staging/android/vsoc.c
> +++ b/drivers/staging/android/vsoc.c
> @@ -128,9 +128,10 @@ struct vsoc_device {
>  
>  static struct vsoc_device vsoc_dev;
>  
> -/*
> - * TODO(ghartman): Add a /sys filesystem entry that summarizes the 
> permissions.
> - */
> +static ssize_t permissions_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf);
> +static DEVICE_ATTR_RO(permissions);
>  
>  struct fd_scoped_permission_node {
>   struct fd_scoped_permission permission;
> @@ -718,6 +719,38 @@ static ssize_t vsoc_write(struct file *filp, const char 
> __user *buffer,
>   return len;
>  }
>  
> +static ssize_t permissions_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buffer)
> +{
> + struct fd_scoped_permission_node *node;
> + char *row;
> + int ret;
> + ssize_t written = 0;
> +
> + row = kzalloc(sizeof(char) * 128, GFP_KERNEL);
> + if (!row)
> + return 0;
> + mutex_lock(_dev.mtx);
> + list_for_each_entry(node, _dev.permissions, list) {
> + ret = snprintf(row, 128, "%x\t%x\t%x\t%x\n",
> +node->permission.begin_offset,
> +node->permission.end_offset,
> +node->permission.owner_offset,
> +node->permission.owned_value);
> + if (ret < 0)
> + goto done;
> + row[ret] = '\0';

Do you really need null termination after snprintf?

> + memcpy(buffer + written, row, ret);
> + written += ret;
> + }
> +
> +done:
> + mutex_unlock(_dev.mtx);
> + kfree(row);
> + return written;
> +}
> +
>  static irqreturn_t vsoc_interrupt(int irq, void *region_data_v)
>  {
>   struct vsoc_region_data *region_data =
> @@ -942,6 +975,15 @@ static int vsoc_probe_device(struct pci_dev *pdev,
>   }
>   vsoc_dev.regions_data[i].device_created = true;
>   }
> + /*
> +  * Create permission attribute on device node.
> +  */
> + result = device_create_file(>dev, _attr_permissions);
> + if (result) {
> + dev_err(_dev.dev->dev, "device_create_file failed\n");
> + vsoc_remove_device(pdev);
> + return -EFAULT;
> + }
>   return 0;
>  }
>  
> @@ -967,6 +1009,7 @@ static void vsoc_remove_device(struct pci_dev *pdev)
>   if (!pdev || !vsoc_dev.dev)
>   return;
>   dev_info(>dev, "remove_device\n");
> + device_remove_file(>dev, _attr_permissions);
>   if (vsoc_dev.regions_data) {
>   for (i = 0; i < vsoc_dev.layout->region_count; ++i) {
>   if (vsoc_dev.regions_data[i].device_created) {
> -- 
> 2.7.4
> 


Re: [PATCH 8/8] pstore/ram: Correctly calculate usable PRZ bytes

2018-11-05 Thread Joel Fernandes
On Mon, Nov 05, 2018 at 09:04:13AM -0800, Kees Cook wrote:
> On Sun, Nov 4, 2018 at 8:42 PM, Joel Fernandes  wrote:
> > Dumping the magic bytes of the non decompressable .enc.z files, I get this
> > which shows a valid zlib compressed header:
> >
> > Something like:
> > 48 89 85 54 4d 6f 1a 31
> >
> > The 0b1000 in the first byte means it is "deflate". The file tool indeed
> > successfully shows "zlib compressed data" and I did the math for the header
> > and it is indeed valid. So I don't think the data is insane. The buffer has
> > enough room because even the very small dumps are not decompressable.
> 
> Interesting. So the kernel wouldn't decompress it even though it's the
> right algo and untruncated? That seems worth fixing.

I just can't reproduce the issue though :) I am wondering if it was something
to do with compression being done by an older version of the algorithm or a
buggy algorithm, and the decompressor is newer. Or something.

Not sure if its worth spending more time on, I'll park it for now unless you
want me to dig deeper into it.

- Joel



Re: [PATCH RFC v2 1/3] pstore: map pstore types to names

2018-11-04 Thread Joel Fernandes
On Sat, Nov 03, 2018 at 04:38:16PM -0700, Joel Fernandes (Google) wrote:
> In later patches we will need to map types to names, so create a table
> for that which can also be used and reused in different parts of old and
> new code. Also use it to save the type in the PRZ which will be useful
> in later patches.
> 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  fs/pstore/inode.c  | 53 +-
>  fs/pstore/ram.c|  4 ++-
>  include/linux/pstore.h | 37 ++
>  include/linux/pstore_ram.h |  2 ++
>  4 files changed, 48 insertions(+), 48 deletions(-)

The original patch had an unused variable warning, below is the updated one.
Sorry about that and thanks!

---8<---

>From de981d893c5c09c62c7c6d5fc9fe73cfd99ffa48 Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" 
Date: Sat, 3 Nov 2018 16:10:12 -0700
Subject: [PATCH RFC v3 1/3] pstore: map pstore types to names

In later patches we will need to map types to names, so create a table
for that which can also be used and reused in different parts of old and
new code. Also use it to save the type in the PRZ which will be useful
in later patches.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/inode.c  | 52 --
 fs/pstore/ram.c|  4 ++-
 include/linux/pstore.h | 37 +++
 include/linux/pstore_ram.h |  2 ++
 4 files changed, 47 insertions(+), 48 deletions(-)

diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 8cf2218b46a7..4a259fd22df0 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -335,53 +335,11 @@ int pstore_mkfile(struct dentry *root, struct 
pstore_record *record)
goto fail_alloc;
private->record = record;
 
-   switch (record->type) {
-   case PSTORE_TYPE_DMESG:
-   scnprintf(name, sizeof(name), "dmesg-%s-%llu%s",
- record->psi->name, record->id,
- record->compressed ? ".enc.z" : "");
-   break;
-   case PSTORE_TYPE_CONSOLE:
-   scnprintf(name, sizeof(name), "console-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_FTRACE:
-   scnprintf(name, sizeof(name), "ftrace-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_MCE:
-   scnprintf(name, sizeof(name), "mce-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_RTAS:
-   scnprintf(name, sizeof(name), "rtas-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_OF:
-   scnprintf(name, sizeof(name), "powerpc-ofw-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_COMMON:
-   scnprintf(name, sizeof(name), "powerpc-common-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PMSG:
-   scnprintf(name, sizeof(name), "pmsg-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_OPAL:
-   scnprintf(name, sizeof(name), "powerpc-opal-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_UNKNOWN:
-   scnprintf(name, sizeof(name), "unknown-%s-%llu",
- record->psi->name, record->id);
-   break;
-   default:
-   scnprintf(name, sizeof(name), "type%d-%s-%llu",
- record->type, record->psi->name, record->id);
-   break;
-   }
+   scnprintf(name, sizeof(name), "%s-%s-%llu%s",
+   pstore_type_to_name(record->type),
+   record->psi->name, record->id,
+   (record->type == PSTORE_TYPE_DMESG
+&& record->compressed) ? ".enc.z" : "");
 
dentry = d_alloc_name(root, name);
if (!dentry)
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 10ac4d23c423..b174d0fc009f 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -611,6 +611,7 @@ static int ramoops_init_przs(const char *name,
goto fail;
}
*paddr += zone_sz;
+   prz_ar[i]->type = pstore_name_to_type(name);
}
 
*przs = prz_ar;
@@ -650,6

Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-11-04 Thread Joel Fernandes
On Sun, Nov 04, 2018 at 07:43:30PM -0800, Paul E. McKenney wrote:
[...]
> > > > > > > > Also about GP memory ordering and RCU-tree-locking, I think you 
> > > > > > > > mentioned to
> > > > > > > > me that the RCU reader-sections are virtually extended both 
> > > > > > > > forward and
> > > > > > > > backward and whereever it ends, those paths do heavy-weight 
> > > > > > > > synchronization
> > > > > > > > that should be sufficient to prevent memory ordering issues 
> > > > > > > > (such as those
> > > > > > > > you mentioned in the Requierments document). That is exactly 
> > > > > > > > why we don't
> > > > > > > > need explicit barriers during rcu_read_unlock. If I recall I 
> > > > > > > > asked you why
> > > > > > > > those are not needed. So that answer made sense, but then now 
> > > > > > > > on going
> > > > > > > > through the 'Memory Ordering' document, I see that you 
> > > > > > > > mentioned there is
> > > > > > > > reliance on the locking. Is that reliance on locking necessary 
> > > > > > > > to maintain
> > > > > > > > ordering then?
> > > > > > > 
> > > > > > > There is a "network" of locking augmented by 
> > > > > > > smp_mb__after_unlock_lock()
> > > > > > > that implements the all-to-all memory ordering mentioned above.  
> > > > > > > But it
> > > > > > > also needs to handle all the possible 
> > > > > > > complete()/wait_for_completion()
> > > > > > > races, even those assisted by hypervisor vCPU preemption.
> > > > > > 
> > > > > > I see, so it sounds like the lock network is just a partial 
> > > > > > solution. For
> > > > > > some reason I thought before that complete() was even called on the 
> > > > > > CPU
> > > > > > executing the callback, all the CPUs would have acquired and 
> > > > > > released a lock
> > > > > > in the "lock network" atleast once thus ensuring the ordering (due 
> > > > > > to the
> > > > > > fact that the quiescent state reporting has to travel up the tree 
> > > > > > starting
> > > > > > from the leaves), but I think that's not necessarily true so I see 
> > > > > > your point
> > > > > > now.
> > > > > 
> > > > > There is indeed a lock that is unconditionally acquired and released 
> > > > > by
> > > > > wait_for_completion(), but it lacks the smp_mb__after_unlock_lock() 
> > > > > that
> > > > > is required to get full-up any-to-any ordering.  And unfortunate 
> > > > > timing
> > > > > (as well as spurious wakeups) allow the interaction to have only 
> > > > > normal
> > > > > lock-release/acquire ordering, which does not suffice in all cases.
> > > > 
> > > > Sorry to be so persistent, but I did spend some time on this and I still
> > > > don't get why every CPU would _not_ have executed 
> > > > smp_mb__after_unlock_lock at least
> > > > once before the wait_for_completion() returns, because every CPU should 
> > > > have
> > > > atleast called rcu_report_qs_rdp() -> rcu_report_qs_rnp() atleast once 
> > > > to
> > > > report its QS up the tree right?. Before that procedure, the complete()
> > > > cannot happen because the complete() itself is in an RCU callback which 
> > > > is
> > > > executed only once all the QS(s) have been reported.
> > > > 
> > > > So I still couldn't see how the synchronize_rcu can return without the
> > > > rcu_report_qs_rnp called atleast once on the CPU reporting its QS 
> > > > during a
> > > > grace period.
> > > > 
> > > > Would it be possible to provide a small example showing this in least 
> > > > number
> > > > of steps? I appreciate your time and it would be really helpful. If you 
> > > > feel
> > > > its too complicated, then feel free to keep this for LPC discussion :)
> > > 
> > > The key point is that "at least once" does not suffice, other than for the
> > > CPU that detects the end of the grace period.  The rest of the CPUs must
> > > do at least -two- full barriers, which could of course be either smp_mb()
> > > on the one hand or smp_mb__after_unlock_lock() after a lock on the other.
> > 
> > I thought I'll atleast get an understanding of the "atleast two full
> > barriers" point and ask you any questions at LPC, because that's what I'm
> > missing I think. Trying to understand what can go wrong without two full
> > barriers. I'm sure an RCU implementation BoF could really in this regard.
> > 
> > I guess its also documented somewhere in Tree-RCU-Memory-Ordering.html but a
> > quick search through that document didn't show a mention of the two full
> > barriers need.. I think its also a great idea for us to document it there
> > and/or discuss it during the conference.
> > 
> > I went through the litmus test here for some hints on the two-barriers but
> > couldn't find any:
> > https://lkml.org/lkml/2017/10/5/636
> > 
> > Atleast this commit made me think no extra memory barrier is needed for
> > tree RCU:  :-\
> > https://lore.kernel.org/patchwork/patch/813386/
> > 
> > I'm sure your last email will be useful to me in the future once I can make
> > more sense of the ordering and the need for two 

Re: [PATCH 8/8] pstore/ram: Correctly calculate usable PRZ bytes

2018-11-04 Thread Joel Fernandes
Hi Kees,

On Fri, Nov 02, 2018 at 01:00:08PM -0700, Kees Cook wrote:
[..] 
> >> This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
> >> Any stored crashes would not be uncompressable (producing a pstorefs
> >> "dmesg-*.enc.z" file), and triggering errors at boot:
> >>
> >>   [2.790759] pstore: crypto_comp_decompress failed, ret = -22!
> >>
> >> Reported-by: Joel Fernandes 
> >> Fixes: b0aad7a99c1d ("pstore: Add compression support to pstore")
> >> Signed-off-by: Kees Cook 
> >
> > Thanks!
> > Reviewed-by: Joel Fernandes (Google) 
> 
> Thanks!
> 
> > Also should this be fixed for other backends or are those good? AFAIR, I saw
> > this for EFI too.
> 
> It seemed like the other backends were doing it correctly (e.g. erst
> removes the header from calculation, etc). I did see that EFI
> allocates more memory than needed?
> 
> efi_pstore_info.buf = kmalloc(4096, GFP_KERNEL);
> if (!efi_pstore_info.buf)
> return -ENOMEM;
> 
> efi_pstore_info.bufsize = 1024;
> 
> efi_pstore_write() does:
> 
> ret = efivar_entry_set_safe(efi_name, vendor, PSTORE_EFI_ATTRIBUTES,
>   !pstore_cannot_block_path(record->reason),
>   record->size, record->psi->buf);
> 
> and efivar_entry_set_safe() says:
> 
>  * Returns 0 on success, -ENOSPC if the firmware does not have enough
>  * space for set_variable() to succeed, or a converted EFI status code
>  * if set_variable() fails.
> 
> So I don't see how this could get truncated. (I'm not saying it
> didn't... just that I can't see it in an obvious place.)


So I *think* the issue is that the pstore had old compressed dmesg dumps in
EFI on my laptop, after the crypto layer in the kernel probably changed
enough to make the data non-decompressable, if that makes any sense. So older
code did compression in certain way, and newer code is doing the decompress,
or something like that.

I did some sysrq crashes on my laptop and the deflate decompress is working
fine with pstore+EFI. Its interesting I see some .enc.z files which fail to
decompress (which are older ones), and others which are decompressed fine
(the newer ones) ;-)

Dumping the magic bytes of the non decompressable .enc.z files, I get this
which shows a valid zlib compressed header:

Something like:
48 89 85 54 4d 6f 1a 31

The 0b1000 in the first byte means it is "deflate". The file tool indeed
successfully shows "zlib compressed data" and I did the math for the header
and it is indeed valid. So I don't think the data is insane. The buffer has
enough room because even the very small dumps are not decompressable.

At this point we can park this issue I guess, but a scenario that is still
broken is:
Say someone crashes the system on compress algo X and then recompiles with
compress algo Y, then the decompress would fail no?

One way to fix that is to store the comrpession method in buffer as well,
then initialize all algorithms at boot and choose the right one in the
buffer ideally. Otherwise atleast we should print a message saying "buffer is
encoded with algo X but compression selected is Y" or something. But I agree
its a very low priority "doctor it hurts if I do this" kind of issue :)

Anyway, let me know what you think :)

thanks,

- Joel



Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-11-03 Thread Joel Fernandes
On Sat, Nov 03, 2018 at 04:22:59PM -0700, Paul E. McKenney wrote:
> On Fri, Nov 02, 2018 at 10:12:26PM -0700, Joel Fernandes wrote:
> > On Thu, Nov 01, 2018 at 09:13:07AM -0700, Paul E. McKenney wrote:
> > > On Wed, Oct 31, 2018 at 10:00:19PM -0700, Joel Fernandes wrote:
> > > > On Wed, Oct 31, 2018 at 11:17:48AM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Oct 30, 2018 at 06:11:19PM -0700, Joel Fernandes wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > On Tue, Oct 30, 2018 at 04:43:36PM -0700, Paul E. McKenney wrote:
> > > > > > > On Tue, Oct 30, 2018 at 03:26:49PM -0700, Joel Fernandes wrote:
> > > > > > > > Hi Paul,
> > > > > > > > 
> > > > > > > > On Sat, Oct 27, 2018 at 09:30:46PM -0700, Joel Fernandes 
> > > > > > > > (Google) wrote:
> > > > > > > > > As per this thread [1], it seems this smp_mb isn't needed 
> > > > > > > > > anymore:
> > > > > > > > > "So the smp_mb() that I was trying to add doesn't need to be 
> > > > > > > > > there."
> > > > > > > > > 
> > > > > > > > > So let us remove this part from the memory ordering 
> > > > > > > > > documentation.
> > > > > > > > > 
> > > > > > > > > [1] https://lkml.org/lkml/2017/10/6/707
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I was just checking about this patch. Do you feel it is correct 
> > > > > > > > to remove
> > > > > > > > this part from the docs? Are you satisified that a barrier 
> > > > > > > > isn't needed there
> > > > > > > > now? Or did I miss something?
> > > > > > > 
> > > > > > > Apologies, it got lost in the shuffle.  I have now applied it 
> > > > > > > with a
> > > > > > > bit of rework to the commit log, thank you!
> > > > > > 
> > > > > > No worries, thanks for taking it!
> > > > > > 
> > > > > > Just wanted to update you on my progress reading/correcting the 
> > > > > > docs. The
> > > > > > 'Memory Ordering' is taking a bit of time so I paused that and I'm 
> > > > > > focusing
> > > > > > on finishing all the other low hanging fruit. This activity is 
> > > > > > mostly during
> > > > > > night hours after the baby is asleep but some times I also manage 
> > > > > > to sneak it
> > > > > > into the day job ;-)
> > > > > 
> > > > > If there is anything I can do to make this a more sustainable task for
> > > > > you, please do not keep it a secret!!!
> > > > 
> > > > Thanks a lot, that means a lot to me! Will do!
> > > > 
> > > > > > BTW I do want to discuss about this smp_mb patch above with you at 
> > > > > > LPC if you
> > > > > > had time, even though we are removing it from the documentation. I 
> > > > > > thought
> > > > > > about it a few times, and I was not able to fully appreciate the 
> > > > > > need for the
> > > > > > barrier (that is even assuming that complete() etc did not do the 
> > > > > > right
> > > > > > thing).  Specifically I was wondering same thing Peter said in the 
> > > > > > above
> > > > > > thread I think that - if that rcu_read_unlock() triggered all the 
> > > > > > spin
> > > > > > locking up the tree of nodes, then why is that locking not 
> > > > > > sufficient to
> > > > > > prevent reads from the read-side section from bleeding out? That 
> > > > > > would
> > > > > > prevent the reader that just unlocked from seeing anything that 
> > > > > > happens
> > > > > > _after_ the synchronize_rcu.
> > > > > 
> > > > > Actually, I recall an smp_mb() being added, but am not seeing it 
> > > > > anywhere
> > > > > relevant to wait_for_completion().  

[PATCH RFC v2 2/3] pstore: simplify ramoops_get_next_prz arguments

2018-11-03 Thread Joel Fernandes (Google)
(1) remove type argument from ramoops_get_next_prz

Since we store the type of the prz when we initialize it, we no longer
need to pass it again in ramoops_get_next_prz since we can just use that
to setup the pstore record. So lets remove it from the argument list.

(2) remove max argument from ramoops_get_next_prz

>From the code flow, the 'max' checks are already being done on the prz
passed to ramoops_get_next_prz. Lets remove it to simplify this function
and reduce its arguments.

(3) further reduce ramoops_get_next_prz arguments by passing record

Both the id and type fields of a pstore_record are set by
ramoops_get_next_prz. So we can just pass a pointer to the pstore_record
instead of passing individual elements. This results in cleaner more
readable code and fewer lines.

In addition lets also remove the 'update' argument since we can detect
that. Changes are squashed into a single patch to reduce fixup conflicts.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram.c | 48 ++--
 1 file changed, 18 insertions(+), 30 deletions(-)

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index b174d0fc009f..202eaa82bcc6 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -124,19 +124,17 @@ static int ramoops_pstore_open(struct pstore_info *psi)
 }
 
 static struct persistent_ram_zone *
-ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max,
-u64 *id,
-enum pstore_type_id *typep, enum pstore_type_id type,
-bool update)
+ramoops_get_next_prz(struct persistent_ram_zone *przs[], int id,
+struct pstore_record *record)
 {
struct persistent_ram_zone *prz;
-   int i = (*c)++;
+   bool update = (record->type == PSTORE_TYPE_DMESG);
 
/* Give up if we never existed or have hit the end. */
-   if (!przs || i >= max)
+   if (!przs)
return NULL;
 
-   prz = przs[i];
+   prz = przs[id];
if (!prz)
return NULL;
 
@@ -147,8 +145,8 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], 
uint *c, uint max,
if (!persistent_ram_old_size(prz))
return NULL;
 
-   *typep = type;
-   *id = i;
+   record->type = prz->type;
+   record->id = id;
 
return prz;
 }
@@ -255,10 +253,8 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
/* Find the next valid persistent_ram_zone for DMESG */
while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
-   prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
-  cxt->max_dump_cnt, >id,
-  >type,
-  PSTORE_TYPE_DMESG, 1);
+   prz = ramoops_get_next_prz(cxt->dprzs, cxt->dump_read_cnt++,
+  record);
if (!prz_ok(prz))
continue;
header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz),
@@ -272,22 +268,18 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
}
}
 
-   if (!prz_ok(prz))
-   prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
-  1, >id, >type,
-  PSTORE_TYPE_CONSOLE, 0);
+   if (!prz_ok(prz) && !cxt->console_read_cnt++)
+   prz = ramoops_get_next_prz(>cprz, 0 /* single */, record);
 
-   if (!prz_ok(prz))
-   prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
-  1, >id, >type,
-  PSTORE_TYPE_PMSG, 0);
+   if (!prz_ok(prz) && !cxt->pmsg_read_cnt++)
+   prz = ramoops_get_next_prz(>mprz, 0 /* single */, record);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
-   if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
-   prz = ramoops_get_next_prz(cxt->fprzs,
-   >ftrace_read_cnt, 1, >id,
-   >type, PSTORE_TYPE_FTRACE, 0);
+   if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU) &&
+   !cxt->ftrace_read_cnt++) {
+   prz = ramoops_get_next_prz(cxt->fprzs, 0 /* single */,
+  record);
} else {
/*
 * Build a new dummy record which combines all the
@@ -303,11 +295,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
while (cxt->ftrace_read_cnt < cxt->m

[PATCH RFC v2 0/3] cleanups for pstore and ramoops

2018-11-03 Thread Joel Fernandes (Google)
Here are some simple cleanups and fixes for ramoops in pstore. Let me know
what you think, thanks.

Joel Fernandes (Google) (3):
pstore: map pstore types to names
pstore: simplify ramoops_get_next_prz arguments
pstore: donot treat empty buffers as valid

fs/pstore/inode.c  | 53 +-
fs/pstore/ram.c| 52 +++--
fs/pstore/ram_core.c   |  2 +-
include/linux/pstore.h | 37 ++
include/linux/pstore_ram.h |  2 ++
5 files changed, 67 insertions(+), 79 deletions(-)

--
2.19.1.930.g4563a0d9d0-goog



[PATCH RFC v2 1/3] pstore: map pstore types to names

2018-11-03 Thread Joel Fernandes (Google)
In later patches we will need to map types to names, so create a table
for that which can also be used and reused in different parts of old and
new code. Also use it to save the type in the PRZ which will be useful
in later patches.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/inode.c  | 53 +-
 fs/pstore/ram.c|  4 ++-
 include/linux/pstore.h | 37 ++
 include/linux/pstore_ram.h |  2 ++
 4 files changed, 48 insertions(+), 48 deletions(-)

diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 8cf2218b46a7..c5c6b8b4b70a 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -304,6 +304,7 @@ int pstore_mkfile(struct dentry *root, struct pstore_record 
*record)
struct dentry   *dentry;
struct inode*inode;
int rc = 0;
+   enum pstore_type_id type;
charname[PSTORE_NAMELEN];
struct pstore_private   *private, *pos;
unsigned long   flags;
@@ -335,53 +336,11 @@ int pstore_mkfile(struct dentry *root, struct 
pstore_record *record)
goto fail_alloc;
private->record = record;
 
-   switch (record->type) {
-   case PSTORE_TYPE_DMESG:
-   scnprintf(name, sizeof(name), "dmesg-%s-%llu%s",
- record->psi->name, record->id,
- record->compressed ? ".enc.z" : "");
-   break;
-   case PSTORE_TYPE_CONSOLE:
-   scnprintf(name, sizeof(name), "console-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_FTRACE:
-   scnprintf(name, sizeof(name), "ftrace-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_MCE:
-   scnprintf(name, sizeof(name), "mce-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_RTAS:
-   scnprintf(name, sizeof(name), "rtas-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_OF:
-   scnprintf(name, sizeof(name), "powerpc-ofw-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_COMMON:
-   scnprintf(name, sizeof(name), "powerpc-common-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PMSG:
-   scnprintf(name, sizeof(name), "pmsg-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_OPAL:
-   scnprintf(name, sizeof(name), "powerpc-opal-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_UNKNOWN:
-   scnprintf(name, sizeof(name), "unknown-%s-%llu",
- record->psi->name, record->id);
-   break;
-   default:
-   scnprintf(name, sizeof(name), "type%d-%s-%llu",
- record->type, record->psi->name, record->id);
-   break;
-   }
+   scnprintf(name, sizeof(name), "%s-%s-%llu%s",
+   pstore_type_to_name(record->type),
+   record->psi->name, record->id,
+   (record->type == PSTORE_TYPE_DMESG
+&& record->compressed) ? ".enc.z" : "");
 
dentry = d_alloc_name(root, name);
if (!dentry)
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 10ac4d23c423..b174d0fc009f 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -611,6 +611,7 @@ static int ramoops_init_przs(const char *name,
goto fail;
}
*paddr += zone_sz;
+   prz_ar[i]->type = pstore_name_to_type(name);
}
 
*przs = prz_ar;
@@ -650,6 +651,7 @@ static int ramoops_init_prz(const char *name,
}
 
*paddr += sz;
+   (*prz)->type = pstore_name_to_type(name);
 
return 0;
 }
@@ -785,7 +787,7 @@ static int ramoops_probe(struct platform_device *pdev)
 
dump_mem_sz = cxt->size - cxt->console_size - cxt->ftrace_size
- cxt->pmsg_size;
-   err = ramoops_init_przs("dump", dev, cxt, >dprzs, ,
+   err = ramoops_init_przs("dmesg", dev, cxt, >dprzs, ,
dump_mem_sz, cxt->record_size,
>

[PATCH RFC v2 3/3] pstore: donot treat empty buffers as valid

2018-11-03 Thread Joel Fernandes (Google)
pstore currently calls persistent_ram_save_old even if a buffer is
empty. While this appears to work, it is does not seem like the right
thing to do and could lead to future bugs so lets avoid that. It also
prevent misleading prints in the logs which claim the buffer is valid.

I got something like:
found existing buffer, size 0, start 0

When I was expecting:
no valid data in buffer (sig = ...)

Signed-off-by: Joel Fernandes (Google) 
---
Note that if you feel this patch is not necessary, then feel free to
drop it. I would say it is harmless and is a good clean up.

 fs/pstore/ram_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index e6375439c5ac..196e4fd7ba8c 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -510,7 +510,7 @@ static int persistent_ram_post_init(struct 
persistent_ram_zone *prz, u32 sig,
 
sig ^= PERSISTENT_RAM_SIG;
 
-   if (prz->buffer->sig == sig) {
+   if (prz->buffer->sig == sig && buffer_size(prz)) {
if (buffer_size(prz) > prz->buffer_size ||
buffer_start(prz) > buffer_size(prz)) {
pr_info("found existing invalid buffer, size %zu, start 
%zu\n",
-- 
2.19.1.930.g4563a0d9d0-goog


Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-11-02 Thread Joel Fernandes
On Thu, Nov 01, 2018 at 09:13:07AM -0700, Paul E. McKenney wrote:
> On Wed, Oct 31, 2018 at 10:00:19PM -0700, Joel Fernandes wrote:
> > On Wed, Oct 31, 2018 at 11:17:48AM -0700, Paul E. McKenney wrote:
> > > On Tue, Oct 30, 2018 at 06:11:19PM -0700, Joel Fernandes wrote:
> > > > Hi Paul,
> > > > 
> > > > On Tue, Oct 30, 2018 at 04:43:36PM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Oct 30, 2018 at 03:26:49PM -0700, Joel Fernandes wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > On Sat, Oct 27, 2018 at 09:30:46PM -0700, Joel Fernandes (Google) 
> > > > > > wrote:
> > > > > > > As per this thread [1], it seems this smp_mb isn't needed anymore:
> > > > > > > "So the smp_mb() that I was trying to add doesn't need to be 
> > > > > > > there."
> > > > > > > 
> > > > > > > So let us remove this part from the memory ordering documentation.
> > > > > > > 
> > > > > > > [1] https://lkml.org/lkml/2017/10/6/707
> > > > > > > 
> > > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > > 
> > > > > > I was just checking about this patch. Do you feel it is correct to 
> > > > > > remove
> > > > > > this part from the docs? Are you satisified that a barrier isn't 
> > > > > > needed there
> > > > > > now? Or did I miss something?
> > > > > 
> > > > > Apologies, it got lost in the shuffle.  I have now applied it with a
> > > > > bit of rework to the commit log, thank you!
> > > > 
> > > > No worries, thanks for taking it!
> > > > 
> > > > Just wanted to update you on my progress reading/correcting the docs. 
> > > > The
> > > > 'Memory Ordering' is taking a bit of time so I paused that and I'm 
> > > > focusing
> > > > on finishing all the other low hanging fruit. This activity is mostly 
> > > > during
> > > > night hours after the baby is asleep but some times I also manage to 
> > > > sneak it
> > > > into the day job ;-)
> > > 
> > > If there is anything I can do to make this a more sustainable task for
> > > you, please do not keep it a secret!!!
> > 
> > Thanks a lot, that means a lot to me! Will do!
> > 
> > > > BTW I do want to discuss about this smp_mb patch above with you at LPC 
> > > > if you
> > > > had time, even though we are removing it from the documentation. I 
> > > > thought
> > > > about it a few times, and I was not able to fully appreciate the need 
> > > > for the
> > > > barrier (that is even assuming that complete() etc did not do the right
> > > > thing).  Specifically I was wondering same thing Peter said in the above
> > > > thread I think that - if that rcu_read_unlock() triggered all the spin
> > > > locking up the tree of nodes, then why is that locking not sufficient to
> > > > prevent reads from the read-side section from bleeding out? That would
> > > > prevent the reader that just unlocked from seeing anything that happens
> > > > _after_ the synchronize_rcu.
> > > 
> > > Actually, I recall an smp_mb() being added, but am not seeing it anywhere
> > > relevant to wait_for_completion().  So I might need to add the smp_mb()
> > > to synchronize_rcu() and remove the patch (retaining the typo fix).  :-/
> > 
> > No problem, I'm glad atleast the patch resurfaced the topic of the potential
> > issue :-)
> 
> And an smp_mb() is needed in Tree RCU's __wait_rcu_gp().  This is
> because wait_for_completion() might get a "fly-by" wakeup, which would
> mean no ordering for code naively thinking that it was ordered after a
> grace period.
> 
> > > The short form answer is that anything before a grace period on any CPU
> > > must be seen by any CPU as being before anything on any CPU after that
> > > same grace period.  This guarantee requires a rather big hammer.
> > > 
> > > But yes, let's talk at LPC!
> > 
> > Sounds great, looking forward to discussing this.
> 
> Would it make sense to have an RCU-implementation BoF?
> 
> > > > Also about GP memory ordering and RCU-tree-locking, I think you 
> > > > mentioned to
> > > > me that the RCU reader-sections are virtually extended bot

Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-11-02 Thread Joel Fernandes
t.
> > 
> > In the previous paragraph, you mentioned the bug "requires a reader before
> > the GP to do a store". However, condition 1 is really different - it is a
> > reader holding a reference to a pointer that is used *after* the
> > synchronize_rcu returns. So that reader's load of the pointer should have
> > completed by the time GP ends, otherwise the reader can look at kfree'd 
> > data.
> > That's different right?
> 
> More specifically, the fix prevents a prior reader's -store- within its
> critical section to be seen as happening after a load that follows the
> end of the grace period.  So I stand by Condition 1.  ;-)
> And again, a store within an RCU read-side critical section is a bit
> on the strange side, but this sort of thing is perfectly legal and
> is used, albeit rather rarely.

Cool :) I never thought about condition 1 this way but good to know that's
possible :)

> > For condition 2, I analyzed it below, let me know what you think:
> > 
> > >   Thanx, Paul
> > > 
> > > 
> > > 
> > > commit bf3c11b7b9789283f993d9beb80caaabc4403916
> > > Author: Paul E. McKenney 
> > > Date:   Thu Nov 1 09:05:02 2018 -0700
> > > 
> > > rcu: Add full memory barrier in __wait_rcu_gp()
> > > 
> > > RCU grace periods have extremely strong any-to-any ordering
> > > requirements that are met by placing full barriers in various places
> > > in the grace-period computation.  However, normal grace period 
> > > requests
> > > might be subjected to a "fly-by" wakeup in which the requesting 
> > > process
> > > doesn't actually sleep and in which the corresponding CPU is not yet
> > > aware that the grace period has ended.  In this case, loads in the 
> > > code
> > > immediately following the synchronize_rcu() call might potentially see
> > > values before stores preceding the grace period on other CPUs.
> > > 
> > > This is an unusual use case, because RCU readers normally read.  
> > > However,
> > > they can do writes, and if they do, we need post-grace-period reads to
> > > see those writes.
> > > 
> > > This commit therefore adds an smp_mb() to the end of __wait_rcu_gp().
> > > 
> > > Many thanks to Joel Fernandes for the series of questions leading to 
> > > me
> > > realizing that this bug exists!
> > > 
> > > Signed-off-by: Paul E. McKenney 
> > > 
> > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > index 1971869c4072..74020b558216 100644
> > > --- a/kernel/rcu/update.c
> > > +++ b/kernel/rcu/update.c
> > > @@ -360,6 +360,7 @@ void __wait_rcu_gp(bool checktiny, int n, 
> > > call_rcu_func_t *crcu_array,
> > >   wait_for_completion(_array[i].completion);
> > >   destroy_rcu_head_on_stack(_array[i].head);
> > >   }
> > > + smp_mb(); /* Provide ordering in case of fly-by wakeup. */
> > >  }
> > >  EXPORT_SYMBOL_GPL(__wait_rcu_gp);
> > >  
https://cs.corp.google.com/piper///depot/google3/base/percpu.cc?type=cs=%22rseq%22+restart=package:piper+file://depot/google3+-file:google3/experimental=0=247>
 > 
> > The fix looks fine to me. Thanks.
> > 
> > If I understand correctly the wait_for_completion() is an ACQUIRE operation,
> > and the complete() is a RELEASE operation aka the "MP pattern". The
> > ACQUIRE/RELEASE semantics allow any writes that happened before the ACQUIRE
> > to get ordered after it. So that would actually imply it is not strong 
> > enough
> > for ordering purposes during a "fly-by" wake up scenario and would be a
> > violation of CONDITION 2, I think (not only condition 1 as you said).  This
> > is because future readers may accidentally see the writes that happened
> > *before* the synchronize_rcu which is CONDITION 2 in the requirements:
> > https://goo.gl/8mrDHN  (I had to shortlink it ;))
> 
> I do appreciate the shorter link.  ;-)
> 
> A write happening before the grace period is ordered by the grace period's
> network of strong barriers, so the fix does not matter in that case.

I was talking about the acquire/release pairs in the calls to spin_lock and
spin_unlock in wait_for_completion, not in the grace period network of rnp 
locks.
Does that make sense?

I thought that during a "fly-by" wake up, that network of strong barriers
doesn't really trigger and that that's the problematic scenario. Did I miss
something?  I was talking about the acquire/release pair in
wait_for_completion during that fly-by scenario.

> Also, the exact end of the grace period is irrelevant for Condition 2,
> it is instead the beginning of the grace period compared to the beginning
> of later RCU read-side critical sections.
> 
> Not saying that Condition 2 cannot somehow happen without the memory
> barrier, just saying that it will take quite a bit more creativity to
> find a relevant scenario.
> 
> Please see below for the updated patch, containing only the typo fix.
> Please let me know if I messed anything up.

Looks good to me, thanks!

 - Joel




Re: [PATCH 7/8] pstore: Remove needless lock during console writes

2018-11-02 Thread Joel Fernandes
On Fri, Nov 02, 2018 at 01:40:06PM -0700, Kees Cook wrote:
> On Fri, Nov 2, 2018 at 11:32 AM, Joel Fernandes  
> wrote:
> > On Thu, Nov 01, 2018 at 04:51:59PM -0700, Kees Cook wrote:
> >> Since commit 70ad35db3321 ("pstore: Convert console write to use
> >> ->write_buf"), the console writer does not use the preallocated crash
> >> dump buffer any more, so there is no reason to perform locking around it.
> >
> > Out of curiosity, what was the reason for having this preallocated crash
> > buffer in the first place? I thought the 'console' type only did regular
> > kernel console logging, not crash dumps.
> 
> The primary reason is that the dumper needs to write to somewhere and
> we don't know the state of the system (memory allocation may not work
> for example).
> 
> The other frontends tend to run at "sane" locations in the kernel. The
> dumper, however, is quite fragile.

Makes sense. thanks.

> > Also I wonder if Namhyung is still poking around that virtio pstore driver 
> > he
> > mentioned in the commit mentioned above. :)
> 
> Did that never land? I thought it mostly had to happen at the qemu end?
> 
> With nvdimm emulation, we can just use ramoops. :)
> 

Yes it seems like it never landed: https://lwn.net/Articles/694742/

One of the nice thing about his virtio set though, vs the nvdimm way is that
he actually gets a directory on the host instead of a backing memory file.
Then he can just list this directory and see all the pstore files as he shows
in his example.

I guess it should not be too hard to write a tool to post-process the nvdimm
images and convert it to files anyway :)

thanks,

 - Joel



Re: [PATCH 7/8] pstore: Remove needless lock during console writes

2018-11-02 Thread Joel Fernandes
On Thu, Nov 01, 2018 at 04:51:59PM -0700, Kees Cook wrote:
> Since commit 70ad35db3321 ("pstore: Convert console write to use
> ->write_buf"), the console writer does not use the preallocated crash
> dump buffer any more, so there is no reason to perform locking around it.

Out of curiosity, what was the reason for having this preallocated crash
buffer in the first place? I thought the 'console' type only did regular
kernel console logging, not crash dumps.

I looked at all the patches and had some minor nits, with the nits addressed
(if you agree with them), feel free to add my Reviewed-by on future respins:

Reviewed-by: Joel Fernandes (Google) 

Also I wonder if Namhyung is still poking around that virtio pstore driver he
mentioned in the commit mentioned above. :)

thanks,

- Joel

> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/platform.c | 29 ++---
>  1 file changed, 6 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
> index a956c7bc3f67..32340e7dd6a5 100644
> --- a/fs/pstore/platform.c
> +++ b/fs/pstore/platform.c
> @@ -461,31 +461,14 @@ static void pstore_unregister_kmsg(void)
>  #ifdef CONFIG_PSTORE_CONSOLE
>  static void pstore_console_write(struct console *con, const char *s, 
> unsigned c)
>  {
> - const char *e = s + c;
> + struct pstore_record record;
>  
> - while (s < e) {
> - struct pstore_record record;
> - unsigned long flags;
> -
> - pstore_record_init(, psinfo);
> - record.type = PSTORE_TYPE_CONSOLE;
> -
> - if (c > psinfo->bufsize)
> - c = psinfo->bufsize;
> + pstore_record_init(, psinfo);
> + record.type = PSTORE_TYPE_CONSOLE;
>  
> - if (oops_in_progress) {
> - if (!spin_trylock_irqsave(>buf_lock, flags))
> - break;
> - } else {
> - spin_lock_irqsave(>buf_lock, flags);
> - }
> - record.buf = (char *)s;
> - record.size = c;
> - psinfo->write();
> - spin_unlock_irqrestore(>buf_lock, flags);
> - s += c;
> - c = e - s;
> - }
> + record.buf = (char *)s;
> + record.size = c;
> + psinfo->write();
>  }
>  
>  static struct console pstore_console = {
> -- 
> 2.17.1
> 


Re: [PATCH 2/8] pstore: Do not use crash buffer for decompression

2018-11-02 Thread Joel Fernandes
On Thu, Nov 01, 2018 at 04:51:54PM -0700, Kees Cook wrote:
> The pre-allocated compression buffer used for crash dumping was also
> being used for decompression. This isn't technically safe, since it's
> possible the kernel may attempt a crashdump while pstore is populating the
> pstore filesystem (and performing decompression). Instead, just allocate

Yeah, that would be bad if it happened ;)

> a separate buffer for decompression. Correctness is preferred over
> performance here.
> 
> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/platform.c | 56 
>  1 file changed, 25 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
> index b821054ca3ed..8b6028948cf3 100644
> --- a/fs/pstore/platform.c
> +++ b/fs/pstore/platform.c
> @@ -258,20 +258,6 @@ static int pstore_compress(const void *in, void *out,
>   return outlen;
>  }
>  
> -static int pstore_decompress(void *in, void *out,
> -  unsigned int inlen, unsigned int outlen)
> -{
> - int ret;
> -
> - ret = crypto_comp_decompress(tfm, in, inlen, out, );
> - if (ret) {
> - pr_err("crypto_comp_decompress failed, ret = %d!\n", ret);
> - return ret;
> - }
> -
> - return outlen;
> -}
> -
>  static void allocate_buf_for_compression(void)
>  {
>   struct crypto_comp *ctx;
> @@ -656,8 +642,9 @@ EXPORT_SYMBOL_GPL(pstore_unregister);
>  
>  static void decompress_record(struct pstore_record *record)
>  {
> + int ret;
>   int unzipped_len;

nit: We could get rid of the unzipped_len variable now I think.

> - char *decompressed;
> + char *unzipped, *workspace;
>  
>   if (!record->compressed)
>   return;
> @@ -668,35 +655,42 @@ static void decompress_record(struct pstore_record 
> *record)
>   return;
>   }
>  
> - /* No compression method has created the common buffer. */
> + /* Missing compression buffer means compression was not initialized. */
>   if (!big_oops_buf) {
> - pr_warn("no decompression buffer allocated\n");
> + pr_warn("no decompression method initialized!\n");
>   return;
>   }
>  
> - unzipped_len = pstore_decompress(record->buf, big_oops_buf,
> -  record->size, big_oops_buf_sz);
> - if (unzipped_len <= 0) {
> - pr_err("decompression failed: %d\n", unzipped_len);
> + /* Allocate enough space to hold max decompression and ECC. */
> + unzipped_len = big_oops_buf_sz;
> + workspace = kmalloc(unzipped_len + record->ecc_notice_size,

Should tihs be unzipped_len + record->ecc_notice_size + 1. The extra byte
being for the NULL character of the ecc notice?

This occurred to me when I saw the + 1 in ram.c. It could be better to just
abstract the size as a macro.

> + GFP_KERNEL);
> + if (!workspace)
>   return;
> - }
>  
> - /* Build new buffer for decompressed contents. */
> - decompressed = kmalloc(unzipped_len + record->ecc_notice_size,
> -GFP_KERNEL);
> - if (!decompressed) {
> - pr_err("decompression ran out of memory\n");
> + /* After decompression "unzipped_len" is almost certainly smaller. */
> + ret = crypto_comp_decompress(tfm, record->buf, record->size,
> +   workspace, _len);
> + if (ret) {
> + pr_err("crypto_comp_decompress failed, ret = %d!\n", ret);
> + kfree(workspace);
>   return;
>   }
> - memcpy(decompressed, big_oops_buf, unzipped_len);
>  
>   /* Append ECC notice to decompressed buffer. */
> - memcpy(decompressed + unzipped_len, record->buf + record->size,
> + memcpy(workspace + unzipped_len, record->buf + record->size,
>  record->ecc_notice_size);
>  
> - /* Swap out compresed contents with decompressed contents. */
> + /* Copy decompressed contents into an minimum-sized allocation. */
> + unzipped = kmemdup(workspace, unzipped_len + record->ecc_notice_size,
> +GFP_KERNEL);
> + kfree(workspace);
> + if (!unzipped)
> + return;
> +
> + /* Swap out compressed contents with decompressed contents. */
>   kfree(record->buf);
> - record->buf = decompressed;
> + record->buf = unzipped;

Rest of it LGTM, thanks!

 - Joel


Re: [PATCH 8/8] pstore/ram: Correctly calculate usable PRZ bytes

2018-11-02 Thread Joel Fernandes
On Thu, Nov 01, 2018 at 04:52:00PM -0700, Kees Cook wrote:
> The actual number of bytes stored in a PRZ is smaller than the
> bytes requested by platform data, since there is a header on each
> PRZ. Additionally, if ECC is enabled, there are trailing bytes used
> as well. Normally this mismatch doesn't matter since PRZs are circular
> buffers and the leading "overflow" bytes are just thrown away. However, in
> the case of a compressed record, this rather badly corrupts the results.

Actually this would also mean some data loss for non-compressed records were
also there before, but is now fixed?

> This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
> Any stored crashes would not be uncompressable (producing a pstorefs
> "dmesg-*.enc.z" file), and triggering errors at boot:
> 
>   [2.790759] pstore: crypto_comp_decompress failed, ret = -22!
> 
> Reported-by: Joel Fernandes 
> Fixes: b0aad7a99c1d ("pstore: Add compression support to pstore")
> Signed-off-by: Kees Cook 

Thanks!
Reviewed-by: Joel Fernandes (Google) 

Also should this be fixed for other backends or are those good? AFAIR, I saw
this for EFI too.

- Joel



> ---
>  fs/pstore/ram.c| 15 ++-
>  include/linux/pstore.h |  5 -
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> index 25bede911809..10ac4d23c423 100644
> --- a/fs/pstore/ram.c
> +++ b/fs/pstore/ram.c
> @@ -814,17 +814,14 @@ static int ramoops_probe(struct platform_device *pdev)
>  
>   cxt->pstore.data = cxt;
>   /*
> -  * Console can handle any buffer size, so prefer LOG_LINE_MAX. If we
> -  * have to handle dumps, we must have at least record_size buffer. And
> -  * for ftrace, bufsize is irrelevant (if bufsize is 0, buf will be
> -  * ZERO_SIZE_PTR).
> +  * Since bufsize is only used for dmesg crash dumps, it
> +  * must match the size of the dprz record (after PRZ header
> +  * and ECC bytes have been accounted for).
>*/
> - if (cxt->console_size)
> - cxt->pstore.bufsize = 1024; /* LOG_LINE_MAX */
> - cxt->pstore.bufsize = max(cxt->record_size, cxt->pstore.bufsize);
> - cxt->pstore.buf = kmalloc(cxt->pstore.bufsize, GFP_KERNEL);
> + cxt->pstore.bufsize = cxt->dprzs[0]->buffer_size;
> + cxt->pstore.buf = kzalloc(cxt->pstore.bufsize, GFP_KERNEL);
>   if (!cxt->pstore.buf) {
> - pr_err("cannot allocate pstore buffer\n");
> + pr_err("cannot allocate pstore crash dump buffer\n");
>   err = -ENOMEM;
>   goto fail_clear;
>   }
> diff --git a/include/linux/pstore.h b/include/linux/pstore.h
> index 3549f2ba865c..f46e5df76b58 100644
> --- a/include/linux/pstore.h
> +++ b/include/linux/pstore.h
> @@ -90,7 +90,10 @@ struct pstore_record {
>   *
>   * @buf_lock:spinlock to serialize access to @buf
>   * @buf: preallocated crash dump buffer
> - * @bufsize: size of @buf available for crash dump writes
> + * @bufsize: size of @buf available for crash dump bytes (must match
> + *   smallest number of bytes available for writing to a
> + *   backend entry, since compressed bytes don't take kindly
> + *   to being truncated)
>   *
>   * @read_mutex:  serializes @open, @read, @close, and @erase callbacks
>   * @flags:   bitfield of frontends the backend can accept writes for
> -- 
> 2.17.1
> 


Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-11-02 Thread Joel Fernandes
rier.  The probability of
> failure is extremely low in the common case, which involves all sorts
> of synchronization on the wakeup path.  It would be quite strange (but
> not impossible) for the wait_for_completion() exit path to -not- to do
> a full wakeup.  Plus the bug requires a reader before the grace period
> to do a store to some location that post-grace-period code loads from.
> Which is a very rare use case.
> 
> But it still should be fixed.  ;-)
> 
> > Did you feel this will violate condition 1. or condition 2. in 
> > "Memory-Barrier
> > Guarantees"? Or both?
> > https://www.kernel.org/doc/Documentation/RCU/Design/Requirements/Requirements.html#Memory-Barrier%20Guarantees
> 
> Condition 1.  There might be some strange combination of events that
> could also cause it to also violate condition 2, but I am not immediately
> seeing it.

In the previous paragraph, you mentioned the bug "requires a reader before
the GP to do a store". However, condition 1 is really different - it is a
reader holding a reference to a pointer that is used *after* the
synchronize_rcu returns. So that reader's load of the pointer should have
completed by the time GP ends, otherwise the reader can look at kfree'd data.
That's different right?

For condition 2, I analyzed it below, let me know what you think:

>   Thanx, Paul
> 
> 
> 
> commit bf3c11b7b9789283f993d9beb80caaabc4403916
> Author: Paul E. McKenney 
> Date:   Thu Nov 1 09:05:02 2018 -0700
> 
> rcu: Add full memory barrier in __wait_rcu_gp()
> 
> RCU grace periods have extremely strong any-to-any ordering
> requirements that are met by placing full barriers in various places
> in the grace-period computation.  However, normal grace period requests
> might be subjected to a "fly-by" wakeup in which the requesting process
> doesn't actually sleep and in which the corresponding CPU is not yet
> aware that the grace period has ended.  In this case, loads in the code
> immediately following the synchronize_rcu() call might potentially see
> values before stores preceding the grace period on other CPUs.
> 
> This is an unusual use case, because RCU readers normally read.  However,
> they can do writes, and if they do, we need post-grace-period reads to
> see those writes.
> 
> This commit therefore adds an smp_mb() to the end of __wait_rcu_gp().
> 
> Many thanks to Joel Fernandes for the series of questions leading to me
> realizing that this bug exists!
> 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 1971869c4072..74020b558216 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -360,6 +360,7 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t 
> *crcu_array,
>   wait_for_completion(_array[i].completion);
>   destroy_rcu_head_on_stack(_array[i].head);
>   }
> + smp_mb(); /* Provide ordering in case of fly-by wakeup. */
>  }
>  EXPORT_SYMBOL_GPL(__wait_rcu_gp);
>  

The fix looks fine to me. Thanks.

If I understand correctly the wait_for_completion() is an ACQUIRE operation,
and the complete() is a RELEASE operation aka the "MP pattern". The
ACQUIRE/RELEASE semantics allow any writes that happened before the ACQUIRE
to get ordered after it. So that would actually imply it is not strong enough
for ordering purposes during a "fly-by" wake up scenario and would be a
violation of CONDITION 2, I think (not only condition 1 as you said).  This
is because future readers may accidentally see the writes that happened
*before* the synchronize_rcu which is CONDITION 2 in the requirements:
https://goo.gl/8mrDHN  (I had to shortlink it ;))

Cheers,

- Joel


Re: [RFC PATCH] Implement /proc/pid/kill

2018-11-01 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 09:24:00PM -0700, Joel Fernandes wrote:
> On Tue, Oct 30, 2018 at 7:56 PM, Aleksa Sarai  wrote:
> > On 2018-10-31, Christian Brauner  wrote:
> >> > I think Aleksa's larger point is that it's useful to treat processes
> >> > as other file-descriptor-named, poll-able, wait-able resources.
> >> > Consistency is important. A process is just another system resource,
> >> > and like any other system resource, you should be open to hold a file
> >> > descriptor to it and do things to that process via that file
> >> > descriptor. The precise form of this process-handle FD is up for
> >> > debate. The existing /proc/$PID directory FD is a good candidate for a
> >> > process handle FD, since it does almost all of what's needed. But
> >> > regardless of what form a process handle FD takes, we need it. I don't
> >> > see a case for continuing to treat processes in a non-unixy,
> >> > non-file-descriptor-based manner.
> >>
> >> That's what I'm proposing in the API for which I'm gathering feedback.
> >> I have presented parts of this in various discussions at LSS Europe last 
> >> week
> >> and will be at LPC.
> >> We don't want to rush an API like this though. It was tried before in
> >> other forms
> >> and these proposals didn't make it.
> >
> > :+1: on a well thought-out and generic proposal. As we've discussed
> > elsewhere, this is an issue that really would be great to (finally)
> > solve.
> 
> Excited to see this and please count me in for discussions around this. 
> thanks.
> 

Just a quick question, is there a track planned at LPC for discussing this
new proposal or topics around/related to the proposal?

If not, should that be planned?

- Joel



Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-10-31 Thread Joel Fernandes
On Wed, Oct 31, 2018 at 11:17:48AM -0700, Paul E. McKenney wrote:
> On Tue, Oct 30, 2018 at 06:11:19PM -0700, Joel Fernandes wrote:
> > Hi Paul,
> > 
> > On Tue, Oct 30, 2018 at 04:43:36PM -0700, Paul E. McKenney wrote:
> > > On Tue, Oct 30, 2018 at 03:26:49PM -0700, Joel Fernandes wrote:
> > > > Hi Paul,
> > > > 
> > > > On Sat, Oct 27, 2018 at 09:30:46PM -0700, Joel Fernandes (Google) wrote:
> > > > > As per this thread [1], it seems this smp_mb isn't needed anymore:
> > > > > "So the smp_mb() that I was trying to add doesn't need to be there."
> > > > > 
> > > > > So let us remove this part from the memory ordering documentation.
> > > > > 
> > > > > [1] https://lkml.org/lkml/2017/10/6/707
> > > > > 
> > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > 
> > > > I was just checking about this patch. Do you feel it is correct to 
> > > > remove
> > > > this part from the docs? Are you satisified that a barrier isn't needed 
> > > > there
> > > > now? Or did I miss something?
> > > 
> > > Apologies, it got lost in the shuffle.  I have now applied it with a
> > > bit of rework to the commit log, thank you!
> > 
> > No worries, thanks for taking it!
> > 
> > Just wanted to update you on my progress reading/correcting the docs. The
> > 'Memory Ordering' is taking a bit of time so I paused that and I'm focusing
> > on finishing all the other low hanging fruit. This activity is mostly during
> > night hours after the baby is asleep but some times I also manage to sneak 
> > it
> > into the day job ;-)
> 
> If there is anything I can do to make this a more sustainable task for
> you, please do not keep it a secret!!!

Thanks a lot, that means a lot to me! Will do!

> > BTW I do want to discuss about this smp_mb patch above with you at LPC if 
> > you
> > had time, even though we are removing it from the documentation. I thought
> > about it a few times, and I was not able to fully appreciate the need for 
> > the
> > barrier (that is even assuming that complete() etc did not do the right
> > thing).  Specifically I was wondering same thing Peter said in the above
> > thread I think that - if that rcu_read_unlock() triggered all the spin
> > locking up the tree of nodes, then why is that locking not sufficient to
> > prevent reads from the read-side section from bleeding out? That would
> > prevent the reader that just unlocked from seeing anything that happens
> > _after_ the synchronize_rcu.
> 
> Actually, I recall an smp_mb() being added, but am not seeing it anywhere
> relevant to wait_for_completion().  So I might need to add the smp_mb()
> to synchronize_rcu() and remove the patch (retaining the typo fix).  :-/

No problem, I'm glad atleast the patch resurfaced the topic of the potential
issue :-)

> The short form answer is that anything before a grace period on any CPU
> must be seen by any CPU as being before anything on any CPU after that
> same grace period.  This guarantee requires a rather big hammer.
> 
> But yes, let's talk at LPC!

Sounds great, looking forward to discussing this.

> > Also about GP memory ordering and RCU-tree-locking, I think you mentioned to
> > me that the RCU reader-sections are virtually extended both forward and
> > backward and whereever it ends, those paths do heavy-weight synchronization
> > that should be sufficient to prevent memory ordering issues (such as those
> > you mentioned in the Requierments document). That is exactly why we don't
> > need explicit barriers during rcu_read_unlock. If I recall I asked you why
> > those are not needed. So that answer made sense, but then now on going
> > through the 'Memory Ordering' document, I see that you mentioned there is
> > reliance on the locking. Is that reliance on locking necessary to maintain
> > ordering then?
> 
> There is a "network" of locking augmented by smp_mb__after_unlock_lock()
> that implements the all-to-all memory ordering mentioned above.  But it
> also needs to handle all the possible complete()/wait_for_completion()
> races, even those assisted by hypervisor vCPU preemption.

I see, so it sounds like the lock network is just a partial solution. For
some reason I thought before that complete() was even called on the CPU
executing the callback, all the CPUs would have acquired and released a lock
in the "lock network" atleast once thus ensuring the ordering (due to the
fact that the quiescent state reporting has to travel up the tree start

Re: [PATCH v2] Implement /proc/pid/kill

2018-10-31 Thread Joel Fernandes
On Thu, Nov 01, 2018 at 04:33:53AM +1100, Aleksa Sarai wrote:
> On 2018-10-31, Joel Fernandes  wrote:
> > I suggest to maintainers we take this in as an intermediate solution
> > since we don't have anything close to it and this is a real issue, and
> > the fix proposed is simple.
> 
> I would suggest we wait until after LPC to see what Christian's design
> is (given that, from what I've heard, it will help us solve more
> problems than just the kill(2) issue).
> 
> I am very skeptical of adding new procfs files as an "intermediate
> solution" (procfs is the most obvious example of an interface which is
> effectively written in stone). Especially if this would conflict with
> the idea Christian will propose -- as he said, there were proposals to
> do this in the past and they didn't get anywhere because of lack of
> discussion and brainstorming before posting patches.
> 

Fine with me, thanks.



Re: [PATCH v2] Implement /proc/pid/kill

2018-10-31 Thread Joel Fernandes
On Wed, Oct 31, 2018 at 02:37:44PM +, Daniel Colascione wrote:
> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
> 
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
> 
> Note that only the real user ID that opened a /proc/pid/kill file can
> write to it; other users get EPERM.  This check prevents confused
> deputy attacks via, e.g., standard output of setuid programs.
> 
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".
> 
> #!/bin/bash
> set -euo pipefail
> pat=$1
> for proc_status in /proc/*/status; do (
> cd $(dirname $proc_status)
> readarray proc_argv -d'' < cmdline
> if ((${#proc_argv[@]} > 0)) &&
>[[ ${proc_argv[0]} = *$pat* ]];
> then
> echo 15 > kill
> fi
> ) || true; done
> 
> Signed-off-by: Daniel Colascione 
> ---
> 
> Added a real-user-ID check to prevent confused deputy attacks.
> 
>  fs/proc/base.c | 51 ++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 7e9f07bf260d..74e494f24b28 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -205,6 +205,56 @@ static int proc_root_link(struct dentry *dentry, struct 
> path *path)
>   return result;
>  }
>  
> +static ssize_t proc_pid_kill_write(struct file *file,
> +const char __user *buf,
> +size_t count, loff_t *ppos)
> +{
> + ssize_t res;
> + int sig;
> + char buffer[4];
> +
> + /* This check prevents a confused deputy attack in which an
> +  * unprivileged process opens /proc/victim/kill and convinces
> +  * a privileged process to write to that kill FD, effectively
> +  * performing a kill with the privileges of the unwitting
> +  * privileged process.  Here, we just fail the kill operation
> +  * if someone calls write(2) with a real user ID that differs
> +  * from the one used to open the kill FD.
> +  */
> + res = -EPERM;
> + if (file->f_cred->user != current_user())
> + goto out;

nit: You could get rid of the out label and just do direct returns. Will save
a few lines and is more readable.

> +
> + res = -EINVAL;
> + if (*ppos != 0)
> + goto out;
> +
> + res = -EINVAL;
> + if (count > sizeof(buffer) - 1)
> + goto out;
> +
> + res = -EFAULT;
> + if (copy_from_user(buffer, buf, count))
> + goto out;
> +
> + buffer[count] = '\0';

I think you can just zero-initialize buffer with "= {};" and get rid of this 
line.

> + res = kstrtoint(strstrip(buffer), 10, );
> + if (res)
> + goto out;


> +
> + res = kill_pid(proc_pid(file_inode(file)), sig, 0);
> + if (res)
> + goto out;

if (res)
return res;

Other than the security issues which I still think you're discussing, since
we need this, I suggest to maintainers we take this in as an intermediate
solution since we don't have anything close to it and this is a real issue,
and the fix proposed is simple.  So FWIW feel free to add my reviewed-by
(with the above nits and security issues taken care off) on any future
respins:

Reviewed-by: Joel Fernandes (Google) 

thanks,

- Joel



Re: [RFC PATCH] Minimal non-child process exit notification support

2018-10-31 Thread Joel Fernandes
On Wed, Oct 31, 2018 at 7:41 AM, Joel Fernandes  wrote:
[...]
>>> > Indeed, to avoid killing the wrong process you need to have opened
>>> > some node of /proc/pid/* (maybe cmdline) before sending the kill
>>> > signal.
>>>
>>> The kernel really needs better documentation of the semantics of
>>> procfs file descriptors. You're not the only person to think,
>>> mistakenly, that keeping a reference to a /proc/$PID/something FD
>>> reserves $PID and prevents it being used for another process. Procfs
>>> FDs do no such thing. kill(2) is unsafe whether or not
>>> /proc/pid/cmdline or any other /proc file is open.
>>
>> Interesting.
>> Linux 'fixed' the problem of pid reuse in the kernel by adding (IIRC)
>> 'struct pid' that reference counts the pid stopping reuse.
>
> This is incorrect if you mean numeric pids. See the end of these
> comments in include/linux/pid.h . A pid value can be reused, it just
> works Ok because it causes a new struct pid allocation. That doesn't
> mean there isn't a numeric reuse. There's also no where in pid_alloc()
> where we prevent the numeric reuse AFAICT.

Bleh, I mean alloc_pid().


Re: [RFC PATCH] Minimal non-child process exit notification support

2018-10-31 Thread Joel Fernandes
On Wed, Oct 31, 2018 at 7:25 AM, David Laight  wrote:
> From: Daniel Colascione
>> Sent: 31 October 2018 12:56
>> On Wed, Oct 31, 2018 at 12:27 PM, David Laight  
>> wrote:
>> > From: Daniel Colascione
>> >> Sent: 29 October 2018 17:53
>> >>
>> >> This patch adds a new file under /proc/pid, /proc/pid/exithand.
>> >> Attempting to read from an exithand file will block until the
>> >> corresponding process exits, at which point the read will successfully
>> >> complete with EOF.  The file descriptor supports both blocking
>> >> operations and poll(2). It's intended to be a minimal interface for
>> >> allowing a program to wait for the exit of a process that is not one
>> >> of its children.
>> >
>> > Why do you need an extra file?
>>
>> Because no current file suffices.
>
> That doesn't stop you making something work on any/all of the existing files.
>
>> > It ought to be possible to use poll() to wait for POLLERR having set
>> > 'events' to zero on any of the nodes in /proc/pid - or even on
>> > the directory itself.
>>
>> That doesn't actually work today. And waiting on a directory with
>> POLLERR would be very weird, since directories in general don't do
>> things like blocking reads or poll support. A separate file with
>> self-contained, well-defined semantics is cleaner.
>
> Device drivers will (well ought to) return POLLERR when a device
> is removed.
> Making procfs behave the same way wouldn't be too stupid.
>
>> > Indeed, to avoid killing the wrong process you need to have opened
>> > some node of /proc/pid/* (maybe cmdline) before sending the kill
>> > signal.
>>
>> The kernel really needs better documentation of the semantics of
>> procfs file descriptors. You're not the only person to think,
>> mistakenly, that keeping a reference to a /proc/$PID/something FD
>> reserves $PID and prevents it being used for another process. Procfs
>> FDs do no such thing. kill(2) is unsafe whether or not
>> /proc/pid/cmdline or any other /proc file is open.
>
> Interesting.
> Linux 'fixed' the problem of pid reuse in the kernel by adding (IIRC)
> 'struct pid' that reference counts the pid stopping reuse.

This is incorrect if you mean numeric pids. See the end of these
comments in include/linux/pid.h . A pid value can be reused, it just
works Ok because it causes a new struct pid allocation. That doesn't
mean there isn't a numeric reuse. There's also no where in pid_alloc()
where we prevent the numeric reuse AFAICT.

/*
 * What is struct pid?
 *
 * A struct pid is the kernel's internal notion of a process identifier.
 * It refers to individual tasks, process groups, and sessions.  While
 * there are processes attached to it the struct pid lives in a hash
 * table, so it and then the processes that it refers to can be found
 * quickly from the numeric pid value.  The attached processes may be
 * quickly accessed by following pointers from struct pid.
 *
 * Storing pid_t values in the kernel and referring to them later has a
 * problem.  The process originally with that pid may have exited and the
 * pid allocator wrapped, and another process could have come along
 * and been assigned that pid.
 *
 * Referring to user space processes by holding a reference to struct
 * task_struct has a problem.  When the user space process exits
 * the now useless task_struct is still kept.  A task_struct plus a
 * stack consumes around 10K of low kernel memory.  More precisely
 * this is THREAD_SIZE + sizeof(struct task_struct).  By comparison
 * a struct pid is about 64 bytes.
 *
 * Holding a reference to struct pid solves both of these problems.
 * It is small so holding a reference does not consume a lot of
 * resources, and since a new struct pid is allocated when the numeric pid
 * value is reused (when pids wrap around) we don't mistakenly refer to new
 * processes.
 */


Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 7:56 PM, Aleksa Sarai  wrote:
> On 2018-10-31, Christian Brauner  wrote:
>> > I think Aleksa's larger point is that it's useful to treat processes
>> > as other file-descriptor-named, poll-able, wait-able resources.
>> > Consistency is important. A process is just another system resource,
>> > and like any other system resource, you should be open to hold a file
>> > descriptor to it and do things to that process via that file
>> > descriptor. The precise form of this process-handle FD is up for
>> > debate. The existing /proc/$PID directory FD is a good candidate for a
>> > process handle FD, since it does almost all of what's needed. But
>> > regardless of what form a process handle FD takes, we need it. I don't
>> > see a case for continuing to treat processes in a non-unixy,
>> > non-file-descriptor-based manner.
>>
>> That's what I'm proposing in the API for which I'm gathering feedback.
>> I have presented parts of this in various discussions at LSS Europe last week
>> and will be at LPC.
>> We don't want to rush an API like this though. It was tried before in
>> other forms
>> and these proposals didn't make it.
>
> :+1: on a well thought-out and generic proposal. As we've discussed
> elsewhere, this is an issue that really would be great to (finally)
> solve.

Excited to see this and please count me in for discussions around this. thanks.

 - Joel


Re: [PATCH v4] pstore: Avoid duplicate call of persistent_ram_zap()

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 8:57 PM, Peng15 Wang 王鹏  wrote:
>
>>From: Joel Fernandes 
>>Sent: Wednesday, October 31, 2018 6:16
>>To: Kees Cook
>>Cc: Peng15 Wang 王鹏; Anton Vorontsov; Colin Cross; Tony Luck; LKML; 
>>vipwangerx...@gmail.com
>>Subject: Re: [PATCH v4] pstore: Avoid duplicate call of persistent_ram_zap()
>>
>>On Tue, Oct 30, 2018 at 02:52:43PM -0700, Kees Cook wrote:
>>> On Tue, Oct 30, 2018 at 2:38 PM, Joel Fernandes  
>>> wrote:
>>> > On Tue, Oct 30, 2018 at 03:52:34PM +0800, Peng Wang wrote:
>>> >> When initialing prz with invalid data in buffer(no PERSISTENT_RAM_SIG),
>>> >> function call path is like this:
>>> >>
>>> >> ramoops_init_prz ->
>>> >> |
>>> >> |-> persistent_ram_new -> persistent_ram_post_init -> persistent_ram_zap
>>> >> |
>>> >> |-> persistent_ram_zap
>>> >>
>>> >> As we can see, persistent_ram_zap() is called twice.
>>> >> We can avoid this by adding an option to persistent_ram_new(), and
>>> >> only call persistent_ram_zap() when it is needed.
>>> >>
>>> >> Signed-off-by: Peng Wang 
>>> >> ---
>>> >>  fs/pstore/ram.c| 4 +---
>>> >>  fs/pstore/ram_core.c   | 5 +++--
>>> >>  include/linux/pstore_ram.h | 1 +
>>> >>  3 files changed, 5 insertions(+), 5 deletions(-)
>>> >>
>>> >> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
>>> >> index ffcff6516e89..b51901f97dc2 100644
>>> >> --- a/fs/pstore/ram.c
>>> >> +++ b/fs/pstore/ram.c
>>> >> @@ -640,7 +640,7 @@ static int ramoops_init_prz(const char *name,
>>> >>
>>> >>   label = kasprintf(GFP_KERNEL, "ramoops:%s", name);
>>> >>   *prz = persistent_ram_new(*paddr, sz, sig, >ecc_info,
>>> >> -   cxt->memtype, 0, label);
>>> >> +   cxt->memtype, PRZ_FLAG_ZAP_OLD, label);
>>> >>   if (IS_ERR(*prz)) {
>>> >>   int err = PTR_ERR(*prz);
>>> >
>>> > Looks good to me except the minor comment below:
>>> >
>>> >>
>>> >> @@ -649,8 +649,6 @@ static int ramoops_init_prz(const char *name,
>>> >>   return err;
>>> >>   }
>>> >>
>>> >> - persistent_ram_zap(*prz);
>>> >> -
>>> >>   *paddr += sz;
>>> >>
>>> >>   return 0;
>>> >> diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
>>> >> index 12e21f789194..2ededd1ea1c2 100644
>>> >> --- a/fs/pstore/ram_core.c
>>> >> +++ b/fs/pstore/ram_core.c
>>> >> @@ -505,15 +505,16 @@ static int persistent_ram_post_init(struct 
>>> >> persistent_ram_zone *prz, u32 sig,
>>> >>   pr_debug("found existing buffer, size %zu, start 
>>> >> %zu\n",
>>> >>buffer_size(prz), buffer_start(prz));
>>> >>   persistent_ram_save_old(prz);
>>> >> - return 0;
>>> >> + if (!(prz->flags & PRZ_FLAG_ZAP_OLD))
>>> >> + return 0;
>>> >
>>> > This could be written differently.
>>> >
>>> > We could just do:
>>> >
>>> > if (prz->flags & PRZ_FLAG_ZAP_OLD)
>>> > persistent_ram_zap(prz);
>>> >
>>> > And remove the zap from below below.
>>>
>>> I actually rearranged things a little to avoid additional round-trips
>>> on the mailing list. :)
>>>
>>> > Since Kees already took this patch, I can just patch this in my series if
>>> > Kees and you are Ok with this suggestion.
>>>
>>> I've put it up here:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=pstore/devel=ac564e023248e3f4d87917b91d12376ddfca5000
>>
>>Cool, it LGTM :)
>>
>>- Joel
>>
>
> Thank you all for these warm help.
>
> This is my first time to submit a patch to community. Feel great!

Congrats and welcome to the mother ship ;-)

 - Joel


Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-10-30 Thread Joel Fernandes
Hi Paul,

On Tue, Oct 30, 2018 at 04:43:36PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 30, 2018 at 03:26:49PM -0700, Joel Fernandes wrote:
> > Hi Paul,
> > 
> > On Sat, Oct 27, 2018 at 09:30:46PM -0700, Joel Fernandes (Google) wrote:
> > > As per this thread [1], it seems this smp_mb isn't needed anymore:
> > > "So the smp_mb() that I was trying to add doesn't need to be there."
> > > 
> > > So let us remove this part from the memory ordering documentation.
> > > 
> > > [1] https://lkml.org/lkml/2017/10/6/707
> > > 
> > > Signed-off-by: Joel Fernandes (Google) 
> > 
> > I was just checking about this patch. Do you feel it is correct to remove
> > this part from the docs? Are you satisified that a barrier isn't needed 
> > there
> > now? Or did I miss something?
> 
> Apologies, it got lost in the shuffle.  I have now applied it with a
> bit of rework to the commit log, thank you!

No worries, thanks for taking it!

Just wanted to update you on my progress reading/correcting the docs. The
'Memory Ordering' is taking a bit of time so I paused that and I'm focusing
on finishing all the other low hanging fruit. This activity is mostly during
night hours after the baby is asleep but some times I also manage to sneak it
into the day job ;-)

BTW I do want to discuss about this smp_mb patch above with you at LPC if you
had time, even though we are removing it from the documentation. I thought
about it a few times, and I was not able to fully appreciate the need for the
barrier (that is even assuming that complete() etc did not do the right
thing).  Specifically I was wondering same thing Peter said in the above
thread I think that - if that rcu_read_unlock() triggered all the spin
locking up the tree of nodes, then why is that locking not sufficient to
prevent reads from the read-side section from bleeding out? That would
prevent the reader that just unlocked from seeing anything that happens
_after_ the synchronize_rcu.

Also about GP memory ordering and RCU-tree-locking, I think you mentioned to
me that the RCU reader-sections are virtually extended both forward and
backward and whereever it ends, those paths do heavy-weight synchronization
that should be sufficient to prevent memory ordering issues (such as those
you mentioned in the Requierments document). That is exactly why we don't
need explicit barriers during rcu_read_unlock. If I recall I asked you why
those are not needed. So that answer made sense, but then now on going
through the 'Memory Ordering' document, I see that you mentioned there is
reliance on the locking. Is that reliance on locking necessary to maintain
ordering then?

Or did I miss the points completely? :(

--
TODO list of the index file marking which ones I have finished perusing:

arrayRCU.txtDONE
checklist.txt   DONE
listRCU.txt DONE
lockdep.txt DONE
lockdep-splat.txt   DONE
NMI-RCU.txt
rcu_dereference.txt
rcubarrier.txt
rculist_nulls.txt
rcuref.txt
rcu.txt
RTFP.txtDONE
stallwarn.txt   DONE
torture.txt
UP.txt
whatisRCU.txt   DONE

Design
 - Data-Structures  DONE
 - Requirements DONE
 - Expedited-Grace-Periods
 - Memory Ordering  next



Re: [RFC] rcu: doc: update example about stale data

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 04:50:39PM -0700, Paul E. McKenney wrote:
> On Sun, Oct 28, 2018 at 06:16:31PM -0700, Joel Fernandes wrote:
> > On Sun, Oct 28, 2018 at 10:21:42AM -0700, Paul E. McKenney wrote:
> > > On Sat, Oct 27, 2018 at 07:16:53PM -0700, Joel Fernandes (Google) wrote:
> > > > The RCU example for 'rejecting stale data' on system-call auditting
> > > > stops iterating through the rules if a deleted one is found. It makes
> > > > more sense to continue looking at other rules once a deleted one is
> > > > rejected. Although the original example is fine, this makes it more
> > > > meaningful.
> > > > 
> > > > Signed-off-by: Joel Fernandes (Google) 
> > > 
> > > Does the actual audit code that this was copied from now include the
> > > continue statement?  If so, please update the commit log to state that
> > > and then I will take the resulting patch.  (This example was inspired
> > > by a long-ago version of the actual audit code.)
> > 
> > The document talks of a situation that could be but is not really in the
> > implementation. It says "If the system-call audit module were to ever need 
> > to
> > reject stale data". So its not really something implemented. I was just
> > correcting the example you had there since it made more sense to me to
> > continue looking for other rules in the list once a rule was shown to be
> > stale. It just makes the example more correct.
> > 
> > But I'm Ok if you want to leave that alone ;-) Hence, the RFC tag to this
> > patch ;-)
> 
> Well, I do agree that there are situations where you need to keep
> going.  But in the common case where only one instance of a given key
> is allowed, and where the list is either (1) sorted and/or (2) added
> to at the beginning, if you find a deleted element with a given key,
> you are guaranteed that you won't find another with that key even if
> you continue scanning the list.  After all, if you did find a deleted
> element, the duplicate either is not on the list in the sorted case
> or is behind you in the add-at-front case.
> 
> And in the more complex cases where persistent searching is required,
> you usually have to restart the search instead of continuing it.  Besides,
> things like the Issaquah Challenge don't seem to belong in introductory
> documentation on RCU.  ;-)

Ok, agreed. Lets drop this :)

-Joel



Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 11:10:47PM +, Daniel Colascione wrote:
> On Tue, Oct 30, 2018 at 10:33 PM, Joel Fernandes  
> wrote:
> > On Wed, Oct 31, 2018 at 09:23:39AM +1100, Aleksa Sarai wrote:
> >> On 2018-10-30, Joel Fernandes  wrote:
> >> > On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
> >> > [...]
> >> > > > > (Unfortunately
> >> > > > > there are lots of things that make it a bit difficult to use 
> >> > > > > /proc/$pid
> >> > > > > exclusively for introspection of a process -- especially in the 
> >> > > > > context
> >> > > > > of containers.)
> >> > > >
> >> > > > Tons of things already break without a working /proc. What do you 
> >> > > > have in mind?
> >> > >
> >> > > Heh, if only that was the only blocker. :P
> >> > >
> >> > > The basic problem is that currently container runtimes either depend on
> >> > > some non-transient on-disk state (which becomes invalid on machine
> >> > > reboots or dead processes and so on), or on long-running processes that
> >> > > keep file descriptors required for administration of a container alive
> >> > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> >> > > attacks). Usually both.
> >> > >
> >> > > What would be really useful would be having some way of "hiding away" a
> >> > > mount namespace (of the pid1 of the container) that has all of the
> >> > > information and bind-mounts-to-file-descriptors that are necessary for
> >> > > administration. If the container's pid1 dies all of the transient state
> >> > > has disappeared automatically -- because the stashed mount namespace 
> >> > > has
> >> > > died. In addition, if this was done the way I'm thinking with (and this
> >> > > is the contentious bit) hierarchical mount namespaces you could make it
> >> > > so that the pid1 could not manipulate its current mount namespace to
> >> > > confuse the administrative process. You would also then create an
> >> > > intermediate user namespace to help with several race conditions (that
> >> > > have caused security bugs like CVE-2016-9962) we've seen when joining
> >> > > containers.
> >> > >
> >> > > Unfortunately this all depends on hierarchical mount namespaces (and
> >> > > note that this would just be that NS_GET_PARENT gives you the mount
> >> > > namespace that it was created in -- I'm not suggesting we redesign 
> >> > > peers
> >> > > or anything like that). This makes it basically a non-starter.
> >> > >
> >> > > But if, on top of this ground-work, we then referenced containers
> >> > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> >> > > races (as well as being able to find out implicitly whether a container
> >> > > has died thanks to the error semantics of /proc/$pid). And that's the
> >> > > way I would suggest doing it (if we had these other things in place).
> >> >
> >> > I didn't fully follow exactly what you mean. If you can explain for the
> >> > layman who doesn't know much experience with containers..
> >> >
> >> > Are you saying that keeping open a /proc/$pid directory handle is not
> >> > sufficient to prevent PID reuse while the proc entries under /proc/$pid 
> >> > are
> >> > being looked into? If its not sufficient, then isn't that a bug? If it is
> >> > sufficient, then can we not just keep the handle open while we do 
> >> > whatever we
> >> > want under /proc/$pid ?
> >>
> >> Sorry, I went on a bit of a tangent about various internals of container
> >> runtimes. My main point is that I would love to use /proc/$pid because
> >> it makes reuse handling very trivial and is always correct, but that
> >> there are things which stop us from being able to use it for everything
> >> (which is what my incoherent rambling was on about).
> >
> > Ok thanks. So I am guessing if the following sequence works, then Dan's 
> > patch is not
> > needed.
> >
> > 1. open /proc/ directory
> > 2. inspect /proc/ or do whatever with 
> > 3. Issue the kill on 
> > 4. Close the /proc/ directory opened in step 1.

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Joel Fernandes
On Wed, Oct 31, 2018 at 09:49:08AM +1100, Aleksa Sarai wrote:
> On 2018-10-30, Joel Fernandes  wrote:
> > > > [...] 
> > > > > > > (Unfortunately
> > > > > > > there are lots of things that make it a bit difficult to use 
> > > > > > > /proc/$pid
> > > > > > > exclusively for introspection of a process -- especially in the 
> > > > > > > context
> > > > > > > of containers.)
> > > > > > 
> > > > > > Tons of things already break without a working /proc. What do you 
> > > > > > have in mind?
> > > > > 
> > > > > Heh, if only that was the only blocker. :P
> > > > > 
> > > > > The basic problem is that currently container runtimes either depend 
> > > > > on
> > > > > some non-transient on-disk state (which becomes invalid on machine
> > > > > reboots or dead processes and so on), or on long-running processes 
> > > > > that
> > > > > keep file descriptors required for administration of a container alive
> > > > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> > > > > attacks). Usually both.
> > > > > 
> > > > > What would be really useful would be having some way of "hiding away" 
> > > > > a
> > > > > mount namespace (of the pid1 of the container) that has all of the
> > > > > information and bind-mounts-to-file-descriptors that are necessary for
> > > > > administration. If the container's pid1 dies all of the transient 
> > > > > state
> > > > > has disappeared automatically -- because the stashed mount namespace 
> > > > > has
> > > > > died. In addition, if this was done the way I'm thinking with (and 
> > > > > this
> > > > > is the contentious bit) hierarchical mount namespaces you could make 
> > > > > it
> > > > > so that the pid1 could not manipulate its current mount namespace to
> > > > > confuse the administrative process. You would also then create an
> > > > > intermediate user namespace to help with several race conditions (that
> > > > > have caused security bugs like CVE-2016-9962) we've seen when joining
> > > > > containers.
> > > > > 
> > > > > Unfortunately this all depends on hierarchical mount namespaces (and
> > > > > note that this would just be that NS_GET_PARENT gives you the mount
> > > > > namespace that it was created in -- I'm not suggesting we redesign 
> > > > > peers
> > > > > or anything like that). This makes it basically a non-starter.
> > > > > 
> > > > > But if, on top of this ground-work, we then referenced containers
> > > > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> > > > > races (as well as being able to find out implicitly whether a 
> > > > > container
> > > > > has died thanks to the error semantics of /proc/$pid). And that's the
> > > > > way I would suggest doing it (if we had these other things in place).
> > > > 
> > > > I didn't fully follow exactly what you mean. If you can explain for the
> > > > layman who doesn't know much experience with containers..
> > > > 
> > > > Are you saying that keeping open a /proc/$pid directory handle is not
> > > > sufficient to prevent PID reuse while the proc entries under /proc/$pid 
> > > > are
> > > > being looked into? If its not sufficient, then isn't that a bug? If it 
> > > > is
> > > > sufficient, then can we not just keep the handle open while we do 
> > > > whatever we
> > > > want under /proc/$pid ?
> > > 
> > > Sorry, I went on a bit of a tangent about various internals of container
> > > runtimes. My main point is that I would love to use /proc/$pid because
> > > it makes reuse handling very trivial and is always correct, but that
> > > there are things which stop us from being able to use it for everything
> > > (which is what my incoherent rambling was on about).
> > 
> > Ok thanks. So I am guessing if the following sequence works, then Dan's 
> > patch is not
> > needed.
> > 
> > 1. open /proc/ directory
> > 2. inspect /proc/ or do whatever with 
> > 3. Issue the kill on 
> > 4. Close

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Joel Fernandes
On Wed, Oct 31, 2018 at 09:23:39AM +1100, Aleksa Sarai wrote:
> On 2018-10-30, Joel Fernandes  wrote:
> > On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
> > [...] 
> > > > > (Unfortunately
> > > > > there are lots of things that make it a bit difficult to use 
> > > > > /proc/$pid
> > > > > exclusively for introspection of a process -- especially in the 
> > > > > context
> > > > > of containers.)
> > > > 
> > > > Tons of things already break without a working /proc. What do you have 
> > > > in mind?
> > > 
> > > Heh, if only that was the only blocker. :P
> > > 
> > > The basic problem is that currently container runtimes either depend on
> > > some non-transient on-disk state (which becomes invalid on machine
> > > reboots or dead processes and so on), or on long-running processes that
> > > keep file descriptors required for administration of a container alive
> > > (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> > > attacks). Usually both.
> > > 
> > > What would be really useful would be having some way of "hiding away" a
> > > mount namespace (of the pid1 of the container) that has all of the
> > > information and bind-mounts-to-file-descriptors that are necessary for
> > > administration. If the container's pid1 dies all of the transient state
> > > has disappeared automatically -- because the stashed mount namespace has
> > > died. In addition, if this was done the way I'm thinking with (and this
> > > is the contentious bit) hierarchical mount namespaces you could make it
> > > so that the pid1 could not manipulate its current mount namespace to
> > > confuse the administrative process. You would also then create an
> > > intermediate user namespace to help with several race conditions (that
> > > have caused security bugs like CVE-2016-9962) we've seen when joining
> > > containers.
> > > 
> > > Unfortunately this all depends on hierarchical mount namespaces (and
> > > note that this would just be that NS_GET_PARENT gives you the mount
> > > namespace that it was created in -- I'm not suggesting we redesign peers
> > > or anything like that). This makes it basically a non-starter.
> > > 
> > > But if, on top of this ground-work, we then referenced containers
> > > entirely via an fd to /proc/$pid then you could also avoid PID reuse
> > > races (as well as being able to find out implicitly whether a container
> > > has died thanks to the error semantics of /proc/$pid). And that's the
> > > way I would suggest doing it (if we had these other things in place).
> > 
> > I didn't fully follow exactly what you mean. If you can explain for the
> > layman who doesn't know much experience with containers..
> > 
> > Are you saying that keeping open a /proc/$pid directory handle is not
> > sufficient to prevent PID reuse while the proc entries under /proc/$pid are
> > being looked into? If its not sufficient, then isn't that a bug? If it is
> > sufficient, then can we not just keep the handle open while we do whatever 
> > we
> > want under /proc/$pid ?
> 
> Sorry, I went on a bit of a tangent about various internals of container
> runtimes. My main point is that I would love to use /proc/$pid because
> it makes reuse handling very trivial and is always correct, but that
> there are things which stop us from being able to use it for everything
> (which is what my incoherent rambling was on about).

Ok thanks. So I am guessing if the following sequence works, then Dan's patch 
is not
needed.

1. open /proc/ directory
2. inspect /proc/ or do whatever with 
3. Issue the kill on 
4. Close the /proc/ directory opened in step 1.

So unless I missed something, the above sequence will not cause any PID reuse
races.

- Joel




Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-10-30 Thread Joel Fernandes
Hi Paul,

On Sat, Oct 27, 2018 at 09:30:46PM -0700, Joel Fernandes (Google) wrote:
> As per this thread [1], it seems this smp_mb isn't needed anymore:
> "So the smp_mb() that I was trying to add doesn't need to be there."
> 
> So let us remove this part from the memory ordering documentation.
> 
> [1] https://lkml.org/lkml/2017/10/6/707
> 
> Signed-off-by: Joel Fernandes (Google) 

I was just checking about this patch. Do you feel it is correct to remove
this part from the docs? Are you satisified that a barrier isn't needed there
now? Or did I miss something?

thanks,

- Joel


> ---
>  .../Tree-RCU-Memory-Ordering.html | 32 +--
>  1 file changed, 1 insertion(+), 31 deletions(-)
> 
> diff --git 
> a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html 
> b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
> index a346ce0116eb..0fb1511763d4 100644
> --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
> +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
> @@ -77,7 +77,7 @@ The key point is that the lock-acquisition functions, 
> including
>  smp_mb__after_unlock_lock() immediately after successful
>  acquisition of the lock.
>  
> -Therefore, for any given rcu_node struction, any access
> +Therefore, for any given rcu_node structure, any access
>  happening before one of the above lock-release functions will be seen
>  by all CPUs as happening before any access happening after a later
>  one of the above lock-acquisition functions.
> @@ -162,36 +162,6 @@ an atomic_add_return() of zero) to detect idle 
> CPUs.
>  
>  
>  
> -The approach must be extended to handle one final case, that
> -of waking a task blocked in synchronize_rcu().
> -This task might be affinitied to a CPU that is not yet aware that
> -the grace period has ended, and thus might not yet be subject to
> -the grace period's memory ordering.
> -Therefore, there is an smp_mb() after the return from
> -wait_for_completion() in the synchronize_rcu()
> -code path.
> -
> -
> -
> -Quick Quiz:
> -
> - What?  Where???
> - I don't see any smp_mb() after the return from
> - wait_for_completion()!!!
> -
> -Answer:
> -
> - That would be because I spotted the need for that
> - smp_mb() during the creation of this documentation,
> - and it is therefore unlikely to hit mainline before v4.14.
> - Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
> - Jonathan Cameron for asking questions that sensitized me
> - to the rather elaborate sequence of events that demonstrate
> - the need for this memory barrier.
> -
> -
> -
> -
>  Tree RCU's grace--period memory-ordering guarantees rely most
>  heavily on the rcu_node structure's -lock
>  field, so much so that it is necessary to abbreviate this pattern
> -- 
> 2.19.1.568.g152ad8e336-goog
> 


Re: [RFC PATCH] Minimal non-child process exit notification support

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 08:59:25AM +, Daniel Colascione wrote:
> On Tue, Oct 30, 2018 at 3:06 AM, Joel Fernandes  wrote:
> > On Mon, Oct 29, 2018 at 1:01 PM Daniel Colascione  wrote:
> >>
> >> Thanks for taking a look.
> >>
> >> On Mon, Oct 29, 2018 at 7:45 PM, Joel Fernandes  wrote:
> >> >
> >> > On Mon, Oct 29, 2018 at 10:53 AM Daniel Colascione  
> >> > wrote:
> >> > >
> >> > > This patch adds a new file under /proc/pid, /proc/pid/exithand.
> >> > > Attempting to read from an exithand file will block until the
> >> > > corresponding process exits, at which point the read will successfully
> >> > > complete with EOF.  The file descriptor supports both blocking
> >> > > operations and poll(2). It's intended to be a minimal interface for
> >> > > allowing a program to wait for the exit of a process that is not one
> >> > > of its children.
> >> > >
> >> > > Why might we want this interface? Android's lmkd kills processes in
> >> > > order to free memory in response to various memory pressure
> >> > > signals. It's desirable to wait until a killed process actually exits
> >> > > before moving on (if needed) to killing the next process. Since the
> >> > > processes that lmkd kills are not lmkd's children, lmkd currently
> >> > > lacks a way to wait for a proces to actually die after being sent
> >> > > SIGKILL; today, lmkd resorts to polling the proc filesystem pid
> >> >
> >> > Any idea why it needs to wait and then send SIGKILL? Why not do
> >> > SIGKILL and look for errno == ESRCH in a loop with a delay.
> >>
> >> I want to get polling loops out of the system. Polling loops are bad
> >> for wakeup attribution, bad for power, bad for priority inheritance,
> >> and bad for latency. There's no right answer to the question "How long
> >> should I wait before checking $CONDITION again?". If we can have an
> >> explicit waitqueue interface to something, we should. Besides, PID
> >> polling is vulnerable to PID reuse, whereas this mechanism (just like
> >> anything based on struct pid) is immune to it.
> >
> > The argument sounds Ok to me. I would also more details in the commit
> > message about the alternate methods to do this (such as kill polling
> > or ptrace) and why they don't work well etc so no one asks any
> > questions. Like maybe under a "other ways to do this" section. A bit
> > of googling also showed a netlink way of doing it without polling
> > (though I don't look into that much and wouldn't be surprised if its
> > more complicated)
> 
> Thanks for taking a look. I'll add to the commit message.
> 
> Re: netlink isn't enabled everywhere and is subject to lossy buffy
> overruns, AIUI. You could also monitor process exit by setting up
> ftrace and watching events, or by installing BPF that watched for
> process exit and sent a perf event. :-) All of these interfaces feel
> like abusing a "monitoring" API for controlling system operations, and
> this kind of abuse tends to have ugly failure modes. I'm looking for
> something a bit more explicit and robust.

Sounds good to me!

 - Joel



Re: [PATCH tip/core/rcu 02/19] rcu: Defer reporting RCU-preempt quiescent states when disabled

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 05:58:00AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 29, 2018 at 08:44:52PM -0700, Joel Fernandes wrote:
> > On Mon, Oct 29, 2018 at 07:27:35AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 29, 2018 at 11:24:42AM +, Ran Rozenstein wrote:
> > > > Hi Paul and all,
> > > > 
> > > > > -Original Message-
> > > > > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> > > > > ow...@vger.kernel.org] On Behalf Of Paul E. McKenney
> > > > > Sent: Thursday, August 30, 2018 01:21
> > > > > To: linux-kernel@vger.kernel.org
> > > > > Cc: mi...@kernel.org; jiangshan...@gmail.com; dipan...@in.ibm.com;
> > > > > a...@linux-foundation.org; mathieu.desnoy...@efficios.com;
> > > > > j...@joshtriplett.org; t...@linutronix.de; pet...@infradead.org;
> > > > > rost...@goodmis.org; dhowe...@redhat.com; eduma...@google.com;
> > > > > fweis...@gmail.com; o...@redhat.com; j...@joelfernandes.org; Paul E.
> > > > > McKenney 
> > > > > Subject: [PATCH tip/core/rcu 02/19] rcu: Defer reporting RCU-preempt
> > > > > quiescent states when disabled
> > > > > 
> > > > > This commit defers reporting of RCU-preempt quiescent states at
> > > > > rcu_read_unlock_special() time when any of interrupts, softirq, or
> > > > > preemption are disabled.  These deferred quiescent states are 
> > > > > reported at a
> > > > > later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug offline
> > > > > operation.  Of course, if another RCU read-side critical section has 
> > > > > started in
> > > > > the meantime, the reporting of the quiescent state will be further 
> > > > > deferred.
> > > > > 
> > > > > This also means that disabling preemption, interrupts, and/or 
> > > > > softirqs will act
> > > > > as an RCU-preempt read-side critical section.
> > > > > This is enforced by checking preempt_count() as needed.
> > > > > 
> > > > > Some special cases must be handled on an ad-hoc basis, for example,
> > > > > context switch is a quiescent state even though both the scheduler and
> > > > > do_exit() disable preemption.  In these cases, additional calls to
> > > > > rcu_preempt_deferred_qs() override the preemption disabling.  Similar 
> > > > > logic
> > > > > overrides disabled interrupts in rcu_preempt_check_callbacks() 
> > > > > because in
> > > > > this case the quiescent state happened just before the corresponding
> > > > > scheduling-clock interrupt.
> > > > > 
> > > > > In theory, this change lifts a long-standing restriction that 
> > > > > required that if
> > > > > interrupts were disabled across a call to rcu_read_unlock() that the 
> > > > > matching
> > > > > rcu_read_lock() also be contained within that interrupts-disabled 
> > > > > region of
> > > > > code.  Because the reporting of the corresponding RCU-preempt 
> > > > > quiescent
> > > > > state is now deferred until after interrupts have been enabled, it is 
> > > > > no longer
> > > > > possible for this situation to result in deadlocks involving the 
> > > > > scheduler's
> > > > > runqueue and priority-inheritance locks.  This may allow some code
> > > > > simplification that might reduce interrupt latency a bit.  
> > > > > Unfortunately, in
> > > > > practice this would also defer deboosting a low-priority task that 
> > > > > had been
> > > > > subjected to RCU priority boosting, so real-time-response 
> > > > > considerations
> > > > > might well force this restriction to remain in place.
> > > > > 
> > > > > Because RCU-preempt grace periods are now blocked not only by RCU 
> > > > > read-
> > > > > side critical sections, but also by disabling of interrupts, 
> > > > > preemption, and
> > > > > softirqs, it will be possible to eliminate RCU-bh and RCU-sched in 
> > > > > favor of
> > > > > RCU-preempt in CONFIG_PREEMPT=y kernels.  This may require some
> > > > > additional plumbing to provide the network denial-of-service 
> > > > > guarant

Re: [PATCH v4] pstore: Avoid duplicate call of persistent_ram_zap()

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 02:52:43PM -0700, Kees Cook wrote:
> On Tue, Oct 30, 2018 at 2:38 PM, Joel Fernandes  
> wrote:
> > On Tue, Oct 30, 2018 at 03:52:34PM +0800, Peng Wang wrote:
> >> When initialing prz with invalid data in buffer(no PERSISTENT_RAM_SIG),
> >> function call path is like this:
> >>
> >> ramoops_init_prz ->
> >> |
> >> |-> persistent_ram_new -> persistent_ram_post_init -> persistent_ram_zap
> >> |
> >> |-> persistent_ram_zap
> >>
> >> As we can see, persistent_ram_zap() is called twice.
> >> We can avoid this by adding an option to persistent_ram_new(), and
> >> only call persistent_ram_zap() when it is needed.
> >>
> >> Signed-off-by: Peng Wang 
> >> ---
> >>  fs/pstore/ram.c| 4 +---
> >>  fs/pstore/ram_core.c   | 5 +++--
> >>  include/linux/pstore_ram.h | 1 +
> >>  3 files changed, 5 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> >> index ffcff6516e89..b51901f97dc2 100644
> >> --- a/fs/pstore/ram.c
> >> +++ b/fs/pstore/ram.c
> >> @@ -640,7 +640,7 @@ static int ramoops_init_prz(const char *name,
> >>
> >>   label = kasprintf(GFP_KERNEL, "ramoops:%s", name);
> >>   *prz = persistent_ram_new(*paddr, sz, sig, >ecc_info,
> >> -   cxt->memtype, 0, label);
> >> +   cxt->memtype, PRZ_FLAG_ZAP_OLD, label);
> >>   if (IS_ERR(*prz)) {
> >>   int err = PTR_ERR(*prz);
> >
> > Looks good to me except the minor comment below:
> >
> >>
> >> @@ -649,8 +649,6 @@ static int ramoops_init_prz(const char *name,
> >>   return err;
> >>   }
> >>
> >> - persistent_ram_zap(*prz);
> >> -
> >>   *paddr += sz;
> >>
> >>   return 0;
> >> diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
> >> index 12e21f789194..2ededd1ea1c2 100644
> >> --- a/fs/pstore/ram_core.c
> >> +++ b/fs/pstore/ram_core.c
> >> @@ -505,15 +505,16 @@ static int persistent_ram_post_init(struct 
> >> persistent_ram_zone *prz, u32 sig,
> >>   pr_debug("found existing buffer, size %zu, start 
> >> %zu\n",
> >>buffer_size(prz), buffer_start(prz));
> >>   persistent_ram_save_old(prz);
> >> - return 0;
> >> + if (!(prz->flags & PRZ_FLAG_ZAP_OLD))
> >> + return 0;
> >
> > This could be written differently.
> >
> > We could just do:
> >
> > if (prz->flags & PRZ_FLAG_ZAP_OLD)
> > persistent_ram_zap(prz);
> >
> > And remove the zap from below below.
> 
> I actually rearranged things a little to avoid additional round-trips
> on the mailing list. :)
> 
> > Since Kees already took this patch, I can just patch this in my series if
> > Kees and you are Ok with this suggestion.
> 
> I've put it up here:
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=pstore/devel=ac564e023248e3f4d87917b91d12376ddfca5000

Cool, it LGTM :)

- Joel



Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Joel Fernandes
On Wed, Oct 31, 2018 at 07:45:01AM +1100, Aleksa Sarai wrote:
[...] 
> > > (Unfortunately
> > > there are lots of things that make it a bit difficult to use /proc/$pid
> > > exclusively for introspection of a process -- especially in the context
> > > of containers.)
> > 
> > Tons of things already break without a working /proc. What do you have in 
> > mind?
> 
> Heh, if only that was the only blocker. :P
> 
> The basic problem is that currently container runtimes either depend on
> some non-transient on-disk state (which becomes invalid on machine
> reboots or dead processes and so on), or on long-running processes that
> keep file descriptors required for administration of a container alive
> (think O_PATH to /dev/pts/ptmx to avoid malicious container filesystem
> attacks). Usually both.
> 
> What would be really useful would be having some way of "hiding away" a
> mount namespace (of the pid1 of the container) that has all of the
> information and bind-mounts-to-file-descriptors that are necessary for
> administration. If the container's pid1 dies all of the transient state
> has disappeared automatically -- because the stashed mount namespace has
> died. In addition, if this was done the way I'm thinking with (and this
> is the contentious bit) hierarchical mount namespaces you could make it
> so that the pid1 could not manipulate its current mount namespace to
> confuse the administrative process. You would also then create an
> intermediate user namespace to help with several race conditions (that
> have caused security bugs like CVE-2016-9962) we've seen when joining
> containers.
> 
> Unfortunately this all depends on hierarchical mount namespaces (and
> note that this would just be that NS_GET_PARENT gives you the mount
> namespace that it was created in -- I'm not suggesting we redesign peers
> or anything like that). This makes it basically a non-starter.
> 
> But if, on top of this ground-work, we then referenced containers
> entirely via an fd to /proc/$pid then you could also avoid PID reuse
> races (as well as being able to find out implicitly whether a container
> has died thanks to the error semantics of /proc/$pid). And that's the
> way I would suggest doing it (if we had these other things in place).

I didn't fully follow exactly what you mean. If you can explain for the
layman who doesn't know much experience with containers..

Are you saying that keeping open a /proc/$pid directory handle is not
sufficient to prevent PID reuse while the proc entries under /proc/$pid are
being looked into? If its not sufficient, then isn't that a bug? If it is
sufficient, then can we not just keep the handle open while we do whatever we
want under /proc/$pid ?

- Joel



Re: [PATCH v4] pstore: Avoid duplicate call of persistent_ram_zap()

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 03:52:34PM +0800, Peng Wang wrote:
> When initialing prz with invalid data in buffer(no PERSISTENT_RAM_SIG),
> function call path is like this:
> 
> ramoops_init_prz ->
> |
> |-> persistent_ram_new -> persistent_ram_post_init -> persistent_ram_zap
> |
> |-> persistent_ram_zap
> 
> As we can see, persistent_ram_zap() is called twice.
> We can avoid this by adding an option to persistent_ram_new(), and
> only call persistent_ram_zap() when it is needed.
> 
> Signed-off-by: Peng Wang 
> ---
>  fs/pstore/ram.c| 4 +---
>  fs/pstore/ram_core.c   | 5 +++--
>  include/linux/pstore_ram.h | 1 +
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> index ffcff6516e89..b51901f97dc2 100644
> --- a/fs/pstore/ram.c
> +++ b/fs/pstore/ram.c
> @@ -640,7 +640,7 @@ static int ramoops_init_prz(const char *name,
>  
>   label = kasprintf(GFP_KERNEL, "ramoops:%s", name);
>   *prz = persistent_ram_new(*paddr, sz, sig, >ecc_info,
> -   cxt->memtype, 0, label);
> +   cxt->memtype, PRZ_FLAG_ZAP_OLD, label);
>   if (IS_ERR(*prz)) {
>   int err = PTR_ERR(*prz);

Looks good to me except the minor comment below:

>  
> @@ -649,8 +649,6 @@ static int ramoops_init_prz(const char *name,
>   return err;
>   }
>  
> - persistent_ram_zap(*prz);
> -
>   *paddr += sz;
>  
>   return 0;
> diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
> index 12e21f789194..2ededd1ea1c2 100644
> --- a/fs/pstore/ram_core.c
> +++ b/fs/pstore/ram_core.c
> @@ -505,15 +505,16 @@ static int persistent_ram_post_init(struct 
> persistent_ram_zone *prz, u32 sig,
>   pr_debug("found existing buffer, size %zu, start %zu\n",
>buffer_size(prz), buffer_start(prz));
>   persistent_ram_save_old(prz);
> - return 0;
> + if (!(prz->flags & PRZ_FLAG_ZAP_OLD))
> + return 0;

This could be written differently.

We could just do:

if (prz->flags & PRZ_FLAG_ZAP_OLD)
persistent_ram_zap(prz);

And remove the zap from below below.

Since Kees already took this patch, I can just patch this in my series if
Kees and you are Ok with this suggestion.

Sorry for the delay in my RFC series, I just got back from paternity leave
and I'm catching up with email.

thanks,

- Joel

>   }
>   } else {
>   pr_debug("no valid data in buffer (sig = 0x%08x)\n",
>prz->buffer->sig);
> + prz->buffer->sig = sig;
>   }
>  
>   /* Rewind missing or invalid memory area. */
> - prz->buffer->sig = sig;
>   persistent_ram_zap(prz);
>  
>   return 0;


Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-30 Thread Joel Fernandes
On Tue, Oct 30, 2018 at 1:50 AM Daniel Colascione  wrote:
>
> On Tue, Oct 30, 2018 at 3:21 AM, Joel Fernandes  wrote:
> > On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione  wrote:
> >>
> >> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> >> write the signal number in base-10 ASCII to the kill file of the
> >> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
> >>
> >> Semantically, /proc/pid/kill works like kill(2), except that the
> >> process ID comes from the proc filesystem context instead of from an
> >> explicit system call parameter. This way, it's possible to avoid races
> >> between inspecting some aspect of a process and that process's PID
> >> being reused for some other process.
> >>
> >> With /proc/pid/kill, it's possible to write a proper race-free and
> >> safe pkill(1). An approximation follows. A real program might use
> >> openat(2), having opened a process's /proc/pid directory explicitly,
> >> with the directory file descriptor serving as a sort of "process
> >> handle".
> >
> > How long does the 'inspection' procedure take? If its a short
> > duration, then is PID reuse really an issue, I mean the PIDs are not
> > reused until wrap around and the only reason this can be a problem is
> > if you have the wrap around while the 'inspecting some aspect'
> > procedure takes really long.
>
> It's a race. Would you make similar statements about a similar fix for
> a race condition involving a mutex and a double-free just because the
> race didn't crash most of the time? The issue I'm trying to fix here
> is the same problem, one level higher up in the abstraction hierarchy.

I was just curious that if this was a real issue you are hitting in a
production system, it wasn't clear from the commit message. When I
read your commit I thought "Does the inspection process take that long
that we wrap around an entire PID range?". So perhaps you should amend
your commit message to address that it is not really a problem you ARE
seeing, but rather something you anticipate and that this patch would
be a nice-to-have to avoid that. Typically there should be good
reasons/real-usecases to add a new interface to the kernel. Linus has
repeatedly rejected new interfaces on the grounds of non existent
use-cases or non real-world use cases. Again if I am missing something
here, then please improve the commit message so others don't have
similar questions :) Its completely upto you though..

> > IMO without a really good reason for this, it could really be a hard
> > sell but the RFC was worth it anyway to discuss it ;-)
>
> The traditional unix process API is down there at level -10 of Rusty
> Russel's old bad API scale: "It's impossible to get right". The races
> in the current API are unavoidable. That most programs don't hit these
> races most of the time doesn't mean that the race isn't present.
>
> We've moved to a model where we identify other system resources, like
> DRM fences, locks, sockets, and everything else via file descriptors.
> This change is a step toward using procfs file descriptors to work
> with processes, which makes the system more regular and easier to
> reason about. A clean API that's possible to use correctly is a
> worthwhile project.

Ok, agreed. thanks,

- Joel


[PATCH] doc: correct parameter in stallwarn

2018-10-29 Thread Joel Fernandes (Google)
The stallwarn document incorrectly mentions 'fps=' instead of 'fqs='.
Correct that.

Signed-off-by: Joel Fernandes (Google) 
---
 Documentation/RCU/stallwarn.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index b01bcafc64aa..073dbc12d1ea 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -205,7 +205,7 @@ handlers are no longer able to execute on this CPU.  This 
can happen if
 the stalled CPU is spinning with interrupts are disabled, or, in -rt
 kernels, if a high-priority process is starving RCU's softirq handler.
 
-The "fps=" shows the number of force-quiescent-state idle/offline
+The "fqs=" shows the number of force-quiescent-state idle/offline
 detection passes that the grace-period kthread has made across this
 CPU since the last time that this CPU noted the beginning of a grace
 period.
-- 
2.19.1.568.g152ad8e336-goog



Re: [PATCH v2] pstore: Avoid duplicate call of persistent_ram_zap()

2018-10-29 Thread Joel Fernandes
On Mon, Oct 29, 2018 at 06:37:53AM +, Peng15 Wang 王鹏 wrote:
> 
> 
> >From: Kees Cook 
> >Sent: Monday, October 29, 2018 0:03
> >To: Peng15 Wang 王鹏
> >Cc: an...@enomsg.org; ccr...@android.com; tony.l...@intel.com; 
> >linux-kernel@vger.kernel.org; Joel Fernandes
> >Subject: Re: [PATCH v2] pstore: Avoid duplicate call of persistent_ram_zap()
> >
> >On Sat, Oct 27, 2018 at 2:08 PM, Peng15 Wang 王鹏  
> >wrote:
> >> When initialing prz with invalid data in buffer(no PERSISTENT_RAM_SIG),
> >> function call path is like this:
> >>
> >> ramoops_init_prz ->
> >> |
> >> |-> persistent_ram_new -> persistent_ram_post_init -> persistent_ram_zap
> >> |
> >> |-> persistent_ram_zap
> >>
> >> As we can see, persistent_ram_zap() is called twice.
> >> We can avoid this by adding an option to persistent_ram_new(), and
> >> only call persistent_ram_zap() when it is needed.
> >>
> >> Signed-off-by: Peng Wang 
> >> ---
> >>  fs/pstore/ram.c|  5 +++--
> >>  fs/pstore/ram_core.c   | 11 +++
> >>  include/linux/pstore_ram.h |  3 ++-
> >>  3 files changed, 12 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> >> index ffcff6516e89..3044274de2f0 100644
> >> --- a/fs/pstore/ram.c
> >> +++ b/fs/pstore/ram.c
> >> @@ -596,7 +596,8 @@ static int ramoops_init_przs(const char *name,
> >>   name, i, *cnt - 1);
> >> prz_ar[i] = persistent_ram_new(*paddr, zone_sz, sig,
> >>>ecc_info,
> >> -  cxt->memtype, flags, label);
> >> +  cxt->memtype, flags,
> >> +  label, true);
> >> if (IS_ERR(prz_ar[i])) {
> >> err = PTR_ERR(prz_ar[i]);
> >> dev_err(dev, "failed to request %s mem region 
> >> (0x%zx@0x%llx): %d\n",
> >> @@ -640,7 +641,7 @@ static int ramoops_init_prz(const char *name,
> >>
> >> label = kasprintf(GFP_KERNEL, "ramoops:%s", name);
> >> *prz = persistent_ram_new(*paddr, sz, sig, >ecc_info,
> >> - cxt->memtype, 0, label);
> >> + cxt->memtype, 0, label, false);
> >> if (IS_ERR(*prz)) {
> >> int err = PTR_ERR(*prz);
> >>
> >> diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
> >> index 12e21f789194..d8a520c8741c 100644
> >> --- a/fs/pstore/ram_core.c
> >> +++ b/fs/pstore/ram_core.c
> >> @@ -486,7 +486,8 @@ static int persistent_ram_buffer_map(phys_addr_t 
> >> start, phys_addr_t size,
> >>  }
> >>
> >>  static int persistent_ram_post_init(struct persistent_ram_zone *prz, u32 
> >> sig,
> >> -   struct persistent_ram_ecc_info 
> >> *ecc_info)
> >> +   struct persistent_ram_ecc_info 
> >> *ecc_info,
> >> +   bool zap_option)
> >>  {
> >> int ret;
> >>
> >> @@ -514,7 +515,8 @@ static int persistent_ram_post_init(struct 
> >> persistent_ram_zone *prz, u32 >sig,
> >>
> >> /* Rewind missing or invalid memory area. */
> >> prz->buffer->sig = sig;
> >> -   persistent_ram_zap(prz);
> >> +   if (zap_option)
> >> +   persistent_ram_zap(prz);
> >
> >This part of persistent_ram_post_init() handles the "invalid buffer"
> >case, which should always zap. The question is whether or not to zap
> >in the case of a valid buffer (the "return 0" earlier in the
> >function). I think you v2 patch needs similar changes found in your
> >v1: the v2 patch also needs to remove the "return 0" and replace it
> >with "zap_option = true;" and to remove the zap call from
> >ramoops_init_prz(). Then I think all the paths will be consolidated.
> 
> Thank you so much for the tips!
> 
> Furthermore,  we can make "zap_option" stand for whether its caller want to 
> zap in case of
> a valid buffer. So ramoops_init_przs() would say "false", and 
> ramoops_init_prz() would 
> say "true".
> 
> In persistent_ram_post_init(), if zap_option says "false", we return 
> immediately after 
> persistent_ram_save_old(), otherwise persistent_ram_zap would be called at 
> the end.

Can you not just add it to the flags, something like PRZ_ZAP_NEW, and set
that flag before calling ramoops_init_prz*, then check the flag in
persistent_ram_new? We are already passing flags to persistent_ram_new.

That way no new function arguments are needed and its simple.

 - Joel



Re: [PATCH tip/core/rcu 02/19] rcu: Defer reporting RCU-preempt quiescent states when disabled

2018-10-29 Thread Joel Fernandes
On Mon, Oct 29, 2018 at 07:27:35AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 29, 2018 at 11:24:42AM +, Ran Rozenstein wrote:
> > Hi Paul and all,
> > 
> > > -Original Message-
> > > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> > > ow...@vger.kernel.org] On Behalf Of Paul E. McKenney
> > > Sent: Thursday, August 30, 2018 01:21
> > > To: linux-kernel@vger.kernel.org
> > > Cc: mi...@kernel.org; jiangshan...@gmail.com; dipan...@in.ibm.com;
> > > a...@linux-foundation.org; mathieu.desnoy...@efficios.com;
> > > j...@joshtriplett.org; t...@linutronix.de; pet...@infradead.org;
> > > rost...@goodmis.org; dhowe...@redhat.com; eduma...@google.com;
> > > fweis...@gmail.com; o...@redhat.com; j...@joelfernandes.org; Paul E.
> > > McKenney 
> > > Subject: [PATCH tip/core/rcu 02/19] rcu: Defer reporting RCU-preempt
> > > quiescent states when disabled
> > > 
> > > This commit defers reporting of RCU-preempt quiescent states at
> > > rcu_read_unlock_special() time when any of interrupts, softirq, or
> > > preemption are disabled.  These deferred quiescent states are reported at 
> > > a
> > > later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug offline
> > > operation.  Of course, if another RCU read-side critical section has 
> > > started in
> > > the meantime, the reporting of the quiescent state will be further 
> > > deferred.
> > > 
> > > This also means that disabling preemption, interrupts, and/or softirqs 
> > > will act
> > > as an RCU-preempt read-side critical section.
> > > This is enforced by checking preempt_count() as needed.
> > > 
> > > Some special cases must be handled on an ad-hoc basis, for example,
> > > context switch is a quiescent state even though both the scheduler and
> > > do_exit() disable preemption.  In these cases, additional calls to
> > > rcu_preempt_deferred_qs() override the preemption disabling.  Similar 
> > > logic
> > > overrides disabled interrupts in rcu_preempt_check_callbacks() because in
> > > this case the quiescent state happened just before the corresponding
> > > scheduling-clock interrupt.
> > > 
> > > In theory, this change lifts a long-standing restriction that required 
> > > that if
> > > interrupts were disabled across a call to rcu_read_unlock() that the 
> > > matching
> > > rcu_read_lock() also be contained within that interrupts-disabled region 
> > > of
> > > code.  Because the reporting of the corresponding RCU-preempt quiescent
> > > state is now deferred until after interrupts have been enabled, it is no 
> > > longer
> > > possible for this situation to result in deadlocks involving the 
> > > scheduler's
> > > runqueue and priority-inheritance locks.  This may allow some code
> > > simplification that might reduce interrupt latency a bit.  Unfortunately, 
> > > in
> > > practice this would also defer deboosting a low-priority task that had 
> > > been
> > > subjected to RCU priority boosting, so real-time-response considerations
> > > might well force this restriction to remain in place.
> > > 
> > > Because RCU-preempt grace periods are now blocked not only by RCU read-
> > > side critical sections, but also by disabling of interrupts, preemption, 
> > > and
> > > softirqs, it will be possible to eliminate RCU-bh and RCU-sched in favor 
> > > of
> > > RCU-preempt in CONFIG_PREEMPT=y kernels.  This may require some
> > > additional plumbing to provide the network denial-of-service guarantees
> > > that have been traditionally provided by RCU-bh.  Once these are in place,
> > > CONFIG_PREEMPT=n kernels will be able to fold RCU-bh into RCU-sched.
> > > This would mean that all kernels would have but one flavor of RCU, which
> > > would open the door to significant code cleanup.
> > > 
> > > Moving to a single flavor of RCU would also have the beneficial effect of
> > > reducing the NOCB kthreads by at least a factor of two.
> > > 
> > > Signed-off-by: Paul E. McKenney  [ paulmck:
> > > Apply rcu_read_unlock_special() preempt_count() feedback
> > >   from Joel Fernandes. ]
> > > [ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
> > >   response to bug reports from kbuild test robot. ] [ paulmck: Fix bug 
> > > located
> > > by kbuild test robot involving recursion
> > >

Re: [RFC PATCH] Implement /proc/pid/kill

2018-10-29 Thread Joel Fernandes
On Mon, Oct 29, 2018 at 3:11 PM Daniel Colascione  wrote:
>
> Add a simple proc-based kill interface. To use /proc/pid/kill, just
> write the signal number in base-10 ASCII to the kill file of the
> process to be killed: for example, 'echo 9 > /proc/$$/kill'.
>
> Semantically, /proc/pid/kill works like kill(2), except that the
> process ID comes from the proc filesystem context instead of from an
> explicit system call parameter. This way, it's possible to avoid races
> between inspecting some aspect of a process and that process's PID
> being reused for some other process.
>
> With /proc/pid/kill, it's possible to write a proper race-free and
> safe pkill(1). An approximation follows. A real program might use
> openat(2), having opened a process's /proc/pid directory explicitly,
> with the directory file descriptor serving as a sort of "process
> handle".

How long does the 'inspection' procedure take? If its a short
duration, then is PID reuse really an issue, I mean the PIDs are not
reused until wrap around and the only reason this can be a problem is
if you have the wrap around while the 'inspecting some aspect'
procedure takes really long.

Also the proc fs is typically not the right place for this. Some
entries in proc are writeable, but those are for changing values of
kernel data structures. The title of man proc(5) is "proc - process
information pseudo-filesystem". So its "information" right?

IMO without a really good reason for this, it could really be a hard
sell but the RFC was worth it anyway to discuss it ;-)

thanks,

- Joel


Re: [RFC PATCH] Minimal non-child process exit notification support

2018-10-29 Thread Joel Fernandes
On Mon, Oct 29, 2018 at 1:01 PM Daniel Colascione  wrote:
>
> Thanks for taking a look.
>
> On Mon, Oct 29, 2018 at 7:45 PM, Joel Fernandes  wrote:
> >
> > On Mon, Oct 29, 2018 at 10:53 AM Daniel Colascione  
> > wrote:
> > >
> > > This patch adds a new file under /proc/pid, /proc/pid/exithand.
> > > Attempting to read from an exithand file will block until the
> > > corresponding process exits, at which point the read will successfully
> > > complete with EOF.  The file descriptor supports both blocking
> > > operations and poll(2). It's intended to be a minimal interface for
> > > allowing a program to wait for the exit of a process that is not one
> > > of its children.
> > >
> > > Why might we want this interface? Android's lmkd kills processes in
> > > order to free memory in response to various memory pressure
> > > signals. It's desirable to wait until a killed process actually exits
> > > before moving on (if needed) to killing the next process. Since the
> > > processes that lmkd kills are not lmkd's children, lmkd currently
> > > lacks a way to wait for a proces to actually die after being sent
> > > SIGKILL; today, lmkd resorts to polling the proc filesystem pid
> >
> > Any idea why it needs to wait and then send SIGKILL? Why not do
> > SIGKILL and look for errno == ESRCH in a loop with a delay.
>
> I want to get polling loops out of the system. Polling loops are bad
> for wakeup attribution, bad for power, bad for priority inheritance,
> and bad for latency. There's no right answer to the question "How long
> should I wait before checking $CONDITION again?". If we can have an
> explicit waitqueue interface to something, we should. Besides, PID
> polling is vulnerable to PID reuse, whereas this mechanism (just like
> anything based on struct pid) is immune to it.

The argument sounds Ok to me. I would also more details in the commit
message about the alternate methods to do this (such as kill polling
or ptrace) and why they don't work well etc so no one asks any
questions. Like maybe under a "other ways to do this" section. A bit
of googling also showed a netlink way of doing it without polling
(though I don't look into that much and wouldn't be surprised if its
more complicated)

Also I guess when you send a patch, it'd be good to pass
"--cc-cmd='./scripts/get_maintainer.pl" to git-send-email so it
automatically CCs the maintainers who maintain this.

thanks,

- Joel


Re: [RFC PATCH] Minimal non-child process exit notification support

2018-10-29 Thread Joel Fernandes
On Mon, Oct 29, 2018 at 12:45 PM Joel Fernandes  wrote:
>
> On Mon, Oct 29, 2018 at 10:53 AM Daniel Colascione  wrote:
> >
> > This patch adds a new file under /proc/pid, /proc/pid/exithand.
> > Attempting to read from an exithand file will block until the
> > corresponding process exits, at which point the read will successfully
> > complete with EOF.  The file descriptor supports both blocking
> > operations and poll(2). It's intended to be a minimal interface for
> > allowing a program to wait for the exit of a process that is not one
> > of its children.
> >
> > Why might we want this interface? Android's lmkd kills processes in
> > order to free memory in response to various memory pressure
> > signals. It's desirable to wait until a killed process actually exits
> > before moving on (if needed) to killing the next process. Since the
> > processes that lmkd kills are not lmkd's children, lmkd currently
> > lacks a way to wait for a proces to actually die after being sent
> > SIGKILL; today, lmkd resorts to polling the proc filesystem pid
>
> Any idea why it needs to wait and then send SIGKILL? Why not do
> SIGKILL and look for errno == ESRCH in a loop with a delay.
>

Sorry I take that back, I see it needs to wait after sending the kill,
not before (duh). Anyway if the polling is ever rewritten, another way
could be to do kill(pid, 0) and then check for return of -1 and errno
== ESRCH; instead of looking at /proc/


Re: [RFC PATCH] Minimal non-child process exit notification support

2018-10-29 Thread Joel Fernandes
On Mon, Oct 29, 2018 at 10:53 AM Daniel Colascione  wrote:
>
> This patch adds a new file under /proc/pid, /proc/pid/exithand.
> Attempting to read from an exithand file will block until the
> corresponding process exits, at which point the read will successfully
> complete with EOF.  The file descriptor supports both blocking
> operations and poll(2). It's intended to be a minimal interface for
> allowing a program to wait for the exit of a process that is not one
> of its children.
>
> Why might we want this interface? Android's lmkd kills processes in
> order to free memory in response to various memory pressure
> signals. It's desirable to wait until a killed process actually exits
> before moving on (if needed) to killing the next process. Since the
> processes that lmkd kills are not lmkd's children, lmkd currently
> lacks a way to wait for a proces to actually die after being sent
> SIGKILL; today, lmkd resorts to polling the proc filesystem pid

Any idea why it needs to wait and then send SIGKILL? Why not do
SIGKILL and look for errno == ESRCH in a loop with a delay.

> entry. This interface allow lmkd to give up polling and instead block
> and wait for process death.

Can we use ptrace(2) for the exit notifications? I am assuming you
already though about it but I'm curious what is the reason this is
better.

thanks,

-Joel


Re: [RFC] rcu: doc: update example about stale data

2018-10-28 Thread Joel Fernandes
On Sun, Oct 28, 2018 at 10:21:42AM -0700, Paul E. McKenney wrote:
> On Sat, Oct 27, 2018 at 07:16:53PM -0700, Joel Fernandes (Google) wrote:
> > The RCU example for 'rejecting stale data' on system-call auditting
> > stops iterating through the rules if a deleted one is found. It makes
> > more sense to continue looking at other rules once a deleted one is
> > rejected. Although the original example is fine, this makes it more
> > meaningful.
> > 
> > Signed-off-by: Joel Fernandes (Google) 
> 
> Does the actual audit code that this was copied from now include the
> continue statement?  If so, please update the commit log to state that
> and then I will take the resulting patch.  (This example was inspired
> by a long-ago version of the actual audit code.)

The document talks of a situation that could be but is not really in the
implementation. It says "If the system-call audit module were to ever need to
reject stale data". So its not really something implemented. I was just
correcting the example you had there since it made more sense to me to
continue looking for other rules in the list once a rule was shown to be
stale. It just makes the example more correct.

But I'm Ok if you want to leave that alone ;-) Hence, the RFC tag to this
patch ;-)

- Joel
 


Re: [RFC] rcu: doc: update example about stale data

2018-10-27 Thread Joel Fernandes
On Sat, Oct 27, 2018 at 7:16 PM, Joel Fernandes (Google)
 wrote:
> The RCU example for 'rejecting stale data' on system-call auditting
> stops iterating through the rules if a deleted one is found. It makes
> more sense to continue looking at other rules once a deleted one is
> rejected. Although the original example is fine, this makes it more
> meaningful.

Sorry, I messed up the patch title, it is supposed to be 'doc: rcu:
...'. I can resend it if you want.

thanks,

- Joel


[RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-10-27 Thread Joel Fernandes (Google)
As per this thread [1], it seems this smp_mb isn't needed anymore:
"So the smp_mb() that I was trying to add doesn't need to be there."

So let us remove this part from the memory ordering documentation.

[1] https://lkml.org/lkml/2017/10/6/707

Signed-off-by: Joel Fernandes (Google) 
---
 .../Tree-RCU-Memory-Ordering.html | 32 +--
 1 file changed, 1 insertion(+), 31 deletions(-)

diff --git 
a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html 
b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
index a346ce0116eb..0fb1511763d4 100644
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
@@ -77,7 +77,7 @@ The key point is that the lock-acquisition functions, 
including
 smp_mb__after_unlock_lock() immediately after successful
 acquisition of the lock.
 
-Therefore, for any given rcu_node struction, any access
+Therefore, for any given rcu_node structure, any access
 happening before one of the above lock-release functions will be seen
 by all CPUs as happening before any access happening after a later
 one of the above lock-acquisition functions.
@@ -162,36 +162,6 @@ an atomic_add_return() of zero) to detect idle 
CPUs.
 
 
 
-The approach must be extended to handle one final case, that
-of waking a task blocked in synchronize_rcu().
-This task might be affinitied to a CPU that is not yet aware that
-the grace period has ended, and thus might not yet be subject to
-the grace period's memory ordering.
-Therefore, there is an smp_mb() after the return from
-wait_for_completion() in the synchronize_rcu()
-code path.
-
-
-
-Quick Quiz:
-
-   What?  Where???
-   I don't see any smp_mb() after the return from
-   wait_for_completion()!!!
-
-Answer:
-
-   That would be because I spotted the need for that
-   smp_mb() during the creation of this documentation,
-   and it is therefore unlikely to hit mainline before v4.14.
-   Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
-   Jonathan Cameron for asking questions that sensitized me
-   to the rather elaborate sequence of events that demonstrate
-   the need for this memory barrier.
-
-
-
-
 Tree RCU's grace--period memory-ordering guarantees rely most
 heavily on the rcu_node structure's -lock
 field, so much so that it is necessary to abbreviate this pattern
-- 
2.19.1.568.g152ad8e336-goog



[RFC] rcu: doc: update example about stale data

2018-10-27 Thread Joel Fernandes (Google)
The RCU example for 'rejecting stale data' on system-call auditting
stops iterating through the rules if a deleted one is found. It makes
more sense to continue looking at other rules once a deleted one is
rejected. Although the original example is fine, this makes it more
meaningful.

Signed-off-by: Joel Fernandes (Google) 
---
 Documentation/RCU/listRCU.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt
index adb5a3782846..09e9a4fc723e 100644
--- a/Documentation/RCU/listRCU.txt
+++ b/Documentation/RCU/listRCU.txt
@@ -250,8 +250,7 @@ as follows:
spin_lock(>lock);
if (e->deleted) {
spin_unlock(>lock);
-   rcu_read_unlock();
-   return AUDIT_BUILD_CONTEXT;
+   continue;
}
rcu_read_unlock();
return state;
-- 
2.19.1.568.g152ad8e336-goog



Re: [RFC 1/6] pstore: map pstore types to names

2018-10-26 Thread Joel Fernandes
On Fri, Oct 26, 2018 at 08:04:24PM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 7:00 PM, Joel Fernandes (Google)
>  wrote:
> > In later patches we will need to map types to names, so create a table
> > for that which can also be used and reused in different parts of old and
> > new code. Also use it to save the type in the PRZ which will be useful
> > in later patches.
> 
> Yes, I like it. :) Comments below...

I'm glad, thanks, my replies are below:

> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >  fs/pstore/inode.c  | 44 --
> >  fs/pstore/ram.c|  4 +++-
> >  include/linux/pstore.h | 29 +
> >  include/linux/pstore_ram.h |  2 ++
> >  4 files changed, 57 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
> > index 5fcb845b9fec..43757049d384 100644
> > --- a/fs/pstore/inode.c
> > +++ b/fs/pstore/inode.c
> > @@ -304,6 +304,7 @@ int pstore_mkfile(struct dentry *root, struct 
> > pstore_record *record)
> > struct dentry   *dentry;
> > struct inode*inode;
> > int rc = 0;
> > +   enum pstore_type_id type;
> > charname[PSTORE_NAMELEN];
> > struct pstore_private   *private, *pos;
> > unsigned long   flags;
> > @@ -335,43 +336,44 @@ int pstore_mkfile(struct dentry *root, struct 
> > pstore_record *record)
> > goto fail_alloc;
> > private->record = record;
> >
> > -   switch (record->type) {
> > +   type = record->type;
> 
> Let's rename PSTORE_TYPE_UNKNOWN in the enum to be PSTORE_TYPE_MAX and
> != 255 (just leave it at the end). The value is never exposed to
> userspace (nor to backend storage), so we can instead use it as the
> bounds-check for doing type -> name mappings. (The one use in erst can
> just be renamed.)
> 
> Then we can add a function to do the bounds checking and mapping
> (instead of using a bare array lookup).
> 
> > +   switch (type) {
> > case PSTORE_TYPE_DMESG:
> > -   scnprintf(name, sizeof(name), "dmesg-%s-%llu%s",
> > - record->psi->name, record->id,
> > - record->compressed ? ".enc.z" : "");
> > +   scnprintf(name, sizeof(name), "%s-%s-%llu%s",
> > +pstore_names[type], record->psi->name, record->id,
> > +record->compressed ? ".enc.z" : "");
> > break;
> > case PSTORE_TYPE_CONSOLE:
> > -   scnprintf(name, sizeof(name), "console-%s-%llu",
> > - record->psi->name, record->id);
> > +   scnprintf(name, sizeof(name), "%s-%s-%llu",
> > + pstore_names[type], record->psi->name, 
> > record->id);
> > break;
> > case PSTORE_TYPE_FTRACE:
> > -   scnprintf(name, sizeof(name), "ftrace-%s-%llu",
> > - record->psi->name, record->id);
> > +   scnprintf(name, sizeof(name), "%s-%s-%llu",
> > + pstore_names[type], record->psi->name, 
> > record->id);
> > break;
> > case PSTORE_TYPE_MCE:
> > -   scnprintf(name, sizeof(name), "mce-%s-%llu",
> > - record->psi->name, record->id);
> > +   scnprintf(name, sizeof(name), "%s-%s-%llu",
> > + pstore_names[type], record->psi->name, 
> > record->id);
> > break;
> > case PSTORE_TYPE_PPC_RTAS:
> > -   scnprintf(name, sizeof(name), "rtas-%s-%llu",
> > - record->psi->name, record->id);
> > +   scnprintf(name, sizeof(name), "%s-%s-%llu",
> > + pstore_names[type], record->psi->name, 
> > record->id);
> > break;
> > case PSTORE_TYPE_PPC_OF:
> > -   scnprintf(name, sizeof(name), "powerpc-ofw-%s-%llu",
> > - record->psi->name, record->id);
> > +   scnprintf(name, sizeof(name), "%s-%s-%llu",
> > + pstore_names[type], record->psi-

Re: [RFC 5/6] pstore: donot treat empty buffers as valid

2018-10-26 Thread Joel Fernandes
On Fri, Oct 26, 2018 at 08:39:13PM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 7:00 PM, Joel Fernandes (Google)
>  wrote:
> > pstore currently calls persistent_ram_save_old even if a buffer is
> > empty. While this appears to work, it is simply not the right thing to
> > do and could lead to bugs so lets avoid that. It also prevent misleading
> > prints in the logs which claim the buffer is valid.
> 
> I need to be better convinced that a present zero length record is the
> same as a non-present record. This seems true, but there is
> potentially still metadata available from a backend. What were the
> misleading prints in logs?

I got something like:
found existing buffer, size 0, start 0

When I was expecting:
no valid data in buffer (sig = ...)

The other thing is a call to persistent_ram_zap is also prevented on the
buffer, which prevent zero-initialize prz->ecc_info.par. Since we are
dropping patch 6/6, the zap will not happen. But I'm not super familiar with
the ecc bits of this code to say that if that's an issue.

About the present zero-length record, I would argue that it should not be
"present" at all. When the system first boots, the record is not present but
the signatures are initialized, then on reboots because the signatures were
initialized, the buffer appears as valid even if it was unused. So for dmesg,
all the max_dump_cnt number of buffers would appear as if they are valid
which is a bit strange because there was no crash at all so it should be in
the same state on reboot, as if there was no crash. That could be a matter of
perspective though so I leave it you how you prefer to do it :)

thanks,

- Joel



Re: [RFC 6/6] Revert "pstore/ram_core: Do not reset restored zone's position and size"

2018-10-26 Thread Joel Fernandes
On Fri, Oct 26, 2018 at 08:42:12PM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 7:22 PM, Joel Fernandes  
> wrote:
> > On Fri, Oct 26, 2018 at 07:16:28PM +0100, Kees Cook wrote:
> >> On Fri, Oct 26, 2018 at 7:00 PM, Joel Fernandes (Google)
> >>  wrote:
> >> > This reverts commit 25b63da64708212985c06c7f8b089d356efdd9cf.
> >> >
> >> > Due to the commit which is being reverted here, it is not possible to
> >> > know if pstore's messages were from a previous boot, or from very old
> >> > boots. This creates an awkard situation where its unclear if crash or
> >> > other logs are from the previous boot or from very old boots. Also
> >> > typically we dump the pstore buffers after one reboot and are interested
> >> > in only the previous boot's crash so let us reset the buffer after we
> >> > save them.
> >> >
> >> > Lastly, if we don't zap them, then I think it is possible that part of
> >> > the buffer will be from this boot and the other parts will be from
> >> > previous boots. So this revert fixes all of this by calling
> >> > persistent_ram_zap always.
> >>
> >> I like the other patches (comments coming), but not this one: it's
> >> very intentional to keep all crashes around until they're explicitly
> >> unlinked from the pstore filesystem from userspace. Especially true
> >> for catching chains of kernel crashes, or a failed log collection,
> >> etc. Surviving multiple reboots is the expected behavior on Chrome OS
> >> too.
> >
> > Oh, ok. Hence the RFC tag ;-) We can drop this one then. I forgot that
> > unlinking was another way to clear the logs.
> 
> In another thread I discovered that the "single prz" ones actually
> _are_ zapped at boot. I didn't realize, but it explains why pmsg would
> vanish on me sometimes. ;) I always thought I was just doing something
> wrong with it. (And I wonder if it's actually a bug that pmsg is
> zapped -- console doesn't matter: it's overwritten every boot by
> design.)

Oh yeah they are. So seems like some are zapped on boot and some aren't then.
Hmm, I would think it makes sense not to boot-zap dmesg ever, since that's
crash logs someone may want to see after many reboots. But console and pmsg
should be since those just "what happened on the last boot". I guess it
should be made clear in some structure or something which types are zapped on
boot, and which ones aren't. That'll make it clear when adding a new type
about that behavior, instead of relying on the assumption that single prz are
zapped and multiple ones aren't. Like for ftrace, since the per-cpu
configuration was added, it will now be zapped on boot if it is using a
per-cpu configuration and not zapped on boot if it isn't right?  That would
seem a bit inconsistent.

thanks!
-Joel




Re: [RFC 3/6] pstore: remove max argument from ramoops_get_next_prz

2018-10-26 Thread Joel Fernandes
On Fri, Oct 26, 2018 at 08:27:49PM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 8:22 PM, Joel Fernandes  
> wrote:
> > On Fri, Oct 26, 2018 at 11:00:39AM -0700, Joel Fernandes (Google) wrote:
> >> From the code flow, the 'max' checks are already being done on the prz
> >> passed to ramoops_get_next_prz. Lets remove it to simplify this function
> >> and reduce its arguments.
> >>
> >> Signed-off-by: Joel Fernandes (Google) 
> >> ---
> >>  fs/pstore/ram.c | 14 ++
> >>  1 file changed, 6 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> >> index cbfdf4b8e89d..3055e05acab1 100644
> >> --- a/fs/pstore/ram.c
> >> +++ b/fs/pstore/ram.c
> >> @@ -124,14 +124,14 @@ static int ramoops_pstore_open(struct pstore_info 
> >> *psi)
> >>  }
> >>
> >>  static struct persistent_ram_zone *
> >> -ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint 
> >> max,
> >> +ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c,
> >>u64 *id, enum pstore_type_id *typep, bool update)
> >>  {
> >>   struct persistent_ram_zone *prz;
> >>   int i = (*c)++;
> >>
> >>   /* Give up if we never existed or have hit the end. */
> >> - if (!przs || i >= max)
> >> + if (!przs)
> >>   return NULL;
> >>
> >>   prz = przs[i];
> >
> > Ah, looks like I may have introduced an issue here since 'i' isn't checked 
> > by
> > the caller for the single prz case, its only checked for the multiple prz
> > cases, so something like below could be folded in. I still feel its better
> > than passing the max argument.
> >
> > Another thought is, even better we could have a different function when
> > there's only one prz and not have to pass an array, just pass the first
> > element? Something like...
> >
> > ramoops_get_next_prz_single(struct persistent_ram_zone *prz, uint *c,
> > enum pstore_type_id *typep, bool update)
> > And for the _single  case, we also wouldn't need to pass id so that's 
> > another
> > argument less.
> >
> > Let me know what you think, otherwise something like the below will need to
> > be folded in to fix this patch... thanks.
> >
> > 8<---
> >
> > diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> > index 5702b692bdb9..061d2af2485b 100644
> > --- a/fs/pstore/ram.c
> > +++ b/fs/pstore/ram.c
> > @@ -268,17 +268,19 @@ static ssize_t ramoops_pstore_read(struct 
> > pstore_record *record)
> > }
> > }
> >
> > -   if (!prz_ok(prz))
> > +   if (!prz_ok(prz) && !cxt->console_read_cnt) {
> > prz = ramoops_get_next_prz(>cprz, 
> > >console_read_cnt,
> >record, 0);
> > +   }
> >
> > -   if (!prz_ok(prz))
> > +   if (!prz_ok(prz) && !cxt->pmsg_read_cnt)
> > prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
> >record, 0);
> >
> > /* ftrace is last since it may want to dynamically allocate memory. 
> > */
> > if (!prz_ok(prz)) {
> > -   if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
> > +   if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU) &&
> > +   !cxt->ftrace_read_cnt) {
> > prz = ramoops_get_next_prz(cxt->fprzs,
> > >ftrace_read_cnt, record, 0);
> > } else {
> 
> Ah yeah, good catch! I think your added fix is right. I was pondering
> asking you to remove the & on the *_read_cnt and having the caller do
> the increment:
> 
> while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
> prz = ramoops_get_next_prz(cxt->dprzs, cxt->dump_read_cnt++,
>>id,
>>type,
>PSTORE_TYPE_DMESG, 1);

Sure, that's better, I'll do that. That we don't have to pass a pointer, the
caller knows about the increment, and its a local variable less. thanks!

 - Joel



Re: [RFC 4/6] pstore: further reduce ramoops_get_next_prz arguments by passing record

2018-10-26 Thread Joel Fernandes
On Fri, Oct 26, 2018 at 08:32:16PM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 7:00 PM, Joel Fernandes (Google)
>  wrote:
> > Both the id and type fields of a pstore_record are set by
> > ramoops_get_next_prz. So we can just pass a pointer to the pstore_record
> > instead of passing individual elements. This results in cleaner more
> > readable code and fewer lines.
> >
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >  fs/pstore/ram.c | 18 --
> >  1 file changed, 8 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> > index 3055e05acab1..710c3d30bac0 100644
> > --- a/fs/pstore/ram.c
> > +++ b/fs/pstore/ram.c
> > @@ -125,7 +125,7 @@ static int ramoops_pstore_open(struct pstore_info *psi)
> >
> >  static struct persistent_ram_zone *
> >  ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c,
> > -u64 *id, enum pstore_type_id *typep, bool update)
> > +struct pstore_record *record, bool update)
> >  {
> > struct persistent_ram_zone *prz;
> > int i = (*c)++;
> > @@ -145,8 +145,8 @@ ramoops_get_next_prz(struct persistent_ram_zone 
> > *przs[], uint *c,
> > if (!persistent_ram_old_size(prz))
> > return NULL;
> >
> > -   *typep = prz->type;
> > -   *id = i;
> > +   record->type = prz->type;
> > +   record->id = i;
> 
> Yes yes. I've been meaning to get all this cleaned up after I
> refactored everything to actually HAVE record at all. :P
> 
> >
> > return prz;
> >  }
> > @@ -254,7 +254,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
> > *record)
> > /* Find the next valid persistent_ram_zone for DMESG */
> > while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
> > prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
> > -  >id, >type, 1);
> > +  record, 1);
> 
> In another patch, I think you could drop the "update" field too, and
> use the record->type instead to determine if update is needed. Like:
> 
> static struct persistent_ram_zone *
> ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint c,
>   struct pstore_record *record)
> {
> bool update = (record->type == PSTORE_TYPE_DMESG);
> ...

Yes, I agree, I'll do that :)

thanks!

 - Joel


Re: [RFC 3/6] pstore: remove max argument from ramoops_get_next_prz

2018-10-26 Thread Joel Fernandes
On Fri, Oct 26, 2018 at 11:00:39AM -0700, Joel Fernandes (Google) wrote:
> From the code flow, the 'max' checks are already being done on the prz
> passed to ramoops_get_next_prz. Lets remove it to simplify this function
> and reduce its arguments.
> 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  fs/pstore/ram.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> index cbfdf4b8e89d..3055e05acab1 100644
> --- a/fs/pstore/ram.c
> +++ b/fs/pstore/ram.c
> @@ -124,14 +124,14 @@ static int ramoops_pstore_open(struct pstore_info *psi)
>  }
>  
>  static struct persistent_ram_zone *
> -ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max,
> +ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c,
>u64 *id, enum pstore_type_id *typep, bool update)
>  {
>   struct persistent_ram_zone *prz;
>   int i = (*c)++;
>  
>   /* Give up if we never existed or have hit the end. */
> - if (!przs || i >= max)
> + if (!przs)
>   return NULL;
>  
>   prz = przs[i];

Ah, looks like I may have introduced an issue here since 'i' isn't checked by
the caller for the single prz case, its only checked for the multiple prz
cases, so something like below could be folded in. I still feel its better
than passing the max argument.

Another thought is, even better we could have a different function when
there's only one prz and not have to pass an array, just pass the first
element? Something like...

ramoops_get_next_prz_single(struct persistent_ram_zone *prz, uint *c,
enum pstore_type_id *typep, bool update)
And for the _single  case, we also wouldn't need to pass id so that's another
argument less.

Let me know what you think, otherwise something like the below will need to
be folded in to fix this patch... thanks.

8<---

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 5702b692bdb9..061d2af2485b 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -268,17 +268,19 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
}
}
 
-   if (!prz_ok(prz))
+   if (!prz_ok(prz) && !cxt->console_read_cnt) {
prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
   record, 0);
+   }
 
-   if (!prz_ok(prz))
+   if (!prz_ok(prz) && !cxt->pmsg_read_cnt)
prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
   record, 0);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
-   if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
+   if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU) &&
+   !cxt->ftrace_read_cnt) {
prz = ramoops_get_next_prz(cxt->fprzs,
>ftrace_read_cnt, record, 0);
} else {


Re: [RFC 6/6] Revert "pstore/ram_core: Do not reset restored zone's position and size"

2018-10-26 Thread Joel Fernandes
On Fri, Oct 26, 2018 at 07:16:28PM +0100, Kees Cook wrote:
> On Fri, Oct 26, 2018 at 7:00 PM, Joel Fernandes (Google)
>  wrote:
> > This reverts commit 25b63da64708212985c06c7f8b089d356efdd9cf.
> >
> > Due to the commit which is being reverted here, it is not possible to
> > know if pstore's messages were from a previous boot, or from very old
> > boots. This creates an awkard situation where its unclear if crash or
> > other logs are from the previous boot or from very old boots. Also
> > typically we dump the pstore buffers after one reboot and are interested
> > in only the previous boot's crash so let us reset the buffer after we
> > save them.
> >
> > Lastly, if we don't zap them, then I think it is possible that part of
> > the buffer will be from this boot and the other parts will be from
> > previous boots. So this revert fixes all of this by calling
> > persistent_ram_zap always.
> 
> I like the other patches (comments coming), but not this one: it's
> very intentional to keep all crashes around until they're explicitly
> unlinked from the pstore filesystem from userspace. Especially true
> for catching chains of kernel crashes, or a failed log collection,
> etc. Surviving multiple reboots is the expected behavior on Chrome OS
> too.

Oh, ok. Hence the RFC tag ;-) We can drop this one then. I forgot that
unlinking was another way to clear the logs.

thanks!

- Joel



[RFC 2/6] pstore: remove type argument from ramoops_get_next_prz

2018-10-26 Thread Joel Fernandes (Google)
Since we store the type of the prz when we initialize it, we no longer
need to pass it again in ramoops_get_next_prz since we can just use that
to setup the pstore record. So lets remove it from the argument list.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram.c | 20 +++-
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index c7cd858adce7..cbfdf4b8e89d 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -125,9 +125,7 @@ static int ramoops_pstore_open(struct pstore_info *psi)
 
 static struct persistent_ram_zone *
 ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max,
-u64 *id,
-enum pstore_type_id *typep, enum pstore_type_id type,
-bool update)
+u64 *id, enum pstore_type_id *typep, bool update)
 {
struct persistent_ram_zone *prz;
int i = (*c)++;
@@ -147,7 +145,7 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], 
uint *c, uint max,
if (!persistent_ram_old_size(prz))
return NULL;
 
-   *typep = type;
+   *typep = prz->type;
*id = i;
 
return prz;
@@ -257,8 +255,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
   cxt->max_dump_cnt, >id,
-  >type,
-  PSTORE_TYPE_DMESG, 1);
+  >type, 1);
if (!prz_ok(prz))
continue;
header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz),
@@ -274,20 +271,18 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
-  1, >id, >type,
-  PSTORE_TYPE_CONSOLE, 0);
+  1, >id, >type, 0);
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
-  1, >id, >type,
-  PSTORE_TYPE_PMSG, 0);
+  1, >id, >type, 0);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
prz = ramoops_get_next_prz(cxt->fprzs,
>ftrace_read_cnt, 1, >id,
-   >type, PSTORE_TYPE_FTRACE, 0);
+   >type, 0);
} else {
/*
 * Build a new dummy record which combines all the
@@ -306,8 +301,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
>ftrace_read_cnt,
cxt->max_ftrace_cnt,
>id,
-   >type,
-   PSTORE_TYPE_FTRACE, 0);
+   >type, 0);
 
if (!prz_ok(prz_next))
continue;
-- 
2.19.1.568.g152ad8e336-goog



[RFC 4/6] pstore: further reduce ramoops_get_next_prz arguments by passing record

2018-10-26 Thread Joel Fernandes (Google)
Both the id and type fields of a pstore_record are set by
ramoops_get_next_prz. So we can just pass a pointer to the pstore_record
instead of passing individual elements. This results in cleaner more
readable code and fewer lines.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 3055e05acab1..710c3d30bac0 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -125,7 +125,7 @@ static int ramoops_pstore_open(struct pstore_info *psi)
 
 static struct persistent_ram_zone *
 ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c,
-u64 *id, enum pstore_type_id *typep, bool update)
+struct pstore_record *record, bool update)
 {
struct persistent_ram_zone *prz;
int i = (*c)++;
@@ -145,8 +145,8 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], 
uint *c,
if (!persistent_ram_old_size(prz))
return NULL;
 
-   *typep = prz->type;
-   *id = i;
+   record->type = prz->type;
+   record->id = i;
 
return prz;
 }
@@ -254,7 +254,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
/* Find the next valid persistent_ram_zone for DMESG */
while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
-  >id, >type, 1);
+  record, 1);
if (!prz_ok(prz))
continue;
header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz),
@@ -270,18 +270,17 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
-  >id, >type, 0);
+  record, 0);
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
-  >id, >type, 0);
+  record, 0);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
prz = ramoops_get_next_prz(cxt->fprzs,
-   >ftrace_read_cnt, >id,
-   >type, 0);
+   >ftrace_read_cnt, record, 0);
} else {
/*
 * Build a new dummy record which combines all the
@@ -298,8 +297,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
while (cxt->ftrace_read_cnt < cxt->max_ftrace_cnt) {
prz_next = ramoops_get_next_prz(cxt->fprzs,
>ftrace_read_cnt,
-   >id,
-   >type, 0);
+   record, 0);
 
if (!prz_ok(prz_next))
continue;
-- 
2.19.1.568.g152ad8e336-goog



[RFC 3/6] pstore: remove max argument from ramoops_get_next_prz

2018-10-26 Thread Joel Fernandes (Google)
>From the code flow, the 'max' checks are already being done on the prz
passed to ramoops_get_next_prz. Lets remove it to simplify this function
and reduce its arguments.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index cbfdf4b8e89d..3055e05acab1 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -124,14 +124,14 @@ static int ramoops_pstore_open(struct pstore_info *psi)
 }
 
 static struct persistent_ram_zone *
-ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max,
+ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c,
 u64 *id, enum pstore_type_id *typep, bool update)
 {
struct persistent_ram_zone *prz;
int i = (*c)++;
 
/* Give up if we never existed or have hit the end. */
-   if (!przs || i >= max)
+   if (!przs)
return NULL;
 
prz = przs[i];
@@ -254,8 +254,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
/* Find the next valid persistent_ram_zone for DMESG */
while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
-  cxt->max_dump_cnt, >id,
-  >type, 1);
+  >id, >type, 1);
if (!prz_ok(prz))
continue;
header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz),
@@ -271,17 +270,17 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
-  1, >id, >type, 0);
+  >id, >type, 0);
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
-  1, >id, >type, 0);
+  >id, >type, 0);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
prz = ramoops_get_next_prz(cxt->fprzs,
-   >ftrace_read_cnt, 1, >id,
+   >ftrace_read_cnt, >id,
>type, 0);
} else {
/*
@@ -299,7 +298,6 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
while (cxt->ftrace_read_cnt < cxt->max_ftrace_cnt) {
prz_next = ramoops_get_next_prz(cxt->fprzs,
>ftrace_read_cnt,
-   cxt->max_ftrace_cnt,
>id,
>type, 0);
 
-- 
2.19.1.568.g152ad8e336-goog



[RFC 5/6] pstore: donot treat empty buffers as valid

2018-10-26 Thread Joel Fernandes (Google)
pstore currently calls persistent_ram_save_old even if a buffer is
empty. While this appears to work, it is simply not the right thing to
do and could lead to bugs so lets avoid that. It also prevent misleading
prints in the logs which claim the buffer is valid.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 0792595ebcfb..1299aa3ea734 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -495,7 +495,7 @@ static int persistent_ram_post_init(struct 
persistent_ram_zone *prz, u32 sig,
 
sig ^= PERSISTENT_RAM_SIG;
 
-   if (prz->buffer->sig == sig) {
+   if (prz->buffer->sig == sig && buffer_size(prz)) {
if (buffer_size(prz) > prz->buffer_size ||
buffer_start(prz) > buffer_size(prz))
pr_info("found existing invalid buffer, size %zu, start 
%zu\n",
-- 
2.19.1.568.g152ad8e336-goog



  1   2   3   4   5   6   7   8   9   10   >