from:"Petr Mladek"

[PATCH v3] taint/module: Clean up global and module taint flags handling

2016-09-21 Thread Petr Mladek

The commit 66cc69e34e86a231 ("Fix: module signature vs tracepoints:
add new TAINT_UNSIGNED_MODULE") updated module_taint_flags() to
potentially print one more character. But it did not increase the
size of the corresponding buffers in m_show() and print_modules().

We have recently done the same mistake when adding a taint flag
for livepatching, see
https://lkml.kernel.org/g/cfba2c823bb984690b73572aaae1db596b54a082.1472137475.git.jpoim...@redhat.com

Also struct module uses an incompatible type for mod-taints flags.
It survived from the commit 2bc2d61a9638dab670d ("[PATCH] list module
taint flags in Oops/panic"). There was used "int" for the global taint
flags at these times. But only the global tain flags was later changed
to "unsigned long" by the commit 25ddbb18aae33ad2 ("Make the taint
flags reliable").

This patch defines TAINT_FLAGS_COUNT that can be used to create
arrays and buffers of the right size. Note that we could not use
enum because the taint flag indexes are used also in assembly code.

Then it reworks the table that describes the taint flags. The TAINT_*
numbers can be used as the index. Instead, we add information
if the taint flag is also shown per-module.

Finally, it uses "unsigned long", bit operations, and the updated
taint_flags table also for mod->taints.

It is not optimal because only few taint flags can be printed by
module_taint_flags(). But better be on the safe side. IMHO, it is
not worth the optimization and this is a good compromise.

Signed-off-by: Petr Mladek 
---
Changes against v2:

  + fixed a typo in a comment

  + rebased on top of for-next branch from git/jikos/livepatching.git
that has the commit commit 2992ef29ae01af9983 ("livepatch/module:
 make TAINT_LIVEPATCH module-specific").


Changes against v1:

  + reverted the change to enums because it broke asm code

  + instead, forced the size of the taint_flags table;
used taint numbers as the index; used the table also
in module.c

  + fixed the type of mod->taints


 include/linux/kernel.h |  9 +
 include/linux/module.h |  2 +-
 kernel/module.c| 33 +--
 kernel/panic.c | 53 --
 4 files changed, 48 insertions(+), 49 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index d96a6118d26a..2e2b9477c5b8 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -509,6 +509,15 @@ extern enum system_states {
 #define TAINT_UNSIGNED_MODULE  13
 #define TAINT_SOFTLOCKUP   14
 #define TAINT_LIVEPATCH15
+#define TAINT_FLAGS_COUNT  16
+
+struct taint_flag {
+   char true;  /* character printed when tainted */
+   char false; /* character printed when not tainted */
+   bool module;/* also show as a per-module taint flag */
+};
+
+extern const struct taint_flag taint_flags[TAINT_FLAGS_COUNT];
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/include/linux/module.h b/include/linux/module.h
index 0c3207d26ac0..f6ee569c62bb 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -399,7 +399,7 @@ struct module {
/* Arch-specific module values */
struct mod_arch_specific arch;
 
-   unsigned int taints;/* same bits as kernel:tainted */
+   unsigned long taints;   /* same bits as kernel:taint_flags */
 
 #ifdef CONFIG_GENERIC_BUG
/* Support for BUG */
diff --git a/kernel/module.c b/kernel/module.c
index f57dd63186e6..a4acd8f403ae 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -330,7 +330,7 @@ static inline void add_taint_module(struct module *mod, 
unsigned flag,
enum lockdep_ok lockdep_ok)
 {
add_taint(flag, lockdep_ok);
-   mod->taints |= (1U << flag);
+   set_bit(flag, &mod->taints);
 }
 
 /*
@@ -1138,24 +1138,13 @@ static inline int module_unload_init(struct module *mod)
 static size_t module_flags_taint(struct module *mod, char *buf)
 {
size_t l = 0;
+   int i;
+
+   for (i = 0; i < TAINT_FLAGS_COUNT; i++) {
+   if (taint_flags[i].module && test_bit(i, &mod->taints))
+   buf[l++] = taint_flags[i].true;
+   }
 
-   if (mod->taints & (1 << TAINT_PROPRIETARY_MODULE))
-   buf[l++] = 'P';
-   if (mod->taints & (1 << TAINT_OOT_MODULE))
-   buf[l++] = 'O';
-   if (mod->taints & (1 << TAINT_FORCED_MODULE))
-   buf[l++] = 'F';
-   if (mod->taints & (1 << TAINT_CRAP))
-   buf[l++] = 'C';
-   if (mod->taints & (1 << TAINT_UNSIGNED_MODULE))
-   buf[l++] = 'E';
-   if (mod->taints & (1 <<

Re: [PATCH v2 7/7] sched/core: Add debug code to catch missing update_rq_clock()

2016-09-21 Thread Petr Mladek

On Wed 2016-09-21 14:38:13, Matt Fleming wrote:
> There's no diagnostic checks for figuring out when we've accidentally
> missed update_rq_clock() calls. Let's add some by piggybacking on the
> rq_*pin_lock() wrappers.
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bf48e7975c23..91f4b3d58d56 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> +/*
> + * rq::clock_update_flags bits
> + *
> + * %RQCF_REQ_SKIP - will request skipping of clock update on the next
> + *  call to __schedule(). This is an optimisation to avoid
> + *  neighbouring rq clock updates.
> + *
> + * %RQCF_ACT_SKIP - is set from inside of __schedule() when skipping is
> + *  in effect and calls to update_rq_clock() are being ignored.
> + *
> + * %RQCF_UPDATED - is a debug flag that indicates whether a call has been
> + *  made to update_rq_clock() since the last time rq::lock was pinned.
> + *
> + * If inside of __schedule(), clock_update_flags will have been
> + * shifted left (a left shift is a cheap operation for the fast path
> + * to promote %RQCF_REQ_SKIP to %RQCF_ACT_SKIP), so you must use,
> + *
> + *   if (rq-clock_update_flags >= RQCF_UPDATED)
> + *
> + * to check if %RQCF_UPADTED is set. It'll never be shifted more than
> + * one position though, because the next rq_unpin_lock() will shift it
> + * back.
> + */
> +#define RQCF_REQ_SKIP0x01
> +#define RQCF_ACT_SKIP0x02
> +#define RQCF_UPDATED 0x04
> +
> +static inline void assert_clock_updated(struct rq *rq)
> +{
> +#ifdef CONFIG_SCHED_DEBUG
> + /*
> +  * The only reason for not seeing a clock update since the
> +  * last rq_pin_lock() is if we're currently skipping updates.
> +  */
> + WARN_ON_ONCE(rq->clock_update_flags < RQCF_ACT_SKIP);
> +#endif
> +}

I am afraid that it might eventually create a deadlock.
For example, there is the following call chain:

+ printk()
  + vprintk_func -> vprintk_default()
+ vprinkt_emit()
  + console_unlock()
+ up_console_sem()
  + up()# takes &sem->lock
+ __up()
  + wake_up_process()
+ try_to_wake_up()
  + ttwu_queue()
+ ttwu_do_activate()
  + ttwu_do_wakeup()
+ rq_clock()
  + lockdep_assert_held()
+ WARN_ON_ONCE()
  + printk()
+ vprintk_func -> vprintk_default()
  + vprintk_emit()
+ console_try_lock()
  + down_trylock_console_sem()
+ __down_trylock_console_sem()
  + down_trylock()

   DEADLOCK: Unable to take &sem->lock


We have recently discussed similar deadlock, see the thread
around https://lkml.kernel.org/r/20160714221251.GE3057@ubuntu

A temporary solution would be to replace the WARN_ON_ONCE()
by printk_deferred(). Of course, this is far from ideal because
you do not get the stack, ...

Sergey is working on WARN_ON_ONCE_DEFERRED() but it is not
an easy task.


>  static inline u64 rq_clock(struct rq *rq)
>  {
>   lockdep_assert_held(&rq->lock);
> + assert_clock_updated(rq);
> +
>   return rq->clock;
>  }
>  

I am not sure how the above call chain is realistic. But adding
WARN_ON() into the scheduler paths is risky in general.

Best Regards,
Petr

Re: qemu:metag image runtime failure in -next due to 'kthread: allow to cancel kthread work'

2016-09-27 Thread Petr Mladek

On Mon 2016-09-19 08:45:09, Guenter Roeck wrote:
> On Mon, Sep 19, 2016 at 03:55:29PM +0100, James Hogan wrote:
> > On Sat, Sep 17, 2016 at 12:32:49AM +0100, James Hogan wrote:
> > > Here this version of QEMU puts the args at where it thinks the end of
> > > the loaded image is, which is based on the number of bytes copied from
> > > the ELF, i.e. the total MemSiz's, not taking into account the alignment
> > > gap in between, so it puts them at 0x40377348.
> > 
> > QEMU meta-v1.3.1 branch updated at:
> > https://github.com/img-meta/qemu.git
> > 
> > Hopefully that'll fix it for you Guenter.
> > 
> Confirmed fixed.

Could you please confirm that the boot problem has been fixed
on the qemu side? I guess that it is
https://github.com/img-meta/qemu/commit/0a2402860228198ae2729048f1de05aeedb7d642

Could Andrew enable all the kthread worker API improvements in -mm
tree again?

I think that kthread worker patch has been an innocent victim.
It added some functions that were not used anywhere. I think
that it has triggered the boot problem just by chance.

Best Regards,
Petr

[PATCH] module/taint: Automatically increase the buffer size for new taint flags

2016-09-07 Thread Petr Mladek

The commit 66cc69e34e86a231 ("Fix: module signature vs tracepoints:
add new TAINT_UNSIGNED_MODULE") updated module_taint_flags() to
potentially print one more character. But it did not increase the
size of the corresponding buffers in m_show() and print_modules().

We have recently done the same mistake when adding a taint flag
for livepatching, see
https://lkml.kernel.org/g/cfba2c823bb984690b73572aaae1db596b54a082.1472137475.git.jpoim...@redhat.com

Let's convert the taint flags into enum and handle the buffer size
almost automatically.

It is not optimal because only few taint flags can be printed by
module_taint_flags(). But better be on the safe side. IMHO, it is
not worth the optimization and this is a good compromise.

Signed-off-by: Petr Mladek 
---
 include/linux/kernel.h | 44 
 kernel/module.c|  8 ++--
 kernel/panic.c |  4 ++--
 3 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index d96a6118d26a..1809bc82b7a5 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -472,14 +472,10 @@ static inline void set_arch_panic_timeout(int timeout, 
int arch_default_timeout)
if (panic_timeout == arch_default_timeout)
panic_timeout = timeout;
 }
-extern const char *print_tainted(void);
 enum lockdep_ok {
LOCKDEP_STILL_OK,
LOCKDEP_NOW_UNRELIABLE
 };
-extern void add_taint(unsigned flag, enum lockdep_ok);
-extern int test_taint(unsigned flag);
-extern unsigned long get_taint(void);
 extern int root_mountflags;
 
 extern bool early_boot_irqs_disabled;
@@ -493,22 +489,30 @@ extern enum system_states {
SYSTEM_RESTART,
 } system_state;
 
-#define TAINT_PROPRIETARY_MODULE   0
-#define TAINT_FORCED_MODULE1
-#define TAINT_CPU_OUT_OF_SPEC  2
-#define TAINT_FORCED_RMMOD 3
-#define TAINT_MACHINE_CHECK4
-#define TAINT_BAD_PAGE 5
-#define TAINT_USER 6
-#define TAINT_DIE  7
-#define TAINT_OVERRIDDEN_ACPI_TABLE8
-#define TAINT_WARN 9
-#define TAINT_CRAP 10
-#define TAINT_FIRMWARE_WORKAROUND  11
-#define TAINT_OOT_MODULE   12
-#define TAINT_UNSIGNED_MODULE  13
-#define TAINT_SOFTLOCKUP   14
-#define TAINT_LIVEPATCH15
+enum taint_flags {
+   TAINT_PROPRIETARY_MODULE,   /*  0 */
+   TAINT_FORCED_MODULE,/*  1 */
+   TAINT_CPU_OUT_OF_SPEC,  /*  2 */
+   TAINT_FORCED_RMMOD, /*  3 */
+   TAINT_MACHINE_CHECK,/*  4 */
+   TAINT_BAD_PAGE, /*  5 */
+   TAINT_USER, /*  6 */
+   TAINT_DIE,  /*  7 */
+   TAINT_OVERRIDDEN_ACPI_TABLE,/*  8 */
+   TAINT_WARN, /*  9 */
+   TAINT_CRAP, /* 10 */
+   TAINT_FIRMWARE_WORKAROUND,  /* 11 */
+   TAINT_OOT_MODULE,   /* 12 */
+   TAINT_UNSIGNED_MODULE,  /* 13 */
+   TAINT_SOFTLOCKUP,   /* 14 */
+   TAINT_LIVEPATCH,/* 15 */
+   TAINT_FLAGS_COUNT   /* keep last! */
+};
+
+extern const char *print_tainted(void);
+extern void add_taint(enum taint_flags flag, enum lockdep_ok);
+extern int test_taint(enum taint_flags flag);
+extern unsigned long get_taint(void);
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/kernel/module.c b/kernel/module.c
index 529efae9f481..fb6c0d425b47 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -4036,6 +4036,10 @@ int module_kallsyms_on_each_symbol(int (*fn)(void *, 
const char *,
 }
 #endif /* CONFIG_KALLSYMS */
 
+/* Maximum number of characters written by module_flags() */
+#define MODULE_FLAGS_BUF_SIZE (TAINT_FLAGS_COUNT + 4)
+
+/* Keep in sync with MODULE_FLAGS_BUF_SIZE !!! */
 static char *module_flags(struct module *mod, char *buf)
 {
int bx = 0;
@@ -4080,7 +4084,7 @@ static void m_stop(struct seq_file *m, void *p)
 static int m_show(struct seq_file *m, void *p)
 {
struct module *mod = list_entry(p, struct module, list);
-   char buf[8];
+   char buf[MODULE_FLAGS_BUF_SIZE];
 
/* We always ignore unformed modules. */
if (mod->state == MODULE_STATE_UNFORMED)
@@ -4251,7 +4255,7 @@ EXPORT_SYMBOL_GPL(__module_text_address);
 void print_modules(void)
 {
struct module *mod;
-   char buf[8];
+   char buf[MODULE_FLAGS_BUF_SIZE];
 
printk(KERN_DEFAULT "Modules linked in:");
/* Most callers should already have preempt disabled, but make sure */
diff --git a/kernel/panic.c b/kernel/panic.c
index ca8cea1ef673..e90125bf9238 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -334,7 +334,7 @@ const char *print_tainted(void)
return buf;
 }
 
-int test_taint(uns

Re: taint/module: Clean up global and module taint flags handling

2016-09-14 Thread Petr Mladek

On Tue 2016-09-13 16:36:09, Jessica Yu wrote:
> +++ Petr Mladek [12/09/16 16:13 +0200]:
> >The commit 66cc69e34e86a231 ("Fix: module signature vs tracepoints:
> >add new TAINT_UNSIGNED_MODULE") updated module_taint_flags() to
> >potentially print one more character. But it did not increase the
> >size of the corresponding buffers in m_show() and print_modules().
> >
> >We have recently done the same mistake when adding a taint flag
> >for livepatching, see
> >https://lkml.kernel.org/g/cfba2c823bb984690b73572aaae1db596b54a082.1472137475.git.jpoim...@redhat.com
> >
> >Also struct module uses an incompatible type for mod-taints flags.
> >It survived from the commit 2bc2d61a9638dab670d ("[PATCH] list module
> >taint flags in Oops/panic"). There was used "int" for the global taint
> >flags at these times. But only the global tain flags was later changed
> >to "unsigned long" by the commit 25ddbb18aae33ad2 ("Make the taint
> >flags reliable").
> >
> >This patch defines TAINT_FLAGS_COUNT that can be used to create
> >arrays and buffers of the right size. Note that we could not use
> >enum because the taint flag indexes are used also in assembly code.
> >
> >Then it reworks the table that describes the taint flags. The TAINT_*
> >numbers can be used as the index. Instead, we add information
> >if the taint flag is also shown per-module.
> >
> >Finally, it uses "unsigned long", bit operations, and the updated
> >taint_flags table also for mod->taints.
> >
> >It is not optimal because only few taint flags can be printed by
> >module_taint_flags(). But better be on the safe side. IMHO, it is
> >not worth the optimization and this is a good compromise.
> >
> >Signed-off-by: Petr Mladek 
> >---
> >
> >Changes against v1:
> >
> > + reverted the change to enums because it broke asm code
> >
> > + instead, forced the size of the taint_flags table;
> >   used taint numbers as the index; used the table also
> >   in module.c
> >
> > + fixed the type of mod->taints
> >
> >
> >include/linux/kernel.h |  9 +
> >include/linux/module.h |  2 +-
> >kernel/module.c| 31 +
> >kernel/panic.c | 53 
> >--
> >4 files changed, 48 insertions(+), 47 deletions(-)
> >
> >diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> >index d96a6118d26a..33e88ff3af40 100644
> >--- a/include/linux/kernel.h
> >+++ b/include/linux/kernel.h
> >@@ -509,6 +509,15 @@ extern enum system_states {
> >#define TAINT_UNSIGNED_MODULE13
> >#define TAINT_SOFTLOCKUP 14
> >#define TAINT_LIVEPATCH  15
> >+#define TAINT_FLAGS_COUNT   16
> >+
> >+struct taint_flag {
> >+char true;  /* character printed when tained */
> >+char false; /* character printed when not tained */
> 
> s/tained/tainted

Great catch!
 
> >diff --git a/kernel/panic.c b/kernel/panic.c
> >index ca8cea1ef673..36d4fa264b2c 100644
> >--- a/kernel/panic.c
> >+++ b/kernel/panic.c
> >+/*
> >+ * TAINT_FORCED_RMMOD could be a per-module flag but the module
> >+ * is being removed anyway.
> >+ */
> >+const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
> >+{ 'P', 'G', true }, /* TAINT_PROPRIETARY_MODULE */
> >+{ 'F', ' ', true }, /* TAINT_FORCED_MODULE */
> >+{ 'S', ' ', false },/* TAINT_CPU_OUT_OF_SPEC */
> >+{ 'R', ' ', false },/* TAINT_FORCED_RMMOD */
> >+{ 'M', ' ', false },/* TAINT_MACHINE_CHECK */
> >+{ 'B', ' ', false },/* TAINT_BAD_PAGE */
> >+{ 'U', ' ', false },/* TAINT_USER */
> >+{ 'D', ' ', false },/* TAINT_DIE */
> >+{ 'A', ' ', false },/* TAINT_OVERRIDDEN_ACPI_TABLE */
> >+{ 'W', ' ', false },/* TAINT_WARN */
> >+{ 'C', ' ', true }, /* TAINT_CRAP */
> >+{ 'I', ' ', false },/* TAINT_FIRMWARE_WORKAROUND */
> >+{ 'O', ' ', true }, /* TAINT_OOT_MODULE */
> >+{ 'E', ' ', true }, /* TAINT_UNSIGNED_MODULE */
> >+{ 'L', ' ', false },/* TAINT_SOFTLOCKUP */
> >+{ 'K', ' ', false },/* TAINT_LIVEPATCH */
> 
> This should be true here, right? TAINT_LIVEPATCH has been made a
> module-specific taint by commit 2992ef29ae ("livepatch/module: make
> TAINT_LIVEPATCH module-specific").

I was not sure whose maintainer tree would be used. So, I rather based
this on the Linus' stable tree. If it will go via the Jikos' livepatch
I could rebase it on top of the commit 2992ef29ae ("livepatch/module: make
TAINT_LIVEPATCH module-specific").

> I think the rest looks fine, thanks for working on the cleanups.

Thanks for review.

Best Regards,
Petr

Re: [PATCH v10 1/2] printk: Make printk() completely async

2016-08-31 Thread Petr Mladek

On Wed 2016-08-31 11:31:35, Sergey Senozhatsky wrote:
> On (08/30/16 11:29), Petr Mladek wrote:
> > > you didn't miss anything, I think I wasn't too descriptive and that caused
> > > some confusion. this patch is not a replacement of wake_up_process() patch
> > > posted earlier in the loop, but an addition to it. not only every WARN/BUG
> > > issued from wake_up_process() will do no good, but every lock we take is
> > > potentially dangerous as well. In the simplest case because of 
> > > $LOCK-debug.c
> > > files in kernel/locking (spin_lock in our case); in the worst case --
> > > because of WARNs issued by log_store() and friends (there may be custom
> > > modifications) or by violations of spinlock atomicity requirements.
> > > 
> > > For example,
> > > 
> > >   vprintk_emit()
> > >   local_irq_save()
> > >   raw_spin_lock()
> > >   text_len = vscnprintf(text, sizeof(textbuf), fmt, args)
> > >   {
> > >   vsnprintf()
> > >   {
> > >   if (WARN_ON_ONCE(size > INT_MAX))
> > >   return 0;
> > >   }
> > >   }
> > >   ...
> > > 
> > > this is a rather unlikely event, sure, there must be some sort of
> > > memory corruption or something else, but the thing is -- if it will
> > > happen, printk() will not be willing to help.
> > > 
> > > wake_up_process() change, posted earlier, is using a deferred version of
> > > WARN macro, but we definitely can (and we better do) switch to lockless
> > > alternative printk() in both cases and don't bother with new macros.
> > > replacing all of the existing ones with 'safe' deferred versions is
> > > a difficult task, but keeping track of a newly introduced ones is even
> > > harder (if possible at all).
> > 
> > I see. It makes some sense. I would like to be on the safe side. I am
> > just afraid that adding yet another per-CPU buffer is too complex.
> > It adds quite some complexity to the code. And it even more scatters
> > the messages so that it will be harder to get them from the
> > crash dump or flush them to the console when the system goes down.
> > 
> > It took few years to get in the solution for NMIs even when
> > it fixed real life deadlocks for many people and customers.
> > I am afraid that it is not realistic to get in similar complex
> > code to fix rather theoretical problems.
> 
> well, I still can try it in my spare time. we can't fix printk() without
> ever touching it, can we? so far we basically only acknowledge the
> existing printk() problems. we can do better than that, I think.

Ah, I do not want to discourage you from finding a solution for these
problems. I just wanted to point out problems with this particular
path of thinking (more per-CPU buffers, shuffling data between
them and the main buffer and console). But I might be wrong.

Sigh, there are many problems with printk(). I think the we recently
discussed the following problems:

  1. Hung task or blocked irq handler when preemption/irqs
 are disabled and there are too many messages pushed to
 the console.

  2. Potential deadlocks when calling wake_up_process() by
 async printk and console_unlock().

  3. Clean up the console handling to split manipulation with
 consoles setting and pushing the messages. By other words,
 allow to push the console messages only when wanted.

  4. Messed output with continuous lines.

They are related but only partly. IMHO, it is not realistic to
fix all the problems in a single patchset. I wonder how to move
forward.

Our primary target was to solve the 1st problem with the async printk.
It has stalled because we hit the other areas. Let's look at them
from this point of view.

Ad 2. The potential deadlock with wake_up_process(). It pooped up
  with using async printk during the suspend.

  But it is not new! up() called by console_unlock() has the
  same problem. I thought that it was different because
  console_trylock() would prevent recursion but I was wrong.
  There seems to be similar deadlock:

  console_unlock()
up_console_sem()
  up()
__up()
  raw_spin_lock_irqsave(&sem->lock, flags);
  wake_up_process()
WARN()
  printk()
vprintk_emit()
  console_trylock()
down_trylock_console_sem()
  __down_trylock_console_sem)()
dow

Re: [PATCH] printk/nmi: avoid direct printk()-s from __printk_nmi_flush()

2016-09-01 Thread Petr Mladek

On Thu 2016-09-01 16:55:07, Sergey Senozhatsky wrote:
> On (08/30/16 13:19), Petr Mladek wrote:
> > 
> > I see. But then we will need to be more careful because printk_func
> > and printk_func_saved will be manipulated in different contexts:
> > normal, irq, nmi. A solution might be using an atomic counter
> > and selecting the right vprintk_func according to the value.
> 
> alt_printk_enter() must be done with local IRQs disabled. so IRQ cannot
> race with `normal' alt_printk. other IRQs cannot race with the current IRQ,
> because we have local IRQs disabled. the only thing that can race here is - 
> NMI.
> both `normal' and IRQ alt_printk can use the same per-CPU buffer, they never
> race. NMI needs to have its own.

Yes. Well, my concern was how to atomically change the printk_func
pointer and save the previous value at the same time. You could not
use locks because NMIs are involved.

Best Regards,
Petr

Re: [PATCH v10 1/2] printk: Make printk() completely async

2016-09-01 Thread Petr Mladek

On Wed 2016-08-31 21:52:24, Sergey Senozhatsky wrote:
> On (08/31/16 11:38), Petr Mladek wrote:
> >   2. Potential deadlocks when calling wake_up_process() by
> >  async printk and console_unlock().
> 
> * there are many reasons to those recursive printk() calls -- some
> can be addressed, some cannot. for instance, it doesn't matter how many
> per-CPU buffers we use for alternative printk() once the logbuf_lock is
> corrupted.

Yup and BTW: Peter Zijlstra wants to avoid zapping locks whenever
possible because it corrupts the state. It might solve the actual
state but it might cause deadlock by the double unlock.

> another `deadlock' example would be:
> 
> SyS_ioctl
>  do_vfs_ioctl
>   tty_ioctl
>n_tty_ioctl
> tty_mode_ioctl
>  set_termios
>   tty_set_termios
>uart_set_termios
> uart_change_speed
>  FOO_serial_set_termios
>   spin_lock_irqsave(&port->lock) // lock the output port
>   
>   !! WARN() or pr_err() or printk()
>   vprintk_emit()
>/* console_trylock() */
>console_unlock()
> call_console_drivers()
>  FOO_write()
>   spin_lock_irqsave(&port->lock) // already
>  locked

Great catch! From the already mentioned solutions, I would prefer
using deferred variants of WARN()/BUG()/printk() on these locations.
Together with using lockdep to find these locations.

Also there is the Peter Zijlstra's idea of using a lockless
"early" console to debug the situations where it happens.
It might make sense to make such a console easy to use.

I am unable to find any other generic solution that would prevent this
from the printk() side at the moment.

> 5. not 100% guaranteed printing on panic
> not entirely related to printk(), but to console output mechanism in
> general. we have console_flush_on_panic() which ignores console semaphore
> state, to increase our chances of seeing the backtrace. however, there are
> more that just one lock involved: logbuf_lock, serial driver locks. so we may
> start zap_locks() in console_flush_on_panic() to re-init the logbuf_lock,
> but underlying serial driver's locks are still in unclear state. most of
> the drivers (if not all of them) take the port->lock under disabled IRQs,
> so if panic-CPU is not the one that holds the port->lock then the port->lock
> owner CPU will probably unlock the spin_lock before processing its STOP_IPI.
> if it's the port->lock CPU that panic() the system (nmi_panic() or BUG())
> then things can be bad.

That might be very hard to solve in general as well. Again the PeterZ's
idea with the lockless console might help here.

> > I wonder how to separate the problems and make them more manageable.
> 
> so I was thinking for a moment about doing the recursion detection rework
> before the async_printk. just because better recursion detection is a nice
> thing to have in the first place and it probably may help us catching some
> of the surprises that async_printk might have. but it probably will be more
> problematic than I thought.
> 
> then async_printk. I have a refreshed series on my hands, addressing
> Viresh's reports. it certainly makes things better, but it doesn't
> eliminate all of the lockups/etc sources.

We must separate historical possible lockups and new regressions.
Only regressions should block the async printk series. Old
bugs should be fixed separately to keep the series manageable.

Anyway, I think that the async printk will make sense even
when we solve all the other issues. If async printk does not
cause regressions, why not make it in.

> a console_unlock() doing
> wake_up_process(printk_kthread) would make it better.

I am not sure what you mean by this.

Thanks for working on it.

Best Regards,
Petr

Re: [PATCH v10 1/2] printk: Make printk() completely async

2016-09-02 Thread Petr Mladek

On Fri 2016-09-02 16:58:08, Sergey Senozhatsky wrote:
> On (09/01/16 10:58), Petr Mladek wrote:
> > On Wed 2016-08-31 21:52:24, Sergey Senozhatsky wrote:
> > > a console_unlock() doing
> > > wake_up_process(printk_kthread) would make it better.
> > 
> > I am not sure what you mean by this.
> 
> I meant that this thing
> 
>   local_irq_save() // or preempt_disable()
>   ...
>   if (console_trylock())
>   console_unlock();
>   ...
>   local_irq_restore() // or preempt_enable()

I see.

> can easily lockup the system if console_trylock() was successful and there
> are enough messages to print. printk_kthread can't help, because here we
> basically enforce the `old' behavior. we have async printk, but not async
> console output. tweaking console_unlock() to offload the actual printing loop
> to printk_kthread would make the entire console output async:
> 
>   static void console_sync_flush_and_unlock(void)
>   {
>   for (;;) {
>   ...
>   call_console_drivers();
>   ...
>   }
>   }
> 
>   void console_unlock(void)
>   {
>   if (!MOTORMOUTH && can_printk_async()) {
>   up();
>   wake_up_process(printk_kthread);
>   return;
>   }
>   console_sync_flush_and_unlock();
>   }

Something like this would make sense. But I would do it in a separate
patch(set). We need to go through all console_unlock() callers and
make sure that they are fine with the potential async behavior.
I would not complicate the async printk patchset by this.

Best Regards,
Petr

[PATCH] thermal/powerclamp: Prevent division by zero when counting interval

2016-08-04 Thread Petr Mladek

I have got a zero division error when disabling the forced
idle injection from the intel powerclamp. I did

  echo 0 >/sys/class/thermal/cooling_device48/cur_state

and got

[  986.072632] divide error:  [#1] PREEMPT SMP
[  986.078989] Modules linked in:
[  986.083618] CPU: 17 PID: 24967 Comm: kidle_inject/17 Not tainted 
4.7.0-1-default+ #3055
[  986.093781] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS 
RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
[  986.106227] task: 880430e1c080 task.stack: 880427ef
[  986.114122] RIP: 0010:[]  [] 
clamp_thread+0x1d9/0x600
[  986.124609] RSP: 0018:880427ef3e20  EFLAGS: 00010246
[  986.131860] RAX: 0258 RBX: 0006 RCX: 0001
[  986.141179] RDX:  RSI:  RDI: 0018
[  986.150478] RBP: 880427ef3ec8 R08: 880427ef R09: 0002
[  986.159779] R10: 3df2 R11: 0018 R12: 0002
[  986.169089] R13:  R14: 880427ef R15: 880427ef
[  986.178388] FS:  () GS:88043594() 
knlGS:
[  986.188785] CS:  0010 DS:  ES:  CR0: 80050033
[  986.196559] CR2: 7f1d0caf CR3: 02006000 CR4: 001406e0
[  986.205909] Stack:
[  986.209524]  8802be897b00 880430e1c080 0011 
006a35959780
[  986.219236]  0011 880427ef0008  
8804359503d0
[  986.228966]  000100029d93 81794140  
0511
[  986.238686] Call Trace:
[  986.242825]  [] ? pkg_state_counter+0x80/0x80
[  986.250866]  [] ? powerclamp_set_cur_state+0x180/0x180
[  986.259797]  [] kthread+0xc9/0xe0
[  986.266682]  [] ret_from_fork+0x1f/0x40
[  986.274142]  [] ? kthread_create_on_node+0x180/0x180
[  986.282869] Code: d1 ea 48 89 d6 80 3d 6a d0 d4 00 00 ba 64 00 00 00 89 d8 
41 0f 45 f5 0f af c2 42 8d 14 2e be 31 00 00 00 83 fa 31 0f 42 f2 31 d2  f6 
48 8b 15 9e 07 87 00 48 8b 3d 97 07 87 00 48 63 f0 83 e8
[  986.307806] RIP  [] clamp_thread+0x1d9/0x600
[  986.315871]  RSP 

RIP points to the following lines:

compensation = get_compensation(target_ratio);
interval = duration_jiffies*100/(target_ratio+compensation);

A solution would be to switch the following two commands in
powerclamp_set_cur_state():

set_target_ratio = 0;
end_power_clamp();

But I think that the zero division might happen also when target_ratio
is non-zero because the compensation might be negative. Therefore
it is better to check the sum of target_ratio and compensation
explicitly.

Also the compensated_ratio variable is always set. Therefore there
is no need to initialize it.

Signed-off-by: Petr Mladek 
---
 drivers/thermal/intel_powerclamp.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c 
b/drivers/thermal/intel_powerclamp.c
index 015ce2eb6eb7..65b865184990 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -388,7 +388,7 @@ static int clamp_thread(void *arg)
int sleeptime;
unsigned long target_jiffies;
unsigned int guard;
-   unsigned int compensation = 0;
+   unsigned int compensated_ratio;
int interval; /* jiffies to sleep for each attempt */
unsigned int duration_jiffies = msecs_to_jiffies(duration);
unsigned int window_size_now;
@@ -409,8 +409,11 @@ static int clamp_thread(void *arg)
 * c-states, thus we need to compensate the injected idle ratio
 * to achieve the actual target reported by the HW.
 */
-   compensation = get_compensation(target_ratio);
-   interval = duration_jiffies*100/(target_ratio+compensation);
+   compensated_ratio = target_ratio +
+   get_compensation(target_ratio);
+   if (compensated_ratio <= 0)
+   compensated_ratio = 1;
+   interval = duration_jiffies * 100 / compensated_ratio;
 
/* align idle time */
target_jiffies = roundup(jiffies, interval);
-- 
1.8.5.6

Re: [PATCH] thermal/powerclamp: Prevent division by zero when counting interval

2016-08-05 Thread Petr Mladek

On Thu 2016-08-04 10:32:00, Jacob Pan wrote:
> On Thu,  4 Aug 2016 16:56:46 +0200
> Petr Mladek  wrote:
> 
> > I have got a zero division error when disabling the forced
> > idle injection from the intel powerclamp. I did
> > 
> >   echo 0 >/sys/class/thermal/cooling_device48/cur_state
> > 
> > and got
> > 
> > [  986.072632] divide error:  [#1] PREEMPT SMP
> > [  986.078989] Modules linked in:
> > [  986.083618] CPU: 17 PID: 24967 Comm: kidle_inject/17 Not tainted
> > 4.7.0-1-default+ #3055 [  986.093781] Hardware name: Intel
> > Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734
> > 05/15/2013 [  986.106227] task: 880430e1c080 task.stack:
> > 880427ef [  986.114122] RIP: 0010:[]
> > [] clamp_thread+0x1d9/0x600 [  986.124609] RSP:
> > 0018:880427ef3e20  EFLAGS: 00010246 [  986.131860] RAX:
> > 0258 RBX: 0006 RCX: 0001
> > [  986.141179] RDX:  RSI:  RDI:
> > 0018 [  986.150478] RBP: 880427ef3ec8 R08:
> > 880427ef R09: 0002 [  986.159779] R10:
> > 3df2 R11: 0018 R12: 0002
> > [  986.169089] R13:  R14: 880427ef R15:
> > 880427ef [  986.178388] FS:  ()
> > GS:88043594() knlGS: [  986.188785] CS:
> > 0010 DS:  ES:  CR0: 80050033 [  986.196559] CR2:
> > 7f1d0caf CR3: 02006000 CR4: 001406e0
> > [  986.205909] Stack: [  986.209524]  8802be897b00
> > 880430e1c080 0011 006a35959780 [  986.219236]
> > 0011 880427ef0008  8804359503d0
> > [  986.228966]  000100029d93 81794140 
> > 0511 [  986.238686] Call Trace: [  986.242825]
> > [] ? pkg_state_counter+0x80/0x80 [  986.250866]
> > [] ? powerclamp_set_cur_state+0x180/0x180
> > [  986.259797]  [] kthread+0xc9/0xe0
> > [  986.266682]  [] ret_from_fork+0x1f/0x40
> > [  986.274142]  [] ?
> > kthread_create_on_node+0x180/0x180 [  986.282869] Code: d1 ea 48 89
> > d6 80 3d 6a d0 d4 00 00 ba 64 00 00 00 89 d8 41 0f 45 f5 0f af c2 42
> > 8d 14 2e be 31 00 00 00 83 fa 31 0f 42 f2 31 d2  f6 48 8b 15 9e
> > 07 87 00 48 8b 3d 97 07 87 00 48 63 f0 83 e8 [  986.307806] RIP
> > [] clamp_thread+0x1d9/0x600 [  986.315871]  RSP
> > 
> > 
> > RIP points to the following lines:
> > 
> > compensation = get_compensation(target_ratio);
> > interval = duration_jiffies*100/(target_ratio+compensation);
> > 
> > A solution would be to switch the following two commands in
> > powerclamp_set_cur_state():
> > 
> > set_target_ratio = 0;
> > end_power_clamp();
> > 
> I see, there is race condition, clamping threads should be stopped if
> target ratio is 0.
> > But I think that the zero division might happen also when target_ratio
> > is non-zero because the compensation might be negative. Therefore
> > it is better to check the sum of target_ratio and compensation
> > explicitly.
> > 
> compensation should never be negative. since it is the additional idle
> ratio added on top of requested ratio.

I am not sure if you are talking about the desired behavior or the
current code. get_compensation() returns value computed from
steady_comp values. These values are assigned in adjust_compensation()
and the code seems to store even negative values. But I did not
tried to investigate it much deeper.

> If actual idle is more than requested, we will skip injection period.
> So i prefer to have both changes.

OK, I'll send an updated patch.

Best Regards,
Petr

[PATCH v2] thermal/powerclamp: Prevent division by zero when counting interval

2016-08-05 Thread Petr Mladek

I have got a zero division error when disabling the forced
idle injection from the intel powerclamp. I did

  echo 0 >/sys/class/thermal/cooling_device48/cur_state

and got

[  986.072632] divide error:  [#1] PREEMPT SMP
[  986.078989] Modules linked in:
[  986.083618] CPU: 17 PID: 24967 Comm: kidle_inject/17 Not tainted 
4.7.0-1-default+ #3055
[  986.093781] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS 
RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
[  986.106227] task: 880430e1c080 task.stack: 880427ef
[  986.114122] RIP: 0010:[]  [] 
clamp_thread+0x1d9/0x600
[  986.124609] RSP: 0018:880427ef3e20  EFLAGS: 00010246
[  986.131860] RAX: 0258 RBX: 0006 RCX: 0001
[  986.141179] RDX:  RSI:  RDI: 0018
[  986.150478] RBP: 880427ef3ec8 R08: 880427ef R09: 0002
[  986.159779] R10: 3df2 R11: 0018 R12: 0002
[  986.169089] R13:  R14: 880427ef R15: 880427ef
[  986.178388] FS:  () GS:88043594() 
knlGS:
[  986.188785] CS:  0010 DS:  ES:  CR0: 80050033
[  986.196559] CR2: 7f1d0caf CR3: 02006000 CR4: 001406e0
[  986.205909] Stack:
[  986.209524]  8802be897b00 880430e1c080 0011 
006a35959780
[  986.219236]  0011 880427ef0008  
8804359503d0
[  986.228966]  000100029d93 81794140  
0511
[  986.238686] Call Trace:
[  986.242825]  [] ? pkg_state_counter+0x80/0x80
[  986.250866]  [] ? powerclamp_set_cur_state+0x180/0x180
[  986.259797]  [] kthread+0xc9/0xe0
[  986.266682]  [] ret_from_fork+0x1f/0x40
[  986.274142]  [] ? kthread_create_on_node+0x180/0x180
[  986.282869] Code: d1 ea 48 89 d6 80 3d 6a d0 d4 00 00 ba 64 00 00 00 89 d8 
41 0f 45 f5 0f af c2 42 8d 14 2e be 31 00 00 00 83 fa 31 0f 42 f2 31 d2  f6 
48 8b 15 9e 07 87 00 48 8b 3d 97 07 87 00 48 63 f0 83 e8
[  986.307806] RIP  [] clamp_thread+0x1d9/0x600
[  986.315871]  RSP 

RIP points to the following lines:

compensation = get_compensation(target_ratio);
interval = duration_jiffies*100/(target_ratio+compensation);

A solution would be to switch the following two commands in
powerclamp_set_cur_state():

set_target_ratio = 0;
end_power_clamp();

But I think that the zero division might happen also when target_ratio
is non-zero because the compensation might be negative. Therefore
we also check the sum of target_ratio and compensation explicitly.

Also the compensated_ratio variable is always set. Therefore there
is no need to initialize it.

Signed-off-by: Petr Mladek 
---
Changes against v1:

+ Also set_target_ratio to 0 after the threads are stopped.

 drivers/thermal/intel_powerclamp.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c 
b/drivers/thermal/intel_powerclamp.c
index 015ce2eb6eb7..0e4dc0afcfd2 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -388,7 +388,7 @@ static int clamp_thread(void *arg)
int sleeptime;
unsigned long target_jiffies;
unsigned int guard;
-   unsigned int compensation = 0;
+   unsigned int compensated_ratio;
int interval; /* jiffies to sleep for each attempt */
unsigned int duration_jiffies = msecs_to_jiffies(duration);
unsigned int window_size_now;
@@ -409,8 +409,11 @@ static int clamp_thread(void *arg)
 * c-states, thus we need to compensate the injected idle ratio
 * to achieve the actual target reported by the HW.
 */
-   compensation = get_compensation(target_ratio);
-   interval = duration_jiffies*100/(target_ratio+compensation);
+   compensated_ratio = target_ratio +
+   get_compensation(target_ratio);
+   if (compensated_ratio <= 0)
+   compensated_ratio = 1;
+   interval = duration_jiffies * 100 / compensated_ratio;
 
/* align idle time */
target_jiffies = roundup(jiffies, interval);
@@ -647,8 +650,8 @@ static int powerclamp_set_cur_state(struct 
thermal_cooling_device *cdev,
goto exit_set;
} else  if (set_target_ratio > 0 && new_target_ratio == 0) {
pr_info("Stop forced idle injection\n");
-   set_target_ratio = 0;
end_power_clamp();
+   set_target_ratio = 0;
} else  /* adjust currently running */ {
set_target_ratio = new_target_ratio;
/* make new set_target_ratio visible to other cpus */
-- 
1.8.5.6

Re: [PATCH v6 1/4] nmi_backtrace: add more trigger_*_cpu_backtrace() methods

2016-08-08 Thread Petr Mladek

On Thu 2016-07-14 16:50:29, Chris Metcalf wrote:
> Currently you can only request a backtrace of either all cpus, or
> all cpus but yourself.  It can also be helpful to request a remote
> backtrace of a single cpu, and since we want that, the logical
> extension is to support a cpumask as the underlying primitive.
> 
> This change modifies the existing lib/nmi_backtrace.c code to take
> a cpumask as its basic primitive, and modifies the linux/nmi.h code
> to use either the old "all/all_but_self" arch methods, or the new
> "cpumask" method, depending on which is available.

> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -31,38 +31,75 @@ static inline void hardlockup_detector_disable(void) {}
>  #endif
>  
>  /*
> - * Create trigger_all_cpu_backtrace() out of the arch-provided
> - * base function. Return whether such support was available,
> + * Create trigger_all_cpu_backtrace() etc out of the arch-provided
> + * base function(s). Return whether such support was available,
>   * to allow calling code to fall back to some other mechanism:
>   */
> -#ifdef arch_trigger_all_cpu_backtrace
>  static inline bool trigger_all_cpu_backtrace(void)
>  {
> +#if defined(arch_trigger_all_cpu_backtrace)
>   arch_trigger_all_cpu_backtrace(true);
> -
>   return true;
> +#elif defined(arch_trigger_cpumask_backtrace)
> + arch_trigger_cpumask_backtrace(cpu_online_mask);
> + return true;
> +#else
> + return false;
> +#endif
>  }
> +
>  static inline bool trigger_allbutself_cpu_backtrace(void)
>  {
> +#if defined(arch_trigger_all_cpu_backtrace)
>   arch_trigger_all_cpu_backtrace(false);
>   return true;
> -}
> -
> -/* generic implementation */
> -void nmi_trigger_all_cpu_backtrace(bool include_self,
> -void (*raise)(cpumask_t *mask));
> -bool nmi_cpu_backtrace(struct pt_regs *regs);
> +#elif defined(arch_trigger_cpumask_backtrace)
> + cpumask_var_t mask;
> + int cpu = get_cpu();
>  
> + if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> + return false;

I tested this patch by the following change:

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 52bbd27e93ae..404a32699554 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -242,6 +242,7 @@ static void sysrq_handle_showallcpus(int key)
 * backtrace printing did not succeed or the
 * architecture has no support for it:
 */
+   printk("---  All CPUs: -\n");
if (!trigger_all_cpu_backtrace()) {
struct pt_regs *regs = get_irq_regs();
 
@@ -251,6 +252,10 @@ static void sysrq_handle_showallcpus(int key)
}
schedule_work(&sysrq_showallcpus);
}
+   printk("---  All but itself: -\n");
+   trigger_allbutself_cpu_backtrace();
+   printk("---  Only two: -\n");
+   trigger_single_cpu_backtrace(2);
 }
 
 static struct sysrq_key_op sysrq_showallcpus_op = {


Then I triggered this function using

  echo l >/proc/sysrq-trigger


and got

[  270.791328] ---  All but itself: -

[  270.791331] ===
[  270.791331] [ INFO: suspicious RCU usage. ]
[  270.791333] 4.8.0-rc1-4-default+ #3086 Not tainted
[  270.791333] ---
[  270.791335] ./include/linux/rcupdate.h:556 Illegal context switch in RCU 
read-side critical section!
[  270.791339] 
   other info that might help us debug this:

[  270.791340] 
   rcu_scheduler_active = 1, debug_locks = 0
[  270.791341] 2 locks held by bash/3720:
[  270.791347]  #0:  (sb_writers#5){.+.+.+}, at: [] 
__sb_start_write+0xd1/0xf0
[  270.791351]  #1:  (rcu_read_lock){..}, at: [] 
__handle_sysrq+0x5/0x220
[  270.791352] 
   stack backtrace:
[  270.791354] CPU: 3 PID: 3720 Comm: bash Not tainted 4.8.0-rc1-4-default+ 
#3086
[  270.791355] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  270.791359]   88013688fc58 8143ddac 
880135748600
[  270.791362]  0001 88013688fc88 810c9727 
88013fd98c00
[  270.791365]  00018c00 024000c0  
88013688fce0
[  270.791366] Call Trace:
[  270.791369]  [] dump_stack+0x85/0xc9
[  270.791372]  [] lockdep_rcu_suspicious+0xe7/0x120
[  270.791374]  [] __schedule+0x4eb/0x820
[  270.791377]  [] preempt_schedule_common+0x18/0x31
[  270.791379]  [] _cond_resched+0x1c/0x30
[  270.791382]  [] kmem_cache_alloc_node_trace+0x224/0x340
[  270.791385]  [] __kmalloc_node+0x31/0x40
[  270.791388]  [] alloc_cpumask_var_node+0x24/0x30
[  270.791391]  [] alloc_cpumask_var+0xe/0x10
[  270.791393]  [] sysrq_handle_showallcpus+0x4b/0xd0
[  270.791395]  [] __handle_sysrq+0x136/0x220
[  270.791398]  [] ? __handle_sysrq+0x5/0x220
[  270.791401]  [] write_sysrq_trigger+0x46/0x60
[  270.791403]  [] proc_reg_write+0x3d/0x70
[  270.791406]  [] ? rcu_sync_lo

Re: [PATCH][RFC] printk: make pr_cont buffer per-cpu

2016-08-24 Thread Petr Mladek

On Wed 2016-08-24 10:14:20, Sergey Senozhatsky wrote:
> Hello,
> 
> On (08/23/16 13:47), Petr Mladek wrote:
> [..]
> > >   if (!(lflags & LOG_NEWLINE)) {
> > > + if (!this_cpu_read(cont_printing)) {
> > > + if (system_state == SYSTEM_RUNNING) {
> > > + this_cpu_write(cont_printing, true);
> > > + preempt_disable();
> > > + }
> > > + }
> > 
> > I am afraid that this is not acceptable. It means that printk() will have
> > an unexpected side effect. The missing "\n" at the end of a printed
> > string would disable preemption. See below for more.
> 
> missing '\n' must WARN about "sched while atomic" eventually, so it
> shouldn't go unnoticed or stay hidden.

Well, it will still force people to rebuilt a test kernel because they
forget to use '\n" and the test kernel is unusable.

IMHO, the connection between '\n' and preemption is not
intuitive and hard to spot. We should do our best to avoid it.

> > I think that cont lines should be a corner case. There should be only
> > a limited use of them. We should not make too complicated things to
> > support them. Also printk() must not get harder to use because of them.
> > I still see a messed output rather as a cosmetic problem in compare with
> > possible possible deadlocks or hung tasks.
> 
> oh, I would love it if pr_cont() was never used in SMP. but this is not
> the case, unfortunately. and, ironically, where pr_cont really matters
> is debugging -- for instance, look at arch/x86/kernel/dumpstack_{32,64}.c
> show_regs() or show_stack_log_lvl()

Sure but how big is the problem in the daily life? I have never heard
colleagues complaining about messed cont lines. It is rare and it
is always possible to restore it. Checking BUG/WARN messages is
a rather hard detective work anyway.

The most painful situation would be if backtraces from different
CPUs are mixed. But there will be problem even with mixed lines.
Fortunately, this usually happens when printing backtraces from
all CPUs in NMI and the output is serialized.

> well, I do understand what you mean and agree with it, but I'm
> afraid pr_cont() kinda matters after all and people *probably*
> expect it to be SMP safe (I'm not entirely sure whether all of
> those pr_cont() calls were put there with the idea that the
> output can be messed up and quite hard to read).

This was even worse before the cont lines buffer.

Sigh, I do not feel experienced enough to decide about this.
I wonder if this is rather theoretical problem or if there
are many real complains about it.

I feel that we might be trapped by a perfectionism. Perfect output
would be great. But it must not make printk() hard to use
in the daily life.

Best Regards,
Petr

Re: livepatch/kprobes incompatibility

2016-08-24 Thread Petr Mladek

On Tue 2016-08-23 21:13:00, Jessica Yu wrote:
> Hi Masami, Petr,
> 
> I'm trying to figure out where we are exactly with fixing the problems with
> livepatch + kprobes, and I was wondering if there will be any more updates to
> the ipmodify patchset that was originally merged back in 2014 (See:
> https://lkml.org/lkml/2014/11/20/808). It seems that patch 4/5 ("kprobes: Set
> IPMODIFY flag only if the probe can change regs->ip") wasn't merged due to
> other ongoing work, and this patch in particular was needed to enforce a hard
> conflict between livepatch and jprobes while still enabling livepatch and
> kprobes to co-exist.
>
> Currently, it looks like livepatch/kpatch and kprobes are still in direct
> conflict, since both kprobe_ftrace_ops and klp_ops have FTRACE_OPS_FL_IPMODIFY
> set. *But* it seems like this mutual exclusion wasn't 100% implemented; I'm
> not sure if this was intentional, but kprobes registration will still return
> success even when ftrace registration fails due to an ipmodify conflict, and
> instead we just get WARNs (See: arm_kprobe_ftrace()).
> 
> So we still end up with buggy situations like the following:
>   (1) livepatch patches meminfo_proc_show [ succeeds ]
>   (2) systemtap probes meminfo_proc_show (using kprobes) [ fails ]
>   * BUT from the user's perspective, it would look like systemtap 
> succeeded,
> since register_kprobe() returned success, but the handler will never 
> fire
> and only when we look at dmesg do we see that something went wrong
> (i.e. ftrace registration had failed since livepatch already reserved
> ipmodify in step 1).

I tried to improve the error handling of kprobes, see
https://lkml.kernel.org/r/1424967232-2923-1-git-send-email-pmla...@suse.cz

My last notes about this patch set are:

  + looked again on the error handling of ftrace operations;
found that my patches would break optimized kprobes;
uff, the kprobes design is not ideal; there are many
flags that need to be checked before each operation;
it is easy to forget to check one or modify the flags
in a wrong order

  + asked Massami if he would be interested into the 1st patch
that was OK; put the rest on hold for a bit


> >From what I understand though, there was work being planned to limit this
> direct conflict to just livepatch and jprobes, since most of the time kprobes
> doesn't change regs->ip. Just wondering what the current state of this work 
> is.

My notes about this are:

  + Jprobe must cause hard conflict because it modifies regs->ip;
when the ftrace handlers are finished the code continues with
the Jprobe .entry handler; the .entry handlers must end with
jprobe_return(). It is quite tricky function because it modifies
the stack and calls int3 break. It is handled by a so-called
break_handler() from kprobe. It calls post_handler() if any,
restores the registry, stack, and goes back to the original
function.

I am not sure why it works this complicated way. It probably
allows to call the .entry handler in a better context, with
IRQs enabled?

Anyway, the important point is that it modifies regs->ip and forces
the ftrace framework to continue with another function. So,
it does exactly the same as live patching and therefore they
could not work together (at least not in the current state).


  + kprobe is safe even when it is located on the function+0x0 address.
The default kprobe handler does not modify regs->ip; well, in theory
kprobe could be used for patching and could do this;


  + kretprobe is safe as well; the kprobe handler does not modify regs->ip;
it just modifies the return address from the function; it does not affect
livepatching because the address is defined by the function caller
and livepatching keeps it as is


Well, there is one more problem. We should also warn when a kprobe
is not longer accessible because the function call is redirected
by a livepatch. My last notes about it are:

  + worked on the check for lost Kprobes; decided that only Kprobe
knows about all probes and need to be informed about patching;
added KPROBE_FLAG_PATCHED and its handling; it will be used
by a fake probe that will just signalize that the function is
patched; added helper functions that will register and unregister
that fake probe; the patchset still needs some clean up before
sending


Unfortunately, this task has been snowed down in my TODO list and I
have not touched it since the spring 2015. I gave it lower priority
because we were on the safe side and nobody complained.

Best Regards,
Petr

Re: [PATCH] livepatch/module: make TAINT_LIVEPATCH module-specific

2016-08-25 Thread Petr Mladek

Hi,

I have spent some time to understand the change. I hope that the
comments below would help others.

On Wed 2016-08-24 16:33:00, Josh Poimboeuf wrote:
> There's no reliable way to determine which module tainted the kernel
> with CONFIG_LIVEPATCH.  For example, /sys/module//taint
> doesn't report it.  Neither does the "mod -t" command in the crash tool.
> 
> Make it crystal clear who the guilty party is by converting
> CONFIG_LIVEPATCH to a module taint flag.

The above paragraph is a bit confusing. The patch adds TAINT_LIVEPATCH into the
list of module taint flags.

> This changes the behavior a bit: now the the flag gets set when the
> module is loaded, rather than when it's enabled.
> 
> Reviewed-by: Chunyu Hu 
> Signed-off-by: Josh Poimboeuf 
> ---
>  kernel/livepatch/core.c |  3 ---
>  kernel/module.c | 35 ---
>  2 files changed, 12 insertions(+), 26 deletions(-)
> 
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 5fbabe0..af46438 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -545,9 +545,6 @@ static int __klp_enable_patch(struct klp_patch *patch)
>   list_prev_entry(patch, list)->state == KLP_DISABLED)
>   return -EBUSY;
>  
> - pr_notice_once("tainting kernel with TAINT_LIVEPATCH\n");
> - add_taint(TAINT_LIVEPATCH, LOCKDEP_STILL_OK);

The first important thing is that add_taint() is replaced with
add_taint_module(). The other function sets also mod->taints.

It is a module taint flag, so it really makes sense to call it
when the module is loaded.

> -
>   pr_notice("enabling patch '%s'\n", patch->mod->name);
>  
>   klp_for_each_object(patch, obj) {
> diff --git a/kernel/module.c b/kernel/module.c
> index 529efae..fd5f95b 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -1149,6 +1149,8 @@ static size_t module_flags_taint(struct module *mod, 
> char *buf)
>   buf[l++] = 'C';
>   if (mod->taints & (1 << TAINT_UNSIGNED_MODULE))
>   buf[l++] = 'E';
> + if (mod->taints & (1 << TAINT_LIVEPATCH))
> + buf[l++] = 'K';

This is the second important part of the change. It shows the flag
in /sys/module//taint.

The rest is just reshufling of the code. But it has problems
as already reported by kbuild test robot.

The change looks good to me. We just need to fix the compilation
problem by adding some #ifdefs.

Best Regards,
Petr

Re: [PATCH v6 20/20] thermal/intel_powerclamp: Convert the kthread to kthread worker API

2016-08-25 Thread Petr Mladek

On Thu 2016-08-25 10:33:17, Sebastian Andrzej Siewior wrote:
> On 2016-04-14 17:14:39 [+0200], Petr Mladek wrote:
> > Kthreads are currently implemented as an infinite loop. Each
> > has its own variant of checks for terminating, freezing,
> > awakening. In many cases it is unclear to say in which state
> > it is and sometimes it is done a wrong way.
> 
> What is the status of this? This is the last email I received and it is
> from April.

There were still some discussions about the kthread worker API.
Anyway, the needed kthread API changes are in Andrew's -mm tree now
and will be hopefully included in 4.9.

I did not want to send the patches using the API before the API
changes are upstream. But I could send the two intel_powerclamp
patches now if you are comfortable with having them on top of
the -mm tree or linux-next.

Best Regards,
Petr

Re: [PATCH v2] livepatch/module: make TAINT_LIVEPATCH module-specific

2016-08-25 Thread Petr Mladek

On Thu 2016-08-25 10:04:45, Josh Poimboeuf wrote:
> There's no reliable way to determine which module tainted the kernel
> with TAINT_LIVEPATCH.  For example, /sys/module//taint
> doesn't report it.  Neither does the "mod -t" command in the crash tool.
> 
> Make it crystal clear who the guilty party is by associating
> TAINT_LIVEPATCH with any module which sets the "livepatch" modinfo
> attribute.  The flag will still get set in the kernel like before, but
> now it also sets the same flag in mod->taint.
> 
> Note that now the taint flag gets set when the module is loaded rather
> than when it's enabled.
> 
> I also renamed find_livepatch_modinfo() to check_modinfo_livepatch() to
> better reflect its purpose: it's basically a livepatch-specific
> sub-function of check_modinfo().
> 
> Reported-by: Chunyu Hu 
> Signed-off-by: Josh Poimboeuf 

Everything looks fine now.

Reviewed-by: Petr Mladek 


Best Regards,
Petr

Re: [PATCH v10 1/2] printk: Make printk() completely async

2016-08-25 Thread Petr Mladek

On Mon 2016-08-22 13:15:20, Sergey Senozhatsky wrote:
> Hello,
> 
> On (08/20/16 14:24), Sergey Senozhatsky wrote:
> > On (08/19/16 21:00), Jan Kara wrote:
> > > > > depending on .config BUG() may never return back -- passing control
> > > > > to do_exit(), so printk_deferred_exit() won't be executed. thus we
> > > > > probably need to have a per-cpu variable that would indicate that
> > > > > we are in deferred_bug. hm... but do we really need deferred BUG()
> > > > > in the first place?
> > > > 
> since we are basically interested in wake_up_process() only from
> printk() POV. not sure how acceptable 2 * preempt_count and 2 * per-CPU
> writes for every try_to_wake_up().
> 
> 
> the other thing I just thought of is doing something as follows
> !!!not tested, will not compile, just an idea!!!
> 
> ---
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 6e260a0..bb8d719 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1789,6 +1789,7 @@ asmlinkage int vprintk_emit(int facility, int level,
> printk_delay();
>  
> local_irq_save(flags);
> +   printk_nmi_enter();
> this_cpu = smp_processor_id();
>  
> /*
> @@ -1804,6 +1805,7 @@ asmlinkage int vprintk_emit(int facility, int level,
>  */
> if (!oops_in_progress && !lockdep_recursing(current)) {
> recursion_bug = true;
> +   printk_nmi_exit();
> local_irq_restore(flags);
> return 0;
> }
> @@ -1920,6 +1922,7 @@ asmlinkage int vprintk_emit(int facility, int level,
> logbuf_cpu = UINT_MAX;
> raw_spin_unlock(&logbuf_lock);
> lockdep_on();
> +   printk_nmi_exit();
> local_irq_restore(flags);
>  
> /* If called from the scheduler, we can not call up(). */

I was so taken by the idea of temporary forcing a lockless and
"trivial" printk implementation that I missed one thing.

Your patch use the alternative printk() variant around logbuf_lock.
But this is not the problem with wake_up_process(). printk_deferred()
takes logbuf_lock without problems.

Our problem is with calling wake_up_process() recursively. The
deadlock is in the scheduler locks.

But the patch still inspired me. What about blocking the problematic
wake_up_process() call by a per-cpu variable. I mean something like
this completely untested code:

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ca9733b802ce..93915eb1fd0d 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1708,6 +1708,8 @@ static size_t cont_print_text(char *text, size_t size)
return textlen;
 }
 
+DEFINE_PER_CPU(bool, printk_wakeup);
+
 asmlinkage int vprintk_emit(int facility, int level,
const char *dict, size_t dictlen,
const char *fmt, va_list args)
@@ -1902,8 +1904,17 @@ asmlinkage int vprintk_emit(int facility, int level,
lockdep_off();
 
if  (printk_kthread && !in_panic) {
+   bool __percpu *printk_wakeup_ptr;
+
/* Offload printing to a schedulable context. */
-   wake_up_process(printk_kthread);
+   local_irq_save(flags);
+   printk_wake_up_ptr = this_cpu_ptr(&printk_wake_up);
+   if (!*printk_wakeup_ptr) {
+   *printk_wake_up_ptr = true;
+   wake_up_process(printk_kthread);
+   *printk_wake_up_ptr = false;
+   }
+   local_irq_restore(flags);
goto out_lockdep;
} else {
/*


We might eventually hide this into a wake_up_process_safe() or so.

Also we might need to use it also in console_unlock() to avoid similar
recursion there as well.

Best Regards,
Petr

Re: [PATCH][RFC] printk: make pr_cont buffer per-cpu

2016-08-25 Thread Petr Mladek

On Thu 2016-08-25 23:27:40, Petr Mladek wrote:
> On Wed 2016-08-24 23:27:29, Sergey Senozhatsky wrote:
> > On (08/24/16 10:19), Petr Mladek wrote:
> > > > On (08/23/16 13:47), Petr Mladek wrote:
> > > > [..]
> > > > > > if (!(lflags & LOG_NEWLINE)) {
> > > > > > +   if (!this_cpu_read(cont_printing)) {
> > > > > > +   if (system_state == SYSTEM_RUNNING) {
> > > > > > +   this_cpu_write(cont_printing, true);
> > > > > > +   preempt_disable();
> > > > > > +   }
> > > > > > +   }
> > > > > 
> > > > > I am afraid that this is not acceptable. It means that printk() will 
> > > > > have
> > > > > an unexpected side effect. The missing "\n" at the end of a printed
> > > > > string would disable preemption. See below for more.
> > > > 
> > > > missing '\n' must WARN about "sched while atomic" eventually, so it
> > > > shouldn't go unnoticed or stay hidden.
> > > 
> > > Well, it will still force people to rebuilt a test kernel because they
> > > forget to use '\n" and the test kernel is unusable.
> > 
> > you are right. misusage of printk() will now force user to go and fix
> > it. the kernel most likely will be rebuilt anyway - there is a missing
> > \n after all.
 
> Of course, it would be great to fix it transparently. But if there must
> be a burden, I would prefer to keep it on the "corner" case users
> rather than to push it on everyday users.

Not to say that a messed log is much less painful than a locked system.

Best Regards,
Petr

Re: [PATCH][RFC] printk: make pr_cont buffer per-cpu

2016-08-25 Thread Petr Mladek

On Wed 2016-08-24 23:27:29, Sergey Senozhatsky wrote:
> On (08/24/16 10:19), Petr Mladek wrote:
> > > On (08/23/16 13:47), Petr Mladek wrote:
> > > [..]
> > > > >   if (!(lflags & LOG_NEWLINE)) {
> > > > > + if (!this_cpu_read(cont_printing)) {
> > > > > + if (system_state == SYSTEM_RUNNING) {
> > > > > + this_cpu_write(cont_printing, true);
> > > > > + preempt_disable();
> > > > > + }
> > > > > + }
> > > > 
> > > > I am afraid that this is not acceptable. It means that printk() will 
> > > > have
> > > > an unexpected side effect. The missing "\n" at the end of a printed
> > > > string would disable preemption. See below for more.
> > > 
> > > missing '\n' must WARN about "sched while atomic" eventually, so it
> > > shouldn't go unnoticed or stay hidden.
> > 
> > Well, it will still force people to rebuilt a test kernel because they
> > forget to use '\n" and the test kernel is unusable.
> 
> you are right. misusage of printk() will now force user to go and fix
> it. the kernel most likely will be rebuilt anyway - there is a missing
> \n after all.

It is not a big problem for the final build. But it is a problem when adding
temporary debugging messages. They will get removed anyway. But people
will hate that some debugging builds are unusable just because of
a missing "\n".

> and rebuild here is pretty much incremental because basically
> only one line is getting updated (missing `\n' in printk() message). so
> it's like 15 seconds, perhaps.

This is not completely true. Sometimes the developer is not able to
test it hemself and need to send a patched kernel with some debug
messages to the customer. The turnaround might be hours or even days.

Also we have a build service for building packages here at SUSE.
It uses rather powerful machines. If we need to do testing on
some less powerfull machine, it is sometimes faster to build
the package by the build service. Sometimes it is the only
possibility if the test machine does not have enought disk
or memory. The point is that the build service always does
clean build and the turnaround takes minutes.

Not to say that the missing newline might be in an error path
that is hard to test and might appear even in the released build.


> > IMHO, the connection between '\n' and preemption is not
> > intuitive and hard to spot. We should do our best to avoid it.
> 
> yes. what can be done is a simple hint -- we are in preempt disabled
> path when we report `sched while atomic', so it's safe to check if
> this_cpu(cont).len != 0 and if it is then additionally report:
>   "incomplete cont line".
> 
> in normal life no one _probably_ would see the change. well, I saw broken
> backtraces on my laptop before, but didn't consider it to be important.
> for the last two days the problem is not theoretical anymore on my side,
> as now I'm looking at actual pr_cont()-related bug report [internal].
> I'm truing to move the whole thing into 'don't use the cont lines' direction;
> there is a mix of numerous pr_cont() calls & missing new-lines [internal 
> code].
> the kernel, unfortunately, IMHO, was too nice in the latter case, flushing
> incomplete cont buffer with LOG_NEWLINE. so many of those `accidental' cont
> lines went unnoticed.
> 
> this "don't use the cont lines" is actually a bit tricky and doesn't
> look so solid. there is a need for cont lines, and the kernel has cont
> printk... but it's sort of broken, so people have to kinda work-around
> it. not the best example, but still (arch/arm64/kernel/traps.c)
> 
> static void __dump_instr(const char *lvl, struct pt_regs *regs)
> {
>   unsigned long addr = instruction_pointer(regs);
>   char str[sizeof(" ") * 5 + 2 + 1], *p = str;
>   int i;
> 
>   for (i = -4; i < 1; i++) {
>   unsigned int val, bad;
> 
>   bad = __get_user(val, &((u32 *)addr)[i]);
> 
>   if (!bad)
>   p += sprintf(p, i == 0 ? "(%08x) " : "%08x ", val);
>   else {
>   p += sprintf(p, "bad PC value");
>   break;
>   }
>   }
>   printk("%sCode: %s\n", lvl, str);
> }
> 
> which is fine, but getting a bit boring once you have more than,
> let's say, 20 function that want to do cont output.

Well, there are only two extra lines. The buffer definition and
the final printk().

Of course, it would be great to fix it transparently. But if there must
be a burden, I would prefer to keep it on the "corner" case users
rather than to push it on everyday users.

Best Regards,
Petr

Re: [RFC V2] printk: add warning while drop partial text in msg

2017-08-16 Thread Petr Mladek

On Fri 2017-08-11 00:55:48, pierre kuo wrote:
> hi Sergey:
> (Please ignore previous mail, I apologize for pressing send button too early 
> :)
> >> this is not the only place that can truncate the message.
> >> vprintk_emit() can do so as well /* vscnprintf() */. but
> >> I think we don't care that much. a user likely will  notice
> >> truncated messages. we report lost messages, because this
> >> is a completely different sort of problem.
> Usually people will more easily find message truncated from semantics
> by vscnprintf, since it brute force truncate input message by the
> upper limit of output buffer.

Do you see the problem in the real life, please?

I ask because msg_print_text() seems to be used carefully.

For example, syslog_idx is bumped in syslog_print() only when
the message fits into the buffer. It repeats the read with an
empty buffer until the bigger userspace buffer is full.

Also kmsg_dump_get_buffer() first checks the size of the messages.
Then it calls msg_print_text() only for messages that fit into
the buffer.

These functions are called from userspace. Of course, all messages
do not fit into the userspace buffer. But userspace repeats the read
until all messages are read. IMHO, nothing is really dropped here.

Best Regards,
Petr

Re: [PATCH v2 1/3] livepatch: Add force sysfs attribute

2017-08-16 Thread Petr Mladek

On Thu 2017-08-10 12:48:13, Miroslav Benes wrote:
> Add read-write force attribute to livepatch sysfs infrastructure. We can
> use it later to force couple of events during a live patching process.
> Be it a sending of a fake signal or forcing of the tasks' successful
> conversion.
> 
> It does not make sense to use the force facility when there is no
> transaction running (although there is no harm doing that). Therefore we
> limit it only to situations when klp_transition_patch variable is set.
> Normally, klp_mutex lock should be acquired, because the variable is
> shared. However that would hold the action back unnecessarily because of
> waiting for the lock, so we omit the lock here. The resulting race
> window is harmless (using force when there is no transaction running).
> 
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index b9628e43c78f..79022b7eca2c 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -954,6 +990,13 @@ static int __init klp_init(void)
>   if (!klp_root_kobj)
>   return -ENOMEM;
>  
> + ret = sysfs_create_group(klp_root_kobj, &klp_sysfs_group);
> + if (ret) {
> + pr_err("cannot create livepatch attributes in sysfs\n");
> + kobject_put(klp_root_kobj);

We need to set klp_root_kobj = NULL here. Or we need to set the global
klp_root_kobj only when the attributes are created. Otherwise,
klp_initialized() would return true and registering a patch would
push the system out of a safe road.

Note that this actually opens a small race window when the livepatching
core pretends to be initialized even when the initialization still
might fail. It is rather theoretical but it would be nice to avoid
it if it can be done an easy way, e.g. by setting klp_root_kobj later.

Best Regards,
Petr

Re: [PATCH v2 0/3] livepatch: Introduce force sysfs attribute

2017-08-16 Thread Petr Mladek

On Thu 2017-08-10 12:48:12, Miroslav Benes wrote:
> Currently, livepatch gradually migrate the system from an unpatched to a
> patched state (or vice versa). Each task drops its TIF_PATCH_PENDING
> itself when crossing the kernel/user space boundary or it is cleared
> using the stack checking approach. If there is a task which sleeps on a
> patched function, the whole transition can get stuck indefinitely.
> 
> TODO:
> Now there is a sysfs attribute called "force", which provides two
> functionalities, "signal" and "force" (previously "unmark"). I haven't
> managed to come up with better names. Proposals are welcome. On the
> other hand I do not mind it much.

What about calling the attribute?

 transition-speedup
 transition-urge

In each case, I would make it more clear that the attribute
is related to the transition attribute of each patch.

Best Regards,
Petr

Re: [PATCH v2 2/3] livepatch: send a fake signal to all blocking tasks

2017-08-16 Thread Petr Mladek

On Thu 2017-08-10 12:48:14, Miroslav Benes wrote:
> Live patching consistency model is of LEAVE_PATCHED_SET and
> SWITCH_THREAD. This means that all tasks in the system have to be marked
> one by one as safe to call a new patched function. Safe means when a
> task is not (sleeping) in a set of patched functions. That is, no
> patched function is on the task's stack. Another clearly safe place is
> the boundary between kernel and userspace. The patching waits for all
> tasks to get outside of the patched set or to cross the boundary. The
> transition is completed afterwards.
> 
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 79022b7eca2c..a359340c924d 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(klp_enable_patch);
>  static ssize_t force_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
>  {
> - return sprintf(buf, "No operation is currently permitted.\n");
> + return sprintf(buf, "signal\n");

This makes invalid the "NOTE:" above this function ;-)

Best Regards,
Petr

Re: [PATCH v2 0/3] livepatch: Introduce force sysfs attribute

2017-08-16 Thread Petr Mladek

On Fri 2017-08-11 16:11:31, Josh Poimboeuf wrote:
> On Thu, Aug 10, 2017 at 12:48:12PM +0200, Miroslav Benes wrote:
> > Now there is a sysfs attribute called "force", which provides two
> > functionalities, "signal" and "force" (previously "unmark"). I haven't
> > managed to come up with better names. Proposals are welcome. On the
> > other hand I do not mind it much.
> 
> Now "force" has two meanings, which is a little confusing.  What do you
> think about just having two separate write-only sysfs flags?
> 
>   echo 1 > /sys/kernel/livepatch/signal
>   echo 1 > /sys/kernel/livepatch/force

I like the simplicity but I wonder if there might be more actions
that need to be forced in the future. Then this might cause
confusion.

For example, we have force_module_load attribute in kGraft.
It allows to load a module even when it is refused by a livepatch.
It is handy when there is a harmless bug in the patch.

Best Regards,
Petr

Re: [PATCH v4] livepatch: introduce shadow variable API

2017-08-17 Thread Petr Mladek

On Mon 2017-08-14 16:02:43, Joe Lawrence wrote:
> Add exported API for livepatch modules:
> 
>   klp_shadow_get()
>   klp_shadow_attach()
>   klp_shadow_get_or_attach()
>   klp_shadow_update_or_attach()
>   klp_shadow_detach()
>   klp_shadow_detach_all()
> 
> that implement "shadow" variables, which allow callers to associate new
> shadow fields to existing data structures.  This is intended to be used
> by livepatch modules seeking to emulate additions to data structure
> definitions.
> 
> See Documentation/livepatch/shadow-vars.txt for a summary of the new
> shadow variable API, including a few common use cases.
> 
> See samples/livepatch/livepatch-shadow-* for example modules that
> demonstrate shadow variables.
> 
> diff --git a/kernel/livepatch/shadow.c b/kernel/livepatch/shadow.c
> new file mode 100644
> index ..0ebd4b635e4f
> --- /dev/null
> +++ b/kernel/livepatch/shadow.c
> +/**
> + * klp_shadow_match() - verify a shadow variable matches given 
> + * @shadow:  shadow variable to match
> + * @obj: pointer to parent object
> + * @id:  data identifier
> + *
> + * Return: true if the shadow variable matches.
> + *
> + * Callers should hold the klp_shadow_lock.
> + */
> +static inline bool klp_shadow_match(struct klp_shadow *shadow, void *obj,
> + unsigned long id)
> +{
> + return shadow->obj == obj && shadow->id == id;
> +}

Do we really need this function? It is called only in situations
where shadow->obj == obj is always true. Especially the use in
klp_shadow_detach_all() is funny because we pass shadow->obj as
the shadow parameter.

> +
> +/**
> + * klp_shadow_get() - retrieve a shadow variable data pointer
> + * @obj: pointer to parent object
> + * @id:  data identifier
> + *
> + * Return: the shadow variable data element, NULL on failure.
> + */
> +void *klp_shadow_get(void *obj, unsigned long id)
> +{
> + struct klp_shadow *shadow;
> +
> + rcu_read_lock();
> +
> + hash_for_each_possible_rcu(klp_shadow_hash, shadow, node,
> +(unsigned long)obj) {
> +
> + if (klp_shadow_match(shadow, obj, id)) {
> + rcu_read_unlock();
> + return shadow->data;
> + }
> + }
> +
> + rcu_read_unlock();
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(klp_shadow_get);
> +
> +/*
> + * klp_shadow_set() - initialize a shadow variable
> + * @shadow:  shadow variable to initialize
> + * @obj: pointer to parent object
> + * @id:  data identifier
> + * @data:pointer to data to attach to parent
> + * @size:size of attached data
> + *
> + * Callers should hold the klp_shadow_lock.
> + */
> +static inline void klp_shadow_set(struct klp_shadow *shadow, void *obj,
> +   unsigned long id, void *data, size_t size)
> +{
> + shadow->obj = obj;
> + shadow->id = id;
> +
> + if (data)
> + memcpy(shadow->data, data, size);
> +}

The function name suggests that it is a counterpart of
klp_shadow_get() but it is not. Which is a bit confusing.

Hmm, the purpose of this function is to reduce the size of cut&pasted
code between all that klp_shadow_*attach() variants. But there
is still too much cut&pasted code. In fact, the base logic of all
variants is basically the same. The only small difference should be
how they handle the situation when the variable is already there.

OK, there is a locking difference in the update variant but
it is questionable, see below.

I would suggest to do something like this:

static enum klp_shadow_attach_existing_handling {
 KLP_SHADOW_EXISTING_RETURN,
 KLP_SHADOW_EXISTING_WARN,
 KLP_SHADOW_EXISING_UPDATE,
};

void *__klp_shadow_get_or_attach(void *obj, unsigned long id, void *data,
   size_t size, gfp_t gfp_flags,
   enum klp_shadow_attach_existing_handling 
existing_handling)
{
struct klp_shadow *new_shadow;
void *shadow_data;
unsigned long flags;

/* Check if the shadow variable if  already exists */
shadow_data = klp_shadow_get(obj, id);
if (shadow_data)
goto exists;

/* Allocate a new shadow variable for use inside the lock below */
new_shadow = kzalloc(size + sizeof(*new_shadow), gfp_flags);
if (!new_shadow) {
pr_error("failed to allocate shadow variable <0x%p, %ul>\n",
 obj, id);
return NULL;
}

new_shadow->obj = obj;
new_shadow->id = id;

/* initialize the shadow variable if if data provided */
if (data)
memcpy(new_shadow->data, data, size);

/* Look for  again under the lock */
spin_lock_irqsave(&klp_shadow_lock, flags);
shadow_data = klp_shadow_get(obj, id);
if (unlikely(shadow_data)) {
/*
 * Shadow variable was found, th

Re: [PATCH v4] livepatch: introduce shadow variable API

2017-08-18 Thread Petr Mladek

On Thu 2017-08-17 12:01:33, Joe Lawrence wrote:
> On 08/17/2017 10:05 AM, Petr Mladek wrote:
> >> diff --git a/kernel/livepatch/shadow.c b/kernel/livepatch/shadow.c
> >> new file mode 100644
> >> index ..0ebd4b635e4f
> >> --- /dev/null
> >> +++ b/kernel/livepatch/shadow.c
> >> +/**
> >> + * klp_shadow_match() - verify a shadow variable matches given 
> >> + * @shadow:   shadow variable to match
> >> + * @obj:  pointer to parent object
> >> + * @id:   data identifier
> >> + *
> >> + * Return: true if the shadow variable matches.
> >> + *
> >> + * Callers should hold the klp_shadow_lock.
> >> + */
> >> +static inline bool klp_shadow_match(struct klp_shadow *shadow, void *obj,
> >> +  unsigned long id)
> >> +{
> >> +  return shadow->obj == obj && shadow->id == id;
> >> +}
> > 
> > Do we really need this function? It is called only in situations
> > where shadow->obj == obj is always true. Especially the use in
> > klp_shadow_detach_all() is funny because we pass shadow->obj as
> > the shadow parameter.
> 
> Personal preference.  Abstracting out all of the routines that operated
> on the shadow variables (setting up, comparison) did save some code
> lines and centralized these common bits.

I take this back. We actually need to check obj because different
objects might have the same hash.

I think that I did the same mistake also the last time. I hope that
I will be able to fix this in my mind faster than "never" vs. "newer"
typo that I do for years.

Also I forgot to say that you did great work. Each version of the
patch is much better than the previous one.

Best Regards,
Petr

Re: [PATCH] printk: Remove superfluous memory barriers from printk_safe

2017-10-16 Thread Petr Mladek

On Sun 2017-10-15 20:27:15, Steven Rostedt wrote:
> On Sat, 14 Oct 2017 18:21:29 +0900
> Sergey Senozhatsky  wrote:
> 
> > On (10/11/17 12:46), Steven Rostedt wrote:
> > > From: Steven Rostedt (VMware) 
> > > 
> > > The variable printk_safe_irq_ready is set and never cleared at system
> > > boot up, when there's only one CPU active. It is set before other
> > > CPUs come on line. Also, it is extremely unlikely that an NMI would
> > > trigger this early in boot up (which I wonder why we even have this
> > > variable at all).  
> > 
> > it's not only NMI related, printk() recursion can happen at any stages,
> > including... um... wait a second. ... including the "before we set up
> > per-CPU areas" stage? hmm... smells like a bug?
> 
> I think this was just being overly paranoid.

I was curious because it was not only about reading the per-CPU
variables. We set and clear the printk_context per-CPU variable
in every printk() call. I wondered if we accessed some
non-initialized stuff.

Fortunately, it seems that we are on the safe side.

If I get it correctly, the per-CPU variables are set up in
setup_per_cpu_areas(). But some per-CPU variables are used even
before, see

  boot_cpu_init()
smp_processor_id()
  raw_smp_processor_id()
this_cpu_read(cpu_number)

IMHO, the trick is the following code in setup_per_cpu_areas()
from arch/x86/kernel/setup_percpu.c:

/*
 * Up to this point, the boot CPU has been using .init.data
 * area.  Reload any changed state for the boot CPU.
 */
if (!cpu)
switch_to_new_gdt(cpu);

IMHO, this means that per-CPU variable for the first boot-CPU
can be used at any time. And all the interesting functions:
boot_cpu_init(), setup_per_cpu_areas(), printk_safe_init() are
still called in the single-CPU mode.

Best Regards,
Petr

Re: NMI watchdog dump does not print on hard lockup

2017-10-16 Thread Petr Mladek

On Fri 2017-10-13 12:12:29, Linus Torvalds wrote:
> On Fri, Oct 13, 2017 at 6:18 AM, Steven Rostedt  wrote:
> >
> > Or add the following case: The watchdog triggers, does the print, then
> > if it triggers again in a certain amount of time, and the print still
> > hasn't been flushed, the flush happens then.

Sounds good to me.

> By the time 40 sec has passed, I suspect most people have just
> rebooted the machine.

This might be the case for a desktop. But people might be more
conservative in case of big servers or when debugging. These might
be desperate to keep going or see something.

> I think an NMI watchdog should just force the flush - the same way an
> oops should. Deadlocks aren't really relevant if something doesn't get
> printed out anyway.

We expicititely flush the NMI buffers in panic() when there is
not other way to see them. But it is questional in other situations.
Sometimes the flush might be the only way to see the messages
and sometimes printk() might unnecessarily cause a deadlock.
IMHO, the only solution is to make it optional.

Best Regards,
Petr

Re: [PATCH v4 2/3] livepatch: shuffle core.c function order

2017-10-16 Thread Petr Mladek

On Thu 2017-10-12 17:12:28, Jason Baron wrote:
> In preparation for __klp_enable_patch() to call a number of 'static'
> functions, in a subsequent patch, move them earlier in core.c. This patch
> should be a nop from a functional pov.
> 
> Signed-off-by: Jason Baron 
> Cc: Josh Poimboeuf 
> Cc: Jessica Yu 
> Cc: Jiri Kosina 
> Cc: Miroslav Benes 
> Cc: Petr Mladek 
> ---
>  kernel/livepatch/core.c | 349 
> 
>  1 file changed, 173 insertions(+), 176 deletions(-)
> 
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index b7f77be..f53eed5 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -283,6 +283,179 @@ static int klp_write_object_relocations(struct module 
> *pmod,
> +static int klp_init_func(struct klp_object *obj, struct klp_func *func)
> +{
> + if (!func->old_name || !func->new_func)
> + return -EINVAL;
> +
> + INIT_LIST_HEAD(&func->stack_node);
> + func->patched = false;
> + func->transition = false;

You lost the change from the 1st patch:

list_add(&func->func_entry, &obj->func_list);

> +
> + /* The format for the sysfs directory is  where sympos
> +  * is the nth occurrence of this symbol in kallsyms for the patched
> +  * object. If the user selects 0 for old_sympos, then 1 will be used
> +  * since a unique symbol will be the first occurrence.
> +  */
> + return kobject_init_and_add(&func->kobj, &klp_ktype_func,
> + &obj->kobj, "%s,%lu", func->old_name,
> + func->old_sympos ? func->old_sympos : 1);
> +}

[...]

> +static int klp_init_object(struct klp_patch *patch, struct klp_object *obj)
> +{
> + struct klp_func *func;
> + int ret;
> + const char *name;
> +
> + if (!obj->funcs)
> + return -EINVAL;
> +
> + obj->patched = false;
> + obj->mod = NULL;
> +
> + klp_find_object_module(obj);
> +
> + name = klp_is_module(obj) ? obj->name : "vmlinux";
> + ret = kobject_init_and_add(&obj->kobj, &klp_ktype_object,
> +&patch->kobj, "%s", name);
> + if (ret)
> + return ret;
> +

Same here:

list_add(&obj->obj_entry, &patch->obj_list);
INIT_LIST_HEAD(&obj->func_list);
klp_for_each_func_static(obj, func) {

> + klp_for_each_func(obj, func) {
> + ret = klp_init_func(obj, func);
> + if (ret)
> + goto free;
> + }
> +
> + if (klp_is_object_loaded(obj)) {
> + ret = klp_init_object_loaded(patch, obj);
> + if (ret)
> + goto free;
> + }
> +
> + return 0;
> +
> +free:
> + klp_free_funcs_limited(obj, func);
> + kobject_put(&obj->kobj);
> + return ret;
> +}

Best Regards,
Petr

Re: [PATCH v4 3/3] livepatch: add atomic replace

2017-10-17 Thread Petr Mladek

On Thu 2017-10-12 17:12:29, Jason Baron wrote:
> Sometimes we would like to revert a particular fix. This is currently
> This is not easy because we want to keep all other fixes active and we
> could revert only the last applied patch.
> 
> One solution would be to apply new patch that implemented all
> the reverted functions like in the original code. It would work
> as expected but there will be unnecessary redirections. In addition,
> it would also require knowing which functions need to be reverted at
> build time.
> 
> A better solution would be to say that a new patch replaces
> an older one. This might be complicated if we want to replace
> a particular patch. But it is rather easy when a new cummulative
> patch replaces all others.
> 
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index f53eed5..d1c7a06 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -283,8 +301,21 @@ static int klp_write_object_relocations(struct module 
> *pmod,
>   return ret;
>  }
>  
> +atomic_t klp_nop_release_count;
> +static DECLARE_WAIT_QUEUE_HEAD(klp_nop_release_wait);
> +
>  static void klp_kobj_release_object(struct kobject *kobj)
>  {
> + struct klp_object *obj;
> +
> + obj = container_of(kobj, struct klp_object, kobj);
> + /* Free dynamically allocated object */
> + if (!obj->funcs) {
> + kfree(obj->name);
> + kfree(obj);
> + atomic_dec(&klp_nop_release_count);
> + wake_up(&klp_nop_release_wait);

I would slightly optimize this by

if (atomic_dec_and_test((&klp_nop_release_count))
wake_up(&klp_nop_release_wait);

> + }
>  }
>  
>  static struct kobj_type klp_ktype_object = {
> @@ -294,6 +325,16 @@ static struct kobj_type klp_ktype_object = {
>  
>  static void klp_kobj_release_func(struct kobject *kobj)
>  {
> + struct klp_func *func;
> +
> + func = container_of(kobj, struct klp_func, kobj);
> + /* Free dynamically allocated functions */
> + if (!func->new_func) {
> + kfree(func->old_name);
> + kfree(func);
> + atomic_dec(&klp_nop_release_count);
> + wake_up(&klp_nop_release_wait);

Same here

if (atomic_dec_and_test((&klp_nop_release_count))
wake_up(&klp_nop_release_wait);

> + }
>  }
>  
>  static struct kobj_type klp_ktype_func = {
> @@ -436,8 +480,14 @@ static int klp_init_object(struct klp_patch *patch, 
> struct klp_object *obj)
>   if (ret)
>   return ret;
>  
> - klp_for_each_func(obj, func) {
> - ret = klp_init_func(obj, func);
> + list_add(&obj->obj_entry, &patch->obj_list);
> + INIT_LIST_HEAD(&obj->func_list);
> +
> + if (nop)
> + return 0;

Ah, this is something that I wanted to avoid. It makes the code
very hard to read and maintain. It forces us to duplicate
some code in klp_alloc_nop_func(). I think that I complained
about this in v2 already.

I understand that you actually kept it because of me.
It is related to the possibility to re-enable released
patches :-(

The klp_init_*() stuff is called from __klp_enable_patch()
for the "nop" functions now. And it has already been called
for the statically defined structures in klp_register_patch().
Therefore we need to avoid calling it twice for the static
structures.

One solution would be to do these operations for the statically
defined structures in __klp_enable_patch() as well. But this
would mean a big redesign of the code.

Another solution would be to give up on the idea that the replaced
patches might be re-enabled without re-loading. I am afraid
that this the only reasonable approach. It will help to avoid
also the extra klp_replaced_patches list. All this will help to
make the code much easier.

I am really sorry that I asked you to do this exercise and
support the patch re-enablement. It looked like a good idea.
I did not expect that it would be that complicated.

I stop reviewing this patch because it will look quite
different again. I will only keep some random comments
around that I added before finding this main design flaw.

Thanks a lot for the hard work. v4 looks much better than
v2 in many ways. I think that we are going on the right way.

> +
> + klp_for_each_func_static(obj, func) {
> + ret = klp_init_func(obj, func, false);
>   if (ret)
>   goto free;
>   }
> @@ -456,6 +506,226 @@ static int klp_init_object(struct klp_patch *patch, 
> struct klp_object *obj)
>   return ret;
>  }

[...]

> +/* Add 'nop' functions which simply return to the caller to run
> + * the original function. The 'nop' functions are added to a
> + * patch to facilitate a 'replace' mode
> + */
> +static int klp_add_nops(struct klp_patch *patch)
> +{
> + struct klp_patch *old_patch;
> + struct klp_object *old_obj;
> + int err = 0;
> +
> + if (!patch->replace)
> + ret

Re: [PATCH v3 2/2] livepatch: add atomic replace

2017-10-17 Thread Petr Mladek

On Tue 2017-10-17 11:02:29, Miroslav Benes wrote:
> On Tue, 10 Oct 2017, Jason Baron wrote:
> > On 10/06/2017 06:32 PM, Josh Poimboeuf wrote:
> > > I don't really like allowing a previously replaced patch to replace the
> > > current patch.  It's just more unnecessary complexity.

I am sorry to say but it really makes the code too complex.

> > > If the user
> > > wants to atomically revert back to kpatch-a, they should be able to:
> > > 
> > >   rmmod kpatch-a
> > >   insmod kpatch-a.ko
> > >
> > Right - that's how I sent v1 (using rmmod/insmod to revert), but it
> > didn't account for the fact the patch or some functions may be marked
> > 'immediate' and thus its not possible to just do 'rmmod'. Thus, since in
> > some cases 'rmmod' was not feasible, I thought it would be simpler from
> > an operational pov to just say we always revert by re-enabling a
> > previously replaced patch as opposed to rmmod/insmod.
> > 
> Hm. Would it make sense to remove immediate and rely only on the 
> consistency model? At least for the architectures where the model is 
> implemented (x86_64)?
> 
> If not, then I'd keep such modules there without a possibility to remove 
> them ever. If its functionality was required again, it would of course 
> mean to insmod a new module with it.

I am fine with this compromise. It seems to be the only way to keep the
livepatch code somehow sane.

Best Regards,
Petr

Re: [RFC][PATCHv3 2/5] printk: introduce printing kernel thread

2017-05-29 Thread Petr Mladek

On Wed 2017-05-10 14:59:35, Sergey Senozhatsky wrote:
> This patch introduces a '/sys/module/printk/parameters/atomic_print_limit'
> sysfs param, which sets the limit on number of lines a process can print
> from console_unlock(). Value 0 corresponds to the current behavior (no
> limitation). The printing offloading is happening from console_unlock()
> function and, briefly, looks as follows: as soon as process prints more
> than `atomic_print_limit' lines it attempts to offload printing to another
> process. Since nothing guarantees that there will another process sleeping
> on the console_sem or calling printk() on another CPU simultaneously, the
> patch also introduces an auxiliary kernel thread - printk_kthread, the
> main purpose of which is to take over printing duty. The workflow is, thus,
> turns into: as soon as process prints more than `atomic_print_limit' lines
> it wakes up printk_kthread and unlocks the console_sem. So in the best case
> at this point there will be at least 1 processes trying to lock the
> console_sem: printk_kthread. (There can also be a process that was sleeping
> on the console_sem and that was woken up by console semaphore up(); and
> concurrent printk() invocations from other CPUs). But in the worst case
> there won't be any processes ready to take over the printing duty: it
> may take printk_kthread some time to become running; or printk_kthread
> may even never become running (a misbehaving scheduler, or other critical
> condition). That's why after we wake_up() printk_kthread we can't
> immediately leave the printing loop, we must ensure that the console_sem
> has a new owner before we do so. Therefore, `atomic_print_limit' is a soft
> limit, not the hard one: we let task to overrun `atomic_print_limit'.
> But, at the same time, the console_unlock() printing loop behaves differently
> for tasks that have exceeded `atomic_print_limit': after every printed
> logbuf entry (call_console_drivers()) such a process wakes up printk_kthread,
> unlocks the console_sem and attempts to console_trylock() a bit later
> (if there any are pending messages in the logbuf, of course). In the best case
> scenario either printk_kthread or some other tasks will lock the console_sem,
> so current printing task will see failed console_trylock(), which will
> indicate a successful printing offloading. In the worst case, however,
> current will successfully console_trylock(), which will indicate that
> offloading did not take place and we can't return from console_unlock(),
> so the printing task will print one more line from the logbuf and attempt
> to offload printing once again; and it will continue doing so until another
> process locks the console_sem or until there are pending messages in the
> logbuf. So if everything goes wrong - we can't wakeup printk_kthread and
> there are no other processes sleeping on the console_sem or trying to down()
> it - then we will have the existing console_unlock() behavior: print all
> pending messages in one shot.

Please, try to avoid such long paragraphs ;-) It looks fine when you
read it for the first time. But it becomes problematic when you try
to go back and re-read some detail.

> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 2cb7f4753b76..a113f684066c 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2155,6 +2216,85 @@ static inline int can_use_console(void)
>   return cpu_online(raw_smp_processor_id()) || have_callable_console();
>  }
>  
> +/*
> + * Under heavy printing load or with a slow serial console (or both)
> + * console_unlock() can stall CPUs, which can result in soft/hard-lockups,
> + * lost interrupts, RCU stalls, etc. Therefore we attempt to limit the
> + * number of lines a process can print from console_unlock().
> + *
> + * There is one more reason to do offloading - there might be other processes
> + * (including user space tasks) sleeping on the console_sem in 
> uninterruptible
> + * state; and keeping the console_sem locked for long time may have a 
> negative
> + * impact on them.
> + *
> + * This function must be called from 'printk_safe' context under
> + * console_sem lock.
> + */
> +static inline bool console_offload_printing(void)
> +{
> + static struct task_struct *printing_task;
> + static unsigned long long lines_printed;
> + static unsigned long saved_csw;
> +
> + if (!printk_offloading_enabled())
> + return false;
> +
> + if (system_state != SYSTEM_RUNNING || oops_in_progress)
> + return false;

Wow, I really like this test. I was not aware of this variable.

Well, I would allow to use the offload also during boot.
We have reports when softlockups happened during boot.
The offloading can be easily disabled via the command line
but not the other way around.


> + /* A new task - reset the counters. */
> + if (printing_task != current) {
> + lines_printed = 0;
> + printing_task = current;
>

Re: [PATCH 3/3] livepatch: force transition process to finish

2017-05-29 Thread Petr Mladek

On Fri 2017-05-26 12:37:56, Josh Poimboeuf wrote:
> On Thu, May 25, 2017 at 06:03:07PM +0200, Petr Mladek wrote:
> > On Thu 2017-05-25 14:59:55, Miroslav Benes wrote:
> > > 
> > > > > > In fact, I would suggest to take klp_mutex in force_store()
> > > > > > and do all actions synchronously, including the check
> > > > > > of klp_transition_patch.
> > > > > 
> > > > > I still think it is better not do it. klp_unmark_tasks() does nothing 
> > > > > else 
> > > > > than tasks already do. They call klp_update_patch_state() by 
> > > > > themselves 
> > > > > and they do not grab klp_mutex lock for doing that. 
> > > > > klp_unmark_tasks() 
> > > > > only forces this action.
> > > > 
> > > > You have a point. But I am not convinced ;-) klp_update_patch_state()
> > > > was called very carefully only when it was safe. The forcing
> > > > intentionally breaks the consistency model. User should really know
> > > > what they are doing when they use this feature.
> > > > 
> > > > I think that we should actually taint the kernel. Developers should
> > > > know when users were pulling their legs.
> > > 
> > > We could do that. I can change pr_warn() to WARN_ON_ONCE(), which would 
> > > of 
> > > course taint the kernel.
> > 
> > Sounds good to me.
> 
> I'm thinking that WARN_ON_ONCE() seems too severe.  If the patch didn't
> need a consistency model in the first place then it wouldn't be worth
> warning about.
> 
> We have to trust that the user knows what they're doing.  And that's
> true for the entire live patching process, including patch analysis and
> patch creation.  And anyway we already have a taint flag for that:
> TAINT_LIVEPATCH.

But the force is done on the user side. Let's say that the authors of
the livepatch code and of the patches know what they are doing.
Could we expect the same from the admins that apply the patches?

TAINT_LIVEPATCH is set because the system behaves differently
than with the original code. But it still should be consistent.
Using the force migration might move the system to a wonder land.


> > > > > On the other hand, I do not see a problem in doing that. We already 
> > > > > have a 
> > > > > relationship between klp_mutex and tasklist_lock defined elsewhere, 
> > > > > so it 
> > > > > is safe.
> > > > 
> > > > Yup.
> > > > 
> > > > > It would only serialize things needlessly.
> > > > 
> > > > I do not agree. The speed is not important here. Also look
> > > > into klp_reverse_transition(). We explicitly clear all
> > > > TIF_PATCH_PENDING flags and call synchronize_rcu() just
> > > > to make the situation easier and reduce space for potential
> > > > mistakes.
> > > 
> > > Yes, because we had to do that. We ran into problems otherwise. We do not 
> > > have to do it here. It does not help anything in my opinion.
> > 
> > AFAIK, we did not have to do it, see
> > https://lkml.kernel.org/r/20161222143452.gk25...@pathway.suse.cz
> > and the comment starting with "It would still leave a small".
> > 
> > Just for record, the idea of disabling the TIF flags came from Josh
> > in another mail. I have just repeated it.
> > 
> > I think that the problem already is complex enough and the
> > serialization would reduce the space of potential races.
> > But it is possible that I see it just too complex here.
> 
> IMO we can skip the mutex.  The consistency model will be broken anyway,
> so all bets are off.

I just hope that I will never be forced to debug a system crash
after this operation.

Imagine a situation when we send a livepatch using the hybrid
consistency model that should be safe also in the immediate mode.
Some processes would get stacked. We suggest forcing because
it should be safe. And it will break. Then we will want to know
why this has happened. If the forcing is not serialized, we will
need to consider/check much more parallel operations.

But if I am the only one who think this way, it might mean
that I am over-pessimistic in this context. I will buy
some head bandage to be prepared and could live without
the serialization.

Best Regards,
Petr

Re: [RFC][PATCHv3 5/5] printk: register PM notifier

2017-05-30 Thread Petr Mladek

On Tue 2017-05-09 17:28:59, Sergey Senozhatsky wrote:
> It's not always possible/safe to wake_up() printk kernel
> thread. For example, late suspend/early resume may printk()
> while timekeeping is not initialized yet, so calling into the
> scheduler may result in recursive warnings.
> 
> Another thing to notice is the fact PM at some point
> freezes user space and kernel threads: freeze_processes()
> and freeze_kernel_threads(), correspondingly. Thus we need
> printk() to operate in emergency mode there and attempt to
> immediately flush pending kernel message to the console.
> 
> This patch registers PM notifier, so PM can switch printk
> to emergency mode from PM_FOO_PREPARE notifiers and return
> back to printk threaded mode from PM_POST_FOO notifiers.
> 
> Signed-off-by: Sergey Senozhatsky 
> Suggested-by: Andreas Mohr 
> ---
>  kernel/printk/printk.c | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 81ea575728b9..6aae36a29aca 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2928,6 +2929,30 @@ static DEFINE_PER_CPU(struct irq_work, 
> wake_up_klogd_work) = {
>   .flags = IRQ_WORK_LAZY,
>  };
>  
> +static int printk_pm_notify(struct notifier_block *notify_block,
> + unsigned long mode, void *unused)
> +{
> + switch (mode) {
> + case PM_HIBERNATION_PREPARE:
> + case PM_SUSPEND_PREPARE:
> + case PM_RESTORE_PREPARE:
> + printk_emergency_begin();
> + break;
> +
> + case PM_POST_SUSPEND:
> + case PM_POST_HIBERNATION:
> + case PM_POST_RESTORE:
> + printk_emergency_end();
> + break;
> + }
> +
> + return 0;

Heh, it seems that the meaning of return values
is a bit unclear and messy. For example see an older problem
with android memory handlers at https://lkml.org/lkml/2012/4/9/177

My understanding is that notifiers should return NOTIFY_OK
when they handled it and NOTIFY_DONE when they did nothing.
For example, see cpu_hotplug_pm_callback().

In reality, there does not seem to be a big difference,
see notifier_to_errno() in __pm_notifier_call_chain().

Anyway, I would try to follow the cpu_hotplug_pm_callback()
example.

> +}
> +
> +static struct notifier_block printk_pm_nb = {
> + .notifier_call = printk_pm_notify,
> +};
> +
>  static int printk_kthread_func(void *data)
>  {
>   while (1) {
> @@ -2961,6 +2986,8 @@ static int __init init_printk_kthread(void)
>  
>   sched_setscheduler(thread, SCHED_FIFO, ¶m);
>   printk_kthread = thread;
> +
> + WARN_ON(register_pm_notifier(&printk_pm_nb) != 0);

I think that a simple error message might be enough.

Also we might want to force the emergency mode in case of error.
Otherwise, the messages will not appear during suspend and such
operations.

BTW: Do you know about some locations that would need to be
patched explicitly, for example, kexec?

Some old versions of this patch touched console_suspend().
This brought me to snapshot_ioctl(). It looks like an
API that allows to create snapshots from user space.
I wonder if we should switch to the emergency mode
there are well, probably in the SNAPSHOT_FREEZE
stage.

Best Regards,
Petr

Re: [PATCH V9] printk: hash addresses printed with %p

2017-10-31 Thread Petr Mladek

On Mon 2017-10-30 09:59:16, Tobin C. Harding wrote:
> Currently there are many places in the kernel where addresses are being
> printed using an unadorned %p. Kernel pointers should be printed using
> %pK allowing some control via the kptr_restrict sysctl. Exposing addresses
> gives attackers sensitive information about the kernel layout in memory.
> 
> We can reduce the attack surface by hashing all addresses printed with
> %p. This will of course break some users, forcing code printing needed
> addresses to be updated.

I am sorry for my ignorance but what is the right update, please?
I expect that there are several possibilities:

  + remove the pointer at all

  + replace it with %pK so that it honors kptr_restrict setting

  + any other option?

Is kptr_restrict considered a safe mechanism?

Also kptr_restrict seems to be primary for the messages that are available
via /proc and /sys. Is it good enough for the messages logged by
printk()?

Will there be a debug option that would allow to see the original
pointers? Or what is the preferred way for debug messages?

> For what it's worth, usage of unadorned %p can be broken down as
> follows (thanks to Joe Perches).
> 
> $ git grep -E '%p[^A-Za-z0-9]' | cut -f1 -d"/" | sort | uniq -c
>1084 arch
>  20 block
>  10 crypto
>  32 Documentation
>8121 drivers
>1221 fs
> 143 include
> 101 kernel
>  69 lib
> 100 mm
>1510 net
>  40 samples
>   7 scripts
>  11 security
> 166 sound
> 152 tools
>   2 virt

It is evident that it will hit many people. I guess that they will
be suprised and might have similar questions. It might make sense
to decribe this in Documentation/printk-formats.txt.

Best Regards,
Petr

Re: [PATCH V9] printk: hash addresses printed with %p

2017-11-01 Thread Petr Mladek

On Wed 2017-11-01 10:53:45, Tobin C. Harding wrote:
> On Tue, Oct 31, 2017 at 04:39:44PM +0100, Petr Mladek wrote:
> > On Mon 2017-10-30 09:59:16, Tobin C. Harding wrote:
> > > Currently there are many places in the kernel where addresses are being
> > > printed using an unadorned %p. Kernel pointers should be printed using
> > > %pK allowing some control via the kptr_restrict sysctl. Exposing addresses
> > > gives attackers sensitive information about the kernel layout in memory.
> > > 
> > > We can reduce the attack surface by hashing all addresses printed with
> > > %p. This will of course break some users, forcing code printing needed
> > > addresses to be updated.
> > 
> > I am sorry for my ignorance but what is the right update, please?
> 
> Can I say first that I am in no way an expert, I am new to both this
> problem and kernel dev in general.

Sure. There are many experienced people in CC. I hope that they might
have some opinion on this.

My concern is that this patch breaks some functionality because it
is dangerous. This makes perfect sense. But people will want to fix
it a safe way. And the way is not clear to me.

> > I expect that there are several possibilities:
> > 
> >   + remove the pointer at all
> 
> This definitely stops the leak!

Sure. But this is not a solution for debugging tools, e.g. lockdep.
sysrq-related dumps. Of course, the pointers must not be visible
on production system to normal users but there should be a way
to see them for debugging purposes.

> >   + replace it with %pK so that it honors kptr_restrict setting
> 
> I think this is the option of choice, see concerns below however. I get
> the feeling that the hope with this patch is that a vast majority of
> users of %p won't care so stopping all those addresses is the real win
> for this patch.
> 
> The next hoped benefit is that the hashing will shed light on this topic
> and get developers to think about the issue before _wildly_ printing
> addresses. Having to work harder to print the address in future will aid
> this (assuming everyone doesn't just start using %x).

> >   + any other option?
> 
> Use %x or %X - really bad, this will create more pain in the future.

Yes, using %x or %X is dangerous. It would be used by users
that do not mind about security. Or is there any situation when
always using the original pointer as a string is needed
for a real functionality?

IMHO, string is needed only for human readable usage. Then it always
smells with information leak.

> > Is kptr_restrict considered a safe mechanism?
> > 
> > Also kptr_restrict seems to be primary for the messages that are available
> > via /proc and /sys. Is it good enough for the messages logged by
> > printk()?
> 
> There is some concern that kptr_restrict is not overly great. Linus is
> the main purveyor of this argument. I won't paraphrase here because I
> will not do the argument justice.
> 
> See this thread for the whole discussion
> 
> [kernel-hardening] [RFC V2 0/6] add more kernel pointer filter options

I saw the discussion but it was a bit hard to follow. I would
highlight the following three concerns related to kptr_restrict:

First, it did not help to improve the situation. IMHO, this is mainly
because it was opt-in and did not force people to fix the code. This
is being addressed by this patch that actively breaks all users.

Second, the user credentials are checked when the string is formatted.
They do not reflect the path how the string is passed to the user.
Therefore it might be cheated. This opens a question if kptr_restrict
would stay in kernel and whether the conversion from %p to %pK
is one of the proposed solutions.

Third, kptr_restrict can be modified too late. It means that
the pointers printed during the early boot will never or always
be readable. It means that you would need two different kernel
builds for production and debugging. Well, this is probably
the smallest issue.

All in all. I wonder if it would make sense to introduce
some compiler or command line option that would allow
to disable the hashing of pointers for debugging
purposes.

Best Regards,
Petr

Re: [PATCH] mm: don't warn about allocations which stall for too long

2017-11-01 Thread Petr Mladek

On Wed 2017-11-01 09:30:05, Vlastimil Babka wrote:
> On 10/31/2017 08:32 PM, Steven Rostedt wrote:
> > 
> > Thank you for the perfect timing. You posted this the day after I
> > proposed a new solution at Kernel Summit in Prague for the printk lock
> > loop that you experienced here.
> > 
> > I attached the pdf that I used for that discussion (ignore the last
> > slide, it was left over and I never went there).
> > 
> > My proposal is to do something like this with printk:
> > 
> > Three types of printk usages:
> > 
> > 1) Active printer (actively writing to the console).
> > 2) Waiter (active printer, first user)
> > 3) Sees active printer and a waiter, and just adds to the log buffer
> >and leaves.
> > 
> > (new globals)
> > static DEFINE_SPIN_LOCK(console_owner_lock);
> > static struct task_struct console_owner;
> > static bool waiter;
> > 
> > console_unlock() {
> > 
> > [ Assumes this part can not preempt ]
> > 
> > spin_lock(console_owner_lock);
> > console_owner = current;
> > spin_unlock(console_owner_lock);
> > 
> > for each message
> > write message out to console
> > 
> > if (READ_ONCE(waiter))
> > break;
> 
> Ah, these two lines clarified for me what I didn't get from your talk,
> so I got the wrong impression that the new scheme is just postponing the
> problem.
> 
> But still, it seems to me that the scheme only works as long as there
> are printk()'s coming with some reasonable frequency. There's still a
> corner case when a storm of printk()'s can come that will fill the ring
> buffers, and while during the storm the printing will be distributed
> between CPUs nicely, the last unfortunate CPU after the storm subsides
> will be left with a large accumulated buffer to print, and there will be
> no waiters to take over if there are no more printk()'s coming. What
> then, should it detect such situation and defer the flushing?

This was my fear as well. Steven argued that this was theoretical.
And I do not have a real-life bullets against this argument at
the moment.

My current main worry with Steven's approach is a risk of deadlocks
that Jan Kara saw when he played with similar solution.

Also I am afraid that it would add yet another twist to the console
locking operations. It is already quite hard to follow the logic,
see the games with:

+ console_locked
+ console_suspended
+ can_use_console()
+ exclusive_console

And Steven is going to add:

+ console_owner
+ waiter

But let's wait for the patch. It might look and work nicely
in the end.

Best Regards,
Petr

Re: [PATCH v3 1/2] livepatch: send a fake signal to all blocking tasks

2017-11-01 Thread Petr Mladek

On Tue 2017-10-31 12:48:52, Miroslav Benes wrote:
> Live patching consistency model is of LEAVE_PATCHED_SET and
> SWITCH_THREAD. This means that all tasks in the system have to be marked
> one by one as safe to call a new patched function. Safe means when a
> task is not (sleeping) in a set of patched functions. That is, no
> patched function is on the task's stack. Another clearly safe place is
> the boundary between kernel and userspace. The patching waits for all
> tasks to get outside of the patched set or to cross the boundary. The
> transition is completed afterwards.
> 
> The problem is that a task can block the transition for quite a long
> time, if not forever. It could sleep in a set of patched functions, for
> example.  Luckily we can force the task to leave the set by sending it a
> fake signal, that is a signal with no data in signal pending structures
> (no handler, no sign of proper signal delivered). Suspend/freezer use
> this to freeze the tasks as well. The task gets TIF_SIGPENDING set and
> is woken up (if it has been sleeping in the kernel before) or kicked by
> rescheduling IPI (if it was running on other CPU). This causes the task
> to go to kernel/userspace boundary where the signal would be handled and
> the task would be marked as safe in terms of live patching.
> 
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> index b004a1fb6032..6700d3b22615 100644
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -577,3 +577,43 @@ void klp_copy_process(struct task_struct *child)
>  
>   /* TIF_PATCH_PENDING gets copied in setup_thread_stack() */
>  }
> +
> +/*
> + * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
> + * Kthreads with TIF_PATCH_PENDING set are woken up. Only admin can request 
> this
> + * action currently.
> + */
> +void klp_force_signals(void)
> +{
> + struct task_struct *g, *task;
> +
> + pr_notice("signaling remaining tasks\n");
> +
> + read_lock(&tasklist_lock);
> + for_each_process_thread(g, task) {
> + if (!klp_patch_pending(task))
> + continue;
> +
> + /*
> +  * There is a small race here. We could see TIF_PATCH_PENDING
> +  * set and decide to wake up a kthread or send a fake signal.
> +  * Meanwhile the task could migrate itself and the action
> +  * would be meaningless. It is not serious though.
> +  */
> + if (task->flags & PF_KTHREAD) {
> + /*
> +  * Wake up a kthread which still has not been migrated.
> +  */
> + wake_up_process(task);

I have just noticed that freezer used wake_up_state(p, TASK_INTERRUPTIBLE);
IMHO, we should do so as well.

wake_up_process() wakes also tasks in TASK_UNINTERRUPTIBLE state.
These might not be ready for an unexpected wakeup. For example,
see concat_dev_erase() in drivers/mtd/mtdcontact.c.

With this change, feel free to use

Reviewed-by: Petr Mladek 

Best Regards,
Petr

Re: [PATCH v3 2/2] livepatch: force transition process to finish

2017-11-01 Thread Petr Mladek

On Tue 2017-10-31 12:48:53, Miroslav Benes wrote:
> If a task sleeps in a set of patched functions uninterruptedly, it could
> block the whole transition process indefinitely.  Thus it may be useful
> to clear its TIF_PATCH_PENDING to allow the process to finish.
> 
> Admin can do that now by writing to force sysfs attribute in livepatch
> sysfs directory. TIF_PATCH_PENDING is then cleared for all tasks and the
> transition can finish successfully.
> 
> Important note! Use wisely. Admin must be sure that it is safe to
> execute such action. This means that it must be checked that by doing so
> the consistency model guarantees are not violated.
> 
> Signed-off-by: Miroslav Benes 

If no animals were harmed when developing this brute force then
feel free to use:

Reviewed-by: Petr Mladek 

Best Regards,
Petr

Re: [PATCH] mm: don't warn about allocations which stall for too long

2017-11-02 Thread Petr Mladek

On Wed 2017-11-01 11:36:47, Steven Rostedt wrote:
> On Wed, 1 Nov 2017 14:38:45 +0100
> Petr Mladek  wrote:
> > My current main worry with Steven's approach is a risk of deadlocks
> > that Jan Kara saw when he played with similar solution.
> 
> And if there exists such a deadlock, then the deadlock exists today.

The patch is going to effectively change console_trylock() to
console_lock() and this might add problems.

The most simple example is:

   console_lock()
 printk()
console_trylock() was SAFE.

   console_lock()
 printk()
   console_lock() cause DEADLOCK!

Sure, we could detect this and avoid waiting when
console_owner == current. But does this cover all
situations? What about?

CPU0CPU1

console_lock()  func()
  console->write()take_lockA()
func()  printk()
  busy wait for console_lock()

  take_lockA()

By other words, it used to be safe to call printk() from
console->write() functions because printk() used console_trylock().
Your patch is going to change this. It is even worse because
you probably will not use console_lock() directly and therefore
this might be hidden for lockdep.

BTW: I am still not sure how to make the busy waiter preferred
over console_lock() callers. I mean that the busy waiter has
to get console_sem even if there are some tasks in the workqueue.

> > But let's wait for the patch. It might look and work nicely
> > in the end.
> 
> Oh, I need to write a patch? Bah, I guess I should. Where's all those
> developers dying to do kernel programing where I can pass this off to?

Yes, where are these days when my primary task was to learn kernel
hacking? This would have been a great training material.

I still have to invest time into fixing printk. But I personally
think that the lazy offloading to kthreads is more promising
way to go. It is pretty straightforward. The only problem is
the guaranty of the takeover. But there must be a reasonable
way how to detect that the system heart is still beating
and we are not the only working CPU.

Best Regards,
Petr

Re: printk discussions at KS

2017-11-02 Thread Petr Mladek

On Wed 2017-11-01 19:12:23, Joe Perches wrote:
> As I was not there, and I know about as much as anyone
> about printk internals, can you please post a recap of
> what was discussed, technical and other, about printk
> improvements at the kernel-summit?
> 
> If there was a pdf/powerpoint, that'd be nice to post too.

There is a nice summary of the discussion at
https://lwn.net/Articles/737822/

In short, it is the old problem with possible soft-lockups.
There were send many variants of a solution based on offloading
and kthreads. We discussed yet another solution proposed
by Steven.

Best Regards,
Petr

Re: [GIT pull] printk updates for 4.15

2017-11-14 Thread Petr Mladek

On Mon 2017-11-13 17:18:33, Linus Torvalds wrote:
> On Mon, Nov 13, 2017 at 1:36 AM, Thomas Gleixner  wrote:
> > Linus,
> >
> > please pull the latest core-printk-for-linus git tree from:
> >
> >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
> > core-printk-for-linus
> >
> > This update adds the mechanisms to emit printk timestamps based on
> > different clocks:
> >
> >   - scheduler clock (default)
> >   - monotonic time
> >   - boot time
> >   - wall clock time
> >
> > This helps to correlate dmesg with user space log information, tracing,
> > etc. This can be local correlation or in case of wall clock time correlated
> > across machines, assumed that all machines are synchronized via
> > NTP/PTP.
> 
> Honestly, this just seems bogus to me, particularly since it's a single 
> choice.
> 
> The *sane* model would be to
> 
>  (a) continue to use the existing time that we always have
> (local_clock()) in the printk timestamps, and don't confuse people
> with the semantics of that field changing.
> 
>  (b) just emit a "synchronization printk" every once in a while, which
> is obviously also using the same standard time source, but the line
> actually _says_ what the other time sources are.

This was actually the original approach by Mark Salyzyn, see
https://lkml.kernel.org/r/20170720182505.9357-1-saly...@android.com


> Then it's easy to see what the printk time source is, in relation to
> any _number_ of other timesources. And if that synchronization printk
> is nicely formatted, it will even be something that people appreciate
> seeing in dmesg _irrespective_ of any actual synchronization issues.
> 
> And something that reads the journal could trivially pick up on the
> synchronization printk line, and then correct the local timesource to
> whatever internal journal timesource it wants to. And the important
> thing is that because you just give *all* timesources in the
> synchronization line, that choice isn't fixed by some random kernel
> configuration or setting.

One risk is that the messages might get lost. For example, they might be
filtered by loglevel or during a flood of messages on slow consoles.


> Instead, this seems to have a completely broken "pick one time source
> model at random" approach, so now different machines will have
> different models, and it will likely _break_ existing code that picks
> the timesource from the kernel dmesg, unless you just pick the local
> one.

AFAIK, local clock is not synchronous between different machines
and even CPUs on the same machine. It was used in printk() because
it was lockless. Therefore it is kind of random itself.

You could make some post-synchronization using that printed
time stamps from other clocks. But it is not reliable (lost
messages) and somehow inconvenient.

I am not super happy that userspace might need update with
the approach in this pull request. But it seems to be rather
trivial. The timestamp (number) in the log can be converted into
the date+time as following:

  + realtime: timestamp ~= number of micro sec. since 1.1.1970
  + other clocks: timestamp ~= number of micro sec. since boot


> That seems like bad design, and really stupid.
>
> Am I missing something? Because as-is, this just seems like a horribly
> bad feature to me. I'm not pulling it without some very good arguments
> for this all.

I wonder if the current approach might be acceptable if we print
some suffix after real-time or any non-local_clock timestamps.
This would allow userspace to always handle this correctly.
IMHO, it would be more reliable and convenient than the
"synchronization printks".

Best Regards,
Petr

Re: [PATCH 1/3] printk: Introduce per-console loglevel setting

2017-11-03 Thread Petr Mladek

On Thu 2017-09-28 17:43:55, Calvin Owens wrote:
> This patch introduces a new per-console loglevel setting, and changes
> console_unlock() to use max(global_level, per_console_level) when
> deciding whether or not to emit a given log message.

> diff --git a/include/linux/console.h b/include/linux/console.h
> index b8920a0..a5b5d79 100644
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -147,6 +147,7 @@ struct console {
>   int cflag;
>   void*data;
>   struct   console *next;
> + int level;

I would make the meaning more clear and call this min_loglevel.

>  };
>  
>  /*
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 512f7c2..3f1675e 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(ignore_loglevel,
>"ignore loglevel setting (prints all kernel messages to the 
> console)");
>  
> -static bool suppress_message_printing(int level)
> +static int effective_loglevel(struct console *con)
>  {
> - return (level >= console_loglevel && !ignore_loglevel);
> + return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG);
> +}
> +
> +static bool suppress_message_printing(int level, struct console *con)
> +{
> + return (level >= effective_loglevel(con) && !ignore_loglevel);
>  }

We need to be more careful here:

First, there is one ugly level called CONSOLE_LOGLEVEL_SILENT. Fortunately,
it is used only by vkdb_printf(). I guess that the purpose is to store
messages into the log buffer and do not show them on consoles.

It is hack and it is racy. It would hide the messages only when the
console_lock() is not already taken. Similar hack is used on more
locations, e.g. in __handle_sysrq() and these are racy as well.
We need to come up with something better in the future but this
is a task for another patchset.

Second, this functions are called with NULL when we need to take
all usable consoles into account. You simplified it by ignoring
the per-console setting. But it is not correct. For example,
you might need to delay the printing in boot_delay_msec()
also on the fast console. Also this was the reason to remove
one optimization in console_unlock().

I thought about a reasonable solution and came up with something like:

static bool suppress_message_printing(int level, struct console *con)
{
int callable_loglevel;

if (ignore_loglevel || console_loglevel == CONSOLE_LOGLEVEL_MOTORMOUTH)
return false;

/* Make silent even fast consoles. */
if (console_loglevel == CONSOLE_LOGLEVEL_SILENT)
return true;

if (con)
callable_loglevel = con->min_loglevel;
else
callable_loglevel = max_custom_console_loglevel;

/* Global setting might make all consoles more verbose. */
if (callable_loglevel < console_loglevel)
callable_loglevel = console_loglevel;

return level >= callable_loglevel();
}

Yes, it is complicated. But the logic is complicated. IMHO, this has
the advantage that we do most of the decisions on a single place
and it might be easier to get the picture.

Anyway, max_custom_console_loglevel would be a global variable
defined as:

/*
 * Minimum loglevel of the most talkative registered console.
 * It is a maximum of all registered con->min_logvevel values.
 */
static int max_custom_console_loglevel = LOGLEVEL_EMERG;

The value should get updated when any console is registered
and when a registered console is manipulated. It means in
register_console(), unregister_console(), and the sysfs
write callbacks.

>  #ifdef CONFIG_BOOT_PRINTK_DELAY
> @@ -2199,22 +2205,11 @@ void console_unlock(void)
>   } else {
>   len = 0;
>   }
> -skip:
> +
>   if (console_seq == log_next_seq)
>   break;
>  
>   msg = log_from_idx(console_idx);
> - if (suppress_message_printing(msg->level)) {
> - /*
> -  * Skip record we have buffered and already printed
> -  * directly to the console when we received it, and
> -  * record that has level above the console loglevel.
> -  */
> - console_idx = log_next(console_idx);
> - console_seq++;
> - goto skip;
> - }

I would like to keep this code. It does not make sense to prepare the
text buffer if it won't be used at all. It would work with the change
that I proposed above.

>   len += msg_print_text(msg, false, text + len, sizeof(text) - 
> len);
>   if (nr_ext_console_drivers) {
>   ext_len = msg_print_ext_header(ext_text,
> @@ -2230,7 +2225,7 @@ void console_unlock(void)
>   raw_spin_unlock(&logbuf_lock);
>  
>

Re: [PATCH 2/3] printk: Add /sys/consoles/ interface

2017-11-03 Thread Petr Mladek

On Thu 2017-09-28 17:43:56, Calvin Owens wrote:
> This adds a new sysfs interface that contains a directory for each
> console registered on the system. Each directory contains a single
> "loglevel" file for reading and setting the per-console loglevel.
> 
> We can let kobject destruction race with console removal: if it does,
> loglevel_{show,store}() will safely fail with -ENODEV. This is a little
> weird, but avoids embedding the kobject and therefore needing to totally
> refactor the way we handle console struct lifetime.

It looks like a sane approach. It might be worth a comment in the code.


>  Documentation/ABI/testing/sysfs-consoles | 13 +
>  include/linux/console.h  |  1 +
>  kernel/printk/printk.c   | 88 
> 
>  3 files changed, 102 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-consoles
> 
> diff --git a/Documentation/ABI/testing/sysfs-consoles 
> b/Documentation/ABI/testing/sysfs-consoles
> new file mode 100644
> index 000..6a1593e
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-consoles
> @@ -0,0 +1,13 @@
> +What:/sys/consoles/

I rather add Greg in CC. I am not 100% sure that the top level
directory is the right thing to do.

Alternative might be to hide this under /sys/kernel/consoles/.


> +Date:September 2017
> +KernelVersion:   4.15
> +Contact: Calvin Owens 
> +Description: The /sys/consoles tree contains a directory for each console
> + configured on the system. These directories contain the
> + following attributes:
> +
> + * "loglevel"Set the per-console loglevel: the kernel uses
> + max(system_loglevel, perconsole_loglevel) when
> + deciding whether to emit a given message. The
> + default is 0, which means max() always yields
> + the system setting in the kernel.printk sysctl.

I would call the attribute "min_loglevel". The name "loglevel" should
be reserved for the really used loglevel that depends also on the
global loglevel value.


> diff --git a/include/linux/console.h b/include/linux/console.h
> index a5b5d79..76840be 100644
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -148,6 +148,7 @@ struct console {
>   void*data;
>   struct   console *next;
>   int level;
> + struct kobject *kobj;
>  };
>  
>  /*
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 3f1675e..488bda3 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -105,6 +105,8 @@ enum devkmsg_log_masks {
>  
>  static unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
>  
> +static struct kobject *consoles_dir_kobj;
>
>  static int __control_devkmsg(char *str)
>  {
>   if (!str)
> @@ -2371,6 +2373,82 @@ static int __init keep_bootcon_setup(char *str)
>  
>  early_param("keep_bootcon", keep_bootcon_setup);
>  
> +static ssize_t loglevel_show(struct kobject *kobj, struct kobj_attribute 
> *attr,
> +  char *buf)
> +{
> + struct console *con;
> + ssize_t ret = -ENODEV;
> +

This might deserve a comment. Something like:

/*
 * Find the related struct console a safe way. The kobject
 * desctruction is asynchronous.
 */
> + console_lock();
> + for_each_console(con) {
> + if (con->kobj == kobj) {
> + ret = sprintf(buf, "%d\n", con->level);
> + break;
> + }
> + }
> + console_unlock();
> +
> + return ret;
> +}
> +
> +static ssize_t loglevel_store(struct kobject *kobj, struct kobj_attribute 
> *attr,
> +   const char *buf, size_t count)
> +{
> + struct console *con;
> + ssize_t ret;
> + int tmp;

I would use some meaningful name, e.g. new_level ;-)

> + ret = kstrtoint(buf, 10, &tmp);
> + if (ret < 0)
> + return ret;
> +
> + if (tmp < LOGLEVEL_EMERG)
> + return -ERANGE;
> +
> + /*
> +  * Mimic the behavior of /dev/kmsg with respect to minimum_loglevel
> +  */
> + if (tmp < minimum_console_loglevel)
> + tmp = minimum_console_loglevel;

Hmm, I would remove this "mimic" stuff. minimum_console_loglevel is currently
used to limit operations by the syslog system call. But root is still
able modify the minimum_console_loglevel by writing into
/proc/sys/kernel/printk.

My plan is that /sys/console interface would eventually replace the
crazy /proc/sys/kernel/printk one.

In each case, the default con->level value is zero. It would be
weird if people were not able to set this value.

> +
> + ret = -ENODEV;

I would repeat the same comment here:

/*
 * Find the related struct console a safe way. The kobject
 * desctruction is asynchronous.
 */

> + console_lock();

Re: [PATCH 3/3] printk: Add ability to set loglevel via "console=" cmdline

2017-11-03 Thread Petr Mladek

On Thu 2017-09-28 17:43:57, Calvin Owens wrote:
> This extends the "console=" interface to allow setting the per-console
> loglevel by adding "/N" to the string, where N is the desired loglevel
> expressed as a base 10 integer. Invalid values are silently ignored.
> 
> Cc: Petr Mladek 
> Cc: Steven Rostedt 
> Cc: Sergey Senozhatsky 
> Signed-off-by: Calvin Owens 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  6 ++---
>  kernel/printk/console_cmdline.h |  1 +
>  kernel/printk/printk.c  | 30 
> -
>  3 files changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 0549662..f22b992 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -607,10 +607,10 @@
>   ttyS[,options]
>   ttyUSB0[,options]
>   Use the specified serial port.  The options are of
> - the form "pnf", where "" is the baud rate,
> + the form "pnf/l", where "" is the baud rate,
>   "p" is parity ("n", "o", or "e"), "n" is number of
> - bits, and "f" is flow control ("r" for RTS or
> - omit it).  Default is "9600n8".
> + bits, "f" is flow control ("r" for RTS or omit it),
> + and "l" is the loglevel on [0,7]. Default is "9600n8".
>  

If I get this correctly, the patch allows to define the loglevel for any
console. I think that we need to describe it in a generic
way. Something like:

   console=[KNL] Output console device and options.

Format: name[,options][/min_loglevel]

Where "name" is the console name, "options"
are console-specific options, and "min_loglevel"
allows to increase the loglevel for a particular
console over the global one.

tty  Use the virtual console device .

I would also add a cross reference into the loglevel= section
about that the global loglevel might be overridden by a higher
console-specific min_loglevel value.


>   See Documentation/admin-guide/serial-console.rst for 
> more
>   information.  See
> diff --git a/kernel/printk/console_cmdline.h b/kernel/printk/console_cmdline.h
> index 2ca4a8b..269e666 100644
> --- a/kernel/printk/console_cmdline.h
> +++ b/kernel/printk/console_cmdline.h
> @@ -5,6 +5,7 @@ struct console_cmdline
>  {
>   charname[16];   /* Name of the driver   */
>   int index;  /* Minor dev. to use*/
> + int loglevel;   /* Loglevel to use */

Again, I would use "min_loglevel".


>   char*options;   /* Options for the driver   */
>  #ifdef CONFIG_A11Y_BRAILLE_CONSOLE
>   char*brl_options;   /* Options for braille driver */
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 488bda3..4c14cf2 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2541,6 +2552,12 @@ void register_console(struct console *newcon)
>   if (newcon->index < 0)
>   newcon->index = c->index;
>  
> + /*
> +  * Carry over the loglevel from the cmdline
> +  */
> + newcon->level = c->loglevel;
> + extant = true;

I would personally do the following:

if (!newcon->min_loglevel)
newcon->min_loglevel = c->min_loglevel;

It is similar to newcon->index handling above. It will use the
command line setting only when the console is registered for the first
time.

All this is based on my assumption that all non-initialized struct
console members are zero. At least I do not see any location where
some other members would be explicitly zeroed. And there might
be candidates, e.g. data, match(), next.

In each case, I do not know what the shortcut "extant" stands for
and I feel confused ;-)


>   if (_braille_register_console(newcon, c))
>   return;
>  
> @@ -2572,8 +2589,9 @@ void register_console(struct console *newcon)
>   /*
>* By default, the per-co

Re: [PATCH 2/3] printk: Add /sys/consoles/ interface

2017-11-03 Thread Petr Mladek

On Fri 2017-11-03 15:32:34, Kroah-Hartman wrote:
> > > diff --git a/Documentation/ABI/testing/sysfs-consoles 
> > > b/Documentation/ABI/testing/sysfs-consoles
> > > new file mode 100644
> > > index 000..6a1593e
> > > --- /dev/null
> > > +++ b/Documentation/ABI/testing/sysfs-consoles
> > > @@ -0,0 +1,13 @@
> > > +What:/sys/consoles/
> 
> Eeek, what!
> 
> > I rather add Greg in CC. I am not 100% sure that the top level
> > directory is the right thing to do.
> 
> Neither do I.
> 
> > Alternative might be to hide this under /sys/kernel/consoles/.
> 
> No no no.
> 
> > > diff --git a/include/linux/console.h b/include/linux/console.h
> > > index a5b5d79..76840be 100644
> > > --- a/include/linux/console.h
> > > +++ b/include/linux/console.h
> > > @@ -148,6 +148,7 @@ struct console {
> > >   void*data;
> > >   struct   console *next;
> > >   int level;
> > > + struct kobject *kobj;
> 
> Why are you using "raw" kobjects and not a "real" struct device?  This
> is a device, use that interface instead please.

Hmm, struct console has a member

struct tty_driver *(*device)(struct console *, int *);

but it is set only when the console has tty binding.

> If you need a console 'bus' to place them on, fine, but the virtual bus
> is probably best and simpler to use.
> 
> That is if you _really_ feel you need sysfs interaction with the console
> layer (hint, I am not yet convinced...)

The purpose of this patch is to make a userfriendly interface
for setting console-specific loglevel (message filtering).

It curretnly uses kobject for creating a simple directory
structure under /sys/. It is inspired by /sys/power/, /sys/kernel/mm/,
/sys/kernel/debug/tracing/ stuff.

There are ideas to add more files that would allow to modify even the
global setting. This is currectly possible by the four numbers in
/proc/sys/kernel/printk. Nobody knows what the four numbers mean.
IMHO, the following interface would be easier to work with:

   /sys/console/loglevel
   /sys/console/min_loglevel
   /sys/condole/default_loglevel

> > /*
> >  * Find the related struct console a safe way. The kobject
> >  * desctruction is asynchronous.
> >  */
> > > + console_lock();
> > > + for_each_console(con) {
> > > + if (con->kobj == kobj) {
> 
> You are doing something wrong, go from kobj to your console directly,
> the fact that you can not do that here is a _huge_ hint that your
> structure is not correct.
> 
> Hint, it's not correct at all :)

I know that we are not following the original purpose of sysfs.
But there are more (mis)users.

> Please cc: me on stuff like this if you want a review, and as you are
> adding a random new sysfs root directory, please always cc: me on that
> so I can talk you out of it...

Good to know.

Best Regards,
Petr

Re: [PATCH 0001/0001] format idle IP output func+offset/length

2017-11-06 Thread Petr Mladek

On Mon 2017-11-06 11:19:31, Liu, Changcheng wrote:
> kaslr feature is enabled in kernel.
> Remove kernel text address when dumping idle IP info

> diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
> index 0bc0a35..9cc4178 100644
> --- a/lib/nmi_backtrace.c
> +++ b/lib/nmi_backtrace.c
> @@ -92,7 +92,7 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
> if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
> arch_spin_lock(&lock);
> if (regs && cpu_in_idle(instruction_pointer(regs))) {
> -   pr_warn("NMI backtrace for cpu %d skipped: idling at pc %#lx\n",
> +   pr_warn("NMI backtrace for cpu %d skipped: idling at pc %pS\n",

Great catch!

Reviewed-by: Petr Mladek 

Best Regards,
Petr

Re: [PATCH 0001/0001] format idle IP output func+offset/length

2017-11-06 Thread Petr Mladek

On Mon 2017-11-06 18:52:03, Liu, Changcheng wrote:
> kaslr feature is enabled in kernel.
> Remove kernel text address when dumping idle IP info
> 
> Signed-off-by: Liu Changcheng 
> Signed-off-by: Jerry Liu 
> 
> diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
> index 0bc0a35..9cc4178 100644
> --- a/lib/nmi_backtrace.c
> +++ b/lib/nmi_backtrace.c
> @@ -92,7 +92,7 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
>   if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
>   arch_spin_lock(&lock);
>   if (regs && cpu_in_idle(instruction_pointer(regs))) {
> - pr_warn("NMI backtrace for cpu %d skipped: idling at pc 
> %#lx\n",
> + pr_warn("NMI backtrace for cpu %d skipped: idling at 
> %pS\n",

Yup, removing "pc" makes sense as well.

Reviewed-by: Petr Mladek 

Best Regards,
Petr

Re: [PATCH v3 2/2] livepatch: add atomic replace

2017-10-18 Thread Petr Mladek

On Wed 2017-10-18 11:10:09, Miroslav Benes wrote:
> On Tue, 17 Oct 2017, Jason Baron wrote:
> > If the atomic replace patch does
> > not contain any immediates, then we can drop the reference on the
> > immediately preceding patch only. That is because there may have been
> > previous transitions to immediate functions in the func stack, and the
> > transition to the atomic replace patch only checks immediately preceding
> > transition. It would be possible to check all of the previous immediate
> > function transitions, but this adds complexity and seems like not a
> > common pattern. So I would suggest that we just drop the reference on
> > the previous patch if the atomic replace patch does not contain any
> > immediate functions.
> 
> It is even more complicated and it is not connected only to atomic replace 
> patch (I realized this while reading the first part of your email and 
> then you confirmed it with this paragraph). The consistency model is 
> broken with respect to immediate patches.
> 
> func  a
> patches   1i
>   2i
>   3
> 
> Now, when you're applying 3, only 2i function is checked. But there might 
> be a task sleeping in 1i. Such task would be migrated to 3, because we do 
> not check 1 in klp_check_stack_func() at all.
> 
> I see three solutions.
> 
> 1. Say it is an user's fault. Since it is not obvious and it is 
> easy-to-make mistake, I would not go this way.
> 
> 2. We can fix klp_check_stack_func() in an exact way you're proposing. 
> We'd go back in func stack as long as there are immediate patches there. 
> This adds complexity and I'm not sure if all the problems would be solved 
> because scenarios how patches are stacked and applied to different 
> functions may be quite complex.
> 
> 3. Drop immediate. It causes problems only and its advantages on x86_64 
> are theoretical. You would still need to solve the interaction with atomic 
> replace on other architecture with immediate preserved, but that may be 
> easier. Or we can be aggressive and drop immediate completely. The force 
> transition I proposed earlier could achieve the same.

To make it clear. We currently rely on the immediate handling on
architectures without a reliable stack checking. The question
is if anyone uses it for another purpose in practice.

A solution would be to remove the per-func immediate flag
and invert the logic of the per-patch one. We could rename
it to something like "consistency_required" or "semantic_changes".
A patch with this flag set then might be refused on systems
without reliable stacks. Otherwise, the consistency model
would be used for all patches.

As a result, all patches would be handled either using
the consistency model or immediately. We would need to
care about any mix of these.

Best Regards,
Petr

Re: [PATCH] printk: simplify no_printk()

2017-10-19 Thread Petr Mladek

On Thu 2017-10-19 19:48:15, Masahiro Yamada wrote:
> Hi Petr,
> 
> 2017-10-02 23:56 GMT+09:00 Petr Mladek :
> > On Mon 2017-09-18 00:01:44, Masahiro Yamada wrote:
> >> Commit 069f0cd00df0 ("printk: Make the printk*once() variants return
> >> a value") surrounded the macro implementation with ({ ... }).
> >>
> >> Now, the inner do { ... } while (0); is redundant.
> >>
> >> Signed-off-by: Masahiro Yamada 
> >
> > Looks fine to me. The return value is slightly more visible now ;-)
> >
> > Reviewed-by: Petr Mladek 
> >
> > JFYI, I have pushed it into for-4.15 branch.
> >
> > Best Regards,
> > Petr
> 
> 
> A minor problem.
> 
> I think "Reviewed-by: " is missing before Sergey.

Great catch! Thanks for double checking!

> I am not sure if this is too late, or not...

Just pushed a fix.

Best Regards,
Petr

Re: [RFC] scripts: add leaking_addresses.pl

2017-10-19 Thread Petr Mladek

On Thu 2017-10-19 17:34:44, Tobin C. Harding wrote:
> diff --git a/scripts/leaking_addresses.pl b/scripts/leaking_addresses.pl
> new file mode 100755
> index ..940547b716e3
> --- /dev/null
> +++ b/scripts/leaking_addresses.pl
> @@ -0,0 +1,139 @@
> +#!/usr/bin/env perl
> +#
> +# leaking_addresses.pl scan kernel for potential leaking addresses.
> +
> +use warnings;
> +use strict;
> +use File::Basename;
> +use feature 'say';

It seems that the 'say' feature is not used at the end.

> +my $DEBUG = 0;
> +my @dirs = ('/proc', '/sys');
> +
> +parse_dmesg();
> +
> +foreach(@dirs)
> +{
> +walk($_);
> +}
> +
> +exit 0;
> +
> +#
> +# TODO
> +#
> +# - Add support for 32 bit architectures.
> +#
> +sub may_leak_address
> +{
> +my $line = $_[0];
> +my $regex = '[a-fA-F0-9]{12}';
> +my $mask = '';
> +
> +if ($line =~ /$mask/) {
> +return

I would personally return 0; instead of nothing.
Well, I am used to reading C and not perl ;-)

Also I wonder if we really need to define the pattern
as a variable. It might be better to use it directly
in the regex and put a comment above, e.g.

# Ignore addresses that say nothing
if ($line =~ // or
$line =~ //) {
return 0;


> +}
> +
> +if ($line =~ /$regex/) {
> +return 1;
> +}
> +return;
> +}
> +
> +sub parse_dmesg
> +{
> +my $line;
> +open my $cmd, '-|', 'dmesg';
> +while ($line = <$cmd>) {
> +if (may_leak_address($line)) {
> +print 'dmesg: ' . $line;
> +}
> +}
> +close $cmd;
> +}
> +
> +# We should skip these files
> +sub skip_file
> +{
> +my $path = $_[0];
> +
> +my @skip_paths = ('/proc/kmsg', '/proc/kcore', '/proc/kallsyms',
> +  '/proc/fs/ext4/sdb1/mb_groups', 
> '/sys/kernel/debug/tracing/trace_pipe',
> +  '/sys/kernel/security/apparmor/revision');

I would suggest to put each directory on a separate line.
It is easier to review and patch.

> +my @skip_files = ('pagemap', 'events', 'access','registers', 
> 'snapshot_raw',
> +  'trace_pipe_raw', 'trace_pipe');

Same here.

> +
> +foreach(@skip_paths) {
> +if ($_ eq $_[0]) {
> +return 1;
> +}
> +}
> +
> +my($filename, $dirs, $suffix) = fileparse($path);
> +
> +foreach(@skip_files) {
> +if ($_ eq $filename) {
> +return 1;
> +}
> +}
> +
> +return;
> +}
> +
> +sub parse_file
> +{
> +my $file = $_[0];
> +
> +if (! -R $file) {
> +return;
> +}
> +
> +if (skip_file($file)) {
> +if ($DEBUG == 1) {
> +print "skipping file: $file\n";
> +}
> +return;
> +}
> +if ($DEBUG == 1) {
> +print "parsing $file\n";
> +}
> +
> +open my $fh, $file or return;
> +
> +while( my $line = <$fh>)  {
> +if (may_leak_address($line)) {
> +print $file . ': ' . $line;
> +}
> +}
> +
> +close $fh;
> +}
> +
> +# Recursively walk directory tree
> +sub walk
> +{
> +my @dirs = ($_[0]);
> +my %seen;
> +
> +while (my $pwd = shift @dirs) {
> +if (!opendir(DIR,"$pwd")) {
> +print STDERR "Cannot open $pwd\n";  

I would print the error only when $DEBUG = 1.
If a directory cannot be opened, it does not leak anything.
Same for opened files.

IMHO, it would make sense to show only real problems.
Otherwise people would have troubles to interpret it.

> +next;
> +} 
> +my @files = readdir(DIR);
> +closedir(DIR);
> +foreach my $file (@files) {
> +next if ($file eq '.' or $file eq '..');
> +
> +my $path = "$pwd/$file";
> +next if (-l $path);
> +
> +if (-d $path and !$seen{$path}) {
> +$seen{$path} = 1;

How is it possible to see a path twice, please?

> +push @dirs, "$path";
> +} else {
> +parse_file("$path");
> +}
> +}
> +}
> +}

Best Regards,
Petr

Re: [PATCH] printk: fix typo in printk_safe.c

2017-10-30 Thread Petr Mladek

On Sun 2017-10-22 22:30:55, Baoquan He wrote:
> Signed-off-by: Baoquan He 

Reviewed-by: Petr Mladek 

I have pushed it into for-4.15 branch in printk.git.

Thanks for the fix.

Best Regards,
Petr

Re: [RFC v2 07/18] kthread: Allow to cancel kthread work

2015-10-05 Thread Petr Mladek

On Mon 2015-10-05 12:07:58, Petr Mladek wrote:
> On Fri 2015-10-02 15:24:53, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Oct 02, 2015 at 05:43:36PM +0200, Petr Mladek wrote:
> > > IMHO, we need both locks. The worker manipulates more works and
> > > need its own lock. We need work-specific lock because the work
> > > might be assigned to different workers and we need to be sure
> > > that the operations are really serialized, e.g. queuing.
> > 
> > I don't think we need per-work lock.  Do we have such usage in kernel
> > at all?  If you're worried, let the first queueing record the worker
> > and trigger warning if someone tries to queue it anywhere else.  This
> > doesn't need to be full-on general like workqueue.  Let's make
> > reasonable trade-offs where possible.
> 
> I actually thought about this simplification as well. But then I am
> in doubts about the API. It would make sense to assign the worker
> when the work is being initialized and avoid the duplicate information
> when the work is being queued:
> 
>   init_kthread_work(work, fn, worker);
>   queue_work(work);
> 
> Or would you prefer to keep the API similar to workqueues even when
> it makes less sense here?
> 
> 
> In each case, we need a way to switch the worker if the old one
> is destroyed and a new one is started later. We would need
> something like:
> 
>   reset_work(work, worker)
> or
>   reinit_work(work, fn, worker)

I was too fast. We could set "work->worker = NULL" when the work
finishes and it is not pending. It means that it will be connected
to the particular worker only when used. Then we could keep the
workqueues-like API and do not need reset_work().

I am going to play with this. I feel that it might work.

Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] bcache: Really show state of work pending bit

2015-10-05 Thread Petr Mladek

WORK_STRUCT_PENDING is a mask for testing the pending bit.
test_bit() expects the number of the bit and we need to
use WORK_STRUCT_PENDING_BIT there.

Also work_data_bits() is defined in workqueues.h now.

I have noticed this just by chance when looking how
WORK_STRUCT_PENDING_BIT is used. The change is compile
tested.

Signed-off-by: Petr Mladek 
---
 drivers/md/bcache/closure.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 7a228de95fd7..9eaf1d6e8302 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -167,8 +167,6 @@ EXPORT_SYMBOL(closure_debug_destroy);
 
 static struct dentry *debug;
 
-#define work_data_bits(work) ((unsigned long *)(&(work)->data))
-
 static int debug_seq_show(struct seq_file *f, void *data)
 {
struct closure *cl;
@@ -182,7 +180,7 @@ static int debug_seq_show(struct seq_file *f, void *data)
   r & CLOSURE_REMAINING_MASK);
 
seq_printf(f, "%s%s%s%s\n",
-  test_bit(WORK_STRUCT_PENDING,
+  test_bit(WORK_STRUCT_PENDING_BIT,
work_data_bits(&cl->work)) ? "Q" : "",
   r & CLOSURE_RUNNING  ? "R" : "",
   r & CLOSURE_STACK? "S" : "",
-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2 07/18] kthread: Allow to cancel kthread work

2015-10-07 Thread Petr Mladek

On Mon 2015-10-05 13:09:24, Petr Mladek wrote:
> On Mon 2015-10-05 12:07:58, Petr Mladek wrote:
> > On Fri 2015-10-02 15:24:53, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Fri, Oct 02, 2015 at 05:43:36PM +0200, Petr Mladek wrote:
> > > > IMHO, we need both locks. The worker manipulates more works and
> > > > need its own lock. We need work-specific lock because the work
> > > > might be assigned to different workers and we need to be sure
> > > > that the operations are really serialized, e.g. queuing.
> > > 
> > > I don't think we need per-work lock.  Do we have such usage in kernel
> > > at all?  If you're worried, let the first queueing record the worker
> > > and trigger warning if someone tries to queue it anywhere else.  This
> > > doesn't need to be full-on general like workqueue.  Let's make
> > > reasonable trade-offs where possible.
> > 
> > I actually thought about this simplification as well. But then I am
> > in doubts about the API. It would make sense to assign the worker
> > when the work is being initialized and avoid the duplicate information
> > when the work is being queued:
> > 
> > init_kthread_work(work, fn, worker);
> > queue_work(work);
> > 
> > Or would you prefer to keep the API similar to workqueues even when
> > it makes less sense here?
> > 
> > 
> > In each case, we need a way to switch the worker if the old one
> > is destroyed and a new one is started later. We would need
> > something like:
> > 
> > reset_work(work, worker)
> > or
> > reinit_work(work, fn, worker)
> 
> I was too fast. We could set "work->worker = NULL" when the work
> finishes and it is not pending. It means that it will be connected
> to the particular worker only when used. Then we could keep the
> workqueues-like API and do not need reset_work().

I have played with this idea and the result is not satisfactory.
I am not able to make the code easier using the single lock.

First, the worker lock is not enough to safely queue the work
without a test_and_set() atomic operation. Let me show this on
a pseudo code:

bool queue_kthread_work(worker, work)
{
bool ret = false;

lock(&worker->lock);

if (test_bit(WORK_PENDING, work->flags);
goto out;

if (WARN(work->worker != worker,
 "Work could not be used by two workers at the same time\n"))
goto out;

set_bit(WORK_PENDING, work->flags);
work->worker = worker;
insert_work(worker->work_list, work);
ret = true;

out:
unlock(&worker->lock);
return ret;
}

Now, let's have one work: W, two workers: A, B, and try to queue
the same work to the two workers at the same time:

CPU0CPU1

queue_kthread_work(A, W);   queue_kthread_work(B, W);
  lock(&A->lock);   lock(&B->lock);
  test_bit(WORK_PENDING, W->flags)  test_bit(WORK_PENDING, W->flags)
# false   # false
  WARN(W->worker != A); WARN(W->worker != B);
# false   # false

  set_bit(WORK_PENDING, W->flags);  set_bit(WORK_PENDING, W->flags);
  W->worker = A;W->worker = B;
  insert_work(A->work_list, W); insert_work(B->work_list, W);

  unlock(&A->lock); unlock(&B->lock);

=> It is possible and the result is unclear.

We would need to set either WORK_PENDING flag or the work->worker
using a test_and_set atomic operation and bail out if it fails.
But then we are back in the original code.

Second, we still need the busy waiting for the pending timer callback.
Yes, we could set some flag so that the call back does not queue
the work. But cancel_kthread_work_sync() still has to wait.
It could not return if there is still some pending operation
with the struct kthread_work. Otherwise, it never could
be freed a safe way.

Also note that we still need the WORK_PENDING flag. Otherwise, we
would not be able to detect the race when timer is removed but
the callback has not run yet.

Let me to repeat that using per-work and per-worker lock is not an
option either. We would need some crazy hacks to avoid ABBA deadlocks.

All in all, I would prefer to keep the original approach that is
heavily inspired by the workqueues. I think that it is actually
an advantage to reuse some working concept that reinventing wheels.

Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2 07/18] kthread: Allow to cancel kthread work

2015-10-14 Thread Petr Mladek

On Wed 2015-10-07 07:24:46, Tejun Heo wrote:
>  At each turn, you come up with non-issues and declare that it needs to
> be full workqueue-like implementation but the issues you're raising
> seem all rather irrelevant.  Can you please try to take a step back
> and put some distance from the implementation details of workqueue?

JFYI, I do a step back and am trying to convert more kthreads to
the kthread worker API. It helps me to get better insight into
the problematic.

I am still not sure where you see the difference between
workqueues and the kthread worker API. My view is that
the main differences are:

Workqueues  Kthread worker

  + pool of kthreads  + dedicated kthread

  + kthreads created and  + kthread created and
destroyed on demand destroyed with the worker

  + can proceed more works+ one work is proceed at a time
in parallel from one queue

Otherwise, similar basic set of operations would be useful:

  + create_worker
  + queue_work, queue_delayed_work
  + mod_delayed_work
  + cancel_work, cancel_delayed_work
  + flush_work
  + flush_worker
  + drain_worker
  + destroy_worker

, where queue, mod, cancel operations should work also from IRQ
context.

There are few potentially complicated and sensitive users of the
kthread workers API, e.g. handling nfs callbacks, some kthreads
used for handling network packets, eventually the rcu stuff.
Here the operations need to be secure and rather fast.

IMHO, it would be great if it is easy to convert between the
kthread worker and workqueues API. It will allow to choose
the most effective variant for a given purpose. IMHO, this is
sometimes hard to say without real life testing.

I wonder if I miss some important angle of view.

In each case, it is still not clear if the API will be acceptable
for the affected parties. Therefore I do not want to spend too
much time on perfectionalizing the API implementation at this
point. Is it OK, please?

Thanks for feedback.

Best Regards,
Petr

PS: I am not convinced that all my concerns were non-issues.
For example, I agree that a race when queuing the same work
to more kthread workers might look theoretical. On the other
hand, the API allows it and it might be hard to debug. IMHO,
it might be an acceptable trade off if the implementation is
much easier and more secure in other areas. But my draft
implementation did not suggested this.

For example, there were more situations when I needed to double
check that the work was still connected with the locked worker
after taking the lock. I know that it will not happen when
the API is used a reasonable way but...

Ah, I am back in the details. I have to stop it for now ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 0/2] ring_buffer: Make the benchmark slightly more safe

2015-09-07 Thread Petr Mladek

These two patches fix potential races in the ring buffer benchmark.
The first two versions were reviewed as part of the patchset that
tried to convert some kthreads into the kthread worker API, see
http://thread.gmane.org/gmane.linux.kernel.api/13224/focus=13821

Changes in v3:

+ fixed several comments (suggested by Steven)
+ removed duplicate memory barrier (suggested by Steven)


Changes in v2:

+ keep the extra initialization; fix a race instead
+ move the setting of the current state (suggested by Steven)

Petr Mladek (2):
  ring_buffer: Do no not complete benchmark reader too early
  ring_buffer: Fix more races when terminating the producer in the
benchmark

 kernel/trace/ring_buffer_benchmark.c | 77 
 1 file changed, 44 insertions(+), 33 deletions(-)

-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 1/2] ring_buffer: Do no not complete benchmark reader too early

2015-09-07 Thread Petr Mladek

It seems that complete(&read_done) might be called too early
in some situations.

1st scenario:
-

CPU0CPU1

ring_buffer_producer_thread()
  wake_up_process(consumer);
  wait_for_completion(&read_start);

ring_buffer_consumer_thread()
  complete(&read_start);

  ring_buffer_producer()
# producing data in
# the do-while cycle

  ring_buffer_consumer();
# reading data
# got error
# set kill_test = 1;
set_current_state(
TASK_INTERRUPTIBLE);
if (reader_finish)  # false
schedule();

# producer still in the middle of
# do-while cycle
if (consumer && !(cnt % wakeup_interval))
  wake_up_process(consumer);

# spurious wakeup
while (!reader_finish &&
   !kill_test)
# leaving because
# kill_test == 1
reader_finish = 0;
complete(&read_done);

1st BANG: We might access uninitialized "read_done" if this is the
  the first round.

# producer finally leaving
# the do-while cycle because kill_test == 1;

if (consumer) {
  reader_finish = 1;
  wake_up_process(consumer);
  wait_for_completion(&read_done);

2nd BANG: This will never complete because consumer already did
  the completion.

2nd scenario:
-

CPU0CPU1

ring_buffer_producer_thread()
  wake_up_process(consumer);
  wait_for_completion(&read_start);

ring_buffer_consumer_thread()
  complete(&read_start);

  ring_buffer_producer()
# CPU3 removes the module <--- difference from
# and stops producer  <--- the 1st scenario
if (kthread_should_stop())
  kill_test = 1;

  ring_buffer_consumer();
while (!reader_finish &&
   !kill_test)
# kill_test == 1 => we never go
# into the top level while()
reader_finish = 0;
complete(&read_done);

# producer still in the middle of
# do-while cycle
if (consumer && !(cnt % wakeup_interval))
  wake_up_process(consumer);

# spurious wakeup
while (!reader_finish &&
   !kill_test)
# leaving because kill_test == 1
reader_finish = 0;
complete(&read_done);

BANG: We are in the same "bang" situations as in the 1st scenario.

Root of the problem:


ring_buffer_consumer() must complete "read_done" only when "reader_finish"
variable is set. It must not be skipped due to other conditions.

Note that we still must keep the check for "reader_finish" in a loop
because there might be spurious wakeups as described in the
above scenarios.

Solution:
--

The top level cycle in ring_buffer_consumer() will finish only when
"reader_finish" is set. The data will be read in "while-do" cycle
so that they are not read after an error (kill_test == 1)
or a spurious wake up.

In addition, "reader_finish" is manipulated by the producer thread.
Therefore we add READ_ONCE() to make sure that the fresh value is
read in each cycle. Also we add the corresponding barrier
to synchronize the sleep check.

Next we set the state back to TASK_RUNNING for the situation where we
did not sleep.

Just from paranoid reasons, we initialize both completions statically.
This is safer, in case there are other races that we are unaware of.

As a side effect we could remove the memory barrier from
ring_buffer_producer_thread(). IMHO, this was the reason for
the barrier. ring_buffer_reset() uses spin locks that should
provide the needed memory barrier for using the buffer.

Signed-off-by: Petr Mladek 
---
 kernel/trace/ring_b

[PATCH v3 2/2] ring_buffer: Fix more races when terminating the producer in the benchmark

2015-09-07 Thread Petr Mladek

The commit b44754d8262d3aab8 ("ring_buffer: Allow to exit the ring
buffer benchmark immediately") added a hack into ring_buffer_producer()
that set @kill_test when kthread_should_stop() returned true. It improved
the situation a lot. It stopped the kthread in most cases because
the producer spent most of the time in the patched while cycle.

But there are still few possible races when kthread_should_stop()
is set outside of the cycle. Then we do not set @kill_test and
some other checks pass.

This patch adds a better fix. It renames @test_kill/TEST_KILL() into
a better descriptive @test_error/TEST_ERROR(). Also it introduces
break_test() function that checks for both @test_error and
kthread_should_stop().

The new function is used in the producer when the check for @test_error
is not enough. It is not used in the consumer because its state
is manipulated by the producer via the "reader_finish" variable.

Also we add a missing check into ring_buffer_producer_thread()
between setting TASK_INTERRUPTIBLE and calling schedule_timeout().
Otherwise, we might miss a wakeup from kthread_stop().

Signed-off-by: Petr Mladek 
---
 kernel/trace/ring_buffer_benchmark.c | 54 +++-
 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c 
b/kernel/trace/ring_buffer_benchmark.c
index 9ea7949366b3..9e00fd178226 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -60,12 +60,12 @@ MODULE_PARM_DESC(consumer_fifo, "fifo prio for consumer");
 
 static int read_events;
 
-static int kill_test;
+static int test_error;
 
-#define KILL_TEST()\
+#define TEST_ERROR()   \
do {\
-   if (!kill_test) {   \
-   kill_test = 1;  \
+   if (!test_error) {  \
+   test_error = 1; \
WARN_ON(1); \
}   \
} while (0)
@@ -75,6 +75,11 @@ enum event_status {
EVENT_DROPPED,
 };
 
+static bool break_test(void)
+{
+   return test_error || kthread_should_stop();
+}
+
 static enum event_status read_event(int cpu)
 {
struct ring_buffer_event *event;
@@ -87,7 +92,7 @@ static enum event_status read_event(int cpu)
 
entry = ring_buffer_event_data(event);
if (*entry != cpu) {
-   KILL_TEST();
+   TEST_ERROR();
return EVENT_DROPPED;
}
 
@@ -115,10 +120,10 @@ static enum event_status read_page(int cpu)
rpage = bpage;
/* The commit may have missed event flags set, clear them */
commit = local_read(&rpage->commit) & 0xf;
-   for (i = 0; i < commit && !kill_test; i += inc) {
+   for (i = 0; i < commit && !test_error ; i += inc) {
 
if (i >= (PAGE_SIZE - offsetof(struct rb_page, data))) {
-   KILL_TEST();
+   TEST_ERROR();
break;
}
 
@@ -128,7 +133,7 @@ static enum event_status read_page(int cpu)
case RINGBUF_TYPE_PADDING:
/* failed writes may be discarded events */
if (!event->time_delta)
-   KILL_TEST();
+   TEST_ERROR();
inc = event->array[0] + 4;
break;
case RINGBUF_TYPE_TIME_EXTEND:
@@ -137,12 +142,12 @@ static enum event_status read_page(int cpu)
case 0:
entry = ring_buffer_event_data(event);
if (*entry != cpu) {
-   KILL_TEST();
+   TEST_ERROR();
break;
}
read++;
if (!event->array[0]) {
-   KILL_TEST();
+   TEST_ERROR();
break;
}
inc = event->array[0] + 4;
@@ -150,17 +155,17 @@ static enum event_status read_page(int cpu)
default:
entry = ring_buffer_event_data(event);
if (*entry != cpu) {
-   KILL_TEST();
+   TEST_ERROR();
break;
}

Re: [PATCH 1/2] rcu: Show the real fqs_state

2015-09-07 Thread Petr Mladek

On Fri 2015-09-04 16:24:22, Paul E. McKenney wrote:
> On Fri, Sep 04, 2015 at 02:11:29PM +0200, Petr Mladek wrote:
> > The value of "fqs_state" in struct rcu_state is always RCU_GP_IDLE.
> > 
> > The real state is stored in a local variable in rcu_gp_kthread().
> > It is modified by rcu_gp_fqs() via parameter and return value.
> > But the actual value is never stored to rsp->fqs_state.
> > 
> > The result is that print_one_rcu_state() does not show the real
> > state.
> > 
> > This code has been added 3 years ago by the commit 4cdfc175c25c89ee
> > ("rcu: Move quiescent-state forcing into kthread"). I guess that it
> > was an overlook or optimization.
> > 
> > Anyway, the value seems to be manipulated only by the thread, except
> > for shoving the status. I do not see any risk in updating it directly
> > in the struct.
> > 
> > Signed-off-by: Petr Mladek 
> 
> Good catch, but how about the following fix instead?
> 
>   Thanx, Paul
> 
> 
> 
> rcu: Finish folding ->fqs_state into ->gp_state
> 
> Commit commit 4cdfc175c25c89ee ("rcu: Move quiescent-state forcing
> into kthread") started the process of folding the old ->fqs_state
> into ->gp_state, but did not complete it.  This situation does not
> cause any malfunction, but can result in extremely confusing trace
> output.  This commit completes this task of eliminating ->fqs_state
> in favor of ->gp_state.

It makes sense but it breaks dynticks handling in rcu_gp_fqs(), see
below.

> 
> Reported-by: Petr Mladek 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 69ab7ce2cf7b..04234936d897 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1949,16 +1949,15 @@ static bool rcu_gp_fqs_check_wake(struct rcu_state 
> *rsp, int *gfp)
>  /*
>   * Do one round of quiescent-state forcing.
>   */
> -static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
> +static void rcu_gp_fqs(struct rcu_state *rsp)
>  {
> - int fqs_state = fqs_state_in;
>   bool isidle = false;
>   unsigned long maxj;
>   struct rcu_node *rnp = rcu_get_root(rsp);
>  
>   WRITE_ONCE(rsp->gp_activity, jiffies);
>   rsp->n_force_qs++;
> - if (fqs_state == RCU_SAVE_DYNTICK) {
> + if (rsp->gp_state == RCU_SAVE_DYNTICK) {

This will never happen because rcu_gp_kthread() modifies rsp->gp_state
many times. The last value before calling rcu_gp_fqs() is
RCU_GP_DOING_FQS.

I think about passing this information via a separate bool.

[...]

> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index d5f58e717c8b..9faad70a8246 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -417,12 +417,11 @@ struct rcu_data {
>   struct rcu_state *rsp;
>  };
>  
> -/* Values for fqs_state field in struct rcu_state. */
> +/* Values for gp_state field in struct rcu_state. */
>  #define RCU_GP_IDLE  0   /* No grace period in progress. */

This value seems to be used instead of the new RCU_GP_WAIT_INIT.

>  #define RCU_GP_INIT  1   /* Grace period being
>  #initialized. */

This value is unused.

>  #define RCU_SAVE_DYNTICK 2   /* Need to scan dyntick
>  #state. */

This one is not longer preserved when merged with the other state.

>  #define RCU_FORCE_QS 3   /* Need to force quiescent
>  #state. */

The meaning of this one is strange. If I get it correctly,
it is set after the state was forced. But the comment suggests
that it is before.

By other words, these states seems to get obsoleted by

/* Values for rcu_state structure's gp_flags field. */
#define RCU_GP_WAIT_INIT 0  /* Initial state. */
#define RCU_GP_WAIT_GPS  1  /* Wait for grace-period start. */
#define RCU_GP_DONE_GPS  2  /* Wait done for grace-period start. */
#define RCU_GP_WAIT_FQS  3  /* Wait for force-quiescent-state time. */
#define RCU_GP_DOING_FQS 4  /* Wait done for force-quiescent-state time. */
#define RCU_GP_CLEANUP   5  /* Grace-period cleanup started. */
#define RCU_GP_CLEANED   6  /* Grace-period cleanup complete. */


Please, find below your commit updated with my ideas:

+ used bool save_dyntick instead of RCU_SAVE_DYNTICK
  and RCU_FORCE_QS states
+ rename RCU_GP_WAIT_INIT -> RCU_GP_IDLE
+ remove all the obsolete states

I am sorry if I handled "Signed-off-by" flags a wrong way. It is
basically your patch with few small updates from me. I am not sure
what is the right process in this case. Feel free to

Re: [PATCH 2/2] rcu: Fix up timeouts for forcing the quiescent state

2015-09-07 Thread Petr Mladek

On Fri 2015-09-04 16:49:46, Paul E. McKenney wrote:
> On Fri, Sep 04, 2015 at 02:11:30PM +0200, Petr Mladek wrote:
> > The deadline to force the quiescent state (jiffies_force_qs) is currently
> > updated only when the previous timeout passed. But the timeout used for
> > wait_event() is always the entire original timeout. This is strange.
> 
> They tell me that kthreads aren't supposed to every catch signals,
> hence the WARN_ON() in the early-exit case stray-signal case.

Yup, I have investigated this recently. All signals are really blocked
for kthreads by default. There are few threads that use signals but
they explicitly enable it by allow_signal().

> In the case where we were awakened with an explicit force-quiescent-state
> request, we do the scan, and then wait the full time for the next scan.
> So the point of the delay is to space out the scans, not to fit a
> pre-determined schedule.
> 
> The reason we get awakened with an explicit force-quiescent-state
> request is that a given CPU just got inundated with RCU callbacks
> or that rcutorture wants to hammer this code path.
> 
> So I am not seeing this as anything in need of fixing.
> 
> Am I missing something subtle here?

There is the commit 88d6df612cc3c99f5 ("rcu: Prevent spurious-wakeup
DoS attack on rcu_gp_kthread()"). It suggests that the spurious
wakeups are possible.

I would consider this patch as a fix/clean up of this Dos attack fix.
Huh, I forgot to mention it in the commit message.

To be honest, I personally do not know how to trigger the spurious
wakeup in the current state of the code. I am trying to convert
the kthread into the kthread worker API and there I got the spurious
wakeups but this is another story.

Thanks a lot for reviewing.

Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] rcu: Show the real fqs_state

2015-09-09 Thread Petr Mladek

On Tue 2015-09-08 12:59:15, Paul E. McKenney wrote:
> On Mon, Sep 07, 2015 at 04:58:27PM +0200, Petr Mladek wrote:
> > On Fri 2015-09-04 16:24:22, Paul E. McKenney wrote:
> > > On Fri, Sep 04, 2015 at 02:11:29PM +0200, Petr Mladek wrote:
[...]

> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 69ab7ce2cf7b..04234936d897 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1949,16 +1949,15 @@ static bool rcu_gp_fqs_check_wake(struct 
> > > rcu_state *rsp, int *gfp)
> > >  /*
> > >   * Do one round of quiescent-state forcing.
> > >   */
> > > -static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
> > > +static void rcu_gp_fqs(struct rcu_state *rsp)
> > >  {
> > > - int fqs_state = fqs_state_in;
> > >   bool isidle = false;
> > >   unsigned long maxj;
> > >   struct rcu_node *rnp = rcu_get_root(rsp);
> > >  
> > >   WRITE_ONCE(rsp->gp_activity, jiffies);
> > >   rsp->n_force_qs++;
> > > - if (fqs_state == RCU_SAVE_DYNTICK) {
> > > + if (rsp->gp_state == RCU_SAVE_DYNTICK) {
> > 
> > This will never happen because rcu_gp_kthread() modifies rsp->gp_state
> > many times. The last value before calling rcu_gp_fqs() is
> > RCU_GP_DOING_FQS.
> > 
> > I think about passing this information via a separate bool.
> > 
> > [...]
> > 
> > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > index d5f58e717c8b..9faad70a8246 100644
> > > --- a/kernel/rcu/tree.h
> > > +++ b/kernel/rcu/tree.h
> > > @@ -417,12 +417,11 @@ struct rcu_data {
> > >   struct rcu_state *rsp;
> > >  };
> > >  
> > > -/* Values for fqs_state field in struct rcu_state. */
> > > +/* Values for gp_state field in struct rcu_state. */
> > >  #define RCU_GP_IDLE  0   /* No grace period in progress. 
> > > */
> > 
> > This value seems to be used instead of the new RCU_GP_WAIT_INIT.
> > 
> > >  #define RCU_GP_INIT  1   /* Grace period being
> > >  #initialized. */
> > 
> > This value is unused.
> > 
> > >  #define RCU_SAVE_DYNTICK 2   /* Need to scan dyntick
> > >  #state. */
> > 
> > This one is not longer preserved when merged with the other state.
> > 
> > >  #define RCU_FORCE_QS 3   /* Need to force quiescent
> > >  #state. */
> > 
> > The meaning of this one is strange. If I get it correctly,
> > it is set after the state was forced. But the comment suggests
> > that it is before.
> > 
> > By other words, these states seems to get obsoleted by
> > 
> > /* Values for rcu_state structure's gp_flags field. */
> > #define RCU_GP_WAIT_INIT 0  /* Initial state. */
> > #define RCU_GP_WAIT_GPS  1  /* Wait for grace-period start. */
> > #define RCU_GP_DONE_GPS  2  /* Wait done for grace-period start. */
> > #define RCU_GP_WAIT_FQS  3  /* Wait for force-quiescent-state time. */
> > #define RCU_GP_DOING_FQS 4  /* Wait done for force-quiescent-state time. */
> > #define RCU_GP_CLEANUP   5  /* Grace-period cleanup started. */
> > #define RCU_GP_CLEANED   6  /* Grace-period cleanup complete. */
> > 
> > 
> > Please, find below your commit updated with my ideas:
> > 
> > + used bool save_dyntick instead of RCU_SAVE_DYNTICK
> >   and RCU_FORCE_QS states
> > + rename RCU_GP_WAIT_INIT -> RCU_GP_IDLE
> > + remove all the obsolete states
> > 
> > I am sorry if I handled "Signed-off-by" flags a wrong way. It is
> > basically your patch with few small updates from me. I am not sure
> > what is the right process in this case. Feel free to use Reviewed-by
> > instead of Signed-off-by with my name.
> > 
> > Well, I guess that this is not the final state ;-)
> 
> Good points, but perhaps an easier solution would be to have a
> "firsttime" argument to rcu_gp_fqs() that said whether or not this
> was the first call to rcu_gp_fqs() during the current grace period.
> If this is the first call, then take the "if" branch that passes
> dyntick_save_progress_counter() to force_qs_rnp(), otherwise take the
> other branch.

This seems to be the most elegant solution at the moment.

> But I am not generating the patch today, just flew across the Pacific
> yesterday.  ;-)

Please, find below the updated patch where I used the first_time
parameter.

Again, I am not sure about the commit person and Signed-off-by
tags. Many parts of the patch are yo

[PATCH] x86/spinlocks: Avoid a deadlock when someone unlock a zapped ticked spinlock

2015-10-21 Thread Petr Mladek

There are few situations when we reinitialize (zap) ticket spinlocks. It
typically happens when the system is going down after an error and we
want to avoid deadlock in some important services. For example,
zap_locks() in printk.c and ioapic_zap_locks().

Peter pointed out that partial deadlock was still possible. It happens
when someone owns a ticket spinlock, we reinitialize it, and the old
owner releases it. Then the head is above the tail and the following
spin_lock() will never[*] succeed.

We could detect this situation in arch_spin_lock() and simply ignore
the superfluous head increment.

We need to do it in the lock() side because the unlock() side works
only with the head to avoid an overflow. Therefore we do not see
the consistent state of the head and the tail there.

Note that we could not check for (head == TICKET_LOCK_INC && !tail)
because the reinitialized lock might be taken several times before
the old owner releases the lock. By other words, the superfluous
head increment might happen at any time.

The change looks quite harmless. It should not affect the fast path
when the lock is taken immediately. It does not make worse the
situation when two processes might own the lock after zapping.
It just avoids the partial deadlock.

[*] unless the ticket number overflows.

Reported-by: Peter Zijlstra 
Signed-off-by: Petr Mladek 
---
 arch/x86/include/asm/spinlock.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index be0a05913b91..f732abf57c6f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -105,12 +105,21 @@ static __always_inline int 
arch_spin_value_unlocked(arch_spinlock_t lock)
  */
 static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
-   register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
+   register struct __raw_tickets inc;
 
+again:
+   inc = (struct __raw_tickets){ .tail = TICKET_LOCK_INC };
inc = xadd(&lock->tickets, inc);
if (likely(inc.head == inc.tail))
goto out;
 
+   /*
+* Avoid a stall when an old owner unlocked a reinitialized spinlock.
+* Simply ignore the superfluous increment of the head.
+*/
+   if (unlikely(inc.head == inc.tail + TICKET_LOCK_INC))
+   goto again;
+
for (;;) {
unsigned count = SPIN_THRESHOLD;
 
-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86/spinlocks: Avoid a deadlock when someone unlock a zapped ticked spinlock

2015-10-22 Thread Petr Mladek

On Wed 2015-10-21 13:11:20, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 11:18:09AM +0200, Petr Mladek wrote:
> > There are few situations when we reinitialize (zap) ticket spinlocks. It
> > typically happens when the system is going down after an error and we
> > want to avoid deadlock in some important services. For example,
> > zap_locks() in printk.c and ioapic_zap_locks().
> 
> So there's a few problems here. On x86 the code you patch is dead code,
> x86 no longer uses ticket locks. Other archs might still.
>
> And I entirely detest adding instructions to any lock path, be it the
> utmost fast path or not, for something that will _never_ happen (on a
> healthy system).

OK

> I would still very much recommend getting rid of the need for
> zap_locks() in the first place.
>
> What I did back when is punt on the whole printk buffer madness and dump
> things to early_printk() without any locking.

My problem with this approach is that the early console is a black hole if
people do not watch it on a remote machine. We would stop using the
really used console even when it would work in most cases. IMHO,
zap_locks() is far from ideal but it would give better results in most
cases.

Maybe, we could add support to use early console in case of Oops and
panic. But we should not do this by default. It would help, especially
when the problem is reproducible and we get stacked with the normal
console.

> I think that as long as the printk buffer has locks you have to accept
> to loose some data when really bad stuff goes down.

Yup, but we could try a bit harder. I am afraid that lockless buffer,
e.g. ring_buffer is not an option here because it would be too complex,
rather slow, and hard to handle in a crash dump.

Thanks a lot for feedback.

Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] printk: Don't discard earlier unprinted messages to make space

2015-10-22 Thread Petr Mladek

Added Andrew into CC who maintains printk() code.

On Thu 2015-10-22 11:16:50, David Howells wrote:
> printk() currently discards earlier messages to make space for new messages
> arriving.  This has the distinct downside that if the kernel starts
> churning out messages because of some initial incident, the report of the
> initial incident is likely to be lost under a blizzard of:
> 
>   ** NNN printk messages dropped **
> 
> messages from console_unlock().
> 
> The first message generated (typically an oops) is usually the most
> important - the one you want to solve first - so we really want to see
> that.

oops=panic or panic_on_warn kernel parameters might be more useful in
this situation.

I would expect that the first few messages are printed to the console
before the buffer is wrapped. IMHO, in many cases, you are interested
into the final messages that describe why the system went down. If there
is no time to print them, you want to have them in the crash dump
(ring buffer) at least.


> To this end, change log_store() to only write a message into the buffer if
> there is sufficient space to hold that message.  The message may be
> truncated if it will then fit.
>
> This patch could be improved by noting that some messages got discarded
> when next there is space to do so.
>
> Signed-off-by: David Howells 
> ---

[...]

> + if (CIRC_SPACE_TO_END(log_next_idx, log_first_idx, log_buf_len) >= 
> wsize)
> + goto have_space_no_wrap;
> +
> + if (CIRC_SPACE(log_next_idx, log_first_idx, log_buf_len) >= size)
> + goto have_space_wrap;
> +
> + /* Try to truncate the message. */
> + size = truncate_msg(&text_len, &trunc_msg_len, &dict_len, &pad_len);
> + wsize = size + sizeof(struct printk_log);

truncate_msg() currently works only for messages that are
bigger than half of the buffer.

Also if the flood of messages is faster than printing, you might
end up with the buffer full of "" strings with a minimum
of useful information.


> + if (CIRC_SPACE_TO_END(log_next_idx, log_first_idx, log_buf_len) >= 
> wsize)
> + goto have_space_no_wrap;
> +
> + if (CIRC_SPACE(log_next_idx, log_first_idx, log_buf_len) < size)
> + return 0;

Please, try to avoid copy&pasted code.

Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 09/14] ring_buffer: Initialize completions statically in the benchmark

2015-09-04 Thread Petr Mladek

On Mon 2015-08-03 14:31:09, Steven Rostedt wrote:
> On Tue, 28 Jul 2015 16:39:26 +0200
> Petr Mladek  wrote:
> 
> > It looks strange to initialize the completions repeatedly.
> > 
> > This patch uses static initialization. It simplifies the code
> > and even helps to get rid of two memory barriers.
> 
> There was a reason I did it this way and did not use static
> initializers. But I can't recall why I did that. :-/
> 
> I'll have to think about this some more.

Heh, the parallel programming is a real fun. I tried to understand
the code in more details and sometimes felt like Duane Dibbley.

Anyway, I found few possible races related to the completions.
One scenario was opened by my previous fix b44754d8262d3aab8429
("ring_buffer: Allow to exit the ring buffer benchmark immediately")

The races can be fixed by the patch below. I still do not see any
scenario where the extra initialization of the two completions
is needed but I am not brave enough to remove it after all ;-)


>From ad75428b1e5e5127bf7dc6062f880ece11dbdbbf Mon Sep 17 00:00:00 2001
From: Petr Mladek 
Date: Fri, 28 Aug 2015 15:59:00 +0200
Subject: [PATCH 1/2] ring_buffer: Do no not complete benchmark reader too
 early

It seems that complete(&read_done) might be called too early
in some situations.

1st scenario:
-

CPU0CPU1

ring_buffer_producer_thread()
  wake_up_process(consumer);
  wait_for_completion(&read_start);

ring_buffer_consumer_thread()
  complete(&read_start);

  ring_buffer_producer()
# producing data in
# the do-while cycle

  ring_buffer_consumer();
# reading data
# got error
# set kill_test = 1;
set_current_state(
TASK_INTERRUPTIBLE);
if (reader_finish)  # false
schedule();

# producer still in the middle of
# do-while cycle
if (consumer && !(cnt % wakeup_interval))
  wake_up_process(consumer);

# spurious wakeup
while (!reader_finish &&
   !kill_test)
# leaving because
# kill_test == 1
reader_finish = 0;
complete(&read_done);

1st BANG: We might access uninitialized "read_done" if this is the
  the first round.

# producer finally leaving
# the do-while cycle because kill_test == 1;

if (consumer) {
  reader_finish = 1;
  wake_up_process(consumer);
  wait_for_completion(&read_done);

2nd BANG: This will never complete because consumer already did
  the completion.

2nd scenario:
-

CPU0CPU1

ring_buffer_producer_thread()
  wake_up_process(consumer);
  wait_for_completion(&read_start);

ring_buffer_consumer_thread()
  complete(&read_start);

  ring_buffer_producer()
# CPU3 removes the module <--- difference from
# and stops producer  <--- the 1st scenario
if (kthread_should_stop())
  kill_test = 1;

  ring_buffer_consumer();
while (!reader_finish &&
   !kill_test)
# kill_test == 1 => we never go
# into the top level while()
reader_finish = 0;
complete(&read_done);

# producer still in the middle of
# do-while cycle
if (consumer && !(cnt % wakeup_interval))
  wake_up_process(consumer);

# spurious wakeup
while (!reader_finish &&
   !kill_test)
# leaving because kill_test == 1
reader_finish = 0;
complete(&read_done);

BANG: We are in the same "bang" situations as in the 1st scenario.

Root of the problem:


ring_buffer_consumer() must complete "read_don

Re: [RFC PATCH 10/14] ring_buffer: Fix more races when terminating the producer in the benchmark

2015-09-04 Thread Petr Mladek

On Mon 2015-08-03 14:33:23, Steven Rostedt wrote:
> On Tue, 28 Jul 2015 16:39:27 +0200
> Petr Mladek  wrote:
> 
> > @@ -384,7 +389,7 @@ static int ring_buffer_consumer_thread(void *arg)
> >  
> >  static int ring_buffer_producer_thread(void *arg)
> >  {
> > -   while (!kthread_should_stop() && !kill_test) {
> > +   while (!break_test()) {
> > ring_buffer_reset(buffer);
> >  
> > if (consumer) {
> > @@ -393,11 +398,15 @@ static int ring_buffer_producer_thread(void *arg)
> > }
> >  
> > ring_buffer_producer();
> > -   if (kill_test)
> > +   if (break_test())
> > goto out_kill;
> >  
> > trace_printk("Sleeping for 10 secs\n");
> > set_current_state(TASK_INTERRUPTIBLE);
> > +   if (break_test()) {
> > +   __set_current_state(TASK_RUNNING);
> 
> Move the setting of the current state to after the out_kill label.

Please, find below the updated version of this patch.

I also reverted some changes in the consumer code. It never stays
in a loop for too long and it must stay in ring_buffer_producer()
until "reader_finish" variable is set.

>From 7f5b1a5b8cf8248245897b55ffc51a6d74c8e15b Mon Sep 17 00:00:00 2001
From: Petr Mladek 
Date: Fri, 19 Jun 2015 14:38:36 +0200
Subject: [PATCH 2/2] ring_buffer: Fix more races when terminating the producer
 in the benchmark

The commit b44754d8262d3aab8 ("ring_buffer: Allow to exit the ring
buffer benchmark immediately") added a hack into ring_buffer_producer()
that set @kill_test when kthread_should_stop() returned true. It improved
the situation a lot. It stopped the kthread in most cases because
the producer spent most of the time in the patched while cycle.

But there are still few possible races when kthread_should_stop()
is set outside of the cycle. Then we do not set @kill_test and
some other checks pass.

This patch adds a better fix. It renames @test_kill/TEST_KILL() into
a better descriptive @test_error/TEST_ERROR(). Also it introduces
break_test() function that checks for both @test_error and
kthread_should_stop().

The new function is used in the producer when the check for @test_error
is not enough. It is not used in the consumer because its state
is manipulated by the producer via the "reader_finish" variable.

Also we add a missing check into ring_buffer_producer_thread()
between setting TASK_INTERRUPTIBLE and calling schedule_timeout().
Otherwise, we might miss a wakeup from kthread_stop().

Signed-off-by: Petr Mladek 
---
 kernel/trace/ring_buffer_benchmark.c | 54 +++-
 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c 
b/kernel/trace/ring_buffer_benchmark.c
index 045e0a24c2a0..d1bfe4399e96 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -60,12 +60,12 @@ MODULE_PARM_DESC(consumer_fifo, "fifo prio for consumer");

 static int read_events;

-static int kill_test;
+static int test_error;

-#define KILL_TEST()\
+#define TEST_ERROR()   \
do {\
-   if (!kill_test) {   \
-   kill_test = 1;  \
+   if (!test_error) {  \
+   test_error = 1; \
WARN_ON(1); \
}   \
} while (0)
@@ -75,6 +75,11 @@ enum event_status {
EVENT_DROPPED,
 };

+static bool break_test(void)
+{
+   return test_error || kthread_should_stop();
+}
+
 static enum event_status read_event(int cpu)
 {
struct ring_buffer_event *event;
@@ -87,7 +92,7 @@ static enum event_status read_event(int cpu)

entry = ring_buffer_event_data(event);
if (*entry != cpu) {
-   KILL_TEST();
+   TEST_ERROR();
return EVENT_DROPPED;
}

@@ -115,10 +120,10 @@ static enum event_status read_page(int cpu)
rpage = bpage;
/* The commit may have missed event flags set, clear them */
commit = local_read(&rpage->commit) & 0xf;
-   for (i = 0; i < commit && !kill_test; i += inc) {
+   for (i = 0; i < commit && !test_error ; i += inc) {

if (i >= (PAGE_SIZE - offsetof(struct rb_page, data))) {
-   KILL_TEST();
+   TEST_ERROR();
break;
}

@@ -128,7 +133,7 @@ static enum event_status read_page(int cpu)
case RINGBUF_TYPE_PADDIN

[PATCH 0/2] rcu: two small fixes for RCU kthreads

2015-09-04 Thread Petr Mladek

I am trying to convert kthreads into a more sane API. I played
also with RCU kthreads and found two small problems. They are
independent on the conversion, so I send the patches already now.

Petr Mladek (2):
  rcu: Show the real fqs_state
  rcu: Fix up timeouts for forcing the quiescent state

 kernel/rcu/tree.c | 90 +++
 1 file changed, 58 insertions(+), 32 deletions(-)

-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] rcu: Fix up timeouts for forcing the quiescent state

2015-09-04 Thread Petr Mladek

The deadline to force the quiescent state (jiffies_force_qs) is currently
updated only when the previous timeout passed. But the timeout used for
wait_event() is always the entire original timeout. This is strange.

First, we might miss the deadline if we wait after a spurious wake up
or after sleeping in cond_resched() because we wait too long.

Second, we might do another forcing too early if the previous forcing
was done earlier because of RCU_GP_FLAG_FQS and we later get a spurious
wake up. IMHO, we should reset the deadline in this case.

This patch updates the deadline "jiffies_force_qs" right after forcing
the quiescent state by rcu_gp_fqs().

Also it updates the remaining timeout according to the current jiffies and
the requested deadline.

It moves the cond_resched_rcu_qs() to a single place. It changes the order
of the check for the pending signal. But there never should be a pending
signal. If there was we would have bigger problems because wait_event()
would never sleep again until someone flushed the signal.

I have found these problems when trying to understand the code. I do not
have any reproducer. I think that it is hardly visible because
the spurious wakeup is rather theoretical.

Signed-off-by: Petr Mladek 
---
 kernel/rcu/tree.c | 77 ++-
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 54af8d5f9f7b..aaeeabcba545 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2035,13 +2035,45 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 }
 
 /*
+ * Normalize, update, and return the first timeout.
+ */
+static unsigned long normalize_jiffies_till_first_fqs(void)
+{
+   unsigned long j = jiffies_till_first_fqs;
+
+   if (unlikely(j > HZ)) {
+   j = HZ;
+   jiffies_till_first_fqs = HZ;
+   }
+
+   return j;
+}
+
+/*
+ * Normalize, update, and return the first timeout.
+ */
+static unsigned long normalize_jiffies_till_next_fqs(void)
+{
+   unsigned long j = jiffies_till_next_fqs;
+
+   if (unlikely(j > HZ)) {
+   j = HZ;
+   jiffies_till_next_fqs = HZ;
+   } else if (unlikely(j < 1)) {
+   j = 1;
+   jiffies_till_next_fqs = 1;
+   }
+
+   return j;
+}
+
+/*
  * Body of kthread that handles grace periods.
  */
 static int __noreturn rcu_gp_kthread(void *arg)
 {
int gf;
-   unsigned long j;
-   int ret;
+   unsigned long timeout, j;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
 
@@ -2071,22 +2103,18 @@ static int __noreturn rcu_gp_kthread(void *arg)
 
/* Handle quiescent-state forcing. */
rsp->fqs_state = RCU_SAVE_DYNTICK;
-   j = jiffies_till_first_fqs;
-   if (j > HZ) {
-   j = HZ;
-   jiffies_till_first_fqs = HZ;
-   }
-   ret = 0;
+   timeout = normalize_jiffies_till_first_fqs();
+   rsp->jiffies_force_qs = jiffies + timeout;
for (;;) {
-   if (!ret)
-   rsp->jiffies_force_qs = jiffies + j;
trace_rcu_grace_period(rsp->name,
   READ_ONCE(rsp->gpnum),
   TPS("fqswait"));
rsp->gp_state = RCU_GP_WAIT_FQS;
-   ret = wait_event_interruptible_timeout(rsp->gp_wq,
-   rcu_gp_fqs_check_wake(rsp, &gf), j);
+   wait_event_interruptible_timeout(rsp->gp_wq,
+   rcu_gp_fqs_check_wake(rsp, &gf),
+   timeout);
rsp->gp_state = RCU_GP_DOING_FQS;
+try_again:
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
if (!READ_ONCE(rnp->qsmask) &&
@@ -2099,28 +2127,29 @@ static int __noreturn rcu_gp_kthread(void *arg)
   READ_ONCE(rsp->gpnum),
   TPS("fqsstart"));
rcu_gp_fqs(rsp);
+   timeout = normalize_jiffies_till_next_fqs();
+   rsp->jiffies_force_qs = jiffies + timeout;
trace_rcu_grace_period(rsp->name,
   READ_ONCE(rsp->gpnum),
   TPS("fqsend"));
-   cond_resched_rcu_qs();
-   WRITE_ONCE(rsp->

[PATCH 1/2] rcu: Show the real fqs_state

2015-09-04 Thread Petr Mladek

The value of "fqs_state" in struct rcu_state is always RCU_GP_IDLE.

The real state is stored in a local variable in rcu_gp_kthread().
It is modified by rcu_gp_fqs() via parameter and return value.
But the actual value is never stored to rsp->fqs_state.

The result is that print_one_rcu_state() does not show the real
state.

This code has been added 3 years ago by the commit 4cdfc175c25c89ee
("rcu: Move quiescent-state forcing into kthread"). I guess that it
was an overlook or optimization.

Anyway, the value seems to be manipulated only by the thread, except
for shoving the status. I do not see any risk in updating it directly
in the struct.

Signed-off-by: Petr Mladek 
---
 kernel/rcu/tree.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9f75f25cc5d9..54af8d5f9f7b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1927,16 +1927,15 @@ static bool rcu_gp_fqs_check_wake(struct rcu_state 
*rsp, int *gfp)
 /*
  * Do one round of quiescent-state forcing.
  */
-static int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
+static void rcu_gp_fqs(struct rcu_state *rsp)
 {
-   int fqs_state = fqs_state_in;
bool isidle = false;
unsigned long maxj;
struct rcu_node *rnp = rcu_get_root(rsp);
 
WRITE_ONCE(rsp->gp_activity, jiffies);
rsp->n_force_qs++;
-   if (fqs_state == RCU_SAVE_DYNTICK) {
+   if (rsp->fqs_state == RCU_SAVE_DYNTICK) {
/* Collect dyntick-idle snapshots. */
if (is_sysidle_rcu_state(rsp)) {
isidle = true;
@@ -1945,7 +1944,7 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int 
fqs_state_in)
force_qs_rnp(rsp, dyntick_save_progress_counter,
 &isidle, &maxj);
rcu_sysidle_report_gp(rsp, isidle, maxj);
-   fqs_state = RCU_FORCE_QS;
+   rsp->fqs_state = RCU_FORCE_QS;
} else {
/* Handle dyntick-idle and offline CPUs. */
isidle = true;
@@ -1959,7 +1958,6 @@ static int rcu_gp_fqs(struct rcu_state *rsp, int 
fqs_state_in)
   READ_ONCE(rsp->gp_flags) & ~RCU_GP_FLAG_FQS);
raw_spin_unlock_irq(&rnp->lock);
}
-   return fqs_state;
 }
 
 /*
@@ -2041,7 +2039,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
  */
 static int __noreturn rcu_gp_kthread(void *arg)
 {
-   int fqs_state;
int gf;
unsigned long j;
int ret;
@@ -2073,7 +2070,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
}
 
/* Handle quiescent-state forcing. */
-   fqs_state = RCU_SAVE_DYNTICK;
+   rsp->fqs_state = RCU_SAVE_DYNTICK;
j = jiffies_till_first_fqs;
if (j > HZ) {
j = HZ;
@@ -2101,7 +2098,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
trace_rcu_grace_period(rsp->name,
   READ_ONCE(rsp->gpnum),
   TPS("fqsstart"));
-   fqs_state = rcu_gp_fqs(rsp, fqs_state);
+   rcu_gp_fqs(rsp);
trace_rcu_grace_period(rsp->name,
   READ_ONCE(rsp->gpnum),
   TPS("fqsend"));
-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] printk: Don't discard earlier unprinted messages to make space

2015-10-23 Thread Petr Mladek

On Thu 2015-10-22 14:19:26, David Howells wrote:
> Petr Mladek  wrote:
> 
> > I would expect that the first few messages are printed to the console
> > before the buffer is wrapped. IMHO, in many cases, you are interested
> > into the final messages that describe why the system went down.
> 
> The last message might tell you that the machine panicked because the NMI
> handler triggered due to a spinlocked section taking too long or something.
> This doesn't help if the oops that caused the spinlock to remain held or
> whatever gets discarded from the buffer due to several intervening complaints
> that result secondarily from the initial oops.

There might be flood of messages if you enable debugging or so. In
this case, the interesting messages would be at the end of the buffer.

IMHO, you want both the beginning of the flood and the end when
machine goes down. You do not want the repeated blob in the middle.
Do you really miss the first few lines on the serial console, please?
Did you consider the panic on Oops kernel parameter?

> > If there is no time to print them, you want to have them in the crash dump
> > (ring buffer) at least.
>
> But not at the expense of discarding the first oops report.  *That* one is the
> most important.

But your patch discards the initial messages as well once they are printed.

> Perhaps things could be arranged such that messages *can* be discarded from
> the front of the buffer *provided* they are not oops messages.

Another possibility would be to decide this by the importance level
of the message. For example, we might start ignoring less important
messages when the buffer is getting full. But this should be optional.
It makes sense only when someone print/logs the messages somewhere.

Anyway, be warned that any additional complexity must have very good
reasons. The printk code already is bloated and there is a strong
resistance to make it worse.

Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2 17/18] rcu: Convert RCU gp kthreads into kthread worker API

2015-10-01 Thread Petr Mladek

On Mon 2015-09-28 10:14:37, Paul E. McKenney wrote:
> On Mon, Sep 21, 2015 at 03:03:58PM +0200, Petr Mladek wrote:
> > Kthreads are currently implemented as an infinite loop. Each
> > has its own variant of checks for terminating, freezing,
> > awakening. In many cases it is unclear to say in which state
> > it is and sometimes it is done a wrong way.
> > 
> > The plan is to convert kthreads into kthread_worker or workqueues
> > API. It allows to split the functionality into separate operations.
> > It helps to make a better structure. Also it defines a clean state
> > where no locks are taken, IRQs blocked, the kthread might sleep
> > or even be safely migrated.
> > 
> > The kthread worker API is useful when we want to have a dedicated
> > single kthread for the work. It helps to make sure that it is
> > available when needed. Also it allows a better control, e.g.
> > define a scheduling priority.
> > 
> > This patch converts RCU gp threads into the kthread worker API.
> > They modify the scheduling, have their own logic to bind the process.
> > They provide functions that are critical for the system to work
> > and thus deserve a dedicated kthread.
> > 
> > This patch tries to split start of the grace period and the quiescent
> > state handling into separate works. The motivation is to avoid
> > wait_events inside the work. Instead it queues the works when
> > appropriate which is more typical for this API.
> > 
> > On one hand, it should reduce spurious wakeups where the condition
> > in the wait_event failed and the kthread went to sleep again.
> > 
> > On the other hand, there is a small race window when the other
> > work might get queued. We could detect and fix this situation
> > at the beginning of the work but it is a bit ugly.
> > 
> > The patch renames the functions kthread_wake() to kthread_worker_poke()
> > that sounds more appropriate.
> > 
> > Otherwise, the logic should stay the same. I did a lot of torturing
> > and I did not see any problem with the current patch. But of course,
> > it would deserve much more testing and reviewing before applying.
> 
> Suppose I later need to add helper kthreads to parallelize grace-period
> initialization.  How would I implement that in a freeze-friendly way?

I have been convinced that there only few kthreads that really need
freezing. See the discussion around my first attempt at
https://lkml.org/lkml/2015/6/13/190

In fact, RCU is a good example of kthreads that should not get
frozen because they are needed even later when the system
is suspended.

If I understand it correctly, they will do the job until most devices
and all non-boot CPUs are disabled. Then the task doing the suspend
will get scheduled. It will write the image and stop the machine.
RCU should not be needed by this very last step.

By other words. RCU should not be much concerned about freezing.

If you are concerned about adding more kthreads, it should be
possible to just add more workers if we agree on using the
kthreads worker API.


Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2 00/18] kthread: Use kthread worker API more widely

2015-10-01 Thread Petr Mladek

On Tue 2015-09-29 22:08:33, Paul E. McKenney wrote:
> On Mon, Sep 21, 2015 at 03:03:41PM +0200, Petr Mladek wrote:
> > My intention is to make it easier to manipulate kthreads. This RFC tries
> > to use the kthread worker API. It is based on comments from the
> > first attempt. See https://lkml.org/lkml/2015/7/28/648 and
> > the list of changes below.
> > 
> > 1st..8th patches: improve the existing kthread worker API
> > 
> > 9th, 12th, 17th patches: convert three kthreads into the new API,
> >  namely: khugepaged, ring buffer benchmark, RCU gp kthreads[*]
> > 
> > 10th, 11th patches: fix potential problems in the ring buffer
> >   benchmark; also sent separately
> > 
> > 13th patch: small fix for RCU kthread; also sent separately;
> >  being tested by Paul
> > 
> > 14th..16th patches: preparation steps for the RCU threads
> >  conversion; they are needed _only_ if we split GP start
> >  and QS handling into separate works[*]
> > 
> > 18th patch: does a possible improvement of the kthread worker API;
> >  it adds an extra parameter to the create*() functions, so I
> >  rather put it into this draft
> >  
> > 
> > [*] IMPORTANT: I tried to split RCU GP start and GS state handling
> > into separate works this time. But there is a problem with
> > a race in rcu_gp_kthread_worker_poke(). It might queue
> > the wrong work. It can be detected and fixed by the work
> > itself but it is a bit ugly. Alternative solution is to
> > do both operations in one work. But then we sleep too much
> > in the work which is ugly as well. Any idea is appreciated.
> 
> I think that the kernel is trying really hard to tell you that splitting
> up the RCU grace-period kthreads in this manner is not such a good idea.

Yup, I guess that it would be better to stay with the approach taken
in the previous RFC. I mean to start the grace period and handle
the quiescent state in a single work. See
https://lkml.org/lkml/2015/7/28/650  It basically keeps the
functionality. The only difference is that we regularly leave
the RCU-specific function, so it will be possible to patch it.

The RCU kthreads are very special because they basically ignore
freezer and they never stop. They do not show well the advantage
of any new API. I tried to convert them primary because they were
so sensitive. I thought that it was good for testing limits
of the API.


> So what are we really trying to accomplish here?  I am guessing something
> like the following:
> 
> 1.Get each grace-period kthread to a known safe state within a
>   short time of having requested a safe state.  If I recall
>   correctly, the point of this is to allow no-downtime kernel
>   patches to the functions executed by the grace-period kthreads.
> 
> 2.At the same time, if someone suddenly needs a grace period
>   at some point in this process, the grace period kthreads are
>   going to have to wake back up and handle the grace period.
>   Or do you have some tricky way to guarantee that no one is
>   going to need a grace period beyond the time you freeze
>   the grace-period kthreads?
> 
> 3.The boost kthreads should not be a big problem because failing
>   to boost simply lets the grace period run longer.
> 
> 4.The callback-offload kthreads are likely to be a big problem,
>   because in systems configured with them, they need to be running
>   to invoke the callbacks, and if the callbacks are not invoked,
>   the grace period might just as well have failed to end.
> 
> 5.The per-CPU kthreads are in the same boat as the callback-offload
>   kthreads.  One approach is to offline all the CPUs but one, and
>   that will park all but the last per-CPU kthread.  But handling
>   that last per-CPU kthread would likely be "good clean fun"...
> 
> 6.Other requirements?
> 
> One approach would be to simply say that the top-level rcu_gp_kthread()
> function cannot be patched, and arrange for the grace-period kthreads
> to park at some point within this function.  Or is there some requirement
> that I am missing?

I am a bit confused by the above paragraphs because they mix patching,
stopping, and parking. Note that we do not need to stop any process
when live patching.

I hope that it is more clear after my response in the other mail about
freezing. Or maybe, I am missing something.

Anyway, thanks a lot for looking at the patches and feedback.


Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2 00/18] kthread: Use kthread worker API more widely

2015-10-02 Thread Petr Mladek

On Thu 2015-10-01 10:00:53, Paul E. McKenney wrote:
> On Thu, Oct 01, 2015 at 05:59:43PM +0200, Petr Mladek wrote:
> > On Tue 2015-09-29 22:08:33, Paul E. McKenney wrote:
> > > On Mon, Sep 21, 2015 at 03:03:41PM +0200, Petr Mladek wrote:
> > > > My intention is to make it easier to manipulate kthreads. This RFC tries
> > > > to use the kthread worker API. It is based on comments from the
> > > > first attempt. See https://lkml.org/lkml/2015/7/28/648 and
> > > > the list of changes below.
> > > > 
> If the point of these patches was simply to test your API, and if you are
> not looking to get them upstream, we are OK.

I would like to eventually transform all kthreads into an API that
will better define the kthread workflow. It need not be this one,
though. I am still looking for a good API that will be acceptable[*]

One of the reason that I played with RCU, khugepaged, and ring buffer
kthreads is that they are maintained by core developers. I hope that
it will help to get better consensus.

> If you want them upstream, you need to explain to me why the patches
> help something.

As I said, RCU kthreads do not show a big win because they ignore
freezer, are not parked, never stop, do not handle signals. But the
change will allow to live patch them because they leave the main
function on a safe place.

The ring buffer benchmark is much better example. It reduced
the main function of the consumer kthread to two lines.
It removed some error prone code that modified task state,
called scheduler, handled kthread_should_stop. IMHO, the workflow
is better and more safe now.

I am going to prepare and send more examples where the change makes
the workflow easier.

> And also how the patches avoid breaking things.

I do my best to keep the original functionality. If we decide to use
the kthread worker API, my first attempt is much more safe, see
https://lkml.org/lkml/2015/7/28/650. It basically replaces the
top level for cycle with one self-queuing work. There are some more
instructions to go back to the cycle but they define a common
safe point that will be maintained on a single location for
all kthread workers.

[*] I have played with two APIs yet. They define a safe point
for freezing, parking, stopping, signal handling, live patching.
Also some non-trivial logic of the main cycle is maintained
on a single location.

Here are some details:

1. iterant API
--

  It allows to define three callbacks that are called the following
  way:

 init();
 while (!stop)
func();
 destroy();

  See also https://lkml.org/lkml/2015/6/5/556.

  Advantages:
+ simple and clear workflow
+ simple use
+ simple conversion from the current kthreads API

  Disadvantages:
+ problematic solution of sleeping between events
+ completely new API

2. kthread worker API
-

  It is similar to workqueues. The difference is that the works
  have a dedicated kthread, so we could better control the resources,
  e.g. priority, scheduling policy, ...

  Advantages:
+ already in use
+ design proven to work (workqueues)
+ nature way to wait for work in the common code (worker)
  using event driven works and delayed works
+ easy to convert to/from workqueues API

  Disadvantages:
+ more code needed to define, initialize, and queue works
+ more complicated conversion from the current API
  if we want to make it a clean way (event driven)
+ might need more synchronization in some cases[**]

  Questionable:
+ event driven vs. procedural programming style
+ allows more grained split of the functionality into
  separate units (works) that might be queued
  as needed

[**] wake_up() is nope for empty waitqueue. But queuing a work
 into non-existing worker might cause a crash. Well, this is
 usually already synchronized.

Any thoughts or preferences are highly appreciated.

Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2 07/18] kthread: Allow to cancel kthread work

2015-10-02 Thread Petr Mladek

On Mon 2015-09-28 13:03:14, Tejun Heo wrote:
> Hello, Petr.
> 
> On Fri, Sep 25, 2015 at 01:26:17PM +0200, Petr Mladek wrote:
> > 1) PENDING state plus -EAGAIN/busy loop cycle
> > -
> > 
> > IMHO, we want to use the timer because it is an elegant solution.
> > Then we must release the lock when the timer is running. The lock
> > must be taken by the timer->function(). And there is a small window
> > when the timer is not longer pending but timer->function is not running:
> > 
> > CPU0CPU1
> > 
> > run_timer_softirq()
> >   __run_timers()
> > detach_expired_timer()
> >   detach_timer()
> > #clear_pending
> > 
> > try_to_grab_pending_kthread_work()
> >   del_timer()
> > # fails because not pending
> > 
> >   test_and_set_bit(KTHREAD_WORK_PENDING_BIT)
> > # fails because already set
> > 
> >   if (!list_empty(&work->node))
> > # fails because still not queued
> > 
> > !!! problematic window !!!
> > 
> > call_timer_fn()
> >  queue_kthraed_work()
> 
> Let's say each work item has a state variable which is protected by a
> lock and the state can be one of IDLE, PENDING, CANCELING.  Let's also
> assume that all cancelers synchronize with each other via mutex, so we
> only have to worry about a single canceler.  Wouldn't something like
> the following work while being a lot simpler?
> 
> Delayed queueing and execution.
> 
> 1. Lock and check whether state is IDLE.  If not, nothing to do.
> 
> 2. Set state to PENDING and schedule the timer and unlock.
> 
> 3. On expiration, timer_fn grabs the lock and see whether state is
>still PENDING.  If so, schedule the work item for execution;
>otherwise, nothing to do.
> 
> 4. After dequeueing from execution queue with lock held, the worker is
>marked as executing the work item and state is reset to IDLE.
> 
> Canceling
> 
> 1. Lock, dequeue and set the state to CANCELING.
> 
> 2. Unlock and perform del_timer_sync().
> 
> 3. Flush the work item.
> 
> 4. Lock and reset the state to IDLE and unlock.
> 
> 
> > 2) CANCEL state plus custom waitqueue
> > -
> > 
> > cancel_kthread_work_sync() has to wait for the running work. It might take
> > quite some time. Therefore we could not block others by a spinlock.
> > Also others could not wait for the spin lock in a busy wait.
> 
> Hmmm?  Cancelers can synchronize amongst them using a mutex and the
> actual work item wait can use flushing.
> 
> > IMHO, the proposed and rather complex solutions are needed in both cases.
> > 
> > Or did I miss a possible trick, please?
> 
> I probably have missed something in the above and it is not completley
> correct but I do think it can be way simpler than how workqueue does
> it.

I have played with this idea and it opens a can of worms with locking
problems and it looks even more complicated.

Let me show this on a snippet of code:

struct kthread_worker {
spinlock_t  lock;
struct list_headwork_list;
struct kthread_work *current_work;
};

enum {
KTHREAD_WORK_IDLE,
KTHREAD_WORK_PENDING,
KTHREAD_WORK_CANCELING,
};

struct kthread_work {
unsigned intflags;
spinlock_t  lock;
struct list_headnode;
kthread_work_func_t func;
struct kthread_worker   *worker;
};


/* the main kthread worker cycle */
int kthread_worker_fn(void *worker_ptr)
{
struct kthread_worker *worker = worker_ptr;
struct kthread_work *work;

repeat:

work = NULL;
spin_lock_irq(&worker->lock);
if (!list_empty(&worker->work_list)) {
work = list_first_entry(&worker->work_list,
struct kthread_work, node);
spin_lock(&work->lock);
list_del_init(&work->node);
work->flags = KTHREAD_WORK_IDLE;
spin_unlock(&work->lock);
}
worker->current_work = work;
spin_unlock_irq(&worker->lock);

if (work) {
__set_current_state(TASK_RUNNING);
work->func(work);
} else if (!freezing(current))
schedule();

goto repeat;
}
EXPORT_SYMBOL_GPL(kthread_worker_fn);


stat

Re: [RFC v2 07/18] kthread: Allow to cancel kthread work

2015-10-05 Thread Petr Mladek

On Fri 2015-10-02 15:24:53, Tejun Heo wrote:
> Hello,
> 
> On Fri, Oct 02, 2015 at 05:43:36PM +0200, Petr Mladek wrote:
> > IMHO, we need both locks. The worker manipulates more works and
> > need its own lock. We need work-specific lock because the work
> > might be assigned to different workers and we need to be sure
> > that the operations are really serialized, e.g. queuing.
> 
> I don't think we need per-work lock.  Do we have such usage in kernel
> at all?  If you're worried, let the first queueing record the worker
> and trigger warning if someone tries to queue it anywhere else.  This
> doesn't need to be full-on general like workqueue.  Let's make
> reasonable trade-offs where possible.

I actually thought about this simplification as well. But then I am
in doubts about the API. It would make sense to assign the worker
when the work is being initialized and avoid the duplicate information
when the work is being queued:

init_kthread_work(work, fn, worker);
queue_work(work);

Or would you prefer to keep the API similar to workqueues even when
it makes less sense here?

In each case, we need a way to switch the worker if the old one
is destroyed and a new one is started later. We would need
something like:

reset_work(work, worker)
or
reinit_work(work, fn, worker)

Thanks for feedback.

Best Regards,
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lib/vsprintf: Do not handle %pO[^F] as %px

2018-08-07 Thread Petr Mladek

On Mon 2018-08-06 15:34:21, Bart Van Assche wrote:
> This patch avoids that gcc reports the following when building with W=1:
> 
> lib/vsprintf.c:1941:3: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>switch (fmt[1]) {
>^~
> 
> Fixes: ce4fecf1fe15 ("vsprintf: Add %p extension "%pOF" for device tree")

To be precise, the above commit was fine because it was the last
"case" in "switch". It has got broken by the commit
7b1924a1d930eb2 ("vsprintf: add printk specifier %px").

Other than that the patch fixes a real problem. I have updated the "Fixes"
tag, added stable tag and pushed it into printk.git, for-4.19 branch.

Best Regards,
Petr

Re: [PATCH V2] riscv: Convert uses of REG_FMT to %p

2018-08-07 Thread Petr Mladek

On Sat 2018-07-28 09:39:57, Joe Perches wrote:
> Use %p pointer output instead of REG_FMT and cast the unsigned longs to
> (void *) to avoid exposing kernel addresses.
> 
> Miscellanea:
> 
> o Convert pr_cont to printk(KERN_DEFAULT as these uses are
>   new logging lines and not previous line continuations
> o Remove the now unused REG_FMT defines
> 
> Signed-off-by: Joe Perches 
> ---
> 
> v2: sigh: Add missing fault.c
> 
>  arch/riscv/include/asm/ptrace.h |  6 -
>  arch/riscv/kernel/process.c | 52 
> +
>  arch/riscv/kernel/traps.c   |  4 ++--
>  arch/riscv/mm/fault.c   |  6 ++---
>  4 files changed, 32 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/ptrace.h b/arch/riscv/include/asm/ptrace.h
> index 2c5df945d43c..b123e723f8fa 100644
> --- a/arch/riscv/include/asm/ptrace.h
> +++ b/arch/riscv/include/asm/ptrace.h
> @@ -60,12 +60,6 @@ struct pt_regs {
>  unsigned long orig_a0;
>  };
>  
> -#ifdef CONFIG_64BIT
> -#define REG_FMT "%016lx"
> -#else
> -#define REG_FMT "%08lx"
> -#endif
> -
>  #define user_mode(regs) (((regs)->sstatus & SR_SPP) == 0)
>  
>  
> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> index d7c6ca7c95ae..7223f6715ff3 100644
> --- a/arch/riscv/kernel/process.c
> +++ b/arch/riscv/kernel/process.c
> @@ -36,7 +36,7 @@
>  extern asmlinkage void ret_from_fork(void);
>  extern asmlinkage void ret_from_kernel_thread(void);
>  
> -void arch_cpu_idle(void)
> +void arch_yycpu_idle(void)
>  {
>   wait_for_interrupt();
>   local_irq_enable();
> @@ -46,31 +46,33 @@ void show_regs(struct pt_regs *regs)
>  {
>   show_regs_print_info(KERN_DEFAULT);
>  
> - pr_cont("sepc: " REG_FMT " ra : " REG_FMT " sp : " REG_FMT "\n",
> - regs->sepc, regs->ra, regs->sp);
> - pr_cont(" gp : " REG_FMT " tp : " REG_FMT " t0 : " REG_FMT "\n",
> - regs->gp, regs->tp, regs->t0);
> - pr_cont(" t1 : " REG_FMT " t2 : " REG_FMT " s0 : " REG_FMT "\n",
> - regs->t1, regs->t2, regs->s0);
> - pr_cont(" s1 : " REG_FMT " a0 : " REG_FMT " a1 : " REG_FMT "\n",
> - regs->s1, regs->a0, regs->a1);
> - pr_cont(" a2 : " REG_FMT " a3 : " REG_FMT " a4 : " REG_FMT "\n",
> - regs->a2, regs->a3, regs->a4);
> - pr_cont(" a5 : " REG_FMT " a6 : " REG_FMT " a7 : " REG_FMT "\n",
> - regs->a5, regs->a6, regs->a7);
> - pr_cont(" s2 : " REG_FMT " s3 : " REG_FMT " s4 : " REG_FMT "\n",
> - regs->s2, regs->s3, regs->s4);
> - pr_cont(" s5 : " REG_FMT " s6 : " REG_FMT " s7 : " REG_FMT "\n",
> - regs->s5, regs->s6, regs->s7);
> - pr_cont(" s8 : " REG_FMT " s9 : " REG_FMT " s10: " REG_FMT "\n",
> - regs->s8, regs->s9, regs->s10);
> - pr_cont(" s11: " REG_FMT " t3 : " REG_FMT " t4 : " REG_FMT "\n",
> - regs->s11, regs->t3, regs->t4);
> - pr_cont(" t5 : " REG_FMT " t6 : " REG_FMT "\n",
> - regs->t5, regs->t6);
> + printk(KERN_DEFAULT "sepc: %p ra : %p sp : %p\n",
> +(void *)regs->sepc, (void *)regs->ra, (void *)regs->sp);
> + printk(KERN_DEFAULT " gp : %p tp : %p t0 : %p\n",
> +(void *)regs->gp, (void *)regs->tp, (void *)regs->t0);
> + printk(KERN_DEFAULT " t1 : %p t2 : %p s0 : %p\n",
> +(void *)regs->t1, (void *)regs->t2, (void *)regs->s0);
> + printk(KERN_DEFAULT " s1 : %p a0 : %p a1 : %p\n",
> +(void *)regs->s1, (void *)regs->a0, (void *)regs->a1);
> + printk(KERN_DEFAULT " a2 : %p a3 : %p a4 : %p\n",
> +(void *)regs->a2, (void *)regs->a3, (void *)regs->a4);
> + printk(KERN_DEFAULT " a5 : %p a6 : %p a7 : %p\n",
> +(void *)regs->a5, (void *)regs->a6, (void *)regs->a7);
> + printk(KERN_DEFAULT " s2 : %p s3 : %p s4 : %p\n",
> +(void *)regs->s2, (void *)regs->s3, (void *)regs->s4);
> + printk(KERN_DEFAULT " s5 : %p s6 : %p s7 : %p\n",
> +(void *)regs->s5, (void *)regs->s6, (void *)regs->s7);
> + printk(KERN_DEFAULT " s8 : %p s9 : %p s10: %p\n",
> +(void *)regs->s8, (void *)regs->s9, (void *)regs->s10);
> + printk(KERN_DEFAULT " s11: %p t3 : %p t4 : %p\n",
> +(void *)regs->s11, (void *)regs->t3, (void *)regs->t4);
> + printk(KERN_DEFAULT " t5 : %p t6 : %p\n",
> +(void *)regs->t5, (void *)regs->t6);
>  
> - pr_cont("sstatus: " REG_FMT " sbadaddr: " REG_FMT " scause: " REG_FMT 
> "\n",
> - regs->sstatus, regs->sbadaddr, regs->scause);
> + printk(KERN_DEFAULT "sstatus: %p sbadaddr: %p scause: %p\n",
> +(void *)regs->sstatus,
> +(void *)regs->sbadaddr,
> +(void *)regs->scause);
>  }

This change makes the dump almost unusable. Note that registers contain any
kind of information, not only pointers.

My understanding is that %px was introduced because printing the
pointer directly is sometimes worth the security risk. IMHO, this
is one place wher

Re: [PATCH RESEND] lib/test_printf.c: call wait_for_random_bytes() before plain %p tests

2018-06-25 Thread Petr Mladek

On Fri 2018-06-22 23:50:20, Thierry Escande wrote:
> On 22/06/2018 22:53, Steven Rostedt wrote:
> > On Thu, Jun 07, 2018 at 02:24:34PM +0200, Petr Mladek wrote:
> > > On Mon 2018-06-04 13:37:08, Thierry Escande wrote:
> > > > If the test_printf module is loaded before the crng is initialized, the
> > > > plain 'p' tests will fail because the printed address will not be hashed
> > > > and the buffer will contain '(ptrval)' instead.
> > > > This patch adds a call to wait_for_random_bytes() before plain 'p' tests
> > > > to make sure the crng is initialized.
> > > 
> > > Hmm, my system did not boot with this patch and
> > > CONFIG_TEST_PRINTF=y
> > 
> > And neither does my test box. It killed my tests I was running, as one of 
> > the
> > configs I test has this set.
> > 
> > It appears that Andrew pulled it in and sent it to Linus, as it is in
> > 4.18-rc1, and I need to now revert this patch to make my tests work.
> 
> This patch has been superseded with a v2 and a v3 pushed into Petr
> printk.git tree 
> (https://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk.git/commit/?h=for-4.19&id=ce041c43f22298485122bab15c14d062383fbc67).
> Sorry for the mess...

Andrew,

should I send the revert and the better fix to Linus or would you like
to do so?

Best Regards,
Petr

Re: [PATCH RESEND] lib/test_printf.c: call wait_for_random_bytes() before plain %p tests

2018-06-25 Thread Petr Mladek

On Mon 2018-06-25 09:50:20, Petr Mladek wrote:
> On Fri 2018-06-22 23:50:20, Thierry Escande wrote:
> > On 22/06/2018 22:53, Steven Rostedt wrote:
> > > On Thu, Jun 07, 2018 at 02:24:34PM +0200, Petr Mladek wrote:
> > > > On Mon 2018-06-04 13:37:08, Thierry Escande wrote:
> > > > > If the test_printf module is loaded before the crng is initialized, 
> > > > > the
> > > > > plain 'p' tests will fail because the printed address will not be 
> > > > > hashed
> > > > > and the buffer will contain '(ptrval)' instead.
> > > > > This patch adds a call to wait_for_random_bytes() before plain 'p' 
> > > > > tests
> > > > > to make sure the crng is initialized.
> > > > 
> > > > Hmm, my system did not boot with this patch and
> > > > CONFIG_TEST_PRINTF=y
> > > 
> > > And neither does my test box. It killed my tests I was running, as one of 
> > > the
> > > configs I test has this set.
> > > 
> > > It appears that Andrew pulled it in and sent it to Linus, as it is in
> > > 4.18-rc1, and I need to now revert this patch to make my tests work.
> > 
> > This patch has been superseded with a v2 and a v3 pushed into Petr
> > printk.git tree 
> > (https://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk.git/commit/?h=for-4.19&id=ce041c43f22298485122bab15c14d062383fbc67).
> > Sorry for the mess...
> 
> Andrew,
> 
> should I send the revert and the better fix to Linus or would you like
> to do so?

Below is the proposed revert-commit just in case people want to add
Reviewed-by tags or so.


>From 043f891b70e6197bc181f3b087c2bd04c60fddd2 Mon Sep 17 00:00:00 2001
From: Petr Mladek 
Date: Mon, 25 Jun 2018 13:28:06 +0200
Subject: [PATCH] Revert "lib/test_printf.c: call wait_for_random_bytes()
 before plain %p tests"

This reverts commit ee410f15b1418f2f4428e79980674c979081bcb7.

It might prevent the machine from boot. It would wait for enough
randomness at the very beginning of kernel_init(). But there is
basically nothing running in parallel that would help to produce
any randomness.

Reported-by: Steven Rostedt (VMware) 
Signed-off-by: Petr Mladek 
---
 lib/test_printf.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/lib/test_printf.c b/lib/test_printf.c
index b2aa8f514844..cea592f402ed 100644
--- a/lib/test_printf.c
+++ b/lib/test_printf.c
@@ -260,13 +260,6 @@ plain(void)
 {
int err;
 
-   /*
-* Make sure crng is ready. Otherwise we get "(ptrval)" instead
-* of a hashed address when printing '%p' in plain_hash() and
-* plain_format().
-*/
-   wait_for_random_bytes();
-
err = plain_hash();
if (err) {
pr_warn("plain 'p' does not appear to be hashed\n");
-- 
2.13.7

Re: [PATCH] printk: remove unnecessary kmalloc() from syslog during clear

2018-06-25 Thread Petr Mladek

On Wed 2018-06-20 19:26:19, Namit Gupta wrote:
> When the request is only for clearing logs, there is no need for
> allocation/deallocation. Only the indexes need to be reset and returned.
> Rest of the patch is mostly made up of changes because of indention.
> 
> Signed-off-by: Namit Gupta 
> Signed-off-by: Himanshu Maithani 

> ---
>  kernel/printk/printk.c | 111 
> ++---
>  1 file changed, 60 insertions(+), 51 deletions(-)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 512f7c2..53952ce 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1348,71 +1348,80 @@ static int syslog_print_all(char __user *buf, int 
> size, bool clear)
>  {
>   char *text;
>   int len = 0;
> + u64 next_seq;
> + u64 seq;
> + u32 idx;
> +
> + if (!buf) {
> + if (clear) {
> + logbuf_lock_irq();
> + clear_seq = log_next_seq;
> + clear_idx = log_next_idx;
> + logbuf_unlock_irq();

I pushed a bit different version into printk.git, branch for-4.19,
see below. It removes the code duplication. Also it keeps the original
indentation. IMHO, it helped to better distinguish the code for printing
and clearing.

It is rather a cosmetic change, so I do not want you to resend
Reviewed-by tags. But feel free to disagree and ask me to use
the original variant.


This is in printk.git now:

>From 41cb6dcedd9257d51fd310bf9b2958d11d93aa2b Mon Sep 17 00:00:00 2001
From: Namit Gupta 
Date: Mon, 25 Jun 2018 14:58:05 +0200
Subject: [PATCH] printk: remove unnecessary kmalloc() from syslog during

When the request is only for clearing logs, there is no need for
allocation/deallocation. Only the indexes need to be reset and returned.
Rest of the patch is mostly made up of changes because of indention.

Link: 
http://lkml.kernel.org/r/20180620135951epcas5p3bd2a8f25ec689ca333bce861b527dba2~54wykct0_3155531555epcas5...@epcas5p3.samsung.com
Cc: linux-kernel@vger.kernel.org
Cc: panka...@samsung.com
Cc: a.sahra...@samsung.com
Signed-off-by: Namit Gupta 
Signed-off-by: Himanshu Maithani 
Reviewed-by: Steven Rostedt (VMware) 
Reviewed-by: Sergey Senozhatsky 
[pmla...@suse.com: Removed code duplication.]
Signed-off-by: Petr Mladek 
---
 kernel/printk/printk.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 247808333ba4..0fa2ca6fd8f9 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1350,12 +1350,14 @@ static int syslog_print(char __user *buf, int size)
 
 static int syslog_print_all(char __user *buf, int size, bool clear)
 {
-   char *text;
+   char *text = NULL;
int len = 0;
 
-   text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
-   if (!text)
-   return -ENOMEM;
+   if (buf) {
+   text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
+   if (!text)
+   return -ENOMEM;
+   }
 
logbuf_lock_irq();
if (buf) {
@@ -1426,7 +1428,8 @@ static int syslog_print_all(char __user *buf, int size, 
bool clear)
}
logbuf_unlock_irq();
 
-   kfree(text);
+   if (text)
+   kfree(text);
return len;
 }
 
-- 
2.13.7

Re: [PATCH v2] printk: make sure to print log on console.

2018-06-25 Thread Petr Mladek

On Wed 2018-06-20 10:55:25, Sergey Senozhatsky wrote:
> On (06/19/18 12:52), Petr Mladek wrote:
> > > But when I set /sys/module/printk/parameters/ignore_loglevel I naturally
> > > expect it to take an immediate action. Without waiting for the consoles
> > > to catch up and to discard N messages [if the consoles were behind the
> > > logbuf head].
> > 
> > Yeah, I understand this view. I thought about it as well. But did you
> > ever needed this behavior in the real life?
> >
> > I personally changed ignore_loglevel only before I wanted to reproduce a
> > bug. Then it would be perfectly fine to handle it only in
> > vprintk_emit(). In fact, it would be even better because it would
> > affect only messages that happened after I triggered the bug.
> 
> So maybe the patch can stand the way it is, after all. JFI, still haven't
> seen those "helps in real life a lot" examples, tho.

I have personally seen these races when testing printk in NMI. I
combined iptables logging, ping -f and sysrq-l. I am not sure
how often they happen in the real life but I could understand
that it might be annoying.

This patch goes in the right direction and nobody really blocks
it. Therefore I pushed it into printk.git, branch for-4.19.

Best Regards,
Petr

Re: [PATCH] printk: Make CONSOLE_LOGLEVEL_QUIET configurable

2018-06-25 Thread Petr Mladek

On Wed 2018-06-20 15:37:47, Hans de Goede wrote:
> Hi,
> 
> On 20-06-18 13:03, Petr Mladek wrote:
> > On Tue 2018-06-19 13:57:26, Hans de Goede wrote:
> > > The goal of passing the "quiet" option to the kernel is for the kernel
> > > to be quiet unless something really is wrong.
> > > 
> > > Sofar passing quiet has been (mostly) equivalent to passing
> > > loglevel=4 on the kernel commandline. Which means to show any messages
> > > with a level of KERN_ERR or higher severity on the console.
> > > 
> > > In practice this often does not result in a quiet boot though, since
> > > there are many false-positive or otherwise harmless error messages 
> > > printed,
> > > defeating the purpose of the quiet option. Esp. the ACPICA code is really
> > > bad wrt this, but there are plenty of others too.
> > 
> > I see your pain. But this sounds like a workaround for a broken code.
> > This change might just encourage people to create even more mess.
> 
> I've been submitting patches upstream to fix false-positive KERN_ERR
> messages for more then a year now and getting a KERN_ERR free kernel
> (on more then 1 specific model hw) is just undoable. Every release some
> new nonsense error comes up, like e.g.:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1568276
> 
> Besides this random KERN_ERR cases (of which there are plenty
> by themselves) I've also had long discussions with the ACPICA upstream
> maintainers, but they refuse to change this instead insisting that:
> 
> a) Vendors should fix there DSDTs to be perfect; and
> b) end-users should then update their BIOS to fix this
> 
> Neither of which is a realistic expectation in anyway.

Thanks for the many examples. It would help me to argue if anyone
later complains about this change ;-)


> > > This commit makes CONSOLE_LOGLEVEL_QUIET configurable.
> > > 
> > > This for example will allow distros which want quiet to really mean quiet
> > > to set CONSOLE_LOGLEVEL_QUIET so that only messages with a higher severity
> > > then KERN_ERR (CRIT, ALERT, EMERG) get printed, avoiding an endless game
> > > of whack-a-mole silencing harmless error messages.
> > 
> > I find it a bit confusing that "quiet" would mean something different
> > on different systems.
> 
> The kernel is so configurable already that I don't think this really is much
> of an issue, quiet will still mean quiet on all systems, some might just
> be a tad more quiet (or actually be quiet) compared to others.

Just for record. Some people actually do a lot to prevent adding new
configure options. The many possibilities cause troubles to users
and even experienced developers.

A common argument against adding new options is: "How users could be
able to choose reasonable value when even experts (maintainers,
developers) are not able to agree on it".


> I went with making this configurable because I expect that to
> be a controversial change.

Exactly, I thought about changing the default. But it might just
bring a lot of bike-shedding.

I do not see any other good option. More people like this patch.
So I pushed it into printk.git, branch for-4.19.

Best Regards,
Petr

[PATCH v12 00/12]

2018-08-28 Thread Petr Mladek

livepatch: Atomic replace feature

The atomic replace allows to create cumulative patches. They
are useful when you maintain many livepatches and want to remove
one that is lower on the stack. In addition it is very useful when
more patches touch the same function and there are dependencies
between them.

This version does another big refactoring based on feedback against
v11[*]. In particular, it removes the registration step, changes
the API and handling of livepatch dependencies. The aim is
to keep the number of possible variants on a sane level.
It helps the keep the feature "easy" to use and maintain.

[*] https://lkml.kernel.org/r/20180323120028.31451-1-pmla...@suse.com


Changes against v11:

  + Functional changes:

+ Livepatches get automatically unregistered when disabled.
  Note that the sysfs interface disappears at this point.
  It simplifies the API and code. The only drawback is that
  the patch can be enabled again only by reloading the module.

+ Refuse to load conflicting patches. The same function can
  be patched again only by a new cumulative patch that
  replaces all older ones.

+ Non-conflicting patches can be loaded and disabled in any
  order.
  

  + API related changes:

 + Change void *new_func -> unsigned long new_addr in
   struct klp_func.

 + Several new macros to hide implementation details and
   avoid casting when defining struct klp-func and klp_object.

 + Remove obsolete klp_register_patch() klp_unregister_patch() API


  + Change in selftest against v4:

 + Use new macros to define struct klp_func and klp_object.

 + Remove klp_register_patch()/klp_unregister_patch() calls.

 + Replace load_mod() + wait_for_transition() with three
   variants load_mod(), load_lp(), load_lp_nowait(). IMHO,
   it is easier to use because we need to detect the end
   of transaction another way after disable_lp() now.

 + Replaced unload_mod() with two variants unload_mod(),
   unload_lp() to match the above change.

 + Wait for the end of transition in disable_lp()
   instead of the unreliable check of the sysfs interface.

 Note that I did not touch the logs with expected result.
 They stay exactly the same as in v4 posted by Joe.
 I hope that it is a good sign ;-)


Changes against v10:

  + Bug fixes and functional changes:
+ Handle Nops in klp_ftrace_handled() to avoid infinite loop [Mirek]
+ Really add dynamically allocated klp_object into the list [Petr]
+ Clear patch->replace when transition finishes [Josh]

  + Refactoring and clean up [Josh]:
+ Replace enum types with bools
+ Avoid using ERR_PTR
+ Remove too paranoid warnings
+ Distinguish registered patches by a flag instead of a list
+ Squash some functions
+ Update comments, documentation, and commit messages
+ Squashed and split patches to do more controversial changes later

Changes against v9:

  + Fixed check of valid NOPs for already loaded objects,
regression introduced in v9 [Joe, Mirek]
  + Allow to replace even disabled patches [Evgenii]

Changes against v8:

  + Fixed handling of statically defined struct klp_object
with empty array of functions [Joe, Mirek]
  + Removed redundant func->new_func assignment for NOPs [Mirek]
  + Improved some wording [Mirek]

Changes against v7:

  + Fixed handling of NOPs for not-yet-loaded modules
  + Made klp_replaced_patches list static [Mirek]
  + Made klp_free_object() public later [Mirek]
  + Fixed several reported typos [Mirek, Joe]
  + Updated documentation according to the feedback [Joe]
  + Added some Acks [Mirek]

Changes against v6:

  + used list_move when disabling replaced patches [Jason]
  + renamed KLP_FUNC_ORIGINAL -> KLP_FUNC_STATIC [Mirek]
  + used klp_is_func_type() in klp_unpatch_object() [Mirek]
  + moved static definition of klp_get_or_add_object() [Mirek]
  + updated comment about synchronization in forced mode [Mirek]
  + added user documentation
  + fixed several typos


Jason Baron (2):
  livepatch: Use lists to manage patches, objects and functions
  livepatch: Add atomic replace

Joe Lawrence (1):
  selftests/livepatch: introduce tests

Petr Mladek (9):
  livepatch: Change void *new_func -> unsigned long new_addr in struct
klp_func
  livepatch: Helper macros to define livepatch structures
  livepatch: Shuffle klp_enable_patch()/klp_disable_patch() code
  livepatch: Consolidate klp_free functions
  livepatch: Refuse to unload only livepatches available during a forced
transition
  livepatch: Simplify API by removing registration step
  livepatch: Remove Nop structures when unused
  livepatch: Atomic replace and cumulative patches documentation
  livepatch: Remove ordering and refuse loading conflicting patches

 Documentation/livepatch/callbacks.txt  | 489 +---
 Documentation/livepatch/cumulative-patches.txt | 105 +++
 Documentatio

[PATCH v12 02/12] livepatch: Helper macros to define livepatch structures

2018-08-28 Thread Petr Mladek

The definition of struct klp_func might be a bit confusing.
The original function is defined by name as a string.
The new function is defined by name as a function pointer
casted to unsigned long.

This patch adds helper macros that hide the different types.
The functions are defined just by the name. For example:

static struct klp_func funcs[] = {
{
.old_name = "function_A",
.new_addr = (unsigned long)livepatch_function_A,
}, {
.old_name = "function_B",
.new_addr = (unsigned long)livepatch_function_B,
}, { }
};

can be defined as:

static struct klp_func funcs[] = {
KLP_FUNC(function_A,
 livepatch_function_A),
KLP_FUNC(function_B,
 livepatch_function_B),
KLP_FUNC_END
};

Just for completeness, this patch adds similar macros to define
struct klp_object. For example,

static struct klp_object objs[] = {
{
/* name being NULL means vmlinux */
.funcs = funcs_vmlinux,
}, {
.name = "module_A",
.funcs = funcs_module_A,
}, {
.name = "module_B",
.funcs = funcs_module_B,
}, { }
};

can be defined as:

static struct klp_object objs[] = {
KLP_VMLINUX(funcs_vmlinux),
KLP_OBJECT(module_A,
   funcs_module_A),
KLP_OBJECT(module_B,
   funcs_module_B),
    KLP_OBJECT_END
};

Signed-off-by: Petr Mladek 
---
 include/linux/livepatch.h| 40 
 samples/livepatch/livepatch-callbacks-demo.c | 55 +++-
 samples/livepatch/livepatch-sample.c | 13 +++
 samples/livepatch/livepatch-shadow-fix1.c| 20 --
 samples/livepatch/livepatch-shadow-fix2.c| 20 --
 5 files changed, 83 insertions(+), 65 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 817a737b49e8..1163742b27c0 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -152,6 +152,46 @@ struct klp_patch {
struct completion finish;
 };
 
+#define KLP_FUNC(_old_func, _new_func) {   \
+   .old_name = #_old_func, \
+   .new_addr = (unsigned long)(_new_func), \
+   }
+#define KLP_FUNC_POS(_old_func, _new_func, _sympos) {  \
+   .old_name = #_old_func, \
+   .new_addr = (unsigned long)_new_func,   \
+   .sympos = _sympos,  \
+   }
+#define KLP_FUNC_END { }
+
+#define KLP_OBJECT(_obj, _funcs) { \
+   .name = #_obj,  \
+   .funcs = _funcs,\
+   }
+#define KLP_OBJECT_CALLBACKS(_obj, _funcs, \
+_pre_patch, _post_patch,   \
+_pre_unpatch, _post_unpatch) { \
+   .name = #_obj,  \
+   .funcs = _funcs,\
+   .callbacks.pre_patch = _pre_patch,  \
+   .callbacks.post_patch = _post_patch,\
+   .callbacks.pre_unpatch = _pre_unpatch,  \
+   .callbacks.post_unpatch = _post_unpatch,\
+   }
+/* name being NULL means vmlinux */
+#define KLP_VMLINUX(_funcs) {  \
+   .funcs = _funcs,\
+   }
+#define KLP_VMLINUX_CALLBACKS(_funcs,  \
+_pre_patch, _post_patch,   \
+_pre_unpatch, _post_unpatch) { \
+   .funcs = _funcs,\
+   .callbacks.pre_patch = _pre_patch,  \
+   .callbacks.post_patch = _post_patch,\
+   .callbacks.pre_unpatch = _pre_unpatch,  \
+   .callbacks.post_unpatch = _post_unpatch,\
+   }
+#define KLP_OBJECT_END { }
+
 #define klp_for_each_object(patch, obj) \
for (obj = patch->objs; obj->funcs || obj->name; obj++)
 
diff --git a/samples/livepatch/livepatch-callbacks-demo.c 
b/samples/livepatch/livepatch-callbacks-demo.c
index 4b1aec474bb7..001a0c672251 100644
--- a/samples/livepatch/livepatch-callbacks-demo.c
+++ b/samples/livepatch/livepatch-callbacks-demo.c
@@ -147,45 +147,34 @@ static void patched_work_func(struct work_struct *work)
 }
 
 static struct klp_func no_funcs[] = {
-   { }
+   KLP_FUNC_END
 };
 
 static struct klp_func busymod_funcs[] = {
-   {
-   .old_name = "busymod_work_func",
-   .new_addr = (unsigned long)patched_w

[PATCH v12 01/12] livepatch: Change void *new_func -> unsigned long new_addr in struct klp_func

2018-08-28 Thread Petr Mladek

The address of the to be patched function and new function is stored
in struct klp_func as:

void *new_func;
unsigned long old_addr;

The different naming scheme and type is derived from the way how
the addresses are set. @old_addr is assigned at runtime using
kallsyms-based search. @new_func is statically initialized,
for example:

  static struct klp_func funcs[] = {
{
.old_name = "cmdline_proc_show",
.new_func = livepatch_cmdline_proc_show,
}, { }
  };

This patch changes void *new_func -> unsigned long new_addr. It removes
some confusion when these address are later used in the code. It is
motivated by a followup patch that adds special NOP struct klp_func
where we want to assign func->new_func = func->old_addr respectively
func->new_addr = func->old_addr.

This patch does not modify the existing behavior.

IMPORTANT: This patch modifies ABI. The patches will need to use,
for example:

  static struct klp_func funcs[] = {
{
.old_name = "cmdline_proc_show",
.new_addr = (unsigned long)livepatch_cmdline_proc_show,
}, { }
  };

Suggested-by: Josh Poimboeuf 
Signed-off-by: Petr Mladek 
---
 include/linux/livepatch.h| 6 +++---
 kernel/livepatch/core.c  | 4 ++--
 kernel/livepatch/patch.c | 2 +-
 kernel/livepatch/transition.c| 4 ++--
 samples/livepatch/livepatch-callbacks-demo.c | 2 +-
 samples/livepatch/livepatch-sample.c | 2 +-
 samples/livepatch/livepatch-shadow-fix1.c| 4 ++--
 samples/livepatch/livepatch-shadow-fix2.c| 4 ++--
 8 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index aec44b1d9582..817a737b49e8 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -37,7 +37,7 @@
 /**
  * struct klp_func - function structure for live patching
  * @old_name:  name of the function to be patched
- * @new_func:  pointer to the patched function code
+ * @new_addr:  address of the new function (function pointer)
  * @old_sympos: a hint indicating which symbol position the old function
  * can be found (optional)
  * @old_addr:  the address of the function being patched
@@ -66,7 +66,7 @@
 struct klp_func {
/* external */
const char *old_name;
-   void *new_func;
+   unsigned long new_addr;
/*
 * The old_sympos field is optional and can be used to resolve
 * duplicate symbol names in livepatch objects. If this field is zero,
@@ -157,7 +157,7 @@ struct klp_patch {
 
 #define klp_for_each_func(obj, func) \
for (func = obj->funcs; \
-func->old_name || func->new_func || func->old_sympos; \
+func->old_name || func->new_addr || func->old_sympos; \
 func++)
 
 int klp_register_patch(struct klp_patch *);
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 5b77a7314e01..577ebeb43024 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -675,7 +675,7 @@ static void klp_free_patch(struct klp_patch *patch)
 
 static int klp_init_func(struct klp_object *obj, struct klp_func *func)
 {
-   if (!func->old_name || !func->new_func)
+   if (!func->old_name || !func->new_addr)
return -EINVAL;
 
if (strlen(func->old_name) >= KSYM_NAME_LEN)
@@ -733,7 +733,7 @@ static int klp_init_object_loaded(struct klp_patch *patch,
return -ENOENT;
}
 
-   ret = kallsyms_lookup_size_offset((unsigned long)func->new_func,
+   ret = kallsyms_lookup_size_offset(func->new_addr,
  &func->new_size, NULL);
if (!ret) {
pr_err("kallsyms size lookup failed for '%s' 
replacement\n",
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index 82d584225dc6..82927f59d3ff 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -118,7 +118,7 @@ static void notrace klp_ftrace_handler(unsigned long ip,
}
}
 
-   klp_arch_set_pc(regs, (unsigned long)func->new_func);
+   klp_arch_set_pc(regs, func->new_addr);
 unlock:
preempt_enable_notrace();
 }
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 5bc349805e03..982a2e4c6120 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -217,7 +217,7 @@ static int klp_check_stack_func(struct klp_func *func,
  * Check for the to-be-unpatched function
  * (the func itself).
  */
-   func_addr = (unsigned long)func->new_func;
+   func_addr = func->new_addr;

[PATCH v12 03/12] livepatch: Shuffle klp_enable_patch()/klp_disable_patch() code

2018-08-28 Thread Petr Mladek

We are going to simplify the API and code by removing the registration
step. This would require calling init/free functions from enable/disable
ones.

This patch just moves the code the code to prevent more forward
declarations.

This patch does not change the code except of two forward declarations.

Signed-off-by: Petr Mladek 
---
 kernel/livepatch/core.c | 330 
 1 file changed, 166 insertions(+), 164 deletions(-)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 577ebeb43024..b3956cce239e 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -278,170 +278,6 @@ static int klp_write_object_relocations(struct module 
*pmod,
return ret;
 }
 
-static int __klp_disable_patch(struct klp_patch *patch)
-{
-   struct klp_object *obj;
-
-   if (WARN_ON(!patch->enabled))
-   return -EINVAL;
-
-   if (klp_transition_patch)
-   return -EBUSY;
-
-   /* enforce stacking: only the last enabled patch can be disabled */
-   if (!list_is_last(&patch->list, &klp_patches) &&
-   list_next_entry(patch, list)->enabled)
-   return -EBUSY;
-
-   klp_init_transition(patch, KLP_UNPATCHED);
-
-   klp_for_each_object(patch, obj)
-   if (obj->patched)
-   klp_pre_unpatch_callback(obj);
-
-   /*
-* Enforce the order of the func->transition writes in
-* klp_init_transition() and the TIF_PATCH_PENDING writes in
-* klp_start_transition().  In the rare case where klp_ftrace_handler()
-* is called shortly after klp_update_patch_state() switches the task,
-* this ensures the handler sees that func->transition is set.
-*/
-   smp_wmb();
-
-   klp_start_transition();
-   klp_try_complete_transition();
-   patch->enabled = false;
-
-   return 0;
-}
-
-/**
- * klp_disable_patch() - disables a registered patch
- * @patch: The registered, enabled patch to be disabled
- *
- * Unregisters the patched functions from ftrace.
- *
- * Return: 0 on success, otherwise error
- */
-int klp_disable_patch(struct klp_patch *patch)
-{
-   int ret;
-
-   mutex_lock(&klp_mutex);
-
-   if (!klp_is_patch_registered(patch)) {
-   ret = -EINVAL;
-   goto err;
-   }
-
-   if (!patch->enabled) {
-   ret = -EINVAL;
-   goto err;
-   }
-
-   ret = __klp_disable_patch(patch);
-
-err:
-   mutex_unlock(&klp_mutex);
-   return ret;
-}
-EXPORT_SYMBOL_GPL(klp_disable_patch);
-
-static int __klp_enable_patch(struct klp_patch *patch)
-{
-   struct klp_object *obj;
-   int ret;
-
-   if (klp_transition_patch)
-   return -EBUSY;
-
-   if (WARN_ON(patch->enabled))
-   return -EINVAL;
-
-   /* enforce stacking: only the first disabled patch can be enabled */
-   if (patch->list.prev != &klp_patches &&
-   !list_prev_entry(patch, list)->enabled)
-   return -EBUSY;
-
-   /*
-* A reference is taken on the patch module to prevent it from being
-* unloaded.
-*/
-   if (!try_module_get(patch->mod))
-   return -ENODEV;
-
-   pr_notice("enabling patch '%s'\n", patch->mod->name);
-
-   klp_init_transition(patch, KLP_PATCHED);
-
-   /*
-* Enforce the order of the func->transition writes in
-* klp_init_transition() and the ops->func_stack writes in
-* klp_patch_object(), so that klp_ftrace_handler() will see the
-* func->transition updates before the handler is registered and the
-* new funcs become visible to the handler.
-*/
-   smp_wmb();
-
-   klp_for_each_object(patch, obj) {
-   if (!klp_is_object_loaded(obj))
-   continue;
-
-   ret = klp_pre_patch_callback(obj);
-   if (ret) {
-   pr_warn("pre-patch callback failed for object '%s'\n",
-   klp_is_module(obj) ? obj->name : "vmlinux");
-   goto err;
-   }
-
-   ret = klp_patch_object(obj);
-   if (ret) {
-   pr_warn("failed to patch object '%s'\n",
-   klp_is_module(obj) ? obj->name : "vmlinux");
-   goto err;
-   }
-   }
-
-   klp_start_transition();
-   klp_try_complete_transition();
-   patch->enabled = true;
-
-   return 0;
-err:
-   pr_warn("failed to enable patch '%s'\n", patch->mod->name);
-
-   klp_cancel_transition();
-   return ret;
-}
-
-/**
- * klp_enable_patch() - enables a registered patch
- * @patch: The registered, d

[PATCH v12 05/12] livepatch: Refuse to unload only livepatches available during a forced transition

2018-08-28 Thread Petr Mladek

module_put() is currently never called in klp_complete_transition() when
klp_force is set. As a result, we might keep the reference count even when
klp_enable_patch() fails and klp_cancel_transition() is called.

This might make an assumption that a module might get blocked in some
strange init state. Fortunately, it is not the case. The reference count
is ignored when mod->init fails and erroneous modules are always removed.

Anyway, this might make some confusion. Instead, this patch moves
the global klp_forced flag into struct klp_patch. As a result,
we block only modules that might still be in use after a forced
transition. Newly loaded livepatches might be eventually completely
removed later.

It is not a big deal. But the code is at least consistent with
the reality.

Signed-off-by: Petr Mladek 
---
 include/linux/livepatch.h |  2 ++
 kernel/livepatch/core.c   |  4 +++-
 kernel/livepatch/core.h   |  1 +
 kernel/livepatch/transition.c | 10 +-
 4 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 22e0767d64b0..86b484b39326 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -138,6 +138,7 @@ struct klp_object {
  * @list:  list node for global list of registered patches
  * @kobj:  kobject for sysfs resources
  * @enabled:   the patch is enabled (but operation may be incomplete)
+ * @forced:was involved in a forced transition
  * @wait_free: wait until the patch is freed
  * @finish:for waiting till it is safe to remove the patch module
  */
@@ -150,6 +151,7 @@ struct klp_patch {
struct list_head list;
struct kobject kobj;
bool enabled;
+   bool forced;
bool wait_free;
struct completion finish;
 };
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 3ca404545150..18af1dc0e199 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -45,7 +45,8 @@
  */
 DEFINE_MUTEX(klp_mutex);
 
-static LIST_HEAD(klp_patches);
+/* Registered patches */
+LIST_HEAD(klp_patches);
 
 static struct kobject *klp_root_kobj;
 
@@ -660,6 +661,7 @@ static int klp_init_patch(struct klp_patch *patch)
mutex_lock(&klp_mutex);
 
patch->enabled = false;
+   patch->forced = false;
INIT_LIST_HEAD(&patch->list);
init_completion(&patch->finish);
 
diff --git a/kernel/livepatch/core.h b/kernel/livepatch/core.h
index 48a83d4364cf..d0cb5390e247 100644
--- a/kernel/livepatch/core.h
+++ b/kernel/livepatch/core.h
@@ -5,6 +5,7 @@
 #include 
 
 extern struct mutex klp_mutex;
+extern struct list_head klp_patches;
 
 static inline bool klp_is_object_loaded(struct klp_object *obj)
 {
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 982a2e4c6120..30a28634c88c 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -33,8 +33,6 @@ struct klp_patch *klp_transition_patch;
 
 static int klp_target_state = KLP_UNDEFINED;
 
-static bool klp_forced = false;
-
 /*
  * This work can be performed periodically to finish patching or unpatching any
  * "straggler" tasks which failed to transition in the first attempt.
@@ -137,10 +135,10 @@ static void klp_complete_transition(void)
  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
 
/*
-* klp_forced set implies unbounded increase of module's ref count if
+* patch->forced set implies unbounded increase of module's ref count if
 * the module is disabled/enabled in a loop.
 */
-   if (!klp_forced && klp_target_state == KLP_UNPATCHED)
+   if (!klp_transition_patch->forced && klp_target_state == KLP_UNPATCHED)
module_put(klp_transition_patch->mod);
 
klp_target_state = KLP_UNDEFINED;
@@ -620,6 +618,7 @@ void klp_send_signals(void)
  */
 void klp_force_transition(void)
 {
+   struct klp_patch *patch;
struct task_struct *g, *task;
unsigned int cpu;
 
@@ -633,5 +632,6 @@ void klp_force_transition(void)
for_each_possible_cpu(cpu)
klp_update_patch_state(idle_task(cpu));
 
-   klp_forced = true;
+   list_for_each_entry(patch, &klp_patches, list)
+   patch->forced = true;
 }
-- 
2.13.7

[PATCH v12 08/12] livepatch: Add atomic replace

2018-08-28 Thread Petr Mladek

From: Jason Baron 

Sometimes we would like to revert a particular fix. Currently, this
is not easy because we want to keep all other fixes active and we
could revert only the last applied patch.

One solution would be to apply new patch that implemented all
the reverted functions like in the original code. It would work
as expected but there will be unnecessary redirections. In addition,
it would also require knowing which functions need to be reverted at
build time.

Another problem is when there are many patches that touch the same
functions. There might be dependencies between patches that are
not enforced on the kernel side. Also it might be pretty hard to
actually prepare the patch and ensure compatibility with the other
patches.

Atomic replace && cumulative patches:

A better solution would be to create cumulative patch and say that
it replaces all older ones.

This patch adds a new "replace" flag to struct klp_patch. When it is
enabled, a set of 'nop' klp_func will be dynamically created for all
functions that are already being patched but that will no longer be
modified by the new patch. They are used as a new target during
the patch transition.

The idea is to handle Nops' structures like the static ones. When
the dynamic structures are allocated, we initialize all values that
are normally statically defined.

The only exception is "new_addr" in struct klp_func. It has to point
to the original function and the address is known only when the object
(module) is loaded. Note that we really need to set it. The address is
used, for example, in klp_check_stack_func().

Nevertheless we still need to distinguish the dynamically allocated
structures in some operations. For this, we add "nop" flag into
struct klp_func and "dynamic" flag into struct klp_object. They
need special handling in the following situations:

  + The structures are added into the lists of objects and functions
immediately. In fact, the lists were created for this purpose.

  + The address of the original function is known only when the patched
object (module) is loaded. Therefore it is copied later in
klp_init_object_loaded().

  + The ftrace handler must not set PC to func->new_addr. It would cause
infinite loop because the address points back to the beginning of
the original function.

  + The various free() functions must free the structure itself.

Note that other ways to detect the dynamic structures are not considered
safe. For example, even the statically defined struct klp_object might
include empty funcs array. It might be there just to run some callbacks.

Special callbacks handling:

The callbacks from the replaced patches are _not_ called by intention.
It would be pretty hard to define a reasonable semantic and implement it.

It might even be counter-productive. The new patch is cumulative. It is
supposed to include most of the changes from older patches. In most cases,
it will not want to call pre_unpatch() post_unpatch() callbacks from
the replaced patches. It would disable/break things for no good reasons.
Also it should be easier to handle various scenarios in a single script
in the new patch than think about interactions caused by running many
scripts from older patches. Not to say that the old scripts even would
not expect to be called in this situation.

Removing replaced patches:

One nice effect of the cumulative patches is that the code from the
older patches is no longer used. Therefore the replaced patches can
be removed. It has several advantages:

  + Nops' structs will not longer be necessary and might be removed.
This would save memory, restore performance (no ftrace handler),
allow clear view on what is really patched.

  + Disabling the patch will cause using the original code everywhere.
Therefore the livepatch callbacks could handle only one scenario.
Note that the complication is already complex enough when the patch
gets enabled. It is currently solved by calling callbacks only from
the new cumulative patch.

  + The state is clean in both the sysfs interface and lsmod. The modules
with the replaced livepatches might even get removed from the system.

Some people actually expected this behavior from the beginning. After all
a cumulative patch is supposed to "completely" replace an existing one.
It is like when a new version of an application replaces an older one.

This patch does the first step. It removes the replaced patches from
the list of patches. It is safe. The consistency model ensures that
they are not longer used. By other words, each process works only with
the structures from klp_transition_patch.

The removal is done by a special function. It combines actions done by
__disable_patch() and klp_complete_transition(). But it is a fast
track without all the transaction-related stuff.

Signed-off-by: Jason Baron 
[pmla...@suse.com: Split, reuse existing cod

[PATCH v12 10/12] livepatch: Atomic replace and cumulative patches documentation

2018-08-28 Thread Petr Mladek

User documentation for the atomic replace feature. It makes it easier
to maintain livepatches using so-called cumulative patches.

Signed-off-by: Petr Mladek 
---
 Documentation/livepatch/cumulative-patches.txt | 105 +
 1 file changed, 105 insertions(+)
 create mode 100644 Documentation/livepatch/cumulative-patches.txt

diff --git a/Documentation/livepatch/cumulative-patches.txt 
b/Documentation/livepatch/cumulative-patches.txt
new file mode 100644
index ..206b7f98d270
--- /dev/null
+++ b/Documentation/livepatch/cumulative-patches.txt
@@ -0,0 +1,105 @@
+===
+Atomic Replace & Cumulative Patches
+===
+
+There might be dependencies between livepatches. If multiple patches need
+to do different changes to the same function(s) then we need to define
+an order in which the patches will be installed. And function implementations
+from any newer livepatch must be done on top of the older ones.
+
+This might become a maintenance nightmare. Especially if anyone would want
+to remove a patch that is in the middle of the stack.
+
+An elegant solution comes with the feature called "Atomic Replace". It allows
+to create so called "Cumulative Patches". They include all wanted changes
+from all older livepatches and completely replace them in one transition.
+
+Usage
+-
+
+The atomic replace can be enabled by setting "replace" flag in struct 
klp_patch,
+for example:
+
+   static struct klp_patch patch = {
+   .mod = THIS_MODULE,
+   .objs = objs,
+   .replace = true,
+   };
+
+Such a patch is added on top of the livepatch stack when registered. It can
+be enabled even when some earlier patches have not been enabled yet.
+
+All processes are then migrated to use the code only from the new patch.
+Once the transition is finished, all older patches are removed from the stack
+of patches. Even the older not-enabled patches mentioned above. They can
+even be unregistered and the related modules unloaded.
+
+Ftrace handlers are transparently removed from functions that are no
+longer modified by the new cumulative patch.
+
+As a result, the livepatch authors might maintain sources only for one
+cumulative patch. It helps to keep the patch consistent while adding or
+removing various fixes or features.
+
+Users could keep only the last patch installed on the system after
+the transition to has finished. It helps to clearly see what code is
+actually in use. Also the livepatch might then be seen as a "normal"
+module that modifies the kernel behavior. The only difference is that
+it can be updated at runtime without breaking its functionality.
+
+
+Features
+
+
+The atomic replace allows:
+
+  + Atomically revert some functions in a previous patch while
+upgrading other functions.
+
+  + Remove eventual performance impact caused by core redirection
+for functions that are no longer patched.
+
+  + Decrease user confusion about stacking order and what patches are
+currently in effect.
+
+
+Limitations:
+
+
+  + Replaced patches can no longer be enabled. But if the transition
+to the cumulative patch was not forced, the kernel modules with
+the older livepatches can be removed and eventually added again.
+
+A good practice is to set .replace flag in any released livepatch.
+Then re-adding an older livepatch is equivalent to downgrading
+to that patch. This is safe as long as the livepatches do _not_ do
+extra modifications in (un)patching callbacks or in the module_init()
+or module_exit() functions, see below.
+
+
+  + Only the (un)patching callbacks from the _new_ cumulative livepatch are
+executed. Any callbacks from the replaced patches are ignored.
+
+By other words, the cumulative patch is responsible for doing any actions
+that are necessary to properly replace any older patch.
+
+As a result, it might be dangerous to replace newer cumulative patches by
+older ones. The old livepatches might not provide the necessary callbacks.
+
+This might be seen as a limitation in some scenarios. But it makes the life
+easier in many others. Only the new cumulative livepatch knows what
+fixes/features are added/removed and what special actions are necessary
+for a smooth transition.
+
+In each case, it would be a nightmare to think about the order of
+the various callbacks and their interactions if the callbacks from all
+enabled patches were called.
+
+
+  + There is no special handling of shadow variables. Livepatch authors
+must create their own rules how to pass them from one cumulative
+patch to the other. Especially they should not blindly remove them
+in module_exit() functions.
+
+A good practice might be to remove shadow variables in the post-unpatch
+callback. It is called only when the livepatch is properly disabled.
-- 
2.13.7

[PATCH v12 11/12] livepatch: Remove ordering and refuse loading conflicting patches

2018-08-28 Thread Petr Mladek

The atomic replace and cumulative patches were introduced as a more secure
way to handle dependent patches. They simplify the logic:

  + Any new cumulative patch is supposed to take over shadow variables
and changes made by callbacks from previous livepatches.

  + All replaced patches are discarded and the modules can be unloaded.
As a result, there is only one scenario when a cumulative livepatch
gets disabled.

The different handling of "normal" and cumulative patches might cause
confusion. It would make sense to keep only one mode. On the other hand,
it would be rude to enforce using the cumulative livepatches even for
trivial and independent (hot) fixes.

This patch removes the stack of patches. The list of enabled patches
is still needed but the ordering is not longer enforced.

Note that it is not possible to catch all possible dependencies. It is
the responsibility of the livepatch authors to decide.

Nevertheless this patch prevents having two patches for the same function
enabled at the same time after the transition finishes. It might help
to catch obvious mistakes. But more importantly, we do not need to
handle situation when a patch in the middle of the function stack
(ops->func_stack) is being removed.

Signed-off-by: Petr Mladek 
---
 Documentation/livepatch/livepatch.txt | 30 +++
 kernel/livepatch/core.c   | 56 +++
 2 files changed, 68 insertions(+), 18 deletions(-)

diff --git a/Documentation/livepatch/livepatch.txt 
b/Documentation/livepatch/livepatch.txt
index 7fb01d27d81d..8d985cab0a21 100644
--- a/Documentation/livepatch/livepatch.txt
+++ b/Documentation/livepatch/livepatch.txt
@@ -141,9 +141,9 @@ without HAVE_RELIABLE_STACKTRACE are not considered fully 
supported by
 the kernel livepatching.
 
 The /sys/kernel/livepatch//transition file shows whether a patch
-is in transition.  Only a single patch (the topmost patch on the stack)
-can be in transition at a given time.  A patch can remain in transition
-indefinitely, if any of the tasks are stuck in the initial patch state.
+is in transition.  Only a single patch can be in transition at a given
+time.  A patch can remain in transition indefinitely, if any of the tasks
+are stuck in the initial patch state.
 
 A transition can be reversed and effectively canceled by writing the
 opposite value to the /sys/kernel/livepatch//enabled file while
@@ -327,9 +327,10 @@ successfully disabled via the sysfs interface.
 Livepatch modules have to call klp_enable_patch() in module_init() callback.
 This function is rather complex and might even fail in the early phase.
 
-First, the addresses of the patched functions are found according to their
-names. The special relocations, mentioned in the section "New functions",
-are applied. The relevant entries are created under
+First, possible conflicts are checked for non-cummulative patches with
+disabled replace flag. The addresses of the patched functions are found
+according to their names. The special relocations, mentioned in the section
+"New functions", are applied. The relevant entries are created under
 /sys/kernel/livepatch/. The patch is rejected when any above
 operation fails.
 
@@ -343,11 +344,11 @@ this process, see the "Consistency model" section.
 Finally, once all tasks have been patched, the 'transition' value changes
 to '0'.
 
-[*] Note that functions might be patched multiple times. The ftrace handler
-is registered only once for a given function. Further patches just add
-an entry to the list (see field `func_stack`) of the struct klp_ops.
-The right implementation is selected by the ftrace handler, see
-the "Consistency model" section.
+[*] Note that two patches might modify the same function during the transition
+to a new cumulative patch. The ftrace handler is registered only once
+for a given function. The new patch just adds an entry to the list
+(see field `func_stack`) of the struct klp_ops. The right implementation
+is selected by the ftrace handler, see the "Consistency model" section.
 
 
 5.2. Disabling
@@ -374,8 +375,11 @@ Third, the sysfs interface is destroyed.
 Finally, the module can be removed if the transition was not forced and the
 last sysfs entry has gone.
 
-Note that patches must be disabled in exactly the reverse order in which
-they were enabled. It makes the problem and the implementation much easier.
+Note that any patch dependencies have to be handled by the atomic replace
+and cumulative patches, see Documentation/livepatch/cumulative-patches.txt.
+Therefore there is usually only one patch enabled on the system. There is
+still possibility to have more trivial and independent livepatches enabled
+at the same time. These can be enabled and disabled in any order.
 
 
 6. Sysfs
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 695d565f23c1..f3e199

[PATCH v12 12/12] selftests/livepatch: introduce tests

2018-08-28 Thread Petr Mladek

From: Joe Lawrence 

Add a few livepatch modules and simple target modules that the included
regression suite can run tests against:

  - basic livepatching (multiple patches, atomic replace)
  - pre/post (un)patch callbacks
  - shadow variable API

Signed-off-by: Joe Lawrence 
---
 Documentation/livepatch/callbacks.txt  | 489 +
 MAINTAINERS|   1 +
 lib/Kconfig.debug  |  21 +
 lib/Makefile   |   2 +
 lib/livepatch/Makefile |  15 +
 lib/livepatch/test_klp_atomic_replace.c|  53 ++
 lib/livepatch/test_klp_callbacks_busy.c|  43 ++
 lib/livepatch/test_klp_callbacks_demo.c| 109 
 lib/livepatch/test_klp_callbacks_demo2.c   |  89 
 lib/livepatch/test_klp_callbacks_mod.c |  24 +
 lib/livepatch/test_klp_livepatch.c |  47 ++
 lib/livepatch/test_klp_shadow_vars.c   | 236 +
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/livepatch/Makefile |   8 +
 tools/testing/selftests/livepatch/README   |  43 ++
 tools/testing/selftests/livepatch/config   |   1 +
 tools/testing/selftests/livepatch/functions.sh | 203 +++
 .../testing/selftests/livepatch/test-callbacks.sh  | 587 +
 .../testing/selftests/livepatch/test-livepatch.sh  | 168 ++
 .../selftests/livepatch/test-shadow-vars.sh|  60 +++
 20 files changed, 1716 insertions(+), 484 deletions(-)
 create mode 100644 lib/livepatch/Makefile
 create mode 100644 lib/livepatch/test_klp_atomic_replace.c
 create mode 100644 lib/livepatch/test_klp_callbacks_busy.c
 create mode 100644 lib/livepatch/test_klp_callbacks_demo.c
 create mode 100644 lib/livepatch/test_klp_callbacks_demo2.c
 create mode 100644 lib/livepatch/test_klp_callbacks_mod.c
 create mode 100644 lib/livepatch/test_klp_livepatch.c
 create mode 100644 lib/livepatch/test_klp_shadow_vars.c
 create mode 100644 tools/testing/selftests/livepatch/Makefile
 create mode 100644 tools/testing/selftests/livepatch/README
 create mode 100644 tools/testing/selftests/livepatch/config
 create mode 100644 tools/testing/selftests/livepatch/functions.sh
 create mode 100755 tools/testing/selftests/livepatch/test-callbacks.sh
 create mode 100755 tools/testing/selftests/livepatch/test-livepatch.sh
 create mode 100755 tools/testing/selftests/livepatch/test-shadow-vars.sh

diff --git a/Documentation/livepatch/callbacks.txt 
b/Documentation/livepatch/callbacks.txt
index c9776f48e458..182e31d4abce 100644
--- a/Documentation/livepatch/callbacks.txt
+++ b/Documentation/livepatch/callbacks.txt
@@ -118,488 +118,9 @@ similar change to their hw_features value.  (Client 
functions of the
 value may need to be updated accordingly.)
 
 
-Test cases
-==
-
-What follows is not an exhaustive test suite of every possible livepatch
-pre/post-(un)patch combination, but a selection that demonstrates a few
-important concepts.  Each test case uses the kernel modules located in
-the samples/livepatch/ and assumes that no livepatches are loaded at the
-beginning of the test.
-
-
-Test 1
---
-
-Test a combination of loading a kernel module and a livepatch that
-patches a function in the first module.  (Un)load the target module
-before the livepatch module:
-
-- load target module
-- load livepatch
-- disable livepatch
-- unload target module
-- unload livepatch
-
-First load a target module:
-
-  % insmod samples/livepatch/livepatch-callbacks-mod.ko
-  [   34.475708] livepatch_callbacks_mod: livepatch_callbacks_mod_init
-
-On livepatch enable, before the livepatch transition starts, pre-patch
-callbacks are executed for vmlinux and livepatch_callbacks_mod (those
-klp_objects currently loaded).  After klp_objects are patched according
-to the klp_patch, their post-patch callbacks run and the transition
-completes:
-
-  % insmod samples/livepatch/livepatch-callbacks-demo.ko
-  [   36.503719] livepatch: enabling patch 'livepatch_callbacks_demo'
-  [   36.504213] livepatch: 'livepatch_callbacks_demo': initializing patching 
transition
-  [   36.504238] livepatch_callbacks_demo: pre_patch_callback: vmlinux
-  [   36.504721] livepatch_callbacks_demo: pre_patch_callback: 
livepatch_callbacks_mod -> [MODULE_STATE_LIVE] Normal state
-  [   36.505849] livepatch: 'livepatch_callbacks_demo': starting patching 
transition
-  [   37.727133] livepatch: 'livepatch_callbacks_demo': completing patching 
transition
-  [   37.727232] livepatch_callbacks_demo: post_patch_callback: vmlinux
-  [   37.727860] livepatch_callbacks_demo: post_patch_callback: 
livepatch_callbacks_mod -> [MODULE_STATE_LIVE] Normal state
-  [   37.728792] livepatch: 'livepatch_callbacks_demo': patching complete
-
-Similarly, on livepatch disable, pre-patch callbacks run before the
-unpatching transition starts.  klp_objects are reverted, post-patc

[PATCH v12 07/12] livepatch: Use lists to manage patches, objects and functions

2018-08-28 Thread Petr Mladek

From: Jason Baron 

Currently klp_patch contains a pointer to a statically allocated array of
struct klp_object and struct klp_objects contains a pointer to a statically
allocated array of klp_func. In order to allow for the dynamic allocation
of objects and functions, link klp_patch, klp_object, and klp_func together
via linked lists. This allows us to more easily allocate new objects and
functions, while having the iterator be a simple linked list walk.

The static structures are added to the lists early. It allows to add
the dynamically allocated objects before klp_init_object() and
klp_init_func() calls. Therefore it reduces the further changes
to the code.

This patch does not change the existing behavior.

Signed-off-by: Jason Baron 
[pmla...@suse.com: Initialize lists before init calls]
Signed-off-by: Petr Mladek 
Cc: Josh Poimboeuf 
Cc: Jessica Yu 
Cc: Jiri Kosina 
Cc: Miroslav Benes 
---
 include/linux/livepatch.h | 19 +--
 kernel/livepatch/core.c   | 16 
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index b4424ef7e0ce..e48a4917fee3 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if IS_ENABLED(CONFIG_LIVEPATCH)
 
@@ -42,6 +43,7 @@
  * can be found (optional)
  * @old_addr:  the address of the function being patched
  * @kobj:  kobject for sysfs resources
+ * @node:  list node for klp_object func_list
  * @stack_node:list node for klp_ops func_stack list
  * @old_size:  size of the old function
  * @new_size:  size of the new function
@@ -79,6 +81,7 @@ struct klp_func {
/* internal */
unsigned long old_addr;
struct kobject kobj;
+   struct list_head node;
struct list_head stack_node;
unsigned long old_size, new_size;
bool patched;
@@ -117,6 +120,8 @@ struct klp_callbacks {
  * @kobj:  kobject for sysfs resources
  * @mod:   kernel module associated with the patched object
  * (NULL for vmlinux)
+ * @func_list: dynamic list of the function entries
+ * @node:  list node for klp_patch obj_list
  * @patched:   the object's funcs have been added to the klp_ops list
  */
 struct klp_object {
@@ -127,6 +132,8 @@ struct klp_object {
 
/* internal */
struct kobject kobj;
+   struct list_head func_list;
+   struct list_head node;
struct module *mod;
bool patched;
 };
@@ -137,6 +144,7 @@ struct klp_object {
  * @objs:  object entries for kernel objects to be patched
  * @list:  list node for global list of registered patches
  * @kobj:  kobject for sysfs resources
+ * @obj_list:  dynamic list of the object entries
  * @enabled:   the patch is enabled (but operation may be incomplete)
  * @wait_free: wait until the patch is freed
  * @module_put: module reference taken and patch not forced
@@ -150,6 +158,7 @@ struct klp_patch {
/* internal */
struct list_head list;
struct kobject kobj;
+   struct list_head obj_list;
bool enabled;
bool wait_free;
bool module_put;
@@ -196,14 +205,20 @@ struct klp_patch {
}
 #define KLP_OBJECT_END { }
 
-#define klp_for_each_object(patch, obj) \
+#define klp_for_each_object_static(patch, obj) \
for (obj = patch->objs; obj->funcs || obj->name; obj++)
 
-#define klp_for_each_func(obj, func) \
+#define klp_for_each_object(patch, obj)\
+   list_for_each_entry(obj, &patch->obj_list, node)
+
+#define klp_for_each_func_static(obj, func) \
for (func = obj->funcs; \
 func->old_name || func->new_addr || func->old_sympos; \
 func++)
 
+#define klp_for_each_func(obj, func)   \
+   list_for_each_entry(func, &obj->func_list, node)
+
 int klp_enable_patch(struct klp_patch *);
 
 void arch_klp_init_object_loaded(struct klp_patch *patch,
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 6a47b36a6c9a..7bc23a106b5b 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -50,6 +50,21 @@ LIST_HEAD(klp_patches);
 
 static struct kobject *klp_root_kobj;
 
+static void klp_init_lists(struct klp_patch *patch)
+{
+   struct klp_object *obj;
+   struct klp_func *func;
+
+   INIT_LIST_HEAD(&patch->obj_list);
+   klp_for_each_object_static(patch, obj) {
+   list_add(&obj->node, &patch->obj_list);
+
+   INIT_LIST_HEAD(&obj->func_list);
+   klp_for_each_func_static(obj, func)
+   list_add(&func->node, &obj->func_list);
+   }
+}
+
 static bool klp_is_module(struct klp_object *obj)
 {
return obj->name;
@@ -664,6 +679,7 @@ static int klp_init_patch(struct klp_patch *patch)
patch->module_put = false;
INIT_LIST_HEAD(&patch->list

[PATCH v12 09/12] livepatch: Remove Nop structures when unused

2018-08-28 Thread Petr Mladek

Replaced patches are removed from the stack when the transition is
finished. It means that Nop structures will never be needed again
and can be removed. Why should we care?

  + Nop structures make false feeling that the function is patched
even though the ftrace handler has no effect.

  + Ftrace handlers are not completely for free. They cause slowdown that
might be visible in some workloads. The ftrace-related slowdown might
actually be the reason why the function is not longer patched in
the new cumulative patch. One would expect that cumulative patch
would allow to solve these problems as well.

  + Cumulative patches are supposed to replace any earlier version of
the patch. The amount of NOPs depends on which version was replaced.
This multiplies the amount of scenarios that might happen.

One might say that NOPs are innocent. But there are even optimized
NOP instructions for different processor, for example, see
arch/x86/kernel/alternative.c. And klp_ftrace_handler() is much
more complicated.

  + It sounds natural to clean up a mess that is not longer needed.
It could only be worse if we do not do it.

This patch allows to unpatch and free the dynamic structures independently
when the transition finishes.

The free part is a bit tricky because kobject free callbacks are called
asynchronously. We could not wait for them easily. Fortunately, we do
not have to. Any further access can be avoided by removing them from
the dynamic lists.

Signed-off-by: Petr Mladek 
---
 include/linux/livepatch.h |  6 
 kernel/livepatch/core.c   | 72 ++-
 kernel/livepatch/core.h   |  2 +-
 kernel/livepatch/patch.c  | 31 ---
 kernel/livepatch/patch.h  |  1 +
 kernel/livepatch/transition.c |  2 +-
 6 files changed, 99 insertions(+), 15 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 97c3f366cf18..5d897a396dc4 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -214,6 +214,9 @@ struct klp_patch {
 #define klp_for_each_object_static(patch, obj) \
for (obj = patch->objs; obj->funcs || obj->name; obj++)
 
+#define klp_for_each_object_safe(patch, obj, tmp_obj)  \
+   list_for_each_entry_safe(obj, tmp_obj, &patch->obj_list, node)
+
 #define klp_for_each_object(patch, obj)\
list_for_each_entry(obj, &patch->obj_list, node)
 
@@ -222,6 +225,9 @@ struct klp_patch {
 func->old_name || func->new_addr || func->old_sympos; \
 func++)
 
+#define klp_for_each_func_safe(obj, func, tmp_func)\
+   list_for_each_entry_safe(func, tmp_func, &obj->func_list, node)
+
 #define klp_for_each_func(obj, func)   \
list_for_each_entry(func, &obj->func_list, node)
 
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index db12c86c4f26..695d565f23c1 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -630,11 +630,20 @@ static struct kobj_type klp_ktype_func = {
.sysfs_ops = &kobj_sysfs_ops,
 };
 
-static void klp_free_funcs(struct klp_object *obj)
+static void __klp_free_funcs(struct klp_object *obj, bool free_all)
 {
-   struct klp_func *func;
+   struct klp_func *func, *tmp_func;
+
+   klp_for_each_func_safe(obj, func, tmp_func) {
+   if (!free_all && !func->nop)
+   continue;
+
+   /*
+* Avoid double free. It would be tricky to wait for kobject
+* callbacks when only NOPs are handled.
+*/
+   list_del(&func->node);
 
-   klp_for_each_func(obj, func) {
/* Might be called from klp_init_patch() error path. */
if (func->kobj.state_initialized)
kobject_put(&func->kobj);
@@ -658,12 +667,21 @@ static void klp_free_object_loaded(struct klp_object *obj)
}
 }
 
-static void klp_free_objects(struct klp_patch *patch)
+static void __klp_free_objects(struct klp_patch *patch, bool free_all)
 {
-   struct klp_object *obj;
+   struct klp_object *obj, *tmp_obj;
 
-   klp_for_each_object(patch, obj) {
-   klp_free_funcs(obj);
+   klp_for_each_object_safe(patch, obj, tmp_obj) {
+   __klp_free_funcs(obj, free_all);
+
+   if (!free_all && !obj->dynamic)
+   continue;
+
+   /*
+* Avoid double free. It would be tricky to wait for kobject
+* callbacks when only dynamic objects are handled.
+*/
+   list_del(&obj->node);
 
/* Might be called from klp_init_patch() error path. */
if (obj->kobj.state_initialized)
@@ -673,6 +691,16 @@ static void klp_free_objects(struct klp_patch *patch)
}
 }
 
+static vo

[PATCH v12 06/12] livepatch: Simplify API by removing registration step

2018-08-28 Thread Petr Mladek

The possibility to re-enable a registered patch was useful for immediate
patches where the livepatch module had to stay until the system reboot.
The improved consistency model allows to achieve the same result by
unloading and loading the livepatch module again.

Also we are going to add a feature called atomic replace. It will allow
to create a patch that would replace all already registered patches. The
aim is to handle dependent patches a more secure way. It will obsolete
the stack of patches that helped to handle the dependencies so far.
Then it might be unclear when a cumulative patch re-enabling is safe.

It would be complicated to support the many modes. Instead we could
actually make the API and code easier.

This patch removes the two step public API. All the checks and init calls
are moved from klp_register_patch() to klp_enabled_patch(). Also the patch
is automatically freed, including the sysfs interface when the transition
to the disabled state is completed.

As a result, there is newer a disabled patch on the top of the stack.
Therefore we do not need to check the stack in __klp_enable_patch().
And we could simplify the check in __klp_disable_patch().

Also the API and logic is much easier. It is enough to call
klp_enable_patch() in module_init() call. The patch patch can be disabled
by writing '0' into /sys/kernel/livepatch//enabled. Then the module
can be removed once the transition finishes and sysfs interface is freed.

IMPORTANT: We only need to be really careful about when and where to call
module_put(). It has to be called only when:

   + the reference was taken before
   + the module structures and code will not longer be accessed

Now, the disable operation is triggered from the sysfs interface. We clearly
could not wait there until the interface is destroyed. Instead we need to call
module_put() in the release callback of patch->kobj. It is safe because:

  + Patch could not longer get re-enabled from enabled_store().

  + kobjects are designed to be part of structures that are freed from
the release callback. We just need to make sure that module_put()
is the last call accessing the patch in the callback.

In theory, we could be more relaxed in klp_enable_patch() error paths
because they are called from the module_init(). But better be on the
safe side.

This patch does the following to keep the code sane:

  + patch->forced is replaced with patch->module_put and an inverted logic.
Then we could call it in klp_enable_patch() error path even before
the reference is taken.

  + try_module_get() is called before initializing patch->kobj. It makes
it more symmetric with the moved module_put().

  + module_put() is the last action also in klp_free_patch_sync_end(). It makes
it safe for an use outside module_init().

Suggested-by: Josh Poimboeuf 
Signed-off-by: Petr Mladek 
---
 Documentation/livepatch/livepatch.txt| 121 +---
 include/linux/livepatch.h|   7 +-
 kernel/livepatch/core.c  | 280 ++-
 kernel/livepatch/core.h  |   2 +
 kernel/livepatch/transition.c|  15 +-
 samples/livepatch/livepatch-callbacks-demo.c |  13 +-
 samples/livepatch/livepatch-sample.c |  13 +-
 samples/livepatch/livepatch-shadow-fix1.c|  14 +-
 samples/livepatch/livepatch-shadow-fix2.c|  14 +-
 9 files changed, 157 insertions(+), 322 deletions(-)

diff --git a/Documentation/livepatch/livepatch.txt 
b/Documentation/livepatch/livepatch.txt
index 2d7ed09dbd59..7fb01d27d81d 100644
--- a/Documentation/livepatch/livepatch.txt
+++ b/Documentation/livepatch/livepatch.txt
@@ -14,10 +14,8 @@ Table of Contents:
4.2. Metadata
4.3. Livepatch module handling
 5. Livepatch life-cycle
-   5.1. Registration
-   5.2. Enabling
-   5.3. Disabling
-   5.4. Unregistration
+   5.1. Enabling
+   5.2. Disabling
 6. Sysfs
 7. Limitations
 
@@ -303,9 +301,8 @@ into three levels:
 
 The usual behavior is that the new functions will get used when
 the livepatch module is loaded. For this, the module init() function
-has to register the patch (struct klp_patch) and enable it. See the
-section "Livepatch life-cycle" below for more details about these
-two operations.
+has to enable the patch (struct klp_patch). See the section "Livepatch
+life-cycle" below for more details.
 
 Module removal is only safe when there are no users of the underlying
 functions. This is the reason why the force feature permanently disables
@@ -319,96 +316,66 @@ forced it is guaranteed that no task sleeps or runs in 
the old code.
 5. Livepatch life-cycle
 ===
 
-Livepatching defines four basic operations that define the life cycle of each
-live patch: registration, enabling, disabling and unregistration.  There are
-several reasons why it is done this way.
+Livepatches get automatically enabled when the respective module is loaded.
+On the

[PATCH v12 04/12] livepatch: Consolidate klp_free functions

2018-08-28 Thread Petr Mladek

The code for freeing livepatch structures is a bit scattered and tricky:

  + direct calls to klp_free_*_limited() and kobject_put() are
used to release partially initialized objects

  + klp_free_patch() removes the patch from the public list
and releases all objects except for patch->kobj

  + object_put(&patch->kobj) and the related wait_for_completion()
are called directly outside klp_mutex; this code is duplicated;

Now, we are going to remove the registration stage to simplify the API
and the code. This would require handling more situations in
klp_enable_patch() error paths.

More importantly, we are going to add a feature called atomic replace.
It will need to dynamically create func and object structures. We will
want to reuse the existing init() and free() functions. This would
create even more error path scenarios.

This patch implements a more clever free functions:

  + checks kobj.state_initialized instead of @limit

  + initializes patch->list early so that the check for empty list
always works

  + The action(s) that has to be done outside klp_mutex are done
in separate klp_free_patch_end() function. It waits only
when patch->kobj was really released via the _begin() part.

Note that it is safe to put patch->kobj under klp_mutex. It calls
the release callback only when the reference count reaches zero.
Therefore it does not block any related sysfs operation that took
a reference and might eventually wait for klp_mutex.

Note that __klp_free_patch() is split because it will be later
used in a _nowait() variant. Also klp_free_patch_end() makes
sense because it will later get more complicated.

This patch does not change the existing behavior.

Signed-off-by: Petr Mladek 
Cc: Josh Poimboeuf 
Cc: Jessica Yu 
Cc: Jiri Kosina 
Cc: Jason Baron 
Acked-by: Miroslav Benes 
---
 include/linux/livepatch.h |  2 ++
 kernel/livepatch/core.c   | 92 +--
 2 files changed, 59 insertions(+), 35 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 1163742b27c0..22e0767d64b0 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -138,6 +138,7 @@ struct klp_object {
  * @list:  list node for global list of registered patches
  * @kobj:  kobject for sysfs resources
  * @enabled:   the patch is enabled (but operation may be incomplete)
+ * @wait_free: wait until the patch is freed
  * @finish:for waiting till it is safe to remove the patch module
  */
 struct klp_patch {
@@ -149,6 +150,7 @@ struct klp_patch {
struct list_head list;
struct kobject kobj;
bool enabled;
+   bool wait_free;
struct completion finish;
 };
 
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index b3956cce239e..3ca404545150 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -465,17 +465,15 @@ static struct kobj_type klp_ktype_func = {
.sysfs_ops = &kobj_sysfs_ops,
 };
 
-/*
- * Free all functions' kobjects in the array up to some limit. When limit is
- * NULL, all kobjects are freed.
- */
-static void klp_free_funcs_limited(struct klp_object *obj,
-  struct klp_func *limit)
+static void klp_free_funcs(struct klp_object *obj)
 {
struct klp_func *func;
 
-   for (func = obj->funcs; func->old_name && func != limit; func++)
-   kobject_put(&func->kobj);
+   klp_for_each_func(obj, func) {
+   /* Might be called from klp_init_patch() error path. */
+   if (func->kobj.state_initialized)
+   kobject_put(&func->kobj);
+   }
 }
 
 /* Clean up when a patched object is unloaded */
@@ -489,26 +487,59 @@ static void klp_free_object_loaded(struct klp_object *obj)
func->old_addr = 0;
 }
 
-/*
- * Free all objects' kobjects in the array up to some limit. When limit is
- * NULL, all kobjects are freed.
- */
-static void klp_free_objects_limited(struct klp_patch *patch,
-struct klp_object *limit)
+static void klp_free_objects(struct klp_patch *patch)
 {
struct klp_object *obj;
 
-   for (obj = patch->objs; obj->funcs && obj != limit; obj++) {
-   klp_free_funcs_limited(obj, NULL);
-   kobject_put(&obj->kobj);
+   klp_for_each_object(patch, obj) {
+   klp_free_funcs(obj);
+
+   /* Might be called from klp_init_patch() error path. */
+   if (obj->kobj.state_initialized)
+   kobject_put(&obj->kobj);
}
 }
 
-static void klp_free_patch(struct klp_patch *patch)
+static void __klp_free_patch(struct klp_patch *patch)
 {
-   klp_free_objects_limited(patch, NULL);
if (!list_empty(&patch->list))
list_del(&patch->list);
+
+   klp_free_objects(patch);
+
+

Re: [PATCH v4] console: Add console=spcr option

2018-08-30 Thread Petr Mladek

On Thu 2018-08-30 08:38:49, Prarit Bhargava wrote:
> ACPI may contain an SPCR table that defines a default system console.
> On ARM if the table is present then the SPCR console is enabled by default.
> On x86 the SPCR console is used if 'earlycon' (no parameters) is
> specified as a kernel parameter and is used only as the early console.
> To use the SPCR data as a console a user must boot with 'earlycon',
> grep logs & specify a console= kernel parameter, and then reboot again.
> 
> Add 'console=spcr' that enables a firmware or hardware console, and on
> x86 enable the SPCR console if 'console=spcr' is specified.

> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 3b20607d581b..a43a34734f02 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -1771,3 +1771,17 @@ void __init 
> arch_reserve_mem_area(acpi_physical_address addr, size_t size)
>   e820__range_add(addr, size, E820_TYPE_ACPI);
>   e820__update_table_print();
>  }
> +
> +int __init arch_console_setup(char *str)
> +{
> + int ret;
> +
> + if (strcmp("spcr", str))
> + return 1;
> +
> + ret = acpi_parse_spcr(false, true);
> + if (ret)
> + pr_err(PREFIX "ERROR: SPCR console is not enabled (%d)\n", ret);
> +
> + return 0;
> +}
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 924e37fb1620..ceee021a37ec 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2107,6 +2112,9 @@ static int __init console_setup(char *str)
>   char *s, *options, *brl_options = NULL;
>   int idx;
>  
> + if (!arch_console_setup(str))
> + return 1;
> +
>   if (_braille_console_setup(&str, &brl_options))
>   return 1;

Sigh, I am still a bit confused by the error handling. Especially
I am not sure why console_setup() always returns 1.

It looks like it means an error when called from do_early_param().
But it means that the option was proceed when called from
obsolete_checksetup(). Do I get this correctly, please?

If this is true, we should change the logic in arch_console_setup().
It should return 1 when "spcr" is handled and it should return 0
otherwise. Also it should be commented.

It looks like a historic mess. This is why I do not ask you to do
any bigger clean up. But the new patch should not make it even
more confusing by an inverted 0/1 logic.

Best Regards,
Petr

[GIT PULL] printk for 4.19

2018-08-14 Thread Petr Mladek

Linus,

please pull the latest printk changes from

  git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk 
tags/printk-for-4.19



- Different vendors have a different expectation about a console quietness.
  Make it configurable to reduce bike-shedding about the upstream default.

- Decide about the message visibility when the message is stored. It avoids
  races caused by a delayed console handling.

- Always store printk() messages into the per-CPU buffers again in NMI.
  The only exception is when flushing trace log in panic(). There
  the risk of loosing messages is worth an eventual reordering.

- Handle invalid %pO printf modifiers correctly.

- Better handle %p printf modifier tests before crng is initialized.

- Some clean up.



Bart Van Assche (1):
  lib/vsprintf: Do not handle %pO[^F] as %px

Hans de Goede (1):
  printk: Make CONSOLE_LOGLEVEL_QUIET configurable

Maninder Singh (1):
  printk: make sure to print log on console.

Namit Gupta (1):
  printk: Remove unnecessary kmalloc() from syslog during clear

Petr Mladek (6):
  printk: Clean up syslog_print_all()
  printk: Split the code for storing a message into the log buffer
  printk: Create helper function to queue deferred console handling
  printk/nmi: Prevent deadlock when accessing the main log buffer in NMI
  printk: Fix warning about unused suppress_message_printing
  Merge branch 'for-4.19-nmi' into for-linus

Thierry Escande (1):
  lib/test_printf.c: accept "ptrval" as valid result for plain 'p' tests

 include/linux/printk.h  |  10 ++-
 kernel/printk/internal.h|   9 ++-
 kernel/printk/printk.c  | 181 
 kernel/printk/printk_safe.c |  58 +-
 kernel/trace/trace.c|   4 +-
 lib/Kconfig.debug   |  11 +++
 lib/nmi_backtrace.c |   3 -
 lib/test_printf.c   |  24 +-
 lib/vsprintf.c  |   1 +
 9 files changed, 189 insertions(+), 112 deletions(-)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3147 matches

Mail list logo