Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-04 Thread Niklas Cassel
On Mon, Sep 03, 2018 at 11:33:05AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
> > > On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > > > > Author: Martin Schwidefsky 
> > > > > Date:   Tue Aug 18 17:09:42 2009 +0200
> > > > > 
> > > > > clocksource: Avoid clocksource watchdog circular locking 
> > > > > dependency
> > > > > 
> > > > > stop_machine from a multithreaded workqueue is not allowed because
> > > > > of a circular locking dependency between cpu_down and the 
> > > > > workqueue
> > > > > execution. Use a kernel thread to do the clocksource downgrade.
> > > > 
> > > > I cannot find stop_machine usage there; either it went away or I need to
> > > > like wake up.
> > > 
> > > timekeeping_notify() which is involved in switching clock source uses 
> > > stomp
> > > machine.
> > 
> > ARGH... OK, lemme see if I can come up with something other than
> > endlessly spawning that kthread.
> > 
> > A special purpose kthread_worker would make more sense than that.
> 
> Can someone test this?
> 
> ---
>  kernel/time/clocksource.c | 28 ++--
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f74fb00d8064..898976d0082a 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -112,13 +112,28 @@ static int finished_booting;
>  static u64 suspend_start;
>  
>  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> -static void clocksource_watchdog_work(struct work_struct *work);
> +static void clocksource_watchdog_work(struct kthread_work *work);
>  static void clocksource_select(void);
>  
>  static LIST_HEAD(watchdog_list);
>  static struct clocksource *watchdog;
>  static struct timer_list watchdog_timer;
> -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> +
> +/*
> + * We must use a kthread_worker here, because:
> + *
> + *   clocksource_watchdog_work()
> + * clocksource_select()
> + *   __clocksource_select()
> + * timekeeping_notify()
> + *   stop_machine()
> + *
> + * cannot be called from a reqular workqueue, because of deadlocks between
> + * workqueue and stopmachine.
> + */
> +static struct kthread_worker *watchdog_worker;
> +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
> +
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
>  
>   /* kick clocksource_watchdog_work() */
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, _work);
>  }
>  
>  /**
> @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>   /* Clocksource already marked unstable? */
>   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   continue;
>   }
>  
> @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>*/
>   if (cs != curr_clocksource) {
>   cs->flags |= CLOCK_SOURCE_RESELECT;
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   } else {
>   tick_clock_notify();
>   }
> @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
>   return select;
>  }
>  
> -static void clocksource_watchdog_work(struct work_struct *work)
> +static void clocksource_watchdog_work(struct kthread_work *work)
>  {
>   mutex_lock(_mutex);
>   if (__clocksource_watchdog_work())
> @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
>  {
>   mutex_lock(_mutex);
>   curr_clocksource = clocksource_default_clock();
> + watchdog_worker = kthread_create_worker(0, "cs-watchdog");

Hello Peter,

watchdog_worker is only defined if CONFIG_CLOCKSOURCE_WATCHDOG is set,
so you might want to wrap it with an ifdef to avoid build errors.

Kind regards,
Niklas

>   finished_booting = 1;
>   /*
>* Run the watchdog first to eliminate unstable clock sources


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-04 Thread Niklas Cassel
On Mon, Sep 03, 2018 at 11:33:05AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
> > > On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > > > > Author: Martin Schwidefsky 
> > > > > Date:   Tue Aug 18 17:09:42 2009 +0200
> > > > > 
> > > > > clocksource: Avoid clocksource watchdog circular locking 
> > > > > dependency
> > > > > 
> > > > > stop_machine from a multithreaded workqueue is not allowed because
> > > > > of a circular locking dependency between cpu_down and the 
> > > > > workqueue
> > > > > execution. Use a kernel thread to do the clocksource downgrade.
> > > > 
> > > > I cannot find stop_machine usage there; either it went away or I need to
> > > > like wake up.
> > > 
> > > timekeeping_notify() which is involved in switching clock source uses 
> > > stomp
> > > machine.
> > 
> > ARGH... OK, lemme see if I can come up with something other than
> > endlessly spawning that kthread.
> > 
> > A special purpose kthread_worker would make more sense than that.
> 
> Can someone test this?
> 
> ---
>  kernel/time/clocksource.c | 28 ++--
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f74fb00d8064..898976d0082a 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -112,13 +112,28 @@ static int finished_booting;
>  static u64 suspend_start;
>  
>  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> -static void clocksource_watchdog_work(struct work_struct *work);
> +static void clocksource_watchdog_work(struct kthread_work *work);
>  static void clocksource_select(void);
>  
>  static LIST_HEAD(watchdog_list);
>  static struct clocksource *watchdog;
>  static struct timer_list watchdog_timer;
> -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> +
> +/*
> + * We must use a kthread_worker here, because:
> + *
> + *   clocksource_watchdog_work()
> + * clocksource_select()
> + *   __clocksource_select()
> + * timekeeping_notify()
> + *   stop_machine()
> + *
> + * cannot be called from a reqular workqueue, because of deadlocks between
> + * workqueue and stopmachine.
> + */
> +static struct kthread_worker *watchdog_worker;
> +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
> +
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
>  
>   /* kick clocksource_watchdog_work() */
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, _work);
>  }
>  
>  /**
> @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>   /* Clocksource already marked unstable? */
>   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   continue;
>   }
>  
> @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>*/
>   if (cs != curr_clocksource) {
>   cs->flags |= CLOCK_SOURCE_RESELECT;
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   } else {
>   tick_clock_notify();
>   }
> @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
>   return select;
>  }
>  
> -static void clocksource_watchdog_work(struct work_struct *work)
> +static void clocksource_watchdog_work(struct kthread_work *work)
>  {
>   mutex_lock(_mutex);
>   if (__clocksource_watchdog_work())
> @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
>  {
>   mutex_lock(_mutex);
>   curr_clocksource = clocksource_default_clock();
> + watchdog_worker = kthread_create_worker(0, "cs-watchdog");

Hello Peter,

watchdog_worker is only defined if CONFIG_CLOCKSOURCE_WATCHDOG is set,
so you might want to wrap it with an ifdef to avoid build errors.

Kind regards,
Niklas

>   finished_booting = 1;
>   /*
>* Run the watchdog first to eliminate unstable clock sources


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Siegfried Metz

On 9/3/18 11:33 AM, Peter Zijlstra wrote:

On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:

On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:

On Mon, 3 Sep 2018, Peter Zijlstra wrote:

On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:

commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
Author: Martin Schwidefsky 
Date:   Tue Aug 18 17:09:42 2009 +0200

 clocksource: Avoid clocksource watchdog circular locking dependency

 stop_machine from a multithreaded workqueue is not allowed because
 of a circular locking dependency between cpu_down and the workqueue
 execution. Use a kernel thread to do the clocksource downgrade.


I cannot find stop_machine usage there; either it went away or I need to
like wake up.


timekeeping_notify() which is involved in switching clock source uses stomp
machine.


ARGH... OK, lemme see if I can come up with something other than
endlessly spawning that kthread.

A special purpose kthread_worker would make more sense than that.


Can someone test this?

---
  kernel/time/clocksource.c | 28 ++--
  1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f74fb00d8064..898976d0082a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -112,13 +112,28 @@ static int finished_booting;
  static u64 suspend_start;
  
  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG

-static void clocksource_watchdog_work(struct work_struct *work);
+static void clocksource_watchdog_work(struct kthread_work *work);
  static void clocksource_select(void);
  
  static LIST_HEAD(watchdog_list);

  static struct clocksource *watchdog;
  static struct timer_list watchdog_timer;
-static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
+
+/*
+ * We must use a kthread_worker here, because:
+ *
+ *   clocksource_watchdog_work()
+ * clocksource_select()
+ *   __clocksource_select()
+ * timekeeping_notify()
+ *   stop_machine()
+ *
+ * cannot be called from a reqular workqueue, because of deadlocks between
+ * workqueue and stopmachine.
+ */
+static struct kthread_worker *watchdog_worker;
+static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
+
  static DEFINE_SPINLOCK(watchdog_lock);
  static int watchdog_running;
  static atomic_t watchdog_reset_pending;
@@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
  
  	/* kick clocksource_watchdog_work() */

if (finished_booting)
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, _work);
  }
  
  /**

@@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list *unused)
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
if (finished_booting)
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, 
_work);
continue;
}
  
@@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list *unused)

 */
if (cs != curr_clocksource) {
cs->flags |= CLOCK_SOURCE_RESELECT;
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, 
_work);
} else {
tick_clock_notify();
}
@@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
return select;
  }
  
-static void clocksource_watchdog_work(struct work_struct *work)

+static void clocksource_watchdog_work(struct kthread_work *work)
  {
mutex_lock(_mutex);
if (__clocksource_watchdog_work())
@@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
  {
mutex_lock(_mutex);
curr_clocksource = clocksource_default_clock();
+   watchdog_worker = kthread_create_worker(0, "cs-watchdog");
finished_booting = 1;
/*
 * Run the watchdog first to eliminate unstable clock sources



Successfully booted my Intel Core 2 Duo with the patch applied on top of
4.18.5 (based on default Arch Linux config).

I tested with at least 8/8 successful boots in total - with no
additional kernel boot parameters and also with "quiet", and "debug".
No problems seen so far.

Thank you for your effort and developing this patch.


Siegfried


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Siegfried Metz

On 9/3/18 11:33 AM, Peter Zijlstra wrote:

On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:

On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:

On Mon, 3 Sep 2018, Peter Zijlstra wrote:

On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:

commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
Author: Martin Schwidefsky 
Date:   Tue Aug 18 17:09:42 2009 +0200

 clocksource: Avoid clocksource watchdog circular locking dependency

 stop_machine from a multithreaded workqueue is not allowed because
 of a circular locking dependency between cpu_down and the workqueue
 execution. Use a kernel thread to do the clocksource downgrade.


I cannot find stop_machine usage there; either it went away or I need to
like wake up.


timekeeping_notify() which is involved in switching clock source uses stomp
machine.


ARGH... OK, lemme see if I can come up with something other than
endlessly spawning that kthread.

A special purpose kthread_worker would make more sense than that.


Can someone test this?

---
  kernel/time/clocksource.c | 28 ++--
  1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f74fb00d8064..898976d0082a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -112,13 +112,28 @@ static int finished_booting;
  static u64 suspend_start;
  
  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG

-static void clocksource_watchdog_work(struct work_struct *work);
+static void clocksource_watchdog_work(struct kthread_work *work);
  static void clocksource_select(void);
  
  static LIST_HEAD(watchdog_list);

  static struct clocksource *watchdog;
  static struct timer_list watchdog_timer;
-static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
+
+/*
+ * We must use a kthread_worker here, because:
+ *
+ *   clocksource_watchdog_work()
+ * clocksource_select()
+ *   __clocksource_select()
+ * timekeeping_notify()
+ *   stop_machine()
+ *
+ * cannot be called from a reqular workqueue, because of deadlocks between
+ * workqueue and stopmachine.
+ */
+static struct kthread_worker *watchdog_worker;
+static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
+
  static DEFINE_SPINLOCK(watchdog_lock);
  static int watchdog_running;
  static atomic_t watchdog_reset_pending;
@@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
  
  	/* kick clocksource_watchdog_work() */

if (finished_booting)
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, _work);
  }
  
  /**

@@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list *unused)
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
if (finished_booting)
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, 
_work);
continue;
}
  
@@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list *unused)

 */
if (cs != curr_clocksource) {
cs->flags |= CLOCK_SOURCE_RESELECT;
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, 
_work);
} else {
tick_clock_notify();
}
@@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
return select;
  }
  
-static void clocksource_watchdog_work(struct work_struct *work)

+static void clocksource_watchdog_work(struct kthread_work *work)
  {
mutex_lock(_mutex);
if (__clocksource_watchdog_work())
@@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
  {
mutex_lock(_mutex);
curr_clocksource = clocksource_default_clock();
+   watchdog_worker = kthread_create_worker(0, "cs-watchdog");
finished_booting = 1;
/*
 * Run the watchdog first to eliminate unstable clock sources



Successfully booted my Intel Core 2 Duo with the patch applied on top of
4.18.5 (based on default Arch Linux config).

I tested with at least 8/8 successful boots in total - with no
additional kernel boot parameters and also with "quiet", and "debug".
No problems seen so far.

Thank you for your effort and developing this patch.


Siegfried


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Kevin Shanahan
On Mon, Sep 03, 2018 at 11:33:05AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
> > > On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > > > > Author: Martin Schwidefsky 
> > > > > Date:   Tue Aug 18 17:09:42 2009 +0200
> > > > > 
> > > > > clocksource: Avoid clocksource watchdog circular locking 
> > > > > dependency
> > > > > 
> > > > > stop_machine from a multithreaded workqueue is not allowed because
> > > > > of a circular locking dependency between cpu_down and the 
> > > > > workqueue
> > > > > execution. Use a kernel thread to do the clocksource downgrade.
> > > > 
> > > > I cannot find stop_machine usage there; either it went away or I need to
> > > > like wake up.
> > > 
> > > timekeeping_notify() which is involved in switching clock source uses 
> > > stomp
> > > machine.
> > 
> > ARGH... OK, lemme see if I can come up with something other than
> > endlessly spawning that kthread.
> > 
> > A special purpose kthread_worker would make more sense than that.
> 
> Can someone test this?

Boots for me (applied on top of 4.18.5).

Tested-by: Kevin Shanahan 

> ---
>  kernel/time/clocksource.c | 28 ++--
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f74fb00d8064..898976d0082a 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -112,13 +112,28 @@ static int finished_booting;
>  static u64 suspend_start;
>  
>  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> -static void clocksource_watchdog_work(struct work_struct *work);
> +static void clocksource_watchdog_work(struct kthread_work *work);
>  static void clocksource_select(void);
>  
>  static LIST_HEAD(watchdog_list);
>  static struct clocksource *watchdog;
>  static struct timer_list watchdog_timer;
> -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> +
> +/*
> + * We must use a kthread_worker here, because:
> + *
> + *   clocksource_watchdog_work()
> + * clocksource_select()
> + *   __clocksource_select()
> + * timekeeping_notify()
> + *   stop_machine()
> + *
> + * cannot be called from a reqular workqueue, because of deadlocks between
> + * workqueue and stopmachine.
> + */
> +static struct kthread_worker *watchdog_worker;
> +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
> +
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
>  
>   /* kick clocksource_watchdog_work() */
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, _work);
>  }
>  
>  /**
> @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>   /* Clocksource already marked unstable? */
>   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   continue;
>   }
>  
> @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>*/
>   if (cs != curr_clocksource) {
>   cs->flags |= CLOCK_SOURCE_RESELECT;
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   } else {
>   tick_clock_notify();
>   }
> @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
>   return select;
>  }
>  
> -static void clocksource_watchdog_work(struct work_struct *work)
> +static void clocksource_watchdog_work(struct kthread_work *work)
>  {
>   mutex_lock(_mutex);
>   if (__clocksource_watchdog_work())
> @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
>  {
>   mutex_lock(_mutex);
>   curr_clocksource = clocksource_default_clock();
> + watchdog_worker = kthread_create_worker(0, "cs-watchdog");
>   finished_booting = 1;
>   /*
>* Run the watchdog first to eliminate unstable clock sources


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Kevin Shanahan
On Mon, Sep 03, 2018 at 11:33:05AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
> > > On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > > > > Author: Martin Schwidefsky 
> > > > > Date:   Tue Aug 18 17:09:42 2009 +0200
> > > > > 
> > > > > clocksource: Avoid clocksource watchdog circular locking 
> > > > > dependency
> > > > > 
> > > > > stop_machine from a multithreaded workqueue is not allowed because
> > > > > of a circular locking dependency between cpu_down and the 
> > > > > workqueue
> > > > > execution. Use a kernel thread to do the clocksource downgrade.
> > > > 
> > > > I cannot find stop_machine usage there; either it went away or I need to
> > > > like wake up.
> > > 
> > > timekeeping_notify() which is involved in switching clock source uses 
> > > stomp
> > > machine.
> > 
> > ARGH... OK, lemme see if I can come up with something other than
> > endlessly spawning that kthread.
> > 
> > A special purpose kthread_worker would make more sense than that.
> 
> Can someone test this?

Boots for me (applied on top of 4.18.5).

Tested-by: Kevin Shanahan 

> ---
>  kernel/time/clocksource.c | 28 ++--
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f74fb00d8064..898976d0082a 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -112,13 +112,28 @@ static int finished_booting;
>  static u64 suspend_start;
>  
>  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> -static void clocksource_watchdog_work(struct work_struct *work);
> +static void clocksource_watchdog_work(struct kthread_work *work);
>  static void clocksource_select(void);
>  
>  static LIST_HEAD(watchdog_list);
>  static struct clocksource *watchdog;
>  static struct timer_list watchdog_timer;
> -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> +
> +/*
> + * We must use a kthread_worker here, because:
> + *
> + *   clocksource_watchdog_work()
> + * clocksource_select()
> + *   __clocksource_select()
> + * timekeeping_notify()
> + *   stop_machine()
> + *
> + * cannot be called from a reqular workqueue, because of deadlocks between
> + * workqueue and stopmachine.
> + */
> +static struct kthread_worker *watchdog_worker;
> +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
> +
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
>  
>   /* kick clocksource_watchdog_work() */
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, _work);
>  }
>  
>  /**
> @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>   /* Clocksource already marked unstable? */
>   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   continue;
>   }
>  
> @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>*/
>   if (cs != curr_clocksource) {
>   cs->flags |= CLOCK_SOURCE_RESELECT;
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   } else {
>   tick_clock_notify();
>   }
> @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
>   return select;
>  }
>  
> -static void clocksource_watchdog_work(struct work_struct *work)
> +static void clocksource_watchdog_work(struct kthread_work *work)
>  {
>   mutex_lock(_mutex);
>   if (__clocksource_watchdog_work())
> @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
>  {
>   mutex_lock(_mutex);
>   curr_clocksource = clocksource_default_clock();
> + watchdog_worker = kthread_create_worker(0, "cs-watchdog");
>   finished_booting = 1;
>   /*
>* Run the watchdog first to eliminate unstable clock sources


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Viktor Jägersküpper
Peter Zijlstra:
> On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:
>> On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
>>> On Mon, 3 Sep 2018, Peter Zijlstra wrote:
 On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> Author: Martin Schwidefsky 
> Date:   Tue Aug 18 17:09:42 2009 +0200
>
> clocksource: Avoid clocksource watchdog circular locking dependency
>
> stop_machine from a multithreaded workqueue is not allowed because
> of a circular locking dependency between cpu_down and the workqueue
> execution. Use a kernel thread to do the clocksource downgrade.

 I cannot find stop_machine usage there; either it went away or I need to
 like wake up.
>>>
>>> timekeeping_notify() which is involved in switching clock source uses stomp
>>> machine.
>>
>> ARGH... OK, lemme see if I can come up with something other than
>> endlessly spawning that kthread.
>>
>> A special purpose kthread_worker would make more sense than that.
> 
> Can someone test this?
> 
> ---
>  kernel/time/clocksource.c | 28 ++--
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f74fb00d8064..898976d0082a 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -112,13 +112,28 @@ static int finished_booting;
>  static u64 suspend_start;
>  
>  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> -static void clocksource_watchdog_work(struct work_struct *work);
> +static void clocksource_watchdog_work(struct kthread_work *work);
>  static void clocksource_select(void);
>  
>  static LIST_HEAD(watchdog_list);
>  static struct clocksource *watchdog;
>  static struct timer_list watchdog_timer;
> -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> +
> +/*
> + * We must use a kthread_worker here, because:
> + *
> + *   clocksource_watchdog_work()
> + * clocksource_select()
> + *   __clocksource_select()
> + * timekeeping_notify()
> + *   stop_machine()
> + *
> + * cannot be called from a reqular workqueue, because of deadlocks between
> + * workqueue and stopmachine.
> + */
> +static struct kthread_worker *watchdog_worker;
> +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
> +
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
>  
>   /* kick clocksource_watchdog_work() */
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, _work);
>  }
>  
>  /**
> @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>   /* Clocksource already marked unstable? */
>   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   continue;
>   }
>  
> @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>*/
>   if (cs != curr_clocksource) {
>   cs->flags |= CLOCK_SOURCE_RESELECT;
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   } else {
>   tick_clock_notify();
>   }
> @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
>   return select;
>  }
>  
> -static void clocksource_watchdog_work(struct work_struct *work)
> +static void clocksource_watchdog_work(struct kthread_work *work)
>  {
>   mutex_lock(_mutex);
>   if (__clocksource_watchdog_work())
> @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
>  {
>   mutex_lock(_mutex);
>   curr_clocksource = clocksource_default_clock();
> + watchdog_worker = kthread_create_worker(0, "cs-watchdog");
>   finished_booting = 1;
>   /*
>* Run the watchdog first to eliminate unstable clock sources
> 

Applied on mainline tag v4.19-rc2. Tested without additional parameters,
with "quiet" and with "debug", my PC booted successfully in all three
cases, whereas it stalled almost always in these three cases before.

Thanks!


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Viktor Jägersküpper
Peter Zijlstra:
> On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:
>> On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
>>> On Mon, 3 Sep 2018, Peter Zijlstra wrote:
 On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> Author: Martin Schwidefsky 
> Date:   Tue Aug 18 17:09:42 2009 +0200
>
> clocksource: Avoid clocksource watchdog circular locking dependency
>
> stop_machine from a multithreaded workqueue is not allowed because
> of a circular locking dependency between cpu_down and the workqueue
> execution. Use a kernel thread to do the clocksource downgrade.

 I cannot find stop_machine usage there; either it went away or I need to
 like wake up.
>>>
>>> timekeeping_notify() which is involved in switching clock source uses stomp
>>> machine.
>>
>> ARGH... OK, lemme see if I can come up with something other than
>> endlessly spawning that kthread.
>>
>> A special purpose kthread_worker would make more sense than that.
> 
> Can someone test this?
> 
> ---
>  kernel/time/clocksource.c | 28 ++--
>  1 file changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f74fb00d8064..898976d0082a 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -112,13 +112,28 @@ static int finished_booting;
>  static u64 suspend_start;
>  
>  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> -static void clocksource_watchdog_work(struct work_struct *work);
> +static void clocksource_watchdog_work(struct kthread_work *work);
>  static void clocksource_select(void);
>  
>  static LIST_HEAD(watchdog_list);
>  static struct clocksource *watchdog;
>  static struct timer_list watchdog_timer;
> -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> +
> +/*
> + * We must use a kthread_worker here, because:
> + *
> + *   clocksource_watchdog_work()
> + * clocksource_select()
> + *   __clocksource_select()
> + * timekeeping_notify()
> + *   stop_machine()
> + *
> + * cannot be called from a reqular workqueue, because of deadlocks between
> + * workqueue and stopmachine.
> + */
> +static struct kthread_worker *watchdog_worker;
> +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
> +
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
>  
>   /* kick clocksource_watchdog_work() */
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, _work);
>  }
>  
>  /**
> @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>   /* Clocksource already marked unstable? */
>   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
>   if (finished_booting)
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   continue;
>   }
>  
> @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>*/
>   if (cs != curr_clocksource) {
>   cs->flags |= CLOCK_SOURCE_RESELECT;
> - schedule_work(_work);
> + kthread_queue_work(watchdog_worker, 
> _work);
>   } else {
>   tick_clock_notify();
>   }
> @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
>   return select;
>  }
>  
> -static void clocksource_watchdog_work(struct work_struct *work)
> +static void clocksource_watchdog_work(struct kthread_work *work)
>  {
>   mutex_lock(_mutex);
>   if (__clocksource_watchdog_work())
> @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
>  {
>   mutex_lock(_mutex);
>   curr_clocksource = clocksource_default_clock();
> + watchdog_worker = kthread_create_worker(0, "cs-watchdog");
>   finished_booting = 1;
>   /*
>* Run the watchdog first to eliminate unstable clock sources
> 

Applied on mainline tag v4.19-rc2. Tested without additional parameters,
with "quiet" and with "debug", my PC booted successfully in all three
cases, whereas it stalled almost always in these three cases before.

Thanks!


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Peter Zijlstra
On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
> > On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > > > Author: Martin Schwidefsky 
> > > > Date:   Tue Aug 18 17:09:42 2009 +0200
> > > > 
> > > > clocksource: Avoid clocksource watchdog circular locking dependency
> > > > 
> > > > stop_machine from a multithreaded workqueue is not allowed because
> > > > of a circular locking dependency between cpu_down and the workqueue
> > > > execution. Use a kernel thread to do the clocksource downgrade.
> > > 
> > > I cannot find stop_machine usage there; either it went away or I need to
> > > like wake up.
> > 
> > timekeeping_notify() which is involved in switching clock source uses stomp
> > machine.
> 
> ARGH... OK, lemme see if I can come up with something other than
> endlessly spawning that kthread.
> 
> A special purpose kthread_worker would make more sense than that.

Can someone test this?

---
 kernel/time/clocksource.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f74fb00d8064..898976d0082a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -112,13 +112,28 @@ static int finished_booting;
 static u64 suspend_start;
 
 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
-static void clocksource_watchdog_work(struct work_struct *work);
+static void clocksource_watchdog_work(struct kthread_work *work);
 static void clocksource_select(void);
 
 static LIST_HEAD(watchdog_list);
 static struct clocksource *watchdog;
 static struct timer_list watchdog_timer;
-static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
+
+/*
+ * We must use a kthread_worker here, because:
+ *
+ *   clocksource_watchdog_work()
+ * clocksource_select()
+ *   __clocksource_select()
+ * timekeeping_notify()
+ *   stop_machine()
+ *
+ * cannot be called from a reqular workqueue, because of deadlocks between
+ * workqueue and stopmachine.
+ */
+static struct kthread_worker *watchdog_worker;
+static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
+
 static DEFINE_SPINLOCK(watchdog_lock);
 static int watchdog_running;
 static atomic_t watchdog_reset_pending;
@@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
 
/* kick clocksource_watchdog_work() */
if (finished_booting)
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, _work);
 }
 
 /**
@@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list *unused)
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
if (finished_booting)
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, 
_work);
continue;
}
 
@@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 */
if (cs != curr_clocksource) {
cs->flags |= CLOCK_SOURCE_RESELECT;
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, 
_work);
} else {
tick_clock_notify();
}
@@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
return select;
 }
 
-static void clocksource_watchdog_work(struct work_struct *work)
+static void clocksource_watchdog_work(struct kthread_work *work)
 {
mutex_lock(_mutex);
if (__clocksource_watchdog_work())
@@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
 {
mutex_lock(_mutex);
curr_clocksource = clocksource_default_clock();
+   watchdog_worker = kthread_create_worker(0, "cs-watchdog");
finished_booting = 1;
/*
 * Run the watchdog first to eliminate unstable clock sources


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Peter Zijlstra
On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
> > On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > > > Author: Martin Schwidefsky 
> > > > Date:   Tue Aug 18 17:09:42 2009 +0200
> > > > 
> > > > clocksource: Avoid clocksource watchdog circular locking dependency
> > > > 
> > > > stop_machine from a multithreaded workqueue is not allowed because
> > > > of a circular locking dependency between cpu_down and the workqueue
> > > > execution. Use a kernel thread to do the clocksource downgrade.
> > > 
> > > I cannot find stop_machine usage there; either it went away or I need to
> > > like wake up.
> > 
> > timekeeping_notify() which is involved in switching clock source uses stomp
> > machine.
> 
> ARGH... OK, lemme see if I can come up with something other than
> endlessly spawning that kthread.
> 
> A special purpose kthread_worker would make more sense than that.

Can someone test this?

---
 kernel/time/clocksource.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f74fb00d8064..898976d0082a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -112,13 +112,28 @@ static int finished_booting;
 static u64 suspend_start;
 
 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
-static void clocksource_watchdog_work(struct work_struct *work);
+static void clocksource_watchdog_work(struct kthread_work *work);
 static void clocksource_select(void);
 
 static LIST_HEAD(watchdog_list);
 static struct clocksource *watchdog;
 static struct timer_list watchdog_timer;
-static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
+
+/*
+ * We must use a kthread_worker here, because:
+ *
+ *   clocksource_watchdog_work()
+ * clocksource_select()
+ *   __clocksource_select()
+ * timekeeping_notify()
+ *   stop_machine()
+ *
+ * cannot be called from a reqular workqueue, because of deadlocks between
+ * workqueue and stopmachine.
+ */
+static struct kthread_worker *watchdog_worker;
+static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work);
+
 static DEFINE_SPINLOCK(watchdog_lock);
 static int watchdog_running;
 static atomic_t watchdog_reset_pending;
@@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs)
 
/* kick clocksource_watchdog_work() */
if (finished_booting)
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, _work);
 }
 
 /**
@@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list *unused)
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
if (finished_booting)
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, 
_work);
continue;
}
 
@@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 */
if (cs != curr_clocksource) {
cs->flags |= CLOCK_SOURCE_RESELECT;
-   schedule_work(_work);
+   kthread_queue_work(watchdog_worker, 
_work);
} else {
tick_clock_notify();
}
@@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void)
return select;
 }
 
-static void clocksource_watchdog_work(struct work_struct *work)
+static void clocksource_watchdog_work(struct kthread_work *work)
 {
mutex_lock(_mutex);
if (__clocksource_watchdog_work())
@@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void)
 {
mutex_lock(_mutex);
curr_clocksource = clocksource_default_clock();
+   watchdog_worker = kthread_create_worker(0, "cs-watchdog");
finished_booting = 1;
/*
 * Run the watchdog first to eliminate unstable clock sources


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Peter Zijlstra
On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
> On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > > Author: Martin Schwidefsky 
> > > Date:   Tue Aug 18 17:09:42 2009 +0200
> > > 
> > > clocksource: Avoid clocksource watchdog circular locking dependency
> > > 
> > > stop_machine from a multithreaded workqueue is not allowed because
> > > of a circular locking dependency between cpu_down and the workqueue
> > > execution. Use a kernel thread to do the clocksource downgrade.
> > 
> > I cannot find stop_machine usage there; either it went away or I need to
> > like wake up.
> 
> timekeeping_notify() which is involved in switching clock source uses stomp
> machine.

ARGH... OK, lemme see if I can come up with something other than
endlessly spawning that kthread.

A special purpose kthread_worker would make more sense than that.


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Peter Zijlstra
On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote:
> On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > > Author: Martin Schwidefsky 
> > > Date:   Tue Aug 18 17:09:42 2009 +0200
> > > 
> > > clocksource: Avoid clocksource watchdog circular locking dependency
> > > 
> > > stop_machine from a multithreaded workqueue is not allowed because
> > > of a circular locking dependency between cpu_down and the workqueue
> > > execution. Use a kernel thread to do the clocksource downgrade.
> > 
> > I cannot find stop_machine usage there; either it went away or I need to
> > like wake up.
> 
> timekeeping_notify() which is involved in switching clock source uses stomp
> machine.

ARGH... OK, lemme see if I can come up with something other than
endlessly spawning that kthread.

A special purpose kthread_worker would make more sense than that.


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Thomas Gleixner
On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > Author: Martin Schwidefsky 
> > Date:   Tue Aug 18 17:09:42 2009 +0200
> > 
> > clocksource: Avoid clocksource watchdog circular locking dependency
> > 
> > stop_machine from a multithreaded workqueue is not allowed because
> > of a circular locking dependency between cpu_down and the workqueue
> > execution. Use a kernel thread to do the clocksource downgrade.
> 
> I cannot find stop_machine usage there; either it went away or I need to
> like wake up.

timekeeping_notify() which is involved in switching clock source uses stomp
machine.

Thanks,

tglx



Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Thomas Gleixner
On Mon, 3 Sep 2018, Peter Zijlstra wrote:
> On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> > Author: Martin Schwidefsky 
> > Date:   Tue Aug 18 17:09:42 2009 +0200
> > 
> > clocksource: Avoid clocksource watchdog circular locking dependency
> > 
> > stop_machine from a multithreaded workqueue is not allowed because
> > of a circular locking dependency between cpu_down and the workqueue
> > execution. Use a kernel thread to do the clocksource downgrade.
> 
> I cannot find stop_machine usage there; either it went away or I need to
> like wake up.

timekeeping_notify() which is involved in switching clock source uses stomp
machine.

Thanks,

tglx



Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Peter Zijlstra
On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote:
> > > Dear kernel developers,
> > > 
> > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
> > > Intel Core 2 Duo processors are affected by boot stalling early in the
> > > boot process. As it is so early there is no dmesg output (or any log).
> > > 
> > > A few users in the Arch Linux community used git bisect and tracked the
> > > issue down to this the bad commit:
> > > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread
> > 
> > I just dug out my core2duo laptop (Lenovo T500) and build a tip/master
> > kernel for it (x86_64 debian distro .config).
> > 
> > Seems to boot just fine.. 3/3 so far.
> > 
> > Any other clues?
> 
> One additional data point, my affected system is a Dell Latitude E6400
> laptop which has a P8400 CPU:
> 
>   vendor_id : GenuineIntel
>   cpu family: 6
>   model : 23
>   model name: Intel(R) Core(TM)2 Duo CPU P8400  @ 2.26GHz
>   stepping  : 6
>   microcode : 0x610
> 
> Judging from what is being discussed in the Arch forums, it does seem
> to related to the CPU having unstable TSC and transitioning to another
> clock source.

Yes; Core2 doesn't have stable TSC.

> Workarounds that seem to be reliable are either booting
> with clocksource= or with nosmp.

nosmp is weird; because even on UP TSC should stop in C state.

processor_idle (acpi_idle) should mark the TSC as unstable on Core2 when
it loads (does so on my T500).

> One person did point out that the commit that introduced the kthread
> did so to remove a deadlock - is the circular locking dependency
> mentioned in that commit still relevant?
> 
> commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> Author: Martin Schwidefsky 
> Date:   Tue Aug 18 17:09:42 2009 +0200
> 
> clocksource: Avoid clocksource watchdog circular locking dependency
> 
> stop_machine from a multithreaded workqueue is not allowed because
> of a circular locking dependency between cpu_down and the workqueue
> execution. Use a kernel thread to do the clocksource downgrade.

I cannot find stop_machine usage there; either it went away or I need to
like wake up.


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-09-03 Thread Peter Zijlstra
On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote:
> On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote:
> > > Dear kernel developers,
> > > 
> > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
> > > Intel Core 2 Duo processors are affected by boot stalling early in the
> > > boot process. As it is so early there is no dmesg output (or any log).
> > > 
> > > A few users in the Arch Linux community used git bisect and tracked the
> > > issue down to this the bad commit:
> > > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread
> > 
> > I just dug out my core2duo laptop (Lenovo T500) and build a tip/master
> > kernel for it (x86_64 debian distro .config).
> > 
> > Seems to boot just fine.. 3/3 so far.
> > 
> > Any other clues?
> 
> One additional data point, my affected system is a Dell Latitude E6400
> laptop which has a P8400 CPU:
> 
>   vendor_id : GenuineIntel
>   cpu family: 6
>   model : 23
>   model name: Intel(R) Core(TM)2 Duo CPU P8400  @ 2.26GHz
>   stepping  : 6
>   microcode : 0x610
> 
> Judging from what is being discussed in the Arch forums, it does seem
> to related to the CPU having unstable TSC and transitioning to another
> clock source.

Yes; Core2 doesn't have stable TSC.

> Workarounds that seem to be reliable are either booting
> with clocksource= or with nosmp.

nosmp is weird; because even on UP TSC should stop in C state.

processor_idle (acpi_idle) should mark the TSC as unstable on Core2 when
it loads (does so on my T500).

> One person did point out that the commit that introduced the kthread
> did so to remove a deadlock - is the circular locking dependency
> mentioned in that commit still relevant?
> 
> commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
> Author: Martin Schwidefsky 
> Date:   Tue Aug 18 17:09:42 2009 +0200
> 
> clocksource: Avoid clocksource watchdog circular locking dependency
> 
> stop_machine from a multithreaded workqueue is not allowed because
> of a circular locking dependency between cpu_down and the workqueue
> execution. Use a kernel thread to do the clocksource downgrade.

I cannot find stop_machine usage there; either it went away or I need to
like wake up.


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-08-31 Thread Kevin Shanahan
On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote:
> > Dear kernel developers,
> > 
> > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
> > Intel Core 2 Duo processors are affected by boot stalling early in the
> > boot process. As it is so early there is no dmesg output (or any log).
> > 
> > A few users in the Arch Linux community used git bisect and tracked the
> > issue down to this the bad commit:
> > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread
> 
> I just dug out my core2duo laptop (Lenovo T500) and build a tip/master
> kernel for it (x86_64 debian distro .config).
> 
> Seems to boot just fine.. 3/3 so far.
> 
> Any other clues?

One additional data point, my affected system is a Dell Latitude E6400
laptop which has a P8400 CPU:

  vendor_id : GenuineIntel
  cpu family: 6
  model : 23
  model name: Intel(R) Core(TM)2 Duo CPU P8400  @ 2.26GHz
  stepping  : 6
  microcode : 0x610

Judging from what is being discussed in the Arch forums, it does seem
to related to the CPU having unstable TSC and transitioning to another
clock source.  Workarounds that seem to be reliable are either booting
with clocksource= or with nosmp.

One person did point out that the commit that introduced the kthread
did so to remove a deadlock - is the circular locking dependency
mentioned in that commit still relevant?

commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
Author: Martin Schwidefsky 
Date:   Tue Aug 18 17:09:42 2009 +0200

clocksource: Avoid clocksource watchdog circular locking dependency

stop_machine from a multithreaded workqueue is not allowed because
of a circular locking dependency between cpu_down and the workqueue
execution. Use a kernel thread to do the clocksource downgrade.

Signed-off-by: Martin Schwidefsky 
Cc: Peter Zijlstra 
Cc: john stultz 
LKML-Reference: <20090818170942.3ab80c91@skybase>
Signed-off-by: Thomas Gleixner 

Thanks,
Kevin.


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-08-31 Thread Kevin Shanahan
On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote:
> > Dear kernel developers,
> > 
> > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
> > Intel Core 2 Duo processors are affected by boot stalling early in the
> > boot process. As it is so early there is no dmesg output (or any log).
> > 
> > A few users in the Arch Linux community used git bisect and tracked the
> > issue down to this the bad commit:
> > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread
> 
> I just dug out my core2duo laptop (Lenovo T500) and build a tip/master
> kernel for it (x86_64 debian distro .config).
> 
> Seems to boot just fine.. 3/3 so far.
> 
> Any other clues?

One additional data point, my affected system is a Dell Latitude E6400
laptop which has a P8400 CPU:

  vendor_id : GenuineIntel
  cpu family: 6
  model : 23
  model name: Intel(R) Core(TM)2 Duo CPU P8400  @ 2.26GHz
  stepping  : 6
  microcode : 0x610

Judging from what is being discussed in the Arch forums, it does seem
to related to the CPU having unstable TSC and transitioning to another
clock source.  Workarounds that seem to be reliable are either booting
with clocksource= or with nosmp.

One person did point out that the commit that introduced the kthread
did so to remove a deadlock - is the circular locking dependency
mentioned in that commit still relevant?

commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
Author: Martin Schwidefsky 
Date:   Tue Aug 18 17:09:42 2009 +0200

clocksource: Avoid clocksource watchdog circular locking dependency

stop_machine from a multithreaded workqueue is not allowed because
of a circular locking dependency between cpu_down and the workqueue
execution. Use a kernel thread to do the clocksource downgrade.

Signed-off-by: Martin Schwidefsky 
Cc: Peter Zijlstra 
Cc: john stultz 
LKML-Reference: <20090818170942.3ab80c91@skybase>
Signed-off-by: Thomas Gleixner 

Thanks,
Kevin.


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-08-30 Thread Peter Zijlstra
On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote:
> > Dear kernel developers,
> > 
> > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
> > Intel Core 2 Duo processors are affected by boot stalling early in the
> > boot process. As it is so early there is no dmesg output (or any log).
> > 
> > A few users in the Arch Linux community used git bisect and tracked the
> > issue down to this the bad commit:
> > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread
> 
> I just dug out my core2duo laptop (Lenovo T500) and build a tip/master
> kernel for it (x86_64 debian distro .config).
> 
> Seems to boot just fine.. 3/3 so far.
> 
> Any other clues?

The .config from https://bugzilla.kernel.org/attachment.cgi?id=278183
seems to boot far enough to drop me into a busybox -- it doesn't seem to
contain enough bits to actually boot my system and I can't be arsed to
figure out what's missing.




Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-08-30 Thread Peter Zijlstra
On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote:
> > Dear kernel developers,
> > 
> > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
> > Intel Core 2 Duo processors are affected by boot stalling early in the
> > boot process. As it is so early there is no dmesg output (or any log).
> > 
> > A few users in the Arch Linux community used git bisect and tracked the
> > issue down to this the bad commit:
> > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread
> 
> I just dug out my core2duo laptop (Lenovo T500) and build a tip/master
> kernel for it (x86_64 debian distro .config).
> 
> Seems to boot just fine.. 3/3 so far.
> 
> Any other clues?

The .config from https://bugzilla.kernel.org/attachment.cgi?id=278183
seems to boot far enough to drop me into a busybox -- it doesn't seem to
contain enough bits to actually boot my system and I can't be arsed to
figure out what's missing.




Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-08-30 Thread Peter Zijlstra
On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote:
> Dear kernel developers,
> 
> since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
> Intel Core 2 Duo processors are affected by boot stalling early in the
> boot process. As it is so early there is no dmesg output (or any log).
> 
> A few users in the Arch Linux community used git bisect and tracked the
> issue down to this the bad commit:
> 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread

I just dug out my core2duo laptop (Lenovo T500) and build a tip/master
kernel for it (x86_64 debian distro .config).

Seems to boot just fine.. 3/3 so far.

Any other clues?


Re: REGRESSION: boot stalls on several old dual core Intel CPUs

2018-08-30 Thread Peter Zijlstra
On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote:
> Dear kernel developers,
> 
> since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
> Intel Core 2 Duo processors are affected by boot stalling early in the
> boot process. As it is so early there is no dmesg output (or any log).
> 
> A few users in the Arch Linux community used git bisect and tracked the
> issue down to this the bad commit:
> 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread

I just dug out my core2duo laptop (Lenovo T500) and build a tip/master
kernel for it (x86_64 debian distro .config).

Seems to boot just fine.. 3/3 so far.

Any other clues?


REGRESSION: boot stalls on several old dual core Intel CPUs

2018-08-30 Thread Siegfried Metz

Dear kernel developers,

since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
Intel Core 2 Duo processors are affected by boot stalling early in the
boot process. As it is so early there is no dmesg output (or any log).

A few users in the Arch Linux community used git bisect and tracked the
issue down to this the bad commit:
7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread

Either reverting the bad commit or as a workaround using an additional
kernel boot parameter "clocksource=hpet" has been successfully used to
boot with Intel Core 2 processors and avoid the stalling.

clocksource= options parameters are one of "tsc", "hpet", "acpi_pm",
most of the time hpet did the trick.


Additional information:
Kernel 4.17.19 is unaffected by this issues, kernel 4.18 series and
4.19-rc1 are still affected.

Also there is the bug-report:
https://bugzilla.kernel.org/show_bug.cgi?id=200957
and a duplicate one:
https://bugzilla.kernel.org/show_bug.cgi?id=200959


Thanks for fixing this issue.

Siegfried


REGRESSION: boot stalls on several old dual core Intel CPUs

2018-08-30 Thread Siegfried Metz

Dear kernel developers,

since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5)
Intel Core 2 Duo processors are affected by boot stalling early in the
boot process. As it is so early there is no dmesg output (or any log).

A few users in the Arch Linux community used git bisect and tracked the
issue down to this the bad commit:
7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread

Either reverting the bad commit or as a workaround using an additional
kernel boot parameter "clocksource=hpet" has been successfully used to
boot with Intel Core 2 processors and avoid the stalling.

clocksource= options parameters are one of "tsc", "hpet", "acpi_pm",
most of the time hpet did the trick.


Additional information:
Kernel 4.17.19 is unaffected by this issues, kernel 4.18 series and
4.19-rc1 are still affected.

Also there is the bug-report:
https://bugzilla.kernel.org/show_bug.cgi?id=200957
and a duplicate one:
https://bugzilla.kernel.org/show_bug.cgi?id=200959


Thanks for fixing this issue.

Siegfried