Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, Sep 03, 2018 at 11:33:05AM +0200, Peter Zijlstra wrote: > On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: > > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: > > > On Mon, 3 Sep 2018, Peter Zijlstra wrote: > > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > > > > Author: Martin Schwidefsky > > > > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > > > > > > > clocksource: Avoid clocksource watchdog circular locking > > > > > dependency > > > > > > > > > > stop_machine from a multithreaded workqueue is not allowed because > > > > > of a circular locking dependency between cpu_down and the > > > > > workqueue > > > > > execution. Use a kernel thread to do the clocksource downgrade. > > > > > > > > I cannot find stop_machine usage there; either it went away or I need to > > > > like wake up. > > > > > > timekeeping_notify() which is involved in switching clock source uses > > > stomp > > > machine. > > > > ARGH... OK, lemme see if I can come up with something other than > > endlessly spawning that kthread. > > > > A special purpose kthread_worker would make more sense than that. > > Can someone test this? > > --- > kernel/time/clocksource.c | 28 ++-- > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index f74fb00d8064..898976d0082a 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -112,13 +112,28 @@ static int finished_booting; > static u64 suspend_start; > > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG > -static void clocksource_watchdog_work(struct work_struct *work); > +static void clocksource_watchdog_work(struct kthread_work *work); > static void clocksource_select(void); > > static LIST_HEAD(watchdog_list); > static struct clocksource *watchdog; > static struct timer_list watchdog_timer; > -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > + > +/* > + * We must use a kthread_worker here, because: > + * > + * clocksource_watchdog_work() > + * clocksource_select() > + * __clocksource_select() > + * timekeeping_notify() > + * stop_machine() > + * > + * cannot be called from a reqular workqueue, because of deadlocks between > + * workqueue and stopmachine. > + */ > +static struct kthread_worker *watchdog_worker; > +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); > + > static DEFINE_SPINLOCK(watchdog_lock); > static int watchdog_running; > static atomic_t watchdog_reset_pending; > @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) > > /* kick clocksource_watchdog_work() */ > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, _work); > } > > /** > @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list > *unused) > /* Clocksource already marked unstable? */ > if (cs->flags & CLOCK_SOURCE_UNSTABLE) { > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > continue; > } > > @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list > *unused) >*/ > if (cs != curr_clocksource) { > cs->flags |= CLOCK_SOURCE_RESELECT; > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > } else { > tick_clock_notify(); > } > @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) > return select; > } > > -static void clocksource_watchdog_work(struct work_struct *work) > +static void clocksource_watchdog_work(struct kthread_work *work) > { > mutex_lock(_mutex); > if (__clocksource_watchdog_work()) > @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) > { > mutex_lock(_mutex); > curr_clocksource = clocksource_default_clock(); > + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); Hello Peter, watchdog_worker is only defined if CONFIG_CLOCKSOURCE_WATCHDOG is set, so you might want to wrap it with an ifdef to avoid build errors. Kind regards, Niklas > finished_booting = 1; > /* >* Run the watchdog first to eliminate unstable clock sources
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, Sep 03, 2018 at 11:33:05AM +0200, Peter Zijlstra wrote: > On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: > > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: > > > On Mon, 3 Sep 2018, Peter Zijlstra wrote: > > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > > > > Author: Martin Schwidefsky > > > > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > > > > > > > clocksource: Avoid clocksource watchdog circular locking > > > > > dependency > > > > > > > > > > stop_machine from a multithreaded workqueue is not allowed because > > > > > of a circular locking dependency between cpu_down and the > > > > > workqueue > > > > > execution. Use a kernel thread to do the clocksource downgrade. > > > > > > > > I cannot find stop_machine usage there; either it went away or I need to > > > > like wake up. > > > > > > timekeeping_notify() which is involved in switching clock source uses > > > stomp > > > machine. > > > > ARGH... OK, lemme see if I can come up with something other than > > endlessly spawning that kthread. > > > > A special purpose kthread_worker would make more sense than that. > > Can someone test this? > > --- > kernel/time/clocksource.c | 28 ++-- > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index f74fb00d8064..898976d0082a 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -112,13 +112,28 @@ static int finished_booting; > static u64 suspend_start; > > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG > -static void clocksource_watchdog_work(struct work_struct *work); > +static void clocksource_watchdog_work(struct kthread_work *work); > static void clocksource_select(void); > > static LIST_HEAD(watchdog_list); > static struct clocksource *watchdog; > static struct timer_list watchdog_timer; > -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > + > +/* > + * We must use a kthread_worker here, because: > + * > + * clocksource_watchdog_work() > + * clocksource_select() > + * __clocksource_select() > + * timekeeping_notify() > + * stop_machine() > + * > + * cannot be called from a reqular workqueue, because of deadlocks between > + * workqueue and stopmachine. > + */ > +static struct kthread_worker *watchdog_worker; > +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); > + > static DEFINE_SPINLOCK(watchdog_lock); > static int watchdog_running; > static atomic_t watchdog_reset_pending; > @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) > > /* kick clocksource_watchdog_work() */ > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, _work); > } > > /** > @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list > *unused) > /* Clocksource already marked unstable? */ > if (cs->flags & CLOCK_SOURCE_UNSTABLE) { > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > continue; > } > > @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list > *unused) >*/ > if (cs != curr_clocksource) { > cs->flags |= CLOCK_SOURCE_RESELECT; > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > } else { > tick_clock_notify(); > } > @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) > return select; > } > > -static void clocksource_watchdog_work(struct work_struct *work) > +static void clocksource_watchdog_work(struct kthread_work *work) > { > mutex_lock(_mutex); > if (__clocksource_watchdog_work()) > @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) > { > mutex_lock(_mutex); > curr_clocksource = clocksource_default_clock(); > + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); Hello Peter, watchdog_worker is only defined if CONFIG_CLOCKSOURCE_WATCHDOG is set, so you might want to wrap it with an ifdef to avoid build errors. Kind regards, Niklas > finished_booting = 1; > /* >* Run the watchdog first to eliminate unstable clock sources
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On 9/3/18 11:33 AM, Peter Zijlstra wrote: On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: On Mon, 3 Sep 2018, Peter Zijlstra wrote: On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab Author: Martin Schwidefsky Date: Tue Aug 18 17:09:42 2009 +0200 clocksource: Avoid clocksource watchdog circular locking dependency stop_machine from a multithreaded workqueue is not allowed because of a circular locking dependency between cpu_down and the workqueue execution. Use a kernel thread to do the clocksource downgrade. I cannot find stop_machine usage there; either it went away or I need to like wake up. timekeeping_notify() which is involved in switching clock source uses stomp machine. ARGH... OK, lemme see if I can come up with something other than endlessly spawning that kthread. A special purpose kthread_worker would make more sense than that. Can someone test this? --- kernel/time/clocksource.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index f74fb00d8064..898976d0082a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -112,13 +112,28 @@ static int finished_booting; static u64 suspend_start; #ifdef CONFIG_CLOCKSOURCE_WATCHDOG -static void clocksource_watchdog_work(struct work_struct *work); +static void clocksource_watchdog_work(struct kthread_work *work); static void clocksource_select(void); static LIST_HEAD(watchdog_list); static struct clocksource *watchdog; static struct timer_list watchdog_timer; -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); + +/* + * We must use a kthread_worker here, because: + * + * clocksource_watchdog_work() + * clocksource_select() + * __clocksource_select() + * timekeeping_notify() + * stop_machine() + * + * cannot be called from a reqular workqueue, because of deadlocks between + * workqueue and stopmachine. + */ +static struct kthread_worker *watchdog_worker; +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); + static DEFINE_SPINLOCK(watchdog_lock); static int watchdog_running; static atomic_t watchdog_reset_pending; @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) /* kick clocksource_watchdog_work() */ if (finished_booting) - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); } /** @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list *unused) /* Clocksource already marked unstable? */ if (cs->flags & CLOCK_SOURCE_UNSTABLE) { if (finished_booting) - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); continue; } @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list *unused) */ if (cs != curr_clocksource) { cs->flags |= CLOCK_SOURCE_RESELECT; - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); } else { tick_clock_notify(); } @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) return select; } -static void clocksource_watchdog_work(struct work_struct *work) +static void clocksource_watchdog_work(struct kthread_work *work) { mutex_lock(_mutex); if (__clocksource_watchdog_work()) @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) { mutex_lock(_mutex); curr_clocksource = clocksource_default_clock(); + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); finished_booting = 1; /* * Run the watchdog first to eliminate unstable clock sources Successfully booted my Intel Core 2 Duo with the patch applied on top of 4.18.5 (based on default Arch Linux config). I tested with at least 8/8 successful boots in total - with no additional kernel boot parameters and also with "quiet", and "debug". No problems seen so far. Thank you for your effort and developing this patch. Siegfried
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On 9/3/18 11:33 AM, Peter Zijlstra wrote: On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: On Mon, 3 Sep 2018, Peter Zijlstra wrote: On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab Author: Martin Schwidefsky Date: Tue Aug 18 17:09:42 2009 +0200 clocksource: Avoid clocksource watchdog circular locking dependency stop_machine from a multithreaded workqueue is not allowed because of a circular locking dependency between cpu_down and the workqueue execution. Use a kernel thread to do the clocksource downgrade. I cannot find stop_machine usage there; either it went away or I need to like wake up. timekeeping_notify() which is involved in switching clock source uses stomp machine. ARGH... OK, lemme see if I can come up with something other than endlessly spawning that kthread. A special purpose kthread_worker would make more sense than that. Can someone test this? --- kernel/time/clocksource.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index f74fb00d8064..898976d0082a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -112,13 +112,28 @@ static int finished_booting; static u64 suspend_start; #ifdef CONFIG_CLOCKSOURCE_WATCHDOG -static void clocksource_watchdog_work(struct work_struct *work); +static void clocksource_watchdog_work(struct kthread_work *work); static void clocksource_select(void); static LIST_HEAD(watchdog_list); static struct clocksource *watchdog; static struct timer_list watchdog_timer; -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); + +/* + * We must use a kthread_worker here, because: + * + * clocksource_watchdog_work() + * clocksource_select() + * __clocksource_select() + * timekeeping_notify() + * stop_machine() + * + * cannot be called from a reqular workqueue, because of deadlocks between + * workqueue and stopmachine. + */ +static struct kthread_worker *watchdog_worker; +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); + static DEFINE_SPINLOCK(watchdog_lock); static int watchdog_running; static atomic_t watchdog_reset_pending; @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) /* kick clocksource_watchdog_work() */ if (finished_booting) - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); } /** @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list *unused) /* Clocksource already marked unstable? */ if (cs->flags & CLOCK_SOURCE_UNSTABLE) { if (finished_booting) - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); continue; } @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list *unused) */ if (cs != curr_clocksource) { cs->flags |= CLOCK_SOURCE_RESELECT; - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); } else { tick_clock_notify(); } @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) return select; } -static void clocksource_watchdog_work(struct work_struct *work) +static void clocksource_watchdog_work(struct kthread_work *work) { mutex_lock(_mutex); if (__clocksource_watchdog_work()) @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) { mutex_lock(_mutex); curr_clocksource = clocksource_default_clock(); + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); finished_booting = 1; /* * Run the watchdog first to eliminate unstable clock sources Successfully booted my Intel Core 2 Duo with the patch applied on top of 4.18.5 (based on default Arch Linux config). I tested with at least 8/8 successful boots in total - with no additional kernel boot parameters and also with "quiet", and "debug". No problems seen so far. Thank you for your effort and developing this patch. Siegfried
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, Sep 03, 2018 at 11:33:05AM +0200, Peter Zijlstra wrote: > On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: > > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: > > > On Mon, 3 Sep 2018, Peter Zijlstra wrote: > > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > > > > Author: Martin Schwidefsky > > > > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > > > > > > > clocksource: Avoid clocksource watchdog circular locking > > > > > dependency > > > > > > > > > > stop_machine from a multithreaded workqueue is not allowed because > > > > > of a circular locking dependency between cpu_down and the > > > > > workqueue > > > > > execution. Use a kernel thread to do the clocksource downgrade. > > > > > > > > I cannot find stop_machine usage there; either it went away or I need to > > > > like wake up. > > > > > > timekeeping_notify() which is involved in switching clock source uses > > > stomp > > > machine. > > > > ARGH... OK, lemme see if I can come up with something other than > > endlessly spawning that kthread. > > > > A special purpose kthread_worker would make more sense than that. > > Can someone test this? Boots for me (applied on top of 4.18.5). Tested-by: Kevin Shanahan > --- > kernel/time/clocksource.c | 28 ++-- > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index f74fb00d8064..898976d0082a 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -112,13 +112,28 @@ static int finished_booting; > static u64 suspend_start; > > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG > -static void clocksource_watchdog_work(struct work_struct *work); > +static void clocksource_watchdog_work(struct kthread_work *work); > static void clocksource_select(void); > > static LIST_HEAD(watchdog_list); > static struct clocksource *watchdog; > static struct timer_list watchdog_timer; > -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > + > +/* > + * We must use a kthread_worker here, because: > + * > + * clocksource_watchdog_work() > + * clocksource_select() > + * __clocksource_select() > + * timekeeping_notify() > + * stop_machine() > + * > + * cannot be called from a reqular workqueue, because of deadlocks between > + * workqueue and stopmachine. > + */ > +static struct kthread_worker *watchdog_worker; > +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); > + > static DEFINE_SPINLOCK(watchdog_lock); > static int watchdog_running; > static atomic_t watchdog_reset_pending; > @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) > > /* kick clocksource_watchdog_work() */ > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, _work); > } > > /** > @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list > *unused) > /* Clocksource already marked unstable? */ > if (cs->flags & CLOCK_SOURCE_UNSTABLE) { > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > continue; > } > > @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list > *unused) >*/ > if (cs != curr_clocksource) { > cs->flags |= CLOCK_SOURCE_RESELECT; > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > } else { > tick_clock_notify(); > } > @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) > return select; > } > > -static void clocksource_watchdog_work(struct work_struct *work) > +static void clocksource_watchdog_work(struct kthread_work *work) > { > mutex_lock(_mutex); > if (__clocksource_watchdog_work()) > @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) > { > mutex_lock(_mutex); > curr_clocksource = clocksource_default_clock(); > + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); > finished_booting = 1; > /* >* Run the watchdog first to eliminate unstable clock sources
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, Sep 03, 2018 at 11:33:05AM +0200, Peter Zijlstra wrote: > On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: > > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: > > > On Mon, 3 Sep 2018, Peter Zijlstra wrote: > > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > > > > Author: Martin Schwidefsky > > > > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > > > > > > > clocksource: Avoid clocksource watchdog circular locking > > > > > dependency > > > > > > > > > > stop_machine from a multithreaded workqueue is not allowed because > > > > > of a circular locking dependency between cpu_down and the > > > > > workqueue > > > > > execution. Use a kernel thread to do the clocksource downgrade. > > > > > > > > I cannot find stop_machine usage there; either it went away or I need to > > > > like wake up. > > > > > > timekeeping_notify() which is involved in switching clock source uses > > > stomp > > > machine. > > > > ARGH... OK, lemme see if I can come up with something other than > > endlessly spawning that kthread. > > > > A special purpose kthread_worker would make more sense than that. > > Can someone test this? Boots for me (applied on top of 4.18.5). Tested-by: Kevin Shanahan > --- > kernel/time/clocksource.c | 28 ++-- > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index f74fb00d8064..898976d0082a 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -112,13 +112,28 @@ static int finished_booting; > static u64 suspend_start; > > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG > -static void clocksource_watchdog_work(struct work_struct *work); > +static void clocksource_watchdog_work(struct kthread_work *work); > static void clocksource_select(void); > > static LIST_HEAD(watchdog_list); > static struct clocksource *watchdog; > static struct timer_list watchdog_timer; > -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > + > +/* > + * We must use a kthread_worker here, because: > + * > + * clocksource_watchdog_work() > + * clocksource_select() > + * __clocksource_select() > + * timekeeping_notify() > + * stop_machine() > + * > + * cannot be called from a reqular workqueue, because of deadlocks between > + * workqueue and stopmachine. > + */ > +static struct kthread_worker *watchdog_worker; > +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); > + > static DEFINE_SPINLOCK(watchdog_lock); > static int watchdog_running; > static atomic_t watchdog_reset_pending; > @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) > > /* kick clocksource_watchdog_work() */ > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, _work); > } > > /** > @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list > *unused) > /* Clocksource already marked unstable? */ > if (cs->flags & CLOCK_SOURCE_UNSTABLE) { > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > continue; > } > > @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list > *unused) >*/ > if (cs != curr_clocksource) { > cs->flags |= CLOCK_SOURCE_RESELECT; > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > } else { > tick_clock_notify(); > } > @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) > return select; > } > > -static void clocksource_watchdog_work(struct work_struct *work) > +static void clocksource_watchdog_work(struct kthread_work *work) > { > mutex_lock(_mutex); > if (__clocksource_watchdog_work()) > @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) > { > mutex_lock(_mutex); > curr_clocksource = clocksource_default_clock(); > + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); > finished_booting = 1; > /* >* Run the watchdog first to eliminate unstable clock sources
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
Peter Zijlstra: > On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: >> On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: >>> On Mon, 3 Sep 2018, Peter Zijlstra wrote: On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > Author: Martin Schwidefsky > Date: Tue Aug 18 17:09:42 2009 +0200 > > clocksource: Avoid clocksource watchdog circular locking dependency > > stop_machine from a multithreaded workqueue is not allowed because > of a circular locking dependency between cpu_down and the workqueue > execution. Use a kernel thread to do the clocksource downgrade. I cannot find stop_machine usage there; either it went away or I need to like wake up. >>> >>> timekeeping_notify() which is involved in switching clock source uses stomp >>> machine. >> >> ARGH... OK, lemme see if I can come up with something other than >> endlessly spawning that kthread. >> >> A special purpose kthread_worker would make more sense than that. > > Can someone test this? > > --- > kernel/time/clocksource.c | 28 ++-- > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index f74fb00d8064..898976d0082a 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -112,13 +112,28 @@ static int finished_booting; > static u64 suspend_start; > > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG > -static void clocksource_watchdog_work(struct work_struct *work); > +static void clocksource_watchdog_work(struct kthread_work *work); > static void clocksource_select(void); > > static LIST_HEAD(watchdog_list); > static struct clocksource *watchdog; > static struct timer_list watchdog_timer; > -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > + > +/* > + * We must use a kthread_worker here, because: > + * > + * clocksource_watchdog_work() > + * clocksource_select() > + * __clocksource_select() > + * timekeeping_notify() > + * stop_machine() > + * > + * cannot be called from a reqular workqueue, because of deadlocks between > + * workqueue and stopmachine. > + */ > +static struct kthread_worker *watchdog_worker; > +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); > + > static DEFINE_SPINLOCK(watchdog_lock); > static int watchdog_running; > static atomic_t watchdog_reset_pending; > @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) > > /* kick clocksource_watchdog_work() */ > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, _work); > } > > /** > @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list > *unused) > /* Clocksource already marked unstable? */ > if (cs->flags & CLOCK_SOURCE_UNSTABLE) { > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > continue; > } > > @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list > *unused) >*/ > if (cs != curr_clocksource) { > cs->flags |= CLOCK_SOURCE_RESELECT; > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > } else { > tick_clock_notify(); > } > @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) > return select; > } > > -static void clocksource_watchdog_work(struct work_struct *work) > +static void clocksource_watchdog_work(struct kthread_work *work) > { > mutex_lock(_mutex); > if (__clocksource_watchdog_work()) > @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) > { > mutex_lock(_mutex); > curr_clocksource = clocksource_default_clock(); > + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); > finished_booting = 1; > /* >* Run the watchdog first to eliminate unstable clock sources > Applied on mainline tag v4.19-rc2. Tested without additional parameters, with "quiet" and with "debug", my PC booted successfully in all three cases, whereas it stalled almost always in these three cases before. Thanks!
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
Peter Zijlstra: > On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: >> On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: >>> On Mon, 3 Sep 2018, Peter Zijlstra wrote: On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > Author: Martin Schwidefsky > Date: Tue Aug 18 17:09:42 2009 +0200 > > clocksource: Avoid clocksource watchdog circular locking dependency > > stop_machine from a multithreaded workqueue is not allowed because > of a circular locking dependency between cpu_down and the workqueue > execution. Use a kernel thread to do the clocksource downgrade. I cannot find stop_machine usage there; either it went away or I need to like wake up. >>> >>> timekeeping_notify() which is involved in switching clock source uses stomp >>> machine. >> >> ARGH... OK, lemme see if I can come up with something other than >> endlessly spawning that kthread. >> >> A special purpose kthread_worker would make more sense than that. > > Can someone test this? > > --- > kernel/time/clocksource.c | 28 ++-- > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index f74fb00d8064..898976d0082a 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -112,13 +112,28 @@ static int finished_booting; > static u64 suspend_start; > > #ifdef CONFIG_CLOCKSOURCE_WATCHDOG > -static void clocksource_watchdog_work(struct work_struct *work); > +static void clocksource_watchdog_work(struct kthread_work *work); > static void clocksource_select(void); > > static LIST_HEAD(watchdog_list); > static struct clocksource *watchdog; > static struct timer_list watchdog_timer; > -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > + > +/* > + * We must use a kthread_worker here, because: > + * > + * clocksource_watchdog_work() > + * clocksource_select() > + * __clocksource_select() > + * timekeeping_notify() > + * stop_machine() > + * > + * cannot be called from a reqular workqueue, because of deadlocks between > + * workqueue and stopmachine. > + */ > +static struct kthread_worker *watchdog_worker; > +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); > + > static DEFINE_SPINLOCK(watchdog_lock); > static int watchdog_running; > static atomic_t watchdog_reset_pending; > @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) > > /* kick clocksource_watchdog_work() */ > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, _work); > } > > /** > @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list > *unused) > /* Clocksource already marked unstable? */ > if (cs->flags & CLOCK_SOURCE_UNSTABLE) { > if (finished_booting) > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > continue; > } > > @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list > *unused) >*/ > if (cs != curr_clocksource) { > cs->flags |= CLOCK_SOURCE_RESELECT; > - schedule_work(_work); > + kthread_queue_work(watchdog_worker, > _work); > } else { > tick_clock_notify(); > } > @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) > return select; > } > > -static void clocksource_watchdog_work(struct work_struct *work) > +static void clocksource_watchdog_work(struct kthread_work *work) > { > mutex_lock(_mutex); > if (__clocksource_watchdog_work()) > @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) > { > mutex_lock(_mutex); > curr_clocksource = clocksource_default_clock(); > + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); > finished_booting = 1; > /* >* Run the watchdog first to eliminate unstable clock sources > Applied on mainline tag v4.19-rc2. Tested without additional parameters, with "quiet" and with "debug", my PC booted successfully in all three cases, whereas it stalled almost always in these three cases before. Thanks!
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: > > On Mon, 3 Sep 2018, Peter Zijlstra wrote: > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > > > Author: Martin Schwidefsky > > > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > > > > > clocksource: Avoid clocksource watchdog circular locking dependency > > > > > > > > stop_machine from a multithreaded workqueue is not allowed because > > > > of a circular locking dependency between cpu_down and the workqueue > > > > execution. Use a kernel thread to do the clocksource downgrade. > > > > > > I cannot find stop_machine usage there; either it went away or I need to > > > like wake up. > > > > timekeeping_notify() which is involved in switching clock source uses stomp > > machine. > > ARGH... OK, lemme see if I can come up with something other than > endlessly spawning that kthread. > > A special purpose kthread_worker would make more sense than that. Can someone test this? --- kernel/time/clocksource.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index f74fb00d8064..898976d0082a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -112,13 +112,28 @@ static int finished_booting; static u64 suspend_start; #ifdef CONFIG_CLOCKSOURCE_WATCHDOG -static void clocksource_watchdog_work(struct work_struct *work); +static void clocksource_watchdog_work(struct kthread_work *work); static void clocksource_select(void); static LIST_HEAD(watchdog_list); static struct clocksource *watchdog; static struct timer_list watchdog_timer; -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); + +/* + * We must use a kthread_worker here, because: + * + * clocksource_watchdog_work() + * clocksource_select() + * __clocksource_select() + * timekeeping_notify() + * stop_machine() + * + * cannot be called from a reqular workqueue, because of deadlocks between + * workqueue and stopmachine. + */ +static struct kthread_worker *watchdog_worker; +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); + static DEFINE_SPINLOCK(watchdog_lock); static int watchdog_running; static atomic_t watchdog_reset_pending; @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) /* kick clocksource_watchdog_work() */ if (finished_booting) - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); } /** @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list *unused) /* Clocksource already marked unstable? */ if (cs->flags & CLOCK_SOURCE_UNSTABLE) { if (finished_booting) - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); continue; } @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list *unused) */ if (cs != curr_clocksource) { cs->flags |= CLOCK_SOURCE_RESELECT; - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); } else { tick_clock_notify(); } @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) return select; } -static void clocksource_watchdog_work(struct work_struct *work) +static void clocksource_watchdog_work(struct kthread_work *work) { mutex_lock(_mutex); if (__clocksource_watchdog_work()) @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) { mutex_lock(_mutex); curr_clocksource = clocksource_default_clock(); + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); finished_booting = 1; /* * Run the watchdog first to eliminate unstable clock sources
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, Sep 03, 2018 at 10:54:23AM +0200, Peter Zijlstra wrote: > On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: > > On Mon, 3 Sep 2018, Peter Zijlstra wrote: > > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > > > Author: Martin Schwidefsky > > > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > > > > > clocksource: Avoid clocksource watchdog circular locking dependency > > > > > > > > stop_machine from a multithreaded workqueue is not allowed because > > > > of a circular locking dependency between cpu_down and the workqueue > > > > execution. Use a kernel thread to do the clocksource downgrade. > > > > > > I cannot find stop_machine usage there; either it went away or I need to > > > like wake up. > > > > timekeeping_notify() which is involved in switching clock source uses stomp > > machine. > > ARGH... OK, lemme see if I can come up with something other than > endlessly spawning that kthread. > > A special purpose kthread_worker would make more sense than that. Can someone test this? --- kernel/time/clocksource.c | 28 ++-- 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index f74fb00d8064..898976d0082a 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -112,13 +112,28 @@ static int finished_booting; static u64 suspend_start; #ifdef CONFIG_CLOCKSOURCE_WATCHDOG -static void clocksource_watchdog_work(struct work_struct *work); +static void clocksource_watchdog_work(struct kthread_work *work); static void clocksource_select(void); static LIST_HEAD(watchdog_list); static struct clocksource *watchdog; static struct timer_list watchdog_timer; -static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); + +/* + * We must use a kthread_worker here, because: + * + * clocksource_watchdog_work() + * clocksource_select() + * __clocksource_select() + * timekeeping_notify() + * stop_machine() + * + * cannot be called from a reqular workqueue, because of deadlocks between + * workqueue and stopmachine. + */ +static struct kthread_worker *watchdog_worker; +static DEFINE_KTHREAD_WORK(watchdog_work, clocksource_watchdog_work); + static DEFINE_SPINLOCK(watchdog_lock); static int watchdog_running; static atomic_t watchdog_reset_pending; @@ -158,7 +173,7 @@ static void __clocksource_unstable(struct clocksource *cs) /* kick clocksource_watchdog_work() */ if (finished_booting) - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); } /** @@ -199,7 +214,7 @@ static void clocksource_watchdog(struct timer_list *unused) /* Clocksource already marked unstable? */ if (cs->flags & CLOCK_SOURCE_UNSTABLE) { if (finished_booting) - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); continue; } @@ -269,7 +284,7 @@ static void clocksource_watchdog(struct timer_list *unused) */ if (cs != curr_clocksource) { cs->flags |= CLOCK_SOURCE_RESELECT; - schedule_work(_work); + kthread_queue_work(watchdog_worker, _work); } else { tick_clock_notify(); } @@ -418,7 +433,7 @@ static int __clocksource_watchdog_work(void) return select; } -static void clocksource_watchdog_work(struct work_struct *work) +static void clocksource_watchdog_work(struct kthread_work *work) { mutex_lock(_mutex); if (__clocksource_watchdog_work()) @@ -806,6 +821,7 @@ static int __init clocksource_done_booting(void) { mutex_lock(_mutex); curr_clocksource = clocksource_default_clock(); + watchdog_worker = kthread_create_worker(0, "cs-watchdog"); finished_booting = 1; /* * Run the watchdog first to eliminate unstable clock sources
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: > On Mon, 3 Sep 2018, Peter Zijlstra wrote: > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > > Author: Martin Schwidefsky > > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > > > clocksource: Avoid clocksource watchdog circular locking dependency > > > > > > stop_machine from a multithreaded workqueue is not allowed because > > > of a circular locking dependency between cpu_down and the workqueue > > > execution. Use a kernel thread to do the clocksource downgrade. > > > > I cannot find stop_machine usage there; either it went away or I need to > > like wake up. > > timekeeping_notify() which is involved in switching clock source uses stomp > machine. ARGH... OK, lemme see if I can come up with something other than endlessly spawning that kthread. A special purpose kthread_worker would make more sense than that.
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, Sep 03, 2018 at 09:38:15AM +0200, Thomas Gleixner wrote: > On Mon, 3 Sep 2018, Peter Zijlstra wrote: > > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > > Author: Martin Schwidefsky > > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > > > clocksource: Avoid clocksource watchdog circular locking dependency > > > > > > stop_machine from a multithreaded workqueue is not allowed because > > > of a circular locking dependency between cpu_down and the workqueue > > > execution. Use a kernel thread to do the clocksource downgrade. > > > > I cannot find stop_machine usage there; either it went away or I need to > > like wake up. > > timekeeping_notify() which is involved in switching clock source uses stomp > machine. ARGH... OK, lemme see if I can come up with something other than endlessly spawning that kthread. A special purpose kthread_worker would make more sense than that.
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, 3 Sep 2018, Peter Zijlstra wrote: > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > Author: Martin Schwidefsky > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > clocksource: Avoid clocksource watchdog circular locking dependency > > > > stop_machine from a multithreaded workqueue is not allowed because > > of a circular locking dependency between cpu_down and the workqueue > > execution. Use a kernel thread to do the clocksource downgrade. > > I cannot find stop_machine usage there; either it went away or I need to > like wake up. timekeeping_notify() which is involved in switching clock source uses stomp machine. Thanks, tglx
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Mon, 3 Sep 2018, Peter Zijlstra wrote: > On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > > Author: Martin Schwidefsky > > Date: Tue Aug 18 17:09:42 2009 +0200 > > > > clocksource: Avoid clocksource watchdog circular locking dependency > > > > stop_machine from a multithreaded workqueue is not allowed because > > of a circular locking dependency between cpu_down and the workqueue > > execution. Use a kernel thread to do the clocksource downgrade. > > I cannot find stop_machine usage there; either it went away or I need to > like wake up. timekeeping_notify() which is involved in switching clock source uses stomp machine. Thanks, tglx
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote: > > On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote: > > > Dear kernel developers, > > > > > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) > > > Intel Core 2 Duo processors are affected by boot stalling early in the > > > boot process. As it is so early there is no dmesg output (or any log). > > > > > > A few users in the Arch Linux community used git bisect and tracked the > > > issue down to this the bad commit: > > > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread > > > > I just dug out my core2duo laptop (Lenovo T500) and build a tip/master > > kernel for it (x86_64 debian distro .config). > > > > Seems to boot just fine.. 3/3 so far. > > > > Any other clues? > > One additional data point, my affected system is a Dell Latitude E6400 > laptop which has a P8400 CPU: > > vendor_id : GenuineIntel > cpu family: 6 > model : 23 > model name: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz > stepping : 6 > microcode : 0x610 > > Judging from what is being discussed in the Arch forums, it does seem > to related to the CPU having unstable TSC and transitioning to another > clock source. Yes; Core2 doesn't have stable TSC. > Workarounds that seem to be reliable are either booting > with clocksource= or with nosmp. nosmp is weird; because even on UP TSC should stop in C state. processor_idle (acpi_idle) should mark the TSC as unstable on Core2 when it loads (does so on my T500). > One person did point out that the commit that introduced the kthread > did so to remove a deadlock - is the circular locking dependency > mentioned in that commit still relevant? > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > Author: Martin Schwidefsky > Date: Tue Aug 18 17:09:42 2009 +0200 > > clocksource: Avoid clocksource watchdog circular locking dependency > > stop_machine from a multithreaded workqueue is not allowed because > of a circular locking dependency between cpu_down and the workqueue > execution. Use a kernel thread to do the clocksource downgrade. I cannot find stop_machine usage there; either it went away or I need to like wake up.
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Sat, Sep 01, 2018 at 11:51:26AM +0930, Kevin Shanahan wrote: > On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote: > > On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote: > > > Dear kernel developers, > > > > > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) > > > Intel Core 2 Duo processors are affected by boot stalling early in the > > > boot process. As it is so early there is no dmesg output (or any log). > > > > > > A few users in the Arch Linux community used git bisect and tracked the > > > issue down to this the bad commit: > > > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread > > > > I just dug out my core2duo laptop (Lenovo T500) and build a tip/master > > kernel for it (x86_64 debian distro .config). > > > > Seems to boot just fine.. 3/3 so far. > > > > Any other clues? > > One additional data point, my affected system is a Dell Latitude E6400 > laptop which has a P8400 CPU: > > vendor_id : GenuineIntel > cpu family: 6 > model : 23 > model name: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz > stepping : 6 > microcode : 0x610 > > Judging from what is being discussed in the Arch forums, it does seem > to related to the CPU having unstable TSC and transitioning to another > clock source. Yes; Core2 doesn't have stable TSC. > Workarounds that seem to be reliable are either booting > with clocksource= or with nosmp. nosmp is weird; because even on UP TSC should stop in C state. processor_idle (acpi_idle) should mark the TSC as unstable on Core2 when it loads (does so on my T500). > One person did point out that the commit that introduced the kthread > did so to remove a deadlock - is the circular locking dependency > mentioned in that commit still relevant? > > commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab > Author: Martin Schwidefsky > Date: Tue Aug 18 17:09:42 2009 +0200 > > clocksource: Avoid clocksource watchdog circular locking dependency > > stop_machine from a multithreaded workqueue is not allowed because > of a circular locking dependency between cpu_down and the workqueue > execution. Use a kernel thread to do the clocksource downgrade. I cannot find stop_machine usage there; either it went away or I need to like wake up.
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote: > On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote: > > Dear kernel developers, > > > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) > > Intel Core 2 Duo processors are affected by boot stalling early in the > > boot process. As it is so early there is no dmesg output (or any log). > > > > A few users in the Arch Linux community used git bisect and tracked the > > issue down to this the bad commit: > > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread > > I just dug out my core2duo laptop (Lenovo T500) and build a tip/master > kernel for it (x86_64 debian distro .config). > > Seems to boot just fine.. 3/3 so far. > > Any other clues? One additional data point, my affected system is a Dell Latitude E6400 laptop which has a P8400 CPU: vendor_id : GenuineIntel cpu family: 6 model : 23 model name: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz stepping : 6 microcode : 0x610 Judging from what is being discussed in the Arch forums, it does seem to related to the CPU having unstable TSC and transitioning to another clock source. Workarounds that seem to be reliable are either booting with clocksource= or with nosmp. One person did point out that the commit that introduced the kthread did so to remove a deadlock - is the circular locking dependency mentioned in that commit still relevant? commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab Author: Martin Schwidefsky Date: Tue Aug 18 17:09:42 2009 +0200 clocksource: Avoid clocksource watchdog circular locking dependency stop_machine from a multithreaded workqueue is not allowed because of a circular locking dependency between cpu_down and the workqueue execution. Use a kernel thread to do the clocksource downgrade. Signed-off-by: Martin Schwidefsky Cc: Peter Zijlstra Cc: john stultz LKML-Reference: <20090818170942.3ab80c91@skybase> Signed-off-by: Thomas Gleixner Thanks, Kevin.
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote: > On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote: > > Dear kernel developers, > > > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) > > Intel Core 2 Duo processors are affected by boot stalling early in the > > boot process. As it is so early there is no dmesg output (or any log). > > > > A few users in the Arch Linux community used git bisect and tracked the > > issue down to this the bad commit: > > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread > > I just dug out my core2duo laptop (Lenovo T500) and build a tip/master > kernel for it (x86_64 debian distro .config). > > Seems to boot just fine.. 3/3 so far. > > Any other clues? One additional data point, my affected system is a Dell Latitude E6400 laptop which has a P8400 CPU: vendor_id : GenuineIntel cpu family: 6 model : 23 model name: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz stepping : 6 microcode : 0x610 Judging from what is being discussed in the Arch forums, it does seem to related to the CPU having unstable TSC and transitioning to another clock source. Workarounds that seem to be reliable are either booting with clocksource= or with nosmp. One person did point out that the commit that introduced the kthread did so to remove a deadlock - is the circular locking dependency mentioned in that commit still relevant? commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab Author: Martin Schwidefsky Date: Tue Aug 18 17:09:42 2009 +0200 clocksource: Avoid clocksource watchdog circular locking dependency stop_machine from a multithreaded workqueue is not allowed because of a circular locking dependency between cpu_down and the workqueue execution. Use a kernel thread to do the clocksource downgrade. Signed-off-by: Martin Schwidefsky Cc: Peter Zijlstra Cc: john stultz LKML-Reference: <20090818170942.3ab80c91@skybase> Signed-off-by: Thomas Gleixner Thanks, Kevin.
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote: > On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote: > > Dear kernel developers, > > > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) > > Intel Core 2 Duo processors are affected by boot stalling early in the > > boot process. As it is so early there is no dmesg output (or any log). > > > > A few users in the Arch Linux community used git bisect and tracked the > > issue down to this the bad commit: > > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread > > I just dug out my core2duo laptop (Lenovo T500) and build a tip/master > kernel for it (x86_64 debian distro .config). > > Seems to boot just fine.. 3/3 so far. > > Any other clues? The .config from https://bugzilla.kernel.org/attachment.cgi?id=278183 seems to boot far enough to drop me into a busybox -- it doesn't seem to contain enough bits to actually boot my system and I can't be arsed to figure out what's missing.
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Thu, Aug 30, 2018 at 03:04:39PM +0200, Peter Zijlstra wrote: > On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote: > > Dear kernel developers, > > > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) > > Intel Core 2 Duo processors are affected by boot stalling early in the > > boot process. As it is so early there is no dmesg output (or any log). > > > > A few users in the Arch Linux community used git bisect and tracked the > > issue down to this the bad commit: > > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread > > I just dug out my core2duo laptop (Lenovo T500) and build a tip/master > kernel for it (x86_64 debian distro .config). > > Seems to boot just fine.. 3/3 so far. > > Any other clues? The .config from https://bugzilla.kernel.org/attachment.cgi?id=278183 seems to boot far enough to drop me into a busybox -- it doesn't seem to contain enough bits to actually boot my system and I can't be arsed to figure out what's missing.
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote: > Dear kernel developers, > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) > Intel Core 2 Duo processors are affected by boot stalling early in the > boot process. As it is so early there is no dmesg output (or any log). > > A few users in the Arch Linux community used git bisect and tracked the > issue down to this the bad commit: > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread I just dug out my core2duo laptop (Lenovo T500) and build a tip/master kernel for it (x86_64 debian distro .config). Seems to boot just fine.. 3/3 so far. Any other clues?
Re: REGRESSION: boot stalls on several old dual core Intel CPUs
On Thu, Aug 30, 2018 at 12:55:30PM +0200, Siegfried Metz wrote: > Dear kernel developers, > > since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) > Intel Core 2 Duo processors are affected by boot stalling early in the > boot process. As it is so early there is no dmesg output (or any log). > > A few users in the Arch Linux community used git bisect and tracked the > issue down to this the bad commit: > 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread I just dug out my core2duo laptop (Lenovo T500) and build a tip/master kernel for it (x86_64 debian distro .config). Seems to boot just fine.. 3/3 so far. Any other clues?
REGRESSION: boot stalls on several old dual core Intel CPUs
Dear kernel developers, since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) Intel Core 2 Duo processors are affected by boot stalling early in the boot process. As it is so early there is no dmesg output (or any log). A few users in the Arch Linux community used git bisect and tracked the issue down to this the bad commit: 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread Either reverting the bad commit or as a workaround using an additional kernel boot parameter "clocksource=hpet" has been successfully used to boot with Intel Core 2 processors and avoid the stalling. clocksource= options parameters are one of "tsc", "hpet", "acpi_pm", most of the time hpet did the trick. Additional information: Kernel 4.17.19 is unaffected by this issues, kernel 4.18 series and 4.19-rc1 are still affected. Also there is the bug-report: https://bugzilla.kernel.org/show_bug.cgi?id=200957 and a duplicate one: https://bugzilla.kernel.org/show_bug.cgi?id=200959 Thanks for fixing this issue. Siegfried
REGRESSION: boot stalls on several old dual core Intel CPUs
Dear kernel developers, since mainline kernel 4.18 (up to the latest mainline kernel 4.18.5) Intel Core 2 Duo processors are affected by boot stalling early in the boot process. As it is so early there is no dmesg output (or any log). A few users in the Arch Linux community used git bisect and tracked the issue down to this the bad commit: 7197e77abcb65a71d0b21d67beb24f153a96055e clocksource: Remove kthread Either reverting the bad commit or as a workaround using an additional kernel boot parameter "clocksource=hpet" has been successfully used to boot with Intel Core 2 processors and avoid the stalling. clocksource= options parameters are one of "tsc", "hpet", "acpi_pm", most of the time hpet did the trick. Additional information: Kernel 4.17.19 is unaffected by this issues, kernel 4.18 series and 4.19-rc1 are still affected. Also there is the bug-report: https://bugzilla.kernel.org/show_bug.cgi?id=200957 and a duplicate one: https://bugzilla.kernel.org/show_bug.cgi?id=200959 Thanks for fixing this issue. Siegfried