Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Dec 14, 2018 at 08:18:26AM +0100, Greg Kroah-Hartman wrote: > On Mon, Nov 19, 2018 at 12:27:21PM +0100, Henrik Austad wrote: > > On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote: > > > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > > > > From: Henrik Austad > > > > > > > > Short story: > > > > > > Sorry for the spam, it looks like I was not very specific in /which/ > > > version I targeted this to, as well as not providing a full Cc-list for > > > the > > > cover-letter. > > > > Gentle prod. I realize this was sent out just before plumbers and that > > people had pretty packed agendas, so a small nudge to gain a spot closer to > > the top of the inbox :) > > > > This series has now been running on an arm64 system for 9 days without any > > issues and pi_stress showed a dramatic improvement from ~30 seconds and up > > to several ours (it finally deadlocked at 3.9e9 inversions). > > > > I'd greatly appreciate if someone could give the list of patches a quick > > glance to verify that I got all the required patches and then if it could > > be added to 4.4.y. Hi Greg, > This is a really intrusive series of patches, and without some testing > and verification by others, I am really reluctant to take these patches. Yes I know, they are intrusive, and they touch core parts of the kernel in interesting ways. I completely agree with the need for testing, and I do not _expect_ these pathces to be merged. It was a "this was useful for us, it is probably useful for others" kind of series. Perhaps it is not that many others out there using pi_futex shared between a sched_rr thread and a sched_deadline thread, which is how you back yourself into this corner. > Why not just move to the 4.9.y tree, or better yet, 4.19.y to resolve > this issue for your systems? That would indeed be the best solution, but vendor will not update kernel past 4.4 for this particular SoC, so we have no way of moving this to a later kernel :( Anyway, I'm happy to carry these in our local tree for our own use. If something pops up in our internal testing requiring update to the series, I'll send an update for others to see should they experience the same issue. :) Thanks for the reply! -- Henrik Austad signature.asc Description: PGP signature
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Mon, Nov 19, 2018 at 12:27:21PM +0100, Henrik Austad wrote: > On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote: > > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > > > From: Henrik Austad > > > > > > Short story: > > > > Sorry for the spam, it looks like I was not very specific in /which/ > > version I targeted this to, as well as not providing a full Cc-list for the > > cover-letter. > > Gentle prod. I realize this was sent out just before plumbers and that > people had pretty packed agendas, so a small nudge to gain a spot closer to > the top of the inbox :) > > This series has now been running on an arm64 system for 9 days without any > issues and pi_stress showed a dramatic improvement from ~30 seconds and up > to several ours (it finally deadlocked at 3.9e9 inversions). > > I'd greatly appreciate if someone could give the list of patches a quick > glance to verify that I got all the required patches and then if it could > be added to 4.4.y. This is a really intrusive series of patches, and without some testing and verification by others, I am really reluctant to take these patches. Why not just move to the 4.9.y tree, or better yet, 4.19.y to resolve this issue for your systems? thanks, greg k-h
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote: > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > > From: Henrik Austad > > > > Short story: > > Sorry for the spam, it looks like I was not very specific in /which/ > version I targeted this to, as well as not providing a full Cc-list for the > cover-letter. Gentle prod. I realize this was sent out just before plumbers and that people had pretty packed agendas, so a small nudge to gain a spot closer to the top of the inbox :) This series has now been running on an arm64 system for 9 days without any issues and pi_stress showed a dramatic improvement from ~30 seconds and up to several ours (it finally deadlocked at 3.9e9 inversions). I'd greatly appreciate if someone could give the list of patches a quick glance to verify that I got all the required patches and then if it could be added to 4.4.y. Thanks! -Henrik > The series is targeted at stable v4.4.162. > > Expanding Cc-list to those missing from the first attempt. > > -Henrik > > > The following patches are needed on a 4.4 kernel to avoid > > Oops in the scheduler when a sched_rr and sched_deadline task contends > > on the same futex (with PI). > > > > Longer story: > > > > On one of our arm64 systems, we occasionally crash with an Oops in the > > scheduler with the following backtrace. > > > > [] enqueue_task_dl+0x1f0/0x420 > > [] activate_task+0x7c/0x90 > > [] push_dl_task+0x164/0x1c8 > > [] push_dl_tasks+0x20/0x30 > > [] __balance_callback+0x44/0x68 > > [] __schedule+0x6f0/0x728 > > [] schedule+0x78/0x98 > > [] __rt_mutex_slowlock+0x9c/0x108 > > [] rt_mutex_slowlock+0xd8/0x198 > > [] rt_mutex_timed_futex_lock+0x30/0x40 > > [] futex_lock_pi+0x200/0x3b0 > > [] do_futex+0x1c4/0x550 > > [] compat_SyS_futex+0x10c/0x138 > > [] __sys_trace_return+0x0/0x4 > > > > This seems to be the same bug Xuneli Pang triggered and fixed in > > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline > > tasks". As noted by Peter Zijlstra in the previous attempt, this fix > > requires a few other patches, most notably the FUTEX_UNLOCK_PI series > > [1] > > > > Testing this on a dual-core VM I have not been able to reproduce the > > same crash, but pi_stress (part of the rt-test suite) reveals that > > vanilla 4.4.162 behaves rather badly with a mix of deadline and > > sched_(rr|fifo) tasks: > > > > time pi_stress --rr --mlockall --sched > > id=high,policy=deadline,runtime=10,deadline=20,period=20 > > Starting PI Stress Test > > Number of thread groups: 1 > > Duration of test run: infinite > > Number of inversions per group: unlimited > > Admin thread SCHED_RR priority 4 > > 1 groups of 3 threads will be created > > High thread SCHED_DEADLINE runtime 10 deadline 20 period > > 20 > >Med thread SCHED_RR priority 2 > >Low thread SCHED_RR priority 1 > > Current Inversions: 141627 > > WATCHDOG triggered: group 0 is deadlocked! > > reporter stopping due to watchdog event > > Stopping test > > Terminated > > > > real0m26.291s > > user0m0.148s > > sys 0m18.819s > > > > With this series applied, the test ran for ~4.5 hours and again for 129 > > minutes (when I remembered to time it) before crashing: > > > > time pi_stress --rr --mlockall --sched > > id=high,policy=deadline,runtime=10,deadline=20,period=20 > > Starting PI Stress Test > > Number of thread groups: 1 > > Duration of test run: infinite > > Number of inversions per group: unlimited > > Admin thread SCHED_RR priority 4 > > 1 groups of 3 threads will be created > > High thread SCHED_DEADLINE runtime 10 deadline 20 period > > 20 > >Med thread SCHED_RR priority 2 > >Low thread SCHED_RR priority 1 > > Current Inversions: 51985223 > > WATCHDOG triggered: group 0 is deadlocked! > > reporter stopping due to watchdog event > > Stopping test > > Terminated > > > > real129m38.807s > > user0m59.084s > > sys 109m53.666s > > > > > > So clearly not perfect, but a *lot* better. > > > > The same series on our vendor-4.4 kernel moves pi_stress up from ~30 > > seconds before deadlock up to the same level as the VM (the test is > > still going as of this writing). > > > > I suspect other users of 4.4 would benefit from having these patches > > backported, so tag them for stable. I assume 4.9 and 4.14 could benefit > > as well, but I have not had time to look into those. > > > > 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html > > > > Peter Zijlstra (13): > > futex: Cleanup variable names for futex_top_waiter() > > futex: Use smp_store_release() in mark_wake_futex() > > futex: Remove rt_mutex_deadlock_account_*() > > futex,rt_mutex: Provide futex specific rt_mutex API > > futex: Change locking rules > > futex: Cleanup refcounting > > futex: Rework inconsistent rt_mutex/futex_q state > > futex: Pull rt_mutex_futex_unlock() out from under hb->lock
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote: > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > > From: Henrik Austad > > > > Short story: > > Sorry for the spam, it looks like I was not very specific in /which/ > version I targeted this to, as well as not providing a full Cc-list for the > cover-letter. Gentle prod. I realize this was sent out just before plumbers and that people had pretty packed agendas, so a small nudge to gain a spot closer to the top of the inbox :) This series has now been running on an arm64 system for 9 days without any issues and pi_stress showed a dramatic improvement from ~30 seconds and up to several ours (it finally deadlocked at 3.9e9 inversions). I'd greatly appreciate if someone could give the list of patches a quick glance to verify that I got all the required patches and then if it could be added to 4.4.y. Thanks! -Henrik > The series is targeted at stable v4.4.162. > > Expanding Cc-list to those missing from the first attempt. > > -Henrik > > > The following patches are needed on a 4.4 kernel to avoid > > Oops in the scheduler when a sched_rr and sched_deadline task contends > > on the same futex (with PI). > > > > Longer story: > > > > On one of our arm64 systems, we occasionally crash with an Oops in the > > scheduler with the following backtrace. > > > > [] enqueue_task_dl+0x1f0/0x420 > > [] activate_task+0x7c/0x90 > > [] push_dl_task+0x164/0x1c8 > > [] push_dl_tasks+0x20/0x30 > > [] __balance_callback+0x44/0x68 > > [] __schedule+0x6f0/0x728 > > [] schedule+0x78/0x98 > > [] __rt_mutex_slowlock+0x9c/0x108 > > [] rt_mutex_slowlock+0xd8/0x198 > > [] rt_mutex_timed_futex_lock+0x30/0x40 > > [] futex_lock_pi+0x200/0x3b0 > > [] do_futex+0x1c4/0x550 > > [] compat_SyS_futex+0x10c/0x138 > > [] __sys_trace_return+0x0/0x4 > > > > This seems to be the same bug Xuneli Pang triggered and fixed in > > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline > > tasks". As noted by Peter Zijlstra in the previous attempt, this fix > > requires a few other patches, most notably the FUTEX_UNLOCK_PI series > > [1] > > > > Testing this on a dual-core VM I have not been able to reproduce the > > same crash, but pi_stress (part of the rt-test suite) reveals that > > vanilla 4.4.162 behaves rather badly with a mix of deadline and > > sched_(rr|fifo) tasks: > > > > time pi_stress --rr --mlockall --sched > > id=high,policy=deadline,runtime=10,deadline=20,period=20 > > Starting PI Stress Test > > Number of thread groups: 1 > > Duration of test run: infinite > > Number of inversions per group: unlimited > > Admin thread SCHED_RR priority 4 > > 1 groups of 3 threads will be created > > High thread SCHED_DEADLINE runtime 10 deadline 20 period > > 20 > >Med thread SCHED_RR priority 2 > >Low thread SCHED_RR priority 1 > > Current Inversions: 141627 > > WATCHDOG triggered: group 0 is deadlocked! > > reporter stopping due to watchdog event > > Stopping test > > Terminated > > > > real0m26.291s > > user0m0.148s > > sys 0m18.819s > > > > With this series applied, the test ran for ~4.5 hours and again for 129 > > minutes (when I remembered to time it) before crashing: > > > > time pi_stress --rr --mlockall --sched > > id=high,policy=deadline,runtime=10,deadline=20,period=20 > > Starting PI Stress Test > > Number of thread groups: 1 > > Duration of test run: infinite > > Number of inversions per group: unlimited > > Admin thread SCHED_RR priority 4 > > 1 groups of 3 threads will be created > > High thread SCHED_DEADLINE runtime 10 deadline 20 period > > 20 > >Med thread SCHED_RR priority 2 > >Low thread SCHED_RR priority 1 > > Current Inversions: 51985223 > > WATCHDOG triggered: group 0 is deadlocked! > > reporter stopping due to watchdog event > > Stopping test > > Terminated > > > > real129m38.807s > > user0m59.084s > > sys 109m53.666s > > > > > > So clearly not perfect, but a *lot* better. > > > > The same series on our vendor-4.4 kernel moves pi_stress up from ~30 > > seconds before deadlock up to the same level as the VM (the test is > > still going as of this writing). > > > > I suspect other users of 4.4 would benefit from having these patches > > backported, so tag them for stable. I assume 4.9 and 4.14 could benefit > > as well, but I have not had time to look into those. > > > > 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html > > > > Peter Zijlstra (13): > > futex: Cleanup variable names for futex_top_waiter() > > futex: Use smp_store_release() in mark_wake_futex() > > futex: Remove rt_mutex_deadlock_account_*() > > futex,rt_mutex: Provide futex specific rt_mutex API > > futex: Change locking rules > > futex: Cleanup refcounting > > futex: Rework inconsistent rt_mutex/futex_q state > > futex: Pull rt_mutex_futex_unlock() out from under hb->lock
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > From: Henrik Austad > > Short story: Sorry for the spam, it looks like I was not very specific in /which/ version I targeted this to, as well as not providing a full Cc-list for the cover-letter. The series is targeted at stable v4.4.162. Expanding Cc-list to those missing from the first attempt. -Henrik > The following patches are needed on a 4.4 kernel to avoid > Oops in the scheduler when a sched_rr and sched_deadline task contends > on the same futex (with PI). > > Longer story: > > On one of our arm64 systems, we occasionally crash with an Oops in the > scheduler with the following backtrace. > > [] enqueue_task_dl+0x1f0/0x420 > [] activate_task+0x7c/0x90 > [] push_dl_task+0x164/0x1c8 > [] push_dl_tasks+0x20/0x30 > [] __balance_callback+0x44/0x68 > [] __schedule+0x6f0/0x728 > [] schedule+0x78/0x98 > [] __rt_mutex_slowlock+0x9c/0x108 > [] rt_mutex_slowlock+0xd8/0x198 > [] rt_mutex_timed_futex_lock+0x30/0x40 > [] futex_lock_pi+0x200/0x3b0 > [] do_futex+0x1c4/0x550 > [] compat_SyS_futex+0x10c/0x138 > [] __sys_trace_return+0x0/0x4 > > This seems to be the same bug Xuneli Pang triggered and fixed in > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline > tasks". As noted by Peter Zijlstra in the previous attempt, this fix > requires a few other patches, most notably the FUTEX_UNLOCK_PI series > [1] > > Testing this on a dual-core VM I have not been able to reproduce the > same crash, but pi_stress (part of the rt-test suite) reveals that > vanilla 4.4.162 behaves rather badly with a mix of deadline and > sched_(rr|fifo) tasks: > > time pi_stress --rr --mlockall --sched > id=high,policy=deadline,runtime=10,deadline=20,period=20 > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: infinite > Number of inversions per group: unlimited > Admin thread SCHED_RR priority 4 > 1 groups of 3 threads will be created > High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 >Med thread SCHED_RR priority 2 >Low thread SCHED_RR priority 1 > Current Inversions: 141627 > WATCHDOG triggered: group 0 is deadlocked! > reporter stopping due to watchdog event > Stopping test > Terminated > > real0m26.291s > user0m0.148s > sys 0m18.819s > > With this series applied, the test ran for ~4.5 hours and again for 129 > minutes (when I remembered to time it) before crashing: > > time pi_stress --rr --mlockall --sched > id=high,policy=deadline,runtime=10,deadline=20,period=20 > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: infinite > Number of inversions per group: unlimited > Admin thread SCHED_RR priority 4 > 1 groups of 3 threads will be created > High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 >Med thread SCHED_RR priority 2 >Low thread SCHED_RR priority 1 > Current Inversions: 51985223 > WATCHDOG triggered: group 0 is deadlocked! > reporter stopping due to watchdog event > Stopping test > Terminated > > real129m38.807s > user0m59.084s > sys 109m53.666s > > > So clearly not perfect, but a *lot* better. > > The same series on our vendor-4.4 kernel moves pi_stress up from ~30 > seconds before deadlock up to the same level as the VM (the test is > still going as of this writing). > > I suspect other users of 4.4 would benefit from having these patches > backported, so tag them for stable. I assume 4.9 and 4.14 could benefit > as well, but I have not had time to look into those. > > 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html > > Peter Zijlstra (13): > futex: Cleanup variable names for futex_top_waiter() > futex: Use smp_store_release() in mark_wake_futex() > futex: Remove rt_mutex_deadlock_account_*() > futex,rt_mutex: Provide futex specific rt_mutex API > futex: Change locking rules > futex: Cleanup refcounting > futex: Rework inconsistent rt_mutex/futex_q state > futex: Pull rt_mutex_futex_unlock() out from under hb->lock > futex,rt_mutex: Introduce rt_mutex_init_waiter() > futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() > futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() > futex: Futex_unlock_pi() determinism > futex: Drop hb->lock before enqueueing on the rtmutex > > Thomas Gleixner (2): > rtmutex: Make wait_lock irq safe > futex: Rename free_pi_state() to put_pi_state() > > Xunlei Pang (2): > rtmutex: Deboost before waking up the top waiter > sched/rtmutex/deadline: Fix a PI crash for deadline tasks > > include/linux/init_task.h | 1 + > include/linux/sched.h | 2 + > include/linux/sched/rt.h| 1 + > kernel/fork.c | 1 + > kernel/futex.c | 532 > ++-- > kernel/locking/rtmutex-debug.c | 9 - > kernel/locking/rtmutex-debug.h | 3 - >
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > From: Henrik Austad > > Short story: Sorry for the spam, it looks like I was not very specific in /which/ version I targeted this to, as well as not providing a full Cc-list for the cover-letter. The series is targeted at stable v4.4.162. Expanding Cc-list to those missing from the first attempt. -Henrik > The following patches are needed on a 4.4 kernel to avoid > Oops in the scheduler when a sched_rr and sched_deadline task contends > on the same futex (with PI). > > Longer story: > > On one of our arm64 systems, we occasionally crash with an Oops in the > scheduler with the following backtrace. > > [] enqueue_task_dl+0x1f0/0x420 > [] activate_task+0x7c/0x90 > [] push_dl_task+0x164/0x1c8 > [] push_dl_tasks+0x20/0x30 > [] __balance_callback+0x44/0x68 > [] __schedule+0x6f0/0x728 > [] schedule+0x78/0x98 > [] __rt_mutex_slowlock+0x9c/0x108 > [] rt_mutex_slowlock+0xd8/0x198 > [] rt_mutex_timed_futex_lock+0x30/0x40 > [] futex_lock_pi+0x200/0x3b0 > [] do_futex+0x1c4/0x550 > [] compat_SyS_futex+0x10c/0x138 > [] __sys_trace_return+0x0/0x4 > > This seems to be the same bug Xuneli Pang triggered and fixed in > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline > tasks". As noted by Peter Zijlstra in the previous attempt, this fix > requires a few other patches, most notably the FUTEX_UNLOCK_PI series > [1] > > Testing this on a dual-core VM I have not been able to reproduce the > same crash, but pi_stress (part of the rt-test suite) reveals that > vanilla 4.4.162 behaves rather badly with a mix of deadline and > sched_(rr|fifo) tasks: > > time pi_stress --rr --mlockall --sched > id=high,policy=deadline,runtime=10,deadline=20,period=20 > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: infinite > Number of inversions per group: unlimited > Admin thread SCHED_RR priority 4 > 1 groups of 3 threads will be created > High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 >Med thread SCHED_RR priority 2 >Low thread SCHED_RR priority 1 > Current Inversions: 141627 > WATCHDOG triggered: group 0 is deadlocked! > reporter stopping due to watchdog event > Stopping test > Terminated > > real0m26.291s > user0m0.148s > sys 0m18.819s > > With this series applied, the test ran for ~4.5 hours and again for 129 > minutes (when I remembered to time it) before crashing: > > time pi_stress --rr --mlockall --sched > id=high,policy=deadline,runtime=10,deadline=20,period=20 > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: infinite > Number of inversions per group: unlimited > Admin thread SCHED_RR priority 4 > 1 groups of 3 threads will be created > High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 >Med thread SCHED_RR priority 2 >Low thread SCHED_RR priority 1 > Current Inversions: 51985223 > WATCHDOG triggered: group 0 is deadlocked! > reporter stopping due to watchdog event > Stopping test > Terminated > > real129m38.807s > user0m59.084s > sys 109m53.666s > > > So clearly not perfect, but a *lot* better. > > The same series on our vendor-4.4 kernel moves pi_stress up from ~30 > seconds before deadlock up to the same level as the VM (the test is > still going as of this writing). > > I suspect other users of 4.4 would benefit from having these patches > backported, so tag them for stable. I assume 4.9 and 4.14 could benefit > as well, but I have not had time to look into those. > > 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html > > Peter Zijlstra (13): > futex: Cleanup variable names for futex_top_waiter() > futex: Use smp_store_release() in mark_wake_futex() > futex: Remove rt_mutex_deadlock_account_*() > futex,rt_mutex: Provide futex specific rt_mutex API > futex: Change locking rules > futex: Cleanup refcounting > futex: Rework inconsistent rt_mutex/futex_q state > futex: Pull rt_mutex_futex_unlock() out from under hb->lock > futex,rt_mutex: Introduce rt_mutex_init_waiter() > futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() > futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() > futex: Futex_unlock_pi() determinism > futex: Drop hb->lock before enqueueing on the rtmutex > > Thomas Gleixner (2): > rtmutex: Make wait_lock irq safe > futex: Rename free_pi_state() to put_pi_state() > > Xunlei Pang (2): > rtmutex: Deboost before waking up the top waiter > sched/rtmutex/deadline: Fix a PI crash for deadline tasks > > include/linux/init_task.h | 1 + > include/linux/sched.h | 2 + > include/linux/sched/rt.h| 1 + > kernel/fork.c | 1 + > kernel/futex.c | 532 > ++-- > kernel/locking/rtmutex-debug.c | 9 - > kernel/locking/rtmutex-debug.h | 3 - >
[PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
From: Henrik Austad Short story: The following patches are needed on a 4.4 kernel to avoid Oops in the scheduler when a sched_rr and sched_deadline task contends on the same futex (with PI). Longer story: On one of our arm64 systems, we occasionally crash with an Oops in the scheduler with the following backtrace. [] enqueue_task_dl+0x1f0/0x420 [] activate_task+0x7c/0x90 [] push_dl_task+0x164/0x1c8 [] push_dl_tasks+0x20/0x30 [] __balance_callback+0x44/0x68 [] __schedule+0x6f0/0x728 [] schedule+0x78/0x98 [] __rt_mutex_slowlock+0x9c/0x108 [] rt_mutex_slowlock+0xd8/0x198 [] rt_mutex_timed_futex_lock+0x30/0x40 [] futex_lock_pi+0x200/0x3b0 [] do_futex+0x1c4/0x550 [] compat_SyS_futex+0x10c/0x138 [] __sys_trace_return+0x0/0x4 This seems to be the same bug Xuneli Pang triggered and fixed in e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline tasks". As noted by Peter Zijlstra in the previous attempt, this fix requires a few other patches, most notably the FUTEX_UNLOCK_PI series [1] Testing this on a dual-core VM I have not been able to reproduce the same crash, but pi_stress (part of the rt-test suite) reveals that vanilla 4.4.162 behaves rather badly with a mix of deadline and sched_(rr|fifo) tasks: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=10,deadline=20,period=20 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 141627 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real0m26.291s user0m0.148s sys 0m18.819s With this series applied, the test ran for ~4.5 hours and again for 129 minutes (when I remembered to time it) before crashing: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=10,deadline=20,period=20 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 51985223 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real129m38.807s user0m59.084s sys 109m53.666s So clearly not perfect, but a *lot* better. The same series on our vendor-4.4 kernel moves pi_stress up from ~30 seconds before deadlock up to the same level as the VM (the test is still going as of this writing). I suspect other users of 4.4 would benefit from having these patches backported, so tag them for stable. I assume 4.9 and 4.14 could benefit as well, but I have not had time to look into those. 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html Peter Zijlstra (13): futex: Cleanup variable names for futex_top_waiter() futex: Use smp_store_release() in mark_wake_futex() futex: Remove rt_mutex_deadlock_account_*() futex,rt_mutex: Provide futex specific rt_mutex API futex: Change locking rules futex: Cleanup refcounting futex: Rework inconsistent rt_mutex/futex_q state futex: Pull rt_mutex_futex_unlock() out from under hb->lock futex,rt_mutex: Introduce rt_mutex_init_waiter() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex: Futex_unlock_pi() determinism futex: Drop hb->lock before enqueueing on the rtmutex Thomas Gleixner (2): rtmutex: Make wait_lock irq safe futex: Rename free_pi_state() to put_pi_state() Xunlei Pang (2): rtmutex: Deboost before waking up the top waiter sched/rtmutex/deadline: Fix a PI crash for deadline tasks include/linux/init_task.h | 1 + include/linux/sched.h | 2 + include/linux/sched/rt.h| 1 + kernel/fork.c | 1 + kernel/futex.c | 532 ++-- kernel/locking/rtmutex-debug.c | 9 - kernel/locking/rtmutex-debug.h | 3 - kernel/locking/rtmutex.c| 406 ++ kernel/locking/rtmutex.h| 2 - kernel/locking/rtmutex_common.h | 24 +- kernel/sched/core.c | 2 + 11 files changed, 620 insertions(+), 363 deletions(-) -- 2.7.4
[PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
From: Henrik Austad Short story: The following patches are needed on a 4.4 kernel to avoid Oops in the scheduler when a sched_rr and sched_deadline task contends on the same futex (with PI). Longer story: On one of our arm64 systems, we occasionally crash with an Oops in the scheduler with the following backtrace. [] enqueue_task_dl+0x1f0/0x420 [] activate_task+0x7c/0x90 [] push_dl_task+0x164/0x1c8 [] push_dl_tasks+0x20/0x30 [] __balance_callback+0x44/0x68 [] __schedule+0x6f0/0x728 [] schedule+0x78/0x98 [] __rt_mutex_slowlock+0x9c/0x108 [] rt_mutex_slowlock+0xd8/0x198 [] rt_mutex_timed_futex_lock+0x30/0x40 [] futex_lock_pi+0x200/0x3b0 [] do_futex+0x1c4/0x550 [] compat_SyS_futex+0x10c/0x138 [] __sys_trace_return+0x0/0x4 This seems to be the same bug Xuneli Pang triggered and fixed in e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline tasks". As noted by Peter Zijlstra in the previous attempt, this fix requires a few other patches, most notably the FUTEX_UNLOCK_PI series [1] Testing this on a dual-core VM I have not been able to reproduce the same crash, but pi_stress (part of the rt-test suite) reveals that vanilla 4.4.162 behaves rather badly with a mix of deadline and sched_(rr|fifo) tasks: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=10,deadline=20,period=20 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 141627 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real0m26.291s user0m0.148s sys 0m18.819s With this series applied, the test ran for ~4.5 hours and again for 129 minutes (when I remembered to time it) before crashing: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=10,deadline=20,period=20 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 51985223 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real129m38.807s user0m59.084s sys 109m53.666s So clearly not perfect, but a *lot* better. The same series on our vendor-4.4 kernel moves pi_stress up from ~30 seconds before deadlock up to the same level as the VM (the test is still going as of this writing). I suspect other users of 4.4 would benefit from having these patches backported, so tag them for stable. I assume 4.9 and 4.14 could benefit as well, but I have not had time to look into those. 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html Peter Zijlstra (13): futex: Cleanup variable names for futex_top_waiter() futex: Use smp_store_release() in mark_wake_futex() futex: Remove rt_mutex_deadlock_account_*() futex,rt_mutex: Provide futex specific rt_mutex API futex: Change locking rules futex: Cleanup refcounting futex: Rework inconsistent rt_mutex/futex_q state futex: Pull rt_mutex_futex_unlock() out from under hb->lock futex,rt_mutex: Introduce rt_mutex_init_waiter() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex: Futex_unlock_pi() determinism futex: Drop hb->lock before enqueueing on the rtmutex Thomas Gleixner (2): rtmutex: Make wait_lock irq safe futex: Rename free_pi_state() to put_pi_state() Xunlei Pang (2): rtmutex: Deboost before waking up the top waiter sched/rtmutex/deadline: Fix a PI crash for deadline tasks include/linux/init_task.h | 1 + include/linux/sched.h | 2 + include/linux/sched/rt.h| 1 + kernel/fork.c | 1 + kernel/futex.c | 532 ++-- kernel/locking/rtmutex-debug.c | 9 - kernel/locking/rtmutex-debug.h | 3 - kernel/locking/rtmutex.c| 406 ++ kernel/locking/rtmutex.h| 2 - kernel/locking/rtmutex_common.h | 24 +- kernel/sched/core.c | 2 + 11 files changed, 620 insertions(+), 363 deletions(-) -- 2.7.4