Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Dec 14, 2018 at 08:18:26AM +0100, Greg Kroah-Hartman wrote: > On Mon, Nov 19, 2018 at 12:27:21PM +0100, Henrik Austad wrote: > > On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote: > > > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > > > > From: Henrik Austad > > > > > > > > Short story: > > > > > > Sorry for the spam, it looks like I was not very specific in /which/ > > > version I targeted this to, as well as not providing a full Cc-list for > > > the > > > cover-letter. > > > > Gentle prod. I realize this was sent out just before plumbers and that > > people had pretty packed agendas, so a small nudge to gain a spot closer to > > the top of the inbox :) > > > > This series has now been running on an arm64 system for 9 days without any > > issues and pi_stress showed a dramatic improvement from ~30 seconds and up > > to several ours (it finally deadlocked at 3.9e9 inversions). > > > > I'd greatly appreciate if someone could give the list of patches a quick > > glance to verify that I got all the required patches and then if it could > > be added to 4.4.y. Hi Greg, > This is a really intrusive series of patches, and without some testing > and verification by others, I am really reluctant to take these patches. Yes I know, they are intrusive, and they touch core parts of the kernel in interesting ways. I completely agree with the need for testing, and I do not _expect_ these pathces to be merged. It was a "this was useful for us, it is probably useful for others" kind of series. Perhaps it is not that many others out there using pi_futex shared between a sched_rr thread and a sched_deadline thread, which is how you back yourself into this corner. > Why not just move to the 4.9.y tree, or better yet, 4.19.y to resolve > this issue for your systems? That would indeed be the best solution, but vendor will not update kernel past 4.4 for this particular SoC, so we have no way of moving this to a later kernel :( Anyway, I'm happy to carry these in our local tree for our own use. If something pops up in our internal testing requiring update to the series, I'll send an update for others to see should they experience the same issue. :) Thanks for the reply! -- Henrik Austad signature.asc Description: PGP signature
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote: > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > > From: Henrik Austad > > > > Short story: > > Sorry for the spam, it looks like I was not very specific in /which/ > version I targeted this to, as well as not providing a full Cc-list for the > cover-letter. Gentle prod. I realize this was sent out just before plumbers and that people had pretty packed agendas, so a small nudge to gain a spot closer to the top of the inbox :) This series has now been running on an arm64 system for 9 days without any issues and pi_stress showed a dramatic improvement from ~30 seconds and up to several ours (it finally deadlocked at 3.9e9 inversions). I'd greatly appreciate if someone could give the list of patches a quick glance to verify that I got all the required patches and then if it could be added to 4.4.y. Thanks! -Henrik > The series is targeted at stable v4.4.162. > > Expanding Cc-list to those missing from the first attempt. > > -Henrik > > > The following patches are needed on a 4.4 kernel to avoid > > Oops in the scheduler when a sched_rr and sched_deadline task contends > > on the same futex (with PI). > > > > Longer story: > > > > On one of our arm64 systems, we occasionally crash with an Oops in the > > scheduler with the following backtrace. > > > > [] enqueue_task_dl+0x1f0/0x420 > > [] activate_task+0x7c/0x90 > > [] push_dl_task+0x164/0x1c8 > > [] push_dl_tasks+0x20/0x30 > > [] __balance_callback+0x44/0x68 > > [] __schedule+0x6f0/0x728 > > [] schedule+0x78/0x98 > > [] __rt_mutex_slowlock+0x9c/0x108 > > [] rt_mutex_slowlock+0xd8/0x198 > > [] rt_mutex_timed_futex_lock+0x30/0x40 > > [] futex_lock_pi+0x200/0x3b0 > > [] do_futex+0x1c4/0x550 > > [] compat_SyS_futex+0x10c/0x138 > > [] __sys_trace_return+0x0/0x4 > > > > This seems to be the same bug Xuneli Pang triggered and fixed in > > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline > > tasks". As noted by Peter Zijlstra in the previous attempt, this fix > > requires a few other patches, most notably the FUTEX_UNLOCK_PI series > > [1] > > > > Testing this on a dual-core VM I have not been able to reproduce the > > same crash, but pi_stress (part of the rt-test suite) reveals that > > vanilla 4.4.162 behaves rather badly with a mix of deadline and > > sched_(rr|fifo) tasks: > > > > time pi_stress --rr --mlockall --sched > > id=high,policy=deadline,runtime=10,deadline=20,period=20 > > Starting PI Stress Test > > Number of thread groups: 1 > > Duration of test run: infinite > > Number of inversions per group: unlimited > > Admin thread SCHED_RR priority 4 > > 1 groups of 3 threads will be created > > High thread SCHED_DEADLINE runtime 10 deadline 20 period > > 20 > >Med thread SCHED_RR priority 2 > >Low thread SCHED_RR priority 1 > > Current Inversions: 141627 > > WATCHDOG triggered: group 0 is deadlocked! > > reporter stopping due to watchdog event > > Stopping test > > Terminated > > > > real0m26.291s > > user0m0.148s > > sys 0m18.819s > > > > With this series applied, the test ran for ~4.5 hours and again for 129 > > minutes (when I remembered to time it) before crashing: > > > > time pi_stress --rr --mlockall --sched > > id=high,policy=deadline,runtime=10,deadline=20,period=20 > > Starting PI Stress Test > > Number of thread groups: 1 > > Duration of test run: infinite > > Number of inversions per group: unlimited > > Admin thread SCHED_RR priority 4 > > 1 groups of 3 threads will be created > > High thread SCHED_DEADLINE runtime 10 deadline 20 period > > 20 > >Med thread SCHED_RR priority 2 > >Low thread SCHED_RR priority 1 > > Current Inversions: 51985223 > > WATCHDOG triggered: group 0 is deadlocked! > > reporter stopping due to watchdog event > > Stopping test > > Terminated > > > > real129m38.807s > > user0m59.084s > > sys 109m53.666s > > > > > > So clearly not perfect, but a *lot* better. > > > > The same series on our vendor-4.4 kernel moves pi_stress up from ~30 > > seconds before deadlock up to the same level as the VM (the test is > > still going as of this writing). > > > > I suspect other users of 4.4 would benefit from having these patches > > backported, so tag them for stable. I assume 4.9 an
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote: > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > > From: Henrik Austad > > > > Short story: > > Sorry for the spam, it looks like I was not very specific in /which/ > version I targeted this to, as well as not providing a full Cc-list for the > cover-letter. Gentle prod. I realize this was sent out just before plumbers and that people had pretty packed agendas, so a small nudge to gain a spot closer to the top of the inbox :) This series has now been running on an arm64 system for 9 days without any issues and pi_stress showed a dramatic improvement from ~30 seconds and up to several ours (it finally deadlocked at 3.9e9 inversions). I'd greatly appreciate if someone could give the list of patches a quick glance to verify that I got all the required patches and then if it could be added to 4.4.y. Thanks! -Henrik > The series is targeted at stable v4.4.162. > > Expanding Cc-list to those missing from the first attempt. > > -Henrik > > > The following patches are needed on a 4.4 kernel to avoid > > Oops in the scheduler when a sched_rr and sched_deadline task contends > > on the same futex (with PI). > > > > Longer story: > > > > On one of our arm64 systems, we occasionally crash with an Oops in the > > scheduler with the following backtrace. > > > > [] enqueue_task_dl+0x1f0/0x420 > > [] activate_task+0x7c/0x90 > > [] push_dl_task+0x164/0x1c8 > > [] push_dl_tasks+0x20/0x30 > > [] __balance_callback+0x44/0x68 > > [] __schedule+0x6f0/0x728 > > [] schedule+0x78/0x98 > > [] __rt_mutex_slowlock+0x9c/0x108 > > [] rt_mutex_slowlock+0xd8/0x198 > > [] rt_mutex_timed_futex_lock+0x30/0x40 > > [] futex_lock_pi+0x200/0x3b0 > > [] do_futex+0x1c4/0x550 > > [] compat_SyS_futex+0x10c/0x138 > > [] __sys_trace_return+0x0/0x4 > > > > This seems to be the same bug Xuneli Pang triggered and fixed in > > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline > > tasks". As noted by Peter Zijlstra in the previous attempt, this fix > > requires a few other patches, most notably the FUTEX_UNLOCK_PI series > > [1] > > > > Testing this on a dual-core VM I have not been able to reproduce the > > same crash, but pi_stress (part of the rt-test suite) reveals that > > vanilla 4.4.162 behaves rather badly with a mix of deadline and > > sched_(rr|fifo) tasks: > > > > time pi_stress --rr --mlockall --sched > > id=high,policy=deadline,runtime=10,deadline=20,period=20 > > Starting PI Stress Test > > Number of thread groups: 1 > > Duration of test run: infinite > > Number of inversions per group: unlimited > > Admin thread SCHED_RR priority 4 > > 1 groups of 3 threads will be created > > High thread SCHED_DEADLINE runtime 10 deadline 20 period > > 20 > >Med thread SCHED_RR priority 2 > >Low thread SCHED_RR priority 1 > > Current Inversions: 141627 > > WATCHDOG triggered: group 0 is deadlocked! > > reporter stopping due to watchdog event > > Stopping test > > Terminated > > > > real0m26.291s > > user0m0.148s > > sys 0m18.819s > > > > With this series applied, the test ran for ~4.5 hours and again for 129 > > minutes (when I remembered to time it) before crashing: > > > > time pi_stress --rr --mlockall --sched > > id=high,policy=deadline,runtime=10,deadline=20,period=20 > > Starting PI Stress Test > > Number of thread groups: 1 > > Duration of test run: infinite > > Number of inversions per group: unlimited > > Admin thread SCHED_RR priority 4 > > 1 groups of 3 threads will be created > > High thread SCHED_DEADLINE runtime 10 deadline 20 period > > 20 > >Med thread SCHED_RR priority 2 > >Low thread SCHED_RR priority 1 > > Current Inversions: 51985223 > > WATCHDOG triggered: group 0 is deadlocked! > > reporter stopping due to watchdog event > > Stopping test > > Terminated > > > > real129m38.807s > > user0m59.084s > > sys 109m53.666s > > > > > > So clearly not perfect, but a *lot* better. > > > > The same series on our vendor-4.4 kernel moves pi_stress up from ~30 > > seconds before deadlock up to the same level as the VM (the test is > > still going as of this writing). > > > > I suspect other users of 4.4 would benefit from having these patches > > backported, so tag them for stable. I assume 4.9 an
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > From: Henrik Austad > > Short story: Sorry for the spam, it looks like I was not very specific in /which/ version I targeted this to, as well as not providing a full Cc-list for the cover-letter. The series is targeted at stable v4.4.162. Expanding Cc-list to those missing from the first attempt. -Henrik > The following patches are needed on a 4.4 kernel to avoid > Oops in the scheduler when a sched_rr and sched_deadline task contends > on the same futex (with PI). > > Longer story: > > On one of our arm64 systems, we occasionally crash with an Oops in the > scheduler with the following backtrace. > > [] enqueue_task_dl+0x1f0/0x420 > [] activate_task+0x7c/0x90 > [] push_dl_task+0x164/0x1c8 > [] push_dl_tasks+0x20/0x30 > [] __balance_callback+0x44/0x68 > [] __schedule+0x6f0/0x728 > [] schedule+0x78/0x98 > [] __rt_mutex_slowlock+0x9c/0x108 > [] rt_mutex_slowlock+0xd8/0x198 > [] rt_mutex_timed_futex_lock+0x30/0x40 > [] futex_lock_pi+0x200/0x3b0 > [] do_futex+0x1c4/0x550 > [] compat_SyS_futex+0x10c/0x138 > [] __sys_trace_return+0x0/0x4 > > This seems to be the same bug Xuneli Pang triggered and fixed in > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline > tasks". As noted by Peter Zijlstra in the previous attempt, this fix > requires a few other patches, most notably the FUTEX_UNLOCK_PI series > [1] > > Testing this on a dual-core VM I have not been able to reproduce the > same crash, but pi_stress (part of the rt-test suite) reveals that > vanilla 4.4.162 behaves rather badly with a mix of deadline and > sched_(rr|fifo) tasks: > > time pi_stress --rr --mlockall --sched > id=high,policy=deadline,runtime=10,deadline=20,period=20 > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: infinite > Number of inversions per group: unlimited > Admin thread SCHED_RR priority 4 > 1 groups of 3 threads will be created > High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 >Med thread SCHED_RR priority 2 >Low thread SCHED_RR priority 1 > Current Inversions: 141627 > WATCHDOG triggered: group 0 is deadlocked! > reporter stopping due to watchdog event > Stopping test > Terminated > > real0m26.291s > user0m0.148s > sys 0m18.819s > > With this series applied, the test ran for ~4.5 hours and again for 129 > minutes (when I remembered to time it) before crashing: > > time pi_stress --rr --mlockall --sched > id=high,policy=deadline,runtime=10,deadline=20,period=20 > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: infinite > Number of inversions per group: unlimited > Admin thread SCHED_RR priority 4 > 1 groups of 3 threads will be created > High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 >Med thread SCHED_RR priority 2 >Low thread SCHED_RR priority 1 > Current Inversions: 51985223 > WATCHDOG triggered: group 0 is deadlocked! > reporter stopping due to watchdog event > Stopping test > Terminated > > real129m38.807s > user0m59.084s > sys 109m53.666s > > > So clearly not perfect, but a *lot* better. > > The same series on our vendor-4.4 kernel moves pi_stress up from ~30 > seconds before deadlock up to the same level as the VM (the test is > still going as of this writing). > > I suspect other users of 4.4 would benefit from having these patches > backported, so tag them for stable. I assume 4.9 and 4.14 could benefit > as well, but I have not had time to look into those. > > 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html > > Peter Zijlstra (13): > futex: Cleanup variable names for futex_top_waiter() > futex: Use smp_store_release() in mark_wake_futex() > futex: Remove rt_mutex_deadlock_account_*() > futex,rt_mutex: Provide futex specific rt_mutex API > futex: Change locking rules > futex: Cleanup refcounting > futex: Rework inconsistent rt_mutex/futex_q state > futex: Pull rt_mutex_futex_unlock() out from under hb->lock > futex,rt_mutex: Introduce rt_mutex_init_waiter() > futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() > futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() > futex: Futex_unlock_pi() determinism > futex: Drop hb->lock before enqueueing on the rtmutex > > Thomas Gleixner (2): > rtmutex: Make wait_lock irq safe > futex: Rename free_pi_state() to put_pi_state() > > Xunlei Pang (2): > rtmutex: Deboost before waking up the top waiter > sched/rtmutex/deadline: Fix a P
Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote: > From: Henrik Austad > > Short story: Sorry for the spam, it looks like I was not very specific in /which/ version I targeted this to, as well as not providing a full Cc-list for the cover-letter. The series is targeted at stable v4.4.162. Expanding Cc-list to those missing from the first attempt. -Henrik > The following patches are needed on a 4.4 kernel to avoid > Oops in the scheduler when a sched_rr and sched_deadline task contends > on the same futex (with PI). > > Longer story: > > On one of our arm64 systems, we occasionally crash with an Oops in the > scheduler with the following backtrace. > > [] enqueue_task_dl+0x1f0/0x420 > [] activate_task+0x7c/0x90 > [] push_dl_task+0x164/0x1c8 > [] push_dl_tasks+0x20/0x30 > [] __balance_callback+0x44/0x68 > [] __schedule+0x6f0/0x728 > [] schedule+0x78/0x98 > [] __rt_mutex_slowlock+0x9c/0x108 > [] rt_mutex_slowlock+0xd8/0x198 > [] rt_mutex_timed_futex_lock+0x30/0x40 > [] futex_lock_pi+0x200/0x3b0 > [] do_futex+0x1c4/0x550 > [] compat_SyS_futex+0x10c/0x138 > [] __sys_trace_return+0x0/0x4 > > This seems to be the same bug Xuneli Pang triggered and fixed in > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline > tasks". As noted by Peter Zijlstra in the previous attempt, this fix > requires a few other patches, most notably the FUTEX_UNLOCK_PI series > [1] > > Testing this on a dual-core VM I have not been able to reproduce the > same crash, but pi_stress (part of the rt-test suite) reveals that > vanilla 4.4.162 behaves rather badly with a mix of deadline and > sched_(rr|fifo) tasks: > > time pi_stress --rr --mlockall --sched > id=high,policy=deadline,runtime=10,deadline=20,period=20 > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: infinite > Number of inversions per group: unlimited > Admin thread SCHED_RR priority 4 > 1 groups of 3 threads will be created > High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 >Med thread SCHED_RR priority 2 >Low thread SCHED_RR priority 1 > Current Inversions: 141627 > WATCHDOG triggered: group 0 is deadlocked! > reporter stopping due to watchdog event > Stopping test > Terminated > > real0m26.291s > user0m0.148s > sys 0m18.819s > > With this series applied, the test ran for ~4.5 hours and again for 129 > minutes (when I remembered to time it) before crashing: > > time pi_stress --rr --mlockall --sched > id=high,policy=deadline,runtime=10,deadline=20,period=20 > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: infinite > Number of inversions per group: unlimited > Admin thread SCHED_RR priority 4 > 1 groups of 3 threads will be created > High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 >Med thread SCHED_RR priority 2 >Low thread SCHED_RR priority 1 > Current Inversions: 51985223 > WATCHDOG triggered: group 0 is deadlocked! > reporter stopping due to watchdog event > Stopping test > Terminated > > real129m38.807s > user0m59.084s > sys 109m53.666s > > > So clearly not perfect, but a *lot* better. > > The same series on our vendor-4.4 kernel moves pi_stress up from ~30 > seconds before deadlock up to the same level as the VM (the test is > still going as of this writing). > > I suspect other users of 4.4 would benefit from having these patches > backported, so tag them for stable. I assume 4.9 and 4.14 could benefit > as well, but I have not had time to look into those. > > 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html > > Peter Zijlstra (13): > futex: Cleanup variable names for futex_top_waiter() > futex: Use smp_store_release() in mark_wake_futex() > futex: Remove rt_mutex_deadlock_account_*() > futex,rt_mutex: Provide futex specific rt_mutex API > futex: Change locking rules > futex: Cleanup refcounting > futex: Rework inconsistent rt_mutex/futex_q state > futex: Pull rt_mutex_futex_unlock() out from under hb->lock > futex,rt_mutex: Introduce rt_mutex_init_waiter() > futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() > futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() > futex: Futex_unlock_pi() determinism > futex: Drop hb->lock before enqueueing on the rtmutex > > Thomas Gleixner (2): > rtmutex: Make wait_lock irq safe > futex: Rename free_pi_state() to put_pi_state() > > Xunlei Pang (2): > rtmutex: Deboost before waking up the top waiter > sched/rtmutex/deadline: Fix a P
[PATCH 06/17] futex: Change locking rules
From: Peter Zijlstra commit 734009e96d1983ad739e5b656e03430b3660c913 upstream. Currently futex-pi relies on hb->lock to serialize everything. But hb->lock creates another set of problems, especially priority inversions on RT where hb->lock becomes a rt_mutex itself. The rt_mutex::wait_lock is the most obvious protection for keeping the futex user space value and the kernel internal pi_state in sync. Rework and document the locking so rt_mutex::wait_lock is held accross all operations which modify the user space value and the pi state. This allows to invoke rt_mutex_unlock() (including deboost) without holding hb->lock as a next step. Nothing yet relies on the new locking rules. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.751993...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 165 + 1 file changed, 132 insertions(+), 33 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index e1200b9..52e3678 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -967,6 +967,39 @@ void exit_pi_state_list(struct task_struct *curr) * * [10] There is no transient state which leaves owner and user space * TID out of sync. + * + * + * Serialization and lifetime rules: + * + * hb->lock: + * + * hb -> futex_q, relation + * futex_q -> pi_state, relation + * + * (cannot be raw because hb can contain arbitrary amount + * of futex_q's) + * + * pi_mutex->wait_lock: + * + * {uval, pi_state} + * + * (and pi_mutex 'obviously') + * + * p->pi_lock: + * + * p->pi_state_list -> pi_state->list, relation + * + * pi_state->refcount: + * + * pi_state lifetime + * + * + * Lock order: + * + * hb->lock + * pi_mutex->wait_lock + * p->pi_lock + * */ /* @@ -974,10 +1007,12 @@ void exit_pi_state_list(struct task_struct *curr) * the pi_state against the user space value. If correct, attach to * it. */ -static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, +static int attach_to_pi_state(u32 __user *uaddr, u32 uval, + struct futex_pi_state *pi_state, struct futex_pi_state **ps) { pid_t pid = uval & FUTEX_TID_MASK; + int ret, uval2; /* * Userspace might have messed up non-PI and PI futexes [3] @@ -985,9 +1020,34 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, if (unlikely(!pi_state)) return -EINVAL; + /* +* We get here with hb->lock held, and having found a +* futex_top_waiter(). This means that futex_lock_pi() of said futex_q +* has dropped the hb->lock in between queue_me() and unqueue_me_pi(), +* which in turn means that futex_lock_pi() still has a reference on +* our pi_state. +*/ WARN_ON(!atomic_read(_state->refcount)); /* +* Now that we have a pi_state, we can acquire wait_lock +* and do the state validation. +*/ + raw_spin_lock_irq(_state->pi_mutex.wait_lock); + + /* +* Since {uval, pi_state} is serialized by wait_lock, and our current +* uval was read without holding it, it can have changed. Verify it +* still is what we expect it to be, otherwise retry the entire +* operation. +*/ + if (get_futex_value_locked(, uaddr)) + goto out_efault; + + if (uval != uval2) + goto out_eagain; + + /* * Handle the owner died case: */ if (uval & FUTEX_OWNER_DIED) { @@ -1002,11 +1062,11 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * is not 0. Inconsistent state. [5] */ if (pid) - return -EINVAL; + goto out_einval; /* * Take a ref on the state and return success. [4] */ - goto out_state; + goto out_attach; } /* @@ -1018,14 +1078,14 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * Take a ref on the state and return success. [6] */ if (!pid) - goto out_state; + goto out_attach; } else { /* * If the owner died bit is not set, then the pi_state * must have an owner. [7] */
[PATCH 07/17] futex: Cleanup refcounting
From: Peter Zijlstra commit bf92cf3a5100f5a0d5f9834787b130159397cb22 upstream. Add a put_pit_state() as counterpart for get_pi_state() so the refcounting becomes consistent. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.801778...@infradead.org Signed-off-by: Thomas Gleixner Conflicts: kernel/futex.c Tested-by:Henrik Austad --- kernel/futex.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 52e3678..9d7d462 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -799,7 +799,7 @@ static int refill_pi_state_cache(void) return 0; } -static struct futex_pi_state * alloc_pi_state(void) +static struct futex_pi_state *alloc_pi_state(void) { struct futex_pi_state *pi_state = current->pi_state_cache; @@ -809,6 +809,11 @@ static struct futex_pi_state * alloc_pi_state(void) return pi_state; } +static void get_pi_state(struct futex_pi_state *pi_state) +{ + WARN_ON_ONCE(!atomic_inc_not_zero(_state->refcount)); +} + /* * Must be called with the hb lock held. */ @@ -850,7 +855,7 @@ static void free_pi_state(struct futex_pi_state *pi_state) * Look up the task based on what TID userspace gave us. * We dont trust it. */ -static struct task_struct * futex_find_get_task(pid_t pid) +static struct task_struct *futex_find_get_task(pid_t pid) { struct task_struct *p; @@ -1097,7 +1102,7 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, goto out_einval; out_attach: - atomic_inc(_state->refcount); + get_pi_state(pi_state); raw_spin_unlock_irq(_state->pi_mutex.wait_lock); *ps = pi_state; return 0; @@ -2019,8 +2024,12 @@ retry_private: * of requeue_pi if we couldn't acquire the lock atomically. */ if (requeue_pi) { - /* Prepare the waiter to take the rt_mutex. */ - atomic_inc(_state->refcount); + /* +* Prepare the waiter to take the rt_mutex. Take a +* refcount on the pi_state and store the pointer in +* the futex_q object of the waiter. +*/ + get_pi_state(pi_state); this->pi_state = pi_state; ret = rt_mutex_start_proxy_lock(_state->pi_mutex, this->rt_waiter, -- 2.7.4
[PATCH 04/17] rtmutex: Make wait_lock irq safe
From: Thomas Gleixner commit b4abf91047cf054f203dcfac97e1038388826937 upstream. Sasha reported a lockdep splat about a potential deadlock between RCU boosting rtmutex and the posix timer it_lock. CPU0CPU1 rtmutex_lock(>rt_mutex) spin_lock(>rt_mutex.wait_lock) local_irq_disable() spin_lock(>it_lock) spin_lock(>mutex.wait_lock) --> Interrupt spin_lock(>it_lock) This is caused by the following code sequence on CPU1 rcu_read_lock() x = lookup(); if (x) spin_lock_irqsave(>it_lock); rcu_read_unlock(); return x; We could fix that in the posix timer code by keeping rcu read locked across the spinlocked and irq disabled section, but the above sequence is common and there is no reason not to support it. Taking rt_mutex.wait_lock irq safe prevents the deadlock. Reported-by: Sasha Levin Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Paul McKenney Tested-by:Henrik Austad --- kernel/futex.c | 18 +++ kernel/locking/rtmutex.c | 135 +-- 2 files changed, 81 insertions(+), 72 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 9e92f12..0f44952 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1307,7 +1307,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter if (pi_state->owner != current) return -EINVAL; - raw_spin_lock(_state->pi_mutex.wait_lock); + raw_spin_lock_irq(_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(_state->pi_mutex); /* @@ -1343,22 +1343,22 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter ret = -EINVAL; } if (ret) { - raw_spin_unlock(_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); return ret; } - raw_spin_lock_irq(_state->owner->pi_lock); + raw_spin_lock(_state->owner->pi_lock); WARN_ON(list_empty(_state->list)); list_del_init(_state->list); - raw_spin_unlock_irq(_state->owner->pi_lock); + raw_spin_unlock(_state->owner->pi_lock); - raw_spin_lock_irq(_owner->pi_lock); + raw_spin_lock(_owner->pi_lock); WARN_ON(!list_empty(_state->list)); list_add(_state->list, _owner->pi_state_list); pi_state->owner = new_owner; - raw_spin_unlock_irq(_owner->pi_lock); + raw_spin_unlock(_owner->pi_lock); - raw_spin_unlock(_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); deboost = rt_mutex_futex_unlock(_state->pi_mutex, _q); @@ -2269,11 +2269,11 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * we returned due to timeout or signal without taking the * rt_mutex. Too late. */ - raw_spin_lock(>pi_state->pi_mutex.wait_lock); + raw_spin_lock_irq(>pi_state->pi_mutex.wait_lock); owner = rt_mutex_owner(>pi_state->pi_mutex); if (!owner) owner = rt_mutex_next_owner(>pi_state->pi_mutex); - raw_spin_unlock(>pi_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(>pi_state->pi_mutex.wait_lock); ret = fixup_pi_state_owner(uaddr, q, owner); goto out; } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 6cf9dab7..b8d08c7 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -163,13 +163,14 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) * 2) Drop lock->wait_lock * 3) Try to unlock the lock with cmpxchg */ -static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock, + unsigned long flags) __releases(lock->wait_lock) { struct task_struct *owner = rt_mutex_owner(lock); clear_rt_mutex_waiters(lock); - raw_spin_unlock(>wait_lock); + raw_spin_unlock_irqrestore(>wait_lock, flags); /* * If a new waiter comes in between the unlock and the cmpxchg * we have two situations: @@ -211,11 +212,12 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) /* * Simple slow path only version: lock->owner is protected by lock->wait_lock. */ -static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock, + unsigned long flags) __releases(lock->wait_lock) { lock->owner = NULL; - raw_spin_unlock(>wait_lock); + raw_spin_unlock_irqrestore(>wait_lock, flags); return
[PATCH 02/17] futex: Use smp_store_release() in mark_wake_futex()
From: Peter Zijlstra commit 1b367ece0d7e696cab1c8501bab282cc6a538b3f upstream. Since the futex_q can dissapear the instruction after assigning NULL, this really should be a RELEASE barrier. That stops loads from hitting dead memory too. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.604296...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index bb87324..9e92f12 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1284,8 +1284,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) * memory barrier is required here to prevent the following * store to lock_ptr from getting ahead of the plist_del. */ - smp_wmb(); - q->lock_ptr = NULL; + smp_store_release(>lock_ptr, NULL); } static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter, -- 2.7.4
[PATCH 06/17] futex: Change locking rules
From: Peter Zijlstra commit 734009e96d1983ad739e5b656e03430b3660c913 upstream. Currently futex-pi relies on hb->lock to serialize everything. But hb->lock creates another set of problems, especially priority inversions on RT where hb->lock becomes a rt_mutex itself. The rt_mutex::wait_lock is the most obvious protection for keeping the futex user space value and the kernel internal pi_state in sync. Rework and document the locking so rt_mutex::wait_lock is held accross all operations which modify the user space value and the pi state. This allows to invoke rt_mutex_unlock() (including deboost) without holding hb->lock as a next step. Nothing yet relies on the new locking rules. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.751993...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 165 + 1 file changed, 132 insertions(+), 33 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index e1200b9..52e3678 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -967,6 +967,39 @@ void exit_pi_state_list(struct task_struct *curr) * * [10] There is no transient state which leaves owner and user space * TID out of sync. + * + * + * Serialization and lifetime rules: + * + * hb->lock: + * + * hb -> futex_q, relation + * futex_q -> pi_state, relation + * + * (cannot be raw because hb can contain arbitrary amount + * of futex_q's) + * + * pi_mutex->wait_lock: + * + * {uval, pi_state} + * + * (and pi_mutex 'obviously') + * + * p->pi_lock: + * + * p->pi_state_list -> pi_state->list, relation + * + * pi_state->refcount: + * + * pi_state lifetime + * + * + * Lock order: + * + * hb->lock + * pi_mutex->wait_lock + * p->pi_lock + * */ /* @@ -974,10 +1007,12 @@ void exit_pi_state_list(struct task_struct *curr) * the pi_state against the user space value. If correct, attach to * it. */ -static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, +static int attach_to_pi_state(u32 __user *uaddr, u32 uval, + struct futex_pi_state *pi_state, struct futex_pi_state **ps) { pid_t pid = uval & FUTEX_TID_MASK; + int ret, uval2; /* * Userspace might have messed up non-PI and PI futexes [3] @@ -985,9 +1020,34 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, if (unlikely(!pi_state)) return -EINVAL; + /* +* We get here with hb->lock held, and having found a +* futex_top_waiter(). This means that futex_lock_pi() of said futex_q +* has dropped the hb->lock in between queue_me() and unqueue_me_pi(), +* which in turn means that futex_lock_pi() still has a reference on +* our pi_state. +*/ WARN_ON(!atomic_read(_state->refcount)); /* +* Now that we have a pi_state, we can acquire wait_lock +* and do the state validation. +*/ + raw_spin_lock_irq(_state->pi_mutex.wait_lock); + + /* +* Since {uval, pi_state} is serialized by wait_lock, and our current +* uval was read without holding it, it can have changed. Verify it +* still is what we expect it to be, otherwise retry the entire +* operation. +*/ + if (get_futex_value_locked(, uaddr)) + goto out_efault; + + if (uval != uval2) + goto out_eagain; + + /* * Handle the owner died case: */ if (uval & FUTEX_OWNER_DIED) { @@ -1002,11 +1062,11 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * is not 0. Inconsistent state. [5] */ if (pid) - return -EINVAL; + goto out_einval; /* * Take a ref on the state and return success. [4] */ - goto out_state; + goto out_attach; } /* @@ -1018,14 +1078,14 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * Take a ref on the state and return success. [6] */ if (!pid) - goto out_state; + goto out_attach; } else { /* * If the owner died bit is not set, then the pi_state * must have an owner. [7] */
[PATCH 07/17] futex: Cleanup refcounting
From: Peter Zijlstra commit bf92cf3a5100f5a0d5f9834787b130159397cb22 upstream. Add a put_pit_state() as counterpart for get_pi_state() so the refcounting becomes consistent. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.801778...@infradead.org Signed-off-by: Thomas Gleixner Conflicts: kernel/futex.c Tested-by:Henrik Austad --- kernel/futex.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 52e3678..9d7d462 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -799,7 +799,7 @@ static int refill_pi_state_cache(void) return 0; } -static struct futex_pi_state * alloc_pi_state(void) +static struct futex_pi_state *alloc_pi_state(void) { struct futex_pi_state *pi_state = current->pi_state_cache; @@ -809,6 +809,11 @@ static struct futex_pi_state * alloc_pi_state(void) return pi_state; } +static void get_pi_state(struct futex_pi_state *pi_state) +{ + WARN_ON_ONCE(!atomic_inc_not_zero(_state->refcount)); +} + /* * Must be called with the hb lock held. */ @@ -850,7 +855,7 @@ static void free_pi_state(struct futex_pi_state *pi_state) * Look up the task based on what TID userspace gave us. * We dont trust it. */ -static struct task_struct * futex_find_get_task(pid_t pid) +static struct task_struct *futex_find_get_task(pid_t pid) { struct task_struct *p; @@ -1097,7 +1102,7 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, goto out_einval; out_attach: - atomic_inc(_state->refcount); + get_pi_state(pi_state); raw_spin_unlock_irq(_state->pi_mutex.wait_lock); *ps = pi_state; return 0; @@ -2019,8 +2024,12 @@ retry_private: * of requeue_pi if we couldn't acquire the lock atomically. */ if (requeue_pi) { - /* Prepare the waiter to take the rt_mutex. */ - atomic_inc(_state->refcount); + /* +* Prepare the waiter to take the rt_mutex. Take a +* refcount on the pi_state and store the pointer in +* the futex_q object of the waiter. +*/ + get_pi_state(pi_state); this->pi_state = pi_state; ret = rt_mutex_start_proxy_lock(_state->pi_mutex, this->rt_waiter, -- 2.7.4
[PATCH 04/17] rtmutex: Make wait_lock irq safe
From: Thomas Gleixner commit b4abf91047cf054f203dcfac97e1038388826937 upstream. Sasha reported a lockdep splat about a potential deadlock between RCU boosting rtmutex and the posix timer it_lock. CPU0CPU1 rtmutex_lock(>rt_mutex) spin_lock(>rt_mutex.wait_lock) local_irq_disable() spin_lock(>it_lock) spin_lock(>mutex.wait_lock) --> Interrupt spin_lock(>it_lock) This is caused by the following code sequence on CPU1 rcu_read_lock() x = lookup(); if (x) spin_lock_irqsave(>it_lock); rcu_read_unlock(); return x; We could fix that in the posix timer code by keeping rcu read locked across the spinlocked and irq disabled section, but the above sequence is common and there is no reason not to support it. Taking rt_mutex.wait_lock irq safe prevents the deadlock. Reported-by: Sasha Levin Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Paul McKenney Tested-by:Henrik Austad --- kernel/futex.c | 18 +++ kernel/locking/rtmutex.c | 135 +-- 2 files changed, 81 insertions(+), 72 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 9e92f12..0f44952 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1307,7 +1307,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter if (pi_state->owner != current) return -EINVAL; - raw_spin_lock(_state->pi_mutex.wait_lock); + raw_spin_lock_irq(_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(_state->pi_mutex); /* @@ -1343,22 +1343,22 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter ret = -EINVAL; } if (ret) { - raw_spin_unlock(_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); return ret; } - raw_spin_lock_irq(_state->owner->pi_lock); + raw_spin_lock(_state->owner->pi_lock); WARN_ON(list_empty(_state->list)); list_del_init(_state->list); - raw_spin_unlock_irq(_state->owner->pi_lock); + raw_spin_unlock(_state->owner->pi_lock); - raw_spin_lock_irq(_owner->pi_lock); + raw_spin_lock(_owner->pi_lock); WARN_ON(!list_empty(_state->list)); list_add(_state->list, _owner->pi_state_list); pi_state->owner = new_owner; - raw_spin_unlock_irq(_owner->pi_lock); + raw_spin_unlock(_owner->pi_lock); - raw_spin_unlock(_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); deboost = rt_mutex_futex_unlock(_state->pi_mutex, _q); @@ -2269,11 +2269,11 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * we returned due to timeout or signal without taking the * rt_mutex. Too late. */ - raw_spin_lock(>pi_state->pi_mutex.wait_lock); + raw_spin_lock_irq(>pi_state->pi_mutex.wait_lock); owner = rt_mutex_owner(>pi_state->pi_mutex); if (!owner) owner = rt_mutex_next_owner(>pi_state->pi_mutex); - raw_spin_unlock(>pi_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(>pi_state->pi_mutex.wait_lock); ret = fixup_pi_state_owner(uaddr, q, owner); goto out; } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 6cf9dab7..b8d08c7 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -163,13 +163,14 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) * 2) Drop lock->wait_lock * 3) Try to unlock the lock with cmpxchg */ -static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock, + unsigned long flags) __releases(lock->wait_lock) { struct task_struct *owner = rt_mutex_owner(lock); clear_rt_mutex_waiters(lock); - raw_spin_unlock(>wait_lock); + raw_spin_unlock_irqrestore(>wait_lock, flags); /* * If a new waiter comes in between the unlock and the cmpxchg * we have two situations: @@ -211,11 +212,12 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) /* * Simple slow path only version: lock->owner is protected by lock->wait_lock. */ -static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock, + unsigned long flags) __releases(lock->wait_lock) { lock->owner = NULL; - raw_spin_unlock(>wait_lock); + raw_spin_unlock_irqrestore(>wait_lock, flags); return
[PATCH 02/17] futex: Use smp_store_release() in mark_wake_futex()
From: Peter Zijlstra commit 1b367ece0d7e696cab1c8501bab282cc6a538b3f upstream. Since the futex_q can dissapear the instruction after assigning NULL, this really should be a RELEASE barrier. That stops loads from hitting dead memory too. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.604296...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index bb87324..9e92f12 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1284,8 +1284,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) * memory barrier is required here to prevent the following * store to lock_ptr from getting ahead of the plist_del. */ - smp_wmb(); - q->lock_ptr = NULL; + smp_store_release(>lock_ptr, NULL); } static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter, -- 2.7.4
[PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
From: Henrik Austad Short story: The following patches are needed on a 4.4 kernel to avoid Oops in the scheduler when a sched_rr and sched_deadline task contends on the same futex (with PI). Longer story: On one of our arm64 systems, we occasionally crash with an Oops in the scheduler with the following backtrace. [] enqueue_task_dl+0x1f0/0x420 [] activate_task+0x7c/0x90 [] push_dl_task+0x164/0x1c8 [] push_dl_tasks+0x20/0x30 [] __balance_callback+0x44/0x68 [] __schedule+0x6f0/0x728 [] schedule+0x78/0x98 [] __rt_mutex_slowlock+0x9c/0x108 [] rt_mutex_slowlock+0xd8/0x198 [] rt_mutex_timed_futex_lock+0x30/0x40 [] futex_lock_pi+0x200/0x3b0 [] do_futex+0x1c4/0x550 [] compat_SyS_futex+0x10c/0x138 [] __sys_trace_return+0x0/0x4 This seems to be the same bug Xuneli Pang triggered and fixed in e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline tasks". As noted by Peter Zijlstra in the previous attempt, this fix requires a few other patches, most notably the FUTEX_UNLOCK_PI series [1] Testing this on a dual-core VM I have not been able to reproduce the same crash, but pi_stress (part of the rt-test suite) reveals that vanilla 4.4.162 behaves rather badly with a mix of deadline and sched_(rr|fifo) tasks: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=10,deadline=20,period=20 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 141627 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real0m26.291s user0m0.148s sys 0m18.819s With this series applied, the test ran for ~4.5 hours and again for 129 minutes (when I remembered to time it) before crashing: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=10,deadline=20,period=20 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 51985223 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real129m38.807s user0m59.084s sys 109m53.666s So clearly not perfect, but a *lot* better. The same series on our vendor-4.4 kernel moves pi_stress up from ~30 seconds before deadlock up to the same level as the VM (the test is still going as of this writing). I suspect other users of 4.4 would benefit from having these patches backported, so tag them for stable. I assume 4.9 and 4.14 could benefit as well, but I have not had time to look into those. 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html Peter Zijlstra (13): futex: Cleanup variable names for futex_top_waiter() futex: Use smp_store_release() in mark_wake_futex() futex: Remove rt_mutex_deadlock_account_*() futex,rt_mutex: Provide futex specific rt_mutex API futex: Change locking rules futex: Cleanup refcounting futex: Rework inconsistent rt_mutex/futex_q state futex: Pull rt_mutex_futex_unlock() out from under hb->lock futex,rt_mutex: Introduce rt_mutex_init_waiter() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex: Futex_unlock_pi() determinism futex: Drop hb->lock before enqueueing on the rtmutex Thomas Gleixner (2): rtmutex: Make wait_lock irq safe futex: Rename free_pi_state() to put_pi_state() Xunlei Pang (2): rtmutex: Deboost before waking up the top waiter sched/rtmutex/deadline: Fix a PI crash for deadline tasks include/linux/init_task.h | 1 + include/linux/sched.h | 2 + include/linux/sched/rt.h| 1 + kernel/fork.c | 1 + kernel/futex.c | 532 ++-- kernel/locking/rtmutex-debug.c | 9 - kernel/locking/rtmutex-debug.h | 3 - kernel/locking/rtmutex.c| 406 ++ kernel/locking/rtmutex.h| 2 - kernel/locking/rtmutex_common.h | 24 +- kernel/sched/core.c | 2 + 11 files changed, 620 insertions(+), 363 deletions(-) -- 2.7.4
[PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4
From: Henrik Austad Short story: The following patches are needed on a 4.4 kernel to avoid Oops in the scheduler when a sched_rr and sched_deadline task contends on the same futex (with PI). Longer story: On one of our arm64 systems, we occasionally crash with an Oops in the scheduler with the following backtrace. [] enqueue_task_dl+0x1f0/0x420 [] activate_task+0x7c/0x90 [] push_dl_task+0x164/0x1c8 [] push_dl_tasks+0x20/0x30 [] __balance_callback+0x44/0x68 [] __schedule+0x6f0/0x728 [] schedule+0x78/0x98 [] __rt_mutex_slowlock+0x9c/0x108 [] rt_mutex_slowlock+0xd8/0x198 [] rt_mutex_timed_futex_lock+0x30/0x40 [] futex_lock_pi+0x200/0x3b0 [] do_futex+0x1c4/0x550 [] compat_SyS_futex+0x10c/0x138 [] __sys_trace_return+0x0/0x4 This seems to be the same bug Xuneli Pang triggered and fixed in e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline tasks". As noted by Peter Zijlstra in the previous attempt, this fix requires a few other patches, most notably the FUTEX_UNLOCK_PI series [1] Testing this on a dual-core VM I have not been able to reproduce the same crash, but pi_stress (part of the rt-test suite) reveals that vanilla 4.4.162 behaves rather badly with a mix of deadline and sched_(rr|fifo) tasks: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=10,deadline=20,period=20 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 141627 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real0m26.291s user0m0.148s sys 0m18.819s With this series applied, the test ran for ~4.5 hours and again for 129 minutes (when I remembered to time it) before crashing: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=10,deadline=20,period=20 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 10 deadline 20 period 20 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 51985223 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real129m38.807s user0m59.084s sys 109m53.666s So clearly not perfect, but a *lot* better. The same series on our vendor-4.4 kernel moves pi_stress up from ~30 seconds before deadlock up to the same level as the VM (the test is still going as of this writing). I suspect other users of 4.4 would benefit from having these patches backported, so tag them for stable. I assume 4.9 and 4.14 could benefit as well, but I have not had time to look into those. 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html Peter Zijlstra (13): futex: Cleanup variable names for futex_top_waiter() futex: Use smp_store_release() in mark_wake_futex() futex: Remove rt_mutex_deadlock_account_*() futex,rt_mutex: Provide futex specific rt_mutex API futex: Change locking rules futex: Cleanup refcounting futex: Rework inconsistent rt_mutex/futex_q state futex: Pull rt_mutex_futex_unlock() out from under hb->lock futex,rt_mutex: Introduce rt_mutex_init_waiter() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex: Futex_unlock_pi() determinism futex: Drop hb->lock before enqueueing on the rtmutex Thomas Gleixner (2): rtmutex: Make wait_lock irq safe futex: Rename free_pi_state() to put_pi_state() Xunlei Pang (2): rtmutex: Deboost before waking up the top waiter sched/rtmutex/deadline: Fix a PI crash for deadline tasks include/linux/init_task.h | 1 + include/linux/sched.h | 2 + include/linux/sched/rt.h| 1 + kernel/fork.c | 1 + kernel/futex.c | 532 ++-- kernel/locking/rtmutex-debug.c | 9 - kernel/locking/rtmutex-debug.h | 3 - kernel/locking/rtmutex.c| 406 ++ kernel/locking/rtmutex.h| 2 - kernel/locking/rtmutex_common.h | 24 +- kernel/sched/core.c | 2 + 11 files changed, 620 insertions(+), 363 deletions(-) -- 2.7.4
[PATCH 09/17] futex: Rename free_pi_state() to put_pi_state()
From: Thomas Gleixner commit 29e9ee5d48c35d6cf8afe09bdf03f77125c9ac11 upstream. free_pi_state() is confusing as it is in fact only freeing/caching the pi state when the last reference is gone. Rename it to put_pi_state() which reflects better what it is doing. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Darren Hart Cc: Davidlohr Bueso Cc: bhuvanesh_surach...@mentor.com Cc: Andy Lowe Link: http://lkml.kernel.org/r/20151219200607.259636...@linutronix.de Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 91acb65..09f698a 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -815,9 +815,12 @@ static void get_pi_state(struct futex_pi_state *pi_state) } /* + * Drops a reference to the pi_state object and frees or caches it + * when the last reference is gone. + * * Must be called with the hb lock held. */ -static void free_pi_state(struct futex_pi_state *pi_state) +static void put_pi_state(struct futex_pi_state *pi_state) { if (!pi_state) return; @@ -1959,7 +1962,7 @@ retry_private: case 0: break; case -EFAULT: - free_pi_state(pi_state); + put_pi_state(pi_state); pi_state = NULL; double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -1976,7 +1979,7 @@ retry_private: * exit to complete. * - The user space value changed. */ - free_pi_state(pi_state); + put_pi_state(pi_state); pi_state = NULL; double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -2049,7 +2052,7 @@ retry_private: } else if (ret) { /* -EDEADLK */ this->pi_state = NULL; - free_pi_state(pi_state); + put_pi_state(pi_state); goto out_unlock; } } @@ -2058,7 +2061,7 @@ retry_private: } out_unlock: - free_pi_state(pi_state); + put_pi_state(pi_state); double_unlock_hb(hb1, hb2); wake_up_q(_q); hb_waiters_dec(hb2); @@ -2211,7 +2214,7 @@ static void unqueue_me_pi(struct futex_q *q) __unqueue_futex(q); BUG_ON(!q->pi_state); - free_pi_state(q->pi_state); + put_pi_state(q->pi_state); q->pi_state = NULL; spin_unlock(q->lock_ptr); @@ -2993,7 +2996,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * Drop the reference to the pi state which * the requeue_pi() code acquired for us. */ - free_pi_state(q.pi_state); + put_pi_state(q.pi_state); spin_unlock(q.lock_ptr); } } else { -- 2.7.4
[PATCH 09/17] futex: Rename free_pi_state() to put_pi_state()
From: Thomas Gleixner commit 29e9ee5d48c35d6cf8afe09bdf03f77125c9ac11 upstream. free_pi_state() is confusing as it is in fact only freeing/caching the pi state when the last reference is gone. Rename it to put_pi_state() which reflects better what it is doing. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Darren Hart Cc: Davidlohr Bueso Cc: bhuvanesh_surach...@mentor.com Cc: Andy Lowe Link: http://lkml.kernel.org/r/20151219200607.259636...@linutronix.de Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 91acb65..09f698a 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -815,9 +815,12 @@ static void get_pi_state(struct futex_pi_state *pi_state) } /* + * Drops a reference to the pi_state object and frees or caches it + * when the last reference is gone. + * * Must be called with the hb lock held. */ -static void free_pi_state(struct futex_pi_state *pi_state) +static void put_pi_state(struct futex_pi_state *pi_state) { if (!pi_state) return; @@ -1959,7 +1962,7 @@ retry_private: case 0: break; case -EFAULT: - free_pi_state(pi_state); + put_pi_state(pi_state); pi_state = NULL; double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -1976,7 +1979,7 @@ retry_private: * exit to complete. * - The user space value changed. */ - free_pi_state(pi_state); + put_pi_state(pi_state); pi_state = NULL; double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -2049,7 +2052,7 @@ retry_private: } else if (ret) { /* -EDEADLK */ this->pi_state = NULL; - free_pi_state(pi_state); + put_pi_state(pi_state); goto out_unlock; } } @@ -2058,7 +2061,7 @@ retry_private: } out_unlock: - free_pi_state(pi_state); + put_pi_state(pi_state); double_unlock_hb(hb1, hb2); wake_up_q(_q); hb_waiters_dec(hb2); @@ -2211,7 +2214,7 @@ static void unqueue_me_pi(struct futex_q *q) __unqueue_futex(q); BUG_ON(!q->pi_state); - free_pi_state(q->pi_state); + put_pi_state(q->pi_state); q->pi_state = NULL; spin_unlock(q->lock_ptr); @@ -2993,7 +2996,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * Drop the reference to the pi state which * the requeue_pi() code acquired for us. */ - free_pi_state(q.pi_state); + put_pi_state(q.pi_state); spin_unlock(q.lock_ptr); } } else { -- 2.7.4
[PATCH 11/17] futex,rt_mutex: Introduce rt_mutex_init_waiter()
From: Peter Zijlstra commit 50809358dd7199aa7ce232f6877dd09ec30ef374 upstream. Since there's already two copies of this code, introduce a helper now before adding a third one. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.950039...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 5 + kernel/locking/rtmutex.c| 12 +--- kernel/locking/rtmutex_common.h | 1 + 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 7054ca3..4d70fd7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2969,10 +2969,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * The waiter is allocated on our stack, manipulated by the requeue * code while we sleep on uaddr. */ - debug_rt_mutex_init_waiter(_waiter); - RB_CLEAR_NODE(_waiter.pi_tree_entry); - RB_CLEAR_NODE(_waiter.tree_entry); - rt_waiter.task = NULL; + rt_mutex_init_waiter(_waiter); ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, , VERIFY_WRITE); if (unlikely(ret != 0)) diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 28c1d40..8778ac3 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1151,6 +1151,14 @@ void rt_mutex_adjust_pi(struct task_struct *task) next_lock, NULL, task); } +void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter) +{ + debug_rt_mutex_init_waiter(waiter); + RB_CLEAR_NODE(>pi_tree_entry); + RB_CLEAR_NODE(>tree_entry); + waiter->task = NULL; +} + /** * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop * @lock: the rt_mutex to take @@ -1233,9 +1241,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, unsigned long flags; int ret = 0; - debug_rt_mutex_init_waiter(); - RB_CLEAR_NODE(_tree_entry); - RB_CLEAR_NODE(_entry); + rt_mutex_init_waiter(); /* * Technically we could use raw_spin_[un]lock_irq() here, but this can diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index 2441c2d..d16de236 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -103,6 +103,7 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, struct task_struct *proxy_owner); extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, struct task_struct *proxy_owner); +extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, struct task_struct *task); -- 2.7.4
[PATCH 11/17] futex,rt_mutex: Introduce rt_mutex_init_waiter()
From: Peter Zijlstra commit 50809358dd7199aa7ce232f6877dd09ec30ef374 upstream. Since there's already two copies of this code, introduce a helper now before adding a third one. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.950039...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 5 + kernel/locking/rtmutex.c| 12 +--- kernel/locking/rtmutex_common.h | 1 + 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 7054ca3..4d70fd7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2969,10 +2969,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * The waiter is allocated on our stack, manipulated by the requeue * code while we sleep on uaddr. */ - debug_rt_mutex_init_waiter(_waiter); - RB_CLEAR_NODE(_waiter.pi_tree_entry); - RB_CLEAR_NODE(_waiter.tree_entry); - rt_waiter.task = NULL; + rt_mutex_init_waiter(_waiter); ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, , VERIFY_WRITE); if (unlikely(ret != 0)) diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 28c1d40..8778ac3 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1151,6 +1151,14 @@ void rt_mutex_adjust_pi(struct task_struct *task) next_lock, NULL, task); } +void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter) +{ + debug_rt_mutex_init_waiter(waiter); + RB_CLEAR_NODE(>pi_tree_entry); + RB_CLEAR_NODE(>tree_entry); + waiter->task = NULL; +} + /** * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop * @lock: the rt_mutex to take @@ -1233,9 +1241,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, unsigned long flags; int ret = 0; - debug_rt_mutex_init_waiter(); - RB_CLEAR_NODE(_tree_entry); - RB_CLEAR_NODE(_entry); + rt_mutex_init_waiter(); /* * Technically we could use raw_spin_[un]lock_irq() here, but this can diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index 2441c2d..d16de236 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -103,6 +103,7 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, struct task_struct *proxy_owner); extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, struct task_struct *proxy_owner); +extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, struct task_struct *task); -- 2.7.4
[PATCH 16/17] rtmutex: Deboost before waking up the top waiter
From: Xunlei Pang commit 2a1c6029940675abb2217b590512dbf691867ec4 upstream. We should deboost before waking the high-priority task, such that we don't run two tasks with the same "state" (priority, deadline, sched_class, etc). In order to make sure the boosting task doesn't start running between unlock and deboost (due to 'spurious' wakeup), we move the deboost under the wait_lock, that way its serialized against the wait loop in __rt_mutex_slowlock(). Doing the deboost early can however lead to priority-inversion if current would get preempted after the deboost but before waking our high-prio task, hence we disable preemption before doing deboost, and enabling it after the wake up is over. This gets us the right semantic order, but most importantly however; this change ensures pointer stability for the next patch, where we have rt_mutex_setprio() cache a pointer to the top-most waiter task. If we, as before this change, do the wakeup first and then deboost, this pointer might point into thin air. [peterz: Changelog + patch munging] Suggested-by: Peter Zijlstra Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170323150216.110065...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 5 +--- kernel/locking/rtmutex.c| 59 ++--- kernel/locking/rtmutex_common.h | 2 +- 3 files changed, 34 insertions(+), 32 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index afb02a7..63fa840 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1457,10 +1457,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ out_unlock: raw_spin_unlock_irq(_state->pi_mutex.wait_lock); - if (deboost) { - wake_up_q(_q); - rt_mutex_adjust_prio(current); - } + rt_mutex_postunlock(_q, deboost); return ret; } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b061a79..c01d7f4 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -371,24 +371,6 @@ static void __rt_mutex_adjust_prio(struct task_struct *task) } /* - * Adjust task priority (undo boosting). Called from the exit path of - * rt_mutex_slowunlock() and rt_mutex_slowlock(). - * - * (Note: We do this outside of the protection of lock->wait_lock to - * allow the lock to be taken while or before we readjust the priority - * of task. We do not use the spin_xx_mutex() variants here as we are - * outside of the debug path.) - */ -void rt_mutex_adjust_prio(struct task_struct *task) -{ - unsigned long flags; - - raw_spin_lock_irqsave(>pi_lock, flags); - __rt_mutex_adjust_prio(task); - raw_spin_unlock_irqrestore(>pi_lock, flags); -} - -/* * Deadlock detection is conditional: * * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted @@ -1049,6 +1031,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q, * lock->wait_lock. */ rt_mutex_dequeue_pi(current, waiter); + __rt_mutex_adjust_prio(current); /* * As we are waking up the top waiter, and the waiter stays @@ -1391,6 +1374,16 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, */ mark_wakeup_next_waiter(wake_q, lock); + /* +* We should deboost before waking the top waiter task such that +* we don't run two tasks with the 'same' priority. This however +* can lead to prio-inversion if we would get preempted after +* the deboost but before waking our high-prio task, hence the +* preempt_disable before unlock. Pairs with preempt_enable() in +* rt_mutex_postunlock(); +*/ + preempt_disable(); + raw_spin_unlock_irqrestore(>wait_lock, flags); /* check PI boosting */ @@ -1440,6 +1433,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock, return slowfn(lock); } +/* + * Undo pi boosting (if necessary) and wake top waiter. + */ +void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost) +{ + wake_up_q(wake_q); + + /* Pairs with preempt_disable() in rt_mutex_slowunlock() */ + if (deboost) + preempt_enable(); +} + static inline void rt_mutex_fastunlock(struct rt_mutex *lock, bool (*slowfn)(struct rt_mutex *lock, @@ -1453,11 +1458,7 @@ rt_mutex_fastunlock(struct rt_mutex *lock, deboost = slowfn(lock, _q); - wake_up_q(_q); - - /* Undo pi boosting if necessary: */ - if (deboost) - rt_mutex_adjust_prio(current); + rt_mutex_postunlock(_q, deboost); } /** @@ -1570,6 +1571,13 @@ bool __sched __rt_mutex_futex_u
[PATCH 08/17] futex: Rework inconsistent rt_mutex/futex_q state
From: Peter Zijlstra commit 73d786bd043ebc855f349c81ea805f6b11cbf2aa upstream. There is a weird state in the futex_unlock_pi() path when it interleaves with a concurrent futex_lock_pi() at the point where it drops hb->lock. In this case, it can happen that the rt_mutex wait_list and the futex_q disagree on pending waiters, in particular rt_mutex will find no pending waiters where futex_q thinks there are. In this case the rt_mutex unlock code cannot assign an owner. The futex side fixup code has to cleanup the inconsistencies with quite a bunch of interesting corner cases. Simplify all this by changing wake_futex_pi() to return -EAGAIN when this situation occurs. This then gives the futex_lock_pi() code the opportunity to continue and the retried futex_unlock_pi() will now observe a coherent state. The only problem is that this breaks RT timeliness guarantees. That is, consider the following scenario: T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1) CPU0 T1 lock_pi() queue_me() <- Waiter is visible preemption T2 unlock_pi() loops with -EAGAIN forever Which is undesirable for PI primitives. Future patches will rectify this. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.850383...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 50 ++ 1 file changed, 14 insertions(+), 36 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 9d7d462..91acb65 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1398,12 +1398,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter new_owner = rt_mutex_next_owner(_state->pi_mutex); /* -* It is possible that the next waiter (the one that brought -* top_waiter owner to the kernel) timed out and is no longer -* waiting on the lock. +* When we interleave with futex_lock_pi() where it does +* rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter, +* but the rt_mutex's wait_list can be empty (either still, or again, +* depending on which side we land). +* +* When this happens, give up our locks and try again, giving the +* futex_lock_pi() instance time to complete, either by waiting on the +* rtmutex or removing itself from the futex queue. */ - if (!new_owner) - new_owner = top_waiter->task; + if (!new_owner) { + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); + return -EAGAIN; + } /* * We pass it to the next owner. The WAITERS bit is always @@ -2342,7 +2349,6 @@ static long futex_wait_restart(struct restart_block *restart); */ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) { - struct task_struct *owner; int ret = 0; if (locked) { @@ -2356,43 +2362,15 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) } /* -* Catch the rare case, where the lock was released when we were on the -* way back before we locked the hash bucket. -*/ - if (q->pi_state->owner == current) { - /* -* Try to get the rt_mutex now. This might fail as some other -* task acquired the rt_mutex after we removed ourself from the -* rt_mutex waiters list. -*/ - if (rt_mutex_futex_trylock(>pi_state->pi_mutex)) { - locked = 1; - goto out; - } - - /* -* pi_state is incorrect, some other task did a lock steal and -* we returned due to timeout or signal without taking the -* rt_mutex. Too late. -*/ - raw_spin_lock_irq(>pi_state->pi_mutex.wait_lock); - owner = rt_mutex_owner(>pi_state->pi_mutex); - if (!owner) - owner = rt_mutex_next_owner(>pi_state->pi_mutex); - raw_spin_unlock_irq(>pi_state->pi_mutex.wait_lock); - ret = fixup_pi_state_owner(uaddr, q, owner); - goto out; - } - - /* * Paranoia check. If we did not take the lock, then we should not be * the owner of the rt_mutex. */ - if (rt_mutex_owner(>pi_state->pi_mutex) == current) + if (rt_mutex_owner(>pi_state->pi_mutex) == current) { printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p " "
[PATCH 05/17] futex,rt_mutex: Provide futex specific rt_mutex API
From: Peter Zijlstra commit 5293c2efda37775346885c7e924d4ef7018ea60b upstream. Part of what makes futex_unlock_pi() intricate is that rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop rt_mutex::wait_lock. This means it cannot rely on the atomicy of wait_lock, which would be preferred in order to not rely on hb->lock so much. The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can race with the rt_mutex fastpath, however futexes have their own fast path. Since futexes already have a bunch of separate rt_mutex accessors, complete that set and implement a rt_mutex variant without fastpath for them. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.702962...@infradead.org Signed-off-by: Thomas Gleixner Tested-by:Henrik Austad --- kernel/futex.c | 30 +++--- kernel/locking/rtmutex.c| 55 ++--- kernel/locking/rtmutex_common.h | 9 +-- 3 files changed, 62 insertions(+), 32 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 0f44952..e1200b9 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -910,7 +910,7 @@ void exit_pi_state_list(struct task_struct *curr) pi_state->owner = NULL; raw_spin_unlock_irq(>pi_lock); - rt_mutex_unlock(_state->pi_mutex); + rt_mutex_futex_unlock(_state->pi_mutex); spin_unlock(>lock); @@ -1358,20 +1358,18 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter pi_state->owner = new_owner; raw_spin_unlock(_owner->pi_lock); - raw_spin_unlock_irq(_state->pi_mutex.wait_lock); - - deboost = rt_mutex_futex_unlock(_state->pi_mutex, _q); - /* -* First unlock HB so the waiter does not spin on it once he got woken -* up. Second wake up the waiter before the priority is adjusted. If we -* deboost first (and lose our higher priority), then the task might get -* scheduled away before the wake up can take place. +* We've updated the uservalue, this unlock cannot fail. */ + deboost = __rt_mutex_futex_unlock(_state->pi_mutex, _q); + + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); spin_unlock(>lock); - wake_up_q(_q); - if (deboost) + + if (deboost) { + wake_up_q(_q); rt_mutex_adjust_prio(current); + } return 0; } @@ -2259,7 +2257,7 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * task acquired the rt_mutex after we removed ourself from the * rt_mutex waiters list. */ - if (rt_mutex_trylock(>pi_state->pi_mutex)) { + if (rt_mutex_futex_trylock(>pi_state->pi_mutex)) { locked = 1; goto out; } @@ -2574,7 +2572,7 @@ retry_private: if (!trylock) { ret = rt_mutex_timed_futex_lock(_state->pi_mutex, to); } else { - ret = rt_mutex_trylock(_state->pi_mutex); + ret = rt_mutex_futex_trylock(_state->pi_mutex); /* Fixup the trylock return value: */ ret = ret ? 0 : -EWOULDBLOCK; } @@ -2597,7 +2595,7 @@ retry_private: * it and return the fault to userspace. */ if (ret && (rt_mutex_owner(_state->pi_mutex) == current)) - rt_mutex_unlock(_state->pi_mutex); + rt_mutex_futex_unlock(_state->pi_mutex); /* Unqueue and drop the lock */ unqueue_me_pi(); @@ -2904,7 +2902,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, spin_lock(q.lock_ptr); ret = fixup_pi_state_owner(uaddr2, , current); if (ret && rt_mutex_owner(_state->pi_mutex) == current) - rt_mutex_unlock(_state->pi_mutex); + rt_mutex_futex_unlock(_state->pi_mutex); /* * Drop the reference to the pi state which * the requeue_pi() code acquired for us. @@ -2944,7 +2942,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * userspace. */ if (ret && rt_mutex_owner(pi_mutex) == current) - rt_mutex_unlock(pi_mutex); + rt_mutex_futex_unlock(pi_mutex); /* Unqueue and drop the lock. */ unqueue_me_pi(); diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b8d08c7..28c1d40 100644 ---
[PATCH 10/17] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
From: Peter Zijlstra commit 16ffa12d742534d4ff73e8b3a4e81c1de39196f0 upstream. There's a number of 'interesting' problems, all caused by holding hb->lock while doing the rt_mutex_unlock() equivalient. Notably: - a PI inversion on hb->lock; and, - a SCHED_DEADLINE crash because of pointer instability. The previous changes: - changed the locking rules to cover {uval,pi_state} with wait_lock. - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in turn allows to rely on wait_lock atomicity completely. - simplified the waiter conundrum. It's now sufficient to hold rtmutex::wait_lock and a reference on the pi_state to protect the state consistency, so hb->lock can be dropped before calling rt_mutex_futex_unlock(). Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.92...@infradead.org Signed-off-by: Thomas Gleixner Conflicts: kernel/futex.c Tested-by:Henrik Austad --- kernel/futex.c | 154 + 1 file changed, 100 insertions(+), 54 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 09f698a..7054ca3 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -918,10 +918,12 @@ void exit_pi_state_list(struct task_struct *curr) pi_state->owner = NULL; raw_spin_unlock_irq(>pi_lock); - rt_mutex_futex_unlock(_state->pi_mutex); - + get_pi_state(pi_state); spin_unlock(>lock); + rt_mutex_futex_unlock(_state->pi_mutex); + put_pi_state(pi_state); + raw_spin_lock_irq(>pi_lock); } raw_spin_unlock_irq(>pi_lock); @@ -1034,6 +1036,11 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, * has dropped the hb->lock in between queue_me() and unqueue_me_pi(), * which in turn means that futex_lock_pi() still has a reference on * our pi_state. +* +* The waiter holding a reference on @pi_state also protects against +* the unlocked put_pi_state() in futex_unlock_pi(), futex_lock_pi() +* and futex_wait_requeue_pi() as it cannot go to 0 and consequently +* free pi_state before we can take a reference ourselves. */ WARN_ON(!atomic_read(_state->refcount)); @@ -1377,48 +1384,40 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) smp_store_release(>lock_ptr, NULL); } -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter, -struct futex_hash_bucket *hb) +/* + * Caller must hold a reference on @pi_state. + */ +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state) { - struct task_struct *new_owner; - struct futex_pi_state *pi_state = top_waiter->pi_state; u32 uninitialized_var(curval), newval; + struct task_struct *new_owner; + bool deboost = false; WAKE_Q(wake_q); - bool deboost; int ret = 0; - if (!pi_state) - return -EINVAL; - - /* -* If current does not own the pi_state then the futex is -* inconsistent and user space fiddled with the futex value. -*/ - if (pi_state->owner != current) - return -EINVAL; - raw_spin_lock_irq(_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(_state->pi_mutex); - - /* -* When we interleave with futex_lock_pi() where it does -* rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter, -* but the rt_mutex's wait_list can be empty (either still, or again, -* depending on which side we land). -* -* When this happens, give up our locks and try again, giving the -* futex_lock_pi() instance time to complete, either by waiting on the -* rtmutex or removing itself from the futex queue. -*/ if (!new_owner) { - raw_spin_unlock_irq(_state->pi_mutex.wait_lock); - return -EAGAIN; + /* +* Since we held neither hb->lock nor wait_lock when coming +* into this function, we could have raced with futex_lock_pi() +* such that we might observe @this futex_q waiter, but the +* rt_mutex's wait_list can be empty (either still, or again, +* depending on which side we land). +* +* When this happens, give up our locks and try again, giving +* the futex_lock_pi() instance time to complete, either by +* waiting on the rtmutex or removing itself from the futex +* queue. +*/ +
[PATCH 16/17] rtmutex: Deboost before waking up the top waiter
From: Xunlei Pang commit 2a1c6029940675abb2217b590512dbf691867ec4 upstream. We should deboost before waking the high-priority task, such that we don't run two tasks with the same "state" (priority, deadline, sched_class, etc). In order to make sure the boosting task doesn't start running between unlock and deboost (due to 'spurious' wakeup), we move the deboost under the wait_lock, that way its serialized against the wait loop in __rt_mutex_slowlock(). Doing the deboost early can however lead to priority-inversion if current would get preempted after the deboost but before waking our high-prio task, hence we disable preemption before doing deboost, and enabling it after the wake up is over. This gets us the right semantic order, but most importantly however; this change ensures pointer stability for the next patch, where we have rt_mutex_setprio() cache a pointer to the top-most waiter task. If we, as before this change, do the wakeup first and then deboost, this pointer might point into thin air. [peterz: Changelog + patch munging] Suggested-by: Peter Zijlstra Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170323150216.110065...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 5 +--- kernel/locking/rtmutex.c| 59 ++--- kernel/locking/rtmutex_common.h | 2 +- 3 files changed, 34 insertions(+), 32 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index afb02a7..63fa840 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1457,10 +1457,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ out_unlock: raw_spin_unlock_irq(_state->pi_mutex.wait_lock); - if (deboost) { - wake_up_q(_q); - rt_mutex_adjust_prio(current); - } + rt_mutex_postunlock(_q, deboost); return ret; } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b061a79..c01d7f4 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -371,24 +371,6 @@ static void __rt_mutex_adjust_prio(struct task_struct *task) } /* - * Adjust task priority (undo boosting). Called from the exit path of - * rt_mutex_slowunlock() and rt_mutex_slowlock(). - * - * (Note: We do this outside of the protection of lock->wait_lock to - * allow the lock to be taken while or before we readjust the priority - * of task. We do not use the spin_xx_mutex() variants here as we are - * outside of the debug path.) - */ -void rt_mutex_adjust_prio(struct task_struct *task) -{ - unsigned long flags; - - raw_spin_lock_irqsave(>pi_lock, flags); - __rt_mutex_adjust_prio(task); - raw_spin_unlock_irqrestore(>pi_lock, flags); -} - -/* * Deadlock detection is conditional: * * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted @@ -1049,6 +1031,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q, * lock->wait_lock. */ rt_mutex_dequeue_pi(current, waiter); + __rt_mutex_adjust_prio(current); /* * As we are waking up the top waiter, and the waiter stays @@ -1391,6 +1374,16 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, */ mark_wakeup_next_waiter(wake_q, lock); + /* +* We should deboost before waking the top waiter task such that +* we don't run two tasks with the 'same' priority. This however +* can lead to prio-inversion if we would get preempted after +* the deboost but before waking our high-prio task, hence the +* preempt_disable before unlock. Pairs with preempt_enable() in +* rt_mutex_postunlock(); +*/ + preempt_disable(); + raw_spin_unlock_irqrestore(>wait_lock, flags); /* check PI boosting */ @@ -1440,6 +1433,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock, return slowfn(lock); } +/* + * Undo pi boosting (if necessary) and wake top waiter. + */ +void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost) +{ + wake_up_q(wake_q); + + /* Pairs with preempt_disable() in rt_mutex_slowunlock() */ + if (deboost) + preempt_enable(); +} + static inline void rt_mutex_fastunlock(struct rt_mutex *lock, bool (*slowfn)(struct rt_mutex *lock, @@ -1453,11 +1458,7 @@ rt_mutex_fastunlock(struct rt_mutex *lock, deboost = slowfn(lock, _q); - wake_up_q(_q); - - /* Undo pi boosting if necessary: */ - if (deboost) - rt_mutex_adjust_prio(current); + rt_mutex_postunlock(_q, deboost); } /** @@ -1570,6 +1571,13 @@ bool __sched __rt_mutex_futex_u
[PATCH 08/17] futex: Rework inconsistent rt_mutex/futex_q state
From: Peter Zijlstra commit 73d786bd043ebc855f349c81ea805f6b11cbf2aa upstream. There is a weird state in the futex_unlock_pi() path when it interleaves with a concurrent futex_lock_pi() at the point where it drops hb->lock. In this case, it can happen that the rt_mutex wait_list and the futex_q disagree on pending waiters, in particular rt_mutex will find no pending waiters where futex_q thinks there are. In this case the rt_mutex unlock code cannot assign an owner. The futex side fixup code has to cleanup the inconsistencies with quite a bunch of interesting corner cases. Simplify all this by changing wake_futex_pi() to return -EAGAIN when this situation occurs. This then gives the futex_lock_pi() code the opportunity to continue and the retried futex_unlock_pi() will now observe a coherent state. The only problem is that this breaks RT timeliness guarantees. That is, consider the following scenario: T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1) CPU0 T1 lock_pi() queue_me() <- Waiter is visible preemption T2 unlock_pi() loops with -EAGAIN forever Which is undesirable for PI primitives. Future patches will rectify this. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.850383...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 50 ++ 1 file changed, 14 insertions(+), 36 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 9d7d462..91acb65 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1398,12 +1398,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter new_owner = rt_mutex_next_owner(_state->pi_mutex); /* -* It is possible that the next waiter (the one that brought -* top_waiter owner to the kernel) timed out and is no longer -* waiting on the lock. +* When we interleave with futex_lock_pi() where it does +* rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter, +* but the rt_mutex's wait_list can be empty (either still, or again, +* depending on which side we land). +* +* When this happens, give up our locks and try again, giving the +* futex_lock_pi() instance time to complete, either by waiting on the +* rtmutex or removing itself from the futex queue. */ - if (!new_owner) - new_owner = top_waiter->task; + if (!new_owner) { + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); + return -EAGAIN; + } /* * We pass it to the next owner. The WAITERS bit is always @@ -2342,7 +2349,6 @@ static long futex_wait_restart(struct restart_block *restart); */ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) { - struct task_struct *owner; int ret = 0; if (locked) { @@ -2356,43 +2362,15 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) } /* -* Catch the rare case, where the lock was released when we were on the -* way back before we locked the hash bucket. -*/ - if (q->pi_state->owner == current) { - /* -* Try to get the rt_mutex now. This might fail as some other -* task acquired the rt_mutex after we removed ourself from the -* rt_mutex waiters list. -*/ - if (rt_mutex_futex_trylock(>pi_state->pi_mutex)) { - locked = 1; - goto out; - } - - /* -* pi_state is incorrect, some other task did a lock steal and -* we returned due to timeout or signal without taking the -* rt_mutex. Too late. -*/ - raw_spin_lock_irq(>pi_state->pi_mutex.wait_lock); - owner = rt_mutex_owner(>pi_state->pi_mutex); - if (!owner) - owner = rt_mutex_next_owner(>pi_state->pi_mutex); - raw_spin_unlock_irq(>pi_state->pi_mutex.wait_lock); - ret = fixup_pi_state_owner(uaddr, q, owner); - goto out; - } - - /* * Paranoia check. If we did not take the lock, then we should not be * the owner of the rt_mutex. */ - if (rt_mutex_owner(>pi_state->pi_mutex) == current) + if (rt_mutex_owner(>pi_state->pi_mutex) == current) { printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p " "
[PATCH 05/17] futex,rt_mutex: Provide futex specific rt_mutex API
From: Peter Zijlstra commit 5293c2efda37775346885c7e924d4ef7018ea60b upstream. Part of what makes futex_unlock_pi() intricate is that rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop rt_mutex::wait_lock. This means it cannot rely on the atomicy of wait_lock, which would be preferred in order to not rely on hb->lock so much. The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can race with the rt_mutex fastpath, however futexes have their own fast path. Since futexes already have a bunch of separate rt_mutex accessors, complete that set and implement a rt_mutex variant without fastpath for them. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.702962...@infradead.org Signed-off-by: Thomas Gleixner Tested-by:Henrik Austad --- kernel/futex.c | 30 +++--- kernel/locking/rtmutex.c| 55 ++--- kernel/locking/rtmutex_common.h | 9 +-- 3 files changed, 62 insertions(+), 32 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 0f44952..e1200b9 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -910,7 +910,7 @@ void exit_pi_state_list(struct task_struct *curr) pi_state->owner = NULL; raw_spin_unlock_irq(>pi_lock); - rt_mutex_unlock(_state->pi_mutex); + rt_mutex_futex_unlock(_state->pi_mutex); spin_unlock(>lock); @@ -1358,20 +1358,18 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter pi_state->owner = new_owner; raw_spin_unlock(_owner->pi_lock); - raw_spin_unlock_irq(_state->pi_mutex.wait_lock); - - deboost = rt_mutex_futex_unlock(_state->pi_mutex, _q); - /* -* First unlock HB so the waiter does not spin on it once he got woken -* up. Second wake up the waiter before the priority is adjusted. If we -* deboost first (and lose our higher priority), then the task might get -* scheduled away before the wake up can take place. +* We've updated the uservalue, this unlock cannot fail. */ + deboost = __rt_mutex_futex_unlock(_state->pi_mutex, _q); + + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); spin_unlock(>lock); - wake_up_q(_q); - if (deboost) + + if (deboost) { + wake_up_q(_q); rt_mutex_adjust_prio(current); + } return 0; } @@ -2259,7 +2257,7 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * task acquired the rt_mutex after we removed ourself from the * rt_mutex waiters list. */ - if (rt_mutex_trylock(>pi_state->pi_mutex)) { + if (rt_mutex_futex_trylock(>pi_state->pi_mutex)) { locked = 1; goto out; } @@ -2574,7 +2572,7 @@ retry_private: if (!trylock) { ret = rt_mutex_timed_futex_lock(_state->pi_mutex, to); } else { - ret = rt_mutex_trylock(_state->pi_mutex); + ret = rt_mutex_futex_trylock(_state->pi_mutex); /* Fixup the trylock return value: */ ret = ret ? 0 : -EWOULDBLOCK; } @@ -2597,7 +2595,7 @@ retry_private: * it and return the fault to userspace. */ if (ret && (rt_mutex_owner(_state->pi_mutex) == current)) - rt_mutex_unlock(_state->pi_mutex); + rt_mutex_futex_unlock(_state->pi_mutex); /* Unqueue and drop the lock */ unqueue_me_pi(); @@ -2904,7 +2902,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, spin_lock(q.lock_ptr); ret = fixup_pi_state_owner(uaddr2, , current); if (ret && rt_mutex_owner(_state->pi_mutex) == current) - rt_mutex_unlock(_state->pi_mutex); + rt_mutex_futex_unlock(_state->pi_mutex); /* * Drop the reference to the pi state which * the requeue_pi() code acquired for us. @@ -2944,7 +2942,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * userspace. */ if (ret && rt_mutex_owner(pi_mutex) == current) - rt_mutex_unlock(pi_mutex); + rt_mutex_futex_unlock(pi_mutex); /* Unqueue and drop the lock. */ unqueue_me_pi(); diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b8d08c7..28c1d40 100644 ---
[PATCH 10/17] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
From: Peter Zijlstra commit 16ffa12d742534d4ff73e8b3a4e81c1de39196f0 upstream. There's a number of 'interesting' problems, all caused by holding hb->lock while doing the rt_mutex_unlock() equivalient. Notably: - a PI inversion on hb->lock; and, - a SCHED_DEADLINE crash because of pointer instability. The previous changes: - changed the locking rules to cover {uval,pi_state} with wait_lock. - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in turn allows to rely on wait_lock atomicity completely. - simplified the waiter conundrum. It's now sufficient to hold rtmutex::wait_lock and a reference on the pi_state to protect the state consistency, so hb->lock can be dropped before calling rt_mutex_futex_unlock(). Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.92...@infradead.org Signed-off-by: Thomas Gleixner Conflicts: kernel/futex.c Tested-by:Henrik Austad --- kernel/futex.c | 154 + 1 file changed, 100 insertions(+), 54 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 09f698a..7054ca3 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -918,10 +918,12 @@ void exit_pi_state_list(struct task_struct *curr) pi_state->owner = NULL; raw_spin_unlock_irq(>pi_lock); - rt_mutex_futex_unlock(_state->pi_mutex); - + get_pi_state(pi_state); spin_unlock(>lock); + rt_mutex_futex_unlock(_state->pi_mutex); + put_pi_state(pi_state); + raw_spin_lock_irq(>pi_lock); } raw_spin_unlock_irq(>pi_lock); @@ -1034,6 +1036,11 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, * has dropped the hb->lock in between queue_me() and unqueue_me_pi(), * which in turn means that futex_lock_pi() still has a reference on * our pi_state. +* +* The waiter holding a reference on @pi_state also protects against +* the unlocked put_pi_state() in futex_unlock_pi(), futex_lock_pi() +* and futex_wait_requeue_pi() as it cannot go to 0 and consequently +* free pi_state before we can take a reference ourselves. */ WARN_ON(!atomic_read(_state->refcount)); @@ -1377,48 +1384,40 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) smp_store_release(>lock_ptr, NULL); } -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter, -struct futex_hash_bucket *hb) +/* + * Caller must hold a reference on @pi_state. + */ +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state) { - struct task_struct *new_owner; - struct futex_pi_state *pi_state = top_waiter->pi_state; u32 uninitialized_var(curval), newval; + struct task_struct *new_owner; + bool deboost = false; WAKE_Q(wake_q); - bool deboost; int ret = 0; - if (!pi_state) - return -EINVAL; - - /* -* If current does not own the pi_state then the futex is -* inconsistent and user space fiddled with the futex value. -*/ - if (pi_state->owner != current) - return -EINVAL; - raw_spin_lock_irq(_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(_state->pi_mutex); - - /* -* When we interleave with futex_lock_pi() where it does -* rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter, -* but the rt_mutex's wait_list can be empty (either still, or again, -* depending on which side we land). -* -* When this happens, give up our locks and try again, giving the -* futex_lock_pi() instance time to complete, either by waiting on the -* rtmutex or removing itself from the futex queue. -*/ if (!new_owner) { - raw_spin_unlock_irq(_state->pi_mutex.wait_lock); - return -EAGAIN; + /* +* Since we held neither hb->lock nor wait_lock when coming +* into this function, we could have raced with futex_lock_pi() +* such that we might observe @this futex_q waiter, but the +* rt_mutex's wait_list can be empty (either still, or again, +* depending on which side we land). +* +* When this happens, give up our locks and try again, giving +* the futex_lock_pi() instance time to complete, either by +* waiting on the rtmutex or removing itself from the futex +* queue. +*/ +
[PATCH 15/17] futex: Drop hb->lock before enqueueing on the rtmutex
From: Peter Zijlstra commit 56222b212e8edb1cf51f5dd73ff645809b082b40 upstream. When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI chain code will (falsely) report a deadlock and BUG. The problem is that it hold hb->lock (now an rt_mutex) while doing task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when interleaved just right with futex_unlock_pi() leads it to believe to see an AB-BA deadlock. Task1 (holds rt_mutex,Task2 (does FUTEX_LOCK_PI) does FUTEX_UNLOCK_PI) lock hb->lock lock rt_mutex (as per start_proxy) lock hb->lock Which is a trivial AB-BA. It is not an actual deadlock, because it won't be holding hb->lock by the time it actually blocks on the rt_mutex, but the chainwalk code doesn't know that and it would be a nightmare to handle this gracefully. To avoid this problem, do the same as in futex_unlock_pi() and drop hb->lock after acquiring wait_lock. This still fully serializes against futex_unlock_pi(), since adding to the wait_list does the very same lock dance, and removing it holds both locks. Aside of solving the RT problem this makes the lock and unlock mechanism symetric and reduces the hb->lock held time. Reported-and-tested-by: Sebastian Andrzej Siewior Suggested-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104152.161341...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 30 + kernel/locking/rtmutex.c| 49 +++-- kernel/locking/rtmutex_common.h | 3 +++ 3 files changed, 52 insertions(+), 30 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 14d270e..afb02a7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2667,20 +2667,33 @@ retry_private: goto no_block; } + rt_mutex_init_waiter(_waiter); + /* -* We must add ourselves to the rt_mutex waitlist while holding hb->lock -* such that the hb and rt_mutex wait lists match. +* On PREEMPT_RT_FULL, when hb->lock becomes an rt_mutex, we must not +* hold it while doing rt_mutex_start_proxy(), because then it will +* include hb->lock in the blocking chain, even through we'll not in +* fact hold it while blocking. This will lead it to report -EDEADLK +* and BUG when futex_unlock_pi() interleaves with this. +* +* Therefore acquire wait_lock while holding hb->lock, but drop the +* latter before calling rt_mutex_start_proxy_lock(). This still fully +* serializes against futex_unlock_pi() as that does the exact same +* lock handoff sequence. */ - rt_mutex_init_waiter(_waiter); - ret = rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, current); + raw_spin_lock_irq(_state->pi_mutex.wait_lock); + spin_unlock(q.lock_ptr); + ret = __rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, current); + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); + if (ret) { if (ret == 1) ret = 0; + spin_lock(q.lock_ptr); goto no_block; } - spin_unlock(q.lock_ptr); if (unlikely(to)) hrtimer_start_expires(>timer, HRTIMER_MODE_ABS); @@ -2693,6 +2706,9 @@ retry_private: * first acquire the hb->lock before removing the lock from the * rt_mutex waitqueue, such that we can keep the hb and rt_mutex * wait lists consistent. +* +* In particular; it is important that futex_unlock_pi() can not +* observe this inconsistency. */ if (ret && !rt_mutex_cleanup_proxy_lock(_state->pi_mutex, _waiter)) ret = 0; @@ -2804,10 +2820,6 @@ retry: get_pi_state(pi_state); /* -* Since modifying the wait_list is done while holding both -* hb->lock and wait_lock, holding either is sufficient to -* observe it. -* * By taking wait_lock while still holding hb->lock, we ensure * there is no point where we hold neither; and therefore * wake_futex_pi() must observe a state consistent with what we diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 3025f61..b061a79 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1659,31 +1659,14 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock, rt_mutex_set_owner(lock, NULL); } -/** - * rt_mutex_st
[PATCH 14/17] futex: Futex_unlock_pi() determinism
From: Peter Zijlstra commit bebe5b514345f09be2c15e414d076b02ecb9cce8 upstream. The problem with returning -EAGAIN when the waiter state mismatches is that it becomes very hard to proof a bounded execution time on the operation. And seeing that this is a RT operation, this is somewhat important. While in practise; given the previous patch; it will be very unlikely to ever really take more than one or two rounds, proving so becomes rather hard. However, now that modifying wait_list is done while holding both hb->lock and wait_lock, the scenario can be avoided entirely by acquiring wait_lock while still holding hb-lock. Doing a hand-over, without leaving a hole. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104152.112378...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 24 +++- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 1cc40dd..14d270e 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1395,15 +1395,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ WAKE_Q(wake_q); int ret = 0; - raw_spin_lock_irq(_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(_state->pi_mutex); - if (!new_owner) { + if (WARN_ON_ONCE(!new_owner)) { /* -* Since we held neither hb->lock nor wait_lock when coming -* into this function, we could have raced with futex_lock_pi() -* such that we might observe @this futex_q waiter, but the -* rt_mutex's wait_list can be empty (either still, or again, -* depending on which side we land). +* As per the comment in futex_unlock_pi() this should not happen. * * When this happens, give up our locks and try again, giving * the futex_lock_pi() instance time to complete, either by @@ -2807,15 +2802,18 @@ retry: if (pi_state->owner != current) goto out_unlock; + get_pi_state(pi_state); /* -* Grab a reference on the pi_state and drop hb->lock. +* Since modifying the wait_list is done while holding both +* hb->lock and wait_lock, holding either is sufficient to +* observe it. * -* The reference ensures pi_state lives, dropping the hb->lock -* is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to -* close the races against futex_lock_pi(), but in case of -* _any_ fail we'll abort and retry the whole deal. +* By taking wait_lock while still holding hb->lock, we ensure +* there is no point where we hold neither; and therefore +* wake_futex_pi() must observe a state consistent with what we +* observed. */ - get_pi_state(pi_state); + raw_spin_lock_irq(_state->pi_mutex.wait_lock); spin_unlock(>lock); ret = wake_futex_pi(uaddr, uval, pi_state); -- 2.7.4
[PATCH 15/17] futex: Drop hb->lock before enqueueing on the rtmutex
From: Peter Zijlstra commit 56222b212e8edb1cf51f5dd73ff645809b082b40 upstream. When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI chain code will (falsely) report a deadlock and BUG. The problem is that it hold hb->lock (now an rt_mutex) while doing task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when interleaved just right with futex_unlock_pi() leads it to believe to see an AB-BA deadlock. Task1 (holds rt_mutex,Task2 (does FUTEX_LOCK_PI) does FUTEX_UNLOCK_PI) lock hb->lock lock rt_mutex (as per start_proxy) lock hb->lock Which is a trivial AB-BA. It is not an actual deadlock, because it won't be holding hb->lock by the time it actually blocks on the rt_mutex, but the chainwalk code doesn't know that and it would be a nightmare to handle this gracefully. To avoid this problem, do the same as in futex_unlock_pi() and drop hb->lock after acquiring wait_lock. This still fully serializes against futex_unlock_pi(), since adding to the wait_list does the very same lock dance, and removing it holds both locks. Aside of solving the RT problem this makes the lock and unlock mechanism symetric and reduces the hb->lock held time. Reported-and-tested-by: Sebastian Andrzej Siewior Suggested-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104152.161341...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 30 + kernel/locking/rtmutex.c| 49 +++-- kernel/locking/rtmutex_common.h | 3 +++ 3 files changed, 52 insertions(+), 30 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 14d270e..afb02a7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2667,20 +2667,33 @@ retry_private: goto no_block; } + rt_mutex_init_waiter(_waiter); + /* -* We must add ourselves to the rt_mutex waitlist while holding hb->lock -* such that the hb and rt_mutex wait lists match. +* On PREEMPT_RT_FULL, when hb->lock becomes an rt_mutex, we must not +* hold it while doing rt_mutex_start_proxy(), because then it will +* include hb->lock in the blocking chain, even through we'll not in +* fact hold it while blocking. This will lead it to report -EDEADLK +* and BUG when futex_unlock_pi() interleaves with this. +* +* Therefore acquire wait_lock while holding hb->lock, but drop the +* latter before calling rt_mutex_start_proxy_lock(). This still fully +* serializes against futex_unlock_pi() as that does the exact same +* lock handoff sequence. */ - rt_mutex_init_waiter(_waiter); - ret = rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, current); + raw_spin_lock_irq(_state->pi_mutex.wait_lock); + spin_unlock(q.lock_ptr); + ret = __rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, current); + raw_spin_unlock_irq(_state->pi_mutex.wait_lock); + if (ret) { if (ret == 1) ret = 0; + spin_lock(q.lock_ptr); goto no_block; } - spin_unlock(q.lock_ptr); if (unlikely(to)) hrtimer_start_expires(>timer, HRTIMER_MODE_ABS); @@ -2693,6 +2706,9 @@ retry_private: * first acquire the hb->lock before removing the lock from the * rt_mutex waitqueue, such that we can keep the hb and rt_mutex * wait lists consistent. +* +* In particular; it is important that futex_unlock_pi() can not +* observe this inconsistency. */ if (ret && !rt_mutex_cleanup_proxy_lock(_state->pi_mutex, _waiter)) ret = 0; @@ -2804,10 +2820,6 @@ retry: get_pi_state(pi_state); /* -* Since modifying the wait_list is done while holding both -* hb->lock and wait_lock, holding either is sufficient to -* observe it. -* * By taking wait_lock while still holding hb->lock, we ensure * there is no point where we hold neither; and therefore * wake_futex_pi() must observe a state consistent with what we diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 3025f61..b061a79 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1659,31 +1659,14 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock, rt_mutex_set_owner(lock, NULL); } -/** - * rt_mutex_st
[PATCH 14/17] futex: Futex_unlock_pi() determinism
From: Peter Zijlstra commit bebe5b514345f09be2c15e414d076b02ecb9cce8 upstream. The problem with returning -EAGAIN when the waiter state mismatches is that it becomes very hard to proof a bounded execution time on the operation. And seeing that this is a RT operation, this is somewhat important. While in practise; given the previous patch; it will be very unlikely to ever really take more than one or two rounds, proving so becomes rather hard. However, now that modifying wait_list is done while holding both hb->lock and wait_lock, the scenario can be avoided entirely by acquiring wait_lock while still holding hb-lock. Doing a hand-over, without leaving a hole. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104152.112378...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 24 +++- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 1cc40dd..14d270e 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1395,15 +1395,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ WAKE_Q(wake_q); int ret = 0; - raw_spin_lock_irq(_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(_state->pi_mutex); - if (!new_owner) { + if (WARN_ON_ONCE(!new_owner)) { /* -* Since we held neither hb->lock nor wait_lock when coming -* into this function, we could have raced with futex_lock_pi() -* such that we might observe @this futex_q waiter, but the -* rt_mutex's wait_list can be empty (either still, or again, -* depending on which side we land). +* As per the comment in futex_unlock_pi() this should not happen. * * When this happens, give up our locks and try again, giving * the futex_lock_pi() instance time to complete, either by @@ -2807,15 +2802,18 @@ retry: if (pi_state->owner != current) goto out_unlock; + get_pi_state(pi_state); /* -* Grab a reference on the pi_state and drop hb->lock. +* Since modifying the wait_list is done while holding both +* hb->lock and wait_lock, holding either is sufficient to +* observe it. * -* The reference ensures pi_state lives, dropping the hb->lock -* is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to -* close the races against futex_lock_pi(), but in case of -* _any_ fail we'll abort and retry the whole deal. +* By taking wait_lock while still holding hb->lock, we ensure +* there is no point where we hold neither; and therefore +* wake_futex_pi() must observe a state consistent with what we +* observed. */ - get_pi_state(pi_state); + raw_spin_lock_irq(_state->pi_mutex.wait_lock); spin_unlock(>lock); ret = wake_futex_pi(uaddr, uval, pi_state); -- 2.7.4
[PATCH 17/17] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
From: Xunlei Pang commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22 upstream. A crash happened while I was playing with deadline PI rtmutex. BUG: unable to handle kernel NULL pointer dereference at 0018 IP: [] rt_mutex_get_top_task+0x1f/0x30 PGD 232a75067 PUD 230947067 PMD 0 Oops: [#1] SMP CPU: 1 PID: 10994 Comm: a.out Not tainted Call Trace: [] enqueue_task+0x2c/0x80 [] activate_task+0x23/0x30 [] pull_dl_task+0x1d5/0x260 [] pre_schedule_dl+0x16/0x20 [] __schedule+0xd3/0x900 [] schedule+0x29/0x70 [] __rt_mutex_slowlock+0x4b/0xc0 [] rt_mutex_slowlock+0xd1/0x190 [] rt_mutex_timed_lock+0x53/0x60 [] futex_lock_pi.isra.18+0x28c/0x390 [] do_futex+0x190/0x5b0 [] SyS_futex+0x80/0x180 This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi() are only protected by pi_lock when operating pi waiters, while rt_mutex_get_top_task(), will access them with rq lock held but not holding pi_lock. In order to tackle it, we introduce new "pi_top_task" pointer cached in task_struct, and add new rt_mutex_update_top_task() to update its value, it can be called by rt_mutex_setprio() which held both owner's pi_lock and rq lock. Thus "pi_top_task" can be safely accessed by enqueue_task_dl() under rq lock. Originally-From: Peter Zijlstra Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Reviewed-by: Thomas Gleixner Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170323150216.157682...@infradead.org Signed-off-by: Thomas Gleixner Conflicts: include/linux/sched.h Tested-by:Henrik Austad --- include/linux/init_task.h | 1 + include/linux/sched.h | 2 ++ include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/locking/rtmutex.c | 29 + kernel/sched/core.c | 2 ++ 6 files changed, 28 insertions(+), 8 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 1c1ff7e..a561ce0c 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -162,6 +162,7 @@ extern struct task_group root_task_group; #ifdef CONFIG_RT_MUTEXES # define INIT_RT_MUTEXES(tsk) \ .pi_waiters = RB_ROOT, \ + .pi_top_task = NULL,\ .pi_waiters_leftmost = NULL, #else # define INIT_RT_MUTEXES(tsk) diff --git a/include/linux/sched.h b/include/linux/sched.h index b30540d..89cd0d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1617,6 +1617,8 @@ struct task_struct { /* PI waiters blocked on a rt_mutex held by this task */ struct rb_root pi_waiters; struct rb_node *pi_waiters_leftmost; + /* Updated under owner's pi_lock and rq lock */ + struct task_struct *pi_top_task; /* Deadlock detection and priority inheritance handling */ struct rt_mutex_waiter *pi_blocked_on; #endif diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index a30b172..60d0c47 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p) extern int rt_mutex_getprio(struct task_struct *p); extern void rt_mutex_setprio(struct task_struct *p, int prio); extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio); +extern void rt_mutex_update_top_task(struct task_struct *p); extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task); extern void rt_mutex_adjust_pi(struct task_struct *p); static inline bool tsk_is_pi_blocked(struct task_struct *tsk) diff --git a/kernel/fork.c b/kernel/fork.c index dd2f79a..9376270 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1242,6 +1242,7 @@ static void rt_mutex_init_task(struct task_struct *p) #ifdef CONFIG_RT_MUTEXES p->pi_waiters = RB_ROOT; p->pi_waiters_leftmost = NULL; + p->pi_top_task = NULL; p->pi_blocked_on = NULL; #endif } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index c01d7f4..dd3b1e9 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -321,6 +321,19 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter) } /* + * Must hold both p->pi_lock and task_rq(p)->lock. + */ +void rt_mutex_update_top_task(struct task_struct *p) +{ + if (!task_has_pi_waiters(p)) { + p->pi_top_task = NULL; + return; + } + + p->pi_top_task = task_top_pi_waiter(p)->task; +} + +/* * Calculate task priority from the waiter tree priority * * Return task->normal_prio when the waiter tree is empty or when @@ -335,12 +348,12 @@ int rt_mutex_getprio(struct task_struct *task) task->normal_prio); }
[PATCH 12/17] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
From: Peter Zijlstra commit 38d589f2fd08f1296aea3ce62bebd185125c6d81 upstream. With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters consistent it's necessary to split 'rt_mutex_futex_lock()' into finer parts, such that only the actual blocking can be done without hb->lock held. Split split_mutex_finish_proxy_lock() into two parts, one that does the blocking and one that does remove_waiter() when the lock acquire failed. When the rtmutex was acquired successfully the waiter can be removed in the acquisiton path safely, since there is no concurrency on the lock owner. This means that, except for futex_lock_pi(), all wait_list modifications are done with both hb->lock and wait_lock held. [bige...@linutronix.de: fix for futex_requeue_pi_signal_restart] Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104152.001659...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 7 -- kernel/locking/rtmutex.c| 52 +++-- kernel/locking/rtmutex_common.h | 8 --- 3 files changed, 55 insertions(+), 12 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 4d70fd7..dce3250 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3045,10 +3045,13 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, */ WARN_ON(!q.pi_state); pi_mutex = _state->pi_mutex; - ret = rt_mutex_finish_proxy_lock(pi_mutex, to, _waiter); - debug_rt_mutex_free_waiter(_waiter); + ret = rt_mutex_wait_proxy_lock(pi_mutex, to, _waiter); spin_lock(q.lock_ptr); + if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, _waiter)) + ret = 0; + + debug_rt_mutex_free_waiter(_waiter); /* * Fixup the pi_state owner and possibly acquire the lock if we * haven't already. diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 8778ac3..78ecea6 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1743,21 +1743,23 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock) } /** - * rt_mutex_finish_proxy_lock() - Complete lock acquisition + * rt_mutex_wait_proxy_lock() - Wait for lock acquisition * @lock: the rt_mutex we were woken on * @to:the timeout, null if none. hrtimer should already have * been started. * @waiter:the pre-initialized rt_mutex_waiter * - * Complete the lock acquisition started our behalf by another thread. + * Wait for the the lock acquisition started on our behalf by + * rt_mutex_start_proxy_lock(). Upon failure, the caller must call + * rt_mutex_cleanup_proxy_lock(). * * Returns: * 0 - success * <0 - error, one of -EINTR, -ETIMEDOUT * - * Special API call for PI-futex requeue support + * Special API call for PI-futex support */ -int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, +int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, struct hrtimer_sleeper *to, struct rt_mutex_waiter *waiter) { @@ -1770,9 +1772,6 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, /* sleep on the mutex */ ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter); - if (unlikely(ret)) - remove_waiter(lock, waiter); - /* * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might * have to fix that up. @@ -1783,3 +1782,42 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, return ret; } + +/** + * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition + * @lock: the rt_mutex we were woken on + * @waiter:the pre-initialized rt_mutex_waiter + * + * Attempt to clean up after a failed rt_mutex_wait_proxy_lock(). + * + * Unless we acquired the lock; we're still enqueued on the wait-list and can + * in fact still be granted ownership until we're removed. Therefore we can + * find we are in fact the owner and must disregard the + * rt_mutex_wait_proxy_lock() failure. + * + * Returns: + * true - did the cleanup, we done. + * false - we acquired the lock after rt_mutex_wait_proxy_lock() returned, + * caller should disregards its return value. + * + * Special API call for PI-futex support + */ +bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock, +struct rt_mutex_waiter *waiter) +{ + bool cleanup = false; + + raw_spin_lock_irq(>wait_lock); + /* +* Unless
[PATCH 13/17] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
From: Peter Zijlstra commit cfafcd117da0216520568c195cb2f6cd1980c4bb upstream. By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list modifications are done under both hb->lock and wait_lock. This closes the obvious interleave pattern between futex_lock_pi() and futex_unlock_pi(), but not entirely so. See below: Before: futex_lock_pi() futex_unlock_pi() unlock hb->lock lock hb->lock unlock hb->lock lock rt_mutex->wait_lock unlock rt_mutex_wait_lock -EAGAIN lock rt_mutex->wait_lock list_add unlock rt_mutex->wait_lock schedule() lock rt_mutex->wait_lock list_del unlock rt_mutex->wait_lock -EAGAIN lock hb->lock After: futex_lock_pi() futex_unlock_pi() lock hb->lock lock rt_mutex->wait_lock list_add unlock rt_mutex->wait_lock unlock hb->lock schedule() lock hb->lock unlock hb->lock lock hb->lock lock rt_mutex->wait_lock list_del unlock rt_mutex->wait_lock lock rt_mutex->wait_lock unlock rt_mutex_wait_lock -EAGAIN unlock hb->lock It does however solve the earlier starvation/live-lock scenario which got introduced with the -EAGAIN since unlike the before scenario; where the -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the after scenario it happens while futex_unlock_pi() actually holds a lock, and then it is serialized on that lock. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104152.062785...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 77 + kernel/locking/rtmutex.c| 26 -- kernel/locking/rtmutex_common.h | 1 - 3 files changed, 62 insertions(+), 42 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index dce3250..1cc40dd 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2112,20 +2112,7 @@ queue_unlock(struct futex_hash_bucket *hb) hb_waiters_dec(hb); } -/** - * queue_me() - Enqueue the futex_q on the futex_hash_bucket - * @q: The futex_q to enqueue - * @hb:The destination hash bucket - * - * The hb->lock must be held by the caller, and is released here. A call to - * queue_me() is typically paired with exactly one call to unqueue_me(). The - * exceptions involve the PI related operations, which may use unqueue_me_pi() - * or nothing if the unqueue is done as part of the wake process and the unqueue - * state is implicit in the state of woken task (see futex_wait_requeue_pi() for - * an example). - */ -static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) - __releases(>lock) +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb) { int prio; @@ -2142,6 +2129,24 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) plist_node_init(>list, prio); plist_add(>list, >chain); q->task = current; +} + +/** + * queue_me() - Enqueue the futex_q on the futex_hash_bucket + * @q: The futex_q to enqueue + * @hb:The destination hash bucket + * + * The hb->lock must be held by the caller, and is released here. A call to + * queue_me() is typically paired with exactly one call to unqueue_me(). The + * exceptions involve the PI related operations, which may use unqueue_me_pi() + * or nothing if the unqueue is done as part of the wake process and the unqueue + * state is implicit in the state of woken task (see futex_wait_requeue_pi() for + * an example). + */ +static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) + __releases(>lock) +{ + __queue_me(q, hb); spin_unlock(>lock); } @@ -2600,6 +2605,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, { struct hrtimer_sleeper timeout, *to = NULL; struct futex_pi_state *pi_state = NULL; + struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; struct futex_q q = futex_q_init; int res, ret; @@ -2652,25 +2658,52 @@ retry_private: } } + WARN_ON(!q.pi_state); + /* * Only actually queue now that the atomic ops are done: */ - queue_me(, hb); + __q
[PATCH 12/17] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
From: Peter Zijlstra commit 38d589f2fd08f1296aea3ce62bebd185125c6d81 upstream. With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters consistent it's necessary to split 'rt_mutex_futex_lock()' into finer parts, such that only the actual blocking can be done without hb->lock held. Split split_mutex_finish_proxy_lock() into two parts, one that does the blocking and one that does remove_waiter() when the lock acquire failed. When the rtmutex was acquired successfully the waiter can be removed in the acquisiton path safely, since there is no concurrency on the lock owner. This means that, except for futex_lock_pi(), all wait_list modifications are done with both hb->lock and wait_lock held. [bige...@linutronix.de: fix for futex_requeue_pi_signal_restart] Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104152.001659...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 7 -- kernel/locking/rtmutex.c| 52 +++-- kernel/locking/rtmutex_common.h | 8 --- 3 files changed, 55 insertions(+), 12 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 4d70fd7..dce3250 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3045,10 +3045,13 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, */ WARN_ON(!q.pi_state); pi_mutex = _state->pi_mutex; - ret = rt_mutex_finish_proxy_lock(pi_mutex, to, _waiter); - debug_rt_mutex_free_waiter(_waiter); + ret = rt_mutex_wait_proxy_lock(pi_mutex, to, _waiter); spin_lock(q.lock_ptr); + if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, _waiter)) + ret = 0; + + debug_rt_mutex_free_waiter(_waiter); /* * Fixup the pi_state owner and possibly acquire the lock if we * haven't already. diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 8778ac3..78ecea6 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1743,21 +1743,23 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock) } /** - * rt_mutex_finish_proxy_lock() - Complete lock acquisition + * rt_mutex_wait_proxy_lock() - Wait for lock acquisition * @lock: the rt_mutex we were woken on * @to:the timeout, null if none. hrtimer should already have * been started. * @waiter:the pre-initialized rt_mutex_waiter * - * Complete the lock acquisition started our behalf by another thread. + * Wait for the the lock acquisition started on our behalf by + * rt_mutex_start_proxy_lock(). Upon failure, the caller must call + * rt_mutex_cleanup_proxy_lock(). * * Returns: * 0 - success * <0 - error, one of -EINTR, -ETIMEDOUT * - * Special API call for PI-futex requeue support + * Special API call for PI-futex support */ -int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, +int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, struct hrtimer_sleeper *to, struct rt_mutex_waiter *waiter) { @@ -1770,9 +1772,6 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, /* sleep on the mutex */ ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter); - if (unlikely(ret)) - remove_waiter(lock, waiter); - /* * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might * have to fix that up. @@ -1783,3 +1782,42 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, return ret; } + +/** + * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition + * @lock: the rt_mutex we were woken on + * @waiter:the pre-initialized rt_mutex_waiter + * + * Attempt to clean up after a failed rt_mutex_wait_proxy_lock(). + * + * Unless we acquired the lock; we're still enqueued on the wait-list and can + * in fact still be granted ownership until we're removed. Therefore we can + * find we are in fact the owner and must disregard the + * rt_mutex_wait_proxy_lock() failure. + * + * Returns: + * true - did the cleanup, we done. + * false - we acquired the lock after rt_mutex_wait_proxy_lock() returned, + * caller should disregards its return value. + * + * Special API call for PI-futex support + */ +bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock, +struct rt_mutex_waiter *waiter) +{ + bool cleanup = false; + + raw_spin_lock_irq(>wait_lock); + /* +* Unless
[PATCH 13/17] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
From: Peter Zijlstra commit cfafcd117da0216520568c195cb2f6cd1980c4bb upstream. By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list modifications are done under both hb->lock and wait_lock. This closes the obvious interleave pattern between futex_lock_pi() and futex_unlock_pi(), but not entirely so. See below: Before: futex_lock_pi() futex_unlock_pi() unlock hb->lock lock hb->lock unlock hb->lock lock rt_mutex->wait_lock unlock rt_mutex_wait_lock -EAGAIN lock rt_mutex->wait_lock list_add unlock rt_mutex->wait_lock schedule() lock rt_mutex->wait_lock list_del unlock rt_mutex->wait_lock -EAGAIN lock hb->lock After: futex_lock_pi() futex_unlock_pi() lock hb->lock lock rt_mutex->wait_lock list_add unlock rt_mutex->wait_lock unlock hb->lock schedule() lock hb->lock unlock hb->lock lock hb->lock lock rt_mutex->wait_lock list_del unlock rt_mutex->wait_lock lock rt_mutex->wait_lock unlock rt_mutex_wait_lock -EAGAIN unlock hb->lock It does however solve the earlier starvation/live-lock scenario which got introduced with the -EAGAIN since unlike the before scenario; where the -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the after scenario it happens while futex_unlock_pi() actually holds a lock, and then it is serialized on that lock. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104152.062785...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 77 + kernel/locking/rtmutex.c| 26 -- kernel/locking/rtmutex_common.h | 1 - 3 files changed, 62 insertions(+), 42 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index dce3250..1cc40dd 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2112,20 +2112,7 @@ queue_unlock(struct futex_hash_bucket *hb) hb_waiters_dec(hb); } -/** - * queue_me() - Enqueue the futex_q on the futex_hash_bucket - * @q: The futex_q to enqueue - * @hb:The destination hash bucket - * - * The hb->lock must be held by the caller, and is released here. A call to - * queue_me() is typically paired with exactly one call to unqueue_me(). The - * exceptions involve the PI related operations, which may use unqueue_me_pi() - * or nothing if the unqueue is done as part of the wake process and the unqueue - * state is implicit in the state of woken task (see futex_wait_requeue_pi() for - * an example). - */ -static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) - __releases(>lock) +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb) { int prio; @@ -2142,6 +2129,24 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) plist_node_init(>list, prio); plist_add(>list, >chain); q->task = current; +} + +/** + * queue_me() - Enqueue the futex_q on the futex_hash_bucket + * @q: The futex_q to enqueue + * @hb:The destination hash bucket + * + * The hb->lock must be held by the caller, and is released here. A call to + * queue_me() is typically paired with exactly one call to unqueue_me(). The + * exceptions involve the PI related operations, which may use unqueue_me_pi() + * or nothing if the unqueue is done as part of the wake process and the unqueue + * state is implicit in the state of woken task (see futex_wait_requeue_pi() for + * an example). + */ +static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) + __releases(>lock) +{ + __queue_me(q, hb); spin_unlock(>lock); } @@ -2600,6 +2605,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, { struct hrtimer_sleeper timeout, *to = NULL; struct futex_pi_state *pi_state = NULL; + struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; struct futex_q q = futex_q_init; int res, ret; @@ -2652,25 +2658,52 @@ retry_private: } } + WARN_ON(!q.pi_state); + /* * Only actually queue now that the atomic ops are done: */ - queue_me(, hb); + __q
[PATCH 17/17] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
From: Xunlei Pang commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22 upstream. A crash happened while I was playing with deadline PI rtmutex. BUG: unable to handle kernel NULL pointer dereference at 0018 IP: [] rt_mutex_get_top_task+0x1f/0x30 PGD 232a75067 PUD 230947067 PMD 0 Oops: [#1] SMP CPU: 1 PID: 10994 Comm: a.out Not tainted Call Trace: [] enqueue_task+0x2c/0x80 [] activate_task+0x23/0x30 [] pull_dl_task+0x1d5/0x260 [] pre_schedule_dl+0x16/0x20 [] __schedule+0xd3/0x900 [] schedule+0x29/0x70 [] __rt_mutex_slowlock+0x4b/0xc0 [] rt_mutex_slowlock+0xd1/0x190 [] rt_mutex_timed_lock+0x53/0x60 [] futex_lock_pi.isra.18+0x28c/0x390 [] do_futex+0x190/0x5b0 [] SyS_futex+0x80/0x180 This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi() are only protected by pi_lock when operating pi waiters, while rt_mutex_get_top_task(), will access them with rq lock held but not holding pi_lock. In order to tackle it, we introduce new "pi_top_task" pointer cached in task_struct, and add new rt_mutex_update_top_task() to update its value, it can be called by rt_mutex_setprio() which held both owner's pi_lock and rq lock. Thus "pi_top_task" can be safely accessed by enqueue_task_dl() under rq lock. Originally-From: Peter Zijlstra Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Reviewed-by: Thomas Gleixner Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170323150216.157682...@infradead.org Signed-off-by: Thomas Gleixner Conflicts: include/linux/sched.h Tested-by:Henrik Austad --- include/linux/init_task.h | 1 + include/linux/sched.h | 2 ++ include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/locking/rtmutex.c | 29 + kernel/sched/core.c | 2 ++ 6 files changed, 28 insertions(+), 8 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 1c1ff7e..a561ce0c 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -162,6 +162,7 @@ extern struct task_group root_task_group; #ifdef CONFIG_RT_MUTEXES # define INIT_RT_MUTEXES(tsk) \ .pi_waiters = RB_ROOT, \ + .pi_top_task = NULL,\ .pi_waiters_leftmost = NULL, #else # define INIT_RT_MUTEXES(tsk) diff --git a/include/linux/sched.h b/include/linux/sched.h index b30540d..89cd0d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1617,6 +1617,8 @@ struct task_struct { /* PI waiters blocked on a rt_mutex held by this task */ struct rb_root pi_waiters; struct rb_node *pi_waiters_leftmost; + /* Updated under owner's pi_lock and rq lock */ + struct task_struct *pi_top_task; /* Deadlock detection and priority inheritance handling */ struct rt_mutex_waiter *pi_blocked_on; #endif diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index a30b172..60d0c47 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p) extern int rt_mutex_getprio(struct task_struct *p); extern void rt_mutex_setprio(struct task_struct *p, int prio); extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio); +extern void rt_mutex_update_top_task(struct task_struct *p); extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task); extern void rt_mutex_adjust_pi(struct task_struct *p); static inline bool tsk_is_pi_blocked(struct task_struct *tsk) diff --git a/kernel/fork.c b/kernel/fork.c index dd2f79a..9376270 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1242,6 +1242,7 @@ static void rt_mutex_init_task(struct task_struct *p) #ifdef CONFIG_RT_MUTEXES p->pi_waiters = RB_ROOT; p->pi_waiters_leftmost = NULL; + p->pi_top_task = NULL; p->pi_blocked_on = NULL; #endif } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index c01d7f4..dd3b1e9 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -321,6 +321,19 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter) } /* + * Must hold both p->pi_lock and task_rq(p)->lock. + */ +void rt_mutex_update_top_task(struct task_struct *p) +{ + if (!task_has_pi_waiters(p)) { + p->pi_top_task = NULL; + return; + } + + p->pi_top_task = task_top_pi_waiter(p)->task; +} + +/* * Calculate task priority from the waiter tree priority * * Return task->normal_prio when the waiter tree is empty or when @@ -335,12 +348,12 @@ int rt_mutex_getprio(struct task_struct *task) task->normal_prio); }
[PATCH 01/17] futex: Cleanup variable names for futex_top_waiter()
From: Peter Zijlstra commit 499f5aca2cdd5e958b27e2655e7e7f82524f46b1 uptream. futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging this to a variable 'match' totally obscures the code. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.554710...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index a26d217..bb87324 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1116,14 +1116,14 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, union futex_key *key, struct futex_pi_state **ps) { - struct futex_q *match = futex_top_waiter(hb, key); + struct futex_q *top_waiter = futex_top_waiter(hb, key); /* * If there is a waiter on that futex, validate it and * attach to the pi_state when the validation succeeds. */ - if (match) - return attach_to_pi_state(uval, match->pi_state, ps); + if (top_waiter) + return attach_to_pi_state(uval, top_waiter->pi_state, ps); /* * We are the first waiter - try to look up the owner based on @@ -1170,7 +1170,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, struct task_struct *task, int set_waiters) { u32 uval, newval, vpid = task_pid_vnr(task); - struct futex_q *match; + struct futex_q *top_waiter; int ret; /* @@ -1196,9 +1196,9 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, * Lookup existing state first. If it exists, try to attach to * its pi_state. */ - match = futex_top_waiter(hb, key); - if (match) - return attach_to_pi_state(uval, match->pi_state, ps); + top_waiter = futex_top_waiter(hb, key); + if (top_waiter) + return attach_to_pi_state(uval, top_waiter->pi_state, ps); /* * No waiter and user TID is 0. We are here because the @@ -1288,11 +1288,11 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) q->lock_ptr = NULL; } -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter, struct futex_hash_bucket *hb) { struct task_struct *new_owner; - struct futex_pi_state *pi_state = this->pi_state; + struct futex_pi_state *pi_state = top_waiter->pi_state; u32 uninitialized_var(curval), newval; WAKE_Q(wake_q); bool deboost; @@ -1313,11 +1313,11 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, /* * It is possible that the next waiter (the one that brought -* this owner to the kernel) timed out and is no longer +* top_waiter owner to the kernel) timed out and is no longer * waiting on the lock. */ if (!new_owner) - new_owner = this->task; + new_owner = top_waiter->task; /* * We pass it to the next owner. The WAITERS bit is always @@ -2639,7 +2639,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags) u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current); union futex_key key = FUTEX_KEY_INIT; struct futex_hash_bucket *hb; - struct futex_q *match; + struct futex_q *top_waiter; int ret; retry: @@ -2663,9 +2663,9 @@ retry: * all and we at least want to know if user space fiddled * with the futex value instead of blindly unlocking. */ - match = futex_top_waiter(hb, ); - if (match) { - ret = wake_futex_pi(uaddr, uval, match, hb); + top_waiter = futex_top_waiter(hb, ); + if (top_waiter) { + ret = wake_futex_pi(uaddr, uval, top_waiter, hb); /* * In case of success wake_futex_pi dropped the hash * bucket lock. -- 2.7.4
[PATCH 03/17] futex: Remove rt_mutex_deadlock_account_*()
From: Peter Zijlstra commit fffa954fb528963c2fb7b0c0084eb77e2be7ab52 upstream These are unused and clutter up the code. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.652692...@infradead.org Signed-off-by: Thomas Gleixner Conflicts: kernel/locking/rtmutex.c (WAKE_Q) Tested-by: Henrik Austad --- kernel/locking/rtmutex-debug.c | 9 kernel/locking/rtmutex-debug.h | 3 --- kernel/locking/rtmutex.c | 47 -- kernel/locking/rtmutex.h | 2 -- 4 files changed, 18 insertions(+), 43 deletions(-) diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c index 62b6cee..0613c4b 100644 --- a/kernel/locking/rtmutex-debug.c +++ b/kernel/locking/rtmutex-debug.c @@ -173,12 +173,3 @@ void debug_rt_mutex_init(struct rt_mutex *lock, const char *name) lock->name = name; } -void -rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task) -{ -} - -void rt_mutex_deadlock_account_unlock(struct task_struct *task) -{ -} - diff --git a/kernel/locking/rtmutex-debug.h b/kernel/locking/rtmutex-debug.h index d0519c3..b585af9 100644 --- a/kernel/locking/rtmutex-debug.h +++ b/kernel/locking/rtmutex-debug.h @@ -9,9 +9,6 @@ * This file contains macros used solely by rtmutex.c. Debug version. */ -extern void -rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task); -extern void rt_mutex_deadlock_account_unlock(struct task_struct *task); extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter); extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name); diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b066724d..6cf9dab7 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -937,8 +937,6 @@ takeit: */ rt_mutex_set_owner(lock, task); - rt_mutex_deadlock_account_lock(lock, task); - return 1; } @@ -1331,8 +1329,6 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, debug_rt_mutex_unlock(lock); - rt_mutex_deadlock_account_unlock(current); - /* * We must be careful here if the fast path is enabled. If we * have no waiters queued we cannot set owner to NULL here @@ -1398,11 +1394,10 @@ rt_mutex_fastlock(struct rt_mutex *lock, int state, struct hrtimer_sleeper *timeout, enum rtmutex_chainwalk chwalk)) { - if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 0; - } else - return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK); + + return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK); } static inline int @@ -1414,21 +1409,19 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int state, enum rtmutex_chainwalk chwalk)) { if (chwalk == RT_MUTEX_MIN_CHAINWALK && - likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 0; - } else - return slowfn(lock, state, timeout, chwalk); + + return slowfn(lock, state, timeout, chwalk); } static inline int rt_mutex_fasttrylock(struct rt_mutex *lock, int (*slowfn)(struct rt_mutex *lock)) { - if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 1; - } + return slowfn(lock); } @@ -1438,19 +1431,18 @@ rt_mutex_fastunlock(struct rt_mutex *lock, struct wake_q_head *wqh)) { WAKE_Q(wake_q); + bool deboost; - if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) { - rt_mutex_deadlock_account_unlock(current); + if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) + return; - } else { - bool deboost = slowfn(lock, _q); + deboost = slowfn(lock, _q); - wake_up_q(_q); + wake_up_q(_q); - /* Undo pi boosting if necessary: */ - if (deboost) - rt_mutex_adjust_prio(current); - } + /* Undo pi boosting if necessary: */ + if (deboost) + rt_mutex_adjust_p
[PATCH 01/17] futex: Cleanup variable names for futex_top_waiter()
From: Peter Zijlstra commit 499f5aca2cdd5e958b27e2655e7e7f82524f46b1 uptream. futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging this to a variable 'match' totally obscures the code. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.554710...@infradead.org Signed-off-by: Thomas Gleixner Tested-by: Henrik Austad --- kernel/futex.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index a26d217..bb87324 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1116,14 +1116,14 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, union futex_key *key, struct futex_pi_state **ps) { - struct futex_q *match = futex_top_waiter(hb, key); + struct futex_q *top_waiter = futex_top_waiter(hb, key); /* * If there is a waiter on that futex, validate it and * attach to the pi_state when the validation succeeds. */ - if (match) - return attach_to_pi_state(uval, match->pi_state, ps); + if (top_waiter) + return attach_to_pi_state(uval, top_waiter->pi_state, ps); /* * We are the first waiter - try to look up the owner based on @@ -1170,7 +1170,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, struct task_struct *task, int set_waiters) { u32 uval, newval, vpid = task_pid_vnr(task); - struct futex_q *match; + struct futex_q *top_waiter; int ret; /* @@ -1196,9 +1196,9 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, * Lookup existing state first. If it exists, try to attach to * its pi_state. */ - match = futex_top_waiter(hb, key); - if (match) - return attach_to_pi_state(uval, match->pi_state, ps); + top_waiter = futex_top_waiter(hb, key); + if (top_waiter) + return attach_to_pi_state(uval, top_waiter->pi_state, ps); /* * No waiter and user TID is 0. We are here because the @@ -1288,11 +1288,11 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) q->lock_ptr = NULL; } -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter, struct futex_hash_bucket *hb) { struct task_struct *new_owner; - struct futex_pi_state *pi_state = this->pi_state; + struct futex_pi_state *pi_state = top_waiter->pi_state; u32 uninitialized_var(curval), newval; WAKE_Q(wake_q); bool deboost; @@ -1313,11 +1313,11 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, /* * It is possible that the next waiter (the one that brought -* this owner to the kernel) timed out and is no longer +* top_waiter owner to the kernel) timed out and is no longer * waiting on the lock. */ if (!new_owner) - new_owner = this->task; + new_owner = top_waiter->task; /* * We pass it to the next owner. The WAITERS bit is always @@ -2639,7 +2639,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags) u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current); union futex_key key = FUTEX_KEY_INIT; struct futex_hash_bucket *hb; - struct futex_q *match; + struct futex_q *top_waiter; int ret; retry: @@ -2663,9 +2663,9 @@ retry: * all and we at least want to know if user space fiddled * with the futex value instead of blindly unlocking. */ - match = futex_top_waiter(hb, ); - if (match) { - ret = wake_futex_pi(uaddr, uval, match, hb); + top_waiter = futex_top_waiter(hb, ); + if (top_waiter) { + ret = wake_futex_pi(uaddr, uval, top_waiter, hb); /* * In case of success wake_futex_pi dropped the hash * bucket lock. -- 2.7.4
[PATCH 03/17] futex: Remove rt_mutex_deadlock_account_*()
From: Peter Zijlstra commit fffa954fb528963c2fb7b0c0084eb77e2be7ab52 upstream These are unused and clutter up the code. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: xlp...@redhat.com Cc: rost...@goodmis.org Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: dvh...@infradead.org Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170322104151.652692...@infradead.org Signed-off-by: Thomas Gleixner Conflicts: kernel/locking/rtmutex.c (WAKE_Q) Tested-by: Henrik Austad --- kernel/locking/rtmutex-debug.c | 9 kernel/locking/rtmutex-debug.h | 3 --- kernel/locking/rtmutex.c | 47 -- kernel/locking/rtmutex.h | 2 -- 4 files changed, 18 insertions(+), 43 deletions(-) diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c index 62b6cee..0613c4b 100644 --- a/kernel/locking/rtmutex-debug.c +++ b/kernel/locking/rtmutex-debug.c @@ -173,12 +173,3 @@ void debug_rt_mutex_init(struct rt_mutex *lock, const char *name) lock->name = name; } -void -rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task) -{ -} - -void rt_mutex_deadlock_account_unlock(struct task_struct *task) -{ -} - diff --git a/kernel/locking/rtmutex-debug.h b/kernel/locking/rtmutex-debug.h index d0519c3..b585af9 100644 --- a/kernel/locking/rtmutex-debug.h +++ b/kernel/locking/rtmutex-debug.h @@ -9,9 +9,6 @@ * This file contains macros used solely by rtmutex.c. Debug version. */ -extern void -rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task); -extern void rt_mutex_deadlock_account_unlock(struct task_struct *task); extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter); extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name); diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b066724d..6cf9dab7 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -937,8 +937,6 @@ takeit: */ rt_mutex_set_owner(lock, task); - rt_mutex_deadlock_account_lock(lock, task); - return 1; } @@ -1331,8 +1329,6 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, debug_rt_mutex_unlock(lock); - rt_mutex_deadlock_account_unlock(current); - /* * We must be careful here if the fast path is enabled. If we * have no waiters queued we cannot set owner to NULL here @@ -1398,11 +1394,10 @@ rt_mutex_fastlock(struct rt_mutex *lock, int state, struct hrtimer_sleeper *timeout, enum rtmutex_chainwalk chwalk)) { - if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 0; - } else - return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK); + + return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK); } static inline int @@ -1414,21 +1409,19 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int state, enum rtmutex_chainwalk chwalk)) { if (chwalk == RT_MUTEX_MIN_CHAINWALK && - likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 0; - } else - return slowfn(lock, state, timeout, chwalk); + + return slowfn(lock, state, timeout, chwalk); } static inline int rt_mutex_fasttrylock(struct rt_mutex *lock, int (*slowfn)(struct rt_mutex *lock)) { - if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 1; - } + return slowfn(lock); } @@ -1438,19 +1431,18 @@ rt_mutex_fastunlock(struct rt_mutex *lock, struct wake_q_head *wqh)) { WAKE_Q(wake_q); + bool deboost; - if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) { - rt_mutex_deadlock_account_unlock(current); + if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) + return; - } else { - bool deboost = slowfn(lock, _q); + deboost = slowfn(lock, _q); - wake_up_q(_q); + wake_up_q(_q); - /* Undo pi boosting if necessary: */ - if (deboost) - rt_mutex_adjust_prio(current); - } + /* Undo pi boosting if necessary: */ + if (deboost) + rt_mutex_adjust_p
Re: [PATCH] backport: sched/rtmutex/deadline: Fix a PI crash for deadline tasks
On Tue, Nov 06, 2018 at 02:22:10PM +0100, Peter Zijlstra wrote: > On Tue, Nov 06, 2018 at 01:47:21PM +0100, Henrik Austad wrote: > > From: Xunlei Pang > > > > On some of our systems, we notice this error popping up on occasion, > > completely hanging the system. > > > >[] enqueue_task_dl+0x1f0/0x420 > >[] activate_task+0x7c/0x90 > >[] push_dl_task+0x164/0x1c8 > >[] push_dl_tasks+0x20/0x30 > >[] __balance_callback+0x44/0x68 > >[] __schedule+0x6f0/0x728 > >[] schedule+0x78/0x98 > >[] __rt_mutex_slowlock+0x9c/0x108 > >[] rt_mutex_slowlock+0xd8/0x198 > >[] rt_mutex_timed_futex_lock+0x30/0x40 > >[] futex_lock_pi+0x200/0x3b0 > >[] do_futex+0x1c4/0x550 > > > > It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously > > similar to what Xuneli Pang observed in his crash, and with this fix, my > > issue goes away (my system has survivied approx 1500 reboots and a few > > nasty tests so far) > > > > Alongside this patch in the tree, there are a few other bits and pieces > > pertaining to futex, rtmutex and kernel/sched/, but those patches > > creates > > weird crashes that I have not been able to dissect yet. Once (if) I have > > been able to figure those out (and test), they will be sent later. > > > > I am sure other users of LTS that also use sched_deadline will run into > > this issue, so I think it is a good candidate for 4.4-stable. Possibly > > also > > to 4.9 and 4.14, but I have not had time to test for those versions. > > But this patch relies on: > > 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter") Yes, I have that one in my other queue (that crashes) > for pointer stability, but that patch in turn relies on the whole > FUTEX_UNLOCK_PI patch set: > > $ git log --oneline > 499f5aca2cdd5e958b27e2655e7e7f82524f46b1..56222b212e8edb1cf51f5dd73ff645809b082b40 > > 56222b212e8e futex: Drop hb->lock before enqueueing on the rtmutex > bebe5b514345 futex: Futex_unlock_pi() determinism > cfafcd117da0 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() > 38d589f2fd08 futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() > 50809358dd71 futex,rt_mutex: Introduce rt_mutex_init_waiter() > 16ffa12d7425 futex: Pull rt_mutex_futex_unlock() out from under hb->lock > 73d786bd043e futex: Rework inconsistent rt_mutex/futex_q state > bf92cf3a5100 futex: Cleanup refcounting > 734009e96d19 futex: Change locking rules > 5293c2efda37 futex,rt_mutex: Provide futex specific rt_mutex API > fffa954fb528 futex: Remove rt_mutex_deadlock_account_*() > 1b367ece0d7e futex: Use smp_store_release() in mark_wake_futex() > > and possibly some follow-up fixes on that (I have vague memories of > that). ok, so this looks a bit like the queue I have, thanks! > As is, just the one patch you propose isn't correct :/ > > Yes, that was a ginormous amount of work to fix a seemingly simple splat > :-( Yep, well, on the positive side, I now know that I have to figure out the crashes, which is useful knowledge! Thanks! I'll hammer away at the full series of backports for this then and resend once I've hammered out the issues. Thanks for the feedback, much appreciated! -- Henrik Austad CVTG Eng - Endpoints Cisco Systems Norway
Re: [PATCH] backport: sched/rtmutex/deadline: Fix a PI crash for deadline tasks
On Tue, Nov 06, 2018 at 02:22:10PM +0100, Peter Zijlstra wrote: > On Tue, Nov 06, 2018 at 01:47:21PM +0100, Henrik Austad wrote: > > From: Xunlei Pang > > > > On some of our systems, we notice this error popping up on occasion, > > completely hanging the system. > > > >[] enqueue_task_dl+0x1f0/0x420 > >[] activate_task+0x7c/0x90 > >[] push_dl_task+0x164/0x1c8 > >[] push_dl_tasks+0x20/0x30 > >[] __balance_callback+0x44/0x68 > >[] __schedule+0x6f0/0x728 > >[] schedule+0x78/0x98 > >[] __rt_mutex_slowlock+0x9c/0x108 > >[] rt_mutex_slowlock+0xd8/0x198 > >[] rt_mutex_timed_futex_lock+0x30/0x40 > >[] futex_lock_pi+0x200/0x3b0 > >[] do_futex+0x1c4/0x550 > > > > It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously > > similar to what Xuneli Pang observed in his crash, and with this fix, my > > issue goes away (my system has survivied approx 1500 reboots and a few > > nasty tests so far) > > > > Alongside this patch in the tree, there are a few other bits and pieces > > pertaining to futex, rtmutex and kernel/sched/, but those patches > > creates > > weird crashes that I have not been able to dissect yet. Once (if) I have > > been able to figure those out (and test), they will be sent later. > > > > I am sure other users of LTS that also use sched_deadline will run into > > this issue, so I think it is a good candidate for 4.4-stable. Possibly > > also > > to 4.9 and 4.14, but I have not had time to test for those versions. > > But this patch relies on: > > 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter") Yes, I have that one in my other queue (that crashes) > for pointer stability, but that patch in turn relies on the whole > FUTEX_UNLOCK_PI patch set: > > $ git log --oneline > 499f5aca2cdd5e958b27e2655e7e7f82524f46b1..56222b212e8edb1cf51f5dd73ff645809b082b40 > > 56222b212e8e futex: Drop hb->lock before enqueueing on the rtmutex > bebe5b514345 futex: Futex_unlock_pi() determinism > cfafcd117da0 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() > 38d589f2fd08 futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() > 50809358dd71 futex,rt_mutex: Introduce rt_mutex_init_waiter() > 16ffa12d7425 futex: Pull rt_mutex_futex_unlock() out from under hb->lock > 73d786bd043e futex: Rework inconsistent rt_mutex/futex_q state > bf92cf3a5100 futex: Cleanup refcounting > 734009e96d19 futex: Change locking rules > 5293c2efda37 futex,rt_mutex: Provide futex specific rt_mutex API > fffa954fb528 futex: Remove rt_mutex_deadlock_account_*() > 1b367ece0d7e futex: Use smp_store_release() in mark_wake_futex() > > and possibly some follow-up fixes on that (I have vague memories of > that). ok, so this looks a bit like the queue I have, thanks! > As is, just the one patch you propose isn't correct :/ > > Yes, that was a ginormous amount of work to fix a seemingly simple splat > :-( Yep, well, on the positive side, I now know that I have to figure out the crashes, which is useful knowledge! Thanks! I'll hammer away at the full series of backports for this then and resend once I've hammered out the issues. Thanks for the feedback, much appreciated! -- Henrik Austad CVTG Eng - Endpoints Cisco Systems Norway
[PATCH] backport: sched/rtmutex/deadline: Fix a PI crash for deadline tasks
From: Xunlei Pang On some of our systems, we notice this error popping up on occasion, completely hanging the system. [] enqueue_task_dl+0x1f0/0x420 [] activate_task+0x7c/0x90 [] push_dl_task+0x164/0x1c8 [] push_dl_tasks+0x20/0x30 [] __balance_callback+0x44/0x68 [] __schedule+0x6f0/0x728 [] schedule+0x78/0x98 [] __rt_mutex_slowlock+0x9c/0x108 [] rt_mutex_slowlock+0xd8/0x198 [] rt_mutex_timed_futex_lock+0x30/0x40 [] futex_lock_pi+0x200/0x3b0 [] do_futex+0x1c4/0x550 It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously similar to what Xuneli Pang observed in his crash, and with this fix, my issue goes away (my system has survivied approx 1500 reboots and a few nasty tests so far) Alongside this patch in the tree, there are a few other bits and pieces pertaining to futex, rtmutex and kernel/sched/, but those patches creates weird crashes that I have not been able to dissect yet. Once (if) I have been able to figure those out (and test), they will be sent later. I am sure other users of LTS that also use sched_deadline will run into this issue, so I think it is a good candidate for 4.4-stable. Possibly also to 4.9 and 4.14, but I have not had time to test for those versions. Apart from a minor conflict in sched.h, the patch applied cleanly. (Tested on arm64 running 4.4.) -Henrik A crash happened while I was playing with deadline PI rtmutex. BUG: unable to handle kernel NULL pointer dereference at 0018 IP: [] rt_mutex_get_top_task+0x1f/0x30 PGD 232a75067 PUD 230947067 PMD 0 Oops: [#1] SMP CPU: 1 PID: 10994 Comm: a.out Not tainted Call Trace: [] enqueue_task+0x2c/0x80 [] activate_task+0x23/0x30 [] pull_dl_task+0x1d5/0x260 [] pre_schedule_dl+0x16/0x20 [] __schedule+0xd3/0x900 [] schedule+0x29/0x70 [] __rt_mutex_slowlock+0x4b/0xc0 [] rt_mutex_slowlock+0xd1/0x190 [] rt_mutex_timed_lock+0x53/0x60 [] futex_lock_pi.isra.18+0x28c/0x390 [] do_futex+0x190/0x5b0 [] SyS_futex+0x80/0x180 This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi() are only protected by pi_lock when operating pi waiters, while rt_mutex_get_top_task(), will access them with rq lock held but not holding pi_lock. In order to tackle it, we introduce new "pi_top_task" pointer cached in task_struct, and add new rt_mutex_update_top_task() to update its value, it can be called by rt_mutex_setprio() which held both owner's pi_lock and rq lock. Thus "pi_top_task" can be safely accessed by enqueue_task_dl() under rq lock. Originally-From: Peter Zijlstra Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Reviewed-by: Thomas Gleixner Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170323150216.157682...@infradead.org Signed-off-by: Thomas Gleixner (cherry picked from commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22) Conflicts: include/linux/sched.h Backported-and-tested-by: Henrik Austad Cc: Greg Kroah-Hartman --- include/linux/init_task.h | 1 + include/linux/sched.h | 2 ++ include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/locking/rtmutex.c | 29 + kernel/sched/core.c | 2 ++ 6 files changed, 28 insertions(+), 8 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 1c1ff7e4faa4..a561ce0c5d7f 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -162,6 +162,7 @@ extern struct task_group root_task_group; #ifdef CONFIG_RT_MUTEXES # define INIT_RT_MUTEXES(tsk) \ .pi_waiters = RB_ROOT, \ + .pi_top_task = NULL,\ .pi_waiters_leftmost = NULL, #else # define INIT_RT_MUTEXES(tsk) diff --git a/include/linux/sched.h b/include/linux/sched.h index a464ba71a993..19a3f946caf0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1628,6 +1628,8 @@ struct task_struct { /* PI waiters blocked on a rt_mutex held by this task */ struct rb_root pi_waiters; struct rb_node *pi_waiters_leftmost; + /* Updated under owner's pi_lock and rq lock */ + struct task_struct *pi_top_task; /* Deadlock detection and priority inheritance handling */ struct rt_mutex_waiter *pi_blocked_on; #endif diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index a30b172df6e1..60d0c4740b9f 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p) extern int rt_mutex_getprio(struct task_struct *p); extern void rt_mutex_setprio(struct task_struct *p, int prio); extern int rt_mutex_get_effective_prio(struct ta
[PATCH] backport: sched/rtmutex/deadline: Fix a PI crash for deadline tasks
From: Xunlei Pang On some of our systems, we notice this error popping up on occasion, completely hanging the system. [] enqueue_task_dl+0x1f0/0x420 [] activate_task+0x7c/0x90 [] push_dl_task+0x164/0x1c8 [] push_dl_tasks+0x20/0x30 [] __balance_callback+0x44/0x68 [] __schedule+0x6f0/0x728 [] schedule+0x78/0x98 [] __rt_mutex_slowlock+0x9c/0x108 [] rt_mutex_slowlock+0xd8/0x198 [] rt_mutex_timed_futex_lock+0x30/0x40 [] futex_lock_pi+0x200/0x3b0 [] do_futex+0x1c4/0x550 It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously similar to what Xuneli Pang observed in his crash, and with this fix, my issue goes away (my system has survivied approx 1500 reboots and a few nasty tests so far) Alongside this patch in the tree, there are a few other bits and pieces pertaining to futex, rtmutex and kernel/sched/, but those patches creates weird crashes that I have not been able to dissect yet. Once (if) I have been able to figure those out (and test), they will be sent later. I am sure other users of LTS that also use sched_deadline will run into this issue, so I think it is a good candidate for 4.4-stable. Possibly also to 4.9 and 4.14, but I have not had time to test for those versions. Apart from a minor conflict in sched.h, the patch applied cleanly. (Tested on arm64 running 4.4.) -Henrik A crash happened while I was playing with deadline PI rtmutex. BUG: unable to handle kernel NULL pointer dereference at 0018 IP: [] rt_mutex_get_top_task+0x1f/0x30 PGD 232a75067 PUD 230947067 PMD 0 Oops: [#1] SMP CPU: 1 PID: 10994 Comm: a.out Not tainted Call Trace: [] enqueue_task+0x2c/0x80 [] activate_task+0x23/0x30 [] pull_dl_task+0x1d5/0x260 [] pre_schedule_dl+0x16/0x20 [] __schedule+0xd3/0x900 [] schedule+0x29/0x70 [] __rt_mutex_slowlock+0x4b/0xc0 [] rt_mutex_slowlock+0xd1/0x190 [] rt_mutex_timed_lock+0x53/0x60 [] futex_lock_pi.isra.18+0x28c/0x390 [] do_futex+0x190/0x5b0 [] SyS_futex+0x80/0x180 This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi() are only protected by pi_lock when operating pi waiters, while rt_mutex_get_top_task(), will access them with rq lock held but not holding pi_lock. In order to tackle it, we introduce new "pi_top_task" pointer cached in task_struct, and add new rt_mutex_update_top_task() to update its value, it can be called by rt_mutex_setprio() which held both owner's pi_lock and rq lock. Thus "pi_top_task" can be safely accessed by enqueue_task_dl() under rq lock. Originally-From: Peter Zijlstra Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Acked-by: Steven Rostedt Reviewed-by: Thomas Gleixner Cc: juri.le...@arm.com Cc: bige...@linutronix.de Cc: mathieu.desnoy...@efficios.com Cc: jdesfos...@efficios.com Cc: bris...@redhat.com Link: http://lkml.kernel.org/r/20170323150216.157682...@infradead.org Signed-off-by: Thomas Gleixner (cherry picked from commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22) Conflicts: include/linux/sched.h Backported-and-tested-by: Henrik Austad Cc: Greg Kroah-Hartman --- include/linux/init_task.h | 1 + include/linux/sched.h | 2 ++ include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/locking/rtmutex.c | 29 + kernel/sched/core.c | 2 ++ 6 files changed, 28 insertions(+), 8 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 1c1ff7e4faa4..a561ce0c5d7f 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -162,6 +162,7 @@ extern struct task_group root_task_group; #ifdef CONFIG_RT_MUTEXES # define INIT_RT_MUTEXES(tsk) \ .pi_waiters = RB_ROOT, \ + .pi_top_task = NULL,\ .pi_waiters_leftmost = NULL, #else # define INIT_RT_MUTEXES(tsk) diff --git a/include/linux/sched.h b/include/linux/sched.h index a464ba71a993..19a3f946caf0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1628,6 +1628,8 @@ struct task_struct { /* PI waiters blocked on a rt_mutex held by this task */ struct rb_root pi_waiters; struct rb_node *pi_waiters_leftmost; + /* Updated under owner's pi_lock and rq lock */ + struct task_struct *pi_top_task; /* Deadlock detection and priority inheritance handling */ struct rt_mutex_waiter *pi_blocked_on; #endif diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index a30b172df6e1..60d0c4740b9f 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p) extern int rt_mutex_getprio(struct task_struct *p); extern void rt_mutex_setprio(struct task_struct *p, int prio); extern int rt_mutex_get_effective_prio(struct ta
Re: [RFD/RFC PATCH 0/8] Towards implementing proxy execution
On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote: > Hi all, Hi, nice series, I have a lot of details to grok, but I like the idea of PE > Proxy Execution (also goes under several other names) isn't a new > concept, it has been mentioned already in the past to this community > (both in email discussions and at conferences [1, 2]), but no actual > implementation that applies to a fairly recent kernel exists as of today > (of which I'm aware of at least - happy to be proven wrong). > > Very broadly speaking, more info below, proxy execution enables a task > to run using the context of some other task that is "willing" to > participate in the mechanism, as this helps both tasks to improve > performance (w.r.t. the latter task not participating to proxy > execution). From what I remember, PEP was originally proposed for a global EDF, and as far as my head has been able to read this series, this implementation is planned for not only deadline, but eventuall also for sched_(rr|fifo|other) - is that correct? I have a bit of concern when it comes to affinities and and where the lock owner will actually execute while in the context of the proxy, especially when you run into the situation where you have disjoint CPU affinities for _rr tasks to ensure the deadlines. I believe there were some papers circulated last year that looked at something similar to this when you had overlapping or completely disjoint CPUsets I think it would be nice to drag into the discussion. Has this been considered? (if so, sorry for adding line-noise!) Let me know if my attempt at translating brainlanguage into semi-coherent english failed and I'll do another attempt > This RFD/proof of concept aims at starting a discussion about how we can > get proxy execution in mainline. But, first things first, why do we even > care about it? > > I'm pretty confident with saying that the line of development that is > mainly interested in this at the moment is the one that might benefit > in allowing non privileged processes to use deadline scheduling [3]. > The main missing bit before we can safely relax the root privileges > constraint is a proper priority inheritance mechanism, which translates > to bandwidth inheritance [4, 5] for deadline scheduling, or to some sort > of interpretation of the concept of running a task holding a (rt_)mutex > within the bandwidth allotment of some other task that is blocked on the > same (rt_)mutex. > > The concept itself is pretty general however, and it is not hard to > foresee possible applications in other scenarios (say for example nice > values/shares across co-operating CFS tasks or clamping values [6]). > But I'm already digressing, so let's get back to the code that comes > with this cover letter. > > One can define the scheduling context of a task as all the information > in task_struct that the scheduler needs to implement a policy and the > execution contex as all the state required to actually "run" the task. > An example of scheduling context might be the information contained in > task_struct se, rt and dl fields; affinity pertains instead to execution > context (and I guess decideing what pertains to what is actually up for > discussion as well ;-). Patch 04/08 implements such distinction. I really like the idea of splitting scheduling ctx and execution context! > As implemented in this set, a link between scheduling contexts of > different tasks might be established when a task blocks on a mutex held > by some other task (blocked_on relation). In this case the former task > starts to be considered a potential proxy for the latter (mutex owner). > One key change in how mutexes work made in here is that waiters don't > really sleep: they are not dequeued, so they can be picked up by the > scheduler when it runs. If a waiter (potential proxy) task is selected > by the scheduler, the blocked_on relation is used to find the mutex > owner and put that to run on the CPU, using the proxy task scheduling > context. > >Follow the blocked-on relation: > > ,-> task <- proxy, picked by scheduler > | | blocked-on > | v > blocked-task | mutex > | | owner > | v > `-- task <- gets to run using proxy info > > Now, the situation is (of course) more tricky than depicted so far > because we have to deal with all sort of possible states the mutex > owner might be in while a potential proxy is selected by the scheduler, > e.g. owner might be sleeping, running on a different CPU, blocked on > another mutex itself... so, I'd kindly refer people to have a look at > 05/08 proxy() implementation and comments. My head hurt already.. :) > Peter kindly shared his WIP patches with us (me, Luca, Tommaso, Claudio, > Daniel, the Pisa gang) a while ago, but I could seriously have a decent > look at them only recently (thanks a lot to the
Re: [RFD/RFC PATCH 0/8] Towards implementing proxy execution
On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote: > Hi all, Hi, nice series, I have a lot of details to grok, but I like the idea of PE > Proxy Execution (also goes under several other names) isn't a new > concept, it has been mentioned already in the past to this community > (both in email discussions and at conferences [1, 2]), but no actual > implementation that applies to a fairly recent kernel exists as of today > (of which I'm aware of at least - happy to be proven wrong). > > Very broadly speaking, more info below, proxy execution enables a task > to run using the context of some other task that is "willing" to > participate in the mechanism, as this helps both tasks to improve > performance (w.r.t. the latter task not participating to proxy > execution). From what I remember, PEP was originally proposed for a global EDF, and as far as my head has been able to read this series, this implementation is planned for not only deadline, but eventuall also for sched_(rr|fifo|other) - is that correct? I have a bit of concern when it comes to affinities and and where the lock owner will actually execute while in the context of the proxy, especially when you run into the situation where you have disjoint CPU affinities for _rr tasks to ensure the deadlines. I believe there were some papers circulated last year that looked at something similar to this when you had overlapping or completely disjoint CPUsets I think it would be nice to drag into the discussion. Has this been considered? (if so, sorry for adding line-noise!) Let me know if my attempt at translating brainlanguage into semi-coherent english failed and I'll do another attempt > This RFD/proof of concept aims at starting a discussion about how we can > get proxy execution in mainline. But, first things first, why do we even > care about it? > > I'm pretty confident with saying that the line of development that is > mainly interested in this at the moment is the one that might benefit > in allowing non privileged processes to use deadline scheduling [3]. > The main missing bit before we can safely relax the root privileges > constraint is a proper priority inheritance mechanism, which translates > to bandwidth inheritance [4, 5] for deadline scheduling, or to some sort > of interpretation of the concept of running a task holding a (rt_)mutex > within the bandwidth allotment of some other task that is blocked on the > same (rt_)mutex. > > The concept itself is pretty general however, and it is not hard to > foresee possible applications in other scenarios (say for example nice > values/shares across co-operating CFS tasks or clamping values [6]). > But I'm already digressing, so let's get back to the code that comes > with this cover letter. > > One can define the scheduling context of a task as all the information > in task_struct that the scheduler needs to implement a policy and the > execution contex as all the state required to actually "run" the task. > An example of scheduling context might be the information contained in > task_struct se, rt and dl fields; affinity pertains instead to execution > context (and I guess decideing what pertains to what is actually up for > discussion as well ;-). Patch 04/08 implements such distinction. I really like the idea of splitting scheduling ctx and execution context! > As implemented in this set, a link between scheduling contexts of > different tasks might be established when a task blocks on a mutex held > by some other task (blocked_on relation). In this case the former task > starts to be considered a potential proxy for the latter (mutex owner). > One key change in how mutexes work made in here is that waiters don't > really sleep: they are not dequeued, so they can be picked up by the > scheduler when it runs. If a waiter (potential proxy) task is selected > by the scheduler, the blocked_on relation is used to find the mutex > owner and put that to run on the CPU, using the proxy task scheduling > context. > >Follow the blocked-on relation: > > ,-> task <- proxy, picked by scheduler > | | blocked-on > | v > blocked-task | mutex > | | owner > | v > `-- task <- gets to run using proxy info > > Now, the situation is (of course) more tricky than depicted so far > because we have to deal with all sort of possible states the mutex > owner might be in while a potential proxy is selected by the scheduler, > e.g. owner might be sleeping, running on a different CPU, blocked on > another mutex itself... so, I'd kindly refer people to have a look at > 05/08 proxy() implementation and comments. My head hurt already.. :) > Peter kindly shared his WIP patches with us (me, Luca, Tommaso, Claudio, > Daniel, the Pisa gang) a while ago, but I could seriously have a decent > look at them only recently (thanks a lot to the
[PATCH] net: export netdev_txq_to_tc to allow sch_mqprio to compile as module
In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc to find the correct tc instead of dev->tc_to_txq[] However, when mqprio is compiled as a module, it cannot resolve the symbol, leading to this error: ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined! This adds an EXPORT_SYMBOL() since the other user in the kernel (netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a sysfs-callback. Cc: Alexander Duyck <alexander.h.du...@intel.com> Cc: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com> Cc: David S. Miller <da...@davemloft.net> Signed-off-by: Henrik Austad <haus...@cisco.com> --- net/core/dev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/dev.c b/net/core/dev.c index fcddccb..d2b20e7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2040,6 +2040,7 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq) return 0; } +EXPORT_SYMBOL(netdev_txq_to_tc); #ifdef CONFIG_XPS static DEFINE_MUTEX(xps_map_mutex); -- 2.7.4
[PATCH] net: export netdev_txq_to_tc to allow sch_mqprio to compile as module
In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc to find the correct tc instead of dev->tc_to_txq[] However, when mqprio is compiled as a module, it cannot resolve the symbol, leading to this error: ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined! This adds an EXPORT_SYMBOL() since the other user in the kernel (netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a sysfs-callback. Cc: Alexander Duyck Cc: Jesus Sanchez-Palencia Cc: David S. Miller Signed-off-by: Henrik Austad --- net/core/dev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/dev.c b/net/core/dev.c index fcddccb..d2b20e7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2040,6 +2040,7 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq) return 0; } +EXPORT_SYMBOL(netdev_txq_to_tc); #ifdef CONFIG_XPS static DEFINE_MUTEX(xps_map_mutex); -- 2.7.4
Re: [TSN RFC v2 0/9] TSN driver for the kernel
ations must be aware of. Could be that we are talking about the same thing, just from different perspectives. > * Kernel Space > > 1. Providing frames with a future transmit time. For normal sockets, >this can be in the CMESG data. For mmap'ed buffers, we will need a >new format. (I think Arnd is working on a new layout.) I need to revisit that discussion again I think. > 2. Time based qdisc for transmitted frames. For MACs that support >this (like the i210), we only have to place the frame into the >correct queue. For normal HW, we want to be able to reserve a time >window in which non-TSN frames are blocked. This is some work, but >in the end it should be a generic solution that not only works >"perfectly" with TSN HW but also provides best effort service using >any NIC. Yes, indeed, that would be one good solution, and quite a lot of work. > 3. ALSA support for tunable AD/DA clocks. The rate of the Listener's >DA clock must match that of the Talker and the other Listeners. To nitpick a bit, all AD/DAs should match that of the gPTP grandmaster (which in most settings would be the Talker). But yes, you need to adjust the AD/DA. SRC is slow and painful, best to avoid. >Either you adjust it in HW using a VCO or similar, or you do >adaptive sample rate conversion in the application. (And that is >another reason for *not* having a shared kernel buffer.) For the >Talker, either you adjust the AD clock to match the PTP time, or >you measure the frequency offset. Yes, some hook into adjusting the clock is needed, I wonder if this is possible via V4L2, or of the monitor-world is a completely different beast. > 4. ALSA support for time triggered playback. The patch series >completely ignore the critical issue of media clock recovery. The >Listener must buffer the stream in order to play it exactly at a >specified time. It cannot simply send the stream ASAP to the audio >HW, because some other Listener might need longer. AFAICT, there is >nothing in ALSA that allows you to say, sample X should be played at >time Y. Yes, and this requires a lot of change to ALSA (and probably something in V4L2 as well?), so before we get to that, perhaps have a set of patches that does this best effort and *then* work on getting time-triggered playback into the kernel? Another item that was brought up last round was getting timing-information to/from ALSA, See driver/media/avb/avb_alsa.c, as a start it updates the time for last incoming/outgoing frame so that userspace can get that information. Probably buggy as heck :) * Back to your email from last night* > You are trying to put tons of code into the kernel that really belongs > in user space, and at the same time, you omit critical functions that > only the kernel can provide. Some (well, to be honest, most) of the of the critical functions that my driver omits, are omitted because they require substantial effort to implement - and befor there's a need for this, that won't happen. So, consider the TSN-driver such a need! I'd love to use a qdisc that uses a time-triggered transmit, that would drop the need for a lot of the stuff in tsn_core.c. The same goes for time-triggered playback in media. > > There are at least one AVB-driver (the AV-part of TSN) in the kernel > > already. > > And which driver is that? Ah, a proverbial slip of the changelog, we visited this the last iteration, that would be the ravb-driver (which is an AVB capable NIC), but it does not include much in the way of AVB-support *In* kernel. Sorry about that! Since then, the iMX7 from NXP has arrived, and this also has HW-support for TSN, but not in the kernel AFAICT. So, the next issue I plan to tackle, is how I do buffers, the current approach where tsn_core allocates memory is on its way out and I'll let the shim (which means alsa/v4l2) will provide a buffer. Then I'll start looking at qdisc. Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [TSN RFC v2 0/9] TSN driver for the kernel
ations must be aware of. Could be that we are talking about the same thing, just from different perspectives. > * Kernel Space > > 1. Providing frames with a future transmit time. For normal sockets, >this can be in the CMESG data. For mmap'ed buffers, we will need a >new format. (I think Arnd is working on a new layout.) I need to revisit that discussion again I think. > 2. Time based qdisc for transmitted frames. For MACs that support >this (like the i210), we only have to place the frame into the >correct queue. For normal HW, we want to be able to reserve a time >window in which non-TSN frames are blocked. This is some work, but >in the end it should be a generic solution that not only works >"perfectly" with TSN HW but also provides best effort service using >any NIC. Yes, indeed, that would be one good solution, and quite a lot of work. > 3. ALSA support for tunable AD/DA clocks. The rate of the Listener's >DA clock must match that of the Talker and the other Listeners. To nitpick a bit, all AD/DAs should match that of the gPTP grandmaster (which in most settings would be the Talker). But yes, you need to adjust the AD/DA. SRC is slow and painful, best to avoid. >Either you adjust it in HW using a VCO or similar, or you do >adaptive sample rate conversion in the application. (And that is >another reason for *not* having a shared kernel buffer.) For the >Talker, either you adjust the AD clock to match the PTP time, or >you measure the frequency offset. Yes, some hook into adjusting the clock is needed, I wonder if this is possible via V4L2, or of the monitor-world is a completely different beast. > 4. ALSA support for time triggered playback. The patch series >completely ignore the critical issue of media clock recovery. The >Listener must buffer the stream in order to play it exactly at a >specified time. It cannot simply send the stream ASAP to the audio >HW, because some other Listener might need longer. AFAICT, there is >nothing in ALSA that allows you to say, sample X should be played at >time Y. Yes, and this requires a lot of change to ALSA (and probably something in V4L2 as well?), so before we get to that, perhaps have a set of patches that does this best effort and *then* work on getting time-triggered playback into the kernel? Another item that was brought up last round was getting timing-information to/from ALSA, See driver/media/avb/avb_alsa.c, as a start it updates the time for last incoming/outgoing frame so that userspace can get that information. Probably buggy as heck :) * Back to your email from last night* > You are trying to put tons of code into the kernel that really belongs > in user space, and at the same time, you omit critical functions that > only the kernel can provide. Some (well, to be honest, most) of the of the critical functions that my driver omits, are omitted because they require substantial effort to implement - and befor there's a need for this, that won't happen. So, consider the TSN-driver such a need! I'd love to use a qdisc that uses a time-triggered transmit, that would drop the need for a lot of the stuff in tsn_core.c. The same goes for time-triggered playback in media. > > There are at least one AVB-driver (the AV-part of TSN) in the kernel > > already. > > And which driver is that? Ah, a proverbial slip of the changelog, we visited this the last iteration, that would be the ravb-driver (which is an AVB capable NIC), but it does not include much in the way of AVB-support *In* kernel. Sorry about that! Since then, the iMX7 from NXP has arrived, and this also has HW-support for TSN, but not in the kernel AFAICT. So, the next issue I plan to tackle, is how I do buffers, the current approach where tsn_core allocates memory is on its way out and I'll let the shim (which means alsa/v4l2) will provide a buffer. Then I'll start looking at qdisc. Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [TSN RFC v2 5/9] Add TSN header for the driver
On Fri, Dec 16, 2016 at 11:09:38PM +0100, Richard Cochran wrote: > On Fri, Dec 16, 2016 at 06:59:09PM +0100, hen...@austad.us wrote: > > +/* > > + * List of current subtype fields in the common header of AVTPDU > > + * > > + * Note: AVTPDU is a remnant of the standards from when it was AVB. > > + * > > + * The list has been updated with the recent values from IEEE 1722, draft > > 16. > > + */ > > +enum avtp_subtype { > > + TSN_61883_IIDC = 0, /* IEC 61883/IIDC Format */ > > + TSN_MMA_STREAM, /* MMA Streams */ > > + TSN_AAF,/* AVTP Audio Format */ > > + TSN_CVF,/* Compressed Video Format */ > > + TSN_CRF,/* Clock Reference Format */ > > + TSN_TSCF, /* Time-Synchronous Control Format */ > > + TSN_SVF,/* SDI Video Format */ > > + TSN_RVF,/* Raw Video Format */ > > + /* 0x08 - 0x6D reserved */ > > + TSN_AEF_CONTINOUS = 0x6e, /* AES Encrypted Format Continous */ > > + TSN_VSF_STREAM, /* Vendor Specific Format Stream */ > > + /* 0x70 - 0x7e reserved */ > > + TSN_EF_STREAM = 0x7f, /* Experimental Format Stream */ > > + /* 0x80 - 0x81 reserved */ > > + TSN_NTSCF = 0x82, /* Non Time-Synchronous Control Format */ > > + /* 0x83 - 0xed reserved */ > > + TSN_ESCF = 0xec,/* ECC Signed Control Format */ > > + TSN_EECF, /* ECC Encrypted Control Format */ > > + TSN_AEF_DISCRETE, /* AES Encrypted Format Discrete */ > > + /* 0xef - 0xf9 reserved */ > > + TSN_ADP = 0xfa, /* AVDECC Discovery Protocol */ > > + TSN_AECP, /* AVDECC Enumeration and Control Protocol */ > > + TSN_ACMP, /* AVDECC Connection Management Protocol */ > > + /* 0xfd reserved */ > > + TSN_MAAP = 0xfe,/* MAAP Protocol */ > > + TSN_EF_CONTROL, /* Experimental Format Control */ > > +}; > > The kernel shouldn't be in the business of assembling media packets. No, but assembling the packets and shipping frames to a destination is not neccessarily the same thing. A nice workflow would be to signal to the shim that "I'm sending a compressed video format" and then the shim/tsn_core will ship out the frames over the network - and then you need to set TSN_CVF as subtype in each header. That does not that mean you should do H.264 encode/decode *in* the kernel Perhaps this is better placed in include/uapi/tsn.h so that userspace and kernel share the same header? -- Henrik Austad signature.asc Description: PGP signature
Re: [TSN RFC v2 5/9] Add TSN header for the driver
On Fri, Dec 16, 2016 at 11:09:38PM +0100, Richard Cochran wrote: > On Fri, Dec 16, 2016 at 06:59:09PM +0100, hen...@austad.us wrote: > > +/* > > + * List of current subtype fields in the common header of AVTPDU > > + * > > + * Note: AVTPDU is a remnant of the standards from when it was AVB. > > + * > > + * The list has been updated with the recent values from IEEE 1722, draft > > 16. > > + */ > > +enum avtp_subtype { > > + TSN_61883_IIDC = 0, /* IEC 61883/IIDC Format */ > > + TSN_MMA_STREAM, /* MMA Streams */ > > + TSN_AAF,/* AVTP Audio Format */ > > + TSN_CVF,/* Compressed Video Format */ > > + TSN_CRF,/* Clock Reference Format */ > > + TSN_TSCF, /* Time-Synchronous Control Format */ > > + TSN_SVF,/* SDI Video Format */ > > + TSN_RVF,/* Raw Video Format */ > > + /* 0x08 - 0x6D reserved */ > > + TSN_AEF_CONTINOUS = 0x6e, /* AES Encrypted Format Continous */ > > + TSN_VSF_STREAM, /* Vendor Specific Format Stream */ > > + /* 0x70 - 0x7e reserved */ > > + TSN_EF_STREAM = 0x7f, /* Experimental Format Stream */ > > + /* 0x80 - 0x81 reserved */ > > + TSN_NTSCF = 0x82, /* Non Time-Synchronous Control Format */ > > + /* 0x83 - 0xed reserved */ > > + TSN_ESCF = 0xec,/* ECC Signed Control Format */ > > + TSN_EECF, /* ECC Encrypted Control Format */ > > + TSN_AEF_DISCRETE, /* AES Encrypted Format Discrete */ > > + /* 0xef - 0xf9 reserved */ > > + TSN_ADP = 0xfa, /* AVDECC Discovery Protocol */ > > + TSN_AECP, /* AVDECC Enumeration and Control Protocol */ > > + TSN_ACMP, /* AVDECC Connection Management Protocol */ > > + /* 0xfd reserved */ > > + TSN_MAAP = 0xfe,/* MAAP Protocol */ > > + TSN_EF_CONTROL, /* Experimental Format Control */ > > +}; > > The kernel shouldn't be in the business of assembling media packets. No, but assembling the packets and shipping frames to a destination is not neccessarily the same thing. A nice workflow would be to signal to the shim that "I'm sending a compressed video format" and then the shim/tsn_core will ship out the frames over the network - and then you need to set TSN_CVF as subtype in each header. That does not that mean you should do H.264 encode/decode *in* the kernel Perhaps this is better placed in include/uapi/tsn.h so that userspace and kernel share the same header? -- Henrik Austad signature.asc Description: PGP signature
Re: [TSN RFC v2 0/9] TSN driver for the kernel
On Fri, Dec 16, 2016 at 01:20:57PM -0500, David Miller wrote: > From: Greg <gvrose8...@gmail.com> > Date: Fri, 16 Dec 2016 10:12:44 -0800 > > > On Fri, 2016-12-16 at 18:59 +0100, hen...@austad.us wrote: > >> From: Henrik Austad <haus...@cisco.com> > >> > >> > >> The driver is directed via ConfigFS as we need userspace to handle > >> stream-reservation (MSRP), discovery and enumeration (IEEE 1722.1) and > >> whatever other management is needed. This also includes running an > >> appropriate PTP daemon (TSN favors gPTP). > > > > I suggest using a generic netlink interface to communicate with the > > driver to set up and/or configure your drivers. > > > > I think configfs is frowned upon for network drivers. YMMV. > > Agreed. Ok - thanks! I will have look at netlink and see if I can wrap my head around it and if I can apply it to how to bring the media-devices up once the TSN-link has been configured. Thanks! :) -- Henrik Austad signature.asc Description: PGP signature
Re: [TSN RFC v2 0/9] TSN driver for the kernel
On Fri, Dec 16, 2016 at 01:20:57PM -0500, David Miller wrote: > From: Greg > Date: Fri, 16 Dec 2016 10:12:44 -0800 > > > On Fri, 2016-12-16 at 18:59 +0100, hen...@austad.us wrote: > >> From: Henrik Austad > >> > >> > >> The driver is directed via ConfigFS as we need userspace to handle > >> stream-reservation (MSRP), discovery and enumeration (IEEE 1722.1) and > >> whatever other management is needed. This also includes running an > >> appropriate PTP daemon (TSN favors gPTP). > > > > I suggest using a generic netlink interface to communicate with the > > driver to set up and/or configure your drivers. > > > > I think configfs is frowned upon for network drivers. YMMV. > > Agreed. Ok - thanks! I will have look at netlink and see if I can wrap my head around it and if I can apply it to how to bring the media-devices up once the TSN-link has been configured. Thanks! :) -- Henrik Austad signature.asc Description: PGP signature
Re: [PATCH] tracing: (backport) Replace kmap with copy_from_user() in trace_marker
On Fri, Dec 09, 2016 at 08:22:05AM +0100, Greg KH wrote: > On Fri, Dec 09, 2016 at 07:34:04AM +0100, Henrik Austad wrote: > > Instead of using get_user_pages_fast() and kmap_atomic() when writing > > to the trace_marker file, just allocate enough space on the ring buffer > > directly, and write into it via copy_from_user(). > > > > Writing into the trace_marker file use to allocate a temporary buffer > > to perform the copy_from_user(), as we didn't want to write into the > > ring buffer if the copy failed. But as a trace_marker write is suppose > > to be extremely fast, and allocating memory causes other tracepoints to > > trigger, Peter Zijlstra suggested using get_user_pages_fast() and > > kmap_atomic() to keep the user space pages in memory and reading it > > directly. > > > > Instead, just allocate the space in the ring buffer and use > > copy_from_user() directly. If it faults, return -EFAULT and write > > "" into the ring buffer. > > > > On architectures without a arch-specific get_user_pages_fast(), this > > will end up in the generic get_user_pages_fast() and this grabs > > mm->mmap_sem. Once you do this, then suddenly writing to the > > trace_marker can cause priority-inversions. > > > > This is a backport of Steven Rostedts patch [1] and applied to 3.10.x so the > > signed-off-chain by is somewhat uncertain at this stage. > > > > The patch compiles, boots and does not immediately explode on impact. By > > definition [2] it must therefore be perfect > > > > 2) https://www.spinics.net/lists/kernel/msg2400769.html > > 2) http://lkml.iu.edu/hypermail/linux/kernel/9804.1/0149.html > > > > Cc: Ingo Molnar <mi...@kernel.org> > > Cc: Henrik Austad <hen...@austad.us> > > Cc: Peter Zijlstra <pet...@infradead.org> > > Cc: Steven Rostedt <rost...@goodmis.org> > > Cc: sta...@vger.kernel.org > > > > Suggested-by: Thomas Gleixner <t...@linutronix.de> > > Used-to-be-signed-off-by: Steven Rostedt <rost...@goodmis.org> > > Backported-by: Henrik Austad <haus...@cisco.com> > > Tested-by: Henrik Austad <haus...@cisco.com> > > Signed-off-by: Henrik Austad <haus...@cisco.com> > > --- > > kernel/trace/trace.c | 78 > > +++- > > 1 file changed, 22 insertions(+), 56 deletions(-) > > What is the git commit id of this patch in Linus's tree? And what > stable trees do you feel it should be applied to? Ah, perhaps I jumped the gun here. I don't think Linus has picked this one up yet, Steven sent out the patch yesterday. Since then, I've backported it to 3.10 and ran the first set of tests over night and it looks good. So ideally this would find its way into 3.10(.104). Do you want med to resubmit when Stevens patch is merged upstream? -Henrik
Re: [PATCH] tracing: (backport) Replace kmap with copy_from_user() in trace_marker
On Fri, Dec 09, 2016 at 08:22:05AM +0100, Greg KH wrote: > On Fri, Dec 09, 2016 at 07:34:04AM +0100, Henrik Austad wrote: > > Instead of using get_user_pages_fast() and kmap_atomic() when writing > > to the trace_marker file, just allocate enough space on the ring buffer > > directly, and write into it via copy_from_user(). > > > > Writing into the trace_marker file use to allocate a temporary buffer > > to perform the copy_from_user(), as we didn't want to write into the > > ring buffer if the copy failed. But as a trace_marker write is suppose > > to be extremely fast, and allocating memory causes other tracepoints to > > trigger, Peter Zijlstra suggested using get_user_pages_fast() and > > kmap_atomic() to keep the user space pages in memory and reading it > > directly. > > > > Instead, just allocate the space in the ring buffer and use > > copy_from_user() directly. If it faults, return -EFAULT and write > > "" into the ring buffer. > > > > On architectures without a arch-specific get_user_pages_fast(), this > > will end up in the generic get_user_pages_fast() and this grabs > > mm->mmap_sem. Once you do this, then suddenly writing to the > > trace_marker can cause priority-inversions. > > > > This is a backport of Steven Rostedts patch [1] and applied to 3.10.x so the > > signed-off-chain by is somewhat uncertain at this stage. > > > > The patch compiles, boots and does not immediately explode on impact. By > > definition [2] it must therefore be perfect > > > > 2) https://www.spinics.net/lists/kernel/msg2400769.html > > 2) http://lkml.iu.edu/hypermail/linux/kernel/9804.1/0149.html > > > > Cc: Ingo Molnar > > Cc: Henrik Austad > > Cc: Peter Zijlstra > > Cc: Steven Rostedt > > Cc: sta...@vger.kernel.org > > > > Suggested-by: Thomas Gleixner > > Used-to-be-signed-off-by: Steven Rostedt > > Backported-by: Henrik Austad > > Tested-by: Henrik Austad > > Signed-off-by: Henrik Austad > > --- > > kernel/trace/trace.c | 78 > > +++- > > 1 file changed, 22 insertions(+), 56 deletions(-) > > What is the git commit id of this patch in Linus's tree? And what > stable trees do you feel it should be applied to? Ah, perhaps I jumped the gun here. I don't think Linus has picked this one up yet, Steven sent out the patch yesterday. Since then, I've backported it to 3.10 and ran the first set of tests over night and it looks good. So ideally this would find its way into 3.10(.104). Do you want med to resubmit when Stevens patch is merged upstream? -Henrik
[PATCH] tracing: (backport) Replace kmap with copy_from_user() in trace_marker
Instead of using get_user_pages_fast() and kmap_atomic() when writing to the trace_marker file, just allocate enough space on the ring buffer directly, and write into it via copy_from_user(). Writing into the trace_marker file use to allocate a temporary buffer to perform the copy_from_user(), as we didn't want to write into the ring buffer if the copy failed. But as a trace_marker write is suppose to be extremely fast, and allocating memory causes other tracepoints to trigger, Peter Zijlstra suggested using get_user_pages_fast() and kmap_atomic() to keep the user space pages in memory and reading it directly. Instead, just allocate the space in the ring buffer and use copy_from_user() directly. If it faults, return -EFAULT and write "" into the ring buffer. On architectures without a arch-specific get_user_pages_fast(), this will end up in the generic get_user_pages_fast() and this grabs mm->mmap_sem. Once you do this, then suddenly writing to the trace_marker can cause priority-inversions. This is a backport of Steven Rostedts patch [1] and applied to 3.10.x so the signed-off-chain by is somewhat uncertain at this stage. The patch compiles, boots and does not immediately explode on impact. By definition [2] it must therefore be perfect 2) https://www.spinics.net/lists/kernel/msg2400769.html 2) http://lkml.iu.edu/hypermail/linux/kernel/9804.1/0149.html Cc: Ingo Molnar <mi...@kernel.org> Cc: Henrik Austad <hen...@austad.us> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Steven Rostedt <rost...@goodmis.org> Cc: sta...@vger.kernel.org Suggested-by: Thomas Gleixner <t...@linutronix.de> Used-to-be-signed-off-by: Steven Rostedt <rost...@goodmis.org> Backported-by: Henrik Austad <haus...@cisco.com> Tested-by: Henrik Austad <haus...@cisco.com> Signed-off-by: Henrik Austad <haus...@cisco.com> --- kernel/trace/trace.c | 78 +++- 1 file changed, 22 insertions(+), 56 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 18cdf91..94eb1ee 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4501,15 +4501,13 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, struct ring_buffer *buffer; struct print_entry *entry; unsigned long irq_flags; - struct page *pages[2]; - void *map_page[2]; - int nr_pages = 1; + const char faulted[] = ""; ssize_t written; - int offset; int size; int len; - int ret; - int i; + +/* Used in tracing_mark_raw_write() as well */ +#define FAULTED_SIZE (sizeof(faulted) - 1) /* '\0' is already accounted for */ if (tracing_disabled) return -EINVAL; @@ -4520,60 +4518,34 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (cnt > TRACE_BUF_SIZE) cnt = TRACE_BUF_SIZE; - /* -* Userspace is injecting traces into the kernel trace buffer. -* We want to be as non intrusive as possible. -* To do so, we do not want to allocate any special buffers -* or take any locks, but instead write the userspace data -* straight into the ring buffer. -* -* First we need to pin the userspace buffer into memory, -* which, most likely it is, because it just referenced it. -* But there's no guarantee that it is. By using get_user_pages_fast() -* and kmap_atomic/kunmap_atomic() we can get access to the -* pages directly. We then write the data directly into the -* ring buffer. -*/ BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE); - /* check if we cross pages */ - if ((addr & PAGE_MASK) != ((addr + cnt) & PAGE_MASK)) - nr_pages = 2; - - offset = addr & (PAGE_SIZE - 1); - addr &= PAGE_MASK; - - ret = get_user_pages_fast(addr, nr_pages, 0, pages); - if (ret < nr_pages) { - while (--ret >= 0) - put_page(pages[ret]); - written = -EFAULT; - goto out; - } + local_save_flags(irq_flags); + size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */ - for (i = 0; i < nr_pages; i++) - map_page[i] = kmap_atomic(pages[i]); + /* If less than "", then make sure we can still add that */ + if (cnt < FAULTED_SIZE) + size += FAULTED_SIZE - cnt; - local_save_flags(irq_flags); - size = sizeof(*entry) + cnt + 2; /* possible \n added */ buffer = tr->trace_buffer.buffer; event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, irq_flags, preempt_count()); - if (!event) { - /* Ring buffer disabled, return as if not open for write */ - written = -EBADF; -
[PATCH] tracing: (backport) Replace kmap with copy_from_user() in trace_marker
Instead of using get_user_pages_fast() and kmap_atomic() when writing to the trace_marker file, just allocate enough space on the ring buffer directly, and write into it via copy_from_user(). Writing into the trace_marker file use to allocate a temporary buffer to perform the copy_from_user(), as we didn't want to write into the ring buffer if the copy failed. But as a trace_marker write is suppose to be extremely fast, and allocating memory causes other tracepoints to trigger, Peter Zijlstra suggested using get_user_pages_fast() and kmap_atomic() to keep the user space pages in memory and reading it directly. Instead, just allocate the space in the ring buffer and use copy_from_user() directly. If it faults, return -EFAULT and write "" into the ring buffer. On architectures without a arch-specific get_user_pages_fast(), this will end up in the generic get_user_pages_fast() and this grabs mm->mmap_sem. Once you do this, then suddenly writing to the trace_marker can cause priority-inversions. This is a backport of Steven Rostedts patch [1] and applied to 3.10.x so the signed-off-chain by is somewhat uncertain at this stage. The patch compiles, boots and does not immediately explode on impact. By definition [2] it must therefore be perfect 2) https://www.spinics.net/lists/kernel/msg2400769.html 2) http://lkml.iu.edu/hypermail/linux/kernel/9804.1/0149.html Cc: Ingo Molnar Cc: Henrik Austad Cc: Peter Zijlstra Cc: Steven Rostedt Cc: sta...@vger.kernel.org Suggested-by: Thomas Gleixner Used-to-be-signed-off-by: Steven Rostedt Backported-by: Henrik Austad Tested-by: Henrik Austad Signed-off-by: Henrik Austad --- kernel/trace/trace.c | 78 +++- 1 file changed, 22 insertions(+), 56 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 18cdf91..94eb1ee 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4501,15 +4501,13 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, struct ring_buffer *buffer; struct print_entry *entry; unsigned long irq_flags; - struct page *pages[2]; - void *map_page[2]; - int nr_pages = 1; + const char faulted[] = ""; ssize_t written; - int offset; int size; int len; - int ret; - int i; + +/* Used in tracing_mark_raw_write() as well */ +#define FAULTED_SIZE (sizeof(faulted) - 1) /* '\0' is already accounted for */ if (tracing_disabled) return -EINVAL; @@ -4520,60 +4518,34 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (cnt > TRACE_BUF_SIZE) cnt = TRACE_BUF_SIZE; - /* -* Userspace is injecting traces into the kernel trace buffer. -* We want to be as non intrusive as possible. -* To do so, we do not want to allocate any special buffers -* or take any locks, but instead write the userspace data -* straight into the ring buffer. -* -* First we need to pin the userspace buffer into memory, -* which, most likely it is, because it just referenced it. -* But there's no guarantee that it is. By using get_user_pages_fast() -* and kmap_atomic/kunmap_atomic() we can get access to the -* pages directly. We then write the data directly into the -* ring buffer. -*/ BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE); - /* check if we cross pages */ - if ((addr & PAGE_MASK) != ((addr + cnt) & PAGE_MASK)) - nr_pages = 2; - - offset = addr & (PAGE_SIZE - 1); - addr &= PAGE_MASK; - - ret = get_user_pages_fast(addr, nr_pages, 0, pages); - if (ret < nr_pages) { - while (--ret >= 0) - put_page(pages[ret]); - written = -EFAULT; - goto out; - } + local_save_flags(irq_flags); + size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */ - for (i = 0; i < nr_pages; i++) - map_page[i] = kmap_atomic(pages[i]); + /* If less than "", then make sure we can still add that */ + if (cnt < FAULTED_SIZE) + size += FAULTED_SIZE - cnt; - local_save_flags(irq_flags); - size = sizeof(*entry) + cnt + 2; /* possible \n added */ buffer = tr->trace_buffer.buffer; event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, irq_flags, preempt_count()); - if (!event) { - /* Ring buffer disabled, return as if not open for write */ - written = -EBADF; - goto out_unlock; - } + + if (unlikely(!event)) + /* Ring buffer disabled, return as if not open for write */ + return -EBADF; entry = ring_buffer_event_data(event); ent
Re: [RFD] sched/deadline: Support single CPU affinity
On Thu, Nov 10, 2016 at 01:38:40PM +0100, luca abeni wrote: > Hi Henrik, Hi Luca, > On Thu, 10 Nov 2016 13:21:00 +0100 > Henrik Austad <hen...@austad.us> wrote: > > On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote: > [...] > > > We define the time to fail as: > > > > > > ttf(t) := t_d - t_b; where > > > > > > t_d is t's absolute deadline > > > t_b is t's remaining budget > > > > > > This is the last possible moment we must schedule this task such > > > that it can complete its work and not miss its deadline. > > > > To elaborate a bit on this (this is a modified LLF approach if my > > memory serves): > > > > You have the dynamic time-to-failure (TtF), i.e. as the task > > progresses (scheduled to run), the relative time-to-failure will > > remain constant. This can be used to compare thasks to a running task > > and should minimize the number of calculations required. > > > > Then you have the static Time-of-failure (ToF, which is the absoulte > > time when a task will no longer be able to meet its deadline. This is > > what you use for keeping a sorted list of tasks in the runqueue. As > > this is a fixed point in time, you do not have to dynamically update > > or do crazy calculation when inserting/removing threads from the rq. > > Sorry, I am missing something here: if ttf is defined as > ttf_i = d_i - q_i So I picked the naming somewhat independently of Peter, his approach is the _absolute_ time of failure, the actual time X, irrespective of the task running or not. I added 2 different measures for the same thing: * ToF: The absolute time of failure is the point in time when the task will no longer be able to meet its deadline. If a task is scheduled and is running on a CPU, this value will move forward at the speed of execution. i.e. when the task is running, this value is changing. When the task is waiting in the runqueue, this value is constant. TtF: The relative time to failure is the value that is tied to the local CPU so to speak. When a task is running, this value is constant as it is the remaining time until the task is no longer able to meet its deadline. When the task is enqueued, this value will steadily decrease as it draws closer to the time when it will fail. So when a task is running on a CPU, you use TtF, when it is in the runqueue you compare ToF > (where d_i is the deadline of thask i and q_i is its remaining budget), > then it also is the time before which you have to schedule task i if > you do not want to miss the deadline... No? So, I do not understand the > difference with tof. So you can calculate one form the other given absolute deadline and remaining budget (or consumed CPU-time). But it is handy to use both as it removes a lot of duplicity and once you get the hang of the terms, makes it a bit easier to reason about the system. > > > If we then augment the regular EDF rules by, for local tasks, > > > considering the time to fail and let this measure override the > > > regular EDF pick when the time to fail can be overran by the EDF > > > pick. > > > > Then, if you do this - do you need to constrict this to a local CPU? > > I *think* you could do this in a global scheduler if you use ToF/TtF > > for all deadline-tasks, I think you should be able to meet deadlines. > I think the ToF/TtF scheduler will be equivalent to LLF (see previous > emails)... Or am I misunderstanding something? (see above) > And LLF is not optimal on multiple CPUs, so I do not think it will be > able to meet deadlines if you use it as a global scheduler. I think I called it Earliest Failure First (I really wanted to call it failure-driven scheduling but that implied a crappy scheduler ;) LLF is prone to high task-switch count when multiple threads gets close to 0 laxity. But as I said, it's been a while since I last worked through the theory, so I have some homework to do before arguing too hard about this. > > I had a rant about this way back [1,2 Sec 11.4], I need to sit down > > and re-read most of it, it has been a few too many years, but the > > idea was to minimize the number of context-switches (which LLF is > > prone to get a lot of) as well as minimize the computational overhead > > by avoiding re-calculating time-of-failre/time-to-failre a lot. > > > > > That is, when: > > > > > > now + left_b > min(ttf) > > > > Why not just use ttf/tof for all deadline-tasks? We have all the > > information available anyway and it would probably make the internal > > logic easier? > I think LLF causes more preemptions and migrations than EDF. yes, it does, which is why you need to adjust LLF to minimize the number of task-switches. -Henrik signature.asc Description: PGP signature
Re: [RFD] sched/deadline: Support single CPU affinity
On Thu, Nov 10, 2016 at 01:38:40PM +0100, luca abeni wrote: > Hi Henrik, Hi Luca, > On Thu, 10 Nov 2016 13:21:00 +0100 > Henrik Austad wrote: > > On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote: > [...] > > > We define the time to fail as: > > > > > > ttf(t) := t_d - t_b; where > > > > > > t_d is t's absolute deadline > > > t_b is t's remaining budget > > > > > > This is the last possible moment we must schedule this task such > > > that it can complete its work and not miss its deadline. > > > > To elaborate a bit on this (this is a modified LLF approach if my > > memory serves): > > > > You have the dynamic time-to-failure (TtF), i.e. as the task > > progresses (scheduled to run), the relative time-to-failure will > > remain constant. This can be used to compare thasks to a running task > > and should minimize the number of calculations required. > > > > Then you have the static Time-of-failure (ToF, which is the absoulte > > time when a task will no longer be able to meet its deadline. This is > > what you use for keeping a sorted list of tasks in the runqueue. As > > this is a fixed point in time, you do not have to dynamically update > > or do crazy calculation when inserting/removing threads from the rq. > > Sorry, I am missing something here: if ttf is defined as > ttf_i = d_i - q_i So I picked the naming somewhat independently of Peter, his approach is the _absolute_ time of failure, the actual time X, irrespective of the task running or not. I added 2 different measures for the same thing: * ToF: The absolute time of failure is the point in time when the task will no longer be able to meet its deadline. If a task is scheduled and is running on a CPU, this value will move forward at the speed of execution. i.e. when the task is running, this value is changing. When the task is waiting in the runqueue, this value is constant. TtF: The relative time to failure is the value that is tied to the local CPU so to speak. When a task is running, this value is constant as it is the remaining time until the task is no longer able to meet its deadline. When the task is enqueued, this value will steadily decrease as it draws closer to the time when it will fail. So when a task is running on a CPU, you use TtF, when it is in the runqueue you compare ToF > (where d_i is the deadline of thask i and q_i is its remaining budget), > then it also is the time before which you have to schedule task i if > you do not want to miss the deadline... No? So, I do not understand the > difference with tof. So you can calculate one form the other given absolute deadline and remaining budget (or consumed CPU-time). But it is handy to use both as it removes a lot of duplicity and once you get the hang of the terms, makes it a bit easier to reason about the system. > > > If we then augment the regular EDF rules by, for local tasks, > > > considering the time to fail and let this measure override the > > > regular EDF pick when the time to fail can be overran by the EDF > > > pick. > > > > Then, if you do this - do you need to constrict this to a local CPU? > > I *think* you could do this in a global scheduler if you use ToF/TtF > > for all deadline-tasks, I think you should be able to meet deadlines. > I think the ToF/TtF scheduler will be equivalent to LLF (see previous > emails)... Or am I misunderstanding something? (see above) > And LLF is not optimal on multiple CPUs, so I do not think it will be > able to meet deadlines if you use it as a global scheduler. I think I called it Earliest Failure First (I really wanted to call it failure-driven scheduling but that implied a crappy scheduler ;) LLF is prone to high task-switch count when multiple threads gets close to 0 laxity. But as I said, it's been a while since I last worked through the theory, so I have some homework to do before arguing too hard about this. > > I had a rant about this way back [1,2 Sec 11.4], I need to sit down > > and re-read most of it, it has been a few too many years, but the > > idea was to minimize the number of context-switches (which LLF is > > prone to get a lot of) as well as minimize the computational overhead > > by avoiding re-calculating time-of-failre/time-to-failre a lot. > > > > > That is, when: > > > > > > now + left_b > min(ttf) > > > > Why not just use ttf/tof for all deadline-tasks? We have all the > > information available anyway and it would probably make the internal > > logic easier? > I think LLF causes more preemptions and migrations than EDF. yes, it does, which is why you need to adjust LLF to minimize the number of task-switches. -Henrik signature.asc Description: PGP signature
Re: [RFD] sched/deadline: Support single CPU affinity
On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote: > > > Add support for single CPU affinity to SCHED_DEADLINE; the supposed reason for > wanting single CPU affinity is better QoS than provided by G-EDF. > > Therefore the aim is to provide harder guarantees, similar to UP, for single > CPU affine tasks. This then leads to a mixed criticality scheduling > requirement for the CPU scheduler. G-EDF like for the non-affine (global) > tasks and UP like for the single CPU tasks. > > > > ADMISSION CONTROL > > Do simple UP admission control on the CPU local tasks, and subtract the > admitted bandwidth from the global total when doing global admission control. > > single cpu: U[n] := \Sum tl_u,n <= 1 > global: \Sum tg_u <= N - \Sum U[n] > > > > MIXED CRITICALITY SCHEDULING > > Since we want to provide better guarantees for single CPU affine tasks than > the G-EDF scheduler provides for the single CPU tasks, we need to somehow > alter the scheduling algorithm. > > The trivial layered EDF/G-EDF approach is obviously flawed in that it will > result in many unnecessary deadline misses. The trivial example is having a > single CPU task with a deadline after a runnable global task. By always > running single CPU tasks over global tasks we can make the global task miss > its deadline even though we could easily have ran both within the alloted > time. > > Therefore we must use a more complicated scheme. By adding a second measure > present in the sporadic task model to the scheduling function we can try and > distinguish between the constraints of handling the two cases in a single > scheduler. > > We define the time to fail as: > > ttf(t) := t_d - t_b; where > > t_d is t's absolute deadline > t_b is t's remaining budget > > This is the last possible moment we must schedule this task such that it can > complete its work and not miss its deadline. To elaborate a bit on this (this is a modified LLF approach if my memory serves): You have the dynamic time-to-failure (TtF), i.e. as the task progresses (scheduled to run), the relative time-to-failure will remain constant. This can be used to compare thasks to a running task and should minimize the number of calculations required. Then you have the static Time-of-failure (ToF, which is the absoulte time when a task will no longer be able to meet its deadline. This is what you use for keeping a sorted list of tasks in the runqueue. As this is a fixed point in time, you do not have to dynamically update or do crazy calculation when inserting/removing threads from the rq. > If we then augment the regular EDF rules by, for local tasks, considering the > time to fail and let this measure override the regular EDF pick when the > time to fail can be overran by the EDF pick. Then, if you do this - do you need to constrict this to a local CPU? I *think* you could do this in a global scheduler if you use ToF/TtF for all deadline-tasks, I think you should be able to meet deadlines. I had a rant about this way back [1,2 Sec 11.4], I need to sit down and re-read most of it, it has been a few too many years, but the idea was to minimize the number of context-switches (which LLF is prone to get a lot of) as well as minimize the computational overhead by avoiding re-calculating time-of-failre/time-to-failre a lot. > That is, when: > > now + left_b > min(ttf) Why not just use ttf/tof for all deadline-tasks? We have all the information available anyway and it would probably make the internal logic easier? > Use augmented RB-tree to store absolute deadlines of all rq local tasks and > keep the heap sorted on the earliest time to fail of a locally affine task. > > TODO > > - finish patch, this only sketches the outlines > - formal analysis of the proposed scheduling function; this is only a hunch. I think you are on the right track, but then again, you agree with some of the stuff I messed around with a while a go, so no wonder I think you're right :) 1) https://lkml.org/lkml/2009/7/10/380 2) https://brage.bibsys.no/xmlui/bitstream/handle/11250/259744/347756_FULLTEXT01.pdf -Henrik > --- > include/linux/sched.h | 1 + > kernel/sched/core.c | 75 ++--- > kernel/sched/deadline.c | 142 > > kernel/sched/sched.h| 12 ++-- > 4 files changed, 191 insertions(+), 39 deletions(-) > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 3762fe4e3a80..32f948615d4c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1412,6 +1412,7 @@ struct sched_rt_entity { > > struct sched_dl_entity { > struct rb_node rb_node; > + u64 __subtree_ttf; Didn't you say augmented rb-tree? > /* >* Original scheduling parameters. Copied here from sched_attr > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index bee18baa603a..46995c060a89 100644 > --- a/kernel/sched/core.c > +++
Re: [RFD] sched/deadline: Support single CPU affinity
On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote: > > > Add support for single CPU affinity to SCHED_DEADLINE; the supposed reason for > wanting single CPU affinity is better QoS than provided by G-EDF. > > Therefore the aim is to provide harder guarantees, similar to UP, for single > CPU affine tasks. This then leads to a mixed criticality scheduling > requirement for the CPU scheduler. G-EDF like for the non-affine (global) > tasks and UP like for the single CPU tasks. > > > > ADMISSION CONTROL > > Do simple UP admission control on the CPU local tasks, and subtract the > admitted bandwidth from the global total when doing global admission control. > > single cpu: U[n] := \Sum tl_u,n <= 1 > global: \Sum tg_u <= N - \Sum U[n] > > > > MIXED CRITICALITY SCHEDULING > > Since we want to provide better guarantees for single CPU affine tasks than > the G-EDF scheduler provides for the single CPU tasks, we need to somehow > alter the scheduling algorithm. > > The trivial layered EDF/G-EDF approach is obviously flawed in that it will > result in many unnecessary deadline misses. The trivial example is having a > single CPU task with a deadline after a runnable global task. By always > running single CPU tasks over global tasks we can make the global task miss > its deadline even though we could easily have ran both within the alloted > time. > > Therefore we must use a more complicated scheme. By adding a second measure > present in the sporadic task model to the scheduling function we can try and > distinguish between the constraints of handling the two cases in a single > scheduler. > > We define the time to fail as: > > ttf(t) := t_d - t_b; where > > t_d is t's absolute deadline > t_b is t's remaining budget > > This is the last possible moment we must schedule this task such that it can > complete its work and not miss its deadline. To elaborate a bit on this (this is a modified LLF approach if my memory serves): You have the dynamic time-to-failure (TtF), i.e. as the task progresses (scheduled to run), the relative time-to-failure will remain constant. This can be used to compare thasks to a running task and should minimize the number of calculations required. Then you have the static Time-of-failure (ToF, which is the absoulte time when a task will no longer be able to meet its deadline. This is what you use for keeping a sorted list of tasks in the runqueue. As this is a fixed point in time, you do not have to dynamically update or do crazy calculation when inserting/removing threads from the rq. > If we then augment the regular EDF rules by, for local tasks, considering the > time to fail and let this measure override the regular EDF pick when the > time to fail can be overran by the EDF pick. Then, if you do this - do you need to constrict this to a local CPU? I *think* you could do this in a global scheduler if you use ToF/TtF for all deadline-tasks, I think you should be able to meet deadlines. I had a rant about this way back [1,2 Sec 11.4], I need to sit down and re-read most of it, it has been a few too many years, but the idea was to minimize the number of context-switches (which LLF is prone to get a lot of) as well as minimize the computational overhead by avoiding re-calculating time-of-failre/time-to-failre a lot. > That is, when: > > now + left_b > min(ttf) Why not just use ttf/tof for all deadline-tasks? We have all the information available anyway and it would probably make the internal logic easier? > Use augmented RB-tree to store absolute deadlines of all rq local tasks and > keep the heap sorted on the earliest time to fail of a locally affine task. > > TODO > > - finish patch, this only sketches the outlines > - formal analysis of the proposed scheduling function; this is only a hunch. I think you are on the right track, but then again, you agree with some of the stuff I messed around with a while a go, so no wonder I think you're right :) 1) https://lkml.org/lkml/2009/7/10/380 2) https://brage.bibsys.no/xmlui/bitstream/handle/11250/259744/347756_FULLTEXT01.pdf -Henrik > --- > include/linux/sched.h | 1 + > kernel/sched/core.c | 75 ++--- > kernel/sched/deadline.c | 142 > > kernel/sched/sched.h| 12 ++-- > 4 files changed, 191 insertions(+), 39 deletions(-) > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 3762fe4e3a80..32f948615d4c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1412,6 +1412,7 @@ struct sched_rt_entity { > > struct sched_dl_entity { > struct rb_node rb_node; > + u64 __subtree_ttf; Didn't you say augmented rb-tree? > /* >* Original scheduling parameters. Copied here from sched_attr > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index bee18baa603a..46995c060a89 100644 > --- a/kernel/sched/core.c > +++
Re: [Intel-wired-lan] [PATCH] igb: add missing fields to TXDCTL-register
On Wed, Oct 19, 2016 at 07:25:10AM -0700, Jesse Brandeburg wrote: > On Wed, 19 Oct 2016 14:37:59 +0200 > Henrik Austad <hen...@austad.us> wrote: > > > The current list of E1000_TXDCTL-registers is incomplete. This adds > > the missing parts for the Transmit Descriptor Control (TXDCTL) > > register. > > > > The rest of these values (threshold for descriptor read/write) for > > TXDCTL seems to be defined in igb/igb.h, not sure why this is split > > though. > > Hi Henrik, thanks for helping with our code. > > While totally correct, having defines added to the kernel that are not > being used anywhere in the code isn't really very useful. Often the > upstream maintainers/reviewers will reject a patch like this that just > adds to a .h file, because there are no actual users of the defines. Yes, I agree, best to avoid bloat whenever possible. > If the transmit or ethtool code were to use these (via the same patch) > or something like that, then the patch would be more likely to be > accepted. Ah, good to know. I am in the process of spinning out a new set of TSN-patches (previous version: see [1]) and setting the priority-bit for the Tx-queues is required. This means that I'm hacking more at igb_main.c. So this was more about laying the groundwork for the series. I'll leave this patch in the tsn-series then, and resend once I'm ready and hope you can provide some feedback on the rest of the series then :) > Jesse > > PS In the future no need to copy linux-kernel for patches going to our > submaintainer list. Ok, I'll remember that, thanks! 1) https://lkml.org/lkml/2016/6/11/187 -- Henrik Austad signature.asc Description: Digital signature
Re: [Intel-wired-lan] [PATCH] igb: add missing fields to TXDCTL-register
On Wed, Oct 19, 2016 at 07:25:10AM -0700, Jesse Brandeburg wrote: > On Wed, 19 Oct 2016 14:37:59 +0200 > Henrik Austad wrote: > > > The current list of E1000_TXDCTL-registers is incomplete. This adds > > the missing parts for the Transmit Descriptor Control (TXDCTL) > > register. > > > > The rest of these values (threshold for descriptor read/write) for > > TXDCTL seems to be defined in igb/igb.h, not sure why this is split > > though. > > Hi Henrik, thanks for helping with our code. > > While totally correct, having defines added to the kernel that are not > being used anywhere in the code isn't really very useful. Often the > upstream maintainers/reviewers will reject a patch like this that just > adds to a .h file, because there are no actual users of the defines. Yes, I agree, best to avoid bloat whenever possible. > If the transmit or ethtool code were to use these (via the same patch) > or something like that, then the patch would be more likely to be > accepted. Ah, good to know. I am in the process of spinning out a new set of TSN-patches (previous version: see [1]) and setting the priority-bit for the Tx-queues is required. This means that I'm hacking more at igb_main.c. So this was more about laying the groundwork for the series. I'll leave this patch in the tsn-series then, and resend once I'm ready and hope you can provide some feedback on the rest of the series then :) > Jesse > > PS In the future no need to copy linux-kernel for patches going to our > submaintainer list. Ok, I'll remember that, thanks! 1) https://lkml.org/lkml/2016/6/11/187 -- Henrik Austad signature.asc Description: Digital signature
[PATCH] igb: add missing fields to TXDCTL-register
The current list of E1000_TXDCTL-registers is incomplete. This adds the missing parts for the Transmit Descriptor Control (TXDCTL) register. The rest of these values (threshold for descriptor read/write) for TXDCTL seems to be defined in igb/igb.h, not sure why this is split though. It seems that this was left out in the commit that added support for 82575 Gigabit Ethernet driver 9d5c8243 (igb: PCI-Express 82575 Gigabit Ethernet driver). Signed-off-by: Henrik Austad <hen...@austad.us> Cc: linux-kernel@vger.kernel.org Cc: Jeff Kirsher <jeffrey.t.kirs...@intel.com> Cc: intel-wired-...@lists.osuosl.org Signed-off-by: Henrik Austad <hen...@austad.us> --- drivers/net/ethernet/intel/igb/e1000_82575.h | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h b/drivers/net/ethernet/intel/igb/e1000_82575.h index 199ff98..212dbb8 100644 --- a/drivers/net/ethernet/intel/igb/e1000_82575.h +++ b/drivers/net/ethernet/intel/igb/e1000_82575.h @@ -158,7 +158,11 @@ struct e1000_adv_tx_context_desc { /* Additional Transmit Descriptor Control definitions */ #define E1000_TXDCTL_QUEUE_ENABLE 0x0200 /* Enable specific Tx Queue */ + +/* Transmit Software Flush, sw-triggered desc writeback */ +#define E1000_TXDCTL_SWFLSH0x0400 /* Tx Queue Arbitration Priority 0=low, 1=high */ +#define E1000_TXDCTL_PRIORITY 0x0800 /* Additional Receive Descriptor Control definitions */ #define E1000_RXDCTL_QUEUE_ENABLE 0x0200 /* Enable specific Rx Queue */ -- 2.7.4
[PATCH] igb: add missing fields to TXDCTL-register
The current list of E1000_TXDCTL-registers is incomplete. This adds the missing parts for the Transmit Descriptor Control (TXDCTL) register. The rest of these values (threshold for descriptor read/write) for TXDCTL seems to be defined in igb/igb.h, not sure why this is split though. It seems that this was left out in the commit that added support for 82575 Gigabit Ethernet driver 9d5c8243 (igb: PCI-Express 82575 Gigabit Ethernet driver). Signed-off-by: Henrik Austad Cc: linux-kernel@vger.kernel.org Cc: Jeff Kirsher Cc: intel-wired-...@lists.osuosl.org Signed-off-by: Henrik Austad --- drivers/net/ethernet/intel/igb/e1000_82575.h | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h b/drivers/net/ethernet/intel/igb/e1000_82575.h index 199ff98..212dbb8 100644 --- a/drivers/net/ethernet/intel/igb/e1000_82575.h +++ b/drivers/net/ethernet/intel/igb/e1000_82575.h @@ -158,7 +158,11 @@ struct e1000_adv_tx_context_desc { /* Additional Transmit Descriptor Control definitions */ #define E1000_TXDCTL_QUEUE_ENABLE 0x0200 /* Enable specific Tx Queue */ + +/* Transmit Software Flush, sw-triggered desc writeback */ +#define E1000_TXDCTL_SWFLSH0x0400 /* Tx Queue Arbitration Priority 0=low, 1=high */ +#define E1000_TXDCTL_PRIORITY 0x0800 /* Additional Receive Descriptor Control definitions */ #define E1000_RXDCTL_QUEUE_ENABLE 0x0200 /* Enable specific Rx Queue */ -- 2.7.4
Re: [alsa-devel] [very-RFC 0/8] TSN driver for the kernel
On Tue, Jun 21, 2016 at 10:45:18AM -0700, Pierre-Louis Bossart wrote: > On 6/20/16 5:18 AM, Richard Cochran wrote: > >On Mon, Jun 20, 2016 at 01:08:27PM +0200, Pierre-Louis Bossart wrote: > >>The ALSA API provides support for 'audio' timestamps (playback/capture rate > >>defined by audio subsystem) and 'system' timestamps (typically linked to > >>TSC/ART) with one option to take synchronized timestamps should the hardware > >>support them. > > > >Thanks for the info. I just skimmed > >Documentation/sound/alsa/timestamping.txt. > > > >That is fairly new, only since v4.1. Are then any apps in the wild > >that I can look at? AFAICT, OpenAVB, gstreamer, etc, don't use the > >new API. > > The ALSA API supports a generic .get_time_info callback, its implementation > is for now limited to a regular 'DMA' or 'link' timestamp for HDaudio - the > difference being which counters are used and how close they are to the link > serializer. The synchronized part is still WIP but should come 'soon' Interesting, would you mind CCing me in on those patches? > >>The intent was that the 'audio' timestamps are translated to a shared time > >>reference managed in userspace by gPTP, which in turn would define if > >>(adaptive) audio sample rate conversion is needed. There is no support at > >>the moment for a 'play_at' function in ALSA, only means to control a > >>feedback loop. > > > >Documentation/sound/alsa/timestamping.txt says: > > > > If supported in hardware, the absolute link time could also be used > > to define a precise start time (patches WIP) > > > >Two questions: > > > >1. Where are the patches? (If some are coming, I would appreciate > > being on CC!) > > > >2. Can you mention specific HW that would support this? > > You can experiment with the 'dma' and 'link' timestamps today on any > HDaudio-based device. Like I said the synchronized part has not been > upstreamed yet (delays + dependency on ART-to-TSC conversions that made it > in the kernel recently) Ok, I think I see a way to hook this into timestamps from the skbuf on incoming frames and a somewhat messy way on outgoing. Having time coupled with 'avail' and 'delay' is useful, and from the looks of it, 'link'-time is the appropriate level to add this. I'm working on storing the time in the tsn_link struct I use, and then read that from the avb_alsa-shim. Details are still a bit fuzzy though, but I plan to do that and then see what audio-time gives me once it is up and running. Richard: is it fair to assume that if ptp4l is running and is part of a PTP domain, ktime_get() will return PTP-adjusted time for the system? -Or do I also need to run phc2sys in order to sync the system-time to PTP-time? Note that this is for outgoing traffic, Rx should perhaps use the timestamp in skb. Hooking into ktime_get() instead of directly to the PTP-subsystem (if that is even possible) makes it a lot easier to debug when running this in a VM as it doesn't *have* to use PTP-time when I'm crashing a new kernel :) Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [alsa-devel] [very-RFC 0/8] TSN driver for the kernel
On Tue, Jun 21, 2016 at 10:45:18AM -0700, Pierre-Louis Bossart wrote: > On 6/20/16 5:18 AM, Richard Cochran wrote: > >On Mon, Jun 20, 2016 at 01:08:27PM +0200, Pierre-Louis Bossart wrote: > >>The ALSA API provides support for 'audio' timestamps (playback/capture rate > >>defined by audio subsystem) and 'system' timestamps (typically linked to > >>TSC/ART) with one option to take synchronized timestamps should the hardware > >>support them. > > > >Thanks for the info. I just skimmed > >Documentation/sound/alsa/timestamping.txt. > > > >That is fairly new, only since v4.1. Are then any apps in the wild > >that I can look at? AFAICT, OpenAVB, gstreamer, etc, don't use the > >new API. > > The ALSA API supports a generic .get_time_info callback, its implementation > is for now limited to a regular 'DMA' or 'link' timestamp for HDaudio - the > difference being which counters are used and how close they are to the link > serializer. The synchronized part is still WIP but should come 'soon' Interesting, would you mind CCing me in on those patches? > >>The intent was that the 'audio' timestamps are translated to a shared time > >>reference managed in userspace by gPTP, which in turn would define if > >>(adaptive) audio sample rate conversion is needed. There is no support at > >>the moment for a 'play_at' function in ALSA, only means to control a > >>feedback loop. > > > >Documentation/sound/alsa/timestamping.txt says: > > > > If supported in hardware, the absolute link time could also be used > > to define a precise start time (patches WIP) > > > >Two questions: > > > >1. Where are the patches? (If some are coming, I would appreciate > > being on CC!) > > > >2. Can you mention specific HW that would support this? > > You can experiment with the 'dma' and 'link' timestamps today on any > HDaudio-based device. Like I said the synchronized part has not been > upstreamed yet (delays + dependency on ART-to-TSC conversions that made it > in the kernel recently) Ok, I think I see a way to hook this into timestamps from the skbuf on incoming frames and a somewhat messy way on outgoing. Having time coupled with 'avail' and 'delay' is useful, and from the looks of it, 'link'-time is the appropriate level to add this. I'm working on storing the time in the tsn_link struct I use, and then read that from the avb_alsa-shim. Details are still a bit fuzzy though, but I plan to do that and then see what audio-time gives me once it is up and running. Richard: is it fair to assume that if ptp4l is running and is part of a PTP domain, ktime_get() will return PTP-adjusted time for the system? -Or do I also need to run phc2sys in order to sync the system-time to PTP-time? Note that this is for outgoing traffic, Rx should perhaps use the timestamp in skb. Hooking into ktime_get() instead of directly to the PTP-subsystem (if that is even possible) makes it a lot easier to debug when running this in a VM as it doesn't *have* to use PTP-time when I'm crashing a new kernel :) Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [alsa-devel] [very-RFC 0/8] TSN driver for the kernel
On Mon, Jun 20, 2016 at 01:08:27PM +0200, Pierre-Louis Bossart wrote: > > >Presentation time is either set by > >a) Local sound card performing capture (in which case it will be 'capture > > time') > >b) Local media application sending a stream accross the network > > (time when the sample should be played out remotely) > >c) Remote media application streaming data *to* host, in which case it will > > be local presentation time on local soundcard > > > >>This value is dominant to the number of events included in an IEC 61883-1 > >>packet. If this TSN subsystem decides it, most of these items don't need > >>to be in ALSA. > > > >Not sure if I understand this correctly. > > > >TSN should have a reference to the timing-domain of each *local* > >sound-device (for local capture or playback) as well as the shared > >time-reference provided by gPTP. > > > >Unless an End-station acts as GrandMaster for the gPTP-domain, time set > >forth by gPTP is inmutable and cannot be adjusted. It follows that the > >sample-frequency of the local audio-devices must be adjusted, or the > >audio-streams to/from said devices must be resampled. > > The ALSA API provides support for 'audio' timestamps > (playback/capture rate defined by audio subsystem) and 'system' > timestamps (typically linked to TSC/ART) with one option to take > synchronized timestamps should the hardware support them. Ok, this sounds promising, and very much in line with what AVB would need. > The intent was that the 'audio' timestamps are translated to a > shared time reference managed in userspace by gPTP, which in turn > would define if (adaptive) audio sample rate conversion is needed. > There is no support at the moment for a 'play_at' function in ALSA, > only means to control a feedback loop. Ok, I understand that the 'play_at' is difficult to obtain, but it sounds like it is doable to achieve something useful. Looks like I will be looking into what to put in the .trigger-handler in the ALSA shim and experimenting with this to see how it make sense to connect it from the TSN-stream. Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [alsa-devel] [very-RFC 0/8] TSN driver for the kernel
On Mon, Jun 20, 2016 at 01:08:27PM +0200, Pierre-Louis Bossart wrote: > > >Presentation time is either set by > >a) Local sound card performing capture (in which case it will be 'capture > > time') > >b) Local media application sending a stream accross the network > > (time when the sample should be played out remotely) > >c) Remote media application streaming data *to* host, in which case it will > > be local presentation time on local soundcard > > > >>This value is dominant to the number of events included in an IEC 61883-1 > >>packet. If this TSN subsystem decides it, most of these items don't need > >>to be in ALSA. > > > >Not sure if I understand this correctly. > > > >TSN should have a reference to the timing-domain of each *local* > >sound-device (for local capture or playback) as well as the shared > >time-reference provided by gPTP. > > > >Unless an End-station acts as GrandMaster for the gPTP-domain, time set > >forth by gPTP is inmutable and cannot be adjusted. It follows that the > >sample-frequency of the local audio-devices must be adjusted, or the > >audio-streams to/from said devices must be resampled. > > The ALSA API provides support for 'audio' timestamps > (playback/capture rate defined by audio subsystem) and 'system' > timestamps (typically linked to TSC/ART) with one option to take > synchronized timestamps should the hardware support them. Ok, this sounds promising, and very much in line with what AVB would need. > The intent was that the 'audio' timestamps are translated to a > shared time reference managed in userspace by gPTP, which in turn > would define if (adaptive) audio sample rate conversion is needed. > There is no support at the moment for a 'play_at' function in ALSA, > only means to control a feedback loop. Ok, I understand that the 'play_at' is difficult to obtain, but it sounds like it is doable to achieve something useful. Looks like I will be looking into what to put in the .trigger-handler in the ALSA shim and experimenting with this to see how it make sense to connect it from the TSN-stream. Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Sun, Jun 19, 2016 at 11:46:29AM +0200, Richard Cochran wrote: > On Sun, Jun 19, 2016 at 12:45:50AM +0200, Henrik Austad wrote: > > edit: this turned out to be a somewhat lengthy answer. I have tried to > > shorten it down somewhere. it is getting late and I'm getting increasingly > > incoherent (Richard probably knows what I'm talking about ;) so I'll stop > > for now. > > Thanks for your responses, Henrik. I think your explanations are on spot. > > > note that an adjustable sample-clock is not a *requirement* but in general > > you'd want to avoid resampling in software. > > Yes, but.. > > Adjusting the local clock rate to match the AVB network rate is > essential. You must be able to *continuously* adjust the rate in > order to compensate drift. Again, there are exactly two ways to do > it, namely in hardware (think VCO) or in software (dynamic > resampling). Don't get me wrong, having an adjustable clock for the sampling is essential -but it si not -required-. > What you cannot do is simply buffer the AV data and play it out > blindly at the local clock rate. No, that you cannot do that, that would not be pretty :) > Regarding the media clock, if I understand correctly, there the talker > has two possibilities. Either the talker samples the stream at the > gPTP rate, or the talker must tell the listeners the relationship > (phase offset and frequency ratio) between the media clock and the > gPTP time. Please correct me if I got the wrong impression... Last first; AFAIK, there is no way for the Talker to tell a Listener the phase offset/freq ratio other than how each end-station/bridge in the gPTP-domain calculates this on psync_update event messages. I could be wrong though, and different encoding formats can probably convey such information. I have not seen any such mechanisms in the underlying 1722 format though. So a Talker should send a stream sampled as if the gPTP time drove the AD/DA sample frequency directly. Whether the local sampling is driven by gPTP or resampled to match gPTP-time prior to transmit is left as an implementation detail for the end-station. Did all that make sense? Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Sun, Jun 19, 2016 at 11:46:29AM +0200, Richard Cochran wrote: > On Sun, Jun 19, 2016 at 12:45:50AM +0200, Henrik Austad wrote: > > edit: this turned out to be a somewhat lengthy answer. I have tried to > > shorten it down somewhere. it is getting late and I'm getting increasingly > > incoherent (Richard probably knows what I'm talking about ;) so I'll stop > > for now. > > Thanks for your responses, Henrik. I think your explanations are on spot. > > > note that an adjustable sample-clock is not a *requirement* but in general > > you'd want to avoid resampling in software. > > Yes, but.. > > Adjusting the local clock rate to match the AVB network rate is > essential. You must be able to *continuously* adjust the rate in > order to compensate drift. Again, there are exactly two ways to do > it, namely in hardware (think VCO) or in software (dynamic > resampling). Don't get me wrong, having an adjustable clock for the sampling is essential -but it si not -required-. > What you cannot do is simply buffer the AV data and play it out > blindly at the local clock rate. No, that you cannot do that, that would not be pretty :) > Regarding the media clock, if I understand correctly, there the talker > has two possibilities. Either the talker samples the stream at the > gPTP rate, or the talker must tell the listeners the relationship > (phase offset and frequency ratio) between the media clock and the > gPTP time. Please correct me if I got the wrong impression... Last first; AFAIK, there is no way for the Talker to tell a Listener the phase offset/freq ratio other than how each end-station/bridge in the gPTP-domain calculates this on psync_update event messages. I could be wrong though, and different encoding formats can probably convey such information. I have not seen any such mechanisms in the underlying 1722 format though. So a Talker should send a stream sampled as if the gPTP time drove the AD/DA sample frequency directly. Whether the local sampling is driven by gPTP or resampled to match gPTP-time prior to transmit is left as an implementation detail for the end-station. Did all that make sense? Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
ernel > >> land. In alsa-lib, sampling rate conversion is implemented in shared > >> object. > >> When userspace applications start playbacking/capturing, depending on PCM > >> node to access, these applications load the shared object and convert PCM > >> frames from buffer in userspace to mmapped DMA-buffer, then commit them. > > > > The AVB use case places an additional requirement on the rate > > conversion. You will need to adjust the frequency on the fly, as the > > stream is playing. I would guess that ALSA doesn't have that option? > > In ALSA kernel/userspace interfaces , the specification cannot be > supported, at all. > > Please explain about this requirement, where it comes from, which > specification and clause describe it (802.1AS or 802.1Q?). As long as I > read IEEE 1722, I cannot find such a requirement. 1722 only describes how the L2 frames are constructed and transmittet. You are correct that it does not mention adjustable clocks there. - 802.1BA gives an overview of AVB - 802.1Q-2011 Sec 34 and 35 describes forwarding and queueing and Stream Reservation (basically what the network needs in order to correctly prioritize TSN streams) - 802.1AS-2011 (gPTP) describes the timing in great detail (from a PTP point of vew) and describes in more detail how the clocks should be syntonized (802.1AS-2011, 7.3.3). Since the clock that drives the sample-rate for the DA/AD must be controlled by the shared clock, the fact that gPTP can adjust the time means that the DA/AD circuit needs to be adjustable as well. note that an adjustable sample-clock is not a *requirement* but in general you'd want to avoid resampling in software. > (When considering about actual hardware codecs, on-board serial bus such > as Inter-IC Sound, corresponding controller, immediate change of > sampling rate is something imaginary for semi-realtime applications. And > the idea has no meaning for typical playback/capture softwares.) Yes, and no. When you play back a stored file to your soundcard, data is pulled by the card from memory. So you only have a single timing-domain to worry about. So I'd say the idea has meaning in normal scenarios as well, you don't have to worry about it. When you send a stream accross the network, you cannot let the Listener pull data from you, you have to have some common sense of time in order to send just enough data, and that is why the gPTP domain is so important. 802.1Q gives you low latency through the network, but more importantly, no dropped frames. gPTP gives you a central reference to time. > [1] [alsa-lib][PATCH 0/9 v3] ctl: add APIs for control element set > http://mailman.alsa-project.org/pipermail/alsa-devel/2016-June/109274.html > [2] IEEE 1722-2011 > http://ieeexplore.ieee.org/servlet/opac?punumber=5764873 > [3] 5.5 Timing and Synchronization > op. cit. > [4] 1394 Open Host Controller Interface Specification > http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/ohci_11.pdf I hope this cleared some of the questions -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
ernel > >> land. In alsa-lib, sampling rate conversion is implemented in shared > >> object. > >> When userspace applications start playbacking/capturing, depending on PCM > >> node to access, these applications load the shared object and convert PCM > >> frames from buffer in userspace to mmapped DMA-buffer, then commit them. > > > > The AVB use case places an additional requirement on the rate > > conversion. You will need to adjust the frequency on the fly, as the > > stream is playing. I would guess that ALSA doesn't have that option? > > In ALSA kernel/userspace interfaces , the specification cannot be > supported, at all. > > Please explain about this requirement, where it comes from, which > specification and clause describe it (802.1AS or 802.1Q?). As long as I > read IEEE 1722, I cannot find such a requirement. 1722 only describes how the L2 frames are constructed and transmittet. You are correct that it does not mention adjustable clocks there. - 802.1BA gives an overview of AVB - 802.1Q-2011 Sec 34 and 35 describes forwarding and queueing and Stream Reservation (basically what the network needs in order to correctly prioritize TSN streams) - 802.1AS-2011 (gPTP) describes the timing in great detail (from a PTP point of vew) and describes in more detail how the clocks should be syntonized (802.1AS-2011, 7.3.3). Since the clock that drives the sample-rate for the DA/AD must be controlled by the shared clock, the fact that gPTP can adjust the time means that the DA/AD circuit needs to be adjustable as well. note that an adjustable sample-clock is not a *requirement* but in general you'd want to avoid resampling in software. > (When considering about actual hardware codecs, on-board serial bus such > as Inter-IC Sound, corresponding controller, immediate change of > sampling rate is something imaginary for semi-realtime applications. And > the idea has no meaning for typical playback/capture softwares.) Yes, and no. When you play back a stored file to your soundcard, data is pulled by the card from memory. So you only have a single timing-domain to worry about. So I'd say the idea has meaning in normal scenarios as well, you don't have to worry about it. When you send a stream accross the network, you cannot let the Listener pull data from you, you have to have some common sense of time in order to send just enough data, and that is why the gPTP domain is so important. 802.1Q gives you low latency through the network, but more importantly, no dropped frames. gPTP gives you a central reference to time. > [1] [alsa-lib][PATCH 0/9 v3] ctl: add APIs for control element set > http://mailman.alsa-project.org/pipermail/alsa-devel/2016-June/109274.html > [2] IEEE 1722-2011 > http://ieeexplore.ieee.org/servlet/opac?punumber=5764873 > [3] 5.5 Timing and Synchronization > op. cit. > [4] 1394 Open Host Controller Interface Specification > http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/ohci_11.pdf I hope this cleared some of the questions -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 7/8] AVB ALSA - Add ALSA shim for TSN
On Wed, Jun 15, 2016 at 01:49:08PM +0200, Richard Cochran wrote: > Now that I understand better... > > On Sun, Jun 12, 2016 at 01:01:35AM +0200, Henrik Austad wrote: > > Userspace is supposed to reserve bandwidth, find StreamID etc. > > > > To use as a Talker: > > > > mkdir /config/tsn/test/eth0/talker > > cd /config/tsn/test/eth0/talker > > echo 65535 > buffer_size > > echo 08:00:27:08:9f:c3 > remote_mac > > echo 42 > stream_id > > echo alsa > enabled > > This is exactly why configfs is the wrong interface. If you implement > the AVB device in alsa-lib user space, then you can handle the > reservations, configuration, UDP sockets, etc, in a way transparent to > the aplay program. And how would v4l2 benefit from this being in alsalib? Should we require both V4L and ALSA to implement the same, or should we place it in a common place for all. And what about those systems that want to use TSN but is not a media-device, they should be given a raw-socket to send traffic over, should they also implement something in a library? So no, here I think configfs is an apt choice. > Heck, if done properly, your layer could discover the AVB nodes in the > network and present each one as a separate device... No, you definately do not want the kernel to automagically add devices whenever something pops up on the network, for this you need userspace to be in control. 1722.1 should not be handled in-kernel. -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 7/8] AVB ALSA - Add ALSA shim for TSN
On Wed, Jun 15, 2016 at 01:49:08PM +0200, Richard Cochran wrote: > Now that I understand better... > > On Sun, Jun 12, 2016 at 01:01:35AM +0200, Henrik Austad wrote: > > Userspace is supposed to reserve bandwidth, find StreamID etc. > > > > To use as a Talker: > > > > mkdir /config/tsn/test/eth0/talker > > cd /config/tsn/test/eth0/talker > > echo 65535 > buffer_size > > echo 08:00:27:08:9f:c3 > remote_mac > > echo 42 > stream_id > > echo alsa > enabled > > This is exactly why configfs is the wrong interface. If you implement > the AVB device in alsa-lib user space, then you can handle the > reservations, configuration, UDP sockets, etc, in a way transparent to > the aplay program. And how would v4l2 benefit from this being in alsalib? Should we require both V4L and ALSA to implement the same, or should we place it in a common place for all. And what about those systems that want to use TSN but is not a media-device, they should be given a raw-socket to send traffic over, should they also implement something in a library? So no, here I think configfs is an apt choice. > Heck, if done properly, your layer could discover the AVB nodes in the > network and present each one as a separate device... No, you definately do not want the kernel to automagically add devices whenever something pops up on the network, for this you need userspace to be in control. 1722.1 should not be handled in-kernel. -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Wed, Jun 15, 2016 at 09:04:41AM +0200, Richard Cochran wrote: > On Tue, Jun 14, 2016 at 10:38:10PM +0200, Henrik Austad wrote: > > Whereas I want to do > > > > aplay some_song.wav > > Can you please explain how your patches accomplish this? In short: modprobe tsn modprobe avb_alsa mkdir /sys/kernel/config/eth0/link cd /sys/kernel/config/eth0/link echo alsa > enabled aplay -Ddefault:CARD=avb some_song.wav Likewise on the receiver side, except add 'Listener' to end_station attribute arecord -c2 -r48000 -f S16_LE -Ddefault:CARD=avb > some_recording.wav I've not had time to fully fix the hw-aprams for alsa, so some manual tweaking of arecord is required. Again, this is a very early attempt to get something useful done with TSN, I know there are rough edges, I know buffer handling and timestamping is not finished Note: if you don't have an intel-card, load tsn in debug-mode and it will let you use all NICs present. modprobe tsn in_debug=1 -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Wed, Jun 15, 2016 at 09:04:41AM +0200, Richard Cochran wrote: > On Tue, Jun 14, 2016 at 10:38:10PM +0200, Henrik Austad wrote: > > Whereas I want to do > > > > aplay some_song.wav > > Can you please explain how your patches accomplish this? In short: modprobe tsn modprobe avb_alsa mkdir /sys/kernel/config/eth0/link cd /sys/kernel/config/eth0/link echo alsa > enabled aplay -Ddefault:CARD=avb some_song.wav Likewise on the receiver side, except add 'Listener' to end_station attribute arecord -c2 -r48000 -f S16_LE -Ddefault:CARD=avb > some_recording.wav I've not had time to fully fix the hw-aprams for alsa, so some manual tweaking of arecord is required. Again, this is a very early attempt to get something useful done with TSN, I know there are rough edges, I know buffer handling and timestamping is not finished Note: if you don't have an intel-card, load tsn in debug-mode and it will let you use all NICs present. modprobe tsn in_debug=1 -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Tue, Jun 14, 2016 at 08:26:15PM +0200, Richard Cochran wrote: > On Tue, Jun 14, 2016 at 11:30:00AM +0200, Henrik Austad wrote: > > So loop data from kernel -> userspace -> kernelspace and finally back to > > userspace and the media application? > > Huh? I wonder where you got that idea. Let me show an example of > what I mean. > > void listener() > { > int in = socket(); > int out = open("/dev/dsp"); > char buf[]; > > while (1) { > recv(in, buf, packetsize); > write(out, buf + offset, datasize); > } > } > > See? Where is your media-application in this? You only loop the audio from network to the dsp, is the media-application attached to the dsp-device? Whereas I want to do aplay some_song.wav or mplayer or spotify or .. > > Yes, I know some audio apps "use networking", I can stream netradio, I can > > use jack to connect devices using RTP and probably a whole lot of other > > applications do similar things. However, AVB is more about using the > > network as a virtual sound-card. > > That is news to me. I don't recall ever having seen AVB described > like that before. > > > For the media application, it should not > > have to care if the device it is using is a soudncard inside the box or a > > set of AVB-capable speakers somewhere on the network. > > So you would like a remote listener to appear in the system as a local > PCM audio sink? And a remote talker would be like a local media URL? > Sounds unworkable to me, but even if you were to implement it, the > logic would surely belong in alsa-lib and not in the kernel. Behind > the enulated device, the library would run a loop like the example, > above. > > In any case, your patches don't implement that sort of thing at all, > do they? Subject: [very-RFC 7/8] AVB ALSA - Add ALSA shim for TSN Did you even bother to look? -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Tue, Jun 14, 2016 at 08:26:15PM +0200, Richard Cochran wrote: > On Tue, Jun 14, 2016 at 11:30:00AM +0200, Henrik Austad wrote: > > So loop data from kernel -> userspace -> kernelspace and finally back to > > userspace and the media application? > > Huh? I wonder where you got that idea. Let me show an example of > what I mean. > > void listener() > { > int in = socket(); > int out = open("/dev/dsp"); > char buf[]; > > while (1) { > recv(in, buf, packetsize); > write(out, buf + offset, datasize); > } > } > > See? Where is your media-application in this? You only loop the audio from network to the dsp, is the media-application attached to the dsp-device? Whereas I want to do aplay some_song.wav or mplayer or spotify or .. > > Yes, I know some audio apps "use networking", I can stream netradio, I can > > use jack to connect devices using RTP and probably a whole lot of other > > applications do similar things. However, AVB is more about using the > > network as a virtual sound-card. > > That is news to me. I don't recall ever having seen AVB described > like that before. > > > For the media application, it should not > > have to care if the device it is using is a soudncard inside the box or a > > set of AVB-capable speakers somewhere on the network. > > So you would like a remote listener to appear in the system as a local > PCM audio sink? And a remote talker would be like a local media URL? > Sounds unworkable to me, but even if you were to implement it, the > logic would surely belong in alsa-lib and not in the kernel. Behind > the enulated device, the library would run a loop like the example, > above. > > In any case, your patches don't implement that sort of thing at all, > do they? Subject: [very-RFC 7/8] AVB ALSA - Add ALSA shim for TSN Did you even bother to look? -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Mon, Jun 13, 2016 at 09:32:10PM +0200, Richard Cochran wrote: > On Mon, Jun 13, 2016 at 03:00:59PM +0200, Henrik Austad wrote: > > On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote: > > > Which driver is that? > > > > drivers/net/ethernet/renesas/ > > That driver is merely a PTP capable MAC driver, nothing more. > Although AVB is in the device name, the driver doesn't implement > anything beyond the PTP bits. Yes, I think they do the rest from userspace, not sure though :) > > What is the rationale for no new sockets? To avoid cluttering? or do > > sockets have a drawback I'm not aware of? > > The current raw sockets will work just fine. Again, there should be a > application that sits in between with the network socket and the audio > interface. So loop data from kernel -> userspace -> kernelspace and finally back to userspace and the media application? I agree that you need a way to pipe the incoming data directly from the network to userspace for those TSN users that can handle it. But again, for media-applications that don't know (or care) about AVB, it should be fed to ALSA/v4l2 directly and not jump between kernel and userspace an extra round. I get the point of not including every single audio/video encoder in the kernel, but raw audio should be piped directly to alsa. V4L2 has a way of piping encoded video through the system and to the media application (in order to support cameras that to encoding). The same approach should be doable for AVB, no? (someone from alsa/v4l2 should probably comment on this) > > Why is configfs wrong? > > Because the application will use the already existing network and > audio interfaces to configure the system. Configuring this via the audio-interface is going to be a challenge since you need to configure the stream through the network before you can create the audio interface. If not, you will have to either drop data or block the caller until the link has been fully configured. This is actually the reason why configfs is used in the series now, as it allows userspace to figure out all the different attributes and configure the link before letting ALSA start pushing data. > > > Lets take a look at the big picture. One aspect of TSN is already > > > fully supported, namely the gPTP. Using the linuxptp user stack and a > > > modern kernel, you have a complete 802.1AS-2011 solution. > > > > Yes, I thought so, which is also why I have put that to the side and why > > I'm using ktime_get() for timestamps at the moment. There's also the issue > > of hooking the time into ALSA/V4L2 > > So lets get that issue solved before anything else. It is absolutely > essential for TSN. Without the synchronization, you are only playing > audio over the network. We already have software for that. Yes, I agree, presentation-time and local time needs to be handled properly. The same for adjusting sample-rate etc. This is a lot of work, so I hope you can understand why I started out with a simple approach to spark a discussion before moving on to the larger bits. > > > 2. A user space audio application that puts it all together, making > > >use of the services in #1, the linuxptp gPTP service, the ALSA > > >services, and the network connections. This program will have all > > >the knowledge about packet formats, AV encodings, and the local HW > > >capabilities. This program cannot yet be written, as we still need > > >some kernel work in the audio and networking subsystems. > > > > Why? > > Because user space is right place to place the knowledge of the myriad > formats and options. Se response above, better to let anything but uncompressed raw data trickle through. > > the whole point should be to make it as easy for userspace as > > possible. If you need to tailor each individual media-appliation to use > > AVB, it is not going to be very useful outside pro-Audio. Sure, there will > > be challenges, but one key element here should be to *not* require > > upgrading every single media application. > > > > Then, back to the suggestion of adding a TSN_SOCKET (which you didn't like, > > but can we agree on a term "raw interface to TSN", and mode of transport > > can be defined later? ), was to let those applications that are TSN-aware > > to do what they need to do, whether it is controlling robots or media > > streams. > > First you say you don't want ot upgrade media applications, but then > you invent a new socket type. That is a contradiction in terms. Hehe, no, bad phrasing on my part. I want *both* (hence the shim-interface) :) > Audio apps already use networking, and the
Re: [very-RFC 0/8] TSN driver for the kernel
On Mon, Jun 13, 2016 at 09:32:10PM +0200, Richard Cochran wrote: > On Mon, Jun 13, 2016 at 03:00:59PM +0200, Henrik Austad wrote: > > On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote: > > > Which driver is that? > > > > drivers/net/ethernet/renesas/ > > That driver is merely a PTP capable MAC driver, nothing more. > Although AVB is in the device name, the driver doesn't implement > anything beyond the PTP bits. Yes, I think they do the rest from userspace, not sure though :) > > What is the rationale for no new sockets? To avoid cluttering? or do > > sockets have a drawback I'm not aware of? > > The current raw sockets will work just fine. Again, there should be a > application that sits in between with the network socket and the audio > interface. So loop data from kernel -> userspace -> kernelspace and finally back to userspace and the media application? I agree that you need a way to pipe the incoming data directly from the network to userspace for those TSN users that can handle it. But again, for media-applications that don't know (or care) about AVB, it should be fed to ALSA/v4l2 directly and not jump between kernel and userspace an extra round. I get the point of not including every single audio/video encoder in the kernel, but raw audio should be piped directly to alsa. V4L2 has a way of piping encoded video through the system and to the media application (in order to support cameras that to encoding). The same approach should be doable for AVB, no? (someone from alsa/v4l2 should probably comment on this) > > Why is configfs wrong? > > Because the application will use the already existing network and > audio interfaces to configure the system. Configuring this via the audio-interface is going to be a challenge since you need to configure the stream through the network before you can create the audio interface. If not, you will have to either drop data or block the caller until the link has been fully configured. This is actually the reason why configfs is used in the series now, as it allows userspace to figure out all the different attributes and configure the link before letting ALSA start pushing data. > > > Lets take a look at the big picture. One aspect of TSN is already > > > fully supported, namely the gPTP. Using the linuxptp user stack and a > > > modern kernel, you have a complete 802.1AS-2011 solution. > > > > Yes, I thought so, which is also why I have put that to the side and why > > I'm using ktime_get() for timestamps at the moment. There's also the issue > > of hooking the time into ALSA/V4L2 > > So lets get that issue solved before anything else. It is absolutely > essential for TSN. Without the synchronization, you are only playing > audio over the network. We already have software for that. Yes, I agree, presentation-time and local time needs to be handled properly. The same for adjusting sample-rate etc. This is a lot of work, so I hope you can understand why I started out with a simple approach to spark a discussion before moving on to the larger bits. > > > 2. A user space audio application that puts it all together, making > > >use of the services in #1, the linuxptp gPTP service, the ALSA > > >services, and the network connections. This program will have all > > >the knowledge about packet formats, AV encodings, and the local HW > > >capabilities. This program cannot yet be written, as we still need > > >some kernel work in the audio and networking subsystems. > > > > Why? > > Because user space is right place to place the knowledge of the myriad > formats and options. Se response above, better to let anything but uncompressed raw data trickle through. > > the whole point should be to make it as easy for userspace as > > possible. If you need to tailor each individual media-appliation to use > > AVB, it is not going to be very useful outside pro-Audio. Sure, there will > > be challenges, but one key element here should be to *not* require > > upgrading every single media application. > > > > Then, back to the suggestion of adding a TSN_SOCKET (which you didn't like, > > but can we agree on a term "raw interface to TSN", and mode of transport > > can be defined later? ), was to let those applications that are TSN-aware > > to do what they need to do, whether it is controlling robots or media > > streams. > > First you say you don't want ot upgrade media applications, but then > you invent a new socket type. That is a contradiction in terms. Hehe, no, bad phrasing on my part. I want *both* (hence the shim-interface) :) > Audio apps already use networking, and the
Re: [very-RFC 0/8] TSN driver for the kernel
On Mon, Jun 13, 2016 at 08:56:44AM -0700, John Fastabend wrote: > On 16-06-13 04:47 AM, Richard Cochran wrote: > > [...] > > Here is what is missing to support audio TSN: > > > > * User Space > > > > 1. A proper userland stack for AVDECC, MAAP, FQTSS, and so on. The > >OpenAVB project does not offer much beyond simple examples. > > > > 2. A user space audio application that puts it all together, making > >use of the services in #1, the linuxptp gPTP service, the ALSA > >services, and the network connections. This program will have all > >the knowledge about packet formats, AV encodings, and the local HW > >capabilities. This program cannot yet be written, as we still need > >some kernel work in the audio and networking subsystems. > > > > * Kernel Space > > > > 1. Providing frames with a future transmit time. For normal sockets, > >this can be in the CMESG data. For mmap'ed buffers, we will need a > >new format. (I think Arnd is working on a new layout.) > > > > 2. Time based qdisc for transmitted frames. For MACs that support > >this (like the i210), we only have to place the frame into the > >correct queue. For normal HW, we want to be able to reserve a time > >window in which non-TSN frames are blocked. This is some work, but > >in the end it should be a generic solution that not only works > >"perfectly" with TSN HW but also provides best effort service using > >any NIC. > > > > When I looked at this awhile ago I convinced myself that it could fit > fairly well into the DCB stack (DCB is also part of 802.1Q). A lot of > the traffic class to queue mappings and priories could be handled here. > It might be worth taking a look at ./net/sched/mqprio.c and ./net/dcb/. Interesting, I'll have a look at dcb and mqprio, I'm not familiar with those systems. Thanks for pointing those out! I hope that the complexity doesn't run crazy though, TSN is not aimed at datacentra, a lot of the endpoints are going to be embedded devices, introducing a massive stack for handling every eventuality in 802.1q is going to be counter productive. > Unfortunately I didn't get too far along but we probably don't want > another mechanism to map hw queues/tcs/etc if the existing interfaces > work or can be extended to support this. Sure, I get that, as long as the complexity for setting up a link doesn't go through the roof :) Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Mon, Jun 13, 2016 at 08:56:44AM -0700, John Fastabend wrote: > On 16-06-13 04:47 AM, Richard Cochran wrote: > > [...] > > Here is what is missing to support audio TSN: > > > > * User Space > > > > 1. A proper userland stack for AVDECC, MAAP, FQTSS, and so on. The > >OpenAVB project does not offer much beyond simple examples. > > > > 2. A user space audio application that puts it all together, making > >use of the services in #1, the linuxptp gPTP service, the ALSA > >services, and the network connections. This program will have all > >the knowledge about packet formats, AV encodings, and the local HW > >capabilities. This program cannot yet be written, as we still need > >some kernel work in the audio and networking subsystems. > > > > * Kernel Space > > > > 1. Providing frames with a future transmit time. For normal sockets, > >this can be in the CMESG data. For mmap'ed buffers, we will need a > >new format. (I think Arnd is working on a new layout.) > > > > 2. Time based qdisc for transmitted frames. For MACs that support > >this (like the i210), we only have to place the frame into the > >correct queue. For normal HW, we want to be able to reserve a time > >window in which non-TSN frames are blocked. This is some work, but > >in the end it should be a generic solution that not only works > >"perfectly" with TSN HW but also provides best effort service using > >any NIC. > > > > When I looked at this awhile ago I convinced myself that it could fit > fairly well into the DCB stack (DCB is also part of 802.1Q). A lot of > the traffic class to queue mappings and priories could be handled here. > It might be worth taking a look at ./net/sched/mqprio.c and ./net/dcb/. Interesting, I'll have a look at dcb and mqprio, I'm not familiar with those systems. Thanks for pointing those out! I hope that the complexity doesn't run crazy though, TSN is not aimed at datacentra, a lot of the endpoints are going to be embedded devices, introducing a massive stack for handling every eventuality in 802.1q is going to be counter productive. > Unfortunately I didn't get too far along but we probably don't want > another mechanism to map hw queues/tcs/etc if the existing interfaces > work or can be extended to support this. Sure, I get that, as long as the complexity for setting up a link doesn't go through the roof :) Thanks! -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 0/8] TSN driver for the kernel
On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote: > Henrik, Hi Richard, > On Sun, Jun 12, 2016 at 01:01:28AM +0200, Henrik Austad wrote: > > There are at least one AVB-driver (the AV-part of TSN) in the kernel > > already, > > Which driver is that? drivers/net/ethernet/renesas/ > > however this driver aims to solve a wider scope as TSN can do > > much more than just audio. A very basic ALSA-driver is added to the end > > that allows you to play music between 2 machines using aplay in one end > > and arecord | aplay on the other (some fiddling required) We have plans > > for doing the same for v4l2 eventually (but there are other fishes to > > fry first). The same goes for a TSN_SOCK type approach as well. > > Please, no new socket type for this. The idea was to create a tsn-driver and then allow userspace to use it either for media or for whatever else they'd like - and then a socket made sense. Or so I thought :) What is the rationale for no new sockets? To avoid cluttering? or do sockets have a drawback I'm not aware of? > > What remains > > - tie to (g)PTP properly, currently using ktime_get() for presentation > > time > > - get time from shim into TSN and vice versa > > ... and a whole lot more, see below. > > > - let shim create/manage buffer > > (BTW, shim is a terrible name for that.) So something thin that is placed between to subystems should rather be called.. flimsy? The point of the name was to indicate that it glued 2 pieces together. If you have a better suggestion, I'm all ears. > [sigh] > > People have been asking me about TSN and Linux, and we've made some > thoughts about it. The interest is there, and so I am glad to see > discussion on this topic. I'm not aware of any such discussions, could you point me to where TSN has been discussed, it would be nice to see other peoples thought on the matter (which was one of the ideas behind this series in the first place) > Having said that, your series does not even begin to address the real > issues. Well, in all honesty, I did say so :) It is marked as "very-RFC", and not for being included in the kernel as-is. I also made a short list of the most crucial bits missing. I know there are real issues, but solving these won't matter if you don't have anything useful to do with it. I decided to start by adding a thin ALSA-driver and then continue to work with the kernel infrastructure. Having something that works-ish makes it a lot easier to test and get others interested in, especially when you are not deeply involved in a subsystem. At one point you get to where you need input from other more intimate with then inner workings of the different subsystems to see how things should be created without making too much of a mess. So where we are :) My primary motivation was to a) gather feedback (which you have provided, and for which I am very grateful) b) get the discussion going on how/if TSN should be added to the kernel > I did not review the patches too carefully (because the > important stuff is missing), but surely configfs is the wrong > interface for this. Why is configfs wrong? Unless you want to implement discovery and enumeration and srp-negotiation in the kernel, you need userspace to handle this. Once userspace has done all that (found priority-codes, streamIDs, vlanIDs and all the required bits), then userspace can create a new link. For that I find ConfigFS to be quite useful and up to the task. In my opinion, it also makes for a much tidier and saner interface than some obscure dark-magic ioctl() > In the end, we will be able to support TSN using > the existing networking and audio interfaces, adding appropriate > extensions. I surely hope so, but as I'm not deep into the networking part of the kernel finding those appropriate extensions is hard - which is why we started writing a standalone module- > Your patch features a buffer shared by networking and audio. This > isn't strictly necessary for TSN, and it may be harmful. At one stage, data has to flow in/out of the network, and whoever's using TSN probably need to store data somewhere as well, so you need some form of buffering at one place in the path the data flows through. That being said, one of the bits on my plate is to remove the "TSN-hosted-buffer" and let TSN read/write data via the shim_ops. What the best set of functions where are, remain to be seen, but it should provide a way to move data from either a single frame or a "few frames" to the shime (err.. ;) > The > Listeners are supposed to calculate the delay from frame reception to > the DA conversion. They can easily include the time needed for a user > space program to parse the frames, copy (and combine/convert) the > data, and re
Re: [very-RFC 0/8] TSN driver for the kernel
On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote: > Henrik, Hi Richard, > On Sun, Jun 12, 2016 at 01:01:28AM +0200, Henrik Austad wrote: > > There are at least one AVB-driver (the AV-part of TSN) in the kernel > > already, > > Which driver is that? drivers/net/ethernet/renesas/ > > however this driver aims to solve a wider scope as TSN can do > > much more than just audio. A very basic ALSA-driver is added to the end > > that allows you to play music between 2 machines using aplay in one end > > and arecord | aplay on the other (some fiddling required) We have plans > > for doing the same for v4l2 eventually (but there are other fishes to > > fry first). The same goes for a TSN_SOCK type approach as well. > > Please, no new socket type for this. The idea was to create a tsn-driver and then allow userspace to use it either for media or for whatever else they'd like - and then a socket made sense. Or so I thought :) What is the rationale for no new sockets? To avoid cluttering? or do sockets have a drawback I'm not aware of? > > What remains > > - tie to (g)PTP properly, currently using ktime_get() for presentation > > time > > - get time from shim into TSN and vice versa > > ... and a whole lot more, see below. > > > - let shim create/manage buffer > > (BTW, shim is a terrible name for that.) So something thin that is placed between to subystems should rather be called.. flimsy? The point of the name was to indicate that it glued 2 pieces together. If you have a better suggestion, I'm all ears. > [sigh] > > People have been asking me about TSN and Linux, and we've made some > thoughts about it. The interest is there, and so I am glad to see > discussion on this topic. I'm not aware of any such discussions, could you point me to where TSN has been discussed, it would be nice to see other peoples thought on the matter (which was one of the ideas behind this series in the first place) > Having said that, your series does not even begin to address the real > issues. Well, in all honesty, I did say so :) It is marked as "very-RFC", and not for being included in the kernel as-is. I also made a short list of the most crucial bits missing. I know there are real issues, but solving these won't matter if you don't have anything useful to do with it. I decided to start by adding a thin ALSA-driver and then continue to work with the kernel infrastructure. Having something that works-ish makes it a lot easier to test and get others interested in, especially when you are not deeply involved in a subsystem. At one point you get to where you need input from other more intimate with then inner workings of the different subsystems to see how things should be created without making too much of a mess. So where we are :) My primary motivation was to a) gather feedback (which you have provided, and for which I am very grateful) b) get the discussion going on how/if TSN should be added to the kernel > I did not review the patches too carefully (because the > important stuff is missing), but surely configfs is the wrong > interface for this. Why is configfs wrong? Unless you want to implement discovery and enumeration and srp-negotiation in the kernel, you need userspace to handle this. Once userspace has done all that (found priority-codes, streamIDs, vlanIDs and all the required bits), then userspace can create a new link. For that I find ConfigFS to be quite useful and up to the task. In my opinion, it also makes for a much tidier and saner interface than some obscure dark-magic ioctl() > In the end, we will be able to support TSN using > the existing networking and audio interfaces, adding appropriate > extensions. I surely hope so, but as I'm not deep into the networking part of the kernel finding those appropriate extensions is hard - which is why we started writing a standalone module- > Your patch features a buffer shared by networking and audio. This > isn't strictly necessary for TSN, and it may be harmful. At one stage, data has to flow in/out of the network, and whoever's using TSN probably need to store data somewhere as well, so you need some form of buffering at one place in the path the data flows through. That being said, one of the bits on my plate is to remove the "TSN-hosted-buffer" and let TSN read/write data via the shim_ops. What the best set of functions where are, remain to be seen, but it should provide a way to move data from either a single frame or a "few frames" to the shime (err.. ;) > The > Listeners are supposed to calculate the delay from frame reception to > the DA conversion. They can easily include the time needed for a user > space program to parse the frames, copy (and combine/convert) the > data, and re
Re: [very-RFC 6/8] Add TSN event-tracing
On Sun, Jun 12, 2016 at 10:22:01PM -0400, Steven Rostedt wrote: > On Sun, 12 Jun 2016 23:25:10 +0200 > Henrik Austad <hen...@austad.us> wrote: > > > > > +#include > > > > +#include > > > > +/* #include */ > > > > + > > > > +/* FIXME: update to TRACE_CLASS to reduce overhead */ > > > > > > I'm curious to why I didn't do this now. A class would make less > > > duplication of typing too ;-) > > > > Yeah, I found this in a really great article written by some tracing-dude, > > I hear he talks really, really fast! > > I plead the 5th! > > > > > https://lwn.net/Articles/381064/ > > > > > > +TRACE_EVENT(tsn_buffer_write, > > > > + > > > > + TP_PROTO(struct tsn_link *link, > > > > + size_t bytes), > > > > + > > > > + TP_ARGS(link, bytes), > > > > + > > > > + TP_STRUCT__entry( > > > > + __field(u64, stream_id) > > > > + __field(size_t, size) > > > > + __field(size_t, bsize) > > > > + __field(size_t, size_left) > > > > + __field(void *, buffer) > > > > + __field(void *, head) > > > > + __field(void *, tail) > > > > + __field(void *, end) > > > > + ), > > > > + > > > > + TP_fast_assign( > > > > + __entry->stream_id = link->stream_id; > > > > + __entry->size = bytes; > > > > + __entry->bsize = link->used_buffer_size; > > > > + __entry->size_left = (link->head - link->tail) % > > > > link->used_buffer_size; > > > > > > Move this logic into the print statement, since you save head and tail. > > > > Ok, any particular reason? > > Because it removes calculations during the trace. The calculations done > in TP_printk() are done at the time of reading the trace, and > calculations done in TP_fast_assign() are done during the recording and > hence adding more overhead to the trace itself. Aha! that makes sense, thanks! (/me goes and updates the tracing-part) -Henrik signature.asc Description: Digital signature
Re: [very-RFC 6/8] Add TSN event-tracing
On Sun, Jun 12, 2016 at 10:22:01PM -0400, Steven Rostedt wrote: > On Sun, 12 Jun 2016 23:25:10 +0200 > Henrik Austad wrote: > > > > > +#include > > > > +#include > > > > +/* #include */ > > > > + > > > > +/* FIXME: update to TRACE_CLASS to reduce overhead */ > > > > > > I'm curious to why I didn't do this now. A class would make less > > > duplication of typing too ;-) > > > > Yeah, I found this in a really great article written by some tracing-dude, > > I hear he talks really, really fast! > > I plead the 5th! > > > > > https://lwn.net/Articles/381064/ > > > > > > +TRACE_EVENT(tsn_buffer_write, > > > > + > > > > + TP_PROTO(struct tsn_link *link, > > > > + size_t bytes), > > > > + > > > > + TP_ARGS(link, bytes), > > > > + > > > > + TP_STRUCT__entry( > > > > + __field(u64, stream_id) > > > > + __field(size_t, size) > > > > + __field(size_t, bsize) > > > > + __field(size_t, size_left) > > > > + __field(void *, buffer) > > > > + __field(void *, head) > > > > + __field(void *, tail) > > > > + __field(void *, end) > > > > + ), > > > > + > > > > + TP_fast_assign( > > > > + __entry->stream_id = link->stream_id; > > > > + __entry->size = bytes; > > > > + __entry->bsize = link->used_buffer_size; > > > > + __entry->size_left = (link->head - link->tail) % > > > > link->used_buffer_size; > > > > > > Move this logic into the print statement, since you save head and tail. > > > > Ok, any particular reason? > > Because it removes calculations during the trace. The calculations done > in TP_printk() are done at the time of reading the trace, and > calculations done in TP_fast_assign() are done during the recording and > hence adding more overhead to the trace itself. Aha! that makes sense, thanks! (/me goes and updates the tracing-part) -Henrik signature.asc Description: Digital signature
Re: [very-RFC 6/8] Add TSN event-tracing
On Sun, Jun 12, 2016 at 12:58:03PM -0400, Steven Rostedt wrote: > On Sun, 12 Jun 2016 01:01:34 +0200 > Henrik Austad <hen...@austad.us> wrote: > > > From: Henrik Austad <haus...@cisco.com> > > > > This needs refactoring and should be updated to use TRACE_CLASS, but for > > now it provides a fair debug-window into TSN. > > > > Cc: "David S. Miller" <da...@davemloft.net> > > Cc: Steven Rostedt <rost...@goodmis.org> (maintainer:TRACING) > > Cc: Ingo Molnar <mi...@redhat.com> (maintainer:TRACING) > > Signed-off-by: Henrik Austad <haus...@cisco.com> > > --- > > include/trace/events/tsn.h | 349 > > + > > 1 file changed, 349 insertions(+) > > create mode 100644 include/trace/events/tsn.h > > > > diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h > > new file mode 100644 > > index 000..ac1f31b > > --- /dev/null > > +++ b/include/trace/events/tsn.h > > @@ -0,0 +1,349 @@ > > +#undef TRACE_SYSTEM > > +#define TRACE_SYSTEM tsn > > + > > +#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ) > > +#define _TRACE_TSN_H > > + > > +#include > > +#include > > + > > +#include > > +#include > > +/* #include */ > > + > > +/* FIXME: update to TRACE_CLASS to reduce overhead */ > > I'm curious to why I didn't do this now. A class would make less > duplication of typing too ;-) Yeah, I found this in a really great article written by some tracing-dude, I hear he talks really, really fast! https://lwn.net/Articles/381064/ > > +TRACE_EVENT(tsn_buffer_write, > > + > > + TP_PROTO(struct tsn_link *link, > > + size_t bytes), > > + > > + TP_ARGS(link, bytes), > > + > > + TP_STRUCT__entry( > > + __field(u64, stream_id) > > + __field(size_t, size) > > + __field(size_t, bsize) > > + __field(size_t, size_left) > > + __field(void *, buffer) > > + __field(void *, head) > > + __field(void *, tail) > > + __field(void *, end) > > + ), > > + > > + TP_fast_assign( > > + __entry->stream_id = link->stream_id; > > + __entry->size = bytes; > > + __entry->bsize = link->used_buffer_size; > > + __entry->size_left = (link->head - link->tail) % > > link->used_buffer_size; > > Move this logic into the print statement, since you save head and tail. Ok, any particular reason? > > + __entry->buffer = link->buffer; > > + __entry->head = link->head; > > + __entry->tail = link->tail; > > + __entry->end = link->end; > > + ), > > + > > + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, > > [buffer=%p, head=%p, tail=%p, end=%p]", > > + __entry->stream_id, __entry->size, __entry->bsize, > > __entry->size_left, > > __entry->stream_id, __entry->size, __entry->bsize, > (__entry->head - __entry->tail) % __entry->bsize, > Ok, so is this about saving space by dropping one intermediate value, or is it some other point I'm missing here? > > + __entry->buffer,__entry->head, __entry->tail, __entry->end) > > + > > + ); > > + > > +TRACE_EVENT(tsn_buffer_write_net, > > + > > + TP_PROTO(struct tsn_link *link, > > + size_t bytes), > > + > > + TP_ARGS(link, bytes), > > + > > + TP_STRUCT__entry( > > + __field(u64, stream_id) > > + __field(size_t, size) > > + __field(size_t, bsize) > > + __field(size_t, size_left) > > + __field(void *, buffer) > > + __field(void *, head) > > + __field(void *, tail) > > + __field(void *, end) > > + ), > > + > > + TP_fast_assign( > > + __entry->stream_id = link->stream_id; > > + __entry->size = bytes; > > + __entry->bsize = link->used_buffer_size; > > + __entry->size_left = (link->head - link->tail) % > > link->used_buffer_size; > > + __entry->buffer = link->buffer; > > + __entry->head = link->head; > > + __entry->tail = link->tail; > > + __entry->end = link->end; > > + ), > > + &
Re: [very-RFC 6/8] Add TSN event-tracing
On Sun, Jun 12, 2016 at 12:58:03PM -0400, Steven Rostedt wrote: > On Sun, 12 Jun 2016 01:01:34 +0200 > Henrik Austad wrote: > > > From: Henrik Austad > > > > This needs refactoring and should be updated to use TRACE_CLASS, but for > > now it provides a fair debug-window into TSN. > > > > Cc: "David S. Miller" > > Cc: Steven Rostedt (maintainer:TRACING) > > Cc: Ingo Molnar (maintainer:TRACING) > > Signed-off-by: Henrik Austad > > --- > > include/trace/events/tsn.h | 349 > > + > > 1 file changed, 349 insertions(+) > > create mode 100644 include/trace/events/tsn.h > > > > diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h > > new file mode 100644 > > index 000..ac1f31b > > --- /dev/null > > +++ b/include/trace/events/tsn.h > > @@ -0,0 +1,349 @@ > > +#undef TRACE_SYSTEM > > +#define TRACE_SYSTEM tsn > > + > > +#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ) > > +#define _TRACE_TSN_H > > + > > +#include > > +#include > > + > > +#include > > +#include > > +/* #include */ > > + > > +/* FIXME: update to TRACE_CLASS to reduce overhead */ > > I'm curious to why I didn't do this now. A class would make less > duplication of typing too ;-) Yeah, I found this in a really great article written by some tracing-dude, I hear he talks really, really fast! https://lwn.net/Articles/381064/ > > +TRACE_EVENT(tsn_buffer_write, > > + > > + TP_PROTO(struct tsn_link *link, > > + size_t bytes), > > + > > + TP_ARGS(link, bytes), > > + > > + TP_STRUCT__entry( > > + __field(u64, stream_id) > > + __field(size_t, size) > > + __field(size_t, bsize) > > + __field(size_t, size_left) > > + __field(void *, buffer) > > + __field(void *, head) > > + __field(void *, tail) > > + __field(void *, end) > > + ), > > + > > + TP_fast_assign( > > + __entry->stream_id = link->stream_id; > > + __entry->size = bytes; > > + __entry->bsize = link->used_buffer_size; > > + __entry->size_left = (link->head - link->tail) % > > link->used_buffer_size; > > Move this logic into the print statement, since you save head and tail. Ok, any particular reason? > > + __entry->buffer = link->buffer; > > + __entry->head = link->head; > > + __entry->tail = link->tail; > > + __entry->end = link->end; > > + ), > > + > > + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, > > [buffer=%p, head=%p, tail=%p, end=%p]", > > + __entry->stream_id, __entry->size, __entry->bsize, > > __entry->size_left, > > __entry->stream_id, __entry->size, __entry->bsize, > (__entry->head - __entry->tail) % __entry->bsize, > Ok, so is this about saving space by dropping one intermediate value, or is it some other point I'm missing here? > > + __entry->buffer,__entry->head, __entry->tail, __entry->end) > > + > > + ); > > + > > +TRACE_EVENT(tsn_buffer_write_net, > > + > > + TP_PROTO(struct tsn_link *link, > > + size_t bytes), > > + > > + TP_ARGS(link, bytes), > > + > > + TP_STRUCT__entry( > > + __field(u64, stream_id) > > + __field(size_t, size) > > + __field(size_t, bsize) > > + __field(size_t, size_left) > > + __field(void *, buffer) > > + __field(void *, head) > > + __field(void *, tail) > > + __field(void *, end) > > + ), > > + > > + TP_fast_assign( > > + __entry->stream_id = link->stream_id; > > + __entry->size = bytes; > > + __entry->bsize = link->used_buffer_size; > > + __entry->size_left = (link->head - link->tail) % > > link->used_buffer_size; > > + __entry->buffer = link->buffer; > > + __entry->head = link->head; > > + __entry->tail = link->tail; > > + __entry->end = link->end; > > + ), > > + > > + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, > > [buffer=%p, head=%p, tail=%p, end=%p]", > > +
Re: [very-RFC 5/8] Add TSN machinery to drive the traffic from a shim over the network
On Sun, Jun 12, 2016 at 12:35:10AM -0700, Joe Perches wrote: > On Sun, 2016-06-12 at 00:22 +0200, Henrik Austad wrote: > > From: Henrik Austad <haus...@cisco.com> > > > > In short summary: > > > > * tsn_core.c is the main driver of tsn, all new links go through > > here and all data to/form the shims are handled here > > core also manages the shim-interface. > [] > > diff --git a/net/tsn/tsn_configfs.c b/net/tsn/tsn_configfs.c > [] > > +static inline struct tsn_link *to_tsn_link(struct config_item *item) > > +{ > > + /* this line causes checkpatch to WARN. making checkpatch happy, > > + * makes code messy.. > > + */ > > + return item ? container_of(to_config_group(item), struct tsn_link, > > group) : NULL; > > +} > > How about > > static inline struct tsn_link *to_tsn_link(struct config_item *item) > { > if (!item) > return NULL; > return container_of(to_config_group(item), struct tsn_link, group); > } Yes, I mulled over this for a while, but I got the impression that the ternary-approach was the way used in configfs, and I tried staying in line with that in tsn_configfs. If you see other parts of the TSN-code, I tend to use the if (!item) ... approach. So, I don't have any technical preferences either way really -- Henrik Austad signature.asc Description: Digital signature
Re: [very-RFC 5/8] Add TSN machinery to drive the traffic from a shim over the network
On Sun, Jun 12, 2016 at 12:35:10AM -0700, Joe Perches wrote: > On Sun, 2016-06-12 at 00:22 +0200, Henrik Austad wrote: > > From: Henrik Austad > > > > In short summary: > > > > * tsn_core.c is the main driver of tsn, all new links go through > > here and all data to/form the shims are handled here > > core also manages the shim-interface. > [] > > diff --git a/net/tsn/tsn_configfs.c b/net/tsn/tsn_configfs.c > [] > > +static inline struct tsn_link *to_tsn_link(struct config_item *item) > > +{ > > + /* this line causes checkpatch to WARN. making checkpatch happy, > > + * makes code messy.. > > + */ > > + return item ? container_of(to_config_group(item), struct tsn_link, > > group) : NULL; > > +} > > How about > > static inline struct tsn_link *to_tsn_link(struct config_item *item) > { > if (!item) > return NULL; > return container_of(to_config_group(item), struct tsn_link, group); > } Yes, I mulled over this for a while, but I got the impression that the ternary-approach was the way used in configfs, and I tried staying in line with that in tsn_configfs. If you see other parts of the TSN-code, I tend to use the if (!item) ... approach. So, I don't have any technical preferences either way really -- Henrik Austad signature.asc Description: Digital signature
[very-RFC 4/8] Add TSN header for the driver
From: Henrik Austad <haus...@cisco.com> This defines the general TSN headers for network packets, the shim-interface and the central 'tsn_list' structure. Cc: "David S. Miller" <da...@davemloft.net> Signed-off-by: Henrik Austad <haus...@cisco.com> --- include/linux/tsn.h | 806 1 file changed, 806 insertions(+) create mode 100644 include/linux/tsn.h diff --git a/include/linux/tsn.h b/include/linux/tsn.h new file mode 100644 index 000..0e1f732b --- /dev/null +++ b/include/linux/tsn.h @@ -0,0 +1,806 @@ +/* TSN - Time Sensitive Networking + * + * Copyright (C) 2016- Henrik Austad <haus...@cisco.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#ifndef _TSN_H +#define _TSN_H +#include +#include +#include + +/* The naming here can be a bit confusing as we call it TSN but naming + * suggests 'AVB'. Reason: IEE 1722 was written before the working group + * was renamed to Time Sensitive Networking. + * + * To be precise. TSN describes the protocol for shipping data, AVB is a + * medialayer which you can build on top of TSN. + * + * For this reason the frames are given avb-names whereas the functions + * use tsn_-naming. + */ + +/* 7 bit value 0x00 - 0x7F */ +enum avtp_subtype { + AVTP_61883_IIDC = 0, + AVTP_MMA = 0x1, + AVTP_MAAP = 0x7e, + AVTP_EXPERIMENTAL = 0x7f, +}; + +/* NOTE NOTE NOTE !! + * The headers below use bitfields extensively and verifications + * are needed when using little-endian vs big-endian systems. + */ + +/* Common part of avtph header + * + * AVB Transport Protocol Common Header + * + * Defined in 1722-2011 Sec. 5.2 + */ +struct avtp_ch { +#if defined(__LITTLE_ENDIAN_BITFIELD) + /* use avtp_subtype enum. +*/ + u8 subtype:7; + + /* Controlframe: 1 +* Dataframe : 0 +*/ + u8 cd:1; + + /* Type specific data, part 1 */ + u8 tsd_1:4; + + /* In current version of AVB, only 0 is valid, all other values +* are reserved for future versions. +*/ + u8 version:3; + + /* Valid StreamID in frame +* +* ControlData not related to a specific stream should clear +* this (and have stream_id = 0), _all_ other values should set +* this to 1. +*/ + u8 sv:1; +#elif defined(__BIG_ENDIAN_BITFIELD) + u8 cd:1; + u8 subtype:7; + u8 sv:1; + u8 version:3; + u8 tsd_1:4; +#else +#error "Unknown Endianness, cannot determine bitfield ordering" +#endif + /* Type specific data (adjacent to tsd_1, but split due to bitfield) */ + u16 tsd_2; + u64 stream_id; + + /* +* payload by subtype +*/ + u8 pbs[0]; +} __packed; + +/* AVTPDU Common Control header format + * IEEE 1722#5.3 + */ +struct avtpc_header { +#if defined(__LITTLE_ENDIAN_BITFIELD) + u8 subtype:7; + u8 cd:1; + u8 control_data:4; + u8 version:3; + u8 sv:1; + u16 control_data_length:11; + u16 status:5; +#elif defined(__BIG_ENDIAN_BITFIELD) + u8 cd:1; + u8 subtype:7; + u8 sv:1; + u8 version:3; + u8 control_data:4; + u16 status:5; + u16 control_data_length:11; +#else +#error "Unknown Endianness, cannot determine bitfield ordering" +#endif + u64 stream_id; +} __packed; + +/* AVTP common stream data AVTPDU header format + * IEEE 1722#5.4 + */ +struct avtpdu_header { +#if defined(__LITTLE_ENDIAN_BITFIELD) + u8 subtype:7; + u8 cd:1; + + /* avtp_timestamp valid */ + u8 tv: 1; + + /* gateway_info valid */ + u8 gv:1; + + /* reserved */ + u8 r:1; + + /* +* Media clock Restart toggle +*/ + u8 mr:1; + + u8 version:3; + + /* StreamID valid */ + u8 sv:1; + u8 seqnr; + + /* Timestamp uncertain */ + u8 tu:1; + u8 r2:7; +#elif defined(__BIG_ENDIAN_BITFIELD) + u8 cd:1; + u8 subtype:7; + + u8 sv:1; + u8 version:3; + u8 mr:1; + u8 r:1; + u8 gv:1; + u8 tv: 1; + + u8 seqnr; + u8 r2:7; + u8 tu:1; +#else +#error "Unknown Endianness, cannot determine bitfield ordering" +#endif + + u64 stream_id; + + u32 avtp_timestamp; + u32 gateway_info; + + /* Stream Data Length */ + u16 sd_len; + + /* Protocol specific header, derived from avt
[very-RFC 6/8] Add TSN event-tracing
From: Henrik Austad <haus...@cisco.com> This needs refactoring and should be updated to use TRACE_CLASS, but for now it provides a fair debug-window into TSN. Cc: "David S. Miller" <da...@davemloft.net> Cc: Steven Rostedt <rost...@goodmis.org> (maintainer:TRACING) Cc: Ingo Molnar <mi...@redhat.com> (maintainer:TRACING) Signed-off-by: Henrik Austad <haus...@cisco.com> --- include/trace/events/tsn.h | 349 + 1 file changed, 349 insertions(+) create mode 100644 include/trace/events/tsn.h diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h new file mode 100644 index 000..ac1f31b --- /dev/null +++ b/include/trace/events/tsn.h @@ -0,0 +1,349 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM tsn + +#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_TSN_H + +#include +#include + +#include +#include +/* #include */ + +/* FIXME: update to TRACE_CLASS to reduce overhead */ +TRACE_EVENT(tsn_buffer_write, + + TP_PROTO(struct tsn_link *link, + size_t bytes), + + TP_ARGS(link, bytes), + + TP_STRUCT__entry( + __field(u64, stream_id) + __field(size_t, size) + __field(size_t, bsize) + __field(size_t, size_left) + __field(void *, buffer) + __field(void *, head) + __field(void *, tail) + __field(void *, end) + ), + + TP_fast_assign( + __entry->stream_id = link->stream_id; + __entry->size = bytes; + __entry->bsize = link->used_buffer_size; + __entry->size_left = (link->head - link->tail) % link->used_buffer_size; + __entry->buffer = link->buffer; + __entry->head = link->head; + __entry->tail = link->tail; + __entry->end = link->end; + ), + + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, [buffer=%p, head=%p, tail=%p, end=%p]", + __entry->stream_id, __entry->size, __entry->bsize, __entry->size_left, + __entry->buffer,__entry->head, __entry->tail, __entry->end) + + ); + +TRACE_EVENT(tsn_buffer_write_net, + + TP_PROTO(struct tsn_link *link, + size_t bytes), + + TP_ARGS(link, bytes), + + TP_STRUCT__entry( + __field(u64, stream_id) + __field(size_t, size) + __field(size_t, bsize) + __field(size_t, size_left) + __field(void *, buffer) + __field(void *, head) + __field(void *, tail) + __field(void *, end) + ), + + TP_fast_assign( + __entry->stream_id = link->stream_id; + __entry->size = bytes; + __entry->bsize = link->used_buffer_size; + __entry->size_left = (link->head - link->tail) % link->used_buffer_size; + __entry->buffer = link->buffer; + __entry->head = link->head; + __entry->tail = link->tail; + __entry->end = link->end; + ), + + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, [buffer=%p, head=%p, tail=%p, end=%p]", + __entry->stream_id, __entry->size, __entry->bsize, __entry->size_left, + __entry->buffer,__entry->head, __entry->tail, __entry->end) + + ); + + +TRACE_EVENT(tsn_buffer_read, + + TP_PROTO(struct tsn_link *link, + size_t bytes), + + TP_ARGS(link, bytes), + + TP_STRUCT__entry( + __field(u64, stream_id) + __field(size_t, size) + __field(size_t, bsize) + __field(size_t, size_left) + __field(void *, buffer) + __field(void *, head) + __field(void *, tail) + __field(void *, end) + ), + + TP_fast_assign( + __entry->stream_id = link->stream_id; + __entry->size = bytes; + __entry->bsize = link->used_buffer_size; + __entry->size_left = (link->head - link->tail) % link->used_buffer_size; + __entry->buffer = link->buffer; + __entry->head = link->head; + __entry->tail = link->tail; + __entry->end = link->end; + ), + + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, [buffer=%p, head=%p, tail=%p, end=%p]", + __entry->stream_id, __entry->size, __entry->bsize, __entry->size_left, + __entry->buffer,__entry->head, __entry-&
[very-RFC 6/8] Add TSN event-tracing
From: Henrik Austad This needs refactoring and should be updated to use TRACE_CLASS, but for now it provides a fair debug-window into TSN. Cc: "David S. Miller" Cc: Steven Rostedt (maintainer:TRACING) Cc: Ingo Molnar (maintainer:TRACING) Signed-off-by: Henrik Austad --- include/trace/events/tsn.h | 349 + 1 file changed, 349 insertions(+) create mode 100644 include/trace/events/tsn.h diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h new file mode 100644 index 000..ac1f31b --- /dev/null +++ b/include/trace/events/tsn.h @@ -0,0 +1,349 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM tsn + +#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_TSN_H + +#include +#include + +#include +#include +/* #include */ + +/* FIXME: update to TRACE_CLASS to reduce overhead */ +TRACE_EVENT(tsn_buffer_write, + + TP_PROTO(struct tsn_link *link, + size_t bytes), + + TP_ARGS(link, bytes), + + TP_STRUCT__entry( + __field(u64, stream_id) + __field(size_t, size) + __field(size_t, bsize) + __field(size_t, size_left) + __field(void *, buffer) + __field(void *, head) + __field(void *, tail) + __field(void *, end) + ), + + TP_fast_assign( + __entry->stream_id = link->stream_id; + __entry->size = bytes; + __entry->bsize = link->used_buffer_size; + __entry->size_left = (link->head - link->tail) % link->used_buffer_size; + __entry->buffer = link->buffer; + __entry->head = link->head; + __entry->tail = link->tail; + __entry->end = link->end; + ), + + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, [buffer=%p, head=%p, tail=%p, end=%p]", + __entry->stream_id, __entry->size, __entry->bsize, __entry->size_left, + __entry->buffer,__entry->head, __entry->tail, __entry->end) + + ); + +TRACE_EVENT(tsn_buffer_write_net, + + TP_PROTO(struct tsn_link *link, + size_t bytes), + + TP_ARGS(link, bytes), + + TP_STRUCT__entry( + __field(u64, stream_id) + __field(size_t, size) + __field(size_t, bsize) + __field(size_t, size_left) + __field(void *, buffer) + __field(void *, head) + __field(void *, tail) + __field(void *, end) + ), + + TP_fast_assign( + __entry->stream_id = link->stream_id; + __entry->size = bytes; + __entry->bsize = link->used_buffer_size; + __entry->size_left = (link->head - link->tail) % link->used_buffer_size; + __entry->buffer = link->buffer; + __entry->head = link->head; + __entry->tail = link->tail; + __entry->end = link->end; + ), + + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, [buffer=%p, head=%p, tail=%p, end=%p]", + __entry->stream_id, __entry->size, __entry->bsize, __entry->size_left, + __entry->buffer,__entry->head, __entry->tail, __entry->end) + + ); + + +TRACE_EVENT(tsn_buffer_read, + + TP_PROTO(struct tsn_link *link, + size_t bytes), + + TP_ARGS(link, bytes), + + TP_STRUCT__entry( + __field(u64, stream_id) + __field(size_t, size) + __field(size_t, bsize) + __field(size_t, size_left) + __field(void *, buffer) + __field(void *, head) + __field(void *, tail) + __field(void *, end) + ), + + TP_fast_assign( + __entry->stream_id = link->stream_id; + __entry->size = bytes; + __entry->bsize = link->used_buffer_size; + __entry->size_left = (link->head - link->tail) % link->used_buffer_size; + __entry->buffer = link->buffer; + __entry->head = link->head; + __entry->tail = link->tail; + __entry->end = link->end; + ), + + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, [buffer=%p, head=%p, tail=%p, end=%p]", + __entry->stream_id, __entry->size, __entry->bsize, __entry->size_left, + __entry->buffer,__entry->head, __entry->tail, __entry->end) + + ); + +TRACE_EVENT(tsn_refill, + + TP_PROTO(struct tsn_link *link, +
[very-RFC 4/8] Add TSN header for the driver
From: Henrik Austad This defines the general TSN headers for network packets, the shim-interface and the central 'tsn_list' structure. Cc: "David S. Miller" Signed-off-by: Henrik Austad --- include/linux/tsn.h | 806 1 file changed, 806 insertions(+) create mode 100644 include/linux/tsn.h diff --git a/include/linux/tsn.h b/include/linux/tsn.h new file mode 100644 index 000..0e1f732b --- /dev/null +++ b/include/linux/tsn.h @@ -0,0 +1,806 @@ +/* TSN - Time Sensitive Networking + * + * Copyright (C) 2016- Henrik Austad + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#ifndef _TSN_H +#define _TSN_H +#include +#include +#include + +/* The naming here can be a bit confusing as we call it TSN but naming + * suggests 'AVB'. Reason: IEE 1722 was written before the working group + * was renamed to Time Sensitive Networking. + * + * To be precise. TSN describes the protocol for shipping data, AVB is a + * medialayer which you can build on top of TSN. + * + * For this reason the frames are given avb-names whereas the functions + * use tsn_-naming. + */ + +/* 7 bit value 0x00 - 0x7F */ +enum avtp_subtype { + AVTP_61883_IIDC = 0, + AVTP_MMA = 0x1, + AVTP_MAAP = 0x7e, + AVTP_EXPERIMENTAL = 0x7f, +}; + +/* NOTE NOTE NOTE !! + * The headers below use bitfields extensively and verifications + * are needed when using little-endian vs big-endian systems. + */ + +/* Common part of avtph header + * + * AVB Transport Protocol Common Header + * + * Defined in 1722-2011 Sec. 5.2 + */ +struct avtp_ch { +#if defined(__LITTLE_ENDIAN_BITFIELD) + /* use avtp_subtype enum. +*/ + u8 subtype:7; + + /* Controlframe: 1 +* Dataframe : 0 +*/ + u8 cd:1; + + /* Type specific data, part 1 */ + u8 tsd_1:4; + + /* In current version of AVB, only 0 is valid, all other values +* are reserved for future versions. +*/ + u8 version:3; + + /* Valid StreamID in frame +* +* ControlData not related to a specific stream should clear +* this (and have stream_id = 0), _all_ other values should set +* this to 1. +*/ + u8 sv:1; +#elif defined(__BIG_ENDIAN_BITFIELD) + u8 cd:1; + u8 subtype:7; + u8 sv:1; + u8 version:3; + u8 tsd_1:4; +#else +#error "Unknown Endianness, cannot determine bitfield ordering" +#endif + /* Type specific data (adjacent to tsd_1, but split due to bitfield) */ + u16 tsd_2; + u64 stream_id; + + /* +* payload by subtype +*/ + u8 pbs[0]; +} __packed; + +/* AVTPDU Common Control header format + * IEEE 1722#5.3 + */ +struct avtpc_header { +#if defined(__LITTLE_ENDIAN_BITFIELD) + u8 subtype:7; + u8 cd:1; + u8 control_data:4; + u8 version:3; + u8 sv:1; + u16 control_data_length:11; + u16 status:5; +#elif defined(__BIG_ENDIAN_BITFIELD) + u8 cd:1; + u8 subtype:7; + u8 sv:1; + u8 version:3; + u8 control_data:4; + u16 status:5; + u16 control_data_length:11; +#else +#error "Unknown Endianness, cannot determine bitfield ordering" +#endif + u64 stream_id; +} __packed; + +/* AVTP common stream data AVTPDU header format + * IEEE 1722#5.4 + */ +struct avtpdu_header { +#if defined(__LITTLE_ENDIAN_BITFIELD) + u8 subtype:7; + u8 cd:1; + + /* avtp_timestamp valid */ + u8 tv: 1; + + /* gateway_info valid */ + u8 gv:1; + + /* reserved */ + u8 r:1; + + /* +* Media clock Restart toggle +*/ + u8 mr:1; + + u8 version:3; + + /* StreamID valid */ + u8 sv:1; + u8 seqnr; + + /* Timestamp uncertain */ + u8 tu:1; + u8 r2:7; +#elif defined(__BIG_ENDIAN_BITFIELD) + u8 cd:1; + u8 subtype:7; + + u8 sv:1; + u8 version:3; + u8 mr:1; + u8 r:1; + u8 gv:1; + u8 tv: 1; + + u8 seqnr; + u8 r2:7; + u8 tu:1; +#else +#error "Unknown Endianness, cannot determine bitfield ordering" +#endif + + u64 stream_id; + + u32 avtp_timestamp; + u32 gateway_info; + + /* Stream Data Length */ + u16 sd_len; + + /* Protocol specific header, derived from avtp_subtype */ + u16 psh; + + /* Stream Payload Data 0 to n octets +* n so that total
[very-RFC 5/8] Add TSN machinery to drive the traffic from a shim over the network
From: Henrik Austad <haus...@cisco.com> In short summary: * tsn_core.c is the main driver of tsn, all new links go through here and all data to/form the shims are handled here core also manages the shim-interface. * tsn_configfs.c is the API to userspace. TSN is driven from userspace and a link is created, configured, enabled, disabled and removed purely from userspace. All attributes requried must be determined by userspace, preferrably via IEEE 1722.1 (discovery and enumeration). * tsn_header.c small part that handles the actual header of the frames we send. Kept out of core for cleanliness. * tsn_net.c handles operations towards the networking layer. The current driver is under development. This means that from the moment it is enabled with a shim, it will send traffic, either 0-traffic (frames of reserved length but with payload 0) or actual traffic. This will change once the driver stabilizes. For more detail, see Documentation/networking/tsn/ Cc: "David S. Miller" <da...@davemloft.net> Signed-off-by: Henrik Austad <haus...@cisco.com> --- net/Makefile | 1 + net/tsn/Makefile | 6 + net/tsn/tsn_configfs.c | 623 +++ net/tsn/tsn_core.c | 975 + net/tsn/tsn_header.c | 203 ++ net/tsn/tsn_internal.h | 383 +++ net/tsn/tsn_net.c | 403 7 files changed, 2594 insertions(+) create mode 100644 net/tsn/Makefile create mode 100644 net/tsn/tsn_configfs.c create mode 100644 net/tsn/tsn_core.c create mode 100644 net/tsn/tsn_header.c create mode 100644 net/tsn/tsn_internal.h create mode 100644 net/tsn/tsn_net.c diff --git a/net/Makefile b/net/Makefile index bdd1455..c15482e 100644 --- a/net/Makefile +++ b/net/Makefile @@ -79,3 +79,4 @@ ifneq ($(CONFIG_NET_L3_MASTER_DEV),) obj-y += l3mdev/ endif obj-$(CONFIG_QRTR) += qrtr/ +obj-$(CONFIG_TSN) += tsn/ diff --git a/net/tsn/Makefile b/net/tsn/Makefile new file mode 100644 index 000..0d87687 --- /dev/null +++ b/net/tsn/Makefile @@ -0,0 +1,6 @@ +# +# Makefile for the Linux TSN subsystem +# + +obj-$(CONFIG_TSN) += tsn.o +tsn-objs :=tsn_core.o tsn_configfs.o tsn_net.o tsn_header.o diff --git a/net/tsn/tsn_configfs.c b/net/tsn/tsn_configfs.c new file mode 100644 index 000..f3d0986 --- /dev/null +++ b/net/tsn/tsn_configfs.c @@ -0,0 +1,623 @@ +/* + * ConfigFS interface to TSN + * Copyright (C) 2015- Henrik Austad <haus...@cisco.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include "tsn_internal.h" + +static inline struct tsn_link *to_tsn_link(struct config_item *item) +{ + /* this line causes checkpatch to WARN. making checkpatch happy, +* makes code messy.. +*/ + return item ? container_of(to_config_group(item), struct tsn_link, group) : NULL; +} + +static inline struct tsn_nic *to_tsn_nic(struct config_group *group) +{ + return group ? container_of(group, struct tsn_nic, group) : NULL; +} + +/* --- + * Tier2 attributes + * + * The content of the links userspace can see/modify + * --- +*/ +static ssize_t _tsn_max_payload_size_show(struct config_item *item, + char *page) +{ + struct tsn_link *link = to_tsn_link(item); + + if (!link) + return -EINVAL; + return sprintf(page, "%u\n", (u32)link->max_payload_size); +} + +static ssize_t _tsn_max_payload_size_store(struct config_item *item, + const char *page, size_t count) +{ + struct tsn_link *link = to_tsn_link(item); + u16 mpl_size = 0; + int ret = 0; + + if (!link) + return -EINVAL; + if (tsn_link_is_on(link)) { + pr_err("ERROR: Cannot change Payload size on on enabled link\n"); + return -EINVAL; + } + ret = kstrtou16(page, 0, _size); + if (ret) + return ret; + + /* 802.1BA-2011 6.4 payload must be <1500 octets (excluding +* headers, tags etc) However, this is not directly mappable to +* how some hw handles things, so to be conservative, we +* restrict it down to [26..1485] +
[very-RFC 0/8] TSN driver for the kernel
Hi all (series based on v4.7-rc2, now with the correct netdev) This is a *very* early RFC for a TSN-driver in the kernel. It has been floating around in my repo for a while and I would appreciate some feedback on the overall design to avoid doing some major blunders. TSN: Time Sensitive Networking, formely known as AVB (Audio/Video Bridging). There are at least one AVB-driver (the AV-part of TSN) in the kernel already, however this driver aims to solve a wider scope as TSN can do much more than just audio. A very basic ALSA-driver is added to the end that allows you to play music between 2 machines using aplay in one end and arecord | aplay on the other (some fiddling required) We have plans for doing the same for v4l2 eventually (but there are other fishes to fry first). The same goes for a TSN_SOCK type approach as well. TSN is all about providing infrastructure. Allthough there are a few very interesting uses for TSN (reliable, deterministic network for audio and video), once you have that reliable link, you can do a lot more. Some notes on the design: The driver is directed via ConfigFS as we need userspace to handle stream-reservation (MSRP), discovery and enumeration (IEEE 1722.1) and whatever other management is needed. Once we have all the required attributes, we can create link using mkdir, and use write() to set the attributes. Once ready, specify the 'shim' (basically a thin wrapper between TSN and another subsystem) and we start pushing out frames. The network part: it ties directly into the rx-handler for receive and writes skb's using netdev_start_xmit(). This could probably be improved. 2 new fields in netdev_ops have been introduced, and the Intel igb-driver has been updated (as this is available as a PCI-e card). The igb-driver works-ish What remains - tie to (g)PTP properly, currently using ktime_get() for presentation time - get time from shim into TSN and vice versa - let shim create/manage buffer Henrik Austad (8): TSN: add documentation TSN: Add the standard formerly known as AVB to the kernel Adding TSN-driver to Intel I210 controller Add TSN header for the driver Add TSN machinery to drive the traffic from a shim over the network Add TSN event-tracing AVB ALSA - Add ALSA shim for TSN MAINTAINERS: add TSN/AVB-entries Documentation/TSN/tsn.txt | 147 + MAINTAINERS | 14 + drivers/media/Kconfig | 15 + drivers/media/Makefile| 3 +- drivers/media/avb/Makefile| 5 + drivers/media/avb/avb_alsa.c | 742 +++ drivers/media/avb/tsn_iec61883.h | 124 drivers/net/ethernet/intel/Kconfig| 18 + drivers/net/ethernet/intel/igb/Makefile | 2 +- drivers/net/ethernet/intel/igb/igb.h | 19 + drivers/net/ethernet/intel/igb/igb_main.c | 10 +- drivers/net/ethernet/intel/igb/igb_tsn.c | 396 include/linux/netdevice.h | 32 + include/linux/tsn.h | 806 include/trace/events/tsn.h| 349 +++ net/Kconfig | 1 + net/Makefile | 1 + net/tsn/Kconfig | 32 + net/tsn/Makefile | 6 + net/tsn/tsn_configfs.c| 623 +++ net/tsn/tsn_core.c| 975 ++ net/tsn/tsn_header.c | 203 +++ net/tsn/tsn_internal.h| 383 net/tsn/tsn_net.c | 403 24 files changed, 5306 insertions(+), 3 deletions(-) create mode 100644 Documentation/TSN/tsn.txt create mode 100644 drivers/media/avb/Makefile create mode 100644 drivers/media/avb/avb_alsa.c create mode 100644 drivers/media/avb/tsn_iec61883.h create mode 100644 drivers/net/ethernet/intel/igb/igb_tsn.c create mode 100644 include/linux/tsn.h create mode 100644 include/trace/events/tsn.h create mode 100644 net/tsn/Kconfig create mode 100644 net/tsn/Makefile create mode 100644 net/tsn/tsn_configfs.c create mode 100644 net/tsn/tsn_core.c create mode 100644 net/tsn/tsn_header.c create mode 100644 net/tsn/tsn_internal.h create mode 100644 net/tsn/tsn_net.c -- 2.7.4
[very-RFC 5/8] Add TSN machinery to drive the traffic from a shim over the network
From: Henrik Austad In short summary: * tsn_core.c is the main driver of tsn, all new links go through here and all data to/form the shims are handled here core also manages the shim-interface. * tsn_configfs.c is the API to userspace. TSN is driven from userspace and a link is created, configured, enabled, disabled and removed purely from userspace. All attributes requried must be determined by userspace, preferrably via IEEE 1722.1 (discovery and enumeration). * tsn_header.c small part that handles the actual header of the frames we send. Kept out of core for cleanliness. * tsn_net.c handles operations towards the networking layer. The current driver is under development. This means that from the moment it is enabled with a shim, it will send traffic, either 0-traffic (frames of reserved length but with payload 0) or actual traffic. This will change once the driver stabilizes. For more detail, see Documentation/networking/tsn/ Cc: "David S. Miller" Signed-off-by: Henrik Austad --- net/Makefile | 1 + net/tsn/Makefile | 6 + net/tsn/tsn_configfs.c | 623 +++ net/tsn/tsn_core.c | 975 + net/tsn/tsn_header.c | 203 ++ net/tsn/tsn_internal.h | 383 +++ net/tsn/tsn_net.c | 403 7 files changed, 2594 insertions(+) create mode 100644 net/tsn/Makefile create mode 100644 net/tsn/tsn_configfs.c create mode 100644 net/tsn/tsn_core.c create mode 100644 net/tsn/tsn_header.c create mode 100644 net/tsn/tsn_internal.h create mode 100644 net/tsn/tsn_net.c diff --git a/net/Makefile b/net/Makefile index bdd1455..c15482e 100644 --- a/net/Makefile +++ b/net/Makefile @@ -79,3 +79,4 @@ ifneq ($(CONFIG_NET_L3_MASTER_DEV),) obj-y += l3mdev/ endif obj-$(CONFIG_QRTR) += qrtr/ +obj-$(CONFIG_TSN) += tsn/ diff --git a/net/tsn/Makefile b/net/tsn/Makefile new file mode 100644 index 000..0d87687 --- /dev/null +++ b/net/tsn/Makefile @@ -0,0 +1,6 @@ +# +# Makefile for the Linux TSN subsystem +# + +obj-$(CONFIG_TSN) += tsn.o +tsn-objs :=tsn_core.o tsn_configfs.o tsn_net.o tsn_header.o diff --git a/net/tsn/tsn_configfs.c b/net/tsn/tsn_configfs.c new file mode 100644 index 000..f3d0986 --- /dev/null +++ b/net/tsn/tsn_configfs.c @@ -0,0 +1,623 @@ +/* + * ConfigFS interface to TSN + * Copyright (C) 2015- Henrik Austad + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include "tsn_internal.h" + +static inline struct tsn_link *to_tsn_link(struct config_item *item) +{ + /* this line causes checkpatch to WARN. making checkpatch happy, +* makes code messy.. +*/ + return item ? container_of(to_config_group(item), struct tsn_link, group) : NULL; +} + +static inline struct tsn_nic *to_tsn_nic(struct config_group *group) +{ + return group ? container_of(group, struct tsn_nic, group) : NULL; +} + +/* --- + * Tier2 attributes + * + * The content of the links userspace can see/modify + * --- +*/ +static ssize_t _tsn_max_payload_size_show(struct config_item *item, + char *page) +{ + struct tsn_link *link = to_tsn_link(item); + + if (!link) + return -EINVAL; + return sprintf(page, "%u\n", (u32)link->max_payload_size); +} + +static ssize_t _tsn_max_payload_size_store(struct config_item *item, + const char *page, size_t count) +{ + struct tsn_link *link = to_tsn_link(item); + u16 mpl_size = 0; + int ret = 0; + + if (!link) + return -EINVAL; + if (tsn_link_is_on(link)) { + pr_err("ERROR: Cannot change Payload size on on enabled link\n"); + return -EINVAL; + } + ret = kstrtou16(page, 0, _size); + if (ret) + return ret; + + /* 802.1BA-2011 6.4 payload must be <1500 octets (excluding +* headers, tags etc) However, this is not directly mappable to +* how some hw handles things, so to be conservative, we +* restrict it down to [26..1485] +* +* This is also the _payload_ size, which does not include the +* AVTPDU heade