Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-12-13 Thread Henrik Austad
On Fri, Dec 14, 2018 at 08:18:26AM +0100, Greg Kroah-Hartman wrote:
> On Mon, Nov 19, 2018 at 12:27:21PM +0100, Henrik Austad wrote:
> > On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote:
> > > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
> > > > From: Henrik Austad 
> > > > 
> > > > Short story:
> > > 
> > > Sorry for the spam, it looks like I was not very specific in /which/ 
> > > version I targeted this to, as well as not providing a full Cc-list for 
> > > the 
> > > cover-letter.
> > 
> > Gentle prod. I realize this was sent out just before plumbers and that 
> > people had pretty packed agendas, so a small nudge to gain a spot closer to 
> > the top of the inbox :)
> > 
> > This series has now been running on an arm64 system for 9 days without any 
> > issues and pi_stress showed a dramatic improvement from ~30 seconds and up 
> > to several ours (it finally deadlocked at 3.9e9 inversions).
> > 
> > I'd greatly appreciate if someone could give the list of patches a quick 
> > glance to verify that I got all the required patches and then if it could 
> > be added to 4.4.y.

Hi Greg,

> This is a really intrusive series of patches, and without some testing
> and verification by others, I am really reluctant to take these patches.

Yes I know, they are intrusive, and they touch core parts of the kernel in 
interesting ways.

I completely agree with the need for testing, and I do not _expect_ these 
pathces to be merged. It was a "this was useful for us, it is probably 
useful for others" kind of series.

Perhaps it is not that many others out there using pi_futex shared between 
a sched_rr thread and a sched_deadline thread, which is how you back 
yourself into this corner.

> Why not just move to the 4.9.y tree, or better yet, 4.19.y to resolve
> this issue for your systems?

That would indeed be the best solution, but vendor will not update kernel 
past 4.4 for this particular SoC, so we have no way of moving this to a 
later kernel :(

Anyway, I'm happy to carry these in our local tree for our own use. If 
something pops up in our internal testing requiring update to the series, 
I'll send an update for others to see should they experience the same 
issue. :)

Thanks for the reply!

-- 
Henrik Austad


signature.asc
Description: PGP signature


Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-11-19 Thread Henrik Austad
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote:
> On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
> > From: Henrik Austad 
> > 
> > Short story:
> 
> Sorry for the spam, it looks like I was not very specific in /which/ 
> version I targeted this to, as well as not providing a full Cc-list for the 
> cover-letter.

Gentle prod. I realize this was sent out just before plumbers and that 
people had pretty packed agendas, so a small nudge to gain a spot closer to 
the top of the inbox :)

This series has now been running on an arm64 system for 9 days without any 
issues and pi_stress showed a dramatic improvement from ~30 seconds and up 
to several ours (it finally deadlocked at 3.9e9 inversions).

I'd greatly appreciate if someone could give the list of patches a quick 
glance to verify that I got all the required patches and then if it could 
be added to 4.4.y.

Thanks!

-Henrik


> The series is targeted at stable v4.4.162.
> 
> Expanding Cc-list to those missing from the first attempt.
> 
> -Henrik
> 
> > The following patches are needed on a 4.4 kernel to avoid
> > Oops in the scheduler when a sched_rr and sched_deadline task contends
> > on the same futex (with PI).
> > 
> > Longer story:
> > 
> > On one of our arm64 systems, we occasionally crash with an Oops in the
> > scheduler with the following backtrace.
> > 
> > [] enqueue_task_dl+0x1f0/0x420
> > [] activate_task+0x7c/0x90
> > [] push_dl_task+0x164/0x1c8
> > [] push_dl_tasks+0x20/0x30
> > [] __balance_callback+0x44/0x68
> > [] __schedule+0x6f0/0x728
> > [] schedule+0x78/0x98
> > [] __rt_mutex_slowlock+0x9c/0x108
> > [] rt_mutex_slowlock+0xd8/0x198
> > [] rt_mutex_timed_futex_lock+0x30/0x40
> > [] futex_lock_pi+0x200/0x3b0
> > [] do_futex+0x1c4/0x550
> > [] compat_SyS_futex+0x10c/0x138
> > [] __sys_trace_return+0x0/0x4
> > 
> > This seems to be the same bug Xuneli Pang triggered and fixed in
> > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline
> > tasks". As noted by Peter Zijlstra in the previous attempt, this fix
> > requires a few other patches, most notably the FUTEX_UNLOCK_PI series
> > [1]
> > 
> > Testing this on a dual-core VM I have not been able to reproduce the
> > same crash, but pi_stress (part of the rt-test suite) reveals that
> > vanilla 4.4.162 behaves rather badly with a mix of deadline and
> > sched_(rr|fifo) tasks:
> > 
> > time pi_stress --rr --mlockall --sched 
> > id=high,policy=deadline,runtime=10,deadline=20,period=20
> > Starting PI Stress Test
> > Number of thread groups: 1
> > Duration of test run: infinite
> > Number of inversions per group: unlimited
> >  Admin thread SCHED_RR priority 4
> > 1 groups of 3 threads will be created
> >   High thread SCHED_DEADLINE runtime 10 deadline 20 period 
> > 20
> >Med thread SCHED_RR priority 2
> >Low thread SCHED_RR priority 1
> > Current Inversions: 141627
> > WATCHDOG triggered: group 0 is deadlocked!
> > reporter stopping due to watchdog event
> > Stopping test
> > Terminated
> > 
> > real0m26.291s
> > user0m0.148s
> > sys 0m18.819s
> > 
> > With this series applied, the test ran for ~4.5 hours and again for 129
> > minutes (when I remembered to time it) before crashing:
> > 
> > time pi_stress --rr --mlockall --sched 
> > id=high,policy=deadline,runtime=10,deadline=20,period=20
> > Starting PI Stress Test
> > Number of thread groups: 1
> > Duration of test run: infinite
> > Number of inversions per group: unlimited
> >  Admin thread SCHED_RR priority 4
> > 1 groups of 3 threads will be created
> >   High thread SCHED_DEADLINE runtime 10 deadline 20 period 
> > 20
> >Med thread SCHED_RR priority 2
> >Low thread SCHED_RR priority 1
> > Current Inversions: 51985223
> > WATCHDOG triggered: group 0 is deadlocked!
> > reporter stopping due to watchdog event
> > Stopping test
> > Terminated
> > 
> > real129m38.807s
> > user0m59.084s
> > sys 109m53.666s
> > 
> > 
> > So clearly not perfect, but a *lot* better.
> > 
> > The same series on our vendor-4.4 kernel moves pi_stress up from ~30
> > seconds before deadlock up to the same level as the VM (the test is
> > still going as of this writing).
> > 
> > I suspect other users of 4.4 would benefit from having these patches
> > backported, so tag them for stable. I assume 4.9 an

Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-11-19 Thread Henrik Austad
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote:
> On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
> > From: Henrik Austad 
> > 
> > Short story:
> 
> Sorry for the spam, it looks like I was not very specific in /which/ 
> version I targeted this to, as well as not providing a full Cc-list for the 
> cover-letter.

Gentle prod. I realize this was sent out just before plumbers and that 
people had pretty packed agendas, so a small nudge to gain a spot closer to 
the top of the inbox :)

This series has now been running on an arm64 system for 9 days without any 
issues and pi_stress showed a dramatic improvement from ~30 seconds and up 
to several ours (it finally deadlocked at 3.9e9 inversions).

I'd greatly appreciate if someone could give the list of patches a quick 
glance to verify that I got all the required patches and then if it could 
be added to 4.4.y.

Thanks!

-Henrik


> The series is targeted at stable v4.4.162.
> 
> Expanding Cc-list to those missing from the first attempt.
> 
> -Henrik
> 
> > The following patches are needed on a 4.4 kernel to avoid
> > Oops in the scheduler when a sched_rr and sched_deadline task contends
> > on the same futex (with PI).
> > 
> > Longer story:
> > 
> > On one of our arm64 systems, we occasionally crash with an Oops in the
> > scheduler with the following backtrace.
> > 
> > [] enqueue_task_dl+0x1f0/0x420
> > [] activate_task+0x7c/0x90
> > [] push_dl_task+0x164/0x1c8
> > [] push_dl_tasks+0x20/0x30
> > [] __balance_callback+0x44/0x68
> > [] __schedule+0x6f0/0x728
> > [] schedule+0x78/0x98
> > [] __rt_mutex_slowlock+0x9c/0x108
> > [] rt_mutex_slowlock+0xd8/0x198
> > [] rt_mutex_timed_futex_lock+0x30/0x40
> > [] futex_lock_pi+0x200/0x3b0
> > [] do_futex+0x1c4/0x550
> > [] compat_SyS_futex+0x10c/0x138
> > [] __sys_trace_return+0x0/0x4
> > 
> > This seems to be the same bug Xuneli Pang triggered and fixed in
> > e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline
> > tasks". As noted by Peter Zijlstra in the previous attempt, this fix
> > requires a few other patches, most notably the FUTEX_UNLOCK_PI series
> > [1]
> > 
> > Testing this on a dual-core VM I have not been able to reproduce the
> > same crash, but pi_stress (part of the rt-test suite) reveals that
> > vanilla 4.4.162 behaves rather badly with a mix of deadline and
> > sched_(rr|fifo) tasks:
> > 
> > time pi_stress --rr --mlockall --sched 
> > id=high,policy=deadline,runtime=10,deadline=20,period=20
> > Starting PI Stress Test
> > Number of thread groups: 1
> > Duration of test run: infinite
> > Number of inversions per group: unlimited
> >  Admin thread SCHED_RR priority 4
> > 1 groups of 3 threads will be created
> >   High thread SCHED_DEADLINE runtime 10 deadline 20 period 
> > 20
> >Med thread SCHED_RR priority 2
> >Low thread SCHED_RR priority 1
> > Current Inversions: 141627
> > WATCHDOG triggered: group 0 is deadlocked!
> > reporter stopping due to watchdog event
> > Stopping test
> > Terminated
> > 
> > real0m26.291s
> > user0m0.148s
> > sys 0m18.819s
> > 
> > With this series applied, the test ran for ~4.5 hours and again for 129
> > minutes (when I remembered to time it) before crashing:
> > 
> > time pi_stress --rr --mlockall --sched 
> > id=high,policy=deadline,runtime=10,deadline=20,period=20
> > Starting PI Stress Test
> > Number of thread groups: 1
> > Duration of test run: infinite
> > Number of inversions per group: unlimited
> >  Admin thread SCHED_RR priority 4
> > 1 groups of 3 threads will be created
> >   High thread SCHED_DEADLINE runtime 10 deadline 20 period 
> > 20
> >Med thread SCHED_RR priority 2
> >Low thread SCHED_RR priority 1
> > Current Inversions: 51985223
> > WATCHDOG triggered: group 0 is deadlocked!
> > reporter stopping due to watchdog event
> > Stopping test
> > Terminated
> > 
> > real129m38.807s
> > user0m59.084s
> > sys 109m53.666s
> > 
> > 
> > So clearly not perfect, but a *lot* better.
> > 
> > The same series on our vendor-4.4 kernel moves pi_stress up from ~30
> > seconds before deadlock up to the same level as the VM (the test is
> > still going as of this writing).
> > 
> > I suspect other users of 4.4 would benefit from having these patches
> > backported, so tag them for stable. I assume 4.9 an

Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-11-09 Thread Henrik Austad
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
> From: Henrik Austad 
> 
> Short story:

Sorry for the spam, it looks like I was not very specific in /which/ 
version I targeted this to, as well as not providing a full Cc-list for the 
cover-letter.

The series is targeted at stable v4.4.162.

Expanding Cc-list to those missing from the first attempt.

-Henrik

> The following patches are needed on a 4.4 kernel to avoid
> Oops in the scheduler when a sched_rr and sched_deadline task contends
> on the same futex (with PI).
> 
> Longer story:
> 
> On one of our arm64 systems, we occasionally crash with an Oops in the
> scheduler with the following backtrace.
> 
> [] enqueue_task_dl+0x1f0/0x420
> [] activate_task+0x7c/0x90
> [] push_dl_task+0x164/0x1c8
> [] push_dl_tasks+0x20/0x30
> [] __balance_callback+0x44/0x68
> [] __schedule+0x6f0/0x728
> [] schedule+0x78/0x98
> [] __rt_mutex_slowlock+0x9c/0x108
> [] rt_mutex_slowlock+0xd8/0x198
> [] rt_mutex_timed_futex_lock+0x30/0x40
> [] futex_lock_pi+0x200/0x3b0
> [] do_futex+0x1c4/0x550
> [] compat_SyS_futex+0x10c/0x138
> [] __sys_trace_return+0x0/0x4
> 
> This seems to be the same bug Xuneli Pang triggered and fixed in
> e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline
> tasks". As noted by Peter Zijlstra in the previous attempt, this fix
> requires a few other patches, most notably the FUTEX_UNLOCK_PI series
> [1]
> 
> Testing this on a dual-core VM I have not been able to reproduce the
> same crash, but pi_stress (part of the rt-test suite) reveals that
> vanilla 4.4.162 behaves rather badly with a mix of deadline and
> sched_(rr|fifo) tasks:
> 
> time pi_stress --rr --mlockall --sched 
> id=high,policy=deadline,runtime=10,deadline=20,period=20
> Starting PI Stress Test
> Number of thread groups: 1
> Duration of test run: infinite
> Number of inversions per group: unlimited
>  Admin thread SCHED_RR priority 4
> 1 groups of 3 threads will be created
>   High thread SCHED_DEADLINE runtime 10 deadline 20 period 20
>Med thread SCHED_RR priority 2
>Low thread SCHED_RR priority 1
> Current Inversions: 141627
> WATCHDOG triggered: group 0 is deadlocked!
> reporter stopping due to watchdog event
> Stopping test
> Terminated
> 
> real0m26.291s
> user0m0.148s
> sys 0m18.819s
> 
> With this series applied, the test ran for ~4.5 hours and again for 129
> minutes (when I remembered to time it) before crashing:
> 
> time pi_stress --rr --mlockall --sched 
> id=high,policy=deadline,runtime=10,deadline=20,period=20
> Starting PI Stress Test
> Number of thread groups: 1
> Duration of test run: infinite
> Number of inversions per group: unlimited
>  Admin thread SCHED_RR priority 4
> 1 groups of 3 threads will be created
>   High thread SCHED_DEADLINE runtime 10 deadline 20 period 20
>Med thread SCHED_RR priority 2
>Low thread SCHED_RR priority 1
> Current Inversions: 51985223
> WATCHDOG triggered: group 0 is deadlocked!
> reporter stopping due to watchdog event
> Stopping test
> Terminated
> 
> real129m38.807s
> user0m59.084s
> sys 109m53.666s
> 
> 
> So clearly not perfect, but a *lot* better.
> 
> The same series on our vendor-4.4 kernel moves pi_stress up from ~30
> seconds before deadlock up to the same level as the VM (the test is
> still going as of this writing).
> 
> I suspect other users of 4.4 would benefit from having these patches
> backported, so tag them for stable. I assume 4.9 and 4.14 could benefit
> as well, but I have not had time to look into those.
> 
> 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html
> 
> Peter Zijlstra (13):
>   futex: Cleanup variable names for futex_top_waiter()
>   futex: Use smp_store_release() in mark_wake_futex()
>   futex: Remove rt_mutex_deadlock_account_*()
>   futex,rt_mutex: Provide futex specific rt_mutex API
>   futex: Change locking rules
>   futex: Cleanup refcounting
>   futex: Rework inconsistent rt_mutex/futex_q state
>   futex: Pull rt_mutex_futex_unlock() out from under hb->lock
>   futex,rt_mutex: Introduce rt_mutex_init_waiter()
>   futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
>   futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
>   futex: Futex_unlock_pi() determinism
>   futex: Drop hb->lock before enqueueing on the rtmutex
> 
> Thomas Gleixner (2):
>   rtmutex: Make wait_lock irq safe
>   futex: Rename free_pi_state() to put_pi_state()
> 
> Xunlei Pang (2):
>   rtmutex: Deboost before waking up the top waiter
>   sched/rtmutex/deadline: Fix a P

Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-11-09 Thread Henrik Austad
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
> From: Henrik Austad 
> 
> Short story:

Sorry for the spam, it looks like I was not very specific in /which/ 
version I targeted this to, as well as not providing a full Cc-list for the 
cover-letter.

The series is targeted at stable v4.4.162.

Expanding Cc-list to those missing from the first attempt.

-Henrik

> The following patches are needed on a 4.4 kernel to avoid
> Oops in the scheduler when a sched_rr and sched_deadline task contends
> on the same futex (with PI).
> 
> Longer story:
> 
> On one of our arm64 systems, we occasionally crash with an Oops in the
> scheduler with the following backtrace.
> 
> [] enqueue_task_dl+0x1f0/0x420
> [] activate_task+0x7c/0x90
> [] push_dl_task+0x164/0x1c8
> [] push_dl_tasks+0x20/0x30
> [] __balance_callback+0x44/0x68
> [] __schedule+0x6f0/0x728
> [] schedule+0x78/0x98
> [] __rt_mutex_slowlock+0x9c/0x108
> [] rt_mutex_slowlock+0xd8/0x198
> [] rt_mutex_timed_futex_lock+0x30/0x40
> [] futex_lock_pi+0x200/0x3b0
> [] do_futex+0x1c4/0x550
> [] compat_SyS_futex+0x10c/0x138
> [] __sys_trace_return+0x0/0x4
> 
> This seems to be the same bug Xuneli Pang triggered and fixed in
> e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline
> tasks". As noted by Peter Zijlstra in the previous attempt, this fix
> requires a few other patches, most notably the FUTEX_UNLOCK_PI series
> [1]
> 
> Testing this on a dual-core VM I have not been able to reproduce the
> same crash, but pi_stress (part of the rt-test suite) reveals that
> vanilla 4.4.162 behaves rather badly with a mix of deadline and
> sched_(rr|fifo) tasks:
> 
> time pi_stress --rr --mlockall --sched 
> id=high,policy=deadline,runtime=10,deadline=20,period=20
> Starting PI Stress Test
> Number of thread groups: 1
> Duration of test run: infinite
> Number of inversions per group: unlimited
>  Admin thread SCHED_RR priority 4
> 1 groups of 3 threads will be created
>   High thread SCHED_DEADLINE runtime 10 deadline 20 period 20
>Med thread SCHED_RR priority 2
>Low thread SCHED_RR priority 1
> Current Inversions: 141627
> WATCHDOG triggered: group 0 is deadlocked!
> reporter stopping due to watchdog event
> Stopping test
> Terminated
> 
> real0m26.291s
> user0m0.148s
> sys 0m18.819s
> 
> With this series applied, the test ran for ~4.5 hours and again for 129
> minutes (when I remembered to time it) before crashing:
> 
> time pi_stress --rr --mlockall --sched 
> id=high,policy=deadline,runtime=10,deadline=20,period=20
> Starting PI Stress Test
> Number of thread groups: 1
> Duration of test run: infinite
> Number of inversions per group: unlimited
>  Admin thread SCHED_RR priority 4
> 1 groups of 3 threads will be created
>   High thread SCHED_DEADLINE runtime 10 deadline 20 period 20
>Med thread SCHED_RR priority 2
>Low thread SCHED_RR priority 1
> Current Inversions: 51985223
> WATCHDOG triggered: group 0 is deadlocked!
> reporter stopping due to watchdog event
> Stopping test
> Terminated
> 
> real129m38.807s
> user0m59.084s
> sys 109m53.666s
> 
> 
> So clearly not perfect, but a *lot* better.
> 
> The same series on our vendor-4.4 kernel moves pi_stress up from ~30
> seconds before deadlock up to the same level as the VM (the test is
> still going as of this writing).
> 
> I suspect other users of 4.4 would benefit from having these patches
> backported, so tag them for stable. I assume 4.9 and 4.14 could benefit
> as well, but I have not had time to look into those.
> 
> 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html
> 
> Peter Zijlstra (13):
>   futex: Cleanup variable names for futex_top_waiter()
>   futex: Use smp_store_release() in mark_wake_futex()
>   futex: Remove rt_mutex_deadlock_account_*()
>   futex,rt_mutex: Provide futex specific rt_mutex API
>   futex: Change locking rules
>   futex: Cleanup refcounting
>   futex: Rework inconsistent rt_mutex/futex_q state
>   futex: Pull rt_mutex_futex_unlock() out from under hb->lock
>   futex,rt_mutex: Introduce rt_mutex_init_waiter()
>   futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
>   futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
>   futex: Futex_unlock_pi() determinism
>   futex: Drop hb->lock before enqueueing on the rtmutex
> 
> Thomas Gleixner (2):
>   rtmutex: Make wait_lock irq safe
>   futex: Rename free_pi_state() to put_pi_state()
> 
> Xunlei Pang (2):
>   rtmutex: Deboost before waking up the top waiter
>   sched/rtmutex/deadline: Fix a P

[PATCH 06/17] futex: Change locking rules

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 734009e96d1983ad739e5b656e03430b3660c913 upstream.

Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
creates another set of problems, especially priority inversions on RT where
hb->lock becomes a rt_mutex itself.

The rt_mutex::wait_lock is the most obvious protection for keeping the
futex user space value and the kernel internal pi_state in sync.

Rework and document the locking so rt_mutex::wait_lock is held accross all
operations which modify the user space value and the pi state.

This allows to invoke rt_mutex_unlock() (including deboost) without holding
hb->lock as a next step.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.751993...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 165 +
 1 file changed, 132 insertions(+), 33 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e1200b9..52e3678 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -967,6 +967,39 @@ void exit_pi_state_list(struct task_struct *curr)
  *
  * [10] There is no transient state which leaves owner and user space
  * TID out of sync.
+ *
+ *
+ * Serialization and lifetime rules:
+ *
+ * hb->lock:
+ *
+ * hb -> futex_q, relation
+ * futex_q -> pi_state, relation
+ *
+ * (cannot be raw because hb can contain arbitrary amount
+ *  of futex_q's)
+ *
+ * pi_mutex->wait_lock:
+ *
+ * {uval, pi_state}
+ *
+ * (and pi_mutex 'obviously')
+ *
+ * p->pi_lock:
+ *
+ * p->pi_state_list -> pi_state->list, relation
+ *
+ * pi_state->refcount:
+ *
+ * pi_state lifetime
+ *
+ *
+ * Lock order:
+ *
+ *   hb->lock
+ * pi_mutex->wait_lock
+ *   p->pi_lock
+ *
  */
 
 /*
@@ -974,10 +1007,12 @@ void exit_pi_state_list(struct task_struct *curr)
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
+static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
+ struct futex_pi_state *pi_state,
  struct futex_pi_state **ps)
 {
pid_t pid = uval & FUTEX_TID_MASK;
+   int ret, uval2;
 
/*
 * Userspace might have messed up non-PI and PI futexes [3]
@@ -985,9 +1020,34 @@ static int attach_to_pi_state(u32 uval, struct 
futex_pi_state *pi_state,
if (unlikely(!pi_state))
return -EINVAL;
 
+   /*
+* We get here with hb->lock held, and having found a
+* futex_top_waiter(). This means that futex_lock_pi() of said futex_q
+* has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
+* which in turn means that futex_lock_pi() still has a reference on
+* our pi_state.
+*/
WARN_ON(!atomic_read(_state->refcount));
 
/*
+* Now that we have a pi_state, we can acquire wait_lock
+* and do the state validation.
+*/
+   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
+
+   /*
+* Since {uval, pi_state} is serialized by wait_lock, and our current
+* uval was read without holding it, it can have changed. Verify it
+* still is what we expect it to be, otherwise retry the entire
+* operation.
+*/
+   if (get_futex_value_locked(, uaddr))
+   goto out_efault;
+
+   if (uval != uval2)
+   goto out_eagain;
+
+   /*
 * Handle the owner died case:
 */
if (uval & FUTEX_OWNER_DIED) {
@@ -1002,11 +1062,11 @@ static int attach_to_pi_state(u32 uval, struct 
futex_pi_state *pi_state,
 * is not 0. Inconsistent state. [5]
 */
if (pid)
-   return -EINVAL;
+   goto out_einval;
/*
 * Take a ref on the state and return success. [4]
 */
-   goto out_state;
+   goto out_attach;
}
 
/*
@@ -1018,14 +1078,14 @@ static int attach_to_pi_state(u32 uval, struct 
futex_pi_state *pi_state,
 * Take a ref on the state and return success. [6]
 */
if (!pid)
-   goto out_state;
+   goto out_attach;
} else {
/*
 * If the owner died bit is not set, then the pi_state
 * must have an owner. [7]
 */

[PATCH 07/17] futex: Cleanup refcounting

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit bf92cf3a5100f5a0d5f9834787b130159397cb22 upstream.

Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
becomes consistent.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.801778...@infradead.org
Signed-off-by: Thomas Gleixner 

Conflicts:
kernel/futex.c
Tested-by:Henrik Austad 
---
 kernel/futex.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 52e3678..9d7d462 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -799,7 +799,7 @@ static int refill_pi_state_cache(void)
return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_pi_state *alloc_pi_state(void)
 {
struct futex_pi_state *pi_state = current->pi_state_cache;
 
@@ -809,6 +809,11 @@ static struct futex_pi_state * alloc_pi_state(void)
return pi_state;
 }
 
+static void get_pi_state(struct futex_pi_state *pi_state)
+{
+   WARN_ON_ONCE(!atomic_inc_not_zero(_state->refcount));
+}
+
 /*
  * Must be called with the hb lock held.
  */
@@ -850,7 +855,7 @@ static void free_pi_state(struct futex_pi_state *pi_state)
  * Look up the task based on what TID userspace gave us.
  * We dont trust it.
  */
-static struct task_struct * futex_find_get_task(pid_t pid)
+static struct task_struct *futex_find_get_task(pid_t pid)
 {
struct task_struct *p;
 
@@ -1097,7 +1102,7 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
goto out_einval;
 
 out_attach:
-   atomic_inc(_state->refcount);
+   get_pi_state(pi_state);
raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
*ps = pi_state;
return 0;
@@ -2019,8 +2024,12 @@ retry_private:
 * of requeue_pi if we couldn't acquire the lock atomically.
 */
if (requeue_pi) {
-   /* Prepare the waiter to take the rt_mutex. */
-   atomic_inc(_state->refcount);
+   /*
+* Prepare the waiter to take the rt_mutex. Take a
+* refcount on the pi_state and store the pointer in
+* the futex_q object of the waiter.
+*/
+   get_pi_state(pi_state);
this->pi_state = pi_state;
ret = rt_mutex_start_proxy_lock(_state->pi_mutex,
this->rt_waiter,
-- 
2.7.4



[PATCH 04/17] rtmutex: Make wait_lock irq safe

2018-11-09 Thread Henrik Austad
From: Thomas Gleixner 

commit b4abf91047cf054f203dcfac97e1038388826937 upstream.

Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.

CPU0CPU1

rtmutex_lock(>rt_mutex)
  spin_lock(>rt_mutex.wait_lock)
local_irq_disable()
spin_lock(>it_lock)
spin_lock(>mutex.wait_lock)
--> Interrupt
spin_lock(>it_lock)

This is caused by the following code sequence on CPU1

 rcu_read_lock()
 x = lookup();
 if (x)
 spin_lock_irqsave(>it_lock);
 rcu_read_unlock();
 return x;

We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.

Taking rt_mutex.wait_lock irq safe prevents the deadlock.

Reported-by: Sasha Levin 
Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Paul McKenney 
Tested-by:Henrik Austad 
---
 kernel/futex.c   |  18 +++
 kernel/locking/rtmutex.c | 135 +--
 2 files changed, 81 insertions(+), 72 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 9e92f12..0f44952 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1307,7 +1307,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *top_waiter
if (pi_state->owner != current)
return -EINVAL;
 
-   raw_spin_lock(_state->pi_mutex.wait_lock);
+   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
new_owner = rt_mutex_next_owner(_state->pi_mutex);
 
/*
@@ -1343,22 +1343,22 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *top_waiter
ret = -EINVAL;
}
if (ret) {
-   raw_spin_unlock(_state->pi_mutex.wait_lock);
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
return ret;
}
 
-   raw_spin_lock_irq(_state->owner->pi_lock);
+   raw_spin_lock(_state->owner->pi_lock);
WARN_ON(list_empty(_state->list));
list_del_init(_state->list);
-   raw_spin_unlock_irq(_state->owner->pi_lock);
+   raw_spin_unlock(_state->owner->pi_lock);
 
-   raw_spin_lock_irq(_owner->pi_lock);
+   raw_spin_lock(_owner->pi_lock);
WARN_ON(!list_empty(_state->list));
list_add(_state->list, _owner->pi_state_list);
pi_state->owner = new_owner;
-   raw_spin_unlock_irq(_owner->pi_lock);
+   raw_spin_unlock(_owner->pi_lock);
 
-   raw_spin_unlock(_state->pi_mutex.wait_lock);
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
 
deboost = rt_mutex_futex_unlock(_state->pi_mutex, _q);
 
@@ -2269,11 +2269,11 @@ static int fixup_owner(u32 __user *uaddr, struct 
futex_q *q, int locked)
 * we returned due to timeout or signal without taking the
 * rt_mutex. Too late.
 */
-   raw_spin_lock(>pi_state->pi_mutex.wait_lock);
+   raw_spin_lock_irq(>pi_state->pi_mutex.wait_lock);
owner = rt_mutex_owner(>pi_state->pi_mutex);
if (!owner)
owner = rt_mutex_next_owner(>pi_state->pi_mutex);
-   raw_spin_unlock(>pi_state->pi_mutex.wait_lock);
+   raw_spin_unlock_irq(>pi_state->pi_mutex.wait_lock);
ret = fixup_pi_state_owner(uaddr, q, owner);
goto out;
}
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 6cf9dab7..b8d08c7 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -163,13 +163,14 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex 
*lock)
  * 2) Drop lock->wait_lock
  * 3) Try to unlock the lock with cmpxchg
  */
-static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
+static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
+   unsigned long flags)
__releases(lock->wait_lock)
 {
struct task_struct *owner = rt_mutex_owner(lock);
 
clear_rt_mutex_waiters(lock);
-   raw_spin_unlock(>wait_lock);
+   raw_spin_unlock_irqrestore(>wait_lock, flags);
/*
 * If a new waiter comes in between the unlock and the cmpxchg
 * we have two situations:
@@ -211,11 +212,12 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex 
*lock)
 /*
  * Simple slow path only version: lock->owner is protected by lock->wait_lock.
  */
-static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
+static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
+   unsigned long flags)
__releases(lock->wait_lock)
 {
lock->owner = NULL;
-   raw_spin_unlock(>wait_lock);
+   raw_spin_unlock_irqrestore(>wait_lock, flags);
return 

[PATCH 02/17] futex: Use smp_store_release() in mark_wake_futex()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 1b367ece0d7e696cab1c8501bab282cc6a538b3f upstream.

Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.604296...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index bb87324..9e92f12 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1284,8 +1284,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, 
struct futex_q *q)
 * memory barrier is required here to prevent the following
 * store to lock_ptr from getting ahead of the plist_del.
 */
-   smp_wmb();
-   q->lock_ptr = NULL;
+   smp_store_release(>lock_ptr, NULL);
 }
 
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q 
*top_waiter,
-- 
2.7.4



[PATCH 06/17] futex: Change locking rules

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 734009e96d1983ad739e5b656e03430b3660c913 upstream.

Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
creates another set of problems, especially priority inversions on RT where
hb->lock becomes a rt_mutex itself.

The rt_mutex::wait_lock is the most obvious protection for keeping the
futex user space value and the kernel internal pi_state in sync.

Rework and document the locking so rt_mutex::wait_lock is held accross all
operations which modify the user space value and the pi state.

This allows to invoke rt_mutex_unlock() (including deboost) without holding
hb->lock as a next step.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.751993...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 165 +
 1 file changed, 132 insertions(+), 33 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e1200b9..52e3678 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -967,6 +967,39 @@ void exit_pi_state_list(struct task_struct *curr)
  *
  * [10] There is no transient state which leaves owner and user space
  * TID out of sync.
+ *
+ *
+ * Serialization and lifetime rules:
+ *
+ * hb->lock:
+ *
+ * hb -> futex_q, relation
+ * futex_q -> pi_state, relation
+ *
+ * (cannot be raw because hb can contain arbitrary amount
+ *  of futex_q's)
+ *
+ * pi_mutex->wait_lock:
+ *
+ * {uval, pi_state}
+ *
+ * (and pi_mutex 'obviously')
+ *
+ * p->pi_lock:
+ *
+ * p->pi_state_list -> pi_state->list, relation
+ *
+ * pi_state->refcount:
+ *
+ * pi_state lifetime
+ *
+ *
+ * Lock order:
+ *
+ *   hb->lock
+ * pi_mutex->wait_lock
+ *   p->pi_lock
+ *
  */
 
 /*
@@ -974,10 +1007,12 @@ void exit_pi_state_list(struct task_struct *curr)
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
+static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
+ struct futex_pi_state *pi_state,
  struct futex_pi_state **ps)
 {
pid_t pid = uval & FUTEX_TID_MASK;
+   int ret, uval2;
 
/*
 * Userspace might have messed up non-PI and PI futexes [3]
@@ -985,9 +1020,34 @@ static int attach_to_pi_state(u32 uval, struct 
futex_pi_state *pi_state,
if (unlikely(!pi_state))
return -EINVAL;
 
+   /*
+* We get here with hb->lock held, and having found a
+* futex_top_waiter(). This means that futex_lock_pi() of said futex_q
+* has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
+* which in turn means that futex_lock_pi() still has a reference on
+* our pi_state.
+*/
WARN_ON(!atomic_read(_state->refcount));
 
/*
+* Now that we have a pi_state, we can acquire wait_lock
+* and do the state validation.
+*/
+   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
+
+   /*
+* Since {uval, pi_state} is serialized by wait_lock, and our current
+* uval was read without holding it, it can have changed. Verify it
+* still is what we expect it to be, otherwise retry the entire
+* operation.
+*/
+   if (get_futex_value_locked(, uaddr))
+   goto out_efault;
+
+   if (uval != uval2)
+   goto out_eagain;
+
+   /*
 * Handle the owner died case:
 */
if (uval & FUTEX_OWNER_DIED) {
@@ -1002,11 +1062,11 @@ static int attach_to_pi_state(u32 uval, struct 
futex_pi_state *pi_state,
 * is not 0. Inconsistent state. [5]
 */
if (pid)
-   return -EINVAL;
+   goto out_einval;
/*
 * Take a ref on the state and return success. [4]
 */
-   goto out_state;
+   goto out_attach;
}
 
/*
@@ -1018,14 +1078,14 @@ static int attach_to_pi_state(u32 uval, struct 
futex_pi_state *pi_state,
 * Take a ref on the state and return success. [6]
 */
if (!pid)
-   goto out_state;
+   goto out_attach;
} else {
/*
 * If the owner died bit is not set, then the pi_state
 * must have an owner. [7]
 */

[PATCH 07/17] futex: Cleanup refcounting

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit bf92cf3a5100f5a0d5f9834787b130159397cb22 upstream.

Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
becomes consistent.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.801778...@infradead.org
Signed-off-by: Thomas Gleixner 

Conflicts:
kernel/futex.c
Tested-by:Henrik Austad 
---
 kernel/futex.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 52e3678..9d7d462 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -799,7 +799,7 @@ static int refill_pi_state_cache(void)
return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_pi_state *alloc_pi_state(void)
 {
struct futex_pi_state *pi_state = current->pi_state_cache;
 
@@ -809,6 +809,11 @@ static struct futex_pi_state * alloc_pi_state(void)
return pi_state;
 }
 
+static void get_pi_state(struct futex_pi_state *pi_state)
+{
+   WARN_ON_ONCE(!atomic_inc_not_zero(_state->refcount));
+}
+
 /*
  * Must be called with the hb lock held.
  */
@@ -850,7 +855,7 @@ static void free_pi_state(struct futex_pi_state *pi_state)
  * Look up the task based on what TID userspace gave us.
  * We dont trust it.
  */
-static struct task_struct * futex_find_get_task(pid_t pid)
+static struct task_struct *futex_find_get_task(pid_t pid)
 {
struct task_struct *p;
 
@@ -1097,7 +1102,7 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
goto out_einval;
 
 out_attach:
-   atomic_inc(_state->refcount);
+   get_pi_state(pi_state);
raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
*ps = pi_state;
return 0;
@@ -2019,8 +2024,12 @@ retry_private:
 * of requeue_pi if we couldn't acquire the lock atomically.
 */
if (requeue_pi) {
-   /* Prepare the waiter to take the rt_mutex. */
-   atomic_inc(_state->refcount);
+   /*
+* Prepare the waiter to take the rt_mutex. Take a
+* refcount on the pi_state and store the pointer in
+* the futex_q object of the waiter.
+*/
+   get_pi_state(pi_state);
this->pi_state = pi_state;
ret = rt_mutex_start_proxy_lock(_state->pi_mutex,
this->rt_waiter,
-- 
2.7.4



[PATCH 04/17] rtmutex: Make wait_lock irq safe

2018-11-09 Thread Henrik Austad
From: Thomas Gleixner 

commit b4abf91047cf054f203dcfac97e1038388826937 upstream.

Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.

CPU0CPU1

rtmutex_lock(>rt_mutex)
  spin_lock(>rt_mutex.wait_lock)
local_irq_disable()
spin_lock(>it_lock)
spin_lock(>mutex.wait_lock)
--> Interrupt
spin_lock(>it_lock)

This is caused by the following code sequence on CPU1

 rcu_read_lock()
 x = lookup();
 if (x)
 spin_lock_irqsave(>it_lock);
 rcu_read_unlock();
 return x;

We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.

Taking rt_mutex.wait_lock irq safe prevents the deadlock.

Reported-by: Sasha Levin 
Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Paul McKenney 
Tested-by:Henrik Austad 
---
 kernel/futex.c   |  18 +++
 kernel/locking/rtmutex.c | 135 +--
 2 files changed, 81 insertions(+), 72 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 9e92f12..0f44952 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1307,7 +1307,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *top_waiter
if (pi_state->owner != current)
return -EINVAL;
 
-   raw_spin_lock(_state->pi_mutex.wait_lock);
+   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
new_owner = rt_mutex_next_owner(_state->pi_mutex);
 
/*
@@ -1343,22 +1343,22 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *top_waiter
ret = -EINVAL;
}
if (ret) {
-   raw_spin_unlock(_state->pi_mutex.wait_lock);
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
return ret;
}
 
-   raw_spin_lock_irq(_state->owner->pi_lock);
+   raw_spin_lock(_state->owner->pi_lock);
WARN_ON(list_empty(_state->list));
list_del_init(_state->list);
-   raw_spin_unlock_irq(_state->owner->pi_lock);
+   raw_spin_unlock(_state->owner->pi_lock);
 
-   raw_spin_lock_irq(_owner->pi_lock);
+   raw_spin_lock(_owner->pi_lock);
WARN_ON(!list_empty(_state->list));
list_add(_state->list, _owner->pi_state_list);
pi_state->owner = new_owner;
-   raw_spin_unlock_irq(_owner->pi_lock);
+   raw_spin_unlock(_owner->pi_lock);
 
-   raw_spin_unlock(_state->pi_mutex.wait_lock);
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
 
deboost = rt_mutex_futex_unlock(_state->pi_mutex, _q);
 
@@ -2269,11 +2269,11 @@ static int fixup_owner(u32 __user *uaddr, struct 
futex_q *q, int locked)
 * we returned due to timeout or signal without taking the
 * rt_mutex. Too late.
 */
-   raw_spin_lock(>pi_state->pi_mutex.wait_lock);
+   raw_spin_lock_irq(>pi_state->pi_mutex.wait_lock);
owner = rt_mutex_owner(>pi_state->pi_mutex);
if (!owner)
owner = rt_mutex_next_owner(>pi_state->pi_mutex);
-   raw_spin_unlock(>pi_state->pi_mutex.wait_lock);
+   raw_spin_unlock_irq(>pi_state->pi_mutex.wait_lock);
ret = fixup_pi_state_owner(uaddr, q, owner);
goto out;
}
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 6cf9dab7..b8d08c7 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -163,13 +163,14 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex 
*lock)
  * 2) Drop lock->wait_lock
  * 3) Try to unlock the lock with cmpxchg
  */
-static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
+static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
+   unsigned long flags)
__releases(lock->wait_lock)
 {
struct task_struct *owner = rt_mutex_owner(lock);
 
clear_rt_mutex_waiters(lock);
-   raw_spin_unlock(>wait_lock);
+   raw_spin_unlock_irqrestore(>wait_lock, flags);
/*
 * If a new waiter comes in between the unlock and the cmpxchg
 * we have two situations:
@@ -211,11 +212,12 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex 
*lock)
 /*
  * Simple slow path only version: lock->owner is protected by lock->wait_lock.
  */
-static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock)
+static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
+   unsigned long flags)
__releases(lock->wait_lock)
 {
lock->owner = NULL;
-   raw_spin_unlock(>wait_lock);
+   raw_spin_unlock_irqrestore(>wait_lock, flags);
return 

[PATCH 02/17] futex: Use smp_store_release() in mark_wake_futex()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 1b367ece0d7e696cab1c8501bab282cc6a538b3f upstream.

Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.604296...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index bb87324..9e92f12 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1284,8 +1284,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, 
struct futex_q *q)
 * memory barrier is required here to prevent the following
 * store to lock_ptr from getting ahead of the plist_del.
 */
-   smp_wmb();
-   q->lock_ptr = NULL;
+   smp_store_release(>lock_ptr, NULL);
 }
 
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q 
*top_waiter,
-- 
2.7.4



[PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-11-09 Thread Henrik Austad
From: Henrik Austad 

Short story:

The following patches are needed on a 4.4 kernel to avoid
Oops in the scheduler when a sched_rr and sched_deadline task contends
on the same futex (with PI).

Longer story:

On one of our arm64 systems, we occasionally crash with an Oops in the
scheduler with the following backtrace.

[] enqueue_task_dl+0x1f0/0x420
[] activate_task+0x7c/0x90
[] push_dl_task+0x164/0x1c8
[] push_dl_tasks+0x20/0x30
[] __balance_callback+0x44/0x68
[] __schedule+0x6f0/0x728
[] schedule+0x78/0x98
[] __rt_mutex_slowlock+0x9c/0x108
[] rt_mutex_slowlock+0xd8/0x198
[] rt_mutex_timed_futex_lock+0x30/0x40
[] futex_lock_pi+0x200/0x3b0
[] do_futex+0x1c4/0x550
[] compat_SyS_futex+0x10c/0x138
[] __sys_trace_return+0x0/0x4

This seems to be the same bug Xuneli Pang triggered and fixed in
e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline
tasks". As noted by Peter Zijlstra in the previous attempt, this fix
requires a few other patches, most notably the FUTEX_UNLOCK_PI series
[1]

Testing this on a dual-core VM I have not been able to reproduce the
same crash, but pi_stress (part of the rt-test suite) reveals that
vanilla 4.4.162 behaves rather badly with a mix of deadline and
sched_(rr|fifo) tasks:

time pi_stress --rr --mlockall --sched 
id=high,policy=deadline,runtime=10,deadline=20,period=20
Starting PI Stress Test
Number of thread groups: 1
Duration of test run: infinite
Number of inversions per group: unlimited
 Admin thread SCHED_RR priority 4
1 groups of 3 threads will be created
  High thread SCHED_DEADLINE runtime 10 deadline 20 period 20
   Med thread SCHED_RR priority 2
   Low thread SCHED_RR priority 1
Current Inversions: 141627
WATCHDOG triggered: group 0 is deadlocked!
reporter stopping due to watchdog event
Stopping test
Terminated

real0m26.291s
user0m0.148s
sys 0m18.819s

With this series applied, the test ran for ~4.5 hours and again for 129
minutes (when I remembered to time it) before crashing:

time pi_stress --rr --mlockall --sched 
id=high,policy=deadline,runtime=10,deadline=20,period=20
Starting PI Stress Test
Number of thread groups: 1
Duration of test run: infinite
Number of inversions per group: unlimited
 Admin thread SCHED_RR priority 4
1 groups of 3 threads will be created
  High thread SCHED_DEADLINE runtime 10 deadline 20 period 20
   Med thread SCHED_RR priority 2
   Low thread SCHED_RR priority 1
Current Inversions: 51985223
WATCHDOG triggered: group 0 is deadlocked!
reporter stopping due to watchdog event
Stopping test
Terminated

real129m38.807s
user0m59.084s
sys 109m53.666s


So clearly not perfect, but a *lot* better.

The same series on our vendor-4.4 kernel moves pi_stress up from ~30
seconds before deadlock up to the same level as the VM (the test is
still going as of this writing).

I suspect other users of 4.4 would benefit from having these patches
backported, so tag them for stable. I assume 4.9 and 4.14 could benefit
as well, but I have not had time to look into those.

1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html

Peter Zijlstra (13):
  futex: Cleanup variable names for futex_top_waiter()
  futex: Use smp_store_release() in mark_wake_futex()
  futex: Remove rt_mutex_deadlock_account_*()
  futex,rt_mutex: Provide futex specific rt_mutex API
  futex: Change locking rules
  futex: Cleanup refcounting
  futex: Rework inconsistent rt_mutex/futex_q state
  futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  futex,rt_mutex: Introduce rt_mutex_init_waiter()
  futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  futex: Futex_unlock_pi() determinism
  futex: Drop hb->lock before enqueueing on the rtmutex

Thomas Gleixner (2):
  rtmutex: Make wait_lock irq safe
  futex: Rename free_pi_state() to put_pi_state()

Xunlei Pang (2):
  rtmutex: Deboost before waking up the top waiter
  sched/rtmutex/deadline: Fix a PI crash for deadline tasks

 include/linux/init_task.h   |   1 +
 include/linux/sched.h   |   2 +
 include/linux/sched/rt.h|   1 +
 kernel/fork.c   |   1 +
 kernel/futex.c  | 532 ++--
 kernel/locking/rtmutex-debug.c  |   9 -
 kernel/locking/rtmutex-debug.h  |   3 -
 kernel/locking/rtmutex.c| 406 ++
 kernel/locking/rtmutex.h|   2 -
 kernel/locking/rtmutex_common.h |  24 +-
 kernel/sched/core.c |   2 +
 11 files changed, 620 insertions(+), 363 deletions(-)

-- 
2.7.4



[PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-11-09 Thread Henrik Austad
From: Henrik Austad 

Short story:

The following patches are needed on a 4.4 kernel to avoid
Oops in the scheduler when a sched_rr and sched_deadline task contends
on the same futex (with PI).

Longer story:

On one of our arm64 systems, we occasionally crash with an Oops in the
scheduler with the following backtrace.

[] enqueue_task_dl+0x1f0/0x420
[] activate_task+0x7c/0x90
[] push_dl_task+0x164/0x1c8
[] push_dl_tasks+0x20/0x30
[] __balance_callback+0x44/0x68
[] __schedule+0x6f0/0x728
[] schedule+0x78/0x98
[] __rt_mutex_slowlock+0x9c/0x108
[] rt_mutex_slowlock+0xd8/0x198
[] rt_mutex_timed_futex_lock+0x30/0x40
[] futex_lock_pi+0x200/0x3b0
[] do_futex+0x1c4/0x550
[] compat_SyS_futex+0x10c/0x138
[] __sys_trace_return+0x0/0x4

This seems to be the same bug Xuneli Pang triggered and fixed in
e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline
tasks". As noted by Peter Zijlstra in the previous attempt, this fix
requires a few other patches, most notably the FUTEX_UNLOCK_PI series
[1]

Testing this on a dual-core VM I have not been able to reproduce the
same crash, but pi_stress (part of the rt-test suite) reveals that
vanilla 4.4.162 behaves rather badly with a mix of deadline and
sched_(rr|fifo) tasks:

time pi_stress --rr --mlockall --sched 
id=high,policy=deadline,runtime=10,deadline=20,period=20
Starting PI Stress Test
Number of thread groups: 1
Duration of test run: infinite
Number of inversions per group: unlimited
 Admin thread SCHED_RR priority 4
1 groups of 3 threads will be created
  High thread SCHED_DEADLINE runtime 10 deadline 20 period 20
   Med thread SCHED_RR priority 2
   Low thread SCHED_RR priority 1
Current Inversions: 141627
WATCHDOG triggered: group 0 is deadlocked!
reporter stopping due to watchdog event
Stopping test
Terminated

real0m26.291s
user0m0.148s
sys 0m18.819s

With this series applied, the test ran for ~4.5 hours and again for 129
minutes (when I remembered to time it) before crashing:

time pi_stress --rr --mlockall --sched 
id=high,policy=deadline,runtime=10,deadline=20,period=20
Starting PI Stress Test
Number of thread groups: 1
Duration of test run: infinite
Number of inversions per group: unlimited
 Admin thread SCHED_RR priority 4
1 groups of 3 threads will be created
  High thread SCHED_DEADLINE runtime 10 deadline 20 period 20
   Med thread SCHED_RR priority 2
   Low thread SCHED_RR priority 1
Current Inversions: 51985223
WATCHDOG triggered: group 0 is deadlocked!
reporter stopping due to watchdog event
Stopping test
Terminated

real129m38.807s
user0m59.084s
sys 109m53.666s


So clearly not perfect, but a *lot* better.

The same series on our vendor-4.4 kernel moves pi_stress up from ~30
seconds before deadlock up to the same level as the VM (the test is
still going as of this writing).

I suspect other users of 4.4 would benefit from having these patches
backported, so tag them for stable. I assume 4.9 and 4.14 could benefit
as well, but I have not had time to look into those.

1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html

Peter Zijlstra (13):
  futex: Cleanup variable names for futex_top_waiter()
  futex: Use smp_store_release() in mark_wake_futex()
  futex: Remove rt_mutex_deadlock_account_*()
  futex,rt_mutex: Provide futex specific rt_mutex API
  futex: Change locking rules
  futex: Cleanup refcounting
  futex: Rework inconsistent rt_mutex/futex_q state
  futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  futex,rt_mutex: Introduce rt_mutex_init_waiter()
  futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  futex: Futex_unlock_pi() determinism
  futex: Drop hb->lock before enqueueing on the rtmutex

Thomas Gleixner (2):
  rtmutex: Make wait_lock irq safe
  futex: Rename free_pi_state() to put_pi_state()

Xunlei Pang (2):
  rtmutex: Deboost before waking up the top waiter
  sched/rtmutex/deadline: Fix a PI crash for deadline tasks

 include/linux/init_task.h   |   1 +
 include/linux/sched.h   |   2 +
 include/linux/sched/rt.h|   1 +
 kernel/fork.c   |   1 +
 kernel/futex.c  | 532 ++--
 kernel/locking/rtmutex-debug.c  |   9 -
 kernel/locking/rtmutex-debug.h  |   3 -
 kernel/locking/rtmutex.c| 406 ++
 kernel/locking/rtmutex.h|   2 -
 kernel/locking/rtmutex_common.h |  24 +-
 kernel/sched/core.c |   2 +
 11 files changed, 620 insertions(+), 363 deletions(-)

-- 
2.7.4



[PATCH 09/17] futex: Rename free_pi_state() to put_pi_state()

2018-11-09 Thread Henrik Austad
From: Thomas Gleixner 

commit 29e9ee5d48c35d6cf8afe09bdf03f77125c9ac11 upstream.

free_pi_state() is confusing as it is in fact only freeing/caching the
pi state when the last reference is gone. Rename it to put_pi_state()
which reflects better what it is doing.

Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Darren Hart 
Cc: Davidlohr Bueso 
Cc: bhuvanesh_surach...@mentor.com
Cc: Andy Lowe 
Link: http://lkml.kernel.org/r/20151219200607.259636...@linutronix.de
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 91acb65..09f698a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -815,9 +815,12 @@ static void get_pi_state(struct futex_pi_state *pi_state)
 }
 
 /*
+ * Drops a reference to the pi_state object and frees or caches it
+ * when the last reference is gone.
+ *
  * Must be called with the hb lock held.
  */
-static void free_pi_state(struct futex_pi_state *pi_state)
+static void put_pi_state(struct futex_pi_state *pi_state)
 {
if (!pi_state)
return;
@@ -1959,7 +1962,7 @@ retry_private:
case 0:
break;
case -EFAULT:
-   free_pi_state(pi_state);
+   put_pi_state(pi_state);
pi_state = NULL;
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
@@ -1976,7 +1979,7 @@ retry_private:
 *   exit to complete.
 * - The user space value changed.
 */
-   free_pi_state(pi_state);
+   put_pi_state(pi_state);
pi_state = NULL;
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
@@ -2049,7 +2052,7 @@ retry_private:
} else if (ret) {
/* -EDEADLK */
this->pi_state = NULL;
-   free_pi_state(pi_state);
+   put_pi_state(pi_state);
goto out_unlock;
}
}
@@ -2058,7 +2061,7 @@ retry_private:
}
 
 out_unlock:
-   free_pi_state(pi_state);
+   put_pi_state(pi_state);
double_unlock_hb(hb1, hb2);
wake_up_q(_q);
hb_waiters_dec(hb2);
@@ -2211,7 +2214,7 @@ static void unqueue_me_pi(struct futex_q *q)
__unqueue_futex(q);
 
BUG_ON(!q->pi_state);
-   free_pi_state(q->pi_state);
+   put_pi_state(q->pi_state);
q->pi_state = NULL;
 
spin_unlock(q->lock_ptr);
@@ -2993,7 +2996,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
 * Drop the reference to the pi state which
 * the requeue_pi() code acquired for us.
 */
-   free_pi_state(q.pi_state);
+   put_pi_state(q.pi_state);
spin_unlock(q.lock_ptr);
}
} else {
-- 
2.7.4



[PATCH 09/17] futex: Rename free_pi_state() to put_pi_state()

2018-11-09 Thread Henrik Austad
From: Thomas Gleixner 

commit 29e9ee5d48c35d6cf8afe09bdf03f77125c9ac11 upstream.

free_pi_state() is confusing as it is in fact only freeing/caching the
pi state when the last reference is gone. Rename it to put_pi_state()
which reflects better what it is doing.

Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Darren Hart 
Cc: Davidlohr Bueso 
Cc: bhuvanesh_surach...@mentor.com
Cc: Andy Lowe 
Link: http://lkml.kernel.org/r/20151219200607.259636...@linutronix.de
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 91acb65..09f698a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -815,9 +815,12 @@ static void get_pi_state(struct futex_pi_state *pi_state)
 }
 
 /*
+ * Drops a reference to the pi_state object and frees or caches it
+ * when the last reference is gone.
+ *
  * Must be called with the hb lock held.
  */
-static void free_pi_state(struct futex_pi_state *pi_state)
+static void put_pi_state(struct futex_pi_state *pi_state)
 {
if (!pi_state)
return;
@@ -1959,7 +1962,7 @@ retry_private:
case 0:
break;
case -EFAULT:
-   free_pi_state(pi_state);
+   put_pi_state(pi_state);
pi_state = NULL;
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
@@ -1976,7 +1979,7 @@ retry_private:
 *   exit to complete.
 * - The user space value changed.
 */
-   free_pi_state(pi_state);
+   put_pi_state(pi_state);
pi_state = NULL;
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
@@ -2049,7 +2052,7 @@ retry_private:
} else if (ret) {
/* -EDEADLK */
this->pi_state = NULL;
-   free_pi_state(pi_state);
+   put_pi_state(pi_state);
goto out_unlock;
}
}
@@ -2058,7 +2061,7 @@ retry_private:
}
 
 out_unlock:
-   free_pi_state(pi_state);
+   put_pi_state(pi_state);
double_unlock_hb(hb1, hb2);
wake_up_q(_q);
hb_waiters_dec(hb2);
@@ -2211,7 +2214,7 @@ static void unqueue_me_pi(struct futex_q *q)
__unqueue_futex(q);
 
BUG_ON(!q->pi_state);
-   free_pi_state(q->pi_state);
+   put_pi_state(q->pi_state);
q->pi_state = NULL;
 
spin_unlock(q->lock_ptr);
@@ -2993,7 +2996,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
 * Drop the reference to the pi state which
 * the requeue_pi() code acquired for us.
 */
-   free_pi_state(q.pi_state);
+   put_pi_state(q.pi_state);
spin_unlock(q.lock_ptr);
}
} else {
-- 
2.7.4



[PATCH 11/17] futex,rt_mutex: Introduce rt_mutex_init_waiter()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 50809358dd7199aa7ce232f6877dd09ec30ef374 upstream.

Since there's already two copies of this code, introduce a helper now
before adding a third one.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.950039...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  |  5 +
 kernel/locking/rtmutex.c| 12 +---
 kernel/locking/rtmutex_common.h |  1 +
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 7054ca3..4d70fd7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2969,10 +2969,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
 * The waiter is allocated on our stack, manipulated by the requeue
 * code while we sleep on uaddr.
 */
-   debug_rt_mutex_init_waiter(_waiter);
-   RB_CLEAR_NODE(_waiter.pi_tree_entry);
-   RB_CLEAR_NODE(_waiter.tree_entry);
-   rt_waiter.task = NULL;
+   rt_mutex_init_waiter(_waiter);
 
ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, , VERIFY_WRITE);
if (unlikely(ret != 0))
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 28c1d40..8778ac3 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1151,6 +1151,14 @@ void rt_mutex_adjust_pi(struct task_struct *task)
   next_lock, NULL, task);
 }
 
+void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
+{
+   debug_rt_mutex_init_waiter(waiter);
+   RB_CLEAR_NODE(>pi_tree_entry);
+   RB_CLEAR_NODE(>tree_entry);
+   waiter->task = NULL;
+}
+
 /**
  * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
  * @lock:   the rt_mutex to take
@@ -1233,9 +1241,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long flags;
int ret = 0;
 
-   debug_rt_mutex_init_waiter();
-   RB_CLEAR_NODE(_tree_entry);
-   RB_CLEAR_NODE(_entry);
+   rt_mutex_init_waiter();
 
/*
 * Technically we could use raw_spin_[un]lock_irq() here, but this can
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 2441c2d..d16de236 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -103,6 +103,7 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex 
*lock,
   struct task_struct *proxy_owner);
 extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
  struct task_struct *proxy_owner);
+extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 struct rt_mutex_waiter *waiter,
 struct task_struct *task);
-- 
2.7.4



[PATCH 11/17] futex,rt_mutex: Introduce rt_mutex_init_waiter()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 50809358dd7199aa7ce232f6877dd09ec30ef374 upstream.

Since there's already two copies of this code, introduce a helper now
before adding a third one.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.950039...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  |  5 +
 kernel/locking/rtmutex.c| 12 +---
 kernel/locking/rtmutex_common.h |  1 +
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 7054ca3..4d70fd7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2969,10 +2969,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
 * The waiter is allocated on our stack, manipulated by the requeue
 * code while we sleep on uaddr.
 */
-   debug_rt_mutex_init_waiter(_waiter);
-   RB_CLEAR_NODE(_waiter.pi_tree_entry);
-   RB_CLEAR_NODE(_waiter.tree_entry);
-   rt_waiter.task = NULL;
+   rt_mutex_init_waiter(_waiter);
 
ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, , VERIFY_WRITE);
if (unlikely(ret != 0))
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 28c1d40..8778ac3 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1151,6 +1151,14 @@ void rt_mutex_adjust_pi(struct task_struct *task)
   next_lock, NULL, task);
 }
 
+void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
+{
+   debug_rt_mutex_init_waiter(waiter);
+   RB_CLEAR_NODE(>pi_tree_entry);
+   RB_CLEAR_NODE(>tree_entry);
+   waiter->task = NULL;
+}
+
 /**
  * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
  * @lock:   the rt_mutex to take
@@ -1233,9 +1241,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long flags;
int ret = 0;
 
-   debug_rt_mutex_init_waiter();
-   RB_CLEAR_NODE(_tree_entry);
-   RB_CLEAR_NODE(_entry);
+   rt_mutex_init_waiter();
 
/*
 * Technically we could use raw_spin_[un]lock_irq() here, but this can
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 2441c2d..d16de236 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -103,6 +103,7 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex 
*lock,
   struct task_struct *proxy_owner);
 extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
  struct task_struct *proxy_owner);
+extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 struct rt_mutex_waiter *waiter,
 struct task_struct *task);
-- 
2.7.4



[PATCH 16/17] rtmutex: Deboost before waking up the top waiter

2018-11-09 Thread Henrik Austad
From: Xunlei Pang 

commit 2a1c6029940675abb2217b590512dbf691867ec4 upstream.

We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).

In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().

Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.

This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.

[peterz: Changelog + patch munging]
Suggested-by: Peter Zijlstra 
Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Steven Rostedt 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.110065...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  |  5 +---
 kernel/locking/rtmutex.c| 59 ++---
 kernel/locking/rtmutex_common.h |  2 +-
 3 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index afb02a7..63fa840 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1457,10 +1457,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_pi_state *pi_
 out_unlock:
raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
 
-   if (deboost) {
-   wake_up_q(_q);
-   rt_mutex_adjust_prio(current);
-   }
+   rt_mutex_postunlock(_q, deboost);
 
return ret;
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b061a79..c01d7f4 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -371,24 +371,6 @@ static void __rt_mutex_adjust_prio(struct task_struct 
*task)
 }
 
 /*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-void rt_mutex_adjust_prio(struct task_struct *task)
-{
-   unsigned long flags;
-
-   raw_spin_lock_irqsave(>pi_lock, flags);
-   __rt_mutex_adjust_prio(task);
-   raw_spin_unlock_irqrestore(>pi_lock, flags);
-}
-
-/*
  * Deadlock detection is conditional:
  *
  * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
@@ -1049,6 +1031,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head 
*wake_q,
 * lock->wait_lock.
 */
rt_mutex_dequeue_pi(current, waiter);
+   __rt_mutex_adjust_prio(current);
 
/*
 * As we are waking up the top waiter, and the waiter stays
@@ -1391,6 +1374,16 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex 
*lock,
 */
mark_wakeup_next_waiter(wake_q, lock);
 
+   /*
+* We should deboost before waking the top waiter task such that
+* we don't run two tasks with the 'same' priority. This however
+* can lead to prio-inversion if we would get preempted after
+* the deboost but before waking our high-prio task, hence the
+* preempt_disable before unlock. Pairs with preempt_enable() in
+* rt_mutex_postunlock();
+*/
+   preempt_disable();
+
raw_spin_unlock_irqrestore(>wait_lock, flags);
 
/* check PI boosting */
@@ -1440,6 +1433,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
return slowfn(lock);
 }
 
+/*
+ * Undo pi boosting (if necessary) and wake top waiter.
+ */
+void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+{
+   wake_up_q(wake_q);
+
+   /* Pairs with preempt_disable() in rt_mutex_slowunlock() */
+   if (deboost)
+   preempt_enable();
+}
+
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
bool (*slowfn)(struct rt_mutex *lock,
@@ -1453,11 +1458,7 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
 
deboost = slowfn(lock, _q);
 
-   wake_up_q(_q);
-
-   /* Undo pi boosting if necessary: */
-   if (deboost)
-   rt_mutex_adjust_prio(current);
+   rt_mutex_postunlock(_q, deboost);
 }
 
 /**
@@ -1570,6 +1571,13 @@ bool __sched __rt_mutex_futex_u

[PATCH 08/17] futex: Rework inconsistent rt_mutex/futex_q state

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 73d786bd043ebc855f349c81ea805f6b11cbf2aa upstream.

There is a weird state in the futex_unlock_pi() path when it interleaves
with a concurrent futex_lock_pi() at the point where it drops hb->lock.

In this case, it can happen that the rt_mutex wait_list and the futex_q
disagree on pending waiters, in particular rt_mutex will find no pending
waiters where futex_q thinks there are. In this case the rt_mutex unlock
code cannot assign an owner.

The futex side fixup code has to cleanup the inconsistencies with quite a
bunch of interesting corner cases.

Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
situation occurs. This then gives the futex_lock_pi() code the opportunity
to continue and the retried futex_unlock_pi() will now observe a coherent
state.

The only problem is that this breaks RT timeliness guarantees. That
is, consider the following scenario:

  T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

CPU0

T1
  lock_pi()
  queue_me()  <- Waiter is visible

preemption

T2
  unlock_pi()
loops with -EAGAIN forever

Which is undesirable for PI primitives. Future patches will rectify
this.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.850383...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 50 ++
 1 file changed, 14 insertions(+), 36 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 9d7d462..91acb65 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1398,12 +1398,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *top_waiter
new_owner = rt_mutex_next_owner(_state->pi_mutex);
 
/*
-* It is possible that the next waiter (the one that brought
-* top_waiter owner to the kernel) timed out and is no longer
-* waiting on the lock.
+* When we interleave with futex_lock_pi() where it does
+* rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
+* but the rt_mutex's wait_list can be empty (either still, or again,
+* depending on which side we land).
+*
+* When this happens, give up our locks and try again, giving the
+* futex_lock_pi() instance time to complete, either by waiting on the
+* rtmutex or removing itself from the futex queue.
 */
-   if (!new_owner)
-   new_owner = top_waiter->task;
+   if (!new_owner) {
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
+   return -EAGAIN;
+   }
 
/*
 * We pass it to the next owner. The WAITERS bit is always
@@ -2342,7 +2349,6 @@ static long futex_wait_restart(struct restart_block 
*restart);
  */
 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 {
-   struct task_struct *owner;
int ret = 0;
 
if (locked) {
@@ -2356,43 +2362,15 @@ static int fixup_owner(u32 __user *uaddr, struct 
futex_q *q, int locked)
}
 
/*
-* Catch the rare case, where the lock was released when we were on the
-* way back before we locked the hash bucket.
-*/
-   if (q->pi_state->owner == current) {
-   /*
-* Try to get the rt_mutex now. This might fail as some other
-* task acquired the rt_mutex after we removed ourself from the
-* rt_mutex waiters list.
-*/
-   if (rt_mutex_futex_trylock(>pi_state->pi_mutex)) {
-   locked = 1;
-   goto out;
-   }
-
-   /*
-* pi_state is incorrect, some other task did a lock steal and
-* we returned due to timeout or signal without taking the
-* rt_mutex. Too late.
-*/
-   raw_spin_lock_irq(>pi_state->pi_mutex.wait_lock);
-   owner = rt_mutex_owner(>pi_state->pi_mutex);
-   if (!owner)
-   owner = rt_mutex_next_owner(>pi_state->pi_mutex);
-   raw_spin_unlock_irq(>pi_state->pi_mutex.wait_lock);
-   ret = fixup_pi_state_owner(uaddr, q, owner);
-   goto out;
-   }
-
-   /*
 * Paranoia check. If we did not take the lock, then we should not be
 * the owner of the rt_mutex.
 */
-   if (rt_mutex_owner(>pi_state->pi_mutex) == current)
+   if (rt_mutex_owner(>pi_state->pi_mutex) == current) {
printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
"

[PATCH 05/17] futex,rt_mutex: Provide futex specific rt_mutex API

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 5293c2efda37775346885c7e924d4ef7018ea60b upstream.

Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.

This means it cannot rely on the atomicy of wait_lock, which would be
preferred in order to not rely on hb->lock so much.

The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
race with the rt_mutex fastpath, however futexes have their own fast path.

Since futexes already have a bunch of separate rt_mutex accessors, complete
that set and implement a rt_mutex variant without fastpath for them.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.702962...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by:Henrik Austad 
---
 kernel/futex.c  | 30 +++---
 kernel/locking/rtmutex.c| 55 ++---
 kernel/locking/rtmutex_common.h |  9 +--
 3 files changed, 62 insertions(+), 32 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 0f44952..e1200b9 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -910,7 +910,7 @@ void exit_pi_state_list(struct task_struct *curr)
pi_state->owner = NULL;
raw_spin_unlock_irq(>pi_lock);
 
-   rt_mutex_unlock(_state->pi_mutex);
+   rt_mutex_futex_unlock(_state->pi_mutex);
 
spin_unlock(>lock);
 
@@ -1358,20 +1358,18 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *top_waiter
pi_state->owner = new_owner;
raw_spin_unlock(_owner->pi_lock);
 
-   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
-
-   deboost = rt_mutex_futex_unlock(_state->pi_mutex, _q);
-
/*
-* First unlock HB so the waiter does not spin on it once he got woken
-* up. Second wake up the waiter before the priority is adjusted. If we
-* deboost first (and lose our higher priority), then the task might get
-* scheduled away before the wake up can take place.
+* We've updated the uservalue, this unlock cannot fail.
 */
+   deboost = __rt_mutex_futex_unlock(_state->pi_mutex, _q);
+
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
spin_unlock(>lock);
-   wake_up_q(_q);
-   if (deboost)
+
+   if (deboost) {
+   wake_up_q(_q);
rt_mutex_adjust_prio(current);
+   }
 
return 0;
 }
@@ -2259,7 +2257,7 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q 
*q, int locked)
 * task acquired the rt_mutex after we removed ourself from the
 * rt_mutex waiters list.
 */
-   if (rt_mutex_trylock(>pi_state->pi_mutex)) {
+   if (rt_mutex_futex_trylock(>pi_state->pi_mutex)) {
locked = 1;
goto out;
}
@@ -2574,7 +2572,7 @@ retry_private:
if (!trylock) {
ret = rt_mutex_timed_futex_lock(_state->pi_mutex, to);
} else {
-   ret = rt_mutex_trylock(_state->pi_mutex);
+   ret = rt_mutex_futex_trylock(_state->pi_mutex);
/* Fixup the trylock return value: */
ret = ret ? 0 : -EWOULDBLOCK;
}
@@ -2597,7 +2595,7 @@ retry_private:
 * it and return the fault to userspace.
 */
if (ret && (rt_mutex_owner(_state->pi_mutex) == current))
-   rt_mutex_unlock(_state->pi_mutex);
+   rt_mutex_futex_unlock(_state->pi_mutex);
 
/* Unqueue and drop the lock */
unqueue_me_pi();
@@ -2904,7 +2902,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
spin_lock(q.lock_ptr);
ret = fixup_pi_state_owner(uaddr2, , current);
if (ret && rt_mutex_owner(_state->pi_mutex) == 
current)
-   rt_mutex_unlock(_state->pi_mutex);
+   rt_mutex_futex_unlock(_state->pi_mutex);
/*
 * Drop the reference to the pi state which
 * the requeue_pi() code acquired for us.
@@ -2944,7 +2942,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
 * userspace.
 */
if (ret && rt_mutex_owner(pi_mutex) == current)
-   rt_mutex_unlock(pi_mutex);
+   rt_mutex_futex_unlock(pi_mutex);
 
/* Unqueue and drop the lock. */
unqueue_me_pi();
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b8d08c7..28c1d40 100644
--- 

[PATCH 10/17] futex: Pull rt_mutex_futex_unlock() out from under hb->lock

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 16ffa12d742534d4ff73e8b3a4e81c1de39196f0 upstream.

There's a number of 'interesting' problems, all caused by holding
hb->lock while doing the rt_mutex_unlock() equivalient.

Notably:

 - a PI inversion on hb->lock; and,

 - a SCHED_DEADLINE crash because of pointer instability.

The previous changes:

 - changed the locking rules to cover {uval,pi_state} with wait_lock.

 - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
   turn allows to rely on wait_lock atomicity completely.

 - simplified the waiter conundrum.

It's now sufficient to hold rtmutex::wait_lock and a reference on the
pi_state to protect the state consistency, so hb->lock can be dropped
before calling rt_mutex_futex_unlock().

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.92...@infradead.org
Signed-off-by: Thomas Gleixner 

Conflicts:
kernel/futex.c
Tested-by:Henrik Austad 
---
 kernel/futex.c | 154 +
 1 file changed, 100 insertions(+), 54 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 09f698a..7054ca3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -918,10 +918,12 @@ void exit_pi_state_list(struct task_struct *curr)
pi_state->owner = NULL;
raw_spin_unlock_irq(>pi_lock);
 
-   rt_mutex_futex_unlock(_state->pi_mutex);
-
+   get_pi_state(pi_state);
spin_unlock(>lock);
 
+   rt_mutex_futex_unlock(_state->pi_mutex);
+   put_pi_state(pi_state);
+
raw_spin_lock_irq(>pi_lock);
}
raw_spin_unlock_irq(>pi_lock);
@@ -1034,6 +1036,11 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 
uval,
 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
 * which in turn means that futex_lock_pi() still has a reference on
 * our pi_state.
+*
+* The waiter holding a reference on @pi_state also protects against
+* the unlocked put_pi_state() in futex_unlock_pi(), futex_lock_pi()
+* and futex_wait_requeue_pi() as it cannot go to 0 and consequently
+* free pi_state before we can take a reference ourselves.
 */
WARN_ON(!atomic_read(_state->refcount));
 
@@ -1377,48 +1384,40 @@ static void mark_wake_futex(struct wake_q_head *wake_q, 
struct futex_q *q)
smp_store_release(>lock_ptr, NULL);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q 
*top_waiter,
-struct futex_hash_bucket *hb)
+/*
+ * Caller must hold a reference on @pi_state.
+ */
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state 
*pi_state)
 {
-   struct task_struct *new_owner;
-   struct futex_pi_state *pi_state = top_waiter->pi_state;
u32 uninitialized_var(curval), newval;
+   struct task_struct *new_owner;
+   bool deboost = false;
WAKE_Q(wake_q);
-   bool deboost;
int ret = 0;
 
-   if (!pi_state)
-   return -EINVAL;
-
-   /*
-* If current does not own the pi_state then the futex is
-* inconsistent and user space fiddled with the futex value.
-*/
-   if (pi_state->owner != current)
-   return -EINVAL;
-
raw_spin_lock_irq(_state->pi_mutex.wait_lock);
new_owner = rt_mutex_next_owner(_state->pi_mutex);
-
-   /*
-* When we interleave with futex_lock_pi() where it does
-* rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
-* but the rt_mutex's wait_list can be empty (either still, or again,
-* depending on which side we land).
-*
-* When this happens, give up our locks and try again, giving the
-* futex_lock_pi() instance time to complete, either by waiting on the
-* rtmutex or removing itself from the futex queue.
-*/
if (!new_owner) {
-   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
-   return -EAGAIN;
+   /*
+* Since we held neither hb->lock nor wait_lock when coming
+* into this function, we could have raced with futex_lock_pi()
+* such that we might observe @this futex_q waiter, but the
+* rt_mutex's wait_list can be empty (either still, or again,
+* depending on which side we land).
+*
+* When this happens, give up our locks and try again, giving
+* the futex_lock_pi() instance time to complete, either by
+* waiting on the rtmutex or removing itself from the futex
+* queue.
+*/
+ 

[PATCH 16/17] rtmutex: Deboost before waking up the top waiter

2018-11-09 Thread Henrik Austad
From: Xunlei Pang 

commit 2a1c6029940675abb2217b590512dbf691867ec4 upstream.

We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).

In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().

Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.

This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.

[peterz: Changelog + patch munging]
Suggested-by: Peter Zijlstra 
Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Steven Rostedt 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.110065...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  |  5 +---
 kernel/locking/rtmutex.c| 59 ++---
 kernel/locking/rtmutex_common.h |  2 +-
 3 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index afb02a7..63fa840 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1457,10 +1457,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_pi_state *pi_
 out_unlock:
raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
 
-   if (deboost) {
-   wake_up_q(_q);
-   rt_mutex_adjust_prio(current);
-   }
+   rt_mutex_postunlock(_q, deboost);
 
return ret;
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b061a79..c01d7f4 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -371,24 +371,6 @@ static void __rt_mutex_adjust_prio(struct task_struct 
*task)
 }
 
 /*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-void rt_mutex_adjust_prio(struct task_struct *task)
-{
-   unsigned long flags;
-
-   raw_spin_lock_irqsave(>pi_lock, flags);
-   __rt_mutex_adjust_prio(task);
-   raw_spin_unlock_irqrestore(>pi_lock, flags);
-}
-
-/*
  * Deadlock detection is conditional:
  *
  * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
@@ -1049,6 +1031,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head 
*wake_q,
 * lock->wait_lock.
 */
rt_mutex_dequeue_pi(current, waiter);
+   __rt_mutex_adjust_prio(current);
 
/*
 * As we are waking up the top waiter, and the waiter stays
@@ -1391,6 +1374,16 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex 
*lock,
 */
mark_wakeup_next_waiter(wake_q, lock);
 
+   /*
+* We should deboost before waking the top waiter task such that
+* we don't run two tasks with the 'same' priority. This however
+* can lead to prio-inversion if we would get preempted after
+* the deboost but before waking our high-prio task, hence the
+* preempt_disable before unlock. Pairs with preempt_enable() in
+* rt_mutex_postunlock();
+*/
+   preempt_disable();
+
raw_spin_unlock_irqrestore(>wait_lock, flags);
 
/* check PI boosting */
@@ -1440,6 +1433,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
return slowfn(lock);
 }
 
+/*
+ * Undo pi boosting (if necessary) and wake top waiter.
+ */
+void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+{
+   wake_up_q(wake_q);
+
+   /* Pairs with preempt_disable() in rt_mutex_slowunlock() */
+   if (deboost)
+   preempt_enable();
+}
+
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
bool (*slowfn)(struct rt_mutex *lock,
@@ -1453,11 +1458,7 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
 
deboost = slowfn(lock, _q);
 
-   wake_up_q(_q);
-
-   /* Undo pi boosting if necessary: */
-   if (deboost)
-   rt_mutex_adjust_prio(current);
+   rt_mutex_postunlock(_q, deboost);
 }
 
 /**
@@ -1570,6 +1571,13 @@ bool __sched __rt_mutex_futex_u

[PATCH 08/17] futex: Rework inconsistent rt_mutex/futex_q state

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 73d786bd043ebc855f349c81ea805f6b11cbf2aa upstream.

There is a weird state in the futex_unlock_pi() path when it interleaves
with a concurrent futex_lock_pi() at the point where it drops hb->lock.

In this case, it can happen that the rt_mutex wait_list and the futex_q
disagree on pending waiters, in particular rt_mutex will find no pending
waiters where futex_q thinks there are. In this case the rt_mutex unlock
code cannot assign an owner.

The futex side fixup code has to cleanup the inconsistencies with quite a
bunch of interesting corner cases.

Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
situation occurs. This then gives the futex_lock_pi() code the opportunity
to continue and the retried futex_unlock_pi() will now observe a coherent
state.

The only problem is that this breaks RT timeliness guarantees. That
is, consider the following scenario:

  T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

CPU0

T1
  lock_pi()
  queue_me()  <- Waiter is visible

preemption

T2
  unlock_pi()
loops with -EAGAIN forever

Which is undesirable for PI primitives. Future patches will rectify
this.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.850383...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 50 ++
 1 file changed, 14 insertions(+), 36 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 9d7d462..91acb65 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1398,12 +1398,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *top_waiter
new_owner = rt_mutex_next_owner(_state->pi_mutex);
 
/*
-* It is possible that the next waiter (the one that brought
-* top_waiter owner to the kernel) timed out and is no longer
-* waiting on the lock.
+* When we interleave with futex_lock_pi() where it does
+* rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
+* but the rt_mutex's wait_list can be empty (either still, or again,
+* depending on which side we land).
+*
+* When this happens, give up our locks and try again, giving the
+* futex_lock_pi() instance time to complete, either by waiting on the
+* rtmutex or removing itself from the futex queue.
 */
-   if (!new_owner)
-   new_owner = top_waiter->task;
+   if (!new_owner) {
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
+   return -EAGAIN;
+   }
 
/*
 * We pass it to the next owner. The WAITERS bit is always
@@ -2342,7 +2349,6 @@ static long futex_wait_restart(struct restart_block 
*restart);
  */
 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 {
-   struct task_struct *owner;
int ret = 0;
 
if (locked) {
@@ -2356,43 +2362,15 @@ static int fixup_owner(u32 __user *uaddr, struct 
futex_q *q, int locked)
}
 
/*
-* Catch the rare case, where the lock was released when we were on the
-* way back before we locked the hash bucket.
-*/
-   if (q->pi_state->owner == current) {
-   /*
-* Try to get the rt_mutex now. This might fail as some other
-* task acquired the rt_mutex after we removed ourself from the
-* rt_mutex waiters list.
-*/
-   if (rt_mutex_futex_trylock(>pi_state->pi_mutex)) {
-   locked = 1;
-   goto out;
-   }
-
-   /*
-* pi_state is incorrect, some other task did a lock steal and
-* we returned due to timeout or signal without taking the
-* rt_mutex. Too late.
-*/
-   raw_spin_lock_irq(>pi_state->pi_mutex.wait_lock);
-   owner = rt_mutex_owner(>pi_state->pi_mutex);
-   if (!owner)
-   owner = rt_mutex_next_owner(>pi_state->pi_mutex);
-   raw_spin_unlock_irq(>pi_state->pi_mutex.wait_lock);
-   ret = fixup_pi_state_owner(uaddr, q, owner);
-   goto out;
-   }
-
-   /*
 * Paranoia check. If we did not take the lock, then we should not be
 * the owner of the rt_mutex.
 */
-   if (rt_mutex_owner(>pi_state->pi_mutex) == current)
+   if (rt_mutex_owner(>pi_state->pi_mutex) == current) {
printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
"

[PATCH 05/17] futex,rt_mutex: Provide futex specific rt_mutex API

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 5293c2efda37775346885c7e924d4ef7018ea60b upstream.

Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.

This means it cannot rely on the atomicy of wait_lock, which would be
preferred in order to not rely on hb->lock so much.

The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
race with the rt_mutex fastpath, however futexes have their own fast path.

Since futexes already have a bunch of separate rt_mutex accessors, complete
that set and implement a rt_mutex variant without fastpath for them.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.702962...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by:Henrik Austad 
---
 kernel/futex.c  | 30 +++---
 kernel/locking/rtmutex.c| 55 ++---
 kernel/locking/rtmutex_common.h |  9 +--
 3 files changed, 62 insertions(+), 32 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 0f44952..e1200b9 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -910,7 +910,7 @@ void exit_pi_state_list(struct task_struct *curr)
pi_state->owner = NULL;
raw_spin_unlock_irq(>pi_lock);
 
-   rt_mutex_unlock(_state->pi_mutex);
+   rt_mutex_futex_unlock(_state->pi_mutex);
 
spin_unlock(>lock);
 
@@ -1358,20 +1358,18 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *top_waiter
pi_state->owner = new_owner;
raw_spin_unlock(_owner->pi_lock);
 
-   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
-
-   deboost = rt_mutex_futex_unlock(_state->pi_mutex, _q);
-
/*
-* First unlock HB so the waiter does not spin on it once he got woken
-* up. Second wake up the waiter before the priority is adjusted. If we
-* deboost first (and lose our higher priority), then the task might get
-* scheduled away before the wake up can take place.
+* We've updated the uservalue, this unlock cannot fail.
 */
+   deboost = __rt_mutex_futex_unlock(_state->pi_mutex, _q);
+
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
spin_unlock(>lock);
-   wake_up_q(_q);
-   if (deboost)
+
+   if (deboost) {
+   wake_up_q(_q);
rt_mutex_adjust_prio(current);
+   }
 
return 0;
 }
@@ -2259,7 +2257,7 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q 
*q, int locked)
 * task acquired the rt_mutex after we removed ourself from the
 * rt_mutex waiters list.
 */
-   if (rt_mutex_trylock(>pi_state->pi_mutex)) {
+   if (rt_mutex_futex_trylock(>pi_state->pi_mutex)) {
locked = 1;
goto out;
}
@@ -2574,7 +2572,7 @@ retry_private:
if (!trylock) {
ret = rt_mutex_timed_futex_lock(_state->pi_mutex, to);
} else {
-   ret = rt_mutex_trylock(_state->pi_mutex);
+   ret = rt_mutex_futex_trylock(_state->pi_mutex);
/* Fixup the trylock return value: */
ret = ret ? 0 : -EWOULDBLOCK;
}
@@ -2597,7 +2595,7 @@ retry_private:
 * it and return the fault to userspace.
 */
if (ret && (rt_mutex_owner(_state->pi_mutex) == current))
-   rt_mutex_unlock(_state->pi_mutex);
+   rt_mutex_futex_unlock(_state->pi_mutex);
 
/* Unqueue and drop the lock */
unqueue_me_pi();
@@ -2904,7 +2902,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
spin_lock(q.lock_ptr);
ret = fixup_pi_state_owner(uaddr2, , current);
if (ret && rt_mutex_owner(_state->pi_mutex) == 
current)
-   rt_mutex_unlock(_state->pi_mutex);
+   rt_mutex_futex_unlock(_state->pi_mutex);
/*
 * Drop the reference to the pi state which
 * the requeue_pi() code acquired for us.
@@ -2944,7 +2942,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
 * userspace.
 */
if (ret && rt_mutex_owner(pi_mutex) == current)
-   rt_mutex_unlock(pi_mutex);
+   rt_mutex_futex_unlock(pi_mutex);
 
/* Unqueue and drop the lock. */
unqueue_me_pi();
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b8d08c7..28c1d40 100644
--- 

[PATCH 10/17] futex: Pull rt_mutex_futex_unlock() out from under hb->lock

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 16ffa12d742534d4ff73e8b3a4e81c1de39196f0 upstream.

There's a number of 'interesting' problems, all caused by holding
hb->lock while doing the rt_mutex_unlock() equivalient.

Notably:

 - a PI inversion on hb->lock; and,

 - a SCHED_DEADLINE crash because of pointer instability.

The previous changes:

 - changed the locking rules to cover {uval,pi_state} with wait_lock.

 - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
   turn allows to rely on wait_lock atomicity completely.

 - simplified the waiter conundrum.

It's now sufficient to hold rtmutex::wait_lock and a reference on the
pi_state to protect the state consistency, so hb->lock can be dropped
before calling rt_mutex_futex_unlock().

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.92...@infradead.org
Signed-off-by: Thomas Gleixner 

Conflicts:
kernel/futex.c
Tested-by:Henrik Austad 
---
 kernel/futex.c | 154 +
 1 file changed, 100 insertions(+), 54 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 09f698a..7054ca3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -918,10 +918,12 @@ void exit_pi_state_list(struct task_struct *curr)
pi_state->owner = NULL;
raw_spin_unlock_irq(>pi_lock);
 
-   rt_mutex_futex_unlock(_state->pi_mutex);
-
+   get_pi_state(pi_state);
spin_unlock(>lock);
 
+   rt_mutex_futex_unlock(_state->pi_mutex);
+   put_pi_state(pi_state);
+
raw_spin_lock_irq(>pi_lock);
}
raw_spin_unlock_irq(>pi_lock);
@@ -1034,6 +1036,11 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 
uval,
 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
 * which in turn means that futex_lock_pi() still has a reference on
 * our pi_state.
+*
+* The waiter holding a reference on @pi_state also protects against
+* the unlocked put_pi_state() in futex_unlock_pi(), futex_lock_pi()
+* and futex_wait_requeue_pi() as it cannot go to 0 and consequently
+* free pi_state before we can take a reference ourselves.
 */
WARN_ON(!atomic_read(_state->refcount));
 
@@ -1377,48 +1384,40 @@ static void mark_wake_futex(struct wake_q_head *wake_q, 
struct futex_q *q)
smp_store_release(>lock_ptr, NULL);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q 
*top_waiter,
-struct futex_hash_bucket *hb)
+/*
+ * Caller must hold a reference on @pi_state.
+ */
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state 
*pi_state)
 {
-   struct task_struct *new_owner;
-   struct futex_pi_state *pi_state = top_waiter->pi_state;
u32 uninitialized_var(curval), newval;
+   struct task_struct *new_owner;
+   bool deboost = false;
WAKE_Q(wake_q);
-   bool deboost;
int ret = 0;
 
-   if (!pi_state)
-   return -EINVAL;
-
-   /*
-* If current does not own the pi_state then the futex is
-* inconsistent and user space fiddled with the futex value.
-*/
-   if (pi_state->owner != current)
-   return -EINVAL;
-
raw_spin_lock_irq(_state->pi_mutex.wait_lock);
new_owner = rt_mutex_next_owner(_state->pi_mutex);
-
-   /*
-* When we interleave with futex_lock_pi() where it does
-* rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
-* but the rt_mutex's wait_list can be empty (either still, or again,
-* depending on which side we land).
-*
-* When this happens, give up our locks and try again, giving the
-* futex_lock_pi() instance time to complete, either by waiting on the
-* rtmutex or removing itself from the futex queue.
-*/
if (!new_owner) {
-   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
-   return -EAGAIN;
+   /*
+* Since we held neither hb->lock nor wait_lock when coming
+* into this function, we could have raced with futex_lock_pi()
+* such that we might observe @this futex_q waiter, but the
+* rt_mutex's wait_list can be empty (either still, or again,
+* depending on which side we land).
+*
+* When this happens, give up our locks and try again, giving
+* the futex_lock_pi() instance time to complete, either by
+* waiting on the rtmutex or removing itself from the futex
+* queue.
+*/
+ 

[PATCH 15/17] futex: Drop hb->lock before enqueueing on the rtmutex

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 56222b212e8edb1cf51f5dd73ff645809b082b40 upstream.

When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
chain code will (falsely) report a deadlock and BUG.

The problem is that it hold hb->lock (now an rt_mutex) while doing
task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
interleaved just right with futex_unlock_pi() leads it to believe to see an
AB-BA deadlock.

  Task1 (holds rt_mutex,Task2 (does FUTEX_LOCK_PI)
 does FUTEX_UNLOCK_PI)

lock hb->lock
lock rt_mutex (as per start_proxy)
  lock hb->lock

Which is a trivial AB-BA.

It is not an actual deadlock, because it won't be holding hb->lock by the
time it actually blocks on the rt_mutex, but the chainwalk code doesn't
know that and it would be a nightmare to handle this gracefully.

To avoid this problem, do the same as in futex_unlock_pi() and drop
hb->lock after acquiring wait_lock. This still fully serializes against
futex_unlock_pi(), since adding to the wait_list does the very same lock
dance, and removing it holds both locks.

Aside of solving the RT problem this makes the lock and unlock mechanism
symetric and reduces the hb->lock held time.

Reported-and-tested-by: Sebastian Andrzej Siewior 
Suggested-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.161341...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  | 30 +
 kernel/locking/rtmutex.c| 49 +++--
 kernel/locking/rtmutex_common.h |  3 +++
 3 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 14d270e..afb02a7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2667,20 +2667,33 @@ retry_private:
goto no_block;
}
 
+   rt_mutex_init_waiter(_waiter);
+
/*
-* We must add ourselves to the rt_mutex waitlist while holding hb->lock
-* such that the hb and rt_mutex wait lists match.
+* On PREEMPT_RT_FULL, when hb->lock becomes an rt_mutex, we must not
+* hold it while doing rt_mutex_start_proxy(), because then it will
+* include hb->lock in the blocking chain, even through we'll not in
+* fact hold it while blocking. This will lead it to report -EDEADLK
+* and BUG when futex_unlock_pi() interleaves with this.
+*
+* Therefore acquire wait_lock while holding hb->lock, but drop the
+* latter before calling rt_mutex_start_proxy_lock(). This still fully
+* serializes against futex_unlock_pi() as that does the exact same
+* lock handoff sequence.
 */
-   rt_mutex_init_waiter(_waiter);
-   ret = rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, 
current);
+   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
+   spin_unlock(q.lock_ptr);
+   ret = __rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, 
current);
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
+
if (ret) {
if (ret == 1)
ret = 0;
 
+   spin_lock(q.lock_ptr);
goto no_block;
}
 
-   spin_unlock(q.lock_ptr);
 
if (unlikely(to))
hrtimer_start_expires(>timer, HRTIMER_MODE_ABS);
@@ -2693,6 +2706,9 @@ retry_private:
 * first acquire the hb->lock before removing the lock from the
 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
 * wait lists consistent.
+*
+* In particular; it is important that futex_unlock_pi() can not
+* observe this inconsistency.
 */
if (ret && !rt_mutex_cleanup_proxy_lock(_state->pi_mutex, 
_waiter))
ret = 0;
@@ -2804,10 +2820,6 @@ retry:
 
get_pi_state(pi_state);
/*
-* Since modifying the wait_list is done while holding both
-* hb->lock and wait_lock, holding either is sufficient to
-* observe it.
-*
 * By taking wait_lock while still holding hb->lock, we ensure
 * there is no point where we hold neither; and therefore
 * wake_futex_pi() must observe a state consistent with what we
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 3025f61..b061a79 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1659,31 +1659,14 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
rt_mutex_set_owner(lock, NULL);
 }
 
-/**
- * rt_mutex_st

[PATCH 14/17] futex: Futex_unlock_pi() determinism

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit bebe5b514345f09be2c15e414d076b02ecb9cce8 upstream.

The problem with returning -EAGAIN when the waiter state mismatches is that
it becomes very hard to proof a bounded execution time on the
operation. And seeing that this is a RT operation, this is somewhat
important.

While in practise; given the previous patch; it will be very unlikely to
ever really take more than one or two rounds, proving so becomes rather
hard.

However, now that modifying wait_list is done while holding both hb->lock
and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
while still holding hb-lock. Doing a hand-over, without leaving a hole.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.112378...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 1cc40dd..14d270e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1395,15 +1395,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_pi_state *pi_
WAKE_Q(wake_q);
int ret = 0;
 
-   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
new_owner = rt_mutex_next_owner(_state->pi_mutex);
-   if (!new_owner) {
+   if (WARN_ON_ONCE(!new_owner)) {
/*
-* Since we held neither hb->lock nor wait_lock when coming
-* into this function, we could have raced with futex_lock_pi()
-* such that we might observe @this futex_q waiter, but the
-* rt_mutex's wait_list can be empty (either still, or again,
-* depending on which side we land).
+* As per the comment in futex_unlock_pi() this should not 
happen.
 *
 * When this happens, give up our locks and try again, giving
 * the futex_lock_pi() instance time to complete, either by
@@ -2807,15 +2802,18 @@ retry:
if (pi_state->owner != current)
goto out_unlock;
 
+   get_pi_state(pi_state);
/*
-* Grab a reference on the pi_state and drop hb->lock.
+* Since modifying the wait_list is done while holding both
+* hb->lock and wait_lock, holding either is sufficient to
+* observe it.
 *
-* The reference ensures pi_state lives, dropping the hb->lock
-* is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
-* close the races against futex_lock_pi(), but in case of
-* _any_ fail we'll abort and retry the whole deal.
+* By taking wait_lock while still holding hb->lock, we ensure
+* there is no point where we hold neither; and therefore
+* wake_futex_pi() must observe a state consistent with what we
+* observed.
 */
-   get_pi_state(pi_state);
+   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
spin_unlock(>lock);
 
ret = wake_futex_pi(uaddr, uval, pi_state);
-- 
2.7.4



[PATCH 15/17] futex: Drop hb->lock before enqueueing on the rtmutex

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 56222b212e8edb1cf51f5dd73ff645809b082b40 upstream.

When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
chain code will (falsely) report a deadlock and BUG.

The problem is that it hold hb->lock (now an rt_mutex) while doing
task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
interleaved just right with futex_unlock_pi() leads it to believe to see an
AB-BA deadlock.

  Task1 (holds rt_mutex,Task2 (does FUTEX_LOCK_PI)
 does FUTEX_UNLOCK_PI)

lock hb->lock
lock rt_mutex (as per start_proxy)
  lock hb->lock

Which is a trivial AB-BA.

It is not an actual deadlock, because it won't be holding hb->lock by the
time it actually blocks on the rt_mutex, but the chainwalk code doesn't
know that and it would be a nightmare to handle this gracefully.

To avoid this problem, do the same as in futex_unlock_pi() and drop
hb->lock after acquiring wait_lock. This still fully serializes against
futex_unlock_pi(), since adding to the wait_list does the very same lock
dance, and removing it holds both locks.

Aside of solving the RT problem this makes the lock and unlock mechanism
symetric and reduces the hb->lock held time.

Reported-and-tested-by: Sebastian Andrzej Siewior 
Suggested-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.161341...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  | 30 +
 kernel/locking/rtmutex.c| 49 +++--
 kernel/locking/rtmutex_common.h |  3 +++
 3 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 14d270e..afb02a7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2667,20 +2667,33 @@ retry_private:
goto no_block;
}
 
+   rt_mutex_init_waiter(_waiter);
+
/*
-* We must add ourselves to the rt_mutex waitlist while holding hb->lock
-* such that the hb and rt_mutex wait lists match.
+* On PREEMPT_RT_FULL, when hb->lock becomes an rt_mutex, we must not
+* hold it while doing rt_mutex_start_proxy(), because then it will
+* include hb->lock in the blocking chain, even through we'll not in
+* fact hold it while blocking. This will lead it to report -EDEADLK
+* and BUG when futex_unlock_pi() interleaves with this.
+*
+* Therefore acquire wait_lock while holding hb->lock, but drop the
+* latter before calling rt_mutex_start_proxy_lock(). This still fully
+* serializes against futex_unlock_pi() as that does the exact same
+* lock handoff sequence.
 */
-   rt_mutex_init_waiter(_waiter);
-   ret = rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, 
current);
+   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
+   spin_unlock(q.lock_ptr);
+   ret = __rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, 
current);
+   raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
+
if (ret) {
if (ret == 1)
ret = 0;
 
+   spin_lock(q.lock_ptr);
goto no_block;
}
 
-   spin_unlock(q.lock_ptr);
 
if (unlikely(to))
hrtimer_start_expires(>timer, HRTIMER_MODE_ABS);
@@ -2693,6 +2706,9 @@ retry_private:
 * first acquire the hb->lock before removing the lock from the
 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
 * wait lists consistent.
+*
+* In particular; it is important that futex_unlock_pi() can not
+* observe this inconsistency.
 */
if (ret && !rt_mutex_cleanup_proxy_lock(_state->pi_mutex, 
_waiter))
ret = 0;
@@ -2804,10 +2820,6 @@ retry:
 
get_pi_state(pi_state);
/*
-* Since modifying the wait_list is done while holding both
-* hb->lock and wait_lock, holding either is sufficient to
-* observe it.
-*
 * By taking wait_lock while still holding hb->lock, we ensure
 * there is no point where we hold neither; and therefore
 * wake_futex_pi() must observe a state consistent with what we
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 3025f61..b061a79 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1659,31 +1659,14 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
rt_mutex_set_owner(lock, NULL);
 }
 
-/**
- * rt_mutex_st

[PATCH 14/17] futex: Futex_unlock_pi() determinism

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit bebe5b514345f09be2c15e414d076b02ecb9cce8 upstream.

The problem with returning -EAGAIN when the waiter state mismatches is that
it becomes very hard to proof a bounded execution time on the
operation. And seeing that this is a RT operation, this is somewhat
important.

While in practise; given the previous patch; it will be very unlikely to
ever really take more than one or two rounds, proving so becomes rather
hard.

However, now that modifying wait_list is done while holding both hb->lock
and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
while still holding hb-lock. Doing a hand-over, without leaving a hole.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.112378...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 1cc40dd..14d270e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1395,15 +1395,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_pi_state *pi_
WAKE_Q(wake_q);
int ret = 0;
 
-   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
new_owner = rt_mutex_next_owner(_state->pi_mutex);
-   if (!new_owner) {
+   if (WARN_ON_ONCE(!new_owner)) {
/*
-* Since we held neither hb->lock nor wait_lock when coming
-* into this function, we could have raced with futex_lock_pi()
-* such that we might observe @this futex_q waiter, but the
-* rt_mutex's wait_list can be empty (either still, or again,
-* depending on which side we land).
+* As per the comment in futex_unlock_pi() this should not 
happen.
 *
 * When this happens, give up our locks and try again, giving
 * the futex_lock_pi() instance time to complete, either by
@@ -2807,15 +2802,18 @@ retry:
if (pi_state->owner != current)
goto out_unlock;
 
+   get_pi_state(pi_state);
/*
-* Grab a reference on the pi_state and drop hb->lock.
+* Since modifying the wait_list is done while holding both
+* hb->lock and wait_lock, holding either is sufficient to
+* observe it.
 *
-* The reference ensures pi_state lives, dropping the hb->lock
-* is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
-* close the races against futex_lock_pi(), but in case of
-* _any_ fail we'll abort and retry the whole deal.
+* By taking wait_lock while still holding hb->lock, we ensure
+* there is no point where we hold neither; and therefore
+* wake_futex_pi() must observe a state consistent with what we
+* observed.
 */
-   get_pi_state(pi_state);
+   raw_spin_lock_irq(_state->pi_mutex.wait_lock);
spin_unlock(>lock);
 
ret = wake_futex_pi(uaddr, uval, pi_state);
-- 
2.7.4



[PATCH 17/17] sched/rtmutex/deadline: Fix a PI crash for deadline tasks

2018-11-09 Thread Henrik Austad
From: Xunlei Pang 

commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22 upstream.

A crash happened while I was playing with deadline PI rtmutex.

BUG: unable to handle kernel NULL pointer dereference at 0018
IP: [] rt_mutex_get_top_task+0x1f/0x30
PGD 232a75067 PUD 230947067 PMD 0
Oops:  [#1] SMP
CPU: 1 PID: 10994 Comm: a.out Not tainted

Call Trace:
[] enqueue_task+0x2c/0x80
[] activate_task+0x23/0x30
[] pull_dl_task+0x1d5/0x260
[] pre_schedule_dl+0x16/0x20
[] __schedule+0xd3/0x900
[] schedule+0x29/0x70
[] __rt_mutex_slowlock+0x4b/0xc0
[] rt_mutex_slowlock+0xd1/0x190
[] rt_mutex_timed_lock+0x53/0x60
[] futex_lock_pi.isra.18+0x28c/0x390
[] do_futex+0x190/0x5b0
[] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra 
Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Steven Rostedt 
Reviewed-by: Thomas Gleixner 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682...@infradead.org
Signed-off-by: Thomas Gleixner 

Conflicts:
include/linux/sched.h
Tested-by:Henrik Austad 
---
 include/linux/init_task.h |  1 +
 include/linux/sched.h |  2 ++
 include/linux/sched/rt.h  |  1 +
 kernel/fork.c |  1 +
 kernel/locking/rtmutex.c  | 29 +
 kernel/sched/core.c   |  2 ++
 6 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1c1ff7e..a561ce0c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -162,6 +162,7 @@ extern struct task_group root_task_group;
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)  \
.pi_waiters = RB_ROOT,  \
+   .pi_top_task = NULL,\
.pi_waiters_leftmost = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b30540d..89cd0d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1617,6 +1617,8 @@ struct task_struct {
/* PI waiters blocked on a rt_mutex held by this task */
struct rb_root pi_waiters;
struct rb_node *pi_waiters_leftmost;
+   /* Updated under owner's pi_lock and rq lock */
+   struct task_struct  *pi_top_task;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index a30b172..60d0c47 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p)
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
+extern void rt_mutex_update_top_task(struct task_struct *p);
 extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
diff --git a/kernel/fork.c b/kernel/fork.c
index dd2f79a..9376270 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1242,6 +1242,7 @@ static void rt_mutex_init_task(struct task_struct *p)
 #ifdef CONFIG_RT_MUTEXES
p->pi_waiters = RB_ROOT;
p->pi_waiters_leftmost = NULL;
+   p->pi_top_task = NULL;
p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c01d7f4..dd3b1e9 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -321,6 +321,19 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct 
rt_mutex_waiter *waiter)
 }
 
 /*
+ * Must hold both p->pi_lock and task_rq(p)->lock.
+ */
+void rt_mutex_update_top_task(struct task_struct *p)
+{
+   if (!task_has_pi_waiters(p)) {
+   p->pi_top_task = NULL;
+   return;
+   }
+
+   p->pi_top_task = task_top_pi_waiter(p)->task;
+}
+
+/*
  * Calculate task priority from the waiter tree priority
  *
  * Return task->normal_prio when the waiter tree is empty or when
@@ -335,12 +348,12 @@ int rt_mutex_getprio(struct task_struct *task)
   task->normal_prio);
 }

[PATCH 12/17] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 38d589f2fd08f1296aea3ce62bebd185125c6d81 upstream.

With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.

Split split_mutex_finish_proxy_lock() into two parts, one that does the
blocking and one that does remove_waiter() when the lock acquire failed.

When the rtmutex was acquired successfully the waiter can be removed in the
acquisiton path safely, since there is no concurrency on the lock owner.

This means that, except for futex_lock_pi(), all wait_list modifications
are done with both hb->lock and wait_lock held.

[bige...@linutronix.de: fix for futex_requeue_pi_signal_restart]

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.001659...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  |  7 --
 kernel/locking/rtmutex.c| 52 +++--
 kernel/locking/rtmutex_common.h |  8 ---
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 4d70fd7..dce3250 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3045,10 +3045,13 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
 */
WARN_ON(!q.pi_state);
pi_mutex = _state->pi_mutex;
-   ret = rt_mutex_finish_proxy_lock(pi_mutex, to, _waiter);
-   debug_rt_mutex_free_waiter(_waiter);
+   ret = rt_mutex_wait_proxy_lock(pi_mutex, to, _waiter);
 
spin_lock(q.lock_ptr);
+   if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, _waiter))
+   ret = 0;
+
+   debug_rt_mutex_free_waiter(_waiter);
/*
 * Fixup the pi_state owner and possibly acquire the lock if we
 * haven't already.
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 8778ac3..78ecea6 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1743,21 +1743,23 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex 
*lock)
 }
 
 /**
- * rt_mutex_finish_proxy_lock() - Complete lock acquisition
+ * rt_mutex_wait_proxy_lock() - Wait for lock acquisition
  * @lock:  the rt_mutex we were woken on
  * @to:the timeout, null if none. hrtimer should 
already have
  * been started.
  * @waiter:the pre-initialized rt_mutex_waiter
  *
- * Complete the lock acquisition started our behalf by another thread.
+ * Wait for the the lock acquisition started on our behalf by
+ * rt_mutex_start_proxy_lock(). Upon failure, the caller must call
+ * rt_mutex_cleanup_proxy_lock().
  *
  * Returns:
  *  0 - success
  * <0 - error, one of -EINTR, -ETIMEDOUT
  *
- * Special API call for PI-futex requeue support
+ * Special API call for PI-futex support
  */
-int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
   struct hrtimer_sleeper *to,
   struct rt_mutex_waiter *waiter)
 {
@@ -1770,9 +1772,6 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
/* sleep on the mutex */
ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
 
-   if (unlikely(ret))
-   remove_waiter(lock, waiter);
-
/*
 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
 * have to fix that up.
@@ -1783,3 +1782,42 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 
return ret;
 }
+
+/**
+ * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
+ * @lock:  the rt_mutex we were woken on
+ * @waiter:the pre-initialized rt_mutex_waiter
+ *
+ * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ *
+ * Unless we acquired the lock; we're still enqueued on the wait-list and can
+ * in fact still be granted ownership until we're removed. Therefore we can
+ * find we are in fact the owner and must disregard the
+ * rt_mutex_wait_proxy_lock() failure.
+ *
+ * Returns:
+ *  true  - did the cleanup, we done.
+ *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
+ *  caller should disregards its return value.
+ *
+ * Special API call for PI-futex support
+ */
+bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
+struct rt_mutex_waiter *waiter)
+{
+   bool cleanup = false;
+
+   raw_spin_lock_irq(>wait_lock);
+   /*
+* Unless

[PATCH 13/17] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit cfafcd117da0216520568c195cb2f6cd1980c4bb upstream.

By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
modifications are done under both hb->lock and wait_lock.

This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:

Before:

futex_lock_pi() futex_unlock_pi()
  unlock hb->lock

  lock hb->lock
  unlock hb->lock

  lock rt_mutex->wait_lock
  unlock rt_mutex_wait_lock
-EAGAIN

  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock

  schedule()

  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

  
-EAGAIN

  lock hb->lock

After:

futex_lock_pi() futex_unlock_pi()

  lock hb->lock
  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock
  unlock hb->lock

  schedule()
  lock hb->lock
  unlock hb->lock
  lock hb->lock
  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

  lock rt_mutex->wait_lock
  unlock rt_mutex_wait_lock
-EAGAIN

  unlock hb->lock

It does however solve the earlier starvation/live-lock scenario which got
introduced with the -EAGAIN since unlike the before scenario; where the
-EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
after scenario it happens while futex_unlock_pi() actually holds a lock,
and then it is serialized on that lock.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.062785...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  | 77 +
 kernel/locking/rtmutex.c| 26 --
 kernel/locking/rtmutex_common.h |  1 -
 3 files changed, 62 insertions(+), 42 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index dce3250..1cc40dd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2112,20 +2112,7 @@ queue_unlock(struct futex_hash_bucket *hb)
hb_waiters_dec(hb);
 }
 
-/**
- * queue_me() - Enqueue the futex_q on the futex_hash_bucket
- * @q: The futex_q to enqueue
- * @hb:The destination hash bucket
- *
- * The hb->lock must be held by the caller, and is released here. A call to
- * queue_me() is typically paired with exactly one call to unqueue_me().  The
- * exceptions involve the PI related operations, which may use unqueue_me_pi()
- * or nothing if the unqueue is done as part of the wake process and the 
unqueue
- * state is implicit in the state of woken task (see futex_wait_requeue_pi() 
for
- * an example).
- */
-static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
-   __releases(>lock)
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
 {
int prio;
 
@@ -2142,6 +2129,24 @@ static inline void queue_me(struct futex_q *q, struct 
futex_hash_bucket *hb)
plist_node_init(>list, prio);
plist_add(>list, >chain);
q->task = current;
+}
+
+/**
+ * queue_me() - Enqueue the futex_q on the futex_hash_bucket
+ * @q: The futex_q to enqueue
+ * @hb:The destination hash bucket
+ *
+ * The hb->lock must be held by the caller, and is released here. A call to
+ * queue_me() is typically paired with exactly one call to unqueue_me().  The
+ * exceptions involve the PI related operations, which may use unqueue_me_pi()
+ * or nothing if the unqueue is done as part of the wake process and the 
unqueue
+ * state is implicit in the state of woken task (see futex_wait_requeue_pi() 
for
+ * an example).
+ */
+static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
+   __releases(>lock)
+{
+   __queue_me(q, hb);
spin_unlock(>lock);
 }
 
@@ -2600,6 +2605,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int 
flags,
 {
struct hrtimer_sleeper timeout, *to = NULL;
struct futex_pi_state *pi_state = NULL;
+   struct rt_mutex_waiter rt_waiter;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
int res, ret;
@@ -2652,25 +2658,52 @@ retry_private:
}
}
 
+   WARN_ON(!q.pi_state);
+
/*
 * Only actually queue now that the atomic ops are done:
 */
-   queue_me(, hb);
+   __q

[PATCH 12/17] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 38d589f2fd08f1296aea3ce62bebd185125c6d81 upstream.

With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.

Split split_mutex_finish_proxy_lock() into two parts, one that does the
blocking and one that does remove_waiter() when the lock acquire failed.

When the rtmutex was acquired successfully the waiter can be removed in the
acquisiton path safely, since there is no concurrency on the lock owner.

This means that, except for futex_lock_pi(), all wait_list modifications
are done with both hb->lock and wait_lock held.

[bige...@linutronix.de: fix for futex_requeue_pi_signal_restart]

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.001659...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  |  7 --
 kernel/locking/rtmutex.c| 52 +++--
 kernel/locking/rtmutex_common.h |  8 ---
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 4d70fd7..dce3250 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3045,10 +3045,13 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, 
unsigned int flags,
 */
WARN_ON(!q.pi_state);
pi_mutex = _state->pi_mutex;
-   ret = rt_mutex_finish_proxy_lock(pi_mutex, to, _waiter);
-   debug_rt_mutex_free_waiter(_waiter);
+   ret = rt_mutex_wait_proxy_lock(pi_mutex, to, _waiter);
 
spin_lock(q.lock_ptr);
+   if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, _waiter))
+   ret = 0;
+
+   debug_rt_mutex_free_waiter(_waiter);
/*
 * Fixup the pi_state owner and possibly acquire the lock if we
 * haven't already.
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 8778ac3..78ecea6 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1743,21 +1743,23 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex 
*lock)
 }
 
 /**
- * rt_mutex_finish_proxy_lock() - Complete lock acquisition
+ * rt_mutex_wait_proxy_lock() - Wait for lock acquisition
  * @lock:  the rt_mutex we were woken on
  * @to:the timeout, null if none. hrtimer should 
already have
  * been started.
  * @waiter:the pre-initialized rt_mutex_waiter
  *
- * Complete the lock acquisition started our behalf by another thread.
+ * Wait for the the lock acquisition started on our behalf by
+ * rt_mutex_start_proxy_lock(). Upon failure, the caller must call
+ * rt_mutex_cleanup_proxy_lock().
  *
  * Returns:
  *  0 - success
  * <0 - error, one of -EINTR, -ETIMEDOUT
  *
- * Special API call for PI-futex requeue support
+ * Special API call for PI-futex support
  */
-int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
   struct hrtimer_sleeper *to,
   struct rt_mutex_waiter *waiter)
 {
@@ -1770,9 +1772,6 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
/* sleep on the mutex */
ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
 
-   if (unlikely(ret))
-   remove_waiter(lock, waiter);
-
/*
 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
 * have to fix that up.
@@ -1783,3 +1782,42 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 
return ret;
 }
+
+/**
+ * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
+ * @lock:  the rt_mutex we were woken on
+ * @waiter:the pre-initialized rt_mutex_waiter
+ *
+ * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ *
+ * Unless we acquired the lock; we're still enqueued on the wait-list and can
+ * in fact still be granted ownership until we're removed. Therefore we can
+ * find we are in fact the owner and must disregard the
+ * rt_mutex_wait_proxy_lock() failure.
+ *
+ * Returns:
+ *  true  - did the cleanup, we done.
+ *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
+ *  caller should disregards its return value.
+ *
+ * Special API call for PI-futex support
+ */
+bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
+struct rt_mutex_waiter *waiter)
+{
+   bool cleanup = false;
+
+   raw_spin_lock_irq(>wait_lock);
+   /*
+* Unless

[PATCH 13/17] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit cfafcd117da0216520568c195cb2f6cd1980c4bb upstream.

By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
modifications are done under both hb->lock and wait_lock.

This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:

Before:

futex_lock_pi() futex_unlock_pi()
  unlock hb->lock

  lock hb->lock
  unlock hb->lock

  lock rt_mutex->wait_lock
  unlock rt_mutex_wait_lock
-EAGAIN

  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock

  schedule()

  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

  
-EAGAIN

  lock hb->lock

After:

futex_lock_pi() futex_unlock_pi()

  lock hb->lock
  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock
  unlock hb->lock

  schedule()
  lock hb->lock
  unlock hb->lock
  lock hb->lock
  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

  lock rt_mutex->wait_lock
  unlock rt_mutex_wait_lock
-EAGAIN

  unlock hb->lock

It does however solve the earlier starvation/live-lock scenario which got
introduced with the -EAGAIN since unlike the before scenario; where the
-EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
after scenario it happens while futex_unlock_pi() actually holds a lock,
and then it is serialized on that lock.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.062785...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c  | 77 +
 kernel/locking/rtmutex.c| 26 --
 kernel/locking/rtmutex_common.h |  1 -
 3 files changed, 62 insertions(+), 42 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index dce3250..1cc40dd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2112,20 +2112,7 @@ queue_unlock(struct futex_hash_bucket *hb)
hb_waiters_dec(hb);
 }
 
-/**
- * queue_me() - Enqueue the futex_q on the futex_hash_bucket
- * @q: The futex_q to enqueue
- * @hb:The destination hash bucket
- *
- * The hb->lock must be held by the caller, and is released here. A call to
- * queue_me() is typically paired with exactly one call to unqueue_me().  The
- * exceptions involve the PI related operations, which may use unqueue_me_pi()
- * or nothing if the unqueue is done as part of the wake process and the 
unqueue
- * state is implicit in the state of woken task (see futex_wait_requeue_pi() 
for
- * an example).
- */
-static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
-   __releases(>lock)
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
 {
int prio;
 
@@ -2142,6 +2129,24 @@ static inline void queue_me(struct futex_q *q, struct 
futex_hash_bucket *hb)
plist_node_init(>list, prio);
plist_add(>list, >chain);
q->task = current;
+}
+
+/**
+ * queue_me() - Enqueue the futex_q on the futex_hash_bucket
+ * @q: The futex_q to enqueue
+ * @hb:The destination hash bucket
+ *
+ * The hb->lock must be held by the caller, and is released here. A call to
+ * queue_me() is typically paired with exactly one call to unqueue_me().  The
+ * exceptions involve the PI related operations, which may use unqueue_me_pi()
+ * or nothing if the unqueue is done as part of the wake process and the 
unqueue
+ * state is implicit in the state of woken task (see futex_wait_requeue_pi() 
for
+ * an example).
+ */
+static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
+   __releases(>lock)
+{
+   __queue_me(q, hb);
spin_unlock(>lock);
 }
 
@@ -2600,6 +2605,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int 
flags,
 {
struct hrtimer_sleeper timeout, *to = NULL;
struct futex_pi_state *pi_state = NULL;
+   struct rt_mutex_waiter rt_waiter;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
int res, ret;
@@ -2652,25 +2658,52 @@ retry_private:
}
}
 
+   WARN_ON(!q.pi_state);
+
/*
 * Only actually queue now that the atomic ops are done:
 */
-   queue_me(, hb);
+   __q

[PATCH 17/17] sched/rtmutex/deadline: Fix a PI crash for deadline tasks

2018-11-09 Thread Henrik Austad
From: Xunlei Pang 

commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22 upstream.

A crash happened while I was playing with deadline PI rtmutex.

BUG: unable to handle kernel NULL pointer dereference at 0018
IP: [] rt_mutex_get_top_task+0x1f/0x30
PGD 232a75067 PUD 230947067 PMD 0
Oops:  [#1] SMP
CPU: 1 PID: 10994 Comm: a.out Not tainted

Call Trace:
[] enqueue_task+0x2c/0x80
[] activate_task+0x23/0x30
[] pull_dl_task+0x1d5/0x260
[] pre_schedule_dl+0x16/0x20
[] __schedule+0xd3/0x900
[] schedule+0x29/0x70
[] __rt_mutex_slowlock+0x4b/0xc0
[] rt_mutex_slowlock+0xd1/0x190
[] rt_mutex_timed_lock+0x53/0x60
[] futex_lock_pi.isra.18+0x28c/0x390
[] do_futex+0x190/0x5b0
[] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra 
Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Steven Rostedt 
Reviewed-by: Thomas Gleixner 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682...@infradead.org
Signed-off-by: Thomas Gleixner 

Conflicts:
include/linux/sched.h
Tested-by:Henrik Austad 
---
 include/linux/init_task.h |  1 +
 include/linux/sched.h |  2 ++
 include/linux/sched/rt.h  |  1 +
 kernel/fork.c |  1 +
 kernel/locking/rtmutex.c  | 29 +
 kernel/sched/core.c   |  2 ++
 6 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1c1ff7e..a561ce0c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -162,6 +162,7 @@ extern struct task_group root_task_group;
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)  \
.pi_waiters = RB_ROOT,  \
+   .pi_top_task = NULL,\
.pi_waiters_leftmost = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b30540d..89cd0d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1617,6 +1617,8 @@ struct task_struct {
/* PI waiters blocked on a rt_mutex held by this task */
struct rb_root pi_waiters;
struct rb_node *pi_waiters_leftmost;
+   /* Updated under owner's pi_lock and rq lock */
+   struct task_struct  *pi_top_task;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index a30b172..60d0c47 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p)
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
+extern void rt_mutex_update_top_task(struct task_struct *p);
 extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
diff --git a/kernel/fork.c b/kernel/fork.c
index dd2f79a..9376270 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1242,6 +1242,7 @@ static void rt_mutex_init_task(struct task_struct *p)
 #ifdef CONFIG_RT_MUTEXES
p->pi_waiters = RB_ROOT;
p->pi_waiters_leftmost = NULL;
+   p->pi_top_task = NULL;
p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c01d7f4..dd3b1e9 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -321,6 +321,19 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct 
rt_mutex_waiter *waiter)
 }
 
 /*
+ * Must hold both p->pi_lock and task_rq(p)->lock.
+ */
+void rt_mutex_update_top_task(struct task_struct *p)
+{
+   if (!task_has_pi_waiters(p)) {
+   p->pi_top_task = NULL;
+   return;
+   }
+
+   p->pi_top_task = task_top_pi_waiter(p)->task;
+}
+
+/*
  * Calculate task priority from the waiter tree priority
  *
  * Return task->normal_prio when the waiter tree is empty or when
@@ -335,12 +348,12 @@ int rt_mutex_getprio(struct task_struct *task)
   task->normal_prio);
 }

[PATCH 01/17] futex: Cleanup variable names for futex_top_waiter()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 499f5aca2cdd5e958b27e2655e7e7f82524f46b1 uptream.

futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
this to a variable 'match' totally obscures the code.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.554710...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a26d217..bb87324 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1116,14 +1116,14 @@ static int attach_to_pi_owner(u32 uval, union futex_key 
*key,
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
   union futex_key *key, struct futex_pi_state **ps)
 {
-   struct futex_q *match = futex_top_waiter(hb, key);
+   struct futex_q *top_waiter = futex_top_waiter(hb, key);
 
/*
 * If there is a waiter on that futex, validate it and
 * attach to the pi_state when the validation succeeds.
 */
-   if (match)
-   return attach_to_pi_state(uval, match->pi_state, ps);
+   if (top_waiter)
+   return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
/*
 * We are the first waiter - try to look up the owner based on
@@ -1170,7 +1170,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct 
futex_hash_bucket *hb,
struct task_struct *task, int set_waiters)
 {
u32 uval, newval, vpid = task_pid_vnr(task);
-   struct futex_q *match;
+   struct futex_q *top_waiter;
int ret;
 
/*
@@ -1196,9 +1196,9 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct 
futex_hash_bucket *hb,
 * Lookup existing state first. If it exists, try to attach to
 * its pi_state.
 */
-   match = futex_top_waiter(hb, key);
-   if (match)
-   return attach_to_pi_state(uval, match->pi_state, ps);
+   top_waiter = futex_top_waiter(hb, key);
+   if (top_waiter)
+   return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
/*
 * No waiter and user TID is 0. We are here because the
@@ -1288,11 +1288,11 @@ static void mark_wake_futex(struct wake_q_head *wake_q, 
struct futex_q *q)
q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q 
*top_waiter,
 struct futex_hash_bucket *hb)
 {
struct task_struct *new_owner;
-   struct futex_pi_state *pi_state = this->pi_state;
+   struct futex_pi_state *pi_state = top_waiter->pi_state;
u32 uninitialized_var(curval), newval;
WAKE_Q(wake_q);
bool deboost;
@@ -1313,11 +1313,11 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *this,
 
/*
 * It is possible that the next waiter (the one that brought
-* this owner to the kernel) timed out and is no longer
+* top_waiter owner to the kernel) timed out and is no longer
 * waiting on the lock.
 */
if (!new_owner)
-   new_owner = this->task;
+   new_owner = top_waiter->task;
 
/*
 * We pass it to the next owner. The WAITERS bit is always
@@ -2639,7 +2639,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned 
int flags)
u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
union futex_key key = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb;
-   struct futex_q *match;
+   struct futex_q *top_waiter;
int ret;
 
 retry:
@@ -2663,9 +2663,9 @@ retry:
 * all and we at least want to know if user space fiddled
 * with the futex value instead of blindly unlocking.
 */
-   match = futex_top_waiter(hb, );
-   if (match) {
-   ret = wake_futex_pi(uaddr, uval, match, hb);
+   top_waiter = futex_top_waiter(hb, );
+   if (top_waiter) {
+   ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
/*
 * In case of success wake_futex_pi dropped the hash
 * bucket lock.
-- 
2.7.4



[PATCH 03/17] futex: Remove rt_mutex_deadlock_account_*()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit fffa954fb528963c2fb7b0c0084eb77e2be7ab52 upstream

These are unused and clutter up the code.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.652692...@infradead.org
Signed-off-by: Thomas Gleixner 

Conflicts:
kernel/locking/rtmutex.c (WAKE_Q)
Tested-by: Henrik Austad 
---
 kernel/locking/rtmutex-debug.c |  9 
 kernel/locking/rtmutex-debug.h |  3 ---
 kernel/locking/rtmutex.c   | 47 --
 kernel/locking/rtmutex.h   |  2 --
 4 files changed, 18 insertions(+), 43 deletions(-)

diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
index 62b6cee..0613c4b 100644
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -173,12 +173,3 @@ void debug_rt_mutex_init(struct rt_mutex *lock, const char 
*name)
lock->name = name;
 }
 
-void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task)
-{
-}
-
-void rt_mutex_deadlock_account_unlock(struct task_struct *task)
-{
-}
-
diff --git a/kernel/locking/rtmutex-debug.h b/kernel/locking/rtmutex-debug.h
index d0519c3..b585af9 100644
--- a/kernel/locking/rtmutex-debug.h
+++ b/kernel/locking/rtmutex-debug.h
@@ -9,9 +9,6 @@
  * This file contains macros used solely by rtmutex.c. Debug version.
  */
 
-extern void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct 
*task);
-extern void rt_mutex_deadlock_account_unlock(struct task_struct *task);
 extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b066724d..6cf9dab7 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -937,8 +937,6 @@ takeit:
 */
rt_mutex_set_owner(lock, task);
 
-   rt_mutex_deadlock_account_lock(lock, task);
-
return 1;
 }
 
@@ -1331,8 +1329,6 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex 
*lock,
 
debug_rt_mutex_unlock(lock);
 
-   rt_mutex_deadlock_account_unlock(current);
-
/*
 * We must be careful here if the fast path is enabled. If we
 * have no waiters queued we cannot set owner to NULL here
@@ -1398,11 +1394,10 @@ rt_mutex_fastlock(struct rt_mutex *lock, int state,
struct hrtimer_sleeper *timeout,
enum rtmutex_chainwalk chwalk))
 {
-   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-   rt_mutex_deadlock_account_lock(lock, current);
+   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
return 0;
-   } else
-   return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
+
+   return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
 }
 
 static inline int
@@ -1414,21 +1409,19 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int 
state,
  enum rtmutex_chainwalk chwalk))
 {
if (chwalk == RT_MUTEX_MIN_CHAINWALK &&
-   likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-   rt_mutex_deadlock_account_lock(lock, current);
+   likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
return 0;
-   } else
-   return slowfn(lock, state, timeout, chwalk);
+
+   return slowfn(lock, state, timeout, chwalk);
 }
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
 int (*slowfn)(struct rt_mutex *lock))
 {
-   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-   rt_mutex_deadlock_account_lock(lock, current);
+   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
return 1;
-   }
+
return slowfn(lock);
 }
 
@@ -1438,19 +1431,18 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
   struct wake_q_head *wqh))
 {
WAKE_Q(wake_q);
+   bool deboost;
 
-   if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-   rt_mutex_deadlock_account_unlock(current);
+   if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
+   return;
 
-   } else {
-   bool deboost = slowfn(lock, _q);
+   deboost = slowfn(lock, _q);
 
-   wake_up_q(_q);
+   wake_up_q(_q);
 
-   /* Undo pi boosting if necessary: */
-   if (deboost)
-   rt_mutex_adjust_prio(current);
-   }
+   /* Undo pi boosting if necessary: */
+   if (deboost)
+   rt_mutex_adjust_p

[PATCH 01/17] futex: Cleanup variable names for futex_top_waiter()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit 499f5aca2cdd5e958b27e2655e7e7f82524f46b1 uptream.

futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
this to a variable 'match' totally obscures the code.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.554710...@infradead.org
Signed-off-by: Thomas Gleixner 
Tested-by: Henrik Austad 
---
 kernel/futex.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a26d217..bb87324 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1116,14 +1116,14 @@ static int attach_to_pi_owner(u32 uval, union futex_key 
*key,
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
   union futex_key *key, struct futex_pi_state **ps)
 {
-   struct futex_q *match = futex_top_waiter(hb, key);
+   struct futex_q *top_waiter = futex_top_waiter(hb, key);
 
/*
 * If there is a waiter on that futex, validate it and
 * attach to the pi_state when the validation succeeds.
 */
-   if (match)
-   return attach_to_pi_state(uval, match->pi_state, ps);
+   if (top_waiter)
+   return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
/*
 * We are the first waiter - try to look up the owner based on
@@ -1170,7 +1170,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct 
futex_hash_bucket *hb,
struct task_struct *task, int set_waiters)
 {
u32 uval, newval, vpid = task_pid_vnr(task);
-   struct futex_q *match;
+   struct futex_q *top_waiter;
int ret;
 
/*
@@ -1196,9 +1196,9 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct 
futex_hash_bucket *hb,
 * Lookup existing state first. If it exists, try to attach to
 * its pi_state.
 */
-   match = futex_top_waiter(hb, key);
-   if (match)
-   return attach_to_pi_state(uval, match->pi_state, ps);
+   top_waiter = futex_top_waiter(hb, key);
+   if (top_waiter)
+   return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
/*
 * No waiter and user TID is 0. We are here because the
@@ -1288,11 +1288,11 @@ static void mark_wake_futex(struct wake_q_head *wake_q, 
struct futex_q *q)
q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q 
*top_waiter,
 struct futex_hash_bucket *hb)
 {
struct task_struct *new_owner;
-   struct futex_pi_state *pi_state = this->pi_state;
+   struct futex_pi_state *pi_state = top_waiter->pi_state;
u32 uninitialized_var(curval), newval;
WAKE_Q(wake_q);
bool deboost;
@@ -1313,11 +1313,11 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, 
struct futex_q *this,
 
/*
 * It is possible that the next waiter (the one that brought
-* this owner to the kernel) timed out and is no longer
+* top_waiter owner to the kernel) timed out and is no longer
 * waiting on the lock.
 */
if (!new_owner)
-   new_owner = this->task;
+   new_owner = top_waiter->task;
 
/*
 * We pass it to the next owner. The WAITERS bit is always
@@ -2639,7 +2639,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned 
int flags)
u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
union futex_key key = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb;
-   struct futex_q *match;
+   struct futex_q *top_waiter;
int ret;
 
 retry:
@@ -2663,9 +2663,9 @@ retry:
 * all and we at least want to know if user space fiddled
 * with the futex value instead of blindly unlocking.
 */
-   match = futex_top_waiter(hb, );
-   if (match) {
-   ret = wake_futex_pi(uaddr, uval, match, hb);
+   top_waiter = futex_top_waiter(hb, );
+   if (top_waiter) {
+   ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
/*
 * In case of success wake_futex_pi dropped the hash
 * bucket lock.
-- 
2.7.4



[PATCH 03/17] futex: Remove rt_mutex_deadlock_account_*()

2018-11-09 Thread Henrik Austad
From: Peter Zijlstra 

commit fffa954fb528963c2fb7b0c0084eb77e2be7ab52 upstream

These are unused and clutter up the code.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: xlp...@redhat.com
Cc: rost...@goodmis.org
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: dvh...@infradead.org
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.652692...@infradead.org
Signed-off-by: Thomas Gleixner 

Conflicts:
kernel/locking/rtmutex.c (WAKE_Q)
Tested-by: Henrik Austad 
---
 kernel/locking/rtmutex-debug.c |  9 
 kernel/locking/rtmutex-debug.h |  3 ---
 kernel/locking/rtmutex.c   | 47 --
 kernel/locking/rtmutex.h   |  2 --
 4 files changed, 18 insertions(+), 43 deletions(-)

diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
index 62b6cee..0613c4b 100644
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -173,12 +173,3 @@ void debug_rt_mutex_init(struct rt_mutex *lock, const char 
*name)
lock->name = name;
 }
 
-void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task)
-{
-}
-
-void rt_mutex_deadlock_account_unlock(struct task_struct *task)
-{
-}
-
diff --git a/kernel/locking/rtmutex-debug.h b/kernel/locking/rtmutex-debug.h
index d0519c3..b585af9 100644
--- a/kernel/locking/rtmutex-debug.h
+++ b/kernel/locking/rtmutex-debug.h
@@ -9,9 +9,6 @@
  * This file contains macros used solely by rtmutex.c. Debug version.
  */
 
-extern void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct 
*task);
-extern void rt_mutex_deadlock_account_unlock(struct task_struct *task);
 extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b066724d..6cf9dab7 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -937,8 +937,6 @@ takeit:
 */
rt_mutex_set_owner(lock, task);
 
-   rt_mutex_deadlock_account_lock(lock, task);
-
return 1;
 }
 
@@ -1331,8 +1329,6 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex 
*lock,
 
debug_rt_mutex_unlock(lock);
 
-   rt_mutex_deadlock_account_unlock(current);
-
/*
 * We must be careful here if the fast path is enabled. If we
 * have no waiters queued we cannot set owner to NULL here
@@ -1398,11 +1394,10 @@ rt_mutex_fastlock(struct rt_mutex *lock, int state,
struct hrtimer_sleeper *timeout,
enum rtmutex_chainwalk chwalk))
 {
-   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-   rt_mutex_deadlock_account_lock(lock, current);
+   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
return 0;
-   } else
-   return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
+
+   return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
 }
 
 static inline int
@@ -1414,21 +1409,19 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int 
state,
  enum rtmutex_chainwalk chwalk))
 {
if (chwalk == RT_MUTEX_MIN_CHAINWALK &&
-   likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-   rt_mutex_deadlock_account_lock(lock, current);
+   likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
return 0;
-   } else
-   return slowfn(lock, state, timeout, chwalk);
+
+   return slowfn(lock, state, timeout, chwalk);
 }
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
 int (*slowfn)(struct rt_mutex *lock))
 {
-   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-   rt_mutex_deadlock_account_lock(lock, current);
+   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
return 1;
-   }
+
return slowfn(lock);
 }
 
@@ -1438,19 +1431,18 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
   struct wake_q_head *wqh))
 {
WAKE_Q(wake_q);
+   bool deboost;
 
-   if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-   rt_mutex_deadlock_account_unlock(current);
+   if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
+   return;
 
-   } else {
-   bool deboost = slowfn(lock, _q);
+   deboost = slowfn(lock, _q);
 
-   wake_up_q(_q);
+   wake_up_q(_q);
 
-   /* Undo pi boosting if necessary: */
-   if (deboost)
-   rt_mutex_adjust_prio(current);
-   }
+   /* Undo pi boosting if necessary: */
+   if (deboost)
+   rt_mutex_adjust_p

Re: [PATCH] backport: sched/rtmutex/deadline: Fix a PI crash for deadline tasks

2018-11-06 Thread Henrik Austad
On Tue, Nov 06, 2018 at 02:22:10PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 06, 2018 at 01:47:21PM +0100, Henrik Austad wrote:
> > From: Xunlei Pang 
> > 
> > On some of our systems, we notice this error popping up on occasion,
> > completely hanging the system.
> > 
> >[] enqueue_task_dl+0x1f0/0x420
> >[] activate_task+0x7c/0x90
> >[] push_dl_task+0x164/0x1c8
> >[] push_dl_tasks+0x20/0x30
> >[] __balance_callback+0x44/0x68
> >[] __schedule+0x6f0/0x728
> >[] schedule+0x78/0x98
> >[] __rt_mutex_slowlock+0x9c/0x108
> >[] rt_mutex_slowlock+0xd8/0x198
> >[] rt_mutex_timed_futex_lock+0x30/0x40
> >[] futex_lock_pi+0x200/0x3b0
> >[] do_futex+0x1c4/0x550
> > 
> > It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously
> > similar to what Xuneli Pang observed in his crash, and with this fix, my
> > issue goes away (my system has survivied approx 1500 reboots and a few
> > nasty tests so far)
> > 
> > Alongside this patch in the tree, there are a few other bits and pieces
> > pertaining to futex, rtmutex and kernel/sched/, but those patches
> > creates
> > weird crashes that I have not been able to dissect yet. Once (if) I have
> > been able to figure those out (and test), they will be sent later.
> > 
> > I am sure other users of LTS that also use sched_deadline will run into
> > this issue, so I think it is a good candidate for 4.4-stable. Possibly
> > also
> > to 4.9 and 4.14, but I have not had time to test for those versions.
> 
> But this patch relies on:
> 
>   2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")

Yes, I have that one in my other queue (that crashes)

> for pointer stability, but that patch in turn relies on the whole
> FUTEX_UNLOCK_PI patch set:
> 
>  $ git log --oneline 
> 499f5aca2cdd5e958b27e2655e7e7f82524f46b1..56222b212e8edb1cf51f5dd73ff645809b082b40
> 
>   56222b212e8e futex: Drop hb->lock before enqueueing on the rtmutex
>   bebe5b514345 futex: Futex_unlock_pi() determinism
>   cfafcd117da0 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
>   38d589f2fd08 futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
>   50809358dd71 futex,rt_mutex: Introduce rt_mutex_init_waiter()
>   16ffa12d7425 futex: Pull rt_mutex_futex_unlock() out from under hb->lock
>   73d786bd043e futex: Rework inconsistent rt_mutex/futex_q state
>   bf92cf3a5100 futex: Cleanup refcounting
>   734009e96d19 futex: Change locking rules
>   5293c2efda37 futex,rt_mutex: Provide futex specific rt_mutex API
>   fffa954fb528 futex: Remove rt_mutex_deadlock_account_*()
>   1b367ece0d7e futex: Use smp_store_release() in mark_wake_futex()
> 
> and possibly some follow-up fixes on that (I have vague memories of
> that).

ok, so this looks a bit like the queue I have, thanks!

> As is, just the one patch you propose isn't correct :/
> 
> Yes, that was a ginormous amount of work to fix a seemingly simple splat
> :-(

Yep, well, on the positive side, I now know that I have to figure out the 
crashes, which is useful knowledge! Thanks!

I'll hammer away at the full series of backports for this then and resend 
once I've hammered out the issues.

Thanks for the feedback, much appreciated!

-- 
Henrik Austad
CVTG Eng - Endpoints
Cisco Systems Norway


Re: [PATCH] backport: sched/rtmutex/deadline: Fix a PI crash for deadline tasks

2018-11-06 Thread Henrik Austad
On Tue, Nov 06, 2018 at 02:22:10PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 06, 2018 at 01:47:21PM +0100, Henrik Austad wrote:
> > From: Xunlei Pang 
> > 
> > On some of our systems, we notice this error popping up on occasion,
> > completely hanging the system.
> > 
> >[] enqueue_task_dl+0x1f0/0x420
> >[] activate_task+0x7c/0x90
> >[] push_dl_task+0x164/0x1c8
> >[] push_dl_tasks+0x20/0x30
> >[] __balance_callback+0x44/0x68
> >[] __schedule+0x6f0/0x728
> >[] schedule+0x78/0x98
> >[] __rt_mutex_slowlock+0x9c/0x108
> >[] rt_mutex_slowlock+0xd8/0x198
> >[] rt_mutex_timed_futex_lock+0x30/0x40
> >[] futex_lock_pi+0x200/0x3b0
> >[] do_futex+0x1c4/0x550
> > 
> > It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously
> > similar to what Xuneli Pang observed in his crash, and with this fix, my
> > issue goes away (my system has survivied approx 1500 reboots and a few
> > nasty tests so far)
> > 
> > Alongside this patch in the tree, there are a few other bits and pieces
> > pertaining to futex, rtmutex and kernel/sched/, but those patches
> > creates
> > weird crashes that I have not been able to dissect yet. Once (if) I have
> > been able to figure those out (and test), they will be sent later.
> > 
> > I am sure other users of LTS that also use sched_deadline will run into
> > this issue, so I think it is a good candidate for 4.4-stable. Possibly
> > also
> > to 4.9 and 4.14, but I have not had time to test for those versions.
> 
> But this patch relies on:
> 
>   2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")

Yes, I have that one in my other queue (that crashes)

> for pointer stability, but that patch in turn relies on the whole
> FUTEX_UNLOCK_PI patch set:
> 
>  $ git log --oneline 
> 499f5aca2cdd5e958b27e2655e7e7f82524f46b1..56222b212e8edb1cf51f5dd73ff645809b082b40
> 
>   56222b212e8e futex: Drop hb->lock before enqueueing on the rtmutex
>   bebe5b514345 futex: Futex_unlock_pi() determinism
>   cfafcd117da0 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
>   38d589f2fd08 futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
>   50809358dd71 futex,rt_mutex: Introduce rt_mutex_init_waiter()
>   16ffa12d7425 futex: Pull rt_mutex_futex_unlock() out from under hb->lock
>   73d786bd043e futex: Rework inconsistent rt_mutex/futex_q state
>   bf92cf3a5100 futex: Cleanup refcounting
>   734009e96d19 futex: Change locking rules
>   5293c2efda37 futex,rt_mutex: Provide futex specific rt_mutex API
>   fffa954fb528 futex: Remove rt_mutex_deadlock_account_*()
>   1b367ece0d7e futex: Use smp_store_release() in mark_wake_futex()
> 
> and possibly some follow-up fixes on that (I have vague memories of
> that).

ok, so this looks a bit like the queue I have, thanks!

> As is, just the one patch you propose isn't correct :/
> 
> Yes, that was a ginormous amount of work to fix a seemingly simple splat
> :-(

Yep, well, on the positive side, I now know that I have to figure out the 
crashes, which is useful knowledge! Thanks!

I'll hammer away at the full series of backports for this then and resend 
once I've hammered out the issues.

Thanks for the feedback, much appreciated!

-- 
Henrik Austad
CVTG Eng - Endpoints
Cisco Systems Norway


[PATCH] backport: sched/rtmutex/deadline: Fix a PI crash for deadline tasks

2018-11-06 Thread Henrik Austad
From: Xunlei Pang 

On some of our systems, we notice this error popping up on occasion,
completely hanging the system.

   [] enqueue_task_dl+0x1f0/0x420
   [] activate_task+0x7c/0x90
   [] push_dl_task+0x164/0x1c8
   [] push_dl_tasks+0x20/0x30
   [] __balance_callback+0x44/0x68
   [] __schedule+0x6f0/0x728
   [] schedule+0x78/0x98
   [] __rt_mutex_slowlock+0x9c/0x108
   [] rt_mutex_slowlock+0xd8/0x198
   [] rt_mutex_timed_futex_lock+0x30/0x40
   [] futex_lock_pi+0x200/0x3b0
   [] do_futex+0x1c4/0x550

It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously
similar to what Xuneli Pang observed in his crash, and with this fix, my
issue goes away (my system has survivied approx 1500 reboots and a few
nasty tests so far)

Alongside this patch in the tree, there are a few other bits and pieces
pertaining to futex, rtmutex and kernel/sched/, but those patches
creates
weird crashes that I have not been able to dissect yet. Once (if) I have
been able to figure those out (and test), they will be sent later.

I am sure other users of LTS that also use sched_deadline will run into
this issue, so I think it is a good candidate for 4.4-stable. Possibly
also
to 4.9 and 4.14, but I have not had time to test for those versions.

Apart from a minor conflict in sched.h, the patch applied cleanly.

(Tested on arm64 running 4.4.)

-Henrik

A crash happened while I was playing with deadline PI rtmutex.

BUG: unable to handle kernel NULL pointer dereference at
0018
IP: [] rt_mutex_get_top_task+0x1f/0x30
PGD 232a75067 PUD 230947067 PMD 0
Oops:  [#1] SMP
CPU: 1 PID: 10994 Comm: a.out Not tainted

Call Trace:
[] enqueue_task+0x2c/0x80
[] activate_task+0x23/0x30
[] pull_dl_task+0x1d5/0x260
[] pre_schedule_dl+0x16/0x20
[] __schedule+0xd3/0x900
[] schedule+0x29/0x70
[] __rt_mutex_slowlock+0x4b/0xc0
[] rt_mutex_slowlock+0xd1/0x190
[] rt_mutex_timed_lock+0x53/0x60
[] futex_lock_pi.isra.18+0x28c/0x390
[] do_futex+0x190/0x5b0
[] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra 
Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Steven Rostedt 
Reviewed-by: Thomas Gleixner 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682...@infradead.org
Signed-off-by: Thomas Gleixner 

(cherry picked from commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22)

Conflicts:
include/linux/sched.h

Backported-and-tested-by: Henrik Austad 
Cc: Greg Kroah-Hartman 
---
 include/linux/init_task.h |  1 +
 include/linux/sched.h |  2 ++
 include/linux/sched/rt.h  |  1 +
 kernel/fork.c |  1 +
 kernel/locking/rtmutex.c  | 29 +
 kernel/sched/core.c   |  2 ++
 6 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1c1ff7e4faa4..a561ce0c5d7f 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -162,6 +162,7 @@ extern struct task_group root_task_group;
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)  \
.pi_waiters = RB_ROOT,  \
+   .pi_top_task = NULL,\
.pi_waiters_leftmost = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a464ba71a993..19a3f946caf0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1628,6 +1628,8 @@ struct task_struct {
/* PI waiters blocked on a rt_mutex held by this task */
struct rb_root pi_waiters;
struct rb_node *pi_waiters_leftmost;
+   /* Updated under owner's pi_lock and rq lock */
+   struct task_struct  *pi_top_task;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index a30b172df6e1..60d0c4740b9f 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p)
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern int rt_mutex_get_effective_prio(struct ta

[PATCH] backport: sched/rtmutex/deadline: Fix a PI crash for deadline tasks

2018-11-06 Thread Henrik Austad
From: Xunlei Pang 

On some of our systems, we notice this error popping up on occasion,
completely hanging the system.

   [] enqueue_task_dl+0x1f0/0x420
   [] activate_task+0x7c/0x90
   [] push_dl_task+0x164/0x1c8
   [] push_dl_tasks+0x20/0x30
   [] __balance_callback+0x44/0x68
   [] __schedule+0x6f0/0x728
   [] schedule+0x78/0x98
   [] __rt_mutex_slowlock+0x9c/0x108
   [] rt_mutex_slowlock+0xd8/0x198
   [] rt_mutex_timed_futex_lock+0x30/0x40
   [] futex_lock_pi+0x200/0x3b0
   [] do_futex+0x1c4/0x550

It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously
similar to what Xuneli Pang observed in his crash, and with this fix, my
issue goes away (my system has survivied approx 1500 reboots and a few
nasty tests so far)

Alongside this patch in the tree, there are a few other bits and pieces
pertaining to futex, rtmutex and kernel/sched/, but those patches
creates
weird crashes that I have not been able to dissect yet. Once (if) I have
been able to figure those out (and test), they will be sent later.

I am sure other users of LTS that also use sched_deadline will run into
this issue, so I think it is a good candidate for 4.4-stable. Possibly
also
to 4.9 and 4.14, but I have not had time to test for those versions.

Apart from a minor conflict in sched.h, the patch applied cleanly.

(Tested on arm64 running 4.4.)

-Henrik

A crash happened while I was playing with deadline PI rtmutex.

BUG: unable to handle kernel NULL pointer dereference at
0018
IP: [] rt_mutex_get_top_task+0x1f/0x30
PGD 232a75067 PUD 230947067 PMD 0
Oops:  [#1] SMP
CPU: 1 PID: 10994 Comm: a.out Not tainted

Call Trace:
[] enqueue_task+0x2c/0x80
[] activate_task+0x23/0x30
[] pull_dl_task+0x1d5/0x260
[] pre_schedule_dl+0x16/0x20
[] __schedule+0xd3/0x900
[] schedule+0x29/0x70
[] __rt_mutex_slowlock+0x4b/0xc0
[] rt_mutex_slowlock+0xd1/0x190
[] rt_mutex_timed_lock+0x53/0x60
[] futex_lock_pi.isra.18+0x28c/0x390
[] do_futex+0x190/0x5b0
[] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra 
Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Steven Rostedt 
Reviewed-by: Thomas Gleixner 
Cc: juri.le...@arm.com
Cc: bige...@linutronix.de
Cc: mathieu.desnoy...@efficios.com
Cc: jdesfos...@efficios.com
Cc: bris...@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682...@infradead.org
Signed-off-by: Thomas Gleixner 

(cherry picked from commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22)

Conflicts:
include/linux/sched.h

Backported-and-tested-by: Henrik Austad 
Cc: Greg Kroah-Hartman 
---
 include/linux/init_task.h |  1 +
 include/linux/sched.h |  2 ++
 include/linux/sched/rt.h  |  1 +
 kernel/fork.c |  1 +
 kernel/locking/rtmutex.c  | 29 +
 kernel/sched/core.c   |  2 ++
 6 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1c1ff7e4faa4..a561ce0c5d7f 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -162,6 +162,7 @@ extern struct task_group root_task_group;
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)  \
.pi_waiters = RB_ROOT,  \
+   .pi_top_task = NULL,\
.pi_waiters_leftmost = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a464ba71a993..19a3f946caf0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1628,6 +1628,8 @@ struct task_struct {
/* PI waiters blocked on a rt_mutex held by this task */
struct rb_root pi_waiters;
struct rb_node *pi_waiters_leftmost;
+   /* Updated under owner's pi_lock and rq lock */
+   struct task_struct  *pi_top_task;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index a30b172df6e1..60d0c4740b9f 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p)
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern int rt_mutex_get_effective_prio(struct ta

Re: [RFD/RFC PATCH 0/8] Towards implementing proxy execution

2018-10-10 Thread Henrik Austad
On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote:
> Hi all,

Hi, nice series, I have a lot of details to grok, but I like the idea of PE

> Proxy Execution (also goes under several other names) isn't a new
> concept, it has been mentioned already in the past to this community
> (both in email discussions and at conferences [1, 2]), but no actual
> implementation that applies to a fairly recent kernel exists as of today
> (of which I'm aware of at least - happy to be proven wrong).
> 
> Very broadly speaking, more info below, proxy execution enables a task
> to run using the context of some other task that is "willing" to
> participate in the mechanism, as this helps both tasks to improve
> performance (w.r.t. the latter task not participating to proxy
> execution).

From what I remember, PEP was originally proposed for a global EDF, and as 
far as my head has been able to read this series, this implementation is 
planned for not only deadline, but eventuall also for sched_(rr|fifo|other) 
- is that correct?

I have a bit of concern when it comes to affinities and and where the 
lock owner will actually execute while in the context of the proxy, 
especially when you run into the situation where you have disjoint CPU 
affinities for _rr tasks to ensure the deadlines.

I believe there were some papers circulated last year that looked at 
something similar to this when you had overlapping or completely disjoint 
CPUsets I think it would be nice to drag into the discussion. Has this been 
considered? (if so, sorry for adding line-noise!)

Let me know if my attempt at translating brainlanguage into semi-coherent 
english failed and I'll do another attempt

> This RFD/proof of concept aims at starting a discussion about how we can
> get proxy execution in mainline. But, first things first, why do we even
> care about it?
> 
> I'm pretty confident with saying that the line of development that is
> mainly interested in this at the moment is the one that might benefit
> in allowing non privileged processes to use deadline scheduling [3].
> The main missing bit before we can safely relax the root privileges
> constraint is a proper priority inheritance mechanism, which translates
> to bandwidth inheritance [4, 5] for deadline scheduling, or to some sort
> of interpretation of the concept of running a task holding a (rt_)mutex
> within the bandwidth allotment of some other task that is blocked on the
> same (rt_)mutex.
> 
> The concept itself is pretty general however, and it is not hard to
> foresee possible applications in other scenarios (say for example nice
> values/shares across co-operating CFS tasks or clamping values [6]).
> But I'm already digressing, so let's get back to the code that comes
> with this cover letter.
> 
> One can define the scheduling context of a task as all the information
> in task_struct that the scheduler needs to implement a policy and the
> execution contex as all the state required to actually "run" the task.
> An example of scheduling context might be the information contained in
> task_struct se, rt and dl fields; affinity pertains instead to execution
> context (and I guess decideing what pertains to what is actually up for
> discussion as well ;-). Patch 04/08 implements such distinction.

I really like the idea of splitting scheduling ctx and execution context!

> As implemented in this set, a link between scheduling contexts of
> different tasks might be established when a task blocks on a mutex held
> by some other task (blocked_on relation). In this case the former task
> starts to be considered a potential proxy for the latter (mutex owner).
> One key change in how mutexes work made in here is that waiters don't
> really sleep: they are not dequeued, so they can be picked up by the
> scheduler when it runs.  If a waiter (potential proxy) task is selected
> by the scheduler, the blocked_on relation is used to find the mutex
> owner and put that to run on the CPU, using the proxy task scheduling
> context.
> 
>Follow the blocked-on relation:
>   
>   ,-> task   <- proxy, picked by scheduler
>   | | blocked-on
>   | v
>  blocked-task |   mutex
>   | | owner
>   | v
>   `-- task   <- gets to run using proxy info
> 
> Now, the situation is (of course) more tricky than depicted so far
> because we have to deal with all sort of possible states the mutex
> owner might be in while a potential proxy is selected by the scheduler,
> e.g. owner might be sleeping, running on a different CPU, blocked on
> another mutex itself... so, I'd kindly refer people to have a look at
> 05/08 proxy() implementation and comments.

My head hurt already.. :)

> Peter kindly shared his WIP patches with us (me, Luca, Tommaso, Claudio,
> Daniel, the Pisa gang) a while ago, but I could seriously have a decent
> look at them only recently (thanks a lot to the 

Re: [RFD/RFC PATCH 0/8] Towards implementing proxy execution

2018-10-10 Thread Henrik Austad
On Tue, Oct 09, 2018 at 11:24:26AM +0200, Juri Lelli wrote:
> Hi all,

Hi, nice series, I have a lot of details to grok, but I like the idea of PE

> Proxy Execution (also goes under several other names) isn't a new
> concept, it has been mentioned already in the past to this community
> (both in email discussions and at conferences [1, 2]), but no actual
> implementation that applies to a fairly recent kernel exists as of today
> (of which I'm aware of at least - happy to be proven wrong).
> 
> Very broadly speaking, more info below, proxy execution enables a task
> to run using the context of some other task that is "willing" to
> participate in the mechanism, as this helps both tasks to improve
> performance (w.r.t. the latter task not participating to proxy
> execution).

From what I remember, PEP was originally proposed for a global EDF, and as 
far as my head has been able to read this series, this implementation is 
planned for not only deadline, but eventuall also for sched_(rr|fifo|other) 
- is that correct?

I have a bit of concern when it comes to affinities and and where the 
lock owner will actually execute while in the context of the proxy, 
especially when you run into the situation where you have disjoint CPU 
affinities for _rr tasks to ensure the deadlines.

I believe there were some papers circulated last year that looked at 
something similar to this when you had overlapping or completely disjoint 
CPUsets I think it would be nice to drag into the discussion. Has this been 
considered? (if so, sorry for adding line-noise!)

Let me know if my attempt at translating brainlanguage into semi-coherent 
english failed and I'll do another attempt

> This RFD/proof of concept aims at starting a discussion about how we can
> get proxy execution in mainline. But, first things first, why do we even
> care about it?
> 
> I'm pretty confident with saying that the line of development that is
> mainly interested in this at the moment is the one that might benefit
> in allowing non privileged processes to use deadline scheduling [3].
> The main missing bit before we can safely relax the root privileges
> constraint is a proper priority inheritance mechanism, which translates
> to bandwidth inheritance [4, 5] for deadline scheduling, or to some sort
> of interpretation of the concept of running a task holding a (rt_)mutex
> within the bandwidth allotment of some other task that is blocked on the
> same (rt_)mutex.
> 
> The concept itself is pretty general however, and it is not hard to
> foresee possible applications in other scenarios (say for example nice
> values/shares across co-operating CFS tasks or clamping values [6]).
> But I'm already digressing, so let's get back to the code that comes
> with this cover letter.
> 
> One can define the scheduling context of a task as all the information
> in task_struct that the scheduler needs to implement a policy and the
> execution contex as all the state required to actually "run" the task.
> An example of scheduling context might be the information contained in
> task_struct se, rt and dl fields; affinity pertains instead to execution
> context (and I guess decideing what pertains to what is actually up for
> discussion as well ;-). Patch 04/08 implements such distinction.

I really like the idea of splitting scheduling ctx and execution context!

> As implemented in this set, a link between scheduling contexts of
> different tasks might be established when a task blocks on a mutex held
> by some other task (blocked_on relation). In this case the former task
> starts to be considered a potential proxy for the latter (mutex owner).
> One key change in how mutexes work made in here is that waiters don't
> really sleep: they are not dequeued, so they can be picked up by the
> scheduler when it runs.  If a waiter (potential proxy) task is selected
> by the scheduler, the blocked_on relation is used to find the mutex
> owner and put that to run on the CPU, using the proxy task scheduling
> context.
> 
>Follow the blocked-on relation:
>   
>   ,-> task   <- proxy, picked by scheduler
>   | | blocked-on
>   | v
>  blocked-task |   mutex
>   | | owner
>   | v
>   `-- task   <- gets to run using proxy info
> 
> Now, the situation is (of course) more tricky than depicted so far
> because we have to deal with all sort of possible states the mutex
> owner might be in while a potential proxy is selected by the scheduler,
> e.g. owner might be sleeping, running on a different CPU, blocked on
> another mutex itself... so, I'd kindly refer people to have a look at
> 05/08 proxy() implementation and comments.

My head hurt already.. :)

> Peter kindly shared his WIP patches with us (me, Luca, Tommaso, Claudio,
> Daniel, the Pisa gang) a while ago, but I could seriously have a decent
> look at them only recently (thanks a lot to the 

[PATCH] net: export netdev_txq_to_tc to allow sch_mqprio to compile as module

2017-10-17 Thread Henrik Austad
In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW
traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
to find the correct tc instead of dev->tc_to_txq[]

However, when mqprio is compiled as a module, it cannot resolve the
symbol, leading to this error:

 ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!

This adds an EXPORT_SYMBOL() since the other user in the kernel
(netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
sysfs-callback.

Cc: Alexander Duyck <alexander.h.du...@intel.com>
Cc: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
Cc: David S. Miller <da...@davemloft.net>
Signed-off-by: Henrik Austad <haus...@cisco.com>
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index fcddccb..d2b20e7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2040,6 +2040,7 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int 
txq)
 
return 0;
 }
+EXPORT_SYMBOL(netdev_txq_to_tc);
 
 #ifdef CONFIG_XPS
 static DEFINE_MUTEX(xps_map_mutex);
-- 
2.7.4



[PATCH] net: export netdev_txq_to_tc to allow sch_mqprio to compile as module

2017-10-17 Thread Henrik Austad
In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW
traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
to find the correct tc instead of dev->tc_to_txq[]

However, when mqprio is compiled as a module, it cannot resolve the
symbol, leading to this error:

 ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!

This adds an EXPORT_SYMBOL() since the other user in the kernel
(netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
sysfs-callback.

Cc: Alexander Duyck 
Cc: Jesus Sanchez-Palencia 
Cc: David S. Miller 
Signed-off-by: Henrik Austad 
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index fcddccb..d2b20e7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2040,6 +2040,7 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int 
txq)
 
return 0;
 }
+EXPORT_SYMBOL(netdev_txq_to_tc);
 
 #ifdef CONFIG_XPS
 static DEFINE_MUTEX(xps_map_mutex);
-- 
2.7.4



Re: [TSN RFC v2 0/9] TSN driver for the kernel

2016-12-17 Thread Henrik Austad
ations must 
be aware of.

Could be that we are talking about the same thing, just from different 
perspectives.

> * Kernel Space
>
> 1. Providing frames with a future transmit time.  For normal sockets,
>this can be in the CMESG data.  For mmap'ed buffers, we will need a
>new format.  (I think Arnd is working on a new layout.)

I need to revisit that discussion again I think.

> 2. Time based qdisc for transmitted frames.  For MACs that support
>this (like the i210), we only have to place the frame into the
>correct queue.  For normal HW, we want to be able to reserve a time
>window in which non-TSN frames are blocked. This is some work, but
>in the end it should be a generic solution that not only works
>"perfectly"  with TSN HW but also provides best effort service using
>any NIC.

Yes, indeed, that would be one good solution, and quite a lot of work.

> 3. ALSA support for tunable AD/DA clocks.  The rate of the Listener's
>DA clock must match that of the Talker and the other Listeners.

To nitpick a bit, all AD/DAs should match that of the gPTP grandmaster 
(which in most settings would be the Talker). But yes, you need to adjust 
the AD/DA. SRC is slow and painful, best to avoid.

>Either you adjust it in HW using a VCO or similar, or you do
>adaptive sample rate conversion in the application. (And that is
>another reason for *not* having a shared kernel buffer.)  For the
>Talker, either you adjust the AD clock to match the PTP time, or
>you measure the frequency offset.

Yes, some hook into adjusting the clock is needed, I wonder if this is 
possible via V4L2, or of the monitor-world is a completely different beast.

> 4. ALSA support for time triggered playback.  The patch series
>completely ignore the critical issue of media clock recovery.  The 
>Listener must buffer the stream in order to play it exactly at a 
>specified time. It cannot simply send the stream ASAP to the audio 
>HW, because some other Listener might need longer.  AFAICT, there is 
>nothing in ALSA that allows you to say, sample X should be played at 
>time Y.

Yes, and this requires a lot of change to ALSA (and probably something in 
V4L2 as well?), so before we get to that, perhaps have a set of patches 
that does this best effort and *then* work on getting time-triggered 
playback into the kernel?

Another item that was brought up last round was getting timing-information 
to/from ALSA, See driver/media/avb/avb_alsa.c, as a start it updates the 
time for last incoming/outgoing frame so that userspace can get that 
information. Probably buggy as heck :)

* Back to your email from last night*

> You are trying to put tons of code into the kernel that really belongs
> in user space, and at the same time, you omit critical functions that
> only the kernel can provide.

Some (well, to be honest, most) of the of the critical functions that my 
driver omits, are omitted because they require substantial effort to 
implement - and befor there's a need for this, that won't happen. So, 
consider the TSN-driver such a need!

I'd love to use a qdisc that uses a time-triggered transmit, that would 
drop the need for a lot of the stuff in tsn_core.c. The same goes for 
time-triggered playback in media.

> > There are at least one AVB-driver (the AV-part of TSN) in the kernel
> > already.
> 
> And which driver is that?

Ah, a proverbial slip of the changelog, we visited this the last iteration, 
that would be the ravb-driver (which is an AVB capable NIC), but it does 
not include much in the way of AVB-support *In* kernel. Sorry about that!

Since then, the iMX7 from NXP has arrived, and this also has HW-support for 
TSN, but not in the kernel AFAICT.

So, the next issue I plan to tackle, is how I do buffers, the current 
approach where tsn_core allocates memory is on its way out and I'll let the 
shim (which means alsa/v4l2) will provide a buffer. Then I'll start looking 
at qdisc.

Thanks!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [TSN RFC v2 0/9] TSN driver for the kernel

2016-12-17 Thread Henrik Austad
ations must 
be aware of.

Could be that we are talking about the same thing, just from different 
perspectives.

> * Kernel Space
>
> 1. Providing frames with a future transmit time.  For normal sockets,
>this can be in the CMESG data.  For mmap'ed buffers, we will need a
>new format.  (I think Arnd is working on a new layout.)

I need to revisit that discussion again I think.

> 2. Time based qdisc for transmitted frames.  For MACs that support
>this (like the i210), we only have to place the frame into the
>correct queue.  For normal HW, we want to be able to reserve a time
>window in which non-TSN frames are blocked. This is some work, but
>in the end it should be a generic solution that not only works
>"perfectly"  with TSN HW but also provides best effort service using
>any NIC.

Yes, indeed, that would be one good solution, and quite a lot of work.

> 3. ALSA support for tunable AD/DA clocks.  The rate of the Listener's
>DA clock must match that of the Talker and the other Listeners.

To nitpick a bit, all AD/DAs should match that of the gPTP grandmaster 
(which in most settings would be the Talker). But yes, you need to adjust 
the AD/DA. SRC is slow and painful, best to avoid.

>Either you adjust it in HW using a VCO or similar, or you do
>adaptive sample rate conversion in the application. (And that is
>another reason for *not* having a shared kernel buffer.)  For the
>Talker, either you adjust the AD clock to match the PTP time, or
>you measure the frequency offset.

Yes, some hook into adjusting the clock is needed, I wonder if this is 
possible via V4L2, or of the monitor-world is a completely different beast.

> 4. ALSA support for time triggered playback.  The patch series
>completely ignore the critical issue of media clock recovery.  The 
>Listener must buffer the stream in order to play it exactly at a 
>specified time. It cannot simply send the stream ASAP to the audio 
>HW, because some other Listener might need longer.  AFAICT, there is 
>nothing in ALSA that allows you to say, sample X should be played at 
>time Y.

Yes, and this requires a lot of change to ALSA (and probably something in 
V4L2 as well?), so before we get to that, perhaps have a set of patches 
that does this best effort and *then* work on getting time-triggered 
playback into the kernel?

Another item that was brought up last round was getting timing-information 
to/from ALSA, See driver/media/avb/avb_alsa.c, as a start it updates the 
time for last incoming/outgoing frame so that userspace can get that 
information. Probably buggy as heck :)

* Back to your email from last night*

> You are trying to put tons of code into the kernel that really belongs
> in user space, and at the same time, you omit critical functions that
> only the kernel can provide.

Some (well, to be honest, most) of the of the critical functions that my 
driver omits, are omitted because they require substantial effort to 
implement - and befor there's a need for this, that won't happen. So, 
consider the TSN-driver such a need!

I'd love to use a qdisc that uses a time-triggered transmit, that would 
drop the need for a lot of the stuff in tsn_core.c. The same goes for 
time-triggered playback in media.

> > There are at least one AVB-driver (the AV-part of TSN) in the kernel
> > already.
> 
> And which driver is that?

Ah, a proverbial slip of the changelog, we visited this the last iteration, 
that would be the ravb-driver (which is an AVB capable NIC), but it does 
not include much in the way of AVB-support *In* kernel. Sorry about that!

Since then, the iMX7 from NXP has arrived, and this also has HW-support for 
TSN, but not in the kernel AFAICT.

So, the next issue I plan to tackle, is how I do buffers, the current 
approach where tsn_core allocates memory is on its way out and I'll let the 
shim (which means alsa/v4l2) will provide a buffer. Then I'll start looking 
at qdisc.

Thanks!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [TSN RFC v2 5/9] Add TSN header for the driver

2016-12-17 Thread Henrik Austad
On Fri, Dec 16, 2016 at 11:09:38PM +0100, Richard Cochran wrote:
> On Fri, Dec 16, 2016 at 06:59:09PM +0100, hen...@austad.us wrote:
> > +/*
> > + * List of current subtype fields in the common header of AVTPDU
> > + *
> > + * Note: AVTPDU is a remnant of the standards from when it was AVB.
> > + *
> > + * The list has been updated with the recent values from IEEE 1722, draft 
> > 16.
> > + */
> > +enum avtp_subtype {
> > +   TSN_61883_IIDC = 0, /* IEC 61883/IIDC Format */
> > +   TSN_MMA_STREAM, /* MMA Streams */
> > +   TSN_AAF,/* AVTP Audio Format */
> > +   TSN_CVF,/* Compressed Video Format */
> > +   TSN_CRF,/* Clock Reference Format */
> > +   TSN_TSCF,   /* Time-Synchronous Control Format */
> > +   TSN_SVF,/* SDI Video Format */
> > +   TSN_RVF,/* Raw Video Format */
> > +   /* 0x08 - 0x6D reserved */
> > +   TSN_AEF_CONTINOUS = 0x6e, /* AES Encrypted Format Continous */
> > +   TSN_VSF_STREAM, /* Vendor Specific Format Stream */
> > +   /* 0x70 - 0x7e reserved */
> > +   TSN_EF_STREAM = 0x7f,   /* Experimental Format Stream */
> > +   /* 0x80 - 0x81 reserved */
> > +   TSN_NTSCF = 0x82,   /* Non Time-Synchronous Control Format */
> > +   /* 0x83 - 0xed reserved */
> > +   TSN_ESCF = 0xec,/* ECC Signed Control Format */
> > +   TSN_EECF,   /* ECC Encrypted Control Format */
> > +   TSN_AEF_DISCRETE,   /* AES Encrypted Format Discrete */
> > +   /* 0xef - 0xf9 reserved */
> > +   TSN_ADP = 0xfa, /* AVDECC Discovery Protocol */
> > +   TSN_AECP,   /* AVDECC Enumeration and Control Protocol */
> > +   TSN_ACMP,   /* AVDECC Connection Management Protocol */
> > +   /* 0xfd reserved */
> > +   TSN_MAAP = 0xfe,/* MAAP Protocol */
> > +   TSN_EF_CONTROL, /* Experimental Format Control */
> > +};
> 
> The kernel shouldn't be in the business of assembling media packets.

No, but assembling the packets and shipping frames to a destination is not 
neccessarily the same thing.

A nice workflow would be to signal to the shim that "I'm sending a 
compressed video format" and then the shim/tsn_core will ship out the 
frames over the network - and then you need to set TSN_CVF as subtype in 
each header.

That does not that mean you should do H.264 encode/decode *in* the kernel

Perhaps this is better placed in include/uapi/tsn.h so that userspace and 
kernel share the same header?

-- 
Henrik Austad


signature.asc
Description: PGP signature


Re: [TSN RFC v2 5/9] Add TSN header for the driver

2016-12-17 Thread Henrik Austad
On Fri, Dec 16, 2016 at 11:09:38PM +0100, Richard Cochran wrote:
> On Fri, Dec 16, 2016 at 06:59:09PM +0100, hen...@austad.us wrote:
> > +/*
> > + * List of current subtype fields in the common header of AVTPDU
> > + *
> > + * Note: AVTPDU is a remnant of the standards from when it was AVB.
> > + *
> > + * The list has been updated with the recent values from IEEE 1722, draft 
> > 16.
> > + */
> > +enum avtp_subtype {
> > +   TSN_61883_IIDC = 0, /* IEC 61883/IIDC Format */
> > +   TSN_MMA_STREAM, /* MMA Streams */
> > +   TSN_AAF,/* AVTP Audio Format */
> > +   TSN_CVF,/* Compressed Video Format */
> > +   TSN_CRF,/* Clock Reference Format */
> > +   TSN_TSCF,   /* Time-Synchronous Control Format */
> > +   TSN_SVF,/* SDI Video Format */
> > +   TSN_RVF,/* Raw Video Format */
> > +   /* 0x08 - 0x6D reserved */
> > +   TSN_AEF_CONTINOUS = 0x6e, /* AES Encrypted Format Continous */
> > +   TSN_VSF_STREAM, /* Vendor Specific Format Stream */
> > +   /* 0x70 - 0x7e reserved */
> > +   TSN_EF_STREAM = 0x7f,   /* Experimental Format Stream */
> > +   /* 0x80 - 0x81 reserved */
> > +   TSN_NTSCF = 0x82,   /* Non Time-Synchronous Control Format */
> > +   /* 0x83 - 0xed reserved */
> > +   TSN_ESCF = 0xec,/* ECC Signed Control Format */
> > +   TSN_EECF,   /* ECC Encrypted Control Format */
> > +   TSN_AEF_DISCRETE,   /* AES Encrypted Format Discrete */
> > +   /* 0xef - 0xf9 reserved */
> > +   TSN_ADP = 0xfa, /* AVDECC Discovery Protocol */
> > +   TSN_AECP,   /* AVDECC Enumeration and Control Protocol */
> > +   TSN_ACMP,   /* AVDECC Connection Management Protocol */
> > +   /* 0xfd reserved */
> > +   TSN_MAAP = 0xfe,/* MAAP Protocol */
> > +   TSN_EF_CONTROL, /* Experimental Format Control */
> > +};
> 
> The kernel shouldn't be in the business of assembling media packets.

No, but assembling the packets and shipping frames to a destination is not 
neccessarily the same thing.

A nice workflow would be to signal to the shim that "I'm sending a 
compressed video format" and then the shim/tsn_core will ship out the 
frames over the network - and then you need to set TSN_CVF as subtype in 
each header.

That does not that mean you should do H.264 encode/decode *in* the kernel

Perhaps this is better placed in include/uapi/tsn.h so that userspace and 
kernel share the same header?

-- 
Henrik Austad


signature.asc
Description: PGP signature


Re: [TSN RFC v2 0/9] TSN driver for the kernel

2016-12-16 Thread Henrik Austad
On Fri, Dec 16, 2016 at 01:20:57PM -0500, David Miller wrote:
> From: Greg <gvrose8...@gmail.com>
> Date: Fri, 16 Dec 2016 10:12:44 -0800
> 
> > On Fri, 2016-12-16 at 18:59 +0100, hen...@austad.us wrote:
> >> From: Henrik Austad <haus...@cisco.com>
> >> 
> >> 
> >> The driver is directed via ConfigFS as we need userspace to handle
> >> stream-reservation (MSRP), discovery and enumeration (IEEE 1722.1) and
> >> whatever other management is needed. This also includes running an
> >> appropriate PTP daemon (TSN favors gPTP).
> > 
> > I suggest using a generic netlink interface to communicate with the
> > driver to set up and/or configure your drivers.
> > 
> > I think configfs is frowned upon for network drivers.  YMMV.
> 
> Agreed.

Ok - thanks!

I will have look at netlink and see if I can wrap my head around it and if 
I can apply it to how to bring the media-devices up once the TSN-link has 
been configured.

Thanks! :)

-- 
Henrik Austad


signature.asc
Description: PGP signature


Re: [TSN RFC v2 0/9] TSN driver for the kernel

2016-12-16 Thread Henrik Austad
On Fri, Dec 16, 2016 at 01:20:57PM -0500, David Miller wrote:
> From: Greg 
> Date: Fri, 16 Dec 2016 10:12:44 -0800
> 
> > On Fri, 2016-12-16 at 18:59 +0100, hen...@austad.us wrote:
> >> From: Henrik Austad 
> >> 
> >> 
> >> The driver is directed via ConfigFS as we need userspace to handle
> >> stream-reservation (MSRP), discovery and enumeration (IEEE 1722.1) and
> >> whatever other management is needed. This also includes running an
> >> appropriate PTP daemon (TSN favors gPTP).
> > 
> > I suggest using a generic netlink interface to communicate with the
> > driver to set up and/or configure your drivers.
> > 
> > I think configfs is frowned upon for network drivers.  YMMV.
> 
> Agreed.

Ok - thanks!

I will have look at netlink and see if I can wrap my head around it and if 
I can apply it to how to bring the media-devices up once the TSN-link has 
been configured.

Thanks! :)

-- 
Henrik Austad


signature.asc
Description: PGP signature


Re: [PATCH] tracing: (backport) Replace kmap with copy_from_user() in trace_marker

2016-12-09 Thread Henrik Austad
On Fri, Dec 09, 2016 at 08:22:05AM +0100, Greg KH wrote:
> On Fri, Dec 09, 2016 at 07:34:04AM +0100, Henrik Austad wrote:
> > Instead of using get_user_pages_fast() and kmap_atomic() when writing
> > to the trace_marker file, just allocate enough space on the ring buffer
> > directly, and write into it via copy_from_user().
> > 
> > Writing into the trace_marker file use to allocate a temporary buffer
> > to perform the copy_from_user(), as we didn't want to write into the
> > ring buffer if the copy failed. But as a trace_marker write is suppose
> > to be extremely fast, and allocating memory causes other tracepoints to
> > trigger, Peter Zijlstra suggested using get_user_pages_fast() and
> > kmap_atomic() to keep the user space pages in memory and reading it
> > directly.
> > 
> > Instead, just allocate the space in the ring buffer and use
> > copy_from_user() directly. If it faults, return -EFAULT and write
> > "" into the ring buffer.
> > 
> > On architectures without a arch-specific get_user_pages_fast(), this
> > will end up in the generic get_user_pages_fast() and this grabs
> > mm->mmap_sem. Once you do this, then suddenly writing to the
> > trace_marker can cause priority-inversions.
> > 
> > This is a backport of Steven Rostedts patch [1] and applied to 3.10.x so the
> > signed-off-chain by is somewhat uncertain at this stage.
> > 
> > The patch compiles, boots and does not immediately explode on impact. By
> > definition [2] it must therefore be perfect
> > 
> > 2) https://www.spinics.net/lists/kernel/msg2400769.html
> > 2) http://lkml.iu.edu/hypermail/linux/kernel/9804.1/0149.html
> > 
> > Cc: Ingo Molnar <mi...@kernel.org>
> > Cc: Henrik Austad <hen...@austad.us>
> > Cc: Peter Zijlstra <pet...@infradead.org>
> > Cc: Steven Rostedt <rost...@goodmis.org>
> > Cc: sta...@vger.kernel.org
> > 
> > Suggested-by: Thomas Gleixner <t...@linutronix.de>
> > Used-to-be-signed-off-by: Steven Rostedt <rost...@goodmis.org>
> > Backported-by: Henrik Austad <haus...@cisco.com>
> > Tested-by: Henrik Austad <haus...@cisco.com>
> > Signed-off-by: Henrik Austad <haus...@cisco.com>
> > ---
> >  kernel/trace/trace.c | 78 
> > +++-
> >  1 file changed, 22 insertions(+), 56 deletions(-)
> 
> What is the git commit id of this patch in Linus's tree?  And what
> stable trees do you feel it should be applied to?

Ah, perhaps I jumped the gun here. I don't think Linus has picked this one 
up yet, Steven sent out the patch yesterday.

Since then, I've backported it to 3.10 and ran the first set of tests 
over night and it looks good. So ideally this would find its way into 
3.10(.104).

Do you want med to resubmit when Stevens patch is merged upstream?

-Henrik


Re: [PATCH] tracing: (backport) Replace kmap with copy_from_user() in trace_marker

2016-12-09 Thread Henrik Austad
On Fri, Dec 09, 2016 at 08:22:05AM +0100, Greg KH wrote:
> On Fri, Dec 09, 2016 at 07:34:04AM +0100, Henrik Austad wrote:
> > Instead of using get_user_pages_fast() and kmap_atomic() when writing
> > to the trace_marker file, just allocate enough space on the ring buffer
> > directly, and write into it via copy_from_user().
> > 
> > Writing into the trace_marker file use to allocate a temporary buffer
> > to perform the copy_from_user(), as we didn't want to write into the
> > ring buffer if the copy failed. But as a trace_marker write is suppose
> > to be extremely fast, and allocating memory causes other tracepoints to
> > trigger, Peter Zijlstra suggested using get_user_pages_fast() and
> > kmap_atomic() to keep the user space pages in memory and reading it
> > directly.
> > 
> > Instead, just allocate the space in the ring buffer and use
> > copy_from_user() directly. If it faults, return -EFAULT and write
> > "" into the ring buffer.
> > 
> > On architectures without a arch-specific get_user_pages_fast(), this
> > will end up in the generic get_user_pages_fast() and this grabs
> > mm->mmap_sem. Once you do this, then suddenly writing to the
> > trace_marker can cause priority-inversions.
> > 
> > This is a backport of Steven Rostedts patch [1] and applied to 3.10.x so the
> > signed-off-chain by is somewhat uncertain at this stage.
> > 
> > The patch compiles, boots and does not immediately explode on impact. By
> > definition [2] it must therefore be perfect
> > 
> > 2) https://www.spinics.net/lists/kernel/msg2400769.html
> > 2) http://lkml.iu.edu/hypermail/linux/kernel/9804.1/0149.html
> > 
> > Cc: Ingo Molnar 
> > Cc: Henrik Austad 
> > Cc: Peter Zijlstra 
> > Cc: Steven Rostedt 
> > Cc: sta...@vger.kernel.org
> > 
> > Suggested-by: Thomas Gleixner 
> > Used-to-be-signed-off-by: Steven Rostedt 
> > Backported-by: Henrik Austad 
> > Tested-by: Henrik Austad 
> > Signed-off-by: Henrik Austad 
> > ---
> >  kernel/trace/trace.c | 78 
> > +++-
> >  1 file changed, 22 insertions(+), 56 deletions(-)
> 
> What is the git commit id of this patch in Linus's tree?  And what
> stable trees do you feel it should be applied to?

Ah, perhaps I jumped the gun here. I don't think Linus has picked this one 
up yet, Steven sent out the patch yesterday.

Since then, I've backported it to 3.10 and ran the first set of tests 
over night and it looks good. So ideally this would find its way into 
3.10(.104).

Do you want med to resubmit when Stevens patch is merged upstream?

-Henrik


[PATCH] tracing: (backport) Replace kmap with copy_from_user() in trace_marker

2016-12-08 Thread Henrik Austad
Instead of using get_user_pages_fast() and kmap_atomic() when writing
to the trace_marker file, just allocate enough space on the ring buffer
directly, and write into it via copy_from_user().

Writing into the trace_marker file use to allocate a temporary buffer
to perform the copy_from_user(), as we didn't want to write into the
ring buffer if the copy failed. But as a trace_marker write is suppose
to be extremely fast, and allocating memory causes other tracepoints to
trigger, Peter Zijlstra suggested using get_user_pages_fast() and
kmap_atomic() to keep the user space pages in memory and reading it
directly.

Instead, just allocate the space in the ring buffer and use
copy_from_user() directly. If it faults, return -EFAULT and write
"" into the ring buffer.

On architectures without a arch-specific get_user_pages_fast(), this
will end up in the generic get_user_pages_fast() and this grabs
mm->mmap_sem. Once you do this, then suddenly writing to the
trace_marker can cause priority-inversions.

This is a backport of Steven Rostedts patch [1] and applied to 3.10.x so the
signed-off-chain by is somewhat uncertain at this stage.

The patch compiles, boots and does not immediately explode on impact. By
definition [2] it must therefore be perfect

2) https://www.spinics.net/lists/kernel/msg2400769.html
2) http://lkml.iu.edu/hypermail/linux/kernel/9804.1/0149.html

Cc: Ingo Molnar <mi...@kernel.org>
Cc: Henrik Austad <hen...@austad.us>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: sta...@vger.kernel.org

Suggested-by: Thomas Gleixner <t...@linutronix.de>
Used-to-be-signed-off-by: Steven Rostedt <rost...@goodmis.org>
Backported-by: Henrik Austad <haus...@cisco.com>
Tested-by: Henrik Austad <haus...@cisco.com>
Signed-off-by: Henrik Austad <haus...@cisco.com>
---
 kernel/trace/trace.c | 78 +++-
 1 file changed, 22 insertions(+), 56 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 18cdf91..94eb1ee 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4501,15 +4501,13 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
struct ring_buffer *buffer;
struct print_entry *entry;
unsigned long irq_flags;
-   struct page *pages[2];
-   void *map_page[2];
-   int nr_pages = 1;
+   const char faulted[] = "";
ssize_t written;
-   int offset;
int size;
int len;
-   int ret;
-   int i;
+
+/* Used in tracing_mark_raw_write() as well */
+#define FAULTED_SIZE (sizeof(faulted) - 1) /* '\0' is already accounted for */
 
if (tracing_disabled)
return -EINVAL;
@@ -4520,60 +4518,34 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt > TRACE_BUF_SIZE)
cnt = TRACE_BUF_SIZE;
 
-   /*
-* Userspace is injecting traces into the kernel trace buffer.
-* We want to be as non intrusive as possible.
-* To do so, we do not want to allocate any special buffers
-* or take any locks, but instead write the userspace data
-* straight into the ring buffer.
-*
-* First we need to pin the userspace buffer into memory,
-* which, most likely it is, because it just referenced it.
-* But there's no guarantee that it is. By using get_user_pages_fast()
-* and kmap_atomic/kunmap_atomic() we can get access to the
-* pages directly. We then write the data directly into the
-* ring buffer.
-*/
BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
 
-   /* check if we cross pages */
-   if ((addr & PAGE_MASK) != ((addr + cnt) & PAGE_MASK))
-   nr_pages = 2;
-
-   offset = addr & (PAGE_SIZE - 1);
-   addr &= PAGE_MASK;
-
-   ret = get_user_pages_fast(addr, nr_pages, 0, pages);
-   if (ret < nr_pages) {
-   while (--ret >= 0)
-   put_page(pages[ret]);
-   written = -EFAULT;
-   goto out;
-   }
+   local_save_flags(irq_flags);
+   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
 
-   for (i = 0; i < nr_pages; i++)
-   map_page[i] = kmap_atomic(pages[i]);
+   /* If less than "", then make sure we can still add that */
+   if (cnt < FAULTED_SIZE)
+   size += FAULTED_SIZE - cnt;
 
-   local_save_flags(irq_flags);
-   size = sizeof(*entry) + cnt + 2; /* possible \n added */
buffer = tr->trace_buffer.buffer;
event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
  irq_flags, preempt_count());
-   if (!event) {
-   /* Ring buffer disabled, return as if not open for write */
-   written = -EBADF;
-

[PATCH] tracing: (backport) Replace kmap with copy_from_user() in trace_marker

2016-12-08 Thread Henrik Austad
Instead of using get_user_pages_fast() and kmap_atomic() when writing
to the trace_marker file, just allocate enough space on the ring buffer
directly, and write into it via copy_from_user().

Writing into the trace_marker file use to allocate a temporary buffer
to perform the copy_from_user(), as we didn't want to write into the
ring buffer if the copy failed. But as a trace_marker write is suppose
to be extremely fast, and allocating memory causes other tracepoints to
trigger, Peter Zijlstra suggested using get_user_pages_fast() and
kmap_atomic() to keep the user space pages in memory and reading it
directly.

Instead, just allocate the space in the ring buffer and use
copy_from_user() directly. If it faults, return -EFAULT and write
"" into the ring buffer.

On architectures without a arch-specific get_user_pages_fast(), this
will end up in the generic get_user_pages_fast() and this grabs
mm->mmap_sem. Once you do this, then suddenly writing to the
trace_marker can cause priority-inversions.

This is a backport of Steven Rostedts patch [1] and applied to 3.10.x so the
signed-off-chain by is somewhat uncertain at this stage.

The patch compiles, boots and does not immediately explode on impact. By
definition [2] it must therefore be perfect

2) https://www.spinics.net/lists/kernel/msg2400769.html
2) http://lkml.iu.edu/hypermail/linux/kernel/9804.1/0149.html

Cc: Ingo Molnar 
Cc: Henrik Austad 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: sta...@vger.kernel.org

Suggested-by: Thomas Gleixner 
Used-to-be-signed-off-by: Steven Rostedt 
Backported-by: Henrik Austad 
Tested-by: Henrik Austad 
Signed-off-by: Henrik Austad 
---
 kernel/trace/trace.c | 78 +++-
 1 file changed, 22 insertions(+), 56 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 18cdf91..94eb1ee 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4501,15 +4501,13 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
struct ring_buffer *buffer;
struct print_entry *entry;
unsigned long irq_flags;
-   struct page *pages[2];
-   void *map_page[2];
-   int nr_pages = 1;
+   const char faulted[] = "";
ssize_t written;
-   int offset;
int size;
int len;
-   int ret;
-   int i;
+
+/* Used in tracing_mark_raw_write() as well */
+#define FAULTED_SIZE (sizeof(faulted) - 1) /* '\0' is already accounted for */
 
if (tracing_disabled)
return -EINVAL;
@@ -4520,60 +4518,34 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt > TRACE_BUF_SIZE)
cnt = TRACE_BUF_SIZE;
 
-   /*
-* Userspace is injecting traces into the kernel trace buffer.
-* We want to be as non intrusive as possible.
-* To do so, we do not want to allocate any special buffers
-* or take any locks, but instead write the userspace data
-* straight into the ring buffer.
-*
-* First we need to pin the userspace buffer into memory,
-* which, most likely it is, because it just referenced it.
-* But there's no guarantee that it is. By using get_user_pages_fast()
-* and kmap_atomic/kunmap_atomic() we can get access to the
-* pages directly. We then write the data directly into the
-* ring buffer.
-*/
BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
 
-   /* check if we cross pages */
-   if ((addr & PAGE_MASK) != ((addr + cnt) & PAGE_MASK))
-   nr_pages = 2;
-
-   offset = addr & (PAGE_SIZE - 1);
-   addr &= PAGE_MASK;
-
-   ret = get_user_pages_fast(addr, nr_pages, 0, pages);
-   if (ret < nr_pages) {
-   while (--ret >= 0)
-   put_page(pages[ret]);
-   written = -EFAULT;
-   goto out;
-   }
+   local_save_flags(irq_flags);
+   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
 
-   for (i = 0; i < nr_pages; i++)
-   map_page[i] = kmap_atomic(pages[i]);
+   /* If less than "", then make sure we can still add that */
+   if (cnt < FAULTED_SIZE)
+   size += FAULTED_SIZE - cnt;
 
-   local_save_flags(irq_flags);
-   size = sizeof(*entry) + cnt + 2; /* possible \n added */
buffer = tr->trace_buffer.buffer;
event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
  irq_flags, preempt_count());
-   if (!event) {
-   /* Ring buffer disabled, return as if not open for write */
-   written = -EBADF;
-   goto out_unlock;
-   }
+
+   if (unlikely(!event))
+   /* Ring buffer disabled, return as if not open for write */
+   return -EBADF;
 
entry = ring_buffer_event_data(event);
ent

Re: [RFD] sched/deadline: Support single CPU affinity

2016-11-10 Thread Henrik Austad
On Thu, Nov 10, 2016 at 01:38:40PM +0100, luca abeni wrote:
> Hi Henrik,

Hi Luca,

> On Thu, 10 Nov 2016 13:21:00 +0100
> Henrik Austad <hen...@austad.us> wrote:
> > On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
> [...]
> > > We define the time to fail as:
> > > 
> > >   ttf(t) := t_d - t_b; where
> > > 
> > >   t_d is t's absolute deadline
> > >   t_b is t's remaining budget
> > > 
> > > This is the last possible moment we must schedule this task such
> > > that it can complete its work and not miss its deadline.  
> > 
> > To elaborate a bit on this (this is a modified LLF approach if my
> > memory serves):
> > 
> > You have the dynamic time-to-failure (TtF), i.e. as the task
> > progresses (scheduled to run), the relative time-to-failure will
> > remain constant. This can be used to compare thasks to a running task
> > and should minimize the number of calculations required.
> > 
> > Then you have the static Time-of-failure (ToF, which is the absoulte
> > time when a task will no longer be able to meet its deadline. This is
> > what you use for keeping a sorted list of tasks in the runqueue. As
> > this is a fixed point in time, you do not have to dynamically update
> > or do crazy calculation when inserting/removing threads from the rq.
> 
> Sorry, I am missing something here: if ttf is defined as
>   ttf_i = d_i - q_i

So I picked the naming somewhat independently of Peter, his approach is 
the _absolute_ time of failure, the actual time X, irrespective of the task 
running or not.

I added 2 different measures for the same thing:

* ToF: 
The absolute time of failure is the point in time when the task will no 
longer be able to meet its deadline. If a task is scheduled and is running 
on a CPU, this value will move forward at the speed of execution. i.e. when 
the task is running, this value is changing. When the task is waiting in 
the runqueue, this value is constant.

TtF:
The relative time to failure is the value that is tied to the local CPU so 
to speak. When a task is running, this value is constant as it is the 
remaining time until the task is no longer able to meet its deadline. When 
the task is enqueued, this value will steadily decrease as it draws closer 
to the time when it will fail.

So when a task is running on a CPU, you use TtF, when it is in the runqueue 
you compare ToF

> (where d_i is the deadline of thask i and q_i is its remaining budget),
> then it also is the time before which you have to schedule task i if
> you do not want to miss the deadline... No? So, I do not understand the
> difference with tof.

So you can calculate one form the other given absolute deadline and 
remaining budget (or consumed CPU-time). But it is handy to use both as it 
removes a lot of duplicity and once you get the hang of the terms, makes it 
a bit easier to reason about the system.

> > > If we then augment the regular EDF rules by, for local tasks,
> > > considering the time to fail and let this measure override the
> > > regular EDF pick when the time to fail can be overran by the EDF
> > > pick.  
> > 
> > Then, if you do this - do you need to constrict this to a local CPU?
> > I *think* you could do this in a global scheduler if you use ToF/TtF
> > for all deadline-tasks, I think you should be able to meet deadlines.
> I think the ToF/TtF scheduler will be equivalent to LLF (see previous
> emails)... Or am I misunderstanding something? (see above)
> And LLF is not optimal on multiple CPUs, so I do not think it will be
> able to meet deadlines if you use it as a global scheduler.

I think I called it Earliest Failure First (I really wanted to call it 
failure-driven scheduling but that implied a crappy scheduler ;)

LLF is prone to high task-switch count when multiple threads gets close to 
0 laxity. But as I said, it's been a while since I last worked through the 
theory, so I have some homework to do before arguing too hard about this.

> > I had a rant about this way back [1,2 Sec 11.4], I need to sit down
> > and re-read most of it, it has been a few too many years, but the
> > idea was to minimize the number of context-switches (which LLF is
> > prone to get a lot of) as well as minimize the computational overhead
> > by avoiding re-calculating time-of-failre/time-to-failre a lot.
> > 
> > > That is, when:
> > > 
> > >   now + left_b > min(ttf)  
> > 
> > Why not just  use ttf/tof for all deadline-tasks? We have all the 
> > information available anyway and it would probably make the internal
> > logic easier?
> I think LLF causes more preemptions and migrations than EDF.

yes, it does, which is why you need to adjust LLF to minimize the number of 
task-switches.

-Henrik


signature.asc
Description: PGP signature


Re: [RFD] sched/deadline: Support single CPU affinity

2016-11-10 Thread Henrik Austad
On Thu, Nov 10, 2016 at 01:38:40PM +0100, luca abeni wrote:
> Hi Henrik,

Hi Luca,

> On Thu, 10 Nov 2016 13:21:00 +0100
> Henrik Austad  wrote:
> > On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
> [...]
> > > We define the time to fail as:
> > > 
> > >   ttf(t) := t_d - t_b; where
> > > 
> > >   t_d is t's absolute deadline
> > >   t_b is t's remaining budget
> > > 
> > > This is the last possible moment we must schedule this task such
> > > that it can complete its work and not miss its deadline.  
> > 
> > To elaborate a bit on this (this is a modified LLF approach if my
> > memory serves):
> > 
> > You have the dynamic time-to-failure (TtF), i.e. as the task
> > progresses (scheduled to run), the relative time-to-failure will
> > remain constant. This can be used to compare thasks to a running task
> > and should minimize the number of calculations required.
> > 
> > Then you have the static Time-of-failure (ToF, which is the absoulte
> > time when a task will no longer be able to meet its deadline. This is
> > what you use for keeping a sorted list of tasks in the runqueue. As
> > this is a fixed point in time, you do not have to dynamically update
> > or do crazy calculation when inserting/removing threads from the rq.
> 
> Sorry, I am missing something here: if ttf is defined as
>   ttf_i = d_i - q_i

So I picked the naming somewhat independently of Peter, his approach is 
the _absolute_ time of failure, the actual time X, irrespective of the task 
running or not.

I added 2 different measures for the same thing:

* ToF: 
The absolute time of failure is the point in time when the task will no 
longer be able to meet its deadline. If a task is scheduled and is running 
on a CPU, this value will move forward at the speed of execution. i.e. when 
the task is running, this value is changing. When the task is waiting in 
the runqueue, this value is constant.

TtF:
The relative time to failure is the value that is tied to the local CPU so 
to speak. When a task is running, this value is constant as it is the 
remaining time until the task is no longer able to meet its deadline. When 
the task is enqueued, this value will steadily decrease as it draws closer 
to the time when it will fail.

So when a task is running on a CPU, you use TtF, when it is in the runqueue 
you compare ToF

> (where d_i is the deadline of thask i and q_i is its remaining budget),
> then it also is the time before which you have to schedule task i if
> you do not want to miss the deadline... No? So, I do not understand the
> difference with tof.

So you can calculate one form the other given absolute deadline and 
remaining budget (or consumed CPU-time). But it is handy to use both as it 
removes a lot of duplicity and once you get the hang of the terms, makes it 
a bit easier to reason about the system.

> > > If we then augment the regular EDF rules by, for local tasks,
> > > considering the time to fail and let this measure override the
> > > regular EDF pick when the time to fail can be overran by the EDF
> > > pick.  
> > 
> > Then, if you do this - do you need to constrict this to a local CPU?
> > I *think* you could do this in a global scheduler if you use ToF/TtF
> > for all deadline-tasks, I think you should be able to meet deadlines.
> I think the ToF/TtF scheduler will be equivalent to LLF (see previous
> emails)... Or am I misunderstanding something? (see above)
> And LLF is not optimal on multiple CPUs, so I do not think it will be
> able to meet deadlines if you use it as a global scheduler.

I think I called it Earliest Failure First (I really wanted to call it 
failure-driven scheduling but that implied a crappy scheduler ;)

LLF is prone to high task-switch count when multiple threads gets close to 
0 laxity. But as I said, it's been a while since I last worked through the 
theory, so I have some homework to do before arguing too hard about this.

> > I had a rant about this way back [1,2 Sec 11.4], I need to sit down
> > and re-read most of it, it has been a few too many years, but the
> > idea was to minimize the number of context-switches (which LLF is
> > prone to get a lot of) as well as minimize the computational overhead
> > by avoiding re-calculating time-of-failre/time-to-failre a lot.
> > 
> > > That is, when:
> > > 
> > >   now + left_b > min(ttf)  
> > 
> > Why not just  use ttf/tof for all deadline-tasks? We have all the 
> > information available anyway and it would probably make the internal
> > logic easier?
> I think LLF causes more preemptions and migrations than EDF.

yes, it does, which is why you need to adjust LLF to minimize the number of 
task-switches.

-Henrik


signature.asc
Description: PGP signature


Re: [RFD] sched/deadline: Support single CPU affinity

2016-11-10 Thread Henrik Austad
On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
> 
> 
> Add support for single CPU affinity to SCHED_DEADLINE; the supposed reason for
> wanting single CPU affinity is better QoS than provided by G-EDF.
> 
> Therefore the aim is to provide harder guarantees, similar to UP, for single
> CPU affine tasks. This then leads to a mixed criticality scheduling
> requirement for the CPU scheduler. G-EDF like for the non-affine (global)
> tasks and UP like for the single CPU tasks.
> 
> 
> 
> ADMISSION CONTROL
> 
> Do simple UP admission control on the CPU local tasks, and subtract the
> admitted bandwidth from the global total when doing global admission control.
> 
>   single cpu: U[n] := \Sum tl_u,n <= 1
>   global: \Sum tg_u <= N - \Sum U[n]
> 
> 
> 
> MIXED CRITICALITY SCHEDULING
> 
> Since we want to provide better guarantees for single CPU affine tasks than
> the G-EDF scheduler provides for the single CPU tasks, we need to somehow
> alter the scheduling algorithm.
> 
> The trivial layered EDF/G-EDF approach is obviously flawed in that it will
> result in many unnecessary deadline misses. The trivial example is having a
> single CPU task with a deadline after a runnable global task. By always
> running single CPU tasks over global tasks we can make the global task miss
> its deadline even though we could easily have ran both within the alloted
> time.
> 
> Therefore we must use a more complicated scheme. By adding a second measure
> present in the sporadic task model to the scheduling function we can try and
> distinguish between the constraints of handling the two cases in a single
> scheduler.
> 
> We define the time to fail as:
> 
>   ttf(t) := t_d - t_b; where
> 
>   t_d is t's absolute deadline
>   t_b is t's remaining budget
> 
> This is the last possible moment we must schedule this task such that it can
> complete its work and not miss its deadline.

To elaborate a bit on this (this is a modified LLF approach if my memory 
serves):

You have the dynamic time-to-failure (TtF), i.e. as the task progresses 
(scheduled to run), the relative time-to-failure will remain constant. This 
can be used to compare thasks to a running task and should minimize the 
number of calculations required.

Then you have the static Time-of-failure (ToF, which is the absoulte time 
when a task will no longer be able to meet its deadline. This is what you 
use for keeping a sorted list of tasks in the runqueue. As this is a fixed 
point in time, you do not have to dynamically update or do crazy 
calculation when inserting/removing threads from the rq.

> If we then augment the regular EDF rules by, for local tasks, considering the
> time to fail and let this measure override the regular EDF pick when the
> time to fail can be overran by the EDF pick.

Then, if you do this - do you need to constrict this to a local CPU? I 
*think* you could do this in a global scheduler if you use ToF/TtF for all 
deadline-tasks, I think you should be able to meet deadlines.

I had a rant about this way back [1,2 Sec 11.4], I need to sit down and 
re-read most of it, it has been a few too many years, but the idea was to 
minimize the number of context-switches (which LLF is prone to get a lot 
of) as well as minimize the computational overhead by avoiding 
re-calculating time-of-failre/time-to-failre a lot.

> That is, when:
> 
>   now + left_b > min(ttf)

Why not just  use ttf/tof for all deadline-tasks? We have all the 
information available anyway and it would probably make the internal logic 
easier?

> Use augmented RB-tree to store absolute deadlines of all rq local tasks and
> keep the heap sorted on the earliest time to fail of a locally affine task.
> 
> TODO
> 
>  - finish patch, this only sketches the outlines
>  - formal analysis of the proposed scheduling function; this is only a hunch.

I think you are on the right track, but then again, you agree with some of 
the stuff I messed around with a while a go, so no wonder I think you're 
right :)

1) https://lkml.org/lkml/2009/7/10/380
2) 
https://brage.bibsys.no/xmlui/bitstream/handle/11250/259744/347756_FULLTEXT01.pdf

-Henrik

> ---
>  include/linux/sched.h   |   1 +
>  kernel/sched/core.c |  75 ++---
>  kernel/sched/deadline.c | 142 
> 
>  kernel/sched/sched.h|  12 ++--
>  4 files changed, 191 insertions(+), 39 deletions(-)
>
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3762fe4e3a80..32f948615d4c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1412,6 +1412,7 @@ struct sched_rt_entity {
>  
>  struct sched_dl_entity {
>   struct rb_node  rb_node;
> + u64 __subtree_ttf;

Didn't you say augmented rb-tree?

>   /*
>* Original scheduling parameters. Copied here from sched_attr
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bee18baa603a..46995c060a89 100644
> --- a/kernel/sched/core.c
> +++ 

Re: [RFD] sched/deadline: Support single CPU affinity

2016-11-10 Thread Henrik Austad
On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
> 
> 
> Add support for single CPU affinity to SCHED_DEADLINE; the supposed reason for
> wanting single CPU affinity is better QoS than provided by G-EDF.
> 
> Therefore the aim is to provide harder guarantees, similar to UP, for single
> CPU affine tasks. This then leads to a mixed criticality scheduling
> requirement for the CPU scheduler. G-EDF like for the non-affine (global)
> tasks and UP like for the single CPU tasks.
> 
> 
> 
> ADMISSION CONTROL
> 
> Do simple UP admission control on the CPU local tasks, and subtract the
> admitted bandwidth from the global total when doing global admission control.
> 
>   single cpu: U[n] := \Sum tl_u,n <= 1
>   global: \Sum tg_u <= N - \Sum U[n]
> 
> 
> 
> MIXED CRITICALITY SCHEDULING
> 
> Since we want to provide better guarantees for single CPU affine tasks than
> the G-EDF scheduler provides for the single CPU tasks, we need to somehow
> alter the scheduling algorithm.
> 
> The trivial layered EDF/G-EDF approach is obviously flawed in that it will
> result in many unnecessary deadline misses. The trivial example is having a
> single CPU task with a deadline after a runnable global task. By always
> running single CPU tasks over global tasks we can make the global task miss
> its deadline even though we could easily have ran both within the alloted
> time.
> 
> Therefore we must use a more complicated scheme. By adding a second measure
> present in the sporadic task model to the scheduling function we can try and
> distinguish between the constraints of handling the two cases in a single
> scheduler.
> 
> We define the time to fail as:
> 
>   ttf(t) := t_d - t_b; where
> 
>   t_d is t's absolute deadline
>   t_b is t's remaining budget
> 
> This is the last possible moment we must schedule this task such that it can
> complete its work and not miss its deadline.

To elaborate a bit on this (this is a modified LLF approach if my memory 
serves):

You have the dynamic time-to-failure (TtF), i.e. as the task progresses 
(scheduled to run), the relative time-to-failure will remain constant. This 
can be used to compare thasks to a running task and should minimize the 
number of calculations required.

Then you have the static Time-of-failure (ToF, which is the absoulte time 
when a task will no longer be able to meet its deadline. This is what you 
use for keeping a sorted list of tasks in the runqueue. As this is a fixed 
point in time, you do not have to dynamically update or do crazy 
calculation when inserting/removing threads from the rq.

> If we then augment the regular EDF rules by, for local tasks, considering the
> time to fail and let this measure override the regular EDF pick when the
> time to fail can be overran by the EDF pick.

Then, if you do this - do you need to constrict this to a local CPU? I 
*think* you could do this in a global scheduler if you use ToF/TtF for all 
deadline-tasks, I think you should be able to meet deadlines.

I had a rant about this way back [1,2 Sec 11.4], I need to sit down and 
re-read most of it, it has been a few too many years, but the idea was to 
minimize the number of context-switches (which LLF is prone to get a lot 
of) as well as minimize the computational overhead by avoiding 
re-calculating time-of-failre/time-to-failre a lot.

> That is, when:
> 
>   now + left_b > min(ttf)

Why not just  use ttf/tof for all deadline-tasks? We have all the 
information available anyway and it would probably make the internal logic 
easier?

> Use augmented RB-tree to store absolute deadlines of all rq local tasks and
> keep the heap sorted on the earliest time to fail of a locally affine task.
> 
> TODO
> 
>  - finish patch, this only sketches the outlines
>  - formal analysis of the proposed scheduling function; this is only a hunch.

I think you are on the right track, but then again, you agree with some of 
the stuff I messed around with a while a go, so no wonder I think you're 
right :)

1) https://lkml.org/lkml/2009/7/10/380
2) 
https://brage.bibsys.no/xmlui/bitstream/handle/11250/259744/347756_FULLTEXT01.pdf

-Henrik

> ---
>  include/linux/sched.h   |   1 +
>  kernel/sched/core.c |  75 ++---
>  kernel/sched/deadline.c | 142 
> 
>  kernel/sched/sched.h|  12 ++--
>  4 files changed, 191 insertions(+), 39 deletions(-)
>
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3762fe4e3a80..32f948615d4c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1412,6 +1412,7 @@ struct sched_rt_entity {
>  
>  struct sched_dl_entity {
>   struct rb_node  rb_node;
> + u64 __subtree_ttf;

Didn't you say augmented rb-tree?

>   /*
>* Original scheduling parameters. Copied here from sched_attr
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bee18baa603a..46995c060a89 100644
> --- a/kernel/sched/core.c
> +++ 

Re: [Intel-wired-lan] [PATCH] igb: add missing fields to TXDCTL-register

2016-10-19 Thread Henrik Austad
On Wed, Oct 19, 2016 at 07:25:10AM -0700, Jesse Brandeburg wrote:
> On Wed, 19 Oct 2016 14:37:59 +0200
> Henrik Austad <hen...@austad.us> wrote:
> 
> > The current list of E1000_TXDCTL-registers is incomplete. This adds
> > the missing parts for the Transmit Descriptor Control (TXDCTL)
> > register.
> > 
> > The rest of these values (threshold for descriptor read/write) for
> > TXDCTL seems to be defined in igb/igb.h, not sure why this is split
> > though.
> 
> Hi Henrik, thanks for helping with our code.
> 
> While totally correct, having defines added to the kernel that are not
> being used anywhere in the code isn't really very useful.  Often the
> upstream maintainers/reviewers will reject a patch like this that just
> adds to a .h file, because there are no actual users of the defines.

Yes, I agree, best to avoid bloat whenever possible.

> If the transmit or ethtool code were to use these (via the same patch)
> or something like that, then the patch would be more likely to be
> accepted.

Ah, good to know. I am in the process of spinning out a new set of 
TSN-patches (previous version: see [1]) and setting the priority-bit for 
the Tx-queues is required. This means that I'm hacking more at igb_main.c.

So this was more about laying the groundwork for the series.

I'll leave this patch in the tsn-series then, and resend once I'm ready and 
hope you can provide some feedback on the rest of the series then :)

> Jesse
> 
> PS In the future no need to copy linux-kernel for patches going to our
> submaintainer list.

Ok, I'll remember that, thanks!

1) https://lkml.org/lkml/2016/6/11/187

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [Intel-wired-lan] [PATCH] igb: add missing fields to TXDCTL-register

2016-10-19 Thread Henrik Austad
On Wed, Oct 19, 2016 at 07:25:10AM -0700, Jesse Brandeburg wrote:
> On Wed, 19 Oct 2016 14:37:59 +0200
> Henrik Austad  wrote:
> 
> > The current list of E1000_TXDCTL-registers is incomplete. This adds
> > the missing parts for the Transmit Descriptor Control (TXDCTL)
> > register.
> > 
> > The rest of these values (threshold for descriptor read/write) for
> > TXDCTL seems to be defined in igb/igb.h, not sure why this is split
> > though.
> 
> Hi Henrik, thanks for helping with our code.
> 
> While totally correct, having defines added to the kernel that are not
> being used anywhere in the code isn't really very useful.  Often the
> upstream maintainers/reviewers will reject a patch like this that just
> adds to a .h file, because there are no actual users of the defines.

Yes, I agree, best to avoid bloat whenever possible.

> If the transmit or ethtool code were to use these (via the same patch)
> or something like that, then the patch would be more likely to be
> accepted.

Ah, good to know. I am in the process of spinning out a new set of 
TSN-patches (previous version: see [1]) and setting the priority-bit for 
the Tx-queues is required. This means that I'm hacking more at igb_main.c.

So this was more about laying the groundwork for the series.

I'll leave this patch in the tsn-series then, and resend once I'm ready and 
hope you can provide some feedback on the rest of the series then :)

> Jesse
> 
> PS In the future no need to copy linux-kernel for patches going to our
> submaintainer list.

Ok, I'll remember that, thanks!

1) https://lkml.org/lkml/2016/6/11/187

-- 
Henrik Austad


signature.asc
Description: Digital signature


[PATCH] igb: add missing fields to TXDCTL-register

2016-10-19 Thread Henrik Austad
The current list of E1000_TXDCTL-registers is incomplete. This adds the
missing parts for the Transmit Descriptor Control (TXDCTL) register.

The rest of these values (threshold for descriptor read/write) for
TXDCTL seems to be defined in igb/igb.h, not sure why this is split
though.

It seems that this was left out in the commit that added support for
82575 Gigabit Ethernet driver 9d5c8243 (igb: PCI-Express 82575 Gigabit
Ethernet driver).

Signed-off-by: Henrik Austad <hen...@austad.us>
Cc: linux-kernel@vger.kernel.org
Cc: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
Cc: intel-wired-...@lists.osuosl.org
Signed-off-by: Henrik Austad <hen...@austad.us>
---
 drivers/net/ethernet/intel/igb/e1000_82575.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h 
b/drivers/net/ethernet/intel/igb/e1000_82575.h
index 199ff98..212dbb8 100644
--- a/drivers/net/ethernet/intel/igb/e1000_82575.h
+++ b/drivers/net/ethernet/intel/igb/e1000_82575.h
@@ -158,7 +158,11 @@ struct e1000_adv_tx_context_desc {

 /* Additional Transmit Descriptor Control definitions */
 #define E1000_TXDCTL_QUEUE_ENABLE  0x0200 /* Enable specific Tx Queue */
+
+/* Transmit Software Flush, sw-triggered desc writeback */
+#define E1000_TXDCTL_SWFLSH0x0400
 /* Tx Queue Arbitration Priority 0=low, 1=high */
+#define E1000_TXDCTL_PRIORITY  0x0800

 /* Additional Receive Descriptor Control definitions */
 #define E1000_RXDCTL_QUEUE_ENABLE  0x0200 /* Enable specific Rx Queue */
--
2.7.4



[PATCH] igb: add missing fields to TXDCTL-register

2016-10-19 Thread Henrik Austad
The current list of E1000_TXDCTL-registers is incomplete. This adds the
missing parts for the Transmit Descriptor Control (TXDCTL) register.

The rest of these values (threshold for descriptor read/write) for
TXDCTL seems to be defined in igb/igb.h, not sure why this is split
though.

It seems that this was left out in the commit that added support for
82575 Gigabit Ethernet driver 9d5c8243 (igb: PCI-Express 82575 Gigabit
Ethernet driver).

Signed-off-by: Henrik Austad 
Cc: linux-kernel@vger.kernel.org
Cc: Jeff Kirsher 
Cc: intel-wired-...@lists.osuosl.org
Signed-off-by: Henrik Austad 
---
 drivers/net/ethernet/intel/igb/e1000_82575.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h 
b/drivers/net/ethernet/intel/igb/e1000_82575.h
index 199ff98..212dbb8 100644
--- a/drivers/net/ethernet/intel/igb/e1000_82575.h
+++ b/drivers/net/ethernet/intel/igb/e1000_82575.h
@@ -158,7 +158,11 @@ struct e1000_adv_tx_context_desc {

 /* Additional Transmit Descriptor Control definitions */
 #define E1000_TXDCTL_QUEUE_ENABLE  0x0200 /* Enable specific Tx Queue */
+
+/* Transmit Software Flush, sw-triggered desc writeback */
+#define E1000_TXDCTL_SWFLSH0x0400
 /* Tx Queue Arbitration Priority 0=low, 1=high */
+#define E1000_TXDCTL_PRIORITY  0x0800

 /* Additional Receive Descriptor Control definitions */
 #define E1000_RXDCTL_QUEUE_ENABLE  0x0200 /* Enable specific Rx Queue */
--
2.7.4



Re: [alsa-devel] [very-RFC 0/8] TSN driver for the kernel

2016-06-23 Thread Henrik Austad
On Tue, Jun 21, 2016 at 10:45:18AM -0700, Pierre-Louis Bossart wrote:
> On 6/20/16 5:18 AM, Richard Cochran wrote:
> >On Mon, Jun 20, 2016 at 01:08:27PM +0200, Pierre-Louis Bossart wrote:
> >>The ALSA API provides support for 'audio' timestamps (playback/capture rate
> >>defined by audio subsystem) and 'system' timestamps (typically linked to
> >>TSC/ART) with one option to take synchronized timestamps should the hardware
> >>support them.
> >
> >Thanks for the info.  I just skimmed 
> >Documentation/sound/alsa/timestamping.txt.
> >
> >That is fairly new, only since v4.1.  Are then any apps in the wild
> >that I can look at?  AFAICT, OpenAVB, gstreamer, etc, don't use the
> >new API.
> 
> The ALSA API supports a generic .get_time_info callback, its implementation
> is for now limited to a regular 'DMA' or 'link' timestamp for HDaudio - the
> difference being which counters are used and how close they are to the link
> serializer. The synchronized part is still WIP but should come 'soon'

Interesting, would you mind CCing me in on those patches?

> >>The intent was that the 'audio' timestamps are translated to a shared time
> >>reference managed in userspace by gPTP, which in turn would define if
> >>(adaptive) audio sample rate conversion is needed. There is no support at
> >>the moment for a 'play_at' function in ALSA, only means to control a
> >>feedback loop.
> >
> >Documentation/sound/alsa/timestamping.txt says:
> >
> >  If supported in hardware, the absolute link time could also be used
> >  to define a precise start time (patches WIP)
> >
> >Two questions:
> >
> >1. Where are the patches?  (If some are coming, I would appreciate
> >   being on CC!)
> >
> >2. Can you mention specific HW that would support this?
> 
> You can experiment with the 'dma' and 'link' timestamps today on any
> HDaudio-based device. Like I said the synchronized part has not been
> upstreamed yet (delays + dependency on ART-to-TSC conversions that made it
> in the kernel recently)

Ok, I think I see a way to hook this into timestamps from the skbuf on 
incoming frames and a somewhat messy way on outgoing. Having time coupled 
with 'avail' and 'delay' is useful, and from the looks of it, 'link'-time 
is the appropriate level to add this.

I'm working on storing the time in the tsn_link struct I use, and then read 
that from the avb_alsa-shim. Details are still a bit fuzzy though, but I 
plan to do that and then see what audio-time gives me once it is up and 
running.

Richard: is it fair to assume that if ptp4l is running and is part of a PTP 
domain, ktime_get() will return PTP-adjusted time for the system? -Or do I 
also need to run phc2sys in order to sync the system-time to PTP-time? Note 
that this is for outgoing traffic, Rx should perhaps use the timestamp 
in skb.

Hooking into ktime_get() instead of directly to the PTP-subsystem (if that 
is even possible) makes it a lot easier to debug when running this in a VM 
as it doesn't *have* to use PTP-time when I'm crashing a new kernel :)

Thanks!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [alsa-devel] [very-RFC 0/8] TSN driver for the kernel

2016-06-23 Thread Henrik Austad
On Tue, Jun 21, 2016 at 10:45:18AM -0700, Pierre-Louis Bossart wrote:
> On 6/20/16 5:18 AM, Richard Cochran wrote:
> >On Mon, Jun 20, 2016 at 01:08:27PM +0200, Pierre-Louis Bossart wrote:
> >>The ALSA API provides support for 'audio' timestamps (playback/capture rate
> >>defined by audio subsystem) and 'system' timestamps (typically linked to
> >>TSC/ART) with one option to take synchronized timestamps should the hardware
> >>support them.
> >
> >Thanks for the info.  I just skimmed 
> >Documentation/sound/alsa/timestamping.txt.
> >
> >That is fairly new, only since v4.1.  Are then any apps in the wild
> >that I can look at?  AFAICT, OpenAVB, gstreamer, etc, don't use the
> >new API.
> 
> The ALSA API supports a generic .get_time_info callback, its implementation
> is for now limited to a regular 'DMA' or 'link' timestamp for HDaudio - the
> difference being which counters are used and how close they are to the link
> serializer. The synchronized part is still WIP but should come 'soon'

Interesting, would you mind CCing me in on those patches?

> >>The intent was that the 'audio' timestamps are translated to a shared time
> >>reference managed in userspace by gPTP, which in turn would define if
> >>(adaptive) audio sample rate conversion is needed. There is no support at
> >>the moment for a 'play_at' function in ALSA, only means to control a
> >>feedback loop.
> >
> >Documentation/sound/alsa/timestamping.txt says:
> >
> >  If supported in hardware, the absolute link time could also be used
> >  to define a precise start time (patches WIP)
> >
> >Two questions:
> >
> >1. Where are the patches?  (If some are coming, I would appreciate
> >   being on CC!)
> >
> >2. Can you mention specific HW that would support this?
> 
> You can experiment with the 'dma' and 'link' timestamps today on any
> HDaudio-based device. Like I said the synchronized part has not been
> upstreamed yet (delays + dependency on ART-to-TSC conversions that made it
> in the kernel recently)

Ok, I think I see a way to hook this into timestamps from the skbuf on 
incoming frames and a somewhat messy way on outgoing. Having time coupled 
with 'avail' and 'delay' is useful, and from the looks of it, 'link'-time 
is the appropriate level to add this.

I'm working on storing the time in the tsn_link struct I use, and then read 
that from the avb_alsa-shim. Details are still a bit fuzzy though, but I 
plan to do that and then see what audio-time gives me once it is up and 
running.

Richard: is it fair to assume that if ptp4l is running and is part of a PTP 
domain, ktime_get() will return PTP-adjusted time for the system? -Or do I 
also need to run phc2sys in order to sync the system-time to PTP-time? Note 
that this is for outgoing traffic, Rx should perhaps use the timestamp 
in skb.

Hooking into ktime_get() instead of directly to the PTP-subsystem (if that 
is even possible) makes it a lot easier to debug when running this in a VM 
as it doesn't *have* to use PTP-time when I'm crashing a new kernel :)

Thanks!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [alsa-devel] [very-RFC 0/8] TSN driver for the kernel

2016-06-20 Thread Henrik Austad
On Mon, Jun 20, 2016 at 01:08:27PM +0200, Pierre-Louis Bossart wrote:
> 
> >Presentation time is either set by
> >a) Local sound card performing capture (in which case it will be 'capture
> >   time')
> >b) Local media application sending a stream accross the network
> >   (time when the sample should be played out remotely)
> >c) Remote media application streaming data *to* host, in which case it will
> >   be local presentation time on local  soundcard
> >
> >>This value is dominant to the number of events included in an IEC 61883-1
> >>packet. If this TSN subsystem decides it, most of these items don't need
> >>to be in ALSA.
> >
> >Not sure if I understand this correctly.
> >
> >TSN should have a reference to the timing-domain of each *local*
> >sound-device (for local capture or playback) as well as the shared
> >time-reference provided by gPTP.
> >
> >Unless an End-station acts as GrandMaster for the gPTP-domain, time set
> >forth by gPTP is inmutable and cannot be adjusted. It follows that the
> >sample-frequency of the local audio-devices must be adjusted, or the
> >audio-streams to/from said devices must be resampled.
> 
> The ALSA API provides support for 'audio' timestamps
> (playback/capture rate defined by audio subsystem) and 'system'
> timestamps (typically linked to TSC/ART) with one option to take
> synchronized timestamps should the hardware support them.

Ok, this sounds promising, and very much in line with what AVB would need.

> The intent was that the 'audio' timestamps are translated to a
> shared time reference managed in userspace by gPTP, which in turn
> would define if (adaptive) audio sample rate conversion is needed.
> There is no support at the moment for a 'play_at' function in ALSA,
> only means to control a feedback loop.

Ok, I understand that the 'play_at' is difficult to obtain, but it sounds 
like it is doable to achieve something useful.

Looks like I will be looking into what to put in the .trigger-handler in 
the ALSA shim and experimenting with this to see how it make sense to 
connect it from the TSN-stream.

Thanks!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [alsa-devel] [very-RFC 0/8] TSN driver for the kernel

2016-06-20 Thread Henrik Austad
On Mon, Jun 20, 2016 at 01:08:27PM +0200, Pierre-Louis Bossart wrote:
> 
> >Presentation time is either set by
> >a) Local sound card performing capture (in which case it will be 'capture
> >   time')
> >b) Local media application sending a stream accross the network
> >   (time when the sample should be played out remotely)
> >c) Remote media application streaming data *to* host, in which case it will
> >   be local presentation time on local  soundcard
> >
> >>This value is dominant to the number of events included in an IEC 61883-1
> >>packet. If this TSN subsystem decides it, most of these items don't need
> >>to be in ALSA.
> >
> >Not sure if I understand this correctly.
> >
> >TSN should have a reference to the timing-domain of each *local*
> >sound-device (for local capture or playback) as well as the shared
> >time-reference provided by gPTP.
> >
> >Unless an End-station acts as GrandMaster for the gPTP-domain, time set
> >forth by gPTP is inmutable and cannot be adjusted. It follows that the
> >sample-frequency of the local audio-devices must be adjusted, or the
> >audio-streams to/from said devices must be resampled.
> 
> The ALSA API provides support for 'audio' timestamps
> (playback/capture rate defined by audio subsystem) and 'system'
> timestamps (typically linked to TSC/ART) with one option to take
> synchronized timestamps should the hardware support them.

Ok, this sounds promising, and very much in line with what AVB would need.

> The intent was that the 'audio' timestamps are translated to a
> shared time reference managed in userspace by gPTP, which in turn
> would define if (adaptive) audio sample rate conversion is needed.
> There is no support at the moment for a 'play_at' function in ALSA,
> only means to control a feedback loop.

Ok, I understand that the 'play_at' is difficult to obtain, but it sounds 
like it is doable to achieve something useful.

Looks like I will be looking into what to put in the .trigger-handler in 
the ALSA shim and experimenting with this to see how it make sense to 
connect it from the TSN-stream.

Thanks!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-20 Thread Henrik Austad
On Sun, Jun 19, 2016 at 11:46:29AM +0200, Richard Cochran wrote:
> On Sun, Jun 19, 2016 at 12:45:50AM +0200, Henrik Austad wrote:
> > edit: this turned out to be a somewhat lengthy answer. I have tried to 
> > shorten it down somewhere. it is getting late and I'm getting increasingly 
> > incoherent (Richard probably knows what I'm talking about ;) so I'll stop 
> > for now.
> 
> Thanks for your responses, Henrik.  I think your explanations are on spot.
> 
> > note that an adjustable sample-clock is not a *requirement* but in general 
> > you'd want to avoid resampling in software.
> 
> Yes, but..
> 
> Adjusting the local clock rate to match the AVB network rate is
> essential.  You must be able to *continuously* adjust the rate in
> order to compensate drift.  Again, there are exactly two ways to do
> it, namely in hardware (think VCO) or in software (dynamic
> resampling).

Don't get me wrong, having an adjustable clock for the sampling is 
essential -but it si not -required-.

> What you cannot do is simply buffer the AV data and play it out
> blindly at the local clock rate.

No, that you cannot do that, that would not be pretty :)

> Regarding the media clock, if I understand correctly, there the talker
> has two possibilities.  Either the talker samples the stream at the
> gPTP rate, or the talker must tell the listeners the relationship
> (phase offset and frequency ratio) between the media clock and the
> gPTP time.  Please correct me if I got the wrong impression...

Last first; AFAIK, there is no way for the Talker to tell a Listener the 
phase offset/freq ratio other than how each end-station/bridge in the 
gPTP-domain calculates this on psync_update event messages. I could be 
wrong though, and different encoding formats can probably convey such 
information. I have not seen any such mechanisms in the underlying 1722 
format though.

So a Talker should send a stream sampled as if the gPTP time drove the 
AD/DA sample frequency directly. Whether the local sampling is driven by 
gPTP or resampled to match gPTP-time prior to transmit is left as an 
implementation detail for the end-station.

Did all that make sense?

Thanks!
-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-20 Thread Henrik Austad
On Sun, Jun 19, 2016 at 11:46:29AM +0200, Richard Cochran wrote:
> On Sun, Jun 19, 2016 at 12:45:50AM +0200, Henrik Austad wrote:
> > edit: this turned out to be a somewhat lengthy answer. I have tried to 
> > shorten it down somewhere. it is getting late and I'm getting increasingly 
> > incoherent (Richard probably knows what I'm talking about ;) so I'll stop 
> > for now.
> 
> Thanks for your responses, Henrik.  I think your explanations are on spot.
> 
> > note that an adjustable sample-clock is not a *requirement* but in general 
> > you'd want to avoid resampling in software.
> 
> Yes, but..
> 
> Adjusting the local clock rate to match the AVB network rate is
> essential.  You must be able to *continuously* adjust the rate in
> order to compensate drift.  Again, there are exactly two ways to do
> it, namely in hardware (think VCO) or in software (dynamic
> resampling).

Don't get me wrong, having an adjustable clock for the sampling is 
essential -but it si not -required-.

> What you cannot do is simply buffer the AV data and play it out
> blindly at the local clock rate.

No, that you cannot do that, that would not be pretty :)

> Regarding the media clock, if I understand correctly, there the talker
> has two possibilities.  Either the talker samples the stream at the
> gPTP rate, or the talker must tell the listeners the relationship
> (phase offset and frequency ratio) between the media clock and the
> gPTP time.  Please correct me if I got the wrong impression...

Last first; AFAIK, there is no way for the Talker to tell a Listener the 
phase offset/freq ratio other than how each end-station/bridge in the 
gPTP-domain calculates this on psync_update event messages. I could be 
wrong though, and different encoding formats can probably convey such 
information. I have not seen any such mechanisms in the underlying 1722 
format though.

So a Talker should send a stream sampled as if the gPTP time drove the 
AD/DA sample frequency directly. Whether the local sampling is driven by 
gPTP or resampled to match gPTP-time prior to transmit is left as an 
implementation detail for the end-station.

Did all that make sense?

Thanks!
-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-18 Thread Henrik Austad
ernel
> >> land. In alsa-lib, sampling rate conversion is implemented in shared 
> >> object.
> >> When userspace applications start playbacking/capturing, depending on PCM
> >> node to access, these applications load the shared object and convert PCM
> >> frames from buffer in userspace to mmapped DMA-buffer, then commit them.
> > 
> > The AVB use case places an additional requirement on the rate
> > conversion.  You will need to adjust the frequency on the fly, as the
> > stream is playing.  I would guess that ALSA doesn't have that option?
> 
> In ALSA kernel/userspace interfaces , the specification cannot be
> supported, at all.
> 
> Please explain about this requirement, where it comes from, which
> specification and clause describe it (802.1AS or 802.1Q?). As long as I
> read IEEE 1722, I cannot find such a requirement.

1722 only describes how the L2 frames are constructed and transmittet. You 
are correct that it does not mention adjustable clocks there.

- 802.1BA gives an overview of AVB

- 802.1Q-2011 Sec 34 and 35 describes forwarding and queueing and Stream 
  Reservation (basically what the network needs in order to correctly 
  prioritize TSN streams)

- 802.1AS-2011 (gPTP) describes the timing in great detail (from a PTP 
  point of vew) and describes in more detail how the clocks should be 
  syntonized (802.1AS-2011, 7.3.3).

Since the clock that drives the sample-rate for the DA/AD must be 
controlled by the shared clock, the fact that gPTP can adjust the time 
means that the DA/AD circuit needs to be adjustable as well.

note that an adjustable sample-clock is not a *requirement* but in general 
you'd want to avoid resampling in software.

> (When considering about actual hardware codecs, on-board serial bus such
> as Inter-IC Sound, corresponding controller, immediate change of
> sampling rate is something imaginary for semi-realtime applications. And
> the idea has no meaning for typical playback/capture softwares.)

Yes, and no. When you play back a stored file to your soundcard, data is 
pulled by the card from memory. So you only have a single timing-domain to 
worry about. So I'd say the idea has meaning in normal scenarios as well, 
you don't have to worry about it.

When you send a stream accross the network, you cannot let the Listener 
pull data from you, you have to have some common sense of time in order to 
send just enough data, and that is why the gPTP domain is so important.

802.1Q gives you low latency through the network, but more importantly, no 
dropped frames. gPTP gives you a central reference to time.

> [1] [alsa-lib][PATCH 0/9 v3] ctl: add APIs for control element set
> http://mailman.alsa-project.org/pipermail/alsa-devel/2016-June/109274.html
> [2] IEEE 1722-2011
> http://ieeexplore.ieee.org/servlet/opac?punumber=5764873
> [3] 5.5 Timing and Synchronization
> op. cit.
> [4] 1394 Open Host Controller Interface Specification
> http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/ohci_11.pdf

I hope this cleared some of the questions

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-18 Thread Henrik Austad
ernel
> >> land. In alsa-lib, sampling rate conversion is implemented in shared 
> >> object.
> >> When userspace applications start playbacking/capturing, depending on PCM
> >> node to access, these applications load the shared object and convert PCM
> >> frames from buffer in userspace to mmapped DMA-buffer, then commit them.
> > 
> > The AVB use case places an additional requirement on the rate
> > conversion.  You will need to adjust the frequency on the fly, as the
> > stream is playing.  I would guess that ALSA doesn't have that option?
> 
> In ALSA kernel/userspace interfaces , the specification cannot be
> supported, at all.
> 
> Please explain about this requirement, where it comes from, which
> specification and clause describe it (802.1AS or 802.1Q?). As long as I
> read IEEE 1722, I cannot find such a requirement.

1722 only describes how the L2 frames are constructed and transmittet. You 
are correct that it does not mention adjustable clocks there.

- 802.1BA gives an overview of AVB

- 802.1Q-2011 Sec 34 and 35 describes forwarding and queueing and Stream 
  Reservation (basically what the network needs in order to correctly 
  prioritize TSN streams)

- 802.1AS-2011 (gPTP) describes the timing in great detail (from a PTP 
  point of vew) and describes in more detail how the clocks should be 
  syntonized (802.1AS-2011, 7.3.3).

Since the clock that drives the sample-rate for the DA/AD must be 
controlled by the shared clock, the fact that gPTP can adjust the time 
means that the DA/AD circuit needs to be adjustable as well.

note that an adjustable sample-clock is not a *requirement* but in general 
you'd want to avoid resampling in software.

> (When considering about actual hardware codecs, on-board serial bus such
> as Inter-IC Sound, corresponding controller, immediate change of
> sampling rate is something imaginary for semi-realtime applications. And
> the idea has no meaning for typical playback/capture softwares.)

Yes, and no. When you play back a stored file to your soundcard, data is 
pulled by the card from memory. So you only have a single timing-domain to 
worry about. So I'd say the idea has meaning in normal scenarios as well, 
you don't have to worry about it.

When you send a stream accross the network, you cannot let the Listener 
pull data from you, you have to have some common sense of time in order to 
send just enough data, and that is why the gPTP domain is so important.

802.1Q gives you low latency through the network, but more importantly, no 
dropped frames. gPTP gives you a central reference to time.

> [1] [alsa-lib][PATCH 0/9 v3] ctl: add APIs for control element set
> http://mailman.alsa-project.org/pipermail/alsa-devel/2016-June/109274.html
> [2] IEEE 1722-2011
> http://ieeexplore.ieee.org/servlet/opac?punumber=5764873
> [3] 5.5 Timing and Synchronization
> op. cit.
> [4] 1394 Open Host Controller Interface Specification
> http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/ohci_11.pdf

I hope this cleared some of the questions

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 7/8] AVB ALSA - Add ALSA shim for TSN

2016-06-15 Thread Henrik Austad
On Wed, Jun 15, 2016 at 01:49:08PM +0200, Richard Cochran wrote:
> Now that I understand better...
> 
> On Sun, Jun 12, 2016 at 01:01:35AM +0200, Henrik Austad wrote:
> > Userspace is supposed to reserve bandwidth, find StreamID etc.
> > 
> > To use as a Talker:
> > 
> > mkdir /config/tsn/test/eth0/talker
> > cd /config/tsn/test/eth0/talker
> > echo 65535 > buffer_size
> > echo 08:00:27:08:9f:c3 > remote_mac
> > echo 42 > stream_id
> > echo alsa > enabled
> 
> This is exactly why configfs is the wrong interface.  If you implement
> the AVB device in alsa-lib user space, then you can handle the
> reservations, configuration, UDP sockets, etc, in a way transparent to
> the aplay program.

And how would v4l2 benefit from this being in alsalib? Should we require 
both V4L and ALSA to implement the same, or should we place it in a common 
place for all.

And what about those systems that want to use TSN but is not a 
media-device, they should be given a raw-socket to send traffic over, 
should they also implement something in a library?

So no, here I think configfs is an apt choice.

> Heck, if done properly, your layer could discover the AVB nodes in the
> network and present each one as a separate device...

No, you definately do not want the kernel to automagically add devices 
whenever something pops up on the network, for this you need userspace to 
be in control. 1722.1 should not be handled in-kernel.


-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 7/8] AVB ALSA - Add ALSA shim for TSN

2016-06-15 Thread Henrik Austad
On Wed, Jun 15, 2016 at 01:49:08PM +0200, Richard Cochran wrote:
> Now that I understand better...
> 
> On Sun, Jun 12, 2016 at 01:01:35AM +0200, Henrik Austad wrote:
> > Userspace is supposed to reserve bandwidth, find StreamID etc.
> > 
> > To use as a Talker:
> > 
> > mkdir /config/tsn/test/eth0/talker
> > cd /config/tsn/test/eth0/talker
> > echo 65535 > buffer_size
> > echo 08:00:27:08:9f:c3 > remote_mac
> > echo 42 > stream_id
> > echo alsa > enabled
> 
> This is exactly why configfs is the wrong interface.  If you implement
> the AVB device in alsa-lib user space, then you can handle the
> reservations, configuration, UDP sockets, etc, in a way transparent to
> the aplay program.

And how would v4l2 benefit from this being in alsalib? Should we require 
both V4L and ALSA to implement the same, or should we place it in a common 
place for all.

And what about those systems that want to use TSN but is not a 
media-device, they should be given a raw-socket to send traffic over, 
should they also implement something in a library?

So no, here I think configfs is an apt choice.

> Heck, if done properly, your layer could discover the AVB nodes in the
> network and present each one as a separate device...

No, you definately do not want the kernel to automagically add devices 
whenever something pops up on the network, for this you need userspace to 
be in control. 1722.1 should not be handled in-kernel.


-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-15 Thread Henrik Austad
On Wed, Jun 15, 2016 at 09:04:41AM +0200, Richard Cochran wrote:
> On Tue, Jun 14, 2016 at 10:38:10PM +0200, Henrik Austad wrote:
> > Whereas I want to do 
> > 
> > aplay some_song.wav
> 
> Can you please explain how your patches accomplish this?

In short:

modprobe tsn
modprobe avb_alsa
mkdir /sys/kernel/config/eth0/link
cd /sys/kernel/config/eth0/link

echo alsa > enabled
aplay -Ddefault:CARD=avb some_song.wav

Likewise on the receiver side, except add 'Listener' to end_station 
attribute

arecord -c2 -r48000 -f S16_LE -Ddefault:CARD=avb > some_recording.wav

I've not had time to fully fix the hw-aprams for alsa, so some manual 
tweaking of arecord is required.


Again, this is a very early attempt to get something useful done with TSN, 
I know there are rough edges, I know buffer handling and timestamping is 
not finished


Note: if you don't have an intel-card, load tsn in debug-mode and it will 
let you use all NICs present.

modprobe tsn in_debug=1


-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-15 Thread Henrik Austad
On Wed, Jun 15, 2016 at 09:04:41AM +0200, Richard Cochran wrote:
> On Tue, Jun 14, 2016 at 10:38:10PM +0200, Henrik Austad wrote:
> > Whereas I want to do 
> > 
> > aplay some_song.wav
> 
> Can you please explain how your patches accomplish this?

In short:

modprobe tsn
modprobe avb_alsa
mkdir /sys/kernel/config/eth0/link
cd /sys/kernel/config/eth0/link

echo alsa > enabled
aplay -Ddefault:CARD=avb some_song.wav

Likewise on the receiver side, except add 'Listener' to end_station 
attribute

arecord -c2 -r48000 -f S16_LE -Ddefault:CARD=avb > some_recording.wav

I've not had time to fully fix the hw-aprams for alsa, so some manual 
tweaking of arecord is required.


Again, this is a very early attempt to get something useful done with TSN, 
I know there are rough edges, I know buffer handling and timestamping is 
not finished


Note: if you don't have an intel-card, load tsn in debug-mode and it will 
let you use all NICs present.

modprobe tsn in_debug=1


-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-14 Thread Henrik Austad
On Tue, Jun 14, 2016 at 08:26:15PM +0200, Richard Cochran wrote:
> On Tue, Jun 14, 2016 at 11:30:00AM +0200, Henrik Austad wrote:
> > So loop data from kernel -> userspace -> kernelspace and finally back to 
> > userspace and the media application?
> 
> Huh?  I wonder where you got that idea.  Let me show an example of
> what I mean.
> 
>   void listener()
>   {
>   int in = socket();
>   int out = open("/dev/dsp");
>   char buf[];
> 
>   while (1) {
>   recv(in, buf, packetsize);
>   write(out, buf + offset, datasize);
>   }
>   }
> 
> See?

Where is your media-application in this? You only loop the audio from 
network to the dsp, is the media-application attached to the dsp-device?

Whereas I want to do 

aplay some_song.wav
or mplayer
or spotify
or ..


> > Yes, I know some audio apps "use networking", I can stream netradio, I can 
> > use jack to connect devices using RTP and probably a whole lot of other 
> > applications do similar things. However, AVB is more about using the 
> > network as a virtual sound-card.
> 
> That is news to me.  I don't recall ever having seen AVB described
> like that before.
> 
> > For the media application, it should not 
> > have to care if the device it is using is a soudncard inside the box or a 
> > set of AVB-capable speakers somewhere on the network.
> 
> So you would like a remote listener to appear in the system as a local
> PCM audio sink?  And a remote talker would be like a local media URL?
> Sounds unworkable to me, but even if you were to implement it, the
> logic would surely belong in alsa-lib and not in the kernel.  Behind
> the enulated device, the library would run a loop like the example,
> above.
> 
> In any case, your patches don't implement that sort of thing at all,
> do they?

Subject: [very-RFC 7/8] AVB ALSA - Add ALSA shim for TSN

Did you even bother to look?

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-14 Thread Henrik Austad
On Tue, Jun 14, 2016 at 08:26:15PM +0200, Richard Cochran wrote:
> On Tue, Jun 14, 2016 at 11:30:00AM +0200, Henrik Austad wrote:
> > So loop data from kernel -> userspace -> kernelspace and finally back to 
> > userspace and the media application?
> 
> Huh?  I wonder where you got that idea.  Let me show an example of
> what I mean.
> 
>   void listener()
>   {
>   int in = socket();
>   int out = open("/dev/dsp");
>   char buf[];
> 
>   while (1) {
>   recv(in, buf, packetsize);
>   write(out, buf + offset, datasize);
>   }
>   }
> 
> See?

Where is your media-application in this? You only loop the audio from 
network to the dsp, is the media-application attached to the dsp-device?

Whereas I want to do 

aplay some_song.wav
or mplayer
or spotify
or ..


> > Yes, I know some audio apps "use networking", I can stream netradio, I can 
> > use jack to connect devices using RTP and probably a whole lot of other 
> > applications do similar things. However, AVB is more about using the 
> > network as a virtual sound-card.
> 
> That is news to me.  I don't recall ever having seen AVB described
> like that before.
> 
> > For the media application, it should not 
> > have to care if the device it is using is a soudncard inside the box or a 
> > set of AVB-capable speakers somewhere on the network.
> 
> So you would like a remote listener to appear in the system as a local
> PCM audio sink?  And a remote talker would be like a local media URL?
> Sounds unworkable to me, but even if you were to implement it, the
> logic would surely belong in alsa-lib and not in the kernel.  Behind
> the enulated device, the library would run a loop like the example,
> above.
> 
> In any case, your patches don't implement that sort of thing at all,
> do they?

Subject: [very-RFC 7/8] AVB ALSA - Add ALSA shim for TSN

Did you even bother to look?

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-14 Thread Henrik Austad
On Mon, Jun 13, 2016 at 09:32:10PM +0200, Richard Cochran wrote:
> On Mon, Jun 13, 2016 at 03:00:59PM +0200, Henrik Austad wrote:
> > On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote:
> > > Which driver is that?
> > 
> > drivers/net/ethernet/renesas/
> 
> That driver is merely a PTP capable MAC driver, nothing more.
> Although AVB is in the device name, the driver doesn't implement
> anything beyond the PTP bits.

Yes, I think they do the rest from userspace, not sure though :)

> > What is the rationale for no new sockets? To avoid cluttering? or do 
> > sockets have a drawback I'm not aware of?
> 
> The current raw sockets will work just fine.  Again, there should be a
> application that sits in between with the network socket and the audio
> interface.

So loop data from kernel -> userspace -> kernelspace and finally back to 
userspace and the media application? I agree that you need a way to pipe 
the incoming data directly from the network to userspace for those TSN 
users that can handle it. But again, for media-applications that don't know 
(or care) about AVB, it should be fed to ALSA/v4l2 directly and not jump 
between kernel and userspace an extra round.

I get the point of not including every single audio/video encoder in the 
kernel, but raw audio should be piped directly to alsa. V4L2 has a way of 
piping encoded video through the system and to the media application (in 
order to support cameras that to encoding). The same approach should be 
doable for AVB, no? (someone from alsa/v4l2 should probably comment on 
this)

> > Why is configfs wrong?
> 
> Because the application will use the already existing network and
> audio interfaces to configure the system.

Configuring this via the audio-interface is going to be a challenge since 
you need to configure the stream through the network before you can create 
the audio interface. If not, you will have to either drop data or block the 
caller until the link has been fully configured.

This is actually the reason why configfs is used in the series now, as it 
allows userspace to figure out all the different attributes and configure 
the link before letting ALSA start pushing data.

> > > Lets take a look at the big picture.  One aspect of TSN is already
> > > fully supported, namely the gPTP.  Using the linuxptp user stack and a
> > > modern kernel, you have a complete 802.1AS-2011 solution.
> > 
> > Yes, I thought so, which is also why I have put that to the side and why 
> > I'm using ktime_get() for timestamps at the moment. There's also the issue 
> > of hooking the time into ALSA/V4L2
> 
> So lets get that issue solved before anything else.  It is absolutely
> essential for TSN.  Without the synchronization, you are only playing
> audio over the network.  We already have software for that.

Yes, I agree, presentation-time and local time needs to be handled 
properly. The same for adjusting sample-rate etc. This is a lot of work, so 
I hope you can understand why I started out with a simple approach to spark 
a discussion before moving on to the larger bits.

> > > 2. A user space audio application that puts it all together, making
> > >use of the services in #1, the linuxptp gPTP service, the ALSA
> > >services, and the network connections.  This program will have all
> > >the knowledge about packet formats, AV encodings, and the local HW
> > >capabilities.  This program cannot yet be written, as we still need
> > >some kernel work in the audio and networking subsystems.
> > 
> > Why?
> 
> Because user space is right place to place the knowledge of the myriad
> formats and options.

Se response above, better to let anything but uncompressed raw data trickle 
through.

> > the whole point should be to make it as easy for userspace as 
> > possible. If you need to tailor each individual media-appliation to use 
> > AVB, it is not going to be very useful outside pro-Audio. Sure, there will 
> > be challenges, but one key element here should be to *not* require 
> > upgrading every single media application.
> > 
> > Then, back to the suggestion of adding a TSN_SOCKET (which you didn't like, 
> > but can we agree on a term "raw interface to TSN", and mode of transport 
> > can be defined later? ), was to let those applications that are TSN-aware 
> > to do what they need to do, whether it is controlling robots or media 
> > streams.
> 
> First you say you don't want ot upgrade media applications, but then
> you invent a new socket type.  That is a contradiction in terms.

Hehe, no, bad phrasing on my part. I want *both* (hence the shim-interface) 
:)

> Audio apps already use networking, and the

Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-14 Thread Henrik Austad
On Mon, Jun 13, 2016 at 09:32:10PM +0200, Richard Cochran wrote:
> On Mon, Jun 13, 2016 at 03:00:59PM +0200, Henrik Austad wrote:
> > On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote:
> > > Which driver is that?
> > 
> > drivers/net/ethernet/renesas/
> 
> That driver is merely a PTP capable MAC driver, nothing more.
> Although AVB is in the device name, the driver doesn't implement
> anything beyond the PTP bits.

Yes, I think they do the rest from userspace, not sure though :)

> > What is the rationale for no new sockets? To avoid cluttering? or do 
> > sockets have a drawback I'm not aware of?
> 
> The current raw sockets will work just fine.  Again, there should be a
> application that sits in between with the network socket and the audio
> interface.

So loop data from kernel -> userspace -> kernelspace and finally back to 
userspace and the media application? I agree that you need a way to pipe 
the incoming data directly from the network to userspace for those TSN 
users that can handle it. But again, for media-applications that don't know 
(or care) about AVB, it should be fed to ALSA/v4l2 directly and not jump 
between kernel and userspace an extra round.

I get the point of not including every single audio/video encoder in the 
kernel, but raw audio should be piped directly to alsa. V4L2 has a way of 
piping encoded video through the system and to the media application (in 
order to support cameras that to encoding). The same approach should be 
doable for AVB, no? (someone from alsa/v4l2 should probably comment on 
this)

> > Why is configfs wrong?
> 
> Because the application will use the already existing network and
> audio interfaces to configure the system.

Configuring this via the audio-interface is going to be a challenge since 
you need to configure the stream through the network before you can create 
the audio interface. If not, you will have to either drop data or block the 
caller until the link has been fully configured.

This is actually the reason why configfs is used in the series now, as it 
allows userspace to figure out all the different attributes and configure 
the link before letting ALSA start pushing data.

> > > Lets take a look at the big picture.  One aspect of TSN is already
> > > fully supported, namely the gPTP.  Using the linuxptp user stack and a
> > > modern kernel, you have a complete 802.1AS-2011 solution.
> > 
> > Yes, I thought so, which is also why I have put that to the side and why 
> > I'm using ktime_get() for timestamps at the moment. There's also the issue 
> > of hooking the time into ALSA/V4L2
> 
> So lets get that issue solved before anything else.  It is absolutely
> essential for TSN.  Without the synchronization, you are only playing
> audio over the network.  We already have software for that.

Yes, I agree, presentation-time and local time needs to be handled 
properly. The same for adjusting sample-rate etc. This is a lot of work, so 
I hope you can understand why I started out with a simple approach to spark 
a discussion before moving on to the larger bits.

> > > 2. A user space audio application that puts it all together, making
> > >use of the services in #1, the linuxptp gPTP service, the ALSA
> > >services, and the network connections.  This program will have all
> > >the knowledge about packet formats, AV encodings, and the local HW
> > >capabilities.  This program cannot yet be written, as we still need
> > >some kernel work in the audio and networking subsystems.
> > 
> > Why?
> 
> Because user space is right place to place the knowledge of the myriad
> formats and options.

Se response above, better to let anything but uncompressed raw data trickle 
through.

> > the whole point should be to make it as easy for userspace as 
> > possible. If you need to tailor each individual media-appliation to use 
> > AVB, it is not going to be very useful outside pro-Audio. Sure, there will 
> > be challenges, but one key element here should be to *not* require 
> > upgrading every single media application.
> > 
> > Then, back to the suggestion of adding a TSN_SOCKET (which you didn't like, 
> > but can we agree on a term "raw interface to TSN", and mode of transport 
> > can be defined later? ), was to let those applications that are TSN-aware 
> > to do what they need to do, whether it is controlling robots or media 
> > streams.
> 
> First you say you don't want ot upgrade media applications, but then
> you invent a new socket type.  That is a contradiction in terms.

Hehe, no, bad phrasing on my part. I want *both* (hence the shim-interface) 
:)

> Audio apps already use networking, and the

Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-14 Thread Henrik Austad
On Mon, Jun 13, 2016 at 08:56:44AM -0700, John Fastabend wrote:
> On 16-06-13 04:47 AM, Richard Cochran wrote:
> > [...]
> > Here is what is missing to support audio TSN:
> > 
> > * User Space
> > 
> > 1. A proper userland stack for AVDECC, MAAP, FQTSS, and so on.  The
> >OpenAVB project does not offer much beyond simple examples.
> > 
> > 2. A user space audio application that puts it all together, making
> >use of the services in #1, the linuxptp gPTP service, the ALSA
> >services, and the network connections.  This program will have all
> >the knowledge about packet formats, AV encodings, and the local HW
> >capabilities.  This program cannot yet be written, as we still need
> >some kernel work in the audio and networking subsystems.
> > 
> > * Kernel Space
> > 
> > 1. Providing frames with a future transmit time.  For normal sockets,
> >this can be in the CMESG data.  For mmap'ed buffers, we will need a
> >new format.  (I think Arnd is working on a new layout.)
> > 
> > 2. Time based qdisc for transmitted frames.  For MACs that support
> >this (like the i210), we only have to place the frame into the
> >correct queue.  For normal HW, we want to be able to reserve a time
> >window in which non-TSN frames are blocked.  This is some work, but
> >in the end it should be a generic solution that not only works
> >"perfectly" with TSN HW but also provides best effort service using
> >any NIC.
> > 
> 
> When I looked at this awhile ago I convinced myself that it could fit
> fairly well into the DCB stack (DCB is also part of 802.1Q). A lot of
> the traffic class to queue mappings and priories could be handled here.
> It might be worth taking a look at ./net/sched/mqprio.c and ./net/dcb/.

Interesting, I'll have a look at dcb and mqprio, I'm not familiar with 
those systems. Thanks for pointing those out!

I hope that the complexity doesn't run crazy though, TSN is not aimed at 
datacentra, a lot of the endpoints are going to be embedded devices, 
introducing a massive stack for handling every eventuality in 802.1q is 
going to be counter productive.

> Unfortunately I didn't get too far along but we probably don't want
> another mechanism to map hw queues/tcs/etc if the existing interfaces
> work or can be extended to support this.

Sure, I get that, as long as the complexity for setting up a link doesn't 
go through the roof :)

Thanks!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-14 Thread Henrik Austad
On Mon, Jun 13, 2016 at 08:56:44AM -0700, John Fastabend wrote:
> On 16-06-13 04:47 AM, Richard Cochran wrote:
> > [...]
> > Here is what is missing to support audio TSN:
> > 
> > * User Space
> > 
> > 1. A proper userland stack for AVDECC, MAAP, FQTSS, and so on.  The
> >OpenAVB project does not offer much beyond simple examples.
> > 
> > 2. A user space audio application that puts it all together, making
> >use of the services in #1, the linuxptp gPTP service, the ALSA
> >services, and the network connections.  This program will have all
> >the knowledge about packet formats, AV encodings, and the local HW
> >capabilities.  This program cannot yet be written, as we still need
> >some kernel work in the audio and networking subsystems.
> > 
> > * Kernel Space
> > 
> > 1. Providing frames with a future transmit time.  For normal sockets,
> >this can be in the CMESG data.  For mmap'ed buffers, we will need a
> >new format.  (I think Arnd is working on a new layout.)
> > 
> > 2. Time based qdisc for transmitted frames.  For MACs that support
> >this (like the i210), we only have to place the frame into the
> >correct queue.  For normal HW, we want to be able to reserve a time
> >window in which non-TSN frames are blocked.  This is some work, but
> >in the end it should be a generic solution that not only works
> >"perfectly" with TSN HW but also provides best effort service using
> >any NIC.
> > 
> 
> When I looked at this awhile ago I convinced myself that it could fit
> fairly well into the DCB stack (DCB is also part of 802.1Q). A lot of
> the traffic class to queue mappings and priories could be handled here.
> It might be worth taking a look at ./net/sched/mqprio.c and ./net/dcb/.

Interesting, I'll have a look at dcb and mqprio, I'm not familiar with 
those systems. Thanks for pointing those out!

I hope that the complexity doesn't run crazy though, TSN is not aimed at 
datacentra, a lot of the endpoints are going to be embedded devices, 
introducing a massive stack for handling every eventuality in 802.1q is 
going to be counter productive.

> Unfortunately I didn't get too far along but we probably don't want
> another mechanism to map hw queues/tcs/etc if the existing interfaces
> work or can be extended to support this.

Sure, I get that, as long as the complexity for setting up a link doesn't 
go through the roof :)

Thanks!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-13 Thread Henrik Austad
On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote:
> Henrik,

Hi Richard,

> On Sun, Jun 12, 2016 at 01:01:28AM +0200, Henrik Austad wrote:
> > There are at least one AVB-driver (the AV-part of TSN) in the kernel
> > already,
> 
> Which driver is that?

drivers/net/ethernet/renesas/

> > however this driver aims to solve a wider scope as TSN can do
> > much more than just audio. A very basic ALSA-driver is added to the end
> > that allows you to play music between 2 machines using aplay in one end
> > and arecord | aplay on the other (some fiddling required) We have plans
> > for doing the same for v4l2 eventually (but there are other fishes to
> > fry first). The same goes for a TSN_SOCK type approach as well.
> 
> Please, no new socket type for this.

The idea was to create a tsn-driver and then allow userspace to use it 
either for media or for whatever else they'd like - and then a socket made 
sense. Or so I thought :)

What is the rationale for no new sockets? To avoid cluttering? or do 
sockets have a drawback I'm not aware of?

> > What remains
> > - tie to (g)PTP properly, currently using ktime_get() for presentation
> >   time
> > - get time from shim into TSN and vice versa
> 
> ... and a whole lot more, see below.
> 
> > - let shim create/manage buffer
> 
> (BTW, shim is a terrible name for that.)

So something thin that is placed between to subystems should rather be 
called.. flimsy? The point of the name was to indicate that it glued 2 
pieces together. If you have a better suggestion, I'm all ears.

> [sigh]
> 
> People have been asking me about TSN and Linux, and we've made some
> thoughts about it.  The interest is there, and so I am glad to see
> discussion on this topic.

I'm not aware of any such discussions, could you point me to where TSN has 
been discussed, it would be nice to see other peoples thought on the matter 
(which was one of the ideas behind this series in the first place)

> Having said that, your series does not even begin to address the real
> issues. 

Well, in all honesty, I did say so :) It is marked as "very-RFC", and not 
for being included in the kernel as-is. I also made a short list of the 
most crucial bits missing.

I know there are real issues, but solving these won't matter if you don't 
have anything useful to do with it. I decided to start by adding a thin 
ALSA-driver and then continue to work with the kernel infrastructure. 
Having something that works-ish makes it a lot easier to test and get 
others interested in, especially when you are not deeply involved in a 
subsystem.

At one point you get to where you need input from other more intimate with 
then inner workings of the different subsystems to see how things should be 
created without making too much of a mess. So where we are :)

My primary motivation was to
a) gather feedback (which you have provided, and for which I am very 
   grateful)
b) get the discussion going on how/if TSN should be added to the kernel

> I did not review the patches too carefully (because the
> important stuff is missing), but surely configfs is the wrong
> interface for this. 

Why is configfs wrong?

Unless you want to implement discovery and enumeration and srp-negotiation 
in the kernel, you need userspace to handle this. Once userspace has done 
all that (found priority-codes, streamIDs, vlanIDs and all the required 
bits), then userspace can create a new link. For that I find ConfigFS to be 
quite useful and up to the task.

In my opinion, it also makes for a much tidier and saner interface than 
some obscure dark-magic ioctl()

> In the end, we will be able to support TSN using
> the existing networking and audio interfaces, adding appropriate
> extensions.

I surely hope so, but as I'm not deep into the networking part of the 
kernel finding those appropriate extensions is hard - which is why we 
started writing a standalone module-

> Your patch features a buffer shared by networking and audio.  This
> isn't strictly necessary for TSN, and it may be harmful. 

At one stage, data has to flow in/out of the network, and whoever's using 
TSN probably need to store data somewhere as well, so you need some form of 
buffering at one place in the path the data flows through.

That being said, one of the bits on my plate is to remove the 
"TSN-hosted-buffer" and let TSN read/write data via the shim_ops. What the 
best set of functions where are, remain to be seen, but it should provide a 
way to move data from either a single frame or a "few frames" to the shime 
(err..  ;)

> The
> Listeners are supposed to calculate the delay from frame reception to
> the DA conversion.  They can easily include the time needed for a user
> space program to parse the frames, copy (and combine/convert) the
> data, and re

Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-13 Thread Henrik Austad
On Mon, Jun 13, 2016 at 01:47:13PM +0200, Richard Cochran wrote:
> Henrik,

Hi Richard,

> On Sun, Jun 12, 2016 at 01:01:28AM +0200, Henrik Austad wrote:
> > There are at least one AVB-driver (the AV-part of TSN) in the kernel
> > already,
> 
> Which driver is that?

drivers/net/ethernet/renesas/

> > however this driver aims to solve a wider scope as TSN can do
> > much more than just audio. A very basic ALSA-driver is added to the end
> > that allows you to play music between 2 machines using aplay in one end
> > and arecord | aplay on the other (some fiddling required) We have plans
> > for doing the same for v4l2 eventually (but there are other fishes to
> > fry first). The same goes for a TSN_SOCK type approach as well.
> 
> Please, no new socket type for this.

The idea was to create a tsn-driver and then allow userspace to use it 
either for media or for whatever else they'd like - and then a socket made 
sense. Or so I thought :)

What is the rationale for no new sockets? To avoid cluttering? or do 
sockets have a drawback I'm not aware of?

> > What remains
> > - tie to (g)PTP properly, currently using ktime_get() for presentation
> >   time
> > - get time from shim into TSN and vice versa
> 
> ... and a whole lot more, see below.
> 
> > - let shim create/manage buffer
> 
> (BTW, shim is a terrible name for that.)

So something thin that is placed between to subystems should rather be 
called.. flimsy? The point of the name was to indicate that it glued 2 
pieces together. If you have a better suggestion, I'm all ears.

> [sigh]
> 
> People have been asking me about TSN and Linux, and we've made some
> thoughts about it.  The interest is there, and so I am glad to see
> discussion on this topic.

I'm not aware of any such discussions, could you point me to where TSN has 
been discussed, it would be nice to see other peoples thought on the matter 
(which was one of the ideas behind this series in the first place)

> Having said that, your series does not even begin to address the real
> issues. 

Well, in all honesty, I did say so :) It is marked as "very-RFC", and not 
for being included in the kernel as-is. I also made a short list of the 
most crucial bits missing.

I know there are real issues, but solving these won't matter if you don't 
have anything useful to do with it. I decided to start by adding a thin 
ALSA-driver and then continue to work with the kernel infrastructure. 
Having something that works-ish makes it a lot easier to test and get 
others interested in, especially when you are not deeply involved in a 
subsystem.

At one point you get to where you need input from other more intimate with 
then inner workings of the different subsystems to see how things should be 
created without making too much of a mess. So where we are :)

My primary motivation was to
a) gather feedback (which you have provided, and for which I am very 
   grateful)
b) get the discussion going on how/if TSN should be added to the kernel

> I did not review the patches too carefully (because the
> important stuff is missing), but surely configfs is the wrong
> interface for this. 

Why is configfs wrong?

Unless you want to implement discovery and enumeration and srp-negotiation 
in the kernel, you need userspace to handle this. Once userspace has done 
all that (found priority-codes, streamIDs, vlanIDs and all the required 
bits), then userspace can create a new link. For that I find ConfigFS to be 
quite useful and up to the task.

In my opinion, it also makes for a much tidier and saner interface than 
some obscure dark-magic ioctl()

> In the end, we will be able to support TSN using
> the existing networking and audio interfaces, adding appropriate
> extensions.

I surely hope so, but as I'm not deep into the networking part of the 
kernel finding those appropriate extensions is hard - which is why we 
started writing a standalone module-

> Your patch features a buffer shared by networking and audio.  This
> isn't strictly necessary for TSN, and it may be harmful. 

At one stage, data has to flow in/out of the network, and whoever's using 
TSN probably need to store data somewhere as well, so you need some form of 
buffering at one place in the path the data flows through.

That being said, one of the bits on my plate is to remove the 
"TSN-hosted-buffer" and let TSN read/write data via the shim_ops. What the 
best set of functions where are, remain to be seen, but it should provide a 
way to move data from either a single frame or a "few frames" to the shime 
(err..  ;)

> The
> Listeners are supposed to calculate the delay from frame reception to
> the DA conversion.  They can easily include the time needed for a user
> space program to parse the frames, copy (and combine/convert) the
> data, and re

Re: [very-RFC 6/8] Add TSN event-tracing

2016-06-13 Thread Henrik Austad
On Sun, Jun 12, 2016 at 10:22:01PM -0400, Steven Rostedt wrote:
> On Sun, 12 Jun 2016 23:25:10 +0200
> Henrik Austad <hen...@austad.us> wrote:
> 
> > > > +#include 
> > > > +#include 
> > > > +/* #include  */
> > > > +
> > > > +/* FIXME: update to TRACE_CLASS to reduce overhead */  
> > > 
> > > I'm curious to why I didn't do this now. A class would make less
> > > duplication of typing too ;-)  
> > 
> > Yeah, I found this in a really great article written by some tracing-dude, 
> > I hear he talks really, really fast!
> 
> I plead the 5th!
> 
> > 
> > https://lwn.net/Articles/381064/
> > 
> > > > +TRACE_EVENT(tsn_buffer_write,
> > > > +
> > > > +   TP_PROTO(struct tsn_link *link,
> > > > +   size_t bytes),
> > > > +
> > > > +   TP_ARGS(link, bytes),
> > > > +
> > > > +   TP_STRUCT__entry(
> > > > +   __field(u64, stream_id)
> > > > +   __field(size_t, size)
> > > > +   __field(size_t, bsize)
> > > > +   __field(size_t, size_left)
> > > > +   __field(void *, buffer)
> > > > +   __field(void *, head)
> > > > +   __field(void *, tail)
> > > > +   __field(void *, end)
> > > > +   ),
> > > > +
> > > > +   TP_fast_assign(
> > > > +   __entry->stream_id = link->stream_id;
> > > > +   __entry->size = bytes;
> > > > +   __entry->bsize = link->used_buffer_size;
> > > > +   __entry->size_left = (link->head - link->tail) % 
> > > > link->used_buffer_size;  
> > > 
> > > Move this logic into the print statement, since you save head and tail.  
> > 
> > Ok, any particular reason?
> 
> Because it removes calculations during the trace. The calculations done
> in TP_printk() are done at the time of reading the trace, and
> calculations done in TP_fast_assign() are done during the recording and
> hence adding more overhead to the trace itself.

Aha! that makes sense, thanks!
(/me goes and updates the tracing-part)

-Henrik


signature.asc
Description: Digital signature


Re: [very-RFC 6/8] Add TSN event-tracing

2016-06-13 Thread Henrik Austad
On Sun, Jun 12, 2016 at 10:22:01PM -0400, Steven Rostedt wrote:
> On Sun, 12 Jun 2016 23:25:10 +0200
> Henrik Austad  wrote:
> 
> > > > +#include 
> > > > +#include 
> > > > +/* #include  */
> > > > +
> > > > +/* FIXME: update to TRACE_CLASS to reduce overhead */  
> > > 
> > > I'm curious to why I didn't do this now. A class would make less
> > > duplication of typing too ;-)  
> > 
> > Yeah, I found this in a really great article written by some tracing-dude, 
> > I hear he talks really, really fast!
> 
> I plead the 5th!
> 
> > 
> > https://lwn.net/Articles/381064/
> > 
> > > > +TRACE_EVENT(tsn_buffer_write,
> > > > +
> > > > +   TP_PROTO(struct tsn_link *link,
> > > > +   size_t bytes),
> > > > +
> > > > +   TP_ARGS(link, bytes),
> > > > +
> > > > +   TP_STRUCT__entry(
> > > > +   __field(u64, stream_id)
> > > > +   __field(size_t, size)
> > > > +   __field(size_t, bsize)
> > > > +   __field(size_t, size_left)
> > > > +   __field(void *, buffer)
> > > > +   __field(void *, head)
> > > > +   __field(void *, tail)
> > > > +   __field(void *, end)
> > > > +   ),
> > > > +
> > > > +   TP_fast_assign(
> > > > +   __entry->stream_id = link->stream_id;
> > > > +   __entry->size = bytes;
> > > > +   __entry->bsize = link->used_buffer_size;
> > > > +   __entry->size_left = (link->head - link->tail) % 
> > > > link->used_buffer_size;  
> > > 
> > > Move this logic into the print statement, since you save head and tail.  
> > 
> > Ok, any particular reason?
> 
> Because it removes calculations during the trace. The calculations done
> in TP_printk() are done at the time of reading the trace, and
> calculations done in TP_fast_assign() are done during the recording and
> hence adding more overhead to the trace itself.

Aha! that makes sense, thanks!
(/me goes and updates the tracing-part)

-Henrik


signature.asc
Description: Digital signature


Re: [very-RFC 6/8] Add TSN event-tracing

2016-06-12 Thread Henrik Austad
On Sun, Jun 12, 2016 at 12:58:03PM -0400, Steven Rostedt wrote:
> On Sun, 12 Jun 2016 01:01:34 +0200
> Henrik Austad <hen...@austad.us> wrote:
> 
> > From: Henrik Austad <haus...@cisco.com>
> > 
> > This needs refactoring and should be updated to use TRACE_CLASS, but for
> > now it provides a fair debug-window into TSN.
> > 
> > Cc: "David S. Miller" <da...@davemloft.net>
> > Cc: Steven Rostedt <rost...@goodmis.org> (maintainer:TRACING)
> > Cc: Ingo Molnar <mi...@redhat.com> (maintainer:TRACING)
> > Signed-off-by: Henrik Austad <haus...@cisco.com>
> > ---
> >  include/trace/events/tsn.h | 349 
> > +
> >  1 file changed, 349 insertions(+)
> >  create mode 100644 include/trace/events/tsn.h
> > 
> > diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h
> > new file mode 100644
> > index 000..ac1f31b
> > --- /dev/null
> > +++ b/include/trace/events/tsn.h
> > @@ -0,0 +1,349 @@
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM tsn
> > +
> > +#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_TSN_H
> > +
> > +#include 
> > +#include 
> > +
> > +#include 
> > +#include 
> > +/* #include  */
> > +
> > +/* FIXME: update to TRACE_CLASS to reduce overhead */
> 
> I'm curious to why I didn't do this now. A class would make less
> duplication of typing too ;-)

Yeah, I found this in a really great article written by some tracing-dude, 
I hear he talks really, really fast!

https://lwn.net/Articles/381064/

> > +TRACE_EVENT(tsn_buffer_write,
> > +
> > +   TP_PROTO(struct tsn_link *link,
> > +   size_t bytes),
> > +
> > +   TP_ARGS(link, bytes),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(u64, stream_id)
> > +   __field(size_t, size)
> > +   __field(size_t, bsize)
> > +   __field(size_t, size_left)
> > +   __field(void *, buffer)
> > +   __field(void *, head)
> > +   __field(void *, tail)
> > +   __field(void *, end)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->stream_id = link->stream_id;
> > +   __entry->size = bytes;
> > +   __entry->bsize = link->used_buffer_size;
> > +   __entry->size_left = (link->head - link->tail) % 
> > link->used_buffer_size;
> 
> Move this logic into the print statement, since you save head and tail.

Ok, any particular reason?

> > +   __entry->buffer = link->buffer;
> > +   __entry->head = link->head;
> > +   __entry->tail = link->tail;
> > +   __entry->end = link->end;
> > +   ),
> > +
> > +   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
> > [buffer=%p, head=%p, tail=%p, end=%p]",
> > +   __entry->stream_id, __entry->size, __entry->bsize, 
> > __entry->size_left,
> 
>  __entry->stream_id, __entry->size, __entry->bsize,
>  (__entry->head - __entry->tail) % __entry->bsize,
> 

Ok, so is this about saving space by dropping one intermediate value, or is 
it some other point I'm missing here?

> > +   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
> > +
> > +   );
> > +
> > +TRACE_EVENT(tsn_buffer_write_net,
> > +
> > +   TP_PROTO(struct tsn_link *link,
> > +   size_t bytes),
> > +
> > +   TP_ARGS(link, bytes),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(u64, stream_id)
> > +   __field(size_t, size)
> > +   __field(size_t, bsize)
> > +   __field(size_t, size_left)
> > +   __field(void *, buffer)
> > +   __field(void *, head)
> > +   __field(void *, tail)
> > +   __field(void *, end)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->stream_id = link->stream_id;
> > +   __entry->size = bytes;
> > +   __entry->bsize = link->used_buffer_size;
> > +   __entry->size_left = (link->head - link->tail) % 
> > link->used_buffer_size;
> > +   __entry->buffer = link->buffer;
> > +   __entry->head = link->head;
> > +   __entry->tail = link->tail;
> > +   __entry->end = link->end;
> > +   ),
> > +
&

Re: [very-RFC 6/8] Add TSN event-tracing

2016-06-12 Thread Henrik Austad
On Sun, Jun 12, 2016 at 12:58:03PM -0400, Steven Rostedt wrote:
> On Sun, 12 Jun 2016 01:01:34 +0200
> Henrik Austad  wrote:
> 
> > From: Henrik Austad 
> > 
> > This needs refactoring and should be updated to use TRACE_CLASS, but for
> > now it provides a fair debug-window into TSN.
> > 
> > Cc: "David S. Miller" 
> > Cc: Steven Rostedt  (maintainer:TRACING)
> > Cc: Ingo Molnar  (maintainer:TRACING)
> > Signed-off-by: Henrik Austad 
> > ---
> >  include/trace/events/tsn.h | 349 
> > +
> >  1 file changed, 349 insertions(+)
> >  create mode 100644 include/trace/events/tsn.h
> > 
> > diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h
> > new file mode 100644
> > index 000..ac1f31b
> > --- /dev/null
> > +++ b/include/trace/events/tsn.h
> > @@ -0,0 +1,349 @@
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM tsn
> > +
> > +#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_TSN_H
> > +
> > +#include 
> > +#include 
> > +
> > +#include 
> > +#include 
> > +/* #include  */
> > +
> > +/* FIXME: update to TRACE_CLASS to reduce overhead */
> 
> I'm curious to why I didn't do this now. A class would make less
> duplication of typing too ;-)

Yeah, I found this in a really great article written by some tracing-dude, 
I hear he talks really, really fast!

https://lwn.net/Articles/381064/

> > +TRACE_EVENT(tsn_buffer_write,
> > +
> > +   TP_PROTO(struct tsn_link *link,
> > +   size_t bytes),
> > +
> > +   TP_ARGS(link, bytes),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(u64, stream_id)
> > +   __field(size_t, size)
> > +   __field(size_t, bsize)
> > +   __field(size_t, size_left)
> > +   __field(void *, buffer)
> > +   __field(void *, head)
> > +   __field(void *, tail)
> > +   __field(void *, end)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->stream_id = link->stream_id;
> > +   __entry->size = bytes;
> > +   __entry->bsize = link->used_buffer_size;
> > +   __entry->size_left = (link->head - link->tail) % 
> > link->used_buffer_size;
> 
> Move this logic into the print statement, since you save head and tail.

Ok, any particular reason?

> > +   __entry->buffer = link->buffer;
> > +   __entry->head = link->head;
> > +   __entry->tail = link->tail;
> > +   __entry->end = link->end;
> > +   ),
> > +
> > +   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
> > [buffer=%p, head=%p, tail=%p, end=%p]",
> > +   __entry->stream_id, __entry->size, __entry->bsize, 
> > __entry->size_left,
> 
>  __entry->stream_id, __entry->size, __entry->bsize,
>  (__entry->head - __entry->tail) % __entry->bsize,
> 

Ok, so is this about saving space by dropping one intermediate value, or is 
it some other point I'm missing here?

> > +   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
> > +
> > +   );
> > +
> > +TRACE_EVENT(tsn_buffer_write_net,
> > +
> > +   TP_PROTO(struct tsn_link *link,
> > +   size_t bytes),
> > +
> > +   TP_ARGS(link, bytes),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(u64, stream_id)
> > +   __field(size_t, size)
> > +   __field(size_t, bsize)
> > +   __field(size_t, size_left)
> > +   __field(void *, buffer)
> > +   __field(void *, head)
> > +   __field(void *, tail)
> > +   __field(void *, end)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->stream_id = link->stream_id;
> > +   __entry->size = bytes;
> > +   __entry->bsize = link->used_buffer_size;
> > +   __entry->size_left = (link->head - link->tail) % 
> > link->used_buffer_size;
> > +   __entry->buffer = link->buffer;
> > +   __entry->head = link->head;
> > +   __entry->tail = link->tail;
> > +   __entry->end = link->end;
> > +   ),
> > +
> > +   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
> > [buffer=%p, head=%p, tail=%p, end=%p]",
> > +

Re: [very-RFC 5/8] Add TSN machinery to drive the traffic from a shim over the network

2016-06-12 Thread Henrik Austad
On Sun, Jun 12, 2016 at 12:35:10AM -0700, Joe Perches wrote:
> On Sun, 2016-06-12 at 00:22 +0200, Henrik Austad wrote:
> > From: Henrik Austad <haus...@cisco.com>
> > 
> > In short summary:
> > 
> > * tsn_core.c is the main driver of tsn, all new links go through
> >   here and all data to/form the shims are handled here
> >   core also manages the shim-interface.
> []
> > diff --git a/net/tsn/tsn_configfs.c b/net/tsn/tsn_configfs.c
> []
> > +static inline struct tsn_link *to_tsn_link(struct config_item *item)
> > +{
> > +   /* this line causes checkpatch to WARN. making checkpatch happy,
> > +    * makes code messy..
> > +    */
> > +   return item ? container_of(to_config_group(item), struct tsn_link, 
> > group) : NULL;
> > +}
> 
> How about
> 
> static inline struct tsn_link *to_tsn_link(struct config_item *item)
> {
>   if (!item)
>   return NULL;
>   return container_of(to_config_group(item), struct tsn_link, group);
> }

Yes, I mulled over this for a while, but I got the impression that the 
ternary-approach was the way used in configfs, and I tried staying in line 
with that in tsn_configfs.

If you see other parts of the TSN-code, I tend to use the if (!item) ... 
approach. So, I don't have any technical preferences either way really

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 5/8] Add TSN machinery to drive the traffic from a shim over the network

2016-06-12 Thread Henrik Austad
On Sun, Jun 12, 2016 at 12:35:10AM -0700, Joe Perches wrote:
> On Sun, 2016-06-12 at 00:22 +0200, Henrik Austad wrote:
> > From: Henrik Austad 
> > 
> > In short summary:
> > 
> > * tsn_core.c is the main driver of tsn, all new links go through
> >   here and all data to/form the shims are handled here
> >   core also manages the shim-interface.
> []
> > diff --git a/net/tsn/tsn_configfs.c b/net/tsn/tsn_configfs.c
> []
> > +static inline struct tsn_link *to_tsn_link(struct config_item *item)
> > +{
> > +   /* this line causes checkpatch to WARN. making checkpatch happy,
> > +    * makes code messy..
> > +    */
> > +   return item ? container_of(to_config_group(item), struct tsn_link, 
> > group) : NULL;
> > +}
> 
> How about
> 
> static inline struct tsn_link *to_tsn_link(struct config_item *item)
> {
>   if (!item)
>   return NULL;
>   return container_of(to_config_group(item), struct tsn_link, group);
> }

Yes, I mulled over this for a while, but I got the impression that the 
ternary-approach was the way used in configfs, and I tried staying in line 
with that in tsn_configfs.

If you see other parts of the TSN-code, I tend to use the if (!item) ... 
approach. So, I don't have any technical preferences either way really

-- 
Henrik Austad


signature.asc
Description: Digital signature


[very-RFC 4/8] Add TSN header for the driver

2016-06-11 Thread Henrik Austad
From: Henrik Austad <haus...@cisco.com>

This defines the general TSN headers for network packets, the
shim-interface and the central 'tsn_list' structure.

Cc: "David S. Miller" <da...@davemloft.net>
Signed-off-by: Henrik Austad <haus...@cisco.com>
---
 include/linux/tsn.h | 806 
 1 file changed, 806 insertions(+)
 create mode 100644 include/linux/tsn.h

diff --git a/include/linux/tsn.h b/include/linux/tsn.h
new file mode 100644
index 000..0e1f732b
--- /dev/null
+++ b/include/linux/tsn.h
@@ -0,0 +1,806 @@
+/*   TSN - Time Sensitive Networking
+ *
+ *   Copyright (C) 2016- Henrik Austad <haus...@cisco.com>
+ *
+ *   This program is free software; you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *   GNU General Public License for more details.
+ */
+#ifndef _TSN_H
+#define _TSN_H
+#include 
+#include 
+#include 
+
+/* The naming here can be a bit confusing as we call it TSN but naming
+ * suggests 'AVB'. Reason: IEE 1722 was written before the working group
+ * was renamed to Time Sensitive Networking.
+ *
+ * To be precise. TSN describes the protocol for shipping data, AVB is a
+ * medialayer which you can build on top of TSN.
+ *
+ * For this reason the frames are given avb-names whereas the functions
+ * use tsn_-naming.
+ */
+
+/* 7 bit value 0x00 - 0x7F */
+enum avtp_subtype {
+   AVTP_61883_IIDC = 0,
+   AVTP_MMA = 0x1,
+   AVTP_MAAP = 0x7e,
+   AVTP_EXPERIMENTAL = 0x7f,
+};
+
+/* NOTE NOTE NOTE !!
+ * The headers below use bitfields extensively and verifications
+ * are needed when using little-endian vs big-endian systems.
+ */
+
+/* Common part of avtph header
+ *
+ * AVB Transport Protocol Common Header
+ *
+ * Defined in 1722-2011 Sec. 5.2
+ */
+struct avtp_ch {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   /* use avtp_subtype enum.
+*/
+   u8 subtype:7;
+
+   /* Controlframe: 1
+* Dataframe   : 0
+*/
+   u8 cd:1;
+
+   /* Type specific data, part 1 */
+   u8 tsd_1:4;
+
+   /* In current version of AVB, only 0 is valid, all other values
+* are reserved for future versions.
+*/
+   u8 version:3;
+
+   /* Valid StreamID in frame
+*
+* ControlData not related to a specific stream should clear
+* this (and have stream_id = 0), _all_ other values should set
+* this to 1.
+*/
+   u8 sv:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   u8 cd:1;
+   u8 subtype:7;
+   u8 sv:1;
+   u8 version:3;
+   u8 tsd_1:4;
+#else
+#error "Unknown Endianness, cannot determine bitfield ordering"
+#endif
+   /* Type specific data (adjacent to tsd_1, but split due to bitfield) */
+   u16 tsd_2;
+   u64 stream_id;
+
+   /*
+* payload by subtype
+*/
+   u8 pbs[0];
+} __packed;
+
+/* AVTPDU Common Control header format
+ * IEEE 1722#5.3
+ */
+struct avtpc_header {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   u8 subtype:7;
+   u8 cd:1;
+   u8 control_data:4;
+   u8 version:3;
+   u8 sv:1;
+   u16 control_data_length:11;
+   u16 status:5;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   u8 cd:1;
+   u8 subtype:7;
+   u8 sv:1;
+   u8 version:3;
+   u8 control_data:4;
+   u16 status:5;
+   u16 control_data_length:11;
+#else
+#error "Unknown Endianness, cannot determine bitfield ordering"
+#endif
+   u64 stream_id;
+} __packed;
+
+/* AVTP common stream data AVTPDU header format
+ * IEEE 1722#5.4
+ */
+struct avtpdu_header {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   u8 subtype:7;
+   u8 cd:1;
+
+   /* avtp_timestamp valid */
+   u8 tv: 1;
+
+   /* gateway_info valid */
+   u8 gv:1;
+
+   /* reserved */
+   u8 r:1;
+
+   /*
+* Media clock Restart toggle
+*/
+   u8 mr:1;
+
+   u8 version:3;
+
+   /* StreamID valid */
+   u8 sv:1;
+   u8 seqnr;
+
+   /* Timestamp uncertain */
+   u8 tu:1;
+   u8 r2:7;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   u8 cd:1;
+   u8 subtype:7;
+
+   u8 sv:1;
+   u8 version:3;
+   u8 mr:1;
+   u8 r:1;
+   u8 gv:1;
+   u8 tv: 1;
+
+   u8 seqnr;
+   u8 r2:7;
+   u8 tu:1;
+#else
+#error "Unknown Endianness, cannot determine bitfield ordering"
+#endif
+
+   u64 stream_id;
+
+   u32 avtp_timestamp;
+   u32 gateway_info;
+
+   /* Stream Data Length */
+   u16 sd_len;
+
+   /* Protocol specific header, derived from avt

[very-RFC 6/8] Add TSN event-tracing

2016-06-11 Thread Henrik Austad
From: Henrik Austad <haus...@cisco.com>

This needs refactoring and should be updated to use TRACE_CLASS, but for
now it provides a fair debug-window into TSN.

Cc: "David S. Miller" <da...@davemloft.net>
Cc: Steven Rostedt <rost...@goodmis.org> (maintainer:TRACING)
Cc: Ingo Molnar <mi...@redhat.com> (maintainer:TRACING)
Signed-off-by: Henrik Austad <haus...@cisco.com>
---
 include/trace/events/tsn.h | 349 +
 1 file changed, 349 insertions(+)
 create mode 100644 include/trace/events/tsn.h

diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h
new file mode 100644
index 000..ac1f31b
--- /dev/null
+++ b/include/trace/events/tsn.h
@@ -0,0 +1,349 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tsn
+
+#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TSN_H
+
+#include 
+#include 
+
+#include 
+#include 
+/* #include  */
+
+/* FIXME: update to TRACE_CLASS to reduce overhead */
+TRACE_EVENT(tsn_buffer_write,
+
+   TP_PROTO(struct tsn_link *link,
+   size_t bytes),
+
+   TP_ARGS(link, bytes),
+
+   TP_STRUCT__entry(
+   __field(u64, stream_id)
+   __field(size_t, size)
+   __field(size_t, bsize)
+   __field(size_t, size_left)
+   __field(void *, buffer)
+   __field(void *, head)
+   __field(void *, tail)
+   __field(void *, end)
+   ),
+
+   TP_fast_assign(
+   __entry->stream_id = link->stream_id;
+   __entry->size = bytes;
+   __entry->bsize = link->used_buffer_size;
+   __entry->size_left = (link->head - link->tail) % 
link->used_buffer_size;
+   __entry->buffer = link->buffer;
+   __entry->head = link->head;
+   __entry->tail = link->tail;
+   __entry->end = link->end;
+   ),
+
+   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
[buffer=%p, head=%p, tail=%p, end=%p]",
+   __entry->stream_id, __entry->size, __entry->bsize, 
__entry->size_left,
+   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
+
+   );
+
+TRACE_EVENT(tsn_buffer_write_net,
+
+   TP_PROTO(struct tsn_link *link,
+   size_t bytes),
+
+   TP_ARGS(link, bytes),
+
+   TP_STRUCT__entry(
+   __field(u64, stream_id)
+   __field(size_t, size)
+   __field(size_t, bsize)
+   __field(size_t, size_left)
+   __field(void *, buffer)
+   __field(void *, head)
+   __field(void *, tail)
+   __field(void *, end)
+   ),
+
+   TP_fast_assign(
+   __entry->stream_id = link->stream_id;
+   __entry->size = bytes;
+   __entry->bsize = link->used_buffer_size;
+   __entry->size_left = (link->head - link->tail) % 
link->used_buffer_size;
+   __entry->buffer = link->buffer;
+   __entry->head = link->head;
+   __entry->tail = link->tail;
+   __entry->end = link->end;
+   ),
+
+   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
[buffer=%p, head=%p, tail=%p, end=%p]",
+   __entry->stream_id, __entry->size, __entry->bsize, 
__entry->size_left,
+   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
+
+   );
+
+
+TRACE_EVENT(tsn_buffer_read,
+
+   TP_PROTO(struct tsn_link *link,
+   size_t bytes),
+
+   TP_ARGS(link, bytes),
+
+   TP_STRUCT__entry(
+   __field(u64, stream_id)
+   __field(size_t, size)
+   __field(size_t, bsize)
+   __field(size_t, size_left)
+   __field(void *, buffer)
+   __field(void *, head)
+   __field(void *, tail)
+   __field(void *, end)
+   ),
+
+   TP_fast_assign(
+   __entry->stream_id = link->stream_id;
+   __entry->size = bytes;
+   __entry->bsize = link->used_buffer_size;
+   __entry->size_left = (link->head - link->tail) % 
link->used_buffer_size;
+   __entry->buffer = link->buffer;
+   __entry->head = link->head;
+   __entry->tail = link->tail;
+   __entry->end = link->end;
+   ),
+
+   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
[buffer=%p, head=%p, tail=%p, end=%p]",
+   __entry->stream_id, __entry->size, __entry->bsize, 
__entry->size_left,
+   __entry->buffer,__entry->head, __entry-&

[very-RFC 6/8] Add TSN event-tracing

2016-06-11 Thread Henrik Austad
From: Henrik Austad 

This needs refactoring and should be updated to use TRACE_CLASS, but for
now it provides a fair debug-window into TSN.

Cc: "David S. Miller" 
Cc: Steven Rostedt  (maintainer:TRACING)
Cc: Ingo Molnar  (maintainer:TRACING)
Signed-off-by: Henrik Austad 
---
 include/trace/events/tsn.h | 349 +
 1 file changed, 349 insertions(+)
 create mode 100644 include/trace/events/tsn.h

diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h
new file mode 100644
index 000..ac1f31b
--- /dev/null
+++ b/include/trace/events/tsn.h
@@ -0,0 +1,349 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tsn
+
+#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TSN_H
+
+#include 
+#include 
+
+#include 
+#include 
+/* #include  */
+
+/* FIXME: update to TRACE_CLASS to reduce overhead */
+TRACE_EVENT(tsn_buffer_write,
+
+   TP_PROTO(struct tsn_link *link,
+   size_t bytes),
+
+   TP_ARGS(link, bytes),
+
+   TP_STRUCT__entry(
+   __field(u64, stream_id)
+   __field(size_t, size)
+   __field(size_t, bsize)
+   __field(size_t, size_left)
+   __field(void *, buffer)
+   __field(void *, head)
+   __field(void *, tail)
+   __field(void *, end)
+   ),
+
+   TP_fast_assign(
+   __entry->stream_id = link->stream_id;
+   __entry->size = bytes;
+   __entry->bsize = link->used_buffer_size;
+   __entry->size_left = (link->head - link->tail) % 
link->used_buffer_size;
+   __entry->buffer = link->buffer;
+   __entry->head = link->head;
+   __entry->tail = link->tail;
+   __entry->end = link->end;
+   ),
+
+   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
[buffer=%p, head=%p, tail=%p, end=%p]",
+   __entry->stream_id, __entry->size, __entry->bsize, 
__entry->size_left,
+   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
+
+   );
+
+TRACE_EVENT(tsn_buffer_write_net,
+
+   TP_PROTO(struct tsn_link *link,
+   size_t bytes),
+
+   TP_ARGS(link, bytes),
+
+   TP_STRUCT__entry(
+   __field(u64, stream_id)
+   __field(size_t, size)
+   __field(size_t, bsize)
+   __field(size_t, size_left)
+   __field(void *, buffer)
+   __field(void *, head)
+   __field(void *, tail)
+   __field(void *, end)
+   ),
+
+   TP_fast_assign(
+   __entry->stream_id = link->stream_id;
+   __entry->size = bytes;
+   __entry->bsize = link->used_buffer_size;
+   __entry->size_left = (link->head - link->tail) % 
link->used_buffer_size;
+   __entry->buffer = link->buffer;
+   __entry->head = link->head;
+   __entry->tail = link->tail;
+   __entry->end = link->end;
+   ),
+
+   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
[buffer=%p, head=%p, tail=%p, end=%p]",
+   __entry->stream_id, __entry->size, __entry->bsize, 
__entry->size_left,
+   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
+
+   );
+
+
+TRACE_EVENT(tsn_buffer_read,
+
+   TP_PROTO(struct tsn_link *link,
+   size_t bytes),
+
+   TP_ARGS(link, bytes),
+
+   TP_STRUCT__entry(
+   __field(u64, stream_id)
+   __field(size_t, size)
+   __field(size_t, bsize)
+   __field(size_t, size_left)
+   __field(void *, buffer)
+   __field(void *, head)
+   __field(void *, tail)
+   __field(void *, end)
+   ),
+
+   TP_fast_assign(
+   __entry->stream_id = link->stream_id;
+   __entry->size = bytes;
+   __entry->bsize = link->used_buffer_size;
+   __entry->size_left = (link->head - link->tail) % 
link->used_buffer_size;
+   __entry->buffer = link->buffer;
+   __entry->head = link->head;
+   __entry->tail = link->tail;
+   __entry->end = link->end;
+   ),
+
+   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
[buffer=%p, head=%p, tail=%p, end=%p]",
+   __entry->stream_id, __entry->size, __entry->bsize, 
__entry->size_left,
+   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
+
+   );
+
+TRACE_EVENT(tsn_refill,
+
+   TP_PROTO(struct tsn_link *link,
+  

[very-RFC 4/8] Add TSN header for the driver

2016-06-11 Thread Henrik Austad
From: Henrik Austad 

This defines the general TSN headers for network packets, the
shim-interface and the central 'tsn_list' structure.

Cc: "David S. Miller" 
Signed-off-by: Henrik Austad 
---
 include/linux/tsn.h | 806 
 1 file changed, 806 insertions(+)
 create mode 100644 include/linux/tsn.h

diff --git a/include/linux/tsn.h b/include/linux/tsn.h
new file mode 100644
index 000..0e1f732b
--- /dev/null
+++ b/include/linux/tsn.h
@@ -0,0 +1,806 @@
+/*   TSN - Time Sensitive Networking
+ *
+ *   Copyright (C) 2016- Henrik Austad 
+ *
+ *   This program is free software; you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *   GNU General Public License for more details.
+ */
+#ifndef _TSN_H
+#define _TSN_H
+#include 
+#include 
+#include 
+
+/* The naming here can be a bit confusing as we call it TSN but naming
+ * suggests 'AVB'. Reason: IEE 1722 was written before the working group
+ * was renamed to Time Sensitive Networking.
+ *
+ * To be precise. TSN describes the protocol for shipping data, AVB is a
+ * medialayer which you can build on top of TSN.
+ *
+ * For this reason the frames are given avb-names whereas the functions
+ * use tsn_-naming.
+ */
+
+/* 7 bit value 0x00 - 0x7F */
+enum avtp_subtype {
+   AVTP_61883_IIDC = 0,
+   AVTP_MMA = 0x1,
+   AVTP_MAAP = 0x7e,
+   AVTP_EXPERIMENTAL = 0x7f,
+};
+
+/* NOTE NOTE NOTE !!
+ * The headers below use bitfields extensively and verifications
+ * are needed when using little-endian vs big-endian systems.
+ */
+
+/* Common part of avtph header
+ *
+ * AVB Transport Protocol Common Header
+ *
+ * Defined in 1722-2011 Sec. 5.2
+ */
+struct avtp_ch {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   /* use avtp_subtype enum.
+*/
+   u8 subtype:7;
+
+   /* Controlframe: 1
+* Dataframe   : 0
+*/
+   u8 cd:1;
+
+   /* Type specific data, part 1 */
+   u8 tsd_1:4;
+
+   /* In current version of AVB, only 0 is valid, all other values
+* are reserved for future versions.
+*/
+   u8 version:3;
+
+   /* Valid StreamID in frame
+*
+* ControlData not related to a specific stream should clear
+* this (and have stream_id = 0), _all_ other values should set
+* this to 1.
+*/
+   u8 sv:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   u8 cd:1;
+   u8 subtype:7;
+   u8 sv:1;
+   u8 version:3;
+   u8 tsd_1:4;
+#else
+#error "Unknown Endianness, cannot determine bitfield ordering"
+#endif
+   /* Type specific data (adjacent to tsd_1, but split due to bitfield) */
+   u16 tsd_2;
+   u64 stream_id;
+
+   /*
+* payload by subtype
+*/
+   u8 pbs[0];
+} __packed;
+
+/* AVTPDU Common Control header format
+ * IEEE 1722#5.3
+ */
+struct avtpc_header {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   u8 subtype:7;
+   u8 cd:1;
+   u8 control_data:4;
+   u8 version:3;
+   u8 sv:1;
+   u16 control_data_length:11;
+   u16 status:5;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   u8 cd:1;
+   u8 subtype:7;
+   u8 sv:1;
+   u8 version:3;
+   u8 control_data:4;
+   u16 status:5;
+   u16 control_data_length:11;
+#else
+#error "Unknown Endianness, cannot determine bitfield ordering"
+#endif
+   u64 stream_id;
+} __packed;
+
+/* AVTP common stream data AVTPDU header format
+ * IEEE 1722#5.4
+ */
+struct avtpdu_header {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   u8 subtype:7;
+   u8 cd:1;
+
+   /* avtp_timestamp valid */
+   u8 tv: 1;
+
+   /* gateway_info valid */
+   u8 gv:1;
+
+   /* reserved */
+   u8 r:1;
+
+   /*
+* Media clock Restart toggle
+*/
+   u8 mr:1;
+
+   u8 version:3;
+
+   /* StreamID valid */
+   u8 sv:1;
+   u8 seqnr;
+
+   /* Timestamp uncertain */
+   u8 tu:1;
+   u8 r2:7;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   u8 cd:1;
+   u8 subtype:7;
+
+   u8 sv:1;
+   u8 version:3;
+   u8 mr:1;
+   u8 r:1;
+   u8 gv:1;
+   u8 tv: 1;
+
+   u8 seqnr;
+   u8 r2:7;
+   u8 tu:1;
+#else
+#error "Unknown Endianness, cannot determine bitfield ordering"
+#endif
+
+   u64 stream_id;
+
+   u32 avtp_timestamp;
+   u32 gateway_info;
+
+   /* Stream Data Length */
+   u16 sd_len;
+
+   /* Protocol specific header, derived from avtp_subtype */
+   u16 psh;
+
+   /* Stream Payload Data 0 to n octets
+* n so that total 

[very-RFC 5/8] Add TSN machinery to drive the traffic from a shim over the network

2016-06-11 Thread Henrik Austad
From: Henrik Austad <haus...@cisco.com>

In short summary:

* tsn_core.c is the main driver of tsn, all new links go through
  here and all data to/form the shims are handled here
  core also manages the shim-interface.

* tsn_configfs.c is the API to userspace. TSN is driven from userspace
  and a link is created, configured, enabled, disabled and removed
  purely from userspace. All attributes requried must be determined by
  userspace, preferrably via IEEE 1722.1 (discovery and enumeration).

* tsn_header.c small part that handles the actual header of the frames
  we send. Kept out of core for cleanliness.

* tsn_net.c handles operations towards the networking layer.

The current driver is under development. This means that from the moment it
is enabled with a shim, it will send traffic, either 0-traffic (frames of
reserved length but with payload 0) or actual traffic. This will change
once the driver stabilizes.

For more detail, see Documentation/networking/tsn/

Cc: "David S. Miller" <da...@davemloft.net>
Signed-off-by: Henrik Austad <haus...@cisco.com>
---
 net/Makefile   |   1 +
 net/tsn/Makefile   |   6 +
 net/tsn/tsn_configfs.c | 623 +++
 net/tsn/tsn_core.c | 975 +
 net/tsn/tsn_header.c   | 203 ++
 net/tsn/tsn_internal.h | 383 +++
 net/tsn/tsn_net.c  | 403 
 7 files changed, 2594 insertions(+)
 create mode 100644 net/tsn/Makefile
 create mode 100644 net/tsn/tsn_configfs.c
 create mode 100644 net/tsn/tsn_core.c
 create mode 100644 net/tsn/tsn_header.c
 create mode 100644 net/tsn/tsn_internal.h
 create mode 100644 net/tsn/tsn_net.c

diff --git a/net/Makefile b/net/Makefile
index bdd1455..c15482e 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -79,3 +79,4 @@ ifneq ($(CONFIG_NET_L3_MASTER_DEV),)
 obj-y  += l3mdev/
 endif
 obj-$(CONFIG_QRTR) += qrtr/
+obj-$(CONFIG_TSN)  += tsn/
diff --git a/net/tsn/Makefile b/net/tsn/Makefile
new file mode 100644
index 000..0d87687
--- /dev/null
+++ b/net/tsn/Makefile
@@ -0,0 +1,6 @@
+#
+# Makefile for the Linux TSN subsystem
+#
+
+obj-$(CONFIG_TSN) += tsn.o
+tsn-objs :=tsn_core.o tsn_configfs.o tsn_net.o tsn_header.o
diff --git a/net/tsn/tsn_configfs.c b/net/tsn/tsn_configfs.c
new file mode 100644
index 000..f3d0986
--- /dev/null
+++ b/net/tsn/tsn_configfs.c
@@ -0,0 +1,623 @@
+/*
+ *   ConfigFS interface to TSN
+ *   Copyright (C) 2015- Henrik Austad <haus...@cisco.com>
+ *
+ *   This program is free software; you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *   GNU General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "tsn_internal.h"
+
+static inline struct tsn_link *to_tsn_link(struct config_item *item)
+{
+   /* this line causes checkpatch to WARN. making checkpatch happy,
+* makes code messy..
+*/
+   return item ? container_of(to_config_group(item), struct tsn_link, 
group) : NULL;
+}
+
+static inline struct tsn_nic *to_tsn_nic(struct config_group *group)
+{
+   return group ? container_of(group, struct tsn_nic, group) : NULL;
+}
+
+/* ---
+ * Tier2 attributes
+ *
+ * The content of the links userspace can see/modify
+ * ---
+*/
+static ssize_t _tsn_max_payload_size_show(struct config_item *item,
+ char *page)
+{
+   struct tsn_link *link = to_tsn_link(item);
+
+   if (!link)
+   return -EINVAL;
+   return sprintf(page, "%u\n", (u32)link->max_payload_size);
+}
+
+static ssize_t _tsn_max_payload_size_store(struct config_item *item,
+  const char *page, size_t count)
+{
+   struct tsn_link *link = to_tsn_link(item);
+   u16 mpl_size = 0;
+   int ret = 0;
+
+   if (!link)
+   return -EINVAL;
+   if (tsn_link_is_on(link)) {
+   pr_err("ERROR: Cannot change Payload size on on enabled 
link\n");
+   return -EINVAL;
+   }
+   ret = kstrtou16(page, 0, _size);
+   if (ret)
+   return ret;
+
+   /* 802.1BA-2011 6.4 payload must be <1500 octets (excluding
+* headers, tags etc) However, this is not directly mappable to
+* how some hw handles things, so to be conservative, we
+* restrict it down to [26..1485]
+  

[very-RFC 0/8] TSN driver for the kernel

2016-06-11 Thread Henrik Austad
Hi all
(series based on v4.7-rc2, now with the correct netdev)

This is a *very* early RFC for a TSN-driver in the kernel. It has been
floating around in my repo for a while and I would appreciate some
feedback on the overall design to avoid doing some major blunders.

TSN: Time Sensitive Networking, formely known as AVB (Audio/Video
Bridging).

There are at least one AVB-driver (the AV-part of TSN) in the kernel
already, however this driver aims to solve a wider scope as TSN can do
much more than just audio. A very basic ALSA-driver is added to the end
that allows you to play music between 2 machines using aplay in one end
and arecord | aplay on the other (some fiddling required) We have plans
for doing the same for v4l2 eventually (but there are other fishes to
fry first). The same goes for a TSN_SOCK type approach as well.

TSN is all about providing infrastructure. Allthough there are a few
very interesting uses for TSN (reliable, deterministic network for audio
and video), once you have that reliable link, you can do a lot more.

Some notes on the design:

The driver is directed via ConfigFS as we need userspace to handle
stream-reservation (MSRP), discovery and enumeration (IEEE 1722.1) and
whatever other management is needed. Once we have all the required
attributes, we can create link using mkdir, and use write() to set the
attributes. Once ready, specify the 'shim' (basically a thin wrapper
between TSN and another subsystem) and we start pushing out frames.

The network part: it ties directly into the rx-handler for receive and
writes skb's using netdev_start_xmit(). This could probably be
improved. 2 new fields in netdev_ops have been introduced, and the Intel
igb-driver has been updated (as this is available as a PCI-e card). The
igb-driver works-ish


What remains
- tie to (g)PTP properly, currently using ktime_get() for presentation
  time
- get time from shim into TSN and vice versa
- let shim create/manage buffer

Henrik Austad (8):
  TSN: add documentation
  TSN: Add the standard formerly known as AVB to the kernel
  Adding TSN-driver to Intel I210 controller
  Add TSN header for the driver
  Add TSN machinery to drive the traffic from a shim over the network
  Add TSN event-tracing
  AVB ALSA - Add ALSA shim for TSN
  MAINTAINERS: add TSN/AVB-entries

 Documentation/TSN/tsn.txt | 147 +
 MAINTAINERS   |  14 +
 drivers/media/Kconfig |  15 +
 drivers/media/Makefile|   3 +-
 drivers/media/avb/Makefile|   5 +
 drivers/media/avb/avb_alsa.c  | 742 +++
 drivers/media/avb/tsn_iec61883.h  | 124 
 drivers/net/ethernet/intel/Kconfig|  18 +
 drivers/net/ethernet/intel/igb/Makefile   |   2 +-
 drivers/net/ethernet/intel/igb/igb.h  |  19 +
 drivers/net/ethernet/intel/igb/igb_main.c |  10 +-
 drivers/net/ethernet/intel/igb/igb_tsn.c  | 396 
 include/linux/netdevice.h |  32 +
 include/linux/tsn.h   | 806 
 include/trace/events/tsn.h| 349 +++
 net/Kconfig   |   1 +
 net/Makefile  |   1 +
 net/tsn/Kconfig   |  32 +
 net/tsn/Makefile  |   6 +
 net/tsn/tsn_configfs.c| 623 +++
 net/tsn/tsn_core.c| 975 ++
 net/tsn/tsn_header.c  | 203 +++
 net/tsn/tsn_internal.h| 383 
 net/tsn/tsn_net.c | 403 
 24 files changed, 5306 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/TSN/tsn.txt
 create mode 100644 drivers/media/avb/Makefile
 create mode 100644 drivers/media/avb/avb_alsa.c
 create mode 100644 drivers/media/avb/tsn_iec61883.h
 create mode 100644 drivers/net/ethernet/intel/igb/igb_tsn.c
 create mode 100644 include/linux/tsn.h
 create mode 100644 include/trace/events/tsn.h
 create mode 100644 net/tsn/Kconfig
 create mode 100644 net/tsn/Makefile
 create mode 100644 net/tsn/tsn_configfs.c
 create mode 100644 net/tsn/tsn_core.c
 create mode 100644 net/tsn/tsn_header.c
 create mode 100644 net/tsn/tsn_internal.h
 create mode 100644 net/tsn/tsn_net.c

--
2.7.4


[very-RFC 5/8] Add TSN machinery to drive the traffic from a shim over the network

2016-06-11 Thread Henrik Austad
From: Henrik Austad 

In short summary:

* tsn_core.c is the main driver of tsn, all new links go through
  here and all data to/form the shims are handled here
  core also manages the shim-interface.

* tsn_configfs.c is the API to userspace. TSN is driven from userspace
  and a link is created, configured, enabled, disabled and removed
  purely from userspace. All attributes requried must be determined by
  userspace, preferrably via IEEE 1722.1 (discovery and enumeration).

* tsn_header.c small part that handles the actual header of the frames
  we send. Kept out of core for cleanliness.

* tsn_net.c handles operations towards the networking layer.

The current driver is under development. This means that from the moment it
is enabled with a shim, it will send traffic, either 0-traffic (frames of
reserved length but with payload 0) or actual traffic. This will change
once the driver stabilizes.

For more detail, see Documentation/networking/tsn/

Cc: "David S. Miller" 
Signed-off-by: Henrik Austad 
---
 net/Makefile   |   1 +
 net/tsn/Makefile   |   6 +
 net/tsn/tsn_configfs.c | 623 +++
 net/tsn/tsn_core.c | 975 +
 net/tsn/tsn_header.c   | 203 ++
 net/tsn/tsn_internal.h | 383 +++
 net/tsn/tsn_net.c  | 403 
 7 files changed, 2594 insertions(+)
 create mode 100644 net/tsn/Makefile
 create mode 100644 net/tsn/tsn_configfs.c
 create mode 100644 net/tsn/tsn_core.c
 create mode 100644 net/tsn/tsn_header.c
 create mode 100644 net/tsn/tsn_internal.h
 create mode 100644 net/tsn/tsn_net.c

diff --git a/net/Makefile b/net/Makefile
index bdd1455..c15482e 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -79,3 +79,4 @@ ifneq ($(CONFIG_NET_L3_MASTER_DEV),)
 obj-y  += l3mdev/
 endif
 obj-$(CONFIG_QRTR) += qrtr/
+obj-$(CONFIG_TSN)  += tsn/
diff --git a/net/tsn/Makefile b/net/tsn/Makefile
new file mode 100644
index 000..0d87687
--- /dev/null
+++ b/net/tsn/Makefile
@@ -0,0 +1,6 @@
+#
+# Makefile for the Linux TSN subsystem
+#
+
+obj-$(CONFIG_TSN) += tsn.o
+tsn-objs :=tsn_core.o tsn_configfs.o tsn_net.o tsn_header.o
diff --git a/net/tsn/tsn_configfs.c b/net/tsn/tsn_configfs.c
new file mode 100644
index 000..f3d0986
--- /dev/null
+++ b/net/tsn/tsn_configfs.c
@@ -0,0 +1,623 @@
+/*
+ *   ConfigFS interface to TSN
+ *   Copyright (C) 2015- Henrik Austad 
+ *
+ *   This program is free software; you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *   GNU General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "tsn_internal.h"
+
+static inline struct tsn_link *to_tsn_link(struct config_item *item)
+{
+   /* this line causes checkpatch to WARN. making checkpatch happy,
+* makes code messy..
+*/
+   return item ? container_of(to_config_group(item), struct tsn_link, 
group) : NULL;
+}
+
+static inline struct tsn_nic *to_tsn_nic(struct config_group *group)
+{
+   return group ? container_of(group, struct tsn_nic, group) : NULL;
+}
+
+/* ---
+ * Tier2 attributes
+ *
+ * The content of the links userspace can see/modify
+ * ---
+*/
+static ssize_t _tsn_max_payload_size_show(struct config_item *item,
+ char *page)
+{
+   struct tsn_link *link = to_tsn_link(item);
+
+   if (!link)
+   return -EINVAL;
+   return sprintf(page, "%u\n", (u32)link->max_payload_size);
+}
+
+static ssize_t _tsn_max_payload_size_store(struct config_item *item,
+  const char *page, size_t count)
+{
+   struct tsn_link *link = to_tsn_link(item);
+   u16 mpl_size = 0;
+   int ret = 0;
+
+   if (!link)
+   return -EINVAL;
+   if (tsn_link_is_on(link)) {
+   pr_err("ERROR: Cannot change Payload size on on enabled 
link\n");
+   return -EINVAL;
+   }
+   ret = kstrtou16(page, 0, _size);
+   if (ret)
+   return ret;
+
+   /* 802.1BA-2011 6.4 payload must be <1500 octets (excluding
+* headers, tags etc) However, this is not directly mappable to
+* how some hw handles things, so to be conservative, we
+* restrict it down to [26..1485]
+*
+* This is also the _payload_ size, which does not include the
+* AVTPDU heade

  1   2   3   >