Re: [PATCH][RT] xfs: Disable preemption when grabbing all icsb counter locks

2015-04-30 Thread Austin Schuh
On Thu, Apr 30, 2015 at 11:32 AM, Steven Rostedt  wrote:
> On Thu, 30 Apr 2015 20:07:21 +0200
> Peter Zijlstra  wrote:
>> The irony, this is distinctly non deterministic code you're putting
>> under a RT specific preempt_disable ;-)
>
> I know :-(
>
> Unfortunately, a RT behaving fix would be much more invasive and would
> probably require the help of the xfs folks. For now, this just prevents
> a live lock that can happen and halt the system, where it becomes
> deterministic catastrophe.
>
> -- Steve

Would it work to instead create a lock to replace the
preempt_enable_rt/preempt_disable_rt pair in XFS?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RT] xfs: Disable preemption when grabbing all icsb counter locks

2015-04-30 Thread Austin Schuh
On Thu, Apr 30, 2015 at 11:32 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Thu, 30 Apr 2015 20:07:21 +0200
 Peter Zijlstra pet...@infradead.org wrote:
 The irony, this is distinctly non deterministic code you're putting
 under a RT specific preempt_disable ;-)

 I know :-(

 Unfortunately, a RT behaving fix would be much more invasive and would
 probably require the help of the xfs folks. For now, this just prevents
 a live lock that can happen and halt the system, where it becomes
 deterministic catastrophe.

 -- Steve

Would it work to instead create a lock to replace the
preempt_enable_rt/preempt_disable_rt pair in XFS?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: fix RLIMIT_RTTIME when PI-boosting to RT

2015-03-05 Thread Austin Schuh
ping?

On Wed, Feb 18, 2015 at 4:23 PM,   wrote:
> From: Brian Silverman 
>
> When non-realtime tasks get priority-inheritance boosted to a realtime
> scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
> counter used for checking this (the same one used for SCHED_RR
> timeslices) was not getting reset. This meant that tasks running with a
> non-realtime scheduling class which are repeatedly boosted to a realtime
> one, but never block while they are running realtime, eventually hit the
> timeout without ever running for a time over the limit. This patch
> resets the realtime timeslice counter when un-PI-boosting from an RT to
> a non-RT scheduling class.
>
> I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
> mutex which induces priority boosting and spins while boosted that gets
> killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
> applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
> does happen eventually with PREEMPT_VOLUNTARY kernels.
>
> Signed-off-by: Brian Silverman 
> ---
> I am not subscribed to the list so please CC me on any responses.
>
>  kernel/sched/core.c |2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 87b9814..16ad0ed 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3192,6 +3192,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
> } else {
> if (dl_prio(oldprio))
> p->dl.dl_boosted = 0;
> +   if (rt_prio(oldprio))
> +   p->rt.timeout = 0;
> p->sched_class = _sched_class;
> }
>
> --
> 1.7.10.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: fix RLIMIT_RTTIME when PI-boosting to RT

2015-03-05 Thread Austin Schuh
ping?

On Wed, Feb 18, 2015 at 4:23 PM,  br...@peloton-tech.com wrote:
 From: Brian Silverman br...@peloton-tech.com

 When non-realtime tasks get priority-inheritance boosted to a realtime
 scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
 counter used for checking this (the same one used for SCHED_RR
 timeslices) was not getting reset. This meant that tasks running with a
 non-realtime scheduling class which are repeatedly boosted to a realtime
 one, but never block while they are running realtime, eventually hit the
 timeout without ever running for a time over the limit. This patch
 resets the realtime timeslice counter when un-PI-boosting from an RT to
 a non-RT scheduling class.

 I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
 mutex which induces priority boosting and spins while boosted that gets
 killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
 applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
 does happen eventually with PREEMPT_VOLUNTARY kernels.

 Signed-off-by: Brian Silverman br...@peloton-tech.com
 ---
 I am not subscribed to the list so please CC me on any responses.

  kernel/sched/core.c |2 ++
  1 file changed, 2 insertions(+)

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 87b9814..16ad0ed 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -3192,6 +3192,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 } else {
 if (dl_prio(oldprio))
 p-dl.dl_boosted = 0;
 +   if (rt_prio(oldprio))
 +   p-rt.timeout = 0;
 p-sched_class = fair_sched_class;
 }

 --
 1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-07-05 Thread Austin Schuh
On Sat, Jul 5, 2014 at 1:26 PM, Thomas Gleixner  wrote:
> On Mon, 30 Jun 2014, Austin Schuh wrote:
>> I think I might have an answer for my own question, but I would
>> appreciate someone else to double check.  If list_empty erroneously
>> returns that there is work to do when there isn't work to do, we wake
>> up an extra worker which then goes back to sleep.  Not a big loss.  If
>> list_empty erroneously returns that there isn't work to do when there
>> is, this should only be because someone is modifying the work list.
>> When they finish, as far as I can tell, all callers then check to see
>> if a worker needs to be started up, and start one.
>
> Precisely.

A comment there when you put together a polished patch for inclusion
would be awesome.  I'm assuming that you will put the patch together?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-07-05 Thread Austin Schuh
On Sat, Jul 5, 2014 at 1:26 PM, Thomas Gleixner t...@linutronix.de wrote:
 On Mon, 30 Jun 2014, Austin Schuh wrote:
 I think I might have an answer for my own question, but I would
 appreciate someone else to double check.  If list_empty erroneously
 returns that there is work to do when there isn't work to do, we wake
 up an extra worker which then goes back to sleep.  Not a big loss.  If
 list_empty erroneously returns that there isn't work to do when there
 is, this should only be because someone is modifying the work list.
 When they finish, as far as I can tell, all callers then check to see
 if a worker needs to be started up, and start one.

 Precisely.

A comment there when you put together a polished patch for inclusion
would be awesome.  I'm assuming that you will put the patch together?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-07-03 Thread Austin Schuh
On Tue, Jul 1, 2014 at 12:32 PM, Austin Schuh  wrote:
> On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh  wrote:
>> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner  wrote:
>>> Completely untested patch below.

I've tested it and looked it over now, and feel pretty confident in
the patch.  Thanks Thomas!

I've been running this fix with my extra lock steps for a couple days
now on 3 machines, and haven't seen any issues.  On each of the
machines, I've got a CAN card generating about 1 CPU worth of load
from interrupts, and the rest of the cores are tied up putting work on
the work queues with the killer_module.

What next?

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-07-03 Thread Austin Schuh
On Tue, Jul 1, 2014 at 12:32 PM, Austin Schuh aus...@peloton-tech.com wrote:
 On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh aus...@peloton-tech.com wrote:
 On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote:
 Completely untested patch below.

I've tested it and looked it over now, and feel pretty confident in
the patch.  Thanks Thomas!

I've been running this fix with my extra lock steps for a couple days
now on 3 machines, and haven't seen any issues.  On each of the
machines, I've got a CAN card generating about 1 CPU worth of load
from interrupts, and the rest of the cores are tied up putting work on
the work queues with the killer_module.

What next?

Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-07-01 Thread Austin Schuh
On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh  wrote:
> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner  wrote:
>> Completely untested patch below.
>
> By chance, I found this in my boot logs.  I'll do some more startup
> testing tomorrow.
>
> Jun 30 19:54:40 vpc5 kernel: [0.670955] [ cut here 
> ]
> Jun 30 19:54:40 vpc5 kernel: [0.670962] WARNING: CPU: 0 PID: 4 at
> kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b()
> Jun 30 19:54:40 vpc5 kernel: [0.670970] Modules linked in:
> Jun 30 19:54:40 vpc5 kernel: [0.670973] CPU: 0 PID: 4 Comm:
> kworker/0:0 Not tainted 3.14.3-rt4abs+ #8
> Jun 30 19:54:40 vpc5 kernel: [0.670974] Hardware name: CompuLab
> Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013
> Jun 30 19:54:40 vpc5 kernel: [0.670983]  0009
> 88040ce75de8 81510faf 0002
> Jun 30 19:54:40 vpc5 kernel: [0.670985]  
> 88040ce75e28 81042085 0001
> Jun 30 19:54:40 vpc5 kernel: [0.670987]  81057a60
> 88042d406900 88042da63fc0 88042da64030
> Jun 30 19:54:40 vpc5 kernel: [0.670988] Call Trace:
> Jun 30 19:54:40 vpc5 kernel: [0.670995]  []
> dump_stack+0x4f/0x7c
> Jun 30 19:54:40 vpc5 kernel: [0.670999]  []
> warn_slowpath_common+0x81/0x9c
> Jun 30 19:54:40 vpc5 kernel: [0.671002]  [] ?
> worker_enter_idle+0x65/0x16b
> Jun 30 19:54:40 vpc5 kernel: [0.671005]  []
> warn_slowpath_null+0x1a/0x1c
> Jun 30 19:54:40 vpc5 kernel: [0.671007]  []
> worker_enter_idle+0x65/0x16b
> Jun 30 19:54:40 vpc5 kernel: [0.671010]  []
> worker_thread+0x1b3/0x22b
> Jun 30 19:54:40 vpc5 kernel: [0.671013]  [] ?
> rescuer_thread+0x293/0x293
> Jun 30 19:54:40 vpc5 kernel: [0.671015]  [] ?
> rescuer_thread+0x293/0x293
> Jun 30 19:54:40 vpc5 kernel: [0.671018]  []
> kthread+0xdc/0xe4
> Jun 30 19:54:40 vpc5 kernel: [0.671022]  [] ?
> flush_kthread_worker+0xe1/0xe1
> Jun 30 19:54:40 vpc5 kernel: [0.671025]  []
> ret_from_fork+0x7c/0xb0
> Jun 30 19:54:40 vpc5 kernel: [0.671027]  [] ?
> flush_kthread_worker+0xe1/0xe1
> Jun 30 19:54:40 vpc5 kernel: [0.671029] ---[ end trace 0001 
> ]---

Bug in my extra locking...  Sorry for the noise.  The second diff is a
cleaner way of destroying the workers.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8900da8..590cc26 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1567,10 +1602,16 @@ static void worker_enter_idle(struct worker *worker)
 {
struct worker_pool *pool = worker->pool;

-   if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
-   WARN_ON_ONCE(!list_empty(>entry) &&
-(worker->hentry.next || worker->hentry.pprev)))
+   if (WARN_ON_ONCE(worker->flags & WORKER_IDLE)) return;
+
+   rt_lock_idle_list(pool);
+   if (WARN_ON_ONCE(!list_empty(>entry) &&
+(worker->hentry.next || worker->hentry.pprev))) {
+   rt_unlock_idle_list(pool);
return;
+   } else {
+   rt_unlock_idle_list(pool);
+   }

/* can't use worker_set_flags(), also called from start_worker() */
worker->flags |= WORKER_IDLE;
@@ -3584,8 +3637,14 @@ static void put_unbound_pool(struct worker_pool *pool)
mutex_lock(>manager_mutex);
spin_lock_irq(>lock);

-   while ((worker = first_worker(pool)))
+   rt_lock_idle_list(pool);
+   while ((worker = first_worker(pool))) {
+   rt_unlock_idle_list(pool);
destroy_worker(worker);
+   rt_lock_idle_list(pool);
+   }
+   rt_unlock_idle_list(pool);
+
WARN_ON(pool->nr_workers || pool->nr_idle);

spin_unlock_irq(>lock);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-07-01 Thread Austin Schuh
On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh aus...@peloton-tech.com wrote:
 On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote:
 Completely untested patch below.

 By chance, I found this in my boot logs.  I'll do some more startup
 testing tomorrow.

 Jun 30 19:54:40 vpc5 kernel: [0.670955] [ cut here 
 ]
 Jun 30 19:54:40 vpc5 kernel: [0.670962] WARNING: CPU: 0 PID: 4 at
 kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b()
 Jun 30 19:54:40 vpc5 kernel: [0.670970] Modules linked in:
 Jun 30 19:54:40 vpc5 kernel: [0.670973] CPU: 0 PID: 4 Comm:
 kworker/0:0 Not tainted 3.14.3-rt4abs+ #8
 Jun 30 19:54:40 vpc5 kernel: [0.670974] Hardware name: CompuLab
 Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013
 Jun 30 19:54:40 vpc5 kernel: [0.670983]  0009
 88040ce75de8 81510faf 0002
 Jun 30 19:54:40 vpc5 kernel: [0.670985]  
 88040ce75e28 81042085 0001
 Jun 30 19:54:40 vpc5 kernel: [0.670987]  81057a60
 88042d406900 88042da63fc0 88042da64030
 Jun 30 19:54:40 vpc5 kernel: [0.670988] Call Trace:
 Jun 30 19:54:40 vpc5 kernel: [0.670995]  [81510faf]
 dump_stack+0x4f/0x7c
 Jun 30 19:54:40 vpc5 kernel: [0.670999]  [81042085]
 warn_slowpath_common+0x81/0x9c
 Jun 30 19:54:40 vpc5 kernel: [0.671002]  [81057a60] ?
 worker_enter_idle+0x65/0x16b
 Jun 30 19:54:40 vpc5 kernel: [0.671005]  [810420ba]
 warn_slowpath_null+0x1a/0x1c
 Jun 30 19:54:40 vpc5 kernel: [0.671007]  [81057a60]
 worker_enter_idle+0x65/0x16b
 Jun 30 19:54:40 vpc5 kernel: [0.671010]  [8105a0a9]
 worker_thread+0x1b3/0x22b
 Jun 30 19:54:40 vpc5 kernel: [0.671013]  [81059ef6] ?
 rescuer_thread+0x293/0x293
 Jun 30 19:54:40 vpc5 kernel: [0.671015]  [81059ef6] ?
 rescuer_thread+0x293/0x293
 Jun 30 19:54:40 vpc5 kernel: [0.671018]  [8105f7ab]
 kthread+0xdc/0xe4
 Jun 30 19:54:40 vpc5 kernel: [0.671022]  [8105f6cf] ?
 flush_kthread_worker+0xe1/0xe1
 Jun 30 19:54:40 vpc5 kernel: [0.671025]  [8151586c]
 ret_from_fork+0x7c/0xb0
 Jun 30 19:54:40 vpc5 kernel: [0.671027]  [8105f6cf] ?
 flush_kthread_worker+0xe1/0xe1
 Jun 30 19:54:40 vpc5 kernel: [0.671029] ---[ end trace 0001 
 ]---

Bug in my extra locking...  Sorry for the noise.  The second diff is a
cleaner way of destroying the workers.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8900da8..590cc26 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1567,10 +1602,16 @@ static void worker_enter_idle(struct worker *worker)
 {
struct worker_pool *pool = worker-pool;

-   if (WARN_ON_ONCE(worker-flags  WORKER_IDLE) ||
-   WARN_ON_ONCE(!list_empty(worker-entry) 
-(worker-hentry.next || worker-hentry.pprev)))
+   if (WARN_ON_ONCE(worker-flags  WORKER_IDLE)) return;
+
+   rt_lock_idle_list(pool);
+   if (WARN_ON_ONCE(!list_empty(worker-entry) 
+(worker-hentry.next || worker-hentry.pprev))) {
+   rt_unlock_idle_list(pool);
return;
+   } else {
+   rt_unlock_idle_list(pool);
+   }

/* can't use worker_set_flags(), also called from start_worker() */
worker-flags |= WORKER_IDLE;
@@ -3584,8 +3637,14 @@ static void put_unbound_pool(struct worker_pool *pool)
mutex_lock(pool-manager_mutex);
spin_lock_irq(pool-lock);

-   while ((worker = first_worker(pool)))
+   rt_lock_idle_list(pool);
+   while ((worker = first_worker(pool))) {
+   rt_unlock_idle_list(pool);
destroy_worker(worker);
+   rt_lock_idle_list(pool);
+   }
+   rt_unlock_idle_list(pool);
+
WARN_ON(pool-nr_workers || pool-nr_idle);

spin_unlock_irq(pool-lock);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-30 Thread Austin Schuh
On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner  wrote:
> Completely untested patch below.

By chance, I found this in my boot logs.  I'll do some more startup
testing tomorrow.

Jun 30 19:54:40 vpc5 kernel: [0.670955] [ cut here ]
Jun 30 19:54:40 vpc5 kernel: [0.670962] WARNING: CPU: 0 PID: 4 at
kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b()
Jun 30 19:54:40 vpc5 kernel: [0.670970] Modules linked in:
Jun 30 19:54:40 vpc5 kernel: [0.670973] CPU: 0 PID: 4 Comm:
kworker/0:0 Not tainted 3.14.3-rt4abs+ #8
Jun 30 19:54:40 vpc5 kernel: [0.670974] Hardware name: CompuLab
Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013
Jun 30 19:54:40 vpc5 kernel: [0.670983]  0009
88040ce75de8 81510faf 0002
Jun 30 19:54:40 vpc5 kernel: [0.670985]  
88040ce75e28 81042085 0001
Jun 30 19:54:40 vpc5 kernel: [0.670987]  81057a60
88042d406900 88042da63fc0 88042da64030
Jun 30 19:54:40 vpc5 kernel: [0.670988] Call Trace:
Jun 30 19:54:40 vpc5 kernel: [0.670995]  []
dump_stack+0x4f/0x7c
Jun 30 19:54:40 vpc5 kernel: [0.670999]  []
warn_slowpath_common+0x81/0x9c
Jun 30 19:54:40 vpc5 kernel: [0.671002]  [] ?
worker_enter_idle+0x65/0x16b
Jun 30 19:54:40 vpc5 kernel: [0.671005]  []
warn_slowpath_null+0x1a/0x1c
Jun 30 19:54:40 vpc5 kernel: [0.671007]  []
worker_enter_idle+0x65/0x16b
Jun 30 19:54:40 vpc5 kernel: [0.671010]  []
worker_thread+0x1b3/0x22b
Jun 30 19:54:40 vpc5 kernel: [0.671013]  [] ?
rescuer_thread+0x293/0x293
Jun 30 19:54:40 vpc5 kernel: [0.671015]  [] ?
rescuer_thread+0x293/0x293
Jun 30 19:54:40 vpc5 kernel: [0.671018]  []
kthread+0xdc/0xe4
Jun 30 19:54:40 vpc5 kernel: [0.671022]  [] ?
flush_kthread_worker+0xe1/0xe1
Jun 30 19:54:40 vpc5 kernel: [0.671025]  []
ret_from_fork+0x7c/0xb0
Jun 30 19:54:40 vpc5 kernel: [0.671027]  [] ?
flush_kthread_worker+0xe1/0xe1
Jun 30 19:54:40 vpc5 kernel: [0.671029] ---[ end trace 0001 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-30 Thread Austin Schuh
On Mon, Jun 30, 2014 at 5:12 PM, Austin Schuh  wrote:
> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner  wrote:
>> On Thu, 26 Jun 2014, Austin Schuh wrote:
>>> If I'm reading the rt patch correctly, wq_worker_sleeping was moved
>>> out of __schedule to sched_submit_work.  It looks like that changes
>>> the conditions under which wq_worker_sleeping is called.  It used to
>>> be called whenever a task was going to sleep (I think).  It looks like
>>> it is called now if the task is going to sleep, and if the task isn't
>>> blocked on a PI mutex (I think).
>>>
>>> If I try the following experiment
>>>
>>>  static inline void sched_submit_work(struct task_struct *tsk)
>>>  {
>>> +   if (tsk->state && tsk->flags & PF_WQ_WORKER) {
>>> + wq_worker_sleeping(tsk);
>>> + return;
>>> +   }
>>>
>>> and then remove the call later in the function, I am able to pass my test.
>>>
>>> Unfortunately, I then get a recursive pool spinlock BUG_ON after a
>>> while (as I would expect), and it all blows up.
>>
>> Well, that's why the check for !pi_blocked_on is there.
>>
>>> I'm not sure where to go from there.  Any changes to the workpool to
>>> try to fix that will be hard, or could affect latency significantly.
>>
>> Completely untested patch below.
>>
>> Thanks,
>>
>> tglx
>>
>> ->
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 8749d20..621329a 100644
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index be0ef50..0ca9d97 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task)
>> return;
>>
>> worker->sleeping = 1;
>> -   spin_lock_irq(>lock);
>> +
>> /*
>>  * The counterpart of the following dec_and_test, implied mb,
>>  * worklist not empty test sequence is in insert_work().
>>  * Please read comment there.
>> -*
>> -* NOT_RUNNING is clear.  This means that we're bound to and
>> -* running on the local cpu w/ rq lock held and preemption
>> -* disabled, which in turn means that none else could be
>> -* manipulating idle_list, so dereferencing idle_list without pool
>> -* lock is safe.
>>  */
>> if (atomic_dec_and_test(>nr_running) &&
>> !list_empty(>worklist)) {
>
> What guarantees that this pool->worklist access is safe?  Preemption
> isn't disabled.

I think I might have an answer for my own question, but I would
appreciate someone else to double check.  If list_empty erroneously
returns that there is work to do when there isn't work to do, we wake
up an extra worker which then goes back to sleep.  Not a big loss.  If
list_empty erroneously returns that there isn't work to do when there
is, this should only be because someone is modifying the work list.
When they finish, as far as I can tell, all callers then check to see
if a worker needs to be started up, and start one.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-30 Thread Austin Schuh
On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner  wrote:
> On Thu, 26 Jun 2014, Austin Schuh wrote:
>> If I'm reading the rt patch correctly, wq_worker_sleeping was moved
>> out of __schedule to sched_submit_work.  It looks like that changes
>> the conditions under which wq_worker_sleeping is called.  It used to
>> be called whenever a task was going to sleep (I think).  It looks like
>> it is called now if the task is going to sleep, and if the task isn't
>> blocked on a PI mutex (I think).
>>
>> If I try the following experiment
>>
>>  static inline void sched_submit_work(struct task_struct *tsk)
>>  {
>> +   if (tsk->state && tsk->flags & PF_WQ_WORKER) {
>> + wq_worker_sleeping(tsk);
>> + return;
>> +   }
>>
>> and then remove the call later in the function, I am able to pass my test.
>>
>> Unfortunately, I then get a recursive pool spinlock BUG_ON after a
>> while (as I would expect), and it all blows up.
>
> Well, that's why the check for !pi_blocked_on is there.
>
>> I'm not sure where to go from there.  Any changes to the workpool to
>> try to fix that will be hard, or could affect latency significantly.
>
> Completely untested patch below.
>
> Thanks,
>
> tglx
>
> ->
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8749d20..621329a 100644
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index be0ef50..0ca9d97 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task)
> return;
>
> worker->sleeping = 1;
> -   spin_lock_irq(>lock);
> +
> /*
>  * The counterpart of the following dec_and_test, implied mb,
>  * worklist not empty test sequence is in insert_work().
>  * Please read comment there.
> -*
> -* NOT_RUNNING is clear.  This means that we're bound to and
> -* running on the local cpu w/ rq lock held and preemption
> -* disabled, which in turn means that none else could be
> -* manipulating idle_list, so dereferencing idle_list without pool
> -* lock is safe.
>  */
> if (atomic_dec_and_test(>nr_running) &&
> !list_empty(>worklist)) {

What guarantees that this pool->worklist access is safe?  Preemption
isn't disabled.

I'm seeing some inconsistency when accessing the idle list.  You seem
to only disable preemption when writing to the idle list?  I'm unsure
if I'm missing something, or what.  With that question in mind, I took
a stab at adding locking around all the idle_list calls.

I went through and double checked to make sure that rt_lock_idle_list
and rt_unlock_idle_list surround all idle_list or entry accesses.  The
following are places where I think you should be locking, but aren't.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8900da8..314a098 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -778,8 +805,12 @@ static bool too_many_workers(struct worker_pool *pool)
  * nr_idle and idle_list may disagree if idle rebinding is in
  * progress.  Never return %true if idle_list is empty.
  */
- if (list_empty(>idle_list))
+ rt_lock_idle_list(pool);
+ if (list_empty(>idle_list)) {
+ rt_unlock_idle_list(pool);
  return false;
+ }
+ rt_unlock_idle_list(pool);

  return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
 }
@@ -788,7 +819,7 @@ static bool too_many_workers(struct worker_pool *pool)
  * Wake up functions.
  */

-/* Return the first worker.  Safe with preemption disabled */
+/* Return the first worker.  Preemption must be disabled */
 static struct worker *first_worker(struct worker_pool *pool)
 {
  if (unlikely(list_empty(>idle_list)))
@@ -1567,10 +1598,16 @@ static void worker_enter_idle(struct worker *worker)
 {
  struct worker_pool *pool = worker->pool;

- if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
-WARN_ON_ONCE(!list_empty(>entry) &&
- (worker->hentry.next || worker->hentry.pprev)))
- return;
+ if (WARN_ON_ONCE(worker->flags & WORKER_IDLE)) return;
+
+ rt_lock_idle_list(pool);
+ if (WARN_ON_ONCE(!list_empty(>entry))) {
+ rt_unlock_idle_list(pool);
+ if (worker->hentry.next || worker->hentry.pprev)
+ return;
+ } else {
+ rt_unlock_idle_list(pool);
+ }

  /* can't use worker_set_flags(), also called from start_worker() */
  worker->flags |= WORKER_IDLE;
@@ -1882,7 +1925,9 @@ static void idle_worker_timeout(unsigned long __pool)
  unsigned long expires;

  /* idle_list is kept in LIFO order, check the last one */
+ rt_lock_idle_list(pool);
  worker = list_entry(pool->idle_list.prev, struct wor

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-30 Thread Austin Schuh
On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote:
 On Thu, 26 Jun 2014, Austin Schuh wrote:
 If I'm reading the rt patch correctly, wq_worker_sleeping was moved
 out of __schedule to sched_submit_work.  It looks like that changes
 the conditions under which wq_worker_sleeping is called.  It used to
 be called whenever a task was going to sleep (I think).  It looks like
 it is called now if the task is going to sleep, and if the task isn't
 blocked on a PI mutex (I think).

 If I try the following experiment

  static inline void sched_submit_work(struct task_struct *tsk)
  {
 +   if (tsk-state  tsk-flags  PF_WQ_WORKER) {
 + wq_worker_sleeping(tsk);
 + return;
 +   }

 and then remove the call later in the function, I am able to pass my test.

 Unfortunately, I then get a recursive pool spinlock BUG_ON after a
 while (as I would expect), and it all blows up.

 Well, that's why the check for !pi_blocked_on is there.

 I'm not sure where to go from there.  Any changes to the workpool to
 try to fix that will be hard, or could affect latency significantly.

 Completely untested patch below.

 Thanks,

 tglx

 -
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 8749d20..621329a 100644
 diff --git a/kernel/workqueue.c b/kernel/workqueue.c
 index be0ef50..0ca9d97 100644
 --- a/kernel/workqueue.c
 +++ b/kernel/workqueue.c
 @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task)
 return;

 worker-sleeping = 1;
 -   spin_lock_irq(pool-lock);
 +
 /*
  * The counterpart of the following dec_and_test, implied mb,
  * worklist not empty test sequence is in insert_work().
  * Please read comment there.
 -*
 -* NOT_RUNNING is clear.  This means that we're bound to and
 -* running on the local cpu w/ rq lock held and preemption
 -* disabled, which in turn means that none else could be
 -* manipulating idle_list, so dereferencing idle_list without pool
 -* lock is safe.
  */
 if (atomic_dec_and_test(pool-nr_running) 
 !list_empty(pool-worklist)) {

What guarantees that this pool-worklist access is safe?  Preemption
isn't disabled.

I'm seeing some inconsistency when accessing the idle list.  You seem
to only disable preemption when writing to the idle list?  I'm unsure
if I'm missing something, or what.  With that question in mind, I took
a stab at adding locking around all the idle_list calls.

I went through and double checked to make sure that rt_lock_idle_list
and rt_unlock_idle_list surround all idle_list or entry accesses.  The
following are places where I think you should be locking, but aren't.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8900da8..314a098 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -778,8 +805,12 @@ static bool too_many_workers(struct worker_pool *pool)
  * nr_idle and idle_list may disagree if idle rebinding is in
  * progress.  Never return %true if idle_list is empty.
  */
- if (list_empty(pool-idle_list))
+ rt_lock_idle_list(pool);
+ if (list_empty(pool-idle_list)) {
+ rt_unlock_idle_list(pool);
  return false;
+ }
+ rt_unlock_idle_list(pool);

  return nr_idle  2  (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO = nr_busy;
 }
@@ -788,7 +819,7 @@ static bool too_many_workers(struct worker_pool *pool)
  * Wake up functions.
  */

-/* Return the first worker.  Safe with preemption disabled */
+/* Return the first worker.  Preemption must be disabled */
 static struct worker *first_worker(struct worker_pool *pool)
 {
  if (unlikely(list_empty(pool-idle_list)))
@@ -1567,10 +1598,16 @@ static void worker_enter_idle(struct worker *worker)
 {
  struct worker_pool *pool = worker-pool;

- if (WARN_ON_ONCE(worker-flags  WORKER_IDLE) ||
-WARN_ON_ONCE(!list_empty(worker-entry) 
- (worker-hentry.next || worker-hentry.pprev)))
- return;
+ if (WARN_ON_ONCE(worker-flags  WORKER_IDLE)) return;
+
+ rt_lock_idle_list(pool);
+ if (WARN_ON_ONCE(!list_empty(worker-entry))) {
+ rt_unlock_idle_list(pool);
+ if (worker-hentry.next || worker-hentry.pprev)
+ return;
+ } else {
+ rt_unlock_idle_list(pool);
+ }

  /* can't use worker_set_flags(), also called from start_worker() */
  worker-flags |= WORKER_IDLE;
@@ -1882,7 +1925,9 @@ static void idle_worker_timeout(unsigned long __pool)
  unsigned long expires;

  /* idle_list is kept in LIFO order, check the last one */
+ rt_lock_idle_list(pool);
  worker = list_entry(pool-idle_list.prev, struct worker, entry);
+ rt_unlock_idle_list(pool);
  expires = worker-last_active + IDLE_WORKER_TIMEOUT;

  if (time_before(jiffies, expires))
@@ -2026,7 +2071,9 @@ static bool maybe_destroy_workers(struct
worker_pool *pool)
  struct worker *worker;
  unsigned long expires;

+ rt_lock_idle_list(pool);
  worker = list_entry(pool-idle_list.prev, struct worker, entry);
+ rt_unlock_idle_list(pool);
  expires = worker-last_active

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-30 Thread Austin Schuh
On Mon, Jun 30, 2014 at 5:12 PM, Austin Schuh aus...@peloton-tech.com wrote:
 On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote:
 On Thu, 26 Jun 2014, Austin Schuh wrote:
 If I'm reading the rt patch correctly, wq_worker_sleeping was moved
 out of __schedule to sched_submit_work.  It looks like that changes
 the conditions under which wq_worker_sleeping is called.  It used to
 be called whenever a task was going to sleep (I think).  It looks like
 it is called now if the task is going to sleep, and if the task isn't
 blocked on a PI mutex (I think).

 If I try the following experiment

  static inline void sched_submit_work(struct task_struct *tsk)
  {
 +   if (tsk-state  tsk-flags  PF_WQ_WORKER) {
 + wq_worker_sleeping(tsk);
 + return;
 +   }

 and then remove the call later in the function, I am able to pass my test.

 Unfortunately, I then get a recursive pool spinlock BUG_ON after a
 while (as I would expect), and it all blows up.

 Well, that's why the check for !pi_blocked_on is there.

 I'm not sure where to go from there.  Any changes to the workpool to
 try to fix that will be hard, or could affect latency significantly.

 Completely untested patch below.

 Thanks,

 tglx

 -
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 8749d20..621329a 100644
 diff --git a/kernel/workqueue.c b/kernel/workqueue.c
 index be0ef50..0ca9d97 100644
 --- a/kernel/workqueue.c
 +++ b/kernel/workqueue.c
 @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task)
 return;

 worker-sleeping = 1;
 -   spin_lock_irq(pool-lock);
 +
 /*
  * The counterpart of the following dec_and_test, implied mb,
  * worklist not empty test sequence is in insert_work().
  * Please read comment there.
 -*
 -* NOT_RUNNING is clear.  This means that we're bound to and
 -* running on the local cpu w/ rq lock held and preemption
 -* disabled, which in turn means that none else could be
 -* manipulating idle_list, so dereferencing idle_list without pool
 -* lock is safe.
  */
 if (atomic_dec_and_test(pool-nr_running) 
 !list_empty(pool-worklist)) {

 What guarantees that this pool-worklist access is safe?  Preemption
 isn't disabled.

I think I might have an answer for my own question, but I would
appreciate someone else to double check.  If list_empty erroneously
returns that there is work to do when there isn't work to do, we wake
up an extra worker which then goes back to sleep.  Not a big loss.  If
list_empty erroneously returns that there isn't work to do when there
is, this should only be because someone is modifying the work list.
When they finish, as far as I can tell, all callers then check to see
if a worker needs to be started up, and start one.

Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-30 Thread Austin Schuh
On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner t...@linutronix.de wrote:
 Completely untested patch below.

By chance, I found this in my boot logs.  I'll do some more startup
testing tomorrow.

Jun 30 19:54:40 vpc5 kernel: [0.670955] [ cut here ]
Jun 30 19:54:40 vpc5 kernel: [0.670962] WARNING: CPU: 0 PID: 4 at
kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b()
Jun 30 19:54:40 vpc5 kernel: [0.670970] Modules linked in:
Jun 30 19:54:40 vpc5 kernel: [0.670973] CPU: 0 PID: 4 Comm:
kworker/0:0 Not tainted 3.14.3-rt4abs+ #8
Jun 30 19:54:40 vpc5 kernel: [0.670974] Hardware name: CompuLab
Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013
Jun 30 19:54:40 vpc5 kernel: [0.670983]  0009
88040ce75de8 81510faf 0002
Jun 30 19:54:40 vpc5 kernel: [0.670985]  
88040ce75e28 81042085 0001
Jun 30 19:54:40 vpc5 kernel: [0.670987]  81057a60
88042d406900 88042da63fc0 88042da64030
Jun 30 19:54:40 vpc5 kernel: [0.670988] Call Trace:
Jun 30 19:54:40 vpc5 kernel: [0.670995]  [81510faf]
dump_stack+0x4f/0x7c
Jun 30 19:54:40 vpc5 kernel: [0.670999]  [81042085]
warn_slowpath_common+0x81/0x9c
Jun 30 19:54:40 vpc5 kernel: [0.671002]  [81057a60] ?
worker_enter_idle+0x65/0x16b
Jun 30 19:54:40 vpc5 kernel: [0.671005]  [810420ba]
warn_slowpath_null+0x1a/0x1c
Jun 30 19:54:40 vpc5 kernel: [0.671007]  [81057a60]
worker_enter_idle+0x65/0x16b
Jun 30 19:54:40 vpc5 kernel: [0.671010]  [8105a0a9]
worker_thread+0x1b3/0x22b
Jun 30 19:54:40 vpc5 kernel: [0.671013]  [81059ef6] ?
rescuer_thread+0x293/0x293
Jun 30 19:54:40 vpc5 kernel: [0.671015]  [81059ef6] ?
rescuer_thread+0x293/0x293
Jun 30 19:54:40 vpc5 kernel: [0.671018]  [8105f7ab]
kthread+0xdc/0xe4
Jun 30 19:54:40 vpc5 kernel: [0.671022]  [8105f6cf] ?
flush_kthread_worker+0xe1/0xe1
Jun 30 19:54:40 vpc5 kernel: [0.671025]  [8151586c]
ret_from_fork+0x7c/0xb0
Jun 30 19:54:40 vpc5 kernel: [0.671027]  [8105f6cf] ?
flush_kthread_worker+0xe1/0xe1
Jun 30 19:54:40 vpc5 kernel: [0.671029] ---[ end trace 0001 ]---
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-28 Thread Austin Schuh
On Fri, Jun 27, 2014 at 8:32 PM, Mike Galbraith
 wrote:
> On Fri, 2014-06-27 at 18:18 -0700, Austin Schuh wrote:
>
>> It would be more context switches, but I wonder if we could kick the
>> workqueue logic completely out of the scheduler into a thread.  Have
>> the scheduler increment/decrement an atomic pool counter, and wake up
>> the monitoring thread to spawn new threads when needed?  That would
>> get rid of the recursive pool lock problem, and should reduce
>> scheduler latency if we would need to spawn a new thread.
>
> I was wondering the same thing, and not only for workqueue, but also the
> plug pulling.  It's kind of a wart to have that stuff sitting in the
> hear of the scheduler in the first place, would be nice if it just went
> away.  When a task can't help itself, you _could_ wake a proxy do that
> for you.  Trouble is, I can imagine that being a heck of a lot of
> context switches with some loads.. and who's gonna help the helper when
> he blocks while trying to help?
>
> -Mike

For workqueues, as long as the helper doesn't block on a lock which
requires the work queue to be freed up, it will eventually become
unblocked and make progress.  The helper _should_ only need the pool
lock, which will wake the helper back up when it is available again.
Nothing should go to sleep in an un-recoverable way with the work pool
lock held.

To drop the extra context switch, you could have a minimum of 2 worker
threads around at all times, and have the management thread start the
work and delegate to the next management thread.  That thread would
then wait for the first thread to block, spawn a new thread, and then
start the next piece of work.  Definitely a bit more complicated.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-28 Thread Austin Schuh
On Fri, Jun 27, 2014 at 8:32 PM, Mike Galbraith
umgwanakikb...@gmail.com wrote:
 On Fri, 2014-06-27 at 18:18 -0700, Austin Schuh wrote:

 It would be more context switches, but I wonder if we could kick the
 workqueue logic completely out of the scheduler into a thread.  Have
 the scheduler increment/decrement an atomic pool counter, and wake up
 the monitoring thread to spawn new threads when needed?  That would
 get rid of the recursive pool lock problem, and should reduce
 scheduler latency if we would need to spawn a new thread.

 I was wondering the same thing, and not only for workqueue, but also the
 plug pulling.  It's kind of a wart to have that stuff sitting in the
 hear of the scheduler in the first place, would be nice if it just went
 away.  When a task can't help itself, you _could_ wake a proxy do that
 for you.  Trouble is, I can imagine that being a heck of a lot of
 context switches with some loads.. and who's gonna help the helper when
 he blocks while trying to help?

 -Mike

For workqueues, as long as the helper doesn't block on a lock which
requires the work queue to be freed up, it will eventually become
unblocked and make progress.  The helper _should_ only need the pool
lock, which will wake the helper back up when it is available again.
Nothing should go to sleep in an un-recoverable way with the work pool
lock held.

To drop the extra context switch, you could have a minimum of 2 worker
threads around at all times, and have the management thread start the
work and delegate to the next management thread.  That thread would
then wait for the first thread to block, spawn a new thread, and then
start the next piece of work.  Definitely a bit more complicated.

Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-27 Thread Austin Schuh
On Fri, Jun 27, 2014 at 11:19 AM, Steven Rostedt  wrote:
> On Fri, 27 Jun 2014 20:07:54 +0200
> Mike Galbraith  wrote:
>
>> > Why do we need the wakeup? the owner of the lock should wake it up
>> > shouldn't it?
>>
>> True, but that can take ages.
>
> Can it? If the workqueue is of some higher priority, it should boost
> the process that owns the lock. Otherwise it just waits like anything
> else does.
>
> I much rather keep the paradigm of the mainline kernel than to add a
> bunch of hacks that can cause more unforeseen side effects that may
> cause other issues.
>
> Remember, this would only be for spinlocks converted into a rtmutex,
> not for normal mutex or other sleeps. In mainline, the wake up still
> would not happen so why are we waking it up here?
>
> This seems similar to the BKL crap we had to deal with as well. If we
> were going to sleep because we were blocked on a spinlock converted
> rtmutex we could not release and retake the BKL because we would end up
> blocked on two locks. Instead, we made sure that the spinlock would not
> release or take the BKL. It kept with the paradigm of mainline and
> worked. Sucked, but it worked.
>
> -- Steve

Sounds like you are arguing that we should disable preemption (or
whatever the right mechanism is) while holding the pool lock?

Workqueues spin up more threads when work that they are executing
blocks.  This is done through hooks in the scheduler.  This means that
we have to acquire the pool lock when work blocks on a lock in order
to see if there is more work and whether or not we need to spin up a
new thread.

It would be more context switches, but I wonder if we could kick the
workqueue logic completely out of the scheduler into a thread.  Have
the scheduler increment/decrement an atomic pool counter, and wake up
the monitoring thread to spawn new threads when needed?  That would
get rid of the recursive pool lock problem, and should reduce
scheduler latency if we would need to spawn a new thread.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-27 Thread Austin Schuh
On Fri, Jun 27, 2014 at 11:19 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Fri, 27 Jun 2014 20:07:54 +0200
 Mike Galbraith umgwanakikb...@gmail.com wrote:

  Why do we need the wakeup? the owner of the lock should wake it up
  shouldn't it?

 True, but that can take ages.

 Can it? If the workqueue is of some higher priority, it should boost
 the process that owns the lock. Otherwise it just waits like anything
 else does.

 I much rather keep the paradigm of the mainline kernel than to add a
 bunch of hacks that can cause more unforeseen side effects that may
 cause other issues.

 Remember, this would only be for spinlocks converted into a rtmutex,
 not for normal mutex or other sleeps. In mainline, the wake up still
 would not happen so why are we waking it up here?

 This seems similar to the BKL crap we had to deal with as well. If we
 were going to sleep because we were blocked on a spinlock converted
 rtmutex we could not release and retake the BKL because we would end up
 blocked on two locks. Instead, we made sure that the spinlock would not
 release or take the BKL. It kept with the paradigm of mainline and
 worked. Sucked, but it worked.

 -- Steve

Sounds like you are arguing that we should disable preemption (or
whatever the right mechanism is) while holding the pool lock?

Workqueues spin up more threads when work that they are executing
blocks.  This is done through hooks in the scheduler.  This means that
we have to acquire the pool lock when work blocks on a lock in order
to see if there is more work and whether or not we need to spin up a
new thread.

It would be more context switches, but I wonder if we could kick the
workqueue logic completely out of the scheduler into a thread.  Have
the scheduler increment/decrement an atomic pool counter, and wake up
the monitoring thread to spawn new threads when needed?  That would
get rid of the recursive pool lock problem, and should reduce
scheduler latency if we would need to spawn a new thread.

Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-26 Thread Austin Schuh
On Thu, Jun 26, 2014 at 3:35 PM, Thomas Gleixner  wrote:
> On Thu, 26 Jun 2014, Austin Schuh wrote:
>> On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger
>>  wrote:
>> > CC'ing RT folks
>> >
>> > On Wed, May 21, 2014 at 8:23 AM, Austin Schuh  
>> > wrote:
>> >> On Tue, May 13, 2014 at 7:29 PM, Austin Schuh  
>> >> wrote:
>> >>> Hi,
>> >>>
>> >>> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
>> >>> patched kernel.  I have currently only triggered it using dpkg.  Dave
>> >>> Chinner on the XFS mailing list suggested that it was a rt-kernel
>> >>> workqueue issue as opposed to a XFS problem after looking at the
>> >>> kernel messages.
>>
>> I've got a 100% reproducible test case that doesn't involve a
>> filesystem.  I wrote a module that triggers the bug when the device is
>> written to, making it easy to enable tracing during the event and
>> capture everything.
>>
>> It looks like rw_semaphores don't trigger wq_worker_sleeping to run
>> when work goes to sleep on a rw_semaphore.  This only happens with the
>> RT patches, not with the mainline kernel.  I'm foreseeing a second
>> deadlock/bug coming into play shortly.  If a task holding the work
>> pool spinlock gets preempted, and we need to schedule more work from
>> another worker thread which was just blocked by a mutex, we'll then
>> end up trying to go to sleep on 2 locks at once.
>
> I remember vaguely, that I've seen and analyzed that quite some time
> ago. I can't page in all the gory details right now, but I have a look
> how the related code changed in the last couple of years tomorrow
> morning with an awake brain.
>
> Steven, you did some analysis on that IIRC, or was that just related
> to rw_locks?
>
> Thanks,
>
> tglx

If I'm reading the rt patch correctly, wq_worker_sleeping was moved
out of __schedule to sched_submit_work.  It looks like that changes
the conditions under which wq_worker_sleeping is called.  It used to
be called whenever a task was going to sleep (I think).  It looks like
it is called now if the task is going to sleep, and if the task isn't
blocked on a PI mutex (I think).

If I try the following experiment

 static inline void sched_submit_work(struct task_struct *tsk)
 {
+   if (tsk->state && tsk->flags & PF_WQ_WORKER) {
+ wq_worker_sleeping(tsk);
+ return;
+   }

and then remove the call later in the function, I am able to pass my test.

Unfortunately, I then get a recursive pool spinlock BUG_ON after a
while (as I would expect), and it all blows up.

I'm not sure where to go from there.  Any changes to the workpool to
try to fix that will be hard, or could affect latency significantly.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-26 Thread Austin Schuh
On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger
 wrote:
> CC'ing RT folks
>
> On Wed, May 21, 2014 at 8:23 AM, Austin Schuh  wrote:
>> On Tue, May 13, 2014 at 7:29 PM, Austin Schuh  
>> wrote:
>>> Hi,
>>>
>>> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
>>> patched kernel.  I have currently only triggered it using dpkg.  Dave
>>> Chinner on the XFS mailing list suggested that it was a rt-kernel
>>> workqueue issue as opposed to a XFS problem after looking at the
>>> kernel messages.

I've got a 100% reproducible test case that doesn't involve a
filesystem.  I wrote a module that triggers the bug when the device is
written to, making it easy to enable tracing during the event and
capture everything.

It looks like rw_semaphores don't trigger wq_worker_sleeping to run
when work goes to sleep on a rw_semaphore.  This only happens with the
RT patches, not with the mainline kernel.  I'm foreseeing a second
deadlock/bug coming into play shortly.  If a task holding the work
pool spinlock gets preempted, and we need to schedule more work from
another worker thread which was just blocked by a mutex, we'll then
end up trying to go to sleep on 2 locks at once.

That is getting a bit deep into the scheduler for me...  Any
suggestions on how to fix it?

Austin
#include 
#include 
#include 
#include 
#include 
#include 

static int device_open(struct inode *, struct file *);
static int device_release(struct inode *, struct file *);
static ssize_t device_read(struct file *, char *, size_t, loff_t *);
static ssize_t device_write(struct file *, const char *, size_t, loff_t *);

// Dev name as it appears in /proc/devices
#define DEVICE_NAME "aschuh"

// Major number assigned to our device driver
static int major;
static struct workqueue_struct *lockup_wq1;
static struct workqueue_struct *lockup_wq2;

static struct file_operations fops = {
  .read = device_read,
  .write = device_write,
  .open = device_open,
  .release = device_release
};

static int __init init_killer_module(void) {

  lockup_wq1 = alloc_workqueue("lockup_wq1", WQ_MEM_RECLAIM, 0);
  if (!lockup_wq1) return -ENOMEM;

  lockup_wq2 = alloc_workqueue("lockup_wq2", WQ_MEM_RECLAIM, 0);
  if (!lockup_wq2) {
destroy_workqueue(lockup_wq1);
return -ENOMEM;
  }

  major = register_chrdev(0, DEVICE_NAME, );
  if (major < 0) {
printk(KERN_ALERT "Registering char device failed with %d\n", major);
destroy_workqueue(lockup_wq1);
destroy_workqueue(lockup_wq2);

return major;
  }

  printk(KERN_INFO "'mknod /dev/%s c %d 0'.\n", DEVICE_NAME, major);

  // A non 0 return means init_module failed; module can't be loaded.
  return 0;
}

// Called when a process tries to open the device file.
static int device_open(struct inode *inode, struct file *file) {
  try_module_get(THIS_MODULE);
  return 0;
}

// Called when a process closes the device file.
static int device_release(struct inode *inode, struct file *file) {
  // Decrement the usage count, or else once you opened the file, you'll never
  // get get rid of the module.
  module_put(THIS_MODULE);

  return 0;
}

static ssize_t device_read(struct file *filp, char *buffer, size_t length,
   loff_t *offset) {
  return 0;
}

#if 0

#define SEM_INIT(sem) sema_init(sem, 1)
#define SEM_TYPE struct semaphore
#define SEM_DOWN(sem) down(sem)
#define SEM_UP(sem) up(sem)

#else

#define SEM_INIT(sem) init_rwsem(sem)
#define SEM_TYPE struct rw_semaphore
#define SEM_DOWN(sem) down_write_nested(sem, 0)
#define SEM_UP(sem) up_write(sem)

#endif

struct mywork {
  struct work_struct work;
  int index;
  SEM_TYPE *sem;
};

static void work1(struct work_struct *work) {
  struct mywork *my_work = container_of(work, struct mywork, work);
  trace_printk("work1 Called with index %d\n", my_work->index);
}

static void work2(struct work_struct *work) {
  struct mywork *my_work = container_of(work, struct mywork, work);
  trace_printk("work2 Called with index %d\n", my_work->index);
  SEM_DOWN(my_work->sem);
  SEM_UP(my_work->sem);
  trace_printk("work2 Finished with index %d\n", my_work->index);
}


static ssize_t device_write(struct file *filp, const char *buff, size_t len,
loff_t *off) {
  SEM_TYPE write_sem;
  SEM_INIT(_sem);
  
  struct mywork my_work1;
  struct mywork my_work2;
  trace_printk("lockup_wq1 %p lockup_wq2 %p\n", lockup_wq1, lockup_wq2);

  trace_printk("Got a write\n");

  SEM_DOWN(_sem);
  my_work1.index = len;
  my_work1.sem = _sem;
  INIT_WORK_ONSTACK(_work1.work, work1);

  my_work2.index = len;
  my_work2.sem = _sem;
  INIT_WORK_ONSTACK(_work2.work, work2);

  queue_work(lockup_wq2, _work2.work);

  queue_work(lockup_wq1, _work1.work);
  flush_work(_work1.work);
  destroy_work_on_stack(_work1.work);

  SEM_UP(_sem);

  flus

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-26 Thread Austin Schuh
On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger
richard.weinber...@gmail.com wrote:
 CC'ing RT folks

 On Wed, May 21, 2014 at 8:23 AM, Austin Schuh aus...@peloton-tech.com wrote:
 On Tue, May 13, 2014 at 7:29 PM, Austin Schuh aus...@peloton-tech.com 
 wrote:
 Hi,

 I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
 patched kernel.  I have currently only triggered it using dpkg.  Dave
 Chinner on the XFS mailing list suggested that it was a rt-kernel
 workqueue issue as opposed to a XFS problem after looking at the
 kernel messages.

I've got a 100% reproducible test case that doesn't involve a
filesystem.  I wrote a module that triggers the bug when the device is
written to, making it easy to enable tracing during the event and
capture everything.

It looks like rw_semaphores don't trigger wq_worker_sleeping to run
when work goes to sleep on a rw_semaphore.  This only happens with the
RT patches, not with the mainline kernel.  I'm foreseeing a second
deadlock/bug coming into play shortly.  If a task holding the work
pool spinlock gets preempted, and we need to schedule more work from
another worker thread which was just blocked by a mutex, we'll then
end up trying to go to sleep on 2 locks at once.

That is getting a bit deep into the scheduler for me...  Any
suggestions on how to fix it?

Austin
#include linux/module.h
#include linux/kernel.h
#include linux/init.h
#include linux/fs.h
#include asm/uaccess.h
#include linux/semaphore.h

static int device_open(struct inode *, struct file *);
static int device_release(struct inode *, struct file *);
static ssize_t device_read(struct file *, char *, size_t, loff_t *);
static ssize_t device_write(struct file *, const char *, size_t, loff_t *);

// Dev name as it appears in /proc/devices
#define DEVICE_NAME aschuh

// Major number assigned to our device driver
static int major;
static struct workqueue_struct *lockup_wq1;
static struct workqueue_struct *lockup_wq2;

static struct file_operations fops = {
  .read = device_read,
  .write = device_write,
  .open = device_open,
  .release = device_release
};

static int __init init_killer_module(void) {

  lockup_wq1 = alloc_workqueue(lockup_wq1, WQ_MEM_RECLAIM, 0);
  if (!lockup_wq1) return -ENOMEM;

  lockup_wq2 = alloc_workqueue(lockup_wq2, WQ_MEM_RECLAIM, 0);
  if (!lockup_wq2) {
destroy_workqueue(lockup_wq1);
return -ENOMEM;
  }

  major = register_chrdev(0, DEVICE_NAME, fops);
  if (major  0) {
printk(KERN_ALERT Registering char device failed with %d\n, major);
destroy_workqueue(lockup_wq1);
destroy_workqueue(lockup_wq2);

return major;
  }

  printk(KERN_INFO 'mknod /dev/%s c %d 0'.\n, DEVICE_NAME, major);

  // A non 0 return means init_module failed; module can't be loaded.
  return 0;
}

// Called when a process tries to open the device file.
static int device_open(struct inode *inode, struct file *file) {
  try_module_get(THIS_MODULE);
  return 0;
}

// Called when a process closes the device file.
static int device_release(struct inode *inode, struct file *file) {
  // Decrement the usage count, or else once you opened the file, you'll never
  // get get rid of the module.
  module_put(THIS_MODULE);

  return 0;
}

static ssize_t device_read(struct file *filp, char *buffer, size_t length,
   loff_t *offset) {
  return 0;
}

#if 0

#define SEM_INIT(sem) sema_init(sem, 1)
#define SEM_TYPE struct semaphore
#define SEM_DOWN(sem) down(sem)
#define SEM_UP(sem) up(sem)

#else

#define SEM_INIT(sem) init_rwsem(sem)
#define SEM_TYPE struct rw_semaphore
#define SEM_DOWN(sem) down_write_nested(sem, 0)
#define SEM_UP(sem) up_write(sem)

#endif

struct mywork {
  struct work_struct work;
  int index;
  SEM_TYPE *sem;
};

static void work1(struct work_struct *work) {
  struct mywork *my_work = container_of(work, struct mywork, work);
  trace_printk(work1 Called with index %d\n, my_work-index);
}

static void work2(struct work_struct *work) {
  struct mywork *my_work = container_of(work, struct mywork, work);
  trace_printk(work2 Called with index %d\n, my_work-index);
  SEM_DOWN(my_work-sem);
  SEM_UP(my_work-sem);
  trace_printk(work2 Finished with index %d\n, my_work-index);
}


static ssize_t device_write(struct file *filp, const char *buff, size_t len,
loff_t *off) {
  SEM_TYPE write_sem;
  SEM_INIT(write_sem);
  
  struct mywork my_work1;
  struct mywork my_work2;
  trace_printk(lockup_wq1 %p lockup_wq2 %p\n, lockup_wq1, lockup_wq2);

  trace_printk(Got a write\n);

  SEM_DOWN(write_sem);
  my_work1.index = len;
  my_work1.sem = write_sem;
  INIT_WORK_ONSTACK(my_work1.work, work1);

  my_work2.index = len;
  my_work2.sem = write_sem;
  INIT_WORK_ONSTACK(my_work2.work, work2);

  queue_work(lockup_wq2, my_work2.work);

  queue_work(lockup_wq1, my_work1.work);
  flush_work(my_work1.work);
  destroy_work_on_stack(my_work1.work);

  SEM_UP(write_sem);

  flush_work(my_work2.work);
  destroy_work_on_stack(my_work2.work

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-26 Thread Austin Schuh
On Thu, Jun 26, 2014 at 3:35 PM, Thomas Gleixner t...@linutronix.de wrote:
 On Thu, 26 Jun 2014, Austin Schuh wrote:
 On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger
 richard.weinber...@gmail.com wrote:
  CC'ing RT folks
 
  On Wed, May 21, 2014 at 8:23 AM, Austin Schuh aus...@peloton-tech.com 
  wrote:
  On Tue, May 13, 2014 at 7:29 PM, Austin Schuh aus...@peloton-tech.com 
  wrote:
  Hi,
 
  I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
  patched kernel.  I have currently only triggered it using dpkg.  Dave
  Chinner on the XFS mailing list suggested that it was a rt-kernel
  workqueue issue as opposed to a XFS problem after looking at the
  kernel messages.

 I've got a 100% reproducible test case that doesn't involve a
 filesystem.  I wrote a module that triggers the bug when the device is
 written to, making it easy to enable tracing during the event and
 capture everything.

 It looks like rw_semaphores don't trigger wq_worker_sleeping to run
 when work goes to sleep on a rw_semaphore.  This only happens with the
 RT patches, not with the mainline kernel.  I'm foreseeing a second
 deadlock/bug coming into play shortly.  If a task holding the work
 pool spinlock gets preempted, and we need to schedule more work from
 another worker thread which was just blocked by a mutex, we'll then
 end up trying to go to sleep on 2 locks at once.

 I remember vaguely, that I've seen and analyzed that quite some time
 ago. I can't page in all the gory details right now, but I have a look
 how the related code changed in the last couple of years tomorrow
 morning with an awake brain.

 Steven, you did some analysis on that IIRC, or was that just related
 to rw_locks?

 Thanks,

 tglx

If I'm reading the rt patch correctly, wq_worker_sleeping was moved
out of __schedule to sched_submit_work.  It looks like that changes
the conditions under which wq_worker_sleeping is called.  It used to
be called whenever a task was going to sleep (I think).  It looks like
it is called now if the task is going to sleep, and if the task isn't
blocked on a PI mutex (I think).

If I try the following experiment

 static inline void sched_submit_work(struct task_struct *tsk)
 {
+   if (tsk-state  tsk-flags  PF_WQ_WORKER) {
+ wq_worker_sleeping(tsk);
+ return;
+   }

and then remove the call later in the function, I am able to pass my test.

Unfortunately, I then get a recursive pool spinlock BUG_ON after a
while (as I would expect), and it all blows up.

I'm not sure where to go from there.  Any changes to the workpool to
try to fix that will be hard, or could affect latency significantly.

Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: On-stack work item completion race? (was Re: XFS crash?)

2014-06-25 Thread Austin Schuh
On Wed, Jun 25, 2014 at 7:00 AM, Tejun Heo  wrote:
>
> Hello,
>
> On Tue, Jun 24, 2014 at 08:05:07PM -0700, Austin Schuh wrote:
> > > I can see no reason why manual completion would behave differently
> > > from flush_work() in this case.
> >
> > I went looking for a short trace in my original log to show the problem,
> > and instead found evidence of the second problem.  I still like the shorter
> > flush_work call, but that's not my call.
>
> So, are you saying that the original issue you reported isn't actually
> a problem?  But didn't you imply that changing the waiting mechanism
> fixed a deadlock or was that a false positive?

Correct, that was a false positive.  Sorry for the noise.

> > I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is
> > returning 1 in sched_submit_work (kernel/sched/core.c).  It looks
> > like sched_submit_work is not detecting that the worker task is blocked on
> > a mutex.
>
> The function unplugs the block layer and doesn't have much to do with
> workqueue although it has "_work" in its name.

Thomas moved
+ if (tsk->flags & PF_WQ_WORKER)
+   wq_worker_sleeping(tsk);
into sched_submit_work as part of the RT patchset.

> > This looks very RT related right now.  I see 2 problems from my reading
> > (and experimentation).  The first is that the second worker isn't getting
> > started because tsk_is_pi_blocked is reporting that the task isn't blocked
> > on a mutex.  The second is that even if another worker needs to be
> > scheduled because the original worker is blocked on a mutex, we need the
> > pool lock to schedule another worker.  The pool lock can be acquired by any
> > CPU, and is a spin_lock.  If we end up on the slow path for the pool lock,
> > we hit BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on))
> > in task_blocks_on_rt_mutex in rtmutex.c.  I'm not sure how to deal with
> > either problem.
> >
> > Hopefully I've got all my facts right...  Debugging kernel code is a whole
> > new world from userspace code.
>
> I don't have much idea how RT kernel works either.  Can you reproduce
> the issues that you see on mainline?
>
> Thanks.
>
> --
> tejun

I'll see what I can do.

Thanks!
  Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: On-stack work item completion race? (was Re: XFS crash?)

2014-06-25 Thread Austin Schuh
On Wed, Jun 25, 2014 at 7:00 AM, Tejun Heo t...@kernel.org wrote:

 Hello,

 On Tue, Jun 24, 2014 at 08:05:07PM -0700, Austin Schuh wrote:
   I can see no reason why manual completion would behave differently
   from flush_work() in this case.
 
  I went looking for a short trace in my original log to show the problem,
  and instead found evidence of the second problem.  I still like the shorter
  flush_work call, but that's not my call.

 So, are you saying that the original issue you reported isn't actually
 a problem?  But didn't you imply that changing the waiting mechanism
 fixed a deadlock or was that a false positive?

Correct, that was a false positive.  Sorry for the noise.

  I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is
  returning 1 in sched_submit_work (kernel/sched/core.c).  It looks
  like sched_submit_work is not detecting that the worker task is blocked on
  a mutex.

 The function unplugs the block layer and doesn't have much to do with
 workqueue although it has _work in its name.

Thomas moved
+ if (tsk-flags  PF_WQ_WORKER)
+   wq_worker_sleeping(tsk);
into sched_submit_work as part of the RT patchset.

  This looks very RT related right now.  I see 2 problems from my reading
  (and experimentation).  The first is that the second worker isn't getting
  started because tsk_is_pi_blocked is reporting that the task isn't blocked
  on a mutex.  The second is that even if another worker needs to be
  scheduled because the original worker is blocked on a mutex, we need the
  pool lock to schedule another worker.  The pool lock can be acquired by any
  CPU, and is a spin_lock.  If we end up on the slow path for the pool lock,
  we hit BUG_ON(rt_mutex_real_waiter(task-pi_blocked_on))
  in task_blocks_on_rt_mutex in rtmutex.c.  I'm not sure how to deal with
  either problem.
 
  Hopefully I've got all my facts right...  Debugging kernel code is a whole
  new world from userspace code.

 I don't have much idea how RT kernel works either.  Can you reproduce
 the issues that you see on mainline?

 Thanks.

 --
 tejun

I'll see what I can do.

Thanks!
  Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: On-stack work item completion race? (was Re: XFS crash?)

2014-06-24 Thread Austin Schuh
[Adding tglx to the cc.  Sorry for any double sends]

On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo  wrote:
> Hello,
>
> On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote:
>> start_flush_work() is effectively a special queue_work()
>> implementation, so if if it's not safe to call complete() from the
>> workqueue as the above patch implies then this code has the same
>> problem.
>>
>> Tejun - is this "do it yourself completion" a known issue w.r.t.
>> workqueues? I can't find any documentation that says "don't do
>> that" so...?
>
> It's more complex than using flush_work() but there's nothing
> fundamentally wrong with it.  A work item is completely unlinked
> before its execution starts.  It's safe to free the work item once its
> work function started, whether through kfree() or returning.
>
> One difference between flush_work() and manual completion would be
> that if the work item gets requeued, flush_work() would wait for the
> queued one to finish but given the work item is one-shot this doesn't
> make any difference.
>
> I can see no reason why manual completion would behave differently
> from flush_work() in this case.

I went looking for a short trace in my original log to show the
problem, and instead found evidence of the second problem.  I still
like the shorter flush_work call, but that's not my call.

I did find this comment in the process_one_work function.  Sounds like
this could be better documented.

  /*
   * It is permissible to free the struct work_struct from
   * inside the function that is called from it, this we need to
   * take into account for lockdep too.  To avoid bogus "held
   * lock freed" warnings as well as problems when looking into
   * work->lockdep_map, make a copy and use that here.
   */

>> As I understand it, what then happens is that the workqueue code
>> grabs another kworker thread and runs the next work item in it's
>> queue. IOWs, work items can block, but doing that does not prevent
>> execution of other work items queued on other work queues or even on
>> the same work queue. Tejun, did I get that correct?
>
> Yes, as long as the workqueue is under its @max_active limit and has
> access to an existing kworker or can create a new one, it'll start
> executing the next work item immediately; however, the guaranteed
> level of concurrency is 1 even for WQ_RECLAIM workqueues.  IOW, the
> work items queued on a workqueue must be able to make forward progress
> with single work item if the work items are being depended upon for
> memory reclaim.

I mentioned this to Dave when I initially started this thread, but I'm
running a RT patched kernel.  I don't see forwards progress.  The
workqueue is only used in 1 spot (xfs_alloc_wq), and has
WQ_MEM_RECLAIM set.  I started with a 10,000,000 line trace and pulled
out the entries which referenced the workqueue and pool leading up to
the lockup.

 scp-4110  [001] 1.3   101.184640:
workqueue_queue_work: work struct=8803c782d900
function=xfs_bmapi_allocate_worker workqueue=88040c9f5a00
pool=88042dae3fc0 req_cpu=512 cpu=1
 scp-4110  [001] 1.3   101.184641:
workqueue_activate_work: work struct 8803c782d900
 kworker/1:1-89[001] 1.1   101.184883:
workqueue_nr_running: pool=88042dae3fc0 nr_running=1
 kworker/1:1-89[001] 1..   101.184885:
workqueue_execute_start: work struct 8803c782d900: function
xfs_bmapi_allocate_worker
 irq/44-ahci-275   [001] 1.5   101.185086:
workqueue_queue_work: work struct=8800ae3f01e0 function=xfs_end_io
workqueue=88040b459a00 pool=88042dae3fc0 req_cpu=512 cpu=1
 irq/44-ahci-275   [001] 1.5   101.185088:
workqueue_activate_work: work struct 8800ae3f01e0
 scp-4110  [001] 1..   101.187911: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate
 scp-4110  [001] 1.3   101.187914:
workqueue_queue_work: work struct=8803c782d900
function=xfs_bmapi_allocate_worker workqueue=88040c9f5a00
pool=88042dae3fc0 req_cpu=512 cpu=1
 scp-4110  [001] 1.3   101.187915:
workqueue_activate_work: work struct 8803c782d900
 scp-4110  [001] 1.4   101.187926:
workqueue_queue_work: work struct=88040a6a01c8
function=blk_delay_work workqueue=88040c9f4a00
pool=88042dae44c0 req_cpu=512 cpu=1
 scp-4110  [001] 1.4   101.187926:
workqueue_activate_work: work struct 88040a6a01c8
 kworker/1:1-89[001] 1..   101.187998:
workqueue_execute_end: work struct 8803c782d900
 kworker/1:1-89[001] 1..   101.188000:
workqueue_execute_start: work struct 8800ae3f01e0: function
xfs_end_io
 kworker/1:1-89[001] 1..   101.188001: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize

The last work never runs.  Hangcheck triggers shortly after.

I spent some more time debugging, and I am seeing that
tsk_is_pi_blocked is returning 1 in 

Re: On-stack work item completion race? (was Re: XFS crash?)

2014-06-24 Thread Austin Schuh
[Adding tglx to the cc.  Sorry for any double sends]

On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote:
 start_flush_work() is effectively a special queue_work()
 implementation, so if if it's not safe to call complete() from the
 workqueue as the above patch implies then this code has the same
 problem.

 Tejun - is this do it yourself completion a known issue w.r.t.
 workqueues? I can't find any documentation that says don't do
 that so...?

 It's more complex than using flush_work() but there's nothing
 fundamentally wrong with it.  A work item is completely unlinked
 before its execution starts.  It's safe to free the work item once its
 work function started, whether through kfree() or returning.

 One difference between flush_work() and manual completion would be
 that if the work item gets requeued, flush_work() would wait for the
 queued one to finish but given the work item is one-shot this doesn't
 make any difference.

 I can see no reason why manual completion would behave differently
 from flush_work() in this case.

I went looking for a short trace in my original log to show the
problem, and instead found evidence of the second problem.  I still
like the shorter flush_work call, but that's not my call.

I did find this comment in the process_one_work function.  Sounds like
this could be better documented.

  /*
   * It is permissible to free the struct work_struct from
   * inside the function that is called from it, this we need to
   * take into account for lockdep too.  To avoid bogus held
   * lock freed warnings as well as problems when looking into
   * work-lockdep_map, make a copy and use that here.
   */

 As I understand it, what then happens is that the workqueue code
 grabs another kworker thread and runs the next work item in it's
 queue. IOWs, work items can block, but doing that does not prevent
 execution of other work items queued on other work queues or even on
 the same work queue. Tejun, did I get that correct?

 Yes, as long as the workqueue is under its @max_active limit and has
 access to an existing kworker or can create a new one, it'll start
 executing the next work item immediately; however, the guaranteed
 level of concurrency is 1 even for WQ_RECLAIM workqueues.  IOW, the
 work items queued on a workqueue must be able to make forward progress
 with single work item if the work items are being depended upon for
 memory reclaim.

I mentioned this to Dave when I initially started this thread, but I'm
running a RT patched kernel.  I don't see forwards progress.  The
workqueue is only used in 1 spot (xfs_alloc_wq), and has
WQ_MEM_RECLAIM set.  I started with a 10,000,000 line trace and pulled
out the entries which referenced the workqueue and pool leading up to
the lockup.

 scp-4110  [001] 1.3   101.184640:
workqueue_queue_work: work struct=8803c782d900
function=xfs_bmapi_allocate_worker workqueue=88040c9f5a00
pool=88042dae3fc0 req_cpu=512 cpu=1
 scp-4110  [001] 1.3   101.184641:
workqueue_activate_work: work struct 8803c782d900
 kworker/1:1-89[001] 1.1   101.184883:
workqueue_nr_running: pool=88042dae3fc0 nr_running=1
 kworker/1:1-89[001] 1..   101.184885:
workqueue_execute_start: work struct 8803c782d900: function
xfs_bmapi_allocate_worker
 irq/44-ahci-275   [001] 1.5   101.185086:
workqueue_queue_work: work struct=8800ae3f01e0 function=xfs_end_io
workqueue=88040b459a00 pool=88042dae3fc0 req_cpu=512 cpu=1
 irq/44-ahci-275   [001] 1.5   101.185088:
workqueue_activate_work: work struct 8800ae3f01e0
 scp-4110  [001] 1..   101.187911: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate
 scp-4110  [001] 1.3   101.187914:
workqueue_queue_work: work struct=8803c782d900
function=xfs_bmapi_allocate_worker workqueue=88040c9f5a00
pool=88042dae3fc0 req_cpu=512 cpu=1
 scp-4110  [001] 1.3   101.187915:
workqueue_activate_work: work struct 8803c782d900
 scp-4110  [001] 1.4   101.187926:
workqueue_queue_work: work struct=88040a6a01c8
function=blk_delay_work workqueue=88040c9f4a00
pool=88042dae44c0 req_cpu=512 cpu=1
 scp-4110  [001] 1.4   101.187926:
workqueue_activate_work: work struct 88040a6a01c8
 kworker/1:1-89[001] 1..   101.187998:
workqueue_execute_end: work struct 8803c782d900
 kworker/1:1-89[001] 1..   101.188000:
workqueue_execute_start: work struct 8800ae3f01e0: function
xfs_end_io
 kworker/1:1-89[001] 1..   101.188001: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize

The last work never runs.  Hangcheck triggers shortly after.

I spent some more time debugging, and I am seeing that
tsk_is_pi_blocked is returning 1 in sched_submit_work
(kernel/sched/core.c).  It looks like 

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-05-21 Thread Austin Schuh
On Wed, May 21, 2014 at 12:30 PM, John Blackwood
 wrote:
>> Date: Wed, 21 May 2014 03:33:49 -0400
>> From: Richard Weinberger 
>> To: Austin Schuh 
>> CC: LKML , xfs , rt-users
>>   
>> Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT
>
>>
>> CC'ing RT folks
>>
>> On Wed, May 21, 2014 at 8:23 AM, Austin Schuh 
>> wrote:
>> > > On Tue, May 13, 2014 at 7:29 PM, Austin Schuh
>> > >  wrote:
>> >> >> Hi,
>> >> >>
>> >> >> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
>> >> >> patched kernel.  I have currently only triggered it using dpkg.
>> >> >> Dave
>> >> >> Chinner on the XFS mailing list suggested that it was a rt-kernel
>> >> >> workqueue issue as opposed to a XFS problem after looking at the
>> >> >> kernel messages.
>> >> >>
>> >> >> The only modification to the kernel besides the RT patch is that I
>> >> >> have applied tglx's "genirq: Sanitize spurious interrupt detection
>> >> >> of
>> >> >> threaded irqs" patch.
>> > >
>> > > I upgraded to 3.14.3-rt4, and the problem still persists.
>> > >
>> > > I turned on event tracing and tracked it down further.  I'm able to
>> > > lock it up by scping a new kernel debian package to /tmp/ on the
>> > > machine.  scp is locking the inode, and then scheduling
>> > > xfs_bmapi_allocate_worker in the work queue.  The work then never gets
>> > > run.  The kworkers then lock up waiting for the inode lock.
>> > >
>> > > Here are the relevant events from the trace.  8803e9f10288
>> > > (blk_delay_work) gets run later on in the trace, but 8803b4c158d0
>> > > (xfs_bmapi_allocate_worker) never does.  The kernel then warns about
>> > > blocked tasks 120 seconds later.
>
> Austin and Richard,
>
> I'm not 100% sure that the patch below will fix your problem, but we
> saw something that sounds pretty familiar to your issue involving the
> nvidia driver and the preempt-rt patch.  The nvidia driver uses the
> completion support to create their own driver's notion of an internally
> used semaphore.
>
> Some tasks were failing to ever wakeup from wait_for_completion() calls
> due to a race in the underlying do_wait_for_common() routine.

Hi John,

Thanks for the suggestion and patch.  The issue is that the work never
gets run, not that the work finishes but the waiter never gets woken.
I applied it anyways to see if it helps, but I still get the lockup.

Thanks,
Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-05-21 Thread Austin Schuh
On Tue, May 13, 2014 at 7:29 PM, Austin Schuh  wrote:
> Hi,
>
> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
> patched kernel.  I have currently only triggered it using dpkg.  Dave
> Chinner on the XFS mailing list suggested that it was a rt-kernel
> workqueue issue as opposed to a XFS problem after looking at the
> kernel messages.
>
> The only modification to the kernel besides the RT patch is that I
> have applied tglx's "genirq: Sanitize spurious interrupt detection of
> threaded irqs" patch.

I upgraded to 3.14.3-rt4, and the problem still persists.

I turned on event tracing and tracked it down further.  I'm able to
lock it up by scping a new kernel debian package to /tmp/ on the
machine.  scp is locking the inode, and then scheduling
xfs_bmapi_allocate_worker in the work queue.  The work then never gets
run.  The kworkers then lock up waiting for the inode lock.

Here are the relevant events from the trace.  8803e9f10288
(blk_delay_work) gets run later on in the trace, but 8803b4c158d0
(xfs_bmapi_allocate_worker) never does.  The kernel then warns about
blocked tasks 120 seconds later.


 scp-5393  [001] 1..   148.883383: xfs_writepage: dev
8:5 ino 0xd8e pgoff 0x4a3e000 size 0x16af02b0 offset 0 length 0
delalloc 1 unwritten 0
 scp-5393  [001] 1..   148.883383: xfs_ilock_nowait:
dev 8:5 ino 0xd8e flags ILOCK_SHARED caller xfs_map_blocks
 scp-5393  [001] 1..   148.883384: xfs_iunlock: dev
8:5 ino 0xd8e flags ILOCK_SHARED caller xfs_map_blocks
 scp-5393  [001] 1..   148.883386: xfs_log_reserve:
dev 8:5 type STRAT_WRITE t_ocnt 2 t_cnt 2 t_curr_res 112332 t_unit_res
112332 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty
writeq empty grant_reserve_cycle 57 grant_reserve_bytes 9250852
grant_write_cycle 57 grant_write_bytes 9250852 curr_cycle 57
curr_block 18031 tail_cycle 57 tail_block 17993
 scp-5393  [001] 1..   148.883386:
xfs_log_reserve_exit: dev 8:5 type STRAT_WRITE t_ocnt 2 t_cnt 2
t_curr_res 112332 t_unit_res 112332 t_flags
XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty writeq empty
grant_reserve_cycle 57 grant_reserve_bytes 9475516 grant_write_cycle
57 grant_write_bytes 9475516 curr_cycle 57 curr_block 18031 tail_cycle
57 tail_block 17993
 scp-5393  [001] 1..   148.883386: xfs_ilock: dev 8:5
ino 0xd8e flags ILOCK_EXCL caller xfs_iomap_write_allocate
 scp-5393  [001] 1.3   148.883390:
workqueue_queue_work: work struct=8803b4c158d0
function=xfs_bmapi_allocate_worker workqueue=8803ea85dc00
req_cpu=512 cpu=1
 scp-5393  [001] 1.3   148.883390:
workqueue_activate_work: work struct 8803b4c158d0
 scp-5393  [001] 1.4   148.883396:
workqueue_queue_work: work struct=8803e9f10288
function=blk_delay_work workqueue=8803ecf96200 req_cpu=512 cpu=1
 scp-5393  [001] 1.4   148.883396:
workqueue_activate_work: work struct 8803e9f10288
 scp-5393  [001] dN..3.4   148.883399: sched_wakeup:
comm=kworker/1:1H pid=573 prio=100 success=1 target_cpu=001
 scp-5393  [001] dN..3.4   148.883400: sched_stat_runtime:
comm=scp pid=5393 runtime=32683 [ns] vruntime=658862317 [ns]
 scp-5393  [001] d...3.4   148.883400: sched_switch:
prev_comm=scp prev_pid=5393 prev_prio=120 prev_state=R+ ==>
next_comm=kworker/1:1H next_pid=573 next_prio=100

irq/44-ahci-273   [000] d...3.5   148.883647: sched_wakeup:
comm=kworker/0:2 pid=732 prio=120 success=1 target_cpu=000
irq/44-ahci-273   [000] d...3..   148.883667: sched_switch:
prev_comm=irq/44-ahci prev_pid=273 prev_prio=49 prev_state=S ==>
next_comm=kworker/0:2 next_pid=732 next_prio=120
kworker/0:2-732   [000] 1..   148.883674:
workqueue_execute_start: work struct 8800aea30cb8: function
xfs_end_io
kworker/0:2-732   [000] 1..   148.883674: xfs_ilock: dev 8:5
ino 0xd8e flags ILOCK_EXCL caller xfs_setfilesize
kworker/0:2-732   [000] d...3..   148.883677: sched_stat_runtime:
comm=kworker/0:2 pid=732 runtime=11392 [ns] vruntime=46907654869 [ns]
kworker/0:2-732   [000] d...3..   148.883679: sched_switch:
prev_comm=kworker/0:2 prev_pid=732 prev_prio=120 prev_state=D ==>
next_comm=swapper/0 next_pid=0 next_prio=120

   kworker/1:1-99[001] 1..   148.884208:
workqueue_execute_start: work struct 8800aea30c20: function
xfs_end_io
kworker/1:1-99[001] 1..   148.884208: xfs_ilock: dev 8:5
ino 0xd8e flags ILOCK_EXCL caller xfs_setfilesize
kworker/1:1-99[001] d...3..   148.884210: sched_stat_runtime:
comm=kworker/1:1 pid=99 runtime=36705 [ns] vruntime=48354424724 [ns]
kworker/1:1-99[001] d...3..   148.884211: sched_switch:
prev_comm=kworker/1:1 prev_pid=99 prev_prio=120 prev_state=R+ ==>
next_comm=swapper/1 next_pid=0 next_prio=120

  kworker/u16:4-296   [001] 1..   151.560358:
workqueue_execute_start: work

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-05-21 Thread Austin Schuh
On Tue, May 13, 2014 at 7:29 PM, Austin Schuh aus...@peloton-tech.com wrote:
 Hi,

 I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
 patched kernel.  I have currently only triggered it using dpkg.  Dave
 Chinner on the XFS mailing list suggested that it was a rt-kernel
 workqueue issue as opposed to a XFS problem after looking at the
 kernel messages.

 The only modification to the kernel besides the RT patch is that I
 have applied tglx's genirq: Sanitize spurious interrupt detection of
 threaded irqs patch.

I upgraded to 3.14.3-rt4, and the problem still persists.

I turned on event tracing and tracked it down further.  I'm able to
lock it up by scping a new kernel debian package to /tmp/ on the
machine.  scp is locking the inode, and then scheduling
xfs_bmapi_allocate_worker in the work queue.  The work then never gets
run.  The kworkers then lock up waiting for the inode lock.

Here are the relevant events from the trace.  8803e9f10288
(blk_delay_work) gets run later on in the trace, but 8803b4c158d0
(xfs_bmapi_allocate_worker) never does.  The kernel then warns about
blocked tasks 120 seconds later.


 scp-5393  [001] 1..   148.883383: xfs_writepage: dev
8:5 ino 0xd8e pgoff 0x4a3e000 size 0x16af02b0 offset 0 length 0
delalloc 1 unwritten 0
 scp-5393  [001] 1..   148.883383: xfs_ilock_nowait:
dev 8:5 ino 0xd8e flags ILOCK_SHARED caller xfs_map_blocks
 scp-5393  [001] 1..   148.883384: xfs_iunlock: dev
8:5 ino 0xd8e flags ILOCK_SHARED caller xfs_map_blocks
 scp-5393  [001] 1..   148.883386: xfs_log_reserve:
dev 8:5 type STRAT_WRITE t_ocnt 2 t_cnt 2 t_curr_res 112332 t_unit_res
112332 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty
writeq empty grant_reserve_cycle 57 grant_reserve_bytes 9250852
grant_write_cycle 57 grant_write_bytes 9250852 curr_cycle 57
curr_block 18031 tail_cycle 57 tail_block 17993
 scp-5393  [001] 1..   148.883386:
xfs_log_reserve_exit: dev 8:5 type STRAT_WRITE t_ocnt 2 t_cnt 2
t_curr_res 112332 t_unit_res 112332 t_flags
XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty writeq empty
grant_reserve_cycle 57 grant_reserve_bytes 9475516 grant_write_cycle
57 grant_write_bytes 9475516 curr_cycle 57 curr_block 18031 tail_cycle
57 tail_block 17993
 scp-5393  [001] 1..   148.883386: xfs_ilock: dev 8:5
ino 0xd8e flags ILOCK_EXCL caller xfs_iomap_write_allocate
 scp-5393  [001] 1.3   148.883390:
workqueue_queue_work: work struct=8803b4c158d0
function=xfs_bmapi_allocate_worker workqueue=8803ea85dc00
req_cpu=512 cpu=1
 scp-5393  [001] 1.3   148.883390:
workqueue_activate_work: work struct 8803b4c158d0
 scp-5393  [001] 1.4   148.883396:
workqueue_queue_work: work struct=8803e9f10288
function=blk_delay_work workqueue=8803ecf96200 req_cpu=512 cpu=1
 scp-5393  [001] 1.4   148.883396:
workqueue_activate_work: work struct 8803e9f10288
 scp-5393  [001] dN..3.4   148.883399: sched_wakeup:
comm=kworker/1:1H pid=573 prio=100 success=1 target_cpu=001
 scp-5393  [001] dN..3.4   148.883400: sched_stat_runtime:
comm=scp pid=5393 runtime=32683 [ns] vruntime=658862317 [ns]
 scp-5393  [001] d...3.4   148.883400: sched_switch:
prev_comm=scp prev_pid=5393 prev_prio=120 prev_state=R+ ==
next_comm=kworker/1:1H next_pid=573 next_prio=100

irq/44-ahci-273   [000] d...3.5   148.883647: sched_wakeup:
comm=kworker/0:2 pid=732 prio=120 success=1 target_cpu=000
irq/44-ahci-273   [000] d...3..   148.883667: sched_switch:
prev_comm=irq/44-ahci prev_pid=273 prev_prio=49 prev_state=S ==
next_comm=kworker/0:2 next_pid=732 next_prio=120
kworker/0:2-732   [000] 1..   148.883674:
workqueue_execute_start: work struct 8800aea30cb8: function
xfs_end_io
kworker/0:2-732   [000] 1..   148.883674: xfs_ilock: dev 8:5
ino 0xd8e flags ILOCK_EXCL caller xfs_setfilesize
kworker/0:2-732   [000] d...3..   148.883677: sched_stat_runtime:
comm=kworker/0:2 pid=732 runtime=11392 [ns] vruntime=46907654869 [ns]
kworker/0:2-732   [000] d...3..   148.883679: sched_switch:
prev_comm=kworker/0:2 prev_pid=732 prev_prio=120 prev_state=D ==
next_comm=swapper/0 next_pid=0 next_prio=120

   kworker/1:1-99[001] 1..   148.884208:
workqueue_execute_start: work struct 8800aea30c20: function
xfs_end_io
kworker/1:1-99[001] 1..   148.884208: xfs_ilock: dev 8:5
ino 0xd8e flags ILOCK_EXCL caller xfs_setfilesize
kworker/1:1-99[001] d...3..   148.884210: sched_stat_runtime:
comm=kworker/1:1 pid=99 runtime=36705 [ns] vruntime=48354424724 [ns]
kworker/1:1-99[001] d...3..   148.884211: sched_switch:
prev_comm=kworker/1:1 prev_pid=99 prev_prio=120 prev_state=R+ ==
next_comm=swapper/1 next_pid=0 next_prio=120

  kworker/u16:4-296   [001] 1..   151.560358:
workqueue_execute_start: work struct 8803e9f10660: function
bdi_writeback_workfn

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-05-21 Thread Austin Schuh
On Wed, May 21, 2014 at 12:30 PM, John Blackwood
john.blackw...@ccur.com wrote:
 Date: Wed, 21 May 2014 03:33:49 -0400
 From: Richard Weinberger richard.weinber...@gmail.com
 To: Austin Schuh aus...@peloton-tech.com
 CC: LKML linux-kernel@vger.kernel.org, xfs x...@oss.sgi.com, rt-users
   linux-rt-us...@vger.kernel.org
 Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT


 CC'ing RT folks

 On Wed, May 21, 2014 at 8:23 AM, Austin Schuh aus...@peloton-tech.com
 wrote:
   On Tue, May 13, 2014 at 7:29 PM, Austin Schuh
   aus...@peloton-tech.com wrote:
   Hi,
  
   I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
   patched kernel.  I have currently only triggered it using dpkg.
   Dave
   Chinner on the XFS mailing list suggested that it was a rt-kernel
   workqueue issue as opposed to a XFS problem after looking at the
   kernel messages.
  
   The only modification to the kernel besides the RT patch is that I
   have applied tglx's genirq: Sanitize spurious interrupt detection
   of
   threaded irqs patch.
  
   I upgraded to 3.14.3-rt4, and the problem still persists.
  
   I turned on event tracing and tracked it down further.  I'm able to
   lock it up by scping a new kernel debian package to /tmp/ on the
   machine.  scp is locking the inode, and then scheduling
   xfs_bmapi_allocate_worker in the work queue.  The work then never gets
   run.  The kworkers then lock up waiting for the inode lock.
  
   Here are the relevant events from the trace.  8803e9f10288
   (blk_delay_work) gets run later on in the trace, but 8803b4c158d0
   (xfs_bmapi_allocate_worker) never does.  The kernel then warns about
   blocked tasks 120 seconds later.

 Austin and Richard,

 I'm not 100% sure that the patch below will fix your problem, but we
 saw something that sounds pretty familiar to your issue involving the
 nvidia driver and the preempt-rt patch.  The nvidia driver uses the
 completion support to create their own driver's notion of an internally
 used semaphore.

 Some tasks were failing to ever wakeup from wait_for_completion() calls
 due to a race in the underlying do_wait_for_common() routine.

Hi John,

Thanks for the suggestion and patch.  The issue is that the work never
gets run, not that the work finishes but the waiter never gets woken.
I applied it anyways to see if it helps, but I still get the lockup.

Thanks,
Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Filesystem lockup with CONFIG_PREEMPT_RT

2014-05-13 Thread Austin Schuh
Hi,

I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
patched kernel.  I have currently only triggered it using dpkg.  Dave
Chinner on the XFS mailing list suggested that it was a rt-kernel
workqueue issue as opposed to a XFS problem after looking at the
kernel messages.

$ uname -a
Linux vpc5 3.10.24-rt22abs #15 SMP PREEMPT RT Tue May 13 14:42:22 PDT
2014 x86_64 GNU/Linux

The only modification to the kernel besides the RT patch is that I
have applied tglx's "genirq: Sanitize spurious interrupt detection of
threaded irqs" patch.

Any ideas on what could be wrong?

Is there any information that I can pull before I reboot the machine
that would be useful?  I have the output of triggering sysrq with l
and t if that would be useful.  Attached is the kernel blocked task
message output.

Thanks,
Austin


vpc5_xfs_lockup_locks
Description: Binary data


Filesystem lockup with CONFIG_PREEMPT_RT

2014-05-13 Thread Austin Schuh
Hi,

I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
patched kernel.  I have currently only triggered it using dpkg.  Dave
Chinner on the XFS mailing list suggested that it was a rt-kernel
workqueue issue as opposed to a XFS problem after looking at the
kernel messages.

$ uname -a
Linux vpc5 3.10.24-rt22abs #15 SMP PREEMPT RT Tue May 13 14:42:22 PDT
2014 x86_64 GNU/Linux

The only modification to the kernel besides the RT patch is that I
have applied tglx's genirq: Sanitize spurious interrupt detection of
threaded irqs patch.

Any ideas on what could be wrong?

Is there any information that I can pull before I reboot the machine
that would be useful?  I have the output of triggering sysrq with l
and t if that would be useful.  Attached is the kernel blocked task
message output.

Thanks,
Austin


vpc5_xfs_lockup_locks
Description: Binary data


Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs

2014-04-28 Thread Austin Schuh
On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh  wrote:
> On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner  wrote:
>> On Mon, 7 Apr 2014, Austin Schuh wrote:
>>> You originally sent the patch out.  I could send your patch out back
>>> to you, but that feels a bit weird ;)
>>
>> Wheee. Let me dig in my archives 
>
> https://lkml.org/lkml/2013/3/7/222 in case that helps.

Did you find the patch?  I didn't see anything go by (but I'm not on
the main mailing list and didn't find anything with a quick Google
search.)  It would be nice to not need to run a custom kernel to keep
my machine running.  I have what is probably a year split between 2
machines of runtime with the patch applied, and I haven't seen any
problems with it.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs

2014-04-28 Thread Austin Schuh
On Mon, Apr 7, 2014 at 1:08 PM, Austin Schuh aus...@peloton-tech.com wrote:
 On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner t...@linutronix.de wrote:
 On Mon, 7 Apr 2014, Austin Schuh wrote:
 You originally sent the patch out.  I could send your patch out back
 to you, but that feels a bit weird ;)

 Wheee. Let me dig in my archives 

 https://lkml.org/lkml/2013/3/7/222 in case that helps.

Did you find the patch?  I didn't see anything go by (but I'm not on
the main mailing list and didn't find anything with a quick Google
search.)  It would be nice to not need to run a custom kernel to keep
my machine running.  I have what is probably a year split between 2
machines of runtime with the patch applied, and I haven't seen any
problems with it.

Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs

2014-04-07 Thread Austin Schuh
On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner  wrote:
> On Mon, 7 Apr 2014, Austin Schuh wrote:
>> You originally sent the patch out.  I could send your patch out back
>> to you, but that feels a bit weird ;)
>
> Wheee. Let me dig in my archives 

https://lkml.org/lkml/2013/3/7/222 in case that helps.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs

2014-04-07 Thread Austin Schuh
On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner  wrote:
> On Mon, 7 Apr 2014, Austin Schuh wrote:
>
>> Hi Thomas,
>>
>> Did anything come of this patch?  Both Oliver and I have found that it
>> fixes real problems.  I have multiple machines which have been running
>> with the patch since December with no ill effects.
>
> No, sorry. It fell through the cracks. Care to resend ?
>
> Thanks,
>
> tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

You originally sent the patch out.  I could send your patch out back
to you, but that feels a bit weird ;)

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs

2014-04-07 Thread Austin Schuh
Hi Thomas,

Did anything come of this patch?  Both Oliver and I have found that it
fixes real problems.  I have multiple machines which have been running
with the patch since December with no ill effects.

Thanks,
  Austin

On Mon, Jan 6, 2014 at 5:32 AM, Oliver Hartkopp  wrote:
> Hi Thomas,
>
> I just wanted to add my
>
> Tested-by: Oliver Hartkopp 
>
> In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem
> disappeared with the discussed patch with the -rt kernel.
>
> The system was running at full CAN bus load over the weekend more than 72
> hours of operation without problems:
>
>CPU0   CPU1   CPU2   CPU3
>   0: 40  0  0  0   IO-APIC-edge  timer
>   1:  1  0  0  0   IO-APIC-edge  i8042
>   8:  0  0  1  0   IO-APIC-edge  rtc0
>   9: 42 45 45 42   IO-APIC-fasteoi   acpi
>  16:  9  8  8  8   IO-APIC-fasteoi   ahci, 
> ehci_hcd:usb1, can4, can5, can6, can7
>  17:  441468642  443275488  443609061  441436145   IO-APIC-fasteoi   can8, 
> can10, can11, can9
>  18:  441975412  438811422  437317802  441209092   IO-APIC-fasteoi   can12, 
> can13, can14, can15
>  19:  427310388  428661677  429813687  428095739   IO-APIC-fasteoi   can0, 
> can1, can2, can3, can16, can17, can18, can19
> (..)
>
> Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3
> minutes) until the irq was killed due to the spurious detection using Linux
> 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae).
>
> I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two
> weeks now without problems.
>
> If you want me to test an improved version (as Austin suggested below) please
> send a patch.
>
> Best regards,
> Oliver
>
> On 23.12.2013 20:25, Austin Schuh wrote:
>> Hi Thomas,
>>
>> Did anything happen with your patch to note_interrupt, originally
>> posted on May 8th of 2013?  (https://lkml.org/lkml/2013/3/7/222)
>>
>> I am seeing an issue on a machine right now running a
>> config-preempt-rt kernel and a SJA1000 CAN card from PEAK.  It works
>> for ~1 day, and then proceeds to die with a "Disabling IRQ #18"
>> message.  I posted on the Linux CAN mailing list, and Oliver Hartkopp
>> was able to reproduce the issue only on a realtime kernel.  A function
>> trace ending when the IRQ was disabled shows that note_interrupt is
>> being called regularly from the IRQ handler threads, and one of the
>> threads is doing work (and therefore calling note_interrupt with
>> IRQ_HANDLED).
>>
>> Oliver Hartkopp and I ran tests over the weekend on numerous machines
>> and verified that the patch that you proposed fixes the problem.  We
>> think that the race condition that Till reported is causing the
>> problem here.
>>
>> In reply to the comment about using the upper bit of
>> threads_handled_last for holding the SPURIOUS_DEFERRED flag, while
>> that may still be an over-optimization, the code should still work.
>> All comparisons are done with the bit set, which just makes it a 31
>> bit counter.  It will take 8 more days for the counter to overflow on
>> my machine, so I won't know for certain until then.
>>
>> My only concern is that there may still be a small race condition with
>> this new code.  If the interrupt handler thread is running at a
>> realtime priority, but lower than another task, it may not get run
>> until a large number of IRQs get triggered, and then process them
>> quickly.  With your new handler code, this would be counted as one
>> single handled interrupt.  With the current constants, this is only a
>> problem if more than 1000 calls to the handler happen between IRQs.  I
>> starved my card's irq threads by running 4 tasks at a higher realtime
>> priority than the handler threads, and saw the number of unhandled
>> IRQs jump from 1/10 to 3/10, so that problem may not show up
>> in practice.
>>
>> Austin Schuh
>>
>> Tested-by: Austin Schuh 
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-can" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs

2014-04-07 Thread Austin Schuh
Hi Thomas,

Did anything come of this patch?  Both Oliver and I have found that it
fixes real problems.  I have multiple machines which have been running
with the patch since December with no ill effects.

Thanks,
  Austin

On Mon, Jan 6, 2014 at 5:32 AM, Oliver Hartkopp socket...@hartkopp.net wrote:
 Hi Thomas,

 I just wanted to add my

 Tested-by: Oliver Hartkopp socket...@hartkopp.net

 In my setup with Core i7 and 20 CAN busses SJA1000 PCIe the problem
 disappeared with the discussed patch with the -rt kernel.

 The system was running at full CAN bus load over the weekend more than 72
 hours of operation without problems:

CPU0   CPU1   CPU2   CPU3
   0: 40  0  0  0   IO-APIC-edge  timer
   1:  1  0  0  0   IO-APIC-edge  i8042
   8:  0  0  1  0   IO-APIC-edge  rtc0
   9: 42 45 45 42   IO-APIC-fasteoi   acpi
  16:  9  8  8  8   IO-APIC-fasteoi   ahci, 
 ehci_hcd:usb1, can4, can5, can6, can7
  17:  441468642  443275488  443609061  441436145   IO-APIC-fasteoi   can8, 
 can10, can11, can9
  18:  441975412  438811422  437317802  441209092   IO-APIC-fasteoi   can12, 
 can13, can14, can15
  19:  427310388  428661677  429813687  428095739   IO-APIC-fasteoi   can0, 
 can1, can2, can3, can16, can17, can18, can19
 (..)

 Before the having the patch, it lasted 1 minutes to 1.5 hours (usually ~3
 minutes) until the irq was killed due to the spurious detection using Linux
 3.10.11-rt (Debian linux-image-3.10-0.bpo.3-rt-686-pae).

 I also tested the patch on different latest 3.13-rc5+ (non-rt) kernels for two
 weeks now without problems.

 If you want me to test an improved version (as Austin suggested below) please
 send a patch.

 Best regards,
 Oliver

 On 23.12.2013 20:25, Austin Schuh wrote:
 Hi Thomas,

 Did anything happen with your patch to note_interrupt, originally
 posted on May 8th of 2013?  (https://lkml.org/lkml/2013/3/7/222)

 I am seeing an issue on a machine right now running a
 config-preempt-rt kernel and a SJA1000 CAN card from PEAK.  It works
 for ~1 day, and then proceeds to die with a Disabling IRQ #18
 message.  I posted on the Linux CAN mailing list, and Oliver Hartkopp
 was able to reproduce the issue only on a realtime kernel.  A function
 trace ending when the IRQ was disabled shows that note_interrupt is
 being called regularly from the IRQ handler threads, and one of the
 threads is doing work (and therefore calling note_interrupt with
 IRQ_HANDLED).

 Oliver Hartkopp and I ran tests over the weekend on numerous machines
 and verified that the patch that you proposed fixes the problem.  We
 think that the race condition that Till reported is causing the
 problem here.

 In reply to the comment about using the upper bit of
 threads_handled_last for holding the SPURIOUS_DEFERRED flag, while
 that may still be an over-optimization, the code should still work.
 All comparisons are done with the bit set, which just makes it a 31
 bit counter.  It will take 8 more days for the counter to overflow on
 my machine, so I won't know for certain until then.

 My only concern is that there may still be a small race condition with
 this new code.  If the interrupt handler thread is running at a
 realtime priority, but lower than another task, it may not get run
 until a large number of IRQs get triggered, and then process them
 quickly.  With your new handler code, this would be counted as one
 single handled interrupt.  With the current constants, this is only a
 problem if more than 1000 calls to the handler happen between IRQs.  I
 starved my card's irq threads by running 4 tasks at a higher realtime
 priority than the handler threads, and saw the number of unhandled
 IRQs jump from 1/10 to 3/10, so that problem may not show up
 in practice.

 Austin Schuh

 Tested-by: Austin Schuh aus...@peloton-tech.com

 --
 To unsubscribe from this list: send the line unsubscribe linux-can in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs

2014-04-07 Thread Austin Schuh
On Mon, Apr 7, 2014 at 11:41 AM, Thomas Gleixner t...@linutronix.de wrote:
 On Mon, 7 Apr 2014, Austin Schuh wrote:

 Hi Thomas,

 Did anything come of this patch?  Both Oliver and I have found that it
 fixes real problems.  I have multiple machines which have been running
 with the patch since December with no ill effects.

 No, sorry. It fell through the cracks. Care to resend ?

 Thanks,

 tglx
 --
 To unsubscribe from this list: send the line unsubscribe linux-can in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

You originally sent the patch out.  I could send your patch out back
to you, but that feels a bit weird ;)

Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] genirq: Sanitize spurious interrupt detection of threaded irqs

2014-04-07 Thread Austin Schuh
On Mon, Apr 7, 2014 at 1:07 PM, Thomas Gleixner t...@linutronix.de wrote:
 On Mon, 7 Apr 2014, Austin Schuh wrote:
 You originally sent the patch out.  I could send your patch out back
 to you, but that feels a bit weird ;)

 Wheee. Let me dig in my archives 

https://lkml.org/lkml/2013/3/7/222 in case that helps.

Austin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/