Re: IRQ affinity notifiers vs RT

2013-10-04 Thread Sebastian Andrzej Siewior
On 10/01/2013 01:44 AM, Ben Hutchings wrote:
> Right, but it looks quite strange to have this thread just for a
> (probably) quite rare event.  Maybe some day someone will work out how
> to make Octeon's IRQ management less unusual and then you can use the
> simpler approach for RT...

We have three threads for some rare events by now if I count correctly.
That is
- mce check on x86 (mce_notify_helper_thread)
- this one
- clock_was_set_delayed() (pending)

So right now I am thinking about a workqueue for rare events because if
I keep doing this we end up with a bunch useless tasks for something
that might never occur.

> Ben.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IRQ affinity notifiers vs RT

2013-10-04 Thread Sebastian Andrzej Siewior
On 10/01/2013 01:44 AM, Ben Hutchings wrote:
 Right, but it looks quite strange to have this thread just for a
 (probably) quite rare event.  Maybe some day someone will work out how
 to make Octeon's IRQ management less unusual and then you can use the
 simpler approach for RT...

We have three threads for some rare events by now if I count correctly.
That is
- mce check on x86 (mce_notify_helper_thread)
- this one
- clock_was_set_delayed() (pending)

So right now I am thinking about a workqueue for rare events because if
I keep doing this we end up with a bunch useless tasks for something
that might never occur.

 Ben.

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IRQ affinity notifiers vs RT

2013-09-30 Thread Ben Hutchings
On Mon, 2013-09-23 at 14:58 +0200, Sebastian Andrzej Siewior wrote:
> On 08/30/2013 11:29 PM, Ben Hutchings wrote:
> > Sebastian, I saw you came up with a fix for this but apparently without
> > seeing my earlier message:
> 
> Yes Ben, I haven't seen it. If I was on Cc I very sorry for overlooking
> it.

You weren't, as I didn't realise you were maintaining the RT patch set
then.

> > On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:
> 
> >> Workqueue code uses spin_lock_irq() on the workqueue lock, which with
> >> PREEMPT_RT enabled doesn't actually block IRQs.
> >>
> >> In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
> >> any outstanding notifications before freeing the cpu_rmap that they use.
> >> This won't be reliable if the notification is scheduled after releasing
> >> the irq_desc lock.
> >>
> >> However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
> >> all workqueues') in 3.8, I think that it is sufficient to do only
> >> kref_get(>affinity_notify->kref) in __irq_set_affinity_locked()
> >> and then call schedule_work() in irq_set_affinity() after releasing the
> >> lock.  Something like this (untested):
> > 
> > Does the following make sense to you?
> 
> This was suggested by the original submitter on rt-users@v.k.o (Joe
> Korty) where I've been made aware of this for the first time. This okay
> except for the part where the workqueue is not scheduled if calling by
> the __ function (i.e. the mips case). If I read the code correctly, the
> CPU goes offline and affinity change should be updated / users notified
> and this is not the case with this patch.
> 
> It is a valid question why only one mips SoC needs the function. If you
> look at commit 0c3263870f ("MIPS: Octeon: Rewrite interrupt handling
> code.") you can see that tglx himself made this adjustment so it might
> be valid :) Therefore I assume that we may get more callers of this
> function and the workqueue should be executed and I made something
> simple that works on RT.

Right, but it looks quite strange to have this thread just for a
(probably) quite rare event.  Maybe some day someone will work out how
to make Octeon's IRQ management less unusual and then you can use the
simpler approach for RT...

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IRQ affinity notifiers vs RT

2013-09-30 Thread Ben Hutchings
On Mon, 2013-09-23 at 14:58 +0200, Sebastian Andrzej Siewior wrote:
 On 08/30/2013 11:29 PM, Ben Hutchings wrote:
  Sebastian, I saw you came up with a fix for this but apparently without
  seeing my earlier message:
 
 Yes Ben, I haven't seen it. If I was on Cc I very sorry for overlooking
 it.

You weren't, as I didn't realise you were maintaining the RT patch set
then.

  On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:
 
  Workqueue code uses spin_lock_irq() on the workqueue lock, which with
  PREEMPT_RT enabled doesn't actually block IRQs.
 
  In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
  any outstanding notifications before freeing the cpu_rmap that they use.
  This won't be reliable if the notification is scheduled after releasing
  the irq_desc lock.
 
  However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
  all workqueues') in 3.8, I think that it is sufficient to do only
  kref_get(desc-affinity_notify-kref) in __irq_set_affinity_locked()
  and then call schedule_work() in irq_set_affinity() after releasing the
  lock.  Something like this (untested):
  
  Does the following make sense to you?
 
 This was suggested by the original submitter on rt-users@v.k.o (Joe
 Korty) where I've been made aware of this for the first time. This okay
 except for the part where the workqueue is not scheduled if calling by
 the __ function (i.e. the mips case). If I read the code correctly, the
 CPU goes offline and affinity change should be updated / users notified
 and this is not the case with this patch.
 
 It is a valid question why only one mips SoC needs the function. If you
 look at commit 0c3263870f (MIPS: Octeon: Rewrite interrupt handling
 code.) you can see that tglx himself made this adjustment so it might
 be valid :) Therefore I assume that we may get more callers of this
 function and the workqueue should be executed and I made something
 simple that works on RT.

Right, but it looks quite strange to have this thread just for a
(probably) quite rare event.  Maybe some day someone will work out how
to make Octeon's IRQ management less unusual and then you can use the
simpler approach for RT...

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IRQ affinity notifiers vs RT

2013-09-23 Thread Sebastian Andrzej Siewior
On 08/30/2013 11:29 PM, Ben Hutchings wrote:
> Sebastian, I saw you came up with a fix for this but apparently without
> seeing my earlier message:

Yes Ben, I haven't seen it. If I was on Cc I very sorry for overlooking
it.

> On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:

>> Workqueue code uses spin_lock_irq() on the workqueue lock, which with
>> PREEMPT_RT enabled doesn't actually block IRQs.
>>
>> In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
>> any outstanding notifications before freeing the cpu_rmap that they use.
>> This won't be reliable if the notification is scheduled after releasing
>> the irq_desc lock.
>>
>> However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
>> all workqueues') in 3.8, I think that it is sufficient to do only
>> kref_get(>affinity_notify->kref) in __irq_set_affinity_locked()
>> and then call schedule_work() in irq_set_affinity() after releasing the
>> lock.  Something like this (untested):
> 
> Does the following make sense to you?

This was suggested by the original submitter on rt-users@v.k.o (Joe
Korty) where I've been made aware of this for the first time. This okay
except for the part where the workqueue is not scheduled if calling by
the __ function (i.e. the mips case). If I read the code correctly, the
CPU goes offline and affinity change should be updated / users notified
and this is not the case with this patch.

It is a valid question why only one mips SoC needs the function. If you
look at commit 0c3263870f ("MIPS: Octeon: Rewrite interrupt handling
code.") you can see that tglx himself made this adjustment so it might
be valid :) Therefore I assume that we may get more callers of this
function and the workqueue should be executed and I made something
simple that works on RT.

> 
> Ben.
> 

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IRQ affinity notifiers vs RT

2013-09-23 Thread Sebastian Andrzej Siewior
On 08/30/2013 11:29 PM, Ben Hutchings wrote:
 Sebastian, I saw you came up with a fix for this but apparently without
 seeing my earlier message:

Yes Ben, I haven't seen it. If I was on Cc I very sorry for overlooking
it.

 On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:

 Workqueue code uses spin_lock_irq() on the workqueue lock, which with
 PREEMPT_RT enabled doesn't actually block IRQs.

 In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
 any outstanding notifications before freeing the cpu_rmap that they use.
 This won't be reliable if the notification is scheduled after releasing
 the irq_desc lock.

 However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
 all workqueues') in 3.8, I think that it is sufficient to do only
 kref_get(desc-affinity_notify-kref) in __irq_set_affinity_locked()
 and then call schedule_work() in irq_set_affinity() after releasing the
 lock.  Something like this (untested):
 
 Does the following make sense to you?

This was suggested by the original submitter on rt-users@v.k.o (Joe
Korty) where I've been made aware of this for the first time. This okay
except for the part where the workqueue is not scheduled if calling by
the __ function (i.e. the mips case). If I read the code correctly, the
CPU goes offline and affinity change should be updated / users notified
and this is not the case with this patch.

It is a valid question why only one mips SoC needs the function. If you
look at commit 0c3263870f (MIPS: Octeon: Rewrite interrupt handling
code.) you can see that tglx himself made this adjustment so it might
be valid :) Therefore I assume that we may get more callers of this
function and the workqueue should be executed and I made something
simple that works on RT.

 
 Ben.
 

Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IRQ affinity notifiers vs RT

2013-08-30 Thread Ben Hutchings
Sebastian, I saw you came up with a fix for this but apparently without
seeing my earlier message:

On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:
> Alexandra Kossovsky reported the following lockdep splat when testing an
> out-of-tree version of sfc on 3.6-rt.  The problem is specific to RT,
> and we haven't tested anything later but I think it's still unfixed.
> 
> > ==
> > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> > 3.6.11.2-rt33.39.el6rt.x86_64.debug #1 Tainted: G   O
> > --
> > insmod/3076 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> >  (&(&(>lock)->lock)->wait_lock){+.+...}, at: [] 
> > rt_spin_lock_slowlock+0x48/0x2f0
> > 
> > and this task is already holding:
> >  (_desc_lock_class){-.}, at: [] 
> > irq_set_affinity+0x46/0x80
> 
> irq_set_affinity() holds the irq_desc lock, and then schedules a work
> item to call the IRQ affinity notifier.
> 
> > which would create a new lock dependency:
> >  (_desc_lock_class){-.} -> 
> > (&(&(>lock)->lock)->wait_lock){+.+...}
> > 
> > but this new dependency connects a HARDIRQ-irq-safe lock:
> >  (_desc_lock_class){-.}
> > ... which became HARDIRQ-irq-safe at:
> >   [] mark_irqflags+0x172/0x190
> >   [] __lock_acquire+0x344/0x4e0
> >   [] lock_acquire+0x8a/0x140
> >   [] _raw_spin_lock+0x40/0x80
> >   [] handle_level_irq+0x1e/0x100
> >   [] handle_irq+0x71/0x190
> >   [] do_IRQ+0x5d/0xe0
> >   [] ret_from_intr+0x0/0x13
> >   [] tsc_init+0x24/0x102
> >   [] x86_late_time_init+0xf/0x11
> >   [] start_kernel+0x312/0x3c6
> >   [] x86_64_start_reservations+0x131/0x136
> >   [] x86_64_start_kernel+0xed/0xf4
> 
> Obviously, irq_desc is used in hard-IRQ context.
> 
> > to a HARDIRQ-irq-unsafe lock:
> >  (&(&(>lock)->lock)->wait_lock){+.+...}
> > ... which became HARDIRQ-irq-unsafe at:
> > ...  [] mark_irqflags+0x120/0x190
> >   [] __lock_acquire+0x344/0x4e0
> >   [] lock_acquire+0x8a/0x140
> >   [] _raw_spin_lock+0x40/0x80
> >   [] rt_spin_lock_slowlock+0x48/0x2f0
> >   [] rt_spin_lock+0x16/0x40
> >   [] create_worker+0x69/0x220
> >   [] init_workqueues+0x24b/0x3f8
> >   [] do_one_initcall+0x42/0x170
> >   [] kernel_init+0xe5/0x17a
> >   [] kernel_thread_helper+0x4/0x10
> [...]
> 
> Workqueue code uses spin_lock_irq() on the workqueue lock, which with
> PREEMPT_RT enabled doesn't actually block IRQs.
> 
> In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
> any outstanding notifications before freeing the cpu_rmap that they use.
> This won't be reliable if the notification is scheduled after releasing
> the irq_desc lock.
> 
> However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
> all workqueues') in 3.8, I think that it is sufficient to do only
> kref_get(>affinity_notify->kref) in __irq_set_affinity_locked()
> and then call schedule_work() in irq_set_affinity() after releasing the
> lock.  Something like this (untested):

Does the following make sense to you?

Ben.

> diff --git a/arch/mips/cavium-octeon/octeon-irq.c 
> b/arch/mips/cavium-octeon/octeon-irq.c
> index 9d36774..0406481 100644
> --- a/arch/mips/cavium-octeon/octeon-irq.c
> +++ b/arch/mips/cavium-octeon/octeon-irq.c
> @@ -635,7 +635,8 @@ static void octeon_irq_cpu_offline_ciu(struct irq_data 
> *data)
>   cpumask_clear(_affinity);
>   cpumask_set_cpu(cpumask_first(cpu_online_mask), _affinity);
>   }
> - __irq_set_affinity_locked(data, _affinity);
> + /* XXX No-one else calls this; why does this chip need it? */
> + __irq_set_affinity_locked(data, _affinity, NULL);
>  }
>  
>  static int octeon_irq_ciu_set_affinity(struct irq_data *data,
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index f04d3ba..de992f4 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -380,7 +380,9 @@ extern void remove_percpu_irq(unsigned int irq, struct 
> irqaction *act);
>  
>  extern void irq_cpu_online(void);
>  extern void irq_cpu_offline(void);
> -extern int __irq_set_affinity_locked(struct irq_data *data,  const struct 
> cpumask *cpumask);
> +extern int __irq_set_affinity_locked(struct irq_data *data,
> +  const struct cpumask *cpumask,
> +  struct irq_affinity_notify **notify);
>  
>  #ifdef CONFIG_GENERIC_HARDIRQS
>  
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 514bcfd..157afa2 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -162,12 +162,16 @@ int irq_do_set_affinity(struct irq_data *data, const 
> struct cpumask *mask,
>   return ret;
>  }
>  
> -int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask 
> *mask)
> +int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask 
> *mask,
> +   struct irq_affinity_notify **notify)
>  {
>   struct irq_chip *chip = irq_data_get_irq_chip(data);
>   

Re: IRQ affinity notifiers vs RT

2013-08-30 Thread Ben Hutchings
Sebastian, I saw you came up with a fix for this but apparently without
seeing my earlier message:

On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:
 Alexandra Kossovsky reported the following lockdep splat when testing an
 out-of-tree version of sfc on 3.6-rt.  The problem is specific to RT,
 and we haven't tested anything later but I think it's still unfixed.
 
  ==
  [ INFO: HARDIRQ-safe - HARDIRQ-unsafe lock order detected ]
  3.6.11.2-rt33.39.el6rt.x86_64.debug #1 Tainted: G   O
  --
  insmod/3076 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
   (((gcwq-lock)-lock)-wait_lock){+.+...}, at: [81589a78] 
  rt_spin_lock_slowlock+0x48/0x2f0
  
  and this task is already holding:
   (irq_desc_lock_class){-.}, at: [810ec226] 
  irq_set_affinity+0x46/0x80
 
 irq_set_affinity() holds the irq_desc lock, and then schedules a work
 item to call the IRQ affinity notifier.
 
  which would create a new lock dependency:
   (irq_desc_lock_class){-.} - 
  (((gcwq-lock)-lock)-wait_lock){+.+...}
  
  but this new dependency connects a HARDIRQ-irq-safe lock:
   (irq_desc_lock_class){-.}
  ... which became HARDIRQ-irq-safe at:
[810adba2] mark_irqflags+0x172/0x190
[810af2c4] __lock_acquire+0x344/0x4e0
[810af4ea] lock_acquire+0x8a/0x140
[8158ac50] _raw_spin_lock+0x40/0x80
[810edfae] handle_level_irq+0x1e/0x100
[810044e1] handle_irq+0x71/0x190
[815943ad] do_IRQ+0x5d/0xe0
[8158b32c] ret_from_intr+0x0/0x13
[81d16d9d] tsc_init+0x24/0x102
[81d13d77] x86_late_time_init+0xf/0x11
[81d10cfe] start_kernel+0x312/0x3c6
[81d1032d] x86_64_start_reservations+0x131/0x136
[81d1041f] x86_64_start_kernel+0xed/0xf4
 
 Obviously, irq_desc is used in hard-IRQ context.
 
  to a HARDIRQ-irq-unsafe lock:
   (((gcwq-lock)-lock)-wait_lock){+.+...}
  ... which became HARDIRQ-irq-unsafe at:
  ...  [810adb50] mark_irqflags+0x120/0x190
[810af2c4] __lock_acquire+0x344/0x4e0
[810af4ea] lock_acquire+0x8a/0x140
[8158ac50] _raw_spin_lock+0x40/0x80
[81589a78] rt_spin_lock_slowlock+0x48/0x2f0
[8158a1b6] rt_spin_lock+0x16/0x40
[8106cf99] create_worker+0x69/0x220
[81d2bae5] init_workqueues+0x24b/0x3f8
[810001c2] do_one_initcall+0x42/0x170
[81d10775] kernel_init+0xe5/0x17a
[81593c24] kernel_thread_helper+0x4/0x10
 [...]
 
 Workqueue code uses spin_lock_irq() on the workqueue lock, which with
 PREEMPT_RT enabled doesn't actually block IRQs.
 
 In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
 any outstanding notifications before freeing the cpu_rmap that they use.
 This won't be reliable if the notification is scheduled after releasing
 the irq_desc lock.
 
 However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
 all workqueues') in 3.8, I think that it is sufficient to do only
 kref_get(desc-affinity_notify-kref) in __irq_set_affinity_locked()
 and then call schedule_work() in irq_set_affinity() after releasing the
 lock.  Something like this (untested):

Does the following make sense to you?

Ben.

 diff --git a/arch/mips/cavium-octeon/octeon-irq.c 
 b/arch/mips/cavium-octeon/octeon-irq.c
 index 9d36774..0406481 100644
 --- a/arch/mips/cavium-octeon/octeon-irq.c
 +++ b/arch/mips/cavium-octeon/octeon-irq.c
 @@ -635,7 +635,8 @@ static void octeon_irq_cpu_offline_ciu(struct irq_data 
 *data)
   cpumask_clear(new_affinity);
   cpumask_set_cpu(cpumask_first(cpu_online_mask), new_affinity);
   }
 - __irq_set_affinity_locked(data, new_affinity);
 + /* XXX No-one else calls this; why does this chip need it? */
 + __irq_set_affinity_locked(data, new_affinity, NULL);
  }
  
  static int octeon_irq_ciu_set_affinity(struct irq_data *data,
 diff --git a/include/linux/irq.h b/include/linux/irq.h
 index f04d3ba..de992f4 100644
 --- a/include/linux/irq.h
 +++ b/include/linux/irq.h
 @@ -380,7 +380,9 @@ extern void remove_percpu_irq(unsigned int irq, struct 
 irqaction *act);
  
  extern void irq_cpu_online(void);
  extern void irq_cpu_offline(void);
 -extern int __irq_set_affinity_locked(struct irq_data *data,  const struct 
 cpumask *cpumask);
 +extern int __irq_set_affinity_locked(struct irq_data *data,
 +  const struct cpumask *cpumask,
 +  struct irq_affinity_notify **notify);
  
  #ifdef CONFIG_GENERIC_HARDIRQS
  
 diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
 index 514bcfd..157afa2 100644
 --- a/kernel/irq/manage.c
 +++ b/kernel/irq/manage.c
 @@ -162,12 +162,16 @@ int irq_do_set_affinity(struct irq_data *data, const 
 struct cpumask *mask,
   return ret;
  }
  
 -int 

IRQ affinity notifiers vs RT

2013-07-24 Thread Ben Hutchings
Alexandra Kossovsky reported the following lockdep splat when testing an
out-of-tree version of sfc on 3.6-rt.  The problem is specific to RT,
and we haven't tested anything later but I think it's still unfixed.

> ==
> [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> 3.6.11.2-rt33.39.el6rt.x86_64.debug #1 Tainted: G   O
> --
> insmod/3076 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>  (&(&(>lock)->lock)->wait_lock){+.+...}, at: [] 
> rt_spin_lock_slowlock+0x48/0x2f0
> 
> and this task is already holding:
>  (_desc_lock_class){-.}, at: [] 
> irq_set_affinity+0x46/0x80

irq_set_affinity() holds the irq_desc lock, and then schedules a work
item to call the IRQ affinity notifier.

> which would create a new lock dependency:
>  (_desc_lock_class){-.} -> 
> (&(&(>lock)->lock)->wait_lock){+.+...}
> 
> but this new dependency connects a HARDIRQ-irq-safe lock:
>  (_desc_lock_class){-.}
> ... which became HARDIRQ-irq-safe at:
>   [] mark_irqflags+0x172/0x190
>   [] __lock_acquire+0x344/0x4e0
>   [] lock_acquire+0x8a/0x140
>   [] _raw_spin_lock+0x40/0x80
>   [] handle_level_irq+0x1e/0x100
>   [] handle_irq+0x71/0x190
>   [] do_IRQ+0x5d/0xe0
>   [] ret_from_intr+0x0/0x13
>   [] tsc_init+0x24/0x102
>   [] x86_late_time_init+0xf/0x11
>   [] start_kernel+0x312/0x3c6
>   [] x86_64_start_reservations+0x131/0x136
>   [] x86_64_start_kernel+0xed/0xf4

Obviously, irq_desc is used in hard-IRQ context.

> to a HARDIRQ-irq-unsafe lock:
>  (&(&(>lock)->lock)->wait_lock){+.+...}
> ... which became HARDIRQ-irq-unsafe at:
> ...  [] mark_irqflags+0x120/0x190
>   [] __lock_acquire+0x344/0x4e0
>   [] lock_acquire+0x8a/0x140
>   [] _raw_spin_lock+0x40/0x80
>   [] rt_spin_lock_slowlock+0x48/0x2f0
>   [] rt_spin_lock+0x16/0x40
>   [] create_worker+0x69/0x220
>   [] init_workqueues+0x24b/0x3f8
>   [] do_one_initcall+0x42/0x170
>   [] kernel_init+0xe5/0x17a
>   [] kernel_thread_helper+0x4/0x10
[...]

Workqueue code uses spin_lock_irq() on the workqueue lock, which with
PREEMPT_RT enabled doesn't actually block IRQs.

In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
any outstanding notifications before freeing the cpu_rmap that they use.
This won't be reliable if the notification is scheduled after releasing
the irq_desc lock.

However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
all workqueues') in 3.8, I think that it is sufficient to do only
kref_get(>affinity_notify->kref) in __irq_set_affinity_locked()
and then call schedule_work() in irq_set_affinity() after releasing the
lock.  Something like this (untested):

diff --git a/arch/mips/cavium-octeon/octeon-irq.c 
b/arch/mips/cavium-octeon/octeon-irq.c
index 9d36774..0406481 100644
--- a/arch/mips/cavium-octeon/octeon-irq.c
+++ b/arch/mips/cavium-octeon/octeon-irq.c
@@ -635,7 +635,8 @@ static void octeon_irq_cpu_offline_ciu(struct irq_data 
*data)
cpumask_clear(_affinity);
cpumask_set_cpu(cpumask_first(cpu_online_mask), _affinity);
}
-   __irq_set_affinity_locked(data, _affinity);
+   /* XXX No-one else calls this; why does this chip need it? */
+   __irq_set_affinity_locked(data, _affinity, NULL);
 }
 
 static int octeon_irq_ciu_set_affinity(struct irq_data *data,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f04d3ba..de992f4 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -380,7 +380,9 @@ extern void remove_percpu_irq(unsigned int irq, struct 
irqaction *act);
 
 extern void irq_cpu_online(void);
 extern void irq_cpu_offline(void);
-extern int __irq_set_affinity_locked(struct irq_data *data,  const struct 
cpumask *cpumask);
+extern int __irq_set_affinity_locked(struct irq_data *data,
+const struct cpumask *cpumask,
+struct irq_affinity_notify **notify);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 514bcfd..157afa2 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -162,12 +162,16 @@ int irq_do_set_affinity(struct irq_data *data, const 
struct cpumask *mask,
return ret;
 }
 
-int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask 
*mask)
+int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask 
*mask,
+ struct irq_affinity_notify **notify)
 {
struct irq_chip *chip = irq_data_get_irq_chip(data);
struct irq_desc *desc = irq_data_to_desc(data);
int ret = 0;
 
+   if (notify)
+   *notify = NULL;
+
if (!chip || !chip->irq_set_affinity)
return -EINVAL;
 
@@ -178,9 +182,9 @@ int __irq_set_affinity_locked(struct irq_data *data, const 
struct cpumask *mask)
irq_copy_pending(desc, mask);
}
 
-   if (desc->affinity_notify) {
+   if (notify 

IRQ affinity notifiers vs RT

2013-07-24 Thread Ben Hutchings
Alexandra Kossovsky reported the following lockdep splat when testing an
out-of-tree version of sfc on 3.6-rt.  The problem is specific to RT,
and we haven't tested anything later but I think it's still unfixed.

 ==
 [ INFO: HARDIRQ-safe - HARDIRQ-unsafe lock order detected ]
 3.6.11.2-rt33.39.el6rt.x86_64.debug #1 Tainted: G   O
 --
 insmod/3076 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
  (((gcwq-lock)-lock)-wait_lock){+.+...}, at: [81589a78] 
 rt_spin_lock_slowlock+0x48/0x2f0
 
 and this task is already holding:
  (irq_desc_lock_class){-.}, at: [810ec226] 
 irq_set_affinity+0x46/0x80

irq_set_affinity() holds the irq_desc lock, and then schedules a work
item to call the IRQ affinity notifier.

 which would create a new lock dependency:
  (irq_desc_lock_class){-.} - 
 (((gcwq-lock)-lock)-wait_lock){+.+...}
 
 but this new dependency connects a HARDIRQ-irq-safe lock:
  (irq_desc_lock_class){-.}
 ... which became HARDIRQ-irq-safe at:
   [810adba2] mark_irqflags+0x172/0x190
   [810af2c4] __lock_acquire+0x344/0x4e0
   [810af4ea] lock_acquire+0x8a/0x140
   [8158ac50] _raw_spin_lock+0x40/0x80
   [810edfae] handle_level_irq+0x1e/0x100
   [810044e1] handle_irq+0x71/0x190
   [815943ad] do_IRQ+0x5d/0xe0
   [8158b32c] ret_from_intr+0x0/0x13
   [81d16d9d] tsc_init+0x24/0x102
   [81d13d77] x86_late_time_init+0xf/0x11
   [81d10cfe] start_kernel+0x312/0x3c6
   [81d1032d] x86_64_start_reservations+0x131/0x136
   [81d1041f] x86_64_start_kernel+0xed/0xf4

Obviously, irq_desc is used in hard-IRQ context.

 to a HARDIRQ-irq-unsafe lock:
  (((gcwq-lock)-lock)-wait_lock){+.+...}
 ... which became HARDIRQ-irq-unsafe at:
 ...  [810adb50] mark_irqflags+0x120/0x190
   [810af2c4] __lock_acquire+0x344/0x4e0
   [810af4ea] lock_acquire+0x8a/0x140
   [8158ac50] _raw_spin_lock+0x40/0x80
   [81589a78] rt_spin_lock_slowlock+0x48/0x2f0
   [8158a1b6] rt_spin_lock+0x16/0x40
   [8106cf99] create_worker+0x69/0x220
   [81d2bae5] init_workqueues+0x24b/0x3f8
   [810001c2] do_one_initcall+0x42/0x170
   [81d10775] kernel_init+0xe5/0x17a
   [81593c24] kernel_thread_helper+0x4/0x10
[...]

Workqueue code uses spin_lock_irq() on the workqueue lock, which with
PREEMPT_RT enabled doesn't actually block IRQs.

In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
any outstanding notifications before freeing the cpu_rmap that they use.
This won't be reliable if the notification is scheduled after releasing
the irq_desc lock.

However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
all workqueues') in 3.8, I think that it is sufficient to do only
kref_get(desc-affinity_notify-kref) in __irq_set_affinity_locked()
and then call schedule_work() in irq_set_affinity() after releasing the
lock.  Something like this (untested):

diff --git a/arch/mips/cavium-octeon/octeon-irq.c 
b/arch/mips/cavium-octeon/octeon-irq.c
index 9d36774..0406481 100644
--- a/arch/mips/cavium-octeon/octeon-irq.c
+++ b/arch/mips/cavium-octeon/octeon-irq.c
@@ -635,7 +635,8 @@ static void octeon_irq_cpu_offline_ciu(struct irq_data 
*data)
cpumask_clear(new_affinity);
cpumask_set_cpu(cpumask_first(cpu_online_mask), new_affinity);
}
-   __irq_set_affinity_locked(data, new_affinity);
+   /* XXX No-one else calls this; why does this chip need it? */
+   __irq_set_affinity_locked(data, new_affinity, NULL);
 }
 
 static int octeon_irq_ciu_set_affinity(struct irq_data *data,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f04d3ba..de992f4 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -380,7 +380,9 @@ extern void remove_percpu_irq(unsigned int irq, struct 
irqaction *act);
 
 extern void irq_cpu_online(void);
 extern void irq_cpu_offline(void);
-extern int __irq_set_affinity_locked(struct irq_data *data,  const struct 
cpumask *cpumask);
+extern int __irq_set_affinity_locked(struct irq_data *data,
+const struct cpumask *cpumask,
+struct irq_affinity_notify **notify);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 514bcfd..157afa2 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -162,12 +162,16 @@ int irq_do_set_affinity(struct irq_data *data, const 
struct cpumask *mask,
return ret;
 }
 
-int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask 
*mask)
+int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask 
*mask,
+ struct irq_affinity_notify **notify)
 {
struct irq_chip *chip = irq_data_get_irq_chip(data);
struct irq_desc *desc =