Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-09 Thread Rich Felker
On Sun, Oct 09, 2016 at 11:14:20AM +0200, Thomas Gleixner wrote:
> Rich,
> 
> On Sat, 8 Oct 2016, Rich Felker wrote:
> > On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> > > Because you drop out the idle spin due to an interrupt, but no interrupt 
> > > is
> > > handled according to the trace. You just go back to sleep and the trace is
> > > full of this behaviour.
> > 
> > Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
> > ignored if the same interrupt is being handled on the other cpu.
> > Modifying the jcore aic driver to call handle_percpu_irq rather than
> > handle_simple_irq when the irq was registered as percpu solves the
> > problem, but I'm actually concerned that handle_simple_irq would also
> > be wrong in the case of non-percpu irqs if they could be delivered on
> > either core -- if another irq arrives after the driver is finished
> > checking for new device status, but before the IRQD_IRQ_INPROGRESS
> > flag is removed, it seems handle_simple_irq loses the irq. This is not
> > a problem for the current hardware since non-percpu irqs always arrive
> > on cpu0, but I'd rather improve the driver to properly handle possible
> > future hardware that allows irq delivery on any core.
> 
> We guarantee no-rentrancy for the handlers. That's why we have special
> treatment for per cpu interrupts and edge type interrupts, where we mark
> the interrupt pending when it arrives before the in progress flag is
> cleared and then call the handler again when it returns. For level type
> interrupts this is not required because level is going to stay raised in
> the hardware.

I understand the no-reentrancy guarantee, but it seems that, for a
"simple" irq with no ack/etc. mechanisms, if another interrupt has
arrives during handling, a flag for that has to be kept and the
interrupt handler re-invoked after the first return. Otherwise
interrupts are lost. Perhaps handle_simple_irq is written with the
intent that individual drivers have to do something ack-like with
their hardware. If so then it's not the right handler (although it
works at present as long as non-percpu interrupts are bound to a
single cpu at the hardware level) and I should probably write an
appropriate handle function.

Or, if we want to consider the current behavior a guarantee so that
changing it would require a new compatible string, then
handle_percpu_irq or an equivalently simpler function could be used
even for non-percpu irqs and avoid all the locking overhead. This
would save a lot more time in the hot path than eliminating the one
conditional (as discussed below below).

> > request_irq ends up working fine still because IRQF_PERCPU causes
> > __setup_irq to mark the irqdesc/irq_data as percpu, so that my handle
> > function can treat it differently. Here's the patch that has it working:
>  
> > diff --git a/drivers/irqchip/irq-jcore-aic.c 
> > b/drivers/irqchip/irq-jcore-aic.c
> > index 5e5e3bb..b53a8a5 100644
> > --- a/drivers/irqchip/irq-jcore-aic.c
> > +++ b/drivers/irqchip/irq-jcore-aic.c
> > @@ -25,12 +25,20 @@
> >  
> >  static struct irq_chip jcore_aic;
> >  
> > +static void handle_jcore_irq(struct irq_desc *desc)
> > +{
> > +   if (irq_is_percpu(irq_desc_get_irq(desc)))
> > +   handle_percpu_irq(desc);
> > +   else
> > +   handle_simple_irq(desc);
> > +}
> > +
> >  static int jcore_aic_irqdomain_map(struct irq_domain *d, unsigned int irq,
> >irq_hw_number_t hwirq)
> >  {
> > struct irq_chip *aic = d->host_data;
> >  
> > -   irq_set_chip_and_handler(irq, aic, handle_simple_irq);
> > +   irq_set_chip_and_handler(irq, aic, handle_jcore_irq);
> 
> What you should do is:
> 
>   if (jcore_irq_per_cpu(hwirq))
>   irq_set_chip_and_handler(irq, aic, handle_percpu_irq);
>   else
>   irq_set_chip_and_handler(irq, aic, handle_simple_irq);
>   
> That avoids the conditional in the irq delivery path,

I'll follow up on this in the thread on the patch.

> which is overly
> expensive as you end up looking up the irq descriptor which you already
> have. You can avoid the lookup by using:
> 
>   if (irqd_is_per_cpu(irq_desc_get_irq_data(desc)))
> 
> but that's still a conditional in the hotpath which you can avoid entirely
> by setting up the proper handler when you map it.

Indeed that helps. Having CONFIG_FRAME_POINTER on was making both
versions huge (because of the implicit -fno-optimize-sibling-calls)
but with that off, it's much more reasonable.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-09 Thread Rich Felker
On Sun, Oct 09, 2016 at 11:14:20AM +0200, Thomas Gleixner wrote:
> Rich,
> 
> On Sat, 8 Oct 2016, Rich Felker wrote:
> > On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> > > Because you drop out the idle spin due to an interrupt, but no interrupt 
> > > is
> > > handled according to the trace. You just go back to sleep and the trace is
> > > full of this behaviour.
> > 
> > Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
> > ignored if the same interrupt is being handled on the other cpu.
> > Modifying the jcore aic driver to call handle_percpu_irq rather than
> > handle_simple_irq when the irq was registered as percpu solves the
> > problem, but I'm actually concerned that handle_simple_irq would also
> > be wrong in the case of non-percpu irqs if they could be delivered on
> > either core -- if another irq arrives after the driver is finished
> > checking for new device status, but before the IRQD_IRQ_INPROGRESS
> > flag is removed, it seems handle_simple_irq loses the irq. This is not
> > a problem for the current hardware since non-percpu irqs always arrive
> > on cpu0, but I'd rather improve the driver to properly handle possible
> > future hardware that allows irq delivery on any core.
> 
> We guarantee no-rentrancy for the handlers. That's why we have special
> treatment for per cpu interrupts and edge type interrupts, where we mark
> the interrupt pending when it arrives before the in progress flag is
> cleared and then call the handler again when it returns. For level type
> interrupts this is not required because level is going to stay raised in
> the hardware.

I understand the no-reentrancy guarantee, but it seems that, for a
"simple" irq with no ack/etc. mechanisms, if another interrupt has
arrives during handling, a flag for that has to be kept and the
interrupt handler re-invoked after the first return. Otherwise
interrupts are lost. Perhaps handle_simple_irq is written with the
intent that individual drivers have to do something ack-like with
their hardware. If so then it's not the right handler (although it
works at present as long as non-percpu interrupts are bound to a
single cpu at the hardware level) and I should probably write an
appropriate handle function.

Or, if we want to consider the current behavior a guarantee so that
changing it would require a new compatible string, then
handle_percpu_irq or an equivalently simpler function could be used
even for non-percpu irqs and avoid all the locking overhead. This
would save a lot more time in the hot path than eliminating the one
conditional (as discussed below below).

> > request_irq ends up working fine still because IRQF_PERCPU causes
> > __setup_irq to mark the irqdesc/irq_data as percpu, so that my handle
> > function can treat it differently. Here's the patch that has it working:
>  
> > diff --git a/drivers/irqchip/irq-jcore-aic.c 
> > b/drivers/irqchip/irq-jcore-aic.c
> > index 5e5e3bb..b53a8a5 100644
> > --- a/drivers/irqchip/irq-jcore-aic.c
> > +++ b/drivers/irqchip/irq-jcore-aic.c
> > @@ -25,12 +25,20 @@
> >  
> >  static struct irq_chip jcore_aic;
> >  
> > +static void handle_jcore_irq(struct irq_desc *desc)
> > +{
> > +   if (irq_is_percpu(irq_desc_get_irq(desc)))
> > +   handle_percpu_irq(desc);
> > +   else
> > +   handle_simple_irq(desc);
> > +}
> > +
> >  static int jcore_aic_irqdomain_map(struct irq_domain *d, unsigned int irq,
> >irq_hw_number_t hwirq)
> >  {
> > struct irq_chip *aic = d->host_data;
> >  
> > -   irq_set_chip_and_handler(irq, aic, handle_simple_irq);
> > +   irq_set_chip_and_handler(irq, aic, handle_jcore_irq);
> 
> What you should do is:
> 
>   if (jcore_irq_per_cpu(hwirq))
>   irq_set_chip_and_handler(irq, aic, handle_percpu_irq);
>   else
>   irq_set_chip_and_handler(irq, aic, handle_simple_irq);
>   
> That avoids the conditional in the irq delivery path,

I'll follow up on this in the thread on the patch.

> which is overly
> expensive as you end up looking up the irq descriptor which you already
> have. You can avoid the lookup by using:
> 
>   if (irqd_is_per_cpu(irq_desc_get_irq_data(desc)))
> 
> but that's still a conditional in the hotpath which you can avoid entirely
> by setting up the proper handler when you map it.

Indeed that helps. Having CONFIG_FRAME_POINTER on was making both
versions huge (because of the implicit -fno-optimize-sibling-calls)
but with that off, it's much more reasonable.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-09 Thread Thomas Gleixner
Rich,

On Sat, 8 Oct 2016, Rich Felker wrote:
> On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> > Because you drop out the idle spin due to an interrupt, but no interrupt is
> > handled according to the trace. You just go back to sleep and the trace is
> > full of this behaviour.
> 
> Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
> ignored if the same interrupt is being handled on the other cpu.
> Modifying the jcore aic driver to call handle_percpu_irq rather than
> handle_simple_irq when the irq was registered as percpu solves the
> problem, but I'm actually concerned that handle_simple_irq would also
> be wrong in the case of non-percpu irqs if they could be delivered on
> either core -- if another irq arrives after the driver is finished
> checking for new device status, but before the IRQD_IRQ_INPROGRESS
> flag is removed, it seems handle_simple_irq loses the irq. This is not
> a problem for the current hardware since non-percpu irqs always arrive
> on cpu0, but I'd rather improve the driver to properly handle possible
> future hardware that allows irq delivery on any core.

We guarantee no-rentrancy for the handlers. That's why we have special
treatment for per cpu interrupts and edge type interrupts, where we mark
the interrupt pending when it arrives before the in progress flag is
cleared and then call the handler again when it returns. For level type
interrupts this is not required because level is going to stay raised in
the hardware.

> > which is the wrong thing to do. You need request_percpu_irq() here, but of
> > course with the irq chip you are using, which has every callback as a noop,
> > it does not matter at all. So that's not an issue and not the reason for
> > this wreckage.
> 
> Mentioning this was helpful to get me looking at the right things
> tracking from the irq entry point to where the irq was lost/dropped,
> so thanks for bringing it up.

Duh, forgot about reentrancy. I should have thought about it when staring
at the traces. I noticed that the irq on the other cpu was handled exactly
at the same point, but I did not make the connection. In all hte
problematic cases there is this sequence:

  -0 [000] d.s.   150.861703: tick_irq_enter: update jiffies 
via irq
  -0 [001] d.h.   150.861827: irq_handler_entry: irq=16 
name=jcore_pit

i.e. the handler on cpu1 is invoked before cpu0 has reached
handle_simple_irq(). 

> request_irq ends up working fine still because IRQF_PERCPU causes
> __setup_irq to mark the irqdesc/irq_data as percpu, so that my handle
> function can treat it differently. Here's the patch that has it working:
 
> diff --git a/drivers/irqchip/irq-jcore-aic.c b/drivers/irqchip/irq-jcore-aic.c
> index 5e5e3bb..b53a8a5 100644
> --- a/drivers/irqchip/irq-jcore-aic.c
> +++ b/drivers/irqchip/irq-jcore-aic.c
> @@ -25,12 +25,20 @@
>  
>  static struct irq_chip jcore_aic;
>  
> +static void handle_jcore_irq(struct irq_desc *desc)
> +{
> + if (irq_is_percpu(irq_desc_get_irq(desc)))
> + handle_percpu_irq(desc);
> + else
> + handle_simple_irq(desc);
> +}
> +
>  static int jcore_aic_irqdomain_map(struct irq_domain *d, unsigned int irq,
>  irq_hw_number_t hwirq)
>  {
>   struct irq_chip *aic = d->host_data;
>  
> - irq_set_chip_and_handler(irq, aic, handle_simple_irq);
> + irq_set_chip_and_handler(irq, aic, handle_jcore_irq);

What you should do is:

if (jcore_irq_per_cpu(hwirq))
irq_set_chip_and_handler(irq, aic, handle_percpu_irq);
else
irq_set_chip_and_handler(irq, aic, handle_simple_irq);

That avoids the conditional in the irq delivery path, which is overly
expensive as you end up looking up the irq descriptor which you already
have. You can avoid the lookup by using:

  if (irqd_is_per_cpu(irq_desc_get_irq_data(desc)))

but that's still a conditional in the hotpath which you can avoid entirely
by setting up the proper handler when you map it.

> Apologies for this big distraction for what was ultimately a driver
> bug on my side. I was misled by the way it seemed to be a regression,
> since the symptoms did not appear in earlier kernel versions for
> whatever reason (likely just chance).

No problem. Glad that I was able to help.

Thanks,

tglx


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-09 Thread Thomas Gleixner
Rich,

On Sat, 8 Oct 2016, Rich Felker wrote:
> On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> > Because you drop out the idle spin due to an interrupt, but no interrupt is
> > handled according to the trace. You just go back to sleep and the trace is
> > full of this behaviour.
> 
> Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
> ignored if the same interrupt is being handled on the other cpu.
> Modifying the jcore aic driver to call handle_percpu_irq rather than
> handle_simple_irq when the irq was registered as percpu solves the
> problem, but I'm actually concerned that handle_simple_irq would also
> be wrong in the case of non-percpu irqs if they could be delivered on
> either core -- if another irq arrives after the driver is finished
> checking for new device status, but before the IRQD_IRQ_INPROGRESS
> flag is removed, it seems handle_simple_irq loses the irq. This is not
> a problem for the current hardware since non-percpu irqs always arrive
> on cpu0, but I'd rather improve the driver to properly handle possible
> future hardware that allows irq delivery on any core.

We guarantee no-rentrancy for the handlers. That's why we have special
treatment for per cpu interrupts and edge type interrupts, where we mark
the interrupt pending when it arrives before the in progress flag is
cleared and then call the handler again when it returns. For level type
interrupts this is not required because level is going to stay raised in
the hardware.

> > which is the wrong thing to do. You need request_percpu_irq() here, but of
> > course with the irq chip you are using, which has every callback as a noop,
> > it does not matter at all. So that's not an issue and not the reason for
> > this wreckage.
> 
> Mentioning this was helpful to get me looking at the right things
> tracking from the irq entry point to where the irq was lost/dropped,
> so thanks for bringing it up.

Duh, forgot about reentrancy. I should have thought about it when staring
at the traces. I noticed that the irq on the other cpu was handled exactly
at the same point, but I did not make the connection. In all hte
problematic cases there is this sequence:

  -0 [000] d.s.   150.861703: tick_irq_enter: update jiffies 
via irq
  -0 [001] d.h.   150.861827: irq_handler_entry: irq=16 
name=jcore_pit

i.e. the handler on cpu1 is invoked before cpu0 has reached
handle_simple_irq(). 

> request_irq ends up working fine still because IRQF_PERCPU causes
> __setup_irq to mark the irqdesc/irq_data as percpu, so that my handle
> function can treat it differently. Here's the patch that has it working:
 
> diff --git a/drivers/irqchip/irq-jcore-aic.c b/drivers/irqchip/irq-jcore-aic.c
> index 5e5e3bb..b53a8a5 100644
> --- a/drivers/irqchip/irq-jcore-aic.c
> +++ b/drivers/irqchip/irq-jcore-aic.c
> @@ -25,12 +25,20 @@
>  
>  static struct irq_chip jcore_aic;
>  
> +static void handle_jcore_irq(struct irq_desc *desc)
> +{
> + if (irq_is_percpu(irq_desc_get_irq(desc)))
> + handle_percpu_irq(desc);
> + else
> + handle_simple_irq(desc);
> +}
> +
>  static int jcore_aic_irqdomain_map(struct irq_domain *d, unsigned int irq,
>  irq_hw_number_t hwirq)
>  {
>   struct irq_chip *aic = d->host_data;
>  
> - irq_set_chip_and_handler(irq, aic, handle_simple_irq);
> + irq_set_chip_and_handler(irq, aic, handle_jcore_irq);

What you should do is:

if (jcore_irq_per_cpu(hwirq))
irq_set_chip_and_handler(irq, aic, handle_percpu_irq);
else
irq_set_chip_and_handler(irq, aic, handle_simple_irq);

That avoids the conditional in the irq delivery path, which is overly
expensive as you end up looking up the irq descriptor which you already
have. You can avoid the lookup by using:

  if (irqd_is_per_cpu(irq_desc_get_irq_data(desc)))

but that's still a conditional in the hotpath which you can avoid entirely
by setting up the proper handler when you map it.

> Apologies for this big distraction for what was ultimately a driver
> bug on my side. I was misled by the way it seemed to be a regression,
> since the symptoms did not appear in earlier kernel versions for
> whatever reason (likely just chance).

No problem. Glad that I was able to help.

Thanks,

tglx


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-08 Thread Rich Felker
On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> On Sat, 8 Oct 2016, Rich Felker wrote:
> > On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> > > CPU spins and waits for an interrupt to happen
> > > 
> > > 
> > >   -0 [000] d...   150.841530: rcu_dyntick: End 0 1
> > > 
> > > Dropping out of the spin about the time we expect the PIT interrupt to
> > > fire. And an interrupt is the reason why we drop out because the 'need
> > > resched' flag is not set and we end up in:
> > 
> > OK, I missed that.
> > 
> > >   -0 [000] d.s.   150.841724: tick_irq_enter: update 
> > > jiffies via irq
> > > 
> > > which is called from irq_enter()
> > > 
> > >   -0 [000] d.s.   150.841918: tick_do_update_jiffies64: 
> > > finished do_timer(1)
> > >   -0 [000] d.s.   150.842348: tick_do_update_jiffies64: 
> > > finished updating jiffies
> > > 
> > > 
> > > So here we would expect:
> > > 
> > >irq_handler_entry: irq=16 name=jcore_pit
> > >...
> > >irq_handler_exit .
> > >
> > > 
> > > or any other interrupt being invoked. But we see this here:
> > 
> > According to the trace the 'irq' is soft ('s'). I couldn't find the
> > code path from the idle loop to softirq but so maybe this is a bug. Is
> > it possible that in some cases the arch irq entry does not get
> > identified as a hard irq or traced and then the subsequent tick code
> > thinks it's running in a softirq and behaves differently? I could add
> > more tracing around irq entry.
> 
> No.
> 
> irq_enter()
>   if (is_idle_task(current) && !in_interrupt()) {
>   local_bh_disable();
>   tick_irq_enter();
>   local_bh_enable();
>   }
>   __irq_enter()
>   preempt_count_add(HARDIRQ_OFFSET);
> 
> So the 's' comes from local_bh_disable() and the code does not behave
> special because of that. It's just always that way. Look at all the other
> instances of tick_irq_enter() which do not show that issue.

Thank you -- that was really confusing me. Now that I know there's
actually a hardirq handling I know where to look for the problem.

> > >   -0 [000] d...   150.842603: __tick_nohz_idle_enter: 
> > > can stop idle tick
> > > 
> > > And that's just wrong. 
> > 
> > Can you explain?
> 
> Because you drop out the idle spin due to an interrupt, but no interrupt is
> handled according to the trace. You just go back to sleep and the trace is
> full of this behaviour.

Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
ignored if the same interrupt is being handled on the other cpu.
Modifying the jcore aic driver to call handle_percpu_irq rather than
handle_simple_irq when the irq was registered as percpu solves the
problem, but I'm actually concerned that handle_simple_irq would also
be wrong in the case of non-percpu irqs if they could be delivered on
either core -- if another irq arrives after the driver is finished
checking for new device status, but before the IRQD_IRQ_INPROGRESS
flag is removed, it seems handle_simple_irq loses the irq. This is not
a problem for the current hardware since non-percpu irqs always arrive
on cpu0, but I'd rather improve the driver to properly handle possible
future hardware that allows irq delivery on any core.

> > > Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> > > 
> > > IIRC you said earlier when the percpu interrupt discussion happened, that
> > > your per cpu PITs have distinct interrupt numbers, but that does not seem
> > > the case here.
> > 
> > No, I said the opposite. They use the same irq number but they're each
> > wired to their own cpu, and there's no way for them to fire on the
> > wrong cpu.
> 
> Your patch does:
> 
> > +   err = request_irq(pit_irq, jcore_timer_interrupt,
> > + IRQF_TIMER | IRQF_PERCPU,
> > + "jcore_pit", jcore_pit_percpu);
> 
> which is the wrong thing to do. You need request_percpu_irq() here, but of
> course with the irq chip you are using, which has every callback as a noop,
> it does not matter at all. So that's not an issue and not the reason for
> this wreckage.

Mentioning this was helpful to get me looking at the right things
tracking from the irq entry point to where the irq was lost/dropped,
so thanks for bringing it up. request_irq ends up working fine still
because IRQF_PERCPU causes __setup_irq to mark the irqdesc/irq_data as
percpu, so that my handle function can treat it differently. Here's
the patch that has it working:

diff --git a/drivers/irqchip/irq-jcore-aic.c b/drivers/irqchip/irq-jcore-aic.c
index 5e5e3bb..b53a8a5 100644
--- a/drivers/irqchip/irq-jcore-aic.c
+++ b/drivers/irqchip/irq-jcore-aic.c
@@ -25,12 +25,20 @@
 
 static struct irq_chip jcore_aic;
 
+static void handle_jcore_irq(struct irq_desc *desc)
+{
+   if (irq_is_percpu(irq_desc_get_irq(desc)))
+   handle_percpu_irq(desc);
+   

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-08 Thread Rich Felker
On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> On Sat, 8 Oct 2016, Rich Felker wrote:
> > On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> > > CPU spins and waits for an interrupt to happen
> > > 
> > > 
> > >   -0 [000] d...   150.841530: rcu_dyntick: End 0 1
> > > 
> > > Dropping out of the spin about the time we expect the PIT interrupt to
> > > fire. And an interrupt is the reason why we drop out because the 'need
> > > resched' flag is not set and we end up in:
> > 
> > OK, I missed that.
> > 
> > >   -0 [000] d.s.   150.841724: tick_irq_enter: update 
> > > jiffies via irq
> > > 
> > > which is called from irq_enter()
> > > 
> > >   -0 [000] d.s.   150.841918: tick_do_update_jiffies64: 
> > > finished do_timer(1)
> > >   -0 [000] d.s.   150.842348: tick_do_update_jiffies64: 
> > > finished updating jiffies
> > > 
> > > 
> > > So here we would expect:
> > > 
> > >irq_handler_entry: irq=16 name=jcore_pit
> > >...
> > >irq_handler_exit .
> > >
> > > 
> > > or any other interrupt being invoked. But we see this here:
> > 
> > According to the trace the 'irq' is soft ('s'). I couldn't find the
> > code path from the idle loop to softirq but so maybe this is a bug. Is
> > it possible that in some cases the arch irq entry does not get
> > identified as a hard irq or traced and then the subsequent tick code
> > thinks it's running in a softirq and behaves differently? I could add
> > more tracing around irq entry.
> 
> No.
> 
> irq_enter()
>   if (is_idle_task(current) && !in_interrupt()) {
>   local_bh_disable();
>   tick_irq_enter();
>   local_bh_enable();
>   }
>   __irq_enter()
>   preempt_count_add(HARDIRQ_OFFSET);
> 
> So the 's' comes from local_bh_disable() and the code does not behave
> special because of that. It's just always that way. Look at all the other
> instances of tick_irq_enter() which do not show that issue.

Thank you -- that was really confusing me. Now that I know there's
actually a hardirq handling I know where to look for the problem.

> > >   -0 [000] d...   150.842603: __tick_nohz_idle_enter: 
> > > can stop idle tick
> > > 
> > > And that's just wrong. 
> > 
> > Can you explain?
> 
> Because you drop out the idle spin due to an interrupt, but no interrupt is
> handled according to the trace. You just go back to sleep and the trace is
> full of this behaviour.

Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
ignored if the same interrupt is being handled on the other cpu.
Modifying the jcore aic driver to call handle_percpu_irq rather than
handle_simple_irq when the irq was registered as percpu solves the
problem, but I'm actually concerned that handle_simple_irq would also
be wrong in the case of non-percpu irqs if they could be delivered on
either core -- if another irq arrives after the driver is finished
checking for new device status, but before the IRQD_IRQ_INPROGRESS
flag is removed, it seems handle_simple_irq loses the irq. This is not
a problem for the current hardware since non-percpu irqs always arrive
on cpu0, but I'd rather improve the driver to properly handle possible
future hardware that allows irq delivery on any core.

> > > Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> > > 
> > > IIRC you said earlier when the percpu interrupt discussion happened, that
> > > your per cpu PITs have distinct interrupt numbers, but that does not seem
> > > the case here.
> > 
> > No, I said the opposite. They use the same irq number but they're each
> > wired to their own cpu, and there's no way for them to fire on the
> > wrong cpu.
> 
> Your patch does:
> 
> > +   err = request_irq(pit_irq, jcore_timer_interrupt,
> > + IRQF_TIMER | IRQF_PERCPU,
> > + "jcore_pit", jcore_pit_percpu);
> 
> which is the wrong thing to do. You need request_percpu_irq() here, but of
> course with the irq chip you are using, which has every callback as a noop,
> it does not matter at all. So that's not an issue and not the reason for
> this wreckage.

Mentioning this was helpful to get me looking at the right things
tracking from the irq entry point to where the irq was lost/dropped,
so thanks for bringing it up. request_irq ends up working fine still
because IRQF_PERCPU causes __setup_irq to mark the irqdesc/irq_data as
percpu, so that my handle function can treat it differently. Here's
the patch that has it working:

diff --git a/drivers/irqchip/irq-jcore-aic.c b/drivers/irqchip/irq-jcore-aic.c
index 5e5e3bb..b53a8a5 100644
--- a/drivers/irqchip/irq-jcore-aic.c
+++ b/drivers/irqchip/irq-jcore-aic.c
@@ -25,12 +25,20 @@
 
 static struct irq_chip jcore_aic;
 
+static void handle_jcore_irq(struct irq_desc *desc)
+{
+   if (irq_is_percpu(irq_desc_get_irq(desc)))
+   handle_percpu_irq(desc);
+   

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-08 Thread Thomas Gleixner
On Sat, 8 Oct 2016, Rich Felker wrote:
> On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> > CPU spins and waits for an interrupt to happen
> > 
> > 
> >   -0 [000] d...   150.841530: rcu_dyntick: End 0 1
> > 
> > Dropping out of the spin about the time we expect the PIT interrupt to
> > fire. And an interrupt is the reason why we drop out because the 'need
> > resched' flag is not set and we end up in:
> 
> OK, I missed that.
> 
> >   -0 [000] d.s.   150.841724: tick_irq_enter: update 
> > jiffies via irq
> > 
> > which is called from irq_enter()
> > 
> >   -0 [000] d.s.   150.841918: tick_do_update_jiffies64: 
> > finished do_timer(1)
> >   -0 [000] d.s.   150.842348: tick_do_update_jiffies64: 
> > finished updating jiffies
> > 
> > 
> > So here we would expect:
> > 
> >  irq_handler_entry: irq=16 name=jcore_pit
> >  ...
> >  irq_handler_exit .
> >  
> > 
> > or any other interrupt being invoked. But we see this here:
> 
> According to the trace the 'irq' is soft ('s'). I couldn't find the
> code path from the idle loop to softirq but so maybe this is a bug. Is
> it possible that in some cases the arch irq entry does not get
> identified as a hard irq or traced and then the subsequent tick code
> thinks it's running in a softirq and behaves differently? I could add
> more tracing around irq entry.

No.

irq_enter()
if (is_idle_task(current) && !in_interrupt()) {
local_bh_disable();
tick_irq_enter();
local_bh_enable();
}
__irq_enter()
preempt_count_add(HARDIRQ_OFFSET);

So the 's' comes from local_bh_disable() and the code does not behave
special because of that. It's just always that way. Look at all the other
instances of tick_irq_enter() which do not show that issue.

> >   -0 [000] d...   150.842603: __tick_nohz_idle_enter: can 
> > stop idle tick
> > 
> > And that's just wrong. 
> 
> Can you explain?

Because you drop out the idle spin due to an interrupt, but no interrupt is
handled according to the trace. You just go back to sleep and the trace is
full of this behaviour.

> > Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> > 
> > IIRC you said earlier when the percpu interrupt discussion happened, that
> > your per cpu PITs have distinct interrupt numbers, but that does not seem
> > the case here.
> 
> No, I said the opposite. They use the same irq number but they're each
> wired to their own cpu, and there's no way for them to fire on the
> wrong cpu.

Your patch does:

> +   err = request_irq(pit_irq, jcore_timer_interrupt,
> + IRQF_TIMER | IRQF_PERCPU,
> + "jcore_pit", jcore_pit_percpu);

which is the wrong thing to do. You need request_percpu_irq() here, but of
course with the irq chip you are using, which has every callback as a noop,
it does not matter at all. So that's not an issue and not the reason for
this wreckage.

But what you need to figure out is why this happens:

100.00x  hrtimer_start(expires = 100.01)
100.00x  idle spin
100.01x  tick_irq_enter()
100.01x  tick_stop()

i.e. why do you drop out of your idle spin, take the low level interrupt
entry path which calls irq_enter() -> tick_irq_enter() and then you do not
handle any interrupt at all and drop back into the idle loop. That's the
real question and that's a problem in your architecture low level interrupt
entry code or some hardware related issue. But certainly not a bug in the
core tick code.

You can come up with another boat load of weird theories about bugs in
that code, but I won't even have a look _before_ you can explain why the
above sequence happens and the PIT timer interrupt is not handled.

Thanks,

tglx










Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-08 Thread Thomas Gleixner
On Sat, 8 Oct 2016, Rich Felker wrote:
> On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> > CPU spins and waits for an interrupt to happen
> > 
> > 
> >   -0 [000] d...   150.841530: rcu_dyntick: End 0 1
> > 
> > Dropping out of the spin about the time we expect the PIT interrupt to
> > fire. And an interrupt is the reason why we drop out because the 'need
> > resched' flag is not set and we end up in:
> 
> OK, I missed that.
> 
> >   -0 [000] d.s.   150.841724: tick_irq_enter: update 
> > jiffies via irq
> > 
> > which is called from irq_enter()
> > 
> >   -0 [000] d.s.   150.841918: tick_do_update_jiffies64: 
> > finished do_timer(1)
> >   -0 [000] d.s.   150.842348: tick_do_update_jiffies64: 
> > finished updating jiffies
> > 
> > 
> > So here we would expect:
> > 
> >  irq_handler_entry: irq=16 name=jcore_pit
> >  ...
> >  irq_handler_exit .
> >  
> > 
> > or any other interrupt being invoked. But we see this here:
> 
> According to the trace the 'irq' is soft ('s'). I couldn't find the
> code path from the idle loop to softirq but so maybe this is a bug. Is
> it possible that in some cases the arch irq entry does not get
> identified as a hard irq or traced and then the subsequent tick code
> thinks it's running in a softirq and behaves differently? I could add
> more tracing around irq entry.

No.

irq_enter()
if (is_idle_task(current) && !in_interrupt()) {
local_bh_disable();
tick_irq_enter();
local_bh_enable();
}
__irq_enter()
preempt_count_add(HARDIRQ_OFFSET);

So the 's' comes from local_bh_disable() and the code does not behave
special because of that. It's just always that way. Look at all the other
instances of tick_irq_enter() which do not show that issue.

> >   -0 [000] d...   150.842603: __tick_nohz_idle_enter: can 
> > stop idle tick
> > 
> > And that's just wrong. 
> 
> Can you explain?

Because you drop out the idle spin due to an interrupt, but no interrupt is
handled according to the trace. You just go back to sleep and the trace is
full of this behaviour.

> > Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> > 
> > IIRC you said earlier when the percpu interrupt discussion happened, that
> > your per cpu PITs have distinct interrupt numbers, but that does not seem
> > the case here.
> 
> No, I said the opposite. They use the same irq number but they're each
> wired to their own cpu, and there's no way for them to fire on the
> wrong cpu.

Your patch does:

> +   err = request_irq(pit_irq, jcore_timer_interrupt,
> + IRQF_TIMER | IRQF_PERCPU,
> + "jcore_pit", jcore_pit_percpu);

which is the wrong thing to do. You need request_percpu_irq() here, but of
course with the irq chip you are using, which has every callback as a noop,
it does not matter at all. So that's not an issue and not the reason for
this wreckage.

But what you need to figure out is why this happens:

100.00x  hrtimer_start(expires = 100.01)
100.00x  idle spin
100.01x  tick_irq_enter()
100.01x  tick_stop()

i.e. why do you drop out of your idle spin, take the low level interrupt
entry path which calls irq_enter() -> tick_irq_enter() and then you do not
handle any interrupt at all and drop back into the idle loop. That's the
real question and that's a problem in your architecture low level interrupt
entry code or some hardware related issue. But certainly not a bug in the
core tick code.

You can come up with another boat load of weird theories about bugs in
that code, but I won't even have a look _before_ you can explain why the
above sequence happens and the PIT timer interrupt is not handled.

Thanks,

tglx










Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-08 Thread Rich Felker
On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> On Fri, 7 Oct 2016, Rich Felker wrote:
> > 
> > If I'm not mistaken, the bug is in tick_nohz_restart. According to the
> 
> I think you are mistaken. Let's look at CPU0 only:

OK.

>   -0 [000] d...   150.829698: __tick_nohz_idle_enter: can 
> stop idle tick
>   -0 [000] d...   150.829774: __tick_nohz_idle_enter: 
> entering stop_sched_tick
>   -0 [000] d...   150.830001: tick_stop: success=1 
> dependency=NONE
>   -0 [000] d...   150.830127: hrtimer_cancel: 
> hrtimer=109e949c
>   -0 [000] d...   150.830380: jcore_pit_set_next_event: pit 
> 109eccb0 0 set delta 82124660 en=441 cntr=75 throt=82124660
>   -0 [000] d...   150.830498: hrtimer_start: 
> hrtimer=109e949c function=tick_sched_timer expires=15084000 
> softexpires=15084000
>   -0 [000] d...   150.830720: jcore_pit_set_next_event: pit 
> 109eccb0 0 set delta 539823 en=441 cntr=34 throt=539823
> 
> Tick is programmed to fire at 150.84
> 
>   -0 [000] d...   150.830822: __tick_nohz_idle_enter: 
> returning from stop_sched_tick
>   -0 [000] d...   150.830940: rcu_dyntick: Start 
> 140 0
>   -0 [000] d...   150.831072: rcu_dyntick: End 0 1
>   -0 [000] d...   150.831165: cpu_idle: state=0 cpu_id=0
>   -0 [000] d...   150.831269: rcu_dyntick: Start 1 0
>   -0 [000]    150.831368: cpu_startup_entry: enabled 
> irqs, spinning
> 
> CPU spins and waits for an interrupt to happen
> 
> 
>   -0 [000] d...   150.841530: rcu_dyntick: End 0 1
> 
> Dropping out of the spin about the time we expect the PIT interrupt to
> fire. And an interrupt is the reason why we drop out because the 'need
> resched' flag is not set and we end up in:

OK, I missed that.

>   -0 [000] d.s.   150.841724: tick_irq_enter: update 
> jiffies via irq
> 
> which is called from irq_enter()
> 
>   -0 [000] d.s.   150.841918: tick_do_update_jiffies64: 
> finished do_timer(1)
>   -0 [000] d.s.   150.842348: tick_do_update_jiffies64: 
> finished updating jiffies
> 
> 
> So here we would expect:
> 
>irq_handler_entry: irq=16 name=jcore_pit
>...
>irq_handler_exit .
>
> 
> or any other interrupt being invoked. But we see this here:

According to the trace the 'irq' is soft ('s'). I couldn't find the
code path from the idle loop to softirq but so maybe this is a bug. Is
it possible that in some cases the arch irq entry does not get
identified as a hard irq or traced and then the subsequent tick code
thinks it's running in a softirq and behaves differently? I could add
more tracing around irq entry.

>   -0 [000] d...   150.842603: __tick_nohz_idle_enter: can 
> stop idle tick
> 
> And that's just wrong. 

Can you explain?

> Now looking at CPU1 at the same time:
> 
>   -0 [001] d.h.   150.841829: irq_handler_entry: irq=16 
> name=jcore_
>   -0 [001] d.h.   150.842054: hrtimer_cancel: 
> hrtimer=109f449c
>   -0 [001] d.h.   150.842218: hrtimer_expire_entry: 
> hrtimer=109f449c function=tick_sched_timer now=150840579940
> 
> So there the PIT interrupt enters and runs the interrupt handler.
> 
> Now looking deeper at the PIT interrupts:
> 
>  kworker/0:1-208   [000] d.h.   150.641822: irq_handler_entry: irq=16 
> name=jcore_pit
>  kworker/0:1-208   [000] d.h.   150.644928: irq_handler_exit: irq=16 
> ret=handled
> 
> 
>   -0 [001] dnh.   150.645370: irq_handler_entry: irq=16 
> name=jcore_pit
>   -0 [001] dnh.   150.647444: irq_handler_exit: irq=16 
> ret=handled
> 
> Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> 
> IIRC you said earlier when the percpu interrupt discussion happened, that
> your per cpu PITs have distinct interrupt numbers, but that does not seem
> the case here.

No, I said the opposite. They use the same irq number but they're each
wired to their own cpu, and there's no way for them to fire on the
wrong cpu.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-08 Thread Rich Felker
On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> On Fri, 7 Oct 2016, Rich Felker wrote:
> > 
> > If I'm not mistaken, the bug is in tick_nohz_restart. According to the
> 
> I think you are mistaken. Let's look at CPU0 only:

OK.

>   -0 [000] d...   150.829698: __tick_nohz_idle_enter: can 
> stop idle tick
>   -0 [000] d...   150.829774: __tick_nohz_idle_enter: 
> entering stop_sched_tick
>   -0 [000] d...   150.830001: tick_stop: success=1 
> dependency=NONE
>   -0 [000] d...   150.830127: hrtimer_cancel: 
> hrtimer=109e949c
>   -0 [000] d...   150.830380: jcore_pit_set_next_event: pit 
> 109eccb0 0 set delta 82124660 en=441 cntr=75 throt=82124660
>   -0 [000] d...   150.830498: hrtimer_start: 
> hrtimer=109e949c function=tick_sched_timer expires=15084000 
> softexpires=15084000
>   -0 [000] d...   150.830720: jcore_pit_set_next_event: pit 
> 109eccb0 0 set delta 539823 en=441 cntr=34 throt=539823
> 
> Tick is programmed to fire at 150.84
> 
>   -0 [000] d...   150.830822: __tick_nohz_idle_enter: 
> returning from stop_sched_tick
>   -0 [000] d...   150.830940: rcu_dyntick: Start 
> 140 0
>   -0 [000] d...   150.831072: rcu_dyntick: End 0 1
>   -0 [000] d...   150.831165: cpu_idle: state=0 cpu_id=0
>   -0 [000] d...   150.831269: rcu_dyntick: Start 1 0
>   -0 [000]    150.831368: cpu_startup_entry: enabled 
> irqs, spinning
> 
> CPU spins and waits for an interrupt to happen
> 
> 
>   -0 [000] d...   150.841530: rcu_dyntick: End 0 1
> 
> Dropping out of the spin about the time we expect the PIT interrupt to
> fire. And an interrupt is the reason why we drop out because the 'need
> resched' flag is not set and we end up in:

OK, I missed that.

>   -0 [000] d.s.   150.841724: tick_irq_enter: update 
> jiffies via irq
> 
> which is called from irq_enter()
> 
>   -0 [000] d.s.   150.841918: tick_do_update_jiffies64: 
> finished do_timer(1)
>   -0 [000] d.s.   150.842348: tick_do_update_jiffies64: 
> finished updating jiffies
> 
> 
> So here we would expect:
> 
>irq_handler_entry: irq=16 name=jcore_pit
>...
>irq_handler_exit .
>
> 
> or any other interrupt being invoked. But we see this here:

According to the trace the 'irq' is soft ('s'). I couldn't find the
code path from the idle loop to softirq but so maybe this is a bug. Is
it possible that in some cases the arch irq entry does not get
identified as a hard irq or traced and then the subsequent tick code
thinks it's running in a softirq and behaves differently? I could add
more tracing around irq entry.

>   -0 [000] d...   150.842603: __tick_nohz_idle_enter: can 
> stop idle tick
> 
> And that's just wrong. 

Can you explain?

> Now looking at CPU1 at the same time:
> 
>   -0 [001] d.h.   150.841829: irq_handler_entry: irq=16 
> name=jcore_
>   -0 [001] d.h.   150.842054: hrtimer_cancel: 
> hrtimer=109f449c
>   -0 [001] d.h.   150.842218: hrtimer_expire_entry: 
> hrtimer=109f449c function=tick_sched_timer now=150840579940
> 
> So there the PIT interrupt enters and runs the interrupt handler.
> 
> Now looking deeper at the PIT interrupts:
> 
>  kworker/0:1-208   [000] d.h.   150.641822: irq_handler_entry: irq=16 
> name=jcore_pit
>  kworker/0:1-208   [000] d.h.   150.644928: irq_handler_exit: irq=16 
> ret=handled
> 
> 
>   -0 [001] dnh.   150.645370: irq_handler_entry: irq=16 
> name=jcore_pit
>   -0 [001] dnh.   150.647444: irq_handler_exit: irq=16 
> ret=handled
> 
> Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> 
> IIRC you said earlier when the percpu interrupt discussion happened, that
> your per cpu PITs have distinct interrupt numbers, but that does not seem
> the case here.

No, I said the opposite. They use the same irq number but they're each
wired to their own cpu, and there's no way for them to fire on the
wrong cpu.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-08 Thread Thomas Gleixner
On Fri, 7 Oct 2016, Rich Felker wrote:
> 
> If I'm not mistaken, the bug is in tick_nohz_restart. According to the

I think you are mistaken. Let's look at CPU0 only:

  -0 [000] d...   150.829698: __tick_nohz_idle_enter: can 
stop idle tick
  -0 [000] d...   150.829774: __tick_nohz_idle_enter: 
entering stop_sched_tick
  -0 [000] d...   150.830001: tick_stop: success=1 
dependency=NONE
  -0 [000] d...   150.830127: hrtimer_cancel: hrtimer=109e949c
  -0 [000] d...   150.830380: jcore_pit_set_next_event: pit 
109eccb0 0 set delta 82124660 en=441 cntr=75 throt=82124660
  -0 [000] d...   150.830498: hrtimer_start: hrtimer=109e949c 
function=tick_sched_timer expires=15084000 softexpires=15084000
  -0 [000] d...   150.830720: jcore_pit_set_next_event: pit 
109eccb0 0 set delta 539823 en=441 cntr=34 throt=539823

Tick is programmed to fire at 150.84

  -0 [000] d...   150.830822: __tick_nohz_idle_enter: 
returning from stop_sched_tick
  -0 [000] d...   150.830940: rcu_dyntick: Start 
140 0
  -0 [000] d...   150.831072: rcu_dyntick: End 0 1
  -0 [000] d...   150.831165: cpu_idle: state=0 cpu_id=0
  -0 [000] d...   150.831269: rcu_dyntick: Start 1 0
  -0 [000]    150.831368: cpu_startup_entry: enabled 
irqs, spinning

CPU spins and waits for an interrupt to happen


  -0 [000] d...   150.841530: rcu_dyntick: End 0 1

Dropping out of the spin about the time we expect the PIT interrupt to
fire. And an interrupt is the reason why we drop out because the 'need
resched' flag is not set and we end up in:

  -0 [000] d.s.   150.841724: tick_irq_enter: update jiffies 
via irq

which is called from irq_enter()

  -0 [000] d.s.   150.841918: tick_do_update_jiffies64: 
finished do_timer(1)
  -0 [000] d.s.   150.842348: tick_do_update_jiffies64: 
finished updating jiffies


So here we would expect:

 irq_handler_entry: irq=16 name=jcore_pit
 ...
 irq_handler_exit .
 

or any other interrupt being invoked. But we see this here:

  -0 [000] d...   150.842603: __tick_nohz_idle_enter: can 
stop idle tick

And that's just wrong. 

Now looking at CPU1 at the same time:

  -0 [001] d.h.   150.841829: irq_handler_entry: irq=16 
name=jcore_  -0 [001] d.h.   150.842054: hrtimer_cancel: 
hrtimer=109f449c
  -0 [001] d.h.   150.842218: hrtimer_expire_entry: 
hrtimer=109f449c function=tick_sched_timer now=150840579940

So there the PIT interrupt enters and runs the interrupt handler.

Now looking deeper at the PIT interrupts:

 kworker/0:1-208   [000] d.h.   150.641822: irq_handler_entry: irq=16 
name=jcore_pit
 kworker/0:1-208   [000] d.h.   150.644928: irq_handler_exit: irq=16 
ret=handled


  -0 [001] dnh.   150.645370: irq_handler_entry: irq=16 
name=jcore_pit
  -0 [001] dnh.   150.647444: irq_handler_exit: irq=16 
ret=handled

Both CPUs have the same interrupt number for their per cpu PIT interrupt.

IIRC you said earlier when the percpu interrupt discussion happened, that
your per cpu PITs have distinct interrupt numbers, but that does not seem
the case here.

That's where you have to dig.

Thanks,

tglx




Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-08 Thread Thomas Gleixner
On Fri, 7 Oct 2016, Rich Felker wrote:
> 
> If I'm not mistaken, the bug is in tick_nohz_restart. According to the

I think you are mistaken. Let's look at CPU0 only:

  -0 [000] d...   150.829698: __tick_nohz_idle_enter: can 
stop idle tick
  -0 [000] d...   150.829774: __tick_nohz_idle_enter: 
entering stop_sched_tick
  -0 [000] d...   150.830001: tick_stop: success=1 
dependency=NONE
  -0 [000] d...   150.830127: hrtimer_cancel: hrtimer=109e949c
  -0 [000] d...   150.830380: jcore_pit_set_next_event: pit 
109eccb0 0 set delta 82124660 en=441 cntr=75 throt=82124660
  -0 [000] d...   150.830498: hrtimer_start: hrtimer=109e949c 
function=tick_sched_timer expires=15084000 softexpires=15084000
  -0 [000] d...   150.830720: jcore_pit_set_next_event: pit 
109eccb0 0 set delta 539823 en=441 cntr=34 throt=539823

Tick is programmed to fire at 150.84

  -0 [000] d...   150.830822: __tick_nohz_idle_enter: 
returning from stop_sched_tick
  -0 [000] d...   150.830940: rcu_dyntick: Start 
140 0
  -0 [000] d...   150.831072: rcu_dyntick: End 0 1
  -0 [000] d...   150.831165: cpu_idle: state=0 cpu_id=0
  -0 [000] d...   150.831269: rcu_dyntick: Start 1 0
  -0 [000]    150.831368: cpu_startup_entry: enabled 
irqs, spinning

CPU spins and waits for an interrupt to happen


  -0 [000] d...   150.841530: rcu_dyntick: End 0 1

Dropping out of the spin about the time we expect the PIT interrupt to
fire. And an interrupt is the reason why we drop out because the 'need
resched' flag is not set and we end up in:

  -0 [000] d.s.   150.841724: tick_irq_enter: update jiffies 
via irq

which is called from irq_enter()

  -0 [000] d.s.   150.841918: tick_do_update_jiffies64: 
finished do_timer(1)
  -0 [000] d.s.   150.842348: tick_do_update_jiffies64: 
finished updating jiffies


So here we would expect:

 irq_handler_entry: irq=16 name=jcore_pit
 ...
 irq_handler_exit .
 

or any other interrupt being invoked. But we see this here:

  -0 [000] d...   150.842603: __tick_nohz_idle_enter: can 
stop idle tick

And that's just wrong. 

Now looking at CPU1 at the same time:

  -0 [001] d.h.   150.841829: irq_handler_entry: irq=16 
name=jcore_  -0 [001] d.h.   150.842054: hrtimer_cancel: 
hrtimer=109f449c
  -0 [001] d.h.   150.842218: hrtimer_expire_entry: 
hrtimer=109f449c function=tick_sched_timer now=150840579940

So there the PIT interrupt enters and runs the interrupt handler.

Now looking deeper at the PIT interrupts:

 kworker/0:1-208   [000] d.h.   150.641822: irq_handler_entry: irq=16 
name=jcore_pit
 kworker/0:1-208   [000] d.h.   150.644928: irq_handler_exit: irq=16 
ret=handled


  -0 [001] dnh.   150.645370: irq_handler_entry: irq=16 
name=jcore_pit
  -0 [001] dnh.   150.647444: irq_handler_exit: irq=16 
ret=handled

Both CPUs have the same interrupt number for their per cpu PIT interrupt.

IIRC you said earlier when the percpu interrupt discussion happened, that
your per cpu PITs have distinct interrupt numbers, but that does not seem
the case here.

That's where you have to dig.

Thanks,

tglx




Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-07 Thread Rich Felker
On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> On Tue, 27 Sep 2016, Rich Felker wrote:
> > I've managed to get a trace with a stall. I'm not sure what the best
> > way to share the full thing is, since it's large, but here are the
> > potentially interesting parts.
> 
> Upload it somewhere.
>  
> > The first is a big time gap with no events, from 82.446093 to
> > 109.852709:
> > 
> 
> >   -0 [001] d.h.82.443398: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   -0 [001] d.h.82.443573: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d.h.82.443689: hrtimer_expire_entry: 
> > hrtimer=109f449c function=tick_sched_timer now=82356112320
> 
> Tick timer expires
> 
> >   -0 [001] d.h.82.443911: softirq_raise: vec=1 
> > [action=TIMER]
> >   -0 [001] d.h.82.444089: softirq_raise: vec=9 
> > [action=RCU]
> >   -0 [001] d.h.82.444306: softirq_raise: vec=7 
> > [action=SCHED]
> 
> Raises softirqs
> 
> >   -0 [001] d.h.82.48: hrtimer_expire_exit: 
> > hrtimer=109f449c
> >   -0 [001] d.h.82.444592: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=8236000 
> > softexpires=8236000
> 
> Rearms the next period
> 
> >   -0 [001] d.h.82.444821: irq_handler_exit: irq=16 
> > ret=handled
> >   -0 [001] ..s.82.445034: softirq_entry: vec=1 
> > [action=TIMER]
> >   -0 [001] ..s.82.445222: softirq_exit: vec=1 
> > [action=TIMER]
> >   -0 [001] ..s.82.445341: softirq_entry: vec=7 
> > [action=SCHED]
> >   -0 [001] ..s.82.445745: softirq_exit: vec=7 
> > [action=SCHED]
> >   -0 [001] ..s.82.445880: softirq_entry: vec=9 
> > [action=RCU]
> >   -0 [001] ..s.82.446093: softirq_exit: vec=9 
> > [action=RCU]
> 
> Softirqs done and the machine vanishes into lala land 
> 
> And now after that gap we have a cancel/start on CPU 0
> 
> >   -0 [000] dn..   109.852709: hrtimer_cancel: 
> > hrtimer=109e949c
> >   -0 [000] dn..   109.853043: hrtimer_start: 
> > hrtimer=109e949c function=tick_sched_timer expires=10977000 
> > softexpires=10977000
> 
> >   sh-388   [000] d.h.   109.855851: irq_handler_entry: irq=18 
> > name=uartlite
> >   sh-388   [000] d.h.   109.856751: irq_handler_exit: irq=18 
> > ret=handled
> >   sh-388   [000] d.h.   109.857264: irq_handler_entry: irq=18 
> > name=uartlite
> >   sh-388   [000] d.h.   109.857458: irq_handler_exit: irq=18 
> > ret=unhandled
> >   -0 [001] d.h.   109.857684: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   -0 [001] d.h.   109.857934: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d.h.   109.858087: hrtimer_expire_entry: 
> > hrtimer=109f449c function=tick_sched_timer now=109770452820
> 
> And this expiry happens on CPU 1, but the expiry time looks like the one
> which got armed on CPU 0 above !?!?!
> 
> >   -0 [001] d.h.   109.858639: softirq_raise: vec=1 
> > [action=TIMER]
> >   -0 [001] d.h.   109.858839: softirq_raise: vec=9 
> > [action=RCU]
> >   -0 [001] d.h.   109.859097: softirq_raise: vec=7 
> > [action=SCHED]
> >   -0 [001] d.h.   109.859282: hrtimer_expire_exit: 
> > hrtimer=109f449c
> >   -0 [001] d.h.   109.859428: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=10978000 
> > softexpires=10978000
> >   -0 [001] d.h.   109.859637: irq_handler_exit: irq=16 
> > ret=handled
> >   -0 [001] ..s.   109.859806: softirq_entry: vec=1 
> > [action=TIMER]
> >   -0 [001] ..s.   109.859984: softirq_exit: vec=1 
> > [action=TIMER]
> >   -0 [001] ..s.   109.860098: softirq_entry: vec=7 
> > [action=SCHED]
> >   -0 [001] ..s.   109.860533: softirq_exit: vec=7 
> > [action=SCHED]
> >   -0 [001] ..s.   109.860663: softirq_entry: vec=9 
> > [action=RCU]
> >   -0 [001] .ns.   109.861424: softirq_exit: vec=9 
> > [action=RCU]
> >   sh-388   [000] d.h.   109.861831: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   sh-388   [000] d.h.   109.862071: hrtimer_cancel: 
> > hrtimer=109e949c
> >rcu_sched-7 [001]    109.862141: timer_init: timer=160a9eb0
> >   sh-388   [000] d.h.   109.862233: hrtimer_expire_entry: 
> > hrtimer=109e949c function=tick_sched_timer now=109774586740
> 
> And CPU0 has it's timer firing 4.5ms late .
> 
> >rcu_sched-7 [001] d...   109.862317: timer_start: timer=160a9eb0 
> > function=process_timeout expires=4294948275 [timeout=1] flags=0x0001
> > 
> > However no stall message appeared on the console here, and I suspect
> > it might have just been a gap in event recording because it doesn't
> > look terribly 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-07 Thread Rich Felker
On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> On Tue, 27 Sep 2016, Rich Felker wrote:
> > I've managed to get a trace with a stall. I'm not sure what the best
> > way to share the full thing is, since it's large, but here are the
> > potentially interesting parts.
> 
> Upload it somewhere.
>  
> > The first is a big time gap with no events, from 82.446093 to
> > 109.852709:
> > 
> 
> >   -0 [001] d.h.82.443398: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   -0 [001] d.h.82.443573: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d.h.82.443689: hrtimer_expire_entry: 
> > hrtimer=109f449c function=tick_sched_timer now=82356112320
> 
> Tick timer expires
> 
> >   -0 [001] d.h.82.443911: softirq_raise: vec=1 
> > [action=TIMER]
> >   -0 [001] d.h.82.444089: softirq_raise: vec=9 
> > [action=RCU]
> >   -0 [001] d.h.82.444306: softirq_raise: vec=7 
> > [action=SCHED]
> 
> Raises softirqs
> 
> >   -0 [001] d.h.82.48: hrtimer_expire_exit: 
> > hrtimer=109f449c
> >   -0 [001] d.h.82.444592: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=8236000 
> > softexpires=8236000
> 
> Rearms the next period
> 
> >   -0 [001] d.h.82.444821: irq_handler_exit: irq=16 
> > ret=handled
> >   -0 [001] ..s.82.445034: softirq_entry: vec=1 
> > [action=TIMER]
> >   -0 [001] ..s.82.445222: softirq_exit: vec=1 
> > [action=TIMER]
> >   -0 [001] ..s.82.445341: softirq_entry: vec=7 
> > [action=SCHED]
> >   -0 [001] ..s.82.445745: softirq_exit: vec=7 
> > [action=SCHED]
> >   -0 [001] ..s.82.445880: softirq_entry: vec=9 
> > [action=RCU]
> >   -0 [001] ..s.82.446093: softirq_exit: vec=9 
> > [action=RCU]
> 
> Softirqs done and the machine vanishes into lala land 
> 
> And now after that gap we have a cancel/start on CPU 0
> 
> >   -0 [000] dn..   109.852709: hrtimer_cancel: 
> > hrtimer=109e949c
> >   -0 [000] dn..   109.853043: hrtimer_start: 
> > hrtimer=109e949c function=tick_sched_timer expires=10977000 
> > softexpires=10977000
> 
> >   sh-388   [000] d.h.   109.855851: irq_handler_entry: irq=18 
> > name=uartlite
> >   sh-388   [000] d.h.   109.856751: irq_handler_exit: irq=18 
> > ret=handled
> >   sh-388   [000] d.h.   109.857264: irq_handler_entry: irq=18 
> > name=uartlite
> >   sh-388   [000] d.h.   109.857458: irq_handler_exit: irq=18 
> > ret=unhandled
> >   -0 [001] d.h.   109.857684: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   -0 [001] d.h.   109.857934: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d.h.   109.858087: hrtimer_expire_entry: 
> > hrtimer=109f449c function=tick_sched_timer now=109770452820
> 
> And this expiry happens on CPU 1, but the expiry time looks like the one
> which got armed on CPU 0 above !?!?!
> 
> >   -0 [001] d.h.   109.858639: softirq_raise: vec=1 
> > [action=TIMER]
> >   -0 [001] d.h.   109.858839: softirq_raise: vec=9 
> > [action=RCU]
> >   -0 [001] d.h.   109.859097: softirq_raise: vec=7 
> > [action=SCHED]
> >   -0 [001] d.h.   109.859282: hrtimer_expire_exit: 
> > hrtimer=109f449c
> >   -0 [001] d.h.   109.859428: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=10978000 
> > softexpires=10978000
> >   -0 [001] d.h.   109.859637: irq_handler_exit: irq=16 
> > ret=handled
> >   -0 [001] ..s.   109.859806: softirq_entry: vec=1 
> > [action=TIMER]
> >   -0 [001] ..s.   109.859984: softirq_exit: vec=1 
> > [action=TIMER]
> >   -0 [001] ..s.   109.860098: softirq_entry: vec=7 
> > [action=SCHED]
> >   -0 [001] ..s.   109.860533: softirq_exit: vec=7 
> > [action=SCHED]
> >   -0 [001] ..s.   109.860663: softirq_entry: vec=9 
> > [action=RCU]
> >   -0 [001] .ns.   109.861424: softirq_exit: vec=9 
> > [action=RCU]
> >   sh-388   [000] d.h.   109.861831: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   sh-388   [000] d.h.   109.862071: hrtimer_cancel: 
> > hrtimer=109e949c
> >rcu_sched-7 [001]    109.862141: timer_init: timer=160a9eb0
> >   sh-388   [000] d.h.   109.862233: hrtimer_expire_entry: 
> > hrtimer=109e949c function=tick_sched_timer now=109774586740
> 
> And CPU0 has it's timer firing 4.5ms late .
> 
> >rcu_sched-7 [001] d...   109.862317: timer_start: timer=160a9eb0 
> > function=process_timeout expires=4294948275 [timeout=1] flags=0x0001
> > 
> > However no stall message appeared on the console here, and I suspect
> > it might have just been a gap in event recording because it doesn't
> > look terribly 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-05 Thread Paul E. McKenney
On Tue, Oct 04, 2016 at 05:48:18PM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2016 at 02:14:51PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> > > On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > > > Ping. Is there anything that still needs to be changed for this 
> > > > > > > driver
> > > > > > > to be acceptable?
> > > > > > 
> > > > > > It is on my radar. I'm reviewing it.
> > > > > > 
> > > > > > Can you elaborate the workaround mentioned in the changelog. I have 
> > > > > > been
> > > > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > > > related to the time framework, the driver itself or whatever else. 
> > > > > > Can
> > > > > > you clarify that ?
> > > > > 
> > > > > Do you have comments on any remaining issues other than this
> > > > > workaround? If not I don't mind removing the workaround and trying to
> > > > > solve the issue separately later. Let me know and either way I'll
> > > > > submit a v8.
> > > > 
> > > > One question of interest to me is whether this patchset prevents the
> > > > RCU CPU stall warnings that you are seeing.
> > > 
> > > With the 5ms minimum delta, I didn't observe any rcu_sched stall
> > > warnings. At 2.5ms I thought it was gone but eventually saw one. With
> > > the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> > > stalls regularly.
> > 
> > Sounds to me like your low-level clock drivers or your clock hardware is
> > having trouble dealing with short timeouts.  I suggest writing a focused
> > test.  It is of course quite possible that the failure could occur for
> > any time period, but simply becomes less probable with longer time
> > periods.
> 
> I don't see any indication of such a problem on the hardware side, and
> it didn't happen in earlier kernels, though of course it's possible
> that they were programming different intervals. And all of the
> intervals logged when I added trace logging were way above the scale
> where it would be plausible for the hardware to be an issue.

Is it possible that the problem occurs when the process setting up the
timeout is delayed for longer than the timeout duration, so that the
time has already expired at the time that the hardware is programmed?

> > Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
> > in order to get the most information in the shortest period of time.
> 
> I'm starting with new traces with "nohlt" on the kernel command line,
> since many of the stalls I've observed up to now may have a different
> mechanism -- since the arch_cpu_idle code doesn't work (returns
> immediately) the rapid rcu idle enter/exit in the cpu idle loop might
> be part of what was interfering with rcu_sched.

That doesn't sound good, though I cannot immediately think of a reason
why that would cause RCU much trouble.

> With my new traces and LED debugging (which I still need to tune with
> more logging, then I'll post some more results) what I'm seeing when
> the stall happens is the timer interrupt firing as expected and making
> it into __hrtimer_run_queues, but tick_sched_timer not getting called
> and thus rcu_sched never waking up from its 1s schedule_timeout (via
> swait).

It would be very interesting to track down exactly the point at which
the system decides not to fire the timeout.

Thanx, Paul



Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-05 Thread Paul E. McKenney
On Tue, Oct 04, 2016 at 05:48:18PM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2016 at 02:14:51PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> > > On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > > > Ping. Is there anything that still needs to be changed for this 
> > > > > > > driver
> > > > > > > to be acceptable?
> > > > > > 
> > > > > > It is on my radar. I'm reviewing it.
> > > > > > 
> > > > > > Can you elaborate the workaround mentioned in the changelog. I have 
> > > > > > been
> > > > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > > > related to the time framework, the driver itself or whatever else. 
> > > > > > Can
> > > > > > you clarify that ?
> > > > > 
> > > > > Do you have comments on any remaining issues other than this
> > > > > workaround? If not I don't mind removing the workaround and trying to
> > > > > solve the issue separately later. Let me know and either way I'll
> > > > > submit a v8.
> > > > 
> > > > One question of interest to me is whether this patchset prevents the
> > > > RCU CPU stall warnings that you are seeing.
> > > 
> > > With the 5ms minimum delta, I didn't observe any rcu_sched stall
> > > warnings. At 2.5ms I thought it was gone but eventually saw one. With
> > > the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> > > stalls regularly.
> > 
> > Sounds to me like your low-level clock drivers or your clock hardware is
> > having trouble dealing with short timeouts.  I suggest writing a focused
> > test.  It is of course quite possible that the failure could occur for
> > any time period, but simply becomes less probable with longer time
> > periods.
> 
> I don't see any indication of such a problem on the hardware side, and
> it didn't happen in earlier kernels, though of course it's possible
> that they were programming different intervals. And all of the
> intervals logged when I added trace logging were way above the scale
> where it would be plausible for the hardware to be an issue.

Is it possible that the problem occurs when the process setting up the
timeout is delayed for longer than the timeout duration, so that the
time has already expired at the time that the hardware is programmed?

> > Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
> > in order to get the most information in the shortest period of time.
> 
> I'm starting with new traces with "nohlt" on the kernel command line,
> since many of the stalls I've observed up to now may have a different
> mechanism -- since the arch_cpu_idle code doesn't work (returns
> immediately) the rapid rcu idle enter/exit in the cpu idle loop might
> be part of what was interfering with rcu_sched.

That doesn't sound good, though I cannot immediately think of a reason
why that would cause RCU much trouble.

> With my new traces and LED debugging (which I still need to tune with
> more logging, then I'll post some more results) what I'm seeing when
> the stall happens is the timer interrupt firing as expected and making
> it into __hrtimer_run_queues, but tick_sched_timer not getting called
> and thus rcu_sched never waking up from its 1s schedule_timeout (via
> swait).

It would be very interesting to track down exactly the point at which
the system decides not to fire the timeout.

Thanx, Paul



Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-04 Thread Rich Felker
On Tue, Oct 04, 2016 at 02:14:51PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> > On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > > Ping. Is there anything that still needs to be changed for this 
> > > > > > driver
> > > > > > to be acceptable?
> > > > > 
> > > > > It is on my radar. I'm reviewing it.
> > > > > 
> > > > > Can you elaborate the workaround mentioned in the changelog. I have 
> > > > > been
> > > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > > related to the time framework, the driver itself or whatever else. Can
> > > > > you clarify that ?
> > > > 
> > > > Do you have comments on any remaining issues other than this
> > > > workaround? If not I don't mind removing the workaround and trying to
> > > > solve the issue separately later. Let me know and either way I'll
> > > > submit a v8.
> > > 
> > > One question of interest to me is whether this patchset prevents the
> > > RCU CPU stall warnings that you are seeing.
> > 
> > With the 5ms minimum delta, I didn't observe any rcu_sched stall
> > warnings. At 2.5ms I thought it was gone but eventually saw one. With
> > the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> > stalls regularly.
> 
> Sounds to me like your low-level clock drivers or your clock hardware is
> having trouble dealing with short timeouts.  I suggest writing a focused
> test.  It is of course quite possible that the failure could occur for
> any time period, but simply becomes less probable with longer time
> periods.

I don't see any indication of such a problem on the hardware side, and
it didn't happen in earlier kernels, though of course it's possible
that they were programming different intervals. And all of the
intervals logged when I added trace logging were way above the scale
where it would be plausible for the hardware to be an issue.

> Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
> in order to get the most information in the shortest period of time.

I'm starting with new traces with "nohlt" on the kernel command line,
since many of the stalls I've observed up to now may have a different
mechanism -- since the arch_cpu_idle code doesn't work (returns
immediately) the rapid rcu idle enter/exit in the cpu idle loop might
be part of what was interfering with rcu_sched.

With my new traces and LED debugging (which I still need to tune with
more logging, then I'll post some more results) what I'm seeing when
the stall happens is the timer interrupt firing as expected and making
it into __hrtimer_run_queues, but tick_sched_timer not getting called
and thus rcu_sched never waking up from its 1s schedule_timeout (via
swait).

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-04 Thread Rich Felker
On Tue, Oct 04, 2016 at 02:14:51PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> > On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > > Ping. Is there anything that still needs to be changed for this 
> > > > > > driver
> > > > > > to be acceptable?
> > > > > 
> > > > > It is on my radar. I'm reviewing it.
> > > > > 
> > > > > Can you elaborate the workaround mentioned in the changelog. I have 
> > > > > been
> > > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > > related to the time framework, the driver itself or whatever else. Can
> > > > > you clarify that ?
> > > > 
> > > > Do you have comments on any remaining issues other than this
> > > > workaround? If not I don't mind removing the workaround and trying to
> > > > solve the issue separately later. Let me know and either way I'll
> > > > submit a v8.
> > > 
> > > One question of interest to me is whether this patchset prevents the
> > > RCU CPU stall warnings that you are seeing.
> > 
> > With the 5ms minimum delta, I didn't observe any rcu_sched stall
> > warnings. At 2.5ms I thought it was gone but eventually saw one. With
> > the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> > stalls regularly.
> 
> Sounds to me like your low-level clock drivers or your clock hardware is
> having trouble dealing with short timeouts.  I suggest writing a focused
> test.  It is of course quite possible that the failure could occur for
> any time period, but simply becomes less probable with longer time
> periods.

I don't see any indication of such a problem on the hardware side, and
it didn't happen in earlier kernels, though of course it's possible
that they were programming different intervals. And all of the
intervals logged when I added trace logging were way above the scale
where it would be plausible for the hardware to be an issue.

> Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
> in order to get the most information in the shortest period of time.

I'm starting with new traces with "nohlt" on the kernel command line,
since many of the stalls I've observed up to now may have a different
mechanism -- since the arch_cpu_idle code doesn't work (returns
immediately) the rapid rcu idle enter/exit in the cpu idle loop might
be part of what was interfering with rcu_sched.

With my new traces and LED debugging (which I still need to tune with
more logging, then I'll post some more results) what I'm seeing when
the stall happens is the timer interrupt firing as expected and making
it into __hrtimer_run_queues, but tick_sched_timer not getting called
and thus rcu_sched never waking up from its 1s schedule_timeout (via
swait).

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-04 Thread Paul E. McKenney
On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > Ping. Is there anything that still needs to be changed for this driver
> > > > > to be acceptable?
> > > > 
> > > > It is on my radar. I'm reviewing it.
> > > > 
> > > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > related to the time framework, the driver itself or whatever else. Can
> > > > you clarify that ?
> > > 
> > > Do you have comments on any remaining issues other than this
> > > workaround? If not I don't mind removing the workaround and trying to
> > > solve the issue separately later. Let me know and either way I'll
> > > submit a v8.
> > 
> > One question of interest to me is whether this patchset prevents the
> > RCU CPU stall warnings that you are seeing.
> 
> With the 5ms minimum delta, I didn't observe any rcu_sched stall
> warnings. At 2.5ms I thought it was gone but eventually saw one. With
> the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> stalls regularly.

Sounds to me like your low-level clock drivers or your clock hardware is
having trouble dealing with short timeouts.  I suggest writing a focused
test.  It is of course quite possible that the failure could occur for
any time period, but simply becomes less probable with longer time
periods.

Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
in order to get the most information in the shortest period of time.

Thanx, Paul



Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-04 Thread Paul E. McKenney
On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > Ping. Is there anything that still needs to be changed for this driver
> > > > > to be acceptable?
> > > > 
> > > > It is on my radar. I'm reviewing it.
> > > > 
> > > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > related to the time framework, the driver itself or whatever else. Can
> > > > you clarify that ?
> > > 
> > > Do you have comments on any remaining issues other than this
> > > workaround? If not I don't mind removing the workaround and trying to
> > > solve the issue separately later. Let me know and either way I'll
> > > submit a v8.
> > 
> > One question of interest to me is whether this patchset prevents the
> > RCU CPU stall warnings that you are seeing.
> 
> With the 5ms minimum delta, I didn't observe any rcu_sched stall
> warnings. At 2.5ms I thought it was gone but eventually saw one. With
> the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> stalls regularly.

Sounds to me like your low-level clock drivers or your clock hardware is
having trouble dealing with short timeouts.  I suggest writing a focused
test.  It is of course quite possible that the failure could occur for
any time period, but simply becomes less probable with longer time
periods.

Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
in order to get the most information in the shortest period of time.

Thanx, Paul



Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-04 Thread Rich Felker
On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > Ping. Is there anything that still needs to be changed for this driver
> > > > to be acceptable?
> > > 
> > > It is on my radar. I'm reviewing it.
> > > 
> > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > digging into the lkml@ thread but it is not clear if the issue is
> > > related to the time framework, the driver itself or whatever else. Can
> > > you clarify that ?
> > 
> > Do you have comments on any remaining issues other than this
> > workaround? If not I don't mind removing the workaround and trying to
> > solve the issue separately later. Let me know and either way I'll
> > submit a v8.
> 
> One question of interest to me is whether this patchset prevents the
> RCU CPU stall warnings that you are seeing.

With the 5ms minimum delta, I didn't observe any rcu_sched stall
warnings. At 2.5ms I thought it was gone but eventually saw one. With
the previous mindelta = 1, i.e. 1 hardware bus period, I get the
stalls regularly.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-04 Thread Rich Felker
On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > Ping. Is there anything that still needs to be changed for this driver
> > > > to be acceptable?
> > > 
> > > It is on my radar. I'm reviewing it.
> > > 
> > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > digging into the lkml@ thread but it is not clear if the issue is
> > > related to the time framework, the driver itself or whatever else. Can
> > > you clarify that ?
> > 
> > Do you have comments on any remaining issues other than this
> > workaround? If not I don't mind removing the workaround and trying to
> > solve the issue separately later. Let me know and either way I'll
> > submit a v8.
> 
> One question of interest to me is whether this patchset prevents the
> RCU CPU stall warnings that you are seeing.

With the 5ms minimum delta, I didn't observe any rcu_sched stall
warnings. At 2.5ms I thought it was gone but eventually saw one. With
the previous mindelta = 1, i.e. 1 hardware bus period, I get the
stalls regularly.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-04 Thread Paul E. McKenney
On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > On 26/09/2016 23:07, Rich Felker wrote:
> > > Ping. Is there anything that still needs to be changed for this driver
> > > to be acceptable?
> > 
> > It is on my radar. I'm reviewing it.
> > 
> > Can you elaborate the workaround mentioned in the changelog. I have been
> > digging into the lkml@ thread but it is not clear if the issue is
> > related to the time framework, the driver itself or whatever else. Can
> > you clarify that ?
> 
> Do you have comments on any remaining issues other than this
> workaround? If not I don't mind removing the workaround and trying to
> solve the issue separately later. Let me know and either way I'll
> submit a v8.

One question of interest to me is whether this patchset prevents the
RCU CPU stall warnings that you are seeing.

Thanx, Paul



Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-04 Thread Paul E. McKenney
On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > On 26/09/2016 23:07, Rich Felker wrote:
> > > Ping. Is there anything that still needs to be changed for this driver
> > > to be acceptable?
> > 
> > It is on my radar. I'm reviewing it.
> > 
> > Can you elaborate the workaround mentioned in the changelog. I have been
> > digging into the lkml@ thread but it is not clear if the issue is
> > related to the time framework, the driver itself or whatever else. Can
> > you clarify that ?
> 
> Do you have comments on any remaining issues other than this
> workaround? If not I don't mind removing the workaround and trying to
> solve the issue separately later. Let me know and either way I'll
> submit a v8.

One question of interest to me is whether this patchset prevents the
RCU CPU stall warnings that you are seeing.

Thanx, Paul



Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-03 Thread Rich Felker
On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> On 26/09/2016 23:07, Rich Felker wrote:
> > Ping. Is there anything that still needs to be changed for this driver
> > to be acceptable?
> 
> It is on my radar. I'm reviewing it.
> 
> Can you elaborate the workaround mentioned in the changelog. I have been
> digging into the lkml@ thread but it is not clear if the issue is
> related to the time framework, the driver itself or whatever else. Can
> you clarify that ?

Do you have comments on any remaining issues other than this
workaround? If not I don't mind removing the workaround and trying to
solve the issue separately later. Let me know and either way I'll
submit a v8.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-03 Thread Rich Felker
On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> On 26/09/2016 23:07, Rich Felker wrote:
> > Ping. Is there anything that still needs to be changed for this driver
> > to be acceptable?
> 
> It is on my radar. I'm reviewing it.
> 
> Can you elaborate the workaround mentioned in the changelog. I have been
> digging into the lkml@ thread but it is not clear if the issue is
> related to the time framework, the driver itself or whatever else. Can
> you clarify that ?

Do you have comments on any remaining issues other than this
workaround? If not I don't mind removing the workaround and trying to
solve the issue separately later. Let me know and either way I'll
submit a v8.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-02 Thread Paul E. McKenney
On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> On Sat, Oct 01, 2016 at 10:58:37AM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> > > On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > > > way to share the full thing is, since it's large, but here are the
> > > > > > potentially interesting parts.
> > > > 
> > > > [ . . . ]
> > > > 
> > > > Some RCU commentary, on the off-chance that it helps...
> > > > 
> > > > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and 
> > > > > CPU1 goes
> > > > > into a NOHZ idle sleep.
> > > > > 
> > > > > >   -0 [001] d...   109.953436: tick_stop: 
> > > > > > success=1 dependency=NONE
> > > > > >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > > > > > hrtimer=109f449c
> > > > > >   -0 [001] d...   109.953818: hrtimer_start: 
> > > > > > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > > > > > softexpires=10988000
> > > > > 
> > > > > which is (using the 0.087621us delta between the trace clock and clock
> > > > > MONO) at: 109.880 + 0.087621 = 109.968
> > > > > 
> > > > > Which is about correct as we expect the RCU timer to fire at:
> > > > >   
> > > > >   109.952633 + 0.01 = 109.963633
> > > > > 
> > > > > or latest at
> > > > > 
> > > > >   109.952633 + 0.02 = 109.983633
> > > > >
> > > > > There is another caveat. That nohz stuff can queue the rcu timer on 
> > > > > CPU0, which
> > > > > it did not because:
> > > > 
> > > > Just for annoying completeness, the location of the timer depends on how
> > > > the rcuo callback-offload kthreads are constrained.  And yes, in the 
> > > > most
> > > > constrained case where all CPUs except for CPU 0 are nohz CPUs, they 
> > > > will
> > > > by default all run on CPU 0.
> > > 
> > > In default full nohz configuration, am I correct that all cpus except
> > > cpu0 willd be nohz and that the rcu callbacks then have to run on
> > > cpu0?
> > 
> > In recent kernels, it looks to me that the default is that none of the
> > CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
> > and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.
> 
> Actually NO_HZ_FULL is not supported because
> HAVE_VIRT_CPU_ACCOUNTING_GEN and HAVE_CONTEXT_TRACKING are missing, so
> it's just normal NO_HZ. I listed the relevant config items somewhere
> earlier in this thread I think.

Very good.  Please feel free to go back and double-check, if you wish.

> > But are there rcu0 kthreads running on your system?
> 
> Apparently not:
> 
> # ps aux | grep rcu
> 7 root   0:00 [rcu_sched]
> 8 root   0:00 [rcu_bh]
>   395 root   0:00 grep rcu

OK, you do not have callback-offloading enabled.  We can therefore eliminate
the possibility that you are being hit by callback-offloading logic.

> > > > > >rcu_sched-7 [001] d...   109.952633: timer_start: 
> > > > > > timer=160a9eb0 function=process_timeout expires=4294948284 
> > > > > > [timeout=1] flags=0x0001
> > > > > 
> > > > > The CPU nr encoded in flags is: 1
> > > > > 
> > > > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > > > it. And that expiry should have happened at 109.968000 !?!
> > > > > 
> > > > > >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > > > > > hrtimer=109f449c
> > > > > >   -0 [001] d...   109.968526: hrtimer_start: 
> > > > > > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > > > > > softexpires=10989000
> > > > > 
> > > > > So this advances the next tick even further out. And CPU 0 sets the 
> > > > > timer to
> > > > > the exact smae value:
> > > > > 
> > > > > >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > > > > > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > > > > > softexpires=10989000
> > > > > 
> > > > > 
> > > > > >   -0 [000] d.h.   109.977690: irq_handler_entry: 
> > > > > > irq=16 name=jcore_pit
> > > > > >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > > > > > hrtimer=109e949c
> > > > > >   -0 [000] d.h.   109.978053: 
> > > > > > hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer 
> > > > > > now=109890434160
> > > > > 
> > > > > Which expires here. And CPU1 instead of getting an interrupt and 
> > > > > expiring
> > > > > the timer does the cancel/restart to the next jiffie again:
> > > > > 
> > > > > >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > > > > > hrtimer=109f449c
> > > > > >   -0 [001] d...   109.978495: hrtimer_start: 
> > > > > > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > > > > > softexpires=1099

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-02 Thread Paul E. McKenney
On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> On Sat, Oct 01, 2016 at 10:58:37AM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> > > On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > > > way to share the full thing is, since it's large, but here are the
> > > > > > potentially interesting parts.
> > > > 
> > > > [ . . . ]
> > > > 
> > > > Some RCU commentary, on the off-chance that it helps...
> > > > 
> > > > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and 
> > > > > CPU1 goes
> > > > > into a NOHZ idle sleep.
> > > > > 
> > > > > >   -0 [001] d...   109.953436: tick_stop: 
> > > > > > success=1 dependency=NONE
> > > > > >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > > > > > hrtimer=109f449c
> > > > > >   -0 [001] d...   109.953818: hrtimer_start: 
> > > > > > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > > > > > softexpires=10988000
> > > > > 
> > > > > which is (using the 0.087621us delta between the trace clock and clock
> > > > > MONO) at: 109.880 + 0.087621 = 109.968
> > > > > 
> > > > > Which is about correct as we expect the RCU timer to fire at:
> > > > >   
> > > > >   109.952633 + 0.01 = 109.963633
> > > > > 
> > > > > or latest at
> > > > > 
> > > > >   109.952633 + 0.02 = 109.983633
> > > > >
> > > > > There is another caveat. That nohz stuff can queue the rcu timer on 
> > > > > CPU0, which
> > > > > it did not because:
> > > > 
> > > > Just for annoying completeness, the location of the timer depends on how
> > > > the rcuo callback-offload kthreads are constrained.  And yes, in the 
> > > > most
> > > > constrained case where all CPUs except for CPU 0 are nohz CPUs, they 
> > > > will
> > > > by default all run on CPU 0.
> > > 
> > > In default full nohz configuration, am I correct that all cpus except
> > > cpu0 willd be nohz and that the rcu callbacks then have to run on
> > > cpu0?
> > 
> > In recent kernels, it looks to me that the default is that none of the
> > CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
> > and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.
> 
> Actually NO_HZ_FULL is not supported because
> HAVE_VIRT_CPU_ACCOUNTING_GEN and HAVE_CONTEXT_TRACKING are missing, so
> it's just normal NO_HZ. I listed the relevant config items somewhere
> earlier in this thread I think.

Very good.  Please feel free to go back and double-check, if you wish.

> > But are there rcu0 kthreads running on your system?
> 
> Apparently not:
> 
> # ps aux | grep rcu
> 7 root   0:00 [rcu_sched]
> 8 root   0:00 [rcu_bh]
>   395 root   0:00 grep rcu

OK, you do not have callback-offloading enabled.  We can therefore eliminate
the possibility that you are being hit by callback-offloading logic.

> > > > > >rcu_sched-7 [001] d...   109.952633: timer_start: 
> > > > > > timer=160a9eb0 function=process_timeout expires=4294948284 
> > > > > > [timeout=1] flags=0x0001
> > > > > 
> > > > > The CPU nr encoded in flags is: 1
> > > > > 
> > > > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > > > it. And that expiry should have happened at 109.968000 !?!
> > > > > 
> > > > > >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > > > > > hrtimer=109f449c
> > > > > >   -0 [001] d...   109.968526: hrtimer_start: 
> > > > > > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > > > > > softexpires=10989000
> > > > > 
> > > > > So this advances the next tick even further out. And CPU 0 sets the 
> > > > > timer to
> > > > > the exact smae value:
> > > > > 
> > > > > >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > > > > > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > > > > > softexpires=10989000
> > > > > 
> > > > > 
> > > > > >   -0 [000] d.h.   109.977690: irq_handler_entry: 
> > > > > > irq=16 name=jcore_pit
> > > > > >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > > > > > hrtimer=109e949c
> > > > > >   -0 [000] d.h.   109.978053: 
> > > > > > hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer 
> > > > > > now=109890434160
> > > > > 
> > > > > Which expires here. And CPU1 instead of getting an interrupt and 
> > > > > expiring
> > > > > the timer does the cancel/restart to the next jiffie again:
> > > > > 
> > > > > >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > > > > > hrtimer=109f449c
> > > > > >   -0 [001] d...   109.978495: hrtimer_start: 
> > > > > > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > > > > > softexpires=1099

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-02 Thread Paul E. McKenney
On Sat, Oct 01, 2016 at 11:59:25PM -0400, Rich Felker wrote:
> On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> > > > > > > - During the whole sequence, hrtimer expiration times are being 
> > > > > > > set to
> > > > > > >   exact jiffies (@ 100 Hz), whereas before it they're quite 
> > > > > > > arbitrary.
> > > > > > 
> > > > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is 
> > > > > > farther out
> > > > > > than the next tick, then tick_sched_timer is set to this next event 
> > > > > > which
> > > > > > can be far out. So that's expected.
> > > > > > 
> > > > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > > > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > > > > >   sched_clock vs clocksource time and not a big deal.
> > > > > > 
> > > > > > Yes. You can tell the tracer to use clock monotonic so then they 
> > > > > > should match.
> > > > > > 
> > > > > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > > > > >   odd/excessive.
> > > > > > 
> > > > > > Why is that odd? That's one tick, i.e. 10ms in your case. And 
> > > > > > that's not
> > > > > > the problem at all. The problem is your timer not firing, but the 
> > > > > > cpu is
> > > > > > obviously either getting out of idle and then moves the tick ahead 
> > > > > > for some
> > > > > > unknown reason.
> > > > > 
> > > > > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > > > > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > > > > and HZ=500 or better for three-jiffy timeouts.
> > > > 
> > > > One possible theory I'm looking at is that the two cpus are both
> > > > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> > > > with sufficient consistency that every time the rcu_gp_fqs_check_wake
> > > > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> > > > outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> > > > forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> > > > that shows the low 2 bits of rnp->qsmask every time it's called, and
> > > > under normal operation the LEDs just flash on momentarily or just one
> > > > stays on for a few seconds then goes off. During a stall both are
> > > > stuck on. I'm still trying to make sense of the code but my impression
> > > > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> > > > correspond directly to cpus; is that right? If so I'm a bit confused
> > > > because I don't see how forward progress could ever happen if the cpu
> > > > on which rcu_gp_kthread is blocking forward progress of
> > > > rcu_gp_kthread.
> > > 
> > > No.  If the CPUs are entering and leaving idle, and if your timers
> > > were waking up rcu_sched every few jiffies like it asks, then the
> > > repeated idle entry/exit events would be noticed, courtesy of the atomic
> > > increments of ->dynticks and the rcu_sched kthread's snapshotting and
> > > checking of this value.
> > 
> > I don't see how rcu_sched could notice the changes if it's stuck in
> > the wait loop I think it's stuck in. There is no check of ->dynticks
> > in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS
> > accomplishes this, I updated my LED hacks to clear the LEDs in that
> > exit path (and killed the other place that could turn them back on
> > from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a
> > stall message.
> > 
> > > Even if the CPUs were always non-idle on every
> > > time force_quiescent_state() is invoked, give or take the possibility
> > > of counter wrap -- but even on a 32-bit system, that takes awhile.
> > 
> > Perhaps force_quiescent_state is not getting invoked? Does that sound
> > plausible, and if so, how should I go about debugging it? I see it is
> > called from the stall reporting code, so that's presumably what's
> > breaking the stalls.
> 
> I can confirm that force_quiescent_state is not being called at all
> except from the stall handler. Where is it supposed to be called from?
> I can't find any code paths to it except the stall handler and
> __call_rcu_core, but the latter seems to only be called when adding
> new rcu callbacks, not as a response to a stalled rcu_sched thread.
> Maybe I'm missing something but this seems like incorrect logic in the
> rcu subsystem.

This is the RCU logic you are missing within the RCU grace-period kthread:

ret = swait_event_interruptible_timeout(rsp->gp_wq,
rcu_gp_fqs_check_wake(rsp, ), j);

On your system, j==1, which means that when the RCU grace-period kthread
sleeps during a grace period, it is supposed to be awakened after one
jiffy regardless of anything else.  On your system (and apparently -only-
your system), this wakeup is not happening.

RCU reacts to your system's failure to awaken it by giving you an RCU
CPU stall warning.

So please do two 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-02 Thread Paul E. McKenney
On Sat, Oct 01, 2016 at 11:59:25PM -0400, Rich Felker wrote:
> On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> > > > > > > - During the whole sequence, hrtimer expiration times are being 
> > > > > > > set to
> > > > > > >   exact jiffies (@ 100 Hz), whereas before it they're quite 
> > > > > > > arbitrary.
> > > > > > 
> > > > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is 
> > > > > > farther out
> > > > > > than the next tick, then tick_sched_timer is set to this next event 
> > > > > > which
> > > > > > can be far out. So that's expected.
> > > > > > 
> > > > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > > > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > > > > >   sched_clock vs clocksource time and not a big deal.
> > > > > > 
> > > > > > Yes. You can tell the tracer to use clock monotonic so then they 
> > > > > > should match.
> > > > > > 
> > > > > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > > > > >   odd/excessive.
> > > > > > 
> > > > > > Why is that odd? That's one tick, i.e. 10ms in your case. And 
> > > > > > that's not
> > > > > > the problem at all. The problem is your timer not firing, but the 
> > > > > > cpu is
> > > > > > obviously either getting out of idle and then moves the tick ahead 
> > > > > > for some
> > > > > > unknown reason.
> > > > > 
> > > > > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > > > > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > > > > and HZ=500 or better for three-jiffy timeouts.
> > > > 
> > > > One possible theory I'm looking at is that the two cpus are both
> > > > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> > > > with sufficient consistency that every time the rcu_gp_fqs_check_wake
> > > > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> > > > outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> > > > forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> > > > that shows the low 2 bits of rnp->qsmask every time it's called, and
> > > > under normal operation the LEDs just flash on momentarily or just one
> > > > stays on for a few seconds then goes off. During a stall both are
> > > > stuck on. I'm still trying to make sense of the code but my impression
> > > > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> > > > correspond directly to cpus; is that right? If so I'm a bit confused
> > > > because I don't see how forward progress could ever happen if the cpu
> > > > on which rcu_gp_kthread is blocking forward progress of
> > > > rcu_gp_kthread.
> > > 
> > > No.  If the CPUs are entering and leaving idle, and if your timers
> > > were waking up rcu_sched every few jiffies like it asks, then the
> > > repeated idle entry/exit events would be noticed, courtesy of the atomic
> > > increments of ->dynticks and the rcu_sched kthread's snapshotting and
> > > checking of this value.
> > 
> > I don't see how rcu_sched could notice the changes if it's stuck in
> > the wait loop I think it's stuck in. There is no check of ->dynticks
> > in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS
> > accomplishes this, I updated my LED hacks to clear the LEDs in that
> > exit path (and killed the other place that could turn them back on
> > from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a
> > stall message.
> > 
> > > Even if the CPUs were always non-idle on every
> > > time force_quiescent_state() is invoked, give or take the possibility
> > > of counter wrap -- but even on a 32-bit system, that takes awhile.
> > 
> > Perhaps force_quiescent_state is not getting invoked? Does that sound
> > plausible, and if so, how should I go about debugging it? I see it is
> > called from the stall reporting code, so that's presumably what's
> > breaking the stalls.
> 
> I can confirm that force_quiescent_state is not being called at all
> except from the stall handler. Where is it supposed to be called from?
> I can't find any code paths to it except the stall handler and
> __call_rcu_core, but the latter seems to only be called when adding
> new rcu callbacks, not as a response to a stalled rcu_sched thread.
> Maybe I'm missing something but this seems like incorrect logic in the
> rcu subsystem.

This is the RCU logic you are missing within the RCU grace-period kthread:

ret = swait_event_interruptible_timeout(rsp->gp_wq,
rcu_gp_fqs_check_wake(rsp, ), j);

On your system, j==1, which means that when the RCU grace-period kthread
sleeps during a grace period, it is supposed to be awakened after one
jiffy regardless of anything else.  On your system (and apparently -only-
your system), this wakeup is not happening.

RCU reacts to your system's failure to awaken it by giving you an RCU
CPU stall warning.

So please do two 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-01 Thread Rich Felker
On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> > > > > > - During the whole sequence, hrtimer expiration times are being set 
> > > > > > to
> > > > > >   exact jiffies (@ 100 Hz), whereas before it they're quite 
> > > > > > arbitrary.
> > > > > 
> > > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is 
> > > > > farther out
> > > > > than the next tick, then tick_sched_timer is set to this next event 
> > > > > which
> > > > > can be far out. So that's expected.
> > > > > 
> > > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > > > >   sched_clock vs clocksource time and not a big deal.
> > > > > 
> > > > > Yes. You can tell the tracer to use clock monotonic so then they 
> > > > > should match.
> > > > > 
> > > > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > > > >   odd/excessive.
> > > > > 
> > > > > Why is that odd? That's one tick, i.e. 10ms in your case. And that's 
> > > > > not
> > > > > the problem at all. The problem is your timer not firing, but the cpu 
> > > > > is
> > > > > obviously either getting out of idle and then moves the tick ahead 
> > > > > for some
> > > > > unknown reason.
> > > > 
> > > > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > > > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > > > and HZ=500 or better for three-jiffy timeouts.
> > > 
> > > One possible theory I'm looking at is that the two cpus are both
> > > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> > > with sufficient consistency that every time the rcu_gp_fqs_check_wake
> > > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> > > outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> > > forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> > > that shows the low 2 bits of rnp->qsmask every time it's called, and
> > > under normal operation the LEDs just flash on momentarily or just one
> > > stays on for a few seconds then goes off. During a stall both are
> > > stuck on. I'm still trying to make sense of the code but my impression
> > > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> > > correspond directly to cpus; is that right? If so I'm a bit confused
> > > because I don't see how forward progress could ever happen if the cpu
> > > on which rcu_gp_kthread is blocking forward progress of
> > > rcu_gp_kthread.
> > 
> > No.  If the CPUs are entering and leaving idle, and if your timers
> > were waking up rcu_sched every few jiffies like it asks, then the
> > repeated idle entry/exit events would be noticed, courtesy of the atomic
> > increments of ->dynticks and the rcu_sched kthread's snapshotting and
> > checking of this value.
> 
> I don't see how rcu_sched could notice the changes if it's stuck in
> the wait loop I think it's stuck in. There is no check of ->dynticks
> in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS
> accomplishes this, I updated my LED hacks to clear the LEDs in that
> exit path (and killed the other place that could turn them back on
> from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a
> stall message.
> 
> > Even if the CPUs were always non-idle on every
> > time force_quiescent_state() is invoked, give or take the possibility
> > of counter wrap -- but even on a 32-bit system, that takes awhile.
> 
> Perhaps force_quiescent_state is not getting invoked? Does that sound
> plausible, and if so, how should I go about debugging it? I see it is
> called from the stall reporting code, so that's presumably what's
> breaking the stalls.

I can confirm that force_quiescent_state is not being called at all
except from the stall handler. Where is it supposed to be called from?
I can't find any code paths to it except the stall handler and
__call_rcu_core, but the latter seems to only be called when adding
new rcu callbacks, not as a response to a stalled rcu_sched thread.
Maybe I'm missing something but this seems like incorrect logic in the
rcu subsystem.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-01 Thread Rich Felker
On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> > > > > > - During the whole sequence, hrtimer expiration times are being set 
> > > > > > to
> > > > > >   exact jiffies (@ 100 Hz), whereas before it they're quite 
> > > > > > arbitrary.
> > > > > 
> > > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is 
> > > > > farther out
> > > > > than the next tick, then tick_sched_timer is set to this next event 
> > > > > which
> > > > > can be far out. So that's expected.
> > > > > 
> > > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > > > >   sched_clock vs clocksource time and not a big deal.
> > > > > 
> > > > > Yes. You can tell the tracer to use clock monotonic so then they 
> > > > > should match.
> > > > > 
> > > > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > > > >   odd/excessive.
> > > > > 
> > > > > Why is that odd? That's one tick, i.e. 10ms in your case. And that's 
> > > > > not
> > > > > the problem at all. The problem is your timer not firing, but the cpu 
> > > > > is
> > > > > obviously either getting out of idle and then moves the tick ahead 
> > > > > for some
> > > > > unknown reason.
> > > > 
> > > > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > > > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > > > and HZ=500 or better for three-jiffy timeouts.
> > > 
> > > One possible theory I'm looking at is that the two cpus are both
> > > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> > > with sufficient consistency that every time the rcu_gp_fqs_check_wake
> > > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> > > outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> > > forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> > > that shows the low 2 bits of rnp->qsmask every time it's called, and
> > > under normal operation the LEDs just flash on momentarily or just one
> > > stays on for a few seconds then goes off. During a stall both are
> > > stuck on. I'm still trying to make sense of the code but my impression
> > > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> > > correspond directly to cpus; is that right? If so I'm a bit confused
> > > because I don't see how forward progress could ever happen if the cpu
> > > on which rcu_gp_kthread is blocking forward progress of
> > > rcu_gp_kthread.
> > 
> > No.  If the CPUs are entering and leaving idle, and if your timers
> > were waking up rcu_sched every few jiffies like it asks, then the
> > repeated idle entry/exit events would be noticed, courtesy of the atomic
> > increments of ->dynticks and the rcu_sched kthread's snapshotting and
> > checking of this value.
> 
> I don't see how rcu_sched could notice the changes if it's stuck in
> the wait loop I think it's stuck in. There is no check of ->dynticks
> in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS
> accomplishes this, I updated my LED hacks to clear the LEDs in that
> exit path (and killed the other place that could turn them back on
> from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a
> stall message.
> 
> > Even if the CPUs were always non-idle on every
> > time force_quiescent_state() is invoked, give or take the possibility
> > of counter wrap -- but even on a 32-bit system, that takes awhile.
> 
> Perhaps force_quiescent_state is not getting invoked? Does that sound
> plausible, and if so, how should I go about debugging it? I see it is
> called from the stall reporting code, so that's presumably what's
> breaking the stalls.

I can confirm that force_quiescent_state is not being called at all
except from the stall handler. Where is it supposed to be called from?
I can't find any code paths to it except the stall handler and
__call_rcu_core, but the latter seems to only be called when adding
new rcu callbacks, not as a response to a stalled rcu_sched thread.
Maybe I'm missing something but this seems like incorrect logic in the
rcu subsystem.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-01 Thread Rich Felker
On Sat, Oct 01, 2016 at 10:58:37AM -0700, Paul E. McKenney wrote:
> On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> > On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > > way to share the full thing is, since it's large, but here are the
> > > > > potentially interesting parts.
> > > 
> > > [ . . . ]
> > > 
> > > Some RCU commentary, on the off-chance that it helps...
> > > 
> > > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 
> > > > goes
> > > > into a NOHZ idle sleep.
> > > > 
> > > > >   -0 [001] d...   109.953436: tick_stop: success=1 
> > > > > dependency=NONE
> > > > >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > > > > hrtimer=109f449c
> > > > >   -0 [001] d...   109.953818: hrtimer_start: 
> > > > > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > > > > softexpires=10988000
> > > > 
> > > > which is (using the 0.087621us delta between the trace clock and clock
> > > > MONO) at: 109.880 + 0.087621 = 109.968
> > > > 
> > > > Which is about correct as we expect the RCU timer to fire at:
> > > >   
> > > >   109.952633 + 0.01 = 109.963633
> > > > 
> > > > or latest at
> > > > 
> > > >   109.952633 + 0.02 = 109.983633
> > > >
> > > > There is another caveat. That nohz stuff can queue the rcu timer on 
> > > > CPU0, which
> > > > it did not because:
> > > 
> > > Just for annoying completeness, the location of the timer depends on how
> > > the rcuo callback-offload kthreads are constrained.  And yes, in the most
> > > constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> > > by default all run on CPU 0.
> > 
> > In default full nohz configuration, am I correct that all cpus except
> > cpu0 willd be nohz and that the rcu callbacks then have to run on
> > cpu0?
> 
> In recent kernels, it looks to me that the default is that none of the
> CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
> and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.

Actually NO_HZ_FULL is not supported because
HAVE_VIRT_CPU_ACCOUNTING_GEN and HAVE_CONTEXT_TRACKING are missing, so
it's just normal NO_HZ. I listed the relevant config items somewhere
earlier in this thread I think.

> But are there rcu0 kthreads running on your system?

Apparently not:

# ps aux | grep rcu
7 root   0:00 [rcu_sched]
8 root   0:00 [rcu_bh]
  395 root   0:00 grep rcu
  
> > > > >rcu_sched-7 [001] d...   109.952633: timer_start: 
> > > > > timer=160a9eb0 function=process_timeout expires=4294948284 
> > > > > [timeout=1] flags=0x0001
> > > > 
> > > > The CPU nr encoded in flags is: 1
> > > > 
> > > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > > it. And that expiry should have happened at 109.968000 !?!
> > > > 
> > > > >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > > > > hrtimer=109f449c
> > > > >   -0 [001] d...   109.968526: hrtimer_start: 
> > > > > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > > > > softexpires=10989000
> > > > 
> > > > So this advances the next tick even further out. And CPU 0 sets the 
> > > > timer to
> > > > the exact smae value:
> > > > 
> > > > >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > > > > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > > > > softexpires=10989000
> > > > 
> > > > 
> > > > >   -0 [000] d.h.   109.977690: irq_handler_entry: 
> > > > > irq=16 name=jcore_pit
> > > > >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > > > > hrtimer=109e949c
> > > > >   -0 [000] d.h.   109.978053: hrtimer_expire_entry: 
> > > > > hrtimer=109e949c function=tick_sched_timer now=109890434160
> > > > 
> > > > Which expires here. And CPU1 instead of getting an interrupt and 
> > > > expiring
> > > > the timer does the cancel/restart to the next jiffie again:
> > > > 
> > > > >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > > > > hrtimer=109f449c
> > > > >   -0 [001] d...   109.978495: hrtimer_start: 
> > > > > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > > > > softexpires=1099
> > > > 
> > > > And this repeats;
> > > > 
> > > > >   -0 [000] d.h.   109.987726: irq_handler_entry: 
> > > > > irq=16 name=jcore_pit
> > > > >   -0 [000] d.h.   109.987954: hrtimer_cancel: 
> > > > > hrtimer=109e949c
> > > > >   -0 [000] d.h.   109.988095: hrtimer_expire_entry: 
> > > > > hrtimer=109e949c function=tick_sched_timer now=109900474620
> > > > 
> > > > >   -0 [001] d...   109.988243: hrtimer_cancel: 
> > > > > hrtimer=109f449c
> > > > > 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-01 Thread Rich Felker
On Sat, Oct 01, 2016 at 10:58:37AM -0700, Paul E. McKenney wrote:
> On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> > On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > > way to share the full thing is, since it's large, but here are the
> > > > > potentially interesting parts.
> > > 
> > > [ . . . ]
> > > 
> > > Some RCU commentary, on the off-chance that it helps...
> > > 
> > > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 
> > > > goes
> > > > into a NOHZ idle sleep.
> > > > 
> > > > >   -0 [001] d...   109.953436: tick_stop: success=1 
> > > > > dependency=NONE
> > > > >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > > > > hrtimer=109f449c
> > > > >   -0 [001] d...   109.953818: hrtimer_start: 
> > > > > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > > > > softexpires=10988000
> > > > 
> > > > which is (using the 0.087621us delta between the trace clock and clock
> > > > MONO) at: 109.880 + 0.087621 = 109.968
> > > > 
> > > > Which is about correct as we expect the RCU timer to fire at:
> > > >   
> > > >   109.952633 + 0.01 = 109.963633
> > > > 
> > > > or latest at
> > > > 
> > > >   109.952633 + 0.02 = 109.983633
> > > >
> > > > There is another caveat. That nohz stuff can queue the rcu timer on 
> > > > CPU0, which
> > > > it did not because:
> > > 
> > > Just for annoying completeness, the location of the timer depends on how
> > > the rcuo callback-offload kthreads are constrained.  And yes, in the most
> > > constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> > > by default all run on CPU 0.
> > 
> > In default full nohz configuration, am I correct that all cpus except
> > cpu0 willd be nohz and that the rcu callbacks then have to run on
> > cpu0?
> 
> In recent kernels, it looks to me that the default is that none of the
> CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
> and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.

Actually NO_HZ_FULL is not supported because
HAVE_VIRT_CPU_ACCOUNTING_GEN and HAVE_CONTEXT_TRACKING are missing, so
it's just normal NO_HZ. I listed the relevant config items somewhere
earlier in this thread I think.

> But are there rcu0 kthreads running on your system?

Apparently not:

# ps aux | grep rcu
7 root   0:00 [rcu_sched]
8 root   0:00 [rcu_bh]
  395 root   0:00 grep rcu
  
> > > > >rcu_sched-7 [001] d...   109.952633: timer_start: 
> > > > > timer=160a9eb0 function=process_timeout expires=4294948284 
> > > > > [timeout=1] flags=0x0001
> > > > 
> > > > The CPU nr encoded in flags is: 1
> > > > 
> > > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > > it. And that expiry should have happened at 109.968000 !?!
> > > > 
> > > > >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > > > > hrtimer=109f449c
> > > > >   -0 [001] d...   109.968526: hrtimer_start: 
> > > > > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > > > > softexpires=10989000
> > > > 
> > > > So this advances the next tick even further out. And CPU 0 sets the 
> > > > timer to
> > > > the exact smae value:
> > > > 
> > > > >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > > > > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > > > > softexpires=10989000
> > > > 
> > > > 
> > > > >   -0 [000] d.h.   109.977690: irq_handler_entry: 
> > > > > irq=16 name=jcore_pit
> > > > >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > > > > hrtimer=109e949c
> > > > >   -0 [000] d.h.   109.978053: hrtimer_expire_entry: 
> > > > > hrtimer=109e949c function=tick_sched_timer now=109890434160
> > > > 
> > > > Which expires here. And CPU1 instead of getting an interrupt and 
> > > > expiring
> > > > the timer does the cancel/restart to the next jiffie again:
> > > > 
> > > > >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > > > > hrtimer=109f449c
> > > > >   -0 [001] d...   109.978495: hrtimer_start: 
> > > > > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > > > > softexpires=1099
> > > > 
> > > > And this repeats;
> > > > 
> > > > >   -0 [000] d.h.   109.987726: irq_handler_entry: 
> > > > > irq=16 name=jcore_pit
> > > > >   -0 [000] d.h.   109.987954: hrtimer_cancel: 
> > > > > hrtimer=109e949c
> > > > >   -0 [000] d.h.   109.988095: hrtimer_expire_entry: 
> > > > > hrtimer=109e949c function=tick_sched_timer now=109900474620
> > > > 
> > > > >   -0 [001] d...   109.988243: hrtimer_cancel: 
> > > > > hrtimer=109f449c
> > > > > 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-01 Thread Paul E. McKenney
On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > way to share the full thing is, since it's large, but here are the
> > > > potentially interesting parts.
> > 
> > [ . . . ]
> > 
> > Some RCU commentary, on the off-chance that it helps...
> > 
> > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 
> > > goes
> > > into a NOHZ idle sleep.
> > > 
> > > >   -0 [001] d...   109.953436: tick_stop: success=1 
> > > > dependency=NONE
> > > >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > > > hrtimer=109f449c
> > > >   -0 [001] d...   109.953818: hrtimer_start: 
> > > > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > > > softexpires=10988000
> > > 
> > > which is (using the 0.087621us delta between the trace clock and clock
> > > MONO) at: 109.880 + 0.087621 = 109.968
> > > 
> > > Which is about correct as we expect the RCU timer to fire at:
> > >   
> > >   109.952633 + 0.01 = 109.963633
> > > 
> > > or latest at
> > > 
> > >   109.952633 + 0.02 = 109.983633
> > >
> > > There is another caveat. That nohz stuff can queue the rcu timer on CPU0, 
> > > which
> > > it did not because:
> > 
> > Just for annoying completeness, the location of the timer depends on how
> > the rcuo callback-offload kthreads are constrained.  And yes, in the most
> > constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> > by default all run on CPU 0.
> 
> In default full nohz configuration, am I correct that all cpus except
> cpu0 willd be nohz and that the rcu callbacks then have to run on
> cpu0?

In recent kernels, it looks to me that the default is that none of the
CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.

But are there rcu0 kthreads running on your system?

> > > >rcu_sched-7 [001] d...   109.952633: timer_start: 
> > > > timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] 
> > > > flags=0x0001
> > > 
> > > The CPU nr encoded in flags is: 1
> > > 
> > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > it. And that expiry should have happened at 109.968000 !?!
> > > 
> > > >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > > > hrtimer=109f449c
> > > >   -0 [001] d...   109.968526: hrtimer_start: 
> > > > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > > > softexpires=10989000
> > > 
> > > So this advances the next tick even further out. And CPU 0 sets the timer 
> > > to
> > > the exact smae value:
> > > 
> > > >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > > > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > > > softexpires=10989000
> > > 
> > > 
> > > >   -0 [000] d.h.   109.977690: irq_handler_entry: 
> > > > irq=16 name=jcore_pit
> > > >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > > > hrtimer=109e949c
> > > >   -0 [000] d.h.   109.978053: hrtimer_expire_entry: 
> > > > hrtimer=109e949c function=tick_sched_timer now=109890434160
> > > 
> > > Which expires here. And CPU1 instead of getting an interrupt and expiring
> > > the timer does the cancel/restart to the next jiffie again:
> > > 
> > > >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > > > hrtimer=109f449c
> > > >   -0 [001] d...   109.978495: hrtimer_start: 
> > > > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > > > softexpires=1099
> > > 
> > > And this repeats;
> > > 
> > > >   -0 [000] d.h.   109.987726: irq_handler_entry: 
> > > > irq=16 name=jcore_pit
> > > >   -0 [000] d.h.   109.987954: hrtimer_cancel: 
> > > > hrtimer=109e949c
> > > >   -0 [000] d.h.   109.988095: hrtimer_expire_entry: 
> > > > hrtimer=109e949c function=tick_sched_timer now=109900474620
> > > 
> > > >   -0 [001] d...   109.988243: hrtimer_cancel: 
> > > > hrtimer=109f449c
> > > >   -0 [001] d...   109.988537: hrtimer_start: 
> > > > hrtimer=109f449c fun9c
> > > 
> > > There is something badly wrong here.
> > > 
> > > >   -0 [000] ..s.   110.019443: softirq_entry: vec=1 
> > > > [action=TIMER]
> > > >   -0 [000] ..s.   110.019617: softirq_exit: vec=1 
> > > > [action=TIMER]
> > > >   -0 [000] ..s.   110.019730: softirq_entry: vec=7 
> > > > [action=SCHED]
> > > >   -0 [000] ..s.   110.020174: softirq_exit: vec=7 
> > > > [action=SCHED]
> > > >   -0 [000] d.h.   110.027674: irq_handler_entry: 
> > > > irq=16 name=jcore_pit
> > > > 
> > > > The 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-01 Thread Paul E. McKenney
On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > way to share the full thing is, since it's large, but here are the
> > > > potentially interesting parts.
> > 
> > [ . . . ]
> > 
> > Some RCU commentary, on the off-chance that it helps...
> > 
> > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 
> > > goes
> > > into a NOHZ idle sleep.
> > > 
> > > >   -0 [001] d...   109.953436: tick_stop: success=1 
> > > > dependency=NONE
> > > >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > > > hrtimer=109f449c
> > > >   -0 [001] d...   109.953818: hrtimer_start: 
> > > > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > > > softexpires=10988000
> > > 
> > > which is (using the 0.087621us delta between the trace clock and clock
> > > MONO) at: 109.880 + 0.087621 = 109.968
> > > 
> > > Which is about correct as we expect the RCU timer to fire at:
> > >   
> > >   109.952633 + 0.01 = 109.963633
> > > 
> > > or latest at
> > > 
> > >   109.952633 + 0.02 = 109.983633
> > >
> > > There is another caveat. That nohz stuff can queue the rcu timer on CPU0, 
> > > which
> > > it did not because:
> > 
> > Just for annoying completeness, the location of the timer depends on how
> > the rcuo callback-offload kthreads are constrained.  And yes, in the most
> > constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> > by default all run on CPU 0.
> 
> In default full nohz configuration, am I correct that all cpus except
> cpu0 willd be nohz and that the rcu callbacks then have to run on
> cpu0?

In recent kernels, it looks to me that the default is that none of the
CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.

But are there rcu0 kthreads running on your system?

> > > >rcu_sched-7 [001] d...   109.952633: timer_start: 
> > > > timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] 
> > > > flags=0x0001
> > > 
> > > The CPU nr encoded in flags is: 1
> > > 
> > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > it. And that expiry should have happened at 109.968000 !?!
> > > 
> > > >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > > > hrtimer=109f449c
> > > >   -0 [001] d...   109.968526: hrtimer_start: 
> > > > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > > > softexpires=10989000
> > > 
> > > So this advances the next tick even further out. And CPU 0 sets the timer 
> > > to
> > > the exact smae value:
> > > 
> > > >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > > > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > > > softexpires=10989000
> > > 
> > > 
> > > >   -0 [000] d.h.   109.977690: irq_handler_entry: 
> > > > irq=16 name=jcore_pit
> > > >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > > > hrtimer=109e949c
> > > >   -0 [000] d.h.   109.978053: hrtimer_expire_entry: 
> > > > hrtimer=109e949c function=tick_sched_timer now=109890434160
> > > 
> > > Which expires here. And CPU1 instead of getting an interrupt and expiring
> > > the timer does the cancel/restart to the next jiffie again:
> > > 
> > > >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > > > hrtimer=109f449c
> > > >   -0 [001] d...   109.978495: hrtimer_start: 
> > > > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > > > softexpires=1099
> > > 
> > > And this repeats;
> > > 
> > > >   -0 [000] d.h.   109.987726: irq_handler_entry: 
> > > > irq=16 name=jcore_pit
> > > >   -0 [000] d.h.   109.987954: hrtimer_cancel: 
> > > > hrtimer=109e949c
> > > >   -0 [000] d.h.   109.988095: hrtimer_expire_entry: 
> > > > hrtimer=109e949c function=tick_sched_timer now=109900474620
> > > 
> > > >   -0 [001] d...   109.988243: hrtimer_cancel: 
> > > > hrtimer=109f449c
> > > >   -0 [001] d...   109.988537: hrtimer_start: 
> > > > hrtimer=109f449c fun9c
> > > 
> > > There is something badly wrong here.
> > > 
> > > >   -0 [000] ..s.   110.019443: softirq_entry: vec=1 
> > > > [action=TIMER]
> > > >   -0 [000] ..s.   110.019617: softirq_exit: vec=1 
> > > > [action=TIMER]
> > > >   -0 [000] ..s.   110.019730: softirq_entry: vec=7 
> > > > [action=SCHED]
> > > >   -0 [000] ..s.   110.020174: softirq_exit: vec=7 
> > > > [action=SCHED]
> > > >   -0 [000] d.h.   110.027674: irq_handler_entry: 
> > > > irq=16 name=jcore_pit
> > > > 
> > > > The 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-01 Thread Rich Felker
On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > I've managed to get a trace with a stall. I'm not sure what the best
> > > way to share the full thing is, since it's large, but here are the
> > > potentially interesting parts.
> 
> [ . . . ]
> 
> Some RCU commentary, on the off-chance that it helps...
> 
> > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> > into a NOHZ idle sleep.
> > 
> > >   -0 [001] d...   109.953436: tick_stop: success=1 
> > > dependency=NONE
> > >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > > hrtimer=109f449c
> > >   -0 [001] d...   109.953818: hrtimer_start: 
> > > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > > softexpires=10988000
> > 
> > which is (using the 0.087621us delta between the trace clock and clock
> > MONO) at: 109.880 + 0.087621 = 109.968
> > 
> > Which is about correct as we expect the RCU timer to fire at:
> >   
> >   109.952633 + 0.01 = 109.963633
> > 
> > or latest at
> > 
> >   109.952633 + 0.02 = 109.983633
> >
> > There is another caveat. That nohz stuff can queue the rcu timer on CPU0, 
> > which
> > it did not because:
> 
> Just for annoying completeness, the location of the timer depends on how
> the rcuo callback-offload kthreads are constrained.  And yes, in the most
> constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> by default all run on CPU 0.

In default full nohz configuration, am I correct that all cpus except
cpu0 willd be nohz and that the rcu callbacks then have to run on
cpu0?

> > >rcu_sched-7 [001] d...   109.952633: timer_start: 
> > > timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] 
> > > flags=0x0001
> > 
> > The CPU nr encoded in flags is: 1
> > 
> > Now we cancel and restart the timer w/o seing the interrupt expiring
> > it. And that expiry should have happened at 109.968000 !?!
> > 
> > >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > > hrtimer=109f449c
> > >   -0 [001] d...   109.968526: hrtimer_start: 
> > > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > > softexpires=10989000
> > 
> > So this advances the next tick even further out. And CPU 0 sets the timer to
> > the exact smae value:
> > 
> > >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > > softexpires=10989000
> > 
> > 
> > >   -0 [000] d.h.   109.977690: irq_handler_entry: irq=16 
> > > name=jcore_pit
> > >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > > hrtimer=109e949c
> > >   -0 [000] d.h.   109.978053: hrtimer_expire_entry: 
> > > hrtimer=109e949c function=tick_sched_timer now=109890434160
> > 
> > Which expires here. And CPU1 instead of getting an interrupt and expiring
> > the timer does the cancel/restart to the next jiffie again:
> > 
> > >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > > hrtimer=109f449c
> > >   -0 [001] d...   109.978495: hrtimer_start: 
> > > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > > softexpires=1099
> > 
> > And this repeats;
> > 
> > >   -0 [000] d.h.   109.987726: irq_handler_entry: irq=16 
> > > name=jcore_pit
> > >   -0 [000] d.h.   109.987954: hrtimer_cancel: 
> > > hrtimer=109e949c
> > >   -0 [000] d.h.   109.988095: hrtimer_expire_entry: 
> > > hrtimer=109e949c function=tick_sched_timer now=109900474620
> > 
> > >   -0 [001] d...   109.988243: hrtimer_cancel: 
> > > hrtimer=109f449c
> > >   -0 [001] d...   109.988537: hrtimer_start: 
> > > hrtimer=109f449c fun9c
> > 
> > There is something badly wrong here.
> > 
> > >   -0 [000] ..s.   110.019443: softirq_entry: vec=1 
> > > [action=TIMER]
> > >   -0 [000] ..s.   110.019617: softirq_exit: vec=1 
> > > [action=TIMER]
> > >   -0 [000] ..s.   110.019730: softirq_entry: vec=7 
> > > [action=SCHED]
> > >   -0 [000] ..s.   110.020174: softirq_exit: vec=7 
> > > [action=SCHED]
> > >   -0 [000] d.h.   110.027674: irq_handler_entry: irq=16 
> > > name=jcore_pit
> > > 
> > > The rcu_sched process does not run again after the tick_stop until
> > > 132s, and only a few RCU softirqs happen (all shown above). During
> > > this time, cpu1 has no interrupt activity and nothing in the trace
> > > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > > they're happening without any interrupts).
> > 
> > If you drop out of the arch idle into the core idle loop then you might end
> > up with this. You want to add a few trace points or trace_printks() to the
> > involved functions. tick_nohz_restart() 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-10-01 Thread Rich Felker
On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > I've managed to get a trace with a stall. I'm not sure what the best
> > > way to share the full thing is, since it's large, but here are the
> > > potentially interesting parts.
> 
> [ . . . ]
> 
> Some RCU commentary, on the off-chance that it helps...
> 
> > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> > into a NOHZ idle sleep.
> > 
> > >   -0 [001] d...   109.953436: tick_stop: success=1 
> > > dependency=NONE
> > >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > > hrtimer=109f449c
> > >   -0 [001] d...   109.953818: hrtimer_start: 
> > > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > > softexpires=10988000
> > 
> > which is (using the 0.087621us delta between the trace clock and clock
> > MONO) at: 109.880 + 0.087621 = 109.968
> > 
> > Which is about correct as we expect the RCU timer to fire at:
> >   
> >   109.952633 + 0.01 = 109.963633
> > 
> > or latest at
> > 
> >   109.952633 + 0.02 = 109.983633
> >
> > There is another caveat. That nohz stuff can queue the rcu timer on CPU0, 
> > which
> > it did not because:
> 
> Just for annoying completeness, the location of the timer depends on how
> the rcuo callback-offload kthreads are constrained.  And yes, in the most
> constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> by default all run on CPU 0.

In default full nohz configuration, am I correct that all cpus except
cpu0 willd be nohz and that the rcu callbacks then have to run on
cpu0?

> > >rcu_sched-7 [001] d...   109.952633: timer_start: 
> > > timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] 
> > > flags=0x0001
> > 
> > The CPU nr encoded in flags is: 1
> > 
> > Now we cancel and restart the timer w/o seing the interrupt expiring
> > it. And that expiry should have happened at 109.968000 !?!
> > 
> > >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > > hrtimer=109f449c
> > >   -0 [001] d...   109.968526: hrtimer_start: 
> > > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > > softexpires=10989000
> > 
> > So this advances the next tick even further out. And CPU 0 sets the timer to
> > the exact smae value:
> > 
> > >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > > softexpires=10989000
> > 
> > 
> > >   -0 [000] d.h.   109.977690: irq_handler_entry: irq=16 
> > > name=jcore_pit
> > >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > > hrtimer=109e949c
> > >   -0 [000] d.h.   109.978053: hrtimer_expire_entry: 
> > > hrtimer=109e949c function=tick_sched_timer now=109890434160
> > 
> > Which expires here. And CPU1 instead of getting an interrupt and expiring
> > the timer does the cancel/restart to the next jiffie again:
> > 
> > >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > > hrtimer=109f449c
> > >   -0 [001] d...   109.978495: hrtimer_start: 
> > > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > > softexpires=1099
> > 
> > And this repeats;
> > 
> > >   -0 [000] d.h.   109.987726: irq_handler_entry: irq=16 
> > > name=jcore_pit
> > >   -0 [000] d.h.   109.987954: hrtimer_cancel: 
> > > hrtimer=109e949c
> > >   -0 [000] d.h.   109.988095: hrtimer_expire_entry: 
> > > hrtimer=109e949c function=tick_sched_timer now=109900474620
> > 
> > >   -0 [001] d...   109.988243: hrtimer_cancel: 
> > > hrtimer=109f449c
> > >   -0 [001] d...   109.988537: hrtimer_start: 
> > > hrtimer=109f449c fun9c
> > 
> > There is something badly wrong here.
> > 
> > >   -0 [000] ..s.   110.019443: softirq_entry: vec=1 
> > > [action=TIMER]
> > >   -0 [000] ..s.   110.019617: softirq_exit: vec=1 
> > > [action=TIMER]
> > >   -0 [000] ..s.   110.019730: softirq_entry: vec=7 
> > > [action=SCHED]
> > >   -0 [000] ..s.   110.020174: softirq_exit: vec=7 
> > > [action=SCHED]
> > >   -0 [000] d.h.   110.027674: irq_handler_entry: irq=16 
> > > name=jcore_pit
> > > 
> > > The rcu_sched process does not run again after the tick_stop until
> > > 132s, and only a few RCU softirqs happen (all shown above). During
> > > this time, cpu1 has no interrupt activity and nothing in the trace
> > > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > > they're happening without any interrupts).
> > 
> > If you drop out of the arch idle into the core idle loop then you might end
> > up with this. You want to add a few trace points or trace_printks() to the
> > involved functions. tick_nohz_restart() 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-30 Thread Paul E. McKenney
On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> On Tue, 27 Sep 2016, Rich Felker wrote:
> > I've managed to get a trace with a stall. I'm not sure what the best
> > way to share the full thing is, since it's large, but here are the
> > potentially interesting parts.

[ . . . ]

Some RCU commentary, on the off-chance that it helps...

> So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> into a NOHZ idle sleep.
> 
> >   -0 [001] d...   109.953436: tick_stop: success=1 
> > dependency=NONE
> >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d...   109.953818: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > softexpires=10988000
> 
> which is (using the 0.087621us delta between the trace clock and clock
> MONO) at: 109.880 + 0.087621 = 109.968
> 
> Which is about correct as we expect the RCU timer to fire at:
>   
>   109.952633 + 0.01 = 109.963633
> 
> or latest at
> 
>   109.952633 + 0.02 = 109.983633
>
> There is another caveat. That nohz stuff can queue the rcu timer on CPU0, 
> which
> it did not because:

Just for annoying completeness, the location of the timer depends on how
the rcuo callback-offload kthreads are constrained.  And yes, in the most
constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
by default all run on CPU 0.

> >rcu_sched-7 [001] d...   109.952633: timer_start: timer=160a9eb0 
> > function=process_timeout expires=4294948284 [timeout=1] flags=0x0001
> 
> The CPU nr encoded in flags is: 1
> 
> Now we cancel and restart the timer w/o seing the interrupt expiring
> it. And that expiry should have happened at 109.968000 !?!
> 
> >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d...   109.968526: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > softexpires=10989000
> 
> So this advances the next tick even further out. And CPU 0 sets the timer to
> the exact smae value:
> 
> >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > softexpires=10989000
> 
> 
> >   -0 [000] d.h.   109.977690: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > hrtimer=109e949c
> >   -0 [000] d.h.   109.978053: hrtimer_expire_entry: 
> > hrtimer=109e949c function=tick_sched_timer now=109890434160
> 
> Which expires here. And CPU1 instead of getting an interrupt and expiring
> the timer does the cancel/restart to the next jiffie again:
> 
> >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d...   109.978495: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > softexpires=1099
> 
> And this repeats;
> 
> >   -0 [000] d.h.   109.987726: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   -0 [000] d.h.   109.987954: hrtimer_cancel: 
> > hrtimer=109e949c
> >   -0 [000] d.h.   109.988095: hrtimer_expire_entry: 
> > hrtimer=109e949c function=tick_sched_timer now=109900474620
> 
> >   -0 [001] d...   109.988243: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d...   109.988537: hrtimer_start: 
> > hrtimer=109f449c fun9c
> 
> There is something badly wrong here.
> 
> >   -0 [000] ..s.   110.019443: softirq_entry: vec=1 
> > [action=TIMER]
> >   -0 [000] ..s.   110.019617: softirq_exit: vec=1 
> > [action=TIMER]
> >   -0 [000] ..s.   110.019730: softirq_entry: vec=7 
> > [action=SCHED]
> >   -0 [000] ..s.   110.020174: softirq_exit: vec=7 
> > [action=SCHED]
> >   -0 [000] d.h.   110.027674: irq_handler_entry: irq=16 
> > name=jcore_pit
> > 
> > The rcu_sched process does not run again after the tick_stop until
> > 132s, and only a few RCU softirqs happen (all shown above). During
> > this time, cpu1 has no interrupt activity and nothing in the trace
> > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > they're happening without any interrupts).
> 
> If you drop out of the arch idle into the core idle loop then you might end
> up with this. You want to add a few trace points or trace_printks() to the
> involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
> tick_nohz_restart_sched_tick() and the idle code should be a good starting
> point.
> 
> > This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> > of hrtimer_cancel/start:
> 
> It's not a frenzy. It's the same pattern as above. It arms the timer to the
> next tick, but that timer never ever fires. And it does that every tick 
> 
> Please put a tracepoint into your set_next_event() callback as well. SO
> this 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-30 Thread Paul E. McKenney
On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> On Tue, 27 Sep 2016, Rich Felker wrote:
> > I've managed to get a trace with a stall. I'm not sure what the best
> > way to share the full thing is, since it's large, but here are the
> > potentially interesting parts.

[ . . . ]

Some RCU commentary, on the off-chance that it helps...

> So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> into a NOHZ idle sleep.
> 
> >   -0 [001] d...   109.953436: tick_stop: success=1 
> > dependency=NONE
> >   -0 [001] d...   109.953617: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d...   109.953818: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=10988000 
> > softexpires=10988000
> 
> which is (using the 0.087621us delta between the trace clock and clock
> MONO) at: 109.880 + 0.087621 = 109.968
> 
> Which is about correct as we expect the RCU timer to fire at:
>   
>   109.952633 + 0.01 = 109.963633
> 
> or latest at
> 
>   109.952633 + 0.02 = 109.983633
>
> There is another caveat. That nohz stuff can queue the rcu timer on CPU0, 
> which
> it did not because:

Just for annoying completeness, the location of the timer depends on how
the rcuo callback-offload kthreads are constrained.  And yes, in the most
constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
by default all run on CPU 0.

> >rcu_sched-7 [001] d...   109.952633: timer_start: timer=160a9eb0 
> > function=process_timeout expires=4294948284 [timeout=1] flags=0x0001
> 
> The CPU nr encoded in flags is: 1
> 
> Now we cancel and restart the timer w/o seing the interrupt expiring
> it. And that expiry should have happened at 109.968000 !?!
> 
> >   -0 [001] d...   109.968225: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d...   109.968526: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=10989000 
> > softexpires=10989000
> 
> So this advances the next tick even further out. And CPU 0 sets the timer to
> the exact smae value:
> 
> >   -0 [000] d.h.   109.969104: hrtimer_start: 
> > hrtimer=109e949c function=tick_sched_timer expires=10989000 
> > softexpires=10989000
> 
> 
> >   -0 [000] d.h.   109.977690: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   -0 [000] d.h.   109.977911: hrtimer_cancel: 
> > hrtimer=109e949c
> >   -0 [000] d.h.   109.978053: hrtimer_expire_entry: 
> > hrtimer=109e949c function=tick_sched_timer now=109890434160
> 
> Which expires here. And CPU1 instead of getting an interrupt and expiring
> the timer does the cancel/restart to the next jiffie again:
> 
> >   -0 [001] d...   109.978206: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d...   109.978495: hrtimer_start: 
> > hrtimer=109f449c function=tick_sched_timer expires=1099 
> > softexpires=1099
> 
> And this repeats;
> 
> >   -0 [000] d.h.   109.987726: irq_handler_entry: irq=16 
> > name=jcore_pit
> >   -0 [000] d.h.   109.987954: hrtimer_cancel: 
> > hrtimer=109e949c
> >   -0 [000] d.h.   109.988095: hrtimer_expire_entry: 
> > hrtimer=109e949c function=tick_sched_timer now=109900474620
> 
> >   -0 [001] d...   109.988243: hrtimer_cancel: 
> > hrtimer=109f449c
> >   -0 [001] d...   109.988537: hrtimer_start: 
> > hrtimer=109f449c fun9c
> 
> There is something badly wrong here.
> 
> >   -0 [000] ..s.   110.019443: softirq_entry: vec=1 
> > [action=TIMER]
> >   -0 [000] ..s.   110.019617: softirq_exit: vec=1 
> > [action=TIMER]
> >   -0 [000] ..s.   110.019730: softirq_entry: vec=7 
> > [action=SCHED]
> >   -0 [000] ..s.   110.020174: softirq_exit: vec=7 
> > [action=SCHED]
> >   -0 [000] d.h.   110.027674: irq_handler_entry: irq=16 
> > name=jcore_pit
> > 
> > The rcu_sched process does not run again after the tick_stop until
> > 132s, and only a few RCU softirqs happen (all shown above). During
> > this time, cpu1 has no interrupt activity and nothing in the trace
> > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > they're happening without any interrupts).
> 
> If you drop out of the arch idle into the core idle loop then you might end
> up with this. You want to add a few trace points or trace_printks() to the
> involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
> tick_nohz_restart_sched_tick() and the idle code should be a good starting
> point.
> 
> > This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> > of hrtimer_cancel/start:
> 
> It's not a frenzy. It's the same pattern as above. It arms the timer to the
> next tick, but that timer never ever fires. And it does that every tick 
> 
> Please put a tracepoint into your set_next_event() callback as well. SO
> this 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-30 Thread Thomas Gleixner
On Tue, 27 Sep 2016, Rich Felker wrote:
> I've managed to get a trace with a stall. I'm not sure what the best
> way to share the full thing is, since it's large, but here are the
> potentially interesting parts.

Upload it somewhere.
 
> The first is a big time gap with no events, from 82.446093 to
> 109.852709:
> 

>   -0 [001] d.h.82.443398: irq_handler_entry: irq=16 
> name=jcore_pit
>   -0 [001] d.h.82.443573: hrtimer_cancel: 
> hrtimer=109f449c
>   -0 [001] d.h.82.443689: hrtimer_expire_entry: 
> hrtimer=109f449c function=tick_sched_timer now=82356112320

Tick timer expires

>   -0 [001] d.h.82.443911: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h.82.444089: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h.82.444306: softirq_raise: vec=7 
> [action=SCHED]

Raises softirqs

>   -0 [001] d.h.82.48: hrtimer_expire_exit: 
> hrtimer=109f449c
>   -0 [001] d.h.82.444592: hrtimer_start: 
> hrtimer=109f449c function=tick_sched_timer expires=8236000 
> softexpires=8236000

Rearms the next period

>   -0 [001] d.h.82.444821: irq_handler_exit: irq=16 
> ret=handled
>   -0 [001] ..s.82.445034: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s.82.445222: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s.82.445341: softirq_entry: vec=7 
> [action=SCHED]
>   -0 [001] ..s.82.445745: softirq_exit: vec=7 
> [action=SCHED]
>   -0 [001] ..s.82.445880: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s.82.446093: softirq_exit: vec=9 
> [action=RCU]

Softirqs done and the machine vanishes into lala land 

And now after that gap we have a cancel/start on CPU 0

>   -0 [000] dn..   109.852709: hrtimer_cancel: 
> hrtimer=109e949c
>   -0 [000] dn..   109.853043: hrtimer_start: 
> hrtimer=109e949c function=tick_sched_timer expires=10977000 
> softexpires=10977000

>   sh-388   [000] d.h.   109.855851: irq_handler_entry: irq=18 
> name=uartlite
>   sh-388   [000] d.h.   109.856751: irq_handler_exit: irq=18 
> ret=handled
>   sh-388   [000] d.h.   109.857264: irq_handler_entry: irq=18 
> name=uartlite
>   sh-388   [000] d.h.   109.857458: irq_handler_exit: irq=18 
> ret=unhandled
>   -0 [001] d.h.   109.857684: irq_handler_entry: irq=16 
> name=jcore_pit
>   -0 [001] d.h.   109.857934: hrtimer_cancel: 
> hrtimer=109f449c
>   -0 [001] d.h.   109.858087: hrtimer_expire_entry: 
> hrtimer=109f449c function=tick_sched_timer now=109770452820

And this expiry happens on CPU 1, but the expiry time looks like the one
which got armed on CPU 0 above !?!?!

>   -0 [001] d.h.   109.858639: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h.   109.858839: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h.   109.859097: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] d.h.   109.859282: hrtimer_expire_exit: 
> hrtimer=109f449c
>   -0 [001] d.h.   109.859428: hrtimer_start: 
> hrtimer=109f449c function=tick_sched_timer expires=10978000 
> softexpires=10978000
>   -0 [001] d.h.   109.859637: irq_handler_exit: irq=16 
> ret=handled
>   -0 [001] ..s.   109.859806: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s.   109.859984: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s.   109.860098: softirq_entry: vec=7 
> [action=SCHED]
>   -0 [001] ..s.   109.860533: softirq_exit: vec=7 
> [action=SCHED]
>   -0 [001] ..s.   109.860663: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] .ns.   109.861424: softirq_exit: vec=9 
> [action=RCU]
>   sh-388   [000] d.h.   109.861831: irq_handler_entry: irq=16 
> name=jcore_pit
>   sh-388   [000] d.h.   109.862071: hrtimer_cancel: 
> hrtimer=109e949c
>rcu_sched-7 [001]    109.862141: timer_init: timer=160a9eb0
>   sh-388   [000] d.h.   109.862233: hrtimer_expire_entry: 
> hrtimer=109e949c function=tick_sched_timer now=109774586740

And CPU0 has it's timer firing 4.5ms late .

>rcu_sched-7 [001] d...   109.862317: timer_start: timer=160a9eb0 
> function=process_timeout expires=4294948275 [timeout=1] flags=0x0001
> 
> However no stall message appeared on the console here, and I suspect
> it might have just been a gap in event recording because it doesn't
> look terribly suspicious. At first I thought (and discussed with

Well. If there are gaps in the traces then decoding this is going to be a
nightmare.

> Daniel off-list) that the hrtimer_cancel at the end of the long gap
> might have been spinning for the whole time, but this does not match
> the behavior of the stalls I'm 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-30 Thread Thomas Gleixner
On Tue, 27 Sep 2016, Rich Felker wrote:
> I've managed to get a trace with a stall. I'm not sure what the best
> way to share the full thing is, since it's large, but here are the
> potentially interesting parts.

Upload it somewhere.
 
> The first is a big time gap with no events, from 82.446093 to
> 109.852709:
> 

>   -0 [001] d.h.82.443398: irq_handler_entry: irq=16 
> name=jcore_pit
>   -0 [001] d.h.82.443573: hrtimer_cancel: 
> hrtimer=109f449c
>   -0 [001] d.h.82.443689: hrtimer_expire_entry: 
> hrtimer=109f449c function=tick_sched_timer now=82356112320

Tick timer expires

>   -0 [001] d.h.82.443911: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h.82.444089: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h.82.444306: softirq_raise: vec=7 
> [action=SCHED]

Raises softirqs

>   -0 [001] d.h.82.48: hrtimer_expire_exit: 
> hrtimer=109f449c
>   -0 [001] d.h.82.444592: hrtimer_start: 
> hrtimer=109f449c function=tick_sched_timer expires=8236000 
> softexpires=8236000

Rearms the next period

>   -0 [001] d.h.82.444821: irq_handler_exit: irq=16 
> ret=handled
>   -0 [001] ..s.82.445034: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s.82.445222: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s.82.445341: softirq_entry: vec=7 
> [action=SCHED]
>   -0 [001] ..s.82.445745: softirq_exit: vec=7 
> [action=SCHED]
>   -0 [001] ..s.82.445880: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s.82.446093: softirq_exit: vec=9 
> [action=RCU]

Softirqs done and the machine vanishes into lala land 

And now after that gap we have a cancel/start on CPU 0

>   -0 [000] dn..   109.852709: hrtimer_cancel: 
> hrtimer=109e949c
>   -0 [000] dn..   109.853043: hrtimer_start: 
> hrtimer=109e949c function=tick_sched_timer expires=10977000 
> softexpires=10977000

>   sh-388   [000] d.h.   109.855851: irq_handler_entry: irq=18 
> name=uartlite
>   sh-388   [000] d.h.   109.856751: irq_handler_exit: irq=18 
> ret=handled
>   sh-388   [000] d.h.   109.857264: irq_handler_entry: irq=18 
> name=uartlite
>   sh-388   [000] d.h.   109.857458: irq_handler_exit: irq=18 
> ret=unhandled
>   -0 [001] d.h.   109.857684: irq_handler_entry: irq=16 
> name=jcore_pit
>   -0 [001] d.h.   109.857934: hrtimer_cancel: 
> hrtimer=109f449c
>   -0 [001] d.h.   109.858087: hrtimer_expire_entry: 
> hrtimer=109f449c function=tick_sched_timer now=109770452820

And this expiry happens on CPU 1, but the expiry time looks like the one
which got armed on CPU 0 above !?!?!

>   -0 [001] d.h.   109.858639: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h.   109.858839: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h.   109.859097: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] d.h.   109.859282: hrtimer_expire_exit: 
> hrtimer=109f449c
>   -0 [001] d.h.   109.859428: hrtimer_start: 
> hrtimer=109f449c function=tick_sched_timer expires=10978000 
> softexpires=10978000
>   -0 [001] d.h.   109.859637: irq_handler_exit: irq=16 
> ret=handled
>   -0 [001] ..s.   109.859806: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s.   109.859984: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s.   109.860098: softirq_entry: vec=7 
> [action=SCHED]
>   -0 [001] ..s.   109.860533: softirq_exit: vec=7 
> [action=SCHED]
>   -0 [001] ..s.   109.860663: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] .ns.   109.861424: softirq_exit: vec=9 
> [action=RCU]
>   sh-388   [000] d.h.   109.861831: irq_handler_entry: irq=16 
> name=jcore_pit
>   sh-388   [000] d.h.   109.862071: hrtimer_cancel: 
> hrtimer=109e949c
>rcu_sched-7 [001]    109.862141: timer_init: timer=160a9eb0
>   sh-388   [000] d.h.   109.862233: hrtimer_expire_entry: 
> hrtimer=109e949c function=tick_sched_timer now=109774586740

And CPU0 has it's timer firing 4.5ms late .

>rcu_sched-7 [001] d...   109.862317: timer_start: timer=160a9eb0 
> function=process_timeout expires=4294948275 [timeout=1] flags=0x0001
> 
> However no stall message appeared on the console here, and I suspect
> it might have just been a gap in event recording because it doesn't
> look terribly suspicious. At first I thought (and discussed with

Well. If there are gaps in the traces then decoding this is going to be a
nightmare.

> Daniel off-list) that the hrtimer_cancel at the end of the long gap
> might have been spinning for the whole time, but this does not match
> the behavior of the stalls I'm 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-27 Thread Rich Felker
On Mon, Sep 26, 2016 at 08:42:58PM -0400, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 07:55:13PM -0400, Thomas Gleixner wrote:
> > On Mon, 26 Sep 2016, Rich Felker wrote:
> > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > Based on use of ftrace, I was able to see situations where a second
> > > timer hardirq happened immediately after one occurred, before the
> > > timer softirq could run. My theory is that this is causing some kind
> > > of feedback loop where new timer expirations keep getting scheduled
> > > with a very short interval such that the softirq never gets to run
> > > (until other interrupt activity disrups the feedback loop). I tried
> > > reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> > > relevant and it didn't help, but on further review (right now) there
> > > seem to be a few related commits just before it that might be
> > > responsible for the regression. I'll see if I can dig up anything else
> > > useful.
> > 
> > Interesting theory, but lets look at the data from the other thread:
> > 
> >  -0 [001] d.h.  2646.702790: irq_handler_entry: irq=72 
> > name=jcore_pit
> >   -0 [001] d.h.  2646.703398: softirq_raise: vec=1 
> > [action=TIMER]
> >   -0 [001] d.h.  2646.703607: softirq_raise: vec=9 
> > [action=RCU]
> >   -0 [001] d.h.  2646.703884: softirq_raise: vec=7 
> > [action=SCHED]
> >   -0 [001] d.h.  2646.704191: irq_handler_exit: irq=72 
> > ret=handled
> > 
> > So it takes 1.4ms to finish the timer irq handler
> > 
> >   -0 [001] d.H.  2646.704509: irq_handler_entry: irq=72 
> > name=jcore_pit
> >   -0 [001] d.H.  2646.704990: softirq_raise: vec=1 
> > [action=TIMER]
> >   -0 [001] d.H.  2646.705182: softirq_raise: vec=9 
> > [action=RCU]
> >   -0 [001] d.H.  2646.705465: softirq_raise: vec=7 
> > [action=SCHED]
> >   -0 [001] d.H.  2646.705761: irq_handler_exit: irq=72 
> > ret=handled
> > 
> > And 1.2ms in this case
> >   
> > 
> >   -0 [001] d.H.  2646.706071: irq_handler_entry: irq=72 
> > name=jcore_pit
> >   -0 [001] d.H.  2646.706328: irq_handler_exit: irq=72 
> > ret=handled
> > 
> > So this one is actually short enough that the soft interrupt can excute and
> > obviously no new timer is scheduled close enough.
> > 
> >   -0 [001] ..s.  2646.706530: softirq_entry: vec=1 
> > [action=TIMER]
> > 
> > Now the obvious question arises: Why is that handler taking that long? And
> 
> I'm not sure. Even at our clock rate of 50 MHz, 1.2ms is 6 cycles.
> It's possible much of that overhead is coming from ftrace. I can play
> around again with just logging the minimum intervals that are
> programmed, and not using ftrace.
> 
> > sure the next event being programmed shortely after the other has to do
> > with various factors (HZ setting, High-resolution timers enabled?). W/o
> > having the corresponding hrtimer tracepoints available (start and expire)
> > it's hard to tell what's going on.
> 
> I do have hrtimers enabled. If it would be helpful I could try again
> with traces for them enabled too.
> 
> > But I can tell for sure that there is no feedback loop and its not
> > disrupted by some other interrupt.
> 
> FYI no rcu_sched messages or noticable stalls were observed with
> tracing active. The cpu load from tracing is so high that I would not
> expect to see the same behavior. I probably failed to make that clear
> before; sorry.

I've managed to get a trace with a stall. I'm not sure what the best
way to share the full thing is, since it's large, but here are the
potentially interesting parts.

The first is a big time gap with no events, from 82.446093 to
109.852709:

  -0 [001] .ns.82.431940: timer_expire_exit: 
timer=160a9eb0
  -0 [001] .ns.82.432088: softirq_exit: vec=1 
[action=TIMER]
  -0 [001] .ns.82.432180: softirq_entry: vec=7 
[action=SCHED]
  -0 [001] .ns.82.432380: softirq_exit: vec=7 
[action=SCHED]
 ksoftirqd/1-14[001] ..s.82.433042: softirq_entry: vec=1 
[action=TIMER]
 ksoftirqd/1-14[001] ..s.82.433166: softirq_exit: vec=1 
[action=TIMER]
 ksoftirqd/1-14[001] ..s.82.433258: softirq_entry: vec=7 
[action=SCHED]
 ksoftirqd/1-14[001] ..s.82.433428: softirq_exit: vec=7 
[action=SCHED]
   rcu_sched-7 [001] 82.434269: timer_init: timer=160a9eb0
   rcu_sched-7 [001] d...82.434410: timer_start: timer=160a9eb0 
function=process_timeout expires=4294945532 [timeout=1] flags=0x0001
 cat-394   [000] d.h.82.437621: irq_handler_entry: irq=16 
name=jcore_pit
 cat-394   [000] d.h.82.437851: hrtimer_cancel: hrtimer=109e949c
 cat-394   [000] d.h.82.438006: hrtimer_expire_entry: 
hrtimer=109e949c function=tick_sched_timer now=82350361520
 cat-394   [000] d.h.82.438555: 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-27 Thread Rich Felker
On Mon, Sep 26, 2016 at 08:42:58PM -0400, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 07:55:13PM -0400, Thomas Gleixner wrote:
> > On Mon, 26 Sep 2016, Rich Felker wrote:
> > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > Based on use of ftrace, I was able to see situations where a second
> > > timer hardirq happened immediately after one occurred, before the
> > > timer softirq could run. My theory is that this is causing some kind
> > > of feedback loop where new timer expirations keep getting scheduled
> > > with a very short interval such that the softirq never gets to run
> > > (until other interrupt activity disrups the feedback loop). I tried
> > > reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> > > relevant and it didn't help, but on further review (right now) there
> > > seem to be a few related commits just before it that might be
> > > responsible for the regression. I'll see if I can dig up anything else
> > > useful.
> > 
> > Interesting theory, but lets look at the data from the other thread:
> > 
> >  -0 [001] d.h.  2646.702790: irq_handler_entry: irq=72 
> > name=jcore_pit
> >   -0 [001] d.h.  2646.703398: softirq_raise: vec=1 
> > [action=TIMER]
> >   -0 [001] d.h.  2646.703607: softirq_raise: vec=9 
> > [action=RCU]
> >   -0 [001] d.h.  2646.703884: softirq_raise: vec=7 
> > [action=SCHED]
> >   -0 [001] d.h.  2646.704191: irq_handler_exit: irq=72 
> > ret=handled
> > 
> > So it takes 1.4ms to finish the timer irq handler
> > 
> >   -0 [001] d.H.  2646.704509: irq_handler_entry: irq=72 
> > name=jcore_pit
> >   -0 [001] d.H.  2646.704990: softirq_raise: vec=1 
> > [action=TIMER]
> >   -0 [001] d.H.  2646.705182: softirq_raise: vec=9 
> > [action=RCU]
> >   -0 [001] d.H.  2646.705465: softirq_raise: vec=7 
> > [action=SCHED]
> >   -0 [001] d.H.  2646.705761: irq_handler_exit: irq=72 
> > ret=handled
> > 
> > And 1.2ms in this case
> >   
> > 
> >   -0 [001] d.H.  2646.706071: irq_handler_entry: irq=72 
> > name=jcore_pit
> >   -0 [001] d.H.  2646.706328: irq_handler_exit: irq=72 
> > ret=handled
> > 
> > So this one is actually short enough that the soft interrupt can excute and
> > obviously no new timer is scheduled close enough.
> > 
> >   -0 [001] ..s.  2646.706530: softirq_entry: vec=1 
> > [action=TIMER]
> > 
> > Now the obvious question arises: Why is that handler taking that long? And
> 
> I'm not sure. Even at our clock rate of 50 MHz, 1.2ms is 6 cycles.
> It's possible much of that overhead is coming from ftrace. I can play
> around again with just logging the minimum intervals that are
> programmed, and not using ftrace.
> 
> > sure the next event being programmed shortely after the other has to do
> > with various factors (HZ setting, High-resolution timers enabled?). W/o
> > having the corresponding hrtimer tracepoints available (start and expire)
> > it's hard to tell what's going on.
> 
> I do have hrtimers enabled. If it would be helpful I could try again
> with traces for them enabled too.
> 
> > But I can tell for sure that there is no feedback loop and its not
> > disrupted by some other interrupt.
> 
> FYI no rcu_sched messages or noticable stalls were observed with
> tracing active. The cpu load from tracing is so high that I would not
> expect to see the same behavior. I probably failed to make that clear
> before; sorry.

I've managed to get a trace with a stall. I'm not sure what the best
way to share the full thing is, since it's large, but here are the
potentially interesting parts.

The first is a big time gap with no events, from 82.446093 to
109.852709:

  -0 [001] .ns.82.431940: timer_expire_exit: 
timer=160a9eb0
  -0 [001] .ns.82.432088: softirq_exit: vec=1 
[action=TIMER]
  -0 [001] .ns.82.432180: softirq_entry: vec=7 
[action=SCHED]
  -0 [001] .ns.82.432380: softirq_exit: vec=7 
[action=SCHED]
 ksoftirqd/1-14[001] ..s.82.433042: softirq_entry: vec=1 
[action=TIMER]
 ksoftirqd/1-14[001] ..s.82.433166: softirq_exit: vec=1 
[action=TIMER]
 ksoftirqd/1-14[001] ..s.82.433258: softirq_entry: vec=7 
[action=SCHED]
 ksoftirqd/1-14[001] ..s.82.433428: softirq_exit: vec=7 
[action=SCHED]
   rcu_sched-7 [001] 82.434269: timer_init: timer=160a9eb0
   rcu_sched-7 [001] d...82.434410: timer_start: timer=160a9eb0 
function=process_timeout expires=4294945532 [timeout=1] flags=0x0001
 cat-394   [000] d.h.82.437621: irq_handler_entry: irq=16 
name=jcore_pit
 cat-394   [000] d.h.82.437851: hrtimer_cancel: hrtimer=109e949c
 cat-394   [000] d.h.82.438006: hrtimer_expire_entry: 
hrtimer=109e949c function=tick_sched_timer now=82350361520
 cat-394   [000] d.h.82.438555: 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Rich Felker
On Mon, Sep 26, 2016 at 07:55:13PM -0400, Thomas Gleixner wrote:
> On Mon, 26 Sep 2016, Rich Felker wrote:
> > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > Based on use of ftrace, I was able to see situations where a second
> > timer hardirq happened immediately after one occurred, before the
> > timer softirq could run. My theory is that this is causing some kind
> > of feedback loop where new timer expirations keep getting scheduled
> > with a very short interval such that the softirq never gets to run
> > (until other interrupt activity disrups the feedback loop). I tried
> > reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> > relevant and it didn't help, but on further review (right now) there
> > seem to be a few related commits just before it that might be
> > responsible for the regression. I'll see if I can dig up anything else
> > useful.
> 
> Interesting theory, but lets look at the data from the other thread:
> 
>  -0 [001] d.h.  2646.702790: irq_handler_entry: irq=72 
> name=jcore_pit
>   -0 [001] d.h.  2646.703398: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h.  2646.703607: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h.  2646.703884: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] d.h.  2646.704191: irq_handler_exit: irq=72 
> ret=handled
> 
> So it takes 1.4ms to finish the timer irq handler
> 
>   -0 [001] d.H.  2646.704509: irq_handler_entry: irq=72 
> name=jcore_pit
>   -0 [001] d.H.  2646.704990: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.H.  2646.705182: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.H.  2646.705465: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] d.H.  2646.705761: irq_handler_exit: irq=72 
> ret=handled
> 
> And 1.2ms in this case
> 
> 
>   -0 [001] d.H.  2646.706071: irq_handler_entry: irq=72 
> name=jcore_pit
>   -0 [001] d.H.  2646.706328: irq_handler_exit: irq=72 
> ret=handled
> 
> So this one is actually short enough that the soft interrupt can excute and
> obviously no new timer is scheduled close enough.
> 
>   -0 [001] ..s.  2646.706530: softirq_entry: vec=1 
> [action=TIMER]
> 
> Now the obvious question arises: Why is that handler taking that long? And

I'm not sure. Even at our clock rate of 50 MHz, 1.2ms is 6 cycles.
It's possible much of that overhead is coming from ftrace. I can play
around again with just logging the minimum intervals that are
programmed, and not using ftrace.

> sure the next event being programmed shortely after the other has to do
> with various factors (HZ setting, High-resolution timers enabled?). W/o
> having the corresponding hrtimer tracepoints available (start and expire)
> it's hard to tell what's going on.

I do have hrtimers enabled. If it would be helpful I could try again
with traces for them enabled too.

> But I can tell for sure that there is no feedback loop and its not
> disrupted by some other interrupt.

FYI no rcu_sched messages or noticable stalls were observed with
tracing active. The cpu load from tracing is so high that I would not
expect to see the same behavior. I probably failed to make that clear
before; sorry.

> The above trace is merily seing the PIT
> irq and nothing else.

It is showing another one happening before the softirq side of the
first could run, right? Of course this is probably due to overall
ftrace slowness. CONFIG_DYNAMIC_FTRACE does not work on big-endian sh
(the objdump needed at build time is hard-coded to assume sh is
little-endian and I haven't yet worked out what all needs to be fixed)
so I suspect that may be the cause of ftrace being so slow when it's
turned on.

Oddly, the min_delta cutoff needed to prevent the stalls was on the
same order of magnitude as these numbers (2.5ms almost entirely made
them go away; 5ms completely).

> Can you please provide a trace with hrtimer tracepoints enabled and tell us
> the settings of CONFIG_HZ and whether high resolution timers are enabled or
> not.

Yes. I think these are the relevant config items:

CONFIG_HZ=100
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

Let me know if I missed any. I'll try getting a trace and follow up
again.

> That workaround of setting min_delta to some absurd big number looks fishy
> at best.

I agree. I'd like to find the right fix, but usability badly suffers
without either a proper fix or a workaround.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Rich Felker
On Mon, Sep 26, 2016 at 07:55:13PM -0400, Thomas Gleixner wrote:
> On Mon, 26 Sep 2016, Rich Felker wrote:
> > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > Based on use of ftrace, I was able to see situations where a second
> > timer hardirq happened immediately after one occurred, before the
> > timer softirq could run. My theory is that this is causing some kind
> > of feedback loop where new timer expirations keep getting scheduled
> > with a very short interval such that the softirq never gets to run
> > (until other interrupt activity disrups the feedback loop). I tried
> > reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> > relevant and it didn't help, but on further review (right now) there
> > seem to be a few related commits just before it that might be
> > responsible for the regression. I'll see if I can dig up anything else
> > useful.
> 
> Interesting theory, but lets look at the data from the other thread:
> 
>  -0 [001] d.h.  2646.702790: irq_handler_entry: irq=72 
> name=jcore_pit
>   -0 [001] d.h.  2646.703398: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h.  2646.703607: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h.  2646.703884: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] d.h.  2646.704191: irq_handler_exit: irq=72 
> ret=handled
> 
> So it takes 1.4ms to finish the timer irq handler
> 
>   -0 [001] d.H.  2646.704509: irq_handler_entry: irq=72 
> name=jcore_pit
>   -0 [001] d.H.  2646.704990: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.H.  2646.705182: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.H.  2646.705465: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] d.H.  2646.705761: irq_handler_exit: irq=72 
> ret=handled
> 
> And 1.2ms in this case
> 
> 
>   -0 [001] d.H.  2646.706071: irq_handler_entry: irq=72 
> name=jcore_pit
>   -0 [001] d.H.  2646.706328: irq_handler_exit: irq=72 
> ret=handled
> 
> So this one is actually short enough that the soft interrupt can excute and
> obviously no new timer is scheduled close enough.
> 
>   -0 [001] ..s.  2646.706530: softirq_entry: vec=1 
> [action=TIMER]
> 
> Now the obvious question arises: Why is that handler taking that long? And

I'm not sure. Even at our clock rate of 50 MHz, 1.2ms is 6 cycles.
It's possible much of that overhead is coming from ftrace. I can play
around again with just logging the minimum intervals that are
programmed, and not using ftrace.

> sure the next event being programmed shortely after the other has to do
> with various factors (HZ setting, High-resolution timers enabled?). W/o
> having the corresponding hrtimer tracepoints available (start and expire)
> it's hard to tell what's going on.

I do have hrtimers enabled. If it would be helpful I could try again
with traces for them enabled too.

> But I can tell for sure that there is no feedback loop and its not
> disrupted by some other interrupt.

FYI no rcu_sched messages or noticable stalls were observed with
tracing active. The cpu load from tracing is so high that I would not
expect to see the same behavior. I probably failed to make that clear
before; sorry.

> The above trace is merily seing the PIT
> irq and nothing else.

It is showing another one happening before the softirq side of the
first could run, right? Of course this is probably due to overall
ftrace slowness. CONFIG_DYNAMIC_FTRACE does not work on big-endian sh
(the objdump needed at build time is hard-coded to assume sh is
little-endian and I haven't yet worked out what all needs to be fixed)
so I suspect that may be the cause of ftrace being so slow when it's
turned on.

Oddly, the min_delta cutoff needed to prevent the stalls was on the
same order of magnitude as these numbers (2.5ms almost entirely made
them go away; 5ms completely).

> Can you please provide a trace with hrtimer tracepoints enabled and tell us
> the settings of CONFIG_HZ and whether high resolution timers are enabled or
> not.

Yes. I think these are the relevant config items:

CONFIG_HZ=100
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

Let me know if I missed any. I'll try getting a trace and follow up
again.

> That workaround of setting min_delta to some absurd big number looks fishy
> at best.

I agree. I'd like to find the right fix, but usability badly suffers
without either a proper fix or a workaround.

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Thomas Gleixner
On Mon, 26 Sep 2016, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> Based on use of ftrace, I was able to see situations where a second
> timer hardirq happened immediately after one occurred, before the
> timer softirq could run. My theory is that this is causing some kind
> of feedback loop where new timer expirations keep getting scheduled
> with a very short interval such that the softirq never gets to run
> (until other interrupt activity disrups the feedback loop). I tried
> reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> relevant and it didn't help, but on further review (right now) there
> seem to be a few related commits just before it that might be
> responsible for the regression. I'll see if I can dig up anything else
> useful.

Interesting theory, but lets look at the data from the other thread:

 -0 [001] d.h.  2646.702790: irq_handler_entry: irq=72 
name=jcore_pit
  -0 [001] d.h.  2646.703398: softirq_raise: vec=1 
[action=TIMER]
  -0 [001] d.h.  2646.703607: softirq_raise: vec=9 
[action=RCU]
  -0 [001] d.h.  2646.703884: softirq_raise: vec=7 
[action=SCHED]
  -0 [001] d.h.  2646.704191: irq_handler_exit: irq=72 
ret=handled

So it takes 1.4ms to finish the timer irq handler

  -0 [001] d.H.  2646.704509: irq_handler_entry: irq=72 
name=jcore_pit
  -0 [001] d.H.  2646.704990: softirq_raise: vec=1 
[action=TIMER]
  -0 [001] d.H.  2646.705182: softirq_raise: vec=9 
[action=RCU]
  -0 [001] d.H.  2646.705465: softirq_raise: vec=7 
[action=SCHED]
  -0 [001] d.H.  2646.705761: irq_handler_exit: irq=72 
ret=handled

And 1.2ms in this case
  

  -0 [001] d.H.  2646.706071: irq_handler_entry: irq=72 
name=jcore_pit
  -0 [001] d.H.  2646.706328: irq_handler_exit: irq=72 
ret=handled

So this one is actually short enough that the soft interrupt can excute and
obviously no new timer is scheduled close enough.

  -0 [001] ..s.  2646.706530: softirq_entry: vec=1 
[action=TIMER]

Now the obvious question arises: Why is that handler taking that long? And
sure the next event being programmed shortely after the other has to do
with various factors (HZ setting, High-resolution timers enabled?). W/o
having the corresponding hrtimer tracepoints available (start and expire)
it's hard to tell what's going on.

But I can tell for sure that there is no feedback loop and its not
disrupted by some other interrupt. The above trace is merily seing the PIT
irq and nothing else.

Can you please provide a trace with hrtimer tracepoints enabled and tell us
the settings of CONFIG_HZ and whether high resolution timers are enabled or
not.

That workaround of setting min_delta to some absurd big number looks fishy
at best.

Thanks,

tglx


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Thomas Gleixner
On Mon, 26 Sep 2016, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> Based on use of ftrace, I was able to see situations where a second
> timer hardirq happened immediately after one occurred, before the
> timer softirq could run. My theory is that this is causing some kind
> of feedback loop where new timer expirations keep getting scheduled
> with a very short interval such that the softirq never gets to run
> (until other interrupt activity disrups the feedback loop). I tried
> reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> relevant and it didn't help, but on further review (right now) there
> seem to be a few related commits just before it that might be
> responsible for the regression. I'll see if I can dig up anything else
> useful.

Interesting theory, but lets look at the data from the other thread:

 -0 [001] d.h.  2646.702790: irq_handler_entry: irq=72 
name=jcore_pit
  -0 [001] d.h.  2646.703398: softirq_raise: vec=1 
[action=TIMER]
  -0 [001] d.h.  2646.703607: softirq_raise: vec=9 
[action=RCU]
  -0 [001] d.h.  2646.703884: softirq_raise: vec=7 
[action=SCHED]
  -0 [001] d.h.  2646.704191: irq_handler_exit: irq=72 
ret=handled

So it takes 1.4ms to finish the timer irq handler

  -0 [001] d.H.  2646.704509: irq_handler_entry: irq=72 
name=jcore_pit
  -0 [001] d.H.  2646.704990: softirq_raise: vec=1 
[action=TIMER]
  -0 [001] d.H.  2646.705182: softirq_raise: vec=9 
[action=RCU]
  -0 [001] d.H.  2646.705465: softirq_raise: vec=7 
[action=SCHED]
  -0 [001] d.H.  2646.705761: irq_handler_exit: irq=72 
ret=handled

And 1.2ms in this case
  

  -0 [001] d.H.  2646.706071: irq_handler_entry: irq=72 
name=jcore_pit
  -0 [001] d.H.  2646.706328: irq_handler_exit: irq=72 
ret=handled

So this one is actually short enough that the soft interrupt can excute and
obviously no new timer is scheduled close enough.

  -0 [001] ..s.  2646.706530: softirq_entry: vec=1 
[action=TIMER]

Now the obvious question arises: Why is that handler taking that long? And
sure the next event being programmed shortely after the other has to do
with various factors (HZ setting, High-resolution timers enabled?). W/o
having the corresponding hrtimer tracepoints available (start and expire)
it's hard to tell what's going on.

But I can tell for sure that there is no feedback loop and its not
disrupted by some other interrupt. The above trace is merily seing the PIT
irq and nothing else.

Can you please provide a trace with hrtimer tracepoints enabled and tell us
the settings of CONFIG_HZ and whether high resolution timers are enabled or
not.

That workaround of setting min_delta to some absurd big number looks fishy
at best.

Thanks,

tglx


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Rich Felker
On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> On 26/09/2016 23:07, Rich Felker wrote:
> > Ping. Is there anything that still needs to be changed for this driver
> > to be acceptable?
> 
> It is on my radar. I'm reviewing it.
> 
> Can you elaborate the workaround mentioned in the changelog. I have been
> digging into the lkml@ thread but it is not clear if the issue is
> related to the time framework, the driver itself or whatever else. Can
> you clarify that ?

It does not seem to be related to the driver. I'd be happy to have a
workaround (or even better a fix) at a higher level outside the
driver. I'll try to summarize what I think is happening.

The symptom is heavy irq load (much higher than the 1-2 irqs/sec I
observed on 4.6, something like 30+ on average) and frequent rcu_sched
stall messages. Under these conditions the system "feels" responsive
still, but that seems to be a consequence of other interrupts breaking
the stall; by inserting and removing the SD card, which is probed
periodically (1s period) by a kernel thread when there's no irq hooked
up for the slot, I was able to observe that the card-removed message
did not appear for several seconds, so apparently the kernel really is
stalling.

Based on use of ftrace, I was able to see situations where a second
timer hardirq happened immediately after one occurred, before the
timer softirq could run. My theory is that this is causing some kind
of feedback loop where new timer expirations keep getting scheduled
with a very short interval such that the softirq never gets to run
(until other interrupt activity disrups the feedback loop). I tried
reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
relevant and it didn't help, but on further review (right now) there
seem to be a few related commits just before it that might be
responsible for the regression. I'll see if I can dig up anything else
useful.

> Regarding the previous version, did you reach a consensus regarding
> per_cpu irq with Mark Rutland ?

I'm not sure. I think I can reasonably say that the existign percpu
framework is not suitable or necessary for modeling what the jcore
hardware is doing. The interrupt controller driver seems to have been
accepted already without use of percpu stuff. I know there was some
concern that not letting the kernel know whether an irq is percpu or
not could lead to wrong behavior, but I believe that's only possible
in the other direction (wrongly registering an irq as percpu when it's
actually a normal one that could happen on either cpu, in which case
locking might be wrongly omitted, etc.).

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Rich Felker
On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> On 26/09/2016 23:07, Rich Felker wrote:
> > Ping. Is there anything that still needs to be changed for this driver
> > to be acceptable?
> 
> It is on my radar. I'm reviewing it.
> 
> Can you elaborate the workaround mentioned in the changelog. I have been
> digging into the lkml@ thread but it is not clear if the issue is
> related to the time framework, the driver itself or whatever else. Can
> you clarify that ?

It does not seem to be related to the driver. I'd be happy to have a
workaround (or even better a fix) at a higher level outside the
driver. I'll try to summarize what I think is happening.

The symptom is heavy irq load (much higher than the 1-2 irqs/sec I
observed on 4.6, something like 30+ on average) and frequent rcu_sched
stall messages. Under these conditions the system "feels" responsive
still, but that seems to be a consequence of other interrupts breaking
the stall; by inserting and removing the SD card, which is probed
periodically (1s period) by a kernel thread when there's no irq hooked
up for the slot, I was able to observe that the card-removed message
did not appear for several seconds, so apparently the kernel really is
stalling.

Based on use of ftrace, I was able to see situations where a second
timer hardirq happened immediately after one occurred, before the
timer softirq could run. My theory is that this is causing some kind
of feedback loop where new timer expirations keep getting scheduled
with a very short interval such that the softirq never gets to run
(until other interrupt activity disrups the feedback loop). I tried
reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
relevant and it didn't help, but on further review (right now) there
seem to be a few related commits just before it that might be
responsible for the regression. I'll see if I can dig up anything else
useful.

> Regarding the previous version, did you reach a consensus regarding
> per_cpu irq with Mark Rutland ?

I'm not sure. I think I can reasonably say that the existign percpu
framework is not suitable or necessary for modeling what the jcore
hardware is doing. The interrupt controller driver seems to have been
accepted already without use of percpu stuff. I know there was some
concern that not letting the kernel know whether an irq is percpu or
not could lead to wrong behavior, but I believe that's only possible
in the other direction (wrongly registering an irq as percpu when it's
actually a normal one that could happen on either cpu, in which case
locking might be wrongly omitted, etc.).

Rich


Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Daniel Lezcano
On 26/09/2016 23:07, Rich Felker wrote:
> Ping. Is there anything that still needs to be changed for this driver
> to be acceptable?
> 

It is on my radar. I'm reviewing it.

Can you elaborate the workaround mentioned in the changelog. I have been
digging into the lkml@ thread but it is not clear if the issue is
related to the time framework, the driver itself or whatever else. Can
you clarify that ?

Regarding the previous version, did you reach a consensus regarding
per_cpu irq with Mark Rutland ?

Thanks.

  -- Daniel

-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Daniel Lezcano
On 26/09/2016 23:07, Rich Felker wrote:
> Ping. Is there anything that still needs to be changed for this driver
> to be acceptable?
> 

It is on my radar. I'm reviewing it.

Can you elaborate the workaround mentioned in the changelog. I have been
digging into the lkml@ thread but it is not clear if the issue is
related to the time framework, the driver itself or whatever else. Can
you clarify that ?

Regarding the previous version, did you reach a consensus regarding
per_cpu irq with Mark Rutland ?

Thanks.

  -- Daniel

-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Rich Felker
Ping. Is there anything that still needs to be changed for this driver
to be acceptable?

On Mon, Sep 17, 2001 at 04:00:00AM +, Rich Felker wrote:
> At the hardware level, the J-Core PIT is integrated with the interrupt
> controller, but it is represented as its own device and has an
> independent programming interface. It provides a 12-bit countdown
> timer, which is not presently used, and a periodic timer. The interval
> length for the latter is programmable via a 32-bit throttle register
> whose units are determined by a bus-period register. The periodic
> timer is used to implement both periodic and oneshot clock event
> modes; in oneshot mode the interrupt handler simply disables the timer
> as soon as it fires.
> 
> Despite its device tree node representing an interrupt for the PIT,
> the actual irq generated is programmable, not hard-wired. The driver
> is responsible for programming the PIT to generate the hardware irq
> number that the DT assigns to it.
> 
> On SMP configurations, J-Core provides cpu-local instances of the PIT;
> no broadcast timer is needed. This driver supports the creation of the
> necessary per-cpu clock_event_device instances.
> 
> A nanosecond-resolution clocksource is provided using the J-Core "RTC"
> registers, which give a 64-bit seconds count and 32-bit nanoseconds
> that wrap every second. The driver converts these to a full-range
> 32-bit nanoseconds count.
> 
> Signed-off-by: Rich Felker 
> ---
>  drivers/clocksource/Kconfig |  10 ++
>  drivers/clocksource/Makefile|   1 +
>  drivers/clocksource/jcore-pit.c | 240 
> 
>  include/linux/cpuhotplug.h  |   1 +
>  4 files changed, 252 insertions(+)
>  create mode 100644 drivers/clocksource/jcore-pit.c
> 
> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> index 5677886..95dd78b 100644
> --- a/drivers/clocksource/Kconfig
> +++ b/drivers/clocksource/Kconfig
> @@ -407,6 +407,16 @@ config SYS_SUPPORTS_SH_TMU
>  config SYS_SUPPORTS_EM_STI
>  bool
>  
> +config CLKSRC_JCORE_PIT
> + bool "J-Core PIT timer driver"
> + depends on OF && (SUPERH || COMPILE_TEST)
> + depends on GENERIC_CLOCKEVENTS
> + depends on HAS_IOMEM
> + select CLKSRC_MMIO
> + help
> +   This enables build of clocksource and clockevent driver for
> +   the integrated PIT in the J-Core synthesizable, open source SoC.
> +
>  config SH_TIMER_CMT
>   bool "Renesas CMT timer driver" if COMPILE_TEST
>   depends on GENERIC_CLOCKEVENTS
> diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
> index fd9d6df..cf87f40 100644
> --- a/drivers/clocksource/Makefile
> +++ b/drivers/clocksource/Makefile
> @@ -5,6 +5,7 @@ obj-$(CONFIG_ATMEL_TCB_CLKSRC)+= tcb_clksrc.o
>  obj-$(CONFIG_X86_PM_TIMER)   += acpi_pm.o
>  obj-$(CONFIG_SCx200HR_TIMER) += scx200_hrt.o
>  obj-$(CONFIG_CS5535_CLOCK_EVENT_SRC) += cs5535-clockevt.o
> +obj-$(CONFIG_CLKSRC_JCORE_PIT)   += jcore-pit.o
>  obj-$(CONFIG_SH_TIMER_CMT)   += sh_cmt.o
>  obj-$(CONFIG_SH_TIMER_MTU2)  += sh_mtu2.o
>  obj-$(CONFIG_SH_TIMER_TMU)   += sh_tmu.o
> diff --git a/drivers/clocksource/jcore-pit.c b/drivers/clocksource/jcore-pit.c
> new file mode 100644
> index 000..4c512d7
> --- /dev/null
> +++ b/drivers/clocksource/jcore-pit.c
> @@ -0,0 +1,240 @@
> +/*
> + * J-Core SoC PIT/clocksource driver
> + *
> + * Copyright (C) 2015-2016 Smart Energy Instruments, Inc.
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License.  See the file "COPYING" in the main directory of this archive
> + * for more details.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define PIT_IRQ_SHIFT12
> +#define PIT_PRIO_SHIFT   20
> +#define PIT_ENABLE_SHIFT 26
> +#define PIT_IRQ_MASK 0x3f
> +#define PIT_PRIO_MASK0xf
> +
> +#define REG_PITEN0x00
> +#define REG_THROT0x10
> +#define REG_COUNT0x14
> +#define REG_BUSPD0x18
> +#define REG_SECHI0x20
> +#define REG_SECLO0x24
> +#define REG_NSEC 0x28
> +
> +struct jcore_pit {
> + struct clock_event_device   ced;
> + __iomem void*base;
> + unsigned long   periodic_delta;
> + unsignedcpu;
> + u32 enable_val;
> +};
> +
> +static __iomem void  *jcore_pit_base;
> +struct jcore_pit __percpu*jcore_pit_percpu;
> +
> +static notrace u64 jcore_sched_clock_read(void)
> +{
> + u32 seclo, nsec, seclo0;
> + __iomem void *base = jcore_pit_base;
> +
> + seclo = readl(base + REG_SECLO);
> + do {
> + seclo0 = seclo;
> + nsec  = readl(base + REG_NSEC);
> + seclo = readl(base + REG_SECLO);
> + } while 

Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-26 Thread Rich Felker
Ping. Is there anything that still needs to be changed for this driver
to be acceptable?

On Mon, Sep 17, 2001 at 04:00:00AM +, Rich Felker wrote:
> At the hardware level, the J-Core PIT is integrated with the interrupt
> controller, but it is represented as its own device and has an
> independent programming interface. It provides a 12-bit countdown
> timer, which is not presently used, and a periodic timer. The interval
> length for the latter is programmable via a 32-bit throttle register
> whose units are determined by a bus-period register. The periodic
> timer is used to implement both periodic and oneshot clock event
> modes; in oneshot mode the interrupt handler simply disables the timer
> as soon as it fires.
> 
> Despite its device tree node representing an interrupt for the PIT,
> the actual irq generated is programmable, not hard-wired. The driver
> is responsible for programming the PIT to generate the hardware irq
> number that the DT assigns to it.
> 
> On SMP configurations, J-Core provides cpu-local instances of the PIT;
> no broadcast timer is needed. This driver supports the creation of the
> necessary per-cpu clock_event_device instances.
> 
> A nanosecond-resolution clocksource is provided using the J-Core "RTC"
> registers, which give a 64-bit seconds count and 32-bit nanoseconds
> that wrap every second. The driver converts these to a full-range
> 32-bit nanoseconds count.
> 
> Signed-off-by: Rich Felker 
> ---
>  drivers/clocksource/Kconfig |  10 ++
>  drivers/clocksource/Makefile|   1 +
>  drivers/clocksource/jcore-pit.c | 240 
> 
>  include/linux/cpuhotplug.h  |   1 +
>  4 files changed, 252 insertions(+)
>  create mode 100644 drivers/clocksource/jcore-pit.c
> 
> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> index 5677886..95dd78b 100644
> --- a/drivers/clocksource/Kconfig
> +++ b/drivers/clocksource/Kconfig
> @@ -407,6 +407,16 @@ config SYS_SUPPORTS_SH_TMU
>  config SYS_SUPPORTS_EM_STI
>  bool
>  
> +config CLKSRC_JCORE_PIT
> + bool "J-Core PIT timer driver"
> + depends on OF && (SUPERH || COMPILE_TEST)
> + depends on GENERIC_CLOCKEVENTS
> + depends on HAS_IOMEM
> + select CLKSRC_MMIO
> + help
> +   This enables build of clocksource and clockevent driver for
> +   the integrated PIT in the J-Core synthesizable, open source SoC.
> +
>  config SH_TIMER_CMT
>   bool "Renesas CMT timer driver" if COMPILE_TEST
>   depends on GENERIC_CLOCKEVENTS
> diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
> index fd9d6df..cf87f40 100644
> --- a/drivers/clocksource/Makefile
> +++ b/drivers/clocksource/Makefile
> @@ -5,6 +5,7 @@ obj-$(CONFIG_ATMEL_TCB_CLKSRC)+= tcb_clksrc.o
>  obj-$(CONFIG_X86_PM_TIMER)   += acpi_pm.o
>  obj-$(CONFIG_SCx200HR_TIMER) += scx200_hrt.o
>  obj-$(CONFIG_CS5535_CLOCK_EVENT_SRC) += cs5535-clockevt.o
> +obj-$(CONFIG_CLKSRC_JCORE_PIT)   += jcore-pit.o
>  obj-$(CONFIG_SH_TIMER_CMT)   += sh_cmt.o
>  obj-$(CONFIG_SH_TIMER_MTU2)  += sh_mtu2.o
>  obj-$(CONFIG_SH_TIMER_TMU)   += sh_tmu.o
> diff --git a/drivers/clocksource/jcore-pit.c b/drivers/clocksource/jcore-pit.c
> new file mode 100644
> index 000..4c512d7
> --- /dev/null
> +++ b/drivers/clocksource/jcore-pit.c
> @@ -0,0 +1,240 @@
> +/*
> + * J-Core SoC PIT/clocksource driver
> + *
> + * Copyright (C) 2015-2016 Smart Energy Instruments, Inc.
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License.  See the file "COPYING" in the main directory of this archive
> + * for more details.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define PIT_IRQ_SHIFT12
> +#define PIT_PRIO_SHIFT   20
> +#define PIT_ENABLE_SHIFT 26
> +#define PIT_IRQ_MASK 0x3f
> +#define PIT_PRIO_MASK0xf
> +
> +#define REG_PITEN0x00
> +#define REG_THROT0x10
> +#define REG_COUNT0x14
> +#define REG_BUSPD0x18
> +#define REG_SECHI0x20
> +#define REG_SECLO0x24
> +#define REG_NSEC 0x28
> +
> +struct jcore_pit {
> + struct clock_event_device   ced;
> + __iomem void*base;
> + unsigned long   periodic_delta;
> + unsignedcpu;
> + u32 enable_val;
> +};
> +
> +static __iomem void  *jcore_pit_base;
> +struct jcore_pit __percpu*jcore_pit_percpu;
> +
> +static notrace u64 jcore_sched_clock_read(void)
> +{
> + u32 seclo, nsec, seclo0;
> + __iomem void *base = jcore_pit_base;
> +
> + seclo = readl(base + REG_SECLO);
> + do {
> + seclo0 = seclo;
> + nsec  = readl(base + REG_NSEC);
> + seclo = readl(base + REG_SECLO);
> + } while (seclo0 != 

[PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-23 Thread Rich Felker
At the hardware level, the J-Core PIT is integrated with the interrupt
controller, but it is represented as its own device and has an
independent programming interface. It provides a 12-bit countdown
timer, which is not presently used, and a periodic timer. The interval
length for the latter is programmable via a 32-bit throttle register
whose units are determined by a bus-period register. The periodic
timer is used to implement both periodic and oneshot clock event
modes; in oneshot mode the interrupt handler simply disables the timer
as soon as it fires.

Despite its device tree node representing an interrupt for the PIT,
the actual irq generated is programmable, not hard-wired. The driver
is responsible for programming the PIT to generate the hardware irq
number that the DT assigns to it.

On SMP configurations, J-Core provides cpu-local instances of the PIT;
no broadcast timer is needed. This driver supports the creation of the
necessary per-cpu clock_event_device instances.

A nanosecond-resolution clocksource is provided using the J-Core "RTC"
registers, which give a 64-bit seconds count and 32-bit nanoseconds
that wrap every second. The driver converts these to a full-range
32-bit nanoseconds count.

Signed-off-by: Rich Felker 
---
 drivers/clocksource/Kconfig |  10 ++
 drivers/clocksource/Makefile|   1 +
 drivers/clocksource/jcore-pit.c | 240 
 include/linux/cpuhotplug.h  |   1 +
 4 files changed, 252 insertions(+)
 create mode 100644 drivers/clocksource/jcore-pit.c

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 5677886..95dd78b 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -407,6 +407,16 @@ config SYS_SUPPORTS_SH_TMU
 config SYS_SUPPORTS_EM_STI
 bool
 
+config CLKSRC_JCORE_PIT
+   bool "J-Core PIT timer driver"
+   depends on OF && (SUPERH || COMPILE_TEST)
+   depends on GENERIC_CLOCKEVENTS
+   depends on HAS_IOMEM
+   select CLKSRC_MMIO
+   help
+ This enables build of clocksource and clockevent driver for
+ the integrated PIT in the J-Core synthesizable, open source SoC.
+
 config SH_TIMER_CMT
bool "Renesas CMT timer driver" if COMPILE_TEST
depends on GENERIC_CLOCKEVENTS
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index fd9d6df..cf87f40 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_ATMEL_TCB_CLKSRC)  += tcb_clksrc.o
 obj-$(CONFIG_X86_PM_TIMER) += acpi_pm.o
 obj-$(CONFIG_SCx200HR_TIMER)   += scx200_hrt.o
 obj-$(CONFIG_CS5535_CLOCK_EVENT_SRC)   += cs5535-clockevt.o
+obj-$(CONFIG_CLKSRC_JCORE_PIT) += jcore-pit.o
 obj-$(CONFIG_SH_TIMER_CMT) += sh_cmt.o
 obj-$(CONFIG_SH_TIMER_MTU2)+= sh_mtu2.o
 obj-$(CONFIG_SH_TIMER_TMU) += sh_tmu.o
diff --git a/drivers/clocksource/jcore-pit.c b/drivers/clocksource/jcore-pit.c
new file mode 100644
index 000..4c512d7
--- /dev/null
+++ b/drivers/clocksource/jcore-pit.c
@@ -0,0 +1,240 @@
+/*
+ * J-Core SoC PIT/clocksource driver
+ *
+ * Copyright (C) 2015-2016 Smart Energy Instruments, Inc.
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PIT_IRQ_SHIFT  12
+#define PIT_PRIO_SHIFT 20
+#define PIT_ENABLE_SHIFT   26
+#define PIT_IRQ_MASK   0x3f
+#define PIT_PRIO_MASK  0xf
+
+#define REG_PITEN  0x00
+#define REG_THROT  0x10
+#define REG_COUNT  0x14
+#define REG_BUSPD  0x18
+#define REG_SECHI  0x20
+#define REG_SECLO  0x24
+#define REG_NSEC   0x28
+
+struct jcore_pit {
+   struct clock_event_device   ced;
+   __iomem void*base;
+   unsigned long   periodic_delta;
+   unsignedcpu;
+   u32 enable_val;
+};
+
+static __iomem void*jcore_pit_base;
+struct jcore_pit __percpu  *jcore_pit_percpu;
+
+static notrace u64 jcore_sched_clock_read(void)
+{
+   u32 seclo, nsec, seclo0;
+   __iomem void *base = jcore_pit_base;
+
+   seclo = readl(base + REG_SECLO);
+   do {
+   seclo0 = seclo;
+   nsec  = readl(base + REG_NSEC);
+   seclo = readl(base + REG_SECLO);
+   } while (seclo0 != seclo);
+
+   return seclo * NSEC_PER_SEC + nsec;
+}
+
+static cycle_t jcore_clocksource_read(struct clocksource *cs)
+{
+   return jcore_sched_clock_read();
+}
+
+static int jcore_pit_disable(struct jcore_pit *pit)
+{
+   writel(0, pit->base + REG_PITEN);
+   return 0;
+}
+
+static int jcore_pit_set(unsigned long delta, struct 

[PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

2016-09-23 Thread Rich Felker
At the hardware level, the J-Core PIT is integrated with the interrupt
controller, but it is represented as its own device and has an
independent programming interface. It provides a 12-bit countdown
timer, which is not presently used, and a periodic timer. The interval
length for the latter is programmable via a 32-bit throttle register
whose units are determined by a bus-period register. The periodic
timer is used to implement both periodic and oneshot clock event
modes; in oneshot mode the interrupt handler simply disables the timer
as soon as it fires.

Despite its device tree node representing an interrupt for the PIT,
the actual irq generated is programmable, not hard-wired. The driver
is responsible for programming the PIT to generate the hardware irq
number that the DT assigns to it.

On SMP configurations, J-Core provides cpu-local instances of the PIT;
no broadcast timer is needed. This driver supports the creation of the
necessary per-cpu clock_event_device instances.

A nanosecond-resolution clocksource is provided using the J-Core "RTC"
registers, which give a 64-bit seconds count and 32-bit nanoseconds
that wrap every second. The driver converts these to a full-range
32-bit nanoseconds count.

Signed-off-by: Rich Felker 
---
 drivers/clocksource/Kconfig |  10 ++
 drivers/clocksource/Makefile|   1 +
 drivers/clocksource/jcore-pit.c | 240 
 include/linux/cpuhotplug.h  |   1 +
 4 files changed, 252 insertions(+)
 create mode 100644 drivers/clocksource/jcore-pit.c

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 5677886..95dd78b 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -407,6 +407,16 @@ config SYS_SUPPORTS_SH_TMU
 config SYS_SUPPORTS_EM_STI
 bool
 
+config CLKSRC_JCORE_PIT
+   bool "J-Core PIT timer driver"
+   depends on OF && (SUPERH || COMPILE_TEST)
+   depends on GENERIC_CLOCKEVENTS
+   depends on HAS_IOMEM
+   select CLKSRC_MMIO
+   help
+ This enables build of clocksource and clockevent driver for
+ the integrated PIT in the J-Core synthesizable, open source SoC.
+
 config SH_TIMER_CMT
bool "Renesas CMT timer driver" if COMPILE_TEST
depends on GENERIC_CLOCKEVENTS
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index fd9d6df..cf87f40 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_ATMEL_TCB_CLKSRC)  += tcb_clksrc.o
 obj-$(CONFIG_X86_PM_TIMER) += acpi_pm.o
 obj-$(CONFIG_SCx200HR_TIMER)   += scx200_hrt.o
 obj-$(CONFIG_CS5535_CLOCK_EVENT_SRC)   += cs5535-clockevt.o
+obj-$(CONFIG_CLKSRC_JCORE_PIT) += jcore-pit.o
 obj-$(CONFIG_SH_TIMER_CMT) += sh_cmt.o
 obj-$(CONFIG_SH_TIMER_MTU2)+= sh_mtu2.o
 obj-$(CONFIG_SH_TIMER_TMU) += sh_tmu.o
diff --git a/drivers/clocksource/jcore-pit.c b/drivers/clocksource/jcore-pit.c
new file mode 100644
index 000..4c512d7
--- /dev/null
+++ b/drivers/clocksource/jcore-pit.c
@@ -0,0 +1,240 @@
+/*
+ * J-Core SoC PIT/clocksource driver
+ *
+ * Copyright (C) 2015-2016 Smart Energy Instruments, Inc.
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PIT_IRQ_SHIFT  12
+#define PIT_PRIO_SHIFT 20
+#define PIT_ENABLE_SHIFT   26
+#define PIT_IRQ_MASK   0x3f
+#define PIT_PRIO_MASK  0xf
+
+#define REG_PITEN  0x00
+#define REG_THROT  0x10
+#define REG_COUNT  0x14
+#define REG_BUSPD  0x18
+#define REG_SECHI  0x20
+#define REG_SECLO  0x24
+#define REG_NSEC   0x28
+
+struct jcore_pit {
+   struct clock_event_device   ced;
+   __iomem void*base;
+   unsigned long   periodic_delta;
+   unsignedcpu;
+   u32 enable_val;
+};
+
+static __iomem void*jcore_pit_base;
+struct jcore_pit __percpu  *jcore_pit_percpu;
+
+static notrace u64 jcore_sched_clock_read(void)
+{
+   u32 seclo, nsec, seclo0;
+   __iomem void *base = jcore_pit_base;
+
+   seclo = readl(base + REG_SECLO);
+   do {
+   seclo0 = seclo;
+   nsec  = readl(base + REG_NSEC);
+   seclo = readl(base + REG_SECLO);
+   } while (seclo0 != seclo);
+
+   return seclo * NSEC_PER_SEC + nsec;
+}
+
+static cycle_t jcore_clocksource_read(struct clocksource *cs)
+{
+   return jcore_sched_clock_read();
+}
+
+static int jcore_pit_disable(struct jcore_pit *pit)
+{
+   writel(0, pit->base + REG_PITEN);
+   return 0;
+}
+
+static int jcore_pit_set(unsigned long delta, struct jcore_pit *pit)
+{
+